Data preprocessing

65 %
35 %
Information about Data preprocessing

Published on March 10, 2014

Author: dineshbabuspr



Data Preprocessing By S.Dinesh Babu II MCA

Definition  Data preprocessing is a data mining technique that involves transforming raw data into an understandable format.  Data in the real world is dirty

Measures for data quality:A multidimensional view ◦ Accuracy: correct or wrong, accurate or not ◦ Completeness: not recorded, unavailable, … ◦ Consistency: some modified but some not, dangling, … ◦ Timeliness: timely update? ◦ Believability: how trustable the data are correct? ◦ Interpretability: how easily the data can be understood?

Major Tasks in Data Preprocessing  Data Cleaning  Data Integration  Data Reduction  Data Transformation and Data Discretization

Data Cleaning: Incomplete  Data is not always available  Ex:Age:” ”;  Missing data may be due to ◦ equipment malfunction ◦ inconsistent with other recorded data and thus deleted ◦ data not entered due to misunderstanding ◦ certain data may not be considered important at the time of entry

Noisy Data  Unstructured Data.  Increases the amount of storage space . Causes: Hardware Failure Programming Errors

Data Cleaning as a Process  Missing values, noise, and inconsistencies contribute to inaccurate data.  The first step in data cleaning as a process is discrepancy detection.  Discrepancies can be caused by several factors.  Poorly designed data entry forms  human error in data entry

The data should also be examined regarding: o Unique rule: Each attribute value must be different from all other attribute value. o Consecutive rule No missing values between lowest and highest values of the attribute. o Null rule Specifies the use of blanks, question marks, special characters.

Data Integration  The merging of data from multiple data stores.  It can help reduce, avoid redundancies and inconsistencies.  It improve the accuracy and speed of the subsequent data mining process.

Data Reduction  To obtain a reduced representation of the data set that is much smaller in volume. Strategies for data reduction include the following:  Data cube aggregation, where aggregation operations are applied to the data in the construction of a data cube.  Attribute subset selection, where irrelevant, weakly relevant, or redundant attributes or dimensions may be detected and removed.

 Dimensionality reduction, where encoding mechanisms are used to reduce the data set size.  Numerosity reduction, where the data are replaced or estimated by alternative, smaller data representations such as  Parametric models  Nonparametric methods such as clustering, sampling, and the use of histograms.

Data Transformation  In data transformation, the data are transformed or consolidated into forms appropriate for mining. Data transformation can involve the following:  Smoothing: remove noise from data  Aggregation: summarization, data cube construction  Generalization: concept hierarchy climbing  Normalization: scaled to fall within a small, specified range  min-max normalization

Data Discretization  Discretization: Divide the range of a continuous attribute into intervals ◦ Interval labels can then be used to replace actual data values ◦ Reduce data size by Discretization ◦ Split (top-down) vs. merge (bottom-up) ◦ Discretization can be performed recursively on an attribute ◦ Prepare for further analysis, e.g., classification

 Three types of attributes ◦ Nominal—values from an unordered set, e.g., color, profession ◦ Ordinal—values from an ordered set, e.g., military or academic rank ◦ Numeric—real numbers, e.g., integer or real numbers


Add a comment

Related pages

Data pre-processing - Wikipedia, the free encyclopedia

Data pre-processing is an important step in the data mining process. The phrase "garbage in, garbage out" is particularly applicable to data mining and ...
Read more

Kapitel 6: Data Preprocessing -

Klemens Böhm Data Warehousing und Mining: Data Preprocessing – 5 Gliederung zEinleitung, zDuplikatelimination – zwei Ansätze, Einleitung zData Reduction.
Read more

Data Preprocessing - Ufldl

Data preprocessing plays a very important in many deep learning algorithms. In practice, many methods work best after the data has been normalized and ...
Read more

Data Preprocessing - Datenvorverarbeitungsschritte des ...

2 Datenbereinigung (data cleaning) Die Datenbereinigung besch aftigt sich mit grundlegenden Problemen, die in Daten der realen Welt auftreten k onnen.
Read more

What is Data Preprocessing? - Definition from Techopedia

Data preprocessing is a data mining technique that involves transforming raw data into an understandable format. Real-world data is often incomplete ...
Read more

Data Preprocessing - MATLAB & Simulink

Format, plot, and transform time series data ... Stochastic Process Characteristics. Understand the definition, forms, and properties of stochastic processes.
Read more

Data Preprocessing - TFLearn

Data Preprocessing. tflearn.data_preprocessing.DataPreprocessing (name='DataPreprocessing') Base class for applying common real-time data preprocessing.
Read more

Data cleaning and Data preprocessing - Wydział MIM UW

preprocessing 1 Data cleaning and Data preprocessing Nguyen Hung Son This presentation was prepared on the basis of the following public materials:
Read more

Data Preprocessing in Data Mining -

Bücher bei Weltbild: Jetzt Data Preprocessing in Data Mining versandkostenfrei online kaufen & per Rechnung bezahlen bei Weltbild, Ihrem Bücher-Spezialisten!
Read more

Data Preprocessing in Data Mining Intelligent Systems ...

Salvador García - Data Preprocessing in Data Mining (Intelligent Systems Reference Library) jetzt kaufen. ISBN: 9783319102467, Fremdsprachige Bücher ...
Read more