Data preprocessing

By
C.Kayathri
Student at ANJAC
Sivakasi

1

Data Preprocessing


Definition



Why preprocess the data?



Data cleaning



Data integration and transformation



Data reduction



Summary
2

Definition


Data preprocessing is a data
mining technique that involves
transforming raw data into an
understandable format.

3

Why Data Preprocessing?


Data in the real world is dirty
incomplete: lacking attribute values, lacking

certain attributes of interest, or containing only
aggregate
data
○ e.g., occupation=“ ”

noisy: containing errors or outliers
○ e.g., Salary=“-10”
inconsistent: containing discrepancies in codes or

names
○ e.g., Age=“42” Birthday=“03/07/1997”
○ e.g., Was rating “1,2,3”, now rating “A, B, C”
○ e.g., discrepancy between duplicate records
4

Why Is Data Preprocessing Important?


No quality data, no quality mining results!
Quality decisions must be based on quality data
○ e.g., duplicate or missing data may cause incorrect

or even misleading statistics.
Data warehouse needs consistent integration of

quality data


Data extraction, cleaning, and transformation
comprises the majority of the work of building a data
warehouse

5

Major Tasks in Data Preprocessing





Data cleaning
Data integration
Data transformation
Data reduction

6

Forms of Data Preprocessing

7

Data Cleaning


Importance
“Data cleaning is one of the three biggest problems
in data warehousing”—Ralph Kimball
“Data cleaning is the number one problem in data
warehousing”—DCI survey



Data cleaning tasks
Fill in missing values
Identify outliers and smooth out noisy data
Correct inconsistent data
Resolve redundancy caused by data integration

8

Missing Data


Data is not always available
 E.g., many tuples have no recorded value for several

attributes, such as customer income in sales data


Missing data may be due to
 equipment malfunction
 inconsistent with other recorded data and thus deleted
 data not entered due to misunderstanding
 certain data may not be considered important at the time of

entry
 not register history or changes of the data


Missing data may need to be inferred.
9

Noisy Data





Noise: random error or variance in a measured
variable
Incorrect attribute values may due to
faulty data collection instruments
data entry problems
data transmission problems
technology limitation
inconsistency in naming convention
Other data problems which requires data cleaning
duplicate records
incomplete data
inconsistent data
10

How to Handle Noisy Data?






Binning
first sort data and partition into (equal-frequency)
bins
then one can smooth by bin means, smooth by bin
median, smooth by bin boundaries, etc.
Regression
smooth by fitting the data into regression functions
Clustering
detect and remove outliers
Combined computer and human inspection
detect suspicious values and check by human (e.g.,
deal with possible outliers)
11

Simple Discretization Methods: Binning


Equal-width (distance) partitioning
 Divides the range into N intervals of equal size: uniform grid
 if A and B are the lowest and highest values of the attribute,

the width of intervals will be: W = (B –A)/N.
 The most straightforward, but outliers may dominate

presentation
 Skewed data is not handled well


Equal-depth (frequency) partitioning
 Divides the range into N intervals, each containing approximately

same number of samples
 Good data scaling
 Managing categorical attributes can be tricky
12

Binning Methods for Data
Smoothing

Sorted data for price (in dollars): 4, 8, 9, 15, 21, 21, 24, 25, 26, 28,
29, 34
* Partition into equal-frequency (equi-depth) bins:
- Bin 1: 4, 8, 9, 15
- Bin 2: 21, 21, 24, 25
- Bin 3: 26, 28, 29, 34
* Smoothing by bin means:
- Bin 1: 9, 9, 9, 9
- Bin 2: 23, 23, 23, 23
- Bin 3: 29, 29, 29, 29
* Smoothing by bin boundaries:
- Bin 1: 4, 4, 4, 15
- Bin 2: 21, 21, 25, 25
- Bin 3: 26, 26, 26, 34


13

Regression
y
Y1

y=x+1

Y1’

x

X1

14

Data Integration







Data integration:
Combines data from multiple sources into a coherent
store
Schema integration: e.g., A.cust-id ≡ B.cust-#
Integrate metadata from different sources
Entity identification problem:
Identify real world entities from multiple data
sources, e.g., Bill Clinton = William Clinton
Detecting and resolving data value conflicts
For the same real world entity, attribute values from
different sources are different
Possible reasons: different representations, different
scales, e.g., metric vs. British units

16

Data Transformation


Smoothing: remove noise from data



Aggregation: summarization, data cube construction



Generalization: concept hierarchy climbing



Normalization: scaled to fall within a small, specified
range
min-max normalization
z-score normalization
normalization by decimal scaling



Attribute/feature construction
New attributes constructed from the given ones
17

Data Reduction






Why data reduction?
 A database/data warehouse may store terabytes of data
 Complex data analysis/mining may take a very long time to run
on the complete data set
Data reduction
 Obtain a reduced representation of the data set that is much
smaller in volume but yet produce the same (or almost the
same) analytical results
Data reduction strategies
 Data cube aggregation:
 Dimensionality reduction — e.g., remove unimportant
attributes
 Data Compression
 Numerosity reduction — e.g., fit data into models
 Discretization and concept hierarchy generation
18

Summary


Data preparation or preprocessing is a big issue for both
data warehousing and data mining



Descriptive data summarization is need for quality data
preprocessing



Data preparation includes
Data cleaning and data integration
Data reduction and feature selection



A lot a methods have been developed but data
preprocessing still an active area of research

19

Data preprocessing

More Related Content

Data preprocessing