Chapter 3: Data Preprocessing
Chapter 3: Data Preprocessing
Chapter 3: Data Preprocessing
Data Quality
Data Cleaning
Data Integration
Data Reduction
Summary
1
Data Quality: Why Preprocess the Data?
2
Major Tasks in Data Preprocessing
Data cleaning
Fill in missing values, smooth noisy data, identify or remove
Data reduction
Dimensionality reduction
Numerosity reduction
Data compression
3
Chapter 3: Data Preprocessing
Data Quality
Data Cleaning
Data Integration
Data Reduction
Summary
4
Data Cleaning
Data in the Real World Is Dirty: Lots of potentially incorrect data,
e.g., instrument faulty, human or computer error, transmission error
incomplete: lacking attribute values, lacking certain attributes of
5
Incomplete (Missing) Data
technology limitation
incomplete data
inconsistent data
8
How to Handle Noisy Data?
Binning
first sort data and partition into (equal-frequency) bins
Clustering
detect and remove outliers
9
Data Cleaning as a Process
Data discrepancy detection
Use metadata (e.g., domain, range, dependency, distribution)
Data Quality
Data Cleaning
Data Integration
Data Reduction
Summary
11
Data Integration
Data integration:
Combines data from multiple sources into a coherent store
Schema integration: e.g., A.cust-id B.cust-#
Integrate metadata from different sources
Entity identification problem:
Identify real world entities from multiple data sources, e.g., Bill Clinton =
William Clinton
Detecting and resolving data value conflicts
For the same real world entity, attribute values from different sources
are different
Possible reasons: different representations, different scales, e.g., metric
vs. British units
12
Handling Redundancy in Data Integration
14
Chi-Square Calculation: An Example
(r-1)(c-1)
Level of Significance:
Chi square > los : reject null hypothesis
Reject or accept the null hypothesis
Correlation coefficient:
Suppose two stocks A and B have the following values in one week: (2, 5), (3,
8), (5, 10), (4, 11), (6, 14).
Question: If the stocks are affected by the same industry trends, will their
prices rise or fall together?
Data Quality
Data Cleaning
Data Integration
Data Reduction
Summary
19
Data Reduction Strategies
Data reduction: Obtain a reduced representation of the data set that is much
smaller in volume but yet produces the same (or almost the same) analytical
results
Why data reduction? — A database/data warehouse may store terabytes of
data. Complex data analysis may take a very long time to run on the
complete data set.
Data reduction strategies
Dimensionality reduction, e.g., remove unimportant attributes
Wavelet transforms
20
Attribute Subset Selection
Another way to reduce dimensionality of data
Redundant attributes
Duplicate much or all of the information contained in one or
more other attributes
E.g., purchase price of a product and the amount of sales
tax paid
Irrelevant attributes
Contain no information that is useful for the data mining
task at hand
E.g., students' ID is often irrelevant to the task of predicting
students' GPA
21
Heuristic Search in Attribute Selection
22
Attribute Creation (Feature Generation)
Create new attributes (features) that can capture the
important information in a data set more effectively than the
original ones
Three general methodologies
Attribute extraction
Domain-specific
Data Quality
Data Cleaning
Data Integration
Data Reduction
Summary
24
Data Transformation
A function that maps the entire set of values of a given attribute to a new
set of replacement values s.t. each old value can be identified with one of
the new values
Methods
Smoothing: Remove noise from data
Attribute/feature construction
New attributes constructed from the given ones
Aggregation: Summarization, data cube construction
Normalization: Scaled to fall within a smaller, specified range
min-max normalization
z-score normalization
normalization by decimal scaling
Discretization: Concept hierarchy climbing
25
Normalization
Min-max normalization: to [new_minA, new_maxA]
v minA
v' (new _ maxA new _ minA) new _ minA
maxA minA
Ex. Let income range $12,000 to $98,000 normalized to [0.0, 1.0].
73,000 12,000
Then $73,000 is mapped to (1.0 0) 0 0.716
98,000 12,000
73,600 54,000
Ex. Let μ = 54,000, σ = 16,000. Then 1.225
16,000
Normalization by decimal scaling
v
v' j Where j is the smallest integer such that Max(|ν’|) < 1
10
26
Discretization
Three types of attributes
Nominal—values from an unordered set, e.g., color, profession
Ordinal—values from an ordered set, e.g., military or academic rank
Numeric—real numbers, e.g., integer or real numbers
Discretization: Divide the range of a continuous attribute into intervals
Interval labels can then be used to replace actual data values
Reduce data size by discretization
Supervised vs. unsupervised
Split (top-down) vs. merge (bottom-up)
Discretization can be performed recursively on an attribute
Prepare for further analysis, e.g., classification
27
Simple Discretization: Binning
29
Summary
Data quality: accuracy, completeness, consistency, timeliness,
believability, interpretability
Data cleaning: e.g. missing/noisy values, outliers
Data integration from multiple sources:
Entity identification problem; Remove redundancies; Detect
inconsistencies
Data reduction
Dimensionality reduction; Numerosity reduction; Data
compression
Data transformation and data discretization
Normalization; Concept hierarchy generation
30