Data Preprocessing: Why Preprocess The Data? Why Preprocess The Data?
Data Preprocessing: Why Preprocess The Data? Why Preprocess The Data?
Data Preprocessing: Why Preprocess The Data? Why Preprocess The Data?
Completeness
Consistency
Timeliness
Believability
Value added
Interpretability
Accessibility
Broad categories:
Intrinsic, contextual, representational, and accessibility
Data cleaning
Fill in missing values, smooth noisy data, identify or remove
outliers, and resolve inconsistencies
Data integration
Integration of multiple databases, data cubes, or files
Data transformation
Normalization and aggregation
Data reduction
Obtains reduced representation in volume but produces the same
or similar analytical results
Data discretization
Part of data reduction but with particular importance, especially
for numerical data
Importance
“Data cleaning is one of the three biggest problems
in data warehousing”—Ralph Kimball
“Data cleaning is the number one problem in data
warehousing”—DCI survey
Data cleaning tasks
Fill in missing values
Identify outliers and smooth out noisy data
Correct inconsistent data
Resolve redundancy caused by data integration
technology limitation
incomplete data
inconsistent data
Clustering
detect and remove outliers
Y1
Y1’ y=x+1
X1 x
rA , B
( A A )( B B ) ( AB ) n A B
( n 1) A B ( n 1) A B
Χ2 (chi-square) test
2
(Observed Expected )
2
Expected
The larger the Χ2 value, the more likely the variables are
related
The cells that contribute the most to the Χ2 value are
those whose actual count is very different from the
expected count
Correlation does not imply causality
# of hospitals and # of car-theft in a city are correlated
Both are causally linked to the third variable: population
smaller in volume but yet produce the same (or almost the
same) analytical results
Data reduction strategies
Data cube aggregation:
Data Compression
understand
Heuristic methods (due to exponential # of choices):
Step-wise forward selection
Decision-tree induction
A4 ?
A1? A6?
Typically lossless
expansion
Audio/video compression
Typically lossy compression, with progressive
refinement
Sometimes small fragments of signal can be
Original Data
Approximated
component vectors
The principal components are sorted in order of decreasing
“significance” or strength
Since the components are sorted, the size of the data can be
X2
Y1
Y2
X1
Linear regression: Y = w X + b
Two regression coefficients, w and b, specify the line
and are to be estimated by using the data at hand
Using the least squares criterion to the known values
of Y1, Y2, …, X1, X2, ….
Multiple regression: Y = b0 + b1 X1 + b2 X2.
Many nonlinear functions can be transformed into the
above
Log-linear models:
The multi-way table of joint probabilities is
approximated by a product of lower-order tables
Probability: p(a, b, c, d) = ab acad bcd
Data Reduction Method (2): Histograms
20000
30000
40000
50000
60000
70000
80000
90000
100000
between each pair for pairs have
the β–1 largest differences
January 1, 2016 Data Mining: Concepts and Techniques 60
Data Reduction Method (3): Clustering
Partition data set into clusters based on similarity, and store cluster
representation (e.g., centroid and diameter) only
Can be very effective if data is clustered but not if data is “smeared”
Can have hierarchical clustering and be stored in multi-dimensional
index tree structures
There are many choices of clustering definitions and clustering
algorithms
Cluster analysis will be studied in depth in Chapter 7
Raw Data
January 1, 2016 Data Mining: Concepts and Techniques 63
Sampling: Cluster or Stratified Sampling