Quick Question42
Quick Question42
Quick Question42
Data Preprocessing:
Data Quality
Major Tasks in Data Preprocessing
Data Cleaning
Data Integration
Data Reduction
Data cleaning
Fill in missing values, smooth noisy data, identify or remove outliers, and
resolve inconsistencies
Data integration
Integration of multiple databases, data cubes, or files
Data reduction
Dimensionality reduction
Numerosity reduction
Data compression
Data transformation and data discretization
Normalization
Concept hierarchy generation
Data Mining: Concepts and Techniques
Data Cleaning
6
BinningSmooth :a sorted data value by consulting its “neighborhood,” that is, the values around it.
The sorted values are distributed into a number of bins
first sort data and partition into (equal-frequency) bins
then one can smooth by bin means, smooth by bin median, smooth by bin boundaries, etc.
Sorted data for price (in dollars): 4, 8, 15, 21, 21, 24, 25, 28, 34
Partition into (equal-frequency) bins:
Bin 1: 4, 8, 15
Bin 2: 21, 21, 24
Bin 3: 25, 28, 34
Smoothing by bin means:
Bin 1: 9, 9, 9
Bin 2: 22, 22, 22
Bin 3: 29, 29, 29
Regression
smooth by fitting the data into regression functions
Clustering
detect and remove outliers
Combined computer and human inspection
detect suspicious values and check by human (e.g., deal with possible outliers)
There are a number of issues to consider during data integration. Schema integration
and object matching can be tricky
Data integration:
Combines data from multiple sources into a coherent store
Schema integration: e.g., A.cust-id B.cust-#
Integrate metadata from different sources
Entity identification problem:
Identify real world entities from multiple data sources, e.g., Bill Clinton = William
Clinton
Detecting and resolving data value conflicts
For the same real world entity, attribute values from different sources are
different
Data Mining:
Possible reasons: different Concepts and Techniques
representations, different scales, e.g., metric vs.
12
British units
Handling Redundancy in Data Integration
13
Redundant data occur often when integration of multiple
databases
Object identification: The same attribute or object may
have different names in different databases
Χ2 (chi-square) test
( Observed Expected )
2
2
Expected
The larger the Χ2 value, the more likely the variables are
related
The cells that contribute the most to the Χ2 value are
those whose actual count is very different from the
expected count
Correlation does not imply causality
# of hospitals and # of car-theft in a city are correlated
Data Mining: Concepts and Techniques
Both are causally linked to the third variable: population
Chi-Square Calculation: An Example
15
Play Not play chess Sum (row)
chess
Like science fiction 250(90) 200(360) 450
17
Scatter plots
showing the
similarity from
–1 to 1.
Data Cleaning
Data Integration
Data Reduction
Summary
18 Data Mining: Concepts and Techniques
18
Data Reduction Strategies
19
Data cube aggregation, where aggregation operations are applied to the data in
the
construction of a data cube.
Attribute subset selection, where irrelevant, weakly relevant, or redundant
attributes
or dimensions may be detected and removed .
Dimensionality reduction, where encoding mechanisms are used to reduce the
data
set size.
Numerosityreduction, where the data are replaced or estimated by alternative,
smaller
data representations such as parametric models (which need store only the
model
parameters instead ofData
theMining:
actualConcepts
data) orand Techniques
nonparametric methods such as
clustering,
Data Cube Aggregation
20
Figure 2.13 Sales data for a given branch of AllElectronics for the years 2002 to 2004. On the left,
the sales are shown per quarter. On the right, the data are aggregated to provide the annual sales
The “best” (and “worst”) attributes are typically determined using tests of
statistical significance.
Multiple regression: Y = b0 + b1 X1 + b2 X2
Many nonlinear functions can be transformed into the above
Log-linear models:
Approximate discrete multidimensional probability distributions
Estimate the probability of each point (tuple) in a multi-dimensional
space for a set of discretized attributes, based on a smaller subset
of dimensional combinations
Useful for dimensionality reduction and data smoothing
Data Mining: Concepts and Techniques
Histogram Analysis
32
W O R o m
S R S rand
p l e h o ut
(sim ple wit
sam ement)
pl a c
re
SRSW
R
Raw Data
37 Data Mining: Concepts and Techniques
Sampling: Cluster or Stratified
Sampling
38
max min
A A
mapped to
73 , 600 12 , 000
(1 . 0 0 ) 0 0 . 716
98 , 000 12 , 000
A
73 , 600 54 , 000
1 . 225
16 , 000
Ex. Let μ = 54,000, σ = 16,000. Then
Normalization
v by decimal scaling
v' Where j is the smallest integer such that Max(|ν’|) <
j
10 1
Data Mining: Concepts and Techniques
Discretization
41
Three types of attributes
Nominal—values from an unordered set, e.g., color, profession
Ordinal—values from an ordered set, e.g., military or academic rank
Numeric—real numbers, e.g., integer or real numbers
Discretization: Divide the range of a continuous attribute into intervals
Reduce data size by discretization
Interval labels can then be used to replace actual data values
Supervised vs. unsupervised
Split (top-down) vs. merge (bottom-up)
Discretization can be performed recursively on an attribute
Prepare for further analysis, e.g., classification
50
In real-world data, tuples with missing values for some attributes are a common
occurrence. Describe various methods for handling this problem.
Given the following data (in increasing order) for the attribute age: 13, 15, 16, 16,
19, 20, 20, 21, 22, 22, 25, 25, 25, 25, 30, 33, 33, 35, 35, 35, 35, 36, 40, 45, 46, 52, 70.
Categorize using a bin depth of 3 and smooth by bin means. Illustrate your
steps. Comment on the effect of this technique for the given data.
What other methods are there for data smoothing?
How might you determine outliers in the data?
Normalize the values of the attributes to the range [-1,1].
Discuss the differences among the following attribute subset selection: stepwise
forward selection, stepwise backward elimination and a combination of them.