Data Pre Processing
Data Pre Processing
Jiawei Han
Department of Computer Science
University of Illinois at Urbana-Champaign
www.cs.uiuc.edu/~hanj
2006 Jiawei Han and Micheline Kamber, All rights reserved
June 7, 2014
Data cleaning
Data reduction
Summary
June 7, 2014
e.g., Salary=-10
June 7, 2014
e.g., occupation=
June 7, 2014
June 7, 2014
Data cleaning
Data integration
Data reduction
Data transformation
Data discretization
June 7, 2014
June 7, 2014
Data cleaning
Data reduction
Summary
June 7, 2014
Motivation
June 7, 2014
1
n
wx
x
xi
i 1
i 1
n
i 1
median L1 (
Mode
Empirical formula:
June 7, 2014
n / 2 ( f )l
f median
)c
10
Boxplot: ends of the box are the quartiles, median is marked, whiskers, and
plot outlier individually
s2
1 n
1 n 2 1 n 2
( xi x)2
[ xi ( xi ) ]
n 1 i1
n 1 i1
n i1
1
N
(x
)2
i 1
1
N
2
i
i 1
June 7, 2014
11
Boxplot Analysis
Boxplot
June 7, 2014
12
Histogram Analysis
June 7, 2014
13
Scatter plot
June 7, 2014
14
June 7, 2014
15
June 7, 2014
16
June 7, 2014
17
Data cleaning
Data reduction
Summary
June 7, 2014
18
Data Cleaning
Importance
Data cleaning is one of the three biggest problems
in data warehousingRalph Kimball
Data cleaning is the number one problem in data
warehousingDCI survey
Data cleaning tasks
June 7, 2014
19
Missing Data
equipment malfunction
June 7, 2014
20
10
Ignore the tuple: usually done when class label is missing (assuming
the tasks in classificationnot effective when the percentage of
missing values per attribute varies considerably.
the attribute mean for all samples belonging to the same class:
smarter
June 7, 2014
21
Noisy Data
June 7, 2014
22
11
Binning
first sort data and partition into (equal-frequency) bins
then one can smooth by bin means, smooth by bin
median, smooth by bin boundaries, etc.
Regression
smooth by fitting the data into regression functions
Clustering
detect and remove outliers
Combined computer and human inspection
detect suspicious values and check by human (e.g.,
deal with possible outliers)
June 7, 2014
23
if A and B are the lowest and highest values of the attribute, the
width of intervals will be: W = (B A)/N.
June 7, 2014
24
12
June 7, 2014
25
Regression
y
Y1
y=x+1
Y1
X1
June 7, 2014
26
13
Cluster Analysis
June 7, 2014
27
Data cleaning
Data reduction
Summary
June 7, 2014
28
14
Data Integration
Data integration:
Combines data from multiple sources into a coherent
store
Schema integration: e.g., A.cust-id B.cust-#
Integrate metadata from different sources
Entity identification problem:
Identify real world entities from multiple data sources,
e.g., Bill Clinton = William Clinton
Detecting and resolving data value conflicts
For the same real world entity, attribute values from
different sources are different
Possible reasons: different representations, different
scales, e.g., metric vs. British units
June 7, 2014
29
correlation analysis
June 7, 2014
30
15
rA , B
( A A )( B B ) ( AB ) n A B
( n 1) A B
( n 1) A B
June 7, 2014
31
2 (chi-square) test
(Observed Expected ) 2
Expected
2
The larger the 2 value, the more likely the variables are
related
The cells that contribute the most to the 2 value are
those whose actual count is very different from the
expected count
Correlation does not imply causality
June 7, 2014
32
16
Sum (row)
250(90)
200(360)
450
50(210)
1000(840)
1050
Sum(col.)
300
1200
1500
507 .93
90
210
360
840
June 7, 2014
33
Data Transformation
min-max normalization
z-score normalization
Attribute/feature construction
June 7, 2014
34
17
v'
v minA
(new _ maxA new _ minA) new _ minA
maxA minA
v'
v A
73,600 54,000
1.225
16,000
v'
v
10 j
June 7, 2014
35
Data cleaning
Data reduction
Summary
June 7, 2014
36
18
June 7, 2014
37
June 7, 2014
38
19
0-D(apex) cuboid
item
time,location
time,item
location
item,location
time,supplier
supplier
1-D cuboids
location,supplier
2-D cuboids
item,supplier
time,location,supplier
3-D cuboids
time,item,location
time,item,supplier
item,location,supplier
4-D(base) cuboid
time, item, location, supplier
June 7, 2014
39
June 7, 2014
40
20
Backward elimination
Initial attribute set:
{A1,A2,A3,A4, A5, A6}
June 7, 2014
41
A1?
Class 1
>
June 7, 2014
Class 2
Class 1
Class 2
42
21
June 7, 2014
43
Data Compression
String compression
There are extensive theories and well-tuned algorithms
Typically lossless
But only limited manipulation is possible without
expansion
Audio/video compression
Typically lossy compression, with progressive
refinement
Sometimes small fragments of signal can be
reconstructed without reconstructing the whole
Time sequence is not audio
Typically short and vary slowly with time
June 7, 2014
44
22
Data Compression
If the original data can be reconstructed
from the compressed data without any
loss of information, the data reduction is
called lossless
Compressed
Data
Original Data
lossless
Original Data
Approximated
June 7, 2014
45
Dimensionality Reduction:
Wavelet Transformation
Haar2
Daubechie4
June 7, 2014
46
23
June 7, 2014
47
X2
Y1
Y2
X1
June 7, 2014
48
24
Numerosity Reduction
Reduce data volume by choosing alternative, smaller
forms of data representation
Parametric methods
Assume the data fits some model, estimate model
parameters, store only the parameters, and discard
the data (except possible outliers)
Example: Log-linear modelsobtain value at a point
in m-D space as the product on appropriate marginal
subspaces
Non-parametric methods
Do not assume models
Major families: histograms, clustering, sampling
June 7, 2014
49
June 7, 2014
50
25
Partitioning rules:
20
June 7, 2014
100000
90000
80000
70000
60000
50000
25
30000
20000
30
10000
35
52
26
Partition data set into clusters based on similarity, and store cluster
representation (e.g., centroid and diameter) only
June 7, 2014
53
June 7, 2014
54
27
Raw Data
June 7, 2014
55
Raw Data
June 7, 2014
Cluster/Stratified Sample
56
28
Data cleaning
Data reduction
Summary
June 7, 2014
57
Discretization
Discretization:
June 7, 2014
58
29
Discretization
June 7, 2014
59
June 7, 2014
60
30
Entropy-Based Discretization
Entropy ( S 1 ) p i log 2 ( p i )
i 1
The boundary that minimizes the entropy function over all possible
boundaries is selected as a binary discretization
June 7, 2014
61
Merge: Find the best neighboring intervals and merge them to form
larger intervals recursively
ChiMerge [Kerber AAAI 1992, See also Liu et al. DMKD 2002]
June 7, 2014
62
31
June 7, 2014
63
Step 1:
Step 2:
-$351
-$159
Min
msd=1,000
profit
Low=-$1,000
(-$1,000 - 0)
(-$400 - 0)
(-$200 -$100)
(-$100 0)
June 7, 2014
Max
High=$2,000
(0 -$ 1,000)
($1,000 - $2,000)
(-$400 -$5,000)
Step 4:
(-$300 -$200)
$4,700
(-$1,000 - $2,000)
Step 3:
(-$400 -$300)
$1,838
High(i.e, 95%-0 tile)
(0 - $1,000)
(0 $200)
($1,000 $1,200)
($200 $400)
($1,200 $1,400)
($1,400 $1,600)
($400 $600)
($600 $800)
($800 $1,000)
($2,000 $3,000)
($3,000 $4,000)
($4,000 $5,000)
64
32
June 7, 2014
65
country
province_or_ state
city
street
June 7, 2014
66
33
Data cleaning
Data reduction
Summary
June 7, 2014
67
Summary
Discretization
June 7, 2014
68
34
References
D. P. Ballou and G. K. Tayi. Enhancing data quality in data warehouse environments. Communications
of ACM, 42:73-78, 1999
T. Dasu and T. Johnson. Exploratory Data Mining and Data Cleaning. John Wiley & Sons, 2003
T. Dasu, T. Johnson, S. Muthukrishnan, V. Shkapenyuk. Mining Database Structure; Or, How to Build
a Data Quality Browser. SIGMOD02.
H.V. Jagadish et al., Special Issue on Data Reduction Techniques. Bulletin of the Technical
Committee on Data Engineering, 20(4), December 1997
V. Raman and J. Hellerstein. Potters Wheel: An Interactive Framework for Data Cleaning and
Transformation, VLDB2001
Y. Wand and R. Wang. Anchoring data quality dimensions ontological foundations. Communications of
ACM, 39:86-95, 1996
R. Wang, V. Storey, and C. Firth. A framework for analysis of data quality research. IEEE Trans.
Knowledge and Data Engineering, 7:623-640, 1995
June 7, 2014
69
35