Data Preprocessing - Data Cleaning
Data Preprocessing - Data Cleaning
Data Preprocessing
Data cleaning
Data reduction
Summary
Data Preprocessing
Data cleaning
Data reduction
Summary
e.g., occupation=
e.g., Salary=-10
Data cleaning
Data integration
Data reduction
Dimensionality reduction
Numerosity reduction
Data compression
Normalization
Data Preprocessing
Data cleaning
Data reduction
Summary
10
Data Cleaning
Importance
Data cleaning is one of the three biggest
problems in data warehousingRalph
Kimball
Data cleaning is the number one problem in
data warehousingDCI survey
11
Data Cleaning
Importance
Data cleaning is one of the three biggest
problems in data warehousingRalph Kimball
Data cleaning is the number one problem in
data warehousingDCI survey
12
13
14
equipment malfunction
15
16
Noisy Data
17
Binning
first sort data and partition into (equalfrequency) bins
then one can smooth by bin means, smooth by
bin median, smooth by bin boundaries, etc.
18
if A and B are the lowest and highest values of the attribute, the
width of intervals will be: W = (B A)/N.
19
20
21
22
Binning
first sort data and partition into (equalfrequency) bins
then one can smooth by bin means, smooth by
bin median, smooth by bin boundaries, etc.
Regression
smooth by fitting the data into regression
functions
23
Regression
y
Y1
y=x+1
Y1
X1
24
Binning
first sort data and partition into (equalfrequency) bins
then one can smooth by bin means, smooth by
bin median, smooth by bin boundaries, etc.
Regression
smooth by fitting the data into regression
functions
Clustering
detect and remove outliers
25
Cluster Analysis
26
Binning
first sort data and partition into (equal-frequency)
bins
then one can smooth by bin means, smooth by
bin median, smooth by bin boundaries, etc.
Regression
smooth by fitting the data into regression functions
Clustering
detect and remove outliers
Combined computer and human inspection
detect suspicious values and check by human
(e.g., deal with possible outliers)
27
Problems
3.3 Suppose that the data for analysis includes the
attribute age. The age values for the data tuples
are (in increasing order)
13,15,16,16,19,20,20,21,22,22,25,25,25,25,30,33,3
3,35,35,35,35,36,40,45,46,52,70.
i. Use smoothing by bin means and bondaries to
smooth the data, using a bin depth of 3. Illustrate
your steps.
ii. How might you determine the outliers?
28