ICS 2408 - Lecture 2 - Data Preprocessing
ICS 2408 - Lecture 2 - Data Preprocessing
s s y
lo
Original Data
Approximated
Numerosity Reduction
Reduce the data volume by choosing alternative ‘smaller’
forms of data representation
Two type:
Parametric – a model is used to estimate the data, only
the data parameters is stored instead of actual data
regression
log-linear model
Nonparametric –storing reduced representation of the
data
Histograms
Clustering
Sampling
Histograms
A popular data reduction technique
Divide data into buckets and store average (sum) for each
bucket
Use binning to approximate data distributions
Bucket – horizontal axis, height (area) of bucket – the
average frequency of the values represented by the
bucket
Bucket for single attribute-value/frequency pair –
singleton buckets
Continuous ranges for the given attribute
Clustering
Partition data set into clusters, and one can store cluster
representation only.
Can be very effective if data is clustered but not if data
is “smeared”/ spread.
Can have hierarchical clustering and be stored in multi-
dimensional index tree structures.
There are many choices of clustering definitions and
clustering algorithms
Sampling
Sampling: obtaining a small sample s to represent the
whole data set N
Allow a mining algorithm to run in complexity that is
potentially sub-linear to the size of the data
Choose a representative subset of the data
Simple random sampling may have very poor
performance in the presence of skew
Develop adaptive sampling methods e.g. Stratified
sampling.
Types of Sampling
Simple random sampling
There is an equal probability of selecting any particular
item
Sampling without replacement
Once an object is selected, it is removed from the
population
Sampling with replacement
A selected object is not removed from the population
Stratified sampling:
Partition the data set, and draw samples from each
partition (proportionally, i.e., approximately the same
percentage of the data)
Used in conjunction with skewed data
Discretization and Concept Hierarchy
Discretization
reduce the number of values for a given continuous
attribute by dividing the range of the attribute into
intervals. Interval labels can then be used to replace
actual data values
Concept hierarchies
reduce the data by collecting and replacing low level
concepts (such as numeric values for the attribute age) by
higher level concepts (such as young, middle-aged, or
senior)
Discretization
Three types of attributes:
Nominal — values from an unordered set
Ordinal — values from an ordered set
Continuous — real numbers
Discretization:
divide the range of a continuous attribute into intervals
because some data mining algorithms only accept
categorical attributes.
Some techniques:
Binning methods – equal-width, equal-frequency
Histogram
Entropy-based methods
References
J. Han and M. Kamber. Data Mining: Concepts and
Techniques. Morgan Kaufmann, 2000.
T. Dasu and T. Johnson. Exploratory Data Mining and
Data Cleaning. John Wiley & Sons, 2003
V. Raman and J. Hellerstein. Potters Wheel: An
Interactive Framework for Data Cleaning and
Transformation, VLDB’2001
Jagadish et al., Special Issue on Data Reduction
Techniques. Bulletin of the Technical Committee on
Data Engineering, 20(4), December 1997.
H.V. Jagadish et al., Special Issue on Data Reduction
Techniques. Bulletin of the Technical Committee on
Data Engineering, 20(4), December 1997