UNIT-2
UNIT-2
UNIT-2
Data Pre-Processing
Data pre-processing
Data pre-processing describes any type of processing performed on raw data to prepare it
for another processing procedure. Commonly used as a preliminary data mining practice,
data pre-processing transforms the data into a format that will be more easily and
effectively processed for the purpose of the user.
Data pre-processing describes any type of processing performed on raw data to prepare it
for another processing procedure. Commonly used as a preliminary data mining practice,
data pre-processing transforms the data into a format that will be more easily and
effectively processed for the purpose of the user
Data in the real world is dirty. It can be in incomplete, noisy and inconsistent from. These
data needs to be pre-processed in order to help improve the quality of the data, and quality
of the mining results.
If no quality data , then no quality mining results. The quality decision is always
based on the quality data.
If there is much irrelevant and redundant information present or noisy and unreliable
data, then knowledge discovery during the training phase is more difficult
Different considerations between the time when the data was collected and when it
is analyzed.
Human/hardware/software problems
Data cleaning
Fill in missing values, smooth noisy data, identify or remove outliers, and resolve
inconsistencies
Data integration
Data transformation
Data reduction
Obtains reduced representation in volume but produces the same or similar
analytical results
Data discretization
Part of data reduction but with particular importance, especially for numerical data
Data cleaning routines attempt to fill in missing values, smooth out noise while identifying
outliers, and correct inconsistencies in the data.
The various methods for handling the problem of missing values in data tuples include:
(a) Ignoring the tuple: This is usually done when the class label is missing (assuming the
mining task involves classification or description). This method is not very effective unless
the tuple contains several
attributes with missing values. It is especially poor when the percentage of missing values
per attribute
varies considerably.
(b) Manually filling in the missing value: In general, this approach is time-consuming
and may not be a reasonable task for large data sets with many missing values, especially
when the value to be filled in is not easily determined.
(c) Using a global constant to fill in the missing value: Replace all missing attribute
values by the same constant, such as a label like “Unknown,” or −∞. If missing values are
replaced by, say, “Unknown,” then the mining program may mistakenly think that they form
an interesting concept, since they all have a value in common — that of “Unknown.” Hence,
although this method is simple, it is not recommended.
(d) Using the attribute mean for quantitative (numeric) values or attribute mode for
categorical (nominal) values, for all samples belonging to the same class as the given
tuple: For example, if classifying customers according to credit risk, replace the missing
value with the average income value for customers in the same credit risk category as that
of the given tuple.
(e) Using the most probable value to fill in the missing value: This may be determined
with regression, inference-based tools using Bayesian formalism, or decision tree
induction. For example, using the other customer attributes in your data set, you may
construct a decision tree to predict the missing values for income.
1 Binning methods: Binning methods smooth a sorted data value by consulting the
neighbourhood", or values around it. The sorted values are distributed into a number of
'buckets', or bins. Because binning methods consult the neighborhood of values, they
perform local smoothing.
In this technique,
3. Then one can smooth by bin means, smooth by bin median, smooth by bin
boundaries, etc.
a. Smoothing by bin means: Each value in the bin is replaced by the mean
value of the bin.
b. Smoothing by bin medians: Each value in the bin is replaced by the bin
median.
c. Smoothing by boundaries: The min and max values of a bin are identified as
the bin boundaries. Each bin value is replaced by the closest boundary value.
Example: Binning Methods for Data Smoothing
o
Sorted data for price (in dollars): 4, 8, 9, 15, 21, 21, 24, 25, 26, 28, 29, 34
o
Partition - Bin 1: 4, 8, 9, 15
into (equi-
depth)
bins(equi
depth of 4
since each
bin
contains
three
values):
In smoothing by bin means, each value in a bin is replaced by the mean value of the bin. For
example, the mean of the values 4, 8, and 15 in Bin 1 is 9. Therefore, each original value in
this bin is replaced by the value 9. Similarly, smoothing by bin medians can be employed, in
which each bin value is replaced by the bin median. In smoothing by bin boundaries, the
minimum and maximum values in a given bin are identified as the bin boundaries. Each bin
value is then replaced by the closest boundary value.
Suppose that the data for analysis include the attribute age. The age values for the data
tuples are (in
increasing order): 13, 15, 16, 16, 19, 20, 20, 21, 22, 22, 25, 25, 25, 25, 30, 33, 33, 35, 35, 35,
35, 36, 40, 45, 46, 52, 70.
(a) Use smoothing by bin means to smooth the above data, using a bin depth of 3. Illustrate
your steps.
Comment on the effect of this technique for the given data.
The following steps are required to smooth the above data using smoothing by bin means
with a bin
depth of 3.
• Step 1: Sort the data. (This step is not required here as the data are already sorted.)
• Step 4: Replace each of the values in each bin by the arithmetic mean calculated for the
bin.
Bin 1: 14, 14, 14 Bin 2: 18, 18, 18 Bin 3: 21, 21, 21
Bin 4: 24, 24, 24 Bin 5: 26, 26, 26 Bin 6: 33, 33, 33
Bin 7: 35, 35, 35 Bin 8: 40, 40, 40 Bin 9: 56, 56, 56
2 Clustering: Outliers in the data may be detected by clustering, where similar values are
organized into groups, or ‘clusters’. Values that fall outside of the set of clusters may be
considered outliers.
3 Regression : smooth by fitting the data into regression functions.
Linear regression involves finding the best line to fit two variables, so that
one variable can be used to predict the other.
Field overloading: is a kind of source of errors that typically occurs when developers
compress new attribute definitions into unused portions of already defined attributes.
Unique rule is a rule says that each value of the given attribute must be different from all
other values of that attribute
Consecutive rule is a rule says that there can be no missing values between the lowest
and highest values of the attribute and that all values must also be unique.
Null rule specifies the use of blanks, question marks, special characters or other strings
that may indicate the null condition and how such values should be handled.
Data Scrubbing tools-use simple domain knowledge (spell checking, parsing techniques)to
detect errors and make corrections in the data
Data Auditing tools-analyze the data to discover rules and relationships and detecting data
that violate such conditions.
It combines data from multiple sources into a coherent store. There are number of issues
to consider during data integration.
Issues:
Detecting and resolving data value conflicts: Attribute values from different
sources can be different due to different representations, different scales. E.g. metric
vs. British units
1. Correlation analysis
The result of the equation is > 0, then A and B are positively correlated, which
means the value of A increases as the values of B increases. The higher value may
indicate redundancy that may be removed.
The result of the equation is = 0, then A and B are independent and there is no
correlation between them.
If the resulting value is < 0, then A and B are negatively correlated where the values
of one attribute increase as the value of one attribute decrease which means each
attribute may discourages each other.
2.4.2 Data Transformation
Data transformation can involve the following:
Smoothing: which works to remove noise from the data
Generalization of the data: where low-level or “primitive” (raw) data are replaced
by higher-level concepts through the use of concept hierarchies. For example, cat
egorical attributes, like street, can be generalized to higher-level concepts, like city
or country.
Normalization: where the attribute data are scaled so as to fall within a small
specified range, such as −1.0 to 1.0, or 0.0 to 1.0.
In which data are scaled to fall within a small, specified range, useful for classification
algorithms involving neural networks, distance measurements such as nearest neighbor
classification and clustering. There are 3 methods for data normalization. They are:
1) min-max normalization
2) z-score normalization
3) normalization by decimal scaling
v' maxA minA (new _ maxA new _ minA) new _ minAv is the value to be normalized
Suppose the minimum and maximum value for an attribute profit(P) are Rs. 10,
000 and Rs. 100, 000. We want to plot the profit in the range [0, 1]. Using min-
max normalization the value of Rs. 20, 000 for attribute profit can be plotted to:
Example :.
Normalize the following group of data –
1000,2000,3000,9000
using min-max normalization by setting min:0 and max:1
Solution –
here,new_max(A)=1 , as given in question- max=1
new_min(A)=0,as given in question- min=0
max(A)=9000,as the maximum data among 1000,2000,3000,9000 is
9000
min(A)=1000,as the minimum data among 1000,2000,3000,9000 is
1000
Case-1: normalizing 1000 –
v = 1000 , putting all values in the formula,we get
v' = (1000-1000) X (1-0)
----------------- + 0 =0
9000-1000
Case-2: normalizing 2000 –
v = 2000, putting all values in the formula,we get
v '= (2000-1000) X (1-0)
----------------- + 0 =0 .125
9000-1000
Case-3: normalizing 3000 –
v=3000, putting all values in the formula,we get
v'=(3000-1000) X (1-0)
----------------- + 0 =0 .25
9000-1000
Case-4: normalizing 9000 –
v=9000, putting all values in the formula, we get
v'=(9000-1000) X (1-0)
----------------- + 0 =1
9000-1000
Outcome :
Hence, the normalized values of 1000,2000,3000,9000 are 0, 0.125, .25, 1.
v' v meanA
stand _ devA
This method is useful when min and max value of attribute A are unknown or when
outliers that are dominate min-max normalization.
Let mean of an attribute P = 60, 000, Standard Deviation = 10, 000, for the
attribute P. Using z-score normalization, a value of 85000 for P can be
transformed to:
A value, v, of attribute A is normalized to v’ by computing
Example 2:
Salary bonus Formula CGPA Normalized after Decimal scaling
400 400 / 1000 0.4
310 310 / 1000 0.31
Example 3:
Salary Formula CGPA Normalized after Decimal scaling
40,000 40,000 / 100000 0.4
31, 000 31,000 / 100000 0.31
These techniques can be applied to obtain a reduced representation of the data set that is
much smaller in volume, yet closely maintains the integrity of the original data. Data
reduction includes,
1. Data cube aggregation, where aggregation operations are applied to the data in the
construction of a data cube.
3. Dimensionality reduction, where encoding mechanisms are used to reduce the data
set size. Examples: Wavelet Transforms Principal Components Analysis
5. Discretization and concept hierarchy generation, where raw data values for
attributes are replaced by ranges or higher conceptual levels. Data Discretization is a
form of numerosity reduction that is very useful for the automatic generation of
concept hierarchies.
2.5.1 Data cube aggregation: Reduce the data to the concept level needed in the analysis.
Queries regarding aggregated information should be answered using data cube when
possible. Data cubes store multidimensional aggregated information. The following figure
shows a data cube for multidimensional analysis of sales data with respect to annual sales
per item type for each branch.
Each cells holds an aggregate data value, corresponding to the data point in
multidimensional space.
Data cubes provide fast access to pre computed, summarized data, thereby benefiting on-
line analytical processing as well as data mining.
The cube created at the lowest level of abstraction is referred to as the base cuboid.
A cube for the highest level of abstraction is the apex cuboid. The lowest level of a data cube
(base cuboid). Data cubes created for varying levels of abstraction are sometimes referred
to as cuboids, so that a “data cube" may instead refer to a lattice of cuboids. Each higher
level of abstraction further reduces the resulting data size.
The following database consists of sales per quarter for the years 1997-1999.
Suppose, the analyzer interested in the annual sales rather than sales per quarter, the
above data can be aggregated so that the resulting data summarizes the total sales per year
instead of per quarter. The resulting data in smaller in volume, without loss of information
necessary for the analysis task.
2.5.2 Dimensionality Reduction
It is the process of reducing the number of random variables or attributes under
consideration
1)wavelet transforms- 2)Principal Component Analysis
Transform or project the original data into smaller space
It reduces the data set size by removing irrelevant attributes. This is a method of
attribute subset selection are applied. A heuristic method of attribute of sub set selection
is explained here:
Feature selection is a must for any data mining product. That is because, when you build a
data mining model, the dataset frequently contains more information than is needed to
build the model. For example, a dataset may contain 500 columns that describe
characteristics of customers, but perhaps only 50 of those columns are used to build a
particular model. If you keep the unneeded columns while building the model, more CPU
and memory are required during the training process, and more storage space is required
for the completed model.
In which select a minimum set of features such that the probability distribution of
different classes given the values for those features is as close as possible to the original
distribution given the values of all features
1. Step-wise forward selection: The procedure starts with an empty set of attributes.
The best of the original attributes is determined and added to the set. At each subsequent
iteration or step, the best of the remaining original attributes is added to the set.
2. Step-wise backward elimination: The procedure starts with the full set of
attributes. At each step, it removes the worst attribute remaining in the set.
3. Combination forward selection and backward elimination: The step-wise forward
selection and backward elimination methods can be combined, where at each step one
selects the best attribute and removes the worst from among the remaining attributes.
The mining algorithm itself is used to determine the attribute sub set, then it is called
wrapper approach or filter approach. Wrapper approach leads to greater accuracy since
it optimizes the evaluation measure of the algorithm while removing attributes.
Data compression
Wavelet transforms
Principal components analysis.
Wavelet compression is a form of data compression well suited for image compression.
The discrete wavelet transform (DWT) is a linear signal processing technique that, when
applied to a data vector D, transforms it to a numerically different vector, D0, of wavelet
coefficients.
1. The length, L, of the input data vector must be an integer power of two. This condition
can be met by padding the data vector with zeros, as necessary.
3. The two functions are applied to pairs of the input data, resulting in two sets of data of
length L/2.
4. The two functions are recursively applied to the sets of data obtained in the previous
loop, until the resulting data sets obtained are of desired length.
5. A selection of values from the data sets obtained in the above iterations are designated
the wavelet coefficients of the transformed data.
If wavelet coefficients are larger than some user-specified threshold then it can be retained.
The remaining coefficients are set to 0.
• Given N data vectors from k-dimensions, find c <= k orthogonal vectors that can be
best used to represent data
The principal components (new set of axes) give important information about variance.
Using the strongest components one can reconstruct a good approximation of the original
signal.
2.5.4 Numerosity Reduction
Data volume can be reduced by choosing alternative smaller forms of data. This tech. can
be
Parametric method
Parametric: Assume the data fits some model, then estimate model parameters, and store
only the parameters, instead of actual data.
In linear regression, the data are modeled to fit a straight line using
Y = α + β X, where α, β are coefficients
• Multiple regression: Y = b0 + b1 X1 + b2 X2.
– Many nonlinear functions can be transformed into the above.
Log-linear model: The multi-way table of joint probabilities is approximated by a
product of lower-order tables.
Histogram
Divide data into buckets and store average (sum) for each bucket
A bucket represents an attribute-value/frequency pair
It can be constructed optimally in one dimension using dynamic programming
It divides up the range of possible values in a data set into classes or groups. For
each group, a rectangle (bucket) is constructed with a base length equal to the range
of values in that specific group, and an area proportional to the number of
observations falling into that group.
The buckets are displayed in a horizontal axis while height of a bucket represents
the average frequency of the values.
Example:
The following data are a list of prices of commonly sold items. The numbers have been
sorted.
18, 18, 18, 20, 20, 20, 20, 20, 20, 20, 21, 21, 21, 21, 25,
25, 25, 25, 25, 28, 28, 30, 30, 30
The buckets can be determined based on the following partitioning rules, including the
following.
V-Optimal and MaxDiff histograms tend to be the most accurate and practical. Histograms
are highly effective at approximating both sparse and dense data, as well as highly skewed,
and uniform data.
Clustering techniques consider data tuples as objects. They partition the objects into
groups or clusters, so that objects within a cluster are “similar" to one another and
“dissimilar" to objects in other clusters. Similarity is commonly defined in terms of how
“close" the objects are in space, based on a distance function.
Quality of clusters measured by their diameter (max distance between any two objects in
the cluster) or centroid distance (avg. distance of each cluster object from its centroid)
Sampling
Sampling can be used as a data reduction technique since it allows a large data set to be
represented by a much smaller random sample (or subset) of the data. Suppose that a large
data set, D, contains N tuples. Let's have a look at some possible samples for D.
3. Cluster sample: If the tuples in D are grouped into M mutually disjoint “clusters", then a
SRS of m clusters can be obtained, where m < M. For example, tuples in a database are
usually retrieved a page at a time, so that each page can be considered a cluster. A reduced
data representation can be obtained by applying, say, SRSWOR to the pages, resulting in a
cluster sample of the tuples.
4. Stratified sample: If D is divided into mutually disjoint parts called “strata", a stratified
sample of D is generated by obtaining a SRS at each stratum. This helps to ensure a
representative sample, especially when the data are skewed. For example, a stratified
sample may be obtained from customer data, where stratum is created for each customer
age group. In this way, the age group having the smallest number of customers will be sure
to be represented.
Advantages of sampling
1. An advantage of sampling for data reduction is that the cost of obtaining a sample is
proportional to the size of the sample, n, as opposed to N, the data set size. Hence,
sampling complexity is potentially sub-linear to the size of the data.
2. When applied to data reduction, sampling is most commonly used to estimate the
answer to an aggregate query.
Discretization:
Discretization techniques can be used to reduce the number of values for a given
continuous attribute, by dividing the range of the attribute into intervals. Interval labels
can then be used to replace actual data values.
Concept Hierarchy
A concept hierarchy for a given numeric attribute defines a Discretization of the attribute.
Concept hierarchies can be used to reduce the data by collecting and replacing low level
concepts (such as numeric values for the attribute age) by higher level concepts (such as
**young, middle-aged, or senior).
Discrete Attributes
Discrete data have a finite value. It can be in numerical form and can also be
in a categorical form. Discrete Attributes are Quantitative Attributes.
Examples of Discrete Data
Attribute Value
Profession Teacher, Bussiness Man, Peon etc
Postal Code 42200, 42300 etc
Example of Continuous Attribute
Continuous data technically have an infinite number of steps.
Continuous data is in float type. There can be many numbers in between 1
and 2. These attributes are Quantitative Attributes.
Example of Continuous Attribute
Attribute Value
Height 5.4…, 6.5….. etc
Weight 50.09…. etc
Continuous — real numbers, e.g., integer or real numbers
There are five methods for numeric concept hierarchy generation. These include:
1. binning,
2. histogram analysis,
3. clustering analysis,
Procedure:
Segmentation by Natural Partitioning
Example:
Suppose that profits at different branches of a company for the year 1997 cover a wide
range, from -$351,976.00 to $4,700,896.50. A user wishes to have a concept hierarchy for
profit automatically generated.
Suppose that the data within the 5%-tile and 95%-tile are between -$159,876 and
$1,838,761. The results of applying the 3-4-5 rule are shown in following figure
Step 1: Based on the above information, the minimum and maximum values are: MIN = -
$351, 976.00, and MAX = $4, 700, 896.50. The low (5%-tile) and high (95%-tile) values to
be considered for the top or first level of segmentation are: LOW = -$159, 876, and HIGH =
$1, 838,761.
Step 2: Given LOW and HIGH, the most significant digit is at the million dollar digit position
(i.e., msd =
1,000,000). Rounding LOW down to the million dollar digit, we get LOW’ = -$1; 000; 000;
and rounding
HIGH up to the million dollar digit, we get HIGH’ = +$2; 000; 000.
Step 3: Since this interval ranges over 3 distinct values at the most significant digit, i.e., (2;
000; 000-(-1, 000; 000))/1, 000, 000 = 3, the segment is partitioned into 3 equi-width sub
segments according to the 3-4-5 rule: (-$1,000,000 - $0], ($0 - $1,000,000], and ($1,000,000
- $2,000,000]. This represents the top tier of the hierarchy.
Step 4: We now examine the MIN and MAX values to see how they “fit" into the first level
partitions. Since the first interval, (-$1, 000, 000 - $0] covers the MIN value, i.e., LOW’ <
MIN, we can adjust the left boundary of this interval to make the interval smaller. The most
significant digit of MIN is the hundred thousand digit position. Rounding MIN down to this
position, we get MIN0’ = -$400, 000.
Therefore, the first interval is redefined as (-$400,000 - 0]. Since the last interval,
($1,000,000-$2,000,000] does not cover the MAX value, i.e., MAX > HIGH’, we need to create
a new interval to cover it. Rounding up MAX at its most significant digit position, the new
interval is ($2,000,000 - $5,000,000]. Hence, the top most level of the hierarchy contains
four partitions, (-$400,000 - $0], ($0 - $1,000,000], ($1,000,000 - $2,000,000], and
($2,000,000 - $5,000,000].
Step 5: Recursively, each interval can be further partitioned according to the 3-4-5 rule to
form the next lower level of the hierarchy:
It organizes the values of attributes or dimension into gradual levels of abstraction. They
are useful in mining at multiple levels of abstraction