M-Unit-2 R16
M-Unit-2 R16
M-Unit-2 R16
UNIT-II
DATA PREPROCESSING
1. Preprocessing
Real-world databases are highly susceptible to noisy, missing, and inconsistent data due
to their typically huge size (often several gigabytes or more) and their likely origin from
multiple, heterogeneous sources. Low-quality data will lead to low-quality mining results, so we
prefer a preprocessing concepts.
Data Preprocessing Techniques
* Data cleaning can be applied to remove noise and correct inconsistencies in the data.
* Data integration merges data from multiple sources into coherent data store, such as
a data warehouse.
* Data reduction can reduce the data size by aggregating, eliminating redundant
features, or clustering, for instance. These techniques are not mutually exclusive; they
may work together.
* Data transformations, such as normalization, may be applied.
Need for preprocessing
Incomplete, noisy and inconsistent data are common place properties of large real
world databases and data warehouses.
Incomplete data can occur for a number of reasons:
Attributes of interest may not always be available
Relevant data may not be recorded due to misunderstanding, or because of
equipment malfunctions.
Data that were inconsistent with other recorded data may have been deleted.
Missing data, particularly for tuples with missing values for some attributes,
may need to be inferred.
The data collection instruments used may be faulty.
There may have been human or computer errors occurring at data entry.
Errors in data transmission can also occur.
There may be technology limitations, such as limited buffer size for coordinating
synchronized data transfer and consumption.
Data cleaning routines work to ―clean‖ the data by filling in missing
values, smoothing noisy data, identifying or removing outliers, and resolving
inconsistencies.
Data integration is the process of integrating multiple databases cubes or files. Yet
some attributes representing a given may have different names in different databases,
causing inconsistencies and redundancies.
Data transformation is a kind of operations, such as normalization and aggregation,
are additional data preprocessing procedures that would contribute toward the success
of the mining process.
Data reduction obtains a reduced representation of data set that is much smaller in
volume, yet produces the same(or almost the same) analytical results.
2. DATA CLEANING
Real-world data tend to be incomplete, noisy, and inconsistent. Data cleaning (or data
cleansing) routines attempt to fill in missing values, smooth out noise while identifying outliers
and correct inconsistencies in the data.
Missing Values
Many tuples have no recorded value for several attributes, such as customer income.so we can fill
the missing values for this attributes.
The following methods are useful for performing missing values over several attributes:
1. Ignore the tuple: This is usually done when the class label missing (assuming the
mining task involves classification). This method is not very effective, unless the
tuple contains several attributes with missing values. It is especially poor when the
percentage of the missing values per attribute varies considerably.
2. Fill in the missing values manually: This approach is time –consuming and may not
be feasible given a large data set with many missing values.
3. Use a global constant to fill in the missing value: Replace all missing attribute value
by the same constant, such as a label like ―unknown‖ or -∞.
4. Use the attribute mean to fill in the missing value: For example, suppose that the
average income of customers is $56,000. Use this value to replace the missing value
for income.
5. Use the most probable value to fill in the missing value: This may be determined
with regression, inference-based tools using a Bayesian formalism or decision tree
induction. For example, using the other customer attributes in the sets decision tree is
constructed to predict the missing value for income.
Noisy Data
Noise is a random error or variance in a measured variable. Noise is removed using data
smoothing techniques.
Binning: Binning methods smooth a sorted data value by consulting its ―neighborhood,‖ that is the
value around it. The sorted values are distributed into a number of ―buckets‖ or ―bins―. Because
binning methods consult the neighborhood of values, they perform local smoothing. Sorted data
for price (in dollars): 3,7,14,19,23,24,31,33,38.
Example 1: Partition into (equal-frequency)
bins: Bin 1: 3,7,14
Bin 2: 19,23,24
Bin 3: 31,33,38
In the above method the data for price are first sorted and then partitioned into equal- frequency
bins of size 3.
Smoothing by bin means:
Bin 1: 8,8,8
Bin 2: 22,22,22
Bin 3: 34,34,34
In smoothing by bin means method, each value in a bin is replaced by the mean value ofthe bin. For
example, the mean of the values 3,7&14 in bin 1 is 8[(3+7+14)/3].
Smoothing by bin boundaries:
Bin 1: 3,3,14
Bin 2: 19,24,24
Bin 3: 31,31,38
In smoothing by bin boundaries, the maximum & minimum values in give bin or identify as the bin
boundaries. Each bin value is then replaced by the closest boundary value.
In general, the large the width, the greater the effect of the smoothing. Alternatively, bins may be
equal-width, where the interval range of values in each bin is constant Example 2: Remove the
noise in the following data using smoothing techniques:
8, 4,9,21,25,24,29,26,28,15
Sorted data for price (in dollars):4,8,9,15,21,21,24,25,26,28,29,34
Partition into equal-frequency (equi-depth) bins:
Bin 1: 4, 8,9,15
Bin 2: 21,21,24,25
Bin 3: 26,28,29,34
Smoothing by bin means:
Bin 1: 9,9,9,9
Bin 2: 23,23,23,23
Bin 3: 29,29,29,29
Smoothing by bin boundaries:
Bin 1: 4, 4,4,15
Bin 2: 21,21,25,25
Bin3: 26,26,26,34
Regression: Data can be smoothed by fitting the data to function, such as with regression. Linear
regression involves finding the ―best‖ line to fit two attributes (or variables), so that one
attribute can be used to predict the other. Multiple linear regressions is an extension of linear
regression, where more than two attributes are involved and the data are fit to a
multidimensional surface.
Clustering: Outliers may be detected by clustering, where similar values are organized into groups,
or ―clusters.‖ Intuitively, values that fall outside of the set of clusters may be
considered outliers.
Where oij is the observed frequency of the joint event (Ai,Bj) and eij is the expected
frequency of (Ai,Bj), which can computed as
4. Data Reduction:
Obtain a reduced representation of the data set that is much smaller in volume but yet
produces the same (or almost the same) analytical results.
Why data reduction? — A database/data warehouse may store terabytes of data.
Complex data analysis may take a very long time to run on the complete data set.
Data reduction strategies
4.1.Data cube aggregation
4.2.Attribute Subset
Selection
4.3.Numerosity reduction — e.g., fit data into models
4.4.Dimensionality reduction - Data Compression
1. Stepwise forward selection: The procedure starts with an empty set of attributes as the
reduced set. The best of original attributes is determined and added to the reduced set. At
each subsequent iteration or step, the best of the remaining original attributes is added to the
set.
2. Stepwise backward elimination: The procedure starts with full set of attributes. At each
step, it removes the worst attribute remaining in the set.
3. Combination of forward selection and backward elimination: The stepwise forward
selection and backward elimination methods can be combined so that, at each step, the
procedure selects the best attribute and removes the worst from among the remaining
attributes.
4. Decision tree induction: Decision tree induction constructs a flowchart like structure
where each internal node denotes a test on an attribute, each branch corresponds to an
outcome of the test, and each leaf node denotes a class prediction. At each node, the
algorithm choices the
―best‖ attribute to partition the data into individual classes. A tree is constructed from
the given data. All attributes that do not appear in the tree are assumed to be irrelevant. The
set of attributes appearing in the tree from the reduced subset of attributes. Threshold measure
is used as stopping criteria.
Numerosity Reduction:
Numerosity reduction is used to reduce the data volume by choosing alternative, smaller forms of
the data representation
Techniques for Numerosity reduction:
Parametric - In this model only the data parameters need to be stored, instead of
the actual data. (e.g.,) Log-linear models, Regression
Nonparametric – This method stores reduced representations of data
include histograms, clustering, and sampling
Parametric model
1. Regression
Linear regression
In linear regression, the data are model to fit a straight line. For example, a
random variable, Y called a response variable), can be modeled as a linear
function of another random variable, X called a predictor variable), with the
equation Y=αX+β
Where the variance of Y is assumed to be constant. The coefficients, α and β
(called regression coefficients), specify the slope of the line and the Y- intercept,
respectively.
Multiple- linear regression
Multiple linear regression is an extension of (simple) linear regression, allowing a
response variable Y, to be modeled as a linear function of two or more predictor
variables.
2. Log-Linear Models
Log-Linear Models can be used to estimate the probability of each point in a
multidimensional space for a set of discretized attributes, based on a smaller
subset of dimensional combinations.
Nonparametric Model
1. Histograms
A histogram for an attribute A partitions the data distribution of A into disjoint subsets,
or buckets. If each bucket represents only a single attribute-value/frequency pair, the buckets are
called singleton buckets.
Ex: The following data are bast of prices of commonly sold items at All Electronics. The numbers
have been sorted:
1,1,5,5,5,5,5,8,8,10,10,10,10,12,14,14,14,15,15,15,15,15,18,18,18,18,18,18,18,18,20,20,20,2
0,20,20,21,21,21,21,21,25,25,25,25,25,28,28,30,30,30
2. Clustering
Clustering technique consider data tuples as objects. They partition the objects into
groups or clusters, so that objects within a cluster are similar to one another and dissimilar to
objects in other clusters. Similarity is defined in terms of how close the objects are in space,
based on a distance function. The quality of a cluster may be represented by its diameter, the
maximum distance between any two objects in the cluster. Centroid distance is an alternative
measure of cluster quality and is defined as the average distance of each cluster object from the
cluster centroid.
3. Sampling:
Sampling can be used as a data reduction technique because it allows a large data set to
be represented by a much smaller random sample (or subset) of the data. Suppose that a large
data set D, contains N tuples, then the possible samples are Simple Random sample without
Replacement (SRS WOR) of size n: This is created by drawing „n‟ of the „N‟ tuples from D
(n<N), where the probability of drawing any tuple in D is 1/N, i.e., all tuples are equally likely
to be sampled.
Dimensionality Reduction:
In dimensionality reduction, data encoding or transformations are applied so as to
obtained reduced or ―compressed‖ representation of the oriental data.
Dimension Reduction Types
Lossless - If the original data can be reconstructed from the compressed data without any
loss of information
Lossy - If the original data can be reconstructed from the compressed data with loss of
information, then the data reduction is called lossy.
Effective methods in lossy dimensional reduction
a) Wavelet transforms
b) Principal components analysis.
a) Wavelet transforms:
The discrete wavelet transform (DWT) is a linear signal processing technique that, when
applied to a data vector, transforms it to a numerically different vector, of wavelet coefficients.
The two vectors are of the same length. When applying this technique to data reduction, we
consider each tuple as an n-dimensional data vector, that is, X=(x1,x2,...............,xn), depicting n
measurements made on the tuple from n database attributes.
For example, all wavelet coefficients larger than some user-specified threshold can be
retained. All other coefficients are set to 0. The resulting data representation is therefore very
sparse, so that can take advantage of data sparsity are computationally very fast if performed in
wavelet space.
The numbers next to a wave let name is the number of vanishing moment of the wavelet
this is a set of mathematical relationships that the coefficient must satisfy and is related to number
of coefficients.
1. The length, L, of the input data vector must be an integer power of 2. This
condition can be met by padding the data vector with zeros as necessary (L >=n).
2. Each transform involves applying two functions
The first applies some data smoothing, such as a sum or weighted average.
The second performs a weighted difference, which acts to bring out the
detailed features of data.
3. The two functions are applied to pairs of data points in X, that is, to all pairs of
measurements (X2i , X2i+1). This results in two sets of data of length L/2. In
general,
these represent a smoothed or low-frequency version of the input data and high frequency
content of it, respectively.
4. The two functions are recursively applied to the sets of data obtained in the
previous loop, until the resulting data sets obtained are of length 2.
In the above figure, Y1 and Y2, for the given set of data originally mapped to the axes X1 and X2.
This information helps identify groups or patterns within the data. The sorted axes are such that
the first axis shows the most variance among the data, the second axis shows the next highest
variance, and so on.
The size of the data can be reduced by eliminating the weaker components.
Advantage of PCA
PCA is computationally inexpensive
Multidimensional data of more than two dimensions can be handled by reducing
the problem to two dimensions.
Principal components may be used as inputs to multiple regression and cluster analysis.
5. Data Transformation and Discretization
Data transformation is the process of converting data from one format or structure into
another format or structure.
In data transformation, the data are transformed or consolidated into forms appropriate for mining.
Strategies for data transformation include the following:
1. Smoothing, which works to remove noise from the data. Techniques include binning,
regression, and clustering.
2. Attribute construction (or feature construction), where new attributes are constructed and
added from the given set of attributes to help the mining process.
3. Aggregation, where summary or aggregation operations are applied to the data. For
example, the daily sales data may be aggregated so as to compute monthly and annual total
amounts. This step is typically used in constructing a data cube for data analysis at multiple
abstraction levels.
4. Normalization, where the attribute data are scaled so as to fall within a smaller range,
such as 1.0 to 1.0, or 0.0 to 1.0.
5. Discretization, where the raw values of a numeric attribute (e.g., age) are replaced
byinterval labels (e.g., 0–10, 11–20, etc.) or conceptual labels (e.g., youth, adult, senior).The
labels, in turn, can be recursively organized into higher-level concepts, resulting in a concept
hierarchy for the numeric attribute. Figure 3.12 shows a concept hierarchy for the attribute
price. More than one concept hierarchy can be defined for the same attribute to accommodate
the needs of various users.
6. Concept hierarchy generation for nominal data, where attributes such as street can be
generalized to higher-level concepts, like city or country. Many hierarchies for nominal
attributes are implicit within the database schema and can be automatically defined at the
schema definition level.
an ―out-of-bounds‖ error if a future input case for normalization falls outside of the original
data range for A.
Example:-Min-max normalization. Suppose that the minimum and maximum values for the
attribute income are $12,000 and $98,000, respectively. We would like to map income to the
range [0.0, 1.0]. By min-max normalization, a value of $73,600 for income is transformed to
b) Z-Score Normalization
The values for an attribute, A, are normalized based on the mean (i.e., average) and standard
deviation of A. A value, vi, of A is normalized to vi’ by computing
where𝐴 and A are the mean and standard deviation, respectively, of attribute A.
Example z-score normalization. Suppose that the mean and standard deviation of the values for
the attribute income are $54,000 and $16,000, respectively. With z-score normalization, a value
of $73,600 for income is transformed to
Example Decimal scaling. Suppose that the recorded values of A range from -986 to 917. The
maximum absolute value of A is 986. To normalize by decimal scaling, we therefore divide
each value by 1000 (i.e., j = 3) so that -986 normalizes to -0.986 and 917normalizes to 0.917.
5.2. Data Discretization
a) Discretization by binning:
Binning is a top-down splitting technique based on a specified number of bins. For
example, attribute values can be discretized by applying equal-width or equal-
frequencybinning, and then replacing each bin value by the bin mean or median, as in smoothing by
bin means or smoothing by bin medians, respectively. These techniques can be applied
recursively to the resulting partitions to generate concept hierarchies.