Data Preprocessing Techniques: 1.1 Why Preprocess The Data?
Data Preprocessing Techniques: 1.1 Why Preprocess The Data?
Data Preprocessing Techniques: 1.1 Why Preprocess The Data?
Data preprocessing describes any type of processing performed on raw data to prepare it for
another processing procedure. Commonly used as a preliminary data mining practice, data
preprocessing transforms the data into a format that will be more easily and effectively
processed for the purpose of the user.
Incomplete, noisy, and inconsistent data are commonplace properties of large, real-world
databases and data warehouses. Incomplete data can occur for a number of reasons.
Attributes of interest may not always be available, such as customer information for sales
transaction data. Other data may not be included simply because it was not considered
important at the time of entry. Relevant data may not be recorded due to a misunderstanding,
or because of equipment malfunctions. Data that were inconsistent with other recorded data
may have been deleted. Furthermore, the recording of the history or modifications to the data
may have been overlooked. Missing data, particularly for tuples with missing values for some
attributes, may need to be inferred.
Data can be noisy, having incorrect attribute values, owing to the following. The data
collection instruments used may be faulty. There may have been human or computer errors
occurring at data entry. Errors in data transmission can also occur. There may be technology
limitations, such as limited buffer size for coordinating synchronized data transfer and
consumption. Incorrect data may also result from inconsistencies in naming conventions or
data codes used. Duplicate tuples also require data cleaning.
1. Ignore the tuple: This is usually done when the class label is missing. This method is
not very effective, unless the tuple contains several attributes with missing values. It
is especially poor when the percentage of missing values per attribute varies
considerably.
2. Fill in the missing value manually: In general, this approach is time-consuming and
may not be feasible given a large data set with many missing values.
3. Use a global constant to fill in the missing value: Replace all missing attribute
values by the same constant, such as a label like “Unknown", or - ~. If missing values
are replaced by, say, “Unknown", then the mining program may mistakenly think that
they form an interesting concept, since they all have a value in common, that of
“Unknown". Hence, although this method is simple, it is not recommended.
4. Use the attribute mean to fill in the missing value: For example, suppose that the
average income of AllElectronics customers is $28,000. Use this value to replace the
missing value for income.
5. Use the attribute mean for all samples belonging to the same class as the given
tuple: For example, if classifying customers according to credit risk, replace the
missing value with the average income value for customers in the same credit risk
category as that of the given tuple.
6. Use the most probable value to fill in the missing value: This may be determined
with inference-based tools using a Bayesian formalism or decision tree induction. For
example, using the other customer attributes in your data set, you may construct a
decision tree to predict the missing values for income.
1.2.2 Noisy data
“What is noise?" Noise is a random error or variance in a measured variable. Given a numeric
attribute such as, say, price, how can we “smooth" out the data to remove the noise? Let's
look at the following data smoothing techniques.
1. Binning methods: Binning methods smooth a sorted data value by consulting the
“neighborhood", or values around it. The sorted values are distributed into a number
of 'buckets', or bins. Figure 1.2 illustrates some binning techniques. In this example,
the data for price are first sorted and partitioned into equi-depth bins (of depth 3). In
smoothing by bin means, each value in a bin is replaced by the mean value of the
bin. For example, the mean of the values 4, 8, and 15 in Bin 1 is 9. Therefore, each
original value in this bin is replaced by the value 9. Similarly, smoothing by bin
medians can be employed, in which each bin value is replaced by the bin median. In
smoothing by bin boundaries, the minimum and maximum values in a given bin are
identified as the bin boundaries. Each bin value is then replaced by the closest
boundary value.
Sorted data for price (in dollars): 4, 8, 15, 21, 21, 24, 25, 28, 34
Partition into (equi-width) bins:
-Bin 1: 4, 8, 15
-Bin 2: 21, 21, 24
-Bin 3: 25, 28, 34
Smoothing by bin means:
-Bin 1: 9, 9, 9,
-Bin 2: 22, 22, 22
-Bin 3: 29, 29, 29
Smoothing by bin boundaries:
-Bin 1: 4, 4, 15
-Bin 2: 21, 21, 24
-Bin 3: 25, 25, 34
4. Regression: Data can be smoothed by fitting the data to a function, such as with
regression. Linear regression involves finding the “best" line to fit two variables, so
that one variable can be used to predict the other. Multiple linear regressions are an
extension of linear regression, where more than two variables are involved and the
data are fit to a multidimensional surface. Using regression to find a mathematical
equation to fit the data helps smooth out the noise.
If the resulting value of Equation is greater than 1, then A and B are positively correlated.
The higher the value, the more each attribute implies the other. Hence, a high value may
indicate that A (or B) may be removed as a redundancy. If the resulting value is equal to 1,
then A and B are independent and there is no correlation between them. If the resulting value
is less than 1, then A and B are negatively correlated. This means that each attribute
discourages the other.
A third important issue in data integration is the detection and resolution of data value
conflicts. Careful integration of the data from multiple sources can help reduce and avoid
redundancies and inconsistencies in the resulting data set.
3. Aggregation, where summary or aggregation operations are applied to the data. For
example, the daily sales data may be aggregated so as to compute monthly and annual total
amounts. This step is typically used in constructing a data cube for analysis of the data at
multiple granularities.
4. Generalization of the data, where low level or 'primitive' (raw) data are replaced by higher
level concepts through the use of concept hierarchies. For example, categorical attributes, like
street, can be generalized to higher level concepts, like city or county. Similarly, values for
numeric attributes, like age, may be mapped to higher level concepts, like young, middle-
aged, and senior.
Min-max normalization performs a linear transformation on the original data. Suppose that
minA and maxA are the minimum and maximum values of an attribute A. Min-max
normalization maps a value v of A to V’ in the range [new_minA; new_maxA] by computing
In z-score normalization (or zero-mean normalization), the values for an attribute A are
normalized based on the mean and standard deviation of A. A value v of A is normalized to
V’ by computing
V’ = v - meanA /stand devA
Where meanA and stand devA are the mean and standard deviation, respectively, of attribute
A. This method of normalization is useful when the actual minimum and maximum of
attribute A are unknown, or when there are outliers which dominate the min-max
normalization.
Normalization by decimal scaling normalizes by moving the decimal point of values of
attribute A. The number of decimal points moved depends on the maximum absolute value of
A. A value v of A is normalized to v’ by computing
V’ = v/10j’
where j is the smallest integer such that Max(jv0j) < 1.
Data reduction techniques can be applied to obtain a reduced representation of the data set
that is much smaller in volume, yet closely maintains the integrity of the original data. That
is, mining on the reduced data set should be more efficient yet produce the same analytical
results.
Strategies for data reduction include the following.
1. Data cube aggregation, where aggregation operations are applied to the data in the
construction of a data cube.
2. Dimension reduction, where irrelevant, weakly relevant or redundant attributes or
dimensions may be detected and removed.
3. Data compression, where encoding mechanisms are used to reduce the data set size.
4. Numerosity reduction, where the data are replaced or estimated by alternative, smaller data
representations such as parametric models or nonparametric methods such as clustering,
sampling, and the use of histograms.
5. Discretization and concept hierarchy generation, where raw data values for attributes are
replaced by ranges or higher conceptual levels. Concept hierarchies allow the mining of data
at multiple levels of abstraction, and are a powerful tool for data mining.
Dimensionality reduction reduces the data set size by removing such attributes (or
dimensions) from it. Typically, methods of attribute subset selection are applied. The goal of
attribute subset selection is to find a minimum set of attributes such that the resulting
probability distribution of the data classes is as close as possible to the original distribution
obtained using all attributes. Mining on a reduced set of attributes has an additional benefit. It
reduces the number of attributes appearing in the discovered patterns, helping to make the
patterns easier to understand.
Wavelet transforms:-
The discrete wavelet transform (DWT) is a linear signal processing technique that, when
applied to a data vector D, transforms it to a numerically different vector, D’, of wavelet
coefficients. The two vectors are of the same length.
The DWT is closely related to the discrete Fourier transform (DFT), a signal processing
technique involving sins and cosines. There is only one DFT, yet there are several DWTs.
The general algorithm for a discrete wavelet transform is as follows:-
1. The length, L, of the input data vector must be an integer power of two. This
condition can be met by padding the data vector with zeros, as necessary.
2. Each transform involves applying two functions. The first applies some data
smoothing, such as a sum or weighted average. The second performs a weighted
difference.
3. The two functions are applied to pairs of the input data, resulting in two sets of data of
length L=2. In general, these respectively represent a smoothed version of the input
data, and the high-frequency content of it.
4. The two functions are recursively applied to the sets of data obtained in the previous
loop, until the resulting data sets obtained are of desired length.
5. A selection of values from the data sets obtained in the above iterations are designated
the wavelet coefficients of the transformed data.
2. PCA computes N orthonormal vectors which provide a basis for the normalized input
data. These are unit vectors that each point in a direction perpendicular to the others.
These vectors are referred to as the principal components. The input data are a linear
combination of the principal components.
4. Since the components are sorted according to decreasing order of “significance", the
size of the data can be reduced by eliminating the weaker components, i.e., those with
low variance.
CONCLUSION
Data preparation is an important issue for both data warehousing and data mining, as real-
world data tends to be incomplete, noisy, and inconsistent. Data preparation includes data
cleaning, data integration, data transformation, and data reduction.
Data cleaning routines can be used to fill in missing values, smooth noisy data,
identify outliers, and correct data inconsistencies. Data integration combines data from
multiples sources to form a coherent data store. Metadata, correlation analysis, data conflict
detection, and the resolution of semantic heterogeneity contribute towards smooth data
integration. Data transformation routines conform the data into appropriate forms for mining.
For example, attribute data may be normalized so as to fall between a small range, such as 0
to 1.0. Data reduction techniques such as data cube aggregation, dimension reduction, data
compression, Numerosity reduction, and Discretization can be used to obtain a reduced
representation of the data, while minimizing the loss of information content. Concept
hierarchies organize the values of attributes or dimensions into gradual levels of abstraction.
They are a form a Discretization that is particularly useful in multilevel mining. Automatic
generation of concept hierarchies for categorical data may be based on the number of distinct
values of the attributes defining the hierarchy. For numeric data, techniques such as data
segmentation by partition rules, histogram analysis, and clustering analysis can be used.
Although several methods of data preparation have been developed, data preparation
remains an active area of research.
REFERENCES