Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
126 views

Chap 1 Data Preprocessing

The document discusses data preprocessing, which is an important step for preparing raw data for machine learning or data mining. There are several key steps in data preprocessing: [1] data cleaning to handle missing values, noisy data, and inconsistencies; [2] data transformation techniques to put the data in a format suitable for algorithms; and [3] data discretization to reduce the number of values of variables. Preprocessing helps improve data quality so machine learning models can more accurately discover patterns in the data.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
126 views

Chap 1 Data Preprocessing

The document discusses data preprocessing, which is an important step for preparing raw data for machine learning or data mining. There are several key steps in data preprocessing: [1] data cleaning to handle missing values, noisy data, and inconsistencies; [2] data transformation techniques to put the data in a format suitable for algorithms; and [3] data discretization to reduce the number of values of variables. Preprocessing helps improve data quality so machine learning models can more accurately discover patterns in the data.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 17

CHAPTER 1: DATA PREPROCESSING

2. Introduction
Data preprocessing is an often neglected but important step in the data
mining process, since real-world data seems to be messy, inconsistent and
noisy. Usually, it is tempting to leap straight into data mining, but first, we need
to get the data ready. Data preprocessing is a process of preparing the raw data
and making it suitable for a machine learning model. It is the first and crucial step
while creating a machine learning model or a data mining activity. It is important
because if the data is noisy and unreliable, knowledge discovery during the
training phase is more difficult.

3. Learning Outcome
It contains the list of competencies that students should acquire during
the learning process.
• Explain why we need to preprocess data.
• Identify the different data processing tasks
• Identify and explain the different data cleaning methods
• Identify and explain the different data transformation techniques
• Identify and explain the different data transformation strategies
• Identify and explain the different data discretization methods

4. Learning Content
It contains readings, selection and discussion questions and sets of
activities that students can work on individually or by group.
Data Preprocessing
1. Why we preprocess data?
a. Data Quality
2. Types of Data Preprocessing
a. Data Cleaning
i. Missing Values
ii. Noisy Data
iii. Data cleaning as process
b. Data Integration
i. Entity Identification Problem
c. Data Reduction
i. Dimensionality Reduction
ii. Numerosity Reduction
iii. Data Compression
d. Data Transformation
e. Data Discretization
Why we preprocess data?
In a real-world scenario, data generally contains noises, missing values, and
maybe in an unusable format which cannot be directly used for data mining activities.
It is a required task to clean the data and make it suitable for a knowledge discovery.
Real world data are generally:

o Incomplete. It means data is lacking attribute values, lacking certain


attributes of interest, or containing only aggregate data.
o Noisy. It means the data is containing errors or outliers.
o Inconsistent. It means the data contains discrepancies in codes or
names

These problems can be a because of the following:

• Data entry, data transmission or data collection problems


• Discrepancy in naming conventions
• Duplicate records
• Incomplete data
• Contradictions in data
Data Quality

A well-accepted multidimensional view of data quality conforms to the


following characteristics:

• Accuracy
• Completeness
• Consistency
• Timeliness
• Believability
• Interpretability Accessibility

If the data does not conform to these characteristics, the decision made from
the knowledge discovered form the data is jeopardized. Garbage in - Garbage
out applies in data analytics and other areas of computing.

Types of Data Preprocessing

1. Data cleaning

Data cleaning is a process to clean the data in such a way that data can be
easily integrated. This includes fill in missing values, smooth noisy data, identify or
remove outliers, and resolve inconsistencies.

1.1. Missing Values


Data is not always available
E.g., many tuples have no recorded value for several attributes,
such as customer income in sales data

Missing data may be due to

• equipment malfunction
• inconsistent with other recorded data and thus deleted
• data not entered due to misunderstanding
• certain data were not considered important at the time of collection
• data format / contents of database changes in the course of the
time changes with the corresponding enterprise organization
Handling Missing Data

▪ Fill in missing values (attribute or class value):


1. Ignore the tuple: This is usually done when the class label is missing
(assuming the mining task involves classification). This method is not
very effective, unless the tuple contains several attributes with
missing values. It is especially poor when the percentage of missing
values per attribute varies considerably. By ignoring the tuple, we do
not make use of the remaining attributes’ values in the tuple. Such
data could have been useful to the task at hand

2. Fill in the missing value manually: In general, this approach is time


consuming and may not be feasible given a large data set with many
missing values.

3. Use a global constant to fill in the missing value: Replace all


missing attribute values by the same constant such as a label like
“Unknown” or 1. If missing values are replaced by, say, “Unknown,”
then the mining program may mistakenly think that they form an
interesting concept, since they all have a value in common—that of
Unknown.” Hence, although this method is simple, it is not foolproof.

4. Use a measure of central tendency for the attribute (e.g., the


mean or median) to fill in the missing value. This is the measure of
central tendency, which indicate the “middle” value of a data
distribution. For normal (symmetric) data distributions, the mean can
be used, while skewed data distribution should employ the median.
For example, suppose that the data distribution regarding the income
of customers is symmetric and that the mean income is P23,000. Use
this value to replace the missing value for income.

5. Use the attribute mean or median for all samples belonging to the
same class as the given tuple: For example, if classifying customers
according to credit risk, we may replace the missing value with the
mean income value for customers in the same credit risk category as
that of the given tuple. If the data distribution for a given class is
skewed, the median value is a better choice.

6. Use the most probable value to fill in the missing value: This may
be determined with regression, inference-based tools using a
Bayesian formalism, or decision tree induction. For example, using
the other customer attributes in your data set, you may construct a
decision tree to predict the missing values for income.

Methods 3 through 6 may give the data a bias, as the filled-in


value may not be accurate. Method 6 is, however, a common technique.
Compared to other approaches, much of the information from the
present data is used to predict missing values. Considering the values
of the other attributes in its calculation of the missing value for income,
there is a stronger likelihood that the relationship between income and
the other attributes are preserved.

It is important to remember that a missing value does not mean


an error in the data in some situations. For example, when applying for
a credit card, applicants might be asked to include their driver's license
number. Candidates who do not have a driver's license can, of course,
leave this field blank. Forms would require respondents to specify values
such as "not-applicable". Software routines may also be used to expose
certain null values (e.g., "don't ask, “?” "or "no one"). Ideally, each
attribute will have one or more rules for the null condition. Rules can
define whether or not nulls are permitted and/or how such values should
be treated or modified. Fields can also be deliberately left blank if they
are to be given at a later stage of the business process. Therefore, while
we should do our hardest to clean up data after it has been captured, the
design of a proper database and data entry process will help to reduce
the amount of missed values or mistakes in the first place.
1.2 Noisy Data

Noise is a random error or variance in a measured variable. Noisy


attribute values may due to:

• faulty data collection instruments


• data entry problems
• data transmission problems
• technology limitation
• inconsistency in naming convention

Handling Noisy Data

▪ Binning
Binning methods smooth a sorted data value by consulting its
“neighborhood,” that is, the values around it. The sorted values are
distributed into a number of “buckets,” or bins. Because binning methods
consult the neighborhood of values, they perform local smoothing.
1. Sort the attribute values and partition them into bins;
2. Then smooth by bin means, bin median, or bin boundaries.

In smoothing by bin means, each value in a bin is replaced by the


mean value of the bin. For example, the mean of the values 4, 8, and 15
in Bin 1 is 9. Therefore, each original value in this bin is replaced by the
value 9. Similarly, smoothing by bin medians can be employed, in which
each bin value is replaced by the bin median. In smoothing by bin
boundaries, the minimum and maximum values in a given bin are
identified as the bin boundaries. Each bin value is then replaced by the
closest boundary value. In general, the larger the width, the greater the
effect of the smoothing.
Sorted data for price (in peso): 4, 8, 15, 21, 21, 24, 25, 28, 34.

Figure x. Binning methods for data smoothing.


▪ Regression
Data smoothing can also be done by regression, a technique that
conforms data values to a function. Linear regression involves finding the
“best” line to fit two attributes (or variables) so that one attribute can be
used to predict the other.
Multiple linear regression is an extension of linear regression, where
more than two attributes are involved and the data are fit to a
multidimensional surface.

▪ Clustering Outlier analysis: Outliers may be detected by clustering, for


example, where similar values are organized into groups, or “clusters.”
Intuitively, values that fall outside of the set of clusters may be considered
outliers.

▪ Combined computer and human inspection. Outliers may be identified


through a combination of computer and human inspection. The process
detects suspicious values and finally check by human.
1.3 Data Cleaning as a Process

Discrepancy Detection is the first step in the data cleaning process.


Discrepancies can be caused by a variety of causes, including poorly designed
data entry forms with several optional fields, human error in data entry,
malicious mistakes (e.g. respondents who may not wish to share information
about themselves) and data deterioration (e.g. obsolete addresses). It may
also arise from inconsistent data representations and inconsistent use of
codes. Other sources of discrepancies include errors in instrumentation
devices that record data and system errors. Errors can also occur when the
data are (inadequately) used for purposes other than originally intended. There
may also be inconsistencies due to data integration (e.g., where a given
attribute can have different names in different databases).

As a starting point in discrepancy detection, you may use any knowledge


you may already have regarding properties of the data. Such knowledge or
“data about data” is referred to as metadata. For example, what are the data
type and domain of each attribute? What are the acceptable values for each
attribute? Anomalies may also be identified including data trends using basic
statistical data process. For example, find the mean, median, and mode values.
Are the data symmetric or skewed? What is the range of values? Do all values
fall within the expected range? What is the standard deviation of each attribute?
Values that are more than two standard deviations away from the mean for a
given attribute may be flagged as potential outliers. Are there any known
dependencies between attributes? In this step, you may write your own scripts
and/or use some of the tools that we discuss further later. From this, you may
find noise, outliers, and unusual values that need investigation.

As a data analyst, you should be on the lookout for the inconsistent use
of codes and any inconsistent data representations (e.g., “2010/12/25” and
“25/12/2010” for date). Field overloading is another error source that typically
results when developers squeeze new attribute definitions into unused (bit)
portions of already defined attributes (e.g., an unused bit of an attribute that has
a value range that uses only, say, 31 out of 32 bits).
The data should also be examined using the following three rules:

• Unique Rule says that each value of the given attribute must be different
from all other values for that attribute.
• Consecutive rule says that there can be no missing values between the
lowest and highest values for the attribute, and that all values must also
be unique (e.g., as in check numbers).
• Null rule specifies the use of blanks, question marks, special characters,
or other strings that may indicate the null condition (e.g., where a value
for a given attribute is not available), and how such values should be
handled. The null rule should specify how to record the null condition,
for example, such as to store zero for numeric attributes, a blank for
character attributes, or any other conventions that may be in use (e.g.,
entries like “don’t know” or “?” should be transformed to blank).

2. Data Integration

Data integration is a process to integrate/combine all the data. It includes


using multiple databases, data cubes, or files. Its main purpose is to combine data
from multiple sources into a coherent database. Careful integration can help
reduce and avoid redundancies and inconsistencies in the resulting data set. This
can help improve the accuracy and speed of the subsequent data mining process.

2.1 Entity Identification Problem

This refers to the problem on how real-world entities can matched up


from multiple data sources. There are a number of issues to consider during
data integration. This includes the following:

▪ Schema integration
o Integrate metadata from different sources
o Attribute identification problem: “same” attributes from multiple
data sources may have different names

▪ Instance integration
o Integrate instances from different sources
o For the same real-world entity, attribute values from different
sources maybe different
o Possible reasons: different representations, different styles,
different scales, errors

Approach

▪ Identification
o Detect corresponding tables from different sources manual
o Detect corresponding attributes from different sources may use
correlation analysis
e.g., A.cust-id ≡ B.cust-#
o Detect duplicate records from different sources involves
approximate matching of attribute values
e.g. 3.14283 ≡ 3.1, Schwartz ≡ Schwarz

▪ Treatment
o Merge corresponding tables
o Use attribute values as synonyms
o Remove duplicate records when data warehouses are already
integrated

3. Data reduction

Data reduction techniques can be applied to obtain a reduced


representation of the data set that is much smaller in volume, yet closely maintains
the integrity of the original data. Data reduction strategies include dimensionality
reduction, numerosity reduction, and data compression.

3.1 . Dimensionality reduction is the process of reducing the number


of random variables or attributes under consideration. Dimensionality
reduction methods include wavelet transforms and principal
components analysis.

o Data cube aggregation: applying roll-up, slice or dice operations.


o Removing irrelevant attributes: attribute selection (filtering and
wrapper methods), searching the attribute space.
o Principle component analysis (numeric attributes only):
searching for a lower dimensional space that can best represent
the data.

3.2 . Numerosity reduction techniques replace the original data volume


by alternative, smaller forms of data representation or reducing the
number of attribute values. These techniques may be parametric or
nonparametric. For parametric methods, a model is used to estimate
the data, so that typically only the data parameters need to be stored,
instead of the actual data. (Outliers may also be stored.) Regression
and log-linear models are examples. Nonparametric methods for
storing reduced representations of the data include histograms
clustering, sampling and data cube aggregation.

3.3. Data Compression is the process wherein the transformations are


applied so as to obtain a reduced or “compressed” representation of
the original data. If the original data can be reconstructed from the
compressed data without any information loss, the data reduction is
called lossless. If, instead, we can reconstruct only an approximation
of the original data, then the data reduction is called lossy. There
are several lossless algorithms for string compression; however,
they typically allow only limited data manipulation. Dimensionality
reduction and numerosity reduction techniques can also be
considered forms of data compression.

There are many other ways of organizing methods of data


reduction. The computational time spent on data reduction should
not outweigh or “erase” the time saved by mining on a reduced data
set size. Some of these methods are the following:

• Discrete Wavelet Transform


• Principal Component Analysis
• Attribute Subset Selection
• Histograms
• Clustering
• Sampling
• Data Cube Aggregation
4. Data transformation and Data Discretization

Data transformation is the preprocessing process to transform the data into


a reliable form or shape that is appropriate for mining. It includes normalization and
aggregation. This process would make the resulting mining process to be more
efficient and the patterns may be easier to understand. Data discretization, a form
of data transformation, is also discussed here.

Strategies for data transformation include the following:

o Smoothing, which works to remove noise from the data. Techniques


include binning, regression, and clustering.

o Attribute construction (or feature construction), where new attributes


are constructed and added from the given set of attributes to help the
mining process.

o Aggregation, where summary or aggregation operations are applied to


the data. For example, the daily sales data may be aggregated so as to
compute monthly and annual total amounts. This step is typically used
in constructing a data cube for data analysis at multiple abstraction
levels. e.g., change of profit for consecutive years.

o Normalization, where the attribute data are scaled so as to fall within a


smaller range. It is used to make different records comparable.
Normalizing the data attempts to give all attributes an equal weight.
Normalization is particularly useful for classification algorithms involving
neural networks or distance measurements such as nearest-neighbor
classification and clustering.
o Discretization, where the raw values of a numeric attribute (e.g., age)
are replaced by interval labels (e.g., 0–10, 11–20, etc.) or conceptual
labels (e.g., youth, adult, senior). The labels, in turn, can be recursively
organized into higher-level concepts, resulting in a concept hierarchy for
the numeric attribute. More than one concept hierarchy can be defined
for the same attribute to accommodate the needs of various users.

o Concept hierarchy generation for nominal data, where attributes such


as street can be generalized to higher-level concepts, like city or country.
Many hierarchies for nominal attributes are implicit within the database
schema and can be automatically defined at the schema definition level.

5. Discretization

Data discretization converts a large number of data values into smaller


once, so that data evaluation and data management become very easy. It is part
of data reduction, replacing numerical attributes with nominal ones.

o Discretization by Binning

Binning is a top-down splitting technique based on a specified


number of bins. These methods are also used as discretization methods
for data reduction and concept hierarchy generation. For example,
attribute values can be discretized by applying equal-width or equal-
frequency binning, and then replacing each bin value by the bin mean or
median, as in smoothing by bin means or smoothing by bin medians,
respectively. These techniques can be applied recursively to the
resulting partitions to generate concept hierarchies.

o Discretization by Histogram Analysis.

Like binning, histogram analysis is an unsupervised discretization


technique because it does not use class information. A histogram
partitions the values of an attribute, A, into disjoint ranges called buckets
or bins. In an equal-width histogram, for example, the values are
partitioned into equal-size partitions or ranges. With an equal-frequency
histogram, the values are partitioned so that, ideally, each partition
contains the same number of data tuples. The histogram analysis
algorithm can be applied recursively to each partition in order to
automatically generate a multilevel concept hierarchy, with the
procedure terminating once a prespecified number of concept levels has
been reached. A minimum interval size can also be used per level to
control the recursive procedure. This specifies the minimum width of a
partition, or the minimum number of values for each partition at each
level. Histograms can also be partitioned based on cluster analysis of
the data distribution, as described next.

o Discretization by Cluster, Decision Tree, and Correlation Analyses.

Clustering, decision tree analysis, and correlation analysis can be


used for data discretization. We briefly study each of these approaches.
Cluster analysis is a popular data discretization method. A clustering
algorithm can be applied to discretize a numeric attribute, A, by
partitioning the values of A into clusters or groups. Clustering takes the
distribution of A into consideration, as well as the closeness of data
points, and therefore is able to produce high-quality discretization
results.

Clustering can be used to generate a concept hierarchy for A by


following either a top-down splitting strategy or a bottom-up merging
strategy, where each cluster forms a node of the concept hierarchy. In
the former, each initial cluster or partition may be further decomposed
into several subclusters, forming a lower level of the hierarchy. In the
latter, clusters are formed by repeatedly grouping neighboring clusters
in order to form higher-level concepts.
Techniques to generate decision trees for classification can be
applied to discretization. Such techniques employ a top-down splitting
approach. Unlike the other methods mentioned so far, decision tree
approaches to discretization are supervised, that is, they make use of
class label information. For example, we may have a data set of patient
symptoms (the attributes) where each patient has an associated
diagnosis class label. Class distribution information is used in the
calculation and determination of split-points (data values for partitioning
an attribute range). Intuitively, the main idea is to select split-points so
that a given resulting partition contains as many tuples of the same class
as possible. Entropy is the most commonly used measure for this
purpose. To discretize a numeric attribute, A, the method selects the
value of A that has the minimum entropy as a split-point, and recursively
partitions the resulting intervals to arrive at a hierarchical discretization.
Such discretization forms a concept hierarchy for A.

After the completion of these tasks, the data is ready for mining.

5. Recommended learning materials and resources for supplementary reading.


• https://youtu.be/J61r--lv7-w
• https://youtu.be/2k6blvMVs4g
Assessment Task
1. In your own words, define and explain the concept of data cleaning.
_______________________________________________________________________________
_______________________________________________________________________________
_______________________________________________________________________________
_______________________________________________________________________________
_______________________________________________________________________________
_____________________________________________________________

2. Using the following data for the attribute age: 13, 15, 16, 16, 19, 20, 20, 21,
22, 22, 25, 25, 25, 25, 30, 33, 33, 35, 35, 35, 35, 36, 40, 45, 46, 52, 70.

(a) Use smoothing by bin means to smooth these data, using a bin depth of 3.
Illustrate your steps. Comment on the effect of this technique for the given
data.

(b) Use smoothing by boundaries to smooth these data, using a bin depth of
3. Illustrate your steps.

3. Discuss issues to consider during data integration.


References
Data mining : concepts and techniques / Jiawei Han, Micheline Kamber, Jian
Pei. – 3rd ed.
https://t4tutorials.com/major-tasks-of-data-pre-processing/

You might also like