Chap 1 Data Preprocessing
Chap 1 Data Preprocessing
2. Introduction
Data preprocessing is an often neglected but important step in the data
mining process, since real-world data seems to be messy, inconsistent and
noisy. Usually, it is tempting to leap straight into data mining, but first, we need
to get the data ready. Data preprocessing is a process of preparing the raw data
and making it suitable for a machine learning model. It is the first and crucial step
while creating a machine learning model or a data mining activity. It is important
because if the data is noisy and unreliable, knowledge discovery during the
training phase is more difficult.
3. Learning Outcome
It contains the list of competencies that students should acquire during
the learning process.
• Explain why we need to preprocess data.
• Identify the different data processing tasks
• Identify and explain the different data cleaning methods
• Identify and explain the different data transformation techniques
• Identify and explain the different data transformation strategies
• Identify and explain the different data discretization methods
4. Learning Content
It contains readings, selection and discussion questions and sets of
activities that students can work on individually or by group.
Data Preprocessing
1. Why we preprocess data?
a. Data Quality
2. Types of Data Preprocessing
a. Data Cleaning
i. Missing Values
ii. Noisy Data
iii. Data cleaning as process
b. Data Integration
i. Entity Identification Problem
c. Data Reduction
i. Dimensionality Reduction
ii. Numerosity Reduction
iii. Data Compression
d. Data Transformation
e. Data Discretization
Why we preprocess data?
In a real-world scenario, data generally contains noises, missing values, and
maybe in an unusable format which cannot be directly used for data mining activities.
It is a required task to clean the data and make it suitable for a knowledge discovery.
Real world data are generally:
• Accuracy
• Completeness
• Consistency
• Timeliness
• Believability
• Interpretability Accessibility
If the data does not conform to these characteristics, the decision made from
the knowledge discovered form the data is jeopardized. Garbage in - Garbage
out applies in data analytics and other areas of computing.
1. Data cleaning
Data cleaning is a process to clean the data in such a way that data can be
easily integrated. This includes fill in missing values, smooth noisy data, identify or
remove outliers, and resolve inconsistencies.
• equipment malfunction
• inconsistent with other recorded data and thus deleted
• data not entered due to misunderstanding
• certain data were not considered important at the time of collection
• data format / contents of database changes in the course of the
time changes with the corresponding enterprise organization
Handling Missing Data
5. Use the attribute mean or median for all samples belonging to the
same class as the given tuple: For example, if classifying customers
according to credit risk, we may replace the missing value with the
mean income value for customers in the same credit risk category as
that of the given tuple. If the data distribution for a given class is
skewed, the median value is a better choice.
6. Use the most probable value to fill in the missing value: This may
be determined with regression, inference-based tools using a
Bayesian formalism, or decision tree induction. For example, using
the other customer attributes in your data set, you may construct a
decision tree to predict the missing values for income.
▪ Binning
Binning methods smooth a sorted data value by consulting its
“neighborhood,” that is, the values around it. The sorted values are
distributed into a number of “buckets,” or bins. Because binning methods
consult the neighborhood of values, they perform local smoothing.
1. Sort the attribute values and partition them into bins;
2. Then smooth by bin means, bin median, or bin boundaries.
As a data analyst, you should be on the lookout for the inconsistent use
of codes and any inconsistent data representations (e.g., “2010/12/25” and
“25/12/2010” for date). Field overloading is another error source that typically
results when developers squeeze new attribute definitions into unused (bit)
portions of already defined attributes (e.g., an unused bit of an attribute that has
a value range that uses only, say, 31 out of 32 bits).
The data should also be examined using the following three rules:
• Unique Rule says that each value of the given attribute must be different
from all other values for that attribute.
• Consecutive rule says that there can be no missing values between the
lowest and highest values for the attribute, and that all values must also
be unique (e.g., as in check numbers).
• Null rule specifies the use of blanks, question marks, special characters,
or other strings that may indicate the null condition (e.g., where a value
for a given attribute is not available), and how such values should be
handled. The null rule should specify how to record the null condition,
for example, such as to store zero for numeric attributes, a blank for
character attributes, or any other conventions that may be in use (e.g.,
entries like “don’t know” or “?” should be transformed to blank).
2. Data Integration
▪ Schema integration
o Integrate metadata from different sources
o Attribute identification problem: “same” attributes from multiple
data sources may have different names
▪ Instance integration
o Integrate instances from different sources
o For the same real-world entity, attribute values from different
sources maybe different
o Possible reasons: different representations, different styles,
different scales, errors
Approach
▪ Identification
o Detect corresponding tables from different sources manual
o Detect corresponding attributes from different sources may use
correlation analysis
e.g., A.cust-id ≡ B.cust-#
o Detect duplicate records from different sources involves
approximate matching of attribute values
e.g. 3.14283 ≡ 3.1, Schwartz ≡ Schwarz
▪ Treatment
o Merge corresponding tables
o Use attribute values as synonyms
o Remove duplicate records when data warehouses are already
integrated
3. Data reduction
5. Discretization
o Discretization by Binning
After the completion of these tasks, the data is ready for mining.
2. Using the following data for the attribute age: 13, 15, 16, 16, 19, 20, 20, 21,
22, 22, 25, 25, 25, 25, 30, 33, 33, 35, 35, 35, 35, 36, 40, 45, 46, 52, 70.
(a) Use smoothing by bin means to smooth these data, using a bin depth of 3.
Illustrate your steps. Comment on the effect of this technique for the given
data.
(b) Use smoothing by boundaries to smooth these data, using a bin depth of
3. Illustrate your steps.