Data Preprocessing
Data Preprocessing
Preprocessing
1
Data Preprocessing
Types of Attributes
Why preprocess the data?
Data cleaning
Data integration and transformation
Data reduction
Discretization and concept hierarchy generation
Summary
2
Discrete vs. Continuous Attributes
Discrete Attribute
Has only a finite or countably infinite set of values
Ordinal
E.g., rankings (e.g., army, professions), grades,
Interval
E.g., calendar dates, body temperatures
4
Data Preprocessing
5
Why Preprocess Data
No quality data, no quality mining results!
Quality decisions must be based on quality data
e.g., duplicate or missing data may cause incorrect or even
misleading statistics
“Data cleaning is the number one problem in data warehousing”—
DCI survey
Data extraction, cleaning, and transformation comprises the
majority of the work of building a data warehouse
6
Data in the Real World Is Dirty
incomplete: lacking attribute values, lacking certain
attributes of interest, or containing only aggregate data
e.g., occupation=“ ” (missing data)
7
Why Is Data Dirty?
Incomplete data may come from
“Not applicable” data value when collected
Different considerations between the time when the data was
collected and when it is analyzed.
Human/hardware/software problems
Noisy data (incorrect values) may come from
Faulty data collection instruments
Human or computer error at data entry
Errors in data transmission
Inconsistent data may come from
Different data sources
Functional dependency violation (e.g., modify some linked data)
Duplicate records also need data cleaning
8
Major Tasks in Data Preprocessing
Data cleaning
Fill in missing values, smooth noisy data, identify or remove
outliers, and resolve inconsistencies
Data integration
Integration of multiple databases, data cubes, or files
Data transformation
Normalization and aggregation
Data reduction
Obtains reduced representation in volume but produces the
same or similar analytical results
Data discretization
Part of data reduction but with particular importance, especially
for numerical data
9
Forms of data preprocessing
10
Data Preprocessing
11
Portland Installing 200 Sensors to Improve Traffic
Safety, Government Computer News, Matt Leonard
June 29, 2018
Portland, OR's Traffic Safety Sensor Project involves installing 200 CityIQ
sensor nodes in the city's "high crash network" to gather data on vehicle and
pedestrian traffic, in the hope that will help officials learn how to prevent fatal
accidents. The sensors will be installed on 30 Portland streets that account for
more than half of the city's traffic fatalities, even though they make up only
8% of the city's roadways. The nodes, which are attached to existing street
light poles, have more than 30 built-in sensors, including two cameras that
capture images of the roadway and sidewalk, and an array of environmental
sensors for measuring temperature, pressure, and humidity. Specialized chips
power vision analysis in the units and produce metadata featuring traffic and
pedestrian counts, along with time and location stamps. The metadata is
transmitted to a cloud platform, where it can be accessed via application
programming interfaces.
12
'You Cannot Be Serious': IBM Taps Emotions for
Wimbledon Highlights Reuters June 26, 2018
IBM's Watson artificial intelligence (AI) platform is analyzing
players' emotions at the Wimbledon tennis tournament to
compile highlights in which players display a heightened
sense of emotion. In addition to recognizing emotions,
Watson is also analyzing crowd noise, players' movements,
and match data. IBM's Sam Seddon says the company is
using machine learning to pinpoint scenes after exciting play
when athletes show their emotions. "If you've got the visual
element from the player, and you know that it's a tight
pressure point in the match, then those are the points that
you are going to really target in on in the highlights
package," he says. Watson also is offering a Wimbledon
chatbot service via Facebook Messenger, which provides
fans with access to customized information on scores, news,
and players
13
Missing Data
14
How to Handle Missing Data?
Ignore the tuple: usually done when class label is missing (assuming
the tasks in classification—not effective when the percentage of
missing values per attribute varies considerably.
Fill in the missing value manually: tedious + infeasible?
Use a global constant to fill in the missing value: e.g., “unknown”, a
new class?!
Use the attribute mean to fill in the missing value
Use the attribute mean for all samples belonging to the same class
to fill in the missing value: smarter
Use the most probable value to fill in the missing value: inference-
based such as Bayesian formula or decision tree
15
Noisy Data
technology limitation
incomplete data
inconsistent data
16
How to Handle Noisy Data?
Binning method:
first sort data and partition into (equi-depth) bins
Regression
smooth by fitting the data into regression functions
17
Simple Discretization Methods: Binning
Equal-width (distance) partitioning:
It divides the range into N intervals of equal size:
uniform grid
if A and B are the lowest and highest values of the
18
Equi-Depth Binning Method
• Data for price (in dollors) : 24,25,26,34,28,4,8,9,21,21,15,29
• Sorted data for price (in dollars): 4, 8, 9, 15, 21, 21, 24, 25, 26,
28, 29, 34
* Partition into (equi-depth) bins:
- Bin 1: 4, 8, 9, 15
- Bin 2: 21, 21, 24, 25
- Bin 3: 26, 28, 29, 34
* Smoothing by bin means:
- Bin 1: 9, 9, 9, 9
- Bin 2: 23, 23, 23, 23
- Bin 3: 29, 29, 29, 29
* Smoothing by bin boundaries:
- Bin 1: 4, 4, 4, 15
- Bin 2: 21, 21, 25, 25
- Bin 3: 26, 26, 26, 34
19
Equi-Width Binning Method
• Data for price (in dollors) : 24,25,26,34,28,4,8,9,21,21,15,29
• Sorted data for price (in dollars): 4, 8, 9, 15, 21, 21, 24, 25, 26,
28, 29, 34
* Partition into (equi-width) bins: Width = (34-4)/3= 10
- Bin 1 [4-14): 4, 8, 9
- Bin 2 [14-24): 15,21, 21
- Bin 3 [24-34]: 24,25,26, 28, 29, 34
* Smoothing by bin means:
- Bin 1: 7,7,7
- Bin 2: 19, 19, 19
- Bin 3: 28, 28, 28, 28
* Smoothing by bin boundaries:
- Bin 1: 4, 9,9
- Bin 2: 15, 21, 21
- Bin 3: 24, 24, 24, 24,34,34
20
Assignment Question
Following data for the attribute age: 20,20, 21, 22,
22, 25, 25, 25,, 36, 40, 45, 46, 52, 70, 13, 15, 16,
16, 19, 25, 30, 33, 33, 35, 35, 35, 35.
21
Data Smoothing using Cluster Analysis
Y1 y
Y1’ y=x+1
X1 x
25
Handling Redundancy in Data Integration
26
Correlation Analysis (Numerical Data)
rp ,q
( p p)( q q) ( pq) n p q
(n 1) p q (n 1) p q
27
Correlation Analysis (Numerical Data)
28
Correlation Analysis (Example)
Person Height Self Esteem
1 68 4.1
2 71 4.6
3 62 3.8
4 75 4.4
5 58 3.2
6 60 3.1
7 67 3.8
8 68 4.1
9 71 4.3
10 69 3.7
11 68 3.5
12 67 3.2
13 63 3.7
14 62 3.3
15 60 3.4
16 63 4.0
17 65 4.1
18 67 3.8
19 63 3.4
Example
20 61 3.6
August 5, 2019 Data Mining: Concepts and Techniques 29
Example
30
Example
31
Assignment Question
Car Age(months) Min Stopping at 40 kph(meters)
A 9 28.4
B 15 29.3
C 24 37.6
D 30 36.2
E 38 36.5
F 46 35.3
G 53 36.2
H 60 44.1
I 64 44.8
J 76 47.2
32
Chi Square Method (Categorical Data)
For categorical (discrete) data, a correlation relationship between two attributes, A and
B, can be discovered by a c2 (chi-square) test.
Suppose A has c distinct values, namely a1;a2; : : :ac. B has r distinct values, namely
b1;b2; : : :br.
The data tuples described by A and B can be shown as a contingency table, with the c
values of A making up the columns and the r values of B making up the rows.
Let (Ai;Bj) denote the event that attribute A takes on value ai and attribute B takes on
value bj, that is, where (A = ai;B = bj).
Male Female
Fiction 250 200
Non Fiction 50 1000
33
Chi Square Method (Categorical Data)
Each and every possible (Ai;Bj) joint event has its own cell (or slot) in
the table. The c2 value (also known as the Pearson c2 statistic) is
computed as:
(Observed Expected) 2
2
Expected
where oi j is the observed frequency (i.e., actual count) of the joint event (Ai;Bj)
eij is the expected frequency of (Ai;Bj), which can be computed as
where N is the number of data tuples, count(A=ai) is the number of tuples having value
ai for A, and count(B = bj) is the number of tuples having value bj for B.
Cells that contribute the most to the c2 value are those whose actual count is very
different from that expected.
The larger the Χ2 value, the more likely the variables are related
34
Example
Suppose that a group of 1,500 people was surveyed.
The gender of each person was noted. Each person was
polled as to whether their preferred type of reading
material was fiction or nonfiction.
Thus, we have two attributes, gender and preferred
reading.
The observed frequency (or count) of each possible joint
event is summarized in the contingency table where the
numbers in parentheses are the expected frequencies
(calculated based on the data distribution for both
attributes).
Male Female
Fiction 250 200
Non Fiction 50 1000
min-max normalization
z-score normalization
Attribute/feature construction
3. Generalization of the data, where low level or ”primitive" (raw) data are
replaced by higher level concepts through the use of concept hierarchies. For
example, categorical attributes, like street, can be generalized to higher level
concepts, like city or county. Similarly, values for numeric attributes, like age,
may be mapped to higher level concepts, like young, middle-aged, and
senior.
Min-Max Normalization:
Performs a linear transformation on the original data.
Suppose minA and maxA are the minimum and
maximum values of an attribute A.
Min-Max normalization maps a value v of A to v’ in the
range[new_minA ,new_maxA]
v minA
v' (new _ maxA new _ minA) new _ minA
maxA minA
Preserves the relationships among the original data
73,600 54,000
1.225
16,000
50
Chapter 3: Data Preprocessing
Data cleaning
Fill in missing values, smooth noisy data, identify or remove
outliers, and resolve inconsistencies
Data integration
Integration of multiple databases, data cubes, or files
Data transformation
Normalization and aggregation
Data reduction
Obtains reduced representation in volume but produces the
same or similar analytical results
Data discretization
Part of data reduction but with particular importance, especially
for numerical data
52
Data Reduction Strategies
terabytes of data
Complex data analysis/mining may take a
Imagine that you have collected the data for your analysis. These
data consist of the All Electronics sales per quarter, for the years
1997 to 1999. You are, however, interested in the annual sales
(total per year), rather than the total per quarter. Thus the data
can be aggregated so that the resulting data summarize the total
sales per year instead of per quarter.
decision-tree induction
A1? A6?
Original Data
Approximated
The usefulness lies in the fact that the wavelet transformed data
can be truncated. A compressed approximation of the data can be
retained by storing only a small fraction of the strongest of the
wavelet coefficients.
For example, a 3-D data cube for sales with the dimensions
item type, branch, and year must first be reduced to a 2-D
cube, such as with the dimensions item type, and branch
year.
1, 1, 5, 5, 5, 5, 5, 8, 8, 10, 10, 10, 10, 12, 14, 14, 14, 15, 15, 15, 15, 15, 15, 18,
18, 18, 18, 18, 18, 18, 18, 20, 20, 20, 20, 20, 20, 20, 21, 21, 21, 21, 25, 25, 25,
25, 25, 28, 28, 30, 30, 30.
They partition the objects into groups or clusters, so that objects within a
cluster are “similar" to one another and “dissimilar" to objects in other clusters.
Similarity is commonly defined in terms of how “close" the objects are in space,
based on a distance function.
Hierarchical aggregation
Suppose that the tree contains 10,000 tuples with keys ranging from 1 to 9,999.
The data in the tree can be approximated by an equi-depth histogram of 6 buckets
for the key ranges 1 to 985, 986 to 3395, 3396 to 5410, 5411 to 8392, 8392 to
9543, and 9544 to 9999.
Each bucket contains roughly 10,000/6 items. Similarly, each bucket is subdivided
into smaller buckets, allowing for aggregate data at a finer-detailed level
August 5, 2019 Data Mining: Concepts and Techniques
91
Sampling
Raw Data
August 5, 2019 Data Mining: Concepts and Techniques
94
Sampling Example
created for each customer age group. In this way, the age group
having the smallest number of customers will be sure to be
represented.
Data cleaning
Fill in missing values, smooth noisy data, identify or remove
outliers, and resolve inconsistencies
Data integration
Integration of multiple databases, data cubes, or files
Data transformation
Normalization and aggregation
Data reduction
Obtains reduced representation in volume but produces the
same or similar analytical results
Data discretization
Part of data reduction but with particular importance, especially
for numerical data
101
Discretization
Three types of attributes:
Nominal — values from an unordered set
Discretization:
divide the range of a continuous attribute into
intervals
Some classification algorithms only accept categorical
attributes.
Reduce data size by discretization
Binning
Histogram analysis
Clustering analysis
A minimum interval size can also be used per level to control the
recursive procedure.
an attribute defining a high concept level will usually contain a smaller number
of distinct values than an attribute defining a lower concept level.
The attribute with the most distinct values is placed at the lowest level of the
hierarchy.
If a user were to specify only the attribute city for a hierarchy defining
location, the system may automatically drag all of the above via semantically-
related attributes to form a hierarchy.