Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
100% found this document useful (1 vote)
202 views

Data Preprocessing

Here are the steps for equi-depth binning of the price data: 1) Sort the data in ascending order: 4, 8, 9, 15, 21, 21, 24, 25, 26, 28, 29, 34 2) Calculate the number of bins (let's say 3 bins): n = total number of data points / number of bins = 12 / 3 = 4 3) Create bins such that each bin has approximately equal number of data points: Bin 1: 4, 8, 9, 15 Bin 2: 21, 21, 24, 25 Bin 3: 26, 28, 29, 34 So the equi-depth binning partitions the data into 3 bins, with each

Uploaded by

Dhruvi Modi
Copyright
© © All Rights Reserved
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
100% found this document useful (1 vote)
202 views

Data Preprocessing

Here are the steps for equi-depth binning of the price data: 1) Sort the data in ascending order: 4, 8, 9, 15, 21, 21, 24, 25, 26, 28, 29, 34 2) Calculate the number of bins (let's say 3 bins): n = total number of data points / number of bins = 12 / 3 = 4 3) Create bins such that each bin has approximately equal number of data points: Bin 1: 4, 8, 9, 15 Bin 2: 21, 21, 24, 25 Bin 3: 26, 28, 29, 34 So the equi-depth binning partitions the data into 3 bins, with each

Uploaded by

Dhruvi Modi
Copyright
© © All Rights Reserved
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 109

Data

Preprocessing

1
Data Preprocessing

 Types of Attributes
 Why preprocess the data?
 Data cleaning
 Data integration and transformation
 Data reduction
 Discretization and concept hierarchy generation
 Summary

2
Discrete vs. Continuous Attributes
 Discrete Attribute
 Has only a finite or countably infinite set of values

 E.g., zip codes, profession, or the set of words in a


collection of documents
 Sometimes, represented as integer variables

 Note: Binary attributes are a special case of discrete


attributes
 Continuous Attribute
 Has real numbers as attribute values

 Examples: temperature, height, or weight

 Practically, real values can only be measured and


represented using a finite number of digits
 Continuous attributes are typically represented as
floating-point variables
3
Types of Attribute Values
 Nominal
 E.g., profession, ID numbers, eye color, zip codes

 Ordinal
 E.g., rankings (e.g., army, professions), grades,

height in {tall, medium, short}


 Binary
 E.g., medical test (positive vs. negative)

 Interval
 E.g., calendar dates, body temperatures

4
Data Preprocessing

 Why preprocess the data?


 Data cleaning
 Data integration and transformation
 Data reduction
 Discretization and concept hierarchy generation
 Summary

5
Why Preprocess Data
 No quality data, no quality mining results!
 Quality decisions must be based on quality data
 e.g., duplicate or missing data may cause incorrect or even
misleading statistics
 “Data cleaning is the number one problem in data warehousing”—
DCI survey
 Data extraction, cleaning, and transformation comprises the
majority of the work of building a data warehouse

6
Data in the Real World Is Dirty
 incomplete: lacking attribute values, lacking certain
attributes of interest, or containing only aggregate data
 e.g., occupation=“ ” (missing data)

 noisy: containing noise, errors, or outliers


 e.g., Salary=“−10” (an error)

 inconsistent: containing discrepancies in codes or


names, e.g.,
 Age=“42” Birthday=“03/07/1997”

 Was rating “1,2,3”, now rating “A, B, C”

 discrepancy between duplicate records

7
Why Is Data Dirty?
 Incomplete data may come from
 “Not applicable” data value when collected
 Different considerations between the time when the data was
collected and when it is analyzed.
 Human/hardware/software problems
 Noisy data (incorrect values) may come from
 Faulty data collection instruments
 Human or computer error at data entry
 Errors in data transmission
 Inconsistent data may come from
 Different data sources
 Functional dependency violation (e.g., modify some linked data)
 Duplicate records also need data cleaning

8
Major Tasks in Data Preprocessing

 Data cleaning
 Fill in missing values, smooth noisy data, identify or remove
outliers, and resolve inconsistencies
 Data integration
 Integration of multiple databases, data cubes, or files
 Data transformation
 Normalization and aggregation
 Data reduction
 Obtains reduced representation in volume but produces the
same or similar analytical results
 Data discretization
 Part of data reduction but with particular importance, especially
for numerical data
9
Forms of data preprocessing

10
Data Preprocessing

 Why preprocess the data?


 Data cleaning
 Data integration and transformation
 Data reduction
 Discretization and concept hierarchy generation
 Summary

11
Portland Installing 200 Sensors to Improve Traffic
Safety, Government Computer News, Matt Leonard
June 29, 2018

Portland, OR's Traffic Safety Sensor Project involves installing 200 CityIQ
sensor nodes in the city's "high crash network" to gather data on vehicle and
pedestrian traffic, in the hope that will help officials learn how to prevent fatal
accidents. The sensors will be installed on 30 Portland streets that account for
more than half of the city's traffic fatalities, even though they make up only
8% of the city's roadways. The nodes, which are attached to existing street
light poles, have more than 30 built-in sensors, including two cameras that
capture images of the roadway and sidewalk, and an array of environmental
sensors for measuring temperature, pressure, and humidity. Specialized chips
power vision analysis in the units and produce metadata featuring traffic and
pedestrian counts, along with time and location stamps. The metadata is
transmitted to a cloud platform, where it can be accessed via application
programming interfaces.

12
'You Cannot Be Serious': IBM Taps Emotions for
Wimbledon Highlights Reuters June 26, 2018
 IBM's Watson artificial intelligence (AI) platform is analyzing
players' emotions at the Wimbledon tennis tournament to
compile highlights in which players display a heightened
sense of emotion. In addition to recognizing emotions,
Watson is also analyzing crowd noise, players' movements,
and match data. IBM's Sam Seddon says the company is
using machine learning to pinpoint scenes after exciting play
when athletes show their emotions. "If you've got the visual
element from the player, and you know that it's a tight
pressure point in the match, then those are the points that
you are going to really target in on in the highlights
package," he says. Watson also is offering a Wimbledon
chatbot service via Facebook Messenger, which provides
fans with access to customized information on scores, news,
and players

13
Missing Data

 Data is not always available


 E.g., many tuples have no recorded value for several
attributes, such as customer income in sales data
 Missing data may be due to
 equipment malfunction
 inconsistent with other recorded data and thus deleted
 data not entered due to misunderstanding
 certain data may not be considered important at the time of
entry
 not register history or changes of the data
 Missing data may need to be inferred.

14
How to Handle Missing Data?
 Ignore the tuple: usually done when class label is missing (assuming
the tasks in classification—not effective when the percentage of
missing values per attribute varies considerably.
 Fill in the missing value manually: tedious + infeasible?
 Use a global constant to fill in the missing value: e.g., “unknown”, a
new class?!
 Use the attribute mean to fill in the missing value
 Use the attribute mean for all samples belonging to the same class
to fill in the missing value: smarter
 Use the most probable value to fill in the missing value: inference-
based such as Bayesian formula or decision tree
15
Noisy Data

 Noise: random error or variance in a measured variable


 Incorrect attribute values may due to
 faulty data collection instruments

 data entry problems

 data transmission problems

 technology limitation

 inconsistency in naming convention

 Other data problems which requires data cleaning


 duplicate records

 incomplete data

 inconsistent data

16
How to Handle Noisy Data?
 Binning method:
 first sort data and partition into (equi-depth) bins

 then one can smooth by bin means, smooth by bin

median, smooth by bin boundaries, etc.


 Clustering
 detect and remove outliers

 Combined computer and human inspection


 detect suspicious values and check by human

 Regression
 smooth by fitting the data into regression functions

17
Simple Discretization Methods: Binning
 Equal-width (distance) partitioning:
 It divides the range into N intervals of equal size:

uniform grid
 if A and B are the lowest and highest values of the

attribute, the width of intervals will be: W = (B-A)/N.


 The most straightforward

 But outliers may dominate presentation

 Skewed data is not handled well.

 Equal-depth (frequency) partitioning:


 It divides the range into N intervals, each containing

approximately same number of samples


 Good data scaling

 Managing categorical attributes can be tricky.

18
Equi-Depth Binning Method
• Data for price (in dollors) : 24,25,26,34,28,4,8,9,21,21,15,29
• Sorted data for price (in dollars): 4, 8, 9, 15, 21, 21, 24, 25, 26,
28, 29, 34
* Partition into (equi-depth) bins:
- Bin 1: 4, 8, 9, 15
- Bin 2: 21, 21, 24, 25
- Bin 3: 26, 28, 29, 34
* Smoothing by bin means:
- Bin 1: 9, 9, 9, 9
- Bin 2: 23, 23, 23, 23
- Bin 3: 29, 29, 29, 29
* Smoothing by bin boundaries:
- Bin 1: 4, 4, 4, 15
- Bin 2: 21, 21, 25, 25
- Bin 3: 26, 26, 26, 34
19
Equi-Width Binning Method
• Data for price (in dollors) : 24,25,26,34,28,4,8,9,21,21,15,29
• Sorted data for price (in dollars): 4, 8, 9, 15, 21, 21, 24, 25, 26,
28, 29, 34
* Partition into (equi-width) bins: Width = (34-4)/3= 10
- Bin 1 [4-14): 4, 8, 9
- Bin 2 [14-24): 15,21, 21
- Bin 3 [24-34]: 24,25,26, 28, 29, 34
* Smoothing by bin means:
- Bin 1: 7,7,7
- Bin 2: 19, 19, 19
- Bin 3: 28, 28, 28, 28
* Smoothing by bin boundaries:
- Bin 1: 4, 9,9
- Bin 2: 15, 21, 21
- Bin 3: 24, 24, 24, 24,34,34
20
Assignment Question
Following data for the attribute age: 20,20, 21, 22,
22, 25, 25, 25,, 36, 40, 45, 46, 52, 70, 13, 15, 16,
16, 19, 25, 30, 33, 33, 35, 35, 35, 35.

(a) Use Equi Depth and Equi width smoothing by bin


means, modes and boundaries and to smooth the
above data, using a bin depth of 3. Illustrate your
steps. Comment on the effect of this technique for
the given data. (n for Equi-width = 3)

21
Data Smoothing using Cluster Analysis

Cluster the data and use properties of the clusters to represent


the instances constituting those clusters.
22
Data Smoothing using Regression

Y1 y

Y1’ y=x+1

X1 x

Data can be smoothed by fitting the data to a function, such as


with regression
23
Data Integration
 Data integration:
 Combines data from multiple sources into a coherent
store
 Schema integration: e.g., A.cust-id  B.cust-#
 Integrate metadata from different sources

 Entity identification problem:


 Identify real world entities from multiple data sources,
e.g., Bill Clinton = William Clinton
 Detecting and resolving data value conflicts
 For the same real world entity, attribute values from
different sources are different
 Possible reasons: different representations, different
scales, e.g., metric vs. British units
24
Assignment Question
 Discuss issues to consider during data integration.

25
Handling Redundancy in Data Integration

 Redundant data occur often when integration of multiple


databases
 Object identification: The same attribute or object may have
different names in different databases
 Derivable data: One attribute may be a “derived” attribute in
another table, e.g., annual revenue
 Redundant attributes may be able to be detected by correlation
analysis
 Careful integration of the data from multiple sources may help
reduce/avoid redundancies and inconsistencies and improve
mining speed and quality

26
Correlation Analysis (Numerical Data)

 Correlation coefficient (also called Pearson’s product moment


coefficient)

rp ,q 
 ( p  p)( q  q)  ( pq)  n p q

(n  1) p q (n  1) p q

where n is the number of tuples, p and q are the respective means of p


and q, σp and σq are the respective standard deviation of p and q, and
Σ(pq) is the sum of the pq cross-product.
 If rp,q > 0, p and q are positively correlated (p’s values increase as q’s).
The higher, the stronger correlation.
 rp,q = 0: independent; rpq < 0: negatively correlated

27
Correlation Analysis (Numerical Data)

28
Correlation Analysis (Example)
Person Height Self Esteem
1 68 4.1
2 71 4.6
3 62 3.8
4 75 4.4
5 58 3.2
6 60 3.1
7 67 3.8
8 68 4.1
9 71 4.3
10 69 3.7
11 68 3.5
12 67 3.2
13 63 3.7
14 62 3.3
15 60 3.4
16 63 4.0
17 65 4.1
18 67 3.8
19 63 3.4

Example
20 61 3.6
August 5, 2019 Data Mining: Concepts and Techniques  29
Example

30
Example

31
Assignment Question
 Car Age(months) Min Stopping at 40 kph(meters)

A 9 28.4
B 15 29.3
C 24 37.6
D 30 36.2
E 38 36.5
F 46 35.3
G 53 36.2
H 60 44.1
I 64 44.8
J 76 47.2

32
Chi Square Method (Categorical Data)
 For categorical (discrete) data, a correlation relationship between two attributes, A and
B, can be discovered by a c2 (chi-square) test.

 Suppose A has c distinct values, namely a1;a2; : : :ac. B has r distinct values, namely
b1;b2; : : :br.

 The data tuples described by A and B can be shown as a contingency table, with the c
values of A making up the columns and the r values of B making up the rows.
 Let (Ai;Bj) denote the event that attribute A takes on value ai and attribute B takes on
value bj, that is, where (A = ai;B = bj).

Male Female
Fiction 250 200
Non Fiction 50 1000

33
Chi Square Method (Categorical Data)
Each and every possible (Ai;Bj) joint event has its own cell (or slot) in
the table. The c2 value (also known as the Pearson c2 statistic) is
computed as:
(Observed  Expected) 2
 
2

Expected
 where oi j is the observed frequency (i.e., actual count) of the joint event (Ai;Bj)
 eij is the expected frequency of (Ai;Bj), which can be computed as

 where N is the number of data tuples, count(A=ai) is the number of tuples having value
ai for A, and count(B = bj) is the number of tuples having value bj for B.
 Cells that contribute the most to the c2 value are those whose actual count is very
different from that expected.
 The larger the Χ2 value, the more likely the variables are related
34
Example
 Suppose that a group of 1,500 people was surveyed.
 The gender of each person was noted. Each person was
polled as to whether their preferred type of reading
material was fiction or nonfiction.
 Thus, we have two attributes, gender and preferred
reading.
 The observed frequency (or count) of each possible joint
event is summarized in the contingency table where the
numbers in parentheses are the expected frequencies
(calculated based on the data distribution for both
attributes).

August 5, 2019 Data Mining: Concepts and Techniques


35
Example

Male Female
Fiction 250 200
Non Fiction 50 1000

the expected frequency for the cell (male, fiction) is

August 5, 2019 Data Mining: Concepts and Techniques


36
Chi-Square Calculation: An Example
Play chess Not play chess Sum (row)
Like science fiction 250(90) 200(360) 450

Not like science fiction 50(210) 1000(840) 1050

Sum(col.) 300 1200 1500

 Χ2 (chi-square) calculation (numbers in parenthesis are expected


counts calculated based on the data distribution in the two categories)

(250  90) 2 (50  210) 2 (200  360) 2 (1000  840) 2


 2
    507.93
90 210 360 840

August 5, 2019 Data Mining: Concepts and Techniques


37
Chi Square Method (Categorical Data)
(250  90) 2 (50  210) 2 (200  360) 2 (1000  840) 2
 
2
    507.93
90 210 360 840

 The c2 statistic tests the hypothesis that A and B are


independent.
 The test is based on a significance level, with (r-1) x (c-1)
degrees of freedom. For Degree 1 the Significance level is
10.828

 It shows that the two attributes are (strongly) correlated


for the given group of people.

August 5, 2019 Data Mining: Concepts and Techniques


38
Example
 A year group in school chooses between drama
and history as below. Is there any difference
between boys' and girls' choices?
 Observed
Chose Chose Total
drama history
 Boys 43 55 98
 Girls 52 54 106
 Total 95 109 204

August 5, 2019 Data Mining: Concepts and Techniques


39
Correlation and Causality

 Correlation does not imply causality


 # of hospitals and # of car-theft in a city are
correlated
 Both are causally linked to the third variable:
population

August 5, 2019 Data Mining: Concepts and Techniques


40
Data Transformation
 A function that maps the entire set of values of a given attribute to a new
set of replacement values s.t. each old value can be identified with one of
the new values
 Methods
 Smoothing: Remove noise from data

 Aggregation: Summarization, data cube construction

 Generalization: Concept hierarchy climbing

 Normalization: Scaled to fall within a small, specified range

 min-max normalization

 z-score normalization

 normalization by decimal scaling

 Attribute/feature construction

 New attributes constructed from the given ones

August 5, 2019 Data Mining: Concepts and Techniques


41
Transformation Techniques
 1. Smoothing, which works to remove the noise from data. Such techniques
include binning, clustering, and regression.

 2. Aggregation, where summary or aggregation operations are applied to the


data. For example, the daily sales data may be aggregated so as to compute
monthly and annual total amounts.

 3. Generalization of the data, where low level or ”primitive" (raw) data are
replaced by higher level concepts through the use of concept hierarchies. For
example, categorical attributes, like street, can be generalized to higher level
concepts, like city or county. Similarly, values for numeric attributes, like age,
may be mapped to higher level concepts, like young, middle-aged, and
senior.

 4. Normalization, where the attribute data are scaled so as to fall within a


small specied range, such as -1.0 to 1.0, or 0 to 1.0.

 5. Attribute construction (or feature construction), where new attributes are


constructed and added from the given set of attributes to help the mining
process.
August 5, 2019 Data Mining: Concepts and Techniques
42
Normalization
 An attribute is normalized by scaling its values so that they fall within
a small specified range, such as 0 to 1.0.

 Normalization is particularly useful for classification algorithms


involving neural networks, or distance measurements such as
nearest-neighbor classification and clustering.

 If using the neural network back propagation algorithm for


classification mining, normalizing the input values for each attribute
measured in the training samples will help speed up the learning
phase.

 For distance-based methods, normalization helps prevent attributes


with initially large ranges (e.g., income) from outweighing attributes
with initially smaller ranges (e.g., binary attributes).

August 5, 2019 Data Mining: Concepts and Techniques


43
Methods of Normalization
 min-max normalization,
 z-score normalization,
 Normalization by decimal scaling.

August 5, 2019 Data Mining: Concepts and Techniques


44
Data Transformation:
Normalization

 Min-Max Normalization:
 Performs a linear transformation on the original data.
 Suppose minA and maxA are the minimum and
maximum values of an attribute A.
 Min-Max normalization maps a value v of A to v’ in the
range[new_minA ,new_maxA]
v  minA
v'  (new _ maxA  new _ minA)  new _ minA
maxA  minA
 Preserves the relationships among the original data

August 5, 2019 Data Mining: Concepts and Techniques


45
Example
 Suppose that the minimum and maximum
values for the attribute income are $12,000
and $98,000, respectively. We would like to
map income to the range [0.0,1.0]. By min-
max normalization, a value of $73,600 for
income is transformed to
v  minA
v'  (new _ maxA  new _ minA)  new _ minA
maxA  minA
73,600  12,000
(1.0  0)  0  0.716
98,000  12,000

August 5, 2019 Data Mining: Concepts and Techniques


46
Z-Score Normalization (zero-mean normalization)

•The values for an attribute A are normalized based on


the mean and standard deviation of A. A value v of A is
normalized to v’ by computing
v  A
v' 
 A

where are the mean and standard deviation, respectively, of


attribute A.

August 5, 2019 Data Mining: Concepts and Techniques


47
Example
 Suppose that the mean and standard deviation of the
values for the attribute income are $54,000 and
$16,000, respectively.
 Z-score normalization (μ: mean, σ: standard deviation):
 Ex. Let μ = 54,000, σ = 16,000. Then

73,600  54,000
 1.225
16,000

August 5, 2019 Data Mining: Concepts and Techniques


48
Decimal scaling Normalization
 Normalization by decimal scaling
v Where j is the smallest integer such that
v'  Max(|ν’|) < 1
10 j

• Suppose that the recorded values of A range from


-986 to 917.
• The maximum absolute value of A is 986.
• To normalize by decimal scaling, we therefore divide
each value by 1,000 (i.e., j = 3) so that -986 normalizes
to -0.986.

August 5, 2019 Data Mining: Concepts and Techniques


49
Assignment

Use the two methods below to normalize the following


group of data:
202, 303, 404, 606, 1001

(a) min-max normalization by setting min = 0 and max = 1

(b) z-score normalization

50
Chapter 3: Data Preprocessing

 Why preprocess the data?


 Data cleaning
 Data integration
 Data transformation
 Data reduction
 Discretization and concept hierarchy generation
 Summary
August 5, 2019 Data Mining: Concepts and Techniques
51
Major Tasks in Data Preprocessing

 Data cleaning
 Fill in missing values, smooth noisy data, identify or remove
outliers, and resolve inconsistencies
 Data integration
 Integration of multiple databases, data cubes, or files
 Data transformation
 Normalization and aggregation
 Data reduction
 Obtains reduced representation in volume but produces the
same or similar analytical results
 Data discretization
 Part of data reduction but with particular importance, especially
for numerical data
52
Data Reduction Strategies

 Why data reduction?


 A database/data warehouse may store

terabytes of data
 Complex data analysis/mining may take a

very long time to run on the complete data


set
 Data reduction: Obtain a reduced representation
of the data set that is much smaller in volume
but yet produce the same (or almost the same)
analytical results
August 5, 2019 Data Mining: Concepts and Techniques
53
Data reduction strategies
 Data cube aggregation, where aggregation operations are applied to the
data in the construction of a data cube.
 Dimension reduction, where irrelevant, weakly relevant, or redundant
attributes or dimensions may be detected and removed
 Data compression, where encoding mechanisms are used to reduce the data
set size.
 Numerosity reduction, where the data are replaced or estimated by
alternative, smaller data representations such as parametric models (which
need store only the model parameters instead of the actual data), or
nonparametric methods such as clustering, sampling, and the use of
histograms.
 Discretization and concept hierarchy generation, where raw data values
for attributes are replaced by ranges or higher conceptual levels. Concept
hierarchies allow the mining of data at multiple levels of abstraction, and are
a powerful tool for data mining.
August 5, 2019 Data Mining: Concepts and Techniques
55
Data Cube Aggregation
Data cubes store multidimensional aggregated information

•Data cubes provide fast access to precomputed, summarized


data, thereby beneting on-line analytical processing as well as
data mining.
•The cube created at the lowest level of abstraction is referred
to as the base cuboid. A cube for the highest level of
abstraction is the apex cuboid.
August 5, 2019 Data Mining: Concepts and Techniques
56
Data Cube Aggregation

 Imagine that you have collected the data for your analysis. These
data consist of the All Electronics sales per quarter, for the years
1997 to 1999. You are, however, interested in the annual sales
(total per year), rather than the total per quarter. Thus the data
can be aggregated so that the resulting data summarize the total
sales per year instead of per quarter.

 The resulting data set is smaller in volume, without loss of


information necessary for the analysis task.

August 5, 2019 Data Mining: Concepts and Techniques


57
Sales data for a given branch of All Electronics for the years 1997
to 1999. In the data on the left, the sales are shown per quarter.
In the data on the right, the data are aggregated to provide the
annual sales.

August 5, 2019 Data Mining: Concepts and Techniques


58
Dimensionality Reduction

 Data sets for analysis may contain hundreds of attributes, many of


which may be irrelevant to the mining task, or redundant.

 if the task is to classify customers as to whether or not they are


likely to purchase a popular new CD at AllElectronics

 Attributes such as the customer's telephone number are likely to


be irrelevant, unlike attributes such as age or music taste.

 This can result in discovered patterns of poor quality.

 In addition, the added volume of irrelevant or redundant attributes


can slow down the mining process.

August 5, 2019 Data Mining: Concepts and Techniques


59
Dimensionality Reduction

 Feature selection (i.e., attribute subset selection):

 Select a minimum set of features such that the probability distribution of


different classes given the values for those features is as close as possible
to the original distribution given the values of all features

 reduce # of patterns in the patterns, easier to understand

 Heuristic methods (due to exponential # of choices):

 step-wise forward selection

 step-wise backward elimination

 combining forward selection and backward elimination

 decision-tree induction

August 5, 2019 Data Mining: Concepts and Techniques


60
Attribute Subset Selection
 Step-wise forward selection : The procedure starts with an empty
set of attributes. The best of the original attributes is determined
and added to the set. At each subsequent iteration or step, the
best of the remaining original attributes is added to the set.

August 5, 2019 Data Mining: Concepts and Techniques


61
Attribute Subset Selection
 Step-wise backward elimination: The procedure starts with the full
set of attributes. At each step, it removes the worst attribute
remaining in the set.

 The procedure may employ a threshold on the measure used to


determine when to stop the attribute selection process.

August 5, 2019 Data Mining: Concepts and Techniques


62
Example of Decision Tree Induction

Initial attribute set:


{A1, A2, A3, A4, A5, A6}
A4 ?

A1? A6?

Class 1 Class 2 Class 1 Class 2

> Reduced attribute set: {A1, A4, A6}


August 5, 2019 Data Mining: Concepts and Techniques
63
Heuristic Feature Selection Methods
 There are 2d possible sub-features of d features

 Several heuristic feature selection methods:

 Best single features under the feature independence assumption: choose


by significance tests.

 Best step-wise feature selection:

 The best single-feature is picked first

 Then next best feature condition to the first, ...

 Step-wise feature elimination:

 Repeatedly eliminate the worst feature

 Best combined feature selection and elimination

August 5, 2019 Data Mining: Concepts and Techniques


64
Approaches
 If the mining task is classification, and the mining algorithm itself
is used to determine the attribute subset, then this is called a
wrapper approach; otherwise, it is a Filter approach.

 In general, the wrapper approach leads to greater accuracy since


it optimizes the evaluation measure of the algorithm while
removing attributes. However, it requires much more computation
than a Filter approach.

August 5, 2019 Data Mining: Concepts and Techniques


65
Data reduction strategies
 Data cube aggregation, where aggregation operations are applied to the
data in the construction of a data cube.
 Dimension reduction, where irrelevant, weakly relevant, or redundant
attributes or dimensions may be detected and removed
 Data compression, where encoding mechanisms are used to reduce the data
set size.
 Numerosity reduction, where the data are replaced or estimated by
alternative, smaller data representations such as parametric models (which
need store only the model parameters instead of the actual data), or
nonparametric methods such as clustering, sampling, and the use of
histograms.
 Discretization and concept hierarchy generation, where raw data values
for attributes are replaced by ranges or higher conceptual levels. Concept
hierarchies allow the mining of data at multiple levels of abstraction, and are
a powerful tool for data mining.
August 5, 2019 Data Mining: Concepts and Techniques
66
Data Compression

Original Data Compressed


Data
lossless

Original Data
Approximated

August 5, 2019 Data Mining: Concepts and Techniques


67
Discrete Wavelet Transform (DWT)

 The discrete wavelet transform (DWT) is a linear signal processing


technique that, when applied to a data vector D, transforms it to a
numerically different vector, D0, of wavelet coefficients. The two
vectors are of the same length.

 The usefulness lies in the fact that the wavelet transformed data
can be truncated. A compressed approximation of the data can be
retained by storing only a small fraction of the strongest of the
wavelet coefficients.

 All wavelet coefficients larger than some user-specified threshold


can be retained. The remaining coefficients are set to 0.

August 5, 2019 Data Mining: Concepts and Techniques


68
More practical example

M= 4 7 6 9 M’ = 8.5 11.5 10.5 15


6 9 3 6 1.5 3.5 -1.5 0
5 4 7 6 -2.5 -0.5 0.5 3
2 4 5 9 0.5 -0.5 2.5 0

Notice how the higher values (low frequency)


are now positioned toward the top left and
the lower values (high frequency) are
positioned toward
August 5, 2019
the bottom right
Data Mining: Concepts and Techniques
69
DFT Vs DWT

 The DWT achieves better lossy compression.

 If the same number of coefficients are retained for a DWT


and a DFT of a given data vector, the DWT version will
provide a more accurate approximation of the original data.
Hence, for an equivalent approximation

 The DWT requires less space than the DFT.

 Unlike DFT, wavelets are quite localized in space,


contributing to the conservation of local detail.

August 5, 2019 Data Mining: Concepts and Techniques


70
Principal component analysis
 Principal component analysis (PCA) is a mathematical procedure that
uses an orthogonal transformation to convert a set of observations of
possibly correlated variables into a set of values of uncorrelated
variables called principal components.

 The number of principal components is less than or equal to the number


of original variables.

 This transformation is defined in such a way that the first principal


component has as high a variance as possible (that is, accounts for as
much of the variability in the data as possible), and each succeeding
component in turn has the highest variance possible under the
constraint that it be orthogonal to (uncorrelated with) the preceding
components

August 5, 2019 Data Mining: Concepts and Techniques


73
Principal Component Analysis
 Given N data vectors from k-dimensions, find
c <= k orthogonal vectors that can be best
used to represent data
 The original data are thus projected onto a
much smaller space, resulting in data
compression.
 Works for numeric data only
 Used when the number of dimensions is large

August 5, 2019 Data Mining: Concepts and Techniques


74
PCA Vs Wavelet
 PCA is computationally inexpensive, can be applied to
ordered and unordered attributes, and can handle sparse
data and skewed data. Multidimensional data of more than
two dimensions can be handled by reducing the problem to
two dimensions.

 For example, a 3-D data cube for sales with the dimensions
item type, branch, and year must first be reduced to a 2-D
cube, such as with the dimensions item type, and branch
year.

 In comparison with wavelet transforms for data


compression, PCA tends to be better at handling sparse
data, while wavelet transforms are more suitable for data of
high dimensionality.
August 5, 2019 Data Mining: Concepts and Techniques
78
Data reduction strategies
 Data cube aggregation, where aggregation operations are applied to the
data in the construction of a data cube.
 Dimension reduction, where irrelevant, weakly relevant, or redundant
attributes or dimensions may be detected and removed
 Data compression, where encoding mechanisms are used to reduce the data
set size.
 Numerosity reduction, where the data are replaced or estimated by
alternative, smaller data representations such as parametric models (which
need store only the model parameters instead of the actual data), or
nonparametric methods such as clustering, sampling, and the use of
histograms.
 Discretization and concept hierarchy generation, where raw data values
for attributes are replaced by ranges or higher conceptual levels. Concept
hierarchies allow the mining of data at multiple levels of abstraction, and are
a powerful tool for data mining.
August 5, 2019 Data Mining: Concepts and Techniques
79
Data reduction strategies
 Data cube aggregation, where aggregation operations are applied to the
data in the construction of a data cube.
 Dimension reduction, where irrelevant, weakly relevant, or redundant
attributes or dimensions may be detected and removed
 Data compression, where encoding mechanisms are used to reduce the data
set size.
 Numerosity reduction, where the data are replaced or estimated by
alternative, smaller data representations such as parametric models (which
need store only the model parameters instead of the actual data), or
nonparametric methods such as clustering, sampling, and the use of
histograms.
 Discretization and concept hierarchy generation, where raw data values
for attributes are replaced by ranges or higher conceptual levels. Concept
hierarchies allow the mining of data at multiple levels of abstraction, and are
a powerful tool for data mining.
August 5, 2019 Data Mining: Concepts and Techniques
80
Numerosity Reduction
 Can we reduce the data volume by choosing alternative “Smaller
form of data representation?”
 Parametric methods
 Assume the data fits some model, estimate model parameters,
store only the parameters, and discard the data (except
possible outliers)
 Log-linear models: Which estimate discrete multidimensional
probability distribution.
 Non-parametric methods
 Do not assume models
 Major families: histograms, clustering, sampling

August 5, 2019 Data Mining: Concepts and Techniques


81
Regression and Log-Linear Models
 Linear regression: Data are modeled to fit a straight line

 Often uses the least-square method to fit the line

 Random variable Y (Response ) can be modeled as a linear function of


another random variable X (Predictor) with the equation

 Where alpha and beta are called regression coefficients.

 Multiple regression: allows a response variable Y to be modeled as a linear


function of multidimensional feature vectorY = b0 + b1 X1 + b2 X2.

 Log-linear model: approximates discrete multidimensional probability


distributions

August 5, 2019 Data Mining: Concepts and Techniques


82
Histograms
 Histograms use binning to approximate data distributions and
are a popular form of data reduction.

 A histogram for an attribute A partitions the data distribution of


A into disjoint subsets, or buckets. The buckets are displayed
on a horizontal axis, while the height (and area) of a bucket
typically reflects the average frequency of the values
represented by the bucket.

 If each bucket represents only a single attribute-


value/frequency pair, the buckets are called singleton buckets.

August 5, 2019 Data Mining: Concepts and Techniques


85
Histogram
 The following data are a list of prices of commonly sold items at All Electronics
(rounded to the nearest dollar). The numbers have been sorted.

 1, 1, 5, 5, 5, 5, 5, 8, 8, 10, 10, 10, 10, 12, 14, 14, 14, 15, 15, 15, 15, 15, 15, 18,
18, 18, 18, 18, 18, 18, 18, 20, 20, 20, 20, 20, 20, 20, 21, 21, 21, 21, 25, 25, 25,
25, 25, 28, 28, 30, 30, 30.

August 5, 2019 Data Mining: Concepts and Techniques


86
Techniques
 How are the buckets determined and attributes values partitioned
?
 Equi Width : the width of each bucket range is uniform.

 Equi Depth(height) : the frequency of each bucket is constant.

August 5, 2019 Data Mining: Concepts and Techniques


87
Clustering
 Clustering techniques consider data tuples as objects.

 They partition the objects into groups or clusters, so that objects within a
cluster are “similar" to one another and “dissimilar" to objects in other clusters.

 Similarity is commonly defined in terms of how “close" the objects are in space,
based on a distance function.

 The “quality" of a cluster may be represented by its diameter, the maximum


distance between any two objects in the cluster.

 Centroid distance is an alternative measure of cluster quality, and is defined as


the average distance of each cluster object from the cluster centroid.

August 5, 2019 Data Mining: Concepts and Techniques


88
Example
 A 2-D plot of customer data with respect to
customer locations in a city, where the centroid
of each cluster is shown with a “+". Three data
clusters are visible.

August 5, 2019 Data Mining: Concepts and Techniques


89
Hierarchical Reduction
Multi Dimensional Index Tree

 Multidimensional index trees are primarily used for


providing fast data access. They can also be used for
hierarchical data reduction, providing a multi resolution
clustering of the data.

 Hierarchical aggregation

 An index tree hierarchically divides a data set into


partitions by value range of some attributes

 Each partition can be considered as a bucket

August 5, 2019 Data Mining: Concepts and Techniques


90
Example
 It provides a hierarchy of clusterings of the data set, where each
cluster has a label that holds for the data contained in the cluster.

Suppose that the tree contains 10,000 tuples with keys ranging from 1 to 9,999.
The data in the tree can be approximated by an equi-depth histogram of 6 buckets
for the key ranges 1 to 985, 986 to 3395, 3396 to 5410, 5411 to 8392, 8392 to
9543, and 9544 to 9999.

Each bucket contains roughly 10,000/6 items. Similarly, each bucket is subdivided
into smaller buckets, allowing for aggregate data at a finer-detailed level
August 5, 2019 Data Mining: Concepts and Techniques
91
Sampling

 Sampling can be used as a data reduction technique


since it allows a large data set to be represented by a
much smaller random sample (or subset) of the data.

 Suppose that a large data set, D, contains N tuples.


Let's have a look at some possible samples for D.

August 5, 2019 Data Mining: Concepts and Techniques


92
Sampling

 Simple random sample without replacement (SRSWOR) of size n:

 This is created by drawing n of the N tuples from D (n < N),


where the probably of drawing any tuple in D is 1/N, i.e., all
tuples are equally likely.

 Simple random sample with replacement (SRSWR) of size n:

 This is similar to SRSWOR, except that each time a tuple is


drawn from D, it is recorded and then replaced. That is, after a
tuple is drawn, it is placed back in D so that it may be drawn
again.

August 5, 2019 Data Mining: Concepts and Techniques


93
Sampling

Raw Data
August 5, 2019 Data Mining: Concepts and Techniques
94
Sampling Example

August 5, 2019 Data Mining: Concepts and Techniques


95
Sampling
 Cluster sample: If the tuples in D are grouped into M mutually
disjoint “clusters", then a SRS of m clusters can be obtained, where
m < M. For example, tuples in a database are usually retrieved a
page at a time, so that each page can be considered a cluster. A
reduced data representation can be obtained by applying, say,
SRSWOR to the pages, resulting in a cluster sample of the tuples.

August 5, 2019 Data Mining: Concepts and Techniques


96
Sampling Example

August 5, 2019 Data Mining: Concepts and Techniques


97
Sampling
 Stratified sample: If D is divided into mutually disjoint parts called
“strata", a stratified sample of D is generated by obtaining a SRS
at each stratum. This helps to ensure a representative sample,
especially when the data are skewed. For example, a stratified
sample may be obtained from customer data, where stratum is

created for each customer age group. In this way, the age group
having the smallest number of customers will be sure to be
represented.

August 5, 2019 Data Mining: Concepts and Techniques


98
Sampling Example

August 5, 2019 Data Mining: Concepts and Techniques


99
Sampling Advantages

 An advantage of sampling for data reduction is that the cost of


obtaining a sample is proportional to the size of the sample, n, as
opposed to N, the data set size. Hence, sampling complexity is
potentially sub-linear to the size of the data.

 Other data reduction techniques can require at least one complete


pass through D. For a fixed sample size, sampling complexity
increases only linearly as the number of data dimensions, d,
increases

August 5, 2019 Data Mining: Concepts and Techniques


100
Major Tasks in Data Preprocessing

 Data cleaning
 Fill in missing values, smooth noisy data, identify or remove
outliers, and resolve inconsistencies
 Data integration
 Integration of multiple databases, data cubes, or files
 Data transformation
 Normalization and aggregation
 Data reduction
 Obtains reduced representation in volume but produces the
same or similar analytical results
 Data discretization
 Part of data reduction but with particular importance, especially
for numerical data
101
Discretization
 Three types of attributes:
 Nominal — values from an unordered set

 Continuous — real numbers

 Discretization:
 divide the range of a continuous attribute into
intervals
 Some classification algorithms only accept categorical
attributes.
 Reduce data size by discretization

 Prepare for further analysis

August 5, 2019 Data Mining: Concepts and Techniques


102
Discretization and Concept hierachy
 Discretization
 reduce the number of values for a given continuous
attribute by dividing the range of the attribute into
intervals. Interval labels can then be used to replace
actual data values.
 Concept hierarchies
 reduce the data by collecting and replacing low level
concepts (such as numeric values for the attribute
age) by higher level concepts (such as young,
middle-aged, or senior).

August 5, 2019 Data Mining: Concepts and Techniques


103
Discretization and concept hierarchy
generation for numeric data

 Binning

 Histogram analysis

 Clustering analysis

 Segmentation by natural partitioning

August 5, 2019 Data Mining: Concepts and Techniques


104
Binning
 These methods are also forms of discretization.

 For example, attribute values can be discretized by


distributing the values into bins, and replacing each bin
value by the bin mean or median, as in smoothing by
bin means or smoothing by bin medians, respectively.

 These techniques can be applied recursively to the


resulting partitions in order to generate concept
hierarchies.

August 5, 2019 Data Mining: Concepts and Techniques


105
Histogram Analysis
 Histograms, as discussed can also be used for discretization.

 For instance, in an equi-width histogram, the values are


partitioned into equal sized partions or ranges (e.g., ($0-$100],
($100-$200], . . . , ($900-$1,000]).

 The histogram analysis algorithm can be applied recursively to


each partition in order to automatically generate a multilevel
concept hierarchy, with the procedure terminating once a pre-
specified number of concept levels has been reached.

 A minimum interval size can also be used per level to control the
recursive procedure.

August 5, 2019 Data Mining: Concepts and Techniques


106
Example
 A concept hierarchy for price

August 5, 2019 Data Mining: Concepts and Techniques


107
Cluster Analysis
 A clustering algorithm can be applied to partition data
into clusters or groups. Each cluster forms a node of a
concept hierarchy, where all nodes are at the same
conceptual level.

 Each cluster may be further decomposed into several


subclusters, forming a lower level of the hierarchy.

 Clusters may also be grouped together in order to form


a higher conceptual level of the hierarchy.

August 5, 2019 Data Mining: Concepts and Techniques


108
Segmentation by natural partitioning
3-4-5 rule can be used to segment numeric data into
relatively uniform, “natural” intervals.
* If an interval covers 3, 6, 7 or 9 distinct values at the most
significant digit, partition the range into 3 equi-width
intervals
* If it covers 2, 4, or 8 distinct values at the most significant
digit, partition the range into 4 intervals
* If it covers 1, 5, or 10 distinct values at the most
significant digit, partition the range into 5 intervals

August 5, 2019 Data Mining: Concepts and Techniques


109
Example

 The Interval (-$400,000- $0] is partitioned into


 4 interval (-4 - -3],(-3 - -2],(-2 - -1],(-1 – 0).
 The interval ($0 - $1,00,000] is partitioned into
 5 interval (0,2],(2,4],(4,6],(6,8],(8,10].
 The interval ($1,00,000 - $2,00,000] is partitioned into
 5 interval (1,1.20],……
 The interval ($2,00,000 - $5,00,000] is partitioned into
 3 interval (2,3],……

August 5, 2019 Data Mining: Concepts and Techniques


110
Concept hierarchy generation for
categorical data
 Specification of a partial ordering of attributes explicitly at the schema
level by users or experts

 street, city, province or state, and country.

 A hierarchy can be defined by specifying the total ordering among


these attributes at the schema level, such as street < city < province
or state < country.

 Specification of a portion of a hierarchy by explicit data grouping

 to add some intermediate levels manually, such as defining explicitly,

 {Alberta, Saskatchewan, Manitobag} Northern Canada,

 {British Columbia, prairies Canadag} Western Canada".

August 5, 2019 Data Mining: Concepts and Techniques


112
Concept hierarchy
 Specification of a set of attributes, but not of their partial ordering

 an attribute defining a high concept level will usually contain a smaller number
of distinct values than an attribute defining a lower concept level.

 Based on this observation, a concept hierarchy can be automatically generated


based on the number of distinct values per attribute in the given attribute set.

 The attribute with the most distinct values is placed at the lowest level of the
hierarchy.

 Specification of only a partial set of attributes

 If a user were to specify only the attribute city for a hierarchy defining
location, the system may automatically drag all of the above via semantically-
related attributes to form a hierarchy.

August 5, 2019 Data Mining: Concepts and Techniques


113
Specification of a set of attributes
Concept hierarchy can be automatically generated based on
the number of distinct values per attribute in the given
attribute set. The attribute with the most distinct values is
placed at the lowest level of the hierarchy.

country 15 distinct values

province_or_ state 65 distinct values

city 3567 distinct values

674,339 distinct values


streetData Mining: Concepts and Techniques
August 5, 2019
114
Chapter 3: Data Preprocessing

 Why preprocess the data?


 Data cleaning
 Data integration and transformation
 Data reduction
 Discretization and concept hierarchy generation
 Summary

August 5, 2019 Data Mining: Concepts and Techniques


115
Summary

 Data preparation is a big issue for both warehousing


and mining
 Data preparation includes
 Data cleaning and data integration
 Data reduction and feature selection
 Discretization
 A lot a methods have been developed but still an active
area of research
August 5, 2019 Data Mining: Concepts and Techniques
116
August 5, 2019
Thank you !!!
Data Mining: Concepts and Techniques
117
August 5, 2019 Data Mining: Concepts and Techniques
118

You might also like