CH 5
CH 5
CH 5
task is Classification −
› A bank loan officer wants to analyze the data in order to
know which customer (loan applicant) are risky or which
are safe.
› A marketing manager at a company needs to analyze a
customer with a given profile, who will buy a new
computer.
Following are the examples of cases where the data analysis
task is Prediction −
› Suppose the marketing manager needs to predict how
much a given customer will spend during a sale at his
company. In this example we are bothered to predict a
numeric value.
Classification
› predicts categorical class labels (discrete or
nominal)
› classifies data (constructs a model) based on
the training set and the values (class labels) in a
classifying attribute and uses it in classifying new
data
Prediction
› Prediction model predicts continuous-valued
functions,
Typical applications
› Medical diagnosis
› Fraud detection
› Loan approval
› Target marketing
Model construction: describing a set of predetermined
classes
› Each tuple/sample is assumed to belong to a predefined
class, as determined by the class label attribute
› The set of tuples used for model construction is training set
› The model is represented as classification rules, decision
trees, or mathematical formulae
Model usage: for classifying future or unknown objects
› Estimate accuracy of the model
The known label of test sample is compared with the
classified result from the model
Accuracy rate is the percentage of test set samples that
are correctly classified by the model
Test set is independent of training set, otherwise over-
fitting will occur
› If the accuracy is acceptable, use the model to classify
data tuples whose class labels are not known
Process (1): Model Construction
Classification
Algorithms
Training
Data
Classifier
Testing
Data Unseen Data
(Jeff, Professor, 4)
NAME RANK YEARS TENURED
T om A ssistant P rof 2 no Tenured?
M erlisa A ssociate P rof 7 no
G eorge P rofessor 5 yes
Joseph A ssistant P rof 7 yes
Machine learning is building machines
that can adapt and learn from
experience without being explicitly
programmed.
Supervised learning (classification)
› Supervision: The training data (observations,
measurements, etc.) are accompanied by
labels indicating the class of the observations
› New data is classified based on the training set
Unsupervised learning (clustering)
› The class labels of training data is unknown
› Given a set of measurements, observations, etc.
with the aim of establishing the existence of
classes or clusters in the data
Data cleaning
› Preprocess data in order to reduce noise
and handle missing values
Relevance analysis (feature selection)
› Remove the irrelevant or redundant
attributes
Data transformation
› Generalize and/or normalize data
Accuracy
› classifier accuracy: predicting class label
› predictor accuracy: guessing value of predicted
attributes
Speed
› time to construct the model (training time)
› time to use the model (classification/prediction time)
Robustness: handling noise and missing values
Scalability: Scalability refers to the ability to construct the classifier or
predictor efficiently; given large amount of data.
Interpretability
› understanding and insight provided by the model
No Yes No Yes
“How are decision trees used for classification?”
› Given a tuple, X, for which the associated class label is
unknown, the attribute values of the tuple are tested
against the decision tree.
› A path is traced from the root to a leaf node, which holds
the class prediction for that tuple. Decision trees can easily
be converted to classification rules.
The benefits of having a decision tree are as follows −
It does not require any domain knowledge.
It is easy to comprehend.
The learning and classification steps of a decision tree are
simple and fast.
Decision Tree Induction Algorithm-
ID3
C4.5
CART
Basic algorithm (a greedy algorithm)
› Tree is constructed in a top-down recursive divide-and-
conquer manner
› At start, all the training examples are at the root
› Attributes are categorical (if continuous-valued, they are
discretized in advance)
› Examples are partitioned recursively based on selected
attributes
› Test attributes are selected on the basis of a heuristic or
statistical measure (e.g., information gain)
Conditions for stopping partitioning
› All samples for a given node belong to the same class
› There are no remaining attributes for further partitioning –
majority voting is employed for classifying the leaf
› There are no samples left
An attribute selection measure is a heuristic for selecting the
splitting criterion that “best” separates a given data partition,
D, of class-labeled training tuples into individual classes.
Gain(income) 0.029
>40 low yes excellent no
31…40 low yes excellent yes
<=30 medium no fair no
<=30
>40
low
medium
yes fair
yes fair
yes
yes
Gain( student ) 0.151
<=30
31…40
medium
medium
yes excellent
no excellent
yes
yes Gain(credit _ rating ) 0.048
31…40 high yes fair yes
>40 medium no excellent no
Let attribute A be a continuous-valued attribute
Must determine the best split point for A
› Sort the value A in increasing order
› Typically, the midpoint between each pair of adjacent
values is considered as a possible split point
(ai+ai+1)/2 is the midpoint between the values of ai and ai+1
› The point with the minimum expected information
requirement for A is selected as the split-point for A
Split:
› D1 is the set of tuples in D satisfying A ≤ split-point, and
D2 is the set of tuples in D satisfying A > split-point
Information gain measure is biased towards
attributes with a large number of values
C4.5 (a successor of ID3) uses gain ratio to
overcome the problem (normalization to information
gain) v | Dj | | Dj |
SplitInfo A ( D) log 2 ( )
j 1 | D| | D|
› GainRatio(A) = Gain(A)/SplitInfo(A)
4 4 6 6 4 4
SplitInfo A ( D) log 2 ( ) log 2 ( ) log 2 ( ) 0.926
Ex. 14 14 14 14 14 14
› gain_ratio(income) = 0.029/0.926 = 0.031
The attribute with the maximum gain ratio is selected
as the splitting attribute
If a data set D contains examples from n classes, gini index,
gini(D) is defined as n
gini( D) 1 p 2j
j 1
where pj is the relative frequency of class j in D
If a data set D is split on A into two subsets D1 and D2, the gini
index gini(D) is defined as
|D1| |D |
gini A (D) gini(D1) 2 gini(D2)
|D| |D|
Reduction in Impurity:
gini( A) gini(D) giniA(D)
The attribute provides the smallest ginisplit(D) (or the largest
reduction in impurity) is chosen to split the node (need to
enumerate all the possible splitting points for each attribute)
Ex. D has 9 tuples in buys_computer = “yes” and 5 in “no”
2 2
9 5
gini ( D) 1 0.459
14 14
Suppose the attribute income partitions D into 10 in D1: {low,
medium} and 4 in D2 10 4
giniincome{low,medium} ( D) Gini( D1 ) Gini( D1 )
14 14
but gini{medium,high} is 0.30 and thus the best since it is the lowest
All attributes are assumed continuous-valued
May need other tools, e.g., clustering, to get the possible split
values
Can be modified for categorical attributes
The three measures, in general, return good results
but
› Information gain:
biased towards multivalued attributes
› Gain ratio:
tends to prefer unbalanced splits in which one
partition is much smaller than the others
› Gini index:
biased to multivalued attributes
has difficulty when # of classes is large
tends to favor tests that result in equal-sized
partitions and purity in both partitions
CHAID: a popular decision tree algorithm, measure based on χ2
test for independence
C-SEP: performs better than info. gain and gini index in certain
cases
G-statistics: has a close approximation to χ2 distribution
MDL (Minimal Description Length) principle (i.e., the simplest
solution is preferred):
› The best tree as the one that requires the fewest # of bits to
both (1) encode the tree, and (2) encode the exceptions to
the tree
Multivariate splits (partition based on multiple variable
combinations)
› CART: finds multivariate splits based on a linear comb. of attrs.
Which attribute selection measure is the best?
› Most give good results, none is significantly superior than others
Overfitting: An induced tree may overfit the training data
› Too many branches, some may reflect anomalies due to noise
or outliers
› Poor accuracy for unseen samples
Two approaches to avoid overfitting
› Prepruning: Halt tree construction early—do not split a node if
this would result in the goodness measure falling below a
threshold
Difficult to choose an appropriate threshold
› Postpruning: Remove branches from a “fully grown” tree—get
a sequence of progressively pruned trees
Use a set of data different from the training data to decide
which is the “best pruned tree”
Allow for continuous-valued attributes
› Dynamically define new discrete-valued attributes
that partition the continuous attribute value into a
discrete set of intervals
Handle missing attribute values
› Assign the most common value of the attribute
› Assign probability to each of the possible values
Attribute construction
› Create new attributes based on existing ones that
are sparsely represented
› This reduces fragmentation, repetition, and
replication
“What if D, the disk-resident training set of class-labeled
tuples, does not fit in memory? In other words, how
scalable is decision tree induction?”
In data mining applications, very large training sets of
millions of tuples are common.
Most often, the training data will not fit in memory!
Therefore, decision tree construction becomes
inefficient due to swapping of the training tuples in and
out of main and cache memories.
More scalable approaches, capable of handling
training data that are too large to fit in memory, are
required.
Scalable decision tree induction methods:
› RainForest
› BOAT
Separates the scalability aspects from the criteria that
determine the quality of the tree
Builds an AVC-list: AVC (Attribute, Value, Class_label)
AVC-set (of an attribute X )
› Projection of training dataset onto the attribute X and
class label where counts of individual class label are
aggregated
AVC-group (of a node n )
› Set of AVC-sets of all predictor attributes at the node n
Training Examples AVC-set on Age AVC-set on income
age income studentcredit_rating
buys_computerAge Buy_Computer income Buy_Computer
True positives (TP): These refer to the positive tuples that were
correctly labeled by the classifier. Let TP be the number of true
positives.
True negatives(TN): These are the negative tuples that were
correctly labeled by the classifier. Let TN be the number of true
negatives.
False positives (FP): These are the negative tuples that were
incorrectly labeled as positive (e.g., tuples of class buys computer
= no for which the classifier predicted buys computer=yes). Let FP
be the number of false positives.
False negatives (FN): These are the positive tuples that were
Accuracy of a classifier M, acc(M): percentage of test set tuples
that are correctly classified by the model M: TP+TN/(P+N)
Error rate (misclassification rate) of M = 1 – acc(M)
Alternative accuracy measures (e.g., for cancer diagnosis)
Sensitivity is also referred to as the true positive (recognition)
rate (i.e., the proportion of positive tuples that are correctly
identified), while specificity is the true negative rate (i.e., the
proportion of negative tuples that are correctly identified)
› sensitivity = TP/P /* true positive recognition rate */
› specificity = TN/N /* true negative recognition rate */
› precision = TP/(TP+FP)
› Precision can be thought of as a measure of exactness (i.e., what
percentage of tuples labeled as positive are actually such),
whereas recall is a measure of completeness (what percentage
of positive tuples are labeled as such)
› recall=TP/(TP+FN)
F1 Score
F1 Score is used to measure a test’s
accuracy. F1 Score is the Harmonic Mean
between precision and recall. The greater
the F1 Score, the better is the performance
of our model.
F =2×precision×recall/precision+recall
Fβ
=(1+β2)×precision×recall/β2×precision+reca
ll
Measure predictor accuracy: measure how far off the predicted value is
from the actual known value
Loss function: measures the error betw. yi and the predicted value yi’
› Absolute error: | yi – yi’|
› Squared error: (yi – yi’)2
Test error (generalization error): the average loss over the test set
› Mean absolute error: Mean squared error: d
( yi yi ' ) 2
d
| y y '|
i 1
i i
i 1
d
› Relative absolute error: d Relative squared error:
d
d
| yi yi ' | (y
i 1
i yi ' ) 2
i 1
d d
| y
i 1
i y|
(y
i 1
i y)2
Yi 0 1X i i
Dependent Independent
(Response) (Explanatory) Variable
Variable (e.g., Years s. serocon.)
(e.g., CD+ c.)
Yi 0 1X i i Observed
value
Y
i = Random error
E YX 0 1 X i
Observed value
Yi 0 1X i i
^i = Random
Y
error
Unsampled
observation
X
Yi 0 1X i
Observed value
1. Plot of All (Xi, Yi) Pairs
2. Suggests How Well Model Will Fit
Y
60
40
20
0 X
0 20 40 60
How would you draw a line through the
points? How do you determine which line
‘fits best’?
Y
60
40
20
0 X
0 20 40 60
How would you draw a line through the
points? How do you determine which line
‘fits best’?
Slope changed
Y
60
40
20
0 X
0 20 40 60
Intercept unchanged
How would you draw a line through the
points? How do you determine which line
‘fits best’?
Slope unchanged
Y
60
40
20
0 X
0 20 40 60
Intercept changed
How would you draw a line through the
points? How do you determine which line
‘fits best’?
Slope changed
Y
60
40
20
0 X
0 20 40 60
Intercept changed
1. ‘Best Fit’ Means Difference Between
Actual Y Values & Predicted Y Values
Are a Minimum. But Positive Differences
Off-Set Negative ones
1. ‘Best Fit’ Means Difference Between
Actual Y Values & Predicted Y Values is a
Minimum. But Positive Differences Off-Set
Negative ones. So square errors!
ˆ
n n
Yi Yˆi
2
2
i
i 1 i 1
1. ‘Best Fit’ Means Difference Between
Actual Y Values & Predicted Y Values
Are a Minimum. But Positive Differences
Off-Set Negative. So square errors!
ˆ
n n
ˆ 2
Y Yi i
2
i
i 1 i 1
2. LS Minimizes the Sum of the Squared
Differences (errors) (SSE)
n
LS minimizes i 1 2 3 4
2 2 2 2 2
i 1
Y Y2 0 1X 2 2
^4
^2
^1 ^3
Yi 0 1X i
X
Sum of squared differences = (2 - 1)2 (4
+ - 2)2 + (1.5 - 3)2 + (3.2 - 4)2 = 6.89
Sum of squared differences = (2 -2.5)2 + (4 - 2.5)2 + (1.5 - 2.5)2 +(3.2 - 2.5)2 =
Let us compare two lines
4 (2,4)
w The second line is horizon
3 w (4,3.2)
2.5
2
(1,2)w
w (3,1.5)
1
The smaller the sum of
squared differences
1 2 3 4 the better the fit of the
line to the data.
67
Prediction equation
yˆi ˆ0 ˆ1xi
Sample slope
SS xy xi x yi y
ˆ1
SS xx i x x 2
Sample Y - intercept
ˆ0 y ˆ1x
Problem Statement
Last year, five randomly selected students took a math
aptitude test before they began their statistics course. The
Statistics Department has three questions.
What linear regression equation best predicts statistics
performance, based on math aptitude scores?
If a student made an 80 on the aptitude test, what grade
would we expect her to make in statistics?
the xi column shows scores
on the aptitude test.
Similarly, the yi column shows
statistics grades.
The last two rows
show sums and mean
scores that we will use
to conduct the
regression analysis.
we need to
compute the
product of the
deviation scores.
we also need to
compute the
squares of the
deviation
scores (the last
two columns in
the table
below).
Once we know the value of the regression coefficient
(b1), we can solve for the regression slope (b0):
g ( x, , ) e 2 2
2
and P(xk|Ci) is
P ( X | C i ) g ( xk , C i , C i )
age income studentcredit_rating
buys_compu
<=30 high no fair no
<=30 high no excellent no
Class: 31…40 high no fair yes
C1:buys_computer = ‘yes’ >40 medium no fair yes
C2:buys_computer = ‘no’ >40 low yes fair yes
>40 low yes excellent no
Data sample
31…40 low yes excellent yes
X = (age <=30,
Income = medium, <=30 medium no fair no
Student = yes <=30 low yes fair yes
Credit_rating = Fair) >40 medium yes fair yes
<=30 medium yes excellent yes
31…40 medium no excellent yes
31…40 high yes fair yes
>40 medium no excellent no
P(Ci): P(buys_computer = “yes”) = 9/14 = 0.643
P(buys_computer = “no”) = 5/14= 0.357