Classification - Issues Regarding Classification and Prediction

CSI3010 Data Warehousing and
Data Mining
Module-4:Classification
Dr Sunil Kumar P V
SCOPE
Classification: Overview
Introduction
• Bank loan
• Buys computer
• Best treatment
• The data analysis task is classification
• A model or classifier is constructed to predict
categorical labels, such as
• “safe” or “risky” for the loan application data
• “yes” or “no” for the marketing data
• “treatment A,” “treatment B,” or “treatment
C” for the medical data
2/39
Classification Vs Prediction
• Predict how much a given customer will spend

during a sale
• The model constructed predicts a
continuous-valued function, or ordered value, as
opposed to a categorical label
• Process is known as prediction/regression
• The model is known as predictor
3/39
Classification Step1
4/39
Classification Step2
5/39
Classification: Step1-Learning
• Learning step using the training set

• Training set: Database tuples and their
associated class labels
• A tuple X = (x1 , x2 , . . . , xn ) is an attribute
vector belongs to a predefined class label
(categorical, discrete and unordered)
• Learning y = f (X ), where y is the class label for
tuple X
• Because the class label of each training tuple is
provided, this step is also known as supervised
6/39
learning
Classification: Step2- Prediction
• A test set is used, made up of test tuples and

their associated class labels which are not used to
construct the classifier
• Accuracy of a classifier is the percentage of test
set tuples that are correctly classified by the
classifier
7/39
Points to note
• Supervised vs Unsupervised learning

• In prediction, no class label attribute, just the
predicted attribute
• Can be viewed as a mapping or function,
y = f (X ), where X is the input (e.g., a tuple
describing a loan applicant), and the output y is
a continuous or ordered value
• Eg :- Predicted loan amount
8/39
Issues Regarding Classification
and Prediction
Data Preparation
• Prepare data for classification and prediction to

improve the accuracy, efficiency, and scalability:
• Data cleaning to avoid noise and missing
values
• Relevance analysis to identify whether any
two given attributes are statistically related
(use correlation analysis)
• Attribute subset selection to remove
irrelevant/redundant attributes
• Data Transformation. Eg :- Normalization
9/39
Comparing Classification and Prediction Methods
• Accuracy
• Speed
• Robustness (performance given noisy data or
data with missing values)
• Scalability
• Iterpretability (level of understanding and insight
that is being provided by the classifier/predictor)
10/39
Decision Tree Induction
Decision Tree Algorithm Basics
• Three algorithms
• All were proposed in 1980s
• ID3 (Iterative Dichotomiser)
• C4.5 (a successor of ID3)
• Classification and Regression Trees (CART)
• Has three parameters: D, attribute_list, and
Attribute_selection_method
• Refer to Figure 6.3, Page 293, Han and Kamber
2nd Ed Textbook
11/39
Decision Tree Algorithms
• D is a data partition, Initiallly the entire training

data set
• attribute_list is a list of attributes describing the
tuples
• Attribute_selection_method specifies a
heuristic procedure for selecting the attribute
that “best” discriminates the given tuples
according to class
• information gain or the gini index (binary only)
• We use information gain as our trees many not
12/39
always be binary
Entropy(H)
• H(V ) = P(vk ) log2 P(v1 k ) =

P
k
P
− k P(vk ) log2 P(vk )
• Let X be a collection of training examples
• p+ is the collection of positive examples
• p− is the collection of negative examples
• H(X ) = −p+ log2 p+ − p− log2 p−
13/39
Entropy Examples
• H(X ) = −p+ log2 p+ − p− log2 p−

• Examples:
• H(14+, 0−) =
−14/14 log2 (14/14) − 0/14 log2 (0/14) = 0
• H(9+, 5−) =
−9/14 log2 (9/14) − 5/14 log2 (5/14) = 0.94
• H(7+, 7−) =
−7/14 log2 (7/14) − 7/14 log2 (7/14) = 1
14/39
Information Gain
• Estimates the expected reduction in the entropy

in the given data, caused by partitioning the
example based on an attribute
• The information gain Gain(X , A) of an attribute
A, relative to a collection of examples X is
defined as
X |X (v )|
Gain(X , A) = H(X ) − H(X (v ))
|X |
v ∈Values(A)
15/39
Decision Tree Demo
16/39
Entropy of Outlook
• Entropy of the dataset

• H(X ) = H(9+, 5−) =
−9/14 log2 (9/14) − 5/14 log2 (5/14) = 0.94
• To determine the entropy of Outlook
• Values of Outlook −→
Sunny , Overcast, Rainy
17/39
Entropy of Outlook
• H(Sunny ) = H(2+, 3−) =

−2/5 log2 (2/5) − 3/5 log2 (2/5) = 0.971
• H(Overcast) = H(4+, 0−) =
−4/4 log2 (4/4) − 0/4 log2 (0/4) = 0
• H(Rainy ) = H(3+, 2−) =
−3/5 log2 (3/5) − 2/5 log2 (2/5) = 0.971
18/39
Information Gain of Outlook
• Gain(X , Outlook) = H(X ) −

P |X (v )|
v ∈{Sunny ,Overcast,Rainy } |X | Entropy (X (v ))
• H(X ) − (5/14)H(Sunny ) −
(4/14)H(Overcast) − (5/14)H(Rainy )
• = 0.94 − (5/14) × 0.971 − (4/14) × 0 −
(5/14) × 0.971 = 0.2464
19/39
Information Gain of Temperature

• H(X ) = H(9+, 5−) =
−9/14 log2 (9/14) − 5/14 log2 (5/14) = 0.94
• Entropy of Temperature
• Values of Temperature −→ Hot, Mild, Cool
• H(Hot) = H(2+, 2−) =
−2/4 log2 (2/4) − 2/4 log2 (2/4) = 1
• H(Mild) = H(4+, 2−) =
−4/6 log2 (4/6) − 2/6 log2 (2/6) = 0.9183
• H(Cool) = H(3+, 1−) =
−3/4 log2 (3/4) − 1/4 log2 (1/4) = 0.8113 20/39
Information Gain of Temperature
• Gain(X , Temperature) =
P |X (v )|
H(X ) − v ∈{Hot,Mild,Cool} |X | Entropy (X (v ))
• H(X ) − 4/14H(Hot) − 6/14H(Mild) −
4/14H(Cool)
• = 0.94−(4/14) × 1 − (6/14) × 0.9183 −
(4/14) × 0.8113 = 0.0289
21/39
Entropy of Humidity

• H(X ) = H(9+, 5−) =
−9/14 log2 (9/14) − 5/14 log2 (5/14) = 0.94
• Entropy of Humidity
• Values of Humidity −→ High, Normal
• H(High) = H(3+, 4−) =
−3/7 log2 (3/7) − 4/7 log2 (4/7) = 0.9852
• H(Normal) = H(6+, 1−) =
−6/7 log2 (6/7) − 1/7 log2 (1/7) = 0.5916
22/39
Information Gain of Humidity
• Gain(X , Humidity ) =
P |X (v )|
H(X ) − v ∈{High,Normal} |X | Entropy (X (v ))
• H(X ) − 7/14H(High) − 7/14H(Normal)
• = 0.94−(7/14) × 0.9852 − (7/14) × 0.5916 =
0.1516
23/39
Entropy of Wind

• H(X ) = H(9+, 5−) =
−9/14 log2 (9/14) − 5/14 log2 (5/14) = 0.94
• Entropy of Wind
• Values of Wind −→ Strong , Weak
• H(Strong ) = H(3+, 3−) =
−3/6 log2 (3/6) − 3/6 log2 (3/6) = 1
• H(Weak) = H(6+, 2−) =
−6/8 log2 (6/8) − 2/8 log2 (2/8) = 0.8113
24/39
Information Gain of Wind
• Gain(X , Wind) =
P |X (v )|
H(X ) − v ∈{Strong ,Weak} |X | Entropy (X (v ))
• H(X ) − 6/14H(Strong ) − 8/14H(Weak)
• = 0.94−(6/14) × 1 − (8/14) × 0.8113 = 0.0478
25/39
Attribute with the Maximum Information Gain
• Gain(X , Outlook) = 0.2464

• Gain(X , Temperature) = 0.0289
• Gain(X , Humidity ) = 0.1516
• Gain(X , Wind) = 0.0478
• Attribute to be chosen is Outlook
26/39
The tree so far
27/39
Dataset corresponding to Outlook = Sunny
28/39
X (Outlook = Sunny , Temperature)
• Entropy of the dataset (X (Sunny ))

• H(Sunny ) = H(3+, 2−) =
−3/5 log2 (3/5) − 2/5 log2 (2/5) = 0.97
• H(Hot) = H(0+, 2−) = 0
• H(Mild) = H(1+, 1−) = 1
• H(Cool) = H(1+, 0−) = 0
• Gain(X (Sunny ), Temperature) = H(Sunny ) −
P |X (v )|
v ∈{Hot,Mild,Cool} |X | Entropy (X (v ))
• 0.97 − 2/5 × 0 − 2/5 × 1 − 1/5 × 0 = 0.570
29/39
X (Outlook = Sunny , Humidity )

• H(Sunny ) = H(3+, 2−) =
−3/5 log2 (3/5) − 2/5 log2 (2/5) = 0.97
• H(High) = H(0+, 3−) = 0
• H(Normal) = H(2+, 0−) = 0
• Gain(X (Sunny ), Humidity ) =
P |X (v )|
H(Sunny )− v ∈{High,Normal} |X | Entropy (X (v ))
• 0.97 − 3/5 × 0 − 2/5 × 0 = 0.97
30/39
X (Outlook = Sunny , Wind)

• H(Sunny ) = H(3+, 2−) =
−3/5 log2 (3/5) − 2/5 log2 (2/5) = 0.97
• H(Strong ) = H(1+, 1−) = 1
• H(Weak) = H(1+, 2−) = 0.9183
• Gain(X (Sunny ), Wind) = H(Sunny ) −
P |X (v )|
v ∈{Strong ,Weak} |X | Entropy (X (v ))
• 0.97 − 2/5 × 1 − 3/5 × 0.918 = 0.0192
31/39
• Gain(X (Sunny ), Temperature) = 0.570

• Gain(X (Sunny ), Humidity ) = 0.97
• Gain(X (Sunny ), Wind) = 0.0192
• Attribute to be chosen is Humidity
32/39
The tree so far
33/39
Dataset corresponding to Outlook = Rain
34/39
X (Outlook = Rain, Temperature)
• Entropy of the dataset (X (Rain))

• H(Rain) = H(3+, 2−) =
−3/5 log2 (3/5) − 2/5 log2 (2/5) = 0.97
• H(Hot) = H(0+, 0−) = 0
• H(Mild) = H(2+, 1−) = 0.9183
• H(Cool) = H(1+, 1−) = 1
• Gain(X (Sunny ), Temperature) = H(Sunny ) −
P |X (v )|
v ∈{Hot,Mild,Cool} |X | Entropy (X (v ))
• 0.97 − 0/5 × 0 − 3/5 × 0.9183 − 2/5 × 1 = 0.0192
35/39
X (Outlook = Rain, Humidity )

• H(Rain) = H(3+, 2−) =
−3/5 log2 (3/5) − 2/5 log2 (2/5) = 0.97
• H(High) = H(1+, 1−) = 1
• H(Normal) = H(2+, 1−) = 0.9183
• Gain(X (Rain), Humidity ) =
P |X (v )|
H(Rain) − v ∈{High,Normal} |X | Entropy (X (v ))
• 0.97 − 2/5 × 1 − 3/5 × 0.9183 = 0.0192
36/39
X (Outlook = Rain, Wind)

• H(Rain) = H(3+, 2−) =
−3/5 log2 (3/5) − 2/5 log2 (2/5) = 0.97
• H(Strong ) = H(0+, 2−) = 0
• H(Weak) = H(3+, 0−) = 0
• Gain(X (Rain), Wind) =
P |X (v )|
H(Rain) − v ∈{Strong ,Weak} |X | Entropy (X (v ))
• 0.97 − 2/5 × 0 − 3/5 × 0 = 0.97
37/39
• Gain(X (Rain), Temperature) = 0.0192

• Gain(X (Rain), Humidity ) = 0.0192
• Gain(X (Rain), Wind) = 0.97
• Attribute to be chosen is Wind
38/39
The Final Decision Tree
39/39

Classification - Issues Regarding Classification and Prediction

Uploaded by

Document Informationclick to expand document information

Copyright:

Available Formats

Classification - Issues Regarding Classification and Prediction

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Classification - Issues Regarding Classification and Prediction

Uploaded by

Copyright:

Available Formats

CSI3010 Data Warehousing and

• Predict how much a given customer will spend

• Learning step using the training set

• A test set is used, made up of test tuples and

• Supervised vs Unsupervised learning

• Prepare data for classification and prediction to

• D is a data partition, Initiallly the entire training

• H(V ) = P(vk ) log2 P(v1 k ) =

• H(X ) = −p+ log2 p+ − p− log2 p−

• Estimates the expected reduction in the entropy

• Entropy of the dataset

• H(Sunny ) = H(2+, 3−) =

• Gain(X , Outlook) = H(X ) −

• Entropy of the dataset

• Entropy of the dataset

• Entropy of the dataset

• Gain(X , Outlook) = 0.2464

• Entropy of the dataset (X (Sunny ))

• Entropy of the dataset (X (Sunny ))

• Entropy of the dataset (X (Sunny ))

• Gain(X (Sunny ), Temperature) = 0.570

• Entropy of the dataset (X (Rain))

• Entropy of the dataset (X (Rain))

• Entropy of the dataset (X (Rain))

• Gain(X (Rain), Temperature) = 0.0192

You might also like