Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
27 views

Classification Algorithm

data mining notes outliers

Uploaded by

prathap badam
Copyright
© © All Rights Reserved
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
27 views

Classification Algorithm

data mining notes outliers

Uploaded by

prathap badam
Copyright
© © All Rights Reserved
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 78

Classification

UNIT-III
• There are two forms of data analysis that can be used for extracting
models describing important classes or to predict future data
trends. These two forms are as follows −
– Classification
– Prediction
• Classification models predict categorical class labels;
– For example, we can build a classification model to categorize bank loan
applications as either safe or risky.
• prediction models predict continuous valued functions.
– EX:a prediction model to predict the expenditures in dollars of potential
customers on computer equipment given their income and occupation.
What is classification?
Following are the examples of cases where the data analysis task is
Classification −
• A bank loan officer wants to analyze the data in order to know
which customer (loan applicant) are risky or which are safe.
• A marketing manager at a company needs to analyze a customer
with a given profile, who will buy a new computer.
• In both of the above examples,
a model or classifier is constructed to predict the categorical
labels. These labels are risky or safe for loan application data and
yes or no for marketing data.
What is prediction?
• Following are the examples of cases where the data analysis task
is Prediction −
• Suppose the marketing manager needs to predict how much a
given customer will spend during a sale at his company. In this
example we are bothered to predict a numeric value. Therefore
the data analysis task is an example of numeric prediction. In this
case, a model or a predictor will be constructed that predicts a
continuous-valued-function or ordered value.
• Note − Regression analysis is a statistical methodology that is
most often used for numeric prediction.
How Does Classification Works?
• With the help of the bank loan application that we have discussed
above, let us understand the working of classification. The Data
Classification process includes two steps −
– Building the Classifier or Model
– Using Classifier for Classification
Building the Classifier or Model
• This step is the learning step or the learning phase.
• In this step the classification algorithms build the classifier.
• The classifier is built from the training set made up of database
tuples and their associated class labels.
• Each tuple that constitutes the training set is referred to as a
category or class. These tuples can also be referred to as sample,
object or data points.
Using Classifier for Classification
• In this step, the classifier is used for classification. Here the test
data is used to estimate the accuracy of classification rules. The
classification rules can be applied to the new data tuples if the
accuracy is considered acceptable.
Classification and Prediction Issues
• The major issue is preparing the data for Classification and Prediction. Preparing
the data involves the following activities −
• Data Cleaning − Data cleaning involves removing the noise and treatment of
missing values. The noise is removed by applying smoothing techniques and the
problem of missing values is solved by replacing a missing value with most
commonly occurring value for that attribute.
• Relevance Analysis − Database may also have the irrelevant attributes.
Correlation analysis is used to know whether any two given attributes are
related.
• Data Transformation and reduction − The data can be transformed by any of the
following methods.
– Normalization − The data is transformed using normalization. Normalization involves
scaling all values for given attribute in order to make them fall within a small specified
range. Normalization is used when in the learning step, the neural networks or the
methods involving measurements are used.
– Generalization − The data can also be transformed by generalizing it to the higher
concept. For this purpose we can use the concept hierarchies.
Comparison of Classification and Prediction Methods
• Here is the criteria for comparing the methods of Classification and
Prediction −
• Accuracy − Accuracy of classifier refers to the ability of classifier. It
predict the class label correctly and the accuracy of the predictor
refers to how well a given predictor can guess the value of
predicted attribute for a new data.
• Speed − This refers to the computational cost in generating and
using the classifier or predictor.
• Robustness − It refers to the ability of classifier or predictor to
make correct predictions from given noisy data.
• Scalability − Scalability refers to the ability to construct the
classifier or predictor efficiently; given large amount of data.
• Interpretability − It refers to what extent the classifier or predictor
understands.
Classification Techniques
• Decision Tree based Methods
• Rule-based Methods
• Memory based reasoning
• Neural Networks
• Naïve Bayes and Bayesian Belief Networks
• Support Vector Machines
Decision Tree Induction
• Decision Tree is a supervised learning method used in data
mining for classification and regression methods.
• The decision tree creates classification or regression models
as a tree structure. It separates a data set into smaller
subsets, and at the same time, the decision tree is steadily
developed.
• The final tree is a tree with the decision nodes and leaf
nodes.
• A decision node has at least two branches. The leaf nodes
show a classification or decision.
• The uppermost decision node in a tree that relates to the
best predictor called the root node.
• Decision trees can deal with both categorical and numerical
data.
• During tree construction Attribute selection measures are
used to select that best partitions the tuples into distinct
classes.
Some Characteristics
Decision Tree and Classification Task
Definition of Decision Tree

Definition 9.1: Decision Tree


Building Decision Tree
Illustration : BuildDT Algorithm

Person Gender Height Class


1 F 1.6 S
2 M 2.0 M
3 F 1.9 M
4 F 1.88 M Attributes:
5 F 1.7 S Gender = {Male(M), Female (F)} // Binary attribute
6 M 1.85 M Height = {1.5, …, 2.5} // Continuous
7 F 1.6 S attribute
8 M 1.7 S
9 M 2.2 T Class = {Short (S), Medium (M), Tall (T)}
10 M 2.1 T
11 F 1.8 M
12 M 1.95 M
13 F 1.9 M Given a person, we are to test in which class s/he
14 F 1.8 M belongs
15 F 1.75 S
Illustration : BuildDT Algorithm
Illustration : BuildDT Algorithm
Illustration : BuildDT Algorithm
Benefits:
• It does not require any domain knowledge.
• The learning and classification steps of a decision
tree are simple and fast.
• Decision Tree is used to build classification and
regression models. It is used to create data models
that will predict class labels or values for the
decision-making process.
• This tree can easily converted to classification rule.
• we can visualize the decisions that make it easy to
understand and thus it is a popular data mining
technique.
Weakness
• Not suitable for prediction of continuous attribute.
• Perform poorly with many class and small data.
• Computationally expensive to train.
– At each node, each candidate splitting field must be sorted before its
best split can be found.
– In some algorithms, combinations of fields are used and a search must
be made for optimal combining weights.
– Pruning algorithms can also be expensive since many candidate sub-
trees must be formed and compared.
• Do not treat well non-rectangular regions.
concept buy_computer
Example of a Decision Tree
l l us
ir ca ir ca o
go go
ti nu ss
te te n l a
ca ca co c
Tid Refund Marital Taxable
Splitting Attributes
Status Income Cheat

1 Yes Single 125K No


2 No Married 100K No Refund
3 No Single 70K No
Yes No
4 Yes Married 120K No NO MarSt
5 No Divorced 95K Yes Married
Single, Divorced
6 No Married 60K No
7 Yes Divorced 220K No TaxInc NO
8 No Single 85K Yes < 80K > 80K
9 No Married 75K No
NO YES
10 No Single 90K Yes
10

Training Data Model: Decision Tree


Another Example of Decision Tree
l l us
ir ca ir ca o
go go ti nu ss
te te n l a Single,
c a c a co c MarSt
Married Divorced
Tid Refund Marital Taxable
Status Income Cheat
NO Refund
1 Yes Single 125K No
Yes No
2 No Married 100K No
3 No Single 70K No NO TaxInc
4 Yes Married 120K No < 80K > 80K
5 No Divorced 95K Yes
NO YES
6 No Married 60K No
7 Yes Divorced 220K No
8 No Single 85K Yes
9 No Married 75K No There could be more than one tree that fits
10 No Single 90K Yes the same data!
10
Decision Tree Classification Task
Tid Attrib1 Attrib2 Attrib3 Class
Tree
1 Yes Large 125K No Induction
2 No Medium 100K No algorithm
3 No Small 70K No
4 Yes Medium 120K No
Induction
5 No Large 95K Yes
6 No Medium 60K No
7 Yes Large 220K No Learn
8 No Small 85K Yes Model
9 No Medium 75K No
10 No Small 90K Yes
Model
10

Training Set
Apply Decision Tree
Tid Attrib1 Attrib2 Attrib3 Class
Model
11 No Small 55K ?

12 Yes Medium 80K ?

13 Yes Large 110K ?


Deduction
14 No Small 95K ?

15 No Large 67K ?
10

Test Set
Apply Model to Test Data
Test Data
Start from the root of tree. Refund Marital Taxable
Status Income Cheat

No Married 80K ?
Refund 10

Yes No

NO MarSt
Single, Divorced Married

TaxInc NO
< 80K > 80K

NO YES
Apply Model to Test Data
Test Data
Refund Marital Taxable
Status Income Cheat

No Married 80K ?
Refund 10

Yes No

NO MarSt
Single, Divorced Married

TaxInc NO
< 80K > 80K

NO YES
Apply Model to Test Data
Test Data
Refund Marital Taxable
Status Income Cheat

No Married 80K ?
Refund 10

Yes No

NO MarSt
Single, Divorced Married

TaxInc NO
< 80K > 80K

NO YES
Apply Model to Test Data
Test Data
Refund Marital Taxable
Status Income Cheat

No Married 80K ?
Refund 10

Yes No

NO MarSt
Single, Divorced Married

TaxInc NO
< 80K > 80K

NO YES
Apply Model to Test Data
Test Data
Refund Marital Taxable
Status Income Cheat

No Married 80K ?
Refund 10

Yes No

NO MarSt
Single, Divorced Married

TaxInc NO
< 80K > 80K

NO YES
Apply Model to Test Data
Test Data
Refund Marital Taxable
Status Income Cheat

No Married 80K ?
Refund 10

Yes No

NO MarSt
Single, Divorced Married Assign Cheat to “No”

TaxInc NO
< 80K > 80K

NO YES
Decision Tree Classification Task
Tid Attrib1 Attrib2 Attrib3 Class
Tree
1 Yes Large 125K No Induction
2 No Medium 100K No algorithm
3 No Small 70K No
4 Yes Medium 120K No
Induction
5 No Large 95K Yes
6 No Medium 60K No
7 Yes Large 220K No Learn
8 No Small 85K Yes Model
9 No Medium 75K No
10 No Small 90K Yes
Model
10

Training Set
Apply Decision Tree
Tid Attrib1 Attrib2 Attrib3 Class
Model
11 No Small 55K ?
12 Yes Medium 80K ?
13 Yes Large 110K ?
Deduction
14 No Small 95K ?
15 No Large 67K ?
10

Test Set
Example of a Decision Tree
l l us
ir ca ir ca o
go go
ti nu ss
te te n l a
ca ca co c
Tid Refund Marital Taxable
Splitting Attributes
Status Income Cheat

1 Yes Single 125K No


2 No Married 100K No Refund
3 No Single 70K No
Yes No
4 Yes Married 120K No NO MarSt
5 No Divorced 95K Yes Married
Single, Divorced
6 No Married 60K No
7 Yes Divorced 220K No TaxInc NO
8 No Single 85K Yes < 80K > 80K
9 No Married 75K No
NO YES
10 No Single 90K Yes
10

Training Data Model: Decision Tree


Another Example of Decision Tree
l l us
ir ca ir ca o
go go ti nu ss
te te n l a Single,
c a c a co c MarSt
Married Divorced
Tid Refund Marital Taxable
Status Income Cheat
NO Refund
1 Yes Single 125K No
Yes No
2 No Married 100K No
3 No Single 70K No NO TaxInc
4 Yes Married 120K No < 80K > 80K
5 No Divorced 95K Yes
NO YES
6 No Married 60K No
7 Yes Divorced 220K No
8 No Single 85K Yes
9 No Married 75K No There could be more than one tree that fits
10 No Single 90K Yes the same data!
10
Decision Tree Classification Task
Tid Attrib1 Attrib2 Attrib3 Class
Tree
1 Yes Large 125K No Induction
2 No Medium 100K No algorithm
3 No Small 70K No
4 Yes Medium 120K No
Induction
5 No Large 95K Yes
6 No Medium 60K No
7 Yes Large 220K No Learn
8 No Small 85K Yes Model
9 No Medium 75K No
10 No Small 90K Yes
Model
10

Training Set
Apply Decision Tree
Tid Attrib1 Attrib2 Attrib3 Class
Model
11 No Small 55K ?

12 Yes Medium 80K ?

13 Yes Large 110K ?


Deduction
14 No Small 95K ?

15 No Large 67K ?
10

Test Set
Decision Tree Induction Algorithm
• A machine researcher named J. Ross Quinlan in
1980 developed a decision tree algorithm known
as ID3 (Iterative Dichotomiser).
• Later, he presented C4.5, which was the successor
of ID3.
• ID3 and C4.5 adopt a greedy approach.
• In this algorithm, there is no backtracking;
• the trees are constructed in a top-down recursive
divide-and-conquer manner.
Splitting Based on Nominal
Attributes
• Multi-way split: Use as many partitions as distinct values.

CarType
Family Luxury
Sports

• Binary split: Divides values into two subsets.


Need to find optimal partitioning.

CarType OR CarType
{Sports, {Family,
Luxury} {Family} Luxury} {Sports}
Splitting Based on Ordinal
Attributes
• Multi-way split: Use as many partitions as distinct values.
Size
Small Large
Medium
• Binary split: Divides values into two subsets.
Need to find optimal partitioning.

Size Size
{Small,
{Large}
OR {Medium,
{Small}
Medium} Large}

• What about this split?


Size
{Small,
Large} {Medium}
Splitting Based on Continuous
Attributes
• Different ways of handling
– Discretization to form an ordinal categorical attribute
• Static – discretize once at the beginning
• Dynamic – ranges can be found by equal interval
bucketing, equal frequency bucketing
(percentiles), or clustering.

– Binary Decision: (A < v) or (A  v)


• consider all possible splits and finds the best cut
• can be more compute intensive
Splitting Based on Continuous
Attributes
Taxable Taxable
Income Income?
> 80K?
< 10K > 80K
Yes No

[10K,25K) [25K,50K) [50K,80K)

(i) Binary split (ii) Multi-way split


Tree Induction
• Greedy strategy.
– Split the records based on an attribute test that
optimizes certain criterion.

• Issues
– Determine how to split the records
• How to specify the attribute test condition?
• How to determine the best split?
– Determine when to stop splitting
concept buy_computer
• The tree starts as a single node, N, representing the training tuples
in D (step 1).
• If the tuples in D are all of the same class, then node N becomes a
leaf and is labeled with that class (steps 2 and 3).
• steps 4 and 5 are terminating conditions.
• the algorithm calls Attribute selection method to determine the
splitting criterion. The splitting criterion tells us which attribute to
test at node N by determining the ―best‖ way to separate or
partition the tuples in D into individual classes(step 6).
• The splitting criterion also tells us which branches to grow from
node N with respect to the outcomes of the chosen test.
• the splitting criterion indicates the splitting attribute and may also
indicate either a split-point or a splitting subset. The splitting
criterion is determined so that, ideally, the resulting partitions at
each branch are as ―pure‖ as possible.
• A partition is pure if all of the tuples in it belong to the same class.
• The node N is labeled with the splitting criterion,
which serves as a test at the node (step 7).
• A branch is grown from node N for each of the
outcomes of the splitting criterion. The tuples
in D are partitioned accordingly (steps 10 to 11).
• There are three possible scenarios Let A be the
splitting attribute
1.A is discrete-valued:
• In this case, the outcomes of the test at node N correspond directly to the known
in training set values of A
• A branch is created for each value aj of the attribute A
• The branch is labeled with that value aj.
• There are as many branches the number of values of A in the training data
2. A is continuous-valued
• In this case, the test at node N has two possible outcomes, corresponding to the
conditions
• A<= split_point and A> split_point
• The split_point is the split-point returned by Attribute_selection_method
• In practice, the split-point is often taken as the midpoint of two known adjacent
values of A
• Therefore the split-point may not actually be a preexisting value of A from the
training data.
• Two branches are grown from N and labeled A<= split_point and A> split_point
• The tuples (table at the node N) are partitioned sub-tables D1 and D2
• D1 holds the subset of class-labeled tuples in D for which A<= split_point
• D2 holds the rest
3. A is discrete-valued and a binary tree must be produced
• The test at node N is of the form “A?SA?”
• SA is the splitting subset for A
• SA is returned by attribute_selection_method as part of the splitting
criterion
• SA is a subset of the known values of A
• IF a given tuple has value aj of A and aj belongs to SA , THEN the test at node N is
satisfied.
• Two branches are grown from N .
• The left branch out of N is labeled yes so that D1 corresponds to the
subset of class-labeled tuples in D that satisfy the test.
• The right branch out of N is labeled no so that D2 corresponds to the
subset of class-labeled tuples from D that do not satisfy the test
• The algorithm uses the same process recursively to form a decision
tree for the tuples at each resulting partition, Dj of D (step 14).
TERMINATING CONDITIONS
• The recursive partitioning stops only when any one of the following
terminating conditions is true
1. All of the tuples in partition D (represented at node N)
belong to the same class (step 2 and 3), or
2. There are no remaining attributes on which the tuples may be
further partitioned (step 4).In this case, majority voting is
employed (step 5).This involves converting node N into a leaf and
labeling it with the most common class in D.
3.There are no tuples for a given branch ,i.e , a partition Dj is
empty(step12). In this case , a leaf is created with the majority class
in D (step 13).
How to determine the Best Split
Before Splitting: 10 records of class 0,
10 records of class 1

Own Car Student


Car? Type? ID?

Yes No Family Luxury c1 c20


c10 c11
Sports
C0: 6 C0: 4 C0: 1 C0: 8 C0: 1 C0: 1 ... C0: 1 C0: 0 ... C0: 0
C1: 4 C1: 6 C1: 3 C1: 0 C1: 7 C1: 0 C1: 0 C1: 1 C1: 1

Which test condition is the best?


How to determine the Best Split
• Greedy approach:
– Nodes with homogeneous class distribution are
preferred
• Need a measure of node impurity:
C0: 5 C0: 9
C1: 5 C1: 1

Non-homogeneous, Homogeneous,
High degree of impurity Low degree of impurity
Measures of Node Impurity
• Gini Index

• Entropy

• Misclassification error
How to Find the Best Split
Before Splitting: C0 N00 M0
C1 N01

A? B?
Yes No Yes No

Node N1 Node N2 Node N3 Node N4

C0 N10 C0 N20 C0 N30 C0 N40


C1 N11 C1 N21 C1 N31 C1 N41

M1 M2 M3 M4

M12 M34
Gain = M0 – M12 vs M0 – M34
Measure of Impurity: GINI
• Gini Index for a given node t :
GINI (t ) 1  j
[ p ( j | t )] 2

(NOTE: p( j | t) is the relative frequency of class j at node t).


– Maximum (1 - 1/nc) when records are equally distributed among all
classes, implying least interesting information
– Minimum (0.0) when all records belong to one class, implying most
interesting information

C1 0 C1 1 C1 2 C1 3
C2 6 C2 5 C2 4 C2 3
Gini=0.000 Gini=0.278 Gini=0.444 Gini=0.500
Examples for computing GINI
GINI (t ) 1  j
[ p ( j | t )] 2

C1 0 P(C1) = 0/6 = 0 P(C2) = 6/6 = 1


C2 6 Gini = 1 – P(C1)2 – P(C2)2 = 1 – 0 – 1 = 0

C1 1 P(C1) = 1/6 P(C2) = 5/6


C2 5 Gini = 1 – (1/6)2 – (5/6)2 = 0.278

C1 2 P(C1) = 2/6 P(C2) = 4/6


C2 4 Gini = 1 – (2/6)2 – (4/6)2 = 0.444
Splitting Based on GINI
• Used in CART, SLIQ, SPRINT.
• When a node p is split into k partitions (children), the quality of
split is computed as,
k
ni
GINI split  GINI (i )
i 1 n

where, ni = number of records at child i,


n = number of records at node p.
Binary Attributes: Computing GINI
Index
 Splits into two partitions
 Effect of Weighing partitions:
 Larger and Purer Partitions are sought for.

Parent
B? C1 6
Yes No C2 6
Gini = 0.500
Node N1 Node N2
Gini(N1)
= 1 – (5/6)2 – (2/6)2 N1 N2 Gini(Children)
= 0.194 C1 5 1 = 7/12 * 0.194 +
Gini(N2) C2 2 4 5/12 * 0.528
= 1 – (1/6)2 – (4/6)2 Gini=0.333 = 0.333
= 0.528
Categorical Attributes: Computing Gini
Index
• For each distinct value, gather counts for each class in the
dataset
• Use the count matrix to make decisions
Multi-way split Two-way split
(find best partition of values)

CarType CarType CarType


Family Sports Luxury {Sports, {Family,
{Family} {Sports}
Luxury} Luxury}
C1 1 2 1 3 1 2 2
C1 C1
C2 4 1 1 C2 2 4 C2 1 5
Gini 0.393 Gini 0.400 Gini 0.419
Continuous Attributes: Computing Gini
Index
• Use Binary Decisions based on one value
• Several Choices for the splitting value
– Number of possible splitting values
= Number of distinct values
• Each splitting value has a count matrix
associated with it
– Class counts in each of the partitions, A < v
and A  v
• Simple method to choose best v
– For each v, scan the database to gather
count matrix and compute its Gini index
– Computationally Inefficient! Repetition of
work. Taxable
Income
> 80K?

Yes No
Continuous Attributes: Computing Gini
Index...
• For efficient computation: for each attribute,
– Sort the attribute on values
– Linearly scan these values, each time updating the count matrix and
computing gini index
– Choose the split position that has the least gini index

Cheat No No No Yes Yes Yes No No No No


Taxable Income
60 70 75 85 90 95 100 120 125 220
Sorted Values
55 65 72 80 87 92 97 110 122 172 230
Split Positions
<= > <= > <= > <= > <= > <= > <= > <= > <= > <= > <= >
Yes 0 3 0 3 0 3 0 3 1 2 2 1 3 0 3 0 3 0 3 0 3 0

No 0 7 1 6 2 5 3 4 3 4 3 4 3 4 4 3 5 2 6 1 7 0

Gini 0.420 0.400 0.375 0.343 0.417 0.400 0.300 0.343 0.375 0.400 0.420
Alternative Splitting Criteria based on
INFO
• Entropy at a given node t:
Entropy (t )   p ( j | t ) log p ( j | t )
j

(NOTE: p( j | t) is the relative frequency of class j at node t).

– Measures homogeneity of a node.


• Maximum (log nc) when records are equally distributed
among all classes implying least information
• Minimum (0.0) when all records belong to one class,
implying most information
– Entropy based computations are similar to the GINI
index computations
Examples for computing Entropy
Entropy (t )   p ( j | t ) log p ( j | t )
j 2

C1 0 P(C1) = 0/6 = 0 P(C2) = 6/6 = 1


C2 6 Entropy = – 0 log 0 – 1 log 1 = – 0 – 0 = 0

C1 1 P(C1) = 1/6 P(C2) = 5/6


C2 5 Entropy = – (1/6) log2 (1/6) – (5/6) log2 (1/6) = 0.65

C1 2 P(C1) = 2/6 P(C2) = 4/6


C2 4 Entropy = – (2/6) log2 (2/6) – (4/6) log2 (4/6) = 0.92
Splitting Based on INFO...
• Information Gain:
n
GAIN Entropy ( p )    Entropy (i ) 
 k
i

 n 
split i 1

Parent Node, p is split into k partitions;


ni is number of records in partition i
– Measures Reduction in Entropy achieved because of the split. Choose
the split that achieves most reduction (maximizes GAIN)
– Used in ID3 and C4.5
– Disadvantage: Tends to prefer splits that result in large number of
partitions, each being small but pure.
Splitting Based on INFO...
• Gain Ratio:

GAIN n n

k
GainRATIO SplitINFO   log
Split i i

SplitINFO
split

n n i 1

Parent Node, p is split into k partitions


ni is the number of records in partition i
– Adjusts Information Gain by the entropy of the partitioning (SplitINFO).
Higher entropy partitioning (large number of small partitions) is
penalized!
– Used in C4.5
– Designed to overcome the disadvantage of Information Gain
Splitting Criteria based on Classification
Error
• Classification error at a node t :
Error (t ) 1  max P (i | t ) i

• Measures misclassification error made by a node.


• Maximum (1 - 1/nc) when records are equally distributed among all
classes, implying least interesting information
• Minimum (0.0) when all records belong to one class, implying most
interesting information
Examples for Computing Error
Error (t ) 1  max P (i | t )
i

C1 0 P(C1) = 0/6 = 0 P(C2) = 6/6 = 1


C2 6 Error = 1 – max (0, 1) = 1 – 1 = 0

C1 1 P(C1) = 1/6 P(C2) = 5/6


C2 5 Error = 1 – max (1/6, 5/6) = 1 – 5/6 = 1/6

C1 2 P(C1) = 2/6 P(C2) = 4/6


C2 4 Error = 1 – max (2/6, 4/6) = 1 – 4/6 = 1/3
Comparison among Splitting
Criteria
For a 2-class problem:
Misclassification Error vs Gini
A? Parent
C1 7
Yes No
C2 3
Node N1 Node N2 Gini = 0.42

Gini(N1) N1 N2
= 1 – (3/3)2 – (0/3)2 Gini(Children)
C1 3 4 = 3/10 * 0
=0
C2 0 3 + 7/10 * 0.489
Gini(N2) = 0.342
= 1 – (4/7)2 – (3/7)2
= 0.489
Tree Induction
• Greedy strategy.
– Split the records based on an attribute test that
optimizes certain criterion.

• Issues
– Determine how to split the records
• How to specify the attribute test condition?
• How to determine the best split?
– Determine when to stop splitting
Stopping Criteria for Tree Induction
• Stop expanding a node when all the records
belong to the same class

• Stop expanding a node when all the records


have similar attribute values

• Early termination (to be discussed later)


Decision Tree Based Classification
• Advantages:
– Inexpensive to construct
– Extremely fast at classifying unknown records
– Easy to interpret for small-sized trees
– Accuracy is comparable to other classification
techniques for many simple data sets

You might also like