Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

Classification and Regression Trees (CART - III) : DR A. Ramesh

Download as pdf or txt
Download as pdf or txt
You are on page 1of 42

Classification and Regression Trees (CART – III)

Dr A. RAMESH
DEPARTMENT OF MANAGEMENT STUDIES

1
Agenda

Python demo for CART model -


• Visualizing Decision Tree
• Interpretation of CART model

2
Example

Problem Description-

Han, J., Pei, J. and Kamber, M., 2011. Data mining: concepts and
techniques. Elsevier.

3
Import Relevant Libraries and Loading Data File

4
Methods used in Data Encoding

• LabelEncoder (): This method is used to normalize labels. It can also be


used to transform non-numerical labels to numerical labels.

• Fit_transform (): This method is used for Fitting label encoder and return
encoded labels.

5
Data Encoding Procedure

6
Data Encoding

7
Structuring Dataframe

drop(): This is used to Remove rows or columns by specifying label names


and corresponding axis or by specifying directly index or column names.

8
Independent and Dependent Variables Selection

9
Build the Decision Tree Model without Splitting

10
Visualizing Decision Tree

11
Decision Tree Visualization

12
Interpretation of the CART Output

13
Calculation of Gini(D)

• We first use the following Equation for Gini index to compute the impurity
of D:

14
Income Attribute

• Low, Medium, High


• Option 1: {Low, Medium}, {High}
• Option 2 : {High, Medium}, {low}
• Option 3 : {High, Low}, {Medium}

15
Tuples in partition D1

• Low + Medium:
Low + Class: buys computer
Medium
Yes 3+4 =7
No 1+ 2 = 3

16
Tuples in partition D2

• High :
High Class: buys computer
Yes 2
No 2

17
Gini index for income attribute

• The Gini index value computed based on this partitioning is

= (10/14) (1- (7/10)2 – (3/10)2) +


(4/14) (1- (2/4)2 – (2/4)2)
= 0.443 = Gini income ∈{high}

18
Gini index for income attribute

• The Gini index value computed based on this partitioning is


Gini income ∈{high, medium}

= (10/14) (1- (6/10)2 – (4/10)2) +


(4/14) (1- (3/4)2 – (1/4)2)
=0.45 = Gini income ∈{low}

19
Gini index for income attribute

• The Gini index value computed based on this partitioning is


Gini income ∈{high, low}
= (8/14) (1- (5/8)2 – (3/8)2) +
(6/14) (1- (2/6)2 – (4/6)2)
=0.458 =Gini income ∈{medium}

20
Gini index for income attribute
• Gini income ∈{low, medium}
= 0.443 = Gini income ∈{high}
• Gini income ∈{high, medium}
= 0.45 = Gini income ∈{low}
• Gini income ∈{high, low}
= 0.458 = Gini income ∈{medium}

21
Gini index for Age attribute

• The Gini index value computed based on this partitioning is


Gini Age ∈{Youth, middle_aged}
= 0.457 = Gini Age ∈{senior}
Gini Age ∈{Youth, Senior}
= 0.357 = Gini Age∈{middle_aged}
Gini Age ∈{senior, middle_aged}
= 0.393 = Gini Age ∈{Youth}

22
Gini index for student attribute

• The Gini index value computed based on this partitioning is


Gini student ∈{Yes, No}
= 7/14 (1- (6/7)2 – (1/7)2 ) +
7/14 (1- (3/7)2 – (4/7)2 )
= 0. 367

23
Gini index for credit_rating attribute

• The Gini index value computed based on this partitioning is


Gini credit rating ∈{fair, Excellent}
= 8/14 (1- (6/8)2 – (2/8)2 ) +
6/14 (1- (3/6)2 – (3/6)2 )
= 0. 428

24
Choosing the root node
The attribute with minimum Gini score will be taken, i.e. Age (Gini Age ∈{Youth, Senior} =
0.357 = Gini Age∈{middle_aged} )

Age Attribute Gini score


Youth, senior
Age 0.357
Income 0.443
Middle age ???
Student 0.367
Credit_rating 0.428

25
Gini index for different attributes for sample of 10
• After separating 4 samples belonging middle age, total 10 are remaining:

26
Gini index for different attributes for sample of 10

• Gini (D) = (1- (5/10)2 – (5/10)2) ) = 0.5


• GiniAge = 0.48
• GiniCredit Rating= 0.41
• Gini Student = 0.32
• Gini income = 0.375
• Take student as node as it have mini. Gini Score

27
Drawing cart

Age
Youth, senior

Middle age Student

yes
No
??? ???

28
For branch Student = No
• Omit the marked rows
(Data entry), either
belonging Age =
middle_aged or student =
Yes
• Total 5 rows are remaining

29
Gini index for different attributes For branch Student = No

• Gini (D) = (1- (4/5)2 – (1/5)2) ) = 0.32


• GiniAge = 0.2
• GiniCredit Rating= 0.267
• Gini Student = 0.32
• Gini income = 0.267
• Take age as node as it have mini. Gini Score

30
Drawing cart

Age
Youth, senior

Middle age Student

yes
No
??? Age

??? ???
31
For branch Student = Yes
• Omit the marked rows
(Data entry), either
belonging Age =
middle_aged or student =
No
• Total 5 rows are remaining

32
Gini index for different attributes For branch Student = No

• Gini (D) = (1- (4/5)2 – (1/5)2) ) = 0.32


• GiniAge = 0.267
• GiniCredit Rating= 0.2
• Gini Student = 0.32
• Gini income = 0.267
• Take credit rating as node as it have mini. Gini Score

33
Drawing cart

Age
Youth, senior

Middle age Student

yes No
Credit_rating Age

??? ??? ??? ???

34
Coding scheme
Age Code Student Code
Youth 2 Yes 1
Middle Age 0 No 0
senior 1 Income Code
High 0
Credit rating Code
Low 1
Fair 1
Medium 2
Excellent 0
Buys computer Class
Yes 1
No 0
35
Values for the dependent
Decision tree variable
Youth, Senior
Middle_age
Decision classifier
• Repeat the
splitting
process until No Yes
we obtain all Number of yes and Sample
the leaf nodes, No in independent size
the final out - variable Excellent Fair
Senior Youth
put:

Excellent Fair High, Low Medium

36
Splitting Dataset

• Train_test_split(): This method is used for splitting dataset into training


and testing data subsets.

37
Build the Decision Tree Model

38
Evaluating the Model

39
Visualizing Decision Tree

40
Decision Tree Visualization

41
Thank You

42

You might also like