Machine Learning With Python - Machine Learning Algorithms - Decision Tree
Machine Learning With Python - Machine Learning Algorithms - Decision Tree
In general, Decision tree analysis is a predictive modelling tool that can be applied
across many areas. Decision trees can be constructed by an algorithmic approach that
can split the dataset in different ways based on different conditions.
Decisions tress are the most powerful algorithms that falls under the category of
supervised algorithms.
The two main entities of a tree are decision nodes, where the data is split and leaves,
where we got outcome.
The example of a binary tree for predicting whether a person is fit or unfit providing
various information like age, eating habits and exercise habits, is given below:
No?
Yes?
No?
Yes?
No? Yes?
· Classification decision trees: In this kind of decision trees, the decision variable is
categorical. The above decision tree is an example of classification decision tree.
Regression decision trees: In this kind of decision trees, the decision variable is continuous.
Gini Index
It is the name of the cost function that is used to evaluate the
binary splits in the dataset and works with the categorial
target variable “Success” or “Failure”.
Higher the value of Gini index, higher the homogeneity. A
perfect Gini index value is 0 and worst is 0.5 (for 2 class
problem). Gini index for a split can be calculated with the help
of following steps:
· First, calculate Gini index for sub-nodes by using the formula p^2+q^2 , which is the sum
of the square of probability for success and failure.
· Part2: Splitting a dataset: It may be defined as separating a dataset into two lists of rows
having index of an attribute and a split value of that attribute. After getting the two groups - right
and left, from the dataset, we can calculate the value of split by using Gini score calculated in first
part. Split value will decide in which group the attribute will reside.
Part3: Evaluating all splits: Next part after finding Gini score and splitting dataset is the evaluation
of all splits. For this purpose, first, we must check every value associated with each attribute as a
candidate split. Then we need to find the best possible split by evaluating the cost of the split. The
best split will be used as a node in the decision tree.
As we know that a tree has root node and terminal nodes. After creating the root node, we can
build the tree by following two parts:
· Minimum Node Records: It may be defined as the minimum number of training patterns that a
given node is responsible for. We must stop adding terminal nodes once tree reached at these
minimum node records or below this minimum.
Terminal node is used to make a final prediction.
Recursive Splitting
As we understood about when to create terminal nodes, now we can start
building our tree. Recursive splitting is a method to build the tree. In this
method, once a node is created, we can create the child nodes (nodes added to
an existing node) recursively on each group of data, generated by splitting the
dataset, by calling the same function again and again.
Prediction
After building a decision tree, we need to make a prediction about it. Basically,
prediction involves navigating the decision tree with the specifically provided
row of data.
We can make a prediction with the help of recursive function, as did above. The
same prediction routine is called again with the left or the child right nodes.
· Decision tree classifier prefers the features values to be categorical. In case if you want to use
continuous values then they must be done discretized prior to model building.
Statistical approach will be used to place attributes at any node position i.e.as root node or
internal node.
1 1 85 66 29 0 26.6 0.351 31 0
3 1 89 66 23 94 28.1 0.167 21 0
Now, split the dataset into features and target variable as follows:
X = pima[feature_cols] # Features
Next, train the model with the help of DecisionTreeClassifier class of sklearn as follows:
clf = DecisionTreeClassifier()
clf = clf.fit(X_train,y_train)
At last we need to make prediction. It can be done with the help of following script:
y_pred = clf.predict(X_test)
print("Confusion Matrix:")
print(result)
print("Classification Report:",)
print (result1)
result2 = accuracy_score(y_test,y_pred)
print("Accuracy:",result2)
Confusion Matrix:
[[116 30]
[ 46 39]]
Classification
Report:
precision recall f1-score support
Accuracy:
0.670995670995671
The above decision tree can be visualized with the help of following code:
import pydotplus
dot_data = StringIO()
export_graphviz(clf, out_file=dot_data, filled=True, rounded=True,
special_characters=True,feature_names = feature_cols,class_names=['0','1'])
graph = pydotplus.graph_from_dot_data(dot_data.getvalue())
graph.write_png('Pima_diabetes_Tree.png')
Company Confidential: Data-Core Systems, Inc. | datacoresystems.com
or alternate code
from sklearn import tree
from matplotLib import pyplot as plt
fig=plt.figure(figsize=(25,20))
g=tree.plot_tree(clf)
Company Confidential: Data-Core Systems, Inc. | datacoresystems.com
Thank You