0% found this document useful (0 votes)

15 views

Random Forest Algorithm

machine learning algorithm based on Random Forest Algorithm

Uploaded by

voicemint311

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

15 views

Random Forest Algorithm

machine learning algorithm based on Random Forest Algorithm

Uploaded by

voicemint311

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 39

Random Forest

Algorithm

HOW IT WORKS & WHY IT’S SO

EFFECTIVE
Overview

 Random forests are a supervised Machine learning

algorithm that is widely used in regression and
classification problems and produces, even without
hyperparameter tuning a great result most of the
time. It is perhaps the most used algorithm
because of its simplicity. It builds a number of
decision trees on different samples and then takes
the majority vote if it’s a classification problem.
Ensemble Techniques

 Suppose you want to purchase a house, will you just walk into
society and purchase the very first house you see, or based on
the advice of your broker will you buy a house? It’s highly
unlikely.
 You would likely browse a few web portals, checking for the
area, number of bedrooms, facilities, price, etc. You will also
probably ask your friends and colleagues for their opinion. In
short, you wouldn’t directly reach a conclusion, but will
instead make a decision considering the opinions of other
people as well.
 Ensemble techniques work in a similar manner, it simply
combines multiple models. Thus, a collection of models is used
to make predictions rather than an individual model and this
will increase the overall performance. Let’s understand 2
main ensemble methods in Machine Learning:
1. Bagging – Suppose we have a dataset, and we make
different models on the same dataset and combine it, will it
be useful? No right? There is a high chance we’ll get the
same results since we are giving the same input. So instead
we use a technique called bootstrapping. In this, we create
subsets of the original dataset with replacement. The size of
the subsets is the same as the size of the original set. Since
we do this with replacement so there is a high chance that we
provide different data points to our models.
2. Boosting – Suppose any data point in your observation
has been incorrectly classified by your 1st model, and then
the next (probably all the models), will combine the
predictions provide better results? Off-course it’s a big NO.
 Boosting technique is a sequential process, where each
model tries to correct the errors of the previous model. The
succeeding models are dependent on the previous model.
 It combines weak learners into strong learners by creating
sequential models such that the final model has the highest
accuracy. For example, ADA BOOST, XG BOOST.
Random forest works on the bagging principle and now let’s dive
into this topic and learn more about how random forest works.
What is the Random Forest Algorithm?

 Random Forest is a technique that uses ensemble learning,

that combines many weak classifiers to provide solutions to
complex problems.
 As the name suggests random forest consists of many
decision trees. Rather than depending on one tree it takes
the prediction from each tree and based on the majority votes
of predictions, predicts the final output. Don’t worry if you
haven’t read about decision trees, I have that part covered in
this article.
Real Life Analogy

 Let’s try to understand random forests with the help of an

example. Suppose you have to go on a solo trip. You are not sure
whether you want to go to a hill station or go somewhere to do
some adventure. So, you go to your friend and ask him what does
he suggests let’s say friend 1 (F1) tells you to go to a hill station
since it’s November already and this will be a great time to have
fun there, friend 2 (F2) want’s you to go for adventure.
 Similarly, all your friends gave you suggestions where you could
go on a trip. At last, you can either go to a place of your choice or
you decide on a place suggested by most of your friends.
 Similarly in Random Forest, we train a number of decision
trees, and the class which gets the maximum votes gets to be the
final result if it’s a classification problem and average if it’s a
regression problem.
Understanding Decision Trees

 To know how a random forest algorithm works we need to

know Decision Trees which is again a Supervised Machine
Learning algorithm used for classification as well as
regression problems.
 Decision trees use a flowchart like a tree structure to show
the predictions that result from a series of feature-based
splits. It starts with a root node and ends with a decision
made by leaves.
 It consists of 3 components which are the root node, decision
node, leaf node. The node from where the population starts
dividing is called a root node. The nodes we get after
splitting a root node are called decision nodes and the node
where further splitting is not possible is called a leaf node.
 The question comes how do we know which feature will be
the root node? In a dataset there can be 100’s of features so
how do we decide which feature will be our root node.
 To answer this question, we need to understand something
called the “Gini Index”
What is Gini Index?

 To select a feature to split further we need to know how

impure or pure that split will be. A pure sub-split means that
either you should be getting “yes” or “no”. Suppose this is
our dataset.
 We will see what output we get after splitting, taking each
feature as our root node.

When we take feature 1 as our root node, we get a pure split

whereas when we take feature 2, the split is not pure.
So how do we know that how much impurity this particular node
has? This can be understood with the help of the “Gini Index”.
 We basically need to know the impurity of our dataset and
we’ll take that feature as the root node which gives the
lowest impurity or say which has the lowest Gini index.
Mathematically Gini index can be written as:

 Where P+ is the probability of a positive class and P_ is the

probability of a negative class.
Let’s understand this formula with the help of a toy
dataset:

Let’s take Loan Amount as our root node and try to

split it:
Now we need to calculate the weighted Gini index that is the total Gini
index of this split. This can be calculated by:

Similarly, this algorithm will try to find the Gini index of all the splits
possible and will choose that feature for the root node which will give the
lowest Gini index.
The lowest Gini index means low impurity.
Entropy
 About another metric is called “Entropy” which is also used
to measure the impurity of the split. The mathematical
formula for entropy is:

We usually use the Gini index since it is computationally efficient, it

takes a shorter period of time for execution because there is no
logarithmic term like there is in entropy here. Usually, if you want to
do logarithmic calculations it takes some amount of time. That’s why
many boosting algorithms use the Gini index as their parameter.
One more important thing to note here is that if there are an equal
number of both the classes in a particular node then Gini Index will
have its maximum value, which means that the node is highly impure.
You can understand this with the help of an example, suppose you have
a group of friends, and you all take a vote on which movie to watch.
You get 5 votes for ‘lucy’ and 5 for ‘titanic’. Wouldn’t it be harder for
you to choose a movie now since both the movies have an equal
number of votes, hence we can say that it is a very difficult situation?
Applying Decision trees in Random Forest Algorithm

 The main difference between these two is that Random

Forest is a bagging method that uses a subset of the original
dataset to make predictions and this property of Random
Forest helps to overcome Overfitting.
 Instead of building a single decision tree, Random forest
builds a number of DT’s with a different set of observations.
 One big advantage of this algorithm is that it can be used for

classification as well as regression problems.

Steps involved in Random Forest Algorithm

Step-1 – We first make subsets of our original data.

We will do row sampling and feature sampling that
means we’ll select rows and columns with
replacement and create subsets of the training dataset
Step- 2 – We create an individual decision tree for
each subset we take
Step-3 – Each decision tree will give an output
Step 4 – Final output is considered based on Majority
Voting if it’s a classification problem and average if
it’s a regression problem.
Before proceeding further, we need to know one more important thing that when we
grow our decision tree to its depth we get Low Bias and High Variance, we can
say that our model will perform perfectly on our training dataset, but it’ll suck when
our new datapoint comes into the picture. To tackle this high variance situation, we
use random forest where we are combining many decision trees and not just
depending on a single DT, this will allow us to lower our variance, and this way we
overcome our overfitting problem.
Read about the types of sampling and sampling techniques
After completing step 1, we will build a decision tree for each subset. In the
above example, we have 3 decision trees.
How were we able to build the decision trees from scratch?
To build a decision tree, we have to use two methods.
1. Gini
2. Entropy and Information gain
 In Step 4, we can clearly understand the process of
combining the predictions of multiple trees we call
aggregation
 For classification, we use majority voting
 For regression, we use averaging
 with this, we understand what exactly a bootstrap
aggregation is all about.
 Now, We need to understand how it benefits.
 It reduces the variance. This helps build robust
models, which work well even on unseen data.
OOB (Out of the bag) Evaluation

 We now know how bootstrapping works in random forests. It

basically does row sampling and feature sampling with a
replacement before training the model. Since the sampling
occurs with replacement, about one-third of the data remains
unused for training the model; this data is referred to as out-
of-bag samples. We can evaluate our model on these out-of-
bag data points to know how it will perform on the test
dataset. Let’s see how we can use this OOB evaluation in
python. Let’s import the required libraries:
Next, we’ll separate X and Y and train our model:

To get the oob evaluation we need to set a parameter

called oob_score to TRUE. We can see that the score we get
from oob samples, and the test dataset is somewhat the same. In
this way, we can use these left-out samples in evaluating our
model.
Difference between Decision Tree and Random Forest

Hence, there is a conclusion that random forests are much more

successful than decision trees only if the trees are diverse and acceptable.
Feature Importance Using Random Forest

 Feature Importance is a technique used to identify which

features (variables or columns) in your dataset have the most
influence on the model's predictions. Random Forest, being an
ensemble of decision trees, inherently provides a way to assess
feature importance, which helps in understanding which
features are critical in making decisions.

 In a Random Forest model, each individual decision tree splits

data based on features that lead to the best splits (usually
based on a metric like Gini impurity or Mean Squared Error).
The feature importance is calculated based on how often and
how effectively a feature is used for splitting the data across all
trees in the forest.
 The more frequently and effectively a feature is used to split
the data, the more important it is considered to be.
How is Feature Importance
Calculated?
 There are a few common methods used to calculate feature
importance in Random Forests:

 Mean Decrease Impurity (Gini Importance):

 This is the default method used by Random Forests.
 For each feature, it sums up the decrease in impurity (such as Gini impurity
or entropy) that results from splits on that feature.
 The more a feature helps in reducing the impurity across all trees, the higher
its importance.

 Mean Decrease Accuracy (Permutation Importance):

 This method works by evaluating the decrease in model performance when
the values of a feature are randomly shuffled.
 If the model's accuracy drops significantly after shuffling a feature, it
indicates that the feature is important.
 This method is useful for evaluating feature importance on any machine
learning model, not just Random Forests.
The formula for calculating the feature importance is:

To understand this formula, first, let’s plot the decision tree for the
above dataset:
Here we have two columns [0 and
1], to calculate the feature
importance of [0] we need to find
those nodes where the split
happened due to this column [0].

In this dataset, we have only 1

node for column [0] and column
[1]. Out of all the nodes, we will
find the feature importance of
those nodes where the split
happened due to column [0] and
then divide it by the feature
importance of all the nodes.
To calculate the importance of a node we will
use this formula:
 Let’s calculate the feature importance of 1st node in our
decision tree.
 Our Nt is 5, N is 5, impurity of that node is 0.48, Nt(right) is 4,
right impurity is 0.375, Nt(Left) is 1, and left impurity is 0,
putting all this information in the above formula we get:
Similarly, we will calculate this for
2nd node:

Now let’s calculate the importance of features [0]

and [1], Hence for the
This can be calculated as : feature [0], the
feature importance
is 0.625 and for [1]
it is 0.375.
Advantages and Disadvantages of Random Forest

 One of the greatest benefits of a random forest algorithm is

its flexibility. We can use this algorithm for regression as well
as classification problems.
 It can be considered a handy algorithm because it produces
better results even without hyperparameter tuning. Also,
the parameters are pretty straightforward, they are easy to
understand and there are also not that many of them.
 One of the biggest problems in machine learning is
Overfitting. We need to make a generalized model which can
get good results on the test data too. Random forest helps to
overcome this situation by combining many Decision Trees
which will eventually give us low bias and low variance.
 The main limitation of random forest is that due to a large
number of trees the algorithm takes a long time to train
which makes it slow and ineffective for real-time predictions.
 In general, these algorithms are fast to train but quite slow
to create predictions once they are trained.
 In most real-world applications, the random forest algorithm
is fast enough but there may be situations where run-time
performance is critical, and alternative approaches would be
more suitable.
Conclusion

 In this article, we looked at a very powerful

machine learning algorithm. To summarize, we
learned about decision trees and random forests.
On what basis does the tree split the nodes and
how does Random Forest helps us to overcome
overfitting. We also know how Random forest help
in feature selection.
THANK you

Machine Learning With Random Forests and Decision Trees - A Visual Guide For Beginners by Scott Hartshorn
No ratings yet
Machine Learning With Random Forests and Decision Trees - A Visual Guide For Beginners by Scott Hartshorn
73 pages
Machine Learning Interview Questions
From Everand
Machine Learning Interview Questions
Tech Interviews
4.5/5 (2)
Problem Set #1
No ratings yet
Problem Set #1
2 pages
Top 100 Machine Learning Questions With Answers For Interview PDF
100% (3)
Top 100 Machine Learning Questions With Answers For Interview PDF
48 pages
Management Science
No ratings yet
Management Science
3 pages
Collaborative Learning Exercise 4 PDF
No ratings yet
Collaborative Learning Exercise 4 PDF
2 pages
An Introduction to Random Forest Algorithm for beginners
No ratings yet
An Introduction to Random Forest Algorithm for beginners
16 pages
Decision Tree Algorithm - A Complete Guide: Data Science Blogathon
No ratings yet
Decision Tree Algorithm - A Complete Guide: Data Science Blogathon
13 pages
Dtree&rf
No ratings yet
Dtree&rf
26 pages
Decision Tree & Random Forest
No ratings yet
Decision Tree & Random Forest
16 pages
Random Forest
No ratings yet
Random Forest
18 pages
Decision Tree
No ratings yet
Decision Tree
45 pages
AIML Final Cpy Word
No ratings yet
AIML Final Cpy Word
15 pages
Ch5 Data Science
No ratings yet
Ch5 Data Science
60 pages
Decision Tree Tutorial
No ratings yet
Decision Tree Tutorial
8 pages
Business Data Mining Week 10 A
No ratings yet
Business Data Mining Week 10 A
28 pages
1.decision Trees Concepts
No ratings yet
1.decision Trees Concepts
70 pages
Random Forest - Basics
No ratings yet
Random Forest - Basics
9 pages
Decision Tree Algorithm Tutorial With Example in R
No ratings yet
Decision Tree Algorithm Tutorial With Example in R
23 pages
Random Forest Algorithms - Comprehensive Guide With Examples
No ratings yet
Random Forest Algorithms - Comprehensive Guide With Examples
13 pages
Deep Learning and Neural Networks
No ratings yet
Deep Learning and Neural Networks
21 pages
Business Data Mining Week 10
No ratings yet
Business Data Mining Week 10
30 pages
Ensemble Methods
No ratings yet
Ensemble Methods
6 pages
Decision Tree Theory
No ratings yet
Decision Tree Theory
22 pages
Decision Trees and How To Build and Optimize Decision Tree Classifier
No ratings yet
Decision Trees and How To Build and Optimize Decision Tree Classifier
16 pages
Assignment INSAID
No ratings yet
Assignment INSAID
7 pages
Basic Interview Q's On ML PDF
100% (2)
Basic Interview Q's On ML PDF
243 pages
Machine Learning With Random Forests - by Knoldus Inc. - Knoldus - Technical Insights - Medium
No ratings yet
Machine Learning With Random Forests - by Knoldus Inc. - Knoldus - Technical Insights - Medium
12 pages
Unit V nOTES
No ratings yet
Unit V nOTES
9 pages
Machine Learning With Boosting
100% (1)
Machine Learning With Boosting
212 pages
5 Learning
No ratings yet
5 Learning
7 pages
Data Science Intervieew Questions
100% (1)
Data Science Intervieew Questions
16 pages
Classification - Decision Trees
No ratings yet
Classification - Decision Trees
43 pages
ID4 Algorithm - Incremental Decision Tree Learning
No ratings yet
ID4 Algorithm - Incremental Decision Tree Learning
9 pages
Assignment 3
No ratings yet
Assignment 3
8 pages
Introduction To Machine Learning Top-Down Approach - Towards Data Science
No ratings yet
Introduction To Machine Learning Top-Down Approach - Towards Data Science
6 pages
GROKKING ALGORITHM BLUEPRINT: A Comprehensive Beginner's Guide to Learn the Realms of Grokking Algorithms from A-Z and Become Efficient Programmers
From Everand
GROKKING ALGORITHM BLUEPRINT: A Comprehensive Beginner's Guide to Learn the Realms of Grokking Algorithms from A-Z and Become Efficient Programmers
William Turner
No ratings yet
Deep Learning Interview Questions - Deep Learning Questions
No ratings yet
Deep Learning Interview Questions - Deep Learning Questions
21 pages
Machine learning
No ratings yet
Machine learning
5 pages
E-Note 14653 Content Document 20231228101402AM
No ratings yet
E-Note 14653 Content Document 20231228101402AM
10 pages
UNIT-5 ML notes
No ratings yet
UNIT-5 ML notes
24 pages
ML Mid Question Solve
No ratings yet
ML Mid Question Solve
19 pages
5th Unit
No ratings yet
5th Unit
2 pages
comprehensive-popular-deep-learning-interview-questions-answers
No ratings yet
comprehensive-popular-deep-learning-interview-questions-answers
15 pages
CST401 M5 Ktunotes - in
No ratings yet
CST401 M5 Ktunotes - in
17 pages
ML
No ratings yet
ML
3 pages
ML Unit 4
No ratings yet
ML Unit 4
47 pages
AI Assignment 2
No ratings yet
AI Assignment 2
5 pages
Random Forests - Consolidating Decision Trees - Paperspace Blog
No ratings yet
Random Forests - Consolidating Decision Trees - Paperspace Blog
9 pages
AAM Unit 2 (1)
No ratings yet
AAM Unit 2 (1)
17 pages
5 Learning
No ratings yet
5 Learning
8 pages
Ensemble Methods
No ratings yet
Ensemble Methods
12 pages
Decision Tree
No ratings yet
Decision Tree
16 pages
UNIT-3 Material
No ratings yet
UNIT-3 Material
19 pages
A Comprehensive Guide To Ensemble Learning (With Python Codes)
No ratings yet
A Comprehensive Guide To Ensemble Learning (With Python Codes)
22 pages
Random Forest Algorithm
No ratings yet
Random Forest Algorithm
3 pages
Classification Algorithms
No ratings yet
Classification Algorithms
68 pages
d986dd506a22b36fe1bc93815bd51d30
No ratings yet
d986dd506a22b36fe1bc93815bd51d30
75 pages
ML Unit 3
No ratings yet
ML Unit 3
22 pages
Fire extinguisher prediction using machine learning report
No ratings yet
Fire extinguisher prediction using machine learning report
48 pages
ML Interview Ques
No ratings yet
ML Interview Ques
12 pages
Decision Tree Classification Algorithm
No ratings yet
Decision Tree Classification Algorithm
4 pages
UNIT 1 - Introduction (Types of Machine Learning)
100% (1)
UNIT 1 - Introduction (Types of Machine Learning)
21 pages
Data Sets
No ratings yet
Data Sets
36 pages
Text Classification in ML
No ratings yet
Text Classification in ML
47 pages
SRS Document
100% (1)
SRS Document
14 pages
lec 2
No ratings yet
lec 2
23 pages
Two Phase Method
No ratings yet
Two Phase Method
4 pages
Product Fake Reviews Detection With Sentiment Analysis Using Machine Learning
No ratings yet
Product Fake Reviews Detection With Sentiment Analysis Using Machine Learning
8 pages
LP Examples
No ratings yet
LP Examples
2 pages
MAT3002 - APPLIED-LINEAR-ALGEBRA - LT - 1.0 - 1 - Applied Linear Algebra
No ratings yet
MAT3002 - APPLIED-LINEAR-ALGEBRA - LT - 1.0 - 1 - Applied Linear Algebra
2 pages
autogluon-cheat-sheet
No ratings yet
autogluon-cheat-sheet
1 page
20a Three Mode PID Control Austin
No ratings yet
20a Three Mode PID Control Austin
9 pages
NumericalAnalysisChapter2 28 02 23
No ratings yet
NumericalAnalysisChapter2 28 02 23
62 pages
Mathematical Modeling of Two Tank System Ijariie4840
100% (1)
Mathematical Modeling of Two Tank System Ijariie4840
8 pages
Counting Sort
No ratings yet
Counting Sort
7 pages
U2-ML-QB With Answers
No ratings yet
U2-ML-QB With Answers
16 pages
T.madas F 2
No ratings yet
T.madas F 2
7 pages
The Kalman Filter: State-Space Derivation For Mass-Spring-Damper System
No ratings yet
The Kalman Filter: State-Space Derivation For Mass-Spring-Damper System
10 pages
RES701 - Research Methodology
No ratings yet
RES701 - Research Methodology
2 pages
Ge 114 Module 4
No ratings yet
Ge 114 Module 4
22 pages
Collections in Java
100% (1)
Collections in Java
24 pages
Exercise 07
No ratings yet
Exercise 07
5 pages
Yuxuan Full PLL
No ratings yet
Yuxuan Full PLL
4 pages
Add Math p1 Answers
No ratings yet
Add Math p1 Answers
86 pages
Using HTK
No ratings yet
Using HTK
36 pages
Machine Learning-Lecture#7-Fall 2020
No ratings yet
Machine Learning-Lecture#7-Fall 2020
18 pages
Unit-1 ML
No ratings yet
Unit-1 ML
19 pages
DSP Lect Note - 6th Sem Etc - PNG
No ratings yet
DSP Lect Note - 6th Sem Etc - PNG
91 pages
PLC Scaling
No ratings yet
PLC Scaling
4 pages
Assignment 3
No ratings yet
Assignment 3
8 pages
Q-3-Q-4 - PREDICTIVE ANALYTICS For Class
No ratings yet
Q-3-Q-4 - PREDICTIVE ANALYTICS For Class
32 pages
KR20 and KR21 Handouts
No ratings yet
KR20 and KR21 Handouts
7 pages
Source Code For Chatbot
No ratings yet
Source Code For Chatbot
22 pages