Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
15 views

Random Forest Algorithm

machine learning algorithm based on Random Forest Algorithm

Uploaded by

voicemint311
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
15 views

Random Forest Algorithm

machine learning algorithm based on Random Forest Algorithm

Uploaded by

voicemint311
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 39

Random Forest

Algorithm

HOW IT WORKS & WHY IT’S SO


EFFECTIVE
Overview

 Random forests are a supervised Machine learning


algorithm that is widely used in regression and
classification problems and produces, even without
hyperparameter tuning a great result most of the
time. It is perhaps the most used algorithm
because of its simplicity. It builds a number of
decision trees on different samples and then takes
the majority vote if it’s a classification problem.
Ensemble Techniques

 Suppose you want to purchase a house, will you just walk into
society and purchase the very first house you see, or based on
the advice of your broker will you buy a house? It’s highly
unlikely.
 You would likely browse a few web portals, checking for the
area, number of bedrooms, facilities, price, etc. You will also
probably ask your friends and colleagues for their opinion. In
short, you wouldn’t directly reach a conclusion, but will
instead make a decision considering the opinions of other
people as well.
 Ensemble techniques work in a similar manner, it simply
combines multiple models. Thus, a collection of models is used
to make predictions rather than an individual model and this
will increase the overall performance. Let’s understand 2
main ensemble methods in Machine Learning:
1. Bagging – Suppose we have a dataset, and we make
different models on the same dataset and combine it, will it
be useful? No right? There is a high chance we’ll get the
same results since we are giving the same input. So instead
we use a technique called bootstrapping. In this, we create
subsets of the original dataset with replacement. The size of
the subsets is the same as the size of the original set. Since
we do this with replacement so there is a high chance that we
provide different data points to our models.
2. Boosting – Suppose any data point in your observation
has been incorrectly classified by your 1st model, and then
the next (probably all the models), will combine the
predictions provide better results? Off-course it’s a big NO.
 Boosting technique is a sequential process, where each
model tries to correct the errors of the previous model. The
succeeding models are dependent on the previous model.
 It combines weak learners into strong learners by creating
sequential models such that the final model has the highest
accuracy. For example, ADA BOOST, XG BOOST.
Random forest works on the bagging principle and now let’s dive
into this topic and learn more about how random forest works.
What is the Random Forest Algorithm?

 Random Forest is a technique that uses ensemble learning,


that combines many weak classifiers to provide solutions to
complex problems.
 As the name suggests random forest consists of many
decision trees. Rather than depending on one tree it takes
the prediction from each tree and based on the majority votes
of predictions, predicts the final output. Don’t worry if you
haven’t read about decision trees, I have that part covered in
this article.
Real Life Analogy

 Let’s try to understand random forests with the help of an


example. Suppose you have to go on a solo trip. You are not sure
whether you want to go to a hill station or go somewhere to do
some adventure. So, you go to your friend and ask him what does
he suggests let’s say friend 1 (F1) tells you to go to a hill station
since it’s November already and this will be a great time to have
fun there, friend 2 (F2) want’s you to go for adventure.
 Similarly, all your friends gave you suggestions where you could
go on a trip. At last, you can either go to a place of your choice or
you decide on a place suggested by most of your friends.
 Similarly in Random Forest, we train a number of decision
trees, and the class which gets the maximum votes gets to be the
final result if it’s a classification problem and average if it’s a
regression problem.
Understanding Decision Trees

 To know how a random forest algorithm works we need to


know Decision Trees which is again a Supervised Machine
Learning algorithm used for classification as well as
regression problems.
 Decision trees use a flowchart like a tree structure to show
the predictions that result from a series of feature-based
splits. It starts with a root node and ends with a decision
made by leaves.
 It consists of 3 components which are the root node, decision
node, leaf node. The node from where the population starts
dividing is called a root node. The nodes we get after
splitting a root node are called decision nodes and the node
where further splitting is not possible is called a leaf node.
 The question comes how do we know which feature will be
the root node? In a dataset there can be 100’s of features so
how do we decide which feature will be our root node.
 To answer this question, we need to understand something
called the “Gini Index”
What is Gini Index?

 To select a feature to split further we need to know how


impure or pure that split will be. A pure sub-split means that
either you should be getting “yes” or “no”. Suppose this is
our dataset.
 We will see what output we get after splitting, taking each
feature as our root node.

When we take feature 1 as our root node, we get a pure split


whereas when we take feature 2, the split is not pure.
So how do we know that how much impurity this particular node
has? This can be understood with the help of the “Gini Index”.
 We basically need to know the impurity of our dataset and
we’ll take that feature as the root node which gives the
lowest impurity or say which has the lowest Gini index.
Mathematically Gini index can be written as:

 Where P+ is the probability of a positive class and P_ is the


probability of a negative class.
Let’s understand this formula with the help of a toy
dataset:

Let’s take Loan Amount as our root node and try to


split it:
Now we need to calculate the weighted Gini index that is the total Gini
index of this split. This can be calculated by:

Similarly, this algorithm will try to find the Gini index of all the splits
possible and will choose that feature for the root node which will give the
lowest Gini index.
The lowest Gini index means low impurity.
Entropy
 About another metric is called “Entropy” which is also used
to measure the impurity of the split. The mathematical
formula for entropy is:

We usually use the Gini index since it is computationally efficient, it


takes a shorter period of time for execution because there is no
logarithmic term like there is in entropy here. Usually, if you want to
do logarithmic calculations it takes some amount of time. That’s why
many boosting algorithms use the Gini index as their parameter.
One more important thing to note here is that if there are an equal
number of both the classes in a particular node then Gini Index will
have its maximum value, which means that the node is highly impure.
You can understand this with the help of an example, suppose you have
a group of friends, and you all take a vote on which movie to watch.
You get 5 votes for ‘lucy’ and 5 for ‘titanic’. Wouldn’t it be harder for
you to choose a movie now since both the movies have an equal
number of votes, hence we can say that it is a very difficult situation?
Applying Decision trees in Random Forest Algorithm

 The main difference between these two is that Random


Forest is a bagging method that uses a subset of the original
dataset to make predictions and this property of Random
Forest helps to overcome Overfitting.
 Instead of building a single decision tree, Random forest
builds a number of DT’s with a different set of observations.
 One big advantage of this algorithm is that it can be used for

classification as well as regression problems.


Steps involved in Random Forest Algorithm

Step-1 – We first make subsets of our original data.


We will do row sampling and feature sampling that
means we’ll select rows and columns with
replacement and create subsets of the training dataset
Step- 2 – We create an individual decision tree for
each subset we take
Step-3 – Each decision tree will give an output
Step 4 – Final output is considered based on Majority
Voting if it’s a classification problem and average if
it’s a regression problem.
Before proceeding further, we need to know one more important thing that when we
grow our decision tree to its depth we get Low Bias and High Variance, we can
say that our model will perform perfectly on our training dataset, but it’ll suck when
our new datapoint comes into the picture. To tackle this high variance situation, we
use random forest where we are combining many decision trees and not just
depending on a single DT, this will allow us to lower our variance, and this way we
overcome our overfitting problem.
Read about the types of sampling and sampling techniques
After completing step 1, we will build a decision tree for each subset. In the
above example, we have 3 decision trees.
How were we able to build the decision trees from scratch?
To build a decision tree, we have to use two methods.
1. Gini
2. Entropy and Information gain
 In Step 4, we can clearly understand the process of
combining the predictions of multiple trees we call
aggregation
 For classification, we use majority voting
 For regression, we use averaging
 with this, we understand what exactly a bootstrap
aggregation is all about.
 Now, We need to understand how it benefits.
 It reduces the variance. This helps build robust
models, which work well even on unseen data.
OOB (Out of the bag) Evaluation

 We now know how bootstrapping works in random forests. It


basically does row sampling and feature sampling with a
replacement before training the model. Since the sampling
occurs with replacement, about one-third of the data remains
unused for training the model; this data is referred to as out-
of-bag samples. We can evaluate our model on these out-of-
bag data points to know how it will perform on the test
dataset. Let’s see how we can use this OOB evaluation in
python. Let’s import the required libraries:
Next, we’ll separate X and Y and train our model:

To get the oob evaluation we need to set a parameter


called oob_score to TRUE. We can see that the score we get
from oob samples, and the test dataset is somewhat the same. In
this way, we can use these left-out samples in evaluating our
model.
Difference between Decision Tree and Random Forest

Hence, there is a conclusion that random forests are much more


successful than decision trees only if the trees are diverse and acceptable.
Feature Importance Using Random Forest

 Feature Importance is a technique used to identify which


features (variables or columns) in your dataset have the most
influence on the model's predictions. Random Forest, being an
ensemble of decision trees, inherently provides a way to assess
feature importance, which helps in understanding which
features are critical in making decisions.

 In a Random Forest model, each individual decision tree splits


data based on features that lead to the best splits (usually
based on a metric like Gini impurity or Mean Squared Error).
The feature importance is calculated based on how often and
how effectively a feature is used for splitting the data across all
trees in the forest.
 The more frequently and effectively a feature is used to split
the data, the more important it is considered to be.
How is Feature Importance
Calculated?
 There are a few common methods used to calculate feature
importance in Random Forests:

 Mean Decrease Impurity (Gini Importance):


 This is the default method used by Random Forests.
 For each feature, it sums up the decrease in impurity (such as Gini impurity
or entropy) that results from splits on that feature.
 The more a feature helps in reducing the impurity across all trees, the higher
its importance.

 Mean Decrease Accuracy (Permutation Importance):


 This method works by evaluating the decrease in model performance when
the values of a feature are randomly shuffled.
 If the model's accuracy drops significantly after shuffling a feature, it
indicates that the feature is important.
 This method is useful for evaluating feature importance on any machine
learning model, not just Random Forests.
The formula for calculating the feature importance is:

To understand this formula, first, let’s plot the decision tree for the
above dataset:
Here we have two columns [0 and
1], to calculate the feature
importance of [0] we need to find
those nodes where the split
happened due to this column [0].

In this dataset, we have only 1


node for column [0] and column
[1]. Out of all the nodes, we will
find the feature importance of
those nodes where the split
happened due to column [0] and
then divide it by the feature
importance of all the nodes.
To calculate the importance of a node we will
use this formula:
 Let’s calculate the feature importance of 1st node in our
decision tree.
 Our Nt is 5, N is 5, impurity of that node is 0.48, Nt(right) is 4,
right impurity is 0.375, Nt(Left) is 1, and left impurity is 0,
putting all this information in the above formula we get:
Similarly, we will calculate this for
2nd node:

Now let’s calculate the importance of features [0]


and [1], Hence for the
This can be calculated as : feature [0], the
feature importance
is 0.625 and for [1]
it is 0.375.
Advantages and Disadvantages of Random Forest

 One of the greatest benefits of a random forest algorithm is


its flexibility. We can use this algorithm for regression as well
as classification problems.
 It can be considered a handy algorithm because it produces
better results even without hyperparameter tuning. Also,
the parameters are pretty straightforward, they are easy to
understand and there are also not that many of them.
 One of the biggest problems in machine learning is
Overfitting. We need to make a generalized model which can
get good results on the test data too. Random forest helps to
overcome this situation by combining many Decision Trees
which will eventually give us low bias and low variance.
 The main limitation of random forest is that due to a large
number of trees the algorithm takes a long time to train
which makes it slow and ineffective for real-time predictions.
 In general, these algorithms are fast to train but quite slow
to create predictions once they are trained.
 In most real-world applications, the random forest algorithm
is fast enough but there may be situations where run-time
performance is critical, and alternative approaches would be
more suitable.
Conclusion

 In this article, we looked at a very powerful


machine learning algorithm. To summarize, we
learned about decision trees and random forests.
On what basis does the tree split the nodes and
how does Random Forest helps us to overcome
overfitting. We also know how Random forest help
in feature selection.
THANK you

You might also like