Random Forest Algorithm
Random Forest Algorithm
Algorithm
Suppose you want to purchase a house, will you just walk into
society and purchase the very first house you see, or based on
the advice of your broker will you buy a house? It’s highly
unlikely.
You would likely browse a few web portals, checking for the
area, number of bedrooms, facilities, price, etc. You will also
probably ask your friends and colleagues for their opinion. In
short, you wouldn’t directly reach a conclusion, but will
instead make a decision considering the opinions of other
people as well.
Ensemble techniques work in a similar manner, it simply
combines multiple models. Thus, a collection of models is used
to make predictions rather than an individual model and this
will increase the overall performance. Let’s understand 2
main ensemble methods in Machine Learning:
1. Bagging – Suppose we have a dataset, and we make
different models on the same dataset and combine it, will it
be useful? No right? There is a high chance we’ll get the
same results since we are giving the same input. So instead
we use a technique called bootstrapping. In this, we create
subsets of the original dataset with replacement. The size of
the subsets is the same as the size of the original set. Since
we do this with replacement so there is a high chance that we
provide different data points to our models.
2. Boosting – Suppose any data point in your observation
has been incorrectly classified by your 1st model, and then
the next (probably all the models), will combine the
predictions provide better results? Off-course it’s a big NO.
Boosting technique is a sequential process, where each
model tries to correct the errors of the previous model. The
succeeding models are dependent on the previous model.
It combines weak learners into strong learners by creating
sequential models such that the final model has the highest
accuracy. For example, ADA BOOST, XG BOOST.
Random forest works on the bagging principle and now let’s dive
into this topic and learn more about how random forest works.
What is the Random Forest Algorithm?
Similarly, this algorithm will try to find the Gini index of all the splits
possible and will choose that feature for the root node which will give the
lowest Gini index.
The lowest Gini index means low impurity.
Entropy
About another metric is called “Entropy” which is also used
to measure the impurity of the split. The mathematical
formula for entropy is:
To understand this formula, first, let’s plot the decision tree for the
above dataset:
Here we have two columns [0 and
1], to calculate the feature
importance of [0] we need to find
those nodes where the split
happened due to this column [0].