Random Forest Algorithms - Comprehensive Guide With Examples
Random Forest Algorithms - Comprehensive Guide With Examples
Introduction
Random Forest is a widely-used machine learning algorithm developed by Leo Breiman and Adele Cutler, which combines
the output of multiple decision trees to reach a single result. Its ease of use and flexibility have fueled its adoption, as it
handles both classification and regression problems. In this article, we will understand how random forest algorithm
works, how it differs from other algorithms and how to use it.
Learning Objectives
Table of contents
Introduction
What is Random Forest Algorithm?
Real-Life Analogy of Random Forest
Working of Random Forest Algorithm
Important Features of Random Forest
Difference Between Decision Tree and Random Forest
Important Hyperparameters in Random Forest
Coding in Python – Random Forest
Random Forest Algorithm Use Cases
Advantages and Disadvantages of Random Forest Algorithm
Conclusion
Frequently Asked Questions
What is Random Forest Algorithm?
Random Forest Algorithm widespread popularity stems from its user-friendly nature and adaptability, enabling it to
tackle both classification and regression problems effectively. The algorithm’s strength lies in its ability to handle
complex datasets and mitigate overfitting, making it a valuable tool for various predictive tasks in machine learning.
One of the most important features of the Random Forest Algorithm is that it can handle the data set
containing continuous variables, as in the case of regression, and categorical variables, as in the case of classification. It
performs better for classification and regression tasks. In this tutorial, we will understand the working of random forest
and implement random forest on a classification task.
Bagging
It creates a different training subset from sample training data with replacement & the final output is based on majority
voting. For example, Random Forest.
Boosting
It combines weak learners into strong learners by creating sequential models such that the final model has the highest
accuracy. For example, ADA BOOST, XG BOOST.
As mentioned earlier, Random forest works on the Bagging principle. Now let’s dive in and understand bagging in detail.
Bagging
Bagging, also known as Bootstrap Aggregation, serves as the ensemble technique in the Random Forest algorithm. Here
are the steps involved in Bagging:
1. Selection of Subset: Bagging starts by choosing a random sample, or subset, from the entire dataset.
2. Bootstrap Sampling: Each model is then created from these samples, called Bootstrap Samples, which are taken from
the original data with replacement. This process is known as row sampling.
3. Bootstrapping: The step of row sampling with replacement is referred to as bootstrapping.
4. Independent Model Training: Each model is trained independently on its corresponding Bootstrap Sample. This
training process generates results for each model.
5. Majority Voting: The final output is determined by combining the results of all models through majority voting. The
most commonly predicted outcome among the models is selected.
6. Aggregation: This step, which involves combining all the results and generating the final output based on majority
voting, is known as aggregation.
Now let’s look at an example by breaking it down with the help of the following figure. Here the bootstrap sample is taken
from actual data (Bootstrap sample 01, Bootstrap sample 02, and Bootstrap sample 03) with a replacement which means
there is a high possibility that each sample won’t contain unique data. The model (Model 01, Model 02, and Model 03)
obtained from this bootstrap sample is trained independently. Each model generates results as shown. Now the Happy
emoji has a majority when compared to the Sad emoji. Thus based on majority voting final output is obtained as Happy
emoji.
Boosting
Boosting is one of the techniques that use the concept of ensemble learning. A boosting algorithm combines multiple
simple models (also known as weak learners or base estimators) to generate the final output. It is done by building a
model by using weak models in series.
There are several boosting algorithms; AdaBoost was the first really successful boosting algorithm that was developed
for the purpose of binary classification. AdaBoost is an abbreviation for Adaptive Boosting and is a prevalent boosting
technique that combines multiple “weak classifiers” into a single “strong classifier.” There are Other Boosting techniques.
For more, you can visit
4 Boosting Algorithms You Should Know – GBM, XGBoost, LightGBM & CatBoost
How many boosting algorithms do you know? Can you name at least two boosting algorithms in machine learning? Boosting
Step 1: In the Random forest model, a subset of data points and a subset of features is selected for constructing each
decision tree. Simply put, n random records and m features are taken from the data set having k number of records.
Step 2: Individual decision trees are constructed for each sample.
Step 3: Each decision tree will generate an output.
Step 4: Final output is considered based on Majority Voting or Averaging for Classification and regression,
respectively.
For example
Consider the fruit basket as the data as shown in the figure below. Now n number of samples are taken from the fruit
basket, and an individual decision tree is constructed for each sample. Each decision tree will generate an output, as
shown in the figure. The final output is considered based on majority voting. In the below figure, you can see that the
majority decision tree gives output as an apple when compared to a banana, so the final output is taken as an apple.
1. Decision trees normally suffer from the 1. Random forests are created from subsets of data, and the
problem of overfitting if it’s allowed to grow final output is based on average or majority ranking; hence the
without any control. problem of overfitting is taken care of.
2. A single decision tree is faster in computation. 2. It is comparatively slower.
3. When a data set with features is taken as 3. Random forest randomly selects observations, builds a
input by a decision tree, it will formulate some decision tree, and takes the average result. It doesn’t use any set
rules to make predictions. of formulas.
Thus random forests are much more successful than decision trees only if the trees are diverse and acceptable.
n_estimators: Number of trees the algorithm builds before averaging the predictions.
max_features: Maximum number of features random forest considers splitting a node.
mini_sample_leaf: Determines the minimum number of leaves required to split an internal node.
criterion: How to split the node in each tree? (Entropy/Gini impurity/Log Loss)
max_leaf_nodes: Maximum leaf nodes in each tree
n_jobs: it tells the engine how many processors it is allowed to use. If the value is 1, it can use only one processor, but
if the value is -1, there is no limit.
random_state: controls randomness of the sample. The model will always produce the same results if it has a definite
value of random state and has been given the same hyperparameters and training data.
oob_score: OOB means out of the bag. It is a random forest cross-validation method. In this, one-third of the sample
is not used to train the data; instead used to evaluate its performance. These samples are called out-of-bag samples.
Python Code:
3. Putting Feature Variable to X and Target variable to y.
# Putting feature variable to X
X = df.drop('heart disease',axis=1)
# Putting response variable to y
y = df['heart disease']
4. Train-Test-Split is performed
# now lets split the data into train and test
from sklearn.model_selection import train_test_split
%%time
classifier_rf.fit(X_train, y_train)
6. Let’s do hyperparameter tuning for Random Forest using GridSearchCV and fit the data.
rf = RandomForestClassifier(random_state=42, n_jobs=-1)
params = {
'max_depth': [2,3,5,10,20],
'min_samples_leaf': [5,10,20,50,100,200],
'n_estimators': [10,25,30,50,100,200]
}
%%time
grid_search.fit(X_train, y_train)
grid_search.best_score_
rf_best = grid_search.best_estimator_
rf_best
From hyperparameter tuning, we can fetch the best estimator, as shown. The best set of parameters identified was
max_depth=5, min_samples_leaf=10,n_estimators=10
The trees created by estimators_[5] and estimators_[7] are different. Thus we can say that each tree is independent of the
other.
8. Now let’s sort the data with the help of feature importance
rf_best.feature_importances_
imp_df = pd.DataFrame({
"Varname": X_train.columns,
"Imp": rf_best.feature_importances_
})
imp_df.sort_values(by="Imp", ascending=False)
For example: In the Banking industry, it can be used to find which customer will default on a loan.
Disadvantages
Random forest is highly complex compared to decision trees, where decisions can be made by following the path of
the tree.
Training time is more than other models due to its complexity. Whenever it has to make a prediction, each decision
tree has to generate output for the given input data.
Conclusion
Random forest is a great choice if anyone wants to build the model fast and efficiently, as one of the best things about the
random forest is it can handle missing values. It is one of the best techniques with high performance, widely used in
various industries for its efficiency. It can handle binary, continuous, and categorical data. Overall, random forest is a fast,
simple, flexible, and robust model with some limitations.
Key Takeaways
Random forest algorithm is an ensemble learning technique combining numerous classifiers to enhance a model’s
performance.
Random Forest is a supervised machine-learning algorithm made up of decision trees.
Random Forest is used for both classification and regression problems.
Regression is a type of supervised learning algorithm that learns a function to map from the input features to the target
variable. There are many different types of regression algorithms, such as linear regression, logistic regression, and
decision trees. Each type of regression algorithm makes different assumptions about the relationship between the
features and the target variable.
The media shown in this article are not owned by Analytics Vidhya and are used at the Author’s discretion.
Sruthi E R
view more
Download
Analytics Vidhya App for the Latest blog/Article
Next Post
zarmeen says:
September 26, 2022 at 10:06 pm
Well written and informative content. Why did you choose data science after doing civil engg?
Reply
Muhammad Aftab says:
November 23, 2022 at 9:40 am
Good work COMPRISED of all basic and introductory information. While reading i was thinking that
both classification and regression will be covered. but at the end majority of the work was based upon
classification. I need regression based work too. But no problem material is still important and good
covered.
Reply
Rim says:
February 16, 2023 at 6:30 pm
Can the author explain this in details: "Train-Test split: In a random forest, we don’t have to segregate
the data for train and test as there will always be 30% of the data which is not seen by the decision
tree." Thanks,
Reply
Leave a Reply
Your email address will not be published. Required fields are marked *
Comment
Name*
Email*
Website
Submit
Top Resources
© Copyright 2013-2023 Analytics Vidhya. Privacy Policy Terms of Use Refund Policy