Module3-Ensemble Learning
Module3-Ensemble Learning
Steps:
• Reserve a subset of the dataset as a validation set.
• Provide the training to the model using the training dataset.
• Now, evaluate model performance using the validation set. If the model
performs well with the validation set, perform the further step, else check
for the issues.
Methods for cross validation
1. Validation Set Approach
2. Leave-P-out cross-validation
3. Leave one out cross-validation
4. K-fold cross-validation
5. Stratified k-fold cross-validation
Validation set approach: 50–50 Split
Let us consider naively splitting the data into a 50–50 split — that is, 50% of
the data is used for training the model (the training dataset) and the other
50% is withheld for testing the model after training (the testing dataset).
While we do have 50% to test our model on, 50% is still quite a bit that is
withheld from helping train the model.
The testing 50% portion may contain valuable information which our model
will miss out on if we use this split.
This may lead to higher bias where our predictions/estimates are far from the
actual.
It also tends to give the underfitted model.
K-Fold Cross-Validation
K-fold cross-validation approach divides the input dataset into K groups of
samples of equal sizes. These samples are called folds. For each learning
set, the prediction function uses k-1 folds, and the rest of the folds are used
for the test set. This approach is a very popular CV approach because it is
easy to understand, and the output is less biased than other methods.
The steps for k-fold cross-validation are:
Split the input dataset into K groups
For each group:
• Take one group as the reserve or test data set.
• Use remaining groups as the training dataset
• Fit the model on the training set and evaluate the performance of the
model using the test set.
Process Diagram- K-fold Cross
validation
Need for stratified k fold
Stratified k-fold cross-validation
Suppose your data contains reviews for a cosmetic product used by both the male
and female population. When we perform random sampling to split the data into
train and test sets, there is a possibility that most of the data representing males is
not represented in training data but might end up in test data. When we train the
model on sample training data that is not a correct representation of the actual
population, the model will not predict the test data with good accuracy.
majority
Boosting
Boosting is an ensemble learning method that combines a set of weak learners
into a strong learner to minimize training errors. In boosting, a random
sample of data is selected, fitted with a model and then trained
sequentially—that is, each model tries to compensate for the weaknesses of
its predecessor. With each iteration, the weak rules from each individual
classifier are combined to form one, strong prediction rule.
• We use boosting for combining weak learners with high bias. Boosting aims to
produce a model with a lower bias than that of the individual models.
• Boosting involves sequentially training weak learners. Here, each subsequent
learner improves the errors of previous learners in the sequence.
• A sample of data is first taken from the initial dataset. This sample is used to
train the first model, and the model makes its prediction. The samples can
either be correctly or incorrectly predicted. The samples that are wrongly
predicted are assigned more weights and are reused for training the next
model. In this way, subsequent models can improve on the errors of previous
models.
• boosting aggregates the results at each step. They are aggregated using
weighted averaging.
Similarities
• They both are ensemble techniques to get the N novices from 1 learner.
• Each generates numerous training statistics sets through random sampling.
• They each make the very last decision by averaging the N number of
beginners (or they take most of the people of them, i.e., the Majority of
voting).
• The Bagging and boosting are exact at reducing the variance and offer
better stability.
Bagging Boosting
The most effective manner of mixing A manner of mixing predictions that belong to
predictions that belong to the same type. different sorts.
The main task of it is decrease the variance but The main task of it is decrease the bias but not
not bias. variance.
Here each of the model is different weight. Here each of the model is same weight.
Each of the model is built here independently. Each of the model is built here dependently.
This training records subsets are decided on Each new subset consists of the factors that
using row sampling with alternative and random were misclassified through preceding models.
sampling techniques from the whole training
dataset.
It is trying to solve by over fitting problem. It is trying to solve by reducing the bias.
If the classifier is volatile (excessive variance), If the classifier is stable and easy (excessive
then apply bagging. bias) the practice boosting.
In the bagging base, the classifier is works In the boosting base, the classifier is works
parallelly. sequentially.
Example is random forest model by using Example is AdaBoost using the boosting
bagging. technique.
Random Forest
Random Forest
Improving Model Accuracy with
Stacking
We use stacking to improve the prediction accuracy of strong learners.
Stacking aims to create a single robust model from multiple heterogeneous
strong learners.
After multiple iterations, we will be able to create the right decision boundary with
the help of all the previous weak learners. As you can see the final model is able to
classify all the points correctly. This final model is known as a strong learner.
Step 1: Assigning Weights
The total error is nothing but the summation of all the sample weights of
misclassified data points.
Here in our dataset, let’s assume there is 1 wrong output, so our total error will
be 1/5, and the alpha (performance of the stump) will be:
0 Indicates perfect stump, and 1 indicates horrible stump. From the graph above, we
can see that when there is no misclassification, then we have no error (Total Error = 0),
so the “amount of say (alpha)” will be a large number.
When the classifier predicts half right and half wrong, then the Total Error = 0.5, and
the importance (amount of say) of the classifier will be 0.
If all the samples have been incorrectly classified, then the error will be very high
(approx. to 1), and hence our alpha value will be a negative integer.
Step 4: Calculate TE and Performance
If identical weights are maintained for the subsequent model, the output will
mirror what was obtained in the initial model.
The wrong predictions will be given more weight, whereas the correct
predictions weights will be decreased. Now when we build our next model
after updating the weights, more preference will be given to the points with
higher weights.
The amount of, say (alpha) will be negative when the sample is correctly classified.
The amount of, say (alpha) will be positive when the sample is miss-classified.
There are four correctly classified samples and 1 wrong. Here, the sample
weight of that datapoint is 1/5, and the amount of say/performance of the
stump of Gender is 0.69.
We know that the total sum of the sample weights must be equal to 1, but here if we sum
up all the new sample weights, we will get 0.8004. To bring this sum equal to 1, we will
normalize these weights by dividing all the weights by the total sum of updated weights,
which is 0.8004.
Step 5: Decrease Errors
For this, we will remove the “sample weightsˮ and “new sample
weightsˮ columns and then, based on the “new sample
weights,ˮ divide our data points into buckets.
Step 6: New Dataset
Now, what the algorithm does is selects random numbers from 01.
Since incorrectly classified records have higher sample weights, the
probability of selecting those records is very high.
Suppose the 5 random numbers our algorithm take is
0.38,0.26,0.98,0.40,0.55.
This comes out to be our new dataset, and we see the data point, which
was wrongly classified, has been selected 3 times because it has a
higher weight.
Step 7: Repeat Previous Steps
• Assign equal weights to all the data points.
• Find the stump that does the best job classifying the new collection of
samples by finding their Gini Index and selecting the one with the lowest
Gini index.
• Calculate the “Amount of Say” and “Total error” to update the previous
sample weights.
• Normalize the new sample weights.
It relies on the intuition that the best possible next model, when combined
with previous models, minimizes the overall prediction error. The key
idea is to set the target outcomes for this next model in order to minimize
the error.
The working of the Gradient Boosting Algorithm can be divided on the
basis of its major three elements:
• Optimizing the loss function
• Fabricating a weak learner for predictions
• Development of an additive model of weak learners to minimize the
loss function
Using three features — customer age, purchase category, and purchase
weight, we want to predict the purchase amount
Loss Function
The loss function that we use for the models, depends on the type of algorithm that we are
using. The main focus of selecting any loss function is that the loss function should be
differentiable. There are many standard functions available for usage, but one can define
their own loss function depending on the type of problem they tackle.
• Calculates the error: Takes the predicted output of the model and compares it to the
ground truth (actual observed values). How it compares, i.e., calculates the difference,
varies from function to function.
• Guides model training: a model’s objective is to minimize the loss function. Throughout
training, the model continually updates its internal architecture and configuration to
make the loss as little as possible.
• Evaluation metric: By comparing the loss on training, validation, and test datasets, you
can assess your model’s ability to generalize and avoid overfitting.
The two most common loss functions are:
❑ Mean Squared Error (MSE): This popular loss function for regression measures the sum
of the squared differences between predicted and actual values. Gradient boosting often
uses this variation of it:
❑ Cross-entropy: This function measures the difference between two probability
distributions. So, it is commonly used for classification tasks where the targets has
discrete categories.
Step 1: Make an initial prediction-
Build a Base model
Gradient boosting is an algorithm that gradually increases its accuracy. To start the
process, we need an initial guess or prediction. The initial guess is always the
average of the target.
Every step of the way, we are searching for a value to find the minimum of the loss
function. In other words, we are looking for a value that makes the derivative
(gradient) of the loss function 0.
And when we take the derivative of the loss function for each observed value with
respect to the predicted and sum them up, we end up with the average of the
target.
Step 2: Calculate the pseudo-residuals
Step 3: Build a weak learner
Next, we will build a decision tree (weak learner) that predicts the residuals
using the three features we have (age, category, purchase weight).
The learning rate in gradient boosting is simply a multiplier between 0
and 1 that scales the prediction of each weak learner. When we add
an arbitrary learning rate of 0.1 into the mix, our prediction becomes
152.75, not the perfect 123.45.
To improve the performance of F1, we could model after the residuals of F1 and
create a new model F2:
This can be done for ‘m’ iterations, until residuals have been minimized as much as
possible:
Here, the additive learners do not disturb the functions created in the previous steps.
Instead, they impart information of their own to bring down the errors.
Additive Model
We add the new trees one at a time in the model such that the pre-existing trees
remain unaltered. We tend to follow the gradient descent procedure in order
to minimize the loss while adding the trees.
XGBoost
Extreme gradient boosting or XGBoost: XGBoost is an implementation of
gradient boosting that’s designed for computational speed and scale.
XGBoost leverages multiple cores on the CPU, allowing for learning to
occur in parallel during training.
Not possible to check all combination since it will increase training time
Random Search