Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
15 views

Module3-Ensemble Learning

Uploaded by

JADEN JOSEPH
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
15 views

Module3-Ensemble Learning

Uploaded by

JADEN JOSEPH
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 107

Module 3: Ensemble Learning

Content of the Sub-Module


• Understanding Ensembles,
• K-fold cross validation,
• Boosting,
• Stumping,
• XGBoost
• Bagging, Subagging,
• Random Forest,
• Comparison with Boosting,
• Different ways to combine classifiers
Example 1
Example 1: If you are planning to buy an air-conditioner, would you enter a
showroom and buy the air-conditioner that the salesperson shows you? The
answer is probably no. In this day and age, you are likely to ask your
friends, family, and colleagues for an opinion, do research on various
portals about different models, and visit a few review sites before making a
purchase decision. In a nutshell, you would not come to a conclusion
directly. Instead, you would try to make a more informed decision after
considering diverse opinions and reviews. In the case of ensemble learning,
the same principle applies..
Example 2
Example 2: Assume that you are developing an app for the travel industry. It is
obvious that before making the app public, you will want to get crucial
feedback on bugs and potential loopholes that are affecting the user
experience. What are your available options for obtaining critical feedback?
1) Soliciting opinions from your parents, spouse, or close friends. 2)
Asking your co-workers who travel regularly and then evaluating their
response. 3) Rolling out your travel and tourism app in beta to gather
feedback from non-biased audiences and the travel community.
Background: Ensemble Learning
Background: Ensemble Learning
Background: Ensemble Learning
Goal of Supervised Learning?
Minimize the probability of model prediction errors on future data.

Two Competing Methodologies


Build one really good model
• Traditional approach
Build many models and average the results
• Ensemble learning

Ensemble method is a machine learning technique that combines several base


models in order to produce one optimal predictive model.
To overcome over-fitting problems, we use a technique called
Cross-Validation.
Training, Validation and Testing Dataset
The model is not trained or tested with entire dataset
Chances that model will work perfectly on training dataset but not work on test
dataset, i.e. in real word scenario.
Case of overfitting may arise.
Hence there is need of cross validation
CROSS-VALIDATION
Cross validation (CV) is a technique used in training our model.
The idea is that we will train our model using entire data set and then test our
model on the testing set (validation) to see how well it does.

Steps:
• Reserve a subset of the dataset as a validation set.
• Provide the training to the model using the training dataset.
• Now, evaluate model performance using the validation set. If the model
performs well with the validation set, perform the further step, else check
for the issues.
Methods for cross validation
1. Validation Set Approach
2. Leave-P-out cross-validation
3. Leave one out cross-validation
4. K-fold cross-validation
5. Stratified k-fold cross-validation
Validation set approach: 50–50 Split
Let us consider naively splitting the data into a 50–50 split — that is, 50% of
the data is used for training the model (the training dataset) and the other
50% is withheld for testing the model after training (the testing dataset).
While we do have 50% to test our model on, 50% is still quite a bit that is
withheld from helping train the model.
The testing 50% portion may contain valuable information which our model
will miss out on if we use this split.
This may lead to higher bias where our predictions/estimates are far from the
actual.
It also tends to give the underfitted model.
K-Fold Cross-Validation
K-fold cross-validation approach divides the input dataset into K groups of
samples of equal sizes. These samples are called folds. For each learning
set, the prediction function uses k-1 folds, and the rest of the folds are used
for the test set. This approach is a very popular CV approach because it is
easy to understand, and the output is less biased than other methods.
The steps for k-fold cross-validation are:
Split the input dataset into K groups
For each group:
• Take one group as the reserve or test data set.
• Use remaining groups as the training dataset
• Fit the model on the training set and evaluate the performance of the
model using the test set.
Process Diagram- K-fold Cross
validation
Need for stratified k fold
Stratified k-fold cross-validation
Suppose your data contains reviews for a cosmetic product used by both the male
and female population. When we perform random sampling to split the data into
train and test sets, there is a possibility that most of the data representing males is
not represented in training data but might end up in test data. When we train the
model on sample training data that is not a correct representation of the actual
population, the model will not predict the test data with good accuracy.

This technique is similar to k-fold cross-validation with some changes.


This approach works on stratification concept, it is a process of rearranging the data
to ensure that each fold or group is a good representative of the complete dataset.
To deal with the bias and variance, it is one of the best approaches.
It can be understood with an example of housing prices, such that the price of some
houses can be much high than other houses. To tackle such situations, a stratified
k-fold cross-validation technique is useful.
Let’s consider the earlier example which has a cosmetic product review of
1000 customers out of which 60% is female and 40% is male. I want to
split the data into train and test data in proportion (80:20). 80% of 1000
customers will be 800 which will be chosen in such a way that there are
480 reviews associated with the female population and 320 representing the
male population. In a similar fashion, 20% of 1000 customers will be
chosen for the test data ( with the same female and male representation).
Leave-One-Out-Cross-Validation (LOOCV)
Imagine a situation where we have very little data points, let’s say 10. Then the
50–50 split is even more unfavorable as leaving out 50% is 5 very
influential data points.

Consider LOOCV as an alternative.


Leave-One-Out-Cross-Validation (LOOCV)
In LOOCV, we train the model on the entire dataset, excluding only one data
point.
We will iterate through the entire dataset such that each time we will be
excluding a different data point from the training set.
As such, the testing set will only contain that sole point which changes in each
iteration.
We will then fit the model on the training set and repeat. Essentially, if we
have a total of n data points, the model will be fitted on n-1 data points
each iteration.
Hence for n samples, we get n different training set and n test set. It has the
following features:
• In this approach, the bias is minimum as all the data points are used.
• The process is executed for n times; hence execution time is high.
• This approach leads to high variation in testing the effectiveness of the
model as we iteratively check against one data point.
Leave-P-out cross-validation
In this approach, the p data points are left out of the training data. It means, if
there are total n datapoints in the original input dataset, then n-p data points
will be used as the training dataset and the p data points as the validation
set. This complete process is repeated for all the samples, and the average
error is calculated to know the effectiveness of the model.
Here p is randomly selected.
There is a disadvantage of this technique; that is, it can be computationally
difficult for the large p.
Holdout Method
This method is the simplest cross-validation technique among all. In this
method, we need to remove a subset of the training data and use it to get
prediction results by training it on the rest part of the dataset.
The error that occurs in this process tells how well our model will perform
with the unknown dataset. Although this approach is simple to perform, it
still faces the issue of high variance, and it also produces misleading results
sometimes.
The data split happens randomly, and we can’t be sure which data ends up in
the train and test bucket during the split unless we specify random_state.
This can lead to extremely high variance and every time, the split changes,
the accuracy will also change.
Time Series Cross-Validation
For time-dependent data, it uses a series of temporally ordered training and
testing sets, preventing the use of future data for training.
Shuffle-Split Cross-Validation
Randomly shuffles the data, and then splits it into training and testing sets
multiple times.
Group K-Fold Cross-Validation
Useful when the data contains groups, like multiple samples from the same
subject, and ensures all samples from a group are kept together in the same
fold.
Comparison of Cross-validation to
train/test split in Machine Learning
Train/test split: The input data is divided into two parts, that are training set
and test set on a ratio of 70:30, 80:20, etc. It provides a high variance,
which is one of the biggest disadvantages.
• Training Data: The training data is used to train the model, and the
dependent variable is known.
• Test Data: The test data is used to make the predictions from the model
that is already trained on the training data. This has the same features
as training data but not the part of that.
Cross-Validation dataset: It is used to overcome the disadvantage of train/test
split by splitting the dataset into groups of train/test splits, and averaging
the result. It can be used if we want to optimize our model that has been
trained on the training dataset for the best performance. It is more efficient
as compared to train/test split as every observation is used for the training
and testing both.
Limitations of Cross-Validation
• For the ideal conditions, it provides the optimum output. But for the
inconsistent data, it may produce a drastic result. So, it is one of the
disadvantages of cross-validation, as there is no certainty of the type of data
in machine learning.
• In predictive modeling, the data evolves over a period, due to which, it may
face the differences between the training set and validation sets. Such as if
we create a model for the prediction of stock market values, and the data is
trained on the previous 5 years stock values, but the realistic future values
for the next 5 years may drastically different, so it is difficult to expect the
correct output for such situations.
Applications of Cross-Validation
• This technique can be used to compare the performance of different
predictive modeling methods.
• It has great scope in the medical research field.
• It can also be used for the meta-analysis, as it is already being used by the
data scientists in the field of medical statistics.
Types of Ensemble Methods
Voting or averaging
Boosting
Bootstrap aggregation (Bagging)
Random Forest
Stacked generalization (Blending)
Voting(Averaging)

majority
Boosting
Boosting is an ensemble learning method that combines a set of weak learners
into a strong learner to minimize training errors. In boosting, a random
sample of data is selected, fitted with a model and then trained
sequentially—that is, each model tries to compensate for the weaknesses of
its predecessor. With each iteration, the weak rules from each individual
classifier are combined to form one, strong prediction rule.
• We use boosting for combining weak learners with high bias. Boosting aims to
produce a model with a lower bias than that of the individual models.
• Boosting involves sequentially training weak learners. Here, each subsequent
learner improves the errors of previous learners in the sequence.
• A sample of data is first taken from the initial dataset. This sample is used to
train the first model, and the model makes its prediction. The samples can
either be correctly or incorrectly predicted. The samples that are wrongly
predicted are assigned more weights and are reused for training the next
model. In this way, subsequent models can improve on the errors of previous
models.
• boosting aggregates the results at each step. They are aggregated using
weighted averaging.

Weighted averaging involves giving all models different weights depending on


their predictive power. In other words, it gives more weight to the model with the
highest predictive power. This is because the learner with the highest predictive
power is considered the most important.
Boosting works with the following steps:
We sample m-number of subsets from an initial training dataset.
Using the first subset, we train the first weak learner.
We test the trained weak learner using the training data. As a result of the
testing, some data points will be incorrectly predicted.
Each data point with the wrong prediction is sent into the second subset of
data, and this subset is updated.
Using this updated subset, we train and test the second weak learner.
We continue with the following subset until the total number of subsets is
reached.
We now have the total prediction. The overall prediction has already been
aggregated at each step, so there is no need to calculate it.
Bootstrap Aggregation (Bagging)
Reducing Variance with Bagging
We use bagging for combining weak learners of high variance. Bagging aims to
produce a model with lower variance than the individual weak models. These
weak learners are homogenous, meaning they are of the same type.
Bagging is also known as Bootstrap aggregating. It consists of two steps:
bootstrapping and aggregation.
Bootstrapping
Involves resampling subsets of data with replacement from an initial dataset. In other
words, subsets of data are taken from the initial dataset. These subsets of data are
called bootstrapped datasets or, simply, bootstraps. Resampled ‘with replacement’
means an individual data point can be sampled multiple times. Each bootstrap
dataset is used to train a weak learner.
Aggregating
The individual weak learners are trained independently from each other. Each learner
makes independent predictions. The results of those predictions are aggregated at
the end to get the overall prediction. The predictions are aggregated using either
max voting or averaging.
Max Voting
It is a commonly used for classification problems that consists of taking the
mode of the predictions (the most occurring prediction). It is called voting
because like in election voting, the premise is that ‘the majority rules’.
Each model makes a prediction. A prediction from each model counts as a
single ‘vote’. The most occurring ‘vote’ is chosen as the representative for
the combined model.
Averaging
It is generally used for regression problems. It involves taking the average of
the predictions. The resulting average is used as the overall prediction for
the combined model.
The steps of bagging are as follows:
1. We have an initial training dataset containing n-number of instances.
2. We create a m-number of subsets of data from the training set. We take a
subset of N sample points from the initial dataset for each subset. Each
subset is taken with replacement. This means that a specific data point can
be sampled more than once.
3. For each subset of data, we train the corresponding weak learners
independently. These models are homogeneous, meaning that they are of
the same type.
4. Each model makes a prediction.
5. The predictions are aggregated into a single prediction. For this, either
max voting or averaging is used.
The main difference between these two learning strategies is the way they are
skilled.
In the bagging technique, it is vulnerable newcomers trained in parallel. But in
the boosting, they are trained sequentially.

Similarities
• They both are ensemble techniques to get the N novices from 1 learner.
• Each generates numerous training statistics sets through random sampling.
• They each make the very last decision by averaging the N number of
beginners (or they take most of the people of them, i.e., the Majority of
voting).
• The Bagging and boosting are exact at reducing the variance and offer
better stability.
Bagging Boosting
The most effective manner of mixing A manner of mixing predictions that belong to
predictions that belong to the same type. different sorts.
The main task of it is decrease the variance but The main task of it is decrease the bias but not
not bias. variance.
Here each of the model is different weight. Here each of the model is same weight.
Each of the model is built here independently. Each of the model is built here dependently.
This training records subsets are decided on Each new subset consists of the factors that
using row sampling with alternative and random were misclassified through preceding models.
sampling techniques from the whole training
dataset.
It is trying to solve by over fitting problem. It is trying to solve by reducing the bias.
If the classifier is volatile (excessive variance), If the classifier is stable and easy (excessive
then apply bagging. bias) the practice boosting.
In the bagging base, the classifier is works In the boosting base, the classifier is works
parallelly. sequentially.
Example is random forest model by using Example is AdaBoost using the boosting
bagging. technique.
Random Forest
Random Forest
Improving Model Accuracy with
Stacking
We use stacking to improve the prediction accuracy of strong learners.
Stacking aims to create a single robust model from multiple heterogeneous
strong learners.

Stacking differs from bagging and boosting in that:

• It combines strong learners


• It combines heterogeneous models
• It consists of creating a Metamodel. A metamodel is a model created using
a new dataset.
The steps of Stacking are as follows:
1.We use initial training data to train
m-number of algorithms.
2.Using the output of each algorithm, we
create a new training set.
3.Using the new training set, we create a
meta-model algorithm.
4.Using the results of the meta-model, we
make the final prediction. The results are
combined using weighted averaging.
Monitoring Ensemble Learning Models
Ensemble learning improves a model’s performance in mainly three ways:
• By reducing the variance of weak learners
• By reducing the bias of weak learners,
• By improving the overall accuracy of strong learners.

Bagging is used to reduce the variance of weak learners.


Boosting is used to reduce the bias of weak learners.
Stacking is used to improve the overall accuracy of strong learners.
AdaBoost
Adaptive boosting or AdaBoost: This method operates iteratively, identifying
misclassified data points and adjusting their weights to minimize the
training error. The model continues optimize in a sequential fashion until it
yields the strongest predictor.
AdaBoost, also called Adaptive Boosting, is a technique in Machine Learning
used as an Ensemble Method. The most common estimator used with
AdaBoost is decision trees with one level which means Decision trees with
only 1 split. These trees are also called Decision Stumps.
Flowchart
The most common estimator used with AdaBoost is decision trees with one
level which means Decision trees with only 1 split. These trees are also
called Decision Stumps.
What this algorithm does is that it builds a model and gives equal weights to
all the data points. It then assigns higher weights to points that are wrongly
classified. Now all the points with higher weights are given more
importance in the next model. It will keep training models until and unless
a lower error is received.
Algorithm
1. Build a model and make predictions.
2. Assign higher weights to miss-classified points.
3. Build next model.
4. Repeat steps 3 and 4.
5. Make a final model using the weighted average of individual models.
Adaboost

After multiple iterations, we will be able to create the right decision boundary with
the help of all the previous weak learners. As you can see the final model is able to
classify all the points correctly. This final model is known as a strong learner.
Step 1: Assigning Weights

The formula to calculate the sample


weights is:

Where N is the total number of data points


Here since we have 5 data points, the sample weights assigned
will be 1/5.
Step 2: Classify the Samples
We start by seeing how well “Gender” classifies the samples and will see how
the variables (Age, Income) classify the samples.
We’ll create a decision stump for each of the features and then calculate
the Gini Index of each tree. The tree with the lowest Gini Index will be our
first stump.
Here in our dataset, let’s say Gender has the lowest gini index, so it will be our
first stump.
Step 3: Calculate the Influence
We’ll now calculate the “Amount of Say” or “Importance” or “Influence” for
this classifier in classifying the data points using this formula:

The total error is nothing but the summation of all the sample weights of
misclassified data points.
Here in our dataset, let’s assume there is 1 wrong output, so our total error will
be 1/5, and the alpha (performance of the stump) will be:
0 Indicates perfect stump, and 1 indicates horrible stump. From the graph above, we
can see that when there is no misclassification, then we have no error (Total Error = 0),
so the “amount of say (alpha)” will be a large number.
When the classifier predicts half right and half wrong, then the Total Error = 0.5, and
the importance (amount of say) of the classifier will be 0.
If all the samples have been incorrectly classified, then the error will be very high
(approx. to 1), and hence our alpha value will be a negative integer.
Step 4: Calculate TE and Performance
If identical weights are maintained for the subsequent model, the output will
mirror what was obtained in the initial model.
The wrong predictions will be given more weight, whereas the correct
predictions weights will be decreased. Now when we build our next model
after updating the weights, more preference will be given to the points with
higher weights.

The amount of, say (alpha) will be negative when the sample is correctly classified.
The amount of, say (alpha) will be positive when the sample is miss-classified.
There are four correctly classified samples and 1 wrong. Here, the sample
weight of that datapoint is 1/5, and the amount of say/performance of the
stump of Gender is 0.69.
We know that the total sum of the sample weights must be equal to 1, but here if we sum
up all the new sample weights, we will get 0.8004. To bring this sum equal to 1, we will
normalize these weights by dividing all the weights by the total sum of updated weights,
which is 0.8004.
Step 5: Decrease Errors

For this, we will remove the “sample weightsˮ and “new sample
weightsˮ columns and then, based on the “new sample
weights,ˮ divide our data points into buckets.
Step 6: New Dataset

Now, what the algorithm does is selects random numbers from 01.
Since incorrectly classified records have higher sample weights, the
probability of selecting those records is very high.
Suppose the 5 random numbers our algorithm take is
0.38,0.26,0.98,0.40,0.55.

This comes out to be our new dataset, and we see the data point, which
was wrongly classified, has been selected 3 times because it has a
higher weight.
Step 7: Repeat Previous Steps
• Assign equal weights to all the data points.
• Find the stump that does the best job classifying the new collection of
samples by finding their Gini Index and selecting the one with the lowest
Gini index.
• Calculate the “Amount of Say” and “Total error” to update the previous
sample weights.
• Normalize the new sample weights.

Suppose, with respect to our dataset, we have constructed 3 decision trees


(DT1, DT2, DT3) in a sequential manner. If we send our test data now, it
will pass through all the decision trees, and finally, we will see which class
has the majority, and based on that, we will do predictions
for our test dataset.
Gradient Descent Algorithm
Gradient Descent is defined as one of the most commonly used iterative
optimization algorithms of machine learning to train the machine
learning and deep learning models. It helps in finding the local minimum
of a function.
Gradient Boosting
Gradient boosting: It works by sequentially adding predictors to an ensemble
with each one correcting for the errors of its predecessor. However, instead
of changing weights of data points like AdaBoost, the gradient
boosting trains on the residual errors of the previous predictor. The name,
gradient boosting, is used since it combines the gradient descent algorithm
and boosting method.

It relies on the intuition that the best possible next model, when combined
with previous models, minimizes the overall prediction error. The key
idea is to set the target outcomes for this next model in order to minimize
the error.
The working of the Gradient Boosting Algorithm can be divided on the
basis of its major three elements:
• Optimizing the loss function
• Fabricating a weak learner for predictions
• Development of an additive model of weak learners to minimize the
loss function
Using three features — customer age, purchase category, and purchase
weight, we want to predict the purchase amount
Loss Function
The loss function that we use for the models, depends on the type of algorithm that we are
using. The main focus of selecting any loss function is that the loss function should be
differentiable. There are many standard functions available for usage, but one can define
their own loss function depending on the type of problem they tackle.
• Calculates the error: Takes the predicted output of the model and compares it to the
ground truth (actual observed values). How it compares, i.e., calculates the difference,
varies from function to function.
• Guides model training: a model’s objective is to minimize the loss function. Throughout
training, the model continually updates its internal architecture and configuration to
make the loss as little as possible.
• Evaluation metric: By comparing the loss on training, validation, and test datasets, you
can assess your model’s ability to generalize and avoid overfitting.
The two most common loss functions are:
❑ Mean Squared Error (MSE): This popular loss function for regression measures the sum
of the squared differences between predicted and actual values. Gradient boosting often
uses this variation of it:
❑ Cross-entropy: This function measures the difference between two probability
distributions. So, it is commonly used for classification tasks where the targets has
discrete categories.
Step 1: Make an initial prediction-
Build a Base model

Here, L is our loss function,


Gamma is our predicted value, and
arg min means we have to find a predicted value/gamma for which the loss
function is minimum.

Gradient boosting is an algorithm that gradually increases its accuracy. To start the
process, we need an initial guess or prediction. The initial guess is always the
average of the target.

Every step of the way, we are searching for a value to find the minimum of the loss
function. In other words, we are looking for a value that makes the derivative
(gradient) of the loss function 0.
And when we take the derivative of the loss function for each observed value with
respect to the predicted and sum them up, we end up with the average of the
target.
Step 2: Calculate the pseudo-residuals
Step 3: Build a weak learner
Next, we will build a decision tree (weak learner) that predicts the residuals
using the three features we have (age, category, purchase weight).
The learning rate in gradient boosting is simply a multiplier between 0
and 1 that scales the prediction of each weak learner. When we add
an arbitrary learning rate of 0.1 into the mix, our prediction becomes
152.75, not the perfect 123.45.

Let’s predict on the second row as well:


We run the row through the tree and get 146.08 as a prediction. We continue in
this fashion for all rows until we have four predictions for four rows:
152.75, 146.08, 174.945, 150.2. Let’s add them as a new column for now:

Next, we find the new pseudo-residuals by subtracting new predictions


from the purchase amount. Let’s add them as a new column to the table.
our new pseudo residuals are smaller, which means our loss is going
down
Step 4: Iterate
In the next steps, we iterate on step 3, i.e. build more weak learners. The only
thing to remember is that we have to keep adding the residuals of each tree
to the initial prediction to generate the next.
The gradient boosting ensemble technique consists of three simple steps:
• An initial model F0 is defined to predict the target variable y. This model will be
associated with a residual (y – F0)
• A new model h1 is fit to the residuals from the previous step
• Now, F0 and h1 are combined to give F1, the boosted version of F0. The mean
squared error from F1 will be lower than that from F0:

To improve the performance of F1, we could model after the residuals of F1 and
create a new model F2:

This can be done for ‘m’ iterations, until residuals have been minimized as much as
possible:

Here, the additive learners do not disturb the functions created in the previous steps.
Instead, they impart information of their own to bring down the errors.
Additive Model
We add the new trees one at a time in the model such that the pre-existing trees
remain unaltered. We tend to follow the gradient descent procedure in order
to minimize the loss while adding the trees.
XGBoost
Extreme gradient boosting or XGBoost: XGBoost is an implementation of
gradient boosting that’s designed for computational speed and scale.
XGBoost leverages multiple cores on the CPU, allowing for learning to
occur in parallel during training.

XGBoost is an optimized distributed gradient boosting library designed for


efficient and scalable training of machine learning models.
XGBoost
Unique Features
• Regularization: XGBoost has an option to penalize complex models through both L1
(absolute values) and L2(squares) regularization. Regularization helps in preventing
overfitting.
• Handling sparse data: Missing values or data processing steps like one-hot encoding
make data sparse. XGBoost incorporates a sparsity-aware split finding algorithm to
handle different types of sparsity patterns in the data
• Weighted quantile sketch: Most existing tree based algorithms can find the split
points when the data points are of equal weights (using quantile sketch algorithm).
However, they are not equipped to handle weighted data. XGBoost has a distributed
weighted quantile sketch algorithm to effectively handle weighted data
• Block structure for parallel learning: For faster computing, XGBoost can make use
of multiple cores on the CPU. This is possible because of a block structure in its
system design. Data is sorted and stored in in-memory units called blocks. Unlike
other algorithms, this enables the data layout to be reused by subsequent iterations,
instead of computing it again. This feature also serves useful for steps like split
finding and column sub-sampling
• Cache awareness: In XGBoost, non-continuous memory access is required to get the
gradient statistics by row index. Hence, XGBoost has been designed to make optimal
use of hardware. This is done by allocating internal buffers in each thread, where the
gradient statistics can be stored
• Out-of-core computing: This feature optimizes the available disk space and
maximizes its usage when handling huge datasets that do not fit into memory
Benefits of XGBoost
• High accuracy: XGBoost is known for its accuracy and has been shown to
outperform other machine learning algorithms in many predictive modeling
tasks.
• Scalability: XGBoost is highly scalable and can handle large datasets with
millions of rows and columns.
• Efficiency: XGBoost is designed to be computationally efficient and can
quickly train models on large datasets.
• Flexibility: XGBoost supports a variety of data types and objectives,
including regression, classification, and ranking problems.
• Regularization: XGBoost incorporates regularization techniques to avoid
overfitting and improve generalization performance.
• Interpretability: XGBoost provides feature importance scores that can help
users understand which features are most important for making predictions.
• Open-source: XGBoost is an open-source library that is widely used and
supported by the data science community.
Challenges of boosting
• Overfitting: There’s some dispute in the research around whether or not
boosting can help reduce overfitting or exacerbate it. We include it under
challenges because in the instances that it does occur, predictions cannot be
generalized to new datasets.
• Intense computation: Sequential training in boosting is hard to scale up.
Since each estimator is built on its predecessors, boosting models can be
computationally expensive, although XGBoost seeks to address scalability
issues seen in other types of boosting methods. Boosting algorithms can be
slower to train when compared to bagging as a large number of parameters
can also influence the behavior of the model.
Applications of Boosting
• Healthcare: Boosting is used to lower errors in medical data predictions, such as
predicting cardiovascular risk factors and cancer patient survival rates. For
example, research (link resides outside of ibm.com) shows that ensemble
methods significantly improve the accuracy in identifying patients who could
benefit from preventive treatment of cardiovascular disease, while avoiding
unnecessary treatment of others. Likewise, another study (link resides out IBM)
found that applying boosting to multiple genomics platforms can improve the
prediction of cancer survival time.
• IT: Gradient boosted regression trees are used in search engines for page
rankings, while the Viola-Jones boosting algorithm is used for image retrieval. As
noted by Cornell (link resides outside of ibm.com), boosted classifiers allow for
the computations to be stopped sooner when it’s clear in which way a prediction
is headed. This means that a search engine can stop the evaluation of lower
ranked pages, while image scanners will only consider images that actually
contains the desired object.
• Finance: Boosting is used with deep learning models to automate critical tasks,
including fraud detection, pricing analysis, and more. For example, boosting
methods in credit card fraud detection and financial products pricing
analysis (link resides outside of ibm.com) improve the accuracy of analyzing
massive data sets to minimize financial losses.
Benefits of Boosting
• Ease of Implementation: Boosting can be used with several
hyper-parameter tuning options to improve fitting. No data preprocessing is
required, and boosting algorithms like have built-in routines to handle
missing data. In Python, the scikit-learn library of ensemble methods (also
known as sklearn.ensemble) makes it easy to implement the popular
boosting methods, including AdaBoost, XGBoost, etc.
• Reduction of bias: Boosting algorithms combine multiple weak learners in
a sequential method, iteratively improving upon observations. This
approach can help to reduce high bias, commonly seen in shallow decision
trees and logistic regression models.
• Computational Efficiency: Since boosting algorithms only select features
that increase its predictive power during training, it can help to reduce
dimensionality as well as increase computational efficiency.
Sub bagging
It is a combination of ‘subsample’ and ‘bagging,’ and it is the fairly obvious
idea that you don’t need to produce samples that are the same size as the
original data.
If you make smaller datasets, then it makes sense to sample without
replacement, but otherwise the implementation is only very slightly
different from the bagging one, except that in NumPy you use
np.random.shuffle() to produce the samples.
It is common to use a dataset size that is half that of the original data, and the
results of this can often be comparable to a full bagging simulation.
Hyperparameters
Hyperparameters are defined as the parameters that are explicitly defined by
the user to control the learning process

• Learning rate for training a neural network


• Train-test split ratio
• Batch Size
• Number of Epochs
• Branches in Decision Tree
• Number of clusters in Clustering Algorithm
• The k in kNN or K-Nearest Neighbour algorithm
Model Hyperparameters
Parameters that are explicitly defined by the user to control the learning
process. Some key points for model parameters are as follows:
• These are usually defined manually by the machine learning engineer.
• One cannot know the exact best value for hyperparameters for the given
problem. The best value can be determined either by the rule of thumb or
by trial and error.
• Some examples of Hyperparameters are the learning rate for training a
neural network, K in the KNN algorithm,
Categories of Hyperparameters
Broadly hyperparameters can be divided into two categories-
• Hyperparameter for Optimization
• Hyperparameter for Specific Models
Hyperparameter for Optimization
The process of selecting the best hyperparameters to use is known as
hyperparameter tuning, and the tuning process is also known as
hyperparameter optimization. Optimization parameters are used for
optimizing the model.
Optimization parameters
• Learning Rate: The learning rate is the hyperparameter in optimization
algorithms that controls how much the model needs to change in response to the
estimated error for each time when the model's weights are updated. It is one of
the crucial parameters while building a neural network, and also it determines the
frequency of cross-checking with model parameters. Selecting the optimized
learning rate is a challenging task because if the learning rate is very less, then it
may slow down the training process. On the other hand, if the learning rate is too
large, then it may not optimize the model properly.
• Batch Size: To enhance the speed of the learning process, the training set is
divided into different subsets, which are known as a batch.
• Number of Epochs: An epoch can be defined as the complete cycle for training
the machine learning model. Epoch represents an iterative learning process. The
number of epochs varies from model to model, and various models are created
with more than one epoch. To determine the right number of epochs, a validation
error is taken into account. The number of epochs is increased until there is a
reduction in a validation error. If there is no improvement in reduction error for
the consecutive epochs, then it indicates to stop increasing the number of epochs.
Hyperparameters for specific models
Hyperparameters that are involved in the structure of the model are known as
hyperparameters for specific models. These are given below:
A number of Hidden Units: Hidden units are part of neural networks, which
refer to the components comprising the layers of processors between input
and output units in a neural network.
It is important to specify the number of hidden units hyperparameter for the
neural network. It should be between the size of the input layer and the size
of the output layer. More specifically, the number of hidden units should be
2/3 of the size of the input layer, plus the size of the output layer.
For complex functions, it is necessary to specify the number of hidden units,
but it should not overfit the model.
Number of Layers: A neural network is made up of vertically arranged
components, which are called layers. There are mainly input layers, hidden
layers, and output layers. A 3-layered neural network gives a better
performance than a 2-layered network. For a Convolutional Neural
network, a greater number of layers make a better model.
Steps for hyperparameter tuning
• Select the right type of model.
• Review the list of parameters of the model and build the HP space
• Finding the methods for searching the hyperparameter space
• Applying the cross-validation scheme approach
• Assess the model score to evaluate the model
Influencing on other parameters
Overall, Hyperparameters are influencing the below factors while
designing your model. Please remember this.
Linear Model
What degree of polynomial features should use?
Decision Tree
What is the maximum allowed depth?
What is the minimum number of samples required at a leaf node in
the decision tree?
Random forest
How many trees we should include?
Neural Network
How many neurons we should keep in a layer?
How many layers, should keep in a layer?
Gradient Descent
What learning rate should we?
Grid Search

When performing hyperparameter optimization, we first need to define


a parameter space or parameter grid, where we include a set of possible
hyperparameter values that can be used to build the model.
The grid search technique is then used to place these hyperparameters in a
matrix-like structure, and the model is trained on every combination of
hyperparameter values.
The model with the best performance is then selected.
Example:
For SVM:
C: [1, 5, 10]
Kernel: {‘linear’, ‘poly’, ‘rbf’, ‘sigmoid’}

Not possible to check all combination since it will increase training time
Random Search

While grid search looks at every possible combination of hyperparameters to


find the best model, random search only selects and tests a random
combination of hyperparameters.
This technique randomly samples from a grid of hyperparameters instead of
conducting an exhaustive search.
We can specify the number of total runs the random search should try before
returning the best model.
When to use Random search vs Grid
search CV
If you ever find yourself trying to choose between grid search and random
search, here are some pointers to help you decide which one to use:
• Use grid search if you already have a acceptable range of known
hyperparameter values that will perform well. Make sure to keep your
parameter space small, because grid search can be extremely
time-consuming.
• Use random search on a broad range of values if you don’t already have an
idea of the parameters that will perform well on your model. Random
search is faster than grid search and should always be used when you have
a large parameter space.
• It is also a good idea to use both random search and grid search to get the
best possible results.
• You can use random search first with a large parameter space since it is
faster. Then, use the best hyperparameters found by random search to
narrow down the parameter grid, and feed a smaller range of values to grid
search.
Ways of Combining Classifier
Points to address while combining
classfiers
Q1: What nature of output information is to be combined ?
Elementary outputs (the output of the classifier is a single label), ranked
outputs (the output of the classifier is a list of labels ranked from most
probable to least probable), and scored outputs (the classifier assigns a
degree of confidence to each class label).

Q2: What types of classifiers are to be combined?


Fusion methods lying on classifiers deriving from the same classification
algorithm are called homogeneous combination approaches (e.g. random
forest uses decision trees) or ensemble methods. If the fusion panel contains
different classifiers, then, heterogeneous combination approaches i.e.
combining a neural network and a decision tree.
Q3: What combination rule is used to combine decisions ?
The two main categories are deterministic versus probabilistic approaches.
Q4: What is the structure of this combination ?
There are three main categories of topology: parallel, sequential or a
hybrid combinations. In parallel fusion, the base classifiers work
independently, they may be trained with inputs living in different feature
spaces and the feature vectors may or may not be derived from the same
raw training examples. However, the output of a given classifier cannot
serve as input for another classifier. In sequential (serial) fusion,
elementary classifiers are stacked in a sequential way and the decision of
one classifier depends on a previous decision. Some class labels are
eliminated at each classification step until one class is left. Hybrid
hierarchical fusion consists in a mix of parallel and sequential
architectures.
Approaches

Parallel combination of Serial combination of Hybrid hierarchical combination


three classifiers three classifiers

You might also like