Machine Learning With Ridge and Lasso Regression
Machine Learning With Ridge and Lasso Regression
Machine Learning With Ridge and Lasso Regression
Ivan Manov
Table of Contents
Abstract ................................................................................................................................................. 3
1. Motivation ..................................................................................................................................... 4
Abstract
In this course, we explore these two regression algorithms and explain their
regularization mechanics. We start by overviewing the concepts of regression
analysis and cover the occurrences overfitting and multicollinearity. Then, we
discover why regularization is a necessity for dealing with such problems properly.
Next, we learn the theory behind ridge and lasso regression and represent them
visually to gain a better understanding of how they work. To top it all off, we put the
theoretical knowledge into practice by applying ridge and lasso regression to a real-
world scenario in Python. We validate their performances by comparing them with
a linear regression without regularization and see which is better for the case.
These course notes give the theoretical essentials for understanding the ridge
and lasso regression basics and serve as a comprehensive guide to the video
materials. They are an additional resource that will help you grasp the topics and
methodologies under discussion.
Copyright 2022 365 Data Science Ltd. Reproduction is forbidden unless authorized. All rights reserved.
4
1. Motivation
There are different types of regressions for data science and machine
learning:
𝑦 = 𝛽0 + 𝛽1 𝑥
𝑦 = 𝛽0 + 𝛽1 𝑥1 + 𝛽2 𝑥2 + ⋯ + 𝛽𝑛 𝑥𝑛
• Polynomial regression:
𝑦 = 𝛽0 + 𝛽1 𝑥1 + 𝛽2 𝑥22 + ⋯ + 𝛽𝑛 𝑥1𝑛
While very similar to linear regression, ridge and lasso include an additional
feature called “penalty term” that prevents overfitting. Categorically, ridge and lasso
regressions are both regularization methods.
5
There are various approaches for dealing with overfitting depending on the
data at hand or your choice of technique. Regularization is one of them and it is
particularly effective when our data also suffers from multicollinearity.
2. What is Regularization?
• L-1 regularization applies an L-1 penalty equal to the absolute value of the
magnitude of the coefficients. It restricts the size of the coefficients,
making some of them equal to zero. Mathematically, the L-1 penalty term
is represented by the following formula:
𝑚
∑ |𝛽𝑗 |
𝑗=1
• L-2 regularization, on the other hand, adds an L-2 penalty equal to the
square of the magnitude of the coefficients. Here, all coefficients are
shrunk by the same factor. Their values become closer to zero, but they
are never actually zero. Mathematically, the L-2 penalty term is
represented by the following formula:
𝑚
∑ 𝛽𝑗2
𝑗=1
∑(𝑌̂𝑖 − 𝑌𝑖 )2
𝑖=1
In ridge regression, we don’t want to minimize only the squared error, but
also the additional regularization penalty term, controlled by a tuning parameter.
This parameter determines how much bias we’ll add to the model and is most often
denoted with lambda:
𝑚
𝜆∑ 𝛽𝑗2
𝑗=1
The higher the values of lambda, the bigger the penalty is. If lambda equals
zero, the ridge regression basically represents a regular least-squares regression.
On the other hand, if lambda equals infinity, then all coefficients shrink to zero.
Therefore, the tuning parameter must be somewhere between zero and infinity. The
process of estimating the proper value is most often established with the help of a
technique called ‘cross-validation’. Applying an appropriate value for the tuning
parameter should:
• train data
• test data
Figure 3: Linear regression
• train data
• test data
Figure 4: Ridge regression /linear regression with a penalty term/
Although its mechanics have been used in other scientific areas, the machine
learning application of lasso regression was introduced by the statistician Robert
Tibshirani in 1996. Much like ridge regression, lasso also incorporates a
regularization technique for dealing with overfitted data. The main difference is the
10
penalty term which is minimized alongside the regression equation’s cost function.
In ridge regression, this is the sum of the coefficient’s magnitudes squared. In a
lasso, on the other hand, the penalty is represented by the sum of the coefficient’s
absolute values. Thus, a lasso regression utilizes an L-1 regularization, whereas a
ridge uses the L-2:
𝑚
𝜆∑ |𝛽𝑗 |
𝑗=1
• 𝜆 – a tuning parameter
Conceptually, the two methods have the same goal – to increase the bias and
lower the variance in order to prevent overfitting. The major difference between the
two algorithms is that a ridge shrinks the coefficients, so they become closer to zero
but never actual zeroes, while a lasso can shrink them all the way to zero. What the
lasso regression does is decrease the values of the irrelevant parameters to zero, so
that they don’t participate in the equation. This way, our model only has variables
that are important for the predictions. Such a process is also known as ‘feature
selection’ as it excludes the irrelevant variables from the equation and leaves us with
a subset containing only the useful ones. A huge benefit of using a lasso regression
is that it’s very suitable when dealing with big datasets because it can easily lower
the variance in models with many features.
𝑦 = 𝛽0 + 𝛽1 𝑥1 + 𝛽2 𝑥22 + ⋯ + 𝛽𝑛 𝑥1𝑛
𝑦 = 𝛽0 + … + 𝛽𝑛 𝑥1𝑛
11
X1 X2 Xn
… … …
… … …
In summary, there are two major differences between a ridge and a lasso
regression. The first is in how they calculate the penalty term. And the second one
is the fact that lasso can perform feature selection, thus excluding the irrelevant
features from the prediction process, while ridge is more applicable in smaller
datasets with fewer variables.
When creating a predictive model, we usually split the data into training and
testing parts. With cross-validation, on the other hand, we divide it into three –
training, testing, and validation.
Data
Cross-
validation
Figure 3: Splitting the data into training, testing, and validation sets
To pick a good value for the tuning parameter, we need to perform cross-
validation on the training part of our data. So, we separate that set into different
parts that we’ll call “folds”.
13
Data
Cross-
validation
Fold 1
Fold 2
Fold 3
Fold 4
Fold 5
We go with the first fold for a validation set, so the quantity of the remaining
folds would be ‘k minus one’. With the data divided this way, we need to pick a
starting value for the tuning parameter – “lambda one”. Say we choose 0.1. Then,
we fit the ‘k minus one’ training folds into our model, using lambda one as a tuning
parameter, and establish values for the coefficients in the ridge regression equation.
Next, we use the obtained coefficients and the independent ‘X’ values from the
validation set to estimate the predicted y values for the validation data. With the
predicted y and the real y values in the validation fold, we can calculate the sum of
squares error.
14
Data
Cross-
validation
Fold 1
Fold 2
Fold 3
Fold 4
Fold 5
𝜆 = 0.1
Figure 5: Using Fold 1 for a validation set with a tuning parameter valued at 0.1
We perform this operation with all other folds for validation sets. We choose to
have five here, which means we can have five different options. Consequently, there
will be five different results for the sum of squares error. We then must sum these
results and measure how well the model works. After that, we will repeat the same
operation with different values for lambda, depending on the dataset size – 0.1, 0.2,
0.3 till 10 for instance. The lambda leading to the lowest SSE would be the correct
choice for our tuning parameter. Thus, we can fit the whole training set with the
lambda value in question.
15
Data
Cross-
validation
Fold 1
Fold 2
Fold 3
Fold 4
Fold 5
Figure 6: Fitting the data with the best value for a tuning parameter
And this is how we choose the proper tuning parameter using K-Fold cross-
validation.
The RepeatedKFold function allows the implementation of the validator with the
following parameters:
Please make note of the term ‘strong correlation’. The interpretation of the
coefficients can vary depending on the data we are working with. Coefficients and
metrics for data analysis may be standard across industries, however, their
significance usually differs depending on the specific case study. As a side note, you
must know that when you call ‘score’ on classifiers instead of regressions, the
method computes the accuracy score by default.
The ‘mean squared error’ is another helps us make a proper comparison and
to validate the performance of the different algorithms. It takes the difference
between the predicted and the actual values, squares the result, and calculates the
average across the whole dataset.
The ‘root mean squared error’ is the square root of MSE. A huge advantage
here is that it is measured in the same units as the target variable, making it probably
the most easily interpreted statistic. While the ‘mean squared error’ is the average
of all the squared residuals, the ‘root mean squared error’ takes the square root of
that, which puts the metric back in the response variable scale. The application of
R-M-S-E is very common – it is considered an excellent error metric for numerical
predictions. In general, the lesser the value of the root mean squared error is, the
better the model calculates predictions.
18
If you found this resource useful, check out our e-learning program. We have everything you need
to succeed in data science.
Learn the most sought-after data science skills from the best experts in the field! Earn a verifiable
certificate of achievement trusted by employers worldwide and future proof your career.
$432 $172.80/year
Ivan Manov
Email: team@365datascience.com