Module 3
Module 3
Module 3
MODULE 3
Introduction: Training Models
Introduction to Training Machine Learning Models
• In earlier chapters, a lot with Machine Learning (ML) algorithms were used without
knowing the details of how the models work internally. This approach works well in
many situations, but understanding the inner workings of ML models can be
beneficial.
Linear Regression
• Linear regression is a type of supervised machine learning algorithm that computes the
linear relationship between the dependent variable and one or more independent
features by fitting a linear equation to observed data.
• When there is only one independent feature, it is known as Simple Linear Regression,
and when there is more than one feature, it is known as Multiple Linear Regression.
• Similarly, when there is only one dependent variable, it is considered Univariate
Linear Regression, while when there are more than one dependent variables, it is
known as Multivariate Regression.
Definition: A linear model makes a prediction by simply computing a weighted sum of the
input features, plus a constant called the bias term (intercept term).
In linear regression, the goal is to find a relationship between the dependent variable (Y) and
one or more independent variables (X). This relationship is represented by a line, known as
the best-fit line, which can be used to predict Y from X. Linear regression involves learning a
function from the given data that minimizes the error between predicted and actual values.
To find the best-fit line, we need to determine the best values for θ1 and θ2. This is done using
the cost function, which measures how well the model predicts the actual values. In linear
regression, we commonly use the Mean Squared Error (MSE) as the cost function:
The Normal Equation: Finding the Best Parameters for Linear Regression
To find the best parameters (θ) for a linear regression model, use a mathematical formula
called the Normal Equation. This equation directly computes the optimal values for θ that
minimize the cost function (usually the Mean Square Error).
3. Check the Results: The original function we used to generate the data was
y = 4 + 3x1 + Gaussian noise.
4. Make Predictions:
X_new = np.array([[0], [2]])
X_new_b = np.c_[np.ones((2, 1)), X_new]
y_predict = X_new_b.dot(theta_best)
print(y_predict)
Gradient Descent
Gradient Descent is a powerful optimization algorithm used to find the minimum value of a
function. It's widely used in machine learning to minimize the cost function of models like
linear regression.
Conceptual Explanation
Imagine you are lost in the mountains on a foggy day. You want to get to the lowest point in
the valley, but you can only feel the slope of the ground under your feet. The best way to
reach the bottom is to keep moving downhill in the direction where the slope is the steepest.
This is how Gradient Descent works: it adjusts the parameters step by step to minimize the
cost function, similar to how you would move downhill to minimize your altitude.
An important parameter in Gradient Descent is the size of the steps, determined by the
learning rate hyperparameter. If the learning rate is too small, the algorithm will take very
small steps, and it will take a long time to reach the minimum.
If the learning rate is too high, the algorithm might overshoot the minimum, causing it to
diverge and fail to find the optimal solution.
Feature Scaling
The cost function has the shape of a bowl, but it can be an elongated bowl if the features have
very different scales. below shows Gradient Descent on a training set where features 1 and 2
have the same scale (on the left), and on a training set where feature 1 has much smaller
values than feature 2 (on the right).
• To implement Gradient Descent, calculate the gradient of the cost function concerning
each model parameter θj.
• Calculate how much the cost function will change if you change θj just a little bit. This
is known as a partial derivative.
Instead of computing these partial derivatives individually, compute them all in one go. The
gradient vector, ∇θ MSE(θ), contains all the partial derivatives of the cost function
Once the gradient vector is found, the gradient vector points uphill (the direction of the
steepest ascent) and to minimize the cost function, move in the opposite direction (downhill).
Below shows the first 10 steps of Gradient Descent using three different learning rates. The
dashed line represents the starting point.
• The Main Problem with Batch Gradient Descent is it uses the entire training set to
compute gradients at each step. Main issue is this process is very slow for large
training sets because it involves a lot of data manipulation at each iteration.
• In Stochastic Gradient Descent (SGD), instead of using the whole training set, SGD
picks a random instance from the training set at each step and computes the gradients
based on that single instance.
• This makes SGD much faster because it deals with very little data at every iteration.
Only one instance needs to be in memory at a time, which allows training on huge
datasets.
Characteristics of SGD
• Due to its stochastic (random) nature, the cost function does not decrease smoothly.
Instead, it bounces up and down but generally decreases over time.
• Once the algorithm stops, the final parameter values are close to the optimal but not
exactly at the minimum.
• The randomness helps SGD jump out of local minima, increasing the chance of
finding the global minimum compared to Batch Gradient Descent.
Polynomial Regression
Sometimes, data is more complex than a straight line can represent. Linear Model can be
used to fit nonlinear data by adding powers of each feature as new features. This technique is
called Polynomial Regression.
Example: Let’s generate some nonlinear data, based on a simple quadratic equation plus
some noise.
m = 100
X = 6 * np.random.rand(m, 1) - 3
y = 0.5 * X**2 + X + 2 + np.random.randn(m, 1)
lin_reg = LinearRegression()
lin_reg.fit(X_poly, y)
lin_reg.intercept_, lin_reg.coef_
Why Regularize?
• Prevent Overfitting: Regularization helps reduce overfitting by constraining the
model, making it less flexible and less likely to overfit the data.
• Degrees of Freedom: The fewer degrees of freedom a model has, the less likely it is to
fit the noise in the data.
1. Ridge Regression
• Ridge Regression, also known as Tikhonov regularization, adds a regularization term
to the Linear Regression cost function.
• Regularization term added is
This term forces the model to keep the weights 𝜃 as small as possible.
• Purpose: To fit the data while keeping the model simple by having smaller weights.
The Mean Squared Error (MSE) plus a term that penalizes large weights. The bias term 𝜃0 is
not regularized.
Elastic Net
• Elastic Net is a middle ground between Ridge Regression and Lasso Regression.
• The regularization term is a simple mix of both Ridge and Lasso’s regularization
terms, and you can control the mix ratio r.
• When r = 0, Elastic Net is equivalent to Ridge Regression, and when r = 1, it is
equivalent to Lasso Regression
Elastic Net is preferred over Lasso since Lasso may behave erratically when the number of
features is greater than the number of training instances or when several features are strongly
correlated.
Early Stopping
Early stopping is a regularization technique used in iterative learning algorithms like Gradient
Descent to prevent overfitting. Instead of running the algorithm for a fixed number of
iterations or until the cost function converges, you monitor the model's performance on a
validation set and stop training when the validation error stops improving.
Logistic Regression
Logistic Regression is used to estimate the probability that an instance belongs to a particular
class. If the estimated probability is greater than 50%, then the model predicts that the
instance belongs to that class (positive class or 1), or else it predicts that it does not (negative
class or “0”). This makes it a binary classifier. (outliers, output <1 or >1)
Estimating Probabilities
Logistic Regression model computes a weighted sum of the input features plus a bias term, it
and outputs the logistic of the result.
Once the Logistic Regression model has estimated the probability P=hθ(x) that an instance x
belongs to the positive class, it can make its prediction ŷ easily
The objective of training is to set the parameter vector θ so that the model estimates high
probabilities for positive instances (y = 1) and low probabilities for negative instances (y=0).
This idea is captured by the cost function shown in Equation for a single training instance x.
The cost function over the whole training set is simply the average cost over all training
instances. It can be written in a single expression called the log loss.
The partial derivatives of the cost function with regards to the jth model parameter θj is given
by
Softmax Regression
• The Logistic Regression model can be generalized to support multiple classes directly,
without having to train and combine multiple binary classifiers. This is called Softmax
Regression, or Multinomial Logistic Regression
Hyperplane
A hyperplane is a decision boundary which separates between given set of data points having
different class labels. The SVM classifier separates data points using a hyperplane with the
maximum amount of margin. This hyperplane is known as the maximum margin hyperplane
and the linear classifier it defines is known as the maximum margin classifier.
Support Vectors
Support vectors are the sample data points, which are closest to the hyperplane. These data
points will define the separating line or hyperplane better by calculating margins.
Margin
A margin is a separation gap between the two lines on the closest data points. It is calculated
as the perpendicular distance from the line to support vectors or closest data points. In SVMs,
we try to maximize this separation gap so that we get maximum margin.
In SVMs, the main objective is to select a hyperplane with the maximum possible margin
between support vectors in the given dataset.
SVM searches for the maximum margin hyperplane in the following 2 step process –
1. Generate hyperplanes which segregates the classes in the best possible way. There are
many hyperplanes that might classify the data. We should look for the best hyperplane
that represents the largest separation, or margin, between the two classes.
2. So, choose the hyperplane so that distance from it to the support vectors on each side
is maximized. If such a hyperplane exists, it is known as the maximum margin
hyperplane and the linear classifier it defines is known as a maximum margin
classifier.
The following diagram illustrates the concept of maximum margin and maximum margin
hyperplane.
source: https://www.kaggle.com/code/prashant111/svm-classifier-tutorial#1.-Introduction-to-Support-Vector-Machines-
The objective function in Soft Margin Classification includes a penalty for misclassified
points. This penalty is proportional to the distance of the points from the correct side of the
margin. Mathematically, this is expressed by adding a term to the cost function that penalizes
errors, weighted by C.
Linear SVM classifiers are efficient and work well in many cases, but some datasets are not
linearly separable. One way to handle these nonlinear datasets is to add more features, like
polynomial features, which can sometimes make the dataset linearly separable.
Example: Consider a dataset with one feature x1. This dataset is not linearly separable (as
shown in the left plot of Figure). However, if you add a second feature x2 = (x1)2, the
resulting 2D dataset becomes linearly separable.
Consider dataset for binary classification in which the data points are shaped as two
interleaving half circles.
Polynomial Kernel
• Adding polynomial features is simple to implement and can work great with all sorts
of Machine Learning algorithms, but at a low polynomial degree it cannot deal with
very complex datasets, and with a high polynomial degree it creates a huge number of
features, making the model too slow.
• The kernel trick is a powerful technique used in Support Vector Machines (SVMs) to
handle nonlinear datasets without explicitly mapping the data to a higher-dimensional
space. Instead, it uses kernel functions to compute the similarity between data points
in this higher dimensional space directly, saving computational resources and
simplifying the process.
• It is a bell-shaped function varying from 0 (very far away from the landmark) to 1 (at
the landmark). Now we are ready to compute the new features.
• For example, let’s look at the instance x1 = –1: it is located at a distance of 1 from the
first landmark, and 2 from the second landmark.
• Therefore, its new features are x2 = exp (–0.3 × 12) ≈ 0.74 and x3=exp(–0.3×22) ≈ 0.30.
The plot on the right of Figure shows the transformed dataset (dropping the original
features). As you can see, it is now linearly separable.
The models are trained with different values of hyperparameters gamma (γ) and C. Increasing
gamma makes the bell-shape curve narrower (see the above left plot of Figure), and as a
result each instance’s range of influence is smaller: the decision boundary ends up being
more irregular, wiggling around individual instances. Conversely, a small gamma value
makes the bell-shaped curve wider, so instances have a larger range of influence, and the
decision boundary ends up smoother. So γ acts like a regularization hyperparameter: if your
model is overfitting, you should reduce it, and if it is underfitting, you should increase it.
SVM Regression
Support Vector Machine (SVM) regression is a versatile method that supports both linear and
nonlinear regression. The primary goal in SVM regression is to fit as many data points as
possible within a predefined margin while limiting margin violations (i.e., points outside the
margin)
• In SVM classification, the objective is to maximize the margin between classes.
However, in SVM regression, the aim is to fit as many data points as possible within a
margin (referred to as the "street").
• The width of this street is controlled by the hyperparameter 𝜖. Only points that fall
outside this margin affect the model. A larger ϵ results in a wider street, leading to
fewer points outside the margin, while a smaller 𝜖 results in a narrower street.
Below figure shows two linear SVM Regression models trained on some random linear data,
one with a large margin (ϵ = 1.5) and the other with a small margin (ϵ = 0.5).
• Linear SVM Regression: The model tries to find a linear function that fits within the
margin.
• For nonlinear regression tasks, kernelized SVM models are used. The kernel trick
allows the SVM to perform in a higher-dimensional space without explicitly
transforming the data.
Figure shows SVM Regression on a random quadratic training set, using a 2nd-degree
polynomial kernel. There is little regularization on the left plot (i.e., a large C value), and
much more regularization on the right plot (i.e., a small C value).