Module 3

Machine Learning 21AI63
MODULE 3
Introduction: Training Models
Introduction to Training Machine Learning Models
• In earlier chapters, a lot with Machine Learning (ML) algorithms were used without
knowing the details of how the models work internally. This approach works well in
many situations, but understanding the inner workings of ML models can be
beneficial.
Why Understand the Inner Workings?

• Choosing Models and Algorithms: Helps in selecting the right model and training
algorithm.
• Hyperparameters: Aids in setting appropriate hyperparameters for better performance.
• Debugging and Error Analysis: Makes it easier to debug and analyze errors.
Linear Regression
• Linear regression is a type of supervised machine learning algorithm that computes the
linear relationship between the dependent variable and one or more independent
features by fitting a linear equation to observed data.
• When there is only one independent feature, it is known as Simple Linear Regression,
and when there is more than one feature, it is known as Multiple Linear Regression.
• Similarly, when there is only one dependent variable, it is considered Univariate
Linear Regression, while when there are more than one dependent variables, it is
known as Multivariate Regression.
Definition: A linear model makes a prediction by simply computing a weighted sum of the
input features, plus a constant called the bias term (intercept term).
1 Deepak D, Asst. Prof., Dept. of AIML, Canara Engineering College, Mangaluru

This can be written in vectorized form
In linear regression, the goal is to find a relationship between the dependent variable (Y) and
one or more independent variables (X). This relationship is represented by a line, known as
the best-fit line, which can be used to predict Y from X. Linear regression involves learning a
function from the given data that minimizes the error between predicted and actual values.
What is the Best Fit Line?

The best-fit line is a straight line that best represents the data on a scatter plot. It minimizes
the error between the predicted values (ŷ) and the actual values (Y). This line shows how
much Y changes with a unit change in X.
• Dependent Variable (Y): The value we want to predict.
• Independent Variable (X): The value used to make predictions.
To find the best-fit line, we need to determine the best values for θ1 and θ2. This is done using
the cost function, which measures how well the model predicts the actual values. In linear
regression, we commonly use the Mean Squared Error (MSE) as the cost function:

Note: Interactive Visualization of Linear Regression

https://observablehq.com/@yizhe-ang/interactive-visualization-of-linear-regression
Training the Linear Regression Model

• Training a model means setting its parameters so that the model best fits the training
set. So, we need a measure of how well (or poorly) the model fits the training data.
• To train a Linear Regression model, find the value of θ that minimize the Mean Square
Error (MSE)
The MSE of a Linear Regression hypothesis hθ on a training set X is calculated using
The Normal Equation: Finding the Best Parameters for Linear Regression
To find the best parameters (θ) for a linear regression model, use a mathematical formula
called the Normal Equation. This equation directly computes the optimal values for θ that
minimize the cost function (usually the Mean Square Error).

Example: Generating Data and Using the Normal Equation

1. Generate Data:
import numpy as np
# Generate random linear-looking data

X = 2 * np.random.rand(100, 1)
y = 4 + 3 * X + np.random.randn(100, 1)
2. Compute θ Using the Normal Equation

# # add x0 = 1 to each instance
X_b = np.c_[np.ones((100, 1)), X]
# Calculate θ using the Normal Equation

theta_best = np.linalg.inv(X_b.T.dot(X_b)).dot(X_b.T).dot(y)
3. Check the Results: The original function we used to generate the data was
y = 4 + 3x1 + Gaussian noise.
Let’s see what θ the equation found:

print(theta_best)
Output: array([[4.21509616], [2.77011339]])
4. Make Predictions:
X_new = np.array([[0], [2]])
X_new_b = np.c_[np.ones((2, 1)), X_new]
y_predict = X_new_b.dot(theta_best)
print(y_predict)
Output: array([[4.21509616], [9.75532293]])
5. Plot the Results

import matplotlib.pyplot as plt
plt.plot(X_new, y_predict, "r-")

plt.plot(X, y, "b.")
plt.axis([0, 2, 0, 15])
plt.show()

Gradient Descent
Gradient Descent is a powerful optimization algorithm used to find the minimum value of a
function. It's widely used in machine learning to minimize the cost function of models like
linear regression.
Conceptual Explanation
Imagine you are lost in the mountains on a foggy day. You want to get to the lowest point in
the valley, but you can only feel the slope of the ground under your feet. The best way to
reach the bottom is to keep moving downhill in the direction where the slope is the steepest.
This is how Gradient Descent works: it adjusts the parameters step by step to minimize the
cost function, similar to how you would move downhill to minimize your altitude.
Working of Gradient Descent

• Initialization: Start with random values for the parameters (θ).
• Compute the Gradient: Measure the slope (gradient) of the cost function with respect
to the parameters.
• Update Parameters: Adjust the parameters in the direction that reduces the cost
function the most. This is done using the learning rate, which determines the size of
the steps.
• Iterate: Repeat the process until the parameters converge to the minimum value.

An important parameter in Gradient Descent is the size of the steps, determined by the
learning rate hyperparameter. If the learning rate is too small, the algorithm will take very
small steps, and it will take a long time to reach the minimum.
If the learning rate is too high, the algorithm might overshoot the minimum, causing it to
diverge and fail to find the optimal solution.
Two main challenges with Gradient Descent

• Local Minima: Some cost functions have multiple local minima. If the algorithm
starts in a region that leads to a local minimum, it might not find the global minimum.
• Plateaus: Some regions of the cost function may be flat, causing the algorithm to take
a long time to find the slope and if you stop too early you will never reach the global
minimum.

Convex Cost Function

For linear regression, the Mean Squared Error (MSE) cost function is convex, meaning it has
a single global minimum and no local minima. This makes Gradient Descent effective
because it will always converge to the global minimum, given a properly chosen learning
rate.
Feature Scaling
The cost function has the shape of a bowl, but it can be an elongated bowl if the features have
very different scales. below shows Gradient Descent on a training set where features 1 and 2
have the same scale (on the left), and on a training set where feature 1 has much smaller
values than feature 2 (on the right).

Batch Gradient Descent
• To implement Gradient Descent, calculate the gradient of the cost function concerning
each model parameter θj.
• Calculate how much the cost function will change if you change θj just a little bit. This
is known as a partial derivative.
Instead of computing these partial derivatives individually, compute them all in one go. The
gradient vector, ∇θ MSE(θ), contains all the partial derivatives of the cost function
Once the gradient vector is found, the gradient vector points uphill (the direction of the
steepest ascent) and to minimize the cost function, move in the opposite direction (downhill).
Gradient Descent Step

To move downhill, update the parameters using the following steps:
• Calculate the Gradient Vector: Find ∇θMSE(θ)
• Determine Step Size: Multiply the gradient vector by the learning rate 𝜂.
• Update Parameters: Subtract this value from the current parameter values.

Below shows the first 10 steps of Gradient Descent using three different learning rates. The
dashed line represents the starting point.
Stochastic Gradient Descent
• The Main Problem with Batch Gradient Descent is it uses the entire training set to
compute gradients at each step. Main issue is this process is very slow for large
training sets because it involves a lot of data manipulation at each iteration.
• In Stochastic Gradient Descent (SGD), instead of using the whole training set, SGD
picks a random instance from the training set at each step and computes the gradients
based on that single instance.
• This makes SGD much faster because it deals with very little data at every iteration.
Only one instance needs to be in memory at a time, which allows training on huge
datasets.
Characteristics of SGD
• Due to its stochastic (random) nature, the cost function does not decrease smoothly.
Instead, it bounces up and down but generally decreases over time.
• Once the algorithm stops, the final parameter values are close to the optimal but not
exactly at the minimum.
• The randomness helps SGD jump out of local minima, increasing the chance of
finding the global minimum compared to Batch Gradient Descent.

Learning Rate and Learning Schedule

• Randomness is good to escape from local optima, but bad because it means that the
algorithm can never settle at the minimum. One solution to this dilemma is to
gradually reduce the learning rate.
• The Learning Rate determines the size of the steps taken towards the minimum.
• Initially it starts with large steps to make quick progress and escape local minima then
get smaller and smaller, allowing the algorithm to settle at the global minimum.
• The function that determines the learning rate at each iteration is called the learning
schedule. If the learning rate is reduced too quickly, may get stuck in a local
minimum. If the learning rate is reduced too slowly, may jump around the minimum
for a long time and end up with a suboptimal solution if you halt training too early.
Mini-batch Gradient Descent

• Mini-batch Gradient Descent is a variation of Gradient Descent. It combines ideas
from Batch Gradient Descent and Stochastic Gradient Descent.
• Minibatch GD computes the gradients on small random sets of instances called
minibatches.
• Advantages of Mini-batch GD is it takes advantage of hardware optimization for
matrix operations, especially when using GPUs, which speeds up computations.
• Mini-batch GD may have a harder time escaping local minima compared to SGD.

Polynomial Regression
Sometimes, data is more complex than a straight line can represent. Linear Model can be
used to fit nonlinear data by adding powers of each feature as new features. This technique is
called Polynomial Regression.
Example: Let’s generate some nonlinear data, based on a simple quadratic equation plus
some noise.
m = 100
X = 6 * np.random.rand(m, 1) - 3
y = 0.5 * X**2 + X + 2 + np.random.randn(m, 1)
• m = 100 data points.

• X: Random data points between -3 and 3.
• y: Quadratic equation with noise.
Use Scikit-Learn’s PolynomialFeatures class to add polynomial features.
from sklearn.preprocessing import PolynomialFeatures
poly_features = PolynomialFeatures(degree=2, include_bias=False)

X_poly = poly_features.fit_transform(X)
X[0] # Original feature

# Output: array([-0.75275929])
X_poly[0] # Original feature and its square

# Output: array([-0.75275929, 0.56664654])

• degree=2: Adds the square of each feature.

• include_bias=False: Excludes the bias term from the transformation.
• X_poly: Contains original features plus their squares.
Train a LinearRegression model on the extended dataset.
from sklearn.linear_model import LinearRegression
lin_reg = LinearRegression()
lin_reg.fit(X_poly, y)
lin_reg.intercept_, lin_reg.coef_
# Output: (array([1.78134581]), array([[0.93366893, 0.56456263]]))

Regularized Linear Models
Why Regularize?
• Prevent Overfitting: Regularization helps reduce overfitting by constraining the
model, making it less flexible and less likely to overfit the data.
• Degrees of Freedom: The fewer degrees of freedom a model has, the less likely it is to
fit the noise in the data.
Types of Regularized Models

• Ridge Regression
• Lasso Regression
• Elastic Net
1. Ridge Regression
• Ridge Regression, also known as Tikhonov regularization, adds a regularization term
to the Linear Regression cost function.
• Regularization term added is
This term forces the model to keep the weights 𝜃 as small as possible.
• Purpose: To fit the data while keeping the model simple by having smaller weights.
Ridge Regression Cost Function:
The Mean Squared Error (MSE) plus a term that penalizes large weights. The bias term 𝜃0 is
not regularized.

Lasso Regression (Least Absolute Shrinkage and Selection Operator Regression)

Like Ridge Regression, it adds a regularization term to the Linear Regression cost function,
but with a key difference in how it penalizes the weights.

Elastic Net
• Elastic Net is a middle ground between Ridge Regression and Lasso Regression.
• The regularization term is a simple mix of both Ridge and Lasso’s regularization
terms, and you can control the mix ratio r.
• When r = 0, Elastic Net is equivalent to Ridge Regression, and when r = 1, it is
equivalent to Lasso Regression
Elastic Net is preferred over Lasso since Lasso may behave erratically when the number of
features is greater than the number of training instances or when several features are strongly
correlated.
Early Stopping
Early stopping is a regularization technique used in iterative learning algorithms like Gradient
Descent to prevent overfitting. Instead of running the algorithm for a fixed number of
iterations or until the cost function converges, you monitor the model's performance on a
validation set and stop training when the validation error stops improving.
How Does Early Stopping Work?

Training and Validation Error:
• As the model trains, the error (e.g., Root Mean Square Error, RMSE) on the training
set decreases. Similarly, the error on the validation set also decreases initially.
Detection of Overfitting:
• After some time, the validation error stops decreasing and starts to increase, indicating
that the model is beginning to overfit the training data.
• Overfitting means the model is too closely tailored to the training data, losing its
ability to generalize to new data.
Stopping Training:
• With early stopping, you halt the training process when the validation error reaches its
minimum, before it starts increasing again.

Logistic Regression
Logistic Regression is used to estimate the probability that an instance belongs to a particular
class. If the estimated probability is greater than 50%, then the model predicts that the
instance belongs to that class (positive class or 1), or else it predicts that it does not (negative
class or “0”). This makes it a binary classifier. (outliers, output <1 or >1)
Estimating Probabilities
Logistic Regression model computes a weighted sum of the input features plus a bias term, it
and outputs the logistic of the result.
Once the Logistic Regression model has estimated the probability P=hθ(x) that an instance x
belongs to the positive class, it can make its prediction ŷ easily

Training and Cost Function
The objective of training is to set the parameter vector θ so that the model estimates high
probabilities for positive instances (y = 1) and low probabilities for negative instances (y=0).
This idea is captured by the cost function shown in Equation for a single training instance x.
The cost function over the whole training set is simply the average cost over all training
instances. It can be written in a single expression called the log loss.
The partial derivatives of the cost function with regards to the jth model parameter θj is given
by

Softmax Regression
• The Logistic Regression model can be generalized to support multiple classes directly,
without having to train and combine multiple binary classifiers. This is called Softmax
Regression, or Multinomial Logistic Regression


Support Vector Machines

• A Support Vector Machine (SVM) is a very powerful and versatile Machine Learning
model, capable of performing linear or nonlinear classification, regression, and even
outlier detection.
• SVMs are well suited for classification of complex but small- or medium-sized
datasets.
Linear SVM Classification
Hyperplane
A hyperplane is a decision boundary which separates between given set of data points having
different class labels. The SVM classifier separates data points using a hyperplane with the
maximum amount of margin. This hyperplane is known as the maximum margin hyperplane
and the linear classifier it defines is known as the maximum margin classifier.
Support Vectors
Support vectors are the sample data points, which are closest to the hyperplane. These data
points will define the separating line or hyperplane better by calculating margins.
Margin
A margin is a separation gap between the two lines on the closest data points. It is calculated
as the perpendicular distance from the line to support vectors or closest data points. In SVMs,
we try to maximize this separation gap so that we get maximum margin.

In SVMs, the main objective is to select a hyperplane with the maximum possible margin
between support vectors in the given dataset.
SVM searches for the maximum margin hyperplane in the following 2 step process –
1. Generate hyperplanes which segregates the classes in the best possible way. There are
many hyperplanes that might classify the data. We should look for the best hyperplane
that represents the largest separation, or margin, between the two classes.
2. So, choose the hyperplane so that distance from it to the support vectors on each side
is maximized. If such a hyperplane exists, it is known as the maximum margin
hyperplane and the linear classifier it defines is known as a maximum margin
classifier.
The following diagram illustrates the concept of maximum margin and maximum margin
hyperplane.
source: https://www.kaggle.com/code/prashant111/svm-classifier-tutorial#1.-Introduction-to-Support-Vector-Machines-
Soft Margin Classification

Soft Margin Classification is an approach used in Support Vector Machines (SVM) to handle
cases where data is not perfectly linearly separable. Hard Margin Classification requires a
clear separation without any misclassified points, Soft Margin Classification allows some
misclassifications to achieve a better overall model.
Hard Margin Classification:

• Requires all data points to be correctly classified with a clear margin. Only works if
the data is perfectly linearly separable.
• Very sensitive to outliers. Even a single outlier can make it impossible to find a
suitable decision boundary.

Soft Margin Classification:

• Allows some points to be within the margin or even misclassified to create a more
robust model.
• Introduces a balance between maximizing the margin and minimizing classification
errors.
Soft Margin Classification Working

• The SVM algorithm finds a compromise between maximizing the margin and
allowing some errors (margin violations).
• This is controlled by a hyperparameter C, which determines the trade-off between a
larger margin and fewer margin violations.
• A small 𝐶 value encourages a larger margin, even if it means more margin violations
(misclassified points).
• A large 𝐶 value aims to classify all training examples correctly, resulting in a smaller
margin.
• In figure, using a low C=1 value the margin is quite large, but many instances end up
on the street. On the right, using a high C value the classifier makes fewer margin
violations but ends up with a smaller margin
The objective function in Soft Margin Classification includes a penalty for misclassified
points. This penalty is proportional to the distance of the points from the correct side of the
margin. Mathematically, this is expressed by adding a term to the cost function that penalizes
errors, weighted by C.

Nonlinear SVM Classification
Linear SVM classifiers are efficient and work well in many cases, but some datasets are not
linearly separable. One way to handle these nonlinear datasets is to add more features, like
polynomial features, which can sometimes make the dataset linearly separable.
Example: Consider a dataset with one feature x1. This dataset is not linearly separable (as
shown in the left plot of Figure). However, if you add a second feature x2 = (x1)2, the
resulting 2D dataset becomes linearly separable.
Consider dataset for binary classification in which the data points are shaped as two
interleaving half circles.

Polynomial Kernel
• Adding polynomial features is simple to implement and can work great with all sorts
of Machine Learning algorithms, but at a low polynomial degree it cannot deal with
very complex datasets, and with a high polynomial degree it creates a huge number of
features, making the model too slow.
• The kernel trick is a powerful technique used in Support Vector Machines (SVMs) to
handle nonlinear datasets without explicitly mapping the data to a higher-dimensional
space. Instead, it uses kernel functions to compute the similarity between data points
in this higher dimensional space directly, saving computational resources and
simplifying the process.
Adding Similarity Features
• To tackle nonlinear problems is to add features computed using a similarity function

that measures how much each instance resembles a particular landmark.
• For example, let’s take the one-dimensional dataset and add two landmarks to it at
x1 = –2 and x1 = 1. Next, let’s define the similarity function to be the Gaussian Radial
Basis Function (RBF) with γ = 0.3
• It is a bell-shaped function varying from 0 (very far away from the landmark) to 1 (at
the landmark). Now we are ready to compute the new features.
• For example, let’s look at the instance x1 = –1: it is located at a distance of 1 from the
first landmark, and 2 from the second landmark.

• Therefore, its new features are x2 = exp (–0.3 × 12) ≈ 0.74 and x3=exp(–0.3×22) ≈ 0.30.
The plot on the right of Figure shows the transformed dataset (dropping the original
features). As you can see, it is now linearly separable.
Gaussian RBF Kernel
The models are trained with different values of hyperparameters gamma (γ) and C. Increasing
gamma makes the bell-shape curve narrower (see the above left plot of Figure), and as a
result each instance’s range of influence is smaller: the decision boundary ends up being
more irregular, wiggling around individual instances. Conversely, a small gamma value
makes the bell-shaped curve wider, so instances have a larger range of influence, and the
decision boundary ends up smoother. So γ acts like a regularization hyperparameter: if your
model is overfitting, you should reduce it, and if it is underfitting, you should increase it.
Figure --- SVM classifiers using an RBF kernel

SVM Regression
Support Vector Machine (SVM) regression is a versatile method that supports both linear and
nonlinear regression. The primary goal in SVM regression is to fit as many data points as
possible within a predefined margin while limiting margin violations (i.e., points outside the
margin)
• In SVM classification, the objective is to maximize the margin between classes.
However, in SVM regression, the aim is to fit as many data points as possible within a
margin (referred to as the "street").
• The width of this street is controlled by the hyperparameter 𝜖. Only points that fall
outside this margin affect the model. A larger ϵ results in a wider street, leading to
fewer points outside the margin, while a smaller 𝜖 results in a narrower street.
Below figure shows two linear SVM Regression models trained on some random linear data,
one with a large margin (ϵ = 1.5) and the other with a small margin (ϵ = 0.5).
• Linear SVM Regression: The model tries to find a linear function that fits within the
margin.
• For nonlinear regression tasks, kernelized SVM models are used. The kernel trick
allows the SVM to perform in a higher-dimensional space without explicitly
transforming the data.
Figure shows SVM Regression on a random quadratic training set, using a 2nd-degree
polynomial kernel. There is little regularization on the left plot (i.e., a large C value), and
much more regularization on the right plot (i.e., a small C value).


Module 3

Uploaded by

Copyright:

Available Formats

Module 3

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Module 3

Uploaded by

Copyright:

Available Formats

Machine Learning 21AI63

Why Understand the Inner Workings?

1 Deepak D, Asst. Prof., Dept. of AIML, Canara Engineering College, Mangaluru

This can be written in vectorized form

What is the Best Fit Line?

2 Deepak D, Asst. Prof., Dept. of AIML, Canara Engineering College, Mangaluru

Note: Interactive Visualization of Linear Regression

Training the Linear Regression Model

The MSE of a Linear Regression hypothesis hθ on a training set X is calculated using

3 Deepak D, Asst. Prof., Dept. of AIML, Canara Engineering College, Mangaluru

Example: Generating Data and Using the Normal Equation

# Generate random linear-looking data

2. Compute θ Using the Normal Equation

# Calculate θ using the Normal Equation

Let’s see what θ the equation found:

Output: array([[4.21509616], [2.77011339]])

Output: array([[4.21509616], [9.75532293]])

5. Plot the Results

plt.plot(X_new, y_predict, "r-")

4 Deepak D, Asst. Prof., Dept. of AIML, Canara Engineering College, Mangaluru

Working of Gradient Descent

5 Deepak D, Asst. Prof., Dept. of AIML, Canara Engineering College, Mangaluru

Two main challenges with Gradient Descent

6 Deepak D, Asst. Prof., Dept. of AIML, Canara Engineering College, Mangaluru

Convex Cost Function

7 Deepak D, Asst. Prof., Dept. of AIML, Canara Engineering College, Mangaluru

Batch Gradient Descent

Gradient Descent Step

8 Deepak D, Asst. Prof., Dept. of AIML, Canara Engineering College, Mangaluru

Stochastic Gradient Descent

9 Deepak D, Asst. Prof., Dept. of AIML, Canara Engineering College, Mangaluru

Learning Rate and Learning Schedule

Mini-batch Gradient Descent

10 Deepak D, Asst. Prof., Dept. of AIML, Canara Engineering College, Mangaluru

• m = 100 data points.

Use Scikit-Learn’s PolynomialFeatures class to add polynomial features.

from sklearn.preprocessing import PolynomialFeatures

poly_features = PolynomialFeatures(degree=2, include_bias=False)

X[0] # Original feature

X_poly[0] # Original feature and its square

11 Deepak D, Asst. Prof., Dept. of AIML, Canara Engineering College, Mangaluru

• degree=2: Adds the square of each feature.

Train a LinearRegression model on the extended dataset.

from sklearn.linear_model import LinearRegression

# Output: (array([1.78134581]), array([[0.93366893, 0.56456263]]))

12 Deepak D, Asst. Prof., Dept. of AIML, Canara Engineering College, Mangaluru

Regularized Linear Models

Types of Regularized Models

Ridge Regression Cost Function:

13 Deepak D, Asst. Prof., Dept. of AIML, Canara Engineering College, Mangaluru

Lasso Regression (Least Absolute Shrinkage and Selection Operator Regression)

14 Deepak D, Asst. Prof., Dept. of AIML, Canara Engineering College, Mangaluru

How Does Early Stopping Work?

15 Deepak D, Asst. Prof., Dept. of AIML, Canara Engineering College, Mangaluru

16 Deepak D, Asst. Prof., Dept. of AIML, Canara Engineering College, Mangaluru

Training and Cost Function

17 Deepak D, Asst. Prof., Dept. of AIML, Canara Engineering College, Mangaluru

18 Deepak D, Asst. Prof., Dept. of AIML, Canara Engineering College, Mangaluru

19 Deepak D, Asst. Prof., Dept. of AIML, Canara Engineering College, Mangaluru