ML - WEEK 06

CSC – 368
WEEK 06
GRADIENT DESCENT
Dr. Sadaf Hussain

Asst. Professor CS
Faculty of Computer Science
Lahore Garrison University
GRADIENT DESCENT
 Gradient descent is a mathematical technique that iteratively finds

the weights and bias that produce the model with the lowest loss.
Gradient descent finds the best weight and bias by repeating the
following process for a number of user-defined iterations.
 Gradient descent is an optimization algorithm which is commonly-
used to train machine learning models and neural networks. It trains
machine learning models by minimizing errors between predicted
and actual results.
INTRODUCTION TO GRADIENT DESCENT
 Recap of Linear Regression:

 Slope-intercept form:
 y = mx + b
 Line of best fit and mean squared error
 Gradient Descent Overview:

 Iterative optimization algorithm
 Minimizes a convex function (cost function)
THE GRADIENT DESCENT PROCESS
1. Initialization:
 Random starting point on the cost function
2. Gradient Calculation:
 Derivative of the cost function with respect to parameters (weights and
bias)
3. Parameter Update:
 Adjust parameters in the direction of steepest descent
 Learning rate determines the step size
4. Convergence:
 Iterative process until the minimum point is reached
MINIMIZING THE COST FUNCTION
 Goal:
 Minimize the difference between predicted and actual values
 Key Components:
 Direction (gradient)Step size (learning rate)
 Convergence:
 Local or global minimum
LEARNING RATE (ALSO REFERRED TO AS STEP SIZE OR THE
ALPHA)
 It is the size of the steps that are taken to
reach the local minimum.
 This is typically a small value, and it is
evaluated and updated based on the
behavior of the cost function.
 High learning rates result in larger steps but
risks overshooting the minimum.
 Conversely, a low learning rate has small
step sizes.
 While it has the advantage of more
precision, the number of iterations
compromises overall efficiency as this takes
more time and computations to reach the
minimum.
EFFECTS OF LEARNING RATE ON GRADIENT DESCENT
COST (OR LOSS) FUNCTION
 The cost (or loss) function measures the difference, or error, between actual y and
predicted y at its current position.
 This improves the machine learning model's efficacy by providing feedback to the
model so that it can adjust the parameters to minimize the error and find the local
or global minimum.
 It continuously iterates, moving along the direction of steepest descent (or the
negative gradient) until the cost function is close to or at zero.
 At this point, the model will stop learning.
 Additionally, while the terms, cost function and loss function, are considered
synonymous, there is a slight difference between them.
 It’s worth noting that a loss function refers to the error of one training example,
while a cost function calculates the average error across an entire training set.
 Training data helps these models learn over time, and the cost
function within gradient descent specifically acts as a barometer,
gauging its accuracy with each iteration of parameter updates.
 Until the function is close to or equal to zero, the model will
continue to adjust its parameters to yield the smallest possible
error.
 Once machine learning models are optimized for accuracy, they can
be powerful tools for artificial intelligence (AI) and computer science
applications.
HOW DOES GRADIENT DESCENT WORK?
 The model begins training with randomized weights and biases near zero, and then repeats
the following steps:
1. Calculate the loss with the current weight and bias.
2. Determine the direction to move the weights and bias that reduce loss.
3. Move the weight and bias values a small amount in the direction that reduces loss.
4. Return to step one and repeat the process until the model can't reduce the loss any
further.
ITERATIVE STEPS IN GRADIENT DESCENT
TYPES OF GRADIENT DESCENT
 Batch gradient descent

 Stochastic gradient descent
 Mini-batch gradient descent
BATCH GRADIENT DESCENT
 Batch gradient descent sums the error for each point in a training
set, updating the model only after all training examples have been
evaluated. This process referred to as a training epoch.
 While this batching provides computation efficiency, it can still have
a long processing time for large training datasets as it still needs to
store all of the data into memory.
 Batch gradient descent also usually produces a stable error gradient
and convergence, but sometimes that convergence point isn’t the
most ideal, finding the local minimum versus the global one.
STOCHASTIC GRADIENT DESCENT
 Stochastic gradient descent (SGD) runs a training epoch for each

example within the dataset and it updates each training example's
parameters one at a time.
 Since you only need to hold one training example, they are easier to
store in memory.
 While these frequent updates can offer more detail and speed, it can
result in losses in computational efficiency when compared to batch
gradient descent. Its frequent updates can result in noisy gradients, but
this can also be helpful in escaping the local minimum and finding the
global one.
MINI-BATCH GRADIENT DESCENT
 Mini-batch gradient descent combines concepts from both batch

gradient descent and stochastic gradient descent. It splits the
training dataset into small batch sizes and performs updates on
each of those batches. This approach strikes a balance between the
computational efficiency of batch gradient descent and the speed of
stochastic gradient descent.
CHALLENGES WITH GRADIENT DESCENT
 Local minima and saddle points

 Vanishing and Exploding Gradients
MATHEMATICAL FORMULATION OF GD
BIAS VARIANCE TRADE-OFF
 What is bias?
 Bias is the difference between the
average prediction of our model and the
correct value which we are trying to
predict. Model with high bias pays very
little attention to the training data and
oversimplifies the model. It always leads
to high error on training and test data.
BIAS VARIANCE TRADE-OFF
 What is variance?
 Variance is the variability of model prediction for a
given data point or a value which tells us spread of
our data. Model with high variance pays a lot of
attention to training data and does not generalize on
the data which it hasn’t seen before. As a result, such
models perform very well on training data but has
high error rates on test data.
 When a model has high variance, it is called
overfitting the data . Overfitting is the process of
fitting the training set accurately via a complex curve
and high-order assumption, but it is not the solution
because the error with unseen data is high. When
training a data model, the variance should be kept
low.
BIAS VARIANCE TRADE-OFF – MATHEMATICALLY
 Let the variable we are trying to predict as Y and other covariates as X. We assume there is a
relationship between the two such that
 Where e is the error term and it’s normally distributed with a mean of 0.
 We will make a model f^(X) of f(X) using linear regression or any other modeling technique. So the
expected squared error at a point x is:
 The Err(x) can be further decomposed as
 Err(x) is the sum of Bias², variance and the irreducible error.

BIAS AND VARIANCE USING BULLS-EYE DIAGRAM
 Center of the target is a model that

perfectly predicts correct values. As
we move away from the bulls-eye
our predictions become get worse
and worse. We can repeat our
process of model building to get
separate hits on the target.
WHY IS BIAS VARIANCE TRADEOFF?
 If our model is too simple and has very
few parameters then it may have high
bias and low variance. On the other
hand if our model has large number of
parameters then it’s going to have high
variance and low bias. So we need to
find the right/good balance without
overfitting and under-fitting the data.
 This tradeoff in complexity is why there
is a tradeoff between bias and variance.
An algorithm can’t be more complex
and less complex at the same time.
 Total Error
 To build a good model, we need to find a good
balance between bias and variance such that it
minimizes the total error.
ACHIEVING BIAS VARIANCE TRADEOFF
 Regularization
 Bagging
 Boosting
REGULARIZATION IN ML
 Regularization is a set of methods for reducing overfitting

in machine learning models. Typically, regularization
trades a marginal decrease in training accuracy for an
increase in generalizability. Regularization encompasses a
range of techniques to correct for overfitting in machine
learning models.
 Rich Regularization (L2)
 Lasso (L1)
 Elastic Net
ROLE OF REGULARIZATION
 Complexity Control:
 Regularization helps control model complexity by preventing overfitting to training data, resulting in better generalization to new
data.
 Preventing Overfitting:
 One way to prevent overfitting is to use regularization, which penalizes large coefficients and constrains their magnitudes, thereby
preventing a model from becoming overly complex and memorizing the training data instead of learning its underlying patterns.
 Balancing Bias and Variance:
 Regularization can help balance the trade-off between model bias (underfitting) and model variance (overfitting) in machine
learning, which leads to improved performance.
 Feature Selection:
 Some regularization methods, such as L1 regularization (Lasso), promote sparse solutions that drive some feature coefficients to
zero. This automatically selects important features while excluding less important ones.
 Handling Multicolinearity:
 When features are highly correlated (multicolinearity), regularization can stabilize the model by reducing coefficient sensitivity to
small data changes.
 Generalization:
 Regularized models learn underlying patterns of data for better generalization to new data, instead of memorizing specific
examples.
RIDGE REGULARIZATION (L2)

ML - WEEK 06

Uploaded by

Copyright:

Available Formats

ML - WEEK 06

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

ML - WEEK 06

Uploaded by

Copyright:

Available Formats

CSC – 368

Dr. Sadaf Hussain

 Gradient descent is a mathematical technique that iteratively finds

 Recap of Linear Regression:

 Gradient Descent Overview:

1. Calculate the loss with the current weight and bias.

 Batch gradient descent

 Stochastic gradient descent (SGD) runs a training epoch for each

 Mini-batch gradient descent combines concepts from both batch

 Local minima and saddle points

 The Err(x) can be further decomposed as

 Err(x) is the sum of Bias², variance and the irreducible error.

 Center of the target is a model that

 Regularization is a set of methods for reducing overfitting

You might also like