ML - WEEK 06
ML - WEEK 06
ML - WEEK 06
WEEK 06
GRADIENT DESCENT
1. Initialization:
Random starting point on the cost function
2. Gradient Calculation:
Derivative of the cost function with respect to parameters (weights and
bias)
3. Parameter Update:
Adjust parameters in the direction of steepest descent
Learning rate determines the step size
4. Convergence:
Iterative process until the minimum point is reached
MINIMIZING THE COST FUNCTION
Goal:
Minimize the difference between predicted and actual values
Key Components:
Direction (gradient)Step size (learning rate)
Convergence:
Local or global minimum
LEARNING RATE (ALSO REFERRED TO AS STEP SIZE OR THE
ALPHA)
It is the size of the steps that are taken to
reach the local minimum.
This is typically a small value, and it is
evaluated and updated based on the
behavior of the cost function.
High learning rates result in larger steps but
risks overshooting the minimum.
Conversely, a low learning rate has small
step sizes.
While it has the advantage of more
precision, the number of iterations
compromises overall efficiency as this takes
more time and computations to reach the
minimum.
EFFECTS OF LEARNING RATE ON GRADIENT DESCENT
COST (OR LOSS) FUNCTION
The cost (or loss) function measures the difference, or error, between actual y and
predicted y at its current position.
This improves the machine learning model's efficacy by providing feedback to the
model so that it can adjust the parameters to minimize the error and find the local
or global minimum.
It continuously iterates, moving along the direction of steepest descent (or the
negative gradient) until the cost function is close to or at zero.
At this point, the model will stop learning.
Additionally, while the terms, cost function and loss function, are considered
synonymous, there is a slight difference between them.
It’s worth noting that a loss function refers to the error of one training example,
while a cost function calculates the average error across an entire training set.
Training data helps these models learn over time, and the cost
function within gradient descent specifically acts as a barometer,
gauging its accuracy with each iteration of parameter updates.
Until the function is close to or equal to zero, the model will
continue to adjust its parameters to yield the smallest possible
error.
Once machine learning models are optimized for accuracy, they can
be powerful tools for artificial intelligence (AI) and computer science
applications.
HOW DOES GRADIENT DESCENT WORK?
The model begins training with randomized weights and biases near zero, and then repeats
the following steps:
2. Determine the direction to move the weights and bias that reduce loss.
3. Move the weight and bias values a small amount in the direction that reduces loss.
4. Return to step one and repeat the process until the model can't reduce the loss any
further.
ITERATIVE STEPS IN GRADIENT DESCENT
TYPES OF GRADIENT DESCENT
Batch gradient descent sums the error for each point in a training
set, updating the model only after all training examples have been
evaluated. This process referred to as a training epoch.
While this batching provides computation efficiency, it can still have
a long processing time for large training datasets as it still needs to
store all of the data into memory.
Batch gradient descent also usually produces a stable error gradient
and convergence, but sometimes that convergence point isn’t the
most ideal, finding the local minimum versus the global one.
STOCHASTIC GRADIENT DESCENT
What is bias?
Bias is the difference between the
average prediction of our model and the
correct value which we are trying to
predict. Model with high bias pays very
little attention to the training data and
oversimplifies the model. It always leads
to high error on training and test data.
BIAS VARIANCE TRADE-OFF
What is variance?
Variance is the variability of model prediction for a
given data point or a value which tells us spread of
our data. Model with high variance pays a lot of
attention to training data and does not generalize on
the data which it hasn’t seen before. As a result, such
models perform very well on training data but has
high error rates on test data.
When a model has high variance, it is called
overfitting the data . Overfitting is the process of
fitting the training set accurately via a complex curve
and high-order assumption, but it is not the solution
because the error with unseen data is high. When
training a data model, the variance should be kept
low.
BIAS VARIANCE TRADE-OFF – MATHEMATICALLY
Let the variable we are trying to predict as Y and other covariates as X. We assume there is a
relationship between the two such that
Where e is the error term and it’s normally distributed with a mean of 0.
We will make a model f^(X) of f(X) using linear regression or any other modeling technique. So the
expected squared error at a point x is:
Regularization
Bagging
Boosting
REGULARIZATION IN ML