ML 21-22 Sem
ML 21-22 Sem
ML 21-22 Sem
GROUP – B
Marks CO No.
Model Parameter:
Model parameters are the internal variables or coefficients that define the structure of the model
and are learned from the training data during the optimization process.
These parameters directly influence the predictions made by the model and are adjusted
through the learning process to minimize the error between the predicted and actual outcomes.
Examples of model parameters include the weights in a neural network, the coefficients in a
linear regression model, or the centroids in a K-means clustering algorithm.
Model parameters are intrinsic to the model and are typically learned automatically during
training.
Hyperparameters are configuration settings external to the model that control the behavior and
performance of the learning algorithm.
They are not learned from the data but are instead specified before the learning process begins
and remain constant throughout training.
Hyperparameters affect the learning process itself, including the optimization algorithm,
regularization techniques, and other aspects of model training.
Examples of hyperparameters include the learning rate in gradient descent optimization, the
regularization parameter in ridge regression, or the number of layers in a neural network.
Hyperparameters need to be tuned and selected carefully to optimize the performance of the
model, often through techniques like grid search, random search, or Bayesian optimization.
Validation set
The validation set is a smaller set of data that you use to tune and optimize your machine
learning algorithm. It contains the input features and the output labels that you do not use for
training, but for evaluating how well your algorithm performs on unseen data. The validation set
is used to compare different versions of your algorithm, such as different hyperparameters,
architectures, or regularization methods, and select the best one based on some metric, such as
accuracy, precision, or recall.
3. (a) Which Linear Regression training algorithm can you use if you have a
3 1,3
If you have a training set with millions of features, traditional linear regression training
algorithms such as Ordinary Least Squares (OLS) may not be suitable due to computational
limitations and potential overfitting issues. In such cases, you can use the following linear
regression training algorithms that are more suitable for high-dimensional data:
Gradient Descent:
Gradient Descent is an iterative optimization algorithm used to minimize the cost function (e.g.,
Mean Squared Error) by adjusting the model parameters iteratively.
It works well for large-scale datasets with many features because it updates the model
parameters based on the gradients of the cost function, rather than computing the inverse of a
large matrix, which can be computationally expensive.
Variants of Gradient Descent, such as Stochastic Gradient Descent (SGD), Mini-batch Gradient
Descent, and Adam, can be used to further improve efficiency and convergence speed.
SGD is a variant of Gradient Descent where the model parameters are updated using a single
training example (or a small subset, known as mini-batch) at a time.
It is well-suited for large-scale datasets because it requires only a small portion of the dataset to
compute each parameter update, making it computationally efficient and scalable.
SGD can handle high-dimensional data efficiently and is commonly used in machine learning
frameworks for training linear regression models with large feature sets.
Coordinate Descent:
Coordinate Descent is an optimization algorithm that updates one model parameter at a time
while holding others fixed.
It is particularly useful for sparse datasets with many zero-valued features, as it only updates the
parameters corresponding to non-zero features.
Coordinate Descent can be parallelized and scaled to handle large feature sets efficiently,
making it suitable for linear regression with millions of features.
No, Gradient Descent is not susceptible to being stuck in a local minimum when training a
Logistic Regression model.
Here's why:
Convex Cost Function: In Logistic Regression, the cost function (often the negative log-
likelihood or the cross-entropy loss) is convex. This means that it has a single global minimum
and no local minima. Therefore, Gradient Descent, being a first-order optimization algorithm,
will always converge to the global minimum regardless of the initial parameters or optimization
path.
Smoothness of the Cost Function: The cost function in Logistic Regression is smooth and
continuous, without abrupt changes or local irregularities that could lead Gradient Descent to
get stuck in local minima. This smoothness ensures that Gradient Descent can navigate the
parameter space effectively towards the global minimum.
2 1,3
In general, a machine learning model analyses the data, find patterns in it and make predictions.
While training, the model learns these patterns in the dataset and applies them to test data for
prediction. While making predictions, a difference occurs between prediction values made by
the model and actual values/expected values, and this difference is known as bias errors or
Errors due to bias. It can be defined as an inability of machine learning algorithms such as
Linear Regression to capture the true relationship between the data points. Each algorithm
begins with some amount of bias because bias occurs from assumptions in the model, which
makes the target function simple to learn. A model has either:
Low Bias: A low bias model will make fewer assumptions about the form of the target function.
High Bias: A model with a high bias makes more assumptions, and the model becomes unable to
capture the important features of our dataset. A high bias model also cannot perform well on
new data.
Generally, a linear algorithm has a high bias, as it makes them learn fast. The simpler the
algorithm, the higher the bias it has likely to be introduced. Whereas a nonlinear algorithm
often has low bias.
Some examples of machine learning algorithms with low bias are Decision Trees, k-Nearest
Neighbours and Support Vector Machines. At the same time, an algorithm with high bias is
Linear Regression, Linear Discriminant Analysis and Logistic Regression.
VARIANCE:
The variance would specify the amount of variation in the prediction if the different training
data was used. In simple words, variance tells that how much a random variable is different
from its expected value. Ideally, a model should not vary too much from one training dataset to
another, which means the algorithm should be good in understanding the hidden mapping
between inputs and output variables. Variance errors are either of low variance or high variance.
Low variance means there is a small variation in the prediction of the target function with
changes in the training data set. At the same time, High variance shows a large variation in the
prediction of the target function with changes in the training dataset.
A model that shows high variance learns a lot and perform well with the training dataset, and
does not generalize well with the unseen dataset. As a result, such a model gives good results
with the training dataset but shows high error rates on the test dataset.
Since, with high variance, the model learns too much from the dataset, it leads to overfitting of
the model. A model with high variance has the below problems:
Usually, nonlinear algorithms have a lot of flexibility to fit the model, have high variance.
The trade-off between bias and variance is a fundamental concept in machine learning that
relates to the performance of a model. Let's break down this trade-off:
Bias:
Bias refers to the error introduced by approximating a real-world problem with a simplified
model.
A model with high bias tends to make strong assumptions about the underlying data
distribution and may not capture the true relationship between features and the target variable.
Examples of high-bias models include linear regression with too few features or decision trees
with limited depth.
Variance:
Variance refers to the model's sensitivity to fluctuations or noise in the training data.
A model with high variance is highly flexible and captures even the smallest fluctuations in the
training data, leading to a high sensitivity to noise.
Examples of high-variance models include deep neural networks with many layers or decision
trees with high depth.
When a model has high bias and low variance, it means that it makes strong assumptions about
the data and is relatively insensitive to changes in the training set.
This can result in underfitting, where the model fails to capture the underlying patterns in the
data.
Examples include a linear regression model that fits a straight line to data that is inherently
non-linear.
Conversely, when a model has low bias and high variance, it means that it is highly flexible and
captures fine details in the training data.
However, this flexibility can lead to overfitting, where the model learns the noise in the training
data rather than the underlying patterns.
As a result, the model may perform well on the training set but generalize poorly to unseen data.
Examples include complex neural networks or decision trees with high depth that capture noise
in the training data.
The goal in machine learning is to find the right balance between bias and variance, known as
the bias-variance trade-off:
Balanced Model:
A well-balanced model has an appropriate level of complexity that minimizes both bias and
variance.
It captures the underlying patterns in the data without being overly influenced by noise.
In summary, the trade-off between bias and variance is about finding the optimal level of model
complexity that minimizes both errors due to bias and errors due to variance, ultimately leading
to better generalization performance on unseen data.
Algorithms.
A confusion matrix presents a table layout of the different outcomes of the prediction and results
of a classification problem and helps visualize its outcomes.
We can obtain four different combinations from the predicted and actual values of a classifier:
True Positive: The number of times our actual positive values are equal to the predicted
positive. You predicted a positive value, and it is correct.
False Positive: The number of times our model wrongly predicts negative values as positives.
You predicted a negative value, and it is actually positive.
True Negative: The number of times our actual negative values are equal to predicted negative
values. You predicted a negative value, and it is actually negative.
False Negative: The number of times our model wrongly predicts negative values as positives.
You predicted a negative value, and it is actually positive.
3 1,3
(b) What is a False Positive and False Negative and How Are They
Significant?
False positives are cases where the model incorrectly predicts the positive class when the actual
label is negative.
For example, if the model incorrectly predicts that a healthy person has a disease, it's a false
positive.
False Negatives (FN):
False negatives are cases where the model incorrectly predicts the negative class when the actual
label is positive.
For example, if the model incorrectly predicts that a person with a disease does not have the
disease, it's a false negative
2 1,3
A feedforward neural network is one of the simplest types of artificial neural networks devised.
In this network, the information moves in only one direction—forward—from the input nodes,
through the hidden nodes (if any), and to the output nodes. There are no cycles or loops in the
network. Feedforward neural networks were the first type of artificial neural network invented
and are simpler than their counterparts like recurrent neural networks and convolutional neural
networks
GROUP – C
7. (a) How would you define clustering? Can you name a few clustering
algorithms?
5 1,3,4
detection?
5 2,4
(c) Can you think of a use case where active learning would be useful?
5 2,4
5 3,4
(iii) Sensitivity
(iv) Specificity
(v) Precision
5 2,4
(i) Accuracy
5 2,4
(c) Design a Neural Network with 2 inputs in the input layer, 2 nodes in a
single hidden layer, and 2 outputs in the output layer. Calculate the
5 3,4
11. (a) Given six data points as (1,1), (2,1), (3,5), (4,3) , (4,6), (6,4). Apply
these points.
5 2,4
(b) Find the number of clusters found in the above dendrogram. 5 2,4