2023 ML Assignment

Introduction to Machine Learning -IITKGP
Assignment - 1
TYPE OF QUESTION: MCQ/MSQ
Number of questions: 15 Total mark: 2 * 15 = 30
1. Which of the following is/are classification tasks?
a. Find the gender of a person by analyzing his writing style

b. Predict the price of a house based on floor area, number of rooms, etc.
c. Predict whether there will be abnormally heavy rainfall next year
d. Predict the number of copies of a book that will be sold this month
Correct Answers: a, c
Explanation: In (c), the amount of rainfall is a continuous variable. But, we are predicting
whether there will be abnormally heavy rainfall next year or not. So it is a Classification
task. Similarly, the number of classes in gender identification (a) is discrete. So, it’s a
classification task. The output variable is a continuous class in other options, so these are
regression tasks.
2. A feature F1 can take certain values: A, B, C, D, E, F, and represents the grade of students
from a college. Which of the following statement is true in the following case?
a. Feature F1 is an example of a nominal variable.
b. Feature F1 is an example of an ordinal variable.
c. It doesn’t belong to any of the above categories.
d. Both of these
Correct Answer: b
Explanation: Ordinal variables are the variables that have some order in their categories.
For example, grade A should be considered a high grade than grade B.
_______________________________________________________________________
3. Suppose I have 10,000 emails in my mailbox out of which 200 are spams. The spam
detection system detects 150 emails as spams, out of which 50 are actually spam. What is
the precision and recall of my spam detection system?
a. Precision = 33.333%, Recall = 25%

b. Precision = 25%, Recall = 33.33%
c. Precision = 33.33%, Recall = 75%
d. Precision = 75%, Recall = 33.33%
Correct Answer: a
Explanation:
We know that,
𝑇𝑝
Precision =
𝑇𝑝+𝐹𝑝
50
=
150
= 33.333%
𝑇𝑝
Recall =
𝑇𝑝+𝐹𝑛
50
=
200
= 25%
___________________________________________________________________________
4. Which of the following statements describes what is most likely TRUE when the amount
of training data increases?
a. Training error usually decreases and generalization error usually increases.

b. Training error usually decreases and generalization error usually decreases.
c. Training error usually increases and generalization error usually decreases.
d. Training error usually increases and generalization error usually increases.
Correct Answer: a
Explanation: When the training data increases, the decision boundary becomes very complex
to fit the data. So, the generalization capability usually reduces with the increase in training
data.
___________________________________________________________________________
5. You trained a learning algorithm, and plot the learning curve. The following figure is
obtained.
The algorithm is suffering from
a. High bias
b. High variance
c. Neither
Correct Answer: a
Explanation: In the plot, the training error is increased with the training set size. The true error
is around 0.4 which is quite high. Thus, we can say that the bias is high.
___________________________________________________________________________
6. I am the marketing consultant of a leading e-commerce website. I have been given a task
of making a system that recommends products to users based on their activity on Facebook.
I realize that user interests could be highly variable. Hence, I decide to
T1) Cluster the users into communities of like-minded people and
T2) Train separate models for each community to predict which product category (e.g.,
electronic gadgets, cosmetics, etc.) would be the most relevant to that community.
The task T1 is a/an ______________ learning problem and T2 is a/an

________________ problem.
Choose from the options:
a. Supervised and unsupervised
b. Unsupervised and supervised
c. Supervised and supervised
d. Unsupervised and unsupervised
Correct Answer: b
Explanation: From the definition of supervised and unsupervised learning

___________________________________________________________________________
7. Select the correct equations.

TP - True Positive, TN - True Negative, FP - False Positive, FN - False Negative
𝑇𝑝
i. Precision =
𝑇𝑝+𝐹𝑝
𝐹𝑝
ii. Recall =
𝑇𝑝+𝐹𝑝
𝑇𝑝
iii. Recall =
𝑇𝑝+𝐹𝑛
𝑇𝑝+𝐹𝑛
iv. Accuracy=
𝑇𝑝+𝐹𝑝+𝑇𝑛+𝐹𝑛
a. i, iii, iv
b. i and iii
c. ii and iv
d. i, ii, iii, iv
Correct Answer: a
Explanation: From the definition of Precision, Recall, and Accuracy
_________________________________________________________________________
8. Which of the following tasks is NOT a suitable machine learning task(s)?

a. Finding the shortest path between a pair of nodes in a graph
b. Predicting if a stock price will rise or fall
c. Predicting the price of petroleum
d. Grouping mails as spams or non-spams
Correct Answer: a
Explanation: Finding the shortest path between a pair of nodes in a graph is not a suitable
machine-learning task because it falls under the category of graph algorithms and can be
efficiently solved using algorithms like Dijkstra's algorithm. Machine learning is typically used
for tasks that involve pattern recognition, prediction, or classification based on data. In this
case, the task of finding the shortest path in a graph is better suited for algorithmic or graph
theory-based approaches rather than machine learning.
__________________________________________________________________________
9. Which of the following is/are associated with overfitting in machine learning?
a. High bias
b. Low bias
c. Low variance
d. High variance
e. Good performance on training data
f. Poor performance on test data
Correct Answers: b, d, e, f
Explanation: Overfitting is characterized by good performance on the training data, as the
model has essentially memorized the data. However, it leads to poor performance on the
test data because the model fails to generalize well. Overfitting is associated with low bias
and high variance, meaning the model is sensitive to noise or fluctuations in the training
data.
________________________________________________________________________
10. Which of the following statements about cross-validation in machine learning is/are true?
a. Cross-validation is used to evaluate a model's performance on the training data.
b. Cross-validation guarantees that a model will generalize well to unseen data.
c. Cross-validation is only applicable to classification problems and not regression
problems.
d. Cross-validation helps in estimating the model's performance on unseen data by
simulating the test phase.
Correct Answer: d
Explanation: Cross-validation is a technique used in machine learning to assess the
performance and generalization ability of a model. It involves dividing the available labeled
data into multiple subsets or folds. The model is trained on a portion of the data (training set)
and evaluated on the remaining portion (validation or test set). By repeating this process with
different partitions of the data, cross-validation provides an estimate of the model's
performance on unseen data.
___________________________________________________________________________
11. What does k-fold cross-validation involve in machine learning?
a. Splitting the dataset into k equal-sized training and test sets.
b. Splitting the dataset into k unequal-sized training and test sets.
c. Partitioning the dataset into k subsets, and iteratively using each subset as a
validation set while the remaining k-1 subsets are used for training.
d. Dividing the dataset into k subsets, where each subset represents a unique class label
for classification tasks.
Correct Answer: c
Explanation: K-fold cross-validation involves dividing the dataset into k subsets or folds. The
process then iterates k times, where each time, one of the k subsets is used as the validation set,
while the remaining k-1 subsets are used for training the model. This ensures that each subset
is used as the validation set exactly once, and the model is trained and evaluated k times, with
each fold serving as the validation set once.
___________________________________________________________________________
12. What does the term "feature space" refer to in machine learning?
a. The space where the machine learning model is trained.
b. The space where the machine learning model is deployed.
c. The space which is formed by the input variables used in a machine learning model.
d. The space where the output predictions are made by a machine learning model.
Correct Answer: c
Explanation: The feature space in machine learning refers to the space formed by the input
variables or features used in a model. It represents the space where the data points reside.
___________________________________________________________________________
13. Which of the following statements is/are true regarding supervised and unsupervised
learning?
a. Supervised learning can handle both labeled and unlabeled data.
b. Unsupervised learning requires human experts to label the data.
c. Supervised learning can be used for regression and classification tasks.
d. Unsupervised learning aims to find hidden patterns in the data.
Correct Answers: c, d
Explanation:
Option “a” is incorrect. Supervised learning specifically requires labeled data, while
unsupervised learning deals with unlabeled data.
Option “b” is incorrect. Unsupervised learning does not require human experts to label the data;
it learns from the raw, unlabeled data.
Option “c” is correct. Supervised learning encompasses both regression, where the output
variable is continuous, and classification, where the output variable is categorical.
Option “d” is correct. Unsupervised learning aims to find hidden patterns, structures, or
relationships within the data without any prior knowledge of the output labels.
___________________________________________________________________________
14. One of the ways to mitigate overfitting is
a. By increasing the model complexity
b. By reducing the amount of training data
c. By adding more features to the model
d. By decreasing the model complexity
Correct Answer: d
Explanation: Overfitting occurs when a machine learning model performs well on the training
data but fails to generalize to new, unseen data. It usually happens when the model becomes
too complex and starts to memorize the training examples instead of learning the underlying
patterns. To mitigate overfitting, one of the effective approaches is to decrease the model
complexity.
__________________________________________________________________________
15. How many Boolean functions are possible with 𝑁 features?

𝑁
a. (22 )
b. (2𝑁 )
c. (𝑁 2 )
d. (4𝑁 )
Correct Answer: a
Explanation: Any variable ‘A’ can have 2 values (i.e., 0 or 1)
For ‘N’ variables there are 2N entries in the truth table.
And each output of any particular row in the truth table can be 0 or 1.
𝑁
Hence, we have (22 ) different Boolean functions with N variables.
___________________________________________________________________________
Assignment - 2
Data for Q. 1 to 3
The following dataset will be used to learn a decision tree for predicting whether a person is
happy (H) or sad (S), based on the color of their shoes, whether they wear a wig, and the
number of ears they have.
Color Wig Num. Ears Emotion (Output)
G Y 2 S
G N 2 S
G N 2 S
B N 2 S
B N 2 H
R N 2 H
R N 2 H
R N 2 H
R Y 3 H
Based on the dataset answer the following questions:
1. What is 𝐸𝑛𝑡𝑟𝑜𝑝𝑦 (𝐸𝑚𝑜𝑡𝑖𝑜𝑛|𝑊𝑖𝑔 = 𝑌)?

a. 1
b. 0
c. 0.50
d. 0.20
Correct answer: a
Explanation:
To calculate the entropy of the target variable (Emotion) given the condition Wig = Y, we
need to compute the distribution of emotions within that subset of the dataset.
Subset of the dataset where Wig = Y:
G Y 2 S
R Y 3 H
Within this subset, we have 1 instance of "S" (sad) and 1 instance of "H" (happy). Therefore,
the distribution of emotions is equal, with a count of 1 for each class.
To calculate the entropy, we can use the formula: Entropy(X) = - Σ P(x) log2 P(x)
Entropy (Emotion | Wig = Y) = - P(S) log2 P(S) - P(H) log2 P(H)
Since P(S) = P(H) = 0.5 (both classes have equal counts), we can substitute these values into
the entropy formula:
Entropy (Emotion | Wig = Y) = - (0.5) log2 (0.5) - (0.5) log2 (0.5)
= - (0.5) (-1) - (0.5) (-1) = 1
Therefore, 𝐸𝑛𝑡𝑟𝑜𝑝𝑦 (𝐸𝑚𝑜𝑡𝑖𝑜𝑛|𝑊𝑖𝑔 = 𝑌) = 1
2. What is 𝐸𝑛𝑡𝑟𝑜𝑝𝑦 (𝐸𝑚𝑜𝑡𝑖𝑜𝑛|𝐸𝑎𝑟𝑠 = 3)?

a. 1
b. 0
c. 0.50
d. 0.20
Correct answer: b
Explanation:
To calculate the entropy of the target variable (Emotion) given the condition Ears = 3, we
need to compute the distribution of emotions within that subset of the dataset.
Subset of the dataset where Ears = 3:

R Y 3 H
Within this subset, we have 1 instance of "H" (happy) and 0 instances of "S" (sad).
To calculate the entropy, we can use the formula:

Entropy(X) = - Σ P(x) log2 P(x)
Entropy (Emotion | Ears=3) = - P(S) log2 P(S) - P(H) log2 P(H)
Since P(S) = 0 and P(H) = 1 (since there are no instances of "S" and 1 instance of "H"), we
can substitute these values into the entropy formula:
Entropy (Emotion | Ears=3) = - 0 log2 0 - 1 log2 1 = 0 - 0 = 0
Therefore, 𝐸𝑛𝑡𝑟𝑜𝑝𝑦 (𝐸𝑚𝑜𝑡𝑖𝑜𝑛|𝐸𝑎𝑟𝑠 = 3) = 0.
3. Which attribute should you choose as root of the decision tree?

a. Color
b. Wig
c. Number of ears
d. Any one of the previous three attributes
Correct answer: a
Explanation:
To determine the attribute to choose as the root of the decision tree, we need to consider the
concept of information gain. Information gain measures the reduction in entropy or impurity
achieved by splitting the data based on a specific attribute.
We can calculate the information gain for each attribute by comparing the entropy before and
after the split. The attribute with the highest information gain will be chosen as the root of the
decision tree.
Let's calculate the information gain for each attribute (Color, Wig, and Num. Ears) based on
the given dataset:
Information Gain (Color):
To calculate the information gain for the Color attribute, we need to compute the entropy of
the Emotion variable before and after the split based on different colors.
Entropy (Emotion) = - (4/9) log2 (4/9) - (5/9) log2 (5/9) ≈ 0.991
After the split based on Color, we have the following subsets:
Subset for Color = Green:

Entropy (Emotion | Color = Green) = 0 (as all instances are of the same class, "S")
Subset for Color = Blue:

Entropy (Emotion | Color = Blue) = -1/2 log2 (1/2) - 1/2 log2(1/2) = 1
Subset for Color = Red:

Entropy (Emotion | Color = Red) = 0 (as all instances are of the same class, "H")
Information Gain (Color) = Entropy (Emotion) - [ (3/9) * 0 + (2/9) * 1 + (4/9) * 0] ≈ 0.7687
Information Gain (Wig):
To calculate the information gain for the Wig attribute, we need to compute the entropy of the
Emotion variable before and after the split based on different values of Wig.
After the split based on Wig, we have the following subsets:
Subset for Wig = Yes:

Entropy (Emotion | Wig = Yes) = -1/2 log2 (1/2) - 1/2 log2 (1/2) =1
Subset for Wig = No:

Entropy (Emotion | Wig = No) = - (4/7) log2 (4/7) - (3/7) log2 (3/7) ≈ 0.985
Information Gain (Wig) = Entropy (Emotion) - [ (2/9) * 1 + (7/9) * 0.985] ≈ 0.002
Information Gain (Num. Ears):
To calculate the information gain for the Num. Ears attribute, we need to compute the entropy
of the Emotion variable before and after the split based on different values of Num. Ears.
After the split based on Num. Ears, we have the following subsets:
Subset for Num. Ears = 2:

Entropy (Emotion | Num. Ears = 2) = - (4/8) log2 (4/8) - (4/8) log2 (4/8) ≈ 1
Subset for Num. Ears = 3:
Entropy (Emotion | Num. Ears = 3) = 0 (as all instances are of the same class, "H")
Information Gain (Num. Ears) = Entropy (Emotion) - [ (8/9) * 1 + (1/9) * 0] ≈ 0.102
Based on the information gain calculations, the attribute with the highest information gain is
Color, with an information gain of approximately 0.768. Therefore, Color should be chosen
as the root of the decision tree.
4. In linear regression, the output is:

a. Discrete
b. Categorical
c. Continuous
d. May be discrete or continuous
Correct answer: c
Explanation:
In linear regression, the output variable, also known as the dependent variable or target
variable, is continuous. Linear regression is a supervised learning algorithm used to model
the relationship between a dependent variable and one or more independent variables.
The goal of linear regression is to find a linear relationship between the independent variables
and the continuous output variable. The linear regression model predicts a continuous value
as the output based on the input features.
5. Consider applying linear regression with the hypothesis as ℎθ(𝑥) = θ0 + θ1𝑥.The

training data is given in the table.
X Y
6 7
5 4
10 9
3 4
We define Mean Square Error (MSE), 𝐽θ =

𝑚
2
1
2𝑚 ( )
∑ (ℎθ 𝑥𝑖 − 𝑦𝑖)
𝑖=1
( )
where m is the number of training examples. ℎθ 𝑥𝑖 is the value of linear regression
hypothesis at point, i. If θ = [1, 1], find 𝐽θ.
a. 0
b. 1
c. 2
d. 0.5
Correct answer: b
Explanation:
Let's calculate the value of 𝐽θ:
𝑚
2
1
2𝑚 ( )
∑ (ℎθ 𝑥𝑖 − 𝑦𝑖)
𝑖=1
We have four training examples, so m = 4.
For θ = [1, 1], the hypothesis function hθ(x) becomes hθ(x) = 1 + 1x.
Substituting the values from the training data into the MSE equation:
𝐽θ = 1/(2*4) [ (1 + 1(6) - 7) 2 + (1 + 1(5) - 4)2 + (1 + 1(10) - 9)2 + (1 + 1(3) - 4)2 ]
= 1/(8) [ (0)2 + (2)2 + (2)2 + (0)2 ]
= 1/(8) [ 0 + 4 + 4 + 0 ]
= 1/(8) [ 8 ]
=1
Therefore, the value of 𝐽θ for θ = [1, 1] is 1.

6. Specify whether the following statement is true or false?
“The ID3 algorithm is guaranteed to find the optimal decision tree”
a. True
b. False
Correct answer: b
Explanation:
The ID3 algorithm uses a greedy strategy to make local decisions at each node based on the
information gain or other impurity measures. It recursively builds the decision tree by
selecting the attribute that provides the highest information gain or the most significant
reduction in impurity at each step. However, this greedy approach does not consider the
global optimum for the entire decision tree structure.
Due to the greedy nature of the algorithm, it is possible for ID3 to get stuck in suboptimal
solutions or make decisions that do not result in the most accurate or optimal tree. In some
cases, the ID3 algorithm may produce a decision tree that is a local optimum but not the
global optimum.
7. Identify whether the following statement is true or false?

“A classifier trained on less training data is less likely to overfit”
a. True
b. False
Correct answer: b
Explanation:
In reality, a classifier trained on less training data is more likely to overfit. Overfitting occurs
when a model learns the training data too well, capturing noise or irrelevant patterns that do
not generalize to unseen data. When the training dataset is smaller, the model has less
exposure to the variety of examples and may struggle to capture the true underlying patterns.
With a limited amount of training data, the model has a higher risk of memorizing specific
examples and idiosyncrasies of the training set, resulting in a biased and overfitted model.
The lack of diversity in the training data hampers the model's ability to generalize well to
new, unseen examples.
To mitigate overfitting, it is generally recommended to have a sufficient amount of diverse

training data that represents the underlying distribution of the problem. More data allows the
model to learn more robust and generalizable patterns, reducing the likelihood of overfitting.
8. Identify whether the following statement is true or false?
“Overfitting is more likely when the hypothesis space is small”
a. True
b. False
Correct answer: b
Explanation: We can see this from the bias-variance trade-off. When hypothesis space is
small, it’s more biased with less variance. So with a small hypothesis space, it’s less likely to
find a hypothesis to fit the data very well,i.e., overfit.
9. Traditionally, when we have a real-valued input attribute during decision-tree learning, we

consider a binary split according to whether the attribute is above or below some threshold.
One of your friends suggests that instead we should just have a multiway split with one
branch for each of the distinct values of the attribute. From the list below choose the single
biggest problem with your friend’s suggestion:
a. It is too computationally expensive

b. It would probably result in a decision tree that scores badly on the training set
and a test set
c. It would probably result in a decision tree that scores well on the training set
but badly on a test set
d. would probably result in a decision tree that scores well on a test set but badly
on a training set
Correct answer: c
Explanation: The single biggest problem with the suggestion of using a multiway split with
one branch for each distinct value of a real-valued input attribute is that it would likely result
in a decision tree that overfits the training data. By creating a branch for each distinct value,
the tree would become more complex, and it would have the potential to fit the training data
too closely, capturing noise or irrelevant patterns specific to the training set.
As a consequence of overfitting, the decision tree would likely score well on the training set
since it can perfectly match the training examples. However, when evaluated on a test set or
unseen data, the tree would struggle to generalize and perform poorly. Overfitting leads to
poor performance on new instances, indicating that the model has failed to learn the
underlying patterns and instead has become too specialized in the training data.
10. Which of the following statements about decision trees is/are true?
a. Decision trees can handle both categorical and numerical data.
b. Decision trees are resistant to overfitting.
c. Decision trees are not interpretable.
d. Decision trees are only suitable for binary classification problems.
Correct answer: a
Explanation: Decision trees can handle both categorical and numerical data as they partition
the data based on various conditions during the tree construction process. This allows
decision trees to be versatile in handling different types of data.
11. Which of the following techniques can be used to handle overfitting in decision trees?
a. Pruning
b. Increasing the tree depth
c. Decreasing the minimum number of samples required to split a node
d. Adding more features to the dataset
Correct answers: a, c
Explanation: Overfitting occurs when a decision tree captures noise or irrelevant patterns in
the training data, resulting in poor generalization to unseen data. Pruning is a technique used
to reduce overfitting by removing unnecessary branches and nodes from the tree.
Decreasing the minimum number of samples required to split a node can also help prevent
overfitting by allowing more flexible splits.
12. Which of the following is a measure used for selecting the best split in decision trees?
a. Gini Index
b. Support Vector Machine
c. K-Means Clustering
d. Naive Bayes
Correct answer: a
Explanation: The Gini Index is a commonly used measure for selecting the best split in
decision trees. It quantifies the impurity or dissimilarity of a node's class distribution. The
split that minimizes the Gini Index is chosen as the optimal split.
13. What is the purpose of the decision tree's root node in machine learning?
a. It represents the class labels of the training data.
b. It serves as the starting point for tree traversal during prediction.
c. It contains the feature values of the training data.
d. It determines the stopping criterion for tree construction.
Correct answer: b
Explanation: The root node of a decision tree serves as the starting point for tree traversal
during prediction. It represents the first decision based on a feature and directs the flow of the
decision tree based on the outcome of that decision. The root node does not contain class
labels or feature values but rather determines the initial split based on a selected criterion.
14. Which of the following statements about linear regression is true?

a. Linear regression is a supervised learning algorithm used for both regression and
classification tasks.
b. Linear regression assumes a linear relationship between the independent and
dependent variables.
c. Linear regression is not affected by outliers in the data.
d. Linear regression can handle missing values in the dataset.
Correct answer: b
Explanation: Linear regression assumes a linear relationship between the independent
variables (features) and the dependent variable (target). It seeks to find the best-fitting line to
the data.
While linear regression is primarily used for regression tasks, it is not suitable for
classification tasks. Outliers can significantly impact linear regression models, and missing
values in the dataset require appropriate handling.
15. Which of the following techniques can be used to mitigate overfitting in machine
learning?
a. Regularization
b. Increasing the model complexity
c. Gathering more training data
d. Feature selection or dimensionality reduction
Correct answers: a, c, d
Explanation: Regularization techniques, such as L1 or L2 regularization, can help mitigate
overfitting by adding a penalty term to the model's objective function, discouraging
excessively large parameter values.
Gathering more training data can also reduce overfitting by providing a more representative
sample of the underlying data distribution.
Feature selection or dimensionality reduction techniques, such as selecting relevant features
or applying techniques like Principal Component Analysis (PCA), can help remove irrelevant
or redundant features, reducing the complexity of the model and mitigating overfitting.
Assignment - 3
Q1. Fill in the blanks:
K-Nearest Neighbor is a _____, _____ algorithm
a. Non-parametric, eager
b. Parametric, eager
c. Non-parametric, lazy
d. Parametric, lazy
Correct Answer: c
Explanation: KNN is non-parametric because it does not make any assumption regarding the
underlying data distribution. It is a lazy learning technique because during training time it just
memorizes the data and finally computes the distance during testing.
Q2. You have been given the following 2 statements. Find out which of these options is/are
true in the case of k-NN.
(i) In case of very large value of k, we may include points from other classes into the
neighborhood.
(ii) In case of too small value of k, the algorithm is very sensitive to noise.
a. (i) is True and (ii) is False

b. (i) is False and (ii) is True
c. Both are True
d. Both are False
Correct Answer: c
Explanation: Both options are true and are self-explanatory.

Q3. State whether the statement is True/False:
k-NN algorithm does more computation on test time rather than train time.
a. True
b. False
Correct Answer: a
Explanation: The training phase of the algorithm consists only of storing the feature vectors
and class labels of the training samples.
In the testing phase, a test point is classified by assigning the label which is most frequent
among the k training samples nearest to that query point – hence higher computation.
Q4. Suppose you are given the following images (1 represents the left image, 2 represents
the middle and 3 represents the right). Now your task is to find out the value of k in k-NN in
each of the images shown below. Here k1 is for 1st, k2 is for 2nd and k3 is for 3rd figure.
a. k1 > k2> k3
b. k1 < k2> k3
c. k1 < k2 < k3
d. None of these
Correct Answer: c
Explanation: The value of k is highest in k3, whereas in k1 it is lowest. As the decision

boundary is more smooth in the right image than the others.
Q5. Which of the following necessitates feature reduction in machine learning?
a. Irrelevant and redundant features
b. Limited training data
c. Limited computational resources.
d. All of the above
Correct Answer: d
Detailed Solution: Follow lecture slides
Q6. Suppose, you have given the following data where x and y are the 2 input variables and
Class is the dependent variable.
Below is a scatter plot which shows the above data in 2D space.

Suppose, you want to predict the class of new data point x=1 and y=1 using Euclidean
distance in 3-NN. In which class this data point belongs to?
a. + Class
b. – Class
c. Can’t Say
d. None of these
Correct Answer: a
Explanation: All three nearest point are of + class so this point will be classified as + class.
Q7. What is the optimum number of principal components in the below figure?
a. 10
b. 20
c. 30
d. 40
Correct Answer: c
Explanation: We can see in the above figure that the number of components = 30 is giving
highest variance with lowest number of components. Hence option ‘c’ is the right answer.
Q8. Suppose we are using dimensionality reduction as pre-processing technique, i.e, instead
of using all the features, we reduce the data to k dimensions with PCA. And then use these
PCA projections as our features. Which of the following statements is correct?
Choose which of the options is correct?
a. Higher value of ‘k’ means more regularization

b. Higher value of ‘k’ means less regularization
Correct Answer: b
Explanation: The higher value of ‘k’ would lead to less smoothening of the decision boundary.
This would be able to preserve more characteristics in data, hence less regularization.
Q9. In collaborative filtering-based recommendation, the items are recommended based on

:
a. Similar users
b. Similar items
c. Both of the above
d. None of the above
Correct Answer: a
Explanation: Follow the definition of collaborative filtering.
Q10. The major limitation of collaborative filtering is:
a. Cold start
b. Overspecialization
c. None of the above
Correct Answer: a
Explanation: For new users, we have very few transactions. So, it’s very difficult to find similar
users.
Q11. Consider the figures below. Which figure shows the most probable PCA component
directions for the data points?
a. A
b. B
c. C
d. D
Correct Answer: a
Explanation: [Follow the lecture slides]
Choose directions such that a total variance of data will be maximum
1. Maximize Total Variance
Choose directions that are orthogonal
2. Minimize correlation
Q12. Suppose that you wish to reduce the number of dimensions of a given data to dimensions
using PCA. Which of the following statement is correct?
a. Higher means more regularization

b. Higher means less regularization
c. Can’t Say
Correct Answer: b
Explanation: Higher k value leads to less smoothening as we preserve more characteristics in

data, hence less regularization.
Q13. Suppose you are given 7 plots 1-7 (left to right) and you want to compare Pearson
correlation coefficients between variables of each plot. Which of the following is true?
1. 1<2<3<4
2. 1>2>3>4
3. 7<6<5<4
4. 7>6>5>4
a. 1 and 3
b. 2 and 3
c. 1 and 4
d. 2 and 4
Correct Answer: b
Explanation: From image 1 to 4, correlation is decreasing (coefficient values are positive).

From image 4 to 7 correlation is increasing, but the coefficient values are negative (for
example, 0, -0.3, -0.7, -0.99).
Q14. Imagine you are dealing with 20 class classification problem. What is the maximum
number of discriminant vectors that can be produced by LDA?
a. 20
b. 19
c. 21
d. 10
Correct Answer: b
Explanation: LDA produces at most c − 1 discriminant vectors.
Q15. In which of the following situations collaborative filtering algorithm is appropriate?

a. You manage an online bookstore and you have the book ratings from many users. For
each user, you want to recommend other books he/she will like based on her previous
ratings and other users’ ratings.
b. You manage an online bookstore and you have the book ratings from many users. You
want to predict the expected sales volume (No of books sold) as a function of average rating
of a book.
c. Both A and B
d. None of the above
Correct Answer: a
Explanation: Collaborative filtering is a recommendation technique that is specifically

designed for situations like the one described in option a. In collaborative filtering,
recommendations are made based on the patterns of user preferences and behaviors. It
analyzes the historical data of user-item interactions, such as book ratings given by users, to
find similarities between users and items.
Option b is not appropriate for collaborative filtering because it involves predicting the
expected sales volume (number of books sold) as a function of average rating of a book.
Collaborative filtering focuses on user-item interactions and is more concerned with providing
personalized recommendations rather than predicting sales volume based on average ratings.
************END************
Assignment - 4
Q1.
Answer Questions 1-4 with the data provided below:

A spam filtering system has a probability of 0.95 to classify correctly a mail as spam and 0.10
probability of giving false positives. It is estimated that 0.5% of the mails are actual spam
mails.
Suppose that the system is now given a new mail to be classified as spam/ not-spam, what is
the probability that the mail will be classified as spam?
a. 0.89575
b. 0.10425
c. 0.00475
d. 0.09950
Correct Answer: b
Detailed Solution:
Let S = ‘Mails correctly marked spam by the system’, T= ‘Mails misclassified by the
system’(Marked as spam when not spam or Marked as not spam when it is a spam) , M = ‘Spam
mails’.
P(S|M) = 0.95 , P(S|M’) = 0.10 , P(M) = 0.005
We are to find the probability of mail being classified as spam which can either be if a spam
mail is correctly classified as spam or if a mail is misclassified as spam.
∴ 𝑃(𝑆) = 𝑃(𝑆|𝑀) ∗ 𝑃(𝑀) + 𝑃(𝑆|𝑀′) ∗ 𝑃(𝑀′) = 0.95 ∗ 0.005 + 0.10 ∗ 0.995

= 0.10425
Q2. Find the probability that, given a mail classified as spam by the system, the mail actually
being spam.
a. 0.04556
b. 0.95444
c. 0.00475
d. 0.99525
Correct Answer: a
Detailed Solution: We are to find P(M|S),
𝑃(𝑆|𝑀)∗𝑃(𝑀) 0.95 ∗ 0.005

𝑃(𝑀|𝑆) = = = 0.0455
𝑃(𝑆) 0.10425
Q3. Given that a mail is classified as not spam, the probability of the mail actually being not
spam
a. 0.10425
b. 0.89575
c. 0.003
d. 0.997
Correct Answer: d
Detailed Solution: We are to find P(M’|S’):

′ ′
𝑃(𝑆 |𝑀 )∗ 𝑃(𝑀′ ) (1−𝑃(𝑆|𝑀′))∗ 𝑃(𝑀′) 0.9∗0.995
′ |𝑆 ′ )
𝑃(𝑀 = = = = 0.997
𝑃(𝑆 ′ ) 1−𝑃(𝑆) (1−0.10425)
P(S|M’) calculated in question 2 and P(S) in question 1.

Q4. Find the probability that the mail is misclassified:
a. 0.90025
b. 0.09975
c. 0.8955
d. 0.1045
Correct Answer: b
Detailed Solution: We are to find P(T) now:
𝑃(𝑇) = 𝑃(𝑆 ∩ 𝑀′) + 𝑃(𝑆′ ∩ 𝑀) = 𝑃(𝑆|𝑀′) ∗ 𝑃(𝑀′) + 𝑃(𝑆′|𝑀) ∗ 𝑃(𝑀)

= 0.10 ∗ 0.995 + 0.05 ∗ 0.005
= 0.09975
Q5. What is the naive assumption in a Naive Bayes Classifier?
a. All the classes are independent of each other

b. All the features of a class are independent of each other
c. The most probable feature for a class is the most important feature to be
considered for classification
d. All the features of a class are conditionally dependent on each other.
Correct Answer: b
Detailed Solution:
Naive Bayes Assumption is that all the features of a class are independent of each other
which is not the case in real life. Because of this assumption, the classifier is called Naive
Bayes Classifier.
Q6.
Answer Questions 6-7 with the data provided below:
Consider the following dataset. a,b,c are the features and K is the class(1/0):
a b c K
1 0 1 1
1 1 1 1
0 1 1 0
1 1 0 0
1 0 1 0
0 0 0 1
Classify the test instance given below into class 1/0 using a Naive Bayes Classifier.
a b c K
0 0 1 ?
a. 0
b. 1
Correct Answer: b
Detailed Solution:
𝑃(𝐾=1)∗𝑃(𝑎=0|𝐾=1)∗𝑃(𝑏=0|𝐾=1)∗𝑃(𝑐=1|𝐾=1)
P(K=1|a=0, b=0, c=1) = [By Naive Bayes’
𝑃(𝑎=0,𝑏=0,𝑐=1)
Assumption]
3 1 2 2
= 6 ∗ 3 ∗ 3 ∗ 3[The denominator can be ignored since all the features
have the same probability for each class]
= 0.07407
3 1 1 2
P(K=0| a=0, b=0, c=1) = 6 ∗ 3 ∗ 3 ∗ 3 = 0.03703 < 0.07407
Hence, the test example will have class 1.

Q7. Find P (K=0| a=1, b=1).
1
a.
3
2
b.
3
1
c.
9
8
d.
9
Correct Answer: b
Detailed Solution:
𝑃(𝐾 = 0|𝑎 = 1, 𝑏 = 1)
𝑃(𝐾 = 0) ∗ 𝑃(𝑎 = 1|𝐾 = 0) ∗ 𝑃(𝑏 = 1|𝐾 = 0) 2
= =
𝑃(𝐾 = 0) ∗ 𝑃(𝑎 = 1|𝐾 = 0) ∗ 𝑃(𝑏 = 1|𝐾 = 0) + 𝑃(𝐾 = 1) ∗ 𝑃(𝑎 = 1|𝐾 = 1) ∗ 𝑃(𝑏 = 1|𝐾 = 1) 3
Q8.
Answer Questions 8-10 with the data given below:
A patient goes to a doctor with symptoms S1, S2 and S3. The doctor suspects disease D1and
D2 and constructs a Bayesian network for the relation among the disease and symptoms as
the following:
What is the joint probability distribution in terms of conditional probabilities?

a. 𝑃(𝐷1) ∗ 𝑃(𝐷2|𝐷1) ∗ 𝑃(𝑆1|𝐷1) ∗ 𝑃(𝑆2|𝐷1) ∗ 𝑃(𝑆3|𝐷2)
b. 𝑃(𝐷1) ∗ 𝑃(𝐷2) ∗ 𝑃(𝑆1|𝐷1) ∗ 𝑃(𝑆2|𝐷1) ∗ 𝑃(𝑆3|𝐷1, 𝐷2)
c. 𝑃(𝐷1) ∗ 𝑃(𝐷2) ∗ 𝑃(𝑆1|𝐷2) ∗ 𝑃(𝑆2|𝐷2) ∗ 𝑃(𝑆3|𝐷2)
d. 𝑃(𝐷1) ∗ 𝑃(𝐷2) ∗ 𝑃(𝑆1|𝐷1) ∗ 𝑃(𝑆2|𝐷1, 𝐷2) ∗ 𝑃(𝑆3|𝐷2)
Correct Answer: d
Detailed Solution:
From the figure, we can see that D1 and D2 are not dependent on any variable as they don’t
have any incoming directed edges. S1 has an incoming edge from D1, hence S1 depends on
D1. S2 has 2 incoming edges from D1 and D2, hence S2 depends on D1 and D2. S3 has an
incoming edge from D2, S3 depends on D2. Hence, (d) is the answer.
Q9. Suppose P(D1) = 0.4, P(D2) = 0.7 , P(S1|D1)=0.3 and P(S1| D1’)= 0.6. Find P(S1)
a. 0.12
b. 0.48
c. 0.36
d. 0.60
Correct Answer: b
Detailed Solution:
𝑃(𝑆1) = 𝑃(𝑆1|𝐷1) ∗ 𝑃(𝐷1) + 𝑃(𝑆1|𝐷1′)𝑃(𝐷1′) = 0.3 ∗ 0.4 + 0.6 ∗ 0.6 = 0.48
Q10. What is the Markov blanket of variable, S3
a. D1
b. D2
c. D1 and D2
d. None
Correct Answer: b
Detailed Solution:
In a Bayesian Network, the Markov blanket of node, X is the set consisting of
● X’s parents
● X’s children
● Parents of X’s children
In the given diagram, variable, S2 has a parent D2 and no children. Hence, the correct answer
is (b).
___________________________________________________________________________
Q11. Consider the following Bayes’ network:
Alarm1 means that the first alarm system rings, Alarm2 means that the second alarm system
rings, and Burglary means that a burglary is in progress. Now assume that:
P(Alarm1) = 0.1
P(Alarm2) = 0.2
P (Burglary | Alarm1, Alarm2) = 0.8
P (Burglary | Alarm1, ¬Alarm2) = 0.7
P (Burglary | ¬Alarm1, Alarm2) = 0.6
P (Burglary | ¬Alarm1, ¬Alarm2) = 0.5
Calculate P (Alarm2 | Burglary, Alarm1).
a. 0.78
b. 0.22
c. 0.50
d. 0.10
Correct Answer: b
Detailed Solution:
P (Alarm2 | Burglary, Alarm1) = P (Alarm1, Alarm2, Burglary) / P (Burglary, Alarm1)
= 0.016 / 0.072 ≈ 0.22 with
P (Alarm1, Alarm2, Burglary) = P(Alarm1) * P(Alarm2) * P (Burglary | Alarm1, Alarm2)
= 0.1 * 0.2 * 0.8 = 0.016
P (Alarm1, ¬ Alarm2, Burglary) = P(Alarm1) * P (¬ Alarm2) * P (Burglary | Alarm1, ¬ Alarm2)
= 0.1 * 0.8 * 0.7 = 0.056
P (Burglary, Alarm1) = P (Alarm1, Alarm2, Burglary) + P (Alarm1, ¬ Alarm2, Burglary)
= 0.016 + 0.056 = 0.072

Q12. Consider the following Bayesian network.
The values of the conditional probabilities are given below. Find 𝑃(𝐷).
Assume,
𝑃(𝐴) = 0.3
𝑃(𝐵) = 0.6
𝑃(𝐶|𝐴) = 0.8
𝑃(𝐶|𝐴) = 0.4
𝑃(𝐷|𝐴, 𝐵) = 0.7
𝑃(𝐷|𝐴, 𝐵) = 0.8
𝑃(𝐷|𝐴, 𝐵) = 0.1
𝑃(𝐷|𝐴, 𝐵) = 0.2
𝑃(𝐸|𝐶) = 0.7
𝑃(𝐸|𝐶) = 0.2
a. 0.68
b. 0.32
c. 0.50
d. 0.70
Correct Answer: b
Detailed Solution:
P(C) = P(C|A) * P(A) + P(C|𝐴) * P (𝐴) = 0.8 * 0.3 + 0.4 * 0.7 = 0.24 + 0.28 = 0.52
𝐶 = 1 - P(C) = 1 - 0.52 = 0.48

Next, let's calculate the probability of event D using the law of total probability and the given
conditional probabilities:
P(D) = P (D|A, B) * P(A) * P(B) + P(D|A, 𝐵) * P(A) * P(𝐵) + P(D|𝐴, B) * P(𝐴) * P(B) +
P(D|𝐴, 𝐵) * P(𝐴) * P(𝐵)
P(D) = 0.7 * 0.3 * 0.6 + 0.8 * 0.3 * 0.4 + 0.1 * 0.7 * 0.6 + 0.2 * 0.7 * 0.4 = 0.126 + 0.096 +
0.042 + 0.056 = 0.32
So, the probability P(D) is 0.32.
Q13.
Answer Questions 13-14 with the data given below:
In an oral exam you have to solve exactly one problem, which might be one of three types, A,
B, or C, which will come up with probabilities 30%, 20%, and 50%, respectively. During your
preparation you have solved 9 of 10 problems of type A, 2 of 10 problems of type B, and 6 of
10 problems of type C.
What is the probability that you will solve the problem of the exam?
a. 0.61
b. 0.39
c. 0.50
d. 0.20
Correct Answer: a
Detailed Solution:
A: Problem of type A.
B: Problem of type B.
C: Problem of type C.
S: You solve the problem
P(A) = 0.30
P(B) = 0.20
P(C) = 0.50
P(S|A) = 9/10
P(S|B) = 2/10
P(S|C) = 6/10
P(S) = P(S|A) P(A) + P(S|B) P(B) + P (S|C) P(C)
= (9/10) * (0.30) + (2/10) * (0.20) + (6/10) * (0.50)
= 0.61
Q14. Given you have solved the problem, what is the probability that it was of type A?
a. 0.35
b. 0.50
c. 0.56
d. 0.44
Correct Answer: d
Detailed Solution:
A: Problem of type A.
S: You solve the problem
P(A) = 0.30
P(S|A) = 9/10
P(S) = 0.61
𝟗
𝑷(𝑺|𝑨)∗ 𝑷(𝑨) ( )(𝟎.𝟑𝟎)
P(A|S) = == 𝟏𝟎
= 0.4426
𝑷(𝑺) (𝟎.𝟔𝟏)
Q15. Naive Bayes is a popular classification algorithm in machine learning. Which of the
following statements is/are true about Naive Bayes?
a. Naive Bayes assumes that all features are independent of each other, given the class.
b. It is particularly well-suited for text classification tasks, like spam detection.
c. Naive Bayes can handle missing values in the dataset without any special treatment.
d. It is a complex algorithm that requires a large amount of training data.
Correct Answers: a, b
Explanation:
a. Correct. Naive Bayes assumes that features are conditionally independent given the class.
This simplifying assumption allows the algorithm to estimate probabilities efficiently even
with a limited amount of data.
b. Correct. Naive Bayes is commonly used for text classification tasks, such as spam detection
and sentiment analysis, due to its ability to handle high-dimensional feature spaces.
c. Incorrect. Naive Bayes does not handle missing values naturally. Missing values need to be
handled before applying the algorithm.
d. Incorrect. Naive Bayes is actually known for its simplicity and ability to work well with
small amounts of training data. It is not considered complex and often provides good results
with relatively simple computations.
*******END*******
Assignment - 5
1. What would be the ideal complexity of the curve which can be used for separating the two
classes shown in the image below?
a. Linear
b. Quadratic
c. Cubic
d. insufficient data to draw a conclusion
Correct Answer: a
Explanation: The blue point in the red region is an outlier (most likely noise). The rest of
the data is linearly separable.
2. Suppose you are using a Linear SVM classifier with 2 class classification problem. Now you
have been given the following data in which some points are circled red that are representing
support vectors.
If you remove the following any one red points from the data. Will the decision boundary
change?
a. Yes
b. No
Correct Answer: a
Explanation: These three examples are positioned such that removing any one of them
introduces slack in the constraints. So, the decision boundary would completely change.
3. What do you mean by a hard margin in SVM Classification?

a. The SVM allows very low error in classification
b. The SVM allows high amount of error in classification
c. Both are True
d. Both are False
Correct Answer: a
Explanation: A hard margin means that an SVM is very rigid in classification and tries to
work extremely well in the training set, causing overfitting.
4. Which of the following statements accurately compares linear regression and logistic
regression?
a. Linear regression is used for classification tasks, while logistic regression is used for
regression tasks.
b. Linear regression models the relationship between input features and continuous
target variables, while logistic regression models the probability of binary outcomes.
c. Linear regression and logistic regression are identical in their mathematical
formulation and can be used interchangeably.
d. Linear regression and logistic regression both handle multi-class classification tasks
equally effectively.
Correct Answer: b
Explanation: Linear regression is employed to predict continuous numeric target variables
based on input features. It finds the best-fitting linear relationship between features and the
target variable. Logistic regression, on the other hand, is designed for binary classification
tasks where the goal is to estimate the probability that a given input belongs to a particular
class. It employs the logistic (sigmoid) function to map the linear combination of features to a
probability value between 0 and 1. Linear regression and logistic regression serve different
purposes and are not interchangeable due to their distinct objectives and mathematical
formulations.
5. After training an SVM, we can discard all examples which are not support vectors and can
still classify new examples?
a. True
b. False
Correct Answer: a
Explanation: Since the support vectors are only responsible for the change in decision
boundary.
6. Suppose you are building a SVM model on data X. The data X can be error prone which
means that you should not trust any specific data point too much. Now think that you want
to build a SVM model which has quadratic kernel function of polynomial degree 2 that uses
Slack variable C as one of it’s hyper parameter.
What would happen when you use very large value of C (C->infinity)?
a. We can still classify data correctly for given setting of hyper parameter C.
b. We can not classify data correctly for given setting of hyper parameter C
c. None of the above
Correct Answer: a
Explanation: For large values of C, the penalty for misclassifying points is very high, so
the decision boundary will perfectly separate the data if possible.
7. Following Question 6, what would happen when you use very small C (C~0)?
a. Data will be correctly classified
b. Misclassification would happen
c. None of these
Correct Answer: b
Explanation: The classifier can maximize the margin between most of the points, while
misclassifying a few points, because the penalty is so low.
8. If g(z) is the sigmoid function, then its derivative with respect to z may be written in term of
g(z) as
a. g(z)(1-g(z))
b. g(z)(1+g(z))
c. -g(z)(1+g(z))
d. g(z)(g(z)-1)
Correct Answer: a
𝑑 1 1
𝐃𝐞𝐭𝐚𝐢𝐥𝐞𝐝 𝐒𝐨𝐥𝐮𝐭𝐢𝐨𝐧: 𝑔′ (𝑧) = ( −𝑧
)= . 𝑒 −𝑧
𝑑𝑧 1 + 𝑒 (1 + 𝑒 −𝑧 )2
1 1
= . (1 − )
1 + 𝑒 −𝑧 1 + 𝑒 −𝑧
= 𝑔(𝑧)(1 − 𝑔(𝑧)
9. In the linearly non-separable case, what effect does the C parameter have on the
SVM mode.
a. it determines how many data points lie within the margin

b. it is a count of the number of data points which do not lie on their respective
side of the hyperplane
c. it allows us to trade-off the number of misclassified points in the training data
and the size of the margin
d. it counts the support vectors
Correct Answer: c
Explanation: A high value of the C parameter results in more emphasis being given to the
penalties arising out of points lying on the wrong sides of the margins. This results in
reducing the number of such points being considered in deciding the decision boundary by
reducing the margin.
10. What type of kernel function is commonly used for non-linear classification tasks in
SVM?
a. Linear kernel
b. Polynomial kernel
c. Sigmoid kernel
d. Radial Basis Function (RBF) kernel
Correct Answer: d
Explanation: The Radial Basis Function (RBF) kernel is commonly used for non-linear
classification tasks in SVM. It introduces non-linearity by mapping data points into a high-
dimensional space, where a linear decision boundary corresponds to a non-linear decision
boundary in the original feature space. The RBF kernel is suitable for capturing complex
relationships and is widely used due to its effectiveness.
11. Which of the following statements is/are true about kernel in SVM?
1. Kernel function map low dimensional data to high dimensional space

2. It’s a similarity function
a. 1 is True but 2 is False

b. 1 is False but 2 is True
c. Both are True
d. Both are False
Correct Answer: c
Explanation: Follow lecture notes
12. The soft-margin SVM is prefered over the hard-margin SVM when:
a. The data is linearly separable

b. The data is noisy
c. The data contains overlapping point
Correct Answer: b, c
Explanation: When the data has noise and overlapping points, there is a problem in drawing
a clear hyperplane without misclassifying.
13. Consider the data-points in the figure below.

Let us assume that the black-colored circles represent positive class whereas the white-colored
circles represent negative class. Which of the following among H1, H2 and H3 is the
maximum-margin hyperplane?
a. H1
b. H2
c. H3
d. None of the above.
Correct Answer: c
Explanation: In a Support Vector Machine (SVM), the maximum-margin hyperplane is the

one that has the largest distance between itself and the nearest data point of either class. This
hyperplane ensures the best generalization to unseen data. The SVM aims to maximize this
margin while still correctly classifying the training data.
To determine the maximum-margin hyperplane, you need to look for the hyperplane that has
the largest "margin" between it and the nearest data point. The margin is the perpendicular
distance between the hyperplane and the closest data point from either class.
H3 has the largest gap between itself and the nearest data point. That hyperplane would be the
maximum-margin hyperplane.
14. What is the primary advantage of Kernel SVM compared to traditional SVM with a linear
kernel?
a. Kernel SVM requires less computational resources.

b. Kernel SVM does not require tuning of hyperparameters.
c. Kernel SVM can capture complex non-linear relationships between data points.
d. Kernel SVM is more robust to noisy data.
Correct Answer: c
Explanation: The primary advantage of Kernel SVM is its ability to capture complex non-
linear relationships between data points through the use of kernel functions. While traditional
SVM with a linear kernel is limited to finding linear decision boundaries, Kernel SVM can
transform the data into higher-dimensional spaces where non-linear decision boundaries can
be effectively learned. This makes Kernel SVM suitable for a wide range of classification tasks
where linear separation is not sufficient.
15. What is the sigmoid function's role in logistic regression?
a. The sigmoid function transforms the input features to a higher-dimensional space.

b. The sigmoid function calculates the dot product of input features and weights.
c. The sigmoid function defines the learning rate for gradient descent.
d. The sigmoid function maps the linear combination of features to a probability value.
Correct Answer: d
Explanation: The sigmoid function, also known as the logistic function, plays a crucial role in
logistic regression. It transforms the linear combination of input features and corresponding
weights into a value between 0 and 1. This value represents the estimated probability that the
input belongs to a particular class. The sigmoid function's curve ensures that the output remains
within the probability range, making it suitable for binary classification.
************END************
Assignment - 6
Do answer questions 1-2 with the data provided below
1. Given below the neural network, find the appropriate weights for w0, w1, and w2 to represent
the AND function. Threshold function = {1, if output >0; 0 otherwise}. x0 and x1 are the inputs
and b1=1 is the bias.
a. w0=1, w1=1, w2=1

b. w0=1, w1=1, w2=-1
c. w0=-1, w1=-1, w2=-1
d. w0=2, w1=-2, w3=-1
Correct Answer: b
Detailed Solution: for x0=1, x1=1 and b1=1,

option (b) gives 1*1+1*1+-1=1>0=1
Similarly, check for others.
2. Fill in the correct weights to represent OR function:
a. w0=1, w1=1, w2=0

b. w0=1, w2=1, w3=1
c. w0=1, w1=1, w2=-1
d. w0=-1, w1=-1, w2=-1
Correct Answer: a
Detailed Solution: Same as Solution to Q. No 1.

3. Which of the following gives non-linearity to a neural network
a. Gradient descent
b. Bias
c. ReLU Activation Function
d. None
Correct Answer: c
Detailed Solution: An activation function such as ReLU gives a non-linearity to the
neural network.
4. Suppose you are to design a system where you want to perform word prediction also known
as language modeling. You are to take the output from the previous state and also the input at
each step to predict the next word. The inputs at each step are the words for which the next
words are to be predicted. Which of the following neural network would you use?
a. Multi-Layer Perceptron
b. Recurrent Neural Network
c. Convolutional Neural Network
d. Perceptron
Correct Answer: b
Detailed Solution:
Recurrent Neural Network (RNN) is a type of Neural Network where the output from
the previous step is fed as input to the current step. Refer to lecture nodes for detailed
explanation.
5. For a fully-connected deep network with one hidden layer, increasing the number of hidden
units should have what effect on bias and variance?
a. Decrease bias, increase variance

b. Increase bias, increase variance
c. Increase bias, decrease variance
d. No change
Correct Answer: a
Detailed Solution: Adding more hidden units should decrease bias and increase variance. In
general, more complicated models will result in lower bias but larger variance, and adding
more hidden units certainly makes the model more complex
Do answer questions 6-7 with the data provided below
6. You are given the task of predicting the price of a house given the various features of a house
such as number of rooms, area (sq ft), etc.
How many neurons should you have at the output?

a. 3
b. 2
c. 1
d. 4
Correct Answer: c
Detailed Solution: The price of a house is a single value. Hence, one neuron is enough.
7. What should be the loss function used to train the model?

a. Multi-Class Cross-Entropy Loss
b. Mean Squared Error
c. Binary Cross-Entropy Loss
Correct Answer: b
Detailed Solution: Mean Squared Error finds the average squared difference between the
predicted value and the true value. Since there are no classes involved as in case of
classification tasks, Cross-Entropy Loss of any type doesn’t qualify to be a loss function.
8. A Convolutional Neural Network (CNN) is a Deep Neural Network that can extract various
abstract features from an input required for a given task. Given the operations performed
by a CNN on an input:
1) Max Pooling
2) Convolution Operation
3) Flatten
4) Forward propagation by Fully Connected Network
Identify the correct sequence from the options below:
a. 4,3,2,1
b. 2,1,3,4
c. 3,1,2,4
d. 4,2,1,3
Correct Answer: b
Detailed Solution:
Follow the lecture slides.
9. An autoencoder is a Neural Network architecture used to create lower dimensional input

representation. Which of the following statements are true about it?
a. It is an unsupervised algorithm similar to PCA

b. It can generate new data by learning the probability distribution
c. Its target output is the input
d. Autoencoders have linear encoder and decoder
Correct Answer: a, c
Detailed Solution: Autoencoders perform dimensionality reduction and are unsupervised
similar to PCA. The second option is true for Variational Auto Encoder which is a generative
model, unlike conventional autoencoders. Autoencoders can have any form of encoders and
decoders.
10. In a simple MLP model with 8 neurons in the input layer, 5 neurons in the hidden layer and 1
neuron in the output layer. What is the size of the weight matrices between hidden to output
layer and input to hidden layer?
a. [5 X 1], [8 X 5]
b. [8 X 5], [ 1 X 5]
c. [3 X 1], [3 X 3]
d. [3 X 3], [3 X 1]
Correct Answer. a
Explanation:
The weight matrix between the hidden layer (5 neurons) and the output layer (1 neuron) will
be of size [5 X 1].
The weight matrix between the input layer (8 neurons) and the hidden layer (5 neurons) will
be of size [8 X 5].
11. If you increase the number of hidden layers in a Multi-Layer Perceptron, the classification
error of test data always decreases. True or False?
a. True
b. False
Correct Answer: b
Explanation: Increasing the number of hidden layers in a Multi-Layer Perceptron (MLP) doesn't
guarantee a decrease in classification error for test data. While adding more hidden layers can
potentially help the network learn more complex representations, it can also lead to overfitting, where
the model performs well on the training data but poorly on the test data. The optimal number of hidden
layers depends on the complexity of the problem, the amount of available data, and careful tuning of
various hyperparameters.
12. Which of the following represents the range of output values for a sigmoid function?
a. -1 to 1
b. -∞to ∞
c. 0 to 1
d. 0 to ∞
Correct answer: c
Explanation: A sigmoid function, such as the logistic sigmoid function, maps input values to an output
range between 0 and 1. As the input values become larger, the output of the sigmoid function approaches
1, and as the input values become more negative, the output approaches 0. This property makes sigmoid
functions useful for tasks that involve binary classification or when you want to squash values into a
limited range.
13. A single perceptron can compute the XOR function

a. True
b. False
Correct answer: b
Explanation: A single perceptron is not capable of directly computing the XOR function. The XOR
function is not linearly separable, which means that a single perceptron, which uses a linear decision
boundary, cannot accurately represent it. The XOR function's output is 1 when the number of input 1s
is odd, and the output is 0 when the number of input 1s is even. This behavior cannot be achieved with
just a single linear threshold. However, XOR can be computed using a multi-layer perceptron (a neural
network with at least one hidden layer), which can model more complex decision boundaries and
accurately represent non-linear relationships like the XOR function.
14. What are the steps for using a gradient descent algorithm?
1. Calculate error between the actual value and the predicted value
2. Repeat until you find the best weights of network
3. Pass an input through the network and get values from output layer
4. Initialize random values for weight and bias
5. Go to each neuron which contributes to the error and change its respective values to reduce
the error
a. 4,3,1,5,2
b. 1,2,3,4,5
c. 3,4,5,2,1
d. 2,3,4,5,1
Correct answer: a
Explanation:
Initialize random values for weight and bias: The process begins by initializing random values for the
weights and biases in the neural network. This is necessary to start the optimization process.
Pass an input through the network and get values from the output layer: The input data is propagated
through the network to obtain the predicted values at the output layer. This step is the forward pass and
helps to calculate the predicted output.
Calculate the error between the actual value and the predicted value: The calculated predicted values
are compared with the actual target values to compute the error or loss. This quantifies how far off the
predictions are from the true values.
Go to each neuron that contributes to the error and change its respective values to reduce the error: This
step involves backpropagation, where the gradients of the loss with respect to the network's parameters
(weights and biases) are computed. The weights are adjusted in a way that minimizes the error by using
gradient information.
Repeat until you find the best weights of the network: Steps 2 through 4 are repeated iteratively for a
certain number of epochs or until convergence criteria are met. The goal is to find the weights that
minimize the error and optimize the network's performance.
So, the correct sequence of steps is 4, 3, 1, 5, 2.
15. The back-propagation learning algorithm applied to a two-layer neural network

a. always finds the globally optimal solution.
b. finds a locally optimal solution which may be globally optimal.
c. never finds the globally optimal solution.
d. finds a locally optimal solution which is never globally optimal
Correct answer: b
Explanation:
The backpropagation learning algorithm applied to a two-layer neural network, or any neural network,
does not guarantee to find the globally optimal solution but rather tends to find a locally optimal
solution.
a. always finds the globally optimal solution: This is incorrect. Neural networks can have complex
loss surfaces with many local minima, making it challenging for backpropagation to guarantee
the globally optimal solution.
b. finds a locally optimal solution which may be globally optimal: This is the most accurate
description. Backpropagation seeks to minimize the loss function by iteratively updating
weights using gradient descent. It converges to a local minimum that represents a good solution,
but it may also coincide with the globally optimal solution, especially in simpler cases.
c. never finds the globally optimal solution: This is not entirely accurate. While it's challenging
to guarantee finding the global optimum due to the complex nature of neural network loss
landscapes, it's still possible for the found local optimum to also be the global optimum,
especially in simpler settings.
d. finds a locally optimal solution which is never globally optimal: This is too strong a statement.
While a locally optimal solution may not always be globally optimal, it's not accurate to state
that it's "never" globally optimal.
In summary, backpropagation in a two-layer neural network often converges to a locally optimal

solution. This local optimum might also be the global optimum, depending on the specific problem and
network architecture.
***********END**********
9/20/23, 2:51 PM Introduction To Machine Learning - IITKGP - - Unit 9 - Week 7
(https://swayam.gov.in) (https://swayam.gov.in/nc_details/NPTEL)
kvani89.mec@gmail.com 
NPTEL (https://swayam.gov.in/explorer?ncCode=NPTEL) » Introduction To Machine Learning - IITKGP
(course)
If already
registered, click
to check your
Week 7 : Assignment 7
payment status The due date for submitting this assignment has passed.
Due on 2023-09-13, 23:59 IST.
Course Assignment submitted on 2023-09-13, 20:10 IST

outline
How does an
NPTEL
online
course
work? ()
Week 0 ()
Week 1 ()
Week 2 ()
Week 3 ()
Week 4 ()
1) 2 points
Week 5 ()
Week 6 ()
Week 7 ()
Lecture 36 : a.
Introduction to
https://onlinecourses.nptel.ac.in/noc23_cs87/unit?unit=59&assessment=129 1/7
Computational b.
Learning
Theory (unit?
c.
unit=59&lesso d.
n=60)
Yes, the answer is correct.
Lecture 37 : Score: 2
Sample Accepted Answers:
Complexity : a.
Finite
2) 2 points
Hypothesis
Space (unit?
unit=59&lesso
n=61)
Lecture 38: VC
Dimension
(unit?
unit=59&lesso
n=62) a.
Lecture 39: b.
Introduction to c.
Ensembles
d.
(unit?
unit=59&lesso Yes, the answer is correct.
n=63) Score: 2
Accepted Answers:
Lecture 40: d.
Bagging and
Boosting (unit?
unit=59&lesso
n=64)
Tutorial 7
(unit?
unit=59&lesso 3) 2 points
n=65)
Week 7 :
Lecture
Material (unit?
unit=59&lesso
n=66)
Quiz: Week 7 a.
: Assignment b.
7
c.
(assessment?
name=129) d.
Feedback Yes, the answer is correct.

Score: 2
Form for Week
7 (unit?
Accepted Answers:
unit=59&lesso
a.
n=134)
Week 8 ()
4) 2 points
Download
Videos ()
Problem
Solving
a.
Session -
July 2023 () b.

Score: 2
Accepted Answers:
b.
5) 2 points
a.
b.
c.
d.

Score: 2
Accepted Answers:
c.
6) 2 points
a.
b.

Score: 2
Accepted Answers:
a.
7) 2 points
a.
b.
c.
d.

Score: 2
Accepted Answers:
d.
8) 2 points
a.
b.
c.

Score: 2
Accepted Answers:
b.
9) 2 points
a.
b.
c.
Y h i

Score: 2
Accepted Answers:
a.
10) 2 points
a.
b.
c.
d.

Score: 2
Accepted Answers:
c.
11) 2 points
a.
b.
c.
d.

Score: 2
Accepted Answers:
b.
d.
12) 2 points
a.
b.
c.

Score: 2
Accepted Answers:
a.
13) 2 points
a.
b.
c.
d.

Score: 2
Accepted Answers:
c.
14) 2 points
a.
b.
c.
d.

Score: 2
Accepted Answers:
b.
15) 2 points
a.
b.
c.
d.

Score: 2
Accepted Answers:
c.

2023 ML Assignment

Uploaded by

Copyright:

Available Formats

2023 ML Assignment

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

2023 ML Assignment

Uploaded by

Copyright:

Available Formats

Introduction to Machine Learning -IITKGP

1. Which of the following is/are classification tasks?

a. Find the gender of a person by analyzing his writing style

a. Precision = 33.333%, Recall = 25%

a. Training error usually decreases and generalization error usually increases.

The algorithm is suffering from

T1) Cluster the users into communities of like-minded people and

The task T1 is a/an ______________ learning problem and T2 is a/an

Explanation: From the definition of supervised and unsupervised learning

7. Select the correct equations.

8. Which of the following tasks is NOT a suitable machine learning task(s)?

9. Which of the following is/are associated with overfitting in machine learning?

15. How many Boolean functions are possible with 𝑁 features?

Color Wig Num. Ears Emotion (Output)

Based on the dataset answer the following questions:

1. What is 𝐸𝑛𝑡𝑟𝑜𝑝𝑦 (𝐸𝑚𝑜𝑡𝑖𝑜𝑛|𝑊𝑖𝑔 = 𝑌)?

Subset of the dataset where Wig = Y:

Color Wig Num. Ears Emotion (Output)

Entropy (Emotion | Wig = Y) = - P(S) log2 P(S) - P(H) log2 P(H)

Entropy (Emotion | Wig = Y) = - (0.5) log2 (0.5) - (0.5) log2 (0.5)

= - (0.5) (-1) - (0.5) (-1) = 1

Therefore, 𝐸𝑛𝑡𝑟𝑜𝑝𝑦 (𝐸𝑚𝑜𝑡𝑖𝑜𝑛|𝑊𝑖𝑔 = 𝑌) = 1

2. What is 𝐸𝑛𝑡𝑟𝑜𝑝𝑦 (𝐸𝑚𝑜𝑡𝑖𝑜𝑛|𝐸𝑎𝑟𝑠 = 3)?

Subset of the dataset where Ears = 3:

Color Wig Num. Ears Emotion (Output)

To calculate the entropy, we can use the formula:

Entropy (Emotion | Ears=3) = - P(S) log2 P(S) - P(H) log2 P(H)

Entropy (Emotion | Ears=3) = - 0 log2 0 - 1 log2 1 = 0 - 0 = 0

Therefore, 𝐸𝑛𝑡𝑟𝑜𝑝𝑦 (𝐸𝑚𝑜𝑡𝑖𝑜𝑛|𝐸𝑎𝑟𝑠 = 3) = 0.

3. Which attribute should you choose as root of the decision tree?

Information Gain (Color):

After the split based on Color, we have the following subsets:

Subset for Color = Green:

Subset for Color = Blue:

Subset for Color = Red:

Information Gain (Color) = Entropy (Emotion) - [ (3/9) * 0 + (2/9) * 1 + (4/9) * 0] ≈ 0.7687

Information Gain (Wig):

After the split based on Wig, we have the following subsets:

Subset for Wig = Yes:

Subset for Wig = No:

Information Gain (Wig) = Entropy (Emotion) - [ (2/9) * 1 + (7/9) * 0.985] ≈ 0.002

Information Gain (Num. Ears):

Subset for Num. Ears = 2:

Information Gain (Num. Ears) = Entropy (Emotion) - [ (8/9) * 1 + (1/9) * 0] ≈ 0.102

4. In linear regression, the output is:

5. Consider applying linear regression with the hypothesis as ℎθ(𝑥) = θ0 + θ1𝑥.The

We define Mean Square Error (MSE), 𝐽θ =

Let's calculate the value of 𝐽θ:

We have four training examples, so m = 4.

𝐽θ = 1/(2*4) [ (1 + 1(6) - 7) 2 + (1 + 1(5) - 4)2 + (1 + 1(10) - 9)2 + (1 + 1(3) - 4)2 ]

= 1/(8) [ (0)2 + (2)2 + (2)2 + (0)2 ]

Therefore, the value of 𝐽θ for θ = [1, 1] is 1.

7. Identify whether the following statement is true or false?

To mitigate overfitting, it is generally recommended to have a sufficient amount of diverse

9. Traditionally, when we have a real-valued input attribute during decision-tree learning, we

a. It is too computationally expensive

14. Which of the following statements about linear regression is true?

Q1. Fill in the blanks:

K-Nearest Neighbor is a _____, _____ algorithm

a. (i) is True and (ii) is False

K-Nearest Neighbor is a _, _ algorithm