2023 ML Assignment
2023 ML Assignment
2023 ML Assignment
Assignment - 1
TYPE OF QUESTION: MCQ/MSQ
Number of questions: 15 Total mark: 2 * 15 = 30
Correct Answers: a, c
Explanation: In (c), the amount of rainfall is a continuous variable. But, we are predicting
whether there will be abnormally heavy rainfall next year or not. So it is a Classification
task. Similarly, the number of classes in gender identification (a) is discrete. So, it’s a
classification task. The output variable is a continuous class in other options, so these are
regression tasks.
2. A feature F1 can take certain values: A, B, C, D, E, F, and represents the grade of students
from a college. Which of the following statement is true in the following case?
a. Feature F1 is an example of a nominal variable.
b. Feature F1 is an example of an ordinal variable.
c. It doesn’t belong to any of the above categories.
d. Both of these
Correct Answer: b
Explanation: Ordinal variables are the variables that have some order in their categories.
For example, grade A should be considered a high grade than grade B.
_______________________________________________________________________
3. Suppose I have 10,000 emails in my mailbox out of which 200 are spams. The spam
detection system detects 150 emails as spams, out of which 50 are actually spam. What is
the precision and recall of my spam detection system?
Correct Answer: a
Explanation:
We know that,
𝑇𝑝
Precision =
𝑇𝑝+𝐹𝑝
50
=
150
= 33.333%
𝑇𝑝
Recall =
𝑇𝑝+𝐹𝑛
50
=
200
= 25%
___________________________________________________________________________
4. Which of the following statements describes what is most likely TRUE when the amount
of training data increases?
Correct Answer: a
Explanation: When the training data increases, the decision boundary becomes very complex
to fit the data. So, the generalization capability usually reduces with the increase in training
data.
___________________________________________________________________________
5. You trained a learning algorithm, and plot the learning curve. The following figure is
obtained.
a. High bias
b. High variance
c. Neither
Correct Answer: a
Explanation: In the plot, the training error is increased with the training set size. The true error
is around 0.4 which is quite high. Thus, we can say that the bias is high.
___________________________________________________________________________
6. I am the marketing consultant of a leading e-commerce website. I have been given a task
of making a system that recommends products to users based on their activity on Facebook.
I realize that user interests could be highly variable. Hence, I decide to
T2) Train separate models for each community to predict which product category (e.g.,
electronic gadgets, cosmetics, etc.) would be the most relevant to that community.
Correct Answer: b
a. High bias
b. Low bias
c. Low variance
d. High variance
e. Good performance on training data
f. Poor performance on test data
Correct Answers: b, d, e, f
Explanation: Overfitting is characterized by good performance on the training data, as the
model has essentially memorized the data. However, it leads to poor performance on the
test data because the model fails to generalize well. Overfitting is associated with low bias
and high variance, meaning the model is sensitive to noise or fluctuations in the training
data.
________________________________________________________________________
10. Which of the following statements about cross-validation in machine learning is/are true?
a. Cross-validation is used to evaluate a model's performance on the training data.
b. Cross-validation guarantees that a model will generalize well to unseen data.
c. Cross-validation is only applicable to classification problems and not regression
problems.
d. Cross-validation helps in estimating the model's performance on unseen data by
simulating the test phase.
Correct Answer: d
Explanation: Cross-validation is a technique used in machine learning to assess the
performance and generalization ability of a model. It involves dividing the available labeled
data into multiple subsets or folds. The model is trained on a portion of the data (training set)
and evaluated on the remaining portion (validation or test set). By repeating this process with
different partitions of the data, cross-validation provides an estimate of the model's
performance on unseen data.
___________________________________________________________________________
11. What does k-fold cross-validation involve in machine learning?
a. Splitting the dataset into k equal-sized training and test sets.
b. Splitting the dataset into k unequal-sized training and test sets.
c. Partitioning the dataset into k subsets, and iteratively using each subset as a
validation set while the remaining k-1 subsets are used for training.
d. Dividing the dataset into k subsets, where each subset represents a unique class label
for classification tasks.
Correct Answer: c
Explanation: K-fold cross-validation involves dividing the dataset into k subsets or folds. The
process then iterates k times, where each time, one of the k subsets is used as the validation set,
while the remaining k-1 subsets are used for training the model. This ensures that each subset
is used as the validation set exactly once, and the model is trained and evaluated k times, with
each fold serving as the validation set once.
___________________________________________________________________________
12. What does the term "feature space" refer to in machine learning?
a. The space where the machine learning model is trained.
b. The space where the machine learning model is deployed.
c. The space which is formed by the input variables used in a machine learning model.
d. The space where the output predictions are made by a machine learning model.
Correct Answer: c
Explanation: The feature space in machine learning refers to the space formed by the input
variables or features used in a model. It represents the space where the data points reside.
___________________________________________________________________________
13. Which of the following statements is/are true regarding supervised and unsupervised
learning?
a. Supervised learning can handle both labeled and unlabeled data.
b. Unsupervised learning requires human experts to label the data.
c. Supervised learning can be used for regression and classification tasks.
d. Unsupervised learning aims to find hidden patterns in the data.
Correct Answers: c, d
Explanation:
Option “a” is incorrect. Supervised learning specifically requires labeled data, while
unsupervised learning deals with unlabeled data.
Option “b” is incorrect. Unsupervised learning does not require human experts to label the data;
it learns from the raw, unlabeled data.
Option “c” is correct. Supervised learning encompasses both regression, where the output
variable is continuous, and classification, where the output variable is categorical.
Option “d” is correct. Unsupervised learning aims to find hidden patterns, structures, or
relationships within the data without any prior knowledge of the output labels.
___________________________________________________________________________
14. One of the ways to mitigate overfitting is
a. By increasing the model complexity
b. By reducing the amount of training data
c. By adding more features to the model
d. By decreasing the model complexity
Correct Answer: d
Explanation: Overfitting occurs when a machine learning model performs well on the training
data but fails to generalize to new, unseen data. It usually happens when the model becomes
too complex and starts to memorize the training examples instead of learning the underlying
patterns. To mitigate overfitting, one of the effective approaches is to decrease the model
complexity.
__________________________________________________________________________
Correct Answer: a
Explanation: Any variable ‘A’ can have 2 values (i.e., 0 or 1)
For ‘N’ variables there are 2N entries in the truth table.
And each output of any particular row in the truth table can be 0 or 1.
𝑁
Hence, we have (22 ) different Boolean functions with N variables.
___________________________________________________________________________
Introduction to Machine Learning -IITKGP
Assignment - 2
TYPE OF QUESTION: MCQ/MSQ
Number of questions: 15 Total mark: 2 * 15 = 30
Data for Q. 1 to 3
The following dataset will be used to learn a decision tree for predicting whether a person is
happy (H) or sad (S), based on the color of their shoes, whether they wear a wig, and the
number of ears they have.
G Y 2 S
G N 2 S
G N 2 S
B N 2 S
B N 2 H
R N 2 H
R N 2 H
R N 2 H
R Y 3 H
Correct answer: a
Explanation:
To calculate the entropy of the target variable (Emotion) given the condition Wig = Y, we
need to compute the distribution of emotions within that subset of the dataset.
G Y 2 S
R Y 3 H
Within this subset, we have 1 instance of "S" (sad) and 1 instance of "H" (happy). Therefore,
the distribution of emotions is equal, with a count of 1 for each class.
To calculate the entropy, we can use the formula: Entropy(X) = - Σ P(x) log2 P(x)
Since P(S) = P(H) = 0.5 (both classes have equal counts), we can substitute these values into
the entropy formula:
Correct answer: b
Explanation:
To calculate the entropy of the target variable (Emotion) given the condition Ears = 3, we
need to compute the distribution of emotions within that subset of the dataset.
Within this subset, we have 1 instance of "H" (happy) and 0 instances of "S" (sad).
Since P(S) = 0 and P(H) = 1 (since there are no instances of "S" and 1 instance of "H"), we
can substitute these values into the entropy formula:
Correct answer: a
Explanation:
To determine the attribute to choose as the root of the decision tree, we need to consider the
concept of information gain. Information gain measures the reduction in entropy or impurity
achieved by splitting the data based on a specific attribute.
We can calculate the information gain for each attribute by comparing the entropy before and
after the split. The attribute with the highest information gain will be chosen as the root of the
decision tree.
Let's calculate the information gain for each attribute (Color, Wig, and Num. Ears) based on
the given dataset:
To calculate the information gain for the Color attribute, we need to compute the entropy of
the Emotion variable before and after the split based on different colors.
Entropy (Emotion) = - (4/9) log2 (4/9) - (5/9) log2 (5/9) ≈ 0.991
To calculate the information gain for the Wig attribute, we need to compute the entropy of the
Emotion variable before and after the split based on different values of Wig.
Entropy (Emotion) = - (4/9) log2 (4/9) - (5/9) log2 (5/9) ≈ 0.991
To calculate the information gain for the Num. Ears attribute, we need to compute the entropy
of the Emotion variable before and after the split based on different values of Num. Ears.
Entropy (Emotion) = - (4/9) log2 (4/9) - (5/9) log2 (5/9) ≈ 0.991
After the split based on Num. Ears, we have the following subsets:
Based on the information gain calculations, the attribute with the highest information gain is
Color, with an information gain of approximately 0.768. Therefore, Color should be chosen
as the root of the decision tree.
Correct answer: c
Explanation:
In linear regression, the output variable, also known as the dependent variable or target
variable, is continuous. Linear regression is a supervised learning algorithm used to model
the relationship between a dependent variable and one or more independent variables.
The goal of linear regression is to find a linear relationship between the independent variables
and the continuous output variable. The linear regression model predicts a continuous value
as the output based on the input features.
X Y
6 7
5 4
10 9
3 4
( )
where m is the number of training examples. ℎθ 𝑥𝑖 is the value of linear regression
hypothesis at point, i. If θ = [1, 1], find 𝐽θ.
a. 0
b. 1
c. 2
d. 0.5
Correct answer: b
Explanation:
𝑚
2
1
2𝑚 ( )
∑ (ℎθ 𝑥𝑖 − 𝑦𝑖)
𝑖=1
For θ = [1, 1], the hypothesis function hθ(x) becomes hθ(x) = 1 + 1x.
Substituting the values from the training data into the MSE equation:
= 1/(8) [ 0 + 4 + 4 + 0 ]
= 1/(8) [ 8 ]
=1
Correct answer: b
Explanation:
The ID3 algorithm uses a greedy strategy to make local decisions at each node based on the
information gain or other impurity measures. It recursively builds the decision tree by
selecting the attribute that provides the highest information gain or the most significant
reduction in impurity at each step. However, this greedy approach does not consider the
global optimum for the entire decision tree structure.
Due to the greedy nature of the algorithm, it is possible for ID3 to get stuck in suboptimal
solutions or make decisions that do not result in the most accurate or optimal tree. In some
cases, the ID3 algorithm may produce a decision tree that is a local optimum but not the
global optimum.
Explanation:
In reality, a classifier trained on less training data is more likely to overfit. Overfitting occurs
when a model learns the training data too well, capturing noise or irrelevant patterns that do
not generalize to unseen data. When the training dataset is smaller, the model has less
exposure to the variety of examples and may struggle to capture the true underlying patterns.
With a limited amount of training data, the model has a higher risk of memorizing specific
examples and idiosyncrasies of the training set, resulting in a biased and overfitted model.
The lack of diversity in the training data hampers the model's ability to generalize well to
new, unseen examples.
Correct answer: b
Explanation: We can see this from the bias-variance trade-off. When hypothesis space is
small, it’s more biased with less variance. So with a small hypothesis space, it’s less likely to
find a hypothesis to fit the data very well,i.e., overfit.
Correct answer: c
Explanation: The single biggest problem with the suggestion of using a multiway split with
one branch for each distinct value of a real-valued input attribute is that it would likely result
in a decision tree that overfits the training data. By creating a branch for each distinct value,
the tree would become more complex, and it would have the potential to fit the training data
too closely, capturing noise or irrelevant patterns specific to the training set.
As a consequence of overfitting, the decision tree would likely score well on the training set
since it can perfectly match the training examples. However, when evaluated on a test set or
unseen data, the tree would struggle to generalize and perform poorly. Overfitting leads to
poor performance on new instances, indicating that the model has failed to learn the
underlying patterns and instead has become too specialized in the training data.
10. Which of the following statements about decision trees is/are true?
a. Decision trees can handle both categorical and numerical data.
b. Decision trees are resistant to overfitting.
c. Decision trees are not interpretable.
d. Decision trees are only suitable for binary classification problems.
Correct answer: a
Explanation: Decision trees can handle both categorical and numerical data as they partition
the data based on various conditions during the tree construction process. This allows
decision trees to be versatile in handling different types of data.
11. Which of the following techniques can be used to handle overfitting in decision trees?
a. Pruning
b. Increasing the tree depth
c. Decreasing the minimum number of samples required to split a node
d. Adding more features to the dataset
Correct answers: a, c
Explanation: Overfitting occurs when a decision tree captures noise or irrelevant patterns in
the training data, resulting in poor generalization to unseen data. Pruning is a technique used
to reduce overfitting by removing unnecessary branches and nodes from the tree.
Decreasing the minimum number of samples required to split a node can also help prevent
overfitting by allowing more flexible splits.
12. Which of the following is a measure used for selecting the best split in decision trees?
a. Gini Index
b. Support Vector Machine
c. K-Means Clustering
d. Naive Bayes
Correct answer: a
Explanation: The Gini Index is a commonly used measure for selecting the best split in
decision trees. It quantifies the impurity or dissimilarity of a node's class distribution. The
split that minimizes the Gini Index is chosen as the optimal split.
13. What is the purpose of the decision tree's root node in machine learning?
a. It represents the class labels of the training data.
b. It serves as the starting point for tree traversal during prediction.
c. It contains the feature values of the training data.
d. It determines the stopping criterion for tree construction.
Correct answer: b
Explanation: The root node of a decision tree serves as the starting point for tree traversal
during prediction. It represents the first decision based on a feature and directs the flow of the
decision tree based on the outcome of that decision. The root node does not contain class
labels or feature values but rather determines the initial split based on a selected criterion.
Correct answer: b
Explanation: Linear regression assumes a linear relationship between the independent
variables (features) and the dependent variable (target). It seeks to find the best-fitting line to
the data.
While linear regression is primarily used for regression tasks, it is not suitable for
classification tasks. Outliers can significantly impact linear regression models, and missing
values in the dataset require appropriate handling.
15. Which of the following techniques can be used to mitigate overfitting in machine
learning?
a. Regularization
b. Increasing the model complexity
c. Gathering more training data
d. Feature selection or dimensionality reduction
Correct answers: a, c, d
Explanation: Regularization techniques, such as L1 or L2 regularization, can help mitigate
overfitting by adding a penalty term to the model's objective function, discouraging
excessively large parameter values.
Gathering more training data can also reduce overfitting by providing a more representative
sample of the underlying data distribution.
Feature selection or dimensionality reduction techniques, such as selecting relevant features
or applying techniques like Principal Component Analysis (PCA), can help remove irrelevant
or redundant features, reducing the complexity of the model and mitigating overfitting.
Introduction to Machine Learning -IITKGP
Assignment - 3
TYPE OF QUESTION: MCQ/MSQ
Number of questions: 15 Total mark: 2 * 15 = 30
a. Non-parametric, eager
b. Parametric, eager
c. Non-parametric, lazy
d. Parametric, lazy
Correct Answer: c
Explanation: KNN is non-parametric because it does not make any assumption regarding the
underlying data distribution. It is a lazy learning technique because during training time it just
memorizes the data and finally computes the distance during testing.
Q2. You have been given the following 2 statements. Find out which of these options is/are
true in the case of k-NN.
(i) In case of very large value of k, we may include points from other classes into the
neighborhood.
(ii) In case of too small value of k, the algorithm is very sensitive to noise.
Correct Answer: c
a. True
b. False
Correct Answer: a
Explanation: The training phase of the algorithm consists only of storing the feature vectors
and class labels of the training samples.
In the testing phase, a test point is classified by assigning the label which is most frequent
among the k training samples nearest to that query point – hence higher computation.
Q4. Suppose you are given the following images (1 represents the left image, 2 represents
the middle and 3 represents the right). Now your task is to find out the value of k in k-NN in
each of the images shown below. Here k1 is for 1st, k2 is for 2nd and k3 is for 3rd figure.
a. k1 > k2> k3
b. k1 < k2> k3
c. k1 < k2 < k3
d. None of these
Correct Answer: c
Q6. Suppose, you have given the following data where x and y are the 2 input variables and
Class is the dependent variable.
a. + Class
b. – Class
c. Can’t Say
d. None of these
Correct Answer: a
Explanation: All three nearest point are of + class so this point will be classified as + class.
Q7. What is the optimum number of principal components in the below figure?
a. 10
b. 20
c. 30
d. 40
Correct Answer: c
Explanation: We can see in the above figure that the number of components = 30 is giving
highest variance with lowest number of components. Hence option ‘c’ is the right answer.
Q8. Suppose we are using dimensionality reduction as pre-processing technique, i.e, instead
of using all the features, we reduce the data to k dimensions with PCA. And then use these
PCA projections as our features. Which of the following statements is correct?
Correct Answer: b
Explanation: The higher value of ‘k’ would lead to less smoothening of the decision boundary.
This would be able to preserve more characteristics in data, hence less regularization.
Correct Answer: a
a. Cold start
b. Overspecialization
c. None of the above
Correct Answer: a
Explanation: For new users, we have very few transactions. So, it’s very difficult to find similar
users.
Q11. Consider the figures below. Which figure shows the most probable PCA component
directions for the data points?
a. A
b. B
c. C
d. D
Correct Answer: a
Explanation: [Follow the lecture slides]
Choose directions such that a total variance of data will be maximum
1. Maximize Total Variance
Choose directions that are orthogonal
2. Minimize correlation
Q12. Suppose that you wish to reduce the number of dimensions of a given data to dimensions
using PCA. Which of the following statement is correct?
Correct Answer: b
Q13. Suppose you are given 7 plots 1-7 (left to right) and you want to compare Pearson
correlation coefficients between variables of each plot. Which of the following is true?
1. 1<2<3<4
2. 1>2>3>4
3. 7<6<5<4
4. 7>6>5>4
a. 1 and 3
b. 2 and 3
c. 1 and 4
d. 2 and 4
Correct Answer: b
Correct Answer: b
Explanation: LDA produces at most c − 1 discriminant vectors.
Correct Answer: a
Option b is not appropriate for collaborative filtering because it involves predicting the
expected sales volume (number of books sold) as a function of average rating of a book.
Collaborative filtering focuses on user-item interactions and is more concerned with providing
personalized recommendations rather than predicting sales volume based on average ratings.
************END************
Introduction to Machine Learning -IITKGP
Assignment - 4
TYPE OF QUESTION: MCQ/MSQ
Number of questions: 15 Total mark: 2 * 15 = 30
Q1.
Suppose that the system is now given a new mail to be classified as spam/ not-spam, what is
the probability that the mail will be classified as spam?
a. 0.89575
b. 0.10425
c. 0.00475
d. 0.09950
Correct Answer: b
Detailed Solution:
Let S = ‘Mails correctly marked spam by the system’, T= ‘Mails misclassified by the
system’(Marked as spam when not spam or Marked as not spam when it is a spam) , M = ‘Spam
mails’.
We are to find the probability of mail being classified as spam which can either be if a spam
mail is correctly classified as spam or if a mail is misclassified as spam.
d. 0.99525
Correct Answer: a
Q3. Given that a mail is classified as not spam, the probability of the mail actually being not
spam
a. 0.10425
b. 0.89575
c. 0.003
d. 0.997
Correct Answer: d
a. 0.90025
b. 0.09975
c. 0.8955
d. 0.1045
Correct Answer: b
= 0.09975
Correct Answer: b
Detailed Solution:
Naive Bayes Assumption is that all the features of a class are independent of each other
which is not the case in real life. Because of this assumption, the classifier is called Naive
Bayes Classifier.
Q6.
Consider the following dataset. a,b,c are the features and K is the class(1/0):
a b c K
1 0 1 1
1 1 1 1
0 1 1 0
1 1 0 0
1 0 1 0
0 0 0 1
Classify the test instance given below into class 1/0 using a Naive Bayes Classifier.
a b c K
0 0 1 ?
a. 0
b. 1
Correct Answer: b
Detailed Solution:
𝑃(𝐾=1)∗𝑃(𝑎=0|𝐾=1)∗𝑃(𝑏=0|𝐾=1)∗𝑃(𝑐=1|𝐾=1)
P(K=1|a=0, b=0, c=1) = [By Naive Bayes’
𝑃(𝑎=0,𝑏=0,𝑐=1)
Assumption]
3 1 2 2
= 6 ∗ 3 ∗ 3 ∗ 3[The denominator can be ignored since all the features
have the same probability for each class]
= 0.07407
3 1 1 2
P(K=0| a=0, b=0, c=1) = 6 ∗ 3 ∗ 3 ∗ 3 = 0.03703 < 0.07407
1
c.
9
8
d.
9
Correct Answer: b
Detailed Solution:
𝑃(𝐾 = 0|𝑎 = 1, 𝑏 = 1)
𝑃(𝐾 = 0) ∗ 𝑃(𝑎 = 1|𝐾 = 0) ∗ 𝑃(𝑏 = 1|𝐾 = 0) 2
= =
𝑃(𝐾 = 0) ∗ 𝑃(𝑎 = 1|𝐾 = 0) ∗ 𝑃(𝑏 = 1|𝐾 = 0) + 𝑃(𝐾 = 1) ∗ 𝑃(𝑎 = 1|𝐾 = 1) ∗ 𝑃(𝑏 = 1|𝐾 = 1) 3
Q8.
Answer Questions 8-10 with the data given below:
A patient goes to a doctor with symptoms S1, S2 and S3. The doctor suspects disease D1and
D2 and constructs a Bayesian network for the relation among the disease and symptoms as
the following:
Detailed Solution:
From the figure, we can see that D1 and D2 are not dependent on any variable as they don’t
have any incoming directed edges. S1 has an incoming edge from D1, hence S1 depends on
D1. S2 has 2 incoming edges from D1 and D2, hence S2 depends on D1 and D2. S3 has an
incoming edge from D2, S3 depends on D2. Hence, (d) is the answer.
Q9. Suppose P(D1) = 0.4, P(D2) = 0.7 , P(S1|D1)=0.3 and P(S1| D1’)= 0.6. Find P(S1)
a. 0.12
b. 0.48
c. 0.36
d. 0.60
Correct Answer: b
Detailed Solution:
a. D1
b. D2
c. D1 and D2
d. None
Correct Answer: b
Detailed Solution:
● X’s parents
● X’s children
● Parents of X’s children
In the given diagram, variable, S2 has a parent D2 and no children. Hence, the correct answer
is (b).
___________________________________________________________________________
Q11. Consider the following Bayes’ network:
Alarm1 means that the first alarm system rings, Alarm2 means that the second alarm system
rings, and Burglary means that a burglary is in progress. Now assume that:
P(Alarm1) = 0.1
P(Alarm2) = 0.2
P (Burglary | Alarm1, Alarm2) = 0.8
P (Burglary | Alarm1, ¬Alarm2) = 0.7
P (Burglary | ¬Alarm1, Alarm2) = 0.6
P (Burglary | ¬Alarm1, ¬Alarm2) = 0.5
Calculate P (Alarm2 | Burglary, Alarm1).
a. 0.78
b. 0.22
c. 0.50
d. 0.10
Correct Answer: b
Detailed Solution:
The values of the conditional probabilities are given below. Find 𝑃(𝐷).
Assume,
𝑃(𝐴) = 0.3
𝑃(𝐵) = 0.6
𝑃(𝐶|𝐴) = 0.8
𝑃(𝐶|𝐴) = 0.4
𝑃(𝐷|𝐴, 𝐵) = 0.7
𝑃(𝐷|𝐴, 𝐵) = 0.8
𝑃(𝐷|𝐴, 𝐵) = 0.1
𝑃(𝐷|𝐴, 𝐵) = 0.2
𝑃(𝐸|𝐶) = 0.7
𝑃(𝐸|𝐶) = 0.2
a. 0.68
b. 0.32
c. 0.50
d. 0.70
Correct Answer: b
Detailed Solution:
P(C) = P(C|A) * P(A) + P(C|𝐴) * P (𝐴) = 0.8 * 0.3 + 0.4 * 0.7 = 0.24 + 0.28 = 0.52
Q13.
Answer Questions 13-14 with the data given below:
In an oral exam you have to solve exactly one problem, which might be one of three types, A,
B, or C, which will come up with probabilities 30%, 20%, and 50%, respectively. During your
preparation you have solved 9 of 10 problems of type A, 2 of 10 problems of type B, and 6 of
10 problems of type C.
What is the probability that you will solve the problem of the exam?
a. 0.61
b. 0.39
c. 0.50
d. 0.20
Correct Answer: a
Detailed Solution:
A: Problem of type A.
B: Problem of type B.
C: Problem of type C.
S: You solve the problem
P(A) = 0.30
P(B) = 0.20
P(C) = 0.50
P(S|A) = 9/10
P(S|B) = 2/10
P(S|C) = 6/10
P(S) = P(S|A) P(A) + P(S|B) P(B) + P (S|C) P(C)
= (9/10) * (0.30) + (2/10) * (0.20) + (6/10) * (0.50)
= 0.61
Q14. Given you have solved the problem, what is the probability that it was of type A?
a. 0.35
b. 0.50
c. 0.56
d. 0.44
Correct Answer: d
Detailed Solution:
A: Problem of type A.
S: You solve the problem
P(A) = 0.30
P(S|A) = 9/10
P(S) = 0.61
𝟗
𝑷(𝑺|𝑨)∗ 𝑷(𝑨) ( )(𝟎.𝟑𝟎)
P(A|S) = == 𝟏𝟎
= 0.4426
𝑷(𝑺) (𝟎.𝟔𝟏)
Q15. Naive Bayes is a popular classification algorithm in machine learning. Which of the
following statements is/are true about Naive Bayes?
a. Naive Bayes assumes that all features are independent of each other, given the class.
b. It is particularly well-suited for text classification tasks, like spam detection.
c. Naive Bayes can handle missing values in the dataset without any special treatment.
d. It is a complex algorithm that requires a large amount of training data.
Correct Answers: a, b
Explanation:
a. Correct. Naive Bayes assumes that features are conditionally independent given the class.
This simplifying assumption allows the algorithm to estimate probabilities efficiently even
with a limited amount of data.
b. Correct. Naive Bayes is commonly used for text classification tasks, such as spam detection
and sentiment analysis, due to its ability to handle high-dimensional feature spaces.
c. Incorrect. Naive Bayes does not handle missing values naturally. Missing values need to be
handled before applying the algorithm.
d. Incorrect. Naive Bayes is actually known for its simplicity and ability to work well with
small amounts of training data. It is not considered complex and often provides good results
with relatively simple computations.
*******END*******
Introduction to Machine Learning -IITKGP
Assignment - 5
TYPE OF QUESTION: MCQ/MSQ
Number of questions: 15 Total mark: 2 * 15 = 30
1. What would be the ideal complexity of the curve which can be used for separating the two
classes shown in the image below?
a. Linear
b. Quadratic
c. Cubic
d. insufficient data to draw a conclusion
Correct Answer: a
Explanation: The blue point in the red region is an outlier (most likely noise). The rest of
the data is linearly separable.
2. Suppose you are using a Linear SVM classifier with 2 class classification problem. Now you
have been given the following data in which some points are circled red that are representing
support vectors.
If you remove the following any one red points from the data. Will the decision boundary
change?
a. Yes
b. No
Correct Answer: a
Explanation: These three examples are positioned such that removing any one of them
introduces slack in the constraints. So, the decision boundary would completely change.
4. Which of the following statements accurately compares linear regression and logistic
regression?
a. Linear regression is used for classification tasks, while logistic regression is used for
regression tasks.
b. Linear regression models the relationship between input features and continuous
target variables, while logistic regression models the probability of binary outcomes.
c. Linear regression and logistic regression are identical in their mathematical
formulation and can be used interchangeably.
d. Linear regression and logistic regression both handle multi-class classification tasks
equally effectively.
Correct Answer: b
Explanation: Linear regression is employed to predict continuous numeric target variables
based on input features. It finds the best-fitting linear relationship between features and the
target variable. Logistic regression, on the other hand, is designed for binary classification
tasks where the goal is to estimate the probability that a given input belongs to a particular
class. It employs the logistic (sigmoid) function to map the linear combination of features to a
probability value between 0 and 1. Linear regression and logistic regression serve different
purposes and are not interchangeable due to their distinct objectives and mathematical
formulations.
5. After training an SVM, we can discard all examples which are not support vectors and can
still classify new examples?
a. True
b. False
Correct Answer: a
Explanation: Since the support vectors are only responsible for the change in decision
boundary.
6. Suppose you are building a SVM model on data X. The data X can be error prone which
means that you should not trust any specific data point too much. Now think that you want
to build a SVM model which has quadratic kernel function of polynomial degree 2 that uses
Slack variable C as one of it’s hyper parameter.
What would happen when you use very large value of C (C->infinity)?
a. We can still classify data correctly for given setting of hyper parameter C.
b. We can not classify data correctly for given setting of hyper parameter C
c. None of the above
Correct Answer: a
Explanation: For large values of C, the penalty for misclassifying points is very high, so
the decision boundary will perfectly separate the data if possible.
7. Following Question 6, what would happen when you use very small C (C~0)?
a. Data will be correctly classified
b. Misclassification would happen
c. None of these
Correct Answer: b
Explanation: The classifier can maximize the margin between most of the points, while
misclassifying a few points, because the penalty is so low.
8. If g(z) is the sigmoid function, then its derivative with respect to z may be written in term of
g(z) as
a. g(z)(1-g(z))
b. g(z)(1+g(z))
c. -g(z)(1+g(z))
d. g(z)(g(z)-1)
Correct Answer: a
𝑑 1 1
𝐃𝐞𝐭𝐚𝐢𝐥𝐞𝐝 𝐒𝐨𝐥𝐮𝐭𝐢𝐨𝐧: 𝑔′ (𝑧) = ( −𝑧
)= . 𝑒 −𝑧
𝑑𝑧 1 + 𝑒 (1 + 𝑒 −𝑧 )2
1 1
= . (1 − )
1 + 𝑒 −𝑧 1 + 𝑒 −𝑧
= 𝑔(𝑧)(1 − 𝑔(𝑧)
9. In the linearly non-separable case, what effect does the C parameter have on the
SVM mode.
Correct Answer: c
Explanation: A high value of the C parameter results in more emphasis being given to the
penalties arising out of points lying on the wrong sides of the margins. This results in
reducing the number of such points being considered in deciding the decision boundary by
reducing the margin.
10. What type of kernel function is commonly used for non-linear classification tasks in
SVM?
a. Linear kernel
b. Polynomial kernel
c. Sigmoid kernel
d. Radial Basis Function (RBF) kernel
Correct Answer: d
Explanation: The Radial Basis Function (RBF) kernel is commonly used for non-linear
classification tasks in SVM. It introduces non-linearity by mapping data points into a high-
dimensional space, where a linear decision boundary corresponds to a non-linear decision
boundary in the original feature space. The RBF kernel is suitable for capturing complex
relationships and is widely used due to its effectiveness.
11. Which of the following statements is/are true about kernel in SVM?
Correct Answer: c
Explanation: Follow lecture notes
12. The soft-margin SVM is prefered over the hard-margin SVM when:
Correct Answer: b, c
Explanation: When the data has noise and overlapping points, there is a problem in drawing
a clear hyperplane without misclassifying.
a. H1
b. H2
c. H3
d. None of the above.
Correct Answer: c
To determine the maximum-margin hyperplane, you need to look for the hyperplane that has
the largest "margin" between it and the nearest data point. The margin is the perpendicular
distance between the hyperplane and the closest data point from either class.
H3 has the largest gap between itself and the nearest data point. That hyperplane would be the
maximum-margin hyperplane.
14. What is the primary advantage of Kernel SVM compared to traditional SVM with a linear
kernel?
Correct Answer: c
Explanation: The primary advantage of Kernel SVM is its ability to capture complex non-
linear relationships between data points through the use of kernel functions. While traditional
SVM with a linear kernel is limited to finding linear decision boundaries, Kernel SVM can
transform the data into higher-dimensional spaces where non-linear decision boundaries can
be effectively learned. This makes Kernel SVM suitable for a wide range of classification tasks
where linear separation is not sufficient.
15. What is the sigmoid function's role in logistic regression?
Correct Answer: d
Explanation: The sigmoid function, also known as the logistic function, plays a crucial role in
logistic regression. It transforms the linear combination of input features and corresponding
weights into a value between 0 and 1. This value represents the estimated probability that the
input belongs to a particular class. The sigmoid function's curve ensures that the output remains
within the probability range, making it suitable for binary classification.
************END************
Introduction to Machine Learning -IITKGP
Assignment - 6
TYPE OF QUESTION: MCQ/MSQ
Number of questions: 15 Total mark: 2 * 15 = 30
1. Given below the neural network, find the appropriate weights for w0, w1, and w2 to represent
the AND function. Threshold function = {1, if output >0; 0 otherwise}. x0 and x1 are the inputs
and b1=1 is the bias.
Correct Answer: b
Correct Answer: a
a. Gradient descent
b. Bias
c. ReLU Activation Function
d. None
Correct Answer: c
Detailed Solution: An activation function such as ReLU gives a non-linearity to the
neural network.
4. Suppose you are to design a system where you want to perform word prediction also known
as language modeling. You are to take the output from the previous state and also the input at
each step to predict the next word. The inputs at each step are the words for which the next
words are to be predicted. Which of the following neural network would you use?
a. Multi-Layer Perceptron
b. Recurrent Neural Network
c. Convolutional Neural Network
d. Perceptron
Correct Answer: b
Detailed Solution:
Recurrent Neural Network (RNN) is a type of Neural Network where the output from
the previous step is fed as input to the current step. Refer to lecture nodes for detailed
explanation.
5. For a fully-connected deep network with one hidden layer, increasing the number of hidden
units should have what effect on bias and variance?
Correct Answer: a
Detailed Solution: Adding more hidden units should decrease bias and increase variance. In
general, more complicated models will result in lower bias but larger variance, and adding
more hidden units certainly makes the model more complex
6. You are given the task of predicting the price of a house given the various features of a house
such as number of rooms, area (sq ft), etc.
Correct Answer: c
Detailed Solution: The price of a house is a single value. Hence, one neuron is enough.
Correct Answer: b
Detailed Solution: Mean Squared Error finds the average squared difference between the
predicted value and the true value. Since there are no classes involved as in case of
classification tasks, Cross-Entropy Loss of any type doesn’t qualify to be a loss function.
8. A Convolutional Neural Network (CNN) is a Deep Neural Network that can extract various
abstract features from an input required for a given task. Given the operations performed
by a CNN on an input:
1) Max Pooling
2) Convolution Operation
3) Flatten
4) Forward propagation by Fully Connected Network
Identify the correct sequence from the options below:
a. 4,3,2,1
b. 2,1,3,4
c. 3,1,2,4
d. 4,2,1,3
Correct Answer: b
Detailed Solution:
Follow the lecture slides.
Correct Answer: a, c
Detailed Solution: Autoencoders perform dimensionality reduction and are unsupervised
similar to PCA. The second option is true for Variational Auto Encoder which is a generative
model, unlike conventional autoencoders. Autoencoders can have any form of encoders and
decoders.
10. In a simple MLP model with 8 neurons in the input layer, 5 neurons in the hidden layer and 1
neuron in the output layer. What is the size of the weight matrices between hidden to output
layer and input to hidden layer?
a. [5 X 1], [8 X 5]
b. [8 X 5], [ 1 X 5]
c. [3 X 1], [3 X 3]
d. [3 X 3], [3 X 1]
Correct Answer. a
Explanation:
The weight matrix between the hidden layer (5 neurons) and the output layer (1 neuron) will
be of size [5 X 1].
The weight matrix between the input layer (8 neurons) and the hidden layer (5 neurons) will
be of size [8 X 5].
11. If you increase the number of hidden layers in a Multi-Layer Perceptron, the classification
error of test data always decreases. True or False?
a. True
b. False
Correct Answer: b
Explanation: Increasing the number of hidden layers in a Multi-Layer Perceptron (MLP) doesn't
guarantee a decrease in classification error for test data. While adding more hidden layers can
potentially help the network learn more complex representations, it can also lead to overfitting, where
the model performs well on the training data but poorly on the test data. The optimal number of hidden
layers depends on the complexity of the problem, the amount of available data, and careful tuning of
various hyperparameters.
12. Which of the following represents the range of output values for a sigmoid function?
a. -1 to 1
b. -∞to ∞
c. 0 to 1
d. 0 to ∞
Correct answer: c
Explanation: A sigmoid function, such as the logistic sigmoid function, maps input values to an output
range between 0 and 1. As the input values become larger, the output of the sigmoid function approaches
1, and as the input values become more negative, the output approaches 0. This property makes sigmoid
functions useful for tasks that involve binary classification or when you want to squash values into a
limited range.
Correct answer: b
Explanation: A single perceptron is not capable of directly computing the XOR function. The XOR
function is not linearly separable, which means that a single perceptron, which uses a linear decision
boundary, cannot accurately represent it. The XOR function's output is 1 when the number of input 1s
is odd, and the output is 0 when the number of input 1s is even. This behavior cannot be achieved with
just a single linear threshold. However, XOR can be computed using a multi-layer perceptron (a neural
network with at least one hidden layer), which can model more complex decision boundaries and
accurately represent non-linear relationships like the XOR function.
14. What are the steps for using a gradient descent algorithm?
1. Calculate error between the actual value and the predicted value
2. Repeat until you find the best weights of network
3. Pass an input through the network and get values from output layer
4. Initialize random values for weight and bias
5. Go to each neuron which contributes to the error and change its respective values to reduce
the error
a. 4,3,1,5,2
b. 1,2,3,4,5
c. 3,4,5,2,1
d. 2,3,4,5,1
Correct answer: a
Explanation:
Initialize random values for weight and bias: The process begins by initializing random values for the
weights and biases in the neural network. This is necessary to start the optimization process.
Pass an input through the network and get values from the output layer: The input data is propagated
through the network to obtain the predicted values at the output layer. This step is the forward pass and
helps to calculate the predicted output.
Calculate the error between the actual value and the predicted value: The calculated predicted values
are compared with the actual target values to compute the error or loss. This quantifies how far off the
predictions are from the true values.
Go to each neuron that contributes to the error and change its respective values to reduce the error: This
step involves backpropagation, where the gradients of the loss with respect to the network's parameters
(weights and biases) are computed. The weights are adjusted in a way that minimizes the error by using
gradient information.
Repeat until you find the best weights of the network: Steps 2 through 4 are repeated iteratively for a
certain number of epochs or until convergence criteria are met. The goal is to find the weights that
minimize the error and optimize the network's performance.
Correct answer: b
Explanation:
The backpropagation learning algorithm applied to a two-layer neural network, or any neural network,
does not guarantee to find the globally optimal solution but rather tends to find a locally optimal
solution.
a. always finds the globally optimal solution: This is incorrect. Neural networks can have complex
loss surfaces with many local minima, making it challenging for backpropagation to guarantee
the globally optimal solution.
b. finds a locally optimal solution which may be globally optimal: This is the most accurate
description. Backpropagation seeks to minimize the loss function by iteratively updating
weights using gradient descent. It converges to a local minimum that represents a good solution,
but it may also coincide with the globally optimal solution, especially in simpler cases.
c. never finds the globally optimal solution: This is not entirely accurate. While it's challenging
to guarantee finding the global optimum due to the complex nature of neural network loss
landscapes, it's still possible for the found local optimum to also be the global optimum,
especially in simpler settings.
d. finds a locally optimal solution which is never globally optimal: This is too strong a statement.
While a locally optimal solution may not always be globally optimal, it's not accurate to state
that it's "never" globally optimal.
***********END**********
9/20/23, 2:51 PM Introduction To Machine Learning - IITKGP - - Unit 9 - Week 7
(https://swayam.gov.in) (https://swayam.gov.in/nc_details/NPTEL)
kvani89.mec@gmail.com
(course)
If already
registered, click
to check your
Week 7 : Assignment 7
payment status The due date for submitting this assignment has passed.
Due on 2023-09-13, 23:59 IST.
How does an
NPTEL
online
course
work? ()
Week 0 ()
Week 1 ()
Week 2 ()
Week 3 ()
Week 4 ()
1) 2 points
Week 5 ()
Week 6 ()
Week 7 ()
Lecture 36 : a.
Introduction to
https://onlinecourses.nptel.ac.in/noc23_cs87/unit?unit=59&assessment=129 1/7
9/20/23, 2:51 PM Introduction To Machine Learning - IITKGP - - Unit 9 - Week 7
Computational b.
Learning
Theory (unit?
c.
unit=59&lesso d.
n=60)
Yes, the answer is correct.
Lecture 37 : Score: 2
Sample Accepted Answers:
Complexity : a.
Finite
2) 2 points
Hypothesis
Space (unit?
unit=59&lesso
n=61)
Lecture 38: VC
Dimension
(unit?
unit=59&lesso
n=62) a.
Lecture 39: b.
Introduction to c.
Ensembles
d.
(unit?
unit=59&lesso Yes, the answer is correct.
n=63) Score: 2
Accepted Answers:
Lecture 40: d.
Bagging and
Boosting (unit?
unit=59&lesso
n=64)
Tutorial 7
(unit?
unit=59&lesso 3) 2 points
n=65)
Week 7 :
Lecture
Material (unit?
unit=59&lesso
n=66)
Quiz: Week 7 a.
: Assignment b.
7
c.
(assessment?
name=129) d.
Week 8 ()
https://onlinecourses.nptel.ac.in/noc23_cs87/unit?unit=59&assessment=129 2/7
9/20/23, 2:51 PM Introduction To Machine Learning - IITKGP - - Unit 9 - Week 7
4) 2 points
Download
Videos ()
Problem
Solving
a.
Session -
July 2023 () b.
5) 2 points
a.
b.
c.
d.
6) 2 points
a.
https://onlinecourses.nptel.ac.in/noc23_cs87/unit?unit=59&assessment=129 3/7
9/20/23, 2:51 PM Introduction To Machine Learning - IITKGP - - Unit 9 - Week 7
b.
7) 2 points
a.
b.
c.
d.
8) 2 points
a.
b.
c.
9) 2 points
a.
b.
c.
Y h i
https://onlinecourses.nptel.ac.in/noc23_cs87/unit?unit=59&assessment=129 4/7
9/20/23, 2:51 PM Introduction To Machine Learning - IITKGP - - Unit 9 - Week 7
10) 2 points
a.
b.
c.
d.
11) 2 points
a.
b.
c.
d.
12) 2 points
https://onlinecourses.nptel.ac.in/noc23_cs87/unit?unit=59&assessment=129 5/7
9/20/23, 2:51 PM Introduction To Machine Learning - IITKGP - - Unit 9 - Week 7
a.
b.
c.
13) 2 points
a.
b.
c.
d.
14) 2 points
a.
b.
c.
d.
https://onlinecourses.nptel.ac.in/noc23_cs87/unit?unit=59&assessment=129 6/7
9/20/23, 2:51 PM Introduction To Machine Learning - IITKGP - - Unit 9 - Week 7
15) 2 points
a.
b.
c.
d.
https://onlinecourses.nptel.ac.in/noc23_cs87/unit?unit=59&assessment=129 7/7