Introduction To Machine Learning
Introduction To Machine Learning
Introduction to
Machine Learning
Assignment- Week 0
TYPE OF QUESTION: MCQ
Number of questions: 10 Total mark: 10 X 2 = 20
MCQ Question
_______________________________________________________________________
QUESTION 1:
1
Find the maxima and minima of the function 𝑓(𝑥) = 𝑥 + 𝑥
.
A. -1,1
B. 1,-1
C. -2,2
D. 2,-2
Correct Answer: A.
Detailed Solution:
1
𝑓'(𝑥) = 1 − 2 , so at 𝑥 = 1 𝑎𝑛𝑑 − 1, 𝑓'(𝑥) = 0.
𝑥
2
𝑓''(𝑥) = 3 .
𝑥
2
For 𝑥 = 1, 𝑓''(𝑥) = 1
> 0, so 𝑥 = 1 is a point of minima for the function.
2
For 𝑥 = − 1, 𝑓''(𝑥) = − 1
< 0, so 𝑥 =− 1 is a point of maxima for the function.
_______________________________________________________________________
QUESTION 2:
Precision is defined as the fraction of relevant instances among the retrieved instances
and Recall is defined as the fraction of relevant instances that have been retrieved over
the total amount of relevant instances. A typical Information Retrieval system retrieves a
total of 20 documents for a particular query out of which only 5 are relevant. Find the
Precision and Recall of the system. Total set of relevant documents = 10.
A. 0.5,0.25
B. 0.25, 0.5
C. 0.5,0.5
D. 0.25,0.25
Correct Answer: B.
Detailed Solution: Precision = (relevant instances among retrieved instances / total no of
all retrieved instances) = 5/20 = 0.25
Recall = (relevant instances among retrieved instances / total no of relevant instances) =
5/10 = 0.5
_______________________________________________________________________
QUESTION 3:
Entropy associated with each possible data value is the negative logarithm of the
probability mass function for the value. Example Formula is:
𝐻(𝑆) = − ∑ 𝑝𝑖𝑙𝑜𝑔2(𝑝𝑖)
𝑖
Here, 𝐻(𝑆) denotes entropy, 𝑖 represents a class, and 𝑝 denotes the probability of that
𝑖
class.
Given a list of 20 examples including 10 positive, 5 negative and 5 neutral examples. The
entropy of the dataset with respect to this classification is:
A. 3/2
B. 2
C. 5/2
D. 3
Correct Answer: A
H(S) = -((½ log (½) )+(¼ log (¼) )+(¼ log (¼))) = 3/2
_____________________________________________________________________
QUESTION 4:
7+𝑥−3
Find the limit 𝑙𝑖𝑚𝑥−>2 𝑥−2
(Hint: Use L-Hospital’s rule)
A. 1/3
B. 1/6
C. 2/3
D. 5/6
Correct Answer: B
Detailed Solution: Use L-Hospital’s rule.
_____________________________________________________________________
QUESTION 5:
5 runners run a race. How many different ways can the top 3 finishers be selected, if we do not
care about the specific order of these top 3?
A. 5
B. 10
C. 20
D. 30
Correct Answer: B.
QUESTION 6:
A busy student must complete 3 problem sets before doing laundry. Each problem set requires
1 day with probability 2/3 and 2 days with probability 1/3. Let B be the number of days a busy
student delays laundry. What is E[B]?
Example: If the first problem set requires 1 day and the second and third problem sets each
requires 2 days, then the student delays for B = 5 days.
A. 2
B. 3
C. 4
D. 5
Correct Answer: C
Detailed Solution:
_______________________________________________________________________
QUESTION 7 :
In a class, there are 15 students who like chocolate. 13 students like vanilla. 10 students
like neither. If there are 35 students in the class, how many students like chocolate and
vanilla?
A. 2
B. 12
C. 3
D. 20
Correct Answer: C.
Detailed Solution:
X: set of students who like chocolate
Y: set of students who like vanilla
|𝑋 ⋃ 𝑌| = |𝑋| + |𝑌| − |𝑋 ⋂ 𝑌|
|𝑋 ⋂ 𝑌| = 15 + 13 − 25 = 3.
_______________________________________________________________________
QUESTION 8:
Suppose there is a sentence "let's play or not play". The bag-of-words representation
vector of the sentence is the count of each word in the sentence, which corresponds to:
Now suppose we have some query vectors related to 'play' q1 = (0,1,0,0) and a query
vector related to 'let’s' q2 = (1,0,0,0). Find the nearest query of the sentence vector (s).
Hint : (Use Cosine similarity distance of two points to perform the same).
A. q1
B. q2
Correct Answer: A
Detailed Solution: Compute cosine similarity of q1 and s. Then compute the cosine
similarity of q2 and s. The query with the higher cosine similarity with s is the nearest
query to the sentence.
_______________________________________________________________________
QUESTION 9:
Let u be a n×1 vector, such that uTu = 1. Let I be the n×n identity matrix. The n×n
matrix A is given by (I − kuuT ), where k is a real constant. u itself is an eigenvector
of A, with eigenvalue −1. What is the value of k?
A. -2
B. -1
C. 2
D. 0
Correct Answer: C
Detailed Solution:
(I − kuuT )u = -1.u
u - kuuTu = -u
2u = ku (note: uTu = 1)
k=2
_______________________________________________________________________
QUESTION 10:
Let Am×n be a matrix of real numbers. The matrix AAT has an eigenvector x with
eigenvalue b. Then the eigenvector y of ATA which has eigenvalue b is equal to
A. xTA
B. ATx
C. x
D. Cannot be described in terms of x
Correct Answer: B
ATAATx = bATx
(ATA)(ATx) = b(ATx)
From the equation above ,we observe that ATx is an eigenvector of the matrix ATA
with the eigenvalue b. As y is also an eigenvalue of the same matrix with the same
eigenvalue, y = ATx.
_______________________________________________________________________
END
NPTEL Online Certification Courses Indian
Institute of Technology Kharagpur
Introduction to
Machine Learning
Assignment- Week 1
TYPE OF QUESTION: MCQ
Number of questions: 10 Total mark: 10 X 2 = 20
MCQ Question
_______________________________________________________________________
QUESTION 1:
Correct Answer: A
Detailed Solution : The number of classes in pneumonia detection is discrete. So, it’s a
classification task. In other options, the output variable is a continuous class, so these are
regression tasks.
_______________________________________________________________________
QUESTION 2:
Which of the following is not a type of supervised learning?
A. Classification
B. Regression
C. Clustering
D. None of the above
Correct Answer: C. Clustering
Detailed Solution : Classification and Regression are both supervised learning methods
as they need class labels or target values for training, but Clustering doesn't need target
values.
_______________________________________________________________________
QUESTION 3:
Detailed Solution : Finding the shortest path is a graph theory based task, whereas
other options are completely suitable for machine learning.
_____________________________________________________________________
QUESTION 4:
Suppose I have 10,000 emails in my mailbox out of which 300 are spams. The spam detection
system detects 150 mails as spams, out of which 50 are actually spams. What is the precision
and recall of my spam detection system?
Correct Answer: C
Detailed Solution :
𝑇𝑝
Precision = 𝑇𝑝+𝐹𝑝
50
= 50 + 100
= 33. 33%
𝑇𝑝
Recall = 𝑇𝑝+𝐹𝑛
50
= 50+250
= 16. 66%
_______________________________________________________________________
QUESTION 5 :
Correct Answer: A, C
Detailed Solution: Option B is an unsupervised learning problem.
_______________________________________________________________________
QUESTION 6:
Aliens challenge you to a complex game that no human has seen before. They give you
time to learn the game and develop strategies before the final showdown. You choose to
use machine learning because an intelligent machine is your only hope. Which machine
learning paradigm should you choose for this?
A. Supervised learning
B. Unsupervised learning
C. Reinforcement learning
D. Use a random number generator and hope for the best
_______________________________________________________________________
NPTEL Online Certification Courses
Indian Institute of Technology Kharagpur
QUESTION 7:
How many Boolean functions are possible with 𝑁 features?
𝑁
( )
A. 2
2
( 𝑁)
B. 2
2
C. (𝑁 )
𝑁
D. (4 )
𝑁
Correct Answer: A. 2 ( )
2
QUESTION 8:
Detailed Solution : The validation dataset is used to tune the model's hyperparameters during
training
_______________________________________________________________________
.
_______________________________________________________________________
QUESTION 9:
Regarding bias and variance, which of the following statements are true? (Here ‘high’ and
‘low’ are relative to the ideal model.)
A. Models which overfit have a high bias.
B. Models which overfit have a low bias.
C. Models which underfit have a high variance.
D. Models which underfit have a low variance.
Correct Answer : B, D
QUESTION 10:
Which of the following is a categorical feature?
A. Height of a person
B. Price of petroleum
C. Mother tongue of a person
D. Amount of rainfall in a day
Correct Answer: C
Detailed Solution: Categorical variables represent types of data which may be divided
into groups. Mother tongue is a categorical feature. All other options are continuous.
______________________________________________________________________
*******END*******
NPTEL Online Certification Courses
Indian Institute of Technology Kharagpur
Introduction to
Machine Learning
Assignment- Week 2
TYPE OF QUESTION: MCQ
Number of questions: 10 Total mark: 10 X 2 = 20
MCQ Question
QUESTION 1:
In a binary classification problem, out of 30 data points 12 belong to class I and 18 belong
to class II. What is the entropy of the data set?
A. 0.97
B. 0
C. 1
D. 0.67
Answer: A. 0.97
Detailed Solution:
Entropy = - ((12/30)*log2(12/30)+(18/30)*log2(18/30)) = 0.97
__________________________________________________________________
QUESTION 2:
A. Low bias
B. High variance
C. Lack of smoothness of prediction surfaces
D. None of the above
Correct Answer: A, B, C
Detailed Solution: Decision tree classifiers have low bias and high variance. As decision
trees split the input space into rectangular spaces, the predictor surface or the decision
boundary lacks smoothness.
__________________________________________________________________
QUESTION 3:
Statement: Decision Tree is an unsupervised learning algorithm.
Reason: The splitting criterion uses only the features of the data to calculate their
respective measures.
Detailed Solution : Decision Tree is a supervised learning algorithm and the reason is
also false.
_______________________________________________________________
QUESTION 4:
In linear regression, our hypothesis is ℎθ(𝑥) = θ0 + θ1𝑥, the training data is given in the
table.
x y
10 5
3 3
6 7
8 6
𝑚
2
If the cost function is 𝐽(θ) =
1
2𝑚 ( )
∑ (ℎθ 𝑥𝑖 − 𝑦𝑖) , where m is no. of training data points.
𝑖=1
What is the value of 𝐽(θ) when θ = (1,1) ?
A. 0
B. 5.75
C. 4.75
D. 6.75
Correct Answer: A. The training accuracy is high while the test accuracy is low.
Detailed Solution: The training accuracy is high while the test accuracy is low.
_________________________________________________________________
QUESTION 6:
Consider the following dataset. We want to build a decision tree classifier to detect
whether a tumor is malignant or not using several input features such as age, vaccination,
tumor size and tumor site. The target variable is “Malignant” and the other attributes are
input features.
Detailed Solution:
________________________________________________________________
QUESTION 8:
For the dataset in Question 7, what is the information gain of Vaccination (If entropy
measure is used to calculate information gain)?
A. 0.4763
B. 0.2102
C. 0.1134
D. 0.9355
________________________________________________________________
QUESTION 9:
Which of the following criteria is typically used for optimizing in linear regression?
A. Maximizing the number of points touched by the line
B. Minimizing the number of points touched by the line
C. Minimizing the sum of squared distance of the line from the points
D. Minimizing the maximum squared distance of a point from a line
Correct Answer: C. Minimizing the sum of squared distance of the line from the
points
Detailed Solution: In linear regression, the objective is to minimize the sum of squared
distance of the line from the points.
________________________________________________________________
QUESTION 10:
Detailed Solution: The linear regression parameters can take any real number value.
________________________________________________________________
*****END*****
NPTEL Online Certification Courses
Indian Institute of Technology Kharagpur
Introduction to
Machine Learning
Assignment- Week 3
TYPE OF QUESTION: MCQ
Number of questions: 10 Total mark: 10 X 2 = 20
QUESTION 1:
Suppose, you have been given the following data where x1 and x2 are the 2 input
variables and Class is the dependent variable.
x1 x2 Class
-1 1 -
0 1 +
0 2 -
1 -1 -
1 0 +
1 2 +
2 2 -
2 3 +
What will be the class of a new data point x1=1 and x2=1 in 5-NN (k nearest neighbour
with k=5) using euclidean distance measure?
A. + Class
B. – Class
C. Cannot be determined
QUESTION 2:
Imagine you are dealing with a 10 class classification problem. What is the maximum
number of discriminant vectors that can be produced by LDA?
A. 20
B. 14
C. 9
D. 10
Correct Answer: C. 9
Detailed Solution : LDA produces at most c − 1 discriminant vectors, c = no of classes
_______________________________________________________________________
QUESTION 3:
_______________________________________________________________________
NPTEL Online Certification Courses Indian
Institute of Technology Kharagpur
QUESTION 4:
A. KNN algorithm does more computation on test time rather than train time.
B. KNN algorithm does lesser computation on test time rather than train time.
C. KNN algorithm does an equal amount of computation on test time and train time.
D. None of these.
Correct Answer: A. KNN algorithm does more computation on test time rather than
train time.
Detailed Solution : The training phase of the algorithm consists only of storing the feature
vectors and class labels of the training samples.
In the testing phase, a test point is classified by assigning the label which is the most
frequent among the k training samples nearest to that query point – hence higher
computation.
_______________________________________________________________________
QUESTION 5:
Which of the following necessitates feature reduction in machine learning?
1. Irrelevant and redundant features
2. Curse of dimensionality
3. Limited computational resources.
A. 1 only
B. 2 only
C. 1 and 2 only
D. 1, 2 and 3
Correct Answer: D. 1,2 and 3
When there is noise in data, which of the following options would improve the performance
of the k-NN algorithm?
Detailed Solution : Increasing the value of k reduces the effect of the noise and
improves the performance of the algorithm.
_______________________________________________________________________
QUESTION 7:
Find the value of the Pearson’s correlation coefficient of X and Y from the data in the
following table.
AGE (X) GLUCOSE (Y)
43 99
21 65
25 79
42 75
A. 0.47
B. 0.68
C. 1
D. 0.33
Correct Answer : B. 0.68
∑(𝑋𝑖−𝑋)((𝑌𝑖−𝑌)
Detailed Solution : Pearson Coefficient 𝑟 = 𝑖
2 2
∑(𝑋𝑖−𝑋) ∑(𝑌𝑖−𝑌)
𝑖 𝑖
Where X = [43,21,25,42], Y = [99,65,79,75], 𝑋 = mean of 𝑋𝑖 values and 𝑌 = mean of
𝑌𝑖 values.
_______________________________________________________________________
QUESTION 8:
A. Only 2
B. 1, 3 and 4
C. 1, 2 and 3
D. 2, 3 and 4
Correct Answer: D
Detailed Solution : PCA is an unsupervised learning algorithm, so 1 is wrong. Other
statements are true about PCA.
_______________________________________________________________________
QUESTION 9:
In user-based collaborative filtering based recommendation, the items are
recommended based on :
A. Similar users
B. Similar items
C. Both of the above
D. None of the above
______________________________________________________________________
QUESTION 10:
Identify whether the following statement is true or false?
“Linear Discriminant Analysis (LDA) is a supervised method”
A. TRUE
B. FALSE
_______________________________________________________________________
******END****
Introduction to Machine Learning
Assignment- Week 4
TYPE OF QUESTION: MCQ
Number of questions: 10 Total mark: 10 X 2 = 20
______________________________________________________________________
QUESTION 1:
A man is known to speak the truth 2 out of 3 times. He throws a die and reports that the
number obtained is 4. Find the probability that the number obtained is actually 4 :
A. 2/3
B. 3/4
C. 5/22
D. 2/7
2
𝑃(𝐵|𝐴) = 7
_________________________________________________________________
QUESTION 2:
Two cards are drawn at random from a deck of 52 cards without replacement. What is
the probability of drawing a 2 and an Ace in that order?
A. 4/51
B. 1/13
C. 4/256
D. 4/663
A : Drawing a 2
1*4 4
𝑃(𝐴𝐵) = 13*51
= 663
______________________________________________________________________
QUESTION 3:
Consider the following graphical model, mark which of the following pair of random
variables are independent given no evidence?
A. a,b
B. c,d
C. e,d
D. c,e
Correct Answer : A. a,b
Detailed Solution : Nodes a and b don’t have any predecessor nodes. As they don’t
have any common parent nodes, a and b are independent.
______________________________________________________________________
QUESTION 4:
Consider the following Bayesian network. The random variables given in the model are
modeled as discrete variables (Rain = R, Sprinkler = S and Wet Grass = W) and the
corresponding probability values are given below. (Note: (¬ X) represents complement
of X)
P(R) = 0.1
P(S) = 0.2
P(W | R, S) = 0.8
P(W | R, ¬ S) = 0.7
P(W | ¬ R, S) = 0.6
P(W | ¬ R, ¬ S) = 0.5
A. 1
B. 0.5
C. 0.22
D. 0.78
𝑃(𝑊,𝑆,𝑅) 𝑃(𝑊𝑆𝑅)
Detailed Solution : 𝑃(𝑆|𝑊, 𝑅) = 𝑃(𝑊,𝑅) =
𝑃(𝑊𝑆𝑅)+𝑃(𝑊𝑆𝑅)
𝑃(𝑊𝑆𝑅) = 𝑃(𝑊|𝑆, 𝑅) * 𝑃(𝑅) * 𝑃(𝑆) = 0. 8 * 0. 1 * 0. 2 = 0. 016
𝑃(𝑊𝑆𝑅) = 𝑃(𝑊|𝑆, 𝑅) * 𝑃(𝑅) * 𝑃(𝑆) = 0. 7 * 0. 1 * 0. 8 = 0. 056
𝑃(𝑊,𝑆,𝑅) 𝑃(𝑊𝑆𝑅) 0.016
𝑃(𝑆|𝑊, 𝑅) = 𝑃(𝑊,𝑅)
= = 0.016+0.056
= 0. 22
𝑃(𝑊𝑆𝑅)+𝑃(𝑊𝑆𝑅)
______________________________________________________________________
QUESTION 5:
What is the naive assumption in a Naive Bayes Classifier?
Correct Answer: B. All the features of a class are independent of each other
Detailed Solution: Naive Bayes Assumption is that all the features of a class are
independent of each other.
______________________________________________________________________
QUESTION 6:
A drug test (random variable T) has 1% false positives (i.e., 1% of those not taking
drugs show positive in the test), and 5% false negatives (i.e., 5% of those taking drugs
test negative). Suppose that 2% of those tested are taking drugs. Determine the
probability that somebody who tests positive is actually taking drugs (random variable
D).
A. 0.66
B. 0.34
C. 0.50
D. 0.91
QUESTION 7:
It is given that 𝑃(𝐴|𝐵) = 2/3 and 𝑃(𝐴|𝐵) = 1/4. Compute the value of 𝑃(𝐵|𝐴).
A. ½
B. ⅔
C. ¾
D. Not enough information.
Correct Solution : D. Not enough information.
Detailed Solution : There are 3 unknown probabilities 𝑃(𝐴), 𝑃(𝐵), 𝑃(𝐴𝐵) which can not
be computed from the 2 given probabilities. So, we don’t have enough information to
compute 𝑃(𝐵|𝐴).
______________________________________________________________________
QUESTION 8:
Consider the following Bayesian network, where F = having the flu and C = coughing:
A. 0.35, 0.23
B. 0.35,0.77
C. 0.24, 0.024
D. 0.5, 0.23
Detailed Solution :
______________________________________________________________________
QUESTION 9:
Bag I contains 4 white and 6 black balls while another Bag II contains 4 white and 3
black balls. One ball is drawn at random from one of the bags and it is found to be
black. Find the probability that it was drawn from Bag I.
A. 1/2
B. 2/3
C. 7/12
D. 9/23
______________________________________________________________________
QUESTION 10:
In a Bayesian network a node with only outgoing edge(s) represents
Detailed Solution : As there is no incoming edge for the node, the node is not
conditionally dependent on any other node.
______________________________________________________________________
************END*******
Course -Introduction to Machine Learning
Assignment- Week 5 (Logistic Regression, SVM, Kernel Function, Kernel
SVM)
TYPE OF QUESTION: MCQ/MSQ
Number of Question: 10 Total Marks:10x2 =20
__________________________________________________________________
Question 1:
What would be the ideal complexity of the curve which can be used for separating the
two classes shown in the image below?
A) Linear
B) Quadratic
C) Cubic
D) insufficient data to draw conclusion
Correct Answer: A
Detailed Solution: The blue point in the red region is an outlier. The rest of the data is
linearly separable.
__________________________________________________________________
Question 2:
Suppose you have a dataset with n=10 features and m=1000 examples. After training a
logistic regression classifier with gradient descent, you find that it has high training error
and does not achieve the desired performance on training and validation sets. Which of
the following might be promising steps to take?
1. Use SVM with a non-linear kernel function
2. Reduce the number of training examples
3. Create or add new polynomial features
A) 1, 2
B) 1, 3
C) 1, 2, 3
D) None
Correct Answer: B
Detailed Solution: As logistic regression did not perform well, it is highly likely that the
dataset is not linearly separable. SVM with a non-linear kernel works well for
non-linearly separable datasets. Creating new polynomial features will also help in
capturing the non-linearity in the dataset.
__________________________________________________________________
Question 3:
In logistic regression, we learn the conditional distribution p(y|x), where y is the class
label and x is a data point. If h(x) is the output of the logistic regression classifier for an
input x, then p(y|x) equals:
𝑦 (1−𝑦)
A. ℎ(𝑥) (1 − ℎ(𝑥))
𝑦 (1−𝑦)
B. ℎ(𝑥) (1 + ℎ(𝑥))
1−𝑦 𝑦
C. ℎ(𝑥) (1 − ℎ(𝑥))
𝑦 (1+𝑦)
D. ℎ(𝑥) (1 + ℎ(𝑥))
Correct Answer: A
Detailed Solution: Refer to the lecture.
__________________________________________________________________
Question 4:
Correct Answer: B
Detailed Solution: The output of binary class logistic regression lies in the range:
[0,1].
__________________________________________________________________
Question 5:
Correct Answer: A
Detailed Solution : Using only the support vector points, it is possible to classify new
examples.
__________________________________________________________________
Question 6:
Suppose you are dealing with a 3-class classification problem and you want to train a
SVM model on the data. For that you are using the One-vs-all method. How many
times do we need to train our SVM model in such a case?
A) 1
B) 2
C) 3
D) 4
Correct Answer: C
Detailed Solution: In a N-class classification problem, we have to train the SVM N
times in the one vs all method.
__________________________________________________________________
__________________________________________________________________
Question 7:
1. Kernel function can map low dimensional data to high dimensional space
2. It’s a similarity function
A) 1
B) 2
C) 1 and 2
D) None of these.
Correct Answer: C
Detailed Solution: Kernels are used in SVMs to map low dimensional data into high
dimensional feature space to classify non-linearly separable data. It also acts as a
similarity function.
_________________________________________________________________
Question 8:
If g(z) is the sigmoid function, then its derivative with respect to z may be written in
term of g(z) as
A) g(z)(g(z)-1)
B) g(z)(1+g(z))
C) -g(z)(1+g(z))
D) g(z)(1-g(z))
Correct Answer: D
Detailed Answer:
−𝑧
𝑑 1 𝑒 1 1
𝑔'(𝑧) = 𝑑𝑧
( −𝑧 ) = −𝑧 2 = −𝑧 (1 − −𝑧 ) = 𝑔(𝑧)(1 − 𝑔(𝑧))
1+𝑒 (1+𝑒 ) 1+𝑒 1+𝑒
__________________________________________________________________
Question 9:
Below are the labelled instances of 2 classes and hand drawn decision boundaries for
logistic regression. Which of the following figures demonstrates overfitting of the
training data?
A) A
B) B
C) C
D) None of these
Correct Answer: C
Detailed Solution: In figure 3, the decision boundary is very complex and unlikely to
generalize the data.
__________________________________________________________________
Question 10:
What do you conclude after seeing the visualization in the previous question (Question
9)?
C1. The training error in the first plot is higher as compared to the second and third
plot.
C2. The best model for this regression problem is the last (third) plot because it
has minimum training error (zero).
C3. Out of the 3 models, the second model is expected to perform best on
unseen data.
C4. All will perform similarly because we have not seen the test data.
A) C1 and C2
B) C1 and C3
C) C2 and C3
D) C4
Correct Answer: B
Detailed Solution: From the visualization, it is clear that the misclassified samples
are more in the plot A when compared to B and C. So, C1 is correct. In figure 3, the
training error is less due to complex boundaries. So, it is unlikely to generalize the
data well. Therefore, option C2 is wrong.
The first model is very simple and underfits the training data. The third model is very
complex and overfits the training data. The second model compared to these models
has less training error and is likely to perform well on unseen data. So, C3 is correct.
We can estimate the performance of the model on unseen data by observing the
nature of the decision boundary. Therefore, C4 is incorrect.
__________________________________________________________________
End
Course Name – Introduction To Machine Learning
Assignment – Week 6 (Neural Networks)
TYPE OF QUESTION: MCQ/MSQ
Question 1:
The neural network given below takes two binary valued inputs 𝑥1, 𝑥2 ∈ {0,1} and the
activation function is the binary threshold function ( ℎ(𝑥) = 1 𝑖𝑓 𝑥 > 0; 0 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒 ). Which
of the following logical functions does it compute?
A) AND
B) OR
C) NAND
D) None of the above
Correct Answer. A
Detailed Solution: ℎ(𝑥) = 1 𝑖𝑓 (15𝑥1 + 10𝑥2 − 20) > 0; 0 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒
if we write the truth table for ℎ(𝑥) , it will be:
𝑥1 𝑥2 ℎ(𝑥)
0 0 0
0 1 0
1 0 0
1 1 1
The truth table for ℎ(𝑥) is the same as the truth table for the AND logical function.
_____________________________________________________________________________
Question 2:
A) I, II, III, IV
B) IV, III, II, I
C) III, I, II, IV
D) I, IV, III, II
Correct Answer: D
Detailed Solution: Refer to the lecture. D is the correct sequence.
_____________________________________________________________________________
Question 3:
Suppose you have inputs as x, y, and z with values -2, 5, and -4 respectively. You have a
neuron ‘q’ and neuron ‘f’ with functions:
q=x+y
f=q*z
A) (-3, 4, 4)
B) (4, 4, 3)
C) (-4, -4, 3)
D) (3, -4, -4)
Correct Answer: C
Detailed Solution: To calculate gradient, we should find out (df/dx), (df/dy) and (df/dz).
𝑑𝑓 𝑑
𝑑𝑥
= 𝑑𝑥 ((𝑥 + 𝑦)𝑧) = 𝑧 · 1 = 𝑧 = − 4
𝑑𝑓 𝑑
𝑑𝑦
= 𝑑𝑦
((𝑥 + 𝑦)𝑧) = 𝑧 · 1 = 𝑧 = −4
𝑑𝑓 𝑑
𝑑𝑧
= 𝑑𝑧
((𝑥 + 𝑦)𝑧) = (𝑥 + 𝑦) = (− 2 + 5) = 3
_____________________________________________________________________________
Question 4:
For a fully-connected neural network with one hidden layer, what effect should increasing the
number of hidden units have on bias and variance?
A. Decrease bias, increase variance
B. Increase bias, increase variance
C. Increase bias, decrease variance
D. No effect
Correct Answer: A
Detailed Solution: Adding more hidden units should decrease bias and increase variance. In
general, more complicated models will result in lower bias but higher variance, and adding
more hidden units certainly makes the model more complex.
_____________________________________________________________________________
Question 5:
Which of the following is true about model capacity (where model capacity means the ability
of a neural network to approximate complex functions)?
Correct Answer: A
Detailed Solution: As the number of hidden layers increase, the ability of the neural network to
model complex functions increases.
_____________________________________________________________________________
Question 6:
Correct Answer. B
Detailed Solution: The back-propagation algorithm finds a local optimal solution, which may
be a global optimal solution.
_____________________________________________________________________________
Question 7:
A) Gradient descent
B) Bias
C) Sigmoid Activation Function
D) None
Correct Answer: C
Detailed Solution: An activation function such as sigmoid gives non-linearity to the neural
network.
_____________________________________________________________________________
Question 8:
The network that involves backward links from outputs to the inputs and hidden layers is called
as
A) Self-organizing Maps
B) Perceptron
C) Recurrent Neural Networks
D) Multi-Layered Perceptron
Correct Answer: C
Detailed Solution: Recurrent Neural Networks involve backward links from outputs to the
inputs and hidden layers.
_____________________________________________________________________________
Question 9:
A Convolutional Neural Network(CNN) is a Deep Neural Network which can extract various
abstract features from an input required for a given task. Given are the operations performed
by a CNN on an input:
1) Max Pooling
2) Convolution Operation
3) Flatten
4) Forward propagation by Fully Connected Network
Identify the correct sequence of operations performed from the options below:
A) 4,3,2,1
B) 2,1,3,4
C) 3,1,2,4
D) 4,2,1,3
Correct Answer: B
Detailed Solution: Follow the lecture slides.
_____________________________________________________________________________
Question 10:
In training a neural network, we notice that the loss does not increase in the first few starting
epochs: What is the reason for this?
A) The learning Rate is low.
B) The Regularization Parameter is High.
C) Stuck at the Local Minima.
D) All of the above could be the reason.
Correct Answer: D
Detailed Solution: The problem can occur due to any one of the reasons above.
_____________________________________________________________________________
END
Course Name: Introduction to Machine Learning
Assignment – Week 7 (Computational Learning theory, PAC Learning, Sample
Complexity, VC Dimension, Ensemble Learning)
TYPE OF QUESTION: MCQ/MSQ
Question 1:
Which of the following options is / are correct regarding the benefits of ensemble model?
1. Better performance
2. More generalized model
3. Better interpretability
A) 1 and 3
B) 2 and 3
C) 1 and 2
D) 1, 2 and 3
Correct Answer: C
Detailed Solution: 1 and 2 are the benefits of ensemble models. Option 3 is incorrect because
when we ensemble multiple models, we lose interpretability of the models)
____________________________________________________________________
Question 2:
In AdaBoost, we give more weights to points having been misclassified in previous iterations.
Now, if we introduce a limit or cap on the weight that any point can take (for example, say we
introduce a restriction that prevents any point’s weight from exceeding a value of 10). Which
among the following would be the effect of such a modification?
Correct Answer: B, C
Detailed Solution: Outliers tend to get misclassified. As the number of iterations increases,
the weight corresponding to outlier points can become very large resulting in subsequent
classifier models trying to classify the outlier points correctly. This generally has an adverse
effect on the overall classifier. Restricting the weights is one way of mitigating this problem.
However, this can also lower the performance of the classifier.
____________________________________________________________________
Question 3:
Question 4:
Considering the AdaBoost algorithm, which among the following statements is true?
A) In each stage, we try to train a classifier which makes accurate predictions on a subset
of the data points where the subset contains more of the data points which were
misclassified in earlier stages.
B) The weight assigned to an individual classifier depends upon the weighted sum error of
misclassified points for that classifier.
C) Both option A and B are true
D) None of them are true
Correct Answer: C
Detailed Solution: In each stage, Adaboost algorithm tries to train a classifier which makes
accurate predictions on a subset of the data points where the subset contains more of the data
points which were misclassified in earlier stages. The weight assigned to an individual classifier
depends upon the weighted sum error of misclassified points for that classifier.
____________________________________________________________________
Question 5:
Correct Answer: A
Detailed Answer: Bagging decreases the variance of the classifier.
____________________________________________________________________
Question 6:
Suppose the VC dimension of a hypothesis space is 6. Which of the following are true?
Correct Answer: A, D
Detailed Solution: If the VC dimension of a hypothesis is d:
● There exists at least one set of d points that can be shattered by the hypothesis space.
● No set of (d+1) points can be shattered by the hypothesis space.
____________________________________________________________________
Question 7:
Correct Answer: B
Detailed Solution: Ensemble is a collection of a diverse set of learners to improve the stability
and the performance of the algorithm. So, the more diverse the models are, the better will be the
performance of the ensemble.
____________________________________________________________________
Question 8:
Correct Answer: C.
Detailed Solution: Decision trees do not aggregate the results of multiple trees, so it is not an
ensemble algorithm.
____________________________________________________________________
Question 9:
Suppose you have run Adaboost on a training set for three boosting iterations. The results are
classifiers h1, h2, and h3, with coefficients α1 = 0.2, α2 = −0.3, and α3 = −0.2. For a given test
input x, you find that the classifiers results are h1(x) = 1, h2(x) = 1, and h3(x) = −1, What is the
class returned by the Adaboost ensemble classifier H on test example x?
A) 1
B) -1
Correct Answer: A
Detailed Solution:
The final output is H(x) = sign((α1*h1(x))+(α2*h2(x))+(α3*h3(x)))
H(x) = sign ((0.2*1) + (−0.3*1) + (−0.2* −1)) = sign(0.1) = 1.
____________________________________________________________________
Question 10:
Generally, an ensemble method works better, if the individual base models have
____________? (Note: Individual models have accuracy greater than 50%)
A) Less correlation among predictions
B) High correlation among predictions
C) Correlation does not have an impact on the ensemble output
D) None of the above.
Correct Answer: A
Detailed Solution: A lower correlation among ensemble model members will increase the
error-correcting capability of the model. So it is preferred to use models with low correlations
when creating ensembles.
____________________________________________________________________
END
Course Name: Introduction to Machine Learning
Assignment – Week 8 (Clustering)
TYPE OF QUESTION: MCQ/MSQ
Correct Answer: A
Detailed Solution: K-Means clustering algorithm may converge on local minima which might
also correspond to the global minima in some cases but not always. Different initial centroid
choices may produce different clustering results.
_____________________________________________________________________________
Question 2:
Correct Answer: C
Detailed Solution: Both the conditions can act as possible termination conditions.
_____________________________________________________________________________
___________________________________________________________________________
Question 3:
Assume, you want to cluster 7 observations into 3 clusters using K-Means clustering
algorithm. After first iteration the clusters: C1, C2, C3 has the following observations:
C1: {(1,1), (4,4), (7,7)}
Correct Answer: A
Detailed Solution:
Finding centroid for data points in cluster C1 = ((1+4+7)/3, (1+4+7)/3) = (4, 4)
Finding centroid for data points in cluster C2 = ((0+4)/2, (4+0)/2) = (2, 2)
Finding centroid for data points in cluster C3 = ((5+9)/2, (5+9)/2) = (7, 7)
Hence, C1: (4,4), C2: (2,2), C3: (7,7)
_____________________________________________________________________________
Question 4:
In single-link clustering, the similarity of two clusters is the similarity of their most similar
members. What is the time complexity of the single-link clustering algorithm? (Note: n is
the number of data points)
A) O(n2)
B) O(n2 log n)
C) O(n3 log n)
D) O(n3)
Correct Answer. A
Detailed Solution: Refer to the lecture.
_____________________________________________________________________________
Question 5:
p1 0.4005 0.5306
p2 0.2148 0.3854
p3 0.3457 0.3156
p4 0.2652 0.1875
p5 0.0789 0.4139
p6 0.4548 0.3022
p1 p2 p3 p4 p5 p6
A)
B)
C)
D)
Correct Answer: A
Detailed Solution: For the single link or MIN version of hierarchical clustering, the proximity
of two clusters is defined to be the minimum of the distance between any two points in the
different clusters. For instance, from the table, we see that the distance between points 3 and 6
is 0.11, and that is the height at which they are joined into one cluster in the dendrogram. As
another example, the distance between clusters {3, 6} and {2, 5} is given by dist ({3, 6}, {2,
5}) = min (dis (3, 2), dist (6, 2), dist (3, 5), dist (6, 5)) = min (0.1483, 0.2540, 0.2843, 0.3921)
= 0.1483.
_____________________________________________________________________________
Question 6:
Is it possible that assignment of observations to clusters does not change between successive
iterations of K-means?
A) Yes
B) No
C) Can’t say
D) None of these
Correct Answer: A
Detailed Solution: When the K-means has reached the global or local minima, it will not alter the
assignment of data points to clusters in successive iterations.
____________________________________________________________________
Question 7:
Which of the following is not a clustering approach?
A) Hierarchical
B) Partitioning
C) Bagging
D) Density-Based
Correct Answer: C
Detailed Solution: Bagging is not a clustering technique.
_____________________________________________________________________________
Question 8:
In which of the following cases will K-Means clustering fail to give good results?
A) Data points with outliers
B) Data points with round shapes
C) Data points with non-convex shapes
D) Data points with different densities
Correct Answer: A, C, D
Detailed Solution: K-Means clustering algorithm fails to give good results when the data contains
outliers, the density spread of data points across the data space is different and the data points
follow non-convex shapes.
_____________________________________________________________________________
Question 9:
Given, A = {0,1,2,5,6} and B = {0,2,3,4,5,7,9}, calculate Jaccard Index of these two sets.
A) 0.50
B) 0.25
C) 0.33
D) 0.41
Correct Answer. C
|𝐴⋂𝐵|
3
Detailed Solution: Jaccard Index 𝐽(𝐴, 𝐵) = = 9
= 0. 33
|𝐴⋃𝐵|
_____________________________________________________________________________
Question 10:
Which of the following statements is/are not true about k−means clustering?
Correct Answer: B
_____________________________________________________________________________
END