SMAI End 2015 S
SMAI End 2015 S
SMAI End 2015 S
INSTRUCTIONS
2. Please start each question on a separate page, indicating the question and sub-part
numbers clearly.
3. When a question is ambiguous, state so. Make reasonable assumptions and write
them clearly before you answer the question.
4. What property of the sigmoid activation function is important for the backpropaga-
tion (BP) learning algorithm? What other activation function has this property? A
multilayer network of linear nodes can create a nonlinear decision boundary – TRUE
OR FALSE? Briefly justify your answers.
5. What are basic assumptions on the distribution of the data in k-Means clustering
algorithm? What parameters does the user specify, if any, in this algorithm?
6. In decision tree learning, when is a node “pure”? The attributes nearer to the leaf
nodes in a decision tree are considered important for the given classification problem.
TRUE OR FALSE? Briefly justify your answer.
7. Suppose we want to build a classifier with data sets having multiple features /
attributes. Can you suggest two ways of handling data items with missing feature
values?
8. Of the two algorithms, k-Nearest Neighbour and k-Means, which one is supervised
and which is unsupervised? Why?
10. Assume that the feature vector → −x for a given class ωi follows Normal Density and
that the features are statistically independent but with unknown mean and each
feature has the same variance, σ 2 for the two classes. What is the the shape of the
decision boundary (give a sketch) and what is the impact of prior probabilities on
the location of the boundary?
11. How are Maximum Likelihood Estimate (MLE) and Maximum A Posteriori (MAP)
estimates related to each other and when is the MAP estimate equivalent to MLE?
12. You are given a 1-D dataset D = {0, 1, 1, 1, 2, 2, 2, 2, 3, 4, 4, 4, 5}. Assuming that the
data come from a Gaussian density, compute the maximum likelihood estimate of
the Gaussians parameters (Mean and Variance – both biased and unbiased estimate
values, as applicable).
13. Give a Multilayer Neural Network solution for the XOR problem. Verify your solu-
tion. What is the role of the hidden layer in this problem?
14. What is the average out-sample error if the number of support vectors is 10 and
the number of training samples is 1000? What do you expect will happen to the
out-sample error, if the number of support vectors increases by 10-times?
PART B: Answer any 4 out of 5 (4 × 20 = 80 Marks)
1. Suppose that there is a student who decides whether or not to go for classes on any
given day based on the weather, sleeping time, and whether there is an interesting
class to attend. The data collected from 13 days are as shown in the table. Build a
decision tree based on these observations, using ID3 algorithm with entropy impurity
(Remember: Entropy E = − j P (wj ) log2 P (wj )).
P
2. Derive the following k-Means clustering algorithm update equations from the first
principles using Expectation-Maximization framework. Clearly indicate all the as-
sumptions and show all the steps.
−1 (x −µ )2
2 i j
E-Step E[zij ] = P e 2σ −1 (x −µ )2
k n
n=1
e 2σ2 i
Pm
E[zij ]xi
M-Step µj = Pi=1
m
E[z ]
i=1 ij
3. Derive the following learning rule for the weights from input-layer to hidden layer :
∆wji = η f 0 (netj ) [ ck=1 wkj δk ] xi . What is the specific form of the rule, if hyperbolic
P
x −x
tangent function (f (x) = eex −e
+e−x ) is used as the activation function in the hidden
nodes? Show the graph of this activation function and specify its domain and range.
4. Is the solution vector found by Perceptron Learning algorithm unique, Why or Why
not? Now, given a hyperplane g(x) = 0, where g(x) = wt x + w0 , answer the
following:
5. A. Derive the dual form of the objective function for computing the maximum
margin linear classifier for a linearly separable set of training samples. Show
all the steps.
B. From the above, write down the expressions for the weight vector and bias terms
of the linear discriminant solution.
C. Also explain how you can identify support vectors from the formulation derived
in Question A.
D. For a cubic kernel K(x, y) = (x · y)3 that maps x in 2-D input space to 4-D
feature space, what is the computational saving when computing inner products
in the input space (2-D) instead of the feature space (4-D)?