Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
83 views

Final 2001f

final ML 2001

Uploaded by

AlexSpiridon
Copyright
© © All Rights Reserved
Available Formats
Download as PDF or read online on Scribd
0% found this document useful (0 votes)
83 views

Final 2001f

final ML 2001

Uploaded by

AlexSpiridon
Copyright
© © All Rights Reserved
Available Formats
Download as PDF or read online on Scribd
You are on page 1/ 18
Last name (CAPTIALS) : First name (CAPITALS): Andrew User ID (CAPITALS): (without the @andrey.cmu.edu bit): 15-781 Final Exam, Fall 2001 © You must answer any nine questions out of the following twelve. Hach question is worth 11 points. # You must fill ont your name and your andrew userid clearly and in block capital letters on the front page. You will be awarded 1 point for doing this correctly. © If you answer more than 9 questions, your best 9 scores will be used to derive your total © Unless the question asks for explanation, no explanation is required for any answer. But you are welcome to provide explanation if you wish 1 Bayes Nets Inference (a) Kangaroos. (K ) P(K) = 2/3 ‘\ P(AIK) = 1/2 A) plaiexy = 1/10 Half of all kangaroos in the zoo are angry, and 2/3 of the zoo is comprised of kangaroos. Only 1 in 10 of the other animals are angry. What's the probability that a randomly- chosen animal is an angry kangaroo? (b) Stupidity. s P(S) = 0.5 P(C|S) = 0.5 \ P(C|~S) = 0.2 Half of all people are stupid. If you're stnpid then more likely to be confused. A randomly-chosen person is confused, What's the chance they're stupid? (c) Potatoes. B Half of all potatoes are big. A big potato is more likely to be tall. P(B) = 1/2 P(TIB) = 1/2 P(T|~B) = 1/20 P(LIT) = 1/2 Punt) = 1/10 A tall potato is more likely to be lovable. What’s the probability that a big lovable potato is tall? (4) Final part. (aye = 12 OD) PIELEI=0. (2) Piei-8 A. Piw|s)=1/2 Gq) P(al=sy=2 What's P(W A F)? (b) Bayes Nets and HMMs Let nbs(m) = the number of possible Bayes Network graph structures using m at tributes. (Note that two networks with the same structure but different. probabilities in their tables do not count as different structures). Which of the following statements is true? © (i) nbs(m) means X is conditionally independent of Z given Y Assuming the conventional assumptions and notation of Hidden Markov Models, in which q denotes the hidden state at time t and O, denotes the observation at time ¢, which of the following are true of all HMMs? Write “True” or “False” next to each statement, I< dead da > fi) F< dey, de qa > ( ( (iii) F< desi, dese > (iv) F< Op1,01,0-4 > © (v) F< Ona, 2 > (vi) F< Op1, 0,02 > 3 Regression (a) Consider the following data with one input and one output 2 ; oe — y ¥ (output) B X (input) © (i) What is the mean squared training set error of running linear regres this data (using the model y = wy + wi)? ® (ii) What is the mean squared test set error of running linear regression on this data, assuming the rightmost three points are in the test set, and the others are in the training set. © (iii) What is the mean squared leave-one-out cross-validation (LOOCY) error of running linear regression on this data? (b) Consider the following data with one input and one output 3 7 x Y ' 1j2 =2 . g 2|2 2 3|1 82 . . » 0 + + + 0 1 X (input) © (i) What is the mean squared training set error of running linear regression on this data (using the model y = wp + w.2)? (Hint: by symmetry it is clear that, the best fit to the three datapoints is a horizontal line) © (ii) What is the mean squared leave-one-out cross-validation (LOOCV) error of running linear regression on this data? (c). Suppose we plan to do regression with the following basis functions Wy 1 hs a’ 7 * -e Neco e ee [tenet oO 2 4 6 0 2 X (input) ---> X (input) (x) 0 ife<0 0 ife<2 @(z)= 0 ifx (b) Using the same notation and the same assumptions, sketch a mixture of three distinct Gaussians that is stuck in a suboptimal configuration (i.e. in which infinitely many more iterations of the EM algorithm would remain in essentially the same suboptimal configuration). (You must not give an answer in which two or more Gaussians all have the same mean vectors—we are looking for an answer in which all the Gaussians have distinct mean vectors). ¥ (output) X (input) ‘c) Using the same notation and the same assumptions, sketch the globally maximally likely mixture of two Gaussians in the following, new, dataset ¥ (output) —-> x (anput) (4) Now, suppose we ran k-means with k = 2 on this dataset. Show the rough locations of the centers of the two clusters in the configuration with globally minimal distortion. 2 3? B 31 ° et X (input) 6 Regression algorithms For each empty box in the following table, write in if the statement at the top of the column applies to the regression algorithm. Write “N” if the statement does not apply. No matter what the training data is, the predicted output is guaranteed to be a continuous function of the input. (i.e. there are no discontinuities in the pre- diction). If a predictor gives continuous but undifferentiable predictions then you should an- swer “Y". The cost of training on a dataset with R records is at O(R2): quadratic (or worse) in R. For iterative algorithms marked with (*) simply consider the cost of one iteration of the algorithm through the data. least. Linear Regression Quadratic Regression Percepirons with sigmoid acti- vation functions (*) T-hidden-layer Neural Nets with sigmoid activation functions (*) T-nearest neighbor 10-nearest neighbor Kernel Regression Locally Weighted Regtession Radial Basis Function Regres- sion with 100 Gaussian basis functions Regression Trees Cascade correlation (with sig moid activation funetions) ‘Multilinear interpolation MARS, 7 Hidden Markov Models Warning: this is a question that will take a few minutes if you really understand HMMs, but could take hours if you don’t. Assume we are working with this HMM “42 4/2 “Start Here with Prob. 1 a= 12 ay = 1/2 dy = 1/2 ayy = 0 BA=12 HO)=1/2 b(Z=0 b(X)=1/2 b(V)=0 — d(Z)=1/2 bs(X) bs(¥) = 1/2 bs(Z) = 1/2 Pdes1 = Silane = Si) POL = kai = Si) Suppose we have observed this sequence XZXVVZV2Z (in long-hand: 0; = X,O2 = 2,05 = X,O1 = Y,0s = Y, Or Fill in this table with a,(i) values, remembering the definition Y, Og = Z,0y = 2) a;(t) = P(Oy A Oz A...01 A at = 5i) So for example, 03(2) = P(O, = X AO, = Z NOs = X Ags = 82) ai(1) | a¢(2) | a4(3) wel Sc 0] =a] a] or « 8 Locally Weighted Regression Here’s an argument made by a misguided practitioner of Locally Weighted Regression, Suppose you have a dataset with R; training points and another dataset with Ry test points. You must predict the output for each of the test points. If you use a kernel function that decays to zero beyond a certain Kernel width then Locally Weighted Regression is computationally cheaper than regular linear regression This is because with locally weighted regression you must do the following for ry point in the test set, each que © Find all the points that have non-zero weight for this particular query * Do a linear regression with them (after having weighted their contribution to the regression appropriate! © Predict the value of the que whereas with regular linear regression you must do the following for each query point: @ take all the training set datapoints, * Do an unweighted linear regression with them, © Predict the value of the query. ‘The locally weighted regression frequently finds itself doing regression on only a tiny fraction of the datapoints because most have zero weight. So most of the local method’s queries are cheap to answer. In contrast, regular regression must use every single training point in every single prediction and so does at least as much work, and usually more. This argument has a serious error. Even if it is true that the kernel function causes almost 10 weight for each LWR query the argument is wrong, What is the error? all points to have 9 Nearest neighbor and cross-validation At some point during this question you may find it useful to use the fact that if U and V are ? Var{U] + 0? Var{V). two independent real-valued random variables then Varfal + bV) Suppose you have 10,000 datapoints {(2%, y,) : k = 1,2,..., 10000}. Your dataset has one pe: input and one output. The kth datapoint is generated by the following re r, = k/10000 we ~ (0,2*) 2 an with mean 0 and variance 0? = 4 (and So that yy is all noise: drawn from a Gauss ation o = 2). Note that its value is independent of all the other y values. You standard, are considering two learning algorithms: © Algorithm NN: |-nearest neighbor. (a) What is the expected Mean Squared ‘Training Error for Algorithm NN? (b) What is the expected Mean Squared Training Error for Algorithm Zero? {c) What is the expected Mean Squared Leave-one-ont Cross-validation Error for Algo- rithm NN? ve-one-out Cross-validation Error for Algo- (a) What is the expected Mean Squared Lea rithm Zero? 10 (a) (b) Neural Nets Suppose we are learning a I-hidden-layer neural net with a sign-function activation Sign(z)= 1 if2>0 Sign(z)= 1 ifs <0 wil ey tenga aa ak + 2 2) wiz2=_ NO

You might also like