Machine Learning Assignment
Machine Learning Assignment
Machine Learning Assignment
Submission mode
Formula Explanations
Code quality
Code efficiency
Late Penalty
Cooperation
Solutions
20
20% of final grade
Monday, April 22, 2013, 23:59
Portable Document Format (PDF); ASCII file for Pyton code.
Please package multiple files into a .zip or .tar archive.
Put your name and students ID on the top of each document.
e-mail to christfried.webers@nicta.com.au
All formulas which you derive need to be explained unless you
use very common mathematical facts. Picture yourself as explaining
your arguments to somebody who is just learning about your
assignment. With other words, do not assume that the person marking
your assignment knows all the background and therefore you can
just write down the formulas without any explanation. It is your
task to convince the reader that you know what you are doing when
you derive an argument.
Python code should be well structured, use meaningful identifiers for
variables and subroutines, and provide sufficient comments. Please
refer to the examples given in the tutorials.
An efficient implementation of an algorithm uses fast subroutines
provided by the language or additional libraries. For the purpose of
implementing Machine Learning algorithms in this course, that means
using the appropriate data structures provided by Python and in
numpy/scipy (e.g. Linear Algebra and random generators).
20% per day overdue (a day starts at midnight!)
All assignments must be done individually. Cheating and plagiarism will
be dealt with in accordance with University procedures (please see
the ANU policies on Academic Honesty and Plagiarism
http://academichonesty.anu.edu.au). Hence, for example,
code for programming assignments must not be developed in groups, nor
should code be shared.
You are encouraged to broadly discuss ideas, approaches and techniques
with a few other students, but not at a level of detail where specific
solutions or implementation issues are described by anyone. If you choose
to consult with other students, you will include the names of your
discussion partners for each solution.
If you have any questions on this, please ask the lecturer before you act.
To be presented in the tutorials.
Probabilities
1.1
Prove that the following holds for the variance of a sum of two random variables X and Y
var[X + Y ] = var[X] + var[Y ] + 2 cov[X, Y ],
where cov[X,Y] is the covariance between X and Y .
For each step in your proof, provide a verbal explanation why this transformation step
holds.
1.2
My neighbour has two children. Let us assume that the gender of a child is a binary
variable (in reality it is not!) following a Bernoulli distribution with parameter 1/2 and
that those defined genders of children are iid.
1. How many girls/boys will my neighbour most likely have?
2. Suppose I ask him whether he has any girls, and he says yes. What is the probability
that one child is a boy?
3. Suppose instead that I happen to see one of his children in the garden, and it is a
girl. What is the probability that the other child is a boy?
1.3
1.4
p(x | ) = ex
x 0,
> 0.
1. Calculate the maximum likelihood estimator (MLE) b for a number of observed iid
lifetimes X1 , . . . , XN .
2. Derive the relation between the mean of X1 , . . . , XN and the maximum likelihood
b
estimator .
3. Suppose we observe X1 = 5, X2 = 4, X3 = 3 and X4 = 4 as the lifetimes (in years)
of four different machines. What is the MLE given this data?
2
2.1
Decision Theory
(2/20) Lower Bound for the Correct Classification
Consider two nonnegative numbers a and b. Show that, if a b, then a (ab)1/2 holds.
Now, consider a two-class classification problem where the decision region was chosen to
mimimise the probability of misclassification. Use the above inequality to prove that
Z p
p
p(mistake)
p(x, C1 ) p(x, C2 ) dx.
Please explain each step in your derivation. If possible, provide the intuition behind a
step.
3
3.1
Dimensionality Reduction
(5/20) Projection with Fishers Discriminant
Fishers Linear Discriminant finds a projection of the input data in which two goals are
combined: maximising the distance between the centres of points belonging to each class
and minimising the variance of data in each class.
We will investigate Fishers Linear Discriminant using the Iris Flower data data set in
order to project the data from the original 4 dimensions into a lower dimension D0 < 4.
1. Given the set of input data X and class labels t of K different classes, calculate
the within-class and between-class scatter matrices SW and SB (see lecture slides).
(Note that scatter matrices are not estimates of covariance matrices because they
differ by a scaling factor related to the number of data points.) Find the matrix W
with columns equal to the D0 normalised eigenvectors of S1
W SB which are associated
with the D0 largest eigenvalues.
2. For D0 = 2:
(a) Report the two eigenvalues and eigenvectors found.
(b) Provide a plot of the projected data using different colours for each class.
(c) Discuss the ratio of the two eigenvalues found with respect to the task of classifying the data in the projected 2-dimensional space.
0
K X
X
(yn n )(yn n )
sB =
k=1 nCk
K
X
Nk (k )(k )T
k=1
N
1 X
=
yn
N
1 X
k =
yn
Nk
n=1
nCk
where Ck is the set of indices of data belonging to class k, and Nk is the number of
data points belonging to class k.
4. Project the Iris data into D0 = 2 dimensions using the W found in 2. and report
the criterion J for this projection. Using the original data of the Iris Flower data
set, report the criterion for all 2-dimensional orthogonal projections onto a plane
spanned by a pair of axes out of the 4 axes of the data set. Compare all criteria J
found and discuss the results.
In the next problems, we will use the Iris flower data set available via the course web site
at
http://sml.nicta.com.au/isml13/assignments/bezdekIris.data.txt
The Iris flower data set or Fishers Iris data set is a multivariate data set introduced
by Sir Ronald Aylmer Fisher (1936) as an example of discriminant analysis. (The file
bezdekIris.data.txt corrects two erroneous entries in the original data set.)
The file consists of 50 samples from each of three species of Iris flowers (Iris setosa, Iris
virginica and Iris versicolor).
Sepal Length
5.1
...
7.0
...
6.3
...
Sepal Width
3.5
...
3.2
...
3.3
...
Petal Length
1.4
...
4.7
...
6.0
...
Petal Width
0.2
...
1.4
...
2.5
...
Species
Iris-setosa
...
Iris-versicolor
...
Iris-virginica
...
The first four comma separated entries in this data set are the input data x R4 as
floating point numbers, the fifth entry is a class name from the ordered set {Iris-setosa,
Iris-versicolor, Iris-virginica} which you should map to {0, 1, 2}.
4.1
The k-Nearest Neighbours Algorithm (k-NN) classifies a new point in the input space by
the most frequent class amongst its k nearest neighbours provided as training data. If
more than one class is the most frequent, the class is decided randomly. The neighbourhood
of a point in input space is given by a metric, usually the Euclidian metric.
1. Implement the k-NN algorithm using a Euclidian distance for the Iris Flower data
set.
2. Apply 2-fold, 5-fold and 10-fold Cross Validation to select the best k for the k-NN
algorithm.
In case of several k having the same lowest error rate, we pick the largest one.
Explain why this is a good strategy.
3. Report the results for k = 2, 4, . . . , 38, 40.
4. The optimal errors decrease with the fold number. Explain why.
5. The optimal k increases with the fold number. Explain why.
6. Provide the listing of your program together with the solution.