Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
86 views

Machine Learning Homework

This document is a homework assignment for a machine learning course. It covers topics from chapters 1, 2, and 4 of a machine learning textbook. The assignment includes problems in linear algebra, calculus, and machine learning applications. Students are expected to submit their solutions in LaTeX format by November 4th, 2019.

Uploaded by

Wessel van Dam
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
86 views

Machine Learning Homework

This document is a homework assignment for a machine learning course. It covers topics from chapters 1, 2, and 4 of a machine learning textbook. The assignment includes problems in linear algebra, calculus, and machine learning applications. Students are expected to submit their solutions in LaTeX format by November 4th, 2019.

Uploaded by

Wessel van Dam
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

Leren — Homework 1

Wessel van Dam 12961922

Deadline: 12:00 November 4th, 2019

1 Introduction
This is the first week’s assignment for Leren. This assignment covers Chapter 1, 2, and 4 from
Alpaydin. Please take note of the following:

• You are expected to hand in your solutions in LATEX;


• This problem set is an individual assignment;
• The deadline for this assignment is 12:00 November 4th, 2019.

2 Math Preliminaries
In Machine Learning, many of the topics covered in Linear Algebra, Calculus and Bayesian
Statistics are used. This first assignment is included to refresh your memory on some of these
topics. Please show your steps for all assignments.

2.1 Linear Algebra


For this assignment, let
     
0 5 −3 −1 2 7 2
A =  4 −1 2  , A−1 =  2 −3 −12 , and b =  1  .
−1 1 −1 3 −5 −20 −1

(a) Compute Ab, bT A and bT Ab.


SOLUTION: First of all, thank you for sharing the .tex file of the assignment. It is highly
appreciated! Second, I do not know how detailed my solutions have to be. I am a third year
Astronomy student at Leiden University, so I feel quite at home concerning Linear Algebra; I
might not give my answer in the amount of steps necessary. Feedback is appreciated! Now, we
have:     
0 5 −3 2 8
Ab = 4 −1 2
   1 = 5
 
−1 1 −1 −1 0
 
 0 5 −3
T
  
b A = 2 1 −1  4 −1 2  = 5 8 −3
−1 1 −1
 
8
T
 
b Ab = 2 1 −1 5 = 16 + 5 − 0 = 21

0

1
(b) Find c, such that Ac = b. HINT: use A−1 .
SOLUTION: We make use of the fact that the matrix multiplication of a matrix with its own
inverse is the identity matrix:

Ac = b −→ A−1 Ac = A−1 b −→ c = A−1 b


    
−1 2 7 2 −7
c =  2 −3 −12  1  =  13 
3 −5 −20 −1 21

2.2 Calculus
Compute the (partial) derivative for the following functions with respect to x:
−3
1. f (x) = (3x4 + 7x2 )
SOLUTION:
∂f ∂
= −3(3x4 + 7x2 )−4 · (3x4 + 7x2 ) = −3(3x4 + 7x2 )−4 · (12x3 + 14x)
∂x ∂x
∂f x(6x2 + 7) 6(6x2 + 7)
= −6 2 = −
∂x (x (3x2 + 7))−4 x7 (3x2 + 7)−4

2. h(x) = x ln( x)
SOLUTION:
∂h ∂ √ √ 1 ∂ √ √ x 1 1
=x· (ln x) + 1 · ln x = x · √ · x + ln x = √ √ + ln x = (1 + ln x)
∂x ∂x x ∂x 2 x x 2 2

3. h(x, y) = exp(2xy + y x )
SOLUTION:
∂h ∂
= exp (2xy + y x ) · (2xy + y x ) = (2y + y x ln y) exp (2xy + y x )
∂x ∂x
Q 
L 2)
4. k(x; y1 , . . . , yL ) = ln i=1 exp((yi − x)
SOLUTION:
L L L
!
∂k ∂ Y
2 ∂ X 2 ∂ X
(yi − x)2

= ln exp((yi − x) ) = ln exp((yi − x) ) =
∂x ∂x ∂x ∂x
i=1 i=1 i=1

L L L
∂k X ∂ X X
= (yi − x)2 = −2(yi − x) = −2Lx − 2 yi
∂x ∂x
i=1 i=1 i=1

3 Machine Learning Applications


Alpaydin discusses various machine learning applications in Chapter 1. In the questions we ex-
plore the difference between regression versus classification and supervised versus unsupervised
Machine Learning problems.

2
(a) Explain in your own words the difference between a regression problem and a classification
problem.
SOLUTION: A classification problem is a problem in which input is analysed by way
of a discrimant, which sorts the input into distinct classes. As such, the output variable
is discrete. By contrast, a regression problem refers to a problem in which input data is
analysed to predict a continuous output variable.

(b) Explain in your own words the difference between supervised and unsupervised machine
learning.
SOLUTION: In supervised learning, the machine receives input data and is already
given the correct output values in advance (by a so-called supervisor, hence the name).
The machine’s purpose is then to solely find out why input x1 has output y1 and x2 has
output y2 . It uses this knowledge of input and corresponding output to discern which
properties of the input indicate a specific output. In unsupervised learning, on the other
hand, the machine does not have the output values, and relies solely on the input to find
regularities and patterns.

(c) For each of the following problems, explain if the problem is a supervised or unsupervised
Machine Learning problem. Moreover, for all supervised problems, explain if it is a regres-
sion or a classification task.

(1) In Radio Astronomy, many terabytes of data are observed on a daily basis. Hence, it is
no longer possible for a group of radio astronomers to inspect every observation. Instead,
Machine Learning may be used to automatically find the anomalies, i.e., observations
that are different from all other observations.
SOLUTION: This is a case of supervised learning. At first, a machine is given the
already inspected data to approximate the mapping function, i.e. observations that
are already classified on an anomaly/not anomaly basis are given to the machine.
Once the machine has approximated the actual ‘mapping function’ well enough, it is
given the new data. It is a classification task, since the machine merely has to find
the anomalies (so the classes are ‘anomaly’ and ‘not an anomaly’).
(2) Predicting the price of second-hand cars based on historical data from a used-car mar-
ketplace website.
SOLUTION: This, too, is a case of supervised learning: known inputs and outputs
of a used-car marketplace website is used to approximate the mapping function, after
which new second-hand cars are analysed. Since the price of a car is a continuous
variable instead of a set of classes, this problem is a regression task.
(3) In medical imaging, an important task is the segmentation of healthy and unhealthy
tissue in an image. You are given a dataset that contains 1,000 images of tissue and for
each image five experts have annotated the unhealthy cells. The task of segmentation is
to make a prediction for each pixel in the input image whether it is healthy or unhealthy
tissue.
SOLUTION: This problem, too, deals with supervised learning, since here the ex-
perts act as the supervisors that have annotated unhealthy cells, and therefore have
provided the output values of the 1,000 images that are the input. Obviously, this is
a classification task, since each cell has to be classified on whether it is a healthy or
an unhealthy cell.
(4) Grouping/finding similar users on a music streaming service, like Spotify, based on a
dataset of users and the tracks each played in the last 24 hours.

3
SOLUTION: This is an unsupervised problem, since the machine is only given input
(the tracks each user played in the last 24 hours) and has to find similar users without
being able to learn with output.
.

4 Chapter 4: Parametric Methods


4.1 Bias/Variance
(a) We have some model g(x) that has a very high bias for some task we are trying to learn.
What would you expect the training and validation errors to be like?
SOLUTION: You would expect them to be large due to the very high bias. A high
bias means that the model is too simple for the data and therefore underfits, causing
large errors, especially in the validation (since the model was made using the training
data, the errors will likely be slightly smaller on the training data).

(b) We also consider some other model, h(x) that has a very high variance for that same task.
What would you expect the the training and validation errors to be like in this case?
SOLUTION: You would expect the training error to be small, since the model was
fitted on those data points. However, when used on new, validation data, models with
a very high variance perform poorly because the random noise in the training data
is taken into account too much, a process known as overfitting. The predicted values
tend to be far from the actual points because of this overfitting.

(c) Suppose we use a massive dataset for training both models, which one is likely to give a
better performance? Will the model that performs worse overfit or underfit?
SOLUTION: The model with high variance is likely to give a better performance,
since the variance is a consistent estimator, which means that it tends to zero if the
dataset gets larger, as learnt in the Bayesian Statistics for Machine Learning course.

5 Regression
In this exercise we will first derive a closed form solution for the parameters of a linear regression
model using some calculus, and then we will do the same using tools from linear algebra. Assume
a dataset D = {(x1 , r1 ), . . . , (xN , rN )}, where xi ∈ Rd , with d being the dimensionality of X,
and ri ∈ R. In Equation 2.14 in Alpaydin, the functional form for linear regression is given as:

d
X
g(x) = w1 x1 + w2 x2 + · · · + wd xd + w0 = w0 + wj x j .
j=1

Note that the vector w defines the hypothesis function g(x) and represents the parameters
that should be learned from the data. The corresponding empirical error on the training set X
is given as:

N N d
X
i i 2 X
i
X
E(g|X) = (r − g(x )) = (r − (w0 + wj xij ))2 (1)
i i=1 j=1

Using the error on the entire dataset it is possible to determine the partial derivatives of
each of the weights. These partial derivative of a single weight results in a combined gradient

4
for all of the datapoints and thus gives the direction and magnitude of the correction required
to get the optimal weight; the weight that minimizes the combined error for all datapoints.
Given this, answer the following questions:

(a) Compute ∂E(g|X)/∂w0 (the partial derivative of E w.r.t. to w0 )


SOLUTION:
N d N d
∂E(g|X) ∂ X i X X ∂ X
= (r − (w0 + wj xij ))2 = (ri − (w0 + wj xij ))2
∂w0 ∂w0 ∂w0
i=1 j=1 i=1 j=1

By way of the chain rule:


N d
∂E(g|X) X X
= −2 (ri − (w0 + wj xij ))
∂w0
i=1 j=1

(b) Compute ∂E(g|X)/∂wk with k ≥ 1 (the partial derivative of E w.r.t. to one specific w,
excluding w0 )
SOLUTION: We start in a similar way as above:
N d N d
∂E(g|X) ∂ X i X X ∂ X
= (r − (w0 + wj xij ))2 = (ri − (w0 + wj xij ))2
∂wk ∂wk ∂wk
i=1 j=1 i=1 j=1

N d
∂E(g|X) X X
= −2 xik (ri − (w0 + wj xij ))2
∂w0
i=1 j=1

Calculating each of the weights separately can become a bit tedious for bigger applications.
However, the process can be simplified through vectorization by defining a new vector x̂ =
[1, xT ]T = [1, x1 , x2 , . . . , xd ]T :

d
X
ĝ(x̂) = wj x̂j = wT x̂
j=0

And since...

d
X
ĝ(x̂) = wj x̂j
j=0
d
X
= w0 ∗ x 0 + wj x̂j
j=1
d
X
= w0 + wj x j
j=1

= g(x)

...we can obtain a prediction r̂i = ĝ(x̂i ) = wT x̂i ; an inner product between our parameter
vector w and the data vector x̂i . Our error function would now look like this:

5
N
X
E(w|X, r) = (ri − r̂i )2
i=1
N
X
= (ri − ĝ(x̂i ))2 = E(g|X)
i=1
XN
= (ri − wT x̂i )2
i=1

We can also obtain a prediction for the complete dataset using r̂ = Xw, where X is the
collection of all data vectors xi , or:

(1) (1)
 
1 x1 ··· xd
(2) (2) 
···

1 x1 xd 
X=
 .. .. .. .. 
. . . . 

(N ) (N )
1 x1 · · · xd

Now that we have a way of making many predictions using a single matrix-vector product,
we can also define the sum of squared errors function as an inner product, i.e.:

E(w|X, r) = (r − Xw)T (r − Xw) (2)

where r = [r1 , . . . , rN ]T .
Given this, answer the following questions:

(c) Show that (r − Xw)T (r − Xw) is equivalent to N i T i 2 T


P
PN i=1 (r − w x̂ ) . Remember that a a =
i 2 1 N T
i=1 (a ) if a = [a , . . . , a ] .
SOLUTION: The i-th component of (r − Xw) is given by:
 
w0
 w1 
(r − Xw)i = ri − 1 xi1 xi2 . . . xid  . 
   
 .. 
wd
= ri − (x̂i )T w
= ri − wT x̂i
PN
Then, we can use the given rule of aT a = i=1 (a
i )2 if a = [a1 , . . . , aN ]T , and we set
a = r − Xw:
N
X N
X
T
(r − Xw) (r − Xw) = i 2
((r − Xw) ) = (ri − wT x̂i )2
i=1 i=1

(d) Distribute the terms in Equation 2 (e.g., a(b+c) = ab+ac) and simplify as much as possible.
Hint: The result of the inner product is a scalar (the error on the entire data set), so you

6
may treat the equation, or parts of it, as a scalar function, when simplifying.
SOLUTION:

(r − Xw)T (r − Xw) = (r T − (Xw)T )(r−)


= r T (r − (Xw)) − (Xw)T (r − (Xw))
= r T r − r T (Xw) − (Xw)T r + (Xw)T (Xw)
= |r|2 +|(Xw)|2 −2|r(Xw)|2

(e) Compute ∂E(w|X, r)/∂w (the partial derivative of E w.r.t to w). For this, you should use
the rules listed from the Vector & Matrix Calculus sheet in the syllabus.
SOLUTION: The differentiation rules from the syllabus are:
d T x = d xT b = b
 
dx b dx
d T x = 2x
dx x
d T T
 
dx x Ax = A + A x

Furthermore, it is worthwile to note:

(ABC . . .)T = . . . C T B T AT

Working out the transposed terms and differentiating according to the rules, we get:

∂E(w|X, r)
= r T r − r T (Xw) − (Xw)T r + (Xw)T (Xw)
∂w
∂ T ∂ ∂
=− r (Xw) − (Xw)T r + (Xw)T (Xw)
∂w ∂w ∂w
∂ T T ∂ T T T
= −r T X − w X r+ w X X w
∂w ∂w
= −r T X − X T r + (X T X)T + (X T X) w
 

Using the fact that


T T
AAT = AT AT = AAT
We see that:
∂E(w|X, r)
= −r T X − X T r + 2(X T X)w
∂w
(f) Set ∂E(w|X, r)/∂w equal to zero and solve for w.
SOLUTION:

−r T X − X T r + 2(X T X)w = 0 −→ 2(X T X)w = r T X + X T r


1
−→ w = (X T X)−1 (r T X + X T r)
2
−→ w = (X T X)−1 X T r

(g) Explain in your own words what the result is of the equation in (f) and what problem this
solves.

7
SOLUTION: The above equation yields the best weights for the model, i.e. the
weights that minimise the combined error of all data points. It makes it possible to
solve for the best parameters by using a quite simple expression.

You might also like