Machine Learning Lecture 1
Machine Learning Lecture 1
Iain Styles
7 October 2019
Introduction
• Objective functions;
• Over/underfitting;
• Regularisation;
• Model capacity;
• Bias vs variance;
• Cross-validation;
• Probabilistic reasoning.
Linear regression
D = {( x1 , y1 ), . . . , ( xn , yn )} = {( xi , yi )}iN=1 (1)
yi = f (w, xi ) + e (2)
where e is a random number drawn from some continuous prob-
ability density function that depends on the particular properties
of the observation process. We will revisit the implications of this
when we consider regression from a probabilistic perspective.
We will first consider a simple way to approach regression by
treating it as an optimisation problem in which the objective is to
find the value of w (denoted w∗ ) that minimises some "loss", or
objective function L(w).
M −1
f (w, x ) = w0 φ0 ( x ) + · · · + w M−1 φ M−1 ( x ) = ∑ wi φi ( x ). (7)
i =0
f(w) = Φw (8)
where the dependency on x is now absorbed into the compo-
nents of f. It is important to note the order of the indices in the def-
inition of Φij : each row (indexed by i) corresponds to a single data
point, whilst each column corresponds to a basis function. As an
example for a simple quadratic model f (w, x ) = w0 + w1 x + w2 x2
with basis functions x0 , x1 , x2 = 1, x, x2 , we have
1 x1 x12
1 x2 x22
Φ= .. (9)
.
1 xN x2N
Having restricted ourselves to linear models, we can begin to
solve the optimisation problem posed by Equation (3). The residu-
als defined by Equation (4) can be written as
r = y − Φw, (10)
lecture 1: regression 4
∂LLSE (w)
= −2ΦT (y − Φw) . (12)
∂w
To understand how we obtain this result let us break the calcu-
lation down step-by-step. Noting that LLSE (bw) = rT r, we first
compute the derivative of r with respect to the components of w.
We first note that
ri = yi − ∑ Φij w j (13)
j
∂ri
= −Φik . (14)
∂wk
LLSE
= 2rl (15)
∂rl
∂LLSE LLSE ∂r
∂wk
= ∑ ∂rl
× l
∂wk
(16)
l
= − ∑ 2rl Φlk (17)
l
∂LLSE ∂LLSE
∂wk
= ∑ −2rl Φlk = −2 ∑ ΦTkl rl →
∂w
= −2ΦT r = −2ΦT (y − Φw) .
l l
(18)
Finally, we set the result to zero to obtain
ΦT y − ΦT Φw∗ = 0 (19)
This result is known as the normal equations and is a set of
simultaneous linear equations that we can solve for w∗ . A naïve
way to do this is to evaluate w∗ = (ΦT Φ)−1 ΦT y, but numerical
inversion of matrices can be troublesome, especially if the matrix is
lecture 1: regression 5
large (in this case, large M) and this is best avoided. It is therefore
usual to solve the normal equations directly (eg using Gaussian
elimination).
This set of mathematical procedures comrise a method by which
the parameters of some model can be learned from data. This is the
very core of what machine learning is about. Although we have
set the general form of the model, it is from the data that we learn
what its precise form is.
Let us work through a simple example. This can be found in the
accompanying notebook which can be accessed at https://colab.
research.google.com/drive/1sHZqzkiDpLgJJmCOodGFo6D4NF9fCgIu
Reading