Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
25 views

Machine Learning Lecture 1

This document provides an introduction and overview of linear regression. It discusses regression as finding a relationship between inputs (independent variables) and outputs (dependent variables) using a mathematical function to predict the outputs from the inputs based on data. Linear regression is introduced as finding this relationship when the model is linear in its parameters. Key concepts covered include: defining the linear model using a matrix formulation; describing the least squares error objective function; and setting up regression as an optimization problem to minimize this loss function and find the optimal parameter values.

Uploaded by

chelsea
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
25 views

Machine Learning Lecture 1

This document provides an introduction and overview of linear regression. It discusses regression as finding a relationship between inputs (independent variables) and outputs (dependent variables) using a mathematical function to predict the outputs from the inputs based on data. Linear regression is introduced as finding this relationship when the model is linear in its parameters. Key concepts covered include: defining the linear model using a matrix formulation; describing the least squares error objective function; and setting up regression as an optimization problem to minimize this loss function and find the optimal parameter values.

Uploaded by

chelsea
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

Lecture 1: Regression

Iain Styles
7 October 2019

Introduction

Regression is the task of finding the relationship between one or


more numerical inputs (independent variables) and one or more
numerical outputs (dependent variables). The goal is to find a rela-
tionship (i.e. a mathematical function) that allows us to (predict)
the dependent variables from the independent variables, and specif-
ically to learn that relationship from data. As shown in Figure 1,
this can, in its simplest form, be described as “curve fitting”, but
it is a powerful and instructive problem to study because it is very
amenable to mathematical analysis which will allow us to under-
stand a range of core issues that are important across the spectrum Figure 1: Some approaches to curve
of machine learning tasks, for example: fitting. https://t.co/X76aDTJXI9

• Objective functions;

• Over/underfitting;

• Regularisation;

• Model capacity;

• Bias vs variance;

• Cross-validation;

• Probabilistic reasoning.

The desired outcomes of a regression are typically to (a) be able


to predict values of the dependent variables given values of the
independent variables; (b) derive estimates of the parameters that
define an underlying mathematical model. This second point is
sometimes overlooked but can be an important one – machine
learning is often framed as a problem only of prediction but can be
used to estimate parameters. Some simple and common examples
include the estimation of chemical reaction rates, radioactive half-
lives, or vehicle speeds from measurements of their position.
Consider the simple dataset shown in Figure 2 in which we
have an independent variable x upon which y is dependent. In this
simple example there are two questions we can ask:
y = 3x + 2
1. What is the value of y at x = 2.5 (or 3.2, or 1.9 etc)?
15
2. Given that it seems reasonably obvious that y has a linear depen-
dence on x, what are the parameters (intercept, gradient) of the 10
y

underlying linear model?


5
How to answer these questions will be the focus of this part of
the course. 0 2 4
x
Figure 2: A simple example of a
dataset with independent variable x
and dependent variable y.
lecture 1: regression 2

Linear regression

In the first instance, let us consider a class of problems that are


linear. We will define what we mean by this shortly. In the first
instance, we will restrict ourselves to the simple case of data that
has one independent variable x and one dependent variable y. A
dataset is then comprised of a set of N pairs of data points

D = {( x1 , y1 ), . . . , ( xn , yn )} = {( xi , yi )}iN=1 (1)

We wish to model the relationship between x and y as a mathe-


matical function f (w, x ) such that yi ≈ f (w, xi ) with unknown
parameters w. We will ideally choose the form of f so that it ac-
curately describes the underlying data generating process so that
all observations of y can be accurately predicted by f . However,
this is complicated by the fact that observations and measurements
are nearly always subject to some form of error or noise such that
there will be a random component of the measurement. That is, if
f fully describes the underlying process by which the data is gener-
ated, the observations are generated by a noisy process that can be
modelled by

yi = f (w, xi ) + e (2)
where e is a random number drawn from some continuous prob-
ability density function that depends on the particular properties
of the observation process. We will revisit the implications of this
when we consider regression from a probabilistic perspective.
We will first consider a simple way to approach regression by
treating it as an optimisation problem in which the objective is to
find the value of w (denoted w∗ ) that minimises some "loss", or
objective function L(w).

w∗ = arg min L(w) (3)


w
The intuition behind this is that L(w) should be designed to
capture the difference between the data and the predictions of the
model, and the optimisation will seek to minimise that difference.
y = 3x + 2 +
One very common choice for L(w) is the least-squares error. Given
15
our dataset D and a modelling function f (w, x ), we construct, for
each datapoint in D a residual error defined as 10
y

ri (w) = yi − f (w, xi ). (4)


5
This is illustrated in Figure 3.
The least squares error (LSE) loss function is then defined in
terms of the residuals as
0 2 4
x
n
LLSE (w) = ∑ ri2 = rT r (5) Figure 3: The residuals (shown in red)
are a measure of the goodness-of-fit of
i =1
a function (black line) to a set of data
It is important to note here that LLSE has no upper bound, but (blue points).
it does have a finite lower bound because it is a strictly positive
lecture 1: regression 3

quantity. It is therefore possible to minimise this quantity and,


following Equation (3), our goal is to find w that minimises the
loss:

w∗ = arg min LLSE (w) (6)


w

Optimisation problems can be extremely difficult and we will,


for now, restrict ourselves to a specific case which can be analysed
using some relatively straightforward mathematics: models which
are linear. What we mean by this is not the familar "y = mx + c"
example (althought this is a linear problem) that you may have
seen before, but a more general class of models that are linear in
their unknown parameters. These will turn out to be surprisingly
powerful. Linear models take the form

M −1
f (w, x ) = w0 φ0 ( x ) + · · · + w M−1 φ M−1 ( x ) = ∑ wi φi ( x ). (7)
i =0

Our function is a linear combination of a set of basis functions


−1 M −1
{φi ( x )}iM
=0 weighted by the free parameters { wi }i =0 . Note that
the basis functions have no free parameters and depend solely on
the independent variable. The basis functions are a representation
of the problem and should be chosen carefully to be sufficiently ex-
pressive to capture the features in the data. A very common choice
 M −1
of basis is the polynomials xi i=0 which we will use in the exam-
ples that follow.
For a finite set of data points D , we can rewrite Equation (7) in
matrix form by defining matrix Φ with components Φij = φj ( xi ) in
terms of which

f(w) = Φw (8)
where the dependency on x is now absorbed into the compo-
nents of f. It is important to note the order of the indices in the def-
inition of Φij : each row (indexed by i) corresponds to a single data
point, whilst each column corresponds to a basis function. As an
example for a simple quadratic model f (w, x ) = w0 + w1 x + w2 x2
with basis functions x0 , x1 , x2 = 1, x, x2 , we have
 

 
1 x1 x12
1 x2 x22 
 
Φ= ..  (9)
.
 
 
1 xN x2N
Having restricted ourselves to linear models, we can begin to
solve the optimisation problem posed by Equation (3). The residu-
als defined by Equation (4) can be written as

r = y − Φw, (10)
lecture 1: regression 4

and the loss function (Equation (6)) becomes

LLSE (w) = (y − Φw)T (y − Φw) (11)

We now take advantage of our observation that LLSE has no


upper bound but does have a lower bound to solve Equation (3). To
minimise LLSE we find the point at which its gradient with respect
to its free parameters is zero. Since LLSE has no maximum, this
point must be the minimum. We therefore differentiate LLSE with
respect to w and set to zero. We get

∂LLSE (w)
= −2ΦT (y − Φw) . (12)
∂w
To understand how we obtain this result let us break the calcu-
lation down step-by-step. Noting that LLSE (bw) = rT r, we first
compute the derivative of r with respect to the components of w.
We first note that
ri = yi − ∑ Φij w j (13)
j

Then, picking a particular component of w, say, wk , to differentiate


with respect to, we find that

∂ri
= −Φik . (14)
∂wk

Now, we note LLSE = ∑i ri2 and so

LLSE
= 2rl (15)
∂rl

Then, we apply the chain rule of differentiation:

∂LLSE LLSE ∂r
∂wk
= ∑ ∂rl
× l
∂wk
(16)
l
= − ∑ 2rl Φlk (17)
l

Since r is a column vector, we have to do some rearrangements to


rewrite this in matrix notation. This means we have to make sure
that i) r is the last term in the equation, and ii) that the matrix Φ is
in the correct orientation for the multiplication. Writing the result
as a vector ∂L∂wLSE
with components ∂∂w LLSE
, we have:
k

∂LLSE ∂LLSE
∂wk
= ∑ −2rl Φlk = −2 ∑ ΦTkl rl →
∂w
= −2ΦT r = −2ΦT (y − Φw) .
l l
(18)
Finally, we set the result to zero to obtain

ΦT y − ΦT Φw∗ = 0 (19)
This result is known as the normal equations and is a set of
simultaneous linear equations that we can solve for w∗ . A naïve
way to do this is to evaluate w∗ = (ΦT Φ)−1 ΦT y, but numerical
inversion of matrices can be troublesome, especially if the matrix is
lecture 1: regression 5

large (in this case, large M) and this is best avoided. It is therefore
usual to solve the normal equations directly (eg using Gaussian
elimination).
This set of mathematical procedures comrise a method by which
the parameters of some model can be learned from data. This is the
very core of what machine learning is about. Although we have
set the general form of the model, it is from the data that we learn
what its precise form is.
Let us work through a simple example. This can be found in the
accompanying notebook which can be accessed at https://colab.
research.google.com/drive/1sHZqzkiDpLgJJmCOodGFo6D4NF9fCgIu

Reading

Sections 1.1 and 3.1 of Bishop, Pattern Recognition and Machine


Learning.

You might also like