0% found this document useful (0 votes)

25 views

Machine Learning Lecture 1

This document provides an introduction and overview of linear regression. It discusses regression as finding a relationship between inputs (independent variables) and outputs (dependent variables) using a mathematical function to predict the outputs from the inputs based on data. Linear regression is introduced as finding this relationship when the model is linear in its parameters. Key concepts covered include: defining the linear model using a matrix formulation; describing the least squares error objective function; and setting up regression as an optimization problem to minimize this loss function and find the optimal parameter values.

Uploaded by

chelsea

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

25 views

Machine Learning Lecture 1

Uploaded by

chelsea

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 5

Lecture 1: Regression

Iain Styles
7 October 2019

Introduction

Regression is the task of finding the relationship between one or

more numerical inputs (independent variables) and one or more
numerical outputs (dependent variables). The goal is to find a rela-
tionship (i.e. a mathematical function) that allows us to (predict)
the dependent variables from the independent variables, and specif-
ically to learn that relationship from data. As shown in Figure 1,
this can, in its simplest form, be described as “curve fitting”, but
it is a powerful and instructive problem to study because it is very
amenable to mathematical analysis which will allow us to under-
stand a range of core issues that are important across the spectrum Figure 1: Some approaches to curve
of machine learning tasks, for example: fitting. https://t.co/X76aDTJXI9

• Objective functions;

• Over/underfitting;

• Regularisation;

• Model capacity;

• Bias vs variance;

• Cross-validation;

• Probabilistic reasoning.

The desired outcomes of a regression are typically to (a) be able

to predict values of the dependent variables given values of the
independent variables; (b) derive estimates of the parameters that
define an underlying mathematical model. This second point is
sometimes overlooked but can be an important one – machine
learning is often framed as a problem only of prediction but can be
used to estimate parameters. Some simple and common examples
include the estimation of chemical reaction rates, radioactive half-
lives, or vehicle speeds from measurements of their position.
Consider the simple dataset shown in Figure 2 in which we
have an independent variable x upon which y is dependent. In this
simple example there are two questions we can ask:
y = 3x + 2
1. What is the value of y at x = 2.5 (or 3.2, or 1.9 etc)?
15
2. Given that it seems reasonably obvious that y has a linear depen-
dence on x, what are the parameters (intercept, gradient) of the 10
y

underlying linear model?

5
How to answer these questions will be the focus of this part of
the course. 0 2 4
x
Figure 2: A simple example of a
dataset with independent variable x
and dependent variable y.
lecture 1: regression 2

Linear regression

In the first instance, let us consider a class of problems that are

linear. We will define what we mean by this shortly. In the first
instance, we will restrict ourselves to the simple case of data that
has one independent variable x and one dependent variable y. A
dataset is then comprised of a set of N pairs of data points

D = {( x1 , y1 ), . . . , ( xn , yn )} = {( xi , yi )}iN=1 (1)

We wish to model the relationship between x and y as a mathe-

matical function f (w, x ) such that yi ≈ f (w, xi ) with unknown
parameters w. We will ideally choose the form of f so that it ac-
curately describes the underlying data generating process so that
all observations of y can be accurately predicted by f . However,
this is complicated by the fact that observations and measurements
are nearly always subject to some form of error or noise such that
there will be a random component of the measurement. That is, if
f fully describes the underlying process by which the data is gener-
ated, the observations are generated by a noisy process that can be
modelled by

yi = f (w, xi ) + e (2)
where e is a random number drawn from some continuous prob-
ability density function that depends on the particular properties
of the observation process. We will revisit the implications of this
when we consider regression from a probabilistic perspective.
We will first consider a simple way to approach regression by
treating it as an optimisation problem in which the objective is to
find the value of w (denoted w∗ ) that minimises some "loss", or
objective function L(w).

w∗ = arg min L(w) (3)

w
The intuition behind this is that L(w) should be designed to
capture the difference between the data and the predictions of the
model, and the optimisation will seek to minimise that difference.
y = 3x + 2 +
One very common choice for L(w) is the least-squares error. Given
15
our dataset D and a modelling function f (w, x ), we construct, for
each datapoint in D a residual error defined as 10
y

ri (w) = yi − f (w, xi ). (4)

5
This is illustrated in Figure 3.
The least squares error (LSE) loss function is then defined in
terms of the residuals as
0 2 4
x
n
LLSE (w) = ∑ ri2 = rT r (5) Figure 3: The residuals (shown in red)
are a measure of the goodness-of-fit of
i =1
a function (black line) to a set of data
It is important to note here that LLSE has no upper bound, but (blue points).
it does have a finite lower bound because it is a strictly positive
lecture 1: regression 3

quantity. It is therefore possible to minimise this quantity and,

following Equation (3), our goal is to find w that minimises the
loss:

w∗ = arg min LLSE (w) (6)

Optimisation problems can be extremely difficult and we will,

for now, restrict ourselves to a specific case which can be analysed
using some relatively straightforward mathematics: models which
are linear. What we mean by this is not the familar "y = mx + c"
example (althought this is a linear problem) that you may have
seen before, but a more general class of models that are linear in
their unknown parameters. These will turn out to be surprisingly
powerful. Linear models take the form

M −1
f (w, x ) = w0 φ0 ( x ) + · · · + w M−1 φ M−1 ( x ) = ∑ wi φi ( x ). (7)
i =0

Our function is a linear combination of a set of basis functions

−1 M −1
{φi ( x )}iM
=0 weighted by the free parameters { wi }i =0 . Note that
the basis functions have no free parameters and depend solely on
the independent variable. The basis functions are a representation
of the problem and should be chosen carefully to be sufficiently ex-
pressive to capture the features in the data. A very common choice
M −1
of basis is the polynomials xi i=0 which we will use in the exam-
ples that follow.
For a finite set of data points D , we can rewrite Equation (7) in
matrix form by defining matrix Φ with components Φij = φj ( xi ) in
terms of which

f(w) = Φw (8)
where the dependency on x is now absorbed into the compo-
nents of f. It is important to note the order of the indices in the def-
inition of Φij : each row (indexed by i) corresponds to a single data
point, whilst each column corresponds to a basis function. As an
example for a simple quadratic model f (w, x ) = w0 + w1 x + w2 x2
with basis functions x0 , x1 , x2 = 1, x, x2 , we have

 
1 x1 x12
1 x2 x22 
 
Φ= ..  (9)
.
 
 
1 xN x2N
Having restricted ourselves to linear models, we can begin to
solve the optimisation problem posed by Equation (3). The residu-
als defined by Equation (4) can be written as

r = y − Φw, (10)
lecture 1: regression 4

and the loss function (Equation (6)) becomes

LLSE (w) = (y − Φw)T (y − Φw) (11)

We now take advantage of our observation that LLSE has no

upper bound but does have a lower bound to solve Equation (3). To
minimise LLSE we find the point at which its gradient with respect
to its free parameters is zero. Since LLSE has no maximum, this
point must be the minimum. We therefore differentiate LLSE with
respect to w and set to zero. We get

∂LLSE (w)
= −2ΦT (y − Φw) . (12)
∂w
To understand how we obtain this result let us break the calcu-
lation down step-by-step. Noting that LLSE (bw) = rT r, we first
compute the derivative of r with respect to the components of w.
We first note that
ri = yi − ∑ Φij w j (13)
j

Then, picking a particular component of w, say, wk , to differentiate

with respect to, we find that

∂ri
= −Φik . (14)
∂wk

Now, we note LLSE = ∑i ri2 and so

LLSE
= 2rl (15)
∂rl

Then, we apply the chain rule of differentiation:

∂LLSE LLSE ∂r
∂wk
= ∑ ∂rl
× l
∂wk
(16)
l
= − ∑ 2rl Φlk (17)
l

Since r is a column vector, we have to do some rearrangements to

rewrite this in matrix notation. This means we have to make sure
that i) r is the last term in the equation, and ii) that the matrix Φ is
in the correct orientation for the multiplication. Writing the result
as a vector ∂L∂wLSE
with components ∂∂w LLSE
, we have:
k

∂LLSE ∂LLSE
∂wk
= ∑ −2rl Φlk = −2 ∑ ΦTkl rl →
∂w
= −2ΦT r = −2ΦT (y − Φw) .
l l
(18)
Finally, we set the result to zero to obtain

ΦT y − ΦT Φw∗ = 0 (19)
This result is known as the normal equations and is a set of
simultaneous linear equations that we can solve for w∗ . A naïve
way to do this is to evaluate w∗ = (ΦT Φ)−1 ΦT y, but numerical
inversion of matrices can be troublesome, especially if the matrix is
lecture 1: regression 5

large (in this case, large M) and this is best avoided. It is therefore
usual to solve the normal equations directly (eg using Gaussian
elimination).
This set of mathematical procedures comrise a method by which
the parameters of some model can be learned from data. This is the
very core of what machine learning is about. Although we have
set the general form of the model, it is from the data that we learn
what its precise form is.
Let us work through a simple example. This can be found in the
accompanying notebook which can be accessed at https://colab.
research.google.com/drive/1sHZqzkiDpLgJJmCOodGFo6D4NF9fCgIu

Reading

Sections 1.1 and 3.1 of Bishop, Pattern Recognition and Machine

Learning.

Sample Psychological Profile
67% (3)
Sample Psychological Profile
2 pages
Lecture 1, Part 1: Linear Regression: Roger Grosse
No ratings yet
Lecture 1, Part 1: Linear Regression: Roger Grosse
9 pages
COL774 Practice Problems
No ratings yet
COL774 Practice Problems
22 pages
Linear Regression
No ratings yet
Linear Regression
9 pages
Lec 03
No ratings yet
Lec 03
42 pages
eng
No ratings yet
eng
10 pages
Lecture Notes 5 Linear Regression
No ratings yet
Lecture Notes 5 Linear Regression
11 pages
Chapter 0: Introduction: 0.2.1 Examples in Machine Learning
No ratings yet
Chapter 0: Introduction: 0.2.1 Examples in Machine Learning
4 pages
Day 1
No ratings yet
Day 1
41 pages
10.1.1.92.623
No ratings yet
10.1.1.92.623
11 pages
Logistic regression
No ratings yet
Logistic regression
19 pages
LR_GD
No ratings yet
LR_GD
5 pages
Linear Model Small PDF
No ratings yet
Linear Model Small PDF
15 pages
lecture3_supervised_learning_I
No ratings yet
lecture3_supervised_learning_I
84 pages
Estimating Penalized Spline Regressions: Theory and Application To Economics
No ratings yet
Estimating Penalized Spline Regressions: Theory and Application To Economics
16 pages
Lecture 1a
No ratings yet
Lecture 1a
17 pages
Logistic Regression in Data Analysis: An Overview
No ratings yet
Logistic Regression in Data Analysis: An Overview
21 pages
Cha-1 Calculus Functions Lecture Note
No ratings yet
Cha-1 Calculus Functions Lecture Note
38 pages
Least Squares Optimization With L1-Norm Regularization
No ratings yet
Least Squares Optimization With L1-Norm Regularization
12 pages
Tutorial 2
No ratings yet
Tutorial 2
3 pages
M Techniques 2
No ratings yet
M Techniques 2
7 pages
Ch.2 The Simple Regression Model
No ratings yet
Ch.2 The Simple Regression Model
6 pages
chapter_4_assignment (6)
No ratings yet
chapter_4_assignment (6)
5 pages
Convex Optimization Prerequisite_topics
No ratings yet
Convex Optimization Prerequisite_topics
6 pages
Comples Notes 1: Abhishake Sadhukhan Conformal Transformations
No ratings yet
Comples Notes 1: Abhishake Sadhukhan Conformal Transformations
3 pages
slides4-mrbm2324
No ratings yet
slides4-mrbm2324
40 pages
SIT194 - Graph Sketching & Integration (Lecture Notes)
No ratings yet
SIT194 - Graph Sketching & Integration (Lecture Notes)
53 pages
斯坦福大学机器学习数学基础 33-40
No ratings yet
斯坦福大学机器学习数学基础 33-40
8 pages
PR M4 Notes
No ratings yet
PR M4 Notes
38 pages
HW 4
No ratings yet
HW 4
7 pages
Section05 Solutions
No ratings yet
Section05 Solutions
9 pages
w1d_linear_regression_regularization
No ratings yet
w1d_linear_regression_regularization
4 pages
Regularized Least-Squares Classification
No ratings yet
Regularized Least-Squares Classification
24 pages
Module 1 - Week 1 - Gen.-Math-2-FINAL
No ratings yet
Module 1 - Week 1 - Gen.-Math-2-FINAL
12 pages
Constrained Optimization
No ratings yet
Constrained Optimization
10 pages
midterm2008f_sol
No ratings yet
midterm2008f_sol
12 pages
Module 3 Numerical Integration
No ratings yet
Module 3 Numerical Integration
10 pages
Nonlinear Regression
No ratings yet
Nonlinear Regression
8 pages
Notes5_Regression
No ratings yet
Notes5_Regression
14 pages
Math For Machine Learning: 1.1 Differential Calculus
No ratings yet
Math For Machine Learning: 1.1 Differential Calculus
21 pages
Math YHPLinear Regression
No ratings yet
Math YHPLinear Regression
13 pages
Support Vector Machines For Classification and Regression
No ratings yet
Support Vector Machines For Classification and Regression
8 pages
A Reader on Functions
No ratings yet
A Reader on Functions
22 pages
Introduction To Brs Symmetry
No ratings yet
Introduction To Brs Symmetry
37 pages
CS 229, Public Course Problem Set #4: Unsupervised Learning and Re-Inforcement Learning
No ratings yet
CS 229, Public Course Problem Set #4: Unsupervised Learning and Re-Inforcement Learning
5 pages
A Proposal On Machine Learning Via Dynamical Systems
No ratings yet
A Proposal On Machine Learning Via Dynamical Systems
11 pages
Odd, Even and Periodic Functions
No ratings yet
Odd, Even and Periodic Functions
9 pages
Homework 3: SVM and Sentiment Analysis: Minted Listings
No ratings yet
Homework 3: SVM and Sentiment Analysis: Minted Listings
7 pages
2.5 - Continuity
No ratings yet
2.5 - Continuity
5 pages
An Introduction To Fractional Calculus
No ratings yet
An Introduction To Fractional Calculus
29 pages
Chapter One 1.1. Theoretical Versus Mathematical Economics Example. Example
No ratings yet
Chapter One 1.1. Theoretical Versus Mathematical Economics Example. Example
19 pages
1.Stochastic Process Edited - Final
No ratings yet
1.Stochastic Process Edited - Final
90 pages
Lecture 1, Part 3: Training A Classifier: Roger Grosse
No ratings yet
Lecture 1, Part 3: Training A Classifier: Roger Grosse
11 pages
Polynomial Notes
No ratings yet
Polynomial Notes
25 pages
324.22
No ratings yet
324.22
3 pages
Solutions Problem Set 1
No ratings yet
Solutions Problem Set 1
7 pages
19 Eigenvalues, Eigenvectors, Ordinary Differential Equations, and Control
No ratings yet
19 Eigenvalues, Eigenvectors, Ordinary Differential Equations, and Control
12 pages
Introduction To The Calculus of Variations
100% (1)
Introduction To The Calculus of Variations
12 pages
Student's Solutions Manual and Supplementary Materials for Econometric Analysis of Cross Section and Panel Data, second edition
From Everand
Student's Solutions Manual and Supplementary Materials for Econometric Analysis of Cross Section and Panel Data, second edition
Jeffrey M. Wooldridge
No ratings yet
Student Solutions Manual to Accompany Economic Dynamics in Discrete Time, secondedition
From Everand
Student Solutions Manual to Accompany Economic Dynamics in Discrete Time, secondedition
Yue Jiang
4.5/5 (2)
Capsule Calculus
From Everand
Capsule Calculus
Ira Ritow
No ratings yet
Experiment Grading Rubric - Lab Report
No ratings yet
Experiment Grading Rubric - Lab Report
1 page
Module 1 - S - T in The Philippines
No ratings yet
Module 1 - S - T in The Philippines
10 pages
Runge Kutta Method
0% (1)
Runge Kutta Method
53 pages
Cbse Psychology Class Xi Practical 2020-21 Viva Questions
100% (5)
Cbse Psychology Class Xi Practical 2020-21 Viva Questions
5 pages
Auditing Operating Systems Networks: Security Part 1: and
No ratings yet
Auditing Operating Systems Networks: Security Part 1: and
24 pages
Ethical Governance - The Emerging Role of Company Secretaries As Ethics Officers by Joffy George and Prof. (DR.) K.sasikumar 3
100% (1)
Ethical Governance - The Emerging Role of Company Secretaries As Ethics Officers by Joffy George and Prof. (DR.) K.sasikumar 3
6 pages
5 CAUTI-checklist-GDIPC
No ratings yet
5 CAUTI-checklist-GDIPC
4 pages
COBOL-Code Switching As Indexical of Social Negotiations
No ratings yet
COBOL-Code Switching As Indexical of Social Negotiations
3 pages
A Study On Consumer Satisfaction Towards E - Banking" With Special Reference To Syndicate Bank, Vidya Nagara, Shivamogga
No ratings yet
A Study On Consumer Satisfaction Towards E - Banking" With Special Reference To Syndicate Bank, Vidya Nagara, Shivamogga
55 pages
ABM - Culminating Activity - Business Enterprise Simulation CG - 2 PDF
No ratings yet
ABM - Culminating Activity - Business Enterprise Simulation CG - 2 PDF
4 pages
Post Martial Law Up To The Present Time
No ratings yet
Post Martial Law Up To The Present Time
3 pages
Re Lear
No ratings yet
Re Lear
5 pages
Republic of The Philippines Professional Regulation Commission Manila
No ratings yet
Republic of The Philippines Professional Regulation Commission Manila
4 pages
An Update On The Progress of Cycloaddition
No ratings yet
An Update On The Progress of Cycloaddition
24 pages
Education System of Pakistan
No ratings yet
Education System of Pakistan
2 pages
Chapter 2 Case Studies
67% (3)
Chapter 2 Case Studies
4 pages
Chicago Author Date System
No ratings yet
Chicago Author Date System
2 pages
Link JECUAN To LA FLORIDA Installation Report
No ratings yet
Link JECUAN To LA FLORIDA Installation Report
5 pages
Strategic Management Chapter 1
No ratings yet
Strategic Management Chapter 1
9 pages
Creep - Radiohead
No ratings yet
Creep - Radiohead
10 pages
2020 - Acts of L
No ratings yet
2020 - Acts of L
10 pages
The Beastie Boy's Paul Revere As Arthurian Ballad
No ratings yet
The Beastie Boy's Paul Revere As Arthurian Ballad
2 pages
Ref, Comparing Literary and Nonliterary Text
No ratings yet
Ref, Comparing Literary and Nonliterary Text
15 pages
Garden Leave and Settlement Agreements Monaco Solicitors
No ratings yet
Garden Leave and Settlement Agreements Monaco Solicitors
1 page
Iodine Clock Reaction
No ratings yet
Iodine Clock Reaction
3 pages
International Standard On Auditing 705 (Revised) Modifications To The Opinion in The Independent Auditor'S Report
No ratings yet
International Standard On Auditing 705 (Revised) Modifications To The Opinion in The Independent Auditor'S Report
25 pages
Concept Paper
No ratings yet
Concept Paper
6 pages
Pacific Commercial Company vs. Yatco
No ratings yet
Pacific Commercial Company vs. Yatco
1 page