EE2211 Lecture 7
EE2211 Lecture 7
EE2211 Lecture 7
Learning
Lecture 7
Thomas Yeo
thomas.yeo@nus.edu.sg
3
© Copyright EE, NUS. All Rights Reserved.
Regression Review
• Goal: Given feature(s) , we want to predict target
– can be 1-D or more than 1-D
– is 1-D
• Two types of input data
– Training set , from
– Test set { , from
• Learning/Training
– Training set used to estimate regression coefficients
• Prediction/Testing/Evaluation
– Prediction performed on test set to evaluate performance
4
© Copyright EE, NUS. All Rights Reserved.
Regression Review: Linear Case
• is 1D & is 1-D
• Linear relationship between &
• Illustration (4 training samples):
5
© Copyright EE, NUS. All Rights Reserved.
Regression Review: Polynomial
• is 1-D (or more than 1-D) & is 1-D
• Polynomial relationship between &
• Quadratic illustration (4 training samples, is 1-D):
6
© Copyright EE, NUS. All Rights Reserved.
Regression Review: Polynomial
• is 1-D (or more than 1-D) & is 1-D
• Polynomial relationship between &
• Quadratic illustration (4 training samples, is 1-D):
7
© Copyright EE, NUS. All Rights Reserved.
Note on Training & Test Sets
• Linear is special case of polynomial => use “P”
instead of “X” from now on
• Training/Learning (primal) on training set
9
© Copyright EE, NUS. All Rights Reserved.
Overfitting Example
10
© Copyright EE, NUS. All Rights Reserved.
Overfitting Example
Big Prediction Error
11
© Copyright EE, NUS. All Rights Reserved.
Overfitting Example
Big Prediction Error
12
© Copyright EE, NUS. All Rights Reserved.
Underfitting Example
13
© Copyright EE, NUS. All Rights Reserved.
Underfitting Example
14
© Copyright EE, NUS. All Rights Reserved.
Underfitting Example
15
© Copyright EE, NUS. All Rights Reserved.
Underfitting Example
16
© Copyright EE, NUS. All Rights Reserved.
“Just Nice”
17
© Copyright EE, NUS. All Rights Reserved.
“Just Nice”
18
© Copyright EE, NUS. All Rights Reserved.
Overfitting & Underfitting
19
© Copyright EE, NUS. All Rights Reserved.
Overfitting & Underfitting
• Overfitting occurs when model predicts the training data
well, but predicts new data (e.g., from test set) poorly
• Reason 1
– Model is too complex for the data
– Previous example: Fit order 9 polynomial to 10 data points
• Reason 2
– Too many features but number of training samples too small
– Even linear model can overfit, e.g., linear model with 9 input
features (i.e., is 10-D) and 10 data points in training set =>
data might not be enough to estimate 10 unknowns well
• Solutions
– Use simpler models (e.g., lower order polynomial)
– Use regularization (see next part of lecture)
20
© Copyright EE, NUS. All Rights Reserved.
Overfitting & Underfitting
• Underfitting is the inability of trained model to predict the
targets in the training set
• Reason 1
– Model is too simple for the data
– Previous example: Fit order 1 polynomial to 10 data points
that came from an order 2 polynomial
– Solution: Try more complex model
• Reason 2
– Features are not informative enough
– Solution: Try to develop more informative features
21
© Copyright EE, NUS. All Rights Reserved.
Overfitting / Underfitting Schematic
Underfitting Overfitting
regime regime
22
© Copyright EE, NUS. All Rights Reserved.
Regularization
• Regularization is an umbrella term that includes methods
forcing learning algorithm to build less complex models.
• Motivation 1: Solve an ill-posed problem
– For example, estimate 10th order polynomial with just 5 datapoints
• Motivation 2: Reduce overfitting
• For example, in previous lecture, we added :
23
© Copyright EE, NUS. All Rights Reserved.
Regularization
• Regularization is an umbrella term that includes methods
forcing learning algorithm to build less complex models.
• Motivation 1: Solve an ill-posed problem
– For example, estimate 10th order polynomial with just 5 datapoints
• Motivation 2: Reduce overfitting
• For example, in previous lecture, we added :
25
© Copyright EE, NUS. All Rights Reserved.
Regularization
• Consider minimization from previous slide
L2 - Regularization
•
• Encourage to be small (also called shrinkage or
weight-decay) => constrain model complexity
• More generally, most machine learning algorithms can be
formulated as the following optimization problem
Training Test
Set Fit Set Fit
Order 9 Good Bad
Order 9 + = 1 Good Good
27
© Copyright EE, NUS. All Rights Reserved.
Regularization Example
Training Test
Set Fit Set Fit
Order 9 Good Bad
Order 9, =1 Good
28
© Copyright EE, NUS. All Rights Reserved.
Regularization Example
Training Test
Set Fit Set Fit
Order 9 Good Bad
Order 9, =1 Good Good
29
© Copyright EE, NUS. All Rights Reserved.
Bias versus Variance
• Suppose we are trying to predict red target below:
Low Variance High Variance
Low Bias: blue Low Bias: blue
predictions on average Low Bias predictions on average
close to red target close to red target
Low Variance: low High Variance: large
variability among variability among
predictions predictions
30
© Copyright EE, NUS. All Rights Reserved.
Bias + Variance Trade off
• Test error = Bias Squared + Variance + Irreducible Noise
31
© Copyright EE, NUS. All Rights Reserved.
Bias + Variance Example
• Simulate data from order 2 polynomial (+ noise)
• Randomly sample 10 training samples each time
• Fit with order 2 polynomial: low variance, low bias
• Fit with order 4 polynomial: high variance, low bias
4th Order Polynomials 2nd Order Polynomials
32
© Copyright EE, NUS. All Rights Reserved.
Bias + Variance Example
• Simulate data from order 2 polynomial (+ noise)
• Randomly sample 10 training samples each time
• Fit with order 2 polynomial: low variance, low bias
• Fit with order 4 polynomial: high variance, low bias
4th Order Polynomials 2nd Order Polynomials
33
© Copyright EE, NUS. All Rights Reserved.
Bias + Variance Example
• Simulate data from order 2 polynomial (+ noise)
• Randomly sample 10 training samples each time
• Fit with order 2 polynomial: low variance, low bias
• Fit with order 4 polynomial: high variance, low bias
4th Order Polynomials 2nd Order Polynomials
34
© Copyright EE, NUS. All Rights Reserved.
Bias + Variance Example
• Simulate data from order 2 polynomial (+ noise)
• Randomly sample 10 training samples each time
• Fit with order 2 polynomial: low variance, low bias
• Fit with order 4 polynomial: high variance, low bias
4th Order Polynomials 2nd Order Polynomials
35
© Copyright EE, NUS. All Rights Reserved.
Bias + Variance Example
• Simulate data from order 2 polynomial (+ noise)
• Randomly sample 10 training samples each time
• Fit with order 2 polynomial: low variance, low bias
• Fit with order 4 polynomial: high variance, low bias
4th Order Polynomials 2nd Order Polynomials
36
© Copyright EE, NUS. All Rights Reserved.
Bias + Variance Example
• Simulate data from order 2 polynomial (+ noise)
• Randomly sample 10 training samples each time
• Fit with order 2 polynomial: low variance, low bias
• Fit with order 4 polynomial: high variance, low bias
4th Order Polynomials 2nd Order Polynomials
37
© Copyright EE, NUS. All Rights Reserved.
Bias + Variance Example
• Simulate data from order 2 polynomial (+ noise)
• Randomly sample 10 training samples each time
• Fit with order 2 polynomial: low variance, low bias
• Fit with order 4 polynomial: high variance, low bias
4th Order Polynomials 2nd Order Polynomials
38
© Copyright EE, NUS. All Rights Reserved.
Bias + Variance Example
• Simulate data from order 2 polynomial (+ noise)
• Randomly sample 10 training samples each time
• Fit with order 2 polynomial: low variance, low bias Order 2
Achieves Lower
• Fit with order 4 polynomial: high variance, low bias Test Error
4th Order Polynomials 2nd Order Polynomials
39
© Copyright EE, NUS. All Rights Reserved.
Bias + Variance Trade off Theorem
40
© Copyright EE, NUS. All Rights Reserved.
Bias + Variance Trade off Theorem
41
© Copyright EE, NUS. All Rights Reserved.
Bias + Variance Trade off Theorem
42
© Copyright EE, NUS. All Rights Reserved.
Summary
• Overfitting, underfitting & model complexity
– Overfitting: low error in training set, high error in test set
– Underfitting: high error in both training & test sets
– Overly complex models can overfit; Overly simple models can underfit
• Regularization (e.g., L2 regularization)
– Solve “ill-posed” problem (e.g., more unknowns than data points)
– Reduce overfitting
• Bias-Variance Tradeoff
– Test error = Bias Squared + Variance + Irreducible Noise
– Interpretation:
• Overly complex models can have high variance, low bias
• Overly simple models can have low variance, high bias
• Interpretation is not always true (see tutorial)
43
© Copyright EE, NUS. All Rights Reserved.