0% found this document useful (0 votes)

31 views

Lec4 Oct12 2022 PracticalNotes LinearRegression

This document provides a summary of a lecture on machine learning systems evaluation and linear regression. It discusses how ML systems can fail due to underfitting, overfitting, or bad data. It also covers evaluating models using train and test accuracy, detecting underfitting and overfitting, the role of hyperparameters, measures of performance like confusion matrices, and challenges with imbalanced data. Finally, it provides an overview of linear regression and its objective to fit a linear function to training data by minimizing empirical risk.

Uploaded by

asasd

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

31 views

Lec4 Oct12 2022 PracticalNotes LinearRegression

Uploaded by

asasd

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 34

Machine Learning 1 (CM50264)

Week 2, Lecture 2: Evaluating ML systems and linear regression

Subhadip Mukherjee
Acknowledgment: Tom S. F. Haines and Xi Chen (for images and other contents)
12 Oct. 2022
University of Bath, UK

1
Evaluating and diagnosing ML systems

How can your machine learning system fail?

• Underfitting
• Overfitting
• Bad data
• ···

How do you detect?

2
Underfitting & overfitting

• Complicated (non-linear) decision boundary

• Classes overlap
• How would you divide them?
3
Underfitting & overfitting

• Complicated (non-linear) decision boundary • Underfitting (over-simplistic classifier)

• Classes overlap
• How would you divide them?
3
Underfitting & overfitting

• Complicated (non-linear) decision boundary • Overfitting (too complicated classifier)

• Classes overlap
• How would you divide them?
3
Underfitting & overfitting

• Complicated (non-linear) decision boundary • Balanced (does reasonable fitting on training

• Classes overlap data, generalizes well on test data)
• How would you divide them?
3
Underfitting & overfitting

• Underfitting • Balanced • Overfitting

• Logistic regression • Tuned random forest • Badly tuned random forest
(sklearn, (sklearn, with default
min impurity decrease=0.008, parameters)
4
n estimators=512)
Underfitting and overfitting in regression

You can fit a polynomial that exactly fits the training data, but does not generalize.
On the other extreme, your model could be too weak to fit the training data.
5
Underfitting causes

• Weak model
• Bad fitting (left, random forest again)

(e.g., if the optimization was not good

enough)

6
Overfitting causes

• Powerful model → can model training specific noise

+
• Insufficient regularization
Regularization controls model complexity (so that model doesn’t fit noise)

• Maybe insufficient training data?

• How to detect?

7
Train and test set

Accuracy
• Model can’t overfit on data it doesn’t see!
Random Forest Train Test
• Split the data:
• A train set, to fit the model
Underfitting 79.2% 79.2%
• A test set, to verify performance

Balanced 97.6% 95.0%

• Large gap between train/test accuracy
indicates overfitting (usually)
Overfitting 99.6% 94.7%

8
Hyperparameters I

• Parameters → fit to training data

• Hyperparameters → parameters that cannot be fit to training data

• Reasons for having hyperparameters:

• Avoiding overfitting, e.g. decision tree depth
• Controlling the amount of computation, e.g. ensemble size
• Hard to optimize (e.g., which degree of polynomial to fit?)
• Bayesian priors have hyperparameters associated with them

9
Hyperparameters II

• Can still fit hyperparameters . . .

(manually or by algorithm)

• . . . but not to the test set!

(this mistake can be found in countless research papers)

• Introduce a third set: validation set

• train – Give to algorithm
• validation – Objective of hyperparameter optimization
• test – To report final performance

10
Measuring performance

• How do we decide on split percentages?

• Train large → Algorithm performs well
• Validation large → Hyperparameter optimization performs well
• Test large → Accurate performance estimate

• Good default: Validation and test small as possible while maintaining reliable
estimate, rest used to train
• . . . but might shrink train due to computational cost
• “small as possible” hard to judge
11
k-fold cross validation

• Divide train/validation into 7-fold

• train: six parts
• validation: one part
Train for all 7 combinations and report average performance on
test
• Effectively, all samples in the train+val dataset get used for training
and validation: artificially increase the train and val data size
• More robust estimate of model accuracy

• k-fold = k × slower! Typically 4 ≤ k ≤ 10

• Most extreme: Jackknife resampling (leave one out cross-val)

validation sets of size 1; very slow
• In practice mostly not done: time = money
12
Measure of performance: confusion matrix

• Makes sense for classification

only
• Random forest on breast
cancer:
• On diagonal means correct, off means wrong
Actual • Can see which classes are confused
• An empty row is a problem
False True
• May want to color code cells as a heat map!
Predicted

False 49 6

True 14 159

13
Some terminologies

Actual

False True
Predicted

False True Negative (TN) False Negative (FN)

True False Positive (FP) True Positive (TP)

14
Some more terminologies

Loads of terms are used (ignore most of them):

TP
TP+FN sensitivity, recall, hit rate, true positive rate
TN
TN+FP specificity, true negative rate
TP
TP+FP precision, positive predictive value
TP+TN
TP+TN+FP+FN accuracy
2×TP
2×TP+FP+FN F1 score

(many more. . . )

15
Imbalanced data

• Imbalanced training set → Makes training difficult

• Cancer detection in CT scans ≈ 0.1% of scans have cancer
• 99.9% accuracy by predicting no cancer – meaningless!

• Often need to adjust training (e.g. oversampling)

• F1 score is better

• Balanced accuracy:
1 X |{yi = c ∧ fθ (xi ) = c}|
=
|C| |{yi = c}|
c∈C

• C = set of classes, of size |C|

• (yi , xi ) = data points
• fθ (·) = your classifier model
16
Linear regression

17
Linear regression – 1

• Training data: (xi , yi )ni=1 , xi ∈ Rd , yi ∈ R

X d
• Objective: Fit a function f (x) = wj x(j) + b = w⊤ x + b, where w ∈ Rd , b ∈ R
j=1
• We use the same trick that we used in Lecture-1:
" #
1
f (x) = [b w] = w̄⊤ x̄, where w̄, x̄ ∈ Rd+1
x

• We will, however, write f (x) = w⊤ x and understand x ∈ Rd+1 as the augmented

feature vector (with its first element being 1; and w encompassing both the weight
vector and the bias term)
n
1 X ⊤ 2
• Empirical risk minimization (ERM): min w x i − yi
w 2
i=1
18
Linear regression – 2

• A more compact formulation of ERM:

• Let X ∈ Rn×(d+1) be such that its i-th row is xi

 
    w0 = b
−−− x⊤
1 −−− y1  w 
1 
− − − x⊤ − − −  y2 
    
2
 w2 
 
X= ..  , y = .
. , w =
. .  .. 
   
 
 . 
−−− x⊤n − − − n×(d+1) yn n×1
wd (d+1)×1

ERM can be expressed using the following equivalent matrix-vector notations:

1 X
min ∥Xw − y∥22 , where ∥z∥22 = zi2
w 2
i
19
Linear regression: direct solution – 1
n
1 X ⊤ 2 1
Let J(w) = w xi − yi = ∥Xw − y∥22
2 2
i=1
Observations:
n ∂ w⊤ x Xn
∂J(w) X ⊤ i

= w xi − yi = w⊤ xi − yi xit , where t = 0, 1, 2, · · · , d
∂wt ∂wt
i=1 i=1

∂J(w) ⊤
i.e., = X (Xw − y)
∂wt t
 ∂J(w) 
0 ∂w
 ∂J(w) 
 ∂w1 
• ∇J(w) = 
 .. 
 = X ⊤ (Xw − y): called the gradient vector
 . 
∂J(w)
∂wd (d+1)×1
20
Linear regression: direct solution – 2

For s, t = 0, 1, 2, · · · , d:
n n
∂ 2 J(w) ∂ X ⊤ X
= w xi − yi xit = xit xis = X ⊤ X
∂ws wt ∂ws t,s
i=1 i=1

• ∇2 J(w) = X ⊤ X: called the Hessian matrix ((d + 1) × (d + 1))

Exercise: Show that the Hessian matrix is positive semi-definite (PSD), i.e., for any
u ∈ R(d+1)×(d+1) , the quadratic form of the Hessian u⊤ ∇2 J(w) u ≥ 0

• A function that is twice differentiable and has a PSD Hessian is called a convex
function.
• For a convex J(w), the solution to ∇J(w) = 0 minimizes the function.

21
Linear regression: direct solution – 3

Therefore, to find the optimal w (which we will denote by w∗ ), we need to solve

∇J(w)w=w∗ = X ⊤ (Xw∗ − y) = 0.

X ⊤ (Xw∗ − y) = 0 : called the normal equation

=⇒ X ⊤ X w∗ = X ⊤ y

This can be solved directly if X ⊤ X is invertible, and the solution is given by

−1
w∗ = X ⊤ X X ⊤y = X †y ,

−1
where X † = X ⊤ X X ⊤ is called the pseudo-inverse of X.

• For the normal matrix X ⊤ X ∈ R(d+1)×(d+1) to be invertible, we need n ≥ (d + 1)

• This condition is necessary for the existence of inverse, not sufficient 22
Why ‘normal’ equation?

• Consider the range space of X, given by {v ∈ Rn : v = Xw}

• The algorithm seeks to project y onto this space

v ⊤ (Xw∗ − y) = (Xw)⊤ (Xw∗ − y) = w⊤ X ⊤ (Xw∗ − y) = 0

| {z }
=0

That is, the residual error e = Xw∗ − y is orthogonal (normal) to the range
space of X, hence the name! 23
The direct solution may not always be available

• X ⊤ X might not be invertible always (possible reasons include n < (d + 1),

and/or linearly dependent feature vectors)

• A direct solution may not even exist!

• Matrix inversion (or computing the pseudo-inverse) is generally an expensive
operation. Could be infeasible for large d and/or n

Can we approximate the solution iteratively?

24
Iterative solution

A very simple iterative algorithm is gradient-descent (GD).

w(k + 1) = w(k) − ηk ∇J(w(k))

= w(k) − ηk X ⊤ (Xw(k) − y)

• For ηk small enough, GD converges to a global minima for convex J (and a

stationary point for non-convex J), subject to some additional technical
requirements
• You will learn more about GD in Xi’s lectures on optimization

25
The intuition behind gradient-descent

How GD worksa

a
image source: mathworks 26
The least mean square (LMS) algorithm

• Basically applies gradient descent, but only on one randomly chosen sample
instead of the whole dataset
LMS algorithm (also known as Widrow-Hoff algo., or just stochastic gradient descent)
• init k ← 0, w ← w(0)
while not converged
– Sample randomly a data point (xk , yk )
– Do weight update: w ← w − ηk x⊤

k w − yk x k
return w
Result: The LMS algorithm recovers a solution of the normal equation if the step-sizes
are chosen appropriately.
Advantage: The computational complexity of every update step is n times smaller
than the batch version.
27
Generalized linear regression (GLR)

• Let ϕ : Rd 7→ RD be a feature transformation, where D > d

D
X
• Let our model be f (x) = wj ϕ(j) (x) + b = w⊤ ϕ + b
j=1
• Using the augmentation trick, we will write f (x) = w⊤ ϕ, where w, ϕ ∈ RD+1
 

⊤
   w0 = b
− − − ϕ(x1 ) −−− y1  w 
⊤ 1 
− − − ϕ(x2 ) − − −  y2 
    
 
Φ= .. ,y =   ..  ,w =  w2 
  
. .  .. 
 
 
 . 
− − − ϕ(xn )⊤ − − − n×(D+1) yn n×1
wD (D+1)×1

1 −1
• ERM: min ∥Φw − y∥22 =⇒ w∗ = Φ⊤ Φ Φ⊤ y
w 2
28
GLR can learn powerful models – 1
" #
x(1)
• Consider binary classification in 2D, x = (2)
x
• A simple linear model f (x) = w x will only allow you to learn lines for fitting the
⊤

data
• Using GLR, you can fit a polynomial, for instance (and more complicated
functions!)
 
1
 x(1) 
 
 x(2) 
 
ϕ : x 7→ ϕ(x) =  
x(1) · x(2) 
 2 
 x(1) 
 
2
x(2)
29
GLR can learn powerful models – 2

• Interestingly, the feature map ϕ can even transform the feature to an

infinite-dimensional space (without actually ever computing ϕ explicitly)
• Leads to the so-called kernel linear regression algorithm (more about it later in
the course)
• The idea is to have a kernel function K such that K(x1 , x2 ) = ϕ(x1 )⊤ ϕ(x2 ), for
any x1 and x2
• K must be efficiently computable (bypasses the task of inner-product in a very
high-dimensional space)
• More about kernels in the context of SVMs
• Even state-of-the-art deep neural networks can be approximated using kernel
linear regression (neural tangent kernels)

30
Easy extension to vector-valued targets

• Target variable y ∈ Rm , m > 1: fit m linear models for m dimensions

1 −1
min ∥XW − Y ∥22 =⇒ w∗ = X ⊤ X X ⊤Y
w 2
   
−−− x⊤1 −−− −−− y1⊤ −−−
− − − x⊤ − − − − − − y2⊤ − − −
   
2
X= ..  ,Y = 
 ..

.  .
  
  
−−− x⊤
n − − − n×(d+1) −−− yn⊤ − − − n×m

 (1) (2) (m) 

w0 w0 ··· w0
 (1) (2) (m) 
w1 w1 ··· w1 
W =
 .. .. .. .. 
.

 . . . 
(1) (2) (m)
wd wd ··· wd (d+1)×m
31

Machine Learning Interview Questions
From Everand
Machine Learning Interview Questions
Tech Interviews
4.5/5 (2)
Model Evaluation and Selection Cheatsheet 1708023215
No ratings yet
Model Evaluation and Selection Cheatsheet 1708023215
7 pages
Data Science Cheatsheet
100% (1)
Data Science Cheatsheet
5 pages
Lecture 3 - Linear Regression
No ratings yet
Lecture 3 - Linear Regression
31 pages
EDAN96_2024_Last_lecture-1
No ratings yet
EDAN96_2024_Last_lecture-1
78 pages
2EL1730 ML Lecture02 Linear and Logistic Regression
No ratings yet
2EL1730 ML Lecture02 Linear and Logistic Regression
65 pages
Mock Exams 2024
No ratings yet
Mock Exams 2024
81 pages
IML-Summary
No ratings yet
IML-Summary
12 pages
CH 1
No ratings yet
CH 1
24 pages
Week11_regularization and optimization
No ratings yet
Week11_regularization and optimization
75 pages
Lecture 2
No ratings yet
Lecture 2
66 pages
MECH4403 LR Week04
No ratings yet
MECH4403 LR Week04
25 pages
Lecture 3_Regression (1)
No ratings yet
Lecture 3_Regression (1)
47 pages
Cours1 ML
No ratings yet
Cours1 ML
41 pages
Wk05 machine learning
No ratings yet
Wk05 machine learning
6 pages
Lecture 02
No ratings yet
Lecture 02
43 pages
Exam in Statistical Machine Learning Statistisk Maskininlärning (1RT700)
No ratings yet
Exam in Statistical Machine Learning Statistisk Maskininlärning (1RT700)
10 pages
Brief Summary ML
No ratings yet
Brief Summary ML
25 pages
Regression
No ratings yet
Regression
39 pages
Week 4 Linear Regression
No ratings yet
Week 4 Linear Regression
38 pages
lecture3_supervised_learning_I
No ratings yet
lecture3_supervised_learning_I
84 pages
ML 01
No ratings yet
ML 01
24 pages
Supervised Machine Learning
No ratings yet
Supervised Machine Learning
112 pages
AWS Machine Learning Specialty Master Cheat Sheet
No ratings yet
AWS Machine Learning Specialty Master Cheat Sheet
24 pages
3 LogisticRegression
No ratings yet
3 LogisticRegression
30 pages
1.1 ID5059 1.2 Tom Kelsey - Jan 2021: February 15, 2021
No ratings yet
1.1 ID5059 1.2 Tom Kelsey - Jan 2021: February 15, 2021
43 pages
3. Undergraduate Fundamentals of Machine Learning Author William J. Deuschle
No ratings yet
3. Undergraduate Fundamentals of Machine Learning Author William J. Deuschle
143 pages
Super Cheatsheet Machine Learning
100% (1)
Super Cheatsheet Machine Learning
15 pages
Basics of ML and Evaluation
No ratings yet
Basics of ML and Evaluation
42 pages
Advanced Regression Pres
No ratings yet
Advanced Regression Pres
42 pages
MLA TAB Lecture3
No ratings yet
MLA TAB Lecture3
70 pages
Today: - Calculus
No ratings yet
Today: - Calculus
61 pages
(MLP) Lecture Notes
No ratings yet
(MLP) Lecture Notes
22 pages
ML 3 (1)
No ratings yet
ML 3 (1)
50 pages
Notes5_Regression
No ratings yet
Notes5_Regression
14 pages
Machinelearning
No ratings yet
Machinelearning
59 pages
03 Linear Models
No ratings yet
03 Linear Models
46 pages
BITS F464 ML Lecture Notes
No ratings yet
BITS F464 ML Lecture Notes
86 pages
Lect 1
No ratings yet
Lect 1
24 pages
Machine Learning Guide
No ratings yet
Machine Learning Guide
185 pages
Module3_Ch1
No ratings yet
Module3_Ch1
83 pages
ML
No ratings yet
ML
9 pages
Lecture16 Crossvalidation
No ratings yet
Lecture16 Crossvalidation
32 pages
Machine Learning Summary
No ratings yet
Machine Learning Summary
38 pages
MLSM Lecture1 050923
No ratings yet
MLSM Lecture1 050923
37 pages
Bishop Solutions PDF
No ratings yet
Bishop Solutions PDF
87 pages
Six Lectures On NN - Montanari
No ratings yet
Six Lectures On NN - Montanari
77 pages
Lecture 8: Gradient Descent and Logistic Regression
No ratings yet
Lecture 8: Gradient Descent and Logistic Regression
39 pages
Lec1 PDF
No ratings yet
Lec1 PDF
56 pages
CS229 Lecture 2 PDF
100% (1)
CS229 Lecture 2 PDF
48 pages
Machine Learning
100% (1)
Machine Learning
185 pages
Statlearn PDF
No ratings yet
Statlearn PDF
123 pages
Stanford ML
No ratings yet
Stanford ML
168 pages
Introduction To Machine Learning: ETH Zurich Janik Schuettler Marcel Graetz FS18
No ratings yet
Introduction To Machine Learning: ETH Zurich Janik Schuettler Marcel Graetz FS18
18 pages
02 - Linear Models - A
No ratings yet
02 - Linear Models - A
23 pages
Machine Learning HC
No ratings yet
Machine Learning HC
4 pages
Supervised_Learning (2)
No ratings yet
Supervised_Learning (2)
41 pages
Random Sample Consensus: Robust Estimation in Computer Vision
From Everand
Random Sample Consensus: Robust Estimation in Computer Vision
Fouad Sabry
No ratings yet
Schaum's Easy Outline of Precalculus
From Everand
Schaum's Easy Outline of Precalculus
Fred Safier
No ratings yet
Errors of Regression Models: Bite-Size Machine Learning, #1
From Everand
Errors of Regression Models: Bite-Size Machine Learning, #1
Lee Baker
No ratings yet
Certified Lean Six Sigma Green Belt (ICGB) Practice Questions And Exam Tests ICGB Exam Guidebook And Updated Questions
From Everand
Certified Lean Six Sigma Green Belt (ICGB) Practice Questions And Exam Tests ICGB Exam Guidebook And Updated Questions
Idea Link
No ratings yet
הסקה סטטיסטית
No ratings yet
הסקה סטטיסטית
80 pages
AP Stats Cheat Sheet FINAL
No ratings yet
AP Stats Cheat Sheet FINAL
8 pages
Refer To Brand Preference Problem 6 5 A Obtain The Studentized Deleted Residuals and Identify PDF
No ratings yet
Refer To Brand Preference Problem 6 5 A Obtain The Studentized Deleted Residuals and Identify PDF
2 pages
Models Involving Two Inverse Weibull Distributions: R. Jiang, D.N.P. Murthy, P. Ji
No ratings yet
Models Involving Two Inverse Weibull Distributions: R. Jiang, D.N.P. Murthy, P. Ji
9 pages
Chapter 03 - Forecasting: Multiple Choice Questions
No ratings yet
Chapter 03 - Forecasting: Multiple Choice Questions
19 pages
Sample Q - A For Module 3 - 4
No ratings yet
Sample Q - A For Module 3 - 4
18 pages
Assignment No 2 Code 4793 by Javeria
No ratings yet
Assignment No 2 Code 4793 by Javeria
17 pages
STAT2110 Instructions For Practical Work 2021
No ratings yet
STAT2110 Instructions For Practical Work 2021
4 pages
L4 Data Visualization Part 1 (1)
No ratings yet
L4 Data Visualization Part 1 (1)
26 pages
ME m_1
No ratings yet
ME m_1
9 pages
Normality Test
No ratings yet
Normality Test
12 pages
Model Evaluation and Selection
No ratings yet
Model Evaluation and Selection
17 pages
Econometrics Assignment
No ratings yet
Econometrics Assignment
5 pages
Lampiran Output Spss Frequencies Frequency Table: Responden
No ratings yet
Lampiran Output Spss Frequencies Frequency Table: Responden
28 pages
BIVARIAT
No ratings yet
BIVARIAT
4 pages
Ncert Solutions For Class 11 Maths May22 Chapter 15 Statistics
No ratings yet
Ncert Solutions For Class 11 Maths May22 Chapter 15 Statistics
52 pages
NCSU Course Syllabus - ST 515 - 001 - Experimental Statistics For Engineers I
No ratings yet
NCSU Course Syllabus - ST 515 - 001 - Experimental Statistics For Engineers I
7 pages
Histogram: Product Weight For A Sample of 40. Target Weight 50.0 Grams
No ratings yet
Histogram: Product Weight For A Sample of 40. Target Weight 50.0 Grams
2 pages
Interpretation of Results of Regression Analysis
No ratings yet
Interpretation of Results of Regression Analysis
32 pages
Box Plots and Distribution
No ratings yet
Box Plots and Distribution
14 pages
Stats Chapter 5
No ratings yet
Stats Chapter 5
10 pages
8426672
78% (9)
8426672
41 pages
Coefficient Alpha, A Basic Introduction From The Perspectives of Classical Test Theory
No ratings yet
Coefficient Alpha, A Basic Introduction From The Perspectives of Classical Test Theory
21 pages
What Is A Hypothesis
No ratings yet
What Is A Hypothesis
4 pages
Skewness and Kurtosis
No ratings yet
Skewness and Kurtosis
5 pages
ADM G11 Q4 Week 1 Week 4 Set A Updated
No ratings yet
ADM G11 Q4 Week 1 Week 4 Set A Updated
31 pages
3 SpectralEstimationParametricAR MA ARMA
100% (1)
3 SpectralEstimationParametricAR MA ARMA
76 pages
Npar Tests: Hasil Asumsi Klasik Uji Normalitas
No ratings yet
Npar Tests: Hasil Asumsi Klasik Uji Normalitas
4 pages
6.05 Project Template (Individual) v22 copy
No ratings yet
6.05 Project Template (Individual) v22 copy
4 pages