Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
2 views

practicalMachineLearning_lecture3

The document outlines the schedule for upcoming presentations in a Practical Machine Learning course, detailing topics and students responsible for each lecture. It also covers key concepts related to loss functions, maximum likelihood estimation, and parameter estimation using PyTorch. Additionally, it includes assignments and hands-on exercises for students to apply their learning on linear and logistic regression.

Uploaded by

p230.mc
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

practicalMachineLearning_lecture3

The document outlines the schedule for upcoming presentations in a Practical Machine Learning course, detailing topics and students responsible for each lecture. It also covers key concepts related to loss functions, maximum likelihood estimation, and parameter estimation using PyTorch. Additionally, it includes assignments and hands-on exercises for students to apply their learning on linear and logistic regression.

Uploaded by

p230.mc
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 25

Redistribution without permission not allowed

Practical Machine Learning


Lecture 3

Daniel Andrade
Check Updated Schedule of Presentations
• Check whether your name is listed in the schedule of the following pages.
• Check the day of your presentation and start preparing.
Schedule of Presentations (1/2)
• Lecture 4 (December 12th):
• Section 6.2 in “An Introduction to Statistical Learning”
Students in charge: m242663, m242232
Either as a team or separate content at Subsection “Comparing the Lasso and Ridge Regression”, page 245 (book page, not pdf page number)

• Lecture 5 (December 17th):


• "Statistical challenges of high-dimensional data”, pages 1-8 + Section 6.4 in “An Introduction to Statistical Learning”,
Student in charge: m242520
• Section 10.1 ~ 10.4 of “An Introduction to Statistical Learning”,
Student in charge: m244718

• Lecture 6 (December 19th):


• https://d2l.ai/chapter_optimization/index.html Section 12.1 ~ 12.6
Students in charge: m241613, m243893
Either as a team or separate content 12.1 ~ 12.3 and 12.4 ~ 12.6
• https://d2l.ai/chapter_optimization/index.html Section 12.7 ~ 12.11
Students in charge: m243893, m245772 (team)

• Lecture 7 (December 24th)


• “Probabilistic Machine Learning - Advanced Topics”, Section 3.1, 3.2
Student in charge: m241098
• "Probabilistic Machine Learning - Advanced Topics”, Section 5.1 KL-Divergence
Students in charge: m245209, m241465
Either as a team or separate content at Subsection 5.1.4
Schedule of Presentations (2/2)
• Lecture 8 (January 7)
"New Insights and Perspectives on the Natural Gradient Method", till page 23 + “Second-order optimization for neural networks.”, Chapter 2+3
Student in charge: m244200, m241594
Either as a team or separate content: "New Insights and Perspectives on the Natural Gradient Method” and “Second-order optimization for neural networks.”

• Lecture 9 (January 9)
https://d2l.ai/chapter_attention-mechanisms-and-transformers/index.html Section 11
Students in charge: m245073, m232259, m242619, m235482
Either as a team or separate content: 11.1, 11.2 and 11.3, 11.4 and 11.5, 11.6 and 11.7~11.9

• Lecture 10 (January 14)


"Bayesian Statistics and Modeling” + “Holes in Bayesian statistics”(some parts)
Students in charge: m244645, m243658, m241734
Either as a team or separate content: "Bayesian Statistics and Modeling” pages 1~8, pages 9~23, “Holes in Bayesian statistics”(some parts)

• Lecture 11 (January 16)


“Probabilistic Machine Learning - Advanced Topics”, Section 2.6.4 (Stationary distribution of a Markov chain) + Section 12.2, 12.3
Students in charge: m243000, m246692 (team)

• Lecture 12 (January 21)


“Probabilistic Machine Learning - Advanced Topics”, Markov chain Monte Carlo, Section 12.5 ~ 12.7
Students in charge: m240492, m244274
Either as a team or separate content: 12.5 and 12.6,12.7

• Lecture 13 (January 23)


“MCMC using Hamiltonian dynamics” pages 1~21
Students in charge: m245117, m246981
Either as a team or separate content: pages 1~12 and 13~21
Todays Topic
1. Loss Functions and Maximum Likelihood Methods for Estimating
Parameters θ of model fθ .
2. Parameter Estimation with PyTorch
3. Hands-on Session: Try out what you learnt
Loss Functions and
Maximum Likelihood
Estimation
Recall from Lecture 1
Goal of Prediction
p
• For any covariate vector x ∈ ℝ , we want to minimize the error of wrong
prediction (in expectation), i.e.:

Find f that minimizes


[ℓ( f(X), Y)] ,

where the expectation is with respect to p(y, x), i.e. the joint density of Y
and X, and ℓ is some loss function. ℓ(y,̂ y), where ŷ is the prediction of the
our model, and y is the true value.
𝔼
Recall from Lecture 1

Commonly used loss functions for evaluation


• Commonly used loss function for regression is the mean squared error (MSE) loss
2
function ℓ(y,̂ y) = (ŷ − y) , for which we get
2 p
[( f(X) − Y) ] , where f : ℝ → ℝ.

• Commonly used loss function for classi ication is 0-1 Loss


ℓ(y,̂ y) = I(ŷ ≠ y) , for which we get
p k
[I( argmax fj(x) ≠ y)] , where f : ℝ → ℝ , k is the number of classes.
j∈{1,2,…,k}
(where fj are the logits that are passed to the softmax, recall
p1(X), p2(X), …, pk(X) = softmax( f1(X), f2(X), …, fk(X))
𝔼
𝔼
f
Parameter Estimation
d
• The model fθ has parameters θ ∈ ℝ that need to be set.
• Since d is large, we cannot specify them all manually and therefore try to estimate them from our data (*)
• Parameter estimation is also called training in machine learning.
• By far, the most successful training method is the gradient descent method:


θ (t+1)
:= θ − η(
(t)
[ℓ( fθ(t)(X), Y)]) ,
∂θ
(t)
where θ is parameter θ at step t, and η is called the learning rate.
(0)
θ is set to some random value.
(t)
If η is small enough, then (in most situations) θ converges to a stationary point.
(Hopefully to a good local minimum. If the objective function is convex then, to a
global minimum)
𝔼
(*) A few remaining parameters nevertheless need to be set manually, these parameters are often called Hyper-parameters
Expectations are estimated using data D
• In general, we do not know p(y, x).

• But assuming D = {(y1, x1), (y2, x2), …, (yn, xn)} are iid samples from p(y, x),
we have the following unbiased estimates:

n
1

[ℓ( fθ(X), Y)] ≈ ℓ( fθ(xi), yi) , and
n i=1
n
∂ 1 ∂

[ℓ( fθ(X), Y)] ≈ ℓ( fθ(xi), yi) .
∂θ n i=1 ∂θ
𝔼
𝔼
Commonly used loss functions for training
2
• For regression the (MSE) loss function ℓ(y,̂ y) = (ŷ − y) can also be used for
training.

• For classi ication the 0-1 Loss cannot be used since the gradient with respect
to θ is 0 almost everywhere. Instead a popular surrogate loss is the cross-
entropy (CE) loss (*):
ℓ(p, y) = − log py ,
where y ∈ {1,2,…, k} is the true label and
vector p = (p1, p2, …, pk) contains in position j the predicted probability of
class j.
(*) strictly speaking, the CE loss, as de ined e.g. in PyTorch, is using the logits as input.
f
f
Gradient of 0-1 Loss with respect to θ is 0 almost everywhere

Simple Example Illustration (Assuming θ ∈ ℝ)

I(yθ̂ ≠ y)

0
θ

Comment: Here I write yθ̂ instead of ŷ to emphasize the dependence on θ


Equivalence to Maximum Likelihood Estimation - Regression
2 2
• Using the likelihood function N(y | f(x), σ ) , for any ixed σ , we have the
following equivalence:

n n
1 2 2
∑ ∏
argmin ( fθ(xi) − yi) = argmax N(yi | fθ(xi), σ )
θ n i=1 θ i=1

Minimizing MSE loss Maximizing likelihood of data

To verify this, transform the righthand side by −log( ⋅ )

f
Equivalence to Maximum Likelihood Estimation - Classi ication

• Using the likelihood function Cat(y | p1(x), p2(x), …, pk(x)), we have the
following equivalence

n n
1
∑ ∏
argmin − log pθ,yi(xi) = argmax pθ,yi(xi)
θ n i=1 θ i=1
Minimizing CE loss Maximizing likelihood of data

where pθ,1(x), pθ,2(x), …, pθ,k(x) = softmax( fθ,1(x), fθ,2(x), …, fθ,k(x))


logits

Recall that
e zj
softmax(z1, z2, …, zk)j := k
∑l=1 e zl

f
Parameter Estimation
with PyTorch
Parameter Estimation with PyTorch
• Vanilla Gradient Descent is available as torch.optim.SGD
• Don’t forget to call optimizer.zero_grad()

A minimal example:

loss_fn = torch.nn.CrossEntropyLoss()

optimizer = torch.optim.SGD(simpleModel.parameters(), lr=LEARNING_RATE)

for t in range(EPOCHS):
pred = simpleModel(X)
loss = loss_fn(pred, y)

optimizer.zero_grad() # clear "grad"


loss.backward() # calculate gradients and save in "grad" standard training pattern in PyTorch
optimizer.step() # one gradient descent step
What is optimizer doing?
• In the case of a constant learning rate (torch.optim.SGD) it is easy to
implement gradient decent manually.
• The code below is for educational purposes. In practice, you should
always use an optimizer, since this leads to code that is
(1) cleaner, (2) possibly faster, (3) easier to replace.

# clear "grad"
for param in simpleModel.parameters(): Corresponds to optimizer.zero_grad()
if param.grad is not None:
param.grad = torch.zeros_like(param.grad)

loss.backward() # calculate gradients and save in "grad"

# one gradient descent step


with torch.no_grad(): Corresponds to optimizer.step()
for param in simpleModel.parameters():
param -= LEARNING_RATE * param.grad
First Assignment
First Assignment
• lecture3_linear_regression.py calculates an approximation to the least squares solution. Denote this approximation
p
by β̃
∈ ℝ (linear1.weight) and τ̃ ∈ ℝ (linear1.bias). In this assignment you are asked to compare the analytic solution
β ∈ ℝp and τ ∈ ℝ to the approximation β̃, τ̃.
• Analytic solution (e.g. check out your “linear model” lecture notes):
T
1 x1
τ
(β)
T
T −1 T 1 x2
= (Xb Xb ) Xb y , where Xb := ∈ ℝn×(p+1) . (That is Xb includes the one vector for the bias)
⋮ ⋮
T
1 xn

• Hand in the source code that calculates


τ Hand-in your source code via
the analytic solution ( ) using PyTorch,
β Moodle until December 24th
(β̃)
τ τ̃
the euclidean distance between ( ) and . (Tuesday) 23.55h
β
Please hand-in a python ile with the following name
assignment1_STUDENTID.py , where
“STUDENTID” is your student id.
f
Hands-on Part
Linear Regression - Exercise
• Execute the code in lecture3_linear_regression.py.
• What is happening in each line?
• Change EPOCHS to 100, and to 1000. What do you observe?
• Try EPOCHS = 100000. What do you observe?
• Finally, set EPOCHS = 10, and LEARNING_RATE = 0.01. What do you observe?
Logistic Regression - Exercise
• Execute the code in lecture3_logistic_regression.py.
• Compare “class LogisticRegressionModel” to
“class LinearRegressionModel” (from lecture3_linear_regression.py)
• Write down the formula that is used to calculate loss_fn(pred, y) in this
example (that is for the logistic regression with CE loss).
Computational Graph for Automatic Differentiation - Exercise
• Execute gradient_descent_from_scratch_exercise.py
and think about what happens if
“with torch.no_grad(): "
is not used.
Homework
Homework
• Read Chapter 4 of “An Introduction to Statistical Learning” (Second
Edition)

You might also like