practicalMachineLearning_lecture3
practicalMachineLearning_lecture3
Daniel Andrade
Check Updated Schedule of Presentations
• Check whether your name is listed in the schedule of the following pages.
• Check the day of your presentation and start preparing.
Schedule of Presentations (1/2)
• Lecture 4 (December 12th):
• Section 6.2 in “An Introduction to Statistical Learning”
Students in charge: m242663, m242232
Either as a team or separate content at Subsection “Comparing the Lasso and Ridge Regression”, page 245 (book page, not pdf page number)
• Lecture 9 (January 9)
https://d2l.ai/chapter_attention-mechanisms-and-transformers/index.html Section 11
Students in charge: m245073, m232259, m242619, m235482
Either as a team or separate content: 11.1, 11.2 and 11.3, 11.4 and 11.5, 11.6 and 11.7~11.9
where the expectation is with respect to p(y, x), i.e. the joint density of Y
and X, and ℓ is some loss function. ℓ(y,̂ y), where ŷ is the prediction of the
our model, and y is the true value.
𝔼
Recall from Lecture 1
∂
θ (t+1)
:= θ − η(
(t)
[ℓ( fθ(t)(X), Y)]) ,
∂θ
(t)
where θ is parameter θ at step t, and η is called the learning rate.
(0)
θ is set to some random value.
(t)
If η is small enough, then (in most situations) θ converges to a stationary point.
(Hopefully to a good local minimum. If the objective function is convex then, to a
global minimum)
𝔼
(*) A few remaining parameters nevertheless need to be set manually, these parameters are often called Hyper-parameters
Expectations are estimated using data D
• In general, we do not know p(y, x).
• But assuming D = {(y1, x1), (y2, x2), …, (yn, xn)} are iid samples from p(y, x),
we have the following unbiased estimates:
n
1
∑
[ℓ( fθ(X), Y)] ≈ ℓ( fθ(xi), yi) , and
n i=1
n
∂ 1 ∂
∑
[ℓ( fθ(X), Y)] ≈ ℓ( fθ(xi), yi) .
∂θ n i=1 ∂θ
𝔼
𝔼
Commonly used loss functions for training
2
• For regression the (MSE) loss function ℓ(y,̂ y) = (ŷ − y) can also be used for
training.
• For classi ication the 0-1 Loss cannot be used since the gradient with respect
to θ is 0 almost everywhere. Instead a popular surrogate loss is the cross-
entropy (CE) loss (*):
ℓ(p, y) = − log py ,
where y ∈ {1,2,…, k} is the true label and
vector p = (p1, p2, …, pk) contains in position j the predicted probability of
class j.
(*) strictly speaking, the CE loss, as de ined e.g. in PyTorch, is using the logits as input.
f
f
Gradient of 0-1 Loss with respect to θ is 0 almost everywhere
I(yθ̂ ≠ y)
0
θ
n n
1 2 2
∑ ∏
argmin ( fθ(xi) − yi) = argmax N(yi | fθ(xi), σ )
θ n i=1 θ i=1
f
Equivalence to Maximum Likelihood Estimation - Classi ication
• Using the likelihood function Cat(y | p1(x), p2(x), …, pk(x)), we have the
following equivalence
n n
1
∑ ∏
argmin − log pθ,yi(xi) = argmax pθ,yi(xi)
θ n i=1 θ i=1
Minimizing CE loss Maximizing likelihood of data
Recall that
e zj
softmax(z1, z2, …, zk)j := k
∑l=1 e zl
f
Parameter Estimation
with PyTorch
Parameter Estimation with PyTorch
• Vanilla Gradient Descent is available as torch.optim.SGD
• Don’t forget to call optimizer.zero_grad()
A minimal example:
loss_fn = torch.nn.CrossEntropyLoss()
for t in range(EPOCHS):
pred = simpleModel(X)
loss = loss_fn(pred, y)
# clear "grad"
for param in simpleModel.parameters(): Corresponds to optimizer.zero_grad()
if param.grad is not None:
param.grad = torch.zeros_like(param.grad)