Lecture3 Logistic Regression Regularization
Lecture3 Logistic Regression Regularization
DSA5103 Lecture 3
Yangjing Zhang
29-Aug-2023
NUS
Today’s content
lecture3 1/31
Logistic regression
Classification
Binary classification:
We usually assign
(
0, normal state/negative class e.g., not spam
label
1, abnormal state/positive class e.g., spam
lecture3 2/31
Classification
Multi-class classification:
• Iris flower (3 species: Setosa, Versicolor, Virginica)
• Optical character recognition
Data xi ∈ Rp , yi ∈ {1, . . . , K}, i = 1, 2, . . . , n.
lecture3 3/31
Linear regression for classification?
Data xi , yi ∈ {0, 1}
We fit f (x) = β0 + β T x, and predict a new input x̃ belong to
class 1 if f (x̃) ≥ 0.5 class 0 if f (x̃) < 0.5
lecture3 4/31
Linear regression vs. logistic regression
Linear regression
• Data xi , yi ∈ R
• Fit f (x) = β T x + β0 = β̂ T x̂, β̂ = [β0 ; β], x̂ = [1; x]
Logistic regression
• Data xi , yi ∈ {0, 1}
1
• Fit f (x) = g(β̂ T x̂), g(z) = sigmoid/logistic function
1 + e−z
. 0 < g(z) < 1 an increasing
1
function
. g(0) = 0.5
0.5
. g(z) → 1 as z → +∞
. g(z) → 0 as z → −∞
0
-10 -5 0 5 10
lecture3 5/31
Graph illustration
0.5
-6 -4 -2 0 2 4 6 8 10 12
lecture3 6/31
Probabilistic interpretation
1.5
0.5
0
0 1 2 3 4 5 6 7 8
1 1
-10 -5 0 5 10
Predict y = 1 (x ∈ class 1) if
Predict y = 0 (x ∈ class 0) if
lecture3 8/31
Example
lecture3 10/31
Decision boundary
β0 + β T x = 0
• a point when p = 1
• a line when p = 2
• a plan when p = 3
• in general a (p − 1)-dimensional subpace
lecture3 11/31
Feature expansion
−4 + x21 + x22 = 0
lecture3 12/31
Maximum likelihood estimation
probability(xi ∈ class yi )
p(yi = 1|xi ; β̂) = f (xi ), if yi = 1
=
p(yi = 0|xi ; β̂) = 1 − f (xi ), if yi = 0
h iyi h i1−yi
= f (xi ) 1 − f (xi )
lecture3 13/31
Maximum likelihood estimation
lecture3 14/31
Maximum likelihood estimation
lecture3 15/31
Cost function
Cost function1
n h
!
Y iyi h i1−yi
L(β̂) = − log f (xi ) 1 − f (xi )
i=1
n
X
=− yi log(f (xi )) + (1 − yi ) log(1 − f (xi ))
i=1
lecture3 17/31
Understand the cost function
4.5
3.5
2.5
1.5
0.5
0
0 0.2 0.4 0.6 0.8 1
lecture3 17/31
Understand the cost function
lecture3 18/31
Understand the cost function
0
0 0.2 0.4 0.6 0.8 1
lecture3 18/31
Simply the cost function
n
∗ T
X
L(β̂) = L(β0 , β) = log(1 + eβ0 +β xi
) − yi (β0 + β T xi )
i=1
1
Derivation*. Recall that f (x) = 1+e−β̂ T x̂
and
n
X
L(β̂) = − yi log(f (xi )) + (1 − yi ) log(1 − f (xi ))
i=1
n
X f (xi )
=− yi log + log(1 − f (xi ))
i=1
1 − f (xi )
lecture3 19/31
Simply the cost function
1 !
f (xi ) T
1+e−β̂ x̂i 1
1. log = log 1 = log
1 − f (xi ) 1 − T 1 + e−β̂ T x̂i − 1
1+e−β̂ x̂i
T
= log(eβ̂ x̂i
) = β̂ T x̂i
T
!
1 + e−β̂ x̂i
1 −1
2. log(1 − f (xi )) = log 1 − = log
1 + e−β̂ T x̂i 1 + e−β̂ T x̂i
1 T
= log = − log 1 + eβ̂ x̂i
1 + eβ̂ T x̂i
lecture3 20/31
Gradient of the cost function
• Cost function
n
X T
L(β0 , β1 , . . . , βp ) = log(1 + eβ0 +β xi
) − yi (β0 + β T xi )
| {z }
i=1
k
lecture3 21/31
Gradient of the cost function
• Cost function
n
X T
L(β0 , β1 , . . . , βp ) = log(1 + eβ0 +β xi
) − yi (β0 + β T xi )
| {z }
i=1
k
Gradient Gradient
n n
∂ X ∂ X 1
L= (β T xi + β0 − yi ) L= ( − yi )
∂β0 i=1
∂β0 i=1
1 + e 0 +β T xi )
−(β
n
X n
X
= (f (xi ) − yi ) = (f (xi ) − yi )
i=1 i=1
n n
∂ X ∂ X 1
L= (β T xi + β0 − yi )xij L= ( − yi )xij
∂βj i=1
∂βj i=1
1 + e−(β0 +β T xi )
n
X n
X
= (f (xi ) − yi )xij = (f (xi ) − yi )xij
i=1 i=1
for j = 1, 2, . . . , p for j = 1, 2, . . . , p
lecture3 22/31
Solution may not exist
0.6
0.5
0.4
0.3
0.2
0.1
0
0 1 2 3 4 5 6
lecture3 23/31
Multi-class classification: one-vs-rest
max fk (x)
k∈{1,2,...,K}
lecture3 24/31
Multi-class classification: one-vs-rest
Say for a new input x̃, we have f1 (x̃) = 0.8, f2 (x̃) = 0.1, f3 (x̃) = 0.6.
Then we say
x̃ belongs to class 1 with probability 80%
x̃ belongs to class 2 with probability 10%
x̃ belongs to class 3 with probability 60%
and we predict it belongs to class 1. lecture3 25/31
Ridge/lasso regularization
Over-fitting
• Ridge regularization:
p
X
λkβk2 = λ βj2
j=1
lecture3 27/31
Ridge regularized problems
lecture3 28/31
Normal equation for ridge regularized linear regression
xT1 y1
xT2 y2
X= .. Y =
..
. .
xTn yn
For simplicity, we assume2 β0 = 0
n p
1X T X 1
minimize (β xi − yi )2 + λ βj2 = kXβ − Y k2 + λkβk2
p
β∈R 2 i=1 j=1
2
lecture3 30/31
Lasso regularized problems
lecture3 31/31
Lasso regularized problems
lecture3 31/31
References i
R. Tibshirani.
Regression shrinkage and selection via the lasso.
Journal of the Royal Statistical Society: Series B (Methodological),
58(1):267–288, 1996.