Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
17 views

Lecture3 Logistic Regression Regularization

Logistic regression is used for classification problems where the target variable is categorical. It models the probability of the target variable belonging to a certain class given the input variables. Maximum likelihood estimation is used to fit the logistic regression model by minimizing the cost function, which is the negative log-likelihood of the training data. Regularization techniques like ridge and lasso regression are introduced to prevent overfitting by penalizing model complexity.

Uploaded by

jiayuan0113
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
17 views

Lecture3 Logistic Regression Regularization

Logistic regression is used for classification problems where the target variable is categorical. It models the probability of the target variable belonging to a certain class given the input variables. Maximum likelihood estimation is used to fit the logistic regression model by minimizing the cost function, which is the negative log-likelihood of the training data. Regularization techniques like ridge and lasso regression are introduced to prevent overfitting by penalizing model complexity.

Uploaded by

jiayuan0113
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 39

Logistic regression and regularization

DSA5103 Lecture 3

Yangjing Zhang
29-Aug-2023
NUS
Today’s content

1. Logistic regression for classification


2. Ridge/lasso regularization

lecture3 1/31
Logistic regression
Classification

Binary classification:

• Email: spam/not spam


• Patient: cancer/healthy
• Student: fail/pass

We usually assign
(
0, normal state/negative class e.g., not spam
label
1, abnormal state/positive class e.g., spam

However, the label assignment can be arbitrary:


0 = not spam, 1 = spam or 0 = spam, 1 = not spam

Data xi ∈ Rp , yi ∈ {0, 1}, i = 1, 2, . . . , n.

lecture3 2/31
Classification

Multi-class classification:
• Iris flower (3 species: Setosa, Versicolor, Virginica)
• Optical character recognition
Data xi ∈ Rp , yi ∈ {1, . . . , K}, i = 1, 2, . . . , n.

Figure 1: Convert images of text to machine-readable format. Image from


internet

lecture3 3/31
Linear regression for classification?

Data xi , yi ∈ {0, 1}
We fit f (x) = β0 + β T x, and predict a new input x̃ belong to
class 1 if f (x̃) ≥ 0.5 class 0 if f (x̃) < 0.5

lecture3 4/31
Linear regression vs. logistic regression

Linear regression
• Data xi , yi ∈ R
• Fit f (x) = β T x + β0 = β̂ T x̂, β̂ = [β0 ; β], x̂ = [1; x]
Logistic regression
• Data xi , yi ∈ {0, 1}
1
• Fit f (x) = g(β̂ T x̂), g(z) = sigmoid/logistic function
1 + e−z
. 0 < g(z) < 1 an increasing
1
function
. g(0) = 0.5
0.5
. g(z) → 1 as z → +∞
. g(z) → 0 as z → −∞
0

-10 -5 0 5 10

lecture3 5/31
Graph illustration

β0 , β1 will change the shape and location of the function

0.5

-6 -4 -2 0 2 4 6 8 10 12

lecture3 6/31
Probabilistic interpretation

1.5

0.5

0
0 1 2 3 4 5 6 7 8

In logistic regression, we interpret

f (x) = probability(input x ∈ class 1)


= p(y = 1|x; β̂)

Then 1 − f (x) = probability(input x ∈ class 0) = p(y = 0|x; β̂)


lecture3 7/31
Logistic regression

1 1

f (x) = g(β̂ T x̂), g(z) =


1 + e−z
f (x) = probability(input x ∈ class 1)
0.5

= p(y = 1|x; β̂)

-10 -5 0 5 10

Predict y = 1 (x ∈ class 1) if

• f (x) ≥ 0.5, i.e., β̂ T x̂ ≥ 0

Predict y = 0 (x ∈ class 0) if

• f (x) < 0.5, i.e., β̂ T x̂ < 0

lecture3 8/31
Example

(1) (One feature)


Say β̂ = [β0 ; β1 ] = [−4; 2]. Then f (x) = g(β0 + β1 x).
Predict y = 1 if
β0 + β1 x = −4 + 2x ≥ 0, i.e., x ≥ 2

(2) (Two features)


Say β̂ = [β0 ; β1 ; β2 ] = [−4; 2; 1]. Then f (x) = g(β0 + β1 x1 + β2 x2 ).
Predict y = 1 if
β0 + β1 x1 + β2 x2 = −4 + 2x1 + x2 ≥ 0, i.e., 2x1 + x2 ≥ 4
lecture3 9/31
Example

(3) (Three features)


Say β̂ = [β0 ; β1 ; β2 ; β3 ] = [−4; 2; 1; 2]. Then
f (x) = g(β0 + β1 x1 + β2 x2 + β3 x3 ).
Predict y = 1 if
β0 +β1 x1 +β2 x2 +β3 x3 = −4+2x1 +x2 +2x3 ≥ 0, i.e., 2x1 +x2 +2x3 ≥ 4

lecture3 10/31
Decision boundary

The set of all x ∈ Rp such that

β0 + β T x = 0

is called the decision boundary between classes 0 and 1.


The logistic regression has a linear decision boundary; it is

• a point when p = 1
• a line when p = 2
• a plan when p = 3
• in general a (p − 1)-dimensional subpace

lecture3 11/31
Feature expansion

Say β̂ = [β0 ; β1 ; . . . ; β5 ] = [−4; 0; 0; 1; 1; 0]. The decision boundary is

−4 + x21 + x22 = 0

lecture3 12/31
Maximum likelihood estimation

• Data (xi , yi ), i = 1, 2, . . . , n, xi ∈ Rp , yi ∈ {0, 1}.


• The likelihood of a single training example (xi , yi ) is

probability(xi ∈ class yi )

 p(yi = 1|xi ; β̂) = f (xi ), if yi = 1
=
p(yi = 0|xi ; β̂) = 1 − f (xi ), if yi = 0

h iyi h i1−yi
= f (xi ) 1 − f (xi )

• Hope the likelihood is close to 1 for every training example (xi , yi )

lecture3 13/31
Maximum likelihood estimation

• Assume independence of the training samples, the likelihood is


n h
Y iyi h i1−yi
f (xi ) 1 − f (xi )
i=1

• Want to find β̂ to maximize the log-likelihood. Note that


maximizing a (positive) function is the same as maximizing the log
of a function (because log is monotone increasing)

lecture3 14/31
Maximum likelihood estimation

If f (x) > 0 for all x,

max f (x) ⇐⇒ max log(f (x))


x x

since log is monotone increasing, i.e., log(x) ≥ log(y) if x ≥ y > 0

x∗ is a global maximizer of f (·)


⇐⇒ f (x∗ ) ≥ f (x) ∀ x
⇐⇒ log(f (x∗ )) ≥ log(f (x)) ∀ x
⇐⇒ x∗ is a global maximizer of log(f (·))

Similar arguments work for local maximizer.


Note that the likelihood is always positive and we can take log.

lecture3 15/31
Cost function

Cost function1
n h
!
Y iyi h i1−yi
L(β̂) = − log f (xi ) 1 − f (xi )
i=1
n
X
=− yi log(f (xi )) + (1 − yi ) log(1 − f (xi ))
i=1

For a particular training example (xi , yi ), the cost is

− yi log(f (xi )) − (1 − yi ) log(1 − f (xi ))


(
− log(f (xi )), if yi = 1
=
− log(1 − f (xi )), if yi = 0

1 We use natural logarithms in logistic regression


lecture3 16/31
Understand the cost function

cost = − log(f (xi )), when yi = 1

• f (xi ) = 1, cost = 0 (“perfect” scenario ⇒ zero cost)


• f (xi ) = 0.9, cost = 0.11 (“good” scenario ⇒ small cost)
• f (xi ) = 0.1, cost = 2.3 (“bad” scenario ⇒ large cost)

lecture3 17/31
Understand the cost function

cost = − log(f (xi )), when yi = 1

• f (xi ) = 1, cost = 0 (“perfect” scenario ⇒ zero cost)


• f (xi ) = 0.9, cost = 0.11 (“good” scenario ⇒ small cost)
• f (xi ) = 0.1, cost = 2.3 (“bad” scenario ⇒ large cost)

4.5

3.5

2.5

1.5

0.5

0
0 0.2 0.4 0.6 0.8 1

lecture3 17/31
Understand the cost function

cost = − log(1 − f (xi )), when yi = 0

• f (xi ) = 0, cost = 0 (“perfect” scenario ⇒ zero cost)


• f (xi ) = 0.1, cost = 0.11 (“good” scenario ⇒ small cost)
• f (xi ) = 0.9, cost = 2.3 (“bad” scenario ⇒ large cost)

lecture3 18/31
Understand the cost function

cost = − log(1 − f (xi )), when yi = 0

• f (xi ) = 0, cost = 0 (“perfect” scenario ⇒ zero cost)


• f (xi ) = 0.1, cost = 0.11 (“good” scenario ⇒ small cost)
• f (xi ) = 0.9, cost = 2.3 (“bad” scenario ⇒ large cost)

0
0 0.2 0.4 0.6 0.8 1

lecture3 18/31
Simply the cost function
n
∗ T
X
L(β̂) = L(β0 , β) = log(1 + eβ0 +β xi
) − yi (β0 + β T xi )
i=1

1
Derivation*. Recall that f (x) = 1+e−β̂ T x̂
and

n
X
L(β̂) = − yi log(f (xi )) + (1 − yi ) log(1 − f (xi ))
i=1
n  
X f (xi )
=− yi log + log(1 − f (xi ))
i=1
1 − f (xi )

It remains to prove two equalities:


f (xi )  T
1. log 1−f (xi ) = β̂ x̂i
T
2. log(1 − f (xi )) = − log(1 + eβ̂ x̂i
)

lecture3 19/31
Simply the cost function

  1 !  
f (xi ) T
1+e−β̂ x̂i 1
1. log = log 1 = log
1 − f (xi ) 1 − T 1 + e−β̂ T x̂i − 1
1+e−β̂ x̂i
T
= log(eβ̂ x̂i
) = β̂ T x̂i

T
!
1 + e−β̂ x̂i
 
1 −1
2. log(1 − f (xi )) = log 1 − = log
1 + e−β̂ T x̂i 1 + e−β̂ T x̂i
 
1  T

= log = − log 1 + eβ̂ x̂i
1 + eβ̂ T x̂i

lecture3 20/31
Gradient of the cost function

• Cost function
n
X T
L(β0 , β1 , . . . , βp ) = log(1 + eβ0 +β xi
) − yi (β0 + β T xi )
| {z }
i=1
k

β1 xi1 + β2 xi2 + · · · + βp xip

lecture3 21/31
Gradient of the cost function

• Cost function
n
X T
L(β0 , β1 , . . . , βp ) = log(1 + eβ0 +β xi
) − yi (β0 + β T xi )
| {z }
i=1
k

β1 xi1 + β2 xi2 + · · · + βp xip


• Calculate
n   n
∂ X 1 X
L= − yi = (f (xi ) − yi )
∂β0 i=1
1 + e−(β0 +β T xi ) i=1
n   n
∂ X 1 X
L= − yi xi1 = (f (xi ) − yi ) xi1
∂β1 i=1
1 + e−(β0 +β T xi ) i=1
n   n
∂ X 1 X
L= −(β0 +β T xi )
− yi xi2 = (f (xi ) − yi ) xi2
∂β2 1 + e
.. i=1 i=1
. n   n
∂ X 1 X
L= − y i x ip = (f (xi ) − yi ) xip
∂βp i=1
1 + e−(β0 +β T xi ) i=1
lecture3 21/31
Linear regression vs. logistic regression

Linear regression Logistic regression L =


n n
1X T X T
L= (β xi + β0 − yi )2 log(1 + eβ0 +β xi ) − yi (β0 + β T xi )
2 i=1 i=1

Gradient Gradient
n n
∂ X ∂ X 1
L= (β T xi + β0 − yi ) L= ( − yi )
∂β0 i=1
∂β0 i=1
1 + e 0 +β T xi )
−(β

n
X n
X
= (f (xi ) − yi ) = (f (xi ) − yi )
i=1 i=1
n n
∂ X ∂ X 1
L= (β T xi + β0 − yi )xij L= ( − yi )xij
∂βj i=1
∂βj i=1
1 + e−(β0 +β T xi )
n
X n
X
= (f (xi ) − yi )xij = (f (xi ) − yi )xij
i=1 i=1

for j = 1, 2, . . . , p for j = 1, 2, . . . , p
lecture3 22/31
Solution may not exist

The solution (global minimizer) of the minimization problem


n
X T
minimize log(1 + eβ0 +β xi
) − yi (β0 + β T xi )
β0 ,β1 ,...,βp
i=1

may not exist. (Regularization will help solve this issue)


Example. n = 1, x1 = −1, y1 = 0. Then the cost function
L(β0 , β1 ) = log(1 + eβ0 −β1 )
We can see that min L = 0. However this value cannot be attained.
0.7

0.6

0.5

0.4

0.3

0.2

0.1

0
0 1 2 3 4 5 6
lecture3 23/31
Multi-class classification: one-vs-rest

Idea: transfer multi-class classification to multiple binary classification


problems
Data: xi ∈ Rp , yi ∈ {1, 2, . . . , K}, i = 1, 2, . . . , n.
For each k ∈ {1, 2, . . . , K}

1. Construct a new label ỹi = 1 if yi = k and ỹi = 0 otherwise


2. Learn a binary classifier fk with data xi , ỹi

Multi-class classifier predicts class k where k achieves the maximal value

max fk (x)
k∈{1,2,...,K}

lecture3 24/31
Multi-class classification: one-vs-rest

feature 1 feature 2 label y


1 5 1
0.8 4.5 1
1.5 1.5 2
4 4 3

Say for a new input x̃, we have f1 (x̃) = 0.8, f2 (x̃) = 0.1, f3 (x̃) = 0.6.
Then we say
x̃ belongs to class 1 with probability 80%
x̃ belongs to class 2 with probability 10%
x̃ belongs to class 3 with probability 60%
and we predict it belongs to class 1. lecture3 25/31
Ridge/lasso regularization
Over-fitting

under-fitted good fit/just right over-fitted

Figure 2: Image from internet

• Under-fitting: a model is too simple and does not adequately


capture the underlying structure of the data
• Over-fitting: a model is too complicated and contains more
parameters that can be justified by the data; it does not generalize
well from training data to test data
• Good fit: a model adequately learns the training data and
generalizes well to test data
lecture3 26/31
Ridge regularization

In linear/logistic regression, over-fitting occurs frequently. Regularization


will make the model simpler and works well for most of the
regression/classification problems.

• Ridge regularization:
p
X
λkβk2 = λ βj2
j=1

λ: regularization parameter, kβk2 : regularizer


• It is differentiable. It forces βj ’s to be small
• Extreme case: suppose λ is a huge number, it will push all βj ’s to
be zero and the model will be naive

lecture3 27/31
Ridge regularized problems

• Logistic regression + ridge regularization (Gradient methods can be


used, a solution exists)
n p
β0 +β T xi
X X
T
minimize log(1 + e ) − yi (β0 + β xi ) + λ βj2
β0 ,β1 ,...,βp
i=1 j=1

• Linear regression + ridge regularization (Apply either normal


equation or gradient methods)
n p
1X T X
minimize (β xi + β0 − yi )2 + λ βj2
β0 ,β1 ,...,βp 2 i=1 j=1

lecture3 28/31
Normal equation for ridge regularized linear regression
   
xT1 y1
xT2  y2 
   
 
X= ..  Y =
 .. 

. .
 
 
xTn yn
For simplicity, we assume2 β0 = 0
n p
1X T X 1
minimize (β xi − yi )2 + λ βj2 = kXβ − Y k2 + λkβk2
p
β∈R 2 i=1 j=1
2

Compute the gradient


Gradient = X T (Xβ − Y ) + 2λβ
Normal equation (setting gradient to be zero)
(2λI + X T X)β = X T Y
⇒ β = (2λI + X T X)−1 X T Y
2 Note that the intercept should be zero β0 = 0 if the data is standardized
lecture3 29/31
Lasso regularization

• Lasso (Least Absolute Shrinkage and Selection Operator)


regularization:
Xp
λkβk1 = λ |βj |
j=1
• It is non-differentiable. It forces some βj ’s to be exactly zero
• It can be used for feature selection (model selection). It selects
important features (removing non-informative or redundant features)
• When λ is larger, less features will be selected

lecture3 30/31
Lasso regularized problems

• Logistic regression + lasso regularization (Gradient methods is no


longer applicable)
n p
β0 +β T xi
X X
T
minimize log(1 + e ) − yi (β0 + β xi ) + λ |βj |
β0 ,β1 ,...,βp
i=1 j=1

• Linear regression + lasso regularization (Gradient methods is no


longer applicable)
n p
1X T 2
X
minimize (β xi + β0 − yi ) + λ |βj |
β0 ,β1 ,...,βp 2 i=1 j=1

• In the following, we always assume β0 = 0. Note that the intercept


should be zero β0 = 0 if the data is standardized.

lecture3 31/31
Lasso regularized problems

• Logistic regression + lasso regularization (Gradient methods is no


longer applicable)
n p
β0 +β T xi
X X
T
minimize log(1 + e ) − yi (β0 + β xi ) + λ |βj |
β0 ,β1 ,...,βp
i=1 j=1

• Linear regression + lasso regularization (Gradient methods is no


longer applicable)
n p
1X T 2
X
minimize (β xi + β0 − yi ) + λ |βj |
β0 ,β1 ,...,βp 2 i=1 j=1

• In the following, we always assume β0 = 0. Note that the intercept


should be zero β0 = 0 if the data is standardized.
Given feature matrix X ∈ Rn×p and response vector Y ∈ Rp , the famous
1
lasso problem [1] minimize kXβ − Y k2 + λkβk1
β∈Rp 2

lecture3 31/31
References i

R. Tibshirani.
Regression shrinkage and selection via the lasso.
Journal of the Royal Statistical Society: Series B (Methodological),
58(1):267–288, 1996.

You might also like