Tutorial On Multivariate Logistic Regression: Javier R. Movellan July 23, 2006
Tutorial On Multivariate Logistic Regression: Javier R. Movellan July 23, 2006
Javier R. Movellan
July 23, 2006
1
1 Motivation
The last few years have seen a resurgence of interest in the Machine Learning community for main-
effect models, i.e., models without interaction effects. This is due to the emergence of learning
algorithms, like SVMs and ADABoost that work well with very high dimensional representations,
avoiding the need for modeling interaction effects. The goal of this tutorials is to introduce the most
important main effect regression models, how they relate to each other and common algorithms
used for training them.
2 General Model
Our goal is to train a well defined model based on examples of input-output pairs. In our notation
the inputs will be n-dimensional, the outputs will be c-dimensional and our training sample will
consist of m input output pairs. For convenience we organize the example inputs as an m × n
matrix x. The corresponding example outputs are organized as a c × m matrix y. Our goal is to
predict the rows of y using the rows of x. The models under consideration produce a intermediary
linear transformation of the inputs and make predictions about y based on that transformation. Let
λ be an n × c matrix and let ŷ = xλ represent the intermediary linear transformation of the data.
We will evaluate the optimality of ŷ, and thus of λ, using the following criterion
m c
X αX T
Φ(λ) = − wj ρj (y, ŷ) + λk λk (1)
j=1
2 k=1
The terms w1 , . . . , wm are positive weights that capture the relative importance of each of the m
input-output pairs, and α is a positive constant. Informally, the first term can be seen as a negative
log-likelihood function, capturing the degree of match between the data and the model, the second
term can be interpreted as a negative log prior over λ.
A standard approach for minimizing Φ is the Newton-Raphson algorithm (See Appendix B),
which calls for computation of the gradient and the Hessian matrix (See Appendix A).
Let λi represent the ith row of λ. The gradient can be computed using the chain rule (See
Appendix A)
c
αX T
∇λi λ λk = αλi (2)
2 k=1 k
m
X m
X
∇λi wj ρj (y, ŷ) = ∇λi ŷi ∇ŷi ρ∇ρ wj ρj (y, ŷ) = xT wΨi (3)
j=1 j=1
2
where w is a diagonal matrix with diagonal elements w1 , . . . , wm and
∂ρ1
∂ ŷ1i
Ψi =
..
(4)
.
∂ρm
∂ ŷmi
Thus
∇λi Φ = xT wΨi + αλi (5)
The overall gradient vector follows ∇λ Φ = (∇λ1 Φ, . . . , ∇λn Φ)T . The chain rule can be used to
compute the component matrices of the overall Hessian matrix
where δik is the Kronecker delta function (0 valued unless i = k in which case it takes value 1) and
In is the n × n identity matrix.
and
∂ 2 ρ1
∂ ŷ1j ∂ ŷ1i
0 ··· 0
∂ 2 ρ2 (y,ŷ) ..
0 . 0
Ψ0ij = ∂ ŷ2j ∂ ŷ2i
(9)
.. ... ..
. ··· .
∂ 2 ρm
0 0 ··· ∂ ŷmj ∂ ŷmi
Thus
∇λj ∇λi Φ = xT Ψ0ij wx + δij αIn (10)
The overall Hessian matrix follows
∇λ1 ∇λ1 Φ · · · ∇λ1 ∇λ1 Φ
∇λ ∇λ Φ =
.. ... ..
(11)
. .
∇λn ∇λ1 Φ · · · ∇λn ∇λn Φ
3
Thus ∂ρ1
∂ ŷ1i y1 − ŷ1
Ψi = − .. ..
= = y − xλ (13)
. .
∂ρm
∂ ŷmi
ym − ŷm
The gradient follows
∇λ Φ = xT w(y − xλ) + αλ (14)
Setting the equation to zero results on a linear equation on λ, that can be solved analytically
−1 T
λ̂ = xT wx + αIn x wy (15)
where λ̂ represents the optimal value of λ.
4 Robust Regression
m
α T X
λ λ+ wj ρ(yj − ŷj ) (16)
2 j=1
where ŷ = xλ. Where ρ typically penalizes extreme deviations between y and ŷ less than the
squared error function does. It is easy to show that in this case
∇λ Φ = −xT wΨ + αλ (17)
where Ψj = ρ0 (yj − ŷj ) the derivative of the error function. This derivative is known as the
influence function for it controls the influence of each example on the final solution. Note if
ρ(yj ŷj ) = (yh − ŷj )2 then Ψ = y − ŷ. If ρ has no second derivative, a common procedure to
find an optimal λ involves the use of the following heuristic. Note that (31) can be expressed as
follows:
∇λ Φ = −xt w̃(y − ŷ) + λ (18)
where w̃ is an m × m diagonal matrix such that wjj = wj Ψj /(yj − ŷj ). Setting this gradient to
zero, and disregarding the fact that w is a function of λ, results on a linear weighted least squares
problem whose solution we have already seen. This iterative version of weighted least squares we
start with an arbitrary value λ and apply the following iteration:
λt+1 = (xt w̃t x + αIn )−1 xt w̃t y (19)
where wt is the matrix of weights obtained using λt .
If ρ has second derivative, then the Hessian matrix exists and has the following form
∇λ ∇λ Φ = xT Ψ0 wx + αIn (20)
00
Where Ψ is an m × m diagonal matrix such that Ψjj = ρ (yj − ŷj ), the second derivative of the
error function. The Newton-Raphson algorithm calls for the following iteration
λt+1 = λt + (xt Ψ0 x)−1 xt Ψ (21)
which can also be casted as weighted least squares solution with weight matrix Ψ0 and desired
response vector z = xλt + Ψ−1 xT Ψ
4
5 Multinomial Logistic Regression
Let c
X
ρj (y, ŷ) = yjk log hk (ŷj ) (22)
k=1
where
eŷjk
hk (ŷj ) = Pc ŷji
(23)
i=1 e
Ψi = yi − hi (ŷ) (26)
and
ŷ1k (δki − ŷ1i ) ···
0 0
...
0 ŷ2k (δki − ŷ2i ) 0
Ψ0ki = (27)
.. ... ..
. ··· .
0 0 · · · ŷmk (δki − ŷmi )
where λ(t) represents the n-dimensional vector of weights at iteration t, y and ŷ(t) are m-dimensional
vectors containing the desired and predicted probabilities for the first response alternative and w(t)
is an m × m matrix w = ∇xλ ŷ, i.e., wii = ŷi (1 − ŷi ). Some simple algebra shows that this iteration
can be casted as the solution to a weighted linear regression problem
−1
λ(t + 1) = xT w̃(t)x xT w̃(t)z(t) (29)
5
6 Robust Regression
In robust regression the criterion function is of the form
m
α X
Φ(λ) = λT λ + wj ρ(yj − ŷj ) (30)
2 j=1
where ŷ = xλ. Where ρ typically penalizes extreme deviations between y and ŷ less than the
squared error function does. It is easy to show that in this case
∇λ Φ = −xT wΨ + αλ (31)
where Ψj = ρ0 (yj − ŷj ) the derivative of the error function. This derivative is known as the
influence function for it controls the influence of each example on the final solution. Note if
ρ(yj ŷj ) = (yh − ŷj )2 then Ψ = y − ŷ. If ρ has no second derivative, a common procedure to
find an optimal λ involves the use of the following heuristic. Note that (31) can be expressed as
follows:
∇λ Φ = −xt w̃(y − ŷ) + λ (32)
where w̃ is an m × m diagonal matrix such that wjj = wj Ψj /(yj − ŷj ). Setting this gradient to
zero, and disregarding the fact that w is a function of λ, results on a linear weighted least squares
problem whose solution we have already seen. This iterative version of weighted least squares we
start with an arbitrary value λ and apply the following iteration:
λt+1 = (xt w̃t x + αIn )−1 xt w̃t y (33)
where wt is the matrix of weights obtained using λt .
If ρ has second derivative, then the Hessian matrix exists and has the following form
∇λ ∇λ Φ = xT Ψ0 wx + αIn (34)
00
Where Ψ is an m × m diagonal matrix such that Ψjj = ρ (yj − ŷj ), the second derivative of the
error function. The Newton-Raphson algorithm calls for the following iteration
λt+1 = λt + (xt Ψ0 x)−1 xt Ψ (35)
which can also be casted as weighted least squares solution with weight matrix Ψ0 and desired
response vector z = xλt + Ψ−1 xT Ψ
6
8 ADABoost and the Backfitting algorithm
Under construction.
A Vector Calculus
Definition Gradient Matrix: Let y = f (x) where y ∈ Rn , x ∈ Rm . The gradient of y with
respect to x, symbolized ∇x y, is an m × n matrix defined as follows
∂y1
· · · ∂y n
∂x1 ∂x1
∇x y = ... ... .. (37)
.
∂y1 ∂yn
∂xm
· · · ∂xm
∇x y = bT + (a + aT )x (38)
∇x z = ∇x y ∇y z (39)
∇x ∇x y = ∇x ax + bT = a (42)
7
of the algorithm at iteration t. We approximate the function f using the linear and quadratic terms
of the Taylor expansion of f around xt .
1
fˆt (x) = f (xt ) + ∇x f (xt )(x − xt )T + (x − xt )T (∇x ∇x f (xt )) (x − xt ) (43)
2
and then we then find the extremum of fˆt with respect to x and move directly to that extremum. To
do so note that
∇x fˆt (x) = ∇x f (xt ) + (∇x ∇x f (xt )) (x − xt ) (44)
We let x(t + 1) be the value of x for which ∇x fˆt (x) = 0
It is useful to compare the Newton-Raphson method with the standard method of gradient ascent.
The gradient ascent iteration is defined as follows
where is a small positive constant. Thus gradient descent can be seen as a Newton-Raphson
method in which the Hessian matrix is approximated by 1 In .
Minimizing fˆt (x) is a linear least squares problem of well known solution. If we let yi = (∇x ri (xt ))T xt −
ri (xt ) and ui = ∇x ri (xt ) then
n
X n
X
xt+1 = ( ui uTi )−1 ( ui y i ) (50)
i=1 i=1
History
• The first version of this document was written by Javier R. Movellan, May 2002. It was 7
pages long.
8
• The document was made open source under GNU FDL license 1.1 as part of the Kolmogorov
project on August 9 2002.
• October 9, 2003. Javier R. Movellan changed the license to GFDL 1.2 and included an
endorsement section.
References