Linear Regression: 1 Perspective 1: Maximum Likelihood Estimation

Linear Regression
compiled by Alvin Wan from Professor Jitendra Maliks lecture
We will begin by a series of assumptions for our training set. The conditional density
p(y|x) to be Gaussian.
Pr(y|x) N (wT x, 2 )
We will also consider the outputs to be a linear combination of the weights with x,
with some noise N (0, 2 ) applied.
y = wT x +
We will now consider our training set, which is composed of xi , a vector in Rd , and
yi , a scalar. Let the matrix of xi be called X. In this lecture, we will explore two
interpretations of our linear regression solution, first with X row-major and then with
X column-major.
1 Perspective 1 : Maximum Likelihood Estimation
As it turns out, maximum likelihood estimate of the Gaussian distribution is exactly

the solution to least squares. This will give us the interpretation of the solution to
least-squares. Consider the probability of our distribution given the data. Since all
xi are i.i.d., we can take the product of the probabilities of distributions given each
xi .
Pr(X|) = ni=1 Pr(yi |xi , )

1 1
= ni=1 2
exp( 2 (yi wT xi )2 )
2 2
1
We will now take the log-likelihood.
n n
X 1 X 1
log Pr(X|) = log( ) (y wT xi )2
2 i
i=1 2 2 i=1
2
The first term is a constant, so maximizing likelihood is equivalent to minimizing only

the second term.
n
1 X
minimizew (yi wT xi )2
2 2 i=1
Pn
Let us express i=1 (yi wT xi )2 in vector-matrix form.

y 1 w T x1 y1 xT1
.. .. ..
= . . w

.
y n w T xn yn xTn
Consider A, an n d matrix that contains all of the xi . A is sometimes called the

design matrix, and in this case, we see that yi wT xi = y Aw, so the above
expression is equivalent to the below.
n
1 X
minimizew 2 ky Awk2
2 i=1
We can ignore 2 , since were minimizing the quantity. Additionally, we will expand
ky Awk2 .
n
1X
= (y Aw)T (y Aw)
2 i=1
n
1X
= (kyk2 2(Aw)T y + kAwk2 )
2 i=1
2
Differentiating with respect to w gives us the gradient, below.
w = AT y + AT Aw
Differentiating again with respect to w gives us the Hessian below.
2w = AT A
Since the Hessian is positive semidefinite, we have that the solution to w = 0 will
give us a minimum. So, let us now solve for w.
w = 0
AT Aw = AT y
w = (AT A)1 AT y
2 Perspective 2 : Linear Algebra
We have another interpretation, using more linear algebra intuition. First, we will
rewrite y Aw, ignoring noise..

y1 | | | w1
.. ..
. a1 a2 ad .

yn | | | wd
On the right hand side, we have that Aw is equivalent to the following.
d
X
Aw = ai w i
i=1
3
This means that Aw is actually a linear combination of the column vectors in A,
weighted by the entries of the vector w. Thus, for any choice of w, we find an Aw
that exists in the column space of A. Recall that the column space of a matrix A
is the vector space spanned by its column vectors. This means that if y exists in the
column space of A, then there is a solution to w. Otherwise, we pick the closest w.
Consider the error vector, after picking the best w.
e = y y = y Aw
If y exists in the column space, the error vector will be 0. Note that if e has some
component that isnt perpendicular to the column space of A, we can perturb w so
that e decreases. Thus, if this is true for all components along the column space of
A, we see that e must be perpendicular to A. Since e is orthogonal to A, we obtain
the following equality.
AT e = 0
AT (y Aw) = 0
AT y AT Aw = 0
AT y = AT Aw
w = (AT A)1 AT y
3 Variants
We can obtain two different variants by adding a penalty or regularization term

to the objective function.
1. Ridge Regression or L2 Regularization
1
minimize ky Awk2 + kwk2
2
It is easy to compute the gradient vector to find that the optimal solution is the
following.
4
w = (AT A + I)1 AT y
As 0, we note that this becomes identical to the least-squares objective function

and least-squares solution. However, the addition of I makes the solution numerically
stable, as the positive semidefinite AT A is not always guaranteed to have an inverse.
With that said, adding I guarantees that AT A + I is positive definite and therefore
invertible.
2. Lasso or L1 Regularization
1
minimize ky Awk2 + kwk1
2
We cannot differentiate the lasso objective function, because the 1-norm term is not
differentiable at 0. However, the function is convex, so we can apply gradient descent
to achieve the global minimum.
We will explore the effects of Regularization in Note 13, which primarily prevents
overfitting.

Linear Regression: 1 Perspective 1: Maximum Likelihood Estimation

Uploaded by

Copyright:

Available Formats

Linear Regression: 1 Perspective 1: Maximum Likelihood Estimation

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Linear Regression: 1 Perspective 1: Maximum Likelihood Estimation

Uploaded by

Copyright:

Available Formats

Linear Regression

compiled by Alvin Wan from Professor Jitendra Maliks lecture

1 Perspective 1 : Maximum Likelihood Estimation

As it turns out, maximum likelihood estimate of the Gaussian distribution is exactly

Pr(X|) = ni=1 Pr(yi |xi , )

The first term is a constant, so maximizing likelihood is equivalent to minimizing only

Consider A, an n d matrix that contains all of the xi . A is sometimes called the

Differentiating again with respect to w gives us the Hessian below.

2 Perspective 2 : Linear Algebra

On the right hand side, we have that Aw is equivalent to the following.

We can obtain two different variants by adding a penalty or regularization term

1. Ridge Regression or L2 Regularization

As 0, we note that this becomes identical to the least-squares objective function

You might also like