Linear Regression: 1 Perspective 1: Maximum Likelihood Estimation
Linear Regression: 1 Perspective 1: Maximum Likelihood Estimation
Linear Regression: 1 Perspective 1: Maximum Likelihood Estimation
We will begin by a series of assumptions for our training set. The conditional density
p(y|x) to be Gaussian.
Pr(y|x) N (wT x, 2 )
We will also consider the outputs to be a linear combination of the weights with x,
with some noise N (0, 2 ) applied.
y = wT x +
We will now consider our training set, which is composed of xi , a vector in Rd , and
yi , a scalar. Let the matrix of xi be called X. In this lecture, we will explore two
interpretations of our linear regression solution, first with X row-major and then with
X column-major.
1
We will now take the log-likelihood.
n n
X 1 X 1
log Pr(X|) = log( ) (y wT xi )2
2 i
i=1 2 2 i=1
2
n
1 X
minimizew (yi wT xi )2
2 2 i=1
Pn
Let us express i=1 (yi wT xi )2 in vector-matrix form.
y 1 w T x1 y1 xT1
.. .. ..
= . . w
.
y n w T xn yn xTn
n
1 X
minimizew 2 ky Awk2
2 i=1
We can ignore 2 , since were minimizing the quantity. Additionally, we will expand
ky Awk2 .
n
1X
= (y Aw)T (y Aw)
2 i=1
n
1X
= (kyk2 2(Aw)T y + kAwk2 )
2 i=1
2
Differentiating with respect to w gives us the gradient, below.
w = AT y + AT Aw
2w = AT A
Since the Hessian is positive semidefinite, we have that the solution to w = 0 will
give us a minimum. So, let us now solve for w.
w = 0
AT Aw = AT y
w = (AT A)1 AT y
We have another interpretation, using more linear algebra intuition. First, we will
rewrite y Aw, ignoring noise..
y1 | | | w1
.. ..
. a1 a2 ad .
yn | | | wd
d
X
Aw = ai w i
i=1
3
This means that Aw is actually a linear combination of the column vectors in A,
weighted by the entries of the vector w. Thus, for any choice of w, we find an Aw
that exists in the column space of A. Recall that the column space of a matrix A
is the vector space spanned by its column vectors. This means that if y exists in the
column space of A, then there is a solution to w. Otherwise, we pick the closest w.
Consider the error vector, after picking the best w.
e = y y = y Aw
If y exists in the column space, the error vector will be 0. Note that if e has some
component that isnt perpendicular to the column space of A, we can perturb w so
that e decreases. Thus, if this is true for all components along the column space of
A, we see that e must be perpendicular to A. Since e is orthogonal to A, we obtain
the following equality.
AT e = 0
AT (y Aw) = 0
AT y AT Aw = 0
AT y = AT Aw
w = (AT A)1 AT y
3 Variants
1
minimize ky Awk2 + kwk2
2
It is easy to compute the gradient vector to find that the optimal solution is the
following.
4
w = (AT A + I)1 AT y
2. Lasso or L1 Regularization
1
minimize ky Awk2 + kwk1
2
We cannot differentiate the lasso objective function, because the 1-norm term is not
differentiable at 0. However, the function is convex, so we can apply gradient descent
to achieve the global minimum.
We will explore the effects of Regularization in Note 13, which primarily prevents
overfitting.