Regression Analysis: Ordinary Least Squares
Regression Analysis: Ordinary Least Squares
Y = w0 + w1 X
where “·” represents the dot product. This form is why the method is
called linear regression.
f (w) = X · w
is linear in w.
Solution. Remember that the term linear was used to describe the
“Expectation” operator. The two conditions we need to check are
f (u + v) = f (w) + f (v)
f (cw) = c f (w)
so that
f (w + v) = f ((w0 , w1 ) + (v0 , v1 ))
= f ((w0 + v0 , w1 + v1 ))
= X · ( w0 + v 0 , w1 + v 1 )
= (1, X ) · (w0 + v0 , w1 + v1 ) (Definition of X)
= ( w0 + v 0 ) + X ( w1 + v 1 ) (Definition of dot product)
= (w0 + Xw1 ) + (v0 + Xv1 ) (Rearranging)
= X·w+X·v
= f (w) + f (v)
regression analysis 57
Sample = {( X1 , Y1 ), ( X2 , Y2 ), . . . , ( Xn , Yn )}
and plotted these points in the plane with “GPA” on the x-axis and
“Salary” on the y-axis, the points would almost surely not fall on a
perfect line. As a result, we introduce an error term e, so that
Y = X·w+e (5)
All of this hasn’t yet told us how to predict our salaries 20 years
from now using only our GPA. The subject of the following section
gives a method for determining the best choice for w0 and w1 given
some sample data. Using these values, we could plug in the vector
(1, our GPA) for X in equation (2) and find a corresponding predicted
salary Y (within some error e).
Y = X·w+e
Data = {( x1 , y1 ), ( x2 , y2 ), . . . , ( xn , yn )}
58 seeing theory
yi = (1, xi ) · (w0 , w1 ) + ei
and we are trying to find w0 and w1 to best fit the data. What do
we mean by “best fit”? The notion we use is to find w0 and w1 that
minimize the sum of squared errors ∑in=1 ei2 . Rearranging the above
equation for ei , we can rewrite this sum of squared errors as
n n
.
E (w) = ∑ ei2 = ∑ (yi − xi · w)2
i =1 i =1
where the vector xi is shorthand for (1, xi ). As we can see above, the
error E is a function of w. In order to minimize the squared error,
we minimize the function E with respect to w. E is a function of
both w0 and w1 . In order to minimize E with respect to these values,
we need to take partial derivatives with respect to w0 and w1 . This
derivation can be tricky in keeping track of all the indices so the
details are omitted. If we differentiate E with respect to w0 and w1 ,
we eventually find that minimizing w can be expressed in matrix
form as
−1
1 x1 y1
" # " # " #
1 x y2
w0 1 1 ... 1 2 1 1 ... 1
= x x . . . x .
. . .
w1 1 2 n . ..
x1 x2 . . . x n ..
1 xn yn
This can be written in the following concise form,
xn 1 xn
and y is the column vector made by stacking the observations yi ,
y1
. y2
y= ..
.
yn
A sketch of the derivation using matrices is given in the following
section for those who cringed at the sentence “This derivation can
be tricky in keeping track of all the indices so the details are omit-
ted.” Some familiarity with linear algebra will also be helpful going
through the following derivation.
regression analysis 59
∇ E = 2DT (DwT − y)
2DT (DwT − y) = 0
DT DwT − DT y = 0
DT DwT = DT y
w T = (D T D ) −1 D T y
Y = w0 + w1 X + e,
we can plug in our GPA for X, and our optimal w0 and w1 to find the
corresponding predicted salary Y, give or take some error e. Note
that since we chose w0 and w1 to minimize the errors, it is likely that
the corresponding error for our GPA and predicted salary is small
(we assume that our (GPA, Salary) pair come from the same “true”
distribution as our samples).
Generalization
Our above example is a simplistic one, relying on the very naive as-
sumption that salary is determined solely by college GPA. In fact
60 seeing theory
there are many factors which influence someones salary. For ex-
ample, earnings could also be related to the salaries of the person’s
parents, as students with more wealthy parents are likely to have
more opportunities than those who come from a less wealthy back-
ground. In this case, there are more predictors than just GPA. We
could extend the relationship to
Y = w 0 + w 1 X1 + w 2 X2 + w 3 X3 + e
Y = w 0 + w 1 X1 + w 2 X2 + · · · + w d X d + e
or more concisely,
Y = X·w+e
w T = (D T D ) −1 D T y
Correlation
Covariance
Suppose we have two random variables X and Y, not necessarily
independent, and we want to quantify their relationship with a num-
ber. This number should satisfy two basic requirements.
( X − EX )(Y − EY )
Cov( X, Y ) = E[ XY ] − E[ X ] E[Y ]
Cov( X, Y )
ρ xy =
σx σy
Exercise 0.0.6. Verify that for given random variables X and Y, the correla-
tion ρ xy lies between −1 and 1.
Heuristic. The rigorous proof for this fact requires us to view X and
Y as elements in an infinite-dimensional normed vector space and ap-
ply the Cauchy Schwartz inequality to the quantity E[( X − EX )(Y −
EY )]. Since we haven’t mentioned any of these terms, we instead try
to understand the result using a less fancy heuristic argument.
Given a random variable X, the first question we ask is,
Cov( X, X ) Var( X )
ρ xy ≤ ρ xx = = =1
σx σx Var( X )
Cov( X, − X )
ρ xy ≥ ρ x,− x =
σx σ− x
E[ X (− X )] − E[ X ] E[− X ] −( E[ X 2 ] − ( EX )2 ) −Var( X )
= p p = = = −1
Var( X ) Var(− X ) Var ( X ) Var( X )
Interpretation of Correlation
The correlation coefficient between two random variables X and
Y can be understood by plotting samples of X and Y in the plane.
Suppose we sample from the distribution on X and Y and get
Sample = {( X1 , Y1 ), . . . , ( Xn , Yn )}
The converse is not. That is, ρ xy = 0 does not necessarily imply that
X and Y are independent.
In “Case 2” of the previous section, we hinted that even though
ρ xy = 0 corresponded to X and Y having no observable relationship,
there could still be some underlying relationship between the random
variables, i.e. X and Y are still not independent. First let’s prove
Proposition 6.6
regression analysis 65
Cov( X, Y )
ρ xy =
σx σy
E[( X − EX )(Y − EY )]
=
σx σy
E[ f ( X ) g(Y )]
=
σx σy
E[ f ( X )] E[ g(Y )]
= (independence of f ( X ) and g(Y ))
σx σy
0·0
= ( E[ f ( X )] = E( X − EX ) = 0)
σx σy
=0
Now let’s see an example where the converse does not hold. That
is, an example of two random variables X and Y such that ρ xy = 0,
but X and Y are not independent.
E( X · | X |) − EX · E| X |
ρ x,| x| = (6)
σx σ| x|
X · | X | ∼ Uniform{−1, 0, 1}
1 1 1
E( X · | X |) = · (−1) + · (0) + · (1) = 0.
3 3 3
66 seeing theory
1 1 1
E[ X ] = · (−1) + · (0) + · (1) = 0.
3 3 3
Plugging these values into the numerator in expression (3), we get ρ x,| x| =
0. Thus, the two random variables X and | X | are certainly not always equal,
they are not independent, and yet they have correlation 0. It is important to
keep in mind that zero correlation does not necessarily imply independence.