Econ 471 Notes 1
Econ 471 Notes 1
Econ 471 Notes 1
Yi = + Xi + ui , i = 1, . . . , n (1.1)
where and are unknown parameters which are the purpose of the estimation. What we
will call data are the n realizations of (Xi , Yi ). We are abusing notation a bit by using the
same letters to refer to random variables and their realizations.
ui is an unobserved random variable which represents the fact that the relationship be-
tween Y and X is not exactly linear. We will momentarily assumet that ui has expected
value zero. Note that if ui = 0, then the relationship between Yi and Xi would be exactly
linear, so it is the presence of ui what breaks this exact nature of the relationship. Y is usu-
ally reffered to as the explained or dependent variable, X is the explanatory or independent
variable.
We will refer to ui as the error term, which is a terminology more appropriate in
the experimental sciences, where a cause x (say the dose of a drug) is administered to
different subjects and then an effect y is measured (say, body temperature). In this case ui
might be a measurement error due to the erratic behavior of a measurement instrument (for
example, a thermometer). In a social science like economics, ui represents a broader notion
of ignorance that represents whatever is not observed (by ignorance, ommision, etc.) that
affects y besides x.
The first goal will be to find reasonable estimates for and based solely on the data,
that is (Xi , Yi ), i = 1, . . . , n.
1
CHAPTER 1. THE TWO VARIABLE LINEAR MODEL 2
Yi i
+ X
Intuitively, we have replaced and by its estimates, and treated ui as if the relationship
were exactly linear, i.e., as if ui were zero. This will be undersood as an estimate of Yi .
Then it is natural to define a notion of estimation error as follows:
ei Yi Yi
correspond to a line in the (X, Y ) plane. Hence different values of and correspond
to different estimated lines, which implies that choosing particular values is equivalent to
choosing a specific line on the plane. For the i-th observation, the estimation errors ei can
be seen graphically as the vertical distance between the points (Xi , Yi ) and (Xi , Yi ), that
and so as the
is, between (Xi , Yi ) and the fitted line. So, intuitively, we want values of
fitted line they induce passes as close as possible to all the points in the scatter so errors
are as small as possible.
Note that if we had only two observations, the problem has a very simple solution, and
and that make estimation errors exactly equal
reduces to finding the only two values of
to zero. Graphically, this is possible since this is equivalent to finding the only straight
line that passes through the two observations available. Trivially, in this extreme case all
estimation errors will be zero.
The more realistic case appears when we have more than two observations, not all of
them lying on a single line. Obviously, a line cannot pass through more than two non-
aligned points, so we cannot make all errors equal to zero. So now the problem is to find
and that determine a line that passes the closest as posible to all the points,
values of
so estimation errors are, in the aggregate, small. For this we need to introduce a criterion
CHAPTER 1. THE TWO VARIABLE LINEAR MODEL 3
of what do we mean by the line being close or far from the points. Let us define a penalty
function, which consists in adding all the estimation errors squared, so as positive and
negative errors matter alike. For any this will give us an idea of how large is the
and ,
aggregate estimation error:
n
X X
SSR( =
, ) e2i = (Yi i )2
X
i=1
SSR stands for sum of squared residuals. Note that given the observations Yi and Xi ,
this is a function that depends on that is, different values of
and , and correspond
to different lines that pass through the data points, implying different estimation errors. It
is now natural to look for and so as to make this aggregate error as small as possible.
The values of
and that minimize the sum of squared residuals are:
P
Xi Yi nY X
= P 2 2
Xi nX
and
= Y X
SRC(
, )
= 0
SRC(
, )
= 0
P X
e2 i) = 0
= 2 Xi (Yi
X (1.4)
Dividing by -2 and distributing the summations:
X X X
Xi Yi =
Xi + Xi2 (1.5)
(1.3) and (1.5) form a system of two linear equations with two unknowns ( known
y )
as the normal equations.
Dividing (1.3) by n and solving for
we get:
= Y X
(1.6)
Replacing in (1.5):
X X X
Xi Yi = (Y X)
Xi + Xi 2
X X X X
Xi Yi = Y Xi X Xi + Xi 2
X X X X
Xi Yi Y Xi =
Xi 2 X Xi
P P
Xi Yi Y X
= P 2 P i
Xi X Xi
= P Xi /n then P Zi = Zn.
Note that: X Replacing, we get:
P
Xi Yi nY X
= P 2 2 (1.7)
X i nX
Y (X) + X
=
+ X
= Y
Then Y (X)
= Y , this is, the estimated regression line by the method of least squares
passes through the point of means.
CHAPTER 1. THE TWO VARIABLE LINEAR MODEL 6
Cov(X, Y )
rXY =
SX SY
The following result establishes the relationship between rXY and .
P
xi yi
= P 2
xi
P
xi yi
= qP qP
x2i x2i
qP
P
xi yi yi2
= qP qP qP
x2i x2i yi2
qP
P
yi2 / n
xi yi
= qP qP qP
x2i yi2 x2i / n
SY
= r
SX
If r = 0 then = 0.Note that if both variables have the same sample variance, then
We can also see
the correlation coefficient is equal to the regression coefficient .
that, unlike the correlation coefficient, is not invariant to changes in scales or unit
of measurement.
and dividing by n:
P P
Yi Yi
=
n n
P
since ei = 0 from the first order conditions. Then:
Y = Y
P
Property 6: is a linear function of the Yi s. This is, can be written as = wi Yi ,
where the wi s are real numbers not all of them equal to zero.
This does not have much intuitive meaning so far, but it will be a useful for later
results.
Yi = + Xi + ui , i = 1, . . . , n
The classical assumptions provide a basic probabilistic structure to study the linear
model. Most assumptions are of a pedagogic nature and we will study later on how they
can be relaxed. Nevertheless, they provide a simple framework to explore the nature of
least squares estimator.
= Y X
We will first explore the main properties of in detail, and leave the analysis of
as
exercises. The starting conceptual point is to see that depends explicitely on the Yi s
which, in turn, depend on the ui s which are, by construction, random variables. Then is
a random variable and then it makes sense to talk about its moments (mean and variance,
for example) and its distribution.
It is easy to verify that:
yi = xi + ui
where ui = ui u
, and, according to the classical assumptions, E(ui ) = 0 and, consequently,
E(yi ) = xi . This is known as the classical two-variables linear model in deviations form
the means.
To prove the result, from the linearity property of the previous section
X
= wi yi
X
E() = wi E(yi ) (wi s are non-stochastic)
X
= wi xi
CHAPTER 1. THE TWO VARIABLE LINEAR MODEL 9
X
= wi xi
X X
= x2i /( x2i )
=
P
The variance of is 2 / x2i
P
From the linearity property, = wi Yi , then
X
=V
V () wi Yi
V (Yi ) = V ( + Xi + ui ) = V (ui ) = 2
X
=
V () V wi Yi
X
= wi2 V (Yi )
X
= 2 wi2
X hX i2
= 2 (x2i )/ x2i
X
= 2 / xi2
V ( ) V ()
The proof of a more general version of this result will be postponed until Chapter 3.
Discussion: BLUE, best does not mean good, we want minimum variance unbiased
(without linear), linear is not an interesting class, etc. If we drop any assumption,
the OLS estimate is no longer BLUE. This justifies the use of OLS when all the
asumptions are correct.
CHAPTER 1. THE TWO VARIABLE LINEAR MODEL 10
Estimation of 2
So far we have concentrated the analysis on and . As an estimate for 2 we will propose:
P 2
2 e i
S =
n2
We will later show that S 2 provides and unbiased estimator for 2 .
Yi Y = Yi Y + ei
yi = yi + ei
using the notation defined before and noting that from Property 4, Y = Y . Taking the
square of both sides and summing over all the observations:
yi2 = (
yi + ei )2
= yi2 + ei + 2
y i ei
X X X X
yi2 = yi2 + e2i + 2 yi ei
P
The next step is to show that yi ei = 0:
X X
yi ei = ( i )ei
+ X
X X
=
ei + Xi ei
= 0+0
from the first order conditions. Then we get the following important decomposition:
P 2 P P
y =
i yi 2 + e2i
T SS = ESS + RSS
This is a key result that indicates that when the we use the least squares method, the total
variability of the dependent variable (TSS) around its sample mean can be decomposed
as the sum of two factors. The first one corresponds to the variability of Y (ESS) and
represents the variability explained by the fitted model. The second term represents the
variability not explained by the model (RSS), associated to the error term.
CHAPTER 1. THE TWO VARIABLE LINEAR MODEL 11
For a given model, the best situation arises when errors are all zero, in which case
the total variability (TSS) conincides with the explained varaibility (ESS). The worst case
corresponds to the situation in which the fitted model does not explain anything of the total
variability, in which case TSS coincides with RSS. From this observation, it is natural to
suggest the following goodness of fit measure, known as R2 , or coefficient of determination:
SCE SCR
R2 = =1
SCT SCT
It can be shown (we will do it in the exercises) that R2 = r2 . Consequently, 0 R2 1.
When R2 = 1 |r| = 1, which corresponds to the case in which the relationship between
Y and X is exactly linear. On the other hand, R2 = 0 is equivalent to r = 0, which
corresponds to the case in which Y and X are linearly unrelated. It is interesting to note
that T SS does not depend on the estimated model, that is, it does not depend on nor
. Then, if and
are choosen so as to minimize SSR then they automatically maximize
R2 . This implies that, for a given model, the least squares estimate maximizes R2 .
The R2 is, arguably, the most used and abused measure of quality of a regression model.
A detailed analysis of the extent to which a high R2 can be taken as representative of a
good model will be undertaken in Chapter 4.
Of course, the central concept behind this procedure lies in specifying what do we mean
by very close or very far, given that is a random variable. More specifically, we need
to know the distribution of under the null hypothesis so we can define precisely the
notion of significantly different from zero. In this context such a statement is necessarily
probabilistic, that is, we will take as the rejection region a set of values that lie far away
from zero, or, a set of values that under the null hypothesis appear with very low probability.
The properties discussed in the previous section are informative about certain moments
of or
(for example, their means and variances) but they are not enough for the purposes
of knowing their distrubutions. Consequently, we need to introduce an additional assump-
tion. We will assume that ui is normally distributed, for i = 1, . . . , n. Given that we have
already assumed that ui has zero mean and constant variance equal to 2 , we have:
ui N (0, 2 )
Yi N ( + Xi , 2 )
If 2 were known we could use this result to test simple hypothesis like:
Ho : = o vs. HA : 6= o
Substracting from its expected value and dividing by its standard deviation we get:
o
z= qP N (0, 1)
/ x2i
Hence, if the null hypothesis is true, z should take values that are small in absolute value, and
large otherwise. As you should remember from a basic statistics course, this is acomplished
by defining a rejection region and an acceptance region as follows. The acceptance region
includes values that lie close to the one corresponding to the null hypothesis. Let c < 1 and
zc be a number such that:
P r(zc z zc ) = 1 c
qX qX
P r o zc / x2i o + zc / x2i =1c
so we accept the null hypothesis if the observed realization of lies within this interval and
reject otherwise. The number c is specified in advance and it is usually a small number. It
is called the significance of the test. Note that it gives the probability that we reject the
null hypohtesis when it is correct. Under the normality assumptions, the value zc can be
easily obtained from a table of percentiles of the standard normal distribution.
As you should also remember from a basic statistics class, a similar logic can be applied
to construct a confidence interval for 0 . Note that:
qX qX
P r zc (/ x2i ) o + zc (/ x2i ) =1c
The practical problem with the previous procedures is that they require that we know
2 , which is usually not available. Instead, we can compute its estimated version S 2 . Define
t as:
t=
S/ x2
t tn2
that is, the t-statistic has the so-called t-distribution with n 2 degrees of freedom.
Hence, when we use the estimated version of the variance we obtain a different distribution
for the statistic used to test simple hypotheses and construct confidence intervals.
Consequently, applying once again the same logic, in order to test the null hypothesis
Ho : = o against HA : 6= o we use the t-statistic:
o
t= qP tn2
S/ x2i
CHAPTER 1. THE TWO VARIABLE LINEAR MODEL 14
F = t2I
We will leave the proof as an excercise.