Lecture 2-2_Simple Linear Regression (One Regressor)
Lecture 2-2_Simple Linear Regression (One Regressor)
Xinwei Ma
Department of Economics
UC San Diego
Spring 2021
We begin with cross-sectional analysis and will assume we have a random sample from
the population of interest.
We begin with the following premise, once we have a population in mind. There are two
variables, X and Y , and we would like to study “how Y varies with changes in X .”
1. How can we allow for factors other than X to affect Y ? There is never an exact relationship
between two variables (in interesting cases).
3. How can we be sure we are capturing a causal relationship between Y and X (as is so often the
goal)?
Y = β0 + β1 X + u
This equation defines the simple linear regression model (a.k.a. two-variable regression
model, or regression with one regressor).
The term “regression” has historical roots in the “regression to the mean” phenomenon.
Terminology:
Y X
dependent variable independent variable
explained variable explanatory variable
response variable feature
predicted variable predictor
regressand regressor
left-hand side variable right-hand side variable
control variable
covariate
Remark: “dependent” and “independent” are used quite often. They should not be
confused with the notion of statistical independence.
Y = β0 + β1 X + u
This equation also addresses the functional form issue (in a simple way).
• β0 is the intercept parameter and β1 is the slope parameter. These describe a population, and
our ultimate goal is to estimate them.
The equation also addresses the causality issue. In this model, all other factors that
affect Y are in u. We want to know how Y changes when X changes, holding u fixed.
∆Y = β1 ∆X + ∆u
= β1 ∆X (when ∆u = 0).
• This equation provides an interpretation of β1 as a slope, provided that we can “hold all other
factors fixed” (∆u = 0).
c Xinwei Ma 2021 4/18
Definition of the Simple Linear Regression Model
• Our model is
testScore = β0 + β1 classSize + u,
where u contains other factors such as textbook, teacher’s experience, school district, etc.
• If we can “hold u fixed,” then the effect of class size on test score is
∆testScore = β1 ∆classSize, when ∆u = 0.
Y = β0 + β1 X + u
This seems too easy! How can we hope to generally estimate the causal effect of Y on X
when we have assumed all other factors affecting Y are unobserved and lumped into u?
The key is that the simple linear regression model is a population model. When it comes
to estimating β1 (and β0 ) using data, we must restrict how u and X are related to each
other.
What we must do is to restrict the way u and X relate to each other in the population.
Simplifying assumption: without loss of generality, the average, or the expected value of
u is zero in the population:
E[u] = 0.
The presence of β0 in
Y = β0 + β1 X + u
allows us to assume E[u] = 0. If the average of u is different from zero, we just adjust
the intercept, leaving the slope the same. If α0 = E[u] then we can write
Y = β0 + α0 + β1 X + u − α0 ,
| {z } | {z }
new intercept new error term
The new intercept is β0 + α0 . Key point: the slope, β1 , has not changed.
c Xinwei Ma 2021 7/18
Definition of the Simple Linear Regression Model
• What if better school districts attracts more experienced teachers and have smaller classes?
EXAMPLE. Suppose u is “student ability” and X is “class size.” We need, for example,
E[studentAbi|classSize = 20] = E[studentAbi|classSize = 40]
so that the average student ability is the same across different class sizes.
• What if students attending large classes are better prepared as they anticipate less in-class
interaction?
Given data on X and Y , how can we estimate the population parameters, β0 and β1 ?
Yi = β0 + β1 Xi + ui , i = 1, 2, · · · , n,
We observe Yi and Xi , but not the error term ui . (However, we know ui is there.)
• Both conditions are implied by the zero conditional mean assumption: E[u|X ] = 0.
u = Y − β0 − β1 X ⇒ 0 = E[Y − β0 − β1 X ]
0 = E[X (Y − β0 − β1 X )]
These are the two conditions in the population that determine β0 and β1 . So we use
their sample analogs, which is a method of moments approach to estimation.
Recall that from the zero conditional mean assumption, E[u|X ] = 0, we obtained the two
population conditions
By the law of large numbers, we expect the sample analogues to be close to zero
n n
1X 1X
0≈ Yi − β0 − β1 Xi , 0≈ Xi (Yi − β0 − β1 Xi ) .
n i=1 n i=1
They will not be exactly zero due to sample variation. Remember: we are dealing with a
(random) sample, not the population.
The two estimates, β̂0 and β̂1 , are defined by requiring the sample analogues to be
exactly zero:
n n
1X 1X
0= Yi − β̂0 − β̂1 Xi , 0= Xi (Yi − β̂0 − β̂1 Xi ) .
n i=1 n i=1
Later, we will show that the two estimates, β̂0 and β̂1 , are close to their true population
values, β0 and β1 .
= Y − β̂0 − β̂1 X .
= XY − β̂0 X − β̂1 X 2 .
c Xinwei Ma 2021 12/18
Deriving the Ordinary Least Squares Estimates
Do not lose the big picture. We started from the zero conditional mean assumption
E[u|X ] = 0,
and obtained the two population conditions
0 = E[Y − β0 − β1 X ], 0 = E[X (Y − β0 − β1 X )].
The previous formula for β̂1 is important. It shows us how to take the data we have and
compute the slope estimate. For reasons we will see, β̂1 is called the ordinary least
squares (OLS) slope estimate. We often refer to it as the slope estimate.
The slope estimate can be computed whenever the sample variance of Xi is not zero,
which only rules out the case where each Xi takes the same value in the sample.
• This makes sense, because we cannot hope to learn “how Y responses to changes in X ” if we
never observe any change in X .
Once we have β̂1 , we compute β̂0 = Y − β̂1 X . This is the OLS intercept estimate.
The calculation is tedious even for small n (sample size). These days, one lets a
computer do the calculations.
Once we have the numbers β̂0 and β̂1 for a given data set, we write the regression line as
a function of X :
Ŷ = β̂0 + β̂1 X .
The regression line allows us to predict Y for any (sensible) value of X . It is also called
the sample regression function.
The intercept, β̂0 , is the predicted Y when X = 0. (The prediction is usually meaningless
if X = 0 is not possible. Consider the class size example.)
The slope, β̂1 , allows us to predict changes in Y for any (reasonable) change in X :
∆Ŷ = β̂1 ∆X .
\
testScore = 698.9 − 2.28 stuTeacherRatio
720
700
The intercept is meaningless. Literally, it says
680
that the average score is predicted to be 698.9
in a district with no student.
660
640
The slope suggests a 2.28 increase in test 620
student-to-teacher ratio. 10 15 20 25 30
Student-Teacher Ratio
30
25
The intercept is meaningless. Literally, it says
20
that the average wage is predicted to be
−$0.90 for individuals who never attended
15
school.
10
5
The slope suggests a $0.54 increase in wage
0
If you do so, you may be subject to student conduct proceedings under the UC San
Diego Student Code of Conduct.
c Xinwei Ma 2021
x1ma@ucsd.edu