Week1 Lecture1
Week1 Lecture1
Week1 Lecture1
University of Amsterdam
Week 1. Lecture 1
February, 2024
1 / 65
Overview
2 / 65
1. Course structure
3 / 65
1.1. Schedule
4 / 65
Weekly schedule
5 / 65
Final grade
6 / 65
The team
7 / 65
Software
8 / 65
Our expectations and suggestions
9 / 65
1.2. Topics
10 / 65
Material
11 / 65
Exam material. Advice.
I Just by reading my slides, you will not learn all the material necessary
for exam.
I During the lectures we will not cover all topics and details
discussed in the book.
I During the lectures (sometimes) we will not cover all topics and
details discussed on the slides.
I Reading the book is essential to prepare for the exam.
I Every week you will get a list of sections from The5 that is essential for
understanding the material. This list will also overlap with the exam
material.
12 / 65
Week-by-week topics
13 / 65
Course objectives
14 / 65
Prior Knowledge
15 / 65
2. What is Econometrics?
16 / 65
2.1. Definitions and distinctions
17 / 65
One of the thoughts on the definition
Ragnar Frisch (1895-1973) in the editorial for the first issue of Econometrica
in 1933: ... econometrics is by no means the same as economic statistics.
Nor is it identical with what we call general economic theory, although a
considerable portion of this theory has a definitely quantitative character.
Nor should econometrics be taken as synonymous with the application of
mathematics to economics.... It is the unification of all three that is
powerful. And it is this unification that constitutes econometrics.
18 / 65
Types of data
Up until very recent times, econometricians were mostly working with
structured, low-dimensional datasets (especially in academia). This is
changing gradually (as you have learned from the Statistical/Machine
Learning course).
20 / 65
Figure: The two guys that make many Dutch econometricians proud.
21 / 65
3. Linear regression model
22 / 65
3.1. The setting
23 / 65
The problem
24 / 65
The model
In this course we will consider the linear models in parameters (or linear
regression models) of the form:
K
X
yi = α + βk xk,i + εi , i = 1, . . . , n. (1)
k=1
Here n is the total size of our sample. Essentially we try to explain yi (that
is observed) using observed quantities (x1,i , . . . , xK ,i ), and we leave
ourselves some room for error so that we cannot perfectly predict/explain
yi . εi is the error term that captures the variation in yi that cannot be
linearly explained using observed explanatory variables.
25 / 65
Interpretation of the coefficients
∂ E[yi |(x1,i , . . . , xK ,i )]
= βk , k = 1, . . . , K . (3)
∂xk,i
26 / 65
Notation/Language/Jargon
Depending on the textbook considered (and the age that textbook was
written) different terminology is used for yi and (x1,i , . . . , xK ,i ):
I yi is usually referred to as the LHS (left-hand side) variable, or the
dependent variable, or regressand, or the explained variable. More
generally (from statistical learning, machine learning) point of view it is
the target variable.
I (x1,i , . . . , xK ,i ) are usually referred to as the RHS (right-hand side)
variables, or the independent variables, or covariates, or explanatory, or
regressors. In the machine learning terminology these are the features.
In most cases, I will use the LHS/RHS terminology or the
dependent/regressor notation. I will avoid the notion of independent
variables as it is too confusing.
27 / 65
3.2. The OLS estimator
28 / 65
Simple model
yi = α + βxi + εi , i = 1, . . . , n. (4)
How do obtain good values (α, β) if we have data for {(yi , xi )}ni=1 ?
29 / 65
Ordinary Least-squares (OLS) objective function
(b
α, β)
b = arg min LSn (α, β). (6)
α,β
30 / 65
Derivatives
We look at the first partial derivatives of that objective function:
n
∂LSn (α, β) X
= −2 (yi − α − βxi )
∂α
i=1
n
∂LSn (α, β) X
= −2 xi (yi − α − βxi ).
∂β
i=1
31 / 65
Derivatives
b = y − βx.
The first equality implies that α b We can plug-in this expression
into the second equality:
n
X
xi ((yi − y ) − β(x
b i − x)) = 0. (7)
i=1
Such that: P
xi (yi − y )
βb = Pi . (8)
i xi (xi − x)
Or equivalently: P
i (x − x)(yi − y )
βb = Pi 2
. (9)
i (xi − x)
This basic equivalence will be proved during the tutorial.
32 / 65
Special case. Binary regressor
Consider the special case Pwhere xi = Di is binary variable. Assume that
n = n0 + n1 , where n1 = i Di (i.e. the number of observations where
Di = 1). In that case:
P
Di (yi − y )
βb = P i
i i (Di − D)
D
n1 y 1 − n1 y
=
n1 − n1 D
y − (n0 /n)y 0 − (n1 /n)y 1
= 1
1 − (n1 /n)
(n − n1 )y 1 − n0 y 0
=
n − n1
= y 1 − y 0.
33 / 65
4. Empirical illustration
34 / 65
The Dataset. Hotels Vienna.
We want to investigate the relationship between prices (per night) and the
distance of the hotel from the city center of Vienna using the linear model:
yi = α + βxi + εi , i = 1, . . . , n. (10)
35 / 65
Scatter plot
250
200
price
150
100
50
0 5 10 15 20
distance_km
36 / 65
Empirical results. Hotel prices in Vienna. Binary variables.
We can see that on average hotels within the 2km radius from the city
center are 34.67 EUR per night more expensive than the hotels outside of
that radius.
37 / 65
Double check the algebra
38 / 65
Empirical results. Hotel prices in Vienna. Continuous
distance.
40 / 65
5.1. OLS useful properties
41 / 65
Some properties of OLS
α c∗ x ∗ = ay − by ax βb + by α
c∗ = y ∗ − β b. (12)
bx
Hence, if add constants to all xi and/or yi then βb does not change (location
invariance). If we multiple all xi and/or yi , then βb scales up/down
appropriately.
Estimate α
b absorbs all location changes in xi and yi directly.
42 / 65
Example.
You are interested to measure the effect of the distance from the Schiphol
airport on the amount of noise and pollution. Imagine you have data from
multiple measurement stations where xi is the distance in kilometres, while
yi is the daily noise measurement in dB.
If you now translate xi from kilometres to imperial miles, then the two β
estimates should differ by ≈ 1.61.
43 / 65
Empirical results. Hotel prices in Vienna. Continuous
distance in Miles.
Estimate of constant (b
α) is the same. But the coefficient for distance (β)
b is
≈ 1.61 bigger. Make sure you understand why this is the case.
44 / 65
5.2. Model fit
45 / 65
Residual and fitted value
Consider the following generic decomposition of the data into the fitted
value and the residuals:
yi = ybi + yi − ybi . (13)
| {z }
ebi
Here ybi is the fitted value, and ebi is the residual (i.e. the remainder of yi
that is not explained by ybi ). In our case:
ybi = α
b + xi β.
b (14)
46 / 65
Residual and fitted value. Visually.
47 / 65
Decomposition. Intuitive explanation.
i.e. the two sets of quantities have zero sample correlation. Alternatively,
yi }ni=1 and {b
we can say that these two sets of quantities, i.e. {b ei }ni=1 (as
vectors) are orthogonal to each other.
48 / 65
Decomposition. Intuitive explanation.
Next week, we provide the necessary geometrical justification for the above
statements.
Make sure to refresh your knowledge of matrix algebra, vectors, and the
concepts of (orthogonal) projections.
49 / 65
P
Why i ebi ybi = 0?
50 / 65
We know that by the definition of the OLS estimator (so by construction):
n
X
(yi − α
b − βx
b i ) = 0,
i=1
n
X
xi (yi − α
b − βx
b i ) = 0.
i=1
Hence, also: X
ebi ybi = 0. (16)
i
51 / 65
5.3. The R 2
52 / 65
The SST = SSR + SSE the decomposition
We want to understand how well the linear model and the corresponding
OLS estimator helps are able to explain the variation of yi in the data, or
alternatively how well we can fit the data for yi given the data for xi .
X n
X
ebi = (yi − α
b − βx
b i) = 0 (18)
i i=1
Note that:
X X 1X X
yi − yb) =
ebi (b ebi ybi − ebi ybi = 0 − 0. (22)
n
i i i i
54 / 65
From here:
X X
SST = yi − yb)2 +
(b ei )2
(b (23)
i i
= SSE + SSR. (24)
55 / 65
Towards R 2
The SST = SSR + SSE decomposition is useful to describe the amount of
variance of yi that we are able to explain by xi , but it is not a scale-free
measure. In particular, if we multiply yi by 100 then all measures increase
by 1002 . This is inconvenient.
SSE
R2 ≡ ∈ [0; 1]. (26)
SST
Hence, the unexplained (or the residual) part is always given by 1 − R 2 .
Hence, the OLS estimator maximizes the R 2 by construction.
56 / 65
Why do we call it an R 2 ?
Notice that
X
SSE = yi − yb)2
(b
i
X
= (b b i − (b
α + βx b 2
α + βx))
i
X
= βb2 (xi − x)2
i
P 2
( i (xi − x)(yi − y ))
= P 2
.
i (xi − x)
57 / 65
From the above it follows that:
SSE
R2 =
SST
P 2
( (xi − x)(yi − y ))
=P i 2
P 2
i (xi − x) i (yi − y )
P !2
( i (xi − x)(yi − y ))
= pP pP .
( i (xi − x)2 ) ( i (yi − y )2 )
2
The term inside the (·) is the sample correlation between {xi }ni=1 and
{yi }ni=1 . Hence, the name R 2 .
58 / 65
The Vienna hotels example.
Model R2
Binary 0.21
Continuous 0.08
The model with a single binary variables (that was derived from the
continuous variable). Next week, we will see that the two models can be
combined effectively.
59 / 65
5.4. The role of a constant term (self-study)
60 / 65
Why include constant α in your model?
So far we assumed that the model always includes a constant term (or an
intercept) α, and we construct estimators (b
α, β).
b But why?
61 / 65
Why include constant α in your model?
OLS without a constant term has many drawbacks and only one advantage.
Drawbacks:
I It can be no longer related to the correlation coefficient between {xi }
and {yi }.
I It is not invariant transformations x → x + constant and
y → y + constant. So, for example, it matters if measure temperature
in Degress Celsius vs. Kelvins.
I If the true model( more about this concept later) contains a non-zero α
then our estimator βb is biased and inconsistent (more about these
aspects next two weeks).
I R 2 interpretation as the squared correlation coefficients is also lost.
I You loose reference group if your regressor is a binary variable
Di ∈ {0; 1}.
The only benefit is lower variance of the estimator (efficiency) if the true
intercept in your model is exactly 0. So is it worth it? Usually not.
62 / 65
6. Summary
63 / 65
Summary today
In this lecture
I We introduced the course.
I We introduced the meaning of econometrics in the narrow sense.
I We introduced the Ordinary Least Squares estimator for a single
variable.
I We discussed how to measure the Regression fit using R 2 .
64 / 65
The remainder of this week
I Tutorial. You will re-fresh your knowledge on OLS algebra with one
regressor, and investigate how to measure the fit of the OLS regression.
I PC-lab. Introduction to STATA. Basic manipulations with the data.
I Lecture (Friday). The meaning of a linear regression (population)
model and different motivations. Finite sample properties of the
estimators.
65 / 65