Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
75 views

R300 Advanced Econometrics Methods Lecture Slides

This document provides an overview of parametric estimation methods. It introduces the concepts of an estimand, an estimator, and the parametric framework for modeling data. Key points covered include: - Evaluating estimators based on their properties - Constructing maximum likelihood estimators and using the asymptotic properties of the sample mean - Examples of parameter estimation for Bernoulli, Poisson, normal, censored normal, and Chi-squared distributions - How to derive the density of a transformed random variable using a change of variables formula The reading provides theoretical background on parameter estimation and introduces common estimators for several parametric families.

Uploaded by

Marco Brolli
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
75 views

R300 Advanced Econometrics Methods Lecture Slides

This document provides an overview of parametric estimation methods. It introduces the concepts of an estimand, an estimator, and the parametric framework for modeling data. Key points covered include: - Evaluating estimators based on their properties - Constructing maximum likelihood estimators and using the asymptotic properties of the sample mean - Examples of parameter estimation for Bernoulli, Poisson, normal, censored normal, and Chi-squared distributions - How to derive the density of a transformed random variable using a change of variables formula The reading provides theoretical background on parameter estimation and introduces common estimators for several parametric families.

Uploaded by

Marco Brolli
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 362

R300 Advanced Econometric Methods

Oleg I. Kitov
oik22@cam.ac.uk

Faculty of Economics and Selwyn College

Michaelmas Term 2021


Intermezzo 8

The main slide structure is blue, as above.

Sometimes we will be forced to deviate or digress temporarily.


This deviation is highlighted by the use of red slides, as this one.

These are stand-alone slides.


Application/Illustration 9

Sometimes we will work out special cases of an argument in some detail.


This deviation is highlighted by the use of yellow(ish) slides, as this one.

These are not stand-alone slides.


They build on the main set of slides and will continue to use their notation
and definitions.
Additional material 10

There are also slides like this one.

They provide some further detail and/or more results.

These will (mostly) not be covered in class and are not examinable.
ESTIMATION IN PARAMETRIC PROBLEMS

2 / 318
Reading

Evaluation of estimators:
Casella and Berger, Chapters 7 and 10
Hansen I, Chapter 6
Asymptotics for the sample mean:
Goldberger, Chapter 9
Hansen I, Chapters 7 and 8
Maximum likelihood:
Davidson and MacKinnon, Chapter 8
Hansen I, Chapter 10
Wooldridge, Chapter 13
Linear regression:
Goldberger, Chapters 14–16
Hansen II, Chapters 2–5

3 / 318
Estimation

We are interested in saying something about a population based on a sample

x1 , . . . , xn

from it. The xi may be scalars or vectors.


Want to learn some feature of the population, say θ; the estimand.
For that we use an estimator

θn = θn (x1 , . . . , xn );

this is just a function of the sample (it can be any function).


Some questions:

How do we evaluate estimators, i.e., what is a good estimator?


How do we construct good estimators?
Does there exist a best estimator and, if so, is it unique?

This inferential aim is different from a descriptive data analysis that gives
means, variances, correlations, regression coefficients, and so on.

4 / 318
The parametric framework

A way to formalize sampling is to see x1 , . . . , xn as a draw from an (n-variate)


probability (mass or density) function, g.

We begin with the random sampling and the parametric framework.

The sample is a random sample if


Qn
1 the xi are independent across i, so that g(x1 , . . . , xn ) = i=1 fi (xi )
for probability functions f1 , . . . , fn ; and
2 all xi are identically distributed, so that fi = f for some f and all i.

We say that the xi are i.i.d.

The parametric framework says that f = fθ is known up to parameter θ


which is finite dimensional (and so is a vector, in general).
That is, we know the class
{fθ : θ ∈ Θ},
but not the particular θ that generated the data.

5 / 318
We know the whole probability distribution once we know the parameter θ.
R
We may calculate Pθ (xi ∈ A) = A
fθ (x) dx for any set A. For example,

Fθ (x) = Pθ (xi ∈ (−∞, x]) = Pθ (xi ≤ x)

for any x (the cumulative distribution function).

We know all raw and centered moments; for example, the mean and variance

varθ (xi ) = (x − Eθ (xi )) (x − Eθ (xi ))0 fθ (x) dx,


R R
Eθ (xi ) = x fθ (x) dx,

and so on.
We know Eθ (ϕ(xi )) for any chosen function ϕ and so also parameters ψ
defined through
Eθ (ϕ(xi ; ψ)) = 0,
(which we call moment conditions). Obvious example is ψ = Eθ (xi ), which
has ϕ(xi ; ψ) = xi − ψ.
For univariate xi the τ th-quantile is qτ = inf q {q : Fθ (q) ≥ τ }, for τ ∈ (0, 1).
It is a solution to the moment condition Eθ ({xi ≤ ψ} − τ ) = 0 and so has
ϕ(xi ; ψ) = {xi ≤ ψ} − τ .

6 / 318
Examples

Let x1 , . . . , xn be a sequence of zeros and ones with Pθ (xi = 1) = θ. Then


xi is Bernoulli with mass function

fθ (x) = θx (1 − θ)1−x , θ ∈ (0, 1),

for x ∈ {0, 1}.


One possible (and sensible) estimator of θ would be the sample frequency of
ones, i.e., xn = n−1 n
P
i=1 xi

Another simple example has x1 , . . . , xn representing the number of arrivals


per unit of time. The Poisson distribution has

θx e−θ
fθ (x) = Pθ (xi = x) = , θ>0
x!
for x ∈ N.
θ is the arrival rate, i.e., the expected number of arrivals per time unit.
A sensible estimator of θ is again the sample mean.

7 / 318
We have data on the number of births per hour over a 24 hour period in
Addenbrooke’s.
Fitting a Poisson model to such data we estimate the number of births per
hour by the sample mean, here 1.875 births/hour.
Given an estimate of θ we can estimate the mass function.

8 / 318
The hospital data also tell us that, of the 44 babies, 18 were boys and 26
where girls.
The maximum likelihood estimator of the probability of giving birth to a boy
is 18/44 = .409.
The estimator is a random variable.
Using arguments to be developed later we can test whether there is a gender
bias at Addenbrooke’s.
p
The standard error on our estimate is (18/44) × (26/44)/44 = .074 which
gives us the value
.409 − .500
= −1.23
.074
for a test statistic which is (asymptotically) standard normal under the null
of no gender bias.
Using a Neyman-Pearson argument (see later) we cannot reject the absence
of gender bias (at conventional significance levels).

9 / 318
A continuous example with two parameters is the normal distribution.
The univariate standard-normal density is
1 1 2
φ(x) = √ e− 2 x ;

it has mean zero and variance one. The corresponding distribution function
is Rx
Φ(x) = −∞ φ(u) du.

The normal distribution is a location/scale family:


If zi ∼ N (0, 1), then
xi = µ + σzi ∼ N (µ, σ 2 ).

Its cumulative distribution function is


 x − µ x − µ
P (xi ≤ x) = P zi ≤ =Φ
σ σ
and its density function is
1 x − µ 1 1 2 2
φ = √ e− 2 (x−µ) /σ .
σ σ 2πσ 2

Obvious estimators for µ, σ 2 would be the sample mean and sample variance.
10 / 318
Less obvious is when
x∗i ∼ N (µ, σ 2 )
but we observe
x∗i if x∗i ≥ 0

xi = .
0 if x∗i < 0
This is a censored normal variable.

How would we estimate the parameters here?

Some obvious candidates would be


the sample mean and variance of the xi ;
the sample mean and variance of the positive xi .

These turn out not to be very attractive and should not be used.

We will construct a better estimator later on.

11 / 318
As a final example, suppose that xi ∼ χ2θ .
The Chi-squared distribution with (integer) θ degrees of freedom has density

xθ/2−1 e−x/2
fθ (x) = ,
2θ/2 Γ(θ/2)
R∞
where Γ(θ) = 0
xθ−1 e−x dx denotes the Gamma function at θ.
We note without proof that
Eθ (xi ) = θ,
varθ (xi ) = 2θ,
Eθ (xpi ) = 2p Γ(p + θ/2)/Γ(θ/2).

It follows that the sample mean is an obvious candidate estimator of θ.


But there is also information in higher-order moments so the sample mean
may be inefficient (it is here but it need not be, a priori; see the Poisson
example).

12 / 318
Change of variable

Let x be a random variable with density f . Let y = ϕ(x) for an invertible


function ϕ.
The density of y is
∂ϕ−1 (y)
 
−1
f (ϕ (y)) det .
∂y 0

Easiest to see in the univariate case:


If ϕ is increasing,

P (y ≤ a) = P (ϕ(x) ≤ a) = P (x ≤ ϕ−1 (a)),

and differentiation with respect to a gives

∂ϕ−1 (y)
f (ϕ−1 (a))
∂y y=a

by the chain rule.


If ϕ is decreasing, P (y ≤ a) = 1 − P (x ≤ ϕ−1 (a)), and differentiation gives
−f (ϕ−1 (a)) ( ∂ϕ−1 (y)/∂y y=a ).

13 / 318
Characteristic function

Let x be a continuous univariate random variable with density f . Then

ϕ(t) = E(eιtx ) = eιtx f (x) dx


R

is its characteristic function. (Here, ι is the imaginary unit, i.e., ι2 = −1)


So, ϕ is the Fourier transform of f .
Like f , ϕ completely characterizes the random variable.
f can be recovered from ϕ through the inverse Fourier transform
1 R −ιtx
f (x) = e ϕ(t) dt.

Further, raw moments equal


∂ p ϕ(t)
E(xp ) = ι−p .
∂tp t=0

0
For multivariate x, ϕ(t) = E(eιt x ) for a vector t of conformable dimension.

14 / 318
An example is the standard normal case. Here,
2
e−x /2 2
f (x) = √ , ϕ(t) = e−t /2
.

We have, using the definition of the cosine function,
1 R +∞ ιtx −x2 /2 2 R +∞ 1 ιtx 2
ϕ(t) = √ −∞
e e dx = √ 0 2
(e + e−ιtx ) e−x /2 dx
2π 2π
2 R +∞ 2
= √ 0
cos(tx) e−x /2 dx.

Next,
2 R +∞ ∂ cos(tx) −x2 /2
ϕ0 (t) = √ ∂t
e dx
2π 0
2 R +∞ 2
= −√ 0
sin(tx) x e−x /2 dx

2  −x2 /2  +∞ 2 R +∞ 2
= √ e sin(tx) −√ 0
t cos(tx) e−x /2 dx
2π 0 2π
= −t ϕ(t).
2
This implies that ϕ ∝ e−t /2 . But because
R
f (x) dx = 1 we must have that
2
ϕ(0) = 1 so that, indeed, ϕ = e−t /2 .
15 / 318
To see that
1 R +∞ −ιtx −t2 /2
φ(x) = e e dt,
2π −∞
we can use the same calculations.

Moreover, note that


1 R +∞ −ιtx −t2 /2 R +∞ 2
e e dt = 1
cos(tx)e−t /2
dt
2π −∞ π 0

by the same argument as before. We have already computed the last integral.

Moreover,
1 R +∞ 2
√  2
cos(tx)e−t /2 dt = 1
π

2
ϕ(x) = √1 e−x /2 = φ(x),
π 0 2π

as claimed.

16 / 318
Squared standard-normal variable

Let z ∼ N (0, 1). Then the density of x = z 2 at a > 0 is


√ √  
φ( a) + φ(− a) 1 1 1 1 1 1 1
√ = √ √ e− 2 a + √ e− 2 a = √ √ e− 2 a .
2 a 2 a 2π 2π 2π a

This is the density of a χ21 random variable. Indeed, use that Γ(1/2) = π
to rewrite the density as

1 −1a a1/2−1 e−a/2 a1/2−1 e−a/2


√ √ e 2 = √ = ,
2π a 21/2 π 21/2 Γ(1/2)

which co-incides with the definition given above.

17 / 318
Sum of squared independent standard-normal variables

The characteristic function of a χ2p -variable is

ϕp (t) = (1 − 2ιt)−p/2 .

So, if
zi ∼ N (0, 1),
then zi2 ∼ χ21 has ϕ1 (t) = (1 − 2ιt)−1/2 .
The characteristic function of n 2
P
i=1 zi is (by independence) equal to

n
Y  n
ϕ1 (t) = (1 − 2ιt)−1/2 = (1 − 2ιt)−n/2 = ϕn (t).
i=1

Hence,
n
X
zi2 ∼ χ2n .
i=1

18 / 318
Sum of independent normal variables

The characteristic function of a N (µ, σ 2 ) variable is


2 t2
ϕµ,σ2 (t) = eιtµ−σ 2 .

So, if
zi ∼ N (0, 1)
Pn
are independent the characteristic function of i=1 zi is
n n
Y Y 2 2 2
ϕ0,1 (t) = (e−t /2
) = (e−t ) = e−n t
/2 n /2
= ϕ0,n (t),
i=1 i=1
Pn
i.e., i=1 zi ∼ N (0, n).
By the location/scale properties of the normal we then have that

z n ∼ N (0, n−1 )

and
xn = µ + σ z n ∼ N (µ, σ 2 /n).

19 / 318
Motivating best unbiasedness

Sampling distributions of several estimators.


Blue is better than red.

0.5
0.4

0.4
0.3
0.3
0.2
0.2

0.1 0.1

0 0
θ θ∗ θ

20 / 318
Best unbiased estimator

An estimator θn is best unbiased if Eθ (θn ) = θ and

varθ (θn ) ≤ varθ (θ∗ )

for any other unbiased estimator θ∗ .


Here and later, the inequality is to be interpreted in the matrix sense: A ≥ 0
means that matrix A is positive semi-definite, i.e., x0 A x ≥ 0 for any real
non-zero vector x.

A lower bound on the variance can be found.


This is called an efficiency bound; here: Cramér-Rao bound.

Very often such an estimator will not exist.


If it exists, it is unique.

21 / 318
Non-existence of bias

The Cauchy distribution with location µ and scale γ has the symmetric
density
1
  2  .
πγ 1 + x−µ γ

It has no moments.
For example, with µ = 0 and γ = 1 we have
RM 1 x log(1+M 2 )
E(|x|) = lim 2 0 π 1+x2
dx = limM →∞ π
= +∞.
M →∞

So it is not useful to estimate the location parameter µ via the sample mean.
A sensible estimator would be the sample median, which is well defined in
spite of the non-existence of moments.

So, if an estimator is Cauchy distributed its bias does not exist.

An example where this happens is with ratios of normal variates, as these


are Cauchy.

22 / 318
Ratio of normals

Take independent scalar normal variates x ∼ N (0, σ12 ) and y ∼ N (0, σ22 ).
Consider the transformation (x, y) → (u, v) = (x/y, y). The Jacobian of the
transformation is v and so the density of u is
R +∞
fσ1 ,σ2 (u) = −∞ φ((uv)/σ
σ1
1 ) φ(v/σ2 )
σ2
|v| dv.
2 √
This is (using that φ(u) = e−u /2
/ 2π and that φ(u) = φ(−u) for all u)
1 R +∞ − 1 v2 ((1/σ2 )2 +(u/σ1 )2 )
fσ1 ,σ2 (u) = e 2 v dv
πσ1 σ2 0
1 R +∞ − 1 v 2 (1+u2 (σ2 /σ1 )2 )/σ2
2
= e 2 v dv
πσ1 σ2 0
1 1 v 2 (1+u2 (σ /σ )2 )/σ 2 0
R +∞  
−2
= −e 2 1 2 dv
πσ1 σ2 ((1 + u2 (σ2 /σ1 )2 )/σ22 ) 0
1
=   2  ,
π σσ12 1 + σ1u/σ2

which is a Cauchy distribution with location zero and scale σ1 /σ2 .

23 / 318
Fisher information

Let
∂ log fθ (x)
∂θ
be the score.
The score has mean zero:
  R
Eθ ∂ log∂θ
fθ (xi ) ∂ log fθ (x) ∂fθ (x)
R
= ∂θ
fθ (x) dx = ∂θ
dx = 0

(where the last step follows from fθ integrating to one.)


The variance of the score,
 
∂ log fθ (xi )
Iθ = varθ ,
∂θ
is called the Fisher information.
When the score is a vector the information is a (variance-covariance) matrix.

24 / 318
Information inequality

Theorem 1 (Cramér-Rao bound)


Under regularity conditions,

varθ (θn ) ≥ Iθ−1 /n

for any unbiased estimator θn of θ.

More information reduces the variance bound.

The bound shrinks like n−1 .

From the proof (to follow) we have that θn attains the bound if and only if
n
X ∂ log fθ (xi )
n Iθ (θn − θ) =
i=1
∂θ θ

25 / 318
Proof (for the scalar case).
Differentiating the zero-bias condition
R R Q
Eθ (θn − θ) = . . . (θn (x1 , . . . , xn ) − θ) i fθ (xi ) dx1 . . . dxn = 0,

gives
R Rn ∂
Q
fθ (xi ) Q o
... (θn (x1 , . . . , xn ) − θ) i
∂θ
− i fθ (xi ) dx1 . . . dxn = 0.

Because densities integrate to one and an identity below we can re-write as

· · · (θn (x1 , . . . , xn ) − θ) i ∂ log∂θ


fθ (xi ) Q
R R P
i fθ (xi ) dx1 . . . dxn = 1.

But this is just  


P ∂ log fθ (xi )
Eθ (θn − θ) i ∂θ
= 1.
Furthermore, this is a covariance (as both terms have zero mean), and so
 2 P 
1 = covθ θn , i ∂ log∂θfθ (xi ) ∂ log fθ (xi )
P
≤ varθ (θn ) × varθ i ∂θ

(by Cauchy-Schwarz). The result then follows from


P 
∂ log fθ (xi )
varθ i ∂θ
= n Iθ .

26 / 318
Proof Annex: Identity used in Step 2.
Above we used the following:
n n
X ∂ log fθ (xi ) X 1 ∂fθ (xi )
=
i=1
∂θ i=1
f θ (xi ) ∂θ
n Q !
X fθ (xj ) ∂fθ (xi )
= Qj6=i
i=1 j fθ (xj ) ∂θ
Q
Pn Q ∂fθ (xi ) ∂ i fθ (xi )
i=1 j6=i fθ (xj ) ∂θ
= Q = Q ∂θ
j f θ (xj ) j fθ (xj )

(using the chain rule on the differentiation of a product), so that we obtain


n
! n
!
∂ n
Q
i=1 fθ (xi ) ∂ log fθ (xi )
X Y
= fθ (xj ) ,
∂θ i=1
∂θ j=1

the integral of which is an expectation.

27 / 318
Cauchy-Schwarz inequality

Theorem 2 (Cauchy-Schwarz)
For scalar random variables xi and yi

E(xi yi )2 ≤ E(x2i ) E(yi2 ).

Equally, for a sample of size n,


n
!2 n
! n
!
−1
X −1
X −1
X
n xi yi ≤ n x2i n yi2 .
i=1 i=1 i=1

28 / 318
Information equality

A useful result is the folowing alternative characterization of the information.

Theorem 3 (Information equality)

∂ 2 log fθ (xi )
   
∂ log fθ (xi )
varθ = Iθ = −Eθ .
∂θ ∂θ∂θ0

We will need this later when establishing optimality of maximum likelihood.

29 / 318
Proof.
Differentiating R ∂ log fθ (x)
∂θ
fθ (x) dx = 0
under the integral sign gives
R ∂ 2 log fθ (x) R ∂ log fθ (x) ∂fθ (x)
∂θ∂θ 0
fθ (x) dx + ∂θ ∂θ 0
dx = 0.

Because
∂ log fθ (x) 1 ∂fθ (x)
∂θ
= fθ (x) ∂θ
,
∂fθ (x)
we have ∂θ
= ∂ log∂θfθ (x) fθ (x) and so we obtain
 2   
Eθ ∂ log fθ (xi )
∂θ∂θ 0
+ Eθ ∂ log∂θ
fθ (xi ) ∂ log fθ (xi )
∂θ 0
= 0.

Re-arrangement then yields the result.

30 / 318
If it exists, the best unbiased estimator is unique

Theorem 4 (Uniqueness)
If θnA and θnB are such that

Eθ (θnA ) = Eθ (θnB ) = θ, varθ (θnA ) = varθ (θnB ) = Iθ−1 /n,

then θnA = θnB .

31 / 318
Proof (for the scalar case).
Define a third estimator θnC through the linear combination

θnc = λ θnA + (1 − λ) θnB , λ ∈ (0, 1).

Then Eθ (θnC ) = λ Eθ (θnA ) + (1 − λ) Eθ (θnB ) = θ, so θnC is also unbiased, and

varθ (θnC ) = λ2 varθ (θnA ) + (1 − λ)2 varθ (θnB ) + 2 λ(1 − λ) covθ (θnA , θnB ).

Now, varθ (θnA ) = varθ (θnB ) = Iθ−1 /n by efficiency and

|covθ (θnA , θnB )| ≤ stdθ (θnA ) stdθ (θnB ) = Iθ−1 /n

by Cauchy-Schwarz. Thus,

varθ (θnC ) ≤ Iθ−1 /n.

The inequality cannot be strict because θnA and θnB are best-unbiased. So
we must have that |corrθ (θnA , θnB )| = 1 which happens iff

θnA = a + b θnB

for constants a, b. Now we have that b = 1 as varθ (θnA ) = varθ (θnB ) and
a = 0 as Eθ (θnA ) = Eθ (θnB ).

32 / 318
Bernoulli

With binary data we have

log fθ (x) = x log(θ) + (1 − x) log(1 − θ),

so that
∂ log fθ (x) x 1−x x−θ
= − = ,
∂θ θ 1−θ θ(1 − θ)
∂ 2 log fθ (x) θ(1 − θ) + (x − θ)(1 − 2θ) (x − θ)2
2
=− =− 2 .
∂θ θ (1 − θ)
2 2 θ (1 − θ)2
Clearly,
 
∂ log fθ (xi ) Eθ (xi − θ) Pθ (xi = 1) − θ
Eθ = = = 0.
∂θ θ(1 − θ) θ(1 − θ)
Further note that, here,
2
∂ 2 log fθ (x)

∂ log fθ (x)
=− ,
∂θ ∂θ2
and so the same holds on taking expectations. This immediately verifies the
information equality.
33 / 318
Note that
Eθ (x2i ) = Eθ (xi )
when xi ∈ {0, 1}.
So,
varθ (xi ) = Eθ ((xi − θ)2 ) = Eθ (x2i − 2xi θ + θ2 ) = θ(1 − θ)

The information thus is


varθ (xi ) 1
Iθ = = .
θ2 (1 − θ)2 θ(1 − θ)

The efficiency bound for θ is


θ(1 − θ)
.
n

Note that this is a concave function in θ (and so is maximized at θ = 1/2).

The sample-mean theorem immediately implies that xn is the best unbiased


estimator of θ.

34 / 318
Sample-mean theorem
Theorem 5 (Sample-mean theorem)
Let xn be the mean of a random sample x1 , . . . , xn from a distribution with
finite mean and variance µ, σ 2 . Then

E(xn ) = µ, var(xn ) = σ 2 /n,

no matter the distribution of the xi .

Proof.
By linearity of the expectations operator in the first step and by random
sampling in the second step,
n
! n
1X 1X
E(xn ) = E xi = E(xi ) = µ.
n i=1 n i=1

Next,
n
!
var( n
Pn
σ2
P
1X i=1 xi ) i=1 var(xi )
var(xn ) = var xi = 2
= 2
= ,
n i=1 n n n

again by random sampling.


35 / 318
Poisson

Here we have
log fθ (x) = x log(θ) − θ + constant.
So,
∂ log fθ (x) x ∂ 2 log fθ (x) x
= − 1, = − 2.
∂θ θ ∂θ2 θ
We note the mean/variance equality of a Poisson distribution:
e−θ θ x e−θ θ x e−θ θ x+1 e−θ θ x
P∞ P∞ P∞ P∞
Eθ (xi ) = x=0 x x!
= x=1 (x−1)! = x=0 x!
=θ x=0 x!

e−θ θ x
P∞ P∞
(because x=0 x!
= x=0 fθ (x) = 1), and similarly,

e−θ θ x e−θ θ x
P∞ P∞
Eθ (x2i ) = x=0 x2 x!
=θ x=0 (x + 1) x!
= θ2 + θ,

so that varθ (xi ) = Eθ (x2i ) − Eθ (xi )2 = θ.


Then Iθ = 1/θ and
θ/n
is the efficiency bound.
It is again immediate (by the sample-mean theorem) that xn will be best
unbiased.
36 / 318
Normal distribution

Here,
(x − µ)2
 
1
log fθ (x) = − log σ 2 + + constant.
2 σ2
So,
∂ log fθ (x) (x − µ)
= ,
∂µ σ2
(x − µ)2
 
∂ log fθ (x) 1 1
= − − ,
∂σ 2 2 σ2 σ4
and !
1 (x−µ)
∂ 2 log fθ (x) σ2 σ4
=− (x−µ) 2 .
∂θ∂θ0 σ4
− 2σ4 + (x−µ)
1
σ6

The information now is the (diagonal) matrix


 2   1 
∂ log fθ (xi ) σ2
0
Iθ = −Eθ = 1 ,
∂θ∂θ0 0 2σ 4

so that the efficiency bounds for µ and σ 2 are σ 2 /n and 2σ 4 /n, respectively.

37 / 318
The sample mean is again best unbiased for µ.

An unbiased estimator of σ 2 is
n
1 X
(xi − xn )2 .
n − 1 i=1

However, it does not hit the efficiency bound (see below).


In fact, as
n Pn n
!
2
i=1 (xi − µ) (xi − µ)2

X ∂ log fθ (x) 1 n n X
2
= − 2
− 4
= 4
− σ2
i=1
∂σ 2 σ σ 2σ i=1
n

depends on µ we cannot have proportionality of the sampling error


P of any
estimator when µ is unknown; the best unbiased estimator is n−1 i (xi −µ)2 ,
which is infeasible.
It follows that the efficiency bound is not attainable for σ 2 . Moreover, a best
unbiased estimator of σ 2 does not exist.

38 / 318
First start with the obvious estimator of σ 2 that is
n
1X
σ̂n2 = (xi − xn )2 .
n i=1

This estimator is biased:


E n−1 n 2
= E((xi − xn )2 )
P 
i=1 (xi − xn )

= E(((xi − µ) − (xn − µ))2 )


= E((xi − µ)2 ) − 2E((xi − µ)(xn − µ)) + E((xn − µ)2 )
= varθ (xi ) − 2covθ (xi , xn ) + varθ (xn )
= σ 2 − 2σ 2 /n + σ 2 /n
= σ 2 − σ 2 /n
n−1 2
= σ .
n
The bias arises from estimating the population mean by the sample mean.

The estimator xn has a variance, σ 2 /n, and covaries with each datapoint xi ,
with covariance σ 2 /n.

39 / 318
An unbiased estimator is therefore
n
n 1 X
σ̃ 2 = σ̂ 2 = (xi − xn )2 ;
n−1 n − 1 i=1

the change in the numerator is called a degrees of freedom correction.


We can show (see below) that
Pn
σ̃ 2 i=1 (xi − x n )2
(n − 1) = ∼ χ2n−1 ,
σ2 σ2
and we know the variance of a χ2n−1 is 2(n − 1).
Hence,
2
σ2 σ̃ 2 2σ 4
  
var(σ̃ 2 ) = var (n − 1) 2 =
n−1 σ n−1
which exceeds the efficiency bound 2σ 4 /n.

40 / 318
Sampling distribution of normal variance

First,
Pn
σ̃ 2 i=1 (xi − xn )2
(n − 1) =
σ2 σ2
n  2 n  n  2
X (xi − µ) − (xn − µ) X xi − µ 2 X xn − µ
= = −
i=1
σ i=1
σ i=1
σ
n   2 X n   2
X xi − µ  2 xn − µ xi − µ  2 xn − µ
= −n = − √ .
i=1
σ σ i=1
σ σ/ n

The right-hand side terms are χ2n and χ21 , respectively. The characteristic
function of a χ2p is (1 − 2ιt)−p/2 .
Second, xn and σ̃ 2 are independent by Basu’s theorem.
Third, the characteristic function of the sum of independent variables is the
product of their characteristic functions, so (n − 1) σ̃ 2 /σ 2 has characteristic
function
(1 − 2ιt)−n/2 (1 − 2ιt)1/2 = (1 − 2ιt)−(n−1)/2 ,
so it is χ2n−1 .

41 / 318
Tobit

x∗i ∼ N (µ, σ 2 ).
The data are top-coded at c, i.e.,
 ∗
xi if x∗i < c
xi = .
c if x∗i ≥ c

The density is
 {x<c}  c − µ {x=c}
1 x − µ 
fθ (x) = φ × 1−Φ .
σ σ σ

Let us focus on the mean parameter µ here. So we assume that σ is known.


Note that
∂ log fθ (x) (x − µ) φ((c − µ)/σ)/σ
= {x < c} + {x = c} ;
∂µ σ2 1 − Φ((c − µ)/σ)
both coded and non-coded observations will contribute to the likelihood.

42 / 318
The probability of not being top-coded and being top coded are
c − µ c − µ
Φ , 1−Φ ,
σ σ
respectively.
Further,
1 φ((x − µ)/σ)
fθ (x| x < c) = ;
σ Φ((c − µ)/σ)
so that the deviation of the mean (from µ) of this truncated distribution is
R c (x−µ)  (x−µ)  Rc
−∞ σ
φ σ
dx −σ 2 −∞ ∂(φ((x−µ)/σ)/σ)
∂x
dx φ( c−µ
σ
)
c−µ = c−µ = −σ .
Φ c−µ
 
Φ σ Φ σ σ

Using these results it is immediate that


 
∂ log fθ (xi )
Eθ = 0.
∂µ

43 / 318
After some more calculus, ∂ 2 log fθ (x)/∂ 2 µ is found to be
 
1 1 φ((c − µ)/σ) φ((c − µ)/σ) c−µ
−{x < c} 2 − {x = c} 2 − .
σ σ 1 − Φ(c − µ)/σ) 1 − Φ(c − µ)/σ) σ
The information on µ then becomes
 
1 c − µ 1 c − µ φ((c − µ)/σ) c−µ
2
Φ + 2φ − .
σ σ σ σ 1 − Φ(c − µ)/σ) σ

The mean of the underlying random variable is µ.


The mean of the coded data is c.
The mean of the non-coded data is
φ((c − µ)/σ)
µ−σ .
Φ((c − µ)/σ)

The marginal mean is


  
φ((c − µ)/σ) c − µ   c − µ 
µ−σ Φ +c 1−Φ .
Φ((c − µ)/σ) σ σ

44 / 318
Probit

Again take x∗i ∼ N (µ, σ 2 ).


Now only observe
if x∗i ≥ 0

1
xi = ,
0 if x∗i < 0
which is Bernoulli.
The probability of success and failure are

Φ (µ/σ) , 1 − Φ(µ/σ),

respectively.
These probabilities depend on µ, σ only through the ratio θ = µ/σ, implying
a scale indeterminacy; we can only learn θ.
The mass function becomes

fθ (x) = Φ(θ)x × (1 − Φ(θ))1−x

(Could further just focus on success probability p = Φ(θ) but this would not
extend to the model with covariates.)

45 / 318
Then
∂ log fθ (x) φ(θ)
= (x − Φ(θ)) ,
∂θ Φ(θ)(1 − Φ(θ))
which has mean zero and variance
φ(θ)2
,
Φ(θ)(1 − Φ(θ))
so the efficiency bound for θ becomes
1 Φ(θ)(1 − Φ(θ))
.
n φ(θ)2

A sensible way to estimate θ would be to first estimate the success probability


p = Φ(θ) by the sample mean xn and then construct

θn = Φ−1 (xn ).

This estimator is not unbiased (an unbiased estimator of θ does not exist
here) but it will hit the efficiency bound in large samples.

46 / 318
Regularity conditions for Cramér-Rao bound

Derivation of best unbiased estimator above required regularity conditions:

Differentiability of the density/mass function,


Conditions for interchanging order of differentiation and integration.

An example where this fails is

xi ∼ (continuous) uniform[0, θ],

that is,
{0 ≤ x ≤ θ}
fθ (x) = .
θ
Nonetheless, a best unbiased estimator exists.

This follows from the Lehmann-Sheffé theorem, which builds on complete


sufficient statistics.

Skip the remainder of this section

47 / 318
Sufficiency

A statistic γn = γ(x1 , . . . , xn ) is sufficient for θ if

fθ (x1 , . . . , xn |γn ) = f (x1 , . . . , xn |γn ),

i.e., the conditional distribution does not depend on θ.

An obvious example is the Bernoulli distribution, where Pθ (xi = 1) = θ.


Here, Pn Pn
fθ (x1 , . . . , xn ) = θ i=1 xi (1 − θ)n− i=1 xi ,
so a sufficient statistic will be n
P
i=1 xi , the number of successes in the sample.
Pn
Indeed, i=1 xi is binomial with
Pn  Pn x Pn
Pnn θ i=1 i (1 − θ)n− i=1 xi ,

fθ i=1 xi = xi
i=1

and so
Pn
Pnn n! P

fθ (x1 , . . . , xn | i=1 xi ) = xi
= ( n
P n
i=1 i=1 xi )!(n− i=1 xi )!

is free of θ.

48 / 318
Sufficiency of the sample mean for a normal population

As another illustration, take xi ∼ N (θ, σ 2 ) with σ 2 known.


Then the sample mean xn is sufficient for the unknown population mean θ.
We have
n  
Y 1 xi − θ
fθ (x1 , . . . , xn ) = φ
i=1
σ σ
 Pn 2 
−1 i=1 (xi −θ)
1 2 σ2
= e
(2πσ 2 )n/2
 Pn 2 2 
−1 i=1 (xi −xn ) +n(xn −θ)
1 2 σ2
= e
(2πσ 2 )n/2
 Pn 2  
i=1 (xi −xn ) n(xn −θ)2
1 −1 2 1 −1
2 σ 2 σ2
= e e
(2πσ 2 )(n−1)/2 (2πσ 2 )1/2
and  
n (xn −θ)2
n1/2
 
1 xn − θ 1
−2
σ2
fθ (xn ) = √ φ √ = e .
σ/ n σ/ n (2πσ 2 )1/2

49 / 318
It follows that
 Pn 2 
i=1 (xi −xn )
fθ (x1 , . . . , xn ) n−1/2 −1
2 σ2
fθ (x1 , . . . , xn |xn ) = = e ,
fθ (xn ) (2πσ 2 )(n−1)/2
which does not depend on θ.

When σ 2 is unknown a sufficient statistic for both µ and σ 2 is the pair


n
1 X
xn , (xi − xn )2 ,
n − 1 i=1

i.e., the sample mean and sample variance.

50 / 318
Improved estimation based on sufficiency

Theorem 6 (Rao-Blackwell theorem)


Let θ∗ satisfy Eθ (θ∗ ) = θ and let γn be sufficient for θ. Define the estimator

θn = E(θ∗ |γn )

(which is a function of the data through γn only). Then θn is unbiased and

varθ (θn ) ≤ varθ (θ∗ )

holds.

Proof.
Unbiasedness of θn follows from iterating expectations on θ∗ .
Next, by the law of total variance,

varθ (θ∗ ) = var(E(θ∗ |γn )) + Eθ (var(θ∗ |γn )) = varθ (θn ) + non-negative term,

and so varθ (θ∗ ) ≥ varθ (θn ).


Finally, θn is a statistic (and so computable from data) by sufficiency.

51 / 318
Rao-Blackwellization for Bernoulli

A simple unbiased estimator of θ is θ∗ = x1 ; its variance is θ(1 − θ).


Define Pn
xi = Pθ (x1 = 1| n
 P
θ n = E x1 i=1 i=1 xi ).

Note that
Pn
Pn Pθ ( i=1 xi = x|x1 = 1) Pθ (x1 = 1)
Pθ (x1 = 1| i=1 xi = x) =
Pθ ( n
P
i=1 xi = x)
Pn
Pθ ( j6=i xj = (x − 1)|x1 = 1)Pθ (x1 = 1)
=
Pθ ( n
P
i=1 xi = x)
Pn
Pθ (j6=i xj = (x − 1))
= Pθ (x1 = 1)
Pθ ( n
P
i=1 xi = x)
n−1 (n−1)!
θx−1 (1 − θ)n−x

(n−x)!(x−1)!
= x−1n s θ = n!
x
θ (1 − θ)n−x (n−x)!x!

(n − 1)! x! x
= = .
n! (x − 1)! n
Pn
Thus, θn = n−1 i=1 xi = xn , which has variance θ(1 − θ)/n (and is, in fact,
best unbiased).
52 / 318
Completeness

A statistic γn is complete (for fθ ) if it holds that

if Eθ (ϕ(γn )) = 0 for all θ, then Pθ (ϕ(γn ) = 0) = 1 for all θ,

for all ϕ for which the expectation exists.

To clarify take xi ∼ N (θ, σ 2 ). Consider the statistic x2 − x1 . We have

Eθ (x2 − x1 ) = θ − θ = 0, for all θ.

However, x2 − x1 ∼ N (0, 2σ 2 ), and so

Pθ (x2 − x1 = 0) = 0 for all θ.

So, this statistic is not complete.

53 / 318
Completeness in the normal problem

Pn
A complete statistic here is xn = n−1 i=1 xi .
We look for a function ϕ such that Eθ (ϕ(xn )) = 0 for all θ.
We have
R +∞  
1√ −θ
xn√
Eθ (ϕ(xn )) = −∞
ϕ(x) σ/ n
φ σ/ n
dx
1 n 2 R +∞ n 2 2
= p e− 2 (θ/σ) −∞ ϕ(x) e− 2 (x/σ) en(θ/σ )x dx
2πσ 2 /n
1 n 2
 n 2

= p e− 2 (θ/σ) L ϕ(x) e− 2 (x/σ) ,
2πσ 2 /n

for L(g(x)) the (two-sided) Laplace transform of g(x).


The Laplace transform L(g(x)) cannot be zero unless g(x) is zero (almost
everywhere). As the exponential function is non-zero it must be that ϕ(x) =
0 (almost everywhere), as claimed.

54 / 318
Completeness in the Bernoulli problem
Pn
Remember that, if Pθ (xi = 1) = θ for θ ∈ (0, 1), then γn = i=1 xi is
Binomial with parameters (n, θ).
So, if
n
! n
! γ
X n γ n−γ n
X n θ
Eθ (ϕ(γn )) = γ θ (1 − θ) = (1 − θ) ϕ(γ) =0
γ=0
γ γ=0
γ 1−θ

for all θ ∈ (0, 1) then the following polynomial in λ = θ/(1 − θ)


n
X n!
cγ λγ = 0, cγ = ϕ(γ) ,
γ=0
γ(n − γ)!

must be zero for all θ ∈ (0, 1).


But the latter can only hold if cγ = 0 for all γ, and so ϕ(γ) = 0 must hold
for all γ ∈ {0, 1, . . . , n}.
Hence,
Pθ (ϕ(γn ) = 0) = 1
follows.

55 / 318
Best unbiased estimation under sufficiency

Theorem 7 (Lehmann-Scheffé theorem)


Let γn be a complete sufficient statistic for θ and consider θn = ϕ(γn ) for
some function ϕ. If Eθ (θn ) = θ then

varθ (θn ) ≤ var(θ∗ )

where θ∗ is any unbiased estimator; i.e., θn is the best unbiased estimator.

Proof.
By the Rao-Blackwell result, under sufficiency, any efficient estimator must
be a function of γn only; so, θn = ϕ(γn ). Then, by assumption, Eθ (θn ) = θ.
It is enough to show that ϕ is unique. Suppose there exist another ψ such
that Eθ (ψ(γn )) = θ. Then, by unbiasedness of both estimators,

Eθ (ϕ(γn ) − ψ(γn )) = 0.

But, by completeness, this implies that Pθ (ϕ(γn ) = ψ(γn )) = 1 (a.e.).

56 / 318
Bernoulli

We have shown above that


n
X
γn = xi
i=1

is both a complete and sufficient statistic for θ.


An unbiased estimator based on it is the sample mean
n
X
xn = n−1 xi = γn /n.
i=1

This confirms the Cramér-Rao result for Bernoulli that xn is best unbiased.

57 / 318
Estimating the maximum of a uniform distribution

Recall
{0 ≤ x ≤ θ}
fθ (x) = .
θ
Easy to see that the maximum-likelihood estimator here is maxi (xi ).
This estimator is biased.
For all x ∈ [0, θ],
 
Pθ max(xi ) ≤ x = Pθ (x1 ≤ x, x2 ≤ x, . . . , xn ≤ x) = (x/θ)n .
i

Further, Rθ
Eθ (max(xi )) = 0
1 − (x/θ)n dx = n
n+1
θ.
i

(The first step holds for any non-negative random variable z ∈ [0, b], say;
integrate by parts to see that
Rb Rb
0
(1 − F (z)) dz = (1 − F (z)) z|b0 + 0 z f (z) dz = E(z),

as claimed.)

58 / 318
It follows that
n+1
θn = max(xi ).
n i

is unbiased.
Remains to show that γn = maxi (xi ) is a complete sufficient statistic for θ.
We already know that Pθ (γn ≤ γ) = (γ/θ)n and so its density is

γ n−1
n
θn
for γ ∈ [0, θ] (and zero elsewhere).
Hence,
R 
γ n−1
Rθ θ
Eθ (ϕ(γn )) = 0
ϕ(γ) n θn
dγ = (n/θn ) 0
ϕ(γ)γ n−1 dγ =: (n/θn ) Q(θ).

Note that, by Leibniz’ rule,


∂Q(θ)
= ϕ(θ) θn−1 .
∂θ
So, if Eθ (ϕ(γn )) = 0 for all θ then Q(θ) = 0 must hold for all θ, but then
its derivative must be zero and so ϕ(θ) = 0 must hold. So, γn is indeed
complete.

59 / 318
To see sufficiency we look at the ratio of the density of the data,
n
Y {xi ≤ θ} {γn ≤ θ}
= ,
i=1
θ θn

and the density of the sample maximum,


{γn ≤ θ}
nγnn−1
θn
(from above).
As this ratio is
γn1−n /n
it is free from θ and so γn is indeed sufficient.

Working out the first two moments of γn using its density from above gives
1
θ2
n(n + 2)
as the variance of the unbiased estimator θn .

Note that this variance shrinks like n−2 , which is faster than the parametric
rate of n−1 .
60 / 318
Efficiency bound for biased estimators

In many cases an (best) unbiased estimator will not exist. So we need to


widen our search to allow for bias.
First generalize Cramér-Rao bound to case where θn is biased, i.e.,

Eθ (θn ) = θ + bn (θ).

Following the same steps as before gives the efficiency bound

varθ (θn ) ≥ Iθ−1 (1 + b0n (θ))2 /n.

Quite generally, b0n (θ) = O(n−1 ), and so

varθ (θn ) ≥ Iθ−1 /n + O(n−2 );

the bias vanishes faster than the standard deviation.


From an asymptotic perspective, this paves the way for best asymptotically
unbiased estimators.

61 / 318
Asymptotics (for the univariate sample mean)

Asymptotic analysis is an approximation to the finite-sample behavior of an


estimator based on what happens when n becomes large.

While exact small-sample results are few and ad hoc. Large-sample analysis
is well established and widely applicable.

The behavior of the sample mean as n → ∞ brings us a long way.

This is so because almost all estimators you will ever look at behave, as
n → ∞, like a sample mean.

Such estimators are called asymptotically linear; we can always represent


them as X
(θn − θ) = n−1 ϕθ (xi ) + op (n−1/2 )
i

for some function ϕθ for which Eθ (ϕθ (xi )) = 0 and varθ (ϕθ (xi )) < ∞. We
will see many examples.

Slide 25 gives the influence function for the best unbiased estimator (when
it exists).
62 / 318
Orders of magnitude (deterministic sequences)

Let h and g be two functions (and h(x) > 0 for large x).
We say that g(x) = O(h(x)) if and only if there exists a positive number b
and a real number x such that

|g(x)| ≤ b h(x) for all x ≥ x.

That is,
g(x)
lim sup < ∞;
x→∞ h(x)
h(x) grows at least as fast as g(x).
We say that g(x) = o(h(x)) if and only if for every positive number b there
exists a real number x such that

|g(x)| ≤ b h(x) for all x ≥ x.

That is,
g(x)
lim = 0;
x→∞ h(x)
h(x) grows faster than g(x).

63 / 318
Orders of magnitude (random sequences)

Let {xn } be a sequence of random variables and let {an } be a deterministic


sequence of numbers.
Consider the limit behavior as n → ∞.
We say that xn = Op (an ) if and only if for every δ there exists a finite number
 and an n such that
 
xn
P >  < δ for all n ≥ n.
an

That is, |xn /an | is stochastically bounded.


We say that xn = op (an ) if and only if for every δ and finite number  there
exists an n such that
 
xn
P >  < δ for all n ≥ n.
an
That is,  
xn
lim P > =0
n→∞ an
for every  > 0.
64 / 318
Convergence in probability

We say that xn converges in probability to x if,for every  > 0,

lim P (|xn − x| > ) = 0,


n→∞

i.e., if xn − x = op (1).
p
We write xn → x and call x the probability limit of the sequence {xn }.

65 / 318
Theorem 8 ((weak) law of large numbers)
Suppose that µ = E(xi ) exists. For any  > 0 and δ > 0, there exists an n
such that
P (|xn − µ| > ) < δ, for all n > n.
p
That is, xn → µ as n → ∞. Equivalently, xn − µ = op (1).

Proof.
Suppose that σ 2 exists. Then (by Chebychev’s inequality)

E((xn − µ)2 ) σ2
P (|xn − µ| > ) = P ((xn − µ)2 > 2 ) ≤ 2
= n−1 2 .
 
Taking limits gives the result.

Note that we immediately get the same result for any transformation ϕ(xi )
provided that E(|ϕ(xi )|) < ∞. That is,
p
n−1
P
i ϕ(xi ) → E(ϕ(xi ))

as n → ∞.

66 / 318
The below plots give deciles of xn as a function of n.

normal ( 2 exists) Student ( exists but 2 does not) Cauchy ( does not exist)
0.4 1 4

0.3 0.8
3

0.6
0.2
2
0.4
0.1
1
0.2
0
deciles

0 0
-0.1
-0.2
-1
-0.2
-0.4
-2
-0.3
-0.6

-0.4 -3
-0.8

-0.5 -1 -4
0 500 1000 1500 2000 2500 0 500 1000 1500 2000 2500 0 500 1000 1500 2000 2500
n n n

67 / 318
Consistency

p
An estimator θn is consistent for an estimand θ if θn → θ.

The mean squared error of is

Eθ ((θn − θ)2 ) = (Eθ (θn − θ))2 + varθ (θn ) = bn (θ)2 + varθ (θn );

so a sufficient condition for consistency is that both bias and variance vanish
as n → ∞.

68 / 318
Uniform convergence

A more general situation has ϕθ (xi ) indexed by θ ∈ Θ (continuous on Θ


compact).

A pointwise convergence result (i.e., for any fixed θ ∈ Θ) follows from above:

P n−1 i ϕθ (xi ) − E(ϕθ (xi )) >  < δ,


P 
for all n > nθ ,

A uniform result is as follows.


Theorem 9 (Uniform law of large numbers)
Suppose that ϕθ (x) ≤ γ(x) and E(|γ(xi )|) < ∞. Then, for all θ ∈ Θ,
!
−1
X
P n ϕθ (xi ) − E(ϕθ (xi )) >  < δ, for all n > n,
i

with n independent of θ.

We write
p
X
sup n−1 ϕθ (xi ) − E(ϕθ (xi )) → 0
θ∈Θ
i
as n → ∞.
69 / 318
To appreciate the difference between pointwise and uniform convergence take
a simple non-stochastic example:

ϕθ (xi ) = nθe−nθ

for θ ∈ Θ = [0, 1]. This function is continuous in θ.

For any fixed θ,


nθe−nθ → 0
as n → ∞. (because the exponential term vanishes more quickly than the
linear term grows.)

However, at θ = n−1 the function equals e−1 for any n. Hence,

sup|nθe−nθ | 9 0
θ∈Θ

as n → ∞.

Note that (in general), uniform convergence implies pointwise convergence.

70 / 318
Continuous-mapping theorem

Theorem 10 (Continuous-mapping theorem)


p
Suppose that xn → x.
Then
p
ϕ(xn ) → ϕ(x)
for (non-stochastic) continuous functions ϕ.

71 / 318
Convergence in distribution

Let {xn } be a sequence of random variables with distribution {Fn } and let
x∼F
d
We say that xn → x if

Fn (a) → F (a) as n → ∞

at all continuity points a of F .

We call F the limit distribution of {xn }.

d
If xn → x it is stochastically bounded, i.e., xn = Op (1).

72 / 318
The central limit theorem

Theorem 11 (Lindeberg-Lévy central limit theorem)


Suppose that xi ∼ i.i.d. (µ, σ 2 ). Then
xn − µ d
√ → N (0, 1)
σ/ n
as n → ∞.

This means that the sample distribution of the standardized sample mean
approaches the standard-normal distribution.

In practice, this means that


a
xn ∼ N (µ, σ 2 /n),

where the a can be interpreted as either ‘asymptotically’ or ‘approximately’.

Observe that this result holds for any distribution, as long as µ, σ 2 exist.

73 / 318
The plots below concern the standardized sample mean of samples of
Bernoulli random variables.
Observe how the histogram approaches the standard-normal density as n
grows.

74 / 318
Proof.
Let ϕx (t) = E(eιtx ) be the characteristic function of x.
Then
n n
xn − µ X 1 xi − µ X zi
z= √ = √ = √ (say),
σ/ n i=1
n σ i=1
n
has characteristic function
P √ Y √ √
ϕz (t) = E(eιt i zi / n
)= E(eι(t/ n)zi
) = ϕzi (t/ n)n ,
i

where we used random sampling.


Now, as ϕzi (0) = 1, ϕ0zi (0) = 0, and ϕ00zi (0) = −1 we have

t2
 2
t2
 2
√ t t t
ϕzi (t/ n) = ϕzi (0) + ϕ0zi (0) √ + ϕ00zi (0) +o =1− +o
n 2n n 2n n
as n → ∞, and so
n
−t2 /2

2
lim ϕz (t) = lim 1+ = e−t /2 (= ϕ of the standard normal)
n→∞ n→∞ n
by definition of the exponential function.

75 / 318
Slutzky’s theorem

Theorem 12 (Slutzky’s theorem)


p d
Suppose that xn → c (a constant) and yn → y (a random variable). Then
d
(i) xn + yn → c + y; and
d
(ii) xn yn → c y.

76 / 318
Take xi ∼ N (µ, σ 2 ). Best ‘estimator’ of σ 2 is n−1 − µ)2 .
P
i (xi

As an example of (i),
X X
σ̂ 2 = n−1 (xi − xn )2 = n−1 (xi − µ)2 − (xn − µ)2 .
i i

p p
As (xn − µ) → 0 and (a − µ)2 is continuous in a we have (xn − µ)2 → 0.
Hence, X
σ̂ 2 = n−1 (xi − µ)2 + op (1).
i

In fact, √ √
(xn − µ)2 = (Op (1/ n))2 = Op (1/n) = op (1/ n),
and so
√ 1 X
n(σ̂ 2 − σ 2 ) = √ ((xi − µ)2 − σ 2 ) + op (1).
n i
Hence, σ̂ 2 and n−1 i (xi − µ)2 are asymptotically equivalent; their limit
P
√ d
distribution is n(σ̂ 2 − σ 2 ) → N (0, 2σ 4 ). This is the same limit distribution
2
as that of (the unbiased) σ̃ .

77 / 318
As an example of (ii),
xn − µ σ xn − µ xn − µ xn − µ d
√ = √ = (1 + op (1)) √ = √ + op (1) → N (0, 1).
σ̂/ n σ̂ σ/ n σ/ n σ/ n

Now given an asymptotically-linear estimator,


X
(θn − θ) = n−1 ϕθ (xi ) + op (n−1/2 )
i

where Eθ (ϕθ (xi )) = 0 and varθ (ϕθ (xi )) < ∞ our results immediately yield
that
(a) θn − θ = Op (n−1/2 ); and
√ a
(b) n(θn − θ) ∼ N (0, varθ (ϕθ (xi ))).
We call varθ (ϕθ (xi )) the asymptotic variance.

78 / 318
Mean-value theorem

Let ϕ be a differentiable function on an interval [x, xn ]. Then, for any


(x1 , x2 ) ∈ [x, xn ]2 there always exists a x∗ ∈ [x, xn ] (not necessarily unique)
so that
∂ϕ(x∗ )
ϕ(x2 ) − ϕ(x1 ) = (x2 − x1 ).
∂x

1.5
ϕ(x)

0.5

0
x1 x∗ x2
79 / 318
Asymptotics for smooth transformations

Theorem 13 (Delta method)


√ d
If n(θn − θ) → N (0, σ 2 ), then
√ d
n(ϕ(θn ) − ϕ(θ)) → N 0, (∂ϕ(θ)/∂θ)2 σ 2


for continuously-differentiable ϕ.

Proof.
A mean-value expansion gives
∂ϕ(θ∗ )
ϕ(θn ) − ϕ(θ) = (θn − θ).
∂θ
The continuous-mapping theorem yields
∂ϕ(θ∗ ) p ∂ϕ(θ)
→ .
∂θ ∂θ
Slutzky’s theorem gives
√ ∂ϕ(θ) √ d
n (θn − θ) + op (1) → N 0, (∂ϕ(θ)/∂θ)2 σ 2 .

n(ϕ(θn ) − ϕ(θ)) =
∂θ
80 / 318
The multivariate case

Now suppose that xi is a vector with mean µ and variance Σ.

The multivariate central limit theorem reads


√ d
n Σ−1/2 (xn − µ) → N (0, I),

where I is the identity matrix of conformable dimension. Here, the limit


distribution is a multivariate standard normal.
√ d
The Delta method extends as follows. Suppose n(θn − θ) → N (0, Σ). Then
√ d
n(ϕ(θn ) − ϕ(θ)) → N (0, ΓΣΓ0 )

for
∂ϕ(θ)
Γ=
∂θ0
the Jacobian matrix.

81 / 318
A nonsingular matrix A has eigendecomposition

A = V DV −1

where D is a diagonal matrix of eigenvalues and V is the matrix of associated


eigenvectors.

The inverse is
A−1 = V D−1 V −1 .
A matrix square root is
A1/2 = V D1/2 V −1 .

Note that

A−1/2 AA−1/2 = (V D−1/2 V −1 )(V DV −1 )(V D−1/2 V −1 ) = I.

√ d
So, for example, if n(θn − θ) → N (0, Σ) for an m × m nonsingular variance
Σ, then
√ d
(i) n Σ−1/2 (θn − θ) → N (0, Im ); and
d
(ii) n(θn − θ)0 Σ−1 (θn − θ) → χ2m .

82 / 318
The multivariate normal distribution

If x ∼ N (µ, Σ) its density is

1 (x−µ)0 Σ−1 (x−µ)


p e− 2 .
(2π)dim x det(Σ)

Any subset of x is again normal. All conditional distributions are again


normal.
Partition x = (x01 , x02 )0 and write
       
x1 µ1 Σ11 Σ12
∼N .
x2 µ2 Σ21 Σ22

The marginal distribution of x1 is normal, x1 ∼ N (µ1 , Σ11 ).


The conditional distribution of x1 given x2 is

N µ1 + Σ12 Σ−1 −1 
22 (x2 − µ2 ), Σ11 − Σ12 Σ22 Σ21 .

83 / 318
The bivariate normal distribution

The above is particularly tractable in the bivariate case, where x1 and x2 are
both scalars.
Write
σ12
       
x1 µ1 ρ σ 1 σ2
∼N
x2 µ2 ρ σ1 σ2 σ22
for ρ the correlation between x1 and x2 .
Here,  
σ1
x1 |x2 ∼ N µ1 + ρ (x2 − µ2 ), (1 − ρ2 ) σ12 .
σ2

Note that
 
σ1 σ1 σ1
E(x1 |x2 ) = µ1 + ρ (x2 − µ2 ) = µ1 − ρ µ2 + ρ x 2
σ2 σ2 σ2
is linear in x2 .
Also, var(x1 |x2 ) is a constant (i.e., not a function of x2 ).

84 / 318
Best asymptotically unbiased estimation

We say that θn is best asymptotically unbiased for θ if


√ d
n(θn − θ) → N (0, Iθ−1 ),

so it achieves the Cramér-Rao bound in large samples.

It exists under weak regularity conditions.

A coherent way of finding it is through the method of maximum likelihood.

85 / 318
The likelihood function

The maximum-likelihood estimator is


n
Y n
X
θ̂ = arg max fθ (xi ) = arg max log fθ (xi ).
θ∈Θ θ∈Θ
i=1 i=1

Qn
The likelihood function, i=1 fθ (xi ), represents the density of the sample
when sampling from fθ .

In the discrete case, it is the probability of observing the actual sample, when
sampling from fθ .

Maximize this probability as a function of θ.

Intuitively attractive. Pretty much what anyone without any prior statistical
knowledge would do.

86 / 318
Maximization program

Let X
Ln (θ) = log fθ (xi )
i

be the log-likelihood function.

The first-order condition is that


∂Ln (θ) X ∂ log fθ (xi )
= = 0;
∂θ i
∂θ

this is the score equation.

The second-order condition for a maximum is that


∂ 2 Ln (θ) X ∂ 2 log fθ (xi )
= < 0;
∂θ∂θ0 i
∂θ∂θ0

the Hessian matrix is negative definite.

Note how these derivatives relate to the Cramér-Rao bound.

87 / 318
Numerical maximization: Newton-Raphson

Often we need to tackle the maximization problem numerically.


Newton-Raphson is a popular algorithm for finding the roots of the score
equation.
Want to solve ϕ(x) = 0. Let x0 be an initial guess. For a new guess x1 we
have
ϕ(x1 ) − ϕ(x0 ) ∂ϕ(x)
≈ = ϕ0 (x0 ).
x1 − x0 ∂x x=x0
So,
ϕ(x0 ) + (x1 − x0 ) ϕ0 (x0 ) ≈ ϕ(x1 ).
We want that ϕ(x1 ) = 0. Solving for x1 yields

x1 = x0 − ϕ(x0 )/ϕ0 (x0 )

as our new guess.


In practice, when maximizing a function whose derivative is ϕ, we start at
x0 and then evaluate ϕ in x1 . If the function would not improve at x1 we
re-evaluate in x01 = x0 − h(x0 − x1 ) for h ∈ (0, 1) a step size and re-evaluate.
We then iterate this procedure untill no further improvement (up to some
specified tolerance level) is found.
88 / 318
Bernoulli

When xi ∈ {0, 1} with probability θ ∈ (0, 1) we have


n
! n
Y xi 1−xi
X
Ln (θ) = log θ (1 − θ) = xi log θ + (1 − xi ) log(1 − θ).
i=1 i=1

So, solving
n
X xi − θ
∂Ln (θ) xn − θ
= =n =0
∂θ i=1
θ(1 − θ) θ(1 − θ)

for θ yields θ̂ = xn as the unique solution. This is a global maximum as


n
∂ 2 Ln (θ) X (xi − θ)2
= − <0
∂θ2 i=1
θ2 (1 − θ)2

for all θ ∈ (0, 1).


This estimator is best unbiased and so also best asymptotically unbiased.

89 / 318
Invariance

Maximum likelihood is invariant to one-to-one parametrizations, β = β(θ).

If Ln (θ) is the log-likelihood and L∗n (β) is the reparametrized log-likelihood,


then
β̂ = arg max L∗n (β) = β(arg max Ln (θ)) = β(θ̂).
β θ

This is an interesting property.

A consequence of invariance is that maximum likelihood will not be unbiased,


in general.

If θ̂ is unbiased and the transformation β(θ) is nonlinear, then β̂ = β(θ̂) will


be biased, in general, by Jensen’s inequality.

90 / 318
Jensen’s inequality
A univariate function ϕ is concave if

ϕ(λx + (1 − λ)x0 ) ≥ λ ϕ(x) + (1 − λ)ϕ(x0 )

and convex if

ϕ(λx + (1 − λ)x0 ) ≤ λ ϕ(x) + (1 − λ)ϕ(x0 )

for all λ ∈ [0, 1].

Theorem 14 (Jensen’s inequality)


If ϕ is concave, E(ϕ(xi )) ≤ ϕ(E(xi )). If ϕ is convex, E(ϕ(xi )) ≥ ϕ(E(xi )).

Proof.
Take ϕ concave. Let ψ be the tangent line at E(xi ); i.e., ψ(x) = a + bx for
constants a, b such that ϕ(E(xi )) = ψ(E(xi )).
By concavity ϕ(x) ≤ ψ(x) for any x. Hence, using linearity of the tangent,

E(ϕ(xi )) ≤ E(ψ(xi )) = ψ(E(xi )) = ϕ(E(xi )).

91 / 318
Probit

The simplest probit model from above had

Pθ (xi = 1) = Φ(θ).

The score equation is


n
X φ(θ)
(xi − Φ(θ)) =0
i=1
Φ(θ)(1 − Φ(θ))

and the efficiency bound was


1 Φ(θ)(1 − Φ(θ))
.
n φ(θ)2

Finding θ̂ by solving the score equation requires numerical optimization.


Notice that the success probability β = Φ(θ) is a one-to-one transformation
of θ. The likelihood for β is the ordinary Bernoulli likelihood, with maximizer
β̂ = xn .
It follows that θ̂ = Φ−1 (xn ).

92 / 318
Further, as
√ d
n(β̂ − β) → N (0, β(1 − β))
by the central limit theorem,

∂Φ−1 (β) 1
= ,
∂β φ(θ)

and β = Φ(θ), the Delta method gives



 
d Φ(θ)(1 − Φ(θ))
n(θ̂ − θ) → N 0, .
φ(θ)2
The asymptotic variance is indeed the Cramér-Rao bound.

93 / 318
The maximum-likelihood estimator may fail to exist in small samples.

In the probit model this happens when complete separation is possible.


In essence, this means we can classify outcomes exactly.
In our model, where
Pθ (xi = 1) = β = Φ(θ),
a data set consisting of only successes (ones) will have β̂ = 1, and so θ̂ = +∞.
If β ∈ (0, 1) this problem will not occur in large samples.

In an extended model with explanatory variables perfect separation would


happen, for example, when all successes can be assigned to one covariate and
all failures to another.
This problem may, in principle, persist in large samples.

94 / 318
Why does maximizing the likelihood work: Identification

Note that the expected log-likelihood


n
X
L(θ∗ ) = Eθ (Ln (θ∗ )) = Eθ (log fθ∗ (xi )) = n Eθ (log fθ∗ (xi ))
i=1

is maximized at θ.
Indeed,
    
fθ∗ (xi ) fθ∗ (xi )
Eθ (Ln (θ∗ ) − Ln (θ)) = n Eθ log ≤ n log Eθ = 0,
fθ (xi ) fθ (xi )

using
R Jensen’s inequality and the fact that Eθ (fθ∗ (xi )/fθ (xi )) =
fθ∗ (x) dx = 1.
Crudely put, L(θ) is the log-likelihood function we would use if we would
have an infinitely-large sample.
(Point) identification means that, in that case, we would be able to learn θ;
so
θ = arg max L(θ∗ ),
θ∗ ∈Θ

and is unique.
95 / 318
Identification may fail (we will give an example below).

Global identification: θ is the unique maximizer of L on Θ.

Local identification: θ is the unique maximizer of L in some neighborhood


around θ.

Local identification is
∂ 2 L(θ)
< 0.
∂θ∂θ0
Note that, as
∂ 2 L(θ) ∂ 2 log fθ (xi )
 
= n Eθ = −nIθ
∂θ∂θ0 ∂θ∂θ0
this is equivalent to the information matrix being positive definite and, hence,
of full rank.

Local identification can be tested.

96 / 318
Why does maximizing the likelihood work: Argmax theorem

By a uniform law of large numbers,


p
sup|n−1 Ln (θ) − n−1 L(θ)| → 0, as n → ∞,
θ∈Θ

provided fθ is continuous, |log fθ (x)| < b(x) so that E(b(xi )) < ∞, and Θ is
closed and bounded (compact).

Then, if θ is identified as the unique global maximizer of L(θ), we will have


that
p
arg max Ln (θ) → arg max L(θ),
θ∈Θ θ∈Θ

but this is just


p
θ̂ → θ, as n → ∞,
which is consistency.

97 / 318
A uniform ε-band around L(θ) and the corresponding interval [θmin , θmax ] in
which θ̂ must lie.

−1

θmin θ θmax

As n → ∞, the ε-band tightens and so the interval [θmin , θmax ] shrinks to a


point. By identification this point must be θ. As θ̂ ∈ [θmin , θmax ] it must be
that θ̂ converges to θ.

98 / 318
Regarding uniform convergence, consider the probit model as an example.
There,
log fθ (y|x) = y log Φ(x0 θ) + (1 − y) log Φ(−xθ).
We have, by a mean-value expansion, that
φ(xθ∗ )
log Φ(xθ) = log Φ(0) + xθ
Φ(xθ∗ )
φ(u)
and 0 < Φ(u) ≤ c |1 + u| for some finite c (visual inspection will help to see
this). Consequently,

φ(xθ∗ )
|log Φ(xθ)| ≤ |log Φ(0)| + |xθ| ≤ |log(2)| + c |1 + xθ∗ | |x||θ|
Φ(xθ∗ )
≤ |log(2)| + c |x||θ∗ | + c |x|2 |θ|2

so an integrable upper bound on log fθ (xi ) exists provided E(x2i ) < ∞.

99 / 318
Asymptotic normality

By definition
n
X ∂ log fθ (xi )
= 0.
i=1
∂θ θ̂

A mean-value expansion around the true θ gives


n n n
X ∂ log fθ (xi ) X ∂ log fθ (xi ) X ∂ 2 log fθ (xi )
= + (θ̂ − θ) = 0,
i=1
∂θ θ̂ i=1
∂θ θ i=1
∂θ∂θ0 θ∗

where θ∗ is some vector that (elementwise) lies between θ̂ and θ.


Inversion of this equation gives the sampling-error representation
n
!−1 n
X ∂ 2 log fθ (xi ) X ∂ log fθ (xi )
(θ̂ − θ) = − 0
.
i=1
∂θ∂θ θ∗ i=1
∂θ θ

100 / 318
Now, by invoking a uniform law of large numbers together with consistency,
n
1 X ∂ 2 log fθ (xi )
 2 
p ∂ log fθ (xi )
→ Eθ = −Iθ .
n i=1 ∂θ∂θ0 θ∗ ∂θ∂θ0 θ

Also, by the central limit theorem,


n
1 X ∂ log fθ (xi ) d
√ → N (0, Iθ ).
n i=1 ∂θ θ

Then, by the continuous-mapping theorem and Slutzky’s theorem, we get


the influence-function representation
n
√ 1 X −1 ∂ log fθ (xi )
n(θ̂ − θ) = √ I + op (1)
n i=1 θ ∂θ θ

and we have the following result (note we use the information equality here).

Theorem 15 (Optimality of maximum likelihood)


Under regularity conditions,
√ d
n(θ̂ − θ) → N (0, Iθ−1 ).

101 / 318
Variance estimation

The information matrix—and, hence, the asymptotic variance of maximum


likelihood—can be estimated.
There are two obvious choices.
The first follows from its definition as the variance of the score:
n  
1 X ∂ log fθ (xi ) ∂ log fθ (xi )
.
n i=1 ∂θ ∂θ0 θ̂

The second follows from the information equality:


n
1 X ∂ 2 log fθ (xi )
− .
n i=1 ∂θ∂θ0 θ̂

In both cases, a uniform law of large numbers can be used to show consistency.
The square root of the diagonal entries (of the inverse) give (estimated)
standard errors on the maximum-likelihood estimator (after dividing through

by n) and so can serve to assess its precision. They will equally serve us in
testing later on.

102 / 318
Labor-force participation

Consider the decision of married women to participate to labor market, yi .

Individuals make decisions based on their own situation/characteristics, xi .


PSID has data on a variety of characteristics (age, education, number of
children, and so on).

Standard Bernoulli is too simple to capture this dependence on observable


characteristics.

A (possible) specification for a conditional model would be

pi = P (yi = 1|xi ) = Φ(x0i β).

Here, choice probabilities are heterogenous in characteristics.

We can derive an econometric model from a specification of an economic


model for the women’s decision problem:

yi = 1 ⇔ u(xi , εi ) ≥ 0;

u(xi , εi ) is i’s utility from working; εi is not observed to the econometrician.


Our specification has u(xi , εi ) = x0i β + εi for εi ∼ N (0, 1) independent of xi .
103 / 318
104 / 318
What are the parameters of interest in the probit model?
Take xi scalar continuous for a moment.
The average marginal effect is
∂E(yi |xi ) ∂Φ(xi β)
= = β φ(xi β).
∂xi ∂xi
This is nonlinear and heterogenous.
Can look at the distribution of this marginal effect (in xi ), and its functionals.
For example, the mean
 
∂E(yi |xi )
θ=E = β E(φ(xi β)).
∂xi

The maximum-likelihood estimator is


n
X
θ̂ = n−1 β̂ φ(xi β̂).
i=1

To obtain a standard error, use the Delta method.


We can also look at other functionals of the distribution of the marginal
effects.
105 / 318
106 / 318
Classical linear regression

The classical linear regression model is an extension of the location/scale


model from above in that it adds regressors.
Data on outcome yi and a (column) vector of regressors (or covariates or
explanatory variables) xi .
The model is
yi |xi ∼ N (x0i β, σ 2 ),
and θ = (β 0 , σ 2 )0 . Equivalently (and more commonly) we can write the model
as
yi = x0i β + εi , εi |xi ∼ N (0, σ 2 ).
Unless explicitely stated otherwise the first covariate is taken to be a constant
term.
This is a simple model for analyzing how the distribution of yi changes with
xi . Here, only impact is through the mean:

E(yi |xi ) = x0i β.

107 / 318
Often convenient to look at this model in matrix form.
We have a set of n equations with k regressors, as in

x1,1 x1,2 · · · x1,k


      
y1 β1 ε1
 y2   x2,1 x2,2 · · · x2,k   β2   ε2 
 . = . + ,
      
.. .. ..   .. ..
 ..   .. . . .  .   . 
yn xn,1 xn,2 · · · xn,k βk εn

which we write as

y = Xβ + ε, ε ∼ N (0, σ 2 I).

The log-likelihood (conditional on the regressors) then is

n 1 X (yi − x0i β)2 n (y − Xβ)0 (y − Xβ)


− log σ 2 − 2
= − log σ 2 − .
2 2 i σ 2 2σ 2

108 / 318
The score equation for β is
X 0 (y − Xβ)
= 0.
σ2
It has the unique solution

β̂ = (X 0 X)−1 X 0 y

(independent of σ 2 ) provided that the inverse of the matrix X 0 X exists.


This is known as the ordinary least-squares estimator.

So, β̂ is uniquely defined if X has maximal rank. This is the well-known


‘no-multicolinearity condition’. No column of X can be written as a linear
combination of the other columns.
A simple counter-example is the dummy-variable trap, where the regressors
would be a constant and a collection of dummies for events whose union
happens with probability one.

109 / 318
Say, xi = (1, di , 1 − di )0 where di is a binary indicator.

Then xi,1 = xi,2 + xi,3 for all i and the rank condition fails.

The model
yi = β1 + di β2 + (1 − di )β3 + εi
is observationally-equivalent to the three-parameter/two-regressor model

yi = (β1 + β3 ) + di (β2 − β3 ) + εi = α1 + di α2 + εi .

We can only learn the reduced-form parameters (α1 , α2 ). The identified set
for β is
{β ∈ R : β1 + β3 = α1 , β2 − β3 = α2 }.
For example, given β3 , we can back out (β1 , β2 ) but, without this knowledge,
we can only say things such as β1 − β2 = α1 − α2 .

110 / 318
As another example of identification failure, suppose that we do not observe
yi in the data but, instead, observe variables y i ≤ y i for which we know that

y i ≤ yi ≤ y i

(income data in social security records, for example, is often bracketed in


this way). Here we cannot even compute the value of the likelihood.
Then the conditional mean is only restricted by

E(y i |xi ) ≤ x0i β ≤ E(y i |xi )

(where we use E(yi |xi ) = x0i β).


An implication is that

E(xi y i ) ≤ E(xi x0i ) β ≤ E(xi y i ).

We can estimate all β compatible with this moment inequality by the set
[β̂, β̂], with
β̂ = (X 0 X)−1 X 0 y, β̂ = (X 0 X)−1 X 0 y
in obvious notation.

111 / 318
We will write
ŷ = X β̂, ε̂ = y − X β̂,
for fitted values and residuals, respectively.
We have the decomposition
y = ŷ + ε̂,
where the fitted values and residuals are uncorrelated, i.e., ŷ 0 ε̂ = 0. Indeed,
the score equation at β̂ equals
X 0 ε̂
= 0,
σ2
so we can say that β̂ gives us thát linear combination of the regressors for
which residuals and regressors are exactly uncorrelated.
An implication is that

y 0 y = (ŷ + ε̂)0 (ŷ + ε̂) = ŷ 0 ŷ + ε̂0 ε̂,

and so the uncentered R2


ŷ 0 ŷ ε̂0 ε̂
Ru2 = 0
= 1 − 0 ∈ [0, 1]
yy yy
gives a relative contribution of the variation in fitted values to the variation
in observed outcomes.
112 / 318
More popular to report is the (centered) R2 , which looks at deviations from
the mean, as
ESS SSR
R2 = =1− ,
T SS T SS
where the total sum of squares decomposes as
n
X n
X n
X
T SS = (yi − y)2 = (ŷi − y)2 + ε̂2i = ESS + SSR,
i=1 i=1 i=1

into the explained


P sum of squares and sum of squared residuals (note that
ŷ = y because i ε̂i = 0).
The intuition is that we want to compare the improvement in fit of a model
with regressors to a model without regressors.
Such a desire for fit comes from the use of the regression model to form linear
predictions.

113 / 318
The vector y is a point in Rn . The column space of the n × k matrix X is
the subspace of linear combinations

X = {a ∈ Rn : a = Xb for some vector b}.

That is, X is the vector space spanned by the columns of X. If rank X = k


the columns of X are basis vectors for X .
The orthogonal projection of y onto X is the solution to
p
minky − ak = min ky − Xbk = min (y − Xb)0 (y − Xb)
a∈X b∈Rk b∈Rk

and equals
ŷ = X β̂ = X(X 0 X)−1 X 0 y = P X y.
The deviation of y from its projection is

ε̂ = y − ŷ = y − P X y = (I − P X )y = M X y,

and lives in the orthogonal complement X ⊥ , so ŷ 0 ε̂ = 0. The projection


matrices

P X = X(X 0 X)−1 X 0 , M X = I − P X = I − X(X 0 X)−1 X 0 ,

will prove convenient. Note that P X = P 0X and P 2X = P X (and the same


for M X ) and that P X M X = 0.
114 / 318
Least squares projection in a three-dimensional space:

ε̂

x2 β̂2

x1 β̂1 x2
x1

115 / 318
Partition X = (X 1 , X 2 ) so that

y = X 1 β1 + X 2 β2 + ε

and, hence,
−1 
X 01 X 1 X 01 X 2 X 01 y
   
β̂1
= .
β̂2 X 02 X 1 X 02 X 2 X 02 y

Some algebra using formulae for partitioned matrix inversion shows that

β̂1 = (X 01 M X 2 X 1 )−1 (X 01 M X 2 y)

(and likewise for β̂2 ).


M X 2 y is the residual vector of a regression of y on X 2 .
M X 2 X 1 is the residual matrix of a regression of the columns of X 1 on X 2 .
These residuals are uncorrelated with X 2 . Moreover, M X 2 y is y, and
M X 2 X 1 is X 1 , respectively, after their linear dependence on X 2 has been
filtered out. β̂1 is the slope coefficient in a regression of these residuals on
each other.
This gives (multiple) least squares its partialling-out interpretation. The
results is known as the Frisch-Waugh-Lovell theorem.

116 / 318
The estimator β̂ is (conditionally) unbiased,

E(β̂|X) = E((X 0 X)−1 X 0 y|X) = (X 0 X)−1 X 0 E(y|X) = β

(because E(y|X) = Xβ). Its variance is

var(β̂|X) = E((β̂ − β)(β̂ − β)0 |X)


= E((X 0 X)−1 X 0 εε0 X(X 0 X)−1 |X)
= (X 0 X)−1 X 0 E(εε0 |X)X(X 0 X)−1
= (X 0 X)−1 X 0 σ 2 I n X(X 0 X)−1
= σ 2 (X 0 X)−1 .

In fact, its exact (conditional) distribution is normal,

β̂ ∼ N (β, σ 2 (X 0 X)−1 ),

because, for any conformable non-stochastic matrix A, Aε ∼ N (0, σ 2 AA0 ),


and thus also for A = (X 0 X)−1 X 0 .
It is also best unbiased, as β̂ − β = (X 0 X)−1 X 0 ε is proportional to the
score equation at the truth (X 0 ε/σ 2 ), with factor of proportionality equal to
σ 2 (X 0 X)−1 .

117 / 318
The score equation for σ 2 is
n (y − Xβ)0 (y − Xβ)
− 2
+ = 0.
σ σ4
which, given β̂, has solution
ε̂0 ε̂
σ̂ 2 = .
n
We already know that this estimator is biased; an unbiased version would be
ε̂0 ε̂
σ̃ 2 = .
n−k
Indeed,
 0
E (y 0 M X y|X) E (ε0 M X ε|X)

ε̂ ε̂ tr(M X ) n−k
E X = = = σ2 = σ2
n n n n n
because

tr(M X ) = tr(I n − P X ) = tr(I n ) − tr(P X ) = n − k;

using that tr(P X ) = tr(X(X 0 X)−1 X 0 ) = tr(X 0 X(X 0 X)−1 ) = tr(Ik ) = k.


Finally,
(n − k) σ̃ 2 /σ 2 ∼ χ2n−k .
118 / 318
Now consider the behavior of the estimators as n → ∞. We let Σ = E(xi x0i ).
First,
β̂ − β = (X 0 X)−1 X 0 ε,
and
X 0X p X 0ε p
→ Σ, → 0,
n n
p
so that β̂ → β by the continuous-mapping theorem.
Next, we have
−1
√ √ X 0X X 0ε

n(β̂ − β) = n(X 0 X)−1 X 0 ε = √ ,
n n
and
X 0ε d
√ → N (0, σ 2 Σ).
n
So,
n
√ 1 X −1 d
n(β̂ − β) = √ Σ xi εi + op (1) → N (0, σ 2 Σ−1 ).
n i=1
The influence function here is Σ−1 xi εi .

119 / 318
For the variance estimator,
ε̂0 ε̂
σ̂ 2 =
n
(y − X β̂)0 (y − X β̂)
=
n
(X(β − β̂) + ε)0 (X(β − β̂) + ε)
=
n
ε0 ε (β̂ − β)0 (X 0 X)(β̂ − β) X 0ε
= + + 2(β̂ − β)0
n n n
ε0 ε −1/2
= + op (n ),
n
because kβ̂ − βk = Op (n−1/2 ), X 0 X = Op (n) and X 0 ε = op (n).

Therefore,
n
√ ε0 ε − E(ε0 ε) X ε2i − E(ε2i ) d
n(σ̂ 2 − σ 2 ) = √ + op (1) = √ + op (1) → N (0, 2σ 4 )
n i=1
n

(recall that, under normality, E(ε4i ) = 3σ 4 .)


This estimator is best asymptotically unbiased (and so is σ̃ 2 ).

120 / 318
Production and cost function

Suppose a firm creates output according to Cobb-Douglas technology. The


associated cost function is linear in logs. The regressors are a constant term,
(the log of) total output, and the log of the price of inputs (labor, capital,
and so on).

The coefficients on the input prices are elasticities.

121 / 318
Poisson regression

The linear regression model will typically be inappropriate when data are
not continuous.
An example is yi ∈ N, i.e., count data.
Patent-application data fits this framework.
A Poisson regression model has (conditional) mass function

µyi i e−µi 0
, µi = exi β .
yi !
0
Remember that µi = exi β is the conditional mean of the outcome variable.

122 / 318
The log-likelihood (up to a constant) is
n
X 0
(yi x0i β − exi β ) + constant.
i=1

The score equation is


n
X 0
xi (yi − exi β ) = 0,
i=1

and the Hessian matrix is


n
X 0
− (xi x0i ) exi β < 0.
i=1

The maximum-likelihood estimator of β is not unbiased.


It is best asymptotically unbiased, however, with limit distribution
√ d
n(β̂ − β) → N (0, E(xi x0i µi )−1 ).

123 / 318
Patent applications (innovation) and R&D spending

The # of patents applied for on R&D spending, stratified by sector.

124 / 318
We can test whether the impact of R&D spending on innovation is different
across sectors.

125 / 318
Examples of maximum likelihood

Arellano, M. and C. Meghir (1992). Female labour supply and on-the-job search: An empirical
model estimated using complementary data sets. Review of Economic Studies 59, 537–557.

Blundell, R., P.-A. Chiappori, T. Magnac, and C. Meghir (2007). Collective labour supply:
Heterogeneity and non-participation. Review of Economic Studies 74, 417–445.

Heckman, J. J. (1974). Shadow prices, market wages, and labor supply. Econometrica 42, 679–
694.

Magnac, T. (1991). Segmented or competitive labor markets. Econometrica 59, 165–187.

Rust, J. (1987). Optimal replacement of GMC bus engines: An empirical model of Harold
Zurcher. Econometrica 55, 999–1033.

Tobin, J. (1958). Estimation of relationships from limited dependent variables. Econometrica


26, 24–36.

126 / 318
Bayesian estimation

Skip the remainder of this section

Before seeing the data you have beliefs about θ. Suppose we can summarize
those beliefs into a distribution function π(θ) on Θ, the prior.
Upon
Qn seeing the data we can evaluate the distribution of the sample,
i=1 fθ (xi ), at any θ ∈ Θ.

When confronted with the data we may alter our beliefs about θ. Bayes rule
gives the posterior as
Qn
i=1 fθ (xi ) π(θ)
π(θ|x1 , . . . , xn ) = R Qn .
Θ i=1 fu (xi ) π(u) du

When prior is a proper distribution, so is the posterior.


The updated beliefs summarized in the posterior distribution can be used to
construct a point estimator if desired. An example is
R
Θ
θ π(θ|x1 , . . . , xn ) dθ,

the posterior mean.


Other natural choices would be the posterior median and mode.
127 / 318
Normal example

Take
xi ∼ N (θ, σ 2 )
(with σ 2 known for simplicity), so that
 Pn 2
n

i=1 (xi −xn ) n(θ−xn )2
 
Y 1 xi − θ 1 −1
σ2
+
σ2
fθ (x1 , . . . , xn ) = φ = 2 n/2
e 2 .
i=1
σ σ (2πσ )

Suppose that
 
(θ−µ)2
 
1 θ−µ 1 −1
2 τ2
π(θ) = φ = e ,
τ τ (2πτ 2 )1/2

i.e., our prior belief is that θ ∼ N (µ, τ 2 ).


A calculation gives the posterior as normal, i.e.,

τ2 σ 2 /n τ 2 σ 2 /n
 
θ|(x1 , . . . , xn ) ∼ N x n + µ, .
τ 2 + σ 2 /n τ 2 + σ 2 /n τ 2 + σ 2 /n

128 / 318
The posterior mean is the point estimator

τ2 σ 2 /n
xn + 2 µ,
τ2 2
+ σ /n τ + σ 2 /n

which is a weighted average of the sample mean xn (the Frequentist estima-


tor) and the prior mean µ.

Note that this estimator is not unbiased.

In fact, Bayesian posterior means are never unbiased.

As n → ∞, the relative contribution of the prior vanishes and


a
θ|(x1 , . . . , xn ) ∼ N xn , σ 2 /n ,


which is the Frequentist asymptotic approximation.

129 / 318
Bernstein-von Mises theorem

The similarity between the Bayesian posterior and the Frequentist


asymptotic-distribution approximation in the above example holds much
more generally.

This is the Bernstein-von Mises result.

It states that, in an appropriate metric (known as the total-variation norm),


the difference between π(θ|x1 , . . . , xn ) and

N (θ̂, Iθ−1 /n)

converges to zero in probability as n → ∞.

One implication is that both procedures are asymptotically equivalent.

130 / 318
James-Stein estimation

Remember the posterior mean in our example was

τ2 σ 2 /n (σ 2 /τ 2 )/n (σ 2 /τ 2 )/n
   
x n + µ = 1 − xn + µ.
τ 2 + σ 2 /n τ 2 + σ 2 /n 1 + (σ 2 /τ 2 )/n 1 + (σ 2 /τ 2 )/n

For example, when µ = 0 (for notational simplicity) we have

(σ 2 /τ 2 )/n σ 2 /n
   
1− xn = 1 − xn
1 + (σ 2 /τ 2 )/n τ 2 + σ 2 /n

The term in brackets lies in (0, 1). So, this estimator is downward biased.
The bias is introduced by the shrinkage of xn towards the prior mean of zero.

A multivariate version would have x ∼ N (θ, (σ 2 /n) I m ) and θ ∼ N (0, τ 2 I m )


with shrinkage estimator

(σ 2 /τ 2 )/n
 
1− x.
1 + (σ 2 /τ 2 )/n

131 / 318
The James-Stein estimator (assuming that σ 2 is known and that m ≥ 2) is
 
m−2
1 − σ2 x.
kxk2
While this estimator is biased, we have
!
  2
m−2
E(kx − 0k2 ) > E 1 − σ2 x−0 .
kxk2

as soon as m > 2.
So, in terms of estimation risk (as measured by expected squared loss), the
James-Stein estimator dominates the Frequentist sample mean estimator x.
The key is that shrinkage reduces variance. Indeed, taking the infeasible
estimator for simplicity

(σ 2 /τ 2 )/n (σ 2 /τ 2 )/n
    
var 1− x = 1 − τ 2 Im
1 + (σ 2 /τ 2 )/n 1 + (σ 2 /τ 2 )/n
σ 2 /τ 2
 
= τ2 − I m + o(n−1 ).
n

132 / 318
TESTING IN PARAMETRIC PROBLEMS

133 / 318
Reading

General discussion:
Casella and Berger, Chapter 8
Hansen I, Chapter 13 and 14

Testing in the likelihood framework:


Davidson and MacKinnon, Chapter 13
Hansen II, Chapter 9

Classical linear regression model:


Goldberger, Chapters 19–21

134 / 318
Simple hypothesis and likelihood ratio

Suppose we wish to test the simple null H0 : θ = θ0 against the simple


alternative H1 : θ = θ1 .
The data distribution is completely specified under both null and alternative.
Write
n
Y
`n (θ) = eLn (θ) = fθ (xi )
i=1

for the likelihood and define the likelihood ratio as


`n (θ0 )
.
`n (θ1 )

If H0 is false we would expect `n (θ0 )/`n (θ1 ) to be small.

135 / 318
`n (θ0 )
0.3
`n (θ)

0.2
θ0
`n (θ1 )
0.1
θ1

0
0 2 4 6 8 10
θ

136 / 318
A decision rule based on the likelihood ratio is to
Reject the null in favor of the alternative when
`n (θ0 )
< c,
`n (θ1 )
Accept the null when
`n (θ0 )
≥ c,
`n (θ1 )
for a chosen value c.
We might wrongfully reject the null. This is called a type-I error.
The significance level or size of the test is
 
`n (θ0 )
Pθ0 <c .
`n (θ1 )
We might wrongfully accept the null. This is called a type-II error.
The power of the test is  
`n (θ0 )
Pθ1 <c .
`n (θ1 )

137 / 318
Normal

Suppose that
xi ∼ N (θ, σ 2 )
for known σ 2 .
(From before; see Slide 49) the density of the data is
 Pn 2   
i=1 (xi −xn ) n(θ−xn )2
1 −1 −1
σ2 σ2
e 2 e2

(2πσ 2 )n/2
The likelihood ratio thus is
n(θ0 −xn )2
 
1
−2
σ2 θ0 −θ1 xn −θ0 1 θ0 −θ
 
e −1 n (x −θ )2 −(x −θ )2
2 2( n 0 n 1 ) √ √ +2 √1

n(θ1 −xn )2
 =e σ = e σ/ n σ/ n σ/ n .
1
−2
σ2
e
If θ0 < θ1 the likelihood ratio is no greater than c when
xn − θ 0
√ ≥ c∗ ,
σ/ n
for some c∗ .

138 / 318
So a level α test is obtained on choosing c∗ so that
   
`n (θ0 ) xn − θ0
Pθ0 < c = Pθ 0 √ ≥ c∗ = 1 − Φ(c∗ ) = α,
`n (θ1 ) σ/ n
which requires that
c∗ = Φ−1 (1 − α) ≡ zα ,
the (1 − α)th quantile of the standard-normal distribution. These values are
tabulated.
Then the decision rule we obtain is that, if,
xn − θ0
√ ≥ zα ,
σ/ n
we reject the null in favor of the alternative.

139 / 318
The standard-normal distribution

140 / 318
The power of the test is
     
xn − θ0 xn − θ 1 θ1 − θ0 θ1 − θ0
Pθ 1 √ ≥ zα = Pθ1 √ + √ ≥ zα = 1−Φ zα − √ .
σ/ n σ/ n σ/ n σ/ n
Note that the power increases when

the difference θ1 − θ0 (> 0 here) increases; and


the samples size n increases.

The latter observation implies that


 
xn − θ 0 n→∞
Pθ1 √ ≥ zα → 1,
σ/ n
i.e., if the null is false this will be spotted with probability approaching one.
This is called consistency of a test.

Note that if, in stead, θ0 > θ1 , the decision rule becomes that, if,
xn − θ 0
√ ≤ −zα ,
σ/ n
we reject the null in favor of the alternative.

141 / 318
Now take the reverse situation where

xi ∼ N (µ, θ)

and µ is known.
Wish to test H0 : θ = θ0 against H1 : θ = θ1 .
The likelihood ratio is
1 θ1 −θ0 Pn (xi −µ)2
−2
(θ1 /θ0 )n/2 e θ1 i=1 θ0 .

If θ0 < θ1 this is small when


n
X (xi − µ)2
i=1
θ0

is large (and vice versa). Now, under the null, this statistic is χ2n and so
n
!
X (xi − µ)2 2
Pθ0 ≥ χn,α = α,
i=1
θ0

where χ2n,α is the (1 − α)th quantile of the χ2n distribution. The power is the
probability that a χ2n is greater than χ2n,α (θ0 /θ1 ).

142 / 318
The χ2 -distribution

143 / 318
Exponential

Take xi to be exponentially distribution, i.e.,

e−x/θ
fθ (x) = , x ≥ 0, θ > 0.
θ

Note that the likelihood is


n
1 Y −xi /θ 1
e = n e−nxn /θ ,
θn i=1 θ

and so the likelihood-ratio statistic for simple null and alternative equals
1 −nxn /θ0
ne
 n θ −θ
θ0 θ1 −nxn θ1 θ 0
1 −nxn /θ1 = e 0 1 .
ne
θ1
θ0

If θ1 > θ0 this statistic is small when nxn is large. Now,


n
X
nxn = xi ∼ Gamma(n, θ)
i=1

(or Erlang(n, 1/θ) as n is an integer) so size is easily controlled for any n.


144 / 318
Most powerful test

Theorem 16 (Neyman-Pearson lemma)


When both the null and alternative hypothesis are simple, the likelihood
ratio test that rejects the null when
`n (θ0 )
<c
`n (θ1 )
for a constant c such that
 
`n (θ0 )
Pθ 0 <c =α
`n (θ1 )
is the most powerful test among all level-α tests.

145 / 318
Composite alternatives and the power function

Now test the simple null H0 : θ = θ0 against a composite alternative H1 :


θ ∈ Θ1 , where Θ1 ⊂ Θ.
We can generalize the likelihood ratio to
`n (θ0 )
supθ1 ∈Θ1 `n (θ1 )

The data distribution is no longer fully specified under the alternative; there
are many possible alternatives.
A test is uniformly most powerful if it is most powerful against all θ1 ∈ Θ1 .

Power is now a function; i.e.,


 
`n (θ0 )
β(θ) = Pθ <c
supθ1 ∈Θ1 `n (θ1 )

146 / 318
Unbiased tests

A level-α test is unbiased if


β(θ) ≥ α
for all θ ∈ Θ1 .
The null is more likely to be rejected when it is false than when it is true.

Unbiasedness is clearly desirable.

We could consider looking for the uniformly most powerful unbiased test.

147 / 318
Normal (One-sided)

Again take xi ∼ N (θ, σ 2 ) for known σ 2 .


Now test H0 : θ = θ0 against H1 : θ > θ0 .
The set of alternatives is thus Θ1 = {θ ∈ Θ : θ > θ0 }. This is a one-sided
alternative.
Clearly,

θ̂1 = arg max `n (θ1 ) = xn {xn > θ0 } + θ0 {xn ≤ θ0 }.


θ1 ∈Θ1

Then,
n(xn −θ0 )2
−1
`n (θ0 ) e 2 σ2
= n(xn −θ̂1 )2
`n (θ̂1 ) e
−1
2 σ2

(xn −θ0 )2 −(xn −θ0 )2 {xn ≤θ0 }−(xn −xn )2 {xn >θ0 }
−n
=e 2 σ2

(xn −θ0 )2 xn −θ0


n o
−1 √ >0
=e 2 σ 2 /n σ/ n

which is no greater than some constant c if


n o
xn −θ
√0 xn −θ
√ 0 > 0 ≥ c∗ .
σ/ n σ/ n
148 / 318
The random variable

z {z > 0}, z ∼ N (0, 1)

is truncated standard normal with cumulative distribution function


 
1
P (z < c∗ |z > 0) = 2 Φ(c∗ ) − .
2

So, noting that only positive values for c∗ make sense,

P (z {z > 0} ≤ c∗ ) = P (z ≤ c∗ |z > 0) P (z > 0) + P (z ≤ 0)


  
1 ∗ 1 1
= 2 Φ(c ) − +
2 2 2
= Φ(c∗ ),

and, therefore, the size of our test can be set to α ∈ (0, 1) by setting

c∗ = Φ−1 (1 − α) = zα .

149 / 318
We get the decision rule
Reject H0 : θ = θ0 in favor of H1 : θ > θ0 if
 
xn − θ0 xn − θ0
√ √ > 0 ≥ zα ;
σ/ n σ/ n

Accept H0 : θ = θ0 if
 
xn − θ 0 xn − θ0
√ √ >0 < zα .
σ/ n σ/ n

With zα > 0 we can just look at


Reject H0 : θ = θ0 in favor of H1 : θ > θ0 if
xn − θ0
√ ≥ zα ;
σ/ n

Accept H0 : θ = θ0 if
xn − θ0
√ < zα .
σ/ n

This test is uniformly the most powerful.

150 / 318
This conclusion follows from the fact that the decision rule is the same as
for the simple alternative θ = θ1 from above, and that test was the most
powerful for any θ1 > θ0 .
We have
   
xn − θ0 xn − θ θ0 − θ
Pθ √ ≥ zα = Pθ √ ≥ zα + √
σ/ n σ/ n σ/ n
so the power function is
 
θ0 − θ
1 − Φ zα + √ .
σ/ n
This test is consistent.
β(θ) is presented graphically below for a setting where θ0 = 0 and σ = 1,
with α = .05.

151 / 318
1

0.8
1 − β(θ1 )
0.6 β(θ1 )
β(θ)

0.4

0.2
α
0
θ0 θ1
θ

152 / 318
Normal (Two-sided)

Continue to work with xi ∼ N (θ, σ 2 ) for known σ 2 .


Now test H0 : θ = θ0 against H1 : θ 6= θ0 .
The set of alternatives is thus Θ1 = {θ ∈ Θ : θ 6= θ0 } = Θ\{θ0 }. This is a
two-sided alternative.
The likelihood-ratio is simply
2
xn −θ0

−1 √
e 2 σ/ n ,
which is no greater than some constant c if
xn − θ 0
√ ≥ c∗ .
σ/ n

153 / 318
So,
   
xn − θ 0 xn − θ 0
Pθ0 √ ≥ c∗ = 1 − Pθ 0 −c∗ < √ ≤ c∗
σ/ n σ/ n
which is simply

1−(Φ (c∗ ) − Φ (−c∗ )) = 1−(1−Φ(−c∗ )−Φ(−c∗ )) = 2Φ(−c∗ ) = 2(1−Φ(c∗ )).

Equalizing this probability to α and inverting toward c∗ yields

c∗ = Φ−1 (1 − α/2) = zα/2 ,

giving the decision rule:


Reject H0 : θ = θ0 in favor of H1 : θ 6= θ0 if
xn − θ0
√ ≥ zα/2 ,
σ/ n
and accept the null if not.
Note that we reject if either
xn − θ 0 xn − θ0
√ < −zα/2 or √ > zα/2 ;
σ/ n σ/ n

each of these events has probability α/2 under the null.


154 / 318
This test is not uniformly most powerful. In fact, for two-sided alternatives,
such tests cannot exist.
The one-sided tests with size α are better on their respective sides of the
null:

0.8

0.6
β(θ)

0.4

0.2
α
0
θ1 θ0
θ

155 / 318
The two-sided test is unbiased and consistent.
Below are the power functions for two sample sizes.

0.8

0.6
β(θ)

0.4

0.2
α
0
θ1 θ0
θ

156 / 318
Normal (Two-sided; variance unknown)

Again
xi ∼ N (µ, σ 2 )
but now with both µ, σ 2 unknown.
Consider the hypothesis

H 0 : µ = µ0 , H1 : µ 6= µ0 .

The likelihood is Pn 2
1 −1 i=1 (xi −µ)
e 2 σ2 .
(2πσ 2 )n/2
The unconstrained maximizers are
n
X
µ̂ = xn , σ̂ 2 = n−1 (xi − xn )2 ,
i=1

while, when µ = µ0 , maximizing with respect to σ 2 only yields


n
X
σ̌ 2 = n−1 (xi − µ0 )2 .
i=1

157 / 318
The likelihood ratio is simply
n/2 −n/2
σ̂ 2 (xn − µ0 )2
 
= 1+ .
σ̌ 2 σ̂ 2
This statistic is smaller than some critical value if and only if
 2   2
xn − µ0 n xn − µ0
√ = √
σ̂/ n n−1 σ̃/ n

exceeds some other critical value; where, recall, σ̃ 2 = σ̂ 2 n/(n − 1).


But,
r p
x n − µ0 σ x n − µ0 xn − µ0 σ̃ 2 xn − µ0 (n − 1) (σ̃ 2 /σ 2 )
√ = √ = √ = √ √ .
σ̃/ n σ̃ σ/ n σ/ n σ2 σ/ n n−1
We know that
x n − µ0 σ̃ 2
√ ∼ N (0, 1), (n − 1) ∼ χ2n−1 ;
σ/ n σ2
and so the ratio follows a t distribution with n − 1 degrees of freedom.

158 / 318
The statistic
xn − µ0
√ ∼ tn−1 .
σ̃/ n
is commonly called the t-statistic.
Exact inference is thus possible on choosing critical values from Student’s t
distribution with n − 1 degrees of freedom.
As n grows, tn−1 approaches the standard normal. So large-sample theory
justifies the use of zα/2 as a critical value.

159 / 318
Student’s t distribution

160 / 318
General composite hypothesis

The more general case has both composite null and alternative, as in

H0 : θ ∈ Θ0 , H1 : θ ∈ Θ1 ,

where Θ0 and Θ1 are subsets of the parameter space Θ.


An obvious generalization of the likelihood ratio would be
supθ0 ∈Θ0 `n (θ0 )
.
supθ1 ∈Θ1 `n (θ1 )

The statistic used above for only the alternative composite is a special case.
Much more common is to work with a likelihood ratio statistic defined as
supθ0 ∈Θ0 `n (θ0 )
;
supθ∈Θ `n (θ)

note that the denominator features the full parameter space. This is often
much easier to work with.

161 / 318
Connection to maximum likelihood

By definition
sup `n (θ) = `n (θ̂),
θ∈Θ

where θ̂ is the (unconstrained) maximum-likelihood estimator.


Likewise, we can think of

θ̌ = arg max `n (θ)


θ∈Θ0

as the constrained maximum-likelihood estimator obtained on enforcing the


null.
The likelihood ratio is then simply

`n (θ̌)
.
`n (θ̂)

162 / 318
Normal (Composite)

xi ∼ N (θ, σ 2 ) for known σ 2 .


Now test H0 : θ ≤ 0 against H1 : θ > 0.
Here,

arg max `n (θ) = xn {xn ≤ 0}, arg max `n (θ) = xn {xn > 0},
θ0 ∈Θ0 θ1 ∈Θ1

and, also,
arg max `n (θ) = xn .
θ∈Θ

So,
supθ0 ∈Θ0 `n (θ0 )
 2  
x√ x√ x√
−1 n sign(xn ) −1 n n
=e 2 σ/ n =e 2 σ/ n σ/ n
supθ1 ∈Θ1 `n (θ1 )
for sign(x) = {x > 0} − {x ≤ 0}, while

supθ0 ∈Θ0 `n (θ0 )


 2  2 n o
x√ x√ x√
−1 n {xn >0} −1 n n >0
=e 2 σ/ n =e 2 σ/ n σ/ n .
supθ∈Θ `n (θ)

163 / 318
The latter likelihood ratio is smaller then c when
 
xn xn
√ √ > 0 > c∗
σ/ n σ/ n
for some c∗ . Note that only positive c∗ make sense, otherwise we will never
reject.
For any fixed θ, let
xn − θ
zθ = √ .
σ/ n
Then
 n o     
Pθ σ/x√
n
n
x√
n
σ/ n
> 0 > c ∗ = Pθ z θ > c ∗ − θ√
σ/ n
= 1 − Φ c∗ − θ√
σ/ n

This function is monotone increasing on Θ0 = (−∞, 0]. The size of the test
is  n o 

supθ∈Θ0 Pθ σ/x√
n
n
x√
n
σ/ n
> 0 > c = 1 − Φ(c∗ ) = α

so that size control yields the critical value c∗ = Φ−1 (1 − α) = zα .

164 / 318
The former likelihood ratio is small when either
xn xn
0< √ and c∗ < √
σ/ n σ/ n
or when
xn xn
√ < 0 and c∗ < √
σ/ n σ/ n

For any θ ∈ Θ0 the probability of this happening is


   
Pθ z θ > c∗ − σ/θ√n , z θ > − σ/θ√n + Pθ z θ > c∗ − θ√
σ/ n
, z θ < − σ/θ√n .

For c∗ > 0 this equals


   
Pθ z θ > c ∗ − θ√
σ/ n
= 1 − Φ c∗ − θ√
σ/ n

while for c∗ ≤ 0 this equals


     
Pθ z θ > − σ/θ√n + Pθ c∗ − θ√
σ/ n
< z θ ≤ − σ/θ√n = 1 − Φ c∗ − θ√
σ/ n
.

In either case the supremum over Θ0 is achieved at θ = 0 for which we find


that c∗ = zα yields size control. Again, for any reasonable size the critical
value is positive.

165 / 318
Likelihood-ratio test

Now consider a general setting where θ is a k-dimensional vector and

H0 : r(θ) = 0, H1 : r(θ) 6= 0,

for a continuously-differentiable m-dimensional function r.


We will denote the m × k Jacobian matrix by R(θ).
Exact size control is difficult in general.
A general approach that is asymptotically valid is the decision rule
Reject the null if !
`n (θ̌)
−2 log > χ2m,α ;
`n (θ̂)
Accept the null if !
`n (θ̌)
−2 log ≤ χ2m,α .
`n (θ̂)

Note that  
−2 log `n (θ̌)/`n (θ̂) = 2(Ln (θ̂) − Ln (θ̌)).

166 / 318
Asymptotic distribution under the null

The validity of the test procedure comes from the following theorem.

Theorem 17 (Limit distribution of the Likelihood-ratio statistic)


Under the null,
d
2(Ln (θ̂) − Ln (θ̌)) → χ2m
as n → ∞.

167 / 318
Proof.
We work under the null. A Taylor expansion gives
n
Ln (θ̌) − Ln (θ̂) = − (θ̌ − θ̂)0 Iθ (θ̌ − θ̂) + op (1).
2
It can be shown that (under the null)
n
√ 1 X ∂ log fθ (xi )
n(θ̂ − θ̌) = Iθ−1 R0 (RIθ−1 R0 )−1 RIθ−1 √ + op (1),
n i=1 ∂θ θ

where R = R(θ). Plugging this into the expansion gives 2(Ln (θ̂) − Ln (θ̌)) as
n
!0 n
!
1 X ∂ log fθ (xi ) 1 X ∂ log fθ (xi )
RIθ−1 √ (RIθ−1 R0 )−1 RIθ−1 √
n i=1 ∂θ θ n i=1 ∂θ θ

(up to op (1) terms). But, as


n
1 X ∂ log fθ d
RIθ−1 √ → N (0, RIθ−1 R0 ),
n i=1 ∂θ θ

this quadratric form is asymptotically χ2m .

168 / 318
Analysis of the constrained estimator

Completing the proof requires finding the asymptotic distribution of

θ̌ = arg max Ln (θ).


θ:r(θ)=0

This estimator maximizes the Lagrangian problem

Ln (θ) + λ0 r(θ).

The first-order conditions are


∂Ln (θ)
+ λ̌0 R(θ̌) = 0, r(θ̌) = 0.
∂θ θ̌

We can Taylor expand

∂Ln (θ) ∂Ln (θ) ∂ 2 Ln (θ)


= + (θ̌ − θ) + op (1)
∂θ θ̌ ∂θ θ ∂θ∂θ0 θ
n
X ∂ log fθ (xi )
= − nIθ (θ̌ − θ) + op (1),
i=1
∂θ θ

and r(θ̌) = r(θ) + R (θ̌ − θ) + op (1) = R (θ̌ − θ) + op (1) (enforcing the null
r(θ) = 0).
169 / 318
Plugging the expansions into the first-order conditions and re-arranging
yields the system of equations
Pn ∂ log fθ (xi ) !
−nIθ R0
  
θ̌ − θ
n−1/2 = −n−1/2 i=1 ∂θ
θ
R 0 λ̌ 0

(up to op (1) terms).


A block-inversion formula shows that
 −1
−nIθ R
R0 0

equals

−n−1 Iθ−1 + n−1 Iθ−1 R0 (RIθ−1 R0 )−1 RIθ−1 Iθ−1 R0 (RIθ−1 R0 )−1
 
.
(RIθ−1 R0 )−1 RIθ−1 n (RIθ−1 R0 )−1

170 / 318
Then we obtain
n
√ 1 X ∂ log fθ (xi )
n(θ̌ − θ) = (Iθ−1 − Iθ−1 R0 (RIθ−1 R0 )−1 RIθ−1 ) √ + op (1),
n i=1 ∂θ θ

which implies that


n
√ 1 X ∂ log fθ (xi )
n(θ̂ − θ̌) = Iθ−1 R0 (RIθ−1 R0 )−1 RIθ−1 ) √ + op (1)
n i=1 ∂θ θ

under the null.

For future reference we also note that


n
λ̌ 1 X ∂ log fθ (xi ) d
√ = −(RIθ−1 R0 )−1 RIθ−1 √ +op (1) → N (0, (RIθ−1 R0 )−1 ).
n n i=1 ∂θ θ

171 / 318
χ2 -statistic

The derivation of the limit distribution of the likelihood-ratio statistic shows


that
d
n (θ̌ − θ̂)0 Iθ (θ̌ − θ̂) → χ2m
under the null.
Let Iˇθ be a consistent estimator of the information under the null. Obvious
choices are 2
∂ 2 log fθ (xi )
− n1 ∂ ∂θ∂θ
Ln (θ)
= − n1 n
P
0 i=1 ∂θ∂θ 0
θ̌ θ̌
and
Pn  ∂ log fθ (xi ) 0   P  P 0
1 ∂ log fθ (xi ) 1 n ∂ log fθ (xi ) 1 n ∂ log fθ (xi )
n i=1 ∂θ ∂θ
− n i=1 ∂θ n i=1 ∂θ
θ̌ θ̌ θ̌ θ̌

note that recentering of the score is needed here as θ̌ does not maximize the
unconstrained likelihood problem, in general.

172 / 318
Slutzky’s theorem gives us the following result.

Theorem 18 (Limit distribution of the χ2 -statistic)


Under the null,
d
n (θ̌ − θ̂)0 Iˇθ (θ̌ − θ̂) → χ2m
as n → ∞.

This result gives us an alternative, but asymptotically equivalent, testing


procedure.
The intuition behind a test based on this result is to look at a distance
between the constrained and the unconstrained estimators which, under the
null, should be small.

173 / 318
Score statistic

The analysis of the constrained estimator implies the following result.

Theorem 19 (Limit distribution of the Score statistic)


Under the null,
0  ˇ−1 
∂Ln (θ) Iθ ∂Ln (θ) d
→ χ2m ,
∂θ θ̌ n ∂θ θ̌
as n → ∞.

This statistic is also known as the Lagrange-multiplier statistic as it can be


written as
R(θ̌)Iˇθ−1 R(θ̌)0
 
λ̌0 λ̌,
n
where λ̌ is the Lagrangian multiplier for the constraint r(θ) = 0.
One interpretation for this is that, if the null is true, the constraint should
be ineffective, aside from sampling error, so λ̌ should be small.
Another interpretation is that, under the null, the unconstrained score should
be close to zero at θ̌.

174 / 318
Wald statistic

Rather than evaluating some distance between θ̌ and θ̂ as in the χ2 -statistic

n (θ̌ − θ̂)0 Iˇθ (θ̌ − θ̂),

we may look at a distance of r(θ̂) from zero (the null). Because we have that

r(θ̂) = r(θ̂) − r(θ̌) = R (θ̂ − θ̌) + op (1)

under the null, we equally have the following.

Theorem 20 (Limit distribution of the Wald statistic)


Under the null,
d
n r(θ̂)0 (R(θ̌)Iˇθ−1 R(θ̌)0 )−1 r(θ̂) → χ2m ,
as n → ∞.

175 / 318
The Wald statistic can equally be derived without reference to a constrained
estimation problem.
Because
√ d
n(θ̂ − θ) → N (0, Iθ−1 ) and r(θ̂) = R (θ̂ − θ) + op (1),

under the null, the Delta method gives


√ d
n r(θ̂) → N (0, RIθ−1 R0 ),

and so also
Theorem 21 (Limit distribution of the Wald statistic (cont’d))
Under the null,
d
n r(θ̂)0 (R(θ̂)Iˆθ−1 R(θ̂)0 )−1 r(θ̂) → χ2m ,
as n → ∞.
Here it makes sense to use an unconstrained estimator of the information.

176 / 318
Notes

All test statistics can be used in the same way to perform (asymptotically)
valid inference.
In small samples they can lead to different test conclusions.
The likelihood-ratio statistic is attractive because

It does not require an estimator of Iθ ;


It is invariant with respect to one-to one transformations.

The second point is important as it implies that the test conclusion is the
same no matter how the null is formulated.
The score statistic is attractive because it requires estimation only under the
null, which is often easier.
In the likelihood context there is no strong argument in favor of the Wald
statistic. In fact it is not likelihood based. Its power lies in that it can be
applied more generally.

177 / 318
Exponential

The exponential distribution is

e−x/θ
fθ (x) = .
θ
Its mean is θ.
We set up several tests for the null H0 : θ = θ0 against θ 6= θ0 .
First note that
n
X
Ln (θ) = − (xi /θ + log θ) = −nxn /θ − n log θ.
i=1

Hence,

∂Ln (θ) ∂ 2 Ln (θ)


= (n/θ)(xn /θ − 1), = −(n/θ2 )(2xn /θ − 1).
∂θ ∂θ2
Therefore, θ̂ = xn and Iθ−1 /n = θ2 /n.

178 / 318
The likelihood-ratio statistic is
  
xn xn
−2(Ln (θ0 ) − Ln (θ̂)) = 2n − 1 − log .
θ0 θ0

The score statistic is


(xn − θ0 )2
n(xn /θ0 − 1)2 =
θ02 /n

The χ2 -statistic and the Wald statistic are

(xn − θ0 )2 (xn − θ0 )2
, ,
θ02 /n x2n /n
respectively. The latter is again the usual t-statistic, which should not be
surprising here.

179 / 318
Classical linear regression

Recall the setup


yi |xi ∼ N (x0i β, σ 2 )
or, in matrix notation,

y = Xβ + ε, ε ∼ N (0, σ 2 I).

The log-likelihood (up to a constant) is

n (y − Xβ)0 (y − Xβ)
Ln (β, σ 2 ) = − log σ 2 − .
2 2σ 2

Let SSRβ = (y − Xβ)0 (y − Xβ). Then the profiled log-likelihood for β is


n −n/2
Ln (β) = − log (SSRβ ) = log(SSRβ )
2
(again up to a constant), and
−n/2
`n (β) ∝ SSRβ .

180 / 318
We consider a set of m linear restrictions on β. We express the null hypothesis
as
Rβ = r,
where R is an m × k matrix and r and is an m-vector.

The m restrictions are non-redundant, so rank R = m.

The unconstrained estimator solves

min SSRβ = min(y − Xβ)0 (y − Xβ)


β β

and equals
β̂ = (X 0 X)−1 Xy,
as before.

The constrained estimator solves the Lagrangian problem


1
min SSRβ − λ0 (Rβ − r).
β 2

181 / 318
The first-order conditions are

X 0 (y − Xβ) − R0 λ = 0, Rβ − r = 0.

Re-arranging the first condition gives

(X 0 X)β = X 0 y − R0 λ

and so

β̌ = (X 0 X)−1 X 0 y − (X 0 X)−1 R0 λ = β̂ − (X 0 X)−1 R0 λ.

Further, pre-multiplying by R and enforcing that Rβ̌ = r gives

Rβ̌ = Rβ̂ − R(X 0 X)−1 R0 λ = r,

which we solve for λ to obtain

λ̌ = (R(X 0 X)−1 R0 )−1 (Rβ̂ − r).

We then find that

β̌ = β̂ − (X 0 X)−1 R0 (R(X 0 X)−1 R0 )−1 (Rβ̂ − r).

182 / 318
The likelihood ratio statistic is
!−n/2
SSRβ̌
,
SSRβ̂

which is small when the ratio in brackets is large. Now,


SSRβ̌ SSRβ̌ − SSRβ̂
−1=
SSRβ̂ SSRβ̂

where
SSRβ̂ = ε̂0 ε̂ = ε0 M X ε
and, using that y = X β̂ + ε̂ to simplify SSRβ̌ = (y − X β̌)0 (y − X β̌) to

SSRβ̌ = ε0 M X ε + (β̂ − β̌)0 (X 0 X)(β̂ − β̌).

Hence,
SSRβ̌ − SSRβ̂ (β̂ − β̌)0 (X 0 X)(β̂ − β̌)
=
SSRβ̂ ε0 M X ε
(Rβ̂ − r)0 (R(X 0 X)−1 R0 )−1 (Rβ̂ − r)
= .
ε0 M X ε

183 / 318
Note that, under the null,

Rβ̂ − r ∼ N (0, σ 2 R(X 0 X)−1 R0 )

such that
SSRβ̌ − SSRβ̂
∼ χ2m .
σ2
We also know that
SSRβ̂ σ̃ 2
= (n − k) ∼ χ2n−k .
σ2 σ2

Lastly, both terms are independent because they are functions of β̂ and ε̂,
respectively. These variables are jointly normal and independent, as the
covariance is

E((β̂ − β)ε̂0 |X) = E (X 0 X)−1 X 0 εε0 M X |X = σ 2 (X 0 X)−1 X 0 M X = 0




(using that M X X = 0)
Therefore,
n − k SSRβ̌ − SSRβ̂
∼ Fm,n−k ,
m SSRβ̂
where F is Snedecor’s F distribution.

184 / 318
Snedecor’s F distribution

185 / 318
A particular F test

A popular restriction in a regression model that includes a constant term is


that all slopes are zero.
Under the null we only estimate a constant term, i.e., β̌ = y n , and so we
have
Xn
SSRβ̌ = (yi − y n )2 = T SS.
i=1

It follows that the F -statistic can be written as


n − k SSRβ̌ − SSRβ̂ n − k T SS − SSRβ̂ n − k 1 − R2
= =
m SSRβ̂ m SSRβ̂ m R2

with R2 = SSRβ̂ /T SS the (centered) coefficient of determination of the


unrestricted model.

186 / 318
F versus t

When we test only the restriction βκ = βκ0 the F -statistic is


!2
(β̂κ − βκ,0 )([(X 0 X)−1 ]κ,κ )−1 (β̂κ − βκ,0 ) β̂κ − βκ,0
= .
σ̃ 2
p
σ̃ 2 [(X 0 X)−1 ]κ,κ

This is the square of the usual t-statistic

β̂κ − βκ,0
p .
σ̃ [(X 0 X)−1 ]κ,κ
2

So the square of a tn−k random variable is F1,n−k distributed.

187 / 318
Joint hypothesis versus multiple single hypotheses

To jointly test the k restrictions β = β0 we use the F -statistic

1 (β̂ − β0 )0 (X 0 X)(β̂ − β0 )
.
k σ̃ 2

This is not the mean of the t-statistics for the k individual hypotheses that
βκ = βκ,0 . The individual t-statistics are correlated.

Jointly testing hypothesis gives acceptance regions that are ellipsoids. The
union of acceptance regions of multiple individual tests is a hypercube.

Multiple testing problems need size corrections which, in turn, lead to low
power.

The family-wise error rate is


( )! !
β̂κ −βκ,0 β̂κ −βκ,0
P
S
κ √ >tn−k,α/2
P
≤ κP √ >tn−k,α/2 =k×α.
σ̃ 2 [(X 0 X)−1 ]κ,κ σ̃ 2 [(X 0 X)−1 ]κ,κ

To keep the family-wise error rate below α we need to test each of k individual
hypothesis at significance level α/k.

188 / 318
p-values

If we follow the Neyman-Pearson decision rule we either accept or reject the


null.

We may also look at the p-value of a test statistic.


Consider a test procedure where we reject the null when the statistic ψn is
large.
If the statistic ψn takes on value ψ in the data the p-value is

sup Pθ (ψn > ψ) .


θ∈Θ0

This is the probability of observing a value of the test statistic greater than
ψ if the null holds.
Small p-values suggest the null is likely to be false.

The p-value gives a cut-off of significance levels for which a Neyman-Pearson


decision rule would accept/reject...

But the p-value is informative in its own right and need not lead to a decision
about the null. This is Fisher’s view.
189 / 318
Inverting test statistics

As an alternative to a point estimator, testing procedures can give rise to


interval estimators.

Suppose we test H0 : θ = θ0 using a decision rule of the form

Accept H0 if ψn (θ0 ) ≤ c

for some critical value c.

Then the set


Θ̂ = {θ ∈ Θ : ψn (θ) ≤ c}
constitutes an interval estimator.

If the original test has size α then

Pθ0 (θ0 ∈ Θ̂) = 1 − α.

The interval estimator is also called a (1 − α) confidence set.

190 / 318
Normal

Suppose xi ∼ N (θ, σ 2 ).
Consider H0 : θ = θ0 and H1 : θ 6= θ0 .
The likelihood ratio decision rule goes in favor of the null if
xn − θ0
√ ≤ tn−1,α/2 .
σ̃/ n

This means that, for any θ in the interval


 
σ̃ σ̃
Θ̂ = xn − √ tn−1,α/2 , xn + √ tn−1,α/2
n n
the null would be accepted.
Θ̂ is an interval estimator of θ0 .
We have  
Pθ0 θ0 ∈ Θ̂ = 1 − α;

and so Θ̂ is a (1 − α) confidence interval for θ0 .

191 / 318
Now,
xi ∼ N (µ, θ)
and, say,
H 0 : θ = θ0 , H1 : θ > θ0 .

Under the null, Pn


i=1 (xi − xn )2
∼ χ2n−1 ,
θ0
and we would accept the null if the sample variance
n
1 X
θ̃ = (xi − xn )2
n − 1 i=1

satisfies θ̃ ≤ (n − 1)θ0 χ2n−1,α


The corresponding interval estimator thus is
h 
(n − 1) θ̃/χ2n−1,α , +∞

and has coverage probability 1 − α.

192 / 318
Bayesian credible sets

Given a Bayesian posterior π(θ|x1 , . . . , xn ) and a region R of its support, we


may calculate
R
P (θ ∈ R|x1 , . . . , xn ) = {θ ∈ R} π(θ|x1 , . . . , xn ) dθ.

This is a credible probability for the credible set R.

Credible regions can be formed in many ways.

For scalar θ we could, for example, take the interval [qα/2 , q1−α/2 ], where qτ
is the τ quantile of the posterior distribution.

193 / 318
Return to the example where xi ∼ N (θ, σ 2 ) (with σ 2 known) and we have
prior information θ ∼ N (µ, τ 2 ).
Here, the posterior was
N m, v 2


for mean and variance


τ2 σ 2 /n τ 2 σ 2 /n
m= x n + µ, v2 = .
τ 2 + σ 2 /n τ 2 + σ 2 /n τ 2 + σ 2 /n
So,
θ−m
∼ N (0, 1)
v
and a 1 − α credible interval is

[m − zα/2 v ; m − zα/2 v].

We can compute the Frequentist coverage probability of this credible set.

194 / 318
The Frequentist framework has xn ∼ N (θ, σ 2 /n) (here θ is fixed).

The posterior depends on the data only through its mean,

1 δ σ 2 /n
m= xn + µ, δ= .
1+δ 1+δ τ2

A calculation shows that



Pθ m − zα/2 v ≤ θ ≤ m + zα/2 v

equals
√ √
   
θ−µ θ−µ
Φ 1 + δ zα/2 + δ √ − Φ − 1 + δ zα/2 + δ √
σ/ n σ/ n

which is different from Φ(zα/2 ) − Φ(−zα/2 ) = 1 − α.

195 / 318
Stratifying regressions

log wages are (approximately) normal.


Suppose different means but common variance for males and females.

196 / 318
Common variance is unrealistic and can be relaxed.
(This will lead us to semiparametric problems; considered below.)

197 / 318
Add homogenous impact of experience.

198 / 318
Stratify impact of experience by gender.

199 / 318
Test the equality of the regression lines.

Test the equality of the intercept and slopes separately.

200 / 318
SEMIPARAMETRIC PROBLEMS: (GENERALIZED) METHOD OF
MOMENTS

201 / 318
Reading

Asymptotic theory:
Arellano, Appendix A
Hansen II, Chapter 13
Hayashi, Chapter 7
Wooldridge, Chapter 12
Linear instrumental variables:
Hansen II, Chapter 12
Hayashi, Chapter 3
Wooldridge, Chapters 5 and 8
Optimality in conditional moment problems:
Arellano, Appendix B

202 / 318
Linear model

Recall,
yi = x0i θ + εi .

Before we had imposed εi |xi ∼ N (0, σ 2 ). but suppose that we only require
that
E(εi |xi ) = 0.

We no longer assume that yi |xi ∼ N (x0i θ, σ 2 ) and so we cannot write down


the likelihood.
For example, var(εi |xi ) is unknown and may depend on xi .
All the information we have is contained in conditional moment condition

E(εi |xi ) = Eθ (yi − x0i θ|xi ) = 0.

This is a semiparametric problem:


The model has a parametric part, the conditional mean, and a nonparametric
part, the distribution of εi |xi .

203 / 318
Iterating expectations shows that

Eθ (xi (yi − x0i θ)) = 0

and the analogy principle suggest estimating θ by the solving the empirical
moment
Xn
n−1 xi (yi − x0i θ) = 0.
i=1

This gives the ordinary least-squares estimator,


n
!−1 n
!
−1
X 0 −1
X
n xi xi n xi yi ,
i=1 i=1
P 0
as unique solution provided i xi xi has maximal rank.
So, for learning θ here, normality of the errors (and knowledge thereof) is
not needed.
The errors can be heteroskedastic, skewed, and so on.

204 / 318
But is ordinary least squares still the best estimator of θ?
Aside from
Eθ (xi (yi − x0i θ)) = 0
we equally have that

Eθ ((xi ⊗ xi )(yi − x0i θ)) = 0,


Eθ ((xi ⊗ xi ⊗ xi )(yi − x0i θ)) = 0,
..
.
Eθ ((xi ⊗ xi ⊗ · · · ⊗ xi )(yi − x0i θ)) = 0,

and, indeed, that


Eθ (ψ(xi )(yi − x0i θ)) = 0
for any vector function ψ.
How do we optimally exploit all this information?

205 / 318
Semiparametric efficiency

In a semiparametric model the distribution of the data is no longer known


up to a small number of parameters.
The model has parametric part (θ); and a nonparametric part (say F ).
Often (i.e., in these slides), the primary interest lies in the parametric part,
θ and all available information on θ is formulated in terms of (conditional)
moment conditions.
A general approach to estimation is GMM.
Can be devised to hit the semiparametric efficiency bound.
Intuitively, this bound is
−1
sup Iθ,F ;
F

that is, the largest of the Cramér-Rao bounds in the parametric submodels
contained in our semiparametric setting.
In the linear regression model from above this would be the Cramér-Rao
bound under the least-favorable distribution for εi |xi that satisfies mean
independence.

206 / 318
Method of moments

Suppose all we know is that

Eθ (ϕ(xi ; θ)) = 0

for some known function ϕ.

A unique solution will generally not exist when dim ϕ < dim θ. We say θ is
underidentified.

Suppose, for now, that dim ϕ = dim θ. We call this the just-identified case.

A method of moment estimator is a solution to


n
X
n−1 ϕ(xi ; θ) = 0.
i=1

The intuition is the analogy principle and similar to the argmax argument.

207 / 318
Identification

The argmax result requires that

Eθ (ϕ(xi ; θ∗ )) 6= 0

for any θ∗ 6= θ.
This is global identification.

In contrast, local identification means there is a neighborhood around θ in


which it is the unique solution.
A sufficient condition for this is that the Jacobian matrix
 
∂ϕ(xi ; θ)

∂θ0
is full rank.

When ϕ is linear in θ local and global identification are the same.

208 / 318
Limit distribution

Let θ̂ satisfy
n
X
n−1 ϕ(xi ; θ̂) = 0.
i=1

We can use a similar argument as used for maximum likelihood to derive its
behavior as n → ∞.
Under smoothness conditions an expansion gives
n n n
X X X ∂ϕ(xi ; θ)
n−1 ϕ(xi ; θ̂) = n−1 ϕ(xi ; θ) + n−1 (θ̂ − θ).
i=1 i=1 i=1
∂θ0 θ∗

Re-arrangement gives
n
!−1 n
√ 1 X ∂ϕ(xi ; θ) 1 X
n(θ̂ − θ) = − √ ϕ(xi ; θ).
n i=1 ∂θ0 θ∗ n i=1

209 / 318
Under a dominance condition we have
n  
1 X ∂ϕ(xi ; θ) p ∂ϕ(xi ; θ)
− → −Eθ = −Γθ (say).
n i=1 ∂θ0 θ∗ ∂θ0

Also, ϕ(xi ; θ) is i.i.d. with zero mean. So we have


n
1 X d
√ ϕ(xi ; θ) → N (0, Ωθ )
n i=1

provided that the asymptotic variance

Ωθ = varθ (ϕ(xi ; θ)) = Eθ (ϕ(xi ; θ)ϕ(xi ; θ)0 )

exists.
Combined with Slutzky’s theorem we get the following result.

Theorem 22 (Limit distribution of MM estimator)


Under regularity conditions,
√ d
n(θ̂ − θ) → N (0, Γ−1 −0
θ Ωθ Γθ ),

as n → ∞.

210 / 318
Linear model

Our model is
yi = x0i θ + εi , Eθ (xi εi ) = 0.
Here, ϕ(xi ; θ) = xi (yi − x0i θ), which gives the least-squares estimator.
Further,
Ωθ = E(ε2i xi x0i ), Γθ = −E(xi x0i ).

The asymptotic variance is

E(xi x0i )−1 E(ε2i xi x0i ) E(xi x0i )−1 .

The variance would simplify if we additionally have that var(εi |xi ) = σ 2 .


This is an assumption of homoskedasticity.
Then (by iterating expectations)

E(ε2i xi x0i ) = σ 2 E(xi x0i ),

so that the asymptotic variance would be

σ 2 E(xi x0i )−1 .

211 / 318
We estimate the asymptotic variance as
n
!−1 n
! n
!−1
1X 1X 2 1X
xi x0i ε̂i xi x0i xi x0i ,
n i=1 n i=1 n i=1

where ε̂i = yi − x0i θ̂ are the residuals from the least-squares regression.
Under homoskedasticity we can use
n
! n
!−1
1X 2 1X
ε̂i xi x0i
n i=1 n i=1

(could also apply the usual degrees-of-freedom correction to the first term).

Note that least squares is no longer normally distributed for small n because
the errors need no longer be normal.

Consequently, the exact distribution of usual t and F statistics is unknown.

212 / 318
Exponential regression

Nonlinear conditional-mean models can be handled in the same way.


For example,
0
Eθ (yi |xi ) = exi θ = µi (say)
implies the moment condition
0
Eθ (xi (yi − µi )) = Eθ (xi εi ) = Eθ (xi (yi − exi θ )) = 0

(among others) and so the estimator that sets


n
X 0
n−1 xi (yi − exi θ ) = 0.
i=1

This equals the score equation for Poisson (see Slides 122–123).
Sometimes called the pseudo Poisson estimator.
However, the maximum-likelihood standard errors do not apply because the
information equality does not hold here:

Ωθ = E(xi x0i ε2i ) 6= E(xi x0i µi ) = −Γθ .

213 / 318
Pseudo Poisson : gravity equation

214 / 318
215 / 318
Maximum likelihood

The maximum-likelihood estimator is a method-of-moment estimator.


The moment condition is
 
∂ log fθ (xi )
Eθ =0
∂θ
and is always just identified.
Here,
Ωθ = −Γθ
holds by the information equality.
When the distribution of the data is misspecified (so the sample is not drawn
from fθ ) the score equation is biased and maximum likelihood inconsistent,
in general.
This makes semiparametric alternatives attractive.

216 / 318
Extremum estimators

An Extremum (or M-) estimator is generic terminology for estimators that


maximize an objective function, i.e,

arg max Qn (θ),


θ
P
where Qn (θ) = i q(xi ; θ) need not be a likelihood function.
(Nonlinear) least-squares, for example, has
n
X
Qn (θ) = − (yi − ϕ(xi ; θ))2 ,
i=1

where Eθ (yi |xi ) = ϕ(xi ; θ) (e.g., probit, logit, poisson, etc.).


If Qn is differentiable, the extremum estimator is a GMM estimator, with
moment conditions  
∂q(xi ; θ)
Eθ = 0.
∂θ

217 / 318
Rank estimator

An example of an M-estimator that is not a GMM estimator is the maximizer


of
Xn X
yi {x0i θ > x0j θ} + yj {x0i θ < x0j θ}.
i=1 i<j

The objective function is a U-process of order two.


The intuition is that, if
E(yi |xi ) = G(x0i θ)
is monotonic, then

E(yi |xi ) > E(yj |xj ) ⇒ x0i θ > x0j θ


.
E(yi |xi ) < E(yj |xj ) ⇒ x0i θ < x0j θ

However, this objective function is not differentiable in θ.


In fact, the summands in the objective function are not independent. We
need a different argument to establish the limit behavior of this estimator.

218 / 318
Quantile regression

Another example of an M-estimator that has a non-smooth objective function


is linear quantile regression.
Take an unconditional setting where xi has continuous (strictly increasing,
for simplicity) distribution F . Let

% = med(xi ) = F −1 (1/2).

We have
% = arg min E(|xi − ρ|).
ρ

Indeed,
R Rρ R +∞
E(|xi − ρ|) = |x − ρ| dF (x) = −∞ (ρ − x) dF (x) + ρ (x − ρ) dF (x).

Using Leibniz’s rule,


∂E(|xi − ρ|) Rρ R +∞
= −∞ dF (x) − ρ dF (x) = F (ρ) − (1 − F (ρ)) = 0
∂ρ
has unique solution ρ = %.
The sample analog is n−1 n
P
i=1 |xi − ρ| and is not differentiable.

219 / 318
An alternative representation of the median follows from
1
F (%) = ,
2
as  
1
E {x ≤ %} − = 0,
2
which is a moment condition.
This suggest as estimator an (approximate) solution to the empirical moment
n
X 1
n−1 {xi ≤ ρ} − = 0.
i=1
2

The solution, say %̂, has ‘nice’ asymptotic properties,


 
a 1 1/4
%̂ − % ∼ N 0, ,
n f (%)2
but showing this requires different machinery than the one discussed here.

220 / 318
Prediction

You wish to predict yi based on xi .


The best predictor depends on how you quantify errors, i.e., the loss function.
If p(xi ) is the predictor,
E((yi − p(xi ))2 )
is the expected squared loss.
Under this loss specification the best predictor p minimizes

E((yi − p(xi ))2 ) = E(((yi − E(yi |xi )) − (p(xi ) − E(yi |xi )))2 )
= E((yi − E(yi |xi ))2 ) + E((p(xi ) − E(yi |xi ))2 )
= E(var(yi |xi )) + E((p(xi ) − E(yi |xi ))2 )
≥ E(var(yi |xi )).

The unique solution is p(xi ) = E(yi |xi ).

221 / 318
Linear prediction

A linear predictor is a linear function of xi , i.e., x0i β for any vector β.


The best linear predictor under expected squared loss uses the coefficients

arg min E((yi − x0i β)2 ).


β

They solve
E(xi (yi − x0i β)) = 0.
(uniquely if E(xi x0i ) has full rank) and equal

β = E(xi x0i )−1 E(xi yi ).

This is the population ordinary least-squares coefficient. By very definition,


xi and εi = yi − x0i β are uncorrelated.
Consequently, we can always write

yi = x0i β + εi

for some vector β such that E(xi εi ) = 0.


We call this the linear projection of yi on xi and write it as E ∗ (yi |xi ) = x0i β.
This does not mean that E(yi |xi ) = x0i β.
222 / 318
Endogeneity in a linear model

Again consider

yi = x0i θ + εi but now we allow that E(xi ε) 6= 0.

Note that
E ∗ (yi |xi ) 6= x0i θ,
so θ is not a regression coefficient;
E(yi |xi ) = x0i θ + E(εi |xi ) 6= x0i θ,
so
∂E(yi |xi )
6= θ.
∂xi

223 / 318
Omitted variables

Say we have
yi = αi + x0i θ + ηi , E(ηi |xi , αi ).

Say an agricultural (log-linearized) Cobb-Douglas production function.


yi is output;
xi are observable inputs ;
αi is soil quality;
ηi is rainfall.

Farmer observes (αi , xi ). We only observe xi . In general, xi , αi are not


independent.
Estimating
yi = x0i θ + (αi + ηi ) = x0i θ + εi
via least-squares suffers from endogeneity bias.
The problem is that αi is not observed in data. Otherwise, can just include
it in xi .

224 / 318
Measurement error

Suppose that
yi = wi0 θ + i , E(i |wi ) = 0
but (together with yi ) we only observe a noisy version of wi , say

xi = wi + ηi ,

for measurement error ηi .


Then

yi = wi0 θ + i = (xi − ηi )0 θ + i = x0i θ + (i − ηi0 θ) = x0i θ + εi .

Suppose, for simplicity, that E(ηi i ) = 0 and E(wi ηi ) = 0. Then

E(xi εi ) = E(xi (i − ηi0 θ)) = −E(xi ηi0 ) θ = −E(ηi ηi0 ) θ 6= 0.

A least-squares regression would estimate the population quantity

E(xi x0i )−1 E(xi yi ) = θ + E(xi x0i )−1 E(xi εi ) = θ − E(xi x0i )−1 E(ηi ηi0 ) θ.

225 / 318
Simultaneity

Temporary deviation from notational conventions to analyze market model

di = αd − θd pi + ui
si = αs + θs pi + vi

where di , si , pi are demand, supply, and price, respectively.

226 / 318
We do not observe supply and demand for any given price.
Collected data is on quantity traded and transaction price, (qi , pi ).

227 / 318
Data comes from markets in equilibrium.
So, we solve
si = di
for the equilibrium price to get
αd − αs ui − vi
pi = + .
θd + θs θd + θs
This gives traded quantity as
αd θs + αs θd θs ui + θd vi
qi = + .
θd + θs θd + θs
(With E(ui vi ) = 0) the population regression slope of qi on pi equals

σu2 σ2
θs − 2 v 2 θd ,
σu2 + σv2 σu + σv

for σu2 = E(u2i ) and σv2 = E(vi2 ).


Least-squares estimates a weighted average of supply and demand elasticities.

228 / 318
229 / 318
To see the problem in terms of endogeneity, focus on the estimation of the
demand curve.
Then, collecting equations from above,
αd − αs ui − vi
di = αd − θd pi + ui , pi = + .
θd + θs θd + θs

Clearly,
σu2
  
ui − v i
E(pi ui ) = E ui = 6= 0,
θd + θs θd + θs

as the errors in both equations are correlated.

The same happens for the supply curve, as


αd − αs ui − vi
si = αs + θs pi + vi , pi = + .
θd + θs θd + θs
and
σv2
  
ui − vi
E(pi vi ) = E vi =− 6= 0.
θd + θs θd + θs

230 / 318
Linear instrumental-variable problem

Now suppose we have

yi = x0i θ + εi , E(zi εi ) = 0

for instrumental variables zi (with dim zi = dim xi ).


This gives us the moment conditions

Eθ (zi (yi − x0i θ)) = 0.

An instrument is

valid if E(zi εi ) = 0; and


relevant if E(zi x0i ) is full rank.

We then obtain the instrumental-variable estimator


n
!−1 n
!
1X 1X
θ̂ = zi x0i zi yi .
n i=1 n i=1

231 / 318
It is useful to proceed in matrix notation:

y = Xθ + ε

and we set to zero the sample covariance of the errors and instruments. The
solution is
θ̂ = (Z 0 X)−1 (Z 0 y).

Note that this gives least squares when regressors instrument for themselves.

We need at least as many instruments as we have covariates.


To motivate the sequel suppose dim zi > dim xi . Then the dim zi equations

Z 0 (y − Xθ) = 0

involve dim xi < dim zi unknowns. Then (generically) these equations do


not have a solution (for finite n).

The method-of-moment idea fails to provide us with an estimator when we


have overidentification.

232 / 318
Resolving simultaneity with instrumental variables

Return to the estimation of a demand curve but now suppose that

di = αd − θd pi + ui
.
si = αs + θs pi + πzi + vi

where E(zi ui ) = 0.
zi shifts supply (relevance) but not demand (exclusion).

We now have the triangular system of equations

di = αd − θd pi + ui
αd − αs π ui − vi .
pi = − zi +
θd + θs θd + θs θd + θs
Further, by relevance and exclusion,

cov(di , zi ) = cov(αd − θd pi + ui , zi ) = −θd cov(pi , zi ),

and so
cov(di , zi )
−θd = .
cov(pi , zi )

233 / 318
234 / 318
Measurement error

Suppose again that


yi = wi θ + i ,
but that wi is measured with error, say as xi = wi + ηi . Then

yi = xi θ + (i − ηi θ)

and a regression of yi on xi does not deliver a consistent estimator of θ.

Suppose that we have an additional noisy measurement of wi ,

zi = wi + ζi .

If E(ζi ηi ) = 0 and E(ζi i ) = 0 we can estimate θ by instrumental variables.

We have
2
E(zi xi ) = E((wi + ζi ) (wi + ηi )) = σw , E(zi (i − ηi θ)) = 0,

so zi is relevant and valid.

235 / 318
Generalized method of moments

In overidentified problems (where we have more equations than unknowns)


we cannot satisfy all empirical moment conditions,
n
X
ĝ(θ) = n−1 ϕ(xi ; θ) = 0,
i=1

exactly.
The solution is to minimize the quadratic form

ĝ(θ)0 A ĝ(θ).

for some (positive semi-definite) weight matrix A.


This is the generalized method of moments

Intuitively, we minimize the distance kĝ(θ) − 0kA .

236 / 318
Reduction in moments

With
n
∂ĝ(θ) X ∂ϕ(xi ; θ)
Ĝ(θ) = 0
= n−1 ,
∂θ i=1
∂θ0
the first-order condition to the GMM problem is

Ĝ(θ)0 A ĝ(θ) = 0.

This is a set of dim θ linear combinations of the dim ϕ original moments.


Linear combination is determined by weight matrix A (which we may choose).
So different A give different estimators.
The optimal weight matrix turns out to be

A = Ω−1
θ

(or a consistent estimator thereof).

237 / 318
Limit distribution
p
Combine the convergence result Ĝ(θ∗ ) → Γθ for any consistent θ∗ with the
expansion
n n n
X X X ∂ϕ(xi ; θ)
n−1 ϕ(xi ; θ̂) = n−1 ϕ(xi ; θ) + n−1 (θ̂ − θ)
i=1 i=1 i=1
∂θ0 θ∗

to see that
1X
(θ̂ − θ) = −(Γ0θ A Γθ )−1 Γ0θ A ϕ(xi ; θ) + op (n−1/2 ).
n i
1 X d
Then, with √ ϕ(xi ; θ) → N (0, Ωθ ),
n i
we get the result.

Theorem 23 (Limit distribution of GMM estimator)


Under regularity conditions,
√ d
n(θ̂ − θ) → N (0, (Γ0θ A Γθ )−1 (Γ0θ AΩθ A0 Γθ )(Γ0θ A Γθ )−1 )

as n → ∞.

238 / 318
Optimal weighting

Theorem 24 (Semiparametric efficiency)


The efficiency bound (for a given set of moment conditions) is

(Γ0θ Ω−1
θ Γθ )
−1
.

We can establish this by showing that the difference

(Γ0θ A Γθ )−1 (Γ0θ AΩθ A0 Γθ )(Γ0θ A Γθ )−1 − (Γ0θ Ω−1 Γθ )−1

is a positive semi-definite matrix.

The bound is achieved if


A = Ω−1
θ

(up to a scale) or if we use a consistent estimator.

Note that in this case we have a generalized information equality.

239 / 318
Proof.
Let
−1/2
C = (Γ0θ A Γθ )−1 Γ0θ AΩθ ,
1/2
D = Ωθ Γθ .

Then
(Γ0θ A Γθ )−1 (Γ0θ AΩθ A0 Γθ )(Γ0θ A Γθ )−1 − (Γ0θ Ω−1
θ Γθ )
−1

can be written as
CC 0 − CD(D0 D)−1 D0 C 0 .
But this is
CMD C 0 ≥ 0, MD = Im − D(D0 D)−1 D0 .

The inequality follows because MD is an orthogonal projection matrix, and


so all its eigenvalues are zero or one. Hence, it is positive semi-definite.

To see that the eigenvalues of an orthogonal projector P are all zero or one,
let λ 6= 0 be an eigenvalue of P . Then P x = λx for some x 6= 0. Because P
is idempotent we must also have that P 2 x = P x = λP x = λ2 x. Therefore it
must hold that
λx = λ2 x,
which can only be true if λ ∈ {0, 1}.

240 / 318
χ2 problem

To illustrate the efficiency gain of combining moments suppose that xi ∼ χ2θ .


We know that
Eθ (xi − θ) = 0
so we could estimate θ by the sample mean.

But varθ (xi ) = 2θ so also have the moment condition Eθ ((xi − θ)2 − 2θ) = 0.
Let  
xi − θ
ϕ(xi ; θ) = .
(xi − θ)2 − 2θ
Then  
1 4
Ωθ = Eθ (ϕ(xi ; θ) ϕ(xi ; θ)0 ) = 2θ .
4 6(θ + 4)

241 / 318
So,
3(θ+4)
 
1 −1
Ω−1
θ =
2
1 .
θ(3θ + 4) −1 4

The Jacobian of the moment conditions is simply −(1, 2)0 and so we find that
the asymptotic variance equals
θ + 4/3
2θ .
θ+2

If we would just use one of the moments the asymptotic variance would be

2θ, and 3θ(θ + 4),

respectively. Both are larger.

Note that Ωθ depends on θ. So the optimal GMM estimator will generally


be a two-step estimator (see below).

Estimation of the weight matrix introduces additional sampling noise that


leads to bias and affects the coverage of confidence intervals/size and power
of tests.

242 / 318
Poisson

When xi is Poisson we similarly have

Eθ (xi − θ) = 0, Eθ ((xi − θ)2 − θ) = 0

by the mean/variance equality.


Here,
   
1 1 1 1 + 2θ −1
Ωθ = θ , Ω−1
θ = .
1 1 + 2θ 2θ2 −1 1

But as Γ0θ = −(1, 1) we get


1
Γ0θ Ω−1
θ Γθ = .
θ
So the asymptotic variance is the same as for the simple estimator xn based
on the first moment condition only.
We knew we should have reached this conclusion here because xn is the
maximum-likelihood estimator and is best unbiased.

243 / 318
Two-step GMM

We can estimate Ωθ = Eθ (ϕ(xi ; θ) ϕ(xi ; θ)0 ) by


X
Ω̂θ̂ = n−1 ϕ(xi ; θ̂) ϕ(xi ; θ̂)0 ,
i

where we use a first-step GMM estimator θ̂ (constructed using a feasible A).


We then re-estimate θ by

θ̂ˆ = arg min ĝ(θ)0 Ω̂−1


θ̂
ĝ(θ).
θ

In principle, this two-step procedure can be iterated.


Could also consider continuously-updated GMM:
X
arg min ĝ(θ)0 Ω̂−1
θ ĝ(θ), Ω̂θ = n−1 ϕ(xi ; θ) ϕ(xi ; θ)0 .
θ
i

This is computationally more challenging; first-order condition features extra


terms (use MCMC).

244 / 318
Examples of (nonlinear) method of moments

Avery, R. B., L. P. Hansen, and V. J. Hotz (1983). Multiperiod probit models and orthogonality
condition estimation. International Economic Review 24, 21–35.

Becker, G. S., M. Grossman, and K. M. Murphy (1994). An empirical analysis of cigarette


addiction. American Economic Review 84, 396–418.

Berry, S. T. (1994). Estimating discrete-choice models of product differentiation. RAND Journal


of Economics 25, 242–262.

Goldberg, P. K. and F. Verboven (2001). The evolution of price dispersion in the European car
market. Review of Economic Studies 68, 811–848.

Hansen, L. P. and K. J. Singleton (1982). Generalized instrumental variables estimation of


nonlinear rational expectations models. Econometrica 50, 1269–1286.

Pakes, A. (1986). Patents as options: Some estimates of the value of holding European patent
stocks. Econometrica 54, 755–784.

245 / 318
Two-stage least squares

In the linear instrumental-variable problem (with more instruments than


covariates) we minimize

(y − Xθ)0 ZA Z 0 (y − Xθ).

The first-order condition is (X 0 Z)A Z 0 (y − Xθ) = 0 and the solution is thus

θ̂ = (X 0 ZAZ 0 X)−1 (X 0 ZAZ 0 y).

Under homoskedasticity,

Ωθ = E(ε2i zi zi0 ) = σε2 E(zi zi0 )

so that the optimal weight matrix is simply σ̂ε2 (Z 0 Z)−1 .


The efficient estimator is a one-step estimator and takes the form

θ̂ = (X 0 Z(Z 0 Z)−1 Z 0 X)−1 (X 0 Z(Z 0 Z)−1 Z 0 y) = (X 0 P Z X)−1 (X 0 P Z y).

This is the two-stage least squares estimator.

246 / 318
To understand 2SLS recall the model

yi = x0i θ + εi , E(zi εi ) = 0

and note that we can always use E ∗ (xi |zi ) = zi0 π to decompose the covariates
as
xi = zi0 π + ηi = x̃i + ηi (say).
By the validity of zi as instrument we have E(zi εi ) = 0 and so we know that

E(xi εi ) = E(x̃i εi ) + E(ηi εi ) = E(ηi εi ),

i.e., ηi is the endogenous part of xi . Also, by virtue of the linear projection,


E(zi ηi ) = 0 and so
E(x̃i ηi ) = 0.
It follows that, in

yi = x0i θ + εi = x̃0i θ + (εi + ηi0 θ) = x̃0i θ + i (say).

the covariates x̃i and error i are uncorrelated.


In practice, x̃i is unknown. Replacing it with an estimator gives 2SLS:
Estimate x̃i by x̂i , the fitted values from a linear regression of xi on zi ;
Estimate θ by regressing yi on x̂i .

247 / 318
Replacing population projection with sample projection introduces bias.

We have
π̂ = (Z 0 Z)−1 Z 0 X = π + (Z 0 Z)−1 Z 0 η
and so
X̂ = X̃ + P Z η.
The second term correlates with ε and so

E(x̂i εi ) 6= 0,

which introduces bias.

Which vanishes as n → ∞, yielding consistency.

248 / 318
249 / 318
250 / 318
Precision of instrumental variables

Instrumental-variable estimators are always more variable than least squares.


Suppose we only have one covariate xi , one instrument zi , and homoskedastic
errors.
The usual first-order approximation to the least-squares estimator is

σ2
 
a
θ̂ − θ ∼ N 0, n−1 ε2
σx

(under exogeneity).
The same first-order approximation to the instrumental-variable estimator is

n−1 σε2
 
a
θ̂ − θ ∼ N 0, 2 ,
ρxz σx2
where ρxz is the correlation between xi and zi .
The intuition is that xi is (in terms of relevance/fit) its own best instrument.
The instrument is said to be weak when ρxz is small.
In this case the first-order approximation becomes poor.

251 / 318
Weak instruments

Take the simple univariate problem, where we only have one covariate xi and
m instruments zi (treat these as fixed), and suppose we have homoskedastic
errors.
We can approximate the mean squared error of 2SLS to second order to get

1 σε2 /ση2 m 2  ρ σ ε 2
+ + o(n−2 ),
n τ n τ
V ARIAN CE SQU ARED BIAS

where
2
π 0 Z 0 Zπ R
nτ = = 2
ση2 1−R
is the concentration parameter.
2
R is the (uncentered) population R2 of the first-stage regression.
This relates directly to the first-stage F -statistic.
When τ is small most of the variation on xi comes from ηi , and not from zi .

252 / 318
Sampling distribution of two-stage least squares as a function of the value of
the concentration parameter (simulation details omitted).

0.5
Figure: The effect of τ ! = 500
! = 200
0.45
! = 50
! = 10
0.4 !=1

0.35

0.3

0.25

0.2

0.15

0.1

0.05

0
-10 -5 0 5 10 15

253 / 318
Many instruments

Note also how


1 σε2 /ση2 m 2  ρ σ ε 2
+ + o(n−2 )
n τ n τ
V ARIAN CE SQU ARED BIAS

depends on the number of instruments (m).

More instruments decrease the relative contribution of ηi to xi .

But the fitted values x̃i = zi0 π have to be estimated.

Under regularity conditions,


p
x̂i − x̃i = Op ( m/n)

The noise in the fitted values grows with m.

254 / 318
Sampling distribution of two-stage least squares as a function of the number
of instruments (simulation details omitted).

0.45
LS Figure: The effect of τ
2SLS; m=5
0.4 2SLS; m=10
2SLS; m=25
2SLS; m=75
2SLS; m=150
0.35

0.3

0.25

0.2

0.15

0.1

0.05

0
-10 -8 -6 -4 -2 0 2 4 6 8 10

255 / 318
Control-function interpretation

Let
e = MZX
be the residuals from the least-squares regression of X on Z (i.e., from the
first stage).
Then 2SLS can be written as

θ̂ = (X 0 P Z X)−1 (X 0 P Z y) = (X 0 M e X)−1 (X 0 M e y).

Indeed,

M e X = M e (P Z X + M Z X) = (I − P e )P Z X + M e e = P Z X.

So 2SLS can equally be performed in the following two steps:


Estimate ηi by ei , the residuals from a linear regression of xi on zi ;
Estimate θ by regressing yi on xi and ei .

256 / 318
This view on 2SLS gives us a way to test the null of exogeneity.
Work through the simple model with

yi = xi θ + εi
xi = zi π + ηi

where the errors are jointly normal.


Let ei be the residual from the first stage.
Then 2SLS solves the empirical moments
X  xi 
(yi − xi θ − ei γ) = 0.
ei
i

for θ, γ.
As ei = xi − zi π̂ = ηi − zi (π̂ − π) we can write this as (evaluating at true
parameter values)
X  xi  
xi
 
0
 
0

ui + zi γ (π̂ − π) + εi + zi2 (π̂ − π)2
ηi ηi zi (π̂ − π) 1
i

for ui = yi − xi θ − ηi γ = εi − ηi γ (which does not correlate with xi or ηi ).

257 / 318
Because E(zi ηi ) = 0 and E(zi εi ) = 0, and because kπ̂ − πk2 = Op (n−1 ), this
behaves like (as n → ∞ and scaled by n−1 )

−1
X  xi  −1
X  πγ 
n ui + n zi ηi ,
ηi 0
i i

were we have used that π̂ − π = n−1 n 2 −1/2


P
i=1 zi ηi /E(zi ) + op (n ) The first
term is standard, it also showed up when ηi was directly observed. The
second term is present because we have replaced ηi by an estimator ei ; this
introduces additional noise that has to be accounted for.
The variance-covariance matrix of the above random variable is
 2 2
σx σu + γ 2 π 2 σz2 ση2 ση2 σu2

Ωθ = 2 2 2 2
ση σu ση σu
 2 2
  2 2
γ (σx − ση2 )ση2 0

σx ση
= σu2 + .
ση2 ση2 0 0

The limit of the Jacobian of the moment conditions is


 2
σx ση2 ση2 −ση2
  
1
Γθ = 2 2 , Γ−1
θ = .
ση ση ση2 (σx2 − ση2 ) −ση2 σx2

258 / 318
The asymptotic variance of the estimator,

Γ−1 −1
θ Ωθ Γθ ,

then equals
ση2
 
1 −1
σu2 Γθ−1 + γ 2 .
σx − ση2
2 −1 1

Under the null of exogeneity (γ = 0), this is just σu2 Γ−1


θ and so the usual
least-squares standard error will be consistent.
Hence, the reported t-statistic is valid for testing exogeneity.

For our estimator of θ we do need a correction to the usual least-squares


standard error as we want to allow that γ 6= 0.

259 / 318
260 / 318
261 / 318
262 / 318
Bias correction with many moments

For a fixed weight matrix A, the bias in the GMM objective function is

Eθ ĝ(θ)0 Aĝ(θ) = tr(AΩθ )/n.




The bias shrinks with n but grows (typically linearly) with dim ϕ.

A bias-corrected GMM estimator minimizes


ϕ(xi ; θ)0 Aϕ(xj ; θ)
P P
i j6=i
ĝ(θ)0 Ag(θ) − tr(AΩ̂θ )/n = .
n2
The continuously-updated estimator from above has a similar bias-correction
interpretation.

For 2SLS we have A = (Z 0 Z)−1 and so the bias-corrected objective function


equals P P 0 0
i j6=i (yi − xi θ) pij (yj − xj θ)
n2
for pij = (P Z )ij = zi0 (Z 0 Z)−1 zj .

263 / 318
Its minimizer is the jackknife instrumental-variable estimator
 −1   !−1 !
XX 0
X X X 0
X
 xi pij xj  xi pij yj =
 x̌i xi x̌i yi ,
i j6=i i j6=i i i

where X X
x̌i = xj pji = xj zj0 (Z 0 Z)−1 zi = Π̂−i zi .
j6=i j6=i

Recall that the first-stage equation is of the form

xi = Πzi + ηi .

Here, Π̂−i is a leave-one-out estimator of the first-stage coefficient matrix


and x̌i is the associated fitted value.

Recall that bias in (feasible) 2SLS arose from the fact that Π̂ is a function of
ηi and ηi correlates with εi (See Slide 248). By construction the leave-one-out
fitted values do not depend on ηi .

264 / 318
Multiplicative models with endogeneity

As an example of nonlinear instrumental-variable estimation, suppose that

yi = ϕ(xi ; θ) εi , E(εi |zi ) = 1.

We have (conditional) moment condition


 
yi
Eθ − 1 zi = 0
ϕ(xi ; θ)
and so many unconditional moment conditions; for example
    
yi zi
Eθ zi −1 = Eθ (yi − ϕ(xi ; θ)) = 0.
ϕ(xi ; θ) ϕ(xi θ)

An example is an exponential model.

265 / 318
Additional reading on instrumental variables

Bekker, P. A. (1994). Alternative approximations to the distributions of instrumental variable


estimators. Econometrica 62, 657–681.

Bound, J. , D. A. Jaeger, and R. M. Baker (1995). Problems with instrumental variables esti-
mation when the correlation between the instruments and the endogeneous explanatory variable
is weak. Journal of the American Statistical Association 90, 443–450.

Staiger, D. and J. H. Stock (1997). Instrumental variabels regression with weak instruments.
Econometrica 65, 557–586.

Stock, J. H. and M. Yogo (2005). Testing for weak instruments in linear IV regression. In
Andrews, D. W. K. and J. H. Stock (Editors), Identification and Inference for Econometric Models:
Essays in Honor of Thomas Rothenberg, Chapter 5, 80—108 (Cambridge UP, Cambridge, UK).

266 / 318
Likelihood-ratio type test statistic

Now consider testing the m-dimensional constraint that r(θ) = 0.


Let
θ̌ˇ = arg min ĝ(θ)0 Ω̂−1
θ̂
ĝ(θ);
θ:r(θ)=0

the efficient GMM estimator under the constraint.

Theorem 25 (Limit distribution of LR-type statistic)


Under the null,

n ĝ(θ̌ˇ)0 Ω̂−1 ĝ(θ̌ˇ) − n ĝ(θ̂ˆ)0 Ω̂−1 ĝ(θ̂ˆ) → χ2m


d
θ̂ θ̂

as n → ∞.

This result requires optimal weighting.


Note that we use the same weight matrix throughout.

267 / 318
Score type test statistic

Similarly, we can look whether the first-order condition of the unconstrained


problem,
Ĝ(θ)0 Ω̂−1
θ̂
ĝ(θ) = 0,
when evaluated in the constrained estimator θ̌ˇ, is far from zero.

Theorem 26 (Limit distribution of LM-type statistic)


Under the null,

n ĝ(θ̌ˇ)0 Ω̂−1 Ĝ(θ̌ˇ) (Ĝ(θ̌ˇ)0 Ω̂−1 Ĝ(θ̌ˇ))−1 Ĝ(θ̌ˇ)0 Ω̂−1 ĝ(θ̌ˇ) → χ2m
d
θ̂ θ̂ θ̂

as n → ∞.

This result requires optimal weighting.

268 / 318
Wald test statistic

The Wald statistic works without reference to the constrained problem.


Under optimal weighting we would have the following, where R is again the
Jacobian matrix of the constraint vector r.

Theorem 27 (Limit distribution of the Wald statistic)


Under the null,

n r(θ̂ˆ)0 (R(Ĝ(θ̂ˆ)0 Ω̂−1 Ĝ(θ̂ˆ))−1 R0 )−1 r(θ̂ˆ) → χ2m ,


d
θ̂

as n → ∞.

More generally, when using estimator θ̂ computed using weight matrix A it


equals

n r(θ̂)0 (R((Ĝ(θ̂)0 A Ĝ(θ̂))−1 (Ĝ(θ̂)0 AΩ̂θ̂ A0 Ĝ(θ̂))(Ĝ(θ̂)0 A Ĝ(θ̂))−1 )−1 R0 )−1 r(θ̂).

269 / 318
J-statistic

Note that
n ĝ(θ̂ˆ)0 Ω̂−1 ĝ(θ̂ˆ) → χ2dim ϕ−dim θ
d
θ̂
if all moments hold.

So we can test the specification.

Only possible when we have overidentification, i.e., when

dim ϕ − dim θ > 0

(Otherwise the test statistic is exactly zero).

If the J-statistic is large relative to the quantiles of the χ2 -distribution at


least some of the moment conditions are likely to be invalid.

This does not tell us which moments are troublesome.

270 / 318
We can test subset of the moments as well.

Partition the moments using ϕ(x; θ) = (ϕ1 (x; θ)0 , ϕ2 (x; θ))0 .

Also partition  
(Ω̂θ )11 (Ω̂θ )12
Ω̂θ = .
(Ω̂θ )21 (Ω̂θ )22

Want to test
Eθ (ϕ2 (xi ; θ)) = 0
assuming that Eθ (ϕ1 (xi ; θ)) = 0.

If dim ϕ1 ≥ dim θ we can compute

θ̌ˇ = arg min ĝ1 (θ)0 (Ω̂θ̂ )−1


11 ĝ1 (θ),
θ

where ĝ1 (θ) = n−1 i ϕ1 (xi ; θ).


P

We can also compute the estimator using all moment conditions, i.e., the
usual
θ̂ˆ = arg min ĝ(θ)0 Ω̂−1
θ̂
ĝ(θ).
θ

271 / 318
We then have the following simple result.

Theorem 28 (Testing moment validity)


If all moment conditions hold,

n ĝ(θ̂ˆ)0 Ω̂−1
θ̂
ĝ(θ̂ˆ) − n ĝ1 (θ̌ˇ)0 (Ω̂θ̂ )−1 ˇ d 2
11 ĝ1 (θ̌ ) → χdim ϕ−dim ϕ1

as n → ∞.

Note that we use the same weight matrix in both terms.


This ensures (in small samples) that the test statistic is non-negative.

272 / 318
Testing instrument validity

In the linear model


y = Xθ + ε
with homoskedastic errors, the optimally-weighted GMM estimator is 2SLS
and the objective function (scaled up by n and evaluated at its minimizer)
equals
ε̂0 P Z ε̂
,
σ̂ 2
where ε̂ are the 2SLS residuals.
This statistic is known as Sargan’s statistic.
Note that P Z ε̂ are the fitted values of a regression of the 2SLS residuals on
the instruments
Moreover, as σ̂ 2 = ε̂0 ε̂/n we can equivalently write

ε̂0 P Z ε̂ ESS
n =n = n R2 .
ε̂0 ε̂ T SS

Invalid instruments can be detected by looking at correlation between the


residuals and the instruments.

273 / 318
Optimal moment conditions in conditional models

Now suppose that we know

Eθ (ϕ(xi ; θ)|zi ) = 0

(a.s.)

This yields an infinite amount of unconditional moments.

We look for the optimal moment conditions, i.e., the function ψ in

Eθ (ψ(zi ) ϕ(xi ; θ)) = 0

for which the asymptotic variance of the resulting GMM estimator is minimal.

The optimal instrument turns out to be


 0
∂ϕ(xi ; θ) −1
ψ(zi ) = −Eθ zi Eθ ϕ(xi ; θ) ϕ(xi ; θ)0 zi = −Γθ (zi )0 Ωθ (zi )−1 .
∂θ0

Note that dim ψ = dim θ.

274 / 318
Notice that, now,

Ωθ = varθ (ψ(zi ) ϕ(xi ; θ)) = Eθ Γθ (zi )0 Ωθ (zi )−1 Γθ (zi ) ,




and
 
∂ϕ(xi ; θ)
= −Eθ Γθ (zi )0 Ωθ (zi )−1 Γθ (zi ) ,

Γθ = Eθ ψ(zi )
∂θ0

(use iterated expectations) such that

Ωθ = −Γθ .

Hence, the generic sandwich-form asymptotic variance becomes

avarθ (θ̂) = (Γ0θ Ω−1


θ Γθ )
−1
= Ω−1
θ ;

that is,
√ d
n(θ̂ − θ) → N (0, Ω−1
θ ).

This is the semiparametric efficiency bound.

275 / 318
Proof.
Let
gi = ψ(zi ) ϕ(xi ; θ), hi = Γ0θ A φ(zi ) ϕ(xi ; θ)
for arbitrary alternative weight matrix A and instrument vector φ.
The asymptotic variances of the associated GMM estimators are

Eθ (gi gi0 )−1 , Eθ (hi gi0 )−1 Eθ (hi h0i )Eθ (gi h0i )−1 ,

respectively.
Rewriting gives

Eθ (hi gi0 )−1 Eθ (hi h0i ) Eθ (gi h0i )−1 −Eθ (gi gi0 )−1 = Eθ (hi gi0 )−1 Eθ (vi vi0 )Eθ (gi h0i )−1

for
vi = hi − gi0 γ, γ = Eθ (gi gi0 )−1 Eθ (gi h0i ).

This difference is positive semi-definite because E(vi vi0 ) ≥ 0.

276 / 318
Linear model

With
yi = x0i θ + εi , Eθ (εi |xi ) = 0
we have
Eθ (yi − x0i θ | xi ) = 0.
Here,
∂(yi − x0i θ)
 
Γθ (xi ) = Eθ xi = −x0i , Ωθ (xi ) = Eθ (ε2i | xi ) = σi2 (say).
∂θ0
So,
xi
ψ(xi ) = −Γθ (xi )0 Ωθ (xi )−1 = ,
σi2
and the optimal estimator solves the empirical moment condition
n
X xi (yi − x0i θ)
n−1 = 0.
i=1
σi2

Observation i gets less weight if σi2 is higher. This is weighted least squares.

277 / 318
If we write
V = diag(σ12 , . . . , σn2 ).
Then the optimal estimator is

θ̂ = (X 0 V −1 X)−1 (X 0 V −1 y)

Under homoskedasticity, i.e., when σi2 = σ 2 for all i this reduces to the simple

θ̂ = (X 0 X)−1 (X 0 y),

which is ordinary least squares.

This is the Gauss-Markov theorem.

278 / 318
Exponential model

We have 0
yi = exi θ εi , Eθ (εi |xi ) = 1,
and so 0
Eθ (yi − exi θ | xi ) = 0.
Here,
0
Γθ (xi ) = −exi θ x0i , Ωθ (xi ) = σi2 (say).
The optimal empirical moment condition thus is
n 0 0
X xi exi θ (yi − exi θ )
n−1 =0
i=1
σi2
0
With Poisson data, for example, σi2 = exi θ and the estimating equation is
n
X 0
n−1 xi (yi − exi θ ) = 0.
i=1

0
With homoskedastic errors σi2 = σ 2 (exi θ )2 and we solve
n 0
X xi (yi − exi θ )
n−1 0 = 0.
i=1
exi θ
279 / 318
Instrumental-variable model

Now if
Eθ (yi − x0i θ| zi ) = 0
we obtain

Γθ (zi )0 = −E(xi | zi ), Ωθ (zi ) = E(ε2i |zi ) = σi2 (say).

So, we solve
n
X E(xi | zi ) (yi − x0i θ)
n−1 = 0.
i=1
σi2

Here, the reduced form is nonlinear, in general.


This is also true under homoskedasticity. So two-stage least squares is not
optimal, in general.
Two-stage least squares approximates E(xi |zi ) by the linear projection
E ∗ (xi |zi ).

280 / 318
Linear model for panel data

Suppose now that we have repeated measurements, as in

yit = x0it θ + εit , t = 1, . . . , T.

For each i we have a set of T equations.


This fits our framework on stacking observations for each i and writing

yi = x0i θ + εi ;

here, e.g., yi = (yi1 , . . . , yiT )0 .


Suppose that
εit = αi + uit .
We may have that αi and xi are correlated. Then E(εi |xi ) 6= 0.
However, with ∆ the first-differencing operator, we have the T − 1 equations

∆yit = ∆x0it θ + ∆εit = ∆x0it θ + ∆uit

that are free of αi .


A sufficient condition for estimation is E(uit |xi , αi ) = 0.

281 / 318
Define the (T − 1) × T first-differencing matrix D as

−1 ···
 
1 0 0 0
0 −1 1 ··· 0 0
D= . ..  .
 
 .. ..
. .
0 0 0 ··· −1 1

We have the conditional moment conditions

Eθ [D(yi − x0i θ)|xi1 , . . . , xiT ] = 0.

The first-differenced least-squares estimator solves


n
X
xi D0 D(yi − x0i θ) = 0.
i=1

This is pooled least-squares on first-differenced data. It is inefficient as the


∆uit are correlated.

282 / 318
Suppose that ui ∼ (0, σ 2 IT ). Then Dui ∼ (0, σ 2 DD0 ).

The optimal unconditional (empirical) moments are


n
X
xi D0 (DD0 )−1 D (yi − x0i θ) = 0.
i=1

This yields a generalized least-squares estimator. It is a pooled least-squares


estimator on demeaned data.

A calculation gives
ιT ι0T
M = D0 (DD0 )−1 D = IT − ,
T
where ιT is a vector of ones.

The matrix M transforms data into deviations from within-group means. For
example, M yi = yi − y i .

283 / 318
Feedback

The above estimator requires that uit is uncorrelated with xi1 , . . . , xiT . This
rules out dynamics and, more generally, feedback.

A simple model where the problem arises is the (first-order) autoregression

yit = yit−1 θ + αi + uit ,

where the initial value yi0 is taken as observed.

First-differencing (and, equivalently, demeaning) sweeps out αi ,

∆yit = ∆yit−1 θ + ∆uit ,

but (taking uit to be homoskedastic and serially uncorrelated for simplicity)

E(∆yit−1 ∆uit ) = E(∆uit−1 ∆uit ) = −E(u2t−1 ) = −σ 2 6= 0,

and so introduces a new endogeneity problem.

284 / 318
An assumption of sequential exogeneity, i.e., E(uit |yi0 , . . . , yit−1 , αi ) = 0 is
enough to obtain a GMM estimator.

It implies (sequential) conditional moments

Eθ (∆yit − ∆yit−1 θ| yi0 , . . . , yit−2 ) = 0.

The conventional GMM estimator uses the linear moment conditions


  
yit−2
yit−3  
E  .  ∆yit − ∆yit−1 θ  = 0
  
.
 .  
yi0

(for all t = 2, . . . , T ).

285 / 318
DEALING WITH (WEAK) DEPENDENCE

286 / 318
Reading

Hansen II, Chapters 14 and 15


Hayashi, Chapter 6

287 / 318
Stationary

Random sampling may be too strong a requirement:


Time series data;
Interactions and other network data;
Snowball sampling (and so on).

Consider a scalar sequence {xi }.


{xi } is (strictly) stationary if, for any h ≥ 0, the distribution of (xi , . . . , xi+h )
does not depend on i.
An implication (if the moments exist) is weak stationarity: the mean E(xi )
and covariance E(xi xi+h ) − E(xi )E(xi+h ) do not depend on i.

The techniques discussed so far can be adapted to stationary data provided


they are weakly dependent.

288 / 318
Dependence and mixing

Weak dependence is a requirement that the overall behavior of {xi } is not


driven by the realizations of the initial random variables (or any of the other
variables later on).
Blocks of data (xi , . . . , xi+j ) and (xi+j+h , . . . , xi+j+h+k ) separated by h units
become independent as h grows.
One way to formalize weak dependence is through mixing.
For h ≥ 1 define the mixing coefficients

αh = sup |P (A ∩ B) − P (A) P (B)|,


A∈A,B∈B

where (somewhat crudely stated) the sets A and B cover all events involving
x−∞ , . . . , xi−1 , xi and xi+h , xi+h+1 , . . . , x+∞ , respectively. Note how these
sets depend on h.
The process {xi } is strongly mixing (or alpha mixing) if αh → 0 as h → ∞.
Note that independent data has αh = 0 for any h.

289 / 318
Consistency of the sample mean

Now let
n
X
xn = n−1 xi
i=1

be the sample mean.

Theorem 29 (Law of large numbers)


Suppose that {xi } is stationary and mixing, and that µ = E(xi ) exists.
Then
p
xn → µ
as n → ∞.

This is a substantial generalization of our earlier law of large numbers for


random samples.
Note that this also implies that the continuous-mapping theorem generalizes
in the same way.

290 / 318
Central limit theorem

Theorem 30 (Central limit theorem)


Suppose that {xi } is stationary and mixing with mixing coefficient satisfying

X δ/(2+δ)
αh < +∞,
h=1

that E(xi ) = µ exists and that E(kxi k2+δ ) for some δ > 0. Then
√ d
nΣ−1/2 (xn − µ) → N (0, I)

for
n−1
! ∞ +∞
X n−h X X
Σ = lim Σ0 + (Σh + Σ0h ) = Σ0 + (Σh +Σ0h ) = Σh < ∞,
n→∞ n
h=1 h=1 h=−∞

where Σh = E((xi − µ)(xi+h − µ)0 ).

Many special cases of this theorem are available with (complicated) low-level
conditions for specific processes.
291 / 318
The summability of the covariances (i.e., the fact that Σ is finite) follows
from the restriction on the mixing coefficients. A sufficient condition for
summability is that Σh → 0 faster than 1/h → 0.
The variance formula follows from (again for the scalar case)
n
! n X n
X X
var xi = cov(xi , xj )
i=1 i=1 j=1
n i−1 n−i
!
X X X
= cov(xi , xi ) + cov(xi , xi−h ) + cov(xi , xi+h )
i=1 h=1 h=1
n i−1 n−i
!
X X X
= Σ0 + Σ−h + Σh
i=1 h=1 h=1

= nΣ0 + (n − 1)(Σ1 + Σ−1 ) + · · · + (Σn−1 + Σ−(n−1) )


n−1
!
X (n − h)
= n Σ0 + (Σh + Σ−h ) .
n
h=1

We have Σ−h = Σ0h .

292 / 318
A (truncated) estimator of the long-run variance Σ can be constructed as
κ−1
X (κ − h)
Σ̂ = Σ̂0 + (Σ̂h + Σ̂0h )
κ
h=1

for chosen integer κ ≤ n and


n−h
1 X
Σ̂h = (xi − xn )(xi+h − xn )0 .
n − h i=1

The resulting estimator is typically referred to as a HAC estimator.


The truncation at κ lags is needed because Σ̂h becomes increasingly noisy as
a function of h (for given n). (Consistency requires that κ → ∞ with n, but
not too fast.)
Σ̂ is also called the Newey-West variance estimator. It can be interpreted
as a ‘kernel’ estimator with a triangular kernel. Importantly, an unlike most
other such ‘kernel’ estimators, it is ensured to be positive semi-definite.

293 / 318
Autoregression

Suppose that

xi = α + ρ xi−1 + εi , εi ∼ i.i.d. N (0, σ 2 ).

We impose that |ρ| < 1.


We have E(xi ) = α + ρ E(xi−1 ) and so
α
µ = E(xi ) = .
1−ρ

Also, var(xi ) = ρ2 var(xi−1 ) + σ 2 and, hence,

σ2
Σ0 = .
1 − ρ2
The univariate stationary distribution, therefore, is xi ∼ N (µ, Σ0 ). The
covariances are proportional to Σ0 :

Σh = ρh Σ0 ;

for example,

Σ1 = cov(xi , xi−1 ) = cov(α + ρ xi−1 + εi , xi−1 ) = ρ cov(xi−1 , xi−1 ) = ρΣ0 .


294 / 318
Σh = ρh Σ0 shrinks at a geometric rate. The long-run variance is well-defined
and equals

X 1+ρ
Σ = Σ0 + 2 ρh Σ0 = Σ0 .
1−ρ
h=1

Here, the particularly simple structure of Σ suggests the simple alternative


HAC estimator
1 + ρ̂
Σ̂0
1 − ρ̂
where Pn
i=2 xi−1 (xi − xn ) σ̂ 2
ρ̂ = P n , Σ̂0 = ,
i=2 (xi − xn )
2 1 − ρ̂2
and Pn
i=2 ((xi − xn ) − ρ̂ (xi−1 − xn ))2
σ̂ 2 = .
n−1

When ρ = 0 {xi } is i.i.d. and Σ = Σ0 but, for example, when ρ = .5 {xi } is


dependent and Σ = 3Σ0 .
Ignoring the non-zero covariances can lead to large underestimation of the
actual variability of the series.

295 / 318
Moving average

Another example has

xi = µ + εi + β εi−1 , εi ∼ i.i.d. N (0, σ 2 ).

Here, E(xi ) = µ is immediate. Further,

Σ0 = (1 + β 2 ) σ 2 and Σ1 = β σ 2 .

However,
Σh = 0, |h| > 1,
so the dependence in short-lived and vanishes abruptly beyond the first-order
autocovariance.
It follows that
Σ = Σ0 + Σ−1 + Σ1 = (1 + β)2 σ 2 .

A combination of both examples gives

xi = α + ρ xi−1 + εi + β εi−1 , εi ∼ i.i.d. N (0, σ 2 ).

Extensions to higher-order are immediate. This gives a parsimonious way to


modelling dependence.

296 / 318
Limit distribution of GMM

In practice the above is important for getting correct standard errors with
serially dependent data.
The two-step GMM estimator solves

min ĝ(θ)0 Ω̂−1


θ̂
ĝ(θ).
θ

where, now, Ω̂θ̂ is a HAC estimator of the long-run covariance matrix of the
moment condition
Xn
ĝ(θ) = n−1 ϕ(xi ; θ).
i=1

The same robust estimator needs to be used when constructing test statistics.
The remainder of the argument for GMM carries over without modification.

297 / 318
Linear model with correlated errors

Consider
yi = x0i θ + εi
with E(εi |x1 , . . . , xn ) = 0. Then, as before,
n
!−1 n
!
√ 1X 0 1 X
n(θ̂ − θ) = xi xi √ xi εi .
n i=1 n i=1

Now,
n +∞
1 X d
X
√ xi εi → N (0, Ω), Ω= E((εi εi+h )(xi x0i+h )).
n i=1 h=−∞

This covariance allows for both heteroskedasticity and autocorrelation in the


errors.
Under homoskedasticity, E(εi εi+h |x1 , . . . , xn ) = E(εi εi+h ), and the formula
simplifies to
+∞
X
Ω= E(εi εi+h ) E(xi x0i+h ).
h=−∞

If, in addition, we also have E(εi εi+h ) = 0 for h 6= 0, then Ω = σ 2 E(xi x0i ).
298 / 318
Autoregression

A simple dynamic model is

yi = ρyi−1 + εi , E(εi |y1 , . . . , yi−1 ) = 0.

Here the regressor (the lagged outcome) is not strictly exogenous but only
weakly exogenous.

Nonetheless,
E(yi−1 εi ) = Eρ (yi−1 (yi − ρyi−1 )) = 0
and so least-squares continues to be consistent and asymptotically normal
(under the usual regularity conditions).

As

X
yi = ρh εi−h ,
h=0

weak exogeneity implies that the εi are uncorrelated.

299 / 318
Autoregression with MA errors

An extension would be

yi = ρyi−1 + εi , εi = ηi + θηi−1 ,

where ηi ∼ i.i.d. (0, σ 2 ).

Now, least-squares is not consistent. Indeed,

E(yi−1 εi ) = θ σ 2 ,

so the usual moment condition is no longer valid.

However, lack of higher-order correlation in the error does imply that

E(yi−h εi ) = 0, for all h ≥ 2,

opening the way for an instrumental-variable approach.

The extension to higher-order MA processes is immediate.

Note that this approach does not work for autoregressive errors.

300 / 318
Intertemporal CAPM

A consumer chooses consumption stream {xi } to maximize her expected


(discounted) utility stream

E( ∞ h
P
h=0 α u(xi+h ; β)| zi ),

where u is a well-behaved utility function and zi is the information set at


baseline i.

Optimality of the consumption path implies that, for all i, the Euler equation

αi u0 (xi ; β) dxi = αi+1 u0 (xi+1 ; β) r dxi

holds. Here, r is the asset return.

Hence,
E ( α r u0 (xi+1 ; β)/u0 (xi ; β) − 1| zi ) = 0
is a valid conditional moment condition for α, β.

301 / 318
NONPARAMETRIC PROBLEMS: CONDITIONAL-MEAN FUNCTIONS

302 / 318
Reading

Hansen II, Chapters 19 and 20


Li and Racine, Chapters 1 and 2
Horowitz, Appendix

303 / 318
Nonparametric specification

Let yi and xi be i.i.d. univariate random variables.

Suppose that
yi = m(xi ) + εi , E(εi |xi ) = 0.

We want to estimate m without imposing a functional form.

If xi takes values v1 , . . . , vk for finite k < n this is a parametric problem:


Regress yi on k dummy variables di,κ = {xi = vκ }, i.e.,
k
X
yi = di,κ βκ + εi ;
κ=1

then βκ = m(vκ ). Equivalently, for a fixed value x ∈ {v1 , . . . , vk },


n
X {xi = x}
m̂(x) = ω i yi , ωi = Pn ;
i=1 j=1 {xj = x}

this is the slope of a regression of yi on di,x , or sample mean in the subsample


with xi = x.
304 / 318
Estimation based on binning

When xi is continuous the probability that it takes on any given value is


zero.
We could estimate m(x) by the weighted average
n
X {|xi − x| ≤ h}
m̂h (x) = ωi yi , ωi = P ,
i=1 j {|xj − x| ≤ h}

where h is some chosen positive number, the bandwidth.


This makes sense if we believe m is smooth, so that m(x) does not change
too fast when x changes little.
A small choice for the bandwidth h defines a small neighborhood and so
decreases bias. But it also increases variance as there will be less observations
‘close’ to x.

305 / 318
Kernel functions

Binning yields a non-smooth estimator of m(x) (as a function of x). Which


may not be attractive.
A (second-order) kernel functionR is any (symmetric)
R non-negative and
bounded
R 2 function k for which k(u) du = 1, u k(u) du = 0, and
u k(u) du < ∞.
Note that any probability density function with finite second moments can
be made to satisfy these requirements.
Commonly-used examples are
1
Uniform : 2
{|u| ≤ 1}
Triangular : (1 − |u|) {|u| ≤ 1}
Epanechnikov : 3
4
(1 − u2 ) {|u| ≤ 1}
2
Gaussian: √1 e−u /2

306 / 318
A locally-constant estimator

A kernel estimator of m(x) is


n
k xih−x

X
m̂h (x) = ω i yi , ωi = P 
xj −x
.
i=1 j k h

This is the Nadaraya-Watson estimator.


The binning estimator is the special case that uses the uniform kernel.
Smooth choices for the kernel k deliver smooth estimators of m.
In practice the choice of h is far more important to the behavior of m̂h than
is the choice of k.
Note that m̂h (x) solves the weighted least squares problem
n
X
min ωi (yi − α)2
i=1

with respect to α.

307 / 318
The above optimization problem is equivalent to minimizing
n
1 X  xi − x  (yi − α)2
k .
n i=1 h h

As n grows this sample averages converges to its expectation, which equals

xi − x  (yi − α)2 xi − x  E((yi − α)2 | xi )


     
E k =E k
h h h h

Let f be the density of xi and g(xi ) = E((yi − α)2 | xi ) f (xi ). A change of


variable to u = (xi − x)/h and a second-order expansion around u = 0 show
the expectation to equal
R xi −x
 E((yi −α)2 | xi ) f (xi ) R
k h h
dxi = k(u) g(x + hu) du
= k(u) g(x) + hu g 0 (x) + 21 h2 u2 g 00 (u∗ ) du
R 

where u∗ lies between u and zero.


Using the properties of the kernel function and assuming that |g 00 |∞ < ∞,

xi − x  (yi − α)2
  
= E (yi − α)2 xi = x f (x) + O(h2 ).

E k
h h
n↑∞
Consistency requires that h → 0. 308 / 318
Can also show that the variance is proportional to (nh)−1 . So, if f is equally
well behaved, and f (x) > 0,
n Pn xi −x
(yi − α)2 p

i=1 k
X 2 h
→ E (yi − α)2 xi = x

ωi (yi − α) = Pn xi −x

i=1 i=1 k h

if n → ∞ provided h → 0 and nh → ∞.
The solution of this limit problem is, of course,

α = m(x),

justifying the Nadaraya-Watson estimator.


We have implicitely derived the nonparametric kernel density estimator
n
1 X  xi − x 
fˆh (x) = k
nh i=1 h

for f (x).
The conditions on h relative to n represent the bias/variance trade-off.

309 / 318
As we need h → 0 the estimator m̂(x) will converge at a slower rate than
n−1/2 (the parametric rate).

We have
√ σ 2 (x) R
 
 d
nh m̂h (x) − m(x) − h2 b(x) u2 k(u) du → N k(u)2 du ,
R
0,
f (x)
where we let
m00 (x) f 0 (x) m0 (x)
b(x) = + , σ 2 (x) = var(yi |xi = x) = E(ε2i |xi = x),
2 f (x)
be first-order bias and variance, respectively.

The convergence rate can be no faster than n−2/5 , which happens when bias
and standard deviation shrink at the same rate.

With the bias being O(h2 ) and the variance O((nh)−1 ) bias vanishes if we
choose h such that nh5 → 0.

This is called undersmoothing; it makes bias small relative to standard error.

Alternatives are bias correction or the use of higher-order kernels. Needed


to perform valid inference.
310 / 318
A locally-linear estimator

Rather than a (weighted) regression on a constant alone we may equally fit


a linear approximation to m at x.

This amounts to estimating m(x) by the intercept in


n
X
min ωi (yi − α − (xi − x)β)2 .
α,β
i=1

Although it has the same asymptotic behavior, such an estimator tends to


perform better than the standard kernel estimator.

In principle, there is no reason to stop at linearity.

Local polynomial regressions, where we add powers of (xi − x) as regressors,


are common practice.

311 / 318
Bandwidth choice

A bandwidth that is ‘good’ (in an overall sense but not at a point) minimizes

E((m̂h (x) − m(x))2 ) f (x) dx,


R

the integrated (with respect to f ) mean squared error.


This measure is unknown but can be estimated (up to a constant) by
n
X
n−1 (yi − m̌h (xi ))2 ,
i=1

where m̌h (xi ) is the leave-one-out estimator; for Nadaraya-Watson it equals


 
P xj −xi
j6=i k h
yj
m̌h (xi ) = P 
xj −xi

j6=i k h

for example.
Pn
(Least-squares) cross-validation selects h by minh n−1 i=1 (yi − m̌h (xi ))2 .
This can also be used to select between different estimators (pick the one
with lowest IMSE).
312 / 318
Curse of dimensionality

Kernel estimators extend easily to the case where the conditioning variable
is the vector xi = (xi,1 , . . . , xi,κ )0 .

It suffices to redefine k to be multivariate. A simple choice would be a kernel


of the form    
xi,1 − x1 xi,κ − xκ
k × ··· × k ;
h1 hκ
a product kernel.

The main problem with multivariate regressors is that the variance of the
estimator now becomes inverse-proportional to n(h1 × · · · × hκ ). This implies
that the convergence rate decreases with κ. This is known as the curse of
dimensionality.

There is a middle-ground between nonparametric and parametric that aims


to tackle this issue.

313 / 318
Matching estimators

We wish to infer the average effect of a treatment on an outcome.


If treatment is randomly assigned,

θ = E(yi |di = 1) − E(yi |di = 0)

is the average treatment effect.


Estimate the effect by a least-squares regression on a constant and treatment
indicator.
Now suppose that treatment is only random conditional on a set of control
variables xi .
Then,
R
θ= (E(yi |di = 1, xi = x) − E(yi |di = 0, xi = x)) f (x) dx.

Let m1 (x) = E(yi |di = 1, xi = x) and m0 (x) = E(yi |di = 0, xi = x). Then
n
X
n−1 m̂1,h (xi ) − m̂0,h (xi )
i=1

is a nonparametric matching estimator of θ.


314 / 318
Matching on the propensity score

Potential problem is limited overlap.

Well-known result says that matching on propensity score,

m(x) = P (di = 1|xi = x),

is equivalent.

(Can estimate m nonparametrically and proceed as before but) a convenient


alternative follows from observation that
E(yi di |xi = x) = E(yi |di = 1, xi = x) m(x),
E(yi (1 − di )|xi = x) = E(yi |di = 0, xi = x) (1 − m(x)),

so that we can write


 
yi d i yi (1 − di )
θ=E − .
m(xi ) 1 − m(xi )

Still important to have the propensity vary over entire (0, 1) for identification.

315 / 318
Regression discontinuity

Now suppose treatment is assigned according to



0 if xi < c
di = ,
1 if xi ≥ c

where the (continuous) running variable xi cannot be manipulated and c is


a known cut-off value.

Identifying assumption is that people around the cut-off are comparable. Can
then identify the local treatment effect

θ = lim E(yi |di = 1, xi = x) − lim E(yi |di = 0, xi = x)


x↓c x↑c

at the cut-off.

Natural is to fit separate nonparametric regressions to the left and right of


cut-off. Using a rectangular kernel, for example, we use observations in the
regions [c − h, c] and [c, c + h] only.

Locally-linear estimators are preferable here as they have better properties


at the boundary.
316 / 318
Semiparametric binary choice

Consider the binary-choice model

yi = {x0i θ ≥ εi }, εi ∼ i.i.d. F,

where, now F , is unknown.

A semiparametric maximum-likelihood estimator is the maximizer of


n
X
yi log(F̌ (x0i θ)) + (1 − yi ) log(1 − F̌ (x0i θ))
i=1

(where we ignore issues of trimming) for


 x0 θ−x0 θ 
j i
P
0 j6=i k h
yj
F̌ (xi θ) = P  x0 θ−x0 θ  .
j i
j6=i k h

317 / 318
Examples of program evaluation

Angrist, J. D. (1990). Lifetime earnings and the Vietnam era draft lottery: evidence from social
security administrative records. American Economic Review 80, 313–336.

Angrist, J. D. and V. Lavy (1999). Using Maimonides’ rule to estimate the effect of class size
on scholastic achievement. Quarterly Journal of Economics 114, 533–575.

Card, D. and A. B. Krueger (2000). Minimum wages and employment: A case study of the
fast-food industry in New Jersey and Pennsylvania. American Economic Review 84, 772–793.

Dehejia, R. H. and S. Wahba (1999). Causal effects in nonexperimental studies: Re-evaluating


the evaluation of training programs. Journal of the American Statistical Association 94, 1053–
1062.

Dehejia, R. H. and S. Wahba (1999). Propensity score-matching methods for nonexperimental


causal studies. Review of Economics and Statistics 84, 151–161.

Dell, M. (2010). The persistent effects of Peru’s mining Mita. Econometrica 78, 1863–1903.

Heckman, J. .J., H. Ichimura, and P. E. Todd (1997). Matching as an econometric evaluation


estimator: Evidence from evaluating a job training programme. Review of Economic Studies 64,
605–654.

Lalonde, R. J. (1986). Evaluating the econometric evaluations of training programs with exper-
imental data. American Economic Review 76, 604–620.

318 / 318

You might also like