R300 Advanced Econometrics Methods Lecture Slides
R300 Advanced Econometrics Methods Lecture Slides
Oleg I. Kitov
oik22@cam.ac.uk
These will (mostly) not be covered in class and are not examinable.
ESTIMATION IN PARAMETRIC PROBLEMS
2 / 318
Reading
Evaluation of estimators:
Casella and Berger, Chapters 7 and 10
Hansen I, Chapter 6
Asymptotics for the sample mean:
Goldberger, Chapter 9
Hansen I, Chapters 7 and 8
Maximum likelihood:
Davidson and MacKinnon, Chapter 8
Hansen I, Chapter 10
Wooldridge, Chapter 13
Linear regression:
Goldberger, Chapters 14–16
Hansen II, Chapters 2–5
3 / 318
Estimation
x1 , . . . , xn
θn = θn (x1 , . . . , xn );
This inferential aim is different from a descriptive data analysis that gives
means, variances, correlations, regression coefficients, and so on.
4 / 318
The parametric framework
5 / 318
We know the whole probability distribution once we know the parameter θ.
R
We may calculate Pθ (xi ∈ A) = A
fθ (x) dx for any set A. For example,
We know all raw and centered moments; for example, the mean and variance
and so on.
We know Eθ (ϕ(xi )) for any chosen function ϕ and so also parameters ψ
defined through
Eθ (ϕ(xi ; ψ)) = 0,
(which we call moment conditions). Obvious example is ψ = Eθ (xi ), which
has ϕ(xi ; ψ) = xi − ψ.
For univariate xi the τ th-quantile is qτ = inf q {q : Fθ (q) ≥ τ }, for τ ∈ (0, 1).
It is a solution to the moment condition Eθ ({xi ≤ ψ} − τ ) = 0 and so has
ϕ(xi ; ψ) = {xi ≤ ψ} − τ .
6 / 318
Examples
θx e−θ
fθ (x) = Pθ (xi = x) = , θ>0
x!
for x ∈ N.
θ is the arrival rate, i.e., the expected number of arrivals per time unit.
A sensible estimator of θ is again the sample mean.
7 / 318
We have data on the number of births per hour over a 24 hour period in
Addenbrooke’s.
Fitting a Poisson model to such data we estimate the number of births per
hour by the sample mean, here 1.875 births/hour.
Given an estimate of θ we can estimate the mass function.
8 / 318
The hospital data also tell us that, of the 44 babies, 18 were boys and 26
where girls.
The maximum likelihood estimator of the probability of giving birth to a boy
is 18/44 = .409.
The estimator is a random variable.
Using arguments to be developed later we can test whether there is a gender
bias at Addenbrooke’s.
p
The standard error on our estimate is (18/44) × (26/44)/44 = .074 which
gives us the value
.409 − .500
= −1.23
.074
for a test statistic which is (asymptotically) standard normal under the null
of no gender bias.
Using a Neyman-Pearson argument (see later) we cannot reject the absence
of gender bias (at conventional significance levels).
9 / 318
A continuous example with two parameters is the normal distribution.
The univariate standard-normal density is
1 1 2
φ(x) = √ e− 2 x ;
2π
it has mean zero and variance one. The corresponding distribution function
is Rx
Φ(x) = −∞ φ(u) du.
Obvious estimators for µ, σ 2 would be the sample mean and sample variance.
10 / 318
Less obvious is when
x∗i ∼ N (µ, σ 2 )
but we observe
x∗i if x∗i ≥ 0
xi = .
0 if x∗i < 0
This is a censored normal variable.
These turn out not to be very attractive and should not be used.
11 / 318
As a final example, suppose that xi ∼ χ2θ .
The Chi-squared distribution with (integer) θ degrees of freedom has density
xθ/2−1 e−x/2
fθ (x) = ,
2θ/2 Γ(θ/2)
R∞
where Γ(θ) = 0
xθ−1 e−x dx denotes the Gamma function at θ.
We note without proof that
Eθ (xi ) = θ,
varθ (xi ) = 2θ,
Eθ (xpi ) = 2p Γ(p + θ/2)/Γ(θ/2).
12 / 318
Change of variable
∂ϕ−1 (y)
f (ϕ−1 (a))
∂y y=a
13 / 318
Characteristic function
0
For multivariate x, ϕ(t) = E(eιt x ) for a vector t of conformable dimension.
14 / 318
An example is the standard normal case. Here,
2
e−x /2 2
f (x) = √ , ϕ(t) = e−t /2
.
2π
We have, using the definition of the cosine function,
1 R +∞ ιtx −x2 /2 2 R +∞ 1 ιtx 2
ϕ(t) = √ −∞
e e dx = √ 0 2
(e + e−ιtx ) e−x /2 dx
2π 2π
2 R +∞ 2
= √ 0
cos(tx) e−x /2 dx.
2π
Next,
2 R +∞ ∂ cos(tx) −x2 /2
ϕ0 (t) = √ ∂t
e dx
2π 0
2 R +∞ 2
= −√ 0
sin(tx) x e−x /2 dx
2π
2 −x2 /2 +∞ 2 R +∞ 2
= √ e sin(tx) −√ 0
t cos(tx) e−x /2 dx
2π 0 2π
= −t ϕ(t).
2
This implies that ϕ ∝ e−t /2 . But because
R
f (x) dx = 1 we must have that
2
ϕ(0) = 1 so that, indeed, ϕ = e−t /2 .
15 / 318
To see that
1 R +∞ −ιtx −t2 /2
φ(x) = e e dt,
2π −∞
we can use the same calculations.
by the same argument as before. We have already computed the last integral.
Moreover,
1 R +∞ 2
√ 2
cos(tx)e−t /2 dt = 1
π
2π
2
ϕ(x) = √1 e−x /2 = φ(x),
π 0 2π
as claimed.
16 / 318
Squared standard-normal variable
17 / 318
Sum of squared independent standard-normal variables
ϕp (t) = (1 − 2ιt)−p/2 .
So, if
zi ∼ N (0, 1),
then zi2 ∼ χ21 has ϕ1 (t) = (1 − 2ιt)−1/2 .
The characteristic function of n 2
P
i=1 zi is (by independence) equal to
n
Y n
ϕ1 (t) = (1 − 2ιt)−1/2 = (1 − 2ιt)−n/2 = ϕn (t).
i=1
Hence,
n
X
zi2 ∼ χ2n .
i=1
18 / 318
Sum of independent normal variables
So, if
zi ∼ N (0, 1)
Pn
are independent the characteristic function of i=1 zi is
n n
Y Y 2 2 2
ϕ0,1 (t) = (e−t /2
) = (e−t ) = e−n t
/2 n /2
= ϕ0,n (t),
i=1 i=1
Pn
i.e., i=1 zi ∼ N (0, n).
By the location/scale properties of the normal we then have that
z n ∼ N (0, n−1 )
and
xn = µ + σ z n ∼ N (µ, σ 2 /n).
19 / 318
Motivating best unbiasedness
0.5
0.4
0.4
0.3
0.3
0.2
0.2
0.1 0.1
0 0
θ θ∗ θ
20 / 318
Best unbiased estimator
21 / 318
Non-existence of bias
The Cauchy distribution with location µ and scale γ has the symmetric
density
1
2 .
πγ 1 + x−µ γ
It has no moments.
For example, with µ = 0 and γ = 1 we have
RM 1 x log(1+M 2 )
E(|x|) = lim 2 0 π 1+x2
dx = limM →∞ π
= +∞.
M →∞
So it is not useful to estimate the location parameter µ via the sample mean.
A sensible estimator would be the sample median, which is well defined in
spite of the non-existence of moments.
22 / 318
Ratio of normals
Take independent scalar normal variates x ∼ N (0, σ12 ) and y ∼ N (0, σ22 ).
Consider the transformation (x, y) → (u, v) = (x/y, y). The Jacobian of the
transformation is v and so the density of u is
R +∞
fσ1 ,σ2 (u) = −∞ φ((uv)/σ
σ1
1 ) φ(v/σ2 )
σ2
|v| dv.
2 √
This is (using that φ(u) = e−u /2
/ 2π and that φ(u) = φ(−u) for all u)
1 R +∞ − 1 v2 ((1/σ2 )2 +(u/σ1 )2 )
fσ1 ,σ2 (u) = e 2 v dv
πσ1 σ2 0
1 R +∞ − 1 v 2 (1+u2 (σ2 /σ1 )2 )/σ2
2
= e 2 v dv
πσ1 σ2 0
1 1 v 2 (1+u2 (σ /σ )2 )/σ 2 0
R +∞
−2
= −e 2 1 2 dv
πσ1 σ2 ((1 + u2 (σ2 /σ1 )2 )/σ22 ) 0
1
= 2 ,
π σσ12 1 + σ1u/σ2
23 / 318
Fisher information
Let
∂ log fθ (x)
∂θ
be the score.
The score has mean zero:
R
Eθ ∂ log∂θ
fθ (xi ) ∂ log fθ (x) ∂fθ (x)
R
= ∂θ
fθ (x) dx = ∂θ
dx = 0
24 / 318
Information inequality
From the proof (to follow) we have that θn attains the bound if and only if
n
X ∂ log fθ (xi )
n Iθ (θn − θ) =
i=1
∂θ θ
25 / 318
Proof (for the scalar case).
Differentiating the zero-bias condition
R R Q
Eθ (θn − θ) = . . . (θn (x1 , . . . , xn ) − θ) i fθ (xi ) dx1 . . . dxn = 0,
gives
R Rn ∂
Q
fθ (xi ) Q o
... (θn (x1 , . . . , xn ) − θ) i
∂θ
− i fθ (xi ) dx1 . . . dxn = 0.
26 / 318
Proof Annex: Identity used in Step 2.
Above we used the following:
n n
X ∂ log fθ (xi ) X 1 ∂fθ (xi )
=
i=1
∂θ i=1
f θ (xi ) ∂θ
n Q !
X fθ (xj ) ∂fθ (xi )
= Qj6=i
i=1 j fθ (xj ) ∂θ
Q
Pn Q ∂fθ (xi ) ∂ i fθ (xi )
i=1 j6=i fθ (xj ) ∂θ
= Q = Q ∂θ
j f θ (xj ) j fθ (xj )
27 / 318
Cauchy-Schwarz inequality
Theorem 2 (Cauchy-Schwarz)
For scalar random variables xi and yi
28 / 318
Information equality
∂ 2 log fθ (xi )
∂ log fθ (xi )
varθ = Iθ = −Eθ .
∂θ ∂θ∂θ0
29 / 318
Proof.
Differentiating R ∂ log fθ (x)
∂θ
fθ (x) dx = 0
under the integral sign gives
R ∂ 2 log fθ (x) R ∂ log fθ (x) ∂fθ (x)
∂θ∂θ 0
fθ (x) dx + ∂θ ∂θ 0
dx = 0.
Because
∂ log fθ (x) 1 ∂fθ (x)
∂θ
= fθ (x) ∂θ
,
∂fθ (x)
we have ∂θ
= ∂ log∂θfθ (x) fθ (x) and so we obtain
2
Eθ ∂ log fθ (xi )
∂θ∂θ 0
+ Eθ ∂ log∂θ
fθ (xi ) ∂ log fθ (xi )
∂θ 0
= 0.
30 / 318
If it exists, the best unbiased estimator is unique
Theorem 4 (Uniqueness)
If θnA and θnB are such that
31 / 318
Proof (for the scalar case).
Define a third estimator θnC through the linear combination
varθ (θnC ) = λ2 varθ (θnA ) + (1 − λ)2 varθ (θnB ) + 2 λ(1 − λ) covθ (θnA , θnB ).
by Cauchy-Schwarz. Thus,
The inequality cannot be strict because θnA and θnB are best-unbiased. So
we must have that |corrθ (θnA , θnB )| = 1 which happens iff
θnA = a + b θnB
for constants a, b. Now we have that b = 1 as varθ (θnA ) = varθ (θnB ) and
a = 0 as Eθ (θnA ) = Eθ (θnB ).
32 / 318
Bernoulli
so that
∂ log fθ (x) x 1−x x−θ
= − = ,
∂θ θ 1−θ θ(1 − θ)
∂ 2 log fθ (x) θ(1 − θ) + (x − θ)(1 − 2θ) (x − θ)2
2
=− =− 2 .
∂θ θ (1 − θ)
2 2 θ (1 − θ)2
Clearly,
∂ log fθ (xi ) Eθ (xi − θ) Pθ (xi = 1) − θ
Eθ = = = 0.
∂θ θ(1 − θ) θ(1 − θ)
Further note that, here,
2
∂ 2 log fθ (x)
∂ log fθ (x)
=− ,
∂θ ∂θ2
and so the same holds on taking expectations. This immediately verifies the
information equality.
33 / 318
Note that
Eθ (x2i ) = Eθ (xi )
when xi ∈ {0, 1}.
So,
varθ (xi ) = Eθ ((xi − θ)2 ) = Eθ (x2i − 2xi θ + θ2 ) = θ(1 − θ)
34 / 318
Sample-mean theorem
Theorem 5 (Sample-mean theorem)
Let xn be the mean of a random sample x1 , . . . , xn from a distribution with
finite mean and variance µ, σ 2 . Then
Proof.
By linearity of the expectations operator in the first step and by random
sampling in the second step,
n
! n
1X 1X
E(xn ) = E xi = E(xi ) = µ.
n i=1 n i=1
Next,
n
!
var( n
Pn
σ2
P
1X i=1 xi ) i=1 var(xi )
var(xn ) = var xi = 2
= 2
= ,
n i=1 n n n
Here we have
log fθ (x) = x log(θ) − θ + constant.
So,
∂ log fθ (x) x ∂ 2 log fθ (x) x
= − 1, = − 2.
∂θ θ ∂θ2 θ
We note the mean/variance equality of a Poisson distribution:
e−θ θ x e−θ θ x e−θ θ x+1 e−θ θ x
P∞ P∞ P∞ P∞
Eθ (xi ) = x=0 x x!
= x=1 (x−1)! = x=0 x!
=θ x=0 x!
=θ
e−θ θ x
P∞ P∞
(because x=0 x!
= x=0 fθ (x) = 1), and similarly,
e−θ θ x e−θ θ x
P∞ P∞
Eθ (x2i ) = x=0 x2 x!
=θ x=0 (x + 1) x!
= θ2 + θ,
Here,
(x − µ)2
1
log fθ (x) = − log σ 2 + + constant.
2 σ2
So,
∂ log fθ (x) (x − µ)
= ,
∂µ σ2
(x − µ)2
∂ log fθ (x) 1 1
= − − ,
∂σ 2 2 σ2 σ4
and !
1 (x−µ)
∂ 2 log fθ (x) σ2 σ4
=− (x−µ) 2 .
∂θ∂θ0 σ4
− 2σ4 + (x−µ)
1
σ6
so that the efficiency bounds for µ and σ 2 are σ 2 /n and 2σ 4 /n, respectively.
37 / 318
The sample mean is again best unbiased for µ.
An unbiased estimator of σ 2 is
n
1 X
(xi − xn )2 .
n − 1 i=1
38 / 318
First start with the obvious estimator of σ 2 that is
n
1X
σ̂n2 = (xi − xn )2 .
n i=1
The estimator xn has a variance, σ 2 /n, and covaries with each datapoint xi ,
with covariance σ 2 /n.
39 / 318
An unbiased estimator is therefore
n
n 1 X
σ̃ 2 = σ̂ 2 = (xi − xn )2 ;
n−1 n − 1 i=1
40 / 318
Sampling distribution of normal variance
First,
Pn
σ̃ 2 i=1 (xi − xn )2
(n − 1) =
σ2 σ2
n 2 n n 2
X (xi − µ) − (xn − µ) X xi − µ 2 X xn − µ
= = −
i=1
σ i=1
σ i=1
σ
n 2 X n 2
X xi − µ 2 xn − µ xi − µ 2 xn − µ
= −n = − √ .
i=1
σ σ i=1
σ σ/ n
The right-hand side terms are χ2n and χ21 , respectively. The characteristic
function of a χ2p is (1 − 2ιt)−p/2 .
Second, xn and σ̃ 2 are independent by Basu’s theorem.
Third, the characteristic function of the sum of independent variables is the
product of their characteristic functions, so (n − 1) σ̃ 2 /σ 2 has characteristic
function
(1 − 2ιt)−n/2 (1 − 2ιt)1/2 = (1 − 2ιt)−(n−1)/2 ,
so it is χ2n−1 .
41 / 318
Tobit
x∗i ∼ N (µ, σ 2 ).
The data are top-coded at c, i.e.,
∗
xi if x∗i < c
xi = .
c if x∗i ≥ c
The density is
{x<c} c − µ {x=c}
1 x − µ
fθ (x) = φ × 1−Φ .
σ σ σ
42 / 318
The probability of not being top-coded and being top coded are
c − µ c − µ
Φ , 1−Φ ,
σ σ
respectively.
Further,
1 φ((x − µ)/σ)
fθ (x| x < c) = ;
σ Φ((c − µ)/σ)
so that the deviation of the mean (from µ) of this truncated distribution is
R c (x−µ) (x−µ) Rc
−∞ σ
φ σ
dx −σ 2 −∞ ∂(φ((x−µ)/σ)/σ)
∂x
dx φ( c−µ
σ
)
c−µ = c−µ = −σ .
Φ c−µ
Φ σ Φ σ σ
43 / 318
After some more calculus, ∂ 2 log fθ (x)/∂ 2 µ is found to be
1 1 φ((c − µ)/σ) φ((c − µ)/σ) c−µ
−{x < c} 2 − {x = c} 2 − .
σ σ 1 − Φ(c − µ)/σ) 1 − Φ(c − µ)/σ) σ
The information on µ then becomes
1 c − µ 1 c − µ φ((c − µ)/σ) c−µ
2
Φ + 2φ − .
σ σ σ σ 1 − Φ(c − µ)/σ) σ
44 / 318
Probit
Φ (µ/σ) , 1 − Φ(µ/σ),
respectively.
These probabilities depend on µ, σ only through the ratio θ = µ/σ, implying
a scale indeterminacy; we can only learn θ.
The mass function becomes
(Could further just focus on success probability p = Φ(θ) but this would not
extend to the model with covariates.)
45 / 318
Then
∂ log fθ (x) φ(θ)
= (x − Φ(θ)) ,
∂θ Φ(θ)(1 − Φ(θ))
which has mean zero and variance
φ(θ)2
,
Φ(θ)(1 − Φ(θ))
so the efficiency bound for θ becomes
1 Φ(θ)(1 − Φ(θ))
.
n φ(θ)2
θn = Φ−1 (xn ).
This estimator is not unbiased (an unbiased estimator of θ does not exist
here) but it will hit the efficiency bound in large samples.
46 / 318
Regularity conditions for Cramér-Rao bound
that is,
{0 ≤ x ≤ θ}
fθ (x) = .
θ
Nonetheless, a best unbiased estimator exists.
47 / 318
Sufficiency
and so
Pn
Pnn n! P
fθ (x1 , . . . , xn | i=1 xi ) = xi
= ( n
P n
i=1 i=1 xi )!(n− i=1 xi )!
is free of θ.
48 / 318
Sufficiency of the sample mean for a normal population
49 / 318
It follows that
Pn 2
i=1 (xi −xn )
fθ (x1 , . . . , xn ) n−1/2 −1
2 σ2
fθ (x1 , . . . , xn |xn ) = = e ,
fθ (xn ) (2πσ 2 )(n−1)/2
which does not depend on θ.
50 / 318
Improved estimation based on sufficiency
θn = E(θ∗ |γn )
holds.
Proof.
Unbiasedness of θn follows from iterating expectations on θ∗ .
Next, by the law of total variance,
varθ (θ∗ ) = var(E(θ∗ |γn )) + Eθ (var(θ∗ |γn )) = varθ (θn ) + non-negative term,
51 / 318
Rao-Blackwellization for Bernoulli
Note that
Pn
Pn Pθ ( i=1 xi = x|x1 = 1) Pθ (x1 = 1)
Pθ (x1 = 1| i=1 xi = x) =
Pθ ( n
P
i=1 xi = x)
Pn
Pθ ( j6=i xj = (x − 1)|x1 = 1)Pθ (x1 = 1)
=
Pθ ( n
P
i=1 xi = x)
Pn
Pθ (j6=i xj = (x − 1))
= Pθ (x1 = 1)
Pθ ( n
P
i=1 xi = x)
n−1 (n−1)!
θx−1 (1 − θ)n−x
(n−x)!(x−1)!
= x−1n s θ = n!
x
θ (1 − θ)n−x (n−x)!x!
(n − 1)! x! x
= = .
n! (x − 1)! n
Pn
Thus, θn = n−1 i=1 xi = xn , which has variance θ(1 − θ)/n (and is, in fact,
best unbiased).
52 / 318
Completeness
53 / 318
Completeness in the normal problem
Pn
A complete statistic here is xn = n−1 i=1 xi .
We look for a function ϕ such that Eθ (ϕ(xn )) = 0 for all θ.
We have
R +∞
1√ −θ
xn√
Eθ (ϕ(xn )) = −∞
ϕ(x) σ/ n
φ σ/ n
dx
1 n 2 R +∞ n 2 2
= p e− 2 (θ/σ) −∞ ϕ(x) e− 2 (x/σ) en(θ/σ )x dx
2πσ 2 /n
1 n 2
n 2
= p e− 2 (θ/σ) L ϕ(x) e− 2 (x/σ) ,
2πσ 2 /n
54 / 318
Completeness in the Bernoulli problem
Pn
Remember that, if Pθ (xi = 1) = θ for θ ∈ (0, 1), then γn = i=1 xi is
Binomial with parameters (n, θ).
So, if
n
! n
! γ
X n γ n−γ n
X n θ
Eθ (ϕ(γn )) = γ θ (1 − θ) = (1 − θ) ϕ(γ) =0
γ=0
γ γ=0
γ 1−θ
55 / 318
Best unbiased estimation under sufficiency
Proof.
By the Rao-Blackwell result, under sufficiency, any efficient estimator must
be a function of γn only; so, θn = ϕ(γn ). Then, by assumption, Eθ (θn ) = θ.
It is enough to show that ϕ is unique. Suppose there exist another ψ such
that Eθ (ψ(γn )) = θ. Then, by unbiasedness of both estimators,
Eθ (ϕ(γn ) − ψ(γn )) = 0.
56 / 318
Bernoulli
This confirms the Cramér-Rao result for Bernoulli that xn is best unbiased.
57 / 318
Estimating the maximum of a uniform distribution
Recall
{0 ≤ x ≤ θ}
fθ (x) = .
θ
Easy to see that the maximum-likelihood estimator here is maxi (xi ).
This estimator is biased.
For all x ∈ [0, θ],
Pθ max(xi ) ≤ x = Pθ (x1 ≤ x, x2 ≤ x, . . . , xn ≤ x) = (x/θ)n .
i
Further, Rθ
Eθ (max(xi )) = 0
1 − (x/θ)n dx = n
n+1
θ.
i
(The first step holds for any non-negative random variable z ∈ [0, b], say;
integrate by parts to see that
Rb Rb
0
(1 − F (z)) dz = (1 − F (z)) z|b0 + 0 z f (z) dz = E(z),
as claimed.)
58 / 318
It follows that
n+1
θn = max(xi ).
n i
is unbiased.
Remains to show that γn = maxi (xi ) is a complete sufficient statistic for θ.
We already know that Pθ (γn ≤ γ) = (γ/θ)n and so its density is
γ n−1
n
θn
for γ ∈ [0, θ] (and zero elsewhere).
Hence,
R
γ n−1
Rθ θ
Eθ (ϕ(γn )) = 0
ϕ(γ) n θn
dγ = (n/θn ) 0
ϕ(γ)γ n−1 dγ =: (n/θn ) Q(θ).
59 / 318
To see sufficiency we look at the ratio of the density of the data,
n
Y {xi ≤ θ} {γn ≤ θ}
= ,
i=1
θ θn
Working out the first two moments of γn using its density from above gives
1
θ2
n(n + 2)
as the variance of the unbiased estimator θn .
Note that this variance shrinks like n−2 , which is faster than the parametric
rate of n−1 .
60 / 318
Efficiency bound for biased estimators
Eθ (θn ) = θ + bn (θ).
61 / 318
Asymptotics (for the univariate sample mean)
While exact small-sample results are few and ad hoc. Large-sample analysis
is well established and widely applicable.
This is so because almost all estimators you will ever look at behave, as
n → ∞, like a sample mean.
for some function ϕθ for which Eθ (ϕθ (xi )) = 0 and varθ (ϕθ (xi )) < ∞. We
will see many examples.
Slide 25 gives the influence function for the best unbiased estimator (when
it exists).
62 / 318
Orders of magnitude (deterministic sequences)
Let h and g be two functions (and h(x) > 0 for large x).
We say that g(x) = O(h(x)) if and only if there exists a positive number b
and a real number x such that
That is,
g(x)
lim sup < ∞;
x→∞ h(x)
h(x) grows at least as fast as g(x).
We say that g(x) = o(h(x)) if and only if for every positive number b there
exists a real number x such that
That is,
g(x)
lim = 0;
x→∞ h(x)
h(x) grows faster than g(x).
63 / 318
Orders of magnitude (random sequences)
i.e., if xn − x = op (1).
p
We write xn → x and call x the probability limit of the sequence {xn }.
65 / 318
Theorem 8 ((weak) law of large numbers)
Suppose that µ = E(xi ) exists. For any > 0 and δ > 0, there exists an n
such that
P (|xn − µ| > ) < δ, for all n > n.
p
That is, xn → µ as n → ∞. Equivalently, xn − µ = op (1).
Proof.
Suppose that σ 2 exists. Then (by Chebychev’s inequality)
E((xn − µ)2 ) σ2
P (|xn − µ| > ) = P ((xn − µ)2 > 2 ) ≤ 2
= n−1 2 .
Taking limits gives the result.
Note that we immediately get the same result for any transformation ϕ(xi )
provided that E(|ϕ(xi )|) < ∞. That is,
p
n−1
P
i ϕ(xi ) → E(ϕ(xi ))
as n → ∞.
66 / 318
The below plots give deciles of xn as a function of n.
normal ( 2 exists) Student ( exists but 2 does not) Cauchy ( does not exist)
0.4 1 4
0.3 0.8
3
0.6
0.2
2
0.4
0.1
1
0.2
0
deciles
0 0
-0.1
-0.2
-1
-0.2
-0.4
-2
-0.3
-0.6
-0.4 -3
-0.8
-0.5 -1 -4
0 500 1000 1500 2000 2500 0 500 1000 1500 2000 2500 0 500 1000 1500 2000 2500
n n n
67 / 318
Consistency
p
An estimator θn is consistent for an estimand θ if θn → θ.
Eθ ((θn − θ)2 ) = (Eθ (θn − θ))2 + varθ (θn ) = bn (θ)2 + varθ (θn );
so a sufficient condition for consistency is that both bias and variance vanish
as n → ∞.
68 / 318
Uniform convergence
A pointwise convergence result (i.e., for any fixed θ ∈ Θ) follows from above:
with n independent of θ.
We write
p
X
sup n−1 ϕθ (xi ) − E(ϕθ (xi )) → 0
θ∈Θ
i
as n → ∞.
69 / 318
To appreciate the difference between pointwise and uniform convergence take
a simple non-stochastic example:
ϕθ (xi ) = nθe−nθ
sup|nθe−nθ | 9 0
θ∈Θ
as n → ∞.
70 / 318
Continuous-mapping theorem
71 / 318
Convergence in distribution
Let {xn } be a sequence of random variables with distribution {Fn } and let
x∼F
d
We say that xn → x if
Fn (a) → F (a) as n → ∞
d
If xn → x it is stochastically bounded, i.e., xn = Op (1).
72 / 318
The central limit theorem
This means that the sample distribution of the standardized sample mean
approaches the standard-normal distribution.
Observe that this result holds for any distribution, as long as µ, σ 2 exist.
73 / 318
The plots below concern the standardized sample mean of samples of
Bernoulli random variables.
Observe how the histogram approaches the standard-normal density as n
grows.
74 / 318
Proof.
Let ϕx (t) = E(eιtx ) be the characteristic function of x.
Then
n n
xn − µ X 1 xi − µ X zi
z= √ = √ = √ (say),
σ/ n i=1
n σ i=1
n
has characteristic function
P √ Y √ √
ϕz (t) = E(eιt i zi / n
)= E(eι(t/ n)zi
) = ϕzi (t/ n)n ,
i
t2
2
t2
2
√ t t t
ϕzi (t/ n) = ϕzi (0) + ϕ0zi (0) √ + ϕ00zi (0) +o =1− +o
n 2n n 2n n
as n → ∞, and so
n
−t2 /2
2
lim ϕz (t) = lim 1+ = e−t /2 (= ϕ of the standard normal)
n→∞ n→∞ n
by definition of the exponential function.
75 / 318
Slutzky’s theorem
76 / 318
Take xi ∼ N (µ, σ 2 ). Best ‘estimator’ of σ 2 is n−1 − µ)2 .
P
i (xi
As an example of (i),
X X
σ̂ 2 = n−1 (xi − xn )2 = n−1 (xi − µ)2 − (xn − µ)2 .
i i
p p
As (xn − µ) → 0 and (a − µ)2 is continuous in a we have (xn − µ)2 → 0.
Hence, X
σ̂ 2 = n−1 (xi − µ)2 + op (1).
i
In fact, √ √
(xn − µ)2 = (Op (1/ n))2 = Op (1/n) = op (1/ n),
and so
√ 1 X
n(σ̂ 2 − σ 2 ) = √ ((xi − µ)2 − σ 2 ) + op (1).
n i
Hence, σ̂ 2 and n−1 i (xi − µ)2 are asymptotically equivalent; their limit
P
√ d
distribution is n(σ̂ 2 − σ 2 ) → N (0, 2σ 4 ). This is the same limit distribution
2
as that of (the unbiased) σ̃ .
77 / 318
As an example of (ii),
xn − µ σ xn − µ xn − µ xn − µ d
√ = √ = (1 + op (1)) √ = √ + op (1) → N (0, 1).
σ̂/ n σ̂ σ/ n σ/ n σ/ n
where Eθ (ϕθ (xi )) = 0 and varθ (ϕθ (xi )) < ∞ our results immediately yield
that
(a) θn − θ = Op (n−1/2 ); and
√ a
(b) n(θn − θ) ∼ N (0, varθ (ϕθ (xi ))).
We call varθ (ϕθ (xi )) the asymptotic variance.
78 / 318
Mean-value theorem
1.5
ϕ(x)
0.5
0
x1 x∗ x2
79 / 318
Asymptotics for smooth transformations
for continuously-differentiable ϕ.
Proof.
A mean-value expansion gives
∂ϕ(θ∗ )
ϕ(θn ) − ϕ(θ) = (θn − θ).
∂θ
The continuous-mapping theorem yields
∂ϕ(θ∗ ) p ∂ϕ(θ)
→ .
∂θ ∂θ
Slutzky’s theorem gives
√ ∂ϕ(θ) √ d
n (θn − θ) + op (1) → N 0, (∂ϕ(θ)/∂θ)2 σ 2 .
n(ϕ(θn ) − ϕ(θ)) =
∂θ
80 / 318
The multivariate case
for
∂ϕ(θ)
Γ=
∂θ0
the Jacobian matrix.
81 / 318
A nonsingular matrix A has eigendecomposition
A = V DV −1
The inverse is
A−1 = V D−1 V −1 .
A matrix square root is
A1/2 = V D1/2 V −1 .
Note that
√ d
So, for example, if n(θn − θ) → N (0, Σ) for an m × m nonsingular variance
Σ, then
√ d
(i) n Σ−1/2 (θn − θ) → N (0, Im ); and
d
(ii) n(θn − θ)0 Σ−1 (θn − θ) → χ2m .
82 / 318
The multivariate normal distribution
N µ1 + Σ12 Σ−1 −1
22 (x2 − µ2 ), Σ11 − Σ12 Σ22 Σ21 .
83 / 318
The bivariate normal distribution
The above is particularly tractable in the bivariate case, where x1 and x2 are
both scalars.
Write
σ12
x1 µ1 ρ σ 1 σ2
∼N
x2 µ2 ρ σ1 σ2 σ22
for ρ the correlation between x1 and x2 .
Here,
σ1
x1 |x2 ∼ N µ1 + ρ (x2 − µ2 ), (1 − ρ2 ) σ12 .
σ2
Note that
σ1 σ1 σ1
E(x1 |x2 ) = µ1 + ρ (x2 − µ2 ) = µ1 − ρ µ2 + ρ x 2
σ2 σ2 σ2
is linear in x2 .
Also, var(x1 |x2 ) is a constant (i.e., not a function of x2 ).
84 / 318
Best asymptotically unbiased estimation
85 / 318
The likelihood function
Qn
The likelihood function, i=1 fθ (xi ), represents the density of the sample
when sampling from fθ .
In the discrete case, it is the probability of observing the actual sample, when
sampling from fθ .
Intuitively attractive. Pretty much what anyone without any prior statistical
knowledge would do.
86 / 318
Maximization program
Let X
Ln (θ) = log fθ (xi )
i
87 / 318
Numerical maximization: Newton-Raphson
So, solving
n
X xi − θ
∂Ln (θ) xn − θ
= =n =0
∂θ i=1
θ(1 − θ) θ(1 − θ)
89 / 318
Invariance
90 / 318
Jensen’s inequality
A univariate function ϕ is concave if
and convex if
Proof.
Take ϕ concave. Let ψ be the tangent line at E(xi ); i.e., ψ(x) = a + bx for
constants a, b such that ϕ(E(xi )) = ψ(E(xi )).
By concavity ϕ(x) ≤ ψ(x) for any x. Hence, using linearity of the tangent,
91 / 318
Probit
Pθ (xi = 1) = Φ(θ).
92 / 318
Further, as
√ d
n(β̂ − β) → N (0, β(1 − β))
by the central limit theorem,
∂Φ−1 (β) 1
= ,
∂β φ(θ)
93 / 318
The maximum-likelihood estimator may fail to exist in small samples.
94 / 318
Why does maximizing the likelihood work: Identification
is maximized at θ.
Indeed,
fθ∗ (xi ) fθ∗ (xi )
Eθ (Ln (θ∗ ) − Ln (θ)) = n Eθ log ≤ n log Eθ = 0,
fθ (xi ) fθ (xi )
using
R Jensen’s inequality and the fact that Eθ (fθ∗ (xi )/fθ (xi )) =
fθ∗ (x) dx = 1.
Crudely put, L(θ) is the log-likelihood function we would use if we would
have an infinitely-large sample.
(Point) identification means that, in that case, we would be able to learn θ;
so
θ = arg max L(θ∗ ),
θ∗ ∈Θ
and is unique.
95 / 318
Identification may fail (we will give an example below).
Local identification is
∂ 2 L(θ)
< 0.
∂θ∂θ0
Note that, as
∂ 2 L(θ) ∂ 2 log fθ (xi )
= n Eθ = −nIθ
∂θ∂θ0 ∂θ∂θ0
this is equivalent to the information matrix being positive definite and, hence,
of full rank.
96 / 318
Why does maximizing the likelihood work: Argmax theorem
provided fθ is continuous, |log fθ (x)| < b(x) so that E(b(xi )) < ∞, and Θ is
closed and bounded (compact).
97 / 318
A uniform ε-band around L(θ) and the corresponding interval [θmin , θmax ] in
which θ̂ must lie.
−1
θmin θ θmax
98 / 318
Regarding uniform convergence, consider the probit model as an example.
There,
log fθ (y|x) = y log Φ(x0 θ) + (1 − y) log Φ(−xθ).
We have, by a mean-value expansion, that
φ(xθ∗ )
log Φ(xθ) = log Φ(0) + xθ
Φ(xθ∗ )
φ(u)
and 0 < Φ(u) ≤ c |1 + u| for some finite c (visual inspection will help to see
this). Consequently,
φ(xθ∗ )
|log Φ(xθ)| ≤ |log Φ(0)| + |xθ| ≤ |log(2)| + c |1 + xθ∗ | |x||θ|
Φ(xθ∗ )
≤ |log(2)| + c |x||θ∗ | + c |x|2 |θ|2
99 / 318
Asymptotic normality
By definition
n
X ∂ log fθ (xi )
= 0.
i=1
∂θ θ̂
100 / 318
Now, by invoking a uniform law of large numbers together with consistency,
n
1 X ∂ 2 log fθ (xi )
2
p ∂ log fθ (xi )
→ Eθ = −Iθ .
n i=1 ∂θ∂θ0 θ∗ ∂θ∂θ0 θ
and we have the following result (note we use the information equality here).
101 / 318
Variance estimation
In both cases, a uniform law of large numbers can be used to show consistency.
The square root of the diagonal entries (of the inverse) give (estimated)
standard errors on the maximum-likelihood estimator (after dividing through
√
by n) and so can serve to assess its precision. They will equally serve us in
testing later on.
102 / 318
Labor-force participation
yi = 1 ⇔ u(xi , εi ) ≥ 0;
107 / 318
Often convenient to look at this model in matrix form.
We have a set of n equations with k regressors, as in
which we write as
y = Xβ + ε, ε ∼ N (0, σ 2 I).
108 / 318
The score equation for β is
X 0 (y − Xβ)
= 0.
σ2
It has the unique solution
β̂ = (X 0 X)−1 X 0 y
109 / 318
Say, xi = (1, di , 1 − di )0 where di is a binary indicator.
Then xi,1 = xi,2 + xi,3 for all i and the rank condition fails.
The model
yi = β1 + di β2 + (1 − di )β3 + εi
is observationally-equivalent to the three-parameter/two-regressor model
yi = (β1 + β3 ) + di (β2 − β3 ) + εi = α1 + di α2 + εi .
We can only learn the reduced-form parameters (α1 , α2 ). The identified set
for β is
{β ∈ R : β1 + β3 = α1 , β2 − β3 = α2 }.
For example, given β3 , we can back out (β1 , β2 ) but, without this knowledge,
we can only say things such as β1 − β2 = α1 − α2 .
110 / 318
As another example of identification failure, suppose that we do not observe
yi in the data but, instead, observe variables y i ≤ y i for which we know that
y i ≤ yi ≤ y i
We can estimate all β compatible with this moment inequality by the set
[β̂, β̂], with
β̂ = (X 0 X)−1 X 0 y, β̂ = (X 0 X)−1 X 0 y
in obvious notation.
111 / 318
We will write
ŷ = X β̂, ε̂ = y − X β̂,
for fitted values and residuals, respectively.
We have the decomposition
y = ŷ + ε̂,
where the fitted values and residuals are uncorrelated, i.e., ŷ 0 ε̂ = 0. Indeed,
the score equation at β̂ equals
X 0 ε̂
= 0,
σ2
so we can say that β̂ gives us thát linear combination of the regressors for
which residuals and regressors are exactly uncorrelated.
An implication is that
113 / 318
The vector y is a point in Rn . The column space of the n × k matrix X is
the subspace of linear combinations
and equals
ŷ = X β̂ = X(X 0 X)−1 X 0 y = P X y.
The deviation of y from its projection is
ε̂ = y − ŷ = y − P X y = (I − P X )y = M X y,
ε̂
x2 β̂2
x1 β̂1 x2
x1
ŷ
115 / 318
Partition X = (X 1 , X 2 ) so that
y = X 1 β1 + X 2 β2 + ε
and, hence,
−1
X 01 X 1 X 01 X 2 X 01 y
β̂1
= .
β̂2 X 02 X 1 X 02 X 2 X 02 y
Some algebra using formulae for partitioned matrix inversion shows that
β̂1 = (X 01 M X 2 X 1 )−1 (X 01 M X 2 y)
116 / 318
The estimator β̂ is (conditionally) unbiased,
β̂ ∼ N (β, σ 2 (X 0 X)−1 ),
117 / 318
The score equation for σ 2 is
n (y − Xβ)0 (y − Xβ)
− 2
+ = 0.
σ σ4
which, given β̂, has solution
ε̂0 ε̂
σ̂ 2 = .
n
We already know that this estimator is biased; an unbiased version would be
ε̂0 ε̂
σ̃ 2 = .
n−k
Indeed,
0
E (y 0 M X y|X) E (ε0 M X ε|X)
ε̂ ε̂ tr(M X ) n−k
E X = = = σ2 = σ2
n n n n n
because
119 / 318
For the variance estimator,
ε̂0 ε̂
σ̂ 2 =
n
(y − X β̂)0 (y − X β̂)
=
n
(X(β − β̂) + ε)0 (X(β − β̂) + ε)
=
n
ε0 ε (β̂ − β)0 (X 0 X)(β̂ − β) X 0ε
= + + 2(β̂ − β)0
n n n
ε0 ε −1/2
= + op (n ),
n
because kβ̂ − βk = Op (n−1/2 ), X 0 X = Op (n) and X 0 ε = op (n).
Therefore,
n
√ ε0 ε − E(ε0 ε) X ε2i − E(ε2i ) d
n(σ̂ 2 − σ 2 ) = √ + op (1) = √ + op (1) → N (0, 2σ 4 )
n i=1
n
120 / 318
Production and cost function
121 / 318
Poisson regression
The linear regression model will typically be inappropriate when data are
not continuous.
An example is yi ∈ N, i.e., count data.
Patent-application data fits this framework.
A Poisson regression model has (conditional) mass function
µyi i e−µi 0
, µi = exi β .
yi !
0
Remember that µi = exi β is the conditional mean of the outcome variable.
122 / 318
The log-likelihood (up to a constant) is
n
X 0
(yi x0i β − exi β ) + constant.
i=1
123 / 318
Patent applications (innovation) and R&D spending
124 / 318
We can test whether the impact of R&D spending on innovation is different
across sectors.
125 / 318
Examples of maximum likelihood
Arellano, M. and C. Meghir (1992). Female labour supply and on-the-job search: An empirical
model estimated using complementary data sets. Review of Economic Studies 59, 537–557.
Blundell, R., P.-A. Chiappori, T. Magnac, and C. Meghir (2007). Collective labour supply:
Heterogeneity and non-participation. Review of Economic Studies 74, 417–445.
Heckman, J. J. (1974). Shadow prices, market wages, and labor supply. Econometrica 42, 679–
694.
Rust, J. (1987). Optimal replacement of GMC bus engines: An empirical model of Harold
Zurcher. Econometrica 55, 999–1033.
126 / 318
Bayesian estimation
Before seeing the data you have beliefs about θ. Suppose we can summarize
those beliefs into a distribution function π(θ) on Θ, the prior.
Upon
Qn seeing the data we can evaluate the distribution of the sample,
i=1 fθ (xi ), at any θ ∈ Θ.
When confronted with the data we may alter our beliefs about θ. Bayes rule
gives the posterior as
Qn
i=1 fθ (xi ) π(θ)
π(θ|x1 , . . . , xn ) = R Qn .
Θ i=1 fu (xi ) π(u) du
Take
xi ∼ N (θ, σ 2 )
(with σ 2 known for simplicity), so that
Pn 2
n
i=1 (xi −xn ) n(θ−xn )2
Y 1 xi − θ 1 −1
σ2
+
σ2
fθ (x1 , . . . , xn ) = φ = 2 n/2
e 2 .
i=1
σ σ (2πσ )
Suppose that
(θ−µ)2
1 θ−µ 1 −1
2 τ2
π(θ) = φ = e ,
τ τ (2πτ 2 )1/2
τ2 σ 2 /n τ 2 σ 2 /n
θ|(x1 , . . . , xn ) ∼ N x n + µ, .
τ 2 + σ 2 /n τ 2 + σ 2 /n τ 2 + σ 2 /n
128 / 318
The posterior mean is the point estimator
τ2 σ 2 /n
xn + 2 µ,
τ2 2
+ σ /n τ + σ 2 /n
129 / 318
Bernstein-von Mises theorem
130 / 318
James-Stein estimation
τ2 σ 2 /n (σ 2 /τ 2 )/n (σ 2 /τ 2 )/n
x n + µ = 1 − xn + µ.
τ 2 + σ 2 /n τ 2 + σ 2 /n 1 + (σ 2 /τ 2 )/n 1 + (σ 2 /τ 2 )/n
(σ 2 /τ 2 )/n σ 2 /n
1− xn = 1 − xn
1 + (σ 2 /τ 2 )/n τ 2 + σ 2 /n
The term in brackets lies in (0, 1). So, this estimator is downward biased.
The bias is introduced by the shrinkage of xn towards the prior mean of zero.
(σ 2 /τ 2 )/n
1− x.
1 + (σ 2 /τ 2 )/n
131 / 318
The James-Stein estimator (assuming that σ 2 is known and that m ≥ 2) is
m−2
1 − σ2 x.
kxk2
While this estimator is biased, we have
!
2
m−2
E(kx − 0k2 ) > E 1 − σ2 x−0 .
kxk2
as soon as m > 2.
So, in terms of estimation risk (as measured by expected squared loss), the
James-Stein estimator dominates the Frequentist sample mean estimator x.
The key is that shrinkage reduces variance. Indeed, taking the infeasible
estimator for simplicity
(σ 2 /τ 2 )/n (σ 2 /τ 2 )/n
var 1− x = 1 − τ 2 Im
1 + (σ 2 /τ 2 )/n 1 + (σ 2 /τ 2 )/n
σ 2 /τ 2
= τ2 − I m + o(n−1 ).
n
132 / 318
TESTING IN PARAMETRIC PROBLEMS
133 / 318
Reading
General discussion:
Casella and Berger, Chapter 8
Hansen I, Chapter 13 and 14
134 / 318
Simple hypothesis and likelihood ratio
135 / 318
`n (θ0 )
0.3
`n (θ)
0.2
θ0
`n (θ1 )
0.1
θ1
0
0 2 4 6 8 10
θ
136 / 318
A decision rule based on the likelihood ratio is to
Reject the null in favor of the alternative when
`n (θ0 )
< c,
`n (θ1 )
Accept the null when
`n (θ0 )
≥ c,
`n (θ1 )
for a chosen value c.
We might wrongfully reject the null. This is called a type-I error.
The significance level or size of the test is
`n (θ0 )
Pθ0 <c .
`n (θ1 )
We might wrongfully accept the null. This is called a type-II error.
The power of the test is
`n (θ0 )
Pθ1 <c .
`n (θ1 )
137 / 318
Normal
Suppose that
xi ∼ N (θ, σ 2 )
for known σ 2 .
(From before; see Slide 49) the density of the data is
Pn 2
i=1 (xi −xn ) n(θ−xn )2
1 −1 −1
σ2 σ2
e 2 e2
(2πσ 2 )n/2
The likelihood ratio thus is
n(θ0 −xn )2
1
−2
σ2 θ0 −θ1 xn −θ0 1 θ0 −θ
e −1 n (x −θ )2 −(x −θ )2
2 2( n 0 n 1 ) √ √ +2 √1
n(θ1 −xn )2
=e σ = e σ/ n σ/ n σ/ n .
1
−2
σ2
e
If θ0 < θ1 the likelihood ratio is no greater than c when
xn − θ 0
√ ≥ c∗ ,
σ/ n
for some c∗ .
138 / 318
So a level α test is obtained on choosing c∗ so that
`n (θ0 ) xn − θ0
Pθ0 < c = Pθ 0 √ ≥ c∗ = 1 − Φ(c∗ ) = α,
`n (θ1 ) σ/ n
which requires that
c∗ = Φ−1 (1 − α) ≡ zα ,
the (1 − α)th quantile of the standard-normal distribution. These values are
tabulated.
Then the decision rule we obtain is that, if,
xn − θ0
√ ≥ zα ,
σ/ n
we reject the null in favor of the alternative.
139 / 318
The standard-normal distribution
140 / 318
The power of the test is
xn − θ0 xn − θ 1 θ1 − θ0 θ1 − θ0
Pθ 1 √ ≥ zα = Pθ1 √ + √ ≥ zα = 1−Φ zα − √ .
σ/ n σ/ n σ/ n σ/ n
Note that the power increases when
Note that if, in stead, θ0 > θ1 , the decision rule becomes that, if,
xn − θ 0
√ ≤ −zα ,
σ/ n
we reject the null in favor of the alternative.
141 / 318
Now take the reverse situation where
xi ∼ N (µ, θ)
and µ is known.
Wish to test H0 : θ = θ0 against H1 : θ = θ1 .
The likelihood ratio is
1 θ1 −θ0 Pn (xi −µ)2
−2
(θ1 /θ0 )n/2 e θ1 i=1 θ0 .
is large (and vice versa). Now, under the null, this statistic is χ2n and so
n
!
X (xi − µ)2 2
Pθ0 ≥ χn,α = α,
i=1
θ0
where χ2n,α is the (1 − α)th quantile of the χ2n distribution. The power is the
probability that a χ2n is greater than χ2n,α (θ0 /θ1 ).
142 / 318
The χ2 -distribution
143 / 318
Exponential
e−x/θ
fθ (x) = , x ≥ 0, θ > 0.
θ
and so the likelihood-ratio statistic for simple null and alternative equals
1 −nxn /θ0
ne
n θ −θ
θ0 θ1 −nxn θ1 θ 0
1 −nxn /θ1 = e 0 1 .
ne
θ1
θ0
145 / 318
Composite alternatives and the power function
The data distribution is no longer fully specified under the alternative; there
are many possible alternatives.
A test is uniformly most powerful if it is most powerful against all θ1 ∈ Θ1 .
146 / 318
Unbiased tests
We could consider looking for the uniformly most powerful unbiased test.
147 / 318
Normal (One-sided)
Then,
n(xn −θ0 )2
−1
`n (θ0 ) e 2 σ2
= n(xn −θ̂1 )2
`n (θ̂1 ) e
−1
2 σ2
(xn −θ0 )2 −(xn −θ0 )2 {xn ≤θ0 }−(xn −xn )2 {xn >θ0 }
−n
=e 2 σ2
and, therefore, the size of our test can be set to α ∈ (0, 1) by setting
c∗ = Φ−1 (1 − α) = zα .
149 / 318
We get the decision rule
Reject H0 : θ = θ0 in favor of H1 : θ > θ0 if
xn − θ0 xn − θ0
√ √ > 0 ≥ zα ;
σ/ n σ/ n
Accept H0 : θ = θ0 if
xn − θ 0 xn − θ0
√ √ >0 < zα .
σ/ n σ/ n
Accept H0 : θ = θ0 if
xn − θ0
√ < zα .
σ/ n
150 / 318
This conclusion follows from the fact that the decision rule is the same as
for the simple alternative θ = θ1 from above, and that test was the most
powerful for any θ1 > θ0 .
We have
xn − θ0 xn − θ θ0 − θ
Pθ √ ≥ zα = Pθ √ ≥ zα + √
σ/ n σ/ n σ/ n
so the power function is
θ0 − θ
1 − Φ zα + √ .
σ/ n
This test is consistent.
β(θ) is presented graphically below for a setting where θ0 = 0 and σ = 1,
with α = .05.
151 / 318
1
0.8
1 − β(θ1 )
0.6 β(θ1 )
β(θ)
0.4
0.2
α
0
θ0 θ1
θ
152 / 318
Normal (Two-sided)
153 / 318
So,
xn − θ 0 xn − θ 0
Pθ0 √ ≥ c∗ = 1 − Pθ 0 −c∗ < √ ≤ c∗
σ/ n σ/ n
which is simply
0.8
0.6
β(θ)
0.4
0.2
α
0
θ1 θ0
θ
155 / 318
The two-sided test is unbiased and consistent.
Below are the power functions for two sample sizes.
0.8
0.6
β(θ)
0.4
0.2
α
0
θ1 θ0
θ
156 / 318
Normal (Two-sided; variance unknown)
Again
xi ∼ N (µ, σ 2 )
but now with both µ, σ 2 unknown.
Consider the hypothesis
H 0 : µ = µ0 , H1 : µ 6= µ0 .
The likelihood is Pn 2
1 −1 i=1 (xi −µ)
e 2 σ2 .
(2πσ 2 )n/2
The unconstrained maximizers are
n
X
µ̂ = xn , σ̂ 2 = n−1 (xi − xn )2 ,
i=1
157 / 318
The likelihood ratio is simply
n/2 −n/2
σ̂ 2 (xn − µ0 )2
= 1+ .
σ̌ 2 σ̂ 2
This statistic is smaller than some critical value if and only if
2 2
xn − µ0 n xn − µ0
√ = √
σ̂/ n n−1 σ̃/ n
158 / 318
The statistic
xn − µ0
√ ∼ tn−1 .
σ̃/ n
is commonly called the t-statistic.
Exact inference is thus possible on choosing critical values from Student’s t
distribution with n − 1 degrees of freedom.
As n grows, tn−1 approaches the standard normal. So large-sample theory
justifies the use of zα/2 as a critical value.
159 / 318
Student’s t distribution
160 / 318
General composite hypothesis
The more general case has both composite null and alternative, as in
H0 : θ ∈ Θ0 , H1 : θ ∈ Θ1 ,
The statistic used above for only the alternative composite is a special case.
Much more common is to work with a likelihood ratio statistic defined as
supθ0 ∈Θ0 `n (θ0 )
;
supθ∈Θ `n (θ)
note that the denominator features the full parameter space. This is often
much easier to work with.
161 / 318
Connection to maximum likelihood
By definition
sup `n (θ) = `n (θ̂),
θ∈Θ
`n (θ̌)
.
`n (θ̂)
162 / 318
Normal (Composite)
arg max `n (θ) = xn {xn ≤ 0}, arg max `n (θ) = xn {xn > 0},
θ0 ∈Θ0 θ1 ∈Θ1
and, also,
arg max `n (θ) = xn .
θ∈Θ
So,
supθ0 ∈Θ0 `n (θ0 )
2
x√ x√ x√
−1 n sign(xn ) −1 n n
=e 2 σ/ n =e 2 σ/ n σ/ n
supθ1 ∈Θ1 `n (θ1 )
for sign(x) = {x > 0} − {x ≤ 0}, while
163 / 318
The latter likelihood ratio is smaller then c when
xn xn
√ √ > 0 > c∗
σ/ n σ/ n
for some c∗ . Note that only positive c∗ make sense, otherwise we will never
reject.
For any fixed θ, let
xn − θ
zθ = √ .
σ/ n
Then
n o
Pθ σ/x√
n
n
x√
n
σ/ n
> 0 > c ∗ = Pθ z θ > c ∗ − θ√
σ/ n
= 1 − Φ c∗ − θ√
σ/ n
This function is monotone increasing on Θ0 = (−∞, 0]. The size of the test
is n o
∗
supθ∈Θ0 Pθ σ/x√
n
n
x√
n
σ/ n
> 0 > c = 1 − Φ(c∗ ) = α
164 / 318
The former likelihood ratio is small when either
xn xn
0< √ and c∗ < √
σ/ n σ/ n
or when
xn xn
√ < 0 and c∗ < √
σ/ n σ/ n
165 / 318
Likelihood-ratio test
H0 : r(θ) = 0, H1 : r(θ) 6= 0,
Note that
−2 log `n (θ̌)/`n (θ̂) = 2(Ln (θ̂) − Ln (θ̌)).
166 / 318
Asymptotic distribution under the null
The validity of the test procedure comes from the following theorem.
167 / 318
Proof.
We work under the null. A Taylor expansion gives
n
Ln (θ̌) − Ln (θ̂) = − (θ̌ − θ̂)0 Iθ (θ̌ − θ̂) + op (1).
2
It can be shown that (under the null)
n
√ 1 X ∂ log fθ (xi )
n(θ̂ − θ̌) = Iθ−1 R0 (RIθ−1 R0 )−1 RIθ−1 √ + op (1),
n i=1 ∂θ θ
where R = R(θ). Plugging this into the expansion gives 2(Ln (θ̂) − Ln (θ̌)) as
n
!0 n
!
1 X ∂ log fθ (xi ) 1 X ∂ log fθ (xi )
RIθ−1 √ (RIθ−1 R0 )−1 RIθ−1 √
n i=1 ∂θ θ n i=1 ∂θ θ
168 / 318
Analysis of the constrained estimator
Ln (θ) + λ0 r(θ).
and r(θ̌) = r(θ) + R (θ̌ − θ) + op (1) = R (θ̌ − θ) + op (1) (enforcing the null
r(θ) = 0).
169 / 318
Plugging the expansions into the first-order conditions and re-arranging
yields the system of equations
Pn ∂ log fθ (xi ) !
−nIθ R0
θ̌ − θ
n−1/2 = −n−1/2 i=1 ∂θ
θ
R 0 λ̌ 0
equals
−n−1 Iθ−1 + n−1 Iθ−1 R0 (RIθ−1 R0 )−1 RIθ−1 Iθ−1 R0 (RIθ−1 R0 )−1
.
(RIθ−1 R0 )−1 RIθ−1 n (RIθ−1 R0 )−1
170 / 318
Then we obtain
n
√ 1 X ∂ log fθ (xi )
n(θ̌ − θ) = (Iθ−1 − Iθ−1 R0 (RIθ−1 R0 )−1 RIθ−1 ) √ + op (1),
n i=1 ∂θ θ
171 / 318
χ2 -statistic
note that recentering of the score is needed here as θ̌ does not maximize the
unconstrained likelihood problem, in general.
172 / 318
Slutzky’s theorem gives us the following result.
173 / 318
Score statistic
174 / 318
Wald statistic
we may look at a distance of r(θ̂) from zero (the null). Because we have that
175 / 318
The Wald statistic can equally be derived without reference to a constrained
estimation problem.
Because
√ d
n(θ̂ − θ) → N (0, Iθ−1 ) and r(θ̂) = R (θ̂ − θ) + op (1),
and so also
Theorem 21 (Limit distribution of the Wald statistic (cont’d))
Under the null,
d
n r(θ̂)0 (R(θ̂)Iˆθ−1 R(θ̂)0 )−1 r(θ̂) → χ2m ,
as n → ∞.
Here it makes sense to use an unconstrained estimator of the information.
176 / 318
Notes
All test statistics can be used in the same way to perform (asymptotically)
valid inference.
In small samples they can lead to different test conclusions.
The likelihood-ratio statistic is attractive because
The second point is important as it implies that the test conclusion is the
same no matter how the null is formulated.
The score statistic is attractive because it requires estimation only under the
null, which is often easier.
In the likelihood context there is no strong argument in favor of the Wald
statistic. In fact it is not likelihood based. Its power lies in that it can be
applied more generally.
177 / 318
Exponential
e−x/θ
fθ (x) = .
θ
Its mean is θ.
We set up several tests for the null H0 : θ = θ0 against θ 6= θ0 .
First note that
n
X
Ln (θ) = − (xi /θ + log θ) = −nxn /θ − n log θ.
i=1
Hence,
178 / 318
The likelihood-ratio statistic is
xn xn
−2(Ln (θ0 ) − Ln (θ̂)) = 2n − 1 − log .
θ0 θ0
(xn − θ0 )2 (xn − θ0 )2
, ,
θ02 /n x2n /n
respectively. The latter is again the usual t-statistic, which should not be
surprising here.
179 / 318
Classical linear regression
y = Xβ + ε, ε ∼ N (0, σ 2 I).
n (y − Xβ)0 (y − Xβ)
Ln (β, σ 2 ) = − log σ 2 − .
2 2σ 2
180 / 318
We consider a set of m linear restrictions on β. We express the null hypothesis
as
Rβ = r,
where R is an m × k matrix and r and is an m-vector.
and equals
β̂ = (X 0 X)−1 Xy,
as before.
181 / 318
The first-order conditions are
X 0 (y − Xβ) − R0 λ = 0, Rβ − r = 0.
(X 0 X)β = X 0 y − R0 λ
and so
182 / 318
The likelihood ratio statistic is
!−n/2
SSRβ̌
,
SSRβ̂
where
SSRβ̂ = ε̂0 ε̂ = ε0 M X ε
and, using that y = X β̂ + ε̂ to simplify SSRβ̌ = (y − X β̌)0 (y − X β̌) to
Hence,
SSRβ̌ − SSRβ̂ (β̂ − β̌)0 (X 0 X)(β̂ − β̌)
=
SSRβ̂ ε0 M X ε
(Rβ̂ − r)0 (R(X 0 X)−1 R0 )−1 (Rβ̂ − r)
= .
ε0 M X ε
183 / 318
Note that, under the null,
such that
SSRβ̌ − SSRβ̂
∼ χ2m .
σ2
We also know that
SSRβ̂ σ̃ 2
= (n − k) ∼ χ2n−k .
σ2 σ2
Lastly, both terms are independent because they are functions of β̂ and ε̂,
respectively. These variables are jointly normal and independent, as the
covariance is
(using that M X X = 0)
Therefore,
n − k SSRβ̌ − SSRβ̂
∼ Fm,n−k ,
m SSRβ̂
where F is Snedecor’s F distribution.
184 / 318
Snedecor’s F distribution
185 / 318
A particular F test
186 / 318
F versus t
β̂κ − βκ,0
p .
σ̃ [(X 0 X)−1 ]κ,κ
2
187 / 318
Joint hypothesis versus multiple single hypotheses
1 (β̂ − β0 )0 (X 0 X)(β̂ − β0 )
.
k σ̃ 2
This is not the mean of the t-statistics for the k individual hypotheses that
βκ = βκ,0 . The individual t-statistics are correlated.
Jointly testing hypothesis gives acceptance regions that are ellipsoids. The
union of acceptance regions of multiple individual tests is a hypercube.
Multiple testing problems need size corrections which, in turn, lead to low
power.
To keep the family-wise error rate below α we need to test each of k individual
hypothesis at significance level α/k.
188 / 318
p-values
This is the probability of observing a value of the test statistic greater than
ψ if the null holds.
Small p-values suggest the null is likely to be false.
But the p-value is informative in its own right and need not lead to a decision
about the null. This is Fisher’s view.
189 / 318
Inverting test statistics
Accept H0 if ψn (θ0 ) ≤ c
190 / 318
Normal
Suppose xi ∼ N (θ, σ 2 ).
Consider H0 : θ = θ0 and H1 : θ 6= θ0 .
The likelihood ratio decision rule goes in favor of the null if
xn − θ0
√ ≤ tn−1,α/2 .
σ̃/ n
191 / 318
Now,
xi ∼ N (µ, θ)
and, say,
H 0 : θ = θ0 , H1 : θ > θ0 .
192 / 318
Bayesian credible sets
For scalar θ we could, for example, take the interval [qα/2 , q1−α/2 ], where qτ
is the τ quantile of the posterior distribution.
193 / 318
Return to the example where xi ∼ N (θ, σ 2 ) (with σ 2 known) and we have
prior information θ ∼ N (µ, τ 2 ).
Here, the posterior was
N m, v 2
194 / 318
The Frequentist framework has xn ∼ N (θ, σ 2 /n) (here θ is fixed).
1 δ σ 2 /n
m= xn + µ, δ= .
1+δ 1+δ τ2
equals
√ √
θ−µ θ−µ
Φ 1 + δ zα/2 + δ √ − Φ − 1 + δ zα/2 + δ √
σ/ n σ/ n
195 / 318
Stratifying regressions
196 / 318
Common variance is unrealistic and can be relaxed.
(This will lead us to semiparametric problems; considered below.)
197 / 318
Add homogenous impact of experience.
198 / 318
Stratify impact of experience by gender.
199 / 318
Test the equality of the regression lines.
200 / 318
SEMIPARAMETRIC PROBLEMS: (GENERALIZED) METHOD OF
MOMENTS
201 / 318
Reading
Asymptotic theory:
Arellano, Appendix A
Hansen II, Chapter 13
Hayashi, Chapter 7
Wooldridge, Chapter 12
Linear instrumental variables:
Hansen II, Chapter 12
Hayashi, Chapter 3
Wooldridge, Chapters 5 and 8
Optimality in conditional moment problems:
Arellano, Appendix B
202 / 318
Linear model
Recall,
yi = x0i θ + εi .
Before we had imposed εi |xi ∼ N (0, σ 2 ). but suppose that we only require
that
E(εi |xi ) = 0.
203 / 318
Iterating expectations shows that
and the analogy principle suggest estimating θ by the solving the empirical
moment
Xn
n−1 xi (yi − x0i θ) = 0.
i=1
204 / 318
But is ordinary least squares still the best estimator of θ?
Aside from
Eθ (xi (yi − x0i θ)) = 0
we equally have that
205 / 318
Semiparametric efficiency
that is, the largest of the Cramér-Rao bounds in the parametric submodels
contained in our semiparametric setting.
In the linear regression model from above this would be the Cramér-Rao
bound under the least-favorable distribution for εi |xi that satisfies mean
independence.
206 / 318
Method of moments
Eθ (ϕ(xi ; θ)) = 0
A unique solution will generally not exist when dim ϕ < dim θ. We say θ is
underidentified.
Suppose, for now, that dim ϕ = dim θ. We call this the just-identified case.
The intuition is the analogy principle and similar to the argmax argument.
207 / 318
Identification
Eθ (ϕ(xi ; θ∗ )) 6= 0
for any θ∗ 6= θ.
This is global identification.
208 / 318
Limit distribution
Let θ̂ satisfy
n
X
n−1 ϕ(xi ; θ̂) = 0.
i=1
We can use a similar argument as used for maximum likelihood to derive its
behavior as n → ∞.
Under smoothness conditions an expansion gives
n n n
X X X ∂ϕ(xi ; θ)
n−1 ϕ(xi ; θ̂) = n−1 ϕ(xi ; θ) + n−1 (θ̂ − θ).
i=1 i=1 i=1
∂θ0 θ∗
Re-arrangement gives
n
!−1 n
√ 1 X ∂ϕ(xi ; θ) 1 X
n(θ̂ − θ) = − √ ϕ(xi ; θ).
n i=1 ∂θ0 θ∗ n i=1
209 / 318
Under a dominance condition we have
n
1 X ∂ϕ(xi ; θ) p ∂ϕ(xi ; θ)
− → −Eθ = −Γθ (say).
n i=1 ∂θ0 θ∗ ∂θ0
exists.
Combined with Slutzky’s theorem we get the following result.
as n → ∞.
210 / 318
Linear model
Our model is
yi = x0i θ + εi , Eθ (xi εi ) = 0.
Here, ϕ(xi ; θ) = xi (yi − x0i θ), which gives the least-squares estimator.
Further,
Ωθ = E(ε2i xi x0i ), Γθ = −E(xi x0i ).
211 / 318
We estimate the asymptotic variance as
n
!−1 n
! n
!−1
1X 1X 2 1X
xi x0i ε̂i xi x0i xi x0i ,
n i=1 n i=1 n i=1
where ε̂i = yi − x0i θ̂ are the residuals from the least-squares regression.
Under homoskedasticity we can use
n
! n
!−1
1X 2 1X
ε̂i xi x0i
n i=1 n i=1
(could also apply the usual degrees-of-freedom correction to the first term).
Note that least squares is no longer normally distributed for small n because
the errors need no longer be normal.
212 / 318
Exponential regression
This equals the score equation for Poisson (see Slides 122–123).
Sometimes called the pseudo Poisson estimator.
However, the maximum-likelihood standard errors do not apply because the
information equality does not hold here:
213 / 318
Pseudo Poisson : gravity equation
214 / 318
215 / 318
Maximum likelihood
216 / 318
Extremum estimators
217 / 318
Rank estimator
218 / 318
Quantile regression
% = med(xi ) = F −1 (1/2).
We have
% = arg min E(|xi − ρ|).
ρ
Indeed,
R Rρ R +∞
E(|xi − ρ|) = |x − ρ| dF (x) = −∞ (ρ − x) dF (x) + ρ (x − ρ) dF (x).
219 / 318
An alternative representation of the median follows from
1
F (%) = ,
2
as
1
E {x ≤ %} − = 0,
2
which is a moment condition.
This suggest as estimator an (approximate) solution to the empirical moment
n
X 1
n−1 {xi ≤ ρ} − = 0.
i=1
2
220 / 318
Prediction
E((yi − p(xi ))2 ) = E(((yi − E(yi |xi )) − (p(xi ) − E(yi |xi )))2 )
= E((yi − E(yi |xi ))2 ) + E((p(xi ) − E(yi |xi ))2 )
= E(var(yi |xi )) + E((p(xi ) − E(yi |xi ))2 )
≥ E(var(yi |xi )).
221 / 318
Linear prediction
They solve
E(xi (yi − x0i β)) = 0.
(uniquely if E(xi x0i ) has full rank) and equal
yi = x0i β + εi
Again consider
Note that
E ∗ (yi |xi ) 6= x0i θ,
so θ is not a regression coefficient;
E(yi |xi ) = x0i θ + E(εi |xi ) 6= x0i θ,
so
∂E(yi |xi )
6= θ.
∂xi
223 / 318
Omitted variables
Say we have
yi = αi + x0i θ + ηi , E(ηi |xi , αi ).
224 / 318
Measurement error
Suppose that
yi = wi0 θ + i , E(i |wi ) = 0
but (together with yi ) we only observe a noisy version of wi , say
xi = wi + ηi ,
E(xi x0i )−1 E(xi yi ) = θ + E(xi x0i )−1 E(xi εi ) = θ − E(xi x0i )−1 E(ηi ηi0 ) θ.
225 / 318
Simultaneity
di = αd − θd pi + ui
si = αs + θs pi + vi
226 / 318
We do not observe supply and demand for any given price.
Collected data is on quantity traded and transaction price, (qi , pi ).
227 / 318
Data comes from markets in equilibrium.
So, we solve
si = di
for the equilibrium price to get
αd − αs ui − vi
pi = + .
θd + θs θd + θs
This gives traded quantity as
αd θs + αs θd θs ui + θd vi
qi = + .
θd + θs θd + θs
(With E(ui vi ) = 0) the population regression slope of qi on pi equals
σu2 σ2
θs − 2 v 2 θd ,
σu2 + σv2 σu + σv
228 / 318
229 / 318
To see the problem in terms of endogeneity, focus on the estimation of the
demand curve.
Then, collecting equations from above,
αd − αs ui − vi
di = αd − θd pi + ui , pi = + .
θd + θs θd + θs
Clearly,
σu2
ui − v i
E(pi ui ) = E ui = 6= 0,
θd + θs θd + θs
230 / 318
Linear instrumental-variable problem
yi = x0i θ + εi , E(zi εi ) = 0
An instrument is
231 / 318
It is useful to proceed in matrix notation:
y = Xθ + ε
and we set to zero the sample covariance of the errors and instruments. The
solution is
θ̂ = (Z 0 X)−1 (Z 0 y).
Note that this gives least squares when regressors instrument for themselves.
Z 0 (y − Xθ) = 0
232 / 318
Resolving simultaneity with instrumental variables
di = αd − θd pi + ui
.
si = αs + θs pi + πzi + vi
where E(zi ui ) = 0.
zi shifts supply (relevance) but not demand (exclusion).
di = αd − θd pi + ui
αd − αs π ui − vi .
pi = − zi +
θd + θs θd + θs θd + θs
Further, by relevance and exclusion,
and so
cov(di , zi )
−θd = .
cov(pi , zi )
233 / 318
234 / 318
Measurement error
yi = xi θ + (i − ηi θ)
zi = wi + ζi .
We have
2
E(zi xi ) = E((wi + ζi ) (wi + ηi )) = σw , E(zi (i − ηi θ)) = 0,
235 / 318
Generalized method of moments
exactly.
The solution is to minimize the quadratic form
ĝ(θ)0 A ĝ(θ).
236 / 318
Reduction in moments
With
n
∂ĝ(θ) X ∂ϕ(xi ; θ)
Ĝ(θ) = 0
= n−1 ,
∂θ i=1
∂θ0
the first-order condition to the GMM problem is
Ĝ(θ)0 A ĝ(θ) = 0.
A = Ω−1
θ
237 / 318
Limit distribution
p
Combine the convergence result Ĝ(θ∗ ) → Γθ for any consistent θ∗ with the
expansion
n n n
X X X ∂ϕ(xi ; θ)
n−1 ϕ(xi ; θ̂) = n−1 ϕ(xi ; θ) + n−1 (θ̂ − θ)
i=1 i=1 i=1
∂θ0 θ∗
to see that
1X
(θ̂ − θ) = −(Γ0θ A Γθ )−1 Γ0θ A ϕ(xi ; θ) + op (n−1/2 ).
n i
1 X d
Then, with √ ϕ(xi ; θ) → N (0, Ωθ ),
n i
we get the result.
as n → ∞.
238 / 318
Optimal weighting
(Γ0θ Ω−1
θ Γθ )
−1
.
239 / 318
Proof.
Let
−1/2
C = (Γ0θ A Γθ )−1 Γ0θ AΩθ ,
1/2
D = Ωθ Γθ .
Then
(Γ0θ A Γθ )−1 (Γ0θ AΩθ A0 Γθ )(Γ0θ A Γθ )−1 − (Γ0θ Ω−1
θ Γθ )
−1
can be written as
CC 0 − CD(D0 D)−1 D0 C 0 .
But this is
CMD C 0 ≥ 0, MD = Im − D(D0 D)−1 D0 .
To see that the eigenvalues of an orthogonal projector P are all zero or one,
let λ 6= 0 be an eigenvalue of P . Then P x = λx for some x 6= 0. Because P
is idempotent we must also have that P 2 x = P x = λP x = λ2 x. Therefore it
must hold that
λx = λ2 x,
which can only be true if λ ∈ {0, 1}.
240 / 318
χ2 problem
But varθ (xi ) = 2θ so also have the moment condition Eθ ((xi − θ)2 − 2θ) = 0.
Let
xi − θ
ϕ(xi ; θ) = .
(xi − θ)2 − 2θ
Then
1 4
Ωθ = Eθ (ϕ(xi ; θ) ϕ(xi ; θ)0 ) = 2θ .
4 6(θ + 4)
241 / 318
So,
3(θ+4)
1 −1
Ω−1
θ =
2
1 .
θ(3θ + 4) −1 4
The Jacobian of the moment conditions is simply −(1, 2)0 and so we find that
the asymptotic variance equals
θ + 4/3
2θ .
θ+2
If we would just use one of the moments the asymptotic variance would be
242 / 318
Poisson
243 / 318
Two-step GMM
244 / 318
Examples of (nonlinear) method of moments
Avery, R. B., L. P. Hansen, and V. J. Hotz (1983). Multiperiod probit models and orthogonality
condition estimation. International Economic Review 24, 21–35.
Goldberg, P. K. and F. Verboven (2001). The evolution of price dispersion in the European car
market. Review of Economic Studies 68, 811–848.
Pakes, A. (1986). Patents as options: Some estimates of the value of holding European patent
stocks. Econometrica 54, 755–784.
245 / 318
Two-stage least squares
(y − Xθ)0 ZA Z 0 (y − Xθ).
Under homoskedasticity,
246 / 318
To understand 2SLS recall the model
yi = x0i θ + εi , E(zi εi ) = 0
and note that we can always use E ∗ (xi |zi ) = zi0 π to decompose the covariates
as
xi = zi0 π + ηi = x̃i + ηi (say).
By the validity of zi as instrument we have E(zi εi ) = 0 and so we know that
247 / 318
Replacing population projection with sample projection introduces bias.
We have
π̂ = (Z 0 Z)−1 Z 0 X = π + (Z 0 Z)−1 Z 0 η
and so
X̂ = X̃ + P Z η.
The second term correlates with ε and so
E(x̂i εi ) 6= 0,
248 / 318
249 / 318
250 / 318
Precision of instrumental variables
σ2
a
θ̂ − θ ∼ N 0, n−1 ε2
σx
(under exogeneity).
The same first-order approximation to the instrumental-variable estimator is
n−1 σε2
a
θ̂ − θ ∼ N 0, 2 ,
ρxz σx2
where ρxz is the correlation between xi and zi .
The intuition is that xi is (in terms of relevance/fit) its own best instrument.
The instrument is said to be weak when ρxz is small.
In this case the first-order approximation becomes poor.
251 / 318
Weak instruments
Take the simple univariate problem, where we only have one covariate xi and
m instruments zi (treat these as fixed), and suppose we have homoskedastic
errors.
We can approximate the mean squared error of 2SLS to second order to get
1 σε2 /ση2 m 2 ρ σ ε 2
+ + o(n−2 ),
n τ n τ
V ARIAN CE SQU ARED BIAS
where
2
π 0 Z 0 Zπ R
nτ = = 2
ση2 1−R
is the concentration parameter.
2
R is the (uncentered) population R2 of the first-stage regression.
This relates directly to the first-stage F -statistic.
When τ is small most of the variation on xi comes from ηi , and not from zi .
252 / 318
Sampling distribution of two-stage least squares as a function of the value of
the concentration parameter (simulation details omitted).
0.5
Figure: The effect of τ ! = 500
! = 200
0.45
! = 50
! = 10
0.4 !=1
0.35
0.3
0.25
0.2
0.15
0.1
0.05
0
-10 -5 0 5 10 15
253 / 318
Many instruments
254 / 318
Sampling distribution of two-stage least squares as a function of the number
of instruments (simulation details omitted).
0.45
LS Figure: The effect of τ
2SLS; m=5
0.4 2SLS; m=10
2SLS; m=25
2SLS; m=75
2SLS; m=150
0.35
0.3
0.25
0.2
0.15
0.1
0.05
0
-10 -8 -6 -4 -2 0 2 4 6 8 10
255 / 318
Control-function interpretation
Let
e = MZX
be the residuals from the least-squares regression of X on Z (i.e., from the
first stage).
Then 2SLS can be written as
Indeed,
M e X = M e (P Z X + M Z X) = (I − P e )P Z X + M e e = P Z X.
256 / 318
This view on 2SLS gives us a way to test the null of exogeneity.
Work through the simple model with
yi = xi θ + εi
xi = zi π + ηi
for θ, γ.
As ei = xi − zi π̂ = ηi − zi (π̂ − π) we can write this as (evaluating at true
parameter values)
X xi
xi
0
0
ui + zi γ (π̂ − π) + εi + zi2 (π̂ − π)2
ηi ηi zi (π̂ − π) 1
i
257 / 318
Because E(zi ηi ) = 0 and E(zi εi ) = 0, and because kπ̂ − πk2 = Op (n−1 ), this
behaves like (as n → ∞ and scaled by n−1 )
−1
X xi −1
X πγ
n ui + n zi ηi ,
ηi 0
i i
258 / 318
The asymptotic variance of the estimator,
Γ−1 −1
θ Ωθ Γθ ,
then equals
ση2
1 −1
σu2 Γθ−1 + γ 2 .
σx − ση2
2 −1 1
259 / 318
260 / 318
261 / 318
262 / 318
Bias correction with many moments
For a fixed weight matrix A, the bias in the GMM objective function is
The bias shrinks with n but grows (typically linearly) with dim ϕ.
263 / 318
Its minimizer is the jackknife instrumental-variable estimator
−1 !−1 !
XX 0
X X X 0
X
xi pij xj xi pij yj =
x̌i xi x̌i yi ,
i j6=i i j6=i i i
where X X
x̌i = xj pji = xj zj0 (Z 0 Z)−1 zi = Π̂−i zi .
j6=i j6=i
xi = Πzi + ηi .
Recall that bias in (feasible) 2SLS arose from the fact that Π̂ is a function of
ηi and ηi correlates with εi (See Slide 248). By construction the leave-one-out
fitted values do not depend on ηi .
264 / 318
Multiplicative models with endogeneity
265 / 318
Additional reading on instrumental variables
Bound, J. , D. A. Jaeger, and R. M. Baker (1995). Problems with instrumental variables esti-
mation when the correlation between the instruments and the endogeneous explanatory variable
is weak. Journal of the American Statistical Association 90, 443–450.
Staiger, D. and J. H. Stock (1997). Instrumental variabels regression with weak instruments.
Econometrica 65, 557–586.
Stock, J. H. and M. Yogo (2005). Testing for weak instruments in linear IV regression. In
Andrews, D. W. K. and J. H. Stock (Editors), Identification and Inference for Econometric Models:
Essays in Honor of Thomas Rothenberg, Chapter 5, 80—108 (Cambridge UP, Cambridge, UK).
266 / 318
Likelihood-ratio type test statistic
as n → ∞.
267 / 318
Score type test statistic
n ĝ(θ̌ˇ)0 Ω̂−1 Ĝ(θ̌ˇ) (Ĝ(θ̌ˇ)0 Ω̂−1 Ĝ(θ̌ˇ))−1 Ĝ(θ̌ˇ)0 Ω̂−1 ĝ(θ̌ˇ) → χ2m
d
θ̂ θ̂ θ̂
as n → ∞.
268 / 318
Wald test statistic
as n → ∞.
n r(θ̂)0 (R((Ĝ(θ̂)0 A Ĝ(θ̂))−1 (Ĝ(θ̂)0 AΩ̂θ̂ A0 Ĝ(θ̂))(Ĝ(θ̂)0 A Ĝ(θ̂))−1 )−1 R0 )−1 r(θ̂).
269 / 318
J-statistic
Note that
n ĝ(θ̂ˆ)0 Ω̂−1 ĝ(θ̂ˆ) → χ2dim ϕ−dim θ
d
θ̂
if all moments hold.
270 / 318
We can test subset of the moments as well.
Partition the moments using ϕ(x; θ) = (ϕ1 (x; θ)0 , ϕ2 (x; θ))0 .
Also partition
(Ω̂θ )11 (Ω̂θ )12
Ω̂θ = .
(Ω̂θ )21 (Ω̂θ )22
Want to test
Eθ (ϕ2 (xi ; θ)) = 0
assuming that Eθ (ϕ1 (xi ; θ)) = 0.
We can also compute the estimator using all moment conditions, i.e., the
usual
θ̂ˆ = arg min ĝ(θ)0 Ω̂−1
θ̂
ĝ(θ).
θ
271 / 318
We then have the following simple result.
n ĝ(θ̂ˆ)0 Ω̂−1
θ̂
ĝ(θ̂ˆ) − n ĝ1 (θ̌ˇ)0 (Ω̂θ̂ )−1 ˇ d 2
11 ĝ1 (θ̌ ) → χdim ϕ−dim ϕ1
as n → ∞.
272 / 318
Testing instrument validity
ε̂0 P Z ε̂ ESS
n =n = n R2 .
ε̂0 ε̂ T SS
273 / 318
Optimal moment conditions in conditional models
Eθ (ϕ(xi ; θ)|zi ) = 0
(a.s.)
for which the asymptotic variance of the resulting GMM estimator is minimal.
274 / 318
Notice that, now,
and
∂ϕ(xi ; θ)
= −Eθ Γθ (zi )0 Ωθ (zi )−1 Γθ (zi ) ,
Γθ = Eθ ψ(zi )
∂θ0
Ωθ = −Γθ .
that is,
√ d
n(θ̂ − θ) → N (0, Ω−1
θ ).
275 / 318
Proof.
Let
gi = ψ(zi ) ϕ(xi ; θ), hi = Γ0θ A φ(zi ) ϕ(xi ; θ)
for arbitrary alternative weight matrix A and instrument vector φ.
The asymptotic variances of the associated GMM estimators are
Eθ (gi gi0 )−1 , Eθ (hi gi0 )−1 Eθ (hi h0i )Eθ (gi h0i )−1 ,
respectively.
Rewriting gives
Eθ (hi gi0 )−1 Eθ (hi h0i ) Eθ (gi h0i )−1 −Eθ (gi gi0 )−1 = Eθ (hi gi0 )−1 Eθ (vi vi0 )Eθ (gi h0i )−1
for
vi = hi − gi0 γ, γ = Eθ (gi gi0 )−1 Eθ (gi h0i ).
276 / 318
Linear model
With
yi = x0i θ + εi , Eθ (εi |xi ) = 0
we have
Eθ (yi − x0i θ | xi ) = 0.
Here,
∂(yi − x0i θ)
Γθ (xi ) = Eθ xi = −x0i , Ωθ (xi ) = Eθ (ε2i | xi ) = σi2 (say).
∂θ0
So,
xi
ψ(xi ) = −Γθ (xi )0 Ωθ (xi )−1 = ,
σi2
and the optimal estimator solves the empirical moment condition
n
X xi (yi − x0i θ)
n−1 = 0.
i=1
σi2
Observation i gets less weight if σi2 is higher. This is weighted least squares.
277 / 318
If we write
V = diag(σ12 , . . . , σn2 ).
Then the optimal estimator is
θ̂ = (X 0 V −1 X)−1 (X 0 V −1 y)
Under homoskedasticity, i.e., when σi2 = σ 2 for all i this reduces to the simple
θ̂ = (X 0 X)−1 (X 0 y),
278 / 318
Exponential model
We have 0
yi = exi θ εi , Eθ (εi |xi ) = 1,
and so 0
Eθ (yi − exi θ | xi ) = 0.
Here,
0
Γθ (xi ) = −exi θ x0i , Ωθ (xi ) = σi2 (say).
The optimal empirical moment condition thus is
n 0 0
X xi exi θ (yi − exi θ )
n−1 =0
i=1
σi2
0
With Poisson data, for example, σi2 = exi θ and the estimating equation is
n
X 0
n−1 xi (yi − exi θ ) = 0.
i=1
0
With homoskedastic errors σi2 = σ 2 (exi θ )2 and we solve
n 0
X xi (yi − exi θ )
n−1 0 = 0.
i=1
exi θ
279 / 318
Instrumental-variable model
Now if
Eθ (yi − x0i θ| zi ) = 0
we obtain
So, we solve
n
X E(xi | zi ) (yi − x0i θ)
n−1 = 0.
i=1
σi2
280 / 318
Linear model for panel data
yi = x0i θ + εi ;
281 / 318
Define the (T − 1) × T first-differencing matrix D as
−1 ···
1 0 0 0
0 −1 1 ··· 0 0
D= . .. .
.. ..
. .
0 0 0 ··· −1 1
282 / 318
Suppose that ui ∼ (0, σ 2 IT ). Then Dui ∼ (0, σ 2 DD0 ).
A calculation gives
ιT ι0T
M = D0 (DD0 )−1 D = IT − ,
T
where ιT is a vector of ones.
The matrix M transforms data into deviations from within-group means. For
example, M yi = yi − y i .
283 / 318
Feedback
The above estimator requires that uit is uncorrelated with xi1 , . . . , xiT . This
rules out dynamics and, more generally, feedback.
284 / 318
An assumption of sequential exogeneity, i.e., E(uit |yi0 , . . . , yit−1 , αi ) = 0 is
enough to obtain a GMM estimator.
(for all t = 2, . . . , T ).
285 / 318
DEALING WITH (WEAK) DEPENDENCE
286 / 318
Reading
287 / 318
Stationary
288 / 318
Dependence and mixing
where (somewhat crudely stated) the sets A and B cover all events involving
x−∞ , . . . , xi−1 , xi and xi+h , xi+h+1 , . . . , x+∞ , respectively. Note how these
sets depend on h.
The process {xi } is strongly mixing (or alpha mixing) if αh → 0 as h → ∞.
Note that independent data has αh = 0 for any h.
289 / 318
Consistency of the sample mean
Now let
n
X
xn = n−1 xi
i=1
290 / 318
Central limit theorem
that E(xi ) = µ exists and that E(kxi k2+δ ) for some δ > 0. Then
√ d
nΣ−1/2 (xn − µ) → N (0, I)
for
n−1
! ∞ +∞
X n−h X X
Σ = lim Σ0 + (Σh + Σ0h ) = Σ0 + (Σh +Σ0h ) = Σh < ∞,
n→∞ n
h=1 h=1 h=−∞
Many special cases of this theorem are available with (complicated) low-level
conditions for specific processes.
291 / 318
The summability of the covariances (i.e., the fact that Σ is finite) follows
from the restriction on the mixing coefficients. A sufficient condition for
summability is that Σh → 0 faster than 1/h → 0.
The variance formula follows from (again for the scalar case)
n
! n X n
X X
var xi = cov(xi , xj )
i=1 i=1 j=1
n i−1 n−i
!
X X X
= cov(xi , xi ) + cov(xi , xi−h ) + cov(xi , xi+h )
i=1 h=1 h=1
n i−1 n−i
!
X X X
= Σ0 + Σ−h + Σh
i=1 h=1 h=1
292 / 318
A (truncated) estimator of the long-run variance Σ can be constructed as
κ−1
X (κ − h)
Σ̂ = Σ̂0 + (Σ̂h + Σ̂0h )
κ
h=1
293 / 318
Autoregression
Suppose that
σ2
Σ0 = .
1 − ρ2
The univariate stationary distribution, therefore, is xi ∼ N (µ, Σ0 ). The
covariances are proportional to Σ0 :
Σh = ρh Σ0 ;
for example,
295 / 318
Moving average
Σ0 = (1 + β 2 ) σ 2 and Σ1 = β σ 2 .
However,
Σh = 0, |h| > 1,
so the dependence in short-lived and vanishes abruptly beyond the first-order
autocovariance.
It follows that
Σ = Σ0 + Σ−1 + Σ1 = (1 + β)2 σ 2 .
296 / 318
Limit distribution of GMM
In practice the above is important for getting correct standard errors with
serially dependent data.
The two-step GMM estimator solves
where, now, Ω̂θ̂ is a HAC estimator of the long-run covariance matrix of the
moment condition
Xn
ĝ(θ) = n−1 ϕ(xi ; θ).
i=1
The same robust estimator needs to be used when constructing test statistics.
The remainder of the argument for GMM carries over without modification.
297 / 318
Linear model with correlated errors
Consider
yi = x0i θ + εi
with E(εi |x1 , . . . , xn ) = 0. Then, as before,
n
!−1 n
!
√ 1X 0 1 X
n(θ̂ − θ) = xi xi √ xi εi .
n i=1 n i=1
Now,
n +∞
1 X d
X
√ xi εi → N (0, Ω), Ω= E((εi εi+h )(xi x0i+h )).
n i=1 h=−∞
If, in addition, we also have E(εi εi+h ) = 0 for h 6= 0, then Ω = σ 2 E(xi x0i ).
298 / 318
Autoregression
Here the regressor (the lagged outcome) is not strictly exogenous but only
weakly exogenous.
Nonetheless,
E(yi−1 εi ) = Eρ (yi−1 (yi − ρyi−1 )) = 0
and so least-squares continues to be consistent and asymptotically normal
(under the usual regularity conditions).
As
∞
X
yi = ρh εi−h ,
h=0
299 / 318
Autoregression with MA errors
An extension would be
yi = ρyi−1 + εi , εi = ηi + θηi−1 ,
E(yi−1 εi ) = θ σ 2 ,
Note that this approach does not work for autoregressive errors.
300 / 318
Intertemporal CAPM
E( ∞ h
P
h=0 α u(xi+h ; β)| zi ),
Optimality of the consumption path implies that, for all i, the Euler equation
Hence,
E ( α r u0 (xi+1 ; β)/u0 (xi ; β) − 1| zi ) = 0
is a valid conditional moment condition for α, β.
301 / 318
NONPARAMETRIC PROBLEMS: CONDITIONAL-MEAN FUNCTIONS
302 / 318
Reading
303 / 318
Nonparametric specification
Suppose that
yi = m(xi ) + εi , E(εi |xi ) = 0.
305 / 318
Kernel functions
306 / 318
A locally-constant estimator
with respect to α.
307 / 318
The above optimization problem is equivalent to minimizing
n
1 X xi − x (yi − α)2
k .
n i=1 h h
xi − x (yi − α)2
= E (yi − α)2 xi = x f (x) + O(h2 ).
E k
h h
n↑∞
Consistency requires that h → 0. 308 / 318
Can also show that the variance is proportional to (nh)−1 . So, if f is equally
well behaved, and f (x) > 0,
n Pn xi −x
(yi − α)2 p
i=1 k
X 2 h
→ E (yi − α)2 xi = x
ωi (yi − α) = Pn xi −x
i=1 i=1 k h
if n → ∞ provided h → 0 and nh → ∞.
The solution of this limit problem is, of course,
α = m(x),
for f (x).
The conditions on h relative to n represent the bias/variance trade-off.
309 / 318
As we need h → 0 the estimator m̂(x) will converge at a slower rate than
n−1/2 (the parametric rate).
We have
√ σ 2 (x) R
d
nh m̂h (x) − m(x) − h2 b(x) u2 k(u) du → N k(u)2 du ,
R
0,
f (x)
where we let
m00 (x) f 0 (x) m0 (x)
b(x) = + , σ 2 (x) = var(yi |xi = x) = E(ε2i |xi = x),
2 f (x)
be first-order bias and variance, respectively.
The convergence rate can be no faster than n−2/5 , which happens when bias
and standard deviation shrink at the same rate.
With the bias being O(h2 ) and the variance O((nh)−1 ) bias vanishes if we
choose h such that nh5 → 0.
311 / 318
Bandwidth choice
A bandwidth that is ‘good’ (in an overall sense but not at a point) minimizes
for example.
Pn
(Least-squares) cross-validation selects h by minh n−1 i=1 (yi − m̌h (xi ))2 .
This can also be used to select between different estimators (pick the one
with lowest IMSE).
312 / 318
Curse of dimensionality
Kernel estimators extend easily to the case where the conditioning variable
is the vector xi = (xi,1 , . . . , xi,κ )0 .
The main problem with multivariate regressors is that the variance of the
estimator now becomes inverse-proportional to n(h1 × · · · × hκ ). This implies
that the convergence rate decreases with κ. This is known as the curse of
dimensionality.
313 / 318
Matching estimators
Let m1 (x) = E(yi |di = 1, xi = x) and m0 (x) = E(yi |di = 0, xi = x). Then
n
X
n−1 m̂1,h (xi ) − m̂0,h (xi )
i=1
is equivalent.
Still important to have the propensity vary over entire (0, 1) for identification.
315 / 318
Regression discontinuity
Identifying assumption is that people around the cut-off are comparable. Can
then identify the local treatment effect
at the cut-off.
yi = {x0i θ ≥ εi }, εi ∼ i.i.d. F,
317 / 318
Examples of program evaluation
Angrist, J. D. (1990). Lifetime earnings and the Vietnam era draft lottery: evidence from social
security administrative records. American Economic Review 80, 313–336.
Angrist, J. D. and V. Lavy (1999). Using Maimonides’ rule to estimate the effect of class size
on scholastic achievement. Quarterly Journal of Economics 114, 533–575.
Card, D. and A. B. Krueger (2000). Minimum wages and employment: A case study of the
fast-food industry in New Jersey and Pennsylvania. American Economic Review 84, 772–793.
Dell, M. (2010). The persistent effects of Peru’s mining Mita. Econometrica 78, 1863–1903.
Lalonde, R. J. (1986). Evaluating the econometric evaluations of training programs with exper-
imental data. American Economic Review 76, 604–620.
318 / 318