ch9 QML
ch9 QML
ch9 QML
i=1
q
i
log
_
q
i
p
i
_
.
This idea is readily generalized to discuss the information content of density functions,
as discussed below.
Let g be the density function of the random variable z and f be another density
function. Dene the Kullback-Leibler Information Criterion (KLIC) of g relative to f
as
II(g: f) =
_
R
log
_
g()
f()
_
g() d.
c Chung-Ming Kuan, 2004
9.1. KULLBACK-LEIBLER INFORMATION CRITERION 231
When f is used to describe z, the value II(g: f) is the expected surprise resulted from
knowing g is in fact the true density of z. The following result shows that the KLIC of
g relative to f is non-negative.
Theorem 9.1 Let g be the density function of the random variable z and f be another
density function. Then II(g : f) 0, with the equality holding if, and only if, g = f
almost everywhere (i.e., g = f except on a set with the Lebesgue measure zero).
Proof: Using the fact that log(1 + x) < x for all x > 1, we have
log
_
g
f
_
= log
_
1 +
f g
g
_
> 1
f
g
.
It follows that
_
log
_
g()
f()
_
g() d >
_
_
1
f()
g()
_
g() d = 0.
Clearly, if g = f almost everywhere, II(g: f) = 0. Conversely, given II(g: f) = 0, suppose
without loss of generality that g = f except that g > f on the set B that has a non-zero
Lebesgue measure. Then,
_
R
log
_
g()
f()
_
g() d =
_
B
log
_
g()
f()
_
g() d >
_
B
(log 1)g() d = 0,
contradicting II(g: f) = 0. Thus, g must be the same as f almost everywhere. 2
Note, however, that the KLIC is not a metric because it is not reexive in general, i.e.,
II(g : f) = II(f : g), and it does not obey the triangle inequality; see Exercise 9.1. Hence,
the KLIC is only a crude measure of the closeness between f and g.
Let {z
t
} be a sequence of R
t=2
g
t
(z
t
)
g
t1
(z
t1
)
= g(z
1
)
T
t=2
g
t
(z
t
| z
t1
),
where g
t
denote the joint density of t random variables z
1
, . . . , z
t
, and g
t
is the density
function of z
t
conditional on past information z
1
, . . . , z
t1
. The joint density function
c Chung-Ming Kuan, 2004
232 CHAPTER 9. THE QUASI-MAXIMUM LIKELIHOOD METHOD: THEORY
g
T
is the random mechanism governing the behavior of z
T
and will be referred to as
the data generation process (DGP) of z
T
.
As g
T
is unknown, we may postulate a conditional density function f
t
(z
t
| z
t1
; ),
where R
k
, and approximate g
T
(z
T
) by
f
T
(z
T
; ) = f(z
1
)
T
t=2
f
t
(z
t
| z
t1
; ).
The function f
T
is referred to as the quasi-likelihood function, in the sense that f
T
need
not agree with g
T
. For notation convenience, we set the unconditional density g(z
1
) (or
f(z
1
)) as a conditional density and write g
T
(or f
T
) as the product of all conditional
density functions. Clearly, the postulated density f
T
would be useful if it is close to the
DGP g
T
. It is therefore natural to consider minimizing the KLIC of g
T
relative to f
T
:
II(g
T
: f
T
; ) =
_
R
T
log
_
g
T
(
T
)
f
T
(
T
; )
_
g
T
(
T
) d(
T
). (9.1)
This amounts to minimizing the surprise level resulted from specifying an f
T
for the
DGP g
T
. As g
T
does not involve , minimizing the KLIC (9.1) with respect to is
equivalent to maximizing
_
R
T
log(f
T
(
T
; ))g
T
(
T
) d(
T
) = IE[log f
T
(z
T
; )],
where IE is the expectation operator with respect to the DGP g
T
(z
T
). This is, in turn,
equivalent to maximizing the average of log f
t
:
L
T
() =
1
T
IE[log f
T
(z
T
; )] =
1
T
IE
_
T
t=1
log f
t
(z
t
| z
t1
; )
_
. (9.2)
Let
be the maximizer of
L
T
(). Then,
=
o
when {f
t
} is
specied correctly in its entirety.
Maximizing
L
T
is, however, not a readily solvable problem because the objective
function (9.2) involves the expectation operator and hence depends on the unknown
DGP g
T
. It is therefore natural to consider maximizing the sample counterpart of
L
T
():
L
T
(z
T
; ) :=
1
T
T
t=1
log f
t
(z
t
| z
t1
; ), (9.3)
c Chung-Ming Kuan, 2004
9.1. KULLBACK-LEIBLER INFORMATION CRITERION 233
which is known as the quasi-log-likelihood function. The maximizer of L
T
(z
T
; ),
T
, is
known as the quasi-maximum likelihood estimator (QMLE) of . The prex quasi is
used to indicate that this solution may be obtained from a misspecied log-likelihood
function. When {f
t
} is specied correctly in its entirety for {z
t
}, the QMLE is under-
stood as the MLE, as in standard statistics textbooks.
Specifying a complete probability model for z
T
may be a formidable task in practice
because it involves too many random variables (T random vectors z
t
, each with ran-
dom variables). Instead, econometricians are typically interested in modeling a variable
of interest, say, y
t
, conditional on a set of pre-determined variables, say, x
t
, where x
t
includes some elements of w
t
and z
t1
. This is a relatively simple job because only the
conditional behavior of y
t
need to be considered. As w
t
are not explicitly modeled, the
conditional density g
t
(y
t
| x
t
) provides only a partial description of {z
t
}. We then nd
a quasi-likelihood function f
t
(y
t
| x
t
; ) to approximate g
t
(y
t
| x
t
). Analogous to (9.1),
the resulting average KLIC of g
t
relative to f
t
is
II
T
({g
t
: f
t
}; ) :=
1
T
T
t=1
II(g
t
: f
t
; ). (9.4)
Let y
T
= (y
1
, . . . , y
T
) and x
T
= (x
1
, . . . , x
T
). Minimizing
II
T
({g
t
: f
t
}; ) in (9.4) is thus
equivalent to maximizing
L
T
() =
1
T
T
t=1
IE[log f
t
(y
t
| x
t
; )]; (9.5)
cf. (9.2). The parameter
=
o
.
Similar as before,
L
T
() is not directly observable, we instead maximize its sample
counterpart:
L
T
(y
T
, x
T
; ) :=
1
T
T
t=1
log f
t
(y
t
| x
t
; ), (9.6)
which will also be referred to as the quasi-log-likelihood function; cf. (9.3). The resulting
solution is the QMLE
T
. When {f
t
} is correctly specied for {y
t
| x
t
}, the QMLE
T
is also understood as the usual MLE.
As a special case, one may concentrate on certain conditional attribute of y
t
and
postulate a specication
t
(x
t
; ) for this attribute. A leading example is the following
c Chung-Ming Kuan, 2004
234 CHAPTER 9. THE QUASI-MAXIMUM LIKELIHOOD METHOD: THEORY
specication of conditional normality with
t
(x
t
; ) as the specication of its mean:
y
t
| x
t
N
_
t
(x
t
; ),
2
_
;
note that conditional variance is not explicitly modeled. Setting = (
2
)
, it is easy
to see that the maximizer of the quasi-log-likelihood function T
1
T
t=1
log f
t
(y
t
| x
t
; )
is also the solution to
min
1
T
T
t=1
[y
t
t
(x
t
; )]
[y
t
t
(x
t
; )] .
The resulting QMLE of is thus the NLS estimator. Therefore, the NLS estimator can
be viewed as a QMLE under the assumption of conditional normality with conditional
homoskedasticity. We say that {
t
} is correctly specied for the conditional mean
IE(y
t
| x
t
) if there exists a
o
such that
t
(x
t
;
o
) = IE(y
t
| x
t
). A more exible
specication, such as
y
t
| x
t
N
_
t
(x
t
; ), h(x
t
; )
_
,
would allow us to characterize conditional variance as well.
9.2 Asymptotic Properties of the QMLE
The quasi-log-likelihood function is, in general, a nonlinear function in , so that the
QMLE thus must be computed numerically using a nonlinear optimization algorithm.
Given that maximizing L
T
is equivalent to minimizing L
T
, the algorithms discussed in
Section 8.2.2 are readily applied. We shall not repeat these methods here but proceed to
the discussion of the asymptotic properties of the QMLE. For our subsequent analysis,
we always assume that the specied quasi-log-likelihood function is twice continuously
dierentiable on the compact parameter space with probability one and that inte-
gration and dierentiation can be interchanged. Moreover, we maintain the following
identication condition.
[ID-2] There exists a unique
T
) about
is
L
T
(z
T
;
T
) = L
T
(z
T
;
) +
2
L
T
(z
T
;
T
)(
), (9.7)
where
T
is between
T
and
T
solves the rst order
condition L
T
(z
T
; ) = 0. Then, as long as
2
L
T
(z
T
;
T
) is invertible with probability
one, (9.7) can be written as
T(
) =
2
L
T
(z
T
;
T
)
1
T L
T
(z
T
;
).
Note that the invertibility of
2
L
T
(z
T
;
T
) amounts to requiring quasi-log-likelihood
function L
T
being locally quadratic at
. Let IE[
2
L
T
(z
T
; )] = H
T
() be the ex-
pected Hessian matrix. When
2
L
T
(z
T
;
T
) obeys a WULLN, we have
2
L
T
(z
T
; ) H
T
()
IP
0,
uniformly in . As
T
is weakly consistent for
, so is
T
. The assumed WULLN then
implies that
2
L
T
(z
T
;
T
) H
T
(
)
IP
0.
It follows that
T(
) = H
T
(
)
1
T L
T
(z
T
;
) + o
IP
(1). (9.8)
This shows that the asymptotic distribution of
T(
) is essentially determined
by the asymptotic distribution of the normalized score:
T L
T
(z
T
;
).
c Chung-Ming Kuan, 2004
236 CHAPTER 9. THE QUASI-MAXIMUM LIKELIHOOD METHOD: THEORY
Let B
T
denote the variance-covariance matrix of the normalized score
TL
T
(z
T
; ):
B
T
() = var
_
TL
T
(z
T
; )
_
,
which will also be referred to as the information matrix. Then provided that log f
t
(z
t
|
z
t1
) obeys a CLT, we have
B
T
(
)
1/2
T
_
L
T
(z
T
;
) IE[L
T
(z
T
;
)]
_
D
N(0, I
k
). (9.9)
When dierentiation and integration can be interchanged,
IE[L
T
(z
T
; )] = IE[L
T
(z
T
; )]
L
T
(),
where the right-hand side is the rst order derivative of (9.2). As
is the KLIC
minimizer,
L
T
(
) = 0 so that IE[L
T
(z
T
;
T(
) = H
T
(
)B
T
(
)
1/2
_
B
T
(
)
1/2
T L
T
(z
T
;
+ o
IP
(1),
which has an asymptotic normal distribution. This immediately leads to the following
result.
Theorem 9.2 When (9.7), (9.8) and (9.9) hold,
C
T
(
)
1/2
T(
)
D
N(0, I
k
),
where
C
T
(
) = H
T
(
)
1
B
T
(
)H
T
(
)
1
,
with H
T
(
) = IE[
2
L
T
(z
T
;
)] and B
T
(
) = var
_
TL
T
(z
T
;
)
_
.
Remark: For the specication of {y
t
| x
t
}, the QMLE is obtained from the quasi-log-
likelihood function L
T
(y
T
, x
T
; ), and its asymptotic normality holds similarly as in
Theorem 9.2. That is,
C
T
(
)
1/2
T(
)
D
N(0, I
k
),
where C
T
(
) = H
T
(
)
1
B
T
(
)H
T
(
)
1
with H
T
(
) = IE[
2
L
T
(y
T
, x
T
;
)] and
B
T
(
) = var
_
TL
T
(y
T
, x
T
;
)
_
.
c Chung-Ming Kuan, 2004
9.3. INFORMATION MATRIX EQUALITY 237
9.3 Information Matrix Equality
A useful result in the quasi-maximum likelihood theory is the information matrix equal-
ity. This equality shows that, under certain conditions, the information matrix B
T
()
is the same as the negative of the expected Hessian matrix H
T
() so that the covari-
ance matrix C
T
() can be simplied. In such a case, the estimation of C
T
() would be
much simpler.
Dene the following score functions: s
t
(z
t
; ) = log f
t
(z
t
| z
t1
; ). The average
of these scores is
s
T
(z
T
; ) = L
T
(z
T
; ) =
1
T
T
t=1
s
t
(z
t
; ).
Clearly, s
t
(z
t
; )f
t
(z
t
| z
t1
; ) = f
t
(z
t
| z
t1
; ). It follows that
s
T
(z
T
; )f
T
(z
T
; ) =
_
1
T
T
t=1
s
t
(z
t
; )
__
T
t=1
f
(z
| z
1
; )
_
=
1
T
T
t=1
f
t
(z
t
|z
t1
; )
=t
f
(z
| z
1
; )
=
1
T
f
T
(z
T
; ).
As
_
f
T
(z
T
; ) dz
T
= 1, its derivative must be zero. By permitting interexchange of
dierentiation and integration we have
0 =
1
T
_
f
T
(z
T
; ) dz
T
=
_
s
T
(z
T
; )f
T
(z
T
; ) dz
T
,
and
0 =
1
T
_
2
f
T
(z
T
; ) dz
T
=
_
_
s
T
(z
T
; ) + Ts
T
(z
T
; )s
T
(z
T
; )
f
T
(z
T
; ) dz
T
.
The equalities are simply the expectations with respect to f
T
(z
T
; ), yet they need not
hold when the integrator is not f
T
(z
T
; ).
If {f
t
} is correctly specied in its entirety for {z
t
}, the results above imply IE[s
T
(z
T
;
o
)] =
0 and
IE[s
T
(z
T
;
o
)] + T IE[s
T
(z
T
;
o
)s
T
(z
T
;
o
)
] = 0,
where the expectations are taken with respect to the true density g
T
(z
T
) = f
T
(z
T
;
o
).
This is the information matrix equality stated below.
c Chung-Ming Kuan, 2004
238 CHAPTER 9. THE QUASI-MAXIMUM LIKELIHOOD METHOD: THEORY
Theorem 9.3 Suppose that there exists a
o
such that f
t
(z
t
| z
t1
;
o
)} = g
t
(z
t
| z
t1
).
Then,
H
T
(
o
) +B
T
(
o
) = 0,
where H
T
(
o
) = IE[
2
L
T
(z
T
;
o
)] and B
T
(
o
) = var
_
TL
T
(z
T
;
o
)
_
.
When this equality holds, the covariance matrix C
T
in Theorem 9.2 simplies to
C
T
(
o
) = B
T
(
o
)
1
= H
T
(
o
)
1
.
That is, the QMLE achieves the Cramer-Rao lower bound asymptotically.
On the other hand, when f
T
is not a correct specication for z
T
, the score function
is not related to the true density g
T
. Hence, there is no guarantee that the mean score
is zero, i.e.,
IE[s
T
(z
T
; )] =
_
s
T
(z
T
; )g
T
(z
T
) dz
T
= 0,
even when this expectation is evaluated at
] = 0,
even when it is evaluated at
.
For the specication of {y
t
| x
t
}, we have f
t
(y
t
| x
t
; ) = s
t
(y
t
, x
t
; )f
t
(y
t
| x
t
; ),
and
_
f
t
(y
t
| x
t
; ) dy
t
= 1. Similar as above,
_
s
t
(y
t
, x
t
; )f
t
(y
t
| x
t
; ) dy
t
= 0,
and
_
[s
t
(y
t
, x
t
; ) +s
t
(y
t
, x
t
; )s
t
(y
t
, x
t
; )
] f
t
(y
t
| x
t
; ) dy
t
= 0.
If {f
t
} is correctly specied for {y
t
| x
t
}, we have IE[s
t
(y
t
, x
t
;
o
) | x
t
] = 0. Then by
the law of iterated expectations, IE[s
t
(y
t
, x
t
;
o
)] = 0, so that the mean score is still
zero under correct specication. Moreover,
IE[s
t
(y
t
, x
t
;
o
) | x
t
] + IE[s
t
(y
t
, x
t
;
o
)s
t
(y
t
, x
t
;
o
)
| x
t
] = 0,
which implies
1
T
T
t=1
IE[s
t
(y
t
, x
t
;
o
)] +
1
T
T
t=1
IE[s
t
(y
t
, x
t
;
o
)s
t
(y
t
, x
t
;
o
)
]
= H
T
(
o
) +
1
T
T
t=1
IE[s
t
(y
t
, x
t
;
o
)s
t
(y
t
, x
t
;
o
)
]
= 0.
c Chung-Ming Kuan, 2004
9.3. INFORMATION MATRIX EQUALITY 239
The equality above is not necessarily equivalent to the information matrix equality,
however.
To see this, consider the specications {f
t
(y
t
| x
t
; )}, which are correct for {y
t
| x
t
}.
These specications are said to have dynamic misspecication if they are not correctly
specied for {y
t
| w
t
, z
t1
}. That is, there does not exist any
o
such that f
t
(y
t
|
x
t
;
o
) = g
t
(y
t
| w
t
, z
t1
). Thus, the information contained in w
t
and z
t1
cannot be
fully represented by x
t
. On the other hand, when dynamic misspecication is absent,
it is easily seen that
IE[s
t
(y
t
, x
t
;
o
) | x
t
] = IE[s
t
(y
t
, x
t
;
o
) | w
t
, z
t1
]. (9.10)
It is then easily veried that
B
T
(
o
) =
1
T
IE
__
T
t=1
s
t
(y
t
, x
t
;
o
)
__
T
t=1
s
t
(y
t
, x
t
;
o
)
__
=
1
T
T
t=1
IE[s
t
(y
t
, x
t
;
o
)s
t
(y
t
, x
t
;
o
)
] +
1
T
T1
=1
T
t=+1
IE[s
t
(y
t
, x
t
;
o
)s
t
(y
t
, x
t
;
o
)
] +
1
T
T1
=1
T
t=+1
IE[s
t
(y
t
, x
t
;
o
)s
t+
(y
t+
, x
t+tau
;
o
)
]
=
1
T
T
t=1
IE[s
t
(y
t
, x
t
;
o
)s
t
(y
t
, x
t
;
o
)
],
where the last equality holds by (9.10) and the law of iterated expectations. When there
is dynamic misspecication, the last equality fails because the covariances of scores
do not vanish. This shows that the average of individual information matrices, i.e.,
var(s
t
(y
t
, x
t
;
o
)), need not be the information matrix.
Theorem 9.4 Suppose that there exists a
o
such that f
t
(y
t
| x
t
;
o
) = g
t
(y
t
| x
t
) and
there is no dynamic misspecication. Then,
H
T
(
o
) +B
T
(
o
) = 0,
where H
T
(
o
) = T
1
T
t=1
IE[s
t
(y
t
, x
t
;
o
)] and
B
T
(
o
) =
1
T
T
t=1
IE[s
t
(y
t
, x
t
;
o
)s
t
(y
t
, x
t
;
o
)
].
c Chung-Ming Kuan, 2004
240 CHAPTER 9. THE QUASI-MAXIMUM LIKELIHOOD METHOD: THEORY
When Theorem 9.4 holds, the covariance matrix needed to normalize
T(
o
) again
simplies to B
T
(
o
)
1
= H
T
(
o
)
1
, and the QMLE achieves the Cramer-Rao lower
bound asymptotically.
Example 9.5 Consider the following specication: y
t
| x
t
N(x
t
,
2
) for all t. Let
= (
2
)
, then
log f(y
t
| x
t
; ) =
1
2
log(2)
1
2
log(
2
)
(y
t
x
t
)
2
2
2
,
and
L
T
(y
T
, x
T
; ) =
1
2
log(2)
1
2
log(
2
)
1
T
T
t=1
(y
t
x
t
)
2
2
2
.
Straightforward calculation yields
L
T
(y
T
, x
T
; ) =
1
T
T
t=1
xt(ytx
t
)
1
2
2
+
(ytx
t
)
2
2(
2
)
2
2
L
T
(y
T
, x
T
; ) =
1
T
T
t=1
xtx
2
xt(ytx
t
)
(
2
)
2
(ytx
t
)x
t
(
2
)
2
1
2(
2
)
2
(ytx
t
)
2
(
2
)
3
.
Setting L
T
(y
T
, x
T
; ) = 0 we can solve for to obtain the QMLE. It is easily veried
that the QMLE of is the OLS estimator
T
and that the QMLE of
2
is the average
of the OLS residuals:
2
T
= T
1
T
t=1
(y
t
x
T
)
2
.
If the specication above is correct for {y
t
| x
t
}, there exists
o
= (
o
2
o
)
such
that the conditional distribution of y
t
given x
t
is N(x
o
,
2
o
). Taking expectation with
respect to the true distribution function, we have
IE[x
t
(y
t
x
t
)] = IE(x
t
x
t
)(
o
),
which is zero when evaluated at =
o
. Similarly,
IE[(y
t
x
t
)
2
] = IE[(y
t
x
o
+x
o
x
t
)
2
]
= IE[(y
t
x
o
)
2
] + IE[(x
o
x
t
)
2
]
=
2
o
+ IE[(x
o
x
t
)
2
],
where the second term on the right-hand side is zero if it is evaluated at =
o
. These
results together show that
H
T
() = IE[
2
L
T
()]
=
1
T
T
t=1
IE(xtx
t
)
2
IE(xtx
t
)(
o
)
(
2
)
2
(
o
)
IE(xtx
t
)
(
2
)
2
1
2(
2
)
2
2
o
+IE[(x
o
x
t
)
2
]
(
2
)
3
.
c Chung-Ming Kuan, 2004
9.3. INFORMATION MATRIX EQUALITY 241
When this matrix is evaluated at
o
= (
o
2
o
)
,
H
T
(
o
) =
1
T
T
t=1
IE(xtx
t
)
2
o
0
0
1
2(
2
o
)
2
.
If there is no dynamic misspecication,
B
T
() =
1
T
T
t=1
IE
(ytx
t
)
2
xtx
t
(
2
)
2
xt(ytx
t
)
2(
2
)
2
+
xt(ytx
t
)
3
2(
2
)
3
(ytx
t
)x
t
2(
2
)
2
+
(ytx
t
)
3
x
t
2(
2
)
3
1
4(
2
)
2
(ytx
t
)
2
2(
2
)
3
+
(ytx
t
)
4
4(
2
)
4
.
Given that y
t
is conditionally normally distributed, its conditional third and fourth
central moments are zero and 3(
2
o
)
2
, respectively. It can then be veried that
IE[(y
t
x
t
)
3
] = 3
2
o
IE[(x
o
x
t
)] IE[(x
o
x
t
)
3
], (9.11)
which is zero when evaluated at =
o
, and that
IE[(y
t
x
t
)
4
] = 3(
2
o
)
2
+ 6
2
o
IE[(x
o
x
t
)
2
] + IE[(x
o
x
t
)
3
], (9.12)
which is 3(
2
o
)
2
when evaluated at =
o
; see Exercise 9.2. Consequently,
B
T
(
o
) =
1
T
T
t=1
IE(xtx
t
)
2
o
0
0
1
2(
2
o
)
2
.
This shows that the information matrix equality holds.
A typical consistent estimator of H
T
(
o
) is
H
T
(
T
) =
T
t=1
xtx
t
T
2
T
0
0
1
2(
2
T
)
2
.
Due to the information matrix equality, a consistent estimator of B
T
(
o
) is B
T
(
T
) =
H
T
(
T
). It can be seen that the upper-left block of H
T
(
T
) is the standard estimator
for the covariance matrix of
T
. On the other hand, when there is dynamic missepci-
cation, B
T
() is not the same as given above so that the information matrix equality
fails; see Exercise 9.3. 2
The information matrix equality may also fail in dierent circumstances. The ex-
ample below shows that, even when the specication for IE(y
t
| x
t
) is correct and there
is no dynamic misspecication, the information matrix equality may still fail to hold
if there is misspecication of other conditional moments, such as neglected conditional
heteroskedasticity.
c Chung-Ming Kuan, 2004
242 CHAPTER 9. THE QUASI-MAXIMUM LIKELIHOOD METHOD: THEORY
Example 9.6 Consider the specication as in Example 9.5: y
t
| x
t
N(x
t
,
2
) for
all t. Suppose that the DGP is
y
t
| x
t
N
_
x
o
, h(x
o
)
_
,
where the conditional variance h(x
o
) varies with x
t
. Then, this specication includes
a correct specication for the conditional mean but itself is incorrect for {y
t
| x
t
} because
it ignores conditional heteroskedasticity. From Example 9.5 we see that the upper-left
block of H
T
() is
2
T
T
t=1
IE(x
t
x
t
)
2
,
and that the corresponding block of B
T
() is
1
T
T
t=1
IE[(y
t
x
t
)
2
x
t
x
t
]
(
2
)
2
.
As the specication is correct for the conditional mean but not the conditional variance,
the KLIC minimzer is
= (
o
, (
)
2
)
yields
1
(
)
2
T
T
t=1
IE(x
t
x
t
),
which diers from the corresponding submatrix of B
T
() evaluated at
:
1
(
)
4
T
T
t=1
IE[h(x
o
)
2
x
t
x
t
].
Thus, the information matrix equality breaks down, despite that the conditional mean
specication is correct.
A consistent estimator of H
T
(
) is
H
T
(
T
) =
T
t=1
xtx
t
T
2
T
0
0
1
2(
2
T
)
2
,
yet a consistent estimator of B
T
(
) is
B
T
(
T
) =
T
t=1
e
2
t
xtx
t
T(
2
T
)
2
0
0
1
4(
2
T
)
2
+
T
t=1
e
4
t
T 4(
2
T
)
4
;
c Chung-Ming Kuan, 2004
9.4. HYPOTHESIS TESTING 243
see Exercise 9.4. Due to the block diagonality of H(
T
) and B(
T
), it is easy to verify
that the upper-left block of H(
T
)
1
B(
T
)H(
T
)
1
is
_
1
T
T
t=1
x
t
x
t
_
1
_
1
T
T
t=1
e
2
t
x
t
x
t
__
1
T
T
t=1
x
t
x
t
_
1
.
This is precisely the the Eicker-White estimator (6.10) for the covariance matrix of the
OLS estimator
T
, as shown in Section 6.3.1. 2
9.4 Hypothesis Testing
In this section we discuss three classical large sample tests (Wald, LM, and likelihood
ratio tests), the Hausman (1978) test, and the information matrix test of White (1982,
1987) for the null hypothesis R
T
) r. From (9.8),
TR(
) = RH
T
(
)
1
B
T
(
)
1/2
_
B
T
(
)
1/2
T L
T
(z
T
;
+o
IP
(1).
This shows that RC
T
(
)R
= RH
T
(
)
1
B
T
(
)H
T
(
)
1
R
TR(
= r
so that
[RC
T
(
)R
]
1/2
T(R(
T
r)
D
N(0, I
q
). (9.13)
This is the key distribution result for the Wald test.
For notation simplicity, we let
H
T
= H
T
(
T
) denote a consistent estimator for
H
T
(
) and
B
T
= B
T
(
T
) denote a consistent estimator for B
T
(
). It follows that a
consistent estimator for C
T
(
) is
C
T
= C
T
(
T
) =
H
1
T
B
T
H
1
T
.
Substituting
C
T
for C
T
() in (9.13) we have
[R tildebC
T
(
)R
]
1/2
T(R(
T
r)
D
N(0, I
q
). (9.14)
c Chung-Ming Kuan, 2004
244 CHAPTER 9. THE QUASI-MAXIMUM LIKELIHOOD METHOD: THEORY
The Wald test statistic is the inner product of the left-hand side of (9.14):
W
T
= T(R
T
r)
(R
C
T
R
)
1
(R
T
r). (9.15)
The limiting distribution of the Wald test now follows from (9.14) and the continuous
mapping theorem.
Theorem 9.7 Suppose that Theorem 9.2 for the QMLE
T
holds. Then under the null
hypothesis,
W
T
D
2
(q),
where W
T
is dened in (9.15) and q is the number of rows of R.
Example 9.8 Consider the quasi-log-likelihood function specied in Example 9.5. We
write = (
2
and = (b
1
b
2
)
, where b
1
is (k s) 1, and b
2
is s 1. We are
interested in the null hypothesis that b
2
= R
= 0, where R = [0 R
1
] is s (k + 1)
and R
1
= [0 I
s
] is s k. The Wald test can be computed according to (9.15):
W
T
= T
2,T
(R
C
T
R
)
1
2,T
,
where
2,T
= R
T
is the estimator of b
2
.
As shown in Example 9.5, when the information matrix equality holds,
C
T
=
H
1
T
is block diagnoal so that
R
C
T
R
= R
H
1
T
R
=
2
T
R
1
(X
X/T)
1
R
1
.
The Wald test then becomes
W
T
= T
2,T
[R
1
(X
X/T)
1
R
1
]
1
2,T
/
2
T
.
In this case, the Wald test is just s times the standard F statistic which is readily
available from most of econometric packages. 2
9.4.2 Lagrange Multiplier Test
Consider now the problem of maximizing L
T
() subject to the constraint R = r. The
Lagrangian is
L
T
() +
,
c Chung-Ming Kuan, 2004
9.4. HYPOTHESIS TESTING 245
where is the vector of Lagrange multipliers. The maximizers of the Lagrangian and
denoted as
T
and
T
, where
T
is the constrained QMLE of . Analogous to Sec-
tion 6.4.2, the LM test under the QML framework also checks whether
T
is suciently
close to zero.
First note that
T
and
T
satisfy the saddle-point condition:
L
T
(
T
) +R
T
= 0.
The mean-value expansion of L
T
(
T
) about
yields
L
T
(
) +
2
L
T
(
T
)(
) +R
T
= 0,
where
T
is the mean value between
T
and
T(
) = H
T
(
)
1
TL
T
(
) + o
IP
(1).
Hence,
H
T
(
T(
) +
2
L
T
(
T
)
T(
) +R
T
= o
IP
(1).
Using the WULLN result:
2
L
T
(
T
) H
T
(
)
IP
0, we obtain
T(
) =
T(
) H
T
(
)
1
R
T
+ o
IP
(1). (9.16)
This establishes a relationship between the constrained and unconstrained QMLEs.
Pre-multiplying both sides of (9.16) by R and noting that the constrained estimator
T
must satisfy the constraint so that R(
) = 0, we have
T
= [RH
T
(
)
1
R
]
1
R
T(
) + o
IP
(1), (9.17)
which relates the Lagrangian multiplier and the unconstrained QMLE
T
. When The-
orem 9.2 holds for the normalized
T
, we obtain the following asymptotic normality
result for the normalized Lagrangian multiplier:
1/2
T
T
=
1/2
T
[RH
T
(
)
1
R
]
1
R
T(
)
D
N(0, I
q
), (9.18)
where
T
(
) = [RH
T
(
)
1
R
]
1
RC
T
(
)R
[RH
T
(
)
1
R
]
1
.
Let
H
T
= H
T
(
T
) denote a consistent estimator for H
T
(
) and
C
T
= C
T
(
T
) denote
a consistent estimator for C
T
(
T
. Then,
T
= (R
H
1
T
R
)
1
R
C
T
R
(R
H
1
T
R
)
1
c Chung-Ming Kuan, 2004
246 CHAPTER 9. THE QUASI-MAXIMUM LIKELIHOOD METHOD: THEORY
is consistent for
T
(
1/2
T
T
=
1/2
T
(R
H
1
T
R
)
1
R
T(
)
D
N(0, I
q
), (9.19)
The LM test statistic is the inner product of the left-hand side of (9.19):
LM
T
= T
1
T
T
= T
T
R
H
1
T
R
(R
C
T
R
)
1
R
H
1
T
R
T
. (9.20)
The limiting distribution of the LM test now follows easily from (9.19) and the contin-
uous mapping theorem.
Theorem 9.9 Suppose that Theorem 9.2 for the QMLE
T
holds. Then under the null
hypothesis,
LM
T
D
2
(q),
where LM
T
is dened in (9.20) and q is the number of rows of R.
Remark: When the information matrix equality holds, the LM statistic (9.20) becomes
LM
T
= T
T
R
H
1
T
R
T
= TL
T
(
T
)
H
1
T
L
T
(
T
),
which mainly involves the averages of scores: L
T
(
T
). The LM test is thus a test that
checks if the average of scores is suciently close to zero and hence also known as the
score test.
Example 9.10 Consider the quasi-log-likelihood function specied in Example 9.5. We
write = (
2
and = (b
1
b
2
)
, where b
1
is (k s) 1, and b
2
is s 1. We are
interested in the null hypothesis that b
2
= R
= 0, where R = [0 R
1
] is s (k + 1)
and R
1
= [0 I
s
] is s k. From the saddle-point condition,
L
T
(
T
) = R
T
.
which can be partitioned as
L
T
(
T
) =
2 L
T
(
T
)
b
1
L
T
(
T
)
b
2
L
T
(
T
)
0
0
= R
T
.
Partitioning x
t
accordingly as (x
1t
x
2t
)
, we have
b
2
L
T
(
T
) =
1
T
2
T
T
t=1
x
2t
t
= X
2
/(T
2
T
).
c Chung-Ming Kuan, 2004
9.4. HYPOTHESIS TESTING 247
where
2
T
=
2t
. The LM test can be computed
according to (9.20):
LM
T
= T
0
0
X
2
/(T
2
T
)
H
1
T
R
(R
C
T
R
)
1
R
H
1
T
0
0
X
2
/(T
2
T
)
,
which converges in distribution to
2
(s) under the null hypothesis. Note that we do
not have to evaluate the complete score vector for computing the LM test; only the
subvector of the score that corresponds to the constraint matters.
When the information matrix equality holds, the LM statistic has a simpler form:
LM
T
= T[0
X
2
/T](X
X/T)
1
[0
X
2
/T]
/
2
T
= T[
X(X
X)
1
X
]
= TR
2
,
where R
2
is the non-centered coecient of determination obtained from the auxiliary
regression of the constrained residuals
t
on x
1t
and x
2t
. 2
Example 9.11 (Breusch-Pagan) Suppose that the specication is
y
t
| x
t
,
t
N
_
x
t
, h(
t
)
_
,
where h: R (0, ) is a dierentiable function, and
t
=
0
+
p
i=1
ti
i
. The null
hypothesis is conditional homoskedasticity, i.e.,
1
= =
p
= 0 so that h(
0
) =
2
0
.
Breusch and Pagan (1979) derived the LM test for this hypothesis under the assumption
that the information matrix equality holds. This test is now usually referred to as the
Breusch-Pagan test.
Note that the constrained specication is y
t
| x
t
,
t
N(x
t
,
2
), where
2
=
h(
0
). This is leads to the standard linear regression model without heteroskedasticity.
The constrained QMLEs for and
2
are, respectively, the OLS estimators
T
and
2
T
=
T
t=1
e
2
t
/T, where e
t
are the OLS residuals. The score vector corresponding to
is:
L
T
(y
t
, x
t
,
t
; ) =
1
T
T
t=1
_
h
t
)
t
2h(
t
)
_
(y
t
x
t
)
2
h(
t
)
1
__
,
c Chung-Ming Kuan, 2004
248 CHAPTER 9. THE QUASI-MAXIMUM LIKELIHOOD METHOD: THEORY
where h
) = h
0
) is just a
constant, say, c. The score vector above evaluated at the constrained QMLEs is
L
T
(y
t
, x
t
,
t
;
T
) =
c
T
T
t=1
_
t
2
2
T
_
2
t
2
T
1
__
.
The (p + 1) (p + 1) block of the Hessian matrix corresponding to is
1
T
T
t=1
_
(y
t
x
t
)
2
h
3
(
t
)
+
1
2h
2
(
t
)
_
[h
)]
2
t
+
_
(y
t
x
t
)
2
2h
2
(
t
)
1
2h
2
(
t
)
_
h(
)
t
t
.
Evaluating the expectation of this block at
= (
o
0
0
)
2
=
h(
0
) we have
_
c
2
2[(
)
2
]
2
_
_
1
T
T
t=1
IE(
t
t
)
_
,
which, apart from the constant c, can be estimated by
_
c
2
2[
2
T
]
2
_
_
1
T
T
t=1
(
t
t
)
_
.
The LM test is now readily derived from the results above when the information matrix
equality holds.
Setting d
t
=
2
t
/
2
T
1, the LM statistic is
LM
T
=
_
T
t=1
d
t
t
__
T
t=1
t
_
1
_
T
t=1
t
d
t
_
/2
= d
Z(Z
Z)
1
Z
d/2
D
2
(p),
where d is T 1 with the t
th
element d
t
, and Z is the T (p+1) matrix with the t
th
row
t
. It can be seen that the numerator of the LM statistic is the (centered) regression sum
of squares (RSS) of regressing d
t
on
t
. This shows that the Breusch-Pagan test can also
be computed by running an auxiliary regression and using the resulting RSS/2 as the
statistic. Intuitively, this amounts to checking whether the variables in
t
are capable
of explaining the square of the (standardized) OLS residuals. It is also interesting to
see that the value of c and the functional form of h do not matter in deriving the
c Chung-Ming Kuan, 2004
9.4. HYPOTHESIS TESTING 249
statistic. The latter feature makes the Breusch-Pagan test a general test for conditional
heteroskedasticity.
Koenker (1981) noted that under conditional normality,
T
t=1
d
2
t
/T
IP
2. Thus, a
test that is more robust to non-normality and asymptotically equivalent to the Breusch-
Pagan test is to replace the denominator 2 with
T
t=1
d
2
t
/T. This robust version can be
expressed as
LM
T
= T[d
Z(Z
Z)
1
Z
d/d
d] = TR
2
,
where R
2
is the (centered) R
2
from regressing d
t
on
t
. This is also equivalent to the
centered R
2
from regressing
2
t
on
t
. 2
Remarks:
1. To compute the Breusch-Pagan test, one must specify a vector
t
that determines
the conditional variance. Here,
t
may contain some or all the variables in x
t
. If
t
is chosen to include all elements of x
t
, their squares and pairwise products, the
resulting TR
2
is also the White (1980) test for (conditional) heteroskedasticity of
unknown form. The White test can also be interpreted as an information matrix
test discussed below.
2. The Breusch-Pagan test is obtained under the condition that the information
matrix equality holds. We have seen that the information matrix equality may
fail when there is dynamic misspecication. Thus, the Breusch-Pagan test is not
valid when, e.g., the errors are serially correlated.
Example 9.12 (Breusch-Godfrey) Given the specication y
t
| x
t
N(x
t
,
2
),
suppose that one would like to check if the errors are serially correlated. Consider rst
the AR(1) errors: y
t
x
t
= (y
t1
x
t1
) +u
t
with || < 1 and {u
t
} a white noise.
The null hypothesis is
t
+ (y
t1
x
t1
),
2
u
).
The constrained specication is just the standard linear regression model y
t
= x
t
.
Testing the null hypothesis that
T
. This is precisely the Breusch (1978) and Godfrey (1978) test for
AR(1) errors with the limiting
2
(1) distribution.
The test above can be extended straightforwardly to check AR(p) errors. By regress-
ing
t
on x
t
and
t1
, . . . ,
tp
, the resulting TR
2
is the LM test when the information
matrix equality holds and has a limiting
2
(p) distribution. Such tests are known as
the Breusch-Godfrey test. Moreover, if the specication is y
t
x
t
= u
t
+ u
t1
, i.e.,
the errors follow an MA(1) process, we can write
y
t
| x
t
, u
t1
N(x
t
+ u
t1
,
2
u
).
The null hypothesis is
T
) L
T
(
T
)]. (9.21)
Utilizing the relationship between the constrained and unconstrained QMLEs (9.16) and
the relationship between the Lagrangian multiplier and unconstrained QMLE (9.17), we
can write
T(
T
) = H
T
(
)
1
R
T
+ o
IP
(1)
= H
T
(
)
1
R
(9.22)
c Chung-Ming Kuan, 2004
9.4. HYPOTHESIS TESTING 251
By Taylor expansion of L
T
(
T
) about
T
,
2T[L
T
(
T
) L
T
(
T
)]
= 2TL
T
(
T
)(
T
) T(
T
)
H
T
(
T
)(
T
) + o
IP
(1)
= T(
T
)
H
T
(
)(
T
) + o
IP
(1)
= T(
[RH
T
(
)
1
R
]
1
R(
) + o
IP
(1),
where the second equality follows because L
T
(
T
) = 0. It can be seen that the right-
hand side is essentially the Wald statistic when RH
T
(
)
1
R
is the normalizing
variance-covariance matrix. This leads to the following distribution result for the LR
test.
Theorem 9.13 Suppose that Theorem 9.2 for the QMLE
T
and the information ma-
trix equality both hold. Then under the null hypothesis,
LR
T
D
2
(q),
where LR
T
is dened in (9.21) and q is the number of rows of R.
Theorem 9.13 diers from Theorem 9.7 and Theorem 9.9 in that it also requires
the validity of the information matrix equality. When the information matrix equality
does not hold, RH
T
(
)
1
R
)?
9.4 In Example 9.6, what is B
T
(
)? Show that
B
T
(
T
) =
T
t=1
e
2
t
xtx
t
T(
2
T
)
2
0
0
1
4(
2
T
)
2
+
T
t=1
e
4
t
T 4(
2
T
)
4
).
9.5 Consider the specication y
t
| x
t
N(x
t
, h(
t
)). What condition would ensure
that H
T
(
) and B
T
(
t
,
2
) and the AR(1) errors:
y
t
y
t1
x
t
= (y
t1
y
t2
x
t1
) + u
t
,
with || < 1 and {u
t
} a white noise. Derive the LM test for the null hypothesis
= 0 and show its square root is Durbins h test; see Section 4.3.3.
c Chung-Ming Kuan, 2004
9.4. HYPOTHESIS TESTING 253
References
Amemiya, Takeshi (1985). Advanced Econometrics, Cambridge, MA: Harvard Univer-
sity Press.
Breusch, T. S. (1978). Testing for autocorrelation in dynamic linear models, Australian
Economic Papers, 17, 334355.
Breusch, T. S. and A. R. Pagan (1979). A simple test for heteroscedasticity and random
coecient variation, Econometrica, 47, 12871294.
Engle, Robert F. (1982). Autoregressive conditional heteroskedasticity with estimates
of the variance of U.K. ination, Econometrica, 50, 9871007.
Godfrey, L. G. (1978). Testing against general autoregressive and moving average error
models when the regressors include lagged dependent variables, Econometrica,
46, 12931301.
Godfrey, L. G. (1988). Misspecication Tests in Econometrics: The Lagrange Multiplier
Principle and Other Approaches, New York: Cambridge University Press.
Hamilton, James D. (1994). Time Series Analysis, Princeton: Princeton University
Press.
Hausman, Jerry A. (1978). Specication tests in econometrics, Econometrica, 46, 1251
1272.
Koenker, Roger (1981). A note on studentizing a test for heteroscedasticity, Journal of
Econometrics, 17, 107112.
White, Halbert (1980). A heteroskedasticity-consistent covariance matrix estimator and
a direct test for heteroskedasticity, Econometrica, 48, 817838.
White, Halbert (1982). Maximum likelihood estimation of misspecied models, Econo-
metrica, 50, 125.
White, Halbert (1994). Estimation, Inference, and Specication Analysis, New York:
Cambridge University Press.
c Chung-Ming Kuan, 2004