Econometric Theory Lecture Notes
Econometric Theory Lecture Notes
Econometric Theory Lecture Notes
INTRODUCTION
1.1. MATRIX NOTATION AND SOME USEFUL RESULT.
Always a (p 1) vector x will be a column vector. That is,
0 1
x1
B x2 C
B C
B : C
x=B B : C.
C
B C
@ : A
xp
The inner product of two vectors, denoted by h ; i ; is dened as
p
X
hx; yi = xi yi .
i=1
Ap q matrix is written as
a1 ; :::; aq (aij )i=1;::;p;j=1;:::;q .
Transpose, 0 , A0 = (aji )i=1;::;p;j=1;:::;q .
(A + B)0 = A0 + B 0
(AB)0 = B 0 A0 .
6 0 =) A 1 , the
If the matrix A is squared, that is p = q, then if jAj =
inverse, exists and it is dened in such a way that AA 1 = A 1 A =
Ip .
(AB) 1 = B 1 A 1 .
Denition 1.1. Let A be a matrix (p q). We dene the rank as the
maximum number of columns (rows) linearly independent. If jAj =
6 0, then
rank(A) = p.
Denition 1.2. Let Ap p be a matrix. A quadratic form is dened as
p
X
x0 Ax = xi aij xj .
i;j=1
Iq 0
BAB 0 = .
0 0
1 Iq 0 1
That is, A = B (B 0 ) .
0 0
@ 1 @
ln jAj = tr A0 A .
@ @
For scalar,
@ 1 1 @ 1
A = A A A .
@ @
@2 1 @2 1 @ 0 1 @
ln jAj = tr A0 A A A0 A A0 A .
@ 2 @ 2 @ @
@2 1 1 @A 1 @A @2A 1
A =A A A .
@ 2 @ @ @ 2
3
1.1.2. Inequalities.
Cauchy-Schwarz
n
!2 n
! n
!
X X X
xi yi x2i yi2 .
i=1 i=1 i=1
Hlders inequality
n n
!1=p n
!1=q
X X X 1 1
x i yi jxi jp jyi jq where + = 1,
p q
i=1 i=1 i=1
or equivalently Pr flimn!1 Xn = Xg = 1.
a:s
This type of convergence is denoted as Xn ! X.
Denition 2.3. (Convergence in r-th mean) We say that fXn gn2N con-
verges in r-th mean to X if limn!1 E jXn Xjr = 0.
r th
This type of convergence will be denoted by Xn ! X.
Remark 2.1. For convergence in r-th mean, r-th in what follows, we need
that E jXn j < 1 and E jXj < 1.
Example 2.1. Let fXn gn2N be such that
1
1 n
Xn = 1
1+ n.
p
Then Xn ! . Indeed, Pr fjXn j > g = Pr fXn = 1 + g = 1=n ! 0.
Example 2.2. Let fXn gn2N be dened as
1
0 1 n2
Xn = 1
n n2
.
a:s:
Then Xn ! 0. Indeed
Pr fjXm Xj < ; all m ng = 1 Pr fjXm j , for some m ng .
So, it su ces to prove that the second term on the right of the last displayed
expression converges to zero. But, that term is
1
X X1
S
1 1
Pr fjXm j g Pr fjXm j g= <"
n=m n=m n=m
m2
P1 2
because m=1 m < 1. From here conclude.
The last type of convergence is Convergence in Distribution.
6
p
Also, we know that Xn ! X that is, 8n n0
" "
Pr fjXn Xj > g < or Pr fjXn Xj g 1 .
2 2
Thus,
Pr fjXn Xj g = Pr fjXn Xj ; X 2 Sg
+ Pr fjXn Xj ; X2= Sg
"
Pr fjXn Xj ; X 2 Sg + ,
2
which implies that
"
1 Pr fjXn Xj g
2
"
Pr fjXn Xj ; X 2 Sg +
2
and hence that 1 " Pr fjXn Xj ; X 2 Sg.
But by continuity of g ( ),
Pr fjXn Xj ; X 2 Sg Pr fjg (Xn ) g (X)j ; X 2 Sg
Pr fjg (Xn ) g (X)j g
and hence
1 " Pr fjg (Xn ) g (X)j g.
This concludes the proof.
p p
Example 2.3. Xn ! c 6= 0 =) 1=Xn ! 1=c.
Let fXn gn2N be a sequence of (k 1) vectors of r.v..
Denition 2.5. We say that fXn gn2N converges in Probability to X a
p
(k 1) vector, if 8i = 1; 2; :::; k, Xni ! Xi .
2.1. WEAK LAW OF LARGE NUMBERS.
Theorem 2.4. (Khintchine) Let fXi gi2N be a sequence of iid random vari-
ables such that EXi = . Then,
n
1X p
Xi ! .
n
i=1
Denote by Yn the left side of the last displayed expression. Then by Theorem
P 2
2.1, see also Theorem 2.2, we have that it su ces to show that E 1 n Yei ! n i=1
0. But because the r.v.s Xi are uncorrelated
n 2 n
1Xe 1 X 2
E Yi = 2 i ! 0.
n n
i=1 i=1
This concludes the proof.
1
Pn p
Remark 2.4. If 8i i = , then n i=1 Xi ! 0.
Theorem 2.6. Let fXi gi2Z be a sequence of r.v.s where EXi = i and
P
V arXi = 2i and Cov (Xi ; Xj ) = jj ij i j . If n 2 ni=1 2i ! 0 as n ! 1;
P
and 1 i=1 j i j < 1, then
n n
1X 1X p
Xi i ! 0.
n n
i=1 i=1
Remark 2.5. From Theorem 2.5 we can observe that there is a trade-o
between the heterogeneity of the sequence of r.v.s fXi g and the moment
conditions of the sequence. In particular, we observe that in Theorem 2.4
we only need the rst moment to be nite whereas in Theorem 2.5, we need
second moments, although 2i may increaseto innity, but not too quickly.
p d d
Theorem 2.7. Assume that Xn ! 0 and Yn ! Y . Then, Xn + Yn ! Y .
2.2. CENTRAL LIMIT THEOREMS.
Central limit theorems (CLT ) deal with the
Pnconvergence in distribution
of (normalized) sums of r.v.s. For instance, i=1 Xi .
Theorem 2.8. (Lindeberg-Levy) Let fXi gi2N be iid r.v.s, with mean and
nite variance 2 . Then,
n
1 X Xi d
1=2
! N (0; 1) .
n i=1
( Bn )2 E jXi ij .
Then,
n Z n
1 X 2 2
X
jz i j dFi (z) Bn E jXi ij
Bn2 jz i j> Bn
i=1 i=1
1.
and so
n Z
X
2 2 2
max j jt i j dFi (t) + Bn2
j n jt
i=1 i j> Bn
Example 2.4. Let fXi gi2N be a sequence of iid r.v.s. Then, by Theorem
2.8, we know that
n
1 X Xi d
Zn = 1=2 ! N (0; 1)
n i=1
d
and then we conclude that Zn2 ! 2.
1
The next theorem tell us that all we need is to show CLT for scalar
sequences of r.v..
Theorem 2.12. (Cramer-Wold) Let fXn gn2N be a sequence of (k 1) vec-
tors of r.v.s. Then,
d 0 d 0
Xn ! X if f Xn ! X 8 6= 0.
Theorem 2.13. (Cramer) Let fXn gn2N be a sequence of (k 1) vectors
p d
of r.v.s, such that Xn = An Zn . Suppose that An ! A (p:s:d) and Zn !
d
N ( ; ), then An Zn ! N (A ; A A0 ).
11
Once the model has been set up, the question is how can we make infer-
ences on based on the data zt = (yt ; x0t )0 ? The general way to do it is by
computing a function of and zt called Objective Function, and see which
value of minimizes such a function. That is,
(4) b = arg min Q ( ; z) = arg min Q ( ) .
Proof. By denition,
0
cov b = E b b
h i
1 1
= E X 0X X 0U U 0X X 0X
1 1
= X 0X X 0E U U 0 X X 0X
1 1
= X 0X X0 2
IX X 0X
2 1
= X 0X .
Thus, Propositions 3.1 and 3.2 tell us that the LSE is an unbiased for
with a variance-covariance matrix equal to 2 (X 0 X) 1 .
Suppose now that we have another unbiased linear estimator, say e = CY .
Then, how good is e compared to b ?
14
Estimator of 2 .
Up to now we have focused on b . But, in the regression model (2) exists
2 . The question is how can we estimate it and what its statistical properties
b 0U
Proof. By denition, E U b is
E U 0M U = Etr U 0 M U
= Etr M U U 0
= trM E U U 0
2
= trM .
1
Now we conclude since trM = tr (IT ) tr (X 0 X) X 0X = T k.
From Lemma 3.5, we conclude that an unbiased estimator of 2 is
T
X
1
(9) e2 = b2i .
u
T k
i=1
1 If Z and Z are independent normally distributed then g (Z ) and h (Z ) are also for
1 2 1 2
any functions g ( ) and h ( ).
17
2 1
= E X 0X ,
p
Proposition 3.9. Under A1, A3 and that ui is iid, we have that b2 ! 2.
0
bi = ui
Proof. Because u b xi , (8) equals
T T
1X 2 b
0 1X
(14) ui xi ui .
T T
i=1 i=1
Consider the rst term of (14). Because u2i is an iid r.v. with nite rst
moment, e.g. 2 , by Theorem 2.4, we conclude that
T
1X 2 p 2
ui ! .
T
i=1
p
The second term of (14) converges in probability to 0, because b !0
P p
and as shown in (13), T1 Ti=1 xi ui ! 0. Thus by Theorem 2.3,
T T
1X 2 b
0 1X p 2
ui xi ui !
T T
i=1 i=1
Because both ui and xi are iid, it will imply that 0 zi is also zero mean iid
with nite variance. By independence of ui and xi , the rst moment of 0 zi
19
and we conclude the proof of (15) , and thus part (a) by Theorem 2.13.
Next, part (b). By denition of ubi , we have that the left side is
n n
1 X b
0 1 X
(16) u2i 2
xi x0i b .
T 1=2 i=1
T 1=2 i=1
But in part (a) we have already shown that T 1=2 b converges in dis-
P p
tribution, and in Theorem 3.8 that T1 ni=1 xi x0i ! xx . Hence, we conclude
that the last displayed expression converges to zero in probability because
the product is a continuous function and T 1=2 ! 0. Then
d 2 2
T 1=2 b2 2
! N 0; E u2i
by Theorem 2.7.
Remark 3.6. (i) Assumption A5 is not necessary for the results to hold.
Moreover, in the case of stochastic and/or nonnormality of the errors all
our arguments will be asymptotic ones.
(ii) If Eu4+i < C for some arbitrary > 0, we could have dropped the
condition of iid r.v. for ui . In this case, we would use Corollary 2.10 instead
of Theorem 2.8.
d
(iii) If ui were normally distributed then T 1=2 b2 2 ! N 0; 2 4 .
A is p.s.d. matrix because it is the covariance matrix of the vector (x0i ; zi0 )0 .
Choosing the vector
I
a= 1 ,
zz zx
so that 0 a0 Aa
= xx xz
1
zx which completes the proof.
zz
Restricted Least Squares.
Suppose that there are m k linear independent linear constraints on
the parameters in (2). That is for Rm k and rm 1 known matrices
R =r rank (R) = m.
How can we estimate s.t. R = r? To that end, we use the Lagrange-
multiplier principle. Let
g ( ; ) = (Y X )0 (Y X ) 0
(R r) .
21
@
g( ; ) = 0
@
(18) = Re r.
1
Multiplying the right side of (17) by R (X 0 X) , we obtain
1 1
0 = 2R X 0 X X 0 Y + 2R e R X 0X R0 e .
because b = (X 0 X) 1
X 0 Y . Replacing the value of e into (17), we have
1 1
0= 2X 0 Y + 2 X 0 X e 2R0 R X 0 X R0 r Rb
(20) H0 : 1 = 1; vs. H1 : 1 6= 1.
b 2 1
AN 0; X 0X .
Case 1.
By Theorem 3.6, we know that
b 2 1
N 0; X 0X ,
b 1;1
1 1 = (1; 0; :::; 0) b N 0; 2
X 0X .
Hence, a test at the 5% signicance level will be to reject if jtj > tT k;:025 ,
where tT k;:025 is such that Pr ft > tT k;:025 g = :025.
Case 2.
When the nite sample distribution is not available, we have to rely on
asymptotic results. We know by Theorem 3.10
d
T 1=2 b ! N 0; 2 1
xx .
T 1=2 b AN 0; 2 1
xx
d
Proposition 3.12. Assuming A1 and A3, t ! N (0; 1).
Proof. The proof is a simple application of Theorems 3.10, 2.3 and 2.13.
Remark 3.8. More general hypothesis testing can be done proceeding simi-
larly, however we will defer them until Section 8.
3.3. VIOLATIONS OF ASSUMPTIONS.
When establishing the statistical properties of the LSE, we have made
use of Assumptions A1-A6. Inspecting these assumptions, we can infer that
some of them seem to be more crucial or important than others. For in-
stance, we have always assumed so far that
(a) Eut = 0.
(b) E (xt ut ) = 0 or independent.
(c) ut iid.
Then, the question is what it would happen with the properties of the
LSE if one or more of the previous assumptions are violated. We will answer
this question one by one, although (c) will be examined in Section 5.
3.3.1. (a) Eut 6= 0.
Assume that Eut = 6= 0. Then, we have the following proposition.
Proposition 3.13. Assume A1, (A1)-A4 except that Eut = 0, then the
0
LSE b is biased (inconsistent) unless X 0 = 0 (plim XT = 0).
h i
Proof. If E b exists, e.g. X is xed or E kX 0 Xk 1 jkX 0 U kj < 1, then
by the law of iterated expectations we have that
h i
1
E b = E X 0X X 0 E (U j X)
h i
1
= E X 0X X0 .
1
But rank (X 0 X) = k so (X 0 X) exists. Then,
E b = 0 if f X 0 = 0.
Now if xt are stochastic then,
" #
1
X 0X X 0U
(21) plim b = plim .
T T
By Theorem 3.10, we have that
1
X 0X 1
plim = xx 0,
T
whereas
T T T
1X 1X 1X
plim xi ui = plim xi (ui )+ plim xi .
T T T
i=1 i=1 i=1
24
P
So Theorem 2.3, we conclude that b is consistent i plimT 1 Tt=1 xt ut = 0
because xx1 0. So, b is an inconsistent because Ext 6= 0.
A further result without proof is the following,
Theorem 3.14. If Eut 6= 0, then either b or b2 (or both) must be biased.
In addition, as long as lim ( 0 ) =T 6= 0, then either b or b2 (or both) must
T !1
be inconsistent.
There is a special case where the regression contains a constant, e.g.
0
yt = + xt + ut .
Theorem 3.15. Consider the latter regression model. Assume A1 (A1)-A4
except that Eut = 0, then the LSE of is unbiased (consistent).
Proof. Left as an exercise.
Thus, a nonzero mean of the error term has not serious consequences as
the slope parameters can be always consistently estimated.
3.3.2. (b) xi and ui are not uncorrelated.
We shall assume that xi stochastic. Examples are quite common, being,
perhaps, the leading one the simultaneous equation model, see Section 7.
Another example is the transformed Box-Cox model
yt 1 0
= xt + ut ,
0
where the parameters are ; 0 . The most immediate consequence is that
the LSE of is no longer consistent, as the following proposition shows.
Proposition 3.16. Assume A1 and A3, except that xt and ut are uncor-
related. Then, the LSE of is inconsistent.
Proof. By denition,
1
b X 0X X 0U
(22) = .
T T
As we have already seen, the rst factor on the right of (22) converges in
probability to xx1 . Next, by Theorem 2.4, the second factor on the right of
(22) is
T T
X 0U 1X 1X p
= xi ui = wi !
T T T
i=1 i=1
since xi and ui are iid, which implies that wi is and E (xi ui ) = 6= 0. Thus,
by Theorem 2.3, we obtain that
plim b = 1
xx 6= 0.
So the LSE is inconsistent.
Remark 3.9. The estimator of 2 in (8) is inconsistent as
T
1X 2 UMU p 2 0 1 2
bt =
u ! xx 6= .
T T
t=1
25
The question is what can we do? Suppose that we have a set of k variables
such that (a) uncorrelated with ut and (b) correlated with xt , that is
E (zt x0t ) = zx 6= 0. Consider the estimator of
e = Z 0X 1
(23) Z 0Y
(compared with the estimator given in Theorem 3.11). This estimator is
called INSTRUMENTAL VARIABLE ESTIMATOR (IV E). Its statistical
properties are given in the next theorem.
Theorem 3.17. Under the conditions of Proposition 3.16 and zt a (k 1)
vector of iid r.v. such that E (zt x0t ) = zx =
6 0 and E (zt ut ) = 0. Then,
P
(a) e !
d
(b) T 1=2 e ! N 0; 2 1
zx zz
1
zx ; zz = E zt zt0 .
Proof. We begin with (a). By denition, (23) is
1
Z 0X Z 0U
(24) + .
T T
Now, by Theorem 2.4,
T
Z 0X 1X p
= zt x0t ! zx
T T
t=1
because wt = zt x0t is an iid sequence of r.v. with nite rst moment zx ,
whereas Theorem 2.3 implies, because zx 6= 0, that
1
Z 0X p 1
! zx .
T
On the other hand, the second factor of expression (24) is
T
1X p
zt ut ! 0
T
t=1
by Theorem 2.4, because Ewt = 0. That concludes the proof of part (a).
Next we show (b). By denition,
T
Z 0X 1
1 X
(25) T 1=2 e = zt ut .
T T 1=2 t=1
By Theorem 2.13, it su ces to show that the rst factor on the right of (25)
converges in probability to a well dened limit, whereas the second factor
on the right of (25) satises the CLT . In view of part (a) and Theorem
2.12, it su ces to show the convergence of
T
1 X 0
zt ut 6= 0
T 1=2 t=1
But, wt = 0 zt ut is iid with zero mean and E wt2 = E 0 zt ut ut zt0 =
0
E (zt zt0 ) E u2t = 2 0 zz . So the second factor on the right of (25)
1
Z0X
converges in distribution to N 0; 2 zz and because T !P 1
xz ,
by Theorem 2.13 we conclude the proof of part (b).
26
Remark 3.10. (i) From the result of the above theorem, we notice that
the asymptotic variance of the IV E depends very much on the correlation
between the regressors and instruments, the higher this correlation is the
smaller the variance.
(ii) If the LSE were consistent and asymptotically normal then 2 (X 0 X) 1
is smaller that 2 zx1 zz xz1 , as the correlation between the regressors and
themselves is maximal, e.g. 1. See also the proof of Theorem 3.11.
The intuition behind LSE is that we try to nd the value in S (X) closest
to Y . What is that point? It is just the projection of Y onto S (X). Recall
that LSE tries to minimize (Y X )0 (Y X ). Thus, one can view this
as splitting the space into S (X) and S? (X), on which U b lies. Because
0
yt = xt + ut ,
we can regard yt as the sum (vector sense) of 0 xt and ut . To say that ut
and xt are uncorrelated is that ut 2 S? (X). Now, if ut is not uncorrelated
with xt it means in geometrical terms that ut 2= S? (X).
Thus, because xt 6? ut , the projection of yt onto S (X) will be dierent
than 0 xt , e.g. its value depends on ut . The reason is because what LSE
does is to nd the closest point on S (X) that explains yt , and if ut
carries information on xt , then we end up estimating
0
xt + E [ut jxt ] = ( + )0 xt .
So, what does IV E do? IV E does basically the following. Obtain the
space span by Z, say S (Z), and then minimize the distance from Y to S (X)
that lies in S (Z). In mathematical terms,
1st ) Regress X on Z, e.g. Z (Z 0 Z) 1 Z 0 X = PZ X, where PZ is the matrix
that projects X onto S (Z).
2nd ) Then nd e that minimizes the distance between Y and PZ X,
e = 1
X 0 PZ X X 0 PZ Y
1 1 1
= X 0Z Z 0Z Z 0X X 0Z Z 0Z Z 0Y ,
0
and the tted values b xt would be. Thus, in our particular framework, the
optimal instrument will be
1 0
Ze = Z Z 0 Z ZX
and so the IV E becomes
1
e = Ze0 X e0 Y
Z
1 1 1
= X 0Z Z 0Z Z 0X X 0Z Z 0Z Z 0Y
1
(26) = X 0 PZ X X 0 PZ Y .
This estimator is called GIVE (generalized instrumental variable estimator).
But is it best? The next lemma will give us the answer.
Lemma 3.18. Consider the linear regression model Y = X + U . Assume
A1 to A4 except that E (xt ut ) = 0. Consider two sets of instruments Z1
and Z2 such that S (Z1 ) S (Z2 ). Then,
e = X 0P 2 X 1
2 Z X 0 PZ2 Y
has an asymptotic variance covariance matrix not bigger than that of
e = X 0P 1 X 1
1 Z X 0 PZ1 Y ,
1
where PZi = Zi (Zi0 Zi ) Zi for i = 1; 2.
Proof. By Theorem 3.17, we know that
d 1
T 1=2 e i ! N 0; X 0 PZi X i = 1; 2.
1 1
Thus, it su ces to show that X 0 PZ2 X X 0 PZ1 X or that
0 X 0 PZ2 PZ1 X.
Let us examine the matrix PZ2 PZ1 rst. We already know that if a
matrix is idempotent then is p.s.d., which is PZ2 PZ1 . Why? Because PZi is
a projection matrix, and then idempotent, we have that
PZ2 PZ1 PZ2 PZ1 = PZ2 + PZ1 PZ2 PZ1 PZ1 PZ2 .
But S (Z1 ) S (Z2 ) which implies that PZ1 PZ2 = PZ2 PZ1 = PZ1 . Thus,
PZ2 PZ1 PZ2 PZ1 = PZ2 PZ1 ,
which concludes the proof.
28
Cov e I 1
( ),
Remark 4.2. Notice that the value b which maximizes ` ( ; y) is the same
as that which maximizes L ( ; y).
Since the observations are iid, the objective function for the M LE is
T
X
Q ( ) = ` ( ; y) = ` ( ; yt ) ,
t=1
where = 0
; 2 0. Now, taking log 0 s in (31), we have that
T
1 2 1X 0 2
` ( ; yt ; xt ) = C log yt xt ,
2 2
t=1
31
So, the method is very simple and quite appealing. However, the method
becomes very intractable when the dimension of the vector becomes greater
than two, as one can imagine. In addition, the method does not work very
well if the objective function has a deep trough near the minimum.
then for a specic value of 1 , say e1 , e1 xt1 and xt2 + e1 xt3 are known (or
can be taken as observed variables), so that 2 enters now in a linear form.
Thus, we can employ LS techniques to obtain the value 2 which minimizes
T
X 2
(33) yt e1 xt1 2 xt2 + e1 xt3 .
t=1
obtaining 1
e2 . So, we can iterate the procedure back and forth until the
prescribed degree of accuracy is reached. That is, we x e2 and obtain an
1
estimate of 1 , say b . With this guess of 1 , we obtain a new estimate of
1
1
b
2 , 2 , via the minimization of (33), and so on.
For model (b) we have similar issues. That is, for xed 3, say e3 , 1 and
2 can be obtained by LS techniques in
T
X 2
1
yt 1 xt1 2 xt2 e3
t=1
34
1
where our regressors are xt1 and xt2 e3 . Then, given values of 1 and
e e
2 , say 1 and 2 respectively, the estimate for 3 can be obtained by Grid
Search approach in
T
X 2
eT ( 3 ) =
Q yt e1 xt1 e2 (xt2 3)
1
.
t=1
Then, as in model (a) the procedure is iterated until the desired level of
accuracy is reached.
4.3.3. NEWTON-RAPHSON.
The Newton-Raphson approach is based on quadratic approximations of
the objective function QT ( ). Suppose that QT ( ) has two continuous
derivatives. Applying a Taylor expansion up to its second term around a
point , say, we obtain that
@ 1 @
QT ( ) = QT ( )+ QT ( ) ( )+ ( )0 QT ( )
@ 0 2 @ @ 0
where = + (1 ) and 2 [0; 1]. Now, taking derivatives with
respect to in both sides of the last displayed equation and evaluate it at b
(the value which minimizes QT ( )), we obtain
@
0 = q b = q( )+ QT b ,
@ @ 0
which implies that
1
b= @
(34) QT q( ).
@ @ 0
4.3.4. GAUSS-NEWTON.
This method was designed to obtain the Nonlinear Least Squares estima-
tor, that is to minimize
T
X
(35) QT ( ) = u2t ( ) ; ut ( ) = yt f (xt ; ) .
t=1
X T
@ @
QT ( ) = 2 ut ( ) ut ( )
@ @
t=1
(36)
XT
@ @ @ @
Q T ( ) = 2 ut ( ) ut ( ) + ut ( ) ut ( ) .
@ @ 0 t=1
@ @ 0 @ @ 0
or at least negligible, for T large enough, compared to the rst term on the
right of (36). Thus, we can based our iteration procedure on
(37)
T
! 1 T
X @ @ X @
i+1 i i i
= ut 0 ut ut i ut i .
@ @ @
t=1 t=1
T
! 1 T
X X
i+1 i
= + wt wt0 ut i
wt .
t=1 t=1
@
yt = ut ( ) + ut ( ) .
@ 0
37
i+1 i @ i i @ i
So is the LSE of ut + u
@ 0 t
on u
@ 0 t
, i.e.
T
! 1
X @ @
i+1 i i
= ut ut
t=1
@ @ 0
XT
@ i @ i i i
ut ut ut
t=1
@ @ 0
T
! 1 T
X @ @ X @
i i i i i
= ut ut ut ut .
t=1
@ @ 0 t=1
@
The weakness of this algorithm is the same as the one of the Newton-Raphson
and the procedure to correct for this is exactly the same.
The dierence between Newton-Raphson and Gauss-Newton is that the
former requires second derivatives, whereas the latter only the rst ones.
0
yt = xt + ut t = 1; :::; T
where = 0
; 2 0 and the log-likelihood
T
T 2 1 X 0 2
` ( ; y; x) = C log 2
yt xt .
2 2
t=1
T
@ 1 X b 0 xt xt = 0
(38) ` ( ; y; x) = yt
@ b2 t=1
T
@ T 1 X b 0 xt
2
(39) ` ( ; y; x) = yt = 0.
@ 2 2b2 2b4 t=1
PT 1P
From (38) we obtain that the M LE of is b = 0
t=1 xt xt
T
t=1 xt yt ,
and plugging back b into (39) we obtain that the M LE of 2 2
is b =
P 0 2
T 1 Tt=1 yt b xt .
38
which coincides with that given in Theorem 3.6. So, not only the LSE is
BLU E but it is also Cramr-Rao e cient when the errors are Gaussian.
Observe that since ut is normally distributed, E u4t = 3 4 and b and b2
are independent.
If xt were stochastic, we would have the same conclusions and
b d 0 2 1
T 1=2 2 2 !N ; xx
4 .
b 0 0 2
In the present context we obtain I ( ) by taking the probability limit of
T 1 times the Hessian matrix, that is the Cramr-Rao bound becomes
1
1 @2 2 1 4
p lim ` ( ; y; x) = diag xx ; 2 .
T @ @
Remark 4.4. In general b and b2 are not independent. For instance if the
distribution of ut is not symmetric.
and
1
(42) TQT ( 0 ) < Q ( 0 ) + "=2.
On the other hand, by denition of b, we know that
(43) QT b < QT ( 0 ) .
So, combining (43) and (41) yield
Q b <T 1
QT ( 0 ) + "=2
and adding the last displayed inequality into (42), we obtain
Q b +T 1
QT ( 0 ) < Q ( 0 ) + T 1
QT ( 0 ) + ",
which implies that
Q b < Q ( 0 ) + ".
So, we have shown that AT implies that Q b Q ( 0 ) < " and hence
together with (40), it will imply that b 2 N . Hence, we have shown that
n o
Pr fAT g Pr b 2 N .
But moreover, Condition C implies that Pr fAT g ! 1 and thus
n o
Pr b 2 N ! 1.
That is, n o
Pr b 0 !0
for any arbitrary > 0. Observe that N is a neighbourhood of 0 , that is
N =f = j 0 j < g for arbitrary > 0.
What we now should do is try to give more primitive and easier to check
conditions on yt , xt and the function QT ( ) under which Conditions A, B
and C hold true. So far Conditions A and B, they are pretty easy to check,
so we do not need to give any further conditions. The only condition which
is a bit more problematic is Condition C. Therefore, we are going to give a
set of Conditions which will guarantee or imply Condition C. This will be
done in the form of a Lemma.
Lemma 5.2. Let g (z; ) a function on z and where 2 . Assuming
that
(i) is a compact set in Rk .
(ii) g (z; ) is continuous in for all z.
(iii) E (g (z; )) = 0.
(iv) z1 ; :::; zT are iid such that
Proof. We are going to use very much assumptions (i) and (ii), and the fact
that those two assumptions imply that g (z; ) is uniformly continuous in .
That is a compact set implies that 9 a partition of , say 1 ; ::::; n such
that
= [ni=1 i and i \ j = ; for all i 6= j.
1 2 i,
Also we can choose the partition such that 8 and 2 i = 1; :::; n,
1 2
(44) <
for any arbitrary small > 0.
Let 1 ; :::; n be a sequence of 0 s such that i 2 i . We need to show
that for any arbitrary " > 0,
( T
)
1X
lim Pr sup g (zt ; ) > " = 0.
T !1 2 T
t=1
So, using the last displayed inequality, we have that the right side of (45)
is bounded by
(46) ( ) n ( )
X n T T
1X i
X 1X i
Pr g zt ; > "=2 + Pr sup g (zt ; ) g zt ; > "=2 .
T 2 i T
i=1 t=1 i=1 t=1
From here, to complete the proof it su ces to show that both terms of (46)
converge to zero.
We begin showing that the second term on the right of (46) converges to
zero. By Markovs inequality and that zt are iid, that term is bounded by
n
2X i
E sup g (zt ; ) g zt ; .
" 2 i
i=1
i
But for all z, and continuity of g (z; ) and (44) imply that sup 2 i g (z; ) g z; !
0, whereas by condition (iv), we obtain that
i
sup g (zt ; ) g zt ; 2 sup jg (zt ; )j 2L (z)
2 i 2
which completes the proof that the second term on the right of (46) converges
to zero.
To nish the proof it su ces to show that the rst term of (46) also
converges to zero. But this is the case because zt are iid, which implies
that g zt ; i is also an iid sequence of random variables with nite rst
moments. So, Khintchines (or Kolmogorov) theorem, and because since
Eg zt ; i = 0, implies that the last expression converges to zero.
The next lemma specializes the above result for the nonlinear least squares.
Consider
yt = f (xt ; ) + ut , t = 1; :::; T .
Lemma 5.3. Assume that yt is scalar, 0 is a (k 1) vector of unknown
parameters, ut is a sequence of iid 0; 2 random variables and xt is deter-
ministic. In addition, assume that
(A) (@=@ ) f (x; ) is a continuous function in 2 for all x.
(B) f (x; ) is continuous in for all x (uniformly). That is,
1 2
f x; f x; <"
1 2 1 2
whenever < for all ; 2 and for all x.
(C) Uniformly in 1 ; 2 2 ,
T
1X 1 2 1 2
f x; f x; ! x; ; .
T
t=1
(D) If 6= 0,
T
1X
jf (x; 0) f (x; )j2 ! (x; 0; ) > 0.
T
t=1
now show. The arguments to be used are similar to those of the proceeding
lemma. First, it is obvious that
T
1X p
(48) f (xt ; 0 ) ut ( 0 ) !0
T
t=1
where e 0
b
0 . Remember that in general we had
1
b @2 @
0 = QT e QT ( 0 ) .
@ @ 0 @
p p
So far we have given su cient conditions under which b 0 ! 0 or b 0 !
0. So, what about its limiting distribution? We will focus on the nonlinear
least squares, but once again, you can guess that the results and arguments
apply for the M LE as well.
Lemma 5.4. Assuming (A) (D) of Lemma 2, and in addition
i: 0 is an interior point of .
ii: Uniformly for all 2 N ( 0 ), a neighbourhood of 0 ,
T
1X @ @
f (xt ; ) 0 f (xt ; ) ! ( ) > 0.
T @ @
t=1
We begin with the second factor on the right of (51). By standard dier-
entiation,
T T
1 X @ 2 2 X @
(yt f (xt ; 0 )) = f (xt ; 0) ut
T 1=2t=1
@ T 1=2 t=1
@
d 2
! N 0; 4 D
by iii). Next, we examine the rst factor on the right of (51). First,
T T
1 X @2 2 2X @ @
0 yt f xt ; e = f xt ; e f xt ; e
T
t=1
@ @ T
t=1
@ @ 0
XT
2 @2
(52) f xt ; e ut e .
T
t=1
@ @ 0
p
Because e is an intermediate point between 0 and b , it implies that e ! 0.
Moreover, since the convergence is uniform in by ii), we have that
T
2X @ @ p
f xt ; e f xt ; e !2 ( 0) > 0.
T
t=1
@ @ 0
On the other hand, iii) implies that uniformly in 2N( 0 ),
T 2
1 X @ 2 f (xt ; )
! 0.
T2
t=1
@ @ 0
case, LSE is T 1=2 -consistent, although less e cient than the M LE, assum-
ing that the probability distribution function of the errors ut is known. The
question is the following. Suppose that we implement an iterative procedure
to obtain the M LE and we start the algorithm using the LSE. Then, what
are the properties of the estimator after one iteration? This is known as
T WO ST EP estimators.
Theorem 5.5. Let fyt gTt=1 be a sequence of iid random variables with prob-
ability density f (y; ). Assume that 1 is a preliminary T 1=2 -consistent es-
timator of . Denote by QT ( ) the likelihood function, then
1
2 1 @2 1 1
(53) = QT q
@ @ 0
is Cramr-Rao e cient.
Proof. Subtracting 0 on both sides of (53), we have that
1
2 1 @2 1 1
0 = 0 QT q
@ @ 0
1
1 @2 1 1
= 0 QT q q ( 0)
@ @ 0
1
@2 1
QT q ( 0) .
@ @ 0
We have already mentioned that
1 1
!
1 @2 1 1 d @2
QT q ( 0) ! N 0; p lim QT ( 0 )
T@ @ 0 T 1=2 @ @ 0
1
N 0; I ( 0) .
So, by Theorem 2.7, it su ces to show that
1
!
1=2 1 @2 1 1 P
T 0 QT q q ( 0) ! 0.
@ @ 0
By the mean value theorem, the left side of the last displayed expression is
1
1=2 1 1 @2 1 1 @2
T 0 QT QT T 1=2 1
0
T@ @ 0 T@ @ 0
" 1
#
1=2 1 1 @2 1 1 @2
= T 0 I QT QT
T@ @ 0 T@ @ 0
where is an intermediate point between 1 and 0 . But, by assumption
T 1=2 1 0 converges in distribution, so to complete the proof, it su ces
to show that
1
1 @2 1 1 @2 P
I QT QT ! 0.
T@ @ 0 T@ @ 0
However, since 1 !P 0 it implies that is also consistent. So using
2 2
that for the function @ @@ 0 QT ( ) is continuous, then T1 @ @@ 0 QT 1 and
1 @2
T @ @ 0 QT converge to the same matrix, which is positive denite, and
47
therefore by Theorem 2.3 the left side of the last displayed expression con-
verge to zero in probability, which completes the proof of the theorem.
But why do we equate it to zero? The reason comes from the assumption
that, if the model is correctly specied,
@
E (xt ut ) = 0, E f (xt ; ) ut = 0, E (zt ut ) = 0 or E (qt ( 0 )) = 0.
@
So, what we have done is simply to nd the value b that equals the sample
moment and (theoretical) population moment, without loss of generality
equal to zero. For instance, if you have fyt gTt=1 and you wish to estimate its
mean, one estimator is the sample mean
T
1X
(54) b =
m yt .
T
t=1
Remark 5.4. From the above arguments we can see why the LSE is incon-
sistent if E (xt ut ) = 6= 0. Indeed, from the denition of the LSE,
T
1X b 0 xt
0 = x t yt
T
t=1
XT T
1 1X b .
= xt ut + xt x0t
T T
t=1 t=1
Denoting wt = xt ut , Ewt = , it implies that for the latter expression to
hold true we need b 6= 0. More specically, we actually obtain that
plim b = 1
.
The general idea of the Generalized Method of Moments (GM M ) is as
follows. Suppose we have some data fzt gTt=1 , we wish to estimate 0 which
satises the moment conditions
E (zt ; 0) =0
where is a p 1 column vector and ( ) is a q 1 vector of equations such
that q p. So, it seems natural that our estimator b of 0 would be that
for which the sample moments equal to zero
T
1X
(56) zt ; b = 0.
T
t=1
However, if q > p in general there is no solution for the set of equations given
in (56). In fact, we can nd many solutions b each of them consistent.
Comparing our discussion to that of the IV E, what we would like to do is
to combine all the possible solutions by looking at
T
! T
!
1 X 1 X
b = bGM M = arg min 0
(zt ; ) AT (zt ; )
2 T T
t=1 t=1
where AT is some positive denite matrix such that plimAT = A 0. This
is the estimator explored by Hansens (1982) Econometrica paper.
The asymptotic properties of b are
Consistent
d
T 1=2 b 0 ! N (0; ) where depends on the matrix AT , A.
It can be shown that under some suitable regularity conditions, the best
choice of A is given by
0 1
E (zt ; 0) (zt ; 0)
or
T
1X
(57) AT 1 = zt ; e 0
zt ; e ,
T
t=1
where
1 @ 0 0 1 @
0 =E (zt ; 0) (zt ; 0) (zt ; 0) (zt ; 0) .
@ @ 0
Example 5.1. Consider the IV E. The criterion function was
T
!0 T
! 1 T
!
1X 1 X 1 X
yt x0t wt wt wt0 yt x0t wt .
T T T
t=1 t=1 t=1
P 1
So, in this example x0t ) wt and AT = T 1 Tt=1 wt wt0
(zt ; ) = (yt
P 1
which converges in probability to A = plimT 1 Tt=1 wt wt0 = (Ewt wt0 ) 1 .
Obviously instead of AT we could have used a general matrix, say, BT
and estimate by
T
!0 T
!
1 X 1 X
e = arg min yt x0t wt BT yt x0t wt ,
2 T T
t=1 t=1
which is
e = X 0 W BT W 0 X 1
X 0 W BT W 0 Y .
Observe that if the dimension of wt , q, is equal to p the dimension of the
parameter (which is that of xt ) then
e = W 0X 1
W 0Y ,
that is the estimator becomes independent of the choice of the weighting ma-
trix AT (BT ). So, it is only when q > p that the estimator is not independent
of the choice of the matrix AT . That is, the e ciency of the estimator de-
pends on AT .
Following the general result, the question is: what is the best choice of
AT ? In our particular example (zt ; 0 ) = (yt x0t ) wt , so
0
E (zt ; (zt ; 0 ) = E u2t wt wt0 = 2 E wt wt0 ,
0)
P
which is estimated by 2 T 1 Tt=1 wt wt0 . Hence, as we know it would is
T
! 1
1X 0
AT = wt wt .
T
t=1
Remark 5.5. Note that multiplicative constants does not a ect e because
T
!0 T
!
1 X 1 X
e = arg min yt x0t wt BT yt x0t wt
2 T T
t=1 t=1
T
!0 T
!
1X 1 X
= arg min yt x0t wt 2
BT yt x0t wt .
2 T T
t=1 t=1
50
We can expect that the e ciency of our estimator b will depend very
much on the choice of g ( ) in (59). Under suitable regularity conditions, the
best choice is
@
g (xt ) = E (yt ; xt ; ) j xt
@
since the lower bound for an estimator b given in (60) is
@ @
E (yt ; xt ; ) (yt ; xt ; ) .
@ @ 0
Remark 5.6. The variance-covariance matrix of b is
1
@ 1 @
E g (xt ) (yt ; xt ; ) E g (xt ) g 0 (xt ) E g (xt ) (yt ; xt ; ) .
@ 0 @
The optimal g (xt ) in (60) is not easy and no explicit functional form ex-
ists. Although work has been done in Econometrica by Robinson (1987) and
Newey (1992).
51
1 1
= X 0X X 0E U U 0 X X 0X
1 1
= X 0X X0 X X 0X ,
which completes the proof.
Thus we rst observe is the fact that the usual formula 2 (X 0 X) 1 no
longer hold. The latter is only true if = 2 IT . Next, once we have that
the basic properties of the LSE are not aected by allowing EU U 0 6= 2 IT ,
one question is how good the LSE is? We already know, by Theorem 3.11,
LSE
that under some regularity conditions b is BLUE. However, to obtain
that result, it was assumed that = 2 IT , which is not the case now.
Consider the model (61). We know by Theorem 3.11 that the BLU E
of is the LSE when EU U 0 = 2 IT . So one idea to obtain the BLUE is
to transform rst the model in (61) to restore the ideal conditions. To
that end, consider 1=2 . If we premultiply both sides of (61) by 1=2 , we
LSE
say is that V ar b V ar e 0. Indeed, from (62), V ar e =
1
X0 1X , so
LSE 1 1 1
V ar b V ar e = X 0 X X0 X X 0X X0 1
X 0.
Next we examined the asymptotic properties of the LSE.
Proposition 6.2. Assuming A1-A3, except that E (U U 0 ) = and
X0 X
lim < 1,
T !1 T
LSE
then b is consistent.
LSE
Proof. Because b = (X 0 X=T ) 1
(X 0 U ) =T , we have that E [X 0 U ] = 0
and by assumption,
0
X 0U X 0U 1
E = E X 0U U 0X
T T T2
1
= X 0 X ! 0.
T2
LSE 2nd
So, b ! then use Theorem 2.2 to conclude.
The statistical properties of e do not need to be examined since as we
have argued above, e , is simply the LSE on the transformed model (62),
and thus it will be consistent, asymptotically normal and e cient in the
Gauss-Markov.
LSE
So, we can summarize our ndings about b as:
(1) UNBIASED
(2) INEFFICIENT COMPARED TO THE (U)GLS
(3) 2 (X 0 X) 1 is not the true variance-covariance but (X 0 X) 1 (X 0 X) (X 0 X) 1
.
6.0.1. UNKNOWN.
It is quite unrealistic to pretend that we know . So, the question is,
what can we do? Although for the computation of the LSE, we do not
needed to know , if our aim is to perform hypothesis testing, we would
need to estimate as its variance-covariance depends on . As it stands,
the matrix has far too many parameters, i.e. has T (T + 1) =2 distinct
elements. Thus, it seems unlikely that we can estimate the components of
with only T observations, which is much smaller than the # of parameters.
So the standard procedure is to assume that = ( ), where =
0 b
( 1 ; :::; m ) . Let be an estimator of then we compute
b= b .
Theorem 6.4. A su cient condition for e and b to have the same asymp-
totic distribution is
X0 b 1 1 X
(a) plim = 0
T
X0 b 1 1 U
(b) plim = 0.
T 1=2
Proof. Theorem 2.7 implies that it su ces to show that
P
(65) T 1=2 e T 1=2 b ! 0.
The rst term on the right of (66) converges to zero in probability, because
the rst factor converges in probability to somewhere nite and the second
factor to zero in probability by assumption. By assumption both factors of
the second term on the right of (66) converge to zero in probability, and thus
by Theorem 2.3 the product will as well. Finally, the third term on the right
of (66) also converges to zero in probability, because its rst factor converges
to zero in probability, whereas the second one converges in distribution to a
Normal random variable. Thus, we conclude that the third term converges
to zero in probability and again by Theorem 2.3 we conclude (65).
54
Remark 6.1. (1) The assumptions of Theorem 6.4 are satised almost
always, although they need to be checked.
(2) For the properties of b , we must rely on asymptotic approximations.
Also notice that now b is no longer a linear estimator.
(3) Finally, an interpretation of the ( U)GLS of is that e is such that
minimizes the objective function
Q ( ) = (Y X )0 1
(Y X ).
We are now to look at two scenarios for = ( ). The rst one we are
going to assume that is diagonal but with distinct components, whereas
in the second one, the elements o the main diagonal of are assumed to
be dierent than zero but the diagonal elements are the same.
6.1. HETEROSCEDASTICITY.
Heteroscedasticity appears when the variance of ut varies across observa-
tions, or in other words, we have that
Eu2t = 2
t 6= 2
; t = 1; :::; T .
A standard situation is in cross-sectiondata, observations for households
or rms at some particular period. We may think that the errors may
depend on xt . In a consumption-income relationship, one can expect that
the expected consumption in food would be much the same for those with
low income than for those with high income.
Thus, it is reasonable to expect that when xt is large yt E (yt j xt ) will
be larger. This eect might be captured assuming that ut is drawn from a
distribution with a dierent variance. When 2t = 2 8t, we say that the
errors are homoscedastic.
6.1.1. Estimator of under heteroscedasticity.
Our model is given by
0
yt = xt + ut ; t = 1; 2; :::; T
where Eut = 0, Eu2t
= 2
t and Eut us = 0 if t 6= s.
In this case, = diag 2 ; 2 ; :::; 2 . If were known, we could compute
1 2 T
the UGLS
e = (X 0 X ) 1
X 0Y ,
where X = 1=2 X and Y = 1=2 Y . In our case, the t-th row of X and
tis the vector containing the elements of the lower triangle matrix of xt x0t .
Moreover, consider
T
1X
Ts = E( ts ) s = 1; 2; :::; K (K + 1) =2
T
t=1
XT
1 2 2 0
BT = E u2t t t T t T ,
T
t=1
can be written as
T
1X
DT b ; b 2 = t b2t
u b2 .
T
t=1
Similarly,
T
1X 2 2
h ih i0
b
BT = bt
u b2 t
bT t
bT ,
T
t=1
PT
where b T = T 1
t=1 t. Then, we can test for heteroscedasticity using
(69) Wh = T DT b ; b2 B
b 1 DT b ; b 2 '
T
2
K(K+1)=2 ,
Remark 6.3. The test given in (69) is not only a test for heteroscedasticity
as rejection might be due to incorrect specication of the regression model.
6.2. AUTOCORRELATION.
We now examine the question of a linear regression model
0
yt = xt + ut , t = 1; :::; T
E (ut us ) 6= 0 for some t 6= s.
1 2 1=2
1 2 1=2 0 2 1=2
2
= exp 2
y1 1 x1 1
(2 2 )T =2 2 "
"
( T
)
1 X 0 2
exp 2
(yt yt 1) (xt xt 1)
2 " t=2
e.g. the information matrix is block diagonal, that is b and b are asymptot-
ically independent, as it was the case with AR errors. The consequence is
that we can implement a step-wise algorithm without losing any e ciency,
i.e. we compute "t on @@ "t and then "t on @@ "t .
Consistent estimators of and can be obtained via the LSE of and
1=2
b= 1 1 4b21
2b1
1
respectively, where because 1 = 1 + 2 , and b1 given in (71). Then,
from these two initial consistent estimators, a two-step estimator will yield
asymptotic e cient estimators.
So, the test will be to reject if T 1=2b > 1:96 (5% signicance level).
Although this is a possibility, perhaps the most well known and used test
is due to Durbin and Watson (1950). The D-W test is based on the statistic
PT
(but u bt 1 )2
d = t=2 PT .
b2t 1
t=2 u
The nite sample distribution of d depends on X, although we can use
asymptotic approximations. First notice that d 2 [0; 4]. Also,
d ' 2 (1 b) .
Thus, d AN (2; 4=T ), e.g. T 1=2 (d 2) =2 AN (0; 1).
But, the contribution of Durbin-Watson was the nite sample behaviour
of d. Although, it is not possible (or known) its exact sample behaviour
they provided two bounds du and dL and a one-sided test. Their test works
as follows
8
< reject H0 =0 vs: H1 > 0 if d < du
accept H0 =0 vs: H1 > 0 if d > dL
:
inconclusive H0 =0 vs: H1 > 0 if d 2 (du ; dL ) .
One word of caution is that in order to implement this test we need a constant term
among the regressors X.
62
Portmanteau Test.
Denoting br the r-th sample autocorrelation, Box and Pierce proposed
P
X
Q=T b2r ' 2
P,
r=1
and Box-Pierce-Ljung, with better nite sample behaviour,
P
X 1 2 2
Q = T (T + 2) (T r) br ' P.
r=1
These tests are designed to detect any departure from randomness indicated
by its rst P autocorrelations of the errors.
63
7. DYNAMIC MODELS
The models we have examined are static. That is, given data zt = (yt ; x0t )0
t 2 Z, we have that E [yt j xs ; s = 0; 1; :::] = 0 xt . However with time
series data we may have that not only xt inuence yt but also past values
of xt and/or yt .
So, we may have in mind that
1
X
E [yt j xs ; s = 0; 1; :::] = j xt j
j=0
P
1
or more generally E [yt j xs ; s = 0; 1; :::] = j xt j implying
j= 1
1
X
yt = j xt j + ut .
j= 1
They are known as distributed lag models, where ut ARM A (p; q).
P
r2
j
where the roots of jz = 0 are outside unit circle, e.g. jzj > 1. Then if
j=0
(i) ut uncorrelated it is known as Rational Distributed Lag models. (ii) if
ut ARM A (p; q) then they are known as Transfer Models.
Denition 7.1. (a) TOTAL MULTIPLIER. The e ect in the long run
of a permanent increase of one unit to the level of xt
1
X
= j.
j=0
Its estimation is not dierent from what we have already seen. If ut are
heteroscedastic or autocorrelated, then we can implement a GLS type of
estimator as (73) is just a standard regression model. One possible problem
is that (X 0 X) might be near singular making the computation quite di cult.
yt = xt+1 + ut ) yt = (1(1 )
L) xt + ut
xt+1 = xt + (xt xt ) 0< <1 with = 1 .
and perform the LSE. However the LSE is inconsistent since E (yt 1 vt ) 6=
0, as vt = ut ut 1 . So what to do? As usual we employ IV E.
Thus, we should nd instruments for yt 1 and xt . Because xt is not
correlated with vt , it implies that all we need is an instrument for yt 1 . Now
if xt is related to yt , xt 1 and yt 1 will also. Thus, one possible instrument
for yt 1 is xt 1 .
This method is not e cient though, since not all the available information
(the particular structure) of the model has been used. How can we exploit
65
One problem is that we are only able to observe x1 ; :::; xT . Thus, what
are the possible solutions? ( )
P1
jx
tP1
jx t
P
1
jx
P1
jx
As t j = t j + j , denoting j = ,
j=0 j=0 j=0 j=0
0 0 1 12
T
X Xt 1
S( ; ) = @yt @ j
xt jA
t A .
t=1 j=0
Again, as we did with the Koyck model, if the errors were correlated then
E [yt 1 ut ] 6= 0 and the LSE would be inconsistent. In this case, we need to
use IV E, with xt 1 as instrument for yt 1 .
On the other hand, a more e cient estimator could be obtained if one is
willing to use all the available information. Once again, we will focus mainly
with the case where ut AR(1), e.g.
ut = ut 1 + "t .
As we did with the Cochrane-Orcutt (the motivation of the method),
we multiply the model by (1 L), in order to eliminate the correlation
structure of the error term, e.g.
(1 L) yt = (1 L) yt 1 + (1 L) xt + "t
yt yt 1 = yt 1 yt 2 + xt xt 1 + "t .
Thus, an appropriate method will be based on
T
X
b = arg min "2t ( ; ; )
=( ; ; )0
t=3
@
"t = (yt 1 yt 2) = zt1
@
@
"t = (xt xt 1) = zt2
@
@
"t = (yt 1 yt 2 xt 1) = zt3 .
@
So, we can implement a Gauss-Newton iteration algorithm, i.e.
" T # 1 T
i+1 i X X
b =b zt z 0 z t "t , t
t=3 t=3
67
i
where "t and ztj , j = 1; 2; 3, are evaluated at b . We can show that
0 2 1
2q2 2q 1 3 1
y xy (1 )
d B 5 C
T 1=2 b ! @0; 4 2q2
t 0 A
2 1
0 1
where
T T
1X @"t 2
1X @"t 2
qy2 = p lim ; qx2 = p lim
T @ T @
t=3 t=3
XT
1 @"t @"t
qxy = p lim .
T @ @
t=3
A two-step procedure can be implemented, starting with the IV E of and
and b as the LSE of ubt on u
bt 1 , obtaining a fully e cient estimator.
Remark 7.1. In this model a step-wise algorithm will not be e cient to
estimate , although it will converge to the true value. The reason being that
the asymptotic variance-covariance matrix is not block diagonal, e.g. the
0
estimators of ; 0 are not independent to that of .
7.4. HATANAKAs TWO-STEP PROCEDURE.
(Residual adjusted Aitken estimator ).
Hatanakas device is a procedure which is asymptotically as e cient as
the M LE of , and . As a by-product, we can conclude or see why the
Cochrane-Orcutt (step-wise) method is not e cient.
STEP 1 : Regress yt on yt 1 and xt using IV E, with xt 1 as an in-
strument for yt 1 , e.g. obtain b and b . Then perform LSE of u bt on
b
bt 1 to obtain .
u
STEP 2 : Regress yt byt 1 on yt 1 byt 2 , xt bxt 1 and u bt 1 .
The key dierence is the inclusion of u bt 1 . LSE here will be asymptoti-
cally e cient, noticing that the e cient estimator of is b + + , where +
is the coe cient associated with the regressor u bt 1 .
If the errors are M A (1) instead of AR (1), it proceeds similarly, e.g.
yt = yt 1 + xt + ut
ut = "t + "t 1 .
Assuming that y1 xed and "1 = 0, the M LE is equivalent to the mini-
mization of the conditional sum of squares, CSS,
T
X
"2t .
t=2
For a Gauss-Newton iteration algorithm, "t = yt yt 1 xt "t 1 and
@"t @
= yt 1 "t 1
@ @
@"t @
= xt "t 1
@ @
@"t @
= "t 1 "t 1
@ @
68
7.6. CAUSALITY.
Let y and x be two scalar variables. We say that x causes y if in some
sense x helps to predict y, that is
h i h i
E (yt E[yt j yt 1 ; :::])2 > E (yt E[yt j xt 1 ; :::; yt 1 ; :::])2 .
If, xt were in the Information Set, then we would say that there is Instan-
taneous Causality. Feedback if yt also helps to predict xt .
7.6.1. GRANGERS TEST.
11 (L) 12 (L) xt "t
= ,
21 (L) 22 (L) yt t
where ij (L) are polynomials in L. To test if x causes y is equivalent to
test if 21 (L) = 0 or not. This is sometimes known as a direct test.
8. SYSTEMS OF EQUATIONS
So far, we have studied models where only one equation was given, e.g.
multiple regression models. Very frequently, we face models or problems
that involve the specication, estimation and inference of more than one
equation. We are now going to study this issue, e.g. systems of equations.
8.1. MULTIVARIATE REGRESSION.
In some sense multivariate regression models are not much dierent than
the models we have already examined. Its specication is as follows
(75) yt = Bxt + "t , t = 1; :::; T
where yt is (N 1) vector, xt is a (K 1) vector and B (N K) matrix of
coe cients, in which we are interested. So, the ` th row of B stand for
the coe cients of the ` th equation.
The multivariate least squares estimator of B is given by
T
! 1 T
X X
Bb =
0 0
xt xt xt yt0
t=1 t=1
b 0 = (X 0 X)
or B 1
X 0 Y , where Y 0 = (y1 ; :::; yT )N and X 0 = (x1 ; :::; xT )K
T T.
8.1.1. Motivation.
Denote the t-th observation of the rst equation of (75) by
yt1 = b01 xt + "t1 (1st row of B is b01 )
whereas the (T 1) vector of observations of the 1st equation by
y1 = Xb1 + "1 .
Thus we can write (75) as
0 1 0 X 10 1
0 1
O b1
y1 "1
B : C B B CB b2 C
B C X CB C B : C
B : C=B CB : C B C
B C B : CB C+B : C
@ : A B CB : C B C
@ : AB
@ :
C @
A : A
yN O : X "N
bN
or in matrix notation
(76) y = (I X) b + " ,
0 ) , b = (b0 ; :::; b0 )0 and " = ("0 ; :::; "0 )0 . So, there
where y = (y10 ; :::; yN 0
1 N 1 N
is no much dierence with what we have seen already. That is we have the
standard linear regression model with y and b as our dependent variable
and vector of parameters respectively. Thus, the LSE becomes
bb 1
= I X 0X I X0 y
1
= I X 0X X0 y
1
= I X 0X X 0 vec(Y )
1
= vec X 0X X 0 Y IN N
71
8.1.2. Properties.
Its properties follow straightforwardly by looking at the properties of the
LSE in a multiple regression model.
(i) It is unbiased (or consistent) and (ii) Its variance (asymptotic variance)
is equal to
!
X 8X 1
1
X 0X p lim ,
T
H0 w0 = r, vs: H1 w0 6= r
8.2. SURE.
Suppose that we have N equations (regression equations)
yi = Xi i + "i ; i = 1; :::; N
e.g. ("t1 ; "t2 ; :::; "tN )0 is an (N 1) column vector with covariance (wij )i;j=1;:::;N .
Thus, we have
0 1 0 1
0 1 X1 0 1 "1
y1 B C 1 B C
B : C B X2 CB : C B "2 C
B C B : CB C B : C
B : C=B CB : C+B C
B C B : C B C B : C
@ : A B C@ : A B C
@ : A @ : A
yN N
XN "N
Y = X + "; V = E""0 = I.
72
Then,
0 1
0 1 1 P
N
w11 X10 X1 w12 X10 X2 ; :::; w1N X10 XN B j=1 w1j X10 yj C
B : C B C
B C B : C
bb = B : C B C.
B C B : C
@ : A B C
@ : A
wN 1 XN0 X
1 wN 2 XN
0 X ; :::; w N N X 0 X
2 N N PN NjX0 y
j=1 w N j
This estimator is more e cient than the LSE equation by equation. But
there are two cases where LSE is fully e cient. Namely: (i) wij = 0 8i 6=
j and (ii) Xi = Xj 8i 6= j.
The derivation of the SURE can be seen by observing what is the Variance-
Covariance matrix of " = ("01 ; :::; "0N )0
0 1
20 1 3 w11 IT w12 IT : : : w1N IT
"1 B C
6B : C 7 B w22 IT : : : w2N IT C
6B C 0 7 B : C
E6B
6B :
C "1 ; :::; "0N
C
7=B
7 B
C=(
C IT ) .
4@ : A 5 B : C
@ : A
"N
wN N IT
So, to obtain b we will use GLS estimators, since the variance covariance
matrix of " is not a diagonal matrix. That is, we will minimize the objective
function
1
(Y X ) I (Y X )
02 3 1 1
X10 20 1 32 3
B6 7 w11 : : : w1N X1 C
B6 X20 7 6B C 76 7C
B6 7 6B : C 76 : 7C
b = B6 : 7 6B C
B6 7 6B : C I7
76
6 : 7C
7C
B6 : 7 4@ A 54 5C
@4 5 : : A
:
0 wN N XN
XN
2 3 20 1 32 3
X10 w11 : : : w1N y1
6 : 7 6B : C 76 : 7
6 7 6B C 76 7
6 : 7 6B : C I7 6 : 7
6 7 6B C 76 7
4 : 5 4@ : A 54 : 5
0
XN wN N yN
0 1
0 1 1 P
N
w11 X10 X1 w12 X10 X2 ::: w1N X10 XN B w1j X10 yj C
B : C B j=1 C
B C B : C
=B
B : C
C
B
B
C:
C
@ A B : C
: @ A
:
wN 1 XN0 X
1 wN 2 XN
0 X
2 ::: wN N XN
0 X
N PN NjX0 y
j=1 w N j
73
b = 1
I X0 1
I (I X) I X0 1
I Y
0 1
(X 0 X) 1 X 0 y1
1
= I X 0X X0 Y =@ A
0 1 0
(X X) X yN
then,
0 1 0 1
b11 X10 X1 ; w
w b12 X10 X2 ; :::; w
b1N X10 XN
1 P
N
B b1j X10 yj
w C
B : C B j=1 C
B C B C
b=B : C B C.
B C B P C
@ : A @ N A
bN j XN
w 0 y
bN 1 XN
w 1 bN N X10 XN
0 X ; :::; w
j=1
j
8.3.1. IDENTIFICATION.
A simultaneous equation system can be written in general as:
(77) Ayt + Bxt = ut
Y A + XB = U
(T N )(N N )
From the order conditions, we need at least two constraints in the 1st equa-
tion. Assume that 11 and 12 are zero. Then, consider the matrix formed
by the columns such that its rst element is zero. If such a matrix has rank
2 (here N = 3) then the 1st equation is identied. Otherwise not.
8.4. ESTIMATION OF A SIMULTANEOUS EQUATION MODEL.
First thing to notice is that the LSE of ai1 and bi1 is inconsistent. This is
expected from the fact that the RHS contains endogenous variables or yti s,
and one would expect that those variables correlated with the error term
ut1 . In fact the LSE is inconsistent except in some special case, when the
system is recursive, that is A is lower triangular and is diagonal.
If the LSE is inconsistent, how can we estimate the parameters? In the
standard linear regression model the method was IV E.
But what instruments should we use? Recall that the e ciency of the
IV E depends on the correlation between the regressors and instruments.
The higher the correlation, the better the IV E is.
Lets go back to the equation of interest, e.g. (80). In matrix form
y1 = Z1 1 + u1
= Y1 a1 + X1 b1 + u1 ,
where Z1 = (Y1 ; X1 ) and Y1 the matrix corresponding to yti i = 2; :::; m
and X1 for xti . All we need to do is to nd the best instruments for Y1 .
Obviously there is no need for X1 as there are not correlated with u1 . Thus,
all we want is to nd instrument W1 as correlated as possible with Y1 .
To that end, let the reduced form equation (78), that is
0
Y =X +V.
Recall the i th column of Y are the (T 1) observations for yi .
Based on the above equation, if we want to obtain a set of instruments
W1 , because X and V are uncorrelated, one candidate is the best predictor
for Y which is X b 0 , where b 0 is the LSE of 0 . Because we do not want all
the Yb = X b 0 , but only those corresponding to Y1 and
1
Yb = X X 0 X X 0Y
the instrument for Y1 are Yb1 = X (X 8 X) 1 X 8 Y1 . Observe that Yb1 and u1
P
are asymptotically uncorrelated because b ! so Yb1 ! X 01 and X ? u.
So the IV E becomes
1
Yb10 Yb1 y1
b1 = (Y1 ; X1 )
X1 X10 y1
1
Yb10 Y1 Yb10 X1 Yb10 y1
=
X10 Y1 X10 X1 X10 y1
! 1
Yb 0 b Yb10 X1 Yb1 y1
= 1Y 1
X10 Yb1 X10 X1 Yb 0 X1
1
since
1
Yb10 X1 = Y10 X X 0 X X 0 X1
= Y10 X1
and
1 1
Yb10 Yb1 = Y1 X X 0 X X 0X X 0X X 0 Y1
1
= Y10 X X 0 X X 0 Y1
= Y10 Yb1 Yb10 Y1 .
77
This will imply that the LSE of y1 on Yb1 and X1 yields the same estimator.
It can be shown that the asymptotic distribution of this estimator b1 is
d 1
T 1=2 (b 1 1) ! N 0; 2
1 ,
where 2 = Eu2t1 and
1 Yb1 Y1 Y10 X1
1 = p lim
T X10 Y1 X10 X1
0Q 0Q
1 1 1 1
=
Q01 1 Q11
0 1 0 0 1 0
1Q = p lim
Y1 X 1 Q1 = p lim Y X1
T T 1
X 0X X 0 X1 X 0 X1
Q = p lim ; Q1 = p lim ; Q11 = p lim 1 .
T T T
This is only for one equation. If what we want is all the system then
we can do the same equation by equation. We have argued before that to
estimate the parameters in the 1st equation
y1 = Z1 1 + u1
was via the LS estimation in
b1
Y1 = Z 1 + u1 ,
b1 = Yb1 ; X1 . Then, for the whole system
where Z
Y =Z +U
with
0 1
0 1 Z1 O 0 1
y1 B C u1
B C B Z2 C B C
B : C B C B : C
B : C
Y =B
B : C;
C Z =B C; U =B
B : C.
C
@ A B : C @ A
: B C :
@ : A
yN uN
O ZN
T
1X
bij =
w bit u
u bjt ,
T
t=1
and the same will be applied to the other equations. This estimator is
called Indirect Least Squares (ILS) and is consistent by consistency of b
and Theorem 2.3. However when the system is overidentied there will be
more than one solution, with implications for the e ciency.
Example 8.2. Suppose that a1 is a scalar and b 12 and b 22 are (1 K2 ),
e.g. there are K2 exogenous variables excluded in the 1st equation, then,
clearly, we have K2 equations and 1 coe cient =) more than one solution.
We should mention that all the solutions are consistent, e.g. they converge
in probability to a1 . So when the system is overidentiable the ILS is not
e cient, as it appears that some of the information given in the system has
not been taken into account. Compare with the IV E when the number of
instruments exceeds the variables to instrument.
79
There is a special situation where all these three estimators become the
same. If the system is (just) identiable and thus a unique ILS exists, then
ILS = 2SLS = 3SLS.
Also the 2SLS = 3SLS if
2 no cross-equation restrictions and
(i ) = I or
= diag 21 ; :::; 2N .
(ii) Each equation is just identiable.
80
9. HYPOTHESIS TESTING
There are three procedures available; (i) the Wald (W ), (ii) the Lagrange
Multiplier (LM ) and (iii) the Likelihood Ratio (LR) tests.
The purpose is based on a sample fzt gTt=1 , to know if its mean or vari-
ance, or maybe the conditional expectation equals some specic value. For
example, we would like to know if 1 = 0 or R = r in the model
0
yt = xt + ut .
To estimate the parameters we looked at
b = arg minQ ( ) ,
2
which will satisfy the FOC
@
(82) Q b = 0.
@
9.0.1. Wald (W).
Based on b, the idea is to decide if the constraints on hold true for b.
Example 9.1. If H0 1 = 0, then the Wald test tries to decide if b 1 0.
9.0.2. Lagrange multiplier (LM).
Because b satises (82), the idea of this test is to decide if the FOC
evaluated under the null holds true. That is, if e is the estimator using the
restrictions, then we looked at
@
Q e ' 0.
@
0
Example 9.2. Consider = 1 ; 02 . If H0 1 = 0, then the LM tries
to decide if
@
Q 0; e2 ' 0, e = 0; e2 .
@
9.0.3. Likelihood Ratio Test (LR).
This test tries to decide if the ratio between the minimum of Q( ) with
and without the constrains is 1, that is
Q 1 e Q b ' 1.
we get a test that all it needs is to estimate the model under the null, then
simplies the computations, as under H0 the model is linear
1
yt = 1 x1t + 2 x2t + "t .
9.0.4. The Wald Test.
Let Q ( ) be the objective function. The W test is based on how far b is
from the null. Consider H h ( 0 ) = 0, then we wish to know if h b is
statistically dierent than zero. The form of the W test for this very general
hypothesis is
1
0 @ @
W = T 1=2 h b h b \ b
Asyvar h b T 1=2 h b ,
@ @ 0
which is 2 (r), where r is the dimension of the vector h ( ).
e= arg min Q ( ) .
2 ;s:t:h( )=0
This test is sometimes called the score test. What is the form of the test?
Similar to testing for = 0, where what we did was
1
b 0 asyvar
\ b b.
9.0.6. Properties.
The asymptotic properties of the W , LR and LM tests are
d 2 d 2 d 2
(a) LR ! (s) , W ! (s) and LM ! (s)
where s # of constraints and (b) They are consistent. That is, if the null
is not true, the tests rejects with probability equal to 1 as T % 1.
Example 9.4. Consider the following linear regression model
0
yt = xt + ut
82
@ X0 Y Xe
Q e = 2
,
@
The W principle
8 0 1
< (Rb r) [R(X 0 X) 1 R0 ] (Rb r) if 2 known
2
W = 0 1
: (Rb r) [R(X 0 X) 1 R0 ] (Rb r) if 2 unknown,
b2
P P
where e2 = T1 Tt=1 u
e2t ; and b2 = T1 Tt=1 u
b2t .
2
Finally the LR would be LR = log e log b2 .
in that they look stationary around the mean. The question then is, which
model should I choose or where does the data come from?
To discriminate between T SP or DSP was rst addressed by Nelson &
Plosser (1982). The testing procedure is a non-nested one. Their approach
was to nest the models and then test for unit roots. That is, they introduced
the articial equation
yt = + t + yt 1 + "t ; t = 1; :::; T ,
yt yt 1 = + t+( 1) yt 1 + "t
= 0 + 1t + ( 1) yt 1 + "t .
Remark 10.2. Similarly you might have started using the specication
yt = 0 + 1 t + ut
; t = 1; :::; T ,
ut = ut 1 + "t
and then employ a Cochrane-Orcutts type transformation.
Their hypothesis testing is
H0 : = 1 and 1 = 0, H1 : Negation of null.
If H0 is rejected, it implies that the data belongs to the T SP class, whereas
if H0 is not rejected, then it belongs to the DSP class. So how can we test
for H0 , and more importantly what are their properties?
The relevance of unit roots in economics comes from the observation that
a shock to the economy will have permanent eect, i.e. a change in monetary
policy will have a permanent eect on output. On the contrary, if the data
were T SP , stationary around a time trend, then the eect is only transitory.
We shall begin by considering the AR (1) model
(84) xt = xt 1 + "t ; t = 1; :::; T ,
where "t iid and testing H0 : = 1 against H1 : < 1.
We already know that if j j < 1, then the LSE , that is (71) satises
the CLT
d
(85) T 1=2 (b ) ! N 0; 1 2
.
But if = 1? The rst issue that we observe from (85) is that 1 2 = 0,
so that the asymptotic variance is zero. So it seems that the theory that
works ne for j j < 1, it will not for = 1. This was examined by Dickey &
Fuller (1974) (Fuller, 1976). They showed that, when = 1,
d
T (b ) ! Distribution.
The rst point to mention is that we need to normalize the LSE of by T
instead of T 1=2 , to obtain a proper limit distribution, which is
1 2 R
d 2 B (1) 1 B (r) dB (r)
T (b 1) ! R 1 = R1 ,
B 2 (r) dr B 2 (r) dr
0 0
if "t iid in (84), where B (r) is the standard Brownian Motion, that is for
xed r, B (r) is distributed as N (0; r), and the r.v.s B (r4 ) B (r3 ) and
86
B (r2 ) B (r1 ) are independent for all 0 < r1 < r2 < r3 < r4 < 1. Moreover,
Phillips (1987; 88) showed that
R1 1 2
(b 1) d B (r) dB (r) 2 B (1) 1
tb = ! R0 1=2
= R1 1=2
.
SE(b) 1 2 2
0 B (r) dr 0 B (r) dr
Fortunately, Dickey & Fuller also tabulated this distribution. However, one
key requirement for its validity is that the true value of = 0.
If 6= 0, we then have that
12 2
d
T 3=2 (b 1) ! N 0; 2
and hence tb !d N (0; 1). So, once again we see that, contrary to the
situation where the regressors are stationary, when the data have unit roots,
changing the model implies that the distribution also changes!!
What would happen if in the model I have a time trend? That is
xt = + t + xt 1 + "t .
We like to test = 1, = 0 and 6= 0 in general (in this sense the
augmented model is regarded as a solution of the dependence of the previous
test on whether = 0 or 6= 0),
2 R1 R1 R1
2
d B (1)
"
2 2B (1) 0 B (r) dr 12 0 r 12 B (r) dr 0 r 21 dB(r)
tb ! 1 .
2 "
R1 R1 2 R1 2
2
0 B (r) dr 0 B (r) dr 12 0 r 12 B (r) dr
88
10.1. COINTEGRATION.
Consider xt and yt two scalar I (d) sequences. Then, we say that xt and
yt are cointegrated if there exists a vector such that
1 xt + 2 yt = zt
is an integrated process of order d b, where b > 0 (C(d; b)). Note that if
1 xt + 2 yt = zt is I(d b) also
xt + e yt = zet I(d b).
The vector = ( 1 ; 2 )0 is not identiable, and we assume that 1 = 1 say.
Assume that d = 1. Suppose that we have n variables, say xt , each of
which is I(1), e.g. 8i = 1; :::; n, xti ' V ARM A. Also,
xt = C(L)"t where C(1) < 1.
Moreover assume that it can be written as
A+ (L) xt = D(L)"t
where both A+ (L) and D(L) are nite polynomials and A+ (L) 1 D(L) =
C(L). Then writing A(L) = A+ (L) , we have the V ARM A representation
A(L)xt = D(L)"t
A(L) = A(1) + A (L)
(this is like a Taylor expansion around L = 1). Assume that A(1) has rank
s < n, i.e. A(1) = 0 . Then
(n s)(s n)
0
A(L)xt = xt + A (L)xt (A (L) xt + zt )
= D(L)"t
89
A(1) = (1; ).
0
10.1.1. ESTIMATION.
A two-step procedure can be implemented.
STEP 1 : The cointegrated vector is estimated via LSE in the model
yt = xt + zt .
T b converges in distribution to a random variable.
STEP 2 : Compute the LSE residuals, zbt 1 = yt 1 ^ xt 1 , and to
use them as a proxy variable for zt . Then, perform LSE in the
Error-correction equation, e.g.
yt = xt + zbt 1 + vt .
The LSE of and is asymptotically normal. The reason basically
is that all the variables involved in the regression model are I (0).
However it is worth mentioning that b has a lot of small sample bias,
and because this bias feeds into the second step, e.g. into the regression of
yt on xt and z^t 1,
and test if = 0 vs. < 0, e.g. the t-ratio of b. However the tables
provided by Dickey-Fuller are no longer valid since u bt is not the observed
data. The Tables are in Engle and Yoo (1987) J.o.E.
One of the ideas, from an economic perspective, is that cointegration gives
us the long-run relationship, or equilibrium path, between two (or more)
variables. In particular, if we consider the general distributed lag model
yt = 1 yt 1 + ::: + r yt r + 0 xt + 1 xt 1 + ::: + s xt s + "t ,
we can obtain the short and long run behaviour as follows. Consider
r 1
X s 1
X
+ +
yt = yt 1+ j yt j + xt + j xt j + "t ,
j=1 j=0
where
r
X r
X s
X s
X
+ +
= j; j = ; j = ; = j
j=1 =j+1 =j+1 j=0
or
r 1
X s 1
X
+ +
yt = j yt j + j xt j +( 1) yt 1 + xt + "t
j=1 j=0
r 1
X s 1
X
+ +
= j yt j + j xt j +( 1) yt 1 xt 1 + "t ,
( 1)
j=1 j=0