214.8 (ARp) PDF
214.8 (ARp) PDF
214.8 (ARp) PDF
1.1
Autoregressive Models
Introduction
AR(p) models for univariate time series are Markov processes with dependence of higher order than lag-1
in the univariate state space. Linear, Gaussian models represent a practically important class of models for
time series analysis and decomposition of processes into components, and also an important class of models
within which to explore and understand the ideas of higher-order dependence in Markov processes. Refer
to *** Support notes on Linear Processes, AR models etc *** on course web site - mainly the first several
pages of chapter 2 of that note , and Chapters 9 and 15 of West & Harrison Bayesian Forecasting and
Dynamic Models.
1.2
p
X
j ytj + t
where
t N (0, v)
j=1
1.3
Pp
j1 j B
jy
+ t
(B)yt = t
Characteristic Polynomial (u) = 1 1 u 2 u2 p up . This is a polynomial of order p
defined on |u| < 1.
1.4
As in the AR(1) case, iterative substitution of yt1 then yt2 and so on in the right hand side of the model
equation represents yt as a linear function of more distant past ys and t , t1 , . . . . Formally, assuming the
inversion can be performed,
yt = (B)1 t = (B)t
or
yt = t + 1 t1 + 2 t2 +
1
1.5
Stationarity
A stationary AR(p) process is such that this inversion exists. The moving average weights must decay to
zero eventually, otherwise the linear combination of past innovations will explode. If the representation
exists, then evidently
E(yt ) = 0 for all t,
V (yt ) = v
2
j=0 j
for all t, and this shows that the weights must decay rapidly with j,
Cov(yt , ytk ) = (k) is some function of the j weights, but depends only on lag k and not on t.
These moments and all other properties are determined by the values of p, , v.
1.6
Characteristic Roots
Write (u) = pj=1 (1 j u) where the i are the reciprocals of the roots of the characteristic polynomial,
i.e., (j1 ) = 0 for each j = 1, . . . , p.
Q
(u) = (u)1 =
Qp
1
i=1 (1 i u)
exists on |u| < 1 if, any only if, |j | < 1 for each j = 1, . . . , p.
1.7
AR(2) Examples
When p = 2, (u) = 1 1 u 2 u2 = (1 1 u)(1 2 u) may have either two, distinct real roots or
one pair of complex conjugate roots. Note that 1 = 1 + 2 and 2 = 1 2 . If the roots are real, the
autocorrelations in the process are a composition of those of two AR(1) processes with parameters given by
the two real roots, and in fact the AR(2) model can be represented as the sum of two such AR processes, as
we shall see below in discussion of model decompositions.
If the roots are complex, the AR(2) process is quasi-periodic. Sample trajectories have the appearance
of noisy damped cosine waves of fixed wavelength 2/. The noise is random variation in amplitude
and phase of the waveform through time, injected by the innovations sequence. The damping is induced by
the modulus r. A low modulus rapidly damps the waveform between time points, prior to the injection of
the next innovation. A very persistent waveform, closer to a sinusoidal form, is generated in cases of r close
to unity.
2
Model decomposition theory below shows how all AR(p) models can be decomposed, and hence understood both theoretically and from a quantitative, practical viewpoint, in terms of basic AR(1) and (almost)
AR(2) processes.
1.8
Sometimes we use fit higher-order models and then may interpret part of the structure as arising from a
lower-order model with correlated innovations, i.e., ARMA processes.
For example, suppose p = 3 and we have one real root and a pair of complex conjugate roots , =
r exp (i) so that (u) = (1 u)(1 u)(1 u). Suppose that is fairly small so that (1 u)1
1 + u + 2 u2 for |u| < 1. Then can rewrite the AR model (B)yt = t as
(1 B)(1 B)yt (1 + B + 2 B 2 )t
or
yt 01 yt1 + 02 yt2 + t + 1 t1 + 2 t2
where 01 = 2r cos(), 02 = r2 , 1 = and 2 = 2 .
yt looks like an ARMA(2,2) process - an AR(2) process in which the innovations are correlated and
themselves have a second-order structure.
Obvious extensions to ARMA(p, q) models are of interest.
In this example with quite small, the second term may be negligible and the process looks approximately like an ARMA(2,1) process.
Very often, higher-order AR(p) models have underlying, lower-order AR structure that is somewhat
obscured by measurement or timing errors that induce correlation between the innovations, and can
be recovered through the device of fitting a higher-order model and then using this idea of partial
inversion, at least at an exploratory level.
1.9
Refer back to Homework 3, Question 5, where you fitted AR(p) models to the SOI series and explored
inference in the reference analysis.
Under this model, conditioning on the set of initial values y1 , . . . , yp and assuming we observe n >
2p consecutive values, we have a model that has the form of a linear regression, and reference Bayesian
inference is standard. With y(p+1):n = (yp+1 , yp+2 , . . . , yn )0 and (n p) p autoregressive design matrix
yp
yp+1
H=
..
.
yn1
yp1 y1
yp y2
..
..
..
.
.
.
yn2 ynp
y(p+1):n = H + (p+1):n
B = H 0H
and b = B 1 H 0 y(p+1):n .
Here b is the conditional MLE, LSE and reference posterior mean and mode for , and Q(b) is the
residual sum of squares from the conditional regression analysis.
This is all (trivially) coded in the existing little Matlab functions. They also provide for posterior simulation - direct sampling from this conditional reference posterior, and simulation of predictive distributions
- sampling the future. Again, refer back to Homework 3, Question 5 and the analysis support code used
their.
1.10
Introduce the pdimensional state vector xt = (yt , yt1 , . . . , ytp+1 )0 for all t. Then the AR(p) model may
be re-expressed as
yt = F 0 xt
xt = Gxt1 + F t
where
F =
1
0
0
..
.
G=
and
1 2
1 0
.. . .
.
.
0 0
0 0
p1 p
0
0
..
..
..
.
.
. .
0
0
1
0
Here G is the state evolution, or transition matrix in the extended state space representation. Note that this
maps the state from one to p dimensions and so converts the pth order Markovian dependence to a first order
dependence.
This representation is one reason for the notational use of yt for data, since now xt is the pvector
state variable.
An easy extension to a latent AR process - HMM with the underlying hidden state being a higher-order
model - is given by the AR(p) in noise extension in which yt = F 0 xt + t .
1.11
Insight into the dependence structure is generated by inspection of the forecast function ft (k) = E(yt+k |y1:t )
of the process as a function of the look-ahead horizon k = 1, 2, . . . . It is easily seen that E(xt+k |y1:t ) =
Gk xt so that ft (k) = F 0 Gk xt .
Now G is a square matrix. Assume that G has distinct, non-zero eigenvalues - this will be almost surely
the case when a is derived from a model fit to real data. The G has an eigendecomposition G = EE 1
where the p p eigenvector matrix E has columns that are the eigenvectors of the corresponding elements
of the diagonal matrix . That is, Gj = ej j where E = [e1 , . . . , ep ] and = diag(1 , . . . , p ). The
eigenvalues and vectors can be real or complex valued. Since G is real valued, any complex eigenvalues
must occur in conjugate pairs. Then Gk = Ek E 1 and so
ft (k) =
p
X
ct,j kj
j=1
1.12
Autocorrelations
In the state space representation, E(xt ) = 0 and V (xt ) = S where S satisfies S = GSG0 + U and U is the
p p matrix of zeros except for U1,1 = v. We can see the form of the a.c.f. of yt easily. Since yt = x0t F we
have
(k) = E(yt+k yt ) = F 0 E(xt+k x0t )F.
Now xt+k = Gk xt + terms involving t+1 , . . . , t+k . Hence E(xt+k x0t ) = Gk S and so
(k) = F 0 Gk SF =
p
X
gj kj
j=1
for some constant gj that depend on (, v). As a result, the autocorrelations (k) have the same form as
a function of lag k; that is, precisely the same form as the forecast function. Autocorrelations of AR(p)
processes are a mixture of damped AR(1)-like terms that decay exponentially (real eigenvalues) and may oscillate (real negative eigenvalues), and damped AR(2)-like cosine forms (complex conjugate pairs of eigenvalues).
5
1.13
The eigentheory is completed with an explicit representation of yt in terms of underlying (latent, but identifiable) components of AR(1) and/or AR(2) (actually ARMA(2,1)) forms.
Define a transformed pvector state variable zt = E 1 xt . Also set Fo = E 0 F and Fe = E 1 F.
Then yt = Fo0 zt and zt = zt1 + Fe t for all t.
Elements of zt = (zt,1 , zt,2 , . . . , zt,p )0 are individual AR(1) processes, with zt,j having AR coefficient
j . They are correlated since they are driven by the same innovations.
Real eigenvalues lead to real components: zt,j AR(1|(j , vj )).
Complex eigenvalues r exp(i) lead to pairs of complex and conjugate AR(1) processes. The linear
combination of two such processes must be real, and this leads to a real component that has the
structure of an ARMA(2,1) component in which the AR(2) part is damped by r > 0, and quasiperiodic with fixed frequency , i.e., wavelength 2/, but time-dependent amplitude and phase.
See the Time Series support notes, especially the discussion of AR(2) decompositions, and the also Chapters
9 - especially - and 15 of West & Harrison Bayesian Forecasting and Dynamic Models for further details.
2.1
Investigation of the details of how posterior distributions for (, v) are sequentially updated as new data
arises is useful for, among other things, defining a framework for extension to time-varying parameter models.
We can recast the model in the following regression form. With the pdimensional state vector xt =
(yt , yt1 , . . . , ytp+1 )0 for all t we have
yt = x0t1 + t
and, under the regression analysis as already described, the posterior for = (, v) has the conjugate
normal-inverse gamma form (see 1.9). We need a change of notation, and now write the posterior as
(|v, y1:t ) N (mt , vMt ),
(v 1 |y1:t ) Ga(nt /2, nt st /2).
This conditional posterior is valid for all times t, and so the defining quantities {mt , Mt , nt , st } are naturally
related as time t varies. In particular, the time t update involves the mapping from their values at t 1 based on data y1:(t1) - to their values at t, representing the additional information provided by the time t
observation yt .
The following key theory defines the sequential updating, and is quite general.
Suppose p(, v|y1:(t1) ) has the above normal-inverse gamma from with defining parameters
{mt1 , Mt1 , nt1 , st1 }.
Then:
The one-step ahead forecast distribution conditional on v is
(yt |y1:(t1) , v) N (x0t1 mt1 , qt v)
with qt = 1 + x0t1 Mt1 xt1 .
The time t posterior distribution is p(, v|y1:t ) = p(|v, y1:t )p(v|y1:t ) and is normal-inverse gamma
with parameters {mt , Mt , nt , st } that are computed as follows:
mt = mt1 + At et ,
Mt = Mt1 At A0t qt ,
nt = nt1 + 1,
st = (nt1 st1 + e2t /qt )/(nt1 + 1).
with
et = yt x0t1 mt1 , the one-step ahead forecast error, and
At = Mt1 xt1 /qt , the adaptive coefficient pvector,
These results flow from standard normal theory and Bayes theorem (see Multivariate Normal Theory notes).
Some comments and alternative expressions are of interest:
The update for mt is a predictor/corrector form: The prior or predicted value for , namely
mt1 , is corrected by the weighted forecast error. A large forecast error implies a large correction,
and vice-versa.
7
and
1
+ xt1 x0t1 .
Mt1 = Mt1
These are standard formul, though the new, alterative representations above are both computationally more
efficient and numerically more stable as no matrix inversions are required. Some additional practical points
related to initialization:
Since the updating results hold true for all t, it is clear that we can now consider analysis based on any
initial prior of the normal-inverse gamma form at t = p. We need to consider the initial point as t = p
since the required regression vector xt is available only for t p. That is, analysis can be initialized
at any specified values of {mp , Mp , np , sp } prior to implementing the sequential form of the analysis
beginning with the first observation at t = p + 1.
An alternative initialization involves fitting the reference posterior distribution based on an initial set
of q > p observations, and then beginning the sequential analysis at a first time point t = q + 1 with
{mq , Mq , nq , sq } defined by the reference posterior.
2.1.1
For reference, the latter equation and the alterative form of Mt are related to a key and broadly useful matrix
identity, in this case - dropping the time indices for generality and clarity - simply
(M 1 + xx0 )1 = M M xx0 M/(1 + x0 M x)
for any positive definite and symmetric p p matrix M and pvector x.
A more general version involves a p q matrix X and a q q positive definite symmetric matrix V ,
when the identity is
(M 1 + XV 1 X 0 )1 = M M X(V + X 0 M X)1 X 0 M.
2.2
One most useful side product of the sequential analysis is an easy evaluation of the model likelihood - the
joint density of observations unconditional on model parameters. Simply note that, at each time t we obtain
the one-step ahead forecast, or predictive density p(yt |y1:(t1) ). This is the univariate T p.d.f.
1/2
e2t
1+
nt1 st1 qt
!(nt1 +1)/2
(( + 1)/2)
.
(/2)
Then, the joint p.d.f. of any set of observations yk:n is given by composition as
p(yk:n |y1:k1 ) =
n
Y
p(yt |y1:(t1) ).
t=k
For example, if we start with t = p + 1 and some initial values of parameters {mp , Mp , np , sp } as discussed
above, then k = p + 1 - the above expression defines the joint density of the data from thereon.
The value of the joint density function is also named the marginal likelihood of the model. For example,
we may rerun the analysis at different values of the model order p, and then p(y1:n ) is actually p(y1:n |p), the
value of the data density conditional on that value of p. As we change p and rerun the analysis, this maps
out the likelihood function over p for assessment of model order. A prior distribution over p might then be
used to convert to a posterior distribution for model order, for example.
2.3
TVAR Models
Some non-stationary phenomena can be regarded as locally stationary in the sense that a standard model,
such as an AR(p) model, provides an adequate and useful functional form for the data generating process,
but its adequacy relies on permitting the defining parameters to take different values over time. This simple
concept is in fact quite powerful in some applications, and in fact underlies very widely used methods of
local smoothing in time series in many areas of social and natural sciences, engineering signal processing,
and short-term forecasting in business and economics. The comprehensive framework of dynamic state
space models (West and Harrison, 1997) demonstrates the broad scope of applicability of the concept, as
well as various models classes.
A core class of such non-stationary models is the class of Time-Varying Autoregressive Models (TVAR),
a rather simple but nevertheless very useful extension of AR models.
2.3.1
The key concept is simply to allow model parameters to vary randomly in time, according to a stochastic
process model. We will examine just random walk models in detail, but the concept is more general. The
AR(p) model yt AR(p|(, v)) is extended to the TVAR(p) model in which, at any time t,
yt = x0t1 t + t
p
X
t,j ytj + t
j=1
where
t = (t,1 , t,2 , . . . , t,p )0
is the - now - time-varying AR parameter vector. Any model for time variation might be considered. A
simple random walk model has several points of recommendation. Such a model has the form
t = t1 + t
where t is a sequence of zero-mean pvector variates with t s for t 6= s. The t variates constitute
a sequence of random shocks to the model that define the temporal evolution of the AR parameters.
This random walk parameter evolution is simple and has the attraction that it is neutral with respect to
directions of change in parameters, and allows them to wander freely over time so permitting substantial
change in model form over long periods.
2.3.2
Suppose that, at time t 1, information on the current AR parameter is summarized via (t1 |v, y1:(t1) )
N (mt1 , vMt1 ) while (v 1 |y1:(t1) ) Ga(nt1 /2, nt1 st1 /2) just as in the static parameter model.
Then:
Assume that t = t1 + t where
(t |v, y1:(t1) ) N (0, vWt )
independently of tk for k 1. Note that we include v as a constant factor in the variance of t for
consistency throughout.
It easily follows that the evolution of the parameter to time t results in a (prior) distribution for the
changed value that is just
(t |v, y1:(t1) ) N (mt1 , vMt|t1 )
where Mt|t1 = Mt1 + Wt .
10
Observing yt , p(t |v, y1:t ) N (mt , vMt ) where the update equations are precisely as in the AR
model but now the initial distribution has a variance matrix component Mt|t1 in place of Mt1 .
The update for p(v|y1:t ) is also essentially as in the AR model but again with the small change that
Mt|t1 replaces Mt1 .
In summary, the sequential learning proceeds as follows:
1. The time t 1 posterior is parameterized by {mt1 , Mt1 , nt1 , st1 };
2. The one-step ahead prior at t 1 is
(t |v, y1:(t1) ) N (mt1 , vMt|t1 ),
(v 1 |y1:(t1) Ga(nt1 , nt1 st1 );
3. The one-step ahead forecast distribution conditional on v is
(yt |y1:(t1) , v) N (x0t1 mt1 , qt v)
with qt = 1 + x0t1 Mt|t1 xt1 .
4. The time t posterior distribution is p(, v|y1:t ) = p(|v, y1:t )p(v|y1:t ) and is normal-inverse gamma
with parameters {mt , Mt , nt , st } that are computed as follows:
mt = mt1 + At et ,
Mt = Mt|t1 At A0t qt ,
nt = nt1 + 1,
st = (nt1 st1 + e2t /qt )/(nt1 + 1).
with
et = yt x0t1 mt1 , the one-step ahead forecast error, and
At = Mt|t1 xt1 /qt , the adaptive coefficient pvector,
Again, these results flow from standard normal theory and Bayes theorem (see Multivariate Normal Theory
notes). Also, again exactly as in the static AR case, important alternative representations of {mt , Mt } are
the forms derived directly from Bayes theorem, namely
mt = Mt (Mt|t1 1 mt1 + xt1 yt )
and
Note that the AR(p) model is, of course, recovered as the special case in which Wt = 0 so that
Mt|t1 = Mt1 . Otherwise, the model now allows for parameter change through non-zero Wt matrices.
Controlling the degree of change is very often desirable, in order that a model parametrization change
smoothly consistent with scientific context. Within a Gaussian framework, this involves care in specifying
or controlling the estimation of these variance matrices.
11
2.3.3
Between times points t 1 and t, the increased in variance from Mt1 to Mt|t1 = Mt1 + Wt reflects
a decreased precision, i.e., loss of information. This concept of information loss is key to a parsimonious
and efficient method of structuring parameter change models - the notion of information discounting. One
specific application of the more general concept of variance matrix discounting (West and Harrison, 1997,
chapter 6) to structure and specify such random-change models - in fact, the simplest such approach - is to
assume a constant rate of loss of information about parameters over time. For a fixed, specified discount
factor with 0 < < 1, specify
Wt = Mt1 ( 1 1).
That is, the loss of information Wt is a (usually rather small) fraction of the existing information Mt1 .
For instance, = 0.99 implies a per period loss of information of about 1%, whereas = 0.9 leads to
information attrition at a rate of about 11% per period.
One key implication is that, simply,
Mt|t1 = Mt1 /,
the discount factor inducing a (usually small) inflation in the elements of the variance matrix between time
points. The discount factor can be specified, and analysis repeated with differing values. The theory of
sequential/compositional computation of the model likelihood applies directly to provide for assessment of
different values along with the model order p.
12