Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
295 views

Financial Econometrics Notes: Kevin Sheppard University of Oxford

Uploaded by

Colin Philip
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
295 views

Financial Econometrics Notes: Kevin Sheppard University of Oxford

Uploaded by

Colin Philip
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 612

Financial Econometrics Notes

Kevin Sheppard
University of Oxford

November 14, 2012


2
This version: 11:51, November 14, 2012

©2005-2012 Kevin Sheppard


ii
Contents

1 Probability, Random Variables and Expectations 1


1.1 Axiomatic Probability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Univariate Random Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.3 Multivariate Random Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
1.4 Expectations and Moments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

2 Estimation, Inference and Hypothesis Testing 63


2.1 Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
2.2 Convergence and Limits for Random Variables . . . . . . . . . . . . . . . . . . . . . 74
2.3 Properties of Estimators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
2.4 Distribution Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
2.5 Hypothesis Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
2.6 The Bootstrap and Monte Carlo . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117
2.7 Inference on Financial Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122

3 Analysis of Cross-Sectional Data 141


3.1 Model Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141
3.2 Functional Form . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145
3.3 Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148
3.4 Assessing Fit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151
3.5 Assumptions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 156
3.6 Small Sample Properties of OLS estimators . . . . . . . . . . . . . . . . . . . . . . 158
3.7 Maximum Likelihood . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 160
3.8 Small Sample Hypothesis Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162
3.9 Large Sample Assumption . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 180
3.10 Large Sample Properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 181
3.11 Large Sample Hypothesis Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . 182
3.12 Violations of the Large Sample Assumptions . . . . . . . . . . . . . . . . . . . . . . 189
3.13 Model Selection and Specification Checking . . . . . . . . . . . . . . . . . . . . . . 203
iv CONTENTS

3.14 Projection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 221


3.A Selected Proofs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 223

4 Analysis of a Single Time Series 235


4.1 Stochastic Processes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 235
4.2 Stationarity, Ergodicity and the Information Set . . . . . . . . . . . . . . . . . . . . . 236
4.3 ARMA Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 238
4.4 Difference Equations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 247
4.5 Data and Initial Estimates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 255
4.6 Autocorrelations and Partial Autocorrelations . . . . . . . . . . . . . . . . . . . . . . 256
4.7 Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 269
4.8 Inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 272
4.9 Forecasting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 273
4.10 Nonstationary Time Series . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 279
4.11 Nonlinear Models for Time-Series Analysis . . . . . . . . . . . . . . . . . . . . . . . 291
4.12 Filters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 291
4.A Computing Autocovariance and Autocorrelations . . . . . . . . . . . . . . . . . . . . 306

5 Analysis of Multiple Time Series 331


5.1 Vector Autoregressions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 331
5.2 Companion Form . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 338
5.3 Empirical Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 339
5.4 VAR forecasting . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . 341
5.5 Estimation and Identification . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . 343
5.6 Granger causality . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . 347
5.7 Impulse Response Function . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . 349
5.8 Cointegration . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . 354
5.9 Cross-sectional Regression with Time-series Data . . . . . . . . . . . . . . . . . . . 371
5.A Cointegration in a trivariate VAR . . . . . . . . . . . . . . . . . . . . . . . . . . . . 377

6 Generalized Method Of Moments (GMM) 389


6.1 Classical Method of Moments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 389
6.2 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 390
6.3 General Specification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 394
6.4 Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 398
6.5 Asymptotic Properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 404
6.6 Covariance Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 408
6.7 Special Cases of GMM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 412
CONTENTS v

6.8 Diagnostics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 418


6.9 Parameter Inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 419
6.10 Two-Stage Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 422
6.11 Weak Identification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 425
6.12 Considerations for using GMM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 426

7 Univariate Volatility Modeling 429


7.1 Why does volatility change? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 429
7.2 ARCH Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 432
7.3 Forecasting Volatility . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 462
7.4 Realized Variance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 468
7.5 Implied Volatility and VIX . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 477
7.A Kurtosis of an ARCH(1) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 483
7.B Kurtosis of a GARCH(1,1) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 485

8 Value-at-Risk, Expected Shortfall and Density Forecasting 493


8.1 Defining Risk . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 493
8.2 Value-at-Risk (VaR) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 494
8.3 Expected Shortfall . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 511
8.4 Density Forecasting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 512
8.5 Coherent Risk Measures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 522

9 Multivariate Volatility, Dependence and Copulas 531


9.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 531
9.2 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 532
9.3 Simple Models of Multivariate Volatility . . . . . . . . . . . . . . . . . . . . . . . . . 535
9.4 Multivariate ARCH Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 544
9.5 Realized Covariance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 557
9.6 Measuring Dependence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 564
9.7 Copulas . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 573
9.A Bootstrap Standard Errors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 587
vi CONTENTS
List of Figures

1.1 Set Operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3


1.2 Bernoulli Random Variables . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.3 Normal PDF and CDF . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
1.4 Poisson and χ 2 distributions . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
1.5 Bernoulli Random Variables . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
1.6 Joint and Conditional Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
1.7 Joint distribution of the FTSE 100 and S&P 500 . . . . . . . . . . . . . . . . . . . . 34
1.8 Simulation and Numerical Integration . . . . . . . . . . . . . . . . . . . . . . . . . . 38
1.9 Modes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

2.1 Convergence in Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75


2.2 Consistency and Central Limits . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
2.3 Central Limit Approximations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
2.4 Data Generating Process and Asymptotic Covariance of Estimators . . . . . . . . . . 99
2.5 Power . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
2.6 Standard Normal CDF and Empirical CDF . . . . . . . . . . . . . . . . . . . . . . . 119
2.7 CRSP Value Weighted Market (VWM) Excess Returns . . . . . . . . . . . . . . . . . 126

3.1 Rejection regions of a t 10 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 166


3.2 Bivariate F distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 168
3.3 Rejection region of a F5,30 distribution . . . . .. . . . . . . . . . . . . . . . . . . . . 171
3.4 Location of the three test statistic statistics . .
. . . . . . . . . . . . . . . . . . . . . 179
IV
3.5 Effect of correlation on the variance of β̂ . .
. . . . . . . . . . . . . . . . . . . . . 196
3.6 Gains of using GLS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 202
3.7 Neglected Nonlinearity and Residual Plots . . . . . . . . . . . . . . . . . . . . . . . 209
3.8 Rolling Parameter Estimates in the 4-Factor Model . . . . . . . . . . . . . . . . . . . 212
3.9 Recursive Parameter Estimates in the 4-Factor Model . . . . . . . . . . . . . . . . . 213
3.10 Influential Observations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 216
3.11 Correct and Incorrect use of “Robust” Estimators . . . . . . . . . . . . . . . . . . . . 220
3.12 Weights of an S&P 500 Tracking Portfolio . . . . . . . . . . . . . . . . . . . . . . . . 222

4.1 Dynamics of linear difference equations . . . . . . . . . . . . . . . . . . . . . . . . . 254


4.2 Stationarity of an AR(2) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 256
viii LIST OF FIGURES

4.3 VWM and Default Spread . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 257


4.4 ACF and PACF for ARMA Processes . . . . . . . . . . . . . . . . . . . . . . . . . . 262
4.5 ACF and PACF for ARMA Processes . . . . . . . . . . . . . . . . . . . . . . . . . . 263
4.6 Autocorrelations and Partial Autocorrelations for the VWM and the Default Spread . . . 267
4.7 M1, M1 growth, and the ACF and PACF of M1 growth . . . . . . . . . . . . . . . . . . 281
4.8 Time Trend Models of GDP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 283
4.9 Unit Root Analysis of ln C P I and the Default Spread . . . . . . . . . . . . . . . . . 290
4.10 Ideal Filters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 293
4.11 Actual Filters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 297
4.12 Cyclical Component of U.S. Real GDP . . . . . . . . . . . . . . . . . . . . . . . . . 301
4.13 Markov Switching Processes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 305
4.14 Self Exciting Threshold Autoregression Processes . . . . . . . . . . . . . . . . . . . 307
4.15 Exercise 4.9 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 327

5.1 Comparing forecasts from a VAR(1) and an AR(1) . . . . . . . . . . . . . . . . . . . 343


5.2 ACF and CCF . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . 345
5.3 Impulse Response Functions . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . 352
5.4 Cointegration . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . 356
5.5 Detrended CAY Residuals . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . 370
5.6 Impulse Response of Level-Slope-Curvature . . . . . . . . . . . . . . . . . . . . . . 385

6.1 2-Step GMM Objective Function Surface . . . . . . . . . . . . . . . . . . . . . . . . 403

7.1 Returns of the S&P 500 and IBM . . . . . . . . . . . . . . . . . . . . . . . . . . . . 437


7.2 Squared returns of the S&P 500 and IBM . . . . . . . . . . . . . . . . . . . . . . . . 438
7.3 Absolute returns of the S&P 500 and IBM . . . . . . . . . . . . . . . . . . . . . . . . 445
7.4 News impact curves . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 448
7.5 Various estimated densities for the S&P 500 . . . . . . . . . . . . . . . . . . . . . . 458
7.6 Effect of distribution on volatility estimates . . . . . . . . . . . . . . . . . . . . . . . 460
7.7 ACF and PACF of S&P 500 squared returns . . . . . . . . . . . . . . . . . . . . . . 462
7.8 ACF and PACF of IBM squared returns . . . . . . . . . . . . . . . . . . . . . . . . . 463
7.9 Realized Variance and sampling frequency . . . . . . . . . . . . . . . . . . . . . . . 473
7.10 RV AC 1 and sampling frequency . . . . . . . . . . . . . . . . . . . . . . . . . . . . 474
7.11 Volatility Signature Plot for SPDR RV . . . . . . . . . . . . . . . . . . . . . . . . . 475
7.12 Black-Scholes Implied Volatility . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 479
7.13 VIX and alternative measures of volatility . . . . . . . . . . . . . . . . . . . . . . . . 483

8.1 Graphical representation of Value-at-Risk . . . . . . . . . . . . . . . . . . . . . . . . 495


8.2 Estimated % VaR for the S&P 500 . . . . . . . . . . . . . . . . . . . . . . . . . . . 502
8.3 S&P 500 Returns and a Parametric Density . . . . . . . . . . . . . . . . . . . . . . . 507
8.4 Empirical and Smoothed empirical CDF . . . . . . . . . . . . . . . . . . . . . . . . 515
8.5 Naïve and Correct Density Forecasts . . . . . . . . . . . . . . . . . . . . . . . . . . 517
LIST OF FIGURES ix

8.6 Fan plot . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 518


8.7 QQ plot . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 519
8.8 Kolmogorov-Smirnov plot . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 521
8.9 Returns, Historical Simulation VaR and Normal GARCH VaR. . . . . . . . . . . . . . 529

9.1 Lag weights in RiskMetrics methodologies . . . . . . . . . . . . . . . . . . . . . . . 538


9.2 Rolling Window Correlation Measures . . . . . . . . . . . . . . . . . . . . . . . . . 545
9.3 Observable and Principal Component Correlation Measures . . . . . . . . . . . . . . 546
9.4 Volatility from Multivariate Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . 557
9.5 Small Cap - Large Cap Correlation . . . . . . . . . . . . . . . . . . . . . . . . . . . 558
9.6 Small Cap - Long Government Bond Correlation . . . . . . . . . . . . . . . . . . . . 559
9.7 Large Cap - Bond Correlation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 560
9.8 Symmetric and Asymmetric Dependence . . . . . . . . . . . . . . . . . . . . . . . . 566
9.9 Rolling Dependence Measures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 568
9.10 Exceedance Correlation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 569
9.11 Copula Distributions and Densities . . . . . . . . . . . . . . . . . . . . . . . . . . . 582
9.12 Copula Densities with Standard Normal Margins . . . . . . . . . . . . . . . . . . . . 583
9.13 S&P 500 - FTSE 100 Diagnostics . . . . . . . . . . . . . . . . . . . . . . . . . . . . 586
9.14 S&P 500 and FTSE 100 Exceedance Correlations . . . . . . . . . . . . . . . . . . . 592
x LIST OF FIGURES
List of Tables

1.1 Monte Carlo and Numerical Integration . . . . . . . . . . . . . . . . . . . . . . . . . 56

2.1 Parameter Values of Mixed Normals . . . . . . . . . . . . . . . . . . . . . . . . . . 100


2.2 Outcome matrix for a hypothesis test . . . . . . . . . . . . . . . . . . . . . . . . . . 106
2.3 Inference on the Market Premium . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125
2.4 Inference on the Market Premium . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125
2.5 Comparing the Variance of the NASDAQ and S&P 100 . . . . . . . . . . . . . . . . . 127
2.6 Comparing the Variance of the NASDAQ and S&P 100 . . . . . . . . . . . . . . . . . 129
2.7 Wald, LR and LM Tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136

3.1 Fama-French Data Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144


3.2 Descriptive Statistics of the Fama-French Data Set . . . . . . . . . . . . . . . . . . . 145
3.3 Regression Coefficient on the Fama-French Data Set . . . . . . . . . . . . . . . . . . 151
2
3.4 Centered and Uncentered R2 and R̄ . . . . . . . . . . . . . .
. . . . . . . . . . . . 155
2 2
3.5 Centered and Uncentered R and R̄ with Regressor Changes . . . . . . . . . . . . . 156
3.6 t -stats for the Big-High Portfolio . . . . . . . . . . . . . . . .
. . . . . . . . . . . . 172
3.7 Likelihood Ratio Tests on the Big-High Portfolio . . . . . . . . .
. . . . . . . . . . . . 175
3.8 Comparison of Small- and Large- Sample t -Statistics . . . . . . . . . . . . . . . . . 189
3.9 Comparison of Small- and Large- Sample Wald, LR and LM Statistic . . . . . . . . . . 190
3.10 OLS and GLS Parameter Estimates and t -stats . . . . . . . . . . . . . . . . . . . . 203

4.1 Estimates from Time-Series Models . . . . . . . . . . . . . . . . . . . . . . . . . . 257


4.2 ACF and PACF for ARMA processes . . . . . . . . . . . . . . . . . . . . . . . . . . 261
4.3 Seasonal Model Estimates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 282
4.4 Unit Root Analysis of ln C P I . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 291

5.1 Parameter estimates from Campbell’s VAR . . . . . . . . . . . . . . . . . . . . . . . 341


5.2 AIC and SBIC in Campbell’s VAR . . . . . . . . . . . . . . . . . . . . . . . . . . . . 347
5.3 Granger Causality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 349
5.4 Johansen Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 367
5.5 Unit Root Tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 369

6.1 Parameter Estimates from a Consumption-Based Asset Pricing Model . . . . . . . . . 402


xii LIST OF TABLES

6.2 Stochastic Volatility Model Parameter Estimates . . . . . . . . . . . . . . . . . . . . 404


6.3 Effect of Covariance Estimator on GMM Estimates . . . . . . . . . . . . . . . . . . . 412
6.4 Stochastic Volatility Model Monte Carlo . . . . . . . . . . . . . . . . . . . . . . . . . 413
6.5 Tests of a Linear Factor Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 419
6.6 Fama-MacBeth Inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 425

7.1 Summary statistics for the S&P 500 and IBM . . . . . . . . . . . . . . . . . . . . . . 443
7.2 Parameter estimates from ARCH-family models . . . . . . . . . . . . . . . . . . . . 444
7.3 Bollerslev-Wooldridge Covariance estimates . . . . . . . . . . . . . . . . . . . . . . 454
7.4 GARCH-in-mean estimates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 457
7.5 Model selection for the S&P 500 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 461
7.6 Model selection for IBM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 464

8.1 Estimated model parameters and quantiles . . . . . . . . . . . . . . . . . . . . . . . 503


8.2 Unconditional VaR of the S&P 500 . . . . . . . . . . . . . . . . . . . . . . . . . . . 506

9.1 Principal Component Analysis of the S&P 500 . . . . . . . . . . . . . . . . . . . . . 541


9.2 Correlation Measures for the S&P 500 . . . . . . . . . . . . . . . . . . . . . . . . . 544
9.3 CCC GARCH Correlation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 555
9.4 Multivariate GARCH Model Estimates . . . . . . . . . . . . . . . . . . . . . . . . . 556
9.5 Refresh-time sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 563
9.6 Dependence Measures for Weekly FTSE and S&P 500 Returns . . . . . . . . . . . . 568
9.7 Copula Tail Dependence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 580
9.8 Unconditional Copula Estimates . . . . . . . . . . . . . . . . . . . . . . . . . . . . 585
9.9 Conditional Copula Estimates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 585
Chapter 1

Probability, Random Variables and


Expectations

Note: The primary reference for these notes is Mittelhammer (1999). Other treatments of
probability theory include Gallant (1997), Casella & Berger (2001) and Grimmett & Stirzaker
(2001).

This chapter provides an overview of probability theory as it applied to both dis-


crete and continuous random variables. The material covered in this chapter
serves as a foundation of the econometric sequence and is useful throughout fi-
nancial economics. The chapter begins with a discussion of the axiomatic founda-
tions of probability theory, and then proceeds to describe properties of univariate
random variables. Attention then turns to multivariate random variables and im-
portant difference form standard univariate random variables. Finally, the chap-
ter discusses the expectations operator and moments.

1.1 Axiomatic Probability

Probability theory is derived from a small set of axioms – a minimal set of essential assump-
tions. A deep understanding of axiomatic probability theory is not essential to financial
econometrics or to the use of probability and statistics in general, although understanding
these core concepts does provide additional insight.
The first concept in probability theory is the sample space, which is an abstract concept
containing primitive probability events.

Definition 1.1 (Sample Space). The sample space is a set, Ω, that contains all possible out-
comes.

Example 1.2. Suppose interest is in a standard 6-sided die. The sample space is 1-dot,
2-dots, . . ., 6-dots.
2 Probability, Random Variables and Expectations

Example 1.3. Suppose interest is in a standard 52-card deck. The sample space is then A♣,
2♣, 3♣, . . . , J ♣, Q ♣, K ♣, A♦, . . . , K ♦, A♥, . . . , K ♥, A♠, . . . , K ♠.

Example 1.4. Suppose interest is in the logarithmic stock return, defined as rt = ln Pt −


ln Pt −1 , then the sample space is R, the real line.

The next item of interest is an event.

Definition 1.5 (Event). An event, ω, is a subset of the sample space Ω.

Events typically include be any subsets of the sample space Ω (including the entire sam-
ple space), and the set of all events is known as the event space.

Definition 1.6 (Event Space). The set of all events in the sample space Ω is called the event
space, and is denoted F.

Event spaces are a somewhat more difficult concept. For finite event spaces, the event
space is usually the power set of the outcomes – that is, the set of all possible unique sets
that can be constructed from the elements. When variables can take infinitely many out-
comes, then a more nuanced definition is needed, although one natural one is to consider
is the set of all (small) intervals (so that each interval has infinitely many points in it).

Example 1.7. Suppose interest lies in the outcome of a coin flip. Then the sample space is
{H , T } and the event space is {∅, {H } , {T } , {H , T }} where ∅ is the empty set.

The first two axioms of probability state that all probabilities are non-negative and pro-
vide a normalization.

Axiom 1.8. For any event ω ∈ F,


Pr (ω) ≥ 0. (1.1)

Axiom 1.9. The probability of all events in the sample space Ω is unity, i.e.

Pr (Ω) = 1. (1.2)

The second axiom is a normalization that states that the probability of the entire sample
space is 1 and ensures that the sample space must contain all events that may occur. Pr (·)
is a set valued function – that is, Pr (ω) returns the probability, a number between 0 and 1,
of observing an event ω.
Before proceeding, it is useful to refresh three concepts from set theory.

Definition 1.10 (Set Union). Let A and B be two sets, then the union is defined A ∪ B =
{x : x ∈ A or x ∈ B }.

A union of two sets contains all elements that are in either set.

Definition 1.11 (Set Intersection). Let A and B be two sets, then the intersection is defined
A ∩ B = {x : x ∈ A and x ∈ B }.
1.1 Axiomatic Probability 3

Set Complement Disjoint Sets

A AC A B

Set Intersection Set Union

A B A B
A∩B A∪B

Figure 1.1: The four set definitions shown in R2 . The upper left panel shows a set and its
complement. The upper right shows two disjoint sets. The lower left shows the intersection
of two sets (darkened region) and the lower right shows the union of two sets (darkened
region). I all diagrams, the outer box represents the entire space.

The intersection contains only the elements that are in both sets.

Definition 1.12 (Set Complement). Let A be a set, then the complement set, denoted A c =
/ A}.
{x : x ∈

The complement of a set contains all elements which are not contained in the set.

Definition 1.13 (Disjoint Sets). Let A and B be sets, then A and B are disjoint if and only if
A ∩ B = ∅.

Figure 1.1 provides a graphical representation of the four set operations in a 2-dimensional
space.
The third and final axiom states that probability is additive when sets have no overlap.

Axiom 1.14. Let {A i }, i = 1, 2, . . . be a countably infinite1 (or finite) set of disjoint events.
1

Definition 1.15. A S set is countably infinite if there exists a bijective (one-to-one) function from the elements
of S to the natural numbers N = {1, 2, . . .} .
4 Probability, Random Variables and Expectations

Then

! ∞
[ X
Pr Ai = Pr (A i ) . (1.3)
i =1 i =1

Assembling a sample space, event space and a probability measure into a set produces
what is known as a probability space. Throughout the course, and in virtually all statistics,
a complete probability space is assumed (typically without explicitly stating this assump-
tion).2

Definition 1.16 (Probability Space). A probability space is denoted using the tuple (Ω, F, Pr)
where Ω is the sample space, F is the event space and Pr is the probability set function
which has domain ω ∈ F.

The three axioms of modern probability are very powerful, and a large number of the-
orems can be proven using only these axioms. A few simple example are provided, and
selected proofs appear in the Appendix.

Theorem 1.17. Let A be an event in the sample space Ω, and let A c be the complement of A
so that Ω = A ∪ A c . Then Pr (A) = 1 − Pr (A c ).

Since A and A c are disjoint, and by definition A c is everything not in A, then the proba-
bility of the two must be unity.

Theorem 1.18. Let A and B be events in the sample space Ω. Then Pr (A∪B )= Pr (A)+Pr (B )−
Pr (A ∩ B ).

This theorem shows that for any two sets, the probability of the union of the two sets is
equal to the probability of the two sets minus the probability of the intersection of the sets.

1.1.1 Conditional Probability


Conditional probability extends the basic concepts of probability to the case where interest
lies in the probability of one event conditional on the occurrence of another event.

Definition 1.19 (Conditional Probability). Let A and B be two events in the sample space
Ω. If Pr (B ) 6= 0, then the conditional probability of the event A, given event B , is given by

 Pr (A ∩ B )
Pr A|B = . (1.4)
Pr (B )

The definition of conditional probability is intuitive. The probability of observing an


event in set A, given an event in set B has occurred is the probability of observing an event
in the intersection of the two sets normalized by the probability of observing an event in
set B .
2
A complete probability space is complete if and only if B ∈ F where Pr (B ) = 0 and A ⊂ B , then A ∈ F.
This essentially ensures that probability can be assigned to any event.
1.1 Axiomatic Probability 5

Example 1.20. In the example of rolling a die, suppose A = {1, 3, 5} is the event that the
outcome is odd and B = {1, 2, 3} is the event that the outcome of the roll is less than 4.
Then the conditional probability of A given B is
2

Pr {1, 3} 2
 = 63 =
Pr {1, 2, 3} 6
3

since the intersection of A and B is {1, 3}.

The axioms can be restated in terms of conditional probability, where the sample space
is now the events in the set B .

1.1.2 Independence
Independence is an important concept that is frequently encountered. In essence, inde-
pendence means that any information about an event occurring in one set has no infor-
mation about whether an event occurs in another set.

Definition 1.21. Let A and B be two events in the sample space Ω. Then A and B are inde-
pendent if and only if
Pr (A ∩ B ) = Pr (A) Pr (B ) (1.5)
, A ⊥⊥ B is commonly used to indicate that A and B are independent.

One immediate implication of the definition of independence is that when A and B are
independent, then the conditional probability of one given the other is the same as the
unconditional probability of the random variable – Pr A|B = Pr (A).


1.1.3 Bayes Rule


Bayes rule is frequently encountered in both statistics (known as Bayesian statistics) and
in financial models where agents learn about their environment. Bayes rule follows as a
corollary to a theorem that states that the total probability of a set A is equal to the condi-
tional probability of A given a set of disjoint sets B which span the sample space.

Theorem 1.22. Let Bi ,i = 1, 2 . . . be a countably infinite (or finite) partition of the sample
space Ω so that B j ∩ Bk = ∅ for j 6= k and ∞ i =1 Bi = Ω. Let Pr (Bi ) > 0 for all i , then for any
S

set A,
X∞
Pr (A) = Pr A|Bi Pr (Bi ) .

(1.6)
i =1

Bayes rule is a restatement of the previous theorem, and states that the probability of
observing an event in B j given an event in A is observed can be related to the conditional
probability of A given B j .
6 Probability, Random Variables and Expectations

Corollary 1.23 (Bayes Rule). Let Bi ,i = 1, 2 . . . be a countably infinite (or finite) partition of
the sample space Ω so that B j ∩ Bk = ∅ for j 6= k and ∞ i =1 Bi = Ω. Let Pr (Bi ) > 0 for all i ,
S

then for any set A where Pr (A) > 0,


 
Pr A|B j Pr B j
Pr B j |A = P∞

.
(B )

i =1 Pr A|Bi Pr i
 
Pr A|B j Pr B j
=
Pr (A)

An immediate consequence of the definition of conditional probability is the Pr (A ∩ B ) =


Pr A|B Pr (B ), which is referred to as the multiplication rule. Also notice that the “order”


is arbitrary, so that the rule can be also given as Pr (A ∩ B ) = Pr B |A Pr (A). Combining




these two (as long as Pr (A) > 0),

Pr A|B Pr (B ) = Pr B |A Pr (A)
 

Pr A|B Pr (B )

⇒ Pr B |A =

. (1.7)
Pr (A)

Example 1.24. Suppose a family has 2 children and one is a boy, that the probability of hav-
ing a child of either sex is equal and independent across children. What is the probability
that they have 2 boys?
Before learning that one child is a boy, there are 4 equally probable possibilities, {B , B },
{B , G }, {G , B } and {G , G }. Using Bayes rule,
 
Pr B ≥ 1| {B , B } × Pr {B , B }
Pr {B , B } |B ≥ 1 = P

S ∈{{B ,B },{B ,G },{G ,B },{G ,B }} Pr B ≥ 1|S Pr (S )


1 × 14
=
1 × 14 + 1 × 14 + 1 × 14 + 0 × 1
4
1
=
3

so that knowing one child is a boy increases the probability of 2 boys from 41 to 13 . Note that
X
Pr B ≥ 1|S Pr (S ) = Pr (B ≥ 1) .


S ∈{{B ,B },{B ,G },{G ,B },{G ,B }}

Example 1.25. The famous Monte Hall Let’s Make a Deal television program provides a an
example of Bayes rule. In the game show, there were three prizes, a large one (e.g. a car)
and two uninteresting ones (duds). The prizes were hidden behind doors numbered 1, 2
and 3. Ex ante, the contestant has no information about the which door has the large prize,
and to the initial probabilities are all 31 . During the negotiations with the host, it is revealed
that one of the non-selected doors does not contain the large prize. The host then gives
1.2 Univariate Random Variables 7

the contestant the chance to switch to the door they didn’t choose. For example, suppose
the contestant choose door 1 initially, and that the host revealed that the large prize is not
behind door 2. The contestant then has the chance to choose door 3 or to stay with door
1. In this example, B is the event where the contestant chooses the door which hides the
large prize, and A is the event that the large prize is not behind door 2.
Initially there are three equally likely outcomes (from the contestant’s point of view),
where D indicates dud and L indicated large, and the order corresponds to the door num-
ber.
{D , D , L } , {D , L , D } , {L , D , D }
The contestant has a 31 chance of having the large prize behind door 1. The host will never
remove the large prize, and so applying Bayes rule we have

Pr H = 3|S = 1, L = 2 × Pr L = 2|S = 1
 
Pr L = 2|H = 3, S = 1 = P3

i =1 Pr H = 3|S = 1, L = i × Pr L = i |S = 1
 

1 × 31
= 1
2
× 13 + 1 × 13 + 0 × 1
3
1
= 3
1
2
2
= .
3
where H is the door the host reveals, S is initial door selected, and L is the door containing
the large prize. This shows that the probability the large prize is behind door 2, given that
the player initially selected door 1 and the host revealed door 3 can be computed using
Bayes rule.
Pr H = 3|S = 1, L = 2 is the probability that the host shows door 3 given the contes-


tant selected door 1 and the large prize is in door 2, which always happens. P L = 2|S = 1


is the probability that the large is in door 2 given the contestant selected door 1, which is 13 .
Pr H = 3|S = 1, L = 1 is the probability that the host reveals door 3 given that door 1 was


selected and contained the large prize, which is 12 , and P H = 3|S = 1, L = 3 is the prob-


ability that the host reveals door 3 given door 3 contains the prize, which never happens.
Bayes rule shows that it is optimal to switch doors since when the host opens a door,
it reveals information about the location of the large prize. Essentially, the two doors not
selected have probability 32 before the doors are opened, and opening one assigns all prob-
ability to the door not opened.

1.2 Univariate Random Variables

Studying the behavior of random variables, and more importantly functions of random
variables (i.e. statistics) is essential for both the theory and practice of financial economet-
8 Probability, Random Variables and Expectations

rics. This section covers univariate random variables, and the discussion of multivariate
random variables is reserved for a later section.
The previous discussion of probability is set based and so includes objects which cannot
be described as random variables, which are a limited (but highly useful) sub-class of all
objects which can be described using probability theory. The primary characteristic of a
random variable is that it takes values on the real line.

Definition 1.26 (Random Variable). Let (Ω, F, P ) be a probability space. If X : Ω → R is a


real-valued function have as its domain elements of Ω, then X is called a random variable.
A random variable is essentially a function which takes ω ∈ Ω as an input an produces
a value x ∈ R, where R is the symbol for the real line. Random variables come in one of
three forms: discrete, continuous and mixed. Random variables which mix discrete and
continuous distributions are generally less important in financial economics and so here
the focus is on discrete and continuous random variables.

Definition 1.27 (Discrete Random Variable). A random variable is called discrete if its range
consists of a countable (possibly infinite) number of elements.
While discrete random variables are less useful than continuous random variables, they
are still very common.

Example 1.28. A random variable which takes on values in {0, 1} is known as a Bernoulli
random variable, and is the simplest non-degenerate random variable (see Section 1.2.3.1).3
Bernoulli random variables are often used to model “success” or “failure”, where success is
loosely defined – a large negative return, the existence of a bull market or a corporate de-
fault.

The distinguishing characteristic of a discrete random variable is not that it takes only
finitely many values, but that the values it takes are distinct in that it is possible to fit a small
interval around each point.

Example 1.29. Poisson random variables take values in{0, 1, 2, 3, . . .} (an infinite range),
and are commonly used to model hazard rates (i.e. the number of occurrences of an event
in an interval). They are especially useful in modeling trading activity (see Section 1.2.3.2).

1.2.1 Mass, Density and Distribution Functions


Discrete random variables are characterized by a probability mass function (pmf) which
gives the probability of observing a particular value of the random variable.

Definition 1.30 (Probability Mass Function). The probability mass function, f , for a dis-
crete random variable X is defined as f (x ) = Pr (x ) for all x ∈ R (X ), and f (x ) = 0 for all
/ R (X ) where R (X ) is the range of X (i.e. the values for which X is defined).
x ∈
3
A degenerate random variable always takes the same value, and so is not meaningfully random.
1.2 Univariate Random Variables 9

Positive Weekly Return Positive Monthly Return


60 70
FTSE 100
50 S&P 500 60

50
40
40
30
30
20
20
10 10

0 0
Less than 0 Above 0 Less than 0 Above 0
Weekly Return above -1% Monthly Return above -4%
80 100

80
60

60
40
40

20
20

0 0
Less than −1% Above −1% Less than −4% Above −4%

Figure 1.2: These four charts show examples of Bernoulli random variables using returns
on the FTSE 100 and S&P 500. In the top two, a success was defined as a positive return. In
the bottom two, a success was a return above -1% (weekly) or -4% (monthly).

Example 1.31. The probability mass function of a Bernoulli random variable takes the form

f (x ; p ) = p x (1 − p )1−x

where p ∈ [0, 1] is the probability of success.

Figure 1.2 contains a few examples of Bernoulli pmfs using data from the FTSE 100 and
S&P 500 over the period 1984–2012. Both weekly returns, using Friday to Friday prices and
monthly returns were constructed. Log returns were used (rt = ln Pt /Pt −1 ) in both ex-


amples. Two of the pmfs defined success as the return being positive. The other two define
the probability of success as a return larger than -1% (weekly) or larger than -4% (monthly).
These show that the probability of a positive return is much larger for monthly horizons
than for weekly.
10 Probability, Random Variables and Expectations

Example 1.32. The probability mass function of a Poisson random variable is

λx
f (x ; λ) = exp (−λ)
x!
where λ ∈ [0, ∞) determines the intensity of arrival (the average value of the random vari-
able).
The pmf of the Poisson distribution can be evaluated for every value of x ≥ 0, which
is the support of a Poisson random variable. Figure 1.4 shows empirical distribution tab-
ulated using a histogram for the time elapsed where .1% of the daily volume traded in the
S&P 500 tracking ETF SPY on May 31, 2012. This data series is a good candidate for model-
ing using a Poisson distribution.
Continuous random variables, on the other hand, take a continuum of values – techni-
cally an uncountable infinity of values.
Definition 1.33 (Continuous Random Variable). A random variable is called continuous
if its range is uncountable infinite and there exists a non-negative-valued function f (x )
defined or all x ∈ (−∞, ∞) such that for any event B ⊂ R (X ), Pr (B ) = x ∈B f (x ) dx and
R

f (x ) = 0 for all x ∈
/ R (X ) where R (X ) is the range of X (i.e. the values for which X is
defined).
The pmf of a discrete random variable is replaced with the probability density function
(pdf) for continuous random variables. This change in naming reflects that the probability
of a single point of a continuous random variable is 0, although the probability of observing
a value inside an arbitrarily small interval in R (X ) is not.
Definition 1.34 (Probability Density Function). For a continuous random variable, the func-
tion f is called the probability density function (pdf).
Before providing some examples of pdfs, it is useful to characterize the properties that
any pdf should have.
Definition 1.35 (Continuous Density Function Characterization). A function f : R → R
is a member of the class of continuous density functions if and only if f (x ) ≥ 0 for all
R∞
x ∈ (−∞, ∞) and −∞ f (x ) dx = 1.
There are two essential properties. First, that the function is non-negative, which fol-
lows from the axiomatic definition of probability, and second, that the function integrated
to 1. This means that the total probability under the function is set to 1. This may seem
like a limitation, but it is only a normalization since any integrable function can always be
normalized to that it integrates to 1.
Example 1.36. A simple continuous random variable can be defined on [0, 1] using the
probability density function
 2
1
f (x ) = 12 x −
2
1.2 Univariate Random Variables 11

and figure 1.3 contains a plot of the pdf.


This simple pdf has peaks near 0 and 1 and a trough at 1/2. More realistic pdfs allow for
values in (−∞, ∞), such as in the density of a normal random variable.
Example 1.37. The pdf of a normal random variable with parameters µ and σ2 is given by
!
1 (x − µ)2
f (x ) = √ exp − , (1.8)
2πσ2 2σ2

and N µ, σ2 is used as a shorthand notation. When µ = 0 and σ2 = 1, the distribution is




known as a standard normal. Figure 1.3 contains a plot of the standard normal pdf along
with two other parameterizations.
For large values of x (in the absolute sense), the pdf takes very small values, and peaks
at x = 0 with a value of 0.3989. The shape of the normal distribution is that of a bell (and
is occasionally referred to a bell curve).
A closely related function to the pdf is the cumulative distribution function, which re-
turns the total probability of observing a value of the random variable less than its input.
Definition 1.38 (Cumulative Distribution Function). The cumulative distribution function
(cdf) for a random variable X is defined as F (c ) = Pr (x ≤ c ) for all c ∈ (−∞, ∞).
Cumulative distribution functions are available for both discrete and continuous ran-
dom variables, and particularly simple for discrete random variables.
Definition 1.39 (Discrete CDF). When X is a discrete random variable, the cdf is
X
F (x ) = f (s ) (1.9)
s ≤x

for x ∈ (−∞, ∞).


Example 1.40. The cdf of a Bernoulli is

 0
 if x <0
F (x ; p ) = p if 0≤x <1 .
 1 if x ≥1

The Bernoulli cdf is simple since it only takes 2 values. The cdf of a Poisson random
variable is also just the sum the probability mass function for all values less than or equal
to the function’s argument.
Example 1.41. The cdf of a Poisson(λ)random variable is given by

bx c
X λi
F (x ; λ) = exp (−λ) , x ≥ 0.
i!
i =0
12 Probability, Random Variables and Expectations

where b·c returns the largest integer smaller than the input (the floor operator).

Continuous cdfs operate much like discrete cdfs, only the summation is replaced by an
integral since there are a continuum of values possible for X .

Definition 1.42 (Continuous CDF). When X is a continuous random variable, the cdf is
Z x
F (x ) = f (s ) ds (1.10)
−∞

for x ∈ (−∞, ∞).

The integral simply computes the total area under the pdf starting from −∞ up to x.
2
Example 1.43. The cdf of the random variable with pdf given by 12 x − 1/2 is

F (x ) = 4x 3 − 6x 2 + 3x .
and figure 1.3 contains a plot of this cdf.

This cdf is simply the integral of the pdf, and checking shows that F (0) = 0, F 1/2 =


1/2 (since it is symmetric around 1/2) and F (1) = 1, which must be 1 since the random
variable is only defined on [0, 1].

Example 1.44. The cdf of a normally distributed random variable with parameters µ and
σ2 is given by
!
(s − µ)2
Z x
1
F (x ) = √ exp − ds . (1.11)
2πσ2 −∞ 2σ2
Figure 1.3 contains a plot of the standard normal cdf along with two other parameteriza-
tions.

In the case of a standard normal random variable, the cdf is not available in closed form,
and so when computed using a computer (i.e. in Excel or MATLAB), fast, accurate numeric
approximations based on polynomial expansions are used (Abramowitz & Stegun 1964).
The cdf can be similarly derived from the pdf as long as the cdf is continuously differen-
tiable. At points where the cdf is not continuously differentiable, the pdf is defined to take
the value 0.4

Theorem 1.45 (Relationship between CDF and pdf). Let f (x ) and F (x ) represent the pdf
and cdf of a continuous random variable X , respectively. The density function for X can be
defined as f (x ) = ∂ ∂F x(x ) whenever f (x ) is continuous and f (x ) = 0 elsewhere.
4
Formally a pdf does not have to exist for a random variable, although a cdf always does. In practice, this is
a technical point and distributions which have this property are rarely encountered in financial economics.
1.2 Univariate Random Variables 13

Probability Density Function Cumulative Distribution Function


3 1

2.5 0.8
2
0.6
1.5
0.4
1

0.5 0.2

0
0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1
Normal PDFs Normal CDFs
1
µ = 0, σ 2 = 1
0.4 µ = 1, σ 2 = 1
µ = 0, σ 2 = 4 0.8
0.3
0.6

0.2
0.4

0.1 0.2

0 0
−6 −4 −2 0 2 4 6 −6 −4 −2 0 2 4 6
2
Figure 1.3: The top panels show the pdf for the density f (x ) = 12 x − 21 and its asso-
ciated cdf. The bottom left panel shows the probability density function for normal distri-
butions with difference combinations of µ and σ2 . The bottom right panel shows the cdf
for the same parameterizations.

Example 1.46. Taking the derivative of the cdf in the running example,

∂ F (x )
= 12x 2 − 12x + 3.
∂x  
1
= 12 x − x +
2
4
 2
1
= 12 x − .
2
14 Probability, Random Variables and Expectations

1.2.2 Quantile Functions


The quantile function is closely related to the cdf – and in many important cases, the quan-
tile function is the inverse (function) of the cdf. Before defining quantile functions, it is
necessary to define a quantile.

Definition 1.47 (Quantile). Any number q satisfying Pr (x ≤ q ) = α and Pr (x ≥ q ) = 1 − α


is known as the α-quantile of X and is denoted qα .

A quantile is just the point on the cdf where the total probability that a random variable
is smaller is α and the probability that the random variable takes a larger value is 1 − α.
The definition of the quantile does not necessarily require the quantile to be unique – non-
unique quantiles are encountered when pdfs have regions of 0 probability (or equivalently
cdfs are discontinuous). Quantiles are unique for random variables which have continu-
ously differentiable cdfs. One common modification of the quantile definition is to select
the smallest number which satisfies the two conditions – this ensures that quantiles are
unique.
The function which returns the quantile is known as the quantile function.

Definition 1.48 (Quantile Function). Let X be a continuous random variable with cdf F (x ).
The quantile function for X is defined as G (α) = q where Pr (x ≤ q ) = α and Pr (x > q ) =
1 − α. When F (x ) is one-to-one (and hence X is strictly continuous) then G (α) = F −1 (α).

Quantile functions are generally set-valued when quantiles are not unique, although in
the common case where the pdf does not contain any regions of 0 probability, the quantile
function is simply the inverse of the cdf.

Example 1.49. The cdf of an exponential random variable is


 x
F (x ; λ) = 1 − exp −
λ
for x ≥ 0 and λ > 0. Since f (x ; λ) > 0 for x > 0, the quantile function is

F −1 (α; λ) = −λ ln (1 − α) .

The quantile function plays an important role in simulation of random variables. In


particular, if u ∼ U (0, 1)5 , then x = F −1 (u) is distributed F . For example, when u is a
standard uniform (U (0, 1)), and F −1 (α) is the quantile function of an exponential random
variable, then x = F −1 (u; λ) follows an exponential (λ) distribution.

Theorem 1.50 (Probability Integral Transform). Let U be a standard uniform random vari-
able, FX (x ) be a continuous, increasing cdf . Then Pr F −1 (U ) < x = FX (x ) and so F −1 (U )


is distributed F .
5
The mathematical notation ∼ is read “distributed as”. For example, x ∼ U (0, 1) indicates that x is dis-
tributed as a standard uniform random variable.
1.2 Univariate Random Variables 15

Proof. Let U be a standard uniform random variable, and for an x ∈ R (X ),

Pr (U ≤ F (x )) = F (x ) ,

which follows from the definition of a standard uniform.

Pr (U ≤ F (x )) = Pr F −1 (U ) ≤ F −1 (F (x ))


= Pr F −1 (U ) ≤ x


= Pr (X ≤ x ) .

The key identity is that Pr F −1 (U ) ≤ x = Pr (X ≤ x ), which shows that the distribu-




tion of F −1 (U ) is F by definition of the cdf. The right panel of figure 1.8 shows the relation-
ship between the cdf of a standard normal and the associated quantile function. Applying
F (X ) produces a uniform U through the cdf and applying F −1 (U ) produces X through the
quantile function.

1.2.3 Common Univariate Distributions

Discrete

1.2.3.1 Bernoulli

A Bernoulli random variable is a discrete random variable which takes one of two values,
0 or 1. It is often used to model success or failure, where success is loosely defined. For
example, a success may be the event that a trade was profitable net of costs, or the event
that stock market volatility as measured by VIX was greater than 40%. The Bernoulli distri-
bution depends on a single parameter p which determines the probability of success.

Parameters

p ∈ [0, 1]

Support

x ∈ {0, 1}

Probability Mass Function

f (x ; p ) = p x (1 − p )1−x , p ≥ 0
16 Probability, Random Variables and Expectations

Time for .1% of Volume in SPY


1000
Time Difference
800

600

400

200

0
0 50 100 150 200 250
5-minute Realized Variance of SPY
Scaled χ23
0.2 5-minute RV

0.1

−0.1
0 0.05 0.1 0.15

Figure 1.4: The left panel shows a histogram of the elapsed time in seconds required for
.1% of the daily volume being traded to occur for SPY on May 31, 2012. The right panel
shows both the fitted scaled χ 2 distribution and the raw data (mirrored below) for 5-minute
“realized variance” estimates for SPY on May 31, 2012.

Moments

Mean p
Variance p (1 − p )

1.2.3.2 Poisson

A Poisson random variable is a discrete random variable taking values in {0, 1, . . .}. The
Poisson depends on a single parameter λ (known as the intensity). Poisson random vari-
ables are often used to model counts of events during some interval, for example the num-
ber of trades executed over a 5-minute window.

Parameters

λ≥0
1.2 Univariate Random Variables 17

Support

x ∈ {0, 1, . . .}

Probability Mass Function

λx
f (x ; λ) = x!
exp (−λ)

Moments

Mean λ
Variance λ

Continuous

1.2.3.3 Normal (Gaussian)

The normal is the most important univariate distribution in financial economics. It is the
familiar “bell-shaped” distribution, and is used heavily in hypothesis testing and in mod-
eling (net) asset returns (e.g. rt = ln Pt − ln Pt −1 or rt = Pt P−P t −1
t −1
).

Parameters

µ ∈ (−∞, ∞) , σ2 ≥ 0

Support

x ∈ (−∞, ∞)

Probability Density Function


 
−µ)2
f x ; µ, σ 2
= √ 1 − (x2σ

2πσ2
exp 2

Cumulative Distribution Function


 
x −µ
F x ; µ, σ2 = 1
+ 21 erf √1 where erf is the error function.6

2 2 σ

6
The error function does not have a closed form and so is a definite integral of the form
Z x
2  
erf (x ) = √ exp −s 2 ds .
π 0
18 Probability, Random Variables and Expectations

Weekly FTSE Weekly S&P 500


Normal Normal
20 Std. t, ν = 5 Std. t, ν = 4
FTSE 100 Return 20 S&P 500 Return
15 15
10 10
5 5

0 0

−5 −5

−10 −10
−0.1 −0.05 0 0.05 0.1 −0.1 −0.05 0 0.05 0.1
Monthly FTSE Monthly S&P 500
10 Normal Normal
Std. t, ν = 5 Std. t, ν = 4
FTSE 100 Return 10 S&P 500 Return

5
5

0 0

−5 −5
−0.15 −0.1 −0.05 0 0.05 0.1 0.15 −0.1 −0.05 0 0.05 0.1

Figure 1.5: Weekly and monthly densities for the FTSE 100 and S&P 500. All panels plot the
pdf of a normal and standardized Student’s t using parameters estimates using maximum
likelihood estimation (See Chapter2). The points below 0 on the y-axis show the actual
returns observed during this period.
1.2 Univariate Random Variables 19

Moments

Mean µ
Variance σ2
Median µ
Skewness 0
Kurtosis 3

Notes

The normal with mean µ and variance σ2 is written N µ, σ2 . A normally distributed ran-


dom variable with µ = 0 and σ2 = 1 is known as a standard normal. Figure 1.5 shows the fit
normal distribution to the FTSE 100 and S&P 500 using both weekly and monthly returns
for the period 1984–2012. Below each figure is a plot of the raw data.

1.2.3.4 Log-Normal

Log-normal random variables are closely related to normals. If X is log-normal, then Y =


ln (X ) is normal. Like the normal, the log-normal family depends on two parameters, µ and
σ2 , although unlike the normal these parameters do not correspond to the mean and vari-
ance. Log-normal random variables are commonly used to model gross returns, Pt +1 /Pt
(although it is often simpler to model rt = ln Pt − ln Pt −1 which is normally distributed).

Parameters

µ ∈ (−∞, ∞) , σ2 ≥ 0

Support

x ∈ [0, ∞)

Probability Density Function


 
x −µ)2
f x ; µ, σ2 = √1 exp − (ln 2σ

2
x 2πσ2

Cumulative Distribution Function

Since Y = ln (X ) ∼ N µ, σ2 , the cdf is the same as the normal only using ln x in place of


x.
20 Probability, Random Variables and Expectations

Moments
 
σ2
Mean exp µ + 2
Median exp (µ)
exp σ 2
− 1 exp 2µ + σ2
  
Variance

1.2.3.5 χ 2 (Chi-square)

χν2 random variables depend on a single parameter ν known as the degree-of-freedom.


They are commonly encountered when testing hypotheses, although they are also used to
model continuous variables which are non-negative, such as variances. χν2 random vari-
ables are closely related to standard normal random variables and are defined as the sum
of ν independent standard normal random variables which have been squared. Suppose
Z 1 , . . . , Z ν are standard normally distributed and independent, then x = νi =1 z i2 follows a
P

χν2 .7

Parameters

ν ∈ [0, ∞)

Support

x ∈ [0, ∞)

Probability Density Function


ν−2
f (x ; ν) = 1
exp − x2 , ν ∈ {1, 2, . . .} where Γ (a ) is the Gamma function.

ν x 2
2 2 Γ ν2 ( )

Cumulative Distribution Function

ν x
F (x ; ν) = 1
γ where γ (a , b ) is the lower incomplete gamma function.

Γ(
,
ν
2 ) 2 2

Moments

Mean ν
Variance 2ν
Pn Pn
7
In general, if Z 1 , . . . , Z n are i.i.d. standard normal, and y = i =1 wi z i2 , then y ∼ χν2 where ν = i =1 wi .
This extends the previous definition to all for non-integer values of ν.
1.2 Univariate Random Variables 21

Notes

Figure 1.4 shows a χ 2 pdf which was used to fit some simple estimators of the 5-minute
variance of the S&P 500 from May 31, 2012. These were computed by summing and squar-
ing 1-minute returns within a 5-minute interval (all using log prices). 5-minute variance
estimators are important in high-frequency trading and other (slower) algorithmic trading.

1.2.3.6 Student’s t and standardized Student’s t

Student’s t random variables are also commonly encountered in hypothesis testing and,
like χν2 random variables, are closely related to standard normals. Student’s t random vari-
ables depend on a single parameter, ν, and can be constructed from two other independent
random variables. If Z a standard normal, W a χν2 and Z ⊥ ⊥ W , then x = z / wν follows
p

a Student’s t distribution. Student’s t are similar to normals except that they are heavier
tailed, although as ν → ∞ a Student’s t converges to a standard normal.

Support

x ∈ (−∞, ∞)

Probability Density Function

Γ ( ν+1 )
 − ν+1
x2 2
f (x ; ν) = √ 2 ν
νπΓ ( 2 )
1+ ν
where Γ (a ) is the Gamma function.

Moments

Mean 0, ν > 1
Median 0
ν
Variance ν−2
,ν>2
Skewness 0, ν > 3
(ν−2)
Kurtosis 3 ν−4 , ν > 4

Notes

When ν = 1, a Student’s t is known as a Cauchy random variable.


The standardized Student’s t extends the usual Student’s t in two directions. First, it
removes the variance’s dependence on ν so that the scale of the random variable can be
established separately from the degree of freedom parameter. Second, it explicitly adds
location and scale parameters so that if Y is a Student’s t random variable with degree of
freedom ν, then √
ν−2
x =µ+σ √ y
ν
22 Probability, Random Variables and Expectations

follows a standardized Student’s t distribution (ν > 2 is required). The standardized Stu-


dent’s t is commonly used to model heavy tailed return distributions such as stock market
indices.
Figure 1.5 shows the fit (using maximum likelihood) standardized t distribution to the
FTSE 100 and S&P 500 using both weekly and monthly returns from the period 1984–2012.
The typical degree of freedom parameter was around 4, indicating that (unconditional)
distributions are heavy tailed with a large kurtosis.

1.2.3.7 Uniform

The continuous uniform is commonly encountered in certain test statistics, especially those
involving testing whether assumed densities are appropriate for a particular series. Uni-
form random variables, when combined with quantile functions, are also useful for simu-
lating random variables.

Parameters

a , b the end points of the interval, where a < b

Support

x ∈ [a , b ]

Probability Density Function

f (x ) = 1
b −a

Cumulative Distribution Function

F (x ) = x −a
b −a
for a ≤ x ≤ b , F (x ) = 0 for x < a and F (x ) = 1 for x > b

Moments
b −a
Mean 2
b −a
Median 2
(b −a )2
Variance 12
Skewness 0
9
Kurtosis 5

Notes

A standard uniform has a = 0 and b = 1. When x ∼ F , then F (x ) ∼ U (0, 1)


1.3 Multivariate Random Variables 23

1.3 Multivariate Random Variables


While univariate random variables are very important in financial economics, most appli-
cation require the use multivariate random variables. Multivariate random variables allow
relationship between two or more random quantities to be modeled and studied. For ex-
ample, the joint distribution of equity and bond returns is important for many investors.
Throughout this section, the multivariate random variable is assumed to have n com-
ponents,  
X1
 X2 
X = . 
 
.
 . 
Xn
which are arranged into a column vector. The definition of a multivariate random variable
is virtually identical to that of a univariate random variable, only mapping ω ∈ Ω to the
n -dimensional space Rn .

Definition 1.51 (Multivariate Random Variable). Let (Ω, F, P ) be a probability space. If X :


Ω → Rn is a real-valued vector function having its domain the elements of Ω, then X : Ω →
Rn is called a (multivariate) n -dimensional random variable.

Multivariate random variables, like univariate random variables, are technically func-
tions of events in the underlying probability space X (ω), although the function argument
ω (the event) is usually suppressed.
Multivariate random variables can be either discrete or continuous. Discrete multivari-
ate random variables are fairly uncommon in financial economics and so the remainder
of the chapter focuses exclusively on the continuous case. The characterization of a what
makes a multivariate random variable continuous is also virtually identical to that in the
univariate case.

Definition 1.52 (Continuous Multivariate Random Variable). A multivariate random vari-


able is said to be continuous if its range is uncountable infinite and if there exists a non-
negative valued function f (x1 , . . . , xn ) defined for all (x1 , . . . , xn ) ∈ Rn such that for any
event B ⊂ R (X ),
Z Z
Pr (B ) = . . . f (x1 , . . . , xn ) dx1 . . . dxn (1.12)
{x1 ,...,xn }∈B

and f (x1 , . . . , xn ) = 0 for all (x1 , . . . , xn ) ∈


/ R (X ).

Multivariate random variables, at least when continuous, are often described by their
probability density function.

Definition 1.53 (Continuous Density Function Characterization). A function f : Rn → R is


a member of the class of multivariate continuous density functions if and only if f (x1 , . . . , xn ) ≥
24 Probability, Random Variables and Expectations

0 for all x ∈ Rn and Z ∞ Z ∞


... f (x1 , . . . , xn ) dx1 . . . dxn = 1. (1.13)
−∞ −∞
Definition 1.54 (Multivariate Probability Density Function). The function f (x1 , . . . , xn ) is
called a multivariate probability function (pdf).
A multivariate density, like a univariate density, is a function which is everywhere non-
negative and that integrates to unity. Figure 1.7 shows the fit joint probability density func-
tion to weekly returns on the FTSE 100 and S&P 500 (assuming that returns are normally
distributed). Two views are presented – one shows the 3-dimensional plot of the pdf and
the other shows the iso-probability contours of the pdf. The figure also contains a scatter
plot of the raw weekly data for comparison. All parameters were estimated using maximum
likelihood.
Example 1.55. Suppose X is a bivariate random variable, then the function f (x1 , x2 ) =
3
x12 + x22 and is defined on [0, 1] × [0, 1] is a valid probability density function.

2

Example 1.56. Suppose X is a bivariate standard normal random variable. Then the prob-
ability density function of X is

x12 + x22
 
1
f (x1 , x2 ) = exp − .
2π 2

The multivariate cumulative distribution function is virtually identical to that in the


univariate case, and measure the total probability between −∞ (for each element of X )
and some point.
Definition 1.57 (Multivariate Cumulative Distribution Function). The joint cumulative dis-
tribution function of an n -dimensional random variable X is defined by

F (x1 , . . . , xn ) = Pr (X i ≤ xi , i = 1, . . . , n )

for all (x1 , . . . , xn ) ∈ Rn , and is given by


Z xn Z x1
F (x1 , . . . , xn ) = ... f (s1 , . . . , sn ) ds1 . . . dsn . (1.14)
−∞ −∞

Example 1.58. Suppose X is a bivariate random variable with probability density function
f (x1 , x2 ) = 23 x12 + x22 and is defined on [0, 1] × [0, 1]. The associated cdf is


x13 x2 + x1 x23
F (x1 , x2 ) = .
2
Figure 1.6 shows the joint cdf of the density in the previous example. As was the case for
univariate random variables, the probability density function can be determined by differ-
entiating the cumulative distribution function – only in the multivariate case, a derivative
is needed for each component.
1.3 Multivariate Random Variables 25

Theorem 1.59 (Relationship between CDF and PDF). Let f (x1 , . . . , xn ) and F (x1 , . . . , xn )
represent the pdf and cdf of an n -dimensional continuous random variable X , respectively.
The density function for X can be defined as f (x1 , . . . , xn ) = ∂ x1∂∂ xF2 ...∂
(x)
n

xn
whenever f (x1 , . . . , xn )
is continuous and f (x1 , . . . , xn ) = 0 elsewhere.

Example 1.60. Suppose X is a bivariate random variable with cumulative distribution func-
x 3 x +x x 3
tion F (x1 , x2 ) = 1 2 2 1 2 . The probability density function can be determined using

∂ 2 F (x1 , x2 )
f (x1 , x2 ) =
∂ x1 ∂ x2
1 ∂ 3x12 x2 + x23

=
2 ∂ x2
3 2
= x1 + x22 .

2

1.3.1 Marginal Densities and Distributions

The marginal distribution is the first concept unique to multivariate random variables.
Marginal densities and distribution functions summarize the information in a subset, usu-
ally a single component of X by averaging over all possible values of the components of X
which are not being marginalized. This involves integrating out the variables which are not
of interest. First, consider the bivariate case.

Definition 1.61 (Bivariate Marginal Probability Density Function). Let X be a bivariate ran-
dom variable comprised of X 1 and X 2 . The marginal distribution of X 1 is given by
Z ∞
f1 (x1 ) = f (x1 , x2 ) dx2 . (1.15)
−∞

The marginal density of X 1 is a density function where X 2 has been integrated out. This
integration is simply a form of averaging – varying x2 according to the probability associ-
ated with each value of x2 . The marginal is only a function of x1 . Both probability density
functions and cumulative distribution functions have marginal versions.

Example 1.62. Suppose X is a bivariate random variable with probability density function
f (x1 , x2 ) = 23 x12 + x22 and is defined on [0, 1] × [0, 1]. The marginal probability density


function for X 1 is  
3 1
f1 (x1 ) = x1 +
2
,
2 3
and by symmetry the marginal probability density function of X 2 is
 
3 1
f2 (x2 ) = x2 +
2
.
2 3
26 Probability, Random Variables and Expectations

Example 1.63. Suppose X is a bivariate random variable with probability density function
f (x1 , x2 ) = 6 x1 x22 and is defined on [0, 1] × [0, 1]. The marginal probability density func-


tions for X 1 and X 2 are


f1 (x1 ) = 2x1 and f2 (x2 ) = 3x22 .
Example 1.64. Suppose X is bivariate normal with parameters µ = [µ1 µ2 ]0 and
" #
σ12 σ12
Σ= ,
σ12 σ22

then the marginal pdf of X 1 is N µ1 , σ12 , and the marginal pdf of X 2 is N µ2 , σ22 .
 

Figure 1.7 shows the fit marginal distributions to weekly returns on the FTSE 100 and
S&P 500 assuming that returns are normally distributed. Marginal pdfs can be transformed
into marginal pdfs through integration.
Definition 1.65 (Bivariate Marginal Cumulative Distribution Function). The cumulative
marginal distribution function of X 1 in bivariate random variable X is defined by

F1 (x1 ) = Pr (X 1 ≤ x1 )

for all x1 ∈ R, and is given by Z x1


F1 (x1 ) = f1 (s1 ) ds1 .
−∞
The general j -dimensional marginal distribution partitions the n -dimensional random
variable X into two blocks, and constructs the marginal distribution for the first j by inte-
grating out (averaging over) the remaining n − j components of X . In the definition, both
X 1 and X 2 are vectors.
Definition 1.66 (Marginal Probability Density Function). Let X be a n -dimensional ran-
dom variable and partition the first 1 ≤ j < n elements of X into X 1 , and the remainder
0
into X 2 so that X = X 10 X 20 . The marginal probability density function for X 1 is given by


Z ∞ Z ∞
f1,..., j x1 , . . . , x j = f (x1 , . . . , xn ) dx j +1 . . . dxn .

... (1.16)
−∞ −∞

The marginal cumulative distribution function is related to the marginal probability


density function in the same manner as the joint probability density function is related to
the cumulative distribution function. It also has the same interpretation.
Definition 1.67 (Marginal Cumulative Distribution Function). Let X be a n -dimensional
random variable and partition the first 1 ≤ j < n elements of X into X 1 , and the remainder
0
into X 2 so that X = X 10 X 20 . The marginal cumulative distribution function for X 1 is given


by Z x1 Z xj
F1,..., j x1 , . . . , x j =
 
... f1,..., j s1 , . . . , s j ds1 . . . ds j . (1.17)
−∞ −∞
1.3 Multivariate Random Variables 27

1.3.2 Conditional Distributions

Marginal distributions provide the tools needed to model the distribution of a subset of
the components of a random variable while averaging over the other components. Condi-
tional densities and distributions, on the other hand, consider a subset of the components
a random variable conditional on observing a specific value or the remaining components.
In practice, the vast majority of modeling makes use of conditional information where the
interest is in understanding the distribution of a random variable conditional on the ob-
served values of some other random variables. For example, consider the problem of mod-
eling the expected return of an individual stock. Usually other information such as the book
value of assets, earnings and return on equity are all available, and can be conditioned on
to model the conditional distribution of the stock’s return.
First, consider the bivariate case.

Definition 1.68 (Bivariate Conditional Probability Density Function). Let X be a bivariate


random variable comprised of X 1 and X 2 . The conditional probability density function for
X 1 given that X 2 ∈ B where B is an event where Pr (X 2 ∈ B ) > 0 is

f (x1 , x2 ) dx2
R
f x1 |X 2 ∈ B = BR

. (1.18)
B 2 (x 2 )
f dx2

When B is an elementary event (e.g. single point), so that Pr (X 2 = x2 ) = 0 and f2 (x2 ) > 0,
then
 f (x1 , x2 )
f x1 |X 2 = x2 = . (1.19)
f2 (x2 )

Conditional density functions differ slightly depending on whether the conditioning


variable is restricted to a set or a point. When the conditioning variable is specified to be a
set where Pr (X 2 ∈ B ) > 0, then the conditional density is simply the joint probability of X 1
and X 2 ∈ B divided by the marginal probability of X 2 ∈ B . When the conditioning variable
is restricted to a point, the conditional density is simply the ratio of the joint pdf divided by
the margin pdf of X 2 .

Example 1.69. Suppose X is a bivariate random variable with probability density function
f (x1 , x2 ) = 23 x12 + x22 and is defined on [0, 1] × [0, 1]. The conditional probability of X 1

1 
given X 2 ∈ 2 , 1   
1 1
= 12x12 + 7 ,

f x1 |X 2 ∈ ,1
2 11
the conditional probability density function of X 1 given X 2 ∈ 0, 12 is
 

  
1 1
= 12x12 + 1 ,

f x1 |X 2 ∈ 0,
2 5
28 Probability, Random Variables and Expectations

and the conditional probability density function of X 1 given X 2 = x2 is

 x2 + x2
f x1 |X 2 = x2 = 1 2 2 .
x2 + 1

Figure 1.6 shows the joint pdf along with both types of conditional densities. The up-
per left panel shows that conditional density for X 2 ∈ [0.25, 0.5]. The highlighted region
contains the components of the joint pdf which are averaged to produce the conditional
density. The lower left also shows the pdf but also shows three (non-normalized) condi-

tional densities of the form f x1 |x2 . The lower right pane shows these three densities
correctly normalized.
The previous example shows that, in general, the conditional probability density func-
tion differs as the region used changes.
Example 1.70. Suppose X is bivariate normal with mean µ = [µ1 µ2 ]0 and covariance
" #
σ12 σ12
Σ= ,
σ12 σ22

σ12
 2

σ12
then the conditional distribution of X 1 given X 2 = x2 is N µ1 + σ22
(x2 − µ2 ) , σ12 − σ22
.
Marginal distributions and conditional distributions are related in a number of ways.
One obvious way is that f x1 |X 2 ∈ R (X 2 ) = f1 (x1 ) – that is, the conditional probability


of X 1 given that X 2 is in its range is simply the marginal pdf of X 1 . This holds since inte-
grating over all values of x2 is essentially not conditioning on anything (which is known as
the unconditional, and a marginal density could, in principle, be called the unconditional
density since it averages across all values of the other variable).
The general definition allows for an n -dimensional random vector where the condition-
ing variable has dimension between 1 and j < n .
Definition 1.71 (Conditional Probability Density Function). Let f (x1 , . . . , xn ) be the joint
density function for an n -dimensional random variable X = [X 1 . . . X n ]0 and partition the
0
first 1 ≤ j < n elements of X into X 1 , and the remainder into X 2 so that X = X 10 X 20 . The


conditional probability density function for X 1 given that X 2 ∈ B is given by

( x j +1 ,...,xn )∈B f (x1 , . . . , xn ) dxn . . . dx j +1


R
f x1 , . . . , x j |X 2 ∈ B = R

 , (1.20)
( x j +1 ,...,xn )∈B f j +1,...,n x j +1 , . . . , xn dxn . . . dx j +1

and when B is an elementary event (denoted x2 ) and if f j +1,...,n (x2 ) > 0,



f x 1 , . . . , x j , x2
f x1 , . . . , x j |X 2 = x2 =

(1.21)
f j +1,...,n (x2 )

In general the simplified notation f x1 , . . . , x j |x2 will be used to represent f x1 , . . . , x j |X 2 = x2 .


 
1.3 Multivariate Random Variables 29

Bivariate CDF Conditional Probability


f(x1 |x2 ∈ [0.25, 0.5])

1 3 x2 ∈ [0.25, 0.5]
F (x1 , x2 )

f(x1 , x2 )
2
0.5
1

0 0
1 1
1 1
0.5 0.5 0.5 0.5
x1 0 0 x2 x1 0 0 x2
Conditional Densities Normalized Conditional Densities
3
f(x1 |x2 = 0.3)
f(x1 |x2 = 0.5)
3 f (x1 |x2 = 0.7) 2.5 f(x1 |x2 = 0.7)
f (x1 |x2 = 0.5)
f (x1 |x2 = 0.3)
2
f(x1 , x2 )

2
f(x1 |x2 )

1.5
1
1
0
1 0.5
1
0.5 0.5 0
x1 0 0 x2 0 0.2 0.4 0.6 0.8 1
x1

Figure 1.6: These four panels show four views of a distribution defined on [0, 1] × [0, 1].
The upper left panel shows the joint cdf. The upper right shows the pdf along with  the
portion of the pdf used to construct a conditional distribution f x1 |x2 ∈ [0.25, 0.5] . The
line shows the actual correctlyscaled conditional distribution which is only a function of
x1 which has been plotted at E X 2 |X 2 ∈ [0.25, 0.5] . The lower left panel also shows the pdf
along with three non-normalized conditional densities. The bottom right panel shows the
correctly normalized conditional densities.
30 Probability, Random Variables and Expectations

1.3.3 Independence
When variables are independent, there is a special relationship between the joint proba-
bility density function and the marginal density functions – the joint must be the product
of each marginal.

Theorem 1.72 (Independence of Random Variables). The random variables X 1 ,. . . , X n with


joint density function f (x1 , . . . , xn ) are independent if and only if
n
Y
f (x1 , . . . , xn ) = fi (xi ) (1.22)
i =1

where fi (xi ) is the marginal distribution of X i .

The intuition behind this result follows from the fact that when the components of a
random variable are independent, any change in one component has no information for
the others. In other words, both marginals and conditionals must be the same.

Example 1.73. Let X be a bivariate random variable with probability density function f (x1 , x2 ) =
x1 x2 on [0, 1] × [0, 1], then X 1 and X 2 are independent. This can be verified since

f1 (x1 ) = x1 and f2 (x2 ) = x2

so that the joint is the product of the two marginal densities.

Independence is a very strong concept, and it carries over from random variables to
functions of random variables as long as each function involves one random variable.8

Theorem 1.74 (Independence of Functions of Independent Random Variables). Let X 1 and


X 2 be independent random variables and define y1 = Y1 (x1 ) and y2 = Y2 (x2 ), then the ran-
dom variables Y1 and Y2 are independent.

Independence is often combined with an assumption that the marginal distribution is


the same to simplify the analysis of collections of random data.

Definition 1.75 (Independent, Identically Distributed). Let {X i } be a sequence of random


variables. If the marginal distribution for X i is the same for all i and X i ⊥⊥ X j for all i 6= j ,
then {X i } is said to be an independent, identically distributed (i.i.d. ) sequence.

1.3.4 Bayes Rule


Bayes rule is used both in financial economics and econometrics. In financial economics,
it is often used to model agents learning, and in econometrics it is used to make inference
8
This can be generalized to the full multivariate case where X is an n-dimensional random variable where
 are independent from the last n − j components defining y1 = Y1 x1 , . . . , x j and

the first j components
y2 = Y2 x j +1 , . . . , xn .
1.3 Multivariate Random Variables 31

about unknown parameters given observed data (a branch known as Bayesian economet-
rics). Bayes rule follows directly from the definition of a conditional density so that the
joint can be factored into a conditional and a marginal. Suppose X is a bivariate random
variable, then

f (x1 , x2 ) = f x1 |x2 f2 (x2 )




= f x2 |x1 f1 (x2 ) .


The joint can be factored two ways, and equating the two factorizations produces Bayes
rule.

Definition 1.76 (Bivariate Bayes Rule). Let X by a bivariate random variable with compo-
nents X 1 and X 2 , then

f x2 |x1 f1 (x1 )

f x1 |x2 =

(1.23)
f2 (x2 )

Bayes rule states that the probability of observing X 1 given a value of X 2 is equal to the
joint probability of the two random variables divided by the marginal probability of ob-
serving X 2 . Bayes rule is normally applied where there is a belief about X 1 ( f1 (x1 ), called a

prior), and the conditional distribution of X 1 given X 2 is a known density ( f x2 |x1 , called

the likelihood), which combine to form a belief about X 1 ( f x1 |x2 , called the posterior).
The marginal density of X 2 is not important when using Bayes rule since the numerator is
still proportional to the conditional density of X 1 given X 2 since f2 (x2 ) is just some value,
and so it is common to express the posterior as

f x1 |x2 ∝ f x2 |x1 f1 (x1 ) ,


 

where ∝ is read “is proportional to”.

Example 1.77. Suppose interest lies in the probability a firm does bankrupt which can be
modeled as a Bernoulli distribution. The parameter p is unknown but, given a value of p ,
the likelihood that a firm goes bankrupt is

f x |p = p x (1 − p )1−x .


While p is known, a prior for the bankruptcy rate can be specified. Suppose the prior for p
follows a Beta (α, β ) distribution which has pdf

p α−1 (1 − p )β −1
f (p ) =
B (α, β )
32 Probability, Random Variables and Expectations

where B (a , b ) is Beta function that acts as a normalizing constant.9 The Beta distribution
has support on [0, 1] and nests the standard uniform as a special case. The expected value
of a random variable with a Beta (α, β ) is α+βα
and the variance is (α+β )2αβ
(α+β +1)
where α > 0
and β > 0.
Using Bayes rule,

p α−1 (1 − p )β −1
∝ p x (1 − p )1−x ×

f p |x
B (α, β )
p α−1+x (1 − p )β −x
= .
B (α, β )

Note that this isn’t a density since it has the wrong normalizing constant. However, the
component of the density which contains p is p α−1+x (1 − p )β −x (known as the kernel) is
the same as in the Beta distribution, only with different parameters. Thus the posterior,
f p |x is Beta (α + x , β + 1 − x ). Since the posterior is the same as the prior, it could be


combined with another observation (and the Bernoulli likelihood) to produce an updated
posterior. When a Bayesian problem has this property, the prior density is called conjugate
to the likelihood.

Example 1.78. Suppose M is a random variable representing the score on the midterm,
and interest lies in the final course grade, C . The prior for C is normal with mean µ and
variance σ2 , and that the distribution of M given C is also conditionally normal with mean
C and variance τ2 . Bayes rule can be used to make inference on the final course grade
given the midterm grade.

∝ f m|c fC (c )
 
f c |m
! !
1 (m − c )2
1 (c − µ)2
∝ √ exp − 2
√ exp −
2πτ2 2τ 2πσ 2 2σ2
( )!
1 (m − c )2 (c − µ)2
= K exp − +
2 τ2 σ2
2c µ m 2 µ2
 2
c2
 
1 c 2c m
= K exp − + − − 2 + 2 + 2
2 τ2 σ2 τ2 σ τ σ
µ µ2
     2 
1 1 1  m  m
= K exp − c 2
+ − 2c + + + 2
2 τ2 σ2 τ2 σ2 τ2 σ

This (non-normalized) density can be shown to have the kernel of a normal by com-
9
The beta function can only be given as an indefinite integral,
Z 1
B (a , b ) = s a −1 (1 − s )b −1 ds .
0
1.3 Multivariate Random Variables 33

pleting the square,10


  !2 
µ
 1 m
τ2
+ σ2 
f c |m ∝ exp − c− .
+
−1 1 1
2 1
τ2
+ 1
σ2 τ2 σ2

This is the kernel of a normal density with mean


µ
m
+

τ2 σ 
2
,
1
τ2
+ 1
σ2

and variance  −1


1 1
+ 2 .
τ2 σ
The mean is a weighted average of the prior mean, µ and the midterm score, m, where the
weights are determined by the inverse variance of the prior and conditional distributions.
Since the weights are proportional to the inverse of the variance, a small variance leads to
a relatively large weight. If τ2 = σ2 ,then the posterior mean is simply the average of the
prior mean and the midterm score. The variance of the posterior depends on both the vari-
ance of the prior and the conditional variance of the data. The posterior variance is always
below than the smaller of σ2 and τ2 . Like the Bernoulli-Beta combination in the previ-
ous problem, the normal distribution is a conjugate prior when the conditional density is
normal.

1.3.5 Common Multivariate Distributions

1.3.5.1 Multivariate Normal

Like the univariate normal, the multivariate normal depends on 2 parameters, µ and n
by 1 vector of means and Σ an n by n positive semi-definite matrix of covariances. The
multivariate normal is closed to both to marginalization and conditioning – in other words,
if X is multivariate normal, then all marginal distributions of X are normal, and so are all
conditional distributions of X 1 given X 2 for any partitioning.

Parameters

µ ∈ Rn , Σ a positive semi-definite matrix

10
Suppose a quadratic in x has the form a x 2 + b x + c . Then

a x 2 + b x + c = a (x − d )2 + e

where d = b /(2a ) and e = c − b 2 / (4a ).


34 Probability, Random Variables and Expectations

Weekly FTSE and S&P 500 Returns Marginal Densities


0.06
FTSE 100
0.04 S&P 500
15
S&P 500 Return

0.02

0 10

−0.02
5
−0.04

−0.06 0
−0.05 0 0.05 −0.05 0 0.05
FTSE 100 Return
Bivariate Normal PDF Contour of Bivariate Normal PDF
0.06

0.04
S&P 500 Return

300 0.02
200 0
100
−0.02

0.05 −0.04
0.05
0 0 −0.06
−0.05 −0.05 −0.05 0 0.05
S&P 500 FTSE 100 FTSE 100 Return

Figure 1.7: These four figures show different views of the weekly returns of the FTSE 100
and the S&P 500. The top left contains a scatter plot of the raw data. The top right shows
the marginal distributions from a fit bivariate normal distribution (using maximum like-
lihood). The bottom two panels show two representations of the joint probability density
function.
1.3 Multivariate Random Variables 35

Support

x ∈ Rn

Probability Density Function


n 1
 
f (x; µ, Σ) = (2π)− 2 |Σ|− 2 exp − 12 (x − µ)0 Σ−1 (x − µ)

Cumulative Distribution Function

Can be expressed as a series of n univariate normal cdfs using repeated conditioning.

Moments

Mean µ
Median µ
Variance Σ
Skewness 0
Kurtosis 3

Marginal Distribution

The marginal distribution for the first j components is


 
− 2j − 21 1 0 −1
f X 1 ,...X j x1 , . . . , x j = (2π) |Σ11 | exp − x1 − µ1 Σ11 x1 − µ1 ,
 
2
where it is assumed that the marginal distribution is that of the first j random variables11 ,
µ = [µ01 µ02 ]0 where µ1 correspond to the first j entries, and
" #
Σ11 Σ12
Σ= .
Σ012 Σ22
0
In other words, the distribution of X 1 , . . . X j is N µ1 , Σ11 . Moreover, the marginal dis-
 

tribution of a single element of X is N µi , σi2 where µi is the ith element of µ and σi2 is the


ith diagonal element of Σ.

Conditional Distribution

The conditional probability of X 1 given X 2 = x2 is

N µ1 + β 0 x2 − µ2 , Σ11 − β 0 Σ22 β
 

11
Any two variables can be reordered in a multivariate normal by swapping their means and reordering the
covariance matrix by swapping the corresponding rows and columns.
36 Probability, Random Variables and Expectations

where β = Σ−1
22 Σ12 .
0

When X is a bivariate normal random variable,


" # " # " #!
x1 µ1 σ12 σ12
∼N , ,
x2 µ2 σ12 σ22

the conditional distribution is, defining

σ12 σ12
2
 
X 1 |X 2 = x2 ∼ N µ1 + 2 (x2 − µ2 ) , σ1 − 2 ,
2
σ2 σ2

where the variance can be seen to always be positive since σ12 σ22 ≥ σ12
2
by the Cauchy-
Schwartz inequality.

Notes

A standard multivariate normal has µ = 0 and Σ = In . When the covariance between


elements i and j equals zero (so that σi j = 0), they are independent. For the normal,
a covariance (or correlation) of 0 implies independence. This is not true of most other
multivariate random variables.

1.4 Expectations and Moments


Expectations and moments are (non-random) functions of random variables that are use-
ful in both understanding properties of random variables – e.g. when comparing the dis-
persion between two distributions – and when estimating parameters using a technique
known as the method of moments (see Chapter 2).

1.4.1 Expectations

The expectation is the value, on average, of a random variable (or function of a random
variable). Unlike common English language usage, where one’s expectation is not well de-
fined (e.g. could be the mean or the mode, another measure of the tendency of a random
variable), the expectation in a probabilistic sense always averages over the possible values
weighting by the probability of observing each value. The form of an expectation in the
discrete case is particularly simple.

Definition 1.79 (Expectation of a Discrete Random Variable). The expectation of a discrete


random variable, defined E [X ] = x ∈R (X ) x f (x ), exists if and only if x ∈R (X ) |x | f (x ) < ∞.
P P

When the range of X is finite then the expectation always exists. When the range is infi-
nite, such as when a random variable takes on values in the range 0, 1, 2, . . ., the probability
1.4 Expectations and Moments 37

mass function must be sufficiently small for large values of the random variable in order
for the expectation to exist.12 Expectations of continuous random variables are virtually
identical, only replacing the sum with an integral.

Definition 1.80 (Expectation of a Continuous Random Variable). The expectation of a con-


R∞ R∞
tinuous random variable, defined E [X ] = −∞ x f (x ) dx , exists if and only if −∞ |x | f (x ) dx <
∞.

Existence of an expectation is a somewhat difficult concept. For continuous random


variables, expectations may not exist if the probability of observing an arbitrarily large
value (in the absolute sense) is very high. For example, in a Student’s t distribution when
the degree of freedom parameter ν is 1 (also known as a Cauchy distribution), the proba-
bility of observing a value with size |x | is proportional to x −1 for large x (in other words,
f (x ) ∝ c x −1 ) so that when we compute x f (x ), we simply have c for large x . The range is
unbounded, and so the integral of a constant, even if very small, will not converge, and so
the expectation does not exist. On the other hand, when a random variable is bounded, it’s
expectation always exists.

Theorem 1.81 (Expectation Existence for Bounded Random Variables). If |x | < c for all
x ∈ R (X ), then E [X ] exists.

The expectation operator, E [·] is generally defined for arbitrary functions of a random
variable, g (x ). In practice, g (x ) takes many forms – x , x 2 , x p for some p , exp (x ) or some-
thing more complicated. Discrete and continuous expectations are closely related. Figure
1.8 shows a standard normal along with a discrete approximation where each bin has a
width of 0.20 and the height is based on the pdf value at the mid-point of the bin. Treating
the normal as a discrete distribution based on this approximation would provide reason-
able approximations to the correct (integral) expectations.

Definition 1.82 (Expectation of a Function of Random Variable). The expectation of a ran-


 R ∞
dom variable defined as a function of X , Y = g (x ), is E [Y ] = E g X ) = −∞ g (x ) f (x ) dx

R∞
exists if and only if −∞ |g (x )| dx < ∞.

When g (x ) is either concave or convex, Jensen’s inequality provides a relationship be-


tween the expected value of the function and the function of the expected value of the
underlying random variable.

Theorem 1.83 (Jensen’s Inequality). If g (·) is a continuous convex function on an open in-
terval containing the range of X , then E [g (X )] ≥ g (E [X ]). Similarly, if g (·) is a continuous
concave function on an open interval containing the range of X , then E [g (X )] ≤ g (E [X ]).
12
Non-existence of an expectation simply means that the sum converges to ±∞ or oscillates. The use of
the |x | in the definition of existence is to rule out both the −∞ and the oscillating cases.
38 Probability, Random Variables and Expectations

Approximation to Std. Normal CDF and Quantile Function


Standard Normal PDF 1
Discrete Approximation Cumulative Distribution Function
0.4
0.8

0.3 0.6

U
0.2 0.4 Quantile Function

0.1 0.2

0 0
−2 −1 0 1 2 −3 −2 −1 0 1 2 3
X

Figure 1.8: The left panel shows a standard normal and a discrete approximation. Discrete
approximations are useful for approximating integrals in expectations. The right panel
shows the relationship between the quantile function and the cdf.

The inequalities become strict if the functions are strictly convex (or concave) as long
as X is not degenerate.13 Jensen’s inequality is common in economic applications. For ex-
ample, standard utility functions (U (·)) are assumed to be concave which reflects the idea
that marginal utility (U 0 (·)) is decreasing in consumption (or wealth). Applying Jensen’s in-
equality shows that if consumption is random, then E [U (c )] < U (E [c ]) – in other words,
the economic agent is worse off when facing uncertain consumption. Convex functions are
also commonly encountered, for example in option pricing or in (production) cost func-
tions. The expectations operator has a number of simple and useful properties:

13
A degenerate random variable has probability 1 on a single point, and so is not meaningfully random.
1.4 Expectations and Moments 39

• If c is a constant, thenE [c ] = c . This property follows since the expectation is an


integral against a probability density which integrates to unity.

• If c is a constant, then E [c X ] = c E [X ]. This property follows directly from pass-


ing the constant out of the integral in the definition of the expectation operator.

• The expectation of the sum is the sum of the expectations,


 
X k k
X
E g i (X ) = E [g i (X )] .
i =1 i =1

This property follows directly from the distributive property of multiplication.

• If a is a constant, then E [a + X ] = a + E [X ]. This property also follows from the


distributive property of multiplication.

• E [f (X )] = f (E [X ]) when f (x ) is affine (i.e. f (x ) = a + b x where a and b are


constants). For general non-linear functions, it is usually the case that E [f (X )] 6=
f (E [X ]) when X is non-degenerate.

• E [X p ] 6= E [X ]p except when p = 1 when X is non-degenerate.

These rules are used throughout financial economics when studying random variables
and functions of random variables.
The expectation of a function of a multivariate random variable is similarly defined,
only integrating across all dimensions.

Definition 1.84 (Expectation of a Multivariate Random Variable). Let (X 1 , X 2 , . . . , X k ) be a


continuously distributed n -dimensional multivariate random variable with joint density
function f (x1 , x2 , . . . xn ). The expectation of Y = g (X 1 , X 2 , . . . , X n ) is defined as
Z ∞Z ∞ Z ∞
... g (x1 , x2 , . . . , xn ) f (x1 , x2 , . . . , xn ) dx1 dx2 . . . dxn . (1.24)
−∞ −∞ −∞

It is straight forward to see that rule that the expectation of the sum is the sum of the
expectation carries over to multivariate random variables, and so
" n # n
X X
E g i (X 1 , . . . X k ) = E [g i (X 1 , . . . X k )] .
i =1 i =1
hP i
k Pk
Additionally, taking g i (X 1 , . . . X k ) = X i , we have E i =1 Xi = i =1 E [X i ].
40 Probability, Random Variables and Expectations

1.4.2 Moments

Moments are simply expectations of particular functions of a random variable, typically


g (x ) = x s for s = 1, 2, . . ., and are often used to compare distributions or to estimate
parameters.

Definition 1.85 (Noncentral Moment). The rth noncentral moment of a continuous ran-
dom variable X is defined
Z ∞
µr ≡ E X =
0
x r f (x ) dx
 r
(1.25)
−∞

for r = 1, 2, . . ..

The first non-central moment is simply the average, or mean, of the random variable.

Definition 1.86 (Mean). The first non-central moment of a random variable X is called the
mean of X and is denoted µ.

Central moments are similarly defined, only centered around the mean.

Definition 1.87 (Central Moment). The rth central moment of a random variables X is de-
fined Z ∞
r
µr ≡ E (X − µ) = (x − µ)r f (x ) dx
 
(1.26)
−∞

for r = 2, 3 . . ..

Aside from the first moment, most generic use of “moment” will be referring to central
moments. Moments may not exist if a distribution is sufficiently heavy tailed. However, if
the r th moment exists, then any moment of lower order must also exist.

Theorem 1.88 (Lesser Moment Existence). If µ0r exists for some r , then µ0s exists for s ≤ r .
Moreover, for any r , µ0r exists if and only if µr exists.

Central moments are used to describe a distribution since they are invariant to changes
in the mean. The second central moment is known as the variance.
h i
Definition 1.89 (Variance). The second central moment of a random variable X , E (X − µ)2
is called the variance and is denoted σ2 or equivalently V [X ].

The variance operator (V [·]) also has a number of useful properties.


1.4 Expectations and Moments 41

• If c is a constant, then V [c ] = 0.

• If c is a constant, then V [c X ] = c 2 V [X ].

• If a is a constant, then V [a + X ] = V [X ].

• The variance of the sum is the sum of the variances plus twice all of the covari-
ancesa , " n #
X Xn n
X n
X
Xi = V [X i ] + 2
 
V Cov X j , X k
i =1 i =1 j =1 k = j +1

a
See Section 1.4.7 for more one covariances.
The variance is a measure of dispersion, although the square root of the variance, known
as the standard deviation, is typically more useful.14
Definition 1.90 (Standard Deviation). The square root of the variance is known as the stan-
dard deviations and is denoted σ or equivalently std (X ).
The standard deviation is a more meaningful measure than the variance since its units
are the same as the mean (and random variable). For example, suppose X is the return
on the stock market next year, and that the mean of X is 8% and the standard deviation is
20% (the variance is .04). The mean and standard deviation are both measures as percent-
age change in investment, and so can be directly compared, such as in the Sharpe ratio
(Sharpe 1994). Applying the properties of the expectation operator and variance operator,
it is possible to define a studentized (or standardized) random variable.
Definition 1.91 (Studentization). Let X be a random variable with mean µ and variance
σ2 , then
x −µ
Z = (1.27)
σ
is a studentized version of X (also known as standardized). Z has mean 0 and variance 1.
Standard deviation also provides a bound on the probability which can lie in the tail of
a distribution, as shown in Chebyshev’s inequality.
Theorem 1.92 (Chebyshev’s Inequality). Pr |x − µ| ≥ k σ ≤ 1/k 2 for k > 0.
 

Chebyshev’s inequality is useful in a number of contexts. One of the most useful is in


establishing that an estimator which has a variance that tends to 0 as the sample size in-
creases which ensures that virtually all of the probability must be concentrated around a
particular point.
14
The standard deviation is occasionally confused for the standard error. While both are square roots of
variances, the standard deviation refers to deviation in a random variable while standard error is reserved for
parameter estimators.
42 Probability, Random Variables and Expectations

The third central moment does not have a specific name, although it is called the skew-
ness when standardized by the scaled variance.

Definition 1.93 (Skewness). The third central moment, standardized by the second central
moment raised to the power 3/2,
h i
3
µ3 E (X − E [X ])
= 3 = E Z
 3
3 (1.28)
(σ2 ) 2
h i
2 2
E (X − E [X ])

is defined as the skewness where Z is a studentized version of X .

The skewness is a general measure of asymmetry, and is 0 for symmetric distribution


(assuming the third moment exists). The normalized fourth central moment is known as
the kurtosis.

Definition 1.94 (Kurtosis). The fourth central moment, standardized by the second central
squared,
h i
4
µ4 E (X − E [X ])
= = E Z4
 
2 2
(1.29)
(σ2 )
h i
2
E (X − E [X ])

is defined as the kurtosis and is denoted κ where Z is a studentized version of X .

The kurtosis is a measure of the chance of observing a large (and absolute terms) value,
and is often given as excess kurtosis.

Definition 1.95 (Excess Kurtosis). The kurtosis of a random variable minus the kurtosis of
a normal random variable, κ − 3, is known as excess kurtosis.

Random variables with a positive excess kurtosis are often referred to as heavy tailed.

1.4.3 Related Measures


While moments are useful in describing the properties of a random, other measures are
also commonly encountered. The median is an alternative measure of central tendency.

Definition 1.96 (Median). Any number m satisfying Pr (X ≤ m ) = 0.5 and Pr (X ≥ m) = 0.5


is known as the median of X .

The median measures the point where 50% of the distribution lies on either side (it may
not be unique), and is just a particular quantile. The median has a few advantages over the
mean, and in particular it is less affected by outliers (e.g. the difference between mean and
median income) and it always exists (the mean doesn’t exist for very heavy tailed distribu-
tions).
1.4 Expectations and Moments 43

The interquartile range uses quartiles15 to provide an alternative measure of dispersion


than standard deviation.

Definition 1.97 (Interquartile Range). The value q.75 − q.25 is known as the interquartile
range.

The mode complements the mean and median as a measure of central tendency. A
mode is a simply maximum of a density.

Definition 1.98 (Mode). Let X be a random variable with density function f (x ). An point
c where f (x ) attains a maximum is known as a mode.

Distributions can be unimodal or multimodal.

Definition 1.99 (Unimodal Distribution). Any random variable which has a single, unique
mode is called unimodal.

Note that modes in a multimodal distribution do not necessarily have to have equal
probability.

Definition 1.100 (Multimodal Distribution). Any random variable which as more than one
mode is called multimodal.

Figure 1.9 shows a number of distributions. The distributions depicted in the top panels
are all unimodal. The distributions in the bottom pane are mixtures of normals, meaning
that with probability p random variables come form one normal, and with probability 1−p
they are drawn from the other. Both mixtures of normals are multimodal.

1.4.4 Multivariate Moments


Other moment definitions are only meaningful when studying 2 or more random variables
(or an n -dimensional random variable). When applied to a vector or matrix, the expecta-
tions operator applies element-by-element. For example, if X is an n -dimensional random
variable,

E [X 1 ]
   
X1
 X2   E [X 2 ] 
E [X ] = E  = . (1.30)
   
.. ..
 .   . 
Xn E [X n ]

Covariance is a measure which captures the tendency of two variables to move together
in a linear sense.
15
-tiles are include
 terciles
 (3), quartiles (4), quintiles (5), deciles (10) and percentiles (100). In all cases the
bin ends i − 1/m , i /m where m is the number of bins and i = 1, 2, . . . , m.
44 Probability, Random Variables and Expectations

0.4 0.5
Std. Normal χ21
χ23
0.4 χ25
0.3

0.3
0.2
0.2

0.1
0.1

0 0
−3 −2 −1 0 1 2 3 0 2 4 6 8 10
50-50 Mixture Normal 30-70 Mixture Normal
0.3
0.2

0.2

0.1
0.1

0 0
−4 −2 0 2 4 −4 −2 0 2 4

Figure 1.9: These four figures show two unimodal (upper panels) and two multimodal
(lower panels) distributions. The upper left is a standard normal density. The upper right
shows three χ 2 densities for ν = 1, 3 and 5. The lower panels contain mixture distributions
of 2 normals – the left is a 50-50 mixture of N (−1, 1) and N (1, 1) and the right is a 30-70
mixture of N (−2, 1) and N (1, 1).
1.4 Expectations and Moments 45

Definition 1.101 (Covariance). The covariance between two random variables X and Y is
defined

Cov [X , Y ] = σX Y = E [(X − E [X ]) (Y − E [Y ])] . (1.31)

Covariance can be alternatively defined using the joint product moment and the prod-
uct of the means.
Theorem 1.102 (Alternative Covariance). The covariance between two random variables X
and Y can be equivalently defined

σX Y = E [X Y ] − E [X ] E [Y ] . (1.32)

Inverting the covariance expression shows that covariance, or a lack of covariance, is


enough to ensure that the expectation of a product is the product of the expectations.
Theorem 1.103 (Zero Covariance and Expectation of Product). If X and Y have σX Y = 0,
then E [X Y ] = E [X ] E [Y ].
The previous result follows directly from the definition of covariance since σX Y = E [X Y ]−
E [X ] E [Y ]. In financial economics, this result is often applied to products of random vari-
ables so that the mean of the product can be directly determine by knowledge of the mean
of each variable and the covariance between the two. For example, when studying con-
sumption based asset pricing, it is common to encounter terms involving the expected
value of consumption growth times the pricing kernel (or stochastic discount factor) – in
many cases the full joint distribution of the two is intractable although the mean and co-
variance of the two random variables can be determined.
The Cauchy-Schwartz inequality is a version of the triangle inequality and states that
the expectation of the squared product is less than the product of the squares.
h i
Theorem 1.104 (Cauchy-Schwarz Inequality). E (X Y )2 ≤ E X 2 E Y 2 .
   

Example 1.105. When X is an n -dimensional random variable, it is useful to assemble the


variances and covariances into a covariance matrix.
Definition 1.106 (Covariance Matrix). The covariance matrix of an n -dimensional random
variable X is defined
..
 
σ 2
σ12 . σ1n
 1
σ12 σ22 . . . σ2n 

0
Cov [X ] = Σ = E (X − E [X ]) (X − E [X ]) =  .
  
.. .. 
 
. ..
 . . . . 
σ1n σ2n . . . σn2

where the ith diagonal element contains the variance of X i (σi2 ) and the element in position
(i , j ) contains the covariance between X i and X j σi j .

46 Probability, Random Variables and Expectations

When X is composed of two sub-vectors, a block form of the covariance matrix is often
convenient.

Definition 1.107 (Block Covariance Matrix). Suppose X 1 is an n1 -dimensional random vari-


able and X 2 is an n2 -dimensional random variable. The block covariance matrix of X =
 0 0 0
X 1 X 2 is
" #
Σ11 Σ12
Σ= (1.33)
Σ012 Σ22
where Σ11 is the n1 by n1 covariance of X 1 , Σ22 is the n2 by n2 covariance of X 2 and Σ12 is the
n1 by n2 covariance matrix between X 1 and X 2 and element (i , j ) equal to Cov X 1,i , X 2, j .
 

A standardized version of covariance is often used to produce a scale free measure.

Definition 1.108 (Correlation). The correlation between two random variables X and Y is
defined
σX Y
Corr [X , Y ] = ρX Y = . (1.34)
σX σY

Additionally, the correlation is always in the interval [−1, 1], which follows from the
Cauchy-Schwartz inequality.

Theorem 1.109. If X and Y are independent random variables, then ρX Y = 0 as long as σ2X
and σ2Y exist.

It is important to note that the converse of this statement is not true – that is, a lack of
correlation does not imply that two variables are independent. In general, a correlation of
0 only implies independence when the variables are multivariate normal.

Example 1.110. Suppose X and Y have ρX Y = 0, then X and Y are not necessarily inde-
pendent. Suppose X is a discrete uniform random variable taking values in {−1, 0, 1} and
Y = X 2 , then σ2X = 2/3, σ2Y = 2/9 and σX Y = 0. While X and Y are uncorrelated, the are
clearly not independent, since when the random variable Y takes the value 1, X must take
either the value −1 or 1.

The corresponding correlation matrix can be assembled. Note that a correlation matrix
has 1s on the diagonal and values bounded by [−1, 1] on the off diagonal positions.

Definition 1.111 (Correlation Matrix). The correlation matrix of an n -dimensional random


variable X is defined
1 1
(Σ In )− 2 Σ (Σ In )− 2 (1.35)
 
where the i , j th element has the form σX i X j / σX i σX j when i 6= j and 1 when i = j .
1.4 Expectations and Moments 47

1.4.5 Conditional Expectations

One of the most useful forms of expectation is the conditional expectation, which are ex-
pectations using conditional densities in place of joint or marginal densities. Conditional
expectations essentially treat one of the variables (in a bivariate random variable) as con-
stant.

Definition 1.112 (Bivariate Conditional Expectation). Let X be a continuous bivariate ran-


dom variable comprised of X 1 and X 2 . The conditional expectation of X 1 given X 2
Z ∞
E g (X 1 ) |X 2 = x2 = g (x1 ) f x1 |x2 dx1
  
(1.36)
−∞

where f x1 |x2 is the conditional probability density function of X 1 given X 2 .16




 
In many cases, it is useful to avoid specifying a specific value for X 2 in which case E X 1 |X 1
 
will be used. Note that E X 1 |X 2 will typically be a function of the random variable X 2 .

Example 1.113. Suppose X is a bivariate normal distribution with components X 1 and X 2 ,


µ = [µ1 µ2 ]0 and
" #
σ12 σ12
Σ = ,
σ12 σ22

σ12
then E X 1 |X 2 = x2 = µ1 + (x2 − µ2 ). This follows from the conditional density of a
 
σ22
bivariate random variable.

The law of iterated expectations uses conditional expectations to show that the condi-
tioning does not affect the final result of taking expectations – in other words, the order of
taking expectations, does not matter.

Theorem 1.114 (Bivariate Law of Iterated Expectations). Let X be a continuous bivariate


random variable comprised of X 1 and X 2 . Then E E g (X 1 ) |X 2 = E [g (X 1 )] .
  

The law of iterated expectations follows from basic properties of an integral since the
order of integration does not matter as long as all integrals are taken.

Example 1.115. Suppose X is a bivariate normal distribution with components X 1 and X 2 ,


µ = [µ1 µ2 ]0 and
" #
σ12 σ12
Σ = ,
σ12 σ22
16
A conditional expectation can also be defined in a natural way for functions of X 1 given X 2 ∈ B where
Pr (X 2 ∈ B ) > 0.
48 Probability, Random Variables and Expectations

then E [X 1 ] = µ1 and

σ12
 
= E µ1 + 2 (X 2 − µ2 )
  
E E X 1 |X 2
σ2
σ12
= µ1 + 2 (E [X 2 ] − µ2 )
σ2
σ12
= µ1 + 2 (µ2 − µ2 )
σ2
= µ1 .

When using conditional expectations, any random variable conditioned on is essen-


tially non-random (in the conditional expectation), and so E E X 1 X 2 |X 2 = E X 2 E X 1 |X 2 .
     
 
This is a very useful tool when combined with the law of iterated expectations when E X 1 |X 2
is a known function of X 2 .

Example 1.116. Suppose X is a bivariate normal distribution with components X 1 and X 2 ,


µ = 0 and
" #
σ12 σ12
Σ = ,
σ12 σ22

then

E [X 1 X 2 ] = E E X 1 X 2 |X 2
  

= E X 2 E X 1 |X 2
  

σ12
  
= E X2 X2
σ22
σ12  2 
= E X2
σ22
σ12
= σ 2

σ22 2

= σ12 .

One particularly useful application of conditional expectations occurs when the condi-
tional expectation is known and constant, so thatE X 1 |X 2 = c .
 

Example 1.117. Suppose X is a bivariate random variable composed of X 1 and X 2 and that
E X 1 |X 2 = c . Then E [X ] = c since
 

E [X 1 ] = E E X 1 |X 2
  

= E [c ]
= c.

Conditional expectations can be taken for general n -dimensional random variables,


1.4 Expectations and Moments 49

and the law of iterated expectations holds as well.


Definition 1.118 (Conditional Expectation). Let X be a n-dimensional random variable
and partition the first 1 ≤ j < n elements of X into X 1 , and the remainder into X 2 so that
0
X = X 10 X 20 . The conditional expectation of X 1 given X 2 = x2


Z ∞ Z ∞
E g (X 1 ) |X 2 = x2 =
   
... g x1 , . . . , x j f x1 , . . . , x j |x2 dx j . . . dx1 (1.37)
−∞ −∞

where f x1 , . . . , x j |x2 is the conditional probability density function of X 1 given X 2 = x2 .




The law of iterated expectations also hold for arbitrary partitions as well.
Theorem 1.119 (Law of Iterated Expectations). Let X be a n-dimensional random variable
and partition the first 1 ≤ j < n elements of X into X 1 , and the remainder into X 2 so that
0
X = X 10 X 20 . Then E E g (X 1 ) |X 2 = E [g (X 1 )]. The law of iterated expectations is also
   

known as the law of total expectations.


Full multivariate conditional expectations are extremely common in time series. For
example, when using daily data, there are over 30,000 observations of the Dow Jones In-
dustrial Average available to model. Attempting to model the full joint distribution would
be a formidable task. On the other hand, modeling the conditional expectation (or condi-
tional mean) of the final observation, conditioning on those observations in the past, is far
simpler.
Example 1.120. Suppose {X t } is a sequence of random variables where X t comes after
X t − j for j ≥ 1. The conditional conditional expectation of X t given its past is
 
E X t |X t −1, X t −2 , . . . .

Example 1.121. Let {εt } be a sequence of independent, identically distributed random


variables with mean 0 and variance σ2 < ∞. Define X 0 = 0 and X t = X t −1 + εt , then
X t is a random walk, and E X t |X t −1 = X t −1 , which follows since X t −1 is a function of
 

εt −1 , εt −2 , . . ..
This leads naturally to the definition of a martingale, which is an important concept in
financial economics which related to efficient markets.
Definition 1.122 (Martingale). If E X t + j |X t −1 , X t −2 . . . = X t −1 for all j ≥ 0 and E |X t | <
   

∞, both holding for all t , then {X t } is a martingale. Similarly, if E X t + j − X t −1 |X t −1 , X t −2 . . . =


 

0 for all j ≥ 0 and E |X t | < ∞, both holding for all t , then {X t } is a martingale.
 

1.4.6 Conditional Moments


All moments can be defined in a conditional form simply by replacing the integral using
the probability density function with an integral using the conditional probability density
50 Probability, Random Variables and Expectations

function. For example, the (unconditional) mean becomes the conditional mean, and the
variance becomes a conditional variance.

Definition 1.123 (Conditional Variance). The variance of a random X conditional on an-


other random variable Y is
h 2 i
= E X − E X |Y
  
V X |Y |Y (1.38)
2
= E X |Y − E X |Y .
 2  

The two definitions of conditional variance are identical to those of the (unconditional)
variance where the expectation operators has been replaced with conditional expectations.
Conditioning can be used to compute higher-order moments as well.

Definition 1.124 (Conditional Moment). The rth central moment of a random variables X
conditional on another random variable Y is defined
h r i
µr ≡ E X − E X |Y

|Y (1.39)

for r = 2, 3, . . ..

Combining the conditional expectation and the conditional variance, leads to the law
of total variance.

Theorem 1.125. The variance of a random variable X can be decomposed into the variance
of the conditional expectation plus the expectation of the conditional variance,

V [X ] = V E X |Y + E V X |Y .
     
(1.40)

The law of total variance shows that the total variance of a variable can be decomposed
into the variability of the conditional mean plus the average of the conditional variance.
This is a useful decomposition for time-series.
Independence can also be defined conditionally.

Definition 1.126 (Conditional Independence). Two random variables X 1 and X 2 are con-
ditionally independent, conditional on Y , if

f x1 , x2 |y = f1 x1 |y f2 x2 |y .
  

Random variables that are conditionally independent are not necessarily uncondition-
ally independent. However, knowledge of the variable is sufficient that the portions of the
underlying random variables which cannot be explained by the conditioning variables be-
come independent.
1.4 Expectations and Moments 51

Example 1.127. Suppose X is a trivariate normal random variable with mean 0 and covari-
ance  
σ12 0 0
Σ =  0 σ22 0 
 
0 0 σ32
and define Y1 = x1 + x3 and Y2 = x2 + x3 . Then Y1 and Y2 are correlated bivariate normal
with mean 0 and covariance
" #
σ12 + σ32 σ32
ΣY = ,
σ32 σ22 + σ32

but the joint distribution of Y1 and Y2 given X 3 is bivariate normal with mean 0 and
" #
σ12 0
ΣY |X 3 =
0 σ22

and so Y1 and Y2 are independent conditional on X 3 .


Other properties of unconditionally independent random variables continue to hold
for conditionally independent random variables. For example, when X 1 and X 2 are inde-
pendent conditional on X 3 , then the conditional covariance between X 1 and X 2 is 0 (as
is the conditional correlation), and E E X 1 X 2 |X 3 = E E X 1 |X 3 E X 2 |X 3 – that is, the
       

conditional expectation of the product is the product of the conditional expectations.

1.4.7 Vector and Matrix Forms


Finally, some useful results for linear combinations of random variables. These are partic-
ularly useful in finance since portfolios are often of interest where the underlying random
variables are the individual assets and the combination vector is simply the vector of port-
folio weights.
Pn
Theorem 1.128. Let Y = i =1 c i X i where c i , i = 1, . . . , n are constants. Then E [Y ] =
Pn
i =1 c i E [X i ]. In matrix notation, Y = c x where c is an n by 1 vector and E [Y ] = c E [X ] .
0 0

The variance of the sum is the weighted sum of the variance plus all of the covariances.
Theorem 1.129. Let Y = ni=1 ci X i where ci are constants. Then
P

n
X n
X n
X
V [Y ] = ci2 V [X i ] + 2
 
c j ck Cov X i , X j (1.41)
i =1 j =1 k = j +1

or equivalently
n
X n
X n
X
σ2Y = ci2 σ2X i +2 c j c k σX j X k .
i =1 j =1 k = j +1
52 Probability, Random Variables and Expectations

This result can be equivalently expressed in vector-matrix notation.


Theorem 1.130. Let c in an n by 1 vector and let X by an n-dimensional random variable
with covariance Σ. Define Y = c0 x. The variance of Y is σ2Y = c0Cov [X ] c = c0 Σc.
Note that the result holds when c is replaced by a matrix C.
Theorem 1.131. Let C be an n by m matrix and let X be an n-dimensional random variable
with mean µX and covariance ΣX . Define Y = C0 x. The expected value of Y is E [Y ] = µY =
C0 E [X ] = C0 µX and the covariance of Y is ΣY = C0Cov [X ] C = C0 ΣX C.
Definition 1.132 (Multivariate Studentization). Let X be an n -dimensional random vari-
able with mean µ and covariance Σ, then
1
Z = Σ− 2 (x − µ) (1.42)
1
is a studentized version of X where Σ 2 is a matrix square root such as the Cholesky factor
or one based on the spectral decomposition of Σ. Z has mean 0 and covariance equal to
the identity matrix In .
The final result for vectors relates quadratic forms of normals (inner-products) to χ 2
distributed random variables.
Theorem 1.133 (Quadratic Forms of Normals). Let X be an n-dimensional normal random
variable with mean 0 and identity covariance In . Then x0 x = ni=1 xi2 ∼ χn2 .
P

Combing this result with studentization, when X is a general n -dimensional normal


random variable with mean µ and covariance Σ,
 1 0 1
(x − µ)0 Σ− 2 Σ− 2 (x − µ)0 = (x − µ)0 Σ−1 (x − µ)0 ∼ χn2 .

1.4.8 Monte Carlo and Numerical Integration


Expectations of functions of continuous random variables are integrals against the under-
lying pdf. In some cases these integrals are analytically tractable, although in many situa-
tions integrals cannot be analytically computed and so numerical techniques are needed
to compute expected values and moments.
Monte Carlo is one method to approximate an integral. Monte Carlo utilizes simulated
draws from the underlying distribution and averaging to approximate integrals.
Definition 1.134 (Monte Carlo Integration). Suppose X ∼ F (θ ) and that it is possible to
simulate a series {xi } from F (θ ). The Monte Carlo expectation of a function g (x ) is defined
n
X
n −1 g (xi ) ,
i =1

Moreover, as long as E |g (x )| < ∞, limn →∞ n −1 ni=1 g (xi ) = E [g (x )].


  P
1.4 Expectations and Moments 53

The intuition behind this result follows from the properties of {xi }. Since these are
i.i.d. draws from F (θ ), they will, on average, tend to appear in any interval B ∈ R (X ) in
proportion to the probability Pr (X ∈ B ). In essence, the simulated values coarsely approx-
imating the discrete approximation shows in 1.8.
While Monte Carlo integration is a general technique, there are some important limi-
tations. First, if the function g (x ) takes large values in regions where Pr (X ∈ B ) is small,
it may require a very large number of draws to accurately approximate E [g (x )] since, by
construction, there are unlikely to many points in B . In practice the behavior of h (x ) =
g (x ) f (x ) plays an important role in determining the appropriate sample size.17 Second,
while Monte Carlo integration is technically valid for random variables with any number
of dimensions, in practice it is usually only reliable when the dimension is small (typically
3 or fewer), especially when the range is unbounded (R (X ) ∈ Rn ). When the dimension of
X is large, many simulated draws are needed to visit the corners of the (joint) pdf, and if
1,000 draws are sufficient for a unidimensional problem, 1000n may be needed to achieve
the same accuracy when X has n dimensions.
Alternatively the function to be integrated can be approximated using a polygon with
an easy-to-compute area, such as the rectangles approximating the normal pdf shows in
figure 1.8. The quality of the approximation will depend on the resolution of the grid used.
Suppose u and l are the upper and lower bounds of the integral, respectively, and that the
region can be split into m intervals l = b0 < b1 < . . . < bm −1 < bm = u . Then the integral
of a function h (·) is
Z u Xm Z bi
h (x ) dx = h (x ) dx .
l i =1 bi −1

In practice, l and u may be infinite, in which case some cut-off point is required. In general,
the cut-off should be chosen to that they vast majority of the probability lies between l and
Ru
u ( l f (x ) dx ≈ 1).
This decomposition is combined with an area for approximating the area under h be-
tween bi −1 and bi . The simplest is the rectangle method, which uses a rectangle with a
height equal to the value of the function at the mid-point.

Definition 1.135 (Rectangle Method). The rectangle rule approximates the area under the
curve with a rectangle and is given by

u +l
Z u  
h (x ) dx ≈ h (u − l ) .
l 2

The rectangle rule would be exact if the function was piece-wise flat. The trapezoid rule
improves the approximation by replacing the function at the midpoint with the average
value of the function, and would be exact for any piece-wise linear function (including
17
Monte Carlo integrals can also be seen as estimators, and in many cases standard inference can be used
to determine the accuracy of the integral. See Chapter 2 for more details on inference and constructing con-
fidence intervals.
54 Probability, Random Variables and Expectations

piece-wise flat functions).

Definition 1.136 (Trapezoid Method). The trapezoid rule approximates the area under the
curve with a trapezoid and is given by

h (u ) + h (l )
Z u
h (x ) dx ≈ (u − l ) .
l 2
The final method is known as Simpson’s rule which is based on using quadratic approx-
imation to the underlying function. It is exact when the underlying function is piece-wise
linear or quadratic.

Definition 1.137 (Simpson’s Rule). Simpson’s Rule uses an approximation that would be
exact if they underlying function were quadratic, and is given by

u +l
Z u    
u −l
h (x ) dx ≈ h (u) + 4h + h (l ) .
l 6 2

Example 1.138. Consider the problem of computing the expected payoff of an option. The
payoff of a call option is given by

c = max (s1 − k , 0)

where k is the strike price and s1 is the stock price at expiration and s0 is the current stock
price. Suppose returns are normally distributed with mean µ = .08 and standard deviation
σ = .20. In this problem, g (r ) = (s0 exp (r ) − k ) I[s0 exp(r )>k ] where I[·] and a binary indicator
function which takes the value 1 when the argument is true, and
!
1 (r − µ)2
f (r ) = √ exp − .
2πσ2 2σ2

Combined, the function the be integrated is


Z ∞ Z ∞
h (r ) dr = g (r ) f (r ) dr
−∞ −∞
!

1 (r − µ)2
Z
= (s0 exp (r ) − k ) I[s0 exp(r )>k ] √ exp − dr
−∞ 2πσ2 2σ2

s0 = k = 50 was used in all results.


All four methods were applied to the problem. The number of bins and the range of
integration was varied for the analytical approximations. The number of bins ranged across
{10, 20, 50, 1000} and the integration range spanned {±3σ, ±4σ, ±6σ, ±10σ} . In all cases
the bins were uniformly spaced along the integration range. Monte Carlo integration was
applied with m ∈ {100, 1000}.
1.4 Expectations and Moments 55

All thing equal, increasing the number of bins increases the accuracy of the approxima-
tion. In this example, 50 appears to be sufficient. However, having a range which is too
small produces values which differ from the correct value of 7.33. The sophistication of the
method also improves the accuracy, especially when the number of nodes is small. The
Monte Carlo results are also close, on average. However, the standard deviation is large,
about 5%, even when 1000 draws are used, and so large errors would be commonly en-
countered and so many more points are needed.
56 Probability, Random Variables and Expectations

Rectangle Method
Bins ±3σ ±4σ ±6σ ±10σ
10 7.19 7.43 7.58 8.50
20 7.13 7.35 7.39 7.50
50 7.12 7.33 7.34 7.36
1000 7.11 7.32 7.33 7.33

Trapezoid Method
Bins ±3σ ±4σ ±6σ ±10σ
10 6.96 7.11 6.86 5.53
20 7.08 7.27 7.22 7.01
50 7.11 7.31 7.31 7.28
1000 7.11 7.32 7.33 7.33

Simpson’s Rule
Bins ±3σ ±4σ ±6σ ±10σ
10 7.11 7.32 7.34 7.51
20 7.11 7.32 7.33 7.34
50 7.11 7.32 7.33 7.33
1000 7.11 7.32 7.33 7.33

Monte Carlo
Draws (m ) 100 1000
Mean 7.34 7.33
Std. Dev. 0.88 0.28

Table 1.1: Computed values for the expected payout for an option, where the correct value
is 7.33 The top three panels use approximations to the function which have simple to com-
pute areas. The bottom panel shows the average and standard deviation from a Monte
Carlo integration where the number of points varies and 10, 000 simulations were used.
1.4 Expectations and Moments 57

Exercises
Exercise 1.1. Prove that E [a + b X ] = a + b E [X ] when X is a continuous random variable.

Exercise 1.2. Prove that V [a + b X ] = b 2 V [X ] when X is a continuous random variable.

Exercise 1.3. Prove that Cov [a + b X , c + d Y ] = b d Cov [X , Y ] when X and Y are a con-
tinuous random variables.

Exercise 1.4. Prove that V [a + b X + c Y ] = b 2 V [X ] + c 2 V [Y ] + 2b c Cov [X , Y ] when X


and Y are a continuous random variables.

Exercise 1.5. Suppose {X i } is an i.i.d. sequence of random variables. Show that V X̄ =


 

V n1 ni=1 X i = n −1 σ2 where σ2 is V [X 1 ].
 P 

Exercise 1.6. Prove that Corr [a + b X , c + d Y ] = Corr [X , Y ].

Exercise 1.7. Suppose {X i } is a sequence of random variables where, for all i , V [X i ] = σ2 ,


Cov [X i , X i −1 ] = θ and Cov X i , X i − j = 0 for j > 1.. What is V X̄ ?
   

Exercise 1.8. Prove that E a + b X |Y = a + b E X |Y when X and Y are continuous


   

random variables.

Exercise 1.9. Suppose that E X |Y = Y 2 where Y is normally distributed with mean µ and
 

variance σ2 . What is E [a + b X ]?

Exercise 1.10. Suppose E X |Y = y = a + b y and V X |Y = y = c + d y 2 where Y is


   

normally distributed with mean µ and variance σ2 . What is V [X ]?

Exercise 1.11. Show that the law of total variance holds for a V [X 1 ] when X is a bivariate
normal with mean µ = [µ1 µ2 ]0 and covariance
" #
σ12 σ12
Σ = .
σ12 σ22

Exercise 1.12. Sixty percent (60%) of all traders hired by a large financial firm are rated as
performing satisfactorily or better in their first year review. Of these, 90% earned a first in
financial econometrics. Of the traders who were rated as unsatisfactory, only 20% earned
a first in financial econometrics.

i. What is the probability that a trader is rated as satisfactory or better given they re-
ceived a first in financial econometrics?

ii. What is the probability that a trader is rated as unsatisfactory given they received a
first in financial econometrics?

iii. Is financial econometrics a useful indicator or trader performance? Why or why not?
58 Probability, Random Variables and Expectations

Exercise 1.13. Large financial firms use automated screening to detect rogue trades – those
that exceed risk limits. One of your former colleagues has introduced a new statistical test
using the trading data that, given that a trader has exceeded her risk limit, detects this with
probability 98%. It also only indicates false positives – that is non-rogue trades that are
flagged as rogue – 1% of the time.

i. Assuming 99% of trades are legitimate, what is the probability that a detected trade is
rogue? Explain the intuition behind this result.

ii. Is this a useful test? Why or why not?

iii. How low would the false positive rate have to be to have a 98% chance that a detected
trade was actually rogue?

Exercise 1.14. Your corporate finance professor uses a few jokes to add levity to his lectures.
Each week he tells 3 different jokes. However, he is also very busy, and so forgets week to
week which jokes were used.

i. Assuming he has 12 jokes, what is the probability of 1 repeat across 2 consecutive


weeks?

ii. What is the probability of hearing 2 of the same jokes in consecutive weeks?

iii. What is the probability that all 3 jokes are the same?

iv. Assuming the term is 8 weeks long, and they your professor has 96 jokes, what is the
probability that there is no repetition across the term? Note: he remembers the jokes
he gives in a particular lecture, only forgets across lectures.

v. How many jokes would your professor need to know to have a 99% chance of not
repeating any in the term?

Exercise 1.15. A hedge fund company manages three distinct funds. In any given month,
the probability that the return is positive is shown in the following table:
Pr (r1,t > 0) = .55 Pr (r1,t > 0 ∪ r2,t > 0) = .82
Pr (r2,t > 0) = .60 Pr (r1,t > 0 ∪ r3,t > 0) = .7525
Pr (r3,t > 0) = .45 Pr (r2,t > 0 ∪ r3,t > 0) = .78
Pr r2,t > 0 ∩ r3,t > 0|r1,t > 0 = .20


i. Are the events of “positive returns” pairwise independent?

ii. Are the events of “positive returns” independent?

iii. What is the probability that funds 1 and 2 have positive returns, given that fund 3 has
a positive return?
1.4 Expectations and Moments 59

iv. What is the probability that at least one fund will has a positive return in any given
month?

Exercise 1.16. Suppose the probabilities of three events, A, B and C are as depicted in the
following diagram:

.15 .10 .15

A B
.10
.05 .05

.175
C

i. Are the three events pairwise independent?

ii. Are the three events independent?

iii. What is Pr (A ∩ B )?

iv. What is Pr A ∩ B |C ?

v. What is Pr C |A ∩ B ?

vi. What is Pr C |A ∪ B ?

Exercise 1.17. At a small high-frequency hedge fund, two competing algorithms produce
trades. Algorithm α produces 80 trades per second and 5% lose money. Algorithm β pro-
duces 20 trades per second but only 1% lose money. Given the last trade lost money, what
is the probability it was produced by algorithm β ?

Exercise 1.18. Suppose f (x , y ) = 2 − x − y where x ∈ [0, 1] and y ∈ [0, 1].

i. What is Pr (X > .75 ∩ Y > .75)?

ii. What is Pr (X + Y > 1.5)?

iii. Show formally whether X and Y are independent.

iv. What is Pr Y < .5|X = x ?




Exercise 1.19. Suppose f (x , y ) = x y for x ∈ [0, 1] and y ∈ [0, 2].

i. What is the joint cdf?


60 Probability, Random Variables and Expectations

ii. What is Pr (X < 0.5 ∩ Y < 1)?

iii. What is the marginal cdf of X ? What is Pr (X < 0.5)?

iv. What is the marginal density of X ?

v. Are X and Y independent?

Exercise 1.20. Suppose F (x ) = 1 − p x +1 for x ∈ [0, 1, 2, . . .] and p ∈ (0, 1).

i. Find the pmf.

ii. Verify that the pmf is valid.

iii. What is Pr (X ≤ 8) if p = .75?

iv. What is Pr (X ≤ 1) given X ≤ 8?

Exercise 1.21. A firm producing widgets has a production function q (L ) = L 0.5 where L is
the amount of labor. Sales prices fluctuate randomly and can be $10 (20%), $20 (50%) or
$30 (30%). Labor prices also vary and can be $1 (40%), 2 (30%) or 3 (30%). The firm always
maximizes profits after seeing both sales prices and labor prices.

i. Define the distribution of profits possible?

ii. What is the probability that the firm makes at least $100?

iii. Given the firm makes a profit of $100, what is the probability that the profit is over
$200?

Exercise 1.22. A fund manager tells you that her fund has non-linear returns as a function
of the market and that his return is ri ,t = 0.02 + 2rm ,t − 0.5rm2 ,t where ri ,t is the return on
the fund and rm,t is the return on the market.

i. She tells you her expectation of the market return this year is 10%, and that her fund
will have an expected return of 22%. Can this be?

ii. At what variance is would the expected return on the fund be negative?

Exercise 1.23. For the following densities, find the mean (if it exists), variance (if it exists),
median and mode, and indicate whether the density is symmetric.

i. f (x ) = 3x 2 for x ∈ [0, 1]

ii. f (x ) = 2x −3 for x ∈ [1, ∞)


−1
iii. f (x ) = π 1 + x 2 for x ∈ (−∞, ∞)


!
4
iv. f (x ) = .2 x .84−x for x ∈ {0, 1, 2, 3, 4}
x
1.4 Expectations and Moments 61

Exercise 1.24. The daily price of a stock has an average value of £2. Then then Pr (X > 10) <
.2 where X denotes the price of the stock. True or false?

Exercise 1.25. An investor can invest in stocks or bonds which have expected returns and
covariances as " # " #
.10 .04 −.003
µ= , Σ=
.03 −.003 .0009
where stocks are the first component.

i. Suppose the investor has £1,000 to invest, and splits the investment evenly. What
is the expected return, standard deviation, variance and Sharpe Ratio (µ/σ) for the
investment?

ii. Now suppose the investor seeks to maximize her expected utility where her utility is
defined is defined in terms of her portfolio return, U (r ) = E [r ] − .01V [r ]. How much
should she invest in each asset?

Exercise 1.26. Suppose f (x ) = (1 − p ) x p for x ∈ (0, 1, . . .) and p ∈ (0, 1]. Show that a ran-
dom variable from the distribution is “memoryless” in the sense that Pr X ≥ s + r |X ≥ r =


Pr (X ≥ s ). In other words, the probability of surviving s or more periods is the same whether
starting at 0 or after having survived r periods.

Exercise 1.27. Your Economics professor offers to play a game with you. You pay £1,000
to play and your Economics professor will flip a fair coin and pay you 2 x where x is the
number of tries required for the coin to show heads.

i. What is the pmf of X ?

ii. What is the expected payout from this game?

Exercise 1.28. Consider the roll of a fair pair of dice where a roll of a 7 or 11 pays 2x and
anything else pays −x where x is the amount bet. Is this game fair?

Exercise 1.29. Suppose the joint density function of X and Y is given by f (x , y ) = 1/2 x exp (−x y )
where x ∈ [3, 5] and y ∈ (0, ∞).

i. Give the form of E Y |X = x .


 

ii. Graph the conditional expectation curve.

Exercise 1.30. Suppose a fund manager has $10,000 of yours under management and tells
you that the expected value of your portfolio in two years time is $30,000 and that with
probability 75% your investment will be worth at least $40,000 in two years time.

i. Do you believe her?


62 Probability, Random Variables and Expectations

ii. Next, suppose she tells you that the standard deviation of your portfolio value is 2,000.
Assuming this is true (as is the expected value), what is the most you can say about
the probability your portfolio value falls between $20,000 and $40,000 in two years
time?

Exercise 1.31. Suppose the joint probability density function of two random variables is
given by f (x ) = 52 (3x + 2y ) where x ∈ [0, 1] and y ∈ [0, 1].

i. What is the marginal probability density function of X ?

ii. What is E X |Y = y ? Are X and Y independent? (Hint: What must the form of
 
 
E X |Y be when they are independent?)

Exercise 1.32. Let Y be distributed χ15


2
.

i. What is Pr (y > 27.488)?

ii. What is Pr (6.262 ≤ y ≤ 27.488)?

iii. Find C where Pr (y ≥ c ) = α for α ∈ {0.01, 0.05, 0.01}.


Next, Suppose Z is distributed χ52 and is independent of Y .

iv. Find C where Pr (y + z ≥ c ) = α for α ∈ {0.01, 0.05, 0.01}.

Exercise 1.33. Suppose X is a bivariate random variable with parameters


" # " #
5 2 −1
µ= , Σ= .
8 −1 3
 
i. What is E X 1 |X 2 ?
 
ii. What is V X 1 |X 2 ?

iii. Show (numerically) that the law of total variance holds for X 2 .

Exercise 1.34. Suppose y ∼ N (5, 36) and x ∼ N (4, 25) where X and Y are independent.

i. What is Pr (y > 10)?

ii. What is Pr (−10 < y < 10)?

iii. What is Pr (x − y > 0)?

iv. Find C where Pr (x − y > C ) = α for α ∈ {0.10, 0.05, 0.01}?


Chapter 2

Estimation, Inference and Hypothesis Testing

Note: The primary reference for these notes is Ch. 7 and 8 of Casella & Berger (2001). This
text may be challenging if new to this topic and Ch. 7 – 10 of Wackerly, Mendenhall &
Scheaffer (2001) may be useful as an introduction.
This chapter provides an overview of estimation, distribution theory, inference
and hypothesis testing. Testing an economic or financial theory is a multi-step
process. First, any unknown parameters must be estimated. Next, the distribu-
tion of the estimator must be determined. Finally, formal hypothesis tests must
be conducted to examine whether the data are consistent with the theory. This
chapter is intentionally “generic” by design and focuses on the case where the data
are independent and identically distributed. Properties of specific models will be
studied in detail in the chapters on linear regression, time series and univariate
volatility modeling.

Three steps must be completed to test the implications of an economic theory:


• Estimate unknown parameters

• Determine the distributional of estimator

• Conduct hypothesis tests to examine whether the data are compatible with a theo-
retical model
This chapter covers each of these steps with a focus on the case where the data is indepen-
dent and identically distributed (i.i.d.). The heterogeneous but independent case will be
covered in the chapter on linear regression and the dependent case will be covered in the
chapters on time series.

2.1 Estimation
Once a model has been specified and hypotheses postulated, the first step is to estimate
the parameters of the model. Many methods are available to accomplish this task. These
64 Estimation, Inference and Hypothesis Testing

include parametric, semi-parametric, semi-nonparametric and nonparametric estimators


and a variety of estimation methods often classified as M-, R- and L-estimators.1
Parametric models are tightly parameterized and have desirable statistical properties
when their specification is correct, such as providing consistent estimates with small vari-
ances. Nonparametric estimators are more flexible and avoid making strong assumptions
about the relationship between variables. This allows nonparametric estimators to capture
a wide range of relationships but comes at the cost of precision. In many cases, nonpara-
metric estimators are said to have a slower rate of convergence than similar parametric
estimators. The practical consequence of the rate is that nonparametric estimators are de-
sirable when there is a proliferation of data and the relationships between variables may
be difficult to postulate a priori. In situations where less data is available, or when an eco-
nomic model proffers a relationship among variables, parametric estimators are generally
preferable.
Semi-parametric and semi-nonparametric estimators bridge the gap between fully para-
metric estimators and nonparametric estimators. Their difference lies in “how paramet-
ric” the model and estimator are. Estimators which postulate parametric relationships be-
tween variables but estimate the underlying distribution of errors flexibly are semi-parametric.
Estimators which take a stand on the distribution of the errors but allow for flexible rela-
tionships between variables are semi-nonparametric. This chapter focuses exclusively on
parametric models and estimators. This choice is more reflective of common practice than
a critique of parametric and nonparametric methods.
The other important characterization of estimators is whether they are members of the
M-, L- or R-estimator classes.2 M-estimators (also known as extremum estimators) always
involve maximizing or minimizing some objective function. M-estimators are the common
in financial econometrics and include maximum likelihood, regression, classical minimum
distance and both the classical and the generalized method of moments. L-estimators, also
known as linear estimators, are a class where the estimator can be expressed as a linear
function of ordered data. Members of this family can always be written as
n
X
wi yi
i =1

for some set of weights {wi } where the data, yi , are ordered such that y j −1 ≤ y j for j =
2, 3, . . . , n . This class of estimators obviously includes the sample mean by setting wi = n1
for all i , and it also includes the median by setting wi = 0 for all i except w j = 1 where
j = (n + 1)/2 (n is odd) or w j = w j +1 = 1/2 where j = n /2 (n is even). R-estimators exploit

1
There is another important dimension in the categorization of estimators: Bayesian or frequentist.
Bayesian estimators make use of Bayes rule to perform inference about unknown quantities – parameters
– conditioning on the observed data. Frequentist estimators rely on randomness averaging out across obser-
vations. Frequentist methods are dominant in financial econometrics although the use of Bayesian methods
has been recently increasing.
2
Many estimators are members of more than one class. For example, the median is a member of all three.
2.1 Estimation 65

the rank of the data. Common examples of R-estimators include the minimum, maximum
and Spearman’s rank correlation, which is the usual correlation estimator on the ranks of
the data rather than on the data themselves. Rank statistics are often robust to outliers and
non-linearities.

2.1.1 M-Estimators

The use of M-estimators is pervasive in financial econometrics. Three common types of


M-estimators include the method of moments, both classical and generalized, maximum
likelihood and classical minimum distance.

2.1.2 Maximum Likelihood

Maximum likelihood uses the distribution of the data to estimate any unknown parame-
ters by finding the values which make the data as likely as possible to have been observed
– in other words, by maximizing the likelihood. Maximum likelihood estimation begins by
specifying the joint distribution, f (y; θ ), of the observable data, y = {y1 , y2 , . . . , yn }, as a
function of a k by 1 vector θ which contains all parameters. Note that this is the joint den-
sity, and so it includes both the information in the marginal distributions of yi and infor-
mation relating the marginals to one another.3 Maximum likelihood estimation “reverses”
the likelihood to express the probability of θ in terms of the observed y, L (θ ; y) = f (y; θ ).
The maximum likelihood estimator, θ̂ , is defined as the solution to

θ̂ = argmax L (θ ; y) (2.1)
θ

where argmax is used in place of max to indicate that the maximum may not be unique – it
could be set valued – and to indicate that the global maximum is required.4 Since L (θ ; y) is
strictly positive, the log of the likelihood can be used to estimate θ .5 The log-likelihood is
defined as l (θ ; y) = ln L (θ ; y). In most situations the maximum likelihood estimator (MLE)
can be found by solving the k by 1 score vector,
3
Formally the relationship between the marginal is known as the copula. Copulas and their use in financial
econometrics will be explored in the second term.
4
Many likelihoods have more than one maximum (i.e. local maxima). The maximum likelihood estimator
is always defined as the global maximum.
5
Note that the log transformation is strictly increasing and globally concave. If z ? is the maximum of g (z ),
and thus
∂ g (z )

=0
∂ z z =z ?
then z ? must also be the maximum of ln(g (z )) since

∂ ln(g (z )) g 0 (z )

0
= = =0
∂z
z =z ? g (z ) z =z ? g (z ? )

which follows since g (z ) > 0 for any value of z .


66 Estimation, Inference and Hypothesis Testing

∂ l (θ ; y)
=0
∂θ
although a score-based solution does not work when θ is constrained and θ̂ lies on the
boundary of the parameter space or when the permissible range of values for yi depends on
θ . The first problem is common enough that it is worth keeping in mind. It is particularly
common when working with variances which must be (weakly) positive by construction.
The second issue is fairly rare in financial econometrics.

2.1.2.1 Maximum Likelihood Estimation of a Poisson Model

Realizations from a Poisson process are non-negative and discrete. The Poisson is com-
mon in ultra high-frequency econometrics where the usual assumption that prices lie in
a continuous space is implausible. For example, trade prices of US equities evolve on a
i.i.d.
grid of prices typically separated by $0.01. Suppose yi ∼ Poisson(λ). The pdf of a single
observation is

exp(−λ)λ yi
f (yi ; λ) = (2.2)
yi !
and since the data are independent and identically distributed (i.i.d.), the joint likelihood
is simply the product of the n individual likelihoods,
n
Y exp(−λ)λ yi
f (y; λ) = L (λ; y) = .
yi !
i =1

The log-likelihood is
n
X
l (λ; y) = −λ + yi ln(λ) − ln(yi !) (2.3)
i =1

which can be further simplified to


n
X n
X
l (λ; y) = −nλ + ln(λ) yi − ln(y j !)
i =1 j =1

The first derivative is


n
∂ l (λ; y) X
= −n + λ−1 yi . (2.4)
∂λ
i =1

The MLE is found by setting the derivative to 0 and solving,


n
X
−n + λ̂ −1
yi = 0
i =1
2.1 Estimation 67

n
X
λ̂−1
yi = n
i =1
X n
yi = n λ̂
i =1
n
X
λ̂ = n −1 yi
i =1

Thus the maximum likelihood estimator in a Poisson is the sample mean.

2.1.2.2 Maximum Likelihood Estimation of a Normal (Gaussian) Model

Suppose yi is assumed to be i.i.d. normally distributed with mean µ and variance σ2 . The
pdf of a normal is
!
1 (yi − µ)2
f (yi ; θ ) = √ exp − . (2.5)
2πσ2 2σ2
0
where θ = µ σ2 . The joint likelihood is the product of the n individual likelihoods,


n
!
Y 1 (yi − µ)2
f (y; θ ) = L (θ ; y) = √ exp − .
2πσ2 2σ2
i =1

Taking logs,
n
X 1 1 (yi − µ)2
l (θ ; y) = − ln(2π) − ln(σ2 ) − (2.6)
2 2 2σ2
i =1

which can be simplified to


n
n n 1 X (yi − µ)2
l (θ ; y) = − ln(2π) − ln(σ ) −
2
.
2 2 2 σ2
i =1
0
Taking the derivative with respect to the parameters θ = µ, σ2 ,

n
∂ l (θ ; y) X (yi − µ)
= (2.7)
∂µ σ2
i =1
n
∂ l (θ ; y) n 1 X (yi − µ)2
= − + . (2.8)
∂ σ2 2σ2 2 σ4
i =1

Setting these equal to zero, the first condition can be directly solved by multiplying both
sides by σ̂2 , assumed positive, and the estimator for µ is the sample average.
68 Estimation, Inference and Hypothesis Testing

n
X (yi − µ̂)
=0
σ̂2
i =1
n
X (yi − µ̂)
σ̂ 2
= σ̂2 0
σ̂2
i =1
n
X
yi − n µ̂ = 0
i =1
n
X
n µ̂ = yi
i =1
n
X
µ̂ = n −1
yi
i =1

Plugging this value into the second score and setting equal to 0, the ML estimator of σ2 is
n
n 1 X (yi − µ̂)2
− 2+ =0
2σ̂ 2 σ̂4
i =1
n
!
2
n 1 X (yi − µ̂)
2σ̂4 − 2+ = 2σ̂4 0
2σ̂ 2 σ̂4
i =1
n
X 2
−n σ̂ +
2
(yi − µ̂) = 0
i =1
n
X 2
σ̂ = n
2 −1
(yi − µ̂)
i =1

2.1.3 Conditional Maximum Likelihood

Interest often lies in the distribution of a random variable conditional on one or more ob-
served values, where the distribution of the observed values is not of interest. When this
occurs, it is natural to use conditional maximum likelihood. Suppose interest lies in model-
ing a random variable Y conditional on one or more variablesX. The likelihood for a single

observation is fi yi |xi , and when Yi are conditionally i.i.d. , then
n
Y
L (θ ; y|X) =

f yi |xi ,
i =1

and the log-likelihood is


n
X
l θ ; y|X =
 
ln f yi |xi .
i =1
2.1 Estimation 69

The conditional likelihood is not usually sufficient to estimate parameters since the re-
lationship between Y and X has not bee specified. Conditional maximum likelihood spec-
ifies the model parameters conditionally on xi . For example, in an conditional normal,
y |xi ∼ N µi , σ2 where µi = g (β , xi ) is some function which links parameters and con-


ditioning variables. In many applications a linear relationship is assumed so that

yi = β 0 xi + εi
Xk
= βi xi , j + εi
j =1
= µ i + εi .

Other relationships are possible, including functions g β 0 xi which limits to range of β 0 xi




such as exp β 0 xi (positive numbers), the normal cdf (Φ β 0 x ) or the logistic function,
 

exp β 0 xi / 1 + exp β 0 xi (both limit the range to (0, 1)).


 

2.1.3.1 Example: Conditional Bernoulli

Suppose Yi and X i are Bernoulli random variables where the conditional distribution of Yi
given X i is
yi |xi ∼ Bernoulli (θ0 + θ1 xi )
so that the conditional probability of observing a success (yi = 1) is pi = θ0 + θ1 xi . The
conditional likelihood is
n
 Y y 1−y
L θ ; y|x = (θ0 + θ1 xi ) i (1 − (θ0 + θ1 xi )) i ,
i =1

the conditional log-likelihood is


n
 X
l θ ; y|x = yi ln (θ0 + θ1 xi ) + (1 − yi ) ln (1 − (θ0 + θ1 xi )) ,
i =1

and the maximum likelihood estimator can be found by differentiation.


 
∂ l θ̂ B ; y|x X n
yi 1 − yi
= − =0
∂ θ̂0 i =1 θ̂0 + θ̂1 x i 1 − θ̂0 − θ̂1 xi
n
∂ l θ ; y|x xi (1 − yi )

X xi yi
= − = 0.
∂ θ1 θ̂0 + θ̂1 xi 1 − θ̂0 − θ̂1 xi
i =1
70 Estimation, Inference and Hypothesis Testing

Using the fact that xi is also Bernoulli, the second score can be solved
 
n n
X  yi (1 − yi ) X nx y nx − nx y
0= xi  +  = −

i =1 θ̂0 + θ̂1 1 − θ̂0 − θ̂1 i =1 θ̂0 + θ̂1 1 − θ̂0 − θ̂1 xi
    
= n x y 1 − θ̂0 + θ̂1 − n x − n x y θ̂0 + θ̂1
     
= n x y − n x y θ̂0 + θ̂1 − n x θ̂0 + θ̂1 + n x y θ̂0 + θ̂1
nx y
θ̂0 + θ̂1 = ,
nx

Define n x = ni=1 xi , n y = ni=1 yi and n x y = ni=1 xi yi . The first score than also be rewrit-
P P P

ten as
n n
X yi 1 − yi X yi (1 − x I ) yi xi 1 − yi (1 − x I ) (1 − yi ) xi
0= − = + − −
i =1 θ̂0 + θ̂1 xi 1 − θ̂0 − θ̂1 xi i =1 θ̂0 θ̂0 + θ̂1 1 − θ̂0 1 − θ̂0 − θ̂1
n
( )
X yi (1 − x I ) 1 − yi (1 − x I ) xi yi xi (1 − yi )
= − + −
i =1 θ̂0 1 − θ̂0 θ̂0 + θ̂1 1 − θ̂0 − θ̂1
ny − nx y n − ny − nx + nx y
= − + {0}
θ̂0 1 − θ̂0
= n y − n x y − θ̂0 n y + θ̂0 n − θ̂0 n + θ̂0 n y + θ̂0 n x − θ̂0 n x y
ny − nx y
θ̂0 =
n − nx
n n −n
so that θ̂1 = nxxy − n−n
y xy
x
. The “0” in the previous derivation follows from noting that the
quantity in {} is equivalent to the first score and so is 0 at the MLE. If X i was not a Bernoulli
random variable, then it would not be possible to analytically solve this problem. In these
cases, numerical methods are needed.6

2.1.3.2 Example: Conditional Normal

Suppose µi = β xi where Yi given X i is conditionally normal. Assuming that Yi are condi-


tionally i.i.d. , the likelihood and log-likelihood are
n
!
2
Y 1 (yi − β xi )
L θ ; y|x =

√ exp −
2πσ 2 2σ2
i =1
n
!
X 1  (yi − β xi )2
l θ ; y|x = ln (2π) + ln σ +2

− .
2 σ2
i =1

6
When X i is not Bernoulli, it is also usually necessary to use a function to ensure pi , the conditional prob-
ability, is in [0, 1]. Tow common choices are the normal cdf and the logistic function.
2.1 Estimation 71

The scores of the likelihood are


 
∂ l θ ; y|x
n x y − β̂ x

X i i i
= =0
∂β σ̂2
i =1
 2
∂ l θ ; y|x
n y − β̂ x

1X 1 i i
= − − =0
∂ σ2 2 σ̂2 (σ̂2 )
2
i =1

After multiplying both sides the first score by σ̂2 , and both sides of the second score by
−2σ̂4 , solving the scores is straight forward, and so
Pn
xi yi
β̂ = Pi =1
n 2
j =1 x j
n
X 2
σ̂ 2
=n −1
(yi − β xi ) .
i =1

2.1.3.3 Example: Conditional Poisson

Suppose Yi is conditional on X 1 i.i.d. distributed Poisson(λi ) where λi = exp (θ xi ). The


likelihood and log-likelihood are
n y
Y exp(−λi )λi i
L θ ; y|x =

yi !
i =1
n
X
l θ ; y|x = exp (θ xi ) + yi (θ xi ) − ln (yi !) .


i =1

The score of the likelihood is


n
∂ l θ ; y|x
  
X
= −xi exp θ̂ xi + xi yi = 0.
∂θ
i =1

This score cannot be analytically solved and so a numerical optimizer must be used to dins
the solution. It is possible, however, to show the score has conditional expectation 0 since
E Yi |X i = λi .
 

"  # " n #
∂ l θ ; y|x X
E |X = E −xi exp (θ xi ) + xi yi |X
∂θ
i =1
n
X
= E −xi exp (θ xi ) |X + E xi yi |X
   

i =1
72 Estimation, Inference and Hypothesis Testing

n
X
= −xi λi + xi E yi |X
 

i =1
X n
= −xi λi + xi λi = 0.
i =1

2.1.4 Method of Moments


Method of moments, often referred to as the classical method of moments to differenti-
ate it from the generalized method of moments (GMM, chapter 6) uses the data to match
noncentral moments.

Definition 2.1 (Noncentral Moment). The rth noncentral moment is defined

µ0r ≡ E X r
 
(2.9)

for r = 1, 2, . . ..

Central moments are similarly defined, only centered around the mean.

Definition 2.2 (Central Moment). The rth central moment is defined


h  i
0 r
µ r ≡ E X − µ1 (2.10)

for r = 2, 3, . . . where the 1st central moment is defined to be equal to the 1st noncentral
moment.

Since E xir is not known any estimator based on it is infeasible. The obvious solution
 

is to use the sample analogue to estimate its value, and the feasible method of moments
estimator is

n
X
µ̂0r =n −1
xir , (2.11)
i =1

the sample average of the data raised to the rth power. While the classical method of mo-
ments was originally specified using noncentral moments, the central moments are usually
the quantities of interest. The central moments can be directly estimated,

n
X
µ̂r = n −1
(xi − µ̂1 )r , (2.12)
i =1

which is simple implement by first estimating the mean (µ̂1 ) and then estimating the re-
maining central moments. An alternative is to expand the noncentral moment in terms of
2.1 Estimation 73

central moments. For example, the second noncentral moment can be expanded in terms
of the first two central moments,

µ02 = µ2 + µ21
which is the usual identity that states that expectation of a random variable squared, E[xi2 ],
is equal to the variance, µ2 = σ2 , plus the mean squared, µ21 . Likewise, it is easy to show
that

µ03 = µ3 + 3µ2 µ1 + µ31


h i
directly by expanding E (X − µ1 ) and solving for µ03 . To understand that the method of
3

moments is in the class of M-estimators, note that the expression in eq. (2.12) is the first
order condition of a simple quadratic form,

n
!2 k n
!2
X X X
argmin n −1 x i − µ1 + n −1 (xi − µ) j − µ j , (2.13)
µ,µ2 ,...,µk
i =1 j =2 i =1

and since the number of unknown parameters is identical to the number of equations, the
solution is exact.7

2.1.4.1 Method of Moments Estimation of the Mean and Variance

The classical method of moments h estimatori for the mean and variance for a set of i.i.d. data
{yi }i =1 where E [Yi ] = µ and E (Yi − µ) = σ2 is given by estimating the first two noncen-
n 2

tral moments and then solving for σ2 .

n
X
µ̂ = n −1
yi
i =1
n
X
σ̂2 + µ̂2 = n −1 yi2
i =1

and thus the variance estimator is σ̂2 = n −1 ni=1 yi2 − µ̂2 . Following some algebra, it is
P

simple to show that the central moment estimator could be used equivalently, and so σ̂2 =
n −1 ni=1 (yi − µ̂)2 .
P

2.1.4.2 Method of Moments Estimation of the Range of a Uniform

Consider a set of realization of a random variable with a uniform density over [0, θ ], and
so yi ∼ U (0, θ ). The expectation of yi is E [Yi ] = θ /2, and so the method of moments
i.i.d.

7
Note that µ1 , the mean, is generally denoted with the subscript suppressed as µ.
74 Estimation, Inference and Hypothesis Testing

estimator for the upper bound is


n
X
θ̂ = 2n −1
yi .
i =1

2.1.5 Classical Minimum Distance


A third – and less frequently encountered – type of M-estimator is classical minimum dis-
tance (CMD) which is also known as minimum χ 2 in some circumstances. CMD differs
from MLE and the method of moments in that it is an estimator that operates using initial
parameter estimates produced by another estimator rather than on the data directly. CMD
is most common when a simple MLE or moment-based estimator is available that can es-
timate a model without some economically motivated constraints on the parameters. This
initial estimator, ψ̂ is then used to estimate the parameters of the model, θ , by minimizing
a quadratic function of the form
 0  
θ̂ = argmin ψ̂ − g (θ ) W ψ̂ − g (θ ) (2.14)
θ

where W is a positive definite weighting matrix. When W is chosen as the covariance of ψ̂,
the CMD estimator becomes the minimum-χ 2 estimator since outer products of standard-
ized normals are χ 2 random variables.

2.2 Convergence and Limits for Random Variables


Before turning to properties of estimators, it is useful to discuss some common measures
of convergence for sequences. Before turning to the alternative definitions which are ap-
propriate for random variables, recall the definition of a limit of a non-stochastic sequence.

Definition 2.3 (Limit). Let {xn } be a non-stochastic sequence. If there exists N such that
for ever n > N , |xn − x | < ε ∀ε > 0, when x is called the limit of xn . When this occurs,
xn → x or limn→∞ xn = x .

A limit is a point where a sequence will approach, and eventually, always remain near.
It isn’t necessary that the limit is ever attained, only that for any choice of ε > 0, xn will
eventually always be less than ε away from its limit.
Limits of random variables come is many forms. The first the type of convergence is
both the weakest and most abstract.

Definition 2.4 (Convergence in Distribution). Let {Yn } be a sequence of random variables


and let {Fn } be the associated sequence of cdfs. If there exists a cdf F where Fn (y) → F (y)
for all y where F is continuous, then F is the limiting cdf of {Yn }. Let Y be a random variable
d d
with cdf F , then Yn converges in distribution to Y ∼ F , Yn → Y ∼ F , or simply Yn → F .
2.2 Convergence and Limits for Random Variables 75

1
Normal CDF
F
0.9 4
F
5
0.8 F
10
F100
0.7

0.6

0.5

0.4

0.3

0.2

0.1

0
−2 −1.5 −1 −0.5 0 0.5 1 1.5 2

Figure 2.1: This figure shows a sequence of cdfs {Fi } that converge to the cdf of a standard
normal.

Convergence in distribution means that the limiting cdf of a sequence of random vari-
ables is the same as the convergent random variable. This is a very weak form of con-
vergence since all it requires is that the distributions are the same. For example, suppose
{X n } is an i.i.d. sequence of standard normal random variables, and Y is a standard nor-
d
mal random variable. X n trivially converges to distribution to Y (X n → Y ) even through
Y is completely independent of {X n } – the limiting cdf of X n is merely the same as the
cdf of Y . Despite the weakness of convergence in distribution, it is an essential notion of
convergence that is used to perform inference on estimated parameters.
Figure 2.1 shows an example of a sequence of random variables which converge in dis-
tribution. The sequence is
√ n1 ni=1 Yi − 1
P
Xn = n √
2
where Yi are i.i.d. χ12 random variables. This is a studentized average since the variance of
the average is 2/n and the mean is 1. By the time n = 100, F100 is nearly indistinguishable
from the standard normal cdf.
Convergence in distribution is preserved through functions.

d
Theorem 2.5 (Continuous Mapping Theorem). Let Xn → X and let the random variable
g (X) be defined by a function g (x) that is continuous everywhere except possibly on a set
76 Estimation, Inference and Hypothesis Testing

d
with zero probability. Then g (Xn ) → g (X).

The continuous mapping theorem is useful since it allow functions of sequences of


random variable to be studied. For example, in hypothesis testing it is common to use
quadratic forms of normals, and when appropriately standardized, quadratic forms of nor-
mally distributed random variables follow a χ 2 distribution.
The next form of convergence is stronger than convergence in distribution since the
limit is to a specific target, not just a cdf.

Definition 2.6 (Convergence in Probability). The sequence of random variables {Xn } con-
verges in probability to X if and only if

lim Pr |X i ,n − X i | < ε = 1 ∀ε > 0, ∀i .



n →∞

p
When this holds, Xn → X or equivalently plim Xn = X (or plim Xn − X = 0) where plim is
probability limit.

Note that X can be either a random variable or a constant (degenerate random variable).
For example, if X n = n −1 + Z where Z is a normally distributed random variable, then
p
X n → Z . Convergence in probability requires virtually all of the probability mass of Xn to
lie near X. This is a very weak form of convergence since it is possible that a small amount
of probability can be arbitrarily far away from X. Suppose a scalar random sequence {X n }
p
takes the value 0 with probability 1 − 1/n and n with probability 1/n . Then {X n } → 0
although E [X n ] = 1 for all n .
Convergence in probability, however, is strong enough that it is useful work studying
random variables and functions of random variables.
p
Theorem 2.7. Let Xn → X and let the random variable g (X) be defined by a function g (x )
p
that is continuous everywhere except possibly on a set with zero probability. Then g (Xn ) →
g (X) (or equivalently plim g (Xn ) = g (X)).
p
This theorem has some, simple useful forms. Suppose the k -dimensional vector Xn →
p
X, the conformable vector Yn → Y and C is a conformable constant matrix, then

• plim CXn = CX

• plim ki=1 X i ,n = ki=1 plim X i ,n – the plim of the sum is the sum of the plims
P P

• plim ki=1 X i ,n = ki=1 plim X i ,n – the plim of the product is the product of the plims
Q Q

• plim Yn Xn = XY
p
• When Yn is a square matrix and Y is nonsingular, then Y−1 n → Y
−1
– the inverse func-
tion is continuous and so plim of the inverse is the inverse of the plim
p
• When Yn is a square matrix and Y is nonsingular, then Y−1 −1
n Xn → Y X.
2.2 Convergence and Limits for Random Variables 77

These properties are very difference from the expectations operator. In particular, the plim
operator passes through functions which allows for broad application. For example,
 
1 1
E 6=
X E [X ]
p
whenever X is a non-degenerate random variable. However, if X n → X , then

1 1
plim =
Xn plimX n
1
= .
X
Alternative definitions of convergence strengthen convergence in probability. In partic-
ular, convergence in mean square requires that the expected squared deviation must be
zero. This requires that E [X n ] = X and V [X n ] = 0.

Definition 2.8 (Convergence in Mean Square). The sequence of random variables {Xn }
converges in mean square to X if and only if
h i
lim E (X i ,n − X i )2 = 0 ∀ε > 0, ∀i .
n →∞

m.s .
When this holds, Xn → X.

Mean square convergence is strong enough to ensure that, when the limit is random X
than E [Xn ] = E [X] and V [Xn ] = V [X] – these relationships do not necessarily hold when
p
only Xn → X.
m .s . p
Theorem 2.9 (Convergence in mean square implies consistency). If Xn → X then Xn → X.

This result follows directly from Chebyshev’s inequality. A final, and very strong, measure
of convergence for random variables is known as almost sure convergence.

Definition 2.10 (Almost sure convergence). The sequence of random variables {Xn } con-
verges almost surely to X if and only if

lim Pr (X i ,n − X i = 0) = 1, ∀i .
n →∞

a .s .
When this holds, Xn → X.

Almost sure convergence requires all probability to be on the limit point. This is a
stronger condition than either convergence in probability or convergence in mean square,
both of which allow for some probability to be (relatively) far from the limit point.
a .s . m.s . p
Theorem 2.11 (Almost sure convergence implications). If Xn → X then Xn → X and Xn →
X.
78 Estimation, Inference and Hypothesis Testing

Random variables which converge almost surely to a limit are asymptotically degener-
ate on that limit.
The Slutsky theorem combines variables which converge in distribution with variables
which converge in probability to show that the joint limit of functions behaves as expected.
d p
Theorem 2.12 (Slutsky Theorem). Let Xn → X and let Y → C, a constant, then for con-
formable X and C,
d
1. Xn + Yn → X + c
d
2. Yn Xn → CX
d
3. Y−1 −1
n Xn → C X as long as C is non-singular.

This theorem is at the core of hypothesis testing where estimated parameters are often
asymptotically normal and an estimated parameter covariance, which converges in prob-
ability to the true covariance, is used to studentize the parameters.

2.3 Properties of Estimators


The first step in assessing the performance of an economic model is the estimation of the
parameters. There are a number of desirable properties estimators may possess.

2.3.1 Bias and Consistency

A natural question to ask about an estimator is whether, on average, it will be equal to the
population value of the parameter estimated. Any discrepancy between the expected value
of an estimator and the population parameter is known as bias.

Definition 2.13 (Bias). The bias of an estimator, θ̂ , is defined


h i h i
B θ̂ = E θ̂ − θ 0 (2.15)

where θ 0 is used to denote the population (or “true”) value of the parameter.

When an estimator has a bias of 0 it is said to be unbiased. Unfortunately many esti-


mators are not unbiased. Consistency is a closely related concept that measures whether
a parameter will be far from the population value in large samples.

Definition 2.14 (Consistency). An estimator θ̂ n is said to be consistent if plimθ̂ n = θ 0 . The


explicit dependence
n o∞ of the estimator on the sample size is used to clarify that these form a
sequence, θ̂ n .
n =1
2.3 Properties of Estimators 79

Consistency requires an estimator to exhibit two features as the sample size becomes large.
First, any bias must be shrinking. Second, the distribution of θ̂ around θ 0 must be shrink-
ing in such a way that virtually all of the probability mass is arbitrarily close to θ 0 . Behind
consistency are a set of theorems known as laws of large numbers. Laws of large number
provide conditions where an average will converge to its expectation. The simplest is the
Kolmogorov Strong Law of Large numbers and is applicable to i.i.d. data.8

Theorem 2.15 (Kolmogorov Strong Law of Large Numbers). Let {yi } by a sequence of i.i.d. random
variables with µ ≡ E [yi ] and define ȳn = n −1 ni=1 yi . Then
P

a .s .
ȳn → µ (2.16)

if and only if E |yi | < ∞.


 

In the case of i.i.d. data the only requirement for consistence is that the expectation exists,
and so a law of large numbers will apply to an average of i.i.d. data whenever its expectation
exists. For example, Monte Carlo integration uses i.i.d. draws and so the Kolmogorov LLN
is sufficient to ensure that Monte Carlo integrals converge to their expected values.
h i  2 
The variance of an estimator is the same as any other variance, V θ̂ = E θ̂ − E[θ̂ ]
although it is worth noting that the variance is defined as the variation around its expec-
tation, E[θ̂ ], not the population value of the parameters, θ 0 . Mean square error measures
this alternative form of variation around the population value of the parameter.

  2.16 (Mean Square Error). The mean square error of an estimator θ̂ , denoted
Definition
MSE θ̂ , is defined
   2 
MSE θ̂ = E θ̂ − θ 0 . (2.17)
  h i2
It can be equivalently expressed as the bias squared plus the variance, MSE θ̂ = B θ̂ +
h i
V θ̂ .

m .s .
When the bias and variance of an estimator both converge to zero, then θ̂ n → θ 0 .

2.3.1.1 Bias and Consistency of the Method of Moment Estimators

The method of moments estimators of the mean and variance are defined as
8
A law of large numbers is strong if the convergence is almost sure. It is weak if convergence is in proba-
bility.
80 Estimation, Inference and Hypothesis Testing

n
X
µ̂ = n −1
yi
i =1
n
X 2
σ̂2 = n −1 (yi − µ̂) .
i =1

When the data are i.i.d. with finite mean µ and variance σ2 , the mean estimator is un-
biased while the variance is biased by an amount that becomes small as the sample size
increases. The mean is unbiased since

" n
#
X
E [µ̂] = E n −1 yi
i =1
n
X
=n −1
E [yi ]
i =1
X n
= n −1 µ
i =1

= n −1 n µ

The variance estimator is biased since

" n
#
X 2
E σ̂ = E n −1 (yi − µ̂)
 2

i =1
" n
!#
X
=E n −1
yi − n µ̂
2 2

i =1
n
!
X
= n −1 E yi − n E µ̂
 2  2

i =1
n !
σ2
X 
= n −1 µ2 + σ 2 − n µ2 +
n
i =1

= n −1 nµ2 + n σ2 − n µ2 − σ2


σ2
 
=n −1 2
nσ − n
n
n −1 2
= σ
n
2.3 Properties of Estimators 81

where the sample mean is equal to the population mean plus an error that is decreasing in
n,

n
!2
X
µ̂2 = µ + n −1 εi
i =1
n n
!2
X X
= µ2 + 2µn −1 εi + n −1 εi
i =1 i =1

and so its square has expectation

 !2 
n
X n
X
E µ̂ = E µ + 2µn −1
εi + n −1 εi
 2 2
 
i =1 i =1
" #  !2 
n
X n
X
= µ2 + 2µn −1 E εi + n −2 E  εi 
i =1 i =1

σ2
= µ2 + .
n

2.3.2 Asymptotic Normality


While unbiasedness and consistency are highly desirable properties of any estimator, alone
these do not provide a method to perform inference. The primary tool in econometrics for
inference is the central limit theorem (CLT). CLTs exist for a wide range of possible data
characteristics that include i.i.d., heterogeneous and dependent cases. The Lindberg-Lévy
CLT, which is applicable to i.i.d. data, is the simplest.

Theorem 2.17 (Lindberg-Lévy). Let {yi } be a sequence of i.i.d. random scalars with µ ≡
E [Yi ] and σ2 ≡ V [Yi ] < ∞. If σ2 > 0, then

ȳn − µ √ ȳn − µ d
= n → N (0, 1) (2.18)
σ̄n σ
q
σ2
Pn
where ȳn = n −1
i =1 yi and σ̄n = n
.

Lindberg-Lévy states that as long as i.i.d. data have 2 moments – a mean and variance –
the sample mean will be asymptotically normal. It can further be seen to show that other
moments of i.i.d. random variables, such as the variance, will be asymptotically normal
as long as two times the power of the moment exists. In other words, an estimator of the
rth moment will be asymptotically normal as long as the 2rth moment exists – at least in
i.i.d. data. Figure 2.2 contains density plots of the sample average of n independent χ12
82 Estimation, Inference and Hypothesis Testing

Consistency and Central Limits


Unscaled Estimator Distribution
3
N=5
2.5 N=10
N=50
2 N=100

1.5

0.5

0
0 0.5 1 1.5 2 2.5 3

n -scaled Estimator Distribution
0.5

0.4

0.3

0.2

0.1

0
−2.5 −2 −1.5 −1 −0.5 0 0.5 1 1.5 2 2.5 3

Figure 2.2: These two panels illustrate the difference between consistency and the cor-
rectly scaled estimators. The sample mean was computed 1,000 times using 5, 10, 50 and
100 i.i.d. χ 2 data points. The top panel contains a kernel density plot of the estimates of
the mean. The density when n = 100 is much tighter√than when pn = 5 or n = 10 since
the estimates are not scaled. The bottom panel plots n (µ̂ − 1)/ 2/N , the standardized
version for which a CLT applies. All scaled densities have similar dispersion although it is
clear that the asymptotic approximation of the CLT is not particularly accurate when n = 5
or n = 10 (due to the right skew in the χ12 data).

random variables for n = 5, 10, 50 and 100.9 The top panel contains the density of the un-
scaled estimates. The bottom panel contains the density plot of the correctly scaled terms,

n(µ̂ − 1)/ 2/n where µ̂ is the sample average. In the top panel the densities are collaps-
p

ing. This is evidence of consistency since the asymptotic distribution of µ̂ is collapsing on


1. The bottom panel demonstrates the operation of a CLT since the appropriately stan-
dardized means all have similar dispersion and are increasingly normal.
Central limit theorems exist for a wide variety of other data generating process includ-
ing processes which are independent but not identically distributed (i.n.i.d) or processes
9
The mean and variance of a χν2 are ν and 2ν, respectively.
2.3 Properties of Estimators 83

which are dependent, such as time-series data. As the data become more heterogeneous,
whether through dependence or by having different variance or distributions, more re-
strictions are needed on certain characteristics of the data to ensure that averages will be
asymptotically normal. The Lindberg-Feller CLT allows for heteroskedasticity (different
variances) and/or different marginal distributions.
Theorem 2.18 (Lindberg-Feller). Let {yi } be a sequence of independent random scalars with
µi ≡ E [yi ] and 0 < σi2 ≡ V [yi ] < ∞ where yi ∼ Fi , i = 1, 2, . . .. Then

√ ȳn − µ̄n d
n → N (0, 1) (2.19)
σ̄n

and
σi2
lim max n −1 =0 (2.20)
n→∞ 1≤i ≤n σ̄n2
if and only if, for every ε > 0,
n Z
X
lim σ̄n2 n −1 (z − µn )2 dFi (z ) = 0 (2.21)
n→∞ (z −µn )2 >εN σn2
i =1

Pn Pn
where µ̄ = n −1 i =1 µi and σ̄2 = n −1 i =1 σi2 .
The Lindberg-Feller CLT relaxes the requirement that the marginal distributions are
identical in the Lindberg-Lévy CLT at a the cost of a technical condition. The final con-
dition, known as a Lindberg condition, essentially requires that no random variable is so
heavy tailed that it dominates the others when averaged. In practice this can be a concern
when the variables have a wide range of variances (σi2 ). Many macroeconomic data series
exhibit a large decrease in the variance of their shocks after 1984, a phenomena is referred
to as the great moderation. The statistical consequence of this decrease is that averages
that use data both before and after 1984 not be well approximated by a CLT and caution
is warranted when using asymptotic approximations. This phenomena is also present in
equity returns where some periods – for example the technology “bubble” from 1997-2002
– have substantially higher volatility than periods before or after. These large persistent
changes in the characteristics of the data have negative consequences on the quality of
CLT approximations and large data samples are often needed.

2.3.2.1 What good is a CLT?

Central limit theorems are the basis of most inference in econometrics, although their for-
mal justification is only asymptotic and hence only guaranteed to be valid for an arbitrarily
large data set. Reconciling these two statements is an important step in the evolution of an
econometrician.
Central limit theorems should be seen as approximations, and as an approximation they
can be accurate or arbitrarily poor. For example, when a series of random variables are
84 Estimation, Inference and Hypothesis Testing

Central Limit Approximations


Accurate CLT Approximation

0.4 T (λ̂ − λ) density
CLT Density

0.3

0.2

0.1

0
−4 −3 −2 −1 0 1 2 3 4
Inaccurate CLT Approximation
0.8 √
T (ρ̂ − ρ) density
CLT Density
0.6

0.4

0.2

0
−4 −3 −2 −1 0 1 2 3 4

Figure 2.3: These two plots illustrate how a CLT can provide a good approximation, even
in small samples (top panel), or a bad approximation even for moderately large samples
(bottom panel). The top panel contains a kernel density plot of the standardized sample
mean of n = 10 Poisson random variables (λ = 5) over 10,000 Monte Carlo simulations.
Here the finite sample distribution and the asymptotic distribution overlay one another.
The bottom panel contains the conditional ML estimates of ρ from the AR(1) yi = ρ yi −1 +εi
where εi is i.i.d. standard normal using 100 data points and 10,000 replications. While ρ̂ is
asymptotically normal, the quality of the approximation when n = 100 is poor.

i.i.d. , thin-tailed and not skewed, the distribution of the sample mean computed using
as few as 10 observations may be very well approximated using a central limit theorem.
On the other hand, the approximation of a central limit theorem for the estimate of the
autoregressive parameter, ρ, in

yi = ρ yi −1 + εi (2.22)
may be poor even for hundreds of data points when ρ is close to one (but smaller). Figure
2.3 contains kernel density plots of the sample means computed from a set of 10 i.i.d. draws
from a Poisson distribution with λ = 5 in the top panel and the estimated autoregressive
2.3 Properties of Estimators 85

parameter from the autoregression in eq. (2.22) with ρ = .995 in the bottom. Each figure
also contains the pdf of an appropriately scaled normal. The CLT for the sample means of
the Poisson random variables is virtually indistinguishable from the actual distribution. On
the other hand, the CLT approximation for ρ̂ is very poor being based on 100 data points –
10× more than in the i.i.d. uniform example. The difference arises because the data in the
AR(1) example are not independent. With ρ = 0.995 data are highly dependent and more
data is required for averages to be well behaved so that the CLT approximation is accurate.
Unfortunately there are no hard and fast rules as to when a CLT will be a good approx-
imation. In general, the more dependent and the more heterogeneous a series, the worse
the approximation for a fixed number of observations. Simulations (Monte Carlo) are a
useful tool to investigate the validity of a CLT since they allow the finite sample distribu-
tion to be tabulated and compared to the asymptotic distribution.

2.3.3 Efficiency
A final concept, efficiency, is useful for ranking consistent asymptotically normal (CAN)
estimators that have the same rate of convergence.10

Definition 2.19 (Relative Efficiency). Let θ̂ n and θ̃ n be two n -consistent asymptotically

normal estimators for θ 0 . If the asymptotic variance of θ̂ n , written avar θ̂ n is less than
the asymptotic variance of θ̃ n , and so
   
avar θ̂ n < avar θ̃ n (2.23)

then θ̂ n is said to be relatively efficient to θ̃ n .11


 
Note that when θ is a vector, avar θ̂ n will be a covariance matrix. Inequality for ma-
trices A and B is interpreted to mean that if A < B then B − A is positive semi-definite, and
so all of the variances of the inefficient estimator must be (weakly) larger than those of the
efficient estimator.

Definition 2.20 (Asymptotically Efficient Estimator). Let θ̂ n and θ̃ n be two n -consistent
asymptotically normal estimators for θ 0 . If
   
avar θ̂ n < avar θ̃ n (2.24)
10
In any consistent estimator the asymptotic distribution of θ̂ − θ 0 is degenerate. In order to make infer-
ence, the difference between the estimate and the population parameters must 
be scaled by a function of the
√ √ 
number of data points. For most estimators this rate is n, and so n θ̂ − θ 0 will have an asymptotically
 
normal distribution. In the general case, the scaled difference can be written as n δ θ̂ − θ 0 where n δ is
known as the rate.
√  
11
The asymptotic variance of a n-consistent estimator, written avar θ̂ n is defined as
h√  i
limn→∞ V n θ̂ n − θ 0 .
86 Estimation, Inference and Hypothesis Testing

for any choice of θ̃ n then θ̂ n is said to be the efficient estimator of θ .

One of the important features of efficiency comparisons is that they are only meaningful

if both estimators are asymptotically normal, and hence consistent, at the same rate – n
in the usual case. It is trivial to produce an estimator that has a smaller variance but is
inconsistent. For example, if an estimator for a scalar unknown is θ̂ = 7 then it has no
variance: it will always be 7. However, unless θ0 = 7 it will also be biased. Mean square error
is a more appropriate method to compare estimators where one or more may be biased,
since it accounts for the total variation, not just the variance.12

2.4 Distribution Theory

Most distributional theory follows from a central limit theorem applied to the moment con-
ditions or to the score of the log-likelihood. While the moment condition or score are not
the object of interest – θ is – a simple expansion can be used to establish the asymptotic
distribution of the estimated parameters.

2.4.1 Method of Moments

Distribution theory for classical method of moments estimators is the most straightfor-
ward. Further, Maximum Likelihood can be considered a special case and so the method
of moments is a natural starting point.13 The method of moments estimator is defined as

n
X
µ̂ = n −1
xi
i =1
X n
2
µ̂2 = n −1 (xi − µ̂)
i =1
..
.
n
X k
µ̂k = n −1 (xi − µ̂)
i =1

√  
d
12
Some consistent asymptotically normal estimators have an asymptotic bias and so n θ̃ n − θ →
   0 
N (B, Σ). Asymptotic MSE defined as E n θ̂ n − θ 0 θ̂ n − θ 0 = BB0 + Σ provides a method to compare
estimators using their asymptotic properties.
13
While the class of method of moments estimators and maximum likelihood estimators contains a sub-
stantial overlap, there are method of moments estimators that cannot be replicated as a score condition of
any likelihood since the likelihood is required to integrate to 1.
2.4 Distribution Theory 87

To understand the distribution theory for the method of moments estimator, begin by re-
formulating the estimator as the solution of a set of k equations evaluated using the pop-
ulations values of µ, µ2 , . . .

n
X
n −1
xi − µ = 0
i =1
n
X
n −1 (xi − µ)2 − µ2 = 0
i =1
..
.
n
X
n −1
(xi − µ)k − µk = 0
i =1

Define g 1i = xi − µ and g j i = (xi − µ) j − µ j , j = 2, . . . , k , and the vector gi as


 
g 1i
 g 2i 
gi =  . (2.25)
 
..
 . 
gki
Using this definition, the method of moments estimator can be seen as the solution to
n
X
n −1
gi = 0.
i =1

Consistency of the method of moments estimator relies on a law of large numbers hold-
Pn Pn j
ing for n −1 i =1 (x i − µ) for j = 2, . . . , k . If x i is an i.i.d. sequence and as
−1
h i =1 xi and
i n p
long as E |xn − µ| j exists, then n −1 ni=1 (xn − µ) j → µ j .14 An alternative, and more re-
P
h i
strictive approach is to assume that E (xn − µ)2 j = µ2 j exists, and so
" n
#
X
E n −1 (xi − µ) j = µj (2.26)
i =1
" n
#  h
X i h i2 
j 2j j
V n −1
(xi − µ) =n −1
E (xi − µ) − E (xi − µ) (2.27)
i =1
 
= n −1 µ2 j − µ2j ,

Pn m .s .
and so n −1 i =1 (xi − µ) j → µ j which implies consistency.
Pn a .s .
Technically, n −1 i =1 (xi − µ) j → µ j by the Kolmogorov law of large numbers, but since a.s. conver-
14

gence implies convergence in probability, the original statement is also true.


88 Estimation, Inference and Hypothesis Testing

The asymptotic normality of parameters estimated using the method of moments fol-
lows from the asymptotic normality of

n
! n
√ X
− 12
X
n n −1
gi =n gi , (2.28)
i =1 i =1

an assumption. This requires the elements of gn to be sufficiently well behaved so that aver-
ages are asymptotically normally distribution. For example, when xi is i.i.d., the Lindberg-
Lévy CLT would require xi to have 2k moments when estimating k parameters. When esti-
mating the mean, 2 moments are required (i.e. the variance is finite). To estimate the mean
and the variance using i.i.d. data, 4 moments are required for the estimators to follow a CLT.
As long as the moment conditions are differentiable in the actual parameters of interest θ –
for example the mean and the variance – a mean value expansion can be used to establish
the asymptotic normality of these parameters.15

n n n
∂ gi (θ )
X X X  
n −1
gi (θ̂ ) = n −1
gi (θ 0 ) + n −1
θ̂ − θ 0 (2.30)
i =1 i =1 i =1
∂ θ 0 θ =θ̄
X n   
=n −1
gi (θ 0 ) + Gn θ̄ θ̂ − θ 0
i =1

Pn
where θ̄ is a vector that lies between θ̂ and θ 0 , element-by-element. Note that n −1 i =1 gi (θ̂ ) =
0 by construction and so

n
X   
n −1
gi (θ 0 ) + Gn θ̄ θ̂ − θ 0 = 0
i =1
   n
X
Gn θ̄ θ̂ − θ 0 = −n −1 gi (θ 0 )
i =1
   −1 n
X
θ̂ − θ 0 = −Gn θ̄ n −1 gi (θ 0 )
i =1

15
The mean value expansion is defined in the following theorem.

Theorem 2.21 (Mean Value Theorem). Let s : Rk → R be defined on a convex set Θ ⊂ Rk . Further, let s be
continuously differentiable on Θ with k by 1 gradient

∂ s (θ )
 
∇s θ̂ ≡ . (2.29)
∂ θ θ =θ̂

Then for any points θ and θ 0 there exists θ̄ lying on the segment between θ and θ 0 such that s (θ ) = s (θ 0 ) +
 0
∇s θ̄ (θ − θ 0 ).
2.4 Distribution Theory 89

n
√    −1 √ X
n θ̂ − θ 0 = −Gn θ̄ nn −1
gi (θ 0 )
i =1
√    −1 √
n θ̂ − θ 0 = −Gn θ̄ n gn (θ 0 )

where gn (θ 0 ) = n −1 ni=1 gi (θ 0 ) is the average of the moment conditions. Thus the nor-
P

malized difference between the estimated and the population values of the parameters,
√  √
   −1 
n θ̂ − θ 0 is equal to a scaled −Gn θ̄ n gn (θ 0 ) that has an

random variable
√ d
asymptotic normal distribution. By assumption n gn (θ 0 ) → N (0, Σ) and so
√  
d
 −1 
n θ̂ − θ 0 → N 0, G−1 Σ G0 (2.31)
 
where Gn θ̄ has been replaced with its limit as n → ∞, G.

∂ gn (θ )

G = plimn→∞ (2.32)
∂ θ 0 θ =θ 0
n
∂ gi (θ )
X
= plimn→∞ n −1
∂ θ0 i =1 θ =θ 0

p p
Since θ̂ is a consistent estimator, θ̂ → θ 0 and so θ̄ → θ 0 since it is between θ̂ and θ 0 . This
form of asymptotic covariance is known as a “sandwich” covariance estimator.

2.4.1.1 Inference on the Mean and Variance

To estimate the mean and variance by the method of moments, two moment conditions
are needed,

n
X
n −1
xi = µ̂
i =1
n
X
n −1
(xi − µ̂)2 = σ̂2
i =1

To derive the asymptotic distribution, begin by forming gi ,


" #
xi − µ
gi =
(xi − µ)2 − σ2

Note that gi is mean 0 and a function of a single xi so that gi is also i.i.d.. The covariance of
gi is given by
90 Estimation, Inference and Hypothesis Testing

"" # #
xi − µ h i
Σ = E gi g0i = E xi − µ (xi − µ)2 − σ2
 
(2.33)
(xi − µ)2 − σ2
   
(xi − µ) 2
(xi − µ) (xi − µ) − σ 2 2

= E
    2 
(xi − µ) (xi − µ)2 − σ2 (xi − µ)2 − σ2

" #
(xi − µ)2 (xi − µ)3 − σ2 (xi − µ)
=E
(xi − µ)3 − σ2 (xi − µ) (xi − µ)4 − 2σ2 (xi − µ)2 + σ4
" #
σ2 µ3
=
µ3 µ4 − σ 4

and the Jacobian is

n
∂ gi (θ )
X
G = plimn →∞ n−1

i =1
∂ θ 0 θ =θ 0
n
" #
X −1 0
= plimn →∞ n −1 .
−2 (xi − µ) −1
i =1

Pn
Since plimn →∞ n −1 i =1 (xi − µ) = plimn →∞ x̄n − µ = 0,
" #
−1 0
G= .
0 −1
0
Thus, the asymptotic distribution of the method of moments estimator of θ = µ, σ2 is
" # " #! " # " #!
√ µ̂ µ d 0 σ2 µ3
n − →N ,
σ̂2 σ2 0 µ3 µ4 − σ 4
0
since G = −I2 and so G−1 Σ G−1 = −I2 Σ (−I2 ) = Σ.

2.4.2 Maximum Likelihood

The steps to deriving the asymptotic distribution of ML estimators are similar to those for
method of moments estimators where the score of the likelihood takes the place of the
moment conditions. The maximum likelihood estimator is defined as the maximum of the
log-likelihood of the data with respect to the parameters,

θ̂ = argmax l (θ ; y). (2.34)


θ
2.4 Distribution Theory 91

When the data are i.i.d., the log-likelihood can be factored into n log-likelihoods, one for
each observation16 ,
n
X
l (θ ; y) = l i (θ ; yi ) . (2.35)
i =1

It is useful to work with the average log-likelihood directly, and so define


n
l¯n (θ ; y) = n −1
X
l i (θ ; yi ) . (2.36)
i =1

The intuition behind the asymptotic distribution follows from the use of the average. Un-
der some regularity conditions, l¯n (θ ; y) converges uniformly in θ to E l (θ ; yi ) . However,
 

since the average log-likelihood is becoming a good approximation for the expectation of
the log-likelihood, the value of θ that maximizes the log-likelihood of the data and its ex-
pectation will be very close for n sufficiently large. As a result,whenever the log-likelihood
is differentiable and the range of yi does not depend on any of the parameters in θ ,
 
∂ ¯n (θ ; yi )
l
E =0 (2.37)
∂θ


θ =θ 0

where θ 0 are the parameters of the data generating process. This follows since


∂ f (y;θ 0 )
∂ l¯n (θ 0 ; y) ∂θ
Z Z
θ =θ 0
f (y; θ 0 ) dy = f (y; θ 0 ) dy (2.38)
∂θ f (y; θ 0 )

Sy Sy
θ =θ 0
∂ f (y; θ 0 )
Z
= dy
Sy ∂θ
θ =θ 0


Z
= f (y; θ ) dy

∂ θ Sy
θ =θ 0

= 1
∂θ
=0

where Sy denotes the support of y. The scores of the average log-likelihood are

16
Even when the data are not i.i.d., the log-likelihood can be factored into n log-likelihoods using condi-
tional distributions for y2 , . . . , yi and the marginal distribution of y1 ,
N
X
l (θ ; y) = l i θ ; yi |yi −1 , . . . , y1 + l 1 (θ ; y1 ) .


n =2
92 Estimation, Inference and Hypothesis Testing

n
∂ l¯n (θ ; yi ) X ∂ l i (θ ; yi )
=n −1
(2.39)
∂θ ∂θ
i =1

and when yi is i.i.d. the scores will be i.i.d., and so the average scores will follow a law of
large numbers for θ close to θ 0 . Thus
n
∂ l i (θ ; yi ) ∂ l (θ ; Yi )
 
a .s .
X
−1
n →E (2.40)
∂θ ∂θ
i =1

As a result, the population value of θ , θ 0 , will also asymptotically solve the first order con-
dition. The average scores are also the basis of the asymptotic normality of maximum like-
lihood estimators. Under some further regularity conditions, the average scores will follow
a central limit theorem, and so

n
!
√ √ ∂ l (θ ; yi )
d
n ∇θ l¯ (θ 0 ) ≡ n
X
n −1 → N (0, J ) . (2.41)

∂θ

i =1

θ =θ 0

Taking a mean value expansion around θ 0 ,

√   √ √   
n ∇θ l¯ θ̂ = n∇θ l¯ (θ 0 ) + n ∇θ θ 0 l¯ θ̄ θ̂ − θ 0
√ √   
0 = n∇θ l¯ (θ 0 ) + n ∇θ θ 0 l¯ θ̄ θ̂ − θ 0
√    √
− n ∇θ θ 0 l¯ θ̄ θ̂ − θ 0 = n∇θ l¯ (θ 0 )
√   h  i−1 √
¯
n θ̂ − θ 0 = −∇θ θ 0 l θ̄ n∇θ l (θ 0 )

where

n
¯
  X ∂ 2
l (θ ; y i )
∇θ θ 0 l θ̄ ≡ n −1
(2.42)

0
∂ θ∂ θ i =1 θ =θ̄

and where θ̄ is a vector whose elements lie between θ̂ and θ 0 . Since θ̂ is a consistent
p
estimator of θ 0 , θ̄ → θ 0 and so functions of θ̄ will converge to their value at θ 0 , and the
asymptotic distribution of the maximum likelihood estimator is

√  
d
n θ̂ − θ 0 → N 0, I −1 J I −1

(2.43)

where
" #
∂ 2 l (θ ; yi )

I = −E (2.44)
∂ θ ∂ θ 0 θ =θ 0
2.4 Distribution Theory 93

and " #
∂ l (θ ; yi ) ∂ l (θ ; yi )

J =E (2.45)
∂θ ∂ θ 0 θ =θ 0

The asymptotic covariance matrix can be further simplified using the information matrix
p
equality which states that I − J → 0 and so
√  
d
n θ̂ − θ 0 → N 0, I −1

(2.46)

or equivalently
√  
d
n θ̂ − θ 0 → N 0, J −1 .

(2.47)

The information matrix equality follows from taking the derivative of the expected score,

∂ 2 l (θ 0 ; y) 1 ∂ 2 f (y; θ 0 ) 1 ∂ f (y; θ 0 ) ∂ f (y; θ 0 )


= − (2.48)
∂ θ∂ θ 0
f (y; θ ) ∂ θ ∂ θ 0
f (y; θ )2 ∂θ ∂ θ0
∂ 2 l (θ 0 ; y) ∂ l (θ 0 ; y) ∂ l (θ 0 ; y) 1 ∂ 2 f (y; θ 0 )
+ =
∂ θ∂ θ0 ∂θ ∂ θ0 f (y; θ ) ∂ θ ∂ θ 0

and so, when the model is correctly specified,

∂ 2 l (θ 0 ; y) ∂ l (θ 0 ; y) ∂ l (θ 0 ; y) 1 ∂ 2 f (y; θ 0 )
  Z
E + = 0 f (y; θ )d y
∂ θ∂ θ0 ∂θ ∂ θ0 Sy f (y; θ ) ∂ θ ∂ θ

∂ 2 f (y; θ 0 )
Z
= dy
Sy ∂ θ∂ θ0
∂2
Z
= f (y; θ 0 )d y
∂ θ ∂ θ 0 Sy
∂2
= 1
∂ θ∂ θ0
= 0.

and

∂ 2 l (θ 0 ; y) ∂ l (θ 0 ; y) ∂ l (θ 0 ; y)
   
E = −E .
∂ θ∂ θ0 ∂θ ∂ θ0
A related concept, and one which applies to ML estimators when the information matrix
equality holds – at least asymptotically – is the Cramér-Rao lower bound.

Theorem 2.22 (Cramér-Rao Inequality). Let f (y; θ ) be the joint density of y where θ is a k
dimensional parameter vector. Let θ̂ be a consistent estimator of θ with finite covariance.
94 Estimation, Inference and Hypothesis Testing

Under some regularity condition on f (·)


 
avar θ̂ ≥ I −1 (θ ) (2.49)

where " #
∂ 2 ln f (Yi ; θ )

I(θ ) = −E . (2.50)
∂ θ ∂ θ 0 θ =θ 0

The important implication of the Cramér-Rao theorem is that maximum likelihood estima-
tors, which are generally consistent, are asymptotically efficient.17 This guarantee makes a
strong case for using the maximum likelihood when available.

2.4.2.1 Inference in a Poisson MLE

Recall that the log-likelihood in a Poisson MLE is


n yi
X X
l (λ; y) = −nλ + ln(λ) yi − ln(i )
i =1 i =1

and that the first order condition is


n
∂ l (λ; y) X
= −n + λ−1 yi .
∂λ
i =1
Pn
The MLE was previously shown to be λ̂ = n −1
i =1 yi . To compute the variance, take the
expectation of the negative of the second derivative,

∂ 2 l (λ; yi )
= −λ−2 yi
∂λ 2

and so

∂ 2 l (λ; yi )
 
I = −E =
 −2 
−E −λ yi
∂ λ2
= λ−2 E [yi ]
 

= λ−2 λ
 

λ
 
=
λ2
= λ−1
17
The Cramér-Rao bound also applied in finite samples when θ̂ is unbiased. While most maximum likeli-
hood estimators are biased in finite samples, there are important cases where estimators are unbiased for any
sample size and so the Cramér-Rao theorem will apply in finite samples. Linear regression is an important
case where the Cramér-Rao theorem applies in finite samples (under some strong assumptions).
2.4 Distribution Theory 95

√  
d
and so n λ̂ − λ0 → N (0, λ) since I −1 = λ.
Alternatively the covariance of the scores could be used to compute the parameter co-
variance,
 
yi 2
J = V −1 +
λ
1
= V [yi ]
λ2
= λ−1 .

I = J and so the IME holds when the data are Poisson distributed. If the data were not
Poisson distributed, then it would not normally be the case that E [yi ] = V [yi ] = λ, and so
I and J would not (generally) be equal.

2.4.2.2 Inference in the Normal (Gaussian) MLE

Recall that the MLE estimators of the mean and variance are

n
X
µ̂ = n −1
yi
i =1
X n
2
σ̂2 = n −1 (yi − µ̂)
i =1

and that the log-likelihood is


n
n n 1 X (yi − µ)2
l (θ ; y) = − ln(2π) − ln(σ ) −
2
.
2 2 2 σ2
i =1
0
Taking the derivative with respect to the parameter vector, θ = µ, σ2 ,

n
∂ l (θ ; y) X (yi − µ)
=
∂µ σ2
i =1
n
∂ l (θ ; y) n 1 X (yi − µ)2
= − + .
∂ σ2 2σ2 2 σ4
i =1

The second derivatives are

n
∂ 2 l (θ ; y) X 1
=−
∂ µ∂ µ σ2
i =1
96 Estimation, Inference and Hypothesis Testing

n
∂ 2 l (θ ; y) X (yi − µ)
= −
∂ µ∂ σ2 σ4
i =1
n
∂ l (θ ; y)
2
n 2 X (yi − µ)2
= − .
∂ σ2 ∂ σ2 2σ4 2 σ6
i =1

The first does not depend on data and so no expectation is needed. The other two have
expectations,

∂ 2 l (θ ; yi ) (yi − µ)
   
E =E −
∂ µ∂ σ2 σ4
(E [yi ] − µ)
=−
σ4
µ−µ
=−
σ4
=0

and

" #
∂ 2 l (θ ; yi ) 2 (yi − µ)2
 
1
E =E −
∂ σ2 ∂ σ2 2σ4 2 σ6
h i
1 E (yi − µ)2

= −
2σ4 σ6
1 σ 2
= − 6
2σ 4 σ
1 1
= − 4
2σ 4 σ
1
=− 4

Putting these together, the expected Hessian can be formed,

 " #
∂ 2 l (θ ; yi ) 1

− σ2 0
E 0 =
∂ θ∂ θ 0 − 2σ1 4

and so the asymptotic covariance is

−1 " #−1


∂ 2 l (θ ; yi ) 1

0
I −1 = −E = σ2
∂ θ∂ θ0 0 1
2σ4
2.4 Distribution Theory 97

" #
σ2 0
=
0 2σ4

The asymptotic distribution is then


" # " #! " # " #!
√ µ̂ µ d 0 σ2 0
n − →N ,
σ̂ 2
σ 2
0 0 2σ4
Note that this is different from the asymptotic variance for the method of moments estima-
tor of the mean and the variance. This is because the data have been assumed to come from
a normal distribution and so the MLE is correctly specified. As a result µ3 = 0 (the normal
is symmetric) and the IME holds. In general the IME does not hold and so the asymptotic
covariance may take a different form which depends on the moments of the data as in eq.
(2.33).

2.4.3 Quasi Maximum Likelihood

While maximum likelihood is an appealing estimation approach, it has one important draw-
back: knowledge of f (y; θ ). In practice the density assumed in maximum likelihood esti-
mation, f (y; θ ), is misspecified for the actual density of y, g (y). This case has been widely
studied and estimators where the distribution is misspecified are known as quasi-maximum
likelihood (QML) estimators. Unfortunately QML estimators generally lose all of the fea-
tures that make maximum likelihood estimators so appealing: they are generally inconsis-
tent for the parameters of interest, the information matrix equality does not hold and they
do not achieve the Cramér-Rao lower bound.
First, consider the expected score from a QML estimator,

∂ l (θ 0 ; y) ∂ l (θ 0 ; y)
  Z
Eg = g (y) dy (2.51)
∂θ Sy ∂θ
∂ l (θ 0 ; y) f (y; θ 0 )
Z
= g (y) dy
Sy ∂θ f (y; θ 0 )
∂ l (θ 0 ; y) g (y)
Z
= f (y; θ 0 ) dy
Sy ∂θ f (y; θ 0 )
∂ l (θ 0 ; y)
Z
= h (y) f (y; θ 0 ) dy
Sy ∂θ

which shows that the QML estimator can be seen as a weighted average with respect to the
density assumed. However these weights depend on the data, and so it will no longer be the
case that the expectation of the score at θ 0 will necessarily be 0. Instead QML estimators
generally converge to another value of θ , θ ∗ , that depends on both f (·) and g (·) and is
known as the pseudo-true value of θ .
98 Estimation, Inference and Hypothesis Testing

The other important consideration when using QML to estimate parameters is that the
Information Matrix Equality (IME) no longer holds, and so “sandwich” covariance estima-
tors must be used and likelihood ratio statistics will not have standard χ 2 distributions.
An alternative interpretation of QML estimators is that of method of moments estimators
where the scores of l (θ ; y) are used to choose the moments. With this interpretation, the
distribution theory of the method of moments estimator will apply as long as the scores,
evaluated at the pseudo-true parameters, follow a CLT.

2.4.3.1 The Effect of the Data Distribution on Estimated Parameters

Figure 2.4 contains three distributions (left column) and the asymptotic covariance of the
mean and the variance estimators, illustrated through joint confidence ellipses contain-
ing 80, 95 and 99% probability the true value is within their bounds (right column).18 The
ellipses were all derived from the asymptotic covariance of µ̂ and σ̂2 where the data are
i.i.d. and distributed according to a mixture of normals distribution where
(
µ1 + σ1 z i with probability p
yi =
µ2 + σ2 z i with probability 1 − p

where z is a standard normal. A mixture of normals is constructed from mixing draws from
a finite set of normals with possibly different means and/or variances, and can take a wide
variety of shapes. All of the variables were constructed so that E [yi ] = 0 and V [yi ] = 1.
This requires

p µ1 + (1 − p )µ2 = 0
and

p µ21 + σ12 + (1 − p ) µ22 + σ22 = 1.


 

The values used to produce the figures are listed in table 2.1. The first set is simply a stan-
dard normal since p = 1. The second is known as a contaminated normal and is com-
posed of a frequently occurring (95% of the time) mean-zero normal with variance slightly
smaller than 1 (.8), contaminated by a rare but high variance (4.8) mean-zero normal. This
produces heavy tails but does not result in a skewed distribution. The final example uses
different means and variance to produce a right (positively) skewed distribution.
The confidence ellipses illustrated in figure 2.4 are all derived from estimators produced
assuming that the data are normal, but using the “sandwich” version of the covariance,
I −1 J I −1 . The top panel illustrates the correctly specified maximum likelihood estimator.
Here the confidence ellipse is symmetric about its center. This illustrates that the param-
18
The ellipses are centered at (0,0) since the population value of the parameters has been subtracted. Also
√ that even though the confidence ellipse for σ̂ extended into the negative space, these must be divided
2
note
by n and re-centered at the estimated value when used.
2.4 Distribution Theory 99

Data Generating Process and Asymptotic Covariance of Estimators


Standard Normal Standard Normal CI
4 99%
0.4 90%
80%
2
0.3

σ2
0
0.2

−2
0.1

−4
−4 −2 0 2 4 −3 −2 −1 0 1 2 3
µ
Contaminated Normal Contaminated Normal CI
6
0.4
4

0.3 2
σ2

0
0.2
−2
0.1 −4

−6
−5 0 5 −3 −2 −1 0 1 2 3
µ
Mixture of Normals Mixture of Normals CI
4
0.4

2
0.3
σ2

0
0.2
−2
0.1
−4
−4 −2 0 2 4 −3 −2 −1 0 1 2 3
µ

Figure 2.4: The six subplots illustrate how the data generating process, not the assumed
model, determine the asymptotic covariance of parameter estimates. In each panel the
data generating process was a mixture of normals, yi = µ1 + σ1 z i with probability p and
yi = µ2 + σ2 z i with probability 1 − p where the parameters were chosen so that E [yi ] = 0
and V [yi ] = 1. By varying p , µ1 , σ1 , µ2 and σ2 , a wide variety of distributions can be created
including standard normal (top panels), a heavy tailed distribution known as a contami-
nated normal (middle panels) and a skewed distribution (bottom panels).
100 Estimation, Inference and Hypothesis Testing

p µ1 σ12 µ2 σ22
Standard Normal 1 0 1 0 1
Contaminated Normal .95 0 .8 0 4.8
Right Skewed Mixture .05 2 .5 -.1 .8

Table 2.1: Parameter values used in the mixtures of normals illustrated in figure 2.4.

eters are uncorrelated – and hence independent, since they are asymptotically normal –
and that they have different variances. The middle panel has a similar shape but is elon-
gated on the variance axis (x). This illustrates that the asymptotic variance of σ̂2 is affected
by the heavy tails of the data (large 4th moment) of the contaminated normal. The final
confidence ellipse is rotated which reflects that the mean and variance estimators are no
longer asymptotically independent. These final two cases are examples of QML; the esti-
mator is derived assuming a normal distribution but the data are not. In these examples,
the estimators are still consistent but have different covariances.19

2.4.4 The Delta Method


Some theories make predictions about functions of parameters rather than on the param-
eters directly. One common example in finance is the Sharpe ratio, S , defined
 
E r − rf
S=q   (2.52)
V r − rf

where r is the return on a risky asset and r f is the risk-free rate – and so r − r f is the excess
return on the risky asset. While the quantities in both the numerator and the denominator
are standard statistics, the mean and the standard deviation, the ratio is not.
The delta method can be used to compute the covariance of functions of asymptotically
normal parameter estimates.
√ d
 
Definition 2.23 (Delta method). Let n (θ̂ −θ 0 ) → N 0, G−1 Σ (G0 )−1 where Σ is a positive
definite covariance matrix. Further, suppose that d(θ ) is a m by 1 continuously differen-
tiable vector function of θ from Rk → Rm . Then,
√ d
 h −1 i 
n (d(θ̂ ) − d(θ 0 )) → N 0, D(θ 0 ) G−1 Σ G0 D(θ 0 )0

where
∂ d (θ )

D (θ 0 ) = . (2.53)
∂ θ 0 θ =θ 0
19
While these examples are consistent, it is not generally the case that the parameters estimated using a
misspecified likelihood (QML) are consistent for the quantities of interest.
2.4 Distribution Theory 101

2.4.4.1 Variance of the Sharpe Ratio

The Sharpe ratio is estimated by “plugging in” the usual estimators of the mean and the
variance,

µ̂
Ŝ = √ .
σ̂2
In this case d (θ 0 ) is a scalar function of two parameters, and so

µ
d (θ 0 ) = √ 2
σ
and
 
1 −µ
D (θ 0 ) =
σ 2σ3
Recall that the asymptotic distribution of the estimated mean and variance is
" # " #! " # " #!
√ µ̂ µ d 0 σ2 µ3
n − →N , .
σ̂2 σ2 0 µ3 µ4 − σ 4

The asymptotic distribution of the Sharpe ratio can be constructed by combining the asymp-
0
totic distribution of θ̂ = µ̂, σ̂2 with the D (θ 0 ), and so
" # 0 !
√  σ2 µ3


d 1 −µ 1 −µ
n Ŝ − S → N 0,
σ 2σ3 µ3 µ4 − σ 4 σ 2σ3

which can be simplified to


!
√  
d µµ3 µ2 µ4 − σ4
n Ŝ − S → N 0, 1 − 4 + .
σ 4σ6

The asymptotic variance can be rearranged to provide some insight into the sources of
uncertainty,

√ 
 

d 1 2
n Ŝ − S → N 0, 1 − S × s k + S (κ − 1) ,
4
where s k is the skewness and κ is the kurtosis. This shows that the variance of the Sharpe
ratio will be higher when the data is negatively skewed or when the data has a large kurtosis
(heavy tails), both empirical regularities of asset pricing data. If asset returns were normally
distributed, and so s k = 0 and κ = 3, the expression of the asymptotic variance simplifies
to
h√  i S2
V n Ŝ − S = 1 + , (2.54)
2
102 Estimation, Inference and Hypothesis Testing

an expression commonly given for the variance of the Sharpe ratio. As this example illus-
trates, the expression in eq. (2.54) is only correct if the skewness is 0 and returns have a
kurtosis of 3 – something that would only be expected if returns are normal.

2.4.5 Estimating Covariances


The presentation of the asymptotic theory in this chapter does not provide a method to
implement hypothesis tests since all of the distributions depend on the covariance of the
scores and the expected second derivative or Jacobian in the method of moments. Feasi-
ble testing requires estimates of these. The usual method to estimate the covariance uses
“plug-in” estimators. Recall that in the notation of the method of moments,
n
!
1
X
Σ ≡ avar n − 2 gi (θ 0 ) (2.55)
i =1

or in the notation of maximum likelihood,


" #
∂ l (θ ; Yi ) ∂ l (θ ; Yi )

J ≡E . (2.56)
∂θ ∂ θ 0 θ =θ 0
When the data are i.i.d., the scores or moment conditions should be i.i.d., and so the
variance of the average is the average of the variance. The “plug-in” estimator for Σ uses
the moment conditions evaluated at θ̂ , and so the covariance estimator for method of mo-
ments applications with i.i.d. data is
n
X    0
Σ̂ = n −1
gi θ̂ gi θ̂ (2.57)
i =1

which is simply the average outer-product of the moment


  condition. The estimator of Σ in
the maximum likelihood is identical replacing gi θ̂ with ∂ l (θ ; yi ) /∂ θ evaluated at θ̂ ,

n
X ∂ l (θ ; yi ) ∂ l (θ ; y )
i

Jˆ = n −1 . (2.58)
∂θ ∂ θ0

i =1 θ =θ̂

The “plug-in” estimator for the second derivative of the log-likelihood or the Jacobian
of the moment conditions is similarly defined,
n

X ∂ g (θ )

Ĝ = n −1
(2.59)

0
∂θ i =1 θ =θ̂

or for maximum likelihood estimators



n
X ∂ l (θ ; yi )
2
Î = n −1 − . (2.60)
∂ θ∂ θ0

i =1 θ =θ̂
2.4 Distribution Theory 103

2.4.6 Estimating Covariances with Dependent Data


The estimators in eq. (2.57) and eq. (2.58) are only appropriate when the moment condi-
tions or scores are not correlated across i .20 If the moment conditions or scores are corre-
lated across observations the covariance estimator (but not the Jacobian estimator) must
be changed to account for the dependence. Since Σ is defined as the variance of a sum it is
necessary to account for both the sum of the variances plus all of the covariances.

n
!
− 21
X
Σ ≡ avar n gi (θ 0 ) (2.61)
i =1
 
n
X n−1 X
X n
= lim n −1  E gi (θ 0 ) gi (θ 0 )0 + E g j (θ 0 ) g j −i (θ 0 )0 + g j −i (θ 0 ) g j (θ 0 ) 
   
n →∞
i =1 i =1 j =i +1

This expression depends on both the usual covariance of the moment conditions and on
the covariance between the scores. When using i.i.d. data the second term vanishes since
the moment conditions must be uncorrelated and so cross-products must have expecta-
tion 0.
If the moment conditions are correlated across i then covariance estimator must be
adjusted to account for this. The obvious solution is estimate the expectations of the cross
terms in eq. (2.57) with their sample analogues, which would result in the covariance esti-
mator

 
Xn    0 X n−1 X
n     0    0 
Σ̂DEP = n −1  gi θ̂ gi θ̂ + g j θ̂ g j −i θ̂ + g j −i θ̂ g j θ̂ .
i =1 i =1 j =i +1
(2.62)
Pn  Pn 0 Pn
This estimator is always zero since Σ̂DEP = n i =1 gi
−1
i =1 gi and i =1 gi = 0, and so
Σ̂DEP cannot be used in practice. One solution is to truncate the maximum lag to be some-
21

thing less than n −1 (usually much less than n −1), although the truncated estimator is not
guaranteed to be positive definite. A better solution is to combine truncation with a weight-
ing function (known as a kernel) to construct an estimator which will consistently estimate
the covariance and is guaranteed to be positive definite. The most common covariance es-
timator of this type is the Newey & West (1987) covariance estimator. Covariance estimators
20
Since i.i.d. implies no correlation, the i.i.d. case is trivially covered.
21
The scalar version of Σ̂DEP may be easier to understand. If g i is a scalar, then
  
X n   n
X −1 Xn    
σ̂DEP
2
= n −1  g i2 θ̂ + 2  g j θ̂ g j −i θ̂  .
i =1 i =1 j =i +1

The first term is the usual variance estimator and the second term is the sum of the (n − 1) covariance esti-
mators. The more complicated expression in eq. (2.62) arises since order matters when multiplying vectors.
104 Estimation, Inference and Hypothesis Testing

for dependent data will be examined in more detail in the chapters on time-series data.

2.5 Hypothesis Testing


Econometrics models are estimated in order to test hypotheses, for example, whether a
financial theory is supported by data or to determine if a model with estimated parameters
can outperform a naïve forecast. Formal hypothesis testing begins by specifying the null
hypothesis.

Definition 2.24 (Null Hypothesis). The null hypothesis, denoted H0 , is a statement about
the population values of some parameters to be tested. The null hypothesis is also known
as the maintained hypothesis.

The null defines the condition on the population parameters that is to be tested. A null
can be either simple, for example H0 : µ = 0, or complex, which allows for testing of multi-
ple hypotheses. For example, it is common to test whether data exhibit any predictability
using a regression model

yi = θ1 + θ2 x2,i + θ3 x3,i + εi , (2.63)


and a composite null, H0 : θ2 = 0 ∩ θ3 = 0, often abbreviated H0 : θ2 = θ3 = 0.22
Null hypotheses cannot be accepted; the data can either lead to rejection of the null or
a failure to reject the null. Neither option is “accepting the null”. The inability to accept the
null arises since there are important cases where the data are not consistent with either the
null or its testing complement, the alternative hypothesis.

Definition 2.25 (Alternative Hypothesis). The alternative hypothesis, denoted H1 , is a com-


plementary hypothesis to the null and determines the range of values of the population
parameter that should lead to rejection of the null.

The alternative hypothesis specifies the population values of parameters for which the null
should be rejected. In most situations the alternative is the natural complement to the null
in the sense that the null and alternative are exclusive of each other but inclusive of the
range of the population parameter. For example, when testing whether a random variable
has mean 0, the null is H0 : µ = 0 and the usual alternative is H1 : µ 6= 0.
In certain circumstances, usually motivated by theoretical considerations, one-sided
alternatives are desirable. One-sided alternatives only reject for population parameter val-
ues on one side of zero and so test using one-sided alternatives may not reject even if both
the null and alternative are false. Noting that a risk premium must be positive (if it exists),
the null hypothesis of H0 : µ = 0 should be tested against the alternative H1 : µ > 0. This
alternative indicates the null should only be rejected if there is compelling evidence that
22
∩, the intersection operator, is used since the null requires both statements to be true.
2.5 Hypothesis Testing 105

the mean is positive. These hypotheses further specify that data consistent with large neg-
ative values of µ should not lead to rejection. Focusing the alternative often leads to an
increased probability to rejecting a false null. This occurs since the alternative is directed
(positive values for µ), and less evidence is required to be convinced that the null is not
valid.
Like null hypotheses, alternatives can be composite. The usual alternative to the null
H0 : θ2 = 0 ∩ θ3 = 0 is H1 : θ2 6= 0 ∪ θ3 6= 0 and so the null should be rejected when-
ever any of the statements in the null are false – in other words if either or both θ2 6= 0
or θ3 6= 0. Alternatives can also be formulated as lists of exclusive outcomes.23 When ex-
amining the relative precision of forecasting models, it is common to test the null that the
forecast performance is equal against a composite alternative that the forecasting perfor-
mance is superior for model A or that the forecasting performance is superior for model B .
If δ is defined as the average forecast performance difference, then the null is H0 : δ = 0
and the composite alternatives are H1A : δ > 0 and H1B : δ < 0, which indicate superior
performance of models A and B, respectively.
Once the null and the alternative have been formulated, a hypothesis test is used to
determine whether the data support the alternative.

Definition 2.26 (Hypothesis Test). A hypothesis test is a rule that specifies which values to
reject H0 in favor of H1 .

Hypothesis testing requires a test statistic, for example an appropriately standardized


mean, and a critical value. The null is rejected when the test statistic is larger than the
critical value.

Definition 2.27 (Critical Value). The critical value for an α-sized test, denoted Cα , is the
value where a test statistic, T , indicates rejection of the null hypothesis when the null is
true.

The region where the test statistic is outside of the critical value is known as the rejection
region.

Definition 2.28 (Rejection Region). The rejection region is the region where T > Cα .

An important event occurs when the null is correct but the hypothesis is rejected. This
is known as a Type I error.

Definition 2.29 (Type I Error). A Type I error is the event that the null is rejected when the
null is true.

A closely related concept is the size of the test. The size controls how often Type I errors
should occur.
23
The ∪ symbol indicates the union of the two alternatives.
106 Estimation, Inference and Hypothesis Testing

Decision
Do not reject H0 Reject H0

H0 Correct Type I Error

Truth
(Size)

H1 Type II Error Correct


(Power)

Table 2.2: Outcome matrix for a hypothesis test. The diagonal elements are both correct
decisions. The off diagonal elements represent Type I error, when the null is rejected but is
valid, and Type II error, when the null is not rejected and the alternative is true.

Definition 2.30 (Size). The size or level of a test, denoted α, is the probability of rejecting
the null when the null is true. The size is also the probability of a Type I error.
Typical sizes include 1%, 5% and 10%, although ideally the selected size should reflect the
decision makers preferences over incorrectly rejecting the null. When the opposite occurs,
the null is not rejected when the alternative is true, a Type II error is made.
Definition 2.31 (Type II Error). A Type II error is the event that the null is not rejected when
the alternative is true.
Type II errors are closely related to the power of a test.
Definition 2.32 (Power). The power of the test is the probability of rejecting the null when
the alternative is true. The power is equivalently defined as 1 minus the probability of a
Type II error.
The two error types, size and power are summarized in table 2.2.
A perfect test would have unit power against any alternative. In other words, whenever
the alternative is true it would reject immediately. Practically the power of a test is a func-
tion of both the sample size and the distance between the population value of a parameter
and its value under the null. A test is said to be consistent if the power of the test goes to
1 as n → ∞ whenever the population value is in the alternative. Consistency is an impor-
tant characteristic of a test, but it is usually considered more important to have correct size
rather than to have high power. Because power can always be increased by distorting the
size, and it is useful to consider a related measure known as the size-adjusted power. The
size-adjusted power examines the power of a test in excess of size. Since a test should reject
at size even when the null is true, it is useful to examine the percentage of times it will reject
in excess of the percentage it should reject.
One useful tool for presenting results of test statistics is the p-value, or simply the p-val.
Definition 2.33 (P-value). The p-value is largest size (α) where the null hypothesis cannot
be rejected. The p-value can be equivalently defined as the smallest size where the null
hypothesis can be rejected.
2.5 Hypothesis Testing 107

The primary advantage of a p-value is that it immediately demonstrates which test sizes
would lead to rejection: anything above the p-value. It also improves on the common prac-
tice of reporting the test statistic alone since p-values can be interpreted without knowl-
edge of the distribution of the test statistic. A related representation is the confidence in-
terval for a parameter.

Definition 2.34 (Confidence Interval). A confidence interval for a scalar parameter is the
range of values, θ0 ∈ (C α , C α ) where the null H0 : θ = θ0 cannot be rejected for a size of α.

The formal definition of a confidence interval is not usually sufficient to uniquely identify
√ d
the confidence interval. Suppose that a n (θ̂ − θ0 ) → N (0, σ2 ). The common 95% confi-
dence interval is (θ̂ − 1.96σ2 , θ̂ + 1.96σ2 ). This set is known as the symmetric confidence
interval and is formally defined as points (C α , C α ) where Pr (θ0 ) ∈ (C α , C α ) = 1 − α and
C α − θ = θ − C α ) . An alternative, but still valid, confidence interval can be defined as
(−∞, θ̂ + 1.645σ2 ). This would also contain the true value with probability 95%. In gen-
eral, symmetric confidence intervals should be used, especially for asymptotically normal
parameter estimates. In rare cases where symmetric confidence intervals are not appro-
priate, other options for defining a confidence interval include shortest interval, so that the
confidence interval is defined as values (C α , C α ) where Pr (θ0 ) ∈ (C α , C α ) = 1 − α subject to
C α − C α chosen to be as small as possible, or symmetric in probability, so that the confi-
dence interval satisfies Pr (θ0 ) ∈ (C α , θ̂ ) = Pr (θ0 ) ∈ (θ̂ , C α ) = 1/2 − α/2. When constructing
confidence internals for parameters that are asymptotically normal, these three definitions
coincide.

2.5.0.1 Size and Power of a Test of the Mean with Normal Data

Suppose n i.i.d. normal random variables have unknown mean µ but known variance σ2
and so the sample mean, ȳ = n −1 ni=1 yi , is then distributed N (µ, σ2 /N ). When testing a
P

null that H0 : µ = µ0 against an alternative H1 : µ 6= µ0 , the size of the test is the probability
that the null is rejected when it is true. Since the distribution
 under the null is N (µ
 0 , σ /N )
2

and the size can be set to α by selecting points where Pr µ̂ ∈ C α , C α |µ = µ0 = 1 − α.


Since the distribution is normal, one natural choice is to select the points symmetrically
so that C α = µ0 + √σN Φ−1 α/2 and C α = µ0 + √σN Φ−1 1 − α/2 where Φ (·) is the cdf of a
 

standard normal.
The power of the test is defined as the probability the null is rejected when the alter-
native is true. This probability will depend on the population mean, µ1 , the sample size,
the test size and mean specified by the null hypothesis. When testing using an α-sized test,
rejection will occur when µ̂ < µ0 + √σN Φ−1 α/2 or µ̂ > µ0 + √σN Φ−1 1 − α/2 . Since under
 

the alternative µ̂ is N µ1 , σ2 , these probabilities will be




µ0 + √σN Φ−1 α/2 − µ1


 ! !
C α − µ1
Φ σ =Φ σ
√ √
N N
108 Estimation, Inference and Hypothesis Testing

and
µ0 + √σ Φ−1 1 − α/2 − µ1
 ! !
N C α − µ1
1−Φ =1−Φ .
√σ √σ
N N

The total probability that the null is rejected is known as the power function,
! !
C α − µ1 C α − µ1
Power (µ0 , µ1 , σ, α, N ) = Φ +1−Φ .
√σ √σ
N N

A graphical illustration of the power is presented in figure 2.5. The null hypothesis is
H0 : µ = 0 and the alternative distribution was drawn at µ1 = .25. The variance σ2 = 1,
n = 5 and the size was set to 5%. The highlighted regions indicate the power: the area
under the alternative distribution, and hence the probability, which is outside of the critical
values. The bottom panel illustrates the power curve for the same parameters allowing n
to range from 5 to 1,000. When n is small, the power is low even for alternatives far from
the null. As n grows the power increases and when n = 1, 000, the power of the test is close
to unity for alternatives greater than 0.1.

2.5.1 Statistical and Economic Significance

While testing can reject hypotheses and provide meaningful p-values, statistical signifi-
cance is different from economic significance. Economic significance requires a more de-
tailed look at the data than a simple hypothesis test. Establishing the statistical significance
of a parameter is the first, and easy, step. The more difficult step is to determine whether
the effect is economically important. Consider a simple regression model

yi = θ1 + θ2 x2,i + θ3 x3,i + εi (2.64)

and suppose that the estimates of both θ2 and θ3 are statistically different from zero. This
can happen for a variety of reasons, including having an economically small impact accom-
panied with a very large sample. To assess the relative contributions other statistics such
as the percentage of the variation that can be explained by either variable alone and/or the
range and variability of the x s.
The other important aspect of economic significance is that rejection of a hypothesis,
while formally as a “yes” or “no” question, should be treated in a more continuous manner.
The p-value is a useful tool in this regard that can provide a deeper insight into the strength
of the rejection. A p-val of .00001 is not the same as a p-value of .09999 even though a 10%
test would reject for either.

2.5.2 Specifying Hypotheses

Formalized in terms of θ , a null hypothesis is


2.5 Hypothesis Testing 109

Power
Rejection Region and Power
Power
Null Distribution
0.8 Alt. Distribution
Crit. Val

0.6

0.4

0.2

0
−1.5 −1 −0.5 0 0.5 1 1.5
Power Curve
1

0.8

0.6

0.4
N=5
0.2 N=10
N=100
N=1000
0
0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2
µ1

Figure 2.5: The top panel illustrates the power. The distribution of the mean under the
null and alternative hypotheses were derived under that assumption that the data are
i.i.d. normal with means µ0 = 0 and µ1 = .25, variance σ2 = 1, n = 5 and α = .05. The
bottom panel illustrates the power function, in terms of the alternative mean, for the same
parameters when n = 5, 10, 100 and 1,000.

H0 : R(θ ) = 0 (2.65)

where R(·) is a function from Rk to Rm , m ≤ k , where m represents the number of hy-


potheses in a composite null. While this specification of hypotheses is very flexible, testing
non-linear hypotheses raises some subtle but important technicalities and further discus-
sion will be reserved for later. Initially, a subset of all hypotheses, those in the linear equality
restriction (LER) class, which can be specified as

H0 : Rθ − r = 0 (2.66)

will be examined where R is a m by k matrix and r is a m by 1 vector. All hypotheses in the


LER class can be written as weighted sums of model parameters,
110 Estimation, Inference and Hypothesis Testing

R11 θ1 + R12 θ2 . . . + R1k θk = r1


 
 R21 θ1 + R22 θ2 . . . + R2k θk = r2 
(2.67)
 
 .. 
 . 
Rm 1 θ1 + Rm 2 θ2 . . . + Rm k θk = ri .
Each linear hypothesis is represented as a row in the above set of equations. Linear equality
constraints can be used to test parameter restrictions on θ = (θ1 , θ2 , θ3 , θ4 )0 such as

θ1 = 0 (2.68)
3θ2 + θ3 = 1
4
X
θj = 0
j =1

θ1 = θ2 = θ3 = 0.

For example, the hypotheses in eq. (2.68) can be described in terms of R and r as

H0 R r
h i
θ1 = 0 1 0 0 0 0

h i
3θ2 + θ3 = 1 0 3 1 0 1

Pk h i
j =1 θ j = 0 1 1 1 1 0

 
1 0 0 0 h i0
θ1 = θ2 = θ3 = 0  0 1 0 0  0 0 0
 
0 0 1 0

When using linear equality constraints, alternatives are generally formulated as H1 :


Rθ − r 6= 0. Once both the null the alternative hypotheses have been postulated, it is
necessary to determine whether the data are consistent with the null hypothesis using one
of the many tests.

2.5.3 The Classical Tests


Three classes of statistics will be described to test hypotheses: Wald, Lagrange Multiplier
and Likelihood Ratio. Wald tests are perhaps the most intuitive: they directly test whether
Rθ̂ − r, the value under the null, is close to zero by exploiting the asymptotic normality
2.5 Hypothesis Testing 111

of the estimated parameters. Lagrange Multiplier tests incorporate the constraint into the
estimation problem using a Lagrangian. If the constraint has a small effect on value of
objective function, the Lagrange multipliers, often described as the shadow price of the
constraint in economic applications, should be close to zero. The magnitude of the scores
form the basis of the LM test statistic. Finally, likelihood ratios test whether the data are
less likely under the null than they are under the alternative. If these restrictions are not
statistically meaningful, this ratio should be close to one since the difference in the log-
likelihoods should be small.

2.5.4 Wald Tests

Wald test statistics are possibly the most natural method to test a hypothesis, and are often
the simplest to compute since only the unrestricted model must be estimated. Wald tests
directly exploit the asymptotic normality of the estimated parameters to form test statistics
with asymptotic χm 2
distributions. Recall that a χν2 random variable is defined to be the sum
of ν independent standard normals squared, νi =1 z i2 where z i ∼ N (0, 1). Also recall that if
P i.i.d.

z is a m dimension normal vector with mean µ and covariance Σ,

z ∼ N (µ, Σ) (2.69)
then the standardized version of z can be constructed as

1
Σ− 2 (z − µ) ∼ N (0, I). (2.70)
− 12 PM
Defining w = Σ (z − µ) ∼ N (0, I), it is easy to see that w0 w = m =1 w m ∼ χm . In
2 2

the usual case, the method of moments estimator, which nests ML and QML estimators as
special cases, is asymptotically normal
√  
d
 
−1 0
n θ̂ − θ 0 → N 0, G Σ G
−1
. (2.71)

If null hypothesis, H0 : Rθ = r is true, it follows directly that


√  
d
 0 
n Rθ̂ − r → N 0, RG−1 Σ G−1 R0 . (2.72)

This allows a test statistic to be formed


 0   −1  
−1 0 0
W = n Rθ̂ − r RG Σ G
−1
R Rθ̂ − r (2.73)

which is the sum of the squares of m random variables, each asymptotically uncorrelated
standard normal and so W is asymptotically χm2
distributed. A hypothesis test with size α
can be conducted by comparing W against Cα = F −1 (1 − α) where F (·) is the cdf of a χm2
.
If W ≥ Cα then the null is rejected.
There is one problem with the definition of W in eq. (2.73): it is infeasible since it
112 Estimation, Inference and Hypothesis Testing

depends on G and Σ which are unknown. The usual practice is to replace the unknown
elements of the covariance matrix with consistent estimates to compute a feasible Wald
statistic,
 0   0 −1  
W = n Rθ̂ − r RĜ Σ̂ Ĝ
−1 −1
R0
Rθ̂ − r . (2.74)

which has the same asymptotic distribution as the infeasible Wald.

2.5.4.1 t -tests

A t -test is a special case of a Wald and is applicable to tests involving a single hypothesis.
Suppose the null is

H0 : Rθ − r = 0
where R is 1 by k , and so
√  
d 0
n Rθ̂ − r → N (0, RG−1 Σ G−1 R0 ).

The studentized version can be formed by subtracting the mean and dividing by the stan-
dard deviation,
√  
n Rθ̂ − r d
t =p 0 0
→ N (0, 1). (2.75)
RG Σ (G ) R
−1 −1

and the test statistic can be compared to the critical values from a standard normal to con-
duct a hypothesis test. t -tests have an important advantage over the broader class of Wald
tests – they can be used to test one-sided null hypotheses. A one-sided hypothesis takes
the form H0 : Rθ ≥ r or H0 : Rθ ≤ r which are contrasted with one-sided alternatives of
H1 : Rθ < r or H1 : Rθ > r , respectively. When using a one-sided test, rejection occurs
when R − r is statistically different from zero and when Rθ < r or Rθ > r as specified by
the alternative.
t -tests are also used in commonly encountered test statistic, the t -stat, a test of the null
that a parameter is 0 against an alternative that it is not. The t -stat is popular because most
models are written in such a way that if a parameter θ = 0 then it will have no impact.

Definition 2.35 (t -stat). The t -stat of a parameter θ j is the t -test value of the null H0 : θ j =
0 against a two-sided alternative H1 : θ j 6= 0.

θ̂ j
t -stat ≡ (2.76)
σθ̂
where
2.5 Hypothesis Testing 113

s
e j G−1 Σ (G−1 )0 e0j
σθ̂ = (2.77)
n
and where e j is a vector of 0s with 1 in the jth position.
Note that the t -stat is identical to the expression in eq. (2.75) when R = e j and r = 0.
R = e j corresponds to a hypothesis test involving only element j of θ and r = 0 indicates
that the null is θ j = 0.
A closely related measure is the standard error of a parameter. Standard errors are es-
sentially standard deviations – square-roots of variance – except that the expression “stan-
dard error” is applied when describing the estimation error of a parameter while “standard
deviation” is used when describing the variation in the data or population.

Definition 2.36 (Standard Error). The standard error of a parameter θ is the square root of
the parameter’s variance,   q
s.e. θ̂ = σθ̂2 (2.78)

where 0
e j G−1 Σ G−1 e0j
σθ̂2 = (2.79)
n
and where e j is a vector of 0s with 1 in the jth position.

2.5.5 Likelihood Ratio Tests


Likelihood ratio tests examine how “likely” the data are under the null and the alternative.
If the hypothesis is valid then the data should be (approximately) equally likely under each.
The LR test statistic is defined as
    
L R = −2 l θ̃ ; y − l θ̂ ; y (2.80)

where θ̃ is defined

θ̃ = argmax l (θ ; y) (2.81)
θ

subject to Rθ − r = 0

and θ̂ is the unconstrained estimator,

θ̂ = argmax l (θ ; y). (2.82)


θ

d
Under the null H0 : Rθ − r = 0, the L R → χm 2
. The intuition behind the asymptotic
distribution of the LR can be seen in a second order Taylor expansion around parameters
114 Estimation, Inference and Hypothesis Testing

estimated under the null, θ̃ .

 0 ∂ l (y; θ̂ ) 1 √  0 1 ∂ 2 l (y; θ̂ ) √  
l (y; θ̃ ) = l (y; θ̂ ) + θ̃ − θ̂ + n θ̃ − θ̂ n θ̃ − θ̂ + R 3 (2.83)
∂θ 2 n ∂ θ∂ θ0

where R 3 is a remainder term that is vanishing as n → ∞. Since θ̂ is an unconstrained


estimator of θ 0 ,

∂ l (y; θ̂ )
=0
∂θ
and

!
  √  0 1 ∂ l (y; θ̂ )
2 √  
−2 l (y; θ̃ ) − l (y; θ̂ ) ≈ n θ̃ − θ̂ − n θ̃ − θ̂ (2.84)
n ∂ θ∂ θ0

Under some mild regularity conditions, when the MLE is correctly specified

1 ∂ 2 l (y; θ̂ ) p ∂ l (y; θ 0 )
 2 
− → −E = I,
n ∂ θ∂ θ0 ∂ θ∂ θ0
and
√  
d
n θ̃ − θ̂ → N (0, I −1 ).

Thus,

√  0 1 ∂ 2 l (y; θ̂ ) √  
d
n θ̃ − θ̂ 0 n θ̂ − θ̂ → χm
2
(2.85)
n ∂ θ∂ θ
 
d
and so 2 l (y; θ̂ ) − l (y; θ̂ ) → χm 2
. The only difficultly remaining is that the distribution of
this quadratic form is a χm 2
an not a χk2 since k is the dimension of the parameter vector.
While formally establishing this is tedious, the intuition follows from the number of restric-
tions. If θ̃ were unrestricted then it must be the case that θ̃ = θ̂ since θ̂ is defined as the
unrestricted estimators. Applying a single restriction leave k − 1 free parameters in θ̃ and
thus it should be close to θ̂ except for this one restriction.
When models are correctly specified LR tests are very powerful against point alterna-
tives (e.g. H0 : θ = θ 0 against H1 : θ = θ 1 ). Another important advantage of the LR is
that the covariance of the parameters does not need to be estimated. In many problems
accurate parameter covariances may be difficult to estimate, and imprecise covariance es-
timators have negative consequence for test statistics, such as size distortions where a 5%
test will reject substantially more than 5% of the time when the null is true.
It is also important to note that the likelihood ratio does not have an asymptotic χm 2
2.5 Hypothesis Testing 115

when the assumed likelihood f (y; θ ) is misspecified. When this occurs the information
matrix equality fails to hold and the asymptotic distribution of the LR is known as a mixture
of χ 2 distribution. In practice, the assumed error distribution is often misspecified and so
it is important that the distributional assumptions used to estimate θ are verified prior to
using likelihood ratio tests.
Likelihood ratio tests are not available for method of moments estimators since no dis-
tribution function is assumed.24

2.5.6 Lagrange Multiplier, Score and Rao Tests


Lagrange Multiplier (LM), Score and Rao test are all the same statistic. While Lagrange
Multiplier test may be the most appropriate description, describing the tests as score tests
illustrates the simplicity of the test’s construction. Score tests exploit the first order condi-
tion to test whether a null hypothesis is compatible with the data. Using the unconstrained
estimator of θ , θ̂ , the scores must be zero,

∂ l (θ ; y)

= 0. (2.86)
∂ θ θ =θ̂
The score test examines whether the scores are “close” to zero – in a statistically mean-
ingful way – when evaluated using the parameters estimated subject to the null restriction,
θ̃ . Define
  ∂ l (θ ; y )
i i
si θ̃ = (2.87)
∂θ
θ =θ̃

as the ith score, evaluated at the restricted estimator. If the null hypothesis is true, then
24
It is possible to construct a likelihood ratio-type statistic for method of moments estimators. Define
n
X
gn (θ ) = n −1 gi (θ )
i =1

to be the average moment conditions evaluated at a parameter θ . The likelihood ratio-type statistic for
method of moments estimators is defined as
  −1     −1  
L M = ng0n θ̃ Σ̂ gn θ̃ − ng0n θ̂ Σ̂ gn θ̂
  −1  
= ng0n θ̃ Σ̂ gn θ̃
 
where the simplification is possible since gn θ̂ = 0 and where

n
X    0
Σ̂ = n −1 gi θ̂ gi θ̂
i =1

is the sample covariance of the moment conditions evaluated at the unrestricted parameter estimates. This
test statistic only differs from the LM test statistic in eq. (2.90) via the choice of the covariance estimator, and
it should be similar in performance to the adjusted LM test statistic in eq. (2.92).
116 Estimation, Inference and Hypothesis Testing

n
!
√ X  
d
n n −1 si θ̃ → N (0, Σ) . (2.88)
i =1

This forms the basis of the score test, which is computed as


 0  
L M = n s̄ θ̃ Σ−1 s̄ θ̃ (2.89)
  Pn  
where s̄ θ̃ = n −1
i =1 si θ̃ . While this version is not feasible since it depends on Σ, the
standard practice is to replace Σ with a consistent estimator and to compute the feasible
score test,
 0 −1  
L M = n s̄ θ̃ Σ̂ s̄ θ̃ (2.90)

where the estimator of Σ depends on the assumptions made about the scores. In the case
where the scores are i.i.d. (usually because the data are i.i.d.),
n
X    0
Σ̂ = n −1
si θ̃ si θ̃ (2.91)
i =1
h  i
is a consistent estimator since E si θ̃ = 0 if the null is true. In practice a more pow-
erful version of the LM test can be formed by subtracting the mean from the covariance
estimator and using
n   
X       0
Σ̃ = n −1
si θ̃ − s̄ θ̃ si θ̃ − s̄ θ̃ (2.92)
i =1

which must be smaller (in the matrix sense) than Σ̂, although asymptotically, if the null is
true, these two estimators will converge to the same limit. Like the Wald and the LR, the LM
follows an asymptotic χm 2
distribution, and an LM test statistic will be rejected if L M > Cα
where Cα is the 1 − α quantile of a χm2
distribution.
Scores test can be used with method of moments estimators by simply replacing the
score of the likelihood with the moment conditions evaluated at the restricted parameter,
   
si θ̃ = gi θ̃ ,

and then evaluating eq. (2.90) or (2.92).

2.5.7 Comparing and Choosing the Tests

All three of the classic tests, the Wald, likelihood ratio and Lagrange multiplier have the
same limiting asymptotic distribution. In addition to all being asymptotically distributed
as a χm
2
, they are all asymptotically equivalent in the sense they all have an identical asymp-
2.6 The Bootstrap and Monte Carlo 117

totic distribution – in other words, the χm 2


that each limits to is the same. As a result, there
is no asymptotic argument that one should be favored over the other.
The simplest justifications for choosing one over the others are practical considerations.
Wald requires estimation under the alternative – the unrestricted model – and require an
estimate of the asymptotic covariance of the parameters. LM tests require estimation un-
der the null – the restricted model – and require an estimate of the asymptotic covariance
of the scores evaluated at the restricted parameters. LR tests require both forms to be esti-
mated but do not require any covariance estimates. On the other hand, Wald and LM tests
can easily be made robust to many forms of misspecification by using the “sandwich” co-
0
variance estimator, G−1 Σ G−1 for moment based estimators or I −1 J I −1 for QML estima-
tors. LR tests cannot be easily corrected and instead will have a non-standard distribution.
Models which are substantially easier to estimate under the null or alternative lead to
a natural choice. If a model is easy to estimate in its restricted form, but not unrestricted
LM tests are good choices. If estimation under the alternative is simpler then Wald tests
are reasonable. If they are equally simple to estimate, and the distributional assumptions
used in ML estimation are plausible, LR tests are likely the best choice. Empirically a re-
lationship exists where W ≈ L R ≥ L M . LM is often smaller, and hence less likely to
reject the null, since it estimates the covariance of the scores under the null. When the null
may be restrictive, the scores will generally have higher variances when evaluated using
the restricted parameters. The larger variances will lower the value of L M since the score
covariance is inverted in the statistic. A simple method to correct this is to use the adjusted
LM based off the modified covariance estimator in eq. (2.92).

2.6 The Bootstrap and Monte Carlo

The bootstrap is an alternative technique for estimating parameter covariances and con-
ducting inference. The name bootstrap is derived from the expression “to pick yourself up
by your bootstraps” – a seemingly impossible task. The bootstrap, when initially proposed,
was treated as an equally impossible feat, although it is now widely accepted as a valid, and
in some cases, preferred method to plug-in type covariance estimation. The bootstrap is a
simulation technique and is similar to Monte Carlo. However, unlike Monte Carlo, which
requires a complete data-generating process, the bootstrap makes use of the observed data
to simulate the data – hence the similarity to the original turn-of-phrase.
Monte Carlo is an integration technique that uses simulation to approximate the un-
i.i.d.
derlying distribution of the data. Suppose Yi ∼ F (θ ) where F is some distribution, and
that interest is in the E [g (Y )]. Further suppose it is possible to simulate from F (θ ) so that
a sample {yi } can be constructed. Then
n
X p
n −1
g (yi ) → E [g (Y )]
i =1
118 Estimation, Inference and Hypothesis Testing

as long as this expectation exists since the simulated data are i.i.d. by construction.
The observed data can be used to compute the empirical cdf.

Definition 2.37 (Empirical CDF). The empirical cdf is defined


n
X
Fˆ (c ) = n −1
I[yi <c ] .
i =1

As long as Fˆ is close to F , then the empirical cdf can be used to simulate random vari-
ables which should be approximately distributed F , and simulated data from the empirical
cdf should have similar statistical properties (mean, variance, etc.) as data simulated from
the true population cdf. The empirical cdf is a coarse step function and so only values
which have been observed can be simulated, and so simulating from the empirical cdf of
the data is identical to re-sampling the original data. In other words, the observed data can
be directly used to simulate the from the underlying (unknown) cdf.
Figure 2.6 shows the population cdf for a standard normal and two empirical cdfs, one
estimated using n = 20 observations and the other using n = 1, 000. The coarse empirical
cdf highlights the stair-like features of the empirical cdf estimate which restrict random
numbers generated using the empirical cdf to coincide with the data used to compute the
empirical cdf.
The bootstrap can be used for a variety of purposes. The most common is to estimate
the covariance matrix of estimated parameters. This is an alternative to the usual plug-in
type estimator, and is simple to implement when the estimator is available in closed form.

Algorithm 2.38 (i.i.d. Nonparametric Bootstrap Covariance).

1. Generate a set of n uniform integers { ji }ni=1 on [1, 2, . . . , n ].



2. Construct a simulated sample y ji .

3. Estimate the parameters of interest using y ji , and denote the estimate θ̃ b .




4. Repeat steps 1 through 3 a total of B times.

5. Estimate the variance of θ̂ using

h i B 
X  0
V θ̂ = B
b −1
θ̃ j − θ̂ θ̃ j − θ̂ .
b =1

or alternatively
h i B 
X  0
b θ̂ = B−1
V θ̃ j − θ̃ θ̃ j − θ̃ .
b =1
2.6 The Bootstrap and Monte Carlo 119

Standard Normal CDF and Empirical CDFs for n = 20 and 1, 000


1 Normal CDF
ECDF, n = 20
0.9 ECDF, n = 1000

0.8

0.7

0.6
(X)
Fd

0.5

0.4

0.3

0.2

0.1

0
−3 −2 −1 0 1 2 3
X

Figure 2.6: These three lines represent the population cdf of a standard normal, and two
empirical cdfs constructed form simulated data. The very coarse empirical cdf is based on
20 observations and clearly highlights the step-nature of empirical cdfs. The other empir-
ical cdf, which is based on 1,000 observations, appear smoother but is still a step function.

The variance estimator that comes from this algorithm cannot be directly compared
to the asymptotic covariance estimator since the bootstrap covariance is converging to 0.

Normalizing the bootstrap covariance estimate by n will allow comparisons and direct
application of the test statistics based on the asymptotic covariance. Note that when using
a conditional model, the vector [yi x0i ]0 should be jointly bootstrapped. Aside from this small
modification to step 2, the remainder of the procedure remains valid.
The nonparametric bootstrap is closely related to the residual bootstrap, at least when
it is possible to appropriately define a residual. For example, when Yi |Xi ∼ N β 0 xi , σ2 ,

0
the residual can be defined ε̂i = yi − β̂ xi . Alternatively if Yi |Xi ∼ Scaled − χν2 exp β 0 xi ,

q
0
then ε̂i = yi / β̂ x . The residual bootstrap can be used whenever it is possible to express
yi = g (θ , εi , xi ) for some known function g .
Algorithm 2.39 (i.i.d. Residual Bootstrap Covariance).
1. Generate a set of n uniform integers { ji }ni=1 on [1, 2, . . . , n ].
 
2. Construct a simulated sample ε̂ ji , x ji and define ỹi = g θ̂ , ε̃i , x̃i where ε̃i = ε̂ ji

120 Estimation, Inference and Hypothesis Testing

and x̃i = x ji .25

3. Estimate the parameters of interest using { ỹi , x̃i }, and denote the estimate θ̃ b .

4. Repeat steps 1 through 3 a total of B times.

5. Estimate the variance of θ̂ using

h i B 
X  0
V θ̂ = B
b −1
θ̃ b − θ̂ θ̃ b − θ̂ .
b =1

or alternatively
h i B 
X  0
b θ̂ = B−1
V θ̃ b − θ̃ θ̃ b − θ̃ .
b =1

It is important to emphasize that the bootstrap is not, generally, a better estimator of


parameter covariance than standard plug-in estimators.26 Asymptotically both are con-
sistent and can be used equivalently. Additionally, i.i.d. bootstraps can only be applied to
(conditionally) i.i.d. data and using an inappropriate bootstrap will produce an inconsis-
tent estimator. When data have dependence it is necessary to use an alternative bootstrap
scheme.
When the interest lies in confidence intervals, an alternative procedure that directly
uses the empirical quantiles of the bootstrap parameter estimates can be constructed (known
as the percentile method).

Algorithm 2.40 (i.i.d. Nonparametric Bootstrap Confidence Interval).

1. Generate a set of n uniform integers { ji }ni=1 on [1, 2, . . . , n ].



2. Construct a simulated sample y ji .

3. Estimate the parameters of interest using y ji , and denote the estimate θ̃ b .




4. Repeat steps 1 through 3 a total of B times.

5. Estimate the 1 − α confidence interval of θ̂k using


h n o n oi
qα/2 θ̃k , q1−α/2 θ̃k

25
In some models, it is possible to use independent indices on ε̂ and x, such as in a linear regression when
the data are conditionally homoskedastic (See chapter 3). In general it is not possible to explicitly break the
link between εi and xi , and so these should usually be resampled using the same indices.
26
There are some problem-dependent bootstraps that are more accurate than plug-in estimators in an
asymptotic sense. These are rarely encountered in financial economic applications.
2.6 The Bootstrap and Monte Carlo 121

n o
where qα θ̃k is the empirical α quantile of the bootstrap estimates. 1-sided lower
confidence intervals can be constructed as
h n oi
R (θk ), q1−α θ̃k

and 1-sided upper confidence intervals can be constructed as


h n o i
qα θ̃k , R (θk )

where R (θk ) and R (θk ) are the lower and upper extremes of the range of θk (possibly
±∞).

The percentile method can also be used directly to compute P-values of test statistics.
This requires enforcing the null hypothesis on the data and so is somewhat more involved.
For example, suppose the null hypothesis is E [yi ] = 0. This can be enforced by replacing
the original data with ỹi = yi − ȳ in step 2 of the algorithm.

Algorithm 2.41 (i.i.d. Nonparametric Bootstrap P-value).

1. Generate a set of n uniform integers { ji }ni=1 on [1, 2, . . . , n ].



2. Construct a simulated sample using data where the null hypothesis is true, ỹ ji .
 
3. Compute the test statistic of interest using ỹ ji , and denote the statistic T θ̃ b .


4. Repeat steps 1 through 3 a total of B times.

5. Compute the bootstrap P-value using


B
X
P−
d v a l = B −1 I[T (θ̂ )≤T (θ̃ )]
b =1

for 1-sided tests where the rejection region is for large values (e.g. a Wald test). When
using 2-sided tests, compute the bootstrap P-value using
B
X
P−
d v a l = B −1 I[|T (θ̂ )|≤|T (θ̃ )|]
b =1

The test statistic may depend on a covariance matrix. When this is the case, the co-
variance matrix is usually estimated from the bootstrapped data using a plug-in method.
Alternatively, it is possible to use any other consistent estimator (when the null is true) of
the asymptotic covariance, such as one based on an initial (separate) bootstrap.
122 Estimation, Inference and Hypothesis Testing

When models are maximum likelihood based, so that a complete model for the data
is specified, it is possible to apple a parametric form of the bootstrap to estimate covari-
ance matrices. This procedure is virtually identical to standard Monte Carlo except that
the initial estimate θ̂ is used in the simulation.

Algorithm 2.42 (i.i.d. Parametric Bootstrap Covariance (Monte Carlo)).


 
1. Simulate a set of n i.i.d. draws { ỹi } from F θ̂ .

2. Estimate the parameters of interest using { ỹi }, and denote the estimates θ̃ b .

3. Repeat steps 1 through 4 a total of B times.

4. Estimate the variance of θ̂ using

h i B 
X  0
V θ̂ = B −1
θ̃ b − θ̂ θ̃ b − θ̂ .
b =1

or alternatively
h i B 
X  0
V θ̂ = B −1
θ̃ b − θ̃ θ̃ b − θ̃ .
b =1

When models use conditional maximum likelihood, it is possible to use parametric


bootstrap as part of a two-step procedure. First, apply a nonparametric bootstrap to the
conditioning
  data{xi }, and then, using the bootstrapped conditioning data, simulate Yi ∼
F θ̂ |X̃i . This is closely related to the residual bootstrap, only the assumed parametric
distribution F is used in place of the data-derived residuals.

2.7 Inference on Financial Data


Inference will be covered in greater detail in conjunction with specific estimators and mod-
els, such as linear regression or ARCH models. These examples examine relatively simple
hypotheses to illustrate the steps required in testing hypotheses.

2.7.1 Testing the Market Premium


Testing the market premium is a cottage industry. While current research is more interested
in predicting the market premium, testing whether the market premium is significantly
different from zero is a natural application of the tools introduced in this chapter. Let λ
denote the market premium and let σ2 be the variance of the return. Since the market is a
traded asset it must be the case that the premium for holding market risk is the same as the
mean of the market return. Monthly data for the Value Weighted Market (V W M ) and the
2.7 Inference on Financial Data 123

risk-free rate (R f ) was available between January 1927 and June 2008. Data for the V W M
was drawn from CRSP and data for the risk-free rate was available from Ken French’s data
library. Excess returns on the market are defined as the return to holding the market minus
the risk free rate, V W M ie = V W M i − R fi . The excess returns along with a kernel density
plot are presented in figure 2.7. Excess returns are both negatively skewed and heavy tailed
– October 1987 is 5 standard deviations from the mean.
The mean and variance can be computed using the method of moments as detailed in
section 2.1.4, and the covariance of the mean and the variance can be computed using the
estimators described in section 2.4.1. The estimates were calculated according to
" #  −1 Pn e

λ̂ n i =1 V W M i
=  −1 Pn  2 
σ̂ 2
n i =1 V W M i
e
− λ̂

and, defining ε̂i = V W M ie − λ̂, the covariance of the moment conditions was estimated
by
" Pn Pn  #
i =1 ε̂i ε̂ ε̂ σ̂
2 2 2
i −
Σ̂ = n −1 Pn Pi =1
n
i
2 .
i =1 ε̂i ε̂i − σ̂ ε̂ σ̂2
2 2 2

i =1 i −

Since the plim of the Jacobian is −I2 , the parameter covariance is also Σ̂. Combining
these two results with a Central Limit Theorem (assumed to hold), the asymptotic distri-
bution is

√ h i
d
n θ − θ̂ → N (0, Σ)
0
where θ = λ, σ2 . These produce the results in the first two rows of table 2.3.

These estimates can also be used to make inference on the standard deviation, σ = σ2
and the Sharpe ratio, S = λ/σ. The derivation of the asymptotic distribution of the Sharpe
ratio was presented in 2.4.4.1 and the asymptotic distribution
√ of the standard deviation can
be determined in a similar manner where d (θ ) = σ2 and so

∂ d (θ )
 
1
D (θ ) = = 0 √ .
∂ θ0 2 σ2
Combining this expression with the asymptotic distribution for the estimated mean and
variance, the asymptotic distribution of the standard deviation estimate is

√ µ4 − σ 4
 
d
n (σ̂ − σ) → N 0, .
4σ2

which was computed by dividing the [2,2] element of the parameter covariance by 4σ̂2 .
124 Estimation, Inference and Hypothesis Testing

2.7.1.1 Bootstrap Implementation

The bootstrap can be used to estimate parameter covariance, construct confidence inter-
vals – either used the estimated covariance or the percentile method, and to tabulate the
P-value of a test statistic. Estimating the parameter covariance is simple – the data is re-
sampled to create a simulated sample with n observations and the mean and variance are
estimated. This is repeated 10,000 times and the parameter covariance is estimated using

B
" # "" ##! " # "" ##!0
X µ̃b µ̂b µ̃b µ̂b
Σ̂ = B −1 − −
σ̃2b µ̂2b σ̃2b µ̂2b
b =1
B 
X  0
= B −1
θ̃ b − θ̂ θ̃ b − θ̂ .
b =1

The percentile method can be used to construct confidence intervals for the parame-
ters as estimated and for functions of parameters such as the Sharpe ratio. Constructing
the confidence intervals for a function of the parameters requires constructing the function
of the estimated parameters using each simulated sample and then computing the confi-
dence interval using the empirical quantile of these estimates. Finally, the test P-value for
the statistic for the null H0 : λ = 0 can be computed directly by transforming the returns
so that they have mean 0 using r̃i = ri − r̄i . The P-value can be tabulated using
B
X
− val = B −1
Pd I[r̄ ≤r̃ b ]
b =1

where r̃ b is the average from bootstrap replication b . Table 2.4 contains the bootstrap
standard errors, confidence intervals based on the percentile method and the bootstrap
P-value for testing whether the mean return is 0. The standard errors are virtually identical
to those estimated using the plug-in method, and the confidence intervals are similar to
θ̂k ± 1.96s.e. (θk ). The null that the average return is 0 is also strongly rejected.

2.7.2 Is the NASDAQ Riskier than the S&P 100?


A second application examines the riskiness of the NASDAQ and the S&P 100. Both of these
indices are value-weighted and contain 100 companies. The NASDAQ 100 contains only
companies that trade on the NASDAQ while the S&P 100 contains large companies that
trade on either the NYSE or the NASDAQ.
The null hypothesis is that the variances are the same, H0 : σS2 P = σN2 D , and the alterna-
tive is that the variance of the NASDAQ is larger, H1 : σN2 D > σS2 P .27 The null and alternative
can be reformulated as a test that δ = σN2 D − σS2 P is equal to zero against an alternative that
27
It may also be interesting to test against a two-sided alternative that the variances are unequal, H1 :
σN
2
D 6= σS2 P .
2.7 Inference on Financial Data 125

Parameter Estimate Standard Error t -stat


λ 0.627 0.173 3.613
σ2 29.41 2.957 9.946
σ 5.423 0.545 9.946
λ
σ
0.116 0.032 3.600

Table 2.3: Parameter estimates and standard errors for the market premium (λ), the vari-
ance of the excess return (σ2 ), the standard deviation of the excess return (σ) and the
Sharpe ratio ( σλ ). Estimates and variances were computed using the method of moments.
The standard errors for σ and σλ were computed using the delta method.

Bootstrap Confidence Interval


Parameter Estimate Standard Error Lower Upper
λ 0.627 0.174 0.284 0.961
σ2 29.41 2.964 24.04 35.70
σ 5.423 0.547 4.903 5.975
λ
σ
0.116 0.032 0.052 0.179

H0 : λ = 0
P-value 3.00 ×10−4

Table 2.4: Parameter estimates, bootstrap standard errors and confidence intervals (based
on the percentile method) for the market premium (λ), the variance of the excess return
(σ2 ), the standard deviation of the excess return (σ) and the Sharpe ratio ( σλ ). Estimates
were computed using the method of moments. The standard errors for σ and σλ were com-
puted using the delta method using the bootstrap covariance estimator.

it is greater than zero. The estimation of the parameters can be formulated as a method of
moments problem,

   
µ̂S P rS P,i
n  2
σ̂S2 P (rS P,i − µ̂S P )
  X 
= n −1
   
µ̂N D rN D ,i
   
i =1
   
2
σ̂N2 D (rN D ,i − µ̂N D )

Inference can be performed by forming the moment vector using the estimated parame-
ters, gi ,
126 Estimation, Inference and Hypothesis Testing

CRSP Value Weighted Market (VWM) Excess Returns


CRSP VWM Excess Returns
40

20

−20

1930 1940 1950 1960 1970 1980 1990 2000

CRSP VWM Excess Return Density

0.08

0.06

0.04

0.02

0
−20 −10 0 10 20 30

Figure 2.7: These two plots contain the returns on the VWM (top panel) in excess of the
risk free rate and a kernel estimate of the density (bottom panel). While the mode of the
density (highest peak) appears to be clearly positive, excess returns exhibit strong negative
skew and are heavy tailed.

 
rS P,i − µS P
(rS P,i − µS P )2 − σS2 P
 
gi = 
 
rN D ,i − µN D

 
(rN D ,i − µN D )2 − σN2 D

and recalling that the asymptotic distribution is given by

√  
d
 −1 
n θ̂ − θ → N 0, G−1 Σ G0 .

Using the set of moment conditions,


2.7 Inference on Financial Data 127

Daily Data
Parameter Estimate Std. Error/Correlation
µS P 9.06 3.462 -0.274 0.767 -0.093
σS P 17.32 -0.274 0.709 -0.135 0.528
µN D 9.73 0.767 -0.135 4.246 -0.074
σN S 21.24 -0.093 0.528 -0.074 0.443

Test Statistics
δ 0.60 σ̂δ 0.09 t -stat 6.98

Monthly Data
Parameter Estimate Std. Error/Correlation
µS P 8.61 3.022 -0.387 0.825 -0.410
σS P 15.11 -0.387 1.029 -0.387 0.773
µN D 9.06 0.825 -0.387 4.608 -0.418
σN S 23.04 -0.410 0.773 -0.418 1.527

Test Statistics
δ 25.22 σ̂δ 4.20 t -stat 6.01

Table 2.5: Estimates, standard errors and correlation matrices for the S&P 100 and NAS-
DAQ 100. The top panel uses daily return data between January 3, 1983 and December 31,
2007 (6,307 days) to estimate the parameter values in the left most column. The rightmost
4 columns contain the parameter standard errors (diagonal elements) and the parameter
correlations (off-diagonal elements). The bottom panel contains estimates, standard er-
rors and correlations from monthly data between January 1983 and December 2007 (300
months). Parameter and covariance estimates have been annualized. The test statistics
(and related quantities) were performed and reported on the original (non-annualized)
values.

 
−1 0 0 0
n 
−2 (rS P,i − µS P ) −1 0 0
X 
G = plimn →∞ n −1  
0 0 −1 0
 
i =1
 
0 0 −2 (rN D ,i − µN D ) −1
= −I4 .

Σ can
 be estimated using the moment conditions evaluated at the estimated parameters,
gi θ̂ ,
128 Estimation, Inference and Hypothesis Testing

n
X    
Σ̂ = n −1
gi θ̂ g0i θ̂ .
i =1

Noting that the (2,2) element of Σ is the variance of σ̂S2 P , the (4,4) element of Σ is the vari-
ance of σ̂N2 D and the (2,4) element is the covariance of the two, the variance of δ̂ = σ̂N2 D −
σ̂S2 P can be computed as the sum of the variances minus two times the covariance, Σ[2,2] +
Σ[4,4] − 2Σ[2,4] . Finally a one-sided t -test can be performed to test the null.
Data was taken from Yahoo! finance between January 1983 and December 2008 at both
the daily and monthly frequencies. Parameter estimates are presented in table 2.5. The
table also contains the parameter standard errors p – the square-root of the asymptotic co-
variance divided by the number of observations ( Σ[i ,i ] /n ) – along the diagonal and the
parameter correlations – Σ[i , j ] / Σ[i ,i ] Σ[ j , j ] – in the off-diagonal positions. The top panel
p

contains results for daily data while the bottom contains results for monthly data. In both
panels 100× returns were used.
All parameter estimates are reported in annualized form, which requires multiplying
daily (monthly)
√  mean estimates by 252 (12), and daily (monthly) volatility estimated by

252 12 . Additionally, the delta method was used to adjust the standard errors on the
volatility estimates since the actual parameter estimates were the means and variances.
Thus, the reported parameter variance covariance matrix has the form
   
252 √
0 0 0 252 √
0 0 0
252 252
     0 0 0   0 0 0 
D θ̂ Σ̂D θ̂ =  2σS P
 Σ̂  2σS P
.
   
 0 0 252 √
0   0 0 252 √
0 
252 252
0 0 0 2σN D
0 0 0 2σN D

In both cases δ is positive with a t -stat greater than 6, indicating a strong rejection of the
null in favor of the alternative. Since this was a one-sided test, the 95% critical value would
be 1.645 (Φ (.95)).
This test could also have been implemented using an LM test, which requires estimating
the two mean parameters but restricting the variances to be equal. One θ̃ is estimated, the
LM test statistic is computed as
  −1  
L M = n gn θ̃ Σ̂ g0n θ̃

where

  n
X  
gn θ̃ = n −1
gi θ̃
i =1

and where µ̃S P = µ̂S P , µ̃N D = µ̂N D (unchanged) and σ̃S2 P = σ̃N2 D = σ̂S2 P + σ̂N2 D /2.

2.7 Inference on Financial Data 129

Daily Data
Parameter Estimate BootStrap Std. Error/Correlation
µS P 9.06 3.471 -0.276 0.767 -0.097
σS P 17.32 -0.276 0.705 -0.139 0.528
µN D 9.73 0.767 -0.139 4.244 -0.079
σN S 21.24 -0.097 0.528 -0.079 0.441

Monthly Data
Parameter Estimate Bootstrap Std. Error/Correlation
µS P 8.61 3.040 -0.386 0.833 -0.417
σS P 15.11 -0.386 1.024 -0.389 0.769
µN D 9.06 0.833 -0.389 4.604 -0.431
σN S 23.04 -0.417 0.769 -0.431 1.513

Table 2.6: Estimates and bootstrap standard errors and correlation matrices for the S&P
100 and NASDAQ 100. The top panel uses daily return data between January 3, 1983 and
December 31, 2007 (6,307 days) to estimate the parameter values in the left most column.
The rightmost 4 columns contain the bootstrap standard errors (diagonal elements) and
the correlations (off-diagonal elements). The bottom panel contains estimates, bootstrap
standard errors and correlations from monthly data between January 1983 and December
2007 (300 months). All parameter and covariance estimates have been annualized.

2.7.2.1 Bootstrap Covariance Estimation

The bootstrap is an alternative to the plug-in covariance estimators. The bootstrap was
implemented using 10,000 resamples where the data were assumed to be i.i.d.. In each
bootstrap resample, the full 4 by 1 vector of parameters was computed. These were com-
bined to estimate the parameter covariance using
B 
X  0
Σ̂ = B −1
θ̃ b − θ̂ θ̃ b − θ̂ .
i =1

Table 2.6 contains the bootstrap standard errors and correlations. Like the results in 2.5, the
parameter estimates and covariance have been annualized, and volatility rather than vari-
ance is reported. The covariance estimates are virtually indistinguishable to those com-
puted using the plug-in estimator. This highlights that the bootstrap is not (generally) a
better estimator, but is merely an alternative.28

28
In this particular application, as the bootstrap and the plug-in estimators are identical as B → ∞ for fixed
n. This is not generally the case.
130 Estimation, Inference and Hypothesis Testing

2.7.3 Testing Factor Exposure

Suppose excess returns were conditionally normal with mean µi = β 0 xi and constant vari-
ance σ2 . This type of model is commonly used to explain cross-sectional variation in re-
turns, and when the conditioning variables include only the market variable, the model is
known as the Capital Asset Pricing Model (CAP-M, Lintner (1965), Sharpe (1964)). Multi-
factor models allow for additional conditioning variables such as the size and value factors
(Fama & French 1992, 1993, Ross 1976). The size factor is the return on a portfolio which is
long small cap stocks and short large cap stocks. The value factor is the return on a port-
folio that is long high book-to-market stocks (value) and short low book-to-market stocks
(growth).

This example estimates a 3 factor model where the conditional mean of excess returns
on individual assets is modeled as a linear function of the excess return to the market, the
size factor and the value factor. This leads to a model of the form
 
f f
ri − ri = β0 + β1 rm ,i − ri + β2 rs ,i + β3 rv,i + εi
rie = β 0 xi + εi
f
where ri is the risk-free rate (short term government rate), rm ,i is the return to the market
portfolio, rs ,i is the return to the size portfolio and rv,i is the return to the value portfolio. εi
is a residual which is assumed to have a N 0, σ2 distribution.


Factor models can be formulated as a conditional maximum likelihood problem,


0 2
n
( )
1X r i − β xi
l r|X; θ = − ln (2π) + ln σ2 +
 
2 σ2
i =1

0
where θ = β 0 σ2 . The MLE can be found using the first order conditions, which are


n
∂ l (r ; θ ) 1 X  0

= xi ri − β̂ xi = 0
∂β σ̂2
i =1
n
!−1 n
X X
⇒ β̂ = xi x0i x i ri
i =1 j =1
 0
2
∂ l (r ; θ ) 1
n
X 1 ri − β̂ xi
= − − =0
∂σ 2 2 σ̂2 σ̂4
i =1
n  2
X 0
⇒ σ̂ 2
= n −1
ri − β̂ xi
i =1
2.7 Inference on Financial Data 131

The vector of scores is


 " # " #" # " #
∂ l ri |xi ; θ σ
1
2 xi ε i
1
0 xi εi xi ε i
= ε 2 = σ2 =S
∂θ − 2σ1 2 + 2σi 4 0 1
2σ4
σ − ε2i
2
σ − ε2i
2

where εi = ri − β 0 xi . The second form will be used to simplify estimating the parameters
covariance. The Hessian is
 " #
∂ 2 l ri |xi ; θ − σ12 xi x0i − σ14 xi εi
= ε2 ,
∂ θ∂ θ0 − σ14 xi εi 2σ1 4 − σi6

and the information matrix is


" #
− σ12 xi x0i − σ14 xi εi
I = −E ε2
− σ14 xi εi 2σ1 4 − σi6
"  #
1 1
ε
 0  
E x x
i i − E x E |X
= σ2  σ4 i  i
− σ14 E xi E εi |X E 2σ1 4
 
" #
1
 0
E x i x 0
= σ 2 i
1 .
0 2σ4

The covariance of the scores is


" " # #
ε2i xi x0i σ2 xi εi − xi ε3i
J = E S 2 S
σ2 x0i εi − x0i ε3i σ2 − ε2i
E ε2i xi x0i E σ ε ε
"    2 3
 #
x i i − xi
= S h 2 ii S
E σ2 x0i εi − x0i ε3i E σ2 − ε2i
 

E E ε2i |X xi x0i E σ2 x0i E h εi |X − x0i Ei ε3i |X


"          #
= S 2 S
E E σ2 x0i εi − x0i ε3i |X E σ2 − ε2i
  
" # " #
σ2 E xi x0i 1
   0
0 E xi x 0
= S S= σ 2 i
1
0 2σ4 0 2σ4

The estimators of the covariance matrices are


n
" #" # " #
X 1
0 x ε̂
i i
h i 1
0
Jˆ = n −1 σ̂2 x0i ε̂i σ̂2 − ε̂2i σ̂2
0 2σ̂1 4 σ̂2 − ε̂2i 0 2σ̂1 4
i =1
n
" #" #" #
1
0 ε̂ 2
x x
i i i
0
σ̂ 2
x ε̂
i i − x ε̂ 3 1
0
i i
X
= n −1 σ̂2
2 2
σ̂2

i =1
0 2σ̂1 4 σ̂ 2 0
x ε̂
i i − x ε̂
0 3
i i σ̂ 2
− ε̂ i 0 2σ̂1 4
132 Estimation, Inference and Hypothesis Testing

and
n
" #
X − σ̂12 xi x0i − σ̂14 xi εi
Î = −1 × n −1 ε2i
i =1
− σ̂14 xi εi 1
2σ̂4
− σ̂6
n
" #
X − σ̂12 xi x0i 0
= −1 × n −1 1 σ̂2
i =1
0 2σ̂4
− σ̂6
n
" # n
" #" #
X − σ̂12 xi x0i 0 X 1
0 xi x0i 0
= −1 × n −1 = n −1 σ̂2
0 − 2σ̂1 4 0 1
2σ̂4
0 1
i =1 i =1

Note that the off-diagonal term in J , σ̂2 x0i ε̂i − x0i ε̂3i , is not necessarily 0 when the data may
be conditionally skewed. Combined, the QMLE parameter covariance estimator is then

n
" #!−1 " n
" ## n
" #
X xi x0i 0 X ε̂2i xi x0i σ̂2 xi ε̂i − xi ε̂3i X xi x0i 0
Î −1 J I −1 = n −1 n −1 2 n −1
i =1
0 1
i =1
σ̂2 x0i ε̂i − x0i ε̂3i σ̂2 − ε̂2i i =1
0 1

where the identical scaling terms have been canceled. Additionally, when returns are con-
ditionally normal,
n
" #" #" #
1
0 ε̂ 2
x x
i i i
0
σ̂ 2
x ε̂
i i − x ε̂ 3 1
0
i i
X
plim Jˆ = plim n −1 σ̂2
2 2
σ̂2

i =1
0 2σ̂1 4 σ̂ xi ε̂i − xi ε̂i
2 0 0 3
σ̂ − ε̂i
2 0 2σ̂1 4
" #" #" #
1
0 σ xi xi
2 0
0 1
0
= σ2
1
σ2
0 2σ4 0 2σ 4
0 2σ1 4
" #
1 0
xx
σ2 i i
0
= 1
0 2σ4

and
n
" #
1
X x x0
σ̂2 i i
0
plim Î = plim n −1
1
0 2σ̂4
i =1
" #
1
x x0
σ2 i i
0
= 1 ,
0 2σ4

and so the IME, plim J −I = 0, will hold when returns are conditionally normal. Moreover,
when returns are not normal, all of the terms in J will typically differ from the limits above
and so the IME will not generally hold.

2.7.3.1 Data and Implementation

Three assets are used to illustrate hypothesis testing: ExxonMobil (XOM), Google (GOOG)
and the SPDR Gold Trust ETF (GLD). The data used to construct the individual equity re-
2.7 Inference on Financial Data 133

turns were downloaded from Yahoo! Finance and span the period September 2, 2002 until
September 1, 2012.29 The market portfolio is the CRSP value-weighted market, which is
a composite based on all listed US equities. The size and value factors were constructed
using portfolio sorts and are made available by Ken French. All returns were scaled by 100.

2.7.3.2 Wald tests

Wald tests make use of the parameters and estimated covariance to assess the evidence
against the null. When testing whether the size and value factor are relevant for an asset,
the null is H0 : β2 = β3 = 0. This problem can be set up as a Wald test using
" # " #
0 0 1 0 0 0
R= ,r=
0 0 0 1 0 0

and  0 h i−1  
W = n Rθ̂ − r RÎ −1 J Î −1 R0 Rθ̂ − r .

The Wald test has an asymptotic χ22 distribution since the null imposes 2 restrictions.
t -stats can similarly be computed for individual parameters

√ β̂ j
tj = n  
s.e. β̂ j
 
where s.e. β̂ j is the square of the jth diagonal element of the parameter covariance matrix.
Table 2.7 contains the parameter estimates from the models, t -stats for the coefficients and
the Wald test statistics for the null H0 : β2 = β3 = 0. The t -stats and the Wald tests where
implemented using both the sandwich covariance estimator (QMLE) and the maximum
likelihood covariance estimator. The two sets of test statistics differ in magnitude since the
assumption of normality is violated in the data, and so only the QMLE-based test statistics
should be considered reliable.

2.7.3.3 Likelihood Ratio tests

Likelihood ratio tests are simple to implement when parameters are estimated using MLE.
The likelihood ratio test statistic is
    
L R = −2 l r|X; θ̃ − l r|X; θ̂

where θ̃ is the null-restricted estimator of the parameters. The likelihood ratio has an
asymptotic χ22 distribution since there are two restrictions. Table 2.7 contains the likeli-
29
Google and the SPDR Gold Trust ETF both started trading after the initial sample date. In both cases, all
available data was used.
134 Estimation, Inference and Hypothesis Testing

hood ratio test statistics for the null H0 : β2 = β3 = 0. Caution is needed when interpret-
ing likelihood ratio test statistics since the asymptotic distribution is only valid when the
model is correctly specified – in this case, when returns are conditionally normal, which is
not plausible.

2.7.3.4 Lagrange Multiplier tests

Lagrange Multiplier tests are somewhat more involved in this problem. The key to com-
puting the LM test statistic is to estimate the score using the restricted parameters,
" #
1
σ2 i i
x ε̃
s̃i = ε̃2 ,
− 2σ̃1 2 + 2σ̃i 4

0
h 0 i0
where ε̃i = ri − β̃ xi and θ̃ = β̃ σ̃2 is the vector of parameters estimated when the null
is imposed. The LM test statistic is then

L M = n s̃S̃−1 s̃

where
n
X n
X
s̃ = n −1
s̃i , and S̃ = n −1
s̃i s̃0i .
i =1 i =1

The improved version of the LM can be computed by replacing S̃ with a covariance estima-
tor based on the scores from the unrestricted estimates,
n
X
Ŝ = n −1 ŝi ŝ0i .
i =1

Table 2.7 contains the LM test statistics for the null H0 : β2 = β3 = 0 using the two co-
variance estimators. LM test statistics are naturally robust to violations of the assumed
normality since Ŝ and S̃ are directly estimated from the scores and not based on properties
of the assumed normal distribution.

2.7.3.5 Discussion of Test Statistics

Table 2.7 contains all test statistics for the three series. The test statistics based on the MLE
and QMLE parameter covariances differ substantially in all three series, and importantly,
the conclusions also differ for the SPDR Gold Trust ETF. The difference between the two sets
of results arises since the assumption that returns are conditionally normal with constant
variance is not supported in the data. The MLE-based Wald test and the LR test (which is
implicitly MLE-based) have very similar magnitudes for all three series. The QMLE-based
Wald test statistics are also always larger than the LM-based test statistics which reflects
2.7 Inference on Financial Data 135

the difference of estimating the covariance under the null or under the alternative.
136 Estimation, Inference and Hypothesis Testing

ExxonMobil
Parameter Estimate t (MLE) t (QMLE)
β0 0.016 0.774 0.774 Wald (MLE) 251.21
(0.439) (0.439) (<0.001)
β1 0.991 60.36 33.07 Wald (QMLE) 88.00
(<0.001) (<0.001) (<0.001)
β2 -0.536 −15.13 −9.24 LR 239.82
(<0.001) (<0.001) (<0.001)
β3 -0.231 −6.09 −3.90 LM (S̃) 53.49
(<0.001) (<0.001) (<0.001)
LM (Ŝ) 54.63
(<0.001)

Google
Parameter Estimate t (MLE) t (QMLE)
β0 0.063 1.59 1.60 Wald (MLE) 18.80
(0.112) (0.111) (<0.001)
β1 0.960 30.06 23.74 Wald (QMLE) 10.34
(<0.001) (<0.001) (0.006)
β2 -0.034 −0.489 −0.433 LR 18.75
(0.625) (0.665) (<0.001)
β3 -0.312 −4.34 −3.21 LM (S̃) 10.27
(<0.001) (0.001) (0.006)
LM (Ŝ) 10.32
(0.006)

SPDR Gold Trust ETF


Parameter Estimate t (MLE) t (QMLE)
β0 0.057 1.93 1.93 Wald (MLE) 12.76
(0.054) (0.054) (0.002)
β1 0.130 5.46 2.84 Wald (QMLE) 5.16
(<0.001) (0.004) (0.076)
β2 -0.037 −0.733 −0.407 LR 12.74
(0.464) (0.684) (0.002)
β3 -0.191 −3.56 −2.26 LM (S̃) 5.07
(<0.001) (0.024) (0.079)
LM (Ŝ) 5.08
(0.079)

Table 2.7: Parameter estimates, t-statistics (both MLE and QMLE-based), and tests of the
exclusion restriction that the size and value factors have no effect (H0 : β2 = β3 = 0) on the
returns of the ExxonMobil, Google and SPDR Gold Trust ETF.
2.7 Inference on Financial Data 137

Exercises
Exercise 2.1. The distribution of a discrete random variable X depends on a discretely val-
ued parameter θ ∈ {1, 2, 3} according to
x f (x |θ = 1) f (x |θ = 2) f (x |θ = 3)
1 1
1 2 3
0
1 1
2 3 4
0
1 1 1
3 6 3 6
1 1
4 0 12 12
3
5 0 0 4

Find the MLE of θ if one value from X has been observed. Note: The MLE is a function that
returns an estimate of θ given the data that has been observed. In the case where both the
observed data and the parameter are discrete, a “function” will take the form of a table.

Exercise 2.2. Let X 1 , . . . , X n be an i.i.d. sample from a gamma(α,β ) distribution. The den-
sity of a gamma(α,β ) is

1
f (x ; α, β ) = x α−1 exp(−x /β )
Γ (α) β α

where Γ (z ) is the gamma-function evaluated at z . Find the MLE of β assuming α is known.

Exercise 2.3. Let X 1 , . . . , X n be an i.i.d. sample from the pdf

θ
f (x |θ ) = , 1 ≤ x < ∞, θ > 1
x θ +1
i. What is the MLE of θ ?

ii. What is E[X j ]?

iii. How can the previous answer be used to compute a method of moments estimator of
θ?

Exercise 2.4. Let X 1 , . . . , X n be an i.i.d. sample from the pdf

1
f (x |θ ) = , 0 ≤ x ≤ θ,θ > 0
θ
i. What is the MLE of θ ? [This is tricky]

ii. What is the method of moments Estimator of θ ?

iii. Compute the bias and variance of each estimator.


138 Estimation, Inference and Hypothesis Testing

Exercise 2.5. Let X 1 , . . . , X n be an i.i.d. random sample from the pdf

f (x |θ ) = θ x θ −1 , 0 ≤ x ≤ 1, 0 < θ < ∞

i. What is the MLE of θ ?

ii. What is the variance of the MLE.

iii. Show that the MLE is consistent.

Exercise 2.6. Let X 1 , . . . , X i be an i.i.d. sample from a Bernoulli(p ).

i. Show that X̄ achieves the Cramér-Rao lower bound.

ii. What do you conclude about using X̄ to estimate p ?

Exercise 2.7. Suppose you witness a coin being flipped 100 times with 56 heads and 44
tails. Is there evidence that this coin is unfair?

Exercise 2.8. Let X 1 , . . . , X i be an i.i.d. sample with mean µ and variance σ2 .

i. Show X̃ = N
PN
i =1 w i = 1.
P
i =1 w i X i is unbiased if and only if

ii. Show that the variance of X̃ is minimized if wi = 1


n
for i = 1, 2, . . . , n .

Exercise 2.9. Suppose {X i } in i.i.d. sequence of normal variables with unknown mean µ
and known variance σ2 .

i. Derive the power function of a 2-sided t -test of the null H0 : µ = 0 against an alter-
native H1 : µ 6= 0? The power function should have two arguments, the mean under
the alternative, µ1 and the number of observations n.

ii. Sketch the power function for n = 1, 4, 16, 64, 100.

iii. What does this tell you about the power as n → ∞ for µ 6= 0?

Exercise 2.10. Let X 1 and X 2 are independent and drawn from a Uniform(θ , θ + 1) distri-
bution with θ unknown. Consider two test statistics,

T1 : Reject if X 1 > .95

and
T2 : Reject if X 1 + X 2 > C

i. What is the size of T1 ?

ii. What value must C take so that the size of T2 is equal to T1


2.7 Inference on Financial Data 139

iii. Sketch the power curves of the two tests as a function of θ . Which is more powerful?

Exercise 2.11. Suppose {yi } are a set of transaction counts (trade counts) over 5-minute
intervals which are believed to be i.i.d. distributed from a Poisson with parameter λ. Recall
the probability density function of a Poisson is

λ yi e −λ
f (yi ; λ) =
yi !

i. What is the log-likelihood for this problem?

ii. What is the MLE of λ?

iii. What is the variance of the MLE?

iv. Suppose that λ̂ = 202.4 and that the sample size was 200. Construct a 95% confidence
interval for λ.

v. Use a t -test to test the null H0 : λ = 200 against H1 : λ 6= 200 with a size of 5%

vi. Use a likelihood ratio to test the same null with a size of 5%.

vii. What happens if the assumption of i.i.d. data is correct but that the data does not
follow a Poisson distribution?

Upper tail probabilities


for a standard normal z
Cut-off c Pr(z > c )
1.282 10%
1.645 5%
1.96 2.5%
2.32 1%

5% Upper tail cut-off for χq2


Degree of Freedom q Cut-Off
1 3.84
2 5.99
199 232.9
200 234.0
140 Estimation, Inference and Hypothesis Testing
Chapter 3

Analysis of Cross-Sectional Data

Note: The primary reference text for these notes is Hayashi (2000). Other comprehensive
treatments are available in Greene (2007) and Davidson & MacKinnon (2003).

Linear regression is the foundation of modern econometrics. While the impor-


tance of linear regression in financial econometrics has diminished in recent
years, it is still widely employed. More importantly, the theory behind least
squares estimators is useful in broader contexts and many results of this chap-
ter are special cases of more general estimators presented in subsequent chap-
ters. This chapter covers model specification, estimation, inference, under both
the classical assumptions and using asymptotic analysis, and model selection.

Linear regression is the most basic tool of any econometrician and is widely used through-
out finance and economics. Linear regression’s success is owed to two key features: the
availability of simple, closed form estimators and the ease and directness of interpretation.
However, despite superficial simplicity, the concepts discussed in this chapter will reappear
in the chapters on time series, panel data, Generalized Method of Moments (GMM), event
studies and volatility modeling.

3.1 Model Description

Linear regression expresses a dependent variable as a linear function of independent vari-


ables, possibly random, and an error.

yi = β1 x1,i + β2 x2,i + . . . + βk xk ,i + εi , (3.1)

where yi is known as the regressand, dependent variable or simply the left-hand-side vari-
able. The k variables, x1,i , . . . , xk ,i are known as the regressors, independent variables or
right-hand-side variables. β1 , β2 , . . ., βk are the regression coefficients, εi is known as the
142 Analysis of Cross-Sectional Data

innovation, shock or error and i = 1, 2, . . . , n index the observation. While this representa-
tion clarifies the relationship between yi and the x s, matrix notation will generally be used
to compactly describe models:

β1 ε1
      
y1 x11 x12 . . . x1k
 y2   x21 x22 . . . x2k  β2   ε2 
= + (3.2)
      
 .. .. .. .. ..  .. .. 
 .   . . . .  .   . 
yn xn 1 xn 2 . . . xn k βk εn

y = Xβ + ε (3.3)
where X is an n by k matrix, β is a k by 1 vector, and both y and ε are n by 1 vectors.
Two vector notations will occasionally be used: row,

= x1 β +ε1
 
y1
 y2 = x2 β +ε2 
(3.4)
 
 .. .. .. 
 . . . 
yn = xn β +εn
and column,

y = β1 x1 + β2 x2 + . . . + βk xk + ε. (3.5)
Linear regression allows coefficients to be interpreted all things being equal. Specifi-
cally, the effect of a change in one variable can be examined without changing the others.
Regression analysis also allows for models which contain all of the information relevant
for determining yi whether it is directly of interest or not. This feature provides the mecha-
nism to interpret the coefficient on a regressor as the unique effect of that regressor (under
certain conditions), a feature that makes linear regression very attractive.

3.1.1 What is a model?

What constitutes a model is a difficult question to answer. One view of a model is that of
the data generating process (DGP). For instance, if a model postulates

yi = β1 xi + εi
one interpretation is that the regressand, yi , is exactly determined by xi and some random
shock. An alternative view, one that I espouse, holds that xi is the only relevant variable
available to the econometrician that explains variation in yi . Everything else that deter-
mines yi cannot be measured and, in the usual case, cannot be placed into a framework
which would allow the researcher to formulate a model.
Consider monthly returns on the S&P 500, a value weighted index of 500 large firms in
3.1 Model Description 143

the United States. Equity holdings and returns are generated by individuals based on their
beliefs and preferences. If one were to take a (literal) data generating process view of the
return on this index, data on the preferences and beliefs of individual investors would need
to be collected and formulated into a model for returns. This would be a daunting task to
undertake, depending on the generality of the belief and preference structures.
On the other hand, a model can be built to explain the variation in the market based on
observable quantities (such as oil price changes, macroeconomic news announcements,
etc.) without explicitly collecting information on beliefs and preferences. In a model of
this type, explanatory variables can be viewed as inputs individuals consider when form-
ing their beliefs and, subject to their preferences, taking actions which ultimately affect the
price of the S&P 500. The model allows the relationships between the regressand and re-
gressors to be explored and is meaningful even though the model is not plausibly the data
generating process.
In the context of time-series data, models often postulate that the past values of a series
are useful in predicting future values. Again, suppose that the data were monthly returns
on the S&P 500 and, rather than using contemporaneous explanatory variables, past re-
turns are used to explain present and future returns. Treated as a DGP, this model implies
that average returns in the near future would be influenced by returns in the immediate
past. Alternatively, taken an approximation, one interpretation postulates that changes in
beliefs or other variables that influence holdings of assets change slowly (possibly in an
unobservable manner). These slowly changing “factors” produce returns which are pre-
dictable. Of course, there are other interpretations but these should come from finance
theory rather than data. The model as a proxy interpretation is additionally useful as it al-
lows models to be specified which are only loosely coupled with theory but that capture
interesting features of a theoretical model.
Careful consideration of what defines a model is an important step in the development
of an econometrician, and one should always consider which assumptions and beliefs are
needed to justify a specification.

3.1.2 Example: Cross-section regression of returns on factors

The concepts of linear regression will be explored in the context of a cross-section regres-
sion of returns on a set of factors thought to capture systematic risk. Cross sectional regres-
sions in financial econometrics date back at least to the Capital Asset Pricing Model (CAPM,
Markowitz (1959), Sharpe (1964) and Lintner (1965)), a model formulated as a regression of
individual asset’s excess returns on the excess return of the market. More general specifica-
tions with multiple regressors are motivated by the Intertemporal CAPM (ICAPM, Merton
(1973)) and Arbitrage Pricing Theory (APT, Ross (1976)).
The basic model postulates that excess returns are linearly related to a set of systematic
risk factors. The factors can be returns on other assets, such as the market portfolio, or any
other variable related to intertemporal hedging demands, such as interest rates, shocks to
144 Analysis of Cross-Sectional Data

Variable Description
VWM Returns on a value-weighted portfolio of all NYSE, AMEX and NASDAQ
stocks
SM B Returns on the Small minus Big factor, a zero investment portfolio that
is long small market capitalization firms and short big caps.
HML Returns on the High minus Low factor, a zero investment portfolio that
is long high BE/ME firms and short low BE/ME firms.
UMD Returns on the Up minus Down factor (also known as the Momentum factor),
a zero investment portfolio that is long firms with returns in the top
30% over the past 12 months and short firms with returns in the bottom 30%.
SL Returns on a portfolio of small cap and low BE/ME firms.
SM Returns on a portfolio of small cap and medium BE/ME firms.
SH Returns on a portfolio of small cap and high BE/ME firms.
BL Returns on a portfolio of big cap and low BE/ME firms.
BM Returns on a portfolio of big cap and medium BE/ME firms.
BH Returns on a portfolio of big cap and high BE/ME firms.
RF Risk free rate (Rate on a 3 month T-bill).
D AT E Date in format YYYYMM.

Table 3.1: Variable description for the data available in the Fama-French data-set used
throughout this chapter.

inflation or consumption growth.

f
ri − ri = f i β + ε i

or more compactly,

rie = fi β + εi
f
where rie = ri − ri is the excess return on the asset and fi = [ f1,i , . . . , fk ,i ] are returns on
factors that explain systematic variation.
Linear factors models have been used in countless studies, the most well known by
Fama and French (Fama & French (1992) and Fama & French (1993)) who use returns on
specially constructed portfolios as factors to capture specific types of risk. The Fama-French
data set is available in Excel (ff.xls) or MATLAB (ff.mat) formats and contains the vari-
ables listed in table 3.1.
All data, except the interest rates, are from the CRSP database and were available monthly
from January 1927 until June 2008. Returns are calculated as 100 times the logarithmic
price difference (100(ln(pi ) − ln(pn −1 ))). Portfolios were constructed by sorting the firms
into categories based on market capitalization, Book Equity to Market Equity (BE/ME), or
past returns over the previous year. For further details on the construction of portfolios,
3.2 Functional Form 145

Portfolio Mean Std. Dev Skewness Kurtosis


V WMe 0.63 5.43 0.22 10.89
SM B 0.23 3.35 2.21 25.26
HML 0.41 3.58 1.91 19.02
UMD 0.78 4.67 -2.98 31.17
BH e 0.91 7.26 1.72 21.95
BM e 0.67 5.81 1.41 20.85
B Le 0.59 5.40 -0.08 8.30
SH e 1.19 8.34 2.30 25.68
SM e 0.99 7.12 1.39 18.29
SLe 0.70 7.84 1.05 13.56

Table 3.2: Descriptive statistics of the six portfolios that will be used throughout this chap-
ter. The data consist of monthly observations from January 1927 until June 2008 (n = 978).

see Fama & French (1993) or Ken French’s website:

http://mba.tuck.dartmouth.edu/pages/faculty/ken.french/data_library.html.

A general model for the B H portfolio can be specified

B Hi − R Fi = β1 + β2 (V W M i − R Fi ) + β3S M Bi + β4 H M L i + β5U M Di + εi

or, in terms of the excess returns,

B Hie = β1 + β2 V W M ie + β3S M Bi + β4 H M L i + β5U M Di + εi .


The coefficients in the model can be interpreted as the effect of a change in one variable
holding the other variables constant. For example, β3 captures the effect of a change in the
S M Bi risk factor holding V W M ie , H M L i and U M Di constant. Table 3.2 contains some
descriptive statistics of the factors and the six portfolios included in this data set.

3.2 Functional Form


A linear relationship is fairly specific and, in some cases, restrictive. It is important to dis-
tinguish specifications which can be examined in the framework of a linear regression from
those which cannot. Linear regressions require two key features of any model: each term
on the right hand side must have only one coefficient that enters multiplicatively and the
error must enter additively.1 Most specifications satisfying these two requirements can be
1
A third but obvious requirement is that neither yi nor any of the x j ,i may be latent (unobservable), j =
1, 2, . . . , k , i = 1, 2, . . . , n.
146 Analysis of Cross-Sectional Data

treated using the tools of linear regression.2 Other forms of “nonlinearities” are permissi-
ble. Any regressor or the regressand can be nonlinear transformations of the original ob-
served data.
Double log (also known as log-log) specifications, where both the regressor and the re-
gressands are log transformations of the original (positive) data, are common.

ln yi = β1 + β2 ln xi + εi .
In the parlance of a linear regression, the model is specified

ỹi = β1 + β2 x̃i + εi
where ỹi = ln(yi ) and x̃i = ln(xi ). The usefulness of the double log specification can be
illustrated by a Cobb-Douglas production function subject to a multiplicative shock

β β
Yi = β1 K i 2 L i 3 εi .
Using the production function directly, it is not obvious that, given values for output (Yi ),
capital (K i ) and labor (L i ) of firm i , the model is consistent with a linear regression. How-
ever, taking logs,

ln Yi = ln β1 + β2 ln K i + β3 ln L i + ln εi
the model can be reformulated as a linear regression on the transformed data. Other forms,
such as semi-log (either log-lin, where the regressand is logged but the regressors are un-
changed, or lin-log, the opposite) are often useful to describe certain relationships.
Linear regression does, however, rule out specifications which may be of interest. Linear
β
regression is not an appropriate framework to examine a model of the form yi = β1 x1,i2 +
β
β3 x2,i4 +εi . Fortunately, more general frameworks, such as generalized method of moments
(GMM) or maximum likelihood estimation (MLE), topics of subsequent chapters, can be
applied.
Two other transformations of the original data, dummy variables and interactions, can
be used to generate nonlinear (in regressors) specifications. A dummy variable is a special
class of regressor that takes the value 0 or 1. In finance, dummy variables (or dummies) are
used to model calendar effects, leverage (where the magnitude of a coefficient depends
on the sign of the regressor), or group-specific effects. Variable interactions parameterize
nonlinearities into a model through products of regressors. Common interactions include
2 3
powers of regressors (x1,i , x1,i , . . .), cross-products of regressors (x1,i x2,i ) and interactions
between regressors and dummy variables. Considering the range of nonlinear transforma-
tion, linear regression is surprisingly general despite the restriction of parameter linearity.
The use of nonlinear transformations also change the interpretation of the regression
2
There are further requirements on the data, both the regressors and the regressand, to ensure that esti-
mators of the unknown parameters are reasonable, but these are treated in subsequent sections.
3.2 Functional Form 147

coefficients. If only unmodified regressors are included,

yi = xi β + εi
∂ yi
and ∂ xk ,i
= βk . Suppose a specification includes both xk and xk2 as regressors,

yi = β1 xi + β2 xi2 + εi

In this specification, ∂∂ xyii = β1 + β2 xi and the level of the variable enters its partial effect.
Similarly, in a simple double log model

ln yi = β1 ln xi + εi ,

and

∂y
∂ ln yi y %∆y
β1 = = =
∂ ln xi ∂x
x
%∆x

Thus, β1 corresponds to the elasticity of yi with respect to xi . In general, the coefficient on


a variable in levels corresponds to the effect of a one unit changes in that variable while
the coefficient on a variable in logs corresponds to the effect of a one percent change. For
example, in a semi-log model where the regressor is in logs but the regressand is in levels,

yi = β1 ln xi + εi ,

β1 will correspond to a unit change in yi for a % change in xi . Finally, in the case of dis-
crete regressors, where there is no differential interpretation of coefficients, β represents
the effect of a whole unit change, such as a dummy going from 0 to 1.

3.2.1 Example: Dummy variables and interactions in cross section regression

Two calendar effects, the January and the December effects, have been widely studied in
finance. Simply put, the December effect hypothesizes that returns in December are un-
usually low due to tax-induced portfolio rebalancing, mostly to realized losses, while the
January effect stipulates returns are abnormally high as investors return to the market.
To model excess returns on a portfolio (B Hie ) as a function of the excess market return
(V W M ie ), a constant, and the January and December effects, a model can be specified

B Hie = β1 + β2 V W M ie + β3 I1i + β4 I12i + εi

where I1i takes the value 1 if the return was generated in January and I12i does the same for
December. The model can be reparameterized into three cases:
148 Analysis of Cross-Sectional Data

B Hie = (β1 + β3 ) + β2 V W M ie + εi January


B Hie = (β1 + β4 ) + β2 V W M ie + εi December
B Hie = β1 + β2 V W M ie + εi Otherwise

Similarly dummy interactions can be used to produce models with both different intercepts
and different slopes in January and December,

B Hie = β1 + β2 V W M ie + β3 I1i + β4 I12i + β5 I1i V W M ie + β6 I12i V W M ie + εi .

If excess returns on a portfolio were nonlinearly related to returns on the market, a simple
model can be specified

B Hie = β1 + β2 V W M ie + β3 (V W M ie )2 + β4 (V W M ie )3 + εi .
Dittmar (2002) proposed a similar model to explain the cross-sectional dispersion of ex-
pected returns.

3.3 Estimation
Linear regression is also known as ordinary least squares (OLS) or simply least squares, a
moniker derived from the method of estimating the unknown coefficients. Least squares
minimizes the squared distance between the fit line (or plane if there are multiple regres-
sors) and the regressand. The parameters are estimated as the solution to
n
X
min (y − Xβ ) (y − Xβ ) = min
0
(yi − xi β )2 . (3.6)
β β
i =1

First order conditions of this optimization problem are


n
X
− 2X0 (y − Xβ ) = −2 X0 y − X0 Xβ = −2 xi (yi − xi β ) = 0

(3.7)
i =1

and rearranging, the least squares estimator for β , can be defined.

Definition 3.1 (OLS Estimator). The ordinary least squares estimator, denoted β̂ , is defined

β̂ = (X0 X)−1 X0 y. (3.8)

Clearly this estimator is only reasonable if X0 X is invertible which is equivalent to the


condition that rank(X) = k . This requirement states that no column of X can be exactly
3.3 Estimation 149

expressed as a combination of the k − 1 remaining columns and that the number of ob-
servations is at least as large as the number of regressors (n ≥ k ). This is a weak condition
and is trivial to verify in most econometric software packages: using a less than full rank
matrix will generate a warning or error.
Dummy variables create one further issue worthy of special attention. Suppose dummy
variables corresponding to the 4 quarters of the year, I1i , . . . , I4i , are constructed from a
quarterly data set of portfolio returns. Consider a simple model with a constant and all 4
dummies

rn = β1 + β2 I1i + β3 I2i + β4 I3i + β5 I4i + εi .

It is not possible to estimate this model with all 4 dummy variables and the constant
because the constant is a perfect linear combination of the dummy variables and so the
regressor matrix would be rank deficient. The solution is to exclude either the constant
or one of the dummy variables. It makes no difference in estimation which is excluded,
although the interpretation of the coefficients changes. In the case where the constant is
excluded, the coefficients on the dummy variables are directly interpretable as quarterly
average returns. If one of the dummy variables is excluded, for example the first quarter
dummy variable, the interpretation changes. In this parameterization,

rn = β1 + β2 I2i + β3 I3i + β4 I4i + εi ,

β1 is the average return in Q1, while β1 + β j is the average return in Q j .


It is also important that any regressor, other the constant, be nonconstant. Suppose a
regression included the number of years since public floatation but the data set only con-
tained assets that have been trading for exactly 10 years. Including this regressor and a
constant results in perfect collinearity, but, more importantly, without variability in a re-
gressor it is impossible to determine whether changes in the regressor (years since float)
results in a change in the regressand or whether the effect is simply constant across all as-
sets. The role that variability of regressors plays be revisited when studying the statistical
properties of β̂ .
The second derivative matrix of the minimization,

2X0 X,

ensures that the solution must be a minimum as long as X0 X is positive definite. Again,
positive definiteness of this matrix is equivalent to rank(X) = k .
Once the regression coefficients have been estimated, it is useful to define the fit values,
ŷ = Xβ̂ and sample residuals ε̂ = y − ŷ = y − Xβ̂ . Rewriting the first order condition in
terms of the explanatory variables and the residuals provides insight into the numerical
properties of the residuals. An equivalent first order condition to eq. (3.7) is
150 Analysis of Cross-Sectional Data

X0 ε̂ = 0. (3.9)
This set of linear equations is commonly referred to as the normal equations or orthogonal-
ity conditions. This set of conditions requires that ε̂ is outside the span of the columns of X.
Moreover, considering the columns of X separately, X0j ε̂ = 0 for all j = 1, 2, . . . , k . When a
column contains a constant (an intercept in the model specification), ι 0 ε̂ = 0 ( ni=1 ε̂i = 0),
P

and the mean of the residuals will be exactly 0.3


The OLS estimator of the residual variance, σ̂2 , can be defined.4
Definition 3.2 (OLS Variance Estimator). The OLS residual variance estimator, denoted σ̂2 ,
is defined
ε̂0 ε̂
σ̂ =
2
(3.10)
n −k

Definition 3.3 (Standard Error of the Regression). The standard error of the regression is
defined as √
σ̂ = σ̂2 (3.11)
The least squares estimator has two final noteworthy properties. First, nonsingular
transformations of the x ’s and non-zero scalar transformations of the y ’s have determin-
istic effects on the estimated regression coefficients. Suppose A is a k by k nonsingular
matrix and c is a non-zero scalar. The coefficients of a regression of c yi on xi A are

β̃ = [(XA)0 (XA)]−1 (XA)0 (c y) (3.12)


= c (A X XA) A X y
0 0 −1 0 0

= c A−1 (X0 X)−1 A0−1 A0 X0 y


= c A−1 (X0 X)−1 X0 y
= c A−1 β̂ .

Second, as long as the model contains a constant, the regression coefficients on all
terms except the intercept are unaffected by adding an arbitrary constant to either the re-
gressor or the regressands. Consider transforming the standard specification,

yi = β1 + β2 x2,i + . . . + βk xk ,i + εi
to

ỹi = β1 + β2 x̃2,i + . . . + βk x̃k ,i + εi


ι is an n by 1 vector of 1s.
3
4
The choice of n − k in the denominator will be made clear once the properties of this estimator have
been examined.
3.4 Assessing Fit 151

Portfolio Constant V WMe SM B HML UMD σ


BH e -0.06 1.08 0.02 0.80 -0.04 1.29
BM e -0.02 0.99 -0.12 0.32 -0.03 1.25
B Le 0.09 1.02 -0.10 -0.24 -0.02 0.78
SH e 0.05 1.02 0.93 0.77 -0.04 0.76
SM e 0.06 0.98 0.82 0.31 0.01 1.07
SLe -0.10 1.08 1.05 -0.19 -0.06 1.24
p
Table 3.3: Estimated regression coefficients from the model ri i = β1 + β2 V W M ie +
p
β3S M Bi + β4 H M L i + β5U M Di + εi , where ri i is the excess return on one of the six size and
BE/ME sorted portfolios. The final column contains the standard error of the regression.

where ỹi = yi + c y and x̃ j ,i = x j ,i + c x j . This model is identical to

yi = β̃1 + β2 x2,i + . . . + βk xk ,i + εi
where β̃1 = β1 + c y − β2 c x2 − . . . − βk c xk .

3.3.1 Estimation of Cross-Section regressions of returns on factors

Table 3.3 contains the estimated regression coefficients as well as the standard error of the
regression for the 6 portfolios in the Fama-French data set in a specification including all
four factors and a constant. There has been a substantial decrease in the magnitude of the
standard error of the regression relative to the standard deviation of the original data. The
next section will formalize how this reduction is interpreted.

3.4 Assessing Fit

Once the parameters have been estimated, the next step is to determine whether or not the
model fits the data. The minimized sum of squared errors, the objective of the optimiza-
tion, is an obvious choice to assess fit. However, there is an important limitation drawback
to using the sum of squared errors: changes in the scale of yi alter the minimized sum of
squared errors without changing the fit. In order to devise a scale free metric, it is neces-
sary to distinguish between the portions of y which can be explained by X from those which
cannot.
Two matrices, the projection matrix, PX and the annihilator matrix, MX , are useful when
decomposing the regressand into the explained component and the residual.

Definition 3.4 (Projection Matrix). The projection matrix, a symmetric idempotent ma-
trix that produces the projection of a variable onto the space spanned by X, denoted PX , is
152 Analysis of Cross-Sectional Data

defined

PX = X(X0 X)−1 X0 (3.13)

Definition 3.5 (Annihilator Matrix). The annihilator matrix, a symmetric idempotent ma-
trix that produces the projection of a variable onto the null space of X0 , denoted MX , is
defined

MX = In − X(X0 X)−1 X0 . (3.14)

These two matrices have some desirable properties. Both the fit value of y (ŷ) and the
estimated errors, ε̂, can be simply expressed in terms of these matrices as ŷ = PX y and
ε̂ = MX y respectively. These matrices are also idempotent: PX PX = PX and MX MX = MX
and orthogonal: PX MX = 0. The projection matrix returns the portion of y that lies in the
linear space spanned by X, while the annihilator matrix returns the portion of y which lies
in the null space of X0 . In essence, MX annihilates any portion of y which is explainable by
X leaving only the residuals.
Decomposing y using the projection and annihilator matrices,

y = PX y + MX y
which follows since PX + MX = In . The squared observations can be decomposed

y0 y = (PX y + MX y)0 (PX y + MX y)


= y0 PX PX y + y0 PX MX y + y0 MX PX y + y0 MX MX y
= y0 PX y + 0 + 0 + y0 MX y
= y0 PX y + y0 MX y

noting that PX and MX are idempotent and PX MX = 0n . These three quantities are often
referred to as5

n
X
yy=
0
yi2 Uncentered Total Sum of Squares (TSSU ) (3.15)
i =1

5
There is no consensus about the names of these quantities. In some texts, the component capturing the
fit portion is known as the Regression Sum of Squares (RSS) while in others it is known as the Explained Sum
of Squares (ESS), while the portion attributable to the errors is known as the Sum of Squared Errors (SSE),
the Sum of Squared Residuals (SSR) ,the Residual Sum of Squares (RSS) or the Error Sum of Squares (ESS).
The choice to use SSE and RSS in this text was to ensure the reader that SSE must be the component of the
squared observations relating to the error variation.
3.4 Assessing Fit 153

n
X
y PX y =
0
(xi β̂ )2 Uncentered Regression Sum of Squares (RSSU ) (3.16)
i =1
X n
y 0 MX y = (yi − xi β̂ )2 Uncentered Sum of Squared Errors (SSEU ). (3.17)
i =1

Dividing through by y0 y

y0 PX y y0 MX y
+ =1
y0 y y0 y
or

RSSU SSEU
+ = 1.
TSSU TSSU
This identity expresses the scale-free total variation in y that is captured by X (y0 PX y)
and that which is not (y0 MX y). The portion of the total variation explained by X is known as
the uncentered R2 (R2U ),
Definition 3.6 (Uncentered R 2 (R2U )). The uncentered R2 , which is used in models that do
not include an intercept, is defined

RSSU SSEU
=1−
R2U = (3.18)
TSSU TSSU
While this measure is scale free it suffers from one shortcoming. Suppose a constant
is added to y, so that the TSSU changes to (y + c )0 (y + c ). The identity still holds and so
(y + c )0 (y + c ) must increase (for a sufficiently large c ). In turn, one of the right-hand side
variables must also grow larger. In the usual case where the model contains a constant, the
increase will occur in the RSSU (y0 PX y), and as c becomes arbitrarily large, uncentered R2
will asymptote to one. To overcome this limitation, a centered measure can be constructed
which depends on deviations from the mean rather than on levels.
Let ỹ = y − ȳ = Mι y where Mι = In − ι(ι 0 ι)−1 ι 0 is matrix which subtracts the mean from
a vector of data. Then

y0 Mι PX Mι y + y0 Mι MX Mι y = y0 Mι y
y0 Mι PX Mι y y0 Mι MX Mι y
+ =1
y 0 Mι y y 0 Mι y

or more compactly

ỹ0 PX ỹ ỹ0 MX ỹ
+ = 1.
ỹ0 ỹ ỹ0 ỹ
Centered R2 (R2C ) is defined analogously to uncentered replacing the uncentered sums
of squares with their centered counterparts.
154 Analysis of Cross-Sectional Data

Definition 3.7 (Centered R 2 (R2C )). The uncentered R2 , used in models that include an in-
tercept, is defined
RSSC SSEC
R2C = =1− (3.19)
TSSC TSSC
where
n
X
y Mι y =
0
(yi − ȳ )2 Centered Total Sum of Squares (TSSC ) (3.20)
i =1
X n
y 0 Mι PX Mι y = (xi β̂ − x̄β̂ )2 Centered Regression Sum of Squares (RSSC ) (3.21)
i =1
X n
y 0 Mι MX Mι y = (yi − xi β̂ )2 Centered Sum of Squared Errors (SSEC ). (3.22)
i =1

Pn
and where x̄ = n −1 i =1 xi .
The expressions R2 , SSE, RSS and TSS should be assumed to correspond to the cen-
tered version unless further qualified. With two versions of R2 available that generally dif-
fer, which should be used? Centered should be used if the model is centered (contains a
constant) and uncentered should be used when it does not. Failing to chose the correct R2
can lead to incorrect conclusions about the fit of the model and mixing the definitions can
lead to a nonsensical R2 that falls outside of [0, 1]. For instance, computing R2 using the
centered version when the model does not contain a constant often results in a negative
value when

SSEC
R2 = 1 − .
TSSC
Most software will return centered R2 and caution is warranted if a model is fit without a
constant.
R2 does have some caveats. First, adding an additional regressor will always (weakly)
increase the R2 since the sum of squared errors cannot increase by the inclusion of an ad-
ditional regressor. This renders R2 useless in discriminating between two models where
one is nested within the other. One solution to this problem is to use the degree of freedom
adjusted R2 .
 
2
Definition 3.8 (Adjusted R2 R̄ ). The adjusted R2 , which adjusts for the number of esti-
mated parameters, is defined

SSE
2 n−k SSE n − 1
R̄ = 1 − =1− . (3.23)
TSS TSS n − k
n−1

2
R̄ will increase if the reduction in the SSE is large enough to compensate for a loss of 1
2
degree of freedom, captured by the n − k term. However, if the SSE does not change, R̄
3.4 Assessing Fit 155

Regressors R2U R2C R̄2U R̄2C


V WMe 0.8141 0.8161 0.8141 0.8161
V W M e ,SM B 0.8144 0.8163 0.8143 0.8161
V W Me,HM L 0.9684 0.9641 0.9683 0.9641
V W M e ,SM B, H M L 0.9685 0.9641 0.9684 0.9640
V W M e , SM B, H M L,U M D 0.9691 0.9665 0.9690 0.9664
1, V W M e 0.8146 0.8116 0.8144 0.8114
1, V W M e , S M B 0.9686 0.9681 0.9685 0.9680
1, V W M e , S M B , H M L 0.9687 0.9682 0.9686 0.9681
1, V W M e , S M B , H M L , U M D 0.9692 0.9687 0.9690 0.9685
2
Table 3.4: Centered and uncentered R2 and R̄ from a variety of factors models. Bold in-
dicates the correct version (centered or uncentered) for that model. R2 is monotonically
2
increasing in larger models, while R̄ is not.

2
will decrease. R̄ is preferable to R2 for comparing models, although the topic of model
2
selection will be more formally considered at the end of this chapter. R̄ , like R2 should
be constructed from the appropriate versions of the RSS, SSE and TSS (either centered or
uncentered) .
Second, R2 is not invariant to changes in the regressand. A frequent mistake is to use R2
to compare the fit from two models with different regressands, for instance yi and ln(yi ).
These numbers are incomparable and this type of comparison must be avoided. Moreover,
R2 is even sensitive to more benign transformations. Suppose a simple model is postulated,

yi = β1 + β2 xi + εi ,
and a model logically consistent with the original model,

yi − xi = β1 + (β2 − 1)xi + εi ,
is estimated. The R2 s from these models will generally differ. For example, suppose the
original coefficient on xi was 1. Subtracting xi will reduce the explanatory power of xi to
0, rendering it useless and resulting in a R2 of 0 irrespective of the R2 in the original model.

2
3.4.1 Example: R2 and R̄ in Cross-Sectional Factor models
To illustrate the use of R2 , and the problems with its use, consider a model for B H e which
can depend on one or more risk factor.
The R2 values in Table 3.4 show two things. First, the excess return on the market port-
folio alone can explain 80% of the variation in excess returns on the big-high portfolio.
Second, the H M L factor appears to have additional explanatory power on top of the mar-
ket evidenced by increases in R2 from 0.80 to 0.96. The centered and uncentered R2 are
156 Analysis of Cross-Sectional Data

Regressand Regressors R2U R2C R¯2U R¯2C


BH e V WMe 0.7472 0.7598 0.7472 0.7598
BH e 1, V W M e 0.7554 0.7504 0.7552 0.7502
10 + B H e V WMe 0.3179 0.8875 0.3179 0.8875
10 + B H e 1, V W M e 0.9109 0.7504 0.9108 0.7502
100 + B H e V WMe 0.0168 2.4829 0.0168 2.4829
100 + B H e 1, V W M e 0.9983 0.7504 0.9983 0.7502
BH e 1, V W M e 0.7554 0.7504 0.7552 0.7502
BH e − V W M e 1, V W M e 0.1652 0.1625 0.1643 0.1616
2
Table 3.5: Centered and uncentered R2 and R̄ from models with regressor or regressand
changes. Using the wrong R2 can lead to nonsensical values (outside of 0 and 1) or a false
sense of fit (R2 near one). Some values are larger than 1 because there were computed
using RSSC /TSSC . Had 1 − SSEC /TSSC been used, the values would be negative because
RSSC /TSSC + SSEC /TSSC = 1. The bottom two lines examine the effect of subtracting a
regressor before fitting the model: the R2 decreases sharply. This should not be viewed as
problematic since models with different regressands cannot be compared using R2 .

very similar because the intercept in the model is near zero. Instead, suppose that the de-
pendent variable is changed to 10 + B H e or 100 + B H e and attention is restricted to the
CAPM. Using the incorrect definition for R2 can lead to nonsensical (negative) and mislead-
ing (artificially near 1) values. Finally, Table 3.5 also illustrates the problems of changing
the regressand by replacing the regressand B Hie with B Hie − V W M ie . The R2 decreases
from a respectable 0.80 to only 0.10, despite the interpretation of the model is remaining
unchanged.

3.5 Assumptions

Thus far, all of the derivations and identities presented are purely numerical. They do not
indicate whether β̂ is a reasonable way to estimate β . It is necessary to make some as-
sumptions about the innovations and the regressors to provide a statistical interpretation
of β̂ . Two broad classes of assumptions can be used to analyze the behavior of β̂ : the
classical framework (also known as small sample or finite sample) and asymptotic analysis
(also known as large sample).
Neither method is ideal. The small sample framework is precise in that the exact distri-
bution of regressors and test statistics are known. This precision comes at the cost of many
restrictive assumptions – assumptions not usually plausible in financial applications. On
the other hand, asymptotic analysis requires few restrictive assumptions and is broadly ap-
plicable to financial data, although the results are only exact if the number of observations
is infinite. Asymptotic analysis is still useful for examining the behavior in finite samples
3.5 Assumptions 157

when the sample size is large enough for the asymptotic distribution to approximate the
finite-sample distribution reasonably well.
This leads to the most important question of asymptotic analysis: How large does n
need to be before the approximation is reasonable? Unfortunately, the answer to this ques-
tion is “It depends”. In simple cases, where residuals are independent and identically dis-
tributed, as few as 30 observations may be sufficient for the asymptotic distribution to be
a good approximation to the finite-sample distribution. In more complex cases, anywhere
from 100 to 1,000 may be needed, while in the extreme cases, where the data is heteroge-
nous and highly dependent, an asymptotic approximation may be poor with more than
1,000,000 observations.
The properties of β̂ will be examined under both sets of assumptions. While the small
sample results are not generally applicable, it is important to understand these results as
the lingua franca of econometrics, as well as the limitations of tests based on the classi-
cal assumptions, and to be able to detect when a test statistic may not have the intended
asymptotic distribution. Six assumptions are required to examine the finite-sample distri-
bution of β̂ and establish the optimality of the OLS procedure( although many properties
only require a subset).

Assumption 3.9 (Linearity). yi = xi β + εi

This assumption states the obvious condition necessary for least squares to be a reason-
able method to estimate the β . It further imposes a less obvious condition, that xi must
be observed and measured without error. Many applications in financial econometrics in-
clude latent variables. Linear regression is not applicable in these cases and a more sophis-
ticated estimator is required. In other applications, the true value of xk ,i is not observed and
a noisy proxy must be used, so that x̃k ,i = xk ,i + νk ,i where νk ,i is an error uncorrelated with
xk ,i . When this occurs, ordinary least squares estimators are misleading and a modified
procedure (two-stage least squares (2SLS) or instrumental variable regression (IV)) must
be used.

Assumption 3.10 (Conditional Mean). E[εi |X] = 0, i = 1, 2, . . . , n

This assumption states that the mean of each εi is zero given any xk ,i , any function of
any xk ,i or combinations of these. It is stronger than the assumption used in the asymp-
totic analysis and is not valid in many applications (e.g. time-series data). When the re-
gressand and regressor consist of time-series data, this assumption may be violated and
E[εi |xi + j ] 6= 0 for some j . This assumption also implies that the correct form of xk ,i enters
the regression, that E[εi ] = 0 (through a simple application of the law of iterated expec-
tations), and that the innovations are uncorrelated with the regressors, so that E[εi 0 x j ,i ] =
0, i 0 = 1, 2, . . . , n , i = 1, 2, . . . , n , j = 1, 2, . . . , k .

Assumption 3.11 (Rank). The rank of X is k with probability 1.


158 Analysis of Cross-Sectional Data

This assumption is needed to ensure that β̂ is identified and can be estimated. In prac-
tice, it requires that the no regressor is perfectly co-linear with the others, that the number
of observations is at least as large as the number of regressors (n ≥ k ) and that variables
other than a constant have non-zero variance.

Assumption 3.12 (Conditional Homoskedasticity). V[εi |X] = σ2

Homoskedasticity is rooted in homo (same) and skedannumi (scattering) and in mod-


ern English means that the residuals have identical variances. This assumption is required
to establish the optimality of the OLS estimator and it specifically rules out the case where
the variance of an innovation is a function of a regressor.

Assumption 3.13 (Conditional Correlation). E[εi ε j |X] = 0, i = 1, 2, . . . , n, j = i + 1, . . . , n

Assuming the residuals are conditionally uncorrelated is convenient when coupled with
the homoskedasticity assumption: the covariance of the residuals will be σ2 In . Like ho-
moskedasticity, this assumption is needed for establishing the optimality of the least squares
estimator.

Assumption 3.14 (Conditional Normality). ε|X ∼ N (0, Σ)

Assuming a specific distribution is very restrictive – results based on this assumption


will only be correct is the errors are actually normal – but this assumption allows for precise
statements about the finite-sample distribution of β̂ and test statistics. This assumption,
when combined with assumptions 3.12 and 3.13, provides a simple distribution for the
d
innovations: εi |X → N (0, σ2 ).

3.6 Small Sample Properties of OLS estimators


Using these assumptions, many useful properties of β̂ can be derived. Recall that β̂ =
(X0 X)−1 X0 y.

Theorem 3.15 (Bias of β̂ ). Under assumptions 3.9 - 3.11

E[β̂ |X] = β . (3.24)

While unbiasedness is a desirable property, it is not particularly meaningful without


further qualification. For instance, an estimator which is unbiased, but does not increase
in precision as the sample size increases is generally not desirable. Fortunately, β̂ is not
only unbiased, it has a variance that goes to zero.

Theorem 3.16 (Variance of β̂ ). Under assumptions 3.9 - 3.13

V[β̂ |X] = σ2 (X0 X)−1 . (3.25)


3.6 Small Sample Properties of OLS estimators 159

Under the conditions necessary for unbiasedness for β̂ , plus assumptions about ho-
moskedasticity and the conditional correlation of the residuals, the form of the variance is
simple. Consistency follows since
Pn !−1
x0 xi
(X0 X)−1 = n i =1 i (3.26)
n
1  −1
≈ E x0i xi
n
will be declining as the sample size increases.
However, β̂ has an even stronger property under the same assumptions. It is BLUE:
Best Linear Unbiased Estimator. Best, in this context, means that it has the lowest variance
among all other linear unbiased estimators. While this is a strong result, a few words of
caution are needed to properly interpret this result. The class of Linear Unbiased Estima-
tors (LUEs) is small in the universe of all unbiased estimators. Saying OLS is the “best” is
akin to a one-armed boxer claiming to be the best one-arm boxer. While possibly true, she
probably would not stand a chance against a two-armed opponent.

Theorem 3.17 (Gauss-Markov Theorem). Under assumptions 3.9-3.13, β̂ is the minimum


variance estimator among all linear unbiased estimators. That is V[β̃ |X] - V[β̂ |X] is positive
semi-definite where β̃ = Cy, E[β̃ ] = β and C 6= (X0 X)−1 X0 .

Letting β̃ be any other linear, unbiased estimator of β , it must have a larger covariance.
However, many estimators, including most maximum likelihood estimators, are nonlinear
and so are not necessarily less efficient. Finally, making use of the normality assumption,
it is possible to determine the conditional distribution of β̂ .

Theorem 3.18 (Distribution of β̂ ). Under assumptions 3.9 – 3.14,

β̂ |X ∼ N (β , σ2 (X0 X)−1 ) (3.27)


Theorem 3.18 should not be surprising. β̂ is a linear combination of (jointly) normally
distributed random variables and thus is also normally distributed. Normality is also use-
ful for establishing the relationship between the estimated residuals ε̂ and the estimated
parameters β̂ .

Theorem 3.19 (Conditional Independence of ε̂ and β̂ ). Under assumptions 3.9 - 3.14, ε̂ is


independent of β̂ , conditional on X.

One implication of this theorem is that Cov(ε̂i , β̂ j |X) = 0 i = 1, 2, . . . , n, j = 1, 2, . . . , k .


As a result, functions of ε̂ will be independent of functions of β̂ , a property useful in deriv-
ing distributions of test statistics that depend on both. Finally, in the small sample setup,
the exact distribution of the sample error variance estimator, σ̂2 = ε̂0 ε̂/(n − k ), can be
derived.
160 Analysis of Cross-Sectional Data

Theorem 3.20 (Distribution of σ̂2 ).

σ̂2
(n − k ) ∼ χn−k
2
σ2
y 0 MX y ε̂0 ε̂
where σ̂2 = n−k
= n −k
.

Since ε̂i is a normal random variable, once it is standardized and squared, it should be
a χ12 . The change in the divisor from n to n − k reflects the loss in degrees of freedom due
to the k estimated parameters.

3.7 Maximum Likelihood

Once the assumption that the innovations are conditionally normal has been made, con-
ditional maximum likelihood is an obvious method to estimate the unknown parameters
(β , σ2 ). Conditioning on X, and assuming the innovations are normal, homoskedastic, and
conditionally uncorrelated, the likelihood is given by

(y − Xβ )0 (y − Xβ )
 
2 − n2
f (y|X; β , σ ) = (2πσ )
2
exp − (3.28)
2σ2
and, taking logs, the log likelihood

n n (y − Xβ )0 (y − Xβ )
l (β , σ2 ; y|X) = − log(2π) − log(σ2 ) − . (3.29)
2 2 2σ2
Recall that the logarithm is a monotonic, strictly increasing transformation, and the ex-
tremum points of the log-likelihood and the likelihood will occur at the same parameters.
Maximizing the likelihood with respect to the unknown parameters, there are k + 1 first
order conditions

∂ l (β , σ2 ; y|X) X0 (y − Xβ̂ )
= =0 (3.30)
∂β σ2
∂ l (β , σ2 ; y|X) n (y − Xβ̂ )0 (y − Xβ̂ )
= − + = 0. (3.31)
∂ σ̂2 2σ̂2 2σ̂4

The first set of conditions is identical to the first order conditions of the least squares esti-
mator ignoring the scaling by σ2 , assumed to be greater than 0. The solution is

MLE
β̂ = (X0 X)−1 X0 y (3.32)
σ̂2 MLE = n −1 (y − Xβ̂ )0 (y − Xβ̂ ) = n −1 ε̂0 ε̂. (3.33)
3.7 Maximum Likelihood 161

The regression coefficients are identical under maximum likelihood and OLS, although the
divisor in σ̂2 and σ̂2 MLE differ.
It is important to note that the derivation of the OLS estimator does not require an as-
sumption of normality. Moreover, the unbiasedness, variance, and BLUE properties do not
rely on conditional normality of residuals. However, if the innovations are homoskedastic,
uncorrelated and normal, the results of the Gauss-Markov theorem can be strengthened
using the Cramer-Rao lower bound.

Theorem 3.21 (Cramer-Rao Inequality). Let f (z; θ ) be the joint density of z where θ is a k
dimensional parameter vector Let θ̂ be an unbiased estimator of θ 0 with finite covariance.
Under some regularity condition on f (·)

V[θ̂ ] ≥ I −1 (θ 0 )

where " #
∂ 2 ln f (z; θ )

I = −E (3.34)
∂ θ ∂ θ 0 θ =θ 0
and " #
∂ ln f (z; θ ) ∂ ln f (z; θ )

J =E (3.35)
∂θ ∂ θ0
θ =θ 0

and, under some additional regularity conditions,

I(θ 0 ) = J (θ 0 ).
The last part of this theorem is the information matrix equality (IME) and when a model is
correctly specified in its entirety, the expected covariance of the scores is equal to negative
of the expected hessian.6 The IME will be revisited in later chapters. The second order
conditions,

∂ 2 l (β , σ2 ; y|X) X0 X
=− 2 (3.36)
∂ β∂ β 0
σ̂
∂ l (β , σ ; y|X)
2 2
X0 (y − Xβ )
= − (3.37)
∂ β ∂ σ2 σ4
∂ 2 l (β , σ2 ; y|X) n (y − Xβ )0 (y − Xβ )
= − (3.38)
∂ 2 σ2 2σ4 σ6

are needed to find the lower bound for the covariance of the estimators of β and σ2 . Taking

6
There are quite a few regularity conditions for the IME to hold, but discussion of these is beyond the
scope of this course. Interested readers should see White (1996) for a thorough discussion.
162 Analysis of Cross-Sectional Data

expectations of the second derivatives,

∂ l (β , σ2 ; y|X) X0 X
 2 
E = − (3.39)
∂ β∂ β0 σ2
∂ l (β , σ2 ; y|X)
 2 
E =0 (3.40)
∂ β ∂ σ2
∂ l (β , σ2 ; y|X)
 2 
n
E =− 4 (3.41)
∂ σ
2 2 2σ
MLE
and so the lower bound for the variance of β̂ = β̂ is σ2 (X0 X)−1 . Theorem 3.16 show
that σ2 (X0 X)−1 is also the variance of the OLS estimator β̂ and so the Gauss-Markov theo-
rem can be strengthened in the case of conditionally homoskedastic, uncorrelated normal
residuals.
MLE
Theorem 3.22 (Best Unbiased Estimator). Under assumptions 3.9 - 3.14, β̂ = β̂ is the
best unbiased estimator of β .

The difference between this theorem and the Gauss-Markov theorem is subtle but im-
portant. The class of estimators is no longer restricted to include only linear estimators and
so this result is both broad and powerful: MLE (or OLS) is an ideal estimator under these
assumptions (in the sense that no other unbiased estimator, linear or not, has a lower vari-
ance). This results does not extend to the variance estimator since E[σ̂2 MLE ] = n −k
n
σ2 6= σ2 ,
and so the optimality of σ̂2 MLE cannot be established using the Cramer-Rao theorem.

3.8 Small Sample Hypothesis Testing

Most regressions are estimated to test implications of economic or finance theory. Hypoth-
esis testing is the mechanism used to determine whether data and theory are congruent.
Formalized in terms of β , the null hypothesis (also known as the maintained hypothesis)
is formulated as

H0 : R(β ) − r = 0 (3.42)

where R(·) is a function from Rk to Rm , m ≤ k and r is an m by 1 vector. Initially, a subset


of all hypotheses, those in the linear equality hypotheses class, formulated

H0 : Rβ − r = 0 (3.43)

will be examined where R is a m by k matrix. In subsequent chapters, more general test


specifications including nonlinear restrictions on the parameters will be considered. All
hypotheses in this class can be written as weighted sums of the regression coefficients,
3.8 Small Sample Hypothesis Testing 163

R11 β1 + R12 β2 . . . + R1k βk = r1


R21 β1 + R22 β2 . . . + R2k βk = r2
..
.
Rm 1 β1 + Rm2 β2 . . . + Rm k βk = ri
Each constraint is represented as a row in the above set of equations. Linear equality con-
straints can be used to test parameter restrictions such as

β1 = 0 (3.44)
3β2 + β3 = 1
k
X
βj = 0
j =1

β1 = β2 = β3 = 0.

For instance, if the unrestricted model was

yi = β1 + β2 x2,i + β3 x3,i + β4 x4,i + β5 x5,i + εi


the hypotheses in eq. (3.44) can be described in terms of R and r as

H0 R r
h i
β1 = 0 1 0 0 0 0 0

h i
3β2 + β3 = 1 0 3 1 0 0 1

Pk h i
j =1 β j = 0 0 1 1 1 1 0

   
1 0 0 0 0 0
β1 = β2 = β3 = 0  0 1 0 0 0   0 
   
0 0 1 0 0 0

When using linear equality constraints, alternatives are specified as H1 : Rβ − r 6= 0.


Once both the null and the alternative hypotheses have been postulated, it is necessary to
discern whether the data are consistent with the null hypothesis. Three classes of statistics
will be described to test these hypotheses: Wald, Lagrange Multiplier and Likelihood Ratio.
Wald tests are perhaps the most intuitive: they directly test whether Rβ − r is close to zero.
Lagrange Multiplier tests incorporate the constraint into the least squares problem using
a lagrangian. If the constraint has a small effect on the minimized sum of squares, the
lagrange multipliers, often described as the shadow price of the constraint in economic
164 Analysis of Cross-Sectional Data

applications, should be close to zero. The magnitude of these form the basis of the LM test
statistic. Finally, likelihood ratios test whether the data are less likely under the null than
they are under the alternative. If the null hypothesis is not restrictive this ratio should be
close to one and the difference in the log-likelihoods should be small.

3.8.1 t -tests
T-tests can be used to test a single hypothesis involving one or more coefficients,

H0 : Rβ = r

where R is a 1 by k vector and r is a scalar. Recall from theorem 3.18, β̂ −β ∼ N (0, σ2 (X0 X)−1 ).
Under the null, R(β̂ − β ) = Rβ̂ − Rβ = Rβ̂ − r and applying the properties of normal ran-
dom variables,
Rβ̂ − r ∼ N (0, σ2 R(X0 X)−1 R0 ).

A simple test can be constructed

Rβ̂ − r
z =p , (3.45)
σ2 R(X0 X)−1 R0
where z ∼ N (0, 1). To perform a test with size α, the value of z can be compared to the
critical values of the standard normal and rejected if |z | > Cα where Cα is the 1 − α quantile
of a standard normal. However, z is an infeasible statistic since it depends on an unknown
quantity, σ2 . The
q natural solution is to replace the unknown parameter with an estimate.
s2
Dividing z by σ2
and simplifying,

z
t =q (3.46)
s2
σ2
Rβ̂ −r

σ2 R(X0 X)−1 R0
= q
s2
σ2

Rβ̂ − r
=p .
s 2 R(X0 X)−1 R0
2
Note that the denominator (n − k ) σs 2 ∼ χn−k
2
, and so t is the ratio of a standard normal to
the square root of a χν normalized by it standard deviation. As long as the standard normal
2

in the numerator and the χv2 are independent, this ratio will have a Student’s t distribution.

Definition 3.23 (Student’s t distribution). Let z ∼ N (0, 1) (standard normal) and let w ∼
3.8 Small Sample Hypothesis Testing 165

χν2 where z and w are independent. Then


z
p w ∼ tν. (3.47)
ν

The independence of β̂ and s 2 – which is only a function of ε̂ – follows from 3.19, and
so t has a Student’s t distribution.

Theorem 3.24 (t -test). Under assumptions 3.9 - 3.14,

Rβ̂ − r
p ∼ t n−k . (3.48)
s 2 R(X0 X)−1 R0

As ν → ∞, the Student’s t distribution converges to a standard normal. As a practical


matter, when ν > 30, the T distribution is close to a normal. While any single linear re-
striction can be tested with a t -test , the expression t -stat has become synonymous with a
specific null hypothesis.

Definition 3.25 (t -stat). The t -stat of a coefficient, βk , is the t -test value of a test of the
null H0 : βk = 0 against the alternative H1 : βk 6= 0, and is computed

β̂k
q (3.49)
s 2 (X0 X)−1
[k k ]

where (X0 X)−1


[k k ] is the k diagonal element of (X X) .
th 0 −1

The previous examples were all two-sided; the null would be rejected if the parameters
differed in either direction from the null hypothesis. T-tests is also unique among these
three main classes of test statistics in that they can easily be applied against both one-sided
alternatives and two-sided alternatives.7
However, there is often a good argument to test a one-sided alternative. For instance, in
tests of the market premium, theory indicates that it must be positive to induce investment.
Thus, when testing the null hypothesis that a risk premium is zero, a two-sided alternative
could reject in cases which are not theoretically interesting. More importantly, a one-sided
alternative, when appropriate, will have more power than a two-sided alternative since the
direction information in the null hypothesis can be used to tighten confidence intervals.
The two types of tests involving a one-sided hypothesis are upper tail tests which test nulls
of the form H0 : Rβ ≤ r against alternatives of the form H1 : Rβ > r , and lower tail tests
which test H0 : Rβ ≥ r against H1 : Rβ < r .
Figure 3.1 contains the rejection regions of a t 10 distribution. The dark gray region cor-
responds to the rejection region of a two-sided alternative to the null that H0 : β̂ = β 0 for a
10% test. The light gray region, combined with the upper dark gray region corresponds to
7
Wald, LM and LR tests can be implemented against one-sided alternatives with considerably more effort.
166 Analysis of Cross-Sectional Data

the rejection region of a one-sided upper tail test, and so test statistic between 1.372 and
1.812 would be rejected using a one-sided alternative but not with a two-sided one.

Rejection regions of a t 10
90% One−sided (Upper)
90% 2−sided
t10

1.372

−1.812 1.812

ˆ 0
β−β
se(β)
ˆ

−3 −2 −1 0 1 2 3

Figure 3.1: Rejection region for a t -test of the nulls H0 : β = β 0 (two-sided) and H0 : β ≤
β 0 . The two-sided rejection region is indicated by dark gray while the one-sided (upper)
rejection region includes both the light and dark gray areas in the right tail.

Algorithm 3.26 (t -test). 1. Estimate β̂ using least squares.


Pn
2. Compute s 2 = (n − k )−1 i =1 ε̂2i and s 2 (X0 X)−1 .

3. Construct the restriction matrix, R, and the value of the restriction, r from the null
hypothesis.

Rβ̂ −r
4. Compute t = √ .
s 2 R(X0 X)−1 R0

5. Compare t to the critical value, Cα , of the t n −k distribution for a test size with α. In the
case of a two tailed test, reject the null hypothesis if |t | > Ft ν 1 − α/2 where Ft ν (·) is


the CDF of a t ν -distributed random variable. In the case of a one-sided upper-tail test,
reject if t > Ft ν (1 − α) or in the case of a one-sided lower-tail test, reject if t < Ft ν (α).
3.8 Small Sample Hypothesis Testing 167

3.8.2 Wald Tests


Wald test directly examines the distance between Rβ and r. Intuitively, if the null hypoth-
esis is true, then Rβ − r ≈ 0. In the small sample framework, the distribution of Rβ − r
follows directly from the properties of normal random variables. Specifically,

Rβ − r ∼ N (0, σ2 R(X0 X)−1 R0 )


Thus, to test the null H0 : Rβ − r = 0 against the alternative H0 : Rβ − r 6= 0, a test statistic
can be based on
−1
(Rβ − r)0 R(X0 X)−1 R0 (Rβ − r)

WInfeasible = (3.50)
σ 2

which has a χm
2
distribution.8 However, this statistic depends on an unknown quantity, σ2 ,
and to operationalize W , σ2 must be replaced with an estimate, s 2 .

−1 −1
(Rβ − r)0 R(X0 X)−1 R0 (Rβ − r)/m σ2 (Rβ − r)0 R(X0 X)−1 R0 (Rβ − r)/m
 
W = =
σ 2 s 2 s 2
(3.51)
The replacement of σ with s has an affect on the distribution of the estimator which
2 2

follows from the definition of an F distribution.

Definition 3.27 (F distribution). Let z 1 ∼ χν21 and let z 2 ∼ χν22 where z 1 and z 2 are inde-
pendent. Then
z1
ν1
z2 ∼ Fν1 ,ν2 (3.52)
ν2

The conclusion that W has an Fm,n −k distribution follows from the independence of β̂
and ε̂, which in turn implies the independence of β̂ and s 2 .

Theorem 3.28 (Wald test). Under assumptions 3.9 - 3.14,


−1
(Rβ − r)0 R(X0 X)−1 R0 (Rβ − r)/m

2
∼ Fm ,n−k (3.53)
s

Analogous to the t ν distribution, an Fν1 ,ν2 distribution converges to a scaled χ 2 in large


samples (χν21 /ν1 as ν2 → ∞). Figure 3.2 contains failure to reject (FTR) regions for some
hypothetical Wald tests. The shape of the region depends crucially on the correlation be-
tween the hypotheses being tested. For instance, panel (a) corresponds to testing a joint
hypothesis where the tests are independent and have the same variance. In this case, the
  
− 1 Im 0
8
The distribution can be derived noting that R(X0 X)−1 R0 2 (Rβ − r) ∼ N

0, where the
0 0
matrix square root makes use of a generalized inverse. A more complete discussion of reduced rank normals
and generalized inverses is beyond the scope of this course.
168 Analysis of Cross-Sectional Data

Bivariate F distributions
(a) (b)
3 3

2 2

1 1

0 0

−1 −1

−2 −2

−3 −3
−2 0 2 −2 0 2
(c) (d)
3 3
99%
90%
2 2
80%
1 1

0 0

−1 −1

−2 −2

−3 −3
−2 0 2 −2 0 2

Figure 3.2: Bivariate plot of an F distribution. The four panels contain the failure-to-reject
regions corresponding to 20, 10 and 1% tests. Panel (a) contains the region for uncorrelated
tests. Panel (b) contains the region for tests with the same variance but a correlation of 0.5.
Panel (c) contains the region for tests with a correlation of -.8 and panel (d) contains the
region for tests with a correlation of 0.5 but with variances of 2 and 0.5 (The test with a
variance of 2 is along the x-axis).

FTR region is a circle. Panel (d) shows the FTR region for highly correlated tests where one
restriction has a larger variance.

Once W has been computed, the test statistic should be compared to the critical value
of an Fm,n −k and rejected if the test statistic is larger. Figure 3.3 contains the pdf of an F5,30
distribution. Any W > 2.049 would lead to rejection of the null hypothesis using a 10%
test.

The Wald test has a more common expression in terms of the SSE from both the re-
stricted and unrestricted models. Specifically,
3.8 Small Sample Hypothesis Testing 169

SSE R −SSEU SSE R −SSEU


W = m
SSEU
= m
. (3.54)
n−k
s2

where SSE R is the sum of squared errors of the restricted model.9 The restricted model is
the original model with the null hypothesis imposed. For example, to test the null H0 : β2 =
β3 = 0 against an alternative that H1 : β2 6= 0 or β3 6= 0 in a bivariate regression,

yi = β1 + β2 x1,i + β3 x2,i + εi (3.55)


the restricted model imposes the null,

yi = β1 + 0x1,i + 0x2,i + εi
= β1 + εi .

The restricted SSE, SSE R is computed using the residuals from this model while the un-
restricted SSE, SSEU , is computed from the general model that includes both x variables
(eq. (3.55)). While Wald tests usually only require the unrestricted model to be estimated,
the difference of the SSEs is useful because it can be computed from the output of any
standard regression package. Moreover, any linear regression subject to linear restrictions
can be estimated using OLS on a modified specification where the constraint is directly
imposed. Consider the set of restrictions, R, in an augmented matrix with r

[R r]
By transforming this matrix into row-echelon form,
 
Im R̃ r̃
a set of m restrictions can be derived. This also provides a direct method to check whether
a set of constraints is logically consistent and feasible or if it contains any redundant re-
strictions.

Theorem 3.29 (Restriction Consistency and Redundancy). If Im R̃ r̃ is [R r] in re-


 

duced echelon form, then a set of restrictions is logically consistent if rank(R̃) = rank( Im R̃ r̃ ).
 

Additionally, if rank(R̃) = rank( Im R̃ r̃ ) = m, then there are no redundant restrictions.


 

1. Estimate the unrestricted model yi = xi β + εi , and the restricted model, ỹi = x̃i β + εi .
Pn 2
2. Compute SSE R = i =1 ε̃ i where ε̃i = ỹi − x̃i β̃ are the residuals from the restricted
regression, and SSEU = ni=1 ε̂2i where ε̂i = yi − xi β̂ are the residuals from the unre-
P

stricted regression.
9
The SSE should be the result of minimizing the squared errors. The centered should be used if a constant
is included and the uncentered versions if no constant is included.
170 Analysis of Cross-Sectional Data

SSE R −SSEU
3. Compute W = m
SSEU .
n−k

4. Compare W to the critical value, Cα , of the Fm ,n−k distribution at size α. Reject the null
hypothesis if W > Cα .

Finally, in the same sense that the t -stat is a test of the null H0 : βk = 0 against the
alternative H1 : βk 6= 0, the F -stat of a regression tests whether all coefficients are zero
(except the intercept) against an alternative that at least one is non-zero.

Definition 3.30 (F -stat of a Regression). The F -stat of a regression is the value of a Wald
test that all coefficients are zero except the coefficient on the constant (if one is included).
Specifically, if the unrestricted model is

yi = β1 + β2 x2,i + . . . βk xk ,i + εi ,

the F -stat is the value of a Wald test of the null H0 : β2 = β3 = . . . = βk = 0 against the
alternative H1 : β j 6= 0, for j = 2, . . . , k and corresponds to a test based on the restricted
regression
yi = β1 + εi .

3.8.3 Example: T and Wald Tests in Cross-Sectional Factor models


Returning to the factor regression example, the t -stats in the 4-factor model can be com-
puted

β̂ j
tj = q .
s 2 (X0 X)−1
[j j]

For example, consider a regression of B H e on the set of four factors and a constant,

B Hie = β1 + β2 V W M ie + β3S M Bi + β4 H M L i + β5U M Di + εi


The fit coefficients, t -stats and p-values are contained in table 3.6.

Definition 3.31 (P-value ). The p-value is smallest size (α) where the null hypothesis may
be rejected. The p-value can be equivalently defined as the largest size where the null hy-
pothesis cannot be rejected.

P-values have the advantage that they are independent of the distribution of the test
statistic. For example, when using a 2-sided t -test, the p-value of a test statistic t is 2(1 −
Ft ν (|t |)) where Ft ν (| · |) is the CDF of a t -distribution with ν degrees of freedom. In a Wald
test, the p-value is 1 − Ffν1 ,ν2 (W ) where Ffν1 ,ν2 (·) is the CDF of an fν1 ,ν2 distribution.
The critical value, Cα , for a 2-sided 10% t test with 973 degrees of freedom (n − 5) is
1.645, and so if |t | > Cα the null hypothesis should be rejected, and the results indicate
3.8 Small Sample Hypothesis Testing 171

Rejection region of a F5,30 distribution

F5,30

2.049 (90% Quantile)

Rejection Region

0 1 2 3 4

Figure 3.3: Rejection region for a F5,30 distribution when using a test with a size of 10%. If
the null hypothesis is true, the test statistic should be relatively small (would be 0 if exactly
true). Large test statistics lead to rejection of the null hypothesis. In this example, a test
statistic with a value greater than 2.049 would lead to a rejection of the null at the 10%
level.

that the null hypothesis that the coefficients on the constant and S M B are zero cannot
be rejected the 10% level. The p-values indicate the null that the constant was 0 could be
rejected at an α of 14% but not one of 13%.

Table 3.6 also contains the Wald test statistics and p-values for a variety of hypothe-
ses, some economically interesting, such as the set of restrictions that the 4 factor model
reduces to the CAPM, β j = 0, j = 1, 3, . . . , 5. Only one regression, the completely unre-
stricted regression, was needed to compute all of the test statistics using Wald tests,

−1
(Rβ − r)0 R(X0 X)−1 R0 (Rβ − r)

W =
s2

where R and r depend on the null being tested. For example, to test whether a strict CAPM
was consistent with the observed data,
172 Analysis of Cross-Sectional Data

t Tests
β̂ σ̂2 [(X0 X)−1 ] j j
p
Parameter t -stat p-val
Constant -0.064 0.043 -1.484 0.138
V WMe 1.077 0.008 127.216 0.000
SM B 0.019 0.013 1.440 0.150
H M L 0.803 0.013 63.736 0.000
U M D -0.040 0.010 -3.948 0.000

Wald Tests
Null Alternative W M p-val
β j = 0, j = 1, . . . , 5 β j 6= 0, j = 1, . . . , 5 6116 5 0.000
β j = 0, j = 1, 3, 4, 5 β j 6= 0, j = 1, 3, 4, 5 1223.1 4 0.000
β j = 0, j = 1, 5 β j 6= 0, j = 1, 5 11.14 2 0.000
β j = 0, j = 1, 3 β j 6= 0, j = 1, 3 2.049 2 0.129
β5 = 0 β5 6= 0 15.59 1 0.000

Table 3.6: The upper panel contains t -stats and p-values for the regression of Big-High
excess returns on the 4 factors and a constant. The lower panel contains test statistics and
p-values for Wald tests of the reported null hypothesis. Both sets of tests were computed
using the small sample assumptions and may be misleading since the residuals are both
non-normal and heteroskedastic.

   
1 0 0 0 0 0
0 0 1 0 0 0
   
R=  and r =  .
   
 0 0 0 1 0   0 
0 0 0 0 1 0
All of the null hypotheses save one are strongly rejected with p-values of 0 to three dec-
imal places. The sole exception is H0 : β1 = β3 = 0, which produced a Wald test statistic
of 2.05. The 5% critical value of an F2,973 is 3.005, and so the null hypothesis would be not
rejected at the 5% level. The p-value indicates that the test would be rejected at the 13%
level but not at the 12% level. One further peculiarity appears in the table. The Wald test
statistic for the null H0 : β5 = 0 is exactly the square of the t -test statistic for the same
null. This should not be surprising since W = t 2 when testing a single linear hypothesis.
Moreover, if z ∼ t ν , then z 2 ∼ F1,ν . This can be seen by inspecting the square of a t ν and
applying the definition of an F1,ν -distribution.

3.8.4 Likelihood Ratio Tests

Likelihood Ratio (LR) test are based on the relative probability of observing the data if the
null is valid to the probability of observing the data under the alternative. The test statistic
3.8 Small Sample Hypothesis Testing 173

is defined
!
maxβ ,σ2 f (y|X; β , σ2 ) subject to Rβ = r
L R = −2 ln (3.56)
maxβ ,σ2 f (y|X; β , σ2 )

Letting β̂ R denote the constrained estimate of β , this test statistic can be reformulated

!
f (y|X; β̂ R , σ̂R2 )
L R = −2 ln (3.57)
f (y|X; β̂ , σ̂2 )
= −2[l (β̂ R , σ̂R2 ; y|X; ) − l (β̂ , σ̂2 ; y|X)]
= 2[l (β̂ , σ̂2 ; y|X) − l (β̂ R , σ̂R2 ; y|X)]

In the case of the normal log likelihood, L R can be further simplified to10

!
f (y|X; β̂ R , σ̂R2 )
L R = −2 ln
f (y|X; β̂ , σ̂2 )
 
2 − n2 (y−Xβ̂ R )0 (y−Xβ̂ R )
(2πσ̂R ) exp(− 2σ̂R2
)
= −2 ln 

(2πσ̂2 )− 2 exp(− (y−Xβ2)σ̂(y−Xβ )
n 0
)

2

n
!
(σ̂R2 )− 2
= −2 ln n
(σ̂2 )− 2
 2 − n2
σ̂R
= −2 ln
σ̂2
= n ln(σ̂R2 ) − ln(σ̂2 )
 

= n ln(SSE R ) − ln(SSEU )
 

Finally, the distribution of the L R statistic can be determined by noting that

σ̂R2
   
SSE R
L R = n ln = N ln (3.58)
SSEU σ̂U2

and that
   
n −k LR
exp − 1 = W. (3.59)
m n
The transformation between W and L R is monotonic so the transformed statistic has the
10
Note that σ̂R2 and σ̂2 use n rather than a degree-of-freedom adjustment since they are MLE estimators.
174 Analysis of Cross-Sectional Data

same distribution as W , a Fm,n −k .

Algorithm 3.32 (Small Sample Wald Test). 1. Estimate the unrestricted model yi = xi β +
εi , and the restricted model, ỹi = x̃i β + εi .
Pn 2
2. Compute SSE R = i =1 ε̃ i where ε̃i = ỹi − x̃i β̃ are the residuals from the restricted
regression, and SSEU = ni=1 ε̂2i where ε̂i = yi − xi β̂ are the residuals from the unre-
P

stricted regression.
 
3. Compute L R = n ln SSE SSE R
U
.

4. Compute W = n −k LR
  
m
exp n
−1 .

5. Compare W to the critical value, Cα , of the Fm ,n−k distribution at size α. Reject the null
hypothesis if W > Cα .

3.8.5 Example: LR Tests in Cross-Sectional Factor models

LR tests require estimating the model under both the null and the alternative. In all ex-
amples here, the alternative is the unrestricted model with four factors while the restricted
models (where the null is imposed) vary. The simplest restricted model corresponds to the
most restrictive null, H0 : β j = 0, j = 1, . . . , 5, and is specified

yi = εi .
To compute the likelihood ratio, the conditional mean and variance must be estimated.
In this simple specification, the conditional mean is ŷR = 0 (since there are no parameters)
and the conditional variance is estimated using the MLE with the mean, σ̂R2 = y0 y/n (the
sum of squared regressands). The mean under the alternative is ŷU = x0i β̂ and the variance
is estimated using σ̂U2 = (y − x0i β̂ )0 (y − x0i β̂ )/n . Once these quantities have been computed,
the L R test statistic is calculated

σ̂R2
 
L R = n ln (3.60)
σ̂U2

σ̂2
where the identity σ̂2R = SSE
SSE R n −k LR
  
U
has been applied. Finally, L R is transformed by m
exp n
− 1
U
to produce the test statistic, which is numerically identical to W . This can be seen by com-
paring the values in table 3.7 to those in table 3.6.

3.8.6 Lagrange Multiplier Tests

Consider minimizing the sum of squared errors subject to a linear hypothesis.


3.8 Small Sample Hypothesis Testing 175

LR Tests
Null Alternative LR M p-val
β j = 0, j = 1, . . . , 5 β j 6= 0, j = 1, . . . , 5 6116 5 0.000
β j = 0, j = 1, 3, 4, 5 β j 6= 0, j = 1, 3, 4, 5 1223.1 4 0.000
β j = 0, j = 1, 5 β j 6= 0, j = 1, 5 11.14 2 0.000
β j = 0, j = 1, 3 β j 6= 0, j = 1, 3 2.049 2 0.129
β5 = 0 β5 6= 0 15.59 1 0.000

LM Tests
Null Alternative LM M p-val
β j = 0, j = 1, . . . , 5 β j 6= 0, j = 1, . . . , 5 189.6 5 0.000
β j = 0, j = 1, 3, 4, 5 β j 6= 0, j = 1, 3, 4, 5 203.7 4 0.000
β j = 0, j = 1, 5 β j 6= 0, j = 1, 5 10.91 2 0.000
β j = 0, j = 1, 3 β j 6= 0, j = 1, 3 2.045 2 0.130
β5 = 0 β5 6= 0 15.36 1 0.000

Table 3.7: The upper panel contains test statistics and p-values using LR tests for using a
regression of excess returns on the big-high portfolio on the 4 factors and a constant. In all
cases the null was tested against the alternative listed. The lower panel contains test statis-
tics and p-values for LM tests of same tests. Note that the LM test statistics are uniformly
smaller than the LR test statistics which reflects that the variance in a LM test is computed
from the model estimated under the null, a value that must be larger than the estimate of
the variance under the alternative which is used in both the Wald and LR tests. Both sets
of tests were computed using the small sample assumptions and may be misleading since
the residuals are non-normal and heteroskedastic.

min (y − Xβ )0 (y − Xβ ) subject to Rβ − r = 0.
β

This problem can be formulated in terms of a Lagrangian,

L(β , λ) = (y − Xβ )0 (y − Xβ ) + (Rβ − r)0 λ

and the problem is


 
max min L(β , λ)
λ β

The first order conditions correspond to a saddle point,

∂L
= −2X0 (y − Xβ ) + R0 λ = 0
∂β
176 Analysis of Cross-Sectional Data

∂L
= Rβ − r = 0
∂λ

pre-multiplying the top FOC by R(X0 X)−1 (which does not change the value, since it is 0),

2R(X0 X)−1 (X0 X)β − 2R(X0 X)−1 X0 y + R(X0 X)−1 R0 λ = 0


⇒ 2Rβ − 2Rβ̂ + R(X0 X)−1 R0 λ = 0

where β̂ is the usual OLS estimator. Solving,


−1
λ̃ = 2 R(X0 X)−1 R0 (Rβ̂ − r)

(3.61)
−1
β̃ = β̂ − (X0 X)−1 R0 R(X0 X)−1 R0 (Rβ̂ − r)

(3.62)

These two solutions provide some insight into the statistical properties of the estima-
tors. β̃ , the constrained regression estimator, is a function of the OLS estimator, β̂ , and a
step in the direction of the constraint. The size of the change is influenced by the distance
between the unconstrained estimates and the constraint (Rβ̂ − r). If the unconstrained
estimator happened to exactly satisfy the constraint, there would be no step.11
The Lagrange multipliers, λ̃, are weighted functions of the unconstrained estimates, β̂ ,
and will be near zero if the constraint is nearly satisfied (Rβ̂ − r ≈ 0). In microeconomics,
Lagrange multipliers are known as shadow prices since they measure the magnitude of the
change in the objective function would if the constraint was relaxed a small amount. Note
that β̂ is the only source of randomness in λ̃ (like β̃ ), and so λ̃ is a linear combination of
normal random variables and will also follow a normal distribution. These two properties
combine to provide a mechanism for testing whether the restrictions imposed by the null
are consistent with the data. The distribution of λ̂ can be directly computed and a test
statistic can be formed.
There is another method to derive the LM test statistic that is motivated by the alterna-
tive name of LM tests: Score tests. Returning to the first order conditions and plugging in
the parameters,

R0 λ = 2X0 (y − Xβ̃ )
R0 λ = 2X0 ε̃

where β̃ is the constrained estimate of β and ε̃ are the corresponding estimated errors
(ε̃ = y − Xβ̃ ). Thus R0 λ has the same distribution as 2X0 ε̃. However, under the small sam-
ple assumptions, ε̃ are linear combinations of normal random variables and so are also
11
Even if the constraint is valid, the constraint will never be exactly satisfied.
3.8 Small Sample Hypothesis Testing 177

normal,

2X0 ε̃ ∼ N (0, 4σ2 X0 X)


and

X0 ε̃ ∼ N (0, σ2 X0 X). (3.63)


A test statistic that these are simultaneously zero can be constructed in the same manner
as a Wald test:

ε̃0 X(X0 X)−1 X0 ε̃


L M Infeasible = . (3.64)
σ2
However, like a Wald test this statistic is not feasible because σ2 is unknown. Using the
same substitution, the LM test statistic is given by

ε̃0 X(X0 X)−1 X0 ε̃


LM = (3.65)
s̃ 2
and has a Fm ,n−k +m distribution where s̃ 2 is the estimated error variance from the con-
strained regression. This is a different estimator than was used in constructing a Wald test
statistic, where the variance was computed from the unconstrained model. Both estimates
are consistent under the null. However, since SSE R ≥ SSEU , s̃ 2 is likely to be larger than s 2 .12
LM tests are usually implemented using a more convenient – but equivalent – form,
SSE R −SSEU
LM = m
SSE R
. (3.66)
n−k +m

To use the Lagrange Multiplier principal to conduct a test:

Algorithm 3.33 (Small Sample LM Test). 1. Estimate the unrestricted model yi = xi β +


εi , and the restricted model, ỹi = x̃i β + εi .
Pn 2
2. Compute SSE R = i =1 ε̃ i where ε̃i = ỹi − x̃i β̃ are the residuals from the restricted
regression, and SSEU = ni=1 ε̂2i where ε̂i = yi − xi β̂ are the residuals from the unre-
P

stricted regression.
SSE R −SSEU
3. Compute L M = m
SSE R .
n−k +m

4. Compare L M to the critical value, Cα , of the Fm,n −k +m distribution at size α. Reject the
null hypothesis if L M > Cα .

Alternatively,
12
Note that since the degree-of-freedom adjustment in the two estimators is different, the magnitude esti-
mated variance is not directly proportional to SSE R and SSEU .
178 Analysis of Cross-Sectional Data

Algorithm 3.34 (Small Sample LM Test). 1. Estimate the restricted model, ỹi = x̃i β + εi .

ε̃0 X(X0 X)−1 X0 ε̃


2. Compute L M = m
s2
where X is n by k the matrix of regressors from the uncon-
Pn
ε̃2i
strained model and s 2 = i =1
n−k +m
.

3. Compare L M to the critical value, Cα , of the Fm,n −k +m distribution at size α. Reject the
null hypothesis if L M > Cα .

3.8.7 Example: LM Tests in Cross-Sectional Factor models

Table 3.7 also contains values from LM tests. LM tests have a slightly different distributions
than the Wald and LR and do not produce numerically identical results. While the Wald and
LR tests require estimation of the unrestricted model (estimation under the alternative),
LM tests only require estimation of the restricted model (estimation under the null). For
example, in testing the null H0 : β1 = β5 = 0 (that the U M D factor has no explanatory
power and that the intercept is 0), the restricted model is estimated from

B Hie = γ1 V W M ie + γ2S M Bi + γ3 H M L i + εi .

The two conditions, that β1 = 0 and that β5 = 0 are imposed by excluding these regressors.
Once the restricted regression is fit, the residuals estimated under the null, ε̃i = yi − xi β̃
are computed and the LM test is calculated from

ε̃0 X(X0 X)−1 X0 ε̃


LM =
s2
where X is the set of explanatory variables from the unrestricted regression (in the case,
xi = [1 V W M ie S M Bi H M L i U M Di ]). Examining table 3.7, the LM test statistics are con-
siderably smaller than the Wald test statistics. This difference arises since the variance used
in computing the LM test statistic, σ̃2 , is estimated under the null. For instance, in the most
restricted case (H0 = β j = 0, j = 1, . . . , k ), the variance is estimated by y0 y/N (since k = 0
in this model) which is very different from the variance estimated under the alternative
(which is used by both the Wald and LR). Despite the differences in the test statistics, the
p-values in the table would result in the same inference. For the one hypothesis that is not
completely rejected, the p-value of the LM test is slightly larger than that of the LR (or W).
However, .130 and .129 should never make a qualitative difference (nor should .101 and
.099, even when using a 10% test). These results highlight a general feature of LM tests:
test statistics based on the LM-principle are smaller than Likelihood Ratios and Wald tests,
and so less likely to reject.
3.8 Small Sample Hypothesis Testing 179

Location of the three test statistic statistics

Rβ − r = 0

2X0 (y − Xβ)

Likelihood
Ratio
Lagrange Wald
Multiplier (y − Xβ)0 (y − Xβ))

β̃ βˆ

Figure 3.4: Graphical representation of the three major classes of tests. The Wald test mea-
sures the magnitude of the constraint, Rβ − r , at the OLS parameter estimate, β̂ . The
LM test measures the magnitude of the score at the restricted estimator (β̃ ) while the LR
test measures the difference between the SSE at the restricted estimator and the SSE at
the unrestricted estimator. Note: Only the location of the test statistic, not their relative
magnitudes, can be determined from this illustration.

3.8.8 Comparing the Wald, LR and LM Tests

With three tests available to test the same hypothesis, which is the correct one? In the small
sample framework, the Wald is the obvious choice because W ≈ L R and W is larger than
L M . However, the LM has a slightly different distribution, so it is impossible to make an
absolute statement. The choice among these three tests reduces to user preference and
ease of computation. Since computing SSEU and SSE R is simple, the Wald test is likely the
simplest to implement.
These results are no longer true when nonlinear restrictions and/or nonlinear models
are estimated. Further discussion of the factors affecting the choice between the Wald, LR
and LM tests will be reserved until then. Figure 3.4 contains a graphical representation of
the three test statistics in the context of a simple regression, yi = β xi + εi .13 The Wald
13
Magnitudes of the lines is not to scale, so the magnitude of the test statistics cannot be determined from
the picture.
180 Analysis of Cross-Sectional Data

test measures the magnitude of the constraint R β − r at the unconstrained estimator β̂ .


The LR test measures how much the sum of squared errors has changed between β̂ and β̃ .
Finally, the LM test measures the magnitude of the gradient, X0 (y − Xβ ) at the constrained
estimator β̃ .

3.9 Large Sample Assumption


While the small sample assumptions allow the exact distribution of the OLS estimator and
test statistics to be derived, these assumptions are not realistic in applications using finan-
cial data. Asset returns are non-normal (both skewed and leptokurtic), heteroskedastic,
and correlated. The large sample framework allows for inference on β without making
strong assumptions about the distribution or error covariance structure. However, the gen-
erality of the large sample framework comes at the loss of the ability to say anything exact
about the estimates in finite samples.
Four new assumptions are needed to analyze the asymptotic behavior of the OLS esti-
mators.
Assumption 3.35 (Stationary Ergodicity). {(xi , εi )} is a strictly stationary and ergodic se-
quence.
This is a technical assumption needed for consistency and asymptotic normality. It im-
plies two properties about the joint density of {(xi , εi )}: the joint distribution of {(xi , εi )}
and {(xi + j , εi + j )} depends on the time between observations ( j ) and not the observation
index (i ) and that averages will converge to their expected value (as long as they exist).
There are a number of alternative assumptions that could be used in place of this assump-
tion, although this assumption is broad enough to allow for i.i.d. , i.d.n.d (independent not
identically distributed, including heteroskedasticity), and some n.i.n.i.d. data, although it
does rule out some important cases. Specifically, the regressors cannot be trending or oth-
erwise depend on the observation index, an important property of some economic time
series such as the level of a market index or aggregate consumption. Stationarity will be
considered more carefully in the time-series chapters.
Assumption 3.36 (Rank). E[x0i xi ] = ΣXX is nonsingular and finite.
This assumption, like assumption 3.11, is needed to ensure identification.
Assumption 3.37 (Martingale Difference). {x0i εi , Fi } is a martingale difference sequence,
h 2 i
E x j ,i εi < ∞, j = 1, 2, . . . , k , i = 1, 2 . . .

1
and S = V[n − 2 X0 ε] is finite and non singular.
A martingale difference sequence has the property that its mean is unpredictable using the
information contained in the information set (Fi ).
3.10 Large Sample Properties 181

Definition 3.38 (Martingale Difference Sequence). Let {zi } be a vector stochastic process
and Fi be the information set corresponding to observation i containing all information
available when observation i was collected except zi . {zi , Fi } is a martingale difference
sequence if
E[zi |Fi ] = 0

In the context of the linear regression model, it states that the current score is not pre-
dictable by any of the previous scores, that the mean of the scores is zero (E[X0i εi ] = 0),
and there is no other variable in Fi which can predict the scores. This assumption is suf-
ficient to ensure that n −1/2 X0 ε will follow a Central Limit Theorem, and it plays a role in
consistency of the estimator. A m.d.s. is a fairly general construct and does not exclude
using time-series regressors as long as they are predetermined, meaning that they do not
depend on the process generating εi . For instance, in the CAPM, the return on the market
portfolio can be thought of as being determined independently of the idiosyncratic shock
affecting individual assets.

Assumption 3.39 (Moment Existence). E[x j4,i ] < ∞, i = 1, 2, . . ., j = 1, 2, . . . , k and E[ε2i ] =


σ2 < ∞, i = 1, 2, . . ..

This final assumption requires that the fourth moment of any regressor exists and the
variance of the errors is finite. This assumption is needed to derive a consistent estimator
of the parameter covariance.

3.10 Large Sample Properties


These assumptions lead to two key theorems about the asymptotic behavior of β̂ : it is con-
sistent and asymptotically normally distributed. First, some new notation is needed. Let
−1 
X0 X X0 y
 
β̂ n = (3.67)
n n
be the regression coefficient using n realizations from the stochastic process {xi , εi }.

Theorem 3.40 (Consistency of β̂ ). Under assumptions 3.9 and 3.35 - 3.37


p
β̂ n → β

Consistency is a weak property of the OLS estimator, but it is important. This result
p
relies crucially on the implication of assumption 3.37 that n −1 X0 ε → 0, and under the same
assumptions, the OLS estimator is also asymptotically normal.

Theorem 3.41 (Asymptotic Normality of β̂ ). Under assumptions 3.9 and 3.35 - 3.37
√ d
n (β̂ n − β ) → N (0, Σ−1
XX SΣXX )
−1
(3.68)
182 Analysis of Cross-Sectional Data

where ΣXX = E[x0i xi ] and S = V[n −1/2 X0 ε]

Asymptotic normality provides the basis for hypothesis tests on β . However, using only
theorem 3.41, tests are not feasible since ΣXX and S are unknown, and so must be estimated.

Theorem 3.42 (Consistency of OLS Parameter Covariance Estimator). Under assumptions


3.9 and 3.35 - 3.39,
p
Σ̂XX =n −1 X0 X → ΣXX
n
X p
Ŝ =n −1
ei2 x0i xi → S
i =1 
=n −1 X0 ÊX

and
−1 −1 p
Σ̂XX ŜΣ̂XX → Σ−1 −1
XX SΣXX

where Ê = diag(ε̂21 , . . . , ε̂2n ) is a matrix with the estimated residuals squared along its diago-
nal.

Combining these theorems, the OLS estimator is consistent, asymptotically normal and
the asymptotic variance can be consistently estimated. These three properties provide the
tools necessary to conduct hypothesis tests in the asymptotic framework. The usual esti-
mator for the variance of the residuals is also consistent for the variance of the innovations
under the same conditions.

Theorem 3.43 (Consistency of OLS Variance Estimator). Under assumptions 3.9 and 3.35 -
3.39 ,
p
σ̂n2 = n −1 ε̂0 ε̂ → σ2

Further, if homoskedasticity is assumed, then the parameter covariance estimator can


be simplified.

Theorem 3.44 (Homoskedastic Errors). Under assumptions 3.9, 3.12, 3.13 and 3.35 - 3.39,
√ d
n (β̂ n − β ) → N (0, σ2 Σ−1
XX )

Combining the result of this theorem with that of theorems 3.42 and 3.43, a consistent
2 −1
estimator of σ2 Σ−1
XX is given by σ̂n Σ̂XX .

3.11 Large Sample Hypothesis Testing


All three tests, Wald, LR, and LM have large sample equivalents that exploit the asymptotic
normality of the estimated parameters. While these tests are only asymptotically exact,
3.11 Large Sample Hypothesis Testing 183

the use of the asymptotic distribution is justified as an approximation to the finite-sample


distribution, although the quality of the CLT approximation depends on how well behaved
the data are.

3.11.1 Wald Tests

Recall from Theorem 3.41,


√ d
n (β̂ n − β ) → N (0, Σ−1
XX SΣXX ).
−1
(3.69)
Applying the properties of a normal random variable, if z ∼ N (µ, Σ), c0 z ∼ N (c0 µ, c0 Σc)
and that if w ∼ N (µ, σ2 ) then (wσ−µ)
2
2 ∼ χ12 . Using these two properties, a test of the null

H0 : Rβ − r = 0
against the alternative

H1 : Rβ − r 6= 0
can be constructed.
Following from Theorem 3.41, if H0 : Rβ − r = 0 is true, then
√ d
n(Rβ̂ n − r) → N (0, RΣ−1
XX SΣXX R )
−1 0
(3.70)
and
1√ d
Γ − 2 n (Rβ̂ n − r) → N (0, Ik ) (3.71)
where Γ = RΣ−1
XX SΣXX R . Under the null that H0 : Rβ − r = 0,
−1 0

h i−1
d
n (Rβ̂ n − r)0 RΣ−1
XX SΣ −1 0
XX R (Rβ̂ n − r) → χm
2
(3.72)

where m is the rank(R). This estimator is not feasible since Γ is not known and must be
estimated. Fortunately, Γ can be consistently estimated by applying the results of Theorem
3.42

Σ̂XX = n −1 X0 X

n
X
Ŝ = n −1
ei2 x0i xi
i =1

and so

−1 −1
Γ̂ = Σ̂XX ŜΣ̂XX .
184 Analysis of Cross-Sectional Data

The feasible Wald statistic is defined


h −1 −1 i−1
d
W = n(Rβ̂ n − r)0 RΣ̂XX ŜΣ̂XX R0 (Rβ̂ n − r) → χm
2
. (3.73)

Test statistic values can be compared to the critical value Cα from a χm


2
at the α-significance
level and the null is rejected if W is greater than Cα . The asymptotic t -test (which has a
normal distribution) is defined analogously,

√ Rβ̂ n − r d
t = n p → N (0, 1), (3.74)
RΓ̂ R0
where R is a 1 by k vector. Typically R is a vector with 1 in its jth element, producing statistic

√ β̂ j N d
t = nq → N (0, 1)
[Γ̂ ] j j

where [Γ̂ ] j j is the jth diagonal element of Γ̂ .



The n term in the Wald statistic (or n in the t -test) may appear strange at first, al-
though these terms are present in the classical tests as well. Recall that the t -stat (null
H0 : β j = 0) from the classical framework with homoskedastic data is given by

β̂ j
t1 = p .
σ̂2 [(X0 X)−1 ] j j
The t -stat in the asymptotic framework is

√ β̂ j N
t2 = nq .
−1
σ̂2 [Σ̂ XX ] j j

If t 1 is multiplied and divided by n , then

√ β̂ j √ β̂ j √ β̂ j
t1 = n√ p = nq = nq = t2,
n σ̂ [(X X) ] j j
2 0 −1 0
σ̂2 [( X X )−1 ] j j
−1
σ̂2 [Σ̂ ]
n XX j j

and these two statistics have the same value since X0 X differs from Σ̂XX by a factor of n .

Algorithm 3.45 (Large Sample Wald Test). 1. Estimate the unrestricted model yi = xi β +
εi .
−1 −1
2. Estimate the parameter covariance using Σ̂XX ŜΣ̂XX where
n
X n
X
Σ̂XX = n −1 x0i xi , Ŝ = n −1 ε̂2i x0i xi
i =1 i =1
3.11 Large Sample Hypothesis Testing 185

3. Construct the restriction matrix, R, and the value of the restriction, r, from the null
hypothesis.
h −1 −1 i−1
4. Compute W = n (Rβ̂ n − r)0 RΣ̂XX ŜΣ̂XX R0 (Rβ̂ n − r).

5. Reject the null if W > Cα where Cα is the critical value from a χm


2
using a size of α.

3.11.2 Lagrange Multiplier Tests

Recall that the first order conditions of the constrained estimation problem require

R0 λ̂ = 2X0 ε̃
where ε̃ are the residuals estimated under the null H0 : Rβ − r = 0. The LM test exam-
ines whether λ is close to zero. In the large sample framework, λ̂, like β̂ , is asymptotically
normal and R0 λ̂ will only be close to 0 if λ̂ ≈ 0. The asymptotic version of the LM test
can be compactly expressed if s̃ is defined as the average score of the restricted estimator,
s̃ = n −1 X0 ε̃. In this notation,

d
L M = n s̃0 S−1 s̃ → χm
2
. (3.75)
If the model is correctly specified, n −1 X0 ε̃, which is a k by 1 vector with jth element
n −1 ni=1 x j ,i ε̃i , should be a mean-zero vector with asymptotic variance S by assumption
P
√ d
3.35. Thus, n(n −1 X0 ε̃) → N (0, S) implies
" #!
√ −1 d I m 0
n S 2 s̃ → N 0, (3.76)
0 0
d
and so n s̃0 S−1 s̃ → χm
2
. This version is infeasible and the feasible version of the LM test must
be used,

d
L M = n s̃0 S̃−1 s̃ → χm
2
. (3.77)
where S̃ = n −1 ni=1 ε̃2i x0i xi is the estimator of the asymptotic variance computed under the
P

null. This means that S̃ is computed using the residuals from the restricted regression, ε̃,
and that it will differ from the usual estimator Ŝ which is computed using residuals from
the unrestricted regression, ε̂. Under the null, both S̃ and Ŝ are consistent estimators for S
and using one or the other has no asymptotic effect.
If the residuals are homoskedastic, the LM test can also be expressed in terms of the
2
R of the unrestricted model when testing a null that the coefficients on all explanatory
variable except the intercept are zero. Suppose the regression fit was

yi = β0 + β1 x1,i + β2 x2,i + . . . + βk xk n .
186 Analysis of Cross-Sectional Data

To test the H0 : β1 = β2 = . . . = βk = 0 (where the excluded β1 corresponds to a


constant),

d
L M = n R2 → χk2 (3.78)
is equivalent to the test statistic in eq. (3.77). This expression is useful as a simple tool to
jointly test whether the explanatory variables in a regression appear to explain any varia-
tion in the dependent variable. If the residuals are heteroskedastic, the n R2 form of the LM
test does not have standard distribution and should not be used.

Algorithm 3.46 (Large Sample LM Test). 1. Form the unrestricted model, yi = xi β + εi .

2. Impose the null on the unrestricted model and estimate the restricted model, ỹi = x̃i β +
εi .

3. Compute the residuals from the restricted regression, ε̃i = ỹi − x̃i β̃ .

4. Construct the score using the residuals from the restricted regression from both models,
s̃i = xi ε̃i where xi are the regressors from the unrestricted model.

5. Estimate the average score and the covariance of the score,


n
X n
X
s̃ = n −1 s̃i , S̃ = n −1 s̃0i s̃i (3.79)
i =1 i =1

6. Compute the LM test statistic as L M = n s̃S̃−1 s̃0 .

7. Reject the null if L M > Cα where Cα is the critical value from a χm


2
using a size of α.

3.11.3 Likelihood Ratio Tests


One major difference between small sample hypothesis testing and large sample hypoth-
esis testing is the omission of assumption 3.14. Without this assumption, the distribution
of the errors is left unspecified. Based on the ease of implementing the Wald and LM tests
their asymptotic framework, it may be tempting to think the likelihood ratio is asymptoti-
cally valid. This is not the case. The technical details are complicated but the proof relies
crucially on the Information Matrix Equality holding. When the data are heteroskedastic
or the not normal, the IME will generally not hold and the distribution of LR tests will be
nonstandard.14
There is, however, a feasible likelihood-ratio like test available. The motivation for this
test will be clarified in the GMM chapter. For now, the functional form will be given with
only minimal explanation,
14
In this case, the LR will converge to a weighted mixture of m independent χ12 random variables where
the weights are not 1. The resulting distribution is not a χm
2
.
3.11 Large Sample Hypothesis Testing 187

d
L R = n s̃0 S−1 s̃ → χm
2
, (3.80)
where s̃ = n −1 X0 ε̃ is the average score vector when the estimator is computed under the
null. This statistic is similar to the LM test statistic, although there are two differences. First,
one term has been left out of this expression, and the formal definition of the asymptotic
LR is

d
L R = n s̃0 S−1 s̃ − ŝ0 S−1 ŝ → χm
2
(3.81)
where ŝ = n −1 X0 ε̂ are the average scores from the unrestricted estimator. Recall from the
first-order conditions of OLS (eq. (3.7)) that ŝ = 0 and the second term in the general
expression of the L R will always be zero. The second difference between L R and L M exists
only in the feasible versions. The feasible version of the LR is given by

d
L R = n s̃0 Ŝ−1 s̃ → χm
2
. (3.82)
where Ŝ is estimated using the scores of the unrestricted model (under the alternative),
n
1X 2 0
Ŝ −1
= ε̂i xi xi . (3.83)
n
i =1

The feasible LM, n s̃0 S̃−1 s̃, uses a covariance estimator (S̃)based on the scores from the re-
stricted model, s̃.
In models with heteroskedasticity it is impossible to determine a priori whether the
LM or the LR test statistic will be larger, although folk wisdom states that LR test statistics
are larger than LM test statistics (and hence the LR will be more powerful). If the data
are homoskedastic, and homoskedastic estimators of Ŝ and S̃ are used (σ̂2 (X0 X/n )−1 and
σ̃2 (X0 X/n )−1 , respectively), then it must be the case that L M < L R . This follows since OLS
minimizes the sum of squared errors, and so σ̂2 must be smaller than σ̃2 , and so the LR
can be guaranteed to have more power in this case since the LR and LM have the same
asymptotic distribution.

Algorithm 3.47 (Large Sample LR Test). 1. Estimate the unrestricted model yi = xi β +


εi .

2. Impose the null on the unrestricted model and estimate the restricted model, ỹi = x̃i β +
εi .

3. Compute the residuals from the restricted regression, ε̃i = ỹi − x̃i β̃ , and from the un-
restricted regression, ε̂i = yi − xi β̂ .

4. Construct the score from both models, s̃i = xi ε̃i and ŝi = xi ε̂i , where in both cases xi
are the regressors from the unrestricted model.
188 Analysis of Cross-Sectional Data

5. Estimate the average score and the covariance of the score,


n
X n
X
s̃ = n −1
s̃i , Ŝ = n −1
ŝ0i ŝi (3.84)
i =1 i =1

6. Compute the LR test statistic as L R = n s̃Ŝ−1 s̃0 .

7. Reject the null if L R > Cα where Cα is the critical value from a χm


2
using a size of α.

3.11.4 Revisiting the Wald, LM and LR tests

The previous test results, all based on the restrictive small sample assumptions, can now be
revisited using the test statistics that allow for heteroskedasticity. Tables 3.8 and 3.9 contain
the values for t -tests, Wald tests, LM tests and LR tests using the large sample versions
of these test statistics as well as the values previously computed using the homoskedastic
small sample framework.
There is a clear direction in the changes of the test statistics: most are smaller, some
substantially. Examining table 3.8, 4 out of 5 of the t -stats have decreased. Since the esti-
mator of β̂ is the same in both the small sample and the large sample frameworks, all of
the difference is attributable to changes in the standard errors, which typically increased
by 50%. When t -stats differ dramatically under the two covariance estimators, the likely
cause is heteroskedasticity.
Table 3.9 shows that the Wald, LR and LM test statistics also changed by large amounts.15
Using the heteroskedasticity robust covariance estimator, the Wald statistics decreased by
up to a factor of 2 and the robust LM test statistics decreased by up to 5 times. The LR
test statistic values were generally larger than those of the corresponding Wald or LR test
statistics. The relationship between the robust versions of the Wald and LR statistics is not
clear, and for models that are grossly misspecified, the Wald and LR test statistics are sub-
stantially larger than their LM counterparts. However, when the value of the test statistics
is smaller, the three are virtually identical, and inference made using of any of these three
tests is the same. All nulls except H0 : β1 = β3 = 0 would be rejected using standard sizes
(5-10%).
These changes should serve as a warning to conducting inference using covariance
estimates based on homoskedasticity. In most applications to financial time-series, het-
eroskedasticity robust covariance estimators (and often HAC (Heteroskedasticity and Au-
tocorrelation Consistent), which will be defined in the time-series chapter) are automati-
cally applied without testing for heteroskedasticity.
15
The statistics based on the small-sample assumptions have fm ,t −k or fm,t −k +m distributions while the
statistics based in the large-sample assumptions have χm
2
distributions, and so the values of the small sample
statistics must be multiplied by m to be compared to the large sample statistics.
3.12 Violations of the Large Sample Assumptions 189

t Tests
Homoskedasticity Heteroskedasticity
Parameter β̂ S.E. t -stat p-val S.E. t -stat p-val
Constant -0.064 0.043 -1.484 0.138 0.043 -1.518 0.129
V WMe 1.077 0.008 127.216 0.000 0.008 97.013 0.000
SM B 0.019 0.013 1.440 0.150 0.013 0.771 0.440
HML 0.803 0.013 63.736 0.000 0.013 43.853 0.000
UMD -0.040 0.010 -3.948 0.000 0.010 -2.824 0.005

Table 3.8: Comparing small and large sample t -stats. The small sample statistics, in the left
panel of the table, overstate the precision of the estimates. The heteroskedasticity robust
standard errors are larger for 4 out of 5 parameters and one variable which was significant
at the 15% level is insignificant.

3.12 Violations of the Large Sample Assumptions

The large sample assumptions are just that: assumptions. While this set of assumptions is
far more general than the finite-sample setup, they may be violated in a number of ways.
This section examines the consequences of certain violations of the large sample assump-
tions.

3.12.1 Omitted and Extraneous Variables

Suppose that the model is linear but misspecified and a subset of the relevant regressors
are excluded. The model can be specified

yi = β 1 x1,i + β 2 x2,i + εi (3.85)

where x1,i is 1 by k1 vector of included regressors and x2,i is a 1 by k2 vector of excluded but
relevant regressors. Omitting x2,i from the fit model, the least squares estimator is
−1
X01 X1 X01 y

β̂ 1n = . (3.86)
n n
Using the asymptotic framework, the estimator can be shown to have a general form of
bias.

Theorem 3.48 (Misspecified Regression). Under assumptions 3.9 and 3.35 - 3.37 through , if
X can be partitioned [X1 X2 ] where X1 correspond to included variables while X2 correspond
190 Analysis of Cross-Sectional Data

Wald Tests
Small Sample Large Sample
Null M W p-val W p-val
β j = 0, j = 1, . . . , 5 5 6116 0.000 16303 0.000
β j = 0, j = 1, 3, 4, 5 4 1223.1 0.000 2345.3 0.000
β j = 0, j = 1, 5 2 11.14 0.000 13.59 0.001
β j = 0, j = 1, 3 2 2.049 0.129 2.852 0.240
β5 = 0 1 15.59 0.000 7.97 0.005

LR Tests
Small Sample Large Sample
Null M L R p-val L R p-val
β j = 0, j = 1, . . . , 5 5 6116 0.000 16303 0.000
β j = 0, j = 1, 3, 4, 5 4 1223.1 0.000 10211.4 0.000
β j = 0, j = 1, 5 2 11.14 0.000 13.99 0.001
β j = 0, j = 1, 3 2 2.049 0.129 3.088 0.213
β5 = 0 1 15.59 0.000 8.28 0.004

LM Tests
Small Sample Large Sample
Null M L M p-val L M p-val
β j = 0, j = 1, . . . , 5 5 190 0.000 106 0.000
β j = 0, j = 1, 3, 4, 5 4 231.2 0.000 170.5 0.000
β j = 0, j = 1, 5 2 10.91 0.000 14.18 0.001
β j = 0, j = 1, 3 2 2.045 0.130 3.012 0.222
β5 = 0 1 15.36 0.000 8.38 0.004

Table 3.9: Comparing large- and small-sample Wald, LM and LR test statistics. The large-
sample test statistics are smaller than their small-sample counterparts, a results of the
heteroskedasticity present in the data. While the main conclusions are unaffected by the
choice of covariance estimator, this will not always be the case.

to excluded variables with non-zero coefficients, then


p
β̂ 1n → β 1 + Σ−1
X1 X1 Σ X1 X2 β 2 (3.87)
p
β̂ 1 → β 1 + δβ 2

where " #
Σ X1 X1 Σ X1 X2
ΣXX =
Σ0X1 X2 ΣX2 X2
The bias term, δβ 2 is composed to two elements δ is a matrix of regression coefficients
3.12 Violations of the Large Sample Assumptions 191

where the jth column is the probability limit of the least squares estimator in the regression

X2 j = X1 δ j + ν,
where X2 j is the jth column of X2 . The second component of the bias term is the original
regression coefficients. As should be expected, larger coefficients on omitted variables lead
to larger bias.
p
β̂ 1n → β 1 under one of three conditions:
p
1. δ̂n → 0

2. β 2 = 0
p
3. The product δ̂n β 2 → 0.
p
β 2 has been assumed to be non-zero (if β 2 = 0 the model is correctly specified). δn → 0
only if the regression coefficients of X2 on X1 are zero, which requires that the omitted and
included regressors to be uncorrelated (X2 lies in the null space of X1 ). This assumption
should be considered implausible in most applications and β̂ 1n will be biased, although
certain classes of regressors, such as dummy variables, are mutually orthogonal and can be
safely omitted.16 Finally, if both δ and β 2 are non-zero, the product could be zero, although
without a very peculiar specification and a careful selection of regressors, this possibility
should be considered unlikely.
Alternatively, consider the case where some irrelevant variables are included. The cor-
rect model specification is

yi = x1,i β 1 + εi

and the model estimated is

yi = x1,i β 1 + x2,i β 2 + εi

As long as the assumptions of the asymptotic framework are satisfied, the least squares
estimator is consistent under theorem 3.40 and
" # " #
p β1 β1
β̂ n → =
β2 0

If the errors are homoskedastic, the variance of n(β̂ n − β ) is σ2 Σ−1 XX where X = [X1 X2 ].
The variance of β̂ 1n is the upper left k1 by k1 block of σ ΣXX . Using the partitioned inverse,
2 −1

16
Safely in terms of consistency of estimated parameters. Omitting variables will cause the estimated vari-
ance to be inconsistent.
192 Analysis of Cross-Sectional Data

" #
Σ−1
X1 X1 + Σ X1 X1 Σ X1 X2 M 1 Σ X1 X2 Σ X1 X1
−1 0 −1
−Σ−1X1 X1 ΣX1 X2 M1
Σ−1
XX =
M 1 Σ X1 X2 Σ X1 X1
0 −1
ΣX2 X2 + ΣX2 X2 Σ0X1 X2 M2 ΣX1 X2 Σ−1
−1 −1
X2 X2

where
X02 MX1 X2
M1 = lim
n→∞ n
X01 MX2 X1
M2 = lim
n →∞ n

and so the upper left block of the variance, Σ−1 X1 X1 + ΣX1 X1 ΣX1 X2 M1 ΣX1 X2 ΣX1 X1 , must be larger
−1 0 −1

than Σ−1X1 X1 because the second term is a quadratic form and M1 is positive semi-definite.
17

Noting that σ̂ is consistent under both the correct specification and the expanded speci-
2

fication, the cost of including extraneous regressors is an increase in the asymptotic vari-
ance.
In finite samples, there is a bias-variance tradeoff. Fewer regressors included in a model
leads to more precise estimates, while including more variables leads to less bias, and
when relevant variables are omitted σ̂2 will be larger and the estimated parameter vari-
ance, σ̂2 (X0 X)−1 must be larger.
Asymptotically only bias remains as it is of higher order than variance (scaling β̂ n − β

by n, the bias is exploding while the variance is constant),and so when the sample size
is large and estimates are precise, a larger model should be preferred to a smaller model.
In cases where the sample size is small, there is a justification for omitting a variable to
enhance the precision of those remaining, particularly when the effect of the omitted vari-
able is not of interest or when the excluded variable is highly correlated with one or more
included variables.

3.12.2 Errors Correlated with Regressors

Bias can arise from sources other than omitted variables. Consider the case where X is
measured with noise and define x̃i = xi + ηi where x̃i is a noisy proxy for xi , the “true”
(unobserved) regressor, and ηi is an i.i.d. mean 0 noise process which is independent of X
and ε with finite second moments Σηη . The OLS estimator,

!−1
X̃0 X̃ X̃0 y
β̂ n = (3.88)
n n
−1
(X + η)0 (X + η) (X + η)0 y

= (3.89)
n n
17
Both M1 and M2 are covariance matrices of the residuals of regressions of x2 on x1 and x1 on x2 respec-
tively.
3.12 Violations of the Large Sample Assumptions 193

−1
X0 X X0 η η0 X η0 η(X + η)0 y

= + + + (3.90)
n n n n n
−1
X X X0 η η0 X η0 η X0 y η0 y
 0   
= + + + + (3.91)
n n n n n n

will be biased downward. To understand the source of the bias, consider the behavior,
under the asymptotic assumptions, of

X0 X p
→ ΣXX
n
X0 η p
→0
n
η0 η p
→ Σηη
n
X0 y p
→ ΣXX β
n
η0 y p
→0
n
so −1
X0 X X0 η η0 X η0 η

p
+ + + → (ΣXX + Σηη )−1
n n n n
and

p
β̂ n → (ΣXX + Σηη )−1 ΣXX β .
p
If Σηη 6= 0, then β̂ n 9 β and the estimator is inconsistent.
p
The OLS estimator is also biased in the case where n −1 X0 ε 9 0k , which arises in sit-
uations with endogeneity. In these cases, xi and εi are simultaneously determined and
p
correlated. This correlation results in a biased estimator since β̂ n → β + Σ−1 XX ΣXε where
ΣXε is the limit of n X ε. The classic example of endogeneity is simultaneous equation
−1 0

models although many situations exist where the innovation may be correlated with one
or more regressors; omitted variables can be considered a special case of endogeneity by
reformulating the model.
The solution to this problem is to find an instrument, zi , which is correlated with the
endogenous variable, xi , but uncorrelated with εi . Intuitively, the endogenous portions
of xi can be annihilated by regressing xi on zi and using the fit values. This procedure is
known as instrumental variable (IV) regression in the case where the number of zi variables
is the same as the number of xi variables and two-stage least squares (2SLS) when the size
of zi is larger than k .
Define zi as a vector of exogenous variables where zi may contain any of the variables
in xi which are exogenous. However, all endogenous variables – those correlated with the
194 Analysis of Cross-Sectional Data

error – must be excluded.


First, a few assumptions must be reformulated.

Assumption 3.49 (IV Stationary Ergodicity). {(zi , xi , εi )} is a strictly stationary and ergodic
sequence.

Assumption 3.50 (IV Rank). E[z0i xi ] = ΣZX is nonsingular and finite.

Assumption 3.51 (IV Martingale Difference). {z0i εi , Fi } is a martingale difference sequence,


h 2 i
E z j ,i εi < ∞, j = 1, 2, . . . , k , i = 1, 2 . . .

1
and S = V[n − 2 Z0 ε] is finite and non singular.

Assumption 3.52 (IV Moment Existence). E[x j4i ] < ∞ and E[z j4i ] < ∞, j = 1, 2, . . . , k ,
i = 1, 2, . . . and E[ε2i ] = σ2 < ∞, i = 1, 2, . . ..

These four assumptions are nearly identical to the four used to establish the asymptotic
normality of the OLS estimator. The IV estimator is defined
−1
Z0 X Z0 y

IV
β̂ n = (3.92)
n n
where the n term is present to describe the number of observations used in the IV estima-
tor. The asymptotic properties are easy to establish and are essentially identical to those of
the OLS estimator.

Theorem 3.53 (Consistency of the IV Estimator). Under assumptions 3.9 and 3.49-3.51, the
IV estimator is consistent,
IV p
β̂ n → β
and asymptotically normal
√ IV d
n (β̂ n − β ) → N (0, Σ−1
ZX S̈ΣZX )
−1
(3.93)

where ΣZX = E[x0i zi ] and S̈ = V[n −1/2 Z0 ε].

Additionally, consistent estimators are available for the components of the asymptotic
variance.

Theorem 3.54 (Asymptotic Normality of the IV Estimator). Under assumptions 3.9 and 3.49
- 3.52,
p
Σ̂ZX = n −1 Z0 X → ΣZX (3.94)

n
S̈ˆ = n −1
X p
ε2i z0i zi → S̈ (3.95)
i =1
3.12 Violations of the Large Sample Assumptions 195

and
Σ̂ZX S̈ˆ Σ̂ZX → Σ−1
−1 0−1 p 0−1
ZX S̈ΣZX (3.96)

The asymptotic variance can be easily computed from

n
!
Σ̂ZX S̈ˆ Σ̂ZX
−1 −1 −1 X −1
=N Z X 0
ε̂2i z0i zi X0 Z (3.97)
i =1
−1  0  0 −1
=N Z0 X Z ÊZ X Z

where Ê = diag(ε̂21 , . . . , ε̂2n ) is a matrix with the estimated residuals squared along its diago-
nal.
IV estimators have one further complication beyond those of OLS. Assumption 3.36 re-
quires the rank of Z0 X to be full (k ), and so zi must be correlated with xi . Moreover, since
the asymptotic variance depends on Σ−1 ZX , even variables with non-zero correlation may
produce imprecise estimates, especially if the correlation is low. Instruments must be care-
fully chosen, although substantially deeper treatment is beyond the scope of this course.
Fortunately, IV estimators are infrequently needed in financial econometrics.

3.12.3 Monte Carlo: The effect of instrument correlation


While IV estimators are not often needed with financial data18 , the problem of endogene-
ity is severe and it is important to be aware of the consequences and pitfalls of using IV
estimators.19 To understand this problem, consider a simple Monte Carlo. The regressor
(xi ), the instrument (z i ) and the error are all drawn from a multivariate normal with the
covariance matrix,
    
xi 1 ρx z ρx ε
 z i  ∼ N 0,  ρ x z 1 0  .
    
εi ρx ε 0 1
Throughout the experiment, ρ x ε = 0.4 and ρ x z is varied from 0 to .9. 200 data points were
generated from

yi = β1 xi + εi
p
where β1 = 1. It is straightforward to show that E[β̂ ] = 1 + ρ x ε and that β̂nIV → 1 as long as
ρ x z 6= 0. 10,000 replications were generated and the IV estimators were computed

β̂nIV = (Z0 X)−1 (Z0 y).


18
IV estimators are most common in corporate finance when examining executive compensation and com-
pany performance.
19
The intuition behind IV estimators is generally applicable to 2SLS.
196 Analysis of Cross-Sectional Data

IV
Effect of correlation on the variance of β̂
4.5 ρ=.2
ρ=.4
4 ρ=.6
ρ=.8
3.5

2.5

1.5

0.5

0
−1 −0.5 0 0.5 1 1.5 2

Figure 3.5: Kernel density of the instrumental variable estimator β̂nIV with varying degrees
of correlation between the endogenous variable and the instrument. Increasing the cor-
relation between the instrument and the endogenous variable leads to a large decrease in
the variance of the estimated parameter (β = 1). When the correlation is small (.2), the
distribution has large variance and non-normal.

Figure 3.5 contains kernel density plots of the instrumental variable estimator for ρ x z
of .2, .4, .6 and .8. When the correlation between the instrument and x is low, the distri-
bution is dispersed (exhibiting a large variance). As the correlation increases, the variance
decreases and the distribution become increasingly normal. This experiment highlights
two fundamental problems with IV estimators: they have large variance when no “good
instruments” – highly correlated with xi by uncorrelated with εi – are available and the
finite-sample distribution of IV estimators may be poorly approximated a normal.

3.12.4 Heteroskedasticity

Assumption 3.35 does not require data to be homoskedastic, which is useful since het-
eroskedasticity is the rule rather than the exception in financial data. If the data are ho-
moskedastic, the asymptotic covariance of β̂ can be consistently estimated by
3.12 Violations of the Large Sample Assumptions 197

−1
X0 X

Ŝ = σ̂ 2
n
Heteroskedastic errors require the use of a more complicated covariance estimator, and
the asymptotic variance can be consistently estimated using

−1 Pn ! −1
X0 X ε̂ 2 0
X0 X

−1 −1 i =1 i i i x x
Σ̂XX ŜΣ̂XX = (3.98)
n n n
n
!
−1 X −1
= n X0 X ε̂2i x0i xi X0 X
 i =1 
−1 −1
= n X0 X X0 ÊX X0 X


where Ê = diag(ε̂21 , . . . , ε̂2n ) is a matrix with the estimated residuals squared along its diago-
nal.
Faced with two covariance estimators, one which is consistent under minimal assump-
tions and one which requires an additional, often implausible assumption, it may be tempt-
ing use rely exclusively on the robust estimator. This covariance estimator is known as the
White heteroskedasticity consistent covariance estimator and standard errors computed
using eq. (3.98) are called heteroskedasticity robust standard errors or White standard er-
rors (White 1980). Using a heteroskedasticity consistent estimator when not needed (ho-
moskedastic data) results in test statistics that have worse small sample properties, For ex-
ample, tests are more likely to have size distortions and so using 5% critical values may lead
to rejection of the null 10% or more of the time when the null is true. On the other hand,
using an inconsistent estimator of the parameter covariance – assuming homoskedasticity
when the data are not – produces tests with size distortions, even asymptotically.
White (1980)also provides a simple and direct test whether a heteroskedasticity robust
covariance estimator is needed. Each term in the heteroskedasticity consistent estimator
takes the form

ε2i x1,i
2
ε2i x1,i x2,i . . . ε2i x1,i xk n
 
 εi x1,i x2,i
2
ε2i x2,i
2
. . . ε2i x2,i xk n 
ε2i x0i xi =  ,
 
.. .. ..
 . . ... . 
ε2i x1,i xk n ε2i x2,i xk n . . . ε2i xk2n
and so, if E[ε2i x j n xl n ] = E[ε2i ]E[x j n xl n ], for all j and l , then the heteroskedasticity ro-
bust and the standard estimator will both consistently estimate the asymptotic variance of
β̂ . White’s test is formulated as a regression of squared estimated residuals on all unique
squares and cross products of xi . Suppose the original regression specification is

yi = β1 + β2 x1,i + β3 x2,i + εi .
198 Analysis of Cross-Sectional Data

White’s test uses an auxiliary regression of ε̂2i on the squares and cross-produces of all
2 2
regressors, {1, x1,i , x2,i , x1,i , x2,i , x1,i x2,i }:

ε̂2i = δ1 + δ2 x1,i + δ3 x2,i + δ4 x1,i


2
+ δ5 x2,i
2
+ δ6 x1,i x2,i + ηi . (3.99)

The null hypothesis tested is H0 : δ j = 0, j > 1, and the test statistic can be com-
puted using n R2 where the centered R2 is from the model in eq. (3.99). Recall that n R2 is an
LM test of the null that all coefficients except the intercept are zero and has an asymptotic
χν2 where ν is the number of restrictions – the same as the number of regressors exclud-
ing the constant. If the null is rejected, a heteroskedasticity robust covariance estimator is
required.

Algorithm 3.55 (White’s Test). 1. Fit the model yi = xi β + εi

2. Construct the fit residuals ε̂i = yi − xi β̂

3. Construct the auxiliary regressors zi where the k (k + 1)/2 elements of zi are computed
from xi ,o xi ,p for o = 1, 2, . . . , k , p = o , o + 1, . . . , k .

4. Estimate the auxiliary regression ε̂2i = zi γ + ηi

5. Compute White’s Test statistic as nR 2 where the R 2 is from the auxiliary regression and
compare to the critical value at size α from a χk2(k +1)/2−1 .

3.12.5 Example: White’s test on the FF data

White’s heteroskedasticity test is implemented using the estimated residuals, ε̂i = yi −x0i β̂ ,
by regressing the estimated residuals squared on all unique cross products of the regres-
sors. The primary model fit is

B Hie = β1 + β2 V W M ie + β3S M Bi + β4 H M L i + β5U M Di + εi .

and the auxiliary model is specified

2
ε̂2i = δ1 + δ2 V W M ie + δ3S M Bi + δ4 H M L i + δ5U M Di + δ6 V W M ie + δ7 V W M ie S M Bi
+ δ8 V W M ie H M L i + δ9 V W M ie U M Di + δ10S M Bi2 + δ1 1S M Bi H M L i
+ δ12S M Bi U M Di + δ13 H M L i2 + δ14 H M L i U M Di + δ15U M Di2 + ηi

Estimating this regression produces an R2 of 11.1% and n R2 = 105.3, which has an


asymptotic χ14 2
distribution (14 regressors, excluding the constant). The p-value of this test
statistic is 0.000 and so the null of homoskedasticity is strongly rejected.
3.12 Violations of the Large Sample Assumptions 199

3.12.6 Generalized Least Squares


An alternative to modeling heteroskedastic data is to transform the data so that it is ho-
moskedastic using generalized least squares (GLS). GLS extends OLS to allow for arbitrary
weighting matrices. The GLS estimator of β is defined
GLS
β̂ = (X0 W−1 X)−1 X0 W−1 y, (3.100)
for some positive definite matrix W. Without any further assumptions or restrictions on
GLS
W, β̂ is unbiased under the same conditions as β̂ , and the variance of β̂ can be easily
shown to be

(X0 W−1 X)−1 (X0 W−1 VW−1 X)(X0 W−1 X)−1


where V is the n by n covariance matrix of ε.
The full value of GLS is only realized when W is wisely chosen. Suppose that the data
are heteroskedastic but not serial correlated,20 and so

y = Xβ + ε (3.101)
where V[εi |X] = σi2 and therefore heteroskedastic. Further, assume σi2 is known. Return-
ing to the small sample assumptions, choosing W ∝ V(ε|X)21 , the GLS estimator will be
efficient.

Assumption 3.56 (Error Covariance). V = V[ε|X]

Setting W = V, the GLS estimator is BLUE.


GLS
Theorem 3.57 (Variance of β̂ ). Under assumptions 3.9 - 3.11 and 3.56,
GLS
V[β̂ |X] = (X0 V−1 X)−1
GLS
and V[β̂ |X] ≤ V[β̃ |X] where β̃ = Cy is any other linear unbiased estimator with E[β̃ ] = β

To understand the intuition behind this result, note that the GLS estimator can be ex-
pressed as an OLS estimator using transformed data. Returning to the model in eq. (3.101),
1
and pre-multiplying by W− 2 ,

1 1 1
W− 2 y = W− 2 Xβ + W− 2 ε
ỹ = X̃β + ε̃

and so
20
Serial correlation is ruled out by assumption 3.37.
21
∝ is the mathematical symbol for “proportional to”.
200 Analysis of Cross-Sectional Data

β̂ = X̃0 X̃ X̃0 ỹ

 1 1
 1 1
= X0 W− 2 W− 2 X X0 W− 2 W− 2 y
= X0 W−1 X X0 W−1 y


GLS
= β̂ .
1 1 1
In the original model, W = V[ε|X], and so V[W− 2 ε|X] = W− 2 WW− 2 = In . ε̃ is homoskedas-
tic and uncorrelated and the transformed model satisfies the assumption of the Gauss-
Markov theorem (theorem 3.17).
This result is only directly applicable under the small sample assumptions and then only
if V[ε|X] is known a priori. In practice, neither is true: data are not congruent with the small
sample assumptions and V[ε|X] is never known. The feasible GLS (FGLS) estimator solves
these two issues, although the efficiency gains of FGLS have only asymptotic justification.
Suppose that V[ε|X] = ω1 + ω2 x1,i + . . . + ωk +1 xk n where ω j are unknown. The FGLS
procedure provides a method to estimate these parameters and implement a feasible GLS
estimator.
The FGLS procedure is described in the following algorithm.
Algorithm 3.58 (GLS Estimation). 1. Estimate β̂ using OLS.

2. Using the estimated residuals, ε̂ = y − Xβ̂ , estimate an auxiliary model by regressing


the squared residual on the variables of the variance model.

3. Using the estimated variance model parameters ω̂, produce a fit variance matrix, V.
b
1 1 FGLS
4. Compute ỹ = V
b − 2 y and X̃ = V
b − 2 X compute β̂ using the OLS estimator on the trans-
formed regressors and regressand.
FGLS
Hypothesis testing can be performed on β̂ using the standard test statistics with the
FGLS variance estimator,
−1
σ̃2 (X0 V
b −1 X)−1 = σ̃2 X̃0 X̃
FGLS
where σ̃2 is the sample variance of the FGLS regression errors (ε̃ = ỹ − X̃β̂ ).
While FGLS is only formally asymptotically justified, FGLS estimates are often much
more precise in finite samples, especially if the data is very heteroskedastic. The largest
gains occur when a some observations have dramatically larger variance than others. The
OLS estimator gives these observations too much weight, inefficiently exploiting the infor-
mation in the remaining observations. FGLS, even when estimated with a diagonal weight-
ing matrix that may be slightly misspecified, can produce substantially more precise esti-
mates.22
22
If the model for the conditional variance of εi is misspecified in an application of FGLS, the resulting
3.12 Violations of the Large Sample Assumptions 201

3.12.6.1 Monte Carlo: A simple GLS

A simple Monte Carlo was designed to demonstrate the gains of GLS. The observed data
are generated according to

yi = xi + xiα εi
where xi is i.i.d. U(0,1) and εi is standard normal. α takes the values of 0.8, 1.6, 2.8 and
4. When α is low the data are approximately homoskedastic. As α increases the data are
increasingly heteroskedastic and the probability of producing a few residuals with small
variances increases. The OLS and (infeasible) GLS estimators were fit to the data and figure
3.6 contains kernel density plots of β̂ and β̂ G LS .
When α is small, the OLS and GLS parameter estimates have a similar variance, indi-
cated by the similarity in distribution. As α increases, the GLS estimator becomes very
precise which is due to GLS’s reweighing of the data by the inverse of its variance. In ef-
fect, observations with the smallest errors become very influential in determining β̂ . This
is the general principal behind GLS: let the data points which are most precise about the
unknown parameters have the most influence.

3.12.7 Example: GLS in the Factor model

Even if it is unreasonable to assume that the entire covariance structure of the residuals
can be correctly specified in the auxiliary regression, GLS estimates are often much more
precise than OLS estimates. Consider the regression of B H e on the 4 factors and a constant.
The OLS estimates are identical to those previously presented and the GLS estimates will
be computed using the estimated variances from White’s test. Define

V̂ = diag σ̂12 , σ̂22 , . . . , σ̂n2




where σ̂i2 is the fit value from the auxiliary regression in White’s test that included only
the squares of the explanatory variables. Coefficients were estimated by regressing ỹ on X̃
where

1
ỹ = V̂− 2 y

1
X̃ = V̂− 2 X
G LS G LS
and β̂ = (X̃0 X̃)−1 X̃0 ỹ. ε̂G LS = y − Xβ̂ are computed from the original data using the
GLS estimate of β , and the variance of the GLS estimator can be computed using

(X̃0 X̃)−1 (X̃0 E


b̃ X̃)−1 (X̃0 X̃)−1 .

estimator is not asymptotically efficient and a heteroskedasticity robust covariance estimator is required.
202 Analysis of Cross-Sectional Data

Gains of using GLS


α = 0.8 α = 1.6
2.5
3

2 2.5

1.5 2

1.5
1
1
0.5
0.5

0 0
0.5 1 1.5 0.5 1 1.5
α = 2.8 α = 4.0
8 OLS
100 GLS

6
80

60
4

40
2
20

0 0
0.5 1 1.5 0.5 1 1.5

Figure 3.6: The four plots show the gains to using the GLS estimator on heteroskedastic
data. The data were generated according to yi = xi + xiα εi where xi is i.i.d. uniform and εi
is standard normal. For large α, the GLS estimator is substantially more efficient than the
OLS estimator. However, the intuition behind the result is not that high variance residuals
have been down-weighted, but that low variance residuals, some with very low variances,
have been up-weighted to produce a very precise fit.

where E b̃ is a diagonal matrix with the estimated residuals squared, ε̂G LS 2 , from the GLS
i
procedure along its diagonal. Table 3.10 contains the estimated parameters, t -stats and p-
values using both the OLS and the GLS estimates. The GLS estimation procedure appears
to provide more precise estimates and inference. The difference in precision is particularly
large for S M B .
3.13 Model Selection and Specification Checking 203

OLS GLS
Variable β̂ t -stat p-val β̂ t -stat p-val
Constant -0.064 -1.518 0.129 -0.698 -3.138 0.002
V WMe 1.077 97.013 0.000 0.986 82.691 0.000
SM B 0.019 0.771 0.440 0.095 16.793 0.000
HML 0.803 43.853 0.000 1.144 34.880 0.000
UMD -0.040 -2.824 0.005 0.157 20.248 0.000

Table 3.10: OLS and GLS parameter estimates and t -stats. t -stats indicate that the GLS
parameter estimates are more precise.

3.13 Model Selection and Specification Checking

Econometric problems often begin with a variable whose dynamics are of interest and a
relatively large set of candidate explanatory variables. The process by which the set of re-
gressors is reduced is known as model selection or building.
Model building inevitably reduces to balancing two competing considerations: congru-
ence and parsimony. A congruent model is one that captures all of the variation in the data
explained by the regressors. Obviously, including all of the regressors and all functions
of the regressors should produce a congruent model. However, this is also an infeasible
procedure since there are infinitely many functions of even a single regressor. Parsimony
dictates that the model should be as simple as possible and so models with fewer regres-
sors are favored. The ideal model is the parsimonious congruent model that contains all
variables necessary to explain the variation in the regressand and nothing else.
Model selection is as much a black art as science and some lessons can only be taught
through experience. One principal that should be universally applied when selecting a
model is to rely on economic theory and, failing that, common sense. The simplest method
to select a poorly performing model is to try any and all variables, a process known as data
snooping that is capable of producing a model with an arbitrarily high R2 even if there is
no relationship between the regressand and the regressors.
There are a few variable selection methods which can be examined for their properties.
These include

• General to Specific modeling (GtS)

• Specific to General modeling (StG)

• Information criteria (IC)


204 Analysis of Cross-Sectional Data

3.13.1 Model Building

3.13.1.1 General to Specific

General to specific (GtS) model building begins by estimating largest model that can be jus-
tified by economic theory (and common sense). This model is then pared down to produce
the smallest model that remains congruent with the data. The simplest version of GtS be-
gins with the complete model. If any coefficients have p-values (of t -stats) less than some
significance level α (usually 5 or 10%), the least significant regressor is dropped from the
regression. Using the remaining regressors, the procedure is repeated until all coefficients
are statistically significant, always dropping the least significant regressor.
One drawback to this simple procedure is that variables which are correlated but rele-
vant are often dropped. This is due to a problem known as multicollinearity and individual
t -stats will be small but joint significance tests that all coefficients are simultaneously zero
will strongly reject. This suggests using joint hypothesis tests to pare the general model
down to the specific one. While theoretically attractive, the scope the of possible joint hy-
pothesis tests is vast even in a small model, and so using joint test is impractical.
GtS suffers from two additional issues. First, it will include an irrelevant variable with
positive probability (asymptotically) but will never exclude a relevant variable. Second, test
statistics do not have standard distributions when they are used sequentially (as is the case
with any sequential model building procedure). The only viable solution to the second
problem is to fit a single model, make variable inclusions and exclusion choices, and live
with the result. This practice is not typically followed and most econometricians use an
iterative procedure despite the problems of sequential testing.

3.13.1.2 Specific to General

Specific to General (StG) model building begins by estimating the smallest model, usually
including only a constant. Variables are then added sequentially based on maximum t -
stat until there is no excluded variable with a significant t -stat at some predetermined α
(again, usually 5 or 10%). StG suffers from the same issues as GtS. First it will asymptotically
include all relevant variables and some irrelevant ones and second tests done sequentially
do not have correct asymptotic size. Choosing between StG and GtS is mainly user pref-
erence, although they rarely select the same model. One argument in favor of using a GtS
approach is that the variance is consistently estimated in the first step of the general speci-
fication while the variance estimated in the first step of the an StG selection is too large. The
leads StG processes to have t -stats that are smaller than GtS t -stats and so StG generally
selects a smaller model than GtS.
3.13 Model Selection and Specification Checking 205

3.13.1.3 Information Criteria

A third method of model selection uses Information Criteria (IC). Information Criteria re-
ward the model for producing smaller SSE while punishing it for the inclusion of additional
regressors. The two most frequently used are the Akaike Information Criterion (AIC) and
Schwartz Information Criterion (SIC) or Bayesian Information Criterion (BIC).23 Most In-
formation Criteria are of the form

−2l + P
where l is the log-likelihood value at the parameter estimates and P is a penalty term. In
the case of least squares, where the log-likelihood is not known (or needed), IC’s take the
form

ln σ̂2 + P
where the penalty term is divided by n .

Definition 3.59 (Akaike Information Criterion (AIC)). For likelihood-based models the AIC
is defined
AI C = −2l + 2k (3.102)
and in its least squares application,

2k
AI C = ln σ̂2 + (3.103)
n
Definition 3.60 (Schwartz/Bayesian Information Criterion (S/BIC)). For likelihood-based
models the BIC (SIC) is defined

B I C = −2l + k ln n (3.104)

and in its least squares applications

ln n
B I C = ln σ̂2 + k (3.105)
n
The obvious difference between these two IC is that the AIC has a constant penalty term
while the BIC has a penalty term that increases with the number of observations. The ef-
fect of the sharper penalty in the S/BIC is that for larger data sizes, the marginal increase
in the likelihood (or decrease in the variance) must be greater. This distinction is subtle
but important: using the BIC to select from a finite set of regressors leads to the correct
model being chosen while the AIC asymptotically selects a model that includes irrelevant
regressors.
23
The BIC and SIC are the same. BIC is probably the most common name but SIC or S/BIC are also fre-
quently encountered.
206 Analysis of Cross-Sectional Data

Using an IC to select a model is similar to either a GtS or StG search. For example, to use
an StG selection method, begin with the smallest model (usually a constant) and compute
the IC for this model. Next, consider all possible univariate regressions. If any reduce the
IC, extend the specification to include the variable that produced the smallest IC. Now, be-
ginning from the selected univariate regression, estimate all bivariate regressions. Again, if
any decrease the IC, choose the one which produces the smallest value. Repeat this proce-
dure until the marginal contribution to the IC of adding any additional variable is positive
(i.e. when comparing an L and L +1 variable regression, including and additional variables
increases the IC).
As an alternative, if the number of regressors is sufficiently small (less than 20) it is pos-
sible to try every possible combination and choose the smallest IC. This requires 2L regres-
sions where L is the number of available regressors (220 is about 1,000,000).

3.13.2 Lasso, Forward Stagewise Regression and LARS


Lasso (least absolute shrinkage and selection operator), Forward Stagewise Regression, and
LARS (Least Angle Regression), are relatively new methods for selecting models (Efron,
Hastie, Johnstone & Tibshirani 2004, Tibshirani 1996). The lasso adds an additional con-
straint to the least squares problem that limits the magnitude of regression coefficients
that produces an interpretable model since regressors with have little explanatory power
will have coefficients exactly equal to 0 (and hence are excluded).

Definition 3.61 (Lasso). The Lasso estimator with tuning parameter ω is defined as the
solution to
k
X
min (y − Xβ ) (y − Xβ ) subject to β j < ω

(3.106)
β
j =1

Forward Stagewise Regression, begins from a model with no regressors and then uses
an iterative method to build the regression in small steps by expanding the regression co-
efficients (small enough that the coefficient expansions should be effectively continuous).

Algorithm 3.62 (Forward Stagewise Regression). The Forward Stagewise Regression (FSR)
estimator is defined as the sample paths of β̂ defined by
(0)
1. Begin with β̂ = 0, and errors ε(0) = y

2. Compute the correlations of the residual at iteration i with the regressors, c(i ) = Corr X, ε(i )
 

3. Define j to be the index of the largest element of |c(i ) | (the absolute value of the correla-
tions), and update the coefficients where β̂ j(i +1) = β̂ j(i ) + η · sign c j and β̂l(i +1) = β̂l(i ) for


l 6= j where η is a small number (should be much smaller than c j ).24


η should be larger than some small value to ensure the algorithm completes in finitely many steps, but
24

should always be weakly smaller than |c j |.


3.13 Model Selection and Specification Checking 207

(i +1)
4. Compute ε(i +1) = y − Xβ̂
5. Repeat steps 2 – 4 until all correlations are 0 (if ε(i ) = 0 than all correlations are 0 by
definition).
The coefficients of FSR are determined by taking a small step in the direction of the
highest correlation between the regressors and the current error, and so the algorithm will
always take a step in the direction of the regressor that has the most (local) explanatory
power over the regressand. The final stage FSR coefficients will be equal to the OLS esti-
mates as long as the number of regressors under consideration is smaller than the number
of observations.
Algorithm 3.63 (Least Angle Regression). The Least Angle Regression (LARS) estimator is
defined as the sample paths of β̂ defined by
(0)
1. Begin with β̂ = 0, and errors ε(0) = ỹ
2. Compute the correlations of the residual at state i with the regressors, c(i ) = Corr X̃(i ) , ε(i )
 

and define j to be the index of the largest element of |c(i ) | (the absolute value of the cor-
relations).
3. Define the active set of regressors X̃(1) = x̃ j .
(1)
4. Move β̂ = β̂ j towards the least squares estimate of regressing ε(0) on X̃(1) until the
(1)
correlation between ε(1) = ỹ − X̃(1) β̂ and some other x̃k is equal to the correlation
between ε(1) and x̃ j .
5. Add x̃k to the active set of regressors so X̃(2) = x̃ j , x̃k .
 

(2)
h i
6. Move β̂ = β̂ j β̂k towards the least squares estimate of regressing ε(1) on X̃(2) until
(2)
the correlation between ε(2) = ỹ − X̃(2) β̂ and some other x̃l is equal to the correlation
between ε(2) and X̃(2) .
7. Repeat steps 5 – 6 by adding regressors to the active set until all regressors have been
added or n steps have been taken, which ever occurs first.
where
y − ȳ
ỹ = (3.107)
σ̂ y

and
xi − x̄i
x̃i = (3.108)
σ̂ x

are studentized versions of the original data.25


25
LARS can be implemented on non-studentized data be replacing correlation with c(i ) = X(i )0 ε(i ) .
208 Analysis of Cross-Sectional Data

The algorithm of LARS describes the statistical justification for the procedure – variables
are added as soon as they have the largest correlation. Once the active set contains 2 or
more regressors, the maximum correlation between the error and all regressors will be the
same since regression coefficients are expanded in a manner that keeps the correlation
identical between the error and any regressors in the active set. Efron et al. (2004) proposes
a new algorithm that allows the entire path of Lasso, FSR and LARS estimates to be quickly
computed for large number of potential regressors.
These models are deeply related as shown Efron et al. (2004) and Hastie et al. (2007). All
three can be used for model selection once a stopping rule (FSR, LARS) or the summation
constraint (ω, Lasso) has been selected. The usual method to choose these values is to use
a cross-validation experiment where procedure is fit on some portion of the data (say 75%)
and then the forecast performance is assessed on the remaining data. For applications
using cross-sectional data, the in-sample and out-of-sample data can be randomly chosen
and the experiment repeated to provide guidance on a reasonable stopping rule or ω.
In addition, the usual standard errors and t -stats are no longer correct since these es-
timators are constrained versions of least squares, and Tibshirani (1996) proposes a boot-
strap method that can be used to compute standard errors and make inference on Lasso
estimators.26

3.13.3 Specification Checking


Once a model has been selected, the final step is to examine the specification, where a
number of issues may arise. For example, a model may have neglected some nonlinear
features in the data, a few outliers may be determining the parameter estimates, or the
data may be heteroskedastic. Residuals for the basis of most specification checks, although
the first step in assessing model fit is always to plot the residuals. A simple residual plot
often reveals problems with a model, such as large (and generally influential) residuals or
correlation among the residuals in time-series applications.

3.13.3.1 Residual Plots and Non-linearity

Plot, plot, plot. Plots of both data and residuals, while not perfect, are effective methods
to detect many problems. Most data analysis should include a plot of the initial unfiltered
data where large observation or missing data are easily detected. Once a model has been
estimated the residuals should be plotted, usually by sorting them against the ordered re-
gressors when using cross-sectional data or against time (the observation index) in time-
series applications.
To see the benefits of plotting residuals, suppose the data were generated by yi = xi +
xi + εi where xi and εi are i.i.d. standard normal, but an affine specification, yi = β1 +
2

26
The standard errors subsequent to a selection procedure using GtS, StG or IC are also not correct since
tests have been repeated. In this regard the bootstrap procedure should be more accurate since it accounts
for the variation due to the selection, something not usually done in traditional model selection procedures.
3.13 Model Selection and Specification Checking 209

Neglected Nonlinearity and Residual Plots


Data and Fit Line

0
Data
Fit
−2
−2 −1.5 −1 −0.5 0 0.5 1 1.5 2
Error plot (sorted by xi )
6

−2
Error
−4
−2.5 −2 −1.5 −1 −0.5 0 0.5 1 1.5 2 2.5

Figure 3.7: The top panel contains data generated according to yi = xi + xi2 + εi and a fit
from a model yi = β1 + β2 xi + εi . The nonlinearity should be obvious, but is even clearer
in the ordered (by xi ) residual plot where a distinct “U” shape can be seen (bottom panel).

β2 xi + εi was fit. Figure 3.7 contains plots of the data and fit lines (top panel) and errors
(bottom panel). It is obvious from the data and fit line that the model is misspecified and
the residual plot makes this clear. Residuals should have no discernible pattern in their
mean when plotted against any variable (or function of variables) in the data set.
One statistical test for detecting neglected nonlinearity is Ramsey’s RESET test. Suppose
some model is fit

yi = xi β + εi

and one desires to test whether there is a neglected nonlinearity present. The RESET test
uses powers of the fit data, ŷi as additional regressors to test whether there is evidence of
nonlinearity in the data.

Definition 3.64 (Ramsey’s RESET Test). The RESET test is a test of the null the null H0 :
210 Analysis of Cross-Sectional Data

γ1 = . . . = γR = 0 in an auxiliary regression,

yi = xi β + γ1 ŷi2 + γ2 ŷi3 + . . . + γR ŷiR −1 εi

where ŷi are the fit values of yi generated in the initial regression. The test statistic has an
asymptotic χR2 distribution.

R is typically 1 or 2 since higher powers may produce numerical problems, imprecise


estimates and size distortions. The biggest difficulty of using a RESET test is that rejection
of the null is not informative about the changes needed to the original specification.

3.13.3.2 Parameter Stability

Parameter instability is a common problem in actual data. For example, recent evidence
suggests that the market β in a CAPM may be differ across up and down markets Ang et al.
(2006). A model fit assuming the strict CAPM would be misspecified since the parameters
are not constant.
There is a simple procedure to test for parameter stability if the point where the param-
eters changes is known. The test is specified by including a dummy for any parameter that
may change and testing the coefficient on the dummy variables for constancy.
Returning to the CAPM example, the standard specification is

f
rie = β1 + β2 (riM − ri ) + εi
f
where riM is the return on the market, ri is the return on the risk free asset and rie is the
excess return on the dependent asset. To test whether the slope is different when (riM −
f
ri ) < 0, define a dummy Ii = I[(r M −r f )<0] and perform a standard test of the null H0 : β3 = 0
i i
in the regression

f f
rie = β1 + β2 (riM − ri ) + β3 Ii (riM − ri ) + εi .

If the break point not known a priori, it is necessary to test whether there is a break in
the parameter at any point in the sample. This test can be implemented by testing at every
point and then examining the largest test statistic. While this is a valid procedure, the dis-
tribution of the largest test statistic is no longer χ 2 and so inference based on standard tests
(and their corresponding distributions) will be misleading. This type of problem is known
as a nuisance parameter problem since that if the null hypothesis (that there is no break)
is correct, the value of regression coefficients after the break is not defined. In the example
above, if there is no break then β3 is not identified (and is a nuisance). Treatment of the
issues surrounding nuisance parameters is beyond the scope of this course, but interested
readers should start see Andrews & Ploberger (1994).
3.13 Model Selection and Specification Checking 211

3.13.3.3 Rolling and Recursive Parameter Estimates

Rolling and recursive parameter estimates are useful tools for detecting parameter insta-
bility in cross-section regression of time-series data (e.g. asset returns). Rolling regression
estimates use a fixed length sample of data to estimate β and then “roll” the sampling win-
dow to produce a sequence of estimates.
Definition 3.65 (m -sample Rolling Regression Estimates). The m-sample rolling regres-
sion estimates are defined as the sequence
 −1
j +m −1
X
β̂ j =  x0i xi  x0i yi (3.109)
i=j

for j = 1, 2, . . . , n − m + 1.
The rolling window length should be large enough so that parameter estimates in each
window are reasonably well approximated by a CLT but not so long as to smooth out any
variation in β . 60-months is a common window length in applications using monthly asset
price data and window lengths ranging between 3-months and 2-year are common when
using daily data. The rolling regression coefficients can be visually inspected for evidence
of instability, and approximate confidence intervals (based on an assumption of parameter
stability) can be constructed by estimating the parameter covariance on the full sample of
n observations and then scaling by n /m so that the estimated covariance is appropriate for
a sample of m observations. The parameter covariance can alternatively be estimated by
−1 −1
averaging the n − m + 1 covariance estimates corresponding to each sample, Σ̂XX, j Ŝ j Σ̂XX, j ,
where
j +m −1
X
Σ̂XX, j = m −1
x0i xi (3.110)
i=j

and
j +m −1
X
Ŝ j = m −1
ε̂i , j x0i xi (3.111)
i=j

where ε̂i , j = yi − x0i β̂ j , and if the parameters are stable these methods for estimating the
parameter covariance should produce similar confidence intervals.
60-month rolling regressions of the B H portfolio in the 4-factor model are presented
in figure 3.8 where approximate confidence intervals were computed using the re-scaled
full-sample parameter covariance estimate. While these confidence intervals cannot di-
rectly be used to test for parameter instability, the estimate of the loadings on the market,
S M B and H M L vary more than their intervals indicate these parameters should were they
stable.
212 Analysis of Cross-Sectional Data

Rolling Parameter Estimates in the 4-Factor Model


V WMe
1.4

1.2

0.8
1930 1940 1950 1960 1970 1980 1990 2000

SM B
0.4
0.2
0
−0.2
1930 1940 1950 1960 1970 1980 1990 2000

HML
1.2
1
0.8
0.6

1930 1940 1950 1960 1970 1980 1990 2000

UMD
0.2

−0.2
1930 1940 1950 1960 1970 1980 1990 2000

Figure 3.8: 60-month rolling parameter estimates from the model B Hie = β1 + β2 V W M ie +
β3S M Bi + β4 H M L i + β5U M Di + εi . Approximate confidence intervals were constructed
by scaling the full sample parameter covariance. These rolling estimates indicate that the
market loading of the Big-High portfolio varied substantially at the beginning of the sam-
ple, and that the loadings on both S M B and H M L may be time-varying.
3.13 Model Selection and Specification Checking 213

Recursive Parameter Estimates in the 4-Factor Model


V WMe

1.1

0.9
1930 1940 1950 1960 1970 1980 1990 2000

SM B
0.1
0
−0.1
−0.2
−0.3
1930 1940 1950 1960 1970 1980 1990 2000

HML
1

0.9

0.8

0.7
1930 1940 1950 1960 1970 1980 1990 2000

UMD
0.1

−0.1

−0.2
1930 1940 1950 1960 1970 1980 1990 2000

Figure 3.9: Recursive parameter estimates from the model B Hie = β1 + β2 V W M ie +


β3S M Bi + β4 H M L i + β5U M Di + εi . Approximate confidence intervals were constructed
by scaling the full sample parameter covariance. While less compelling than the rolling
window estimates, these recursive estimates indicate that the loading on the market and
on H M L may not be constant throughout the sample.
214 Analysis of Cross-Sectional Data

An alternative to rolling regressions is recursive estimation, a stability check that uses


an expanding window of observations to estimate β .

Definition 3.66 (Recursive Regression Estimates). Recursive regression estimates are de-
fined as the sequence
 −1
j
X
β̂ j =  x0i xi  x0i yi (3.112)
i =1

for j = l , 2, . . . , n where l > k is the smallest window used.

Approximate confidence intervals can be computed either by re-scaling the full-sample


parameter covariance or by directly estimating the parameter covariance in each recursive
sample. Documenting evidence of parameter instability using recursive estimates is often
more difficult than with rolling, as demonstrated in figure 3.9

3.13.3.4 Normality

Normality may be a concern if the validity of the small sample assumptions is important.
The standard method to test for normality of estimated residuals is the Jarque-Bera (JB) test
which is based on two higher order moments (skewness and kurtosis) and tests whether
they are consistent with those of a normal distribution. In the normal, the skewness is 0 (it
is symmetric) and the kurtosis is 3. Let ε̂i be the estimated residuals. Skewness and kurtosis
are defined
Pn
n −1 i =1 ε̂3i
sˆk = 3
(σ̂2 ) 2
Pn
n −1
i =1 ε̂4i
κ̂ =
(σ̂2 )2
The JB test is computed
 
n 1
JB = s k + (κ − 3)2
2
6 4
and is distributed χ22 . If s k ≈ 0 and κ ≈ 3, then the J B should be small and normality
should not be rejected. To use the JB test, compute J B and compare it to Cα where Cα is
the critical value from a χ22 . If J B > Cα , reject the null of normality.

3.13.3.5 Heteroskedasticity

Heteroskedasticity is a problem if neglected. See section 3.12.4.


3.13 Model Selection and Specification Checking 215

3.13.3.6 Influential Observations

Influential observations are those which have a large effect on the estimated parameters.
Data, particularly data other than asset price data, often contain errors.27 These errors,
whether a measurement problem or a typo, tend to make β̂ unreliable. One method to
assess whether any observation has an undue effect on the sample is to compute the vector
of “hat” matrices,

hi = xi (X0 X)−1 x0i .


This vector (which is the diagonal of PX ) summarizes the influence of each observation on
the estimated parameters and is known as the influence function. Ideally, these should be
similar and no observation should dominate.
Consider a simple specification where yi = xi + εi where xi and εi are i.i.d. standard
normal. In this case the influence function is well behaved. Now suppose one xi is erro-
neously increased by 100. In this case, the influence function shows that the contaminated
observation (assume it is xn ) has a large impact on the parameter estimates. Figure 3.10
contains four panels. The two left panels show the original data (top) and the data with
the error (bottom) while the two right panels contain the influence functions. The influ-
ence function for the non-contaminated data is well behaved and each observation has
less than 10% influence. In the contaminated data, one observation (the big outlier), has
an influence greater than 98%.
Plotting the data would have picked up this problem immediately. However, it may be
difficult to determine whether an observation is influential when using multiple regressors
because the regressors for an observation may be “large” in many dimensions.

3.13.4 Improving estimation in the presence of outliers

Data may contain outliers for many reasons: someone entered an incorrect price on an
electronic exchange, a computer glitch multiplied all data by some large constant or a
CEO provided an answer out-of-line with other answers due to misunderstanding a survey
question. The standard least squares estimator is non-robust in the sense that large ob-
servations can have a potentially unbounded effect on the estimated parameters. A num-
ber of techniques have been developed to produce “robust” regression estimates that use
weighted least squares to restrict the influence of any observation.
For clarity of exposition, consider the problem of estimating the mean using data that
may be contaminated with a small number of large errors. The usual estimator will be
heavily influenced by these outliers, and if outliers occur with any regularity in the data
(suppose, for example, 1% of data is contaminated), the effect of outliers can result in an
estimator that is biased and in some cases inconsistent. The simplest method to robustly
27
And even some asset price data, such as TAQ prices.
216 Analysis of Cross-Sectional Data

Influential Observations
Actual Data Influence Function
3 0.07

2 0.06

1 0.05

0.04
0
0.03
−1
0.02
−2
0.01
−3
0
−2 −1 0 1 2 20 40 60 80 100
Actual Data Influence Function
1
3

2 0.8
1
0.6
0

−1 0.4

−2
0.2
−3
0
0 20 40 60 80 100 20 40 60 80 100

Figure 3.10: The two left panels contain realizations from the data generating process yi =
xi + εi where a single xi has been contaminated (bottom left panel). The two right panels
contain the influence functions of the xi . If all data points were uniformly influential, the
distribution of the influence function should be close to uniform (as is the case in the top
left panel). In the bottom right panel, it is clear that the entire fit is being driven by a single
xi which has an influence greater than .98.

estimate the mean is to use an α-trimmed mean where α represents a quantile of the em-
pirical distribution of the data.

Definition 3.67 (α-Trimmed Mean). The α-quantile trimmed mean is


Pn
yi I[CL ≤yi ≤CU ]
µ̂α = i =1 (3.113)
n∗

where n ∗ = n (1 − α) = ni=1 I[−C <yi <C ] is the number of observations used in the trimmed
P
3.13 Model Selection and Specification Checking 217

mean.28

Usually α is chosen to be between .90 and .99. To use an α-trimmed mean estimator, first
compute C L the α/2-quantile and CU the 1 − α/2-quantile of the of y . Using these values,
compute the trimmed mean as
A closely related estimator to the trimmed mean is the Windsorized mean. The sole dif-
ference between an α-trimmed mean and a Windsorized mean is the method for address-
ing the outliers. Rather than dropping extreme observations below C L and CU , a Wind-
sorized mean truncates the data at these points.

Definition 3.68 (Windsorized mean). Let yi∗ denote a transformed version of yi ,

yi∗ = max(min(yi , CU ), C L )

where C L and CU are the α/2 and 1 − α/2 quantiles of y . The Windsorized mean is defined
Pn ∗
i =1 yi
µ̂W = . (3.114)
n
While the α-trimmed mean and the Windsorized mean are “robust” to outliers, they are
not robust to other assumptions about the data. For example, both mean estimators are
biased unless the distribution is symmetric, although “robust” estimators are often em-
ployed as an ad-hoc test that results based on the standard mean estimator are not being
driven by outliers.
Both of these estimators are in the family of linear estimators (L-estimators). Members
of this family can always be written as
n
X
µ̂ =

wi yi
i =1

for some set of weights wi where the data, yi , are ordered such that y j −1 ≤ y j for j =
2, 3, . . . , N . This class of estimators obviously includes the sample mean by setting wi = n1
for all i , and it also includes the median by setting wi = 0 for all i except wm = 1 where m =
(n + 1)/2 (n is odd) or wm = wm +1 = 1/2 where m = n/2 (n is even). The trimmed mean
estimator can be constructed by setting wi = 0 if n ≤ s or i ≥ n −s and wi = n −2s 1
otherwise
where s = n α is assumed to be an integer. The Windsorized mean sets wi = 0 if n ≤ s or
n ≥ N − s , wi = s n+1 if n = s + 1 or n = n − s − 1 and wi = n1 otherwise. Examining the
weights between the α-trimmed mean and the Windsorized mean, the primary difference
is on the weights wk +1 and wn−k −1 . In the trimmed mean, the weights on these observation
are the same as the weights on the data between these points. In the Windsorized mean
estimator, the weights on these observations are kn+1 reflecting the censoring that occurs at
these observations.
28
This assumes that nα is an integer. If this is not the case, the second expression is still valid.
218 Analysis of Cross-Sectional Data

3.13.4.1 Robust regression based estimators

Like the mean estimator, the least squares estimator is not “robust” to outliers. To under-
stand the relationship between L-estimators and linear regression, consider decomposing
each observation into its mean and an additive error,

n
X
µ̂ =

wi yi
i =1
X n
= wi (µ + εi )
i =1
X n n
X
= wi µ + w i εi
i =1 i =1

From this decomposition a number of properties can be discerned. First, in order for µ∗
to be unbiased it must be the case that ni=1 wi = 1 and ni=1 E[wi εi ] = 0. All of the linear
P P

estimators satisfy the first condition although the second will depend crucially on the dis-
tribution of the errors. If the distribution of the errors is symmetric then the Windsorized
mean, the α-trimmed mean or even median are unbiased estimators of the mean. How-
ever, if the error distribution is not symmetric, then these estimators are likely to be biased.
Unlike the usual case where E[wi εi ] = wi E[εi ], the weights are functions of the errors and
the expectation of the product of the expectations is not the expectation of the product.
Second, weights on the observations (yi ) are the same as weights on the errors, εi . This
follows from noticing that if y j ≤ y j +1 then it must be the case that ε j ≤ ε j +1 .
Robust estimators in linear regression models require a two-step or iterative procedure.
The difference between robust mean estimators and robust regression arises since if yi has
a relationship to a set of explanatory variables xi , then orderings based on yi will not be the
same as orderings based on the residuals, εi . For example, consider the simple regression

yi = β xi + εi .

Assuming β > 0, the largest yi are those which correspond either the largest xi or εi . Simple
trimming estimators will not only trim large errors but will also trim yi that have large val-
ues of xi . The left panels of figure 3.11 illustrate the effects of Windsorization and trimming
on the raw data. In both cases, the regression coefficient is asymptotically biased (as indi-
cated by the dotted line), since trimming the raw data results in an error that is correlated
with the regressor. For example, observations with the largest xi values and with positive
εi more likely to be trimmed. Similarly, observations for the smallest xi values and with
negative εi are more likely to be trimmed. The result of the trimming is that the remaining
εi are negatively correlated with the remaining xi .
To avoid this issue, a two-step or iterative procedure is needed. The first step is used
3.13 Model Selection and Specification Checking 219

to produce a preliminary estimate of β̂ . Usually OLS is used in this step although occa-
sionally some weighted least squares estimator may be used. Estimated residuals can be
constructed from the preliminary estimate of β (ε̂i = yi − xi β̂ ), and the trimming or Wind-
sorizing is done on these preliminary residuals. In the case of α-trimming, observations
with the largest errors (in absolute value) are dropped and the α-trimmed regression is es-
timated using only the observations with C L < ε̂i < CU .
Windsorized regression also uses the first step regression to estimate ε̂, but, rather than
dropping observations, errors larger than CU are set to ε̂U and errors smaller than C L are
set to ε̂L . Using these modified errors,

ε̂?i = max(min(ε̂i , CU ), C L )
a transformed set of dependent variables is created, yi? = xi β̂ + ε?i . The Windsorized re-
gression coefficients are then estimated by regressing yi? on xi . Correct application of α-
trimming and Windsorization are illustrated in the bottom two panels of figure 3.11. In
the α-trimming examples, observations marked with an x were trimmed, and in the Wind-
sorization example, observations marked with a • were reduced from their original value
to either CU or C L . It should be noted that both of these estimators are unbiased although
this result relies crucially on the symmetry of the errors.
In addition to the two-step procedure illustrated above, an iterative estimator can be de-
(1)
fined by starting with some initial estimate of β̂ denoted β̂ and then trimming (or Wind-
(2) (2)
sorization) the data to estimate a second set of coefficients, β̂ . Using β̂ and the original
(2)
data, a different set of estimated residuals can be computed ε̂i = yi − xi β̂ and trimmed
(3)
(or Windsorized). Using the new set of trimmed observations, a new set of coefficients,
β̂ ,
(i ) (i −1)
can be estimated. This procedure can be repeated until it converges – max β̂ − β̂ .29

Both α-trimmed and Windsorized regression are special cases of a broader class of “ro-
bust” regression estimators. Many of these robust regression estimators can be imple-
mented using an iterative procedure known as Iteratively Re-weighted Least Squares (IR-
WLS) and, unlike trimmed or Windsorized least squares, are guaranteed to converge. For
more on these estimators, see Huber (2004) or Rousseeuw & Leroy (2003).

3.13.4.2 Ad-hoc “Robust” Estimators

In the academic finance literature it is not uncommon to see papers that use Windsoriza-
tion (or trimming) as a check that the findings are not being driven by a small fraction of
outlying data. This is usually done by directly Windsorizing the dependent variable and/or
the regressors. While there is no theoretical basis for these ad-hoc estimators, they are a
useful tool to ensure that results and parameter estimates are valid for “typical” observa-
tions as well as for the full sample. However, if this is the goal, other methods, such as vi-
(j)
29
These iterative procedures may not converge due to cycles in {β̂ }.
220 Analysis of Cross-Sectional Data

Correct and incorrect use of “robust” estimators


Incorrect Trimming Incorrect Windsorization
5 5

0 0

−5 −5
−5 0 5 −5 0 5
Correct Trimming Correct Windsorization
5 5

0 0

−5 −5
−5 0 5 −5 0 5

Figure 3.11: These four panels illustrate correct and incorrect α-trimming (left) and Wind-
sorization (right). In both cases, the DGP was yi = xi +εi where xi and εi were independent
standard normals. Windsorization or trimming based on the raw data lead to asymptotic
bias as illustrated by the dotted lines.

suals inspections of residuals or residuals sorted by explanatory variables are equally valid
and often more useful in detecting problems in a regression.

3.13.4.3 Inference on “Robust” Estimators

It may be tempting to use OLS or White heteroskedasticity robust standard errors in “ro-
bust” regressions. These regressions (and most L-estimators) appear similar to standard
least-squares estimators. However, there is an additional term in the asymptotic covari-
ance of the estimated regression coefficients since the trimming or Windsorization point
must be estimated. This term is related to the precision of the trimming point, and is closely
3.14 Projection 221

related to the uncertainty affecting the estimation of a quantile. Fortunately bootstrapping


can be used (under some mild conditions) to quickly estimate the covariance of the regres-
sors (See the Multivariate Time Series chapter).

3.14 Projection
Least squares has one further justification: it is the best linear predictor of a dependent
variable where best is interpreted to mean that it minimizes the mean square error (MSE).
Suppose f (x) is a function of only x and not y . Mean squared error is defined

E[(y − f (x))2 ].
Assuming that it is permissible to differentiate under the expectations operator, the solu-
tion to is problem is

E[y − f (x)] = 0,
and, using the law of iterated expectations,

f (x) = E[y |x]


is the solution. However, if f (x) is restricted to include only linear functions of x, the prob-
lem simplifies to choosing β to minimize the MSE,

E[(y − xβ )2 ]
and differentiating under the expectations (again, when possible),

E[x0 (y − xβ )] = 0

and β̂ = E[x0 x]−1 E[x0 y]. In the case where x contains a constant, this allows the best linear
predictor to be expressed in terms of the covariance matrix of y and x̃ where the˜indicates
the constant has been excluded (i.e. x = [1 x̃]), and so

β̂ = Σ−1
XX ΣXy

where the covariance matrix of [y x̃] can be partitioned


" #
ΣXX ΣXy
Cov([y x̃]) =
Σ0Xy Σ y y

Recall from assumptions 3.35 that {xi , εi } is a stationary and ergodic sequence and from
assumption 3.36 that it has finite second moments and is of full rank. These two assump-
tions are sufficient to justify the OLS estimator as the best linear predictor of y . Further, the
222 Analysis of Cross-Sectional Data

Weights of an S&P 500 tracking portfolio

Portfolio Weight

0.1

0.08

0.06

0.04

0.02

0
AA
AIG
AXP
BA
C
CAT
DD
DIS
GE
GM
HD
HON
HPQ
IBM
INTC
JNJ
JPM
KO
MCD
MMM
MO
MRK
MSFT
PFE
PG
T
UTX
VZ
WMT
XOM
Figure 3.12: Plot of the optimal tracking portfolio weights. The optimal tracking portfolio
is long all asset and no weight is greater than 11%.

OLS estimator can be used to make predictions for out of sample data. Suppose yn+1 was
an out-of-sample data point. Using the OLS procedure, the best predictor of yn+1 (again,
in the MSE sense), denoted ŷn +1 is xn +1 β̂ .

3.14.1 Tracking Error Minimization

Consider the problem of setting up a portfolio that would generate returns as close as possi-
ble to the return on some index, for example the FTSE 100. One option would be to buy the
entire portfolio and perfectly replicate the portfolio. For other indices, such as the Wilshire
5000, which consists of many small and illiquid stocks, this is impossible and a tracking
portfolio consisting of many fewer stocks must be created. One method to create the track-
3.A Selected Proofs 223

ing portfolios is find the best linear predictor of the index using a set of individual shares.
Let xi be the returns on a set of assets and let yi be the return on the index. The tracking
error problem is to minimize the

E[(yi − xi w)2 ]

where w is a vector of portfolio weights. This is the same problem as the best linear predic-
tor and ŵ = (X0 X)−1 X0 y.
Data between January 2, 2003 and December 31, 2007 was used, a total of 1258 trad-
ing days. The regression specification is simple: the return on the S&P is regressed on the
returns on the individual issues,
30
X
riS P 500 = w j ri j + ε i
j =1

where the portfolios are ordered alphabetically (not that this matters). The portfolio weights
(which need not sum to 1) are presented in figure 3.12. All stocks have positive weight and
the maximum weight is 11%. More importantly, this portfolio tracks the return of the S&P
to within 2.5% per year, a large reduction from the 13.5% per year volatility of the S&P over
this period.
While the regression estimates provide the solution to the unconditional tracking error
problem, this estimator ignores two important considerations: how should stocks be se-
lected and how conditioning information (such as time-varying covariance) can be used.
The first issue, which stocks to choose, is difficult and is typically motivated by the cost of
trading and liquidity. The second issue will be re-examined in the context of Multivariate
GARCH and related models in a later chapter.

3.A Selected Proofs

Theorem 3.15.
h i h −1 0 i
E β̂ |X = E X0 X X y|X
h −1 0 −1 0 i
= E X0 X X Xβ + X0 X X ε|X
h −1 0 i
=β +E XX 0
X ε|X
−1 0 
= β + X0 X X E ε|X
 


224 Analysis of Cross-Sectional Data

Theorem 3.16.
h i  h i  h i0 
V β̂ |X = E β̂ − E β̂ |X β̂ − E β̂ |X |X
  0 
= E β̂ − β β̂ − β |X
h −1 0 0 −1 i
=E XX 0
X εε X X X 0
|X
−1 0 −1
= X0 X X E εε0 |X X X0 X
   
−1 0 −1
= σ 2 X0 X X In X X0 X
−1 0 −1
= σ 2 X0 X X X X0 X
−1
= σ 2 X0 X

Theorem 3.17. Without loss of generality C = (X0 X)−1 X + D0 where D0 must satisfy D0 X = 0
and E D0 ε|X = 0 since
 

h i
E β̃ |X = E Cy|X
 
h −1 0  i
=E X0 X X + D0 (Xβ + ε) |X
= β + D0 Xβ + E D0 ε|X
 

and by assumption Cy is unbiased and so E Cy|X = β .


 

h i h −1 0   −1  i
V β̃ |X = E X0 X X + D0 εε0 D + X X0 X |X
h −1 0 0 −1 i h −1 i h −1 0 i
= E X0 X X εε X X0 X |X + E D0 εε0 D|X + E D0 εεX X0 X |X + E X0 X X εεD|X
 

−1 −1 −1 0


= σ2 X0 X + σ2 D0 D + σ2 D0 X X0 X |X + σ2 X0 X XD
h i
= V β̂ |X + σ2 D0 D + 0 + 0
h i
= V β̂ |X + σ2 D0 D

and so the variance of β̃ is equal to the variance of β̂ plus a positive semi-definite matrix,
and so h i h i
V β̃ |X − V β̂ |X = σ2 D0 D ≥ 0

where the inequality is strict whenever D 6= 0.

Theorem 3.18.
−1
β̂ = β + X0 X X0 ε
3.A Selected Proofs 225

and so β̂ is a linear function of normal random variables ε, and so it must be normal. Ap-
plying the results of Theorems 3.15 and 3.16 completes the proof.

Theorem 3.19.

β̂ − β = (X0 X)−1 X0 ε and ε̂ = y − X (X0 X)−1 X0 y = MX y = MX ε, and so

h  i h −1 0 0 i
E β̂ − β ε̂0 |X = E X0 X X εε MX |X
−1 0  0 
= X0 X X E εε |X MX
−1 0
= σ2 X0 X

X MX
−1
= σ2 X0 X (MX X)0

−1
= σ2 X0 X 0
=0

since MX X = 0 by construction. β̂ and ε̂ are jointly normally distributed since both are
linear functions of ε, and since they are uncorrelated they are independent.30
0
Theorem 3.20. σ̂2 = nε̂−k ε̂
and so (n − k ) σ̂2 = ε̂0 ε̂. ε̂ = MX ε, so (n − k ) σ̂2 = ε0 MX 0 MX ε
= ε σM2X ε = σε 0 MX σε = z0 MX z since MX is idempotent (and hence symmetric)
0
and (n − k ) σ̂
2
σ2
where z is a n by 1 multivariate normal vector with covariance In . Finally, applying the
result in Lemma 3.69, z0 MX z ∼ ni=1 λi χ1,i 2
where {λi }, i = 1, 2, . . . , n are the eigenvalues
P

of MX and χ1,i , i = 1, 2, . . . , n are independent χ12 random variables. Finally note that MX is
2

a rank n − k idempotent matrix, so it must have n − k eigenvalues equal to 1, λi = 1 for


i = 1, 2, . . . , n − k and k eigenvalues equal to 0, λi = 0 for i = n − k + 1, . . . , n, and so the
distribution is a χn−k 2
.

Lemma 3.69 (Quadratic Forms of Multivariate Normals). Suppose z ∼ N (0, Σ) where Σ is


a n by n positive semi-definite matrix, and let W be a n by n positive semi-definite matrix,
then
Xn
z Wz ∼ N2 (0, Σ; W) ≡
0
λi χ1,i
2

i =1
1 1
where λi are the eigenvalues of Σ WΣ and N2 (·) is known as a type-2 normal..
2 2

This lemma is a special case of Baldessari (1967) as presented in White (Lemma 8.2, 1996).

Theorem 3.22. The OLS estimator is the BUE estimator since it is unbiased by Theorem
3.15 and it achieves the Cramer-Rao lower bound (Theorem 3.21).
30
Zero correlation is, in general, insufficient to establish that two random variables are independent. How-
ever, when two random variables are jointly normally distribution, they are independent if and only if they
are uncorrelated.
226 Analysis of Cross-Sectional Data

Theorem 3.24. Follows directly from the definition of a Student’s t by applying Theorems
3.18, 3.19, and 3.16.

Theorem 3.28. Follows directly from the definition of a Fν1 ,ν2 by applying Theorems 3.18,
3.19, and 3.16.

Theorem 3.40.
−1
β̂ n − β = X0 X X0 ε
n
!−1 n
X X
= x0i xi x0i εi
i =1 i =1
Pn !−1 P
n
i =1 x0i xi i =1 x0i εi
=
n n

Since E[x0i xi ] is positive P


definite by Assumption 3.36, and {xi } is stationary and ergodic by
n
x0 x
Assumption 3.35, then i =1n i i will be positive definitePfor n sufficiently large, and so β̂ n
n n
x0 x a .s . x0 ε a .s .
P
exists.Applying the Ergodic Theorem (Theorem 3.70), i =1n i i → ΣXX and i =1n i i → 0 and
by the Continuous Mapping Theorem (Theorem 3.71) combined with the continuity of the
 Pn 0 −1
i =1 xi xi a .s . −1
matrix inverse function, n
→ ΣXX , and so

Pn !−1 P
n
0
i =1 xi xi i =1 x0i εi
β̂ n − β =
n n
a .s .
→ Σ−1
XX · 0
a .s .
→ 0.

p
Finally, almost sure convergence implies convergence in probability and so β̂ n − β → 0 or
p
β̂ n → β .

Theorem 3.70 (Ergodic Theorem). If {zt } is ergodic and its rth moment, µr , is finite, then
a .s .
T −1 Tt=1 zrt → µr .
P

Theorem 3.71 (Continuous Mapping Theorem). Given g : Rk → Rl , and any sequence of


a .s .
random k by 1 vectors {zn } such that zn → z where z is k by 1, if g is continuous at z, then
a .s .
g (zn ) → g (z).

Theorem 3.41. See White (Theorem 5.25, 2000).


3.A Selected Proofs 227

Theorem 3.43.
 0  
0
ε̂ ε̂ y − Xβ̂ n y − Xβ̂ n
=
n  n0  
y − Xβ̂ n y − Xβ̂ n
=
 n 0  
y − Xβ̂ n + Xβ − Xβ y − Xβ̂ n + Xβ − Xβ
=
n 
   0  
y − Xβ + X β − β̂ n y − Xβ + X β − β̂ n
=
  0  n  
ε + X β − β̂ n ε + X β − β̂ n
=
 0 n  0  
ε0 ε β − β̂ n X ε 0
β − β̂ n X X β − β̂ n
0

= +2 +
n n n

By the Ergodic Theorem and the existence of E[ε2i ] (Assumption 3.39), the first term con-
verged to σ2 . The second term
 0
β − β̂ n X0 ε  0 P X0 ε p
i =1
= β − β̂ n → 00 0 = 0
n n

since β̂ n is consistent and E[xi εi ] = 0 combined with the Ergodic Theorem. The final term
 0  
β − β̂ n X0 X β − β̂ n  0 X0 X  
= β − β̂ n β − β̂ n
n n
p
→ 00 ΣXX 0 =0

and so the variance estimator is consistent.

Theorem 3.48.
−1
X01 X1 X01 y

β̂ 1n =
n n
−1 −1 −1 −1
X01 X1 X01 (X1 + X2 + ε) X01 X1 X01 X1 X01 X1 X01 X2 X01 X1 X01 ε
   
= + +
n n n n n n n n
p
→ β 1 + Σ−1
X1 X1 Σ X1 X2 β 2 + Σ X1 X1 0
−1

= β 1 + Σ−1
X1 X1 Σ X1 X2 β 2

 −1
X01 X1 p X01 X1 p
where n
→ Σ−1
X1 X1 and n
→ ΣX1 X2 by the Ergodic and Continuous Mapping Theo-
228 Analysis of Cross-Sectional Data

rems (Theorems 3.70 and 3.71). Finally note that


−1 −1
X01 X1 X01 X2 X01 X1
 
=
 
X1 x2,1 X1 x2,2 . . . X1 x2,k2
n n n
" −1  0 −1  0 −1 #
X01 X1 X1 X1 X1 X1
= X1 x2,1 X1 x2,2 . . . X1 x2,k2
n n n
h i
= δ̂1n δ̂2n . . . δ̂k2 n

where δ j is the regression coefficient in x2, j = Xδ j + η j .

Theorem 3.53. See White (Theorem 6.3, 2000).

Theorem 3.54. See White (Theorem 6.4, 2000).

Theorem 3.57. By Assumption 3.56,


1 1 1
V− 2 y = V− 2 Xβ + V− 2 ε
h 1 i
and V V− 2 ε = σ2 In , uncorrelated and homoskedastic, and so Theorem 3.17 can be ap-
plied.
3.A Selected Proofs 229

Exercises
Exercise 3.1. Imagine you have been given the task of evaluating the relationship between
the return on a mutual fund and the number of years its manager has been a professional.
You have a panel data set which covers all of the mutual funds returns in the year 1970-2005.
Consider the regression
ri ,t = α + β experi ,t + εi ,t
where ri t is the return on fund i in year t and experi t is the number of years the fund man-
ager has held her job in year t . The initial estimates of β and α are computed by stacking
all of the observations into a vector and running a single OLS regression (across all funds
and all time periods).
i. What test statistic would you use to determine whether experience has a positive ef-
fect?

ii. What are the null and alternative hypotheses for the above test?

iii. What does it mean to make a type I error in the above test? What does it mean to make
a type II error in the above test?

iv. Suppose that experience has no effect on returns but that unlucky managers get fired
and thus do not gain experience. Is this a problem for the above test? If so, can you
comment on its likely effect?

v. Could the estimated β̂ ever be valid if mutual funds had different risk exposures? If
so, why? If not, why not?

vi. If mutual funds do have different risk exposures, could you write down a model which
may be better suited to testing the effect of managerial experience than the initial
simple specification? If it makes your life easier, you can assume there are only 2
mutual funds and 1 risk factor to control for.
Exercise 3.2. Consider the linear regression

yt = β x t + εt

i. Derive the least squares estimator. What assumptions are you making in the deriva-
tion of the estimator?

ii. Under the classical assumptions, derive the variance of the estimator β̂ .
d
iii. Suppose the errors εt have an AR(1) structure where εt = ρεt −1 + ηt where ηt →
N (0, 1) and |ρ| < 1. What is the variance of β̂ now?

iv. Now suppose that the errors have the same AR(1) structure but the x t variables are
i.i.d.. What is the variance of β̂ now?
230 Analysis of Cross-Sectional Data

v. Finally, suppose the linear regression is now

yt = α + β x t + εt

where εt has an AR(1) structure and that x t is i.i.d.. What is the covariance of [α β ]0 ?

Exercise 3.3. Consider the simple regression model yi = β x1,i + εi where the random error
terms are i.i.d. with mean zero and variance σ2 and are uncorrelated with the x1,i .
i. Show that the OLS estimator of β is consistent.

ii. Is the previously derived OLS estimator of β still consistent if yi = α+β x1,i +εi ? Show
why or why not.

iii. Now suppose the data generating process is

yi = β1 x1,i + β2 x2,i + εi

Derive the OLS estimators of β1 and β2 .

iv. Derive the asymptotic covariance of this estimator using the method of moments ap-
proach.

(a) What are the moments conditions?


(b) What is the Jacobian?
(c) What does the Jacobian limit to? What does this require?
(d) What is the covariance of the moment conditions. Be as general as possible.

In all of the above, clearly state any additional assumptions needed.


Exercise 3.4. Let Ŝ be the sample covariance matrix of z = [y X], where X does not include
a constant
n
X
Ŝ = n −1
(zi − z̄)0 (zi − z̄)
i =1
" #
ŝ y y ŝ0x y
Ŝ =
ŝ x y Ŝ x x

and suppose n, the sample size, is known (Ŝ is the sample covariance estimator). Under the
small sample assumptions (including homoskedasticity and normality if needed), describe
one method, using only Ŝ, X̄ (the 1 by k − 1 sample mean of the matrix X, column-by-
column), ȳ and n , to
i. Estimate β̂1 , . . . , β̂k from a model

yi = β1 + β2 x2,i + . . . + βk xk ,i + εi
3.A Selected Proofs 231

ii. Estimate s , the standard error of the regression

iii. Test H0 : β j = 0, j = 2, . . . , k

Exercise 3.5. Consider the regression model

yi = β1 + β2 xi + εi

where the random error terms are i.i.d. with mean zero and variance σ2 and are uncorre-
lated with the xi . Also assume that xi is i.i.d. with mean µ x and variance σ2x , both finite.

i. Using scalar notation, derive the OLS estimators of β1 and β2 .

ii. Show these estimators are consistent. Are any further assumptions needed?

iii. Show that the matrix expression for the estimator of the regression parameters, β̂ =
−1
(X0 X) X0 y, is identical to the estimators derived using scalar notation.

Exercise 3.6. Let xm β be the best linear projection of ym . Let εm be the prediction error.

i. What is the variance of a projected y ?

ii. What is the variance if the β s are estimated using regressors that do not include ob-
servation m (and hence not xm or εm )? Hint: You can use any assumptions in the
notes, just be clear what you are assuming

Exercise 3.7. Are Wald tests of linear restrictions in a linear regression invariant to linear
reparameterizations? Hint: Let F be an invertible matrix. Parameterize W in the case where
H0 : Rβ − r = 0 and H0 : F(Rβ − r) = FRβ − Fr = 0.

i. Are they the same?

ii. Show that n · R2 has an asymptotic χk2−1 distribution under the classical assumptions
when the model estimated is

yi = β1 + β2 x2,i + . . . + βk xk ,i + εi

Hint: What is the does the distribution of c /ν converge to as ν → ∞ when c ∼ χν2 .

Exercise 3.8. Suppose an unrestricted model is

yi = β1 + β2 x1,i + β3 x2,i + β4 x3,i + εi

i. Sketch the steps required to test a null H0 : β2 = β3 = 0 in the large sample framework
using a Wald test and a LM test.

ii. Sketch the steps required to test a null H0 : β2 + β3 + β4 = 1 in the small sample
framework using a Wald test, a t test, a LR test and a LM test.
232 Analysis of Cross-Sectional Data

In the above questions be clear what the null and alternative are, which regressions must
be estimated, how to compute any numbers that are needed and the distribution of the
test statistic.

Exercise 3.9. Let yi and xi conform to the small sample assumptions and let yi = β1 +
β2 xi + εi . Define another estimator

ȳH − ȳL
β̆2 =
x̄H − x̄ L

where x̄H is the average value of xi given xi > median (x), and ȳH is the average value of yi
for n such that xi > median (x). x̄ L is the average value of xi given xi ≤ median (x), and ȳL is
the average value of yi for n such that xi ≤ median (x) (both x̄ and ȳ depend on the order of
xi , and not yi ). For example suppose the xi were ordered such that x1 < x2 < x3 < . . . < xi
and n is even. Then,
n/2
2X
x̄ L = xi
n
i =1

and
n
2 X
x̄H = xi
n
i =n /2+1

i. Is β̆2 unbiased, conditional on X?

ii. Is β̆2 consistent? Are any additional assumptions needed beyond those of the small
sample framework?

iii. What is the variance of β̆2 , conditional on X?

Exercise 3.10. Suppose


yi = β1 + β2 xi + εi
and that variable z i is available where V [z i ] = σ2z > 0, Corr (xi , z i ) = ρ 6= 0 and E εi |z = 0,
 

n = 1, . . . , N . Further suppose the other assumptions of the small sample framework hold.
Rather than the usual OLS estimator,
Pn
(z i − z̄ ) yi
β̈2 = Pni =1
i =1 (z i − z̄ ) x i

is used.

i. Is β̈2 a reasonable estimator for β2 ?

ii. What is the variance of β̈2 , conditional on x and z?

iii. What does the variance limit to (i.e. not conditioning on x and z)?
3.A Selected Proofs 233

iv. How is this estimator related to OLS, and what happens to its variance when OLS is
used (Hint: What is Corr (xi , xi )?)

Exercise 3.11. Let {yi }ni=1 and {xi }ni=1 conform to the small sample assumptions and let
yi = β1 + β2 xi + εi . Define the estimator

ȳH − ȳL
β̆2 =
x̄H − x̄ L

where x̄H is the average value of xi given xi > median (x), and ȳH is the average value of yi
for i such that xi > median (x). x̄ L is the average value of xi given xi ≤ median (x), and ȳL is
the average value of yi for i such that xi ≤ median (x) (both x̄ and ȳ depend on the order of
xi , and not yi ). For example suppose the xi were ordered such that x1 < x2 < x3 < . . . < xn
and n is even. Then,
n/2
2X
x̄ L = xi
n
i =1

and
n
2 X
x̄H = xi
n
i =n /2+1

i. Is β̆2 unbiased, conditional on X?

ii. Is β̆2 consistent? Are any additional assumptions needed beyond those of the small
sample framework?

iii. What is the variance of β̆2 , conditional on X?


Next consider the estimator

β̈2 =

where ȳ and x̄ are sample averages of {yi } and {x }, respectively.

iv. Is β̈2 unbiased, conditional on X?

v. Is β̈2 consistent? Are any additional assumptions needed beyond those of the small
sample framework?

vi. What is the variance of β̈2 , conditional on X?

Exercise 3.12. Suppose an unrestricted model is

yi = β1 + β2 x1,i + β3 x2,i + β4 x3,i + εi

i. Discuss which features of estimators each of the three major tests, Wald, Likelihood
Ratio, and Lagrange Multiplier, utilize in testing.
234 Analysis of Cross-Sectional Data

ii. Sketch the steps required to test a null H0 : β2 = β3 = 0 in the large sample framework
using Wald, LM and LR tests.

iii. What are type I & II errors?

iv. What is the size of a test?

v. What is the power of a test?

vi. What influences the power of a test?

vii. What is the most you can say about the relative power of a Wald, LM and LR test of
the same null?

Exercise 3.13. Consider the regression model

yi = β1 + β2 xi + εi

where the random error terms are i.i.d. with mean zero and variance σ2 and are uncorre-
lated with the xi . Also assume that xi is i.i.d. with mean µ x and variance σ2x , both finite.

i. Using scalar notation, derive the OLS estimators of β1 and β2 .

ii. Show that these estimators are consistent. Are any further assumptions needed?

iii. Show that the matrix expression for the estimator of the regression parameters, β̂ =
−1
(X0 X) X0 y, is identical to the estimators derived using scalar notation.

iv. Suppose instead


yi = γ1 + γ2 (xi − x̄ ) + εi
was fit to the data. How are the estimates of the γs related to the β s?

v. What can you say about the relationship between the t -statistics of the γs and the β s?

vi. How would you test for heteroskedasticity in the regression?

vii. Since the errors are i.i.d. there is no need to use White’s covariance estimator for this
regression. What are the consequences of using White’s covariance estimator if it is
not needed?
Chapter 4

Analysis of a Single Time Series

Note: The primary reference for these notes is Enders (2004). An alternative and more
technical treatment can be found in Hamilton (1994).

Most data used in financial econometrics occur sequentially through time. Inter-
est rates, asset returns, and foreign exchange rates are all time series. This chapter
introduces time-series econometrics and focuses primarily on linear models, al-
though some common non-linear models are described in the final section. The
analysis of time-series data begins by defining two key concepts in the analysis of
time series: stationarity and ergodicity. The chapter next turns to Autoregressive
Moving Average models (ARMA) and covers the structure of these models, station-
arity conditions, model selection, estimation, inference and forecasting. Finally,
The chapter concludes by examining nonstationary time series.

4.1 Stochastic Processes

A stochastic process is an arbitrary sequence of random data, and is denoted

{yt } (4.1)

where {·} is used to indicate that the y s form a sequence. The simplest non-trivial stochas-
i.i.d.
tic process specifies that yt ∼ D for some distribution D , for example normal. Another
simple stochastic process is the random walk,

yt = yt −1 + εt

where εt is an i.i.d. process.


236 Analysis of a Single Time Series

4.2 Stationarity, Ergodicity and the Information Set


Stationarity is a probabilistically meaningful measure of regularity. This regularity can be
exploited to estimate unknown parameters and characterize the dependence between ob-
servations across time. If the data generating process frequently changed in an unpre-
dictable manner, constructing a meaningful model would be difficult or impossible.
Stationarity exists in two forms, strict stationarity and covariance (also known as weak)
stationarity. Covariance stationarity is important when modeling the mean of a process,
although strict stationarity is useful in more complicated settings, such as non-linear mod-
els.

Definition 4.1 (Strict Stationarity). A stochastic process {yt } is strictly stationary if the joint
distribution of {yt , yt +1 , . . . , yt +h } only depends only on h and not on t .

Strict stationarity requires that the joint distribution of a stochastic process does not de-
pend on time and so the only factor affecting the relationship between two observations
is the gap between them. Strict stationarity is weaker than i.i.d. since the process may be
dependent but it is nonetheless a strong assumption and implausible for many time series,
including both financial and macroeconomic data.
Covariance stationarity, on the other hand, only imposes restrictions on the first two
moments of a stochastic process.

Definition 4.2 (Covariance Stationarity). A stochastic process {yt } is covariance stationary


if

E [yt ] = µ for t = 1, 2, . . . (4.2)


V [yt ] = σ < ∞
2
for t = 1, 2, . . .
E (yt − µ)(yt −s − µ) = γs for t = 1, 2, . . . , s = 1, 2, . . . , t − 1.
 

Covariance stationarity requires that both the unconditional mean and unconditional vari-
ance are finite and do not change with time. Note that covariance stationarity only applies
to unconditional moments and not conditional moments, and so a covariance process may
have a varying conditional mean (i.e. be predictable).
These two types of stationarity are related although neither nests the other. If a pro-
cess is strictly stationary and has finite second moments, then it is covariance stationary.
If a process is covariance stationary and the joint distribution of the studentized residuals
(demeaned and standardized by their standard deviation) does not depend on time, then
the process is strictly stationary. However, one type can occur without the other, both can
occur or neither may be applicable to a particular time series. For example, if a process
has higher order moments which depend on time (e.g. time-varying kurtosis), it may be
covariance stationary but not strictly stationary. Alternatively, a sequence of i.i.d. Student’s
t random variables with 2 degrees of freedom is strictly stationary but not covariance sta-
tionary since the variance of a t 2 is infinite.
4.2 Stationarity, Ergodicity and the Information Set 237

γs = E (yt − µ)(yt −s − µ) is the covariance of yt with itself at a different point in time,


 

known as the sth autocovariance. γ0 is the lag-0 autocovariance, the same quantity as the
long-run variance of yt (i.e. γ0 = V [yt ]).1

Definition 4.3 (Autocovariance). The autocovariance of a covariance stationary scalar pro-


cess {yt } is defined
γs = E (yt − µ)(yt −s − µ)
 
(4.3)
where µ = E [yt ]. Note that γ0 = E (yt − µ)(yt − µ) = V [yt ].
 

Ergodicity is another important concept in the analysis of time series, and is one form
of asymptotic independence.

Definition 4.4 (Ergodicity). Let {yt } be a stationary sequence. {yt } is ergodic if for any two
bounded functions f : Rk → R g : Rl → R

lim E f (yt , . . . , yt +k ) g yt + j , . . . , yt +l + j
 
(4.4)
j →∞

= |E [f (yt , . . . , yt +k )]| E g yt + j , . . . , yt +l + j
 

In essence, if an ergodic stochastic process is sampled at two points far apart in time,
these samples will be independent. The ergodic theorem provides a practical application
of ergodicity.

Theorem 4.5 (Ergodic Theorem). If {yt } is ergodic and its rth moment µr is finite, then
p
T −1 Tt=1 ytr → µr .
P

The ergodic theorem states that averages will converge to their expectation provided
the expectation exists. The intuition for this results follows from the definition of ergodicity
since samples far apart in time are (effectively) independent, and so errors average across
time.
Not all series are ergodic. Let yt = η + εt where η ∼ N (0, 1), εt ∼ N (0, 1) and η and εt
i.i.d.

are independent for any t . Note that η is drawn only once (not every t ). Clearly, E [yt ] = 0.
p
However, T −1 Tt=1 yt → η 6= 0, and so even though the average converges it does not
P

converge to E[yt ] since the effect of the initial draw of η is present in every observation of
{yt }.
The third important building block of time-series models is white noise. White noise
generalizes i.i.d. noise and allows for dependence in a series as long as three conditions are
satisfied: the series is mean zero, uncorrelated and has finite second moments.

1
The use of long-run variance is used to distinguish V[yt ] from the innovation variance, V[εt ], also known
as the short-run variance.
238 Analysis of a Single Time Series

Definition 4.6 (White Noise). A process {εt } is known as white noise if

E [εt ] = 0 for t = 1, 2, . . . (4.5)


V [εt ] = σ < ∞ 2
for t = 1, 2, . . .
E εt εt − j = Cov(εt , εt − j ) = 0 for t = 1, 2, . . . , j 6= 0.
 

An i.i.d. series with finite second moments is trivially white noise, but other important
processes, such as residuals following an ARCH (Autoregressive Conditional Heteroskedas-
ticity) process, may also be white noise although not independent since white noise only
requires linear independence.2 A white noise process is also covariance stationary since it
satisfies all three conditions: the mean, variance and autocovariances are all finite and do
not depend on time.
The final important concepts are conditional expectation and the information set. The
information set at time t is denoted Ft and contains all time t measurable events3 , and so
the information set includes realization of all variables which have occurred on or before
t . For example, the information set for January 3, 2008 contains all stock returns up to
an including those which occurred on January 3. It also includes everything else known
at this time such as interest rates, foreign exchange rates or the scores of recent football
games. Many expectations will often be made conditional on the time-t information set,
expressed E yt +h |Ft , or in abbreviated form as E t [yt +h ]. The conditioning information set
 

matters when taking expectations and E [yt +h ], E t [yt +h ] and E t +h [yt +h ] are not the same.
Conditional variance is similarly defined, V yt +h |Ft = Vt [yt +h ] = E t (yt +h − E t [yt +h ])2 .
   

4.3 ARMA Models


Autoregressive moving average (ARMA) processes form the core of time-series analysis.
The ARMA class can be decomposed into two smaller classes, autoregressive (AR) pro-
cesses and moving average (MA) processes.

4.3.1 Moving Average Processes

The 1st order moving average, written MA(1), is the simplest non-degenerate time-series
process,

yt = φ0 + θ1 εt −1 + εt
where φ0 and θ1 are parameters and εt a white noise series. This process stipulates that
2
Residuals generated from a ARCH process have dependence in conditional variances but not mean.
3
A measurable event is any event that can have probability assigned to it at time t . In general this includes
any observed variable but can also include time t beliefs about latent (unobserved) variables such as volatility
or the final revision of the current quarter’s GDP.
4.3 ARMA Models 239

the current value of yt depends on both a new shock and the previous shock. For example,
if θ is negative, the current realization will “bounce back” from the previous shock.

Definition 4.7 (First Order Moving Average Process). A first order Moving Average process
(MA(1)) has dynamics which follow

yt = φ0 + θ1 εt −1 + εt (4.6)

where εt is a white noise process with the additional property that E t −1 [εt ] = 0.

It is simple to derive both the conditional and unconditional means in this process. The
conditional mean is

E t −1 [yt ] = E t −1 [φ0 + θ1 εt −1 + εt ] (4.7)


= φ0 + θ1 E t −1 [εt −1 ] + E t −1 [εt ]
= φ0 + θ1 εt −1 + 0
= φ0 + θ1 εt −1

where E t −1 [εt ] = 0 follows by assumption that the shock is unpredictable using the time-
t − 1 information set, and since εt −1 is in the time-t − 1 information set (εt −1 ∈ Ft −1 ), it
passes through the time-t − 1 conditional expectation. The unconditional mean is

E [yt ] = E [φ0 + θ1 εt −1 + εt ] (4.8)


= φ0 + θ1 E [εt −1 ] + E [εt ]
= φ0 + θ1 0 + 0
= φ0 .

Comparing these two results, the unconditional mean of yt , E [yt ], is φ0 while the condi-
tional mean E t −1 [yt ] = φ0 + θ1 εt −1 . This difference reflects the persistence of the previous
shock in the current period. The variances can be similarly derived,

h i
V [yt ] = E (φ0 + θ1 εt −1 + εt − E [φ0 + θ1 εt −1 + εt ])2 (4.9)
h i
= E (φ0 + θ1 εt −1 + εt − φ0 )2
h i
= E (θ1 εt −1 + εt )2
= θ12 E ε2t −1 + E ε2t + 2θ1 E [εt −1 εt ]
   

= σ2 θ12 + σ2 + 0
= σ2 1 + θ12

240 Analysis of a Single Time Series

where E [εt −1 εt ] follows from the white noise assumption. The conditional variance is
h i
Vt −1 [yt ] = E t −1 (φ0 + θ1 εt −1 + εt − E t −1 [φ0 + θ1 εt −1 + εt ])2 (4.10)
h i
= E t −1 (φ0 + θ1 εt −1 + εt − φ0 − θ1 εt −1 )2
= E t −1 ε2t
 

= σ2t

where σ2t is the conditional variance of εt . White noise does not have to be homoskedastic,
although if εt is homoskedastic then Vt −1 [yt ] = E σ2t = σ2 . Like the mean, the uncondi-
 

tional variance and the conditional variance are different. The unconditional variance is
unambiguously larger than the average conditional variance which reflects the extra vari-
ability introduced by the moving average term.
Finally, the autocovariances can be derived

E [(yt − E [yt ]) (yt −1 − E [yt −1 ])] = E [(φ0 + θ1 εt −1 + εt − φ0 ) (φ0 + θ1 εt −2 + εt −1 − φ0 )]


(4.11)
= E θ1 ε2t −1 + θ1 εt εt −2 + εt εt −1 + θ12 εt −1 εt −2
 

= θ1 E ε2t −1 + θ1 E [εt εt −2 ] + E [εt εt −1 ] + θ12 E [εt −1 εt −2 ]


 

= θ1 σ2 + 0 + 0 + 0
= θ1 σ2
E [(yt − E [yt ]) (yt −2 − E [yt −2 ])] = E [(φ0 + θ1 εt −1 + εt − φ0 ) (φ0 + θ1 εt −3 + εt −2 − φ0 )]
(4.12)
= E [(θ1 εt −1 + εt ) (θ1 εt −3 + εt −2 )]
= E θ1 εt −1 εt −2 + θ1 εt −3 εt + εt εt −2 + θ12 εt −1 εt −3
 

= θ1 E [εt −1 εt −2 ] + θ1 E [εt −3 εt ] + E [εt εt −2 ] + θ12 E [εt −1 εt −3 ]


=0+0+0+0
=0

By inspection of eq. (4.12) it follows that γs = E (yt − E [yt ])(yt −s − E [yt −s ]) = 0 for s ≥ 2.
 

The MA(1) can be generalized into the class of MA(Q ) processes by including additional
lagged errors.

Definition 4.8 (Moving Average Process of Order Q ). A Moving Average process of order Q,
abbreviated MA(Q), has dynamics which follow
Q
X
yt = φ0 + θq εt −q + εt (4.13)
q =1
4.3 ARMA Models 241

where εt is white noise series with the additional property that E t −1 [εt ] = 0.

The following properties hold in higher order moving averages:

• E [yt ] = φ0
PQ
• V [yt ] = (1 + q =1 θq2 )σ2
P −s
• E (yt − E [yt ])(yt −s − E [yt −s ]) = σ2 Qi =0 θi θi +s for s ≤ Q where θ0 = 1.
 

• E (yt − E [yt ])(yt −s − E [yt −s ]) = 0 for s > Q


 

4.3.2 Autoregressive Processes

The other subclass of ARMA processes is the autoregressive process.

Definition 4.9 (First Order Autoregressive Process). A first order autoregressive process,
abbreviated AR(1), has dynamics which follow

yt = φ0 + φ1 yt −1 + εt (4.14)

where εt is a white noise process with the additional property that E t −1 [εt ] = 0.

Unlike the MA(1) process, y appears on both sides of the equation. However, this is only a
convenience and the process can be recursively substituted to provide an expression that
depends only on the errors, εt and an initial condition.

yt = φ0 + φ1 yt −1 + εt
yt = φ0 + φ1 (φ0 + φ1 yt −2 + εt −1 ) + εt
yt = φ0 + φ1 φ0 + φ12 yt −2 + εt + φ1 εt −1
yt = φ0 + φ1 φ0 + φ12 (φ0 + φ1 yt −3 + εt −2 ) + εt + φ1 εt −1
yt = φ0 + φ1 φ0 + φ12 φ0 + φ13 yt −3 + εt + φ1 εt −1 + φ12 εt −2
.. ..
. .
t −1
X t −1
X
yt = φ1i φ0 + φ1i εt −i + φ1t y0
i =0 i =0

Using backward substitution, an AR(1) can be expressed as an MA(t). In many cases the
initial condition is unimportant and the AR process can be assumed to have begun long
ago in the past. As long as |φ1 | < 1, limt →∞ φ t y0 → 0 and the effect of an initial condition
will be small. Using the “infinite history” version of an AR(1), and assuming |φ1 | < 1, the
solution simplifies to
242 Analysis of a Single Time Series

yt = φ0 + φ1 yt −1 + εt

X X ∞
yt = φ1 φ0 +
i
φ1i εt −i
i =0 i =0

φ0 X
yt = + φ1i εt −i (4.15)
1 − φ1
i =0

where the identity ∞ i =0 φ1 = (1 − φ1 )


i −1
P
is used in the final solution. This expression of
an AR process is known as an MA(∞) representation and it is useful for deriving standard
properties.
The unconditional mean of an AR(1) is

" ∞
#
φ0 X
E [yt ] = E + φ1i εt −i (4.16)
1 − φ1
i =0

φ0 X
= + φ1i E [εt −i ]
1 − φ1
i =0

φ0 X
= + φ1i 0
1 − φ1
i =0
φ0
= .
1 − φ1

The unconditional mean can be alternatively derived noting that, as long as {yt } is covari-
ance stationary, that E [yt ] = E [yt −1 ] = µ, and so

E [yt ] = E [φ0 + φ1 yt −1 + εt −1 ] (4.17)


E [yt ] = φ0 + φ1 E [yt −1 ] + E [εt −1 ]
µ = φ0 + φ1 µ + 0
µ − φ1 µ = φ0
µ (1 − φ1 ) = φ0
φ0
E [yt ] =
1 − φ1

The Ft −1 -conditional expectation is

E t −1 [yt ] = E t −1 [φ0 + φ1 yt −1 + εt ] (4.18)


4.3 ARMA Models 243

= φ0 + φ1 E t −1 [yt −1 ] + E t −1 [εt ]
= φ0 + φ1 yt −1 + 0
= φ0 + φ1 yt −1

since yt −1 ∈ Ft −1 . The unconditional and conditional variances are

h i
V [yt ] = E (yt − E [yt ])2 (4.19)
 !2 

φ0 X φ0
= E + φ1i εt −i −
1 − φ1 1 − φ1

i =0
 !2 
X∞
= E φ1i εt −i 
i =0
 
∞ ∞ ∞
i+j
X X X
= E φ12i ε2t −i + φ1 εt −i εt − j 
i =0 i =0 j =0,i 6= j
"  

# ∞ ∞
i+j
X X X
=E φ12i ε2t −i +E φ1 εt −i εt − j 
i =0 i =0 j =0,i 6= j
∞ ∞ ∞
i+j 
X X X
= φ12i E εt −i + φ1 E εt −i εt − j
 2  

i =0 i =0 j =0,i 6= j
∞ ∞ ∞
i+j
X X X
= φ12i σ2 + φ1 0
i =0 i =0 j =0,i 6= j

σ2
=
1 − φ12

where the expression for the unconditional variance uses the identity that ∞ i =0 φ1 = 1−φ12
2i 1
P

and E[εt −i εt − j ] = 0 follows from the white noise assumption. Again, assuming covariance
stationarity and so V[yt ] = V[yt −1 ], the variance can be directly computed,

V [yt ] = V [φ0 + φ1 yt −1 + εt ] (4.20)


V [yt ] = V [φ0 ] + V [φ1 yt −1 ] + V [εt ] + 2Cov [φ1 yt −1 , εt ]
V [yt ] = 0 + φ12 V [yt −1 ] + σ2 + 2 · 0
V [yt ] = φ12 V [yt ] + σ2
V [yt ] − φ12 V [yt ] = σ2
V [yt ] (1 − φ12 ) = σ2
244 Analysis of a Single Time Series

σ2
V [yt ] =
1 − φ12

where Cov [yt −1 , εt ] = 0 follows from the white noise assumption since yt −1 is a function of
εt −1 , εt −2 , . . .. The conditional variance is

h i
Vt −1 [yt ] = E t −1 (φ1 yt −1 + εt − φ1 yt −1 )2 (4.21)
= E t −1 ε2t
 

= σ2t

Again, the unconditional variance is uniformly larger than the average conditional vari-
ance (E σ2t = σ2 ) and the variance explodes as |φ1 | approaches 1 or -1. Finally, the auto-
 

covariances can be derived,

" ∞
!
φ0 X φ0
E (yt − E[yt ])(yt −s − E[yt −s ]) = E + φ1i εt −i −
 
(4.22)
1 − φ1 1 − φ1
i =0

!#
φ0 X φ0
× + φ1i εt −s −i − (4.23)
1 − φ1 1 − φ1
" ∞ !i =0 ∞ !#
X X
=E φ1i εt −i φ1i εt −s −i
i =0 i =0
" s −1 ∞
! ∞
!#
X X X
=E φ1i εt −i + φ1i εt −i φ1i εt −s −i
i =0 i =s i =0
" s −1 ∞
! ∞
!#
X X X
=E φ1i εt −i + φ1s φ1i εt −s −i φ1i εt −s −i
i =0 i =0 i =0
" s −1
! ∞
!
X X
=E φ1i εt −i φ1i εt −s −i
i =0 i =0

! ∞
!#
X X
+ φ1s φ1i εt −s −i φ1i εt −s −i (4.24)
i =0 i =0
" s −1
! ∞
!#
X X
=E φ1i εt −i φ1i εt −s −i
i =0 i =0
" ∞
! ∞
!#
X X
+E φ1s φ1i εt −s −i φ1i εt −s −i (4.25)
i =0 i =0
" ∞
! ∞
!#
X X
= 0 + φ1s E φ1i εt −s −i φ1i εt −s −i
i =0 i =0
4.3 ARMA Models 245

= 0 + φ1s V [yt −s ]
σ2
= φ1s
1 − φ12
P −i i
An alternative approach to deriving the autocovariance is to note that yt −µ = is=0 φ1 εt −i +
φ s (yt −s − µ) where µ = E[yt ] = E[yt −s ]. Using this identify, the autocovariance can be de-
rived

" s −i
! #
X
E (yt − E[yt ])(yt −s − E[yt −s ]) = E φ1i εt −i + φ s (yt −s − µ) (yt −s − µ)
 
(4.26)
i =0
" s −i
! #
X
=E φ1i εt −i (yt −s − µ) + φ s (yt −s − µ) (yt −s − µ)


i =0
" s −i
! #
X
=E φ1i εt −i (yt −s − µ) + E φ s (yt −s − µ) (yt −s − µ)
 

i =0

=0+φ E s
(yt −s − µ) (yt −s − µ)
 

= φ s V [yt −s ]
σ2
= φ1s
1 − φ12

where the white noise assumption is used to ensure that E [εt −u (yt −s − µ)] = 0 when u > s .
The AR(1) can be extended to the AR(P ) class by including additional lags of yt .

Definition 4.10 (Autoregressive Process of Order P ). An Autoregressive process of order P


(AR(P )) has dynamics which follow
P
X
yt = φ0 + φp yt −p + εt (4.27)
p =1

where εt is white noise series with the additional property that E t −1 [εt ] = 0.

Some of the more useful properties of general AR process are:


φ
• E[yt ] = P 0
1− Pp =1 φp

σ2
• V[yt ] = 1−
PP
φp ρ p
where ρp is the pth autocorrelation.
p =1

• V[yt ] is infinite if Pp =1 φp ≥ 1
P

• E (yt − E[yt ])(yt −s − E[yt −s ]) 6= 0 for any s (in general, although certain parameter-
 

izations may produce some 0 autocovariances).


246 Analysis of a Single Time Series

These four properties point to some important regularities of AR processes. First, the mean
is only finite if Pp =1 φp < 1. Second, the autocovariances are (generally) not zero, unlike
P

MA processes which has γs = 0 for |s | > Q . This difference in the behavior of the autoco-
variances plays an important role in model building. Explicit expressions for the variance
and autocovariance of higher order AR processes can be found in appendix 4.A.

4.3.3 Autoregressive Moving Average Processes

Putting these two processes together yields the complete class of ARMA processes.

Definition 4.11 (Autoregressive-Moving Average Process). An Autoregressive Moving Aver-


age process with orders P and Q (ARMA(P, Q )) has dynamics which follow

P Q
X X
yt = φ0 + φp yt −p + θq εt −q + εt (4.28)
p =1 q =1

where εt is a white noise process with the additional property that E t −1 [εt ] = 0.

Again, consider the simplest ARMA(1,1) process that includes a constant term,

yt = φ0 + φ1 yt −1 + θ1 εt −1 + εt

To derive the properties of this model it is useful to convert the ARMA(1,1) into its infinite
lag representation using recursive substitution,

yt = φ0 + φ1 yt −1 + θ1 εt −1 + εt (4.29)
yt = φ0 + φ1 (φ0 + φ yt −2 + θ1 εt −2 + εt −1 ) + θ1 εt −1 + εt
yt = φ0 + φ1 φ0 + φ12 yt −2 + φ1 θ1 εt −2 + φ1 εt −1 + θ1 εt −1 + εt
yt = φ0 + φ1 φ0 + φ12 (φ0 + φ yt −3 + θ1 εt −3 + εt −2 ) + φ1 θ1 εt −2 + φ1 εt −1 + θ1 εt −1 + εt
yt = φ0 + φ1 φ0 + φ12 φ0 + φ13 yt −3 + φ12 θ1 εt −3 + φ12 εt −2 + φ1 θ1 εt −2 + φ1 εt −1 + θ1 εt −1 + εt
.. ..
. .

X X∞
yt = φ1i φ0 + εt + φ1i (φ1 + θ1 ) εt −i −1
i =0 i =0

φ0 X
yt = + εt + φ1i (φ1 + θ1 ) εt −i −1 .
1 − φ1
i =0

Using the infinite lag representation, the unconditional and conditional means can be
computed,
4.4 Difference Equations 247

" ∞
#
φ0 X
E [yt ] = E + εt + φ1i (φ1 + θ1 ) εt −i −1 (4.30)
1 − φ1
i =0

φ0 X
= + E [εt ] + φ1i (φ1 + θ1 ) E [εt −i −1 ]
1 − φ1
i =0

φ0 X
= +0+ φ1i (φ1 + θ1 ) 0
1 − φ1
i =0
φ0
=
1 − φ1

and

E t −1 [yt ] = E t −1 [φ0 + φ1 yt −1 + θ1 εt −1 + εt ] (4.31)


= φ0 + φ1 E t −1 [yt −1 ] + θ1 E t −1 [εt −1 ] + E t −1 [εt ]
= φ0 + φ1 yt −1 + θ1 εt −1 + 0
= φ0 + φ1 yt −1 + θ1 εt −1

Since yt −1 and εt −1 are in the time-t − 1 information set, these variables pass through
the conditional expectation. The unconditional variance can be tediously derived (see ap-
pendix 4.A.3 for the complete derivation)

1 + 2φ1 θ1 + θ12
 
V [yt ] = σ 2
(4.32)
1 − φ12

The conditional variance is identical to that in the AR(1) or MA(1), Vt −1 [yt ] = σ2t , and, if εt
is homoskedastic, Vt −1 [yt ] = σ2 .
The unconditional mean of an ARMA is the same as an AR since the moving average
terms, which are all mean zero, make no contribution. The variance is more complicated,
and it may be larger or smaller that an AR(1) with the same autoregressive parameter (φ1 ).
The variance will only be smaller if φ1 and θ1 have opposite signs and 2φ1 θ1 < θ12 . Deriving
the autocovariance is straightforward but tedious and is presented in appendix 4.A.

4.4 Difference Equations

Before turning to the analysis of the stationarity conditions for ARMA processes, it is useful
to develop an understanding of the stability conditions in a setting without random shocks.
248 Analysis of a Single Time Series

Definition 4.12 (Linear Difference Equation). An equation of the form

yt = φ0 + φ1 yt −1 + φ2 yt −2 + . . . + φP yt −P + x t . (4.33)

is known as a Pth order linear difference equation where the series {x t } is known as the
driving process.
Linear difference equation nest ARMA processes which can be seen by setting x t equal to
the shock plus the moving average component of the ARMA process,

x t = θ1 εt −1 + θ2 εt −2 + . . . + θQ εt −Q + εt .
Stability conditions depend crucially on the solution to the linear difference equation.
Definition 4.13 (Solution). A solution to a linear difference equation expresses the linear
difference equation

yt = φ0 + φ1 yt −1 + φ2 yt −2 + . . . + φP yt −P + x t . (4.34)

as a function of only {xi }tt =1 , a constant and, when yt has finite history, an initial value y0 .
Consider a first order linear difference equation

yt = φ0 + φ1 yt −1 + x t .
Starting from an initial value y0 ,

y1 = φ0 + φ1 y0 + x1

y2 = φ0 + φ1 (φ0 + φ1 y0 + x1 ) + x2
= φ0 + φ1 φ0 + φ12 y0 + x2 + φ1 x1

y3 = φ0 + φ1 y2 + x2
= φ0 + φ1 (φ0 + φ1 φ0 + φ12 y0 + φ1 x1 + x2 ) + x2
= φ0 + φ1 φ0 + φ12 φ0 + φ13 y0 + x3 + φ1 x2 + φ12 x1

Continuing these iterations, a pattern emerges:


t −1
X t −1
X
yt = φ1t y0 + φ1i φ0 + φ1i x t −i (4.35)
i =0 i =0

This is a solution since it expresses yt as a function of only {x t }, y0 and constants. When no


initial condition is given (or the series is assumed to be infinite), the solution can be found
by solving backward
4.4 Difference Equations 249

yt = φ0 + φ1 yt −1 + x t

yt −1 = φ0 + φ1 yt −2 + x t −1 ⇒
yt = φ0 + φ1 (φ0 + φ1 yt −2 + x t −1 ) + x t
= φ0 + φ1 φ0 + φ12 yt −2 + x t + φ1 x t −1
yt −2 = φ0 + φ1 yt −3 + x t −2 ⇒
yt = φ0 + φ1 φ0 + φ12 (φ0 + φ1 yt −3 + x t −2 ) + x t + φ1 x t −1
= φ0 + φ1 φ0 + φ12 φ0 + φ13 yt −3 + x t + φ1 x t −1 + φ12 x t −2

which leads to the approximate solution

s −1
X s −1
X
yt = φ1i φ0 + φ1i x t −i + φ1s yt −s .
i =0 i =0

To understand the behavior of this solution, it is necessary to take limits. If |φ1 | < 1,
lims →∞ φ1s yt −s goes to zero (as long as yt −s is bounded) and the solution simplifies to

X ∞
X
yt = φ0 φ1i + φ1i x t −i . (4.36)
i =0 i =0
P∞
Noting that, as long as |φ1 | < 1, i =0 φ1i = 1/(1 − φ1 ),

φ0 X
yt = + φ1i x t −i (4.37)
1 − φ1
i =0

is the solution to this problem with an infinite history. The solution concept is important
because it clarifies the relationship between observations in the distant past and the cur-
rent observation, and if lims →∞ φ1s yt −s does not converge to zero then observations arbi-
trarily far in the past have influence on the value of y today.
When |φ1 | > 1 then this system is said to be nonconvergent since φ1t diverges as t grows
large and values in the past are not only important, they will dominate when determining
the current value. In the special case where φ1 = 1,

X
yt = φ0 t + x t −i ,
i =0

which is a random walk when {x t } is a white noise process, and the influence of a single
x t never diminishes. Direct substitution can be used to find the solution of higher order
linear difference equations at the cost of more tedium. A simpler alternative is to focus on
the core component of a linear difference equation, the linear homogenous equation.
250 Analysis of a Single Time Series

4.4.1 Homogeneous Difference Equations


When the number of lags grows large (3 or greater), solving linear difference equations by
substitution is tedious. The key to understanding linear difference equations is the study
of the homogeneous portion of the equation. In the general linear difference equation,

yt = φ0 + φ1 yt −1 + φ2 yt −2 + . . . + φP yt −P + x t
the homogenous portion is defined as the terms involving only y ,

yt = φ1 yt −1 + φ2 yt −2 + . . . + φP yt −P . (4.38)
The intuition behind studying this portion of the system is that, given the sequence of {x t },
all of the dynamics and the stability of the system are determined by the relationship be-
tween contemporaneous yt and its lagged values which allows the determination of the
parameter values where the system is stable. Again, consider the homogeneous portions
of the simple 1st order system,

yt = φ1 yt −1 + x t
which has homogenous portion

yt = φ1 yt −1 .
To find solutions to this equation, one can try trial and error: one obvious solution is 0 since
0 = φ · 0. It is easy to show that

yt = φ1t y0
is also a solution by examining the solution to the linear difference equation in eq. (4.35).
But then so is any solution of the form c φ1t for an arbitrary constant c. How?

yt = c φ1t
yt −1 = c φ1t −1

and

yt = φ1 yt −1

Putting these two together shows that

yt = φ1 yt −1
c φ1t = φ1 yt −1
4.4 Difference Equations 251

c φ1t = φ1 c φ1t −1
c φ1t = c φ1t

and there are many solutions. However, from these it is possible to discern when the solu-
tion will converge to zero and when it will explode:

• If |φ1 | < 1 the system converges to 0. If φ1 is also negative, the solution oscillates,
while if φ1 is greater than 0, the solution decays exponentially.

• If |φ1 | > 1 the system diverges, again oscillating if negative and growing exponentially
if positive.

• If φ1 = 1, the system is stable and all values are solutions. For example 1 = 1 · 1,
2 = 1 · 2, etc.

• If φ1 = −1, the system is metastable. The values, in absolute terms, are unchanged,
but it oscillates between + and -.

These categories will play important roles in examining the dynamics of larger equations
since they determine how past shocks will affect current values of yt . When the order is
greater than 1, there is an easier approach to examining stability of the system. Consider
the second order linear difference system,

yt = φ0 + φ1 yt −1 + φ2 yt −2 + x t
and again focus on the homogeneous portion,

yt = φ1 yt −1 + φ2 yt −2 .
This equation can be rewritten

yt − φ1 yt −1 − φ2 yt −2 = 0
so any solution of the form

c z t − φ1 c z t −1 − φ2 c z t −2 = 0 (4.39)
c z t −2 z 2 − φ1 z − φ2 = 0


will solve this equation.4 Dividing through by c z t −2 , this is equivalent to


z 2 − φ1 z − φ 2 = 0 (4.40)
and he solutions to this quadratic polynomial are given by the quadratic formula,
4
The solution can only be defined up to a constant, c , since the right hand side is 0. Thus, multiplying
both by a constant, the solution will still be valid.
252 Analysis of a Single Time Series

φ1 ± φ12 + 4φ2
p
c1 , c2 = (4.41)
2
The roots of the equation, c1 and c2 , play the same role as φ1 in the 1st order case.5 If
|c1 | < 1 and |c2 | < 1, the system is convergent. With two roots both smaller than 1 there are
three interesting cases:
Case 1: Both roots are real and positive. In this case, the system will exponentially
dampen and not oscillate.

Case 2: Both roots are imaginary (of the form c + d i where i = −1) and distinct, or
real and at least one negative.
√ In this case, the absolute value of the roots (also called the
modulus, defined as c + d for an imaginary number c + d i ) is less than 1, and so the
2 2

system will be convergent but oscillate.


Case 3: Real but the same. This occurs when φ12 + 4φ2 = 0. Since there is only one root,
the system is convergent if it is less than 1 in absolute value, which require that |φ1 | < 2.
If one or both are greater than 1 in absolute terms, the system is divergent.

4.4.2 Lag Operators


Before proceeding to higher order models, it is necessary to define the lag operator. Lag
operations are a particularly useful tool in the analysis of time series and are nearly self
descriptive.6
Definition 4.14 (Lag Operator). The lag operator is denoted L and is defined as the operator
that has the following properties:

L yt = yt −1
L 2 yt = yt −2
L i yt = yt −i
L (L (yt )) = L (yt −1 ) = yt −2 = L 2 yt
(1 − L − L 2 )yt = yt − L yt − L 2 yt = yt − yt −1 − yt −2

The last equation above is particularly useful when studying autoregressive processes. One
additional property of the lag operator is that the lag of a constant is just the constant, i.e.
Lc = c .

4.4.3 Higher Order Linear Homogenous Equations


Stability analysis can be applied to higher order systems by forming the characteristic equa-
tion and finding the characteristic roots.
5
In the first order case, yt = φ1 yt −1 , so yt −φ1 yt −1 = 0. The solution has the property that z t −φ1 z t −1 = 0
so z − φ1 = 0, which has the single solution c = φ1 .
6
In some texts, the lag operator is known as the backshift operator, and L is replaced with B .
4.4 Difference Equations 253

Definition 4.15 (Characteristic Equation). Let yt follow a Pth order linear difference equa-
tion
yt = φ0 + φ1 yt −1 + φ2 yt −2 + . . . + φP yt −P + x t (4.42)
which can be rewritten as

yt − φ1 yt −1 − φ2 yt −2 − . . . − φP yt −P = φ0 + x t (4.43)
(1 − φ1 L − φ2 L 2 − . . . − φP L P )yt = φ0 + x t

The characteristic equation of this process is

z P − φ1 z P −1 − φ2 z P −2 − . . . − φP −1 z − φP = 0 (4.44)

The characteristic roots are the solutions to this equation and most econometric pack-
ages will return the roots of the characteristic polynomial when an ARMA model is esti-
mated.

Definition 4.16 (Characteristic Root). Let

z P − φ1 z P −1 − φ2 z P −2 − . . . − φP −1 z − φP = 0 (4.45)

be the characteristic polynomial associated with a P th order linear difference equation. The
P characteristic roots, c1 , c2 , . . . , cP are defined as the solution to this polynomial

(z − c1 )(z − c2 ) . . . (z − cP ) = 0 (4.46)

The conditions for stability are the same for higher order systems as they were for first and
second order systems: all roots cp , p = 1, 2, . . . , P must satisfy |cp | < 1 (again, if complex,
| · | means modulus). If any |cp | > 1 the system is divergent. If one of more |cp | = 1 and
none are larger, the system will exhibit unit root (random walk) behavior.
These results are the key to understanding important properties of linear time-series
models which turn out to be stationary if the corresponding linear homogeneous system
is convergent, i.e. |cp | < 1, p = 1, 2, . . . , P .

4.4.4 Example: Characteristic Roots and Stability


Consider 6 linear difference equations, their characteristic equation, and the roots:

• yt = 0.9yt −1 + x t

– Characteristic Equation: z-0.9=0


– Characteristic Root: z=0.9

• yt = −0.5yt −1 + x t
254 Analysis of a Single Time Series

Dynamics of linear difference equations


yt = 0.9yt −1 yt = −0.5yt −1
1 1
0.8
0.5
0.6
0.4 0
0.2
−0.5
5 10 15 20 25 5 10 15 20 25
yt = 0.5yt −1 0.4yt −2 yt = 0.64yt −1 − 0.1024yt −2
1 1
0.8 0.8

0.6 0.6
0.4
0.4
0.2
0.2
5 10 15 20 25 5 10 15 20 25
yt = −0.5yt −1 − 0.4yt −2 yt = 1.6yt −1 − 0.5yt −2
1
60
0.5
40

0 20

−0.5
5 10 15 20 25 5 10 15 20 25

Figure 4.1: These six plots correspond to the dynamics of the six linear homogeneous sys-
tems described in the text. All processes received a unit shock at t = 1 (x1 = 1) and no other
shocks (x j = 0, j 6= 1). Pay close attention to the roots of the characteristic polynomial and
the behavior of the system (exponential decay, oscillation and/or explosion).

– Characteristic Equation: z+0.5=0


– Characteristic Root: z=-0.5

• yt = 0.5yt −1 + 0.4yt −2 + x t

– Characteristic Equation: z 2 − 0.5z − 0.4 = 0


– Characteristic Roots: z = 0.93, −.43

• yt = 0.64yt −1 − 0.1024yt −2 + x t

– Characteristic Equation: z 2 − 0.64z + 0.1024 = 0


– Characteristic Roots: z = 0.32, 0.32 (identical)

• yt = −0.5yt −1 − 0.4yt −2 + x t
4.5 Data and Initial Estimates 255

– Characteristic Equation: z 2 + 0.5z + 0.4 = 0


– Characteristic Roots (Modulus): z = −0.25 + 0.58i (0.63), −0.25 − 0.58i (0.63)

• yt = 1.6yt −1 − 0.5yt −2 + x t

– Characteristic Equation: z 2 − 1.6z + 0.5 = 0


– Characteristic Roots: z = 1.17, 0.42

The plots in figure 4.1 show the effect of a unit (1) shock at t = 1 to the 6 linear difference
systems above (all other shocks are 0). The value of the root makes a dramatic difference
in the observed behavior of the series.

4.4.5 Stationarity of ARMA models


Stationarity conditions for ARMA processes can be determined using the results for the
convergence of linear difference equations. First, note that any ARMA process can be writ-
ten using a lag polynomial

yt = φ0 + φ1 yt −1 + . . . + φP yt −P + θ1 εt −1 + . . . + θQ εt −Q + εt
yt − φ1 yt −1 − . . . − φP yt −P = φ0 + θ1 εt −1 + . . . + θQ εt −Q + εt
(1 − φ1 L − φ2 L 2 − . . . − φP L P )yt = φ0 + (1 + θ1 L + θ2 L 2 + . . . + θQ L Q )εt

This is a linear difference equation, and the stability conditions depend on the roots of the
characteristic polynomial

z P − φ1 z P −1 − φ2 z P −2 − . . . − φP −1 z − φP
An ARMA process driven by a white noise shock will be covariance stationary as long
as the characteristic roots are less than one in modulus. In the simple AR(1) case, this cor-
responds to |z 1 | < 1. In the AR(2) case, the region is triangular with a curved bottom and
corresponds to the points (z 1 , z 2 ) = (−2, −1), (1, 0), (2, −2) (see figure 4.2). For higher order
models, stability must be checked by numerically solving the characteristic equation.
The other particularly interesting point is that all MA processes driven by covariance
stationary shocks are stationary since the homogeneous portions of a MA process has no
root and thus cannot diverge.

4.5 Data and Initial Estimates


Two series will be used throughout the stationary time-series analysis section: returns on
the value weighted market and the spread between the average interest rates on portfolios
256 Analysis of a Single Time Series

Stationarity of an AR(2)
2.5
Stationary Area
2 Real Roots
Imaginary Roots
1.5

0.5

−0.5

−1

−1.5

−2

−2.5
−2.5 −2 −1.5 −1 −0.5 0 0.5 1 1.5 2 2.5
φ1

Figure 4.2: The triangular region corresponds to the values of the parameters in the AR(2)
yt = φ1 yt −1 + φ2 yt −2 + εt . The dark region corresponds to real roots and the light region
corresponds to imaginary roots.

of Aaa-rated and Baa-rated corporate bonds, commonly known as the default spread or
default premium. The VWM returns were taken from CRSP and are available from January
1927 through July 2008 and the bond yields are available from Moody’s via FRED II and are
available from January 1919 until July 2008. Both series are monthly.
Figure 4.3 contains plots of the two series. Table 4.1 contains parameter estimates for
an AR(1), an MA(1) and an ARMA(1,1) for each series. The default spread exhibits a large
autoregressive coefficient (.97) that is highly significant, but it also contains a significant
moving average term and in an ARMA(1,1) both parameters are significant. The market
portfolio exhibits some evidence of predictability although it is much less persistent than
the default spread.7

4.6 Autocorrelations and Partial Autocorrelations


Autoregressive processes, moving average processes and ARMA processes all exhibit differ-
ent patterns in their autocorrelations and partial autocorrelations. These differences can
7
For information on estimating an ARMA in MATLAB, see the MATLAB supplement to this course.
4.6 Autocorrelations and Partial Autocorrelations 257

VWM
30
20
10
0
−10
−20

1930 1940 1950 1960 1970 1980 1990 2000

Default Spread (Baa-Aaa)

1920 1930 1940 1950 1960 1970 1980 1990 2000

Figure 4.3: Plots of the returns on the VWM and the default spread, the spread between the
yield of a portfolio of Baa-rated bonds and the yield of a portfolio of Aaa-rated bonds.

VWM Baa-Aaa
φ̂0 φ̂1 θ̂1 σ̂ φ̂0 φ̂1 θ̂1 σ̂
0.284 0.115 5.415 0.026 0.978 0.149
(0.108) (0.052) (0.284) (0.000)
0.320 0.115 5.415 1.189 0.897 0.400
(0.096) (0.042) (0.000) (0.000)
0.308 0.039 0.077 5.417 0.036 0.969 0.202 0.146
(0.137) (0.870) (0.724) (0.209) (0.000) (0.004)

Table 4.1: Parameter estimates and p-values from an AR(1), MA(1) and ARMA(1,1) for the
VWM and Baa-Aaa spread.

be exploited to select a parsimonious model from the general class of ARMA processes.
258 Analysis of a Single Time Series

4.6.1 Autocorrelations and the Autocorrelation Function


Autocorrelations are to autocovariances as correlations are to covariances. That is, the sth
autocorrelation is the sth autocovariance divided by the product of the variance of yt and
yt −s , and when a processes is covariance stationary, V[yt ] = V[yt −s ], and so V[yt ]V[yt −s ] =
p

V[yt ].

Definition 4.17 (Autocorrelation). The autocorrelation of a covariance stationary scalar


process is defined
γs E[(yt − E [yt ])(yt −s − E [yt −s ])]
ρs = = (4.47)
γ0 V[yt ]
where γs is the sth autocovariance.

The autocorrelation function (ACF) relates the lag length (s ) and the parameters of the
model to the autocorrelation.

Definition 4.18 (Autocorrelation Function). The autocorrelation function (ACF), ρ(s ), is a


function of the population parameters that defines the relationship between the autocor-
relations of a process and lag length.

The variance of a covariance stationary AR(1) is σ2 (1 − φ12 )−1 and the sth autocovariance
is φ s σ2 (1 − φ12 )−1 , and so the ACF is

φ s σ2 (1 − φ 2 )−1
ρ(s ) = = φs . (4.48)
σ2 (1 − φ 2 )−1
Deriving ACFs of ARMA processes is a straightforward, albeit tedious, task. Further details
on deriving the ACF of stationary ARMA processes are presented in appendix 4.A.

4.6.2 Partial Autocorrelations and the Partial Autocorrelation Function


Partial autocorrelations are similar to autocorrelations with one important difference: the
sth partial autocorrelation still relates yt and yt −s but it eliminates the effects of yt −1 , yt −2 ,
. . ., yt −(s −1) .

Definition 4.19 (Partial Autocorrelation). The sth partial autocorrelation (ϕs ) is defined as
the population value of the regression coefficient on φs in

yt = φ0 + φ1 yt −1 + φ2 yt −2 + . . . + φs −1 yt −(s −1) + φs yt −s + εt .

Like the autocorrelation function, the partial autocorrelation function (PACF) relates
the partial autocorrelation to population parameters and lag length.

Definition 4.20 (Partial Autocorrelation Function). The partial autocorrelation function


(PACF), ϕ(s ), defines the relationship between the partial autocorrelations of a process and
lag length. The PACF is denoted.
4.6 Autocorrelations and Partial Autocorrelations 259

The partial autocorrelations are directly interpretable as population regression coeffi-


cients. The sth partial autocorrelations can be computed using s +1 autocorrelations. Recall
that the population values of φ1 , φ2 , . . ., φs in

yt = φ0 + φ1 yt −1 + φ2 yt −2 + . . . + φs −1 yt −(s −1) + φs yt −s + εt
can be defined in terms of the covariance between yt , yt −1 , yt −2 , . . ., yt −s . Let Γ denote this
covariance matrix,
 
γ0 γ1 γ2 γ3 . . . γs −1 γs
 γ γ0 γ1 γ2 . . . γs −2 γs −1 
 1 
 γ2 γ1 γ0 γ1 . . . γs −3 γs −2 
 
Γ = .
 .. .. .. .. .. 
 . . . . . ... . . 

 γs −1 γs −2 γs −3 γs −4 . . . γ0 γ1 
 
γs γs −1 γs −2 γs −3 . . . γ1 γ0
The matrix Γ is known as a Toeplitz matrix which reflects the special symmetry it exhibits
which follows from stationarity, and so E[(yt −µ)(yt −s −µ)] = γs = γ−s = E[(yt −µ)(yt +s −µ)].
Γ can be decomposed in terms of γ0 (the long-run variance) and the matrix of autocorre-
lations,
 
1 ρ1 ρ2 ρ3 . . . ρs −1 ρs
 ρ 1 ρ1 ρ2 . . . ρs −2 ρs −1 
 1 
 ρ2 ρ1 1 ρ1 . . . ρs −3 ρs −2 
 
Γ = γ0 
 .. .. .. .. .. .. 
 . . . . ... . . 

 ρs −1 ρs −2 ρs −3 ρs −1 . . . 1 ρ1 
 
ρs ρs −1 ρs −2 ρs −3 . . . ρ1 1
directly by applying the definition of an autocorrelation. The population regression pa-
rameters can be computed by partitioning Γ into four blocks, γ0 , the long-run variance
of yt , Γ 01 = Γ 010 , the vector of covariances between yt and yt −1 , yt −2 , . . . , yt −s , and Γ 11 , the
covariance matrix of yt −1 , yt −2 , . . . , yt −s .
" # " #
γ0 Γ 01 1 R01
Γ = = γ0
Γ 10 Γ 11 R10 R11

where R are vectors or matrices of autocorrelations. Using this formulation, the population
regression parameters φ = [φ1 , φ2 , . . . , φs ]0 are defined as

φ = Γ −1
11 Γ 10 = γ0 R11 γ0 R10 = R11 R10 .
−1 −1 −1
(4.49)
The sth partial autocorrelation (ϕs ) is the sth element in φ (when Γ is s by s ), e0s R−1
11 R10 where
th
es is a s by 1 vector of zeros with one in the s position.
For example, in a stationary AR(1) model, yt = φ1 yt −1 + εt , the PACF is
260 Analysis of a Single Time Series

|s |
ϕ(s ) = φ1 s = 0, 1, −1
=0 otherwise

That ϕ0 = φ 0 = 1 is obvious: the correlation of a variable with itself is 1. The first partial
autocorrelation is defined as the population parameter of φ1 in the regression yt = φ0 +
φ1 yt −1 + εt . Since the data generating process is an AR(1), ϕ1 = φ1 , the autoregressive
parameter. The second partial autocorrelation is defined as the population value of φ2 in
the regression
yt = φ0 + φ1 yt −1 + φ2 yt −2 + ε2 .
Since the DGP is an AR(1), once yt −1 is included, yt −2 has no effect on yt and the population
value of both φ2 and the second partial autocorrelation, ϕ2 , is 0. This argument holds for
any higher order partial autocorrelation.
Note that the first partial autocorrelation and the first autocorrelation are both φ1 in

yt = φ0 + φ1 yt −1 + εt ,

and at the second (and higher) lag these differ. The autocorrelation at s = 2 is the popula-
tion value of φ2 in the regression

yt = φ0 + φ2 yt −2 + ε
while the second partial autocorrelation is the population value of from φ2 in the regression

yt = φ0 + φ1 yt −1 + φ2 yt −2 + ε.
If the DGP were an AR(1), the second autocorrelation would be ρ2 = φ12 while the sec-
ond partial autocorrelation would be ϕ2 = 0.

4.6.2.1 Examples of ACFs and PACFs

The key to understanding the value of ACFs and PACFs lies in the distinct behavior the
autocorrelations and partial autocorrelations of AR and MA processes exhibit.

• AR(P)

– ACF dies exponentially (may oscillate, referred to as sinusoidally)


– PACF is zero beyond P

• MA(Q)

– ACF is zero beyond Q


4.6 Autocorrelations and Partial Autocorrelations 261

Process ACF PACF


White Noise All 0 All 0
AR(1) ρs = φ s
0 beyond lag 2
AR(P) Decays toward zero exponentially Non-zero through lag P, 0 thereafter
MA(1) ρ1 6= 0, ρs = 0, s > 1 Decays toward zero exponentially
MA(Q) Non-zero through lag Q, 0 thereafter Decays toward zero exponentially
ARMA(P,Q) Exponential Decay Exponential Decay

Table 4.2: Behavior that the ACF and PACF for various members of the ARMA family.

– PACF dies exponentially (may oscillate, referred to as sinusoidally)

Table 4.2 provides a summary of the ACF and PACF behavior of ARMA models and this
difference forms the basis of the Box-Jenkins model selection strategy.

4.6.3 Sample Autocorrelations and Partial Autocorrelations


Sample autocorrelations are computed using sample analogues of the population moments
in the definition of an autocorrelation. Define yt∗ = yt − ȳ to be the demeaned series where
ȳ = T −1 Tt=1 yt . The sth sample autocorrelation is defined
P

PT
t =s +1 yt∗ yt∗−s
ρ̂s = PT (4.50)
t =1 (yt∗ )2
although in small-samples, a corrected version
PT
yt∗ yt∗−s
t =s +1
T −S
ρ̂s = (4.51)
( yt∗ )
PT 2
t =1
T
or PT
yt∗ yt∗−s
t =s +1
ρ̂s = qP PT −s ∗ 2 . (4.52)
T ∗ 2
t =s +1 (yt ) t =1 (yt )

may be more accurate.


Definition 4.21 (Sample Autocorrelogram). A plot of the sample autocorrelations against
the lag index in known as a sample autocorrelogram.
Inference on estimated autocorrelation coefficients depends on the null hypothesis tested
and whether the data are homoskedastic. The most common assumptions are that the data
are homoskedastic and that all of the autocorrelations are zero. In other words, yt − E [yt ]
is white noise process. Under the null H0 : ρs = 0, s 6= 0, inference can be made noting
that V [ρ̂s ] = T −1 using a standard t -test,
262 Analysis of a Single Time Series

Autocorrelation and Partial Autocorrelation function


ACF PACF
White Noise
1 1

0.5 0.5

0 0

−0.5 −0.5

−1 −1
1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10
AR(1), φ1 = 0.9
1 1

0.5 0.5

0 0

−0.5 −0.5

−1 −1
1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10
AR(1), φ1 = −0.9
1 1

0.5 0.5

0 0

−0.5 −0.5

−1 −1
1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10
MA(1), θ1 = 0.8
1 1

0.5 0.5

0 0

−0.5 −0.5

−1 −1
1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10

Figure 4.4: Autocorrelation function and partial autocorrelation function for 4 processes.
Note the difference between how the ACF and PACF respond in AR and MA models.

ρ̂s ρ̂s 1 d
=√ = T 2 ρ̂s → N (0, 1). (4.53)
V[ρ̂s ]
p
T −1

A alternative null hypothesis is that the autocorrelations on lags s and above are zero
but that the autocorrelations on lags 1, 2, . . . , s − 1 are unrestricted, H0 : ρ j = 0, j ≥ s .
Under this null, and again assuming homoskedasticity,
4.6 Autocorrelations and Partial Autocorrelations 263

Autocorrelation and Partial Autocorrelation function


ACF PACF
MA(1), θ1 = −0.8
1 1

0.5 0.5

0 0

−0.5 −0.5

−1 −1
1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10
ARMA(1,1), φ1 = 0.9, θ1 = −0.8
1 1

0.5 0.5

0 0

−0.5 −0.5

−1 −1
1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10
Random Walk
1 1

0.5 0.5

0 0

−0.5 −0.5

−1 −1
1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10

Figure 4.5: Autocorrelation function and partial autocorrelation function for 3 processes,
an MA(1), and ARMA(1,1) and a random walk. Note the difference between how the ACF
and PACF respond in AR and MA models.

V[ρ̂s ] = T −1 for s = 1 (4.54)


s −1
X
=T −1
(1 + 2 ρ̂ 2j ) for s > 1
j =1

If the null is H0 : ρs = 0 with no further restrictions on the other autocorrelations, the


variance of the sth autocorrelation is (assuming homoskedasticity)


X
V[ρ̂s ] = T −1
(1 + 2 ρ̂ 2j ) (4.55)
j =1, j 6=s
264 Analysis of a Single Time Series

which is infeasible. The usual practice is to truncate the variance estimator at some finite
1
lag L where L is a function of the sample size, often assumed that L ∝ T 3 (if L is not an
integer, rounding to the nearest one).8
Once the assumption of homoskedasticity is relaxed inference becomes more compli-
cated. First consider the most restrictive null H0 : ρs = 0, s 6= 0. If {yt } is a heteroskedastic
white noise process (plus possibly a non-zero mean), inference can be made using White’s
heteroskedasticity robust covariance estimator (see chapter 3) so that

T
!−1 T
! T
!−1
X X X
V[ρ̂s ] = T −1 T −1 yt∗2
−s T −1 yt∗ 2 yt∗2
−s T −1 yt∗2
−s (4.56)
t =1 t =1 t =1
PT
t =s +1 yt∗ 2 yt∗2
−s
= P 2 .
T ∗2
t =s +1 yt −s

This covariance estimator is identical to White’s covariance estimator for the regression

yt = ρs yt −s + εt
since under the null that ρs = 0, yt = εt .
To test one of the more complicated nulls a Heteroskedasticity Autocorrelation Consis-
tent (HAC) covariance estimator is required, the most common of which is the Newey-West
covariance estimator.

Definition 4.22 (Newey-West Variance Estimator). Let z t be a series that may be autocor-
related and define z t∗ = z t − z̄ where z̄ = T −1 Tt=1 z t . The L -lag Newey-West variance
P

estimator for the variance of z̄ is


T
X L
X T
X
σ̂N2 W =T −1
z t∗ 2 +2 wl T −1
z t∗ z t∗−l (4.57)
t =1 l =1 t =l +1
L
X
= γ̂0 + 2 wl γ̂l
l =1

L +1−l
PT
where γ̂l = T −1 t =l +1 z t∗ z t∗−l and wl = L +1
.

The Newey-West estimator has two important properties. First, it is always greater than
0. This is a desirable property of any variance estimator. Second, as long as L → ∞, the
p
σ̂N2 W → V[yt ]. The only remaining choice is which value to choose for L . Unfortunately
this is problem dependent and it is important to use as small a value for L as the data will
permit. Newey-West estimators tend to perform poorly in small samples and are worse,
1 1
8
The choice of L ∝ T 3 is motivated by asymptotic theory where T 3 has been shown to be the optimal rate
in the sense that is minimizes the asymptotic mean square error of the variance estimator.
4.6 Autocorrelations and Partial Autocorrelations 265

often substantially, than simpler estimators such as White’s heteroskedasticity consistent


covariance estimator. This said, they also work in situations where White’s estimator fails:
when a sequence is autocorrelated White’s estimator is not consistent.9 Long-run variance
estimators are covered in more detail in the Multivariate Time Series chapter (chapter 5).
When used in a regression, the Newey-West estimator extends White’s covariance esti-
mator to allow {yt −s εt } to be both heteroskedastic and autocorrelated, setting z t∗ = yt∗ yt∗−s ,

T
!−1
X
V[ρ̂s ] = T −1 T −1 yt∗2
−s (4.58)
t =s +1
 
T
X L
X T
X  
× T −1 −s + 2
yt∗ 2 yt∗2 w j T −1 yt∗ yt∗−s yt∗− j yt∗−s − j 
t =s +1 j =1 t =s + j +1

T
!−1
X
−1
× T yt∗2
−s
t =s +1
PT PL PT  
t =s +1 yt yt −s + 2
∗ 2 ∗2 ∗ ∗ ∗ ∗
j =1 w j y y
t =s + j +1 t t −s y y
t − j t −s − j
= P 2 .
T ∗2
t =s +1 yt −s

Note that only the center term has been changed and that L must diverge for this estimator
1
to be consistent – even if {yt } follows an MA process, and the efficient choice sets L ∝ T 3 .
Tests that multiple autocorrelations are simultaneously zero can also be conducted.
The standard method to test that s autocorrelations are zero, H0 = ρ1 = ρ2 = . . . = ρs = 0,
is the Ljung-Box Q statistic.

Definition 4.23 (Ljung-Box Q statistic). The Ljung-Box Q statistic, or simply Q statistic,


tests the null that the first s autocorrelations are all zero against an alternative that at least
one is non-zero: H0 : ρk = 0 for k = 1, 2, . . . , s versus H1 : ρk 6= 0 for k = 1, 2, . . . , s . The
test statistic is defined
s
X ρ̂k2
Q = T (T + 2) (4.59)
T −k
k =1

and Q has a standard χs2 distribution.

The Q statistic is only valid under an assumption of homoskedasticity so caution is war-


ranted when using it with financial data. A heteroskedasticity robust version of the Q -stat
can be formed using an LM test.

Definition 4.24 (LM test for serial correlation). Under the null, E[yt∗ yt∗− j ] = 0 for 1 ≤
j ≤ s . The LM-test for serial correlation is constructed by defining the score vector st =
9
The Newey-West estimator nests White’s covariance estimator as a special case by choosing L = 0.
266 Analysis of a Single Time Series

0
yt∗ yt∗−1 yt∗−2 . . . yt∗−s ,

d
L M = T s̄0 Ŝs̄ → χs2 (4.60)
where s̄ = T −1 Tt=1 st and Ŝ = T −1 Tt=1 st s0t .10
P P

Like the Ljung-Box Q statistic, this test has an asymptotic χs2 distribution with the added
advantage of being heteroskedasticity robust.
Partial autocorrelations can be estimated using regressions,

yt = φ0 + φ1 yt −1 + φ2 yt −2 + . . . + ϕ̂s yt −s + εt
where ϕ̂s = φ̂s . To test whether a partial autocorrelation is zero, the variance of φ̂s , under
the null and assuming homoskedasticity, is approximately T −1 for any s , and so a standard
t -test can be used,
1 d
T 2 φ̂s → N (0, 1). (4.61)
If homoskedasticity cannot be assumed, White’s covariance estimator can be used to con-
trol for heteroskedasticity.

Definition 4.25 (Sample Partial Autocorrelogram). A plot of the sample partial autocorre-
lations against the lag index in known as a sample partial autocorrelogram.

4.6.3.1 Example: Autocorrelation, partial autocorrelation and Q Statistic

Figure 4.6 contains plots of the first 20 autocorrelations and partial autocorrelations of the
VWM market returns and the default spread. The market appears to have a small amount
of persistence and appears to be more consistent with a moving average than an autore-
gression. The default spread is highly persistent and appears to be a good candidate for
an AR(1) since the autocorrelations decay slowly and the partial autocorrelations drop off
dramatically after one lag, although an ARMA(1,1) cannot be ruled out.

4.6.4 Model Selection: The Box-Jenkins Methodology


The Box and Jenkins methodology is the most common approach for time-series model
selection. It consists of two stages:

• Identification: Visual inspection of the series, the autocorrelations and the partial
autocorrelations.

• Estimation: By relating the sample autocorrelations and partial autocorrelations to


the ACF and PACF of ARMA models, candidate models are identified. These can-
didates are estimated and the residuals are tested for neglected dynamics using the
10
Refer to chapters 2 and 3 for more on LM-tests.
4.6 Autocorrelations and Partial Autocorrelations 267

Autocorrelations and Partial Autocorrelations


VWM
Autocorrelations Partial Autocorrelations
0.15 0.15

0.1 0.1

0.05 0.05

0 0

−0.05 −0.05

−0.1 −0.1

−0.15 −0.15

−0.2 −0.2
5 10 15 20 5 10 15 20

Default Spread
Autocorrelations Partial Autocorrelations
1.2 1.2

1 1

0.8 0.8

0.6 0.6

0.4 0.4

0.2 0.2

0 0

−0.2 −0.2
5 10 15 20 5 10 15 20

Figure 4.6: These for pictures plot the first 20 autocorrelations (left) and partial autocorre-
lations (right) of the VWM (top) and the Baa-Aaa spread (bottom). Approximate standard
errors, assuming homoskedasticity, are in parenthesis.

residual autocorrelations, partial autocorrelations and Q statistics or LM-tests for se-


rial correlation. If dynamics are detected in the residuals, a new model is specified
and the procedure is repeated.

The Box-Jenkins procedure relies on two principles: parsimony and invertibility.

Definition 4.26 (Parsimony). Parsimony is property of a model where the specification


with the fewest parameters capable of capturing the dynamics of a time series is preferred
to other representations equally capable of capturing the same dynamics.
268 Analysis of a Single Time Series

Parsimony is an intuitive principle and using the smallest model has other benefits, partic-
ularly when forecasting. One consequence of the parsimony principle is that parameters
which are not needed are excluded. For example, if the data generating process were and
AR(1), selecting an AR(2) would adequately describe the process. The parsimony princi-
ple indicates the AR(1) should be preferred to an AR(2) since both are equally capable of
capturing the dynamics in the data. Further, recall that an AR(1) can be reformulated as an
MA(T ) where θs = φ1s . Both the AR(1) and MA(T ) are capable of capturing the dynamics
in the data if the DGP is an AR(1), although the number of parameters in each is very dif-
ferent. The parsimony principle provides guidance on selecting the AR(1) over the MA(T )
since it contains (many) fewer parameters yet provides an equivalent description of the
relationship between current and past values of the data.

Definition 4.27 (Invertibility). A moving average is invertible if it can be written as a finite


or convergent autoregression. Invertibility requires the roots of

(1 − θ1 z − θ2 z 2 − . . . − θQ z Q ) = 0

to be greater than one in modulus (absolute value).

Invertibility is a technical requirement stemming from the use of the autocorrelogram


and partial autocorrelogram to choose the model, and it plays an important role in achiev-
ing unique identification of the MA component of a model. For example, the ACF and PACF
of

yt = 2εt −1 + εt
and
yt = .5εt −1 + εt
are identical. The first autocorrelation is θ1 /(1 + θ12 ), and so in the first specification ρ1 =
2/(1 + 22 ) = .4 and in the second ρ1 = .5/(1 + .52 ) = .4 while all other autocorrelations are
zero. The partial autocorrelations are similarly identical – partial correlation are function
of autocorrelations – and so two processes are indistinguishable. Invertibility rules out the
first of these two models since the root of 1 − 2z = 0 is 21 < 1.
Information criteria such as the AIC or S/BIC can also be used to choose a model. Recall
the definitions of the AIC and BIC:

Definition 4.28 (Akaike Information Criterion). The Akaike Information Criteria (AIC) is

2k
AI C = ln σ̂2 + (4.62)
T

where σ̂2 is the estimated variance of the regression error and k is the number of parame-
ters in the model.
4.7 Estimation 269

Definition 4.29 (Schwartz/Bayesian Information Criterion). The Schwartz Information Cri-


teria (SIC), also known as the Bayesian Information Criterion (BIC) is

k ln T
B I C = ln σ̂2 + (4.63)
T

where σ̂2 is the estimated variance of the regression error and k is the number of parame-
ters in the model.

ICs are often applied by estimating the largest model which is thought to correctly capture
the dynamics and then dropping lags until the AIC or S/BIC fail to decrease. Specific-to-
General (StG) and General-to-Specific (GtS) are also applicable to time-series modeling
and suffer from the same issues as those described in chapter 3, section ??.

4.7 Estimation

ARMA models are typically estimated using maximum likelihood (ML) estimation assum-
ing that the errors are normal, using either conditional maximum likelihood, where the
likelihood of yt given yt −1 , yt −2 , . . . is used, or exact maximum likelihood where the joint
distribution of [y1 , y2 , . . . , yt −1 , yt ] is used.

4.7.1 Conditional Maximum Likelihood

Conditional maximum likelihood uses the distribution of yt given yt −1 , yt −2 , . . . to estimate


the parameters of an ARMA. The data are assumed to be conditionally normal, and so the
likelihood is

ε2
 
2 − 21
f (yt |yt −1 , yt −2 , . . . ; φ, θ , σ ) = (2πσ )
2
exp − t 2 (4.64)

PP
(yt − φ0 − φi yt −i − Qj=1 θ j εt − j )2
P !
2 − 12 i =1
= (2πσ ) exp −
2σ2

Since the {εt } series is assumed to be a white noise process, the joint likelihood is simply
the product of the individual likelihoods,

T
ε2t
 
2 − 12
Y
f (yt |yt −1 , yt −2 . . . ; φ, θ , σ ) =
2
(2πσ ) exp − 2 (4.65)
t =1

and the conditional log-likelihood is


270 Analysis of a Single Time Series

T
1X ε2
l (φ, θ , σ2 ; yt |yt −1 , yt −2 . . .) = − ln 2π + ln σ2 + t2 . (4.66)
2 t =1 σ

Recall that the first-order condition for the mean parameters from a normal log-likelihood
does not depend on σ2 and that given the parameters in the mean equation, the maximum
likelihood estimate of the variance is
T
X
σ̂2 = T −1 (yt − φ0 − φ1 yt −1 − . . . − φP yt −P − θ1 εt −1 − . . . − θQ εt −Q )2 (4.67)
t =1
T
X
=T −1
ε2t . (4.68)
t =1

This allows the variance to be concentrated out of the log-likelihood so that it becomes

T T
1X X ε2t
l (yt |yt −1 , yt −2 . . . ; φ, θ , σ2 ) = − ln 2π + ln(T −1 ε2t ) + (4.69)
2 t =1 T −1 Tt=1 ε2t
P
t =1
T T T T
1X 1X X T X ε2t
=− ln 2π − ln(T −1
εt ) −
2
2 t =1 2 t =1 2 t =1 Tt=1 ε2t
P
t =1
T T T
T Tt=1 ε2t
P
1X 1X X
=− ln 2π − ln(T −1
ε t ) − PT
2
2 t =1 2 t =1 t =1
2 t =1 ε2t
T T T
1X 1X X T
=− ln 2π − ln(T −1 ε2t ) −
2 t =1 2 t =1 t =1
2
T T T
1X T 1X X
=− ln 2π − − ln(T −1 ε2t )
2 t =1 2 2 t =1 t =1
T
1X T T
=− ln 2π − − ln σ̂2 .
2 t =1 2 2

Eliminating terms that do not depend on model parameters shows that maximizing the
likelihood is equivalent to minimizing the error variance,

T
max l (yt |yt −1 , yt −2 . . . ; φ, θ , σ2 ) = − ln σ̂2 . (4.70)
φ,θ ,σ2 2

where ε̂t = yt − φ0 − φ1 yt −1 − . . . − φP yt −P − θ1 εt −1 − . . . − θQ εt −Q , and so . estimation us-


ing conditional maximum likelihood is equivalent to least squares, although unlike linear
4.7 Estimation 271

regression the objective is nonlinear due to the moving average terms and so a nonlinear
maximization algorithm is required. If the model does not include moving average terms
(Q = 0), then the conditional maximum likelihood estimates of an AR(P ) are identical the
least squares estimates from the regression

yt = φ0 + φ1 yt −1 + φ2 yt −2 + . . . + φP yt −P + εt . (4.71)
Conditional maximum likelihood estimation of ARMA models requires either backcast
values or truncation since some of the observations have low indices (e.g. y1 ) which de-
pend on observations not in the sample (e.g. y0 , y−1 , ε0 , ε−1 , etc.). Truncation is the most
common and the likelihood is only computed for t = P + 1, . . . , T , and initial values of εt
are set to 0. When using backcasts, missing values of y can be initialized at the long-run
average, ȳ = T −1 Tt=1 yt , and the initial values of εt are set to their unconditional expec-
P

tation, 0. Using unconditional values works well when data are not overly persistent and T
is not too small. The likelihood can then be recursively computed where estimated errors
ε̂t used are using in moving average terms,

ε̂t = yt − φ0 − φ1 yt −1 − . . . − φP yt −P − θ1 ε̂t −1 − . . . − θQ ε̂t −Q , (4.72)


where backcast values are used if any index is less than or equal to 0. The estimated resid-
uals are then plugged into the conditional log-likelihood (eq. (4.69)) and the log-likelihood
value is computed. The numerical maximizer will search for values of φ and θ that pro-
duce the largest log-likelihood. Once the likelihood optimizing values have been found,
the maximum likelihood estimate of the variance is computed using

T
X
σ̂2 = T −1 (yt − φ̂0 − φ̂1 yt −1 − . . . − φ̂P yt −P − θ̂1 ε̂t −1 − . . . − θ̂Q ε̂t −Q )2 (4.73)
t =1

or the truncated version which sums from P + 1 to T .

4.7.2 Exact Maximum Likelihood


Exact maximum likelihood directly utilizes the autocorrelation function of an ARMA(P,Q)
to compute the correlation matrix of all of the y data, which allows the joint likelihood to
be evaluated. Define
y = [yt yt −1 yt −2 . . . y2 y1 ]0
and let Γ be the T by T covariance matrix of y. The joint likelihood of y is given by
!
T T y Γ
0 −1
y
f (y|φ, θ , σ2 ) = (2π)− 2 |Γ |− 2 exp − . (4.74)
2
The log-likelihood is
272 Analysis of a Single Time Series

T T 1
l (φ, θ , σ2 ; y) = −
ln(2π) − ln |Γ | − y0 Γ −1 y. (4.75)
2 2 2
where Γ is a matrix of autocovariances,
 
γ0 γ1 γ2 γ3 . . . γT −1 γT
 γ γ0 γ1 γ2 . . . γT −2 γT −1 
 1 
 γ2 γ1 γ0 γ1 . . . γT −3 γT −2 
 
Γ = .
 .. .. .. .. .. 
 .
. . . . . . . . . 

 γT −1 γT −2 γT −3 γT −4 . . . γ0 γ1 
 
γT γT −1 γT −2 γT −3 . . . γ1 γ0
and that are determined by the model parameters (excluding the constant), φ, θ and σ2 . A
nonlinear maximization algorithm can be used to search for the vector of parameters that
maximizes this log-likelihood. Exact maximum likelihood is generally believed to be more
precise than conditional maximum likelihood and does note require backcasts of data or
errors.

4.8 Inference

Inference on ARMA parameters from stationary time series is a standard application of


maximum likelihood theory. Define ψ = [φ θ σ2 ]0 as the parameter vector. Recall from 2
that maximum likelihood estimates are asymptotically normal,
√ d
T (ψ − ψ̂) → N (0, I −1 ) (4.76)
where

∂ 2 l (y; ψ)
 
I = −E .
∂ ψ∂ ψ0
where ∂ 2 l (y; ψ)/∂ ψ∂ ψ0 is the second derivative matrix of the log-likelihood (or Hessian).
In practice I is not known and it must be replaced with a consistent estimate,

T
X ∂ 2 l (yt ; ψ̂)
Î = T −1
− .
t =1
∂ ψ∂ ψ0
Wald and t -tests on the parameter estimates can be computed using the elements of I, or
likelihood ratio tests can be used by imposing the null on the model and comparing the
log-likelihood values of the constrained and unconstrained estimators.
One important assumption in the above distribution theory is that the estimator is a
maximum likelihood estimator; this requires the likelihood to be correctly specified, or,
in other words, for the data to be homoskedastic and normally distributed. This is gener-
4.9 Forecasting 273

ally an implausible assumption when using financial data and a modification of the above
theory is needed. When one likelihood is specified for the data but they actually have a
different distribution the estimator is known as a Quasi Maximum Likelihood estimator
(QML). QML estimators, like ML estimators, are asymptotically normal under mild regu-
larity conditions on the data but with a different asymptotic covariance matrix,
√ d
T (ψ − ψ̂) → N (0, I −1 J I −1 ) (4.77)
where

∂ l (y; ψ) ∂ l (y; ψ)
 
J =E
∂ψ ∂ ψ0
J must also be estimated and the usual estimator is
T
X ∂ l (yt ; ψ) ∂ l (yt ; ψ)
Jˆ = T −1
t =1
∂ψ ∂ ψ0

where ∂ l ∂(yψt ;ψ) is the score of the log-likelihood. I −1 J I −1 is known as a sandwich covariance
estimator, White’s covariance estimator.
A sandwich covariance estimator is needed when the model for the data is not com-
pletely specified or is misspecified, and it accounts for the failure of Information Matrix
Inequality to hold (see chapters 2and 3). As was the case in linear regression, a sufficient
condition for the IME to fail in ARMA estimation is heteroskedastic residuals. Considering
the prevalence of conditionally heteroskedasticity in financial data, this is nearly a given.

4.9 Forecasting

Forecasting is a common objective of many time-series models. The objective of a forecast


is to minimize a loss function.

Definition 4.30 (Loss Function). A loss function is a function of the observed data, yt +h and
the time-t constructed forecast, ŷt +h |t , L (yt , ŷt +h |t ), that has the three following properties:

• Property 1: The loss of any forecast is non-negative, so L (yt +h , ŷt +h |t ) ≥ 0.

• Property 2: There exists a point, yt∗+h , known as the optimal forecast, where the loss
function takes the value 0. That is L (yt +h , yt∗+h ) = 0.

• Property 3: The loss is non-decreasing away from yt∗+h . That is if ytB+h > ytA+h > yt∗+h ,
then L (yt +h , ytB+h ) > L (yt +h , ytA+h ) > L (yt +h , yt∗+h ). Similarly, if ytD+h < ytC+h < yt∗+h , then
L (yt +h , ytD+h ) > L (yt +h , ytC+h ) > L (yt +h , yt∗+h ).
274 Analysis of a Single Time Series

The most common loss function is Mean Square Error (MSE) which chooses the forecast
to minimize

E[L (yt +h , ŷt +h |t )] = E[(yt +h − ŷt +h |t )2 ] (4.78)


where ŷt +h |t is the time-t forecast of yt +h . Notice that this is just the optimal projection
problem and the optimal forecast is the conditional mean, yt∗+h |t = E t [yt +h ] (See chapter
3). It is simple to verify that this loss function satisfies the properties of a loss function.
Property 1 holds by inspection and property 2 occurs when yt +h = ŷt∗+h |t . Property 3 follows
from the quadratic form. MSE is far and away the most common loss function but others,
such as Mean Absolute Deviation (MAD), Quad-Quad and Linex are used both in practice
and in the academic literature. The MAD loss function will be revisited in chapter 6 (Value-
at-Risk). The Advanced Financial Econometrics elective will go into more detail on non-
MSE loss functions.
The remainder of this section will focus exclusively on forecasts that minimize the MSE
loss function. Fortunately, in this case forecasting from ARMA models is an easy exercise.
For simplicity consider the AR(1) process,

yt = φ0 + φ1 yt −1 + εt .
Since the optimal forecast is the conditional mean, all that is needed is to compute E t [yt +h ]
for any h . When h = 1,

yt +1 = φ0 + φ1 yt + εt +1
so the conditional expectation is

E t [yt +1 ] = E t [φ0 + φ1 yt + εt +1 ] (4.79)


= φ0 + φ1 E t [yt ] + E t [εt +1 ]
= φ0 + φ1 yt + 0
= φ0 + φ1 yt

which follows since yt is in the time-t information set (Ft ) and E t [εt +1 ] = 0 by assump-
tion.11 The optimal forecast for h = 2 is given by E t [yt +2 ],

E t [yt +2 ] = E t [φ0 + φ1 yt +1 + εt +2 ]
= φ0 + φ1 E t [yt +1 ] + E t [εt +1 ]
= φ0 + φ1 (φ0 + φ1 yt ) + 0
= φ0 + φ1 φ0 + φ12 yt
11
This requires a sightly stronger assumption than εt is a white noise process.
4.9 Forecasting 275

which follows by substituting in the expression derived in eq. (4.79) for E t [yt +1 ]. The opti-
mal forecast for any arbitrary h uses the recursion

E t [yt +h ] = φ0 + φ1 E t [yt +h −1 ] (4.80)


P −1 i
and it is easily shown that E t [yt +h ] = φ0 hi =0 φ1 + φ1h yt . If |φ1 | < 1, as h → ∞, the
forecast of yt +h and E t [yt +h ] converges to φ0 /(1 − φ1 ), the unconditional expectation of yt .
In other words, for forecasts in the distant future there is no information about the location
of yt +h other than it will return to its unconditional mean. This is not surprising since yt is
covariance stationary when |φ1 | < 1.
Next consider forecasts from an MA(2),

yt = φ0 + θ1 εt −1 + θ2 εt −2 + εt .

The one-step-ahead forecast is given by

E t [yt +1 ] = E t [φ0 + θ1 εt + θ2 εt −1 + εt +1 ]
= φ0 + θ1 E t [εt ] + θ2 E t [εt −1 ] + E t [εt +1 ]
= φ0 + θ1 εt + θ2 εt −1 + 0

which follows since εt and εt −1 are in the Ft information set and E t [εt +1 ] = 0 by assump-
tion. In practice the one step ahead forecast would be given by

E t [yt +1 ] = φ̂0 + θ̂1 ε̂t + θ̂2 ε̂t −1

where both the unknown parameters and the unknown residuals would be replaced with
their estimates.12 The 2-step ahead forecast is given by

E t [yt +2 ] = E t [φ0 + θ1 εt +1 + θ2 εt + εt +2 ]
= φ0 + θ1 E t [εt +1 ] + θ2 E t [εt ] + E t [εt +2 ]
= φ0 + θ1 0 + θ2 εt + 0
= φ0 + θ2 εt .

The 3 or higher step forecast can be easily seen to be φ0 . Since all future residuals have zero
expectation they cannot affect long horizon forecasts. Like the AR(1) forecast, the MA(2)
forecast is mean reverting. Recall the unconditional expectation of an MA(Q) process is
simply φ0 . For any h > Q the forecast of yt +h is just this value, φ0 .
Finally consider the 1 to 3-step ahead forecasts from an ARMA(2,2),
12
The residuals are a natural by product of the parameter estimation stage.
276 Analysis of a Single Time Series

yt = φ0 + φ1 yt −1 + φ2 yt −2 + θ1 εt −1 + θ2 εt −2 + εt .

Conditioning on the information set Ft , the expectation of yt +1 is

E t [yt +1 ] = E t [φ0 + φ1 yt + φ2 yt −1 + θ1 εt + θ2 εt −1 + εt +1 ]
= E t [φ0 ] + E t [φ1 yt ] + E t [φ2 yt −1 ] + E t [θ1 εt ] + E t [θ2 εt −1 ] + E t [εt +1 ].

Noting that all of the elements are in Ft except εt +1 , which has conditional expectation 0,

E t [yt +1 ] = φ0 + φ1 yt + φ2 yt −1 + θ1 εt + θ2 εt −1

Note that in practice, the parameters and errors will all be replaced by their estimates (i.e.
φ̂1 and ε̂t ). The 2-step ahead forecast is given by

E t [yt +2 ] = E t [φ0 + φ1 yt +1 + φ2 yt + θ1 εt +1 + θ2 εt + εt +2 ]
= E t [φ0 ] + E t [φ1 yt +1 ] + E t [φ2 yt ] + θ1 E t [εt +1 ] + θ2 εt + E t [εt +2 ]
= φ0 + φ1 E t [yt +1 ] + φ2 yt + θ1 E t [εt +1 ] + θ2 εt + E t [εt +2 ]
= φ0 + φ1 (φ0 + φ1 yt + φ2 yt −1 + θ1 εt + θ2 εt −1 ) + φ2 yt + θ1 0 + θ2 εt + 0
= φ0 + φ1 φ0 + φ12 yt + φ1 φ2 yt −1 + φ1 θ1 εt + φ1 θ2 εt −1 + φ2 yt + θ2 εt
= φ0 + φ1 φ0 + (φ12 + φ2 )yt + φ1 φ2 yt −1 + (φ1 θ1 + θ2 )εt + φ1 θ2 εt −1 .

In this case, there are three terms which are not known at time t . By assumption E t [εt +2 ] =
E t [εt +1 ] = 0 and E t [yt +1 ] has been computed above, so

E t [yt +2 ] = φ0 + φ1 φ0 + (φ12 + φ2 )yt + φ1 φ2 yt −1 + (φ1 θ1 + θ2 )εt + φ1 θ2 εt −1

In a similar manner,

E t [yt +3 ] = φ0 + φ1 E t [yt +2 ] + φ2 E t [yt +1 ] + θ1 εt +2 + θ2 εt +1 + εt +3


E t [yt +3 ] = φ0 + φ1 E t [yt +2 ] + φ2 E t [yt +1 ] + 0 + 0 + 0

which is easily solved by plugging in the previously computed values for E t [yt +2 ] and E t [yt +1 ].
This pattern can be continued by iterating forward to produce the forecast for an arbitrary
h.
Two things are worth noting from this discussion:

• If there is no AR component, all forecast for h > Q will be φ0 .


4.9 Forecasting 277

• For large h , the optimal forecast converges to the unconditional expectation given by

φ0
lim E t [yt +h ] = (4.81)
h →∞ 1 − φ1 − φ2 − . . . − φP

4.9.1 Forecast Evaluation


Forecast evaluation is an extensive topic and these notes only cover two simple yet impor-
tant tests: Mincer-Zarnowitz regressions and Diebold-Mariano tests.

4.9.1.1 Mincer-Zarnowitz Regressions

Mincer-Zarnowitz regressions (henceforth MZ) are used to test for the optimality of the
forecast and are implemented with a standard regression. If a forecast is correct, it should
be the case that a regression of the realized value on its forecast and a constant should
produce coefficients of 1 and 0 respectively.
Definition 4.31 (Mincer-Zarnowitz Regression). A Mincer-Zarnowitz (MZ) regression is a
regression of a forecast, ŷt +h |t on the realized value of the predicted variable, yt +h and a
constant,
yt +h = β1 + β2 ŷt +h |t + ηt . (4.82)
If the forecast is optimal, the coefficients in the MZ regression should be consistent with
β1 = 0 and β2 = 1.
For example, let ŷt +h |t be the h -step ahead forecast of y constructed at time t . Then
running the regression

yt +h = β1 + β2 ŷt +h |t + νt
should produce estimates close to 0 and 1. Testing is straightforward and can be done with
any standard test (Wald, LR or LM). An augmented MZ regression can be constructed by
adding time-t measurable variables to the original MZ regression.
Definition 4.32 (Augmented Mincer-Zarnowitz Regression). An Augmented Mincer-Zarnowitz
regression is a regression of a forecast, ŷt +h |t on the realized value of the predicted variable,
yt +h , a constant and any other time-t measurable variables, xt = [x1t x2t . . . x K t ],

yt +h = β1 + β2 ŷt +h |t + β3 x1t + . . . + βK +2 x K t + ηt . (4.83)

If the forecast is optimal, the coefficients in the MZ regression should be consistent with
β1 = β3 = . . . = βK +2 = 0 and β2 = 1.
It is crucial that the additional variables are time-t measurable and are in Ft . Again,
any standard test statistic can be used to test the null H0 : β2 = 1 ∩ β1 = β3 = . . . = βK +2 = 0
against the alternative H1 : β2 6= 1 ∪ β j 6= 0, j = 1, 3, 4, . . . , K − 1, K − 2.
278 Analysis of a Single Time Series

4.9.1.2 Diebold-Mariano Tests

A Diebold-Mariano test, in contrast to an MZ regression, examines the relative perfor-


mance of two forecasts. Under MSE, the loss function is given by L (yt +h , ŷt +h |t ) = (yt +h −
ŷt +h |t )2 . Let A and B index the forecasts from two models ŷtA+h |t and ŷtB+h |t , respectively.
The losses from each can be defined as l tA = (yt +h − ŷtA+h |t )2 and l tB = (yt +h − ŷtB+h |t )2 . If the
models were equally good (or bad), one would expect l¯A ≈ l¯B where l¯ is the average loss. If
model A is better, meaning it has a lower expected loss E[L (yt +h , ŷtA+h |t )] < E[L (yt +h , ŷtB+h |t )],
then, on average, it should be the case that l¯A < l¯B . Alternatively, if model B were better
it should be the case that l¯B < l¯A . The DM test exploits this to construct a simple t -test of
equal predictive ability.

Definition 4.33 (Diebold-Mariano Test). Define d t = l tA − l tB . The Diebold-Mariano test is


a test of equal predictive accuracy and is constructed as


DM = q
V[
d ¯d]

where M (for modeling) is the number of observations used in the model building and es-
timation, R (for reserve) is the number of observations held back for model evaluation and
P +R
d¯ = R −1 M t =M +1 d t . Under the null that E[L (yt +h , ŷt +h |t )] = E[L (yt +h , ŷt +h |t )], and under
A B

d
some regularity conditions on {d t }, D M → N (0, 1). V[d t ] is the long-run variance of d t
and must be computed using a HAC covariance estimator.

If the models are equally accurate, one would expect that E[d t ] = 0 which forms the null
of the DM test, H0 : E[d t ] = 0. To test the null, a standard t -stat is used although the test
has two alternatives: H1A : E[d t ] < 0 and H1B : E[d t ] > 0 which correspond to superiority of
model A and B , respectively. D M is asymptotically normally distributed. Large negative
values (less than -2) indicate model A produces less loss on average and hence is superior,
while large positive values indicate the opposite. Values close to zero indicate neither is
statistically superior.
In Diebold-Marino tests the variance must be estimated using a Heteroskedasticity Au-
tocorrelation Consistent variance estimator.

Definition 4.34 (Heteroskedasticity Autocorrelation Consistent Covariance Estimator). Co-


variance estimators which are robust to both ignored autocorrelation in residuals and to
heteroskedasticity are known as heteroskedasticity autocorrelation consistent (HAC) co-
variance. The most common example of a HAC estimator is the Newey-West (or Bartlett)
covariance estimator.

The typical variance estimator cannot be used in DM tests and a kernel estimator must be
substituted (e.g. Newey-West).
4.10 Nonstationary Time Series 279

Despite all of these complications, implementing a DM test is very easy. The first step
is to compute the series of losses, {l tA } and {l tB }, for both forecasts. Next compute d t =
l tA − l tB . Finally, regress d t on a constant and use Newey-West errors,

d t = β1 + εt .
The t -stat on β1 is the DM test statistic and can be compared to critical values of a normal
distribution.

4.10 Nonstationary Time Series


Nonstationary time series present some particular difficulties and standard inference often
fails when a process depends explicitly on t . Nonstationarities can be classified into one
of four categories:

• Seasonalities

• Deterministic Trends (also known as Time Trends)

• Unit Roots (also known as Stochastic Trends)

• Structural Breaks

Each type has a unique feature. Seasonalities are technically a form of deterministic trend,
although their analysis is sufficiently similar to stationary time series that little is lost in
treating a seasonal time series as if it were stationary. Processes with deterministic trends
have unconditional means which depend on time while unit roots processes have uncon-
ditional variances that grow over time. Structural breaks are an encompassing class which
may result in either or both the mean and variance exhibiting time dependence.

4.10.1 Seasonality, Diurnality and Hebdomadality


Seasonality, diurnality and hebdomadality are pervasive in economic time series. While
many data series have been seasonally adjusted to remove seasonalities, particularly US
macroeconomic series, there are many time-series where no seasonally adjusted version
is available. Ignoring seasonalities is detrimental to the precision of parameters and fore-
casting and model estimation and selection is often more precise when both seasonal and
nonseasonal dynamics are simultaneously modeled.

Definition 4.35 (Seasonality). Data are said to be seasonal if they exhibit a non-constant
deterministic pattern with an annual frequency.

Definition 4.36 (Hebdomadality). Data which exhibit day-of-week deterministic effects


are said to be hebdomadal.
280 Analysis of a Single Time Series

Definition 4.37 (Diurnality). Data which exhibit intra-daily deterministic effects are said
to be diurnal.

Seasonal data are non-stationary, although seasonally de-trended data (usually referred
to as deseasonalized data) may be stationary. Seasonality is common in macroeconomic
time series, diurnality is pervasive in ultra-high frequency data (tick data) and hebdomadal-
ity is often believed to be a feature of asset prices. Seasonality is, technically, a form of non-
stationarity since the mean of a process exhibits explicit dependence on t through the sea-
sonal component, and the Box-Jenkins methodology is not directly applicable. However, a
slight change in time scale, where seasonal pattern is directly modeled along with any non-
seasonal dynamics produces a residual series which is stationary and so the Box-Jenkins
methodology may be applied.
For example, consider a seasonal quarterly time series. Seasonal dynamics may occur
at lags 4, 8, 12, 16, . . ., while nonseasonal dynamics can occur at any lag 1, 2, 3, 4, . . .. Note
that multiples of 4 appear in both lists and so the identification of the seasonal and non-
seasonal dynamics may be difficult (although separate identification makes little practical
difference).
The standard practice when working with seasonal data is to conduct model selection
over two sets of lags by choosing a maximum lag to capture the seasonal dynamics and
by choosing a maximum lag to capture nonseasonal ones. Returning to the example of a
seasonal quarterly time series, a model may need to examine up to 4 lags to capture non-
seasonal dynamics and up to 4 lags of the seasonal component, and if the seasonal com-
ponent is annual, these four seasonal lags correspond to regressors as t − 4, t − 8, t − 12,
and t − 16.

4.10.1.1 Example: Seasonality

Most U.S. data series are available seasonally adjusted, something that is not true for data
from many areas of the world, including the Euro zone. This example makes use of monthly
data on the U.S. money supply, M1, a measure of the money supply that includes all coins,
currency held by the public, travelers’ checks, checking account balances, NOW accounts,
automatic transfer service accounts, and balances in credit unions.
Figure 4.10.1.1 contains a plot of monthly M1, the growth of M1 (log differences), and
the sample autocorrelogram and sample partial autocorrelogram of M1. These figures
show evidence of an annual seasonality (lags 12, 24 and 36), and applying the Box-Jenkins
methodology, the seasonality appears to be a seasonal AR, or possibly a seasonal ARMA.
The short run dynamics oscillate and appear consistent with an autoregression since the
partial autocorrelations are fairly flat (aside from the seasonal component). Three specifi-
cation which may be appropriate to model the process were fit: a 12 month seasonal AR,
a 12 month seasonal MA and a 12-month seasonal ARMA, all combined with an AR(1) to
model the short run dynamics. Results are reported in table 4.3
4.10 Nonstationary Time Series 281

M1, M1 Growth, and the ACF and PACF of M1 Growth


M1 M1 Growth
1400 0.04
1200
0.02
1000

800 0
600
−0.02
400

200 −0.04
1960 1970 1980 1990 2000 1960 1970 1980 1990 2000

Autocorrelation of M1 Growth Partial Autocorrelation of M1 Growth


1 1

0.8 0.8

0.6 0.6

0.4 0.4

0.2 0.2

0 0

−0.2 −0.2

−0.4 −0.4
5 10 15 20 25 30 35 5 10 15 20 25 30 35

Figure 4.7: Plot of the money supply (M1), M1 growth (log differences), and the sample au-
tocorrelogram and sample partial autocorrelogram of M1 growth. There is a clear seasonal
pattern at 12 months which appears consistent with a seasonal ARMA(1,1).

4.10.2 Deterministic Trends

The simplest form of nonstationarity is a deterministic trend. Models with deterministic


time trends can be decomposed into three components:

yt = deterministictrend + stationarycomponent + noise (4.84)

where {yt } would be stationary if the trend were absent. The two most common forms of
time trends are polynomial (linear, quadratic, etc) and exponential. Processes with poly-
nomial time trends can be expressed

yt = φ0 + δ1 t + δ2 t 2 + . . . + δS t S + stationarycomponent + noise,
282 Analysis of a Single Time Series

Modeling seasonalities in M1 growth

yt = φ0 + φ1 yt −1 + φ12 yt −12 + θ12 εt −12 + εt

φ̂0 φ̂1 φ̂12 θ̂12 SIC


0.000 −0.014 0.984 −0.640 −9.989
(0.245) (0.000) (0.000) (0.000)
0.001 −0.011 0.873 −9.792
(0.059) (0.000) (0.000)
0.004 −0.008 0.653 −9.008
(0.002) (0.000) (0.000)

Table 4.3: Estimated parameters, p-values and SIC for three models with seasonalities. The
SIC prefers the larger specification with both seasonal AR and MA terms. Moreover, cor-
rectly modeling the seasonalities frees the AR(1) term to model the oscillating short run
dynamics (notice the significant negative coefficient).

and linear time trend models are the most common,

yt = φ0 + δ1 t + stationarycomponent + noise.

For example, consider a linear time trend model with an MA(1) stationary component,

yt = φ0 + δ1 t + θ1 εt −1 + εt

The long-run behavior of this process is dominated by the time trend, although it may still
exhibit persistent fluctuations around δ1 t .
Exponential trends appear as linear or polynomial trends in the log of the dependent
variable, for example

ln yt = φ0 + δ1 t + stationarycomponent + noise.

The trend is the permanent component of a nonstationary time series, and so any two ob-
servation are permanently affected by the trend line irrespective of the number of observa-
tions between them. The class of deterministic trend models can be reduced to a stationary
process by detrending.

Definition 4.38 (Trend Stationary). A stochastic process, {yt } is trend stationary if there
exists a nontrivial function g (t , δ) such that {yt − g (t , δ)} is stationary.

Detrended data may be strictly or covariance stationary (or both).


4.10 Nonstationary Time Series 283

Time trend models of GDP


GDP ε̂ from G D Pt = µ + δ1 t + δ2 t 2 + εt
400
12000
200
10000
0
8000
−200
6000
−400
4000
−600
2000
1950 1960 1970 1980 1990 2000 1950 1960 1970 1980 1990 2000

ln G D P ε̂ from ln G D Pt = µ + δ1 t + εt
9.5

0.05
9
0
8.5
−0.05
8
−0.1
7.5
1950 1960 1970 1980 1990 2000 1950 1960 1970 1980 1990 2000

Figure 4.8: Two time trend models are presented, one on the levels of GDP and one on the
natural log. Note that the detrended residuals are still highly persistent. This is a likely sign
of a unit root.

4.10.2.1 Modeling the time trend in GDP

U.S. GDP data was taken from FRED II from Q1 1947 until Q2 July 2008. To illustrate the
use of a time trend, consider two simple models for the level of GDP. The first models the
level as a quadratic function of time while the second models the natural log of GDP in an
exponential trend model.
G D Pt = φ0 + δ1 t + δ2 t 2 + εt
and
ln G D Pt = φ0 + δ1 t + εt .

Figure 4.8 presents the time series of GDP, the log of GDP and errors from two models
that include trends. Neither time trend appears to remove the extreme persistence in GDP
284 Analysis of a Single Time Series

which may indicate the process contains a unit root.

4.10.3 Unit Roots

Unit root processes are generalizations of the classic random walk. A process is said to
have a unit root if the distributed lag polynomial can be factored so that one of the roots is
exactly one.

Definition 4.39 (Unit Root). A stochastic process, {yt }, is said to contain a unit root if

(1 − φ1 L − φ2 L 2 − . . . − φP L P )yt = φ0 + (1 − θ1 L − θ2 L 2 − . . . − θQ L Q )εt (4.85)

can be factored

(1 − L )(1 − φ̃1 L − φ̃2 L 2 − . . . − φ̃P −1 L P −1 )yt = φ0 + (1 − θ1 L − θ2 L 2 − . . . − θQ L Q )εt . (4.86)

The simplest example of a unit root process is a random walk.

Definition 4.40 (Random Walk). A stochastic process {yt } is known as a random walk if

yt = yt −1 + εt (4.87)

where εt is a white noise process with the additional property that E t −1 [εt ] = 0.

The basic properties of a random walk are simple to derive. First, a random walk is a
martingale since E t [yt +h ] = yt for any h .13 The variance of a random walk can be deduced
from

V[yt ] = E[(yt − y0 )2 ] (4.88)


= E[(εt + yt −1 − y0 )2 ]
= E[(εt + εt −1 + yt −2 − y0 )2 ]
= E[(εt + εt −1 + . . . + ε1 )2 ]
= E[ε2t + ε2t −1 + . . . + ε21 ]
= t σ2

and this relationship holds for any time index, and so V[ys ] = s σ2 . The sth autocovariance
(γs ) of a unit root process is given by

13
Since the effect of an innovation never declines in a unit root process, it is not reasonable to consider the
infinite past as in a stationary AR(1).
4.10 Nonstationary Time Series 285

V[(yt − y0 )(yt −s − y0 )] = E[(εt + εt −1 + . . . + ε1 )(εt −s + εt −s −1 + . . . + ε1 )] (4.89)


= E[(ε2t −s + ε2t −s −1 + ... + ε21 ]
= (t − s )σ2

and the sth autocorrelation is then


t −s
ρs = (4.90)
t
which tends to 1 for large t and fixed s . This is a useful property of a random walk process
(and any unit root process): The autocorrelations will be virtually constant at 1 with only
a small decline at large lags. Building from the simple unit root, one can define a unit root
plus drift model,
yt = δ + yt −1 + εt
which can be equivalently expressed
t
X
yt = δt + εi + y0
i =1

and so the random walk plus drift process consists of both a deterministic trend and a ran-
dom walk. Alternatively, a random walk model can be augmented with stationary noise so
that
t
X
yt = εi + η t
i =1

which leads to the general class of random walk models plus stationary noise processes

t
X t −1
X
yt = εi + θ j ηt − j + ηt
i =1 j =1
t
X
= εi + Θ(L )ηt
i =1

where Θ(L )ηt = tj −1


=1 θ j ηt − j + ηt is a compact expression for a lag polynomial in θ . Since
P

Θ(L )ηt can include any covariance stationary process, this class should be considered gen-
eral. More importantly, this process has two components: a permanent one, it =1 εi and
P

a transitory one Θ(L )ηt . The permanent behaves similarly to a deterministic time trend,
although unlike the deterministic trend model, the permanent component of this specifi-
cation depends on random increments. For this reason it is known as a stochastic trend.
Like the deterministic model, where the process can be detrended, a process with a
286 Analysis of a Single Time Series

unit root can be stochastically detrended, or differenced, ∆yt = yt − yt −1 . Differencing a


random walk produces a stationary series,

t
X t −1
X
yt − yt −1 = εi + Θ(L )ηt − εi + Θ(L )ηt −1
i =1 i =1

∆yt = εt + (1 − L )Θ(L )ηt

Over-differencing occurs when the difference operator is applied to a stationary series.


While over-differencing cannot create a unit root, it does have negative consequences such
as increasing the variance of the residual and reducing the magnitude of possibly impor-
tant dynamics. Finally, unit root processes are often known as I(1) processes.

Definition 4.41 (Integrated Process of Order 1). A stochastic process {yt } is integrated of
order 1, written I (1), if {yt } is non-covariance-stationary and if {∆yt } is covariance station-
ary. Note: A process that is already covariance stationary is said to be I (0).

The expression integrated is derived from the presence of it =1 εi in a unit root process
P

where the sum operator is the discrete version of an integrator.

4.10.4 Difference or Detrend?


Detrending removes nonstationarities from deterministically trending series while differ-
encing removes stochastic trends from unit roots. What happens if the wrong type of de-
trending is used? The unit root case is simple, and since the trend is stochastic, no amount
of detrending can eliminate the permanent component. Only knowledge of the stochastic
trend at an earlier point in time can transform the series to be stationary.
Differencing a stationary series produces another series which is stationary but with a
larger variance than a detrended series.

yt = δt + εt
∆yt = δ + εt − εt −1

while the properly detrended series would be

yt − δt = εt
If εt is a white noise process, the variance of the differenced series is twice that of the de-
trended series with a large negative MA component. The parsimony principle dictates
that the correctly detrended series should be preferred even though differencing is a vi-
able method of transforming a nonstationary series to be stationary. Higher orders of time
trends can be eliminated by re-differencing at the cost of even higher variance.
4.10 Nonstationary Time Series 287

4.10.5 Testing for Unit Roots: The Dickey-Fuller Test and the Augmented DF
Test

Dickey-Fuller tests (DF), and their generalization to augmented Dickey-Fuller tests (ADF),
are the standard test for unit roots. Consider the case of a simple random walk,

yt = yt −1 + εt

so that
∆yt = εt .

Dickey and Fuller noted that if the null of a unit root were true, then

yt = φ1 yt −1 + εt

can be transformed into


∆yt = γyt −1 + εt

where γ = φ − 1 and a test could be conducted for the null H0 : γ = 0 against an alternative
H1 : γ < 0. This test is equivalent to testing whether φ = 1 in the original model. γ̂ can be
estimated using a simple regression of ∆yt on yt −1 , and the t -stat can be computed in the
usual way. If the distribution of γ̂ were standard normal (under the null), this would be a
very simple test. Unfortunately, it is non-standard since, under the null, yt −1 is a unit root
and the variance is growing rapidly as the number of observations increases. The solution
to this problem is to use the Dickey-Fuller distribution rather than the standard normal to
make inference on the t -stat of γ̂.
Dickey and Fuller considered three separate specifications for their test,

∆yt = γyt −1 + εt (4.91)


∆yt = φ0 + γyt −1 + εt
∆yt = φ0 + δ1 t + γyt −1 + εt

which correspond to a unit root, a unit root with a linear time trend, and a unit root with
a quadratic time trend. The null and alternative hypotheses are the same: H0 : γ = 0,
H1 : γ < 0 (one-sided alternative), and the null that yt contains a unit root will be rejected
if γ̂ is sufficiently negative, which is equivalent to φ̂ being significantly less than 1 in the
original specification.
Unit root testing is further complicated since the inclusion of deterministic regressor(s)
affects the asymptotic distribution. For example, if T = 200, the critical values of a Dickey-
Fuller distribution are
288 Analysis of a Single Time Series

No trend Linear Quadratic


10% -1.66 -2.56 -3.99
5% -1.99 -2.87 -3.42
1% -2.63 -3.49 -3.13

The Augmented Dickey-Fuller (ADF) test generalized the DF to allow for short-run dy-
namics in the differenced dependent variable. The ADF is a DF regression augmented with
lags of the differenced dependent variable to capture short-term fluctuations around the
stochastic trend,

P
X
∆yt = γyt −1 + φp ∆yt −p + εt (4.92)
p =1
P
X
∆yt = φ0 + γyt −1 + φp ∆yt −p + εt
p =1
P
X
∆yt = φ0 + δ1 t + γyt −1 + φp ∆yt −p + εt
p =1

Neither the null and alternative hypotheses nor the critical values are changed by the
inclusion of lagged dependent variables. The intuition behind this result stems from the
observation that the ∆yt −p are “less integrated” than yt and so are asymptotically less in-
formative.

4.10.6 Higher Orders of Integration


In some situations, integrated variables are not just I(1) but have a higher order or integra-
tion. For example, the log consumer price index (ln C P I ) is often found to be I(2) (inte-
grated of order 2) and so double differencing is required to transform the original data to a
stationary series. As a consequence, both the level of ln C P I and its difference (inflation)
contain unit roots.

Definition 4.42 (Integrated Process of Order d ). A stochastic process {yt } is integrated of


order d , written I (d ), if {(1 − L )d yt } is a covariance stationary ARMA process.

Testing for higher orders of integration is simple: repeat the DF or ADF test on the dif-
ferenced data. Suppose that it is not possible to reject the null that the level of a variable,
yt , is integrated and so the data should be differenced (∆yt ). If the differenced data rejects
a unit root, the testing procedure is complete and the series is consistent with an I(1) pro-
cess. If the differenced data contains evidence of a unit root, the data should be double
differenced (∆2 yt ) and the test repeated. The null of a unit root should be rejected on the
4.10 Nonstationary Time Series 289

double-differenced data since no economic data are thought to be I(3), and so if the null
cannot be rejected on double-differenced data, careful checking for omitted deterministic
trends or other serious problems in the data is warranted.

4.10.6.1 Power of Unit Root tests

Recall that the power of a test is 1 minus the probability Type II error, or simply the prob-
ability that the null is rejected when the null is false. In the case of a unit root, power is
the ability of a test to reject the null that the process contains a unit root when the largest
characteristic root is less than 1. Many economic time-series have roots close to 1 and so
it is important to maximize the power of a unit root test so that models have the correct
order of integration.
DF and ADF tests are known to be very sensitive to misspecification and, in particular,
have very low power if the ADF specification is not flexible enough to account for factors
other than the stochastic trend. Omitted deterministic time trends or insufficient lags of
the differenced dependent variable both lower the power by increasing the variance of the
residual. This works analogously to the classic regression testing problem of having a low
power test when the residual variance is too large due to omitted variables.
A few recommendations can be made regarding unit root tests:
• Use a loose model selection criteria to choose the lag length of the included differ-
enced dependent variable s(e.g. AIC).

• Including extraneous deterministic regressors lowers power, but failing to include rel-
evant deterministic regressors produces a test with no power, even asymptotically,
and so be conservative when excluding deterministic regressors.

• More powerful tests than the ADF are available. Specifically, DF-GLS of Elliott, Rothen-
berg & Stock (1996) is increasingly available and it has maximum power against cer-
tain alternatives.

• Trends tend to be obvious and so always plot both the data and the differenced data.

• Use a general-to-specific search to perform unit root testing. Start from a model
which should be too large. If the unit root is rejected, one can be confident that there
is not a unit root since this is a low power test. If a unit root cannot be rejected, re-
duce the model by removing insignificant deterministic components first since these
lower power without affecting the t -stat. If all regressors are significant, and the null
still cannot be rejected, then conclude that the data contains a unit root.

4.10.7 Example: Unit root testing


Two series will be examined for unit roots: the default spread and the log U.S. consumer
price index. The ln C P I , which measure consumer prices index less energy and food costs
290 Analysis of a Single Time Series

Unit Root Analysis of ln C P I and the Default Spread


ln C P I ∆ ln C P I (Annualized Inflation)
0.15

5
0.1

4.5
0.05

4
0

1970 1975 1980 1985 1990 1995 2000 2005 1970 1975 1980 1985 1990 1995 2000 2005

∆2 ln C P I (Change in Annualized Inflation) Default Spread (Baa-Aaa)

5
0.05
4

0 3

2
−0.05
1

1970 1975 1980 1985 1990 1995 2000 2005 1920 1930 1940 1950 1960 1970 1980 1990 2000

Figure 4.9: These four panels plot the log consumer price index (ln C P I ), ∆ ln C P I ,
∆2 ln C P I and the default spread. Both ∆2 ln C P I and the default spread reject the null
of a unit root.

(also known as core inflation), has been taken from FRED, consists of quarterly data and
covers the period between August 1968 and August 2008. Figure 4.9 contains plots of both
series as well as the first and second differences of ln C P I .
ln C P I is trending and the spread does not have an obvious time trend. However, de-
terministic trends should be over-specified and so the initial model for ln C P I will include
both a constant and a time-trend and the model for the spread will include a constant. Lag
length was automatically selected using the BIC.
Results of the unit root tests are presented in table 4.4. Based on this output, the spreads
reject a unit root at the 5% level but the ln C P I cannot. The next step is to difference the
ln C P I to produce ∆ lnCPI. Rerunning the ADF test on the differenced CPI (inflation) and
including either a constant or no deterministic trend, the null of a unit root still cannot
be rejected. Further differencing the data, ∆2 ln C P I t = δ ln C P I t − ln C P I t −1 , strongly
4.11 Nonlinear Models for Time-Series Analysis 291

ln C P I ln C P I ln C P I ∆ ln C P I ∆ ln C P I ∆2 ln C P I Default Sp. Default Sp.


t -stat -2.119 -1.541 1.491 -2.029 -0.977 -13.535 -3.130 -1.666
p-val 0.543 0.517 0.965 0.285 0.296 0.000 0.026 0.091
Deterministic Linear Const. None Const. None None Const. None
# lags 4 4 4 3 3 2 15 15

Table 4.4: ADF results for tests that ln C P I and the default spread have unit roots. The null
of an unit root cannot be rejected in ln C P I , nor can the null that ∆ lnCPI contains a unit
root and so CPI appears to be an I(2) process. The default spread rejects the null of a unit
root although clearly highly persistent.

rejects, and so ln C P I appears to be I(2). The final row of the table indicates the number of
lags used in the ADF and was selected using the BIC with a maximum of 12 lags for ln C P I
or 36 lags for the spread (3 years).

4.11 Nonlinear Models for Time-Series Analysis

While this chapter has exclusively focused on linear time-series processes, many non-linear
time-series processes have been found to parsimoniously describe important dynamics in
financial data. Two which have proven particularly useful in the analysis of financial data
are Markov Switching Autoregressions (MSAR) and Threshold Autoregressions (TAR), es-
pecially the subclass of Self-Exciting Threshold Autoregressions (SETAR).14

4.12 Filters

The ultimate goal of most time-series modeling is to forecast a time-series in its entirety,
which requires a model for both permanent and transitory components. In some situa-
tions, it may be desirable to focus on either the short-run dynamics or the long-run dynam-
ics exclusively, for example in technical analysis where prices are believed to be long-run
unpredictable but may have some short- or medium-run predictability. Linear filters are a
class of functions which can be used to “extract” a stationary cyclic component from a time-
series which contains both short-run dynamics and a permanent component. Generically,
a filter for a time series {yt } is defined as

X
xt = wi yt −i (4.93)
i =−∞

14
There are many nonlinear models frequently used in financial econometrics for modeling quantities
other than the conditional mean. For example, both the ARCH (conditional volatility) and CaViaR (condi-
tional Value-at-Risk) models are nonlinear in the original data.
292 Analysis of a Single Time Series

where x t is the filtered time-series or filter output. In most applications, it is desirable to


assign a label to x t , either a trend, written τt , or a cyclic component, c t .
Filters are further categorized into causal and non-causal, where causal filters are re-
stricted to depend on only the past and present of yt , and so as a class are defined through

X
xt = wi yt −i . (4.94)
i =0

Causal filters are more useful in isolating trends from cyclical behavior for forecasting pur-
poses while non-causal filters are more useful for historical analysis of macroeconomic and
financial data.

4.12.1 Frequency, High- and Low-Pass Filters

This text has exclusively dealt with time series in the time domain – that is, understanding
dynamics and building models based on the time distance between points. An alternative
strategy for describing a time series is in terms of frequencies and the magnitude of the
cycle at a given frequency. For example, suppose a time series has a cycle that repeats
every 4 periods. This series could be equivalently described as having a cycle that occurs
with a frequency of 1 in 4, or .25. A frequency description is relatively compact – it is only
necessary to describe a process from frequencies of 0 to 0.5, the latter of which would be a
cycle with a period of 2.15
The idea behind filtering is to choose a set of frequencies and then to isolate the cycles
which occur within the selected frequency range. Filters that eliminate high frequency
cycles are known as low-pass filters, while filters that eliminate low frequency cycles are
known as high-pass filters. Moreover, high- and low-pass filters are related in such a way
that if {wi } is a set of weights corresponding to a high-pass filter, v0 = 1 − w0 , vi = −wi
i 6= 0 is a set of weights corresponding to a low-pass filter. This relationship forms an iden-
tity since {vi + wi } must correspond to an all-pass filter since ∞ i =−∞ (vi + w i )yt −1 = yt for
P

any set of weights.


The goal of a filter is to select a particular frequency range and nothing else. The gain
function describes the amount of attenuations which occurs at a given frequency.16 A gain
of 1 at a particular frequency means any signal at that frequency is passed through unmod-
ified while a gain of 0 means that the signal at that frequency is eliminated from the filtered
15
The frequency 12 is known as the Nyquist frequency since it is not possible to measure any cyclic behavior
at frequencies above 21 since these would have a cycle Pof 1 period and so would appear constant.

16
The gain function for any filter of the form x t = i =−∞ wi yt −i can be computed as


X
G (f ) = w j exp (−i k 2πf )

k =−∞

where i = −1.
4.12 Filters 293

Ideal Filters
All Pass Low Pass
1 1

0.8 0.8

0.6 0.6

0.4 0.4

0.2 0.2

0 0
0 0.1 0.2 0.3 0.4 0.5 0 0.1 0.2 0.3 0.4 0.5
High Pass Band Pass
1 1

0.8 0.8

0.6 0.6

0.4 0.4

0.2 0.2

0 0
0 0.1 0.2 0.3 0.4 0.5 0 0.1 0.2 0.3 0.4 0.5

Figure 4.10: These four panels depict the gain functions from a set of ideal filters. The all-
1
pass filter allows all frequencies through. The low-pass filter cuts off at 10 . The high-pass
1 1 1
cuts off below 6 and the band-pass filter cuts off below 32 and above 6 .

data. Figure 4.10 contains a graphical representation of the gain function for a set of ideal
filters. The four panels show an all-pass (all frequencies unmodified), a low-pass filter with
1
a cutoff frequency of 10 , a high-pass with a cutoff frequency of 16 , and a band-pass filter with
cutoff frequencies of 61 and 32 1 17
. In practice, only the all-pass filter (which corresponds to
a filter with weights w0 = 1, wi = 0 for i 6= 0) can be constructed using a finite sum, and so
applied filtering must make trade-offs.

17
Band-pass filters are simply the combination of two low-pass filters. Specifically, if {wi } is set of weights
from a low-pass filter with a cutoff frequency of f1 and {vi } is a set of weights from a low-pass filter with a
cutoff frequency of f2 , f2 > f1 , then {vi − wi } is a set of weights which corresponds to a band-pass filter with
cutoffs at f1 and f2 .
294 Analysis of a Single Time Series

4.12.2 Moving Average and Exponentially Weighted Moving Average (EWMA)


Moving averages are the simplest filters and are often used in technical analysis. Moving
averages can be constructed as both causal and non-causal filters.

Definition 4.43 (Causal Moving Average). A causal moving average (MA) is a function which
takes the form
n
1X
τt = yt −i +1 .
n
i =1

Definition 4.44 (Centered (Non-Causal) Moving Average). A centered moving average (MA)
is a function which takes the form
n
1 X
τt = yt −i +1 .
2n + 1
i =−n

Note that the centered MA is an average over 2n + 1 data points.


Moving averages are low-pass filters since their weights add up to 1. In other words,
the moving average would contain the permanent component of {yt } and so would have
the same order of integration. The cyclic component, c t = yt − τt , would have a lower
order of integration that yt . Since moving averages are low-pass filters, the difference of
two moving averages must be a band-pass filter. Figure 4.11 contains the gain function
from the difference between a 20-day and 50-day moving average which is commonly used
in technical analysis.
Exponentially Weighted Moving Averages (EWMA) are a close cousin of the MA which
place greater weight on recent data than on past data.

Definition 4.45 (Exponentially Weighted Moving Average). A exponentially weighed mov-


ing average (EWMA) is a function which takes the form

X
τt = (1 − λ) λi yt −i
i =0

for some λ ∈ (0, 1).

The name EWMA is derived from the exponential decay in the weights, and EWMAs can be
equivalently expressed (up to an initial condition) as

τt = (1 − λ) λyt + λτt −1 .
Like MAs, EWMAs are low-pass filters since the weights sum to 1.
EWMAs are commonly used in financial application as simple volatility filters, where
the dependent variable is chosen to be the squared return. The difference between two
EWMAs is often referred to as a Moving Average Convergence Divergence (MACD) filter in
technical analysis. MACDs are indexed by two numbers, a fast period and a slow period,
4.12 Filters 295

where the number of data in the MACD can be converted to λ as λ = (n − 1)/(n + 1), and
so a MACD(12,26) is the difference between two EWMAs with parameters .842 and .926.
4.11 contains the gain function from a MACD(12,26) (the difference between two EWMAs),
which is similar to, albeit smoother than, the gain function from the difference of a 20-day
and a 50-day MAs.

4.12.3 Hodrick-Prescott Filter

The Hodrick & Prescott (1997) (HP) filter is constructed by balancing the fitting the trend
to the data and the smoothness of the trend. The HP filter is defined as the solution to
T
X T −1
X
(yt − τt ) + λ
2
(τt −1 − τt ) − (τt + τt −1 )

min
τt
t =1 t =2

where (τt −1 − τt ) − (τt + τt −1 ) can be equivalently expressed as the second-difference of τt ,


∆2 τt . λ is a smoothness parameter: if λ = 0 then the solution to the problem is τt = yt ∀t ,
and as λ → ∞, the “cost” of variation in {τt } becomes arbitrarily high and τt = β0 + β1 t
where β0 and β1 are the least squares fit of a linear trend model to the data.
It is simple to show that the solution to this problem must have

y = Γτ
where Γ is a band-diagonal matrix (all omitted elements are 0) of the form

1 + λ −2λ λ
 

 −2λ 1 + 5λ −4λ λ 


 λ −4λ 1 + 6λ −4λ λ 

λ −4λ 1 + 6λ −4λ λ
 
 
Γ =
 .. .. .. .. .. 
 . . . . . 

λ −4λ 1 + 6λ −4λ λ
 
 
λ −4λ 1 + 6λ −4λ λ
 
 
λ −4λ 1 + 5λ −2λ
 
 
λ −2λ 1 + λ

and The solution to this set of T equations in T unknowns is

τ = Γ −1 y.
The cyclic component is defined as c t = yt − τt .
Hodrick & Prescott (1997) recommend values of 100 for annual data, 1600 for quarterly
data and 14400 for monthly data. The HP filter is non-causal and so is not appropriate for
prediction. The gain function of the cyclic component of the HP filter with λ = 1600 is il-
lustrated in figure 4.11. While the filter attempts to eliminate components with a frequency
296 Analysis of a Single Time Series

1 1
below ten years of quarterly data ( 40 ), there is some gain until about 50
and the gain is not
1
unity until approximately 25 .

4.12.4 Baxter-King Filter

Baxter & King (1999) consider the problem of designing a filter to be close to the ideal filter
subject to using a finite number of points.18 They further argue that extracting the cyclic
component requires the use of both a high-pass and a low-pass filter – the high-pass fil-
ter is to cutoff the most persistent components while the low-pass filter is used to elimi-
nate high-frequency noise. The BK filter is defined by a triple, two period lengths (inverse
frequencies) and the number of points used to construct the filter (k ), and is written as
B K k (p , q ) where p < q are the cutoff frequencies.
1
Baxter and King suggest using a band-pass filter with cutoffs at 32 and 61 for quarterly
data. The final choice for their approximate ideal filter is the number of nodes, for which
they suggest 12. The number of points has two effects. First, the BK filter cannot be used in
the first and last k points. Second, a higher number of nodes will produce a more accurate
approximation to the ideal filter.
Implementing the BK filter is simple. Baxter and King show that the optimal weights for
a low-pass filter at particular frequency f , satisfy

w̃0 = 2f (4.95)
sin(2i πf )
w̃i = , i = 1, 2, . . . , k (4.96)
iπ  
Xk
θ = [2k + 1]−1 1 − w̃i  (4.97)
i =−k

wi = w̃i + θ , i = 0, 1, . . . , k (4.98)
wi = w−i . (4.99)

The BK filter is constructed as the difference between two low-pass filters, and so

k
X
τt = wi yt −i (4.100)
i =−k
k
X
ct = (vi − wi ) yt −i (4.101)
i =−k

18
Ideal filters, except for the trivial all-pass, require an infinite number of points to implement, and so are
infeasible in applications.
4.12 Filters 297

Actual Filters
HP Filter (Cyclic) B K 12 (6, 32) Filter (Cyclic)
1 1

0.5 0.5

0 0
0 0.02 0.04 0.06 0.08 0 0.1 0.2 0.3 0.4 0.5
First Difference MA(50) - MA(20) Difference
1 1

0.5 0.5

0 0
0 0.1 0.2 0.3 0.4 0.5 0 0.1 0.2 0.3 0.4 0.5
EWMA (λ = 0.94) MACD(12,26)
1 1

0.5 0.5

0 0
0 0.1 0.2 0.3 0.4 0.5 0 0.1 0.2 0.3 0.4 0.5

Figure 4.11: These six panels contain the standard HP filter, the B K 12 (6, 32) filter, the first
difference filter, an EWMA with λ = .94, a MACD(12,26) and the difference between a 20-
day and a 50-day moving average. The gain functions in the right hand column have been
normalized so that the maximum weight is 1. The is equivalent to scaling all of the filter
weights by a constant, and so is simple a change in variance of the filter output.

where {wi } and {vi } are both weights from low-pass filters where the period used to con-
struct {wi } is longer than the period used to construct {vi }. The gain function of the
B K 12 (6, 32) is illustrated in the upper right panel of figure 4.11. The approximation is rea-
1
sonable, with near unit gain between 32 and 61 and little gain outside.

4.12.5 First Difference

Another very simple filter to separate a “trend” from a ”cyclic” component is the first dif-
ference. Note that if yt is an I(1) series, then the first difference which contains the “cyclic”
component, c t = 21 ∆yt , is an I(0) series and so the first difference is a causal filter. The
“trend” is measured using an MA(2), τt = 12 (yt + yt −1 ) so that yt = c t + τt . The FD filter
is not sharp – it allows for most frequencies to enter the cyclic component – and so is not
298 Analysis of a Single Time Series

recommended in practice.

4.12.6 Beveridge-Nelson Decomposition

The Beveridge & Nelson (1981) decomposition extends the first order difference decom-
position to include any predictable component in the future trend as part of the current
trend. The idea behind the BN decomposition is simple: if the predictable part of the long-
run component places the long-run component above its current value, then the cyclic
component should be negative. Similarly, if the predictable part of the long-run compo-
nent expects that the long run component should trend lower then the cyclic component
should be positive. Formally the BN decomposition if defined as

τt = lim ŷt +h |t − h µ (4.102)


h →∞

c t = yt − τt

where µ is the drift in the trend, if any. The trend can be equivalently expressed as the
current level of yt plus the expected increments minus the drift,

h
X
τt = yt + lim E ∆ ŷt +i |t − µ
 
(4.103)
h →∞
i =1

where µ is the unconditional expectation of the increments to yt , E[∆ ŷt + j |t ]. The trend
component contains the persistent component and so the filter applied must be a low-pass
filter, while the cyclic component is stationary and so must be the output of a high-pass
filter. The gain of the filter applied when using the BN decomposition depends crucially
on the forecasting model for the short-run component.
Suppose {yt } is an I(1) series which has both a permanent and transitive component.
Since {yt } is I(1), ∆yt must be I(0) and so can be described by a stationary ARMA(P,Q) pro-
cess. For simplicity, suppose that ∆yt follows an MA(3) so that

∆yt = φ0 + θ1 εt −1 + θ1 εt −2 + θ3 εt −3 + εt

In this model µ = φ0 , and the h -step ahead forecast is given by

∆ ŷt +1|t = µ + θ1 εt + θ2 εt −1 + θ3 εt −2
∆ ŷt +2|t = µ + θ2 εt + θ3 εt −1
∆ ŷt +3|t = µ + θ3 εt
∆ ŷt +h |t = µ h ≥ 4,
4.12 Filters 299

and so
τt = yt + (θ1 + θ2 + θ3 ) εt + (θ2 + θ3 ) εt −1 + θ3 εt −2
and

c t = − (θ1 + θ2 + θ3 ) εt − (θ2 + θ3 ) εt −1 − θ3 εt −2 .
Alternatively, suppose that ∆yt follows an AR(1) so that

∆yt = φ0 + φ1 ∆yt −1 + εt .
This model can be equivalently defined in terms of deviations around the long-run mean,
∆ ỹt = ∆yt − φ0 /(1 − φ1 ), as

∆yt = φ0 + φ1 ∆yt −1 + εt
1 − φ1
∆yt = φ0 + φ1 ∆yt −1 + εt
1 − φ1
φ0 φ0
∆yt = − φ1 + φ1 ∆yt −1 + εt
1 − φ1 1 − φ1
φ0 φ0
 
∆yt − = φ1 ∆yt −1 − + εt
1 − φ1 1 − φ1
∆ ỹt = φ1 ∆ ỹt −1 + εt .

In this transformed model, µ = 0 which simplifies finding the expression for the trend. The
h -step ahead forecast if ∆ ỹt is given by

∆ ỹˆt +h |t = φ1h ∆ ỹt


and so

h
X
τt = yt + lim ∆ ỹˆt +i |t
h →∞
i =1
h
X
= yt + lim φ1i ∆ ỹt
h →∞
i =1
h
X
= yt + lim ∆ ỹt φ1i
h →∞
i =1
φ1
= yt + lim ∆ ỹt
h →∞ 1 − φ1
Ph Ph
which follows since limh →∞ i =1 φ1i = −1 + limh →∞ i =0 φ1i = 1/(1 − φ1 ) − 1. The main crit-
300 Analysis of a Single Time Series

icism of the Beveridge-Nelson decomposition is that the trend and the cyclic component
are perfectly (negatively) correlation.

4.12.7 Extracting the cyclic components from Real US GDP


To illustrate the filters, the cyclic component was extracted from log real US GDP data taken
from the Federal Reserve Economics Database. Data was available from 1947 Q1 until Q2
2009. Figure 4.12 contains the cyclical component extracted using 4 methods. The top
panel contains the standard HP filter with λ = 1600. The middle panel contains B K 12 (6, 32)
(solid) and B K 12 (1, 32) (dashed) filters, the latter of which is a high pass-filter since the faster
frequency is 1. Note that the first and last 12 points of the cyclic component are set to
0. The bottom panel contains the cyclic component extracted using a Beveridge-Nelson
decomposition based on an AR(1) fit to GDP growth. For the BN decomposition, the first
2 points are zero which reflects the loss of data due to the first difference and the fitting of
the AR(1) to the first difference.19
The HP filter and the B K 12 (1, 32) are remarkably similar with a correlation of over 99%.
The correlation between the B K 12 (6, 32) and the HP filter was 96%, the difference being
in the lack of a high-frequency component. The cyclic component from the BN decom-
position has a small negative correlation with the other three filters, although choosing a
different model for GDP growth would change the decomposition.

4.12.8 Markov Switching Autoregression


Markov switching autoregression, introduced into econometrics in Hamilton (1989), uses
a composite model which evolves according to both an autoregression and a latent state
which determines the value of the autoregressive parameters. In financial applications
using low-frequency asset returns, a MSAR that allows the mean and the variance to be
state-dependent has been found to outperform linear models (Perez-Quiros & Timmer-
mann 2000).

Definition 4.46 (Markov Switching Autoregression). A k -state Markov switching autore-


gression (MSAR) is a stochastic process which has dynamics that evolve through both a
Markovian state process and an autoregression where the autoregressive parameters are
state dependent. The states, labeled 1, 2, . . . , k , are denoted st and follow a k -state latent
Markov Chain with transition matrix P,
 
p11 p12 . . . p1k
 p21 p22 . . . p2k 
P= (4.104)
 
.. .. .. .. 
 . . . . 
pk 1 pk 2 . . . pk k
19
The AR(1) was chosen from a model selection search of AR models with an order up to 8 using the SBIC.
4.12 Filters 301

Cyclical Component of U.S. Real GDP


Hodrick-Prescott (λ = 1600)
0.04

−0.04

1950 1955 1960 1965 1970 1975 1980 1985 1990 1995 2000 2005

Baxter-King (1, 32) and (6, 32)


0.1

0.05

−0.05

−0.1
1950 1955 1960 1965 1970 1975 1980 1985 1990 1995 2000 2005

Beveridge-Nelson (AR (1))


0.02

0.01

−0.01

−0.02
1950 1955 1960 1965 1970 1975 1980 1985 1990 1995 2000 2005

Figure 4.12: The top panel contains the filtered cyclic component from a HP filter with
λ = 1600. The middle panel contains the cyclic component from B K 12 (6, 32) (solid)
and B K 12 (1, 32) (dashed) filters. The bottom panel contains the cyclic component from
a Beveridge-Nelson decomposition based on an AR(1) model for GDP growth rates.

Pk
where pi j = P r (st +1 = i |st = j ). Note that the columns must sum to 1 since i =1 P r (st +1 =
302 Analysis of a Single Time Series

i |st = j ) = 1. Data are generated according to a Pth order autoregression,

yt = φ0(st ) + φ1(st ) yt −1 + . . . + φP(st ) yt −p + σ(st ) εt (4.105)

where φ (st ) = [φ0(st ) φ1(st ) . . . φP(st ) ]0 are state-dependent autoregressive parameters, σ(st ) is
the state-dependent standard deviation and εt ∼ N (0, 1).20 The unconditional state prob-
i.i.d.

abilities (Pr (st = i )), known as the ergodic probabilities, are denoted π = [π1 π2 . . . πk ]0
and are the solution to
π = Pπ. (4.106)
The ergodic probabilities can also be computed as the normalized eigenvector of P corre-
sponding to the only unit eigenvalue.
Rather than attempting to derive properties of an MSAR, consider a simple specification
with two states, no autoregressive terms, and where only the mean of the process varies21
(
φ H + εt
yt = (4.107)
φ L + εt
where the two states are indexed by H (high) and L (low). The transition matrix is
" # " #
pH H pH L pH H 1 − pL L
P= = (4.108)
pL H pL L 1 − pH H pL L
and the unconditional probabilities of being in the high and low state, πH and πL , respec-
tively, are

1 − pL L
πH = (4.109)
2 − pH H − pL L
1 − pH H
πL = . (4.110)
2 − pH H − pL L

This simple model is useful for understanding the data generation in a Markov Switch-
ing process:
1. At t = 0 an initial state, s0 , is chosen according to the ergodic (unconditional) proba-
bilities. With probability πH , s0 = H and with probability πL = 1 − πH , s0 = L .

2. The state probabilities evolve independently from the observed data according to a
Markov Chain. If s0 = H , s1 = H with probability pH H , the probability st +1 = H given
st = H and s1 = L with probability pL H = 1 − pH H . If s0 = L , s1 = H with probability
pH L = 1 − pL L and s1 = L with probability pL L .
i.i.d.
20
The assumption that εt ∼ N (0, 1) can be easily relaxed to include other i.i.d. processes for the innova-
tions.
21
See Hamilton (1994, chapter 22) or Krolzig (1997) for further information on implementing MSAR models.
4.12 Filters 303

3. Once the state at t = 1 is known, the value of y1 is chosen according to


(
φ H + ε1 if s1 = H
y1 = .
φ L + εt if s1 = L

4. Steps 2 and 3 are repeated for t = 2, 3, . . . , T , to produce a time series of Markov


Switching data.

4.12.8.1 Markov Switching Examples

Using the 2-state Markov Switching Autoregression described above, 4 systems were sim-
ulated for 100 observations.

• Pure mixture

– µH = 4, µL = −2, V[εt ] = 1 in both states


– pH H = .5 = pL L
– πH = πL = .5
– Remark: This is a “pure” mixture model where the probability of each state does
not depend on the past. This occurs because the probability of going from high
to high is the same as the probability of going form low to high, 0.5.

• Two persistent States

– µH = 4, µL = −2, V[εt ] = 1 in both states


– pH H = .9 = pL L so the average duration of each state is 10 periods.
– πH = πL = .5
– Remark: Unlike the first parameterization this is not a simple mixture. Condi-
tional on the current state being H , there is a 90% chance that the next state will
remain H .

• One persistent state, on transitory state

– µH = 4, µL = −2, V[εt ] = 1 if st = H and V[εt ] = 2 if st = L


– pH H = .9, pL L = .5
– πH = .83, πL = .16
– Remark: This type of model is consistent with quarterly data on U.S. GDP where
booms (H ) typically last 10 quarters while recessions die quickly, typically in 2
quarters.

• Mixture with different variances


304 Analysis of a Single Time Series

– µH = 4, µL = −2, V[εt ] = 1 if st = H and V[εt ] = 16 if st = L


– pH H = .5 = pL L
– πH = πL = .5
– Remark: This is another “pure” mixture model but the variances differ between
the states. One nice feature of mixture models (MSAR is a member of the family
of mixture models) is that the unconditional distribution of the data may be non-
normal even though the shocks are conditionally normally distributed.22

Figure 4.13 contains plots of 100 data points generated from each of these processes. The
first (MSAR(1)) produces a mixture with modes at -2 and 4 each with equal probability and
the states (top panel, bottom right) are i.i.d. . The second process produced a similar un-
conditional distribution but the state evolution is very different. Each state is very persis-
tent and, conditional on the state being high or low, it was likely to remain the same. The
third process had one very persistent state and one with much less persistence. This pro-
duced a large skew in the unconditional distribution since the state where µ = −2 was
visited less frequently than the state with µ = 4. The final process (MSAR(4)) has state dy-
namics similar to the first but produces a very different unconditional distribution. The
difference occurs since the variance depends on the state of the Markov process.

4.12.9 Threshold Autoregression and Self Exciting Threshold Autoregression

A second class of nonlinear models that have gained considerable traction in financial ap-
plications are Threshold Autoregressions (TAR), and in particular, the subfamily of Self Ex-
citing Threshold Autoregressions (SETAR).23

Definition 4.47 (Threshold Autoregression). A threshold autoregression is a Pth Order au-


toregressive process with state-dependent parameters where the state is determined by the
lagged level of an exogenous variable x t −k for some k ≥ 1.

yt = φ0(st ) + φ1(st ) yt −1 + . . . + φP(st ) yt −p + σ(st ) εt (4.111)

Let −∞ = x0 < x1 < x2 < . . . < xN < xN +1 = ∞ be a partition of x in to N + 1 distinct bins.


st = j if x t −k ∈ (x j , x j +1 ).

Self exciting threshold autoregressions, introduced in Tong (1978), are similarly defined.
The only change is in the threshold variable; rather then relying on an exogenous variable
to determine the state, the state in SETARs is determined by lagged values of the dependent
variable.
22
Mixtures of finitely many normals, each with different means and variances, can be used approximate
many non-normal distributions.
23
See Fan & Yao (2005) for a comprehensive treatment of non-linear time-series models.
4.12 Filters 305

Markov Switching Processes


MSAR(1)

−5 0 5 10
0

−5
MSAR(2)
6
4
2
−6 −4 −2 0 2 4 6 8
0
−2
−4

MSAR(3)
6
4
2 −5 0 5 10
0
−2

MSAR(4)
10

0 −15 −10 −5 0 5 10 15
−5

−10
20 40 60 80 100
0 20 40 60 80 100

Figure 4.13: The four panels of this figure contain simulated data generated by the 4 Markov
switching processes described in the text. In each panel, the large subpanel contains the
generated data, the top right subpanel contains a kernel density estimate of the uncondi-
tional density and the bottom right subpanel contains the time series of the state values
(high points correspond to the high state).

Definition 4.48 (Self Exciting Threshold Autoregression). A self exciting threshold autore-
gression is a Pth Order autoregressive process with state-dependent parameters where the
state is determined by the lagged level of the dependent variable, yt −k for some k ≥ 1.

yt = φ0(st ) + φ1(st ) yt −1 + . . . + φP(st ) yt −p + σ(st ) εt (4.112)

Let −∞ = y0 < y1 < y2 < . . . < yN < yN +1 = ∞ be a partition of y in to N + 1 distinct bins.


306 Analysis of a Single Time Series

st = j is yt −k ∈ (y j , y j +1 ).

The primary application of SETAR models in finance has been to exchange rates which
often exhibit a behavior that is difficult to model with standard ARMA models: many FX
rates exhibit random-walk-like behavior in a range yet remain within the band longer than
would be consistent with a simple random walk. A symmetric SETAR is a parsimonious
model that can describe this behavior, and is parameterized

yt = yt −1 + εt if C − δ < yt < C + δ (4.113)


yt = C (1 − φ) + φ yt −1 + εt if yt < C − δ or yt > C + δ

where C is the “target” exchange rate. The first equation is a standard random walk, and
when yt is within the target band it follows a random walk. The second equation is only
relevant when yt is outside of its target band and ensures that yt is mean reverting towards
C as long as |φ| < 1.24 φ is usually assumed to lie between 0 and 1 which produces a
smooth mean reversion back towards the band.
To illustrate the behavior of this process and the highlight the differences between it and
a random walk, 200 data points were generated with different values of φ using standard
normal innovations. The mean was set to 100 and the used δ = 5, and so yt follows a
random walk when between 95 and 105. The lag value of the threshold variable (k ) was set
to one. Four values for φ were used: 0, 0.5, 0.9 and 1. The extreme cases represent a process
which is immediately mean reverting (φ = 0), in which case as soon as yt leaves the target
band it is immediately returned to C , and a process that is a pure random walk (φ = 1) since
yt = yt −1 + εt for any value of yt −1 . The two interior cases represent smooth reversion back
to the band; when φ = .5 the reversion is quick and when φ = .9 the reversion is slow.
When φ is close to 1 it is very difficult to differentiate a band SETAR from a pure random
walk, which is one of explanations for the poor performance of unit root tests where tests
often fail to reject a unit root despite clear economic theory predicting that a time series
should be mean reverting.

4.A Computing Autocovariance and Autocorrelations

This appendix covers the derivation of the ACF for the MA(1), MA(Q), AR(1), AR(2), AR(3)
and ARMA(1,1). Throughout this appendix, {εt } is assumed to be a white noise process and
the processes parameters are always assumed to be consistent with covariance stationarity.
All models are assumed to be mean 0, an assumption made without loss of generality since
autocovariances are defined using demeaned time series,

Recall the mean of an AR(1) yt = φ0 + φ1 yt −1 + εt is φ0 /(1 − φ1 ) where φ0 = C (1 − φ) and φ1 = φ in this


24

SETAR.
4.A Computing Autocovariance and Autocorrelations 307

Self Exciting Threshold Autoregression Processes


φ=0 φ = 0.5
106 106

104 104

102 102

100 100
98 98
96 96
94 94
50 100 150 200 50 100 150 200
φ = 0.9 φ=1

115
105

110
100
105

95
100

50 100 150 200 50 100 150 200

Figure 4.14: The four panels of this figure contain simulated data generated by a SETAR with
different values of φ. When φ = 0 the process is immediately returned to its unconditional
mean C = 100. Larger values of φ increase the amount of time spent outside of the “target
band” (95–105) and when φ = 1, the process is a pure random walk.

γs = E[(yt − µ)(yt −s − µ)]

where µ = E[yt ]. Recall that the autocorrelation is simply the of the sth autocovariance to
the variance,

γs
ρs = .
γ0

This appendix presents two methods for deriving the autocorrelations of ARMA pro-
cesses: backward substitution and the Yule-Walker equations, a set of k equations with k
unknowns where γ0 , γ1 , . . . , γk −1 are the solution.
308 Analysis of a Single Time Series

4.A.1 Yule-Walker

The Yule-Walker equations are a linear system of max(P, Q )+1 equations (in an ARMA(P,Q))
where the solution to the system are the long-run variance and the first k − 1 autocovari-
ances. The Yule-Walker equations are formed by equating the definition of an autocovari-
ance with an expansion produced by substituting for the contemporaneous value of yt . For
example, suppose yt follows an AR(2) process,

yt = φ1 yt −1 + φ2 yt −2 + εt
The variance must satisfy

E[yt yt ] = E[yt (φ1 yt −1 + φ2 yt −2 + εt )] (4.114)


E[yt2 ] = E[φ1 yt yt −1 + φ2 yt yt −2 + yt εt ]
V[yt ] = φ1 E[yt yt −1 ] + φ2 E[yt yt −2 ] + E[yt εt ].

In the final equation above, terms of the form E[yt yt −s ] are replaced by their population
values, γs and E[yt εt ] is replaced with its population value, σ2 .

V[yt yt ] = φ1 E[yt yt −1 ] + φ2 E[yt yt −2 ] + E[yt εt ] (4.115)

becomes

γ0 = φ1 γ1 + φ2 γ2 + σ2 (4.116)

and so the long run variance is a function of the first two autocovariances, the model pa-
rameters, and the innovation variance. This can be repeated for the first autocovariance,

E[yt yt −1 ] = φ1 E[yt −1 yt −1 ] + φ2 E[yt −1 yt −2 ] + E[yt −1 εt ]

becomes

γ1 = φ1 γ0 + φ2 γ1 , (4.117)

and for the second autocovariance,

E[yt yt −2 ] = φ1 E[yt −2 yt −1 ] + φ2 E[yt −2 yt −2 ] + E[yt −2 εt ] becomes


4.A Computing Autocovariance and Autocorrelations 309

becomes

γ2 = φ1 γ1 + φ2 γ0 . (4.118)

Together eqs. (4.116), (4.117) and (4.118) form a system of three equations with three un-
knowns. The Yule-Walker method relies heavily on the covariance stationarity and so E[yt yt − j ] =
E [yt −h yt −h − j ] for any h . This property of covariance stationary processes was repeatedly
used in forming the producing the Yule-Walker equations since E[yt yt ] = E[yt −1 yt −1 ] =
E[yt −2 yt −2 ] = γ0 and E[yt yt −1 ] = E[yt −1 yt −2 ] = γ1 .
The Yule-Walker method will be demonstrated for a number of models, starting from a
simple MA(1) and working up to an ARMA(1,1).

4.A.2 MA(1)

The MA(1) is the simplest model to work with.

yt = θ1 εt −1 + εt

The Yule-Walker equation are

E[yt yt ] = E[θ1 εt −1 yt ] + E[εt yt ] (4.119)


E[yt yt −1 ] = E[θ1 εt −1 yt −1 ] + E[εt yt −1 ]
E[yt yt −2 ] = E[θ1 εt −1 yt −2 ] + E[εt yt −2 ]

γ0 = θ12 σ2 + σ2 (4.120)
γ1 = θ1 σ 2

γ2 = 0

Additionally, both γs and ρs , s ≥ 2 are 0 by the white noise property of the residuals,
and so the autocorrelations are

θ1 σ2
ρ1 =
θ12 σ2 + σ2
θ1
= ,
1 + θ12
ρ2 = 0.
310 Analysis of a Single Time Series

4.A.2.1 MA(Q)

The Yule-Walker equations can be constructed and solved for any MA(Q ), and the structure
of the autocovariance is simple to detect by constructing a subset of the full system.

E[yt yt ] = E[θ1 εt −1 yt ] + E[θ2 εt −2 yt ] + E[θ3 εt −3 yt ] + . . . + E[θQ εt −Q yt ] (4.121)


γ0 = θ12 σ2
+ θ22 σ2 + θ32 σ2 + . . . + θQ2 σ2 +σ 2

= σ (12
+ θ12 + θ22 + θ32 + . . . + θQ2 )
E[yt yt −1 ] = E[θ1 εt −1 yt −1 ] + E[θ2 εt −2 yt −1 ] + E[θ3 εt −3 yt −1 ] + . . . + E[θQ εt −Q yt −1 ] (4.122)
γ1 = θ1 σ2 + θ1 θ2 σ2 + θ2 θ3 σ2 + . . . + θQ −1 θQ σ2
= σ2 (θ1 + θ1 θ2 + θ2 θ3 + . . . + θQ −1 θQ )
E[yt yt −2 ] = E[θ1 εt −1 yt −2 ] + E[θ2 εt −2 yt −2 ] + E[θ3 εt −3 yt −2 ] + . . . + E[θQ εt −Q yt −2 ] (4.123)
γ2 = θ2 σ2 + θ1 θ3 σ2 + θ2 θ4 σ2 + . . . + θQ −2 θQ σ2
= σ2 (θ2 + θ1 θ3 + θ2 θ4 + . . . + θQ −2 θQ )

The pattern that emerges shows,

Q −s Q −s
X X
γs = θs σ +2
σ θi θi +s = σ (θs +
2 2
θi θi +s ).
i =1 i =1

and so , γs is a sum of Q − s + 1 terms. The autocorrelations are

PQ −1
θ1 + θi θi +1
i =1
ρ1 = PQ (4.124)
1 + θs + i =1 θi2
P −2
θ2 + Qi =1 θi θi +2
ρ2 = PQ
1 + θs + i =1 θi2
.. ..
. = .
θQ
ρQ =
1 + θs + Qi=1 θi2
P

ρQ +s = 0, s ≥0

4.A.2.2 AR(1)

The Yule-Walker method requires be max(P, Q ) + 1 equations to compute the autocovari-


ance for an ARMA(P ,Q ) process and in an AR(1), two are required (the third is included to
establish this point).
yt = φ1 yt −1 + εt
4.A Computing Autocovariance and Autocorrelations 311

E[yt yt ] = E[φ1 yt −1 yt ] + E[εt yt ] (4.125)


E[yt yt −1 ] = E[φ1 yt −1 yt −1 ] + E[εt yt −1 ]
E[yt yt −2 ] = E[φ1 yt −1 yt −2 ] + E[εt yt −2 ]

These equations can be rewritten in terms of the autocovariances, model parameters and
σ2 by taking expectation and noting that E[εt yt ] = σ2 since yt = εt + φ1 εt −1 + φ12 εt −2 + . . .
and E[εt yt − j ] = 0, j > 0 since {εt } is a white noise process.

γ0 = φ1 γ1 + σ2 (4.126)
γ1 = φ1 γ0
γ2 = φ1 γ1

The third is redundant since γ2 is fully determined by γ1 and φ1 , and higher autocovari-
ances are similarly redundant since γs = φ1 γs −1 for any s . The first two equations can be
solved for γ0 and γ1 ,

γ0 = φ1 γ1 + σ2
γ1 = φ1 γ0 ⇒
γ0 = φ12 γ0 +σ 2

γ0 − φ12 γ0 = σ 2

γ0 (1 − φ12 ) =σ 2

σ 2
γ0 =
1 − φ12

and

γ1 = φ1 γ0
σ2
γ0 = ⇒
1 − φ12
σ2
γ1 = φ1
1 − φ12

The remaining autocovariances can be computed using the recursion γs = φ1 γs −1 , and so

σ2
γs = φ12 .
1 − φ12
312 Analysis of a Single Time Series

Finally, the autocorrelations can be computed as ratios of autocovariances,

γ1 σ s . σ2
ρ1 = = φ1
γ0 1 − φ12 1 − φ12
ρ 1 = φ1

γs σ2 . σ2
ρs = = φ1s
γ0 1 − φ12 1 − φ12
ρs = φ1s .

4.A.2.3 AR(2)

The autocorrelations in an AR(2)

yt = φ1 yt −1 + φ2 yt −2 + εt

can be similarly computed using the max(P, Q ) + 1 equation Yule-Walker system,

E[yt yt ] = φ1 E[yt −1 yt ] + φ2 E[yt −2 yt ] + Eεt yt ] (4.127)


E[yt yt −1 ] = φ1 E[yt −1 yt −1 ] + φ2 E[yt −2 yt −1 ] + E[εt yt −1 ]
E[yt yt −2 ] = φ1 E[yt −1 yt −2 ] + φ2 E[yt −2 yt −2 ] + E[εt yt −2 ]

and then replacing expectations with their population counterparts, γ0 ,γ1 , γ2 and σ2 .

γ0 = φ1 γ1 + φ2 γ2 + σ2 (4.128)
γ1 = φ1 γ0 + φ2 γ1
γ2 = φ1 γ1 + φ2 γ0

Further, it must be the case that γs = φ1 γs −1 + φ2 γs −2 for s ≥ 2. To solve this system of


equations, divide the autocovariance equations by γ0 , the long run variance. Omitting the
first equation, the system reduces to two equations in two unknowns,

ρ 1 = φ1 ρ 0 + φ2 ρ 1
ρ 2 = φ1 ρ 1 + φ 2 ρ 0

since ρ0 = γ0 /γ0 = 1.
4.A Computing Autocovariance and Autocorrelations 313

ρ 1 = φ1 + φ2 ρ 1
ρ 2 = φ1 ρ 1 + φ2

Solving this system,

ρ 1 = φ1 + φ2 ρ 1
ρ 1 − φ2 ρ 1 = φ1
ρ1 (1 − φ2 ) = φ1
φ1
ρ1 =
1 − φ2

and

ρ 2 = φ1 ρ 1 + φ2
φ1
= φ1 + φ2
1 − φ2
φ1 φ1 + (1 − φ2 )φ2
=
1 − φ2
φ + φ2 − φ22
2
= 1
1 − φ2

Since ρs = φ1 ρs −1 + φ2 ρs −2 , these first two autocorrelations are sufficient to compute


the other autocorrelations,

ρ 3 = φ1 ρ 2 + φ2 ρ 1
φ12 + φ2 − φ22 φ1
= φ1 + φ2
1 − φ2 1 − φ2

and the long run variance of yt ,

γ0 = φ1 γ1 + φ2 γ2 + σ2
γ0 − φ1 γ1 − φ2 γ2 = σ2
γ0 (1 − φ1 ρ1 − φ2 ρ2 ) = σ2
σ2
γ0 =
1 − φ1 ρ 1 − φ 2 ρ 2
314 Analysis of a Single Time Series

The final solution is computed by substituting for ρ1 and ρ2 ,

σ2
γ0 = φ 2 +φ −φ 2
φ1
1 − φ1 1−φ 2
− φ2 1 1−φ2 2 2
1 − φ2 σ2
 
=
1 + φ2 (φ1 + φ2 − 1)(φ2 − φ1 − 1)

4.A.2.4 AR(3)

Begin by constructing the Yule-Walker equations,

E[yt yt ] = φ1 E[yt −1 yt ] + φ2 E[yt −2 yt ] + φ3 E[yt −3 yt ] + E[εt yt ]


E[yt yt −1 ] = φ1 E[yt −1 yt −1 ] + φ2 E[yt −2 yt −1 ] + φ3 E[yt −3 yt −1 ] + E[εt yt −1 ]
E[yt yt −2 ] = φ1 E[yt −1 yt −2 ] + φ2 E[yt −2 yt −2 ] + φ3 E[yt −3 yt −2 ] + E[εt yt −2 ]
E[yt yt −3 ] = φ1 E[yt −1 yt −3 ] + φ2 E[yt −2 yt −3 ] + φ3 E[yt −3 yt −3 ] + E[εt yt −4 ].

Replacing the expectations with their population values, γ0 , γ1 , . . . and σ2 , the Yule-Walker
equations can be rewritten

γ0 = φ1 γ1 + φ2 γ2 + φ3 γ3 + σ2 (4.129)
γ1 = φ1 γ0 + φ2 γ1 + φ3 γ2
γ2 = φ1 γ1 + φ2 γ0 + φ3 γ1
γ3 = φ1 γ2 + φ2 γ1 + φ3 γ0

and the recursive relationship γs = φ1 γs −1 + φ2 γs −2 + φ3 γs −3 can be observed for s ≥ 3.


Omitting the first condition and dividing by γ0 ,

ρ 1 = φ1 ρ 0 + φ 2 ρ 1 + φ 3 ρ 2
ρ 2 = φ1 ρ 1 + φ 2 ρ 0 + φ 3 ρ 1
ρ 3 = φ1 ρ 2 + φ 2 ρ 1 + φ 3 ρ 0 .

leaving three equations in three unknowns since ρ0 = γ0 /γ0 = 1.

ρ 1 = φ 1 + φ 2 ρ 1 + φ3 ρ 2
ρ 2 = φ 1 ρ 1 + φ2 + φ3 ρ 1
ρ 3 = φ1 ρ 2 + φ2 ρ 1 + φ3
4.A Computing Autocovariance and Autocorrelations 315

Following some tedious algebra, the solution to this system is

φ1 + φ2 φ3
ρ1 =
1 − φ2 − φ1 φ3 − φ32
φ2 + φ12 + φ3 φ1 − φ22
ρ2 =
1 − φ2 − φ1 φ3 − φ32
φ3 + φ13 + φ12 φ3 + φ1 φ22 + 2φ1 φ2 + φ22 φ3 − φ2 φ3 − φ1 φ32 − φ33
ρ3 = .
1 − φ2 − φ1 φ3 − φ32

Finally, the unconditional variance can be computed using the first three autocorrela-
tions,

γ0 = φ1 γ1 + φ2 γ2 + φ3 γ3 σ2
γ0 − φ1 γ1 − φ2 γ2 − φ3 γ3 = σ2
γ0 (1 − φ1 ρ1 + φ2 ρ2 + φ3 ρ3 ) = σ2
σ2
γ0 =
1 − φ1 ρ 1 − φ2 ρ 2 − φ3 ρ 3
σ2 1 − φ2 − φ1 φ3 − φ32

γ0 =
(1 − φ2 − φ3 − φ1 ) 1 + φ2 + φ3 φ1 − φ32 (1 + φ3 + φ1 − φ2 )


4.A.2.5 ARMA(1,1)

Deriving the autocovariances and autocorrelations of an ARMA process is slightly more


difficult than for a pure AR or MA process. An ARMA(1,1) is specified as

yt = φ1 yt −1 + θ1 εt −1 + εt
and since P = Q = 1, the Yule-Walker system requires two equations, noting that the third
or higher autocovariance is a trivial function of the first two autocovariances.

E[yt yt ] = E[φ1 yt −1 yt ] + E[θ1 εt −1 yt ] + E[εt yt ] (4.130)


E[yt yt −1 ] = E[φ1 yt −1 yt −1 ] + E[θ1 εt −1 yt −1 ] + E[εt yt −1 ]

The presence of the E[θ1 εt −1 yt ] term in the first equation complicates solving this sys-
tem since εt −1 appears in yt directly though θ1 εt −1 and indirectly through φ1 yt −1 . The
non-zero relationships can be determined by recursively substituting yt until it consists of
only εt , εt −1 and yt −2 (since yt −2 is uncorrelated with εt −1 by the WN assumption).

yt = φ1 yt −1 + θ1 εt −1 + εt (4.131)
316 Analysis of a Single Time Series

= φ1 (φ1 yt −2 + θ1 εt −2 + εt −1 ) + θ1 εt −1 + εt
= φ12 yt −2 + φ1 θ1 εt −2 + φ1 εt −1 + θ1 εt −1 + εt
= φ12 yt −2 + φ1 θ1 εt −2 + (φ1 + θ1 )εt −1 + εt

and so E[θ1 εt −1 yt ] = θ1 (φ1 + θ1 )σ2 and the Yule-Walker equations can be expressed using
the population moments and model parameters.

γ0 = φ1 γ1 + θ1 (φ1 + θ1 )σ2 + σ2
γ1 = φ1 γ0 + θ1 σ2

These two equations in two unknowns which can be solved,

γ0 = φ1 γ1 + θ1 (φ1 + θ1 )σ2 + σ2
= φ1 (φ1 γ0 + θ1 σ2 ) + θ1 (φ1 + θ1 )σ2 + σ2
= φ12 γ0 + φ1 θ1 σ2 + θ1 (φ1 + θ1 )σ2 + σ2
γ0 − φ12 γ0 = σ2 (φ1 θ1 + φ1 θ1 + θ12 + 1)
σ2 (1 + θ12 + 2φ1 θ1 )
γ0 =
1 − φ12

γ1 = φ1 γ0 + θ1 σ2
σ (1 + θ12 + 2φ1 θ1 )
 2 
= φ1 + θ1 σ2
1 − φ12
σ (1 + θ12 + 2φ1 θ1 ) (1 − φ12 )θ1 σ2
 2 
= φ1 +
1 − φ12 1 − φ12
σ2 (φ1 + φ1 θ12 + 2φ12 θ1 ) (θ1 − θ1 φ12 )σ2
= +
1 − φ12 1 − φ12
σ2 (φ1 + φ1 θ12 + 2φ12 θ1 + θ1 − φ12 θ1 )
=
1 − φ12
σ2 (φ12 θ1 + φ1 θ12 + φ1 + θ1 )
=
1 − φ12
σ2 (φ1 + θ1 )(φ1 θ1 + 1)
=
1 − φ12

and so the 1st autocorrelation is

σ2 (φ1 +θ1 )(φ1 θ1 +1)


1−φ12 (φ1 + θ1 )(φ1 θ1 + 1)
ρ1 = = .
σ2 (1+θ12 +2φ1 θ1 ) (1 + θ12 + 2φ1 θ1 )
1−φ12
4.A Computing Autocovariance and Autocorrelations 317

Returning to the next Yule-Walker equation,

E[yt yt −2 ] = E[φ1 yt −1 yt −2 ] + E[θ1 εt −1 yt −2 ] + E[εt yt −2 ]

and so γ2 = φ1 γ1 , and, dividing both sized by γ0 , ρ2 = φ1 ρ1 . Higher order autocovariances


and autocorrelation follow γs = φ1 γs −1 and ρs = φ1 ρs −1 respectively, and so ρs = φ1s −1 ρ1 ,
s ≥ 2.

4.A.3 Backward Substitution

Backward substitution is a direct but tedious method to derive the ACF and long run vari-
ance.

4.A.3.1 AR(1)

The AR(1) process,


yt = φ1 yt −1 + εt
is stationary if |φ1 | < 1 and {εt } is white noise. To compute the autocovariances and au-
tocorrelations using backward substitution, yt = φ1 yt −1 + εt must be transformed into a
pure MA process by recursive substitution,

yt = φ1 yt −1 + εt (4.132)
= φ1 (φ1 yt −2 + εt −1 ) + εt
= φ12 yt −2 + φ1 εt −1 + εt
= φ12 (φ1 yt −3 + εt −2 ) + φ1 εt −1 + εt
= φ13 yt −3 + φ12 εt −2 + φ1 εt −1 + εt
= εt + φ1 εt −1 + φ12 εt −2 + φ13 εt −3 + . . .
X∞
yt = φ1i εt −i .
i =0

The variance is the expectation of the square,

γ0 = V[yt ] = E[yt2 ] (4.133)


X∞
= E[( φ1i εt −i )2 ]
i =0

= E[(εt + φ1 εt −1 + φ12 εt −2 + φ13 εt −3 + . . .)2 ]


318 Analysis of a Single Time Series

∞ ∞ ∞
j
X X X
= E[ φ12i ε2t −i + φ1i φ1 εt −i εt − j ]
i =0 i =0 j =0,i 6= j
∞ ∞ ∞
j
X X X
= E[ φ12i ε2t −i ] + E[ φ1i φ1 εt −i εt − j ]
i =0 i =0 j =0,i 6= j
∞ ∞ ∞
j
X X X
= φ12i E[ε2t −i ] + φ1i φ1 E[εt −i εt − j ]
i =0 i =0 j =0,i 6= j
∞ ∞ ∞
j
X X X
= φ12i σ2 + φ1i φ1 0
i =0 i =0 j =0,i 6= j

X
= φ12i σ2
i =0
σ2
=
1 − φ12i

The difficult step in the derivation is splitting up the εt −i into those that are matched to
their own lag (ε2t −i ) to those which are not (εt −i εt − j , i 6= j ). The remainder of the derivation
follows from the assumption that {εt } is a white noise process, and so E[ε2t −i ] = σ2 and
i
E[εt −i εt − j ]=0, i 6= j . Finally, the identity that limn →∞ ni=0 φ12i = limn →∞ ni=0 φ12 =
P P
1
1−φ 2
for |φ1 | < 1 was used to simplify the expression.
1

The 1st autocovariance can be computed using the same steps on the MA(∞) represen-
tation,

γ1 = E[yt yt −1 ] (4.134)
X∞ ∞
X
= E[ φ1 εt −i
i
φ1i −1 εt −i ]
i =0 i =1

= E[(εt + φ1 εt −1 + φ12 εt −2 + φ13 εt −3 + . . .)(εt −1 + φ1 εt −2 + φ12 εt −3 + φ13 εt −4 + . . .)]


∞ ∞ ∞
2i +1 2 j −1
X X X
= E[ φ1 εt −1−i + φ1i φ1 εt −i εt − j ]
i =0 i =0 j =1,i 6= j
∞ ∞ ∞
j −1
X X X
= E[φ1 φ12i ε2t −1−i ] + E[ φ1i φ1 εt −i εt − j ]
i =0 i =0 j =1,i 6= j
∞ ∞ ∞
j −1
X X X
= φ1 φ12i E[ε2t −1−i ] + φ1i φ1 E[εt −i εt − j ]
i =0 i =0 j =1,i 6= j
∞ ∞ ∞
j −1
X X X
= φ1 φ12i σ2 + φ1i φ1 0
i =0 i =0 j =1,i 6= j
4.A Computing Autocovariance and Autocorrelations 319


!
X
= φ1 φ12i σ2
i =0
σ 2
= φ1
1 − φ12
= φ1 γ0

and the sth autocovariance can be similarly determined.

γs = E[yt yt −s ] (4.135)
X∞ ∞
X
= E[ φ1 εt −i
i
φ1i −s εt −i ]
i =0 i =s
∞ ∞ ∞
φ12i +s ε2t −s −i
j −s
X X X
= E[ + φ1i φ1 εt −i εt − j ]
i =0 i =0 j =s ,i 6= j
∞ ∞ ∞
j −s
X X X
= E[φ1s φ12i ε2t −s −i ] + E[ φ1i φ1 εt −i εt − j ]
i =0 i =0 j =s ,i 6= j
∞ ∞ ∞
j −s
X X X
= φ1s φ12i σ2 + φ1i φ1 0
i =0 i =0 j =s ,i 6= j

!
X
= φ1s φ12i σ2
i =0

= φ1s γ0

Finally, the autocorrelations can be computed from rations of autocovariances, ρ1 = γ1 /γ0 =


φ1 and ρs = γs /γ0 = φ1s .

4.A.3.2 MA(1)

The MA(1) model is the simplest non-degenerate time-series model considered in this
course,
yt = θ1 εt −1 + εt

and the derivation of its autocorrelation function is trivial since there no backward substi-
tution is required. The variance is

γ0 = V[yt ] = E[yt2 ] (4.136)


= E[(θ1 εt −1 + εt ) ] 2

= E[θ12 ε2t −1 + 2θ1 εt εt −1 + ε2t ]


320 Analysis of a Single Time Series

= E[θ12 ε2t −1 ] + E[2θ1 εt εt −1 ] + E[ε2t ]


= θ12 σ2 + 0 + σ2
= σ2 (1 + θ12 )

and the 1st autocovariance is

γ1 = E[yt yt −1 ] (4.137)
= E[(θ1 εt −1 + εt )(θ1 εt −2 + εt −1 )]
= E[θ12 εt −1 εt −2 + θ1 ε2t −1 + θ1 εt εt −2 + εt εt −1 ]
= E[θ12 εt −1 εt −2 ] + E[θ1 ε2t −1 ] + E[θ1 εt εt −2 ] + E[εt εt −1 ]
= 0 + θ1 σ2 + 0 + 0
= θ1 σ2

The 2nd (and higher) autocovariance is

γ2 = E[yt yt −2 ] (4.138)
= E[(θ1 εt −1 + εt )(θ1 εt −3 + εt −2 )]
= E[θ12 εt −1 εt −3 + θ1 εt −1 εt −2 + θ1 εt εt −3 + εt εt −2 ]
= E[θ12 εt −1 εt −3 ] + E[θ1 εt −1 εt −2 ] + E[θ1 εt εt −3 ] + E[εt εt −2 ]
=0+0+0+0
=0

and the autocorrelations are ρ1 = θ1 /(1 + θ12 ), ρs = 0, s ≥ 2.

4.A.3.3 ARMA(1,1)

An ARMA(1,1) process,

yt = φ1 yt −1 + θ1 εt −1 + εt
is stationary if |φ1 | < 1 and {εt } is white noise. The derivation of the variance and autoco-
variances is more tedious than for the AR(1) process.It should be noted that derivation is
longer and more complex than solving the Yule-Walker equations.
Begin by computing the MA(∞) representation,

yt = φ1 yt −1 + θ1 εt −1 + εt (4.139)
yt = φ1 (φ1 yt −2 + θ1 εt −2 + εt −1 ) + θ1 εt −1 + εt
4.A Computing Autocovariance and Autocorrelations 321

yt = φ12 yt −2 + φ1 θ1 εt −2 + φ1 εt −1 + θ1 εt −1 + εt
yt = φ12 (φ1 yt −3 + θ1 εt −3 + εt −2 ) + φ1 θ1 εt −2 + (φ1 + θ1 )εt −1 + εt
yt = φ13 yt −3 + φ12 θ1 εt −3 + φ12 εt −2 + φ1 θ1 εt −2 + (φ1 + θ1 )εt −1 + εt
yt = φ13 (φ1 yt −4 + θ1 εt −4 + εt −3 ) + φ12 θ1 εt −3 + φ1 (φ1 + θ1 )εt −2 + (φ1 + θ1 )εt −1 + εt
yt = φ14 yt −4 + φ13 θ1 εt −4 + φ13 εt −3 + φ12 θ1 εt −3 + φ1 (φ1 + θ1 )εt −2 + (φ1 + θ1 )εt −1 + εt
yt = φ14 yt −4 + φ13 θ1 εt −4 + φ12 (φ1 + θ1 )εt −3 + φ1 (φ1 + θ1 )εt −2 + (φ1 + θ1 )εt −1 + εt
yt = εt + (φ1 + θ1 )εt −1 + φ1 (φ1 + θ1 )εt −2 + φ12 (φ1 + θ1 )εt −3 + . . .
X∞
yt = εt + φ1i (φ1 + θ1 )εt −1−i
i =0

The primary issue is that the backward substitution form, unlike in the AR(1) case, is not
completely symmetric. Specifically, εt has a different weight than the other shocks and
does not follow the same pattern.

γ0 = V [yt ] = E yt2
 
(4.140)
 !2 
X∞
= E  εt + φ1i (φ1 + θ1 ) εt −1−i 
i =0
h 2 i
=E εt + (φ1 + θ1 ) εt −1 + φ1 (φ1 + θ1 ) εt −2 + φ12 (φ1 + θ1 ) εt −3 + . . .
 !2 

X X ∞
= E ε2t + 2εt φ1i (φ1 + θ1 ) εt −1−i + φ1i (φ1 + θ1 ) εt −1−i 
i =0 i =0
"  !2 

# ∞
X X
= E ε2t + E 2εt φ1i (φ1 + θ1 ) εt −1−i + E  φ1i (φ1 + θ1 ) εt −1−i 
 

i =0 i =0
 !2 

X
= σ +0+E
2
φ1i (φ1 + θ1 ) εt −1−i 
i =0
 
∞ ∞ ∞
j
X X X
= σ2 + E  φ12i (φ1 + θ1 )2 ε2t −1−i + φ1i φ1 (φ1 + θ1 )2 εt −1−i εt −1− j 
i =0 i =0 j =0, j 6=i
∞ ∞ ∞
j
X  X X
= σ2 + φ12i (φ1 + θ1 )2 E ε2t −1−i + φ1i φ1 (φ1 + θ1 )2 E εt −1−i εt −1− j
  

i =0 i =0 j =0, j 6=i
∞ ∞ ∞
j
X X X
= σ2 + φ12i (φ1 + θ1 )2 σ2 + φ1i φ1 (φ1 + θ1 )2 0
i =0 i =0 j =0, j 6=i
X∞
= σ2 + φ12i (φ1 + θ1 )2 σ2
i =0
322 Analysis of a Single Time Series

2
(φ1 + θ1 ) σ2
= σ2 +
1 − φ12
1 − φ12 + (φ1 + θ1 )2
= σ2
1 − φ12
1 + θ12 + 2φ1 θ1
= σ2
1 − φ12

The difficult step in this derivations is in aligning the εt −i since {εt } is a white noise
process. The autocovariance derivation is fairly involved (and presented in full detail).

γ1 = E [yt yt −1 ] (4.141)
" ∞
! ∞
!#
X X
=E εt + φ1i (φ1 + θ1 ) εt −1−i εt −1 + φ1i (φ1 + θ1 ) εt −2−i
i =0 i =0

=E εt + (φ1 + θ1 ) εt −1 + φ1 (φ1 + θ1 ) εt −2 + φ12 (φ1 + θ1 ) εt −3 + . . . ×


 

εt −1 + (φ1 + θ1 ) εt −2 + φ1 (φ1 + θ1 ) εt −3 + φ12 (φ1 + θ1 ) εt −4 + . . .



" ∞ ∞
X X
= E εt εt −1 + φ1i (φ1 + θ1 ) εt εt −2−i + φ1i (φ1 + θ1 ) εt −1 εt −1−i
i =0 i =0

! ∞
!#
X X
+ φ1i (φ1 + θ1 ) εt −1−i φ1i (φ1 + θ1 ) εt −2−i
i =0 i =0
"∞ # " ∞
#
X X
= E [εt εt −1 ] + E φ1i (φ1 + θ1 ) εt εt −2−i + E φ1i (φ1 + θ1 ) εt −1 εt −1−i
i =0 i =0
" ∞
! ∞
!#
X X
+E φ1i (φ1 + θ1 ) εt −1−i φ1i (φ1 + θ1 ) εt −2−i
i =0 i =0
" ∞
! ∞
!#
X X
= 0 + 0 + (φ1 + θ1 ) σ2 + E φ1i (φ1 + θ1 ) εt −1−i φ1i (φ1 + θ1 ) εt −2−i
i =0 i =0
 
∞ ∞ ∞
φ12i +1 (φ1 + θ1 )2 ε2t −2−i +
j
X X X
= (φ1 + θ1 ) σ2 + E  φ1i φ1 (φ1 + θ1 )2 εt −1−i εt −2−i 
i =0 i =0 j =0,i 6= j +1
"  

# ∞ ∞
φ12i +1 (φ1 + θ1 )2 ε2t −2−i
j
X X X
= (φ1 + θ1 ) σ2 + E +E φ1i φ1 (φ1 + θ1 )2 εt −1−i εt −2−i 
i =0 i =0 j =0,i 6= j +1
" ∞
#
X
= (φ1 + θ1 ) σ2 + E φ1 φ12i (φ1 + θ1 )2 ε2t −2−i + 0
i =0

X
= (φ1 + θ1 ) σ2 + φ1 φ12i (φ1 + θ1 )2 E ε2t −2−i
 

i =0
4.A Computing Autocovariance and Autocorrelations 323


X
= (φ1 + θ1 ) σ2 + φ1 φ12i (φ1 + θ1 )2 σ2
i =0
2
(φ1 + θ1 ) σ2
= (φ1 + θ1 ) σ2 + φ1
1 − φ12
h i
σ2 1 − φ12 (φ1 + θ1 ) + φ1 (φ1 + θ1 )2

=
1 − φ12
σ2 φ1 + θ1 − φ13 − φ12 θ1 + φ13 + 2φ12 θ1 − φ1 θ12

=
1 − φ12
σ2 φ1 + θ1 + φ12 θ1 − φ1 θ12
 
=
1 − φ12
σ2 (φ1 + θ1 ) (φ1 θ1 + 1)
=
1 − φ12

The most difficult step in this derivation is in showing that E[ ∞ i =0 φ1 (φ1 + θ1 )εt −1 εt −1−i ] =
i
P

σ2 (φ1 + θ1 ) since there is one εt −1−i which is aligned to εt −1 (i.e. when i = 0), and so the
autocorrelations may be derived,
σ2 (φ1 +θ1 )(φ1 θ1 +1)
1−φ12
ρ1 = σ2 (1+θ12 +2φ1 θ1 )
(4.142)
1−φ12
(φ1 + θ1 )(φ1 θ1 + 1)
=
(1 + θ12 + 2φ1 θ1 )

and the remaining autocorrelations can be computed using the recursion, ρs = φ1 ρs −1 ,


s ≥ 2.
324 Analysis of a Single Time Series

Exercises
Exercise 4.1. Answer the following questions:

i. Under what conditions on the parameters and errors are the following processes co-
variance stationary?

(a) yt = φ0 + εt
(b) yt = φ0 + φ1 yt −1 + εt
(c) yt = φ0 + θ1 εt −1 + εt
(d) yt = φ0 + φ1 yt −1 + φ2 yt −2 + εt
(e) yt = φ0 + φ2 yt −2 + εt
(f) yt = φ0 + φ1 yt −1 + θ1 εt −1 + εt

ii. Is the sum of two white noise processes, νt = εt + ηt , necessarily a white noise pro-
cess? If so, verify that the properties of a white noise are satisfied. If not, show why and
describe any further assumptions required for the sum to be a white noise process.

Exercise 4.2. Consider an AR(1)

yt = φ0 + φ1 yt −1 + εt

i. What is a minimal set of assumptions sufficient to ensure {yt } is covariance stationary


if {εt } is an i.i.d. sequence?

ii. What are the values of the following quantities?

(a) E[yt +1 ]
(b) E t [yt +1 ]
(c) V[yt +1 ]
(d) Vt [yt +1 ]
(e) ρ−1
(f) ρ2

Exercise 4.3. Consider an MA(1)

yt = φ0 + θ1 εt −1 + εt

i. What is a minimal set of assumptions sufficient to ensure {yt } is covariance stationary


if {εt } is an i.i.d. sequence?

ii. What are the values of the following quantities?


4.A Computing Autocovariance and Autocorrelations 325

(a) E[yt +1 ]
(b) E t [yt +1 ]
(c) V[yt +1 ]
(d) Vt [yt +1 ]
(e) ρ−1
(f ) ρ2

iii. Suppose you were trying to differentiate between an AR(1) and an MA(1) but could
not estimate any regressions. What would you do?

Exercise 4.4. Consider an MA(2)

yt = µ + θ1 εt −1 + θ2 εt −2 + εt

i. What is a minimal set of assumptions sufficient to ensure {yt } is covariance stationary


if {εt } is an i.i.d. sequence?

ii. What are the values of the following quantities?

(a) E[yt +1 ]
(b) E t [yt +1 ]
(c) V[yt +1 ]
(d) Vt [yt +1 ]
(e) ρ−1
(f) ρ2
(g) ρ3

Exercise 4.5. Answer the following questions:

i. For each of the following processes, find E t [yt +1 ]. You can assume {εt } is a mean zero
i.i.d. sequence.

(a) yt = φ0 + φ1 yt −1 + εt
(b) yt = φ0 + θ1 εt −1 + εt
(c) yt = φ0 + φ1 yt −1 + φ2 yt −2 + εt
(d) yt = φ0 + φ2 yt −2 + εt
(e) yt = φ0 + φ1 yt −1 + θ1 εt −1 + εt

ii. For (a), (c) and (e), derive the h -step ahead forecast, E t [yt +h ]. What is the long run
behavior of the forecast in each case?
326 Analysis of a Single Time Series

iii. The forecast error variance is defined as E[(yt +h − E t [yt +h ])2 ]. Find an explicit expres-
sion for the forecast error variance for (a) and (c).

Exercise 4.6. Answer the following questions:

i. What are the characteristic equations for the above systems?

(a) yt = 1 + .6yt −1 + x t
(b) yt = 1 + .8yt −2 + x t
(c) yt = 1 + .6yt −1 + .3yt −2 + x t
(d) yt = 1 + 1.2yt −1 + .2yt −2 + x t
(e) yt = 1 + 1.4yt −1 + .24yt −2 + x t
(f) yt = 1 − .8yt −1 + .2yt −2 + x t

ii. Compute the roots for the characteristic equation? Which are convergent? Which are
explosive? Are any stable or metastable?

Exercise 4.7. Suppose that yt follows a random walk then ∆yt = yt − yt −1 is stationary.

i. Is yt − yt − j for and j ≥ 2 stationary?

ii. If it is and {εt } is an i.i.d. sequence of standard normals, what is the distribution of
yt − yt − j ?

iii. What is the joint distribution of yt − yt − j and yt −h − yt − j −h (Note: The derivation for
an arbitrary h is challenging)?
Note: If it helps in this problem, consider the case where j = 2 and h = 1.

Exercise 4.8. Outline the steps needed to perform a unit root test on as time-series of FX
rates. Be sure to detail the any important considerations that may affect the test.

Exercise 4.9. Answer the following questions:

i. How are the autocorrelations and partial autocorrelations useful in building a model?

ii. Suppose you observe the three sets of ACF/PACF in figure 4.15. What ARMA specifica-
tion would you expect in each case. Note: Dashed line indicates the 95% confidence
interval for a test that the autocorrelation or partial autocorrelation is 0.

iii. Describe the three methods of model selection discussed in class: general-to-specific,
specific-to-general and the use of information criteria (Schwartz/Bayesian Informa-
tion Criteria and/or Akaike Information Criteria). When might each be preferred to
the others?
4.A Computing Autocovariance and Autocorrelations 327

Autocorrelation and Partial Autocorrelation function


ACF PACF
(a)
0.6 0.5

0.4
0
0.2

0
−0.5
1 2 3 4 5 6 7 8 9 10 11 12 1 2 3 4 5 6 7 8 9 10 11 12
(b)
1 1

0.5 0.5

0 0
1 2 3 4 5 6 7 8 9 10 11 12 1 2 3 4 5 6 7 8 9 10 11 12
(c)
0.1 0.1

0.05 0.05

0 0

−0.05 −0.05

−0.1 −0.1
1 2 3 4 5 6 7 8 9 10 11 12 1 2 3 4 5 6 7 8 9 10 11 12

Figure 4.15: The ACF and PACF of three stochastic processes. Use these to answer question
4.9.
328 Analysis of a Single Time Series

iv. Describe the Wald, Lagrange Multiplier (Score) and Likelihood ratio tests. What as-
pect of a model does each test? What are the strengths and weaknesses of each?

Exercise 4.10. Answer the following questions about forecast errors.

i. Let yt = φ0 + φ1 yt −1 + εt with the usual assumptions on {εt }. Derive an explicit


expression for the 1-step and 2-step ahead forecast errors, e t +h |t = yt +h − ŷt +h |t where
ŷt +h |t is the MSE optimal forecast where h = 1 or h = 2 (what is the MSE optimal
forecast?).

ii. What is the autocorrelation function of a time-series of forecast errors {e t +h |t }, h = 1


or h = 2. (Hint: Use the formula you derived above)

iii. Can you generalize the above to a generic h ? (In other words, leave the solution as a
function of h ).

iv. How could you test whether the forecast has excess dependence using an ARMA model?

Exercise 4.11. Answer the following questions.

i. Outline the steps needed to determine whether a time series {yt } contains a unit root.
Be certain to discuss the important considerations at each step, if any.

ii. If yt follows a pure random walk driven by whit noise innovations then ∆yt = yt − yt −1
is stationary.

(a) Is yt − yt − j for and j ≥ 2 stationary?


(b) If it is and {εt } is an i.i.d. sequence of standard normals, what is the distribution
of yt − yt − j ?
(c) What is the joint distribution of yt − yt − j and yt −h − yt − j −h ?

iii. Let yt = φ0 + φ1 yt −1 + εt where {εt } is a WN process.

(a) Derive an explicit expression for the 1-step and 2-step ahead forecast errors,
e t +h |t = yt +h − ŷt +h |t where ŷt +h |t is the MSE optimal forecast where h = 1 or
h = 2.
(b) What is the autocorrelation function of a time-series of forecast errors {e t +h |t }
for h = 1 and h = 2?
(c) Generalize the above to a generic h ? (In other words, leave the solution as a func-
tion of h ).
(d) How could you test whether the forecast has excess dependence using an ARMA
model?
4.A Computing Autocovariance and Autocorrelations 329

Exercise 4.12. Suppose


yt = φ0 + φ1 yt −1 + θ1 εt −1 + εt
where {εt } is a white noise process.

i. Precisely describe the two types of stationarity.

ii. Why is stationarity a useful property?

iii. What conditions on the model parameters are needed for {yt } to be covariance sta-
tionary?

iv. Describe the Box-Jenkins methodology for model selection.


Now suppose that φ1 = 1 and that εt is homoskedastic.

v. What is E t [yt +1 ]?

vi. What is E t [yt +2 ]?

vii. What can you say about E t [yt +h ] for h > 2?

viii. What is Vt [yt +1 ]?

ix. What is Vt [yt +2 ]?

x. What is the first autocorrelation, ρ1 ?


330 Analysis of a Single Time Series
Chapter 5

Analysis of Multiple Time Series

Note: The primary references for these notes are chapters 5 and 6 in Enders (2004). An alter-
native, but more technical treatment can be found in chapters 10-11 and 18-19 in Hamilton
(1994).
Multivariate time-series analysis extends many of the ideas of univariate time-
series analysis to systems of equations. The primary model in multivariate time-
series analysis is the vector autoregression (VAR), a direct and natural extension
of the univariate autoregression. Most results that apply to univariate time-series
can be directly ported to multivariate time-series with a slight change in notation
and the use of linear algebra. The chapter examines both stationary and nonsta-
tionary vector processes through VAR analysis, cointegration and spurious regres-
sion. This chapter discusses properties of vector time-series models, estimation
and identification as well as Granger causality and Impulse Response Functions.
The chapter concludes by examining the contemporaneous relationship between
two or more time-series in the framework of cointegration, spurious regression
and cross-sectional regression of stationary time-series.

In many situations, analyzing a time-series in isolation is reasonable; in other cases uni-


variate analysis may be limiting. For example, Campbell (1996) links financially interesting
variables, including stock returns and the default premium, in a multivariate system that
allows shocks to one variable to propagate to the others. The vector autoregression is the
mechanism that is used to link multiple stationary time-series variables together. When
variables contain unit roots, a different type of analysis, cointegration, is needed. This
chapter covers these two topics building on many results from the analysis of univariate
time-series.

5.1 Vector Autoregressions


Vector autoregressions are remarkably similar to univariate autoregressions; so similar that
the intuition behind most results carries over by simply replacing scalars with matrices and
332 Analysis of Multiple Time Series

scalar operations with matrix operations.

5.1.1 Definition
The definition of a vector autoregression is nearly identical to that of a univariate autore-
gression.

Definition 5.1 (Vector Autoregression of Order P). A Pth order vector autoregression , writ-
ten VAR(P), is a process that evolves according to

yt = Φ0 + Φ1 yt −1 + Φ2 yt −2 + . . . + ΦP yt −P + εt (5.1)

where yt is a k by 1 vector stochastic process, Φ0 is a k by 1 vector of intercept parameters,


Φ j , j = 1, . . . , P are k by k parameter matrices and εt is a vector white noise process with
the additional assumption that E t −1 [εt ] = 0.

Simply replacing the vectors and matrices with scalars will produce the definition of an
AR(P). A vector white noise process has the same useful properties as a univariate white
noise process; it is mean zero, has finite covariance and is uncorrelated with its past al-
though the elements of a vector white noise process are not required to be contemporane-
ously uncorrelated.

Definition 5.2 (Vector White Noise Process). A k by 1 vector valued stochastic process,
{εt }is said to be a vector white noise if

E[εt ] = 0k (5.2)
E[εt ε0t −s ] = 0k ×k
E[εt ε0t ] = Σ

where Σ is a finite positive definite matrix.

The simplest VAR is a first-order bivariate specification which can be equivalently ex-
pressed as

yt = Φ0 + Φ1 yt −1 + εt ,

" # " # " #" # " #


y1,t φ1,0 φ11,1 φ12,1 y1,t −1 ε1,t
= + + ,
y2,t y2,0 φ21,1 φ22,1 y2,t −1 ε2,t

y1,t = φ1,0 + φ11,1 y1,t −1 + φ12,1 y2,t −1 + ε1,t


y2,t = φ2,0 + φ21,1 y1,t −1 + φ22,1 y2,t −1 + ε2,t .
5.1 Vector Autoregressions 333

It is clear that each element of yt is a function of each element of yt −1 , although certain


parameterizations of Φ1 may remove the dependence. Treated as individual time-series,
deriving the properties of VARs is an exercise in tedium. However, a few tools from linear
algebra make working with VARs hardly more difficult than autoregressions.

5.1.2 Properties of a VAR(1)

The properties of the VAR(1) are fairly simple to study. More importantly, section 5.2 shows
that all VAR(P)s can be rewritten as a VAR(1), and so the general case requires no additional
effort than the first order VAR.

5.1.2.1 Stationarity

A VAR(1), driven by vector white noise shocks,

yt = Φ0 + Φ1 yt −1 + εt

is covariance stationary if the eigenvalues of Φ1 are less than 1 in modulus.1 In the univari-
ate case, this is equivalent to the condition |φ1 | < 1. Assuming the eigenvalues of φ 1 are
less than one in absolute value, backward substitution can be used to show that

X ∞
X
yt = Φi1 Φ0 + Φi1 εt −i (5.3)
i =0 i =0

which, applying Theorem 5.4, is equivalent to



X
yt = (Ik − Φ1 )−1 Φ0 + Φi1 εt −i (5.4)
i =0

where the eigenvalue condition ensures that Φi1 will converge to zero as i grows large.
1
The definition of an eigenvalue is:

Definition 5.3 (Eigenvalue). λ is an eigenvalue of a square matrix A if and only if |A − λIn | = 0 where | · |
denotes determinant.

The crucial properties of eigenvalues for applications to VARs are given in the following theorem:

Theorem 5.4 (Matrix Power). Let A be an n by n matrix. Then the following statements are equivalent
• Am → 0 as m → ∞.
• All eigenvalues of A, λi , i = 1, 2, . . . , n, are less than 1 in modulus (|λi | < 1).
Pm
• The series i =0 Am = In + A + A2 + . . . + Am → (In − A)−1 as m → ∞.

PNote: Replacing A with a scalar a produces many familiar results: a m → 0 as m → ∞ (property 1) and
m
i =0 a
m
→ (1 − a )−1 as m → ∞ (property 3) as long as |a |<1 (property 2).
334 Analysis of Multiple Time Series

5.1.2.2 Mean

Taking expectations of yt using the backward substitution form yields

" ∞
#
h i X
E [yt ] = E (Ik − Φ1 )−1 Φ0 + E Φi1 εt −i (5.5)
i =0

X
= (Ik − Φ1 )−1 Φ0 + Φi1 E [εt −i ]
i =0

X
= (Ik − Φ1 )−1 Φ0 + Φi1 0
i =0
−1
= (Ik − Φ1 ) Φ0

This result is similar to that of a univariate AR(1) which has a mean of (1 − φ1 )−1 φ0 . The
eigenvalues play an important role in determining the mean. If an eigenvalue of Φ1 is close
to one, (Ik − Φ1 )−1 will contain large values and the unconditional mean will be large. Sim-
ilarly, if Φ1 = 0, then the mean is Φ0 since {yt } is composed of white noise and a constant.

5.1.2.3 Variance

Before deriving the variance of a VAR(1), it often useful to express a VAR in deviation form.
Define µ = E[yt ] to be the unconditional expectation of y (and assume it is finite). The
deviations form of a VAR(P)

yt = Φ0 + Φ1 yt −1 + Φ2 yt −2 + . . . + ΦP yt −P + εt
is given by

yt − µ = Φ1 (yt −1 − µ) + Φ2 (yt −2 − µ) + . . . + ΦP (yt −P − µ) + εt (5.6)


ỹt = Φ1 ỹt −1 + Φ2 ỹt −2 + . . . + ΦP ỹt −P + εt

and in the case of a VAR(1),


X
ỹt = φ i1 εt −i (5.7)
i =1

The deviations form is simply a translation of the VAR from its original mean, µ, to a mean
of 0. The advantage of the deviations form is that all dynamics and the shocks are identical,
and so can be used in deriving the long-run covariance, autocovariances and in forecasting.
5.1 Vector Autoregressions 335

Using the backward substitution form of a VAR(1), the long run covariance can be derived
as

 !0 

! ∞
X X
E (yt − µ) (yt − µ)0 = E ỹt ỹ0t = E  Φi1 εt −i Φi1 εt −i
   
 (5.8)
i =0 i =0
"∞ #
0 0
X
=E Φi1 εt −i ε0t −i Φ1 (Since εt is WN)


i =0

X 0
= Φi1 E εt −i ε0t −i Φ01
 

i =0

X 0
= Φi1 Σ Φ01
i =0

vec E (yt − µ) (yt − µ)0 = (Ik 2 − Φ1 ⊗ Φ1 )−1 vec (Σ)


 

where µ = (Ik −Φ1 )−1 Φ0 . Compared to the long-run variance of a univariate autoregression,
σ2 /(1 − φ12 ), the similarities are less obvious. The differences arise from the noncommu-
tative nature of matrices (AB 6= BA in general). The final line makes use of the vec (vector)
operator to re-express the covariance. The vec operator and a Kronecker product stack the
336 Analysis of Multiple Time Series

elements of a matrix product into a single column.2


Once again the eigenvalues of Φ1 play an important role. If any are close to 1, the vari-
ance will be large since the eigenvalues fundamentally determine the persistence of shocks:
as was the case in scalar autoregressions, higher persistence lead to larger variances.

5.1.2.4 Autocovariance

The autocovariances of a vector valued stochastic process are defined

Definition 5.8 (Autocovariance). The autocovariance matrices of k by 1 valued vector co-


variance stationary stochastic process {yt } are defined

Γ s = E[(yt − µ)(yt −s − µ)0 ] (5.10)

and
Γ −s = E[(yt − µ)(yt +s − µ)0 ] (5.11)
where µ = E[yt ] = E[yt − j ] = E[yt + j ].

These present the first significant deviation from the univariate time-series analysis in
chapter 4. Instead of being symmetric around t , they are symmetric in their transpose.
Specifically,
2
The vec of a matrix A is defined:

Definition 5.5 (vec). Let A = [a i j ] be an m by n matrix. The vec operator (also known as the stack operator)
is defined  
a1
 a2 
vec A =  .  (5.9)
 
 .. 
an
where a j is the jth column of the matrix A.

The Kronecker Product is defined:

Definition 5.6 (Kronecker Product). Let A = [a i j ] be an m by n matrix, and let B = [bi j ] be a k by l matrix.
The Kronecker product is defined
 
a 11 B a 12 B . . . a 1n B
 a 21 B a 22 B . . . a 2n B 
A⊗B= .. .. .. ..
 

 . . . . 
am 1B am 2B . . . am n B

and has dimension mk by nl .

It can be shown that:

Theorem 5.7 (Kronecker and vec of a product). Let A, B and C be conformable matrices as needed. Then

vec (ABC) = C0 ⊗ A vec B



5.1 Vector Autoregressions 337

Γ s 6= Γ −s
but it is the case that3

Γ s = Γ 0−s .
In contrast, the autocovariances of stationary scalar processes satisfy γs = γ−s . Computing
the autocovariances is also easily accomplished using the backward substitution form,

 !0 

! ∞
X X
Γ s = E (yt − µ) (yt −s − µ)0 = E  Φi1 εt −i Φi1 εt −s −i
 
 (5.12)
i =0 i =0
 !0 
s −1
! ∞
X X
= E Φi1 εt −i Φi1 εt −s −i 
i =0 i =0
 !0 

! ∞
X X
+E Φ1s Φi1 εt −s −i Φi1 εt −s −i  (5.13)
i =0 i =0
 !0 

! ∞
X X
= 0 + Φ1s E  Φi1 εt −s −i Φi1 εt −s −i 
i =0 i =0

= Φ1s V [yt ]

and

 !0 

! ∞
X X
Γ −s = E (yt − µ) (yt +s − µ)0 = E  Φi1 εt −i Φi1 εt +s −i
 
 (5.14)
i =0 i =0
 !0 

! ∞
X X
= E Φi1 εt −i Φ1s Φi1 εt −i 
i =0 i =0
 !0 

! s −1
X X
+E Φi1 εt −i Φi1 εt +s −i  (5.15)
i =0 i =0
" ∞
! ∞
!#
X X i s
=E Φi1 εt −i ε0t −i Φ01 Φ01 +0
i =0 i =0
" ∞
! ∞
!#
X X i s
=E Φi1 εt −i ε0t −i Φ01 Φ01
i =0 i =0

3
This follows directly from the property of a transpose that if A and B are compatible matrices, (AB)0 = B0 A0 .
338 Analysis of Multiple Time Series

s
= V [yt ] Φ01

where V[yt ] is the symmetric covariance matrix of the VAR. Like most properties of a VAR,
this result is similar to the autocovariance function of an AR(1): γs = φ1s σ2 /(1 − φ12 ) =
φ1s V[yt ].

5.2 Companion Form


Once the properties of a VAR(1) have been studied, one surprising and useful result is that
any stationary VAR(P) can be rewritten as a VAR(1). Suppose {yt } follows a VAR(P) process,

yt = Φ0 + Φ1 yt −1 + Φ2 yt −2 + . . . + ΦP yt −P + εt .
By subtracting the mean and stacking P of yt into a large column vector denoted zt , a
VAR(P) can be transformed into a VAR(1) by constructing the companion form.
Definition 5.9 (Companion Form of a VAR(P)). Let yt follow a VAR(P) given by

yt = Φ0 + Φ1 yt −1 + Φ2 yt −2 + . . . + ΦP yt −P + εt
 PP −1
where εt is a vector white noise process and µ = I − p =1 Φp Φ0 = E[yt ] is finite. The
companion form is given by
zt = Υ zt −1 + ξt (5.16)
where
yt − µ
 
 yt −1 − µ 
zt =  , (5.17)
 
..
 . 
yt −P +1 − µ
 
Φ1 Φ2 Φ3 . . . ΦP −1 ΦP
Ik 0 0 ... 0 0 
 

Υ = 0 Ik 0 ... 0 0 
 
(5.18)
.. .. .. .. .. .. 
 
. . . . . . 


0 0 0 . . . Ik 0
and
εt
 
 0 
ξt =  . (5.19)
 
..
 . 
0
This is known as the companion form and allows the statistical properties of any VAR(P)
to be directly computed using only the results of a VAR(1) noting that
5.3 Empirical Examples 339

Σ
 
0 ... 0
 0 0 ... 0 
E[ξt ξ0t ] =  .
 
.... .. ..
 . . . . 
0 0 ... 0

Using this form, it can be determined that a VAR(P) is covariance stationary if all of the
eigenvalues of Υ - there are k P of them - are less than one in absolute value (modulus if
complex).4

5.3 Empirical Examples

Throughout this chapter two examples from the finance literature will be used.

5.3.1 Example: The interaction of stock and bond returns

Stocks and long term bonds are often thought to hedge one another. VARs provide a simple
method to determine whether their returns are linked through time. Consider the VAR(1)
" # " # " #" # " #
V W Mt φ01 φ11,1 φ12,1 V W M t −1 ε1,t
= + +
10Y R t φ02 φ21,1 φ22,1 10Y R t −1 ε2,t

which implies a model for stock returns

V W M t = φ01 + φ11,1 V W M t −1 + φ12,1 10Y R t −1 + ε1,t

and a model for long bond returns

10Y R t = φ01 + φ21,1 V W M t −1 + φ22,1 10Y R t −1 + ε2,t .

Since these models do not share any parameters, they can be estimated separately using
OLS. Using annualized return data for the VWM from CRSP and the 10-year constant matu-
rity treasury yield from FRED covering the period May 1953 until December 2008, a VAR(1)
was estimated.5

4
Companion form is also useful when working with univariate AR(P) models. An AR(P) can be reexpressed
using its companion VAR(1) which allows properties such as the long-run variance and autocovariances to
be easily computed.
5
The yield is first converted to prices and then returns are computed as the log difference in consecutive
prices.
340 Analysis of Multiple Time Series

   
" # 9.733 0.097 0.301 " # " #
V W Mt (0.000) (0.104) (0.000)  V W M t −1 ε1,t
   
= + +
  
10Y R t 1.058 −0.095 0.299  10Y R t −1 ε2,t

  
(0.000) (0.000) (0.000)
where the p-val is in parenthesis below each coefficient. A few things are worth noting.
Stock returns are not predictable with their own lag but do appear to be predictable using
lagged bond returns: positive bond returns lead to positive future returns in stocks. In
contrast, positive returns in equities result in negative returns for future bond holdings.
The long-run mean can be computed as
" # " #!−1 " # " #
1 0 0.097 0.301 9.733 10.795
− = .
0 1 −0.095 0.299 1.058 0.046

These values are similar to the sample means of 10.801 and 0.056.

5.3.2 Example: Campbell’s VAR


Campbell (1996) builds a theoretical model for asset prices where economically meaning-
ful variables evolve according to a VAR. Campbell’s model included stock returns, real la-
bor income growth, the term premium, the relative t-bill rate and the dividend yield. The
VWM series from CRSP is used for equity returns. Real labor income is computed as the
log change in income from labor minus the log change in core inflation and both series
are from FRED. The term premium is the difference between the yield on a 10-year con-
stant maturity bond and the 3-month t-bill rate. Both series are from FRED. The relative
t-bill rate is the current yield on a 1-month t-bill minus the average yield over the past 12
months and the data is available on Ken French’s web site. The dividend yield was com-
puted as the difference in returns on the VWM with and without dividends; both series are
available from CRSP.
Using a VAR(1) specification, the model can be described

ε1,t
     
V W Mt V W M t −1
 LBR   LBR   ε 
 t   t −1   2,t 
 RT Bt  = Φ0 + Φ1  RT Bt −1  +  ε3,t  .
     
 T E R Mt   T E R M t −1   ε4,t 
     

D I Vt D I Vt −1 ε5t
Two sets of parameters are presented in table 5.1. The top panel contains estimates using
non-scaled data. This produces some very large (in magnitude, not statistical significance)
estimates which are the result of two variables having very different scales. The bottom
panel contains estimates from data which have been standardized by dividing each series
by its standard deviation. This makes the magnitude of all coefficients approximately com-
5.4 VAR forecasting 341

Raw Data
V W M t −1 L B R t −1 RT Bt −1 T E R M t −1 D I Vt −1
V W Mt 0.073 0.668 −0.050 −0.000 0.183
(0.155) (0.001) (0.061) (0.844) (0.845)
L B Rt 0.002 −0.164 0.002 0.000 −0.060
(0.717) (0.115) (0.606) (0.139) (0.701)
RT Bt 0.130 0.010 0.703 −0.010 0.137
(0.106) (0.974) (0.000) (0.002) (0.938)
T E R Mt −0.824 −2.888 0.069 0.960 4.028
(0.084) (0.143) (0.803) (0.000) (0.660)
D I Vt 0.001 −0.000 −0.001 −0.000 −0.045
(0.612) (0.989) (0.392) (0.380) (0.108)

Standardized Series
V W M t −1 L B R t −1 RT Bt −1 T E R M t −1 D I Vt −1
V W Mt 0.073 0.112 −0.113 −0.011 0.007
(0.155) (0.001) (0.061) (0.844) (0.845)
L B Rt 0.012 −0.164 0.027 0.065 −0.013
(0.717) (0.115) (0.606) (0.139) (0.701)
RT Bt 0.057 0.001 0.703 −0.119 0.002
(0.106) (0.974) (0.000) (0.002) (0.938)
T E R Mt −0.029 −0.017 0.006 0.960 0.005
(0.084) (0.143) (0.803) (0.000) (0.660)
D I Vt 0.024 −0.000 −0.043 −0.043 −0.045
(0.612) (0.989) (0.392) (0.380) (0.108)

Table 5.1: Parameter estimates from Campbell’s VAR. The top panel contains estimates us-
ing unscaled data while the bottom panel contains estimates from data which have been
standardized to have unit variance. While the magnitudes of many coefficients change, the
p-vals and the eigenvalues of these two parameter matrices are identical, and the parame-
ters are roughly comparable since the series have the same variance.

parable. Despite this transformation and very different parameter estimates, the p-vals re-
main unchanged. This shouldn’t be surprising since OLS t -stats are invariant to scalings
of this type. One less obvious feature of the two sets of estimates is that the eigenvalues of
the two parameter matrices are identical and so both sets of parameter estimates indicate
the same persistence.

5.4 VAR forecasting

Once again, the behavior of a VAR(P) is identical to that of an AR(P). Recall that the h -step
ahead forecast, ŷt +h |t from an AR(1) is given by

h −1
j
X
E t [yt +h ] = φ1 φ0 + φ1h yt .
j =0
342 Analysis of Multiple Time Series

The h -step ahead forecast of a VAR(1) , ŷt +h |t is

h −1
j
X
E t [yt +h ] = Φ1 Φ0 + Φh1 yt
j =0

Forecasts from higher order VARs can be constructed by direct forward recursion beginning
at h = 1, although they are often simpler to compute using the deviations form of the VAR
since it includes no intercept,

ỹt = Φ1 ỹt −1 + Φ2 ỹt −2 + . . . + ΦP ỹt −P + εt .


Using the deviations form h -step ahead forecasts from a VAR(P) can be computed using
the recurrence

E t [ỹt +h ] = Φ1 E t [ỹt +h −1 ] + Φ2 E t [ỹt +h −2 ] + . . . + ΦP E t [ỹt +h −P ].


starting at E t [ỹt +1 ]. Once the forecast of E t [ỹt +h ] has been computed, the h -step ahead
forecast of yt +h is constructed by adding the long run mean, E t [yt +h ] = µ + E t [ỹt +h ].

5.4.1 Example: The interaction of stock and bond returns

When two series are related in time, univariate forecasts may not adequately capture the
feedback between the two and are generally misspecified if the series belong in a VAR. To
illustrate the differences, recursively estimated 1-step ahead forecasts were produced from
the stock-bond VAR,
" # " # " #" # " #
V W Mt 9.733 0.097 0.301 V W M t −1 ε1,t
= + +
10Y R t 1.058 −0.095 0.299 10Y R t −1 ε2,t

and a simple AR(1) for each series. The data set contains a total of 620 observations. Be-
ginning at observation 381 and continuing until observation 620, the models (the VAR and
the two ARs) were estimated using an expanding window of data and 1-step ahead forecasts
were computed.6 Figure 5.1 contains a graphical representation of the differences between
the AR(1)s and the VAR(1). The forecasts for the market are substantially different while the
forecasts for the 10-year bond return are not. The changes (or lack there of) are simply a
function of the model specification: the return on the 10-year bond has predictive power
for both series. The VAR(1) is a better model for stock returns (than an AR(1)) although it
not meaningfully better for bond returns.
6
Recursive forecasts computed using an expanding window use data from t = 1 to R to estimate any
model parameters and to produce a forecast of R +1. The sample then grows by one observation and data
from 1 to R + 1 are used to estimate model parameters and a forecast of R + 2 is computed. This pattern
continues until the end of the sample. An alternative is to use rolling windows where both the end point and
the start point move through time so that the distance between the two is constant.
5.5 Estimation and Identification 343

The Importance of VARs in Forecasting


1-month-ahead forecasts of the VWM returns
60

40

20

−20
1987 1990 1993 1995 1998 2001 2004 2006

1-month-ahead forecasts of 10-year bond returns


AR(1)
40 VAR(1)

20

−20

Figure 5.1:1987
The figure contains
1990 1-step ahead
1993 1995 forecasts
1998 from a VAR(1)2004
2001 and an AR(1)
2006 for both
the value-weighted market returns (top) and the return on a 10-year bond. These two pic-
tures indicate that the return on the long bond has substantial predictive power for equity
returns while the opposite is not true.

5.5 Estimation and Identification

Estimation and identification is the first significant break from directly applying the lessons
learned from the analogies to univariate modeling to multivariate models. In addition to
autocorrelation function and partial autocorrelation functions, vector stochastic processes
also have cross-correlation functions (CCFs) and partial cross-correlation functions (PC-
CFs).

Definition 5.10 (Cross-correlation). The sth cross correlations between two covariance sta-
tionary series {x t } and {yt } are defined

E[(x t − µ x )(yt −s − µ y )]
ρ x y ,s = (5.20)
V[x t ]V[yt ]
p

and
E[(yt − µ y )(x t −s − µ x )]
ρ y x ,s = (5.21)
V[x t ]V[yt ]
p
344 Analysis of Multiple Time Series

where the order of the indices indicates which variable is measured using contemporane-
ous values and which variable is lagged, E[yt ] = µ y and E[x t ] = µ x .

It should be obvious that, unlike autocorrelations, cross-correlations are not symmetric –


the order, x y or y x , matters. Partial cross-correlations are defined in a similar manner;
the correlation between x t and yt −s controlling for yt −1 , . . . , yt −(s −1) .

Definition 5.11 (Partial Cross-correlation). The partial cross-correlations between two co-
variance stationary series {x t } and {yt } are defined as the population values of the coeffi-
cients ϕ x y ,s in

x t = φ0 + φ1 yt −1 + . . . + φs −1 yt −(s −1) + ϕ x y ,s yt −s + ε x ,t (5.22)

and ϕ y x ,s in
yt = φ0 + φ1 x t −1 + . . . + φs −1 x t −(s −1) + ϕ y x ,s x t −s + ε x ,t (5.23)
where the order of the indices indicates which variable is measured using contemporane-
ous values and which variable is lagged.

Figure 5.2 contains the CCF (cross-correlation function) and PCCF (partial cross-correlation
function) of two first order VARs with identical persistence. The top panel contains the
functions for
" # " #" # " #
yt .5 .4 yt −1 ε1,t
= +
xt .4 .5 x t −1 ε2,t

while the bottom contains the functions for a trivial VAR


" # " #" # " #
yt .9 0 yt −1 ε1,t
= +
xt 0 .9 x t −1 ε2,t

which is just two AR(1)s in a system. The nontrivial VAR(1) exhibits dependence between
both series while the AR-in-disguise shows no dependence between yt and x t − j , j > 0.
With these new tools, it would seem that Box-Jenkins could be directly applied to vector
processes, and while technically possible, exploiting the ACF, PACF, CCF and PCCF to de-
termine what type of model is appropriate is difficult. For specifications larger than a bi-
variate VAR, there are simply too many interactions.
The usual solution is to take a hands off approach as advocated by Sims (1980). The
VAR specification should include all variables which theory indicate are relevant and a lag
length should be chosen which has a high likelihood of capturing all of the dynamics. Once
these values have been set, either a general-to-specific search can be conducted or an in-
formation criteria can be used to select the appropriate lag length. In the VAR case, the
Akaike IC, Hannan & Quinn (1979) IC and the Schwartz/Bayes IC are given by
5.5 Estimation and Identification 345

ACF and CCF for two VAR(1)s


VAR(1) ACF (y on lagged y ) VAR(1) CCF (y on lagged x )

0.6 0.6

0.5 0.5

0.4 0.4

0.3 0.3

0.2 0.2
0.1 0.1
0 0
5 10 15 20 5 10 15 20

Diagonal VAR(1) ACF (y on lagged y ) Diagonal VAR(1) CCF (y on lagged x )

0.8
0.5
0.6
0
0.4
−0.5
0.2
−1
Figure 0
5.2:The
5 top 10
panel 15
contains
20 the ACF and CCF for5 a nontrivial
10 15VAR process
20 where
contemporaneous values depend on both series. The bottom contains the ACF and CCF
for a trivial VAR which is simply composed to two AR(1)s.

2
AIC: ln |Σ̂(P )| + k 2 P
T
2 ln ln T
HQC: ln |Σ̂(P )| + k 2 P
T
ln T
SBIC: ln |Σ̂(P )| + k 2 P
T

where Σ̂(P ) is the covariance of the residuals using P lags and | · | indicates determinant.7
The lag length should be chosen to minimize one of these criteria, and the SBIC will always
choose a (weakly) smaller model than the HQC which in turn will select a (weakly) smaller

7
ln |Σ̂| is, up to an additive constant, the gaussian log-likelihood divided by T , and these information crite-
ria are all special cases of the usual information criteria for log-likelihood models which take the form L + PI C
where PI C is the penalty which depends on the number of estimated parameters in the model.
346 Analysis of Multiple Time Series

model than the AIC. Ivanov & Kilian (2005) recommend the AIC for monthly models and
the HQC for quarterly models, unless the sample size is less than 120 quarters in which
case the SBIC is preferred. Their recommendation is based on the accuracy of the impulse
response function, and so may not be ideal in other applications such as forecasting.
To use a general-to-specific approach, a simple likelihood ratio test can be computed
as
 
A
(T − P2 k 2 ) ln |Σ̂(P1 )| − ln |Σ̂(P2 )| ∼ χ(P
2
2 −P1 )k
2

where P1 is the number of lags in the restricted (smaller) model, P2 is the number of lags in
the unrestricted (larger) model and k is the dimension of yt . Since model 1 is a restricted
version of model 2, its variance is larger which ensures this statistic is positive. The −P2 k 2
term in the log-likelihood is a degree of freedom correction that generally improves small-
sample performance. Ivanov & Kilian (2005) recommended against using sequential like-
lihood ratio testing for selecting lag length.

5.5.1 Example: Campbell’s VAR

A lag length selection procedure was conducted using Campbell’s VAR. The results are con-
tained in table 5.2. This table contains both the AIC and SBIC values for lags 0 through 12
as well as likelihood ratio test results for testing l lags against l + 1. Note that the LR and
P-val corresponding to lag l is a test of the null of l lags against an alternative of l + 1 lags.
Using the AIC, 12 lags would be selected since it produces the smallest value. If the initial
lag length was less than 12, 7 lags would be selected. The HQC and SBIC both choose 3 lags
in a specific-to-general search and 12 in a general-to-specific search. Using the likelihood
ratio, a general-to-specific procedure chooses 12 lags while a specific-to-general procedure
chooses 3. The test statistic for a null of H0 : P = 11 against an alternative that H1 : P = 12
has a p-val of 0.
One final specification search was conducted. Rather than beginning at the largest lag
and work down one at a time, a “global search” which evaluates models using every combi-
nation of lags up to 12 was computed. This required fitting 4096 VARs which only requires
a few seconds on a modern computer.8 For each possible combination of lags, the AIC and
the SBIC were computed. Using this methodology, the AIC search selected lags 1-4, 6, 10
and 12 while the SBIC selected a smaller model with only lags 1, 3 and 12 - the values of
these lags indicate that there may be a seasonality in the data. Search procedures of this
type are computationally viable for checking up to about 20 lags.

8
For a maximum lag length of L , 2L models must be estimated.
5.6 Granger causality 347

Lag Length AIC HQC BIC LR P-val


0 6.39 5.83 5.49 1798 0.00
1 3.28 2.79 2.56 205.3 0.00
2 2.98 2.57 2.45 1396 0.00
3 0.34 0.00 0.00 39.87 0.03
4 0.35 0.08 0.19 29.91 0.23
5 0.37 0.17 0.40 130.7 0.00
6 0.15 0.03 0.37 44.50 0.01
7 0.13 0.08 0.53 19.06 0.79
8 0.17 0.19 0.75 31.03 0.19
9 0.16 0.26 0.94 19.37 0.78
10 0.19 0.36 1.15 27.33 0.34
11 0.19 0.43 1.34 79.26 0.00
12 0.00 0.31 1.33 N/A N/A

Table 5.2: Normalized values for the AIC and SBIC in Campbell’s VAR. The AIC chooses 12
lags while the SBIC chooses only 3. A general-to-specific search would stop at 12 lags since
the likelihood ratio test of 12 lags against 11 rejects with a p-value of 0. If the initial number
of lags was less than 12, the GtS procedure would choose 6 lags. Note that the LR and P-val
corresponding to lag l is a test of the null of l lags against an alternative of l + 1 lags.

5.6 Granger causality

Granger causality (GC, also known as prima facia causality) is the first concept exclusive
to vector analysis. GC is the standard method to determine whether one variable is useful
in predicting another and evidence of Granger causality it is a good indicator that a VAR,
rather than a univariate model, is needed.

5.6.1 Definition

Granger causality is defined in the negative.

Definition 5.12 (Granger causality). A scalar random variable {x t } is said to not Granger
cause {yt } if
E[yt |x t −1 , yt −1 , x t −2 , yt −2 , . . .] = E[yt |, yt −1 , yt −2 , . . .].9 That is, {x t } does not Granger cause
if the forecast of yt is the same whether conditioned on past values of x t or not.

Granger causality can be simply illustrated in a bivariate VAR.


9
Technically, this definition is for Granger causality in the mean. Other definition exist for Granger causal-
ity in the variance (replace conditional expectation with conditional variance) and distribution (replace con-
ditional expectation with conditional distribution).
348 Analysis of Multiple Time Series

" # " #" # " #" # " #


xt φ11,1 φ12,1 x t −1 φ11,2 φ12,2 x t −2 ε1,t
= + +
yt φ21,1 φ22,1 yt −1 φ21,2 φ22,2 yt −2 ε2,t

In this model, if φ21,1 = φ21,2 = 0 then {x t } does not Granger cause {yt }. If this is the case,
it may be tempting to model yt using

yt = φ22,1 yt −1 + φ22,2 yt −1 + ε2,t


However, it is not; ε1,t and ε2,t can be contemporaneously correlated. If it happens to be
the case that {x t } does not Granger cause {yt } and ε1,t and ε2,t have no contemporaneous
correlation, then yt is said to be weakly exogenous, and yt can be modeled completely in-
dependently of x t . Finally it is worth noting that {x t } not Granger causing {yt } says nothing
about whether {yt } Granger causes {x t }.
One important limitation of GC is that it doesn’t account for indirect effects. For ex-
ample, suppose x t and yt are both Granger caused by z t . When this is the case, x t will
usually Granger cause yt even when it has no effect once z t has been conditioned on, and
so E [yt |yt −1 , z t −1 , x t −1 , . . .] = E [yt |yt −1 , z t −1 , . . .] but E [yt |yt −1 , x t −1 , . . .] 6= E [yt |yt −1 , . . .].

5.6.2 Testing
Testing for Granger causality in a VAR(P) is usually conducted using likelihood ratio tests.
In this specification,

yt = Φ0 + Φ1 yt −1 + Φ2 yt −2 + . . . + ΦP yt −P + εt ,
{y j ,t } does not Granger cause {yi ,t } if φi j ,1 = φi j ,2 = . . . = φi j ,P = 0. The likelihood ratio
test can be computed
 
A
(T − (P k 2 − k )) ln |Σ̂r | − ln |Σ̂u | ∼ χP2

where Σr is the estimated residual covariance when the null of no Granger causation is
imposed (H0 : φi j ,1 = φi j ,2 = . . . = φi j ,P = 0) and Σu is the estimated covariance in the
unrestricted VAR(P). If there is no Granger causation in a VAR, it is probably not a good idea
to use one.10

5.6.3 Example: Campbell’s VAR


Campbell’s VAR will be used to illustrate testing for Granger causality. Table 5.3 contains
the results of Granger causality tests from a VAR which included lags 1, 3 and 12 (as chosen
by the “global search” SBIC method) for the 5 series in Campbell’s VAR. Tests of a variable
10
The multiplier in the test is a degree of freedom adjusted factor. There are T data points and there are
P k 2 − k parameters in the restricted model.
5.7 Impulse Response Function 349

VWM LBR RTB TERM DIV


Exclusion Stat P-val Stat P-val Stat P-val Stat P-val Stat P-val
VWM – – 0.05 0.83 0.06 0.80 2.07 0.15 2.33 0.13
L B R 1.54 0.21 – – 3.64 0.06 0.03 0.86 0.44 0.51
RT B 12.05 0.00 2.87 0.09 – – 49.00 0.00 2.88 0.09
T E R M 13.55 0.00 5.98 0.01 43.93 0.00 – – 0.57 0.45
DIV 0.16 0.69 0.72 0.40 5.55 0.02 0.06 0.80 – –
All 5.60 0.23 8.83 0.07 72.29 0.00 56.68 0.00 9.06 0.06

Table 5.3: Tests of Granger causality. This table contains tests where the variable on the left
hand side is excluded from the regression for the variable along the top. Since the null is
no GC, rejection indicates a relationship between past values of the variable on the left and
contemporaneous values of variables on the top.

causing itself have been omitted as these aren’t particularly informative in the multivariate
context. The table tests whether the variables in the left hand column Granger cause the
variables along the top row. From the table, it can be seen that every variable causes at least
one other variable since each row contains a p-val indicating significance using standard
test sizes (5 or 10%) since the null is no Granger causation. It can also be seen that every
variable is caused by another by examining the p-values column by column.

5.7 Impulse Response Function


The second concept new to multivariate time-series analysis is the impulse response func-
tion. In the univariate world, the ACF was sufficient to understand how shocks decay.
When analyzing vector data, this is no longer the case. A shock to one series has an im-
mediate effect on that series but it can also affect the other variables in a system which, in
turn, can feed back into the original variable. After a few iterations of this cycle, it can be
difficult to determine how a shock propagates even in a simple bivariate VAR(1).

5.7.1 Defined
Definition 5.13 (Impulse Response Function). The impulse response function of yi , an el-
ement of y, with respect to a shock in ε j , an element of ε, for any j and i , is defined as the
change in yi t +s , s ≥ 0 for a unit shock in ε j ,t .
This definition is somewhat difficult to parse and the impulse response function can
be clearly illustrated through a vector moving average (VMA).11 As long as yt is covariance
stationary it must have a VMA representation,
11
Recall that a stationary AR(P) can also be transformed into a MA(∞). Transforming a stationary VAR(P)
into a VMA(∞) is the multivariate time-series analogue.
350 Analysis of Multiple Time Series

yt = µ + εt + Ξ1 εt −1 + Ξ2 εt −2 + . . .
Using this VMA, the impulse response yi with respect to a shock in ε j is simply {1, Ξ1[i i ] , Ξ2[i i ] , Ξ3 [i i ] , .
if i = j and {0, Ξ1[i j ] , Ξ2[i j ] , Ξ3 [i j ] , . . .} otherwise. The difficult part is computing Ξl , l ≥ 1.
In the simple VAR(1) model this is easy since

yt = (Ik − Φ1 )−1 Φ0 + εt + Φ1 εt −1 + Φ21 εt −2 + . . . .


However, in more complicated models, whether higher order VARs or VARMAs, determin-
ing the MA(∞) form can be tedious. One surprisingly simply, yet correct, method to com-
pute the elements of {Ξ j } is to simulate the effect of a unit shock of ε j ,t . Suppose the model
is a VAR in deviations form12 ,

yt − µ = Φ1 (yt −1 − µ) + Φ2 (yt −2 − µ) + . . . + ΦP (yt −P − µ) + εt .


The impulse responses can be computed by “shocking” εt by 1 unit and stepping the
process forward. To use this procedure, set yt −1 = yt −2 = . . . = yt −P = 0 and then begin the
simulation by setting ε j ,t = 1. The 0th impulse will obviously be e j = [0 j −1 1 0k − j ]0 , a vector
with a 1 in the jth position and zeros everywhere else. The first impulse will be,

Ξ1 = Φ1 e j ,
the second will be
Ξ2 = Φ21 e j + Φ2 e j
and the third is

Ξ3 = Φ31 e j + Φ1 Φ2 e j + Φ2 Φ1 e j + Φ3 e j .
This can be continued to compute any Ξ j .

5.7.2 Correlated Shocks and non-unit Variance


The previous discussion has made use of unit shocks, e j which represent a change of 1 in
jth error. This presents two problems: actual errors do not have unit variances and are of-
ten correlated. The solution to these problems is to use non-standardized residuals and/or
correlated residuals. Suppose that the residuals in a VAR have a covariance of Σ. To simu-
late the effect of a shock to element j , e j can be replaced with ẽ j = Σ1/2 e j and the impulses
can be computed using the procedure previously outlined.
This change has two effects. First, every series will generally have an instantaneous re-
action to any shock when the errors are correlated. Second, the choice of matrix square
12
Since the VAR is in deviations form, this formula can be used with any covariance stationary VAR. If the
model is not covariance stationary, the impulse response function can still be computed although the un-
conditional mean must be replaced with a conditional one.
5.7 Impulse Response Function 351

root, Σ1/2 , matters. There are two matrix square roots: the Choleski and the spectral de-
composition. The Choleski square root is a lower triangular matrix which imposes an order
to the shocks. Shocking element j (using e j ) has an effect of every series j , . . . , k but not
on 1, . . . , j − 1. In contrast the spectral matrix square root is symmetric and a shock to the
jth error will generally effect every series instantaneously. Unfortunately there is no right
choice. If there is a natural ordering in a VAR where shocks to one series can be reasoned
to have no contemporaneous effect on the other series, then the Choleski is the correct
choice. However, in many situations there is little theoretical guidance and the spectral
decomposition is the natural choice.

5.7.3 Example: Impulse Response in Campbell’s VAR


Campbell’s VAR will be used to illustrate impulse response functions. Figure 5.3 contains
the impulse responses of the relative T-bill rate to shocks in the in the four other variables:
equity returns, labor income growth, the term premium and the dividend rate. The dotted
lines represent 2 standard deviation confidence intervals. The relative T-bill rate increases
subsequent to positive shocks in any variable which indicates that the economy is improv-
ing and there are inflationary pressures driving up the short end of the yield curve.

5.7.4 Confidence Intervals


Impulse response functions, like the parameters of the VAR, are estimated quantities and
subject to statistical variation. Confidence bands can be constructed to determine whether
an impulse response is large in a statistically meaningful sense. Since the parameters of
the VAR are asymptotically normal (as long as it is stationary and the innovations are white
noise), the impulse responses will also be asymptotically normal by applying the δ-method.
The derivation of the covariance of the impulse response function is tedious and has no
intuitive value. Interested readers can refer to 11.7 in Hamilton (1994). Instead, two com-
putational methods to construct confidence bands for impulse response functions will be
described: Monte Carlo and using a procedure known as the bootstrap.

5.7.4.1 Monte Carlo Confidence Intervals

Monte Carlo confidence intervals come in two forms, one that directly simulates Φ̂i from
its asymptotic distribution and one that simulates the VAR and draws Φ̂i as the result of
estimating the unknown parameters in the simulated VAR. The direct sampling method is
simple:

1. Compute Φ̂ from the initial data and estimate the covariance matrix Λ̂ in the asymp-
√ A
totic distribution T (Φ̂ − Φ0 ) ∼ N (0, Λ).13
13
This is an abuse of notation. Φ is a matrix and the vec operator is needed to transform it into a vector.
Interested readers should see 11.7 in Hamilton (1994) for details on the correct form.
352 Analysis of Multiple Time Series

Impulse Response Function


RTB to VWM RTB to LBRG
5 1.5
4

3 1

1 0.5

0
0
−1

2 4 6 8 10 12 2 4 6 8 10 12
RTB to TERM RTB to DIV
1
−8
−10
−12 0.5
−14
−16
0
−18
−20
2 4 6 8 10 12 2 4 6 8 10 12
Period Period

Figure 5.3: Impulse response functions for 12 steps of the response of the relative T-bill
rate to equity returns, labor income growth, the term premium rate and the dividend yield.
The dotted lines represent 2 standard deviation (in each direction) confidence intervals.
All values have been scaled by 1,000.

2. Using Φ̂ and Λ̂, generate simulated values Φ̃b from the asymptotic distribution as
1/2
Λ̂ ε + Φ̂ where ε ∼ N (0, I). These are i.i.d. draws from a N (Φ̂, Λ̂) distribution.
i.i.d.

3. Using Φ̃b , compute the impulse responses {Ξ̂ j ,b } where b = 1, 2, . . . , B . Save these
values.

4. Return to step 2 and compute a total of B impulse responses. Typically B is between


100 and 1000.

5. For each impulse response for each horizon, sort the responses. The 5th and 95th per-
centile of this distribution are the confidence intervals.

The second Monte Carlo method differs only in the method used to compute Φ̃b .
5.7 Impulse Response Function 353

1. Compute Φ̂ from the initial data and estimate the residual covariance Σ̂.

2. Using Φ̂ and Σ̂, simulate a time-series {ỹt } with as many observations as the original
data. These can be computed directly using forward recursion
1/2
ỹt = Φ̂0 + Φ̂1 yt −1 + . . . + Φ̂P yt −P + Σ̂ εt

where ε ∼ N (0, Ik ) are i.i.d. multivariate standard normally distributed.


i.i.d.

3. Using {ỹt }, estimate Φ̃b using a VAR.

4. Using Φ̃b , compute the impulse responses {Ξ̃ j ,b } where b = 1, 2, . . . , B . Save these
values.

5. Return to step 2 and compute a total of B impulse responses. Typically B is between


100 and 1000.

6. For each impulse response for each horizon, sort the impulse responses. The 5th and
95th percentile of this distribution are the confidence intervals.

Of these two methods, the former should be preferred as the assumption of i.i.d. normal
errors in the latter may be unrealistic. This is particularly true for financial data. The final
method, which uses a procedure known as the bootstrap, combines the ease of the second
with the robustness of the first.

5.7.4.2 Bootstrap Confidence Intervals

The bootstrap is a computational tool which has become popular in recent years primar-
ily due to the significant increase in the computing power of typical PCs. Its name is de-
rived from the expression “pulling oneself up by the bootstraps”, a seemingly impossible
feat. The idea is simple: if the residuals are realizations of the actual error process, one can
use them directly to simulate this distribution rather than making an arbitrary assumption
about the error distribution (e.g. i.i.d. normal). The procedure is essentially identical to the
second Monte Carlo procedure outlined above:

1. Compute Φ̂ from the initial data and estimate the residuals ε̂t .

2. Using ε̂t , compute a new series of residuals ε̃t by sampling, with replacement, from
the original residuals. The new series of residuals can be described

{ε̂u 1 , ε̂u 2 , . . . , ε̂u T }

where u i are i.i.d. discrete uniform random variables taking the values 1, 2, . . . , T . In
essence, the new set of residuals is just the old set of residuals reordered with some
354 Analysis of Multiple Time Series

duplication and omission.14

3. Using Φ̂ and {ε̂u 1 , ε̂u 2 , . . . , ε̂u T }, simulate a time-series {ỹt } with as many observations
as the original data. These can be computed directly using the VAR

ỹt = Φ̂0 + Φ̂1 yt −1 + . . . + Φ̂P yt −P + ε̂u t

4. Using {ỹt }, compute estimates of Φ̆b from a VAR.

5. Using Φ̆b , compute the impulse responses {Ξ̆ j ,b } where b = 1, 2, . . . , B . Save these
values.

6. Return to step 2 and compute a total of B impulse responses. Typically B is between


100 and 1000.

7. For each impulse response for each horizon, sort the impulse responses. The 5th and
95th percentile of this distribution are the confidence intervals.

The bootstrap has many uses in econometrics. Interested readers can find more applica-
tions in Efron & Tibshirani (1998).

5.8 Cointegration
Many economic time-series have two properties that make standard VAR analysis unsuit-
able: they contain one or more unit roots and most equilibrium models specify that devia-
tions between key variables, either in levels or ratios, are transitory. Before formally defin-
ing cointegration, consider the case where two important economic variables that contain
unit roots, consumption and income, had no long-run relationship. If this were true, the
values of these variables would grow arbitrarily far apart given enough time. Clearly this
is unlikely to occur and so there must be some long-run relationship between these two
time-series. Alternatively, consider the relationship between the spot and future price of
oil. Standard finance theory dictates that the future’s price, f t , is a conditionally unbiased
estimate of the spot price in period t + 1, st +1 (E t [st +1 ] = f t ). Additionally, today’s spot
price is also an unbiased estimate of tomorrow’s spot price (E t [st +1 ] = st ). However, both
of these price series contain unit roots. Combining these two identities reveals a cointe-
grating relationship: st − f t should be stationary even if the spot and future prices contain
unit roots.15
It is also important to note how cointegration is different from stationary VAR analy-
sis. In stationary time-series, whether scalar or when the multiple processes are linked
through a VAR, the process is self-equilibrating; given enough time, a process will revert
14
This is one version of the bootstrap and is appropriate for homoskedastic data. If the data are het-
eroskedastic, some form of block bootstrap is needed.
15
This assumes the horizon is short.
5.8 Cointegration 355

to its unconditional mean. In a VAR, both the individual series and linear combinations
of the series are stationary. The behavior of cointegrated processes is meaningfully differ-
ent. Treated in isolation, each process contains a unit root and has shocks with permanent
impact. However, when combined with another series, a cointegrated pair will show a ten-
dency to revert towards one another. In other words, a cointegrated pair is mean reverting
to a stochastic trend.
Cointegration and error correction provide the tools to analyze temporary deviations
from long-run equilibria. In a nutshell, cointegrated time-series may show temporary de-
viations from a long-run trend but are ultimately mean reverting to this trend. It may also
be useful to relate cointegration to what has been studied thus far: cointegration is to VARs
as unit roots are to stationary time-series.

5.8.1 Definition
Recall that an integrated process is defined as a process which is not stationary in levels but
is stationary in differences. When this is the case, yt is said to be I(1) and ∆yt = yt − yt −1
is I(0). Cointegration builds on this structure by defining relationships across series which
transform I(1) series into I(0) series.

Definition 5.14 (Bivariate Cointegration). Let {x t } and {yt } be two I(1) series. These series
are said to be cointegrated if there exists a vector β with both elements non-zero such that

β 0 [x t yt ]0 = β1 x t − β2 yt ∼ I (0) (5.24)

Put another way, there exists a nontrivial linear combination of x t and yt which is sta-
tionary. This feature, when present, is a powerful link in the analysis of nonstationary
data. When treated individually, the data are extremely persistent; however there is a com-
bination of the data which is well behaved. Moreover, in many cases this relationship
takes a meaningful form such as yt − x t . Note that cointegrating relationships are only
defined up to a constant. For example if x t − β yt is a cointegrating relationship, then
2x t − 2β yt = 2(x t − β yt ) is also a cointegrating relationship. The standard practice is to
chose one variable to normalize the vector. For example, if β1 x t − β2 yt was a cointegrating
relationship, one normalized version would be x t − β2 /β1 yt = x t − β̃ yt .
The definition in the general case is similar, albeit slightly more intimidating.

Definition 5.15 (Cointegration). A set of k variables yt are said to be cointegrated if at least


2 series are I (1) and there exists a non-zero, reduced rank k by k matrix π such that

πyt ∼ I (0). (5.25)

The non-zero requirement is obvious: if π = 0 then πyt = 0 and is trivially I(0). The second
requirement, that π is reduced rank, is not. This technical requirement is necessary since
whenever π is full rank and πyt ∼ I (0), the series must be the case that yt is also I(0).
356 Analysis of Multiple Time Series

Nonstationary and Stationary VAR(1)s


Cointegration (Φ11 ) Independent Unit Roots(Φ12 )
10 20

8 15

10
6
5
4
0
2
−5
0 −10

−2 −15
0 20 40 60 80 100 0 20 40 60 80 100
Persistent, Stationary (Φ21 ) Anti-persistent, Stationary(Φ22 )
6 4

4 3

2
2
1
0
0
−2
−1
−4 −2

−6 −3
0 20 40 60 80 100 0 20 40 60 80 100

Figure 5.4: A plot of four time-series that all begin at the same point initial value and use
the same shocks. All data were generated by yt = Φi j yt −1 + εt where Φi j varies.

However, in order for variables to be cointegrated they must also be integrated. Thus, if
the matrix is full rank, there is no possibility for the common unit roots to cancel and it
must have the same order of integration before and after the multiplication by π. Finally,
the requirement that at least 2 of the series are I (1) rules out the degenerate case where all
components of yt are I (0), and allows yt to contain both I (0) and I (1) random variables.
For example, suppose x t and yt are cointegrated and x t − β yt is stationary. One choice
for π is " #
1 −β
π=
1 −β

To begin developing a feel for cointegration, examine the plots in figure 5.4. These four
plots correspond to two nonstationary processes and two stationary processes all begin-
ning at the same point and all using the same shocks. These plots contain data from a sim-
5.8 Cointegration 357

ulated VAR(1) with different parameters, Φi j .

yt = Φi j yt −1 + εt

" # " #
.8 .2 1 0
Φ11 = Φ12 =
.2 .8 0 1
λi = 1, 0.6 λi = 1, 1
" # " #
.7 .2 −.3 .3
Φ21 = Φ22 =
.2 .7 .1 −.2
λi = 0.9, 0.5 λi = −0.43, −0.06
where λi are the eigenvalues of the parameter matrices. Note that the eigenvalues of the
nonstationary processes contain the value 1 while the eigenvalues for the stationary pro-
cesses are all less then 1 (in absolute value). Also, note that the cointegrated process has
only one eigenvalue which is unity while the independent unit root process has two. Higher
dimension cointegrated systems may contain between 1 and k − 1 unit eigenvalues. The
number of unit eigenvalues indicates the number of unit root “drivers” in a system of equa-
tions. The picture presents evidence of another issue in cointegration analysis: it can be
very difficult to tell when two series are cointegrated, a feature in common with unit root
testing of scalar processes.

5.8.2 Error Correction Models (ECM)

The Granger representation theorem provides a key insight to understanding cointegrat-


ing relationships. Granger demonstrated that if a system is cointegrated then there exists
an error correction model and if there is an error correction model then the system must be
cointegrated. The error correction model is a form which governs short deviations from
the trend (a stochastic trend or unit root). The simplest ECM is given by
" # " #" # " #
∆x t π11 π12 x t −1 ε1,t
= + (5.26)
∆yt π21 π22 yt −1 ε2,t

which states that changes in x t and yt are related to the levels of x t and yt through the
cointegrating matrix (π). However, since x t and yt are cointegrated, there exists β such
that x t − β yt is I(0). Substituting this into this equation, equation 5.26 can be rewritten
" # " # " # " #
∆x t α1 h i x t −1 ε1,t
= 1 −β + . (5.27)
∆yt α2 yt −1 ε2,t

The short-run dynamics take the forms


358 Analysis of Multiple Time Series

∆x t = α1 (x t −1 − β yt −1 ) + ε1,t (5.28)
and
∆yt = α2 (x t −1 − β yt −1 ) + ε2,t . (5.29)
The important elements of this ECM can be clearly labeled: x t −1 − β yt −1 is the deviation
from the long-run trend (also known as the equilibrium correction term) and α1 and α2 are
the speed of adjustment parameters. ECMs impose one restriction of the αs: they cannot
both be 0 (if they were, π would also be 0). In its general form, an ECM can be augmented to
allow past short-run deviations to also influence present short-run deviations or to include
deterministic trends. In vector form, the generalized ECM is

∆yt = π0 + πyt −1 + π1 ∆yt −1 + π2 ∆yt −2 + . . . + +πP ∆yt −P + εt

where πyt −1 captures the cointegrating relationship, π0 represents a linear time trend in
the original data (levels) and π j ∆yt − j , j = 1, 2, . . . , P capture short-run dynamics around
the stochastic trend.

5.8.2.1 The Mechanics of the ECM

It may not be obvious how a cointegrated VAR is transformed into an ECM. Consider a
simple cointegrated bivariate VAR(1)
" # " #" # " #
xt .8 .2 x t −1 ε1,t
= +
yt .2 .8 yt −1 ε2,t

To transform this VAR to an ECM, begin by subtracting [x t −1 yt −1 ]0 from both sides

" # " # " #" # " # " #


xt x t −1 .8 .2 x t −1 x t −1 ε1,t
− = − + (5.30)
yt yt −1 .2 .8 yt −1 yt −1 ε2,t
" # " # " #! " # " #
∆x t .8 .2 1 0 x t −1 ε1,t
= − +
∆yt .2 .8 0 1 yt −1 ε2,t
" # " #" # " #
∆x t −.2 .2 x t −1 ε1,t
= +
∆yt .2 −.2 yt −1 ε2,t
" # " # " # " #
∆x t −.2 h i x
t −1 ε1,t
= 1 −1 +
∆yt .2 yt −1 ε2,t

In this example, the speed of adjustment parameters are -.2 for ∆x t and .2 for ∆yt and the
normalized (on x t ) cointegrating relationship is [1 − 1]. In the general multivariate case,
a cointegrated VAR(P) can be turned into an ECM by recursive substitution. Consider a
5.8 Cointegration 359

cointegrated VAR(3),
yt = Φ1 yt −1 + Φ2 yt −2 + Φ3 yt −3 + εt

This system will be cointegrated if at least one but fewer than k eigenvalues of π = Φ1 +
Φ2 + Φ3 − Ik are not zero. To begin the transformation, add and subtract Φ3 yt −2 to the right
side

yt = Φ1 yt −1 + Φ2 yt −2 + Φ3 yt −2 − Φ3 yt −2 + Φ3 yt −3 + εt
= Φ1 yt −1 + Φ2 yt −2 + Φ3 yt −2 − Φ3 ∆yt −2 + εt
= Φ1 yt −1 + (Φ2 + Φ3 )yt −2 − Φ3 ∆yt −2 + εt

then add and subtract (Φ2 + Φ3 )yt −1 to the right side,

yt = Φ1 yt −1 + (Φ2 + Φ3 )yt −1 − (Φ2 + Φ3 )yt −1 + (Φ2 + Φ3 )yt −2 − Φ3 ∆yt −2 + εt


= Φ1 yt −1 + (Φ2 + Φ3 )yt −1 − (Φ2 + Φ3 )∆yt −1 − Φ3 ∆yt −2 + εt
= (Φ1 + Φ2 + Φ3 )yt −1 − (Φ2 + Φ3 )∆yt −1 − Φ3 ∆yt −2 + εt .

Finally, subtract yt −1 from both sides,

yt − yt −1 = (Φ1 + Φ2 + Φ3 )yt −1 − yt −1 − (Φ2 + Φ3 )∆yt −1 − Φ3 ∆yt −2 + εt


∆yt = (Φ1 + Φ2 + Φ3 − Ik )yt −1 − (Φ2 + Φ3 )∆yt −1 − Φ3 ∆yt −2 + εt .

The final step is to relabel the above equation in terms of π notation,

yt − yt −1 = (Φ1 + Φ2 + Φ3 − Ik )yt −1 − (Φ2 + Φ3 )∆yt −1 − Φ3 ∆yt −2 + εt (5.31)


∆yt = πyt −1 + π1 ∆yt −1 + π2 ∆yt −2 + εt .

which is equivalent to

∆yt = αβ 0 yt −1 + π1 ∆yt −1 + π2 ∆yt −2 + εt . (5.32)

where α contains the speed of adjustment parameters and β contains the cointegrating
vectors. This recursion can be used to transform any cointegrated VAR(P)

yt −1 = Φ1 yt −1 + Φ2 yt −2 + . . . + ΦP yt −P + εt

into its ECM from


360 Analysis of Multiple Time Series

∆yt = πyt −1 + π1 ∆yt −1 + π2 ∆yt −2 + . . . + πP −1 ∆yt −P +1 + εt


using the identities π = −Ik + Pi=1 Φi and πp = − Pi=p +1 Φi .
P P

5.8.2.2 Cointegrating Vectors

The key to understanding cointegration in systems with 3 or more variables is to note that
the matrix which governs the cointegrating relationship, π, can always be decomposed into
two matrices,
π = αβ 0
where α and β are both k by r matrices where r is the number of cointegrating relation-
ships. For example, suppose the parameter matrix in an ECM was
 
0.3 0.2 −0.36
π =  0.2 0.5 −0.35 
 
−0.3 −0.3 0.39

The eigenvalues of this matrix are .9758, .2142 and 0. The 0 eigenvalue of π indicates
there are two cointegrating relationships since the number of cointegrating relationships
is rank(π). Since there are two cointegrating relationships, β can be specified as
 
1 0
β = 0 1 
 
β1 β2

and α has 6 unknown parameters. αβ 0 can be combined to produce


 
α11 α12 α11 β1 + α12 β2
π =  α21 α22 α21 β1 + α22 β2 
 
α31 α32 α31 β1 + α32 β2
and α can be trivially solved using the left block of π. Once α is known, any two of the three
remaining elements can be used to solve of β1 and β2 . Appendix A contains a detailed
illustration of solving a trivariate cointegrated VAR for the speed of adjustment coefficients
and the cointegrating vectors.

5.8.3 Rank and the number of unit roots

The rank of π is the same as the number of cointegrating vectors since π = αβ 0 and so if π
has rank r , then α and β must both have r linearly independent columns. α contains the
speed of adjustment parameters and β contains the cointegrating vectors. Note that since
there are r cointegrating vectors there are m = k − r distinct unit roots in the system. This
5.8 Cointegration 361

relationship holds since when there are k variables, and m distinct unit roots, then there
are r distinct linear combinations of the series which will be stationary (except in special
circumstances).
Consider a trivariate cointegrated system driven by either one or two unit roots. Denote
the underlying unit root processes as w1,t and w2,t . When there is a single unit root driving
all three variables, the system can be expressed

y1,t = κ1 w1,t + ε1,t


y2,t = κ2 w1,t + ε2,t
y3,t = κ3 w1,t + ε3,t

where ε j ,t is a covariance stationary error (or I(0), but not necessarily white noise).
In this system there are two linearly independent cointegrating vectors. First consider
normalizing the coefficient on y1,t to be 1 and so in the equilibrium relationship y1,t −
β1 y2,t − β1 y3,t it must satisfy

κ1 = β1 κ2 + β2 κ3
to ensure that the unit roots are not present. This equation does not have a unique solution
since there are two unknown parameters. One solution is to further restrict β1 = 0 so that
the unique solution is β2 = κ1 /κ3 and an equilibrium relationship is y1,t − (κ1 /κ3 )y3,t . This
is a cointegration relationship since

κ1 κ1 κ1 κ1
y1,t − y3,t = κ1 w1,t + ε1,t − κ3 w1,t − ε3,t = ε1,t − ε3,t
κ3 κ3 κ3 κ3
Alternatively one could normalize the coefficient on y2,t and so the equilibrium relation-
ship y2,t − β1 y1,t − β2 y3,t would require

κ2 = β1 κ1 + β2 κ3
which again is not identified since there are 2 unknowns and 1 equation. To solve assume
β1 = 0 and so the solution is β2 = κ2/κ3, which is a cointegrating relationship since

κ2 κ2 κ2 κ2
y2,t − y3,t = κ2 w1,t + ε2,t − κ3 w1,t − ε3,t = ε2,t − ε3,t
κ3 κ3 κ3 κ3
These solutions are the only two needed since any other definition of the equilibrium
would be a linear combination of these two. For example, suppose you choose next to try
and normalize on y1,t to define an equilibrium of the form y1,t − β1 y2,t − β2 y3,t , and impose
that β3 = 0 to solve so that β1 = κ1 /κ2 to produce the equilibrium condition

κ1
y1,t − y2,t .
κ2
This equilibrium is already implied by the first two,
362 Analysis of Multiple Time Series

κ1 κ2
y1,t − y3,t and y2,t − y3,t
κ3 κ3
and can be seen to be redundant since

κ1 κ1 κ1 κ2
   
y1,t − y2,t = y1,t − y3,t − y2,t − y3,t
κ2 κ3 κ2 κ3
In this system of three variables and 1 common unit root the set of cointegrating vectors
can be expressed as
 
1 0
β = 0 1 
 
κ1 κ2
κ3 κ3

since with only 1 unit root and three series, there are two non-redundant linear combina-
tions of the underlying variables which will be stationary.
Next consider a trivariate system driven by two unit roots,

y1,t = κ11 w1,t + κ12 w2,t + ε1,t


y2,t = κ21 w1,t + κ22 w2,t + ε2,t
y3,t = κ31 w1,t + κ32 w2,t + ε3,t

where the errors ε j ,t are again covariance stationary. By normalizing the coefficient on y1,t
to be 1, it must be the case the weights in the equilibrium condition, y1,t − β1 y2,t − β2 y3,t ,
must satisfy

κ11 = β1 κ21 + β2 κ31 (5.33)


κ12 = β1 κ22 + β2 κ32 (5.34)

in order to eliminate both unit roots. This system of 2 equations in 2 unknowns has the
solution
" # " #−1 " #
β1 κ21 κ31 κ11
= .
β2 κ22 κ32 κ12

This solution is unique after the initial normalization and there are no other cointegrating
vectors, and so
 
1
β =  κκ21
 11 κ32 −κ12 κ22 
κ32 −κ22 κ31 
κ12 κ21 −κ11 κ31
κ21 κ32 −κ22 κ31
5.8 Cointegration 363

The same line of reasoning extends to k -variate systems driven by m unit roots, and r
cointegrating vectors can be constructed by normalizing on the first r elements of y one
at a time. In the general case

yt = Kwt + εt

where K is a k by m matrix, wt a m by 1 set of unit root processes, and εt is a k by 1 vector


of covariance stationary errors. The cointegrating vectors in this system can be expressed
" #
Ir
β= (5.35)
β̃

where Ir is an r -dimensional identity matrix and β̃ is a m by r matrix of loadings which


can be found by solving the set of equations

β̃ = K−1 0
2 K1 (5.36)

where K1 is the first r rows of K (r by m ) and K2 is the bottom m rows of K (m by m ). In the


trivariate example driven by one unit root,
" #
κ1
K1 = and K2 = κ3
κ2

and in the trivariate system driven by two unit roots,


" #
κ21 κ22
K1 = [κ11 κ12 ] and K2 = .
κ31 κ32

Applying eqs. (5.35) and (5.36) will produce the previously derived set of cointegrating vec-
tors. Note that when r = 0 then the system contains k unit roots and so is not cointegrated
(in general) since the system would have 3 equations and only two unknowns. Similarly
when r = k there are no unit roots since any linear combination of the series must be
stationary.

5.8.3.1 Relationship to Common Features and common trends

Cointegration is special case of a broader concept known as common features. In the case
of cointegration, both series have a common stochastic trend (or common unit root). Other
examples of common features which have been examined are common heteroskedasticity,
defined as x t and yt are heteroskedastic but there exists a combination, x t − β yt , which
is not, and common nonlinearities which are defined in an analogous manner (replacing
heteroskedasticity with nonlinearity). When modeling multiple time series, you should
always consider whether the aspects you are interested in may be common.
364 Analysis of Multiple Time Series

5.8.4 Testing

Testing for cointegration shares one important feature with its scalar counterpart (unit root
testing): it can be complicated. Two methods will be presented, the original Engle-Granger
2-step procedure and the more sophisticated Johansen methodology. The Engle-Granger
method is generally only applicable if there are two variables or the cointegrating relation-
ship is known (e.g. an accounting identity where the left-hand side has to add up to the
right-hand side). The Johansen methodology is substantially more general and can be used
to examine complex systems with many variables and more than one cointegrating rela-
tionship.

5.8.4.1 Johansen Methodology

The Johansen methodology is the dominant technique to determine whether a system of


I (1) variables is cointegrated, and if so, the number of cointegrating relationships. Recall
that one of the requirements for a set of integrated variables to be cointegrated is that π
has reduced rank.

∆yt = πyt −1 + π1 ∆yt −1 + . . . + πP ∆yt −P εt


and so the number of non-zero eigenvalues of π is between 1 and k − 1. If the number of
non-zero eigenvalues was k , the system would be stationary. If no non-zero eigenvalues
were present, the system would be contain k unit roots. The Johansen framework for coin-
tegration analysis uses the magnitude of these eigenvalues to directly test for cointegration.
Additionally, the Johansen methodology allows the number of cointegrating relationships
to be determined from the data directly, a key feature missing from the Engle-Granger two-
step procedure.
The Johansen methodology makes use of two statistics, the trace statistic (λtrace ) and the
maximum eigenvalue statistic (λmax ). Both statistics test functions of the estimated eigen-
values of π but have different null and alternative hypotheses. The trace statistic tests the
null that the number of cointegrating relationships is less than or equal to r against an al-
ternative that the number is greater than r . Define λ̂i , i = 1, 2, . . . , k to be the complex
modulus of the eigenvalues of π̂1 and let them be ordered such that λ1 > λ2 > . . . > λk .16
The trace statistic is defined

k
X
λtrace (r ) = −T ln(1 − λ̂i ).
i =r +1

There are k trace statistics. The trace test is applied sequentially, and the number of
cointegrating relationships is determined by proceeding through the test statistics until the
null cannot be rejected. The first trace statistic, λtrace (0) = −T ki=1 ln(1 − λ̂i ), tests that null
P


16
The complex modulus is defined as |λi | = |a + b i | = a 2 + b 2.
5.8 Cointegration 365

of no cointegrating relationships (e.g. k unit roots) against an alternative that the number
of cointegrating relationships is 1 or more. For example, if the were no cointegrating rela-
tionships, each of the eigenvalues would be close to zero and λtrace (0) ≈ 0 since every unit
root “driver” corresponds to a zero eigenvalue in π. When the series are cointegrated, π
will have one or more non-zero eigenvalues.
Like unit root tests, cointegration tests have nonstandard distributions that depend on
the included deterministic terms, if any. Fortunately, most software packages return the
appropriate critical values for the length of the time-series analyzed and any included de-
terministic regressors.
The maximum eigenvalue test examines the null that the number of cointegrating re-
lationships is r against the alternative that the number is r + 1. The maximum eigenvalue
statistic is defined
λmax (r, r + 1) = −T ln(1 − λ̂r +1 )
Intuitively, if there are r + 1 cointegrating relationships, then the r + 1th ordered eigenvalue
should be different from zero and the value of λmax (r, r + 1) should be large. On the other
hand, if there are only r cointegrating relationships, the r + 1th eigenvalue should be close
from zero and the statistic will be small. Again, the distribution is nonstandard but most
statistical packages provide appropriate critical values for the number of observations and
the included deterministic regressors.
The steps to implement the Johansen procedure are:
Step 1: Plot the data series being analyzed and perform univariate unit root testing. A set of
variables can only be cointegrated if they are all integrated. If the series are trending, either
linearly or quadratically, make note of this and remember to include deterministic terms
when estimating the ECM.
Step 2: The second stage is lag length selection. Select the lag length using one of the pro-
cedures outlined in the VAR lag length selection section (General-to-Specific, AIC or SBIC).
For example, to use the General-to-Specific approach, first select a maximum lag length L
and then, starting with l = L , test l lags against l − 1 use a likelihood ratio test,

L R = (T − l · k 2 )(ln |Σl −1 | − ln |Σl |) ∼ χk2 .


Repeat the test decreasing the number of lags (l ) by one each iteration until the LR rejects
the null that the smaller model is appropriate.
Step 3: Estimate the selected ECM,

∆yt = πyt + π1 ∆yt −1 + . . . + πP −1 ∆yt −P +1 + ε


and determine the rank of π where P is the lag length previously selected. If the levels of
the series appear to be trending, then the model in differences should include a constant
and

∆yt = π0 + πyt + π1 ∆yt −1 + . . . + πP −1 ∆yt −P +1 + ε


366 Analysis of Multiple Time Series

should be estimated. Using the λtrace and λmax tests, determine the cointegrating rank of the
system. It is important to check that the residuals are weakly correlated – so that there are
no important omitted variables, not excessively heteroskedastic, which will affect the size
and power of the procedure, and are approximately Gaussian.
Step 4: Analyze the normalized cointegrating vectors to determine whether these conform
to implications of finance theory. Hypothesis tests on the cointegrating vector can also be
performed to examine whether the long-run relationships conform to a particular theory.
Step 5: The final step of the procedure is to assess the adequacy of the model by plotting and
analyzing the residuals. This step should be the final task in the analysis of any time-series
data, not just the Johansen methodology. If the residuals do not resemble white noise, the
model should be reconsidered. If the residuals are stationary but autocorrelated, more lags
may be necessary. If the residuals are I(1), the system may not be cointegrated.

5.8.4.2 Example: Consumption Aggregate Wealth

To illustrate cointegration and error correction, three series which have played an impor-
tant role in the revival of the CCAPM in recent years will be examined. These three series
are consumption (c ), asset prices (a ) and labor income (y ). The data were made available
by Martin Lettau on his web site,
http://faculty.haas.berkeley.edu/lettau/data_cay.html
and contain quarterly data from 1952:1 until 2009:1.
The Johansen methodology begins by examining the original data for unit roots. Since
it has been clearly established that all series have unit roots, this will be skipped. The next
step tests eigenvalues of π in the error correction model

∆yt = π0 + πyt −1 + π1 ∆yt −1 + π2 ∆yt −2 + . . . + +πP ∆yt −P + εt .

using λtrace and λmax tests. Table 5.4 contains the results of the two tests. These tests are
applied sequentially. However, note that all of the p-vals for the null r = 0 indicate no
significance at conventional levels (5-10%), and so the system appears to contain k unit
roots.17 The Johansen methodology leads to a different conclusion than the Engle-Granger
methodology: there is no evidence these three series are cointegrated. This seem counter
intuitive, but testing alone cannot provide a reason why this has occurred; only theory can.

5.8.4.3 Single Cointegrating Vector: Engle-Granger Methodology

The Engle-Granger method exploits the key feature of any cointegrated system where there
is a single cointegrating relationship – when data are cointegrated, a linear combination of
17
Had the first null been rejected, the testing would have continued until a null could not be rejected. The
first null not rejected would indicate the cointegrating rank of the system. If all null hypotheses are rejected,
then the original system appears stationary, and a further analysis of the I(1) classification of the original data
is warranted.
5.8 Cointegration 367

Trace Test
Null Alternative λtrace Crit. Val. P-val
r =0 r ≥1 16.77 29.79 0.65
r =1 r ≥2 7.40 15.49 0.53
r =2 r =3 1.86 3.841 0.17

Max Test
Null Alternative λmax Crit. Val. P-val
r =0 r =1 9.37 21.13 0.80
r =1 r =2 5.53 14.26 0.67
r =2 r =3 1.86 3.841 0.17

Table 5.4: Results of testing using the Johansen methodology. Unlike the Engle-Granger
procedure, no evidence of cointegration is found using either test statistic.

the series can be constructed that is stationary. If they are not, any linear combination
will remain I(1). When there are two variables, the Engle-Granger methodology begins by
specifying the cross-section regression

yt = β x t + εt

where β̂ can be estimated using OLS. It may be necessary to include a constant and

yt = β1 + β x t + εt

can be estimated instead if the residuals from the first regression are not mean 0. Once the
coefficients have been estimated, the model residuals, ε̂t , can be tested for the presence of
a unit root. If x t and yt were both I(1) and ε̂t is I(0), the series are cointegrated. The proce-
dure concludes by using ε̂t to estimate the error correction model to estimate parameters
which may be of interest (e.g. the speed of convergence parameters).
Step 1: Begin by analyzing x t and yt in isolation to ensure that they are both integrated. You
should plot the data and perform ADF tests. Remember, variables can only be cointegrated
if they are integrated.
Step 2: Estimate the long-run relationship by fitting

yt = β1 + β2 x t + εt

using OLS and computing the estimated residuals {ε̂t }. Use an ADF test (or DF-GLS for
more power) and test H0 : γ = 0 against H1 : γ < 0 in the regression

∆ε̂t = γε̂t −1 + δ1 ∆ε̂t −1 + . . . + δp ∆ε̂t −P + ηt .


368 Analysis of Multiple Time Series

It may be necessary to include deterministic trends. Fortunately, standard procedures for


testing time-series for unit roots can be used to examine if this series contains a unit root.
If the null is rejected and ε̂t is stationary, then x t and yt appear to be cointegrated. Alter-
natively, if ε̂t still contains a unit root, the series are not cointegrated.
Step 3: If a cointegrating relationship is found, specify and estimate the error correction
model
" # " # " # " # " # " #
∆x t π01 α1 (yt − β x t ) ∆x t −1 ∆x t −P η1,t
= + + π1 + . . . + πP +
∆yt π02 α2 (yt − β x t ) ∆yt −1 ∆yt −P η2,t

Note that this specification is not linear in its parameters. Both equations have interactions
between the α and β parameters and so OLS cannot be used. Engle and Granger noted that
the term involving β can be replaced with ε̂t = (yt − β x t ),

" # " # " # " # " # " #


∆x t π01 α1 ε̂t ∆x t −1 ∆x t −P η1,t
= + + π1 + . . . + πP + ,
∆yt π02 α2 ε̂t ∆yt −1 ∆yt −P η2,t

and so parameters of these specifications can be estimated using OLS.


Step 4: The final step is to assess the model adequacy and test hypotheses about α1 and
α2 . Standard diagnostic checks including plotting the residuals and examining the ACF
should be used to examine model adequacy. Impulse response functions for the short-run
deviations can be examined to assess the effect of a shock on the deviation of the series
from the long term trend.

5.8.4.4 Cointegration in Consumption, Asset Prices and Income

The Engle-Granger procedure begins by performing unit root tests on the individual series
and examining the data. Table 5.5 and figure 5.5 contain these tests and graphs. The null of
a unit root cannot be rejected in any of the three series and all have time-detrended errors
which appear to be nonstationary.
The next step is to specify the cointegrating regression

c t = β1 + β2 a t + β3 yt + εt

and to estimate the long-run relationship using OLS. The estimated cointegrating vector
from the first stage OLS was [1 − 0.170 − 0.713] which corresponds to a long-run relation-
ship of c t − 0.994 − 0.170a t − 0.713yt . Finally, the residuals were tested for the presence
of a unit root. The results of this test are in the final line of table 5.5 and indicate a strong
rejection of a unit root in the errors. Based on the Engle-Granger procedure, these three
series appear to be cointegrated.
5.8 Cointegration 369

Unit Root Testing on c , a and y


Series T-stat P-val ADF Lags
c -1.79 0.39 6
a -1.20 0.68 3
y -1.66 0.45 1
ε̂t -2.91 0.00 2

Table 5.5: Unit root test results. The top three lines contain the results of ADF tests for unit
roots in the three components of c a y : Consumption, Asset Prices and Aggregate Wealth.
None of these series reject the null of a unit root. The final line contains the results of a unit
root test on the estimated residuals where the null is strongly rejected indicating that there
is a cointegrating relationship between the three. The lags column reports the number of
lags used in the ADF procedure as automatically selected using the AIC.

5.8.5 Spurious Regression and Balance

When two related I(1) variables are regressed on one another, the cointegrating relationship
dominates and the regression coefficients can be directly interpreted as the cointegrating
vectors. However, when two unrelated I(1) variables are regressed on one another, the re-
gression coefficient is no longer consistent. For example, let x t and yt be independent
random walk processes.

x t = x t −1 + ηt
and

yt = yt −1 + νt
In the regression

x t = β yt + εt
β̂ is not consistent for 0 despite the independence of x t and yt .
Models that include independent I(1) processes are known as spurious regressions.
When the regressions are spurious, the estimated β̂ can take any value and typically have
t -stats which indicates significance at conventional levels. The solution to this problems is
simple: whenever regressing one I(1) variable on another, always check to be sure that the
regression residuals are I(0) and not I(1) – In other words, use the Engle-Granger procedure
as a diagnostic check.
Balance is another important concept when data which contain both stationary and
integrated data. An equation is said to be balanced if all variables have the same order
of integration. The usual case occurs when a stationary variable (I(0)) is related to other
stationary variables. However, other situation may arise and it is useful to consider the
370 Analysis of Multiple Time Series

Analysis of cay
Original Series (logs)
Consumption Asset Prices Labor Income
0.1

−0.1

1954 1960 1965 1971 1976 1982 1987 1993 1998 2004

Error
0.03 0̂t
0.02
0.01
0
−0.01
−0.02

1954 1960 1965 1971 1976 1982 1987 1993 1998 2004

Figure 5.5: The top panel contains plots of time detrended residuals from regressions of
consumption, asset prices and labor income on a linear time trend. These series may
contain unit roots and are clearly highly persistent. The bottom panel contains a plot of
εt = c t − 0.994 − 0.170a t − 0.713yt which is commonly known as the c a y scaling factor
(pronounced consumption-aggregate wealth). The null of a unit root is rejected using the
Engle-Granger procedure for this series indicating that the three original series are cointe-
grated.

four possibilities:

• I(0) on I(0): The usual case. Standard asymptotic arguments apply. See section 5.9
for more issues in cross-section regression using time-series data.

• I(1) on I(0): This regression is unbalanced. An I(0) variable can never explain the long-
run variation in an I(1) variable. The usual solution is to difference the I(1) and the
examine whether the short-run dynamics in the differenced I(1) can be explained by
the I(0).

• I(1) on I(1): One of two outcomes: cointegration or spurious regression.


5.9 Cross-sectional Regression with Time-series Data 371

• I(0) on I(1): This regression is unbalanced. An I(1) variable can never explain the vari-
ation in an I(0) variable and unbalanced regressions are not meaningful in explain-
ing economic phenomena. Unlike spurious regressions, the t-stat still has a standard
asymptotic distribution although caution is needed as small sample properties can
be very poor. This is a common problem in finance where a stationary variable, re-
turns on the market, are regressed on a very persistent “predictor” (such as the default
premium or dividend yield).

5.9 Cross-sectional Regression with Time-series Data


Finance often requires cross-sectional regressions to be estimated using data that occur se-
quentially. Chapter 1 used n to index the observations in order to avoid any issues specific
to running regressions with time-series data,

yn = β1 xn1 + β2 xn2 + . . . + βk xn k + εn , (5.37)


The observation index above can be replaced with t to indicate that the data used in the
regression are from a time-series,

yt = β1 x t 1 + β2 x t 2 + . . . + βk x t k + εt . (5.38)
Also recall the five assumptions used in establishing the asymptotic distribution of the pa-
rameter estimated (recast with time-series indices):

Assumption 5.16 (Linearity). yt = xt β + εt


{(xt , εt )} is a strictly stationary and ergodic sequence.
E[x0t xt ] = ΣXX is non-singular and finite. h 2 i
{xt εt , Ft −1 } is a martingale difference sequence, E x j ,t εt
0
< ∞ j = 1, 2, . . . , k , t =
1
1, 2 . . . and S = V[T − 2 X0 ε] is finite and non singular.
E[x j4,t ] < ∞, j = 1, 2, . . . , k , t = 1, 2, . . . and E[ε2t ] = σ2 < ∞, t = 1, 2, . . ..

The key assumption often violated in applications using time-series data is assumption ??,
that the scores from the linear regression, x0t εt are a martingale with respect to the time
t − 1 information set, Ft −1 . When the scores are not a MDS, it is usually the case that the
errors from the model, εt , can be predicted by variables in Ft −1 , often their own lagged
values. The MDS assumption featured prominently in two theorems about the asymptotic
distribution of β̂ and a consistent estimators of its covariance.

Theorem 5.17. Under assumptions 3.9 and ?? - ??


√ d
T (β̂ T − β ) → N (0, Σ−1
XX SΣXX )
−1
(5.39)

where ΣXX = E[x0t xt ] and S = V[T −1/2 X0 ε]


372 Analysis of Multiple Time Series

Under assumptions 3.9 and ?? - ??,


p
Σ̂XX =T −1 X0 X → ΣXX
T
X p
Ŝ =T −1 e t2 x0t xt → S
n=1
 
=T −1 0
X ÊX

and
−1 −1 p
Σ̂XX ŜΣ̂XX → Σ−1 −1
XX SΣXX

where Ê = diag(ε̂21 , . . . , ε̂2T ) is a matrix with the squared estimated residuals along the diag-
onal.

The major change when the assumption of martingale difference scores is relaxed is that a
more complicated covariance estimator is required to estimate the variance of β̂ . A mod-
ification from White’s covariance estimator is needed to pick up the increase in the long
run-variance due to the predictability of the scores (xt εt ). In essence, the correlation in
the scores reduces the amount of “unique” information present in the data. The standard
covariance estimator assumes that the scores are uncorrelated with their past and so each
contributes its full share to the precision to β̂ .
Heteroskedasticity Autocorrelation Consistent (HAC) covariance estimators address this
issue. Before turning attention to the general case, consider the simple differences that
arise in the estimation of the unconditional mean (a regression on a constant) when the
errors are a white noise process and when the errors follow a MA(1).

5.9.1 Estimating the mean with time-series errors

To understand why a Newey-West estimator may be needed, consider estimating the mean
in two different setups, the first where standard and the shock, {εt }, is assumed to be a
white noise process with variance σ2 , and the second where the shock follows an MA(1).

5.9.1.1 White Noise Errors

Define the data generating process for yt ,

yt = µ + εt
where {εt } is a white noise process. It’s trivial to show that

E[yt ] = µ and V[yt ] = σ2


directly from the white noise assumption. Define the sample mean in the usual way,
5.9 Cross-sectional Regression with Time-series Data 373

T
X
µ̂ = T −1
yt
t =1

The sample mean is unbiased,


T
X
E[µ̂] = E[T −1
yt ]
t =1
T
X
=T −1
E[yt ]
t =1
T
X
= T −1 µ
t =1

= µ.

The variance of the mean estimator exploits the white noise property which ensures E[εi ε j ]=0
whenever i 6= j .

T
X
V[µ̂] = E[(T −1 yt − µ)2 ]
t =1
T
X
= E[(T −1
εt )2 ]
t =1
XT T
X T
X
= E[T −2
( εt +
2
εr εs )]
t =1 r =1 s =1,r 6=s
T
X T
X T
X
=T −2
E[ε2t ] +T −2
E[εr εs ]
t =1 r =1 s =1,r 6=s
T
X T
X T
X
=T −2
σ +T2 −2
0
t =1 r =1 s =1,r 6=s

=T −2
Tσ 2

σ2
= ,
T
σ2
and so, V[µ̂] = T
– the standard result.

5.9.1.2 MA(1) errors

Consider a modification of the original model where the error process ({ηt }) is a mean zero
MA(1) constructed from white noise shocks ({εt }).
374 Analysis of Multiple Time Series

ηt = θ εt −1 + εt
The properties of the error can be easily derived. The mean is 0,
E[ηt ] = E[θ εt −1 + εt ] = θ E[εt −1 ] + E[εt ] = θ 0 + 0 = 0

and the variance depends on the MA parameter,

V[ηt ] = E[(θ εt −1 + εt )2 ]
= E[θ 2 ε2t −1 + 2εt εt −1 + ε2t ]
= E[θ 2 ε2t −1 ] + 2E[εt εt −1 ] + E[ε2t ]
= θ 2 σ2 + 2 · 0 + σ2
= σ2 (1 + θ 2 ).

The DGP for yt has the same form,

yt = µ + ηt

and the mean and variance of yt are

E[yt ] = µ and V[yt ] = V[ηt ] = σ2 (1 + θ 2 ).

These follow from the derivations in chapter 4 for the MA(1) model. More importantly, the
usual mean estimator is unbiased.

T
X
µ̂ = T −1
yt
t =1

T
X
E[µ̂] = E[T −1
yt ]
t =1
T
X
=T −1
E[yt ]
t =1
T
X
=T −1
µ
t =1

= µ,

although its variance is different. The difference is due to the fact that ηt is autocorrelated
and so E[ηt ηt −1 ] 6= 0.
5.9 Cross-sectional Regression with Time-series Data 375

T
X
V[µ̂] = E[(T −1 yt − µ)2 ]
t =1
T
X
= E[(T −1
ηt )2 ]
t =1
XT T −1
X T −2
X 2
X 1
X
= E[T −2
( ηt + 2
2
ηt ηt +1 + 2 ηt ηt +2 + . . . + 2 ηt ηt +T −2 + 2 ηt ηt +T −1 )]
t =1 t =1 t =1 t =1 t =1
T
X T −1
X T −2
X
= T −2 E[η2t ] + 2T −2 E[ηt ηt +1 ] + 2T −2 E[ηt ηt +2 ] + . . . +
t =1 t =1 t =1
2
X 1
X
2T −2 E[ηt ηt +T −2 ] + 2T −2 E[ηt ηt +T −1 ]
t =1 t =1
T
X T −1
X T −2
X 2
X 1
X
=T −2
γ0 + 2T −2
γ1 + 2T −2
γ2 + . . . + 2T −2
γT −2 + 2T −2
γT −1
t =1 t =1 t =1 t =1 t =1

where γ0 = E[η2t ] = V[ηt ] and γs = E[ηt ηt −s ]. The two which are non-zero in this specifi-
cation are γ0 and γ1 .

γ1 = E[ηt ηt −1 ]
= E[(θ εt −1 + εt )(θ εt −2 + εt −1 )]
= E[θ 2 εt −1 εt −2 + θ ε2t −1 + θ εt εt −2 + εt εt −1 ]
= θ 2 E[εt −1 εt −2 ] + θ E[ε2t −1 ] + θ E[εt εt −2 ] + E[εt εt −1 ]
= θ 2 0 + θ σ2 + θ 0 + 0
= θ σ2

since γs = 0, s ≥ 2 in a MA(1). Returning to the variance of µ̂,

T
X T −1
X
V[µ̂] = T −2 γ0 + 2T −2 γ1
t =1 t =1

= T T γ0 + 2T
−2 −2
(T − 1)γ1
γ0 + 2γ1

T
and so when the errors are autocorrelated, the usual mean estimator will have a different
variance, one which reflects the dependence in the errors, and so it is not that case that
376 Analysis of Multiple Time Series

γ0
. V[µ̂] =
T
This simple illustration captures the basic idea behind the Newey-West covariance es-
timator, which is defined,

L  
X l
σ̂N2 W = γ̂0 + 2 1− γ̂l .
L +1
l =1

When L = 1, the only weight is 2(1 − 12 ) = 2 12 and σ̂N2 W = γ̂0 + γ̂1 , which is different from the
variance in the MA(1) error example. However as L increases, the weight on γ1 converges to
1
2 since limL →∞ 1 − L +1 = 1 and the Newey-West covariance estimator will, asymptotically,
include the important terms from the covariance, γ0 + 2γ1 , with the correct weights. What
happens when we use σ̂N2 W instead of the usual variance estimator? As L grows large,

σ̂N2 W → γ0 + 2γ1
and the variance of the estimated mean can be estimated using σN2 W ,

γ0 + 2γ1 σ2
V[µ̂] = ≈ NW
T T
As a general principle, the variance of the sum is not the sum of the variances – this statement
is only true when the errors are uncorrelated. Using a HAC covariance estimator allows for
time-series dependence and leads to correct inference as long as L grows with the sample
size.18
It is tempting to estimate γ̂0 and γ̂1 and use the natural estimator σ̂H 2
AC = γ̂0 + 2γ̂1 ?
Unfortunately this estimator is not guaranteed to be positive, while the Newey-West esti-
mator, γ0 + γ1 (when L =1) is always (weakly) positive. More generally, for any choice of
L , the Newey-West covariance estimator, σ̂N2 W , is guaranteed to be positive while the un-
weighted estimator, σ̂H 2
AC = γ̂0 + 2γ̂1 + 2γ̂2 + . . . + 2γ̂L , is not. This ensures that the variance
estimator passes a minimal sanity check.

5.9.2 Estimating the variance of β̂ when the errors are autocorrelated

There are two solutions to working with cross-section data that have autocorrelated errors.
The direct method is to change a cross-sectional model in to a time-series model which
includes both contemporaneous effects of xt as well as lagged values of yt and possibly
xt . This approach will need to include enough lags so that the errors are white noise. If
this can be accomplished, White’s heteroskedasticity (but not autocorrelation) consistent
covariance estimator can be used and the problem has been solved. The second approach
modifies the covariance estimator to explicitly capture the dependence in the data.
1
18
Allowing L to grow at the rate T 3 has been shown to be optimal in a certain sense related to testing.
5.A Cointegration in a trivariate VAR 377

The key to White’s estimator of S,

T
X
Ŝ = T −1 e t2 x0t xt
t =1

is that it explicitly captures the dependence between the e t2 and x0t xt . Heteroskedasticity
Autocorrelation Consistent estimators work similarly by capturing both the dependence
between the e t2 and x0t xt (heteroskedasticity) and the dependence between the xt e t and
xt − j e t − j (autocorrelation). A HAC estimator for a linear regression can be defined as an
estimator of the form

  
T
X L
X T
X T
X
ŜN W = T −1  e t2 x0t xt + wl  e s e s −l x0s xs −l + eq −l eq x0q −l xq  (5.40)
t =1 l =1 s =l +1 q =l +1
L
X  
= Γ̂ 0 + wl Γ̂ l + Γ̂ −l
l =1
L
0
X  
= Γ̂ 0 + wl Γ̂ l + Γ̂ l
l =1

where {wl } are a set of weights. The Newey-West estimator is produced by setting wl =
l
1 − L +1 . Other estimators can be computed using different weighting schemes.

5.A Cointegration in a trivariate VAR

This section details how to

• Determine whether a trivariate VAR is cointegrated

• Determine the number of cointegrating vectors in a cointegrated system

• Decompose the π matrix in to α, the adjustment coefficient, and β , the cointegrating


vectors.

5.A.1 Stationary VAR

Consider the VAR(1):


      
xt .9 −.4 .2 x t −1 ε1,t
 yt  =  .2 .8 −.3   yt −1  +  ε2,t 
      
zt .5 .2 .1 z t −1 ε3,t
378 Analysis of Multiple Time Series

Easy method to determine the stationarity of this VAR is to compute the eigenvalues of
the parameter matrix. If the eigenvalues are all less than one in modulus, the VAR(1) is
stationary. These values are 0.97, 0.62, and 0.2. Since these are all less then one, the model
is stationary. An alternative method is to transform the model into an ECM
         
∆x t .9 −.4 .2 1 0 0 x t −1 ε1,t
 ∆yt  =  .2 .8 −.3  −  0 1 0   yt −1  +  ε2,t 
         
∆z t .5 .2 .1 0 0 1 z t −1 ε3,t
      
∆x t −.1 −.4 .2 x t −1 ε1,t
 ∆yt  =  .2 −.2 −.3   yt −1  +  ε2,t 
      
∆z t .5 .2 −.9 z t −1 ε3,t
∆wt = πwt + εt

where wt is a vector composed of x t , yt and z t . The rank of the parameter matrix π can be
determined by transforming it into row-echelon form.

       
−0.1 −0.4 0.2 1 4 −2 1 4 −2 1 4 −2
 0.2 −0.2 −0.3  ⇒  0.2 −0.2 −0.3  ⇒  0 −1 0.1  ⇒  0 1 −0.1 
       
0.5 0.2 −0.9 0.5 0.2 −0.9 0 −1.8 0.1 0 −1.8 0.1
     
1 0 −1 1 0 −1 1 0 0
⇒  0 1 −0.1  ⇒  0 1 −0.1  ⇒  0 1 0 
     
0 0 −0.08 0 0 1 0 0 1

Since the π matrix is full rank, the system must be stationary.

5.A.2 Independent Unit Roots

This example is trivial,


      
xt 1 0 0 x t −1 ε1,t
 yt  =  0 1 0   yt −1  +  ε2,t 
      
zt 0 0 1 z t −1 ε3,t

The eigenvalues are trivially 1 and the ECM is given by


         
∆x t 1 0 0 1 0 0 x t −1 ε1,t
 ∆yt  =  0 1 0  −  0 1 0   yt −1  +  ε2,t 
         
∆z t 0 0 1 0 0 1 z t −1 ε3,t
5.A Cointegration in a trivariate VAR 379

      
∆x t 0 0 0 x t −1 ε1,t
 ∆yt  =  0 0 0   yt −1  +  ε2,t 
      
∆z t 0 0 0 z t −1 ε3,t

and the rank of π is obviously 0, so these are three independent unit roots.

5.A.3 Cointegrated with 1 cointegrating relationship

Consider the VAR(1):


      
xt 0.8 0.1 0.1 x t −1 ε1,t
 yt  =  −0.16 1.08 0.08   yt −1  +  ε2,t 
      
zt 0.36 −0.18 0.82 z t −1 ε3,t

the eigenvalues of the parameter matrix are 1, 1 and .7. The ECM form of this model is
         
∆x t 0.8 0.1 0.1 1 0 0 x t −1 ε1,t
 ∆yt  =  −0.16 1.08 0.08  −  0 1 0   yt −1  +  ε2,t 
         
∆z t 0.36 −0.18 0.82 0 0 1 z t −1 ε3,t
      
∆x t −0.2 0.1 0.1 x t −1 ε1,t
 ∆yt  =  −0.16 0.08 0.08   yt −1  +  ε2,t 
      
∆z t 0.36 −0.18 −0.18 z t −1 ε3,t

and the eigenvalues of π are 0, 0 and -.3 indicating it has rank one. Remember, in a cointe-
grated system, the number of cointegrating vectors is the rank of π. In this example, there
is one cointegrating vector, which can be solved for by transforming π into row-echelon
form,

     
−0.2 0.1 0.1 1 −0.5 −0.5 1 −0.5 −0.5
 −0.16 0.08 0.08  ⇒  −0.16 0.08 0.08  ⇒  0 0 0 
     
0.36 −0.18 −0.18 0.36 −0.18 −0.18 0 0 0

So β = [1 − 0.5 − 0.5]0 and α can be solved for by noting that


 
α1 − 12 α1 − 12 α1
π = αβ 0 =  α2 − 12 α2 − 12 α2 
 
α3 − 12 α3 − 12 α3
and so α = [−.2 − .16 0.36]0 is the first column of π.
380 Analysis of Multiple Time Series

5.A.4 Cointegrated with 2 cointegrating relationships


Consider the VAR(1):
      
xt 0.3 0.4 0.3 x t −1 ε1,t
 yt  =  0.1 0.5 0.4   yt −1  +  ε2,t 
      
zt 0.2 0.2 0.6 z t −1 ε3,t

the eigenvalues of the parameter matrix are 1, .2+.1i and .2-.1i . The ECM form of this
model is
         
∆x t 0.3 0.4 0.3 1 0 0 x t −1 ε1,t
 ∆yt  =  0.1 0.5 0.4  −  0 1 0   yt −1  +  ε2,t 
         
∆z t 0.2 0.2 0.6 0 0 1 z t −1 ε3,t
      
∆x t −0.7 0.4 0.3 x t −1 ε1,t
 ∆yt  =  0.1 −0.5 0.4   yt −1  +  ε2,t 
      
∆z t 0.2 0.2 −0.4 z t −1 ε3,t

and the eigenvalues of π are 0, −0.8 + 0.1i and −0.8 − 0.1i indicating it has rank two (note
that two of the eigenvalues are complex). Remember, in a cointegrated system, the num-
ber of cointegrating vectors is the rank of π. In this example, there are two cointegrating
vectors, which can be solved for by transforming π into row-echelon form,

     
−0.7 0.4 0.3 1 −0.57143 −0.42857 1 −0.57143 −0.42857
 0.1 −0.5 0.4  ⇒  0.1 −0.5 0.4 ⇒ 0 −0.44286 0.44286  ⇒
     
0.2 0.2 −0.4 0.2 0.2 −0.4 0 0.31429 −0.31429
   
1 −0.57143 −0.42857 1 0 −1
 0 1 −1  ⇒  0 1 −1 
   
0 0.31429 −0.31429 0 0 0

β is the transpose of first two rows of the row-echelon form,


 
1 0
β = 0 1 
 
−1 −1
and α can be solved for by noting that
 
α11 α12 −α11 − α12
π = αβ 0 =  α21 α22 −α21 − α22 
 
α31 α32 −α31 − α32
5.A Cointegration in a trivariate VAR 381

and so
 
−0.7 0.4
α =  0.1 −0.5 
 
0.2 0.2

is the first two columns of π.


382 Analysis of Multiple Time Series

Exercises

Exercise 5.1. Consider the estimation of a mean where the errors are a white noise process.

i. Show that the usual mean estimator is unbiased and derive its variance without as-
suming the errors are i.i.d.
Now suppose error process follows an MA(1) so that εt = νt + θ1 νt −1 where νt is a
WN process.

ii. Show that the usual mean estimator is still unbiased and derive the variance of the
mean.
Suppose that {η1,t } and {η2,t } are two sequences of uncorrelated i.i.d. standard nor-
mal random variables.

x t = η1,t + θ11 η1,t −1 + θ12 η2,t −1


yt = η2,t + θ21 η1,t −1 + θ22 η2,t −1

iii. What are E t [x t +1 ] and E t [x t +2 ]?

iv. Define the autocovariance matrix of a vector process.

v. Compute the autocovariance matrix Γ j for j = 0, ±1.

Exercise 5.2. Consider an AR(1)

i. What are the two types of stationarity? Provide precise definitions.

ii. Which of the following bivariate Vector Autoregressions are stationary? If they are not
stationary are they cointegrated, independent unit roots or explosive? Assume
" #
ε1t i.i.d.
∼ N (0, I2 )
ε2t

Recall that the eigenvalues values of a 2×2 non-triangular matrix


" #
π11 π12
π=
π21 π22

can be solved using the two-equation, two-unknowns system λ1 + λ2 = π11 + π22 and
λ1 λ2 = π11 π22 − π12 π21 .
5.A Cointegration in a trivariate VAR 383

(a) " # " #" # " #


xt 1.4 .4 x t −1 ε1t
= +
yt −.6 .4 yt −1 ε2t

(b) " # " #" # " #


xt 1 0 x t −1 ε1t
= +
yt 0 1 yt −1 ε2t

(c) " # " #" # " #


xt .8 0 x t −1 ε1t
= +
yt .2 .4 yt −1 ε2t

iii. What are spurious regression and balance?

iv. Why is spurious regression a problem?

v. Briefly outline the steps needed to test for a spurious regression in

yt = β1 + β2 x t + εt .

Exercise 5.3. Consider the AR(2)

yt = φ1 yt −1 + φ2 yt −2 + εt

i. Rewrite the model with ∆yt on the left-hand side and yt −1 and ∆yt −1 on the right-
hand side.

ii. What restrictions are needed on φ1 and φ2 for this model to collapse to an AR(1) in
the first differences?

iii. When the model collapses, what does this tell you about yt ?
Consider the VAR(1)

x t = x t −1 + ε1,t
yt = β x t −1 + ε2,t

where {εt } is a vector white noise process.


i. Are x t and yt cointegrated?

ii. Write this model in error correction form.


Consider the VAR(1)
" # " #" # " #
xt 0.4 0.3 x t −1 ε1,t
= +
yt 0.8 0.6 yt −1 ε2,t
384 Analysis of Multiple Time Series

where {εt } is a vector white noise process.

i. How can you verify that x t and yt are cointegrated?

ii. Write this model in error correction form.

iii. Compute the speed of adjustment coefficient α and the cointegrating vector β where
the β on x t is normalized to 1.

Exercise 5.4. Data on interest rates on US government debt was collected for 3-month
(3M O ) T-bills, and 3-year (3Y R ) and 10-year (10Y R ) bonds from 1957 until 2009. Using
these three series, the following variables were defined

Level 3M O
Slope 10Y R − 3M O
Curvature (10Y R − 3Y R ) − (3Y R − 3M O )

i. In terms of VAR analysis, does it matter whether the original data or the level-slope-
curvature model is fit? Hint: Think about reparameterizations between the two.
Granger Causality analysis was performed on this set and the p-vals were

Levelt −1 Slopet −1 Curvaturet −1


Levelt 0.000 0.244 0.000
Slopet 0.000 0.000 0.000
Curvaturet 0.000 0.000 0.000
All (excl. self ) 0.000 0.000 0.000

ii. Interpret this table.

iii. When constructing impulse response graphs the selection of the covariance of the
shocks is important. Outline the alternatives and describe situations when each may
be preferable.

iv. Figure 5.6 contains the impulse response curves for this model. Interpret the graph.
Also comment on why the impulse responses can all be significantly different from 0
in light of the Granger Causality table.

v. Why are some of the “0” lag impulses 0 while other aren’t?

Exercise 5.5. Answer the following questions:


5.A Cointegration in a trivariate VAR 385

LevelLevel-Slope-Curvature
Shock Slope Shock Curvature Shock
Impulse Response
0.6 0.6 0.6
0.4 0.4 0.4
Level

0.2 0.2 0.2


0 0 0
0 10 20 0 10 20 0 10 20

0.2 0.2 0.2


0 0 0
Slope

−0.2 −0.2 −0.2

0 10 20 0 10 20 0 10 20

0.2 0.2 0.2


Curvature

0 0 0

−0.2 −0.2 −0.2


0 10 20 0 10 20 0 10 20

Figure 5.6: Impulse response functions and 95% confidence intervals for the level-slope-
curvature exercise.

i. Consider the AR(2)


yt = φ1 yt −1 + φ2 yt −2 + εt

(a) Rewrite the model with ∆yt on the left-hand side and yt −1 and ∆yt −1 on the right-
hand side.
(b) What restrictions are needed on φ1 and φ2 for this model to collapse to an AR(1)
in the first differences?
(c) When the model collapses, what does this tell you about yt ?

ii. Consider the VAR(1)

x t = x t −1 + ε1,t
yt = β x t −1 + ε2,t

where {εt } is a vector white noise process.

(a) Are x t and yt cointegrated?


(b) Write this model in error correction form.
386 Analysis of Multiple Time Series

iii. Consider the VAR(1)


" # " #" # " #
xt 0.625 −0.3125 x t −1 ε1,t
= +
yt −0.75 0.375 yt −1 ε2,t

where {εt } is a vector white noise process.

(a) How can you verify that x t and yt are cointegrated?


(b) Write this model in error correction form.
(c) Compute the speed of adjustment coefficient α and the cointegrating vector β
where the β on x t is normalized to 1.

Exercise 5.6. Consider the estimation of a mean where the errors are a white noise process.

i. Show that the usual mean estimator is unbiased and derive its variance without as-
suming the errors are i.i.d.

Now suppose error process follows an AR(1) so that εt = ρεt −1 + νt where {νt } is a
WN process.

ii. Show that the usual mean estimator is still unbiased and derive the variance of the
sample mean.

iii. What is Granger Causality and how is it useful in Vector Autoregression analysis? Be
as specific as possible.

Suppose that {η1,t } and {η2,t } are two sequences of uncorrelated i.i.d. standard nor-
mal random variables.

x t = η1,t + θ11 η1,t −1 + θ12 η2,t −1


yt = η2,t + θ21 η1,t −1 + θ22 η2,t −1

iv. Define the autocovariance matrix of a vector process.

v. Compute the autocovariance matrix Γ j for j = 0, ±1.

vi. The AIC, HQC and SBIC were computed for a bivariate VAR with lag length ranging
from 0 to 12 and are in the table below. Which model is selected by each?
5.A Cointegration in a trivariate VAR 387

Lag Length AIC HQC SBIC


0 2.1916 2.1968 2.2057
1 0.9495 0.9805 1.0339
2 0.9486 1.0054 1.1032
3 0.9716 1.0542 1.1965
4 0.9950 1.1033 1.2900
5 1.0192 1.1532 1.3843
6 1.0417 1.2015 1.4768
7 1.0671 1.2526 1.5722
8 1.0898 1.3010 1.6649
9 1.1115 1.3483 1.7564
10 1.1331 1.3956 1.8478
11 1.1562 1.4442 1.9406
12 1.1790 1.4926 2.0331
388 Analysis of Multiple Time Series
Chapter 6

Generalized Method Of Moments (GMM)

Note: The primary reference text for these notes is Hall (2005). Alternative, but less compre-
hensive, treatments can be found in chapter 14 of Hamilton (1994) or some sections of chap-
ter 4 of Greene (2007). For an excellent perspective of GMM from a finance point of view, see
chapters 10, 11 and 13 in Cochrane (2001).
Generalized Moethod of Moments is a broadly applicable parameter es-
timation strategy which nests the classic method of moments, linear re-
gression, maximum likelihood. This chapter discusses the specification of
moment conditions – the building blocks of GMM estimations, estimation,
inference and specificatrion testing. These ideas are illustrated through
three examples: estimation of a consumption asset pricing model, linear
factors models and stochastic volatility.
Generalized Method of Moments (GMM) is an estimation procedure that allows eco-
nomic models to be specified while avoiding often unwanted or unnecessary assumptions,
such as specifying a particular distribution for the errors. This lack of structure means
GMM is widely applicable, although this generality comes at the cost of a number of is-
sues, the most important of which is questionable small sample performance. This chapter
introduces the GMM estimation strategy, discuss specification, estimation, inference and
testing.

6.1 Classical Method of Moments


The classical method of moments, or simply method of moments, uses sample moments
to estimate unknown parameters. For example, suppose a set of T observations, y1 , . . . , yT
are i.i.d. Poisson with intensity parameter λ. Since E[yt ] = λ, a natural method to estimate
the unknown parameter is to use the sample average,
T
X
λ̂ = T −1
yt (6.1)
t =1
390 Generalized Method Of Moments (GMM)

which converges to λ as the sample size grows large. In the case of Poisson data , the mean
is not the only moment which depends on λ, and so it is possible to use other moments to
learn about the intensity. For example the variance V[yt ] = λ, also depends on λ and so
E[yt2 ] = λ2 + λ. This can be used estimate to lambda since
" T
#
X
λ + λ2 = E T −1 yt2 (6.2)
t =1

and, using the quadratic formula, an estimate of λ can be constructed as


q
−1 + 1 + 4y 2
λ̂ = (6.3)
2
where y 2 = T −1 Tt=1 yt2 . Other estimators for λ could similarly be constructed by com-
P

puting higher order moments of yt .1 These estimators are method of moments estimators
since they use sample moments to estimate the parameter of interest. Generalized Method
of Moments (GMM) extends the classical setup in two important ways. The first is to for-
mally treat the problem of having two or more moment conditions which have informa-
tion about unknown parameters. GMM allows estimation and inference in systems of Q
equations with P unknowns, P ≤ Q . The second important generalization of GMM is that
quantities other than sample moments can be used to estimate the parameters. GMM ex-
ploits laws of large numbers and central limit theorems to establish regularity conditions
for many different “moment conditions” that may or may not actually be moments. These
two changes produce a class of estimators that is broadly applicable. Section 6.7 shows
that the classical method of moments, ordinary least squares and maximum likelihood are
all special cases of GMM.

6.2 Examples
Three examples will be used throughout this chapter. The first is a simple consumption
asset pricing model. The second is the estimation of linear asset pricing models and the
final is the estimation of a stochastic volatility model.

6.2.1 Consumption Asset Pricing

GMM was originally designed as a solution to a classic problem in asset pricing: how can
a consumption based model be estimated without making strong assumptions on the dis-
tribution of returns? This example is based on Hansen & Singleton (1982), a model which
builds on Lucas (1978).

1 −1− 1+4y 2
The quadratic formula has two solutions. It is simple to verify that the other solution, 2 , is neg-
ative and so cannot be the intensity of a Poisson process.
6.2 Examples 391

The classic consumption based asset pricing model assumes that a representative agent
maximizes the conditional expectation of their lifetime discounted utility,
" ∞
#
X
Et β i U (c t +i ) (6.4)
i =0

where β is the discount rate (rate of time preference) and U (·) is a strictly concave utility
function. Agents allocate assets between N risky assets and face the budget constraint

N
X N
X
ct + p j ,t q j ,t = R j ,t q j ,t −m j + w t (6.5)
j =1 j =1

where c t is consumption, p j ,t and q j ,t are price and quantity of asset j , j = 1, 2, . . . , N , R j ,t


is the time t payoff of holding asset j purchased in period t − m j , q j ,t −m j is the amount
purchased in period t −m j and w t is real labor income. The budget constraint requires that
consumption plus asset purchases (LHS) is equal to portfolio wealth plus labor income.
Solving this model produces a standard Euler equation,
h i
p j ,t U 0 (c t ) = β m j E t R j ,t +m j U 0 (c t +m j ) (6.6)

which is true for all assets and all time periods. This Euler equation states that the utility
foregone by purchasing an asset at p j ,t must equal the discounted expected utility gained
from holding that asset in period t + m j . The key insight of Hansen & Singleton (1982) is
that this simple condition has many testable implications, mainly that
" !#
R j ,t +m j U 0 (c t +m j )
 
Et β mj −1=0 (6.7)
p j ,t U 0 (c t )
R j ,t +m
Note that p j ,t j is the gross rate of return for asset j (1 plus the net rate of return). Since
the Euler equation holds for all time horizons, it is simplest to reduce it to a one-period
problem. Defining r j ,t +1 to be the net rate of return one period ahead for asset j ,

 U 0 (c t +1 )
  
Et β 1 + r j ,t +1 −1=0 (6.8)
U 0 (c t )
which provides a simple testable implication of this model. This condition must be true for
any asset j which provides a large number of testable implications by replacing the returns
of one series with those of another. Moreover, the initial expectation is conditional which
produces further implications for the model. Not only is the Euler equation required to
have mean zero, it must be uncorrelated with any time t instrument z t , and so it must also
be the case that

 U 0 (c t +1 )
    
E β 1 + r j ,t +1 − 1 z t = 0. (6.9)
U 0 (c t )
392 Generalized Method Of Moments (GMM)

The use of conditioning information can be used to construct a huge number of testable
restrictions. This model is completed by specifying the utility function to be CRRA,

1−γ
c
U (c t ) = t (6.10)
1−γ
−γ
U 0 (c t ) = c t (6.11)

where γ is the coefficient of relative risk aversion. With this substitution, the testable im-
plications are
" ! #
 c t +1 −γ
 
E β 1 + r j ,t +1 − 1 zt = 0 (6.12)
ct

where z t is any t available instrument (including a constant, which will produce an un-
conditional restriction).

6.2.2 Linear Factor Models


Linear factor models are widely popular in finance due to their ease of estimation using the
Fama & MacBeth (1973) methodology and the Shanken (1992) correction. However, Fama-
MacBeth, even with the correction, has a number of problems; the most important is that
the assumptions underlying the Shanken correction are not valid for heteroskedastic asset
pricing models and so the modified standard errors are not consistent. GMM provides
a simple method to estimate linear asset pricing models and to make correct inference
under weaker conditions than those needed to derive the Shanken correction. Consider
the estimation of the parameters of the CAPM using two assets. This model contains three
parameters: the two β s, measuring the risk sensitivity, and λm , the market price of risk.
These two parameters are estimated using four equations,

r1te = β1 rme t + ε1t (6.13)


r2te = β2 rme t + ε2t
r1te = β1 λm + η1t
r2te = β2 λm + η2t

where r je,t is the excess return to asset j , rm,t


e
is the excess return to the market and ε j ,t and
η j ,t are errors.
These equations should look familiar; they are the Fama-Macbeth equations. The first
two – the “time-series” regressions – are initially estimated using OLS to find the values for
β j , j = 1, 2 and the last two – the “cross-section” regression – are estimated conditioning
on the first stage β s to estimate the price of risk. The Fama-MacBeth estimation procedure
6.2 Examples 393

can be used to generate a set of equations that should have expectation zero at the correct
parameters. The first two come from the initial regressions (see chapter 3),

(r1te + β1 rme t )rme t = 0 (6.14)


(r2te + β2 rme t )rme t = 0

and the last two come from the second stage regressions

r1te − β1 λm = 0 (6.15)
r2te − β2 λ = 0
m

This set of equations consists 3 unknowns and four equations and so cannot be directly
estimates using least squares. One of the main advantages of GMM is that it allows esti-
mation in systems where the number of unknowns is smaller than the number of moment
conditions, and to test whether the moment conditions hold (all conditions not signifi-
cantly different from 0).

6.2.3 Stochastic Volatility Models

Stochastic volatility is an alternative framework to ARCH for modeling conditional het-


eroskedasticity. The primary difference between the two is the inclusion of 2 (or more)
shocks in stochastic volatility models. The inclusion of the additional shock makes stan-
dard likelihood-based methods, like those used to estimate ARCH-family models, infeasi-
ble. GMM was one of the first methods used to estimate these models. GMM estimators
employ a set of population moment conditions to determine the unknown parameters of
the models. The simplest stochastic volatility model is known as the log-normal SV model,

rt = σt εt (6.16)
ln σ2t = ω + ρ ln σ2t −1 − ω + ση ηt

(6.17)

where (εt , ηt ) ∼ N (0, I2 ) are i.i.d. standard normal. The first equation specifies the distri-
i.i.d.

bution of returns as heteroskedastic normal. The second equation specifies the dynamics
of the log of volatility as an AR(1). The parameter vector is (ω, ρ, ση )0 . The application of
GMM will use functions of rt to identify the parameters of the model. Because this model
is so simple, it is straight forward to derive the following relationships:

r
2
E |rt | = E [σt ]
 
(6.18)
π
E rt2 = E σ2t
   
394 Generalized Method Of Moments (GMM)

r
2  3
E |rt | = 2 E σt
 3
π
E |rt4 | = 3E σ4t
   

 2 
E |rt rt − j | = E σt σt − j
 
h i πh i
E |rt rt − j | = E σ2t σ2t − j
2 2

where

ση2
 
ω 1−ρ 2
E σm = exp m + m 2
 
t
 (6.19)
2 8
ση2
 
h i
1−ρ 2
E σm
t σt − j
n
= E σm E σt exp (mn)ρ j
   n
t
.
4

These conditions provide a large set of moments to determine the three unknown param-
eters. GMM seamlessly allows 3 or more moment conditions to be used in the estimation
of the unknowns.

6.3 General Specification

The three examples show how a model – economic or statistical – can be turned into a set of
moment conditional that have zero expectation, at least if the model is correctly specified.
All GMM specifications are constructed this way. Derivation of GMM begins by defining
the population moment condition.

Definition 6.1 (Population Moment Condition). Let wt be a vector of random variables, θ 0


be a p by 1 vector of parameters, and g(·) be a q by 1 vector valued function. The population
moment condition is defined
E[g(wt , θ 0 )] = 0 (6.20)

It is possible that g(·) could change over time and so could be replaced with gt (·). For clarity
of exposition the more general case will not be considered.

Definition 6.2 (Sample Moment Condition). The sample moment condition is derived from
the average population moment condition,
T
X
gT (w, θ ) = T −1 g(wt , θ ). (6.21)
t =1
6.3 General Specification 395

The gT notation dates back to the original paper of Hansen (1982) and is widely used to
differentiate population and sample moment conditions. Also note that the sample mo-
ment condition suppresses the t in w. The GMM estimator is defined as the value of θ that
minimizes

QT (θ ) = gT (w, θ )0 WT gT (w, θ ). (6.22)

Thus the GMM estimator is defined as

θ̂ = argmin QT (θ ) (6.23)
θ

where WT is a q by q positive semi-definite matrix. WT may (and generally will) depend


on the data but it is required to converge in probability to a positive definite matrix for
the estimator to be well defined. In order to operationalize the GMM estimator, q , the
number of moments, will be required to greater than or equal to p , the number of unknown
parameters.

6.3.1 Identification and Overidentification


GMM specifications fall in to three categories: underidentified, just-identified and overi-
dentified. Underidentified models are those where the number of non-redundant moment
conditions is less than the number of parameters. The consequence of this is obvious: the
problem will have many solutions. Just-identified specification have q = p while overi-
dentified GMM specifications have q > p . The role of just- and overidentification will
be reexamined in the context of estimation and inference. In most applications of GMM
it is sufficient to count the number of moment equations and the number of parameters
when determining whether the model is just- or overidentified. The exception to this rule
arises if some moment conditions are linear combination of other moment conditions – in
other words are redundant – which is similar to including a perfectly co-linear variable in
a regression.

6.3.1.1 Example: Consumption Asset Pricing Model

In the consumption asset pricing model, the population moment condition is given by
 −γ0 !
c t +1
g(wt , θ 0 ) = β0 (1 + rt +1 ) − 1 ⊗ zt (6.24)
ct

where θ 0 = (β0 , γ0 )0 , and wt = (c t +1 , c t , r0t +1 , z0t )0 and ⊗ denotes Kronecker product.2 Note
2

Definition 6.3 (Kronecker Product). Let A = [a i j ] be an m by n matrix, and let B = [bi j ] be a k by l matrix.
396 Generalized Method Of Moments (GMM)

that both rt +1 and zt are column vectors and so if there are n assets and k instruments,
then the dimension of g(·)(number of moment conditions) is q = nk by 1 and the number
of parameters is p = 2. Systems with nk ≥ 2 will be identified as long as some technical
conditions are met regarding instrument validity (see section 6.11).

6.3.1.2 Example: Linear Factor Models

In the linear factor models, the population moment conditions are given by
!
(rt − β ft ) ⊗ ft
g(wt , θ 0 ) = (6.27)
rt − β λ

where θ 0 = (vec(β )0 , λ0 )0 and wt = (r0t , f0t )0 where rt is n by 1 and ft is k by 1.3 These moments
can be decomposed into two blocks. The top block contains the moment conditions neces-
sary to estimate the β s. This block can be further decomposed into n blocks of k moment
conditions, one for each factor. The first of these n blocks is

(r1t − β11 f1t − β12 f2t − . . . − β1K f K t )f1t ε1t f1t


   
 (r1t − β11 f1t − β12 f2t − . . . − β1K f K t )f2t   ε1t f2t 
= . (6.29)
   
 .. ..
 .   . 
(r1t − β11 f1t − β12 f2t − . . . − β1K f K t )f K t ε1t f K t
Each equation in (6.29) should be recognized as the first order condition for estimating the
The Kronecker product is defined
 
a 11 B a 12 B ... a 1n B
 a 21 B a 22 B ... a 2n B 
A⊗B= .. .. .. .. (6.25)
 

 . . . . 
a m 1 B a m2 B . . . am n B

and has dimension mk by nl . If a and b are column vectors with length m and k respectively, then
 
a1b
 a2b 
a ⊗ b =  . . (6.26)
 
.
 . 
am b
3
The vec operator stacks the columns of a matrix into a column vector.

Definition 6.4 (vec). Let A = [a i j ] be an m by n matrix. The the vec operator (also known as the stack
operator) is defined  
A1
 A2 
vecA =  .  (6.28)
 
 .. 
An
and vec(A) is m n by 1.
6.3 General Specification 397

slope coefficients in a linear regression. The second block has the form

r1t − β11 λ1 − β12 λ2 − . . . − β1K λK


 
 r2t − β21 λ1 − β22 λ2 − . . . − β2K λK 
(6.30)
 
 .. 
 . 
rN t − βN 1 λ1 − βN 2 λ2 − . . . − βN K λK

where λ j is the risk premium on the jth factor. These moment conditions are derived from
the relationship that the average return on an asset should be the sum of its risk exposure
times the premium for that exposure.
The number of moment conditions (and the length of g(·)) is q = nk + n . The number
of parameters is p = n k (from β ) + k (from λ), and so the number of overidentifying
restrictions is the number of equations in g(·) minus the number of parameters, (nk + n ) −
(nk + k ) = n − k , the same number of restrictions used when testing asset pricing models
in a two-stage Fama-MacBeth regressions.

6.3.1.3 Example: Stochastic Volatility Model

Many moment conditions are available to use in the stochastic volatility model. It is clear
that at least 3 conditions are necessary to identify the 3 parameters and that the upper
bound on the number of moment conditions is larger than the amount of data available.
For clarity of exposition, only 5 and 8 moment conditions will be used, where the 8 are a
superset of the 5. These 5 are:

 
2
 
q ση

 |rt | − π2 exp  ω2 + 8  1−ρ 2



 
   
 2

 ση 
rt2 − exp ω + 1−ρ
 2 

2
 
 
 
ση2
   
g(wt , θ 0 ) = 
 r t
4
− 3 exp 2ω + 2 1−ρ 2

 (6.31)
   2   
 ση2 ση2 
 |rt rt −1 | − 2 exp  ω + 1−ρ2  exp ρ 1−ρ2 
 

 π 2 8 4 
 
   2 
 ση2 
 r 2 r 2 − exp ω + 1−ρ2  exp ρ 2 ση2
   

t t −2 2 1−ρ 2

These moment conditions can be easily verified from 6.18 and 6.19. The 8 moment-condition
estimation extends the 5 moment-condition estimation with
398 Generalized Method Of Moments (GMM)

 
Moment conditions from 6.31 
 2
ση

 q 
|rt3 | − 2 2 exp 3 ω + 9 1−ρ 
 2 
 π 2 8 
 
   2   
2 2
 
ση ση
g(wt , θ 0 ) =  (6.32)
 
 |rt rt −3 | − π2 exp  ω2 + 1−ρ  exp ρ 3 1−ρ2 
2 
8 4

 
 
   2 
 ση2 
 rt2 rt2−4 − exp ω + 1−ρ2  exp ρ 4 ση 2
2
   

2 1−ρ

The moments that use lags are all staggered to improve identification of ρ.

6.4 Estimation

Estimation of GMM is seemingly simple but in practice fraught with difficulties and user
choices. From the definitions of the GMM estimator,

QT (θ ) = gT (w, θ )0 WT gT (w, θ ) (6.33)


θ̂ = argmin QT (θ ) (6.34)
θ

Differentiation can be used to find the solution, θ̂ , which solves

2GT (w, θ̂ )0 WT gT (w, θ̂ ) = 0 (6.35)


where GT (w, θ ) is the q by p Jacobian of the moment conditions with respect to θ 0 ,

T
∂ gT (w, θ ) X ∂ g(wt , θ )
GT (w, θ ) = = T −1
. (6.36)
∂ θ0 t =1
∂ θ 0

GT (w, θ ) is a matrix of derivatives with q rows and p columns where each row contains
the derivative of one of the moment conditions with respect to all p parameters and each
column contains the derivative of the q moment conditions with respect to a single pa-
rameter.
The seeming simplicity of the calculus obscures two important points. First, the so-
lution in eq. (6.35) does not generally emit an analytical solution and so numerical opti-
mization must be used. Second, QT (·) is generally not a convex function in θ with a unique
minimum, and so local minima are possible. The solution to the latter problem is to try
multiple starting values and clever initial choices for starting values whenever available.
6.4 Estimation 399

Note that WT has not been specified other than requiring that this weighting matrix is
positive definite. The choice of the weighting matrix is an additional complication of using
p
GMM. Theory dictates that the best choice of the weighting matrix must satisfyWT → S−1
where
n√ o
S = avar T gT (wt , θ 0 ) (6.37)

and where avar indicates asymptotic variance. That is, the best choice of weighting is the
inverse of the covariance of the moment conditions. Unfortunately the covariance of the
moment conditions generally depends on the unknown parameter vector, θ 0 . The usual
solution is to use multi-step estimation. In the first step, a simple choice for WT ,which
does not depend on θ (often Iq the identity matrix), is used to estimate θ̂ . The second uses
the first-step estimate of θ̂ to estimate Ŝ. A more formal discussion of the estimation of S
will come later. For now, assume that a consistent estimation method is being used so that
p p
Ŝ → S and so WT = Ŝ−1 → S−1 .
The three main methods used to implement GMM are the classic 2-step estimation, K -
step estimation where the estimation only ends after some convergence criteria is met and
continuous updating estimation.

6.4.1 2-step Estimator

Two-step estimation is the standard procedure for estimating parameters using GMM. First-
step estimates are constructed using a preliminary weighting matrix W̃, often the identity
matrix, and θ̂ 1 solves the initial optimization problem

2GT (w, θ̂ 1 )0 W̃gT (w, θ̂ 1 ) = 0. (6.38)

The second step uses an estimated Ŝ based on the first-step estimates θ̂ 1 . For example, if
the moments are a martingale difference sequence with finite covariance,

T
X
Ŝ(w, θ̂ 1 ) = T −1
g(wt , θ̂ 1 )g(wt , θ̂ 1 )0 (6.39)
t =1

is a consistent estimator of the asymptotic variance of gT (·), and the second-step estimates,
θ̂ 2 , minimizes

QT (θ ) = gT (w, θ )0 Ŝ−1 (θ 1 )gT (w, θ ). (6.40)

which has first order condition

2GT (w, θ̂ 2 )0 Ŝ−1 (θ 1 )gT (w, θ̂ 2 ) = 0. (6.41)


400 Generalized Method Of Moments (GMM)

p
Two-step estimation relies on the the consistence of the first-step estimates, θ̂ 1 → θ 0
p
which is generally needed for Ŝ → S.

6.4.2 k -step Estimator

The k -step estimation strategy extends the two-step estimator in an obvious way: if two-
steps are better than one, k may be better than two. The k -step procedure picks up where
the 2-step procedure left off and continues iterating between θ̂ and Ŝ using the most re-
cent values θ̂ available when computing the covariance of the moment conditions. The
procedure terminates when some stopping criteria is satisfied. For example if

max |θ̂ k − θ̂ k −1 | < ε (6.42)


for some small value ε, the iterations would stop and θ̂ = θ̂ k . The stopping criteria should
depend on the values of θ . For example, if these values are close to 1, then 1 × 10−4 may
be a good choice for a stopping criteria. If the values are larger or smaller, the stopping
criteria should be adjusted accordingly. The k -step and the 2-step estimator are asymp-
totically equivalent, although, the k -step procedure is thought to have better small sample
properties than the 2-step estimator, particularly when it converges.

6.4.3 Continuously Updating Estimator (CUE)

The final, and most complicated, type of estimation, is the continuously updating estima-
tor. Instead of iterating between estimation of θ and S, this estimator parametrizes S as a
function of θ . In the problem, θ̂ C is found as the minimum of

QTC (θ ) = gT (w, θ )0 S(w, θ )−1 gT (w, θ ) (6.43)


The first order condition of this problem is not the same as in the original problem since θ
appears in three terms. However, the estimates are still first-order asymptotically equiva-
lent to the two-step estimate (and hence the k -step as well), and if the continuously updat-
ing estimator converges, it is generally regarded to have the best small sample properties
among these methods.4 There are two caveats to using the continuously updating estima-
tor. First, it is necessary to ensure that gT (w, θ ) is close to zero and that minimum is not
being determined by a large covariance since a large S(w, θ ) will make QTC (θ ) small for any
value of the sample moment conditions gT (w, θ ). The second warning when using the con-
tinuously updating estimator has to make sure that S(w, θ ) is not singular. If the weighting
matrix is singular, there are values of θ which satisfy the first order condition which are not
consistent. The continuously updating estimator is usually implemented using the k -step
4
The continuously updating estimator is more efficient in the second-order sense than the 2- of k -step
estimators, which improves finite sample properties.
6.4 Estimation 401

estimator to find starting values. Once the k -step has converged, switch to the continu-
ously updating estimator until it also converges.

6.4.4 Improving the first step (when it matters)


There are two important caveats to the first-step choice of weighting matrix. The first is
simple: if the problem is just identified, then the choice of weighting matrix does not matter
and only one step is needed. The understand this, consider the first-order condition which
definesθ̂ ,

2GT (w, θ̂ )0 WT gT (w, θ̂ ) = 0. (6.44)


If the number of moment conditions is the same as the number of parameters, the solution
must have

gT (w, θ̂ ) = 0. (6.45)
as long as WT is positive definite and GT (w, θ̂ ) has full rank (a necessary condition). How-
ever, if this is true, then

2GT (w, θ̂ )0 W̃T gT (w, θ̂ ) = 0 (6.46)


for any other positive definite W̃T whether it is the identity matrix, the asymptotic variance
of the moment conditions, or something else.
The other important concern when choosing the initial weighting matrix is to not over-
weight high-variance moments and underweight low variance ones. Reasonable first-step
estimates improve the estimation of Ŝ which in turn provide more accurate second-step es-
timates. The second (and later) steps automatically correct for the amount of variability in
the moment conditions. One fairly robust starting value is to use a diagonal matrix with the
inverse of the variances of the moment conditions on the diagonal. This requires knowl-
edge about θ to implement and an initial estimate or a good guess can be used. Asymptot-
ically it makes no difference, although careful weighing in first-step estimation improves
the performance of the 2-step estimator.

6.4.5 Example: Consumption Asset Pricing Model


The consumption asset pricing model example will be used to illustrate estimation. The
data set consists of two return series, the value-weighted market portfolio and the equally-
weighted market portfolio, V W M and E W M respectively. Models were fit to each return
series separately. Real consumption data was available from Q1 1947 until Q4 2009 and
downloaded from FRED (PCECC96). Five instruments (zt ) will be used, a constant (1),
contemporaneous and lagged consumption growth (c t /c t −1 and c t −1 /c t −2 ) and contem-
poraneous and lagged gross returns on the VWM (pt /pt −1 and pt −1 /pt −2 ). Using these five
402 Generalized Method Of Moments (GMM)

instruments, the model is overidentified since there are only 2 unknowns and five moment
conditions,
   −γ  
 β (1 + rt +1 ) ct c t +1
−1 
   −γ  
β (1 + rt +1 ) ct
c t +1
− 1 ctc−1
 
 t 
 
   −γ  
g(wt , θ 0 ) =  β (1 + rt +1 ) ct
c t +1
− 1 cctt −2
 −1



 (6.47)
   −γ  
β (1 + rt +1 ) ct
c t +1
− 1 ppt −1
 t

 
 
   −γ  
β (1 + rt +1 ) ct
c t +1
− 1 pptt −1
 
−2

where rt +1 is the return on either the VWM or the EWM. Table 6.1 contains parameter esti-
mates using the 4 methods outlined above for each asset.

VWM EWM
Method β̂ γ̂ β̂ γ̂
Initial weighting matrix : I5
1-Step 0.977 0.352 0.953 2.199
2-Step 0.975 0.499 0.965 1.025
k -Step 0.975 0.497 0.966 0.939
Continuous 0.976 0.502 0.966 0.936

Initial weighting matrix: (z0 z)−1


1-Step 0.975 0.587 0.955 1.978
2-Step 0.975 0.496 0.966 1.004
k -Step 0.975 0.497 0.966 0.939
Continuous 0.976 0.502 0.966 0.936

Table 6.1: Parameter estimates from the consumption asset pricing model using both the
VWM and the EWM to construct the moment conditions. The top block corresponds to us-
ing an identity matrix for starting values while the bottom block of four correspond to using
(z0 z)−1 in the first step. The first-step estimates seem to be better behaved and closer to the
2- and K -step estimates when (z0 z)−1 is used in the first step. The K -step and continuously
updating estimators both converged and so produce the same estimates irrespective of the
1-step weighting matrix.

The parameters estimates were broadly similar across the different estimators. The typ-
ical discount rate is very low (β close to 1) and the risk aversion parameter appears to be
between 0.5 and 2.
One aspect of the estimation of this model is that γ is not well identified. Figure 6.1 con-
tain surface and contour plots of the objective function as a function of β and γ for both
6.4 Estimation 403

Contours of Q in the Consumption Model


2-step Estimator
6
1.05 x 10

3
1
2

Q
β

0.95 1

1.05
0.9 1 4
2
0.95 0
−2 0 2 4 0.9 −2
γ β γ
Continuously Updating Estimator
6
1.05 x 10
3

1 2
Q
β

1
0.95

1.05
0.9 1 4
2
0.95 0
−2 0 2 4 0.9 −2
γ β γ

Figure 6.1: This figure contains a plot of the GMM objective function using the 2-step es-
timator (top panels) and the CUE (bottom panels). The objective is very steep along the β
axis but nearly flat along the γ axis. This indicates that γ is not well identified.

the two-step estimator and the CUE. It is obvious in both pictures that the objective func-
tion is steep along the β -axis but very flat along the γ-axis. This means that γ is not well
identified and many values will result in nearly the same objective function value. These
results demonstrate how difficult GMM can be in even a simple 2-parameter model. Sig-
nificant care should always be taken to ensure that the objective function has been globally
minimized.

6.4.6 Example: Stochastic Volatility Model

The stochastic volatility model was fit using both 5 and 8 moment conditions to the returns
on the FTSE 100 from January 1, 2000 until December 31, 2009, a total of 2,525 trading days.
The results of the estimation are in table 6.2. The parameters differ substantially between
the two methods. The 5-moment estimates indicate relatively low persistence of volatility
404 Generalized Method Of Moments (GMM)

with substantial variability. The 8-moment estimates all indicate that volatility is extremely
persistent with ρclose to 1. All estimates weighting matrix computed using a Newey-West
1
covariance with 16 lags (≈ 1.2T 3 ). A non-trivial covariance matrix is needed in this prob-
lem as the moment conditions should be persistent in the presence of stochastic volatility,
unlike in the consumption asset pricing model which should, if correctly specified, have
martingale errors.
In all cases the initial weighting matrix was specified to be an identity matrix, although
in estimation problems such as this where the moment condition can be decomposed into
g(wt , θ ) = f(wt ) − h(θ ) a simple expression for the covariance can be derived by noting
that, if the model is well specified, E[g(wt , θ )] = 0 and thus h(θ ) = E[f(wt )]. Using this
relationship the covariance of f(wt ) can be computed replacing h(θ ) with the sample mean
of f(wt ).

Method ω̂ ρ̂ σˆη
5 moment conditions
1-Step 0.004 1.000 0.005
2-Step -0.046 0.865 0.491
k -Step -0.046 0.865 0.491
Continuous -0.046 0.865 0.491

8 moment conditions
1-Step 0.060 1.000 0.005
2-Step -0.061 1.000 0.005
k -Step -0.061 1.000 0.004
Continuous -0.061 1.000 0.004

Table 6.2: Parameter estimates from the stochastic volatility model using both the 5- and 8-
moment condition specifications on the returns from the FTSE from January 1, 2000 until
December 31, 2009.

6.5 Asymptotic Properties


The GMM estimator is consistent and asymptotically normal under fairly weak, albeit tech-
nical, assumptions. Rather than list 7-10 (depending on which setup is being used) hard
to interpret assumptions, it is more useful to understand why the GMM estimator is con-
sistent and asymptotically normal. The key to developing this intuition comes from un-
derstanding the that moment conditions used to define the estimator, gT (w, θ ), are simple
averages which should have mean 0 when the population moment condition is true.
In order for the estimates to be reasonable, g(wt , θ ) need to well behaved. One scenario
where this occurs is when g(wt , θ ) is a stationary, ergodic sequence with a few moments. If
this is true, only a few additional assumptions are needed to ensure θ̂ should be consistent
6.5 Asymptotic Properties 405

and asymptotically normal. Specifically, WT must be positive definite and the system must
be identified. Positive definiteness of WT is required to ensure that QT (θ ) can only be mini-
mized at one value – θ 0 . If WT were positive semi-definite or indefinite, many values would
minimize the objective function. Identification is trickier, but generally requires that there
is enough variation in the moment conditions to uniquely determine all of the parameters.
Put more technically, the rank of G = plimGT (w, θ 0 ) must be weakly larger than the number
of parameters in the model. Identification will be discussed in more detail in section 6.11.
If these technical conditions are true, then the GMM estimator has standard properties.

6.5.1 Consistency

The estimator is consistent under relatively weak conditions. Formal consistency argu-
ments involve showing that QT (θ )is suitably close the E [QT (θ )] in large samples so that
the minimum of the sample objective function is close to the minimum of the population
objective function. The most important requirement – and often the most difficult to verify
– is that the parameters are uniquely identified which is equivalently to saying that there
is only one value θ 0 for which E [g (wt , θ )] = 0. If this condition is true, and some more
technical conditions hold, then

p
θ̂ − θ 0 → 0 (6.48)

The important point of this result is that the estimator is consistent for any choice of WT ,
p
not just WT → S−1 since whenever WT is positive definite and the parameters are uniquely
identified, QT (θ ) can only be minimized when E [g (w, θ )]= 0 which is θ 0 .

6.5.2 Asymptotic Normality of GMM Estimators

The GMM estimator is also asymptotically normal, although the form of the asymptotic
covariance depends on how the parameters are estimated. Asymptotic normality of GMM
estimators follows from taking a mean-value (similar to a Taylor) expansion of the moment
conditions around the true parameter θ 0 ,

 
 0   ∂ g w, θ̈  
0 = GT w, θ̂ WT g w, θ̂ ≈ G0 Wg (w, θ 0 ) + G0 W 0 θ̂ − θ 0 (6.49)
∂ θ 
≈ G Wg (w, θ 0 ) + G WG θ̂ − θ 0
0 0

 
G0 WG θ̂ − θ 0 ≈ −G0 Wg (w, θ 0 ) (6.50)
√   −1 0 h√ i
T θ̂ − θ 0 ≈ − G0 WG G W T g (w, θ 0 )
406 Generalized Method Of Moments (GMM)

where G ≡ plimGT (w, θ 0 ) and W ≡ plimWT . The first line uses the score condition on the
left hand side and the right-hand side contains the first-order Taylor expansion of the first-
order condition. The second line uses the definition G = ∂ g (w, θ ) /∂ θ 0 evaluated at some

point θ̈ between θ̂ and θ 0 (element-by-element) the last line scales the estimator by T .
This expansion shows that the asymptotic normality of GMM estimators is derived directly
from the normality of the moment conditions evaluated at the true parameter – moment
conditions which are averages and so may, subject to some regularity conditions, follow a
CLT.
The asymptotic variance of the parameters can be computed by computing the variance
of the last line in eq. (6.49).

h√  i −1 0 h√ √ i −1
V T θ̂ − θ 0 = G0 WG G WV T g (w, θ 0 ) , T g (w, θ 0 )0 W0 G G0 WG
−1 0 −1
= G0 WG G WSW0 G G0 WG

Using the asymptotic variance, the asymptotic distribution of the GMM estimator is
√  
d
 −1 0 −1 
T θ̂ − θ 0 → N 0, G0 WG G WSWG G0 WG (6.51)

If one were to use single-step estimation with an identity weighting matrix, the asymptotic
covariance would be (G0 G)−1 G0 SG (G0 G)−1 . This format may look familiar: the White het-
eroskedasticity robust standard error formula when G = X are the regressors and G0 SG =
X0 EX, where E is a diagonal matrix composed of the the squared regression errors.

6.5.2.1 Asymptotic Normality, efficient W

This form of the covariance simplifies when the efficient choice of W = S−1 is used,

h√  i −1 0 −1 −1 −1
V T θ̂ − θ 0 = G0 S−1 G G S SS G G0 S−1 G
−1 0 −1 −1
= G0 S−1 G G S G G0 S−1 G
−1
= G0 S−1 G

and the asymptotic distribution is


√  
d
 −1 
T θ̂ − θ 0 → N 0, G S G
0 −1
(6.52)

Using the long-run variance of the moment conditions produces an asymptotic covariance
which is not only simpler than the generic form, but is also smaller (in the matrix sense).
This can be verified since
6.5 Asymptotic Properties 407

−1 0 −1 −1


G0 WG G WSWG G0 WG − G0 S−1 G =
h i
−1 −1 −1
G0 WG G0 WSWG − G0 WG G0 S−1 G G0 WG G0 WG =
    

−1 0 1
h −1 0 − 1 1i −1 h −1 0 i
− 12
0
G WG G WS Iq − S G G S G
2
0 −1
GS 2 S WG G WG
2
0
= A Iq − X X X
0 0
X A

1 1
where A = S 2 WG (G0 WG)−1 andX = S− 2 G. This is a quadratic form where the inner matrix
is idempotent – and hence positive semi-definite – and so the difference must be weakly
positive. In most cases the efficient weighting matrix should be used, although there are
application where an alternative choice of covariance matrix must be made due to practical
considerations (many moment conditions) or for testing specific hypotheses.

6.5.3 Asymptotic Normality of the estimated moments, gT (w, θ̂ )

Not only are the parameters asymptotically normal, but the estimated moment conditions
are as well. The asymptotic normality of the moment conditions allows for testing the spec-
ification of the model by examining whether the sample moments are sufficiently close to
0. If the efficient weighting matrix is used (W = S−1 ),
√ 1/2 d
 −1 0 1/2 
T WT gT (w, θ̂ ) → N 0, Iq − W G G WG
1/2
 0
GW (6.53)

The variance appears complicated but has a simple intuition. If the true parameter vector
was known, W1/2 T ĝT (w, θ ) would be asymptotically normal with identity covariance matrix.
The second term is a result of having to estimate an unknown parameter vector. Essentially,
one degree of freedom is lost for every parameter estimated and the covariance matrix of
the estimated moments has q − p degrees of freedom remaining. Replacing √ W with the
efficient choice of weighting matrix (S ), the asymptotic variance of T ĝT (w, θ̂ ) can be
−1
−1 0
equivalently written as S − G G0 S−1 G

G by pre- and post-multiplying the variance in by
1
S 2 . In cases where the model is just-identified, q = p , gT (w, θ̂ ) = 0, and the asymptotic
variance is degenerate (0).
p
If some other weighting matrix is used where WT 9 Sthen the asymptotic distribution
of the moment conditions is more complicated.

√ d
 
T W1/2
T gT (w, θ̂ ) → N 0, NW 1/2
SW 1/2
N 0
(6.54)

where N = Iq − W1/2 G [G0 WG]−1 G0 W1/2 . If an alternative weighting matrix is used, the es-
timated moments are still asymptotically normal but with a different, larger variance. To
see how the efficient form of the covariance matrix is nested in this inefficient form, replace
W = S−1 and note that since N is idempotent, N = N0 and NN = N.
408 Generalized Method Of Moments (GMM)

6.6 Covariance Estimation


Estimation of the long run (asymptotic) covariance matrix of the moment conditions is
important and often has significant impact on tests of either the model or individual coef-
ficients. Recall the definition of the long-run covariance,
n√ o
S = avar T gT (wt , θ 0 ) .

S is the covariance of an average, gT (wt , θ 0 ) = T T −1 Tt=1 g (wt , θ 0 ) and the variance of
P

an average includes all autocovariance terms. The simplest estimator one could construct
to capture all autocovariance is

T −1 
0
X 
Ŝ = Γ̂ 0 + Γ̂ i + Γ̂ i (6.55)
i =1

where

T
X    0
Γ̂ i = T −1 g wt , θ̂ g wt −i , θ̂ .
t =i +1

While this estimator is the natural sample analogue of the population long-run covariance,
it is not positive definite and so is not useful in practice. A number of alternatives have been
developed which can capture autocorrelation in the moment conditions and are guaran-
teed to be positive definite.

6.6.1 Serially uncorrelated moments

If the moments are serially uncorrelated then the usual covariance estimator can be used.
Moreover, E[gT (w, θ )] = 0 and so S can be consistently estimated using

T
X
Ŝ = T −1
g(wt , θ̂ )g(wt , θ̂ )0 , (6.56)
t =1

This estimator does not explicitly remove the mean of the moment conditions. In practice
it may be important to ensure the mean of the moment condition when the problem is
over-identified (q > p ), and is discussed further in6.6.5.

6.6.2 Newey-West

The Newey & West (1987) covariance estimator solves the problem of positive definiteness
of an autocovariance robust covariance estimator by weighting the autocovariances. This
produces a heteroskedasticity, autocovariance consistent (HAC) covariance estimator that
is guaranteed to be positive definite. The Newey-West estimator computes the long-run
6.6 Covariance Estimation 409

covariance as if the moment process was a vector moving average (VMA), and uses the
sample autocovariances, and is defined (for a maximum lag l )

l
X l +1−i  0

Ŝ NW
= Γ̂ 0 + Γ̂ i + Γ̂ i (6.57)
l +1
i =1

The number of lags, l , is problem dependent, and in general must grow with the sample
size to ensure consistency when the moment conditions are dependent. The optimal rate
1
of lag growth has l = c T 3 where c is a problem specific constant.

6.6.3 Vector Autoregressive

While the Newey-West covariance estimator is derived from a VMA, a Vector Autoregres-
sion (VAR)-based estimator can also be used. The VAR-based long-run covariance estima-
tors have significant advantages when the moments are highly persistent. Construction of
the VAR HAC estimator is simple and is derived directly from a VAR. Suppose the moment
conditions, gt follow a VAR(r ), and so

gt = Φ0 + Φ1 gt −1 + Φ2 gt −2 + . . . + Φr gt −r + ηt . (6.58)
The long run covariance of gt can be computed from knowledge of Φ j , j = 1, 2, . . . , s and
the covariance of ηt . Moreover, if the assumption of VAR(r ) dynamics is correct, ηt is a
while noise process and its covariance can be consistently estimated by

T
X
Ŝη = (T − r ) −1
η̂t η̂0t . (6.59)
t =r +1

The long run covariance is then estimated using


 −1 0
Ŝ AR
= (I − Φ̂1 − Φ̂2 − . . . − Φ̂r ) Ŝη
−1
I − Φ̂1 − Φ̂2 − . . . − Φ̂r . (6.60)

The primary advantage of the VAR based estimator over the NW is that the number of pa-
rameters needing to be estimated is often much, much smaller. If the process is well de-
scribed by an VAR, k may be as small as 1 while a Newey-West estimator may require many
lags to adequately capture the dependence in the moment conditions. den Haan & Levin
(2000)show that the VAR procedure can be consistent if the number of lags grow as the
sample size grows so that the VAR can approximate the long-run covariance of any covari-
ance stationary process. They recommend choosing the lag length using BIC in two steps:
first choosing the lag length of own lags, and then choosing the number of lags of other
moments.
410 Generalized Method Of Moments (GMM)

6.6.4 Pre-whitening and Recoloring


The Newey-West and VAR long-run covariance estimators can be combined in a procedure
known as pre-whitening and recoloring. This combination exploits the VAR to capture the
persistence in the moments and used the Newey-West HAC to capture any neglected se-
rial dependence in the residuals. The advantage of this procedure over Newey-West or VAR
HAC covariance estimators is that PWRC is parsimonious while allowing complex depen-
dence in the moments.
A low order VAR (usually 1st ) is fit to the moment conditions,

gt = Φ0 + Φ1 gt −1 + ηt (6.61)
and the covariance of the residuals, η̂t is estimated using a Newey-West estimator, prefer-
ably with a small number of lags,

l
X l −i +1 0

ŜN
η
W
= Ξ̂0 + Ξ̂i + Ξ̂i (6.62)
l +1
i =1

where
T
X
Ξ̂i = T −1
η̂t η̂0t −i . (6.63)
t =i +1

The long run covariance is computed by combining the VAR parameters with the Newey-
West covariance of the residuals,
 −1  −1 0
Ŝ P W RC
= I − Φ̂1 ŜN
η
W
I − Φ̂1 , (6.64)

or, if a higher order VAR was used,


 −1  −1 0
r
X r
X
ŜP W R C = I − Φ̂ j  ŜN W 
 I− Φ̂ j   (6.65)

η
j =1 j =1

where r is the order of the VAR.

6.6.5 To demean or not to demean?


One important issue when computing asymptotic variances is whether the sample mo-
ments should be demeaned before estimating the long-run covariance. If the population
moment conditions are valid, then E[gt (wt , θ )] = 0 and the covariance can be computed
from {gt (wt , θ̂ )} without removing the mean. If the population moment conditions are
not valid, then E[gt (wt , θ )] 6= 0 and any covariance matrices estimated from the sample
moments will be inconsistent. The intuition behind the inconsistency is simple. Suppose
6.6 Covariance Estimation 411

the E[gt (wt , θ )] 6= 0 for all θ ∈ Θ, the parameter space and that the moments are a vec-
tor martingale process. Using the “raw” moments to estimate the covariance produces an
inconsistent estimator since
T
X p
Ŝ = T −1
g(wt , θ̂ )g(wt , θ̂ )0 → S + µµ0 (6.66)
t =1

where S is the covariance of the moment conditions and µ is the expectation of the moment
conditions evaluated at the probability limit of the first-step estimator, θ̂ 1 .
One way to remove the inconsistency is to demean the moment conditions prior to es-
timating the long run covariance so that gt (wt , θ̂ ) is replaced by g̃t (wt , θ̂ ) = gt (wt , θ̂ ) −
T −1 Tt=1 gt (wt , θ̂ ) when computing the long-run covariance. Note that demeaning is not
P

free since removing the mean, when the population moment condition is valid, reduces the
variation in gt (·) and in turn the precision of Ŝ. As a general rule, the mean should be re-
moved except in cases where the sample length is small relative to the number of moments.
In practice, subtracting the mean from the estimated moment conditions is important for
testing models using J -tests and for estimating the parameters in 2- or k -step procedures.

6.6.6 Example: Consumption Asset Pricing Model


Returning the to consumption asset pricing example, 11 different estimators where used
to estimate the long run variance after using the parameters estimated in the first step of
the GMM estimator. These estimators include the standard estimator and both the Newey-
West and the VAR estimator using 1 to 5 lags. In this example, the well identified parameter,
β is barely affected but the poorly identified γ shows some variation when the covariance
estimator changes. In this example it is reasonable to use the simple covariance estimator
because, if the model is well specified, the moments must be serially uncorrelated. If they
are serially correlated then the investor’s marginal utility is predictable and so the model is
misspecified. It is generally good practice to impose any theoretically sounds restrictions
on the covariance estimator (such as a lack of serially correlation in this example, or at most
some finite order moving average).

6.6.7 Example: Stochastic Volatility Model


Unlike the previous example, efficient estimation in the stochastic volatility model example
requires the use of a HAC covariance estimator. The stochastic volatility estimator uses
unconditional moments which are serially correlated whenever the data has time-varying
volatility. For example, the moment conditions E[|rt |] is autocorrelated since E[|rt rt − j |] 6=
E[|rt |]2 (see eq.(6.19)). All of the parameter estimates in table 6.2 were computed suing a
1
Newey-West covariance estimator with 12 lags, which was chosen using c T 3 rule where
c = 1.2 was chosen. Rather than use actual data to investigate the value of various HAC
estimators, consider a simple Monte Carlo where the DGP is
412 Generalized Method Of Moments (GMM)

Lags Newey-West Autoregressive


β̂ γ̂ β̂ γ̂
0 0.975 0.499
1 0.979 0.173 0.982 -0.082
2 0.979 0.160 0.978 0.204
3 0.979 0.200 0.977 0.399
4 0.978 0.257 0.976 0.493
5 0.978 0.276 0.976 0.453

Table 6.3: The choice of variance estimator can have an effect on the estimated parame-
ters in a 2-step GMM procedure. The estimate of the discount rate is fairly stable, but the
estimate of the coefficient of risk aversion changes as the long-run covariance estimator
varies. Note that the NW estimation with 0 lags is the just the usual covariance estimator.

rt = σt εt (6.67)
ln σ2t = −7.36 + 0.9 ln σ2t −1 − 7.36 + 0.363ηt


which corresponds to an annualized volatility of 22%. Both shocks were standard normal.
1000 replications with T = 1000 and 2500 were conducted and the covariance matrix was
estimated using 4 different estimators: a misspecified covariance assuming the moments
are uncorrelated, a HAC using 1.2T 1/3 , a VAR estimator where the lag length is automati-
cally chosen by the SIC, and an “infeasible” estimate computed using a Newey-West esti-
mator computed from an auxiliary run of 10 million simulated data points. The first-step
estimator was estimated using an identity matrix.
The results of this small simulation study are presented in table 6.4. This table contains
a lot of information, much of which is contradictory. It highlights the difficulties in actu-
ally making the correct choices when implementing a GMM estimator. For example, the
bias of the 8 moment estimator is generally at least as large as the bias from the 5 moment
estimator, although the root mean square error is generally better. This highlights the gen-
eral bias-variance trade-off that is made when using more moments: more moments leads
to less variance but more bias. The only absolute rule evident from the the table is the
performance changes when moving from 5 to 8 moment conditions and using the infeasi-
ble covariance estimator. The additional moments contained information about both the
persistence ρ and the volatility of volatility σ.

6.7 Special Cases of GMM


GMM can be viewed as a unifying class which nests mean estimators. Estimators used in
frequentist econometrics can be classified into one of three types: M-estimators (maxi-
6.7 Special Cases of GMM 413

5 moment conditions 8 moment conditions


Bias
T =1000 ω ρ σ ω ρ σ
Inefficeint -0.000 -0.024 -0.031 0.001 -0.009 -0.023
Serial Uncorr. 0.013 0.004 -0.119 0.013 0.042 -0.188
Newey-West -0.033 -0.030 -0.064 -0.064 -0.009 -0.086
VAR -0.035 -0.038 -0.058 -0.094 -0.042 -0.050
Infeasible -0.002 -0.023 -0.047 -0.001 -0.019 -0.015

T=2500
Inefficeint 0.021 -0.017 -0.036 0.021 -0.010 -0.005
Serial Uncorr. 0.027 -0.008 -0.073 0.030 0.027 -0.118
Newey-West -0.001 -0.034 -0.027 -0.022 -0.018 -0.029
VAR 0.002 -0.041 -0.014 -0.035 -0.027 -0.017
Infeasible 0.020 -0.027 -0.011 0.020 -0.015 0.012

RMSE
T=1000
Inefficeint 0.121 0.127 0.212 0.121 0.073 0.152
Serial Uncorr. 0.126 0.108 0.240 0.128 0.081 0.250
Newey-West 0.131 0.139 0.217 0.141 0.082 0.170
VAR 0.130 0.147 0.218 0.159 0.132 0.152
Infeasible 0.123 0.129 0.230 0.128 0.116 0.148

T=2500
Inefficeint 0.075 0.106 0.194 0.075 0.055 0.114
Serial Uncorr. 0.079 0.095 0.201 0.082 0.065 0.180
Newey-West 0.074 0.102 0.182 0.080 0.057 0.094
VAR 0.072 0.103 0.174 0.085 0.062 0.093
Infeasible 0.075 0.098 0.185 0.076 0.054 0.100

Table 6.4: Results from the Monte Carlo experiment on the SV model. Two data lengths
(T = 1000 and T = 2500) and two sets of moments were used. The table shows how
difficult it can be to find reliable rules for improving finite sample performance. The only
clean gains come form increasing the sample size and/or number of moments.
414 Generalized Method Of Moments (GMM)

mum), R-estimators (rank), and L-estimators (linear combination). Most estimators pre-
sented in this course, including OLS and MLE, are in the class of M-estimators. All M-class
estimators are the solution to some extremum problem such as minimizing the sum of
squares or maximizing the log likelihood.
In contrast, all R-estimators make use of rank statistics. The most common examples
include the minimum, the maximum and rank correlation, a statistic computed by cal-
culating the usual correlation on the rank of the data rather than on the data itself. R-
estimators are robust to certain issues and are particularly useful in analyzing nonlinear
relationships. L-estimators are defined as any linear combination of rank estimators. The
classical example of an L-estimator is a trimmed mean, which is similar to the usual mean
estimator except some fraction of the data in each tail is eliminated, for example the top
and bottom 1%. L-estimators are often substantially more robust than their M-estimator
counterparts and often only make small sacrifices in efficiency. Despite the potential ad-
vantages of L-estimators, strong assumptions are needed to justify their use and difficulties
in deriving theoretical properties limit their practical application.
GMM is obviously an M-estimator since it is the result of a minimization and any esti-
mator nested in GMM must also be an M-estimator and most M-estimators are nested in
GMM. The most important exception is a subclass of estimators known as classical mini-
mum distance (CMD). CMD estimators minimize the distance between a restricted param-
eter space and an initial set of estimates. The final parameter estimates generally includes
fewer parameters than the initial estimate or non-linear restrictions. CMD estimators are
not widely used in financial econometrics, although they occasionally allow for feasible
solutions to otherwise infeasible problems – usually because direct estimation of the pa-
rameters in the restricted parameter space is difficult or impossible using nonlinear opti-
mizers.

6.7.1 Classical Method of Moments

The obvious example of GMM is classical method of moments. Consider using MM to es-
timate the parameters of a normal distribution. The two estimators are

T
X
µ=T −1
yt (6.68)
t =1

T
X
σ =T 2 −1
(yt − µ)2 (6.69)
t =1

which can be transformed into moment conditions


" PT #
T −1 t =1 yt − µ
gT (w, θ ) = PT (6.70)
t =1 (yt − µ) − σ
−1 2 2
T
6.7 Special Cases of GMM 415

which obviously have the same solutions. If the data are i.i.d., then, defining ε̂t = yt − µ̂, S
can be consistently estimated by

T h
X i
Ŝ = T −1
gt (wt , θ̂ )gt (wt , θ̂ ) 0
(6.71)
t =1
T
"  #
X ε̂ 2
t ε̂ t ε̂ 2
t − σ 2
= T −1 2 2
ε̂ ε̂ 2
σ 2
ε̂ 2
σ
 
t t − t −
" Pt =1 #
T PT
ε̂ 2
ε̂ 3
= PTt =1 3t t =1 t
PT
PT since t =1 ε̂t = 0
ε̂
t =1 t ε̂
t =1 t
4
− 2σ2 ε̂2t + σ4
" #
σ2 0
E[Ŝ] ≈ if normal
0 2σ4

Note that the last line is exactly the variance of the mean and variance if the covariance
was estimated assuming normal maximum likelihood. Similarly, GT can be consistently
estimated by

 PT PT 
∂ yt −µ
t =1 ∂ t =1 (yt −µ)
2 −σ2
∂ µ ∂µ
Ĝ = T −1 


PT

PT  (6.72)
t =1 (yt −µ)
2 −σ2
t =1 yt −µ

∂ σ2 ∂ σ2

θ =θ̂
" P #
T PT
t =1 −1 −2 ε̂ t
= T −1 PT t =1
0 t =1 −1
" P #
T
t =1 −1 0 PT
= T −1 PT by t =1 ε̂t = 0
0 t =1 −1
" #
−T 0
= T −1
0 −T
= −I2
 −1
and since the model is just-identified (as many moment conditions as parameters) Ĝ0T Ŝ−1 ĜT =
 0
Ĝ−1
T T = Ŝ, the usual covariance estimator for the mean and variance in the method
ŜĜ−1
of moments problem.

6.7.2 OLS

OLS (and other least squares estimators, such as WLS) can also be viewed as a special case
of GMM by using the orthogonality conditions as the moments.
416 Generalized Method Of Moments (GMM)

gT (w, θ ) = T −1 X0 (y − Xβ ) = T −1 X0 ε (6.73)
and the solution is obviously given by

β̂ = (X0 X)−1 X0 y. (6.74)

If the data are from a stationary martingale difference sequence, then S can be consis-
tently estimated by

T
X
Ŝ = T −1
x0t ε̂t ε̂t xt (6.75)
t =1
T
X
Ŝ = T −1
ε̂2t x0t xt
t =1

and GT can be estimated by

∂ X0 (y − Xβ )
Ĝ = T −1 (6.76)
∂ β0
= −T −1 X0 X

Combining these two, the covariance of the OLS estimator is


T
!
 −1 0 X −1 −1 −1
−T −1 X0 X T −1 ε̂2t x0t xt −T −1 X0 X = Σ̂XX ŜΣ̂XX (6.77)
t =1

which is the White heteroskedasticity robust covariance estimator.

6.7.3 MLE and Quasi-MLE


GMM also nests maximum likelihood and quasi-MLE (QMLE) estimators. An estimator is
said to be a QMLE is one distribution is assumed, for example normal, when the data are
generated by some other distribution, for example a standardized Student’s t . Most ARCH-
type estimators are treated as QMLE since normal maximum likelihood is often used when
the standardized residuals are clearly not normal, exhibiting both skewness and excess kur-
tosis. The most important consequence of QMLE is that the information matrix inequality
is generally not valid and robust standard errors must be used. To formulate the (Q)MLE
problem, the moment conditions are simply the average scores,
T
X
gT (w, θ ) = T −1
∇θ l (wt , θ ) (6.78)
t =1
6.7 Special Cases of GMM 417

where l (·) is the log-likelihood. If the scores are a martingale difference sequence, S can be
consistently estimated by

T
X
Ŝ = T −1 ∇θ l (wt , θ )∇θ 0 l (wt , θ ) (6.79)
t =1

and GT can be estimated by

T
X
Ĝ = T −1
∇θ θ 0 l (wt , θ ). (6.80)
t =1

However, in terms of expressions common to MLE estimation,

T
X
E[Ŝ] = E[T −1 ∇θ l (wt , θ )∇θ 0 l (wt , θ )] (6.81)
t =1
T
X
= T −1 E[∇θ l (wt , θ )∇θ 0 l (wt , θ )]
t =1

=T −1
T E[∇θ l (wt , θ )∇θ 0 l (wt , θ )]
= E[∇θ l (wt , θ )∇θ 0 l (wt , θ )]
=J

and

T
X
E[Ĝ] = E[T −1 ∇θ θ 0 l (wt , θ )] (6.82)
t =1
T
X
= T −1 E[∇θ θ 0 l (wt , θ )]
t =1

=T −1
T E[∇θ θ 0 l (wt , θ )]
= E[∇θ θ 0 l (wt , θ )]
= −I
 0
−1
The GMM covariance is Ĝ ŜĜ−1 which, in terms of MLE notation is I −1 J I −1 . If the
information matrix equality is valid (I = J ), this simplifies to I −1 , the usual variance in
MLE. However, when the assumptions of the MLE procedure are not valid, the robust form
0
of the covariance estimator, I −1 J I −1 = Ĝ−1 ŜĜ−1 should be used, and failure to do so
can result in tests with incorrect size.
418 Generalized Method Of Moments (GMM)

6.8 Diagnostics
The estimation of a GMM model begins by specifying the population moment conditions
which, if correct, have mean 0. This is an assumption and is often a hypothesis of inter-
est. For example, in the consumption asset pricing model the discounted returns should
have conditional mean 0 and deviations from 0 indicate that the model is misspecified.
The standard method to test whether the moment conditions is known as the J test and is
defined

J = T gT (w, θ̂ )0 Ŝ−1 gT (w, θ̂ ) (6.83)


 
= T QT θ̂ (6.84)

which is T times the minimized GMM objective function where Ŝ is a consistent estimator
of the long-run covariance of the moment conditions. The distribution of J is χq2−p , where
q −p measures the degree of overidentification. The distribution of the test follows directly
from the asymptotic normality of the estimated moment conditions (see section 6.5.3). It is
important to note that the standard J test requires the use of a multi-step estimator which
p
uses an efficient weighting matrix (WT → S−1 ).
In cases where an efficient estimator of WT is not available, an inefficient test test can
be computed using

h −1 0 1/2 i 1/2


J WT = T gT (w, θ̂ )0 Iq − W1/2 G G0 WG

GW W (6.85)
−1 0 1/2 i0 −1
h 
gT (w, θ̂ )
1/2 1/2
 0
× SW Iq − W G G WG GW

which follow directly from the asymptotic normality of the estimated moment conditions
even when the weighting matrix is sub-optimally chosen. J WT , like J , is distributed χq2−p ,
although it is not T times the first-step GMM objective. Note that the inverse in eq. (6.85
) is of a reduced rank matrix and must be computed using a Moore-Penrose generalized
inverse.

6.8.1 Example: Linear Factor Models

The CAPM will be used to examine the use of diagnostic tests. The CAPM was estimated
on the 25 Fama-French 5 by 5 sort on size and BE/ME using data from 1926 until 2010. The
moments in this specification can be described
" #
(ret − β ft ) ⊗ ft
gt (wt θ ) = (6.86)
(ret − β λ)
6.9 Parameter Inference 419

CAPM 3 Factor
Method J ∼ χ24
2
p-val J ∼ χ22
2
p-val
2-Step 98.0 0.000 93.3 0.000
k -Step 98.0 0.000 92.9 0.000
Continuous 98.0 0.000 79.5 0.000
2-step NW 110.4 0.000 108.5 0.000
2-step VAR 103.7 0.000 107.8 0.000

Table 6.5: Values of the J test using different estimation strategies. All of the tests agree,
although the continuously updating version is substantially smaller in the 3 factor model
(but highly significant since distributed χ22
2
).

where ft is the excess return on the market portfolio and ret is a 25 by 1 vector of excess
returns on the FF portfolios. There are 50 moment conditions and 26 unknown parame-
ters so this system is overidentified and the J statistic is χ24
2
distributed. The J -statistics
were computed for the four estimation strategies previously described, the inefficient 1-
step test, 2-step, K -step and continuously updating. The values of these statistics, con-
tained in table 6.5, indicate the CAPM is overwhelmingly rejected. While only the simple
covariance estimator was used to estimate the long run covariance, all of the moments are
portfolio returns and this choice seems reasonable considering the lack of predictability of
monthly returns. The model was then extended to include the size and momentum fac-
tors, which resulted in 100 moment equations and 78 (75β s + 3 risk premia) parameters,
and so the J statistic is distributed as a χ22
2
.

6.9 Parameter Inference

6.9.1 The delta method and nonlinear hypotheses

Thus far, all hypothesis tests encountered have been linear, and so can be expressed H0 :
Rθ − r = 0 where R is a M by P matrix of linear restriction and r is a M by 1 vector of
constants. While linear restrictions are the most common type of null hypothesis, some
interesting problems require tests of nonlinear restrictions.
Define R(θ ) to be a M by 1 vector valued function. From this nonlinear function, a non-
linear hypothesis can be specified H0 : R(θ ) = 0. To test this hypothesis, the distribution
of R(θ ) needs to be determined (as always, under the√ null). The delta method can be used
to simplify finding this distribution in cases where T (θ̂ − θ 0 ) is asymptotically normal as
long as R(θ ) is a continuously differentiable function of θ at θ 0 .

√ d
Definition 6.5 (Delta method). Let T (θ̂ − θ 0 ) → N (0, Σ) where Σ is a positive definite
covariance matrix. Further, suppose that R(θ ) is a continuously differentiable function of
420 Generalized Method Of Moments (GMM)

θ from Rp → Rm , m ≤ p . Then,
√ ∂ R(θ 0 ) ∂ R(θ 0 )
 
d
T (R(θ̂ ) − R(θ 0 )) → N 0, Σ (6.87)
∂ θ0 ∂θ

This result is easy to relate to the class of linear restrictions, R(θ ) = Rθ − r. In this class,

∂ R(θ 0 )
=R (6.88)
∂ θ0
and the distribution under the null is
√ d
T (Rθ̂ − Rθ 0 ) → N 0, RΣR0 .

(6.89)
Once the distribution of the nonlinear function θ has been determined, using the delta
method to conduct a nonlinear hypothesis test is straight forward with one big catch. The
null hypothesis is H0 : R(θ 0 ) = 0 and a Wald test can be calculated
" #−1
∂ R(θ 0 ) ∂ R(θ 0 ) 0
W = T R(θ̂ )0 Σ R(θ̂ ). (6.90)
∂ θ0 ∂ θ0
)
The distribution of the Wald test is determined by the rank of R(θ
∂ θ0
evaluated under H0 .
In some simple cases the rank is obvious. For example, in the linear hypothesis testing
)
framework, the rank of R(θ
∂ θ0
is simply the rank of the matrix R. In a test of a hypothesis
H0 : θ1 θ2 − 1 = 0,
" #
R(θ ) θ2
= (6.91)
∂ θ0 θ1
)
assuming there are two parameters in θ and the rank of R(θ ∂ θ0
must be one if the null is true
since both parameters must be non-zero to have a product of 1. The distribution of a Wald
test of this null is a χ12 . However, consider a test of the null H0 : θ1 θ2 = 0. The Jacobian
of this function is identical but the slight change in the null has large consequences. For
this null to be true, one of three things much occur: θ1 = 0 and θ2 6= 0, θ1 6= 0 and θ2 = 0
)
or θ1 = 0 and θ2 = 0. In the first two cases, the rank of R(θ∂ θ0
is 1. However, in the last case
the rank is 0. When the rank of the Jacobian can take multiple values depending on the
value of the true parameter, the distribution under the null is nonstandard and none of the
standard tests are directly applicable.

6.9.2 Wald Tests

Wald tests in GMM are essentially identical to Wald tests in OLS; W is T times the stan-
dardized, summed and squared deviations from the null. If the efficient choice of WT is
used,
6.9 Parameter Inference 421

√  
d
 −1 
T θ̂ − θ 0 → N 0, G0 S−1 G (6.92)

and a Wald test of the (linear) null H0 : Rθ − r = 0 is computed


h −1 0 i−1 d
W = T (Rθ̂ − r)0 R G0 S−1 G R (Rθ̂ − r) → χm
2
(6.93)

where m is the rank of R. Nonlinear hypotheses can be tested in an analogous manner us-
ing the delta method. When using the delta method, m is the rank of ∂ ∂R(θ
θ0
0)
. If an inefficient
choice of WT is used,
√  
d
 −1 0 −1 
T θ̂ − θ 0 → N 0, G0 WG G WSWG G0 WG (6.94)

and

d
W = T (Rθ̂ − r)0 V−1 (Rθ̂ − r) → χm
2
(6.95)
where V = R (G0 WG)−1 G0 WSWG (G0 WG)−1 R0 .
T-tests and t-stats are also valid and can be computed in the usual manner for single
hypotheses,

Rθ̂ − r d
t = √ → N (0, 1) (6.96)
V
where the form of V depends on whether an efficient of inefficient choice of WT was used.
In the case of the t-stat of a parameter,

θ̂i d
t =p → N (0, 1) (6.97)
V[i i ]
where V[i i ] indicates the element in the ith diagonal position.

6.9.3 Likelihood Ratio (LR) Tests

Likelihood Ratio-like tests, despite GMM making no distributional assumptions, are avail-
able. Let θ̂ indicate the unrestricted parameter estimate and let θ̃ indicate the solution
of

θ̃ = argmin QT (θ ) (6.98)
θ

subject to Rθ − r = 0

where QT (θ ) = gT (w, θ )0 Ŝ−1 gT (w, θ ) and Ŝ is an estimate of the long-run covariance of the
moment conditions computed from the unrestricted model (using θ̂ ). A LR-like test statis-
422 Generalized Method Of Moments (GMM)

tic can be formed


 
d
L R = T gT (w, θ̃ ) Ŝ gT (w, θ̃ ) − gT (w, θ̂ ) Ŝ gT (w, θ̂ ) → χm
0 −1 0 −1 2
(6.99)

Implementation of this test has one crucial aspect. The covariance matrix of the moments
used in the second-step estimation must be the same for the two models. Using different
covariance estimates can produce a statistic which is not χ 2 .
The likelihood ratio-like test has one significant advantage: it is invariant to equiva-
lent reparameterization of either the moment conditions or the restriction (if nonlinear)
while the Wald test is not. The intuition behind this result is simple; LR-like tests will be
constructed using the same values of QT for any equivalent reparameterization and so the
numerical value of the test statistic will be unchanged.

6.9.4 LM Tests

LM tests are also available and are the result of solving the Lagrangian

θ̃ = argmin QT (θ ) − λ0 (Rθ − r) (6.100)


θ

In the GMM context, LM tests examine how much larger the restricted moment condi-
tions are than their unrestricted counterparts. The derivation is messy and computation is
harder than either Wald or LR, but the form of the LM test statistic is

d
L M = T gT (w, θ̃ )0 S−1 G(G0 S−1 G)−1 G0 S−1 gT (w, θ̃ ) → χm
2
(6.101)
The primary advantage of the LM test is that it only requires estimation under the null
which can, in some circumstances, be much simpler than estimation under the alternative.
You should note that the number of moment conditions must be the same in the restricted
model as in the unrestricted model.

6.10 Two-Stage Estimation

Many common problems involve the estimation of parameters in stages. The most com-
mon example in finance are Fama-MacBeth regressions(Fama & MacBeth 1973) which use
two sets of regressions to estimate the factor loadings and risk premia. Another example
is models which first fit conditional variances and then, conditional on the conditional
variances, estimate conditional correlations. To understand the impact of first-stage es-
timation error on second-stage parameters, it is necessary to introduce some additional
notation to distinguish the first-stage moment conditions from the second stage moment
conditions. Let g1T (w, ψ) = T −1 Tt=1 g1 (wt , ψ) and g2T (w, ψ, θ ) = T −1 Tt=1 g2 (wt , ψ, θ )
P P
6.10 Two-Stage Estimation 423

be the first- and second-stage moment conditions. The first-stage moment conditions only
depend on a subset of the parameters,ψ, and the second-stage moment conditions depend
on both ψ and θ . The first-stage moments will be used to estimate ψ and the second-
stage moments will treat ψ̂ as known when estimating θ . Assuming that both stages are
just-identified, which is the usual scenario, then
" # " # !
√ ψ
^ −ψ d 0 −1 0
SG−1

T →N , G
θ̂ − θ 0
" #
G1ψ G2ψ
G=
0 G2θ

∂ g1T ∂ g2T ∂ g2T


G1ψ = 0 , G2ψ = 0 , G2θ =
∂ψ ∂ψ ∂ θ0
√ 0 √ 0 i0
h 
S = avar T g1T , T g2T

Application of the partitioned inverse shows that the asymptotic variance of the first-
√  
d

stage parameters is identical to the usual expression, and so T ψ̂ − ψ → N 0, G1ψ Sψψ G1ψ
−1 −1

where Sψψ is the upper block of Swhich corresponds to the g1 moments. The distribution
of the second-stage parameters differs from what would be found if the estimation of ψ
was ignored, and so

√    h i h i0  
d
T θ̂ − θ →N 0,G2θ −G2ψ G1ψ , I S −G2ψ G1ψ , I G2θ
−1 −1 −1 −1
(6.102)
 h i 
=N 0,G−1
2θ S θθ − G G −1
S
2ψ 1ψ ψθ − S G −1
G
θ ψ 1ψ 2ψ + G G−1
S G −1 0
G
2ψ 1ψ ψψ 1ψ 2ψ G −1
2θ .

The intuition for this form comes from considering an expansion of the second stage
moments first around the second-stage parameters, and the accounting for the additional
variation due to the first-stage parameter estimates. Expanding the second-stage moments
around the true second stage-parameters,
√   √  
T θ̂ − θ ≈ −G−1
2θ T g 2T w, ψ̂, θ 0 .

If ψ were known, then this would be sufficient to construct the asymptotic variance.
When ψ is estimated, it is necessary to expand the final term around the first-stage param-
eters, and so
√   h√ √  i
T θ̂ − θ ≈ −G−1 ψ θ + ψ̂ ψ

2θ T g2T w, 0 , 0 G 2ψ T −

which shows that the error in the estimation of ψ appears in the estimation error of θ .
424 Generalized Method Of Moments (GMM)

√   √
T ψ̂ − ψ ≈ −G−1
1ψ T g1T w, ψ0 , the expression can be

Finally, using the relationship
completed, and

√   h√ √ i
T θ̂ − θ ≈ −G2θ
−1
T g2T w, ψ0 , θ 0 − G2ψ G1ψ T g1T w, ψ0
−1

" "  ##
h i√ g1T w, ψ0
= −G2θ −G2ψ G1ψ , I T
−1 −1
.
g2T w, ψ0 , θ 0


Squaring this expression and replacing the outer-product of the moment conditions with
the asymptotic variance produces eq. (6.102).

6.10.1 Example: Fama-MacBeth Regression


Fama-MacBeth regression is a two-step estimation proceedure where the first step is just-
identified and the second is over-identified. The first-stage moments are used to estimate
the factor loadings (β s) and the second-stage moments are used to estimate the risk pre-
mia. In an application to n portfolios and k factors there are q1 = nk moments in the
first-stage,

g1t (wt θ ) = (rt − β ft ) ⊗ ft


which are used to estimate n k parameters. The second stage uses k moments to estimate
k risk premia using

g2t (wt θ ) = β 0 (rt − β λ) .


It is necessary to account for the uncertainty in the estimation of β when constructing
confidence intervals for λ. Corect inference can be made by estimating the components of
eq. (6.102),
T
X
−1
Ĝ1β = T In ⊗ ft f0t ,
t =1

T
X 0 0
Ĝ2β = T −1 (rt − β̂ λ̂)0 ⊗ Ik − β̂ ⊗ λ̂ ,
t =1

T
X 0
Ĝ2λ = T −1
β̂ β̂ ,
t =1

T
" #" #0
X (rt − β̂ ft ) ⊗ ft (rt − β̂ ft ) ⊗ ft
Ŝ = T −1 0 0 .
t =1
β̂ (rt − β̂ λ̂) β̂ (rt − β̂ λ̂)
These expressions were applied to the 25 Fama-French size and book-to-market sorted
6.11 Weak Identification 425

Correct OLS - White Correct OLS - White


λ̂ s.e. t-stat s.e. t-stat λ̂ s.e. t-stat s.e. t-stat
V WMe 7.987 2.322 3.440 0.643 12.417 6.508 2.103 3.095 0.812 8.013
SMB – 2.843 1.579 1.800 1.651 1.722
HML – 3.226 1.687 1.912 2.000 1.613

Table 6.6: Annual risk premia, correct and OLS - White standard errors from the CAPM
and the Fama-French 3 factor mode.

portfolios. Table 6.6 contains the standard errors and t-stats which are computed using
both incorrect inference – White standard errors which come from a standard OLS of the
mean excess return on the β s – and the consistent standard errors which are computed
using the expressions above. The standard error and t-stats for the excess return on the
market change substantially when the parameter estiamtion error in β is included.

6.11 Weak Identification

The topic of weak identification has been a unifying theme in recent work on GMM and
related estimations. Three types of identification have previously been described: underi-
dentified, just-identified and overidentified. Weak identification bridges the gap between
just-identified and underidentified models. In essence, a model is weakly identified if it is
identified in a finite sample, but the amount of information available to estimate the pa-
rameters does not increase with the sample. This is a difficult concept, so consider it in the
context of the two models presented in this chapter.
In the consumption asset pricing model, the moment conditions are all derived from
!
 c t +1 −γ
 
β 1 + r j ,t +1 − 1 zt . (6.103)
ct

Weak identification can appear in at least two places in this moment conditions. First,
assume that cct +1
t
≈ 1. If it were exactly 1, then γ would be unidentified. In practice con-
sumption is very smooth and so the variation in this ratio from 1 is small. If the variation
is decreasing over time, this problem would be weakly identified. Alternatively suppose
that the instrument used, z t , is not related to future marginal utilities or returns at all. For
example, suppose z t is a simulated a random variable. In this case,

"  −γ ! # "  −γ !#


c t +1 c t +1
β 1 + r j ,t +1 =E β 1 + r j ,t +1 E[z t ] = 0
 
E − 1 zt −1
ct ct
(6.104)
for any values of the parameters and so the moment condition is always satisfied. The
426 Generalized Method Of Moments (GMM)

choice of instrument matters a great deal and should be made in the context of economic
and financial theories.
In the example of the linear factor models, weak identification can occur if a factor
which is not important for any of the included portfolios is used in the model. Consider
the moment conditions,
!
(rt − β ft ) ⊗ ft
g(wt , θ 0 ) = . (6.105)
r − βλ

If one of the factors is totally unrelated to asset returns and has no explanatory power,
all β s corresponding to that factor will limit to 0. However, if this occurs then the second
set of moment conditions will be valid for any choice of λi ; λi is weakly identified. Weak
identification will make most inference nonstandard and so the limiting distributions of
most statistics are substantially more complicated. Unfortunately there are few easy fixes
for this problem and common sense and economic theory must be used when examining
many problems using GMM.

6.12 Considerations for using GMM


This chapter has provided a introduction to GMM. However, before applying GMM to every
econometric problem, there are a few issues which should be considered.

6.12.1 The degree of overidentification


Overidentification is beneficial since it allows models to be tested in a simple manner us-
ing the J test. However, like most things in econometrics, there are trade-off when decid-
ing how overidentified a model should be. Increasing the degree of overidentification by
adding extra moments but not adding more parameters can lead to substantial small sam-
ple bias and poorly behaving tests. Adding extra moments also increases the dimension of
the estimated long run covariance matrix, Ŝ which can lead to size distortion in hypothesis
tests. Ultimately, the number of moment conditions should be traded off against the sam-
ple size. For example, in linear factor model with n portfolios and k factors there are n − k
overidentifying restrictions and n k + k parameters. If testing the CAPM with monthly data
back to WWII (approx 700 monthly observations), the total number of moments should
be kept under 150. If using quarterly data (approx 250 quarters), the number of moment
conditions should be substantially smaller.

6.12.2 Estimation of the long run covariance


Estimation of the long run covariance is one of the most difficult issues when implement-
ing GMM. Best practices are to to use the simplest estimator consistent with the data or
6.12 Considerations for using GMM 427

theoretical restrictions which is usually the estimator with the smallest parameter count.
If the moments can be reasonably assumed to be a martingale difference series then a sim-
ple outer-product based estimator is sufficient. HAC estimators should be avoided if the
moments are not autocorrelated (or cross-correlated). If the moments are persistent with
geometrically decaying autocorrelation, a simple VAR(1) model may be enough.
428 Generalized Method Of Moments (GMM)

Exercises
Exercise 6.1. Suppose you were interested in testing a multi-factor model with 4 factors
and excess returns on 10 portfolios.

i. How many moment conditions are there?

ii. What are the moment conditions needed to estimate this model?

iii. How would you test whether the model correctly prices all assets. What are you really
testing?

iv. What are the requirements for identification?

v. What happens if you include a factor that is not relevant to the returns of any series?

Exercise 6.2. Suppose you were interested in estimating the CAPM with (potentially) non-
zero αs on the excess returns of two portfolios, r1e and r2e .

i. Describe the moment equations you would use to estimate the 4 parameters.

ii. Is this problem underidentified, just-identified, or overidentified?

iii. Describe how you would conduct a joint test of the null H0 : α1 = α2 = 0 against an
alternative that at least one was non-zero using a Wald test.

iv. Describe how you would conduct a joint test of the null H0 : α1 = α2 = 0 against an
alternative that at least one was non-zero using a LR-like test.

v. Describe how you would conduct a joint test of the null H0 : α1 = α2 = 0 against an
alternative that at least one was non-zero using an LM test.

In all of the questions involving tests, you should explain all steps from parameter estima-
tion to the final rejection decision.
Chapter 7

Univariate Volatility Modeling

Note: The primary references for these notes are chapters 10 and 11 in Taylor (2005). Alter-
native, but less comprehensive, treatments can be found in chapter 21 of Hamilton (1994) or
chapter 4 of Enders (2004). Many of the original articles can be found in Engle (1995).
Engle (1982) introduced the ARCH model and, in doing so, modern financial
econometrics. Since then, measuring and modeling conditional volatility has be-
come the cornerstone of the field. Models used for analyzing conditional volatil-
ity can be extended to capture a variety of related phenomena including duration
analysis, Value-at-Risk, Expected Shortfall and density forecasting. This chapter
begins by examining the meaning of “volatility” - it has many - before turning at-
tention to the ARCH-family of models. The chapter proceeds through estimation,
inference, model selection, forecasting and diagnostic testing. The chapter con-
cludes by covering a relatively new method for measuring volatility using ultra-
high-frequency data, realized volatility, and a market-based measure of volatility,
implied volatility.

Volatility measurement and modeling is the foundation of financial econometrics. This


chapter begins by introducing volatility as a meaningful concept and then describes a widely
employed framework for volatility analysis: the ARCH model. The chapter describes the
important members of the ARCH family, some of their properties, estimation, inference
and model selection. Attention then turns to a new tool in the measurement and mod-
eling of financial econometrics, realized volatility, before concluding with a discussion of
option-based implied volatility.

7.1 Why does volatility change?


Time-varying volatility is a pervasive empirical regularity of financial time series, so much
so that it is difficult to find an asset return series which does not exhibit time-varying volatil-
ity. This chapter focuses on providing a statistical description of the time-variation of volatil-
ity, but does not go into depth on why volatility varies over time. A number of explanations
430 Univariate Volatility Modeling

have been proffered to explain this phenomenon, although treated individually, none are
completely satisfactory.

• News Announcements: The arrival of unanticipated news (or “news surprises”) forces
agents to update beliefs. These new beliefs trigger portfolio rebalancing and high
periods of volatility correspond to agents dynamically solving for new asset prices.
While certain classes of assets have been shown to react to surprises, in particular
government bonds and foreign exchange, many appear to be unaffected by even large
surprises (see, inter alia Engle & Li (1998) and Andersen, Bollerslev, Diebold & Vega
(2007)). Additionally, news-induced periods of high volatility are generally short, of-
ten on the magnitude of 5 to 30-minutes and the apparent resolution of uncertainty
is far too quick to explain the time-variation of volatility seen in asset prices.

• Leverage: When firms are financed using both debt and equity, only the equity will
reflect the volatility of the firms cash flows. However, as the price of equity falls, a
smaller quantity must reflect the same volatility of the firm’s cash flows and so nega-
tive returns should lead to increases in equity volatility. The leverage effect is perva-
sive in equity returns, especially in broad equity indices, although alone it is insuffi-
cient to explain the time variation of volatility (Bekaert & Wu 2000, Christie 1982).

• Volatility Feedback: Volatility feedback is motivated by a model where the volatility


of an asset is priced. When the price of an asset falls, the volatility must increase
to reflect the increased expected return (in the future) of this asset, and an increase
in volatility requires an even lower price to generate a sufficient return to compen-
sate an investor for holding a volatile asset. There is evidence that this explanation is
empirically supported although it cannot explain the totality of the time-variation of
volatility (Bekaert & Wu 2000).

• Illiquidity: Short run spells of illiquidity may produce time varying volatility even
when shocks are i.i.d.. Intuitively, if the market is oversold (bought), a small nega-
tive (positive) shock will cause a small decrease (increase) in demand. However, since
there are few participants willing to buy (sell), this shock has a large effect on prices.
Liquidity runs tend to last from 20 minutes to a few days and cannot explain the long
cycles in present volatility.

• State Uncertainty: Asset prices are important instruments that allow agents to express
beliefs about the state of the economy. When the state is uncertain, slight changes in
beliefs may cause large shifts in portfolio holdings which in turn feedback into beliefs
about the state. This feedback loop can generate time-varying volatility and should
have the largest effect when the economy is transitioning between periods of growth
and contraction.

The actual cause of the time-variation in volatility is likely a combination of these and some
not present.
7.1 Why does volatility change? 431

7.1.1 What is volatility?

Volatility comes in many shapes and forms. To be precise when discussing volatility, it is
important to be clear what is meant when the term “volatility” used.
Volatility Volatility is simply the standard deviation. Volatility is often preferred to variance
as it is measured in the same units as the original data. For example, when using returns,
the volatility is also in returns, and a volatility of 5% indicates that ±5% is a meaningful
quantity.
Realized Volatility Realized volatility has historically been used to denote a measure of the
volatility over some arbitrary period of time,

v
u
u T
X
σ̂ = tT −1 (rt − µ̂)2 (7.1)
t =1

but is now used to describe a volatility measure constructed using ultra-high-frequency


(UHF) data (also known as tick data). See section 7.4 for details.
Conditional Volatility Conditional volatility is the expected volatility at some future time t +
h based on all available information up to time t (Ft ). The one-period ahead conditional
volatility is denoted E t [σt +1 ].
Implied Volatility Implied volatility is the volatility that will correctly price an option. The
Black-Scholes pricing formula relates the price of a European call option to the current
price of the underlying, the strike, the risk-free rate, the time-to-maturity and the volatility,

B S (St , K , r, t , σt ) = C t (7.2)

where C is the price of the call. The implied volatility is the value which solves the Black-
Scholes taking the option and underlying prices, the strike, the risk-free and the time-to-
maturity as given,

σ̂t (St , K , r, t , C ) (7.3)

Annualized Volatility When volatility is measured over an interval other than a year, such
as a day, week or month, it can always be scaled to reflect the volatility of the asset over a
year. For example, if σ denotes the daily
√ volatility of an asset and there are 252 trading days
in a year, the annualized volatility is 252σ. Annualized volatility is a useful measure that
removes the sampling interval from reported volatilities.
Variance All of the uses of volatility can be replaced with variance and most of this chapter
is dedicated to modeling conditional variance, denoted E t [σ2t +h ].
432 Univariate Volatility Modeling

7.2 ARCH Models

In financial econometrics, an arch is not an architectural feature of a building; it is a fun-


damental tool for analyzing the time-variation of conditional variance. The success of the
ARCH (AutoRegressive Conditional Heteroskedasticity) family of models can be attributed
to three features: ARCH processes are essentially ARMA models and many of the tools of
linear time-series analysis can be directly applied, ARCH-family models are easy to esti-
mate and many parsimonious models are capable of providing good descriptions of the
dynamics of asset volatility.

7.2.1 The ARCH model

The complete ARCH(P) model (Engle 1982) relates the current level of volatility to the past
P squared shocks.

Definition 7.1 (Pth Order Autoregressive Conditional Heteroskedasticity (ARCH)). A Pth or-
der ARCH process is given by

rt = µt + εt (7.4)
σ2t = ω + α1 ε2t −1 + α2 ε2t −2 + . . . + αP ε2t −P
εt = σt e t
e t ∼ N (0, 1).
i.i.d.

where µt can be any adapted model for the conditional mean.1

The key feature of this model is that the variance of the shock, εt , is time varying and
depends on the past P shocks, εt −1 , εt −2 , . . . , εt −P through their squares. σ2t is the time t −1
conditional variance and it is in the time t − 1 information set Ft −1 . This can be verified
by noting that all of the right-hand side variables that determine σ2t are known at time
t − 1. The model for the conditional mean can include own lags, shocks (in a MA model)
or exogenous variables such as the default spread or term premium. In practice, the model
for the conditional mean should be general enough to capture the dynamics present in the
data. For many financial time series, particularly when measured over short intervals - one
day to one week - a constant mean, sometimes assumed to be 0, is often sufficient.
An alternative way to describe an ARCH(P) model is

1
A model is adapted if everything required to model the mean at time t is known at time t − 1. Standard
examples of adapted mean processes include a constant mean or anything in the family of ARMA processes
or any exogenous regressors known at time t − 1.
7.2 ARCH Models 433

rt |Ft −1 ∼ N (µt , σ2t ) (7.5)


σ2t =ω+ α1 ε2t −1 + α2 ε2t −2 + ... + αP ε2t −P
εt = rt − µt

which is read “rt given the information set at time t − 1 is conditionally normal with mean
µt and variance σ2t ”. 2
The conditional variance, denoted σ2t , is

E t −1 ε2t = E t −1 e t2 σ2t = σ2t E t −1 e t2 = σ2t


     
(7.6)
and the unconditional variance, denoted σ̄2 , is

E ε2t +1 = σ̄2 .
 
(7.7)
The first interesting property of the ARCH(P) model is the unconditional variance. Assum-
ing the unconditional variance exists, σ̄2 = E[σ2t ] can be derived from

E σ2t = E ω + α1 ε2t −1 + α2 ε2t −2 + . . . + αP ε2t −P


   
(7.8)
= ω + α1 E εt −1 + α2 E εt −2 + . . . + αP E εt −P
 2   2   2 

= ω + α1 E σ2t −1 E e t2−1 +
   

α2 E σ2t −2 E e t2−2 + . . . + αP E σ2t −P E e t2−P


       

= ω + α1 E σ2t −1 + α2 E σ2t −2 + . . . + αP E σ2t −P


     

E σ2t − α1 E σ2t −1 − . . . − αP E σ2t −P = ω


     

E σ2t (1 − α1 − α2 − . . . − αP ) = ω
 

ω
σ̄2 = .
1 − α1 − α2 − . . . − αP

This derivation makes use of a number of properties of ARCH family models. First, the
definition of the shock ε2t ≡ e t2 σ2t is used to separate the i.i.d. normal innovation (e t ) from
the conditional variance (σ2t ). e t and σ2t are independent since σ2t depends on εt −1 , εt −2 , . . . , εt −P
(and in turn e t −1 , e t −2 , . . . , e t −P ) while e t is an i.i.d. draw at time t . Using these two prop-
erties, the derivation follows by noting that the unconditional expectation of σ2t − j is the
same in any time period (E[σ2t ] = E[σ2t −p ] = σ̄2 ) and is assumed to exist. Inspection of the
final line in the derivation reveals the condition needed to ensure that the unconditional
expectation is finite: 1 − α1 − α2 − . . . − αP > 0. As was the case in an AR model, as the
persistence (as measured by α1 , α2 , . . .) increases towards a unit root, the process explodes.
2
It is implausible that the unconditional mean return of a risky asset to be zero. However, when using
daily equity data, the squared mean is typically less than 1% of the variance and there are few ramifications
2
for setting this value to 0 ( σµ 2 < 0.01). Other assets, such as electricity prices, have non-trivial predictability
and so an appropriate model for the conditional mean should be specified.
434 Univariate Volatility Modeling

7.2.1.1 Stationarity

An ARCH(P) model is covariance stationary as long as the model for the conditional mean
corresponds to a stationary process3 and 1 − α1 − α2 − . . . − αP > 0.4 ARCH models have
the property that E[ε2t ] = σ̄2 = ω/(1 − α1 − α2 − . . . − αP ) since

E[ε2t ] = E[e t2 σ2t ] = E[e t2 ]E[σ2t ] = 1 · E[σ2t ] = E[σ2t ]. (7.9)


which exploits the independence of e t from σ2t and the assumption that e t is a mean zero
process with unit variance and so E[e t2 ] = 1.
One crucial requirement of any covariance stationary ARCH process is that the parame-
ters of the variance evolution, α1 , α2 , . . . , αP must all be positive.5 The intuition behind this
requirement is that if one of the αs were negative, eventually a shock would be sufficiently
large to produce a negative conditional variance, an undesirable feature. Finally, it is also
necessary that ω > 0 to ensure covariance stationarity.
To aid in developing intuition about ARCH-family models consider a simple ARCH(1)
with a constant mean of 0,

rt = εt (7.10)
σ2t = ω + α1 ε2t −1
εt = σt e t
e t ∼ N (0, 1).
i.i.d.

While the conditional variance in an ARCH process appears different from anything previ-
ously encountered, it can be equivalently expressed as an AR(1) for ε2t . This transformation
allows many properties of ARCH residuals to be directly derived by applying the results of
chapter 2. By adding ε2t − σ2t to both sides of the volatility equation,

σ2t = ω + α1 ε2t −1 (7.11)


σ2t + ε2t − σ2t =ω+ α1 ε2t −1 + ε2t − σ2t
ε2t =ω+ α1 ε2t −1 + ε2t − σ2t
ε2t =ω+ α1 ε2t −1 + σ2t e t2 −

1
ε2t =ω+ α1 ε2t −1 + νt ,

an ARCH(1) process can be shown to be an AR(1). The error term, νt represents the volatil-
3
For example, a constant or a covariance stationary ARMA process.
PP
4
When i =1 αi > 1, and ARCH(P) may still be strictly stationary although it cannot be covariance station-
ary since it has infinite variance.
5
Since each α j ≥ 0, the roots of the characteristic polynomial associated with α1 , α2 , . . . , αp will be less
PP
than 1 if and only if p =1 αp < 1.
7.2 ARCH Models 435

ity surprise, ε2t − σ2t , which can be decomposed as σ2t (e t2 − 1) and is a mean 0 white noise
process since e t is i.i.d. and E[e t2 ] = 1. Using the AR representation, the autocovariances of
ε2t are simple to derive. First note that ε2t − σ̄2 = ∞ i =0 α1 νt −i . The first autocovariance can
i
P

be expressed

 ! 
∞ ∞
j −1
X X
ε2t − σ̄2 ε2t −1 − σ̄2 = E αi1 νt −i α1 ν j −i 
  
E  (7.12)
i =0 j =1
 ! ∞ 

X X j −1
= E  νt + αi1 νt −i  α1 νt − j 
i =1 j =1
 ! ∞ 

X X j −1
= E  νt + α1 αi1−1 νt −i  α1 νt − j 
i =1 j =1
"  ! 

!# ∞ ∞
j −1
X X X
= E νt αi1−1 νt −i + E α1 αi1−1 νt −i  α1 νt − j 
i =1 i =1 j =1
 ! 
∞ ∞ ∞
j −1
X X X
= αi1−1 E [νt νt −i ] + E α1 αi1−1 νt −i  α1 νt − j 
i =1 i =1 j =1
 ! 
∞ ∞ ∞
j −1
X X X
= αi1−1 · 0 + E α1 αi1−1 νt −i  α1 νt − j 
i =1 i =1 j =1
 !2 

X
= α1 E  αi1−1 νt −i 
i =1
 !2 

X
= α1 E  αi1 νt −1−i 
i =0
 
∞ ∞ X

jk 
X X
= α1  α2i
1 E νt −1−i + 2 α1 E νt −1− j νt −1−k 
 2  

i =0 j =0 k = j +1
 
∞ ∞ X

jk
X X
= α1  α2i
1 V [νt −1−i ] + 2 α1 · 0
i =0 j =0 k = j +1

X
= α1 α2i
1 V [νt −1−i ]
i =0

= α1 V ε2t −1
 
436 Univariate Volatility Modeling

where V[ε2t −1 ] = V[ε2t ] is the variance of the squared innovations.6 By repeated substitution,
the sth autocovariance, E[(ε2t − σ̄2 )(ε2t −s − σ̄2 )], can be shown to be α1s V[ε2t ], and so that the
autocovariances of an ARCH(1) are identical to those of an AR(1).

7.2.1.2 Autocorrelations

Using the autocovariances, the autocorrelations are

α1s V[ε2t ]
Corr(ε2t , ε2t −s ) = = α1s . (7.13)
V[ε2t ]
Further, the relationship between the sth autocorrelation of an ARCH process and an AR
process holds for ARCH process with other orders. The autocorrelations of an ARCH(P) are
identical to those of an AR(P) process with {φ1 , φ2 , . . . , φP } = {α1 , α2 , . . . , αP }. One inter-
esting aspect of ARCH(P) processes (and any covariance stationary ARCH-family model) is
that the autocorrelations must be positive. If one autocorrelation were negative, eventu-
ally a shock would be sufficiently large to force the conditional variance negative and so the
process would be ill-defined. In practice it is often better to examine the absolute values

(Corr |εt | , |εt −s | ) rather than the squares since financial returns frequently have outliers
that are exacerbated when squared.

7.2.1.3 Kurtosis

The second interesting property of ARCH models is that the kurtosis of shocks (εt ) is strictly
greater than the kurtosis of a normal. This may seem strange since all of the shocks εt =
σt e t are normal by assumption. However, an ARCH model is a variance-mixture of normals
which must produce a kurtosis greater than three. An intuitive proof is simple,

E ε4t E E t −1 ε4t E E t −1 e t4 σ4t E 3σ4t E σ4t


            
κ =  2 =  2 =    2 =  2 2 = 3  2 2 ≥ 3. (7.14)
E ε2t E E t −1 e t2 σ2t E E t −1 e t2 σ2t E σt E σt


The key steps in this derivation are that ε4t = e t4 σ4t and that E t [e t4 ] = 3 since e t is a stan-
dard normal. The final conclusion that E[σ4t ]/E[σ2t ]2 > 1 follows from noting that V ε2t =
 
 2  2 E[ε4t ]
E ε4t − E ε2t ≥ 0 and so it must be the case that E ε4t ≥ E ε2t or
   
2 ≥ 1. The
E[ε2t ]
kurtosis, κ, of an ARCH(1) can be shown to be

3(1 − α21 )
κ= >3 (7.15)
(1 − 3α21 )
6
For the time being, assume this is finite.
7.2 ARCH Models 437

Returns of the S&P 500 and IBM


S&P 500 Returns
10

−5

2001 2002 2004 2005 2006 2008 2009 2010

IBM Returns
10

−5

−10
2001 2002 2004 2005 2006 2008 2009 2010

Figure 7.1: Plots of S&P 500 and IBM returns (scaled by 100) from 2001 until 2010. The
bulges in the return plots are graphical evidence of time-varying volatility.

which is greater than 3 since 1 − 3α21 < 1 − α21 and so (1 − α21 )/(1 − 3α21 ) > 1. The formal
derivation of the kurtosis is tedious and is presented in Appendix 7.A.

7.2.2 The GARCH model


The ARCH model has been deemed a sufficient contribution to economics to warrant a No-
bel prize. Unfortunately, like most models, it has problems. ARCH models typically require
5-8 lags of the squared shock to adequately model conditional variance. The Generalized
ARCH (GARCH) process, introduced by Bollerslev (1986), improves the original specifica-
tion adding lagged conditional variance, which acts as a smoothing term. GARCH models
typically fit as well as a high-order ARCH yet remain parsimonious.

Definition 7.2 (Generalized Autoregressive Conditional Heteroskedasticity (GARCH) pro-


cess). A GARCH(P,Q) process is defined as

rt = µt + εt (7.16)
438 Univariate Volatility Modeling

Squared Returns of the S&P 500 and IBM


Squared S&P 500 Returns

10

0
2001 2002 2004 2005 2006 2008 2009 2010

Squared IBM Returns


20

15

10

0
2001 2002 2004 2005 2006 2008 2009 2010

Figure 7.2: Plots of the squared returns of the S&P 500 Index and IBM. Time-variation in
the squared returns is evidence of ARCH.

P Q
X X
σ2t =ω+ αp ε2t −p + βq σ2t −q
p =1 q =1

εt = σt e t
e t ∼ N (0, 1)
i.i.d.

where µt can be any adapted model for the conditional mean.


The GARCH(P,Q) model builds on the ARCH(P) model by including Q lags of the conditional
variance, σ2t −1 , σ2t −2 , . . . , σ2t −Q . Rather than focusing on the general specification with all
of its complications, consider a simpler GARCH(1,1) model where the conditional mean is
assumed to be zero,

rt = εt (7.17)
σ2t = ω + α1 ε2t −1 + β1 σ2t −1
7.2 ARCH Models 439

εt = σt e t
e t ∼ N (0, 1)
i.i.d.

In this specification, the future variance will be an average of the current shock, ε2t −1 , and
the current variance, σ2t −1 , plus a constant. The effect of the lagged variance is to produce
a model which is actually an ARCH(∞) in disguise. Begin by backward substituting,

σ2t = ω + α1 ε2t −1 + β1 σ2t −1 (7.18)


= ω + α1 ε2t −1 + β (ω + α1 ε2t −2 + β1 σ2t −2 )
= ω + β1 ω + α1 ε2t −1 + β1 α1 ε2t −2 + β12 σ2t −2
= ω + β1 ω + α1 ε2t −1 + β1 α1 ε2t −2 + β12 (ω + α1 ε2t −3 + β1 σ2t −3 )
= ω + β1 ω + β12 ω + α1 ε2t −1 + β1 α1 ε2t −2 + β12 α1 ε2t −3 + β13 σ2t −3

X ∞
X
= β1i ω + β1i α1 ε2t −i −1 ,
i =0 i =0

and the ARCH(∞) representation can be derived.7 It can be seen that the conditional vari-
ance in period t is a constant, ∞ i =0 β1 ω, and a weighted average of past squared innova-
i
P

tions with weights α1 , β1 α1 , β12 α1 , β13 α1 , . . ..


As was the case in the ARCH(P) model, the coefficients of a GARCH model must also
be restricted to ensure the conditional variances are uniformly positive. In a GARCH(1,1),
these restrictions are ω > 0, α1 ≥ 0 and β1 ≥ 0. In a GARCH(P,1) model the restriction
change to αp ≥ 0, p = 1, 2, . . . , P with the same restrictions on ω and β1 . However, in a
complete GARCH(P,Q) model the parameter restriction are difficult to derive. For example,
in a GARCH(2,2), one of the two β ’s (β2 ) can be slightly negative while ensuring that all
conditional variances are positive. See Nelson & Cao (1992) for further details.
As was the case in the ARCH(1) model, the GARCH(1,1) model can be transformed into
a standard time series model for ε2t ,

σ2t = ω + α1 ε2t −1 + β1 σ2t −1 (7.19)


σ2t + ε2t − σ2t =ω+ α1 ε2t −1 + β1 σ2t −1 + ε2t − σ2t
ε2t =ω+ α1 ε2t −1 + β1 σ2t −1 + ε2t − σ2t
ε2t =ω+ α1 ε2t −1 + β1 σ2t −1 − β1 ε2t −1 + β1 ε2t −1 + ε2t − σ2t
ε2t =ω+ α1 ε2t −1 + β1 ε2t −1 − β1 (ε2t −1 − σ2t −1 ) + ε2t − σ2t
ε2t =ω+ α1 ε2t −1 + β1 ε2t −1 − β1 νt −1 + νt
ε2t = ω + (α1 + β1 )ε2t −1 − β1 νt −1 + νt
7
Since the model is assumed to be stationary, it much be the case that 0 ≤ β < 1 and so lim j →∞ β j σt − j =
0.
440 Univariate Volatility Modeling

by adding ε2t − σ2t to both sides. However, unlike the ARCH(1) process which can be trans-
formed into an AR(1), the GARCH(1,1) is transformed into an ARMA(1,1) where νt = ε2t −σ2t
is the volatility surprise. In the general GARCH(P,Q), the ARMA representation takes the
form of an ARMA(max(P,Q),Q).

max(P,Q ) Q
X X
ε2t = ω + (αi + βi )ε2t −i − β1 νt −q + νt (7.20)
i =1 q =1

Using the same derivation in the ARCH(1) model, the unconditional variance can be
shown to be

E [σ2t ] = ω + α1 E [ε2t −1 ] + β1 E [σ2t −1 ] (7.21)


σ̄2 = ω + α1 σ̄2 + β1 σ̄2
σ̄2 − α1 σ̄2 − β1 σ̄2 = ω
ω
σ̄2 = .
1 − α1 − β1

Inspection of the ARMA model leads to an alternative derivation of σ̄2 since the AR coeffi-
cient is α1 + β1 and the intercept is ω, and the unconditional mean in an ARMA(1,1) is the
intercept divided by one minus the AR coefficient, ω/(1−α1 −β1 ). In a general GARCH(P,Q)
the unconditional variance is

ω
σ̄2 = PP PQ . (7.22)
1− p =1 αp − q =1 βq

As was the case in the ARCH(1) model, the requirements for stationarity are that 1−α1 −β >
0 and α1 ≥ 0, β1 ≥ 0 and ω > 0.
The ARMA(1,1) form can be used directly to solve for the autocovariances. Recall the
definition of a mean zero ARMA(1,1),

yt = φ yt −1 + θ εt −1 + εt (7.23)
The 1st autocovariance can be computed as

E[yt yt −1 ] = E[(φ yt −1 + θ εt −1 + εt )yt −1 ] (7.24)


= E[φ yt2−1 ] + [θ ε2t −1 ]
= φV[yt −1 ] + θ V[εt −1 ]
γ1 = φV[yt −1 ] + θ V[εt −1 ]

and the sth autocovariance is γs = φ s −1 γ1 . In the notation of a GARCH(1,1) model, φ =


α1 + β1 , θ = −β1 , yt −1 is ε2t −1 and ηt −1 is σ2t −1 − ε2t −1 . Thus, V[ε2t −1 ] and V[σ2t − ε2t ] must
be solved for. However, this is tedious and is presented in the appendix. The key to under-
7.2 ARCH Models 441

standing the autocovariance (and autocorrelation) of a GARCH is to use the ARMA map-
ping. First note that E[σ2t − ε2t ] = 0 so V[σ2t − ε2t ] is simply E[(σ2t − ε2t )2 ]. This can be ex-
panded to E[ε4t ] − 2E[ε2t σ2t ] + E[σ4t ] which can be shown to be 2E[σ4t ]. The only remaining
step is to complete the tedious derivation of the expectation of these fourth powers which
is presented in Appendix 7.B.

7.2.2.1 Kurtosis

The kurtosis can be shown to be

3(1 + α1 + β1 )(1 − α1 − β1 )
κ= > 3. (7.25)
1 − 2α1 β1 − 3α21 − β12
Once again, the kurtosis is greater than that of a normal despite the innovations, e t , all
having normal distributions. The formal derivation is presented in 7.B.

7.2.3 The EGARCH model

The Exponential GARCH (EGARCH) model represents a major shift from the ARCH and
GARCH models (Nelson 1991). Rather than model the variance directly, EGARCH models
the natural logarithm of the variance, and so no parameters restrictions are required to
ensure that the conditional variance is positive.

Definition 7.3 (Exponential Generalized Autoregressive Conditional Heteroskedasticity (EGARCH)


process). An EGARCH(P,O,Q) process is defined

rt = µt + εt (7.26)
P r ! O Q
εt −p εt −o X

− 2 +
X X
ln(σ2t ) =ω+ αp
σt −p γo + βq ln(σ2t −q )
p =1
π o =1
σt −o q =1

εt = σt e t
e t ∼ N (0, 1)
i.i.d.

where µt can be any adapted model for the conditional mean. In the original parameteri-
zation of Nelson (1991), P and O were assumed to be equal.

Rather than working with the complete specification, consider a simpler version, an EGARCH(1,1,1)
with a constant mean,
442 Univariate Volatility Modeling

rt = µ + εt (7.27)
r !
εt −1 εt −1

2
ln(σ2t ) = ω + α1 σt −1 − π + γ1 σt −1 + β1 ln(σt −1 )
2

εt = σt e t
e t ∼ N (0, 1)
i.i.d.

q
εt −1
which shows that log variance is a constant plus three terms. The first term, σt −1 − π2 =|e t −1 |−
q
2
π
, is just the absolute value of a normal random variable, e t −1 , minus its expectation,
p
2/π, and so it is a mean zero shock. The second term, e t −1 , is also a mean zero shock
and the last term is the lagged log variance. The two shocks behave differently (the e t −1
terms): the first produces a symmetric rise in the log variance while the second creates an
asymmetric effect. γ1 is typically estimated to be less than zero and volatility rises more
subsequent to negative shocks than to positive ones. In the usual case where γ1 < 0, the
magnitude of the shock can be decomposed by conditioning on the sign of e t −1
(
α1 + γ1 when e t −1 < 0
Shock coefficient = (7.28)
α1 − γ1 when e t −1 > 0

Since both shocks are mean zero and the current log variance is linearly related to past log
variance through β1 , the EGARCH(1,1,1) model is an AR model.
EGARCH models often provide superior fits when compared to standard GARCH mod-
els. The presence of the asymmetric term is largely responsible for the superior fit since
many asset return series have been found to exhibit a “leverage” effect and the use of stan-
dardized shocks (e t −1 ) in the evolution of the log-variance tend to dampen the effect of
large shocks.

7.2.3.1 The S&P 500 and IBM

The application of GARCH models will be demonstrated using daily returns on both the
S&P 500 and IBM from January 1, 2001 until December 31, 2010. All data were taken from
Yahoo! finance and returns are scaled by 100. The returns are plotted in figure 7.1, the
squared returns are plotted in figure 7.2 and the absolute values of the returns are plot-
ted in figure 7.3. The plots of the squared returns and the absolute values of the returns
are useful graphical diagnostics for detecting ARCH. If the residuals are conditionally het-
eroskedastic, both plots should produce evidence of dynamics in the transformed returns.
The absolute value plot is often more useful since the squared returns are often noisy and
the dynamics in the data may be obscured by a small number of outliers.
Summary statistics are presented in table 7.1 and estimates from an ARCH(5), and GARCH(1,1)
7.2 ARCH Models 443

Summary Statistics
S&P 500 IBM
Ann. Mean -0.202 6.702
Ann. Volatility 21.840 26.966
Skewness -0.124 0.311
Kurtosis 11.194 9.111

Table 7.1: Summary statistics for the S&P 500 and IBM. Means and volatilities are reported
in annualized terms using 100 × returns, while skewness and kurtosis are estimates of the
daily magnitude.

and an EGARCH(1,1,1) are presented in table 7.2. The summary statistics are typical of fi-
nancial data where both series are negatively skewed and heavy-tailed (leptokurtotic).

Definition 7.4 (Leptokurtosis). A random variable x t is said to be leptokurtotic if its kurto-


sis,
E[(x t − E[x t ])4 ]
κ=
E[(x t − E[x t ])2 ]2
is greater than that of a normal (κ > 3). Leptokurtotic variables are also known as “heavy
tailed” or “fat tailed”.

Definition 7.5 (Platykurtosis). A random variable x t is said to be platykurtotic if its kurtosis,

E[(x t − E[x t ])4 ]


κ=
E[(x t − E[x t ])2 ]2

is less than that of a normal (κ < 3). Platykurtotic variables are also known as “thin tailed”.

Table 7.2 contains estimates from an ARCH(5), a GARCH(1,1) and an EGARCH(1,1,1)


model. All estimates were computed using maximum likelihood assuming the innova-
tions are conditionally normally distributed. Examining the table, there is strong evidence
of time varying variances since most p-values are near 0. The highest log-likelihood (a
measure of fit) is produced by the EGARCH model in both series. This is likely due to the
EGARCH’s inclusion of asymmetries, a feature excluded from both the ARCH and GARCH
models.

7.2.4 Alternative Specifications

Many extensions to the basic ARCH model have been introduced to capture important em-
pirical features. This section outlines three of the most useful extensions in the ARCH-
family.
444 Univariate Volatility Modeling

S&P 500
ARCH(5)
ω α1 α2 α3 α4 α5 LL
Coeff. 0.364 0.040 0.162 0.216 0.190 0.202 -3796
(0.000) (0.063) (0.000) (0.000) (0.000) (0.000)

GARCH(1,1)
ω α1 β1 LL
Coeff. 0.012 0.079 0.913 -3722
(0.023) (0.000) (0.000)

EGARCH(1,1,1)
ω α1 γ1 β1 LL
Coeff. 0.003 0.094 −0.119 0.985 -3669
(0.138) (0.000) (0.000) (0.000)

IBM
ARCH(5)
ω α1 α2 α3 α4 α5 LL
Coeff. 0.639 0.245 0.152 0.100 0.222 0.163 -4513
(0.000) (0.000) (0.000) (0.001) (0.000) (0.000)

GARCH(1,1)
ω α1 β1 LL
Coeff. 0.048 0.103 0.880 -4448
(0.183) (0.073) (0.000)

EGARCH(1,1,1)
ω α1 γ1 β1 LL
Coeff. 0.012 0.105 −0.073 0.988 -4404
(0.054) (0.018) (0.000) (0.000)

Table 7.2: Parameter estimates, p-values and log-likelihoods from ARCH(5), GARCH(1,1)
and EGARCH(1,1,1) models for the S&P 500 and IBM. These parameter values are typical
of models estimated on daily data. The persistence of conditional variance, as measures by
the sum of the αs in the ARCH(5), α1 + β1 in the GARCH(1,1) and β1 in the EGARCH(1,1,1),
is high in all models. The log-likelihoods indicate the EGARCH model is preferred for both
return series.
7.2 ARCH Models 445

Absolute Returns of the S&P 500 and IBM


Absolute S&P 500 Returns

0
2001 2002 2004 2005 2006 2008 2009 2010

Absolute IBM Returns

0
2001 2002 2004 2005 2006 2008 2009 2010

Figure 7.3: Plots of the absolute returns of the S&P 500 and IBM. Plots of the absolute value
are often more useful in detecting ARCH as they are less noisy than squared returns yet still
show changes in conditional volatility.

7.2.4.1 GJR-GARCH

The GJR-GARCH model was named after the authors who introduced it, Glosten, Jagan-
nathan & Runkle (1993). It extends the standard GARCH(P,Q) to include asymmetric terms
that capture an important phenomenon in the conditional variance of equities: the propen-
sity for the volatility to rise more subsequent to large negative shocks than to large positive
shocks (known as the “leverage effect”).

Definition 7.6 (GJR-Generalized Autoregressive Conditional Heteroskedasticity (GJR-GARCH)


process). A GJR-GARCH(P,O,Q) process is defined as

rt = µt + εt (7.29)
P O Q
X X X
σ2t = ω + αp ε2t −p + γo ε2t −o I[εt −o <0] + βq σ2t −q
p =1 o =1 q =1
446 Univariate Volatility Modeling

εt = σt e t
e t ∼ N (0, 1)
i.i.d.

where µt can be any adapted model for the conditional mean and I[εt −o <0] is an indicator
function that takes the value 1 if εt −o < 0 and 0 otherwise.

The parameters of the GJR-GARCH, like the standard GARCH model, must be restricted
to ensure that the fit variances are always positive. This set is difficult to describe for a
complete GJR-GARCH(P,O,Q) model although it is simple of a GJR-GARCH(1,1,1). The dy-
namics in a GJR-GARCH(1,1,1) evolve according to

σ2t = ω + α1 ε2t −1 + γ1 ε2t −1 I[εt −1 <0] + β1 σ2t −1 . (7.30)


and it must be the case that ω > 0, α1 ≥ 0, α1 + γ ≥ 0 and β1 ≥ 0. If the innovations
are conditionally normal, a GJR-GARCH model will be covariance stationary as long as the
parameter restriction are satisfied and α1 + 12 γ1 + β1 < 1.

7.2.4.2 AVGARCH/TARCH/ZARCH

The Threshold ARCH (TARCH) model (also known as AVGARCH and ZARCH) makes one
fundamental change to the GJR-GARCH model (Taylor 1986, Zakoian 1994). Rather than
modeling the variance directly using squared innovations, a TARCH model parameterizes
the conditional standard deviation as a function of the lagged absolute value of the shocks.
It also captures asymmetries using an asymmetric term similar in a manner similar to the
asymmetry in the GJR-GARCH model.

Definition 7.7 (Threshold Autoregressive Conditional Heteroskedasticity (TARCH) process).


A TARCH(P, O, Q) process is defined as

rt = µt + εt (7.31)
P O Q
X X X
σt = ω + αp |εt −p | + γo |εt −o |I[εt −o <0] + βq σt −q
p =1 o =1 q =1

εt = σt e t
e t ∼ N (0, 1)
i.i.d.

where µt can be any adapted model for the conditional mean. TARCH models are also
known as ZARCH due to Zakoian (1994) or AVGARCH when no asymmetric terms are in-
cluded (O = 0, Taylor (1986)).

Below is an example of a TARCH(1,1,1) model.

σt = ω + α1 |εt −1 | + γ1 |εt −1 |I[εt −1 <0] + β1 σt −1 , α1 + γ1 ≥ 0 (7.32)


7.2 ARCH Models 447

where I[εt −1 <0] is an indicator variable which takes the value 1 if εt −1 < 0. Models of the con-
ditional standard deviation often outperform models that directly parameterize the con-
ditional variance and the gains arise since the absolute shocks are less responsive then the
squared shocks, an empirically relevant feature.

7.2.4.3 APARCH

The third model extends the TARCH and GJR-GARCH models by directly parameterizing
the non-linearity in the conditional variance. Where the GJR-GARCH model uses 2 and the
TARCH model uses 1, the Asymmetric Power ARCH (APARCH) of Ding, Engle & Granger
(1993) parameterizes this value directly (using δ). This form provides greater flexibility in
modeling the memory of volatility while remaining parsimonious.

Definition 7.8 (Asymmetric Power Autoregressive Conditional Heteroskedasticity (APARCH)


process). An APARCH(P,O,Q) process is defined as

rt = µt + εt (7.33)
max(P,O ) Q

σδt βq σδt −q
X X
=ω+ α j |εt − j | + γ j εt − j +
j =1 q =1

εt = σt e t
e t ∼ N (0, 1)
i.i.d.

where µt can be any adapted model for the conditional mean. In this specification it must
be the case that P ≥ O . When P > O , γ j = 0 if j > O . To ensure the conditional variances
are non-negative, it is necessary that ω > 0, αk ≥ 0 and −1 ≤ γ j ≤ 1.

It is not obvious that the APARCH model nests the GJR-GARCH and TARCH models as
special cases. To examine how an APARCH nests a GJR-GARCH, consider an APARCH(1,1,1)
model.

σδt = ω + α1 |εt −1 | + γ1 εt −1 + β1 σδt −1 (7.34)
Suppose δ = 2, then

2
σ2t = ω + α1 |εt −1 | + γ1 εt −1 + β1 σ2t −1 (7.35)
= ω + α1 |εt −1 |2 + 2α1 γ1 εt −1 |εt −1 | + α1 γ21 ε2t −1 + β1 σ2t −1
= ω + α1 ε2t −1 + α1 γ21 ε2t −1 + 2α1 γ1 ε2t −1 sign(εt −1 ) + β1 σ2t −1

where sign(·) is a function that returns 1 if its argument is positive and -1 if its argument is
negative. Consider the total effect of ε2t −1 as it depends on the sign of εt −1 ,
448 Univariate Volatility Modeling

New Impact Curves


S&P 500 News Impact Curve
4
ARCH(1)
GARCH(1,1)
3 GJR−GARCH(1,1,1)
TARCH(1,1,1)
APARCH(1,1,1)
2

0
−3 −2 −1 0 1 2 3
IBM News Impact Curve
10

0
−3 −2 −1 0 1 2 3

Figure 7.4: News impact curves for returns on both the S&P 500 and IBM. While the ARCH
and GARCH curves are symmetric, the others show substantial asymmetries to negative
news. Additionally, the fit APARCH models chose δ̂ ≈ 1 and so the NIC of the APARCH and
the TARCH models appear similar.

(
α1 + α1 γ21 + 2α1 γ1 when εt > 0
Shock coefficient = (7.36)
α1 + α1 γ21 − 2α1 γ1 when εt < 0

γ is usually estimated to be less than zero which corresponds to the typical “leverage effect”
in GJR-GARCH models.8 The relationship between a TARCH model and an APARCH model
works analogously setting δ = 1. The APARCH model also nests the ARCH(P), GARCH(P,Q)
and AVGARCH(P,Q) models as special cases.

8
The explicit relationship between an APARCH and a GJR-GARCH can be derived when δ = 2 by solving a
system of two equation in two unknowns where eq. (7.36) is equated with the effect in a GJR-GARCH model.
7.2 ARCH Models 449

7.2.5 The News Impact Curve

With a wide range of volatility models, each with a different specification for the dynamics
of conditional variances, it can be difficult to determine the precise effect of a shock to
the conditional variance. News impact curves which measure the effect of a shock in the
current period on the conditional variance in the subsequent period facilitate comparison
between models.

Definition 7.9 (News Impact Curve (NIC)). The news impact curve of an ARCH-family
model is defined as the difference between the variance with a shock e t and the variance
with no shock (e t = 0). To ensure that the NIC does not depend on the level of variance,
the variance in all previous periods is set to the unconditional expectation of the variance,
σ̄2 ,
n (e t ) = σ2t +1 (e t |σ2t = σ̄2t ) (7.37)
N I C (e t ) = n (e t ) − n (0) (7.38)

To facilitate comparing both linear and non-linear specification (e.g. EGARCH) NICs are
normalized by setting the variance in the current period to the unconditional variance.
News impact curves for ARCH and GARCH models are simply the terms which involve
εt .
2

GARCH(1,1)

n(e t ) = ω + α1 σ̄2 e t2 + β1 σ̄2 (7.39)

N I C (e t ) = α1 σ̄2 e t2 (7.40)

The news impact curve can be fairly complicated if a model is not linear in ε2t , as this ex-
ample from a TARCH(1,1,1) shows.

σt = ω + α1 |εt | + γ1 |εt |I[εt <0] + β1 σt −1 (7.41)

n (e t ) = ω2 +2ω(α1 +γ1 I[εt <0] )|εt |+2β (α1 +γ1 I[εt <0] )|εt |σ̄ +β12 σ̄2 +2ωβ1 σ̄ +(α1 +γ1 I[εt <0] )2 ε2t
(7.42)

N I C (e t ) = (α1 + γ1 I[εt <0] )2 ε2t + (2ω + 2β1 σ̄)(α1 + γ1 I[εt <0] )|εt | (7.43)

While deriving explicit expressions for NICs can be tedious practical implementation
only requires computing the conditional variance for a shock of 0 (n (0)) and for a set of
shocks between -3 and 3 (n (z ) for z ∈ (−3, 3)). The difference between the conditional
variance with a shock and the conditional variance without a shock is the NIC.
450 Univariate Volatility Modeling

7.2.5.1 The S&P 500 and IBM

Figure 7.4 contains plots of the news impact curves for both the S&P 500 and IBM. When
the models include asymmetries, the news impact curves are asymmetric and show a much
larger response to negative shocks than to positive shocks, although the asymmetry is stronger
in the volatility of the returns of the S&P 500 than it is in the volatility of IBM’s returns.

7.2.6 Estimation

Consider a simple GARCH(1,1) specification,

rt = µt + εt (7.44)
σ2t = ω + α1 ε2t −1 + β σ2t −1
εt = σt e t
e t ∼ N (0, 1)
i.i.d.

Since the errors are assumed to be conditionally i.i.d. normal9 , maximum likelihood is a
natural choice to estimate the unknown parameters, θ which contain both the mean and
variance parameters. The normal likelihood for T independent variables is

T
(rt − µt )2
 
2 − 12
Y
f (r; θ ) = (2πσt ) exp − (7.45)
t =1
2σ2t

and the normal log-likelihood function is

T
X 1 1 (rt − µt )2
l (θ ; r) = − log(2π) − log(σ2t ) − . (7.46)
t =1
2 2 2σ2t

If the mean is set to 0, the log-likelihood simplifies to

T
X 1 1 r2
l (θ ; r) = − log(2π) − log(σ2t ) − t 2 (7.47)
t =1
2 2 2σt

and is maximized by solving the first order conditions.

T
∂ l (θ ; r) X 1 rt2
= − + =0 (7.48)
∂ σ2t t =1
2σ2t 2σ4t

which can be rewritten to provide some insight into the estimation of ARCH models,

9
The use of conditional is to denote the dependence on σ2t , which is in Ft −1 .
7.2 ARCH Models 451

T
∂ l (θ ; r) 1 X 1 rt2
 
= −1 . (7.49)
∂ σ2t 2 t =1 σ2t σ2t
 
rt2
This expression clarifies that the parameters of the volatility are chosen to make σ2t
−1
as close to zero as possible.10 These first order conditions are not complete since ω, α1 and
β1 , not σ2t , are the parameters of a GARCH(1,1) model and

∂ l (θ ; r) ∂ l (θ ; r) ∂ σ2t
= (7.50)
∂ θi ∂ σ2t ∂ θi
The derivatives take an recursive form not previously encountered,

∂ σ2t ∂ σ2t −1
= 1 + β1 (7.51)
∂ω ∂ω
∂ σt
2 ∂ σ2t −1
= εt −1 + β1
2
∂ α1 ∂ α1
∂ σt
2 ∂ σ2t −1
= σt −1 + β1
2
,
∂ β1 ∂ β1

although the recursion in the first order condition for ω can be removed noting that

∂ σ2t ∂ σ2t −1 1
= 1 + β1 ≈ . (7.52)
∂ω ∂ω 1 − β1
Eqs. (7.50) – (7.52) provide the necessary formulas to implement the scores of the log-
likelihood although they are not needed to estimate a GARCH model.11
The use of the normal likelihood has one strong justification; estimates produced by
maximizing the log-likelihood of a normal are strongly consistent. Strong consistency is a
h 2 i
r
If E t −1 σt2 − 1 = 0, and so the volatility is correctly specified, then the scores of the log-likelihood have
10
t
expectation zero since
" !# " " !##
1 rt2 1 rt2
E −1 = E E t −1 −1
σ2t σ2t σ2t σ2t
" " #!#
1 rt2
=E E t −1 −1
σ2t σ2t
 
1
=E (0)
σ2t
= 0.

11
MATLAB and many other econometric packages are capable of estimating the derivatives using a nu-
merical approximation that only requires the log-likelihood. Numerical derivatives use the definition of a
derivative, f 0 (x ) = limh →0 f (x +hh)−f (x ) to approximate the derivative using f 0 (x ) ≈ f (x +hh)−f (x ) for some small
h.
452 Univariate Volatility Modeling

property of an estimator that ensures parameter estimates converge to the true param-
eters even if the wrong conditional distribution is assumed. For example, in a standard
GARCH(1,1), the parameter estimates would still converge to their true value if estimated
with the normal likelihood as long as the volatility model was correctly specified. The in-
tuition behind this result comes form the generalized error

ε2t
 
−1 . (7.53)
σ2t
Whenever σ2t = E t −1 [ε2t ],

ε2t E t −1 [ε2t ] σt
     2 
E −1 =E −1 =E −1 = 0. (7.54)
σ2t σ2t σ2t
Thus, as long as the GARCH model nests the true DGP, the parameters will be chosen to
make the conditional expectation of the generalized error 0; these parameters correspond
to those of the original DGP even if the conditional distribution is misspecified.12 This is
a very special property of the normal distribution and is not found in any other common
distribution.

7.2.7 Inference

Under some regularity conditions, parameters estimated using maximum likelihood have
been shown to be asymptotically normally distributed,
√ d
T (θ̂ − θ 0 ) → N (0, I −1 ) (7.55)
where
∂ 2 l (θ 0 ; rt )
 
I = −E (7.56)
∂ θ∂ θ0
is the negative of the expected Hessian. The Hessian measures how much curvature there
is in the log-likelihood at the optimum just like the second-derivative measures the rate-of-
change in the rate-of-change of the function in a standard calculus problem. To estimate
I, the sample analogue employing the time-series of Hessian matrices computed at θ̂ is
used,

T
X ∂ 2 l (θ̂ ; rt )
Î = T −1
. (7.57)
t =1
∂ θ∂ θ0
The chapter 1 notes show that the Information Matrix Equality (IME) generally holds
for MLE problems, so
12
An assumption that a GARCH specification nests the DGP is extremely strong and almost certainly wrong.
However, this property of the normal provides a justification even though the standardized residuals of most
asset return series are leptokurtotic and skewed.
7.2 ARCH Models 453

I=J (7.58)
where

∂ l (rt ; θ 0 ) ∂ l (rt ; θ 0 )
 
J =E (7.59)
∂θ ∂ θ0
is the covariance of the scores, which measures how much information there is in the data
to pin down the parameters. Large score variance indicate that small parameter changes
have a large impact and so the parameters are precisely estimated. The estimator of J is
the sample analogue using the scores evaluated at the estimated parameters,
T
X ∂ l (θ̂ ; rt ) ∂ l (θ̂ ; rt )
Jˆ = T −1 . (7.60)
t =1
∂θ ∂ θ0
The conditions for the IME to hold require that the parameter estimates are maximum
likelihood estimates which in turn requires both the likelihood used in estimation to be
correct as well as the specification for the conditional variance. When one specification is
used for estimation (e.g. normal) but the data follow a different conditional distribution,
these estimators are known as Quasi Maximum Likelihood Estimators (QMLE) and the IME
generally fails to hold. Under some regularity conditions, the estimated parameters are still
asymptotically normal but with a different covariance,
√ d
T (θ̂ − θ 0 ) → N (0, I −1 J I −1 ) (7.61)
If the IME was valid, I = J and so this covariance would simplify to the usual MLE variance
estimator.
In most applications of ARCH models, the conditional distribution of shocks is decid-
edly not normal, exhibiting both excess kurtosis and skewness. Bollerslev & Wooldridge
(1992) were the first to show that the IME does not generally hold for GARCH models when
the distribution is misspecified and the “sandwich” form

Î −1 Jˆ Î −1 (7.62)
of the covariance estimator is often referred to as the Bollerslev-Wooldridge covariance ma-
trix or simply a robust covariance matrix. Standard Wald tests can be used to test hypothe-
ses of interest, such as whether an asymmetric term is statistically significant, although
likelihood ratio tests are not reliable since they do not have the usual χm
2
distribution.

7.2.7.1 The S&P 500 and IBM

To demonstrate the different covariance estimators, TARCH(1,1,1) models were estimated


for both the S&P 500 and the IBM data. Table 7.3 contains the estimated parameters and
t-stats using both the MLE covariance matrix and the Bollerslev-Wooldridge covariance
454 Univariate Volatility Modeling

S&P 500
ω α1 γ1 β1
Coeff. 0.017 0.000 0.123 0.937
Std. VCV 6.07 0.00 9.66 108.36
Robust VCV 3.85 0.00 7.55 82.79

IBM
ω α1 γ1 β1
Coeff. 0.020 0.022 0.080 0.938
Std. VCV 3.93 1.94 7.80 75.86
Robust VCV 1.64 0.93 3.54 30.98

Table 7.3: Estimates from a TARCH(1,1,1) for the S&P 500 and IBM. The Bollerslev-
Wooldridge VCV makes a difference in the significance of the symmetric term (α1 ) in the
IBM model.

matrix. There is little change in the S&P model but the symmetric term in the IBM model
changes from highly significant to insignificant at 5%.

7.2.7.2 Independence of the mean and variance parameters

One important but not obvious issue when using MLE assuming conditionally normal er-
rors - or QMLE when conditional normality is wrongly assumed - is that inference on the pa-
rameters of the ARCH model is still correct as long as the model for the mean is general enough
to nest the true form. As a result, it is safe to estimate the mean and variance parameters
separately without correcting the covariance matrix of the estimated parameters.13 This
surprising feature of QMLE estimators employing a normal log-likelihood comes form the
cross-partial derivative of the log-likelihood with respect to the mean and variance param-
eters,

1 1 (rt − µt )2
l (θ ; rt ) = − log(2π) − log(σt ) −
2
. (7.63)
2 2 2σ2t
The first order condition is,
T
∂ l (θ ; r) ∂ µt X (rt − µt ) ∂ µt
=− (7.64)
∂ µt ∂ φ t =1
σ2t ∂φ

13
The estimated covariance for the mean should use a White covariance estimator. If the mean parameters
are of particular interest, it may be more efficient to jointly estimate the parameters of the mean and volatility
equations as a form of GLS (see Chapter 3).
7.2 ARCH Models 455

and the second order condition is


T
∂ 2 l (θ ; r) ∂ µt ∂ σ2t X (rt − µt ) ∂ µt ∂ σ2
= t
(7.65)
∂ µt ∂ σt 2
∂ φ ∂ ψ t =1
σ 4
t ∂ φ ∂ ψ

where φ is a parameter of the conditional mean and ψ is a parameter of the conditional


variance. For example, in a simple ARCH(1) model with a constant mean,

rt = µ + εt (7.66)
σ2t = ω + α1 ε2t −1
εt = σt e t
e t ∼ N (0, 1),
i.i.d.

φ = µ and ψ can be either ω or α1 . Taking expectations of the cross-partial,

" T #
∂ 2 l (θ ; r) ∂ µt ∂ σ2t X rt − µt ∂ µt ∂ σ2
 
E =E t
(7.67)
∂ µt ∂ σ2t ∂ φ ∂ ψ t =1
σ 4
t ∂ φ ∂ ψ
" " T ##
X rt − µt ∂ µt ∂ σ2
= E E t −1 t

t =1
σ 4
t ∂ φ ∂ ψ
" T #
X E t −1 [rt − µt ] ∂ µt ∂ σ2
=E t

t =1
σt4
∂φ ∂ψ
" T #
X 0 ∂ µt ∂ σ2
=E t

t =1
σ 4
t ∂ φ ∂ ψ
=0

it can be seen that the expectation of the cross derivative is 0. The intuition behind this
result is also simple: if the mean model is correct for the conditional expectation of rt , the
term rt − µt has conditional expectation 0 and knowledge of the variance is not needed,
and is a similar argument to the validity of least squares estimation when the errors are
heteroskedastic.

7.2.8 GARCH-in-Mean

The GARCH-in-mean model (GIM) makes a significant change to the role of time-varying
volatility by explicitly relating the level of volatility to the expected return (Engle, Lilien &
Robins 1987). A simple GIM model can be specified as
456 Univariate Volatility Modeling

rt = µ + δσ2t + εt (7.68)
σ2t =ω+ α1 ε2t −1 + β1 σ2t −1
εt = σt e t
e t ∼ N (0, 1)
i.i.d.

although virtually any ARCH-family model could be used for the conditional variance. The
obvious difference between the GIM and a standard GARCH(1,1) is that the variance ap-
pears in the mean of the return. Note that the shock driving the changes in variance is not
the mean return but still ε2t −1 , and so the ARCH portion of a GIM is unaffected. Other forms
of the GIM model have been employed where the conditional standard deviation or the log
of the conditional variance are used in the mean equation14 ,

rt = µ + δσt + εt (7.69)
or
rt = µ + δ ln(σ2t ) + εt (7.70)
Because the variance appears in the mean equation for rt , the mean and variance parame-
ters cannot be separately estimated. Despite the apparent feedback, processes that follow
a GIM will be stationary as long as the variance process is stationary. This result follows
from noting that the conditional variance (σ2t ) in the conditional mean does not feedback
into the conditional variance process.

7.2.8.1 The S&P 500

Asset pricing theorizes that there is a risk-return trade off and GARCH-in-mean models
provide a natural method to test whether this is the case. Using the S&P 500 data, three GIM
models were estimated (one for each transformation of the variance in the mean equation)
and the results are presented in table 7.4. Based on these estimates, there does appear to
be a trade off between mean and variance and higher variances produce higher expected
means, although the magnitude is economically small.

7.2.9 Alternative Distributional Assumptions


Despite the strengths of the assumption that the errors are conditionally normal (estima-
tion is simple and parameters are strongly consistent for the true parameters), GARCH mod-
els can be specified and estimated with alternative distributional assumptions. The moti-
vation for using something other than the normal distribution is two-fold. First, a better
14
The model for the conditional mean can be extended to include ARMA terms or any other predetermined
regressor.
7.2 ARCH Models 457

S&P Garch-in-Mean Estimates


Mean Specification µ δ ω α1 β1 Log Lik.
σt 0.002 0.047 0.012 0.078 0.913 -3719.5
(0.694) (0.034) (0.020) (0.000) (0.000)
σ2t 0.031 0.014 0.012 0.078 0.913 -3719.7
(0.203) (0.457) (0.019) (0.000) (0.000)
ln(σt ) 0.053 0.027 0.012 0.078 0.913 -3719.5
(0.020) (0.418) (0.027) (0.000) (0.000)

Table 7.4: GARCH-in-mean estimates for the S&P 500 series. δ, the parameter which mea-
sures the GIM effect, is the most interesting parameter and it is significant in both the log
variance specification and the variance specification. The GARCH model estimated was a
standard GARCH(1,1). P-values are in parentheses.

approximation to the conditional distribution of the standardized returns may improve


the precision of the volatility process parameter estimates and, in the case of MLE, the es-
timates will be fully efficient. Second, GARCH models are often used in situations where
the choice of the density matters such as Value-at-Risk and option pricing.
Three distributions stand among the myriad that have been used to estimate the pa-
rameters of GARCH processes. The first is a standardized Student’s t (to have unit variance
for any value ν, see Bollerslev (1987)) with ν degrees of freedom,

Standardized Student’s t

ν+1
Γ

1 1 1
f (εt ; ν, σ2t ) = 2
ν
(7.71)
π(ν − 2) σt
Γ
p   ν+1
ε2t 2
2 1+ σt (ν−2)

where Γ (·) is the gamma function.15 This distribution is always fat-tailed and produces a
better fit than the normal for most asset return series. This distribution is only well de-
fined if ν > 2 since the variance of a Student’s t with ν ≤ 2 is infinite. The second is the
generalized error distribution (GED, see Nelson (1991)),

Generalized Error Distribution

 
ν exp − 12 | σεtt |ν
f (εt ; ν, σ2t ) = ν+1 (7.72)
λ2 ν Γ ( ν1 )
2
! 21
2− ν Γ ( ν1 )
λ= (7.73)
Γ ( ν3 )
15
q The standardized Student’s t differs from the usual Student’s t so that the it is necessary to scale data by
ν
ν−2 if using functions (such as the CDF) for the regular Student’s t distribution.
458 Univariate Volatility Modeling

Density of standardized residuals for the S&P 500


Student’s t , ν = 11 GED, ν = 1.5
0.5
0.4
0.4

0.3
0.3

0.2
0.2

0.1 0.1

−3 −2 −1 0 1 2 3 −3 −2 −1 0 1 2 3
Skewed t , ν = 11, λ = −0.12 Empirical (Kernel)
0.5
0.4
0.4

0.3
0.3

0.2 0.2

0.1 0.1

0
−3 −2 −1 0 1 2 3 −3 −2 −1 0 1 2 3

Figure 7.5: The four panels of this figure contain the estimated density for the S&P 500 and
the density implied by the distributions: Student’s t , GED, Hansen’s skewed t and a kernel
density plot of the standardized residuals, ê t = εt /σ̂t , along with the PDF of a normal
(dashed-line) for comparison. The shape parameters in the Student’s t , GED and skewed
t , ν and λ, were jointly estimated with the variance parameters.

which nests the normal when ν = 2. The GED is fat-tailed when ν < 2 and thin-tailed
when ν > 2. In order for this distribution to be used for estimating GARCH parameters, it is
necessary that ν ≥ 1 since the variance is infinite when ν < 1. The third useful distribution
introduced in Hansen (1994) extends the standardized Student’s t to allow for skewness of
returns

Hansen’s skewed t
7.2 ARCH Models 459

   2 −(ν+1)/2
b εt +σt a
 bc 1+ εt < −a /b

 1

ν−2 σt (1−λ)
f (εt ; ν, λ) =   2 −(ν+1)/2 (7.74)
b εt +σt a
 bc 1+ εt ≥ −a /b
1


ν−2 σt (1+λ)

where
ν−2
 
a = 4λc ,
ν−1
b = 1 + 3λ2 − a 2 ,

and
ν+1
Γ

c =p 2
ν
.
π(ν − 2)Γ 2

The two shape parameters, ν and λ, control the kurtosis and the skewness, respectively.
These distributions may be better approximations to the true distribution since they
allow for kurtosis greater than that of the normal, an important empirical fact, and, in the
case of the skewed t , skewness in the standardized returns. Chapter 8 will return to these
distributions in the context of Value-at-Risk and density forecasting.

7.2.9.1 Alternative Distribution in Practice

To explore the role of alternative distributional assumptions in the estimation of GARCH


models, a TARCH(1,1,1) was fit to the S&P 500 returns using the conditional normal, the
Student’s t , the GED and Hansen’s skewed t . Figure 7.5 contains the empirical density
(constructed with a kernel) and the fit density of the three distributions. Note that the
shape parameters, ν and λ, were jointly estimated with the conditional variance param-
eters. Figure 7.6 shows plots of the estimated conditional variance for both the S&P 500
and IBM using three distributional assumptions. The most important aspect of this figure
is that the fit variances are virtually identical. This is a common finding estimating mod-
els using alternative distributional assumptions where it is often found that there is little
meaningful difference in the fit conditional variances or estimated parameters.

7.2.10 Model Building

Since ARCH and GARCH models are similar to AR and ARMA models, the Box-Jenkins
methodology is a natural way to approach the problem. The first step is to analyze the sam-
ple ACF and PACF of the squared returns, or if the model for the conditional mean is non-
trivial, the sample ACF and PACF of the estimated residuals, ε̂t should be examined for het-
eroskedasticity. Figures 7.7 and 7.8 contains the ACF and PACF for the squared returns of
460 Univariate Volatility Modeling

Conditional Variance and Distributional Assumptions


S&P 500 Annualized Volatility (TARCH(1,1,1))
80 Normal
Students T
GED
60

40

20

2001 2002 2004 2005 2006 2008 2009 2010

IBM Annualized Volatility (TARCH(1,1,1))


70 Normal
60 Students T
GED
50
40
30
20

2001 2002 2004 2005 2006 2008 2009 2010

Figure 7.6: The choice of the distribution for the standardized innovation makes little dif-
ference to the fit variances or the estimated parameters in most models. The alternative
distributions are more useful in application to Value-at-Risk and Density forecasting in
which case the choice of density may make a large difference.

the S&P 500 and IBM respectively. The models used in selecting the final model are repro-
duced in tables 7.5 and 7.6 respectively. Both selections began with a simple GARCH(1,1).
The next step was to check if more lags were needed for either the squared innovation or the
lagged variance by fitting a GARCH(2,1) and a GARCH(1,2) to each series. Neither of these
meaningfully improved the fit and a GARCH(1,1) was assumed to be sufficient to capture
the symmetric dynamics.
The next step in model building is to examine whether the data exhibit any evidence
of asymmetries using a GJR-GARCH(1,1,1). The asymmetry term was significant and so
other forms of the GJR model were explored. All were found to provide little improve-
ment in the fit. Once a GJR-GARCH(1,1,1) model was decided upon, a TARCH(1,1,1) was
fit to test whether evolution in variances or standard deviations was more appropriate.
Both series preferred the TARCH to the GJR-GARCH (compare the log-likelihoods), and
the TARCH(1,1,1) was selected. In comparing alternative specifications, an EGARCH was
7.2 ARCH Models 461

S&P 500 Model Building


ω α1 α2 γ1 γ2 β1 β2 LL
GARCH(1,1) 0.012 0.078 0.913 -3721.9
(.556) (.000) (.000)
GARCH(2,1) 0.018 0.000 0.102 0.886 -3709.2
(.169) (.999) (.000) (.000)
GARCH(1,2) 0.012 0.078 0.913 0.000 -3721.9
(.600) (.000) (.000) (.999)
GJR-GARCH(1,1,1) 0.013 0.000 0.126 0.926 -3667.1
(.493) (.999) (.000) (.000)
GJR-GARCH(1,2,1) 0.013 0.000 0.126 0.000 0.926 -3667.1
(.004) (.999) (.000) (.999) (.000)
TARCH(1,1,1) 0.017 0.000 0.123 0.937 -3668.8
(.000) (.999) (.000) (.000)
EGARCH(1,1,1) 0.003 0.094 −0.119 0.985 -3669.4
(.981) (.000) (.000) (.000)
EGARCH(2,1,1)** 0.004 −0.150 0.273 −0.125 0.981 -3652.5
(.973) (.000) (.000) (.000) (.000)
EGARCH(1,2,1) 0.003 0.090 −0.188 0.073 0.986 -3666.8
(.988) (.000) (.000) (.174) (.000)
EGARCH(1,1,2) 0.003 0.090 −0.110 1.059 −0.073 -3669.3
(.991) (.545) (.701) (.000) (.938)

Table 7.5: The models estimated in selecting a final model for the conditional variance of
the S&P 500 Index. ** indicates the selected model.

fit and found to also provide a good description of the data. In both cases the EGARCH was
expanded to include more lags of the shocks or lagged log volatility. For S&P 500 data, a
model with 2 symmetric terms, an EGARCH(2,1,1), was selected. For IBM return data, the
EGARCH(1,1,1) model also improved on the TARCH(1,1,1), and not further extensions to
the EGARCH(1,1,1) were significant. The final step, if needed, would be to consider any
alternative distributions for standardized errors.

7.2.10.1 Testing for (G)ARCH

Although conditional heteroskedasticity can often be identified by graphical inspection, a


formal test of conditional homoskedasticity is also useful. The standard method to test for
ARCH is to use the ARCH-LM test which is implemented as a regression of squared resid-
uals on lagged squared residuals and it directly exploits the AR representation of an ARCH
process (Engle 1982). The test is computed by estimating

ε̂2t = φ0 + φ1 ε̂2t −1 + . . . + φP ε̂2t −P + ηt (7.75)


and then computing a test statistic as T times the R 2 of the regression (L M = T × R 2 ), and
is asymptotically distributed χP2 where ε̂t are residuals from a conditional mean model.
The null hypothesis is H0 : φ1 = . . . = φP = 0 which corresponds to no persistence in the
conditional variance.
462 Univariate Volatility Modeling

ACF and PACF of squared returns of the S&P 500


Squared Residuals ACF Squared Residuals PACF
0.4
0.3
0.3
0.2

0.2 0.1

0.1 0

−0.1
0

5 10 15 20 25 30 35 5 10 15 20 25 30 35
Standardized Squared Residuals ACF Standardized Squared Residuals PACF
0.06 0.06
0.04 0.04
0.02 0.02
0 0
−0.02 −0.02

−0.04 −0.04

−0.06 −0.06
5 10 15 20 25 30 35 5 10 15 20 25 30 35

Figure 7.7: ACF and PACF of the squared returns for the S&P 500. The bottom two panels
contain the ACF and PACF of ê t2 = ε̂2t /σ̂2t . The top panels show persistence in both the ACF
and PACF which indicate an ARMA model is needed (hence a GARCH model) while the
ACF and PACF of the standardized residuals appear to be compatible with an assumption
of white noise.

7.3 Forecasting Volatility

Forecasting conditional variances with GARCH models ranges from trivial for simple ARCH
and GARCH specifications to difficult for non-linear specifications. Consider the simple
ARCH(1) process,

εt = σt e t (7.76)
e t ∼ N (0, 1)
i.i.d.

σ2t = ω + α1 ε2t −1
7.3 Forecasting Volatility 463

ACF and PACF of squared returns of IBM


Squared Residuals ACF Squared Residuals PACF
0.2
0.15

0.15
0.1
0.1
0.05
0.05

0
0

5 10 15 20 25 30 35 5 10 15 20 25 30 35
Standardized Squared Residuals ACF Standardized Squared Residuals PACF
0.04 0.04

0.02 0.02

0 0

−0.02 −0.02

−0.04 −0.04
5 10 15 20 25 30 35 5 10 15 20 25 30 35

Figure 7.8: ACF and PACF of the squared returns for IBM. The bottom two panels contain
the ACF and PACF of ê t2 = ε̂2t /σ̂2t . The top panels show persistence in both the ACF and
PACF which indicate an ARMA model is needed (hence a GARCH model) while the ACF
and PACF of the standardized residuals appear to be compatible with an assumption of
white noise. Compared to the S&P 500 ACF and PACF, IBM appears to have weaker volatility
dynamics.

Iterating forward, σ2t +1 = ω + α1 ε2t , and taking conditional expectations, E t [σ2t +1 ] = E t [ω +


α1 ε2t ] = ω + α1 ε2t since all of these quantities are known at time t . This is a property com-
mon to all ARCH-family models16 : forecasts of σ2t +1 are always known at time t .
The 2-step ahead forecast follows from the law of iterated expectations,

E t [σ2t +2 ] = E t [ω + α1 ε2t +1 ] = ω + α1 E t [ε2t +1 ] = ω + α1 (ω + α1 ε2t ) = ω + α1 ω + α21 ε2t . (7.77)

16
Not only is this property common to all ARCH-family members, it is the defining characteristic of an
ARCH model.
464 Univariate Volatility Modeling

IBM Model Building


ω α1 α2 γ1 γ2 β1 β2 LL
GARCH(1,1) 0.048 0.104 0.879 -4446.2
(.705) (.015) (.000)
GARCH(2,1) 0.048 0.104 0.000 0.879 -4446.2
(.733) (.282) (.999) (.000)
GARCH(1,2) 0.054 0.118 0.714 0.149 -4445.6
(.627) (.000) (.000) (.734)
GJR-GARCH(1,1,1) 0.036 0.021 0.105 0.912 -4417.2
(.759) (.950) (.000) (.000)
GJR-GARCH(1,2,1) 0.036 0.021 0.105 0.000 0.912 -4417.2
(.836) (.965) (.192) (.999) (.000)
TARCH(1,1,1) 0.020 0.022 0.080 0.938 -4407.6
(.838) (.950) (.000) (.000)
EGARCH(1,1,1)** 0.010 0.104 −0.072 0.988 -4403.7
(.903) (.000) (.000) (.000)
EGARCH(2,1,1) 0.009 0.158 −0.060 −0.070 0.989 -4402.8
(.905) (.000) (.877) (.000) (.000)
EGARCH(1,2,1) 0.010 0.103 −0.079 0.008 0.989 -4403.7
(.925) (.000) (.000) (.990) (.000)
EGARCH(1,1,2) 0.011 0.118 −0.082 0.830 0.157 -4403.4
(.905) (.000) (.000) (.000) (.669)

Table 7.6: The models estimated in selecting a final model for the conditional variance of
IBM. ** indicates the selected model.

A generic expression for a h -step ahead forecast can be constructed by repeatedly substi-
tution and is given by

h −1
X
E t [σ2t +h ] = αi1 ω + αh1 ε2t . (7.78)
i =0

This form should look familiar since it is the multi-step forecasting formula for an AR(1).
This should not be surprising since an ARCH(1) is an AR(1).
Forecasts from GARCH(1,1) models can be derived in a similar fashion,

E t [σ2t +1 ] = E t [ω + α1 ε2t + β1 σ2t ] (7.79)


= ω + α1 ε2t + β1 σ2t
E t [σ2t +2 ] = E t [ω + α1 ε2t +1 + β1 σ2t +1 ]
= ω + α1 E t [ε2t +1 ] + β1 E t [σ2t +1 ]
= ω + α1 E t [e t2+1 σ2t +1 ] + β1 E t [σ2t +1 ]
= ω + α1 E t [e t2+1 ]E t [σ2t +1 ] + β1 E t [σ2t +1 ]
= ω + α1 · 1 · E t [σ2t +1 ] + β1 E t [σ2t +1 ]
= ω + α1 E t [σ2t +1 ] + β1 E t [σ2t +1 ]
7.3 Forecasting Volatility 465

= ω + (α1 + β1 )E t [σ2t +1 ]

and substituting the one-step ahead forecast, E t [σ2t +1 ], produces

E t [σ2t +2 ] = ω + (α1 + β1 )(ω + α1 ε2t + β1 σ2t ) (7.80)


= ω + (α1 + β1 )ω + (α1 + β1 )α1 ε2t + (α1 + β1 )β1 σ2t

Note that E t [σ2t +3 ] = ω + (α1 + β1 )E t [σ2t +2 ], and so

E t [σ2t +3 ] = ω + (α1 + β1 )(ω + (α1 + β1 )ω + (α1 + β1 )α1 ε2t + (α1 + β1 )β1 σ2t ) (7.81)
= ω + (α1 + β1 )ω + (α1 + β1 ) ω + (α1 + β1 ) 2 2
α1 ε2t + (α1 + β1 )
2
β1 σ2t .

Continuing in this manner produces a pattern which can be compactly expressed

h −1
X
E t [σ2t +h ] = (α1 + β1 )i ω + (α1 + β1 )h −1 (α1 ε2t + β1 σ2t ). (7.82)
i =0

Despite similarities to ARCH and GARCH models, forecasts from GJR-GARCH are less
simple since the presence of the asymmetric term results in the probability that e t −1 < 0
appearing in the forecasting formula. If the standardized residuals were normal (or any
other symmetric distribution), then the probability would be 12 . If the density is unknown,
this quantity would need to be estimated from the standardized residuals.
In the GJR-GARCH model, the one-step ahead forecast is just

E t [σ2t +1 ] = ω + α1 ε2t + α1 ε2t I[εt <0] + β1 σ2t (7.83)

The two-step ahead forecast can be computed following

E t [σ2t +2 ] = ω + α1 E t [ε2t +1 ] + α1 E t [ε2t +1 I[εt +1 <0] ] + β1 E t [σ2t +1 ] (7.84)


= ω + α1 E t [σ2t +1 ] + α1 E t [ε2t +1 |εt +1 < 0] + β1 E t [σ2t +1 ] (7.85)

and if normality is assumed, E t [ε2t +1 |εt +1 < 0] = Pr(εt +1 < 0)E[σ2t +1 ] = 0.5E[σ2t +1 ] since the
probability εt +1 < 0 is independent of E t [σ2t +1 ].
Multi-step forecasts from other models in the ARCH-family, particularly those which
are not linear combinations of ε2t , are nontrivial and generally do not have simple recur-
sive formulas. For example, consider forecasting the variance from the simplest nonlinear
ARCH-family member, a TARCH(1,0,0) model,
466 Univariate Volatility Modeling

σt = ω + α1 |εt −1 | (7.86)
As is always the case, the 1-step ahead forecast is known at time t ,

E t [σ2t +1 ] = E t [(ω + α1 |εt |)2 ] (7.87)


= E t [ω + 2ωα1 |εt | + α21 ε2t ]
2

= ω2 + 2ωα1 E t [|εt |] + α21 E t [ε2t ]


= ω2 + 2ωα1 |εt | + α21 ε2t

The 2-step ahead forecast is more complicated, and is given by

E t [σ2t +2 ] = E t [(ω + α1 |εt +1 |)2 ] (7.88)


= E t [ω2 + 2ωα1 |εt +1 | + α21 ε2t +1 ]
= ω2 + 2ωα1 E t [|εt +1 |] + α21 E t [ε2t +1 ]
= ω2 + 2ωα1 E t [|e t +1 |σt +1 ] + α21 E t [e t2 σ2t +1 ]
= ω2 + 2ωα1 E t [|e t +1 |]E t [σt +1 ] + α21 E t [e t2 ]E t [σ2t +1 ]
= ω2 + 2ωα1 E t [|e t +1 |](ω + α1 |εt |) + α21 · 1 · (ω2 + 2ωα1 |εt | + α21 ε2t )

The issues in multi-step ahead forecasting arise because the forecast depends on more
than just E t [e t2+h ] ≡ 1. In the above example, the forecast
q depends on both E t [e t2+1 ] = 1 and
E t [|e t +1 |]. When returns are normal E t [|e t +1 |] == π2 but if the driving innovations have a
different distribution, this expectation will differ. The final form, assuming the conditional
distribution is normal, is
r
2
E t [σ2t +2 ] = ω2 + 2ωα1 (ω + α1 |εt |) + α21 (ω2 + 2ωα1 |εt | + α21 ε2t ). (7.89)
π
The difficulty in multi-step forecasting using “nonlinear” GARCH models follows directly
from Jensen’s inequality. In the case of TARCH,

E t [σt +h ]2 6= E t [σ2t +h ] (7.90)


and in the general case

2
E t [σδt +h ] δ 6= E t [σ2t +h ]. (7.91)

7.3.1 Evaluating Volatility Forecasts

Evaluation of volatility forecasts is similar to the evaluation of forecasts from conditional


mean models with one caveat. In standard time series models, once time t + h has arrived,
the value of the variable being forecast is known. However, in volatility models the value
7.3 Forecasting Volatility 467

of σ2t +h is always unknown and it must be replaced with a proxy. The standard choice is
to use the squared return, rt2 . This is reasonable if the squared conditional mean is small
relative to the variance, a reasonable assumption for short-horizon problems (daily or pos-
sibly weekly returns). If using longer horizon measurements of returns, such as monthly,
squared residuals (ε2t ) produced from a model for the conditional mean can be used in-
stead. An alternative, and likely better choice if to use the realized variance, RVt (m) , to proxy
for the unobserved volatility (see section 7.4). Once a choice of proxy has been made, Gen-
eralized Mincer-Zarnowitz regressions can be used to assess forecast optimality,

rt2+h − σ̂2t +h |t = γ0 + γ1 σ̂2t +h |t + γ2 z 1t + . . . + γK +1 z K t + ηt (7.92)

where z j t are any instruments known at time t . Common choices for z j t include rt2 , |rt |, rt
or indicator variables for the sign of the lagged return. The GMZ regression in equation 7.92
has a heteroskedastic variance and that a better estimator, GMZ-GLS, can be constructed
as

rt2+h − σ̂2t +h |t 1 z 1t zK t
= γ0 + γ1 1 + γ2 + . . . + γK +1 2 + νt (7.93)
σ̂2t +h |t σ̂2t +h |t σ̂t +h |t
2
σ̂t +h |t
rt2+h 1 z 1t zK t
− 1 = γ0 + γ1 1 + γ2 + . . . + γK +1 2 + νt (7.94)
σ̂2t +h |t σ̂2t +h |t σ̂t +h |t
2
σ̂t +h |t

by dividing both sized by the time t forecast, σ̂2t +h |t where νt = ηt /σ̂2t +h |t . Equation 7.94
shows that the GMZ-GLS is a regression of the generalized error from a normal likelihood.
If one were to use the realized variance as the proxy, the GMZ and GMZ-GLS regressions
become
RVt +h − σ̂2t +h |t = γ0 + γ1 σ̂2t +h |t + γ2 z 1t + . . . + γK +1 z K t + ηt (7.95)

and

RVt +h − σ̂2t +h |t 1 σ̂2t +h |t z 1t zK t ηt


= γ0 + γ1 + γ2 + . . . + γK +1 2 + 2 (7.96)
σ̂2t +h |t σ̂2t +h |t σ̂2t +h |t σ̂t +h |t
2
σ̂t +h |t σ̂t +h |t

Diebold-Mariano tests can also be used to assess the relative performance of two mod-
els. To perform a DM test, a loss function must be specified. Two obvious choices for the
loss function are MSE,

 2
rt2+h − σ̂2t +h |t (7.97)

and QML-loss (which is simply the kernel of the normal log-likelihood),


468 Univariate Volatility Modeling

!
r2
ln(σ̂2t +h |t ) + 2t +h . (7.98)
σ̂t +h |t

The DM statistic is a t-test of the null H0 : E [δt ] = 0 where


 2  2
δt = rt2+h − σ̂2A,t +h |t − rt2+h − σ̂2B ,t +h |t (7.99)

in the case of the MSE loss or


! !
rt2+h rt2+h
δt = ln(σ̂2A,t +h |t ) + − ln(σ̂2B ,t +h |t ) + (7.100)
σ̂2A,t +h |t σ̂2B ,t +h |t

in the case of the QML-loss. Statistically significant positive values of δ̄ = R −1 Rr=1 δr indi-
P

cate that B is a better model than A while negative values indicate the opposite (recall R is
used to denote the number of out-of-sample observations used to compute the DM statis-
tic). The QML-loss should be generally preferred as it is a “heteroskedasticity corrected”
version of the MSE. For more on evaluation of volatility forecasts using MZ regressions see
Patton & Sheppard (2009).

7.4 Realized Variance

Realized variance is a relatively new tool for measuring and modeling the conditional vari-
ance of asset returns that is novel for not requiring a model to measure the volatility, unlike
ARCH models. Realized variance is a nonparametric estimator of the variance that is com-
puted using ultra high-frequency data.17
Consider a log-price process, pt , driven by a standard Wiener process with a constant
mean and variance,

dpt = µ dt + σ dWt
where the coefficients have been normalized such that the return during one day is the
difference between p at two consecutive integers (i.e. p1 − p0 is the first day’s return). For
the S&P 500 index, µ ≈ .00031 and σ ≈ .0125, which correspond to 8% and 20% for the
annualized mean and volatility, respectively.
Realized variance is constructed by frequently sampling pt throughout the trading day.
Suppose that prices on day t were sampled on a regular grid of m + 1 points, 0, 1, . . . , m
and let pi ,t denote the ith observation of the log price. The m -sample realized variance is
defined
17
RV was invented somewhere between 1972 and 1997. However, its introduction to modern econometrics
clearly dates to the late 1990s (Andersen & Bollerslev 1998, Andersen, Bollerslev, Diebold & Labys 2003, Barn-
dorff-Nielsen & Shephard 2004).
7.4 Realized Variance 469

m m
(m)
X X
RVt = (pi ,t − pi −1,t ) = 2
ri2,t . (7.101)
i =1 i =1

Since the price process is a standard Brownian motion, each return is an i.i.d. normal

with mean µ/m , variance σ2 /m (and so a volatility of σ/ m ). First, consider the expecta-
tion of RVt (m) ,
" m
# " m  2 #
h i µ σ
E RVt (m) = E
X X
ri2,t =E + √ εi ,t (7.102)
m m
i =1 i =1

where εi ,t is a standard normal.

"m  2 #
h i µ σ
E RVt (m) = E
X
+ √ εi ,t (7.103)
m m
=1
" im #
X µ2 µσ σ2 2
=E + 2 3 εi ,t + ε
m2 m 2 m i ,t
=1
" im # "m # "m #
X µ2 X µσ X σ2
=E +E 2 3 εi ,t + E ε2i ,t
m2 m 2 m
i =1 i =1 i =1
m m
µ 2 X µσ X σ2  2 
= + 2 3 E [εi ,t ] + E εi ,t
m m 2 m
i =1 i =1
m m
µ2 µσ X σ2 X  2 
= +2 3 E [εi ,t ] + E εi ,t
m m 2 i =1 m
i =1
m m
µ2 µσ X σ 2 X
= +2 3 0+ 1
m m 2 i =1 m
i =1
µ2 σ 2
= + m
m m
µ2
= + σ2
m
h i
The expected value is nearly σ2 , the variance, and, as m → ∞, limm →∞ E RVt (m) = σ2 .
The variance of RVt (m) can be similarly derived,

"m #
h
(m)
i X µ2 µσ σ2 2
V RVt =V + 2 3 εi ,t + ε (7.104)
m2 m2 m i ,t
i =1
"m # "m # "m # "m m
#
X µ2 X µσ X σ2 X µ2 X µσ
=V +V 2 3 εi ,t + V ε2 + 2Cov , 2 3 εi ,t
m2 m2 m i ,t m2 m2
i =1 i =1 i =1 i =1 i =1
470 Univariate Volatility Modeling

"m m
# "m m
#
X µ2 X σ2 2 X µσ X σ2 2
+ 2Cov , ε + 2Cov 2 3 εi ,t , ε .
m2 m i ,t m2 m i ,t
i =1 i =1 i =1 i =1

Working through these 6 terms, it can be determined that


"m
# m
" m
# m
"m
#
X µ2 X µ2 X µσ X µ2 X σ 2 2
V = Cov , 2 3 εi ,t = Cov , ε =0
m2 m2 m2 m2 m i ,t
i =1 i =1 i =1 i =1 i =1

since mµ 2 is a constant, and that the remaining covariance term also has expectation 0 since
2

εi ,t are i.i.d. standard normal and so have a skewness of 0,

m
" m
#
X µσ X σ2 2
Cov 2 3 εi ,t , ε =0
m i ,t
i =1 m
2
i =1

The other two terms can be shown to be (also left as exercises)


"m #
X µσ µ2 σ 2
V 2 3 εi ,t = 4
m2
i =1 m
2
"m #
X σ2 σ4
V εi ,t = 2
2
m m
i =1

and so

h i µ2 σ 2 σ4
V RVt (m) = 4 + 2 . (7.105)
m2 m

The variance is decreasing as m → ∞ and RVt (m) is a consistent estimator for σ2 .


In the more realistic case of a price process with time-varying drift and stochastic volatil-
ity,

d pt = µt dt + σt dWt ,

RVt (m) is a consistent estimator of the integrated variance,


Z t +1
(m) p
lim RVt → σ2s ds . (7.106)
m →∞ t

If the price process contains jumps, RVt (m) is still a consistent estimator although its
limit is the quadratic variation rather then the integrated variance, and so
Z t +1
(m) p
X
lim RVt → σ2s ds + ∆ Js2 . (7.107)
m →∞ t t ≤1
7.4 Realized Variance 471

where ∆ Js2 measures the contribution of jumps to the total variation. Similar result hold
if the price process exhibits leverage (instantaneous correlation between the price and the
variance). The conditions for RVt (m) to be a reasonable method to estimate the integrated
variance on day t are essentially that the price process, pt , is arbitrage-free and that the
efficient price is observable. Data seem to support the first condition but the second is
clearly violated.
There are always practical limits to m. This said, the maximum m is always much higher
than 1 – a close-to-close return – and is typically somewhere between 13 (30-minute returns
on the NYSE) and 390 (1-minute return on the NYSE), and so even when RVt (m) is not con-
sistent, it is still a better proxy, often substantially, for the latent variance on day t than rt2
(the “1-sample realized variance”). The signal-to-noise ratio (which measures the ratio of
useful information to pure noise) is approximately 1 for RV but is between .05 and .1 for
rt2 . In other words, RV is 10-20 times more precise than squared daily returns. In terms of a
linear regression, the difference between RV and daily squared returns is similar to fitting a
regression with an R 2 of about 50% and one with an R 2 of 5% (Andersen & Bollerslev 1998).

7.4.1 Modeling RV

If RV can reasonably be treated as observable, then it can be modeled using standard time
series tools such as ARMA models. This has been the approach pursued thus far although
there are issues in treating the RV as error free. The observability of the variance through
RV is questionable and if there are errors in RV (which there almost certainly are), these will
lead to an errors-in-variables problem in ARMA models and bias in estimated coefficients
(see chapter 4). In particular, Corsi (2009) have advocated the use of a heterogeneous au-
toregressions (HAR) where the current realized volatility depends on the realized volatility
in the previous day, the average realized variance over the previous week, and the average
realized variance of the previous month (22 days). HARs have been used in both levels

RVt = φ0 + φ1 RVt −1 + φ5 RV t −5 + φ2 2RV t −22 + εt (7.108)


= 15 5i =1 RVt −i and RV t −22 = 22
1 2
i =1 2RVt −i (suppressing the (m) terms), or
P P
where RV t −5
in logs,

ln RVt = φ0 + φ1 ln RVt −1 + φ5 ln RV t −5 + φ2 2 ln RV t −22 + εt . (7.109)


HARs allow for highly persistent volatility (through he 22-day moving average) and short
term dynamics (through the 1-day and 5-day terms).
An alternative strategy is to apply ARCH-family models to the realized variance. ARCH-
family models can be interpreted as a multiplicative error model for any non-negative
process, not just squared returns (Engle 2002a).18 To use RV in an ARCH model, define
18
ARCH-family models have, for example, been successfully applied to both durations (time between
trades) and hazards (number of trades in an interval of time), two non-negative processes.
472 Univariate Volatility Modeling


r̃t = sign(rt ) RVt where sign(rt ) takes the value 1 if the return was positive in period t
and -1 if it was negative, and so r̃t is the signed square root of the realized variance on day
t . Any ARCH-family model can be applied to these transformed realized variances to esti-
mate a model for forecasting the quadratic variation. For example, consider the variance
evolution in a GJR-GARCH(1,1,1) using the transformed data,

σ2t = ω + α1 r̃t2−1 + γ1 r̃t2−1 I[r̃t −1 <0] + β1 σ2t −1 (7.110)

which, in terms of realized variance, is equivalently expressed

σ2t = ω + α1 RVt −1 + γ1 RVt −1 I[rt −1 <0] + β1 σ2t −1 . (7.111)

Maximum likelihood estimation, assuming normally distributed errors, can be used to


estimate the parameters of this model. This procedure solves the errors-in-variables prob-
lem present when RV is treated as observable and makes modeling RV simple using stan-
dard estimators. Inference and model building is identical to that of ARCH processes.

7.4.2 Problems with RV

Realized variance suffers from a number of problems. The most pronounced is that ob-
served prices are contaminated by noise since they are only observed at the bid and the
ask. Consider a simple model of bid-ask bounce where returns are computed as the log
difference in observed prices composed of the true (unobserved) efficient prices, pi∗,t , con-
taminated by an independent mean zero shock, νi ,t .

pi ,t = pi∗,t + νi ,t

The ith observed return, ri ,t can be decomposed into the actual (unobserved) return ri∗,t
and an independent noise term ηi ,t .

pi ,t − pi −1,t = pi∗,t + νi ,t − pi∗−1,t + νi −1,t


 
(7.112)
pi ,t − pi −1,t = pi∗,t − pi∗−1,t + (νi ,t − νi −1,t )


ri ,t = ri∗,t + ηi ,t

where ηi ,t = νi ,t − νi −1,t is a MA(1) shock.


Computing the RV from returns contaminated by noise has an unambiguous effect on
realized variance; RV is biased upward.
m
(m)
X
RVt = ri2,t (7.113)
i =1
7.4 Realized Variance 473

SPDR RV for sampling frequencies of 15s and 1, 5 and 30 minutes


RV , 15 seconds RV , 1 minute
0.8 0.8

0.6 0.6

0.4 0.4

0.2 0.2

0 0
2006 2007 2008 2006 2007 2008

RV , 5 minutes RV , 30 minutes
0.8 0.8

0.6 0.6

0.4 0.4

0.2 0.2

0 0
2006 2007 2008 2006 2007 2008

Figure 7.9: The four panels of this figure contain the Realized Variance for every day the
q July 31, 2008. The estimated RV have been
market was open from January 3, 2006 until
transformed into annualized volatility ( 252 · RVt (m) ). While these plots appear superfi-
cially similar, the 1- and 5-minute RV are the most precise and the 15-second RV is biased
upward.

m
X
= (ri∗,t + ηi ,t )2
i =1
m
X
= ri∗,t 2 + 2ri∗,t ηi ,t + η2i ,t
i =1
d t + m τ2
≈ RV

where τ2 is the variance of ηi ,t and RV


d t is the realized variance computed using the ef-
ficient returns. The bias is increasing in the number of samples (m ), and may be large
474 Univariate Volatility Modeling

SPDR RV AC 1 for sampling frequencies of 15s and 1, 5 and 30 minutes


RV AC 1 , 15 seconds RV AC 1 , 1 minute
0.8 0.8

0.6 0.6

0.4 0.4

0.2 0.2

0 0
2006 2007 2008 2006 2007 2008

RV AC 1 , 5 minutes RV AC 1 , 30 minutes
0.8 0.8

0.6 0.6

0.4 0.4

0.2 0.2

0 0
2006 2007 2008 2006 2007 2008

Figure 7.10: The four panels of this figure contain a noise robust version Realized Variance,
RV AC 1 , for every day the market was open from January 3, 2006 until July 31, 2008 trans-
formed into annualized volatility. The 15-second RV AC 1 is much better behaved than the
15-second RV although it still exhibits some strange behavior. In particular the negative
spikes are likely due to errors in the data, a common occurrence in the TAQ database.

for stocks with large bid-ask spreads. There are a number of solutions to this problem.
The simplest “solution” is to avoid the issue by not sampling prices relatively infrequently,
which usually means limiting samples to somewhere between 1 to 30 minutes (see Bandi
& Russell (2008)). Another simple, but again not ideal, method is to filter the data using an
MA(1). Transaction data contain a strong negative MA due to bid-ask bounce, and so RV
computed using the errors (ε̂i ,t ) from a model,

ri ,t = θ εi −1,t + εi ,t (7.114)

will eliminate much of the bias. A better method to remove the bias is to use an estimator
7.4 Realized Variance 475

Volatility Signature Plots


Volatility Signature Plot for SPDR RV
−5
x 10

10.5
Average RV

10

9.5

5s 15s 30s 45s 1 2 5 10 15 20 30


Sampling Interval, minutes unless otherwise noted
Volatility Signature Plot for SPDR RV AC 1
−5
x 10
9.6
Average RVAC1

9.4

9.2

5s 15s 30s 45s 1 2 5 10 15 20 30


Sampling Interval, minutes unless otherwise noted

Figure 7.11: The volatility signature plot for the RV shows a clear trend. Based on visual
inspection, it would be difficult to justify sampling more frequently than 3 minutes. Unlike
the volatility signature plot of the RV , the signature plot of RV AC 1 does not monotonically
increase with the sampling frequency, and the range of the values is considerably smaller
than in the RV signature plot. The decreases for the highest frequencies may be due to
errors in the data which are more likely to show up when prices are sampled frequently.

known as RV AC 1 which is similar to a Newey-West estimator.


m m
AC 1 (m)
X X
RVt = ri2,t +2 ri ,t ri −1,t (7.115)
i =1 i =2

In the case of a constant drift, constant volatility Brownian motion subject to bid-ask bounce,
this estimator can be shown to be unbiased, although it is not (quite) consistent. A more
general class of estimators that use a kernel structure and which are consistent has been
studied in Barndorff-Nielsen, Hansen, Lunde & Shephard (2008).19
19
The Newey-West estimator is a special case of a broad class of estimators known as kernel estimators.
They all share the property that they are weighted sums where the weights are determined by a kernel func-
tion.
476 Univariate Volatility Modeling

Another problem for realized variance is that prices are not available at regular intervals.
This fortunately has a simple solution: last price interpolation. Last price interpolation sets
the price at time t to the last observed price pτ where τ is the largest time index less where
τ ≤ t . Linearly interpolated prices set the time-t price to pt = w pτ1 + (1 − w )pτ2 where
τ1 is the time subscript of the last observed price before t and τ2 is the time subscript of
the first price after time t , and the interpolation weight is w = (τ2 − t )/(τ2 − τ1 ). When a
transaction occurs at t , τ1 = τ2 and no interpolation is needed. Using a linearly interpo-
lated price, which averages the prices around using the two closest observations weighted
by the relative time to t (theoretically) produces a downward bias in RV.
Finally, most markets do not operate 24 hours a day and so the RV cannot be computed
when markets are closed. The standard procedure is to use high-frequency returns when
available and then use the close-to-open return squared to supplement this estimate.

m m
RVt (m) = rCtO,t
X X
2
+ (pi ,t − pi −1,t )2 = ri2,t . (7.116)
i =1 i =1

2
where rCtO,t is the return between the close on day t − 1 and the market open on day t .
Since the overnight return is not measured frequently, the resulting RV must be treated as
a random variable (and not an observable). An alternative method to handle the overnight
return has been proposed in Hansen & Lunde (2005) and Hansen & Lunde (2006) which
weighs the overnight squared return by λ1 and the daily realized variance by λ2 to produce
an estimator with a lower mean-square error.

g (m) = λ1 r 2 + λ2 RVt (m) .


RV t CtO,t

7.4.3 Realized Variance of the S&P 500

Returns on S&P 500 Depository Receipts, known as SPiDeRs (AMEX:SPY) will be used to
illustrate the gains and pitfalls of RV. Price data was taken from TAQ and includes every
transaction between January 3, 2006 until July 31, 2008, a total of 649 days. SPDRs track the
S&P 500 and are among the most liquid assets in the U.S. market with an average volume
of 60 million shares per day. During the 2.5 year sample there were over 100,000 trades on
an average day – more than 4 per second when the market is open. TAQ data contain errors
and so the data was filtered by matching the daily high and low from the CRSP database.
Any prices out of these high-low bands or outside of the usual trading hours of 9.30 – 16.00
were discarded.
The primary tool of examining different Realized Variance estimators is the volatility
signature plot.

Definition 7.10 (Volatility Signature Plot). The volatility signature plot displays the time-
7.5 Implied Volatility and VIX 477

series average of Realized Variance


T
(m)
RVt (m)
X
RV t =T −1

t =1

as a function of the number of samples, m . An equivalent representation displays the


amount of time, whether in calendar time or tick time (number of trades between obser-
vations) along the X-axis.

Figures 7.9 and 7.10 contain plots of the annualized volatility constructed from the RV
and RV AC 1 for each day in the sample where estimates have been transformed into an-
nualized volatility to facilitate comparison. Figures 7.9 shows that the 15-second RV is
somewhat larger than the RV sampled at 1, 5 or 30 minutes and that the 1 and 5 minute
RV are less noisy than the 30-minute RV . These plots provide some evidence that sam-
pling more frequently than 30 minutes may be desirable. Comparing figure 7.10 to figure
7.9, there is a reduction in the scale of the 15-second RV AC 1 relative to the 15-second RV .
The 15-second RV is heavily influenced by the noise in the data (bid-ask bounce) while the
RV AC 1 is less affected. The negative values in the 15-second RV AC 1 are likely due by errors
in the recorded data which may contain a typo or a trade reported at the wrong time. This
is a common occurrence in the TAQ database and frequent sampling increases the chances
that one of the sampled prices is erroneous.
Figures 7.11 and 7.11 contain the volatility signature plot for RV and RV AC 1 , respec-
tively. These are presented in the original levels of RV to avoid averaging nonlinear trans-
formations of random variables. The dashed horizontal line depicts the usual variance esti-
mator computed from daily returns. There is a striking difference between the two figures.
The RV volatility signature plot diverges when sampling more frequently than 3 minutes
while the RV AC 1 plot is non-monotone, probably due to errors in the price data. RV AC 1 ap-
pears to allow sampling every 20-30 seconds, or nearly 5 times more often than RV . This
is a common finding when comparing RV AC 1 to RV across a range of equity data.

7.5 Implied Volatility and VIX


Implied volatility differs from other measures in that it is both market-based and forward-
looking. Implied volatility was originally conceived as the “solution” to the Black-Scholes
options pricing formula when everything except the volatility is known. Recall that the
Black-Scholes formula is derived by assuming that stock prices follow a geometric Brown-
ian motion plus drift,

dSt = µSt dt + σSt dWt (7.117)


where St is the time t stock prices, µ is the drift, σ is the (constant) volatility and dWt is a
Wiener process. Under some additional assumptions sufficient to ensure no arbitrage, the
478 Univariate Volatility Modeling

price of a call option can be shown to be

C (T, K ) = S Φ(d 1 ) + K e −r T Φ(d 2 ) (7.118)


where

ln S /K + r + σ2 /2 T
 
d1 = √ (7.119)
σ T
ln S /K + r − σ2 /2 T
 
d2 = √ . (7.120)
σ T

and where K is the strike price, T is the time to maturity – usually reported in years – r is the
risk-free interest rate and Φ(·) is the normal CDF. Given the stock price St , time to maturity
T , the strike price K , the risk free rate, r , and the volatility, σ, the price of call options can
be directly and calculated. Moreover, since the price of a call option is monotonic in the
volatility, the formula can be inverted to express the volatility required to match a known
call price. In this case, the implied volatility is

σImplied = g C t (T, K ), St , K , T, r

t (7.121)
which is the expected volatility between t and T under the risk neutral measure (which is
the same as under the physical when volatility is constant).20

7.5.0.1 The smile

Under normal conditions, the Black-Scholes implied volatility exhibits a “smile” (higher IV
for out of the money than in the money) or “smirk” (higher IV for out of the money puts).
The “smile” or ”smirk” arises since actual returns are heavy tailed (“smile”) and skewed
(“smirk”) although BSIV is derived under an assumption the prices follow a geometric Brow-
nian motion, and so log returns are assumed to be normal. These patterns are reflecting the
misspecification in the BSIV assumptions. Figure 7.12 contains the BSIV for mini S&P 500
puts on January 23, 2009 which exhibits the typical smile patters. The x axis is expressed
in terms of moneyness and so 0 indicates the current price, negative values indicate strikes
below the current price (out-of-the-money puts) and positive values the opposite.

7.5.1 Model-Free Volatility


B-S implied volatility suffers from a number of issues:
• Derived under constant volatility: The returns on most asset prices exhibit condi-
tional heteroskedasticity, and time-variation in the volatility of returns will generate
heavy tails which increases the chance of seeing a large stock price move.
20
The value of the function can be determined by numerically inverting the B-S pricing formula.
7.5 Implied Volatility and VIX 479

Returns on the S&P 500 and IBM


Moneyness, S =83.20, T=28/260, r=0.58%
t

0.54 IV on 23 Jan 2009

0.52

0.5
Mini SPX Put Implied Volatility

0.48

0.46

0.44

0.42

0.4

0.38

0.36

0.34
−0.2 −0.15 −0.1 −0.05 0 0.05 0.1 0.15 0.2

Figure 7.12: Plot of the Black-Scholes implied volatility “smile” on January 23, 2009 based
on mini S&P 500 puts.

• Leverage effects are ruled out: Leverage, or negative correlation between the price
and volatility of a stock, can generate negative skewness. This would produce a larger
probability of seeing a large downward movement in a stock price than the B-S model
allows.

• No jumps: Jumps are also an empirical fact of most asset prices. Jumps, like time-
varying volatility, increase the chance of seeing a large return.

The consequences of these limits are that, contrary to what the model underlying the B-S
implies, B-S implied volatilities will not be constant across strike prices for a fixed maturity.
Normally B-S implied volatilities follow a “smile”, where options at-the-money (strike close
to current price) will imply a lower volatility that options deep out of the money (strike
larger then current price) and options deep in the money (strike lower then current price).
In other times, the pattern of implied volatilities can take the shape of a “frown” – at the
money implying a higher volatility than in- or out-of-the-money – or a “smirk”, where the
implied volatility of out-of-the-money options is higher than at- or in-the-money options.
The various shapes the implied volatility curve can produce are all signs of misspecification
of the B-S model. In particular they point to the three lacking features described above.
480 Univariate Volatility Modeling

Model-free implied volatility has been developed as an alternative to B-S implied volatil-
ity by Britten-Jones & Neuberger (2000) with an important extension to jump processes
and empirical implementation details provided by Jiang & Tian (2005). An important re-
sult highlighting the usefulness of options prices for computing expectations under the
risk-neutral measures can be found in Breeden & Litzenberger (1978). Suppose that the
risk-neutral measure Q exists and is unique. Then, under the risk neutral measure, it must
be the case that

∂ St
= σ(t , ·)dWt (7.122)
St
is a martingale where σ(t , ·) is a (possibly) time-varying volatility process that may depend
on the stock price or other state variables. From the relationship, the price of a call option
can be computed as

C (t , K ) = E Q (St − K )+
 
(7.123)

for t > 0, K > 0 where the function (x )+ = max(x , 0). Thus


Z ∞
C (t , K ) = (St − K ) φt (St )dSt (7.124)
K

where φt (·) is the risk-neutral measure. Differentiating with respect to K ,

∂ C (t , K ) ∞
Z
=− φt (St )dSt (7.125)
∂K K

and differentiating this expression again with respect to K (note K in the lower integral
bound)

∂ 2 C (t , K )
= φt (K ) (7.126)
∂K2
the risk neutral measure can be recovered from an options pricing function. This result
provides a basis for non-parametrically estimating the risk-neutral density from observed
options prices (see, for example Aït-Sahalia & Lo (1998)). Another consequence of this re-
sult is that the expected (under Q) variation in a stock price over the interval [t 1 , t 2 ] measure
can be recovered from
"Z 2 #
t2
∂ St ∞
C (t 2 , K ) − C (t 1 , K )
 Z
EQ =2 dK . (7.127)
t1 St 0 K2

This expression cannot be directly implemented to recover the expected volatility since it
would require a continuum of strike prices.
Equation 7.127 assumes that the risk free rate is 0. When it is not, a similar result can be
derived using the forward price
7.5 Implied Volatility and VIX 481

"Z 2 #
t2
∂ Ft ∞
C F (t 2 , K ) − C F (t 1 , K )
 Z
EF =2 dK (7.128)
t1 Ft 0 K2
where F is the forward probability measure – that is, the probability measure where the
forward price is a martingale and C F (·, ·) is used to denote that this option is defined on the
forward price. Additionally, when t 1 is 0, as is usually the case, the expression simplifies to

C (t , K ) − (F0 − K )+
"Z  2 #
∂ Ft
t Z ∞ F
EF =2 dK . (7.129)
0 Ft 0 K2
A number of important caveats are needed for employing this relationship to compute
MFIV from option prices:

• Spot rather than forward prices. Because spot prices are usually used rather than
forwards, the dependent variable needs to be redefined. If interest rates are non-
stochastic, then define B (0, T ) to be the price of a bond today that pays $1 time T .
Thus, F0 = S0 /B (0, T ), is the forward price and C F (T, K ) = C (T, K )/B (0, T ) is the for-
ward option price. With the assumption of non-stochastic interest rates, the model-
free implied volatility can be expressed
"Z  2 # +
∂ St C (t , K )/B (0, T ) − S0 /B (0, T ) − K
t Z ∞
EF =2 dK (7.130)
0 St 0 K2

or equivalently using a change of variables as

C (t , K /B (0, T )) − (S0 − K )+
"Z  2 #
∂ St
t Z ∞
EF =2 dK . (7.131)
0 St 0 K2

• Discretization. Because only finitely many options prices are available, the integral
must be approximated using a discrete grid. Thus the approximation

C (t , K /B (0, T )) − (S0 − K )+
"Z  2 #
∂ St
t Z ∞
EF =2 dK (7.132)
0 St 0 K2
M
X
g (T, K m ) + g (T, K m −1 ) (K m − K m −1 )
 
≈ (7.133)
m =1

is used where
C (t , K /B (0, T )) − (S0 − K )+
g (T, K ) = (7.134)
K2
If the option tree is rich, this should not pose a significant issue. For option trees
on individual firms, more study (for example using a Monte Carlo) may be needed
to ascertain whether the MFIV is a good estimate of the volatility under the forward
482 Univariate Volatility Modeling

measure.

• Maximum and minimum strike prices. That the integral cannot be implemented
from 0 to ∞ produces a downward bias in the implied volatility where the bias cap-
tures the missed volatility in the upper and lower tails. For rich options trees such as
the S&P 500, this should not be a major issue.

7.5.2 VIX
The VIX – Volatility Index – is a volatility measure produced by the Chicago Board Options
Exchange (CBOE). It is computed using a “model-free” like estimator which uses both call
and put prices.21 The VIX is computed according to
N 2
2 r T X ∆K i

1 F0
σ = e
2
2
Q (K i ) − −1 (7.135)
T K i T K0
i =1

where T is the time to expiration of the options used, F0 is the forward price which is
computed from index option prices, K i is the strike of the ith out-of-the-money option,
∆K i = (K i +1 − K i −1 )/2 is half of the distance of the interval surrounding the option with a
strike price of K i , K 0 is the strike of the option immediately below the forward level, F0 , r
is the risk free rate and Q (K i ) is the mid-point of the bid and ask for the call or put used at
strike K i . The forward index price is extracted using put-call parity as F0 = K 0 + e r T (C0 − P0 )
where K 0 is the strike price where the price difference between put and call is smallest, and
C0 and P0 are the respective call and put prices at this node.
The VIX uses options at two time horizons that attempt to bracket the 30-days (for ex-
ample 15- and 45-days) with the caveat that options with 8 or fewer days to expiration are
not used. This is to avoid pricing anomalies since the very short term options market is not
liquid. Immediately subsequent to moving the time horizon, the time to expiration will
not bracket 30-days (for example, after a move the times to expiration may be 37 and 65
days). More details on the implementation of the VIX can be found in the CBOE whitepa-
per (CBOE 2003).

7.5.3 Empirical Relationships


The daily VIX series from January 1, 1999 until December 31, 2008 is plotted in figure 7.13
against an estimated “backward looking” volatility, as estimated by a TARCH(1,1,1) (top
panel), and a 22-day forward looking moving average computed as
s
252 X 2
σt =
MA
rt +i .
22 2 i =0 1

21
The VIX is based exclusively on out-of-the-money prices, so calls are used for strikes above the current
price and puts are used for strikes below the current price.
7.A Kurtosis of an ARCH(1) 483

VIX and alternative measures of volatility


VIX and TARCH Volatility
CBOE VIX
80 TARCH(1,1,1) Volatility

60

40

20

0
2000 2001 2002 2003 2004 2005 2006 2007 2008

VIX and Forward Volatility


CBOE VIX
80 Forward 22−day Realized Volatility

60

40

20

0
2000 2001 2002 2003 2004 2005 2006 2007 2008

Figure 7.13: Plots of the VIX against a TARCH-based estimate of the volatility (top panel)
and a 22-day forward moving average (bottom panel). The VIX is consistently above both
measures reflecting the presence of a risk premium that compensates for time-varying
volatility and/or jumps in the market return.

The VIX is consistently, but not uniformly, higher than the other two series. This highlights
both a feature and a drawback of using a measure of the volatility computed under the
risk-neutral measure: it will contain a (possibly) time-varying risk premium. This risk pre-
mium will capture the propensity of the volatility to change (volatility of volatility) and any
compensated jump risk.

7.A Kurtosis of an ARCH(1)

The necessary steps to derive the kurtosis of an ARCH(1) process are

E[ε4t ] = E[E t −1 [ε4t ]] (7.136)


484 Univariate Volatility Modeling

= E[3(ω + α1 ε2t −1 )2 ]
= 3E[(ω + α1 ε2t −1 )2 ]
= 3E[ω2 + 2ωα1 ε2t −1 + α21 ε4t −1 ]
= 3 ω2 + ωα1 E[ε2t −1 ] + α21 E[ε4t −1 ]


= 3ω2 + 6ωα1 E[ε2t −1 ] + 3α21 E[ε4t −1 ].

Using µ4 to represent the expectation of the fourth power of εt (µ4 = E[ε4t ]),

E[ε4t ] − 3α21 E[ε4t −1 ] = 3ω2 + 6ωα1 E[ε2t −1 ] (7.137)


µ4 − 3α21 µ4 = 3ω2 + 6ωα1 σ̄2
µ4 (1 − 3α21 ) = 3ω2 + 6ωα1 σ̄2
3ω2 + 6ωα1 σ̄2
µ4 =
1 − 3α21
ω
3ω2 + 6ωα1 1−α
µ4 = 2
1

1 − 3α1
α1
3ω (1 + 2 1−α
2
)
µ4 = 2
1

1 − 3α1
3ω2 (1 + α1 )
µ4 = .
(1 − 3α21 )(1 − α1 )

This derivation makes use of the same principals as the intuitive proof and the identity
that σ̄2 = ω/(1 − α1 ). The final form highlights two important issues:
q first, µ4 (and thus the
kurtosis) is only finite if 1 − 3α21 > 0 which requires that α1 < 1
3
≈ .577, and second, the
E[ε4t ] µ4
kurtosis, κ = E[ε2t ]2
= σ̄2
, is always greater than 3 since

E[ε4t ]
κ= (7.138)
E[ε2t ]2
3ω2 (1+α1 )
(1−3α21 )(1−α1 )
= ω2
(1−α1 )2
3(1 − α1 )(1 + α1 )
=
(1 − 3α21 )
3(1 − α21 )
= > 3.
(1 − 3α21 )

Finally, the variance of ε2t can be computed noting that for any variable y , V[y ] = E[y 2 ] −
E[y ]2 , and so
7.B Kurtosis of a GARCH(1,1) 485

V[ε2t ] = E[ε4t ] − E[ε2t ]2 (7.139)


3ω (1 + α1 )
2
ω 2
= −
(1 − 3α1 )(1 − α1 ) (1 − α1 )2
2

3ω2 (1 + α1 )(1 − α1 )2 ω2 (1 − 3α21 )(1 − α1 )


= −
(1 − 3α21 )(1 − α1 )(1 − α1 )2 (1 − 3α21 )(1 − α1 )(1 − α1 )2
3ω2 (1 + α1 )(1 − α1 )2 − ω2 (1 − 3α21 )(1 − α1 )
=
(1 − 3α21 )(1 − α1 )(1 − α1 )2
3ω2 (1 + α1 )(1 − α1 ) − ω2 (1 − 3α21 )
=
(1 − 3α21 )(1 − α1 )2
3ω2 (1 − α21 ) − ω2 (1 − 3α21 )
=
(1 − 3α21 )(1 − α1 )2
3ω2 (1 − α21 ) − 3ω2 ( 13 − α21 )
=
(1 − 3α21 )(1 − α1 )2
3ω2 [(1 − α21 ) − ( 31 − α21 )]
=
(1 − 3α21 )(1 − α1 )2
2ω2
=
(1 − 3α21 )(1 − α1 )2
2
ω

2
=
1 − α1 (1 − 3α21 )
2σ̄4
=
(1 − 3α21 )

The variance of the squared returns depends on the unconditional level of the variance, σ̄2
and the innovation term (α1 ) squared.

7.B Kurtosis of a GARCH(1,1)

First note that E[σ2t − ε2t ] = 0 so V[σ2t − ε2t ] is just E[(σ2t − ε2t )2 ]. This can be expanded to
E[ε4t ] − 2E[ε2t σ2t ] + E[σ4t ] which can be shown to be 2E[σ4t ] since

E[ε4t ] = E[E t −1 [e t4 σ4t ]] (7.140)


= E[E t −1 [e t4 ]σ4t ]
= E[3σ4t ]
= 3E[σ4t ]

and
486 Univariate Volatility Modeling

E[ε2t σ2t ] = E[E t −1 [e t2 σ2t ]σ2t ] (7.141)


= E[σ2t σ2t ]
= E[σ4t ]

so

E[ε4t ] − 2E[ε2t σ2t ] + E[σ4t ] = 3E[σ4t ] − 2E[σ4t ] + E[σ4t ] (7.142)


= 2E[σ4t ]

The only remaining step is to complete the tedious derivation of the expectation of this
fourth power,

E[σ4t ] = E[(σ2t )2 ] (7.143)


= E[(ω + α1 ε2t −1 + β1 σ2t −1 )2 ]
= E[ω2 + 2ωα1 ε2t −1 + 2ωβ1 σ2t −1 + 2α1 β1 ε2t −1 σ2t −1 + α21 ε4t −1 + β12 σ4t −1 ]
= ω2 + 2ωα1 E[ε2t −1 ] + 2ωβ1 E[σ2t −1 ] + 2α1 β1 E[ε2t −1 σ2t −1 ] + α21 E[ε4t −1 ] + β12 E[σ4t −1 ]

Noting that

• E[ε2t −1 ] = E[E t −2 [ε2t −1 ]] = E[E t −2 [e t2−1 σ2t −1 ]] = E[σ2t −1 E t −2 [e t2−1 ]] = E[σ2t −1 ] = σ̄2

• E[ε2t −1 σ2t −1 ] = E[E t −2 [ε2t −1 ]σ2t −1 ] = E[E t −2 [e t2−1 σ2t −1 ]σ2t −1 ] = E[E t −2 [e t2−1 ]σ2t −1 σ2t −1 ] =
E[σ4t ]

• E [ε4t −1 ] = E[E t −2 [ε4t −1 ]] = E[E t −2 [e t4−1 σ4t −1 ]] = 3E[σ4t −1 ]

the final expression for E[σ4t ] can be arrived at

E[σ4t ] = ω2 + 2ωα1 E[ε2t −1 ] + 2ωβ1 E[σ2t −1 ] + 2α1 β1 E[ε2t −1 σ2t −1 ] + α21 E[ε4t −1 ] + β12 E[σ4t −1 ]
(7.144)
= ω2 + 2ωα1 σ̄2 + 2ωβ1 σ̄2 + 2α1 β1 E[σ4t −1 ] + 3α21 E[σ4t −1 ] + β12 E[σ4t −1 ].

E[σ4t ] can be solved for (replacing E[σ4t ] with µ4 ),


7.B Kurtosis of a GARCH(1,1) 487

µ4 = ω2 + 2ωα1 σ̄2 + 2ωβ1 σ̄2 + 2α1 β1 µ4 + 3α21 µ4 + β12 µ4


(7.145)
µ4 − 2α1 β1 µ4 − 3α21 µ4 − β12 µ4 = ω + 2ωα1 σ̄ + 2ωβ1 σ̄
2 2 2

µ4 (1 − 2α1 β1 − 3α21 − β12 ) = ω2 + 2ωα1 σ̄2 + 2ωβ1 σ̄2


ω2 + 2ωα1 σ̄2 + 2ωβ1 σ̄2
µ4 =
1 − 2α1 β1 − 3α21 − β12

finally substituting σ̄2 = ω/(1 − α1 − β1 ) and returning to the original derivation,

3(1 + α1 + β1 )
E[ε4t ] = (7.146)
(1 − α1 − β1 )(1 − 2α1 β1 − 3α21 − β12 )

E[ε4t ] µ4
and the kurtosis, κ = E[ε2t ]2
= σ̄2
, which simplifies to

3(1 + α1 + β1 )(1 − α1 − β1 )
κ= > 3. (7.147)
1 − 2α1 β1 − 3α21 − β12
488 Univariate Volatility Modeling

Exercises
Exercise 7.1. Suppose we model log-prices at time t , written pt , is an ARCH(1) process

pt |Ft −1 ∼ N (pt −1 , σ2t ),

where Ft denotes the information up to and including time t and

σ2t = α + β (pt −1 − pt −2 )2 .

i. Is pt a martingale?

ii. What is
E σ2t ?
 

iii. Calculate h i
Cov (pt − pt −1 )2 , (pt −s − pt −1−s )2

for s > 0.

iv. Comment on the importance of this result from a practical perspective.

v. How do you use a likelihood function to estimate an ARCH model?

vi. How can the ARCH(1) model be generalized to be more empirically realistic models
of the innovations in price processes in financial economics?

vii. In the ARCH(1) case, what can you find out about the properties of

pt +s |Ft −1 ,

for s > 0, i.e. the multistep ahead forecast of prices?

viii. Why are Bollerslev-Wooldridge standard errors important when testing coefficients
in ARCH models?

Exercise 7.2. Derive explicit relationships between the parameters of an APARCH(1,1,1)


and

i. ARCH(1)

ii. GARCH(1,1)

iii. AVGARCH(1,1)

iv. TARCH(1,1,1)

v. GJR-GARCH(1,1,1)
7.B Kurtosis of a GARCH(1,1) 489

Exercise 7.3. Consider the following GJR-GARCH process,

rt = ρrt −1 + εt
εt = σt e t
σ2t = ω + αε2t −1 + γε2t −1 I[εt −1 <0] + β σ2t −1
e t ∼ N (0, 1)
i.i.d.

where E t [·] = E[·|Ft ] is the time t conditional expectation and Vt [·] = V[·|Ft ] is the time t
conditional variance.

i. What conditions are necessary for this process to be covariance stationary?


Assume these conditions hold in the remaining questions. Note: If you cannot answer
one or more of these questions for an arbitrary γ, you can assume that γ = 0 and
receive partial credit.

ii. What is E[rt +1 ]?

iii. What is E t [rt +1 ]?

iv. What is V[rt +1 ]?

v. What is Vt [rt +1 ]?

vi. What is Vt [rt +2 ]?

Exercise 7.4. Let rt follow a GARCH process

rt = σt e t
σ2t = ω + αrt2−1 + β σ2t −1
e t ∼ N (0, 1)
i.i.d.

i. What are the values of the following quantities?

(a) E[rt +1 ]
(b) E t [rt +1 ]
(c) V[rt +1 ]
(d) Vt [rt +1 ]
(e) ρ1

ii. What is E[(rt2 − σ̄2 )(rt2−1 − σ̄2 )]

iii. Describe the h -step ahead forecast from this model.


490 Univariate Volatility Modeling

Exercise 7.5. Let rt follow a ARCH process

rt = σt e t
σ2t = ω + α1 rt2−1
e t ∼ N (0, 1)
i.i.d.

i. What are the values of the following quantities?

(a) E[rt +1 ]
(b) E t [rt +1 ]
(c) V[rt +1 ]
(d) Vt [rt +1 ]
(e) ρ1

ii. What is E[(rt2 − σ̄2 )(rt2−1 − σ̄2 )] Hint: Think about the AR duality.

iii. Describe the h -step ahead forecast from this model.

Exercise 7.6. Consider an EGARCH(1,1,1) model:


r !
2
ln σ2t = ω + α1 |e t −1 | − + γ1 e t −1 + β1 ln σ2t −1
π

where e t ∼ N (0, 1).


i.i.d.

i. What are the important conditions for stationarity in this process?

ii. What is the one-step-ahead forecast of σ2t (E t σ2t +1 )?


 

iii. What is the most you can say about the two-step-ahead forecast of σ2t (E t σ2t +2 )?
 

Exercise 7.7. Answer the following questions:

i. Describe three fundamentally different procedures to estimate the volatility over some
interval. What the strengths and weaknesses of each?

ii. Why is Realized Variance useful when evaluating a volatility model?

iii. What considerations are important when computing Realized Variance?

iv. Why is does the Black-Scholes implied volatility vary across strikes?
7.B Kurtosis of a GARCH(1,1) 491

Exercise 7.8. Consider a general volatility specification for an asset return rt :

rt |Ft −1 s N 0, σ2t


rt
and let e t ≡
σt
so e t |Ft −1 s i i d N (0, 1)

i. Find the conditional kurtosis of the returns:


h i
E t −1 (rt − E t −1 [rt ])4
Kurtt −1 [rt ] ≡
(Vt −1 [rt ])2

ii. Show that if V σ2t > 0, then the unconditional kurtosis of the returns,
 

h i
4
E (rt − E [rt ])
Kurt [rt ] ≡
(V [rt ])2

is greater than 3.

iii. Find the conditional skewness of the returns:


h i
E t −1 (rt − E t −1 [rt ])3
Skewt −1 [rt ] ≡
(Vt −1 [rt ])3/2

iv. Find the unconditional skewness of the returns:


h i
E (rt − E [rt ])3
Skew [rt ] ≡
(V [rt ])3/2

Exercise 7.9. Answer the following questions:

i. Describe three fundamentally different procedures to estimate the volatility over some
interval. What are the strengths and weaknesses of each?

ii. Why does the Black-Scholes implied volatility vary across strikes?

iii. Consider the following GJR-GARCH process,

rt = µ + ρrt −1 + εt
εt = σt e t
σ2t = ω + αε2t −1 + γε2t −1 I[εt −1 <0] + β σ2t −1
e t ∼ N (0, 1)
i.i.d.
492 Univariate Volatility Modeling

where E t [·] = E[·|Ft ] is the time t conditional expectation and Vt [·] = V[·|Ft ] is the
time t conditional variance.

(a) What conditions are necessary for this process to be covariance stationary?

Assume these conditions hold in the remaining questions.


(b) What is E[rt +1 ]?
(c) What is E t [rt +1 ]?
(d) What is E t [rt +2 ]?
(e) What is V[rt +1 ]?
(f) What is Vt [rt +1 ]?
(g) What is Vt [rt +2 ]?

Exercise 7.10. Answer the following questions about variance estimation.

i. What is Realized Variance?

ii. How is Realized Variance estimated?

iii. Describe two models which are appropriate for modeling Realized Variance.

iv. What is an Exponential Weighted Moving Average (EWMA)?

v. Suppose an ARCH model for the conditional variance of daily returns was fit

rt +1 = µ + σt +1 e t +1
σ2t +1 = ω + α1 ε2t + α2 ε2t −1
e t ∼ N (0, 1)
i.i.d.

What are the forecasts for t + 1, t + 2 and t + 3 given the current (time t ) information
set?

vi. Suppose an EWMA was used instead for the model of conditional variance with smooth-
ing parameter = .94. What are the forecasts for t + 1, t + 2 and t + 3 given the current
(time t ) information set?

vii. Compare the ARCH(2) and EWMA forecasts when the forecast horizon is large (e.g.
E t σ2t +h for large h ).
 

viii. What is VIX?


Chapter 8

Value-at-Risk, Expected Shortfall and Density


Forecasting

Note: The primary reference for these notes is Gourieroux & Jasiak (2009), although it is fairly
technical. An alternative and less technical textbook treatment can be found in Christof-
fersen (2003) while a comprehensive and technical treatment can be found in McNeil, Frey
& Embrechts (2005).
The American Heritage dictionary, Fourth Edition, defines risk as “the possibility of suf-
fering harm or loss; danger”. In finance, harm or loss has a specific meaning: decreases in
the value of a portfolio. This chapter provides an overview of three methods used to as-
sess the riskiness of a portfolio: Value-at-Risk (VaR), Expected Shortfall, and modeling the
entire density of a return.

8.1 Defining Risk

Portfolios are exposed to many classes of risk. These six categories represent an overview
of the risk factors which may affect a portfolio.

8.1.1 The Types of risk

Market Risk contains all uncertainty about the future price of an asset. For example, changes
in the share price of IBM due to earnings news represent market risk.

Liquidity Risk complements market risk by measuring the extra loss involved if a position
must be rapidly changed. For example, if a fund wished to sell 20,000,000 shares of
IBM on a single day (typical daily volume 10,000,000), this sale would be expected to
have a substantial effect on the price. Liquidity risk is distinct from market risk since
it represents a transitory distortion due to transaction pressure.
494 Value-at-Risk, Expected Shortfall and Density Forecasting

Credit Risk, also known as default risk, covers cases where a 2nd party is unable to pay
per previously agreed to terms. Holders of corporate bonds are exposed to credit risk
since the bond issuer may not be able to make some or all of the scheduled coupon
payments.

Counterparty Risk generalizes credit risk to instruments other than bonds and represents
the chance they the counterparty to a transaction, for example the underwriter of
an option contract, will be unable to complete the transaction at expiration. Coun-
terparty risk was a major factor in the financial crisis of 2008 where the protection
offered in Credit Default Swaps (CDS) was not available when the underlying assets
defaulted.

Model Risk represents an econometric form of risk which captures uncertainty over the
correct form of the model used to assess the price of an asset and/or the asset’s risk-
iness. Model risk is particularly important when prices of assets are primarily deter-
mined by a model rather than in a liquid market, as was the case in the Mortgage
Backed Securities (MBS) market in 2007.

Estimation Risk captures an aspect of risk that is present whenever econometric models
are used to manage risk since all model contain estimated parameters. Moreover, es-
timation risk is distinct from model risk since it is present even if a model is correctly
specified. In many practical applications, parameter estimation error can result in a
substantial misstatement of risk. Model and estimation risk are always present and
are generally substitutes – parsimonious models are increasingly likely to be misspec-
ified but have less parameter estimation uncertainty.

This chapter deals exclusively with market risk. Liquidity, credit risk and counterparty risk
all require special treatment beyond the scope of this course.

8.2 Value-at-Risk (VaR)


The most common reported measure of risk is Value-at-Risk (VaR). The VaR of a portfolio
is the amount risked over some period of time with a fixed probability. VaR provides a
more sensible measure of the risk of the portfolio than variance since it focuses on losses,
although VaR is not without its own issues. These will be discussed in more detail in the
context of coherent risk measures (section 8.5).

8.2.1 Defined
The VaR of a portfolio measures the value (in £, $, €,¥, etc.) which an investor would lose
with some small probability, usually between 1 and 10%, over a selected period of time.
Because the VaR represents a hypothetical loss, it is usually a positive number.
8.2 Value-at-Risk (VaR) 495

A graphical representation of Value-at-Risk


Distribution of Wealth
Value−at−Risk
5% quantile

Previous Wealth

Expected Wealth
5% VaR

980000 990000 1000000 1010000 1020000


Portfolio Value

Figure 8.1: A graphical representation of Value-at-Risk. The VaR is represented by the mag-
nitude of the horizontal bar and measures the distance between the value of the port-
folio in the current period and its α-quantile. In this example, α = 5% and returns are
N (.001, .0152 ).

Definition 8.1 (Value-at-Risk). The α Value-at-Risk (V a R ) of a portfolio is defined as the


largest number such that the probability that the loss in portfolio value over some period
of time is greater than the V a R is α,

P r (R t < −V a R ) = α (8.1)

where R t = Wt − Wt −1 is the change in the value of the portfolio, Wt and the time span
depends on the application (e.g. one day or two weeks).

For example, if an investor had a portfolio value of £10,000,000 and had a daily portfolio
return which was N (.001, .0152 ) (annualized mean of 25%, volatility of 23.8%), the daily α
Value-at-Risk of this portfolio would

£10, 000, 000(−.001 − .015Φ−1 (α)) = £236, 728.04


where Φ(·) is the CDF of a standard normal (and so Φ−1 (·) is the inverse CDF). This expres-
sion may appear backward; it is not. The negative sign on the mean indicates that increases
496 Value-at-Risk, Expected Shortfall and Density Forecasting

in the mean decrease the VaR and the negative sign on the standard deviation term indi-
cates that increases in the volatility raise the VaR since for α < .5, Φ−1 (α) < 0. It is often
more useful to express Value-at-Risk as a percentage of the portfolio value – e.g. 1.5% –
rather than in units of currency since it removes the size of the portfolio from the reported
number.
Definition 8.2 (Percentage Value-at-Risk). The α percentage Value-at-Risk (%V a R ) of a
portfolio is defined as the largest return such that the probability that the return on the
portfolio over some period of time is less than -%V a R is α,

P r (rt < −%V a R ) = α (8.2)

where rt is the percentage return on the portfolio. %V a R can be equivalently defined as


%V a R = V a R /Wt −1 .
Since percentage VaR and VaR only differ by the current value of the portfolio, the re-
mainder of the chapter will focus on percentage VaR in place of VaR.

8.2.1.1 The relationship between VaR and quantiles

Understanding that VaR and quantiles are fundamentally related provides a key insight into
computing VaR. If r represents the return on a portfolio , the α-VaR is −1 × qα (r ) where
qα (r ) is the α-quantile of the portfolio’s return. In most cases α is chosen to be some small
quantile – 1, 5 or 10% – and so qα (r ) is a negative number, and VaR should generally be
positive.1

8.2.2 Conditional Value-at-Risk


Most applications of VaR are used to control for risk over short horizons and require a con-
ditional Value-at-Risk estimate that employs information up to time t to produce a VaR for
some time period t + h .
Definition 8.3 (Conditional Value-at-Risk). The conditional α Value-at-Risk is defined as

P r (rt +1 < −V a R t +1|t |Ft ) = α (8.3)

where rt +1 = Wt +1W−W
t
t
is the time t + 1 return on a portfolio. Since t is an arbitrary measure
of time, t + 1 also refers to an arbitrary unit of time (day, two-weeks, 5 years, etc.)
Most conditional models for VaR forecast the density directly, although some only at-
tempt to estimate the required quantile of the time t + 1 return distribution. Five standard
methods will be presented in the order of the restrictiveness of the assumptions needed to
justify the method, from strongest to weakest.
1
If the VaR is negative, either the portfolio has no risk, the portfolio manager has unbelievable skill or most
likely the model used to compute the VaR is badly misspecified.
8.2 Value-at-Risk (VaR) 497

8.2.2.1 RiskMetrics©

The RiskMetrics group has produced a surprisingly simple yet robust method for producing
conditional VaR. The basic structure of the RiskMetrics model is a restricted GARCH(1,1),
where α + β = 1 and ω = 0, is used to model the conditional variance,

σ2t +1 = (1 − λ)rt2 + λσ2t . (8.4)

where rt is the (percentage) return on the portfolio in period t . In the RiskMetrics specifi-
cation σ2t +1 follows an exponentially weighted moving average which places weight λ j (1−λ)
on rt2− j .2 This model includes no explicit mean model for returns and is only applicable to
assets with returns that are close to zero or when the time horizon is short (e.g. one day to
one month). The VaR is derived from the α-quantile of a normal distribution,

V a R t +1 = −σt +1 Φ−1 (α) (8.5)


where Φ−1 (·) is the inverse normal CDF. The attractiveness of the RiskMetrics model is that
there are no parameters to estimate; λ is fixed at .94 for daily data (.97 for monthly).3 Addi-
tionally, this model can be trivially extended to portfolios using a vector-matrix switch by
replacing the squared return with the outer product of a vector of returns, rt r0t , and σ2t +1
with a matrix, Σt +1 . The disadvantages of the procedure are that the parameters aren’t es-
timated (which was also an advantage), it cannot be modified to incorporate a leverage
effect, and the VaR follows a random walk since λ + (1 − λ) = 1.

8.2.2.2 Parametric ARCH Models

Fully parametric ARCH-family models provide a natural method to compute VaR. For sim-
plicity, only a constant mean GARCH(1,1) will be described, although the mean could be
described using other time-series models and the variance evolution could be specified as
any ARCH-family member.4

rt +1 = µ + εt +1
σ2t +1 = ω + γ1 ε2t + β1 σ2t
εt +1 = σt +1 e t +1
e t +1 ∼ f (0, 1)
i.i.d.

2
An EWMA is similar to a traditional moving average although the EWMA places relatively more weight
on recent observations than on observation in the distant past.
3
The suggested coefficient for λ are based on a large study of the RiskMetrics model across different asset
classes.
4
The use of α1 in ARCH models has been avoided to avoid confusion with the α in the VaR.
498 Value-at-Risk, Expected Shortfall and Density Forecasting

where f (0, 1) is used to indicate that the distribution of innovations need not be normal
but must have mean 0 and variance 1. For example, f could be a standardized Student’s
t with ν degrees of freedom or Hansen’s skewed t with degree of freedom parameter ν
and asymmetry parameter λ. The parameters of the model are estimated using maximum
likelihood and the time t conditional VaR is

V a R t +1 = −µ̂ − σ̂t +1 Fα−1


where Fα−1 is the α-quantile of the distribution of e t +1 . Fully parametric ARCH models pro-
vide substantial flexibility for modeling the conditional mean and variance as well as spec-
ifying a distribution for the standardized errors. The limitations of this procedure are that
implementations require knowledge of a density family which includes f – if the distri-
bution is misspecified then the quantile used will be wrong – and that the residuals must
come from a location-scale family. The second limitation imposes that all of the dynamics
of returns can be summarized by a time-varying mean and variance, and so higher order
moments must be time invariant.

8.2.2.3 Semiparametric ARCH Models/Filtered Historical Simulation

Semiparametric estimation mixes parametric ARCH models with nonparametric estima-


tors of the distribution.5 Again, consider a constant mean GARCH(1,1) model

rt +1 = µ + εt +1
σ2t +1 = ω + γ1 ε2t + β1 σ2t
εt +1 = σt +1 e t +1
e t +1 ∼ g (0, 1)
i.i.d.

where g (0, 1) is an unknown distribution with mean zero and variance 1.


When g (·) is unknown standard maximum likelihood estimation is not available. Recall
that assuming a normal distribution for the standardized residuals, even if misspecified,
produces estimates which are strongly consistent, and so ω, γ1 and β1 will converge to their
true values for most any g (·). The model can be estimated using QMLE by assuming that
the errors are normally distributed and then the Value-at-Risk for the α-quantile can be
computed

V a R t +1 (α) = −µ̂ − σ̂t +1Ĝα−1 (8.6)


n o
where Ĝα−1 is the empirical α-quantile of e t +1 = σεtt+1+1
. To estimate this quantile, define
ε̂t +1
ê t +1 = σ̂t +1
. and order the errors such that
5
This is only one example of a semiparametric estimator. Any semiparametric estimator has elements of
both a parametric estimator and a nonparametric estimator.
8.2 Value-at-Risk (VaR) 499

ê1 < ê2 < . . . < ên −1 < ên .


where n replaces T to indicate the residuals are no longer time ordered. Ĝα−1 = êbαnc or
Ĝα−1 = êdαne where bx c and dx e denote the floor (largest integer smaller than) and ceiling
(smallest integer larger than) of x .6 In other words, the estimate of G −1 is the α-quantile of
the empirical distribution of ê t +1 which corresponds to the αn ordered ên .
Semiparametric ARCH models provide one clear advantage over their parametric ARCH
cousins; the quantile, and hence the VaR, will be consistent under weaker conditions since
the density of the standardized residuals does not have to be assumed. The primary disad-
vantage of the semiparametric approach is that Ĝα−1 may be poorly estimated – especially if
α is very small (e.g. 1%). Semiparametric ARCH models also share the limitation that their
use is only justified if returns are generated by some location-scale distribution.

8.2.2.4 Cornish-Fisher Approximation

The Cornish-Fisher approximation splits the difference between a fully parametric model
and a semi parametric model. The setup is identical to that of the semiparametric model

rt +1 = µ + εt +1
σ2t +1 = ω + γε2t + β σ2t
εt +1 = σt +1 e t +1
e t +1 ∼ g (0, 1)
i.i.d.

where g (·) is again an unknown distribution. The unknown parameters are estimated by
quasi-maximum likelihood assuming conditional normality to produce standardized resid-
uals, ê t +1 = σ̂ε̂tt+1
+1
. The Cornish-Fisher approximation is a Taylor-series like expansion of the
α-VaR around the α-VaR of a normal and is given by

V a R t +1 = −µ − σt +1 FC−1
F (α) (8.7)

ς  −1 2 
F (α)
FC−1 ≡ Φ (α) + Φ (α) − 1 +
−1
(8.8)
6
κ − 3  −1 3  ς2   3 
Φ (α) − 3Φ−1 (α) − 2 Φ−1 (α) − 5Φ−1 (α)
24 36
where ς and κ are the skewness and kurtosis of ê t +1 , respectively. From the expression for
F (α), negative skewness and excess kurtosis (κ > 3, the kurtosis of a normal) decrease
FC−1
6
When estimating a quantile from discrete data and not smoothing, the is quantile “set valued” and de-
fined as any point between êbαnc and êdαne , inclusive.
500 Value-at-Risk, Expected Shortfall and Density Forecasting

the estimated quantile and increases the VaR. The Cornish-Fisher approximation shares
the strength of the semiparametric distribution in that it can be accurate without a para-
metric assumption. However, unlike the semi-parametric estimator, Cornish-Fisher esti-
mators are not necessarily consistent which may be a drawback. Additionally, estimates of
higher order moments of standardized residuals may be problematic or, in extreme cases,
the moments may not even exist.

8.2.2.5 Conditional Autoregressive Value-at-Risk (CaViaR)

Engle & Manganelli (2004) developed ARCH-like models to directly estimate the condi-
tional Value-at-Risk using quantile regression. Like the variance in a GARCH model, the
α-quantile of the return distribution, Fα,t
−1
+1 , is modeled as a weighted average of the previ-
ous quantile, a constant, and a ’shock’. The shock can take many forms although a “H I T ”,
defined as an exceedance of the previous Value-at-Risk, is the most natural.

H I Tt +1 = I[rt +1 <F −1 ] − α (8.9)


t +1

+1 is the time t α-quantile of this distribution.


where rt +1 the (percentage) return and Ft −1
Defining qt +1 as the time t + 1 α-quantile of returns, the evolution in a standard CaViaR
model is defined by

qt +1 = ω + γH I Tt + β qt . (8.10)
Other forms which have been examined are the symmetric absolute value,

qt +1 = ω + γ|rt | + β qt . (8.11)
the asymmetric absolute value,

qt +1 = ω + γ1 |rt | + γ2 |rt |I[rt <0] + β qt (8.12)


the indirect GARCH,
 12
qt +1 = ω + γrt2 + β qt2 (8.13)
or combinations of these. The parameters can be estimated by minimizing the “tick” loss
function

T
X
argmin T −1
α(rt − qt )(1 − I[rt <qt ] ) + (1 − α)(qt − rt )I[rt <qt ] = (8.14)
θ t =1
T
X
argmin T −1 α(rt − qt ) + (qt − rt )I[rt <qt ]
θ t =1
8.2 Value-at-Risk (VaR) 501

where I[rt <qt ] is an indicator variable which is 1 if rt < qt . Estimation of the parameters in
this problem is tricky since the objective function may have many flat spots and is non-
differentiable. Derivative free methods, such as simplex methods or genetic algorithms,
can be used to overcome these issues. The VaR in a CaViaR framework is then

V a R t +1 = −qt +1 = −Fˆt −1
+1 (8.15)

Because a CaViaR model does not specify a distribution of returns or any moments, its
use is justified under much weaker assumptions than other VaR estimators. Additionally,
its parametric form provides reasonable convergence of the unknown parameters. The
main drawbacks of the CaViaR modeling strategy are that it may produce out-of order
quantiles (i.e. 5% VaR is less then 10% VaR) and that estimation of the model parameters is
challenging.

8.2.2.6 Weighted Historical Simulation

Weighted historical simulation applies weights to returns where the weight given to recent
data is larger than the weight given to returns further in the past. The estimator is non-
parametric in that no assumptions about either distribution or the dynamics of returns is
made.
Weights are assigned using an exponentially declining function. Assuming returns were
available from i = 1, . . . , t . The weight given to data point i is

wi = λt −i (1 − λ) / 1 − λt , i = 1, 2, . . . , t


Typical values for λ range from .99 to .995. When λ = .99, 99% of the weight occurs in the
most recent 450 data points – .995 changes this to the most recent 900 data points. Smaller
values of lambda will make the value-at-risk estimator more “local” while larger weights
result in a weighting that approaches an equal weighting.
The weighted cumulative CDF is then
t
X
Ĝ (r ) = wi I[r <ri ] .
i =1

Clearly the CDF of a return as large as the largest is 1 since the weights sum to 1, and the
CDF of the a return smaller than the smallest has CDF value 0. The VaR is then computed
as the solution to

V a R t +1|t = min Ĝ (r ) ≥ α
r

which chooses the smallest value of r where the weighted cumulative distribution is just
as large as α.
502 Value-at-Risk, Expected Shortfall and Density Forecasting

Fit Percentage Value-at-Risk using α = 5%


% VaRusing RiskMetrics

0.08

0.06

0.04

0.02

2000 2001 2002 2003 2004 2005 2006 2007 2008 2009

% VaRusing TARCH(1,1,1) with Skew t errors

0.08

0.06

0.04

0.02

2000 2001 2002 2003 2004 2005 2006 2007 2008 2009

% VaRusing Asymmetric CaViaR

0.08

0.06

0.04

0.02

2000 2001 2002 2003 2004 2005 2006 2007 2008 2009

Figure 8.2: The figure contains the estimated % VaR for the S&P 500 using data from 1999
until the end of 2009. While these three models are clearly distinct, the estimated VaRs are
remarkably similar.

8.2.2.7 Example: Conditional Value-at-Risk for the S&P 500

The concepts of VaR will be illustrated using S&P 500 returns form January 1, 1999 until
December 31, 2009, the same data used in the univariate volatility chapter. A number of
8.2 Value-at-Risk (VaR) 503

Model Parameters
TARCH(1,1,1)
σt +1 = ω + γ1 |rt | + γ2 |rt |I[rt <0] + β σt
ω γ1 γ2 β ν λ
Normal 0.016 0.000 0.120 0.939
Student’s t 0.015 0.000 0.121 0.939 12.885
Skew t 0.016 0.000 0.125 0.937 13.823 -0.114

CaViaR
qt +1 = ω + γ1 |rt | + γ2 |rt |I[rt <0] + β qt
ω γ1 γ2 β
Asym CaViaR -0.027 0.028 -0.191 0.954

Estimated Quantiles from Parametric and Semi-parametric TARCH models


Semiparam. Normal Stud. t Skew t CF
1% -3.222 -2.326 -2.439 -2.578 -4.654
5% -1.823 -1.645 -1.629 -1.695 -1.734
10% -1.284 -1.282 -1.242 -1.274 -0.834

Table 8.1: Estimated model parameters and quantiles. The choice of distribution for the
standardized shocks makes little difference in the parameters of the TARCH process, and
so the fit conditional variances are virtually identical. The only difference in the VaRs from
these three specifications comes from the estimates of the quantiles of the standardized
returns (bottom panel).

models have been estimated which produce similar VaR estimates. Specifically the GARCH
models, whether using a normal likelihood, a Students t , a semiparametric or Cornish-
Fisher approximation all produce very similar fits and generally only differ in the quan-
tile estimated. Table 8.1 reports parameter estimates from these models. Only one set of
TARCH parameters are reported since they were similar in all three models, ν̂ ≈ 12 in both
the standardized Student’s t and the skewed t indicating that the standardizes residuals
are leptokurtotic, and λ̂ ≈ −.1, from the skewed t , indicating little skewness. The CaViaR
estimates indicate little change in the conditional quantile for a symmetric shock (other
then mean reversion), a large decease when the return is negative and that the conditional
quantile is persistent.

The table also contains estimated quantiles using the parametric, semiparametric and
Cornish-Fisher estimators. Since the fit conditional variances were similar, the only mean-
ingful difference in the VaRs comes from the differences in the estimated quantiles.
504 Value-at-Risk, Expected Shortfall and Density Forecasting

8.2.3 Unconditional Value at Risk

While the conditional VaR is often the object of interest, there may be situations which call
for the unconditional VaR (also known as marginal VaR). Unconditional VaR expands the
set of choices from the conditional to include ones which do not make use of conditioning
information to estimate the VaR directly from the unmodified returns.

8.2.3.1 Parametric Estimation

The simplest form of VaR specifies a parametric model for the unconditional distribution
of returns and derives the VaR from the α-quantile of this distribution. For example, if
rt ∼ N (µ, σ2 ), the α-VaR is

V a R = −µ − σΦ−1 (α) (8.16)


and the parameters can be directly estimated using Maximum likelihood with the usual
estimators,

T
X T
X
µ̂ = T −1
rt σ̂ = T
2 −1
(rt − µ̂)2
t =1 t =1

In a general parametric VaR model, some distribution for returns which depends on a set of
unknown parameters θ is assumed, rt ∼ F (θ ) and parameters are estimated by maximum
likelihood. The VaR is then −Fα−1 , where Fα−1 is the α-quantile of the estimated distribu-
tion. The advantages and disadvantages to parametric unconditional VaR are identical to
parametric conditional VaR. The models are parsimonious and the parameters estimates
are precise yet finding a specification which necessarily includes the true distribution is
difficult (or impossible).

8.2.3.2 Nonparametric Estimation/Historical Simulation

At the other end of the spectrum is a pure nonparametric estimate of the unconditional
VaR. As was the case in the semiparametric conditional VaR, the first step is to sort the
returns such that

r1 < r2 < . . . < rn−1 < rn


where n = T is used to denote an ordering not based on time. The VaR is estimated us-
ing rbαnc or alternatively rdαne or an average of the two where bx c and dx e denote the floor
(largest integer smaller than) and ceiling (smallest integer larger than) of x , respectively. In
other words, the estimate of the VaR is the α-quantile of the empirical distribution of {rt },

V a R = −Ĝα−1 (8.17)
8.2 Value-at-Risk (VaR) 505

where Ĝα−1 is the estimated quantile. This follows since the empirical CDF is defined as

T
X
G (r ) = T −1
I[r <rt ]
t =1

where I[r <rt ] is an indicator function that takes the value 1 if r is less than rt , and so this
function counts the percentage of returns which are smaller than any value r .
Historical simulation estimates are rough and a single new data point may produce very
different VaR estimates. Smoothing the estimated quantile using a kernel density generally
improves the precision of the estimate when compared to one calculated directly on the
sorted returns. This is particularly true if the sample is small. See section 8.4.2 for more
details.
The advantage of nonparametric estimates of VaR is that they are generally consistent
under very weak conditions and that they are trivial to compute. The disadvantage is that
the VaR estimates can be poorly estimated – or equivalently that very large samples are
needed for estimated quantiles to be accurate – particularly for 1% VaRs (or smaller).

8.2.3.3 Parametric Monte Carlo

Parametric Monte Carlo is meaningfully different from either straight parametric or non-
parametric estimation of the density. Rather than fit a model to the returns directly, para-
metric Monte Carlo fits a parsimonious conditional model which is then used to simulate
the unconditional distribution. For example, suppose that returns followed an AR(1) with
GARCH(1,1) errors and normal innovations,

rt +1 = φ0 + φ1 rt + εt +1
σ2t +1 = ω + γε2t + β σ2t
εt +1 = σt +1 e t +1
e t +1 ∼ N (0, 1).
i.i.d.

Parametric Monte Carlo is implemented by first estimating the parameters of the model,
θ̂ = [φ̂0 , φ̂1 , ω̂, γ̂, β̂ ]0 , and then simulating the process for a long period of time (generally
much longer than the actual number of data points available). The VaR from this model is
the α-quantile of the simulated data r̃t .

V a R = −G̃ˆα−1 (8.18)

where G̃ˆα−1 is the empirical α-quantile of the simulated data, {r̃t }. Generally the amount
of simulated data should be sufficient that no smoothing is needed so that the empirical
quantile is an accurate estimate of the quantile of the unconditional distribution. The ad-
vantage of this procedure is that it efficiently makes use of conditioning information which
506 Value-at-Risk, Expected Shortfall and Density Forecasting

Unconditional Value-at-Risk
HS Normal Stud. t Skew t CF
1% VaR 2.832 3.211 3.897 4.156 5.701
5% VaR 1.550 2.271 2.005 2.111 2.104
10% VaR 1.088 1.770 1.387 1.448 1.044

Table 8.2: Unconditional VaR of S&P 500 returns estimated assuming returns are Normal,
Student’s t or skewed t , using a Cornish-Fisher transformation or using a nonparametric
quantile estimator. While the 5% and 10% VaR are similar, the estimators of the 1% VaR
differ.

is ignored in either parametric or nonparametric estimators of unconditional VaR and that


rich families of unconditional distributions can be generated from parsimonious condi-
tional models. The obvious drawback of this procedure is that an incorrect conditional
specification leads to an inconsistent estimate of the unconditional VaR.

8.2.3.4 Example: Unconditional Value-at-Risk for the S&P 500

Using the S&P 500 data, 3 parametric models, a normal, a Student’s t and a skewed t ,
a Cornish-Fisher estimator based on the studentized residuals (ê t = (rt − µ̂)/σ̂) and a
nonparametric estimator were used to estimate the unconditional VaR. The estimates are
largely similar although some differences can be seen at the 1% VaR.

8.2.4 Evaluating VaR models


Evaluating the performance of VaR models is not fundamentally different from the evalu-
ation of either ARMA or GARCH models. The key insight of VaR evaluation comes from the
loss function for VaR errors,
T
X
α(rt − Ft −1 )(1 − I[rt <Ft−1 ] ) + (1 − α)(Ft −1 − rt )I[rt <Ft−1 ] (8.19)
t =1

where rt is the return in period t and Ft −1 is α-quantile of the return distribution in period
t . The generalized error can be directly computed from this loss function by differentiating
with respect to V a R , and is

ge t = I[rt <Ft−1 ] − α (8.20)


which is the time-t “HIT” (H I Tt ).7 When there is a VaR exceedance, H I Tt = 1 − α and
when there is no exceedance, H I Tt = −α. If the model is correct, then α of the H I T s
7
The generalized error extends the concept of an error in a linear regression  or linear time-series model
to nonlinear estimators. Suppose a loss function is specified as L yt +1 , ŷt +1|t , then the generalized error is
8.2 Value-at-Risk (VaR) 507

Unconditional Distribution of the S&P 500


Nonparametric
Skew T
Normal
S&P 500 Returns

−0.05 −0.04 −0.03 −0.02 −0.01 0 0.01 0.02 0.03 0.04


S&P 500 Return

Figure 8.3: Plot of the S&P 500 returns as well as a parametric density using Hansen’s skewed
t and a nonparametric density estimator constructed using a kernel.

should be (1 − α) and (1 − α) should be −α,

α(1 − α) − α(1 − α) = 0,
and the mean of H I Tt should be 0. Moreover, when the VaR is conditional on time t in-
formation , E t [H I Tt +1 ] = 0 which follows from the properties of optimal forecasts (see
chapter 4). A test that the conditional expectation is zero can be performed using a gener-
alized Mincer-Zarnowitz (GMZ) regression of H I Tt +1|t on any time t available variable. For
example, the estimated quantile Ft −1 +1|t for t + 1 could be included (which is in the time-t
information set) as well as lagged H I T s to form a regression,

H I Tt +1|t = γ0 + γ1 Ft −1
+1|t + γ2 H I Tt + γ3 H I Tt −1 + . . . + γK H I Tt −K +2 + ηt

If the model is correctly specified, all of the coefficients should be zero and the null H0 :
the derivative of the loss function with respect to the second argument, that is

L yt +1 , ŷt +1|t
ge t = (8.21)
∂ ŷt +1|t

where it is assumed that the loss function is differentiable at this point.


508 Value-at-Risk, Expected Shortfall and Density Forecasting

γ = 0 can be tested against an alternative that H1 : γ j 6= 0 for some j .

8.2.4.1 Likelihood Evaluation

While the generalized errors can be tested in the GMZ framework, VaRforecast evalua-
tion can be improved by noting that H I Tt is a Bernoulli random variable which takes the
value 1 − α with probability α and takes the value −α with probability 1 − α. Defining
Hg I T t = I[rt <Ft ] , this modified H I T is exactly a Bernoulli(α), and a more powerful test can
be constructed using a likelihood ratio test. Under the null that the model is correctly spec-
ified, the likelihood function of a series of Hg I T s is

T
Y
f (Hg
I T ; p) = p H I T t (1 − p )1−H I T t
g g

t =1

and the log-likelihood is

T
X
l (p ; Hg
IT) = I T t ln(p ) + (1 − Hg
Hg I T t ) ln(1 − p ).
t =1

If the model is correctly specified, p = α and a likelihood ratio test can be performed as

L R = 2(l (p̂ ; Hg
I T ) − l (p = α; Hg
I T )) (8.22)
where p̂ = T −1 Tt=1 Hg
P
I T t is the maximum likelihood estimator of p under the alternative.
The test has a single restriction and so has an asymptotic χ12 distribution.
The likelihood-based test for unconditionally correct VaR can be extended to condition-
ally correct VaR by examining the sequential dependence of H I T s. This testing strategy
uses properties of a Markov chain of Bernoulli random variables. A Markov chain is a mod-
eling device which resembles ARMA models yet is more general since it can handle random
variables which take on a finite number of values – such as a H I T . A simple 1st order bi-
nary valued Markov chain produces Bernoulli random variables which are not necessarily
independent. It is characterized by a transition matrix which contains the probability that
the state stays the same. In a 1st order binary valued Markov chain, the transition matrix is
given by " # " #
p00 p01 p00 1 − p00
= ,
p10 p11 1 − p11 p11
where pi j is the probability that the next observation takes value j given that this observa-
tion has value i . For example, p10 indicates that the probability that the next observation
is a not a H I T given the current observation is a H I T . In a correctly specified model, the
probability of a H I T in the current period should not depend on whether the previous pe-
riod was a H I T or not. In other words, the sequence {H I Tt } is i.i.d. , and so that p00 = 1−α
and p11 = α when the model is conditionally correct.
8.2 Value-at-Risk (VaR) 509

Define the following quantities,

T −1
X
n00 = (1 − Hg
I T t )(1 − Hg
I T t +1 )
t =1
T −1
X
n10 = I T t (1 − Hg
Hg I T t +1 )
t =1
T −1
X
n01 = (1 − Hg
I T t )Hg
I T t +1
t =1
T −1
X
n11 = Hg
I T t Hg
I T t +1
t =1

I T t +1 = i after Hg
where ni j counts the number of times Hg IT t = j.
The log-likelihood for the sequence two VaR exceedances is

l (p ; Hg
I T ) = n00 ln(p00 ) + n01 ln(1 − p00 ) + n11 ln(p11 ) + n10 ln(1 − p11 )

where p11 is the probability of two sequential H I T s and p00 is the probability of two sequen-
tial periods without a H I T . The null is H0 : p11 = 1 − p00 = α. The maximum likelihood
estimates of p00 and p11 are

n00
p̂00 =
n00 + n01
n11
p̂11 =
n11 + n10

and the hypothesis can be tested using the likelihood ratio

L R = 2(l (p̂00 , p̂11 ; Hg


I T ) − l (p00 = 1 − α, p11 = α; Hg
I T )) (8.23)

and is asymptotically χ22 distributed.


This framework can be extended to include conditioning information by specifying a
probit or logit for Hg
I T t using any time-t available information. For example, a specifica-
tion test could be constructed using K lags of H I T , a constant and the forecast quantile
as

I T t +1|t = γ0 + γ1 Ft +1|t + γ2 Hg
Hg I T t + γ3 Hg
I T t −1 + . . . + γK Hg
I T t −K +1 .

To implement this in a Bernoulli log-likelihood, it is necessary to ensure that

0 ≤ γ0 + γ1 Ft +1|t + γ2 Hg
I T t + γ3 Hg
I T t −1 + . . . + γK Hg
I T t −K +1 ≤ 1.
510 Value-at-Risk, Expected Shortfall and Density Forecasting

This is generally accomplished using one of two transformations, the normal CDF (Φ(z ))
which produces a probit or the logistic function (e z /(1+e z )) which produces a logit. Gener-
ally the choice between these two makes little difference. If xt = [1 Ft +1|t Hg
I T t Hg
I T t −1 . . . Hg
I T t −K +1
the model for Hg I T is

I T t +1|t = Φ (xt γ)
Hg
where the normal CDF is used to map from (−∞, ∞) to (0,1), and so the model is a condi-
tional probability model. The log-likelihood is then

T
X
l (γ; Hg
I T , x) = I T t ln(Φ(xt γ)) − (1 − Hg
Hg I T t ) ln(1 − Φ(xt γ)). (8.24)
t =1

The likelihood ratio for testing the null H0 : γ0 = Φ−1 (α), γ j = 0 for all j = 1, 2, . . . , K against
an alternative H1 = γ0 6= Φ−1 (α) or γ j 6= 0 for some j = 1, 2, . . . , K can be computed
 
L R = 2 l (γ̂; Hg
I T ) − l (γ0 ; Hg
IT) (8.25)

where γ0 is the value under the null (γ = 0) and γ̂ is the estimator under the alternative (i.e.
the unrestricted estimator from the probit).

8.2.5 Relative Comparisons

Diebold-Mariano tests can be used to relatively rank VaR forecasts in an analogous manner
as how they are used to rank conditional mean or conditional variance forecasts (Diebold
& Mariano 1995). If L (rt +1 , V a R t +1|t ) is a loss function defined over VaR, then a Diebold-
Mariano test statistic can be computed

d
DM = r h i (8.26)
d
V d

where

d t = L (rt +1 , V a R tA+1|t ) − L (rt +1 , V a R tB+1|t ),


P +R
V a R A and V a R B are the Value-at-Risks from models A and B respectively, d = R −1 M
t =M +1 d t ,
M (for modeling) is the number of observations used in the model building and estima-
r R (for reserve) is the number of observations held back for model evaluation, and
tion,
dh i
V d is the long-run variance of d t which requires the use of a HAC covariance estima-
tor (e.g. Newey-West). Recall that D M has an asymptotical normal distribution and that
the test has the null H0 : E [d t ] = 0 and the composite alternative H1A : E [d t ] < 0 and
H1B : E [d t ] > 0. Large negative values (less than -2) indicate model A is superior while
8.3 Expected Shortfall 511

large positive values indicate the opposite; values close to zero indicate neither forecast
outperforms the other.
Ideally the loss function, L (·), should reflect the user’s loss over VaR forecast errors. In
some circumstances there is not an obvious choice. When this occurs, a reasonable choice
for the loss function is the VaR optimization objective,

L (rt +1 , V a R t +1|t ) = α(rt +1 −V a R t +1|t )(1− I[rt +1 <V a Rt +1|t ] )+(1−α)(V a R t +1|t − rt +1 )I[rt +1 <V a Rt +1|t ]
(8.27)
which has the same interpretation in a VaR model as the mean square error (MSE) loss
function has in conditional mean model evaluation or the Q-like loss function has for com-
paring volatility models.

8.3 Expected Shortfall


Expected shortfall – also known as tail VaR – combines aspects of the VaR methodology
with more information about the distribution of returns in the tail.8
Definition 8.4 (Expected Shortfall). Expected Shortfall (ES) is defined as the expected value
of the portfolio loss given a Value-at-Risk exceedance has occurred. The unconditional
Expected Shortfall is defined
 
W1 − W0 W1 − W0
ES = E < −V a R (8.28)
W0 W0
= E rt +1 |rt +1 < −V a R
 

where Wt , t = 0, 1, is the value of the assets in the portfolio and 1 and 0 measure an arbi-
trary length of time (e.g. one day or two weeks).9
The conditional, and generally more useful, Expected Shortfall is similarly defined.
Definition 8.5 (Conditional Expected Shortfall). Conditional Expected Shortfall is defined

E St +1 = E t rt +1 |rt +1 < −V a R t +1 .
 
(8.29)

where rt +1 return on a portfolio at time t + 1. Since t is an arbitrary measure of time, t + 1


also refers to an arbitrary unit of time (day, two-weeks, 5 years, etc.)
8
Expected Shortfall is a special case of a broader class of statistics known as exceedance measures. Ex-
ceedance measures all describe a common statistical relationship conditional on one or more variable being
in its tail. Expected shortfall it is an exceedance mean. Other exceedance measures which have been studies
include exceedance variance, V[x |x < qα ], exceedance correlation, Corr(x , y |x < qα,x , y < qα,y ), and ex-
1
ceedance β , Cov(x , y |x < qα,x , y < qα,y )/ V[x |x < qα,x ]V[y |y < qα,y ] 2 where qα,· is the α-quantile of the
distribution of x or y .
9
Just like VaR, Expected Shortfall can be equivalently defined in terms or returns or in terms of wealth. For
consistency with the VaR discussion, Expected Shortfall is presented in terms of the return.
512 Value-at-Risk, Expected Shortfall and Density Forecasting

Because computation of Expected Shortfall requires both a quantile and an expectation,


they are generally computed from density models, either parametric or semi-parametric,
rather than simpler and more direct specifications.

8.3.1 Evaluating Expected Shortfall models


Expected Shortfall models can be evaluated using standard techniques since Expected Short-
fall is a conditional mean,

E t [E St +1 ] = E t [rt +1 |rt +1 < −V a R t +1 ].


A generalized Mincer-Zarnowitz regression can be used to test whether this mean is
zero. Let I[rt <V a Rt ] indicate that the portfolio return was less than the VaR. The GMZ regres-
sion for testing Expected Shortfall is

(E St +1|t − rt +1 )I[rt +1 <−V a Rt +1|t ] = xt γ (8.30)


where xt , as always, is any set of time t measurable instruments. The natural choices for
xt include a constant and E St +1|t , the forecast expected shortfall. Any other time-t mea-
surable regressor that captures some important characteristic of the tail, such as recent
volatility ( τt =0 rt2−i ) or VaR (V a R t −i ), may also be useful in evaluating Expected Shortfall
P

models. If the Expected Shortfall model is correct, the null that none of the regressors are
useful in predicting the difference, H0 : γ = 0, should not be rejected. If the left-hand side
term – Expected Shortfall “surprise” – in eq. (8.30) is predictable, then the model can be
improved.
Despite the simplicity of the GMZ regression framework to evaluate Expected Shortfall,
their evaluation is difficult owning to a lack of data regarding the exceedance mean; Ex-
pected Shortfall can only be measured when there is a VaR exceedance and so 4 years of
data would only produce 50 observations where this was true. The lack of data about the
tail makes evaluating Expected Shortfall models difficult and can lead to a failure to reject
in many cases even when using badly misspecified Expected Shortfall models.

8.4 Density Forecasting


VaR (a quantile) provides a narrow view into the riskiness of an asset. More importantly,
VaR may not adequately describe the types of risk an arbitrary forecast consumer may care
about. The same cannot be said for a density forecast which summarizes everything there is
to know about the riskiness of the asset. Density forecasts also nest both VaR and Expected
Shortfall as special cases.
In light of this relationship, it may be tempting to bypass VaR or Expected Shortfall fore-
casting and move directly to density forecasts. Unfortunately density forecasting also suf-
fers from a number of issues, including:
8.4 Density Forecasting 513

• The density contains all of the information about the random variable being studied,
and so a flexible form is generally needed. The cost of this flexibility is increased pa-
rameter estimation error which can be magnified when computing the expectation
of nonlinear functions of a forecast asset price density (e.g. pricing an option).

• Multi-step density forecasts are difficult (often impossible) to compute since densi-
ties do not time aggregate, except in special cases which are usually to simple to be of
any interest. This contrasts with standard results for ARMA and ARCH models.

• Unless the user has preferences over the entire distribution, density forecasting inef-
ficiently utilize information.

8.4.1 Density Forecasts from ARCH models

Producing density forecasting from ARCH models is virtually identical to producing VaR
forecasts from ARCH models. For simplicity, only a model with a constant mean and GARCH(1,1)
variances will be used, although the mean and variance can be modeled using richer, more
sophisticated processes.

rt +1 = µ + εt +1
σ2t +1 = ω + γ1 ε2t + β1 σ2t
εt +1 = σt +1 e t +1
e t +1 ∼ g (0, 1).
i.i.d.

where g (0, 1) is used to indicate that the distribution of innovations need not be normal but
must have mean 0 and variance 1. Standard choices for g (·) include the standardized Stu-
dent’s t , the generalized error distribution, and Hansen’s skew t . The 1-step ahead density
forecast is

d
fˆt +1|t = g (µ̂, σ̂2t +1|t ) (8.31)

where f (·) is the distribution of returns. This follow directly from the original model since
rt +1 = µ + σt +1 e t +1 and e t +1 ∼ g (0, 1).
i.i.d.

8.4.2 Semiparametric Density forecasting

Semiparametric density forecasting is also similar to its VaR counterpart. The model begins
by assuming that innovations are generated according to some unknown distribution g (·),
514 Value-at-Risk, Expected Shortfall and Density Forecasting

rt +1 = µ + εt +1
σ2t +1 = ω + γ1 ε2t + β1 σ2t
εt +1 = σt +1 e t +1
e t +1 ∼ g (0, 1).
i.i.d.

and estimates of σ̂2t are computed assuming that the innovations are conditionally nor-
mal. The justification for this choice follows from the strong consistency of the variance
parameter estimates even when the innovations are not normal. Using the estimated vari-
ances, standardized innovations are computed as ê t = σ̂ε̂tt . The final step is to compute the
density. The simplest method to accomplish this is to compute the empirical CDF as

T
X
G (e ) = I[êt <e ] (8.32)
t =1

which simply sums up the number of standardized residuals than e . This method is trivial
but has some limitations. First, the PDF does not exist since G (·) is not differentiable. This
makes some applications difficult, although a histogram provides a simple, if imprecise,
method to work around the non-differentiability of the empirical CDF. Second, the CDF is
jagged and is generally an inefficient estimator, particularly in the tails.
An alternative, and more efficient estimator, can be constructed using a kernel to smooth
the density. A kernel density is a local average of the number of ê t in a small neighborhood
of e . The more in this neighborhood, the higher the probability in the region, and the larger
the value of the kernel density. The kernel density estimator is defined

T  
1 X ê t − e
g (e ) = K (8.33)
T h t =1 h

where K (·) can be one of many kernels - the choice of which usually makes little difference
– and the normal

1
K (x ) = √ exp(−x 2 /2) (8.34)

or the Epanechnikov (
3
(1 − x 2) −1 ≤ x ≤ 1
K (x ) = 4 (8.35)
0 otherwise

are the most common. The choice of the bandwidth (h ) is more important and is usu-
1
ally set to Silverman’s bandwidth , h = 1.06σT − 5 where σ is the standard deviation of ê t .
However, larger or smaller bandwidths can be used to produce smoother or rougher den-
sities, respectively, and the size of the bandwidth represents a bias-variance tradeoff – a
8.4 Density Forecasting 515

Empirical and Smoothed CDF of the S&P 500


1
Empirical
Smoothed
0.9

0.8

0.7
S&P 500 Return

0.6

0.5

0.4

0.3

0.2

0.1

0
−3 −2 −1 0 1 2
Cumulative Probability

Figure 8.4: The rough empirical and smoothed empirical CDF for standardized returns of
the S&P 500 in 2009 (standardized by a TARCH(1,1,1)).

small bandwidth has little bias but is very jagged (high variance), while a large bandwidth
produces an estimate with substantial bias but very smooth (low variance). If the CDF is
needed, g (e ) can be integrated using numerical techniques such as a trapezoidal approxi-
mation to the Riemann integral.
Finally the density forecast can be constructed by scaling the distribution G by σt +1|t
and adding the mean. Figure 8.4 contains a plot of the smooth and non-smoothed CDF
of TARCH(1,1,1)-standardized S&P 500 returns in 2009. The empirical CDF is jagged and
there are some large gaps in the observed returns.

8.4.3 Multi-step density forecasting and the fan plot

Multi-step ahead density forecasting is not trivial. For example, consider a simple GARCH(1,1)
model with normal innovations,

rt +1 = µ + εt +1
σ2t +1 = ω + γ1 ε2t + β1 σ2t
516 Value-at-Risk, Expected Shortfall and Density Forecasting

εt +1 = σt +1 e t +1
e t +1 ∼ N (0, 1).
i.i.d.

The 1-step ahead density forecast of returns is

rt +1 |Ft ∼ N (µ, σ2t +1|t ). (8.36)


h i
Since innovations are conditionally normal and E t σ2t +2|t is simple to compute, it is tempt-
ing construct a 2-step ahead forecast using a normal,

rt +2 |Ft ∼ N (µ, σ2t +2|t ). (8.37)

This forecast is not correct since the 2-step ahead distribution is a variance-mixture of nor-
mals and so is itself non-normal. This reason for the difference is that σ2t +2|t , unlike σ2t +1|t ,
is a random variable and the uncertainty must be integrated out to determine the distribu-
tion of rt +2 . The correct form of the 2-step ahead density forecast is
Z ∞
rt +2 |Ft ∼ φ(µ, σ2 (e t +1 )t +2|t +1 )φ(e t +1 )de t +1 .
−∞

where φ(·) is a normal probability density function and σ2 (e t +1 )t +2|t +1 reflects the explicit
dependence of σ2t +2|t +1 on e t +1 . While this expression is fairly complicated, a simpler way to
view it is as a mixture of normal random variables where the probability of getting a specific
normal depends on w (e ),
Z ∞
rt +2 |Ft ∼ w (e )f (µ, σ(e t +1 )t +2|t +1 )de .
−∞

Unless w (e ) is constant, the resulting distribution will not be a normal. The top panel in
figure 8.5 contains the naïve 10-step ahead forecast and the correct 10-step ahead forecast
for a simple GARCH(1,1) process,

rt +1 = εt +1
σ2t +1 = .02 + .2ε2t + .78σ2t
εt +1 = σt +1 e t +1
e t +1 ∼ N (0, 1)
i.i.d.

where ε2t = σt = σ̄ = 1 and hence E t [σt +h ] = 1 for all h . The bottom panel contains the
plot of the density of a cumulative 10-day return (the sum of the 10 1-day returns). In this
case the naïve model assumes that

rt +h |Ft ∼ N (µ, σt +h |t )
8.4 Density Forecasting 517

Multi-step density forecasts


10-step ahead density
0.4

0.3

0.2

0.1

0
−4 −3 −2 −1 0 1 2 3 4
Cumulative 10-step ahead density
Correct
0.12 Naive

0.1
0.08
0.06
0.04
0.02
0
−10 −8 −6 −4 −2 0 2 4 6 8 10

Figure 8.5: Naïve and correct 10-step ahead density forecasts from a simulated GARCH(1,1)
model. The correct density forecasts have substantially fatter tails then the naïve forecast
as evidenced by the central peak and cross-over of the density in the tails.

for h = 1, 2, . . . , 10. The correct forecast has heavier tails than the naïve forecast which can
be verified by checking that the solid line is above the dashed line for large deviations.

8.4.3.1 Fan Plots

Fan plots are a graphical method to convey information about future changes in uncer-
tainty. Their use has been popularized by the Bank of England and they are a good way
to “wow” an audience unfamiliar with this type of plot. Figure 8.6 contains a simple ex-
ample of a fan plot which contains the density of an irregular ARMA(6,2) which begins at
0 and has i.i.d. standard normal increments. Darker regions indicate higher probability
while progressively lighter regions indicate less likely events.
518 Value-at-Risk, Expected Shortfall and Density Forecasting

A fan plot of a standard random walk

10

5
Forecast

−5
1 3 5 7 9 11 13 15 17 19
Steps Ahead

Figure 8.6: Future density of an ARMA(6,2) beginning at 0 with i.i.d. standard normal in-
crements. Darker regions indicate higher probability while progressively lighter regions
indicate less likely events.

8.4.4 Quantile-Quantile (QQ) plots

Quantile-Quantile, or QQ, plots provide an informal but simple method to assess the fit of
a density or a density forecast. Suppose a set of standardized residuals ê t are assumed to
have a distribution F . The QQ plot is generated by first ordering the standardized residuals,

ê1 < ê2 < . . . < ên −1 < ên

and then plotting the ordered residual ê j against its hypothetical


 value
 if the correct dis-

j j
tribution were F , which is the inverse CDF evaluated at T +1 , F −1 T +1 . This informal
assessment of a distribution will be extended into a formal statistic in the Kolmogorov-
Smirnov test. Figure 8.7 contains 4 QQ plots for the raw S&P 500 returns against a normal,
a t 8 , a t 3 and a GED with ν = 1.5. The normal, t 8 and GED appear to be badly misspecified
in the tails – as evidenced through deviations from the 45o line – while the t 3 appears to
be a good approximation (the MLE estimate of the Student’s t degree of freedom, ν, was
approximately 3.1).
8.4 Density Forecasting 519

QQ plots of S&P 500 returns


Normal Student’s t , ν = 8
0.1 0.1

0.05 0.05

0 0

−0.05 −0.05

−0.1 −0.1
−0.1 −0.05 0 0.05 0.1 −0.1 −0.05 0 0.05 0.1
GED, ν = 1.5 Student’s t , ν = 3
0.1 0.1

0.05 0.05

0
0

−0.05
−0.05

−0.1
−0.1
−0.1 −0.05 0 0.05 0.1 −0.1 −0.05 0 0.05 0.1

Figure 8.7: QQ plots of the raw S&P 500 returns against a normal, a t 8 , a t 3 and a GED with
ν = 1.5. Points along the 45o indicate a good distributional fit.

8.4.5 Evaluating Density Forecasts

All density evaluation strategies are derived from a basic property of random variables: if
x ∼ F , then u ≡ F (x ) ∼ U (0, 1). That is, for any random variable x , the cumulant of x has
a Uniform distribution over [0, 1]. The opposite of this results is also true, if u ∼ U (0, 1),
F −1 (u) = x ∼ F .10

Theorem 8.6 (Probability Integral Transform). Let a random variable X have a continuous,
increasing CDF FX (x ) and define Y = FX (X ). Then Y is uniformly distributed and Pr(Y ≤
y ) = y , 0 < y < 1.
10
The latter result can be used as the basis of a random number generator. To generate a random number
with a CDF of F , a first generate a uniform, u, and then compute the inverse CDF at u to produce a random
number from F , y = F −1 (u). If the inverse CDF is not available in closed form, monotonicity of the CDF
allows for quick, precise numerical inversion.
520 Value-at-Risk, Expected Shortfall and Density Forecasting

Theorem 8.6. For any y ∈ (0, 1), Y = FX (X ), and so

FY (y ) = Pr(Y ≤ y ) = Pr(FX (X ) ≤ y )
= Pr(FX−1 (FX (X )) ≤ FX−1 (y )) Since FX−1 is increasing
= Pr(X ≤ FX−1 (y )) Invertible since strictly increasing
= FX (FX−1 (y )) Definition of FX
=y

The proof shows that Pr(FX (X ) ≤ y ) = y and so this must be a uniform distribution (by
definition).
The Kolmogorov-Smirnov (KS) test exploits this property of residuals from the correct
distribution to test whether a set of observed data are compatible with a specified distri-
bution F . The test statistic is calculated by first computing the probability integral trans-
formed residuals û t = F (ê t ) from the standardized residuals and then sorting them

u 1 < u 2 < . . . < u n−1 < u n .


The KS test is computed as

τ
X 1
K S = max I[u j < Tτ ] − (8.38)

τ T
i =1
τ
!
X τ
= max I[u j < T ] −
τ

τ T
i =1

The test finds the point where the distance between the observed cumulant distribution
The objects being maximized over is simply the number of u j less than Tτ minus the ex-
pected number of observations which should be less than Tτ . Since the probability integral
transformed residuals should be U(0,1) when the model is correct, the number of proba-
bility integral transformed residuals expected to be less than Tτ is Tτ . The distribution of the
KS test is nonstandard but many software packages contain the critical values. Alterna-
tively, simulating the distribution is trivial and precise estimates of the critical values can
be computed in seconds using only uniform pseudo-random numbers.
The KS test has a graphical interpretation as a QQ plot of the probability integral trans-
formed residuals against a uniform. Figure 8.8 contains a representation of the KS test
using data from two series, the first is standard normal and the second is a standardized
students t 4 . 95% confidence bands are denoted with dotted lines. The data from both se-
ries were assumed to be standard normal and the t 4 just rejects the null (as evidenced by
the cumulants touching the confidence band).
8.4 Density Forecasting 521

A KS test of normal and standardized t 4 when the data are normal


1
Normal
0.9 T, ν=4
45o line
0.8 95% bands

0.7
Cumulative Probability

0.6

0.5

0.4

0.3

0.2

0.1

0
50 100 150 200 250 300 350 400 450 500
Observation Number

Figure 8.8: A simulated KS test with a normal and a t 4 . The t 4 crosses the confidence bound-
ary indicating a rejection of this specification. A good density forecast should have a cu-
mulative distribution close to the 45o line.

8.4.5.1 Parameter Estimation Error and the KS Test

The critical values supplied by most packages do not account for parameter estimation
error and KS tests with estimated parameters are generally less likely to reject than if the
parameters are known. For example, if a sample of 1000 random variables are i.i.d. standard
normal and the mean and variance are known to be 0 and 1, the 90, 95 and 99% CVs for
the KS test are 0.0387, 0.0428, and 0.0512. If the parameters are not known and must be
estimated, the 90, 95 and 99% CVs are reduced to 0.0263, 0.0285, 0.0331. Thus, a desired
size of 10% (corresponding to a critical value of 90%) has an actual size closer 0.1% and the
test will not reject the null in many instances where it should.
The solution to this problem is simple. Since the KS-test requires knowledge of the en-
tire distribution, it is simple to simulate a sample with length T , and to estimate the param-
eters and to compute the KS test on the simulated standardized residuals (where the resid-
uals are using estimated parameters). Repeat this procedure B times (B>1000, possibly
larger) and then compute the empirical 90, 95 or 99% quantiles from K Sb , b = 1, 2, . . . , B .
These quantiles are the correct values to use under the null while accounting for parameter
estimation uncertainty.
522 Value-at-Risk, Expected Shortfall and Density Forecasting

8.4.5.2 Evaluating conditional density forecasts

In a direct analogue to the unconditional case, if x t |Ft −1 ∼ F , then û t ≡ F ( x̂ t )|Ft −1 ∼


i.i.d.

U (0, 1). That is, probability integral transformed residuals are conditionally i.i.d. uniform
on [0, 1]. While this condition is simple and easy to interpret, direct implementation of a
test is not. The Berkowitz (2001) test works around this by re-transforming the probability
integral transformed residuals into normals using the inverse Normal CDF . Specifically if
û t = Ft |t −1 (ê t ) are the residuals standardized by their forecast distributions, the Berkowitz
test computes ŷt = Φ−1 (û t ) = Φ−1 (Ft |t −1 (ê t )) which have the property, under the null of
a correct specification that ŷt ∼ N (0, 1), an i.i.d. sequence of standard normal random
i.i.d.

variables.
Berkowitz proposes using a regression model to test the yt for i.i.d. N (0, 1). The test is
implementing by estimating the parameters of

yt = φ0 + φ1 yt −1 + ηt

via maximum likelihood. The Berkowitz test is computing using a the likelihood ratio test

L R = 2(l (θ̂ ; y) − l (θ 0 ; y)) ∼ χ32 (8.39)

where θ 0 are the parameters if the null is true, corresponding to parameter values of φ0 =
φ1 = 0 and σ2 = 1 (3 restrictions). In other words, that the yt are independent normal
random variables with a variance of 1. As is always the case in tests of conditional models,
the regression model can be augmented to include any time t − 1 available instrument and
a more general specification is

yt = xt γ + ηt

where xt may contains a constant, lagged yt or anything else relevant for evaluating a den-
sity forecast. In the general specification, the null is H0 : γ = 0, σ2 = 1 and the alternative is
the unrestricted estimate from the alternative specification. The likelihood ratio test statis-
tic in the case would have a χK2 +1 distribution where K is the number of elements in xt (the
+1 comes from the restriction that σ2 = 1).

8.5 Coherent Risk Measures

With multiple measures of risk available, which should be chosen: variance, VaR, or Ex-
pected Shortfall? Recent research into risk measures have identified a number of desirable
properties for measures of risk. Let ρ be a generic measure of risk that maps the riskiness
of a portfolio to an amount of required reserves to cover losses that regularly occur and let
P , P1 and P2 be portfolios of assets.
8.5 Coherent Risk Measures 523

Drift Invariance The requires reserved for portfolio P satisfies

ρ (P + c ) = ρ (P ) − c

That is, adding a constant return c to P decreases the required reserved by that amount

Homogeneity The required reserved are linear homogeneous,

ρ(λP ) = λρ(P ) for any λ > 0 (8.40)

The homogeneity property states that the required reserves of two portfolios with the
same relative holdings of assets depends linearly on the scale – doubling the size of
a portfolio while not altering its relative composition generates twice the risk, and
requires twice the reserves to cover regular losses.

Monotonicity If P1 first order stochastically dominates P2 , the required reserves for P1 must
be less than those of P2
ρ(P1 ) ≤ ρ(P2 ) (8.41)
If P1 FOSD P2 then the value of portfolio P1 will be larger than the value of portfolio P2
in every state of the world, and so the portfolio must be less risky.

Subadditivity The required reserves for the combination of two portfolios is less then the
required reserves for each treated separately

ρ(P1 + P2 ) ≤ ρ(P1 ) + ρ(P2 ) (8.42)

Definition 8.7 (Coherent Risk Measure). Any risk measure which satisfies these four prop-
erties is known as coherent.
Coherency seems like a good thing for a risk measure. The first three conditions are indis-
putable. For example, in the third, if P1 FOSD P2 , then P1 will always have a higher return
and must be less risky. The last is somewhat controversial.
Theorem 8.8 (Value-at-Risk is not Coherent). Value-at-Risk is not coherent since it fails the
subadditivity criteria. It is possible to have a VaR which is superadditive where the Value-at-
Risk of the combined portfolio is greater than the sum of the Values-at-Risk of either portfolio.
Examples of the superadditivity of VaR usually require the portfolio to depend nonlin-
early on some assets (i.e. hold derivatives). Expected Shortfall, on the other hand, is a
coherent measure of risk.
Theorem 8.9 (Expected Shortfall is Coherent). Expected shortfall is a coherent risk measure.
The coherency of Expected Shortfall is fairly straight forward to show in many cases (for
example, if the returns are jointly normally distributed) although a general proof is diffi-
cult and provides little intuition. However, that Expected Shortfall is coherent and VaR is
524 Value-at-Risk, Expected Shortfall and Density Forecasting

not does not make Expected Shortfall a better choice. VaR has a number of advantages
for measuring risk since it only requires the modeling of a quantile of the return distribu-
tion, VaR always exists and is finite and there are many widely tested methodologies for
estimating VaR. Expected Shortfall requires an estimate of the mean in the tail which is
substantially harder than simply estimating the VaR and may not exist in some cases. Ad-
ditionally, in most realistic cases, increases in the Expected Shortfall will be accompanied
with increases in the VaR and they will both broadly agree about the risk of a the portfolio.
8.5 Coherent Risk Measures 525

Exercises

Exercise 8.1. Precisely answer the following questions

i. What is VaR?

ii. What is expected shortfall?

iii. Describe two methods to estimate the VaR of a portfolio? Compare the strengths and
weaknesses of these two approaches.

iv. Suppose two bankers provide you with VaR forecasts (which are different) and you
can get data on the actual portfolio returns. How could you test for superiority? What
is meant by better forecast in this situation?

Exercise 8.2. The figure below plots the daily returns on IBM from 1 January 2007 to 31
December 2007 (251 trading days), along with 5% Value-at-Risk (VaR) forecasts from two
models. The first model (denoted “HS”) uses ‘historical simulation’ with a 250-day window
of data. The second model uses a GARCH(1,1) model, assuming that daily returns have a
constant conditional mean, and are conditionally Normally distributed (denoted ‘Normal-
GARCH’ in the figure).
Daily returns on IBM in 2007, with 5% VaR forecasts
6
daily return
HS forecast
Normal−GARCH forecast
4

2
Percent return

−2

−4

−6
Jan07 Feb07 Mar07 Apr07 May07 Jun07 Jul07 Aug07 Sep07 Oct07 Nov07 Dec07

i. Briefly describe one other model for VaR forecasting, and discuss its pros and cons
relative to the ‘historical simulation’ model and the Normal-GARCH model.
526 Value-at-Risk, Expected Shortfall and Density Forecasting

ii. For each of the two VaR forecasts in the figure, a sequence of ‘hit’ variables was con-
structed:
 
HS
H i ttHS
= 1 rt ≤ Vd aRt
 
G AR C H
G AR C H
H i tt = 1 rt ≤ Vd aRt
(
1, if rt ≤ a
where 1 {rt ≤ a } =
0, if rt > a

and the following regression was run (standard errors are in parentheses below the
parameter estimates):

H i t tH S = 0.0956 + u t
(0.0186)

H i t tG AR C H = 0.0438 + u t
(0.0129)

(a) How can we use the above regression output to test the accuracy of the VaR fore-
casts from these two models?

(b) What do the tests tell us?

iii. Another set of regressions was also run (standard errors are in parentheses below the
parameter estimates):

H i t tH S = 0.1018 − 0.0601H i t tH−1


S
+ ut
(0.0196) (0.0634)

H i t tG AR C H = 0.0418 + 0.0491H i t tG−1


AR C H
+ ut
(0.0133) (0.0634)

A joint test that the intercept is 0.05 and the slope coefficient is zero yielded a chi-
squared statistic of 6.9679 for the first regression, and 0.8113 for the second regres-
sion.

(a) Why are these regressions potentially useful?

(b) What do the results tell us? (The 95% critical values for a chi-squared variable
8.5 Coherent Risk Measures 527

with q degrees of freedom are given below:)

q 95% critical value


1 3.84
2 5.99
3 7.81
4 9.49
5 11.07
10 18.31
25 37.65
249 286.81
250 287.88
251 288.96

Exercise 8.3. Figure 8.9 plots the daily returns from 1 January 2008 to 31 December 2008
(252 trading days), along with 5% Value-at-Risk (VaR) forecasts from two models. The first
model (denoted “HS”) uses ‘historical simulation’ with a 250-day window of data. The sec-
ond model uses a GARCH(1,1) model, assuming that daily returns have a constant condi-
tional mean, and are conditionally Normally distributed (denoted ‘Normal-GARCH’ in the
figure).

i. Briefly describe one other model for VaR forecasting, and discuss its pros and cons
relative to the ‘historical simulation’ model and the Normal-GARCH model.

ii. For each of the two VaR forecasts in the figure, a sequence of ‘hit’ variables was con-
structed:
 
HS
H i ttHS
= 1 rt ≤ Vd aRt
 
G AR C H
G AR C H
H i tt = 1 rt ≤ Vd aRt
(
1, if rt ≤ a
where 1 {rt ≤ a } =
0, if rt > a

and the following regression was run (standard errors are in parentheses below the
parameter estimates):

H i t tH S = 0.0555 + u t
(0.0144)

H i t tG AR C H = 0.0277 + u t
(0.0103)

(a) How can we use the above regression output to test the accuracy of the VaR fore-
casts from these two models?
528 Value-at-Risk, Expected Shortfall and Density Forecasting

(b) What do the tests tell us?

iii. Another set of regressions was also run (standard errors are in parentheses below the
parameter estimates):

H i t tH S = 0.0462 + 0.1845H i t tH−1


S
+ ut
(0.0136) (0.1176)

H i t tG AR C H = 0.0285 − 0.0285H i t tG−1


AR C H
+ ut
(0.0106) (0.0106)

A joint test that the intercept is 0.05 and the slope coefficient is zero yielded a chi-
squared statistic of 8.330 for the first regression, and 4.668 for the second regression.

(a) Why are these regressions potentially useful?


(b) What do the results tell us? (The 95% critical values for a chi-squared variable
with q degrees of freedom are given below:)

iv. Comment on the similarities and/or differences between what you found in (b) and
(c).

q 95% critical value


1 3.84
2 5.99
3 7.81
4 9.49
5 11.07
10 18.31
25 37.65
249 286.81
250 287.88
251 288.96

Exercise 8.4. Answer the following question:

i. Assume that X is distributed according to some distribution F, and that F is contin-


uous and strictly increasing. Define U ≡ F (X ) . Show that U s Uniform (0, 1) .

ii. Assume that V s U n i f o r m (0, 1) , and that G is some continuous and strictly in-
creasing distribution function. If we define Y ≡ G −1 (V ), show that Y s G .

For the next two parts, consider the problem of forecasting the time taken for the
price of a particular asset (Pt ) to reach some threshold (P ∗ ). Denote the time (in days)
8.5 Coherent Risk Measures 529

Returns and VaRs

Returns
HS Forecast VaR
Normal GARCH Forecast VaR
4

−1

−2

−3

−4

Figure 8.9: Returns,


50 Historical100Simulation VaR
150 and Normal
200 GARCH VaR.
250

taken for the asset to reach the threshold as Z t . Assume that the true distribution of
Z t is Exponential with parameter β ∈ (0, ∞) :

Zt s Exponential (β )
(
1 − exp {−β z } , z ≥ 0
so F (z ; β ) =
0, z <0

Now consider a forecaster who gets the distribution correct,


 but the parameter wrong.
Denote her distribution forecast as Fˆ (z ) = Exponential β̂ .

iii. If we define U ≡ Fˆ (Z ) , show that Pr [U ≤ u ] = 1 − (1 − u)β /β̂ for u ∈ (0, 1) , and


interpret.
p
iv. Now think about the case where β̂ is an estimate of β , such that β̂ → β as n → ∞.
p
Show that Pr [U ≤ u ] → u as n → ∞, and interpret.
530 Value-at-Risk, Expected Shortfall and Density Forecasting
Chapter 9

Multivariate Volatility, Dependence and


Copulas

Multivariate modeling is in many ways similar to modeling the volatility


of a single asset. The primary challenges which arise in the multivariate
problem are ensuring that the forecast covariance is positive definite and
limiting the number of parameters which need to be estimated as the num-
ber of assets becomes large. This chapter covers standard “simple” multi-
variate models, multivariate ARCH models and realized covariance. Atten-
tion then turns to measures of dependence which go beyond simple linear
correlation, and the chapter concludes with an introduction to a general
framework for modeling multivariate returns which use copulas.

9.1 Introduction
Multivariate volatility or covariance modeling is a crucial ingredient in modern portfolio
management. If is useful for a number of important tasks including:
• Portfolio Construction - Classic Markowitz (1959) portfolio construction requires an
estimate of the covariance of returns, along with the expected returns of the assets,
to determine the optimal portfolio weights. The Markowitz problem find the portfo-
lio with the minimum variance subject to achieving a required mean. Alternatively,
the Markowitz problem can be formulated as maximizing the expected mean of the
portfolio given a constraint on the volatility of the portfolio.

• Portfolio Sensitivity Analysis - Many portfolios are constructed using other objectives
than the “pure” Markowitz problem. For example, fund managers may be selecting
firms based on beliefs about fundamental imbalances between the firm and its com-
petitors. Return covariance is useful in these portfolios for studying the portfolio sen-
sitivity to adding additional or liquidating existing positions, especially when multi-
ple investment opportunities exist which have similar risk-return characteristics.
532 Multivariate Volatility, Dependence and Copulas

• Value-at-Risk - Portfolio Value-at-Risk often begins with the covariance of the assets
held in the portfolio. Naive V a R uses a constant value for the lower α-quantile mul-
tiplied by the standard deviation of the portfolio. More sophisticated risk measures
examine the joint tail behavior – that is, the probability that two or more assets have
large, usually negative returns in the same period. Copulas are a natural method for
studying the extreme behavior of asset returns.

• Credit Pricing - Many credit products are written on a basket of bonds, and so the
correlation between the payouts of the underlying bonds is crucial for determining
the value of the portfolio.

• Correlation Trading - Recent financial innovation has produced contracts where cor-
relation can be directly traded. The traded correlation is formally equicorrelation (See
9.3.5) and measuring and predicting correlation is a directly profitable strategy, at
least if it can be done well.

This chapter begins with an overview of simple, static estimators for covariance which
widely used. The chapter then turns to dynamic models for conditional covariance based
on the ARCH framework. The third topic is realized covariance which exploits ultra-high
frequency data in the same manner as realized variance. The chapter concludes with an ex-
amination of non-linear dependence and copulas, a recent introduction to financial econo-
metrics which allows for the flexible construction of multivariate models.

9.2 Preliminaries
Most volatility models are built using either returns, which is appropriate if the time hori-
zon is small and/or the conditional mean is small relative to the conditional volatility, or
demeaned returns when using longer time-spans or if working with series with a non-trivial
mean (e.g. electricity prices). The k by 1 vector of returns is denoted rt , and the corre-
sponding demeaned returns are εt = rt − µt where µt ≡ E t −1 [rt ].
The conditional covariance, E t −1 εt ε0t ≡ Σt , is assumed to be a k by k positive definite
 

matrix. Some models will may use of “devolatilized” residuals defined as u i ,t = εi ,t /σi ,t ,
i = 1, 2, . . . , k , or in matrix notation ut = εt σ t where denoted Hadamard division
(element-by-element). Multivariate standardized residuals, which are both “devolatilized”
−1
and “decorrelated”, are defined et = Σt 2 εt so that E t −1 et e0t = Ik . Some models explicitly
 

parameterize the conditional correlation, E t −1 ut u0t ≡ Rt = Σt σ t σ 0t , or equivalently


  

t Σt Dt where
Rt = D−1 −1

σ1,t
 
0 ... 0
 0 σ2,t . . . 0 
Dt = 
 
.. .. .. .. 
 . . . . 
0 0 . . . σk ,t
9.2 Preliminaries 533

and so Σt = Dt Rt Dt .
Some models utilize a factor structure to reduce the dimension of the estimation prob-
lem. The p by 1 vector of factors is denoted ft and the factor returns are assumed to be
mean 0 (or demeaned if the assumption of conditional mean 0 is inappropriate). The con-
f
ditional covariance of the factors is denoted Σt ≡ E t −1 ft f0t .
 

This chapter focuses exclusively models capable of predicting the time-t covariance us-
ing information in Ft −1 . Multi-step forecasting is possible from most models in this chapter
by direct recursion. Alternatively, direct forecasting techniques can be used to mix higher
frequency data (e.g. daily) with longer forecast horizons (e.g. 2-week or monthly).

9.2.1 Synchronization
Synchronization is an important concern when working with a cross-section of asset re-
turns, and non-synchronous returns can arise for a number of reasons:

• Closing hour differences – Closing hour differences are more important when work-
ing with assets that trade in different markets. The NYSE closes at either 20:00 or
21:00 GMT, depending on whether the U.S. east coast is using EDT or EST. The Lon-
don Stock Exchange closes at 15:30 or 16:30 GMT, depending whether the U.K. is using
GMT or BST. This difference means that changes in U.S. equity prices that occur after
the LSE has closed will not appear in U.K. equities until the next trading day.
Even within the same geographic region markets have different trading hours. Com-
mon U.S. equities trade from 9:30 until 16:00 EDT/EST time. U.S. government bond
futures are traded using open outcry from 7:20 a.m. to 14:00. Light Sweet Crude fu-
tures trade 9:00 - 14:30 in an open outcry session. All of these assets also trade elec-
tronically; Light sweet crude trades electronically 18:00 - 17:15 (all but 45 minutes per
day) from Sunday to Friday.

• Market closures due to public holidays – Different markets are closed for different
holidays which is a particular problem when working with international asset returns.

• Delays in opening/closing – Assets that trade on the same exchange may be subject
to opening or closing delays. For example, the gap between the first and last stock
to open in the S&P 500 is typically 15 minutes. The closing spread is similar. These
small differences lead to problems measuring the covariance when using intra-daily
returns.

• Illiquidity/Stale Prices - Some assets trade more than others. The most liquid stock in
the S&P 500 has a daily volume that is typically at least 100 times larger than the least
liquid. Illiquidity is problematic when measuring covariance using intra-daily data.1
1
On February 26, 2010, the most liquid S&P 500 company was Bank of America (BAC) which had a volume
of 96,352,600. The least liquid S&P 500 company was the Washington Post (WPO) which had a volume of
21,000. IMS Healthcare (RX) did not trade since it was acquired by another company.
534 Multivariate Volatility, Dependence and Copulas

There are three solutions to the problem of non-synchronous data. The first is to use rela-
tively long time-spans. When using daily data, NYSE and LSE data are typically only simul-
taneously open for 2 hours out of 6.5 (30%). If using multi-day returns, the lack of common
opening hours is less problematic since developments in U.S. equities on one day will show
up in prices changes in London on the next day. For example, when using 2-day returns, it
is as if 8.5 out of the 13 trading hours are synchronous (65%). When using 5 day returns it is
as if 28 out of 32.5 hours are synchronized (86%). The downside of using aggregate returns
is the loss of data which results in inefficient estimators as well as difficultly in capturing
recent changes.
The second solution is to use synchronized prices (also known as pseudo-closing prices).
Synchronized prices are collected when all markets are simultaneously open. For exam-
ple, if using prices of NYSE and LSE listed firms, a natural sampling time would be 1 hour
before the LSE closes, which typically corresponds to 10:30 Eastern time. Daily returns
constructed from these prices should capture all of the covariance between these assets.
The downside of using synchronized prices is that many markets have no common hours,
which is an especially acute problem when measuring the covariance of a global portfolio.
The third solution is to synchronize otherwise non-synchronous returns using a vector
moving average (Burns, Engle & Mezrich 1998). Suppose returns were ordered in such a
way that the first to close was in position 1, the second to close as in position 2, and so on
until the last to close was in position k . With this ordering, returns on day t + 1 for asset i
may be correlated with the return on day t for asset j whenever j > i , and that the return
on day t + 1 should not be correlated with the day t return on asset j when j ≤ i .
For example, consider returns from assets that trade on the Australian Stock Exchange
(UTC 0:00 - 6:10), the London Stock Exchange (UTC 8:00 - 16:30), NYSE (UTC 14:30 - 21:30)
and Tokyo Stock Exchange (UTC 18:00 - 0:00 (+1 day)). The ASX closes before any of the
others open, and so the contemporaneous correlation with the LSE, NYSE and TSE should
pick up all of the correlation between Australian equities and the rest of the world on day
t . The LSE opens second and so innovations in the LSE on day t may be correlated with
changes on the ASX on t + 1. Similarly innovations in New York after UTC 16:30 will show
up in day t + 1 in the ASX and LSE. Finally news which comes out when the TSE is open will
show up in the day t + 1 return in the 3 other markets. This leads to a triangular structure
in a vector moving average,
      
rtAS X 0 θ12 θ13 θ14 εAS
t −1
X
ε AS X
t
θ θ ε ε
 LS E     LS E   LS E 
r 0 0 23 24   t −1
 N Y SE  =  Y SE  +  N Y SE 
 t   t
(9.1)
  
 rt   0 0 0 θ34   εNt −1   εt


rtT S E 0 0 0 0 εTt −1
SE
εTt S E
The recursive structure of this system makes estimation simple since rtT S E = εTt S E , and so
the model for rtN Y S E is a MA(1)-X. Given estimates of εNt Y S E , the model for rtLS E is also a
MA(1)-X.
In vector form this adjustment model is
9.3 Simple Models of Multivariate Volatility 535

rt = Θεt −1 + εt
where rt is the k by 1 vector of nonsynchronous returns. Synchronized returns, r̂t are con-
structed using the VMA parameters as

r̂t = (Ik + Θ) εt .
The role of Θ is to capture any components in asset returns j which occur in the returns to
asset return i when the market where i trades in closes later than the market where j trades.
In essence this procedure “brings forward” the fraction of the return which has not yet oc-
curred by the close of market where asset j trades. Finally, the conditional covariance of εt
is Σt , and so the covariance of the synchronized returns is E t −1 [r̂t r̂t ] = (Ik + Θ) Σt (Ik + Θ)0 .
Implementing this adjustment requires fitting the conditional covariance to the residual
from the VMA, εt , rather than to returns directly.

9.3 Simple Models of Multivariate Volatility


Many simple models which do not require complicated parameter estimation are widely
used as benchmarks.

9.3.1 Moving Average Covariance

The n-period moving average is the simplest covariance estimator.

Definition 9.1 (n -period Moving Average Covariance). The n -period moving average co-
variance is defined
n
X
Σt = n −1
εt −i ε0t −i (9.2)
i =1

When returns are measured daily, common choices for n include 22 (monthly), 66 (quar-
terly), and 252 (annual). When returns are measured monthly, common choices for n are
12 (annual) and 60. Moving average covariance are often imprecise measures since they
simultaneously give too little weight to recent observations while giving too much to ob-
servations in the distant past.

9.3.2 Exponentially Weighted Moving Average Covariance

Exponentially weighted moving averages (EWMA) provide an alternative to moving average


covariance estimators which allow for more weight on recent information. EWMAs have
been popularized in the volatility literature by RiskMetrics, which was introduced in the
univariate context in chapter 8.
536 Multivariate Volatility, Dependence and Copulas

Definition 9.2 (Exponentially Weighted Moving Average Covariance). The EWMA covari-
ance is defined recursively as

Σt = (1 − λ)εt −1 ε0t −1 + λΣt −1 (9.3)

for λ ∈ (0, 1). EWMA covariance is equivalently defined through the infinite moving aver-
age
X∞
Σt = (1 − λ) λi −1 εt −i ε0t −i . (9.4)
i =1

Implementation of an EWMA covariance estimator requires an initial value for Σ1 , which


can be set to the average covariance over the first m days for some m > k or could be set to
the full-sample covariance. The single parameter, λ, is usually set to .94 for daily data and
.97 for monthly data based on recommendations from RiskMetrics (J.P.Morgan/Reuters
1996).

Definition 9.3 (RiskMetrics 1994 Covariance). The RiskMetrics 1994 Covariance is com-
puted as an EWMA with λ = .94 for daily data or λ = .97 for monthly.

The RiskMetrics EWMA estimator is formally known as RM1994, and has been sur-
passed by RM2006 which uses a long memory model for volatility. Long memory requires
that the weights on past returns decay hyperbolically (w ∝ i −α , α > 0) rather than expo-
nentially (w ∝ λi ). The new methodology extends the 1994 methodology by computing
the volatility as a weighted sum of EWMAs (eq. 9.5, line 1) rather than a single EWMA (eq.
9.3).
The RM2006 covariance estimator is computed as the average of m EWMA covariances.

Definition 9.4 (RiskMetrics 2006 Covariance). The RiskMetrics 2006 Covariance is com-
puted as
m
X
Σt = wi Σi ,t (9.5)
i =1

Σi ,t = (1 − λi ) εt −1 ε0t −1 + λi Σi ,t −1
ln (τi )
 
1
wi = 1−
C ln (τ0 )
 
1
λi = exp −
τi
τi = τ1 ρ i −1 , i = 1, 2, . . . , m
Pm
where C is a normalization constant which ensures that i =1 wi = 1.

The 2006 methodology uses a 3-parameter model which includes a logarithmic decay
factor, τ0 (1560), a lower cut-off, τ1 (4), and an upper cutoff τmax (512) [suggested values in
9.3 Simple Models of Multivariate Volatility 537

parentheses], using eq. (9.5) (Zumbach 2007). One additional


√ 2 parameter, ρ, is required to
operationalize the model, and RiskMetrics suggests 2.
Both RiskMetrics covariance estimators can be expressed as Σt = ∞ i =1 γi εt −i εt −1 for a
0
P

set of weights {γi }. Figure 9.1 contains a plot of the weights for the 120 most recent obser-
vations from both the RM1994 and RM2006 estimators. The new methodology has both
higher weight on recent data, and higher weight on data in the distant past. One method
to compare the two models is considering how many period it takes for 99% of the weight
to have been accumulated, or minn ni=0 γi ≥ .99. For the RM1994 methodology, this hap-
P

pens in 75 days – the RM2006 methodology requires 619 days to achieve the same target.
The first 75 weights in the RM2006 estimator contain 83% of the weight, and so 1/6 of the
total weight depends on returns more than 6 months in the past.

9.3.3 Observable Factor Covariance

The n -period factor model assumes that returns are generated by a strict factor structure
and is closely related to the CAP-M (Black 1972, Lintner 1965, Sharpe 1964), the intertempo-
ral CAP-M (Merton 1973) and Arbitrage Pricing Theory (Roll 1977). Moving average factor
covariance estimators can be viewed as restricted versions on the standard moving aver-
age covariance estimator where all covariance is attributed to common exposure to a set
of factors. The model postulates that the return on the ith asset is generated by a set of p
observable factors with returns ft , an p by 1 set of asset-specific factor loadings, β i and an
idiosyncratic shock ηi ,t ,

εi ,t = f0t β i ,t + ηi ,t .

The k by 1 vector of returns can be compactly described as

εt = β ft + ηt

where β is a k by p matrix of factor loadings and ηt is a k by 1 vector of idiosyncratic shocks.


The shocks are assumed to be white noise, cross-sectionally uncorrelated (E t −1 ηi ,t η j ,t =
 

0) and uncorrelated with the factors.

Definition 9.5 (n -period Factor Covariance). The n -period factor covariance is defined as
f
Σt = β Σt β 0 + Ωt (9.6)

2
τmax does not directly appear in the equations for the RM2006 framework, but is implicitly included since
 
ln ττmax
1
m= .
ln ρ
538 Multivariate Volatility, Dependence and Copulas

Weights in RiskMetrics Estimators


RM2006
RM1994
0.07

0.06

0.05
Weight

0.04

0.03

0.02

0.01

0
10 20 30 40 50 60 70 80 90 100 110 120
Lag

Figure 9.1: These two lines show the weights on lagged outer-product of returns εt ε0t


in the 1994 and 2006 versions of the RiskMetrics methodology. The 2006 version features
more weight on recent volatility and more weight on volatility in the distant past relative to
the 1994 methodology.

f Pn
where Σt = n −1 0
i =1 ft −i ft −i is the n -period moving covariance of the factors,

n
!−1 n
X X
βt = ft −i f0t −i ft −i ε0t −i
i =1 i =1

Pn
is the p by k matrix of factor loadings and Ωt is a diagonal matrix with ω2j ,t = n −1 i =1 η2j ,t −i
in the jth diagonal position where ηi ,t = εi ,t − f0t β i are the regression residuals.

While the moving average factor covariance is a restricted version of the standard moving
average covariance estimator, it does have one important advantage: factor covariance es-
timators are always positive definite as long as the number of periods used to estimate the
factor covariance and factor loadings is larger than the number of factors (n > p ). The
standard moving average covariance requires the number of period used in the estimator
to be larger than the number of assets to be positive definite (n > k ). This difference makes
the factor structure suitable for large portfolios.
The factor covariance estimator can also be easily extended to allow for different asset
9.3 Simple Models of Multivariate Volatility 539

classes which may have exposure to different factors by restricting coefficients on unre-
lated factors to be zero. For example, suppose a portfolio consisted of equity and credit
products, and that a total of 5 factors where needed to model the covariance – 1 common
to all assets, 2 specific to equities and 2 specific to bonds. The factor covariance would be
a 5 by 5 matrix, but the factor loadings for any asset would only have three non-zero co-
efficients, the common and equity factors if the asset was an equity or the common and
the credit factors if the asset was a bond. Zero restrictions in the factor loadings allows for
parsimonious models to be built to model complex portfolios, even in cases where many
factors are needed to span the range of assets in a portfolio.

9.3.4 Principal Component Covariance


Principal component analysis (PCA) is a statistical technique which can be used to decom-
pose a T by k matrix Y into a T by k set of orthogonal factors, F,and a k by k set of nor-
malized weights (or factor loadings), β . Formally the principal component problem can
be defined as

k X
T
X 2
min (k T ) −1
yi ,t − ft β i subject to β 0 β = Ik (9.7)
β ,F
i =1 t =1

where ft is a 1 by k vector of common factors and β i is a k by 1 vector of factor loadings. The


solution to the principle component problem is given by the eigenvalue decomposition of
the outer product of Y, Ω = Y0 Y = Tt=1 yt y0t .
P

Definition 9.6 (Orthonormal Matrix). A k -dimensional orthonormal matrix U satisfies U0 U =


Ik , and so U0 = U−1 .
Definition 9.7 (Eigenvalue). The eigenvalues of a real, symmetric matrix k by k matrix A
are the k solutions to
|λIk − A| = 0 (9.8)
where | · | is the determinant.
Definition 9.8 (Eigenvector). An a k by 1 vector u is an eigenvector corresponding to an
eigenvalue λ of a real, symmetric matrix k by k matrix A if

Au = λu (9.9)

Theorem 9.9 (Spectral Decomposition Theorem). A real, symmetric matrix A can be fac-
tored into A = UΛU0 where U is an orthonormal matrix (U0 = U−1 ) containing the eigen-
vectors of A in its columns and Λ is a diagonal matrix with the eigenvalues λ1 , λ2 ,. . .,λk of A
along its diagonal.
Since Y0 Y = Ω is real and symmetric with eigenvalues Λ = diag (λi )i =1,...,k , the factors the
be determined using
540 Multivariate Volatility, Dependence and Copulas

Y0 Y = UΛU0
U0 Y0 YU = U0 UΛU0 U
(UY)0 (YU) = Λ since U0 = U−1
F0 F = Λ.

F = YU is the T by k matrix of factors and β = U0 is the k by k matrix of factor loadings.


Additionally Fβ = FU0 = YUU0 = Y.3
Covariance estimation based in Principal Component Analysis (PCA) is virtually iden-
tical to observable factor covariance estimation. The sole difference is that the factors are
estimated directly from the returns and so covariance estimators using PCA do not require
the common factors to be observable.

Definition 9.10 (n -period Principal Component Covariance). The n -period principal com-
ponent covariance is defined as
f
Σt = β 0t Σt β t + Ωt (9.10)
f
where Σt = n −1 ni=1 ft −i f0t −i is the n -period moving covariance of first p principal compo-
P

nent factors, β̂ t is the p by k matrix of principal component loadings corresponding to the


first p factors, and Ωt is a diagonal matrix with ω2j ,t +1 = n −1 ni=1 η2j ,t −1 on the jth diagonal
P

where ηi ,t = ri ,t − f0t β i ,t are the residuals from a p -factor principal component analysis.

Selecting the number of factors to use, p , is one of the more difficult aspects of imple-
menting a PCA covariance. One method specifies a fixed number of factors based on expe-
rience with the data or empirical regularities, for example selecting 3 factors when working
with equity returns. An alternative is to select the number of factors by minimizing an in-
formation criteria such as those proposed in Bai & Ng (2002),

k +T
 
kT
I C (p ) = ln(V (p , f̂ )) + p
p
ln
kT k +T
where
k X
X T
V (p , fˆp ) = (k T )−1 η2i ,t (9.11)
i =1 t =1
k X
T
p p 2
X
= (k T ) −1
ri ,t − β i ft (9.12)
i =1 t =1

p p
where β i are the p factor loadings for asset i , and ft are the first p factors.
3
The factors and factor loadings are only identified up a scaling by ±1.
9.3 Simple Models of Multivariate Volatility 541

Principal Component Analysis of the S&P 500

k = 194 1 2 3 4 5 6 7 8 9 10
Partial R 2 0.263 0.039 0.031 0.023 0.019 0.016 0.014 0.013 0.012 0.011
Cumulative R 2 0.263 0.302 0.333 0.356 0.375 0.391 0.405 0.418 0.430 0.441

Table 9.1: Percentage of variance explained by the first 10 eigenvalues of the outer product
matrix of S&P 500 returns. Returns on an asset were only included if an asset was in the S&P
500 for the entire sample (k contains the number which meet this criteria). The second line
contains the cumulative R2 of a p -factor model for the first 10 factors.

9.3.4.1 Interpreting the components

One nice feature of PCA is that the factors can be easily interpreted in terms of their con-
tribution to total variance using R2 . This interpretation is possible since the factors are
orthogonal, and so the R2 of a model including p < k factors is the sum of the R2 of the p
factors. Suppose the eigenvalues were ordered from largest to smallest and so λ1 ≥ λ2 ≥
. . . ≥ λk and that the factors associated with eigenvalue i has been ordered such that it
appears in column i of F. The R2 associated with factor i is then

λi
λ1 + λ2 + . . . + λk
and the cumulative R2 of including p < k factors is

λ1 + λ2 + . . . + λp
.
λ1 + λ2 + . . . + λk
Cumulative R2 is often used to select a subset of the k factors for model building. For
example, in equity return data, it is not uncommon for 3–5 factors to explain 30% of the
total variation in a large panel of equity returns.

9.3.4.2 Alternative methods

Principal components can also computed on either the covariance matrix of Y or the cor-
relation matrix of Y. Using the covariance matrix is equivalent to building a model with an
intercept,

yi ,t = αi + ft β i (9.13)
which differs from the principal components extracted from the outer product which is
equivalent to the model

yi ,t = ft β i . (9.14)
542 Multivariate Volatility, Dependence and Copulas

When working with asset return data, the difference between principal components ex-
tracted from the outer product and the covariance is negligible except in certain markets
(e.g. electricity markets) or when the using returns covering long time spans (e.g. one
month or more).
Principal components can also be extracted from the sample correlation matrix of Y
which is equivalent to the model

yi ,t − ȳi
= ft β i (9.15)
σ̂i
where ȳi = T −1 Tt=1 yi ,t is the mean of yi and σ̂i is the sample standard deviation of yi .
P

PCA is usually run on the correlation matrix when a subset of the series in Y have variances
which are much larger than the others. In cases where the variances differ greatly, principal
components extracted from the outer product or covariance will focus on the high variance
data series since fitting these produces the largest decrease in residual variance and thus
the largest increases in R2 .

9.3.5 Equicorrelation

Equicorrelation, like factor models, is a restricted covariance estimator. The equicorrela-


tion estimator assumes that the covariance between any two assets can be expressed as
ρσi σ j where σi and σ j are the volatilities of assets i and j , respectively. The correlation
parameter is not indexed by i or j , and so it is common to all assets. This estimator is
clearly mis-specified whenever k > 2, and is generally only appropriate for assets where
the majority of the pairwise correlations are positive and the correlations are fairly homo-
geneous.4

Definition 9.11 (n -period Moving Average Equicorrelation Covariance). The n -period mov-
ing average equicorrelation covariance is defined as

σ1,t
2
ρt σ1,t σ2,t ρt σ1,t σ3,t . . . ρt σ1,t σk ,t
 
 ρt σ1,t σ2,t σ2,t
2
ρt σ2,t σ3,t . . . ρt σ2,t σk ,t 
Σt =  (9.16)
 
.. .. .. .. .. 
 . . . . . 
ρt σ1,t σk ,t ρt σ2,t σk ,t ρt σ3,t σk ,t ... σk2 ,t
Pn
where σ2j ,t = n −1 i =1 ε2j ,t and ρt is estimated using one of the estimators below.

The equicorrelation can be estimated either using a moment-based estimator or using


a maximum-likelihood estimator. Defined εp ,t as the equally weighted portfolio return. It

4
The positivity
 constraint is needed to ensure that the covariance is positive definite which requires ρ ∈
−1/(k − 1), 1 , and so for k moderately large, the lower bound is effectively 0.
9.3 Simple Models of Multivariate Volatility 543

is straightforward to see that

k
X k
X k
X
E[ε2p ,t ] =k −2
σ2j ,t + 2k −2
ρσo ,t σq ,t (9.17)
j =1 o =1 q =o +1
k
X k
X k
X
=k −2
σ2j ,t + 2ρk −2
σo ,t σq ,t
j =1 o =1 q =o +1

if the correlations among all of the pairs of assets were identical. The moment-based esti-
mator replaces population values with estimates,

n
X
σ2j ,t =n −1
ε2j ,t −i , j = 1, 2, . . . , k , p ,
i =1

and the equicorrelation is estimated using

Pk
σp2 ,t − k −2 j =1 σ2j ,t
ρt = Pk Pk .
q =o +1 σo ,t σq ,t
2k −2
o =1

Alternatively maximum likelihood assuming returns are multivariate Gaussian can be


used to estimate the equicorrelation using standardized residuals u j ,t = ε j ,t /σ j ,t . The
estimator for ρ can be found by maximizing the likelihood

n
1X
L (ρt ; u) = − k ln 2π + ln |Rt | + u0t −i R−1
t ut −i (9.18)
2
i =1
n 
X 
= k ln 2π + ln (1 − ρt )k −1 1 + (k − 1)ρt
i =1
  2 
k k
1 X 2 ρt X
+ u j ,t −i − u q ,t −i  

(1 − ρt ) 1 + (k − 1)ρt
 
j =1 q =1

where ut is a k by 1 vector of standardized residuals and Rt is a correlation matrix with all


non-diagonal elements equal to ρ. This likelihood is computationally similar to univariate
likelihood for any k and so maximization is very fast even when k is large.5

5
Pk
The computation speed of the likelihood can be increased by pre-computing j =1 u 2j ,t −i and
Pk
q =1 u q ,t −i .
544 Multivariate Volatility, Dependence and Copulas

Correlation Measures for the S&P 500


Equicorrelation 1-Factor R 2 (S&P 500) 3-Factor R 2 (Fama-French)
0.255 0.236 0.267

Table 9.2: Full sample correlation measures for the S&P 100. Returns on an asset were only
included if an asset was in the S&P 500 for the entire sample (n contains the number which
meet this criteria). The 1-factor R2 was from a model that used the return on the S&P 500
and the 3-factor R2 was from a model that used the 3 Fama-French factors.

9.3.6 Application: S&P 500

The S&P 500 was used to illustrate some of the similarities among the simple covariance
estimators. Daily data on all constituents of the S&P 500 was downloaded from CRSP from
January 1, 1999 until December 31, 2008. Principal component analysis was conducted on
the subset of the returns which was available for each day in the sample. Table 9.1 contains
the number of assets which meet this criteria (k ) and both the partial and cumulative R2
for the first 10 principal components. The first explains a substantial amount of the data –
25%, and the next four combine to explain 35% of the cross-sectional variation. If returns
did not follow a factor structure each principal component would be expected to explain
approximately 0.5% of the variation. Table 9.2 contains the full-sample equicorrelation, 1-
factor R2 using the S&P 500 as the observable factor and the 3-factor R2 using the 3 Fama-
French factors. The average correlation and the 1-factor fit is similar to that in 1-factor
PCA model, although the 3 Fama-French factors do not appear to work as well as the 3
factors estimated from the data. This is likely due to the lack of cross-sectional variation
with respect to size in the S&P 500 when compared to all assets in CRSP.
Figure 9.2 contains a plot of the 252-day moving average equicorrelation and 1- and
3-factor PCA R2 . Periods of high volatility, such as the end of the dot-com bubble and late
2008, appear to also have high correlation. The three lines broadly agree about the changes
and only differ in level. Figure 9.3 contains plots of the R2 from the 1-factor PCA and the
1-factor model which uses the S&P500 return (top panel) and the 3-factor PCA and the 3
Fama-French factors (bottom panel). The dynamics in all series are similar with the largest
differences in the levels, where PCA fits the cross-section of data better than the observable
models.

9.4 Multivariate ARCH Models

9.4.1 Vector GARCH (vec)

Vector GARCH was the first multivariate ARCH specification Bollerslev, Engle & Wooldridge
(1988), and the natural extension of the standard GARCH model. The model is defined
9.4 Multivariate ARCH Models 545

252-day Rolling Window Correlation Measures for the S&P 500

0.6
Equicorrelation
PCA 1−factor
0.5 PCA 3−factor

0.4

0.3

0.2

0.1

1992 1994 1996 1998 2000 2002 2004 2006 2008

Figure 9.2: Three views into the average correlation in the S&P 500. The PCA measures are
the R2 of models with 1 and 3 factors. Each estimate was computed using a 252-day rolling
window and is plotted against the center of the rolling window. All three measures roughly
agree about the changes in the average correlation.

using the vec of the conditional covariance.

Definition 9.12 (Vector GARCH). The covariance in a vector GARCH model (vec) evolves
according to

vec (Σt ) = vec (C) + Avec εt −1 ε0t −1 + Bvec (Σt −1 )



(9.19)
 0 !
1 1
= vec (C) + Avec Σt2−1 et Σt2 et + Bvec (Σt −1 ) (9.20)

where C is a k by k positive definite matrix and both A and B are k 2 by k 2 parameter matri-
1
ces. In the second line, Σt2−1 is a matrix square root and {et } is a sequence of i.i.d. random
variables with mean 0 and covariance Ik , such as a standard multivariate normal.

See eq. 5.9 for the definition of the vec operator. The vec allows each cross-product to
influence each covariance term. To understand the richness of the specification, consider
the evolution of the conditional covariance in a bivariate model,
546 Multivariate Volatility, Dependence and Copulas

Observable and Principal Component Correlation Measures for the S&P 500
1-Factor Models
0.5 PCA 1−factor
1−factor (S&P 500)
0.4

0.3

0.2

0.1

3-Factor Models
0.6 PCA 3−factor
3−factor (Fama−French)
0.5
0.4
0.3
0.2
0.1
1992 1994 1996 1998 2000 2002 2004 2006 2008

Figure 9.3: The top panel plots the R2 for 1-factor PCA and an observable factor model
which uses the return on the S&P 500 as the observable factor. The bottom contains the
same for 3-factor PCA and the Fama-French 3-factor model. Each estimate was computed
using a 252-day rolling window and is plotted against the center of the rolling window.

         
σ11,t c11 a 11 a 12 a 12 a 13 ε21,t −1 b11 b12 b12 b13 σ11,t −1
σ12,t c12 a 21 a 22 a 22 a 23 ε1,t −1 ε2,t −1 b21 b22 b22 b23 σ12,t −1
         
= +  + .
         
σ12,t c12 a 31 a 32 a 32 a 33 ε1,t −1 ε2,t −1 b31 b32 b32 b33 σ12,t −1
  
         
σ22,t c22 a 41 a 42 a 42 a 44 ε22,t −1 b41 b42 b42 b44 σ22,t −1

The vec operator stacks the elements of the covariance matrix and the outer products of
returns. The evolution of the conditional variance of the first asset,

σ11,t = c11 + a 11 ε21,t −1 + 2a 12 ε1,t −1 ε2,t −1 + a 13 ε22,t −1 + b11 σ11,t −1 + 2b12 σ12,t −1 + b13 σ22,t −1 ,

depends on all of the past squared returns and cross-products. In practice it is very difficult
9.4 Multivariate ARCH Models 547

to use the vector GARCH model since finding general conditions on A and B which will
ensure that Σt is positive definite is difficult.
The diagonal vec model has been more successful, primarily because it is relatively
straight forward to find conditions which ensure that the conditional covariance is posi-
tive semi-definite. The diagonal vec model restricts A and B to diagonal matrices which
means that the elements of Σt evolve according to

Σt = C + Ã εt −1 ε0t −1 + B̃ Σt −1 (9.21)
 0 !
1 1
= C + Ã Σt2−1 et Σt2 et + B̃ Σt −1 (9.22)

where à and B̃ are symmetric parameter matrices and is Hadamard product.6 All ele-
ments of Σt evolves according a GARCH(1,1)-like specification, so that

σi j ,t = ci j + ã i j εi ,t −1 ε j ,t −1 + b̃i j σi j ,t −1 .
The diagonal vec still requires restrictions the parameters to ensure that the conditional
covariance is positive definite. Ding & Engle (2001) develop one set of restrictions in the
context of Matrix GARCH (see section 9.4.3).

9.4.2 BEKK GARCH


The BEKK (Baba, Engle, Kraft and Kroner) directly addresses the difficulties in finding con-
straints on the parameters of a vec specification (Engle & Kroner 1995). The primary insight
of the BEKK is that quadratic forms are positive semi-definite and the sum of a positive
semi-definite matrix and a positive definite matrix is positive definite.

Definition 9.15 (BEKK GARCH). The covariance in a BEKK GARCH(1,1) model evolves ac-
cording to

Σt = CC0 + Aεt −1 ε0t −1 A0 + BΣt −1 B0 (9.23)


 0 !
1 1
= CC + A Σt2−1 et Σt2 et
0
A0 + BΣt −1 B0 (9.24)

where C is a k by k lower triangular matrix and A and B are k by k parameter matrices.


6

Definition 9.13 (Hadamard Product). Let A and B be matrices with the same size. The Hadamard product of
A and B denoted A B is the matrix with ijth element a i j bi j .

Definition 9.14 (Hadamard Quotient). Let A and B be matrices with the same size. The Hadamard quotient
of A and B denoted A B is the matrix with ijth element a i j /bi j .
548 Multivariate Volatility, Dependence and Copulas

Using the vec operator, the BEKK can be seen as a restricted version of the vec speci-
fication where A ⊗ A and B ⊗ B control the response to recent news and the smoothing,
respectively,

vec (Σt ) = vec CC0 + A ⊗ Avec εt −1 ε0t −1 + B ⊗ Bvec (Σt −1 ) .


 
(9.25)
 0 !
1 1
= vec CC0 + A ⊗ Avec Σt2−1 et Σt2 et + B ⊗ Bvec (Σt −1 ) .

(9.26)

The elements of Σt generally depend on all cross-products. For example, consider a bivari-
ate BEKK,

" # " #" #0


σ11,t σ12,t c11 0 c11 0
= (9.27)
σ12,t σ22,t c12 c22 c21 c22
" #" #" #0
a 11 a 12 ε21,t −1 ε1,t −1 ε2,t −1 a 11 a 12
+
a 21 a 22 ε1,t −1 ε2,t −1 ε22,t −1 a 21 a 22
" #" #" #0
b11 b12 σ11,t −1 σ12,t −1 b11 b12
+
b21 b22 σ12,t −1 σ22,t −1 b21 b22

The conditional variance of the first asset is given by

σ11,t = c11
2
+a 11
2 2
ε1,t −1 +2a 11 a 12 ε1,t −1 ε2,t −1 +a 12
2 2
ε2,t −1 + b11
2
σ11,t −1 +2b11 b12 σ12,t −1 + b12
2
σ22,t −1 .

The other conditional variance and the conditional covariance have similar forms that de-
pend on both squared returns and the cross-product of returns.
Estimation of full BEKK models rapidly becomes difficult as the number of assets grows
since the number of parameters in the model is (5k 2 + k )/2, and so is usually only appro-
priate for k ≤ 5. The diagonal BEKK partially addresses some of the number of parameters
by restricting A and B to be diagonal matrices,
Definition 9.16 (Diagonal BEKK GARCH). The covariance in a diagonal BEKK-GARCH(1,1)
model evolves according to

Σt = CC0 + Ãεt −1 ε0t −1 Ã0 + B̃Σt −1 B̃0 . (9.28)


 0 !
1 1
= CC0 + Ã Σt2−1 et Σt2 et Ã0 + B̃Σt −1 B̃0 . (9.29)

where C is a k by k lower triangular matrix and à and B̃ are diagonal parameter matrices.
The conditional covariances in a diagonal BEKK evolve according to
9.4 Multivariate ARCH Models 549

σi j ,t = c˜i j + a i a j εi ,t −1 ε j ,t −1 + bi b j σi j ,t −1 (9.30)
where c˜i j is the ijth element of CC0 . This covariance evolution is similar to the diagonal vec
specification except that the parameters are shared between different series. The scalar
BEKK further restricts the parameter matrices to be common across all assets, and is a par-
ticularly simple (and restrictive) model.

Definition 9.17 (Scalar BEKK GARCH). The covariance in a scalar BEKK-GARCH(1,1) model
evolves according to

Σt = CC0 + a 2 εt −1 εt −1 + b 2 Σt −1 (9.31)
 0 !
1 1
= CC0 + a 2 Σt2−1 et Σt2 et + b 2 Σt −1 (9.32)

where C is a k by k lower triangular matrix and a and b are scalar parameters.

The scalar BEKK has one further advantage: it can easily be covariance targeted. Covari-
ance targeting replaces the intercept (CC0 ) with a consistent estimator, (1−a 2 −b 2 )Σ, where
Σ is the long-run variance of the data. Σ is usually estimated using the outer product of
returns, and so Σb = T −1 PT ε ε0 . The conditional covariance is then estimated condi-
t =1 t t
tioning on the unconditional covariance of returns,

Σt = (1 − a 2 − b 2 )Σ
b + a 2ε ε
t −1 t −1 + b Σt −1
2
(9.33)
and so only a and b reaming to be estimated using maximum likelihood. Scalar BEKK
models can be used in large portfolios (k > 50), unlike models without targeting.

9.4.3 Matrix GARCH (M-GARCH)


Matrix GARCH (Ding & Engle 2001) contains a set of parameterizations which include the
diagonal vec and an alternative parameterization of the diagonal BEKK.

Definition 9.18 (Matrix GARCH). The covariance in a Matrix GARCH(1,1) model evolves
according to

Σt = CC0 + AA0 εt −1 εt −1 + BB0 Σt −1 (9.34)


 0 !
1 1
= CC0 + AA0 Σt2−1 et Σt2 et + BB0 Σt −1 (9.35)

where C, A and B are lower triangular matrices.

Ding & Engle (2001) show that if U and V are positive semi-definite matrices, then U V
is also, which, when combined with the quadratic forms in the model, ensures that Σt will
550 Multivariate Volatility, Dependence and Copulas

be positive definite as long as C has full rank. They also propose a diagonal Matrix GARCH
specification which is equivalent to the diagonal BEKK.

Definition 9.19 (Diagonal Matrix GARCH). The covariance in a diagonal Matrix GARCH(1,1)
model evolves according to

Σt = CC0 + aa0 εt −1 ε0t −1 + bb0 Σt −1 (9.36)


 0 !
1 1
= CC0 + aa0 Σt2−1 et Σt2 et + bb0 Σt −1 , (9.37)

where C is a lower triangular matrices and a and b are k by 1 parameter vectors.

The scalar version of the Matrix GARCH is identical to the scalar BEKK.

9.4.4 Constant Conditional Correlation (CCC) GARCH


Constant Conditional Correlation GARCH Bollerslev (1990) uses a different approach to
that of the vec, BEKK, and Matrix GARCH. CCC GARCH decomposes the conditional co-
variance into k conditional variances and the conditional correlation, which is assumed to
be constant,

Σt = Dt RDt . (9.38)
Dt is a diagonal matrix with the conditional standard deviation of the ith asset in its ith di-
agonal position.  
σ1,t 0 0 ... 0
 0 σ2,t 0 ... 0 
 

Dt =  0 0 σ3,t . . . 0 
 
(9.39)
 .. .. .. .. .. 
 
 . . . . . 
0 0 0 . . . σk ,t

where σi ,t = σi i ,t . The conditional variances are typically modeled using standard GARCH(1,1)
models,

σi i ,t = ωi + αi ri2,t −1 + βi σi i ,t −1 (9.40)
= ωi + αi σi i ,t −1 u i2,t −1 + βi σi i ,t −1
1
where u i ,t −1 is the ith element of ut = R 2 et where {et } is a sequence of i.i.d. random vari-
ables with mean 0 and covariance Ik , such as a standard multivariate normal. Other spec-
ifications, such as TARCH or EGARCH, can also be used. It is also possible to model the
conditional variances a different models for each asset, a distinct advantage over vec and
related models. The conditional correlation is constant
9.4 Multivariate ARCH Models 551

 
1 ρ12 ρ13 . . . ρ1k
ρ12 1 ρ23 . . . ρ2k
 
 
R= ρ13 ρ23 1 . . . ρ3k
 
. (9.41)
.. .. .. .. ..
 
. . . . .
 
 
ρ1k ρ2k ρ3k ... 1
The conditional covariance matrix is computed of the conditional standard deviations and
the conditional correlation, and so all of the dynamics in the conditional covariance are
attributable to changes in the conditional variances.
 
σ11,t ρ12 σ1,t σ2,t ρ13 σ1,t σ3,t . . . ρ1k σ1,t σk ,t
 ρ12 σ1,t σ2,t σ22,t ρ23 σ2,t σ3,t . . . ρ2k σ2,t σk ,t 
 

Σt =  ρ σ σ ρ23 σ2,t σ3,t σ33,t . . . ρ3k σ3,t σk ,t 


 
 13 1,t 3,t (9.42)
.. .. .. .. ..

. . . . .
 
 
ρ1k σ1,t σk ,t ρ2k σ2,t σk ,t ρ3k σ3,t σk ,t ... σk k ,t
Bollerslev (1990) shows that the CCC GARCH model can be estimated in two steps. The
first fits k conditional variance modelsp (e.g. GARCH) and produces the vector of standard-
ized residuals ut where u i ,t = εi ,t / σ̂i i ,t . The second step estimates the constant condi-
tional correlation using the standard correlation estimator on the standardized residuals.
Definition 9.20 (Constant Conditional Correlation GARCH). The covariance in a constant
conditional correlation GARCH model evolves according to
 
σ11,t ρ12 σ1,t σ2,t ρ13 σ1,t σ3,t . . . ρ1k σ1,t σk ,t
 ρ12 σ1,t σ2,t σ22,t ρ23 σ2,t σ3,t . . . ρ2k σ2,t σk ,t 
 

Σt =  ρ σ σ ρ23 σ2,t σ3,t σ33,t . . . ρ3k σ3,t σk ,t 


 
 13 1,t 3,t (9.43)
.. .. .. .. ..

. . . . .
 
 
ρ1k σ1,t σk ,t ρ2k σ2,t σk ,t ρ3k σ3,t σk ,t ... σk k ,t

where σi i ,t , i = 1, 2, . . . , k evolves according to some univariate GARCH process on asset i ,


usually a GARCH(1,1).

9.4.5 Dynamic Conditional Correlation (DCC)


Dynamic Conditional Correlation extends CCC GARCH by introducing simple, scalar BEKK-
like dynamics to the conditional correlations, and so R in the CCC is replaced with Rt in the
DCC (Engle 2002b)
Definition 9.21 (Dynamic Conditional Correlation GARCH). The covariance in a dynamic
conditional correlation GARCH model evolves according to

Σt = Dt Rt Dt . (9.44)
552 Multivariate Volatility, Dependence and Copulas

where
Rt = Q∗t Qt Q∗t , (9.45)

Qt = (1 − a − b ) R + a ut −1 u0t −1 + b Qt −1 , (9.46)
  0
1 1
= (1 − a − b ) R + a Rt −1 et −1
2
Rt −1 et −1 + b Qt −1 ,
2
(9.47)

1
Q∗t = (Qt Ik )− 2 (9.48)
ut is the k by 1 vector of standardized returns (u i ,t = εi ,t / σ̂i i ,t ) and denotes Hadamard
p

multiplication (element-by-element). {et } are a sequence of i.i.d. innovations with mean


0 and covariance Ik , such as a standard multivariate normal or possibly a heavy tailed dis-
tribution. Dt is a diagonal matrix with the conditional standard deviation of asset i on the
ith diagonal position. The conditional variance, σi i ,t , i = 1, 2, . . . , k , evolve according to
some univariate GARCH process for asset i , usually a GARCH(1,1) and are identical to eq.
9.40.

Eqs. 9.45 and 9.48 are needed to ensure that Rt is a correlation matrix with diagonal el-
ements equal to 1. The Qt process is parameterized in a similar manner to the variance
targeting BEKK (eq. 9.33) which allows for three step estimation. The first two steps are
identical to those of the CCC GARCH model. The third plugs in the estimate of the corre-
lation into eq. 9.46 to estimate the parameters which govern the conditional correlation
dynamics, a and b .

9.4.6 Orthogonal GARCH (OGARCH)

The principle components of a T by k matrix of returns ε are defined as F = εU where U


is the matrix of eigenvectors of the outer product of ε. Orthogonal GARCH uses the first
p principle components to model the conditional covariance by assuming that the factors
are conditionally uncorrelated.7

Definition 9.22 (Orthogonal GARCH). The covariance in an orthogonal GARCH (OGARCH)


model evolves according to
f
Σt = β Σt β 0 + Ω (9.49)
where β is the k by p matrix of factor loadings corresponding to the p factors with the

7
Principle components uses the unconditional covariance of returns and so only guarantees that the fac-
tors are unconditionally uncorrelated.
9.4 Multivariate ARCH Models 553

highest total R2 . The conditional covariance of the factors is assumed diagonal,


 
ψ21,t 0 0 ... 0
 0 ψ22,t 0 ... 0 
 
f
Σt =  0 0 ψ23,t . . . 0  ,
 
(9.50)
 .. .. .. .. .. 
 
 . . . . . 
0 0 0 . . . ψ2l ,t

and the conditional variance of each factor follows a GARCH(1,1) process (other models
possible)

ψ2i ,t = ϕi + αi fi 2,t −1 + βi ψ2i ,t −1 (9.51)


= ϕi + αi ψ2i ,t −1 e t2,t −1 + βi ψ2i ,t −1 (9.52)

where {et } are a sequence of i.i.d. innovations with mean 0 and covariance Ik , such as a
standard multivariate normal or possibly a heavy tailed distribution.
The conditional covariance of the residuals is assumed to be constant and diagonal,
 
ω21 0 0 ... 0
 0 ω22 0 . . . 0 
 

Ω= 0 0 ω23 . . . 0  ,
 
(9.53)
 .. .. .. .. .. 
 
 . . . . . 
0 0 0 . . . ω2l

where each variance is estimated using the residuals from an p factor model,
T T
X X 2
ω2i = η2i ,t = εi ,t − ft β i . (9.54)
t =1 t =1

Variants of the standard OGARCH model include parameterizations where the number
of factors is equal to the number of assets, and so Ω = 0, and a specification which replaces
Ω with Ωt where each ω2i ,t follows a univariate GARCH process.

9.4.7 Conditional Asymmetries

Standard multivariate ARCH models are symmetric in the sense that the news impact curves
are symmetric for εt and −εt since they only depend on the outer product of returns. Most
models can be modified in a simple way to allow for conditional asymmetries in covariance
which may be important when modeling equity returns. Define ζt = εt I[εt <0] where I[εt <0]
is a k by 1 vector of indicator variables where the ith position is 1 if ri ,t < 0. An asymmetric
BEKK model can be constructed as
554 Multivariate Volatility, Dependence and Copulas

Σt = CC0 + Aεt −1 ε0t −1 A0 + Gζt −1 ζ0t −1 G0 + BΣt −1 B0 (9.55)

where G is a k by k matrix of parameters which control the covariance response the “bad”
news, and when k = 1 this model reduces the a GJR-GARCH(1,1,1) model for variance.
Diagonal and scalar BEKK models can be similarly adapted.
An asymmetric version of Matrix GARCH can be constructed in a similar manner,

Σt = CC0 + AA0 εt −1 ε0t −1 + GG0 ζt −1 ζ0t −1 + BB0 Σt −1 (9.56)

where G is a lower triangular parameter matrix. The dynamics of the covariances in the
asymmetric Matrix GARCH process are given by

σi j ,t = c˜i j + ã i j ri ,t −1 r j ,t −1 + g˜i j ri ,t −1 r j ,t −1 Ii ,t −1 I j ,t −1 + b̃i j σi j ,−t 1

where c˜i j is the ijth element of CC0 and ã i j , g˜i j and b̃i j are similarly defined. All conditional
variances follow GJR-GARCH(1,1,1) models and covariances evolve using similar dynamics
only being driven by cross products of returns. The asymmetric term has a slightly differ-
ent interpretation for covariance since it is only relevant when both indicators are 1 which
only occurs if both market experience “bad” news (negative returns). An asymmetric DCC
model has been proposed in Cappiello, Engle & Sheppard (2006).

9.4.8 Fitting Multivariate GARCH Models

Returns are typically assumed to be conditionally multivariate normal, and so model pa-
rameters are typically estimated by maximizing the corresponding likelihood function,
 
− k2 − 12 1
f (εt ; θ ) = (2π) |Σt | t εt
exp − ε0t Σ−1 (9.57)
2
where θ contains the collection of parameters in the model. Estimation is, in principle, a
simple problem. In practice parameter estimation is only straight-forward when the num-
ber of assets, k is relatively small (less than 10) or when the model is tightly parameterized
(e.g. scalar BEKK). The problems in estimation arise for two reasons. First, the likelihood is
relatively flat and so finding its maximum value is difficult for optimization software. Sec-
ond, the computational cost of calculating the likelihood is increasing in the number of
unknown parameters and typically grows at rate k 3 in multivariate ARCH models.
A number of models have been designed to use multi-stage estimation to avoid these
problems, including:

• Covariance Targeting BEKK : The intercept is concentrated out using the sample co-
variance of returns, and so only the parameters governing the dynamics of the con-
ditional covariance need to be estimated using numerical methods.
9.4 Multivariate ARCH Models 555

CCC GARCH Correlation


FSLCX OAKMX WHOSX
FSLCX 1.000 0.775 -0.169
OAKMX 0.775 1.000 -0.163
WHOSX -0.169 -0.163 1.000

Table 9.3: Conditional correlation form a CCC GARCH model for three mutual funds span-
ning small cap stocks (FSLCX), large cap stocks (OAKMX), and long government bond re-
turns (WHOSX).

• Constant Conditional Correlation: Fitting a CCC GARCH involves fitting k univariate


GARCH models and then using a closed form estimator for the constant conditional
correlation.

• Dynamic Conditional Correlation: Fitting a DCC GARCH combines the first stage of
the CCC GARCH with correlation targeting similar to the covariance targeting BEKK.

• Orthogonal GARCH: Orthogonal GARCH only involves fitting p ≤ k univariate GARCH


models and uses a closed form estimator for the idiosyncratic variance.

9.4.9 Application: Mutual Fund Returns


Three mutual funds were used to illustrate the differences (and similarities) of the dynamic
covariance models. The three funds were:

• Oakmark I (OAKMX) - A broad large cap fund

• Fidelity Small Cap Stock (FSLCX) - A broad small cap fund which seeks to invest in
firms with capitalizations similar to those in the Russell 2000 or S&P 600.

• Wasatch-Hoisington US Treasury (WHOSX) - A fund which invests at least 90% of total


assets in U.S. Treasury securities and can vary the average duration of assets held from
1 to 25 years, depending on market conditions.

These funds are used to capture interesting aspects of the broad set of investment oppor-
tunities. All data was takes from CRSP’s database from January 1, 1999 until December 31,
2008. Tables 9.3 contains the estimated correlation from the CCC-GARCH model where
each volatility series was assumed to follow a standard GARCH(1,1). This shows that the
correlations between these assets are large and positive for the equity fund and negative,
on average, between the equity funds and the treasury fund. Table 9.4 contains the param-
eters of the dynamics for the DCC, scalar BEKK, an asymmetric scalar BEKK, Matrix GARCH
and an asymmetric version of the Matrix GARCH. The parameters in the DCC are typical
of DCC models – the parameters sum to nearly 1 and α is smaller than is typically found
556 Multivariate Volatility, Dependence and Copulas

Multivariate GARCH Model Estimates


α γ β
DCC 0.025 – 0.970
(6.84) (207)
Scalar vec 0.046 – 0.950
(11.6) (203)
Asymmetric Scalar vec 0.043 0.009 0.948
(11.6) (2.37) (175)

AA0 GG0 BB0


0.058 0.060 0.033 – – – 0.931 0.930 0.945
(8.68) (9.33) (8.68) (118) (130) (177)
Matrix GARCH 0.060 0.064 0.035 – – – 0.930 0.929 0.944
(9.33) (8.75) (8.60) (130) (129) (180)
0.033 0.035 0.035 – – – 0.945 0.944 0.959
(8.68) (8.60) (5.72) (177) (180) (130)

0.034 0.032 0.035 0.055 0.060 −0.007 0.926 0.927 0.942


(6.01) (6.57) (8.51) (4.89) (5.56) (−4.06) (80.3) (94.6) (131)
Asymmetric Matrix GARCH 0.032 0.029 0.033 0.060 0.070 −0.007 0.927 0.929 0.943
(6.57) (6.37) (8.44) (5.56) (5.79) (−4.48) (94.6) (107) (156)
0.035 0.033 0.036 −0.007 −0.007 0.001 0.942 0.943 0.958
(8.51) (8.44) (6.30) (−4.06) (−4.48) (2.30) (131) (156) (137)

Table 9.4: Parameter estimates (t -stats in parenthesis) from a selection of multivariate


ARCH models for three mutual funds spanning small cap stocks (FSLCX), large cap stocks
(OAKMX), and long government bond returns (WHOSX). The top panel contains results for
DCC, scalar v e c and asymmetric scalar v e c . The bottom panel contains estimation results
for Matrix GARCH and asymmetric version of the Matrix GARCH model, which shows large
differences in asymmetries between equity and bond returns.

in univariate models. This indicates that correlation is very persistent but probably moves
slower than volatility. The parameters in the scalar BEKK and asymmetric scalar BEKK are
similar to what one would typically find in a univariate model, although the asymmetry is
weak. The Matrix GARCH parameters are fairly homogeneous although the treasury fund
is less responsive to news (smaller coefficient in AA0 ). The most interesting finding in this
table is in the asymmetric Matrix GARCH model where the response to “bad” news is very
different between the equity funds and the treasury fund. This heterogeneity is the likely
the source of the small asymmetry parameter in the asymmetric scalar BEKK.

Figure 9.4 plots the annualized volatility for these series from 4 models: the CCC (stan-
dard GARCH(1,1)), the two RiskMetrics methodologies, and the asymmetric scalar BEKK.
All volatilities broadly agree which may be surprising given the differences in the models.
Figures 9.5, 9.6 and 9.7 plot the correlations as fit from 6 different models. Aside from the
CCC GARCH fit (which is constant), all models broadly agree about the correlation dynam-
ics in these series.
9.5 Realized Covariance 557

Mutual Fund Volatility


Large Cap
CCC/DCC
80
RM 1994
RM 2006
60 Asymmetric Vec

40

20

0
1998 2000 2001 2002 2004 2005 2006 2008 2009

Small Cap

60

40

20

0
1998 2000 2001 2002 2004 2005 2006 2008 2009

Long Government Bond

30

20

10

0
1998 2000 2001 2002 2004 2005 2006 2008 2009

Figure 9.4: The three panels plot the volatility for the three mutual funds spanning small
caps, large caps and long government bond returns. The volatilities from all of the models
are qualitatively similar and the only visible differences occur when volatility is falling.

9.5 Realized Covariance

Realized covariance estimates the integrated covariance over some period, usually a day.
Suppose prices followed a k -variate continuous time diffusion,
558 Multivariate Volatility, Dependence and Copulas

Small Cap - Large Cap Correlation


Conditional Correlation Models
1
0.8
0.6
0.4
0.2
CCC
0
DCC
1999 2001 2003 2005 2007 2009

RiskMetrics
1
0.8
0.6
0.4
0.2
RM1994
0
RM2006
1999 2001 2003 2005 2007 2009

vec Models
1
0.8
0.6
0.4
0.2
Symmetric Scalar Vec
0
Asymmetric Scalar Vec
1999 2001 2003 2005 2007 2009

Figure 9.5: The three graphs plot the fit correlation from 6 models. The conditional corre-
lation estimates are broadly similar, aside from the CCC GARCH which assumes that cor-
relation is constant.

d pt = µt dt + Ωt d Wt

where µt is the instantaneous drift, Σt = Ωt Ω0t is the instantaneous covariance, and d Wt is


a k -variate Brownian motion. Realized covariance estimates
9.5 Realized Covariance 559

Small Cap - Long Government Bond Correlation


Conditional Correlation Models
CCC
0.5 DCC

−0.5

1999 2001 2003 2005 2007 2009

RiskMetrics
RM1994
0.5 RM2006

−0.5

1999 2001 2003 2005 2007 2009

vec Models
Symmetric Vec
0.5 Asymmetric Vec

−0.5

1999 2001 2003 2005 2007 2009

Figure 9.6: The three graphs plot the fit correlation from 6 models. The conditional corre-
lation estimates are broadly similar, aside from the CCC GARCH which assumes that cor-
relation is constant.

Z 1
Σs d s
0

where the bounds 0 and 1 represent the (arbitrary) interval over which the realized covari-
560 Multivariate Volatility, Dependence and Copulas

Large Cap - Long Government Bond Correlation


Conditional Correlation Models
CCC
0.5 DCC

−0.5

1999 2001 2003 2005 2007 2009

RiskMetrics
RM1994
0.5 RM2006

−0.5

1999 2001 2003 2005 2007 2009

vec Models
Symmetric Vec
0.5 Asymmetric Vec

−0.5

1999 2001 2003 2005 2007 2009

Figure 9.7: The three graphs plot the fit correlation from 6 models. The conditional corre-
lation estimates are broadly similar, aside from the CCC GARCH which assumes that cor-
relation is constant.

ance is computed.8
Realized covariance is computed using the outer-product of high-frequency returns.

8
In the presence of jumps, realized covariance estimates the quadratic covariation, which is the integrated
9.5 Realized Covariance 561

Definition 9.23 (Realized Covariance). The realized covariance is defined


m
X
R Ct = ri ,t r0i ,t = (pi ,t − pi −1,t ) (pi ,t − pi −1,t )0 , (9.58)
i =1

where ri ,t is the ith return on day t .

In principle prices should be sampled as frequently as possible to maximize the precision


of the realized covariance estimator. In practice this is not possible since:

• Prices, especially transaction prices (trades), are contaminated by noise (e.g. bid-ask
bounce).

• Prices are not perfectly synchronized. For example, asset i might trade at 10:00:00
while the most recent trade of asset j might have occurred at 9:59:50. The lack of
synchronization will bias the covariance between the two assets toward 0.

The standard method to address these two concerns is to sample relatively infrequently,
for example every 5 minutes. An improved method is to use a modified realized covariance
estimator which uses subsampling. Suppose returns were computed every minute, but
that microstructure concerns (noise and synchronization) do not allow for sampling more
frequently than every 10 minutes. The subsampled realized covariance uses all 10-minute
returns, not just non-overlapping ones, to estimate the covariance.

Definition 9.24 (Subsampled Realized Covariance). The subsampled realized covariance


estimator is defined

m X+1 X
m−n n
R C tSS = ri + j −1,t r0i + j −1,t (9.59)
n(m − n + 1)
i =1 j =1
n −n +1
mX
1 X m
= ri + j −1,t r0i + j −1,t
n (m − n + 1)
j =1 i =1
n
1 X
= R
g C j ,t ,
n
j =1

where there are m high-frequency returns available and the selected sampling time is based
on n returns.
covariance plus the outer product of the jumps
Z 1 X
Σs ds + ∆ps ∆ps0 ,
0 0≤s ≤1

where ∆ps are the jumps.


562 Multivariate Volatility, Dependence and Copulas

For example, suppose data was available from 9:30:00 to 16:00:00, and that prices were
sampled every minute. The standard realized covariance would compute returns using
prices at 9:30:00, 9:40:00, 9:50:00, . . .. The subsampled realized covariance would compute
returns using all 10 minute windows, e.g. 9:30:00 and 9:40:00, 9:31:00 and 9:41:00, 9:32:00
and 9:42:00, and so on. In this example m , the number of returns available over a 6.5 hour
day is 390 and n , the number of returns used in the desired sampled window, is 10.
Barndorff-Nielsen, Hansen, Lunde & Shephard (2011) recently proposed an alternative
method to compute the realized covariance known as a realized kernel. It is superficially
similar to realized covariance except that realized kernels use a weighting function similar
to that in a Newey & West (1987) covariance estimator.
Definition 9.25 (Realized Kernel). The realized kernel is defined as
h  
X i
R Kt = Γ 0 + Γ i + Γ 0i

K (9.60)
H +1
i =1

X
Γj = r̃i ,t r̃i − j ,t
i = j +1

where r̃ are refresh time returns, m̃ is the number of refresh time returns, K (·) is a kernel
weighting function and H is a parameter which controls the bandwidth.
Refresh time returns are needed to ensure that prices are not overly stale, and are computed
by sampling all prices using last-price interpolation only when all assets have traded have
traded. For example, consider the transactions in table 9.5 which contains a hypothetical
series of trade times for MSFT and IBM where Ø indicates a trade with the time stamp
indicated on the left. A Ø in the refresh column indicates that this time stamp is a refresh
time. The final two columns indicate the time stamp of the price which would be used for
MSFT and IBM when computing the refresh-time return.
The recommended kernel is Parzen’s kernel,

 1 − 6x + 6x 0 > x ≥ 2
 2 3 1

K (x ) = 2(1 − x )3 1
2
>x ≥1 (9.61)
 0 x >1

Selection of the bandwidth parameter, H , is an important choice when implementing real-


ized Kernels, although a discussion is the choices needed to correctly determine the band-
width is beyond the scope of these notes. See Barndorff-Nielsen et al. (2008)and Barndorff-
Nielsen, Hansen, Lunde & Shephard (2011) for detailed discussions.

9.5.1 Realized Correlation and Beta


Realized Correlation is the realized analogue of the usual correlation estimator, only de-
fined in terms of realized covariance.
9.5 Realized Covariance 563

Trade Time MSFT IBM Refresh MSFT Time IBM Time


9:30:00 Ø Ø Ø 9:30:00 9:30:00
9:30:01 Ø Ø Ø 9:30:01 9:30:01
9:30:02
9:30:03 Ø
9:30:04 Ø
9:30:05 Ø Ø 9:30:04 9:30:05
9:30:06 Ø
9:30:07 Ø
9:30:08 Ø Ø 9:30:08 9:30:07

Table 9.5: This table illustrated the concept of refresh-time sampling. Prices are sam-
pled every time all assets have traded using last-price interpolation. Refresh-time sam-
pling usually eliminated some of the data, as with the 9:30:03 trade of MSFT, and produces
some sampling points where the prices are not perfectly synchronized, as with the 9:30:08
refresh-time price which uses a MSFT price from 9:30:08 and an IBM price from 9:30:07.

Definition 9.26 (Realized Correlation). The realized correlation between two series is de-
fined
R Ci j
RC or r = p
R Ci i R C j j
where R Ci j is the realized covariance between assets i and j and R Ci i and R C j j are the real-
ized variances of assets i and j, respectively.
Realized Betas are similarly defined, only using definition of β (which is a function of
the covariance).
Definition 9.27 (Realized Beta). Suppose R C t is a k + 1 by k + 1 realized covariance ma-
trix for an asset and a set of observable factors where the asset is in position 1, so that the
realized covariance can be partitioned
" #
RVi R C f0 i
RC =
R C f i R C f ,f

where RVi ,i is the realized variance of the asset being studied, R Ci f is the k by 1 vector of
realized covariance between the asset and the factors, and R C f f is the k by k covariance
of the factors. The Realized Beta is defined

R β = R C f−1
f RCf i .

In the usual case where there is only one factor, usually the market, the realized beta
is the ratio of the realized covariance between the asset and the market to the variance of
the market. Realized Betas are similar to other realized measures in that they are model
564 Multivariate Volatility, Dependence and Copulas

free and, as long as prices can be sampled frequently and have little market microstructure
noise, is an accurate measure of the current exposure to changes in the market.

9.5.2 Modeling Realized Covariance

Modeling realized covariance and realized kernels can be accomplished by modifying stan-
dard multivariate ARCH models. The basic assumption is that the mean of the realized
covariance, conditional on the time t − 1 information, is Σt ,

R C t |Ft −1 ∼ F (Σt , υ)
where F (·, ·) is some distribution with conditional mean Σt which may depend on other
parameters unrelated to the mean which are contained in υ. This assumption implies that
the realized covariance is driven by a matrix-values shock which has conditional expecta-
tion Ik , and so
1 1
R C t = Σt2 ΞΣt2
where Ξ ∼ F (I, υ̃) and υ̃ is used to denote that these parameters are related to the original
i.i.d.

parameters although will generally be different. This assumption is identical to the one
made when modeling realized variance as a non-negative process with a multiplicative
error (MEM) where it is assumed that RVt = σ2t ξt = σt ξt σt where ξt ∼ F (1, υ).
i.i.d.

With this assumption most multivariate ARCH models can be used. Consider the stan-
dard BEKK model,

Σt = CC0 + Art −1 rt −1 A0 + BΣt −1 B0 .


The BEKK can be viewed as a multiplicative error model and used for realized covariance
by specifying the dynamics as

Σt = CC0 + AR C t −1 A0 + BΣt −1 B0 .
Other ARCH models can be similarly adapted by replacing the outer product of returns by
the realized covariance or realized kernel. Estimation is not more difficult than the estima-
tion of a multivariate ARCH model since parameters can be estimated using the Wishart
likelihood. See Noureldin et al. (2012) for details.

9.6 Measuring Dependence

Covariance modeling is the only the first step to understanding risk in portfolios since co-
variance (and correlation) is only one measure of dependence, and often lacking in many
applications.
9.6 Measuring Dependence 565

9.6.1 Linear Dependence

Linear or Pearson correlation is the most common measure of dependence.

Definition 9.28 (Linear (Pearson) Correlation). The linear (Pearson) correlation between
two random variables X and Y is
Cov [X , Y ]
ρ=p . (9.62)
V [X ] V [Y ]

The sample correlation is estimated using


PT
(x t − µ̂ x ) yt − µ̂ y

t =1
ρ̂ = qP 2 . (9.63)
T 2 PT
t =1 (x s − µ̂ x ) s =1 yt − µ̂ y

where µ̂ x and µ̂ y are the sample mean of x t and yt .


Linear correlation measures the strength of the linear relationship between standard-
ized versions of X and Y . Correlation is obviously invariant to affine transformations of X
and/or Y (e.g. a + b X ). It is not, however, invariant to non-linear transformations, even
when the non-linear transformation is order preserving (e.g. the log of a non-negative ran-
dom variable). Linear correlation is also insufficient to characterize the dependence be-
tween two random variables, except in the special case where X and Y are follow a bivariate
normal distribution. Moreover, two distributions can have the same correlation yet exhibit
very different characteristics with respect to the amount of diversification available.

9.6.2 Non-linear Dependence

A number of measures have been designed to overcome some of the shortcomings on cor-
relation as a measure of risk. These are broadly classified as measures of non-linear depen-
dence.

9.6.2.1 Rank Correlation

Rank correlation, also known as Spearman correlation, is an alternative measure of depen-


dence which can assess the strength of a relationship and is robust to certain non-linear
transformations. Suppose X and Y are random variables, X ∼ N (0, 1) and y ≡ x λ where
i.i.d.

λ is odd. If λ = 1 then y = x and the linear correlation is 1. If λ = 3 the correlation is


.77. If λ = 5 then the correlation is only .48, despite y being a function of only x . As λ
becomes increasing large the correlation becomes arbitrarily small despite the one-to-one
relationship between X and Y . Rank correlation is robust to non-linear transformations
and so will return a correlation on 1 between X and Y for any power λ.

Definition 9.29 (Rank (Spearman) Correlation). The rank (Spearman) correlation between
566 Multivariate Volatility, Dependence and Copulas

Simulated Returns with Symmetric and Asymmetric Dependence


Symmetric Dependence
0.1
Individual Equity

0.05

−0.05

−0.1
−0.06 −0.04 −0.02 0 0.02 0.04 0.06 0.08
Market

Asymmetric Dependence
0.1
Individual Equity

0.05

−0.05

−0.1
−0.06 −0.04 −0.02 0 0.02 0.04 0.06 0.08
Market

Figure 9.8: These graphs illustrate simulated returns from a CAP-M where the market has a
t 6 distribution with the same variance as the S&P 500 and the idiosyncratic shock is normal
with the same variance as the average idiosyncratic shock in the S&P 500 constituents. The
asymmetric dependence was introduced by making  the idiosyncratic error heteroskedastic
by defining its variance to be σε exp 10rm I[rm <0] , and so the idiosyncratic component has
a smaller variance when the market return is negative then it does when the market return
is positive.

two random variables X and Y is

Cov FX (X ), FY (Y )
 
ρs (X , Y ) = Corr FX (X ), FY (Y ) = q   = 12Cov FX (X ), FY (Y ) (9.64)
  
V FX (X ) V FY (Y )
 

1
where the final identity uses the fact that the variance of a uniform (0,1) is 12
.

The rank correlation measures the correlation between the probability integral trans-
forms of X and Y . The use of the probability integral transform means that rank correlation
is preserved under strictly increasing transformations (decreasing monotonic changes the
sign), and so ρs (X , Y ) = ρs (T1 (X ), T2 (Y )) when T1 and T2 are any strictly increasing func-
9.6 Measuring Dependence 567

tions.
The sample analogue of the Spearman correlation makes use of the empirical ranks of
the observed data. Define r x ,i to be the rank of xi , where a rank of 1 corresponds to the
smallest value, a rank of n corresponds to the largest value, where any ties are all assigned
the average value of the ranks associated with the values in the tied group. Define r y ,i in an
identical fashion on yi . The sample rank correlation between X and Y is computed as the
sample correlation of the ranks,

Pn r x ,i 1
  r y ,i
1

− − Pn n +1
r y ,i − n+1
 
i =1 n +1 n+1
2 2 i =1 r x ,i −
ρ=q r 2 = qPn  qPn
2 2
 Pn  r y ,i n +1 2 n+1 2

1 2 − 2
r x ,i j =1 r y , j −
Pn r x ,i 1
i =1 n +1
− 2 j =1 n+1
− 2
i =1 2

r
x ,i
where n+1 is the empirical quantile of xi . These two formulations are identical since the
arguments in the second are linear transformations of the arguments in the first, and linear
correlation is robust to linear transformations.

9.6.2.2 Kendall’s τ

Kendall’s τ is another measure of non-linear dependence which is based on the idea of


concordance. Concordance is defined with respect to differences in the signs of pairs of
random variables.

Definition 9.30 (Concordant Pair). The pairs of random variables (xi , yi ) and (x j , y j ) are
concordant if sgn(xi − x j ) = sgn(yi − y j ) where sgn(·) is the sign function which returns
-1 for negative values, 0 for zero, and +1 for positive values (equivalently defined if (xi −
x j )(yi − y j ) > 0).

If a pair is not concordant then it is discordant.

Definition 9.31 (Kendall’s τ). Kendall τ is defined

τ = Pr(sgn(xi − x j ) = sgn(yi − y j )) − Pr(sgn(xi − x j ) 6= sgn(yi − y j )) (9.65)

The estimator of Kendall’s τ uses the obvious sample analogues to the probabilities in
the definition. Defined n c = ni=1 nj=i +1 I[sgn(xi −x j )=sgn(yi −y j )] as the count of the concordant
P P

pairs and nd = 21 n (n − 1) − n c as the count of discordant pairs. The estimator of τ is

n c − nd
τ= (9.66)
1
2
n (n − 1)
nc nd
= − (9.67)
1
2
n (n − 1) 1
2
n (n − 1)
= Pr(sgn(x
b i − x j ) = sgn(yi − y j )) − Pr(sgn(x i − x j ) 6= sgn(yi − y j ))
b (9.68)
568 Multivariate Volatility, Dependence and Copulas

Rolling Dependence Measures for the S&P 500 and FTSE


0.9
Linear (Pearson)
0.8 Rank (Spearman)
Kendall’s τ
0.7

0.6

0.5

0.4

0.3

0.2

0.1

0
1986 1988 1990 1992 1994 1996 1998 2000 2002 2004 2006 2008

Figure 9.9: Plot of rolling linear correlation, rank correlation and Kendall’s τ between
weekly returns on the S&P 500 and the FTSE estimated using 252-day moving windows.
The measures broadly agree about the changes in dependence but not the level.

Dependence Measures for Weekly FTSE and S&P 500 Returns

Linear (Pearson) 0.660 Rank (Spearman) 0.593 Kendall’s τ 0.429


(0.028) (0.027) (0.022)

Table 9.6: Linear and rank correlation and Kendall’s τ (bootstrap std. error in parenthesis)
for weekly returns for the S&P 500 and FTSE 100. Weekly returns were used to minimize
problems with of non-synchronous closings in these two markets.

where Pr b denotes the empirical probability. Kendall’s τ measures the difference between
the probability a pair is concordant, n c /( 21 n (n − 1)) and the probability a pair is discordant
nd /( 21 n (n − 1)). Since τ is the difference between two probabilities it must fall in [−1, 1]
where -1 indicates that all pairs are discordant, 1 indicates that all pairs are concordant,
and τ is increasing as the concordance between the pairs increases. Like rank correlation,
Kendall’s τ is also invariant to increasing transformation since a pair that was concordant
before the transformation (e.g. xi > x j and yi > y j ) must also be concordant after a strictly
increasing transformation (e.g. T1 (xi ) > T1 (x j ) and T2 (yi ) > T2 (y j )).
9.6 Measuring Dependence 569

Exceedance Correlation for the S&P 500 and FTSE


1
Negative Exceedance Correlation
Positive Exceedance Correlation
95% Confidence Interval (Bootstrap)
0.9

0.8

0.7

0.6

0.5

0.4

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1


Quantile

Figure 9.10: Plot of the exceedance correlations with 95% bootstrap confidence intervals
for weekly returns on the S&P 500 and FTSE (each series was divided by its sample standard
deviation). There is substantial asymmetry between positive and negative exceedance cor-
relation.

9.6.2.3 Exceedance Correlations and Betas

Exceedance correlation, like expected shortfall, is one of many exceedance measures which
can be constructed by computing expected values conditional on exceeding some thresh-
old. Exceedance correlation measures the correlation between two variables conditional
on both variable taking values in either the upper tail or lower tail.

Definition 9.32 (Exceedance Correlation). The exceedance correlation at level κ is defined


as

ρ + (κ) = Corr x , y |x > κ, y > κ



(9.69)
ρ − (κ) = Corr x , y |x < −κ, y < −κ

(9.70)

Exceedance correlation are computed using the standard (linear) correlation estimator
on the subset of data where both x > κ and y > κ (positive) or x < −κ and y < −κ.
Exceedance correlation can also be defined using series specific cutoff points such as κ x
and κ y , which are often used if the series doe not have the same variance, and are often set
570 Multivariate Volatility, Dependence and Copulas

using quantiles of x and y (e.g. the 10% quantile of each). Alternatively exceedance corre-
lations can be computed with data transformed to have unit variance. Sample exceedance
correlations are computed as

+
σ̂+x y (κ) σ̂−
x y (κ)
ρ̂ (κ) = + , ρ̂ −
(κ) = (9.71)
σ̂ x (κ)σ̂+y (κ) σ̂−
x (κ)σ̂ y (κ)

where
 
µ+x (κ) µ+y (κ)
PT
t =1 xt − yt − I[xt >κ∩yt >κ]
σ̂+x y (κ) =
Tκ+
PT  
x t − µ− (κ) µ −
(κ)

t =1 x yt − y I[xt <−κ∩yt <−κ]
σ̂−
x y (κ) =
Tκ−
2
x t − µ̂+x (κ)
Pt Pt
t =1 x t I [x t >κ∩yt >κ] t =1 I[xt >κ∩yt >κ]
µ̂+x (κ) = , σ̂2+
x (κ) =
Tκ+ Tκ+
Pt Pt 2
t =1 x t − µ̂ x (κ)

x t I[xt <−κ∩yt <−κ]
t =1 I[xt <−κ∩yt <−κ]
µ̂−
x (κ) = − , σ̂ x (κ) =
2−

Tκ Tκ
T T
Tκ+ =
X X
I[xt >κ∩yt >κ] , Tκ− = I[xt <−κ∩yt <−κ]
t =1 t =1

where the quantities for y are similarly defined. Exceedance correlation can only be esti-
mated if the where x < κ and y < κ is populated with data, and it is possible for some
assets that this region will empty (e.g. if the assets have strong negative dependence such
as with equity and bond returns).
Inference can be conducted using the bootstrap or using analytical methods. Hong,
Tu & Zhou (2007) show that inference on exceedance correlations can be conducted by
viewing these estimators as method of moments estimators. Define the standardized ex-
ceedance residuals as,

x t − µ+x (κ)
x̃ t+ (κ) =
σ+x (κ)

x t − µ− x (κ)
x̃ t− (κ) =
σ x (κ)

yt − µ+y (κ)
ỹt+ (κ) =
σ+y (κ)

yt − µ−
y (κ)
ỹt− (κ) = .
σ−
y (κ)

These form the basis of the moment conditions,


9.6 Measuring Dependence 571

T
x̃ + (κ) ỹ + (κ) − ρ + (κ) I[xt >κ∩yt >κ]

+
(9.72)

T
− x̃ (κ) ỹ (κ) − ρ (κ) I [x t <−κ∩yt <−κ] .
− − −

(9.73)

Inference on a vector of exceedance correlation can be conducted by stacking the moment


conditions and using a HAC covariance estimator such as the Newey & West (1987) estima-
tor. Suppose κ was a vector of thresholds κ1 , κ2 , . . . , κn , then
!
√ +
ρ̂ (κ) − ρ (κ) +
d
T → N (0, Ω)
ρ̂ − (κ) − ρ − (κ)

Ω can be estimated using the moment conditions,

L
0
X  
Ω̂ = Γ̂ 0 + w j Γ̂ l + Γ̂ l (9.74)
l =1

where wl = 1 − l
L +1
,

T
X
Γ̂ j = ξt ξt − j
t = j +1

and

1
x̃ + (κ1 ) ỹ + (κ1 ) − ρ + (κ) I[xt >κ1 ∩yt >κ1 ]
  
Tκ+1
 .. 

 . 

1
x̃ (κn ) ỹ (κn ) − ρ + (κn ) I[xt >κn ∩yt >κn ]
+ +
  
Tκ+n
ξt = T 
 
.
1
x̃ − (κ1 ) ỹ − (κ1 ) − ρ − (κ) I[xt >κ1 ∩yt >κ1 ]

 
 Tκ− 1

 .. 

 . 

1
x̃ (κn ) ỹ (κn ) − ρ − (κn ) I[xt >κn ∩yt >κn ]
− −

Tκ−
n

Exceedance beta is similarly defined, only using the ratio of an exceedance covariance
to an exceedance variance.

Definition 9.33 (Exceedance Beta). The exceedance beta at level κ is defined as

> κ, > κ σ+y (κ) +



+ Cov x , y |x y
β (κ) =  = + ρ (κ) (9.75)
V x |x > κ, y > κ σ x (κ)
Cov x , y |x < −κ, y < −κ σ−y (κ) −

β (κ) =

 = − ρ (κ) (9.76)
V x |x < −κ, y < −κ σ x (κ)
572 Multivariate Volatility, Dependence and Copulas

Sample exceedance betas are computed using the sample analogues,

σ̂+x y (κ) σ̂−


x y (κ)
β̂ + (κ) = , and β̂ −
(κ) = , (9.77)
σ̂2+
x (κ) σ̂2−
x (κ)

and inference can be conducted in an analogous manner to exceedance correlations


using a HAC estimator and the moment conditions

σ+y (κ) +
!
T
x̃ (κ) ỹ + (κ) − β + (κ) I[xt >κ∩yt >κ] (9.78)
Tκ+ σ+x (κ)
!
T σ−y (κ) −
x̃ (κ) ỹ − (κ) − β − (κ) I[xt <−κ∩yt <−κ] . (9.79)
Tκ− σ−x (κ)

9.6.3 Application: Dependence between the S&P 500 and the FTSE 100

Daily data was downloaded from Yahoo! Finance for the entire history of both the S&P
500 and the FTSE 100. Table 9.6 contains the three correlations and standard errors com-
puted using the bootstrap where weekly returns were used to avoid issues with of return
synchronization (overlapping returns were used in all applications). The linear correla-
tion is the largest, followed by the rank and Kendall’s τ. Figure 9.9 plots these same three
measures only using 252-day moving averages. The three measures broadly agree about
the changes in dependence. Figure 9.10 plots the negative and positive exceedance cor-
relation for these two assets and 95% confidence intervals computed using the bootstrap.
The exceedance thresholds were chosen using quantiles of each series where negative cor-
responds to thresholds less than or equal to 50% and positive includes thresholds greater
than or equal to 50%. The correlation between these markets differs substantially depend-
ing on the sign of the returns.

9.6.4 Application: Asymmetric Dependence from Simple Models

Asymmetric dependence can be generated from simple models. The simulated data in
both panels in figure 9.8 was from a standard CAP-M calibrated to match a typical S&P 500
stock. The market returns was simulated from a standardized t 6 with the same variance
as the S&P 500 in the past 10 years and the idiosyncratic variance was similarly calibrated
to the cross-section of idiosyncratic variances. The simulated data in the top panel was
computed as

ri ,t = rm,t + εi ,t
where εi ,t was i.i.d. normal with same variance as the average variance in the idiosyncratic
return from the S&P 500 cross-section. The bottom panel was generated according to
9.7 Copulas 573

ri ,t = rm ,t + z i ,t εi ,t

where z i ,t = exp −10rm ,t I[rm,t <0] introduced heteroskedasticity so that the idiosyncratic


variance is lower on days where the market is down. This simple change introduces asym-
metric dependence between positive and negative returns.

9.7 Copulas

Copulas are a relatively new tool in financial econometrics which are useful in risk man-
agement, credit pricing and derivatives. Copulas allow the dependence between assets to
be separated from the marginal distribution of each asset. Recall that a k -variate random
variable X has a distribution function F (x1 , x2 , . . . , xk ) which maps from the domain of X
to [0,1]. The distribution function contains all of the information about the probability
of observing different values of X, and while there are many distribution functions, most
are fairly symmetric and in-flexible. For example, the multivariate Student’s t requires all
margins to have the same degree-of-freedom parameter which mean that chance of seeing
relatively large returns much be the same for all assets. While this assumption may be rea-
sonable for modeling equity index returns, extremely heavy tails are not realistic in a model
which contains equity index returns and bond returns or foreign exchange returns, since
the latter two are substantially thinner tailed. Copulas provide a flexible mechanism where
the marginal distributions can be modeled separately from the dependence structure and
provide substantially richer framework than working within the set of known (joint) distri-
bution functions.
Recall the definition of the marginal distribution of X 1 .

Definition 9.34 (Marginal Density). Let X = (X 1 , X 2 , . . . , X k ) be a k -variate random variable


with joint density f X (X ). The marginal density of X i is defined
Z
fi (xi ) = f X (x )d x1 . . . dxi −1 dxi +1 . . . dxk ,
S (X 1 ,...X i −1 ,X i +1 ,...,X k )

where S (·) is the support of its argument.

The marginal density contains only information about the probability of observing val-
ues of X i and only X i since all other random variables have been integrated out. For ex-
ample, if X was a bivariate random variable with continuous support, then the marginal
density of X 1 is Z ∞
f1 (x1 ) = f X (x1 , x2 ) d x2 .
−∞

The marginal distribution,


574 Multivariate Volatility, Dependence and Copulas

Z x1
F1 (x1 ) = f1 (s ) d s ,
−∞

contains all of the information about the probability of observing values of X 1 , and impor-
tantly FX 1 (x1 ) ∼ U (0, 1). Since this is true for both X 1 and X 2 , u 1 = FX 1 (x1 ) and u 2 = FX 2 (x2 )
must contain only information about the dependence between the two random variables.
The distribution which describes the dependence is known as a copula, and so applications
built with copulas allow information in marginal distributions to be cleanly separated from
the dependence between random variables. This decomposition provides a flexible frame-
work for constructing precise models of both marginal distributions and dependence.

9.7.1 Basic Theory


A copula is a distribution function for a random variable where each margin is uniform
[0, 1].

Definition 9.35 (Copula). A k -dimensional copula is a distribution function on [0, 1]k with
standard uniform marginal distributions, and is denoted C (u 1 , u 2 , . . . , u k ).

All copulas all satisfy 4 key properties.

• C (u 1 , u 2 , . . . , u k ) is increasing in each component u i .

• C (0, . . . , u j , . . . , 0) = 0.

• C (1, . . . , u j , . . . , 1) = u j .

• For all u ≤ v where inequality holds on a point-by-point basis, the probability of the
hypercube bound with corners u and v is non-negative.

The key insight which has lead to the popularity of copulas in finance comes from Sklar’s
theorem (Sklar 1959).

Theorem 9.36 (Sklar’s Theorem). Let F be a k -variate joint distribution with marginal dis-
tributions F1 ,F2 ,. . .,Fk . Then there exists a copula C : [0, 1]k → [0, 1] such that for all x1 ,x2 ,. . .,xk ,

F (x1 , x2 , . . . , xk ) = C F1 (x1 ), F2 (x2 ), . . . , Fk (xk )




= C (u 1 , u 2 , . . . u k ) .

Additionally, if the margins are continuous then C is unique.

Sklar’s theorem is useful in a number of ways. First, it ensures that the copula is unique
whenever the margins are continuous, which is usually the case in financial applications.
Second, it shows how any distribution function with known margins can be transformed
into a copula. Suppose F (x1 , x2 , . . . , xk ) is a known distribution function, and that the
9.7 Copulas 575

marginal distribution function of the ith variable is denoted Fi (·). Further assume that the
marginal distribution function is invertible, and denote the inverse as Fi −1 (·). The copula
implicitly defined by F is

C (u 1 , u 2 , . . . , u k ) = F F1−1 (u 1 ), F2−1 (u 2 ), . . . , Fk−1 (u k ) .




This relationship allows for any distribution function to be used as the basis for a copula,
and appears in the definition of the Gaussian and the Student’s t copulas.
Copulas are distribution functions for k -variate uniforms, and like all distribution func-
tions they may (or may not) have an associated density. When the copula density exists it
can be derived by differentiating the distribution with respect to each random variable,

∂ k C (u 1 , u 2 , . . . , u k )
c (u 1 , u 2 , . . . , u k ) = . (9.80)
∂ u1∂ u2 . . . ∂ uk

This is identical to the relationship between any k -variate distribution F and its associated
density f .

9.7.2 Tail Dependence


One final measure of dependence, tail dependence, can be useful in understanding risks
in portfolios. Tail dependence of more of a theoretical construction than something that
would generally be estimated (although it is possible to estimate tail dependence).
Definition 9.37 (Tail Dependence). The upper and lower tail dependence, τU and τL re-
spectively, are defined as the probability of an extreme event,

τU = lim Pr X > FX−1 (u )|Y > FY−1 (u )


 
(9.81)
u→1−

τ = lim Pr X < FX (X )|Y < FY (Y )


L
 
(9.82)
u→0+

where the limits are taken from above for τU and below for τL .
Tail dependence measures the probability X takes an extreme value given Y takes an
extreme value. The performance of or assets used as hedges or portfolio diversification is
particularly important when the asset being hedged has had a particularly bad day, char-
acterized by an extreme return in its lower tail.
Lower tail dependence takes a particularly simple form when working in copulas, and
is defined

C (u , u )
τL = lim (9.83)
u→0+ u
1 − 2u + C (u , u )
τU = lim (9.84)
u→1− 1−u
576 Multivariate Volatility, Dependence and Copulas

The coefficient of tail dependence is always in [0, 1] since it is a probability. When τU


τL is 0, then the two series are upper (lower) tail independent. Otherwise the series are


tail dependent where higher values indicate more dependence in extreme events.

9.7.3 Copulas

A large number of copulas have been produced. Some, such as the Gaussian, are implicitly
defined from general distributions. Others have been produced only for uniform random
variables. In all expressions for the copulas, u i ∼ U (0, 1) are uniform random variables.

9.7.3.1 Independence Copula

The simplest copula is the independence copula, and given by the product of the inputs.

Definition 9.38 (Independence Copula). The independence copula is

k
Y
C (u 1 , u 2 , . . . , u k ) = ui (9.85)
i =1

The independence copula has no parameters.

9.7.3.2 Comonotonicity Copula

The copula with the most dependence is known as the comonotonicity copula.

Definition 9.39 (Comonotonicity Copula). The comonotonicity copula is

C (u 1 , u 2 , . . . , u k ) = min (u 1 , u 2 , . . . , u k ) (9.86)

The dependence in this copula is perfect. The comonotonicity does not have an associ-
ated copula density.

9.7.3.3 Gaussian Copula

The Gaussian (normal) copula is implicitly defined in terms of the k -variate Gaussian dis-
tribution, Φk (·), and the univariate Gaussian distribution, Φ (·).

Definition 9.40 (Gaussian Copula). The Gaussian copula is

C (u 1 , u 2 , . . . , u k ) = Φk Φ−1 (u 1 ), Φ−1 (u 2 ), . . . , Φ−1 (u k )



(9.87)

where Φ−1 (·) is the inverse of the Gaussian distribution function.


9.7 Copulas 577

Recall that if u is a uniform random variable than Φ−1 (x ) will have a standard normal
distribution. This transformation allows the Gaussian copula density to be implicitly de-
fined using the inverse distribution function. The Gaussian copula density is
k −1
(2π)− 2 |R| 2 exp − 12 η0 R−1 η

c (u 1 , u 2 , . . . , u k ) = (9.88)
φ (Φ−1 (u 1 )) . . . φ (Φ−1 (u k ))

where η = Φ−1 (u) is a k by 1 vector where ηi = Φ−1 (u i ), R is a correlation matrix and


φ (·) is the normal pdf. The extra terms in the denominator are present in all implicitly
defined copulas since the joint density is the product of the marginal densities and the
copula density.

f1 (x1 ) . . . fk (xk ) c (u 1 , . . . , u k ) = f (x1 , x2 , . . . , xk )


f (x1 , x2 , . . . , xk )
c (u 1 , . . . , u k ) =
f1 (x1 ) . . . fk (xk )

9.7.3.4 Student’s t Copula

The Student’s t copula is also implicitly defined in an identical manner to the Gaussian
copula.
Definition 9.41 (Student’s Copula). The Student’s t copula is

C (u 1 , u 2 , . . . , u k ) = t k ,ν t ν−1 (u 1 ), t ν−1 (u 2 ), . . . , t ν−1 (u k )



(9.89)

where t k ,ν (·) is the k -variate Student’s t distribution function with ν degrees of freedom
and t ν−1 is the inverse of the univariate Student’s t distribution function with ν degrees of
freedom.
Note that while a Student’s t is superficially similar to a normal, variables which have
a multivariate t ν distributed are often substantially more dependant, at least when ν is
small (3 – 8). A multivariate Student’s t which is a multivariate normal divided by a single,
common, independent χν2 standardized to have mean 1. When ν is small, the chance of
seeing a small value in the denominator is large, and since this divisor is common, all series
will tend to take relatively large values simultaneously.

9.7.3.5 Clayton Copula

The Clayton copula exhibits asymmetric dependence for most parameter values. The lower
tail is more dependant than the upper tail and so it may be appropriate for financial asset
such as equity returns.
Definition 9.42 (Clayton Copula). The Clayton copula is
 −1/θ
C (u 1 , u 2 ) = u 1−θ + u 2−θ −1 ,θ >0 (9.90)
578 Multivariate Volatility, Dependence and Copulas

The Clayton copula limits to the independence copula when θ → 0. The copula density
can be found by differentiating the Copula with respect to u 1 and u 2 , and so is
 −1/θ −2
c (u 1 , u 2 ) = (θ + 1) u 1−θ −1 u 2−θ −1 u 1−θ + u 2−θ − 1 .

9.7.3.6 Gumbel and Rotated Gumbel Copula

The Gumbel copula exhibits asymmetric dependence in the upper tail rather than the lower
tail.

Definition 9.43 (Gumbel Copula). The Gumbel copula is


  1/θ 
θ θ
C (u 1 , u 2 ) = exp − (− ln u 1 ) + (− ln u 2 ) ,θ ≥1 (9.91)

The Gumbel copula exhibits upper tail dependence which increases as θ grows larger,
and limits to the independence copula when θ → 1. Because upper tail dependence is
relatively rare among financial assets, a “rotated” version of the Gumbel may be more ap-
propriate.

Definition 9.44 (Rotated (Survival) Copula). Let C (u 1 , u 2 ) be a bivariate copula. The ro-
tated version9 of the copula is given by

C R (u 1 , u 2 ) = u 1 + u 2 − 1 + C (1 − u 1 , 1 − u 2 ) .

Using this definition allows the rotated Gumbel copula to be constructed which will
have lower tail dependence rather than the upper tail dependence found in the usual Gum-
bel copula.

Definition 9.45 (Rotated Gumbel Copula). The rotated Gumbel copula is


  1/θ 
θ θ
C (u 1 , u 2 ) = u 1 + u 2 − 1 + exp − (− ln (1 − u 1 )) + (− ln (1 − u 2 )) ,θ ≥1 (9.92)

The rotated Gumbel is the Gumbel copula using 1 − u 1 and 1 − u 2 as its arguments. The
extra terms are needed to satisfy the 4 properties of a copula. The rotated Gumbel copula
density is tedious to compute, but is presented here.
The rotated Gumbel copula density is

9
The rotated copula is commonly known as the survival copula, since rather than computing the probabil-
ity of observing values smaller than (u 1 , u 2 ), it computes the probability of seeing values larger than (u 1 , u 2 ).
9.7 Copulas 579

  1/θ 
θ θ θ −1
exp − (− ln (1 − u 1 )) + (− ln (1 − u 2 )) ((− ln (1 − u 1 )) (− ln (1 − u 2 )))
c (u 1 , u 2 ) = 2−1/θ
(1 − u 1 ) (1 − u 2 ) ((− ln (1 − u 1 )) + (− ln (1 − u 2 )))
 1/θ 
θ θ
× (− ln (1 − u 1 )) + (− ln (1 − u 2 )) +θ −1 .

This copula density is identical to the Gumbel copula density only using 1 − u 1 and
1 − u 2 as its arguments. This is the “rotation” since values of the original Gumbel near 0,
where the dependence is low, are near 1 after the rotation, where the dependence is higher.

9.7.3.7 Joe-Clayton Copula

The Joe-Clayton copula allows for asymmetric dependence is both tails.


Definition 9.46 (Joe-Clayton Copula). The Joe-Clayton copula is
 −θL  −θL −1/θL !1/θU
C (u 1 , u 2 ) = 1 − 1 − (1 − u 1 )θU + 1 − (1 − u 2 )θU −1 (9.93)

where the two parameters, θL and θU are directly related to lower and upper tail depen-
dence through
1 1
θL = −  , θU =
log2 τ log2 2 − τU

L

where both coefficients of tail dependence satisfy 0 < τi < 1, i = L , U .


The copula density is a straightforward, although tedious, derivation. The Joe-Clayton
copula is not symmetric, even when the same values for τL and τU are used. This may be
acceptable, but if symmetry is preferred a symmetric copula can be constructed by aver-
aging a copula with its rotated counterpart.
Definition 9.47 (Symmetrized Copula). Let C (u 1 , u 2 ) be an asymmetric bivariate copula.
The symmetrized version of the copula is given by

1
C S (u 1 , u 2 ) = C (u 1 , u 2 ) + C R (1 − u 1 , 1 − u 2 )

(9.94)
2

If C (u 1, u 2 ) is already symmetric, then C (u 1 , u 2 ) = C R (1−u 1 , 1−u 2 ) and so the C S (u 1 , u 2 )


must also be symmetric. The copula density, assuming it exists, is simple for symmetrized
copulas since
1
c S (u 1 , u 2 ) = (c (u 1 , u 2 ) + c (1 − u 1 , 1 − u 2 ))
2
which follows since the derivative of the rotated copula with respect to u 1 and u 2 only de-
pends on the term involving C (1 − u 1 , 1 − u 2 ).
580 Multivariate Volatility, Dependence and Copulas

Copula τL τU Notes
Gaussian 0 0 √ |ρ| < 1
2t ν+1 (w ) 2t ν+1 (w ) w = − ν + 1 1 − ρ/ 1 + ρ
p p
Students t
1
Clayton 2− θ 0
1
Gumbel 0 2 − 2θ Rotated Swaps τL and τU
1−θ 1−θ
Symmetrized Gumbel 1−2 θ 1−2 θ
1
− θ1
Joe-Clayton 2 L 2 − 2 θU Also Symmetrized JC

Table 9.7: The relationship between parameter values and tail dependence for the copulas
in section 9.7.3. t ν+1 (·) is the CDF of a univariate Student’s t distribution with ν + 1 degree
of freedom.

9.7.4 Tail Dependence in Copulas

The copulas presented in the previous section all have different functional forms, and so
will lead to different distributions. One simple method to compare the different forms is
through the tail dependence. Table 9.7 show the relationship between the tail dependence
in the different copulas and their parameters. The Gaussian has no tail dependence except
in the extreme case when |ρ| = 1, in which case tail dependence is 1 in both tails. Other
copulas, such as the Clayton and Gumbel, have asymmetric tail dependence.

9.7.5 Visualizing Copulas

Copulas are defined on the unit hyper-cube (or unit square when there are two variables)
and so one obvious method to inspect the difference between two is to plot the distribution
function or the density on its default domain. This visualization method does not facili-
tate inspecting the tail dependence since the interesting dependence occurs in the small
squares of in [0, 0.05] × [0, 0.05] and [.95, 1] × [.95, 1] which correspond to the lower and
upper 5% of each margin. A superior method to visually inspect copulas is to compare the
joint densities where the marginal distribution of each series is a standard normal. This
visualization technique ensures that all differences are attributable to the copula which
spreading the interesting aspects of a copula over a wider range and so allowing for more
detail to be seen about the dependence.
Figure 9.11 contains plots of 4 copulas. The top tow panels contains the independence
copula and the comonotonicity copula as distributions on [0, 1] × [0, 1] where curves are
isoprobability lines. In distribution space, increasing dependence appears as “L” shapes,
which independence appears as a semi-circle. The bottom two figures contain the normal
copula distribution and the Gaussian copula density using normal margins, where in both
cases the correlation is ρ = 0.5. The dependence in the Gaussian copula is heavier than in
the independence copula – a special case of the Gaussian copula when ρ = 0 – but lighter
9.7 Copulas 581

than in the comonotonicity copula. The density has both a Gaussian copula and Gaussian
margins and so depicts a bivariate normal. The density function shows the dependence
between the two series in a more transparent manner and so is usually preferable.10
Figure 9.12 contains plots of 4 copulas depicted as densities where the margin of each
series is standard normal. The upper left panel contains the Clayton density which has
strong tail lower dependence (θ = 1.5). The upper right contains the symmetrized Joe-
Clayton where τL = τU = 0.5 which has both upper and lower tail dependence. The bot-
tom two panels contain the Gumbel and symmetrized Gumbel where θ = 1.5. The rotated
Gumbel is similar to the Clayton copula although it is not identical.

9.7.6 Estimation of Copula models

Copula models are estimated using maximum likelihood which is natural since the specify
a complete distribution for the data. As long as the copula density exists, and the parame-
ters of the margins are distinct from the parameters of the copula (which is almost always
the case), the likelihood or a k -variate random variable Y can be written as

f (yt ; θ , ψ) = f1 (y1,t ; θ 1 )f2 (y2,t ; θ 2 ) . . . fk (yk ,t ; θ k )c (u 1,t , u 2,t , . . . , u k ,t ; ψ)

where u j ,t = Fj−1 y j ,t ; θ j are the probability integral transformed data,θ j are the parame-


ters specific to marginal model j and ψ are the parameters of the copula. The log likelihood
is then the sum of the marginal log likelihoods and the copula log likelihood,

l (θ , ψ; y) = ln f1 (y1 ; θ 1 ) + ln f2 (y2 ; θ 2 ) + . . . + ln fk (yk ; θ k ) + ln c (u 1 , u 2 , . . . , u k ; ψ) .

This decomposition allows for consistent estimation of the parameters in two steps:

1. For each margin j , estimate θ j using quasi maximum likelihood as the solution to

T
X
ln f j y j ,t ; θ j

argmax
θj t =1

When fitting models using copulas it is also important to verify that the marginal
models are adequate using the diagnostics for univariate densities described in chap-
ter 8.

2. Using the probability integral transformed residuals evaluated at the estimated val-

10
Some copulas do not have a copula density and in the cases the copula distribution is the only method
to visually compare copulas.
582 Multivariate Volatility, Dependence and Copulas

Copula Distributions and Densities


Independence Comonotonicity

0.8 0.8

0.6 0.6

0.4 0.4

0.2 0.2

0.2 0.4 0.6 0.8 0.2 0.4 0.6 0.8


Gaussian, ρ = 0.59 Gaussian, ρ = 0.59
3

0.8 2

1
0.6
0
0.4
−1
0.2
−2

−3
0.2 0.4 0.6 0.8 −3 −2 −1 0 1 2 3

Figure 9.11: The top left panel contains the independence copula. The top right panel con-
tains the comonotonicity copula which as perfect dependence. The bottom panels contain
the Gaussian copula, where the left depicts the copula in distribution space ([0, 1] × [0, 1])
and the right shows the copula a as density with standard normal margins. The parameter
values were estimated in the application to the S&P 500 and FTSE 100 returns.

 
ues, û j ,t = F −1 y j ,t ; θ̂ j , estimate the parameters of the copula as

T
X
argmax ln c (û 1,t , û 2,t , . . . , û k ,t ; ψ)
ψ t =1

This two step procedure is not efficient in the sense that the parameter estimates are con-
sistent but have higher variance than would be the case if all parameters were simultane-
ously estimated. In practice the reduction in precision is typically small, and one alterna-
tive is to use the two-step estimator as a starting value for an estimator which simultane-
ously estimates the all parameters. Standard errors can be computed from the two-step
9.7 Copulas 583

Copula Densities with Standard Normal Margins


Clayton, θ = 1.07 Symmetrized Joe-Clayton, τL = 0.59, τH = 0.06
3 3

2 2

1 1

0 0

−1 −1

−2 −2

−3 −3
−3 −2 −1 0 1 2 3 −3 −2 −1 0 1 2 3
Rotated Gumbel, θ = 1.67 Symmetrized Gumbel, θ = 1.68
Symmetrized Gumbel, $θ=$1.68
3
3
2
2
1
1

0 0

−1 −1

−2 −2

−3 −3
−3 −2 −1 0 1 2 3 −3 −2 −1 0 1 2 3

Figure 9.12: These for panels all depict copulas as densities using standard normal mar-
gins. All differences in appearance can be attributed to the differences in the copulas. The
top left panel contains the Clayton copula density. The top right contains the symmetrized
Joe-Clayton. The bottom panels contain the rotated Gumbel which has lower tail depen-
dence and the symmetrized Gumbel. The parameter values were estimated in the applica-
tion to the S&P 500 and FTSE 100 returns.

estimation procedure by treating it as a two-step GMM problem where the scores of the
marginal log likelihoods and the copula are the moment conditions (See section 6.10 for a
discussion)
An alternative estimation procedure uses nonparametric models for the margins. Non-
parametric margins are typically employed when characterizing the distribution of the mar-
gins are not particularly important, and so the first step can be replaced through the use
of the empirical CDF. The empirical CDF estimates the û j ,t = rank(y j ,t )/(T + 1) where rank
is the ordinal rank of observation t among the T observations. The empirical CDF is uni-
form by construction. Using the empirical CDF is not generally appropriate when the data
584 Multivariate Volatility, Dependence and Copulas

have time-series dependence (e.g. volatility clustering) or when forecasting is an important


consideration.

9.7.7 Application: Dependence between the S&P 500 and the FTSE 100

The use of copulas will be illustrated using the S&P 500 and FTSE 100 data. Returns were
computed using 5-trading days to mitigate issues with non-synchronous trading. The first
example will use the empirical CDF to estimate the probability integral transformed resid-
uals so that the model will focus on the unconditional dependence between the two se-
ries. The upper left panel of figure 9.13 contains a scatter plot of the ECDF transformed
residuals. The residuals tend to cluster around the 45o line indicating positive dependence
(correlation). There are also obvious clusters near (0, 0) and (1,1) indicating dependence
in the extremes. The normal, students t , Clayton, rotated Gumbel, symmetrized Gum-
bel and symmetric Joe-Clayton were all estimated. Parameter estimates and copula log-
likelihoods are reported in table 9.8. The Joe-Clayton fits the data the best, followed by the
symmetrized Gumbel and then the rotated Gumbel. The Clayton and the Gaussian both
appear to fit the data substantially worse than the others. In the Joe-Clayton, both tails
appear to have some dependence, although unsurprisingly the lower tail is substantially
more dependent.
Copulas can also be used with dynamic models. Using a constant copula with dynamic
models for the margins is similar to using a CCC-GARCH model for modeling conditional
covariance. A dynamic distribution model was built using TARCH(1,1,1) distributions for
each index return where the innovations are modeled using Hansen’s Skew t . The same set
of copulas were estimated using the conditionally transformed residuals u i ,t = F yi ,t ; σ2t , ν, λ


where σ2t is the conditional variance, ν is the degree of freedom, and λ which controls the
skewness. Parameter estimates are reported in table 9.9. The top panel reports the param-
eter estimates from the TARCH model. Both series have persistent volatility although the
leverage effect is stronger in the S&P 500 than it is in the FTSE 100. Standardized residuals
in the S&P 500 were also heavier tailed with a degree of freedom of 8 versus 12 for the FTSE
100, and both were negatively skewed. The parameter estimates from the copulas all indi-
cate less dependence than in the model build using the empirical CDF. This is a common
finding and is due to synchronization between the two markets of volatility. Coordinated
periods of high volatility leads to large returns in both series at the same time, even when
the standardized shock is only moderately large. Mixing periods of high and low volatility
across markets tends to increase unconditional dependence in the same way that missing
periods of high and low volatility leads to heavy tails in the same market. The difference
in the dependence shows up in the parameter values in the copulas where are uniformly
lower than in their unconditional counterparts, and through the reduction in the range of
log-likelihoods relative to the Gaussian.
Figure 9.13 contains some diagnostic plots related to fitting the conditional copula. The
top right panel contains the scatter plot of the probability integral transformed residuals
9.7 Copulas 585

Dependence Measures for Weekly FTSE and S&P 500 Returns

Copula θ1 θ2 Log Lik.


Gaussian 0.619 305.9
Clayton 1.165 285.3
Rotated Gumbel 1.741 331.7
Symmetrized Gumbel 1.775 342.1
Symmetrized Joe-Clayton 0.606 0.177 346.2

Table 9.8: Parameter estimates for the unconditional copula between weekly returns on the
S&P 500 and the FTSE 100. Marginal distributions were estimated using empirical CDFs.
For the Gaussian copula, θ1 is the correlation, and in the Joe-Clayton θ1 is τL and θ2 is τU .
The final column reports the log likelihood from the copula density.

Conditional Copula Estimates for Weekly FTSE and S&P 500 Returns

Index α1 γ1 β1 ν λ
S&P 500 0.003 0.259 0.843 8.247 -0.182
FTSE 100 0.059 0.129 0.846 12.328 -0.152

Copula θ1 θ2 Log Lik.


Gaussian 0.586 267.2
Clayton 1.068 239.9
Rotated Gumbel 1.667 279.7
Symmetrized Gumbel 1.679 284.4
Symmetrized Joe-Clayton 0.586 0.057 284.3

Table 9.9: Parameter estimates for the conditional copula between weekly returns on the
S&P 500 and the FTSE 100. Marginal distributions were estimated using a TARCH(1,1,1)
with Hansen’s Skew t error. Parameter estimates from the marginal models are reported in
the top panel. The bottom panel contains parameter estimates from copulas fit using the
conditionally probability integral transformed residuals. For the Gaussian copula, θ1 is the
correlation, and in the Joe-Clayton θ1 is τL and θ2 is τU . The final column reports the log
likelihood from the copula density.

using from the TARCH. While these appear similar to the plot from the empirical CDF, the
amount of clustering near (0, 0) and (1, 1) is slightly lower. The bottom left panel contains a
QQ plot of the actual returns against the expected returns using the degree of freedom and
skewness parameters estimated on the two indices. These curves are straight except for the
most extreme observations, indicating an acceptable fit. The bottom right plot contains the
annualized volatility series for the two assets where the coordination in volatility cycles is
apparent. It also appears the coordination in volatility cycles has strengthened post-2000.
586 Multivariate Volatility, Dependence and Copulas

S&P 500 - FTSE 100 Diagnostics


Empirical Margins Skew t Margins
1 1

0.8 0.8
FTSE 100

FTSE 100
0.6 0.6

0.4 0.4

0.2 0.2

0 0
0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1
S&P 500 S&P 500 -
Skew t QQ plot Annualized Volatility
S&P 500
100 FTSE 100

0 80
Actual

60
−5
40

20
−10
0
−4 −2 0 2 1985 1989 1993 1997 2001 2005 2009
Expected

Figure 9.13: These four panels contain diagnostics for fitting copulas to weekly returns
on the S&P 500 and FTSE 100. The top two panels are scatter plots of the probability in-
tegral transformed residuals. The left contain the PIT from using the empirical CDF and
so is a depiction of the unconditional dependence. The right contains the PIT from the
TARCH(1,1,1) with Skew t errors. The bottom left contains a QQ plot of the data against the
expected value from a Skew t . The bottom right plot contains the fit annualized volatility
for the two indices.

9.7.8 Dynamic Copulas

This chapter has focused on static copulas of the form C (u 1 , u 2 ; θ ). It is possible to con-
struct dynamics copulas where dependence parameters change through time which leads
to a conditional copula of the form C (u 1 , u 2 ; θ t ) . This was first done in Patton (2006) in an
application to exchange rates. The primary difficulty in specifying dynamic copula models
is in determining the form of the “shock”. In ARCH-type volatility models rt2 is the natural
shock since it has the correct conditional expectation. In most copula models there isn’t a
single, unique equivalent. Creal, Koopman & Lucas (2012) have recently developed a gen-
9.A Bootstrap Standard Errors 587

eral framework which can be used to construct a natural shock even in complex models,
and have applied this in the context of copulas.
DCC can also be used as a dynamic Gaussian copula where the first step is modified
from fitting the conditional variance to fitting the conditional distribution. Probability in-
tegral transformed residuals from the modified first step can then be transformed to be
Gaussian, which in turn can be used in the second step of the DCC estimator. The com-
bined model has flexible marginal distributions and a Gaussian copula.

9.A Bootstrap Standard Errors


The Bootstrap is a powerful tool which has a variety of uses, although it is primarily used
for computing standard errors as an alternative to “plug-in” estimators used in most infer-
ence. Moreover, in some applications, expressions for asymptotic standard errors cannot
be directly computed and so the bootstrap is the only viable method to make inference.
This appendix provides a very brief introduction to computing bootstrap standard errors.
The idea behind the bootstrap is very simple. If {rt } is a sample of T data points from
some unknown joint distribution F , then {rt } can be used to simulate (via re-sampling)
from the unknown distribution F . The name bootstrap comes from the expression “To
pull yourself up by your bootstraps”, a seemingly impossible task, much like simulation
from an unknown distribution
There are many bootstraps available and different bootstraps are appropriate for differ-
ent types of data. Bootstrap methods can be classified as parametric or non-parametric.
Parametric bootstraps make use of residuals as the basis of the re-sampling. Nonparamet-
ric bootstraps make use of the raw data. In many applications both types of bootstraps
are available and the choice between the two is similar to the choice between parametric
and non-parametric estimators: parametric estimators are precise but may be mislead-
ing if the model is misspecified while non-parametric estimators are consistent although
may require large amounts of data to be reliable. This appendix describes three bootstraps
and one method to compute standard errors using a nonparametric bootstrap method.
Comprehensive treatments of the bootstrap can be found in Efron & Tibshirani (1998) and
Chernick (2008).
The simplest form of the bootstrap is the i.i.d. bootstrap, which is applicable when the
data are i.i.d.

1. Draw T indices τi = dT u i e where u i ∼ U (0, 1)


i.i.d.
Algorithm 9.48 (IID Bootstrap).

2. Construct an artificial time series using the indices {τi }Ti=1 ,

yτ1 yτ2 . . . yτT .

3. Repeat steps 1–2 a total of B times.


588 Multivariate Volatility, Dependence and Copulas

In most applications with financial data an assumption of i.i.d. errors is not plausible
and a bootstrap appropriate for dependant data is necessary. The two most common boot-
straps for dependant data are the block bootstrap and the stationary bootstrap (Politis &
Romano 1994). The block bootstrap is based on the idea of drawing blocks of data which
are sufficiently long so that the blocks are approximately i.i.d.
Algorithm 9.49 (Block Bootstrap).

1. Draw τ1 = dT u e where u ∼ U (0, 1).


i.i.d.

2. For i = 2, . . . , T , if i mod m 6= 0, τi = τi −1 + 1 where wrapping is used so that if


τi −1 = T then τi = 1. If i mod m = 0 when τi = dT u e where u ∼ U (0, 1).
i.i.d.

3. Construct an artificial time series using the indices {τi }Ti=1 .

4. Repeat steps 1 – 3 a total of B times.


The stationary bootstrap is closely related to the block bootstrap. The only difference is
that it uses blocks with lengths that are exponentially distributed with an average length of
m.
Algorithm 9.50 (Stationary Bootstrap).

1. Draw τ1 = dT u e where u ∼ U (0, 1).


i.i.d.

2. For i = 2, . . . , T , draw a standard uniform v ∼ U (0, 1). If v > 1/m, τi = τi −1 + 1,


i.i.d.

where wrapping is used so that if τi −1 = T then τi = 1 . If v ≤ 1/m, τi dT u e where


u ∼ U (0, 1)
i.i.d.

3. Construct an artificial time series using the indices {τi }Ti=1 .

4. Repeat steps 1 – 3 a total of B times.


These bootstraps dictate how to re-sample the data. The re-sampled data are then used to
make inference on statistics of interest.
Algorithm 9.51 (Bootstrap Parameter Covariance Estimation).

1. Begin by computing the statistic of interest θ̂ using the original sample.

2. Using a bootstrap appropriate for the dependence in the data, estimate the statistic of
interest on the B artificial samples, and denote these estimates as θ̃ j , j = 1, 2, . . . , B

3. Construct confidence intervals using:

(a) (Inference using standard deviation) Estimate the variance of θ̂ − θ 0 as


B 
X 2
B −1 θ̃ b − θ̂
b =1
9.A Bootstrap Standard Errors 589

(b) (Inference using symmetric quantiles) Construct bootstrap errors as ηb = θ̃ b − θ̂ ,


and construct the 1 − α confidence interval (θ ± q̄α/2 ) using the 1 − α/2 quantile
of |ηb |, denoted q̄1−α/2 .
(c) (Inference using asymmetric quantiles) Construct bootstrap errors as ηb = θ̃ b − θ̂ ,
and construct the 1 − α confidence interval (θ − qα/2 , θ + q1−α/2 ) using the α/2 and
1 − α/2 quantile of ηb , denoted qα/2 and q1−α/2 , respectively.

The bootstrap confidence intervals in√this chapter were all computed using this algorithm
and a stationary bootstrap with m ∝ T .
Warning: The bootstrap is broadly applicable in cases where parameters are asymptot-
ically normal such as in regression with stationary data. They are either not appropriate
or require special attention in many situations, e.g. unit roots, and so before computing
bootstrap standard errors, it is useful to verify that the bootstrap will lead to correct infer-
ence. In cases where the bootstrap fails, a more general statistical technique subsampling
can usually be used to make correct inference.
590 Multivariate Volatility, Dependence and Copulas

Exercises

Exercise 9.1. Answer the following questions about covariance modeling

i. Describe the similarities between the RiskMetrics 1994 and RiskMetrics 2006 method-
ologies.

ii. Describe two multivariate GARCH models. What are the strengths and weaknesses of
these model?

iii. Other than linear correlation, describe two other measures of dependence.

iv. What is Realized Covariance?

v. What the important considerations when using Realized Covariance?

Exercise 9.2. Answer the following questions.

i. Briefly outline two applications in finance where a multivariate volatility model would
be useful.

ii. Describe two of the main problems faced in multivariate volatility modeling, using
two different models to illustrate these problems.

iii. Recall that, for a bivariate application, the BEKK model for a time-varying conditional
covariance matrix is:
" #
σ11,t σ12,t
≡ Σt = CC0 + Aεt −1 ε0t −1 A0 + BΣt −1 B0
σ12,t σ22,t

where C is a lower triangular matrix, and ε0t ≡ [ε1,t , ε2,t ] is the vector of residuals.
Using the result that vec(QRS) = (S0 ⊗ Q) vec(R), where ⊗ is the Kronecker product,
re-write the BEKK model for vec(Σt ) rather than Σt .

iv. Estimating this model on two-day returns on the S&P 500 index and the FTSE 100
index over the period 4 April 1984 to 30 December 2008, we find:
" # " # " #
0.15 0 0.97 −0.01 0.25 0.03
Ĉ = , B̂ = , Â =
0.19 0.20 −0.01 0.92 0.05 0.32

Using your answer from (c), compute the (1, 1) element of the coefficient matrix on
vec (Σt −1 ) .

Exercise 9.3. Answer the following questions.


9.A Bootstrap Standard Errors 591

i. For a set of two asset returns, recall that the BEKK model for a time-varying condi-
tional covariance matrix is:
" #
h11t h12t
≡ Σt = CC0 + BΣt −1 B0 + Aεt −1 ε0t −1 A0
h12t h22t

where C is a lower triangular matrix, and ε0t ≡ [ε1t , ε2t ] is the vector of residuals.

ii. Describe two of the main problems faced in multivariate volatility modeling, and how
the BEKK model overcomes or does not overcome these problems.

iii. Using the result that vec(QRS) = (S0 ⊗ Q) vec(R), where ⊗ is the Kronecker product,
re-write the BEKK model for vec(Σt ) rather than Σt .

iv. Estimating this model on two-day returns on the S&P 500 index and the FTSE 100
index over the period 4 April 1984 to 30 December 2008, we find:
" # " # " #
0.15 0 0.97 −0.01 0.25 0.03
Ĉ = , B̂ = , Â =
0.19 0.20 −0.01 0.92 0.05 0.32

Using your answer from (b), compute the estimated intercept vector in the vec(Σt )
representation of the BEKK model. (Hint: this vector is 4 × 1.)

v. Computing “exceedance correlations” on the two-day returns on the S&P 500 index
and the FTSE 100 index, we obtain Figure 9.14. Describe what exceedance correla-
tions are, and what feature(s) of the data they are designed to measure.

vi. What does the figure tell us about the dependence between returns on the S&P 500
index and returns on the FTSE 100 index?

Exercise 9.4. Answer the following questions about covariance modeling:

i. Describe the RiskMetrics 1994 methodology for modeling the conditional covariance.

ii. How does the RiskMetrics 2006 methodology differ from the 1994 methodology for
modeling the conditional covariance?

iii. Describe one multivariate GARCH model. What are the strengths and weaknesses of
the model?

iv. How is the 5% portfolio V a R computed when using the RiskMetrics 1994 methodol-
ogy?

v. Other than linear correlation, describe two other measures of dependence.


592 Multivariate Volatility, Dependence and Copulas

Exceedence correlation between SP500 and FTSE100


1
1984−1989
2000−2008
0.9

0.8
Exceedence correlation

0.7

0.6

0.5

0.4

0.3

0 0.2 0.4 0.6 0.8 1


cut−off quantile (q)

Figure 9.14: Exceedance correlations between two-day returns on the S&P 500 index and
the FTSE 100 index. Line with circles uses data from April 1984 to December 1989; line with
stars uses data from January 2000 to December 2008.

vi. What is Realized Covariance?

vii. What are the important considerations when using Realized Covariance?
Bibliography

Abramowitz, M. & Stegun, I. (1964), Handbook of Mathematical Functions with Forumula, Graphs and Math-
ematical Tables, Dover Publications.

Aït-Sahalia, Y. & Lo, A. W. (1998), ‘Nonparametric estimation of state-price densities implicit in financial asset
prices’, Journal of Finance 53(2), 499–547.

Andersen, T. G. & Bollerslev, T. (1998), ‘Answering the skeptics: Yes, standard volatility models do provide
accurate forecasts’, International Economic Review 39(4), 885–905.

Andersen, T. G., Bollerslev, T., Diebold, F. X. & Labys, P. (2003), ‘Modeling and Forecasting Realized Volatility’,
Econometrica 71(1), 3–29.

Andersen, T. G., Bollerslev, T., Diebold, F. X. & Vega, C. (2007), ‘Real-time price discovery in global stock, bond
and foreign exchange markets’, Journal of International Economics 73(2), 251 – 277.

Andrews, D. W. K. & Ploberger, W. (1994), ‘Optimal tests when a nuisance parameter is present only under the
alternative’, Econometrica 62(6), 1383–1414.

Ang, A., Chen, J. & Xing, Y. (2006), ‘Downside Risk’, Review of Financial Studies 19(4), 1191–1239.

Bai, J. & Ng, S. (2002), ‘Determining the number of factors in approximate factor models’, Econometrica
70(1), 191–221.

Baldessari, B. (1967), ‘The distribution of a quadratic form of normal random variables’, The Annals of Math-
ematical Statistics 38(6), 1700–1704.

Bandi, F. & Russell, J. (2008), ‘Microstructure noise, realized variance, and optimal sampling’, The Review of
Economic Studies 75(2), 339–369.

Barndorff-Nielsen, O. E. & Shephard, N. (2004), ‘Econometric analysis of realised covariation: high frequency
based covariance, regression and correlation in financial economics’, Econometrica 73(4), 885–925.

Barndorff-Nielsen, O. E., Hansen, P. R., Lunde, A. & Shephard, N. (2008), ‘Designing realized kernels to mea-
sure the ex post variation of equity prices in the presence of noise’, Econometrica 76(6), 1481–1536.

Barndorff-Nielsen, O. E., Hansen, P. R., Lunde, A. & Shephard, N. (2011), ‘Multivariate realised kernels: consis-
tent positive semi-definite estimators of the covariation of equity prices with noise and non-synchronous
trading’, Journal of Econometrics 162, 149–169. Forthcoming.

Baxter, M. & King, R. G. (1999), ‘Measuring Business Cycles: Approximate Band-Pass Filters For Economic
Time Series’, The Review of Economics and Statistics 81(4), 575–593.
594 BIBLIOGRAPHY

Bekaert, G. & Wu, G. (2000), ‘Asymmetric volatility and risk in equity markets’, Review of Financial Studies
13(1), 1–42.

Berkowitz, J. (2001), ‘Testing density forecasts, with applications to risk management’, Journal of Business and
Economic Statistics 19, 465–474.

Beveridge, S. & Nelson, C. (1981), ‘A new approach to decomposition of economic time series into permanent
and transitory components with particular attention to measurement of the ’business cycle’’, Journal of
Monetary Economics 7(2), 151–174.

Black, F. (1972), ‘Capital market equilibrium with restricted borrowing’, The Journal of Business 45(3), 444–
455.

Bollerslev, T. (1986), ‘Generalized autoregressive conditional heteroskedasticity’, Journal of Econometrics


31(3), 307–327.

Bollerslev, T. (1987), ‘A conditionally heteroskedastic time series model for security prices and rates of return
data’, Review of Economics and Statistics 69(3), 542–547.

Bollerslev, T. (1990), ‘Modeling the coherence in short run nominal exchange rates: A multivariate generalized
ARCH model’, Review of Economics and Statistics 72(3), 498–505.

Bollerslev, T., Engle, R. F. & Wooldridge, J. M. (1988), ‘A capital asset pricing model with time-varying covari-
ances’, Journal of Political Economy 96(1), 116–131.

Bollerslev, T. & Wooldridge, J. M. (1992), ‘Quasi-maximum likelihood estimation and inference in dynamic
models with time-varying covariances’, Econometric Reviews 11(2), 143–172.

Breeden, D. T. & Litzenberger, R. H. (1978), ‘Prices of state contingent claims implicit in option prices’, Journal
of Business 51, 621–651.

Britten-Jones, M. & Neuberger, A. (2000), ‘Option prices, implied price processes, and stochastic volatility’,
Journal of Finance 55(2), 839–866.

Burns, P., Engle, R. F. & Mezrich, J. (1998), ‘Correlations and volatilities of asynchronous data’, Journal of
Derivatives 5(4), 7–18.

Campbell, J. Y. (1996), ‘Understanding risk and return’, Journal of Political Economy 104, 298–345.

Cappiello, L., Engle, R. F. & Sheppard, K. (2006), ‘Asymmetric Dynamics in the Correlations of Global Equity
and Bond Returns’, Journal of Financial Econometrics 4(4), 537–572.

Casella, G. & Berger, R. L. (2001), Statistical Inference (Hardcover), 2 edn, Duxbury.

CBOE (2003), VIX: CBOE Volatility Index, Technical report, Chicago Board Options Exchange.
URL: http://www.cboe.com/micro/vix/vixwhite.pdf

Chernick, M. R. (2008), Bootstrap Methods: A Guide for Practitioners and Researchers, Wiley Series in Proba-
bility and Statistics, John Wiley & Sons Inc, Hoboken, New Jersey.

Christie, A. (1982), ‘The stochastic behavior of common stock variances: Value, leverage and interest rate
effects’, Journal of Financial Economics 10(4), 407–432.

Christoffersen, P. F. (2003), Elements of Financial Risk Management, Academic Press Inc., London.
BIBLIOGRAPHY 595

Cochrane, J. H. (2001), Asset Pricing, Princeton University Press, Princeton, N. J.

Corsi, F. (2009), ‘A Simple Approximate Long-Memory Model of Realized Volatility’, Journal of Financial
Econometrics 7(2), 174–196.

Creal, D. D., Koopman, S. J. & Lucas, A. (2012), ‘Generalized autoregressive score models with applications’,
Journal of App . Forthcoming.

Davidson, R. & MacKinnon, J. G. (2003), Econometric Theory and Methods, Oxford University Press.

den Haan, W. & Levin, A. T. (2000), Robust covariance estimation with data-dependent var prewhitening or-
der, Technical report, University of California – San Diego.

Diebold, F. X. & Mariano, R. S. (1995), ‘Comparing predictive accuracy’, Journal of Business & Economic Statis-
tics 13(3), 253–263.

Ding, Z. & Engle, R. (2001), ‘Large scale conditional matrix modeling, estimation and testing’, Academia Eco-
nomic Papers 29(2), 157–184.

Ding, Z., Engle, R. F. & Granger, C. W. J. (1993), ‘A long memory property of stock market returns and a new
model’, Journal of Empirical Finance 1(1), 83–106.

Dittmar, R. F. (2002), ‘Nonlinear pricing kernels, kurtosis preference, and the cross-section of equity returns’,
Journal of Finance 57(1), 369–403.

Efron, B., Hastie, T., Johnstone, L. & Tibshirani, R. (2004), ‘Least angle regression’, Annals of Statistics 32, 407–
499.

Efron, B. & Tibshirani, R. J. (1998), An introduction to the bootstrap, Chapman & Hal, Boca Raton ; London.

Elliott, G., Rothenberg, T. & Stock, J. (1996), ‘Efficient tests for an autoregressive unit root’, Econometrica
64, 813–836.

Enders, W. (2004), Applied econometric time series, 2nd edn, J. Wiley, Hoboken, NJ.

Engle, R. (1995), ARCH: selected readings, Oxford University Press, USA.

Engle, R. (2002a), ‘New frontiers for ARCH models’, Journal of Applied Econometrics 17(5), 425–446.

Engle, R. F. (1982), ‘Autoregressive conditional heteroskedasticity with esimtates of the variance of U.K. infla-
tion’, Econometrica 50(4), 987–1008.

Engle, R. F. (2002b), ‘Dynamic conditional correlation - a simple class of multivariate GARCH models’, Journal
of Business and Economic Statistics 20(3), 339–350.

Engle, R. F. & Kroner, K. F. (1995), ‘Multivariate simultaneous generalized ARCH’, Econometric Theory
11(1), 122–150.

Engle, R. F. & Li, L. (1998), Macroeconomic announcements and volatility of treasury futures. UCSD Working
Paper No. 97-27.

Engle, R. F., Lilien, D. M. & Robins, R. P. (1987), ‘Estimating time-varying risk premia in the term structure:
The ARCH-M model’, Econometrica 55(2), 391–407.
596 BIBLIOGRAPHY

Engle, R. F. & Manganelli, S. (2004), ‘Caviar: conditional autoregressive value at risk by regression quantiles’,
Journal of Business & Economic Statistics 22, 367–381.

Fama, E. F. & French, K. R. (1992), ‘The cross-section of expected stock returns’, Journal of Finance 47, 427–465.

Fama, E. F. & French, K. R. (1993), ‘Common risk factors in the returns on stocks and bonds’, Journal of Finan-
cial Economics 33, 3–56.

Fama, E. & MacBeth, J. D. (1973), ‘Risk, return, and equilibrium: Empirical tests’, The Journal of Political
Economy 81(3), 607–636.

Fan, J. & Yao, Q. (2005), Nonlinear Time Series: Nonparametric and Parametric Methods, Springer Series in
Statistics, Springer.

Gallant, A. R. (1997), An Introduction to Econometric Theory: Measure-Theoretic Probability and Statistics with
Applications to Economics, Princeton University Press.

Glosten, L., Jagannathan, R. & Runkle, D. (1993), ‘On the relationship between the expected value and the
volatility of the nominal excess return on stocks’, Journal of Finance 48(5), 1779–1801.

Gourieroux, C. & Jasiak, J. (2009), Value at risk, in Y. Aït-Sahalia & L. P. Hansen, eds, ‘Handbook of Financial
Econometrics’, Elsevier Science. Forthcoming Handbook of Financial Econometrics.

Greene, W. H. (2007), Econometric Analysis, 6 edn, Prentice Hall.

Grimmett, G. & Stirzaker, D. (2001), Probability and Random Processes, Oxford University Press.

Hall, A. R. (2005), Generalized Method of Moments, Oxford University Press, Oxford.

Hamilton, J. (1989), ‘A New Approach to Economic Analysis of Nonstationary Time Series’, Econometrica
57(2), 357–384.

Hamilton, J. D. (1994), Time series analysis, Princeton University Press, Princeton, N.J.

Hannan, E. J. & Quinn, B. G. (1979), ‘The determination of the order of an autoregression’, Journal of the Royal
Statistical Society. Series B (Methodological) 41(2), pp. 190–195.
URL: http://www.jstor.org/stable/2985032

Hansen, B. E. (1994), ‘Autoregressive conditional density estimation’, International Economic Review


35(3), 705–30.

Hansen, L. P. (1982), ‘Large sample properties of generalized method of moments estimators’, Econometrica
50(4), 1029–1054.

Hansen, L. P. & Singleton, K. J. (1982), ‘Generalized instrumental variables estimation of nonlinear rational
expectations models’, Econometrica 50(5), 1269–1286.

Hansen, P. R. & Lunde, A. (2005), ‘A Realized Variance for the Whole Day Based on Intermittent High-
Frequency Data’, Journal of Financial Econometrics 3(4), 525–554.
URL: http://jfec.oxfordjournals.org/cgi/content/abstract/3/4/525

Hansen, P. R. & Lunde, A. (2006), ‘Realized variance and market microstructure noise’, Journal of Business and
Economic Statistics 24, 127–161.
BIBLIOGRAPHY 597

Hastie, T., Taylor, J., Tibshirani, R. & Walther, G. (2007), ‘Forward stagewise regression and the monotone
lasso’, Electronic Journal of Statistics 1(1), 1 – 29.

Hayashi, F. (2000), Econometrics, Princeton University Press.

Hodrick, R. J. & Prescott, E. C. (1997), ‘Postwar U.S. Business Cycles: An Empirical Investigation’, Journal of
Money, Credit and Banking 29(1), 1–16.

Hong, Y., Tu, J. & Zhou, G. (2007), ‘Asymmetries in Stock Returns: Statistical Tests and Economic Evaluation’,
Rev. Financ. Stud. 20(5), 1547–1581.

Huber, P. J. (2004), Robust Statistics, John Wiley & Sons Inc, Hoboken, New Jersey.

Ivanov, V. & Kilian, L. (2005), ‘A practitioner’s guide to lag order selection for var impulse response analysis’,
Studies in Nonlinear Dynamics & Econometrics 9(1), 1219–1253.
URL: http://www.bepress.com/snde/vol9/iss1/art2

Jiang, G. J. & Tian, Y. S. (2005), ‘The model-free implied volatility and its information content’, Review of Fi-
nancial Studies 18(4), 1305–1342.

J.P.Morgan/Reuters (1996), Riskmetrics - technical document, Technical report, Morgan Guaranty Trust
Company. Fourth Edition.

Krolzig, H.-M. (1997), Markov-Switching Vector Autoregressions: Modelling, Statistical Inference, and Appli-
cation to Business Cycle Analysis, Lecture Notes in Economics and Mathematical Systems, Springer-Verlag,
Berlin.

Lintner, J. (1965), ‘The valuation of risk assets and the selection of risky investments in stock portfolios and
capital budgets’, The Review of Economics and Statistics 47(1), 13–37.

Lucas, R. E. (1978), ‘Asset prices in an exchange economy’, Econometrica 46(6), 1429–1445.

Markowitz, H. (1959), Portfolio Selection: Efficient Diversification of Investments, John Wiley.

McNeil, A. J., Frey, R. & Embrechts, P. (2005), Quantitative Risk Management : Concepts, Techniques, and Tools,
Princeton University Press, Woodstock, Oxfordshire.

Merton, R. C. (1973), ‘An intertemporal capital asset pricing model’, Econometrica 41, 867–887.

Mittelhammer, R. C. (1999), Mathematical Statistics for Economics and Business, Springer.

Nelson, D. B. (1991), ‘Conditional heteroskedasticity in asset returns: A new approach’, Econometrica


59(2), 347–370.

Nelson, D. & Cao, C. (1992), ‘Inequality constraints in the univariate GARCH model’, Journal of Business and
Economic Statistics 10(2), 229–235.

Newey, W. K. & West, K. D. (1987), ‘A simple, positive definite, heteroskedasticity and autocorrelation consis-
tent covariance matrix’, Econometrica 55(3), 703–708.

Noureldin, D., Shephard, N. & Sheppard, K. (2012), ‘Multivariate high-frequency-based volatility (heavy)
models’, Journal of Applied Econometrics .

Patton, A. (2006), ‘Modelling asymmetric exchange rate dependence’, International Economic Review
47(2), 527–556.
598 BIBLIOGRAPHY

Patton, A. J. & Sheppard, K. (2009), ‘Optimal combinations of realised volatility estimators’, International
Journal of Forecasting 25(2), 218–238.
URL: http://ideas.repec.org/a/eee/intfor/v25y2009i2p218-238.html

Perez-Quiros, G. & Timmermann, A. (2000), ‘Firm size and cyclical variations in stock returns’, Journal of
Finance 55(3), 1229–1262.

Politis, D. N. & Romano, J. P. (1994), ‘The stationary bootstrap’, Journal of the American Statistical Association
89(428), 1303–1313.

Roll, R. (1977), ‘A critique of the asset pricing theory’s tests; part I: On past and potential testability of the
theory’, Journal of Financial Economics 4, 129–176.

Ross, S. A. (1976), ‘The arbitrage theory of capital asset pricing’, Journal of Economic Theory 13(3), 341–360.

Rousseeuw, P. J. & Leroy, A. M. (2003), Robust Regression and Outlier Detection, John Wiley & Sons Inc, Hobo-
ken, New Jersey.

Shanken, J. (1992), ‘On the estimation of beta-pricing models’, The Review of Financial Studies 5(1), 1–33.

Sharpe, W. (1964), ‘Capital asset prices: A theory of market equilibrium under conditions of risk’, Journal of
Finance 19, 425–442.

Sharpe, W. F. (1994), ‘The sharpe ratio’, The Journal of Portfolio Management 21(1), 49–58.

Sims, C. (1980), ‘Macroeconomics and reality’, Econometrica 48, 1–48.

Sklar, A. (1959), ‘Fonctions de répartition à n dimensions et leurs marges’, Publ. Inst. Statist. Univ. Paris 8, 229–
231.

Taylor, S. J. (1986), Modeling Financial Time Series, John Wiley and Sons Ltd.

Taylor, S. J. (2005), Asset price dynamics, volatility, and prediction, Princeton University Press, Princeton, N.J.;
Oxford.

Tibshirani, R. (1996), ‘Regression shrinkage and selection via the lasso’, Journal of the Royal Statistical Society,
Series B 58, 267–288.

Tong, H. (1978), On a threshold model, in C. Chen, ed., ‘Pattern Recognition and Signal Processing’, Sijhoff
and Noordoff, Amsterdam.

Wackerly, D., Mendenhall, W. & Scheaffer, R. L. (2001), Mathematical Statistics with Applications, 6 edn,
Duxbury Press.

White, H. (1980), ‘A heteroskedasticity-consistent covariance matrix estimator and a direct test for het-
eroskedasticity’, Econometrica 48(4), 817–838.

White, H. (1996), Estimation, Inference and Specification Analysis, Econometric Society Monographs, Cam-
bridge University Press, Cambridge.

White, H. (2000), Asymptotic Theory for Econometricians, Economic Theory, Econometrics, and Mathemati-
cal Economics, Academic Press.

Zakoian, J. M. (1994), ‘Threshold heteroskedastic models’, Journal of Economic Dynamics and Control
18(5), 931–955.

Zumbach, G. (2007), The riskmetrics 2006 methodology, Technical report, RiskMetrics Group.

You might also like