0% found this document useful (0 votes)

188 views

Nonparametric Notes

This document is a preface for a course on nonparametric statistics and applications. The course covers advanced topics in nonparametric econometrics, including nonlinear time series models and models for economic and financial applications. The focus is on both theoretical foundations and applying techniques to real data examples using statistical software like R. Several projects involving data analysis and computer programming are assigned throughout the term. The preface outlines topics that will be covered in the course, such as density estimation, nonparametric regression, and estimation of covariance matrices, along with references for further reading.

Uploaded by

Malim Muhammad Siregar

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

188 views

Nonparametric Notes

Uploaded by

Malim Muhammad Siregar

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 184

Nonparametric Statistics: Theory and

Applications
1
ZONGWU CAI
E-mail address: zcai@uncc.edu
Department of Mathematics & Statistics,
University of North Carolina, Charlotte, NC 28223, U.S.A.
September 18, 2012
c _2012, ALL RIGHTS RESERVED by ZONGWU CAI
1
This manuscript may be printed and reproduced for individual or instructional
use, but may not be printed for commercial purposes.
Preface
This is the advanced level of nonparametric econometrics with theory and applications.
Here, the focus is on both the theory and the skills of analyzing real data using nonpara-
metric econometric techniques and statistical softwares such as R. This is along the line
with the spirit STRONG THEORETICAL FOUNDATION and SKILL EXCELLENCE.
In other words, this course covers the advanced topics in analysis of economic and nan-
cial data using nonparametric techniques, particularly in nonlinear time series models and
some models related to economic and nancial applications. The topics covered start from
classical approaches to modern modeling techniques even up to the research frontiers. The
dierence between this course and others is that you will learn not only the theory but also
step by step how to build a model based on data (or so-called let data speak themselves)
through real data examples using statistical softwares or how to explore the real data using
what you have learned. Therefore, there is no a single book serviced as a textbook for this
course so that materials from some books and articles will be provided. However, some
necessary handouts, including computer codes like R codes, will be provided with your help
(You might be asked to print out the materials by yourself).
Several projects, including the heavy computer works, are assigned throughout the term.
The purpose of projects is to train student to understand the theoretical concepts and to
know how to apply the methodology to real problems. The group discussion is allowed to do
the projects, particularly writing the computer codes. But, writing the nal report to each
project must be in your own language. Copying each other will be regarded as a cheating. If
you use the R language, similar to SPLUS, you can download it from the public web site at
http://www.r-project.org/ and install it into your own computer or you can use PCs at
our labs. You are STRONGLY encouraged to use (but not limited to) the package R since
it is a very convenient programming language for doing statistical analysis and Monte Carol
simulations as well as various applications in quantitative economics and nance. Of course,
you are welcome to use any one of other packages such as SAS, GAUSS, STATA, SPSS
and EVIEW. But, I might not have an ability of giving you a help if doing so.
Contents
1 Package R and Simple Applications 1
1.1 Computational Toolkits . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 How to Install R ? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.3 Data Analysis and Graphics Using R An Introduction (109 pages) . . . . . 3
1.4 CRAN Task View: Empirical Finance . . . . . . . . . . . . . . . . . . . . . . 3
1.5 CRAN Task View: Computational Econometrics . . . . . . . . . . . . . . . . 3
2 Estimation of Covariance Matrix 5
2.1 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.2 An Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.3 R Commands . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.4 Reading Materials the paper by Zeileis (2004) . . . . . . . . . . . . . . . . 12
2.5 Computer Codes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.6 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
3 Density, Distribution & Quantile Estimations 16
3.1 Time Series Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
3.1.1 Mixing Conditions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
3.1.2 Martingale and Mixingale . . . . . . . . . . . . . . . . . . . . . . . . 18
3.2 Nonparametric Density Estimate . . . . . . . . . . . . . . . . . . . . . . . . 19
3.2.1 Asymptotic Properties . . . . . . . . . . . . . . . . . . . . . . . . . . 20
3.2.2 Optimality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
3.2.3 Boundary Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
3.2.4 Bandwidth Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
3.2.5 Project for Density Estimation . . . . . . . . . . . . . . . . . . . . . 31
3.2.6 Multivariate Density Estimation . . . . . . . . . . . . . . . . . . . . . 32
3.2.7 Reading Materials . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
3.3 Distribution Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
3.3.1 Smoothed Distribution Estimation . . . . . . . . . . . . . . . . . . . 34
3.3.2 Relative Eciency and Deciency . . . . . . . . . . . . . . . . . . . . 37
3.4 Quantile Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
3.4.1 Value at Risk . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
3.4.2 Nonparametric Quantile Estimation . . . . . . . . . . . . . . . . . . . 39
3.5 Computer Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
3.6 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
CONTENTS iii
4 Nonparametric Regression Models 47
4.1 Prediction and Regression Functions . . . . . . . . . . . . . . . . . . . . . . 47
4.2 Kernel Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
4.2.1 Asymptotic Properties . . . . . . . . . . . . . . . . . . . . . . . . . . 49
4.2.2 Boundary Behavior . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
4.3 Local Polynomial Estimate . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
4.3.1 Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
4.3.2 Implementation in R . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
4.3.3 Complexity of Local Polynomial Estimator . . . . . . . . . . . . . . . 55
4.3.4 Properties of Local Polynomial Estimator . . . . . . . . . . . . . . . 57
4.3.5 Bandwidth Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
4.4 Project for Regression Function Estimation . . . . . . . . . . . . . . . . . . . 63
4.5 Functional Coecient Model . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
4.5.1 Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
4.5.2 Local Linear Estimation . . . . . . . . . . . . . . . . . . . . . . . . . 65
4.5.3 Bandwidth Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
4.5.4 Smoothing Variable Selection . . . . . . . . . . . . . . . . . . . . . . 67
4.5.5 Goodness-of-Fit Test . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
4.5.6 Asymptotic Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
4.5.7 Conditions and Proofs . . . . . . . . . . . . . . . . . . . . . . . . . . 71
4.5.8 Monte Carlo Simulations and Applications . . . . . . . . . . . . . . . 78
4.6 Additive Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
4.6.1 Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
4.6.2 Backtting Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . 83
4.6.3 Projection Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
4.6.4 Two-Stage Procedure . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
4.6.5 Monte Carlo Simulations and Applications . . . . . . . . . . . . . . . 87
4.6.6 New Developments . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
4.6.7 Additive Model to to Boston House Price Data . . . . . . . . . . . . 88
4.7 Computer Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
4.7.1 Example 4.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
4.7.2 Codes for Additive Modeling Analysis of Boston Data . . . . . . . . . 94
4.8 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
5 Nonparametric Quantile Models 100
5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
5.2 Modeling Procedures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
5.2.1 Local Linear Quantile Estimate . . . . . . . . . . . . . . . . . . . . . 105
5.2.2 Asymptotic Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
5.2.3 Bandwidth Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . 112
5.2.4 Covariance Estimate . . . . . . . . . . . . . . . . . . . . . . . . . . . 114
5.3 Empirical Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
5.3.1 A Simulated Example . . . . . . . . . . . . . . . . . . . . . . . . . . 115
5.3.2 Real Data Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . 118
5.4 Derivations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127
CONTENTS iv
5.5 Proofs of Lemmas . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131
5.6 Computer Codes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135
5.7 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135
6 Conditional VaR and Expected Shortfall 140
6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140
6.2 Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144
6.3 Nonparametric Estimating Procedures . . . . . . . . . . . . . . . . . . . . . 145
6.3.1 Estimation of Conditional PDF and CDF . . . . . . . . . . . . . . . . 146
6.3.2 Estimation of Conditional VaR and ES . . . . . . . . . . . . . . . . . 149
6.4 Distribution Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149
6.4.1 Assumptions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149
6.4.2 Asymptotic Properties for Conditional PDF and CDF . . . . . . . . . 151
6.4.3 Asymptotic Theory for CVaR and CES . . . . . . . . . . . . . . . . . 154
6.5 Empirical Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 157
6.5.1 Bandwidth Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . 158
6.5.2 Simulated Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . 158
6.5.3 Real Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162
6.6 Proofs of Theorems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 165
6.7 Proofs of Lemmas . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 171
6.8 Computer Codes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 174
6.9 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 174
List of Tables
3.1 Sample sizes required for p-dimensional nonparametric regression to have comparable performance
List of Figures
2.1 Time plots of U.S. weekly interest rates (in percentages) from January 5, 1962 to September 10,
2.2 Scatterplots of U.S. weekly interest rates from January 5, 1962 to September 10, 1999: the left panel
2.3 Residual series of linear regression Model I for two U.S. weekly interest rates: the left panel is time
2.4 Time plots of the change series of U.S. weekly interest rates from January 12, 1962 to September
2.5 Residual series of the linear regression models: Model II (top) and Model III (bottom) for two change
3.1 Bandwidth is taken to be 0.25, 0.5, 1.0 and the optimal one (see later) with the Epanechnikov kernel.
3.2 The ACF and PACF plots for the original data (top panel) and the rst dierence (middle panel).
4.1 Scatterplots of x
t
, [x
t
[, and (x
t
)
2
versus x
t
with the smoothed curves computed using scatter.smo
4.2 Scatterplots of x
t
, [x
t
[, and (x
t
)
2
versus x
t
with the smoothed curves computed using scatter.smo
4.3 The results from model (4.66). . . . . . . . . . . . . . . . . . . . . . . . . . . 89
4.4 (a) Residual plot for model (4.66). (b) Plot of g
1
(x
6
) versus x
6
. (c) Residual plot for model (4.67
5.1 Simulated Example: The plots of the estimated coecient functions for three quantiles = 0.05
5.2 Boston Housing Price Data: Displayed in (a)-(d) are the scatter plots of the house price versus the
5.3 Boston Housing Price Data: The plots of the estimated coecient functions for three quantiles
5.4 Exchange Rate Series: (a) Japanese-dollar exchange rate return series Y
t
; (b) autocorrelation function
5.5 Exchange Rate Series: The plots of the estimated coecient functions for three quantiles = 0.05
6.1 Simulation results for Example 1 when p = 0.05. Displayed in (a) - (c) are the true CVaR functions
6.2 Simulation results for Example 1 when p = 0.05. Displayed in (a) - (c) are the true CES functions
6.3 Simulation results for Example 1 when p = 0.01. Displayed in (a) - (c) are the true CVaR functions
6.4 Simulation results for Example 1 when p = 0.01. Displayed in (a) - (c) are the true CES functions
6.5 Simulation results for Example 2 when p = 0.05. (a) Boxplots of MADEs for both the WDKLL and
6.6 (a) 5% CVaR estimate for DJI index. (b) 5% CES estimate for DJI index. . 164
6.7 (a) 5% CVaR estimates for IBM stock returns. (b) 5% CES estimates for IBM stock returns index.
Chapter 1
Package R and Simple Applications
1.1 Computational Toolkits
When you work with large data sets, messy data handling, models, etc, you need to choose
the computational tools that are useful for dealing with these kinds of problems. There are
menu driven systems where you click some buttons and get some work done - but these
are useless for anything nontrivial. To do serious economics and nance in the modern days,
you have to write computer programs. And this is true of any eld, for example, applied
econometrics, empirical macroeconomics - and not just of computational nance which is
a hot buzzword recently.
The question is how to choose the computational tools. According to Ajay Shah (De-
cember 2005), you should pay attention to three elements: price, freedom, elegant and
powerful computer science, and network eects. Low price is better than high price.
Price = 0 is obviously best of all. Freedom here is in many aspects. A good software system
is one that does not tie you down in terms of hardware/OS, so that you are able to keep
moving. Another aspect of freedom is in working with colleagues, collaborators and students.
With commercial software, this becomes a problem, because your colleagues may not have
the same software that you are using. Here free software really wins spectacularly. Good
practice in research involves a great accent on reproducibility. Reproducibility is important
both so as to avoid mistakes, and because the next person working in your eld should be
standing on your shoulders. This requires an ability to release code. This is only possible
with free software. Systems like SAS and Gauss use archaic computer science. The code
is inelegant. The language is not powerful. In this day and age, writing C or Fortran by
hand is too low level. Hell, with Gauss, even a minimal ting like online help is tawdry.
CHAPTER 1. PACKAGE R AND SIMPLE APPLICATIONS 2
One prefers a system to be built by people who know their computer science - it should be
an elegant, powerful language. All standard CS knowledge should be nicely in play to give
you a gorgeous system. Good computer science gives you more productive humans. Lots of
economists use Gauss, and give out Gauss source code, so there is a network eect in favor
of Gauss. A similar thing is right now happening with statisticians and R.
Here I cite comparisons among most commonly used packages (see Ajay Shah (December
2005)); see the web site at
http://www.mayin.org/ajayshah/COMPUTING/mytools.html.
R is a very convenient programming language for doing statistical analysis and Monte
Carol simulations as well as various applications in quantitative economics and nance.
Indeed, we prefer to think of it of an environment within which statistical techniques are
implemented. I will teach it at the introductory level, but NOTICE that you will have to
learn R on your own. Note that about 97% of commands in S-PLUS and R are same. In
particular, for analyzing time series data, R has a lot of bundles and packages, which can
be downloaded for free, for example, at http://www.r-project.org/.
R, like S, is designed around a true computer language, and it allows users to add
additional functionality by dening new functions. Much of the system is itself written in
the R dialect of S, which makes it easy for users to follow the algorithmic choices made.
For computationally-intensive tasks, C, C++ and Fortran code can be linked and called
at run time. Advanced users can write C code to manipulate R objects directly.
1.2 How to Install R ?
(1) go to the web site http://www.r-project.org/;
(2) click CRAN;
(3) choose a site for downloading, say http://cran.cnr.Berkeley.edu;
(4) click Windows (95 and later);
(5) click base;
(6) click R-2.15.1-win.exe (Version of 22-06-2012) to save this le rst and then run it to
install.
The basic R is installed into your computer. If you need to install other packages, you need
CHAPTER 1. PACKAGE R AND SIMPLE APPLICATIONS 3
to do the followings:
(7) After it is installed, there is an icon on the screen. Click the icon to get into R;
(8) Go to the top and nd packages and then click it;
(9) Go down to Install package(s)... and click it;
(10) There is a new window. Choose a location to download the packages, say USA(CA1),
move mouse to there and click OK;
(11) There is a new window listing all packages. You can select any one of packages and
click OK, or you can select all of them and then click OK.
1.3 Data Analysis and Graphics Using R An Intro-
duction (109 pages)
See the le r-notes.pdf (109 pages) which can be downloaded from
http://www.math.uncc.edu/ zcai/r-notes.pdf.
I encourage you to download this le and learn it by yourself.
1.4 CRAN Task View: Empirical Finance
This CRAN Task View contains a list of packages useful for empirical work in Finance,
grouped by topic. Besides these packages, a very wide variety of functions suitable for em-
pirical work in Finance is provided by both the basic R system (and its set of recommended
core packages), and a number of other packages on the Comprehensive R Archive Network
(CRAN). Consequently, several of the other CRAN Task Views may contain suitable pack-
ages, in particular the Econometrics Task View. The web site is
http://cran.r-project.org/src/contrib/Views/Finance.html
1.5 CRAN Task View: Computational Econometrics
Base R ships with a lot of functionality useful for computational econometrics, in particular
in the stats package. This functionality is complemented by many packages on CRAN,
a brief overview is given below. There is also a considerable overlap between the tools for
econometrics in this view and nance in the Finance view. Furthermore, the nance SIGis
a suitable mailing list for obtaining help and discussing questions about both computational
CHAPTER 1. PACKAGE R AND SIMPLE APPLICATIONS 4
nance and econometrics. The packages in this view can be roughly structured into the
following topics. The web site is
http://cran.r-project.org/src/contrib/Views/Econometrics.html
Chapter 2
Estimation of Covariance Matrix
2.1 Methodology
Consider a regression model stated in (2.1) below. There may exist situations which the error
e
t
has serial correlations and/or conditional heteroscedasticity, but the main objective
of the analysis is to make inference concerning the regression coecients . When e
t
has se-
rial correlations, we assume that e
t
follows an ARIMA type model but this assumption might
not be always satised in some applications. Here, we consider a general situation without
making this assumption. In situations under which the ordinary least squares estimates of
the coecients remain consistent, methods are available to provide consistent estimate of
the covariance matrix of the coecients. Two such methods are widely used in economics
and nance. The rst method is called heteroscedasticity consistent (HC) estimator;
see Eicker (1967) and White (1980). The second method is called heteroscedasticity and
autocorrelation consistent (HAC) estimator; see Newey and West (1987).
To ease in discussion, we write a regression model as
y
t
=
T
x
t
+ e
t
, (2.1)
where y
t
is the dependent variable, x
t
= (x
1t
, , x
pt
)
T
is a p-dimensional vector of ex-
planatory variables including constant and lagged variables, and = (
1
, ,
p
)
T
is the
parameter vector. The LS estimate of is given by

=
_
n

t=1
x
t
x
T
t
_
1
n

t=1
x
t
y
t
,
CHAPTER 2. ESTIMATION OF COVARIANCE MATRIX 6
and the associated covariance matrix has the so-called sandwich form as

= Cov(

) =
_
n

t=1
x
t
x
T
t
_
1
C
_
n

t=1
x
t
x
T
t
_
1
if et is iid
=
2
e
_
n

t=1
x
t
x
T
t
_
1
,
where C is called the meat given by
C = Var
_
n

t=1
e
t
x
t
_
,

2
e
is the variance of e
t
and is estimated by the variance of residuals of the regression. In the
presence of serial correlations or conditional heteroscedasticity, the prior covariance matrix
estimator is inconsistent, often resulting in inating the t-ratios of

.
The estimator of White (1980) is based on following:

,hc
=
_
n

t=1
x
t
x
T
t
_
1

C
hc
_
n

t=1
x
t
x
T
t
_
1
,
where with e
t
= y
t

T
x
t
being the residual at time t,

C
hc
=
n
n p
n

t=1
e
2
t
x
t
x
T
t
.
The estimator of Newey and West (1987) is

,hac
=
_
n

t=1
x
t
x
T
t
_
1

C
hac
_
n

t=1
x
t
x
T
t
_
1
,
where

C
hac
is given by

C
hac
=
n

t=1
e
2
t
x
t
x
T
t
+
l

j=1
w
j
n

t=j+1
_
x
t
e
t
e
tj
x
T
tj
+x
tj
e
tj
e
t
x
T
t
_
with l is a truncation parameter and w
j
is weight function such as the Barlett weight function
dened by w
j
= 1 j/(l + 1). Other weight function can also used. Newey and West
(1987) showed that if l and l
4
/T 0, then

C
hac
is a consistent estimator of C.
Newey and West (1987) suggested choosing l to be the integer part of 4(n/100)
1/4
and
Newey and West (1994) suggested using some adaptive (data-driven) methods to choose
l; see Newey and West (1994) for details. In general, this estimator essentially can use a
nonparametric method to estimate the covariance matrix of

n
t=1
e
t
x
t
and a class of kernel-
based heteroskedasticity and autocorrelation consistent (HAC) covariance matrix
CHAPTER 2. ESTIMATION OF COVARIANCE MATRIX 7
estimators was introduced by Andrews (1991). For example, the Barlett weight w
j
above
can be replaced by w
j
= K(j/(l + 1)) where K() is a kernel function such as truncated
kernel K(x) = I([x[ 1), the Tukey-Hanning kernel K(x) = (1 + cos( x))/2 if [x[ 1, the
Parzen kernel
K(x) =
_
_
_
1 6 x
2
+ 6 [x[
3
for 0 [x[ 1/2,
2(1 [x[)
3
for 1/2 [x[ 1,
0 otherwsie,
and the Quadratic spectral kernel
K(x) =
25
12
2
x
2
_
sin(6 x/5)
6 x/5
cos(6 x/5)
_
.
Andrews (1991) suggested using the data-driven method to select the bandwidth l:

l =
2.66( T)
1/5
for the Parzen kernel,

l = 1.7462( T)
1/5
for the Tukey-Hanning kernel, and

l = 1.3221( T)
1/5
for the quadratic spectral kernel, where
=
4

k
i=1

2
i

4
i
/(1
i
)
8

n
i=1

4
i
/(1
i
)
4
with
i
and
i
being parameters estimated from an AR(1) model for u
t
= e
t
x
t
.
2.2 An Example
Example 2.1: We consider the relationship between two U.S. weekly interest rate series: x
t
:
the 1-year Treasury constant maturity rate and y
t
: the 3-year Treasury constant maturity
rate. Both series have 1967 observations from January 5, 1962 to September 10, 1999 and
are measured in percentages. The series are obtained from the Federal Reserve Bank of St
Louis.
Figure 2.1 shows the time plots of the two interest rates with solid line denoting the
1-year rate and dashed line for the 3-year rate. The left panel of Figure 2.2 plots y
t
versus
x
t
, indicating that, as expected, the two interest rates are highly correlated. A naive way to
describe the relationship between the two interest rates is to use the simple model, Model I:
y
t
=
1
+
2
x
t
+e
t
. This results in a tted model y
t
= 0.911 +0.924 x
t
+e
t
, with
2
e
= 0.538
and R
2
= 95.8%, where the standard errors of the two coecients are 0.032 and 0.004,
respectively. This simple model (Model I) conrms the high correlation between the two
interest rates. However, the model is seriously inadequate as shown by Figure 2.3, which
gives the time plot and ACF of its residuals. In particular, the sample ACF of the residuals
CHAPTER 2. ESTIMATION OF COVARIANCE MATRIX 8
1970 1980 1990 2000
4
6
8
1
0
1
2
1
4
1
6
Figure 2.1: Time plots of U.S. weekly interest rates (in percentages) from January 5, 1962
to September 10, 1999. The solid line (black) is the Treasury 1-year constant maturity rate
and the dashed line the Treasury 3-year constant maturity rate (red).
oo o oooo o
o o o
o o ooooo oo o o ooo
oooo o o o o o oo o o o oo ooooooo o oooo o oo o oooo oooo ooo o o o oooo oooooo o oooooo oo ooooo ooo ooooooo o o oo ooooo oooo oo o oo oooooo o o o oo ooooooo o oooooo o oo
oo o ooo ooooooooo o oooo ooo o ooo oo o ooooooooooooooo o ooooo ooo
o
ooo o o oooooooo o o o oo ooo oo
ooo o o
oo
oo oo
o
o
oo
oo o
o
o o o ooo o o o o
o
o o o
o
oo oo
oo
o o
o
o o o oo
ooooo
oo
oooo ooo o
o o
oo oooo oooo
o
o oooo ooo
o
o oo oo oooo
o
o o
oo
o
oooo
o
o
oo
o o
o
o o
o o oo o o oo o ooooooooo o
o
o
o
o oo
o o
oo o
ooo o o o oooo
ooo
o
ooo o
o
o o o
oo
o
o
o
o
o
o
oo
o
o
o
o
oo
o
oo o
o
oo
o o
oo
o
o
o
o
o
o oo
o
o o
o
o
o
oo oo
o
oo o
o
o o oooo
o
ooo
o
oo o o o
o
o
o
o
ooo o
ooo
o
o
o o
o
o oo
o
oo
o
o
o
o
o
oo
o
o o
o
oo
o o o
oooo
o o
ooo o
o o
o o
o
oo o
o
o o o o
o o
oo
oo o oooo
oo
o
o o o
o
ooo
oooo
oo o o o o oooo
o
ooooo oo o o o oooooooooo
o
oo o
ooooo
oo
o ooo ooooo oo
o
o
o
o
o
o
o
o
o
o o
o
o
o o
o o oo
oo
o
o
o
o o
oooo
o
o
ooo
o
oo
o
o
oooo
o
o
o
oooo o
o
o
o
o
o
ooo
oo
o
o
o
o o
oo oo
o
o
oo o
o o
o o oo o
o
oo
o
o o o
o
o
o
o o
o o
o
o oo o
o
o
o
ooo
o
o
ooo
o o
o
o
o o
o
o
o
o
oo
ooo
o
o
o
o
o o o oo
ooo
o
oo
oo
o
oo
oo
o
o
oo
o o oo
o
o
o
o
o
o o o o o o oo
o
o
o
o oo
o
o
oo o oo
o
o
o
oo
o o
oo o oo o o
oo
o
ooo o o o
o o ooooo
ooo o ooo ooo
o
o oo o o ooooooo
o o o oooo o o o o
oo oo
ooooo o oo
o
oooo o
o
o
ooo oo
oooo
o
o
oo
oo
oo o
oo oo o
o
ooo
oo oo o o o
o ooo o o
o o
o
oo
oo
ooo ooo
o
o
oo o o
o
o
o
o
oo
o
o
oo
o
o oo oo
o
o
o
o
o
o
o
o o
o
o
o
o
o
o
o
o
o o o
o
o
o
oooo
o
o
o
o
o
o
o
o
oo
o
o
o
o
o
o
o
o
o
o o
o
o o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
oo
o
o
o
o o
o
o
o
o
o o
o
o
o
oo o
o o
o
o
o
o
o
o
o
o o
o
o
o
oo
o
o
o
o
o
o
o
o
oo
o o
o
o
o
o o o
o o o
oo
o
o
o
o
o
o
o
o o
oo
o oo
o
o
o
oo
o
ooo o
o o
o o o
o
o
o
o
oo
o
o
o
oo
oo
o
oo o
oo
o
o
o
o o
o o
o
o o
o
o
o
o
o
o
o o
o
o o
o
o
ooo o o
o
o
o o o o o
oo ooo
oo
o
o
o o
o
o
oo
o
o
o
o
o
oo
o
o o
o
o
o
o
o oo
o o
o
oo o
o
o
o
o
oo
o o
o o
oo
o oo
oooo
o
o
ooo
o o
o
o oo
o
o
o o
oo
o
o
oo
o
o
oo
o
oo
o
o o
oo
o
oo o
o o
ooo
o
ooo
o
o
o
ooo o
o
o
o o
o
o
o
o
o
o o
o
oo
o
o
o
o
o o
ooo
o
o
o
o
o
oo o o
o
oo ooo
o oooooo
o o ooo
o o ooo o
o
o
o
o
o
o
o
o
o
o o
o
oo o o o
o
o
o oo
o
o
o oo
o
o
o
o
oooooo
o o o oo o
o
o
o oo oo
o o
oo
o o
oo
o
oo
oo
o o
o oo
oooo
ooo
o
o ooo
o ooo o
o
o
ooo
o oo
o o
ooo
o
oo o o
o
o o
oo
o o o
o
o o
o
ooo
o
o
oo
o
o
o
o
o o o
oo
o
o
o o oo o
o oooo o
oo o
oo
oo o
o o
o
o o o oo
o
o o
o
o o o
o oo o oo
o o
o
oo
o o oooo
o
o o
o o
o
o oo
o
o o o
o o o o o
o o
o
oo
o
o o
o oo o oooo o
o
o o oo oo o
o
o
o
oo o
o o
o
o o oo
o
o o
o o
o
o o
o o o
o o
o oo
oo
o
o
o o
o
o o
oo
o
oo
oo
o
oo
o
o
ooo o
o
o
o
o
o oo
oo
o
oo
oooo
o
oo
o o o o o
ooo
o
o
o
o o oo o
o oo oo
o
ooo
oo
oo o
oo o o
o
o o o
oo oooo
o
o oooo oo ooo
o oo
o
oo
o
o
ooo
o
o o
oo
o
o
o
oo
oo
o
ooo
oo
o
oo o o
o
o
oo
o oo
oo
ooo
ooo ooo o o o
o o
o
o
oo
o ooo o
oo
o
o o
o
o o o
o
o o o
o
o oo
o o
o o o o
o
o o o o
o o oo o oo o
o o o
oo o o o
o
oo
oo oo
o
oo
oo
oo
o
o
o o oo
o
oo
o
ooo
o
o
o o
o
o o oo
o o oo oo
oo oooooo
o o o
o
oo
o
ooo oo
o o o oo
o o
oo o
oo o o
oo o
o oo
o o oooo
oooooo o oo o
o ooo ooo
oo o oo o oooo oo o o oo oo o oooo o oo
o
o
oo
o
o o o oo
oooo
oo o
o o o o oo
oo
o
o
o
o oo o
oooo
oo
oo
oo o
o o o
o o
oo
o
o o
o o
4 6 8 10 12 14 16
4
6
8
1
0
1
2
1
4
1
6
o o
o
o
oo o
o
oo
o
o
o
o
o
o
o
o
o
o
o
oo
o
o
o
oo
o o
o oo
o
o
oo
o
o
o
o
o
oo
oo
o o
oo
o
o
o o
o
o
o
o
o
o
ooo
o
o
o
o
o
oooo
o
o
o
o o
o o
o o
o
o
o
o
o o
o
o
oo
ooo oo
o
o
o
o o
o
o o o
ooo
o
o
o
o o
o
o
o
o o
o
ooo
o
o
ooo o
oo o o
o
oo
o
o
o
o o
o
o
oo o
oo o
o
o
o
o
o
o
o o
ooo
oo o
oo o o
oo
o
o o
o
o o
o
o
o
oo o
o
oo ooo
o ooo
o
o
o
o
o
oo
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
ooo
o
o o
o o
o
o
o o
o
o
o
o
o
o o
o
o
o
o
o
o
o
o
o
o o
o
o
o
o
o
o
o
o o
oo
o
oo
o
oo
o o
oo
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
oo
o
o
o
o
o
o
oo
o
o
o
o
o
o
o
oo
o
o o
o
o
o o
o o
o
o
o
o
oo
o
o
o
o
o
oo
o o
o
o
o
o
o
o
o
o o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
oo
o o o o
o
o
oo
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
oo
o
o
o
o
o
o
o o
o
o
o
o
o
o
o
o
oo
o
o
oo
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
oo
o
o
o
o o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o o
o
o
o
o
o
oo
o
o
o
o
o
o
o
o
o o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
oo
o o
o o
o
oo
o
o
o
o
o
o
o
o
o
oo
o
o
oo
o o
o o oo
o
o
o
o
o
o
o o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o o
o
o
o o
o
o
o
o
o
o
o
o
o
o
oo
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o o
o
o
o
o
o
o
o o
o
o
o o o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o o
o
o
o
o
o
o
o
o o
o
o
o
o
o
o
o
o
o
o
o
o
o o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o o
o
o
o
o
o
o
o
o
oo
o
o
o
o
o o o
o o
o
oo o
o
o
o
o
o
o
o
o
o
o
o o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
oo
o
o
oo
o
o o
o
o
o
o
o
o oo
o
o
o
o
o o
o
o
o
o
o
o
o
o
o
o o
oo
o
o
oo
o
o
o
o
o
o
o
o
o o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o o
o
o
o
o
o
o o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o o
o o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
oo
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
oo
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o o
oo
o
o o
o
o
o o
o
o
o
o o
o
o
o o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
oo
o
o
o
o
o
o
o
o
o
o
o
o
o
o o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
oo
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o o
o
o
o
o
o
o
o
o
o
o o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
oo
o
o
o
o
o
o
o
o
o
o
o
o
o
o ooo
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o o
o
o
o
o
o
o
o
o
o
oo
o
o
o
o
o
o
o
o
o
o
o
o o
o
o
o o
o o
o
o
o
o
o
o
o
o
o
o
o
oo
o
o
o
o
o
o
oo
o
o
o
o
o
o
o
o o
o
o
o o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o o
o
o o
o
oo
o
o
o
o
o
o
o
o oo
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
oo
o
o
o
o
o
o
o
o
oo
o
o
o
o
o
o
o
o
o
o
o
oo
o
o
o
o
o
o
o
o
o
o
oo
o
o
o
o
oo
o
o
o
o
o
o
o
o
o
o
o
o
o
oo
o
o
o
o
o
o
o o
o
o
o o
o
o
o
o
o
o
o
o o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
oo
o
o
o
o
o
o
o
o o
o
o
o
o
o
o oo
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o o
o
o
o
oo
o
oo
o
o
o
o
o
o
o
o
oo
o
o o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o o
o
o
o
o
o
o
o
oo
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
oo
o
o
o
o
o
o
o
o
oo
o
o
o
o
o
o
o
o
o
o o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
oo
o
o
o
o
o
oo
o
o
o
o
o
o
o
oo
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
oo
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
oo
o
o
o
o
o
o
oo
o
o
o
o
oo
o
o
o
o
o
o
o
oo
o
o o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
oo
o
o
o
o
o
o
o
o
o
o
o o
o
o
o
o o
o
o o
o
o
o
o
o
o
o o
o
o
o
o
o
o
o
o
o
o
o
o
o
oo
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
oo o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
1.5 0.5 0.5 1.0 1.5

1
.
0

0
.
5
0
.
0
0
.
5
1
.
0
1
.
5
Figure 2.2: Scatterplots of U.S. weekly interest rates from January 5, 1962 to September 10,
1999: the left panel is 3-year rate versus 1-year rate, and the right panel is changes in 3-year
rate versus changes in 1-year rate.
is highly signicant and decays slowly, showing the pattern of a unit root nonstationary
time series. The behavior of the residuals suggests that marked dierences exist between the
two interest rates. Using the modern econometric terminology, if one assumes that the two
interest rate series are unit root nonstationary, then the behavior of the residuals indicates
that the two interest rates are not co-integrated. In other words, the data fail to support
the hypothesis that there exists a long-term equilibrium between the two interest rates. In
some sense, this is not surprising because the pattern of inverted yield curve did occur
during the data span. By the inverted yield curve, we mean the situation under which
CHAPTER 2. ESTIMATION OF COVARIANCE MATRIX 9
1970 1980 1990 2000

1
.
5

1
.
0

0
.
5
0
.
0
0
.
5
1
.
0
0 5 10 15 20 25 30

0
.
5
0
.
0
0
.
5
1
.
0
Figure 2.3: Residual series of linear regression Model I for two U.S. weekly interest rates:
the left panel is time plot and the right panel is ACF.
interest rates are inversely related to their time to maturities.
The unit root behavior of both interest rates and the residuals leads to the consideration
of the change series of interest rates. Let x
t
= y
t
y
t1
= (1L) x
t
be changes in the 1-year
interest rate and y
t
= y
t
y
t1
= (1 L) y
t
denote changes in the 3-year interest rate.
Consider the linear regression, Model II: y
t
=
1
+
2
x
t
+ e
t
. Figure 2.4 shows time
plots of the two change series, whereas the right panel of Figure 2.3 provides a scatterplot
1970 1980 1990 2000

1
.
5

1
.
0

0
.
5
0
.
0
0
.
5
1
.
0
1
.
5
Figure 2.4: Time plots of the change series of U.S. weekly interest rates from January 12,
1962 to September 10, 1999: changes in the Treasury 1-year constant maturity rate are in
denoted by black solid line, and changes in the Treasury 3-year constant maturity rate are
indicated by red dashed line.
between them. The change series remain highly correlated with a tted linear regression
CHAPTER 2. ESTIMATION OF COVARIANCE MATRIX 10
model given by y
t
= 0.0002 + 0.7811 x
t
+ e
t
with
2
e
= 0.0682 and R
2
= 84.8%. The
standard errors of the two coecients are 0.0015 and 0.0075, respectively. This model further
conrms the strong linear dependence between interest rates. The two top panels of Figure
2.5 show the time plot (left) and sample ACF (right) of the residuals (Model II). Once again,
0 500 1000 1500 2000
0
.
4

0
.
2
0
.
0
0
.
2
0
.
4
0 5 10 15 20 25 30

0
.
5
0
.
0
0
.
5
1
.
0
0 500 1000 1500 2000
0
.
4

0
.
2
0
.
0
0
.
2
0
.
4
0 5 10 15 20 25 30

0
.
5
0
.
0
0
.
5
1
.
0
Figure 2.5: Residual series of the linear regression models: Model II (top) and Model III
(bottom) for two change series of U.S. weekly interest rates: time plot (left) and ACF (right).
the ACF shows some signicant serial correlation in the residuals, but the magnitude of the
correlation is much smaller. This weak serial dependence in the residuals can be modeled by
using the simple time series models discussed in the previous sections, and we have a linear
regression with time series errors.
For illustration, we consider the rst dierenced interest rate series in Model II. The
t-ratio of the coecient of x
t
is 104.63 if both serial correlation and conditional het-
eroscedasticity in residuals are ignored; it becomes 46.73 when the HC estimator is used,
and it reduces to 40.08 when the HAC estimator is employed.
2.3 R Commands
To use HC or HAC estimator, we can use the package sandwich in R and the commands
are vcovHC() or vcovHAC() or meatHAC(). There are a set of functions implementing
CHAPTER 2. ESTIMATION OF COVARIANCE MATRIX 11
a class of kernel-based heteroskedasticity and autocorrelation consistent (HAC) covariance
matrix estimators as introduced by Andrews (1991). In vcovHC(), these estimators dier in
their choice of the
i
in = Var(e) = diag
1
, ,
n
, an overview of the most important
cases is given in the following:
const :
i
=
2
HC0 :
i
= e
2
i
HC1 :
i
=
n
n k
e
2
i
HC2 :
i
=
e
2
i
1 h
i
HC3 :
i
=
e
2
i
(1 h
i
)
2
HC4 :
i
=
e
2
i
(1 h
i
)

i
where h
i
= H
ii
are the diagonal elements of the hat matrix and
i
= min4, h
i
/h.
vcovHC(x, type = c("HC3", "const", "HC", "HC0", "HC1", "HC2", "HC4"),
omega = NULL, sandwich = TRUE, ...)
meatHC(x, type = , omega = NULL)
vcovHAC(x, order.by = NULL, prewhite = FALSE, weights = weightsAndrews,
adjust = TRUE, diagnostics = FALSE, sandwich = TRUE, ar.method = "ols",
data = list(), ...)
meatHAC(x, order.by = NULL, prewhite = FALSE, weights = weightsAndrews,
adjust = TRUE, diagnostics = FALSE, ar.method = "ols", data = list())
kernHAC(x, order.by = NULL, prewhite = 1, bw = bwAndrews,
kernel = c("Quadratic Spectral", "Truncated", "Bartlett", "Parzen",
"Tukey-Hanning"), approx = c("AR(1)", "ARMA(1,1)"), adjust = TRUE,
diagnostics = FALSE, sandwich = TRUE, ar.method = "ols", tol = 1e-7,
data = list(), verbose = FALSE, ...)
CHAPTER 2. ESTIMATION OF COVARIANCE MATRIX 12
weightsAndrews(x, order.by = NULL,bw = bwAndrews,
kernel = c("Quadratic Spectral","Truncated","Bartlett","Parzen",
"Tukey-Hanning"), prewhite = 1, ar.method = "ols", tol = 1e-7,
data = list(), verbose = FALSE, ...)
bwAndrews(x,order.by=NULL,kernel=c("Quadratic Spectral", "Truncated",
"Bartlett","Parzen","Tukey-Hanning"), approx=c("AR(1)", "ARMA(1,1)"),
weights = NULL, prewhite = 1, ar.method = "ols", data = list(), ...)
Also, there are a set of functions implementing the Newey and West (1987, 1994) het-
eroskedasticity and autocorrelation consistent (HAC) covariance matrix estimators.
NeweyWest(x, lag = NULL, order.by = NULL, prewhite = TRUE, adjust = FALSE,
diagnostics = FALSE, sandwich = TRUE, ar.method = "ols", data = list(),
verbose = FALSE)
bwNeweyWest(x, order.by = NULL, kernel = c("Bartlett", "Parzen",
"Quadratic Spectral", "Truncated", "Tukey-Hanning"), weights = NULL,
prewhite = 1, ar.method = "ols", data = list(), ...)
2.4 Reading Materials the paper by Zeileis (2004)
2.5 Computer Codes
#####################################################
# This is Example 2.1 for weekly interest rate series
#####################################################
z<-read.table("c:/res-teach/xiada/teaching05-07/data/ex2-1.txt",header=F)
# first column=one year Treasury constant maturity rate;
# second column=three year Treasury constant maturity rate;
# third column=date
CHAPTER 2. ESTIMATION OF COVARIANCE MATRIX 13
x=z[,1]
y=z[,2]
n=length(x)
u=seq(1962+1/52,by=1/52,length=n)
x_diff=diff(x)
y_diff=diff(y)
# Fit a simple regression model and examine the residuals
fit1=lm(y~x) # Model 1
e1=fit1$resid
postscript(file="c:/res-teach/xiada/teaching05-07/figs/fig-2.1.eps",
horizontal=F,width=6,height=6)
matplot(u,cbind(x,y),type="l",lty=c(1,2),col=c(1,2),ylab="",xlab="")
dev.off()
postscript(file="c:/res-teach/xiada/teaching05-07/figs/fig-2.2.eps",
horizontal=F,width=6,height=6)
par(mfrow=c(1,2),mex=0.4,bg="light grey")
plot(x,y,type="p",pch="o",ylab="",xlab="",cex=0.5)
plot(x_diff,y_diff,type="p",pch="o",ylab="",xlab="",cex=0.5)
dev.off()
postscript(file="c:/res-teach/xiada/teaching05-07/figs/fig-2.3.eps",
horizontal=F,width=6,height=6)
par(mfrow=c(1,2),mex=0.4,bg="light green")
plot(u,e1,type="l",lty=1,ylab="",xlab="")
abline(0,0)
acf(e1,ylab="",xlab="",ylim=c(-0.5,1),lag=30,main="")
dev.off()
# Take different and fit a simple regression again
fit2=lm(y_diff~x_diff) # Model 2
CHAPTER 2. ESTIMATION OF COVARIANCE MATRIX 14
e2=fit2$resid
postscript(file="c:/res-teach/xiada/teaching05-07/figs/fig-2.4.eps",
horizontal=F,width=6,height=6)
matplot(u[-1],cbind(x_diff,y_diff),type="l",lty=c(1,2),col=c(1,2),
ylab="",xlab="")
abline(0,0)
dev.off()
postscript(file="c:/res-teach/xiada/teaching05-07/figs/fig-2.5.eps",
horizontal=F,width=6,height=6)
par(mfrow=c(2,2),mex=0.4,bg="light pink")
ts.plot(e2,type="l",lty=1,ylab="",xlab="")
abline(0,0)
acf(e2, ylab="", xlab="",ylim=c(-0.5,),lag=30,main="")
# fit a model to the differenced data with an MA(1) error
fit3=arima(y_diff,xreg=x_diff, order=c(0,0,1)) # Model 3
e3=fit3$resid
ts.plot(e3,type="l",lty=1,ylab="",xlab="")
abline(0,0)
acf(e3, ylab="",xlab="",ylim=c(-0.5,1),lag=30,main="")
dev.off()
#################################################################
library(sandwich) # HC and HAC are in the package "sandwich"
library(zoo)
z<-read.table("c:/res-teach/xiada/teaching05-07/data/ex2-1.txt",header=F)
x=z[,1]
y=z[,2]
x_diff=diff(x)
y_diff=diff(y)
# Fit a simple regression model and examine the residuals
CHAPTER 2. ESTIMATION OF COVARIANCE MATRIX 15
fit1=lm(y_diff~x_diff)
print(summary(fit1))
e1=fit1$resid
# Heteroskedasticity-Consistent Covariance Matrix Estimation
#hc0=vcovHC(fit1,type="const")
#print(sqrt(diag(hc0)))
# type=c("const","HC","HC0","HC1","HC2","HC3","HC4")
# HC0 is the White estimator
hc1=vcovHC(fit1,type="HC0")
print(sqrt(diag(hc1)))
#Heteroskedasticity and autocorrelation consistent (HAC) estimation
#of the covariance matrix of the coefficient estimates in a
#(generalized) linear regression model.
hac1=vcovHAC(fit1,sandwich=T)
print(sqrt(diag(hac1)))
2.6 References
Andrews, D.W.K. (1991). Heteroskedasticity and autocorrelation consistent covariance
matrix estimation. Econometrica, 59, 817-858.
Eicker, F. (1967). Limit theorems for regression with unequal and dependent errors. In
Proceedings of the 5th Berkeley Symposium on Mathematical Statistics and Probability
(L. LeCam and J. Neyman, eds.), University of California Press, Berkeley.
Newey, W.K. and K.D. West (1987). A simple, positive-denite, heteroskedasticity and
autocorrelation consistent covariance matrix. Econometrica, 55, 703-708.
Newey, W.K. and K.D. West (1994). Automatic lag selection in covariance matrix estima-
tion. Review of Economic Studies, 61, 631-653.
White, H. (1980). A Heteroskedasticity consistent covariance matrix and a direct test for
heteroskedasticity. Econometrica, 48, 817-838.
Zeileis, A. (2004). Econometric computing with HC and HAC covariance matrix estimators.
Journal of Statistical Software, Volume 11, Issue 10.
Zeileis, A. (2006). Object-oriented computation of sandwich estimators. Journal of Statis-
tical Software, 16, 1-16.
Chapter 3
Density, Distribution & Quantile
Estimations
3.1 Time Series Structure
Since most of economic and nancial data are time series, we discuss our methodologies
and theory under the framework of time series. For linear models, the time series structure
can be often assumed to have some well known forms such as an autoregressive moving
average (ARMA) model. However, under nonparametric setting, this assumption might
not be valid. Therefore, we can assume a more general time series dependence, which is
commonly used in the literature, described as follows.
3.1.1 Mixing Conditions
Mixing dependence is commonly used to characterize the dependent structure and it is of-
ten referred often to as short range dependence or weak dependence, which means
that the distance between two observations goes farther and farther, the dependence be-
comes weaker and weaker very faster. It is well known that -mixing includes many time
series models as a special case. In fact, under very mild assumptions, linear processes,
including linear autoregressive models and more generally bilinear time series mod-
els are -mixing with mixing coecients decaying exponentially. Many nonlinear time se-
ries models, such as functional coecient autoregressive processes with/without
exogenous variables, nonlinear additive autoregressive models with/without ex-
ogenous variables, ARCH and GARCH type processes, stochastic volatility models,
and many continuous time diusion models (including the Black-Scholes type
models) are strong mixing under some mild conditions. See Genon-Caralot, Jeantheau and
CHAPTER 3. DENSITY, DISTRIBUTION & QUANTILE ESTIMATIONS 17
Laredo (2000), Cai (2002), Carrasco and Chen (2002), and Chen and Tang (2005) for more
details.
To simplify the notation, we only introduce mixing conditions for strictly stationary
processes (in spite of the fact that a mixing process is not necessarily stationary). The idea
is to dene mixing coecients to measure the strength (in dierent ways) of dependence for
the two segments of a time series which are apart from each other in time. Let X
t
be a
strictly stationary time series. For n 1, dene
(n) = sup
AF
0

;BF

n
[P(A)P(B) P(AB)[,
where T
j
i
denotes the -algebra generated by X
t
; i t j. Note that T

n
. If (n) 0
as n , X
t
is called -mixing or strong mixing. There are several other mixing
conditions such as -mixing, -mixing, -mixing, and -mixing; see the books by Hall
and Heyde (1980) and Fan and Yao (2003, page 68) for details. Indeed,
(n) = E
_
sup
AF

n
[P(A) P(A[ X
t
, t 0)
_
,
(n) = sup
XF
0

;Y F

n
[Corr(X, Y )[,
(n) = sup
AF
0

;BF

n
,P(A)>0
[P(B) P(B[ A)[,
and
(n) = sup
AF
0

;BF

n
,P(A)P(B)>0
[1 P(B[ A)/P(B)[,
It is well known that the relationships among the mixing conditions are
(n)
1
4
(n)
1
2
(n),
so that -mixing = -mixing = -mixing = -mixing as well as -mixing = -
mixing. Note that all our theoretical results are derived under mixing conditions. The
following inequalities are very useful in applications, which can be found in the book by
Hall and Heyde (1980, pp. 277-280).
Lemma 3.1: (Davydovs inequality) (i) If E[X
i
[
p
+ E[X
j
[
q
< for some p 1 and
q 1 and 1/p + 1/q < 1, it holds that
[Cov(X
i
, X
j
)[ 8
1/r
([j i[)[[X
i
[[
p
[[X
j
[[
q
,
CHAPTER 3. DENSITY, DISTRIBUTION & QUANTILE ESTIMATIONS 18
where r = (1 1/p 1/q)
1
.
(ii) If P([X
i
[ C
1
) = 1 and P([X
j
[ C
2
) = 1 for some constants C
1
and C
2
, it holds that
[Cov(X
i
, X
j
)[ 4 ([j i[) C
1
C
2
.
Note that if we allow X
i
and X
j
to be complex-valued random variables, (ii) still holds with
the coecient 4 on the RHS of the inequality replaced by 16.
(iii) If If P([X
i
[ C
1
) = 1 and E[X
j
[
p
< for some constants C
1
and p > 1, then,
[Cov(X
i
, X
j
)[ 6 C
1
[[X
j
[[
p

1p
1
([j i[).
Lemma 3.2: If E[X
i
[
p
+E[X
j
[
q
< for some p 1 and q 1 and 1/p +1/q = 1, it holds
that
[Cov(X
i
, X
j
)[ 2
1/p
([j i[)[[X
i
[[
p
[[X
j
[[
q
.
3.1.2 Martingale and Mixingale
Martingale is very useful in applications. Here is the denition. Let X
n
, n A be a
sequence of random variables on a probability space (, T, P), and let T
n
, n A be an
increasing sequence of sub--elds of T. Suppose that the sequence X
n
, n A satises
(i) X
n
is measurable with respect to T
n
,
(ii) E[X
n
[ < ,
(iii) E[X
n
[ T
m
] = X
m
for all m < n, n A.
Then, the sequence X
n
, n A is said to be a martingale with respect to T
n
, n A. We
write that X
n
, T
n
, n A is a martingale. If (i) and (ii) are retained and (iii) is replaced
by the inequality E[X
n
[ T
m
] X
m
(E[X
n
[ T
m
] X
m
), then X
n
, T
n
, n A is called a
sub-martingale (super-martingale). Dene Y
n
= X
n
X
n1
. Then Y
n
, T
n
, n A is
called a martingale dierence (MD) if X
n
, T
n
, n A is called a martingale. Clearly,
E[Y
n
[ T
n1
] = 0, which means that a MD is not predicable based on the past information.
In a nance language, a stock market is ecient. Equivalently, it is a MD.
Another type of dependent structure is called mixingale, which is the so-called asymp-
totic martingale. The concept of mixingale, introduced by McLeish (1975), is dened as
follows. Let X
n
, n 1 be a sequence of square-integrable random variables on a probabil-
ity space (, T, P), and let T
n
, < n < be an increasing sequence of sub--elds of
CHAPTER 3. DENSITY, DISTRIBUTION & QUANTILE ESTIMATIONS 19
T. Then, X
n
, T
n
is called a L
r
-mixingale (dierence) sequence for r 1 if, for some
sequences of nonnegative constants c
n
and
m
, where
m
0 as m , we have
(i) [[E(X
n
[ T
nm
)[[
r

m
c
n
, and (ii) [[X
n
E(X
n
[ T
nm
)[[
r

m+1
c
n
,
for all n 1 and m 0. The idea of mixingale is to try to build a bridge between martingale
and mixing. The following examples give the idea of the scope of L
2
-mixingales.
Examples:
1. A square-integrable martingale is a mixingale with c
n
= [[X
n
[[ and
0
= 1 and
m
= 0
for m 1.
2. A linear process is given by X
n
=

i
with
i
iid mean zero and variance
2
and

2
i
< . Then, X
n
, T
n
is a mixingale with all c
n
= and
2
m
=

|i|m

2
i
.
3. If X
n
is a square-integrable sequence of -mixing, then it is a mixingale with c
n
=
2[[X
n
[[
2
and
m
=
1/2
(m), where (m) is the -mixing coecient.
4. If X
n
is a sequence of -mixing with [[X
n
[[
p
< for some p > 2, then it is a mixingale
with c
n
= 2(

2 + 1)[[X
n
[[
2
and
m
=
1/21/p
(m), where (m) is the -mixing coecient.
Note that Examples 3 and 4 can be derived form the following inequality, due to McLeish
(1975).
Lemma 3.3: (McLeishs inequality) Suppose that X is a random variable measurable
with respect to /, and [[X[[
r
< for some 1 p r . Then
[[E(X[ T) E(X)[[
p

_
2[(T, /)]
11/r
[[X[[
r
, for -mixing,
2(2
1/p
+ 1)[(T, /)]
1/p1/r
[[X[[
r
, for -mixing.
3.2 Nonparametric Density Estimate
Let X
i
be a random sample with a (unknown) marginal distribution F() (CDF) and its
probability density function (PDF) f(). The question is how to estimate f() and F().
Since
F(x) = P(X
i
x) = E[I(X
i
x)] =
_
x

f(u)du,
and
f(x) = lim
h0
F(x + h) F(x h)
2 h

F(x + h) F(x h)
2 h
if h is very small, by the method of moment estimation (MME), F(x) can be estimated by
F
n
(x) =
1
n
n

i=1
I(X
i
x),
CHAPTER 3. DENSITY, DISTRIBUTION & QUANTILE ESTIMATIONS 20
which is called the empirical cumulative distribution function (ecdf), so that f(x) can
be estimated by
f
n
(x) =
F
n
(x + h) F
n
(x h)
2 h
=
1
n
n

i=1
K
h
(X
i
x),
where K(u) = I([u[ 1)/2 and K
h
(u) = K(u/h)/h. Indeed, the kernel function K(u) can
be taken to be any symmetric density function. here, h is called the bandwidth. f
n
(x)
was proposed initially by Rosenblatt (1956) and Parzen (1962) explored its properties in
detail. Therefore, it is called the Rosenblatt-Parzen density estimate.
Exercise: Please show that F
n
(x) is an unbiased estimate of F(x) but f
n
(x) is a biased
estimate of f(x). Think about intuitively
(1) why f
n
(x) is biased
(2) where the bias comes from
(3) why K() should be symmetric.
3.2.1 Asymptotic Properties
Asymptotic Properties for ECDF
If X
i
is stationary, then E[F
n
(x)] = F(x) and
nVar(F
n
(x)) = Var(I(X
i
x)) + 2
n

i=2
_
1
i 1
n
_
Cov(I(X
1
x), I(X
i
x))
= F(x)[1 F(x)] + 2
n

i=2
Cov(I(X
1
x), I(X
i
x))
. .

2
(x) by assuming that
2
(x)<
2
n

i=2
i 1
n
Cov(I(X
1
x), I(X
i
x))
. .
0 by Kronecker Lemma

2
F
(x) F(x)[1 F(x)] + 2

i=2
Cov(I(X
1
x), I(X
i
x))
. .
This term is called A
d
.
Therefore,
nVar(F
n
(x))
2
F
(x). (3.1)
CHAPTER 3. DENSITY, DISTRIBUTION & QUANTILE ESTIMATIONS 21
One can show based on the mixing theory that

n [F
n
(x) F(x)] N
_
0,
2
F
(x)
_
. (3.2)
It is clear that A
d
= 0 if X
i
are independent. If A
d
,= 0, the question is how to estimate
it. We can use the HC estimator by White (1980) or the HAC estimator by Newey and
West (1987); see Chapter 2, or the kernel method by Andrew (1991).
The results in (3.2) can used to construct a test statistic to test the null hypothesis
H
0
: F(x) = F
0
(x) versus H
a
: F(x) ,= (>)(<)F
0
(x).
This test statistic is the well-known Kolmogorov-Smirnov test, dened as
D
n
= sup
<x<
[F
n
(x) F
0
(x)[
for the two-sided test. One can show (see Sering (1980)) that under some regularity condi-
tions,
P(

nD
n
d) 1 2

j=1
(1)
j+1
exp(2j
2
d
2
)
and
P(

nD
+
n
d) = P(

n
d) 1 exp(2d
2
),
where D
+
n
= sup
<x<
[F
n
(x) F
0
(x)] and D
+
n
= sup
<x<
[F
0
(x) F
n
(x)] for one-sided
tests. In R, there is a built-in command for the Kolmogorov-Smirnov test, which is ks.test().
Asymptotic Properties for Density Estimation
Next, we derive the asymptotic variance for f
n
(x). First, dene Z
i
= K
h
(X
i
x). Then,
E[Z
1
Z
i
] =
_ _
K
h
(u x)K
h
(v x) f
1,i
(u, v)dudv
=
_ _
K(u)K(v) f
1,i
(x + u h, x + v h)dudv
f
1,i
(x, x),
where f
1,i
(u, v) is the joint density of (X
1
, X
i
), so that
Cov(Z
1
, Z
i
) f
1,i
(x, x) f
2
(x).
It is easy to show that
h Var(Z
1
)
0
(K) f(x),
CHAPTER 3. DENSITY, DISTRIBUTION & QUANTILE ESTIMATIONS 22
where
j
(K) =
_
u
j
K
2
(u)du. Therefore,
nh Var(f
n
(x)) = h Var(Z
1
) + 2h
n

i=2
_
1
i 1
n
_
Cov(Z
1
, Z
i
)
. .
A
f
0 under some assumptions

0
(K) f(x).
To show that A
f
0, let d
n
and d
n
h 0. Then,
[A
f
[ h
dn

i=2
[Cov(Z
1
, Z
i
)[ + h
n

i=dn+1
[Cov(Z
1
, Z
i
)[.
For the rst term, if f
1,i
(u, v) M
1
, then, it is bounded by h d
n
= o(1). For the second
term, we apply the Davydovs inequality (see Lemma 3.1) to obtain
h
n

i=dn+1
[Cov(Z
1
, Z
i
)[ M
2
n

i=dn+1
(i)/h = O(d
+1
n
h
1
)
if (n) = O(n

) for some > 2. If d

n
= O(h
2/
), then, the second term is dominated by
O(h
12/
) which goes to 0 as n . Hence,
nh Var(f
n
(x))
0
(K) f(x). (3.3)
By a comparison of (3.1) and (3.3), one can see clearly that there is an innity term involved
in
2
F
(x) due to the dependence but the asymptotic variance in (3.3) is the same as that for
the iid case (without the innity term). We can establish the following asymptotic normality
for f
n
(x) but the proof will be discussed later.
Theorem 3.1: Under regularity conditions, we have

nh
_
f
n
(x) f(x)
h
2
2

2
(K) f

(x) + o
p
(h
2
)
_
N (0,
0
(K) f(x)) ,
where the term
h
2
2

2
(K) f

(x) is called the asymptotic bias and

2
(K) =
_
u
2
K(u)du.
Exercise: By comparing (3.1) and (3.3), what can you observe?
Example 3.1: Let us examine how importance the choice of bandwidth is. The data X
i

n
i=1
are generated from N(0, 1) (iid) and n = 300. The grid points are taken to be [4, 4] with
an increment = 0.1. Bandwidth is taken to be 0.25, 0.5 and 1.0, respectively and the
CHAPTER 3. DENSITY, DISTRIBUTION & QUANTILE ESTIMATIONS 23
4 2 0 2 4
0
.
0
0
.
1
0
.
2
0
.
3
0
.
4
True
h=0.25
h=0.5
h=1
h=h_o
Figure 3.1: Bandwidth is taken to be 0.25, 0.5, 1.0 and the optimal one (see later) with the
Epanechnikov kernel.
kernel can be the Epanechnikov kernel K(u) = 0.75(1 u
2
)I([u[ 1) or Gaussian kernel.
Comparisons are given in Figure 3.1.
Example 3.2: Next, we apply the kernel density estimation to estimate the density of
the weekly 3-month Treasury bill from January 2, 1970 to December 26, 1997. Figure 3.2
displays the ACF and PACF plots for the original data (top panel) and the rst dierence
(middle panel) and the estimated density of the dierencing series together with the true
standard normal density: the bottom left panel is for the built-in function density() and
the bottom right panel is for own code.
Note that the computer code in R for the above two examples can be found in Section 3.5.
R has a built-in function density() for computing the nonparametric density estimation.
Also, you can use the command plot(density()) to plot the estimated density. Further, R
has a built-in function ecdf() for computing the empirical cumulative distribution function
estimation and plot(ecdf()) for plotting the step function.
CHAPTER 3. DENSITY, DISTRIBUTION & QUANTILE ESTIMATIONS 24
0 5 10 15 20 25 30
0
.
0
0
.
2
0
.
4
0
.
6
0
.
8
1
.
0
Lag
A
C
F
0 5 10 15 20 25 30

0
.
2
0
.
0
0
.
2
0
.
4
0
.
6
0
.
8
1
.
0
Lag
P
a
r
t
ia
l
A
C
F
0 5 10 15 20 25 30
0
.
0
0
.
2
0
.
4
0
.
6
0
.
8
1
.
0
Lag
A
C
F
0 5 10 15 20 25 30

0
.
1
0
.
0
0
.
1
0
.
2
Lag
P
a
r
t
ia
l
A
C
F
4 2 0 2 4
0
.
0
0
.
1
0
.
2
0
.
3
0
.
4
0
.
5
0
.
6
Density of 3mtb (Buindin)
Estimated
Standard
4 2 0 2 4
0
.
0
0
.
2
0
.
4
0
.
6
Density of 3mtb
Estimated
Standard
Figure 3.2: The ACF and PACF plots for the original data (top panel) and the rst
dierence (middle panel). The bottom left panel is for the built-in function density() and
the bottom right panel is for own code.
3.2.2 Optimality
As we already have shown that
E(f
n
(x)) = f(x) +
h
2
2

2
(K) f

(x) + o(h
2
),
and
Var(f
n
(x)) =

0
(K) f(x)
nh
+ o((nh)
1
),
so that the asymptotic mean integrated squares error (AMISE) is
AMISE =
h
4
4

2
2
(K)
_
[f

(x)]
2
+

0
(K)
nh
.
Minimizing the AMISE gives the
h
opt
= C
1
(K) [[f

[[
2/5
2
n
1/5
, (3.4)
CHAPTER 3. DENSITY, DISTRIBUTION & QUANTILE ESTIMATIONS 25
where
C
1
(K) =
_

0
(K)/
2
2
(K)

1/5
.
With this asymptotically optimal bandwidth, the optimal AMISE is given by
AMISE
opt
=
5
4
C
2
(K) [[f

[[
2/5
2
n
4/5
,
where
C
2
(K) =
_

2
0
(K)
2
(K)

2/5
.
To choose the best kernel, it suces to choose one to minimize C
2
(K).
Proposition 1: The nonnegative probability density function K minimizing C
2
(K) is a
re-scaling of the Epanechnikov kernel:
K
opt
(u) =
3
4 a
(1 u
2
/a
2
)
+
for any a > 0.
Proof: First of all, we note that C
2
(K
h
) = C
2
(K) for any h > 0. Let K
0
be the Epanechnikov
kernel. For any other nonnegative K, by re-scaling if necessary, we assume that
2
(K) =

2
(K
0
). Thus, we need only to show that
0
(K
0
)
0
(K). Let G = K K
0
. Then,
_
G(u)du = 0 and
_
u
2
G(u)du = 0,
which implies that
_
(1 u
2
) G(u)du = 0.
Using this and the fact that K
0
has the support [1, 1], we have
_
G(u) K
0
(u)du =
3
4
_
|u|1
G(u)(1 u
2
)du
=
3
4
_
|u|>1
G(u)(1 u
2
)du =
3
4
_
|u|>1
K(u)(u
2
1)du.
Since K is nonnegative, so is the last term. Therefore,
_
K
2
(u)du =
_
K
2
0
(u)du + 2
_
K
0
(u)G(u)du +
_
G
2
(u)du
_
K
2
0
(u)du,
which proves that K
0
is the optimal kernel.
Remark: This proposition implies that the Epanechnikov kernel should be used in practice.
CHAPTER 3. DENSITY, DISTRIBUTION & QUANTILE ESTIMATIONS 26
3.2.3 Boundary Problems
In many applications, the density f() has a bounded support. For example, the interest rate
can not be less than zero and the income is always nonnegative. It is reasonable to assume
that the interest rate has support [0, 1). However, because a kernel density estimator spreads
smoothly point masses around the observed data points, some of those near the boundary
of the support are distributed outside the support of the density. Therefore, the kernel
density estimator under estimates the density in the boundary regions. The problem is more
severe for large bandwidth and for the left boundary where the density is high. Therefore,
some adjustments are needed. To gain some further insights, let us assume without loss
of generality that the density function f() has a bounded support [0, 1] and we deal with
the density estimate at the left boundary. For simplicity, suppose that K() has a support
[1, 1]. For the left boundary point x = c h (0 c < 1) , it can easily be seen that as
h 0,
E(f
n
(ch)) =
_
1/hc
c
f(ch + hu)K(u)du
= f(0+)
0,c
(K) + h f

(0+)[c
0,c
(K) +
1,c
(K)] + o(h), (3.5)
where f(0+) = lim
x0
f(x),

j,c
=
_

c
u
j
K(u)du, and
j,c
(K) =
_

c
u
j
K
2
(u)du.
Also, we can show that Var(f
n
(ch)) = O(1/nh). Therefore,
f
n
(ch) = f(0+)
0,c
(K) + h f

(0+)[c
0,c
(K) +
1,c
(K)] + o
p
(h).
Particularly, if c = 0 and K() is symmetric, then E(f
n
(0)) = f(0)/2 + o(1).
There are several methods to deal with the density estimation at boundary points. Pos-
sible approaches include the boundary kernel (see Gasser and M uller (1979) and M uller
(1993)), reection (see Schuster (1985) and Hall and Wehrly (1991)), transformation (see
Wand, Marron and Ruppert (1991) and Marron and Ruppert (1994)) and local polynomial
tting (see Hjort and Jones (1996a) and Loader (1996)), and others.
Boundary Kernel
One way of choosing a boundary kernel is
K
(c)
(u) =
12
(1 + c)
4
(1 + u)
_
(1 2c)u +
3c
2
2c + 1
2
_
I
[1,c]
.
CHAPTER 3. DENSITY, DISTRIBUTION & QUANTILE ESTIMATIONS 27
Note K
(1)
(t) = K(t), the Epanechnikov kernel as dened above. Moreover, Zhang and
Karunamuni (1998) have shown that this kernel is optimal in the sense of minimizing the
MSE in the class of all kernels of order (0, 2) with exactly one change of sign in their support.
The downside to the boundary kernel is that it is not necessarily non-negative, as will be
seen on densities where f(0) = 0.
Reection
The reection method is to construct the kernel density estimate based on the synthetic data
X
t
; 1 t n where reected data are X
t
; 1 t n and the original data a re
X
t
; 1 t n. This results in the estimate
f
n
(x) =
1
n
_
n

t=1
K
h
(X
t
x) +
n

t=1
K
h
(X
t
x)
_
, for x 0.
Note that when x is away from the boundary, the second term in the above is practically
negligible. Hence, it only corrects the estimate in the boundary region. This estimator is
twice the kernel density estimate based on the synthetic data X
t
; 1 t n. See Schuster
(1985) and Hall and Wehrly (1991).
Transformation
The transformation method is to rst transform the data by Y
i
= g(X
i
), where g() is a
given monotone increasing function, ranging from to . Now apply the kernel density
estimator to this transformed data set to obtain the estimate f
n
(y) for Y and apply the
inverse transform to obtain the density of X. Therefore,
f
n
(x) = g

(x)
1
n
n

t=1
K
h
(g(X
t
) g(x)).
The density at x = 0 corresponds to the tail density of the transformed data since log(0) =
, which can not usually be estimated well due to lack of the data at tails. Except
at this point, the transformation method does a fairly good job. If g() is unknown in
many situations, Karunamuni and Alberts (2003) suggested a parametric form and then
estimated the parameter. Also, Karunamuni and Alberts (2003) considered other types of
transformations.
CHAPTER 3. DENSITY, DISTRIBUTION & QUANTILE ESTIMATIONS 28
Local Likelihood Fitting
The main idea is to consider the approximation log(f(X
t
)) P(X
t
x), where P(u x) =

p
j=0
a
j
(u x)
j
with the localized version of log-likelihood
n

t=1
log(f(X
t
)) K
h
(X
t
x) n
_
K
h
(u x)f(u)du.
With this approximation, the local likelihood becomes
L(a
0
, , d
p
) =
n

t=1
P(X
t
x) K
h
(X
t
x) n
_
K
h
(u x) exp(P(u x))du.
Let a
j
be the maximizer of the above local likelihood L(a
0
, , d
p
). Then, the local
likelihood density estimate is
f
n
(x) = exp(a
0
).
The maximizer does not exist, then f
n
(x) = 0. See Loader (1996) and Hjort and Jones
(1996a) for more details. If R is used for the local t for density estimation, please use the
function density.lf() in the package localt.
Exercise: Please conduct a Monte Carol simulation to see what the boundary eects are
and how the correction methods work. For example, you can consider some distribution
densities with a nite support such as beta-distribution.
3.2.4 Bandwidth Selection
Simple Bandwidth Selectors
The optimal bandwidth (3.4) is not directly usable since it depends on the unknown param-
eter [[f

[[
2
. When f(x) is a Gaussian density with standard deviation , it is easy to see
from (3.4) that
h
opt
= (8

/3)
1/5
C
1
(K) n
1/5
,
which is called the normal reference bandwidth selector in literature, obtained by
replacing the unknown parameter in the above equation by the sample standard deviation
s. In particular, after calculating the constant C
1
(K) numerically, we have the following
normal reference bandwidth selector

h
opt
=
_
1.06 s n
1/5
for the Gaussian kernel
2.34 s n
1/5
for the Epanechnikov kernel
CHAPTER 3. DENSITY, DISTRIBUTION & QUANTILE ESTIMATIONS 29
Hjort and Jones (1996b) proposed an improved rule obtained by using an Edgeworth ex-
pansion for f(x) around the Gaussian density. Such a rule is given by

opt
= h
opt
_
1 +
35
48

4
+
35
32

2
3
+
385
1024

2
4
_
1/5
,
where
3
and
4
are respectively the sample skewness and kurtosis. For details about the
Edgeworth expansion, please see the book by Hall (1992).
Note that the normal reference bandwidth selector is only a simple rule of thumb. It is
a good selector when the data are nearly Gaussian distributed, and is often reasonable in
many applications. However, it can lead to over-smooth when the underlying distribution is
asymmetric or multi-modal. In that case, one can either subjectively tune the bandwidth, or
select the bandwidth by more sophisticated bandwidth selectors. One can also transform data
rst to make their distribution closer to normal, then estimate the density using the normal
reference bandwidth selector and apply the inverse transform to obtain an estimated density
for the original data. Such a method is called the transformation method. There are quite
a few important techniques for selecting the bandwidth such as cross-validation (CV)
and plug-in bandwidth selectors. A conceptually simple technique, with theoretical
justication and good empirical performance, is the plug-in technique. This technique relies
on nding an estimate of the functional [[f

[[
2
, which can be obtained by using a pilot
bandwidth. An implementation of this approach is proposed by Sheather and Jones (1991)
and an overview on the progress of bandwidth selection can be found in Jones, Marron and
Sheather (1996).
Function dpik() in the package KernSmooth in R selects a bandwidth for estimating
the kernel density estimation using the plug-in method.
Cross-Validation Method
The integrated squared error (ISE) of f
n
(x) is dened by
ISE(h) =
_
[f
n
(x) f(x)]
2
dx.
A commonly used measure of discrepancy between f
n
(x) and f(x) is the mean integrated
squared error (MISE) MISE(h) = E[ISE(h)]. It can be shown easily (or see Chiu, 1991) that
MISE(h) AMISE(h). The optimal bandwidth minimizing the AMISE is given in (3.4).
CHAPTER 3. DENSITY, DISTRIBUTION & QUANTILE ESTIMATIONS 30
The least squares cross-validation (LSCV) method proposed by Rudemo (1982) and Bowman
(1984) is a popular method to estimate the optimal bandwidth h
opt
. Cross-validation is very
useful for assessing the performance of an estimator via estimating its prediction error. The
basic idea is to set one of the data point aside for validation of a model and use the remaining
data to build the model. The main idea is to choose h to minimize ISE(h). Since
ISE(h) =
_
f
2
n
(x)dx 2
_
f(x) f
n
(x)dx +
_
f
2
(x)dx,
the question is how to estimate the second term on the right hand side. Well, let us consider
the simplest case when X
t
are iid. Re-express f
n
(x) as
f
n
(x) =
n 1
n
f
(s)
n
(x) +
1
n
K
h
(X
s
x)
for any 1 s n, where
f
(s)
n
(x) =
1
n 1
n

t=s
K
h
(X
t
x),
which is the kernel density estimate without the sth observation, commonly called the jack-
knife estimate or leave-one-out estimate. It is easy to see that for any 1 s n,
f
n
(x) f
(s)
n
(x).
Let T
s
= X
1
, , X
s1
, X
s+1
, , X
n
. Then,
E
_
f
(s)
n
(X
s
) [ T
s

=
_
f
(s)
n
(x)f(x)dx
_
f
n
(x)f(x)dx,
which, by using the method of moment, can be estimated by
1
n

n
s=1
f
(s)
n
(X
s
). Therefore,
the cross-validation is
CV(h) =
_
f
2
n
(x)dx
2
n
n

s=1
f
(s)
n
(X
s
)
=
1
n
2

s,t
K

h
(X
s
X
t
)
2
n(n 1)
n

t=s
K
h
(X
s
X
t
),
where K

h
() is the convolution of K
h
() and K
h
() as
K

h
(u) =
_
K
h
(v) K
h
(u v)dv.
CHAPTER 3. DENSITY, DISTRIBUTION & QUANTILE ESTIMATIONS 31
Let

h
cv
be the minimizer of CV(h). Then, it is called the optimal bandwidth based on
the cross-validation. Stone (1984) showed that

h
cv
is a consistent estimate of the optimal
bandwidth h
opt
.
Function lscv() in the package loct in R selects a bandwidth for estimating the kernel
density estimation using the least squares cross-validation method.
3.2.5 Project for Density Estimation
I. Do Monte Carlo simulations to compare the performances of the kernel density estima-
tions for dierent settings and to make your own conclusions based on your simulations.
Please do the followings:
1. Use the Rosenblatt-Parzen method by choosing dierent sample sizes (you
take several dierent sample sizes, say 250, 400, 600 and 1000), dierent
kernels (say the normal and Epanechnikov kernel), dierent bandwidths,
and dierent bandwidth selection methods such as cross-validation and
plug-in as well as normal reference. Any conclusions and comments?
2. Compare the Rosenblatt-Parzen method with local density method as in
Loader (1996) or Hjort and Jones (1996). Any conclusions and comments?
3. Compare the various methods for boundary correction.
To assess the performance of nite samples, for each setting, you need to compute the
mean absolute deviation errors (MADE) for

f(), dened as
MADE = n
1
0
n
0

k=1

f(u
k
) f(u
k
)

,
where

f() is the nonparametric estimate of f() and u
k
are the grid points, taken
to be arbitrary within the range of data. Note that you can choose any distribution
to generate your samples for your simulation. Also, note that the choice of the grid
points is not important so that they can be chosen arbitrarily. In general, the number
of replications can be taken to be n
sim
= 500 or 1000. The question is how to report
the simulation results. There are two ways of doing so. You can display the n
sim
values
of MADE either in a boxplot form (boxplot() in R) or in a table by presenting the
CHAPTER 3. DENSITY, DISTRIBUTION & QUANTILE ESTIMATIONS 32
median and standard deviation of the n
sim
values of MADE. Either one is okay but
the boxplot is preferred by most people.
II. Consider three real data sets for the US Treasury bill (Secondary Market Rate): the
daily 3-month Treasury bill from January 4, 1954 to May 2, 2007, in the data le
DTB3.txt or DTB3.csv, the weekly 3-month Treasury bill from January 8, 1954 to
April 27, 2007, in the data le WTB3MS.txt or WTB3MS.csv, and the monthly 3-
month Treasury bill from January 1, 1934 to March 1, 2007, in the data le TB3MS.txt
or TB3MS.csv.
1. Apply Ljung-Box test [Box.test() in R] to see if three series are autocor-
related or not. Also, you might look at the autocorrelation function (ACF)
[acf() in R]or/and partial autocorrelation function (PACF)[pacf() in R].
2. Apply the kernel density estimation to estimate three density functions.
3. Any conclusions and comments on three density functions?
Note that the real data sets can be downloaded from the web site for Federal Reserve
Bank of Saint Louis at http://research.stlouisfed.org/fred2/categories/46. You can use
any statistical package to do your simulation. You try to use R since it is very sim-
ple. You need to hand in all necessary materials (tables or graphs) to support your
conclusions. If you need any help, please come to see me.
3.2.6 Multivariate Density Estimation
As we discussed earlier, the kernel density or distribution estimation is basically one-dimensional.
For multivariate case, the kernel density estimate is given by
f
n
(x) =
1
n
n

t=1
K
H
(X
t
x), (3.6)
where K
H
(u) = K(H
1
u)/ det(H), K(u) is a multivariate kernel function, and H is the
bandwidth matrix such as for all 1 i, j p, nh
ij
and h
ij
0 where h
ij
is the (i, j)th
element of H. The bandwidth matrix is introduced to capture the dependent structure in
CHAPTER 3. DENSITY, DISTRIBUTION & QUANTILE ESTIMATIONS 33
the independent variables. Particularly, if H is a diagonal matrix and K(u) =

p
j=1
K
j
(u
j
)
where K
j
() is a univariate kernel function, then, f
n
(x) becomes
f
n
(x) =
1
n
n

t=1
p

j=1
K
h
j
(X
jt
x
j
),
which is called the product kernel density estimation. This case is commonly used in
practice. Similar to the univariate case, it is easy to derive the theoretical results for the
multivariate case, which is left as an exercise. See Wand and Jones (1995) for details.
Curse of Dimensionality
For the product kernel estimate with h
j
= h, we can show easily that
E(f
n
(x)) = f(x) +
h
2
2
tr(
2
(K) f

(x)) + o(h
2
),
where
2
(K) =
_
u u
T
K(u)du, and
Var(f
n
(x)) =

0
(K) f(x)
nh
p
+ o((nh)
1
),
so that the AMSE is given by
AMSE =

0
(K) f(x)
nh
p
+
h
4
4
B(x),
where B(x) = (tr(
2
(K) f

(x)))
2
. By minimizing the AMSE, we obtain the optimal band-
width
h
opt
=
_
p
0
(K) f(x)
B(x)
_
1/(p+4)
n
1/(p+4)
,
which leads to the optimal rate of convergence for MSE which is O(n
4/(4+p)
) by trading
o the rates between the bias and variance. When p is large, the so called curse of
dimensionality exists. To understand this problem quantitatively, let us look at the rate
of convergence. To have a comparable performance with one-dimensional nonparametric
regression with n
1
data points, for p-dimensional nonparametric regression, we need the
number of data points n
p
,
O(n
4/(4+p)
p
) = O(n
4/5
1
),
or n
p
= O(n
(p+4)/5
1
). Note that here we only emphasize on the rate of convergence for MSE
by ignoring the constant part. Table 3.1 shows the result with n
1
= 100. The increase of
required sample sizes is exponentially fast.
CHAPTER 3. DENSITY, DISTRIBUTION & QUANTILE ESTIMATIONS 34
Table 3.1: Sample sizes required for p-dimensional nonparametric regression to have compa-
rable performance with that of 1-dimensional nonparametric regression using size 100
dimension 2 3 4 5 6 7 8 9 10
sample size 252 631 1,585 3,982 10,000 25,119 63,096 158,490 398,108
Exercise: Please derive the asymptotic results given in (3.6) for the general multivariate
case.
In R, the built-in function density() is only for univariate case. For multivariate situ-
ations, there are two packages ks and KernSmooth. Function kde() in ks can compute
the multivariate density estimate for 2- to 6- dimensional data and Function bkde2D() in
KernSmooth computes the 2D kernel density estimate. Also, ks provides some functions
for some bandwidth matrix selection such as Hbcv() and Hscv for 2D case and Hlscv()
and Hpi().
3.2.7 Reading Materials
Applications in Finance: Please read the papers by At-Sahalia and Lo (1998, 2000),
Pritsker (1998) and Hong and Li (2005) on how to apply the kernel density estimation to the
nonparametric estimation of the state-price densities (SPD) or risk neutral densities (RND)
and nonparametric risk estimation based on the state-price density. Please download the
data from http://nance.yahoo.com/ (say, S&P500 index) to estimate the SPD.
3.3 Distribution Estimation
3.3.1 Smoothed Distribution Estimation
The question is how to obtain a smoothed estimate of CDF F(x). Well, one way of doing
so is to integrate the estimated PDF f
n
(x), given by

F
n
(x) =
_
x

f
n
(u)du =
1
n
n

i=1
/
_
x X
i
h
_
,
where /(x) =
_
x

K(u)du; the distribution of K(). Why do we need this smoothed

estimate of CDF? To answer this question, we need to consider the mean squares error
(MSE).
CHAPTER 3. DENSITY, DISTRIBUTION & QUANTILE ESTIMATIONS 35
First, we derive the asymptotic bias. By the integration by parts, we have
E
_

F
n
(x)
_
= E
_
/
_
x X
i
h
__
=
_
F(x hu)K(u)du
= F(x) +
h
2
2

2
(K) f

(x) + o(h
2
).
Next, we derive the asymptotic variance.
E
_
/
2
_
x X
i
h
__
=
_
F(x hu)b(u)du = F(x) h f(x) + o(h),
where b(u) = 2 K(u) /(u) and =
_
u b(u)du. Then,
Var
_
/
_
x X
i
h
__
= F(x)[1 F(x)] h f(x) + o(h).
Dene I
j
(x) = Cov (I(X
1
x), I(X
j+1
t)) = F
j
(x, x) F
2
(x) and
I
nj
(x) = Cov
_
/
_
x X
1
h
_
, /
_
x X
j+1
h
__
.
By means of Lemma 2 in Lehmann (1966), the covariance I
nj
(x) may be written as follows
I
nj
(t) =
_ _
P
_
/
_
x X
1
h
_
> u, /
_
x X
j+1
h
_
> v
_
P
_
/
_
x X
1
h
_
> u
_
P
_
/
_
x X
j+1
h
_
> v
_ _
dudv.
Inverting the CDF /() and making two changes of variables, the above relation becomes
I
nj
(x) =
_
[F
j
(x hu, x hv) F(x hu)F(x hv)]K(u)K(v)dudv.
Expanding the right-hand side of the above equation according to Taylors formula, we obtain
[I
nj
(x) I
j
(x)[ C h
2
.
By the Davydovs inequality (see Lemma 3.1), we have
[I
nj
(x) I
j
(x)[ C (j),
so that for any 1/2 < < 1,
[I
nj
(x) I
j
(x)[ C h
2

1
(j).
CHAPTER 3. DENSITY, DISTRIBUTION & QUANTILE ESTIMATIONS 36
Therefore,
1
n
n1

j=1
(n j)[I
nj
(x) I
j
(x)[
n1

j=1
[I
nj
(x) I
j
(x)[ C h
2

j=1

1
(j) = O(h
2
)
provided that

j=1

1
(j) < for some 1/2 < < 1. Indeed, this assumption is satised
if (n) = O(n

) for some > 2. By the stationarity, it is clear that

nVar
_

F
n
(x)
_
= Var
_
/
_
x X
1
h
__
+
2
n
n1

j=1
(n j)I
nj
(x).
Therefore,
nVar
_

F
n
(x)
_
= F(x)[1 F(x)] h f(x) + o(h) + 2

j=1
I
j
(x) + O(h
2
)
=
2
F
(x) h f(x) + o(h).
We can establish the following asymptotic normality for

F
n
(x) but the proof will be discussed
later.
Theorem 3.2: Under regularity conditions, we have

n
_

F
n
(x) F(x)
h
2
2

2
(K) f

(x) + o
p
(h
2
)
_
N
_
0,
2
F
(x)
_
.
Similarly, we have
nAMSE
_

F
n
(x)
_
=
nh
4
4

2
2
(K) [f

(x)]
2
+
2
F
(x) h f(x) .
If > 0, minimizing the AMSE gives the
h
opt
=
_
f(x)

2
2
(K)[f

(x)]
2
_
1/3
n
1/3
,
and with this asymptotically optimal bandwidth, the optimal AMSE is given by
nAMSE
opt
_

F
n
(x)
_
=
2
F
(x)
3
4
_

2
f
2
(x)

2
(K)f

(x)
_
2/3
n
1/3
.
Remark: From the aforementioned equation, we can see that if > 0, the AMSE of

F
n
(x)
can be smaller than that for F
n
(x) in the second order. Also, it is easy to that if K() is the
Epanechnikov kernel, > 0.
CHAPTER 3. DENSITY, DISTRIBUTION & QUANTILE ESTIMATIONS 37
3.3.2 Relative Eciency and Deciency
To measure the relative eciency and deciency of

F
n
(x) over F
n
(x), we dene
i(n) = min
_
k 1, 2, . . .; MSE(F
k
(x)) MSE
_

F
n
(x)
__
.
We have the following results without the detailed proofs which can be found in Cai and
Roussas (1998).
Proposition 2: (i) Under regularity conditions,
i(n)
n
1, if and only if nh
4
n
0.
(ii) Under regularity conditions,
i(n) n
nh
(x), if and only if nh
3
n
0,
where (x) = f(x)/
2
F
(x).
Remark: It is clear that the quantity (x) may be looked upon as a way of measuring the
performance of the estimate

F
n
(x). Suppose that the kernel K() is chosen, so that > 0,
which is equivalent to (x) > 0. Then, for suciently large n, i(n) > n+nh
n
((x)). Thus,
i(n) is substantially larger than n, and, indeed, i(n) n tends to . Actually, Reiss (1981)
and Falk (1983) posed the question of determining the exact value of the superiority of over
a certain class of kernels. More specically, let /
m
be the class of kernels / : [1, 1]
which are absolutely continuous and satisfy the requirements: /(1) = 0, /(1) = 1, and
_
1
1
u

K(u)du = 0, = 1, , m, for some m = 0, 1, (where the moment condition is

vacuous for m = 0). Set
m
= sup; / /
m
. Then, Mammitzsch (1984) answered the
question posed by showing in an elegant manner. See Cai and Roussas (1998) for more
details and simulation results.
Exercise: Please conduct a Monte Carol simulation to see what the dierences are for
smoothed and non-smoothed distribution estimations.
3.4 Quantile Estimation
Let X
(1)
X
(2)
X
(n)
denote the order statistics of X
t

n
t=1
. Dene the inverse of
F(x) as F
1
(p) = infx ; F(x) p, where is the real line. The traditional estimate
CHAPTER 3. DENSITY, DISTRIBUTION & QUANTILE ESTIMATIONS 38
of F(x) has been the empirical distribution function F
n
(x) based on X
1
, . . . , X
n
, while the
estimate of the p-th quantile
p
= F
1
(p), 0 < p < 1, is the sample quantile function

pn
= F
1
n
(p) = X
([np])
, where [x] denotes the integer part of x. It is a consistent estimator
of
p
for -mixing data (Yoshihara, 1995). However, as stated in Falk (1983), F
n
(x) does not
take into account the smoothness of F(x); i.e., the existence of a probability density function
f(x). In order to incorporate this characteristic, investigators proposed several smoothed
quantile estimates, one of which is based on

F
n
(x) obtained as a convolution between F
n
(x)
and a properly scaled kernel function; see the previous section. Finally, note that R has a
command quantile() which can be used for computing
pn
, the nonparametric estimate of
quantile.
3.4.1 Value at Risk
Value at Risk (VaR) is a popular measure of market risk associated with an asset or a
portfolio of assets. It has been chosen by the Basel Committee on Banking Supervision as a
benchmark risk measure and has been used by nancial institutions for asset management
and minimization of risk. Let X
t

n
t=1
be the market value of an asset over n periods of t = 1
a time unit, and let Y
t
= log(X
t
/X
t1
) be the negative log-returns (loss). Suppose
Y
t

n
j=1
is a strictly stationary dependent process with marginal distribution function F(y).
Given a positive value p close to zero, the 1 p level VaR is

p
= infu : F(u) 1 p = F
1
(1 p),
which species the smallest amount of loss such that the probability of the loss in market
value being larger than
p
is less than p. Comprehensive discussions on VaR are available
in Due and Pan (1997) and Jorion (2001), and references therein. Therefore, VaR can
be regarded as a special case of quantile. R has a built-in package called VaR for a set
of methods for calculation of VaR, particularly, for some parametric models such as the
General Pareto Distribution (GPD). But the restrict parametric specications might
be misspecied.
A more general form for the generalized Pareto distribution with shape parameter
k ,= 0, scale parameter , and threshold parameter , is
f(x) =
1

_
1 + k
x

_
1/k1
, and F(x) = 1
_
1 + k
x

_
1/k
CHAPTER 3. DENSITY, DISTRIBUTION & QUANTILE ESTIMATIONS 39
for < x, when k > 0. In the limit for k = 0, the density is f(x) =
1

exp ((x )/) for

< x. If k = 0 and = 0, the generalized Pareto distribution is equivalent to the exponential
distribution. If k > 0 and = , the generalized Pareto distribution is equivalent to the
Pareto distribution.
Another popular risk measure is the expected shortfall (ES) which is the expected loss,
given that the loss is at least as large as some given quantile of the loss distribution (e.g.,
VaR), dened as

p
= E(Y
t
[ Y
t
>
p
) =
_

p
y f(y)dy/p.
It is well known from Artzner, Delbaen, Eber and Heath (1999) that ES is a coherent
risk measure such as it satises the four axioms: homogeneity (increasing the size of a
portfolio by a factor should scale its risk measure by the same factor), monotonicity (a
portfolio must have greater risk if it has systematically lower values than another), risk-free
condition or translation invariance (adding some amount of cash to a portfolio should
reduce its risk by the same amount), and subadditivity (the risk of a portfolio must be
less than the sum of separate risks or merging portfolios cannot increase risk). VaR satises
homogeneity, monotonicity, and risk-free condition but is not sub-additive. See Artzner, et
al. (1999) for details.
3.4.2 Nonparametric Quantile Estimation
The smoothed sample quantile estimate of
p
,

p
, based on

F
n
(x), is dened by:

p
=

F
1
n
(1 p) = inf
_
x ;

F
n
(x) 1 p
_
.

p
is referred to in literature as the perturbed (smoothed) sample quantile. Asymptotic
properties of

p
, both under independence as well as under certain modes of dependence,
have been investigated extensively in literature; see Cai and Roussas (1997) and Chen and
Tang (2005).
By the dierentiability of

F
n
(x), we use the Taylor expansion and ignore the higher terms
to obtain

F
n
(

p
) = 1 p

F
n
(
p
) f
n
(
p
) (

p
), (3.7)
then,

p
[

F
n
(
p
) (1 p)]/f
n
(
p
) [

F
n
(
p
) (1 p)]/f(
p
)
CHAPTER 3. DENSITY, DISTRIBUTION & QUANTILE ESTIMATIONS 40
since f
n
(x) is a consistent estimator of f(x). As an application of Theorem 3.2, we can
establish the following theorem for the asymptotic normality of

p
but the proof is omitted
since it is similar to that for Theorem 3.2.
Theorem 3.3: Under regularity conditions, we have

n
_

h
2
2

2
(K) f

(
p
)/f(
p
) + o
p
(h
2
)
_
N
_
0,
2
F
(
p
)/f
2
(
p
)
_
.
Next, let us examine the AMSE. To this eect, we can derive the asymptotic bias and
variance. From the previous section, we have
E
_

p
_
=
p
+
h
2
2

2
(K) f

(
p
)/f(
p
) + o
p
(h
2
),
and
nVar
_

p
_
=
2
F
(
p
)/f
2
(
p
) h /f(
p
) + o(h).
Therefore, the AMSE is
nAMSE(

p
) =
nh
4
4

2
2
(K) [f

(
p
)/f(
p
)]
2
+
2
F
(
p
)/f
2
(
p
) h /f(
p
).
If > 0, minimizing the AMSE gives the
h
opt
=
_
f(
p
)

2
2
(K)[f

(
p
)]
2
_
1/3
n
1/3
,
and with this asymptotically optimal bandwidth, the optimal AMSE is given by
nAMSE
opt
(

p
) =
2
F
(
p
)/f
2
(
p
)
3
4
_

2

2
(K)f

(
p
)f(
p
)
_
2/3
n
1/3
,
which indicates a reduction to the AMSE of the second order. Chen and Tang (2005)
conducted an intensive study on simulations to demonstrate the advantages of nonparametric
estimation

p
over the sample quantile
pn
under the VaR setting. We refer to the paper
by Chen and Tang (2005) for simulation results and empirical examples.
Exercise: Please use the above procedures to estimate nonparametrically the ES and discuss
its properties as well as conduct simulation studies and empirical applications.
CHAPTER 3. DENSITY, DISTRIBUTION & QUANTILE ESTIMATIONS 41
3.5 Computer Code
# April 10, 2007
graphics.off() # clean the previous graphs on the screen
###############
# Example 3.1
##############
#########################################################
# Define the Epanechnikov kernel function
kernel<-function(x){0.75*(1-x^2)*(abs(x)<=1)}
###############################################################
# Define the kernel density estimator
kernden=function(x,z,h,ker){
# parameters: x=variable; h=bandwidth; z=grid point; ker=kernel
nz<-length(z)
nx<-length(x)
x0=rep(1,nx*nz)
dim(x0)=c(nx,nz)
x1=t(x0)
x0=x*x0
x1=z*x1
x0=x0-t(x1)
if(ker==1){x1=kernel(x0/h)} # Epanechnikov kernel
if(ker==0){x1=dnorm(x0/h)} # normal kernel
f1=apply(x1,2,mean)/h
return(f1)
}
####################################################################
############################################################################
# Simulation for different bandiwidths and different kernels
n=300 # n=300
CHAPTER 3. DENSITY, DISTRIBUTION & QUANTILE ESTIMATIONS 42
ker=1 # ker=1 => Epan; ker=0 => Gaussian
h0=c(0.25,0.5,1) # set initial bandwidths
z=seq(-4,4,by=0.1) # grid points
nz=length(z) # number of grid points
x=rnorm(n) # simulate x ~ N(0, 1)
if(ker==1){h_o=2.34*n^{-0.2}} # bandwidth for Epanechnikov kernel
if(ker==0){h_o=1.06*n^{-0.2}} # bandwidth for normal kernel
f1=kernden(x,z,h0[1],ker)
f2=kernden(x,z,h0[2],ker)
f3=kernden(x,z,h0[3],ker)
f4=kernden(x,z,h_o,ker)
text1=c("True","h=0.25","h=0.5","h=1","h=h_o")
data=cbind(dnorm(z),f1,f2,f3,f4) # combine them as a matrix
win.graph()
matplot(z,data,type="l",lty=1:5,col=1:5,xlab="",ylab="")
legend(-1,0.2,text1,lty=1:5,col=1:5)
##################################################################
##################
# Example 3.2
##################
z1=read.table("c:/res-teach/xiada/teaching05-07/data/ex3-2.txt")
# dada: weekly 3-month Treasury bill from 1970 to 1997
x=z1[,4]/100 # decimal
n=length(x)
y=diff(x) # Delta x_t=x_t-x_{t-1}=change rate
x=x[1:(n-1)]
n=n-1
x_star=(x-mean(x))/sqrt(var(x)) # standardized
den_3mtb=density(x_star,bw=0.30,kernel=c("epanechnikov"),
from=-3,to=3,n=61)
den_est=den_3mtb$y # estimated density values
CHAPTER 3. DENSITY, DISTRIBUTION & QUANTILE ESTIMATIONS 43
z_star=seq(-3,3,by=0.1)
text1=c("Estimated Density","Standard Norm")
win.graph()
par(bg="light green")
plot(den_3mtb,main="Density of 3mtb (Buind-in)",ylab="",xlab="",
col.main="red")
points(z_star,dnorm(z_star),type="l",lty=2,col=2,ylab="",xlab="")
legend(0,0.45,text1,lty=c(1,2),col=c(1,2),cex=0.7)
h_den=0.5
f_hat=kernden(x_star,z_star,h_den,1)
ff=cbind(f_hat,dnorm(z_star))
win.graph()
par(bg="light blue")
matplot(z_star,ff,type="l",lty=c(1,2),col=c(1,2),ylab="",xlab="")
title(main="Density of 3mtb",col.main="red")
legend(0,0.55,text1,lty=c(1,2),col=c(1,2),cex=0.7)
#################################################################
3.6 References
At-Sahalia, Y. and A.W. Lo (1998). Nonparametric estimation of state-price densities
implicit in nancial asset prices. Journal of Fiance, 53, 499-547.
At-Sahalia, Y. and A.W. Lo (2000), Nonparametric risk management and implied risk
aversion. Journal of Econometric, 94, 9-51.
Andrews, D.W.K. (1991). Heteroskedasticity and autocorrelation consistent covariance
matrix estimation. Econometrica, 59, 817-858.
Artzner, P., F. Delbaen, J.M. Eber, and D. Heath (1999). Coherent measures of risk.
Mathematical Finance, 9, 203-228.
Bowman, A. (1984). An alternative method of cross-validation for the smoothing of density
estimate. Biometrika, 71, 353-360.
CHAPTER 3. DENSITY, DISTRIBUTION & QUANTILE ESTIMATIONS 44
Cai, Z. (2002). Regression quantile for time series. Econometric Theory, 18, 169-192.
Cai, Z. and G.G. Roussas (1997). Smooth estimate of quantiles under association. Statistics
and Probability Letters, 36, 275-287.
Cai, Z. and G.G. Roussas (1998). Ecient estimation of a distribution function under
quadrant dependence. Scandinavian Journal of Statistics, 25, 211-224.
Carrasco, M. and X. Chen (2002). Mixing and moments properties of various GARCH and
stochastic volatility models. Econometric Theory, 18, 17-39.
Chen, S.X. and C.Y. Tang (2005). Nonparametric inference of value at risk for dependent
nancial returns. Journal of Financial Econometrics, 3, 227-255.
Chiu, S.T. (1991). Bandwidth selection for kernel density estimation. The Annals of
Statistics, 19, 1883-1905.
Due, D. and J. Pan (1997). An overview of value at risk. Journal of Derivatives, 4, 7-49.
Fan, J. and Q. Yao (2003). Nonlinear Time Series: Nonparametric and Parametric Meth-
ods. Springer-Verlag, New York.
Gasser, T. and H.-G. M uller (1979). Kernel estimation of regression functions. In Smoothing
Techniques for Curve Estimation, Lecture Notes in Mathematics, 757, 23-68. Springer-
Verlag, New York.
Falk, M.(1983). Relative eciency and deciency of kernel type estimators of smooth
distribution functions. Statistica Neerlandica, 37, 73-83.
Genon-Caralot, V., T. Jeantheau and C. Laredo (2000). Stochastic volatility models as
hidden Markov models and statistical applications. Bernoulli, 6, 1051-1079.
Hall, P. (1992). The Bootstrap and Edgeworth Expansion. Springer-Verlag, New York.
Hall, P. and C.C. Heyde (1980). Martingale Limit Theory and its Applications. Academic
Press, New York.
Hall, P. and T.E. Wehrly (1991). A geometrical method for removing edge eects from
kernel-type nonparametric regression estimators. Journal of American Statistical As-
sociation, 86, 665-672.
Hjort, N.L. and M.C. Jones (1996a). Locally parametric nonparametric density estimation.
The Annals of Statistics, 24,1619-1647.
Hjort, N.L. and M.C. Jones (1996b). Better rules of thumb for choosing bandwidth in
density estimation. Working paper, Department of Mathematics, University of Oslo,
Norway.
Hong, Y. and H. Li (2005). Nonparametric specication testing for continuous-time models
with applications to interest rate term structures. Review of Financial Studies, 18,
37-84.
CHAPTER 3. DENSITY, DISTRIBUTION & QUANTILE ESTIMATIONS 45
Jones, M.C., J.S. Marron and S.J. Sheather (1996). A brief survey of bandwidth selection
for density estimation. Journal of American Statistical Association, 91, 401-407.
Jorion, P. (2001). Value at Risk, 2nd Edition. New York: McGraw-Hill.
Karunamuni, R.J. and T. Alberts (2003). On boundary correction in kernel density estima-
tion. Working paper, Department of Mathematical an d Statistical Sciences, University
of Alberta, Canada.
Lehmann, E. (1966). Some concepts of dependence. Annals of Mathematical Statistics, 37,
1137-1153.
Loader, C. R. (1996). Local likelihood density estimation. The Annals of Statistics, 24,
1602-1618.
Mammitzsch, V. (1984). On the asymptotically optimal solution within a certain class of
kernel type estimators. Statistics Decisions, 2, 247-255.
Marron, J.S. and D. Ruppert (1994). Transformations to reduce boundary bias in kernel
density estimation. Journal of the Royal Statistical Society Series B, 56, 653-671.
McLeish, D.L. (1975). A maximal inequality and dependent strong laws. The Annals of
Probability, 3, 829-839.
M uller, H.-G. (1993). On the boundary kernel method for nonparametric curve estimation
near endpoints. Scandinavian Journal of Statistics, 20, 313-328.
Newey, W.K. and K.D. West (1987). A simple, positive-denite, heteroskedasticity and
autocorrelation consistent covariance matrix. Econometrica, 55, 703-708.
Parzen, E. (1962). On estimation of a probability of density function and mode. Annals of
Mathematical Statistics, 33, 1065-1076.
Pritsker, M. (1998). Nonparametric density estimation and tests of continuous time interest
rate models. Review of Financial Studies, 11, 449-487.
Reiss, R.D. (1981). Nonparametric estimation of smooth distribution functions. Scandi-
navia Journal of Statistics, 8, 116-119.
Rosenblatt, M. (1956). Remarks on some nonparametric estimates of a density function.
Annals of Mathematical Statistics, 27, 832-837.
Rudemo, M . (1982). Empirical choice of histograms and kernel density estimators. Scan-
dinavia Journal of Statistics, 9, 65-78 .
Schuster, E.F. (1985). Incorporating support constraints into nonparametric estimates of
densities. Communications in Statistics ?Theory and Methods, 14, 1123-1126.
Sering, R.J. (1980). Approximation Theorems of Mathematical Statistics. Wiley, New
York.
CHAPTER 3. DENSITY, DISTRIBUTION & QUANTILE ESTIMATIONS 46
Sheather, S.J. and M.C. Jones (1991). A reliable data-based bandwidth selection method
for kernel density estimation. Journal of the Royal Statistical Society, Series B, 53,
683-690.
Stone, C. J. (1984). An asymptotically optimal window selection rule for kernel density
estimates. The Annals of Statistics, 12, 1285-1297.
Wand, M.P. and M.C. Jones (1995). Kernel Smoothing. London: Chapman and Hall.
Wand, M.P., J.S. Marron and D. Ruppert (1991). Transformations in density estimation
(with discussion). Journal of the American Statistical Association, 86, 343-361.
White, H. (1980). A Heteroskedasticity consistent covariance matrix and a direct test for
heteroskedasticity. Econometrica, 48, 817-838.
Yoshihara, K. (1995). The Bahadur representation of sample quantiles for sequences of
strongly mixing random variables. Statistics and Probability Letters, 24, 299-304.
Zhang, S. and R.J. Karunamuni (1998). On Kernel density estimation near endpoints.
Journal of Statistical Planning and Inference, 70, 301-316.
Chapter 4
Nonparametric Regression Models
4.1 Prediction and Regression Functions
Suppose that we have the information set I
t
at time t and we want to forecast the future
value, say Y
t+1
(one step-ahead forecast, or Y
t+s
, s-step ahead). There are several forecasting
criteria available in the literature. The general form is
m(I
t
) = min
a
E[(Y
t+1
a) [ I
t
],
where () is an objective (loss) function. Here are three major directions.
(1) If (z) = z
2
is the quadratic function, then, m(I
t
) = E(Y
t+1
[ I
t
), called the mean
regression function. Implicitly, it requires that the distribution of Y
t
should be symmetric.
If the distribution of Y
t
is skewed, then this is not a good criterion.
(2) If

(y) = y ( I
{y<0}
) called the check function, where (0, 1) and I
A
is the
indicator function of any set A, then, m(I
t
) satises
_
m(It)

f(y [ I
t
) du = F(m(I
t
) [ I
t
) = ,
where f(y [ I
t
) and F(m(I
t
) [ I
t
) are the conditional PDF and CDF of Y
t+1
given I
t
, respec-
tively. This m(I
t
) becomes the conditional quantile or quantile regression, dented
by q

(I
t
), proposed by Koenker and Bassett (1978, 1982). Particularly, if = 1/2, then,
m(I
t
) is the well known least absolute deviation (LAD) regression which is robust. If
q

(I
t
) is a linear function of regressors like
T

X
t
as in Koenker and Bassett (1978, 1982),
Koenker (2005) developed the R module quantreg to make statistical inferences on the
linear quantile regression model.
CHAPTER 4. NONPARAMETRIC REGRESSION MODELS 48
To t a linear quantile regression using R, one can use the command rq() in the package
quantreg. For a nonlinear parametric model, the command is nlrq(). For a nonparametric
quantile model for univariate case, one can use the command lprq() for implementing the
local polynomial estimation. For an additive quantile regression, one can use the commands
rqss() and qss().
(3) If (x) =
1
2
x
2
I
|x|M
+M([x[ M/2) I
|x|>M
, the so called Huber function in literature,
then it is the Huber robust regression. We will not discuss this topic. If you have an
interest, please read the book by Rousseeuw and Leroy (1987). In R, the library MASS
has the function rlm for robust linear model. Also, the library lqs contains functions for
bounded-inuence regression.
Note that for the second and third cases, the regression functions usually do not have
a close form of expression. Since the information set I
t
contains too many variables (high
dimension), it is often to approximate I
t
by some nite numbers of variables, say X
t
=
(X
t1
, . . . , X
tp
)
T
(p 1), including the lagged variables and exogenous variables. First, our
focus is on the mean regression m(X
t
). Of course, by the same token, we can consider the
nonparametric estimation of the conditional variance
2
(x) = Var(Y
t
[X
t
= x). Why do we
need to consider nonlinear (nonparametric) models in economic practice? To nd
the answer, please read the book by Granger and Ter asvirta (1993).
4.2 Kernel Estimation
How to estimate m(x) nonparametrically? Let us look at the Nadaraya-Watson estimate
of the mean regression m(x). The main idea is as follows:
m(x) =
_
y f(y [ x)dy =
_
y f(x, y)dy
_
f(x, y)dy
,
where f(x, y) is the joint PDF of X
t
and Y
t
. To estimate m(x), we can apply the plug-in
method. That is, plug the nonparametric kernel density estimate f
n
(x, y) (product kernel
method) into the right hand side of the above equation to obtain
m
nw
(x) =
_
y f
n
(x, y)dy
_
f
n
(x, y)dy
= =
1
n
n

t=1
Y
t
K
h
(X
t
x)/f
n
(x) =
n

t=1
W
t
Y
t
,
where f
n
(x) is the kernel density estimation of f(x), dened in Chapter 3, and
W
t
= K
h
(X
t
x)/
n

t=1
K
h
(X
t
x).
CHAPTER 4. NONPARAMETRIC REGRESSION MODELS 49
m
nw
(x) is the well known Nadaraya-Watson (NW) estimator, proposed by Nadaraya
(1964) and Watson (1984). Note that the weights W
t
do not depend on Y
t
. Therefore,
m
nw
(x) is called a linear estimator, similar to the least squares estimate (LSE).
Let us look at the NW estimator from a dierent angle. m
nw
(x) can be re-expressed as
the minimizer of the weighted locally least squares; that is,
m
nw
(x) = min
a
n

t=1
(Y
t
a)
2
K
h
(X
t
x).
This means that when X
t
is in a neighborhood of x, m(X
t
) is approximated by a constant
a (local approximation). Indeed, we consider the following working model
Y
t
= m(X
t
) +
t
a +
t
with the weights K
h
(X
t
x), where
t
= Y
t
E(Y
t
[ X
t
). Therefore, the Nadaraya-Watson
estimator is also called the local constant estimator.
In the implementation, for each x, we can t the following transformed linear model
Y

t
=
1
X

t
+
t
,
where Y

t
=
_
K
h
(X
t
x) Y
t
and X

t
=
_
K
h
(X
t
x). In R, we can use functions lm() or
glm() with weights K
h
(X
t
x) to t a weighted least squares or generalized linear model.
Or, you can use the weighted least squares theory (matrix multiplication); see Section 4.7.
4.2.1 Asymptotic Properties
We derive the asymptotic properties of the nonparametric estimator for the time series
situations. Note that the mathematical derivations are dierent for the iid case and time
series situations since E[Y
t
[ X
1
, , X
n
] ,= E[Y
t
[ X
t
] = m(X
t
) which is true for the iid case.
To easy notation, we consider only the simple case when p = 1.
m
nw
(x) =
1
n
n

t=1
m(X
t
) K
h
(X
t
x)/f
n
(x)
. .
I
1
+
n

t=1
W
t

t
. .
I
2
.
We will show that I
1
contributes only the asymptotic bias and I
2
gives the asymptotic
normality. First, we derive the asymptotic bias for the interior boundary points. By the
Taylors expansion, when X
t
is in (x h, x + h), we have
m(X
t
) = m(x) + m

(x)(X
t
x) +
1
2
m

(x
t
)(X
t
x)
2
,
CHAPTER 4. NONPARAMETRIC REGRESSION MODELS 50
where x
t
= x + (X
t
x) with 1 < < 1. Then,
I
11

1
n
n

t=1
m(X
t
) K
h
(X
t
x) = m(x) f
n
(x) + m

(x)
1
n
n

t=1
(X
t
x) K
h
(X
t
x)
. .
J
1
(x)
+
1
2
1
n
n

t=1
m

(x
t
)(X
t
x)
2
K
h
(X
t
x)
. .
J
2
(x)
.
Then,
E[J
1
(x)] = E[(X
t
x) K
h
(X
t
x)] =
_
(u x)K
h
(u x)f(u)du
= h
_
uK(u)f(x + hu)du = h
2
f

(x)
2
(K) + o(h
2
).
Similar to the derivation of the variance of f
n
(x) in (3.3), we can show that
nh Var(J
1
(x)) = O(1).
Therefore, J
1
(x) = h
2
f

(x)
2
(K) + o
p
(h
2
). By the same token, we have
E[J
2
(x)] = E
_
m

(x
t
)(X
t
x)
2
K
h
(X
t
x)

= h
2
_
m

(x + hu)u
2
K(u)f(x + hu)du = h
2
m

(x)
2
(K) f(x) + o(h
2
)
and Var(J
2
(x)) = O(1/nh). Therefore, J
2
(x) = h
2
m

(x)
2
(K) f(x) + o
p
(h
2
). Hence,
I
1
= m(x) + m

(x) J
1
(x)/f
n
(x) +
1
2
J
2
(x)/f
n
(x)
= m(x) +
h
2
2

2
(K) [m

(x) + 2 m

(x)f

(x)/f(x)] + o
p
(h
2
)
by the fact that f
n
(x) = f(x) + o
p
(1). The term
B
nw
(x) =
h
2
2

2
(K) [m

(x) + 2 m

(x)f

(x)/f(x)] (4.1)
is regarded as the asymptotic bias. The bias term involves not only curvatures of m(x)
(m

(x)) but also the unknown density function f(x) and its derivative f

(x) so that the

design can not be adaptive.
Under some regularity conditions, similar to (3.3), we can show that for x being an
interior grid point,
nh Var(I
2
)
0
(K)
2

(x)/f(x) =
2
m
(x),
CHAPTER 4. NONPARAMETRIC REGRESSION MODELS 51
where
2

(x) = Var(
t
[ X
t
= x). Further, we can establish the asymptotic normality (the
proof is provided later)

nh
_
m
nw
(x) m(x) B
nw
(x) + o
p
(h
2
)

N
_
0,
2
m
(x)
_
,
where B
nw
(x) is given in (4.1).
4.2.2 Boundary Behavior
For expositional purpose, in what follows, we only consider the case when p = 1. As for the
boundary behavior of the NW estimator, we can follow Fan and Gijbels (1996). Without loss
of generality, we consider the left boundary point x = c h, 0 < c < 1. From Fan and Gijbels
(1996), we take K() to have support [1, 1] and m() to have support [0, 1]. Similar to
(3.5), it is easy to see that if x = c h,
E[J
1
(ch)] = E[(X
t
ch) K
h
(X
t
ch)] =
_
1
0
(u ch)K
h
(u ch)f(u)du
= h
_
1/hc
c
uK(u)f(h(u + c))du
= h f(0+)
1,c
(K) + h
2
f

(0+)[
2,c
(K) + c
1,c
(K)] + o(h
2
),
and
E[J
2
(ch)] = E
_
m

(x
t
)(X
t
ch)
2
K
h
(X
t
ch)

= h
2
_
1/hc
c
m

(h(c + u))u
2
K(u)f(h(u + c))du
= h
2
m

(0+)
2,c
(K) f(0+) + o(h
2
).
Also, we can see that
Var(J
1
(ch)) = O(1/nh) and Var(J
2
(ch)) = O(1/nh),
which imply that
J
1
(ch) = h f(0+)
1,c
(K) + o
p
(h) and J
2
(ch) = h
2
m

(0+)
2,c
(K) f(0+) + o(h
2
).
This, in conjunction with (3.5), gives
I
1
m(ch) = m

(ch) J
1
(ch)/f
n
(ch) +
1
2
J
2
(ch)/f
n
(ch) = a(c, K) h + b(c, K) h
2
+ o
p
(h
2
),
CHAPTER 4. NONPARAMETRIC REGRESSION MODELS 52
where
a(c, K) =
m

(0+)
1,c
(K)

0,c
(K)
,
and
b(c, K) =

2,c
(K) m

(0+)
2
0,c
(K)
+
f

(0+)m

(0+)[
2,c
(K)
0,c
(K)
2
1,c
(K)]
f(0+)
2
0,c
(K)
.
Here, a(c, K) h+b(c, K) h
2
serves as the asymptotic bias term, which is of the order O(h).
We can show that at the boundary point, the asymptotic variance has the following form
nhVar( m
nw
(x))
0,c
(K)
2
m
(0+)/[
0,c
(K) f(0+)],
which the same order as that for the interior point although the scaling constant is dierent.
4.3 Local Polynomial Estimate
To overcome the above shortcomings of local constant estimate, we can use the local poly-
nomial tting scheme; see Fan and Gijbels (1996). The main idea is described as follows.
4.3.1 Formulation
Assume that the regression function m(x) has (q + 1)th order continuous derivative. For
ease notation, assume that p = 1. When X
t
(x h, x + h), then
m(X
t
)
q

j=0
m
(j)
(x)
j!
(X
t
x)
j
=
q

j=0

j
(X
t
x)
j
,
where
j
= m
(j)
(x)/j!. Therefore, when X
t
(x h, x + h), the model becomes
Y
t

q

j=0

j
(X
t
x)
j
+
t
.
Hence, we can apply the weighted least squares method. The weighted locally least
squares becomes
n

t=1
_
Y
t

j=0

j
(X
t
x)
j
_
2
K
h
(X
t
x). (4.2)
CHAPTER 4. NONPARAMETRIC REGRESSION MODELS 53
Minimizing the above with respect to = (
0
, . . . ,
q
)
T
to obtain the local polynomial
estimate

;

=
_
X
T
WX
_
1
X
T
WY, (4.3)
where W = diagK
h
(X
1
x), , K
h
(X
n
x),
X =
_
_
_
_
1 (X
1
x) (X
1
x)
q
1 (X
2
x) (X
2
x)
q
.
.
.
.
.
.
.
.
.
.
.
.
1 (X
n
x) (X
n
x)
q
_
_
_
_
, and Y =
_
_
_
_
Y
1
Y
2
.
.
.
Y
n
_
_
_
_
.
Therefore, for 1 j q,
m
(j)
(x) = j!

j
.
This means that the local polynomial method estimates not only the regression function
itself but also derivatives of regression.
4.3.2 Implementation in R
There are several ways of implementing the local polynomial estimator. One way you can
do so is to write your own code by using matrix multiplication as in (4.3) or employing
function lm() or glm() with weights K
h
(X
t
x). Recently, in R, there are some build-
in packages for implementing the local polynomial estimate. For example, the package
KernSmooth contains several functions. Function bkde() computes the kernel density
estimate and Function bkde2D() computes the 2D kernel density estimate as well as Func-
tion bkfe() computes the kernel functional (derivative) density estimate. Function dpik()
selects a bandwidth for estimating the kernel density estimation using the plug-in method
and Function dpill() chooses a bandwidth for the local linear (q = 1) regression estimation
using the plug-in approach. Finally, Function locpoly() is for the local polynomial tting
including a local polynomial estimate of the density of X (or its derivative) if the dependent
variable is omitted.
Example 4.1: We apply the kernel regression estimation and local polynomial tting meth-
ods to estimate the drift and diusion of the weekly 3-month Treasury bill from January 2,
1970 to December 26, 1997. Let x
t
denote the weekly 3-month Treasury bill. It is often to
model x
t
by assuming that it satises the continuous-time stochastic dierential equation
(Black-Scholes model)
d x
t
= (x
t
) dt + (x
t
) dW
t
,
CHAPTER 4. NONPARAMETRIC REGRESSION MODELS 54
where W
t
is a Wiener process, (x
t
) is called the drift function and (x
t
) is called the
diusion function. Our interest is to identify (x
t
) and (x
t
). Assume a time series sequence
X
t
, 1 t n is observed at equally spaced time points. Using the innitesimal
generator (ksendal, 1985), the rst-order approximations of moments of x
t
, a discretized
version of the Itos process, are given by Stanton (1997) (see Fan and Zhang (2003) for the
higher orders)
x
t
= (x
t
) + (x
t
)

,
where x
t
= x
t+
x
t
, N(0, 1), and x
t
and
t
are independent. Therefore,
0.04 0.08 0.12 0.16

0
.
0
1
0
.
0
0
0
.
0
1
0
.
0
2
x(t1)
(a) y(t) vs x(t)
Local Constant Estimate
0.04 0.08 0.12 0.16
0
.
0
0
0
0
.
0
0
5
0
.
0
1
0
0
.
0
1
5
x(t1)
(b) |y(t)| vs x(t)
0.04 0.08 0.12 0.16 0
e
+
0
0
1
e

0
4
2
e

0
4
3
e

0
4
x(t1)
(c) y(t)^2 vs x(t)
Figure 4.1: Scatterplots of x
t
, [x
t
[, and (x
t
)
2
versus x
t
with the smoothed curves
computed using scatter.smooth() and the local constant estimation.
(x
t
) = lim
0
E[x
t
[ x
t
]/ and
2
(x
t
) = lim
0
E
_
(x
t
)
2
[ x
t

/.
Hence, estimating (x) and
2
(x) becomes a nonparametric regression problem. We can use
both local constant and local polynomial method to estimate (x) and
2
(x). As a result,
the local constant estimators (red line) together with the lowess smoothers (black line) and
the scatterplots of x
t
[in (a)], [x
t
[ [in (b)], and (x
t
)
2
[in (c)] versus x
t
are presented
in Figure 4.1 and the local linear estimators (red line) together with the lowess smoothers
(black line) and the scatterplots of x
t
[in (a)], [x
t
[ [in (b)], and (x
t
)
2
[in (c)] versus x
t
are displaced in Figure 4.2. An alternative approach can be found in At-Sahalia (1996).
CHAPTER 4. NONPARAMETRIC REGRESSION MODELS 55
0.04 0.08 0.12 0.16

0
.
0
1
0
.
0
0
0
.
0
1
0
.
0
2
x(t1)
(a) y(t) vs x(t)
Local Linear Estimate
0.04 0.08 0.12 0.16
0
.
0
0
0
0
.
0
0
5
0
.
0
1
0
0
.
0
1
5
x(t1)
(b) |y(t)| vs x(t)
0.04 0.08 0.12 0.16 0
e
+
0
0
1
e

0
4
2
e

0
4
3
e

0
4
x(t1)
(c) y(t)^2 vs x(t)
Figure 4.2: Scatterplots of x
t
, [x
t
[, and (x
t
)
2
versus x
t
with the smoothed curves
computed using scatter.smooth() and the local linear estimation.
4.3.3 Complexity of Local Polynomial Estimator
To implement the local polynomial estimator, we have to choose the order of the polynomial
q, the bandwidth h and the kernel function K(). These parameters are of course confounded
each other. Clearly, when h = , the local polynomial tting becomes a global polynomial
tting and the order q determines the model complexity. Unlike in the parametric models,
the complexity of local polynomial tting is primarily controlled by the bandwidth, as
shown in Fan and Gijbels (1996) and Fan and Yao (2003). Hence q is usually small and the
issue of choosing q becomes less critical. We discuss those issues in detail as follows.
(1) If the objective is to estimate m
(j)
() (j 0), the local polynomial tting corrects
automatically the boundary bias when q j is is odd. Further, when q j is odd, comparing
with the order q 1 t (so that q j 1 is even), the order q t contains one extra parameter
without increasing the variance for estimating m
(j)
(). But this extra parameter creates
opportunities for bias reduction, particularly in the boundary regions; see the next section
and the books by Fan and Gijbels (1996) and Ruppert and Wand(1994). For these reasons,
the odd order ts (the order q is chosen so that q j is odd) outperforms the even order ts
[the order (q 1) t so that q is even]. Based on theoretical and practical considerations,
CHAPTER 4. NONPARAMETRIC REGRESSION MODELS 56
the order q = j + 1 is recommended in Fan and Gijbels (1996). If the primary objective
is to estimate the regression function, one uses local linear t and if the target
function is the rst order derivative, one uses the local quadratic t and so on.
(2) It is well known that the choice of the bandwidth h plays an important role in
any kernel smoothing, including the local polynomial tting. A too large bandwidth caus-
es over-smoothing (reducing variance), creating excessive modeling bias, while a too small
bandwidth results in under-smoothing (reducing bias but increasing variance), obtaining
wiggly estimates. The bandwidth can be subjectively chosen by users via visually
inspecting resulting estimates, or automatically chosen by data via minimizing
an estimated theoretical risk (discussed later). Since the choice of bandwidth is not easy
task, it is often attached by people who do not know well nonparametric techniques.
(3) Since the estimate is based on the local regression (4.2), it is reasonable to require a
non-negative weight function K(). It can be shown (see Fan and Gijbels (1996)) that for all
choices of q and j, the optimal weight function is K(z) = 3/4(1 z
2
)
+
, the Epanechnikov
kernel, based on minimizing the asymptotic variance of the local polynomial estimator.
Thus, it is a universal weighting scheme and provides a useful benchmark for other kernels
to compare with. As shown in Fan and Gijbels (1996) and Fan and Yao (2003), other kernels
have nearly the same eciency for practical use of q and j. Hence the choice of the kernel
function is not critical.
The local polynomial estimator compares favorably with other estimators, including the
Nadaraya-Watson (local constant) estimator and other linear estimators such as the
Gasser and M uller estimator of Gasser and M uller (1979) and the Priestley and Chao
estimator of Priestley and Chao (1972). Indeed, it was shown by Fan (1993) that the local
linear tting is asymptotically minimax based on the quadratic loss function among all
linear estimators and is nearly minimax among all possible linear estimators. This minimax
property is extended by Fan, Gasser, Gijbels, Brockmann and Engel (1995) to more general
local polynomial tting. For the detailed comparisons of the above four estimators, see Fan
and Gijbels (1996).
Note that the Gasser and M uller estimator and the Priestley and Chao estimator are
particularly for the xed design. That is, X
t
= t. Let s
t
= (2t +1)/2 (t = 1, , n 1) with
CHAPTER 4. NONPARAMETRIC REGRESSION MODELS 57
s
0
= and s
n
= . The Gasser and M uller estimator is
m
gm
(t
0
) =
n

t=1
_
st
s
t1
K
h
(u t
0
)du Y
t
.
Unlike the local constant estimator, no denominator is needed since the total weight
n

t=1
_
st
s
t1
K
h
(u t
0
)du = 1.
Indeed, the Gasser and M uller estimator is an improved version of the Priestley and Chao
estimator, which is dened as
m
pc
(t
0
) =
n

t=1
K
h
(t t
0
) Y
t
.
Note that the Priestley and Chao estimator is only applicable for the equi-space setting.
4.3.4 Properties of Local Polynomial Estimator
Dene, for 0 j q,
s
n,j
(x) =
n

t=1
(X
t
x)
j
K
h
(X
t
x)
and S
n
(x) = X
T
WX. Then, the (i + 1, j + 1)th element of S
n
(x) is s
n,i+j
(x). Similar to
the evaluation of I
11
, we can show easily that
s
n,j
(x) = nh
j

j
(K) f(x)1 + o
p
(1).
Dene, H = diag1, h, , h
q
and S = (
i+j
(K))
0i,jq
. Then, it is not dicult to show
that S
n
(x) = nf(x) H S H1 + o
p
(1).
First of all, for 0 j q, let e
j
be a (q +1) 1 vector with (j +1)th element being one
and zero otherwise. Then,

j
can be re-expressed as

j
= e
T
j

=
n

t=1
W
j,n,h
(X
t
x) Y
t
,
where W
j,n,h
(X
t
x) is called the eective kernel in Fan and Gijbels (1996) and Fan and
Yao (2003), given by
W
j,n,h
(X
t
x) = e
T
j
S
n
(x)
1
(1, (X
t
x), , (X
t
x)
q
)
T
K
h
(X
t
x).
CHAPTER 4. NONPARAMETRIC REGRESSION MODELS 58
It is not dicult to show (based on the least square theory) that W
j,n,h
(X
t
x) satises the
following the so-called discrete moment conditions
n

t=1
(X
t
x)
l
W
j,n,h
(X
t
x) =
_
1 if l = j,
0 otherwise.
(4.4)
Note that the local constant estimator does not have this property; see J
1
(x)
in Section 4.2.1. This property implies that the local polynomial estimator is
unbiased for estimating
j
, when the true regression function m(x) is a polynomial
of order q.
To gain more insights about the local polynomial estimator, dene the equivalent kernel
(see Fan and Gijbels (1996))
W
j
(u) = e
T
j
S
1
(1, u, , u
q
)
T
K(u).
Then, it can be shown (see Fan and Gijbels (1996)) that
W
j,n,h
(X
t
x) =
1
nh
j+1
f(x)
W
j
((X
t
x)/h)1 + o
p
(1)
and
_
u
l
W
j
(u)du =
_
1 if l = j,
0 otherwise.
The implications of these results are as follows.
As pointed out by Fan and Yao (2003), the local polynomial estimator works like a
kernel regression estimation with a known design density f(x). This explains why the local
polynomial t adapts to various design densities. In contrast, the kernel regression estimator
has large bias at the region where the derivative of f(x) is large, namely it can not adapt
to highly-skewed designs. To see that, imagine the true regression function has large slope
in this region. Since the derivative of design density is large, for a given x, there are more
points on one side of x than the other. When the local average is taken, the Nadaraya-
Watson estimate is biased towards the side with more local data points because the local
data are asymmetrically distributed. This issue is more pronounced at the boundary regions,
since the local data are even more asymmetric. On the other hand, the local polynomial t
creates asymmetric weights, if needed, to compensate for this kind of design bias. Hence, it
is adaptive to various design densities and to the boundary regions.
CHAPTER 4. NONPARAMETRIC REGRESSION MODELS 59
We next derive the asymptotic bias and variance expression for local polynomial estima-
tors. For independent data, we can obtain the bias and variance expression via conditioning
on the design matrix X. However, for time series data, conditioning on X would mean condi-
tioning on nearly the entire series. Hence, we derive the asymptotic bias and variance using
the asymptotic normality rather than conditional expectation. As explained in Chapter 3,
localizing in the state domain weakens the dependent structure for the local data. Hence, one
would expect that the result for the independent data continues to hold for the stationary
process with certain mixing conditions. The mixing condition and the bandwidth should be
related, which can be seen later.
Set B
n
(x) = (b
1
(x), , b
n
(x))
T
, where, for 0 j q,
b
j+1
(x) =
n

t=1
_
m(X
t
)
q

j=0
m
(j)
(x)
j!
(X
t
x)
j
_
(X
t
x)
j
K
h
(X
t
x).
Then,

=
_
X
T
WX
_
1
B
n
(x) +
_
X
T
WX
_
1
X
T
W,
where = (
1
, ,
n
)
T
. It is easy to show that if q is odd,
B
n
(x) = nh
q+1
H f(x)
m
(q+1)
(x)
(q + 1)!
c
1,q
1 + o
p
(1),
where, for 1 k 3, c
k,q
= (
q+k
(K), ,
2q+k
(K))
T
. If q is even,
B
n
(x) = nh
q+2
H f(x)
_
c
2,q
m
(q+1)
(x)f

(x)
f(x)(q + 1)!
+ c
3,q
m
(q+2)
(x)
(q + 2)!
_
1 + o
p
(1).
Note that f

(x)/f(x) does not appear in the right hand side of B

n
(x) when q is odd. In
either case, we can show that
nh Var
_
H(

)
_

2
(x)S
1
S

S
1
/f(x) = (x),
where S

is a (q + 1) (q + 1) matrix with the (i, j)th element being

i+j2
(K).
This shows that the leading conditional bias term depends on whether q is odd or even.
By a Taylor series expansion argument, we know that when considering [X
t
x[ < h, the
remainder term from a qth order polynomial expansion should be of order O(h
q+1
), so the
result for odd q is quite easy to understand. When q is even, (q + 1) is odd hence the term
h
q+1
is associated with
_
u
l
K(u)du for l odd, and this term is zero because K(u) is a even
CHAPTER 4. NONPARAMETRIC REGRESSION MODELS 60
function. Therefore, the h
q+1
term disappears, while the remainder term becomes O(h
q+2
).
Since q is either odd or even, then we see that the bias term is an even power of h. This
is similar to the case where one uses higher order kernel functions based upon a symmetric
kernel function (an even function), where the bias is always an even power of h.
Finally, we can show that when q is odd,

nh
_
H(

) B(x)
_
N(0, (x)),
the asymptotic bias term for the local polynomial estimator is
B(x) =
h
q+1
(q + 1)!
m
(q+1)
(x) S
1
c
1,q
1 + o
p
(1).
Or,

nh
2j+1
_
m
(j)
(x) m
(j)
(x) B
j
(x)

N(0,
jj
(x)),
where the asymptotic bias and variance for the local polynomial estimator of m
(j)
(x) are
B
j
(x) =
j! h
q+1j
(q + 1)!
m
(q+1)
(x)
_
u
q+1
W
j
(u)du1 + o
p
(1)
and

jj
(x) =
(j!)
2

2
(x)
f(x)
_
W
2
j
(u)du.
Similarly, we can derive the asymptotic bias and variance at boundary points if the regression
function has a nite support. For details, see Fan and Gijbels (1996), Fan and Yao (2003),
and Ruppert and Wand (1994). Indeed, dene S
c
, S

c
, and c
k,q,c
similarly to S, S

and c
k,q
with
j
(K) and
j
(K) replaced by
j,c
(K) and
j,c
(K) respectively. We can show that

nh
_
H(

(ch) (ch)) B
c
(0)
_
N(0,
c
(0)), (4.5)
where the asymptotic bias term for the local polynomial estimator at the left boundary point
is
B
c
(0) =
h
q+1
(q + 1)!
m
(q+1)
(0) S
1
c
c
1,q,c
1 + o
p
(1),
and the asymptotic variance is
c
(0) =
2
(0)S
1
c
S

c
S
1
c
/f(0). Or,

nh
2j+1
_
m
(j)
(ch) m
(j)
(ch) B
j,c
(0)

N(0,
jj,c
(0)),
where with W
j,c
(u) = e
T
j
S
1
c
(1, u, , u
q
)
T
K(u),
B
j,c
(0) =
j! h
q+1j
(q + 1)!
m
(q+1)
(0)
_

c
u
q+1
W
j,c
(u)du1 + o
p
(1)
CHAPTER 4. NONPARAMETRIC REGRESSION MODELS 61
and

jj,c
(0) =
(j!)
2

2
(0)
f(0)
_

c
W
2
j,c
(u)du.
Exercise: Please derive the asymptotic properties for the local polynomial estimator. That
is to prove (4.5).
The above conclusions show that when q j is odd, the bias at the boundary is of the
same order as that for points on the interior. Hence, the local polynomial t does not create
excessive boundary bias when q j is odd. Thus, the appealing boundary behavior of local
polynomial mean estimation extends to derivative estimation. However, when q j is even,
the bias at the boundary is larger than in the interior, and the bias can also be large at
points where f(x) is discontinuous. This is referred to as boundary eect. For these reasons
(and the minimax eciency arguments), it is recommended that one strictly set q j to be
odd when estimating m
(j)
(x). It is indeed an odd world!
4.3.5 Bandwidth Selection
As seen in previous sections, for stationary sequences of data under certain mixing conditions,
the local polynomial estimator performs very much like that for independent data, because
windowing reduces dependency among local data. Partially because of this, there are not
many studies on bandwidth selection for these problems. However, it is reasonable to expect
the bandwidth selectors for independent data continue to work for dependent data with
certain mixing conditions. Below, we summarize a few of useful approaches. When data do
not have strong enough mixing, the general strategy is to increase bandwidth in order to
reduce the variance.
As what we had already seen for the nonparametric density estimation, the cross-
validation method is very useful for assessing the performance of an estimator via esti-
mating its prediction error. The basic idea is to set one of the data point aside for validation
of a model and use the remaining data to build the model. It is dened as
CV(h) =
n

s=1
[Y
s
m
s
(X
s
)]
2
where m
s
(X
s
) is the local polynomial estimator with j = 0 and bandwidth h, but with-
out using the sth observation. The above summand is indeed a squared-prediction error
CHAPTER 4. NONPARAMETRIC REGRESSION MODELS 62
of the sth data point using the training set (X
t
, Y
t
) : t ,= s. This idea of the cross-
validation method is simple but is computationally intensive. An improved version, in terms
of computation, is the generalized cross-validation (GCV), proposed by Wahba (1977) and
Craven and Wahba (1979). This criterion can be described as follows. The tted values

Y = ( m(X
1
), , m(X
n
))
T
can be expressed as

Y = H(h)Y , where H(h) is an n n hat
matrix, depending on the X-variate and bandwidth h, and it is also called a smoothing
matrix. Then the GCV approach selects the bandwidth h that minimizes
GCV(h) =
_
n
1
tr(I H(h))

2
MASE(h)
where MASE(h) =

n
t=1
(Y
t
m(X
t
))
2
/n is the average of squared residuals.
A drawback of the cross-validation type method is its inherited variability (see Hall and
Johnstone, 1992). Further, it can not be directly applied to select bandwidths for estimating
derivative curves. As pointed out by Fan, Heckman, and Wand (1995), the cross-validation
type method performs poorly due to its large sample variation, even worse for dependent
data. Plug-in methods avoid these problems. The basic idea is to nd a bandwidth h
minimizing estimated mean integrated square error (MISE). See Ruppert, Sheather and
Wand (1995) and Fan and Gijbels (1995) for details.
Nonparametric AIC Selector
Inspired by the nonparametric version of the Akaike nal prediction error criterion proposed
by Tjstheim and Auestad (1994b) for the lag selection in nonparametric setting, Cai (2002)
proposed a simple and quick method to select bandwidth for the foregoing estimation proce-
dures, which can be regarded as a nonparametric version of the Akaike information criterion
(AIC) to be attentive to the structure of time series data and the over-tting or under-tting
tendency. Note that the idea is also motivated by its analogue of Cai and Tiwari (2000).
The basic idea is described as follows.
By recalling the classical AIC for linear models under the likelihood setting
2 (maximized log likelihood) + 2 (number of estimated parameters),
Cai (2002) proposed the following nonparametric AIC to select h minimizing
AIC(h) = log MASE + (tr(H(h)), n), (4.6)
CHAPTER 4. NONPARAMETRIC REGRESSION MODELS 63
where (tr(H(h)), n) is chosen particularly to be the form of the bias-corrected version of
the AIC, due to Hurvich and Tsai (1989),
(tr(H(h)), n) = 2 tr(H(h)) + 1/[n tr(H(h)) + 2], (4.7)
and tr(H(h)) is the trace of the smoothing matrix H(h), regarded as the nonparametric
version of degrees of freedom, called the eective number of parameters. See the book
by Hastie and Tibshirani (1990, Section 3.5) for the detailed discussion on this aspec-
t for nonparametric models. Note that actually, (4.6) is a generalization of the AIC for
the parametric regression and autoregressive time series contexts, in which tr(H(h)) is the
number of regression (autoregressive) parameters in the tting model. In view of (4.7),
when (tr(H(h)), n) = 2 log(1 tr(H(h))/n), then (4.6) becomes the generalized cross-
validation (GCV) criterion, commonly used to select the bandwidth in the time series liter-
ature even in the iid setting, when (tr(H(h)), n) = 2 tr(H(h))/n, then (4.6) is the classical
AIC discussed in Engle, Granger, Rice, and Weiss (1986) for time series data, and when
(tr(H(h)), n) = log(1 2 tr(H(h))/n), (4.6) is the T-criterion, proposed and studied by
Rice (1984) for iid samples. It is clear that when tr(H(h))/n 0, then the nonparametric
AIC, the GCV and the T-criterion are asymptotically equivalent. However, the T-criterion
requires tr(H(h))/n < 1/2, and, when tr(H(h))/n is large, the GCV has relatively weak
penalty. This is especially true for the nonparametric setting. Therefore, the criterion pro-
posed here counteracts the over-tting tendency of the GCV. Note that Hurvich, Simono,
and Tsai (1998) gave the detailed derivation of the nonparametric AIC for the nonpara-
metric regression problems under the iid Gaussian error setting and they argued that the
nonparametric AIC performs reasonably well and better than some existing methods in the
literature.
4.4 Project for Regression Function Estimation
Do Monte Carlo simulations to compare the performances of the local linear and local con-
stant estimations for the nonparametric regression function for dierent settings and to make
your own conclusions based on your simulations. Please do the followings:
1. Choosing dierent sample sizes, dierent kernels, dierent bandwidths, and dierent
bandwidth selection methods. Any conclusions and comments?
CHAPTER 4. NONPARAMETRIC REGRESSION MODELS 64
2. Compare the local linear method with local constant method. Any conclusions and
comments?
Note that you can choose any distribution to generate your samples for your simulation.
You can use any statistical package to do your simulation. You try to use R since it is very
simple. You have to write report to present what you do in details and to explain what you
observe as well as to make your comments. Please hand in all necessary materials (tables or
graphs) to support your conclusions. If you need any help, please come to see me.
4.5 Functional Coecient Model
4.5.1 Model
As mentioned earlier, when p is large, there exists the so called curse of dimensionality. To
overcome this shortcoming, one way to do so is to consider the functional coecient model
as studied in Cai, Fan and Yao (2000) and the additive model discussed in Section 4.6. First,
we study the functional coecient model. To use the notation from Cai, Fan and Yao (2000),
we change the notation from the previous sections.
Let U
i
, X
i
, Y
i

i=
be jointly strictly stationary processes with U
i
taking values in

k
and X
i
taking values in
p
. Typically, k is small. Let E(Y
2
1
) < . We dene the
multivariate regression function
m(u, x) = E (Y [ U = u, X = x) , (4.8)
where (U, X, Y ) has the same distribution as (U
i
, X
i
, Y
i
). In a pure time series context,
both U
i
and X
i
consist of some lagged values of Y
i
. The functional-coecient regression
model has the form
m(u, x) =
p

j=1
a
j
(u) x
j
, (4.9)
where the functions a
j
() are measurable from
k
to
1
and x = (x
1
, . . . , x
p
)
T
. This
model has been studied extensively in the literature; see Cai, Fan and Yao (2000) for the
detailed discussions.
For simplicity, in what follows, we consider only the case k = 1 in (4.9). Extension to the
case k > 1 involves no fundamentally new ideas. Note that models with large k are often
CHAPTER 4. NONPARAMETRIC REGRESSION MODELS 65
not practically useful due to the curse of dimensionality. If k is large, to overcome the
problem, one way to do so is to consider an index functional coecient model proposed by
Fan, Yao and Cai (2003)
m(u, x) =
p

j=1
a
j
(
T
u) x
j
, (4.10)
where
1
= 1. Fan, Yao and Cai (2003) studied the estimation procedures, bandwidth se-
lection and applications. Hong and Lee (2003) considered the applications of model (4.10)
to the exchange rates, Juhl (2005) studied the unit root behavior of nonlinear time series
models, Li, Huang, Li and Fu (2002) modelled the production frontier using Chinas man-
ufactural industry data, and Cai, Das, Xiong and Wu (2006) considered the nonparametric
two-stage instrumental variable estimators for returns to education.
4.5.2 Local Linear Estimation
As recommended by Fan and Gijbels (1996), we estimate the coecient functions a
j
()
using the local linear regression method from observations U
i
, X
i
, Y
i

n
i=1
, where X
i
=
(X
i1
, . . . , X
ip
)
T
. We assume throughout that a
j
() has a continuous second derivative. Note
that we may approximate a
j
() locally at u
0
by a linear function a
j
(u) a
j
+ b
j
(u u
0
).
The local linear estimator is dened as a
j
(u
0
) = a
j
, where (a
j
,

b
j
) minimize the sum of
weighted squares
n

i=1
_
Y
i

j=1
a
j
+ b
j
(U
i
u
0
) X
ij
_
2
K
h
(U
i
u
0
), (4.11)
where K
h
() = h
1
K(/h), K() is a kernel function on
1
and h > 0 is a bandwidth. It
follows from the least squares theory that
a
j
(u
0
) =
n

k=1
K
n,j
(U
k
u
0
, X
k
) Y
k
, (4.12)
where
K
n,j
(u, x) = e
T
j,2p
_

X
T
W

X
_
1
_
x
u x
_
K
h
(u) (4.13)
e
j,2p
is the 2p 1 unit vector with 1 at the jth position,

X denotes an n 2p matrix with
(X
T
i
, X
T
i
(U
i
u
0
)) as its ith row, and W = diag K
h
(U
1
u
0
), . . . , K
h
(U
n
u
0
).
CHAPTER 4. NONPARAMETRIC REGRESSION MODELS 66
4.5.3 Bandwidth Selection
Various existing bandwidth selection techniques for nonparametric regression can be adapted
for the foregoing estimation; see, e.g., Fan, Yao, and Cai (2003) and the nonparametric
AIC as discussed in Section 4.3.5. Also, Fan and Gijbels (1996) and Ruppert, Sheather,
and Wand (1995) developed data-driven bandwidth selection schemes based on asymptotic
formulas for the optimal bandwidths, which are less variable and more eective than the
conventional data-driven bandwidth selectors such as the cross-validation bandwidth rule.
Similar algorithms can be developed for the estimation of functional-coecient models based
on (4.23); however, this will be a future research topic.
Cai, Fan and Yao (2000) proposed a simple and quick method for selecting bandwidth
h. It can be regarded as a modied multi-fold cross-validation criterion that is attentive to
the structure of stationary time series data. Let m and Q be two given positive integers and
n > mQ. The basic idea is rst to use Q subseries of lengths n qm (q = 1, , , Q) to
estimate the unknown coecient functions and then compute the one-step forecasting errors
of the next section of the time series of length m based on the estimated models. More
precisely, we choose h that minimizes the average mean squared (AMS) error
AMS(h) =
Q

q=1
AMS
q
(h), (4.14)
where for q = 1, , Q,
AMS
q
(h) =
1
m
nqm+m

i=nqm+1
_
Y
i

j=1
a
j,q
(U
i
)X
i,j
_
2
,
and a
j,q
() are computed from the sample (U
i
, X
i
, Y
i
), 1 i n qm with bandwidth
equal h[n/(nqm)]
1/5
. Note that we re-scale bandwidth h for dierent sample sizes according
to its optimal rate, i.e. h n
1/5
. In practical implementations, we may use m = [0.1n] and
Q = 4. The selected bandwidth does not depend critically on the choice of m and Q, as long
as mQ is reasonably large so that the evaluation of prediction errors is stable. A weighted
version of AMS(h) can be used, if one wishes to down-weight the prediction errors at an
earlier time. We believe that this bandwidth should be good for modeling and forecasting
for time series.
CHAPTER 4. NONPARAMETRIC REGRESSION MODELS 67
4.5.4 Smoothing Variable Selection
Of importance is to choose an appropriate smoothing variable U in applying functional-
coecient regression models if U is a lagged variable. Knowledge on physical background of
the data may be very helpful, as Cai, Fan and Yao (2000) discussed in modeling the lynx
data. Without any prior information, it is pertinent to choose U in terms of some data-driven
methods such as the Akaike information criterion (AIC) and its variants, cross-validation,
and other criteria. Ideally, we would choose U as a linear function of given explanatory
variables according to some optimal criterion, which can be fully explored in the work by
Fan, Yao and Cai (2003). Nevertheless, we propose here a simple and practical approach:
let U be one of the given explanatory variables such that AMS dened in (4.14) obtains its
minimum value. Obviously, this idea can be also extended to select p (number of lags) as
well.
4.5.5 Goodness-of-Fit Test
To test whether model (4.9) holds with a specied parametric form which is popular in
economic and nancial applications, such as the threshold autoregressive (TAR) models
a
j
(u) =
_
a
j1
, if u
a
j2
, if u > ,
or generalized exponential autoregressive (EXPAR) models
a
j
(u) =
j
+ (
j
+
j
u) exp(
j
u
2
),
or smooth transition autoregressive (STAR) models
a
j
(u) = [1 exp(
j
u)]
1
(logistic),
or
a
j
(u) = 1 exp(
j
u
2
) (exponential),
or
a
j
(u) = [1 exp(
j
[u[)]
1
(absolute),
[for more discussions on those models, please see the survey paper by van Dijk, Ter asvirta and
Franses (2002)], we propose a goodness-of-t test based on the comparison of the residual sum
of squares (RSS) from both parametric and nonparametric ttings. This method is closely
CHAPTER 4. NONPARAMETRIC REGRESSION MODELS 68
related to the sieve likelihood method proposed by Fan, Zhang and Zhang (2001). Those
authors demonstrated the optimality of this kind of procedures for independent samples.
Consider the null hypothesis
H
0
: a
j
(u) =
j
(u, ), 1 j p, (4.15)
where
j
(, ) is a given family of functions indexed by unknown parameter vector . Let

be an estimator of . The RSS under the null hypothesis is

RSS
0
= n
1
n

i=1
_
Y
i

1
(U
i
,

)X
i1

p
(U
i
,

)X
ip
_
2
.
Analogously, the RSS corresponding to model (4.9) is
RSS
1
= n
1
n

i=1
Y
i
a
1
(U
i
)X
i1
a
p
(U
i
)X
ip

2
.
The test statistic is dened as
T
n
= (RSS
0
RSS
1
)/RSS
1
= RSS
0
/RSS
1
1,
and we reject the null hypothesis (4.15) for large value of T
n
. We use the following nonpara-
metric bootstrap approach to evaluate the p value of the test:
1. Generate the bootstrap residuals

n
i=1
from the empirical distribution of the centered
residuals
i

n
i=1
, where

i
= Y
i
a
1
(U
i
) X
i1
a
p
(U
i
) X
ip
,

=
1
n
n

i=1

i
,
and dene
Y

i
=
1
(U
i
,

)X
i1
+ +
p
(U
i
,

)X
ip
+

i
.
2. Calculate the bootstrap test statistic T

n
based on the sample U
i
, X
i
, Y

i

n
i=1
.
3. Reject the null hypothesis H
0
when T
n
is greater than the upper- point of the condi-
tional distribution of T

n
given U
i
, X
i
, Y
i

n
i=1
.
The p-value of the test is simply the relative frequency of the event T

n
T
n
in the
replications of the bootstrap sampling. For the sake of simplicity, we use the same bandwidth
CHAPTER 4. NONPARAMETRIC REGRESSION MODELS 69
in calculating T

n
as that in T
n
. Note that we bootstrap the centralized residuals from
the nonparametric t instead of the parametric t, because the nonparametric estimate of
residuals is always consistent, no matter whether the null or the alternative hypothesis is
correct. The method should provide a consistent estimator of the null distribution even
when the null hypothesis does not hold. Kreiss, Neumann, and Yao (1998) considered
nonparametric bootstrap tests in a general nonparametric regression setting. They proved
that, asymptotically, the conditional distribution of the bootstrap test statistic is indeed the
distribution of the test statistic under the null hypothesis. It may be proven that the similar
result holds here as long as

converges to at the rate n
1/2
.
It is a great challenge to derive the asymptotic property of the testing statistics T
n
under
time series context and general assumptions. That is to show that
b
n
[T
n

n
] N(0,
2
)
for some b
n
and
n
, which is a great project for future research. Note that Fan, Zhang and
Zhang (2001) derived the above result for the iid sample.
4.5.6 Asymptotic Results
We rst present a result on mean squared convergence that serves as a building block for
our main result and is also of independent interest. We now introduce some notation. Let
S
n
= S
n
(u
0
) =
_
S
n,0
S
n,1
S
n,1
S
n,2
_
and
T
n
= T
n
(u
0
) =
_
T
n,0
(u
0
)
T
n,1
(u
0
)
_
with
S
n,j
= S
n,j
(u
0
) =
1
n
n

i=1
X
i
X
T
i
_
U
i
u
0
h
_
j
K
h
(U
i
u
0
)
and
T
n,j
(u
0
) =
1
n
n

i=1
X
i
_
U
i
u
0
h
_
j
K
h
(U
i
u
0
) Y
i
. (4.16)
Then, the solution to (4.11) can be expressed as

= H
1
S
1
n
T
n
, (4.17)
CHAPTER 4. NONPARAMETRIC REGRESSION MODELS 70
where H = diag (1, . . . , 1, h, . . . , h) with p-diagonal elements 1s and p diagonal elements
hs. To facilitate the notation, we denote
= (
l,m
)
pp
= E
_
XX
T
[ U = u
0
_
. (4.18)
Also, let f(u, x) denote the joint density of (U, X) and f
u
(u) be the marginal density of U.
We use the following convention: if U = X
j
0
for some 1 j
0
p, then f(u, x) becomes
f(x) the joint density of X.
Theorem 4.1. Let condition A.1 in hold, and let f(u, x) be continuous at the point u
0
.
Let h
n
0 and nh
n
, as n . Then it holds that
E(S
n,j
(u
0
)) f
u
(u
0
) (u
0
)
j
,
and
nh
n
Var(S
n,j
(u
0
)
l,m
) f
u
(u
0
)
2j

l,m
for each 0 j 3 and 1 l, m p.
As a consequence of Theorem 4.1, we have
S
n
P
f
u
(u
0
) S, and S
n,3
P

3
f
u
(u
0
)
in the sense that each element converges in probability, where
S =
_

1

1

2

_
.
Put

2
(u, x) = Var(Y [ U = u, X = x) (4.19)
and

(u
0
) = E
_
XX
T

2
(U, X) [ U = u
0

. (4.20)
Let c
0
=
2
/ (
2

2
1
) and c
1
=
1
/ (
2

2
1
).
Theorem 4.2. Let
2
(u, x) and f(u, x) be continuous at the point u
0
. Then under condi-
tions A.1 and A.2,
_
nh
n
_
a(u
0
) a(u
0
)
h
2
2

2
2

2
1
a

(u
0
)
_
D
N
_
0,
2
(u
0
)
_
, (4.21)
CHAPTER 4. NONPARAMETRIC REGRESSION MODELS 71
provided that f
u
(u
0
) ,= 0, where

2
(u
0
) =
c
2
0

0
+ 2 c
0
c
1

1
+ c
2
1

2
f
u
(u
0
)

1
(u
0
)

(u
0
)
1
(u
0
). (4.22)
Theorem 4.2 indicates that the asymptotic bias of a
j
(u
0
) is
h
2
2

2
2

2
1
a

j
(u
0
)
and the asymptotic variance is (nh
n
)
1

2
j
(u
0
), where

2
j
(u
0
) =
c
2
0

0
+ 2 c
0
c
1

1
+ c
2
1

2
f
u
(u
0
)
e
T
j,p

1
(u
0
)

(u
0
)
1
(u
0
) e
j,p
.
When
1
= 0, the bias and variance expressions can be simplied as h
2

2
a

j
(u
0
)/2 and

2
j
(u
0
) =

0
f
u
(u
0
)
e
T
j,p

1
(u
0
)

(u
0
)
1
(u
0
) e
j,p
.
The optimal bandwidth for estimating a
j
() can be dened to be the one that minimizes the
squared bias plus variance. The optimal bandwidth is given by
h
j,opt
=
_

2
2

0
2
1

1
+
2
1

2
f
u
(u
0
) (
2
2

3
)
2
e
T
j,p

1
(u
0
)

(u
0
)
1
(u
0
) e
j,p
_
a

j
(u
0
)
_
2
_
1/5
n
1/5
. (4.23)
4.5.7 Conditions and Proofs
We rst impose some conditions on the regression model but they might not be the weakest
possible.
Condition A.1
a. The kernel function K() is a bounded density with a bounded support [1, 1].
b. [f(u, v [ x
0
, x
1
; l)[ M < , for all l 1, where f(u, v, [ x
0
, x
1
; l) is the conditional
density of (U
0
, U
l
)) given (X
0
, X
l
), and f(u [ x) M < , where f(u [ x) is the
conditional density of U given X = x.
c. The process U
i
, X
i
, Y
i
is -mixing with

k
c
[(k)]
12/
< for some > 2 and
c > 1 2/.
d. E[X[
2
< , where is given in condition A.1c.
CHAPTER 4. NONPARAMETRIC REGRESSION MODELS 72
Condition A.2
a. Assume that
E
_
Y
2
0
+ Y
2
l
[ U
0
= u, X
0
= x
0
; U
l
= v, X
l
= x
1
_
M < , (4.24)
for all l 1, x
0
, x
1

p
, u, and v in a neighborhood of u
0
.
b. Assume that h
n
and nh
n
. Further, assume that there exists a sequence of
positive integers s
n
such that s
n
, s
n
= o
_
(nh
n
)
1/2
_
, and (n/h
n
)
1/2
(s
n
) 0,
as n .
c. There exists

> , where is given in Condition A.1c, such that

E
_
[Y [

[U = u, X = x
_
M
4
< (4.25)
for all x
p
and u in a neighborhood of u
0
, and
(n) = O
_
n

_
, (4.26)
where

/2(

).
d. E[X[
2

< , and n
1/2/4
h
/

1/2/4
= O(1).
Remark A.1. We provide a sucient condition for the mixing coecient (n) to sat-
isfy conditions A.1c and A.2b. Suppose that h
n
= An

(0 < < 1, A > 0), s

n
=
(nh
n
/ log n)
1/2
and (n) = O
_
n
d
_
for some d > 0. Then condition A.1c is satised for
d > 2(1 1/)/(1 2/) and condition A.2b is satised if d > (1 + )/(1 ). Hence both
conditions are satised if
(n) = O
_
n
d
_
, d > max
_
1 +
1
,
2(1 1/)
1 2/
_
.
Note that this is a trade-o between the order of the moment of Y and the rate of decay
of the mixing coecient; the larger the order , the weaker the decay rate of (n).
To study the joint asymptotic normality of a(u
0
), we need to center the vector T
n
(u
0
)
by replacing Y
i
with Y
i
m(U
i
, X
i
) in the expression (4.16) of T
n,j
(u
0
). Let
T

n,j
(u
0
) =
1
n
n

i=1
X
i
_
U
i
u
0
h
_
j
K
h
(U
i
u
0
) [Y
i
m(U
i
, X
i
)],
CHAPTER 4. NONPARAMETRIC REGRESSION MODELS 73
and
T

n
=
_
T

n,0
T

n,1
_
.
Because the coecient functions a
j
(u) are conducted in the neighborhood of [U
i
u
0
[ < h,
by Taylors expansion,
m(U
i
, X
i
) = X
T
i
a(u
0
) + (U
i
u
0
) X
T
i
a

(u
0
) +
h
2
2
_
U
i
u
0
h
_
2
X
T
i
a

(u
0
) + o
p
(h
2
),
where a

(u
0
) and a

(u
0
) are the vectors consisting of the rst and second derivatives of the
functions a
j
(). Then,
T
n,0
T

n,0
= S
n,0
a(u
0
) + h S
n,1
a

(u
0
) +
h
2
2
S
n,2
a

(u
0
) + o
p
(h
2
)
and
T
n,1
T

n,1
= S
n,1
a(u
0
) + h S
n,2
a

(u
0
) +
h
2
2
S
n,3
a

(u
0
) + o
p
(h
2
),
so that
T
n
T

n
= S
n
H +
h
2
2
_
S
n,2
S
n,3
_
a

(u
0
) + o
p
(h
2
), (4.27)
where = (a(u
0
)
T
, a

(u
0
)
T
)
T
. Thus it follows from (4.17), (4.27), and Theorem .1 that
H
_

_
= f
1
u
(u
0
) S
1
T

n
+
h
2
2
S
1
_

_
a

(u
0
) + o
p
(h
2
), (4.28)
from which the bias term of

(u
0
) is evident. Clearly,
a(u
0
)a(u
0
) =

1
f
u
(u
0
) (
2

2
1
)
_

2
T

n,0

1
T

n,1

+
h
2
2

2
2

2
1
a

(u
0
)+o
p
(h
2
). (4.29)
Thus (4.29) indicates that the asymptotic bias of a(u
0
) is
h
2
2

2
2

2
1
a

(u
0
).
Let
Q
n
=
1
n
n

i=1
Z
i
, (4.30)
where
Z
i
= X
i
_
c
0
+ c
1
_
U
i
u
0
h
__
K
h
(U
i
u
0
) [Y
i
m(U
i
, X
i
)] (4.31)
with c
0
=
2
/ (
2

2
1
) and c
1
=
1
/ (
2

2
1
). It follows from (4.29) and (4.30) that
_
nh
n
_
a(u
0
) a(u
0
)
h
2
2

2
2

2
1
a

(u
0
)
_
=

1
f
u
(u
0
)
_
nh
n
Q
n
+ o
p
(1). (4.32)
CHAPTER 4. NONPARAMETRIC REGRESSION MODELS 74
We need the following lemma, whose proof is more involved than that for Theorem 4.1.
Therefore, we prove only this lemma. Throughout this section, we let C denote a generic
constant, which may take dierent values at dierent places.
Lemma 4.1. Under conditions A.1 and A.2 and the assumption that h
n
0 and nh
n

, as n , if
2
(u, x) and f(u, x) are continuous at the point u
0
, then we have
(a) h
n
Var(Z
1
) f
u
(u
0
)

(u
0
) [c
2
0

0
+ 2 c
0
c
1

1
+ c
2
1

2
];
(b) h
n

n1|
l=1
[Cov(Z
1
, Z
l+1
)[ = o(1); and
(c) nh
n
Var(Q
n
) f
u
(u
0
)

(u
0
) [c
2
0

0
+ 2 c
0
c
1

1
+ c
2
1

2
].
Proof: First, by conditioning on (U
1
, X
1
) and using Theorem 1 of Sun (1984), we have
Var(Z
1
) = E
_
X
1
X
T
1

2
(U
1
, X
1
)
_
c
0
+ c
1
_
U
1
u
0
h
__
2
K
2
h
(U
1
u
0
)
_
=
1
h
_
f
u
(u
0
)

(u
0
)
_
c
2
0

0
+ 2 c
0
c
1

1
+ c
2
1

2
_
+ o(1)

. (4.33)
The result (c) follows in an obvious manner from (a) and (b) along with
Var(Q
n
) =
1
n
Var(Z
1
) +
2
n
n1

l=1
_
1
l
n
_
Cov(Z
1
, Z
l+1
). (4.34)
It thus remains to prove part (b). To this end, let d
n
be a sequence of positive integers
such that d
n
h
n
0. Dene
J
1
=
dn1

l=1
[Cov(Z
1
, Z
l+1
)[ and J
2
=
n1

l=dn
[Cov(Z
1
, Z
l+1
)[.
It remains to show that J
1
= o (h
1
) and J
2
= o (h
1
).
We remark that because K() has a bounded support [1, 1], a
j
(u) is bounded in
the neighborhood of u [u
0
h, u
0
+ h]. Let B = max
1jp
sup
|uu
0
|<h
[a
j
(u)[ and
g(x) =

p
j=1
[x
j
[. Then sup
|uu
0
|<h
[m(u, x)[ Bg(x). By conditioning on (U
1
, X
1
) and
CHAPTER 4. NONPARAMETRIC REGRESSION MODELS 75
(U
l+1
, X
l+1
), and using (4.24) and condition A.1b, we have, for all l 1,
[Cov(Z
1
, Z
l+1
)[
C E
_
[X
1
X
T
l+1
[ [Y
1
[ + Bg(X
1
)[Y
l+1
[ + Bg(X
l+1
) K
h
(U
1
u
0
) K
h
(U
l+1
u
0
)

C E
_
[X
1
X
T
l+1
[
_
M
2
+ B
2
g
2
(X
1
)
_
1/2
_
M
2
+ B
2
g
2
(X
l+1
)
_
1/2
K
h
(U
1
u
0
)K
h
(U
l+1
u
0
)
_
C E
_
[X
1
X
T
l+1
[ 1 + g(X
1
) 1 + g(X
l+1
)

C. (4.35)
It follows that
J
1
C d
n
= o
_
h
1
_
by the choice of d
n
. We next consider the upper bound of J
2
. To this end, using Davydovs
inequality (see Hall and Heyde 1980, Corollary A.2), we obtain, for all 1 j, m p and
l 1,
[Cov(Z
1j
, Z
l+1,m
)[ C [(l)]
12/
_
E[Z
j
[

1/
_
E[Z
m
[

1/
. (4.36)
By conditioning on (U, X) and using conditions A.1b and A.2c, one has
E
_
[Z
j
[

C E
_
[X
j
[

h
(U u
0
)
_
[Y [

+ B

(X)
_
C E
_
[X
j
[

h
(U u
0
)
_
M
3
+ B

(X)
_
C h
1
E
_
[X
j
[

_
M
3
+ B

(X)
_
C h
1
. (4.37)
A combination of (4.36) and (4.37) leads to
J
2
C h
2/2

l=dn
[(l)]
12/
C h
2/2
d
c
n

l=dn
l
c
[(l)]
12/
= o
_
h
1
_
(4.38)
by choosing d
n
such that h
12/
d
c
n
= C, so the requirement that d
n
h
n
0 is satised.
Proof of Theorem 4.2
We use the small-block and large-block technique namely, partition 1, . . . , n into 2 q
n
+1
subsets with large block of size r = r
n
and small block of size s = s
n
. Set
q = q
n
=
_
n
r
n
+ s
n
_
. (4.39)
We now use the Cramer-Wold device to derive the asymptotic normality of Q
n
. For any unit
vector d
p
, let Z
n,i
=

h d
T
Z
i+1
, i = 0, . . . , n 1. Then

nh d
T
Q
n
=
1

n
n1

i=0
Z
n,i
,
CHAPTER 4. NONPARAMETRIC REGRESSION MODELS 76
and, by Lemma 4.1,
Var(Z
n,0
) f
u
(u
0
) d
T

(u
0
) d
_
c
2
0

0
+ 2 c
0
c
1

1
+ c
2
1

2
(u
0
) (4.40)
and
n1

l=0
[Cov(Z
n,0
, Z
n,l
)[ = o(1). (4.41)
Dene the random variables, for 0 j q 1,

j
=
j(r+s)+r1

i=j(r+s)
Z
n,i
,
j
=
(j+1)(r+s)

i=j(r+s)+r
Z
n,i
, and
q
=
n1

i=q(r+s)
Z
n,i
.
Then,

nh d
T
Q
n
=
1

n
_
q1

j=0

j
+
q1

j=0

j
+
q
_

n
Q
n,1
+ Q
n,2
+ Q
n,3
. (4.42)
We show that as n ,
1
n
E [Q
n,2
]
2
0,
1
n
E [Q
n,3
]
2
0, (4.43)

E [exp(i t Q
n,1
)]
q1

j=0
E [exp(i t
j
)]

0, (4.44)
1
n
q1

j=0
E
_

2
j
_

2
(u
0
), (4.45)
and
1
n
q1

j=0
E
_

2
j
I
_
[
j
[ (u
0
)

n
_
0 (4.46)
for every > 0. (4.43) implies that Q
n,2
and Q
n,3
are asymptotically negligible in probability,
(4.44) shows that the summands
j
in Q
n,1
are asymptotically independent and (4.45) and
(4.46) are the standard Lindeberg-Feller conditions for asymptotic normality of Q
n,1
for the
independent setup.
We rst establish (4.43). For this purpose, we choose the large block size. Condition A.2b
implies that there is a sequence of positive constants
n
such that
n
s
n
= o
_
nh
n
_
and

n
(n/h
n
)
1/2
(s
n
) 0. (4.47)
CHAPTER 4. NONPARAMETRIC REGRESSION MODELS 77
Dene the large block size r
n
by r
n
= (nh
n
)
1/2
/
n
and the small block size s
n
. Then it
can easily be shown from (4.47) that as n ,
s
n
/r
n
0, r
n
/n 0, r
n
(nh
n
)
1/2
0, (4.48)
and
(n/r
n
) (s
n
) 0. (4.49)
Observe that
E [Q
n,2
]
2
=
q1

j=0
Var(
j
) + 2

0i<jq1
Cov(
i
,
j
) I
1
+ I
2
. (4.50)
It follows from stationarity and Lemma 4.1 that
I
1
= q
n
Var(
1
) = q
n
Var
_
sn

j=1
Z
n,j
_
= q
n
s
n
[
2
(u
0
) + o(1)]. (4.51)
Next consider the second term I
2
in the right side of (4.50). Let r

j
= j(r
n
+ s
n
), then
r

j
r

i
r
n
for all j > i, we thus have
[I
2
[ 2

0i<jq1
sn

j
1
=1
sn

j
2
=1
[Cov(Z
n,r

i
+rn+j
1
, Z
n,r

j
+rn+j
2
)[
2
nrn

j
1
=1
n

j
2
=j
1
+rn
[Cov(Z
n,j
1
, Z
n,j
2
)[.
By stationarity and Lemma 4.1, one obtains
[I
2
[ 2n
n

j=rn+1
[Cov(Z
n,1
, Z
n,j
)[ = o(n). (4.52)
Hence, by (4.48)-(4.52), we have
1
n
E[Q
n,2
]
2
= O
_
q
n
s
n
n
1
_
+ o(1) = o(1). (4.53)
It follows from stationarity, (4.48), and Lemma 4.1 that
Var [Q
n,3
] = Var
_
_
nqn(rn+sn)

j=1
Z
n,j
_
_
= O(n q
n
(r
n
+ s
n
)) = o(n). (4.54)
Combining (4.48), (4.53), and (4.54), we establish (4.43). As for (4.45), by stationarity,
(4.48), (4.49), and Lemma 4.1, it is easily seen that
1
n
qn1

j=0
E
_

2
j
_
=
q
n
n
E
_

2
1
_
=
q
n
r
n
n

1
r
n
Var
_
rn

j=1
Z
n,j
_

2
(u
0
).
CHAPTER 4. NONPARAMETRIC REGRESSION MODELS 78
To establish (4.44), we use Lemma 1.1 of Volkonskii and Rozanov (1959) (see also Ibragimov
and Linnik 1971, p. 338) to obtain

E [exp(i t Q
n,1
)]
qn1

j=0
E [exp(i t
j
)]

16 (n/r
n
) (s
n
)
tending to 0 by (4.49).
It remains to establish (4.46). For this purpose, we use theorem 4.1 of Shao and Yu
(1996) and condition A.2 to obtain
E
_

2
1
I
_
[
1
[ (u
0
)

n
_
C n
1/2
E
_
[
1
[

_
C n
1/2
r
/2
n
_
E
_
[Z
n,0
[

__
/

. (4.55)
As in (4.37),
E
_
[Z
n,0
[

_
C h
1

/2
. (4.56)
Therefore, by (4.55) and (4.56),
E
_

2
1
I
_
[
1
[ (u
0
)

n
_
C n
1/2
r
/2
n
h
(2

)/(2

)
. (4.57)
Thus, by (4.39) and the denition of r
n
, and using conditions A.2c and A.2d, we obtain
1
n
q1

j=0
E
_

2
j
I
_
[
j
[ (u
0
)

n
_
C
1/2
n
n
1/2/4
h
/

1/2/4
n
0 (4.58)
because
n
. This completes the proof of the theorem.
4.5.8 Monte Carlo Simulations and Applications
1. Applications to Time Series
See Cai, Fan and Yao (2000) for the detailed Monte Carlo simulation results and applications.
2. Boston Housing Data
1. Description of Data
The well known Boston house price data set
1
consists of 14 variables, collected on each
of 506 dierent houses from a variety of locations. The Boston house-price data set was
used originally by Harrison and Rubinfeld (1978) and it was re-analyzed in Belsley, Kuh
and Welsch (1980) by various transformations in the table on pages 244-261. Variables are,
denoted by X
1
, , X
13
and Y , in order:
1
This dataset can be downloaded from the web site at http://lib.stat.cmu.edu/datasets/boston.
CHAPTER 4. NONPARAMETRIC REGRESSION MODELS 79
CRIM per capita crime rate by town
ZN proportion of residential land zoned for lots over 25,000 sq.ft.
INDUS proportion of non-retail business acres per town
CHAS Charles River dummy variable (= 1 if tract bounds river; 0
otherwise)
NOX nitric oxides concentration (parts per 10 million)
RM average number of rooms per dwelling
AGE proportion of owner-occupied units built prior to 1940
DIS weighted distances to five Boston employment centers
RAD index of accessibility to radial highways
TAX full-value property-tax rate per 10,000USD
PTRATIO pupil-teacher ratio by town
B 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town
LSTAT lower status of the population
MEDV Median value of owner-occupied homes in $1000s
The dependent variable is Y , the median value of owner-occupied homes in $1, 000s
(house price). The major factors possibly aecting the house prices used in the literature
are: X
13
=proportion of population of lower educational status X
6
=the average number
of rooms per house, X
1
=the per capita crime rate, X
10
=the full property tax rate, and
X
11
=the pupil/teacher ratio. For the complete description of all 14 variables, see Harrison
and Rubinfeld (1978) and Gilley and Pace (1996) for corrections.
2. Linear Models
Harrison and Rubinfeld (1978) was the rst to analyze this data set using a standard regres-
sion model Y versus all 13 variables including some higher order terms or transformations
on Y and X
j
s. The purpose of this study is to see whether there are the eects of pol-
lution on housing prices via hedonic pricing methodology. Belsley, Kuh and Welsch
(1980) used this data set to illustrate the eects of using robust regression and outlier
detection strategies. From these results, we might conclude that the model might not be
linear and there might exist outliers. Also, Pace and Gilley (1997) added a georeferencing
idea (spatial statistics) and used a spatial estimation method to consider this data set.
CHAPTER 4. NONPARAMETRIC REGRESSION MODELS 80
Exercise: Please use all possible methods to explore this dataset to see what is
the best linear model you can obtain.
3. Fit a Varying-Coecient Model
Recently, Sent urk and M uller (2003) studied the correlation between the house price Y and
the crime rate X
1
adjusted by the confounding variable X
13
through a varying coecien-
t model and they concluded that the expected eect of increasing crime rate on
declining house prices seems to be only observed for lower educational status
neighborhoods in Boston. Finally, it is surprising that all the existing nonparametric
models aforementioned above did not include the crime rate X
1
, which may be an important
factor aecting the housing price, and did not consider the interaction terms such as X
13
and X
1
.
See the paper by Fan and Huang (2005) for tting a varying coecient model to the
Boston housing data.
Exercise: Please t a a varying coecient model to the Boston housing data.
4.6 Additive Model
4.6.1 Model
In this section, we use the notation from Cai (2002). Let X
t
, Y
t
, Z
t

t=
be jointly
stationary processes, where X
t
and Y
t
take values in
p
and
q
with p, q 0, respectively.
The regression surface is dened by
m(x, y) = E Z
t
[ X
t
= x, Y
t
= y . (4.59)
Here, it is assumed that E[Z
t
[ < . Note that the regression function m(, ) dened in
(4.59) can identify only the sum
m(x, y) = + g
1
(x) + g
2
(y). (4.60)
Such a decomposition holds, for example, for the following nonlinear additive autoregressive
model with exogenous variables (ARX)
Y
t
= + g
1
(X
tj
1
, . . . , X
tjp
) + g
2
(Y
ti
1
, . . . , Y
tiq
) +
t
,
CHAPTER 4. NONPARAMETRIC REGRESSION MODELS 81
X
tj
1
= g
3
(X
tj
2
, . . . , X
tjp
) +
t
.
For detailed discussions on the ARX model, the reader is referred to the papers by Masry
and Tjstheim (1997) and Cai and Masry (2000). For identiability, it is assumed that
E g
1
(X
t
) = 0 and E g
2
(Y
t
) = 0. Then, the projection of m(x, y) on the g
1
(x)-direction
is dened by
Em(x, Y
t
) = + g
1
(x) + E g
2
(Y
t
) = + g
1
(x). (4.61)
Clearly, g
1
() can be identied up to an additive constant and g
2
() can be retrieved likewise.
A thorough discussion of additive time series models dened in (4.60) can be found
in Chen and Tsay (1993). Additive components can be estimated with a one-dimensional
nonparametric rate. In most papers, to estimate additive components, several methods have
been proposed. For example, Chen and Tsay (1993) used the iterative backtting procedures,
such as the ACE algorithm and the BRUTO approach; see Hastie and Tibshirani (1990)
for details. But, their asymptotic properties are not well understood due to the implicit
denition of the resulting estimators. To attenuate the drawbacks of iterative procedures,
Auestad and Tjstheim (1991) and Tjstheim and Auestad (1994a) proposed a direct method
based on an average regression surface idea, referred to as projection method in Tjstheim
and Auestad (1994a) for time series data. As pointed out by Cai and Fan (2000), a direct
method has some advantages, such as it does not rely on iterations, it can make computation
fast, and more importantly, it allows an asymptotic analysis. Finally, the projection method
was extended to nonlinear ARX models by Masry and Tjstheim (1997) using the kernel
method and Cai and Masry (2000) coupled with the local polynomial approach. It should be
remarked that the projection method, under the name of marginal integration, was proposed
independently by Newey (1994) and Linton and Nielsen (1995) for iid samples, and since then,
some important progresses have been made by some authors. For example, by combining
the marginal integration with one-step backtting, Linton (1997, 2000) presents an ecient
estimator, Mammen, Linton, and Nielsen (1999) established rigorously the asymptotic theory
of the backtting, Cai and Fan (2000) considered estimating each component using the
weighted projection method coupled with the local linear tting in an ecient way, and
Sperlich, Tjtheim, and Yang (2002) extended the ecient method to models with simple
CHAPTER 4. NONPARAMETRIC REGRESSION MODELS 82
interactions.
The projection method has some disadvantages although it has the aforementioned mer-
its. The projection method may not be ecient if covariates (endogenous or exogenous
variables) are strongly correlated, which is particularly relevant for autoregressive models.
The intuitive interpretation is that additive components are not orthogonal. To overcome
this shortcoming, two ecient estimation methods have been proposed in the literature. The
rst one is called weight function procedure, proposed by Fan, Hardle, and Mammen (1998)
for iid samples and extended to time series situations by Cai and Fan (2000). With an ap-
propriate choice of the weight function, additive components can be eciently estimated in
the sense that an additive component can be estimated with the same asymptotic bias and
variance as if the rest of components were known. The second one is to combine the marginal
integration with one-step backtting, introduced by Linton (1997, 2000) for iid samples and
extended by Sperlish, Tjstheim, and Yang (2002) to additive models with single interac-
tions, but this method has not been advocated for time series situations. However, there
has not been any attempt to discuss the bandwidth selection for the projection method and
its variations in the literature due to their complexity. In practice, one bandwidth is usual-
ly used for all components although Cai and Fan (2000) argued that dierent bandwidths
might be used theoretically to deal with the situation that additive components posses the
dierent smoothness. Therefore, the projection method may not be optimal in practice in
the sense that one bandwidth is used.
To estimate unknown additive components in (4.60) eciently, following the spirit of the
marginal integration with one-step backtting proposed by Linton (1997) for iid samples, I
use a two-stage method, due to Linton (2000), coupled with the local linear (polynomial)
method, which has some attractive properties, such as mathematical eciency, bias reduction
and adaptation of edge eect (see Fan and Gijbels, 1996). The basic idea of the two-stage
approach is described as follows. At the rst stage, one obtains the initial estimated values
for all components. More precisely, the idea for estimating any additive component is rst to
estimate directly high-dimensional regression surface by the local linear method and then to
average the regression surface over the rest of variables to stabilize variance. Such an initial
estimate, in general, is under-smoothed so that the bias should be asymptotically negligible.
At the second stage, the local linear (polynomial) technique is used again to estimate any
additive component by using the initial estimated values of the rest of components. In such
CHAPTER 4. NONPARAMETRIC REGRESSION MODELS 83
a way, it is shown that the estimate at the second stage is not only ecient in the sense of
being equivalent to a procedure based on knowing other components, but also making the
bandwidth selection much easier. Note that this technique is not novel to this chapter since
the two-stage method is rst used by Linton (1997, 2000) for iid samples, but many details
and insights are.
4.6.2 Backtting Algorithm
The building block of the generalized additive model algorithm is the scatterplot smoother.
We will rst describe scatterplot smoothing in a simple setting, and then indicate how it is
used in generalized additive modelling. Here y is a response or outcome variable, and x is
a prognostic factor. We wish to t a smooth curve f(x) that summarizes the dependence
of y on x. If we were to nd the curve that simply minimizes

n
i=1
[y
i
f(x
i
)]
2
, the result
would be an interpolating curve that would not be smooth at all. The cubic spline smoother
imposes smoothness on f(x). We seek the function f(x) that minimizes
n

i=1
[y
i
f(x
i
)]
2
+
_
[f

(x)]
2
dx (4.62)
Notice that
_
[f

(x)]
2
dx measures the wiggliness of the function f(x): linear f(x)s have
_
[f

(x)]
2
dx = 0, while non-linear fs produce values bigger than zero. is a non-negative
smoothing parameter that must be chosen by the data analyst. It governs the tradeo
between the goodness of t to the data and (as measured by and wiggleness of the function.
Larger values of force f(x) to be smoother.
For any value of , the solution to (4.62) is a cubic spline, i.e., a piecewise cubic polynomial
with pieces joined at the unique observed values of x in the dataset. Fast and stable numerical
procedures are available for computation of the tted curve. What value of did we use in
practice? In fact it is not a convenient to express the desired smoothness of f(x) in terms
of , as the meaning of depends on the units of the prognostic factor x. Instead, it is
possible to dene an eective number of parameters or degrees of freedom of a
cubic spline smoother, and then use a numerical search to determine the value of to yield
this number. In practice, if we chose the eective number of parameters to be 5, roughly
speaking, this means that the complexity of the curve is about the same as a polynomial
regression of degrees 4. However, the cubic spline smoother spreads out its parameters in
CHAPTER 4. NONPARAMETRIC REGRESSION MODELS 84
a more even manner, and hence is much more exible than a polynomial regression. Note
that the degrees of freedom of a smoother need not be an integer.
The above discussion tells how to t a curve to a single prognostic factor. With multiple
prognostic factors, if x
ij
denotes the value of the jth prognostic factor for the ith observation,
we t the additive model
y
i
=
d

j=1
f
j
(x
ij
) +
i
.
A criterion like (4.62) can be specied for this problem, and a simple iterative procedure exists
for estimating the f
j
s. We apply a cubic spline smoother to the outcome y
i

d
j=k

f
j
(x
ij
)
as a function of x
ik
, for each prognostic factor in turn. The process is continues until the
estimates

f
j
(x) stabilize. These procedure is known as backtting and the resulting t is
analogous to a multiple regression for linear models.
To t an additive model or a partially additive model in R, the function is gam() in
the package gam. For details, please look at the help command help(gam) after loading
the package gam [library(gam)]. Note that the function gam() allows to t a semi-
parametric additive model as
Y =
T
X+
p

j=1
g
j
(Z
j
) + ,
which can be done by specifying some components without smooth.
4.6.3 Projection Method
This section is devoted to a brief review of the projection method and discusses its merits
and disadvantages.
It is assumed that all additive components have continuous second partial derivatives,
so that m(u, v) can be locally approximated by a linear term in a neighborhood of (x, y),
namely, m(u, v)
0
+
T
1
(u x) +
T
2
(v y) with
j
depending on x and y, where

T
1
denotes the transpose of
1
.
Let K() and L() be symmetric kernel functions in
p
and
q
, respectively, and h
11
=
h
11
(n) > 0 and h
12
= h
12
(n) > 0 be bandwidths in the step of estimating the regression
surface. Here, to handle various degrees of smoothness, Cai and Fan (2000) propose using h
11
CHAPTER 4. NONPARAMETRIC REGRESSION MODELS 85
and h
12
dierently although the implementation may not be easy in practice. The reader is
referred to the paper by Cai and Fan (2000) for details. Given observations X
t
, Y
t
, Z
t

n
t=1
,
let

j
be the minimizer of the following locally weighted least squares
n

t=1
_
Z
t

T
1
(X
t
x)
T
2
(Y
t
y)
_
2
K
h
11
(X
t
x) L
h
12
(Y
t
y),
where K
h
() = K(/h)/h
p
and L
h
() = L(/h)/h
q
. Then, the local linear estimator of the
regression surface m(x, y) is m(x, y) =

0
. By computing the sample average of m(, )
based on (4.61), the projection estimators of g
1
() and g
2
() are dened as, respectively,
g
1
(x) =
1
n
n

t=1
m(x, Y
t
) , and g
2
(y) =
1
n
n

t=1
m(X
t
, y) ,
where = n
1

n
t=1
Z
t
. Under some regularity conditions, by using the same arguments
as those employed in the proof of Theorem 3 in Cai and Masry (2000), it can be shown
(although not easy and tedious) that the asymptotic bias and asymptotic variance of g
1
(x)
are, respectively, h
2
11
tr
2
(K) g

1
(x)/2 and v
1
(x) =
0
(K) A(x), where
A(x) =
_
p
2
2
(y)
2
(x, y) p
1
(x, y) dy and
2
(x, y) = Var ( Z
t
[ X
t
= x, Y
t
= y) .
Here, p(x, y) stands for the joint density of X
t
and Y
t
, p
1
(x) denotes the marginal density of
X
t
, p
2
(y) is the marginal density of Y
t
,
0
(K) =
_
K
2
(u)du, and
2
(K) =
_
uu
T
K(u) du.
The foregoing method has some advantages, such as it is easy to understand, it can make
computation fast, and it allows an asymptotic analysis. However, it can be quite inecient in
an asymptotic sense. To demonstrate this idea, let us consider the ideal situation that g
2
()
and are known. In such a case, one can estimate g
1
() by directly regressing the partial
error

Z
t
= Z
t
g
2
(Y
t
) on X
t
and such an ideal estimator is optimal in an asymptotic
minimax sense (see, e.g., Fan and Gijbels, 1996). The asymptotic bias for the ideal estimator
is h
2
11
tr
2
(K) g

1
(x)/2 and the asymptotic variance is
v
0
(x) =
0
(K) B(x) with B(x) = p
1
1
(x) E
_

2
(X
t
, Y
t
) [ X
t
= x
_
(4.63)
(see, e.g., Masry and Fan, 1997). It is clear that v
1
(x) = v
0
(x) if X
t
and Y
t
are independent.
If X
t
and Y
t
are correlated and when
2
(x, y) is a constant, it follows from the Cauchy-
Schwarz inequality that
B(x) =

2
p
1
(x)
_
p
1/2
(y[x)
p
2
(y)
p
1/2
(y[x)
dy

2
p
1
(x)
_
p
2
2
(y)
p(y[x)
d y = A(x),
CHAPTER 4. NONPARAMETRIC REGRESSION MODELS 86
which implies that the ideal estimator has always smaller asymptotic variance than the pro-
jection method although both have the same bias. This suggests that the projection method
could lead to an inecient estimation of g
1
() and g
2
() when X
t
and Y
t
are serially correlat-
ed, which is particularly relevant for autoregressive models. To alleviate this shortcoming, I
propose the two-stage approach described next.
4.6.4 Two-Stage Procedure
The two-stage method due to Linton (1997, 2000) is introduced. The basic idea is to get an
initial estimate for g
2
() using a small bandwidth h
12
. The initial estimate can be obtained
by the projection method and h
12
can be chosen so small that the bias of estimating g
2
()
can be asymptotically negligible. Then, using the partial residuals Z

t
= Z
t
g
2
(Y
t
), we
apply the local linear regression technique to the pseudo regression model
Z

t
= g
1
(X
t
) +

t
to estimate g
1
(). This leads naturally to the weighted least-squares problem
n

t=1
_
Z

t

1

T
2
(X
t
x)
_
2
J
h
2
(X
t
x), (4.64)
where J() is the kernel function in
p
and h
2
= h
2
(n) > 0 is the bandwidth at the second-
stage. The advantage of this is twofold: the bandwidth h
2
can now be selected purposely for
estimating g
1
() only and any bandwidth selection technique for nonparametric regression can
be applied here. Maximizing (4.64) with respect to
1
and
2
gives the two-stage estimate
of g
1
(x), denoted by g
1
(x) =

1
, where

1
and

2
are the minimizer of (4.64).
It is shown in Theorem 4.3, in which follows, that under some regularity conditions, the
asymptotic bias and variance of the two-stage estimate g
1
(x) are the same as those for the
ideal estimator, provided that the initial bandwidth h
12
satises h
12
= o (h
2
).
Sampling Properties
To establish the asymptotic normality of the two-stage estimator, it is assumed that the
initial estimator satises a linear approximation; namely,
g
2
(Y
t
) g
2
(Y
t
)
1
n
n

i=1
L
h
12
(Y
i
Y
t
)(X
i
, Y
t
)
i
+
1
2
h
2
12
tr
2
(L) g

2
(Y
t
), (4.65)
CHAPTER 4. NONPARAMETRIC REGRESSION MODELS 87
where
t
= Z
t
m(X
t
, Y
t
) and (x, y) = p
1
(x)/p(x, y). Note that under some regularity
conditions, by following the same arguments as in Masry (1996), one might show (although
the proof is not easy, quite lengthy, and tedious) that (4.65) holds. Note that this assumption
is also imposed in Linton (2000) for iid samples to simplify the proof of the asymptotic results
of the two-stage estimator. Now, the asymptotic normality for the two-stage estimator is
stated here and its proof can be found in Cai (2002).
THEOREM 4.3. Under (4.65) and Assumptions A1 A9 stated in Cai (2002), if band-
widths h
12
and h
2
are chosen such that h
12
0, nh
q
12
, h
2
0, and nh
p
2
as
n , then,
_
nh
p
2
_
g
1
(x) g
1
(x) bias(x) + o
p
_
h
2
12
+ h
2
2
_
D
N 0, v
0
(x) ,
where the asymptotic bias is
bias(x) =
h
2
2
2
tr
2
(J) g

1
(x)
h
2
12
2
tr
2
(L) E (g

2
(Y
t
) [ X
t
= x)
and the asymptotic variance is v
0
(x) =
0
(J) B(x).
We remark that by Theorem 4.3, the asymptotic variance of the two-stage estimator is
independent of the initial bandwidths. Thus, the initial bandwidths should be chosen as
small as possible. This is another benet of using the two-stage procedure: the bandwidth
selection problem becomes relatively easy. In particular, when h
12
= o (h
2
), the bias from
the initial estimation can be asymptotically negligible. For the ideal situation that g
2
()
is known, Masry and Fan (1997) show that under some regularity conditions, the optimal
estimate of g
1
(x), denoted by g

1
(x), by using (4.64) in which the partial residual Z

t
is
replaced by the partial error

Z
t
= Y
t
g
2
(Y
t
), is asymptotically normally distributed,
_
nh
p
2
_
g

1
(x) g
1
(x)
h
2
2
2
tr
2
(J) g

1
(x) + o
p
(h
2
2
)
_
D
N 0, v
0
(x) .
This, in conjunction with Theorem 4.3, shows that the two-stage estimator and the ideal
estimator share the same asymptotic bias and variance if h
12
= o (h
2
).
4.6.5 Monte Carlo Simulations and Applications
See the paper by Cai (2002) for the detailed Monte Carlo simulation results and applications.
CHAPTER 4. NONPARAMETRIC REGRESSION MODELS 88
4.6.6 New Developments
See the paper by Mammen, Linton and Nielsen (1999).
4.6.7 Additive Model to to Boston House Price Data
There have been several papers devoted to the analysis of this dataset using some non-
parametric methods. For example, Breiman and Friedman (1985), Pace (1993), Chaudhuri,
Doksum and Samarov (1997), and Opsomer and Ruppert (1998) used four covariates: X
6
,
X
10
, X
11
and X
13
or their transformations (including the transformation on Y ) to t the
data through a mean additive regression model such as
log(Y ) = + g
1
(X
6
) + g
2
(X
10
) + g
3
(X
11
) + g
4
(X
13
) + , (4.66)
where the additive components g
j
() are unspecied smooth functions. Pace (1993) and
Chaudhuri, Doksum and Samarov (1997) also considered the nonparametric estimation of the
rst derivative of each additive component which measures how much the response changes as
one covariate is perturbed while the other covariates are held xed; see Chaudhuri, Doksum
and Samarov (1997). Let us use model (4.66) to t the Boston house price data. The results
are summarized in Figure 4.3 (the R code can be found in Section 4.7.2). Also, we t a
semi-parametric additive model as
log(Y ) = + g
1
(X
6
) +
2
X
10
+
3
X
11
+
4
X
13
+ . (4.67)
The results are summarized in Figure 4.4 (the R code can be found in Section 4.7.2).
4.7 Computer Code
4.7.1 Example 4.1
# 04-28-2007
graphics.off() # clean the previous graphs on the screen
###############
# Example 4.1
################
CHAPTER 4. NONPARAMETRIC REGRESSION MODELS 89
4 5 6 7 8
0
.
0
0
.
1
0
.
2
0
.
3
0
.
4
x6
l
o
(
x
6
)
200 300 400 500 600 700

0
.
1
0
0
.
0
0
0
.
1
0
0
.
2
0
x10
l
o
(
x
1
0
)
14 16 18 20 22

0
.
1
0
0
.
0
0
0
.
0
5
0
.
1
0
0
.
1
5
x11
l
o
(
x
1
1
)
10 20 30

0
.
8

0
.
4
0
.
0
0
.
2
0
.
4
x13
l
o
(
x
1
3
)
Component of X_13
Figure 4.3: The results from model (4.66).
##########################################################################
z1=read.tablefile="c:/res-teach/xiada/teaching05-07/data/ex4-1.txt")
# dada: weekly 3-month Treasury bill from 1970 to 1997
x=z1[,4]/100
n=length(x)
y=diff(x) # Delta x_t=x_t-x_{t-1}
x=x[1:(n-1)]
n=n-1
x_star=(x-mean(x))/sqrt(var(x))
z=seq(min(x),max(x),length=50)
win.graph()
#postscript(file="c:/res-teach/xiada/teaching05-07/figs/fig-4.1.eps",
CHAPTER 4. NONPARAMETRIC REGRESSION MODELS 90
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o o
o
o
o
o
o
oo
o
o
o
o
o
o
o
o
o o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o o
o
o
o
o oo
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
oo
o
o
o
o
o
o
o
o
o
o
o o
o
o
o
o
o
o
o
o
o
o
o o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o o
o
o o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o o o
o
o
o
o o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o o
o oo o
o
o
o
o
o
o
o
o
o o o
o
o
o
o
o
o
o
o
o
o o
o
o
o o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o o
o
o
o o
o
o o
o
o o
o
o o
o
o
o
o
o o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
oo
o
o
o
o
o
o
o
o
o
o
o
o
o o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o o
o
o
o
o
o
o
o o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
oo
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o o
o
o
oo
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o o
o
2.5 3.0 3.5 4.0

0
.
5
0
.
0
0
.
5
1
.
0
y_hat
Residual Plot of Additive Model
o
o
o
o
o
o
o
o
o
o
o
o ooo
o
oo
o
o
o
o
o
o
o
o
o
o
o
o
o
o o
o
o
o
oo
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
oo
ooo o
o
o
o
o
o
o
o
o
o
o
o
o
o
o oo
oo o
o
o
o
o
o
o
o
oo
o o
o
o
o o
o
o
o
o
o
o
o
o
o
o
o o
ooo
o
o
o
o
o oo
o
o
o
o
o
o
oo
o
o
o
o
o
o
o
o
o
o
o
o
o
oo
o
o
o
o
o
o
o
o
o
o
o
o
oo
oo
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o o
o o
o
o
o
o
o
o
oo
oo
o
o
o
o
o o
o
o
o
o
o
o
o
o
o
o
o
o
o
o o
o
o
o o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o o
o o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o o
o
o
o
o
o
o o
o
o
oo
o
o
o
o
o
o
o
o o
o
oo
o o o
o
o
o
o
o o
o
o
o
o
o
o
o
o
o
o
o
o o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o o
o
o
o
o
o
o
o
o
o
o
o
o o
o
o
o
o
oo
o
o
o
o
oo
o o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o o
o o
o o
o
oo
o o
o
o
o
o
o
o
o
o
o
o o
o
o
o
o
o
o
o
o
o
o o
o
o o
o oo
o
o
oo o
o
o
o
o
o
o
o
o
o
o
oo
o
o
o
o
o
o
o
o
o
o
oo
o
o
o
o
o
o
o
o
o
o
o
o
o
4 5 6 7 8
0
.
0
0
.
1
0
.
2
0
.
3
0
.
4
0
.
5
0
.
6
x6
s
1
(
x
6
)
Component of X_6
o
o
o o
o
o
o
o
o
o
o o
o
o
o
o
o o
o
o
o
o
o o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o o
o
o
o
o
o
o
o o
o
o
o
o o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
oo
o
o
o
o
o
o
o
o o
o
o o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o o
o
o
o
o
o
o o
o
o o
o
o
o
o
o
o
o
o
o
o
o
o
o
o o
o
o
o o
o
o
o
o
o
o
o
o
o
o
o o
o
o
o
o
o
o
o o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o o
o
o
o o o o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
oo
o
o o
o
o
o
o
o
o
o o o o o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o o
o
o
o
o
o
o
o
o
o
o
o
o
o o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o o
o
o
o
o
o
o
o
o
o
o
o
o
o
oo
o
o
o
o
o
oo
o
o o
o
o
o
o
o
o
o
o
o
oo
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
oo
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
oo
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o o
o o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o o
o
2.5 3.0 3.5 4.0

0
.
5
0
.
0
0
.
5
1
.
0
y_hat
Residual Plot of Model II
0 10 20 30 40 50
0
.
0
0
0
.
0
2
0
.
0
4
0
.
0
6
Density of Y
Figure 4.4: (a) Residual plot for model (4.66). (b) Plot of g
1
(x
6
) versus x
6
. (c) Residual
plot for model (4.67). (d) Density estimate of Y .
# horizontal=F,width=6,height=6)
par(mfrow=c(2,2),mex=0.4,bg="light blue")
scatter.smooth(x,y,span=1/10,ylab="",xlab="x(t-1)",evaluation=60)
title(main="(a) y(t) vs x(t)",col.main="red")
scatter.smooth(x,abs(y),span=1/10,ylab="",xlab="x(t-1)",evaluation=60)
title(main="(b) |y(t)| vs x(t)",col.main="red")
scatter.smooth(x,y^2,span=1/10,ylab="",xlab="x(t-1)",evaluation=60)
title(main="(c) y(t)^2 vs x(t)",col.main="red")
#dev.off()
#######################################################################
#########################
# Nonparametric Fitting #
CHAPTER 4. NONPARAMETRIC REGRESSION MODELS 91
#########################
#########################################################
# Define the Epanechnikov kernel function
kernel<-function(x){0.75*(1-x^2)*(abs(x)<=1)}
###############################################################
# Define the kernel density estimator
kernden=function(x,z,h,ker){
# parameters: x=variable; h=bandwidth; z=grid point; ker=kernel
nz<-length(z)
nx<-length(x)
x0=rep(1,nx*nz)
dim(x0)=c(nx,nz)
x1=t(x0)
x0=x*x0
x1=z*x1
x0=x0-t(x1)
if(ker==1){x1=kernel(x0/h)} # Epanechnikov kernel
if(ker==0){x1=dnorm(x0/h)} # normal kernel
f1=apply(x1,2,mean)/h
return(f1)
}
###############################################################
# Define the local constant estimator
local.constant=function(y,x,z,h,ker){
# parameters: x=variable; h=bandwidth; z=grid point; ker=kernel
nz<-length(z)
nx<-length(x)
x0=rep(1,nx*nz)
dim(x0)=c(nx,nz)
x1=t(x0)
CHAPTER 4. NONPARAMETRIC REGRESSION MODELS 92
x0=x*x0
x1=z*x1
x0=x0-t(x1)
if(ker==1){x1=kernel(x0/h)} # Epanechnikov kernel
if(ker==0){x1=dnorm(x0/h)} # normal kernel
x2=y*x1
f1=apply(x1,2,mean)
f2=apply(x2,2,mean)
f3=f2/f1
return(f3)
}
####################################################################
# Define the local linear estimator
local.linear<-function(y,x,z,h){
# parameters: y=response, x=design matrix; h=bandwidth; z=grid point
nz<-length(z)
ny<-length(y)
beta<-rep(0,nz*2)
dim(beta)<-c(nz,2)
for(k in 1:nz){
x0=x-z[k]
w0<-kernel(x0/h)
beta[k,]<-glm(y~x0,weight=w0)$coeff
}
return(beta)
}
##############################################################
h=0.02
# Local constant estimate
CHAPTER 4. NONPARAMETRIC REGRESSION MODELS 93
mu_hat=local.constant(y,x,z,h,1)
sigma_hat=local.constant(abs(y),x,z,h,1)
sigma2_hat=local.constant(y^2,x,z,h,1)
#win.graph()
postscript(file="c:/res-teach/xiada/teaching05-07/figs/fig-4.1.eps",
horizontal=F,width=6,height=6)
par(mfrow=c(2,2),mex=0.4,bg="light yellow")
scatter.smooth(x,y,span=1/10,ylab="",xlab="x(t-1)")
points(z,mu_hat,type="l",lty=1,lwd=3,col=2)
title(main="(a) y(t) vs x(t)",col.main="red")
legend(0.04,0.0175,"Local Constant Estimate")
scatter.smooth(x,abs(y),span=1/10,ylab="",xlab="x(t-1)")
points(z,sigma_hat,type="l",lty=1,lwd=3,col=2)
title(main="(b) |y(t)| vs x(t)",col.main="red")
scatter.smooth(x,y^2,span=1/10,ylab="",xlab="x(t-1)")
title(main="(c) y(t)^2 vs x(t)",col.main="red")
points(z,sigma2_hat,type="l",lty=1,lwd=3,col=2)
dev.off()
# Local Linear Estimate
fit2=local.linear(y,x,z,h)
mu_hat=fit2[,1]
fit2=local.linear(abs(y),x,z,h)
sigma_hat=fit2[,1]
fit2=local.linear(y^2,x,z,h)
sigma2_hat=fit2[,1]
#win.graph()
postscript(file="c:/res-teach/xiada/teaching05-07/figs/fig-4.2.eps",
CHAPTER 4. NONPARAMETRIC REGRESSION MODELS 94
horizontal=F,width=6,height=6)
par(mfrow=c(2,2),mex=0.4,bg="light green")
scatter.smooth(x,y,span=1/10,ylab="",xlab="x(t-1)")
points(z,mu_hat,type="l",lty=1,lwd=3,col=2)
title(main="(a) y(t) vs x(t)",col.main="red")
legend(0.04,0.0175,"Local Linear Estimate")
scatter.smooth(x,abs(y),span=1/10,ylab="",xlab="x(t-1)")
points(z,sigma_hat,type="l",lty=1,lwd=3,col=2)
title(main="(b) |y(t)| vs x(t)",col.main="red")
scatter.smooth(x,y^2,span=1/10,ylab="",xlab="x(t-1)")
title(main="(c) y(t)^2 vs x(t)",col.main="red")
points(z,sigma2_hat,type="l",lty=1,lwd=3,col=2)
dev.off()
#####################################################################
4.7.2 Codes for Additive Modeling Analysis of Boston Data
The following is the R code for making Figures 4.3 and 4.4.
data=read.table("c:/res-teach/xiada/teaching05-07/data/ex4-2.txt")
y=data[,14]
x1=data[,1]
x6=data[,6]
x10=data[,10]
x11=data[,11]
x13=data[,13]
y_log=log(y)
library(gam)
fit_gam=gam(y_log~lo(x6)+lo(x10)+lo(x11)+lo(x13))
resid=fit_gam$residuals
y_hat=fit_gam$fitted
postscript(file="c:/res-teach/xiada/teaching05-07/figs/fig-boston1.eps",
horizontal=F,width=6,height=6,bg="light grey")
CHAPTER 4. NONPARAMETRIC REGRESSION MODELS 95
par(mfrow=c(2,2),mex=0.4)
plot(fit_gam)
title(main="Component of X_13",col.main="red",cex=0.6)
dev.off()
fit_gam1=gam(y_log~lo(x6)+x10+x11+x13)
s1=fit_gam1$smooth[,1] # obtain the smoothed component
resid1=fit_gam1$residuals
y_hat1=fit_gam1$fitted
print(summary(fit_gam1))
postscript(file="c:/res-teach/xiada/teaching05-07/figs/fig-boston2.eps",
horizontal=F,width=6,height=6,bg="light green")
par(mfrow=c(2,2),mex=0.4)
plot(y_hat,resid,type="p",pch="o",ylab="",xlab="y_hat")
title(main="Residual Plot of Additive Model",col.main="red",cex=0.6)
abline(0,0)
plot(x6,s1,type="p",pch="o",ylab="s1(x6)",xlab="x6")
title(main="Component of X_6",col.main="red",cex=0.6)
plot(y_hat1,resid1,type="p",pch="o",ylab="",xlab="y_hat")
title(main="Residual Plot of Model II",col.main="red",cex=0.5)
abline(0,0)
plot(density(y),ylab="",xlab="",main="Density of Y")
dev.off()
4.8 References
At-Sahalia, Y. (1996). Nonparametric pricing of interest rate derivative securities. Econo-
metrica, 64, 527-560.
Belsley, D.A., E. Kuh and R.E. Welsch (1980). Regression Diagnostic: Identifying Inuen-
tial Data and Sources of Collinearity. New York: Wiley.
Breiman, L. and J.H. Friedman (1985). Estimating optimal transformation for multiple
CHAPTER 4. NONPARAMETRIC REGRESSION MODELS 96
regression and correlation. Journal of the American Statistical Association, 80, 580-
619.
Cai, Z. (2002). A two-stage approach to additive time series models. Statistica Neerlandica,
56, 415-433.
Cai, Z., M. Das, H. Xiong and X. Wu (2006). Functional-Coecient Instrumental Variables
Models. Journal of Econometrics, 133, 207-241.
Cai, Z. and J. Fan (2000). Average regression surface for dependent data. Journal of
Multivariate Analysis, 75, 112-142.
Cai, Z., J. Fan and Q. Yao (2000). Functional-coecient regression models for nonlinear
time series. Journal of American Statistical Association, 95, 941-956.
Cai, Z. and E. Masry (2000). Nonparametric estimation of additive nonlinear ARX time
series: Local linear tting and projection. Econometric Theory, 16, 465-501.
Cai, Z. and R.C. Tiwari (2000). Application of a local linear autoregressive model to BOD
time series. Environmetrics, 11, 341-350.
Chaudhuri, P., K. Doksum and A. Samarov (1997). On average derivative quantile regres-
sion. The Annuals of Statistics, 25, 715-744.
Chen, R. and R. Tsay (1993). Nonlinear additive ARX models. Journal of the American
Statistical Association, 88, 310-320.
Engle, R.F., C.W.J. Grabger, J. Rice, and A. Weiss (1986). Semiparametric estimates of
the relation between weather and electricity sales. Journal of The American Statistical
Association, 81, 310-320.
Fan, J. (1993). Local linear regression smoothers and their minimax eciency. The Annals
of Statistics, 21, 196-216.
Fan, J., T. Gasser, I. Gijbels, M. Brockmann and J. Engel (1996). Local polynomial tting:
optimal kernel and asymptotic minimax eciency. Annals of the Institute of Statistical
Mathematics, 49, 79-99.
Fan, J. and I. Gijbels (1996). Local Polynomial Modeling and Its Applications. London:
Chapman and Hall.
Fan, J., N.E. Heckman, and M.P. Wand (1995). Local polynomial kernel regression for
generalized linear models and quasi-likelihood functions. Journal of the American
Statistical Association, 90, 141-150.
Fan, J. and T. Huang (2005). Prole likelihood inferences on semiparametric varying-
coecient partially linear models. Bernoulli, 11, 1031-1057.
Fan, J. and Q. Yao (2003). Nonlinear Time Series: Nonparametric and Parametric Meth-
ods. New York: Springer-Verlag.
CHAPTER 4. NONPARAMETRIC REGRESSION MODELS 97
Fan, J., Q. Yao and Z. Cai (2003). Adaptive varying-coecient linear models. Journal of
the Royal Statistical Society, Series B, 65, 57-80.
Fan, J. and C. Zhang (2003). A re-examination of diusion estimators with applications
to nancial model validation. Journal of the American Statistical Association, 98,
118-134.
Fan, J., C. Zhang and J. Zhang (2001). Generalized likelihood test statistic and Wilks
phenomenon. The Annals of Statistics, 29, 153-193.
Gasser, T. and H.-G. M uller (1979). Kernel estimation of regression functions. In Smoothing
Techniques for Curve Estimation, Lecture Notes in Mathematics, 757, 23-28. Springer-
Verlag, New York.
Gilley, O.W. and R.K. Pace (1996). On the Harrison and Rubinfeld Data. Journal of
Environmental Economics and Management, 31, 403-405.
Granger, C.W.J., and T. Ter asvirta (1993). Modeling Nonlinear Economic Relationships.
Oxford University Press, Oxford, U.K..
Hall, P. and C.C. Heyde (1980). Martingale Limit Theory and Its Applications. New York:
Academic Press.
Hall, P. and I. Johnstone (1992). Empirical functional and ecient smoothing parameter
selection (with discussion). Journal of the Royal Statistical Society, Series B, 54,
475-530.
Harrison, D. and D.L. Rubinfeld (1978). Hedonic housing prices and demand for clean air.
Journal of Environmental Economics and Management, 5, 81-102.
Hastie, T.J. and R.J. Tibshirani (1990). Generalized Additive Models. Chapman and Hall,
London.
Hong, Y. and Lee, T.-H. (2003). Inference on via generalized spectrum and nonlinear time
series models. The Review of Economics and Statistics, 85, 1048-1062.
Hurvich, C.M., J.S. Simono and C.-L. Tsai (1998). Smoothing parameter selection in
nonparametric regression using an improved Akaike information criterion. Journal of
the Royal Statistical, Society B, 60, 271-293.
Jiang, G.J. and J.L. Knight (1997). A nonparametric approach to the estimation of diusion
processes, with an application to a short-term interest rate model. Econometric Theory,
13, 615-645.
Johannes, M.S. (2004). The statistical and economic role of jumps in continuous-time
interest rate models. Journal of Finance, 59, 227-260.
Juhl, T. (2005). Functional coecient models under unit root behavior. Econometrics
Journal, 8, 197-213.
CHAPTER 4. NONPARAMETRIC REGRESSION MODELS 98
Koenker, R. (2005). Quantile Regression. Cambridge University Press, New York.
Koenker, R. and G.W. Bassett (1978). Regression quantiles. Econometrica, 46, 33-50.
Koenker, R. and G.W. Bassett (1982). Robust tests for heteroscedasticity based on regres-
sion quantiles. Econometrica, 50, 43-61.
Kreiss, J.P., M. Neumann and Q. Yao (1998). Bootstrap tests for simple structures in
nonparametric time series regression. Unpublished manuscript.
Li, Q., C. Huang, D. Li and T. Fu (2002). Semiparametric smooth coecient models.
Journal of Business and Economic Statistics, 20, 412-422.
Linton, O.B. (1997). Ecient estimation of additive nonparametric regression models.
Biometrika, 84, 469-473.
Linton, O.B. (2000). Ecient estimation of generalized additive nonparametric regression
models. Econometric Theory, 16, 502-523.
Linton, O.B. and J.P. Nielsen (1995). A kernel method of estimating structured nonpara-
metric regression based on marginal integration. Biometrika, 82, 93-100.
Mammen, E., O.B. Linton, and J.P. Nielsen (1999). The existence and asymptotic prop-
erties of a backtting projection algorithm under weak conditions. The Annals of
Statistics, 27, 1443-1490.
Masry, E. and J. Fan (1997). Local polynomial estimation of regression functions for mixing
processes. Scandinavian Journal of Statistics, 24, 165-179.
Masry, E. and D. Tjstheim (1997). Additive nonlinear ARX time series and projection
estimates. Econometric Theory, 13, 214-252.
Sent urk, D. and H.-G. M uller (2003). Inference for covariate adjusted regression via varying
coecient models. Forthcoming in Scandinavian Journal of Statistics.
Nadaraya, E.A. (1964). On estimating regression. Theory of Probability and Its Applica-
tions, 9, 141-142.
ksendal, B. (1985). Stochastic Dierential Equations: An Introduction with Applications,
3th edition. New York: Springer-Verlag.
Opsomer, J.D. and D. Ruppert (1998). A fully automated bandwidth selection for additive
regression model. Journal of The American Statistical Association, 93, 605-618.
Pace, R.K. (1993). Nonparametric methods with applications hedonic models. Journal of
Real Estate Finance and Economics, 7, 185-204.
Pace, R.K. and O.W. Gilley (1997). Using the spatial conguration of the data to improve
estimation. Journal of the Real Estate Finance and Economics, 14, 333-340.
CHAPTER 4. NONPARAMETRIC REGRESSION MODELS 99
Priestley, M.B. and M.T. Chao (1972). Nonparametric function tting. Journal of the
Royal Statistical Society, Series B, 34, 384-392.
Rice, J. (1984). Bandwidth selection for nonparametric regression. The Annals of Statistics,
12, 1215-1230.
Ruppert, D., S.J. Sheather and M.P. Wand (1995). An eective bandwidth selector for local
least squares regression. Journal of American Statistical Association, 90, 1257-1270.
Ruppert, D. and M.P. Wand (1994). Multivariate weighted least squares regression. The
Annals of Statistics, 22, 1346-1370.
Rousseeuw, R.J. and A.M. Leroy (1987). Robust Regression and Outlier Detection. New
York: Wiley.
Shao, Q. and H. Yu (1996). Weak convergence for weighted empirical processes of dependent
sequences. The Annals of Probability, 24, 2098-2127.
Sperlish, S., D. Tjstheim, and L. Yang (2002). Nonparametric estimation and testing of
interaction in additive models. Econometric Theory, 18, 197-251.
Stanton, R. (1997). A nonparametric model of term structure dynamics and the market
price of interest rate risk. Journal of Finance, 52, 1973-2002.
Sun, Z. (1984). Asymptotic unbiased and strong consistency for density function estimator.
Acta Mathematica Sinica, 27, 769-782.
Tjstheim, D. and B. Auestad (1994a). Nonparametric identication of nonlinear time
series: Projections. Journal of the American Statistical Association, 89, 1398-1409.
Tjstheim, D. and B. Auestad (1994b). Nonparametric identication of nonlinear time
series: Selecting signicant lags. Journal of the American Statistical Association, 89,
1410-1419.
van Dijk, D., T. Ter asvirta, and P.H. Franses (2002). Smooth transition autoregressive
models - a survey of recent developments. Econometric Reviews, 21, 1-47.
Watson, G.S. (1964). Smooth regression analysis. Sankhy a, Series A, 26, 359-372.
Chapter 5
Nonparametric Quantile Models
For details, see the paper by Cai and Xu (2005). If you like to read the whole paper, you
can download it from the web site at http://www.wise.xmu.edu.cn/ at WORKING
PAPER column. Next we present only a part of the whole paper of Cai and Xu (2005).
5.1 Introduction
Over the last three decades, quantile regression, also called conditional quantile or regression
quantile, introduced by Koenker and Bassett (1978), has been used widely in various dis-
ciplines, such as nance, economics, medicine, and biology. It is well-known that when the
distribution of data is typically skewed or data contains some outliers, the median regression,
a special case of quantile regression, is more explicable and robust than the mean regres-
sion. Also, regression quantiles can be used to test heteroscedasticity formally or graphically
(Koenker and Bassett, 1982; Efron, 1991; Koenker and Zhao, 1996; Koenker and Xiao, 2002).
Although some individual quantiles, such as the conditional median, are sometimes of inter-
est in practice, more often one wishes to obtain a collection of conditional quantiles which
can characterize the entire conditional distribution. More importantly, another application
of conditional quantiles is the construction of prediction intervals for the next value given
a small section of the recent past values in a stationary time series (Granger, White, and
Kamstra, 1989; Koenker, 1994; Zhou and Portnoy, 1996; Koenker and Zhao, 1996; Taylor
and Bunn, 1999). Also, Granger, White, and Kamstra (1989), Koenker and Zhao (1996),
and Taylor and Bunn (1999) considered an interval forecasting for parametric autoregressive
conditional heteroscedastic (ARCH) type models. For more details about the historical and
recent developments of quantile regression with applications for time series data, particularly
CHAPTER 5. NONPARAMETRIC QUANTILE MODELS 101
in nance, see, for example, the papers and books by J.P. Morgan (1995), Due and Pan
(1997), Jorin (2000), Koenker (2000), Koenker and Hallock (2001), Tsay (2000, 2002), Khin-
danova and Rachev (2000), and Bao, Lee and Salto glu (2001), and the references therein.
Recently, the quantile regression technique has been successfully applied to politics. For
example, in the 1992 presidential selection, the Democrats used the yearly Current Popula-
tion Survey data to show that between 1980 and 1992 there was an increase in the number
of people in the high-salary category as well as an increase in the number of people in the
low-salary category. This phenomena could be illustrated by using the quantile regression
method as follows: computing 90% and 10% quantile regression functions of salary as a func-
tion of time. An increasing 90% quantile regression function and a decreasing 10% quantile
regression function corresponded to the Democrats claim that the rich got richer and the
poor got poorer during the Republican administrations; see Figure 6.4 in Fan and Gijbels
(1996, p. 229).
More importantly, by following the regulations of the Bank for International Settlements,
many of nancial institutions have begun to use a uniform measure of risk to measure the
market risks called Value-at-Risk (VaR), which can be dened as the maximum potential
loss of a specic portfolio for a given horizon in nance. In essence, the interest is to
compute an estimate of the lower tail quantile (with a small probability) of future portfolio
returns, conditional on current information. Therefore, the VaR can be regarded as a special
application of the quantile regression. There is a vast amount of literature in this area;
see, to name just a few, J.P. Morgan (1995), Due and Pan (1997), Engle and Manganelli
(2004), Jorion (2000), Tsay (2000, 2002), Khindanova and Rachev (2000), and Bao, Lee and
Salto glu (2001), and references therein.
In this chapter, we assume that X
t
, Y
t

t=
is a stationary sequence. Denote F(y [x)
the conditional distribution of Y given X = x, where X
t
= (X
t1
, . . . , X
td
)

with

denoting
the transpose of a matrix or vector, is the associated covariate vector in
d
with d 1,
which might be a function of exogenous (covariate) variables or some lagged (endogenous)
variables or time t. The regression (conditional) quantile function q

(x) is dened as, for

CHAPTER 5. NONPARAMETRIC QUANTILE MODELS 102
any 0 < < 1,
q

(x) = infy
1
: F(y [x) , or q

(x) = argmin
a
1E

(Y
t
a) [ X
t
= x ,
(5.1)
where

(y) = y ( I
{y<0}
) with y
1
is called the loss (check) function, and I
A
is the
indicator function of any set A. There are several advantages of using a quantile regression:
A quantile regression does not require knowing the distribution of the
dependent variable.
It does not require the symmetry of the measurement error.
It can characterize the heterogeneity.
It can estimate the mean and variance simultaneously.
It is a robust procedure.
There are a lot more.
Having conditioned on the observed characteristics X
t
= x, based on the Skorohod repre-
sentation, Y
t
and the quantile function q

(x) have a following relationship as

Y
t
= q(X
t
, U
t
), (5.2)
where U
t
[X
t
U(0, 1). We will refer to U
t
as the rank variable, and note that representation
(5.2) is essential to what follows. The rank variable U
t
is responsible for heterogeneity of
outcomes among individuals with the same observed characteristics X
t
. It also determines
their relative ranking in terms of potential outcomes; hence one may think of rank U
t
as
representing some unobserved characteristic. This interpretation makes quantile analysis
an interesting tool for describing and learning the structure of heterogeneous eects and
controlling for unobserved heterogeneity.
Clearly, the simplest form of model (5.1) is q

(x) =

x, which is called the linear quantile

regression model well studied by many authors. For details, see the papers by Due and Pan
(1997), Koenker (2000), Tsay (2000), Koenker and Hallock (2001), Khindanova and Rachev
(2000), and Bao, Lee and Salto glu (2001), Engle and Manganelli (2004), and references
therein.
CHAPTER 5. NONPARAMETRIC QUANTILE MODELS 103
In many practical applications, however, the linear quantile regression model might not
be rich enough to capture the underlying relationship between the quantile of response
variable and its covariates. Indeed, some components may be highly nonlinear or some
covariates may be interactive. To make the quantile regression model more exible, there
is a swiftly growing literature on nonparametric quantile regression. Various smoothing
techniques, such as kernel methods, splines, and their variants, have been used to estimate
the nonparametric quantile regression for both the independent and time series data. For the
recent developments and the detailed discussions on theory, methodologies, and applications,
see, for example, the papers by He, Ng, and Portony (1998), Yu and Jones (1998), He and
Ng (1999), He and Portony (2000), Honda (2000, 2004), Tsay (2000, 2002), Lu, Hui and
Zhao (2000), Khindanova and Rachev (2000), Bao, Lee and Salto glu (2001), Cai (2002a),
De Gooijer, and Gannoun (2003), Horowitz and Lee (2005), Yu and Lu (2004), and Li
and Racine (2004), and references therein. In particular, for the univariate case, recently,
Honda (2000) and Lu, Hui and Zhao (2000) derived the asymptotic properties of the local
linear estimator of the quantile regression function under -mixing condition. For the high
dimensional case, however, the aforementioned methods encounter some diculties such as
the so-called curse of dimensionality and their implementation in practice is not easy as
well as the visual display is not so useful for the exploratory purposes.
To attenuate the above problems, De Gooijer and Zerom (2003), Horowitz and Lee
(2005), and Yu and Lu (2004) considered an additive quantile regression model q

(X
t
) =

d
k=1
g
k
(X
tk
). To estimate each component, for the time series case, De Gooijer and Zerom
(2003) rst estimated a high dimensional quantile function by inverting the conditional dis-
tribution function estimated by using a weighted Nadaraya-Watson approach, proposed by
Cai (2002a), and then used a projection method to estimate each component, as discussed
in Cai and Masry (2000), while Yu and Lu (2004) focused on the independent data and
used a back-tting algorithm method to estimate each component. On the other hand, to
estimate each additive component for the independent data, Horowitz and Lee (2005) used a
two-stage approach consisting of the series estimation at the rst step and a local polynomial
tting at the second step. For the independent data, the above model was extended by He,
Ng and Portony (1998), He and Ng (1999), and He and Portony (2000) to include interaction
terms by using spline methods.
In this chapter, we adapt another dimension reduction modelling method to analyze
CHAPTER 5. NONPARAMETRIC QUANTILE MODELS 104
dynamic time series data, termed as the smooth (functional or varying) coecient modelling
approach. This approach allows appreciable exibility on the structure of tted models. It
allows for linearity in some continuous or discrete variables which can be exogenous or lagged
and nonlinear in other variables in the coecients. In such a way, the model has the ability
of capturing the individual variations. More importantly, it can ease the so-called curse
of dimensionality and combines both additivity and interactivity. A smooth coecient
quantile regression model for time series data takes the following form
q

(U
t
, X
t
) =
d

k=0
a
k
(U
t
) X
tk
= X

t
a

(U
t
), (5.3)
where U
t
is called the smoothing variable, which might be one part of X
t1
, . . . , X
td
or just
time or other exogenous variables or the lagged variables, X
t
= (X
t0
, X
t1
, . . . , X
td
)

with
X
t0
1, a
k
() are smooth coecient functions, and a

() = (a
0,
(), . . . , a
d,
())

. Here,
some of a
k,
() are allowed to depend on . For simplicity, we drop from a
k,
() in
what follows. It is our interest here to estimate the coecient functions a() rather than the
quantile regression surface q

(, ) itself. Note that model (5.3) was studied by Honda (2004)

for the independent sample, but our focus here is on the dynamic model for nonlinear time
series, which is more appropriate for economic and nancial applications.
The general setting in (5.3) covers many familiar quantile regression models, including
the quantile autoregressive model (QAR) proposed by Koenker and Xiao (2004) who applied
the QAR model for the unit root inference. In particular, it includes a specic class of ARCH
models, such as heteroscedastic linear models considered by Koenker and Zhao (1996). Also,
if there is no X
t
in the model (d = 0), q

(U
t
, X
t
) becomes q

(U
t
) so that model (5.3) reduces
to the ordinary nonparametric quantile regression model which has been studied extensively.
For the recent developments, refer to the papers by He, Ng and Portony (1998), Yu and
Jones (1998), He and Ng (1999), He and Portony (2000), Honda (2000), Lu, Hui and Zhao
(2000), Cai (2002a), De Gooijer and Zerom (2003), Horowitz and Lee (2005), Yu and Lu
(2004), and Li and Racine (2004). If U
t
is just time, then the model is called the time-
varying coecient quantile regression model, which is potentially useful to see whether the
quantile regression changes over time and in a case with a practical interest is, for example,
the aforementioned illustrative example for the 1992 presidential election and the analysis of
the reference growth data by Cole (1994), Wei, Pere, Koenker and He (2006), and Wei and
He (2006), and the references therein. However, if U
t
is time, the observed time series might
CHAPTER 5. NONPARAMETRIC QUANTILE MODELS 105
not be stationary. Therefore, the treatment for non-stationary case would require a dierent
approach so that it is beyond the scope of this chapter and deserves a further investigation.
For more applications, see the work in Xu (2005). Finally, note that the smooth coecient
mean regression model is one of the most popular nonlinear time series models in mean
regression and has various applications. For more discussions, refer to the papers by Chen
and Tsay (1993), Cai, Fan, and Yao (2000), Cai and Tiwari (2000), Cai (2007), Hong and
Lee (2003), and Wang (2003), and the book by Tsay (2002), and references therein.
The motivation of this study comes from an analysis of the well known Boston housing
price data, consisting of several variables collected on each of 506 dierent houses from a
variety of locations. The interest is to identify the factors aecting the house price in Boston
area. As argued by Sent urk and M uller (2005), the correlation between the house price
and the crime rate can be adjusted by the confounding variable which is the proportion of
population of lower educational status through a varying coecient model and the expected
eect of increasing crime rate on declining house prices seems to be only observed for lower
educational status neighborhoods in Boston. The interesting features of this dataset are that
the response variable is the median price of a home in a given area and the distributions
of the price and the major covariate (the confounding variable) are left skewed. Therefore,
quantile methods are suitable for the analysis of this dataset. Therefore, such a problem
can be tackled by using model (5.3). In another example, one is interested in exploring
the possible nonlinearity feature, heteroscedasticity, and predictability of the exchange rates
such as the Japanese Yen per US dollar. The detailed analysis of these data sets is reported
in Section 3.
5.2 Modeling Procedures
5.2.1 Local Linear Quantile Estimate
Now, we apply the local polynomial method to the smooth coecient quantile regression
model as follows. For the sake of brevity, we only consider the case where U
t
in (5.3)
is one-dimensional, denoted by U
t
in what follows. Extension to multivariate U
t
involves
fundamentally no new ideas although the theory and procedure continue to hold. Note
that the models with high dimension might not be practically useful due to the curse of
dimensionality. A local polynomial tting has several nice properties such as high statistical
CHAPTER 5. NONPARAMETRIC QUANTILE MODELS 106
eciency in an asymptotic minimax sense, design-adaptation, and automatic edge correction
(see, e.g., Fan and Gijbels, 1996).
We estimate the functions a
k
() using the local polynomial regression method from
observations (U
t
, X
t
, Y
t
)
n
t=1
. We assume throughout the chapter that the coecient func-
tions a() have the (q + 1)th derivative, so that for any given gird point u
0
, a
k
() can be
approximated by a polynomial function in a neighborhood of the given grid point u
0
as
a(U
t
) a(u
0
) +a

(u
0
) (U
t
u
0
) + +a
(q)
(u
0
) (U
t
u
0
)
q
/q ! and
q

(U
t
, X
t
)
q

j=0
X

j
(U
t
u
0
)
j
,
where
j
= a
(j)
(u
0
)/j!. Then, the locally weighted loss function is
n

t=1

_
Y
t

j=0
X

j
(U
t
u
0
)
j
_
K
h
(U
t
u
0
), (5.4)
where K() is a kernel function, K
h
(x) = K(x/h)/h, and h = h
n
is a sequence of positive
numbers tending to zero, which controls the amount of smoothing used in estimation. Solving
the minimization problem in (5.4) gives a(u
0
) =

0
, the local polynomial estimate of a(u
0
),
and a
(j)
(u
0
) = j !

j
(j 1), the local polynomial estimate of the jth derivative a
(j)
(u
0
)
of a(u
0
). By moving u
0
along with the real line, one obtains the estimate for the entire
curve. For various practical applications, Fan and Gijbels (1996) recommended using the
local linear t (q = 1). Therefore, for the expositional purpose, in what follows, we only
consider the case q = 1 (local linear tting).
The programming involved in the local (polynomial) linear quantile estimation is rela-
tively simple and can be modied with few eorts from the existing programs for a linear
quantile model. For example, for each grid point u
0
, the local linear quantile estimation can
be implemented in the R package quantreg, of Koenker (2004) by setting covariates as X
t
and X
t
(U
t
u
0
) and the weight as K
h
(U
t
u
0
).
Although some modications are needed, the method developed here for the local lin-
ear quantile estimation is applicable to a general local polynomial quantile estimation. In
particular, we note that the local constant (Nadaraya-Watson type) quantile estimation of
a(u
0
), denoted by a(u
0
), is

minimizing the following subjective function
n

t=1

(Y
t
X

t
) K
h
(U
t
u
0
), (5.5)
CHAPTER 5. NONPARAMETRIC QUANTILE MODELS 107
which is a special case of (5.4) with q = 0. We compare a(u
0
) and a(u
0
) theoretically at the
end of Section 2.2 and empirically in Section 3.1 and the comparison leads to suggest that
one should use the local linear approach in practice.
5.2.2 Asymptotic Results
We rst give some regularity conditions that are sucient for the consistency and asymptotic
normality of the proposed estimators, although they might not be the weakest possible. We
introduce the following notations. Denote
(u
0
) E[X
t
X

t
[U
t
= u
0
] and

(u
0
) E[X
t
X

t
f
y|u,x
(q

(u
0
, X
t
)) [ U
t
= u
0
],
where f
y|u,x
(y) is the conditional density of Y given U and X. Let f
u
(u) present the marginal
density of U.
Assumptions:
(C1) a(u) is twice continuously dierentiable in a neighborhood of u
0
for any u
0
.
(C2) f
u
(u) is continuous and f
u
(u
0
) > 0.
(C3) f
y|u,x
(y) is bounded and satises the Lipschitz condition.
(C4) The kernel function K() is symmetric and has a compact support, say [1, 1].
(C5) (X
t
, Y
t
, U
t
) is a strictly -mixing stationary process with mixing coecient (t)
satises

t1
t
l

(2)/
(t) < for some positive real number > 2 and l > ( 2)/.
(C6) E|X
t
|
2

< with

> .
(C7) (u
0
) is positive-denite and continuous in a neighborhood of u
0
(C8)

(u
0
) is continuous and positive-denite in a neighborhood of u
0
.
(C9) The bandwidth h satises h 0 and nh .
(C10) f(u, v[x
0
, x
s
; s) M < for s 1, where f(u, v[x
0
, x
s
; s) is the conditional density
of (U
0
, U
s
) given (X
0
= x
0
, X
s
= x
s
).
(C11) n
1/2/4
h
/

1/2/4
= O(1).
CHAPTER 5. NONPARAMETRIC QUANTILE MODELS 108
Remark 1: (Discussion of Conditions) Assumptions (C1) - (C3) include some smooth-
ness conditions on functionals involved. The requirement in (C4) that K() be compactly
supported is imposed for the sake of brevity of proofs, and can be removed at the cost of
lengthier arguments. In particular, the Gaussian kernel is allowed. The -mixing is one of
the weakest mixing conditions for weakly dependent stochastic processes. Stationary time
series or Markov chains fullling certain (mild) conditions are -mixing with exponentially
decaying coecients; see the discussions in Section 1 and Cai (2002a) for more examples.
On the other hand, the assumption on the convergence rate of () in (C5) might not be
the weakest possible and is imposed to simplify the proof. Further, (C10) is just a techni-
cal assumption, which is also imposed by Cai (2002a). (C6) - (C8) require some standard
moments. Clearly, (C11) allows the choice of a wide range of smoothing parameter values
and is slightly stronger than the usual condition of nh . However, for the bandwidths
of optimal size (i.e., h = O(n
1/5
)), (C11) is automatically satised for 3 and it is still
fullled for 2 < < 3 if

satises <

1 + 1/(3 ), so that we do not concern

ourselves with such renements. Indeed, this assumption is also imposed by Cai, Fan and
Yao (2000) for the mean regression. Finally, if there is no X
t
in model (5.3), (C5) can be
replaced by (C5)

: (t) = O(t

) for some > 2 and (C11) can be substituted by (C11)

:
nh
/(2)
; see Cai (2002a) for details.
Remark 2: (Identication) It is clear from (5.3) that
(u
0
) a(u
0
) = E[q

(u
0
, X
t
) X
t
[ U
t
= u
0
].
Then, a(u
0
) is identied (uniquely determined) if and only if (u
0
) is positive denite for
any u
0
. Therefore, Assumption (C7) is the necessary and sucient condition for the model
identication.
To establish the asymptotic normality of the proposed estimator, similar to Chaudhuri
(1991), we rst derive the local Bahadur representation for the local linear estimator. To
this end, our analysis follows the approach of Koenker and Zhao (1996), which can simplify
the theoretical proofs. Dene,
j
=
_
u
j
K(u) du and
j
=
_
u
j
K
2
(u) du. Also, set

(x) =
I
{x<0}
, U
th
= (U
t
u
0
)/h, X

t
=
_
X
t
U
th
X
t
_
, Y

t
= Y
t
X

t
[a(u
0
) +a

(u
0
) (U
t
u
0
)], and
=

nh H
_

0
a(u
0
)

1
a

(u
0
)
_
with H = diagI, h I.
CHAPTER 5. NONPARAMETRIC QUANTILE MODELS 109
Theorem 5.1: (Local Bahadur Representation) Under assumptions (C1)- (C9),we have

=
[

1
(u
0
)]
1

nhf
u
(u
0
)
n

t=1

t
) X

t
K(U
th
) + o
p
(1), (5.6)
where

1
(u
0
) = diag

(u
0
),
2

(u
0
).
Remark 3: From Theorem 5.1 and Lemma 5.5 (in Section 5.4), it is easy to see that the
local linear estimator a(u
0
) is consistent with the optimal nonparametric convergence rate

nh.
Theorem 5.2: (Asymptotic Normality) Under assumptions (C1)- (C11), we have the fol-
lowing asymptotic normality

nh
_
H
_
a(u
0
) a(u
0
)
a

(u
0
) a

(u
0
)
_

h
2
2
_
a

(u
0
)
2
0
_
+ o
p
(h
2
)
_
N 0, (u
0
) ,
where (u
0
) = diag(1 )
0

a
(u
0
), (1 )
2

a
(u
0
) with

a
(u
0
) = [

(u
0
)]
1
(u
0
) [

(u
0
)]
1
/f
u
(u
0
). (5.7)
In particular,

nh
_
a(u
0
) a(u
0
)
h
2

2
2
a

(u
0
) + o
p
(h
2
)
_
N 0, (1 )
0

a
(u
0
) .
Remark 4: From Theorem 5.2, the asymptotic mean squares error (AMSE) of a(u
0
) is
given by
AMSE =
h
4

2
2
4
[[a

(u
0
)[[
2
+
(1 )
0
nh f
u
(u
0
)
tr(
a
(u
0
)),
which gives the optimal bandwidth h
opt
by minimizing the AMSE
h
opt
=
_
(1 )
0
tr(
a
(u
0
))
f
u
(u
0
) [[a

(u
0
)[[
2
_
1/5
n
1/5
and the optimal AMSE is
AMSE
opt
=
5
4
_
(1 )
0
tr(
a
(u
0
))
f
u
(u
0
)
_
4/5
[[a

(u
0
)[[
2/5
n
4/5
.
Further, notice that the similar results in Theorem 5.2 were obtained by Honda (2004) for
the independent data. Finally, it is interesting to note that the asymptotic bias in Theorem
CHAPTER 5. NONPARAMETRIC QUANTILE MODELS 110
5.2 is the same as that for the mean regression case but the two asymptotic variances are
dierent; see, for example, Cai, Fan and Yao (2000).
If model (5.3) does not have X (d = 0), it becomes the nonparametric quantile regression
model q

(). Then, we have the following asymptotic normality for the local linear estima-
tor of the nonparametric quantile regression function q

(), which covers the results in Yu

and Jones (1998), Honda (2000), Lu, Hui and Zhao (2000), and Cai (2002a) for both the
independent and time series data.
Corollary 5.1: If there is no X
t
in (5.3), then,

nh
_
q

(u
0
) q

(u
0
)
h
2

2
2
q

(u
0
) + o
p
(h
2
)
_
N
_
0,
2

(u
0
)
_
,
where
2

(u
0
) = (1 )
0
f
1
u
(u
0
) f
2
y|u
(q

(u
0
)).
Now we consider the comparison of the performance of the local linear estimation a(u
0
)
obtained in (5.4) with that of the local constant estimation a(u
0
) given in (5.5). To this
eect, rst, we derive the asymptotic results for the local constant estimator but the proof
is omitted since it is along the same line with the proof of Theorems 5.1 and 5.2; see Xu
(2005) for details. Under some regularity conditions, it can be shown that

nh
_
a(u
0
) a(u
0
)

b + o
p
(h
2
)
_
N 0, (1 )
0

a
(u
0
) ,
where

b =
h
2

2
2
_
a

(u
0
) + 2 a

(u
0
) f

u
(u
0
)/f
u
(u
0
) + 2

(u
0
)
1

(u
0
) a

(u
0
)

,
which implies that the asymptotic bias for a(u
0
) is dierent from that for a(u
0
) but both
have the same asymptotic variance. Therefore, the local constant quantile estimator does not
adapt to nonuniform designs: the bias can be large when f

u
(u
0
)/f
u
(u
0
) or

(u
0
)
1

(u
0
)
is large even when the true coecient functions are linear. It is surprising that to the best
of our knowledge, this nding seems to be new for the nonparametric quantile regression
setting although it is well documented in literature for the ordinary regression case; see Fan
and Gijbels (1996) for details.
Finally, to examine the asymptotic behaviors of the local linear and local constant quan-
tile estimators at the boundaries, we oer Theorem 5.3 below but its proofs are omitted due
CHAPTER 5. NONPARAMETRIC QUANTILE MODELS 111
to their similarity to those for Theorem 5.2 with some modications and for the ordinary
regression setting (Fan and Gijbels, 1996); see Xu (2005) for the detailed proofs. Without
loss of generality, we consider only the left boundary point u
0
= c h, 0 < c < 1, if U
t
takes
values only from [0, 1]. A similar result in Theorem 5.3 holds for the right boundary point
u
0
= 1 c h. Dene
j,c
=
_
1
c
u
j
K(u)du and
j,c
=
_
1
c
u
j
K
2
(u)du.
Theorem 5.3: (Asymptotic Normality) Under assumptions of Theorem 5.2, we have the
following asymptotic normality of the local linear quantile estimator at the left boundary
point,

nh
_
a(c h) a(c h)
h
2
b
c
2
a

(0+) + o
p
(h
2
)
_
N 0, (1 ) v
c

a
(0+) ,
where
b
c
=

2
2,c

1,c

3,c

2,c

0,c

2
1,c
and v
c
=

2
2,c

0,c
2
1,c

2,c

1,c
+
2
1,c

2,c
_

2,c

0,c

2
1,c

2
.
Further, we have the following asymptotic normality of the local constant quantile estimator
at the left boundary point u
0
= c h for 0 < c < 1,

nh
_
a(c h) a(c h)

b
c
+ o
p
(h
2
)
_
N
_
0, (1 )
0,c

a
(0+)/
2
0,c
_
.
where

b
c
=
_
h
1,c
a

(0+) +
h
2

2,c
2
_
a

(0+) +
2a

(0+)f

u
(0+)
f
u
(0+)
+ 2
1
(0+)

(0+)a

(0+)
__
/
0,c
.
Similar results hold for the right boundary point u
0
= 1 c h.
Remark 5: We remark that if the point 0 were an interior point, then, Theorem 5.3 would
hold with c = 1, which becomes Theorem 5.2. Also, as c 1, b
c

2
, and v
c

0
and these limits are exactly the constant factors appearing respectively in the asymptotic
bias and variance for an interior point. Therefore, Theorem 5.3 shows that the local linear
estimation has the automatic good behavior at boundaries without the need of boundary
correction. Further, one can see from Theorem 5.3 that at the boundaries, the asymptotic
bias term for the local constant quantile estimate is of the order h by comparing to the order
h
2
for the local linear quantile estimate. This shows that the local linear quantile estimate
does not suer from boundary eects but the local constant quantile estimate does, which
is another advantage of the local linear quantile estimator over the local constant quantile
estimator. This suggests that one should use the local linear approach in practice.
CHAPTER 5. NONPARAMETRIC QUANTILE MODELS 112
As a special case, Theorem 5.3 includes the asymptotic properties for the local constant
quantile estimator of the nonparametric quantile function q

() at both the interior and

boundary points, stated as follows.
Corollary 5.2: If there is no X
t
in (5.3), then, the asymptotic normality of the local constant
quantile estimator is given by

nh
_
q

(u
0
) q

(u
0
)
h
2

2
2
q

(u
0
) + 2 q

(u
0
)f

u
(u
0
)/f
u
(u
0
) + o
p
(h
2
)
_
N
_
0,
2

(u
0
)
_
.
Further, at the left boundary point, we have

nh
_
q

(c h) q

(c h)

c
+ o
p
(h
2
)
_
N
_
0,
2
c
_
,
where

c
=
_
h
1,c
q

(0+) +
h
2

2,c
2
q

(0+) + 2 q

(0+)f

u
(0+)/f
u
(0+)
_
/
0,c
and
2
c
= (1 )
0,c
f
1
u
(0+) f
2
y|u
(q

(0+))/
2
0,c
.
5.2.3 Bandwidth Selection
It is well known that the bandwidth plays an essential role in the trade-o between reducing
bias and variance. To the best of our knowledge, there has been almost nothing done
about selecting the bandwidth in the context of estimating the coecient functions in the
quantile regression even though there is a rich amount of literature on this issue in the mean
regression setting; see, for example, Cai, Fan and Yao (2000). In practice, it is desirable to
have a quick and easily implemented data-driven fashioned method. Based on this spirit,
Yu and Jones (1998) or Yu and Lu (2004) proposed a simple and convenient method for the
nonparametric quantile estimation. Their approach assumes that the second derivatives of
the quantile function are parallel. However, this assumption might not be valid for many
applications in economics and nance due to (nonlinear) heteroscedasticity. Further, the
mean regression approach can not directly estimate the variance function. To attenuate
these problems, we propose a method of selecting bandwidth for the foregoing estimation
procedure, based on the nonparametric version of the Akaike information criterion (AIC),
which can attend to the structure of time series data and the over-tting or under-tting
tendency. This idea is motivated by its analogue of Cai and Tiwari (2000) and Cai (2002b)
for nonlinear time series models. The basic idea is described below.
CHAPTER 5. NONPARAMETRIC QUANTILE MODELS 113
By recalling the classical AIC for linear models under the likelihood setting
2 (maximized log likelihood) + 2 (number of estimated parameters),
we propose the following nonparametric version of the bias-corrected AIC, due to Hurvich
and Tsai (1989) for parametric models and Hurvich, Simono and Tsai (1998) for nonpara-
metric regression models, to select h by minimizing
AIC(h) = log
_

2

_
+ 2 (p
h
+ 1)/[n (p
h
+ 2)], (5.8)
where
2

and p
h
are dened later. This criterion may be interpreted as the AIC for the local
quantile smoothing problem and seems to perform well in some limited applications. Note
that similar to (5.8), Koenker, Ng and Portnoy (1994) considered the Schwarz information
criterion (SIC) of Schwarz (1978) with the second term on the right-hand side of (5.8)
replayed by 2 n
1
p
h
log n, where p
h
is the number of active knots for the smoothing
spline quantile setting, and Machado (1993) studied similar criteria for parametric quantile
regression models and more general M-estimators of regression.
Now the question is how to dene
2

and p
h
in this setting. In the mean regression
setting,
2

is just the estimate of the variance

2
. In the quantile regression, we dene
2

as
n
1

t
t=1

(Y
t
X

t
a(U
t
)), which may be interpreted as the mean square error in the least
square setting and was also used by Koenker, Ng and Portnoy (1994). In nonparametric
models, p
h
is the nonparametric version of degrees of freedom, called the eective number of
parameters, and it is usually based on the trace of various quasi-projection (hat) matrices in
the least square theory (linear estimators); see, for example, Hastie and Tibshirani (1990),
Cai and Tiwari (2000), and Cai (2002b) for a cogent discussion for nonparametric regression
models and nonlinear time series models. For the quantile smoothing setting, the explicit
expression for the quasi-projection matrix does not exist due to its nonlinearity. However,
we can use the rst order approximation (the local Bahadur representation) given in (5.6)
to derive an explicit expression, which may be interpreted as the quasi-projection matrix in
this setting. To this end, dene
S
n
= S
n
(u
0
) = a
n
n

t=1

t
X

K(U
th
),
where
t
= I(Y
t
X

t
a(u
0
) + a
n
) I(Y
t
X

t
a(u
0
)) and a
n
= (nh)
1/2
. It is shown in
Section 5.5 that
S
n
(u
0
) = f
u
(u
0
)

1
(u
0
) + o
p
(1). (5.9)
CHAPTER 5. NONPARAMETRIC QUANTILE MODELS 114
From (5.6), it is easy to verify that

a
n
S
1
n

n
t=1

t
) X

t
K(U
th
). Then, we have
q

(U
t
, X
t
) q

(U
t
, X
t
)
1
n
n

s=1

s
(U
t
)) K
h
((U
s
U
t
)/h) X
0
t

S
1
n
(U
t
) X

s
where X
0
t
=
_
X
t
0
_
. The coecient of

s
(U
s
)) on the right-hand side of the above
expression is
s
= a
2
n
K(0) X
0
s

S
1
n
(U
s
) X
0
s
. Now, we have that p
h
=

n
s=1

s
, which can be
regarded as an approximation to the trace of the quasi-projection (hat) matrix for linear
estimators. In the practical implementation, we need to estimate a(u
0
) rst since S
n
(u
0
)
involves a(u
0
). We recommend using a pilot bandwidth which can be chosen as the one
proposed by Yu and Jones (1998). Similar to the least square theory, as expected, the
criterion proposed in (5.8) counteracts the over-tting tendency of the generalized cross-
validation due to its relatively weak penalty and the under-tting of the SIC of Schwarz
(1978) studied by Koenker, Ng and Portnoy (1994) because of the heavy penalty.
5.2.4 Covariance Estimate
For the purpose of statistical inference, we next consider the estimation of the asymptotic
covariance matrix to construct the pointwise condence intervals. In practice, a quick and
simple way to estimate the asymptotic covariance matrix is desirable. In view of (5.7), the
explicit expression of the asymptotic covariance provides a direct estimator. Therefore, we
can use the so-called sandwich method. In other words, we need to obtain a consistent
estimate for both (u
0
) and

(u
0
). To this eect, dene,

n,0
=
1
n
n

t=1
X
t
X

t
K
h
(U
t
u
0
) and

n,1
=
1
n
n

t=1
w
t
X
t
X

t
K
h
(U
t
u
0
),
where w
t
= I(X

t
a(u
0
)
n
< Y
t
X

t
a(u
0
) +
n
)/(2
n
) for any
n
0 as n . It is
shown in Section 5.5 that

n,0
= f
u
(u
0
) (u
0
) + o
p
(1) and

n,1
= f
u
(u
0
)

(u
0
) + o
p
(1). (5.10)
Therefore, the consistent estimate of
a
(u
0
) is given by

a
(u
0
) =
_

n,1
(u
0
)
_
1

n,0
(u
0
)
_

n,1
(u
0
)
_
1
.
Note that

n,1
(u
0
) might be close to singular for some sparse regions. To avoid this com-
putational diculty, there are two alternative ways to construct a consistent estimate of
CHAPTER 5. NONPARAMETRIC QUANTILE MODELS 115
f
u
(u
0
)

(u
0
) through estimating the conditional density of Y , f
y|u,x
(q

(u, x)). The rst

method is the Nadaraya-Watson type (or local linear) double kernel method of Fan, Yao and
Tong (1996) dened as,

f
y|u,x
(q

(u, x)) =
n

t=1
K
h
2
(U
t
u, X
t
x) L
h
1
(Y
t
q

(u, x))/
n

t=1
K
h
2
(U
t
u, X
t
x),
where L() is a kernel function, and the second one is the dierence quotients method of
Koenker and Xiao (2004) such as

f
y|u,x
(q

(u, x)) = (
j

j1
)/[q

j
(u, x) q

j1
(u, x)],
for some appropriately chosen sequence of
j
; see Koenker and Xiao (2004) for more
discussions. Then, in view of the denition of f
u
(u
0
)

(u
0
), the estimator

n,1
can be
constructed as,

n,1
=
1
n
n

t=1

f
y|u,x
( q

(U
t
, X
t
)) X
t
X

t
K
h
(U
t
u
0
).
By an analogue of (5.10), one can show that under some regularity conditions, both estima-
tors are consistent.
5.3 Empirical Examples
In this section we report a Monte Carlo simulation to examine the nite sample property of
the proposed estimator and to further explore the possible nonlinearity feature, heteroscedas-
ticity, and predictability of the exchange rate of the Japanese Yen per US dollar and to
identify the factors aecting the house price in Boston. In our computation, we use the
Epanechnikov kernel K(u) = 0.75 (1 u
2
) I([u[ 1) and construct the pointwise condence
intervals based on the consistent estimate of the asymptotic covariance described in Section
2.4 without the bias correction. For a predetermined sequence of hs from a wide range,
say from h
a
to h
b
with an increment h

, based on the AIC bandwidth selector described in

Section 2.3, we compute AIC(h) for each h and choose h
opt
to minimize AIC(h).
5.3.1 A Simulated Example
Example 5.1: We consider the following data generating process
Y
t
= a
1
(U
t
) Y
t1
+ a
2
(U
t
) Y
t2
+ (U
t
) e
t
, t = 1, . . . , n, (5.11)
CHAPTER 5. NONPARAMETRIC QUANTILE MODELS 116
where a
1
(U
t
) = sin(

2 U
t
), a
2
(U
t
) = cos(

2 U
t
), and (U
t
) = 3 exp(4 (U
t
1)
2
) +
2 exp(5 (U
t
2)
2
). U
t
is generated from uniform (0, 3) independently and e
t
N(0, 1).
The quantile regression is
q

(U
t
, Y
t1
, Y
t2
) = a
0
(U
t
) + a
1
(U
t
) Y
t1
+ a
2
(U
t
) Y
t2
,
where a
0
(U
t
) =
1
() (U
t
) and
1
() is the -th quantile of the standard normal. There-
fore, only a
0
() is a function of . Note that a
0
() = 0 when = 0.5. To assess the
performance of nite samples, we compute the mean absolute deviation errors (MADE) for
a
j
(), which is dened as
MADE
j
= n
1
0
n
0

k=1
[a
j
(u
k
) a
j
(u
k
)[ ,
where a
j
() is either the local linear or local constant quantile estimate of a
j
() and z
k
=
0.1(k 1) + 0.2 : 1 k n
0
= 27 are the grid points. The Monte Carlo simulation is
repeated 500 times for each sample size n = 200, 500, and 1000 and for each = 0.05, 0.50
and 0.95. We compute the optimal bandwidth for each replication, sample size, and . We
compute the median and standard deviation (in parentheses) of 500 MADE values for each
scenario and summarize the results in Table 5.3.1.
From Table 5.3.1, we can observe that the MADE values for both the local linear and
local constant quantile estimates decrease when n increases for all three values of and the
local linear estimate outperforms the local constant estimate. This is another example to
show that the local linear method is superior over the local constant even in the quantile
setting. Also, the performance for the median quantile estimate is slightly better than that
for two tails ( = 0.05 and 0.95). This observation is not surprising because of the sparsity
of data in the tailed regions. Moreover, another benet of using the quantile method is that
we can obtain the estimate of a
0
() (conditional standard deviation) simultaneously with the
estimation of a
1
() and a
2
() (functions in the conditional mean), which, in contrast, avoids a
two-stage approach needed to estimate the variance function in the mean regression; see Fan
and Yao (1998) for details. However, it is interesting to see that due to the larger variation,
the performance for a
0
(), although it is reasonably good, is not as good as that of a
1
() and
a
2
(). This can be further evidenced from Figure 1. The results in this simulated experiment
show that the proposed procedure is reliable and they are along the line of our asymptotic
theory.
CHAPTER 5. NONPARAMETRIC QUANTILE MODELS 117
Table 5.1: The Median and Standard Deviation of 500 MADE Values
The Local Linear Estimator
= 0.05 = 0.5 = 0.95
n MADE
0
MADE
1
MADE
2
MADE
0
MADE
1
MADE
2
MADE
0
MADE
1
MADE
2
200 0.911 0.186 0.177 0.401 0.092 0.089 0.920 0.187 0.175
(0.520) (0.041) (0.041) (0.091) (0.032) (0.032) (0.517) (0.042) (0.039)
500 0.510 0.085 0.083 0.311 0.055 0.055 0.517 0.085 0.083
(0.414) (0.023) (0.02) (0.056) (0.019) (0.018) (0.390) (0.023) (0.023)
1000 0.419 0.060 0.059 0.311 0.050 0.049 0.416 0.060 0.059
(0.071) (0.018) (0.017) (0.051) (0.014) (0.014) (0.072) (0.017) (0.017)
The Local Constant Estimator
= 0.05 = 0.5 = 0.95
n MADE
0
MADE
1
MADE
2
MADE
0
MADE
1
MADE
2
MADE
0
MADE
1
MADE
2
200 3.753 0.285 0.290 0.501 0.144 0.147 3.763 0.287 0.287
(2.937) (0.050) (0.051) (0.115) (0.027) (0.028) (3.188) (0.052) (0.051)
500 2.201 0.147 0.146 0.355 0.084 0.085 2.223 0.147 0.147
(3.025) (0.024) (0.025) (0.062) (0.016) (0.015) (3.320) (0.025) (0.025)
1000 0.883 0.086 0.086 0.322 0.060 0.061 0.882 0.086 0.087
(0.462) (0.015) (0.014) (0.054) (0.012) (0.011) (0.427) (0.015) (0.015)
Finally, Figure 5.1 plots the local linear estimates for all three coecient functions with
their true values (solid line): () in Figure 5.1(a), a
1
() in Figure 5.1(b), and a
2
() in Figure
5.1(c), for three quantiles = 0.05 (dashed line), 0.50 (dotted line) and 0.95 (dotted-dashed
line), for n = 500 based on a typical sample which is chosen based on its MADE value equal
to the median of the 500 MADE values. The selected optimal bandwidths are h
opt
= 0.10
for = 0.05, 0.075 for = 0.50, and 0.10 for = 0.95. Note that the estimate of ()
for = 0.50 can not be recovered from the estimate of a
0
() = 0 and it is not presented
in Figure 5.1(a). The 95% point-wise condence intervals without the bias correction are
depicted in Figure 1 in thick lines for the = 0.05 quantile estimate. By the same token,
we can compute the point-wise condence intervals (not shown here) for the rest. Basically,
all condence intervals cover the true values. Also, we can see that the condence interval
for a
0
() is wider than that for a
1
() and a
2
() due to the larger variation. Similar plots
CHAPTER 5. NONPARAMETRIC QUANTILE MODELS 118
0.5 1.0 1.5 2.0 2.5
0
1
2
3
4
(a)
True
tau=0.05
tau=0.95
0.5 1.0 1.5 2.0 2.5

1
.
5

0
.
5
0
.
0
0
.
5
1
.
0
1
.
5
(b)
True
tau=0.05
tau=0.50
tau=0.95
C.I.
0.5 1.0 1.5 2.0 2.5

1
.
5

0
.
5
0
.
0
0
.
5
1
.
0
1
.
5
(c)
True
tau=0.05
tau=0.50
tau=0.95
C.I.
Figure 5.1: Simulated Example: The plots of the estimated coecient functions for three
quantiles = 0.05 (dashed line), = 0.50 (dotted line), and = 0.95 (dot-dashed line)
with their true functions (solid line): (u) versus u in (a), a
1
(u) versus u in (b), and a
2
(u)
versus u in (c), together with the 95% point-wise condence interval (thick line) with the
bias ignored for the = 0.5 quantile estimate.
are obtained (not shown here) for the local constant estimates due to the space limitations.
Overall, the proposed modeling procedure performs fairly well.
5.3.2 Real Data Examples
Example 5.2: (Boston House Price Data) We analyze a subset of the Boston house price da-
ta (available at http://lib.stat.cmu.edu/datasets/boston) of Harrison and Rubinfeld (1978).
This dataset consists of 14 variables collected on each of 506 dierent houses from a variety
of locations. The dependent variable is Y , the median value of owner-occupied homes in
$1, 000s (house price); some major factors aecting the house prices used are: proportion of
population of lower educational status (i.e. proportion of adults with high school education
CHAPTER 5. NONPARAMETRIC QUANTILE MODELS 119
and proportion of male workers classied as labors), denoted by U, the average number of
rooms per house in the area, denoted by X
1
, the per capita crime rate by town, denoted
by X
2
, the full property tax rate per $10,000, denoted by X
3
, and the pupil/teacher ratio
by town school district, denoted by X
4
. For the complete description of all 14 variables,
see Harrison and Rubinfeld (1978). Gilley and Pace (1996) provided corrections and ex-
amined censoring. Recently, there have been several papers devoted to the analysis of this
dataset. For example, Breiman and Friedman (1985), Chaudhuri, Doksum and Samarov
(1997), and Opsomer and Ruppert (1998) used four covariates: X
1
, X
3
, X
4
and U or their
transformations to t the data through a mean additive regression model whereas Yu and
Lu (2004) employed the additive quantile technique to analyze the data. Further, Pace and
Gilley (1997) added the georeferencing factor to improve estimation by a spatial approach.
Recently, Sent urk and M uller (2005) studied the correlation between the house price Y and
the crime rate X
2
adjusted by the confounding variable U through a varying coecient mod-
el and they concluded that the expected eect of increasing crime rate on declining house
prices seems to be only observed for lower educational status neighborhoods in Boston. Some
existing analyses (e.g., Breiman and Friedman, 1985; Yu and Lu, 2004) in both mean and
quantile regressions concluded that most of the variation seen in housing prices in the re-
stricted data set can be explained by two major variables: X
1
and U. Indeed, the correlation
coecients between Y and U and X
1
are 0.7377 and 0.6954 respectively. The scatter plots
of Y versus U and X
1
are displayed in Figures 5.2(a) and 5.2(b) respectively. The interesting
features of this data set are that the response variable is the median price of a home in a
given area and the distributions of Y and the major covariate U are left skewed (the density
estimates are not presented). Therefore, quantile methods are particularly well suited to the
analysis of this dataset. Finally, it is surprising that all the existing nonparametric models
aforementioned above did not include the crime rate X
2
, which may be an important factor
aecting the housing price, and did not consider the interaction terms such as U and X
2
.
Based on the above discussions, it concludes that the model studied in this chapter might
be well suitable to the analysis of this dataset. Therefore, we analyze this dataset by the
following quantile smooth coecient model
1
q

(U
t
, X
t
) = a
0,
(U
t
) + a
1,
(U
t
) X
t1
+ a
2,
(U
t
) X

t2
, 1 t n = 506, (5.12)
1
We do not include the other variables such as X
3
and X
4
in model (5.12), since we found that the
coecient functions for these variables seem to be constant. Therefore, a semiparametric model would be
appropriate if the model includes these variables. It of course deserves a further investigation.
CHAPTER 5. NONPARAMETRIC QUANTILE MODELS 120
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o o
o
o
o
o
o
oo o o
o
o
o
o
o
o o
o
o
o
o
o
o
o
o
o
o
o
o
o
o o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
oo o
o
o
o
o o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o o
o
o
o
o
o o
o
o
o
o
o
o
o
o
o
o
o
o o
o
o
o
o
o
o
o
o
oo o o o
o o
o
o
o o
o o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o o
o
o
o
o
o
o
o
o
o
o o o
o
o
o
o
o
o
o
o
o
o
o o
o
o
o
o
o
o
o
o
o o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o o o
o o o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
10 20 30
1
0
2
0
3
0
4
0
5
0
(a)
P
r
i
c
e
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
oo
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o o
o
o
o
o
o
o o o o
o
o
o
o
o
o o
o
o
o
o
o
o
o
o
o
o
o
o
o
o o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o o o
o
o
o
o o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o o
o
o
o
o
o o
o
o
o
o
o
o
o
o
o
o
o
o o
o
o
o
o
o
o
o
o
o o o o o
o o
o
o
o o
o o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o o
o
o
o
o
o
o
o
o
o
o o o
o
o
o
o
o
o
o
o
o
o
o o
o
o
o
o
o
o
o
o
o o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
oo o
o oo
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
2 1 0 1 2
1
0
2
0
3
0
4
0
5
0
(b)
P
r
i
c
e
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
oo
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
oo
o
o
o
o
o
oo oo
o
o
o
o
o
oo
o
o
o
o
o
o
o
o
o
o
o
o
o
oo
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
oo o
o
o
o
oo
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
oo
o
o
o
o
oo
o
o
o
o
o
o
o
o
o
o
o
o o
o
o
o
o
o
o
o
o
ooo o o
o o
o
o
o o
o o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o o
o
o
o
o
o
o
o
o
o
o o o
o
o
o
o
o
o
o
o
o
o
o o
o
o
o
o
o
o
o
o
o o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
oo o
o o o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
0 20 40 60 80
1
0
2
0
3
0
4
0
5
0
(c)
P
r
i
c
e
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
oo
o
o
o
o
o
o o oo
o
o
o
o
o
oo
o
o
o
o
o
o
o
o
o
o
o
o
o
oo
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
oo o
o
o
o
oo
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
oo
o
o
o
o
o o
o
o
o
o
o
o
o
o
o
o
o
o o
o
o
o
o
o
o
o
o
ooo o o
o o
o
o
o o
o o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o o
o
o
o
o
o
o
o
o
o
o o o
o
o
o
o
o
o
o
o
o
o
o o
o
o
o
o
o
o
o
o
o o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
oo o
o o o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
4 2 0 2 4
1
0
2
0
3
0
4
0
5
0
(d)
P
r
i
c
e
Figure 5.2: Boston Housing Price Data: Displayed in (a)-(d) are the scatter plots of the
house price versus the covariates U, X
1
, X
2
and log(X
2
), respectively.
where X

t2
= log(X
t2
). The reason for using the logarithm of X
t2
in (5.12), instead of X
t2
itself, is that the correlation between Y
t
and X

t2
(the correlation coecient is 0.4543) is
slightly stronger than that for Y
t
and X
t2
(0.3883), which can be witnessed as well from
Figures 5.2(c) and 5.2(d). In the model tting, covariates X
1
and X
2
are centralized. For
the purpose of comparison, we also consider the following functional coecient model in the
mean regression
Y
t
= a
0
(U
t
) + a
1
(U
t
) X
t1
+ a
2
(U
t
) X

t2
+ e
t
(5.13)
and we employ the local linear tting technique to estimate the coecient functions a
j
(),
denoted by a
j
(); see Cai, Fan and Yao (2000) for details.
The coecient functions are estimated through the local linear quantile approach by
using the bandwidth selector described in Section 2.3. The selected optimal bandwidths are
h
opt
= 2.0 for = 0.05, 1.5 for = 0.50, and 3.5 for = 0.95. Figures (5.3(e), (5.3(f)
CHAPTER 5. NONPARAMETRIC QUANTILE MODELS 121
5 10 15 20 25
1
0
1
5
2
0
2
5
3
0
3
5
4
0
(e)
tau=0.05
tau=0.50
tau=0.95
Mean
CI for tau=0.5
5 10 15 20 25
0
5
1
0
(f)
tau=0.05
tau=0.50
tau=0.95
Mean
CI for tau=0.5
5 10 15 20 25

1
0
1
2
3
4
(g)
tau=0.05
tau=0.50
tau=0.95
Mean
CI for tau=0.5
Figure 5.3: Boston Housing Price Data: The plots of the estimated coecient functions for
three quantiles = 0.05 (solid line), = 0.50 (dashed line), and = 0.95 (dotted line), and
the mean regression (dot-dashed line): a
0,
(u) and a
0
(u) versus u in (e), a
1,
(u) and a
1
(u)
versus u in (f), and a
2,
(u) and a
2
(u) versus u in (g). The thick dashed lines indicate the
95% point-wise condence interval for the median estimate with the bias ignored.
and (5.3(g) present the estimated coecient functions a
0,
(), a
1,
(), and a
2,
() respectively,
for three quantiles = 0.05 (solid line), 0.50 (dashed line) and 0.95 (dotted line), together
with the estimates a
j
() from the mean regression model (dot-dashed line). Also, the 95%
point-wise condence intervals for the median estimate are displayed by the thick dashed
lines without the bias correction. First, from these three gures, one can see that the
median estimates are quite close to the mean estimates and the estimates based on the
mean regression are always within the 95% condence interval of the median estimates.
It can be concluded that the distribution of the measurement error e
t
in (5.13) might be
symmetric and a
j,0.5
() in (5.12) is almost same as a
j
() in (5.13). Also, one can observe
from Figure 5.3(e) that three quantile curves are parallel, which implies that the intercept in
CHAPTER 5. NONPARAMETRIC QUANTILE MODELS 122
a
0,
() depends on , and they decrease exponentially, which can support that the logarithm
transformation may be needed as argued in Yu and Lu (2004). More importantly, one can
observe from Figures 5.3(f) and 5.3(g) that three quantile estimated coecient curves are
intersect. This reveals that the structure of quantiles is complex and the lower and upper
quantiles have dierent behaviors and the heteroscedasticity might exist. But unfortunately,
this phenomenon was not observed in any previous analyses in the aforementioned papers.
From Figure 5.3(f), rst, we can observe that a
1,0.50
() and a
1,0.95
() are almost same
but a
1,0.05
() is dierent. Secondly, we can see that the correlation between the house price
and the number of rooms per house is almost positive except for houses with the median
price and/or higher than ( = 0.50 and 0.95) in very low educational status neighborhoods
(U > 23). Thirdly, for the low price houses ( = 0.05), the correlation is always positive and
it deceases when U is between 0 and 14 and then keeps almost constant afterwards. This
implies that the expected eect of increasing the number of rooms can make the house price
slightly higher in any low educational status neighborhoods but much higher in relatively
high educational status neighborhoods. Finally, for the median and/or higher price houses,
the correlation deceases when U is between 0 and 14 and then keeps almost constant until U
up to 20 and nally deceases again afterwards, and it becomes negative for U larger than 23.
This means that the number of room has a positive eect on the median and/or higher price
houses in relatively high and low educational status neighborhoods but increasing the number
of rooms might not increase the house price in very low educational status neighborhoods.
In other words, it is very dicult to sell high price houses with high number of rooms at a
reasonable price in very low educational status neighborhoods.
From Figure 5.3(g), rst, one can conclude that the overall trend for all curves is decreas-
ing with a
3,0.95
() deceasing faster than the others, and that a
3,0.05
() and a
3,0.50
() tend to be
constant for U larger than 16. Secondly, the correlation between the housing prices ( = 0.50
and 0.95) and the crime rate seems to be positive for smaller U values (about U 13) and
becomes negative afterwards. This positive correlation between the housing prices ( = 0.50
and 0.95) and the crime rate for relatively high educational status neighborhoods seems
against intuitive. However, the reason for this positive correlation is the existence of high
educational status neighborhoods close to central Boston where high house prices and crime
rate occur simultaneously. Therefore, the expected eect of increasing crime rate on declin-
ing house prices for = 0.50 and 0.95 seems to be observed only for lower educational status
CHAPTER 5. NONPARAMETRIC QUANTILE MODELS 123
neighborhoods in Boston. Finally, it can be seen that the correlation between the housing
prices for = 0.05 and the crime rate is almost negative although the degree depends on
the value of U. This implies that increasing crime rate slightly decreases relatively the house
prices for the cheap houses ( = 0.05).
In summary, it concludes that there is a nonlinear relationship between the conditional
quantiles of the housing price and the aecting factors. It seems that the factors U, X
1
and X
2
do have dierent eects on the dierent quantiles of the conditional distribution
of the housing price. Overall, the housing price and the proportion of population of low-
er educational status have a strong negative correlation, and the number of rooms has a
mostly positive eect on the housing price whereas the crime rate has the most negative
eect on the housing price. In particular, by using the proportion of population of lower
educational status U as the confounding variable, we demonstrate the substantial benets
obtained by characterizing the aecting factors X
1
and X
2
on the housing price based on
the neighborhoods.
Example 5.3: (Exchange Rate Data) This example concerns the closing bid prices of the
Japanese Yen (JPY) in terms of US dollar. There is a vast amount of literature devoted to
the study of the exchange rate time series; see Sercu and Uppal (2000) and the references
therein for details. Here we use the proposed model and its modeling approaches to explore
the possible nonlinearity feature, heteroscedasticity, and predictability of the exchange rate
series. The data is a weekly series from January 1, 1974 to December 31, 2003. The
daily noon buying rates in New York City certied by the Federal Reserve Bank of New
York for customs and cable transfers purposes were obtained from the Chicago Federal
Reserve Board (www.frbchi.org). The weekly series is generated by selecting the Wednesdays
series (if a Wednesday is a holiday then the following Thursday is used), which has 1566
observations. The use of weekly data avoids the so-called weekend eect as well as other
biases associated with nontrading, bid-ask spread, asynchronous rates and so on, which are
often present in higher frequency data. The previous analysis of this particularly dicult
data set can be found in Gallant, Hsieh and Tauchen (1991), Fan, Yao and Cai (2003),
and Hong and Lee (2003), and the references within. We model the return series Y
t
=
100 log(
t
/
t1
), plotted in Figure 5.4(a), using the techniques developed in this chapter,
where
t
is an exchange rate level on the t-th week. Typically the classical nancial theory
would treat Y
t
as a martingale dierence process. Therefore, Y
t
would be unpredictable.
CHAPTER 5. NONPARAMETRIC QUANTILE MODELS 124
0 500 1000 1500

1
0

5
0
5
(a)
E
x
c
h
a
n
g
e

r
a
t
e

r
e
t
u
r
n
0 5 10 15 20 25 30
0
.
0
0
.
2
0
.
4
0
.
6
0
.
8
1
.
0
(b)
A
C
F
0 500 1000 1500

0
.
1
5

0
.
0
5
0
.
0
5
0
.
1
5
(c)
M
A
T
T
R
Figure 5.4: Exchange Rate Series: (a) Japanese-dollar exchange rate return series Y
t
; (b)
autocorrelation function of Y
t
; (c) moving average trading technique rule.
But this assumption was strongly rejected by Hong and Lee (2003) by examining ve major
currencies and applying several testing procedures. Note that the return series Y
t
has 1565
observations. Figure 5.4(b) shows that there exists almost no signicant autocorrelation in
Y
t
, which also was conrmed by Tsay (2002) and Hong and Lee (2003) by using several
statistical testing procedures.
Based on the evidence from Fan, Yao and Cai (2003) and Hong and Lee (2003), the
exchange rate series is predictable by using the functional coecient autoregressive model
Y
t
= a
0
(U
t
) +
d

j=1
a
j
(U
t
) Y
tj
+
t
e
t
, (5.14)
where U
t
is the smooth variable dened later and
t
is a function of U
t
and the lagged
variables. If U
t
is observable, a
j
() can be estimated by a local linear tting; see Cai,
Fan and Yao (2000) for details, denoted by a
j
(). Here,
t
is the stochastic volatility which
CHAPTER 5. NONPARAMETRIC QUANTILE MODELS 125
may depend on U
t
and the lagged variables Y
tj
. Now the question is how to choose U
t
.
Usually, U
t
can be chosen based on the knowledge of data or economic theory. However, if no
prior information is available, U
t
may be chosen as a function of explanatory vector
tj
or
through the use of data-driven methods such as AIC or cross-validation. Recently, Fan, Yao
and Cai (2003) proposed a data-driven method to the choice of U
t
by a linear combination
of
tj
and the lagged variables Y
tj
. By following the analysis of Fan, Yao and Cai
(2003) and Hong and Lee (2003), we choose the smooth variable U
t
as an moving average
technical trading rule (MATTR) in nance so that the autoregressive coecients vary with
investment positions. U
t
is dened as U
t
=
t1
/M
t
1, where M
t
=

L
j=1

tj
/L, which is
the moving average and can be regarded as a proxy for the trend at the time t 1. Similar
to Hong and Lee (2003), We choose L = 26 (half a year). U
t
+1 is the ratio of the exchange
rate at the time t 1 to the average rate of the most recent L periods of exchange rates at
time t 1. The time series plot of U
t
is given in Figure 5.4(c). As pointed out by Hong
and Lee (2003), U
t
is expected to reveal some useful information on the direction of changes.
The MATTR signals 1 (the position to buy JPY) when U
t
> 0 and 1 (the position to sell
JPY) when U
t
< 0. For the detailed discussions of the MATTR, see (for example) the papers
by LeBaron (1997, 1999), Hong and Lee (2003), Fan, Yao and Cai (2003), and the reference
therein. Note that model (5.12) was studied by Fan, Yao and Cai (2003) for the daily data
and Hong and Lee (2003) for the weekly data under the homogenous assumption (assume
that
t
= ) based on the least square theory. In particular, Hong and Lee (2003) provided
some empirical evidences to conclude that model (5.14) outperforms the martingale model
and autoregressive models.
We analyze this exchange rate series by using the smooth coecient model under the
quantile regression framework with only two lagged variables
2
as follows
q

(U
t
, Y
t1
, Y
t2
) = a
0,
(U
t
) + a
1,
(U
t
) Y
t1
+ a
2,
(U
t
) Y
t2
. (5.15)
The rst 1540 observations of Y
t
are used for estimation and the last 25 observations
are left for prediction. The coecient functions a
j,
() are estimated through the local
linear quantile approach, denoted by a
j,
(). The previous analysis of this particularly
dicult data set can be found in optimal bandwidths are h
opt
= 0.03 for = 0.05, 0.025
2
We also considered the models with more than two lagged variables and we found that the conclusions
are similar and not reported here.
CHAPTER 5. NONPARAMETRIC QUANTILE MODELS 126
0.10 0.05 0.00 0.05 0.10

0
.
4

0
.
2
0
.
0
0
.
2
0
.
4
(d)
tau=0.50
Mean
CI for tau=0.5
0.10 0.05 0.00 0.05 0.10

2
0
2
(e)
tau=0.05
tau=0.95
0.10 0.05 0.00 0.05 0.10

0
.
3

0
.
2

0
.
1
0
.
0
0
.
1
0
.
2
(f)
tau=0.05
tau=0.50
tau=0.95
Mean
CI for tau=0.5
0.10 0.05 0.00 0.05 0.10

0
.
2

0
.
1
0
.
0
0
.
1
0
.
2
0
.
3
(g)
tau=0.05
tau=0.50
tau=0.95
Mean
CI for tau=0.5
Figure 5.5: Exchange Rate Series: The plots of the estimated coecient functions for three
quantiles = 0.05 (solid line), = 0.50 (dashed line), and = 0.95 (dotted line), and the
mean regression (dot-dashed line): a
0,0.50
(u) and a
0
(u) versus u in (d), a
0,0.05
(u) and a
0,0.95
(u)
versus u in (e), a
1,
(u) and a
1
(u) versus u in (f), and a
2,
(u) and a
2
(u) versus u in (g). The
thick dashed lines indicate the 95% point-wise condence interval for the median estimate
with the bias ignored.
for = 0.50, and 0.03 for = 0.95. Figures 5.5(d) - 5.5(g) depict the estimated coecient
functions a
0,
(), a
1,
(), and a
2,
() respectively, for three quantiles = 0.05 (solid line), 0.50
(dashed line) and 0.95 (dotted line), together with the estimates a
j
() (dot-dashed line)
from the mean regression model in (5.14). Also, the 95% point-wise condence intervals for
the median estimate are displayed by the thick dashed lines without the bias correction.
First, from Figures 5.5(d), 5.5(f) and 5.5(g), we see clearly that the median estimates
CHAPTER 5. NONPARAMETRIC QUANTILE MODELS 127
a
j,0.50
() in (5.15) are almost parallel with or close to the mean estimates a
j
() in (5.14) and
the mean estimates are almost within the 95% condence interval of the median estimates.
Secondly, a
0,0.50
() in Figure 3(d) shows a nonlinear pattern (increasing and then decreasing)
and a
0,0.05
() and a
0,0.95
() in Figure 5.5(e) exhibit nonlinearly (slightly U-shape) and sym-
metrically. More importantly, one can observe from Figures 5.5(f) and 5.5(g) that the lower
and upper quantile estimated coecient curves are intersect and they behave slightly dier-
ently. Particularly, from Figure 5.5(g), we observe that a
2,0.05
(U
t
) seems to be nonlinear but
a
2,0.95
(U
t
) looks like constant when U
t
< 0.06, and both a
2,0.05
(U
t
) and a
2,0.95
(U
t
) decrease
when U
t
> 0.06. One might conclude that the distribution of the measurement error e
t
in
(5.14) might not be symmetric about 0 and there exists a nonlinearity in a
j,
(). This sup-
ports the nonlinearity test of Hong and Lee (2003). Also, our ndings lead to the conclusions
that the quantile has a complex structure and the heteroscedasticity exists. This observation
supports the existing conclusion in literature that the GARCH (generalized ARCH) eects
occur in the exchange rate time series; see Engle, Ito and Lin (1990) and Tsay (2002).
Finally, we consider the post-sample forecasting for the last 25 observations based on
the local linear quantile estimators which are computed by using the same bandwidths as
those used in the model tting. The 95% nonparametric prediction interval is constructed
as ( q
0.025
(), q
0.975
()) and the prediction results are reported in Table 2, which shows that
24 out of 25 predictive intervals contain the corresponding true values. The average length
of the intervals is 5.77, which is about 35.5% of the range of the data. Therefore, we can
conclude that under the dynamic smooth coecient quantile regression model assumption,
the prediction intervals based on the proposed method work reasonably well.
5.4 Derivations
In this section, we give the derivations of the theorems and present certain lemmas with
their detailed proofs relegated to Section 5.5. First, we need the following two lemmas.
Lemma 5.1: Let V
n
() be a vector function that satises
(i)

V
n
( )

V
n
() for 1
and
(ii) sup
M
|V
n
() + DA
n
| = o
p
(1), where |A
n
| = O
p
(1), 0 < M < , and D is
a positive-denite matrix. Suppose that
n
is a vector such that |V
n
(
n
)| = o
p
(1), then, we
CHAPTER 5. NONPARAMETRIC QUANTILE MODELS 128
Table 2: The Post-Sample Predictive Intervals For Exchange Rate Data
Observation True Value Prediction Interval
Y
1541
0.392 (-2.891, 2.412)
Y
1542
0.509 (-3.099, 2.405)
Y
1543
1.549 (-2.943, 2.446)
Y
1544
-0.121 (-2.684, 2.525)
Y
1545
-0.991 (-2.677, 2.530)
Y
1546
-0.646 (-3.110, 2.401)
Y
1547
-0.354 (-3.178, 2.365)
Y
1548
-1.393 (-3.083, 2.372)
Y
1549
0.997 (-3.110, 2.230)
Y
1550
-0.916 (-3.033, 2.431)
Y
1551
-3.707 (-3.021, 2.286)
Y
1552
-0.919 (-3.841, 2.094)
Y
1553
-0.901 (-3.603, 2.770)
Y
1554
0.071 (-3.583, 2.821)
Y
1555
-0.497 (-3.351, 2.899)
Y
1556
-0.648 (-3.436, 2.783)
Y
1557
1.648 (-3.524, 2.866)
Y
1558
-1.184 (-3.121, 2.810)
Y
1559
0.530 (-3.529, 2.531)
Y
1560
0.107 (-3.222, 2.648)
Y
1561
-0.804 (-3.294, 2.651)
Y
1562
0.274 (-3.419, 2.534)
Y
1563
-0.847 (-3.242, 2.640)
Y
1564
-0.060 (-3.426, 2.532)
Y
1565
-0.088 (-3.300, 2.576)
have
(1) |
n
| = O
p
(1) and (2)
n
= D
1
A
n
+ o
p
(1).
Proof: The proof follows from Jureckoa (1977) and Koenker and Zhao (1996).
Lemma 5.2: Let

be the minimizer of the function

n
t=1
w
t

(y
t
X

t
),
where w
t
> 0. Then,
|
n
t=1
w
t
X
t

(y
t
X

)| dim(X) max
tn
|w
t
X
t
|.
CHAPTER 5. NONPARAMETRIC QUANTILE MODELS 129
Proof: The proof follows from Ruppert and Carroll (1980).
From the denition of , we have
=
_
a(u
0
)
a

(u
0
)
_
+ a
n
H
1
,
where a
n
is dened in (5.10). Then, Y
t

q
j=0
X

j
(U
t
u
0
)
j
= Y

t
a
n

t
. Therefore,

= argmin
n

t=1

[Y

t
a
n

t
] K(U
th
) argmin G().
Now, dene V
n
() as
V
n
() = a
n
n

t=1

t
a
n

t
] X

t
K(U
th
). (5.16)
To establish the asymptotic properties of

, in the next three lemmas, we show that V
n
()
satises Lemma 5.1 so that we can derive the local Bahadur representation for

. The
results are stated here and their detailed proofs are given in Section 5.5. For the notational
convenience dene A
m
= : || M for some 0 < M < .
Lemma 5.3: Under assumptions of Theorem 5.1, we have
sup
Am
|V
n
() V
n
(0) E[V
n
() V
n
(0)]| = o
p
(1).
Lemma 5.4: Under assumptions of Theorem 5.1, we have
sup
Am
|E[V
n
() V
n
(0)] + f(u
0
)

1
(u
0
) | = o(1).
Lemma 5.5: Let Z
t
=

t
) X

t
K(U
th
). Under assumptions of Theorem 5.1, we have
E[Z
1
] =
h
3
f(u
0
)
2
_

(u
0
) a

(u
0
)
0
_
1 + o(1)
and
Var [Z
1
] = h (1 ) f(u
0
)
1
(u
0
) 1 + o(1),
CHAPTER 5. NONPARAMETRIC QUANTILE MODELS 130
where

1
(u
0
) =
_

0
(u
0
) 0
0
2
(u
0
)
_
.
Further,
Var [V
n
(0)] (1 ) f(u
0
)
1
(u
0
).
Therefore, |V
n
(0)| = O
p
(1).
Now we can embrace the proofs of the theorems.
Proof of Theorem 5.1: By Lemmas 5.5, 5.3, and 5.4, V
n
() satises the condition (ii)
of Lemma 5.1; that is, |A
n
| = O
p
(1) and sup
Am
|V
n
() + D A
n
| = o
p
(1) with
D = f
u
(u
0
)

1
(u
0
) and A
n
= V
n
(0). It follows Lemma 5.2 that |V
n
(

)| = o
p
(1), where

is
the minimizer of G(). Finally, since

(x) is an increasing function of x, then,

V
n
() = a
n
n

t=1
(

)(

(Y

t
a
n

t
) X

t
K(U
th
)
= a
n
n

t=1

[Y

t
+ a
n
(

t
)] (

t
) K(U
th
)
is an increasing function of . Thus, the condition (i) of Lemma 5.1 is satised. Therefore,
it follows that

= D
1
A
n
+ o
p
(1) =
(

1
)
1

nh f
u
(u
0
)
n

t=1

t
) X

t
K(U
th
) + o
p
(1). (5.17)
This proves (5.6).
Proof of Theorem 5.2: Let
t
=

(Y
t
X

t
a(U
t
)). Then, E(
t
) = 0 and Var(
t
) = (1).
From (5.17),

1
)
1

nh f
u
(u
0
)
n

t=1
[

(Y

t
)
t
] X

t
K(U
th
) +
(

1
)
1

nh f
u
(u
0
)
n

t=1

t
X

t
K(U
th
) B
n
+
n
.
Similar to the proof of Theorem 2 in Cai, Fan and Yao (2000), by using the small-block and
large-block technique and the Cramer-Wold device, one can show that

n
N(0, (u
0
)). (5.18)
By the stationarity and Lemma 5.5,
E[B
n
] =
(

1
)
1

nhf
u
(u
0
)
nE[Z
1
] 1 + o(1) = a
1
n
h
2
2
_
a

(u
0
)
2
0
_
1 + o(1). (5.19)
CHAPTER 5. NONPARAMETRIC QUANTILE MODELS 131
Since

t
)
t
= I(Y
t
X

t
a(U
t
)) I(Y
t
X

t
(a(u
0
) +a

(u
0
)(U
t
u
0
))), then,
[

t
)
t
]
2
= I(d
1t
< Y
t
d
2t
), (5.20)
where d
1t
= min(c
1t
, c
2t
) and d
2t
= max(c
1t
, c
2t
) with c
1t
= X

t
a(U
t
) and c
2t
= X

t
[a(u
0
) +
a

(u
0
)(U
t
u
0
)]. Further,
E
_

t
)
t

2
K
2
(U
th
) X

t
X

= E
_
F
y|u,x
(d
2t
) F
y|u,x
(d
1t
) K
2
(U
th
) X

t
X

= O(h
3
).
Thus, Var(B
n
) = o(1). This, in conjunction with (5.18) and (5.19) and the Slutsky Theorem,
proves the theorem.
5.5 Proofs of Lemmas
Note that the same notations in Sections 5.2 and 5.4 are used here. Throughout this section,
we denote a generic constant by C, which may take dierent values at dierent appearances.
Let F
y|u,x
(y) denote the conditional distribution of Y given U and X.
Proof of Lemma 5.3: First, for any A
m
, we consider the following term
V
n
() V
n
(0) = a
n
n

t=1
[

nt
)

t
)] X

t
K(U
th
) a
n
n

i=1
V
nt
(),
where Y

nt
= Y

t
a
n

t
and V
nt
() = V
nt
= [

nt
)

t
)] X

t
K(U
th
) = (V

nt1
, V

nt2
)

with
V
nt1
= [

nt
)

t
)] X
t
K(U
th
) and V
nt2
= [

(Y

nt
)

t
)] X
t
U
th
K(U
th
).
Thus,
|V
n
() V
n
(0) E[V
n
() V
n
(0)]|
a
n
|
n

t=1
(V
nt1
EV
nt1
) | + a
n
|
n

t=1
(V
nt2
EV
nt2
) | V
(1)
n
+ V
(2)
n
.
Clearly,
V
(1)
n
a
n
|
n

t=1
(V
nt1
EV
nt1
) |
d

i=0
|V
(1i)
n
|,
CHAPTER 5. NONPARAMETRIC QUANTILE MODELS 132
where V
(1i)
n
= a
n

n
t=1
(V
(i)
nt1
EV
(i)
nt1
) and V
(i)
nt1
= [

(Y

nt
)

t
)] X
ti
K(U
th
), which is the
i-th component of V
nt1
. Then,
Var(V
(1i)
n
) = a
2
n
E
_
n

t=1
_
V
(i)
nt1
EV
(i)
nt1
_
_
2
= a
2
n
_
n

t=1
Var(V
(i)
nt1
) + 2
n1

s=1
_
1
s
n
_
Cov(V
(i)
n11
, V
(i)
n(s+1)1
)
_

1
h
_
Var(V
(i)
n11
) + 2
dn1

s=1
[Cov(V
(i)
n11
, V
(i)
n(s+1)1
)[ + 2

s=dn
[Cov(V
(i)
n11
, V
(i)
n(s+1)1
)[
_
J
1
+ J
2
+ J
3
for some d
n
specied later. For J
3
, use the Davydovs inequality (see, e.g., Corollary
A.2 of Hall and Heyde, 1980) to obtain
[Cov(V
(i)
n11
, V
(i)
n(s+1)1
)[ C
12/
(s) [E[V
(i)
n11
[

]
2/
.
Similar to (5.20), for any k > 0,
[

nt
)

t
)[
k
= I(d
3t
< Y
t
d
4t
),
where d
3t
= min(c
2t
, c
2t
+c
3t
) and d
4t
= max(c
2t
, c
2t
+c
3t
) with c
3t
= a
n

t
. Therefore, by
Assumption (C3), there exists a C > 0 independent of such that
E
_
[

nt
)

t
)[
k
[ U
t
, X
t
_
= F
y|u,x
(c
4t
) F
y|u,x
(c
3t
) C a
n
[

t
[,
which implies that
E[V
(i)
n11
[

= E
_
[

n1
)

1
)[

[X
1i
[

(U
1h
)

C a
n
E
_
[

t
[ [X
1i
[

(U
1h
)

C a
n
h
uniformly in over A
m
by Assumption (C6). Then,
J
3
C a
2/
n
h
2/1

s=dn
[(s)]
12/
C a
2/
n
h
2/1
d
l
n

s=dn
s
l
[(s)]
12/
= o(a
2/
n
h
2/1
d
l
n
)
uniformly in over A
m
. As for J
2
, we use Assumption (C10) to get
[Cov(V
(i)
n11
, V
(i)
n(s+1)1
)[ C
_
E[X
1i
X
(s+1)i
[K(U
1h
) K(U
(s+1)h
) + a
2
n
h
2

= O(h
2
)
CHAPTER 5. NONPARAMETRIC QUANTILE MODELS 133
uniformly in over A
m
. It follows that J
2
= O(d
n
h) uniformly in over A
m
. Analogously,
J
1
= h
1
Var(V
(i)
n11
) h
1
E(V
(i)
n11
)
2
= O(a
n
)
uniformly in over A
m
. By choosing d
n
such that d
l
n
h
12/
= c, then, d
n
h 0 and
Var(V
(1i)
n
) = o(1). Therefore, V
(1i)
n
= o
p
(1) so that V
(1)
n
= o
p
(1) uniformly in over A
m
. By
the same token, we can show that V
(2)
n
= o
p
(1) uniformly in over A
m
. This completes the
proof of the lemma.
Proof of Lemma 5.4: It is easy to justify that
E[V
n
() V
n
(0)] = na
n
E[(

t
a
n

t
)

t
))] X

t
K(U
th
)]
= na
n
E[F
y|u,x
(c
2t
) F
y|u,x
(c
2t
+ a
n

t
) X

t
K(U
th
)]

1
h
E[f
y|u,x
(c
2t
) X

t
X

K(U
th
)]
f
u
(u
0
)

1
(u
0
)
uniformly in over A
m
by Assumption (C3). The proof of the lemma is complete.
Proof of Lemma 5.5: Observe by Taylor expansions and Assumption (C3) that
E[Z
t
] = E[ F
y|u,x
(c
2t
) X

t
K(U
th
)]
E
__
F
y|u,x
(c
2t
+X

t
a

(u
0
) h
2
U
2
th
/2) F
y|u,x
(c
2t
)
_
X

t
K(U
th
)

h
2
2
E
_
f
y|u,x
(c
2t
) X

t
X

t
a

(u
0
) U
2
th
K(U
th
)

h
2
2
E
_
f
y|u,x
(q

(u
0
, X
t
)) X

t
X

t
a

(u
0
) U
2
th
K(U
th
)

h
3
f
u
(u
0
)
2
_

(u
0
) a

(u
0
)
0
_
. (5.21)
Also, we have
Var[Z
t
] = E[ I(Y
t
< c
2t
)
2
X

t
X

K
2
(U
th
)]
E
__

2
2 F
y|u,x
(c
2t
) + F
y|u,x
(c
2t
)
_
X

t
X

K
2
(U
th
)

(1 ) E
_
X

t
X

K
2
(U
th
)

(1 ) h f
u
(u
0
)
1
(u
0
). (5.22)
Next, we show that the last part of lemma holds true. Clearly, V
n
(0) = a
n

n
t=1
Z
t
. Similar
to the proof of Lemma 5.3, we have
Var [V
n
(0)] =
1
h
Var(Z
1
) +
2
h
dn1

s=1
_
1
s
n
_
Cov(Z
1
, Z
s+l
) +
2
h
n

s=dn
_
1
s
n
_
Cov(Z
1
, Z
s+l
)
CHAPTER 5. NONPARAMETRIC QUANTILE MODELS 134
J
4
+ J
5
+ J
6
for some d
n
specied later. By (5.22),
J
4
(1 ) f
u
(u
0
)
1
(u
0
).
Therefore, it suces to show that [J
5
[ = o(1) and [J
6
[ = o(1). For J
6
, using the Davydovs
inequality (see, e.g., Corollary A.2 of Hall and Heyde, 1980) and the boundedness of

()
to obtain
[Cov(Z
1
, Z
s+1
)[ C
12/
(s) [E[Z
1
[

]
2/
C h
2/

12/
(s),
which gives
J
6
C h
2/1

s=dn
[(s)]
12/
C h
2/1
d
l
n

s=dn
s
l
[(s)]
12/
= o(h
2/1
d
l
n
) = o(1)
by choosing d
n
to satisfy d
l
n
h
12/
= c. As for J
5
, we use Assumption (C10) and (5.21) to
get
[Cov(Z
1
, Z
s+1
)[ C
_
E[X

1
X

s+1

[ K(U
1h
) K(U
(s+1)h
) + h
6

= O(h
2
)
so that J
5
= O(d
n
h) = o(1) by the choice of d
n
. We nish the proof of this lemma.
Proof of (5.9) and (5.10): By the Taylor expansion,
E[
t
[ U
t
, X
t
] = F
y|u,x
(X

t
a(u
0
) + a
n
) F
y|u,x
(X

t
a(u
0
)) f
y|u,x
(X

t
a(u
0
)) a
n
.
Therefore,
E [S
n
] h
1
E[f
y|u,x
(X

t
a(u
0
)) X

t
X

K(U
th
)] f
u
(u
0
)

1
(u
0
).
Similar to the proof of Var[V
n
(0)] in Lemma 5.5, one can show that Var(S
n
) 0. Therefore,
S
n
f
u
(u
0
)

1
(u
0
) in probability. This proves (5.9). Clearly,
E
_

n,0
_
= E[X
t
X

t
K
h
(U
t
u
0
)] =
_
(u
0
+ h v) f
u
(u
0
+ h v) K(v) dv f
u
(u
0
) (u
0
).
Similarly, one can show that Var(

n,0
) 0. This proves the rst part of (5.10). By
the same token, one can show that E
_

n,1
_
f
u
(u
0
)

(u
0
) and Var(

n,1
) 0. Thus,

n,1
= f
u
(u
0
)

(u
0
) + o
p
(1). We prove (5.10).
CHAPTER 5. NONPARAMETRIC QUANTILE MODELS 135
5.6 Computer Codes
Please see the les chapter5-1.r, chapter5-2.r, and chapter5-3.r for making gures. If
you want to learn the codes for computation, they are available upon request.
5.7 References
An, H.Z. and Chen, S.G. (1997). A Note on the Ergodicity of Nonlinear Autoregressive
Models. Statistics and Probability Letters, 34, 365-372.
An, H.Z. and Huang, F.C. (1996). The Geometrical Ergodicity of Nonlinear Autoregressive
Models. Statistica Sinica, 6, 943-956.
Auestad, B. and Tjstheim, D. (1990). Identication of nonlinear time series: First order
characterization and order determination. Biometrika, 77, 669-687.
Bao, Y., Lee, T.-H. and Salto glu, B. (2001). Evaluating predictive performance of value-at-
risk models in emerging markets: a reality check. Journal of Forecasting, forthcoming.
Breiman, L. and Friedman, J. H. (1985). Estimating optimal transformation for multiple
regression and correlation. Journal of the American Statistical Association, 80, 580619.
Cai, Z. (2002a). Regression quantile for time series. Econometric Theory, 18, 169-192.
Cai, Z. (2002b). A two-stage approach to additive time series models. Statistica Neerlandi-
ca, 56, 415-433.
Cai, Z. (2007). Trending time-varying coecient time series models with serially correlated
errors. Journal of Econometrics, 137, 163-188
Cai, Z., Fan, J. and Yao, Q. (2000). Functional-coecient regression models for nonlinear
time series. Journal of the American Statistical Association, 95, 941-956.
Cai, Z. and Masry, E. (2000). Nonparametric estimation in nonlinear ARX time series
models: Projection and linear tting. Econometric Theory, 16, 465-501.
Cai, Z. and Tiwari, R.C. (2000). Application of a local linear autoregressive model to BOD
time series. Environmetrics, 11, 341-350.
Cai, Z. and X. Xu (2005). Nonparametric quantile estimations for dynamic smooth coe-
cient models. Forthcoming in Journal of the American Statistical Association.
Chaudhuri, P. (1991). Nonparametric estimates of regression quantiles and their local
Bahadur representation. The Annals of Statistics, 19, 760-777.
Chaudhuri, P., Doksum, K. and Samarov, A. (1997). On average derivative quantile re-
gression. The Annuals of Statistics, 25, 715-744.
CHAPTER 5. NONPARAMETRIC QUANTILE MODELS 136
Chen, R. and Tsay, R.S. (1993). Functional-coecient autoregressive models. Journal of
the American Statistical Association, 88, 298-308.
Cole, T.J. (1994). Growth charts for both cross-sectional and longitudinal data. Statistics
in Medicine, 13, 2477-2492.
De Gooijer, J. and Zerom, D. (2003). On additive conditional quantiles with high dimen-
sional covariates. Journal of American Statistical Association, 98, 135-146.
Due, D. and Pan, J. (1997). An overview of value at risk. Journal of Derivatives, 4, 7-49.
Engle, R.F., Ito, T. and Lin, W. (1990). Meteor showers or heat waves? Heteroskedastic
intra-daily volatility in the foreign exchange market. Econometrica, 58, 525-542.
Engle, R.F. and Manganelli, S. (2004). CAViaR: conditional autoregressive value at risk
by regression quantile. Journal of Business and Economics Statistics, 22, 367381.
Efron, B. (1991). Regression percentiles using asymmetric squared error loss. Statistica
Sinica, 1, 93-125.
Fan, J. and Gijbels, I. (1996). Local Polynomial Modeling and Its Applications. Chapman
and Hall, London.
Fan, J. and Yao, Q. (1998). Ecient estimation of conditional variance functions in s-
tochastic regression. Biometrika, 85, 645-660.
Fan, J., Yao, Q. and Cai, Z. (2003). Adaptive varying-coecient linear models. Journal of
the Royal Statistical Society, series B, 65, 57-80.
Fan, J., Yao, Q. and Tong, H. (1996). Estimation of conditional densities and sensitivity
measures in nonlinear dynamical systems. Biometrika, 83, 189-206.
Gallant, A.R., Hsieh, D.A. and Tauchen, G.E. (1991). On tting a recalcitrant series:
the pound/dollar exchange rate, 1974-1983. In Nonparametric And Semiparametric
Methods in Econometrics and Statistics (W.A. Barnett, J. Powell and G.E. Tauchen,
eds.), pp.199-240. Cambridge: Cambridge University Press.
Gilley, O.W. and Pace, R.K. (1996). On the Harrison and Rubinfeld Data. Journal of
Environmental Economics and Management, 31, 403-405.
Gorodetskii, V.V. (1977). On the strong mixing property for linear sequences. Theory of
Probability and Its Applications, 22, 411-413.
Granger, C.W.J., White, H. and Kamstra, M. (1989). Interval forecasting: an analysis
based upon ARCH-quantile estimators. Journal of Econometrics, 40, 87-96.
Hall, P. and Heyde, C.C. (1980). Martingale Limit Theory and its Applications. Academic
Press, New York.
CHAPTER 5. NONPARAMETRIC QUANTILE MODELS 137
Harrison, D. and Rubinfeld, D.L. (1978). Hedonic housing prices and demand for clean air.
Journal of Environmental Economics and Management, 5, 81-102.
Hastie, T.J. and Tibshirani, R. (1990). Generalized Additive Models. Chapman and Hall,
London.
He, X. and Ng, P. (1999). Quantile splines with several covariates. Journal of Statistical
Planning and Inference, 75, 343-352.
He, X., Ng, P. and Portony, S. (1998). Bivariate quantile smoothing splines. Journal of the
Royal Statistical Society, Series B, 60, 537-550.
He, X. and Portnoy, S. (2000). Some asymptotic results on bivariate quantile splines.
Journal of Statistical Planning and Inference, 91, 341-349.
Honda, T. (2000). Nonparametric estimation of a conditional quantile for -mixing pro-
cesses. Annals of the Institute of Statistical Mathematics, 52, 459-470.
Honda, T. (2004). Quantile regression in varying coecient models. Journal of Statistical
Planning and Inferences, 121, 113-125.
Hong, Y. and Lee, T.-H. (2003). Inference on via generalized spectrum and nonlinear time
series models. The Review of Economics and Statistics, 85, 1048-1062.
Horowitz, J.L. and Lee, S. (2005). Nonparametric Estimation of an Additive Quantile
Regression Model. Journal of the American Statistical Association, 100, 1238-1249.
Hurvich, C.M., Simono, J.S. and Tsai, C.-L. (1998). Smoothing parameter selection in
nonparametric regression using an improved Akaike information criterion. Journal of
the Royal Statistical Society, Series B, 60, 271-293.
Hurvich, C.M. and Tsai, C.-L. (1989). Regression and time series model selection in small
samples. Biometrika, 76, 297-307.
Jorion, P. (2000). Value at Risk, 2ed. McGraw Hill, New York.
Jureckoa, J. (1977). Asymptotic relations of M-estimates and R-estimates in linear regres-
sion model. The Annals of Statistics, 5, 464-472.
Khindanova, I.N. and Rachev, S.T. (2000). Value at risk: Recent advances. Handbook on
Analytic-Computational Methods in Applied Mathematics, CRC Press LLC.
Koenker, R. (1994). Condence intervals for regression quantiles. In Proceedings of the
Fifth Prague Symposium on Asymptotic Statistics (P. Mandl and M. Huskova, eds.),
349-359. Physica, Heidelberg.
Koenker, R. (2004). Quantreg: An R package for quantile regression and related methods
http://cran.r-project.org.
CHAPTER 5. NONPARAMETRIC QUANTILE MODELS 138
Koenker R. (2000). Galton, Edgeworth, Frisch, and prospects for quantile regression in
econometrics. Journal of Econometrics, 95, 347-374.
Koenker, R. and Bassett, G.W. (1978). Regression quantiles. Econometrica, 46, 33-50.
Koenker, R. and Bassett, G.W. (1982). Robust tests for heteroscedasticity based on regres-
sion quantiles. Econometrica, 50, 43-61.
Koenker, R. and Hallock, K.F. (2001). Quantile regression: An introduction. Journal of
Economic Perspectives, 15, 143-157.
Koenker, R., Ng, P. and Portnoy, S. (1994). Quantile smoothing splines. Biometrika, 81,
673-680.
Koenker, R. and Xiao, Z. (2002). Inference on the quantile regression process. Economet-
rica, 70, 1583-1612.
Koenker, R. and Xiao, Z. (2004). Unit root quantile autoregression inference. Journal of
American Statistical Association, 99, 775-787.
Koenker, R. and Zhao, Q. (1996). Conditional quantile estimation and inference for ARCH
models. Econometric Theory, 12, 793-813.
LeBaron, B. (1997). Technical trading rule and regime shifts in foreign exchange. In
Advances in Trading Rules (E. Acar and S. Satchell, eds.). Butterworth-Heinemann.
LeBaron, B. (1999). Technical trading rule protability and foreign exchange intervention.
Journal of International Economics, 49, 125-143.
Li, Q. and Racine, J. (2004). Nonparametric estimation of conditional CDF and quan-
tile functions with mixed categorical and continuous data. Journal of Business and
Economic Statistics, forthcoming.
Lu, Z. (1998). On the ergodicity of non-linear autoregressive model with an autoregressive
conditional heteroscedastic term. Statistica Sinica, 8, 1205-1217.
Lu, Z., Hui, Y.V. and Zhao, Q. (2000). Local linear quantile regression under dependence:
Bahadur representation and application. Working Paper, Department of Management
Sciences, City University of Hong Kong.
Masry, E. and Tjstheim, D. (1995). Nonparametric estimation and identication of non-
linear ARCH time series: Strong convergence and asymptotic normality. Econometric
Theory, 11, 258-289.
Masry, E. and Tjstheim, D. (1997). Additive nonlinear ARX time series and projection
estimates. Econometric Theory, 13, 214-252.
Machado, J.A.F. (1993). Robust model selection and M-estimation. Econometric Theory,
9, 478-493.
CHAPTER 5. NONPARAMETRIC QUANTILE MODELS 139
Morgan, J.P. (1995). Riskmetrics Technical Manual, 3ed.
Opsomer, J.D. and Ruppert, D. (1998). A fully automated bandwidth selection for additive
regression model. Journal of The American Statistical Association, 93, 605618.
Pace, R.K. and Gilley, O.W. (1997). Using the spatial conguration of the data to improve
estimation. The Journal of Real Estate Finance and Economics, 14, 333-340.
Ruppert, D. and Carroll, R.J. (1980). Trimmed least squares estimation in the linear model.
Journal of The American Statistical Association, 75, 828-838.
Schwarz, G. (1978). Estimating the dimension of a model. The Annals of Statistics, 6,
461-464.
Sercu, P. and Uppal, R. (2000). Exchange Rate Volatility, Trade, and Capital Flows under
Alternative Rate Regimes. Cambridge: Cambridge University Press.
Sent urk, D. and M uller, H.G. (2005). Covariate adjusted correlation analysis. Forthcoming
in Scandinavian Journal of Statistics.
Taylor, J.W. and Bunn, D.W. (1999). A quantile regression approach to generating predic-
tion intervals. Management Science, 45, 225-237.
Tsay, R.S. (2000). Extreme values, quantile estimation and value at risk. Working paper,
Graduate School of Business, University of Chicago.
Tsay, R.S. (2002). Analysis of Financial Time Series. John Wiley & Sons, New York.
Wang, K. (2003). Asset pricing with conditioning information: A new test. Journal of
Finance, 58, 161-196.
Wei, Y. and He, X. (2006). Conditional growth charts (with discussion). The Annals of
Statistics, 34, 2069-2097.
Wei, Y., Pere, A., Koenker, R. and He, X. (2006). Quantile regression methods for reference
growth charts. Statistics in Medicine, 25, 1369-1382.
Withers, C.S. (1981). Conditions for linear processes to be strong mixing. Zeitschrift fur
Wahrscheinlichkeitstheorie verwandte Gebiete, 57, 477-480.
Yu, K. and Jones, M.C. (1998). Local linear quantile regression. Journal of the American
Statistical Association, 93, 228-237.
Yu, K. and Lu, Z. (2004). Local linear additive quantile regression. Scandinavian Journal
of Statistics, 31, 333-346.
Xu, X. (2005). Semiparametric Quantile Dynamic Time Series Models and Their Applica-
tions. Ph.D. Dissertation, University of North Carolina at Charlotte.
Zhou, K.Q. and Portnoy, S.L. (1996). Direct use of regression quantiles to construct con-
dence sets in linear models. The Annals of Statistics, 24, 287306.
Chapter 6
Conditional VaR and Expected
Shortfall
For details, see the paper by Cai and Wang (2006). If you like to read the whole paper, you
can download it from the web site at http://www.wise.xmu.edu.cn/ at WORKING
PAPER column. Next we present only a part of the whole paper of Cai and Wang (2006).
6.1 Introduction
The value-at-risk (hereafter, VaR) and expected shortfall (ES) have become two popular
measures on market risk associated with an asset or a portfolio of assets during the last
decade. In particular, VaR has been chosen by the Basle Committee on Banking Supervision
as the benchmark of risk measures for capital requirements and both of them have been used
by nancial institutions for asset managements and minimization of risk as well as have
been developed rapidly as analytic tools to assess riskiness of trading activities. See, to
name just a few, Morgan (1996), Due and Pan (1997), Jorion (2001, 2003), and Due and
Singleton (2003) for the nancial background, statistical inferences, and various applications.
In terms of the formal denition, VaR is simply a quantile of the loss distribution
(future portfolio values) over a prescribed holding period (e.g., 2 weeks) at a
given condence level, while ES is the expected loss, given that the loss is at
least as large as some given quantile of the loss distribution (e.g., VaR). It is well
known from Artzner, Delbaen, Eber and Heath (1999) that ES is a coherent risk measure
such as it satises the following four axioms:
homogeneity: increasing the size of a portfolio by a factor should scale its risk
measure by the same factor,
CHAPTER 6. CONDITIONAL VAR AND EXPECTED SHORTFALL 141
monotonicity: a portfolio must have greater risk if it has systematically lower
values than another,
risk-free condition or translation invariance: adding some amount of cash
to a portfolio should reduce its risk by the same amount, and
subadditivity: the risk of a portfolio must be less than the sum of separate
risks or merging portfolios cannot increase risk.
VaR satises homogeneity, monotonicity, and risk-free condition but is not sub-additive. See
Artzner, et al. (1999) for details. As advocated by Artzner, et al. (1999), ES is preferred
due to its better properties although VaR is widely used in applications.
Measures of risk might depend on the state of the economy since economic and market
conditions vary from time to time. This requires risk managers should focus on the condi-
tional distributions of prot and loss, which take full account of current information about
the investment environment (macroeconomic and nancial as well as political) in forecasting
future market values, volatilities, and correlations. As pointed out by Due and Singleton
(2003), not only are the prices of the underlying market indices changing randomly over
time, the portfolio itself is changing, as are the volatilities of prices, the credit qualities of
counterparties, and so on. On the other hand, one would expect the VaR to increase as the
past returns become very negative, because one bad day makes the probability of the next
somewhat greater. Similarly, very good days also increase the VaR, as would be the case for
volatility models. Therefore, VaR could depend on the past returns in someway. Hence, an
appropriate risk analytical tool or methodology should be allowed to adapt to varying mar-
ket conditions and to reect the latest available information in a time series setting rather
than the iid framework. Most of the existing risk management literature has concentrated on
unconditional distributions and the iid setting although there have been some studies on the
conditional distributions and time series data. For more background, see Chernozhukov and
Umanstev (2001), Cai (2002), Fan and Gu (2003), Engle and Manganelli (2004), Cai and
Xu (2005), Scaillet (2005), and Cosma, Scaillet and von Sachs (2006), and references therein
for conditional models, and Due and Pan (1997), Artzner, et al. (1999), Rockafellar and
Uryasev (2000), Acerbi and Tasche (2002), Frey and McNeil (2002), Scaillet (2004), Chen
and Tang (2005), Chen (2006), and among others for unconditional models. Also, most of
studies in the literature and applications are limited to parametric models, such as all stan-
CHAPTER 6. CONDITIONAL VAR AND EXPECTED SHORTFALL 142
dard industry models like CreditRisk
+
, CreditMetrics, CreditPortfolio View and the model
proposed by the KMV corporation. See Chernozhukov and Umanstev (2001), Frey and M-
cNeil (2002), Engle and Manganelli (2004), and references therein on parametric models in
practice and Fan and Gu (2003) and references therein for semiparametric models.
The main focus of this chapter is on studying the conditional value-at-risk (CVaR)
and conditional expected shortfall (CES) and proposing a new nonparametric estima-
tion procedure to estimate CVaR and CES functions where the conditional information is
allowed to contain economic and market (exogenous) variables and the past observed return-
s. Parametric models for CVaR and CES can be most ecient if the underlying functions
are correctly specied. See Chernozhukov and Umanstev (2001) for a polynomial type re-
gression model and Engle and Manganelli (2004) for a GARCH type parametric model for
CVaR based on regression quantile. However, a misspecication may cause serious bias and
model constraints may distort the underlying distributions. A nonparametric modeling is
appealing in several aspects. One of the advantages for nonparametric modeling is that little
or no restrictive prior information on functionals is needed. Further, it may provide a useful
insight for further parametric tting.
The approach proposed by Cai and Wang (2006) has several advantages. The rst one
is to propose a new nonparametric approach to estimate CVaR and CES. In essence, our
estimator for CVaR is based on inverting a newly proposed estimator of the conditional
distribution function for time series data and the estimator for CES is by a plugging-in
method based on plugging in the estimated conditional probability density function and
the estimated CVaR function. Note that they are analogous to the estimators studied by
Scaillet (2005) by using the Nadaraya-Watson (NW) type double kernel (smoothing in both
the y and x directions) estimation, and Cai (2002) by utilizing the weighted Nadaraya-
Watson (WNW) kernel type technique to avoid the so-called boundary eects as well as Yu
and Jones (1998) by employing the double kernel local linear method. More precisely, our
newly proposed estimator combines the WNW method of Cai (2002) and the double kernel
local linear technique of Yu and Jones (1998), termed as weighted double kernel local linear
(WDKLL) estimator.
The second merit is to establish the asymptotic properties for the WDKLL estimators
of the conditional probability density function (PDF) and cumulative distribution function
CHAPTER 6. CONDITIONAL VAR AND EXPECTED SHORTFALL 143
(CDF) for the -mixing time series at both boundary and interior points. It is therefore
shown that the WDKLL method enjoys the same convergence rates as those of the double
kernel local linear estimator of Yu and Jones (1998) and the WNW estimator of Cai (2002). It
is also shown that the WDKLL estimators have desired sampling properties at both boundary
and interior points of the support of the design density, which seems to be seminal. Finally,
we derive the WDKLL estimator of CVaR by inverting the WDKLL conditional distribution
estimator and the WDKLL estimator of CES by plugging in the WDKLL estimators of PDF
and CVaR. We show that the WDKLL estimator of CVaR exists always due to the WDKLL
estimator of CDF being a distribution function itself, and that it inherits all better properties
from the WDKLL estimator of CDF; that is, the WDKLL estimator of CDF is a CDF and
dierentiable, and it possess the asymptotic properties such as design adaption, avoiding
boundary eects, and mathematical eciency. Note that to preserve shape constraints,
recently, Cosma, Scaillet and von Sachs (2006) used a wavelet method to estimate conditional
probability density and cumulative distribution functions and then to estimate conditional
quantiles.
Note that CVaR dened here is essentially the conditional quantile or quantile regression
of Koenker and Bassett (1978), based on the conditional distribution, rather than CVaR
dened in some risk management literature (see, e.g., Rockafellar and Uryasev, 2000; Jorion,
2001, 2003) which is what we call ES here. Also, note that the ES here is called TailVaR in
Artzner, et al. (1999). Moreover, as aforementioned, CVaR can be regarded as a special case
of quantile regression. See Cai and Xu (2005) for the state-of-the-art about current research
on nonparametric quantile regression, including CVaR. Further, note that both ES and CES
have been known for decades among actuary sciences and they are very popular in insurance
industry. Indeed, they have been used to assess risk on a portfolio of potential claims, and to
design reinsurance treaties. See the book by Embrechts, Kluppelberg, and Mikosch (1997)
for the excellent review on this subject and the papers by McNeil (1997), H urlimann (2003),
Scaillet (2005), and Chen (2006). Finally, ES or CES is also closely related to other applied
elds such as the mean residual life function in reliability and the biometric function in
biostatistics. See Oakes and Dasu (1990) and Cai and Qian (2000) and references therein.
CHAPTER 6. CONDITIONAL VAR AND EXPECTED SHORTFALL 144
6.2 Setup
Assume that the observed data (X
t
, Y
t
); 1 t n, X
t

d
, are available and they are
observed from a stationary time series model. Here Y
t
is the risk or loss variable which can
be the negative logarithm of return (log loss) and X
t
is allowed to include both economic and
market (exogenous) variables and the lagged variables of Y
t
and also it can be a vector. But,
for the expositional purpose, we consider only the case when X
t
is a scalar (d = 1). Note that
the proposed methodologies and their theory for the univariate case (d = 1) continue to hold
for multivariate situations (d > 1). Extension to the case d > 1 involves no fundamentally
new ideas. Note that models with large d are often not practically useful due to curse of
dimensionality.
We now turn to considering the nonparametric estimation of the conditional expected
shortfall
p
(x), which is dened as

p
(x) = E[Y
t
[ Y
t

p
(x), X
t
= x],
where
p
(x) is the conditional value-at-risk, which is dened as the solution of
P(Y
t

p
(x) [ X
t
= x) = S(
p
(x) [ x) = p
or expressed as
p
(x) = S
1
(p [ x), where S(y [ x) is the conditional survival function of Y
t
given X
t
= x; S(y [ x) = 1 F(y [ x), and F(y [ x) is the conditional cumulative distribution
function. It is easy to see that

p
(x) =
_

p(x)
y f(y [ x) dy/p,
where f(y [ x) is the conditional probability density function of Y
t
given X
t
= x. To estimate

p
(x), one can use the plugging-in method as

p
(x) =
_

p(x)
y

f(y [ x) dy/p, (6.1)
where
p
(x) is a nonparametric estimation of
p
(x) and

f(y [ x) is a nonparametric estimation
of f(y [ x). But the bandwidths for
p
(x) and

f(y [ x) are not necessary to be same.
Note that Scaillet (2005) used the NW type double kernel method to estimate f(y [ x)
rst, due to Roussas (1969), denoted by

f(y [ x), and then estimated
p
(x) by inverting
CHAPTER 6. CONDITIONAL VAR AND EXPECTED SHORTFALL 145
the estimated conditional survival function, denoted by
p
(x), and nally estimated
p
(x)
by plugging

f(y [ x) and
p
(x) into (6.1), denoted by
p
(x), where
p
(x) =

S
1
(y [ x) and

S(y [ x) =
_

f(u [ x)du. But, it is well documented (see, e.g., Fan and Gijbels, 1996) that the
NW kernel type procedures have serious drawbacks: the asymptotic bias involves the design
density so that they can not be adaptive, and boundary eects exist so that they require
boundary modications. In particular, boundary eects might cause a serious problem for
estimating
p
(x) since it is only concerned with the tail probability. The question is now
how to provide a better estimate for f(y [ x) and
p
(x) so that we have a good estimate for

p
(x). Therefore, we address this issue in the next section.
6.3 Nonparametric Estimating Procedures
We start with the nonparametric estimators for the conditional density function and its
distribution function rst and then turn to discussing the nonparametric estimators for the
conditional VaR and ES functions.
There are several methods available for estimating
p
(x), f(y [ x), and F(y [ x) in the
literature, such as kernel and nearest-neighbor.
1
To attenuate these drawbacks of the kernel
type estimators mentioned in Section 6.2, recently, some new methods have been proposed to
estimate conditional quantiles. The rst one, a more direct approach, by using the check
function such as the robustied local linear smoother, was provided by Fan, Hu, and Troung
(1994) and further extended by Yu and Jones (1997, 1998) for iid data. A more general
nonparametric setting was explored by Cai and Xu (2005) for time series data. This modeling
idea was initialed by Koenker and Bassett (1978) for linear regression quantiles and Fan, Hu,
and Troung (1994) for nonparametric models. See Cai and Xu (2005) and references therein
for more discussions on models and applications. An alternative procedure is rst to estimate
the conditional distribution function by using double kernel local linear technique of Fan,
Yao, and Tong (1996) and then to invert the conditional distribution estimator to produce
an estimator of a conditional quantile or CVaR. Yu and Jones (1997, 1998) compared these
two methods theoretically and empirically and suggested that the double kernel local linear
would be better.
1
To name just a few, see Lejeune and Sarda (1988), Troung (1989), Samanta (1989), and Chaudhuri
(1991) for iid errors, Roussas (1969) and Roussas (1991) for Markovian processes, and Troung and Stone
(1992) and Boente and Fraiman (1995) for mixing sequences.
CHAPTER 6. CONDITIONAL VAR AND EXPECTED SHORTFALL 146
6.3.1 Estimation of Conditional PDF and CDF
To make a connection between the conditional density (distribution) function and nonpara-
metric regression problem, it is noted by the standard kernel estimation theory (see, e.g.,
Fan and Gijbles, 1996) that for a given symmetric density function K(),
EK
h
0
(y Y
t
) [ X
t
= x = f(y [ x) +
h
2
0
2

2
(K) f
2,0
(y [ x) + o(h
2
0
) f(y [ x), as h
0
0,
(6.2)
where K
h
0
(u) = K(u/h
0
)/h
0
,
2
(K) =
_

u
2
K(u)du, f
2,0
(y [ x) =
2
/y
2
f(y [ x), and
denotes an approximation by ignoring the higher terms. Note that Y

t
(y) = K
h
0
(y Y
t
) can
be regarded as an initial estimate of f(y [ x) smoothing in the y direction. Also, note that
this approximation ignores the higher order terms O(h
j
0
) for j 2, since they are negligible
if h
0
= o(h), where h is the bandwidth used in smoothing in the x direction (see (6.3) below).
Therefore, the smoothing in the y direction is not important in the context of this subject
so that intuitively, it should be under-smoothed. Thus, the left hand side of (6.2) can be
regraded as a nonparametric regression of the observed variable Y

t
(y) versus X
t
and the
local linear (or polynomial) tting scheme of Fan and Gijbles (1996) can be applied to here.
This leads us to consider the following locally weighted least squares regression problem:
n

t=1
Y

t
(y) a b (X
t
x)
2
W
h
(x X
t
), (6.3)
where W() is a kernel function and h = h(n) > 0 is the bandwidth satisfying h 0 and
nh as n , which controls the amount of smoothing used in the estimation. Note
that (6.3) involves two kernels K() and W(). This is the reason of calling double kernel.
Minimizing the above locally weighted least squares in (6.3) with respect to a and b, we
obtain the locally weighted least squares estimator of f(y [ x), denoted by

f(y [ x), which is
a. From Fan and Gijbels (1996) or Fan, Yao and Tong (1996),

f(y [ x) can be re-expressed
as a linear estimator form as

f
ll
(y [ x) =
n

t=1
W
ll,t
(x, h) Y

t
(y),
where with S
n,j
(x) =

n
t=1
W
h
(x X
t
) (X
t
x)
j
, the weights W
ll,t
(x, h) are given by
W
ll,t
(x, h) =
[S
n,2
(x) (x X
t
) S
n,1
(x)] W
h
(x X
t
)
S
n,0
(x)S
n,2
(x) S
2
n,1
(x)
.
CHAPTER 6. CONDITIONAL VAR AND EXPECTED SHORTFALL 147
Clearly, W
ll,t
(x, h) satisfy the so-called discrete moments conditions as follows: for 0
j 1,
n

t=1
W
ll,t
(x, h) (X
t
x)
j
=
0,j
=
_
1 if j = 0
0 otherwsie
(6.4)
based on the least squares theory; see (3.12) of Fan and Gijbels (1996, p.63). Note that the
estimator

f
ll
(y [ x) can range outside [0, ). The double kernel local linear estimator of
F(y [ x) is constructed (see (8) of Yu and Jones (1998)) by integrating

f
ll
(y [ x)

F
ll
(y [ x) =
_
y

f
ll
(y [ x)dy =
n

t=1
W
ll,t
(x, h) G
h
0
(y Y
t
),
where G() is the distribution function of K() and G
h
0
(u) = G(u/h
0
). Clearly,

F
ll
(y [ x)
is continuous and dierentiable with respect to y with

F
ll
([ x) = 0 and

F
ll
([ x) = 1.
Note that the dierentiability of the estimated distribution function can make the asymptotic
analysis much easier for the nonparametric estimators of CVaR and CES (see later).
Although Yu and Jones (1998) showed that the double kernel local linear estimator has
some attractive properties such as no boundary eects, design adaptation, and mathematical
eciency (see, e.g., Fan and Gijbels, 1996), it has the disadvantage of producing conditional
distribution function estimators that are not constrained either to lie between zero and one
or to be monotone increasing, which is not good for estimating CVaR if the inverting method
is used. In both these respects, the NW method is superior, despite its rather large bias
and boundary eects. The properties of positivity and monotonicity are particularly advan-
tageous if the method of inverting conditional distribution estimator is applied to produce
the estimator of a conditional quantile or CVaR. To overcome these diculties, Hall, Wol,
and Yao (1999) and Cai (2002) proposed the WNW estimator based on an empirical likeli-
hood principle, which is designed to possess the superior properties of local linear methods
such as bias reduction and no boundary eects, and to preserve the property that the NW
estimator is always a distribution function, although it might require more computation-
al eorts since it requires estimating and optimizing additional weights aimed at the bias
correction. Cai (2002) discussed the asymptotic properties of the WNW estimator at both
interior and boundary points for the mixing time series under some regularity assumptions
and showed that the WNW estimator has a better performance than other competitors. See
Cai (2002) for details. Recently, Cosma, Scaillet and von Sachs (2006) proposed a shape
preserving estimation method to estimate cumulative distribution functions and probability
CHAPTER 6. CONDITIONAL VAR AND EXPECTED SHORTFALL 148
density functions using the wavelet methodology for multivariate dependent data and then
to estimate a conditional quantile or CVaR.
The WNW estimator of the conditional distribution F(y [ x) of Y
t
given X
t
= x is dened
by

F
c1
(y [ x) =
n

t=1
W
c,t
(x, h) I(Y
t
y), (6.5)
where the weights W
c,t
(x, h) are given by
W
c,t
(x, h) =
p
t
(x) W
h
(x X
t
)

n
t=1
p
t
(x) W
h
(x X
t
)
, (6.6)
and p
t
(x) is chosen to be p
t
(x) = n
1
1 + (X
t
x) W
h
(x X
t
)
1
0 with , a
function of data and x, uniquely dened by maximizing the logarithm of the empirical
likelihood
L
n
() =
n

t=1
log 1 + (X
t
x) W
h
(x X
t
)
subject to the constraints

n
t=1
p
t
(x) = 1 and the discrete moments conditions in (6.4); that
is,
n

t=1
W
c,t
(x, h) (X
t
x)
j
=
0,j
(6.7)
for 0 j 1. Also, see Cai (2002) for details on this aspect. In implementation, Cai (2002)
recommended using the Newton-Raphson scheme to nd the root of equation L

n
() = 0.
Note that 0

F
c1
(y [ x) 1 and it is monotone in y. But

F
c1
(y [ x) is not continuous in y
and of course, not dierentiable in y either. Note that under regression setting, Cai (2001)
provided a comparison of the local linear estimator and the WNW estimator and discussed
the asymptotic minimax eciency of the WNW estimator.
To accommodate all nice properties (monotonicity, continuity, dierentiability, and lying
between zero and one) and the attractive asymptotic properties (design adaption, avoiding
boundary eects, and mathematical eciency, see Cai (2002) for detailed discussions) of
both estimators

F
ll
(y [ x) and

F
c1
(y [ x) under a unied framework, we propose the following
nonparametric estimators for the conditional density function f(y [ x) and its conditional
distribution function F(y [ x), termed as weighted double kernel local linear estimation,

f
c
(y [ x) =
n

t=1
W
c,t
(x, h) Y

t
(y),
CHAPTER 6. CONDITIONAL VAR AND EXPECTED SHORTFALL 149
where W
c,t
(x, h) is given in (6.6), and

F
c
(y [ x) =
_
y

f
c
(y [ x)dy =
n

t=1
W
c,t
(x, h) G
h
0
(y Y
t
). (6.8)
Note that if p
t
(x) in (6.6) is a constant for all t, or = 0, then

f
c
(y [ x) becomes the classical
NW type double kernel estimator used by Scaillet (2005). However, Scaillet (2005) adopted
a single bandwidth for smoothing in both the y and x directions. Clearly,

f
c
(y [ x) is a
probability density function so that

F
c
(y [ x) is a cumulative distribution function (monotone,
0

F
c
(y [ x) 1,

F
c
([ x) = 0, and

F
c
([ x) = 1). Also,

F
c
(y [ x) is continuous and
dierentiable in y. Further, as expected, it will be shown that like

F
c1
(y [ x),

F
c
(y [ x) has
the attractive properties such as no boundary eects, design adaptation, and mathematical
eciency.
6.3.2 Estimation of Conditional VaR and ES
We now are ready to formulate the nonparametric estimators for
p
(x) and
p
(x). To this
end, from (6.8),
p
(x) is estimated by inverting the estimated conditional survival distribution

S
c
(y [ x) = 1

F
c
(y [ x), denoted by
p
(x) and dened as
p
(x) =

S
1
c
(p [ x). Note that
p
(x)
always exists since

S
c
(p [ x) is a survival function itself. Plugging-in
p
(x) and

f
c
(y [ x) into
(6.1), we obtain the nonparametric estimation of
p
(x),

p
(x) = p
1
_

p(x)
y

f
c
(y [ x) dy = p
1
n

t=1
W
c,t
(x, h)
_

p(x)
y K
h
0
(y Y
t
)dy
= p
1
n

t=1
W
c,t
(x, h)
_
Y
t

G
h
0
(
p
(x) Y
t
) + h
0
G
1,h
0
(
p
(x) Y
t
)

, (6.9)
where

G(u) = 1 G(u), G
1,h
0
(u) = G
1
(u/h
0
), and G
1
(u) =
_

u
v K(v)dv. Note that as
mentioned earlier,
p
(x) in (6.9) can be an any consistent estimator.
6.4 Distribution Theory
6.4.1 Assumptions
Before we proceed with the asymptotic properties of the proposed nonparametric estimators,
we rst list all assumptions needed for the asymptotic theory, although some of them might
not be the weakest possible. Note that proofs of the asymptotic results presented in this
CHAPTER 6. CONDITIONAL VAR AND EXPECTED SHORTFALL 150
section may be found in Section 6.6 with some lemmas and their detailed proofs relegated
to Section 6.7. First, we introduce some notation. Let (K) =
_

u K(u)

G(u)du and

j
(W) =
_

u
j
W(u)du. Also, for any j 0, write
l
j
(u [ v) = E[Y
j
t
I(Y
t
u) [ X
t
= v] =
_

u
y
j
f(y [ v)dy, l
a,b
j
(u [ v) =

ab
u
a
v
b
l
j
(u [ v),
and l
a,b
j
(
p
(x) [ x) = l
a,b
j
(u [ v)

u=p(x),v=x
. Clearly, l
0
(u [ v) = S(u [ v) and l
1
(
p
(x) [ x) =
p
p
(x). Finally, l
1,0
j
(u [ v) = u
j
f(u [ v) and l
2,0
j
(u [ v) = [u
j
f
1,0
(u [ v) + j u
j1
f(u [ v)].
We now list the following regularity conditions.
Assumption A:
A1. For xed y and x, 0 < F(y [ x) < 1, g(x) > 0, the marginal density of X
t
, and is
continuous at x, and F(y [ x) has continuous second order derivative with respect to
both x and y.
A2. The kernels K() and W() are symmetric, bounded, and compactly supported density.
A3. h 0 and nh , and h
0
0 and nh
0
, as n .
A4. Let g
1,t
(, ) be the joint density of X
1
and X
t
for t 2. Assume that [g
1,t
(u, v)
g(u) g(v)[ M < for all u and v.
A5. The process (X
t
, Y
t
) is a stationary -mixing with the mixing coecient satisfying
(t) = O
_
t
(2+)
_
for some > 0.
A6. nh
1+2/
.
A7. h
0
= o(h).
Assumption B:
B1. Assume that E
_
[Y
t
[

[ X
t
= u
_
M
3
< for some > 2, in a neighborhood of x.
B2. Assume that [g
1,t
(y
1
, y
2
[x
1
, x
2
)[ M
1
< for all t 2, where g
1,t
(y
1
, y
2
[ x
1
, x
2
) be
the conditional density of Y
1
and Y
t
given X
1
= x
1
and X
t
= x
2
.
B3. The mixing coecient of the -mixing process (X
t
, Y
t
)

t=
satises

t1
t
a

12/
(t)
< for some a > 1 2/, where is given in Assumption B1.
CHAPTER 6. CONDITIONAL VAR AND EXPECTED SHORTFALL 151
B4. Assume that there exists a sequence of integers s
n
> 0 such that s
n
, s
n
=
o((nh)
1/2
), and (n/h)
1/2
(s
n
) 0, as n .
B5. There exists

> such that E

_
[Y
t
[

[ X
t
= u
_
M
4
< in a neighborhood of
x, (t) = O(t

), where is given in Assumption B1,

/2(

), and
n
1/2/4
h
/

1/2/4
= O(1).
Remark 1. Note that Assumptions A1 - A5 and B1 - B5 are used commonly in the literature
of time series data (see, e.g., Masry and Fan, 1997, Cai, 2001). Note that -mixing imposed
in Assumption A5 is weaker than -mixing in Hall, Wol, and Yao (1999) and -mixing
in Fan, Yao, and Tong (1996). Because A6 is satised by the bandwidths of optimal size
(i.e., h n
1/5
) if > 1/2, we do not concern ourselves with such renements. Indeed,
Assumptions A1 - A6 are also required in Cai (2002). Assumption A7 means that the initial
step bandwidth should be chosen as small as possible so that the bias from the initial step
can be ignored. Since the common technique truncation approach for time series data is
not applicable to our setting (see, e.g., Masry and Fan, 1997), the purpose of Assumption
B5 is to use the moment inequality. If (t) decays geometrically, then Assumptions B4 and
B5 are satised automatically. Note that Assumptions B3, B4, and B5 are stronger than
Assumptions A5 and A6. This is not surprising because the higher moments involved, the
faster decaying rate of () is required. Finally, Assumptions B1 - B5 are also imposed in
Cai (2001).
6.4.2 Asymptotic Properties for Conditional PDF and CDF
First, we investigate the asymptotic behaviors of

f
c
(y [ x), including the asymptotic normality
stated in the following theorem.
Theorem 6.1: Under Assumptions A1 - A6 with h in A3 and A6 replaced by h
0
h, we have
_
nh
0
h
_

f
c
(y [ x) f(y [ x) B
f
(y [ x)
_
N
_
0,
2
f
(y [ x)
_
,
where the asymptotic bias is
B
f
(y [ x) =
h
2
2

2
(W) f
0,2
(y [ x) +
h
2
0
2

2
(K) f
2,0
(y [ x),
and the asymptotic variance is
2
f
(y [ x) =
0
(K
2
)
0
(W
2
) f(y [ x)/g(x).
CHAPTER 6. CONDITIONAL VAR AND EXPECTED SHORTFALL 152
Remark 2: The asymptotic results for

f
c
(y [ x) in Theorem 6.1 are similar to those for

f
ll
(y [ x) in Fan, Yao, and Tong (1996) for the -mixing sequence, which is stronger than
-mixing, but as mentioned earlier,

f
ll
(y [ x) is not always a probability density function.
The asymptotic bias and variance are intuitively expected. The bias comes from the approx-
imations in both x and y directions and the variance is from the local conditional variance
in the density estimation setting, which is f(y [ x).
Next, we study the asymptotic behaviors for

S
c
(y [ x) at both interior and boundary
points. Similar to Theorem 6.1 for

f
c
(y [ x), we have the following asymptotic normality for

S
c
(y [ x).
Theorem 6.2: Under Assumptions A1 - A6, we have

nh
_

S
c
(y [ x) S(y [ x) B
S
(y [ x)
_
N
_
0,
2
S
(y [ x)
_
,
where the asymptotic bias is given by
B
S
(y [ x) =
h
2
2

2
(W) S
0,2
(y [ x)
h
2
0
2

2
(K) f
1,0
(y [ x),
and the asymptotic variance is
2
S
(y [ x) =
0
(W
2
) S(y [ x) [1 S(y [ x)]/g(x). In particular,
if Assumption A7 holds true, then,

nh
_

S
c
(y [ x) S(y [ x)
h
2
2

2
(W) S
0,2
(y [ x)
_
N
_
0,
2
S
(y [ x)
_
.
Remark 3: Note that the asymptotic results for

S
c
(y [ x) in Theorem 6.2 are analogous to
those for

S
ll
(y [ x) = 1

F
ll
(y [ x) in Yu and Jones (1998) for iid data, but as mentioned
previously,

F
ll
(y [ x) is not always a distribution function. A comparison of B
s
(y [ x) with
the asymptotic bias for

S
c1
(y [ x) (see Theorem 1 in Cai (2002)), it reveals that there is an
extra term
h
2
0
2
f
1,0
(y [ x)
2
(K) in the asymptotic bias expression B
s
(y [ x) due to the vertical
smoothing in the y direction. Also, there is an extra term in the asymptotic variance (see
(6.20)). These extra terms are carried over from the initial estimate but they can be ignored
if the bandwidth at the initial step is taken to be a higher order than the bandwidth at the
smoothing step.
Remark 4: It is important to examine the performance of

S
c
(y [ x) by considering the
asymptotic mean squared error (AMSE). Theorem 6.2 concludes that the AMSE of

S
c
(y [ x)
CHAPTER 6. CONDITIONAL VAR AND EXPECTED SHORTFALL 153
is
AMSE
_

S
c
(y [ x)
_
=
h
2

2
(W) S
0,2
(y [ x) h
2
0

2
(K) f
1,0
(y [ x)
2
4
+
1
nh

0
(W
2
) S(y [ x) [1 S(y [ x)]
g(x)
. (6.10)
By minimizing AMSE in (6.10) and taking h
0
= o(h), therefore, we obtain the optimal
bandwidth given by
h
opt,S
(y [ x) =
_

0
(W
2
) S(y [ x) [1 S(y [ x)]

2
(W) S
0,2
(y [ x)
2
g(x)
_
1/5
n
1/5
.
Therefore, the optimal rate of the AMSE of

S
c
(y [ x) is n
4/5
.
As for the boundary behavior of the WDKLL estimator, we can follow Cai (2002) to
establish a similar result for

S
c
(y [ x) like Theorem 2 in Cai (2002). Without loss of generality,
we consider the left boundary point x = c h, 0 < c < 1. From Fan, Hu, and Troung (1994), we
take W() to have support [1, 1] and g() to have support [0, 1]. Then, under Assumptions
A1 - A7, by following the same proof as that for Theorem 6.2 and using the second assertion
in Lemma 6.1, although not straightforward, we can show that

nh
_

S
c
(y [ c h) S
c
(y [ c h) B
S,c
(y)
_
N
_
0,
2
S,c
(y)
_
, (6.11)
where the asymptotic bias term is given by B
S,c
(y) = h
2

0
(c) S
0,2
(y [ 0+)/[2
1
(c)] and the
asymptotic variance is
2
S,c
(y) =
2
(0) S(y [ 0+)[1 S(y [ 0+)]/[
2
1
(c) g(0+)] with g(0+) =
lim
z0
g(z),

0
(c) =
_
c
1
u
2
W(u)
1
c
u W(u)
du,
j
(c) =
_
c
1
W
j
(u)
1
c
u W(u)
j
du, 1 j 2,
and
c
being the root of equation L
c
() = 0
L
c
() =
_
c
1
u W(u)
1 u W(u)
du.
Note that the proof of (6.11) is similar to that for Theorem 2 in Cai (2002) and omitted.
Theorem 6.2 and (6.11) reect two of the major advantages of the WKDLL estimator: (a) the
asymptotic bias does not depend on the design density g(x), and indeed it is dependent only
on the simple conditional distribution curvature S
0,2
(y [ x) and conditional density curvature
f
1,0
(y [ x); and (b) it has an automatic good behavior at boundaries. See Cai (2002) for the
detailed discussions.
CHAPTER 6. CONDITIONAL VAR AND EXPECTED SHORTFALL 154
Finally, we remark that if the point 0 were an interior point, then, (6.11) would hold
with c = 1, which becomes Theorem 6.2. Therefore, Theorem 6.2 shows that the WKDLL
estimation has the automatic good behavior at boundaries without the need of the boundary
correction.
6.4.3 Asymptotic Theory for CVaR and CES
By the dierentiability of

S
c
(
p
(x) [ x), we use the Taylor expansion and ignore the higher
terms to obtain

S
c
(
p
(x) [ x) = p

S
c
(
p
(x) [ x)

f
c
(
p
(x) [ x) (
p
(x)
p
(x)), (6.12)
then, by Theorem 6.1,

p
(x)
p
(x) [

S
c
(
p
(x) [ x) p]/

f
c
(
p
(x) [ x) [

S
c
(
p
(x) [ x) p]/f(
p
(x) [ x).
As an application of Theorem 6.2, we can establish the following theorem for the asymptotic
normality of
p
(x) but the proof is omitted since it is similar to that for Theorem 6.2.
Theorem 6.3: Under Assumptions A1 - A6, we have

nh [
p
(x)
p
(x) B

(x)] N
_
0,
2

(x)
_
,
where the asymptotic bias is B

(x) = B
S
(
p
(x) [ x)/f(
p
(x) [ x) and the asymptotic variance
is
2

(x) =
0
(W
2
) p(1 p)/[g(x)f
2
(
p
(x) [ x)]. In particular, if Assumption A7 holds, then,

nh
_

p
(x)
p
(x)
h
2
2
S
0,2
(
p
(x) [ x)
f(
p
(x) [ x)

2
(W)
_
N
_
0,
2

(x)
_
.
Remark 5: First, as a consequence of Theorem 6.3,
p
(x)
p
(x) = O
p
_
h
2
+ h
2
0
+ (nh)
1/2
_
so that
p
(x) is a consistent estimator of
p
(x) with a convergence rate. Also, note that
the asymptotic results for
p
(x) in Theorem 6.3 are akin to those for
ll,p
(x) =

S
1
ll
(p [ x)
in Yu and Jones (1998) for iid data. But in the bias term of Theorem 6.3, the quantity
S
0,2
(
p
(x) [ x)/f(
p
(x) [ x), involving the second derivative of the conditional distribution
function with respect to x, replaces

p
(x), the second derivative of the conditional VaR
function itself, which is in the bias term of the check function type local linear estimator
in Yu and Jones (1998) for iid data and Cai and Xu (2005) for time series. See Cai and Xu
CHAPTER 6. CONDITIONAL VAR AND EXPECTED SHORTFALL 155
(2005) for details. This is not surprising since the bias comes only from the approximation.
The former utilizes the approximation of the conditional distribution function but the later
uses the approximation of the conditional VaR function. Finally, Theorems 6.2 and 6.3
imply that if the initial bandwidth h
0
is chosen small as possible such as h
0
= o(h), the
nal estimates of S(y [ x) and
p
(x) are not sensitive to the choice of h
0
as long as it satises
Assumption A7. This makes the selection of bandwidths much easier in practice, which will
be elaborated later (see Section 6.5.1).
Remark 6: Similar to Remark 5, we can derive the asymptotic mean squared error for

p
(x). By following Yu and Jones (1998), Theorem 6.3 and (6.20) (given in Section 6.6)
imply that the AMSE of
p
(x) is given by
AMSE(
p
(x)) =
h
2
S
0,2
(
p
(x) [ x)
2
(W) h
2
0
f
1,0
(
p
(x) [ x)
2
(K)
2
4 f
2
(
p
(x) [ x)
+
1
nh

0
(W
2
) [p(1 p) + 2 h
0
f(
p
(x) [ x) (K)]
f
2
(
p
(x) [ x) g(x)
. (6.13)
Note that the above result is similar to that in Theorem 1 in Yu and Jones (1998) for the
double kernel local linear conditional quantile estimator. But, a comparison of (6.13) with
Theorem 3 in Cai (2002) for the WNW estimator reveals that (6.13) has two extra terms
(negligible if Assumption A7 is satised) due to the vertical smoothing in the y direction, as
mentioned previously. By minimizing AMSE in (6.13) and taking h
0
= o(h), therefore, we
obtain the optimal bandwidth given by
h
opt,
(x) =
_

0
(W
2
) p(1 p)

2
(W) S
0,2
(
p
(x) [ x)
2
g(x)
_
1/5
n
1/5
.
Therefore, the optimal rate of the AMSE of
p
(x) is n
4/5
. By comparing h
opt,
(x) with
h
opt,S
(y [ x), it turns out that h
opt,
(x) is h
opt,
(y [ x) evaluated at y =
p
(x). Therefore, the
best choice of the bandwidth for estimating S
c
(y [ x) can be used for estimating
p
(x).
Remark 7: Similar to (6.11), one can establish the asymptotic result at boundaries for

p
(x) as follows, one can show that under Assumption A7,

nh [
p
(c h)
p
(c h) B
,c
] N
_
0,
2
,c
_
,
where the asymptotic bias is B
,c
= h
2

2
(c)S
0,2
(
p
(0+)[0+)/[2
1
(c)f(
p
(0+)[0+)] and the
asymptotic variance is
2
,c
=
0
(0) p [1 p]/[
2
1
(c) f
2
(
p
(0+) [ 0+) g(0+)]. Clearly,
p
(x)
CHAPTER 6. CONDITIONAL VAR AND EXPECTED SHORTFALL 156
inherits all good properties from the WDKLL estimator of S
c
(y [ x). Note that the above
result can be established by using the second assertion in Lemma 6.1 and following the same
lines along with those used in the proof of Theorem 6.2 and omitted.
Finally, we examine the asymptotic behavior for
p
(x) at both interior and boundary
points. First, we establish the following theorem for the asymptotic normality for
p
(x)
when x is an interior point.
Theorem 6.4: Under Assumptions A1 - A4 and B2 - B5, we have

nh [
p
(x)
p
(x) B

(x)] N
_
0,
2

(x)
_
,
where the asymptotic bias is B

(x) = B
,0
(x) +
h
2
0
2

2
(K) p
1
f(
p
(x) [ x) with
B
,0
(x) =
h
2
2

2
(W) p
1
_
l
0,2
1
(
p
(x) [ x)
p
(x) S
0,2
(
p
(x) [ x)

,
and the asymptotic variance is

(x) =

0
(W
2
)
p g(x)
_
p
1
l
2
(
p
(x) [ x) p
2
p
(x) + (1 p)
p
(x)
p
(x) 2
p
(x)

.
In particular, if Assumption A7 holds true, then,

nh [
p
(x)
p
(x) B
,0
(x)] N
_
0,
2

(x)
_
.
Remark 8: First, Theorem 6.4 concludes that
p
(x)
p
(x) = O
p
_
h
2
+ h
2
0
+ (nh)
1/2
_
so
that
p
(x) is a consistent estimator of
p
(x) with a convergence rate. Also, note that the
asymptotic results in Theorem 6.4 imply that
p
(x) is a consistent estimator for
p
(x) with
a convergence rate (nh)
1/2
. Further, note that although the asymptotic variance
2

(x) is
the same as that in Scaillet (2005) for
p
(x), Scaillet (2005) did not provide an expression
for the asymptotic bias term like B

(x) in the rst result or B

,0
(x) in the second conclusion
in Theorem 6.4. Clearly, the second term in the asymptotic bias expression is carried over
from the y direction smoothing at the initial step and it is negligible if Assumption A7 is
satised. Clearly, Assumption A7 implies that B

(x) becomes B
,0
(x).
Remark 9: Like Remark 5, the AMSE for
p
(x) can be derived in the same manner. It
follows from Theorem 6.4 that the AMSE of
p
(x) is given by
AMSE(
p
(x)) =
1
nh

2

(x) +
_
B
,0
(x) +
h
2
0
2

2
(K) p
1
f(
p
(x) [ x)
_
2
. (6.14)
CHAPTER 6. CONDITIONAL VAR AND EXPECTED SHORTFALL 157
Under Assumption A7, minimizing AMSE in (6.14) with respect to h yields the optimal
bandwidth given by
h
opt,
(x) =
_

(x)

2
(W) p
1
_
l
0,2
1
(
p
(x) [ x)
p
(x) S
0,2
(
p
(x) [ x)
_
_
2/5
n
1/5
.
Therefore, as expected, the optimal rate of the AMSE of
p
(x) is n
4/5
.
Finally, we oer the asymptotic results for
p
(x) at the left boundary point x = c h. By
the same fashion, one can show that under Assumption A7,

nh [
p
(c h)
p
(c h) B
,c
] N
_
0,
2
,c
_
,
where the asymptotic bias is
B
,c
= h
2

2
(c) p
1
_
l
0,2
1
(
p
(0+) [ 0+)
p
(0+) S
0,2
(
p
(0+) [ 0+)

/[2
1
(c)],
and the asymptotic variance is

2
,c
=

0
(0)
p
2
1
(c) g(0+)
_
p
1
l
2
(
p
(0+) [ 0+) p
2
p
(0+) + (1 p)
p
(0+)
p
(0+) 2
p
(0+)

.
Note that the proof of the above result can be carried over by using the second assertion in
Lemma 6.1 and following the same lines along with those used in the proof of Theorem 6.4 and
omitted. Next, we consider the comparison of the performance of the WDKLL estimation

p
(x) with the NW type kernel estimator
p
(x) as in Scaillet (2005). To this eect, it is not
very dicult to derive the asymptotic results for the NW type kernel estimator but the proof
is omitted since it is along the same line with the proof of Theorem 6.2. See Scaillet (2005) for
the results at the interior point. Under some regularity conditions, it can be shown although
tediously (see Cai (2002) for details) that at the left boundary x = c h, the asymptotic bias
term for the NW type kernel estimator
p
(x) is of the order h by comparing to the order
h
2
for the WDKLL estimate (see B
,c
above). This shows that the WDKLL estimate does
not suer from boundary eects but the NW type kernel estimator estimate does. This is
another advantage of the WDKLL estimator over the WW type kernel estimator
p
(x).
6.5 Empirical Examples
To illustrate the proposed methods, we consider two simulated examples and two real data
examples on stock index returns and security returns. Throughout this section, the Epanech-
nikov kernel K(u) = 0.75(1 u
2
)
+
is used and bandwidths are selected as described in the
next section.
CHAPTER 6. CONDITIONAL VAR AND EXPECTED SHORTFALL 158
6.5.1 Bandwidth Selection
With the basic model at hand, one must address the important bandwidth selection issue,
as the quality of the curve estimates depends sensitively on the choice of the bandwidth. For
practitioners, it is desirable to have a convenient and eective data-driven rule. However,
almost nothing has been done so far about this problem in the context of estimating
p
(x)
and
p
(x) although there are some results available in the literature in other contexts for
some specic purposes.
As indicated earlier, the choice of the initial bandwidth h
0
is not very sensitive to the
nal estimation but it needs to be specied. First, we use a very simple idea to choose
h
0
. As mentioned previously, the WNW method involves only one bandwidth in estimating
the conditional distribution and VaR. Because the WNW estimate is a linear smoother (see
(6.5)), we recommend using the optimal bandwidth selector, the so-called nonparametric
AIC proposed by Cai and Tiwari (2000), to select the bandwidth, called

h. Then we take
0.1

h or smaller as the initial bandwidth h

0
. For the given h
0
, we can select h as follows.
According to (6.8),

F
c
([) is a linear estimator so that the nonparametric AIC selector of Cai
and Tiwari (2000) can be applied here to select the optimal bandwidth for

F
c
([), denoted
by h
S
. As mentioned at the end of Remark 6, the bandwidth for
p
(x) is the same as that
for

F
c
([) so that it is simply to take h
S
as h

. From (6.9),
p
(x) is a linear estimator too
for given
p
(x). Therefore, by the same token, the nonparametric AIC selector is applied
to selecting h

for
p
(x). This simple approach is used in our implementation in the next
sections.
6.5.2 Simulated Examples
In the simulated examples, we demonstrate the nite sample performance of the estimators
in terms of the mean absolute deviation error (MADE). For example, the MADE for
p
(x)
is dened as
c
p
=
1
n
0
n
0

k=1
[
p
(x
k
)
p
(x
k
)[,
where x
k

n
0
k=1
are the pre-determined regular grid points. Similarly, we can dene the
MADE for
p
(x), denoted by c
p
.
CHAPTER 6. CONDITIONAL VAR AND EXPECTED SHORTFALL 159
Example 6.1. We consider an ARCH type model with X
t
= Y
t1
,
Y
t
= 0.9 sin(2.5X
t
) + (X
t
)
t
,
where
2
(x) = 0.8

1.2 + x
2
and
t
are iid standard normal random variables. We consider
three sample sizes: n = 250, 500, and 1000 and the experiment is repeated 500 times for
each sample size. The mean absolute deviation errors are computed for each sample size and
each replication.
The 5% WDKLL and NW estimations are summarized in Figure 6.1 for CVaR and in
Figure 6.2 for CES. For each n, the boxplots of 500 c
p
-values of the WDKLL and NW
1.0 0.5 0.0 0.5 1.0
1
.
0
1
.
5
2
.
0
(a) n=100
1.0 0.5 0.0 0.5 1.0
1
.
0
1
.
5
2
.
0
(b) n=300
1.0 0.5 0.0 0.5 1.0
0
.
5
1
.
0
1
.
5
2
.
0
2
.
5
(c) n=500
0
.
0
0
.
2
0
.
4
0
.
6
0
.
8
1
.
0
(d) MADE
n = 100 n = 300 n = 500
Figure 6.1: Simulation results for Example 1 when p = 0.05. Displayed in (a) - (c) are the
true CVaR functions (solid lines), the estimated WDKLL CVaR functions (dashed lines),
and the estimated NW CVaR functions (dotted lines) for n = 250, 500 and 1000, respectively.
Boxplots of the 500 MADE values for both the WDKLL and NW estimations of CVaR are
plotted in (d).
estimations are plotted in Figure 6.1(d) for CVaR and in Figure 6.2(d) for CES.
From Figures 6.1(d) and 6.2(d), we can observe that the estimation becomes stable as
the sample size increases for both the WDKLL and NW estimators. This is in line with our
asymptotic theory that the proposed estimators are consistent. Further, it is obvious that
the MADEs of the WDKLL estimator are smaller than those for the NW estimator. This
CHAPTER 6. CONDITIONAL VAR AND EXPECTED SHORTFALL 160
1.0 0.5 0.0 0.5 1.0
1
.
0
1
.
5
2
.
0
2
.
5
(a) n=100
1.0 0.5 0.0 0.5 1.0
1
.
0
1
.
5
2
.
0
2
.
5
(b) n=300
1.0 0.5 0.0 0.5 1.0
1
.
0
1
.
5
2
.
0
2
.
5
3
.
0
(c) n=500
0
.
0
0
.
2
0
.
4
0
.
6
0
.
8
1
.
0
(d) MADE
n = 100
Figure 6.2: Simulation results for Example 1 when p = 0.05. Displayed in (a) - (c) are the
true CES functions (solid lines), the estimated WDKLL CES functions (dashed lines), and
the estimated NW CES functions (dotted lines) for n = 250, 500 and 1000, respectively.
Boxplots of the 500 MADE values for both the WDKLL and NW estimations of CES are
plotted in (d).
indicates that our WDKLL estimator has smaller bias than that for the NW estimator. This
implies that the overall performance of the WDKLL estimator should be better than that
for the NW estimator.
Figures 6.1(a) (c) for n = 250, 500 and 1000, respectively, display the true CVaR
function (solid line)
p
(x) = 0.9 sin(2.5x) + (x)
1
(1 p), where () is the standard
normal distribution function, together with the dashed and dotted lines representing the
proposed WDKLL (dashed) and NW (dotted) estimates of CVaR, respectively, which are
computed based on a typical sample. The typical sample is selected in such a way that its
c
p
value is equal to the median in the 500 replications. From Figures 6.1(a) (c), we can
observe that both the estimated curves are closer to the true curve as n increases and the
performance of the WDKLL estimator is better than that for the NW estimator, especially
at boundaries.
In Figures 6.2(a)(c), the true CES function
p
(x) = 0.9 sin(2.5x)p+(x)
1
(
1
(1p))
is displayed by the solid line, where
1
(t) =
_

t
u(u)du and () is the standard normal
CHAPTER 6. CONDITIONAL VAR AND EXPECTED SHORTFALL 161
distribution density function, and the dashed and dotted lines present the proposed WDKLL
(dashed) and NW (dotted) estimates of CES, respectively, from a typical sample. The
typical sample is selected in such a way that its c
p
-value is equal to the median in the 500
1.0 0.5 0.0 0.5 1.0
1
.
0
1
.
5
2
.
0
2
.
5
3
.
0
3
.
5
(a) n=100
1.0 0.5 0.0 0.5 1.0
1
.
5
2
.
0
2
.
5
(b) n=300
1.0 0.5 0.0 0.5 1.0
1
.
5
2
.
0
2
.
5
3
.
0
(c) n=500
0
.
0
0
.
2
0
.
4
0
.
6
0
.
8
1
.
0
1
.
2
(d) MADE
n = 100 n = 300 n = 500
Figure 6.3: Simulation results for Example 1 when p = 0.01. Displayed in (a) - (c) are the
true CVaR functions (solid lines), the estimated WDKLL CVaR functions (dashed lines),
and the estimated NW CVaR functions (dotted lines) for n = 250, 500 and 1000, respectively.
Boxplots of the 500 MADE values for both WDKLL and NW estimation of the conditional
VaR are plotted in (d).
replications. We can conclude from Figures 6.2(a)(c) that the CES estimator has a similar
performance as that for the CVaR estimator.
The 1% WDKLL and NW estimates of CVaR and CES are computed under the same
setting and they are displayed in Figures 6.3 and 6.4, respectively. Similar conclusions
to those for the 5% estimates can be observed. But it is not surprising to see that the
performance of the 1% CVaR and CES estimates is not good as that for the 5% estimates
due to the sparsity of data.
Example 6.2. In the above example, we consider only the case when X
t
is one-dimensional.
In this example, we consider the multivariate situation, i.e. X
t
consists of two lagged vari-
ables: X
t1
= Y
t1
and X
t2
= Y
t2
. The data generating model is given below:
Y
t
= m(X
t
) + (X
t
)
t
,
CHAPTER 6. CONDITIONAL VAR AND EXPECTED SHORTFALL 162
1.0 0.5 0.0 0.5 1.0
1
.
0
1
.
5
2
.
0
2
.
5
3
.
0
(a) n=100
1.0 0.5 0.0 0.5 1.0
1
.
5
2
.
0
2
.
5
3
.
0
(b) n=300
1.0 0.5 0.0 0.5 1.0
1
.
5
2
.
0
2
.
5
3
.
0
(c) n=500
0
.
0
0
.
2
0
.
4
0
.
6
0
.
8
1
.
0
1
.
2
(d) MADE
n = 100 n = 300 n = 500
Figure 6.4: Simulation results for Example 1 when p = 0.01. Displayed in (a) - (c) are the
true CES functions (solid lines), the estimated WDKLL CES functions (dashed lines), and
the estimated NW CES functions (dotted lines) for n = 250, 500 and 1000, respectively.
Boxplots of the 500 MADE values for both the WDKLL and NW estimations of CVaR are
plotted in (d).
where m(x) = 0.63 x
1
0.47 x
2
,
2
(x) = 0.5 + 0.23 x
2
1
+ 0.3 x
2
2
, and
t
are iid generated
from N(0, 1). Three sample sizes: n = 200, 400, and 600, are considered here. For each
sample size, we replicate the design 500 times. Here we present only the boxplots of the 500
MADEs for the CVaR and CES estimates in Figure 6.5. Figure 6.5(a) displays the boxplots
of the 500 c
p
-values of the WDKLL and NW estimates of CVaR and the boxplots of the
500 c
p
-values of the WDKLL and NW estimates of CES are given in Figure 6.5(b). From
Figures 6.5(a) and (b), it is visually veried that both WDKLL and NW estimations become
stable as the sample size increases and the performance of the WDKLL estimator is better
than that for the NW estimator.
6.5.3 Real Examples
Example 6.3. Now we illustrate our proposed methodology by considering a real data set
on Dow Jones Industrials (DJI) index returns. We took a sample of 1801 daily prices from
DJI index, from November 3, 1998 to January 3, 2006, and computed the daily returns as
CHAPTER 6. CONDITIONAL VAR AND EXPECTED SHORTFALL 163
0
.
0
0
.
2
0
.
4
0
.
6
0
.
8
1
.
0
1
.
2
1
.
4
(a) MADE of VaR
n = 200 n = 400 n = 600
0
.
0
0
.
2
0
.
4
0
.
6
0
.
8
1
.
0
1
.
2
1
.
4
(b) MADE of ES
n = 200 n = 400 n = 600
Figure 6.5: Simulation results for Example 2 when p = 0.05. (a) Boxplots of MADEs for
both the WDKLL and NW CVaR estimates. (b) Boxplots of MADEs for Both the WDKLL
and NW CES estimates.
100 times the dierence of the log of prices. Let Y
t
be the daily negative log return (log loss)
of DJI and X
t
be the rst lagged variable of Y
t
. The estimators proposed in this chapter
are used to estimate the 5% CVaR and CES functions. The estimation results are shown in
Figure 6.6 for the 5% CVaR estimate in Figure 6.6(a) and the 5% CES estimate in Figure
6.6(b). Both CVaR and CES estimates exhibit a U-shape, which corresponds to the so-called
volatility smile. Therefore, the risk tends to be lower when the lagged log loss of DJI is
close to the empirical average and larger otherwise. We can also observe that the curves are
asymmetric. This may indicate that the DJI is more likely to fall down if there was a loss
within the last day than there was a same amount positive return.
Example 6.4. We apply the proposed methods to estimate the conditional value-at-risk
and expected shortfall of the International Business Machine Co. (NYSE: IBM) security
returns. The data are daily prices recorded from March 1, 1996 to April 6, 2005. We use
the same method to calculate the daily returns as in Example 3. In order to estimate the
value-at-risk of a stock return, generally, the information set X
t
may contain a market index
of corresponding capitalization and type, the industry index, and the lagged values of stock
return. For this example, Y
t
is the log loss of IBM stock returns and only two variables are
CHAPTER 6. CONDITIONAL VAR AND EXPECTED SHORTFALL 164
1.0 0.5 0.0 0.5 1.0
1
.
6
0
1
.
6
5
1
.
7
0
1
.
7
5
1
.
8
0
1
.
8
5
1
.
9
0
(a) Conditional VaR
1.0 0.5 0.0 0.5 1.0
2
.
2
2
.
3
2
.
4
2
.
5
2
.
6
(b) Conditional Es
Figure 6.6: (a) 5% CVaR estimate for DJI index. (b) 5% CES estimate for DJI index.
chosen as information set for the sake of simplicity. Let X
t1
be the rst lagged variable of
Y
t
and X
t2
denote the rst lagged daily log loss of Dow Jones Industrials (DJI) index. Our
main results from the estimation of the model are summarized in Figure 6.7. The surfaces
of the estimators of IBM returns are given in Figure 6.7(a) for CVaR and in Figure 6.7(b)
for CES. For visual convenience, Figures 6.7(c) and (e) depict the estimated CVaR and CES
curves (as function of X
t2
) for three dierent values of X
t1
= (0.275, 0.025, 0.325) and
Figures 6.7(d) and (f) display the estimated CVaR and CES curves (as function of X
t1
) for
three dierent values of X
t2
= (0.225, 0.025, 0.425).
From Figures 6.7(c) - (f), we can observe that most of these curves are U-shaped. This is
consistent with the results observed in Example 3. Also, we can see that these three curves
in each gure are not parallel. This implies that the eects of lagged IBM and lagged DJI
variables on the risk of IBM are dierent and complex. To be concrete, let us examine Figure
6.7(d). Three curves are close to each other when the lagged IBM log loss is around 0.2
and far away otherwise. This implies that DJI has fewer eects (less information) on CVaR
around this value. Otherwise, DJI has more eects when the lagged IBM log loss is far from
this value.
CHAPTER 6. CONDITIONAL VAR AND EXPECTED SHORTFALL 165
IB
M
0.4
0.2
0.0
0.2
0.4
D
J
I
0.4
0.2
0.0
0.2
0.4
2.6
2.7
2.8
2.9
(a) Conditional VaR surface
IB
M
0.4
0.2
0.0
0.2
0.4
D
J
I
0.4
0.2
0.0
0.2
0.4
3.8
4.0
4.2
4.4
4.6
(b) Conditional ES surface
0.4 0.2 0.0 0.2 0.4
2
.
6
0
2
.
7
0
2
.
8
0
2
.
9
0
(c) Conditional VaR
x1=0.275
x1=0.025
x1=0.350
0.4 0.2 0.0 0.2 0.4
2
.
6
2
.
7
2
.
8
2
.
9
(d) Conditional VaR
x2=0.225
x2=0.025
x2=0.425
0.4 0.2 0.0 0.2 0.4
3
.
7
3
.
8
3
.
9
4
.
0
4
.
1
4
.
2
(e) Conditional ES
x1=0.275
x1=0.025
x1=0.350
0.4 0.2 0.0 0.2 0.4
3
.
8
4
.
0
4
.
2
4
.
4
4
.
6
(f) Conditional ES
x2=0.225
x2=0.025
x2=0.425
Figure 6.7: (a) 5% CVaR estimates for IBM stock returns. (b) 5% CES estimates for IBM
stock returns index. (c) 5% CVaR estimates for three dierent values of lagged negative
IBM returns (0.275, 0.025, 0.325). (d) 5% CVaR estimates for three dierent values of
lagged negative DJI returns (0.225, 0.025, 0.425). (e) 5% CES estimates for three dierent
values of lagged negative IBM returns (0.275, 0.025, 0.325). (f) 5% CES estimates for
three dierent values of lagged negative DJI returns (0.225, 0.025, 0.425).
6.6 Proofs of Theorems
In this section, we present the proofs of Theorems 6.1 - 6.4. First, we list two lemmas. The
proof of Lemma 6.1 can be found in Cai (2002) and the proof of Lemma 6.2 is relegated to
Section 6.7.
Lemma 6.1: Under Assumptions A1 - A5, we have
= h
0
1 + o
p
(1) and p
t
(x) = n
1
b
t
(x) 1 + o
p
(1),
where
0
=
2
(W) g

(x)/[2
2
(W
2
) g(x)] and b
t
(x) = [1 h
0
(X
t
x) W
h
(x X
t
)]
1
. Fur-
CHAPTER 6. CONDITIONAL VAR AND EXPECTED SHORTFALL 166
ther, we have
p
t
(c h) = n
1
b
c
t
(c h) 1 + o
p
(1),
where b
c
t
(x) = [1 +
c
(X
t
x) K
h
(x X
t
)]
1
.
Lemma 6.2: Under Assumptions A1 - A5, we have, for any j 0,
J
j
= n
1
n

t=1
c
t
(x)
_
X
t
x
h
_
j
= g(x)
j
(W) + O
p
(h
2
),
where c
t
(x) = b
t
(x) W
h
(x X
t
).
Before we start to provide the main steps for proofs of theorems. First, it follows from
Lemmas 6.1 and 6.2 that
W
c,t
(x, h)
b
t
(x) W
h
(x X
t
)

n
t=1
b
t
(x) W
h
(x X
t
)
n
1
g
1
(x) b
t
(x) W
h
(x X
t
) =
c
t
(x)
ng(x)
. (6.15)
Now we embark on the proofs of the theorems.
Proof of Theorem 6.1: By (6.7), we decompose

f
c
(y [ x) f(y [ x) into three parts as
follows

f
c
(y [ x) f(y [ x) I
1
+ I
2
+ I
3
, (6.16)
where with
t,1
= Y

t
(y) E(Y

t
(y)[X
t
),
I
1
=
n

t=1

t,1
W
c,t
(x, h), I
2
=
n

t=1
[E(Y

t
(y)[X
t
) f(y[X
t
)] W
c,t
(x, h),
and
I
3
=
n

t=1
[f(y [ X
t
) f(y [ x)] W
c,t
(x, h).
An application of the Taylor expansion, (6.7), (6.15), and Lemmas 6.1 and 6.2 gives
I
3
=
n

t=1
1
2
f
0,2
(y [ x) W
c,t
(x, h) (X
t
x)
2
+ o
p
(h
2
)
=
1
2
g
1
(x) f
0,2
(y [ x) n
1
n

t=1
c
t
(x) (X
t
x)
2
+ o
p
(h
2
)
=
h
2
2

2
(W) f
0,2
(y [ x) + o
p
(h
2
).
By (6.2) and following the same steps as in the proof of Lemma 6.2, we have
I
2
=
h
2
0

2
(K)
2 g(x)
n
1
n

t=1
f
2,0
(y [ X
t
) c
t
(x) + o
p
(h
2
0
+ h
2
) =
h
2
0
2

2
(K) f
2,0
(y [ x) + o
p
(h
2
0
+ h
2
).
CHAPTER 6. CONDITIONAL VAR AND EXPECTED SHORTFALL 167
Therefore,
I
2
+ I
3
=
h
2
2

2
(W) f
0,2
(y [ x) +
h
2
0
2

2
(K) f
2,0
(y [ x) + o
p
(h
2
+ h
2
0
) = B
f
(y [ x) + o
p
(h
2
+ h
2
0
).
Thus, (6.16) becomes
_
nh
0
h
_

f
c
(y [ x) f(y [ x) B
f
(y [ x) + o
p
(h
2
+ h
2
0
)
_
=
_
nh
0
h I
1
= g
1
(x) I
4
1 + o
p
(1) N
_
0,
2
f
(y [ x)
_
,
where I
4
=
_
h
0
h/n

n
t=1

t,1
c
t
(x). This, together with Lemma 6.3 in Section 6.7, therefore,
proves the theorem.
Proof of Theorem 6.2: Similar to (6.16), we have

S
c
(y [ x) S(y [ x) I
5
+ I
6
+ I
7
, (6.17)
where with
t,2
=

G
h
0
(y Y
t
) E(

G
h
0
(y Y
t
)[X
t
),
I
5
=
n

t=1

t,2
W
c,t
(x, h), I
6
=
n

t=1
[E

G
h
0
(y Y
t
) [ X
t
S(y[X
t
)] W
c,t
(x, h),
and
I
7
=
n

t=1
[S(y [ X
t
) S(y [ x)] W
c,t
(x, h).
Similar to the analysis of I
2
, by the Taylor expansion, (6.7), and Lemmas 6.1 and 6.2, we
have
I
7
=
n

t=1
1
2
S
0,2
(y [ x) W
c,t
(x, h) (X
t
x)
2
+ o
p
(h
2
)
=
1
2
S
0,2
(y [ x) g
1
(x) n
1
n

t=1
c
t
(x) (X
t
x)
2
+ o
p
(h
2
)
=
h
2
2

2
(W) S
0,2
(y [ x) + o
p
(h
2
).
To evaluate I
6
, rst, we consider the following
E[

G
h
0
(y Y
t
) [ X
t
= x] =
_

K(u) S(y h
0
u [ x)du
= S(y [ x) +
h
2
0
2

2
(K) S
2,0
(y [ x) + o(h
2
0
)
= S(y [ x)
h
2
0
2

2
(K) f
1,0
(y [ x) + o(h
2
0
). (6.18)
CHAPTER 6. CONDITIONAL VAR AND EXPECTED SHORTFALL 168
By (6.18) and following the same arguments as in the proof of Lemma 6.2, we have
I
6
=
h
2
0

2
(K)
2 g(x)
n
1
n

t=1
f
1,0
(y [ X
t
)c
t
(x) +o
p
(h
2
0
+h
2
) =
h
2
0
2

2
(K) f
1,0
(y [ x) +o
p
(h
2
0
+h
2
).
Therefore,
I
6
+ I
7
=
h
2
2

2
(W) S
0,2
(y [ x)
h
2
0
2

2
(K) f
1,0
(y [ x) + o
p
(h
2
+ h
2
0
) = B
S
(y [ x) + o
p
(h
2
+ h
2
0
),
so that by (6.17),

nh
_

S
c
(y [ x) S(y [ x) B
S
(y [ x) + o
p
(h
2
+ h
2
0
)
_
=

nhI
5
.
Clearly, to accomplish the proof of theorem, it suces to establish the asymptotic normality
of

nh I
5
. To this end, rst, we compute Var(
t,2
[ X
t
= x). Note that
E[

G
2
h
0
(y Y
t
) [ X
t
= x] =
_

G
2
h
0
(y u) f(u [ x)du
=
_

K(u
1
) K(u
2
) S(max(y h
0
u
1
, y h
0
u
2
) [ x)du
1
du
2
= S(y [ x) + 2 h
0
(K) f(y [ x) + O(h
2
0
), (6.19)
which, in conjunction with (6.18), implies that
Var(
t,2
[ X
t
= x) = S(y [ x) [1 S(y [ x)] + 2 h
0
(K) f(y [ x) + o(h
0
).
This, together with the fact that
Var(
t,2
c
t
(x)) = E
_
c
2
t
(x) E
2
t,2
[ X
t

= E
_
c
2
t
(x) Var(
t,2
[ X
t
)

,
leads to
h Var
t,2
c
t
(x) =
0
(W
2
) g(x) [S(y [ x)1 S(y [ x) + 2 h
0
(K) f(y [ x)] + o(h
0
).
Now, since [
t,2
[ 1, by following the same arguments as those used in the proofs of Lemmas
6.2 and 6.3 in Section 6.7 (or Lemma 1 and Theorem 1 in Cai (2002)), we can show although
tediously that
Var(I
8
) =
2
S
(y [ x) g
2
(x) + 2
0
(W
2
) h
0
(K) f(y [ x) g(x) + o(h
0
), (6.20)
where I
8
=
_
h/n

n
t=1

t,2
c
t
(x), and

nh I
5
= g
1
(x) I
8
1 + o
p
(1) N
_
0,
2
S
(y [ x)
_
.
CHAPTER 6. CONDITIONAL VAR AND EXPECTED SHORTFALL 169
This completes the proof of Theorem 6.2.
Proof of Theorem 6.4: Similar to (6.12), we use the Taylor expansion and ignore the
higher terms to obtain
_

p(x)
y K
h
0
(y Y
t
)dy
_

p(x)
y K
h
0
(y Y
t
)dy
p
(x) K
h
0
(
p
(x) Y
t
) [
p
(x)
p
(x)]
= Y
t

G
h
0
(
p
(x) Y
t
)
p
(x) K
h
0
(
p
(x) Y
t
) [
p
(x)
p
(x)] + h
0
G
1,h
0
(
p
(x) Y
t
).
Plugging the above into (6.9) leads to
p
p
(x)
p,1
(x) + I
9
, (6.21)
where

p,1
(x) =
n

t=1
W
c,t
(x, h) Y
t

G
h
0
(
p
(x) Y
t
)
p
(x)

f
c
(
p
(x)[x)[
p
(x)
p
(x)],
which will be shown later to be the source of both the asymptotic bias and variance, and
I
9
= h
0
n

t=1
W
c,t
(x, h) G
1,h
0
(
p
(x) Y
t
),
which will be shown to contribute only the asymptotic bias (see Lemma 6.4 in Section 6.7).
From (6.12) and (6.8),

f
c
(
p
(x) [ x) [
p
(x)
p
(x)]
n

t=1
W
c,t
(x, h)

G
h
0
(
p
(x) Y
t
) p.
Therefore, by (6.15),

p,1
(x) =
n

t=1
W
c,t
(x, h) [Y
t

p
(x)

G
h
0
(
p
(x) Y
t
) p
p
(x)]
=
n

t=1
W
c,t
(x, h)
t,3
+
n

t=1
W
c,t
(x, h) E
t
(x) [ X
t

g
1
(x) n
1
n

t=1

t,3
c
t
(x) +
n

t=1
W
c,t
(x, h) E
t
(x) [ X
t

p,2
(x) +
p,3
(x),
where
t
(x) = [Y
t

p
(x)]

G
h
0
(
p
(x) Y
t
) +p
p
(x) and
t,3
=
t
(x) E
t
(x) [ X
t
. Next, we
derive the asymptotic bias and variance for
p,1
(x). Indeed, we will show that asymptotic
CHAPTER 6. CONDITIONAL VAR AND EXPECTED SHORTFALL 170
bias of
p
(x) comes from both
p,3
(x) and I
9
, and the asymptotic variance for
p,1
(x) is only
from
p,2
(x). First, we consider
p,3
(x). Now, it is easy to see by the Taylor expansion that
E[Y
t

G
h
0
(
p
(x) Y
t
) [ X
t
= v] =
_

K(u)du
_

p(x)h
0
u
y f(y [ v)dy
=
_

l
1
(
p
(x) h
0
u [ v) K(u)du = l
1
(
p
(x) [ v) +
h
2
0
2

2
(K) l
2,0
1
(
p
(x) [ v) + o(h
2
0
)
= l
1
(
p
(x) [ v)
h
2
0
2

2
(K)
_

p
(x) f
1,0
(
p
(x) [ v) + f(
p
(x) [ x)

+ o(h
2
0
),
which, in conjunction with (6.18), leads to
(v) = E[
t
(x) [ X
t
= v] = A(
p
(x) [ v)
h
2
0
2

2
(K) f(
p
(x) [ v) + o(h
2
0
), (6.22)
where A(
p
(x)[v) = l
1
(
p
(x) [ v)
p
(x) [S(
p
(x) [ v)p]. It is easy to verify that A(
p
(x)[v) =
E[Y
t

p
(x) I(Y
t

p
(x)) [ X
t
= v] + p
p
(x), A(
p
(x)[x) = p
p
(x), and A
0,2
(
p
(x)[x) =
l
0,2
1
(
p
(x) [ x)
p
(x) S
0,2
(
p
(x) [ x). Therefore, by (6.22), the Taylor expansion, and (6.7),

p,3
(x) becomes

p,3
(x) =
n

t=1
W
c,t
(x, h) (X
t
) = (x) +
1
2

(x)
n

t=1
W
c,t
(x, h) (X
t
x)
2
+ o
p
(h
2
).
Further, by Lemmas 6.1 and 6.2,

p,3
(x) = (x) +
h
2
2

2
(W)

(x) + o
p
(h
2
)
= p
p
(x) +
h
2
2

2
(W) A
0,2
(
p
(x) [ x)
h
2
0
2

2
(K) f(
p
(x) [ x) + o
p
(h
2
0
).
This, in conjunction with Lemma 6.4 in Section 6.7, concludes that

p,3
(x) + I
9
= p [
p
(x) + B

(x)] + o
p
(h
2
+ h
2
0
),
so that by (6.21),

p,1
(x) p [
p
(x) + B

(x)] =
p,2
(x) + o
p
(h
2
+ h
2
0
),
and

p
(x)
p
(x) B

(x) = p
1

p,2
(x) + o
p
(h
2
+ h
2
0
).
Finally, by Lemma 6.5 in Section 6.7, we have

nh
_

p
(x)
p
(x) B

(x) + o
p
(h
2
+ h
2
0
)

=
1
p g(x)
I
10
1 + o
p
(1) N
_
0,
2

(x)
_
,
where I
10
=
_
h/n

n
t=1

t,3
c
t
(x). Thus, we prove the theorem.
CHAPTER 6. CONDITIONAL VAR AND EXPECTED SHORTFALL 171
6.7 Proofs of Lemmas
In this section, we present the proofs of Lemmas 6.2, 6.3, 6.4, and 6.5. Note that we use the
same notation as in Sections 6.2 - 6.6. Also, throughout this section, we denote a generic
constant by C, which may take dierent values at dierent appearances.
Proof of Lemma 6.2: Let
t
= c
t
(x)(X
t
x)
j
/h
j
. It is easy to verify by the Taylor
expansion that
E(J
j
) = E(
t
) =
_
v
j
W(v) g(x h v)
1 + h
0
v W(v)
dv = g(x)
j
(W) + O(h
2
), (6.23)
and
E(
2
t
) = h
1
_
v
2j
W
2
(v) g(x h v)
[1 + h
0
v W(v)]
2
dv = O(h
1
).
Also, by the stationarity, a straightforward manipulation yields
nVar(J
j
) = Var(
1
) +
n

t=2
l
n,t
Cov(
1
,
t
), (6.24)
where l
n,t
= 2 (nt +1)/n. Now decompose the second term on the right hand side of (6.24)
into two terms as follows
n

t=2
[Cov(
1
,
t
)[ =
dn

t=2
( ) +
n

t=dn+1
( ) J
j1
+ J
j2
, (6.25)
where d
n
= O(h
1/(1+/2)
). For J
j1
, it follows by Assumption A4 that [Cov(
1
,
t
)[ C, so
that J
j1
= O(d
n
) = o(h
1
). For J
j2
, Assumption A2 implies that [(X
t
x)
j
W
h
(x X
t
)[
C h
j1
, so that [
t
[ C h
1
. Then, it follows from the Davydovs inequality (see, e.g.,
Theorem 17.2.1 of Ibragimov and Linnik (1971)) that [Cov(
1
,
t+1
)[ C h
2
(t), which,
together with Assumption A5, implies that
J
j2
C h
2

tdn
(t) C h
2
d
(1+)
n
= o(h
1
).
This, together with (6.24) and (6.25), therefore implies that Var(J
j
) = O((nh)
1
) = o(1).
This completes the proof of the lemma.
Lemma 6.3: Under Assumptions A1 - A6 with h in A3 and A6 replaced by h h
0
, we have
I
4
=
_
h
0
h
n
n

t=1

t,1
c
t
(x) N
_
0,
2
f
(y [ x) g
2
(x)
_
.
CHAPTER 6. CONDITIONAL VAR AND EXPECTED SHORTFALL 172
Proof: It follows by using the same lines as those used in the proof of Lemma 6.2 and
Theorem 1 in Cai (2002), omitted. The outline is described as follows. First, similar to the
proof of Lemma 6.2, it is easy to see that
Var(I
4
) = h
0
h Var(
t,1
c
t
(x)) + h
0
h
n

t=2
l
n,t
Cov(
1,1
c
1
(x),
t,1
c
t
(x)). (6.26)
Next, we compute Var(
t,1
[ X
t
= x). Note that
h
0
E[Y

t
(y)
2
[ X
t
= x] =
_

K
2
(u) f(y h
0
u [ x)du =
0
(K
2
) f(y [ x) + O(h
2
0
),
which, together with the fact that
Var(
t,1
c
t
(x)) = E
_
c
2
t
(x) E
2
t,1
[ X
t

= E
_
c
2
t
(x) Var(
t,1
[ X
t
)

and (6.2), implies that

h h
0
Var(
t,1
c
t
(x)) =
0
(K
2
)
0
(W
2
) f(y [ x) g(x) + O(h
2
0
) =
2
f
(y [ x) g
2
(x) + O(h
2
0
).
As for the second term on the right hand side of (6.26), similar to (6.25), it is decomposed into
two summons. By using Assumption A4 for the rst summon and using the Davydovs in-
equality and Assumption A5 to the second summon, we can show that the second term on the
right hand side of (6.26) goes to zero as n goes to innity. Thus, Var(I
4
)
2
f
(y [ x) g
2
(x) by
(6.26). To show the normality, we employ Doobs small-block and large-block technique (see,
e.g., Ibragimov and Linnik, 1971, p. 316). Namely, partition 1, . . . , n into 2 q
n
+1 subsets
with large-block of size r
n
= (nh h
0
)
1/2
and small-block of size s
n
= (nh h
0
)
1/2
/ log n,
where q
n
= n/(r
n
+ s
n
) with x denoting the integer part of x. By following the same
steps as in the proof of Theorem 1 in Cai (2002), we can accomplish the rest of proofs:
the summands for the large-blocks are asymptotically independent, two summands for the
small-blocks are asymptotically negligible in probability, and the standard Lindeberg-Feller
conditions hold for the summands for the large-blocks. See Cai (2002) for details. So, the
proof of the lemma is complete.
Lemma 6.4: Under Assumptions A1 - A6, we have
I
9
= h
0
n

t=1
W
c,t
(x, h) G
1,h
0
(
p
(x) Y
t
) = h
2
0

2
(K) f(
p
(x) [ x) + o
p
(h
2
0
).
CHAPTER 6. CONDITIONAL VAR AND EXPECTED SHORTFALL 173
Proof: Dene
t,1
= c
t
(x) G
1,h
0
(
p
(x) Y
t
). Then, by Lemma 6.1, I
9
= I
10
1 + o
p
(1),
where I
10
= g
1
(x) h
0

n
t=1

t,1
/n. Similar to (6.23),
E (
t,1
) = E [c
t
(x) E G
1,h
0
(
p
(x) Y
t
) [ X
t
]
=
_

K(u) W(v) u S(
p
(x) h
0
u) [ x) g(x h v)
1 + h
0
v W(v)
dudv
= h
0

2
(K) f(
p
(x) [ x) g(x) + O(h
0
h
2
),
and
E(
2
t,1
) = E
_
b
2
t
(x) W
2
h
(x X
t
) E
_
G
2
1,h
0
(
p
(x) Y
t
) [ X
t
_
= O(h
0
/h),
so that Var(
t,1
) = O(h
0
/h). By following the same arguments in the derivation of Var(J
j
)
in Lemma 6.2, one can show that Var(I
10
) = O((nh)
1
) = o(1). This proves the lemma.
Lemma 6.5: Under Assumptions A1 - A4 and B2 - B5, we have
I
10
=
_
h
n
n

t=1

t,3
c
t
(x) N
_
0, p
2
g
2
(x)
2

(x)
_
.
Proof: It follows by using the same lines as those used in the proof of Lemma A.1 and
Theorem 1 in Cai (2001), omitted. The main idea is as follows. First, similar to the proof
of Lemmas 6.2 and 6.3, we will show by Assumptions B1 - B3 that
Var(I
10
) p
2

(x) g
2
(x). (6.27)
Finally, we need to compute Var(
t,3
c
t
(x)). Since
Var(
t,3
c
t
(x)) = E
_
c
2
t
(x) E
2
t,3
[ X
t

= E
_
c
2
t
(x) Var(
t
(x) [ X
t
)

,
then, we rst need to calculate Var(
t
(x) [ X
t
). To this eect, by (6.22),
Var(
t
(x) [ X
t
= v) = Var[(Y
t

p
(x))

G
h
0
(
p
(x) Y
t
) [ X
t
= v]
= E
_
(Y
t

p
(x))
2

G
2
h
0
(
p
(x) Y
t
)[X
t
= v

[l
1
(
p
(x)[v)
p
(x)S(
p
(x)[v)]
2
+ O(h
2
0
).
Similar to (6.19),
E[(Y
t

p
(x))
2

G
2
h
0
(
p
(x) Y
t
) [ X
t
= v] =
_

G
2
h
0
(
p
(x) y) (y
p
(x))
2
f(y [ v)dy
=
_

K(u
1
) K(u
2
) (max(
p
(x) h
0
u
1
,
p
(x) h
0
u
2
) [ v)du
1
du
2
= (
p
(x) [ v) 2 h
0

1,0
(
p
(x) [ v) (K) + O(h
2
0
) = (
p
(x) [ v) + O(h
2
0
)
CHAPTER 6. CONDITIONAL VAR AND EXPECTED SHORTFALL 174
since
1,0
(
p
(x) [ v) = 0, where (u [ v) = l
2
(u [ v) 2
p
(x)l
1
(u [ v) +
2
p
(x)S(u [ v). Therefore,
Var(
t
(x) [ X
t
= v) = Var[(Y
t

p
(x))I(Y
t

p
(x)) [ X
t
= v] + O(h
2
0
),
and
h Var(
t,3
c
t
(x)) =
0
(W
2
) Var[(Y
t

p
(x))I(Y
t

p
(x)) [ X
t
= x] g(x) + o(1).
Similar to Lemmas 6.2 and 6.3, clearly, we have,
Var(I
10
) = h Var(
t,3
c
t
(x)) + h
n

t=2
l
n,t
Cov(
1,3
c
1
(x),
t,3
c
t
(x)),
and the rst term on right hand side of the above equation converges to p
2

(x) g
2
(x). As
for the second term on the right hand side of the above equation, similar to (6.25), it is
decomposed into two summons. By using Assumptions A4 and B2 for the rst summon and
using the Davydovs inequality and Assumptions A5 and B3 to the second summon, we can
show that the second term on the right hand side of the above equation goes to zero as n
goes to innity. Thus, (6.27) holds. To show the normality, we employ Doobs small-block
and large-block technique (see, e.g., Ibragimov and Linnik, 1971, p. 316). Namely, partition
1, . . . , n into 2 q
n
+1 subsets with large-block of size r
n
and small-block of size s
n
, where
s
n
is given in Assumption B4, q
n
= n/(r
n
+ s
n
), and r
n
= (nh)
1/2
/
n
with
n
satisfying
followings:
n
is a sequence of positive numbers
n
such that
n
s
n
/

nh 0 and

n
(n/h)
1/2
(s
n
) 0 by Assumption B4. By following the same steps as in the proof of
Theorem 1 in Cai (2001) and using Assumption B5, we can accomplish the rest of proofs:
the summands for the large-blocks are asymptotically independent, two summands for the
small-blocks are asymptotically negligible in probability, and the standard Lindeberg-Feller
conditions hold for the summands for the large-blocks. See Cai (2001) for details. Therefore,
the lemma is proved.
6.8 Computer Codes
Please see the les chapter6-1.r, chapter6-2.r, chapter6-3.r, and chapter6-4.r for mak-
ing gures. If you want to learn the codes for computation, they are available upon request.
6.9 References
Acerbi, C. and D. Tasche (2002). On the coherence of expected shortfall. Journal of
Banking and Finance, 26, 1487-1503.
CHAPTER 6. CONDITIONAL VAR AND EXPECTED SHORTFALL 175
Artzner, P., F. Delbaen, J.M. Eber, and D. Heath (1999). Coherent measures of risk.
Mathematical Finance, 9, 203-228.
Boente, G. and R. Fraiman (1995). Asymptotic distribution of smoothers based on local
means and local medians under dependence. Journal of Multivariate Analysis, 54,
77-90.
Cai, Z. (2001). Weighted Nadaraya-Watson regression estimation. Statistics and Probability
Letters, 51, 307-318.
Cai, Z. (2002). Regression quantiles for time series data. Econometric Theory, 18, 169-192.
Cai, Z. (2007). Trending time varying coecient time series models with serially correlated
errors. Journal of Econometrics, 137, 163-188
Cai, Z. and L. Qian (2000). Local estimation of a biometric function with covariate eects.
In Asymptotics in Statistics and Probability (M. Puri, ed), 47-70.
Cai, Z. and R.C. Tiwari (2000). Application of a local linear autoregressive model to BOD
time series. Environmetrics, 11, 341-350.
Cai, Z. and X. Wang (2006). Nonparametric methods for estimating conditional value-at-
risk and expected shortfall. Forthcoming in Journal of Econometrics.
Cai, Z. and X. Xu (2005). Nonparametric quantile estimations for dynamic smooth coe-
cient models. Forthcoming in Journal of the American Statistical Association.
Chaudhuri, P. (1991). Nonparametric estimates of regression quantiles and their local
Bahadur representation. The Annals of statistics, 19, 760-777.
Chen, S.X. (2006). Nonparametric estimation of expected shortfall. Working paper, De-
partment of Statistics, Iowa State University.
Chen, S.X. and C.Y. Tang (2005). Nonparametric inference of value at risk for dependent
nancial returns. Journal of Financial Econometrics, 3, 227-255.
Chernozhukov, V. and L. Umanstev (2001). Conditional value-at-risk: Aspects of modeling
and estimation. Empirical Economics, 26, 271-292.
Cosma, A., O. Scaillet and R. von Sachs (2006). Multivariate wavelet-based shape preserv-
ing estimation for dependent observations. Bernoulli, in press.
Due, D. and J. Pan (1997). An overview of value at risk. Journal of Derivatives, 4, 7-49.
Due, D. and K.J. Singleton (2003). Credit Risk: Pricing, Measurement, and Management.
Princeton: Princeton University Press.
Embrechts, P., C. Kluppelberg, and T. Mikosch (1997). Modeling Extremal Events For
Finance and Insurance. New York: Springer-Verlag.
CHAPTER 6. CONDITIONAL VAR AND EXPECTED SHORTFALL 176
Engle, R.F. and S. Manganelli (2004). CAViaR: conditional autoregressive value at risk by
regression quantile. Journal of Business and Economics Statistics, 22, 367-381.
Fan, J. and I. Gijbels (1996). Local Polynomial Modeling and Its Applications. London:
Chapman and Hall.
Fan, J. and J. Gu (2003). Semiparametric estimation of value-at-risk. Econometrics Jour-
nal, 6, 261-290.
Fan, J., T.-C. Hu and Y.K. Troung (1994). Robust nonparametric function estimation.
Scandinavian Journal of Statistics, 21, 433-446.
Fan, J., Q. Yao, and H. Tong (1996). Estimation of conditional densities and sensitivity
measures in nonlinear dynamical systems. Biometrika, 83, 189-206.
Frey, R. and A.J. McNeil (2002). VaR and expected shortfall in portfolios of dependent
credit risks: conceptual and practical insights. Journal of Banking and Finance, 26,
1317-1334.
Hall, P., R.C.L. Wol, and Q. Yao (1999). Methods for estimating a conditional distribution
function. Journal of the American Statistical Association, 94, 154-163.
H urlimann, W. (2003). A Gaussian exponential approximation to some compound Poisson
distributions. ASTIN Bulletin, 33, 41-55.
Ibragimov, I.A. and Yu. V. Linnik (1971). Independent and Stationary Sequences of Ran-
dom Variables. Groningen, the Netherlands: Walters-Noordho.
Jorion, P. (2001). Value at Risk, 2nd Edition. New York: McGraw-Hill.
Jorion, P. (2003). Financial Risk Manager Handbook, 2nd Edition. New York: John Wiley.
Koenker, R. and G.W. Bassett (1978). Regression quantiles. Econometrica, 46, 33-50.
Lejeune, M.G. and P. Sarda (1988). Quantile regression: a nonparametric approach. Com-
putational Statistics and Data Analysis, 6, 229-281.
Masry, E. and J. Fan (1997). Local polynomial estimation of regression functions for mixing
processes. The Scandinavian Journal of Statistics, 24, 165-179.
McNeil, A. (1997). Estimating the tails of loss severity distributions using extreme value
theory. ASTIN Bulletin, 27, 117-137.
Morgan, J.P. (1996). Risk Metrics - Technical Documents, 4th Edition, New York.
Oakes, D. and T. Dasu (1990). A note on residual life. Biometrika, 77, 409-410.
Rockafellar, R. and S. Uryasev (2000). Optimization of conditional value-at-risk. Journal
of Risk, 2, 21-41.
CHAPTER 6. CONDITIONAL VAR AND EXPECTED SHORTFALL 177
Roussas, G.G. (1969). Nonparametric estimation of the transition distribution function of
a Markov process. The Annals of Mathematical Statistics, 40, 1386-1400.
Roussas, G.G. (1991). Estimation of transition distribution function and its quantiles in
Markov processes: Strong consistency and asymptotic normality. In G.G. Roussas
(ed.), Nonparametric Functional Estimation and related Topics, pp. 443-462. Amster-
dam: Kluwer Academic.
Samanta, M. (1989). Nonparametric estimation of conditional quantiles. Statistics and
Probability Letters, 7, 407-412.
Scaillet, O. (2004). Nonparametric estimation and sensitivity analysis of expected shortfall.
Mathematical Finance, 14, 115-129.
Scaillet, O. (2005). Nonparametric estimation of conditional expected shortfall. Revue
Assurances et Gestion des Risques/Insurance and Risk Management Journal, 74, 639-
660.
Troung, Y.K. (1989). Asymptotic properties of kernel estimators based on local median.
The Annals of Statistics, 17, 606-617.
Troung, Y.K. and C.J. Stone (1992) Nonparametric function estimation involving time
series. The Annals of Statistics 20, 77-97.
Yu, K. and M.C. Jones (1997). A comparison of local constant and local linear regression
quantile estimation. Computational Statistics and Data Analysis, 25, 159-166.
Yu, K. and M.C. Jones (1998). Local linear quantile regression. Journal of the American
Statistical Association, 93, 228-237.