0% found this document useful (0 votes)

9 views

Functional Index Coefficient Models With Variable Selection

Functional Index Coefficient Models with Variable Selection

Uploaded by

lucas wang

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

9 views

Functional Index Coefficient Models With Variable Selection

Functional Index Coefficient Models with Variable Selection

Uploaded by

lucas wang

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 13

Journal of Econometrics 189 (2015) 272–284

Contents lists available at ScienceDirect

Journal of Econometrics
journal homepage: www.elsevier.com/locate/jeconom

Functional index coefficient models with variable selection✩

Zongwu Cai a,b , Ted Juhl a , Bingduo Yang c,∗
a
Department of Economics, University of Kansas, Lawrence, KS 66045, USA
b
Wang Yanan Institute for Studies in Economics and Fujian Key Lab of Statistical Sciences, Xiamen University, Xiamen, Fujian 361005, China
c
School of Finance, Jiangxi University of Finance and Economics, Nanchang, Jiangxi 330013, China

article info abstract

Article history: We consider model (variable) selection in a semi-parametric time series model with functional
Available online 19 March 2015 coefficients. Variable selection in the semi-parametric model must account for the fact that the parametric
part of the model is estimated at a faster convergence rate than the nonparametric component. Our
JEL classification:
variable selection procedures employ a smoothly clipped absolute deviation penalty function and consist
C140
of two steps. The first is to select covariates with functional coefficients that enter in the semi-parametric
C580
C520 model. Then, we perform variable selection for variables with parametric coefficients. The asymptotic
properties, such as consistency, sparsity and the oracle property of these two-step estimators are
Keywords: established. A Monte Carlo simulation study is conducted to examine the finite sample performance of
Functional index coefficient autoregressive the proposed estimators and variable selection procedures. Finally, an empirical example exploring the
model
predictability of asset returns demonstrates the practical application of the proposed functional index
Model selection
coefficient autoregressive models and variable selection procedures.
Oracle property
Penalty function © 2015 Elsevier B.V. All rights reserved.
Smoothly clipped absolute deviation

1. Introduction model are in unknown vector functional form depending on lagged

terms, which satisfy
Linear time series models such as linear autoregressive moving
p
average models, hereafter ARMA models (Box and Jenkins, 1970), 
were well developed in last century. However, linear ARMA rt = gj (r∗t −1 )rt −j + εt ,
j =1
models may not capture some important and potentially nonlinear
features of the data in economics and finance. Many nonlinear
where r∗t −1 = (rt −i1 , . . . , rt −id )T with 1 ≤ i1 < i2 < · · · < td
time series models have been proposed. The early work includes
the bilinear models (Granger and Andersen, 1978), the threshold and gj (·) is an unknown function in Rd for 1 ≤ j ≤ p. The above
autoregressive (TAR) models (Tong, 1990), the smooth transition FAR model covers several traditional varying coefficient models
AR (STAR) models (Chan and Tong, 1986; Teräsvirta, 1994) and as a special case, such as the threshold autoregressive models in
Markov switching models (Hamilton, 1989), among others. One Tong (1990) and the STAR models in Chan and Tong (1986) and
of the popular semiparametric models is the functional coefficient Teräsvirta (1994).
autoregressive (FAR) model, which was proposed by Chen and Tsay Due to the curse of dimensionality, Chen and Tsay (1993) just
(1993) and extended by Cai et al. (2000b). The coefficients in FAR considered one single threshold variable case r∗t −1 = rt −k for some
k, and they proposed an arranged local regression to estimate the
functional coefficients {gj (·)} with an iterative algorithm. In fact,
✩ We thank the Guest Editors, Professors Shiqing Ling, Michael McAleer and their method is similar to the local constant semiparametric esti-
Howell Tong, and the anonymous referees for their insightful comments that mator as pointed out by Cai et al. (2000b). For efficient estimation
greatly improved our paper. Cai’s research was supported, in part, by the National
of the FAR model, the reader is referred to the papers by Cai et al.
Nature Science Foundation of China grants #71131008 (Key Project), #70871003
and #70971113. Yang’s research was supported by the National Nature Science (2000a) and Fan and Zhang (1999).
Foundation of China grant #71401066 and Specialized Research Fund for the To overcome the curse of dimensionality and incorporate more
Doctoral Program of Higher Education #20130161120023.
∗ Corresponding author. variables in the functional coefficients {gj (·)}, we assume that r∗t −1
E-mail addresses: caiz@ku.edu (Z. Cai), juhl@ku.edu (T. Juhl),
is a linear combination of rt −ik ’s, e.g. r∗t −1 = β T rt −1 , where rt −1 =
bdyang2006@gmail.com (B. Yang). (rt −1 , . . . , rt −d )T . We denote this model as the functional index
http://dx.doi.org/10.1016/j.jeconom.2015.03.022
0304-4076/© 2015 Elsevier B.V. All rights reserved.
Z. Cai et al. / Journal of Econometrics 189 (2015) 272–284 273

coefficient autoregressive (FIAR) model satisfying et al. (2008) and Li and Liang (2008), varying coefficient partially
p linear models with errors-in-variables in Zhao and Xue (2010), and
partially linear single-index models in Liang et al. (2010), and the

rt = gj (β T rt −1 )rt −j + εt ,
j =1
references therein.
However, the aforementioned papers focused mainly on the
where gj (·) is an unknown function in R for 1 ≤ j ≤ p. In fact, variable selection with parametric coefficients. Also, the shrink-
the above FIAR model can be regraded as a case of functional index age method was extended to select significant variables with func-
coefficient models of Fan et al. (2003) with tional coefficients. Lin and Zhang (2006) proposed component
p selection and smoothing operator (COSSO) for model selection and
model fitting in multivariate nonparametric regression models in

yi = gj (β T Zi )Xji + εi ≡ g (β T Zi )T Xi + εi , 1 ≤ i ≤ n, (1)
j =1
the framework of smoothing spline analysis of variance. Mean-
while, they extended the COSSO to the exponential families (Zhang
where yi is a dependent variable, Xi = (X1i , X2i , . . . , Xpi )T is a and Lin, 2006). Wang et al. (2008) proposed the variable selection
p × 1 vector of covariates, Zi is a d × 1 vector of local variables, procedures with basis function approximations and SCAD, which
εi are independently identically distributed (i.i.d.) with mean 0 is similar to the COSSO, and they argued that their procedures can
and standard deviation σ , β ∈ Rd is a d × 1 vector of unknown select significant variables with time-varying effect and estimate
parameters and g (·) = (g1 (·), . . . , gp (·))T is a p × 1 vector of the nonzero smooth coefficient functions simultaneously. Huang
unknown functional coefficients. We assume that ∥β∥ = 1 and et al. (2010) proposed to use the adaptive group LASSO for variable
the first element of β is positive for identification, where ∥ · ∥ is selection in nonparametric additive models based on a spline ap-
the Euclidean norm (L2 -norm). Note that both Xi and Zi can include proximation, in which the number of variables and additive com-
the lagged variables of yi . In particular, if X1i ≡ 1, then, model (1) ponents may be larger than the sample size. By adopting the idea of
contains an intercept function term. the grouping method in Yuan and Lin (2006), Wang and Xia (2009)
Xia and Li (1999) studied the asymptotic properties of model used kernel LASSO to apply shrinkage to functional coefficients in
(1) under mixing conditions when the index part of above model the varying coefficient models. Their pure nonparametric shrink-
is not constrained to be a linear combination of Zi . However, due to age procedure is different from approaches of using spline and ba-
the efficiency of estimation and the accuracy of prediction, it is of sis functions (Lin and Zhang, 2006; Wang et al., 2008; Huang et al.,
importance to select variables in both Zi and Xi , and to potentially 2010). For a comprehensive survey paper of variable selection in
exclude variables in Eq. (1). Fan et al. (2003) provided algorithms nonparametric and semiparametric regression models via shrink-
to estimate local parameters β and functional coefficients g (·). age, the reader is referred to the paper by Su and Zhang (2013).
Meanwhile, they deleted the least significant variables in a given Almost all the variable selection procedures mentioned above
model according to t-value, and selected the best model in terms of are based on the assumption that the observations are independent
the Akaike information criterion (AIC) of Akaike (1973) in multiple and identically distributed (i.i.d.). To the best of our knowledge,
steps. However, as mentioned in Fan and Li (2001), this stepwise there are few papers to consider variable selections under non
deletion procedure may suffer stochastic errors inherited in the i.i.d. settings. It might not be appropriate if it is applied to
multiple stages. Meanwhile, there is no theory on this variable analyze financial and economic data directly, since most of the
selection procedure and the authors did not mention how to select financial/economic data are weakly dependent. To address this
the regressors Xi . These selection issues motivate us to consider issue, Wang et al. (2007) extended to the regression model with
variable selection on both local variables Zi and covariates Xi in autoregressive errors via LASSO. In this paper, we consider variable
model (1). selection in functional index coefficient models under very general
The FIAR model reduces the curse of dimensionality since each dependence structure—the strong mixing context. Our variable
of the nonparametric functions has only one argument. However, selection procedures consist of two steps. The first is to select
there still remain potential areas of dimension reduction. First, covariates with functional coefficients, and then we perform model
there are several nonparametric functions in the p × 1 vector selection for local variables with parametric coefficients.
g (β ⊤ Z ). In addition, the vector Z is d-dimensional. Hence, by The rest of this paper is organized as follows. In Section 2,
using model selection methods, there is potential to find a more we present the identification conditions for functional index
parsimonious model that effectively captures the features of our coefficient models, our new two-step estimation procedures, and
data. Variable selection methods and their algorithms can be traced some properties of the SCAD penalty function and numerical
back to four decades ago. Pioneering contributions include the AIC implementations. In Section 3, we propose variable selection
and the Bayesian information criterion (BIC) of Schwarz (1978). procedures for both covariates with functional coefficients and
Various shrinkage type methods have been developed recently, local variables with parametric coefficients. We then establish
including but not limited to the nonnegative garrotte of Breiman the consistency, the sparsity and the oracle property of all the
(1995), bridge regression of Fu (1998), the least absolute shrinkage proposed estimators. A simple bandwidth selection method is also
and selection operator (LASSO) of Tibshirani (1996), the smoothly discussed in the same section. Monte Carlo simulation results for
clipped absolute deviation of Fan and Li (2001), the adaptive LASSO the proposed two-step procedures are reported in Section 4. An
of Zou (2006), and so on. The reader is referred to the review paper empirical example of applying the functional index coefficient
by Fan and Lv (2010) for details. Here, we recommend the smoothly autoregressive model and its variable selection procedures is
clipped absolute deviation (SCAD) penalty function of Fan and Li extensively studied in Section 5. Finally, the concluding remarks
(2001) since it merits three properties of unbiasedness, sparsity are given in Section 6 and all the regularity conditions and technical
and continuity. Furthermore, it has the oracle property; namely, proofs are gathered in the Appendix.
the resulting procedures perform as well as those that correspond
to the case when the true model is known in advance. 2. Identification, estimation and penalty function
The shrinkage method has been successfully extended to
semiparametric models; for example, variable selection in partially 2.1. Identification
linear models in Liang and Li (2009), partially linear models in
longitudinal data in Fan and Li (2004), single-index models in The identification problem in single index model was first
Kong and Xia (2007), semiparametric regression models in Brent investigated by Ichimura (1993), and extensively studied by
274 Z. Cai et al. / Journal of Econometrics 189 (2015) 272–284

Li and Jeffrey (2007) and Horowitz (2009). Meanwhile, partial with K (·) being a kernel function, Kh (z ) = K (z /h)/h and Pλn (·)
conditions for identification in functional index coefficient models being a penalty function. {λ1 , . . . , λp } are tuning parameters and
were showed in Fan et al. (2003). Here we present the conditions ĝ·k = [ĝk (β̂ T Z1 ), . . . , ĝk (β̂ T Zn )]T is the estimates of kth functional
for identification below. coefficient at corresponding sample points. As recommended, an
initial estimator β̂ can be obtained by various algorithms such
Theorem 1 (Identification in Functional Index Coefficient Models).
as the method in Fan et al. (2003), or average derivative esti-
Assume that dependent variable Y is generated by Eq. (1), X is a
mators such as Newey and Stoker (1993). As long as the initial
p-dimensional vector of covariates and Z is a d-dimensional vector √
of local variables. β is a d-dimensional vector of unknown parameters estimator satisfies ∥β̂ − β∥ = Op (1/ n), as expected, the para-
and g (·) is a p-dimensional vector of unknown functional coefficients. metric estimator β̂ has little effect on the shrinkage estimation of
Then, β and g (·) are identified if the following conditions hold: functional coefficients ĝ (·) in the above equation if sample size
n is large. We choose penalty term Pλn (·) as the SCAD function,
Assumption I. I1. The vector functions g (·) are continuous and which is  described in Section 2.3, and the L2 functional norm
not constant everywhere. ∥ĝ·k ∥ = ĝk2 (β̂ T Z1 ) + · · · + ĝk2 (β̂ T Zn ) has the same definition of
I2. The components of Z are continuously distributed random
standard Euclidean norm. The purpose of using the penalized lo-
variables.
cally weighted least squares is to select significant covariates Xi in
I3. There exists no perfect multi-collinearity within each compo-
model (1).
nents of Z and none of the components of Z is constant.
Note that when the penalty term Pλn (z ) = λn |z |, the penalized
I4. There exists no perfect multi-collinearity within each compo-
local least squares becomes the Lasso type, so that the above object
nents of X .
function in Eq. (2) is reduced to the case in the paper by Wang and
I5. The first element of β is positive and ∥β∥ = 1, where ∥ · ∥ is
Xia (2009).
the standard Euclidean norm.
I6. When X = Z , E (Y |X , Z ) becomes to E (Y |X ) and it cannot be Step Two: Given the estimator of function ĝ (·), minimize the
expressed in the form as E (Y |X ) = α T X β T X + γ T X + c, where penalized global least squares Q (β, ĝ ), where
α , γ ∈ Rd and c ∈ R are constant, and α and β are not parallel
n d
to each other. 1  2 
Q (β, ĝ ) = yi − ĝ T (β T Zi )Xi +n Ψζn (|βk |) (3)
2 i =1
Remark 1. Assumption I1 is a mild condition since continuous k=1

and bounded functions are commonly assumed in nonparametric

with Ψ (·) being a penalty function. {ζ1 , . . . , ζd } are tuning
estimation, and it is obvious that β cannot be identified if any
parameters and |βk | takes the absolute value of βk .
element of g (·) is a constant. We can relax Assumption I2 with
Clearly, the above general setting may cover several other ex-
some components of Z being discrete random variables, however,
two more conditions should be imposed, see Ichimura (1993) and isting variable selection procedures as a special case. For exam-
Horowitz (2009) in detail. The perfect multi-collinearity problem ple, when p = 1 and the regressor X = 1, the above procedure
in Assumptions I3 and I4 is similar to those in the classical linear becomes variable selection for the single-index model in Kong
models. In fact, it would be hard to get accurate estimates if high and Xia (2007), which provided an alternative variable selection
correlation of components exists in either Z or X . Meanwhile, it is method called separated cross validation. When p = 2 and the
not identified if any component of Z is constant. For example, if only regressor is market return, then the above model reduces to
Z1 =1, E (Y |X , Z ) = g T (β1 + β2 Z2 + · · · + βd Zd )X = f T (β2 Z2 + · · · + the case in the paper by Cai et al. (2014a) for an application in
βd Zd )X . An alternative of Assumption I5 is to let the first coefficient finance. In particular, they considered semiparametric estimates
be 1, i.e. β1 = 1. However, it is infeasible to implement variable of time-varying betas and alpha in the conditional capital asset
selection procedures, since we do not have any prior information pricing model with variable selection. Furthermore, the model in-
that whether the coefficient β1 of Z1 is zero or not. Assumption I6 cludes a special case of variable selection in partially linear single-
can be found in the paper by Fan et al. (2003). index models as addressed in Liang et al. (2010), if only the first
functional coefficient g (·) is nonlinear and all others are constant.
2.2. Estimation procedures Finally, it transforms to variable selection in semiparametric re-
gression modeling by Li and Liang (2008), if the dimension of lo-
Model (1) can be regarded as a semiparametric model. There- cal variables d = 1 and some of the functional coefficients g (·) are
fore, to estimate both functions g (·) and parameters β , it is com- constant and others are not.
mon to use a two-stage approach. To estimate g (·), one needs an
initial estimator of β̂ which might have little effect on the final es- 2.3. Penalty function and implementation
timation of g (·) if the sample size n is large enough, due to the fact
that the convergence rate of the parametric estimator β̂ is faster As pointed out by Fan and Li (2001), a good penalty function
than the nonparametric estimator ĝ (·). Here, we propose variable should enjoy the following three desirable properties, e.g., unbi-
selection and estimation in two steps: asedness for the large true unknown estimator, sparsity that can
Step One: Given an initial estimator β̂ such that ∥β̂ − β∥ = set small estimator to be zero automatically, and continuity of the
√
Op (1/ n), minimize the penalized local least squares Q (ĝ , β̂, h) resulting estimator to avoid instability in model prediction.
to obtain ĝ (·), where To achieve all the aforementioned three properties, Fan and Li
n 
n 
(2001) proposed the following so called SCAD penalty function,
   2  
Q (ĝ , β̂, h) = yi − ĝ T β̂ T Zj Xi Kh β̂ T Zi − β̂ T Zj Pλ (|β|)
j=1 i=1
λ|β|, |β| ≤ λ,

p
= −(|β|2 − 2aλ|β| + λ2 )/[2(a − 1)], λ < |β| ≤ aλ, (4)

Pλn ∥ĝ·k ∥ ,
 
+n (2)
(a + 1)λ2 /2, |β| > aλ.

k=1
Z. Cai et al. / Journal of Econometrics 189 (2015) 272–284 275

The important property for the SCAD penalty function is that it has such as C1 n−1/5 < h < C2 n−1/5 }. For Z ∈ Az , β ∈ B, and h ∈ Hn ,
the following first derivative, define a n × p matrix penalized estimator as
λ, |β| ≤ λ,
       T 
Ĝ β̂ = ĝ β̂ T Z1 , . . . , ĝ β̂ T Zn = ĝ·1 , . . . , ĝ·p ,

Pλ′ (|β|) = (aλ − |β|)/(a − 1), λ < |β| ≤ aλ,
0, |β| > aλ, where
for some a > 2 (5)       T
ĝ β̂ T Z = ĝ1 β̂ T Z , . . . , ĝp β̂ T Z ∈ Rp ,
so that it makes the computational implementation easily. It can be
clearly seen that Pλ (|β|) is not differentiable at 0 with respect to β . and
Thus, it is not easy to minimize the penalized least squares due to     T
its singularity. To make implementation easily, Fan and Li (2001) ĝ·k = ĝk β̂ T Z1 , . . . , ĝk β̂ T Zn ∈ Rn .
suggested to approximate the penalty function by a quadratic
Similarly, we define the true value G0 (β) , g0 β T Z and g0·k ,
 
function as
respectively. Without loss of generality, we assume that the first
(0) 1
Pλ (|βj |) ≈ Pλ (|βj |) + {Pλ′ (|βj(0) |)/|βj(0) |}(βj2 − βj(0)2 ) p0 functional coefficients are non-zero, and other p − p0 functional
2 coefficients are zero, i.e. ∥g·k ∥ ̸= 0 and g·k are not constant
for βj ≈ βj .
(0)
(6) everywhere for 1 ≤ k ≤ p0 , ∥g·k ∥ = 0 for p0 < k ≤ p. Let
αn = max{Pλ′ (∥g·k ∥) : 1 ≤ k ≤ p0 }. Then, by minimizing the
Alternatively, Zou and Li (2008) proposed local linear approxima- penalized local least squares Q (ĝ , β̂, h) in Eq. (2), one can obtain
tion for non-concave penalty functions as the penalized local least squares estimator ĝ (·) for g (·).
(0) (0) (0) (0) To study the asymptotic distribution of the penalized local
Pλ (|βj |) ≈ Pλ (|βj |) + Pλ′ (|βj |)(|βj | − |βj |) for βj ≈ βj (7)
least squares estimator, we impose some technical conditions as
which can reduce the computational cost without losing any follows.
statistical efficiency. Meanwhile, some other algorithms such as
minorize-maximize algorithm (Hunter and Li, 2005) are also Assumption A. A1. The vector functions g (·) have continuous
proposed. second order derivatives with respect to the support of Az .
In view of (6), given a good initial value β (0) we can find the A2. For any β ∈ B and Z ∈ Az , the density function f (·, β)
one-step estimator as follow is continuous and there exists a small positive δ such that
 f (·, β) > δ .
1 A3. The kernel function K (·) is a bounded density with
 a bounded
β (1) = argmin (β − β (0) )T [−∇ 2 ℓ(β (0) )](β − β (0) ) support region. Let µ2 = v 2 K (v)dv and ν0 = K 2 (v)dv .
2
A4. limn→∞ infθ→0+ Pλ′ n (θ )/λn > 0, n−1/10 λn → 0, h ∝ n−1/5 and
(0)
 √
P ′ (|β |)
d

λ ∥β̂ − β0 ∥ = Op (1/ n).
+ n k
β 2
, (8)
A5. Define Ω (z , β) = E Xi XiT |β T Zj = z . Assume that Ω (·) is

(0) k
k=1 2|β k |
nonsingular and has bounded second order derivative on Az .
where ℓ(·) is a loss function and ∇ 2 ℓ(β (0) ) = ∂ 2 ℓ(β (0) )/∂β∂β T . A6. {(Xi , Zi , yi )} is a strictly stationary and strongly mixing se-
As argued in Fan and Li (2001), there is no need to iterate until it quence with mixing coefficient satisfying α (m) = O (ρ m ) for
converges as long as the initial estimator is reasonable. Also, the some 0 < ρ < 1.
A7. Assume that the conditional density of f zi , zs |zj is continu-
 
MLE estimator from the full model without penalty term can be
regarded as the reasonable initial estimator. For using the local ous and has bounded second order derivative.
A8. Assume that Ω zi , zs , zj = E Xi XiT Xs XsT |β T Zi = zi , β T Zs = zs ,
  
linear approximation in Zou and Li (2008) and Eq. (7), the sparse
β T Zj = zj is continuous and has bounded second order

one-step estimator given in Eq. (8) becomes
derivative. Define Ω1 zi , zs , zj = ∂ Ω zi , zs , zj /∂ zi and
    
(1) 1
β (β − β (0) )T [−∇ 2 ℓ(β (0) )](β − β (0) ) Ω2 zi , zs , zj = ∂ Ω zi , zs , zj /∂ zs .
   
= argmin
2
Remark 2. The conditions in A2 imply that the distances between

d
(0) two ranked values β T Z(i) are at most order of Op (log n/n) (Janson,

+ n Pλ′ (|βk |)|βk | . (9)
k=1 1987). For any value Z ∈ Az , we can find a closest value β T Zj to
As demonstrated in Zou and Li (2008), this one step estimator is Λ = β T Z such that |β T Zj − Λ| = Op (log n/n). With the conditions
as efficient as the fully iterative estimator, provided that the initial in A1, ∥g (β T Zj ) − g (Λ)∥ = Op (log n/n), which is smaller order of
estimator is good enough. For example, we let β (0) be the maximal nonparametric convergence rate n−2/5 . This implies that we only
likelihood estimator without the penalty term. need to estimate ĝ (β T Zi ) for i = 1, 2, . . . , n rather than ĝ (Λ) for
all values in the domain Az . For the detailed arguments, we refer
3. Large sample theory to the paper by Wang and Xia (2009). Assumption A3 is a common
assumption in nonparametric estimation. The assumption ∥β̂ −
√
3.1. Penalized nonparametric estimator for functional coefficients β0 ∥ = Op (1/ n) in A4 implies that the estimators of β̂ have little
effect in the estimation of ĝ (·) if the sample size n is large, since the
Let {(Xi , Zi , yi )} be a strictly stationary and strong mixing convergence rate of the parametric estimators β̂ is faster than the
sequence, and f (·, β) be the density function of β T Z , where β is nonparametric function estimators ĝ (·). The assumptions in A5–A8
an interior point of the compact set B. Assume δ is a small positive are very standard and used for the proof under mixing conditions;
constant and define Az = {Z : f (β T Z , β) ≥ δ , β ∈ B, there exist see Cai et al. (2000b). In particular, Assumptions in A6 are the
a and b such that β T Z ∈ [a, b]} as the domain of Z . Then, β T Z is common conditions with weekly dependent data. Most financial
bounded and the density of f (·, β) is bounded away from 0. Also, models satisfy these conditions, such as ARMA, ARCH and GARCH
define the domain of bandwidth h, Hn = {h: there exist C1 and C2 models; see Cai (2002).
276 Z. Cai et al. / Journal of Econometrics 189 (2015) 272–284

Define the nonparametric estimator ĝ (z , β̂) ≡ [ĝa (z , β̂), ĝb (z , e1 be an asymptotically standard normal random d1 -dimensional
1/2
β̂)]T where z = β̂ T Z , ĝa (z , β̂) = [ĝ1 (z , β̂), . . . , ĝp0 (z , β̂)]T ∈ Rp0 vector such that V1n = n1/2 σ 
V10 e1 .
and ĝb (z , β̂) = [ĝp0 +1 (z , β̂), . . . , ĝp (z , β̂)]T ∈ Rp−p0 . Analogously, To study the asymptotic distribution of the penalized least
we denote the true value g0 (z , β0 ) ≡ [g0a (z , β0 ), g0b (z , β0 )]T . squares estimator β̂ , we impose some technique conditions as
The following theorem presents the asymptotic properties for the below.
penalized nonparametric estimator ĝ (z , β̂), including the oracle
Assumption B. B1. The vector functions g (·) have continuous
property, sparsity and asymptotic normality of the estimator
second order derivatives with respect to the support of Az .
ĝ (z , β̂).
B2. For any β ∈ B and Z ∈ Az , the density function f (·, β)
is continuous and there exists a small positive δ such that
Theorem 2 (Oracle Property). Let (Xi , Zi ) be a strong mixing and
f (·, β) > δ .
strictly stationary sequence. Under Assumptions A1–A8, ∥β̂ − β0 ∥ = B3. The kernel function K (·) is a bounded density with
Op (n−1/2 ), limn→∞ infθ→0+ Pλ′ n (θ )/λn > 0, h ∝ n−1/5 and
 a bounded
support region. Let µ2 = v 2 K (v)dv and ν0 = K 2 (v)dv .
n−1/10 λn → 0 as n → ∞, then
√
B4. limn→∞ infθ →0+ Pζ′ n (θ )/ζn > 0, ζn → 0, nζn → ∞ and
(a) Sparsity: supZ ∈Az ∥ĝk (z , β̂)∥ = 0, for all p0 < k ≤ p. h ∝ n−1/5 .
(b) Asymptotic normality: B5. Same as Assumption A6.
√   B6. E (εi |Xi , Zi ) = 0, E (εi2 |Xi , Zi ) = σ 2 , E |Xi |m < ∞ and E |yi |m <
nh ĝa (z , β̂) − g0a (z , β0 ) − h2 B(z , β0 ) + op (h2 ) ∞ for all m > 0.
∼ N (0, V (z , β0 )), Remark 4. The assumptions in B4 indicate the oracle property in
where V (z , β0 ) = ν0 M −1 (z , β0 )σ 2 , and Theorem 4. An alternative condition for bandwidth in Ichimura
(1993) is nh8 → 0. However, the condition nh8 → 0 is still
B(z , β0 ) = µ2 M −1 (z , β0 )Ṁ (z , β0 )ġ (z , β0 ) +
1
µ2 g̈ (z , β0 ) satisfied with our condition h ∝ n−1/5 in B4. For Assumption B6, it
2 is not hard to extend to the heteroscedasticity case, E (εi2 |Xi , Zi ) =
with M (z , β0 ) = f (z , β0 )Ω (z , β0 ), Ṁ (z , β0 ) = ∂ M (z , β0 )/∂ z, σ 2 (Xi , Zi ), and it requires some higher moment conditions of Xi and
ġ (z , β0 ) = ∂ g (z , β0 )/∂ z and g̈ (z , β0 ) = ∂ ġ (z , β0 )/∂ z. yi so that Chebyshev inequality can be applied.

Now, we have the asymptotic properties for the penalized least

Remark 3. The unpenalized estimator can be written as
squares estimator β̂ .
 −1  
n n
Theorem 3. Let {(Xi , Zi , yi )} be a strictly stationary and strong
 
ĝu (z , β) = Xia XiaT Kh (β̂ Zi − z )
T
Xia Yi Kh (β̂ Zi − z ) .
T T

i= 1 i=1 mixing sequence, an = max{Ψζ′n (βk ) : βk ̸= 0}, and β̂ = argminβ∈B

Similar to the argument in the paper by Wang and Xia (2009), Q (β, ĝ ). Under Assumptions B1–B6 and if max{Ψζ′′n (βk ) : βk ̸=
under regular conditions, one can show that supZ ∈Az ∥ĝa (z , β̂) − 0} → 0, then the order of ∥β̂ − β0 ∥ is Op (n−1/2 + an ). If the penalty
ĝu (z , β̂)∥ = op (n−2/5 ). It suggests that the difference between function is SCAD function, an = 0 as sample size n → ∞, and
penalized estimator ĝa (z , β̂) and unpenalized estimator ĝu (z , β̂) ∥β̂ − β0 ∥ = Op (n−1/2 ).
is smaller order of optimal nonparametric convergence rate of
n−2/5 . Thus, the penalized estimator ĝa (z , β̂) merits the same large Theorem 4 (Oracle Property). Let {(Xi , Zi , yi )} be a strictly stationary
sample properties as the unpenalized estimator ĝu (z , β̂), as the and strong mixing
√ sequence. Under Assumptions B1–B6, by assuming
sample size n goes to infinity. ζn → 0 and nζn → ∞ as n → ∞, then,
(a) Sparsity:
Sparsity is an important statistical property in high-dimensional
statistics. By assuming that only a small subset of the variables are β̂2 = 0.
important for dependent variable, it can reduce complexity so that (b) Asymptotic normality:
it improves interpretability and predictability of the model. The √
n(β̂1 − β10 ) → N 0, −1 −1
,
 
sparsity property from Theorem 2 demonstrates that our penalized V10 V10
V10
model can estimate zero components of the true parameter vector
V10 is defined earlier and V10 = Γ (0) + 2 ℓ=1 Γ (ℓ) with
∞
exactly as zero with probability one as sample size goes to infinity. where 
Γ (ℓ) = Cov (Γi , Γi−ℓ ) and Γi = Z1i − E (Z1i |β10 Z1i ) ġ T (β10 Z1i )
 T
 T

3.2. Penalized estimator for parametric coefficients Xi ε1i .

When the random variables {Γi }∞i=1 are either i.i.d. or martingale
To perform variable selection for variables with parametric
difference sequence, V10 becomes V10 = Γ (0) = Var(Γi ). Other-
coefficients, we should minimize the penalized least squares listed
wise, the autocovariance function Γ (ℓ) may not be zero at least
in Eq. (3). We assume the first d1 coefficients of β are nonzero and
for some lag orders ℓ > 0 due to the serial correlation. Theorem 4
all rest of parameters are zero. That is, β0 = (β10 T
, β20
T T
) , where shows that our variable selection procedures of minimizing penal-
all elements of β10 with dimension d1 are nonzero and nd − d1 ized least squares enjoy the oracle property.
dimensional coefficients β20 = 0. Finally, define Vn = i=1 (Zi −
E (Zi |β0T Zi ))ġ T (β0T Zi )Xi εi , where vector ġ (·) is the first derivative
3.3. Choosing bandwidth and tuning parameters
of function g (·) vector, and εi is independent and identically
distributed (i.i.d.) with mean 0 and standard deviation σ . Let  V0 =
1
Var(Vn )/σ 2 , and define e be an asymptotically standard normal To do the nonparametric estimation and variable selection si-
n
1/2 multaneously, we should choose suitable regularization parame-
random d-dimensional vector such that Vn = n1/2 σ  V0 e. V1n = ters, bandwidth h for nonparametric estimator and λ’s for penalty
i=1 (Z1i − E (Z1i |β10 Z1i ))ġ (β10 Z1i )Xi ε1i , where ε1i is the same as
n T T T
terms. For simplicity, we just consider global bandwidth selection
εi since β20 = 0. Similarly, we define  V10 = 1n Var(V1n )/σ 2 and rather than pointwise selection. Recent literature reveals that the
Z. Cai et al. / Journal of Econometrics 189 (2015) 272–284 277

BIC-type selector identifies the true model consistently and the re- Table 1
sulting estimator possesses the oracle property. In contrast, the Simulation results for the covariates with functional
coefficients.
AIC-type selector tends to be less efficient and over fitting in the
σ =4 σ =2
final model; see the papers by Wang et al. (2007) and Zhang et al.
(2010). This motivates us to select the bandwidth h and tuning pa- n = 200
rameters λ’s simultaneously with BIC-type criterion. We define our Shrinkage rate 79.4% 93.4%
Keeping rate 92.0% 99.8%
BIC criterion as
n = 400
BIC(h, λ) = log(SSE(h, λ)) + df(h, λ) log(n)/n, Shrinkage rate 94.5% 100%
Keeping rate 98.6% 100%
where SSE(h, λ) is the sum of squared errors obtained from the n = 1000
penalized least squares with parameters (h, λ), and df(h, λ) is the Shrinkage rate 100% 100%
number of nonzero coefficients of β̂ conditional on parameters Keeping rate 100% 100%
h and λ. This BIC criterion is reasonable since it can balance
the trade-off between the variance and the number of non-zero Table 2
coefficients in terms of the bandwidth h and tuning parameters Simulation results for the local variable with parametric
λ’s. Further, it enjoys the property of consistency, which indicates coefficients.
that it can select the correct model with the probability one as σ = 15 σ = 7.5
the sample size goes to infinity (Zhang et al., 2010). However, n = 200
it is still computationally expensive to choose d-dimensional Shrinkage rate 83.2% 91.1%
tuning parameters λ = (λ1 , λ2 , . . . , λd )T . By adopting the idea Keeping rate 93.4% 96.9%
of Fan and Li (2004), to reduce the dimension of λ, we let n = 400
(0) (0) Shrinkage rate 92.3% 100%
λn = λ0 σ̂ (βˆk ), where σ̂ (βˆk ) is the standard deviation of un- Keeping rate 97.5% 100%
(0)
penalized estimator βˆk . The theoretical properties of BIC(h, λ) n = 1000
(0) Shrinkage rate 100% 100%
and dimension reduction technique with λk = λ0 σ̂ (βˆk ) need Keeping rate 100% 100%
further research and they can be regarded as the future research
topics. The reader is referred to the papers by Cai et al. (2000a)
and Fan and Li (2001) more on choosing the bandwidth in years and four years trading days, respectively. For each setting,
nonparametric estimation and the tuning parameters in variable we replicate 1000 times. The ‘‘Shrinkage Rate’’ and ‘‘Keeping rate’’
selection. are reported in Table 1, in which ‘‘Shrinkage rate’’ represents the
percentage that five zero functional coefficients correctly shrink
4. Monte Carlo simulations to 0 and ‘‘Keeping rate’’ stands for the percentage that two non-
zero functional coefficients do not set to 0 correctly. Clearly, one
Example 1. In this example, we study the finite sample perfor- can see from Table 1 that ‘‘Shrinkage rate’’ and ‘‘Keeping rate’’
mance of the variable selection for covariates with functional produce better results with larger sample size and smaller noise.
coefficients. In our simulations, the optimal bandwidth and the Meanwhile, it shows that the proposed estimator performs as good
tuning parameter λn are chosen by BIC criterion in Section 3.3. The as the oracle estimator if the sample size n = 1000 as well as
Epanechnikov kernel K (x) = 0.75(1 − x2 ) if |x| ≤ 1 is used. We the case of n = 400 and σ = 2. This simulation shows that the
choose the value of a in SCAD to be 3.7 as suggested in Fan and Li proposed variable selection procedures perform fairly well for a
(2001). finite sample.

In this example, we assume that the data are generated by Example 2. To examine the performance of the variable selection
yi = (Z1i + Z2i ) + (Z1i + Z2i )2 X1i + σ εi , 1 ≤ i ≤ n, for local variables with parametric coefficients, similar to Tibshi-
rani (1996) and Fan and Li (2001), our data generating process is
and the working model is given below
6
 yi = ui + u2i Xi + σ εi ,
yi = g0 (β T Zi ) + gk (β T Zi )Xki + ei ,
k=1 where ui = ZiT β , β = (3, 1.5, 0, 0, 2, 0, 0, 0)T and εi is generated
where εi is generated from standard normal distribution and from standard normal distribution. Furthermore, the nine dimen-
Z = (Z1 , Z2 )T with Z1 = Φ (Z1∗ ), Z2 = Φ (Z2∗ ) and Φ (·) being sional vector (ZiT , Xi )T is generated from the following vector au-
the cumulative standard normal distribution function. The eight toregressive process
dimensional vector (Z1∗ , Z2∗ , X1 , . . . , X6 )T follows the following    
Zi Zi−1
vector autoregressive process =A ∗
+ ei ,
 ∗ Xi Xi−1
Zi∗−1
 
Zi
=A + ξi , where A∗ is a 9 × 9 matrix with the diagonal elements being 0.15
Xi Xi−1
and all others being 0.05. The initial value of (Z1T , X1 )T and each
where Z ∗ = (Z1∗ , Z2∗ )T , X = (X1 , X2 , . . . , X6 )T and A is an 8 × 8 element of the random vector ei are generated from i.i.d. standard
matrix with the diagonal elements being 0.15 and all others being normal. Similar to the previous example, we consider three sample
0.05. The initial value of (Z1∗ , X1 )T and each component of the sizes as n = 200, 400 and 1000 and for each simulation, we repli-
random vector term ξi are generated from i.i.d. standard normal cate 1000 times. We also consider two values for σ as σ = 7.5
distribution. Note that for this set up, the data generated by the and σ = 15. Table 2 displays the simulation results of SCAD vari-
above autoregressive process are weekly dependent. We consider able selection for the local variables with parametric coefficients.
three sample sizes as n = 200, n = 400 and n = 1000 and Similar to the conclusions from Table 1, it can be seen from Table 2
two standard deviations as σ = 2 and σ = 4. Sample sizes that the ‘‘Shrinkage rate’’ for irrelevant local variables and ‘‘Keeping
n = 200, 400, 1000 are corresponding to about one year, two rate’’ for relevant local variables perform better with larger sample
278 Z. Cai et al. / Journal of Econometrics 189 (2015) 272–284

Table 3 horizons, as expected, they appear to have high sample kurtosis,

Simulation results for the two-step selection procedures. and it demonstrates that more sample points are further away from
σ =2 σ =1 the sample mean and their tails are heavier. The Box–Pierce tests
n = 200 show that the autocorrelations of monthly return of three indexes
Shrinkage rate for nonsignificant covariates 79.1% 89.7% are not significantly different from zero. However, others among
Keeping rate for significant covariates 92.0% 96.3% daily and weekly horizons are significantly different from zero. This
Shrinkage rate for local nonsignificant covariates 77.0% 82.0%
phenomena suggests that most financial variables are not i.i.d. To
n = 400 be precise, they are weakly dependent.
Shrinkage rate for nonsignificant covariates 81.8% 91.2%
To explore the performance of functional index coefficient
Keeping rate for significant covariates 95.7% 99.3%
Shrinkage rate for local nonsignificant covariates 82.5% 93.8% autoregressive models, we assume that our working model is
n = 1000
established below
Shrinkage rate for nonsignificant covariates 88.9% 95.7% p

Keeping rate for significant covariates 100% 100% rt = g0 (zt ) + gj (zt )rt −j + εt
Shrinkage rate for local nonsignificant covariates 93.6% 100% j =1

εt = σt et et ∼ skewed-t (λ, ν)
size and smaller noise. Specifically, it performs as good as the or- σt2 = ω + αεt2−1 + γ εt2−1 It −1 + ρσt2−1
acle estimator for the cases where n = 400 and σ = 7.5 as well
as sample size n = 1000. The Monte Carlo simulation results indi- where zt = β1 rt −1 + β2 rt −2 + β3 rt −3 and we assume β12 +
cate that our variable selection for local variables merits good finite β22 + β32 = 1 in order to satisfy the identification condition. The
sample properties. standardized residuals et is skewed-t distributed with skewness
parameter λ and degree of freedom ν , γ captures the leverage
Example 3. To investigate the performance of variable selection effect. The indicator function It −1 takes value of 1 for εt ≤ 0 and 0
for covariates and local variables simultaneously, we do one more otherwise. This model can be viewed as an extension of the model
step with variable selection for local variables in Example 1. All the by Chen and Tsay (1993). We use the two-step variable selection
settings are the same as in Example 1, except the true model is procedures to select variables and to estimate unknown coefficient
defined as functions simultaneously. Firstly, we select covariates based on
penalized local least squares and then do variable selection for local
yi = g0 (Z1i ) + g1 (Z1i )X1i + σ εi , 1 ≤ i ≤ n,
important variables based on penalized global least squares. After
where g0 (u) = u and g1 (u) = u . We assume that the index coef-
2
the two-step variable selection procedures are employed in above
ficient depends only on local variable Z1 in this true model. Local model, the estimated coefficients of local variables and the norms
variable Z2 and five covariates (X2 , . . . , X6 ) are not included in the of covariates are reported in Table 5. Note that zt may include other
model but estimated in the working model; see Example 1 for de- financial or state economy variables as in Cai et al. (2014a).
tails. Two-step selection procedures are employed in this simula- In the columns of local variables, both one day lagged return
tion. The first is to select six covariates (X1 , . . . , X6 ) with functional and two days lagged return have effect on the daily return of
coefficients as well as constant term. Then, we perform variable se- these three indexes. Three days lagged return does not have any
lection for local variables Z1 and Z2 with parametric coefficients. effect on the daily return of both NASDAQ and S&P 500. Only
The simulation results for these two-step selection procedures are one week lagged return contributes to the weekly return of both
tabulated in Table 3. Table 3 shows that with larger sample size NASDAQ and S&P 500. However, two weeks lagged return does not
and smaller noise, shrinkage rates for both nonsignificant covari- have any contribution. Specifically, one week lagged return and
ates and local nonsignificant covariates become larger. These indi- three weeks lagged return perform similar for the weekly return
cate that our two-step procedures perform quite well so that the of DOW. For the monthly horizon, one month lagged return has a
proposed methods are efficient. significant effect on these three indexes, two months lagged return
for NASDAQ and three months lagged return for both DOW and S&P
5. Empirical example 500.
In the columns of covariates, only one day lagged covariate
In the previous section, we conduct Monte Carlo simulation and two days lagged covariate contribute to the daily return of
studies to illustrate the effectiveness of the proposed estimation three indexes. Meanwhile, for the weekly horizon, only one week
methods. In this section, to demonstrate the practical usefulness lagged covariate and three weeks lagged covariate are important
of the proposed model and its estimation methods, we apply these factors for the weekly return of three indexes. Further, for the
methodologies to consider the predictability of the asset return. monthly horizon, one month lagged covariate has contribution for
Our data consist of daily, weekly and monthly returns on the the monthly return of three indexes, two months lagged return for
three indexes of Dow Jones Industrial Average, NASDAQ Composite NASDAQ, and three months lagged return for DOW and S&P 500.
and S&P 500 Index. The sample of these three indexes comprise The coefficients of GJR-GARCH model for error terms are
over 30 years between May 1, 1994 and April 30, 2014. They end tabulated in Table 6. The significance for the skewness λ and the
in 30 April due to the fact that most listing corporations post their degree of freedom µ for all horizons leads to the non-normality of
annual reports at the end of April. The sample size up to 30 years standardized residuals. It is interesting that leverage effects exist
are considered so that there are enough data for nonparametric in both one day horizon and one week horizon. However, it cannot
estimation in the model. All the data are downloaded from the be observed in one month horizon. Meanwhile, we cannot find any
Wind Information database.1 Table 4 shows the summary statistics heteroscedasticity in terms of GJR-GARCH model from one month
of returns for one day horizon, one week horizon and one month horizon.
horizon. All horizons show the negative skewness, which indicates
that a relatively long lower tail exists. For one day and one week 6. Conclusion

Variable selection technology and its algorithms are well

developed for models with i.i.d. data and for many fully parametric
1 The Web Site for Wind Information is http://www.wind.com.cn/En/Default.aspx. models. Specifically, variable selection in both semi-parametric
Z. Cai et al. / Journal of Econometrics 189 (2015) 272–284 279

Table 4
Summary statistics of returns for different horizons.
Sample size Mean Median StdDev Skewness Kurtosis Min Max ρ1 Box–Pierce test

One day horizon

DOW 7572 0.0414 0.0520 1.1229 −1.0822 29.6289 −22.6100 11.0800 −0.0349 0.0000
NASDAQ 7572 0.0471 0.1064 1.4084 −0.0250 8.2972 −11.3500 14.1700 0.0150 0.0000
S&P 500 7572 0.0392 0.0588 1.1522 −0.8293 21.4268 −20.4700 11.5800 −0.0408 0.0000
One week horizon
DOW 1566 0.1958 0.3499 2.2885 −0.6595 5.3837 −18.1500 11.2900 −0.0682 0.0001
NASDAQ 1566 0.2249 0.3348 2.9844 −0.7072 7.3782 −25.3000 18.9800 0.0209 0.0304
S&P 500 1566 0.1843 0.3133 2.3041 −0.5709 5.2818 −18.2000 12.0300 −0.0703 0.0004
One month horizon
DOW 360 0.8371 1.1460 4.3900 −0.8078 2.8407 −23.2200 13.8200 0.0181 0.9318
NASDAQ 360 0.9957 1.7370 6.4387 −0.5703 1.9663 −27.2300 21.9800 0.1001 0.5826
S&P 500 360 0.7866 1.1430 4.4202 −0.7755 2.3396 −21.7600 13.1800 0.0517 0.9723

Table 5
Coefficients for local variables and covariates.
Local variables Covariatesa
rt −1 rt −2 rt −3 1 rt −1 rt −2 rt −3 rt −4 rt −5 rt −6
One day horizon
DOW 0.7087 0.6385 −0.3000 3.1532 2.8250 2.6166 0 0 0 0
NASDAQ 0.6609 −0.7505 0 5.0515 3.2321 3.7374 0 0 0 0
S&P 500 0.4183 −0.9083 0 4.4862 1.8567 3.9876 0 0 0 0
One week horizon
DOW 0.7716 0 0.6360 16.4212 8.0355 0 6.6187 0 0 0
NASDAQ 1 0 0 30.9023 14.9169 0 4.3741 0 0 0
S&P 500 1 0 0 15.5780 9.5644 0 2.1025 0 0 0
One month horizon
DOW 0.8638 0 −0.5034 70.1698 59.3783 0 34.5686 0 0 0
NASDAQ 0.8488 0.5286 0 189.735 68.8333 42.7579 0 0 0 0
S&P 500 0.8697 0 0.4934 98.7581 48.1937 0 27.6172 0 0 0
a
We calculate the norm of the functional coefficients for covariates.

Table 6 Theoretical properties such as consistency, sparsity, and the oracle

Estimation results of GJR-GARCH model for error terms. The skewness λ and the
property of these two-step estimators are derived. Monte Carlo
degree of freedom µ are parameters of skewed-t distribution of standardized
residuals et . Four parameters ω, α , γ and ρ are from GJR-GARCH model. The simulations show that our two-step procedures perform fairly
corresponding t-ratios based on robust standard errors are reported in parentheses. well. To address the issue of stock return predictability, an example
λ µ ω α γ ρ of functional index coefficient autoregressive models is extensively
studied and it can be viewed as an extension of the model in Chen
One day horizon
and Tsay (1993).
DOW 0.8594* 8.4413* 0.0134* 0.0000 0.1786* 0.9009*
(32.02) (5.02) (3.20) (0.00) (5.53) (49.04) In financial economics, many of the regressions may suffer
NASDAQ 0.8592* 12.8651* 0.0263* 0.0000 0.1524* 0.9068* spurious regression due to the presence of highly persistent
(31.02) (3.30) (2.44) (0.02) (3.83) (38.27) regressors. Persistence can be found in many financial variables,
S&P 500 0.8534* 8.2578* 0.0201* 0.0000 0.1882* 0.8935*
such as book-to-market ratios, dividend–price ratio, earning–price
(28.55) (5.45) (3.14) (0.48) (5.02) (44.84)
ratio, short-term Treasury bill rate and yield spread (Campbell and
One week horizon
Yogo, 2006; Phillips and Lee, 2013). The theories for regression
DOW 0.8670* 9.1538* 0.2994 0.0131 0.1913* 0.8166*
(29.45) (4.48) (1.79) (0.59) (2.96) (10.40) model with persistent variables are very different from the model
NASDAQ 0.8886* 7.6217* 0.2626* 0.0657* 0.1351* 0.8262* with stationary variables; see, for example, Cai and Wang (2014)
(24.22) (5.47) (1.97) (0.03) (0.06) (0.06) and Cai et al. (forthcoming). And there is little literature regarding
S&P 500 0.8709* 10.4121* 0.2680 0.0045 0.2322* 0.8119*
variable selection in the model with persistent regressors. For
(30.73) (3.98) (1.75) (0.24) (3.19) (10.55)
future research, it would be interesting to consider variable
One month horizon
selection for the linear and nonlinear time series prediction models
DOW 1.0221* 2.0107* 0.0000 0.4868 1.0000 0.0013
(84.93) (177.50) (0.08) (0.52) (0.59) (0.20) with persistent and/or nonstationary variables.
NASDAQ 1.0304* 2.0100* 0.0036 0.0140 1.0000 0.2310
(69.11) (234.73) (0.70) (0.09) (0.51) (0.34)
S&P 500 1.0056* 2.0344* 0.0918 0.1551 0.4581 0.6138
Appendix. Mathematical proofs
(53.03) (37.72) (0.61) (0.16) (0.32) (0.72)
*
Denotes significance at confidence level 5%. In this Appendix, we present briefly the derivations of the
main results given in previous sections. Before embracing on the
and nonparametric models has become popular in recent years. In proofs, we define some notations and list some lemmas that will
contrast to the i.i.d. setting in those papers with variable selection, be used throughout this appendix. First, let C be a finite positive
we considered variable selection in functional index coefficient constant and Rm denotes ignorable small order term. Both of them
models under strong mixing context. Most weakly dependent might be different in different appearances. Now, we present
financial time series can be analyzed in our procedures under the Lemmas 1 and 2.
general conditions considered in this paper. Our variable selection
procedures select both covariates with functional coefficients Lemma 1. Let {Xi , Zi , yi } be a strong mixing and strictly stationary
and local variables with parametric coefficients in two steps. sequence. Under Assumptions A1–A8. Assume that h ∝ n−1/5 ,
280 Z. Cai et al. / Journal of Econometrics 189 (2015) 272–284
√
n−1/10 αn → 0 and ∥β̂ − β0 ∥ = Op (1/ n), we have n−1/2 h1/2 k=
p
1 Pλn (∥g0·k ∥)∥u·k ∥, where the first term on the
0 ′

n
right hand side is followed by Cauchy–Schwarz inequality and
∥ĝ (β̂ T Zi ) − g0 (β0T Zi )∥2 = Op (n−4/5 ). the second term is followed by Taylor expansion and triangle

n− 1
i=1
inequality. Therefore,
n
∥uj· ∥2 − 2(n−1 ∥u∥2 )1/2 (n−1 ∥ê∥2 )1/2

D ≥ λ̂min n−1
i=1 ∥ĝ (β̂ Zi ) − g0 (β0 Zi )∥
n T
Proof. By the triangle inequality n−1 T 2
j =1
i=1 ∥ĝ (β̂ Zi ) − g0 (β̂ Zi )∥ ∥g0 (β̂ T Zi ) −
−1
n T T 2 −1
n
≤ n + n i=1 p0
g0 (β0T Zi )∥2 .
− n−1/2 h1/2 αn

∥u·k ∥
The second term on the right hand side
k=1
n
n− 1

∥g0 (β̂ T Zi ) − g0 (β0T Zi )∥2 ≥ λ̂min n−1 ∥u∥2 − 2(n−1 ∥u∥2 )1/2 (n−1 ∥ê∥2 )1/2
 p0
1/2
i=1
1/2 √

n − h α p0 n −1
∥u·k ∥ 2

∥ġ0 (β0T Zi )(β̂ − β0 )T Zi + op (n−1/2 )∥2


= n− 1 k=1
√ √ √
i =1
= λ̂min C − 2 C (n−1 ∥ê∥2 )1/2 − h1/2 αn p0 C .
(by Taylor expansion)
n
 As we will show later that
≤ n− 1 C (β̂ − β0 )T Zi ZiT (β̂ − β0 ) + op (n−1 ) n−1 ∥ê∥2 = Op (1) and λ̂min →P λmin as n → ∞,
0
i=1

(by Assumption A1) where λmin

0 = infz ∈[0,1] λmin (f (β̂ Z )Ω (β̂ Z )), λmin (·) denotes the
= C (β̂ − β0 ) E ( T
Zi ZiT )(β̂ − β0 ) + op (n ) −1 minimal eigenvalue of an arbitrary positive definite matrix. By
Assumptions A2 and A4, as λmin0 > 0 and h1/2 αn → 0, we can show
= Op (n ). −1
that D > 0 for a sufficient large C . Then, this proof is complete.
In the above equation, C is the maximum value of ∥ġ0 (β0T Zi )∥2 . To show n−1 ∥ê∥2 = O(1), it is easy to see that
We can conclude that the order of the second term is of order  n−1 ∥ê∥2 →P E ∥êj ∥2
as Op (n−1 ). Now, it suffices to show that n−1 i=1 ∥ĝ β̂ T Zi −
n
  and
g0 β̂ T Zi ∥2 = Op (n−4/5 ). Following the proof in Wang and Xia 
 n

(2009), we let u = (uik ) ∈ Rn×p be an arbitrary n × p matrix

2
E ∥êj ∥ ≤ n hE  −1
[X X T (g (β̂ T Zi ) − g0 (β̂ T Zj ))]

 i=1 i i 0
·k and u = (u1· , u2· , . . . , un· )
T
with rows ui· and columns u =
2
(u·1 , u·2 , . . . , u·p ). Set ∥u∥ =
 2
i,k ui,k to be the L2 -norm for an

× Kh (β̂ Zi − β̂ Zj )
T T 
arbitrary matrix u = (uik ). For any small ε > 0, if we can show 
that there is a large constant C such that P {infn−1 ∥u∥2 =C Q (G0 +  2
n
(nh)−1/2 u, β̂) > Q (G0 , β̂)} > 1 − ε , then the proof is finished.
 
+ n hE  [Xi εi Kh (β̂ Zi − β̂ Zj )]
−1 T T
 
To this end, define  i=1 

 n
D ≡ n−1 h{Q (G0 + (nh)−1/2 u, β̂) − Q (G0 , β̂)} + n−1 hE  [Xi XiT (g0 (β0T Zi ) − g0 (β̂ T Zi ))]

  i=1
n 
n  2
yi − g0T (β̂ T Zj )Xi − (nh)−1/2 uTj· Xi

−1
=n h 2

j =1 i =1 × Kh (β̂ T Zi − β̂ T Zj )

  n 
n  2 
× Kh β̂ T Zi − β̂ T Zj − yi − g0T (β̂ T Zj )Xi
≡ A + B + D̃
j =1 i =1

where A denotes the first term, B is for the second term and D̃

 
× Kh β̂ T Zi − β̂ T Zj stands for the last term. We introduce notations as zi = β T Zi ,
zs = β T Zs and zj = β T Zj
p 
Pλn ∥g0·k + (nh)−1/2 u·k ∥ − Pλn (∥g0·k ∥)
    
+h A = n −1
hE

[(g0 (β̂ T Zi ) − g0 (β̂ T Zj ))T Xi XiT Xs XsT (g0 (β̂ T Zs )
k =1
i̸=s̸=j
n 
 
≥ n− 1 uTj· Σ̂ (β̂ T Zj )uj· − 2uTj· êj − g0 (β̂ T Zj ))
j =1


p0 × Kh (β̂ Zi − β̂ Zj )Kh (β̂ Zs − β̂ Zj )] + n−1 hE
T T T T
{· · · }
∥g0·k + (nh)−1/2 u·k ∥ − Pλn (∥g0·k ∥) ,
    
+h Pλn (i=s)̸=j
k=1 

where Σ̂ (β̂ T Zj ) = n −1/2 ≡ n hE −1
[(g0 (zi ) − g0 (zj ))T Xi XiT Xs XsT (g0 (zs ) − g0 (zj ))
i=1 Xi Xi Kh (β̂ Zi − β̂ Zj ) and êj = n
−1
n T T T

1/2 i̸=s̸=j
[Xi XiT (g0 (β0T Zi )− g0 (β̂ T Zi ))+ Xi XiT (g0 (β̂ T Zi )− g0 (β̂ T Zj ))+
n
h i=1 
Xi εi ]Kh (β̂ Zi − β̂ T Zj ). Let λ̂min
T
be the smallest eigenvalue of

j × Kh (zi − zj )Kh (zs − zj )] + n−1 hE {· · · }
Σ̂ (β̂ T Zj ), λ̂min = min{λ̂min j , j = 1 , . . . , n} and ê = (ê1 , . . . , ên )T (i=s)̸=j
= Rn×p . Then, D ≥ n−1 nj=1 (∥uj· ∥2 λ̂min ≡ A1 + A2 ,

j − 2∥uj· ∥∥êj ∥) −
Z. Cai et al. / Journal of Econometrics 189 (2015) 272–284 281

where A1 denotes the first term and A2 is for the second term. It is where
easy to show that 
A21 (zj ) ≡ (g0 (zi ) − g0 (zj ))T Ω (zi , zj )(g0 (zi ) − g0 (zj ))
A1 ≡ nhE {(g0 (zi ) − g0 (zj ))T Xi XiT Xs XsT (g0 (zs )
− g0 (zj ))Kh (zi − zj )Kh (zs − zj )} + Rm × Kh2 (zi − zj )f (zi |zj )dzi .
= nhE {(g0 (zi ) − g0 (zj ))T Ω (zi , zs , zj )(g0 (zs ) Let zi = zj + w h. Then,
− g0 (zj ))Kh (zi − zj )Kh (zs − zj )} + Rm

1
 A21 (zj ) = (ġ0 (zj )w h + C w 2 h2 )T Ω (zj + w h, zj )
h
= nh E {(g0 (zi ) − g0 (zj ))T Ω (zi , zs , zj )
× (ġ0 (zj )w h + C w2 h2 )k2 (w)f (zj + w h|zj )dw
× (g0 (zs ) − g0 (zj ))Kh (zi − zj )Kh (zs − zj )|zj }f (zj )dzj + Rm 
≡ A11 + Rm , = I21 (Zj )h w2 k2 (w)dw + Rm

where the definition of A11 is apparent and Rm is an ignorable small where I21 (Zj ) is an integrable function, then A2 = Op (h2 ) = op (1).
order term, which might be different in different appearances. Let Hence, A = Op (1), Now, we consider the term B as follows.
zi = zj + w h and zs = zj + v h. Then,  T  
 n n
 
    T B = n −1
hE Xi εi Kh (zi − zj ) Xs εs Kh (zs − zj )
1
ġ0 (zj )w h + C1 w h2 2
 i=1 s =1

A11 = nh
2  

= n−1 hE Xi XsT εi εs Kh (zi − zj )Kh (zs − zj )
 
1
× Ω (zj + wh, zj + v h, zj ) ġ0 (zj )v h + C2 v h 2 2
(i=s)̸=j
2  
 
× k(w)k(v)f ((zj + w h, zj + v h)|zj )dwdv f (zj )dzj + 2n hE −1
εε
Xi XsT i s Kh (zi − zj )Kh (zs − zj )
(i=j)̸=s
  
A12 (zj )f (zj )dzj ,

≡ nh + n hE −1
εε
Xi XsT i s Kh (zi − zj )Kh (zs − zj )
(i̸=s)̸=j
 
where 
   T + n hE −1
εε
Xi XsT i s Kh (zi − zj )Kh (zs − zj )
1 i=s=j
A12 (zj ) = ġ0 (zj )w h + C1 w h2 2
[Ω (zj , zj , zj )
2 ≡ B1 + B2 + B3 + B4 ,
+ Ω1 (zj , zj , zj )w h + Ω2 (zj , zj , zj )v h where the definitions of Bj ’s are apparent. Now,
+ op (w 2 h2 ) + op (v 2 h2 )]
  B1 = hE [Xi XiT εi2 Kh2 (zi − zj )] + Rm
1
× ġ0 (zj )v h + C2 v h [f ((zj , zj )|zj )
2 2
= hE [Xi XiT Kh2 (zi − zj )E (εi2 |Xi , zi , zj )] + Rm
2
+ f1 ((zj , zj )|zj )w h + f2 ((zj , zj )|zj )v h = hσ 2 E [Xi XiT Kh2 (zi − zj )] + Rm

+ op (w 2 h2 ) + op (v 2 h2 )]k(w)k(v)dw dv = hσ 2 E [Ω (zi , zj )Kh2 (zi − zj )] + Rm

 = hσ 2 E {E [Ω (zi , zj )Kh2 (zi − zj )|zj ]} + Rm .
= I12 (Zj )h4 w2 v 2 k(w)k(v)dw dv + op (h4 )
Let zi = zj + w h. Then, we have

where I12 (Zj ) is an integrable function. Then, A1 = Op (nh5 ) = E [Ω (zi , zj )Kh2 (zi − zj )|zj ]
Op (1). Also, we can show that
  
1 zi − zj
= Ω (zi , zj )k 2
fzi |zj (zi |zj )dzi
 h2 h

[(g0 (zi ) − g0 (zj ))T Xi XiT Xi XiT

A2 = n−1 hE 1
= Ω (zj + w h, zj )k2 (w)fzi |zj (zj + w h|zj )dw
i̸=j h
 
× (g0 (zi ) − g0 (zj )) Kh2 (zi − zj )] = IB2 (zj )Op (1/h) k2 (w)dw,

where IB1 (zj ) is an integral function of zj , so that B1 = Op (1),

= hE {(g0 (zi ) − g0 (zj ))T Xi XiT Xi XiT (g0 (zi ) − g0 (zj ))  
× Kh2 (zi − zj )} + Rm 
B2 = 2n −1
hE εε
Xj XsT j s Kh (0)Kh (zs − zj )
= hE {(g0 (zi ) − g0 (zj ))T Ω (zi , zj )(g0 (zi ) − g0 (zj )) s̸=j
∞
× K 2 (zi − zj )} + Rm

 h = 2n−1 h E [Xs+ℓ XsT εℓ+s εs Kh (0)Kh (zs+ℓ − zs )] + Rm
ℓ=−∞
= h E {(g0 (zi ) − g0 (zj ))T Ω (zi , zj )(g0 (zi ) ∞
= 2n h −1
E [E (Xs+ℓ XsT εℓ+s εs |zℓ+s , zs )
− g0 (zj ))Kh2 (zi − zj )|zj }f (zj )dzj + Rm ℓ=−∞
× Kh (0)Kh (zs+ℓ − zs )] + Rm

≡ h A21 f (zj )dzj + Rm , = Op (1),
282 Z. Cai et al. / Journal of Econometrics 189 (2015) 272–284
 
 Since Pλ′ n (∥ĝ.k ∥) = 0 when ∥ĝ.k ∥ ̸= 0 and n is large, then, Π = 0
B3 = n −1
hE εε
Xi XsT i s Kh (zi − zj )Kh (zs − zj ) follows when n is large. Note that
(i̸=s)̸=j
∞
 n  
= n− 1 h E [Xi XiT−ℓ εi εi−ℓ Kh (zi − zj )Kh (zi−ℓ − zj )]

Xia yi − ĝaT (β̂ T Zj )Xia Kh (β̂ T Zi − β̂ T Zj ) = 0.
ℓ=−∞ i=1
∞
= n h−1
E [E ( εε
Xi XiT−ℓ i i−ℓ |zi , zi−ℓ , zj ) In fact, the above normal equation also holds for all z = β̂ T Z ,
ℓ=−∞
Z ∈ Az , β̂ ∈ B. It turns out
× Kh (zi − zj )Kh (zi−ℓ − zj )]
= Op (h), n
  
Xia yi − ĝaT (z , β̂)Xia Kh (β̂ T Zi − z ) = 0
and
i=1

B4 = n−1 hE [XjT Xj εj2 Kh2 (0)] = n−1 hE [XjT Xj E (εj2 |Xj )Kh2 (0)]
and
= n−1 hσ 2 Kh2 (0)E [XjT Xj ] = Op (n−4/5 ).   −1
n
 n

Thus, B = Op (1). Now, ĝa (z , β̂) = Xia XiaT Kh (β̂ T Zi − z ) Xia yi Kh (β̂ T Zi − z ).
 2 i=1 i =1
 n 
D̃ = n−1 hE  [Xi XiT (g0 (β0T Zi ) − g0 (β̂ T Zi ))]Kh (β̂ T Zi − β̂ T Zj ) Then,
 
 i =1 
ĝa (z , β̂) − g0a (z , β0 ) = {ĝa (z , β̂) − ĝa (z , β0 )}
≤ hE ∥[Xi XiT (g0 (β0T Zi ) − g0 (β̂ T Zi ))]Kh (β̂ T Zi − β̂ T Zj )∥2
+ {ĝa (z , β0 ) − g0a (z , β0 )}.
= hE ∥[Xi XiT (ġ0 (β0T Zi )(β̂ − β0 )T Zi + op (n−1/2 ))]
By Taylor expansion, the first term in the right hand side of the
× Kh (β̂ T Zi − β̂ T Zj )∥2
above equation is the order of Op (n−1/2 ) and the second term in
≤ C (h/n)E ∥Xi XiT Kh (β̂ T Zi − β̂ T Zj )∥2 the right hand side is the order of Op (n−2/5 ). Thus the asymptotic
= Op (1/n). property of the ĝa (z , β̂) − g0a (z , β0 ) is the same as the second
This proves the lemma.
term. And the asymptotic property of the second term ĝa (z , β0 ) −
g0a (z , β0 ) can be found in the proof of Theorem 3 by Xia and Li
Lemma 2. Let {Xi , Zi , yi } be a strong mixing and strictly station- (1999).
ary sequence, h ∝ n−1/5 , limn→∞ infθ→0+ Pλ′ n (θ )/λn > 0, and
Proof of Theorem 3. It follows from Theorem 1 in Xia and Li
n−1/10 λn → 0. Then, ∥ĝ.k ∥ = 0 as n → ∞ for k > d0 .
(1999) that
Proof. Assume ∥ĝ.k ∥ ̸= 0, then,
Q̂1 (β, h) = S̃ (β) + T (h) + R1 (β, h) + R2 (h),
∂ Q (G, β̂, h)
= J1 + J2 = 0, where Q̂1 (β, h) = i=1 (yi − ĝ (β Zi )Xi ) , T (h) and R2 (h) do not
n T T 2
∂ g.k
  depend on β , and R1 (β, h) is an ignorable term. Furthermore,
where J1 = (J11 , J12 , . . . , J1n )T , J1j = −2 Xik yi − ĝ T (β̂ T Zj )Xi
n
1/2 1/2
S̃ (β) = n[Ṽ0 (β − β0 ) − n−1/2 σ ε]T [Ṽ0 (β − β0 ) − n−1/2 σ ε]
i=1

Kh (β̂ T Zi − β̂ T Zj ), and J2 = nPλ′ n (∥g.k ∥) ∥g.k ∥ . Similar to the proof

g
.k + R3 + R4 (β),
of (A.7) in Wang and Xia (2009), by Lemma 1, we can derive that
Pλ′ (∥g.k ∥)
∥J1 ∥ = Op (nh−1/2 ) and we know ∥J2 ∥ = nPλ′ n (∥g.k ∥) = · n where R3 does not depend on β and h, and R4 (β) is an ignorable
λn
√ Pλ′ (∥g.k ∥) √ term.
hλn · nh−1/2 . Since n λ > 0 and hλn → 0, then P (∥J2 ∥ < Let δn = n−1/2 + an , t = (t1 , . . . , td )T . For any small ε > 0, if
n
∥J1 ∥) → 1 as n → ∞. It contradicts with the assumption. Hence, we can show there exists a large constant C , such that
∥ĝ.k ∥ = 0 as n → ∞.
P { inf Q (β0 + δn t , ĝ ) > Q (β0 , ĝ )} > 1 − ε,
Proof of Theorem 2. (a) Following the similar steps in the Proof ∥t ∥=C
of Theorem 1 by Wang and Xia (2009), with Lemma 2 and Hunter
then
and Li (2005), we can conclude supZ ∈Az ∥ĝk (z , β̂)∥ = 0, for all
d1 < k ≤ d. ∥β̂ − β0 ∥ = Op (δn ).
(b) We want to show that there exists a Ĝa such that it is
the minimizer of Q ((Ga , 0), β̂, h). Taking the first derivative of Define Dn = Q (β0 + δn t , ĝ ) − Q (β0 , ĝ ). Then,
Q ((Ga , 0), β̂, h) with respective to ĝa (β̂ T Zj ), we can get the normal n
equation as 1
Dn ≥ (yi − ĝ T (β0T Zi + δn t T Zi )Xi )2
n   2 i =1

Xia yi − ĝaT (β̂ T Zj )Xia Kh (β̂ T Zi − β̂ T Zj ) + nΠj = 0, n
1
i =1 − (yi − ĝ T (β0T Zi )Xi )2
2 i =1
where Πj is a a-dimensional vector with its kth component given
d1 d1
by  
+n Ψζn (|β10k + δn tk |) − n Ψζn (|β10k |)
ĝk (β̂ T Zj ) k=1 k=1
Pλ′ n (∥ĝ.k ∥) . (by β20 = 0)
∥ĝ.k ∥
Z. Cai et al. / Journal of Econometrics 189 (2015) 272–284 283

and and
 T 
d1
  d1
∂Q β1T , β2T , ĝ 1 ∂ Q̂1 (β, h)
n Ψζn (|β10k + δn tk |) − n Ψζn (|β10k |) = + nΨζ′n (|βk |) sgn (βk )
k=1 k=1 ∂βk 2 ∂βk
Ψζ′n (|βk |)
d0 
   
1
 1
= nζn Op sgn (βk ) .

=n δn Ψζ′n (|β10k |)sgn(β10k )tk + δn2 Ψζ′′n (|β10k |)tk2 √ +
2 nζn ζn
k=1
√ Ψζ′ (|βk |)
+ op (nδn2 ) Since nζn → ∞ and lim infn→∞,βk →0+ n
> 0, the sign of
ζn
∂Q
1 is determined by the sign of βk . It follows from Part (a) that

≤ d1 nδn an ∥t ∥ + nδn2 max1≤k≤d0 {Ψζ′′n (|β10k |)}∥t ∥2 + op (nδn2 ) ∂βk
2  
(by Cauchy–Schwarz inequality)
T
∂Q β1T , β2T , ĝ 
=0

≤ nδn2 d0 C + Op (nδn2 )

∂β 
β̂
as n → ∞ and max1≤k≤d0 {Ψζ′′n (|β10k |)} → 0 β=( 01 )


and
and   
n
1
n
1 1 ∂ Ŝ β̂1 , 0 , h
d
(yi − ĝ T (β0T Zi + δn t T Zi )Xi )2 − (yi − ĝ T (β0T Zi )Xi )2 + n∆Ψζn1 = 0
2 i =1 2 i=1 2 ∂β1
1 d
where ∆Ψζn1 = {Ψζ′n (|β1 |) sgn(β1 ), . . . , Ψζ′n |βd1 | sgn(βd1 )}T .
 
1/2 1/2
= n[Ṽ0 δn t − n−1/2 σ ε]T [Ṽ0 δn t − n−1/2 σ ε]
2 Note that as n → ∞ and ζn → 0, Ψζ′n (|βk |) = 0 for k = 1, . . . , d1
1 and
− n[n−1/2 σ ε]T [n−1/2 σ ε]
2
  
+ R1 (β0 + δn t , h) − R1 (β0 , h) + op (1) 1 ∂ Ŝ β̂1 , 0 , h
= 0,
(by the theorem in Xia and Li, 1999) 2 ∂β1
1 1/2 which implies that
= nδn2 t T Ṽ0 t − n1/2 δn t T Ṽ0 σ ε + R1 (β0 + δn t , h)
2  
1/2
− R1 (β0 , h) + op (1) V10 β̂1 − β10 − n1/2 σ 
n V10 e1 + op (1) = 0
1
= nδn2 t T Ṽ0 t − δn t T Vn + R1 (β0 + δn t , h) − R1 (β0 , h) + op (1). √   √
2 n β̂1 − β10 =−1
V10 (1/ n)V1n
Since R1 are negligible terms as n → ∞ and √1n Vn = Op (1). then √  n
√
−δn t Vn = C · Op (δn n) = C · Op (δ ). By choosing a sufficient
T 2 =−1
V10 (1/ n) (Z1i − E (Z1i |β10
T
Z1i ))ġ T (β10
T
Z1i )Xi ε1i
nn
i =1
large C , the term 12 nδn2 t T Ṽ0 t will dominate others. Hence, Dn ≥ 0
holds. so that
−1/2
 
Proof of Theorem 4. Let β̂1 − β10 = Op n ∞
 
. We want to show √   
 T  T  n β̂1 − β10 → N 0, 
D −1
V10 Γ (ℓ)−1
V10
that β̂1 , 0 = argmin T Q β1T , β2T , ĝ . It suffices to ℓ=−∞
β1T ,β2T ∈B
show that for some constant C and k = q0 + 1, . . . , q, where Γ (ℓ) = E Γi Γi−ℓ with Γi = Z1i − E (Z1i |β10 Z1i ) ġ T (β10 Z1i )
T T
 T

 T  Xi ε1i .
∂Q β1T , β2T , ĝ
> 0 for 0 < βk < Cn−1/2 References
∂βk
< 0 for −Cn−1/2 < βk < 0. Akaike, H., 1973. Maximum likelihood identification of gaussian autoregressive
moving average models. Biometrika 60, 255–265.
Box, G.E.P., Jenkins, G.M., 1970. Time Series Analysis: Forecasting and Control.
Note that
Holden-Day, San Francisco.
Breiman, L., 1995. Better subset regression using the nonnegative garrote.
∂ Q̂1 (β, h) ∂ S̃ (β) Technometrics 37, 373–384.
= + Rm
∂βk ∂βk Brent, A.J., Lin, D.Y., Zeng, D., 2008. Penalized estimating functions and variable
selection in semiparametric regression models. J. Amer. Statist. Assoc. 103,
∂ S̃ (β) 672–680.
= eTk + Rm Cai, Z., 2002. Regression quantiles for time series. Econometric Theory 18, 169–192.
∂β Cai, Z., Fan, J., Li, R., 2000a. Efficient estimation and inferences for varying coefficient
models. J. Amer. Statist. Assoc. 95, 888–902.
1/2
= 2neTk Ṽ0 (β − β0 ) − 2n1/2 σ eTk Ṽ0 ε + Rm Cai, Z., Fan, J., Yao, Q., 2000b. Functional-coefficient regression models for nonlinear
time series. J. Amer. Statist. Assoc. 95, 941–956.
= 2neTk Ṽ0 (β − β0 ) − 2eTk Vn + Rm Cai, Z., Ren, Y., Yang, B., 2014a. A semiparametric conditional capital asset pricing
model. Working paper, The Wang Yanan Institute for Studies in Economics,
Xiamen University.
where Rm represents small order term and ek is a d-dimensional
Cai, Z., Wang, Y., 2014. Testing predictive regression models with nonstationary
vector with kth
 element
√  being one and
√ all others being zero. Since regressors. J. Econometrics 178, 4–14.
β − β0 = Op 1/ n and Vn = Op n , then, Cai, Z., Wang, Y., Wang, Y., 2014b. Testing instability in predictive regression
model with nonstationary regressors. Econometric Theory, (forthcoming).
http://dx.doi.org/10.1017/S0266466614000590.
∂ Ŝ (β, h) √  Campbell, J.Y., Yogo, M., 2006. Efficient tests of stock return predictability. J. Financ.
= Op n
∂βk Econ. 81, 27–60.
284 Z. Cai et al. / Journal of Econometrics 189 (2015) 272–284

Chan, K.S., Tong, H., 1986. On estimating thresholds in autoregressive models. Lin, Y., Zhang, H., 2006. Component selection and smoothing in multivariate
J. Time Ser. Anal. 7, 179–190. nonparametric regression. Ann. Statist. 34, 2272–2297.
Chen, R., Tsay, R.S., 1993. Functional coefficient autoregressive model. J. Amer. Newey, W.K., Stoker, T.M., 1993. Efficiency of weighted average derivative
Statist. Assoc. 88, 298–308. estimators and index models. Econometrica 61, 1199–1223.
Fan, J., Li, R., 2001. Variable selection via non-concave penalized likelihood and its Phillips, P.C.B., Lee, J.H., 2013. Predictive regression under various degrees of
oracle properties. J. Amer. Statist. Assoc. 96, 1348–1360. persistence and robust long-horizon regression. J. Econometrics 177, 250–264.
Fan, J., Li, R., 2004. New estimation and model selection procedures for Schwarz, G., 1978. Estimating the dimension of a model. Ann. Statist. 6, 461–464.
semiparametric modeling in longitudinal data analysis. J. Amer. Statist. Assoc. Su, L., Zhang, Y., 2013. Variable selection in nonparametric and semiparametric
99, 710–723. regression models. In: Handbook in Applied Nonparametric and Semi-
Fan, J., Lv, J., 2010. A selective overview of variable selection in high dimensional Nonparametric Econometrics and Statistics. Research Collection School of
feature space. Statist. Sinica 20, 101–148. Economics.
Fan, J., Yao, Q., Cai, Z., 2003. Adaptive varying-coefficient linear models. J. R. Stat. Teräsvirta, T., 1994. Specification, estimation, and evaluation of smooth transition
Soc. Ser. B 65, 57–80. autoregressive models. J. Amer. Statist. Assoc. 89, 208–218.
Fan, J., Zhang, W.Y., 1999. Statistical estimation in varying coefficient models. Ann. Tibshirani, R., 1996. Regression shrinkage and selection via the LASSO. J. R. Stat. Soc.
Statist. 27, 1491–1518. Ser. B 58, 267–288.
Fu, W.J., 1998. Penalized regressions: the bridge versus the LASSO. J. Comput. Graph. Tong, H., 1990. Non-linear Time Series: A Dynamical System Approach. Oxford
Statist. 7, 397–416. University Press, Oxford, UK.
Granger, C.W.J., Andersen, A.P., 1978. An Introduction to Bilinear Time Series Wang, H., Li, G., Tsai, C.L., 2007. Regression coefficient and autoregressive order
Models. Vanderhoek and Ruprecht, Gottingen.
shrinkage and selection via LASSO. J. R. Stat. Soc. Ser. B 69, 63–68.
Hamilton, J.D., 1989. A new approach to the economic analysis of nonstationary
Wang, L.F., Li, H.Z., Huang, J.H., 2008. Variable selection in nonparametric varying-
time series and the business cycle. Econometrica 57, 357–384.
coefficient models for analysis of repeated measurements. J. Amer. Statist.
Horowitz, J.L., 2009. Semiparametric and Nonparametric Methods in Econometrics.
Assoc. 103, 1556–1569.
Springer-Verlag, New York.
Wang, H., Xia, Y., 2009. Shrinkage estimation of the varying coefficient model.
Huang, J., Joel, L.H., Wei, F.R., 2010. Variable selection in nonparametric additive
J. Amer. Statist. Assoc. 104, 747–757.
models. Ann. Statist. 38, 2282–2313.
Xia, Y., Li, W.K., 1999. On single-index coefficient regression models. J. Amer. Statist.
Hunter, D.R., Li, R., 2005. Variable selection using MM algorithms. Ann. Statist. 33,
Assoc. 94, 1275–1285.
1617–1642.
Ichimura, H., 1993. Semiparametric least squares (SLS) and weighted SLS estimation Yuan, M., Lin, Y., 2006. Model selection and estimation in regression with grouped
of single index model. J. Econometrics 58, 71–120. variables. J. R. Stat. Soc. Ser. B 68, 49–57.
Janson, S., 1987. Maximal spacing in several dimensions. Ann. Probab. 15, 274–280. Zhang, Y., Li, R., Tsai, C.L., 2010. Regularization parameter selections via generalized
Kong, E., Xia, Y., 2007. Variable selection for the single index model. Biometrika 94, information criterion. J. Amer. Statist. Assoc. 105, 312–323.
217–229. Zhang, H.H., Lin, Y., 2006. Component selection and smoothing for nonparametric
Li, Q., Jeffrey, S.R., 2007. Nonparametric Econometrics: Theory and Practice. regression in exponential families. Statist. Sinica 16, 1021–1041.
Princeton University Press, Princeton. Zhao, P.X., Xue, L., 2010. Variable selection for semi-parametric varying coef-
Li, R., Liang, H., 2008. Variable selection in semiparametric regression modeling. ficient partially linear errors-in-variables models. J. Multivariate Anal. 101,
Ann. Statist. 36, 261–286. 1872–1883.
Liang, H., Li, R., 2009. Variable selection for partially linear models with Zou, H., 2006. The adaptive LASSO and its oracle properties. J. Amer. Statist. Assoc.
measurement errors. J. Amer. Statist. Assoc. 104, 234–248. 101, 1418–1429.
Liang, H., Liu, X., Li, R., Tsai, C.L., 2010. Estimation and testing for partially linear Zou, H., Li, R., 2008. One-step sparse estimates in non-concave penalized likelihood
single-index models. Ann. Statist. 38, 3811–3836. models. Ann. Statist. 36, 1509–1533.