Functional Index Coefficient Models With Variable Selection
Functional Index Coefficient Models With Variable Selection
Journal of Econometrics
journal homepage: www.elsevier.com/locate/jeconom
coefficient autoregressive (FIAR) model satisfying et al. (2008) and Li and Liang (2008), varying coefficient partially
p linear models with errors-in-variables in Zhao and Xue (2010), and
partially linear single-index models in Liang et al. (2010), and the
rt = gj (β T rt −1 )rt −j + εt ,
j =1
references therein.
However, the aforementioned papers focused mainly on the
where gj (·) is an unknown function in R for 1 ≤ j ≤ p. In fact, variable selection with parametric coefficients. Also, the shrink-
the above FIAR model can be regraded as a case of functional index age method was extended to select significant variables with func-
coefficient models of Fan et al. (2003) with tional coefficients. Lin and Zhang (2006) proposed component
p selection and smoothing operator (COSSO) for model selection and
model fitting in multivariate nonparametric regression models in
yi = gj (β T Zi )Xji + εi ≡ g (β T Zi )T Xi + εi , 1 ≤ i ≤ n, (1)
j =1
the framework of smoothing spline analysis of variance. Mean-
while, they extended the COSSO to the exponential families (Zhang
where yi is a dependent variable, Xi = (X1i , X2i , . . . , Xpi )T is a and Lin, 2006). Wang et al. (2008) proposed the variable selection
p × 1 vector of covariates, Zi is a d × 1 vector of local variables, procedures with basis function approximations and SCAD, which
εi are independently identically distributed (i.i.d.) with mean 0 is similar to the COSSO, and they argued that their procedures can
and standard deviation σ , β ∈ Rd is a d × 1 vector of unknown select significant variables with time-varying effect and estimate
parameters and g (·) = (g1 (·), . . . , gp (·))T is a p × 1 vector of the nonzero smooth coefficient functions simultaneously. Huang
unknown functional coefficients. We assume that ∥β∥ = 1 and et al. (2010) proposed to use the adaptive group LASSO for variable
the first element of β is positive for identification, where ∥ · ∥ is selection in nonparametric additive models based on a spline ap-
the Euclidean norm (L2 -norm). Note that both Xi and Zi can include proximation, in which the number of variables and additive com-
the lagged variables of yi . In particular, if X1i ≡ 1, then, model (1) ponents may be larger than the sample size. By adopting the idea of
contains an intercept function term. the grouping method in Yuan and Lin (2006), Wang and Xia (2009)
Xia and Li (1999) studied the asymptotic properties of model used kernel LASSO to apply shrinkage to functional coefficients in
(1) under mixing conditions when the index part of above model the varying coefficient models. Their pure nonparametric shrink-
is not constrained to be a linear combination of Zi . However, due to age procedure is different from approaches of using spline and ba-
the efficiency of estimation and the accuracy of prediction, it is of sis functions (Lin and Zhang, 2006; Wang et al., 2008; Huang et al.,
importance to select variables in both Zi and Xi , and to potentially 2010). For a comprehensive survey paper of variable selection in
exclude variables in Eq. (1). Fan et al. (2003) provided algorithms nonparametric and semiparametric regression models via shrink-
to estimate local parameters β and functional coefficients g (·). age, the reader is referred to the paper by Su and Zhang (2013).
Meanwhile, they deleted the least significant variables in a given Almost all the variable selection procedures mentioned above
model according to t-value, and selected the best model in terms of are based on the assumption that the observations are independent
the Akaike information criterion (AIC) of Akaike (1973) in multiple and identically distributed (i.i.d.). To the best of our knowledge,
steps. However, as mentioned in Fan and Li (2001), this stepwise there are few papers to consider variable selections under non
deletion procedure may suffer stochastic errors inherited in the i.i.d. settings. It might not be appropriate if it is applied to
multiple stages. Meanwhile, there is no theory on this variable analyze financial and economic data directly, since most of the
selection procedure and the authors did not mention how to select financial/economic data are weakly dependent. To address this
the regressors Xi . These selection issues motivate us to consider issue, Wang et al. (2007) extended to the regression model with
variable selection on both local variables Zi and covariates Xi in autoregressive errors via LASSO. In this paper, we consider variable
model (1). selection in functional index coefficient models under very general
The FIAR model reduces the curse of dimensionality since each dependence structure—the strong mixing context. Our variable
of the nonparametric functions has only one argument. However, selection procedures consist of two steps. The first is to select
there still remain potential areas of dimension reduction. First, covariates with functional coefficients, and then we perform model
there are several nonparametric functions in the p × 1 vector selection for local variables with parametric coefficients.
g (β ⊤ Z ). In addition, the vector Z is d-dimensional. Hence, by The rest of this paper is organized as follows. In Section 2,
using model selection methods, there is potential to find a more we present the identification conditions for functional index
parsimonious model that effectively captures the features of our coefficient models, our new two-step estimation procedures, and
data. Variable selection methods and their algorithms can be traced some properties of the SCAD penalty function and numerical
back to four decades ago. Pioneering contributions include the AIC implementations. In Section 3, we propose variable selection
and the Bayesian information criterion (BIC) of Schwarz (1978). procedures for both covariates with functional coefficients and
Various shrinkage type methods have been developed recently, local variables with parametric coefficients. We then establish
including but not limited to the nonnegative garrotte of Breiman the consistency, the sparsity and the oracle property of all the
(1995), bridge regression of Fu (1998), the least absolute shrinkage proposed estimators. A simple bandwidth selection method is also
and selection operator (LASSO) of Tibshirani (1996), the smoothly discussed in the same section. Monte Carlo simulation results for
clipped absolute deviation of Fan and Li (2001), the adaptive LASSO the proposed two-step procedures are reported in Section 4. An
of Zou (2006), and so on. The reader is referred to the review paper empirical example of applying the functional index coefficient
by Fan and Lv (2010) for details. Here, we recommend the smoothly autoregressive model and its variable selection procedures is
clipped absolute deviation (SCAD) penalty function of Fan and Li extensively studied in Section 5. Finally, the concluding remarks
(2001) since it merits three properties of unbiasedness, sparsity are given in Section 6 and all the regularity conditions and technical
and continuity. Furthermore, it has the oracle property; namely, proofs are gathered in the Appendix.
the resulting procedures perform as well as those that correspond
to the case when the true model is known in advance. 2. Identification, estimation and penalty function
The shrinkage method has been successfully extended to
semiparametric models; for example, variable selection in partially 2.1. Identification
linear models in Liang and Li (2009), partially linear models in
longitudinal data in Fan and Li (2004), single-index models in The identification problem in single index model was first
Kong and Xia (2007), semiparametric regression models in Brent investigated by Ichimura (1993), and extensively studied by
274 Z. Cai et al. / Journal of Econometrics 189 (2015) 272–284
Li and Jeffrey (2007) and Horowitz (2009). Meanwhile, partial with K (·) being a kernel function, Kh (z ) = K (z /h)/h and Pλn (·)
conditions for identification in functional index coefficient models being a penalty function. {λ1 , . . . , λp } are tuning parameters and
were showed in Fan et al. (2003). Here we present the conditions ĝ·k = [ĝk (β̂ T Z1 ), . . . , ĝk (β̂ T Zn )]T is the estimates of kth functional
for identification below. coefficient at corresponding sample points. As recommended, an
initial estimator β̂ can be obtained by various algorithms such
Theorem 1 (Identification in Functional Index Coefficient Models).
as the method in Fan et al. (2003), or average derivative esti-
Assume that dependent variable Y is generated by Eq. (1), X is a
mators such as Newey and Stoker (1993). As long as the initial
p-dimensional vector of covariates and Z is a d-dimensional vector √
of local variables. β is a d-dimensional vector of unknown parameters estimator satisfies ∥β̂ − β∥ = Op (1/ n), as expected, the para-
and g (·) is a p-dimensional vector of unknown functional coefficients. metric estimator β̂ has little effect on the shrinkage estimation of
Then, β and g (·) are identified if the following conditions hold: functional coefficients ĝ (·) in the above equation if sample size
n is large. We choose penalty term Pλn (·) as the SCAD function,
Assumption I. I1. The vector functions g (·) are continuous and which is described in Section 2.3, and the L2 functional norm
not constant everywhere. ∥ĝ·k ∥ = ĝk2 (β̂ T Z1 ) + · · · + ĝk2 (β̂ T Zn ) has the same definition of
I2. The components of Z are continuously distributed random
standard Euclidean norm. The purpose of using the penalized lo-
variables.
cally weighted least squares is to select significant covariates Xi in
I3. There exists no perfect multi-collinearity within each compo-
model (1).
nents of Z and none of the components of Z is constant.
Note that when the penalty term Pλn (z ) = λn |z |, the penalized
I4. There exists no perfect multi-collinearity within each compo-
local least squares becomes the Lasso type, so that the above object
nents of X .
function in Eq. (2) is reduced to the case in the paper by Wang and
I5. The first element of β is positive and ∥β∥ = 1, where ∥ · ∥ is
Xia (2009).
the standard Euclidean norm.
I6. When X = Z , E (Y |X , Z ) becomes to E (Y |X ) and it cannot be Step Two: Given the estimator of function ĝ (·), minimize the
expressed in the form as E (Y |X ) = α T X β T X + γ T X + c, where penalized global least squares Q (β, ĝ ), where
α , γ ∈ Rd and c ∈ R are constant, and α and β are not parallel
n d
to each other. 1 2
Q (β, ĝ ) = yi − ĝ T (β T Zi )Xi +n Ψζn (|βk |) (3)
2 i =1
Remark 1. Assumption I1 is a mild condition since continuous k=1
The important property for the SCAD penalty function is that it has such as C1 n−1/5 < h < C2 n−1/5 }. For Z ∈ Az , β ∈ B, and h ∈ Hn ,
the following first derivative, define a n × p matrix penalized estimator as
λ, |β| ≤ λ,
T
Ĝ β̂ = ĝ β̂ T Z1 , . . . , ĝ β̂ T Zn = ĝ·1 , . . . , ĝ·p ,
Pλ′ (|β|) = (aλ − |β|)/(a − 1), λ < |β| ≤ aλ,
0, |β| > aλ, where
for some a > 2 (5) T
ĝ β̂ T Z = ĝ1 β̂ T Z , . . . , ĝp β̂ T Z ∈ Rp ,
so that it makes the computational implementation easily. It can be
clearly seen that Pλ (|β|) is not differentiable at 0 with respect to β . and
Thus, it is not easy to minimize the penalized least squares due to T
its singularity. To make implementation easily, Fan and Li (2001) ĝ·k = ĝk β̂ T Z1 , . . . , ĝk β̂ T Zn ∈ Rn .
suggested to approximate the penalty function by a quadratic
Similarly, we define the true value G0 (β) , g0 β T Z and g0·k ,
function as
respectively. Without loss of generality, we assume that the first
(0) 1
Pλ (|βj |) ≈ Pλ (|βj |) + {Pλ′ (|βj(0) |)/|βj(0) |}(βj2 − βj(0)2 ) p0 functional coefficients are non-zero, and other p − p0 functional
2 coefficients are zero, i.e. ∥g·k ∥ ̸= 0 and g·k are not constant
for βj ≈ βj .
(0)
(6) everywhere for 1 ≤ k ≤ p0 , ∥g·k ∥ = 0 for p0 < k ≤ p. Let
αn = max{Pλ′ (∥g·k ∥) : 1 ≤ k ≤ p0 }. Then, by minimizing the
Alternatively, Zou and Li (2008) proposed local linear approxima- penalized local least squares Q (ĝ , β̂, h) in Eq. (2), one can obtain
tion for non-concave penalty functions as the penalized local least squares estimator ĝ (·) for g (·).
(0) (0) (0) (0) To study the asymptotic distribution of the penalized local
Pλ (|βj |) ≈ Pλ (|βj |) + Pλ′ (|βj |)(|βj | − |βj |) for βj ≈ βj (7)
least squares estimator, we impose some technical conditions as
which can reduce the computational cost without losing any follows.
statistical efficiency. Meanwhile, some other algorithms such as
minorize-maximize algorithm (Hunter and Li, 2005) are also Assumption A. A1. The vector functions g (·) have continuous
proposed. second order derivatives with respect to the support of Az .
In view of (6), given a good initial value β (0) we can find the A2. For any β ∈ B and Z ∈ Az , the density function f (·, β)
one-step estimator as follow is continuous and there exists a small positive δ such that
f (·, β) > δ .
1 A3. The kernel function K (·) is a bounded density with
a bounded
β (1) = argmin (β − β (0) )T [−∇ 2 ℓ(β (0) )](β − β (0) ) support region. Let µ2 = v 2 K (v)dv and ν0 = K 2 (v)dv .
2
A4. limn→∞ infθ→0+ Pλ′ n (θ )/λn > 0, n−1/10 λn → 0, h ∝ n−1/5 and
(0)
√
P ′ (|β |)
d
λ ∥β̂ − β0 ∥ = Op (1/ n).
+ n k
β 2
, (8)
A5. Define Ω (z , β) = E Xi XiT |β T Zj = z . Assume that Ω (·) is
(0) k
k=1 2|β k |
nonsingular and has bounded second order derivative on Az .
where ℓ(·) is a loss function and ∇ 2 ℓ(β (0) ) = ∂ 2 ℓ(β (0) )/∂β∂β T . A6. {(Xi , Zi , yi )} is a strictly stationary and strongly mixing se-
As argued in Fan and Li (2001), there is no need to iterate until it quence with mixing coefficient satisfying α (m) = O (ρ m ) for
converges as long as the initial estimator is reasonable. Also, the some 0 < ρ < 1.
A7. Assume that the conditional density of f zi , zs |zj is continu-
MLE estimator from the full model without penalty term can be
regarded as the reasonable initial estimator. For using the local ous and has bounded second order derivative.
A8. Assume that Ω zi , zs , zj = E Xi XiT Xs XsT |β T Zi = zi , β T Zs = zs ,
linear approximation in Zou and Li (2008) and Eq. (7), the sparse
β T Zj = zj is continuous and has bounded second order
one-step estimator given in Eq. (8) becomes
derivative. Define Ω1 zi , zs , zj = ∂ Ω zi , zs , zj /∂ zi and
(1) 1
β (β − β (0) )T [−∇ 2 ℓ(β (0) )](β − β (0) ) Ω2 zi , zs , zj = ∂ Ω zi , zs , zj /∂ zs .
= argmin
2
Remark 2. The conditions in A2 imply that the distances between
d
(0) two ranked values β T Z(i) are at most order of Op (log n/n) (Janson,
+ n Pλ′ (|βk |)|βk | . (9)
k=1 1987). For any value Z ∈ Az , we can find a closest value β T Zj to
As demonstrated in Zou and Li (2008), this one step estimator is Λ = β T Z such that |β T Zj − Λ| = Op (log n/n). With the conditions
as efficient as the fully iterative estimator, provided that the initial in A1, ∥g (β T Zj ) − g (Λ)∥ = Op (log n/n), which is smaller order of
estimator is good enough. For example, we let β (0) be the maximal nonparametric convergence rate n−2/5 . This implies that we only
likelihood estimator without the penalty term. need to estimate ĝ (β T Zi ) for i = 1, 2, . . . , n rather than ĝ (Λ) for
all values in the domain Az . For the detailed arguments, we refer
3. Large sample theory to the paper by Wang and Xia (2009). Assumption A3 is a common
assumption in nonparametric estimation. The assumption ∥β̂ −
√
3.1. Penalized nonparametric estimator for functional coefficients β0 ∥ = Op (1/ n) in A4 implies that the estimators of β̂ have little
effect in the estimation of ĝ (·) if the sample size n is large, since the
Let {(Xi , Zi , yi )} be a strictly stationary and strong mixing convergence rate of the parametric estimators β̂ is faster than the
sequence, and f (·, β) be the density function of β T Z , where β is nonparametric function estimators ĝ (·). The assumptions in A5–A8
an interior point of the compact set B. Assume δ is a small positive are very standard and used for the proof under mixing conditions;
constant and define Az = {Z : f (β T Z , β) ≥ δ , β ∈ B, there exist see Cai et al. (2000b). In particular, Assumptions in A6 are the
a and b such that β T Z ∈ [a, b]} as the domain of Z . Then, β T Z is common conditions with weekly dependent data. Most financial
bounded and the density of f (·, β) is bounded away from 0. Also, models satisfy these conditions, such as ARMA, ARCH and GARCH
define the domain of bandwidth h, Hn = {h: there exist C1 and C2 models; see Cai (2002).
276 Z. Cai et al. / Journal of Econometrics 189 (2015) 272–284
Define the nonparametric estimator ĝ (z , β̂) ≡ [ĝa (z , β̂), ĝb (z , e1 be an asymptotically standard normal random d1 -dimensional
1/2
β̂)]T where z = β̂ T Z , ĝa (z , β̂) = [ĝ1 (z , β̂), . . . , ĝp0 (z , β̂)]T ∈ Rp0 vector such that V1n = n1/2 σ
V10 e1 .
and ĝb (z , β̂) = [ĝp0 +1 (z , β̂), . . . , ĝp (z , β̂)]T ∈ Rp−p0 . Analogously, To study the asymptotic distribution of the penalized least
we denote the true value g0 (z , β0 ) ≡ [g0a (z , β0 ), g0b (z , β0 )]T . squares estimator β̂ , we impose some technique conditions as
The following theorem presents the asymptotic properties for the below.
penalized nonparametric estimator ĝ (z , β̂), including the oracle
Assumption B. B1. The vector functions g (·) have continuous
property, sparsity and asymptotic normality of the estimator
second order derivatives with respect to the support of Az .
ĝ (z , β̂).
B2. For any β ∈ B and Z ∈ Az , the density function f (·, β)
is continuous and there exists a small positive δ such that
Theorem 2 (Oracle Property). Let (Xi , Zi ) be a strong mixing and
f (·, β) > δ .
strictly stationary sequence. Under Assumptions A1–A8, ∥β̂ − β0 ∥ = B3. The kernel function K (·) is a bounded density with
Op (n−1/2 ), limn→∞ infθ→0+ Pλ′ n (θ )/λn > 0, h ∝ n−1/5 and
a bounded
support region. Let µ2 = v 2 K (v)dv and ν0 = K 2 (v)dv .
n−1/10 λn → 0 as n → ∞, then
√
B4. limn→∞ infθ →0+ Pζ′ n (θ )/ζn > 0, ζn → 0, nζn → ∞ and
(a) Sparsity: supZ ∈Az ∥ĝk (z , β̂)∥ = 0, for all p0 < k ≤ p. h ∝ n−1/5 .
(b) Asymptotic normality: B5. Same as Assumption A6.
√ B6. E (εi |Xi , Zi ) = 0, E (εi2 |Xi , Zi ) = σ 2 , E |Xi |m < ∞ and E |yi |m <
nh ĝa (z , β̂) − g0a (z , β0 ) − h2 B(z , β0 ) + op (h2 ) ∞ for all m > 0.
∼ N (0, V (z , β0 )), Remark 4. The assumptions in B4 indicate the oracle property in
where V (z , β0 ) = ν0 M −1 (z , β0 )σ 2 , and Theorem 4. An alternative condition for bandwidth in Ichimura
(1993) is nh8 → 0. However, the condition nh8 → 0 is still
B(z , β0 ) = µ2 M −1 (z , β0 )Ṁ (z , β0 )ġ (z , β0 ) +
1
µ2 g̈ (z , β0 ) satisfied with our condition h ∝ n−1/5 in B4. For Assumption B6, it
2 is not hard to extend to the heteroscedasticity case, E (εi2 |Xi , Zi ) =
with M (z , β0 ) = f (z , β0 )Ω (z , β0 ), Ṁ (z , β0 ) = ∂ M (z , β0 )/∂ z, σ 2 (Xi , Zi ), and it requires some higher moment conditions of Xi and
ġ (z , β0 ) = ∂ g (z , β0 )/∂ z and g̈ (z , β0 ) = ∂ ġ (z , β0 )/∂ z. yi so that Chebyshev inequality can be applied.
When the random variables {Γi }∞i=1 are either i.i.d. or martingale
To perform variable selection for variables with parametric
difference sequence, V10 becomes V10 = Γ (0) = Var(Γi ). Other-
coefficients, we should minimize the penalized least squares listed
wise, the autocovariance function Γ (ℓ) may not be zero at least
in Eq. (3). We assume the first d1 coefficients of β are nonzero and
for some lag orders ℓ > 0 due to the serial correlation. Theorem 4
all rest of parameters are zero. That is, β0 = (β10 T
, β20
T T
) , where shows that our variable selection procedures of minimizing penal-
all elements of β10 with dimension d1 are nonzero and nd − d1 ized least squares enjoy the oracle property.
dimensional coefficients β20 = 0. Finally, define Vn = i=1 (Zi −
E (Zi |β0T Zi ))ġ T (β0T Zi )Xi εi , where vector ġ (·) is the first derivative
3.3. Choosing bandwidth and tuning parameters
of function g (·) vector, and εi is independent and identically
distributed (i.i.d.) with mean 0 and standard deviation σ . Let V0 =
1
Var(Vn )/σ 2 , and define e be an asymptotically standard normal To do the nonparametric estimation and variable selection si-
n
1/2 multaneously, we should choose suitable regularization parame-
random d-dimensional vector such that Vn = n1/2 σ V0 e. V1n = ters, bandwidth h for nonparametric estimator and λ’s for penalty
i=1 (Z1i − E (Z1i |β10 Z1i ))ġ (β10 Z1i )Xi ε1i , where ε1i is the same as
n T T T
terms. For simplicity, we just consider global bandwidth selection
εi since β20 = 0. Similarly, we define V10 = 1n Var(V1n )/σ 2 and rather than pointwise selection. Recent literature reveals that the
Z. Cai et al. / Journal of Econometrics 189 (2015) 272–284 277
BIC-type selector identifies the true model consistently and the re- Table 1
sulting estimator possesses the oracle property. In contrast, the Simulation results for the covariates with functional
coefficients.
AIC-type selector tends to be less efficient and over fitting in the
σ =4 σ =2
final model; see the papers by Wang et al. (2007) and Zhang et al.
(2010). This motivates us to select the bandwidth h and tuning pa- n = 200
rameters λ’s simultaneously with BIC-type criterion. We define our Shrinkage rate 79.4% 93.4%
Keeping rate 92.0% 99.8%
BIC criterion as
n = 400
BIC(h, λ) = log(SSE(h, λ)) + df(h, λ) log(n)/n, Shrinkage rate 94.5% 100%
Keeping rate 98.6% 100%
where SSE(h, λ) is the sum of squared errors obtained from the n = 1000
penalized least squares with parameters (h, λ), and df(h, λ) is the Shrinkage rate 100% 100%
number of nonzero coefficients of β̂ conditional on parameters Keeping rate 100% 100%
h and λ. This BIC criterion is reasonable since it can balance
the trade-off between the variance and the number of non-zero Table 2
coefficients in terms of the bandwidth h and tuning parameters Simulation results for the local variable with parametric
λ’s. Further, it enjoys the property of consistency, which indicates coefficients.
that it can select the correct model with the probability one as σ = 15 σ = 7.5
the sample size goes to infinity (Zhang et al., 2010). However, n = 200
it is still computationally expensive to choose d-dimensional Shrinkage rate 83.2% 91.1%
tuning parameters λ = (λ1 , λ2 , . . . , λd )T . By adopting the idea Keeping rate 93.4% 96.9%
of Fan and Li (2004), to reduce the dimension of λ, we let n = 400
(0) (0) Shrinkage rate 92.3% 100%
λn = λ0 σ̂ (βˆk ), where σ̂ (βˆk ) is the standard deviation of un- Keeping rate 97.5% 100%
(0)
penalized estimator βˆk . The theoretical properties of BIC(h, λ) n = 1000
(0) Shrinkage rate 100% 100%
and dimension reduction technique with λk = λ0 σ̂ (βˆk ) need Keeping rate 100% 100%
further research and they can be regarded as the future research
topics. The reader is referred to the papers by Cai et al. (2000a)
and Fan and Li (2001) more on choosing the bandwidth in years and four years trading days, respectively. For each setting,
nonparametric estimation and the tuning parameters in variable we replicate 1000 times. The ‘‘Shrinkage Rate’’ and ‘‘Keeping rate’’
selection. are reported in Table 1, in which ‘‘Shrinkage rate’’ represents the
percentage that five zero functional coefficients correctly shrink
4. Monte Carlo simulations to 0 and ‘‘Keeping rate’’ stands for the percentage that two non-
zero functional coefficients do not set to 0 correctly. Clearly, one
Example 1. In this example, we study the finite sample perfor- can see from Table 1 that ‘‘Shrinkage rate’’ and ‘‘Keeping rate’’
mance of the variable selection for covariates with functional produce better results with larger sample size and smaller noise.
coefficients. In our simulations, the optimal bandwidth and the Meanwhile, it shows that the proposed estimator performs as good
tuning parameter λn are chosen by BIC criterion in Section 3.3. The as the oracle estimator if the sample size n = 1000 as well as
Epanechnikov kernel K (x) = 0.75(1 − x2 ) if |x| ≤ 1 is used. We the case of n = 400 and σ = 2. This simulation shows that the
choose the value of a in SCAD to be 3.7 as suggested in Fan and Li proposed variable selection procedures perform fairly well for a
(2001). finite sample.
In this example, we assume that the data are generated by Example 2. To examine the performance of the variable selection
yi = (Z1i + Z2i ) + (Z1i + Z2i )2 X1i + σ εi , 1 ≤ i ≤ n, for local variables with parametric coefficients, similar to Tibshi-
rani (1996) and Fan and Li (2001), our data generating process is
and the working model is given below
6
yi = ui + u2i Xi + σ εi ,
yi = g0 (β T Zi ) + gk (β T Zi )Xki + ei ,
k=1 where ui = ZiT β , β = (3, 1.5, 0, 0, 2, 0, 0, 0)T and εi is generated
where εi is generated from standard normal distribution and from standard normal distribution. Furthermore, the nine dimen-
Z = (Z1 , Z2 )T with Z1 = Φ (Z1∗ ), Z2 = Φ (Z2∗ ) and Φ (·) being sional vector (ZiT , Xi )T is generated from the following vector au-
the cumulative standard normal distribution function. The eight toregressive process
dimensional vector (Z1∗ , Z2∗ , X1 , . . . , X6 )T follows the following
Zi Zi−1
vector autoregressive process =A ∗
+ ei ,
∗ Xi Xi−1
Zi∗−1
Zi
=A + ξi , where A∗ is a 9 × 9 matrix with the diagonal elements being 0.15
Xi Xi−1
and all others being 0.05. The initial value of (Z1T , X1 )T and each
where Z ∗ = (Z1∗ , Z2∗ )T , X = (X1 , X2 , . . . , X6 )T and A is an 8 × 8 element of the random vector ei are generated from i.i.d. standard
matrix with the diagonal elements being 0.15 and all others being normal. Similar to the previous example, we consider three sample
0.05. The initial value of (Z1∗ , X1 )T and each component of the sizes as n = 200, 400 and 1000 and for each simulation, we repli-
random vector term ξi are generated from i.i.d. standard normal cate 1000 times. We also consider two values for σ as σ = 7.5
distribution. Note that for this set up, the data generated by the and σ = 15. Table 2 displays the simulation results of SCAD vari-
above autoregressive process are weekly dependent. We consider able selection for the local variables with parametric coefficients.
three sample sizes as n = 200, n = 400 and n = 1000 and Similar to the conclusions from Table 1, it can be seen from Table 2
two standard deviations as σ = 2 and σ = 4. Sample sizes that the ‘‘Shrinkage rate’’ for irrelevant local variables and ‘‘Keeping
n = 200, 400, 1000 are corresponding to about one year, two rate’’ for relevant local variables perform better with larger sample
278 Z. Cai et al. / Journal of Econometrics 189 (2015) 272–284
εt = σt et et ∼ skewed-t (λ, ν)
size and smaller noise. Specifically, it performs as good as the or- σt2 = ω + αεt2−1 + γ εt2−1 It −1 + ρσt2−1
acle estimator for the cases where n = 400 and σ = 7.5 as well
as sample size n = 1000. The Monte Carlo simulation results indi- where zt = β1 rt −1 + β2 rt −2 + β3 rt −3 and we assume β12 +
cate that our variable selection for local variables merits good finite β22 + β32 = 1 in order to satisfy the identification condition. The
sample properties. standardized residuals et is skewed-t distributed with skewness
parameter λ and degree of freedom ν , γ captures the leverage
Example 3. To investigate the performance of variable selection effect. The indicator function It −1 takes value of 1 for εt ≤ 0 and 0
for covariates and local variables simultaneously, we do one more otherwise. This model can be viewed as an extension of the model
step with variable selection for local variables in Example 1. All the by Chen and Tsay (1993). We use the two-step variable selection
settings are the same as in Example 1, except the true model is procedures to select variables and to estimate unknown coefficient
defined as functions simultaneously. Firstly, we select covariates based on
penalized local least squares and then do variable selection for local
yi = g0 (Z1i ) + g1 (Z1i )X1i + σ εi , 1 ≤ i ≤ n,
important variables based on penalized global least squares. After
where g0 (u) = u and g1 (u) = u . We assume that the index coef-
2
the two-step variable selection procedures are employed in above
ficient depends only on local variable Z1 in this true model. Local model, the estimated coefficients of local variables and the norms
variable Z2 and five covariates (X2 , . . . , X6 ) are not included in the of covariates are reported in Table 5. Note that zt may include other
model but estimated in the working model; see Example 1 for de- financial or state economy variables as in Cai et al. (2014a).
tails. Two-step selection procedures are employed in this simula- In the columns of local variables, both one day lagged return
tion. The first is to select six covariates (X1 , . . . , X6 ) with functional and two days lagged return have effect on the daily return of
coefficients as well as constant term. Then, we perform variable se- these three indexes. Three days lagged return does not have any
lection for local variables Z1 and Z2 with parametric coefficients. effect on the daily return of both NASDAQ and S&P 500. Only
The simulation results for these two-step selection procedures are one week lagged return contributes to the weekly return of both
tabulated in Table 3. Table 3 shows that with larger sample size NASDAQ and S&P 500. However, two weeks lagged return does not
and smaller noise, shrinkage rates for both nonsignificant covari- have any contribution. Specifically, one week lagged return and
ates and local nonsignificant covariates become larger. These indi- three weeks lagged return perform similar for the weekly return
cate that our two-step procedures perform quite well so that the of DOW. For the monthly horizon, one month lagged return has a
proposed methods are efficient. significant effect on these three indexes, two months lagged return
for NASDAQ and three months lagged return for both DOW and S&P
5. Empirical example 500.
In the columns of covariates, only one day lagged covariate
In the previous section, we conduct Monte Carlo simulation and two days lagged covariate contribute to the daily return of
studies to illustrate the effectiveness of the proposed estimation three indexes. Meanwhile, for the weekly horizon, only one week
methods. In this section, to demonstrate the practical usefulness lagged covariate and three weeks lagged covariate are important
of the proposed model and its estimation methods, we apply these factors for the weekly return of three indexes. Further, for the
methodologies to consider the predictability of the asset return. monthly horizon, one month lagged covariate has contribution for
Our data consist of daily, weekly and monthly returns on the the monthly return of three indexes, two months lagged return for
three indexes of Dow Jones Industrial Average, NASDAQ Composite NASDAQ, and three months lagged return for DOW and S&P 500.
and S&P 500 Index. The sample of these three indexes comprise The coefficients of GJR-GARCH model for error terms are
over 30 years between May 1, 1994 and April 30, 2014. They end tabulated in Table 6. The significance for the skewness λ and the
in 30 April due to the fact that most listing corporations post their degree of freedom µ for all horizons leads to the non-normality of
annual reports at the end of April. The sample size up to 30 years standardized residuals. It is interesting that leverage effects exist
are considered so that there are enough data for nonparametric in both one day horizon and one week horizon. However, it cannot
estimation in the model. All the data are downloaded from the be observed in one month horizon. Meanwhile, we cannot find any
Wind Information database.1 Table 4 shows the summary statistics heteroscedasticity in terms of GJR-GARCH model from one month
of returns for one day horizon, one week horizon and one month horizon.
horizon. All horizons show the negative skewness, which indicates
that a relatively long lower tail exists. For one day and one week 6. Conclusion
Table 4
Summary statistics of returns for different horizons.
Sample size Mean Median StdDev Skewness Kurtosis Min Max ρ1 Box–Pierce test
Table 5
Coefficients for local variables and covariates.
Local variables Covariatesa
rt −1 rt −2 rt −3 1 rt −1 rt −2 rt −3 rt −4 rt −5 rt −6
One day horizon
DOW 0.7087 0.6385 −0.3000 3.1532 2.8250 2.6166 0 0 0 0
NASDAQ 0.6609 −0.7505 0 5.0515 3.2321 3.7374 0 0 0 0
S&P 500 0.4183 −0.9083 0 4.4862 1.8567 3.9876 0 0 0 0
One week horizon
DOW 0.7716 0 0.6360 16.4212 8.0355 0 6.6187 0 0 0
NASDAQ 1 0 0 30.9023 14.9169 0 4.3741 0 0 0
S&P 500 1 0 0 15.5780 9.5644 0 2.1025 0 0 0
One month horizon
DOW 0.8638 0 −0.5034 70.1698 59.3783 0 34.5686 0 0 0
NASDAQ 0.8488 0.5286 0 189.735 68.8333 42.7579 0 0 0 0
S&P 500 0.8697 0 0.4934 98.7581 48.1937 0 27.6172 0 0 0
a
We calculate the norm of the functional coefficients for covariates.
n
right hand side is followed by Cauchy–Schwarz inequality and
∥ĝ (β̂ T Zi ) − g0 (β0T Zi )∥2 = Op (n−4/5 ). the second term is followed by Taylor expansion and triangle
n− 1
i=1
inequality. Therefore,
n
∥uj· ∥2 − 2(n−1 ∥u∥2 )1/2 (n−1 ∥ê∥2 )1/2
D ≥ λ̂min n−1
i=1 ∥ĝ (β̂ Zi ) − g0 (β0 Zi )∥
n T
Proof. By the triangle inequality n−1 T 2
j =1
i=1 ∥ĝ (β̂ Zi ) − g0 (β̂ Zi )∥ ∥g0 (β̂ T Zi ) −
−1
n T T 2 −1
n
≤ n + n i=1 p0
g0 (β0T Zi )∥2 .
− n−1/2 h1/2 αn
∥u·k ∥
The second term on the right hand side
k=1
n
n− 1
∥g0 (β̂ T Zi ) − g0 (β0T Zi )∥2 ≥ λ̂min n−1 ∥u∥2 − 2(n−1 ∥u∥2 )1/2 (n−1 ∥ê∥2 )1/2
p0
1/2
i=1
1/2 √
n − h α p0 n −1
∥u·k ∥ 2
where A denotes the first term, B is for the second term and D̃
× Kh β̂ T Zi − β̂ T Zj stands for the last term. We introduce notations as zi = β T Zi ,
zs = β T Zs and zj = β T Zj
p
Pλn ∥g0·k + (nh)−1/2 u·k ∥ − Pλn (∥g0·k ∥)
+h A = n −1
hE
[(g0 (β̂ T Zi ) − g0 (β̂ T Zj ))T Xi XiT Xs XsT (g0 (β̂ T Zs )
k =1
i̸=s̸=j
n
≥ n− 1 uTj· Σ̂ (β̂ T Zj )uj· − 2uTj· êj − g0 (β̂ T Zj ))
j =1
p0 × Kh (β̂ Zi − β̂ Zj )Kh (β̂ Zs − β̂ Zj )] + n−1 hE
T T T T
{· · · }
∥g0·k + (nh)−1/2 u·k ∥ − Pλn (∥g0·k ∥) ,
+h Pλn (i=s)̸=j
k=1
where Σ̂ (β̂ T Zj ) = n −1/2 ≡ n hE −1
[(g0 (zi ) − g0 (zj ))T Xi XiT Xs XsT (g0 (zs ) − g0 (zj ))
i=1 Xi Xi Kh (β̂ Zi − β̂ Zj ) and êj = n
−1
n T T T
1/2 i̸=s̸=j
[Xi XiT (g0 (β0T Zi )− g0 (β̂ T Zi ))+ Xi XiT (g0 (β̂ T Zi )− g0 (β̂ T Zj ))+
n
h i=1
Xi εi ]Kh (β̂ Zi − β̂ T Zj ). Let λ̂min
T
be the smallest eigenvalue of
j × Kh (zi − zj )Kh (zs − zj )] + n−1 hE {· · · }
Σ̂ (β̂ T Zj ), λ̂min = min{λ̂min j , j = 1 , . . . , n} and ê = (ê1 , . . . , ên )T (i=s)̸=j
= Rn×p . Then, D ≥ n−1 nj=1 (∥uj· ∥2 λ̂min ≡ A1 + A2 ,
j − 2∥uj· ∥∥êj ∥) −
Z. Cai et al. / Journal of Econometrics 189 (2015) 272–284 281
where A1 denotes the first term and A2 is for the second term. It is where
easy to show that
A21 (zj ) ≡ (g0 (zi ) − g0 (zj ))T Ω (zi , zj )(g0 (zi ) − g0 (zj ))
A1 ≡ nhE {(g0 (zi ) − g0 (zj ))T Xi XiT Xs XsT (g0 (zs )
− g0 (zj ))Kh (zi − zj )Kh (zs − zj )} + Rm × Kh2 (zi − zj )f (zi |zj )dzi .
= nhE {(g0 (zi ) − g0 (zj ))T Ω (zi , zs , zj )(g0 (zs ) Let zi = zj + w h. Then,
− g0 (zj ))Kh (zi − zj )Kh (zs − zj )} + Rm
1
A21 (zj ) = (ġ0 (zj )w h + C w 2 h2 )T Ω (zj + w h, zj )
h
= nh E {(g0 (zi ) − g0 (zj ))T Ω (zi , zs , zj )
× (ġ0 (zj )w h + C w2 h2 )k2 (w)f (zj + w h|zj )dw
× (g0 (zs ) − g0 (zj ))Kh (zi − zj )Kh (zs − zj )|zj }f (zj )dzj + Rm
≡ A11 + Rm , = I21 (Zj )h w2 k2 (w)dw + Rm
where the definition of A11 is apparent and Rm is an ignorable small where I21 (Zj ) is an integrable function, then A2 = Op (h2 ) = op (1).
order term, which might be different in different appearances. Let Hence, A = Op (1), Now, we consider the term B as follows.
zi = zj + w h and zs = zj + v h. Then, T
n n
T B = n −1
hE Xi εi Kh (zi − zj ) Xs εs Kh (zs − zj )
1
ġ0 (zj )w h + C1 w h2 2
i=1 s =1
A11 = nh
2
= n−1 hE Xi XsT εi εs Kh (zi − zj )Kh (zs − zj )
1
× Ω (zj + wh, zj + v h, zj ) ġ0 (zj )v h + C2 v h 2 2
(i=s)̸=j
2
× k(w)k(v)f ((zj + w h, zj + v h)|zj )dwdv f (zj )dzj + 2n hE −1
εε
Xi XsT i s Kh (zi − zj )Kh (zs − zj )
(i=j)̸=s
A12 (zj )f (zj )dzj ,
≡ nh + n hE −1
εε
Xi XsT i s Kh (zi − zj )Kh (zs − zj )
(i̸=s)̸=j
where
T + n hE −1
εε
Xi XsT i s Kh (zi − zj )Kh (zs − zj )
1 i=s=j
A12 (zj ) = ġ0 (zj )w h + C1 w h2 2
[Ω (zj , zj , zj )
2 ≡ B1 + B2 + B3 + B4 ,
+ Ω1 (zj , zj , zj )w h + Ω2 (zj , zj , zj )v h where the definitions of Bj ’s are apparent. Now,
+ op (w 2 h2 ) + op (v 2 h2 )]
B1 = hE [Xi XiT εi2 Kh2 (zi − zj )] + Rm
1
× ġ0 (zj )v h + C2 v h [f ((zj , zj )|zj )
2 2
= hE [Xi XiT Kh2 (zi − zj )E (εi2 |Xi , zi , zj )] + Rm
2
+ f1 ((zj , zj )|zj )w h + f2 ((zj , zj )|zj )v h = hσ 2 E [Xi XiT Kh2 (zi − zj )] + Rm
where I12 (Zj ) is an integrable function. Then, A1 = Op (nh5 ) = E [Ω (zi , zj )Kh2 (zi − zj )|zj ]
Op (1). Also, we can show that
1 zi − zj
= Ω (zi , zj )k 2
fzi |zj (zi |zj )dzi
h2 h
[(g0 (zi ) − g0 (zj ))T Xi XiT Xi XiT
A2 = n−1 hE 1
= Ω (zj + w h, zj )k2 (w)fzi |zj (zj + w h|zj )dw
i̸=j h
× (g0 (zi ) − g0 (zj )) Kh2 (zi − zj )] = IB2 (zj )Op (1/h) k2 (w)dw,
B4 = n−1 hE [XjT Xj εj2 Kh2 (0)] = n−1 hE [XjT Xj E (εj2 |Xj )Kh2 (0)]
and
= n−1 hσ 2 Kh2 (0)E [XjT Xj ] = Op (n−4/5 ). −1
n
n
Thus, B = Op (1). Now, ĝa (z , β̂) = Xia XiaT Kh (β̂ T Zi − z ) Xia yi Kh (β̂ T Zi − z ).
2 i=1 i =1
n
D̃ = n−1 hE [Xi XiT (g0 (β0T Zi ) − g0 (β̂ T Zi ))]Kh (β̂ T Zi − β̂ T Zj ) Then,
i =1
ĝa (z , β̂) − g0a (z , β0 ) = {ĝa (z , β̂) − ĝa (z , β0 )}
≤ hE ∥[Xi XiT (g0 (β0T Zi ) − g0 (β̂ T Zi ))]Kh (β̂ T Zi − β̂ T Zj )∥2
+ {ĝa (z , β0 ) − g0a (z , β0 )}.
= hE ∥[Xi XiT (ġ0 (β0T Zi )(β̂ − β0 )T Zi + op (n−1/2 ))]
By Taylor expansion, the first term in the right hand side of the
× Kh (β̂ T Zi − β̂ T Zj )∥2
above equation is the order of Op (n−1/2 ) and the second term in
≤ C (h/n)E ∥Xi XiT Kh (β̂ T Zi − β̂ T Zj )∥2 the right hand side is the order of Op (n−2/5 ). Thus the asymptotic
= Op (1/n). property of the ĝa (z , β̂) − g0a (z , β0 ) is the same as the second
This proves the lemma.
term. And the asymptotic property of the second term ĝa (z , β0 ) −
g0a (z , β0 ) can be found in the proof of Theorem 3 by Xia and Li
Lemma 2. Let {Xi , Zi , yi } be a strong mixing and strictly station- (1999).
ary sequence, h ∝ n−1/5 , limn→∞ infθ→0+ Pλ′ n (θ )/λn > 0, and
Proof of Theorem 3. It follows from Theorem 1 in Xia and Li
n−1/10 λn → 0. Then, ∥ĝ.k ∥ = 0 as n → ∞ for k > d0 .
(1999) that
Proof. Assume ∥ĝ.k ∥ ̸= 0, then,
Q̂1 (β, h) = S̃ (β) + T (h) + R1 (β, h) + R2 (h),
∂ Q (G, β̂, h)
= J1 + J2 = 0, where Q̂1 (β, h) = i=1 (yi − ĝ (β Zi )Xi ) , T (h) and R2 (h) do not
n T T 2
∂ g.k
depend on β , and R1 (β, h) is an ignorable term. Furthermore,
where J1 = (J11 , J12 , . . . , J1n )T , J1j = −2 Xik yi − ĝ T (β̂ T Zj )Xi
n
1/2 1/2
S̃ (β) = n[Ṽ0 (β − β0 ) − n−1/2 σ ε]T [Ṽ0 (β − β0 ) − n−1/2 σ ε]
i=1
and and
T
d1
d1
∂Q β1T , β2T , ĝ 1 ∂ Q̂1 (β, h)
n Ψζn (|β10k + δn tk |) − n Ψζn (|β10k |) = + nΨζ′n (|βk |) sgn (βk )
k=1 k=1 ∂βk 2 ∂βk
Ψζ′n (|βk |)
d0
1
1
= nζn Op sgn (βk ) .
=n δn Ψζ′n (|β10k |)sgn(β10k )tk + δn2 Ψζ′′n (|β10k |)tk2 √ +
2 nζn ζn
k=1
√ Ψζ′ (|βk |)
+ op (nδn2 ) Since nζn → ∞ and lim infn→∞,βk →0+ n
> 0, the sign of
ζn
∂Q
1 is determined by the sign of βk . It follows from Part (a) that
≤ d1 nδn an ∥t ∥ + nδn2 max1≤k≤d0 {Ψζ′′n (|β10k |)}∥t ∥2 + op (nδn2 ) ∂βk
2
(by Cauchy–Schwarz inequality)
T
∂Q β1T , β2T , ĝ
=0
≤ nδn2 d0 C + Op (nδn2 )
∂β
β̂
as n → ∞ and max1≤k≤d0 {Ψζ′′n (|β10k |)} → 0 β=( 01 )
and
and
n
1
n
1 1 ∂ Ŝ β̂1 , 0 , h
d
(yi − ĝ T (β0T Zi + δn t T Zi )Xi )2 − (yi − ĝ T (β0T Zi )Xi )2 + n∆Ψζn1 = 0
2 i =1 2 i=1 2 ∂β1
1 d
where ∆Ψζn1 = {Ψζ′n (|β1 |) sgn(β1 ), . . . , Ψζ′n |βd1 | sgn(βd1 )}T .
1/2 1/2
= n[Ṽ0 δn t − n−1/2 σ ε]T [Ṽ0 δn t − n−1/2 σ ε]
2 Note that as n → ∞ and ζn → 0, Ψζ′n (|βk |) = 0 for k = 1, . . . , d1
1 and
− n[n−1/2 σ ε]T [n−1/2 σ ε]
2
+ R1 (β0 + δn t , h) − R1 (β0 , h) + op (1) 1 ∂ Ŝ β̂1 , 0 , h
= 0,
(by the theorem in Xia and Li, 1999) 2 ∂β1
1 1/2 which implies that
= nδn2 t T Ṽ0 t − n1/2 δn t T Ṽ0 σ ε + R1 (β0 + δn t , h)
2
1/2
− R1 (β0 , h) + op (1) V10 β̂1 − β10 − n1/2 σ
n V10 e1 + op (1) = 0
1
= nδn2 t T Ṽ0 t − δn t T Vn + R1 (β0 + δn t , h) − R1 (β0 , h) + op (1). √ √
2 n β̂1 − β10 =−1
V10 (1/ n)V1n
Since R1 are negligible terms as n → ∞ and √1n Vn = Op (1). then √ n
√
−δn t Vn = C · Op (δn n) = C · Op (δ ). By choosing a sufficient
T 2 =−1
V10 (1/ n) (Z1i − E (Z1i |β10
T
Z1i ))ġ T (β10
T
Z1i )Xi ε1i
nn
i =1
large C , the term 12 nδn2 t T Ṽ0 t will dominate others. Hence, Dn ≥ 0
holds. so that
−1/2
Proof of Theorem 4. Let β̂1 − β10 = Op n ∞
. We want to show √
T T n β̂1 − β10 → N 0,
D −1
V10 Γ (ℓ)−1
V10
that β̂1 , 0 = argmin T Q β1T , β2T , ĝ . It suffices to ℓ=−∞
β1T ,β2T ∈B
show that for some constant C and k = q0 + 1, . . . , q, where Γ (ℓ) = E Γi Γi−ℓ with Γi = Z1i − E (Z1i |β10 Z1i ) ġ T (β10 Z1i )
T T
T
T Xi ε1i .
∂Q β1T , β2T , ĝ
> 0 for 0 < βk < Cn−1/2 References
∂βk
< 0 for −Cn−1/2 < βk < 0. Akaike, H., 1973. Maximum likelihood identification of gaussian autoregressive
moving average models. Biometrika 60, 255–265.
Box, G.E.P., Jenkins, G.M., 1970. Time Series Analysis: Forecasting and Control.
Note that
Holden-Day, San Francisco.
Breiman, L., 1995. Better subset regression using the nonnegative garrote.
∂ Q̂1 (β, h) ∂ S̃ (β) Technometrics 37, 373–384.
= + Rm
∂βk ∂βk Brent, A.J., Lin, D.Y., Zeng, D., 2008. Penalized estimating functions and variable
selection in semiparametric regression models. J. Amer. Statist. Assoc. 103,
∂ S̃ (β) 672–680.
= eTk + Rm Cai, Z., 2002. Regression quantiles for time series. Econometric Theory 18, 169–192.
∂β Cai, Z., Fan, J., Li, R., 2000a. Efficient estimation and inferences for varying coefficient
models. J. Amer. Statist. Assoc. 95, 888–902.
1/2
= 2neTk Ṽ0 (β − β0 ) − 2n1/2 σ eTk Ṽ0 ε + Rm Cai, Z., Fan, J., Yao, Q., 2000b. Functional-coefficient regression models for nonlinear
time series. J. Amer. Statist. Assoc. 95, 941–956.
= 2neTk Ṽ0 (β − β0 ) − 2eTk Vn + Rm Cai, Z., Ren, Y., Yang, B., 2014a. A semiparametric conditional capital asset pricing
model. Working paper, The Wang Yanan Institute for Studies in Economics,
Xiamen University.
where Rm represents small order term and ek is a d-dimensional
Cai, Z., Wang, Y., 2014. Testing predictive regression models with nonstationary
vector with kth
element
√ being one and
√ all others being zero. Since regressors. J. Econometrics 178, 4–14.
β − β0 = Op 1/ n and Vn = Op n , then, Cai, Z., Wang, Y., Wang, Y., 2014b. Testing instability in predictive regression
model with nonstationary regressors. Econometric Theory, (forthcoming).
http://dx.doi.org/10.1017/S0266466614000590.
∂ Ŝ (β, h) √ Campbell, J.Y., Yogo, M., 2006. Efficient tests of stock return predictability. J. Financ.
= Op n
∂βk Econ. 81, 27–60.
284 Z. Cai et al. / Journal of Econometrics 189 (2015) 272–284
Chan, K.S., Tong, H., 1986. On estimating thresholds in autoregressive models. Lin, Y., Zhang, H., 2006. Component selection and smoothing in multivariate
J. Time Ser. Anal. 7, 179–190. nonparametric regression. Ann. Statist. 34, 2272–2297.
Chen, R., Tsay, R.S., 1993. Functional coefficient autoregressive model. J. Amer. Newey, W.K., Stoker, T.M., 1993. Efficiency of weighted average derivative
Statist. Assoc. 88, 298–308. estimators and index models. Econometrica 61, 1199–1223.
Fan, J., Li, R., 2001. Variable selection via non-concave penalized likelihood and its Phillips, P.C.B., Lee, J.H., 2013. Predictive regression under various degrees of
oracle properties. J. Amer. Statist. Assoc. 96, 1348–1360. persistence and robust long-horizon regression. J. Econometrics 177, 250–264.
Fan, J., Li, R., 2004. New estimation and model selection procedures for Schwarz, G., 1978. Estimating the dimension of a model. Ann. Statist. 6, 461–464.
semiparametric modeling in longitudinal data analysis. J. Amer. Statist. Assoc. Su, L., Zhang, Y., 2013. Variable selection in nonparametric and semiparametric
99, 710–723. regression models. In: Handbook in Applied Nonparametric and Semi-
Fan, J., Lv, J., 2010. A selective overview of variable selection in high dimensional Nonparametric Econometrics and Statistics. Research Collection School of
feature space. Statist. Sinica 20, 101–148. Economics.
Fan, J., Yao, Q., Cai, Z., 2003. Adaptive varying-coefficient linear models. J. R. Stat. Teräsvirta, T., 1994. Specification, estimation, and evaluation of smooth transition
Soc. Ser. B 65, 57–80. autoregressive models. J. Amer. Statist. Assoc. 89, 208–218.
Fan, J., Zhang, W.Y., 1999. Statistical estimation in varying coefficient models. Ann. Tibshirani, R., 1996. Regression shrinkage and selection via the LASSO. J. R. Stat. Soc.
Statist. 27, 1491–1518. Ser. B 58, 267–288.
Fu, W.J., 1998. Penalized regressions: the bridge versus the LASSO. J. Comput. Graph. Tong, H., 1990. Non-linear Time Series: A Dynamical System Approach. Oxford
Statist. 7, 397–416. University Press, Oxford, UK.
Granger, C.W.J., Andersen, A.P., 1978. An Introduction to Bilinear Time Series Wang, H., Li, G., Tsai, C.L., 2007. Regression coefficient and autoregressive order
Models. Vanderhoek and Ruprecht, Gottingen.
shrinkage and selection via LASSO. J. R. Stat. Soc. Ser. B 69, 63–68.
Hamilton, J.D., 1989. A new approach to the economic analysis of nonstationary
Wang, L.F., Li, H.Z., Huang, J.H., 2008. Variable selection in nonparametric varying-
time series and the business cycle. Econometrica 57, 357–384.
coefficient models for analysis of repeated measurements. J. Amer. Statist.
Horowitz, J.L., 2009. Semiparametric and Nonparametric Methods in Econometrics.
Assoc. 103, 1556–1569.
Springer-Verlag, New York.
Wang, H., Xia, Y., 2009. Shrinkage estimation of the varying coefficient model.
Huang, J., Joel, L.H., Wei, F.R., 2010. Variable selection in nonparametric additive
J. Amer. Statist. Assoc. 104, 747–757.
models. Ann. Statist. 38, 2282–2313.
Xia, Y., Li, W.K., 1999. On single-index coefficient regression models. J. Amer. Statist.
Hunter, D.R., Li, R., 2005. Variable selection using MM algorithms. Ann. Statist. 33,
Assoc. 94, 1275–1285.
1617–1642.
Ichimura, H., 1993. Semiparametric least squares (SLS) and weighted SLS estimation Yuan, M., Lin, Y., 2006. Model selection and estimation in regression with grouped
of single index model. J. Econometrics 58, 71–120. variables. J. R. Stat. Soc. Ser. B 68, 49–57.
Janson, S., 1987. Maximal spacing in several dimensions. Ann. Probab. 15, 274–280. Zhang, Y., Li, R., Tsai, C.L., 2010. Regularization parameter selections via generalized
Kong, E., Xia, Y., 2007. Variable selection for the single index model. Biometrika 94, information criterion. J. Amer. Statist. Assoc. 105, 312–323.
217–229. Zhang, H.H., Lin, Y., 2006. Component selection and smoothing for nonparametric
Li, Q., Jeffrey, S.R., 2007. Nonparametric Econometrics: Theory and Practice. regression in exponential families. Statist. Sinica 16, 1021–1041.
Princeton University Press, Princeton. Zhao, P.X., Xue, L., 2010. Variable selection for semi-parametric varying coef-
Li, R., Liang, H., 2008. Variable selection in semiparametric regression modeling. ficient partially linear errors-in-variables models. J. Multivariate Anal. 101,
Ann. Statist. 36, 261–286. 1872–1883.
Liang, H., Li, R., 2009. Variable selection for partially linear models with Zou, H., 2006. The adaptive LASSO and its oracle properties. J. Amer. Statist. Assoc.
measurement errors. J. Amer. Statist. Assoc. 104, 234–248. 101, 1418–1429.
Liang, H., Liu, X., Li, R., Tsai, C.L., 2010. Estimation and testing for partially linear Zou, H., Li, R., 2008. One-step sparse estimates in non-concave penalized likelihood
single-index models. Ann. Statist. 38, 3811–3836. models. Ann. Statist. 36, 1509–1533.