Automatic Selection by Penalized Asymmetric L - Norm in An High-Dimensional Model With Grouped Variables
Automatic Selection by Penalized Asymmetric L - Norm in An High-Dimensional Model With Grouped Variables
Automatic Selection by Penalized Asymmetric L - Norm in An High-Dimensional Model With Grouped Variables
Abstract
The paper focuses on the automatic selection of the grouped explanatory variables in
an high-dimensional model, when the model errors are asymmetric. After introducing the
model and notations, we define the adaptive group LASSO expectile estimator for which
we prove the oracle properties: the sparsity and the asymptotic normality. Afterwards, the
results are generalized by considering the asymmetric Lq -norm loss function. The theoreti-
cal results are obtained in several cases with respect to the number of variable groups. This
number can be fixed or dependent on the sample size n, with the possibility that it is of the
same order as n. Note that these new estimators allow us to consider weaker assumptions
on the data and on the model errors than the usual ones. Simulation study demonstrates the
competitive performance of the proposed penalized expectile regression, especially when
the samples size is close to the number of explanatory variables and model errors are asym-
metrical. An application on air pollution data is considered.
1
1 Introduction
With advances in computing and data collection, we are increasingly faced with handling prob-
lems of high-dimensional models with grouped explanatory variables for which an automatic
selection of relevant groups must be performed. For many real applications, the selection of
the relevant grouped variables is very important to the prediction performance of the response
variable and of the model parameters. Several methods have been proposed in the literature for
automatic selection of the variables and then of the groups of relevant variables, by penalizing
the loss function with an adaptive penalty of LASSO type. The loss term is to be chosen ac-
cording to the assumptions on the model errors while the penalty term aims to select significant
(group of) explanatory variables. Let us then give some recent bibliographical references on
this topic. For the least square (LS) loss function with adaptive group LASSO penalty, Wang
and Leng (2008) prove the convergence rate and the oracle properties of the associated estima-
tor when the group number of explanatory variables is fixed. These results were extended in
Zhang and Xiang (2016) where the number p of groups depends on the sample size n but with
p of order strictly less than n. For a quantile model with grouped variables, Ciuperca (2019),
Ciuperca (2020) automatically select the groups of relevant variables by adaptive LASSO and
adaptive elastic-net methods, respectively. Based on the PCA and PLS n1/2 -consistent estima-
tors corresponding to adaptive weights, Mendez-Civieta et al. (2021) study the sparsity of the
adaptive group LASSO quantile estimator. On the other hand, Wang and Wang (2014) study the
adaptive LASSO estimators for generalized linear model (GLM), while Wang and Tian (2019)
consider the grouped variables for a GLM, their results including the case p > n. Zhou et al.
(2019) prove the oracle inequalities for the estimation and prediction error of overlapping group
Lasso method in the GLMs. Concerning the computational aspects, when the LS loss function is
penalized with L1 -norm for the subgroup of coefficients and with L2 -norm or L1 -norm for fused
coefficient subgroups, Dondelinger and Mukherjee (2020) present two co-ordinate descendent
algorithm for calculating the corresponding estimators.
Emphasize now that for a model with asymmetric errors, the LS estimation method is not appro-
priate, while that quantile makes the inference more difficult because of the non differentiability
2
of the loss function. Combining the ideas of the LS method to those of the quantile, the expectile
estimation method can be considered, with the advantage that the loss function is differentiable
and then the theoretical study is more amenable, the numerical computation being simplified
(a comparison between quantile and expectile models can be found in Schulze-Waltrup et al.
(2015)). Remark also that the quantile model is a generalization of the median model and char-
acterizes the tail behaviour of a distribution, while the expectile method is a generalization of
least squares. The reader can found asymptotic properties of the sample expectiles in Holzmann
and Klar (2016). The automatic selection of the relevant variables in a model with ungrouped
explanatory variables was realized in Liao et al. (2019), Ciuperca (2021) by the adaptive LASSO
expectile estimation method, results generalized afterwards by Hu et al. (2021) for the asymmet-
ric Lq -norm loss function.
The present paper generalizes the these last three listed papers, for a model with grouped ex-
planatory variables, the number p of groups being either fixed or dependent on n. The conver-
gence rates and the asymptotic distributions of the estimators depend on the size order of p with
respect to n. Note that our results remain valid when p is of the same order as n. It should also
be noted that the proposed methods encounter fewer numerical problems than for to the non-
derivable quantile loss function, proposed and studied in Ciuperca (2019), Ciuperca (2020). The
simulation study and application on real data show that our proposed adaptive group LASSO
methods have good performance.
The remainder of the paper is organized as follows. Section 2 introduces the model and general
notations. Section 3 defines and studies the adaptive group LASSO expectile estimator firstly
when the number p of groups is fixed and afterwards when p depends on the sample size n. The
method and the results are generalized in Section 4 for asymmetric Lq -norm loss function, where
convergence rate and oracle properties are stated for the adaptive group LASSO Lq -estimator.
Section 5 presenting simulation results is followed by an application on real data in Section 6.
The theoretical result proofs are relegated in Section 7.
3
2 Model and notations
In this section, we introduce the studied statistical model and some general notations.
We give some notations before presenting the statistical model. All vectors and matrices are
denoted by bold symbols and all vectors are column. For a vector v, we denote by v > its
transposed, by kvk2 its euclidian norm, by kvk1 and kvk∞ , its L1 , L∞ norms, respectively.
For a positive definite matrix M , we denote by δmin (M ) and δmax (M ) its smallest and largest
eigenvalues, respectively. We will also use the following notations: if Un and Vn are random
variable sequences, Vn = oP (Un ) means that lim P(|Un /Vn | > e) = 0 for all e > 0 . Moreover,
n→∞
Vn = OP (Un ) mean that there exists C > 0 so that lim P(|Un /Vn | > C) < e for all e > 0.
n→∞
If (an )n∈N , (bn )n∈N are two positive deterministic sequences we denote by an bn when
L P
limn→∞ an /bn = ∞. We use −→ , −→ to represent the convergence in distribution and in
n→∞ n→∞
probability. For an event E, 11E denotes the indicator function that the event E happens. For a
real x, bxc is the integer part of x. For an index set A, let us denote either by |A| or by Card(A)
the cardinality of A and by Ac its complementary set. Throughout this paper, C will always
denote a generic constant, not depending on the size n and its value is not of interest. We will
denote by 0k the zero k-vector. When it is not specified, the convergence is as n → ∞.
We consider the following model with p groups of explanatory variables:
p
X
Yi = X> >
ij βj + εi = Xi β + εi , i = 1, · · · , n, (1)
j=1
where Yi is the response variable, εi is the model error, both being random variables. For each
group j ∈ {1, ..., p}, the vector of parameters is β j = (βj1 , ..., βjdj ) ∈ Rdj and the design
for observation i is the dj -vector Xij . In the other words, the vector with all the coefficients
is β = (β 1 , ..., β p ) related to the vector with all explanatory variables Xi = (Xi1 , ..., Xip ).
State now that for j = 1, · · · , p, we denote by β 0j = (βj1
0 , ..., β 0 ) the true (unknown) value
jdj
of the coefficient parameters β j corresponding to the j-th group of explanatory variables. For
observation i, we denote by Xij,k the kth variable of the jth group. Emphasize that the relevant
groups of explanatory variables correspond to the non-zero vectors. More precisely, without loss
4
of generality, we suppose that the first p0 (p0 ≤ p) groups of variables are relevant:
Taking into account model (2), we have A = {1, ..., p0 }. Throughout this paper, we denote
by β A the non-zero sub-vectors of β which contains all sub-vectors β j , with j ∈ A. For any
j ∈ A, the corresponding group of explanatory variables is relevant, while for j ∈ Ac , the group
of variables is irrelevant. In the applications, since β 0 is unknown, then A is also unknown. We
denote by Xi,A the r0 -vector which contains the elements Xij with j ∈ {1, ..., p0 }.
In this section we introduce and study the automatic selection of the relevant group of explana-
tory variables by penalizing the expectile loss function with an adaptive L2,1 -norm. Two cases
will be considered: first when the number p of variable groups is fixed and afterwards when it
is dependent on the number n of observations. For a fixed τ ∈]0, 1[, we define the expectile
function ρτ : R → R+ :
ρτ (u) ≡ u2 |τ − 11u<0 |. (3)
The value τ is called expectile index and ρτ is the expectile function of order τ . We also define:
0 00
ψτ (t) ≡ ρτ (ε − t), gτ (ε) ≡ ψτ (0) = −2ε(τ 11ε≥0 + (1 − τ )11ε<0 ), hτ (ε) ≡ ψτ (0) = 2(τ 11ε≥0 +
(1 − τ )11ε<0 ), σg2τ ≡ Var [gτ (ε)], µhτ ≡ E[hτ (ε)]. We can then define the expectile estimator:
n
X
e ≡ arg min Gn (β),
β n with Gn (β) ≡ ρτ (Yi − X>
i β). (4)
β∈Rr i=1
find the classical least square (LS) estimator. Unfortunately, the estimator β
e doesn’t allow au-
n
tomatic selection of groups of variables. In order to select the significant explanatory variables,
5
we would need to perform some hypothesis tests, which can be tedious if p is large. In order
to overcome this inconvenience, Zou (2006) proposed in the particular case τ = 1/2 and un-
grouped explanatory variables, under specific assumptions on the design, the adaptive LASSO
estimator which automatically selects variables. Inspired by this idea, in this section we would
like to select the groups of relevant variables. Then, we will introduce an estimator, denoted by
β b ,··· ,β
b = (β b ), minimizing the expectile process (4) penalized by a LASSO adaptive
n n;1 n;p
term. The main objective of this section is to study the asymptotic properties of adaptive group
LASSO expectile estimator (ag E) defined by:
expectile estimators corresponding to the j-th group of the explanatory variables. The sequence
(λn )n∈N is called tuning parameter and γ is a known positive constant. Corresponding to β
b ,
n
we define the index set of the non null adaptive group LASSO expectile estimators:
The set Abn is an estimator of A and |Abn | an estimator for the number p0 of significant groups
of variables. Two cases will be considered in this section: one where p is constant and one
where p is depending of n. Remark that β
b is a generalization of the estimator for non grouped
n
explanatory variables (dj = 1 for any j = 1, · · · , p) proposed by Liao et al. (2019) when p is
fixed and by Ciuperca (2020) for p depending on n.
We present now the assumptions, on the model errors and on the design, that will be necessary
in this section. The assumptions on (λn ) and γ will be presented in each of the next two subsec-
tions, depending on whether or not p depends on n.
The model errors (εi )1≤i≤n satisfy the following assumption:
(A1) (εi )i≤n are independent, identically distributed, having 0 as their τ th expectile, with a
positive density, continuous near 0. We also suppose E(ε4i ) < ∞.
6
Two assumptions are considered for the deterministic design (Xi )1≤i≤n :
Assumption (A3) is classical for linear regression, while assumption (A1) is typical in the con-
text of expectile regression (Liao et al. (2019), Ciuperca (2021)) and it implies E[εi (τ 11εi ≥0 +
(1 − τ )11εi <0 )] = −E[gτ (εi )] = 0. Hence, the expectile index τ will be fixed throughout in this
section, such that E[gτ (εi )] = 0. Assumption (A2) is commonly considered in high-dimensional
models when the number of parameters diverges with n (Wang and Wang (2014), Zhao et al.
(2018), Wang and Tian (2019), Ciuperca (2021), Hu et al. (2021), Zhou et al. (2019)).
Note that in the particular case τ = 1/2 of the LS loss function and p fixed, we obtain the adap-
tive group LASSO LS estimator proposed and studied by Wang and Leng (2008). The proofs of
the results presented in the following two subsections can be found in Subsection 7.1.
conditions are considered for the tuning parameter sequence (λn )n∈N , λn −→ 0 and on γ:
n→+∞
(a) n1/2 λn −→ 0,
n→+∞
(6)
(b) n(γ+1)/2 λn −→ ∞.
n→+∞
The two conditions of (6) are classical for adaptive LASSO penalties when the number of co-
efficients is fixed (see for example Wu and Liu (2009), Ciuperca (2019), Liao et al. (2019)).
Observe also that Assumption (A3) implies when p is fixed that there exists a positive definite
r × r matrix Υ so that
n
X
n−1 Xi X>
i −→ Υ. (7)
n→+∞
i=1
Let us first find the convergence rate of the adaptive group LASSO expectile estimator β
b .
n
b −β 0 k2 = OP (n−1/2 ).
Lemma 3.1. Under assumptions (A1), (A2), (A3) and (6)(a), we have: kβ n
7
b is of optimal order n−1/2 and coincides with that obtained by Liao
The convergence rate of β n
et al. (2019) for the SCAD expectile estimator with ungrouped variables, or by Wang and Leng
(2008) for the adaptive group LASSO LS estimator, both when the number of model parameters
is fixed. The following theorem shows the asymptotic normality of β
b corresponding to the
n
relevant groups of variables. In this case, the non-zero parameter estimators have the same
asymptotic distribution they would have if the true non-zero parameters were known.
L
Theorem 3.1. Under assumptions (A1)-(A3) and (6), we have: n1/2 (β b −β 0 )A −→ N 0r0 , σg2τ µ−2 −1
n hτ
ΥA ,
n→∞
Pp0 −1
with ΥA the sub-matrix of Υ with indexes in {1, ..., d1 , d1 + 1, ..., d1 + d2 , ..., j=1 dj +
P0
1, · · · , pj=1 dj }.
estimators of the groups of non-zero parameters are indeed non-zero with a probability converg-
ing to 1 when n converges to infinity.
An estimator which satisfies the sparsity property and is asymptotically normal for the non-zero
coefficient vector, enjoys the oracle properties.
In this section, we study the asymptotic properties of the adaptive group LASSO expectile es-
b defined by (5) when p depends on n: p = pn and pn −→ +∞. For readability,
timator β n
n→+∞
we keep the notation p instead of pn , even if p depends on n. Obviously, p0 , r, r0 , A can also
depend on n, but for the same readability reasons, the index n does not appear.
Since p depends on n, new assumptions are considered:
(A5) For h0 ≡ min kβ 0j k2 , there exist a constant K > 0 so that K ≤ n−α h0 and α >
1≤j≤p0
(c − 1)/2.
8
Assumptions (A4) is often used when p is depending on n (see Ciuperca (2021) and Ciuperca
(2019)). Since kxk2 ≤ r1/2 kxk∞ , then assumptions (A2) and (A4) imply
supposition considered by Ciuperca (2019) for the grouped quantile case, by Ciuperca (2021),
Hu et al. (2021) for ungrouped expectile case and ungrouped Lq -norm case, respectively. As-
sumption (A4) enable us to control the convergence rate of p, and ensure the convergence to-
wards 0 of the sequence p/n. Assumption (A5) is classical for a model with the number of
variable groups depends on the sample size (see Ciuperca (2019), Zhang and Xiang (2016)).
This assumption implies that coefficients can be non-zero for a fixed n but they converge to 0
when n converges to infinity. By assumption (A3) we deduce that r = rank(n−1 ni=1 Xi X>
P
i ).
c ∈ [0, 1/2) and c ∈ [1/2, 1]. Note that the first case covers also the possibility that p does not
depend on n. To shorten the paper, only case c ∈ [0, 1/2) will be presented for the expectile loss
function. The case c ∈ [1/2, 1] will be considered in Section 4 which will have the expectile
function as a special case.
The tuning parameter (λn )n∈N , with λn −→ 0, satisfies instead of conditions (6)(a) and (6)(b)
n→+∞
the following two assumptions:
λn n1/2−αγ −→ 0, (9)
n→+∞
Obviously, for c = 0 and α = 0 relations (9) and (10) become the two conditions of (6).
Condition (10) is considered also by Zhang and Xiang (2016) for LS loss function (τ = 1/2)
with adaptive group LASSO penalty, in the case of i.i.d. model errors of zero mean and finite
variance. Condition (9) is weaker than λn n(1+c)/2−αγ −→ 0 the corresponding one of Zhang
n→+∞
and Xiang (2016).
The following theorem deals with the convergence rate of β
b . This rate will be the same than
n
convergence rate was obtained for adaptive group LASSO estimators when the loss functions
9
are: likelihood in Wang and Tian (2019), LS in Zhang and Xiang (2016), quantile in Ciuperca
(2019). Obviously if c = 0 we get the results of the previous subsection.
Theorem 3.3. Under assumptions (A1)-(A5) and condition (9) for (λn )n∈N , we have: kβ
b −
n
β 0 k2 = OP ((p/n)1/2 ).
Theorem 3.4. Under assumptions (A1)-(A5), conditions (9) and (10) for (λn )n∈N , we have:
lim P[Abn = A] = 1.
n→∞
0
Theorem 3.5. Under the same assumptions as in Theorem 3.4, for all vector u ∈ Rr so that
kuk2 = 1, considering Υn,A ≡ n−1 ni=1 Xi,A X> 1/2 (u> Υ−1 u)−1/2 u> (β
P b −
i,A , we have: n n,A n
L −2
β 0 )A −→ N 0, σg2τ µhτ .
n→∞
In this section the results of Section 3 are generalized by considering the asymmetric Lq -norm
as the loss function, with q > 1. For the index τ ∈ (0, 1), we consider now the following loss
function:
ρτ (x; q) ≡ |τ − 11x<0 | · |x|q , x ∈ R. (11)
For q = 2 we find the expectile regression studied in Section 3. Remark also that if q = 1 then
we obtain the quantile function, which will not be considered in this section because this case
10
was considered by Ciuperca (2019). The properties of function (11) can be found in Daouia
et al. (2019) where it is specified for example that the choice q ∈ (1, 2) is preferable for data
with outliers in order to take into account both the robustness of the quantile method and the
sensitivity of that expectile. For t ∈ R, consider the function: ψτ (t; q) ≡ ρτ (ε − t; q) for which
∂ψτ
we denote: gτ (ε; q) ≡ ∂t (0; q) = −qτ |ε|q−1 11ε≥0 + q(1 − τ )|ε|q−1 11ε<0 and hτ (ε; q) ≡
∂ 2 ψτ
∂t2
(0; q) = q(q − 1)τ |ε|q−2 11ε≥0 + q(q − 1)(1 − τ )|ε|q−2 11ε<0 . Hu et al. (2021) proved that
µhτ (q) ≡ E[hτ (ε; q)] < ∞ and σg2τ (q) ≡ Var [gτ (ε; q)] < ∞.
For the asymmetric Lq -norm, relation (4) becomes
n
X
Gn (β; q) ≡ ρτ (Yi − X>
i β; q),
e (q) ≡ arg min Gn (β; q)
β n
i=1 β∈Rr
Since the quantities arg mint∈R E[ρτ (ε − t; q)] are called Lq -quantiles (see Chen (1996), Daouia
e (q) is called the Lq -quantile estimator. Then, we define the adaptive group
et al. (2019)), β n
For τ = 1/2 and q = 2 we find the adaptive group LASSO LS method proposed and studied
by Zhang and Xiang (2016). Let us also underline that in respect to present paper, Hu et al.
(2021) considers an asymmetric Lq regression with ungrouped variables, where the explanatory
variable selection is carried out with SCAD and adaptive LASSO penalties.
Now, in this section, for the model errors (εi )16i6n , we suppose the following assumption,
considered also by Hu et al. (2021) for Lq -quantile regression with ungrouped variables and
which for q = 2 becomes assumption (A1):
(A1q) The errors (εi )16i6n are i.i.d random variables with a continuous positive density in a
neighborhood of zero and the τ th Lq -quantile zero: E[gτ (ε; q)] = 0. We also suppose
E(|εi |2q ) < ∞.
11
As in Section 3, the value of τ will be fixed throughout this section, such that E[gτ (ε; q)] = 0.
The true non-zero parameters satisfy the following assumption:
Assumption (A5q) is a particular case of assumption (A5) for α = 0 and it supposes that the
norm L2 of non-zero groups does not depend on n. The proofs of the results presented in the
following two subsections are postponed in Subsection 7.2.
As in Section 3, we first consider c ∈ [0, 1/2), with the particular case c = 0 when p is fixed.
First of all, let’s underline that, by elementary calculations, we have, for t → 0:
t2
E[ρτ (ε − t; q) − ρτ (ε; q)] = µhτ (q) · + o(t2 ) (14)
2
As for the penalized expectile method presented in Subsection 3.2 when p = O(nc ), with c ∈
(0, 1/2), the tuning parameter sequence (λn )n∈N satisfies conditions (9) and (10). In order to
b (q), we must first find the convergence rate of the Lq -quantile
study the oracle properties of β n
b (q). For a model with ungrouped variables, c ∈ (0, 1) and
e (q) and afterwards of β
estimator β n n
b (q) is of order
the design satisfying (8), Hu et al. (2021) prove that the convergence rate of β n
(ii) If the tuning parameter sequence (λn )n∈N satisfies (9) and assumption (A5q) is also fulfilled,
b (q) − β 0 k1 = OP (p/n)1/2 .
then, kβ n
We observe that these convergence rates don’t depend on q. With this two results we can now
prove the sparsity property and afterwards the asymptotic normality of β
b (q).
n
Theorem 4.2. Under assumptions (A1q), (A2), (A3), (A4), (A5q), conditions (9) and (10) on
(λn )n∈N , we have: lim P[Abn = A] = 1.
n→∞
0
Theorem 4.3. Under assumptions of Theorem 4.2, for all r0 -vector u ∈ Rr such that kuk2 = 1,
L
we have: n1/2 (u> Υ−1 b (q) − β 0 )A −→
−1/2 u> (β N 0, σg2τ (q) µ−2
n,A u) n n→∞ hτ (q) .
12
4.2 Case p = O(nc ), c ∈ [1/2, 1]
gence rate of β
e (q). We will prove that the convergence rate of the estimators β
n
e (q) and β
n
b (q)
n
e (q) − β 0 k1 =
Lemma 4.1. Under assumptions (A1q), (A2), (A3), (A6), we have that kβ n
OP (an ), with the sequence (an )n∈N such that an → 0 and n1/2 an → ∞.
e (q)k−γ , n1/2 .
bn;j (q) = min kβ
ω n;j 2
Theorem 4.4. Under assumptions (A1q), (A2), (A3), (A5q), (A6) the tuning parameter (λn )n∈N
and sequence (bn )n∈N → 0 satisfying , n1/2 bn → ∞, λn (p0 )1/2 b−1
n → 0, as n → ∞, we have,
b (q) − β 0 k1 = OP (bn ).
kβ n
number of non-zero groups. Now that we know the convergence rates of these two estimators,
we can study the oracle properties of β
b (q).
n
Theorem 4.5. Suppose that assumptions (A1q), (A2), (A3), (A5q), (A6) hold, the tuning pa-
rameter (λn )n∈N and sequence (bn )n∈N → 0 satisfy n1/2 bn → ∞, (p0 )1/2 b−1
n λn → 0 and
a−γ −1
n bn λn → ∞, as n → ∞. Then:
(i) P Abn = A → 1, as n → ∞.
(ii) For any vector u of size r0 such that kuk1 = 1, we have: n1/2 (u> Υ−1
n,A u)
−1/2 u> (β
b (q) −
n
0 L 2 −2
β )A −→ N 0, σgτ (q) µhτ (q) .
n→∞
13
5 Simulation study
In this section, we study by Monte Carlo simulations our adaptive group LASSO expectile es-
timator and compare it with the adaptive group LASSO quantile estimator, in terms of sparsity
and accuracy. All simulations will be performed using R language. Two scenarios are consid-
ered: ungrouped and grouped variables. Moreover, every time, p will be fixed and afterwards
varied with n. As specified in Section 3, the value of τ is fixed and it must satisfy the condition:
E(ε11ε<0 )
τ= . (15)
E(ε(11ε<0 − 11ε>0 ))
In this subsection, the linear models are with ungrouped variables, that is dj = 1 for any j =
1, · · · p. The used R language packages are SALES with function ernet for expectile regression
and quantreg with function rq for quantile regression. Based on relation (15), all simulations
will be preceded by an estimation of τ , depending on the distribution of the model error ε.
Parameters choice. Taking into account (6) we consider: λn = n−1/2−γ/4 , with γ ∈ (0, 1).
We choose p0 = 5, A = {1, ..., 5} and: β10 = 1, β20 = −2, β30 = 0.5, β40 = 4, β50 = −6,,
βj0 = 0 for all j > p0 . For the model errors ε, three distributions are considered: N (0, 1) which
is symmetrical, Exp(−1) and N (1.2, 0.42 ) + χ2 (1), the last two being asymmetrical. The ex-
planatory variables are of normal standard distribution. By M = 1000 Monte Carlo replications,
the adaptive LASSO expectile estimator (ag E) is compared with the adaptive LASSO quantile
estimator (ag Q). For ag Q, the tuning parameter is of order n−3/5 and the value of the power
in the weight of the penalty is 1.225 (see Ciuperca (2021)).
Results. Looking at the sparsity property, two cardinalities are calculated: Card(A∩ Abn ) which
is the number of true non-zeros estimated as non-zero, Card(Abn \A) which is the number of
false non-zeros. Note that for a perfect estimation method we should get Card(A∩ Abn ) = p0 =
5 and Card(Abn \A) = 0. Table 1 presents these two cardinalities. Looking at Card(Abn \A),
ag E shows better performance than ag Q, especially when p = O(n). Note also that the ac-
14
curacy of the evaluation of Card(Abn \A) rises with γ and concerning Card(A ∩ Abn ), ag Q
is more accurate when p n, even if the results for ag E are very close. However, when p is
greater and the error distribution asymmetrical, ag E provide better estimates. Notice also that
the number of the true non-zeros estimated as non-zero decrease when γ increase.
Parameters choice. The numbers p and p0 are calibrated in two ways: firstly p = bn/2c,
p0 = 2bn1/2 c and then c = 1 , afterwards p = bn(log n)−1 c, p0 = 2bn1/4 c and then c < 1. For
all j = 1, · · · , p0 , βj0 ∼ N (0, 2). The design, model errors are similar to those when p is fixed.
Results. The sparsity is studied by calculating: (100/p0 )Card(A∩ Abn ) which is the percentage
of the true non-zero parameters estimated as non-zero, (100/(p − p0 ))Card(Abn \A) which is
the percentage of the number of false non-zeros. For a perfect estimation method we should
(m)
get (100/p0 )Card(A ∩ Abn ) = 100 and (100/(p − p0 ))Card(Abn \A) = 0. Denoting by βbn
the estimation obtained for the mth Monte Carlo replication, we also calculate: mean(|βbn −
β 0 |) = (M p)−1 M
P Pp b(m) 0
m=1 j=1 kβn;j − βj k1 the accuracy of the complete estimation vectors
Pp0 b(m)
and mean(|(βbn −β 0 )A |)= (M p)−1 M 0
P
m=1 j=1 kβn;j −βj k1 the accuracy of the estimation of
the non-zero parameters. Remark that in Table 2, we are in the case where p = O(n). Looking
at the sparsity property, ag Q is slightly better for symmetrical error (N (0, 1)) but ag E is better
when the distributions are asymmetrical. When n is greater, there is no significant difference
between the two methods. We can notice a connection between the accuracy and the sparsity
property. Precisely, the more an estimator can select variable efficiently, the less it is accurate
on the estimation of the non-zeros parameters. ag E tends to be more precise for smaller value
of n. Another case is presented in Table 3. Here, ag E always has a better sparsity property and
ag Q is always more precise.
Parameters choice. We take p ∈ {10, 100} and n = 100. Parameter β 0 and the model errors ε
are similar as in the case of fixed p.
Results. Results are presented for ag E. In the case of symmetrical distribution of the error,
15
ε n p Card(A ∩ A
bn ) Card(A
bn \A)
ag E ag Q ag E ag Q
γ = 5/8 γ = 11/12 γ = 5/8 γ = 11/12
N (0, 1) 50 10 4.999 4.995 5 0.007 0.001 0.221
25 4.998 4.989 4.999 0.034 0.001 1.326
50 4.91 4.85 4.897 1.596 2.668 15.97
100 10 5 5 5 0 0 0.127
25 5 5 5 0.002 0 0.573
100 4.973 4.901 4.912 1.438 2.937 49.28
200 10 5 5 5 0 0 0.048
100 5 5 5 0 0 1.764
200 4.986 4.96 4.949 0.333 0.557 109.4
N (−1.2, 0.42 ) + χ2 (1) 50 10 4.984 4.972 5 0.043 0.019 0.133
25 4.97 4.958 4.994 0.284 0.194 1.242
50 4.87 4.762 4.831 5.984 8.134 24.73
100 10 5 5 5 0.005 0 0.035
25 5 4.997 5 0.089 0.039 1.452
100 4.924 4.87 4.892 5.927 8.723 57.23
200 10 5 5 5 0 0 0.013
100 5 5 5 0.01 0 1.337
200 4.961 4.929 4.938 0.861 1.54 122.2
Exp(−1) 50 10 4.999 4.993 5 0 0 0.033
25 4.995 4.986 4.999 0.008 0.005 0.558
50 4.906 4.822 4.871 2.225 4.033 21
100 10 5 5 5 0 0 0.005
25 5 5 5 0 0 0.073
100 4.945 4.902 4.914 0.883 1.619 48.48
200 10 5 5 5 0 0 0.001
100 5 5 5 0 0 0.419
200 4.979 4.971 4.939 0.12 0.21 106.3
Table 1: Sparsity study of the adaptive LASSO expectile estimator (ag E) and of the adaptive LASSO
quantile estimator (ag Q) when explanatory variables are ungrouped and p0 = 5.
−1
100 p0 Card(A ∩ A
bn ) 100(p − p0 )−1 Card(A
bn \A) bn − β 0 | )
mean(|β bn − β 0 )A |)
mean(|(β
ag E ag Q ag E ag Q ag E ag Q ag E ag Q
ε n γ = 5
8
11
γ = 12 γ = 5
8
11
γ = 12 γ = 5
8
γ = 11
12
5
γ = 8 γ = 11
12
N (0, 1) 50 100 100 98.21 0.154 0.12 0.009 0.196 0.223 0.347 0.350 0.399 0.619
100 100 99.99 100 0 0 0 0.161 0.15 0.109 0.402 0.375 0.289
400 99.99 98 100 0 0 0 0.086 0.077 0.031 0.43 0.384 0.154
N (−1.2, 0.42 ) 50 100 100 93.05 1.03 0.509 0.045 0.215 0.196 0.595 0.364 0.384 1.063
+ χ2 (1) 100 100 100 99.5 0.003 0 0 0.166 0.149 0.196 0.414 0.37 0.491
400 100 100 99.57 0 0 0 0.077 0.083 0.079 0.385 0.414 0.393
Exp(−1) 50 100 100 96.01 0.045 0.009 0 0.237 0.217 0.456 0.424 0.387 0.814
100 100 100 99.96 0 0 0 0.17 0.153 0.094 0.425 0.382 0.235
400 100 98.9 100 0 0 0 0.078 0.073 0.022 0.389 0.366 0.109
Table 2: Sparsity study of ag E and of ag Q when explanatory variables are ungrouped, p = bn/2c,
√
p0 = 2b nc.
16
−1
100 p0 Card(A ∩ A
bn ) 100(p − p0 )−1 Card(A
bn \A) bn − β 0 | )
mean(|β bn − β 0 )A |)
mean(|(β
ag E ag Q ag E ag Q ag E ag Q ag E ag Q
ε n γ = 5 8
11
γ = 12 γ = 5 8
11
γ = 12 γ = 5 8
γ = 11
12
γ = 8 5 γ = 11
12
N (0, 1) 50 100 100 100 0.1 0.013 0.062 0.139 0.140 0.073 0.416 0.420 0.221
100 100 100 100 0 0 0 0.115 0.11 0.049 0.403 0.383 0.167
400 100 99.3 100 0 0 0 0.048 0.043 0.011 0.393 0.354 0.012
N (−1.2, 0.42 ) 50 100 100 100 0.99 0.425 0.025 0.137 0.123 0.074 0.41 0.37 0.222
+ χ2 (1) 100 100 100 100 0 0 0 0.113 0.108 0.04 0.394 0.38 0.141
400 100 100 100 0 0 0 0.049 0.048 0.008 0.404 0.4 0.068
Exp(−1) 50 100 100 100 0.037 0.05 0 0.124 0.109 0.049 0.373 0.328 0.147
100 100 100 100 0 0 0 0.112 0.105 0.033 0.391 0.366 0.115
400 100 100 100 0 0 0 0.047 0.053 0.007 0.386 0.438 0.048
Table 3: Sparsity study of the adaptive LASSO expectile estimator (ag E) and of the adaptive LASSO
quantile estimator (ag Q) when explanatory variables are ungrouped, p = bn(log n)−1 c, p0 = 2bn1/4 c.
Figures 1 and 2 show that the number of true non-zeros estimated as non-zero and the number
of false non-zeros decrease function of γ. Consequently, the choice of γ will depend on the
context. However, if p n, taking γ close to 1 will always be the best option. In the case of
asymmetrical distribution of the error, Card(A ∩ Abn ) is still a decreasing function of γ while
Card(Abn \A) has a minimum value (Figure 4). It would be interesting to choose γ near this
minimum. We can still notice that, when p n, the influence of γ is mainly on Card(Abn \A)
(Figure 3). Finally, γ ' 0.6 is a good choice when p = O(n) and for p n the choice of γ
close to 1 is favorable.
17
(100/p0 )Card(A ∩ Abn ) (100/(p − p0 ))Card(Abn \A)
5.1.4 Effect of kβ 0 k2
18
Results. For the true non-zeros, from Figures 5, 6, 7 and 8 we deduce the value of v for obtaining
a satisfactory selection as p ' n and as the distribution of the error becomes asymmetrical. The
effect on 100(p − p0 )−1 Card(Abn \A) isn’t relevant even if, when p = O(n), this value seems
to have a maximum for a symmetrical distribution of error. Table 4 presents the value of kβ 0 k2
for which different values of (100/p0 )Card(A ∩ Abn ) are achieved. More precisely 99 kβ 0 k2 is
the value of β 0 for which (100/p0 )Card(A ∩ Abn ) = 99% and 95 kβ
0k
2 the value for which
(100/p0 )Card(A ∩ Abn ) = 95%.
In this subsection, the explanatory variables are grouped and the expectile index is τ = 1/2. The
R language packages used are grpreg with function grpreg for expectile regression and rqPen
with function QICD.group for quantile regression.
19
(100/p0 )Card(A ∩ Abn ) (100/(p − p0 ))Card(Abn \A)
Parameters choice. We take λn = n−1/2−γ/4 , A = {1, ..., 4} and: β10 = (0.5, 1, 1.5, 1, 0.5),
β20 = (1, 1, 1, 1, 1), β30 = (−1, 0, 1, 2, 1.5), β40 = (−1.5, 1, 0.5, 0.5, 0.5), βj0 = 0, for any
j > p0 . The errors ε follow N (0, 1) or Cauchy C(0, 0.1) distributions while the explanatory
variables are normal standard distributed. For ag Q, the tuning parameter is of order n−3/5 and
the power value in the weight of the penalty is 1.225 (see Ciuperca (2020)).
Results. The same sparsity measurements as in Subsubsection 5.1.2 are presented, through 1000
20
Monte Carlo replications. From Table 5 we deduce in the case ε ∼ N (0, 1) that ag E has a better
sparsity property than ag Q, especially when n is small. Since Cauchy distribution has no mean,
when ε ∼ C(0, 0.1), if n is small, ag E is still better for the true non-zeros, but worse for the
false non-zeros (Card(Abn \A)) in either case. If n increases, the penalized quantile estimation
is always better.
Parameters choice. We calibrate p and p0 in two ways: firstly p = bn/5c, p0 = 2b5−1 n1/2 c
and then c = 1, afterwards p = bn(2 log n)−1 c, p0 = 2bn1/4 c and then c < 1. We consider
βj0 ∼ N (05 , 2I5 ), for any j = 1, · · · , p0 , with I5 the identity matrix of order 5. All the other
parameters are similar as in the case of fixed p.
Results. The same measurements of sparsity and accuracy as in Subsubsection 5.1.2 are pre-
sented. The results of Table 6 are for the case p = O(n). When ε ∼ N (0, 1), ag E is always
better in the variable selection and more accurate for the two values of γ considered: 1/10 and
9/10 and for all values of n. When ε ∼ C(0, 0.1), for γ = 1/10 in adaptive weight of ag E,
the results are similar for the two estimation methods, while for γ = 9/10 the obtained results
by ag E are better than by ag Q. Moreover, in Table 7, the sparsity and accuracy results of the
penalized expectile estimation are better corresponding to the two considered values of γ.
In addition to the simulation study presented in the previous section, we demonstrate in this
section the practical utility of the proposed estimator. Thus, ag E will be used on real data
concerning air pollution, especially two gases: ozone and nitrogen dioxide.
6.1 Data
21
−1
ε n p 100 p0 Card(A ∩ An ) 100(p − p0 )−1 Card(A
bn \A)
ag E ag Q ag E ag Q
γ = 1/10 γ = 9/10 γ = 1/10 γ = 9/10
N (0, 1) 50 5 98.53 99 18.35 4.3 3.9 0.5
7 98.63 98.53 18.425 4.3 4.8 0.4
9 98.55 98.68 18.6 5.06 4.9 1.22
250 10 100 100 93.62 0 0 0.15
25 100 100 93 0 0 0.187
49 100 100 93.3 0 0 0.204
500 10 100 100 100 0 0 0.033
25 100 100 100 0 0 0.062
99 100 100 100 0 0 0.081
C(0, 0.1) 50 5 94.7 94.43 29.8 2 3.4 2.3
7 93.8 94.73 29.48 2.93 2.63 1.97
9 94.5 94.63 29.58 2.58 3.12 1.46
250 10 100 99.98 99.63 1.15 1.12 0.7
25 99.98 100 99.85 1.94 1.19 0.61
49 99.98 99.99 99.75 0.48 1.42 0.7
500 10 100 99.95 100 1.02 1.08 0.017
25 99.95 99.95 100 0.98 0.5 0.019
99 99.98 99.98 100 0.91 0.75 0.008
−1
100 p0 Card(A ∩ A
bn ) 100(p − p0 )−1 Card(A
bn \A) bn − β 0 | )
mean(|β bn − β 0 )A |)
mean(|(β
ag E ag Q ag E ag Q ag E ag Q ag E ag Q
ε n 1
γ = 10 9
γ = 10 1
γ = 10 9
γ = 10 1
γ = 10 9
γ = 10 1
γ = 10 9
γ = 10
N (0, 1) 51 100 100 92.55 0.025 0 10.2 0.011 0.033 0.21 0.041 0.099 1.27
101 100 100 99.325 0.025 0 9.7 0.009 0.031 0.21 0.044 0.216 1.07
249 100 100 100 0 0 9 0.0037 0.024 0.041 0.0375 0.217 0.306
C(0, 0.01) 51 99.9 100 99.1 10.21 0 10.17 0.231 0.055 0.27 0.24 0.349 1.83
101 99.97 100 96.35 12.1 0 9.47 0.0385 0.033 0.145 0.063 0.237 0.78
249 99.98 100 100 12.38 0 8.12 0.19 0.028 0.056 0.197 0.259 0.461
Table 6: Sparsity study of ag E when τ = 1/2, p = bn/5c, p0 = 2b5−1 n1/2 c. Comparison with ag Q.
−1
100 p0 Card(A ∩ A
bn ) 100(p − p0 )−1 Card(A
bn \A) bn − β 0 | )
mean(|β bn − β 0 )A |)
mean(|(β
ag E ag Q ag E ag Q ag E ag Q ag E ag Q
ε n 1
γ = 10 9
γ = 10 1
γ = 10 9
γ = 10 1
γ = 10 9
γ = 10 1
γ = 10 9
γ = 10
51 99.95 100 95.15 6.55 0 32.5 0.0411 0.12 0.87 0.061 0.227 1.13
101 99.87 100 99.9 1.375 0 35.6 0.027 0.101 0.42 0.0347 0.114 0.8
249 100 100 100 0 0 38.37 0.007 0.041 0.056 0.0157 0.158 0.195
C(0, 0.01) 51 99.15 100 96.32 19.65 0 30.9 0.062 0.0893 0.7 0.0756 0.177 0.921
101 99.98 99.99 99.5 18.55 0.05 36.2 0.061 0.437 0.401 0.07 0.185 0.674
249 100 99.85 100 17.44 0.019 32.12 0.037 0.172 0.076 0.0575 0.173 0.385
Table 7: Sparsity study of ag E when τ = 1/2, p = bn(2 log n)−1 c, p0 = 2bn1/4 c. Comparison with
ag Q.
22
lute humidity (AH). The explained variables are O3 and NO2. We will add to the explanatory
variables, the concentrations of O3 and NO2 up to three day before. More precisely, for ozone,
we will add the variables O3−1 , O3−2 , O3−3 which are respectively the daily ozone concentra-
tion one, two and three days before the observation. All usable data are taken from March 13,
2004 to May 4, 2005. Learning database is from May 1 to September 15, 2004 (115 observa-
tions) while the test database is from September 16 to 30, 2004 (15 observations).
We will first determine the groups of variables influencing the explained variable.
Model and parameters choice: We consider τ = 1/2, γ = 1 (as p n), λn = nξ , with
ξ chosen so that the accuracy of the estimation of the non-zero parameters is minimal. Three
group of variables are formed: the ”pollutant” group: CO, C6H6, NOx and O3 or NO2 de-
pending on the studied explained variable; the ”weather” group: T, RH, AH; the ”past” group:
O3−1 , O3−2 , O3−3 or NO2−1 , NO2−2 , NO2−3 , depending on the studied explained variable.
Results when the response variable is Ozone. Following values for the euclidean norm
of the grouped variables are obtained: kβ̂n (pollutant)k2 = 0.46, kβ̂n (weather)k2 = 0.15,
kβ̂n (past)k2 = 0.29. Hence, all groups are selected.
Results when the response variable is Nitrogen dioxide. In this case: kβ̂n (pollutant)k2 =
0.15, kβ̂n (weather)k2 = 0.25, kβ̂n (past)k2 = 0.0091.
Because the three groups of variables are relevant, to better study the influence of each variable
of a group, we now consider models with ungrouped explanatory variables.
Parameters choice: We take γ = 1 (since p n). For a better approach of the model, the
expectile index τ is estimated for each model, taking into account relation (15). More precisely,
using (e
yi )i the normalized observations of Y , τ is estimated by:
23
Tables 8 and 9 present the MAD which is the empirical mean of the absolute values of residuals
and the empirical variance of the residuals.
Results when the response variable is Ozone. For τ̂ = 0.42, λn = n−0.999 , the selected
explanatory variables by the adaptive LASSO expectile method are: C6H6, N02, T, AH, O3−1 :
βbn;C6H6 = 0.78, βbn;N02 = 0.016, βbn;T = 0.14, βbn;AH = −0.024, βbn;O3−1 = 0.18. The results
by the adaptive LASSO expectile method are compared with those obtained by the adaptive
LASSO quantile and the classical LS methods (Table 8). Note that the adaptive LASSO expectile
and LS estimators are more precise than the adaptive LASSO quantile estimator on the learning
and test data.
Results when the response variable is Nitrogen dioxide. For τ̂ = 0.31, λn = n−0.999 , the
selected variables are: CO, NOx, T, AH, NO2−1 , NO2−2 , NO2−3 : βbn;CO = 0.42, βbn;NOx =
0.25, βbn;T = 0.37, βbn;AH = −0.2, βbn;NO2−1 = 0.066, βbn;NO2−2 = 0.0001, βbn;NO2−3 = 0.0096.
The precision of the three estimation methods is roughly the same, with a slight advantage for
the adaptive LASSO expectile estimation (Table 9).
Estimator MAD on all data MAD on learning data MAD on test data Variance on all data Variance learning Variance test
Adaptive LASSO expectile 0.504 0.303 0.43 0.484 0.162 0.432
Adaptive LASSO quantile 0.473 0.355 0.612 0.404 0.231 0.551
Least squares 0.522 0.31 0.428 0.516 0.152 0.464
Table 8: Comparison of the performance of estimators, for Ozone modelisation, ungrouped variable.
Estimator MAD on all data MAD on learning data MAD on test data Variance on all data Variance learning Variance test
adaptive LASSO expectile 0.880 0.347 0.416 15.54 0.194 0.331
adaptive LASSO quantile 0.853 0.349 0.445 16.18 0.219 0.363
Least squares 0.84 0.334 0.532 11.8 0.174 0.515
Table 9: Comparison of the performance of estimator for NO2 modelization, ungrouped variables.
7 Proofs
In this section we present the proofs of the results stated in Sections 3 and 4.
24
7.1 Proofs of Section 3
Proof of Lemma 3.1. According to Wu and Liu (2009), it suffices to show that, for all >
Qn (β 0 +
0 , there exists B > 0 large enough such that, for n large enough: P inf
u∈Rr ,kuk2 =1
B n−1/2 u) > Qn (β 0 ) ≥ 1 − . Let B > 0 be and u ∈ Rr such that kuk2 = 1. Otherwise,
Qn (β 0 + Bn−1/2 u) − Qn (β 0 )
0
p
X (16)
−1 −1/2
≥n 0
(Gn (β + Bn 0
u) − Gn (β )) + λn bn;j (kβ 0j + Bn−1/2 uj k2 − kβ 0j k2 ).
ω
j=1
By the proof of Theorem 1 of Liao et al. (2019), under assumptions (A1), (A2), (A3), we have:
n
0 −1/2 0
X BX> u
ρτ (εi − √ i ) − ρτ (εi )
Gn (β + Bn u) − Gn (β ) =
n
i=1
n
X BX> u hτ (εi ) BX> u
= gτ (εi ) √ i + ( √ i )2 + oP (1). (17)
n 2 n
i=1
In order to study (17), we will prove the following two asymptotic results:
n
L
X
n−1/2 gτ (εi )Xi −→ N (0, σg2τ Υ). (18)
n→∞
i=1
n
X
−1
hτ (εi )Xi X>
P
n i −→ µhτ Υ.
n→∞
(19)
i=1
By assumption (A1), we get: E[n−1/2 gτ (εi )Xi ] = 0. Moreover, by assumptions (A1), (A3) and
relation (7): Var n−1/2 ni=1 gτ (εi )Xi −→ σg2τ Υ. On the other hand, for ξ > 0, we have:
P
n→+∞
kg (ε ) √Xi k4
Xi
2 τ i n 2
E gτ (εi ) √ 2 · 11kg (ε ) √Xi k >ξ =E
11 Xi
n τ i n 2 kgτ (εi ) √Xni k22 kgτ (i ) √n k2 >ξ
E kgτ (εi ) √Xni k42 11kg (ε ) √Xi k >ξ
E kgτ (εi ) √Xni k42
τ i n 2
≤ ≤ .
ξ2 ξ2
Then, by assumptions (A1) and (A3), we obtain:
n n >
Xi Xi 2
X
Xi
2 1 4
X
E gτ (εi ) √ 2 · 11kg (ε ) √Xi k >ξ ≤ 2 E gτ (ε)
−→ 0.
n τ i n 2 ξ n n→+∞
i=1 i=1
25
Thus, by the Central Limit Theorem (CLT) of Linderberg-Feller we obtain (18).
Relation (19) follows using the decomposition n−1 ni=1 hτ (εi )Xi X>
P −1
Pn
i = n i=1 hτ (εi ) −
Gn (β 0 + Bn−1/2 u) − Gn (β 0 ) = OP (B 2 ). (20)
Let us now study the second term of the right-hand side of (16). For any j ≤ p0 we have
kβ 0j k2 6= 0 and, by the consistency of the expectile estimator, we obtain:
bn;j −→ kβ 0j k−γ
P
ω
n→∞ 2 6= 0. (21)
Proof of Theorem 3.1. By Lemma 3.1, we can set u b − β 0 ) and more generally
bn ≡ n1/2 (β n
n−1/2 X>
i u. By elementary calculations, for t → 0, we obtain ρτ (ε − t) = ρτ (ε) + gτ (ε)t +
nλn pj=1 ω bn;j kβ 0j + n−1/2 uj k2 − kβ 0j k2 , which can be written, taking into account (23):
P
n p
X X> u hτ (εi ) X> u 2 X uj
gτ (εi ) √i + √i bn;j kβ 0j + √ k2 − kβ 0j k2 ,
Ln (u) = + oP (1) + nλn ω
n 2 n n
i=1 j=1
(24)
26
Remark that relations (18) and (19) hold for the first term of the right-hand side of (24).
We are now interested in the second term of the right-hand side of relation (24). We will prove:
p
X p
X
−1/2
bn;j (kβ 0j kβ 0j k2 ) −→ W (β 0j , u),
P
nλn ω +n uj k2 − (25)
n→∞
j=1 j=1
with:
0 if j∈A
W (β 0j , u) = 0 if j ∈ Ac and uj = 0dj
j ∈ Ac
∞ if and uj 6= 0dj .
find the minimizer of L(u), by a result on strictly convex quadratic forms, is −µ−1 −1
hτ ΥA zA .
We have then, by an epi-convergence result of Geyer (1994) and Knight and Fu (2000) that:
L
bn,A −→ N 0r0 , σg2τ µ−2 −1
u
n→∞ hτ ΥA . The proof is finished.
0 −1/2 ), β
b c = OP (n−1/2 ). Then,
Proof of Theorem 3.2. By Lemma 3.1: β
b
n,A = β A + OP (n n,A
0
we consider β ∈ Rr , such that kβ − β 0 k2 = O(n−1/2 ), β = (β A , β Ac ), β A ∈ Rr , β Ac ∈
0
Rr−r . In order to prove the theorem, we consider an β such that kβ Ac k2 6= 0. Then,
n n
X X
n Qn ((β A , 0r−r0 )) − Qn ((β A , β Ac )) = ρτ (Yi − X> β
i,A A ) − ρτ (Yi − X>
i β)
i=1 i=1
p (26)
X
− nλn bn;j kβ j k2 .
ω
j=p0 +1
27
0
Now, by assumption (A2) we have, for any i = 1, · · · , n that X>i,A (β A − β A ) n→∞
−→ 0. Then,
Pn >
we deduce the following in the same way as for (23): i=1 ρτ (Yi − Xi,A β A ) − ρτ (εi ) =
Pn > 0 −1 > 0 2 > 0 2
i=1 gτ (εi )Xi,A (β A − β A ) + 2 hτ (εi )(Xi,A (β A − β A )) + oP ((Xi,A (β A − β A )) ). By
similar arguments as for the proof of (18), using assumptions (A1), (A2) and since β A − β 0A =
O(n−1/2 ), we deduce: ni=1 gτ (εi )X>
P 0 1/2 (β −β 0 )> n−1
Pn >
i,A (β A −β A ) = OP n A A i=1 Xi,A Xi,A
1/2
·n1/2 (β A − β 0A )σg2τ = OP (1) and ni=1 hτ (εi )(X> 0 2 1/2 (β −
P
i,A (β A − β A )) = µhτ n A
0 > −1 Pn > 1/2 0
βA ) n i=1 Xi,A Xi,A n (β A − β A ) + oP (1) = OP (1). We obtain then:
n
X
ρτ (Yi − X>
i,A β A ) − ρτ (εi ) = OP (1). (27)
i=1
with ΥAj the sub-matrix of Υ with the index {dj−1 + 1, ..., dj }. Since kβ 0j k2 6= 0 then
lim P[A ⊆ Abn ] = 1 which together with lim P[Abn ⊆ A] = 1 imply the theorem.
n→∞ n→∞
Recall first a result given in Ciuperca (2021) needed in the following proofs.
Lemma 7.1 (Theorem 2.1.(i) of Ciuperca (2021)). Under assumptions (A1), (A2), (A3), (A4),
e − β 0 k2 = OP (p/n)1/2 . More precisely, for u ∈ Rr such that kuk2 = 1, then
we have: kβ n
Proof of Theorem 3.3. We show that, for all > 0, there exists B > 0 large enough such that,
for n large enough:
Qn β 0 + B (p/n)1/2 u > Qn (β 0 ) ≥ 1 − .
P inf (29)
u∈Rr ,kuk2 =1
28
Let u ∈ Rr be such that kuk2 = 1 and B > 0. In order to prove relation (29), we will study:
Let us first study the penalty of the right-hand side of relation (30). The following inequality
holds, with probability 1: minkβ 0j k2 ≤ maxkβ
e − β 0 k2 + minkβ
n;j j
e k2 . By Lemma 7.1,
n;j
j∈A j∈A j∈A
e − β 0 k2 = OP (n(c−1)/2 ) from which, since α > (c − 1)/2 by supposition (A5), it
maxkβ n;j j
j∈A
e −β 0 k2 = oP (1). On the other hand, still using assumption (A5), we have:
results n−α maxkβ n;j j
j∈A
e − β 0 k2 + n−α minkβ
K ≤ h0 n−α ≤ n−α maxkβ e k2 . Then, the two last relations imply:
n;j j n;j
j∈A j∈A
α
e k2 > Kn = 1.
lim P minkβ n;j (31)
n→+∞ j∈A 2
Taking into account relation (31) and the Cauchy-Schwarz inequality, we get for a constant
B > 0, with probability converging to one, that:
p p 0
On the other hand, using (9) together with (32) we obtain: λn pj=1 ω
bn;j (kβ 0j +B(p/n)1/2 uj k2 −
P
p
X
λn bn;j (kβ 0j + B(p/n)1/2 uj k2 − kβ 0j k2 ) ≥ −oP (Bp/n).
ω (33)
j=1
On the other hand, by Lemma 7.1: 0 < n−1 Gn (β 0 +B(p/n)1/2 u)−Gn (β 0 ) = OP (B 2 pn−1 ).
Therefore, taking into account (33) by choosing B and n large enough, we deduce that the first
term of the right-hand side of (30) dominates the right-hand side and relation (29) follows.
29
Consider also the parameter set: Wn = {β ∈ Vp (β 0 ), kβ Ac k2 > 0}. If β ∈ Wn , then p > p0 .
In order to show the property of sparsity, we will first show:
b ∈ Wn ] = 0.
lim P[β (34)
n→∞ n
(1) (1)
For this, we consider two vectors β = (β A , β Ac ) ∈ Wn and β (1) = (β A , β Ac ) such that
(1) (1)
β A = β A and β Ac = 0r−r0 . Then, in order to prove relation (34), we will study the difference
p
X
(1) −1 (1)
Qn (β) − Qn (β )=n (Gn (β) − Gn (β )) + λn bn;j kβ j k2 .
ω (35)
j=p0 +1
By elementary calculations, in the same way as for relation (23), we can rewrite:
n
X
Gn (β) − Gn (β (1) ) = ρτ (εi − X> 0 > 0
i,A (β A − β A )) − ρτ (εi − Xi (β A − β A , β Ac ))
i=1
n
X hτ (εi ) >
gτ (εi )X> 0
(Xi,A (β A − β 0A ))2 + oP ((X> 0 2
= i,A (β A − β A ) + i,A (β A − β A )) )
2
i=1
n
X hτ (εi ) >
gτ (εi )X> 0
(Xi (β A − β 0A , β Ac ))2 + oP ((X> 0 2
− i (β A − β A , β Ac ) + i (β A − β A , β Ac )) ) .
2
i=1
OP (kβ A − β 0A k22 ) = OP pn−1 . Likewise, we obtain: 0 < n−1 ni=1 2−1 hτ (εi )(X>
P
i (β A −
We are now interested of the penalty term of the right-hand side of (35). By Lemma 7.1, for all
j ∈ Ac , we have: ω e k−γ = kβ e − β 0 k−γ = OP p/n −γ/2 and since β ∈ Wn
bn:j = kβ n:j 2 n:j j 2
p p
X X (1−γ)/2
0< bn;j kβ j k2 =
λn ω OP λn p/n . (37)
j=p0 +1 j=p0 +1
30
Relations (36) and (37) imply for relation (35):
(1−γ)/2
Qn (β) − Qn (β (1) ) = OP p/n + (p − p0 )OP λn p/n
. (38)
Using condition (10), we have for (38) that the second term of the right-hand side that dominates,
(1−γ)/2
and therefore: Qn (β) − Qn (β (1) ) = (p − p0 )OP λn p/n . On the other hand, by (35)
and (36), we have: Qn (β 0 ) − Qn (β (1) ) = OP p/n . Therefore, using (10), we get :
Qn (β). Then, relation (39) implies relation (34), which in turn implies:
Finally, by Theorem 3.3 and a similar relation to (31) we have: lim P[minkβ
b k2 > 0] = 1
n;j
n→∞ j∈A
which implies lim P[Abn ⊇ A] = 1 and which together with (40) complete the proof.
n→∞
Proof of Theorem 3.5. By Theorems 3.3 and 3.4, we consider the parameter β under the form:
β = β 0 + (p/n)1/2 η, with η = (ηA , ηAc ), ηAc = 0r−r0 , kηA k2 ≤ C. Then, we will study:
n
X
−1
0
Qn (β + (p/n) 1/2 0
η) − Qn (β ) = n [ρτ (εi − (p/n)1/2 X>
i η) − ρτ (εi )] + P, (41)
i=1
Pp0
with P = bn;j (kβ j k2 − kβ 0j k2 ).
j=1 λn ω For all j ∈ 1, ..., p0 , we have that |kβ j k2 − kβ 0j k2 | ≤
(p/n)1/2 kηj k2 . On the other hand, using assumption (A5), we have kβ 0j k2 ≥ h0 = O(nα )
e k2 − kβ 0 k2 = OP ((p/n)1/2 ) = OP (n(c−1)/2 ). Since, by the same assumption,
and kβ n;j j
α > (c − 1)/2, then, we deduce: ω bn;j = OP (n−αγ ), for all j ∈ A. We have then, with
P0 P0
probability equal to 1: |P| ≤ λn pj=1 ω
bn;j |kβ j k2 − kβ 0j k2 | ≤ λn (p/n)1/2 pj=1 ω
bn;j kηj k2 .
Applying the Cauchy-Schwarz inequality and condition (9), we get with probability one: |P| ≤
Pp0 2 1/2 Pp0 Pp0 2 1/2
λn (p/n)1/2 2 1/2
j=1 ω
bn;j j=1 kηj k2 = λn (p/n) j=1 ω
bn;j kηA k2 . This rela-
bn;j = OP (n−αγ ), imply with probability converging
tion together with condition (9) and with ω
Pp0 −2αγ 1/2
to one: |P| ≤ Cλn (p/n)1/2 j=1 n ≤ C(p/n)1/2 (p0 )1/2 λn n−αγ ≤ C(p/n)1/2 p1/2 λn n−αγ
31
= o p/n . We are now interested in the first term of the right-hand side of (41). By assumption
(A3) and the Markov inequality, similarly as in the proof of Theorem 3.1, we have:
n r
1X p >
ρτ (εi − Xi η) − ρτ (εi )
n n
i=1
r n n
1 pX >
1 pX >
Xi,A X>
=− gτ (εi )(Xi,A ηA ) + E[hτ (εi )] + hτ (εi ) − E[hτ (εi )] ηA i,A ηA (1 + oP (1))
n n 2n n
i=1 i=1
r n
1 p X
>
1 p n
X
>
Xi,A X>
=− gτ (εi )(Xi,A ηA ) + E[hτ (εi )]ηA i,A ηA (1 + oP (1)) = OP p/n .
n n 2n n
i=1 i=1
Thus, the minimization of the right-hand side of (41) in respect to η, amounts to minimizing
n−1 ni=1 [ρτ (εi −(p/n)1/2 X>
P −1 −1 1/2
Pn
i η)−ρτ (εi )]. Moreover, the minimizer of −n (pn ) i=1 gτ (εi )
n n
·(X> −1 −1 > > −1 1/2
P P
i,A ηA )+(2n) pn i=1 E[hτ (εi )] ·ηA Xi,A Xi,A ηA verifies: −n (p/n) i=1 gτ (εi )Xi,A
Υ−1 n
r
p n,A 1
X
ηA = gτ (εi )Xi,A . (42)
n µhτ n
i=1
0
Let u ∈ Rr be such that kuk2 = 1. In order to study ηA of (42), let us consider the random
variables, for i = 1, · · · , n: Ri ≡ µ−1 > −1
hτ gτ (εi )u Υn,A Xi,A . Taking into account assumption
Proof of Theorem 4.1. (i) We prove that for all > 0, there exists a constant B > 0 large
enough, such that when n is large: P inf u∈Rr , kuk1 =1 Gn β 0 +B (p/n)1/2 u; q > Gn (β 0 ; q) ≥
1 − . Consider B > 0 a constant to be determined later and u ∈ Rp such that kuk1 = 1. Then:
r
0 p
u; q − Gn (β 0 ; q) ≡ ∆1 + ∆2 ,
Gn β + B (43)
n
32
with ∆1 ≡ ni=1 ρτ εi − B(p/n)1/2 X> 1/2 X> u; q −
P
i u; q − ρτ (εi ; q) − E ρτ εi − B(p/n) i
On the other hand, using assumptions (A1q), (A3), we have: E B(p/n)1/2 gτ (εi ; q)X>
i u = 0
the other hand, by the Cauchy-Schwarz inequality, taking also into account assumption (A4) and
the fact that kXi k2 ≤ r1/2 kXi k∞ , we obtain (p/n)1/2 max16i6n kXi k2 = o(1), which implies
2
(p/n)1/2 X> ≤ (p/n)1/2 X>
i u i u , for any i = 1, · · · , n. Then, there exists a constant c1 > 0
the generality that c1 = 1. Then |Di |11|εi |≤(p/n)1/2 |X> u| ≤ (p/n)1/2 |X>
i u|11|εi |≤(p/n)1/2 |X> u| .
i i
Using Hölder’s inequality, assumptions (A2), (A3), the fact that kuk2 ≤ kuk1 , we obtain:
n r n r
pX > 2 p > pX > 2 p
T1 ≤ (Xi u) P |εi | ≤ |Xi u| ≤ (Xi u) P |εi | ≤ kuk1 max kXi k∞
n n n n 16i6n
i=1 i=1
33
n r r
pX > 2 p p
≤ (Xi u) · O F max kXi k∞ − F − max kXi k∞
n n 16i6n n 16i6n
i=1
r
p
=O p = o(p). (48)
n
2
In order to study T2 we get: Di 11|εi |>(p/n)1/2 |X> u| = hτ (εi ; q) (p/n)1/2 X>
i u 11|εi |>(p/n)1/2 |X> u| (1+
i i
2 Pn q p > 4 R
√ p > |x|2(q−2) f (x)dx (1 + o(1)).
o(1)). Then: T2 ≤ q(q − 1) i=1
n iX u
|x|> n |Xi u|
If 1 < q < 2, using Hölder’s inequality and assumption (A2), we obtain: T2 ≤ q(q −
2 Pn 1/2 X> u2q (1 + o(1)) = O pq n−(q−1) . Then, taking into account relations
1) i=1 (p/n)
i
similarly to (33), using (i), assumptions (A5q) and (9), taking also into account the equivalence
of norms in finite dimension, we have: λn pj=1 ω bn;j (q) kβ 0j + B(p/n)1/2 uj k2 − kβ 0j k2 ≥
P
P0
λn pj=1 ω bn;j (q) kβ 0j + B(p/n)1/2 uj k2 − kβ 0j k2 ≥ −oP Bpn−1 and claim (ii) follows.
Proof of Theorem 4.2. The idea of the proof is similar to that of Theorem 3.4. Let us con-
sider the parameter sets Vp (β 0 ) ≡ β ∈ Rp ; kβ − β 0 k1 ≤ B(p/n)1/2 , with B > 0
34
β ∈ Vp (β 0 ); kβ Ac k1 > 0 . Let us consider two parameter vec-
large enough and Wn ≡
(1) (1) (1)
tors: β = (β A , β Ac ) ∈ Wn and β (1) = (β A , β Ac ) ∈ Vp (β 0 ), such that β A = β A and
(1)
β Ac = 0r−r0 for which we study the difference: Qn β; q − Qn (β (1) ; q) = n−1 Gn β; q −
Gn (β (1) ; q) + λn pj=p0 +1 ω
P
bn;j (q)kβ j k2 . By Lemma A1 of Hu et al. (2021)
n r n r n
X p > X pX
ρτ εi − Xi u; q = ρτ (εi ; q) + gτ (εi ; q)X>
i u
n n
i=1 i=1 i=1
n (52)
E[hτ (ε; q)] p > X
+ u Xi X>
i u + oP (p 1/2
).
2 n
i=1
Then, using (52) we have, similarly to proof of Theorem 3.4: Gn β; q −Gn (β (1) ; q) = Gn β; q −
Pn
Gn (β 0 ; q) − Gn (β (1) ; q) − Gn (β 0 ; q) = 0
i=1 gτ (εi ; q)Xi,A (β − β )A = OP (p). Thus,
this last relation together with a reasoning similar to (37), taking also into account (10), lead to:
Qn β; q −Qn (β (1) ; q) = OP (p/n)+ pj=p0 +1 OP λn (p/n)(1−γ)/2 = pj=p0 +1 OP λn (p/n)(1−γ)/2 .
P P
Proof of Theorem 4.3. The proof, similar to that of Theorem 3.5, is omitted.
Proof of Lemma 4.1. The proof idea is similar to that of Theorem 4.1(i). Consider u ∈ Rr such
that kuk1 = 1. Using Hölder’s inequality: |X>
i u| ≤ kXi k∞ kuk1 , relation (44) becomes:
n
X
0 < ∆2 = OP a2n µhτ (q) B 2 (X> 2 2
i u) 1 + oP (1) = OP (B an n). (53)
i=1
With respect to the proof of Theorem 4.1(i), now ∆1 ≡ ni=1 ρτ εi −Ban X>
P
i u; q −ρτ (εi ; q)−
35
OP (aqn n1/2 ) = oP (an n1/2 ). If q > 2, then T2 ≤ O(a4n n), from where
Pn
i=1 (Di − E[Di ]) =
1/2 Pn
OP (an ) + OP (a2n n1/2 ) = oP (an n1/2 ). Therefore, in the two cases, i=1 (Di − E[Di ]) =
oP (an n1/2 ) and then: ∆1 = −OP (Bn1/2 an ) + oP (an n1/2 ) = −OP (Bn1/2 an ). In conclusion,
since n1/2 an → ∞, we have: Gn β 0 +Ban u; q −Gn (β 0 ; q) = OP (B 2 a2n n)−OP (Ban n1/2 ) =
Proof of Theorem 4.4. As in the proof of Theorem 4.1(ii), we study the difference Qn β 0 +
Bbn u; q − Qn (β 0 ; q) for u ∈ Rr , kuk1 = 1 and B > 0. By the proof of Lemma 4.1 we have
Similarly to (32), taking into account assumption (A5q), the equivalence of norms in finite di-
mension, we obtain:
p
X
bn;j (q) kβ 0j + Bbn uj k2 − kβ 0j k2 ≥ −B(p0 )1/2 λn bn .
λn ω (55)
j=1
By assumption λn (p0 )1/2 b−1 2 2 0 1/2 b which implies that (54) domi-
n → 0, then B bn Bλn (p ) n
36
Pn Pn
X> 0 1/2 ).
i=1 oP i (β−β )hτ (εi ; q)Xij . Similarly to (46), we have: i=1 gτ (εi ; q)Xij = OP (n
Similarly as the proof of E[hτ (ε; q)] < ∞ in Hu et al. (2021) we show that Var [hτ (ε; q)] < ∞.
Then, considering u = β − β 0 with kuk1 = O(bn ), using the Markov inequality together with
assumptions (A1q), (A2), (A3), we have: ni=1 X>
P Pn >
i uhτ (εi ; q)Xij = E i=1 Xi uhτ (εi ; q)Xij
Pn >
1/2
+OP Var i=1 Xi uhτ (εi ; q)Xij = O(nbn ) + OP (nb2n ) = OP (nb2n ). Thus,
n
X
gτ (Yi − X>
i β; q)Xij = OP (n
1/2
) + OP (nbn ) = OP (nbn ). (58)
i=1
On the other hand, since j ∈ Abn ∩Ac , we have that λn ω e (q)k−γ = λn OP (a−γ
bn;j (q) = λn kβ n ).
n;j 2
Since, λn a−γ −1
n bn → ∞, thus, this is in contradiction with (58), since the left-hand side is much
smaller than the one on the right of relation (56). Then, P[Abn ∩ Ac = ∅] −→ 0.
n→∞
Taking also into account (57) we have then, P[Abn = A] −→ 1.
n→∞
(ii) By Theorem 4.4 and (i) of the present theorem, we consider the parameters of the form
β = β 0 + bn δ, with δ ∈ Rr , δ = (δ A , δ Ac ), δ Ac = 0r−r0 , kδ A k1 ≤ C. Then,
n
X
−1
0 0
ρτ (Yi − X> 0
Qn β + bn δ; q − Qn (β ; q) = n i (β + bn δ); q) − ρτ (εi ; q) + P, (59)
i=1
Pp0
kβ 0j + bn δ j k2 − kβ 0j k2 . Relation (55) implies, using, the Cauchy-
with P = λn j=1 ω
bn;j (q)
Schwarz inequality:
|P| = OP (λn bn (p0 )1/2 ). (60)
On the other hand, for the first term for the right-hand side of (59), we have:
n
X n
X
ρτ (Yi − X> 0
ρτ (εi − bn X>
i (β + bn δ); q) − ρ (ε
τ i ; q) = i,A δ A ; q) − ρτ (εi ; q)
i=1 i=1
n n
X b2n X
gτ (εi ; q)X> δ> >
= −bn i,A δ A + A Xi,A Xi,A δ A E[hτ (εi ; q)] 1 + oP (1) , (61)
2
i=1 i=1
Pn
which has as minimizer, the solution of: −bn i=1 gτ (εi ; q)Xi,A + nb2n Υn,A δ A E[hτ (εi ; q)] =
0r0 , from where, using assumption (A3)
Υ−1
n,A 1X
n
bn δ A = · gτ (εi ; q)Xi,A . (62)
E[hτ (εi ; q)] n
i=1
37
term of the right-hand side which dominates. Then, the minimizer of (59) is (62). Consider
−1
Ri ≡ E[hτ (εi ; q)] gτ (εi ; q)Υ−1 >
n,A u Xi,A and the rest of the proof is similar to the proof of
Theorem 3.5.
References
Chen, Z., (1996). Conditional Lp -quantiles and their application to the testing of symmetry in
non-parametric regression. Statist. Probab. Lett., 29(2), 1107–115.
Ciuperca, G., (2019). Adaptive group LASSO selection in quantile models. Statist. Papers,
60(1), 173–197.
Ciuperca, G., (2020). Adaptive elastic-net selection in a quantile model with diverging number
of variable groups. Statistics, 54(5), 1147–1170.
Ciuperca, G., (2021). Variable selection in high-dimensional linear model with possibly asym-
metric errors. Comput. Statist. Data Anal., 155, 107–112.
Dondelinger, F., Mukherjee, S., (2020). The joint lasso: high-dimensional regression for group
structured data. Biostatistics, 21(2), 219–235.
Daouia, A., Girard, S., Stupfler, G., (2019). Extreme M-quantiles as risk measures: from L1 to
Lp optimization. Bernoulli, 25(1), 264–309.
Geyer, C. J., (1994). On the asymptotics of constrained M-estimation. Statistics, 22, 1993–2010.
Holzmann, H., Klar, B., (2016). Expectile asymptotics. Electron. J. Stat., 10(2), 2355–2371.
Hu, J., Chen, Y., Zhang, W., Guo, X., (2021). Penalized high-dimensional M-quantile regres-
sion: From L1 to Lp optimization. Canad. J. Statist., 49(3), 875–905.
Knight, K., Fu, W., (2000). Asymptotics for lasso-type estimators. Ann. Statist., 28, 1356–1378.
Liao, L., Park, C., Choi, H., (2019). Penalized expectile regression: an alternative to penalized
quantile regression. Ann. Inst. Statist. Math., 71(2), 409–438.
38
Mendez-Civieta, A., Aguilera-Morillo, C., Lillo, R.E., (2021). Adaptive sparse group LASSO
in quantile regression. Adv. Data Anal. Classif., 15(3), 547–573.
Tibshirani, R., (1996). Regression shrinkage and selection via the LASSO. J. R. Stat. Soc. Ser.
B. Stat. Methodol., Ser. B, 58, 267–288.
Schulze-Waltrup, L., Sobotka, F., Kneib, T., Kauermann, G., (2015). Expectile and quantile
regression—David and Goliath?. Stat. Model., 15(5), 433–456.
Wang, H., Leng, C., (2008). A note on adaptive group lasso. Comput. Statist. Data Anal., 52,
5277-5286.
Wang, M., Tian, G.L., (2019). Adaptive group Lasso for high-dimensional generalized linear
models. Statist. Papers, 60 (5), 1469–1486.
Wang, M., Wang, X., (2014). Adaptive Lasso estimators for ultrahigh dimensional generalized
linear models. Statist. Probab. Lett., 89, 41–50.
Wu, Y., Liu, Y., (2009). Variable selection in quantile regression. Statist. Sinica, 19, 801–817.
Zhang, C., Xiang, Y., (2016). On the oracle property of adaptive group LASSO in high-
dimensional linear models. Statist. Papers, 57(1), 249-265.
Zhao, J., Chen, Y., Zhang, Y., (2018). Expectile regression for analyzing heteroscedasticity in
high dimension. Statist. Probab. Lett., 137, 304–311.
Zhou, S., Zhou, J., Zhang, B., (2019). Overlapping group lasso for high-dimensional generalized
linear models. Comm. Statist. Theory Methods, 48(19), 4903–4917.
Zou, H., (2006). The adaptive Lasso and its oracle properties. J. Amer. Statist. Assoc. , 101(476),
1418–1428.
39