Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

Lasso

Download as pdf or txt
Download as pdf or txt
You are on page 1of 59

T HE NON - OVERLAPPING STATISTICAL APPROXIMATION TO

OVERLAPPING GROUP LASSO

Mingyu Qi Tianxi Li
Department of Statistics School of Statistics
University of Virginia University of Minnesota, Twin Cities
Charlottesville, VA 22904-4135, USA Minneapolis, MN 55455, USA
mq3sq@virginia.edu tianxili@umn.edu
arXiv:2211.09221v3 [stat.ML] 20 Feb 2024

Abstract
The group lasso penalty is widely used to introduce structured sparsity in statistical learning, characterized by its ability
to eliminate predefined groups of parameters automatically. However, when the groups are overlapping, solving the
group lasso problem can be time-consuming in high-dimensional settings because of the non-separability induced by the
groups. This difficulty has significantly limited the penalty’s applicability in cutting-edge computational areas, such as
gene pathway selection and graphical model estimation. This paper introduces a non-overlapping and separable penalty to
efficiently approximate the overlapping group lasso penalty. The approximation substantially improves the computational
efficiency in optimization, especially for large-scale and high-dimensional problems. We show that the proposed penalty
is the tightest separable relaxation of the overlapping group lasso norm within the family of ℓq1 /ℓq2 norms. Furthermore,
the estimators based on our proposed norm are statistically equivalent to those derived from the overlapping group lasso
in terms of estimation error, support recovery, and minimax rate, under the squared loss. The effectiveness of the method
is demonstrated through extensive simulation examples and a predictive task of cancer tumors.

Keywords overlapping group lasso · separable approximation · computational efficiency · statistical error bound ·
support recovery · high-dimensional regression

1. Introduction
Grouping patterns of variables are commonly observed in real-world applications. For example, in regression modeling,
explanatory variables might belong to different groups with the expectation that the variables are highly correlated
within the groups. In this context, variable selection or model regularization should also consider the grouping patterns,
and one may prefer to either include the whole group of variables in the selection or completely rule out the group.
Group lasso (Yuan and Lin, 2006) is one popular method designed for this group selection task via adding ℓ1 /ℓ2
regularization, as a broader class for group selection (Bach, 2008; Levina et al., 2008; Meier et al., 2008; Ravikumar
et al., 2009; Zhao et al., 2009b; Danaher et al., 2014; Loh, 2014; Basu et al., 2015; Xiang et al., 2015; Campbell and
Allen, 2017; Tank et al., 2017; Yan and Bien, 2017; Austin et al., 2020; Yang and Peng, 2020).
While the original group lasso penalty (Yuan and Lin, 2006) focuses on regularizing disjoint parameter groups,
overlapping groups appear frequently in many applications such as tumor metastasis analysis (Jacob et al., 2009; Zhao
et al., 2009b; Yuan et al., 2011; Chen et al., 2012) and structured model seelction problems (Mohan et al., 2014; Cheng
et al., 2017; Yu and Bien, 2017; Tarzanagh and Michailidis, 2018). For example, in tumor metastasis analysis, scientists
usually aim to select a small number of tumor-related genes. Biological theory indicates that rather than functioning
in isolation, genes act in groups to perform biological functions. Hence, the gene selection is more meaningful if
co-functioning groups of genes are selected together (Ma and Kosorok, 2010). In particular, gene pathways, in the form
of overlapping groups of genes, render mechanistic insights into the co-functioning pattern. Applying group lasso with
these overlapping groups is then a natural way to incorporate the prior group information into tumor metastasis analysis.
For another example, graphical models have been widely used to represent conditional dependency structures among
variables. Cheng et al. (2017) developed a mixed graphical model for high-dimensional data with both continuous and
discrete variables. In their model, the groups are naturally determined by groups of parameters corresponding to each
edge, and these groups overlap because edges share common nodes. Selecting the graph structures under this class of
models requires eliminating groups of parameters, which is achieved by the overlapping group lasso penalty.
The optimization involving the group lasso penalty with non-overlapping groups is efficient (Friedman et al., 2010;
Qin et al., 2013; Yang and Zou, 2015). However, the overlapping group lasso problems present more complex
challenges despite their convex nature. This is because the non-separability between groups intrinsically increases the
problem’s dimensionality compared with the non-overlapping situation, as revealed in the study of Yan and Bien (2017).
Proposed methods for such optimization problems include the second-order cone program method SLasso (Jenatton
et al., 2011a), the ADMM-based methods (Boyd et al., 2011; Deng et al., 2013), and their smoothed improvement,
FoGLasso, introduced by Yuan et al. (2011). Nevertheless, these exact solvers of the problem involve expensive gradient
calculations when the overlapping becomes severe, which may limit the applicability of the overlapping group lasso
penalty in many large-scale applications such as genomewide association studies (Yang et al., 2010; Lee and Xing,
2012, 2014) or graphical model fitting problems (Cheng et al., 2017). For instance, Cheng et al. (2017) showed that
overlapping group lasso, though a natural choice for the problem, is infeasible even for moderate-size graphs, and they
used a fast lasso approach (Tibshirani, 1996) to solve the graph estimation problem without theory. As we introduce
later, our proposed solution includes the method of Cheng et al. (2017) as a special case, but our method is more general
and comes with theoretical guarantees.
In this paper, we propose a non-overlapping approximation alternative to the overlapping group lasso penalty. The
approximation is formulated as a weighted non-overlapping group lasso penalty that respects the original overlapping
group patterns, making optimization significantly easier. The proposed penalty is shown to be the tightest separable
relaxation of the original overlapping group lasso penalty within a broad family of penalties. Our analysis reveals that
the estimator derived from our method is statistically equivalent to the original overlapping group lasso estimator in
terms of estimator error and support recovery. The practical utility of our proposed method is exemplified through
simulation examples and its application in a predictive task involving a breast cancer gene dataset. As a high-level
summary, our major contribution to the paper is the design of a novel approximation penalty to the overlapping group
lasso penalty, which enjoys substantially better computational efficiency in optimization while maintaining equivalent
statistical properties as the original penalty.
The remainder of this paper is organized as follows: Section 2 introduces the overlapping group lasso problem and
the proposed approximation method. We also establish the optimality of the proposed penalty from the optimization
perspective. Section 3 details the statistical properties of the penalized estimator based on the proposed penalty.
Comparisons between our estimator and the original overlapping group lasso estimator are made to show that they are
statistically equivalent with respect to estimation errors and variable selection performance. Empirical evaluations using
simulated and real breast cancer gene expression data are presented in Sections 4 and 5, respectively. Finally, Section 6
concludes the paper with additional discussions.

2. Methodology
Notation and Preliminaries. Throughout this paper, for an integer z, the notation [z] is used to denote the index set
{1, · · · , z}. Given two sequences {an } and {bn }, we denote an ≲ bn or an = O(bn ) if an ⩽ Cbn for a sufficiently
large n and a universal constant C > 0. We write an ≪ bn or an = o(bn ) if an /bn → 0. Furthermore, an ≍ bn if
both an ≲ bn and an ≳ bn hold. Given a set T , |T | represents the cardinality of T . When referring to a matrix A, AT
denotes the sub-matrix consisting of columns indexed by T , and AT,T denotes the sub-matrix induced by both rows and
1
columns indexed by T . Additionally, for a vector x ∈ Rp , we define ∥x∥a = (|x1 |a + |x2 |a + . . . + |xp |a ) a . Recall
the operator norm definition: ∥A∥a,b = sup∥u∥a ≤1 ∥Au∥b . When A is a symmetric matrix, γmin (A) and γmax (A)
denote its smallest and largest eigenvalues, respectively. We will introduce other notations within the text as needed.
Table 9 in Appendix A lists all the notations in the paper.

2.1 Overlapping Group Lasso

Suppose in a statistical learning problem, the parameters are represented by a vector β ∈ Rp , where βj denotes the j-th
element of β. Let G = {G1 , · · · , Gm } be the m predefined groups for the p parameters, with each group Gg being a
subset of [p], and ∪g∈[m] Gg = [p]. For each group Gg , dG G G
g = |Gg | denoted the group size, and dmax = max dg . For
g∈[m]
any set T ⊂ [p], βT denotes the subvector of β indexed by T . Let w = {w1 , · · · , wm } be the user-defined positive
weights associated with the groups. The group lasso penalty (Yuan and Lin, 2006) is defined as
X
ϕG (β) = wg βGg 2 . (1)
g∈[m]
We will omit G in all notations when the group structure is clearly given.

2
In statistical estimation problems involving group selection, the group lasso norm is combined with a convex empirical
loss function Ln , and the estimator is determined by solving the following M-estimation problem:
minimizeβ∈Rp {Ln (β) + λn ϕ(β)} . (2)

If the groups are disjoint, then the group lasso penalty will select and eliminate variables by groups. When the groups
overlap, the above estimation enforces an “all-out” pattern by simultaneously setting all variables in certain groups to
be zero, thus the zero-out variables are form a union of a subset of the groups (Jenatton et al., 2011a). Such a pattern is
desirable in many problems, such as graphical models, multi-task learning and gene analysis (Jacob et al., 2009; Zhao
et al., 2009b; Mohan et al., 2014; Cheng et al., 2017; Tarzanagh and Michailidis, 2018). Another generalization of
the group lasso for overlapping groups is the latent overlapping group lasso (Jacob et al., 2009; Mairal and Yu, 2013),
following an “all-in” pattern by keeping the nonzero patterns as a union of groups. As noted in Yan and Bien (2017), the
decision to use an “all-in” or “all-out” strategy depends on the problem and the corresponding scientific interpretations.
The comparison between these two strategies is not our objective in this paper. However, both methods suffer from
computational difficulties. We focus on introducing an approximation method for the overlapping group lasso penalty
(1) and will leave the computational improvement of the latent overlapping group lasso for future work.
Problem (2) is a non-smooth convex optimization problem (Jenatton et al., 2011a; Chen et al., 2012), and the proximal
gradient method (Beck and Teboulle, 2009; Nesterov, 2013) is one of the most general yet efficient strategies to solve it.
Intuitively, proximal gradient descent minimizes the objective iteratively by applying the proximal operator of λn ϕ(β)
at each step.
The proximal operator associated with group lasso penalty in (1) is defined as
1
proxλn (µ) = argmin ∥µ − β∥2 + λn ϕ(β). (3)
β∈Rp 2
whose dual problem is shown to be the following by Jenatton et al. (2011b):
m
!
1 X
minimize ∥µ − ξ g ∥22 , s.t. ∥ξ g ∥2 ≤ λn wg , and ξjg = 0 if j ∈
/ Gg . (4)
{ξ g ∈Rp }g∈[m] 2 g=1

The proximal operator (3) and its dual can be computed using a block coordinate descent (BCD) algorithm, as studied by
Jenatton et al. (2011b). We list the procedure in Algorithm 1 for readers’ information. The convergence of Algorithm 1
is guaranteed by Proposition 2.7.1 of Bertsekas (1997).

Algorithm 1 BCD algorithm for the proximal operator of the overlapping group lasso
m
Input: G, {wg }g=1 , u, λn ,
m
Requirement: G, {wg }g=1 > 0, λn > 0.
g m
Initialization: Set {ξ }g=1 = 0 ∈ Rp .
Output: β ∗
1: while stopping criterion not reached do
2: for all g ∈ {1, · · · , m} do
Calculate rg = µ − h̸=g ξ h .
P
3:
(
g 0 if j ∈/ Gg
4: if ||rg ||2 ⩽ λn wg then ξj = g
rj if j ∈ Gg
0 if j ∈/ Gg
(
5: else ξjg = λwg rjg
||r g ||2 if j ∈ Gg
6: end if
7: end for
8: end while
m
9: β ∗ = u − ξg .
P
g=1

Although additional techniques employing smoothing techniques have been developed to improve the optimization
(Yuan et al., 2011; Chen et al., 2012), (3) and (4) continue to offer crucial insights into the computational bottlenecks
caused by overlapping groups. Notably, the duality
P between (3) and (4) reveals that the overlapping group lasso
problem has an intrinsic dimension equal to a g∈[m] dg -dimensional separable problem. When the groups have a

3
nontrivial proportion of overlapping variables, the computation of the overlapping group lasso becomes substantially
more difficult, eventually prohibitive on large-scale problems. This issue significantly limits the applicability of the
overlapping group lasso penalty. Next, we introduce our non-overlapping approximation to rectify this challenge.

2.2 The Non-overlapping Approximation of the Overlapping Group Lasso

The fundamental challenge in solving overlapping group lasso problems stems from the non-separability of the penalty.
To enhance computational efficiency, our approach hinges on introducing separable operators. As a starting point, we
will illustrate this concept using a toy example of an interlocking group structure as a special case. In this structure, the
groups are arranged sequentially, with each group overlapping with its adjacent neighbors (Figure 1a). For simplicity,
we consider a uniform weight scenario where wg ≡ 1 for all groups.

(a) Interlocking group structure.

(b) Partitioned group structure.

Figure 1: Illustration of proposed group partition in an interlocking group structure. Red regions are the overlapping
variables in the original group structure.

We now partition the original overlapping groups in Figure 1b into smaller groups as in Figure 1b. This partition
identifies intersections as individual groups. We define these new groups as G = {G1 , · · · , Gm }, where, in this specific
instance, m = 2m − 1. Taking G1 as an example. We have G1 = G1 ∪ G2 and by the triangular inequality,
∥βG1 ∥2 ≤ ∥βG1 ∥2 + ∥βG2 ∥2 .
Extending this principle to each group, the norm of the overlapping group lasso based on G can be bounded by a
reweighted non-overlapping group norm based on G:
X X
∥βGg ∥2 ≤ hg ∥βGg ∥2 , (5)
g∈[m] g∈[M]

where hg equals 1 for odd g and 2 for even g. Consequently, controlling the sum on the right-hand side of (5)
effectively controls the overlapping group norm on the left-hand side. The key advantage of this approach is the
separability of the right-hand side norm, which substantially simplifies and enhances the efficiency of optimization.
While this example is about the interlocking group structures, the whole idea is applicable to any general overlapping
pattern, as introduced in the two steps below.

Step 1: overlapping-induced partition construction. Our method starts from constructing a new non-overlapping
group structure G from G, following Algorithm 2. We represent the initial group structure G by an m × p binary
matrix G, where Ggj = 1 if and only if the j-th variable is a member of the g-th group, and Ggj = 0 otherwise. To
clearly differentiate the original group structure G and the derived non-overlapping structure G, we employ standard
letters, such as {g, d, m, w, G}, to represent quantities about the original group structure, while calligraphic letters, like
{g, d, m, w, G}, are used for quantities about G. For instance, m denotes the number of groups in G, and g ∈ [m]
serves as the index for groups within G.

4
Algorithm 2 Algorithm to construct the overlapping-induced partition G
Input: Binary matrix G.
Output: New group structure G.
1: Initialize the column index set as C = {1, . . . , p}.
2: Initialize k = 1.
3: while C is not empty do
4: Choose the first column index j in C, and set I to be the set of all column indices in G identical to G,j :
I = {j ′ ∈ C, G,j ′ = G,j }.
5: Set Gk = I, and remove I from C: C ← C \ I.
6: k = k + 1.
7: end while
8: Return G ← {G1 , G2 , . . .} where each Gk represents a group.

Step 2: overlapping-based group weights calculation. Note that each group within G is a subset of at least one of
the original groups in G. Conversely, each group in G can be reconstructed as the union of groups in G. We introduce
the following mappings:
F (g) = {g : g ∈ [m], Gg ⊂ Gg } and F −1 (g) = {g : g ∈ [m], Gg ⊂ Gg }.
Given positive weights w of G, we set the weights w of G as:
X
wg = wg , g ∈ [m]. (6)
g∈F (g)

With the new partition G and the new weights w from the previous two steps, we define the following norm as the
proposed alternative to the original overlapping group lasso norm:
m
X
ψG (β) = wg βGg 2
. (7)
g=1

In general, by triangular inequality, the proposed norm is always an upper bound of the original group lasso norm:
m
X m
X
G
ϕ (β) = w g βG g 2
⩽ wg βGg 2
= ψG (β). (8)
g=1 g=1

Our proposed penalty is essentially a weighted non-overlapping group lasso on G. For illustration, Figure 2 shows
the unit ball of these two norms based on G1 = {β1 , β2 } and G2 = {β1 , β2 , β3 } in a three dimensional problem. All
singular points of the ϕG -ball (where exactly zero happens in (2)) are also singular points of the ψG -ball.
Readers may observe that the inequality in (8) can also hold for other separable norms. For instance, consider
partitioning all p variables into individual groups and employing a weighted lasso norm as another upper bound for ϕG ,
represented by:
Xp  X 
wg |βj |. (9)
j=1 {g|βj ∈Gg }

This approach was taken by Cheng et al. (2017). So what is special about our proposed norm in (7)?
Intuitively, as illustrated by our construction process for G or Figure 2, our method introduces additional singular
points in the norm only when it is necessary to achieve separability. Unlike the lasso upper bound, this process avoids
adding redundancy. As such, our approximation is expected to maintain a certain level of tightness. We now formally
substantiate this intuition. Given any group structure G and weights w, following Cai et al. (2022), we define the
ℓq1 /ℓq2 norm of β for any 0 ⩽ q1 , q2 ⩽ ∞ as
 X  q1
wg ||βGg ||qq12
1
||β{G,w} ||q1 ,q2 = . (10)
g∈[m]

5
Figure 2: Illustration of two norms in R3 : the outer region depicts the unit ball of the overlapping group lasso norm
defined by {β : ϕG (β) ⩽ 1}; the inner region represents the unit ball of our proposed separable norm
{β : ψG (β) ⩽ 1}.

This general class of norms potentially includes most commonly used penalties, including the weighted lasso penalty.
The subsequent theorem shows that the proposed ψG (β) is the tightest separable relaxation of the original overlapping
group lasso norm among all separable ℓq1 /ℓq2 norms.
Theorem 1. Let G represent the set of all possible partitions of [p]. Given the original groups G and their weights w,
there does not exist 0 ⩽ q1 , q2 ⩽ ∞, G̃ ∈ G, w̃ ∈ (0, ∞)p such that:
(
ϕG (β) ⩽ ||β{G̃,w̃} ||q1 ,q2 ⩽ ψG (β) for all β ∈ Rp
. (11)
||β{G̃,w̃} ||q1 ,q2 < ψG (β) for some β ∈ Rp

3. Statistical Properties
Incorporating the proposed norm ψG into an M-estimation procedure leads to the following optimization problem:
minimizeβ∈Rp {Ln (β) + λn ψG } , (12)
which is different but related to (2). In this section, by studying the statistical properties of the regularized estimator
based on ψG and the estimator based on ϕG , we show that ψG could be used as an alternative to ϕG . Following previous
group lasso studies (Huang and Zhang, 2010; Lounici et al., 2011; Chen et al., 2012; Negahban et al., 2012; Dedieu,
2019), our analysis will focus on high-dimensional linear models. Specifically, define the linear model as
Y = Xβ ∗ + ε, (13)
where Y ∈ Rn×1 is the response vector, X ∈ Rn×p is the covariate matrix, and ε ∈ Rn×1 is a random noise vector.
The overlapping group lasso coefficient estimator under the linear regression model is defined by a solution of (2) under
the squared loss:
1
β̂ G ∈ arg min ∥Y − Xβ∥22 + λn ϕG (β). (14)
β∈Rp 2n
Correspondingly, we define the regularized estimator by our approximation norm as
1
β̂G ∈ arg min ∥Y − Xβ∥22 + λn ψG (β). (15)
β∈Rp 2n

The solution uniqueness of (14) and (15) has been studied by Jenatton et al. (2011a), and we include their results in
Appendix B for completeness. However, our study only requires the estimator to be one solution to the problem, as

6
in Jenatton et al. (2011a); Negahban et al. (2012); Wainwright (2019). So we will not specifically worry about the
uniqueness in our discussion.
As a remark, our objective is not to present (15) as an approximate optimization problem of (14). Rather, we focus on
the statistical equivalence of the two classes of estimators defined by (14) and (15) in terms of their statistical properties
under sparse regression models when appropriate values of λn are chosen (which may differ for each estimator). Our
theoretical analysis focuses on three aspects. In Section 3.1, we establish that under reasonable assumptions, the ℓ2
estimation error bound for (15) is no larger than that for (14). In Section 3.2, we present the minimax error rate for the
overlapping sparse group regression problem, showing that both (14) and (15) are minimax optimal under additional
requirements of the group structures. Lastly, in Section 3.3, we demonstrate that both estimators consistently recover
the support of the sparse β ∗ with high probability under similar sample size requirements.

3.1 Estimation Error Bounds

We start by introducing additional quantities. Define the overlapping degree hGj as the numberSof groups in G containing
βj , and hG
max = max h j . Given a group index set I ⊆ [m], we use G I to denote the union g∈I Gg . Given G and I,
following Wainwright (2019), we define two parameter spaces:
M (I) = {β ∈ Rp | βj = 0 for all j ∈ (GI )c } ,
M ⊥ (I) = {β ∈ Rp | βj = 0 for all j ∈ GI } ,
and we further use βM (I) to denote the projection of β onto M (I).
Given any set T ⊆ [p], we define the a set of groups GT = {g ∈ [m] | Gg ∩ T ̸= ∅}. Notice that (GGT )c is called the
hull of T in Jenatton et al. (2011a). Let supp(β) = {j ∈ [p] | βj ̸= 0} denotes the support set. We define the group
support set S G (β) = Gsupp(β) , and the augmented group support S G (β) = {g ∈ [m] | Gg ∩ GS(β) ̸= ∅}. Furthermore,
define s = |supp(β)|, sg = |S(β)|, and sg = |S(β)|. We omit the subscript G in notations when G is clearly given in
context. Now we introduce additional assumptions under the regression model (13).
Assumption 1 (Sub-Gaussian noise for the response variable). The coordinates of ε are i.i.d. zero-mean sub-Gaussian
with parameter σ. Specifically, there exists σ > 0 such that E[exp(tε)] ⩽ exp(σ 2 t2 /2) for all t ∈ R.
Our theoretical studies also hold for a fixed design of X, with trivial modifications. We prefer to introduce the random
design here to make the statements more concise and interpretable, especially for the comparison in Section 3.3.
Assumption 2 (Normal random design for covariates). The rows of the data matrix X are i.i.d. from N (0, Θ), where
1/c1 ⩽ γmin (Θ) ⩽ γmax (Θ) ⩽ c1 for some constant c1 > 0.
Lastly, we need some mild constraints on the group dimensions.
Assumption 3 (Dimension of the group structure). The predefined group structure G satisfies dmax ⩽ c2 n for some
constant c2 > 0. In addition, we assume log m ≪ n.
The following theorem establishes the ℓ2 estimation error bounds for the two estimators.
Theorem 2. Given G and its induced G according to Algorithm 2, define hgmin = min hj , hgmax = max hj . Let
j∈Gg j∈Gg
δ ∈ (0, 1) be a scalar that might depend on n. Under Assumptions 1, 2 and 3, for β̂ G and β̂G defined in (14) and (15),
we have the following results:
1. Suppose that β ∗ satisfies the group sparsity condition
min (wg2 hgmin )
n g∈[m]
sg (β ∗ ) ≲ · . (16)
log m + dmax max(wg2 hgmax )
g∈S
q
′ log m
dmax
When λn = cσ
min (wg2 hg n + n + δ for some constant c′ > 0, we have
min )
g∈[m]
P 
GS
wg 2 · hmax  
2 g∈S dmax log m
β̂ G − β ∗ ≲ σ2 ·  · + +δ . (17)
2 min wg2 hgmin n n
g∈[m]
−c3 nδ
with probability at least 1 − e for constant c3 > 0.

7
2. Suppose β ∗ satisfies the group sparsity condition
2
min (wg )
∗ n g∈[m]
sg (β ) ≲ · 2)
. (18)
log m + dmax max(wg
g∈S
q
c′ σ dmax log m
When λn = min wg n + n + δ for some constant c′ > 0, we have
g∈[m]

wg 2 
P

2 g∈{F −1 (g)}g∈S dmax log m
β̂G − β ∗ ≲ σ2 · 2
 · + +δ . (19)
2 min wg n n
g∈[m]

with probability at least 1 − e−c4 nδ for constant c4 > 0.

The error bound in (17) subsumes the non-overlapping group lasso error bound as a particular instance. When the
groups in G are disjoint, the reduced form of (17) matches the bounds studied in Huang and Zhang (2010); Lounici
et al. (2011); Negahban et al. (2012); Wainwright (2019). The main difference in the context of overlapping groups is
the necessity to account for the overlapping degree and the extension of sparsity requirements to augmented groups.
The conditions specified in (16) and (18) relate to the cardinality of the augmented group support set (the number of
non-zero groups in non-overlapping group structure). Although the conditions in (16) and (18) may initially appear
distinct, they generally converge to a similar requirement in many typical cases, which can lead to an informative
comparison between the two bounds in (17) and (19). The following results can characterize this.
Assumption 4. Assume the predefined group structure G and its induced group structure G satisfy max{dmax , m} ≍
max{dmax , m}.
Proposition 3. Suppose that max |F −1 (g)| is bounded by a constant. Under Assumption 4, the following inequality
g∈S
holds:  GS
wg 2 · hmax
P
wg 2 
P
  
g∈F −1 (S) dmax log m g∈S dmax log m
 · + + δ ≲ · + + δ .
min wg2 hgmin

min wg2 n n n n
g∈[m] g∈[m]
G
This implies that the error bound for the estimator β̂ in (17) also serves as an upper bound for the error associated
with the estimator β̂G .

The quantity |F −1 (g)| is the number of groups in G that has intersect with Gg . Proposition 3 requires that every Gg
such that Gg ∩ supp(β ∗ ) ̸= ∅ is partitioned into bounded number of non-overlapping groups. On the other hand,
Assumption 4 requires that the maximum of two quantities — the maximum group size and the number of groups in the
given group structure G — should have the same order as those in the induced structure G. The above requirement
always holds for interlocking groups with similar groups and overlap sizes (see Figure 1). More importantly, we can
always assess the assumption directly on data by calculating the group sizes and numbers for both G and G. In Section
4.3, we evaluate five group structures from real-world gene pathways and examine the ratio of the maximum of two
quantities from each G and G. Assumption 4 looks reasonable in all of these real-world grouping structures. See details
in Table 1.

3.2 Lower Bound of Estimation Error

Proposition 3 compares the two estimators’ upper bounds of estimation errors. While the comparison gives intuitive
ideas, it does not rigorously establish the statistical equivalence without the tightness of the error bounds. To strengthen
our findings, we now investigate the minimax estimation error rate in linear regression models characterized by
overlapping group sparsity. We will focus on the following class of group-wise sparse vectors:
 
1{∥βGg ∥2 ̸=0} ⩽ sg
X
Ω(G, sg ) = β : (20)
Gg ∈G

Following the assumption of Cai et al. (2022), we focus on the special case of equal-size groups.
Assumption 5 (Equal size groups). The m predefined groups of G come with equal group size d, m ≪ p, d ≪ log(p).

8
Theorem 4. (Lower bound of estimation error)· Under Assumptions 1,2 and 5, we have
 
σ 2 sg (d + log( smg ))
inf sup E∥β̂ − β∥22 ≳ . (21)
β̂ β∈Ω(G,sg ) n

Combining Theorem 2 and Theorem 4, we can see that both estimators attain the minimax error rate and are statistically
equivalent, as demonstrated by the following corollary.
G
Corollary 1. Under Assumptions 1–4, if hmax
S
≍ 1, both β̂ G and β̂G attain the minimax estimation rate specified in
(21).

3.3 Support Recovery Consistency

We now proceed to analyze the support recovery consistency of β̂ G and β̂G . We begin by introducing more quantities
for our analysis. For any β ∈ Rp , we define the mapping rG (β) : Rp → Rp as:
 P wg
∥βGg ∩supp(β) ∥2 , if j ∈ supp(β),
 βj
G
r (β)j = g∈G supp(β) ,G g ∩j̸ =∅ (22)
0, if j ∈
/ supp(β).
rG (β) is closely related to subgradients of the penalty and is used for determining optimality conditions. In the lasso
which is exactly the lasso penalty. When focusing on β ∗ , we write S = supp(β ∗ ),
case, rG (β) is the sign vector, 
r = r (β ), and βmin = min |βj∗ |; βj∗ ̸= 0 .
G G ∗ ∗

Our analysis essentially follows the strategy in Jenatton et al. (2011a). The major difference is that we study the problem
with a more tailored setup for the random design rather than the fixed design as in Jenatton et al. (2011a). Using
random designs, as discussed before, is helpful to compare the two estimators β̂ G and β̂G directly. We now introduce
additional assumptions used to study the pattern consistency, which can be seen as the population-level counterpart of
the assumptions in Jenatton et al. (2011a).
Assumption 1’ (Gaussian noise for the response variable). Under model (13), the coordinates of ε are i.i.d from
N (0, σ 2 ).
Assumption 6 (Irrepresentable condition). For any β ∈ Rp , define
X
ϕcS (βSc ) = wg ∥βSc ∩Gg ∥2 ,
g∈[m]\GS

and its dual norm


(ϕcS )∗ [u] = sup βS⊤c u.
ϕcS (βSc )≤1

Assume that there exists τ ∈ (0, 32 ], such that



(ϕcS )∗ [ΘSc S Θ−1
SS rS ] ⩽ 1 − . (23)
2
Assumption 1’ is widely used to study support recovery consistency of linear regression. For example, in addition
to Jenatton et al. (2011a), it is also used in Zhao and Yu (2006); Wainwright (2009, 2019). Assumption 6 is the
population-level version of the irrepresentable condition as discussed in Zhao and Yu (2006) and Wainwright (2019).
Theorem 5. Suppose Assumption 1’, Assumption 2 and Assumption 6 hold. Under model (13), assume the support of
β ∗ is compatible with the overlapping group lasso penalty, such that the zero positions are given by an exact union of
groups in G. Mathematically, that means
 [
[p] \ Gg = S. (24)
Gg ∩S=∅

1. If
log(p − |S|) ⩾ |S|,
1
n β∗ β ∗ aSc o
min
λn |S| ≲ min
2 , P minp , (25)
AS AS wg |Gg ∩ S|
g∈GS

9
n σ 2 log(p − |S|) max {(βj∗ )2 } log(p − |S|) o
j∈S
n ≳ max , , (26)
a2Sc λ2n a2Sc λ2n
wg wg
where aS = min , aSc = min , and AS = hmax (GS ) max wg ∥u∥1 .
g∈GS dg g∈GSc dg g∈GS

Then for the overlapping group lasso estimator β̂ G , we have:


   n  na2S τ 2 γmin (ΘSS ) 
P supp(β̂ G ) ̸= S ⩽8 exp − + exp − 2
2

4 ∥rS ∥2 γmax ΘSc Sc |S
(27)
 nλ2 τ 2 a2   nc2 (S, G) 
n Sc
+ exp − + 2|S| exp −
144σ 2 2σ 2
with n β∗
min β ∗ aSc o
c(S, G) ≍ min , P minp .
AS AS wg |Gg ∩ S|
g∈GS

2. Furthermore, if maxg∈GS F −1 (g) ≍ 1, for the proposed estimator β̂G and assuming maxg∈GS F −1 (g) ≍ 1,
the property holds:
   n  na2S τ 2 γmin (ΘSS ) 
P supp(β̂G ) ̸= S ⩽8 exp − + exp − 2
2

4 rGS 2 γmax ΘSc Sc |S (28)
 nλ2 τ 2 a2   nc2 (S, G) 
n Sc
+ exp − + 2|S| exp − ,
144σ 2 2σ 2
with n β∗
min β ∗ aSc o
c(S, G) ≍ min , P minp .
AS AS wg |Gg ∩ S|
g∈GS

The conditions involved in the above theorem can be seen as the population-level counterparts of those used in Jenatton
et al. (2011a) for the overlapping group lasso estimator under the fixed design. As an illustration of the conditions, in
the lasso context, (25) and (26) reduce to the typical scaling of n ≈ log p and λn ≈ σ(log p/n)1/2 . Together with the

requirements on the sample size|S| log(p − |S|) and on βmin , they match the requirements in Wainwright (2009) for
the support recovery by the lasso regression. For non-overlapping group lasso estimators, our assumptions align with
the conditions outlined in Corollary 9.27 of Wainwright (2019) under the random design.
Theorem 5 shows that both estimators consistently identify the support of the group sparse regression coefficients.
Compared to the previous study of the overlapping group lasso estimator of Jenatton et al. (2011a), we switch to the
random design of X, because such a setting renders a common basis for the comparison of the two estimators directly.
Specifically, comparing (27) and (28), as well as the common conditions, we can see that the two estimators give
comparable performance in support recovery with respect to the sampling complexity.

4. Simulation
In this section, we assess the performance of the proposed estimator to demonstrate our claimed properties. At a
high level, we want to use the simulation experiments to show that the proposed estimator based on (7) gives similar
statistical performance to the overlapping group lasso estimator while admitting much better computational efficiency.
Our estimator achieves this primarily because of the tightest separable relaxation property of Theorem 1, which can be
attributed to two designs of the norm (7): the induced partition G and the corresponding overlapping-based weights
w. Therefore, in our simulation experiments, we will also evaluate the effects of these two designs by comparing
the proposed estimator with other benchmark estimators. In Sections 4.1–4.3, we evaluate the performance of the
proposed estimator and compare it with the weighted lasso estimator with overlapping-based weights, as discussed
in (9), under various configurations. This sequence of experiments will demonstrate the importance of our proposed
partition G. In Section 4.4, we compare the proposed estimator with two other group lasso estimators, using the same G
but overlapping-ignorant weights, under the same set of configurations. The results will demonstrate the importance of
using the proposed overlapping-based weights w.

10
Two MATLAB-based solvers for the overlapping group lasso problems are employed. The first solver (Yuan et al.,
2011) is from the SLEP package (Liu et al., 2009). It can handle general overlapping group structures. The second
solver is from the SPAM package (Mairal et al., 2014), which is designed to solve the overlapping group lasso problem
when the groups can be represented by tree structures, formally defined in Section 4.2. Therefore, the SPAM solver is
used only for the experiment in Section 4.2. The SLEP solver is more general, but using the two solvers can provide
a more thorough evaluation across multiple implementations. For a fair comparison, the SLEP and SPAM package
solvers were also applied to solve lasso and non-overlapping group lasso estimators in our benchmark set to ensure that
the timing comparison implementation is consistent.
As an important side note, SLEP is widely acknowledged as one of the most efficient solvers for the overlapping
group lasso problem (Yuan et al., 2011; Chen et al., 2012; Cheng et al., 2017). Still, for non-overlapping group lasso
problems, alternative solvers, such as Yang and Zou (2015), may offer much better computational efficiency. For
example, Yang and Zou (2015) reported that their solver is about 10–30 times faster than the SLEP package when
solving non-overlapping group lasso problems. Such solvers are available because of the separability in non-overlapping
groups and are not available for overlapping problems. For a fair comparison to avoid implementation bias, we use
SLEP to solve for our estimator. Therefore, the computational advantage we demonstrate will be conservative. In
practice, with the better solvers used, our method would enjoy an even more substantial computational advantage over
the original overlapping group lasso than reported in the experiments.

Evaluation criterion. For each configuration, we generate 50 independent replicates and report the average result.
The performance assessment is conducted in three aspects:

• Regularization path computing time. We begin by performing a line search to determine two pivotal values:
λmax and λmin . The search for λmax starts at 108 and decreases progressively, multiplying by 0.9 at each
iteration, until reaching the first value at which no variables are selected. In contrast, the determination of λmin
starts from 10−8 and increases incrementally, multiplying by 1.1 each time, until the first value is found that
retains the entire set of variables. Following this, we select 50 values in log-scale within the range [λmin , λmax ].
Subsequently, We compute the entire regularization path using these λ values and record the computation
time associated with this process as a performance metric. The computing time evaluation mimics the most
practical situation where the whole regularization path is solved for tuning purposes.
• Relative ℓ2 estimation error: From the entire regularization path, we select the smallest relative estimation
error, defined as ∥β̂ − β ∗ ∥2 /∥β ∗ ∥2 , as the estimation error for the method. This serves as the measure of the
ideally tuned performance.
• Support discrepancy: From the entire regularization path, we select the smallest support discrepancy, defined
as |{i ∈ [p] : |sign(β̂i )| ̸= |sign(βi∗ )|}|/p. Such a (normalized) Hamming distance is commonly used as a
performance metric for support recovery (Grave et al., 2011; Jenatton et al., 2011a) to quantify the accuracy of
pattern selection.

4.1 Interlocking group structure

In the first set of experiments, we evaluate the performances based on interlocking group structure (Figure 1a). This group
structure exhibits a relatively low degree of overlap and is frequently used for evaluating overlapping group lasso methods
(Yuan et al., 2011; Chen et al., 2012). Specifically, we set m interlocked groups with d variables in each group and 0.2d
variables in each intersection. For example, G1 = {1, · · · , 10}, G2 = {8, 9, · · · , 17}, · · · , G10 = {33, 34, · · · , 42}
when m = 5 and d = 10. In the experiment, we will vary m and d to evaluate their impacts on the performance.
Following the strategy of Yan and Bien (2017), we generate the data matrix X from a Gaussian distribution N (0, Θ),
where Θ is determined to match the correlations within the specified group structure. Initially, we construct a matrix Θ̃
as follows: 
 1, if i = j,

0, if βi and βj belong to different groups in G,

Θ̃ij =

 0.6, if βi and βj are in the same group in G,

0.36, if βi and βj are in the same group in G but different groups in G,
and then Θ is derived as the projection of Θ̃ onto the set of symmetric positive definite matrices with a minimum
eigenvalue of 0.1. Such strong within-group correlation patterns have also been used in Zhao et al. (2009a); Yang and
Zou (2015).

11
We generate β ∗ by first sampling its p coordinates from the normal distribution N (10, 16), then randomly flipping signs
of the covariates and randomly setting 90% of the groups to be zero. This setup aligns with the setting in Bach (2008);
Friedman et al. (2010); Huang and Zhang (2010). The response variable Y is generated from Y = Xβ ∗ + ϵ, where ϵ
follows a normal distribution with mean 0 and variance σ 2 , and 2
p we set σ = 3 following Yang and Zou (2015). The
group weight in the overlapping group lasso problem is wg = dg , as is usually used in practice. We used the absolute
difference in function values between iterations for all methods as the stopping criterion, with a tolerance set at 10−5 .

6e−2

Support discrepancy
4e3 8e−1

Estimation error
Timing in sec.

3e3 4e−2
6e−1
2e3
4e−1 2e−2
1e3
0 2e−1 0
2000 4000 8000 2000 4000 8000 2000 4000 8000
Sample size Sample size Sample size
(a) Performance vs. Sample size
6e−2

Support discrepancy
3 8e−1
Estimation error

12e
Timing in sec.

9e3 6e−1 4e−2


6e3
4e−1 2e−2
3e3
0 2e−1 0
200 400 800 200 400 800 200 400 800
Number of groups Number of groups Number of groups
(b) Performance vs. Number of groups
6e−2
Support discrepancy

8e−1
Estimation error

3
Timing in sec.

9e
6e−1 4e−2
6e3

3e3 4e−1 2e−2

0 2e−1 0
20 40 80 20 40 80 20 40 80
Group size Group size Group size
(c) Performance vs. Groups size

method Overlapping group lasso Proposed approximation method Weighted lasso

Figure 3: Regularization path computing time, ℓ2 estimation error, and support discrepancy under different configu-
rations of interlocking groups. (a) Varying sample size n when fixing m = 400 and d = 40 (p = 12808);
(b) Varying number of groups m when fixing n = 4000 and d = 40 ; (c) Varying group size d when fixing
n = 4000 and m = 400.

Figure 3 presents the average computation times, estimation errors, and support discrepancy with 95% confidence
intervals (CIs). The result highlights the significant computational advantage of the proposed method over the original
overlapping group lasso. Specifically, our method is 5–20 times faster than the original overlapping group lasso.

12
Even though the overlap is not severe within the interlocking group structure, solving the overlapping group lasso
problem carries a more substantial computational burden due to the non-separable structure within its penalty term. The
computational time increases with larger sample sizes, a greater number of variables, and larger group sizes, and the
computational disadvantage of the overlapping group lasso is more substantial as the problem scales up. In contrast, our
proposed method consistently achieves accuracy similar to the overlapping group lasso estimator in both the estimation
error and support discrepancy. This consistency in performance, observed across a spectrum of configurations, serves as
an empirical confirmation of the validity of our theoretical findings.
On the other hand, the weighted lasso approximation is slightly faster than our method. This is expected from the
optimization perspective. However, the weighted lasso approximation exhibits much higher errors than the overlapping
group lasso estimator and our estimator across all configurations, revealing that the weighted also gives a poor
approximation to the overlapping group lasso. This is because the weighted lasso fails to leverage the group information,
different from the induced groups G used in our estimator.
In summary, our proposed estimator achieves comparable statistical performance to the original overlapping group
lasso estimator while significantly enhancing computational efficiency. In contrast, although computationally efficient,
the weighted lasso yields notably poor estimations, rendering it an uncompetitive alternative for approximating the
original problems.

4.2 Nested tree structure of overlapping groups

In this experiment, we evaluate the performance of the estimators under a configuration of the tree-group structures
introduced in Jenatton et al. (2011b), as below.
Definition 1. (Jenatton et al., 2011b) A set of groups G = {G1 , · · · , Gm } is said to be tree-structured in [p] if
∪g∈[m] Gg = [p] and if for all g, g ′ ∈ [m]. Gg ∩ Gg′ ̸= ∅ implies either Gg ⊂ Gg′ or Gg′ ⊂ Gg .

In particular, we consider the special case of the tree groups, the nested group structure where all groups are nested.
This configuration is interesting as it represents an extreme setting of overlapping groups – the overlapping degree
is maximized in a certain sense and we hope to evaluate the methods in this extreme scenario. The nested group
structure was also used in a few previous studies (Kim and Xing, 2012; Nowakowski et al., 2023). In this experiment,
the SPAM solver, designed for the tree group structures, is also used to provide a more thorough evaluation across
different implementations. We consider the following nested group configuration: 800 groups G = {G1 , . . . , G800 } are
established, where Gg ⊂ Gg+1 and |Gg | = g × 4, g = 1, · · · , 800 with p = 3200 in total. The sample size varies from
600 to 2400. The data matrix X is generated from N (0, Θ), where Θ is generated by first constructing the matrix Θ̃ as

1, if i = j,
Θ̃ij = 0.6, if βi and βj belong to the same group in G, .

0.36, if βi and βj are in the same group in G but in different groups in G,

and then projecting Θ̃ onto the set of symmetric positive definite matrices with minimum eigenvalue 0.1. The generative
process for β ∗ and y remains nearly identical as before, where the only difference is that the first 90% of the groups are
set to zero following the hierarchical structure. The group weights are set to wg = 1/dg as suggested (Nowakowski
et al., 2023). For a fair comparison of the two solvers, in this experiment, we adopt the stopping criterion provided in
the SPAM package (Mairal et al., 2014) with a convergence tolerance 10−5 .

13
Support discrepancy
12e2 4.5e−1 8e−4

Estimation error
Timing in sec.

8e2
3.0e−1 4e−4
2
4e

0 1.5e−1 0
600 1200 2400 600 1200 2400 600 1200 2400
Sample Size Sample Size Sample Size

Overlapping group lasso Proposed approximation method SLEP


method type
Weighted lasso SPAM

Figure 4: Regularization path computing time, ℓ2 estimation error, and support discrepancy across various sample sizes
under the nested tree group structure.

Figure 4 shows the performance of the three methods based on both solvers. SLEP is generally faster than SPAM, but
the two solvers give consistent conclusions about the estimators. As studied by Jenatton et al. (2011b), solving the
overlapping group lasso problem becomes highly efficient under such a nested group structure because, under a tree
structure, a single iteration over all groups is adequate to obtain the exact solution of the proximal operator. Our timing
results support this statement. Compared with the previous setting, the timing advantage of our method is reduced.
However, our method is still at least twice as fast as the overlapping group lasso. When considering estimation error
and support discrepancy, our proposed estimator consistently delivers similar results compared to the overlapping group
lasso estimator. The comparison with the weighted lasso remains similar to the previous experiment; while the lasso
estimator is also fast to compute, it delivers very poor approximation.
In summary, solving overlapping group lasso problems exhibits efficiency when applied to tree structures. However,
even in such cases, our proposed estimator maintains reasonable computational advantage and similar statistical
estimation performance compared to the original overlapping group lasso estimator.

4.3 Group structures based on real-world gene pathways

¯
Table 1: Summary information for the gene pathways: the mean and standard deviation of both group size (d/sd(d))
and the overlapping degree (h̄/sd(h)), the number of genes (p), and the ratio required in Assumption 4.

¯ max{m, dmax }/
Pathways d/sd(d) h̄/sd(h) p
max{m, dmax }
BioCarta (Kong et al., 2006) 15.4/ 8.71 3.25/ 5.56 1129 2.35
PID (Schaefer et al., 2008) 38.51/ 19.59 3.28/ 5.09 2297 5.95
KEGG (Kanehisa et al., 2015) 58.48/ 47.36 2.58/ 3.39 4207 3.61
WIKI (Slenter et al., 2017) 38.17/ 44.10 4.35/ 7.70 6242 4.94
Reactome (Gillespie et al., 2021) 45.31/ 54.10 8.78/ 13.26 8331 2.35

The previous two sets of experiments are based on human-designed group structures. To reflect more realistic situations,
in this set of experiments, we use five gene pathway sets from the Molecular Signatures Database (Subramanian et al.,
2005) as group structures, summarized in Table 1. Each gene pathway represents a collection of genes united by
common biological characteristics. These pathways have been widely adopted in studies of cancer and biological
mechanisms (Menashe et al., 2010; Yuan et al., 2011; Livshits et al., 2015; Chen et al., 2020).
In particular, this data set can be used to assess the empirical applicability of Assumption 4 in our theory. The last
column of Table 1 shows the ratio between max{m, dmax } and max{m, dmax }. All values are within the range of
[2,6], indicating that the two terms can be treated as terms in the same order.

14
We use the gene expression data from Van De Vijver et al. (2002) as the covariate matrix X, which can be accessed
through the R package breastCancerNKI (Schroeder et al., 2021). This design matrix has 295 observations and 24,481
genes. We perform gene filtering for each gene pathway set to exclude genes not defined within any pathways, a data
processing step commonly used in similar studies (Jacob et al., 2009; Lee and Xing, 2014; Chen et al., 2012). The
data-generating procedure for β ∗ and y remains almost the same as before, except that we use a much sparser model
because of the smaller sample size of the data. Specifically, we randomly sample 0.05m p active groups and set the
coefficients in other groups to zero. The weights in overlapping group lasso are set to be dg .

Table 2: Comparison of the average computing time (in seconds) and the corresponding 95% confidence intervals for
each pathway group structure.
Group
Overlapping group lasso Weighted lasso The proposed approximation
Structure
BioCarts 67.18 [ 62.28, 72.08] 6.22 [ 5.99, 6.45] 16.03 [ 15.17, 16.89]
KEGG 287.27 [ 267.18, 307.36] 28.77 [ 26.42, 31.12] 48.32 [ 45.12, 51.52]
PID 445.99 [ 420.56, 471.42] 10.27 [ 9.74, 10.80] 31.25 [ 29.43, 33.07]
WIKI 1279.22 [1214.34, 1344.10] 63.56 [ 57.36, 69.76] 132.79 [121.82, 143.76]
Reactome 3739.97 [3569.27, 3910.67] 116.34 [106.32, 126.36] 194.61 [181.31, 207.91]

Table 3: Comparison of the relative ℓ2 estimation errors and the corresponding 95% confidence intervals for each group
structure.
Group Structure Overlapping group lasso Lasso Proposed approximation
BioCarts 0.22 [0.20, 0.24] 0.28 [0.24, 0.32] 0.25 [0.22, 0.28]
KEGG 0.52 [0.47, 0.57] 0.80 [0.76, 0.84] 0.54 [0.51, 0.57]
PID 0.23 [0.21, 0.25] 0.50 [0.44, 0.56] 0.25 [0.23, 0.28]
WIKI 0.55 [0.49, 0.61] 0.65 [0.58, 0.72] 0.55 [0.49, 0.61]
Reactome 0.66 [0.63, 0.69] 0.85 [0.83, 0.87] 0.65 [0.62, 0.68]

Table 4: Comparison of the support discrepancy and the corresponding 95% confidence intervals for each group
structure.
Group Structure Overlapping group lasso Lasso Proposed approximation
BioCarts 0.041 [0.039, 0.043] 0.043 [0.040, 0.046] 0.041 [0.039, 0.043]
KEGG 0.023 [0.021, 0.025] 0.026 [0.024, 0.028] 0.023 [0.021, 0.025]
PID 0.033 [0.031, 0.035] 0.033 [0.031, 0.035] 0.033 [0.031, 0.035]
WIKI 0.013 [0.012, 0.014] 0.013 [0.011, 0.015] 0.013 [0.012, 0.014]
Reactome 0.012 [0.011, 0.013] 0.020 [0.019, 0.021] 0.012 [0.010, 0.014]

Table 2 displays the computing time, and Table 3 displays the estimation error results for the five pathway group
structures. The high-level message remains consistent. Both our proposed group lasso approximation and the lasso
approximation could substantially reduce the computing time. Across all settings, the proposed method reduces the
computation time by 4 - 20 times and is more than 10 times faster in all settings with higher dimensions. Meanwhile,
the proposed estimator delivers statistical performance similar to that of the original overlapping group lasso estimator.
In contrast, the lasso approximation fails to leverage the group information effectively and yields inferior estimation
results.

4.4 Comparison of the proposed weights against other weighting choices

In addition to the partitioned groups, the overlapping-based weight defined in (6) for each partitioned group g is another
crucial component to ensure the tightness of (7). We will demonstrate this aspect by experiments here to compare the
proposed weights (6) with two other commonly used choices of weights that do not consider the original overlapping
pattern: the uniform weights and group size-dependent weights (Yuan and Lin, 2006), on the same induced groups G.
Specifically, uniform weighting is the setting when all groups share the same weight while the size-dependent weighting

15
p p
uses the weight dg if wg = dg (interlocking and gene pathway groups) and is 1/dg if wg = 1/dg (nested groups).
The comparative analysis is performed under all group structures in the previous simulations, maintaining consistent
simulation settings.

4.8e−1
Estimation error

Estimation error

Estimation error
4.8e−1 4.8e−1

3.6e−1 3.6e−1 3.6e−1

2.4e−1 2.4e−1 2.4e−1


2000 4000 8000 200 400 800 20 40 80
Sample size Number of groups Group size
Support discrepancy

Support discrepancy

Support discrepancy
1.8e−2 1.8e−2
1.8e−2

1.2e−2 1.2e−2
1.2e−2

0.6e−2 0.6e−2
0.6e−2
2000 4000 8000 200 400 800 20 40 80
Sample size Number of groups Group size
(a) Performance under interlocking group structure
Support discrepancy

3e−3
Estimation error

6e−1
2e−3
−1
4e
1e−3
2e−1
0
600 1200 2400 600 1200 2400
Sample Size Sample Size
(b) Performance vs. Sample size under nested tree structure

Overlapping group lasso Uniform weight SLEP


method type
Proposed approximation method Group size−dependent weight SPAM

Figure 5: Regularization ℓ2 estimation error and support discrepancy of the proposed method using different choices of
weights under interlocking group structure and nested tree structure. Figure 5a is an extension to Figure 3,
and Figure 5b is an extension to Figure 4.

Figure 5a and Figure 5b illustrate the weigh effects comparison in the settings of Figure 3 and Figure 4, respectively.
Under the interlocking group structure (Figure 5a), three weighting themes deliver similar performance in terms of
estimation errors. Still, the size-dependent weighting leads to a larger support discrepancy. This interlocking group
structure is not very distinctive for the three weights themes because the overlapping degree is nearly uniform. The
nested group structures (Figure 5b) more effectively highlight the importance of the proposed weights. Our method
significantly outperforms the other two weighting themes and aligns well with the original overlapping group lasso
estimator. The weight design comparison on the gene pathway group structure is shown in Tables 5–6. The proposed
estimator gives a close approximation to the original overlapping group lasso, but the other two weighing designs lead
to significantly different performances in several settings.

16
Table 5: Comparative analysis of average estimation errors and the corresponding 95% confidence intervals for three
weighting designs. The ∗ indicates that the error is statistically different from that of overlapping group lasso
by a paired t-test.
Group Group size-
Proposed weight Uniform weight
Structure dependent weight
BioCarts 0.25 [0.22, 0.28] 0.28 [0.26, 0.30]* 0.35 [0.30, 0.40]*
KEGG 0.54 [0.51, 0.57] 0.80 [0.77, 0.83]* 0.58 [0.51, 0.65]*
PID 0.25 [0.23, 0.27] 0.24 [0.21, 0.27] 0.39 [0.36, 0.42]*
WIKI 0.55 [0.49, 0.61] 0.83 [0.80, 0.86]* 0.74 [0.67, 0.81]*
Reactome 0.65 [0.62, 0.68] 0.58 [0.55, 0.61]* 0.69 [0.63, 0.75]

Table 6: Comparative analysis of average support discrepancy and the corresponding 95% confidence intervals for
three weighting designs. The ∗ indicates that the value is statistically different from that of overlapping group
lasso by a paired t-test.
Group Group size-
Proposed weight Uniform weight
Structure dependent weight
BioCarts 0.041 [0.039, 0.043] 0.045 [0.042, 0.048]* 0.042 [0.039, 0.045]
KEGG 0.023 [0.021, 0.025] 0.059 [0.055, 0.063]* 0.024 [0.022, 0.026]
PID 0.033 [0.031, 0.035] 0.037 [0.035, 0.039]* 0.030 [0.027, 0.033]*
WIKI 0.013 [0.012, 0.014] 0.025 [0.023, 0.027]* 0.013 [0.012, 0.014]
Reactome 0.012 [0.010, 0.014] 0.010 [0.008, 0.012]* 0.022 [0.021, 0.023]*

In summary, the experiments demonstrate that the weights designed in our penalty also serve as an indispensable part of
a successful approximation to the overlapping group lasso estimation, which is another aspect of the tightest separable
relaxation property in Theorem 1.

5. Application Example: Pathway Analysis of Breast Cancer Data


In this section, we demonstrate the proposed method by predictive tasks on the breast cancer tumor data, as previously
used in Section 4.3. This time, unlike the previous simulation studies, we use the complete data set with tumor labels
for each observation. Specifically, each observation is labeled according to the status of the breast cancer tumors, with
79 classified as metastatic and 216 as non-metastatic. These labels serve as the response variable for our analysis.
Gene pathways have been widely used to key gene groups in cancer studies. In particular, Yuan et al. (2011); Chen et al.
(2012); Lee and Xing (2014) used the overlapping group lasso techniques to exclude less significant biological pathways
in cancer prediction. As a detailed example, Chen et al. (2012) leveraged the overlapping group lasso penalty to pinpoint
biologically meaningful gene groups. Their analysis revealed multiple groups of genes associated with essential
biological functions, such as protease activity, protease inhibitors, nicotine, and nicotinamide metabolism, which turned
out to be important breast cancer markers (Ma and Kosorok, 2010). This evidence highlights the potential of using
the overlapping group lasso penalty in cancer analysis. On the other hand, another way to incorporate gene pathway
information in such analysis is to retain genes by entire pathways. Jacob et al. (2009) used the latent overlapping group
lasso penalty to achieve this while Mairal and Yu (2013) introduced an ℓ∞ variant further. The success of all these
previous studies reveals the potential of the gene pathway information in cancer prediction. They also show that the
proper way to use the pathways (e.g., either eliminating-by-group, as in overlapping group lasso, or including-by-group,
as in latent overlapping group lasso) highly depends on the data set and genes.
In our analysis, we use regularized logistic regression to build a classifier with the overlapping group lasso penalty
(OGL), our proposed group lasso approximation penalty (Proposed approximation), the standard lasso penalty, the
latent overlapping group lasso penalty (LOG) (Jacob et al., 2009), and the ℓ∞ latent overlapping group lasso penalty
of (Mairal and Yu, 2013). As mentioned in previous sections, our focus is not on justifying the overlapping group
lasso should be used. Instead, our primary objective is to demonstrate that when an overlapping group lasso
penalty is used, our method provides a good approximation to the overlapping group lasso (with a much faster
computation) across various pathway sets (Table 1), whether or not the overlapping group lasso penalty is the best
option for the problem.

17
Two additional aspects can also be evaluated as by-products of our analysis. First, as the lasso penalty does not
consider the pathway information, comparing the performance of the group-based penalty and the lasso penalty in
this problem would verify whether a specific gene pathway set contains predictive grouping information for breast
cancer tumor type. Second, by assessing the predictive performances among the overlapping group lasso classifier and
the latent overlapping group lasso classifiers, we can verify whether a specific gene pathway set is more suitable for
eliminating-by-group or including-by-group strategies for prediction.

Table 7: Computing time (in seconds) under different pathway databases.


Method
OGL Lasso Proposed approximation
Database
BioCarts 732 26 75
KEGG 2468 102 225
PID 1231 41 107
WIKI 5172 170 395
Reactome 11356 321 1186

We adopt the evaluation procedure of Lee and Xing (2014), where we randomly split the data set into 200 training
observations and 95 test observations. All methods are tuned by 5-fold cross-validation on the training data. We
calculate the area under the receiver operating characteristic (AUC) curve, a commonly used metric for classifying
accuracy (Hanley and McNeil, 1982), on the test data. The total time for the entire cross-validation process is recorded
as computation time. The experiment is repeated 100 times independently. Table 7 and Table 8 show the average
computing time and AUC, respectively.
The following can be summarized from the results:

• First and foremost, the proposed estimator acts as an effective and computationally efficient approximation
for the overlapping group lasso estimator. The results evidently support this claim. The proposed estimator
delivers predictive performance that is (the most) similar to the overlapping group lasso estimator across
various pathway datasets while significantly reducing the computing time by roughly ten times.
• Second, the lasso classifier performs best only on the WIKI pathway set, suggesting that the pathways in the
WIKI database might not be sufficiently informative for cancer prediction.
• Third, the superiority between the overlapping group lasso regularizations and the latent overlapping group
lasso regularizations depends on the specific group information. Among the four pathway sets with useful
group information, the overlapping group lasso delivers superior predictive performance for the Biocarts and
PID databases, while the latent overlapping group lasso classifiers provide better predictions on the KEGG and
Reactome databases.

Table 8: Predictive AUC results of the three methods under different pathway databases.
Method Proposed
OGL Lasso LOG LOG ∞
Database approximation
BioCarts 0.7103 0.6989 0.7242 0.6888 0.6995
KEGG 0.7021 0.6862 0.7081 0.7390 0.7333
PID 0.7475 0.7004 0.7301 0.6881 0.6891
WIKI 0.6862 0.7282 0.6893 0.7149 0.7207
Reactome 0.6921 0.7301 0.7053 0.7463 0.7438

As a remark, while our evaluation is based on prediction accuracy, it is not the only criterion to determine if a method
is proper for the dataset. For example, Mairal and Yu (2013) found that neither the overlapping group lasso model
nor the latent overlapping group lasso model outperformed simple ridge regularization in prediction. The value of
structured penalties also lies in their ability to identify potentially more interpretable genes, depending on the biological
interpretations.

18
6. Discussion
We have introduced a separable penalty as an approximation to the group lasso penalty when groups overlap. The
penalty is designed by partitioning the original overlapping groups into disjoint subgroups and reweighing the new
groups according to the original overlapping pattern. The penalty is the tightest separable relaxation of the overlapping
group lasso among all ℓq1 /ℓq2 norms. We have also shown that for linear problems, the proposed estimator is statistically
equivalent to the original overlapping group lasso estimator but enjoys significantly faster computation for large-scale
problems.
Several interesting directions could be considered for future research. The overlapping group lasso penalty presents a
variable selection by eliminating variables by entire groups. A counterpart selection procedure can include variables by
entire groups, which is achieved by the latent overlapping group lasso (Jacob et al., 2009). This penalty also suffers
from a non-separability computational bottleneck. It would be valuable to investigate whether a similar approximation
strategy could be designed to boost the computational performance in this scenario. More generally, the introduced
concept of “tightest separable relaxation" might be a promising direction for optimizing non-separable functions.
Studying the more general form and corresponding properties of this concept may generate fundamental insights about
optimization.

Acknowledgments

The work is supported in part by the NSF grant DMS-2015298 and the 3-Caveliers award from the University of
Virginia. The authors acknowledge the Minnesota Supercomputing Institute (MSI) at the University of Minnesota and
the Research Computing at The University of Virginia for providing resources that contributed to the research results
reported within this paper. We appreciate the insightful feedback and comments from the editor and reviewer, which
significantly improved the paper.

References
E. Austin, W. Pan, and X. Shen. A new semiparametric approach to finite mixture of regressions using penalized
regression via fusion. Statistica Sinica, 30(2):783, 2020.
F. R. Bach. Consistency of the group lasso and multiple kernel learning. Journal of Machine Learning Research, 9(6),
2008.
S. Basu, A. Shojaie, and G. Michailidis. Network granger causality with inherent grouping structure. The Journal of
Machine Learning Research, 16(1):417–453, 2015.
A. Beck and M. Teboulle. A fast iterative shrinkage-thresholding algorithm for linear inverse problems. SIAM journal
on imaging sciences, 2(1):183–202, 2009.
D. P. Bertsekas. Nonlinear programming. Journal of the Operational Research Society, 48(3):334–334, 1997.
S. Boyd, N. Parikh, E. Chu, B.Peleato, and J.Eckstein. Distributed optimization and statistical learning via the alternating
direction method of multipliers. Foundations and Trends in Machine learning, 3:1–122, 2011.
T. T. Cai, A. R. Zhang, and Y. Zhou. Sparse group lasso: Optimal sample complexity, convergence rate, and statistical
inference. IEEE Transactions on Information Theory, 2022.
F. Campbell and G. I. Allen. Within group variable selection through the exclusive lasso. Electronic Journal of Statistics,
11(2):4220–4257, 2017.
J. Chen, C. Liu, J. Cen, T. Liang, J. Xue, H. Zeng, Z. Zhang, G. Xu, C. Yu, Z. Lu, et al. Kegg-expressed genes and
pathways in triple negative breast cancer: Protocol for a systematic review and data mining. Medicine, 99(18), 2020.
X. Chen, Q. Lin, S. Kim, J. G. Carbonell, and E. P. Xing. Smoothing proximal gradient method for general structured
sparse regression. The Annals of Applied Statistics, 6(2):719–752, 2012.
J. Cheng, T. Li, E. Levina, and J. Zhu. High-dimensional mixed graphical models. Journal of Computational and
Graphical Statistics, 26, 2017.
P. Danaher, P. Wang, and D. M. Witten. The joint graphical lasso for inverse covariance estimation across multiple
classes. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 76(2):373–397, 2014.
A. Dedieu. An error bound for lasso and group lasso in high dimensions. arXiv:1912.11398, 2019.

19
W. Deng, W. Yin, and Y. Zhang. Group sparse optimization by alternating direction method. Proceedings of the SPIE,
2013.
J. H. Friedman, T. J. Hastie, and R. Tibshirani. A note on the group lasso and a sparse group lasso. arXiv: Statistics
Theory, 2010.
M. Gillespie, B. Jassal, R. Stephan, M. Milacic, K. Rothfels, A. Senff-Ribeiro, J. Griss, C. Sevilla, L. Matthews,
C. Gong, C. Deng, T. Varusai, E. Ragueneau, Y. Haider, B. May, V. Shamovsky, J. Weiser, T. Brunson, N. Sanati,
L. Beckman, X. Shao, A. Fabregat, K. Sidiropoulos, J. Murillo, G. Viteri, J. Cook, S. Shorser, G. Bader, E. Demir,
C. Sander, R. Haw, G. Wu, L. Stein, H. Hermjakob, and P. D’Eustachio. The reactome pathway knowledgebase 2022.
Nucleic Acids Research, 50(D1):D687–D692, 11 2021.
R. L. Graham, D. E. Knuth, and O. Patashnik. Concrete Mathematics: A Foundation for Computer Science. Addison-
Wesley, Reading, MA, second edition, 1994. ISBN 0201558025 9780201558029 0201580438 9780201580433
0201142368 9780201142365.
E. Grave, G. R. Obozinski, and F. Bach. Trace lasso: a trace norm regularization for correlated designs. Advances in
Neural Information Processing Systems, 24, 2011.
J. A. Hanley and B. J. McNeil. The meaning and use of the area under a receiver operating characteristic (roc) curve.
Radiology, 143(1):29–36, 1982.
J. Huang and T. Zhang. The benefit of group sparsity. The Annals of Statistics, 38(4):1978–2004, 2010. ISSN
00905364, 21688966. URL http://www.jstor.org/stable/20744481.
L. Jacob, G. Obozinski, and J. Vert. Group lasso with overlap and graph lasso. Proceedings of the 26th Annual
International Conference on Machine Learning,ICML, 09:433–440, 2009.
R. Jenatton, J.-Y. Audibert, and F. Bach. Structured variable selection with sparsity-inducing norms. The Journal of
Machine Learning Research, 12:2777–2824, 2011a.
R. Jenatton, J. Mairal, G. Obozinski, and F. Bach. Proximal methods for hierarchical sparse coding. The Journal of
Machine Learning Research, 12:2297–2334, 2011b.
M. Kanehisa, Y. Sato, M. Kawashima, M. Furumichi, and M. Tanabe. KEGG as a reference resource for gene and
protein annotation. Nucleic Acids Research, 44(D1):D457–D462, 10 2015.
S. Kim and E. P. Xing. Tree-guided group lasso for multi-response regression with structured sparsity, with an
application to eqtl mapping. The Annals of Applied Statistics, 6:1095–1117, 2012.
S. W. Kong, W. T. Pu, and P. J. Park. A multivariate approach for integrating genome-wide expression data and
biological knowledge. Bioinformatics, 22(19):2373–2380, 2006.
B. Laurent and P. Massart. Adaptive estimation of a quadratic functional by model selection. Annals of Statistics, pages
1302–1338, 2000.
S. Lee and E. Xing. Screening rules for overlapping group lasso. arXiv:1410.6880, 2014.
S. Lee and E. P. Xing. Leveraging input and output structures for joint mapping of epistatic and marginal eqtls.
Bioinformatics, 28(12):i137–i146, 2012.
E. Levina, A. Rothman, and J. Zhu. Sparse estimation of large covariance matrices via a nested lasso penalty. The
Annals of Applied Statistics, 2(1):245–263, 2008.
J. Liu, S. Ji, and J. Ye. SLEP: Sparse Learning with Efficient Projections. Arizona State University, 2009. URL
http://www.public.asu.edu/~jye02/Software/SLEP.
A. Livshits, A. Git, G. Fuks, C. Caldas, and E. Domany. Pathway-based personalized analysis of breast cancer
expression data. Molecular oncology, 9(7):1471–1483, 2015.
P.-L. Loh. High-dimensional statistics with systematically corrupted data. University of California, Berkeley, 2014.
K. Lounici, M. Pontil, S. V. D. Geer, and A. B. Tsybakov. Oracle inequalities and optimal inference under group
sparsity. The Annals of Statistics, 39(4):2164–2204, 2011.
S. Ma and M. R. Kosorok. Detection of gene pathways with predictive power for breast cancer prognosis. BMC
bioinformatics, 11(1):1–11, 2010.
J. Mairal and B. Yu. Supervised feature selection in graphs with path coding penalties and network flows. Journal of
Machine Learning Research, 14(8), 2013.
J. Mairal, F. Bach, J. Ponce, G. Sapiro, R. Jenatton, and G. Obozinski. Spams: A sparse modeling software, v2. 3. URL
http://spams-devel. gforge. inria. fr/downloads. html, 2014.

20
L. Meier, S. Van De Geer, and P. Bühlmann. The group lasso for logistic regression. Journal of the Royal Statistical
Society: Series B (Statistical Methodology), 70(1):53–71, 2008.
I. Menashe, D. Maeder, M. Garcia-Closas, J. D. Figueroa, S. Bhattacharjee, M. Rotunno, P. Kraft, D. J. Hunter, S. J.
Chanock, P. S. Rosenberg, et al. Pathway analysis of breast cancer genome-wide association study highlights three
pathways and one canonical signaling cascade. Cancer research, 70(11):4453–4459, 2010.
K. Mohan, P. London, M. Fazel, D. Witten, and S.-I. Lee. Node-based learning of multiple gaussian graphical models.
The Journal of Machine Learning Research, 15(1):445–488, 2014.
S. N. Negahban, P. Ravikumar, M. J. Wainwright, and B. Yu. A unified framework for high-dimensional analysis of
m-estimators with decomposable regularizers. Statistical science, 27(4):538–557, 2012.
Y. Nesterov. Gradient methods for minimizing composite functions. Mathematical programming, 140(1):125–161,
2013.
S. Nowakowski, P. Pokarowski, W. Rejchel, and A. Sołtys. Improving group lasso for high-dimensional categorical
data. In International Conference on Computational Science, pages 455–470. Springer, 2023.
Z. Qin, K.Scheinberg, and D. Goldfarb. Efficient block-coordinate descent algorithms for the grouplasso. Mathematical
Programming Computation, 5(2), 2013.
P. Ravikumar, J. Lafferty, H. Liu, and L. Wasserman. Sparse additive models. Journal of the Royal Statistical Society:
Series B (Statistical Methodology), 71(5):1009–1030, 2009.
C. Schaefer, K. Anthony, S. Krupa, J. Buchoff, M. Day, T. Hannay, and K. Buetow. Pid: The pathway interaction
database. Nature Precedings, 3, 08 2008. doi: 10.1038/npre.2008.2243.1.
M. Schroeder, B. Haibe-Kains, A. Culhane, C. Sotiriou, G. Bontempi, and J. Quackenbush. breastCancerNKI:
Genexpression dataset published by van’t Veer et al. [2002] and van de Vijver et al. [2002] (NKI)., 2021. URL
http://compbio.dfci.harvard.edu/. R package version 1.32.0.
D. N. Slenter, M. Kutmon, K. Hanspers, A. Riutta, J. Windsor, N. Nunes, J. Mélius, E. Cirillo, S. L. Coort, D. Digles,
F. Ehrhart, P. Giesbertz, M. Kalafati, M. Martens, R. Miller, K. Nishida, L. Rieswijk, A. Waagmeester, L. M. T.
Eijssen, C. T. Evelo, A. R. Pico, and E. L. Willighagen. WikiPathways: a multifaceted pathway database bridging
metabolomics to other omics research. Nucleic Acids Research, 46(D1):D661–D667, 11 2017. ISSN 0305-1048.
A. Subramanian, P. Tamayo, V. K. Mootha, S. Mukherjee, B. L. Ebert, M. A. Gillette, A. Paulovich, S. L. Pomeroy, T. R.
Golub, E. S. Lander, et al. Gene set enrichment analysis: a knowledge-based approach for interpreting genome-wide
expression profiles. Proceedings of the National Academy of Sciences, 102(43):15545–15550, 2005.
A. Tank, E. B. Fox, and A. Shojaie. An efficient admm algorithm for structural break detection in multivariate time
series. arXiv preprint arXiv:1711.08392, 2017.
D. A. Tarzanagh and G. Michailidis. Estimation of graphical models through structured norm minimization. Journal of
machine learning research, 18(1), 2018.
R. Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society: Series B
(Methodological), 58(1):267–288, 1996.
M. J. Van De Vijver, Y. D. He, L. J. Van’t Veer, H. Dai, A. A. Hart, D. W. Voskuil, G. J. Schreiber, J. L. Peterse,
C. Roberts, M. J. Marton, et al. A gene-expression signature as a predictor of survival in breast cancer. New England
Journal of Medicine, 347(25):1999–2009, 2002.
M. J. Wainwright. Sharp thresholds for high-dimensional and noisy sparsity recovery using ℓ1 constrained quadratic
programming (lasso). IEEE transactions on information theory, 55(5):2183–2202, 2009.
M. J. Wainwright. High-dimensional statistics: A non-asymptotic viewpoint, volume 48. Cambridge University Press,
2019.
S. Xiang, X. Shen, and J. Ye. Efficient nonconvex sparse group feature selection via continuous and discrete optimization.
Artificial Intelligence, 224:28–50, 2015. ISSN 0004-3702.
X. Yan and J. Bien. Hierarchical sparse modeling: A choice of two group lasso formulations. Statistical Science, 32(4):
531–560, 2017.
C. Yang, X. Wan, Q. Yang, H. Xue, and W. Yu. Identifying main effects and epistatic interactions from large-scale snp
data via adaptive group lasso. BMC bioinformatics, 11(1):1–11, 2010.
J. Yang and J. Peng. Estimating time-varying graphical models. Journal of Computational and Graphical Statistics, 29
(1):191–202, 2020.
Y. Yang and H. Zou. A fast unified algorithm for solving group-lasso penalize learning problems. Statistics and
Computing, 25(6):1129–1141, 2015.

21
G. Yu and J. Bien. Learning local dependence in ordered data. The Journal of Machine Learning Research, 18(1):
1354–1413, 2017.
L. Yuan, J. Liu, and J. Ye. Efficient methods for overlapping group lasso. Advances in Neural Information Process
Systems, pages 352–360, 2011.
M. Yuan and Y. Lin. Model selection and estimation in regression with grouped variables. Journal of the Royal
Statistical Society: Series B (Statistical Methodology), 68(1):49–67, 2006.
P. Zhao and B. Yu. On model selection consistency of lasso. The Journal of Machine Learning Research, 7:2541–2563,
2006.
P. Zhao, G. Rocha, and B. Yu. The composite absolute penalties family for grouped and hierarchical variable selection.
Annals of Statistics, 37(6A):3468–3497, 2009a.
P. Zhao, G. Rocha, and B. Yu. The composite absolute penalties family for grouped and hierarchical variable selection.
Annals of Statistics, 37(6A):3468–3497, 2009b.

22
A. Notation summary

Table 9: Mathematical notations in the paper.


Indices:
[z] index set {1, ..., z}
Gg index set of g th group S
GS collection of non-zero groups, g∈S(β) Gg
S
GS g∈S(β) Gg
βj the j th element of β
βGg sub-vector of β indexed by Gg
βM (S) projection of β onto M (S)
A,T sub-matrix consisting of the columns indexed by T

Parameters:
H a diagonal matrix, diag( h11 , · · · , h1p )
G P Ggj = 1 iff βj ∈ Gg
group structure matrix,
dg group size, dg = j∈[p] Ggj
dmax maximum group size, P dmax = maxg∈[m] dg
hj overlap degree, hj = g∈[m] Ggj
hgmax maximum overlap degree in Gg , hgmax = maxj∈Gg hj
hgmin minimum overlap degree in Gg , minj∈Gg hj
hmax maximum overlap degree, hmax = maxj∈[p] hj
hg overlap degree of Gg , h{j|j∈Gg }
σ parameter in the sub-Gaussian distribution
sg number of non-zero groups |S|
sg number of groups in the argument group support set |S|
κ parameter controls convexity

Definitions: P
ϕ(β) group lasso norm, wg βGg 2
,
g∈[m]
ϕ∗ (β) dual norm of ϕ(β), max 1
(Hβ)Gg
g∈[m] wg 2
F (g) ⊆ [m] overlapping groups which include the variables in Gg
F −1 (g) ⊆ [m] non-overlapping groups that were partitioned from Gg
 ! q1  q11
 P q2 

|βj |q2
P
||β{G,w} ||q1 ,q2 ℓq1 ,q2 norm, wg
g∈[m] j∈Gg 
supp(β) support set, {j ∈ {1, · · · , p}|βj ̸= 0}
S(β) group support set, {g ∈ {1, · · · , m}|Gg ∩ supp(β) ̸= ∅}
S(β) {g = {1, · · · , m}|Gg ∩ GS(β) ̸= ∅}
M (S) {β ∈ Rp |βj = 0 for all j ∈ (GS )c }
M ⊥ (S) {β p
 ∈ RP|βj = 0 for all j ∈ GS }
Ω(G, sg ) β: 1{∥βGg ∥2 ̸=0} ⩽ sg
g ∈G
GS
JG (β) [p]\ Gg ∩supp(β)=∅ Gg .
GJG (β) {g ∈ [m] | Gg ∩ JG (β) ̸= ∅}
GJG (β)c {g ∈ [m] | Gg ∩ JG (β)c ̸= ∅}

23
B. Uniqueness of the overlapping group lasso problem
The group lasso penalization problems (14) and (15) are generally convex, but may not be strictly convex. The
uniqueness of these problems has been studied by Jenatton et al. (2011a). Here we introduce their results for
completeness. Note that our theoretical properties in Section 3 do not rely on such uniqueness.
Lemma 1. (Proposition 1 of Jenatton et al. (2011a)) If the gram matrix Q = X ⊤ X/n is invertible, or if there exists
g ∈ [m] such that Gg = [p], then the optimization problem specified in (14), with λn > 0, is guaranteed to have a
unique solution. The same holds for problem (15) with G replaced by G.

C. Additional Theoretical Results


To begin with, we introduce our proposed upper bound for the dual norm of the overlapping group lasso.
Proposition 1. The sharp upper bound for ϕ∗ (the dual norm of overlapping group lasso penalty in (1)) is
1
max (Hβ)Gg 2
,
g∈[m] wg
where H is a diagonal matrix with diagonals ( h11 , · · · , h1p ).
Assumption 7. Under model (13), we assume

1. (Sub-Gaussian noises) The coordinates of ε are i.i.d zero mean sub-Gaussian random variable denote with
parameter σ, which means that there exist σ > 0 such that
2 2
eσ t
E[etε )] ⩽ , for all t ∈ R.
2
q
XGg T XGg
2. (Group normalization condition) γmax ( n ) ⩽ c for some constant c.
3. (Restricted strong convexity condition) For some κ > 0,
 2
X β̄ − β ∗ 2 2
n    o
⩾ κ β̄ − β ∗ 2 , for all β̄ ∈ β | ϕ (β − β ∗ )M ⊥ (S) ⩽ 3ϕ (β − β ∗ )M (S) .
n
Remark: The assumption requires an upper bound for the quadratic form associated with each group. This type
of assumption is commonly used for developing the upper estimation error bound for non-overlapping group lasso
(Lounici et al., 2011; Huang and Zhang, 2010; Dedieu, 2019; Negahban et al., 2012; Wainwright, 2019). Additionally,
the restricted curvature conditions have been well discussed by Wainwright (2019). The curvature κ in Assumption 7 is
a parameter measuring the convexity. Generally speaking, the restricted curvature conditions state the loss function is
locally strongly convex in a neighborhood of ground truth and thus guarantees that a small distance between the estimate
and the true parameter implies the closeness in the loss function. However, such a strong convexity condition cannot
hold in the high-dimensional setting. So, we focus on a restrictive set of estimates. Restricted curvature conditions are
milder than the group-based RIP conditions used in (Huang and Zhang, 2010; Dedieu, 2019), which require that all
submatrices up to a certain size are close to isometries (Wainwright, 2019). Based on Assumption 7, Theorem 6 gives
ℓ2 norm estimation upper error bound for overlapping group lasso.
Theorem 6. Define hgmin = min hj , dmax = max dg , and dmax = max dg . Suppose Assumption 7 holds, for any
j∈Gg g∈[m] g∈[m]
δ ∈ [0, 1],
q
8cσ dmax log 5 log m
1. with λn = min (wg2 hg n + n + δ, the following bound hold for β̂ G in (14)
min )
g∈[m]
!
P 2 G
wg · hmax
S
2
 
2 σ g∈S dmax log 5 log m
β̂ G − β ∗ ≲ · · + + δ . (29)
min wg2 hgmin

2 κ2 n n
g∈[m]

with probability at least 1 − e−2nδ .

24
q
8cσ dmax log 5 log m
2. with λn = min wg n + n + δ, the following bound hold for β̂G in (15)
g∈[m]

wg 2 
P
2

2 σ g∈F −1 (S) dmax log 5 log m
β̂G − β ∗ ≲ ·  · + + δ . (30)
2 κ2 min wg2 n n
g∈[m]

Following the framework in (Negahban et al., 2012; Wainwright, 2019), we further study the applicability of the
restricted curvature conditions in terms of a random design matrix. Given a group structure G, Theorem 6 is developed
based on the assumption that the fixed design matrix X satisfies the restricted curvature condition. In practice, verifying
that a given design matrix X satisfies this condition is difficult. Indeed, developing methods to “certify” design matrices
this way is one line of ongoing research (Wainwright, 2019). However, it is possible to give high-probability results
based on the following assumptions.
Theorem 7. Under Assumptions 1,2, and 3, we have
q T X
′ XG Gg
1. With probability at least 1 − e−c n , maxg∈[m] γmax ( gn ) ⩽ c for some constants c, c′ > 0, as long as
log m = o(n).
2. The restricted strong convexity condition, which is
 2
X β̄ − β ∗ 2 2
n    o
⩾ κ β̄ − β ∗ 2 , for all β̄ ∈ β | ϕ (β − β ∗ )M ⊥ (S) ⩽ 3ϕ (β − β ∗ )M (S) .
n
n
e− 32
hold with probability at least 1 − n for some constant κ > 0.
1−e− 64

25
D. Proofs
D.1 Proof of Theorem 1

Lemma 8. Consider a norm || ·{G̃,w̃} ||q1 ,q2 satisfying the conditions of Equation (11). The following two statements
hold:

1. For any g ∈ [m], there exists a g̃ ∈ [|G̃|] such that Gg ⊆ G̃g̃ .

2. For any g̃ ∈ [|G̃|], there exists a g ∈ [m] such that G̃g̃ = Gg .

Proof Based on Lemma 8, if a norm ||β{G̃,w̃} ||q1 ,q2 exists that satisfies Equation (11), then it must be that G̃ = G.
Consequently, any disparity between ||β{G̃,w̃} ||q1 ,q2 and our proposed norm could only be due to differences in weights
or the values of q1 or q2 .
For any β with non-zero elements solely in the gth group Gg , we have:
X X  X  X
wg ||βGg ||2 = wg ||Gg ||2 ⩽ ||β{G,w̃} ||q1 ,q2 ⩽ wg ||βGg ||2 . (31)
g∈[m] g∈[m] g∈F (g) g∈[m]

1
This implies that (w̃g ||βGg ||qq12 ) q1 = wg ||βGg ||2 . By setting one element in Gg to 1, and other elements to 0, it follows
that w̃g = wg . Since this holds for any group in G, we have w̃ = w.
1
From Equation (31), it is evident that (wg ||βGg ||qq12 ) q1 = wg ||βGg ||2 for any β with non-zero elements only in Gg .
This suggests that q1 = 1 and q2 = 2. Therefore, the existing norm ||β{G̃,w̃} ||q1 ,q2 does not satisfy the second condition
in Equation (11).

D.1.1 P ROOF OF L EMMA 8

Proof We begin by proving the first item in Lemma 8. Recall that G represents the space of all possible partitions of
[p]. Given that G̃ ∈ G, for an arbitrary g ∈ [m], suppose Gg ⊈ G̃g̃ for any g̃. Then, we can identify the smallest set T
such that [
Gg ⊆ G̃g̃ .
g̃∈T

Let T = {t1 , t2 , · · · , t|T | }. Choose βj ∈ (Gg ∩ G̃t1 ) and βk ∈ (Gg ∩ G̃t2 ). As βj and βk are both in Gg , if an original
group includes βj , it also contains βk . Consider a vector β where only βj and βk are non-zero. We have
X  X q
wg ||βGg ||2 = wg βj2 + βk2 ⩽ ||β{G̃,w̃} ||q1 ,q2
g∈[m] {g|βj ∈Gg }
X  X q
⩽ wg ||βGg ||2 = wg βj2 + βk2 ,
g∈[m] {g|βj ∈Gg }

leading to
1 1 1  X q
||β{G̃,w̃} ||q1 ,q2 = ((w̃t1 |βj |)q1 + (w̃t2 |βk |)q1 ) q1 = w̃tq11 |βj | + w̃tq21 |βk | = wg βj2 + βk2 ,
{g|βj ∈Gg }

for any 0 ⩽ q1 , q2 ⩽ ∞. However, by setting


 1 1 √ P 
βj = βk = 1, β{[p]\{j,k}} = 0 if wtq11 + wtq21 ̸= 2 {g|βj ∈Gg } wg
1 1 √ P  ,
β = 2, β = 1, β q1 q1
j k {[p]\{j,k}} = 0 if wt1 + wt2 = 2 {g|βj ∈Gg } wg

we arrive at a contradiction. Thus, we demonstrate that if a norm || ·{G̃,w} ||q1 ,q2 exists, then each group in G̃ is a union
of groups in G.

26
Now, to prove the second item: Since the first part implies that each group in G̃ is S
a union of groups in G, let us
consider g̃ ∈ [|G̃|]. Suppose there exists a index set V ⊆ [m] such that G̃g̃ = Gg with |V | > 1. Denote
g∈V
V = {v1 , · · · , v|V | }, and consider two cases:
Case 1: ∄a ∈ [m] s.t. (Gv1 ∪ Gv2 ) ⊆ Ga .
Case 2: ∃a ∈ [m] s.t. (Gv1 ∪ Gv2 ) ⊆ Ga .
Under Case 1, if only Gv1 and Gv2 have non-zero values in β, we obtain:
X  X q  X q
wg ||βGg ||2 = wg βG2v + wg βG2v
1 2
g∈[m] g∈F (v1 ) g∈F (v2 )
X
⩽ ||β{G̃,w̃} ||q1 ,q2 ⩽ wg ||βGg ||2
g∈[m]
q q
= wv1 βG2v + wv2 βG2v
1 2
 X q  X q
= wg βG2v + wg βG2v ,
1 2
g∈F (v1 ) g∈F (v2 )

which leads to
q q  X  q1  X  q1
|βj |q2 |βj |q2
2 2
wv1 βG2v + wv2 βG2v = w̃g̃ = w̃g̃ .
1 2
j∈G̃g̃ j∈{Gv1 ∪Gv2 }

This equation does not hold by picking j ∈ Gv1 , k ∈ Gv2 , and setting
1
(
βj = βk = 1, β{[p]\{j,k}} = 0 ̸ w̃g̃ · 2 q2
if wv1 + wv2 =
1 .
βj = 2, βk = 1, β{[p]\{j,k}} = 0 if wv1 + wv2 = w̃g̃ · 2 q2
Therefore, |V | > 1 cannot happen.
Under Case 2, let βj ∈ Gv1 and βk ∈ Gv2 . Define β j as the vector with 1 at the j-th element and 0 elsewhere, and β k as
the vector with 1 at the k-th element and 0 elsewhere, with j ̸= k.
When β = β j , we find:
X  X  X
wg ||βGg ||2 = wg ⩽ w̃g̃ ⩽ wg ||βGg ||2 = wv1 ,
g∈[m] g∈F (v1 ) g∈[m]

indicating that w̃g̃ = wv1 for all q1 , q2 .


Similarly, for β = β k , we have:
X  X  X
wg ||βGg ||2 = wg ⩽ w̃g̃ ⩽ wg ||βGg ||2 = wv2 ,
g∈[m] g∈F (v2 ) g∈[m]

indicating that w̃g̃ = wv2 for all q1 , q2 .


If wv1 ̸= wv2 , then such a weight assignment is not feasible. Assuming wv1 = wv2 = wg̃ = k, then for any β with
1
non-zero values only in Gv1 , we have wg̃ ||βGv1 ||2 = (wg̃ ||βGv1 ||qq12 ) q1 , implying that if a norm satisfies (11), it must be
an ℓ1 /ℓ2 norm.
Since Gv1 and Gv2 are different groups, there is at least one original group that contains variables in Gv1 but not in Gv2 ,
and vice versa. Taking β with non-zero values in both Gv1 and Gv2 , we find:
X
k||βGg ||2 > k||βGv1 ∪ βGv2 ||2 = ||β{G̃,w̃} ||1,2 ,
g∈[m]

which is a contradiction. Hence, in both cases, |V | > 1 is not possible, implying that there exists a g ∈ [m] such that
G̃g̃ = Gg .

27
D.2 Proof of Theorem 2

Proof We begin by examining the bound for the estimator β̂ G . Considering a fixed design matrix X and a group
structure G that comply with Assumption 7, and selecting an appropriate λn , Theorem 6 asserts that both inequalities
(17) and (19) hold with a probability of at least 1 − e−2nδ .
Under Assumptions 1,2, and 3, Theorem 7 establishes that Assumption 7 is valid with a probability of at least
2 − n
1 − e−c2 nδ − e 3264
n , where c2 is a positive constant.
1−e

Considering these two theorems together, we conclude that under Assumptions 1,2, and 3, both (17) and (19) are
2 − n
satisfied with a probability of at least 1 − e−c2 nδ − e−2nδ − e 3264
n . This probability can be further bounded below by
1−e

1 − e−c nδ for some suitable constant c′ .
The bound for β̂G can be directly derived, noting that it represents a group lasso estimator with group G and weights w.

D.3 Proof of Corollary 3

Proof
   
Assuming max{dmax , m} ≍ max{dmax , m}, we have dmaxnlog 5 + log m
n + δ ≍ dmaxnlog 5 + log m
n +δ .
P
With wg = wg , by the Cauchy–Schwarz inequality, we have
g∈F (g)
 X 2  X 
2
wg = wg ⩽ hg wg2 .
g∈F (g) g∈F (g)

Therefore,
 
G
X X X X X
wg 2 ⩽ hg ( wg2 ) ⩽ hmax
S
wg2 .
g∈F −1 (S) g∈F −1 (S) g∈F (g) g∈F −1 (S) g∈F (g)

Let’s introduce kg as the number of non-overlapping groups from G into which the gth group is partitioned in the new
structure G. We also define K as the maximum number of such partitions, i.e., K = maxg kg and K ⩽ ∞.
Now we want to show that X X X
wg2 ⩽ kg wg2 .
g∈F −1 (S) g∈F (g) g∈S

Recall the definition of F −1 (S) as


F −1 (S) = {g | g ∈ F −1 (g), g ∈ S}.
For each g ∈ F −1 (g) that also belongs to F −1 (S), we add wg2 to the summation. Therefore, the maximum contribution
wg2 is kg wg2 .
P P
from each original group g to the sum
g∈F −1 (S) g∈F (g)

Given that
{g|g ∈ F (g) and g ∈ F −1 (S)} = S,
it follows that
 
G GS
X X X
hmax
S
wg2 ⩽ hmax kg wg2
g∈F −1 (S) g∈F (g) g∈S
GS
X
⩽ hmax K wg2 .
g∈S

28
On the other hand, we have
   X 2  X 2
2
min wg = min wg ⩾ min min {wg }
g∈[m] g∈[m] g∈[m] g∈[m]
g∈F (g) g∈F (g)
 2  2
hgmin min {wg } = hmin min {wg } ⩾ min wg2 hgmin .
g 
⩾ min
g∈[m] g∈[m] g∈[m] g∈[m]

Therefore,
P 
GS
wg 2 wg 2 · hmax
P
K
g∈F −1 (S) g∈S
⩽ .
min wg2 hgmin
 
min wg2
g∈[m] g∈[m]

Consequently, if K is upper bounded by a constant, then


P 
2 GS
wg 2  · hmax
P
w g
2 2
  
σ g∈F (S)−1 dmax log 5 log m σ g∈S dmax log 5 log m
· · + + δ ≲ · · + + δ .
min wg2 hgmin
 
κ2 min wg2 n n κ2 n n
g∈[m] g∈[m]

D.4 Proof of Proposition 1

Proof Let HGg be the sub-matrix of H consisting of the columns indexed by Gg . Let uGg , vGg be the sub-vectors of
u, v indexed by Gg respectively. Given two vectors u, v ∈ Rp , we have
ϕ∗ (v) = sup uT v = sup {u1 v1 + u2 v2 + · · · + up vp }

ϕ(u)⩽1 ϕ(u)⩽1
 
v1 vp
= sup · h1 · u1 + · · · + · hp · up
ϕ(u)⩽1 h1 hp
(m ) (m  )
X T X (Hv)Gg
= sup HGg vGg uGg = sup · wg · uGg
ϕ(u)⩽1 g=1 ϕ(u)⩽1 g=1 wg
(m ) 
X (Hv)Gg 1

2
⩽ sup · wg uGg 2 ⩽ max · (Hv)Gg 2 · ϕ(u)
ϕ(u)⩽1 g=1 wg g∈[m] wg

1
⩽ max · (Hv)Gg 2
,
g∈[m] wg
where the first inequality is achieved by using Cauchy’s inequality.

Let g0 = arg max w1g (Hv)Gg and hgmax


0
= 1. Define u ∈ Rp as
g∈[m] 2

0 for j ∈
(
/ Gg0
1 vj 1
uj = · · for j ∈ Gg0 ,
wg0 hj 2 (Hv)Gg
0 2

then we have v
m
1 1 vj 2
X u X
ϕ (u) = wg uGg = w g0 · · ·t
u
g=1
2 wg0 (Hv)Gg j∈G
hj 4
g0
0 2
v
1 vj 2
u X
= = 1,
u
hj 2
t
(Hv)Gg j∈G g0
0 2

29
where the last equality holds due to the fact that hj = 1 for any j ∈ Gg0 , and we also have
1 1 X vj 2 1 1 2
uT v = · 2 = w · (Hv)Gg
wg0 (Hv) h g (Hv)Gg 0 2
Gg j∈Gg0 j 0
0 2 0 2
1 1
= (Hv)Gg = max (Hv)Gg = ϕ∗ (v) .
wg0 0 2 g∈[m] wg0 2

Therefore, this is a sharp bound.

D.5 Proof of Theorem 6

Proof In this section, we mostly follow the proof in Chapter 14 of Wainwright (2019). By default, we take S = S(β ∗ )
and S = S(β ∗ ) in all settings. From the optimality of β̂ G , we have
1 2 1 2  
0⩾ Y − X β̂ − Y − Xβ ∗ + λn ϕ(β̂) − ϕ(β ∗ )
n 2 n 2
1 T   
= Y Y − 2Y T X β̂ + β̂ T X T X β̂ − Y T Y + 2Y T Xβ ∗ − β ∗T X T Xβ ∗ + λn ϕ(β̂) − ϕ (β ∗ )
n
1   
= (2X T Xβ ∗ − 2X T Y )T (β̂ − β ∗ ) + (β̂ − β ∗ )T X T X(β̂ − β ∗ ) + λn ϕ(β̂) − ϕ(β ∗ )
n
2  
 Y − Xβ ∗   X β̂ − β ∗  
= ▽ 2
, β̂ − β ∗ + 2
+ λn ϕ(β̂) − ϕ(β ∗ )
n n
2
 Y − Xβ ∗     2  

⩾ ▽ 2
, β̂ − β +κ β̂ − β ∗ + λn ϕ(β̂) − ϕ(β ∗ )
n 2
2
 Y − Xβ ∗     2  
⩾− ▽ 2
, β̂ − β ∗ + κ β̂ − β ∗ + λn ϕ(β̂) − ϕ(β ∗ ) ,
n 2

where the penultimate step is valid due to the assumption of restrictive strong convexity.
By applying Holder’s inequality with the regularizer ϕ and its dual norm ϕ∗ , we have
2  ∥Y − Xβ ∗ ∥2  
∥Y − Xβ ∗ ∥2 
  

▽ , β̂ − β ⩽ ϕ∗ ▽ 2
ϕ β̂ − β ∗ . (32)
n n

Next, we have
   
ϕ(β̂) = ϕ β ∗ + (β̂ − β ∗ ) = ϕ βM ∗
(S) + β ∗

M (S) + ( β̂ − β ∗
)M (S) + ( β̂ − β ∗
) ⊥
M (S)
   
∗ ∗ ∗ ∗
⩾ ϕ βM (S) + (β̂ − β )M ⊥ (S) − ϕ(βM ⊥ (S) ) − ϕ (β̂ − β )M (S)
   
∗ ∗ ∗ ∗
= ϕ(βM (S) ) + ϕ (β̂ − β ) ⊥
M (S) − ϕ(β ⊥
M (S) ) − ϕ ( β̂ − β )M (S) .

The inequality holds by applying the triangle inequality on ϕ(β̂), and the last step holds by applying Lemma 10.
Consequently, we have
   
ϕ(β̂) − ϕ (β ∗ ) ⩾ ϕ (β̂ − β ∗ )M ⊥ (S) − ϕ (β̂ − β ∗ )M (S) − 2ϕ(βM

⊥ (S) )
    (33)
= ϕ (β̂ − β ∗ )M ⊥ (S) − ϕ (β̂ − β ∗ )M (S) ,
 
∗ ∗
where ϕ βM ⊥ (S) = 0 as βM ⊥ (S) is a zero vector.

30
Based on Equation(32) and Equation(33), we have
1 2 1  
2
Y − X β̂ − ∥Y − Xβ ∗ ∥2 + λn ϕ(β̂) − ϕ(β ∗ )
n 2 n
2
∥Y − Xβ ∗ ∥2 
    2  

⩾− ▽ , β̂ − β + κ β̂ − β ∗ + λn ϕ(β̂) − ϕ(β ∗ )
n 2
2
∥Y − Xβ ∗ ∥2 
  2       
⩾ κ β̂ − β ∗ + λn ϕ (β̂ − β ∗ )M ⊥ (S) − ϕ (β̂ − β ∗ )M (S) − ▽ , β̂ − β ∗
2 n
  2       ∥Y − Xβ ∗ ∥2   
⩾ κ β̂ − β ∗ + λn ϕ (β̂ − β ∗ )M ⊥ (S) − ϕ (β̂ − β ∗ )M (S) − ϕ∗ ▽ 2
ϕ β̂ − β ∗
2 n
  2      λ  
n
⩾ κ β̂ − β ∗ + λn ϕ (β̂ − β ∗ )M ⊥ (S) − ϕ (β̂ − β ∗ )M (S) − ϕ β̂ − β ∗ ,
2 2
∥Y −Xβ ∗ ∥22
 
where the last step is valid because Lemma 9 implies that we can guarantee λn ⩾ 2ϕ∗ ▽ n with high
probability by taking appropriate λn . Moreover, Lemma 11 implies that
n    o
β̂ ∈ β ∈ Rp | ϕ (β − β ∗ )M ⊥ (S) ⩽ 3ϕ (β − β ∗ )M (S) .

By the triangle inequality, we have


     
ϕ(β̂ − β ∗ ) = ϕ (β̂ − β ∗ )M (S) + (β̂ − β ∗ )M ⊥ (S) ⩽ ϕ (β̂ − β ∗ )M (S) + ϕ (β̂ − β ∗ )M ⊥ (S) ,

and hence we have


1 2 1  
2
Y − X β̂ − ∥Y − Xβ ∗ ∥2 + λn ϕ(β̂) − ϕ(β ∗ )
n 2 n
  2      λ  
n
⩾ κ β̂ − β ∗ + λn ϕ (β̂ − β ∗ )M ⊥ (S) − ϕ (β̂ − β ∗ )M (S) − ϕ β̂ − β ∗
2 2
  2     
⩾ κ β̂ − β ∗ + λn ϕ (β̂ − β ∗ )M ⊥ (S) − ϕ (β̂ − β ∗ )M (S)
2
λn     
− ϕ (β̂ − β ∗ )M (S) + ϕ (β̂ − β ∗ )M ⊥ (S)
2
2 λn  

⩾ κ β̂ − β + ϕ(β̂ − β ∗ )M ⊥ (S) − 3ϕ(β̂ − β ∗ )M (S)
2 2
2 3λn  

⩾ κ β̂ − β − ϕ (β̂ − β ∗ )M (S) .
2 2
   
By definition, we have ϕ (β̂ − β ∗ )M (S) = wg β̂ − β ∗
P
, and by Cauchy-Schwarz inequality, we have
Gg 2
g∈S
s
  sX   2
G
X
wg β̂ − β ∗ ⩽ wg 2 · hmax
S
· max β̂ − β ∗
Gg 2 g∈S Gg 2
g∈S g∈S
r
2
sX  
G
⩽ wg 2 · hmax
S
· β̂ − β ∗
2
g∈S
sX q  
G
= wg 2 · hmax
S
β̂ − β ∗ .
2
g∈S

31
2
q  
G
rP
3λn
On the other hand, since κ β̂ − β ∗ − 2 wg 2 · hmax
S
β̂ − β ∗ ⩽ 0, we have
2 2
g∈S

2 9λ2n X 2 GS
β̂ − β ∗ ⩽ wg · hmax
2 4κ2
g∈S

64c2 σ 2 wg 2 · hmax (S) 


P

9 g∈S dmax log 5 log m
⩽ 2· · + +δ
min wg2 hgmin

4κ n n
g∈[m]
G
wg 2 · hmax
P S
2 2
 
144c σ g∈S dmax log 5 log m
⩽ · · + +δ
κ 2 min wg2 hgmin n n
g∈[m]

D.6 Lemmas for the proof of Theorem 6

In these lemmas, we abbreviate β̂ G by β̂.


Lemma 9. Under the Assumption 7 and (2), taking
q
8cσ dmax log 5 log m
λn = r n + n +δ for some δ ∈ [0, 1],
min (wg2 hg
min )
g∈[m]

 ⊤

then P λn ⩾ 2ϕ∗ ( Xn ε ) ⩾ 1 − e−2nδ .

Proof [Proof of Lemma 9]


 
Xig1 Xig2 Xigd
Let Vi·g = −εi hg wg , hg wg , . . . , hg wg ∈ Rdg . According to the variational form of ℓ2 norm, we have
g

1 2 d
1
Pn 1
Pn g dg −1
n ∥ V ∥
i=1 i·g 2 = sup u, n i=1 Vi·g , where S is the Euclidean sphere in Rdg . Also, for any vector
u∈S dg−1
dg−1
u∈S and t ∈ R, we have
dg n n
 dg
!
  P n

  tP uj
P
Vi·gj
 t
P P
uj Vi·gj
1 t u, Vi·g 1 1
log E e i=1 = log E e j=1 i=1 = log E e i=1 j=1
n n n
! !
n dg uj Xig εi n dg uj Xig
 −t P P j  −t P P j
εi
 
1 i=1 g=1
hg wg
j 1 i=1 j=1
hg wg
j
= log E e = log E e .
n n

32
n
Since {ϵi }i=1 are i.i.d zero mean sub-Gaussian random variables with parameter σ, let u = (u1 , · · · , udg )T ∈ Rdg ×1 ,
Xi,g = (Xig1 , · · · Xigdg )T ∈ Rdg ×1 , then we have
! ! !
n dg uj xig dg u X dg u X
 −t P P j  −tε P j 1gj  −tε P j ngj
εi
  
1 hg w g
j 1 1 hg wg
j 1 n hg wg
j
log E e i=1 j=1
= log E e j=1
+ · · · + log E e j=1

n n n
n d g n X dg
t2 σ 2 X X uj Xigj 2 t2 σ 2
  X 2 
  1
⩽ ⩽ uj Xig
2n i=1 j=1 wg hgj 2n wg2 (hgmin )2 i=1 j=1 j

n n
t2 σ 2 t2 σ 2
X  X 
1 2 1 T T
= ⟨u, Xi,g ⟩ = (u Xi,g Xi,g u)
2n wg2 (hgmin )2 i=1 2n wg2 (hgmin )2 i=1
n
t2 σ 2
  X  
1 T 1 T
= u Xi,g X i,g u
2 wg2 (hgmin )2 n i=1
T
t2 σ 2 XG XGg
 
1 T g
= u u
2 wg2 (hgmin )2 n
T
t2 σ 2 XG XGg
 
1 g
⩽ γmax ( )
2 wg2 (hgmin )2 n

T
XG XGg
By Assumption

7, we have γmax ( gn ) ⩽ c2 . Combining this with the previous proof, we have
n
P  n 
 t u, Vi·g 
1 c2 t2 σ 2
P
n log E e i=1 ⩽ 2wg2 (hg
. Therefore, the random variable u, Vi·g is the sub-Gaussian with the
min ) i=1
r
2 2
parameter at most w2c hσg , and by properties of sub-Gaussian variables, we have
g ( min )

D X n E λ 
n λ2n wg2 hgmin
log P u, Vi·g ⩾ ⩽− .
i=1
4 32C 2 σ 2

Pn
We can find
a
1
2covering
 of S dg −1 in Euclidean norm:{u1 , u2 , . . . , uN } with N ≤ 5dg , recall that n1 ∥ i=1 Vi·g ∥2 =
n
1
Vi·g , so that for any u ∈ S dg−1 , we can find a uq(u) ∈ u1 , . . . , uN , such that uq(u) − u 2 ⩽ 12 ,
P 
n sup u,
u∈S dg−1 i=1
and
n n n
1 D X E 1 D X E D X E
sup u, Vi·g = sup u − uq(u) , Vi·g + uq(u) , Vi·g
n u∈S dg−1 i=1
n u∈S dg−1 i=1 i=1
n
1 D X E 1 D E
⩽ sup u − uq(u) , Vi·g + max uq , Vi·g
n u∈S dg−1 i=1
n q∈[N ]

By applying the Cauchy-Schwarz inequality, we have


n n n
1 D X E u − uq(u) X 1 X
sup u − uq(u) , Vi·g ⩽ 2
Vi·g ⩽ Vi·g .
n u∈S dg−1 i=1
n i=1
2 2n i=1 2

n n D n E
1 1 1
uq ,
P P P
Hence, we obtain n Vi·g ⩽ 2n Vi·g + max
n q∈[N Vi·g , which indicates that
i=1 2 i=1 2 ] i=1
n n
1 X D
q 1
X E
Vi·g ⩽ 2 max u , Vi·g .
n i=1
2 q∈[N ] n i=1

33
Consequently, we can express the probability as
1 X n n
λn   D 1X E λ 
n
P Vi·g ⩾ ⩽ P max uq , Vi·g ⩾
n i=1 2 2 q∈[N ] n i=1 4
N n
X D 1X E λ 
n
⩽ P uq , Vi·g ⩾
q=1
n i=1
4
 nλ2 w2 hg   nλ2 w2 hg 
n g min n g min
⩽ N exp − 2 2
⩽ exp − 2 2
+ dg log 5 ,
32C σ 32C σ
q
8Cσ dmax log 5 log m
and by setting λn = r
n + n + δ, we get
min (wg2 hg
min )
g∈[m]

n m n
 1 X λn  X  1 X λn 
P max Vi·g ⩾ ⩽ P Vi·g ⩾
g∈[m] n 2 2 n i=1 2 2
i=1 g=1
 nλ2n 2 g

⩽ exp − min (w h ) + dmax log 5 + log m
32C 2 σ 2 g∈[m] g min
⩽ exp{−2nδ}.

From Proposition 1, we have


 X ⊤ε  n n
1  HX ⊤ ε  1 1X X
ig1 Xigdg  1X
ϕ∗ ⩽ max = max −εi ,··· , = max Vi·g .
n g∈[m] wg n Gg 2 g∈[m] wg n
i=1
hg1 hgdg 2 g∈[m] n i=1 2
 ⊤

Therefore, P λn ⩾ 2ϕ∗ ( Xn ε ) ⩾ 1 − e−2nδ .

Lemma 10. The group lasso regularizer (1) is decomposable with respect to the pair M (S) , M ⊥ (S) . That is,


ϕ(a + b) = ϕ(a) + ϕ(b), for all a ∈ M (S) and for all b ∈ M ⊥ (S).
Proof [Proof of Lemma 10]
m
X X X
ϕ (a + b) = wg (a + b)Gg = wg (a + b)Gg + wg (a + b)Gg
2 2 2
g=1 g∈M (S) g ∈M
/ (S)
X X X X
= wg aGg 2
+ wg bGg 2
= wg aGg 2
+ wg bGg 2
g∈M (S) g∈M ⊥ (S) g∈M (S) g∈M ⊥ (S)

= ϕ (a) + ϕ (b)

     
XT ε
Lemma 11. If λn ⩾ 2ϕ∗ n , then ϕ (β̂ − β ∗ )M ⊥ (S) ⩽ 3ϕ (β̂ − β ∗ )M (S) .

Proof [Proof of Lemma 11 (also see proposition 9.13 in Wainwright (2019)] From equation (33), we have
   
ϕ(β̂) − ϕ (β ∗ ) ⩾ ϕ (β̂ − β ∗ )M ⊥ (S) − ϕ (β̂ − β ∗ )M (S) ,
On the other hand, by the convexity of the cost function, we have
2 2
Y − Xβ ∗ 2  Y − Xβ ∗
   
1 2 1 ∗ 2 ∗ 2


Y − X β̂ 2 − Y − Xβ 2 ⩾ ▽ , β̂ − β ⩾− ▽ , β̂ − β .
n n n n
By applying Holder’s inequality with the regularizer ϕ and its dual norm ϕ∗ , we have
2 2 
∥Y − Xβ ∗ ∥2  ∥Y − Xβ ∗ ∥2
   
▽ , β̂ − β ∗ ⩽ ϕ∗ ▽ ϕ β̂ − β ∗ .
n n

34
Therefore,
2 2 
∥Y − Xβ ∗ ∥2  ∥Y − Xβ ∗ ∥2
  
1 2 1 2

Y − X β̂ 2
− Y − Xβ ∗ 2
⩾− ▽ , β̂ − β ∗ ⩾ −ϕ∗ ▽ ϕ β̂ − β ∗
n n n n
λn   λ n
 
⩾ − ϕ β̂ − β ∗ ⩾ − ϕ(β̂ − β ∗ )M (S) + ϕ(β̂ − β ∗ )M ⊥ (S) ,
2 2
and
1 2 1 2
 
0= Y − X β̂ 2 − Y − Xβ ∗ 2 + λn ϕ(β̂) − ϕ(β ∗ )
n n
      λ  
n

≥ λn ϕ (β̂ − β )M ⊥ (S) − ϕ (β̂ − β ∗ )M (S) − 2ϕ(βM ∗
⊥ (S) ) − ϕ(β̂ − β ∗ )M (S) + ϕ(β̂ − β ∗ )M ⊥ (S)
2
λn   ∗
 


= ϕ (β̂ − β )M ⊥ (S) − 3ϕ β̂ − β )M (S) ,
2
from which the claim follows.

D.7 Proof of Theorem 7

Proof [Proof of Theorem 7 Part 1]


By Lemma 12, we have

 XGT g XGg r
||| − ΘGg ,Gg |||2

dg dg 2
P n
⩽ c5 ( + ) + δ > 1 − c4 e−c2 nδ
|||Θ|Gg ,Gg ||2 n n
T
By triangle inequality, since XG g
XGg is a positive semi-definite, we have
T T T
XG g
XGg XG g
XGg XG g
XGg
γmax ( ) = ||| |||2 = ||| − ΘGg ,Gg |||2 + |||ΘGg ,Gg |||2
n nr n
dg dg
⩽ (1 + c5 ( + ) + δ)|||ΘGg ,Gg |||2 ,
n n
2
with probability at least 1 − c4 e−c2 nδ . Because |||ΘGg ,Gg |||2 ≤ |||Θ|||2 ⩽ c1 for some constant c1 and dg ⩽ n, we
T
XG XGg 2
have γmax ( gn ) ⩽ c + δ for some constant c, with probability at least 1 − e−c2 nδ . Taking the union probability
for all m groups, we have
T
XG g
XGg
max γmax ( )≤c+δ
g∈[m] n
with probability at least 1 − exp(−c′ 2nδ 2 ) for some constant c′ > 0 as long as
log m ≪ nδ 2 .
For simplicity, we take δ as a constant.

Proof [Proof of Theorem 7 Part 2] First note that we must have ρ(Θ) ≤ γmax (Θ) ≤ c1 by Assumptions 1,2, and 3. By
applying Minkowski inequality, we have
v
m um
X √ uX 2 √ r 2
ϕ(β) = wg βGg 2 ⩽ mt wg2 βGg 2 ⩽ m max wg2 hgmax ∥β∥2
g∈[m]
g=1 g=1
∥Xβ∥2 2
Let β = β ∗ − β̄, we now want to prove that ϕ βM ⊥ (S̄) ⩽ 3ϕ βM (S̄) implies n 2 ⩾ γ64
  min
∥β∥2 .
 
Since ϕ βM ⊥ (S̄) ⩽ 3ϕ βM (S̄) , combining with triangle inequality, we have
p r
ϕ(β) = ϕ βM (S̄) + ϕ βM ⊥ (S̄) ⩽ 4ϕ βM (S̄) ⩽ 4 sg max wg2 hgmax βM (S̄) 2
  
g∈S̄
p r
⩽ 4 sg max wg2 hgmax ∥β∥2
g∈S̄

35
From Lemma 13, we have
r
∥Xβ∥2 1 1 1
2(log m + dmax log 5)
√ ⩾ Θ2 β − 8ρ(Θ) max p g ϕ(β)
n 4 2 g∈[m] wg hmin n
r
1 1 2(log m + dmax log 5) p r
⩾ √ ∥β∥2 − 32ρ(Θ) max p g sg max wg2 hgmax ∥β∥2
4 c1 g∈[m] wg hmin n g∈S̄

1
⩾ √ ∥β∥2 ,
64 c1
where the last step is valid due to Assumption 2 and 3.

D.8 Lemmas for the proof of Theorem 7

Lemma 12. (Theorem 6.5 in (Wainwright, 2019))


Let |||.|||2 be the spectral norm of a matrix. There are universal constants c2 , c3 , c4 , c5 such that, for any matrix
A ∈ Rn×p , if all rows are drawn i.i.d from N (0, Θ), then the sample covariance matrix Θ̂ satisfies the bound
  t2 θ 2 n
E et|||Θ̂−Θ|||2 ⩽ ec3 n +4p for all |t| < ,
64e2 |||Θ|||2
and hence for all δ ∈ [0, 1]

r !
|||Θ̂ − Θ|||2 p p 2
P ⩽ c5 ( + ) + δ > 1 − c4 e−c2 nδ (34)
|||Θ|||2 n n
Lemma 13. Under Assumptions 1,2, and 3, and use ρ(Θ) to denote the maximum diagonal of a covariance matrix Θ.
For any vector β ∈ Rp and a given group structure with m groups, we have
!r
∥Xβ∥2 1 1 1 2(log m + dmax log 5)
√ ≥ Θ2 β − 8ρ(Θ) max p g ϕ(β), (35)
n 4 2 g∈[m] wg hmin n
n
e− 32
with probability at least 1 − n .
1−e− 64
p
o with, for a vector β ∈ R with a fixed group structure, we define the set
Proof [Proofnof Lemma 13] To begin
1
S p−1 (Θ) = β ∈ Rp Θ 2 β = 1 , the function
2
r
1 2(log m + dmax log 5)
g(t) = 4ρ(Θ) max p g ·t
g∈[m] wg hmin n
and the event  
p−1
 n×p ∥Xβ∥2 1
E S (Θ) = X ∈ R inf √ + 2g(ϕ(β)) ⩽ .
β∈S p−1 (Θ) n 4
where ϕ(.) is the overlapping group lasso regularizer. In addition, given 0 ⩽ rℓ ⩽ ru , we define the set
K (rℓ , ru ) = β ∈ S p−1 (Θ) g (ϕ(β)) ∈ [rℓ , ru ] ,


and the event:  


∥Xβ∥2 1
A (rℓ , ru ) = X ∈ Rn×p inf √ ⩽ − ru .
β∈K(rℓ ,ru ) n 2

Based on lemma 13.1 and lemma 13.2, we have



(∞ )
n 2ℓ 2
−n
X X
ℓ−1 ℓ − 32 82 υ

P (X ∈ E) ⩽ P (A(0, υ)) + P A(2 υ, 2 υ) ⩽ e e .
ℓ=1 t=0

36
∞ ∞ n
n n 2ℓ
υ2 n ℓ 2
e− 32
1
and 22ℓ ⩾ 2ℓ, we have P (X ∈ E) ⩽ e− 32 e− 8 2 ⩽ e− 32 e−n 4 υ ⩽
P P
Since υ = n .
4
ℓ=0 ℓ=0 1−e− 64
We just get upper bound of P (X ∈ E). We next show that the bound in (35) always hold on the complementary set E c .
|Xβ∥2
If X ∈/ E, based on the definition of E, we have inf √
n
⩾ 14 − 2g (ϕ(β)) . That is ∀β ∈ S p−1 (Θ).
β∈S p−1 (Θ)
∥Xβ∥2 1 ′ ′ β′

n
⩾ 4 − 2g (ϕ(β)). Therefore, for any β ∈ {β ∈ R| 1 ∈ S p−1 (Θ)}, we have
Θ 2 β′
2


β
X 1
β′
  
Θ 2 β′ 2 1
√ 2
⩾ − 2g ϕ 1
n 4 Θ 2 β′ 2

Xβ ′ 1 1
 
√ 2
⩾ Θ 2 β ′ 2 − 2g ϕ(β ′ ) ,
n 4

We finish the proof by substituting the definition of g(ϕ(β)).

S∞
Lemma 13.1 For υ = 14 , we have E ⊆ A(0, υ) ∪ ℓ−1

ℓ=1 A 2 υ, 2ℓ υ .
n n 2
Lemma 13.2 For any pair (rℓ , ru ), where 0 ⩽ rℓ ⩽ ru , we have P (A (rℓ , ru )) ⩽ e− 32 e− 8 ru .
Proof
S∞ ℓ−1

Proof [Proof of Lemma 13.1] By definition, K(0, υ) ∪ ℓ=1 K 2 υ, 2ℓ υ is a cover of S p−1 (Θ). Therefore, for
any β, it either belongs to K(0, υ) or K 2ℓ−1 υ, 2ℓ υ .
Case 1 If β ∈ K(0, υ), by definition, we have g (ϕ(β)) ∈ [0, υ] and
∥Xβ∥2 1 1 1
√ ⩽ − 2g (ϕ(β)) ⩽ = − υ.
n 4 4 2
Therefore, the event A(0, υ) must happen in this case. 
/ K(0, υ), we must have β ∈ K 2ℓ−1 υ, 2ℓ υ for some ℓ = 1, 2, · · · , and moreover
Case 2: If β ∈
∥Xβ∥2 1 1  1 1
⩽ − 2g (ϕ(β)) ⩽ − 2 · 2ℓ−1 υ ⩽ − 2 · 2ℓ−1 υ ⩽ − 2ℓ υ.


n 4 4 2 2
∞ 
 
So that the event A 2ℓ−1 υ, 2ℓ υ must happen. Therefore, E ⊆ A(0, υ) ∪ A 2ℓ−1 υ, 2ℓ υ .
S
ℓ=1

Proof [Proof of Lemma 13.2] To prove Lemma 13.2, we define and bound the random variable T (rℓ , ru ) =
∥Xβ∥2
− inf √
n
. Let S n−1 be a unit ball on Rn , by the variational representation of the ℓ2 -norm, we have
β∈K(rℓ ,ru )

∥Xβ∥2 ⟨u, Xβ⟩ ⟨u, Xβ⟩


T (rℓ , ru ) = − inf √ =− inf sup √ = sup infn−1 √ .
β∈K(rℓ ,ru ) n β∈K(rℓ ,ru ) u∈S n−1 n β∈K(rℓ ,ru ) u∈S n

1 1
Let X = W Θ 2 , where W ∈ Rn×p is a standard Gaussian matrix, and define the transformed vector v = Θ 2 β, then
⟨u, Xβ⟩ ⟨u, W v⟩
T (rℓ , ru ) = sup √ = sup infn−1 infn−1 √ ,
n
β∈K(rℓ ,ru ) u∈S v∈K̄(rℓ ,ru ) u∈S n
n  1
 o
where K̄ (rℓ , ru ) = v ∈ Rp ∥v∥2 = 1, g ϕ(Θ− 2 v) ∈ [rℓ , ru ] .

Define Zu,v = ⟨u,W



n
v⟩
, since (u, v) range over a subset of S n−1 × S p−1 , each variable Zu,v is zero-mean Gaussian
−1
with variance n . We compare the Gaussian process Zu,v to the zero-mean Gaussian process Yu,v which defined as:
⟨ζ, u⟩ ⟨ξ, v⟩
Yu,v = √ + √ where ζ ∈ Rn , ξ ∈ Rp , have i.i.d N (0, 1) entries.
n n

37
Next, we show that the Yu,v and Zu,v defined above satisfy conditions in Gordon’s inequality. By definition, we have
2 n p
⟨u, W v⟩ ⟨u′ , W v ′ ⟩

2 1 XX 2
E (Zu,v − Zu′ ,v′ ) = E √ − √ = ui vj − u′i vj′
n n n i=1 j=1
n p
1 XX 2 (36)
= ui vj − u′i vj + u′i vj − u′i vj′
n i=1 j=1
1 2 2 2 2

2

2

= ∥v∥2 ∥u − u′ ∥2 + ∥u′ ∥2 ∥v − v ′ ∥2 + 2 ∥v∥2 − ⟨v, v ′ ⟩ ⟨u, u′ ⟩ − ∥u∥2 ,
n
 
2 2 2 2
On one hand, since ∥v∥2 ⩽ 1, ∥u′ ∥2 ⩽ 1, (7) ⩽ n1 ∥u − u′ ∥2 + ∥v − v ′ ∥2 .
On the other hand, we have
2
⟨ζ, u − u′ ⟩ ⟨ξ, v − v ′ ⟩

2
E (Yu,v − Yu′ ,v′ ) = E √ + √
n n
(37)
 
n p n X p
1 X X X 1 2 2

=  (u − u′ )2 + (v − v ′ )2  = ∥u − u′ ∥2 + ∥v − v ′ ∥2 .
n i=1 j=1 i=1 j=1
n

Taking equation (36) and (37) together, we have


2 1 2 2

2
E (Zu,v − Zu′ ,v′ ) ⩽ ∥u − u′ ∥2 + ∥v − v ′ ∥2 = E (Yu,v − Yu′ ,v′ ) .
n
   
2 2
If V = V ′ , then nE (Zu,v − Zu′ ,v′ ) = ∥u − u′ ∥2 = nE (Yu,v − Yu′ ,v′ ) .
By applying Gordon’s inequality, we have
! !
E sup inf Zu,v ⩽E sup inf Yu,v .
n−1 n−1
v∈K̃(rℓ ,ru ) u∈S v∈K̃(rℓ ,ru ) u∈S

Therefore,
!  !
⟨u, W v⟩ ⟨ξ, v⟩ ⟨ζ, u⟩
E (T (rℓ , ru )) = E sup infn−1 √ ⩽E sup infn−1 √ + √
v∈K̃(rℓ ,ru ) u∈S n v∈K̃(rℓ ,ru ) u∈S n n
 D 1 E
Σ 2 ξ, β  
= E  sup √  − E ∥ζ∥√ 2
β∈K(rℓ ,ru ) n n

  q   
∥ζ∥2 ξ12 +...+ξn
2 |ξ1 |+...+|ξn |
Next, we bound these two terms. For the second term, we have E √
n
=E n ⩾E n =
D 1 E! !
1
Θ 2 ξ,β
q
ϕ(β)ϕ∗ (Θ 2 ξ) 1
2
π . For the first term, we have E sup √
n
⩽E sup √
n
, where ϕ∗ (Θ 2 ξ) is the the
β∈K(rℓ ,ru ) β∈K(rℓ ,ru )
dual norm defined before. Since β ∈ K (rℓ , ru ), g (ϕ(β)) ⩽ ru , by the definition of g(t), we have
ru
ϕ(β) ⩽  q . (38)
1 2(log m+dmax log 5)
4ρ(Θ) max √ g n
g∈[m] wg hmin

   
1 1 1
Let ηGg = (Θ 2 ξ)Gg , to bound E max (Θ 2 ξ)Gg = E max ηGg 2
. Since Θ 2 ξ ∼ N (0, Θ), by the
g 2 g
1
properties of normal distribution, its corresponding marginal distribution of jth variable (Θ 2 ξ)j also follows zero mean
1
normal distribution with covariance matrix Θjj , which is the jth diagonal elements of Θ. Therefore, any subset of Θ 2 ξ

38
is a zero-mean sub-Gaussian random sequence with parameters at most ρ(Θ). By equation (38) and Lemma 13.2.3, we
have
1 1 
ϕ(β)ϕ∗ Θ 2 ξ ϕ∗ Θ 2 ξ
  
ru
E sup √ ⩽E sup q 2(log m+dmax log 5)  √n
n

β∈K(rℓ ,ru ) β∈K(rℓ ,ru ) 4ρ(Θ) max 1
g
g∈[m] wg hmin n
 
 ϕ∗ Θ 12 ξ 
ru
= q 2(log m+dmax log 5)  E √
n
4ρ(Θ) max wg h1g n
g∈[m] min
 
ru 1  1 
⩽ max √
q 2(log m+dmax log 5)  E g∈[m] H Θ ξ 2
nwg Gg 2
4ρ(Θ) max wg h1g n
g∈[m] min
 
ru 1  1 
⩽ max √
q 2(log m+dmax log 5)  E g∈[m] Θ ξ2
nwg hgmin Gg 2
4ρ(Θ) max wg h1g n
g∈[m] min
 
ru  1 
⩽ q E max Θ ξ 2
g∈[m] Gg 2
4ρ(Θ) 2(log m+dnmax log 5)
ru  p  r
u
⩽ q  2ρ(Θ) (log m + dmax log 5) 2σ 2 ⩽
2(log m+dmax log 5) 2
4ρ(Θ) n

q
Therefore, E [T (rℓ , ru )] ⩽ − π2 + r2u . Next we want to bound P T (rℓ , ru ) ⩾ − 21 + ru based on the bound of


this expectation. To apply Lemma 13.2.4, we first show that, the f = T (rl , ru ), a function of the random variable W
is a √1n -Lipschitz function and without making confusion, we denote the corresponding function as T (W ). For any
standard Gaussian matrix W1 and W2 , we have

⟨u, W1 v⟩ ⟨u, W2 v⟩
|T (W1 ) − T (W2 )| = sup √
infn−1 − sup infn−1 √
v∈K̃(rℓ ,ru ) u∈S n v∈K̃(rℓ ,ru ) u∈S n
   
∥W1 v∥ ∥W2 v∥
= sup − √ 2 − sup − √ 2
v∈K̃(rℓ ,ru ) n v∈K̃(rℓ ,ru ) n
! !
∥W1 v∥2 ∥W2 v∥2
= − inf √ − − inf √
v∈K̃(rℓ ,ru ) n v∈K̃(rℓ ,ru ) n

∥W2 v∥2 ∥W1 v∥2


= inf √ − inf √ .
v∈K̃(rℓ ,ru ) n v∈K̃(rℓ ,ru ) n

∥W1 v1 ∥2 ∥W1 v∥2 ∥W2 v2 ∥2 ∥W2 v∥2


Suppose that √
n
= inf √
n
and √
n
= inf √
n
.
v∈K̃(rℓ ,ru ) v∈K̃(rℓ ,ru )

Case 1 If ∥W1 v1 ∥2 > ∥W2 v2 ∥2 , then we have

∥W2 v∥2 ∥W1 v∥2


|T (W1 ) − T (W2 )| = inf √ − inf √
v∈K̃(rℓ ,ru ) n v∈K̃(rℓ ,ru ) n
∥W1 v1 ∥2 − ∥W2 v2 ∥2 ∥W1 v2 ∥2 − ∥W2 v2 ∥2
= √ ⩽ √
n n
∥(W1 − W2 )v2 ∥2 ∥W1 − W2 ∥F
⩽ √ ⩽ √
n n
.

39
Case 2 If ∥W1 v1 ∥2 ⩽ ∥W2 v2 ∥2 , then we have

∥W2 v∥2 ∥W1 v∥2


|T (W1 ) − T (W2 )| = inf √ − inf √
v∈K̃(rℓ ,ru ) n v∈K̃(rℓ ,ru ) n
∥W2 v2 ∥2 − ∥W1 v1 ∥2 ∥W2 v1 ∥2 − ∥W1 v1 ∥2
= √ ⩽ √
n n
∥(W1 − W2 )v1 ∥2 ∥W1 − W2 ∥F
⩽ √ ⩽ √
n n
.
where ∥.∥F represent the Frobenious norm of a matrix. Thus under the Euclidean norm, T (W ) is a √1 -Lipschitz
n
function. Therefore, by lemma 13.2.3, we have
2
P(T (rl , ru ) − E(T (rl , ru )) ⩾ t) ⩽ e−nt /2
, ∀t ⩾ 0
q
n n 2
Set t = π2 − 12 + r2u ⩾ 14 + r2u , we have, E(T (rl , ru )) + t ⩽ − 12 + ru and P T (rℓ , ru ) ⩾ − 12 + ru ⩽ e− 32 e− 8 ru ,
 

which is actually the Lemma 13.2

Lemma 13.2.1 (Gordon’s Inequality) Let {Zu,v }u∈U,v∈V and {Yu,v }u∈U,v∈V be zero-mean Gaussian process indexed
by a non-empty index set I = U × V . If
   
2 2
1. E (Zu,v − Zu′ v′ ) ≤ E (Yu,v − Yu′ ,v′ ) for all pairs (u, v) and (u′ v ′ ) ∈ I
   
2 2
2. E (Zu,v − Zu′ v ) = E (Yu,v − Yu′ ,v ) ,
then we have E(max min Zu,v ) ≤ E(max min Yu,v ).
v∈V u∈U v∈V u∈U
Lemma 13.2.2 Suppose that α = (α1 , ..., αd ), where each αi , i ∈ [d] is a zero-mean sub-Gaussian
 random variable
with parameter at most σ 2 , then for any t ∈ R, we have E (exp (t ∥α∥2 )) ⩽ 5d exp 2t2 σ 2 .
Lemma 13.2.3 Suppose that α = (α1 , ..., αd ), where each αi , i ∈ [d] is a zero-mean sub-Gaussian random variable
with parameter at most σ 2 , and for a given group structure G, let αGg be the corresponding group norm, m be the
number of groups and dmax be the maximum group size, then
 
p
E max αGg ⩽ 2 2σ 2 (log m + dmax log 5)
g

Lemma 13.2.4 (Theorem 2.26 in (Wainwright, 2019)): Let x = (x1 , · · · , xn ) be a vector of i.i.d standard Gaussian
variable, and f : Rn → R be a L-Lipschitz, with respect to the Euclidean norm, then f (x) − Ef (x) is sub-Gaussian
t2
with parameter at most L, and hence P ((f (x) − E [f (x))) ⩾ t] ⩽ e− 2L2 , ∀t ⩾ 0.

D.8.1 P ROOF OF L EMMA 13.2.2

We can find a 12 - cover of S d−1 , and for any u ∈ S d−1 in the Euclidean norm with cardinally at most N ⩽ 5d , say
there exists uq(u) ∈ u1 , . . . , uN , such that uq(u) − u 2 ⩽ 12 .


1
By the variational representation of the ℓ2 norm, we have ∥α∥2 = max ⟨u, α⟩ ⩽ max uq(u) , α + 2 ∥α∥2 .
u∈S d−1 q(u)∈[N ]
Therefore, ∥α∥2 ⩽ 2 max uq(u) , α . Consequently,
q(u)∈[N ]
    
E (exp (t ∥α∥2 )) ⩽ E exp 2t max ⟨uq , α⟩ = E max exp (2t ⟨uq , α⟩)
q∈[N ] q∈[N ]
N
4t2 σ 2
X  
E (exp (2t ⟨uq , α⟩)) ⩽ 5d exp ⩽ 5d exp 2t2 σ 2 .


q=1
2

40
D.8.2 P ROOF OF L EMMA 13.2.3
     
For any t > 0, by Jensen’s inequality, we have exp tE max αGg ⩽ E exp t max αGg 2
g g
  m m
  
5dg exp 2t2 σ 2 ⩽ m · 5dmax · exp(2t2 σ 2 ).
P P
= E max exp t αGg 2
⩽ E exp t αGg 2

j j=1 j=1
   
2 2
By taking log at both sides, we have tE max αGg ⩽ log m + dmax log 5 + 2t σ . That is E max αGg ⩽
g g
2 2
log m+dmax log 5+2t σ
t .
q  
log m+dmax log 5
p
Let t = 2σ 2 , we have E max αGg ⩽ 2 (log m + dmax log 5) 2σ 2 .
g

41
D.9 Proof of Theorem 4

The two lemmas below are integral to the proof:


Lemma 14 (Packing Number for Binary Sets). Consider a set A defined for real numbers m, sg as
 
 Xm 
A = a ∈ {0, 1}m | aj ≤ sg .
 
j=1
!
m
− 2
q
sg sg
Then the 2 -packing number of set A ⩾ , and
m
!
sg
sg ·2 2

2
 
m
− 2 !
sg m
log   ≍ sg log( ).
m sg
 sg  · 2 2 sg
2
q
2dsg
Lemma 15 (Packing Number for Sparse Group Vectors). For the set Ω(G, sg ), the 5 -packing number ≳
!
m
−2
sg √
· ( 2)dsg , and
m
!
s g
sg ·2
2

2
 
m
−2 √ ds
!
sg m
log   · ( 2) g
≍ sg (d + log( )).
m sg
 sg  · 2 2 sg
2

Proof [Proof of Theorem 4]


q
2dsg
First, select N points ω (1) , . . . , ω (N ) from Ω(G, sg ) such that ω (i) − ω (j) > 5 for all distinct i, j. Clearly,
ω (i) − ω (j) ⩽ 4sg d.
p

Define β (i) = rω (i) for each i. This results in


2ksg r2 2
≤ β (i) − β (j) ⩽ 4sg dr2 .
5 2

Next, let y (i) = Xβ (i) + ε for 1 ⩽ i ⩽ N . Consider the Kullback-Leibler divergence between different distribution
pairs: "  !#

(i) (j)
 p y (i) , X
DKL (y , X), (y , X) = E(y(j) ,X) log  .
p y (j) , X
 
where p y (i) , X is the probability density of y (i) , X . Conditioning on X, we have
" ! #
p y (i) , X ∥X(β (i) − β (j) )∥22
E(y(j) ,X) log  |X = .
p y (j) , X 2σ 2

Thus, for 1 ≤ i ̸= j ≤ N,
 2

(i)
 
(j)
 X β (i) − β (j) 2 n(β (i) − β (j) )⊤ Σ(β (i) − β (j) )
DKL y ,X , y ,X = EX =
2σ 2 2σ 2
2
3c1 β (i) − β (j) 2 2c1 ndr2 sg
≤ ≤ .
2σ 2 σ2

42
  ndr 2 sg
m +log 2
From Lemma 15, log N ≍ sg d + log sg . Setting σ2log N = 21 , we obtain
v 
u
u d + log m σ 2
t sg
r≳ .
3nd
!
q ndr 2 sg
2r 2 ksg σ2
+log 2
By generalized Fano’s Lemma, inf sup E∥β̂ − β∥2 ⩾ 5 1− log N . Consequently,
β̂ β
 
 2 σ 2 sg (d + log( smg ))
inf sup E∥β̂ − β∥22 ≥ inf sup E∥β̂ − β∥2 ≳ .
n

Proof [Proof of Lemma 14]


 
m
Notice that the cardinality of A is . Denote the hamming distance between any two points x, y ∈ A by
sg
h(a, b) = | {j : aj ̸= bj } |.
Then, for a fixed point a ∈ A,  
n sg o m sg
| b ∈ A, h(a, b) ≤ = sg · 2⌊ 2 ⌋ |.
2 ⌊2⌋
g s
In  all elements b ∈ A with h(a, b) ≤ 2 can be obtained as follows. First, take any subset J ⊂ [m] of cardinality
 sgfact,
2 , then set aj = bj for j ∈
/ J and choose bj ∈ {0, 1} for j ∈ J.
!
m
−2
sg
Now let As be any subset of A with cardinality at most T = , then we have
m
!
sg
  sg ·2 2

2
 
sg o m
sg
| {b ∈ A | there exist a ∈ As with h(a, b) ≤ ≤ (|As |) · sg  ·2 2 | < |A|.
2 2

s
It implies that one can find an element b ∈ A with h(a, b) > 2g for all a ∈ As . Therefore one can construct a subset
s
As with |As | ≥ T and the property h(a, b) > 2g for any two distinct elements a, b ∈ As .
q
s sg
On the other hand, h(a, b) > 2g implies ∥a − b∥ > 2 . Therefore, there exist at least T points in A such that the
q
sg
distance between any two points is greater than 2 .
!
m
sg ⌊ s2g ⌋!(m−⌊ s2g ⌋)! (m−sg +1)···m−⌊
sg
2 ⌋ Q⌈ s2g ⌉ m−sg +j
Moreover, since = = sg = , we have
m
sg !(m−sg )! (⌊ ⌋+1)···sg j=1 ⌊ s2g ⌋+j
!
2
sg
2
 
!⌊ s2g ⌋ m
s  ⌈ s2g ⌉
m− ⌊ 2g sg

m − sg + 1
⩽  ⩽ ,
2sg m
sg ⌈sg ⌉
sg  2 2

2
and therefore we can find C1 , C2 , such that C1 sg log( smg ) ⩽ log T ⩽ C2 sg log( smg ), so that
 
m
− 2 !
sg m
log   ≍ sg log( ).
m
s g sg
sg  ·2 2
2

43
Proof [Proof of Lemma 15]
 !c 
S
Given a group support a ∈ A, define ka = i|i∈ Gg , and the set
{g|ag =0}
 [  [ c 
Ω(a) = ω ∈ Rp | ωi = 0 if i ∈ Gg , ωi ∈ {−1, 1} if i ∈ Gg .
{g|ag =0} {g|ag =0}

Notice that Ω(a) ⊆ Ω(G, sg ), and |Ω(a) | = 2ka . Also denote the hamming distance between x, y ∈ Ω(a) by
h(x, y) = | {j : xj ̸= yj } |.
(a)
Then for any fixed x ∈ ΩG , we have
⌊ k10
a
ka X⌋  k 
(a) a
{y ∈ Ω , h(x, y) ≤ } =
10 j
j=0

(a) 2ka −2
Let Ωs be any subset of Ω(a) with cardinality at most N (a) = k !. Then,
⌊ a⌋
10
P ka
j=0 j
ka
{y ∈ Ω(a) | ∃x ∈ Ω(a)
s with h(x, y) ≤ } < |Ω(a) |.
10
q
ka 2ka
On the other hand, h(x, y) > 10 implies ∥x − y∥ ≥ 5 . Thus, there are at least N (a) points in Ω(a) with pairwise
q
distances greater than 2k5a .
From Chapter 9 in Graham et al. (1994),
X  k  9  ka  9 ka 9 ka
a
< ka ≤ (10e) 10 ≤ 2 2 .
j 8 ⌊ 10 ⌋ 8 8
ka
j≤⌊ 10 ⌋
ka √
Consequently, we have N (a) > 89 2 2 ≳ ( 2)ka .
The value of ka depends on the predefined groups and group support a and spans a range from 0 to sg d. Lemma 15
seeks a lower bound for all conceivable overlapping patterns, necessitating an analysis of the maximum value of ka .

q according to Lemma 14, we can identify at least T points in A where the distance between any two√points
Furthermore,
sg 8 sg d
exceeds 2 . For {a1 , · · · , aT } group supports, if there is a group structure such that we could find at least 9 ( 2)
q
2sg d
on each group support, and the distance between every pair of these points is greater than 5 , then Lemma 15 is
proved.
Considering m non-overlapping groups, qka = sgq d for each group support a. In addition, given any two group support
q
sg dsg 2dsg (a)
a, b with ∥a − b∥ > 2 , ∥x − y∥ > 2 > 5 for any x ∈ Ω and y ∈ Ω(b) . Thus, considering all possible
!
m
−2
sg √
overlapping patterns, we can find at least · 89 ( 2)dsg point in Ω(G, sg ), such that the distance between
m
!
sg
  sg2 ·2
q 2
2dsg
every pair of points is greater than 5 .

44
D.10 Proof of Theorem 5

This proof consists of parts: Parts I-IV dedicated to Theorem 5.1, and Part V is for Theorem 5.2. To be more specific,
Part I provides some additional concepts, Part II introduces the reduced problem, Part III shows the successful selection
of the correct pattern under favorable conditions, and Part IV establishes that certain conditions are satisfied with high
probability.

D.10.1 PART I

Recall that S = supp(β ∗ ). With S, we define the norm ϕS for any β ∈ Rp as


X
ϕS (βS ) = wg ∥βS∩Gg ∥2 ,
g∈GS

along with its dual norm (ϕS )∗ [u] = supϕS (βS )≤1 βS⊤ u. Similarly, for Sc = [p] \ S, we define the norm ϕcS for any
β ∈ Rp as X
ϕcS (βSc ) = wg ∥βSc ∩Gg ∥2 ,
g∈[m]\GS

accompanied by its corresponding dual norm (ϕcS )∗ [u] = supϕcS (βSc )≤1 βS⊤c u.
We also introduce equivalence parameters aS , AS , aSc , ASc as follows:
∀β ∈ Rp , aS ∥βS ∥1 ⩽ ϕS (βS ) ⩽ AS ∥βS ∥1 , (39)
∀β ∈ Rp , aSc ∥βSc ∥1 ⩽ ϕcS (βSc ) ⩽ ASc ∥βSc ∥1 . (40)

We now study the equivalence parameters from two aspects. First, since
sup βS⊤ u ⩾ sup βS⊤ u ⩾ sup βS⊤ u,
aS ∥βS ∥1 ⩽1 ϕS (βS )⩽1 AS ∥βS ∥1 ⩽1

by the definition of dual norm, we have

∀u ∈ R|S| , A−1 ∗ −1
S ∥u∥∞ ⩽ (ϕS ) [u] ⩽ aS ∥u∥∞ . (41)
Similarly, by order-reversing,
c
∀u ∈ R|S | , A−1 c ∗ −1
Sc ∥u∥∞ ⩽ (ϕS ) [u] ⩽ aSc ∥u∥∞ . (42)

Second, by the Cauchy-Schwarz inequality, for any β ∈ Rp and g ∈ GS ,


w
p g ∥βS∩Gg ∥1 ⩽ wg ∥βS∩Gg ∥2 ⩽ max wg ∥βS∩Gg ∥1 .
dg g∈GS

Consequently, we have
wg
min p ∥βS ∥1 ⩽ ϕS (βS ) ⩽ hmax (GS ) max wg ∥βS ∥1 ,
g∈GS dg g∈GS
w
Therefore, we can set aS = min √ g and AS = hmax (GS ) max wg . With an trivial extension, we can set aSc =
g∈GS dg g∈GS
p
min wg / dg .
g∈GSc

D.10.2 PART II

From the full problem to the reduced problem


Recall that the group lasso estimator in (14) is defined as

1
β̂ G = arg min ∥Y − Xβ∥22 + λn ϕG (β). (43)
β∈Rp 2n

45
1
Now we write ϕG (β) = ϕ(β) and L(β) = 2n ∥Y − Xβ∥22 for ease of notation. Following Jenatton et al. (2011a);
Wainwright (2009), we consider the following restricted problem
X
β̂ R = arg min L(β) + λn ϕ(β) = arg min L(β) + λn wg βS∩Gg 2
β∈Rp ,βSc =0 β∈Rp ,βSc =0
g∈GS
(44)
:= arg min L(β) + λn ϕS (βS ).
β∈Rp ,βSc =0

1
Let LS (βS ) = 2n ∥Y − XS βS ∥22 . Due to the restriction of β̂ R , we can obtain β̂ R by first solving the following reduced
problem
1 X
β̂S = arg min ∥Y − XS βS ∥22 + λn wg βS∩Gg 2
βS ∈R|S| 2n
g∈GS (45)
= arg min LS (βS ) + λn ϕS (βS )
βS ∈R|S|

and then padding β̂S with zeros on Sc . In addition,

1
LS (β̂S ) = ∥Y − XS β̂S ∥22
2n
1  ⊤ 
= Y Y − 2Y ⊤ XS β̂S + (XS β̂S )⊤ XS β̂S
2n
1  ⊤ 
= Y Y − 2(Xβ ∗ + ϵ)⊤ XS β̂S + (XS β̂S )⊤ XS β̂S
2n
1  ⊤ 
= Y Y − 2(XS βS∗ )⊤ XS β̂S − 2ϵ⊤ XS β̂S + (XS β̂S )⊤ XS β̂S ,
2n
and consequently,

1 ⊤ 1 1
∇LS (β̂S ) = XS XS β̂S − XS⊤ XS βS∗ − ϵ⊤ XS
n n n (46)
:= QSS (β̂S − βS∗ ) − qS ,
n
1 ⊤ 1
P
where Q = n X X, q= n ϵi x i .
i=1

D.10.3 PART III

Part III mostly follows the proof in Theorem 7 of Jenatton et al. (2011a). Here we aim to show that supp(β̂ G ) = S
under certain conditions.
To begin with, Given β ∈ Rp , we define J G (β) as:
n [ o
J G (β) = [p] \ Gg .
Gg ∩supp(β)=∅

J (β) is called the adapted hull of the support of β in Jenatton et al. (2011a). For simplicity, we write J G (β) = J(β).
G

Notice that by assumption we have


n [ o
J(β ∗ ) = [p] \ Gg = S.
Gg ∩supp(β ∗ )=∅

Now we consider the reduced problem (45), and we want to show that for all g ∈ GS , β̂S∩Gg > 0. That is, no

active group is missing.
Lemma 16. (Lemma 14 of Jenatton et al. (2011a))
For the loss L(β) and norm ϕ in (43), β̂ ∈ Rp is a solution of
min L(β) + λn ϕ(β) (47)
β∈Rp

46
if and only if (
∇L(β̂)J(β̂) + λn r(β̂)J(β̂) = 0
h i (48)
(ϕcJ(β̂) )∗ ∇L(β̂)J(β̂)c ⩽ λn .

In addition, the solution β̂ satisfies

ϕ∗ [∇L(β̂)] ⩽ λn . (49)

As β̂S is the solution of (45), Equation (49) in Lemma 16 implies that


h i h   i
(46)
(ϕS )∗ ∇LS (β̂S ) = (ϕS )∗ QSS β̂S − βS − qS ⩽ λn . (50)

By the property of the equivalent parameters, we have


  (41) h   i (50)
A−1
S Q SS β̂ S − β S − qS ⩽ (ϕ S )∗
QSS β̂ S − β S − q S ⩽ λn . (51)

If ∗
γmin (QSS ) βmin
λn ⩽ 1 , (52)
3|S| 2 AS
and ∗
γmin (QSS ) βmin
∥qS ∥∞ ⩽ 1 , (53)
3|S| 2
then we have  
β̂S − βS∗ = Q−1 SS QSS β̂S − βS


 ∞ 
−1
⩽ QSS ∞,∞ QSS β̂S − βS∗

1
 
−1
⩽ |S| γmax QSS QSS β̂S − βS∗

2

1
  ∞ 
−1 (54)
⩽ |S| 2 γmin (QSS ) QSS β̂S − βS − qS + ∥qS ∥∞

(51) 1
−1
⩽ |S| 2 γmin (QSS ) (λn AS + ∥qS ∥∞ )
1 1
−1 −1
⩽ |S| 2 γmin (QSS ) λn AS + |S| 2 γmin (QSS ) ∥qS ∥∞
2 ∗
⩽ βmin .
3

βmin
If there exist a group g ∈ GS such that β̂S∩Gg < 3 , then


βmin 2β ∗
β̂S − βS∗ ∗
> βmin − = min .
∞ 3 3
Thus, Equation (54) implies that for all g ∈ GS ,

βmin
β̂S∩Gg > > 0. (55)
∞ 3

Secondly, we want to show that β̂ R solves problem (43). As β̂ R is obtained by padding β̂S with zeros on Sc ,
 [   [ 
J(β̂ R ) = [p] \ Gg = [p] \ Gg
Gg ∩supp(β̂ R )=∅ Gg ∩supp(β̂S )=∅
 
(54) [
= [p] \ Gg = S.
Gg ∩S=∅

47
From Lemma 16 we know that β̂ R is the optimal for problem (43) if and only if
∇L(β̂ R )S + λn r(β̂ R )S = 0, (56)
and h i
(ϕcS )∗ ∇L(β̂ R )Sc ⩽ λn . (57)

We now verify the condition in Equation (56). Since


1
L(β̂ R ) = ∥Y − X β̂ R ∥22
2n
1  ⊤ 
= Y Y − 2(Xβ ∗ )⊤ X β̂ R − 2ϵ⊤ X β̂ R + (X β̂ R )⊤ X β̂ R ,
2n
we have
h1   1 i
∇L(β̂ R )S = X ⊤ X β̂ R − β ∗ − ϵ⊤ X
hn  i n S 

R
= Q β̂ − β − qS = QSS β̂ R − β ∗ − qS
 S   S (58)
R ∗ ∗
= QSS β̂S − βS − qS = QSS β̂S − βS − qS

= ∇LS (β̂S ).

On the other hand, as β̂ R is obtained by padding β̂S with zeros on Sc , we have

λn r(β̂ R )S = λn rS (β̂S ).

Because β̂S is the optimal for problem (45), Equation (48) in Lemma 16 implies that
(46)
∇LS (β̂S ) + λn rS (β̂S ) = QSS (β̂S − βS∗ ) − qS + λn rS (β̂S ) = 0. (59)

Thus, Equation (56) holds as


(59)
∇L(β̂ R )S + λn rS (β̂ R ) = ∇LS (β̂S ) + λn rS (β̂S ) = 0. (60)

Now we continue to show Equation (57). Notice that


   
(58) (59)
β̂ R − β ∗ = β̂S − βS∗ = Q−1 SS (qS − λn rS (β̂S )). (61)
S

Let qSc |S = qSc − QSc S Q−1


SS qS , we have

 
(58)
∇L(β̂ R )Sc = Q(β̂ R − β ∗ ) − qSc = QSc S (β̂ R − β ∗ )S − qSc
Sc
 
(61)
= QSc S Q−1
SS qS − λn rS (β̂S ) − qSc
. (62)
= −QSc S Q−1 −1
SS λn rS (β̂S ) + QSc S QSS qS − qSc
 
= −λn QSc S Q−1
SS r S (β̂ S ) − rS (β ∗
S ) − λn QSc S Q−1 ∗
SS rS (βS ) − qSc |S .

The previous expression leads us to study the difference of rS (β̂S ) − rS (βS∗ ). We now introduce the following lemma.
Lemma 17. (Lemma 12 of Jenatton et al. (2011a))
For any J ⊂ [p], let uJ and vJ be two nonzero vectors in R|J| , and define the mapping rJ : R|J| 7→ R|J| such that
ωg
rJ (βJ )j = βj Σ .
g∈GJ ,Gg ∩j̸=ϕ βJ∩Gg
2
Then there exists ξJ = t0 uJ + (1 − t0 )vJ for some t0 ∈ (0, 1), such that
  
X X wg 1{j∈Gg } X X X |ξj ||ξk |wg4 1{j,k∈Gg }
∥rJ (uJ ) − rJ (vJ )∥1 ⩽ ∥uJ − vJ ∥∞  + 
3
 .
j∈J g∈G
ξ J∩G g 2 j∈J k∈J g∈G ξ J∩Gg
J J 2

48
Lemma 17 implies that
 
X X w 1
g {j∈Gg } X X X (wg ) 4
1 |
{j,k∈Gg } jβ̃ ||β̃ |
k 
rS (β̂S ) − rS (βS∗ ) ⩽ β̂S − βS∗ + , (63)

 3
1 ∞ β̃S∩Gg 3 β̃
j∈S g∈GS
2
j∈S k∈S g∈GS w g S∩Gg
2

where β̃ = t0 β̂S + (1 − t0 ) βS∗ .

To find an upper bound of the right-hand side. Recall that Equation (54) implies that β̂S − βS∗ ∗
⩽ 23 βmin , so we

have
q
β̃S∩Gg ⩾ |S ∩ Gg | min{|β̃|j | β̃j ̸= 0}
2
q

⩾ |S ∩ Gg |(βmin − t0 β̂S − βS∗ )

q
∗ ∗
⩾ |S ∩ Gg |(βmin − β̂S − βS )


βmin
q
⩾ |S ∩ Gg | .
3
Consequently, the first term could be upper bounded by
X X wg 1{j∈Gg } X wg |S ∩ Gg | 3 X
q
= ⩽ ∗ wg |S ∩ Gg |
β̃S∩Gg β̃S∩Gg βmin
j∈S g∈GS g∈GS g∈GS
2 2

On the other hand, the Cauchy-Schwarz inequality gives


2 2
β̃S∩Gg ⩽ |S ∩ Gg | β̃S∩Gg .
1 2

Thus, the second term could also be upper bounded by


2
4
X X X (wg )4 1{j,k∈Gg } |β̃j ||β̃k | X wg β̃S∩Gg
1
3 = 3
j∈S k∈S g∈GS wg3 β̃S∩Gg g∈GS wg3 β̃S∩Gg
2 2
X wg |S ∩ Gg |

g∈GS β̃S∩Gg
2
3 X q
⩽ ∗ wg |S ∩ Gg |.
βmin
g∈GS

6
P p
Let c2 = ∗
βmin g∈GS wg |S ∩ Gg |, then Equation (63) implies

rS (β̂S ) − rS (βS∗ ) ⩽ c2 β̂S − βS∗ .


1 ∞

If
−1
∥QSc S QSS2 ∥2,∞ ⩽ 3, (64)
then we have

49
−1 −1
   
QSc S Q−1
SS r (β̂
S S ) − r (β
S S

) = QSc S QSS2 QSS2 rS (β̂S ) − rS (βS∗ )
∞ ∞
−1 −1
⩽ QSc S QSS2 QSS2 rS (β̂S ) − rS (βS∗ )
∞,2 2 2
−1
⩽ 3γmax (QSS2 ) rS (β̂S ) − rS (βS∗ )

− 21
⩽ 3γmin (QSS )c2 β̂S − βS∗

(54) − 21 1
−1
⩽ 3c2 γmin (QSS ) |S| γmin (QSS ) (λn AS + ∥qS ∥∞ )
2

6
q
− 32 1
X
=3 ∗ wg |S ∩ Gg |γmin (QSS ) |S| 2 (λn AS + ∥qS ∥∞ ) .
βmin
g∈GS

If the following conditions are satisfied:


6 X τ
q
− 23 1
a−1
S c ∗ wg |S ∩ Gg |γmin (QSS ) |S| 2 λn AS ⩽ , (65)
βmin 12
g∈GS

6 τ
q
− 32 1
X
a−1
Sc ∗ wg |S ∩ Gg |γmin (QSS ) |S| 2 ∥qS ∥∞ ⩽ , (66)
βmin 12
g∈GS

(ϕcS )∗ [QSc S Q−1


SS rS ] ⩽ 1 − τ, (67)

λn τ
(ϕcS )∗ [qSc |S ] ⩽ , (68)
2
then we have

h i h   i
(62)
(ϕcS )∗ ∇L(β̂ R )Sc = (ϕcS )∗ λn QSc S Q−1 ∗ −1 ∗
SS rS (β̂S ) − rS (βS ) + λn QSc S QSS rS (βS ) − qSc |S
h  i
⩽ (ϕcS )∗ λn QSc S Q−1 ∗
+ (ϕcS )∗ λn QSc S Q−1 ∗ c ∗
   
SS r (β̂
S S ) − r (β
S S ) SS rS (βS ) + (ϕS ) −qSc |S


h  i λn τ
⩽ λn (ϕcS ) QSc S Q−1 SS r (
S Sβ̂ ) − r (β
S S

) + λn (1 − τ ) +
2
(42)
−1
  λ n τ
⩽ λn a (Sc ) QSc S Q−1
SS rS (β̂S ) − rS (βS )

+ λn −
∞ 2
λn τ λn τ λn τ
⩽ + + λn − ⩽ λn ,
4 4 2
which is Equation (57).
Because Equation (56) and Equation (57) are satisfied, Lemma 16 implies that β̂ R is the optimal. Thus,
supp(β̂ G ) = supp(β̂ R ) = S.

D.10.4 PART IV

The results in Part III depend on conditions (52), (53), (64), (65), (66), (67), and (68), which are summarized as follows:

−1
∥QSc S QSS2 ∥2,∞ ⩽ 3, (69)

 
3
∗ ∗

γ 
min (QSS ) βmin τ γmin (QSS )aSc βmin
2 
1
λn |S| 2 ⩽ min , p , (70)
3AS
P

 72AS wg |Gg ∩ S| 

g∈GS

50
(ϕcS )∗ [QSc S QSS rS ] ⩽ 1 − τ, (71)

λn τ
(ϕcS )∗ [qSc |S ] ⩽ , (72)
2
 
3
∗ ∗

γ 
min (QSS ) βmin τ γmin (QSS )aSc βmin
2 
∥qS ∥∞ ⩽ min , p . (73)
3AS
P

 72AS wg |Gg ∩ S| 
g∈GS

In Part IV, we want to make sure that these conditions hold with high probability.
Condition (69)
To begin with, for any matrix A ∈ Rm×n , the Cauchy-Schwarz inequality implies that
s X 
∥A∥2,∞ = sup ∥Au∥∞ = sup max Aij uj
∥u∥2 ⩽1 ∥u∥2 ⩽1 i∈[m]
j∈[n]
s X sX 
⩽ sup max A2ij u2j
∥u∥2 ⩽1 i∈[m]
j∈[n] j∈[n]
s X  nq o
⩽ max A2ij ⩽ max diag(AA⊤ ) .
i∈[m] i∈[m]
j∈[n]

1 ⊤ −1
Recall that Q = n X X. Let A = QSc S QSS2 , we have
q
−1
∥Q Sc S QSS2 ∥2,∞ ⩽ max{ diag(QSc S Q−1
SS QSSc )}.

Using the Schur complement of Q on the block matrices QSS and QSc Sc , the positiveness of Q implies the positiveness
of QSc Sc − QSc S Q−1
SS QSSc . Thus,

max diag(QSc S Q−1


SS QSSc ) ⩽ max diag(QSc Sc ) ⩽ max
c
Qjj .
j∈S

Lemma 18. (Lemma 1 of Laurent and Massart (2000))


Suppose that the random variable U follows χ2 distribution with d degrees of freedom, then for any positive x,

P(U − d ≥ 2 dx + √2x) ⩽ exp(−x),
P(d − U ≥ 2 dx) ⩽ exp(−x).
nQjj
As X follows multivariate normal, Q̃jj = Θ2jj
∼ χ2n . Then by Lemma 18, we have
p [ X
P(maxc Qjj > 3) ⩽ P(maxc Qjj > 5) ⩽ P( Qjj > 5) ⩽ P(Qjj > 5)
j∈S j∈S
j∈Sc j∈Sc
X X Qjj
⩽ P(Qjj > 5Θ2jj ) = P(n > 5n)
Θ2jj
j∈Sc j∈Sc
X (74)
⩽ P(Q̃jj > n + 2n + 2n) ⩽ (p − |S|) exp(−n)
j∈Sc

= exp(−n + log(p − |S|))


n
⩽ exp(− ),
2
where the last inequality holds as n > 2 log(p − |S|). Thus,
−1 p n
P(∥QSc S QSS2 ∥2,∞ > 3) ⩽ P(maxc Qjj > 3) ⩽ exp(− ).
j∈S 2

51
Similarly, let QSc Sc |S = QSc Sc − QSc S Q−1
SS QSSc . The diagonal terms of QSc Sc |S is less than the diagonal terms of
QSc Sc , which implies
1/2 p n
P(∥QSc Sc |S ∥2,∞ > 3) ⩽ P(maxc Qjj > 3) ⩽ exp(− ).
j∈S 2

Condition (70)
Lemma 19. (Lemma 9 of Wainwright (2009))
Suppose that d ⩽ n and X ∈ Rn×d have i.i.d rows Xi ∼ N (0, Θ), then
   
1 ⊤ n
P γmax X X ⩾ 9γmax (Θ) ⩽ 2 exp(− ),
n 2
   
1 9 n
P γmax ( X ⊤ X)−1 ⩾ ⩽ 2 exp(− ).
n γmin (Θ) 2

As we assume that |S| ⩽ n and XSS ∼ N (0, ΘSS ), then Lemma 19 implies
n
P (γmax (QSS ) ⩾ 9γmax (ΘSS )) ⩽ 2 exp(− ),
2
and also
n
P (γmin (ΘSS ) ⩾ 9γmin (QSS )) ⩽ 2 exp(− ).
2
Thus, by assuming that
3
∗ ∗
 
1 3γmin (Θ)βmin τ γmin
2
(Θ)aSc βmin
λn |S| ⩽ min
2 , p ,
AS
P
8AS wg |Gg ∩ S|
g∈GS

we have 3
∗ ∗
 
1 γmin (QSS ) βmin τ γmin
2
(QSS )aSc βmin
λn |S| 2 ⩽ min , p
3AS
P
72AS wg |Gg ∩ S|
g∈GS
holds with high probability.

Condition (71)
For any j ∈ Sc , Xj ∈ Rn is zero-mean Gaussian. Following the decomposition in Wainwright (2009), we have
Xj⊤ = ΘjS Θ−1 ⊤ ⊤
SS XS + Ej , (75)
    −1
where Ej are i.i.d from N 0, ΘSc Sc |S jj with ΘSc Sc |S = ΘSc Sc − ΘSc S (ΘSS ) ΘSSc . Let ESc be an |S c | × n
matrix, with each row representing Ej for an element j ∈ Sc , then we have

QSc S Q−1 ⊤ ⊤
SS rS = XSc XS (XS XS )
−1
rS
(75)
= ΘSc S Θ−1 ⊤ ⊤ ⊤ −1

SS XS + ESc XS (XS XS ) rS
(76)
= ΘSc S Θ−1 ⊤ ⊤
SS rS + ESc XS (XS XS )
−1
rS
:= ΘSc S Θ−1
SS rS + η.

The preceding expression prompts us to establish an upper bound for the dual norm of η. To achieve this, we begin by
examining the scenario in which XS is fixed. Our objective now is to derive the covariance matrix of η. For any j ∈ Sc ,
we have
E[ηj ] = E Ej⊤ XS (XS⊤ XS )−1 rS = 0.
 

52
For any pair of j, k ∈ Sc , we have
E[ηj ηk ] = E Ej⊤ XS (XS⊤ XS )−1 rS Ek⊤ XS (XS⊤ XS )−1 rS
 

= E r⊤ ⊤ −1 ⊤
XS Ej Ek⊤ XS (XS⊤ XS )−1 rS
 
S (XS XS )

= r⊤ ⊤ −1 ⊤
XS E Ej Ek⊤ XS (XS⊤ XS )−1 rS ,
 
S (XS XS )
where
 (75) 
E Ej Ek⊤ = E Xj − XS Θ−1 ⊤
Xk⊤ − ΘkS Θ−1 ⊤
  
SS ΘjS SS XS
= E Xj Xk⊤ − E XS Θ−1 ⊤ −1 ⊤ −1 ⊤ −1 ⊤
       
SS ΘjS Xk − E Xj ΘkS ΘSS XS | XS + E XS ΘSS ΘjS ΘkS ΘSS XS
= E Xj Xk⊤ − XS Θ−1 −1 ⊤ −1 −1 ⊤
   ⊤
SS ΘjS E Xk − E [Xj ] ΘkS ΘSS XS + XS ΘSS ΘjS ΘkS ΘSS XS
= E Xj Xk⊤ − E [Xj ] E Xk⊤ = Cov Xj , Xk⊤ = ΘSc Sc |S jk In×n .
      

Consequently,
E[ηj ηk ] = r⊤ ⊤ −1 ⊤
XS E Ej Ek⊤ XS (XS⊤ XS )−1 rS
 
S (XS XS )

= r⊤ ⊤ −1 ⊤
XS ΘSc Sc |S jk In×n XS (XS⊤ XS )−1 rS

S (XS XS )

r⊤
S (QSS )
−1
rS
= r⊤ ⊤ −1
 
S (XS XS ) rS · ΘSc Sc |S jk
= · ΘSc Sc |S jk .
n
r⊤ (Q )−1 rS 
And we have Cov(η) = S SS n · ΘSc Sc |S := Ξ.
Lemma 20. (Theorem 2.26 in Wainwright (2019))
Let (X1 , . . . , Xn ) be a vector of i.i.d. standard Gaussian variables, and let f : Rn 7→ R be a Lipschitz function with
respect to the Euclidean norm and Lipschitz constant L. Then the variable f (X) − E[f (X)] is sub-Gaussian with
parameter at most L, and hence

t2
P[|f (X) − E[f (X)]| ⩾ t] ⩽ 2 exp(− ) for all t ⩾ 0.
2L2
h 1 i 1
To apply the concentration bound in Lemma 20, we define function Ψ(u) = (ϕ∗Sc ) Ξ 2 u . As η = Ξ 2 W where
W ∼ N (0, I|Sc |×|Sc | ), (ϕcS )∗ (η) has the same distribution as Ψ(W ) . We continue to show that Ψ is a Lipschitz
function given fixed XS .
h 1 i

|Ψ(u) − Ψ(v)| ⩽ Ψ(u − v) = (ϕcS ) Ξ 2 (u − v)
1
⩽ a−1
S Ξ 2 (u − v)

 12
r⊤ −1

S (QSS ) rS
a−1

= S · ΘSc Sc |S (u − v)
n ∞
1 1
− 21
a−1 γmax Q−1
 
⩽ S ∥rS ∥2 n 2
SS γmax ΘSc Sc |S ∥u − v∥2 .
2

Thus, the corresponding Lipstichiz constant is


1  12
− 12 2
Lη = a−1 γmax Q−1

S ∥rS ∥2 n SS γmax ΘSc Sc |S .

On the other hand, suppose that E [(ϕcS )∗ (η)] ⩽ τ4 , since Ψ is a Lipschitiz function, by applying t = τ
4 in concentration
Lemma 20 on Lipschitz functions of multivariate standard random variables, we have

∗ τ  τ  τ τ
P (ϕcS ) [η] > = P Ψ(W ) > = P Ψ(W ) − >
2 2 4 4
  c ∗  τ
⩽ P Ψ(W ) − E (ϕS ) (η) >
4
τ2
 
 τ
= P Ψ(W ) − E [Ψ(W )] > ⩽ exp − 2 .
4 4Lη

53
Now we further assume that {γmax (Q−1
SS ) ⩽
9
γmin (ΘSS ) }. Under this condition, we have
1
3a−1

S ∥rS ∥2 γmax ΘSc Sc |S
1 1 2
− 12
a−1 γmax Q−1
 
Lη = S ∥rS ∥2 n 2
SS γmax ΘSc Sc |S ⩽
2
1 . (77)
(nγmin (ΘSS )) 2
Lemma 21. (Sudakov inequality, Theorem 5.27 in Wainwright (2019))
If X and Y are a.s. bounded, centered Gaussian processes on T such that
2 2
E (Xt − Xs ) ≤ E (Yt − Ys )
then
E sup Xt ≤ E sup Yt .
T T

Lemma 22. (Exercise 2.12 in Wainwright (2019)) Let X1 , . . . , Xn be independent σ 2 -subgaussian random variables.
Then p
E[ max |Xi |] ≤ 2 σ 2 log n.
1≤i≤n

On the other hand, for any ut , us , we have


1 1
E(u⊤ ⊤ 2 ⊤ 2 ⊤ 2 2 ⊤
t η − us η) = E(ut Ξ W − us Ξ W ) = (ut − us ) Ξ(ut − us )
1 1
⩽||ut − us ||22 γmax (Ξ) = E(γmax
2
(Ξ) u⊤ ⊤
t W − γmax (Ξ) us W )
2 2

By using Sudakov-Fernique inequality in Lemma 21, we have


1
h 1
i h i
E sup u⊤ Ξ 2 W ⩽ E sup γmax 2
(Ξ) u⊤ W
ϕcS (u)⩽1 ϕcS (u)⩽1

Consequently, h i h i h i
1
E (ϕcS )∗ (η) = E sup u⊤ η = E sup u⊤ Ξ 2 W
ϕcS (u)⩽1 ϕcS (u)⩽1
1
h i 1
h i (78)
⩽ γmax
2
(Ξ) E sup u⊤ W = γmax (Ξ) 2 E (ϕcS )∗ (W ) .
ϕcS (u)⩽1

Notice that 2
2
 X wg
∥rS ∥2 ⩽ |S| max r2j = |S| max{βj∗ · ∗ }
j∈S j∈S ∥βG ∥
g ∩S 2
g∈GG
S ,Gg ∩j̸=∅
 X wg 2
⩽ |S| max{|βj∗ |} · max{ ∗ }
j∈S ∥βG ∥
g ∩S 2
g∈GG
S ,Gg ∩j̸=∅

 max {|βj |} 2
j∈S w
p g
X
⩽ |S| ∗ · max{ }
βmin |Gg ∩ S
g∈GG
S ,Gg ∩j̸=∅

 max {|βj∗ |} 2
j∈S X
⩽ |S| · max{ wg } (79)

βmin
g∈GG
S ,Gg ∩j̸=∅

 max {|βj |} 2
j∈S
⩽ |S| ∗ · hmax (GS ) max wg }
βmin g∈GG
S

 max {|βj |} 2  A 2
j∈S 2 ∗ 2 S
⩽ ∗ |S|A S = max {(β j ) }|S| ∗
βmin j∈S βmin
max{(βj∗ )2 }
j∈S

λ2n

54
Thus, if XS satisfies γmax (Q−1
SS ) ⩽
9
γmin (ΘSS ) , we have
(78) 1
E [(ϕcS )∗ (η)] ⩽ γmax (Ξ) 2 E [(ϕcS )∗ (W )]
−1 1 
∥rS ∥2 γmin
2
(QSS ) γmax
2
ΘSc Sc |S
⩽ 1 E [(ϕcS )∗ (W )]
n 2
1 
∥rS ∥2 3γmax
2
ΘSc Sc |S
⩽ 1 E [(ϕcS )∗ (W )]
(nγmin (ΘSS )) 2
1
(42) ∥rS ∥2 3γmax
2
ΘSc Sc |S
 (80)
E a−1
 
⩽ 1 Sc ∥W ∥∞
(nγmin (ΘSS )) 2

1 
∥rS ∥2 3γmax
2
ΘSc Sc |S
⩽ 1 E [∥W ∥∞ ]
aSc (nγmin (ΘSS )) 2
1 
Lemma 22 6 ∥rS ∥2 γmax
2
ΘSc Sc |S p τ
⩽ 1 log(p − |S|) ⩽ ,
aSc (nγmin (ΘSS )) 2 4
where the last inequality holds as Assumption 6 implies that
max{(βj∗ )2 } log(p − |S|) 2 2 
j∈S (79)∥rS ∥2 log(p − |S|) 576 ∥rS ∥2 log(p − |S|)γmax ΘSc Sc |S
n≳ ≳ ⩾ .
a2Sc λ2n a2Sc a2Sc γmin (ΘSS )τ 2
Consequently, Equation (77) and (80) together implies

∗ τ 9 
P (ϕcS ) [η] > | XS , γmax (Q−1
SS ) ⩽
2 γmin (ΘSS )
2  (81)
 τ  τ na2S γmin (ΘSS )
2 
⩽ exp − ⩽ exp − 2 .
4L2η

12 ∥rS ∥2 γmax ΘSc Sc |S

Thus, let A be the event {XS | γmax (Q−1


SS ) ⩽
9
γmin (ΘSS ) }. We have

∗ τ  
∗ τ 9 
P (ϕcS ) [η] > | XS = P (ϕcS ) [η] > | XS , γmax (Q−1 SS ) ⩽
2 2 γmin (ΘSS )

∗ τ 9 
+P (ϕcS ) [η] > | XS , γmax (Q−1 SS ) >
2 γmin (ΘSS )
!
τ 2 na2S γmin (ΘSS )
⩽ exp − 2  + P (A c )
4 ∥rS ∥2 γmax ΘSc Sc |S
!
τ 2 na2S γmin (ΘSS ) n
⩽ exp − 2  + 2 exp(− ).
4 ∥rS ∥2 γmax ΘSc Sc |S 2

55
Condition (72)
Now we are going to study condition (72). Recall that qSc |S = qSc − QSc S Q−1 SS qS and QSc Sc |S = QSc Sc −
QSc S Q−1 Q
SS SS c . Given X, q c
S |S is a centered Gaussian random vector with covariance matrix
h i
E qSc |S qS⊤c |S = E qSc qS⊤c − qSc qS⊤ Q−1 −1 ⊤ −1 ⊤ −1
 
SS QSS − QSc S QSS qS qSc + QSc S QSS qS qS QSS QSSc
c

= E qSc qS⊤c − QSc S Q−1 ⊤ −1


 
SS qS qS QSS QSSc
= E qSc qS⊤c − E QSc S Q−1 ⊤ −1
   
SS qS qS QSS QSSc
σ2 σ2 σ2
QSc Sc −
= QSc S Q−1
SS QSSc := QSc Sc |S .
n n n
 
∗ 1/2 ∗
Next, we define ψ(u) = (ϕcS ) σn−1/2 QSc Sc |S u so that (ϕcSc ) qSc |S has the same distribution as ψ(W ). Now


we want to show that ψ is a Lipschitz function


 
∗ 1/2
|ψ(u) − ψ(v)| ⩽ ψ(u − v) = (ϕcS ) σn−1/2 QSc Sc |S (u − v)
1
⩽ σn−1/2 a−1
Sc QSc Sc |S (u − v)
2


1
⩽ σn−1/2 a−1
Sc QSc Sc |S
2
∥(u − v)∥∞
2,∞
1
⩽ σn−1/2 a−1
Sc QSc Sc |S
2
∥(u − v)∥2
2,∞

1/2
Suppose that QSc Sc |S ⩽ 3, then ψ is a Lipschitz function with Lipschitz constant 3σn−1/2 a−1
Sc . In addition, if
2,∞
λn τ λn τ
E[(ϕcS )∗ (qSc |S )] ⩽ 4 , then by Lemma 20 , we have for t = 4 ,

     
c ∗
  λn τ λn τ λn τ λn τ
P (ϕS ) qSc |S ⩾ = P ψ(W ) > = P ψ(W ) − >
2 2 4 4
 
λ n τ
⩽ P ψ(W ) − E[(ϕcS )∗ (qSc |S )] >
4
   2 2 2 
λn τ τ λn naSc
= P ψ(W ) − E [ψ(W )] > ⩽ exp − .
4 144σ 2
Now, we consider random X. For any ut , us , we have
2 σ2 σ2 1
2 2
E (ut − us )⊤ qSc |S = (ut − us )⊤ QSc Sc |S (ut − us ) ⩽

QS2 c Sc |S 2
(ut − us ) 2
n n
 −1 1 2

= E σn 2 ∥QSc Sc |S ∥2 (ut − us ) W
2

By using Sudakov-Fernique inequality, if ∥QSc Sc |S ∥2 ⩽ 9, we get


E[(ϕcS )∗ (qSc |S )] = E sup u⊤ qSc |S
ϕcS (u)≤1
1
⩽ σn−1/2 ∥QSc Sc |S ∥22 E sup u⊤ W
ϕcS (u)≤1

− 12 ∗
1 (82)
⩽ σn ∥QSc Sc |S ∥2 E (ϕcS ) (W )
 
2

1 ∗
⩽ 3σn− 2 E (ϕcS ) (W )
 

λn τ
⩽ .
4
On the other hand, Assumption 1’ and 6 imply that
 ∗ 
9σ 2 E2 (ϕcS ) (W ) 9σ 2 log(p − |S|) λ2 τ 2
⩽ 2 ⩽ n .
n aSc n 16

56
Therefore, we have
τ 2 nλ2n a2Sc
   
∗  λn τ 1/2
P (ϕcS ) qSc |S ⩾ | X, QSc Sc |S ⩽3 ⩽ exp − .
2 2,∞ 144σ 2
1/2
Let B be the event {X | QSc Sc |S ⩽ 3}. We have
2,∞
   
∗  λn τ ∗  λn τ 1/2
P (ϕcS ) qSc |S ⩾ | X = P (ϕcS ) qSc |S ⩾ | X, QSc Sc |S ⩽3
2 2 2,∞
 
∗ λn τ 1/2
+P (ϕcS ) qSc |S ⩾
 
| X, QSc Sc |S >3
2 2,∞
 2 2 2 
τ nλn aSc
⩽ exp − + P (Bc )
144σ 2
 2 2 2 
(69) τ nλn aSc n
⩽ exp − + exp(− ).
144σ 2 2
Condition (73)
The last condition (73) lead us to control the term P (∥qS ∥∞ ⩾ c′ (S, G)), with
 
3
∗ ∗

γ 
min (QSS ) βmin τ γmin (QSS )aSc βmin
2 
c′ (S, G) = min , p .
3AS
P

 72AS wg |Gg ∩ S| 
g∈GS

For any given X, Jenatton et al. (2011a) showed that for any δ > 0,

nδ 2
 
P (∥qS ∥∞ ⩾ δ) ⩽ 2|S| exp − 2 .

Recall under the event A, we have

γmin (ΘSS )
⩽ γmin (QSS ).
9
Which implies that
 
3
∗ ∗

γ 
min (ΘSS ) βmin τ γmin (ΘSS ) aSc βmin
2

c′ (S, G) ⩾ min , p
27AS
P

 648AS wg |Gg ∩ S| 

g∈GS
 
 
∗ ∗
 
 βmin τ aSc βmin 
⩾ min , 3 := c(S, G).
 27c1 AS 648c 2 A P
w
p
|G ∩ S| 

 1 S g g 

g∈GS

Thus, consider random X, we have


nc2 (S, G)
 

P (∥qS ∥∞ ⩾ c (S, G) | A) ⩽ P (∥qS ∥∞ ⩾ c(S, G) | A) ⩽ 2|S| exp −
2σ 2
Thus,

P (∥qS ∥∞ ⩾ c′ (S, G)) = P (∥qS ∥∞ ⩾ c′ (S, G) ∩ A) + P (∥qS ∥∞ ⩾ c′ (S, G) ∩ A c )


⩽ P (∥qS ∥∞ ⩾ c′ (S, G) ∩ A) + P (A c )
= P (∥qS ∥∞ ⩾ c′ (S, G) | A) P (A) + P (A c )
⩽ P (∥qS ∥∞ ⩾ c′ (S, G) | A) + P (A c )
nc2 (S, G)
 
⩽ 2|S| exp − + 2 exp(−n/2).
2σ 2

57
In summary, the probability of one of the conditions being violated is upper bound by
!
na2S τ 2 γmax (ΘSS ) nλ2n τ 2 a2Sc nc2 (S, G)
   
n
8 exp(− ) + exp − 2 + exp − + 2|S| exp − .
2 32σ 2 c42 2σ 2

4 ∥rS ∥2 γmax ΘSc Sc |S

D.10.5 PART V

First, given the original group structure G and its induced counterpart G, along with their respective weights w and w,
we consider the scenario where J = S. For all β ∈ Rp , we have
X X X
ϕG

S (βS ) = wg ∥βS∩Gg ∥2 ⩽ wg ∥βS∩Gg ∥2
g∈GG
S
g∈GS g:g∈F −1 (g),Gg ⊂S
X X 
= wg ∥βS∩Gg ∥2
g:Gg ⊂S g:g∈F (g),g∈GS
X X  (83)
= wg ∥βS∩Gg ∥2
g:Gg ⊂S g:g∈F (g)
X
= wg ∥βS∩Gg ∥2 = ϕG
S (β).
g∈GG
S

Since ϕG G G G
S (β) ⩽ ϕS (β), we can set aS = aS = min
√wg . Since
g∈GG dg
S
X
max wg = max wg ⩽ hmax (GS ) max wg ,
g∈GG
S
g:Gg ∩S̸=∅ g∈GG
S
g∈F (g)
G p
we can set AG
S = AS . On the other hand, for all β ∈ R , we have

X X X
(ϕG c c

S ) (βS ) = wg ∥βSc ∩Gg ∥2 ⩽ wg ∥βSc ∩Gg ∥2
g∈[m]\GG
S
g∈[m]\GS g:g∈F −1 (g),Gg ⊂Sc
X X 
= wg ∥βSc ∩Gg ∥2
g:Gg ⊂Sc g:g∈F (g),g∈[m]\GG
S
X X  (84)
= wg ∥βSc ∩Gg ∥2
g:Gg ⊂Sc g:g∈F (g)
X
c
= wg ∥βSc ∩Gg ∥2 = (ϕG
S ) (β).
g∈[m]\GG
S

G
p
Consequently, with an trivial extension, we can set aG
Sc = aSc ⩽ min wg / dg .
g∈GG
Sc

Based on the result of Theorem 5.1, Equation (28) holds if


1
n β∗ β ∗ aSc o
min
λn |S| 2 ≲ min , P minp .
AS AS wg |Gg ∩ S|
g∈GS

By the Cauchy–Schwarz inequality, we have


X q X X q
wg |Gg ∩ S| ⩽ wg |Gg ∩ S|
g∈GS g∈GS g∈F −1 (g)
X q X 
= |Gg ∩ S| wg
g∈F −1 (g),g∈GS g∈F (g)
X q
= wg |Gg ∩ S|
g∈GS

58
If F −1 (g) = O(1) for every g ∈ GS , we have
X X q 2
|Gg ∩ S| = |Gg ∩ S| ≍ |Gg ∩ S| .
g∈F −1 (g) g∈F −1 (g)
p P p
Consequent, we have |Gg ∩ S| ≍ |Gg ∩ S|,
g∈F −1 (g)
X q X q
wg |Gg ∩ S| ≍ wg |Gg ∩ S|,
g∈GS g∈GS

and
∗ ∗
β ∗ aGc ∗
aG
   
βmin βmin βmin Sc
min G
, G P minpS ≍ min , .
AG
p
AG
P
AS AS wg |Gg ∩ S| S S wg |Gg ∩ S|
g∈GS g∈GS

59

You might also like