Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

A Fitted Sparse-Group Lasso For Genome-Based Evaluations

Download as pdf or txt
Download as pdf or txt
You are on page 1of 9

30 IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. 20, NO.

1, JANUARY/FEBRUARY 2023

A Fitted Sparse-Group Lasso for Genome-Based


Evaluations
€rte Wittenburg
Jan Klosa, Noah Simon, Volkmar Liebscher , and Do

Abstract—In life sciences, high-throughput techniques typically lead to high-dimensional data and often the number of covariates is
much larger than the number of observations. This inherently comes with multicollinearity challenging a statistical analysis in a linear
regression framework. Penalization methods such as the lasso, ridge regression, the group lasso, and convex combinations thereof,
which introduce additional conditions on regression variables, have proven themselves effective. In this study, we introduce a novel
approach by combining the lasso and the standardized group lasso leading to meaningful weighting of the predicted (“fitted”) outcome
which is of primary importance, e.g., in breeding populations. This “fitted” sparse-group lasso was implemented as a proximal-averaged
gradient descent method and is part of the R package “seagull” available at CRAN. For the evaluation of the novel method, we
executed an extensive simulation study. We simulated genotypes and phenotypes which resemble data of a dairy cattle population.
Genotypes at thousands of genomic markers were used as covariates to fit a quantitative response. The proximity of markers on a
chromosome determined grouping. In the majority of simulated scenarios, the new method revealed improved prediction abilities
compared to other penalization approaches and was able to localize the signals of simulated features.

Index Terms—Biology and genetics, iterative methods, optimization, statistical computing

1 INTRODUCTION More complex methods can be built by applying several pen-


N the framework of linear regression, an n-dimensional
alties at once. Examples are the elastic net (EN) [3] which com-
I response vector and a p-dimensional vector of regressors
are assumed to hold a linear relationship. Ordinary least
bines the lasso and ridge regression, or the sparse-group lasso
(SGL) [4] where the lasso and GL are simultaneously applied.
squares aims to provide a solution for the p regression coeffi- Moreover, p>n leads to multicollinearity among regres-
cients that minimizes the residual sum of squares. If p>n this sors. Dependencies between variables can be addressed by
minimizer is no longer unique. Instead, the set of minimizers grouping those ones together that have a strong correlation
includes an infinite number of potential candidates, which all with each other. A method which takes a group structure into
solve the initial regression problem equally well. In order to consideration is the GL. There are explicit formulas to solve
determine a meaningful one, different approaches were intro- the GL [2], some of which require orthogonality of the regres-
duced in the past. One approach – called penalization or regu- sors. In order to guarantee a flawless mathematical applica-
larization – consists of applying one or more additional tion to non-orthogonalized data, the standardized group
conditions onto the set of minimizers. The corresponding pen- lasso was proposed [5]. Compared to the GL where grouped
alty function of such an approach belongs to one of two mutu- regression coefficients form the penalty, the standardized
ally distinct categories: it is either differentiable everywhere group lasso incorporates a penalty of the grouped predicted
such as the ridge regression (or Tikhonov regularization), or it (“fitted”) outcome.
is not (e.g., the lasso [1] and the group lasso (GL) [2]). The objective of our study is to present a new penalty
Approaches from the second category are considerably which combines the strong selective ability of the lasso and
harder to solve, but often implicitly apply variable selection, the mathematical adequacy of the standardized group lasso
which increases the interpretability of the underlying model. – the fitted sparse-group lasso (fitSGL). The solution to this
mathematical problem is non-trivial as the two distinct pen-
alties act on different scales which cannot be merged by any
 Jan Klosa and D€ orte Wittenburg are with the Institute of Genetics and transformation. Therefore, established tools for calculations
Biometry, Research Institute for Farm Animal Biology (FBN), 18196 Dum- are not appropriate. Here, we provide a solution via proxi-
merstorf, Germany. E-mail: {klosa, wittenburg}@fbn-dummerstorf.de.
 Noah Simon is with the Department of Biostatistics, University of Wash- mal-averaged gradient descent [6] with additional accelera-
ington, Seattle, WA 98195 USA. E-mail: nrsimon@uw.edu. tion based on [7].
 Volkmar Liebscher is with the Institute of Mathematics and Computer Sci- One field of application, where highly correlated predictor
ence, University of Greifswald, 17489 Greifswald, Germany.
E-mail: volkmar.liebscher@uni-greifswald.de. variables may appear, is the genetic evaluation of individual
trait expression in a breeding population. Fitted values, also
Manuscript received 5 May 2021; revised 20 Oct. 2021; accepted 1 Mar. 2022.
Date of publication 7 Mar. 2022; date of current version 3 Feb. 2023. known as genomic breeding values (GBVs), are determinant
This work was supported in part by German Research Foundation under Grant for breeding decisions. Explicitly accounting for them in a
DFG, WI 4450/2-1. The publication of this article was supported by the Open grouping approach is expected to improve the prediction pre-
Access Fund of the FBN.
(Corresponding author: D€ orte Wittenburg.)
cision of GBVs and to help identifying individuals with partic-
Digital Object Identifier no. 10.1109/TCBB.2022.3156805 ularly high or low GBVs. Furthermore, the selection property
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
KLOSA ET AL.: FITTED SPARSE-GROUP LASSO FOR GENOME-BASED EVALUATIONS 31

of the lasso helps detecting trait-associated sites on the coefficients with respect to step width t. This approach con-
genome. Thus, for evaluation purposes, we applied the novel sists – in its core – of (compact) singular value decomposi-
method to a large variety of simulated scenarios resembling tions of the matrices XðlÞ , and a subsequent transformation
data from a dairy cattle population. The scenarios differed in into orthogonal variables. The transformation leads to the
terms of sample size and features of causal variants influenc- GL with PGD update according to [8], [10]. Eventually, a
ing trait expression. The outcome was compared to that of back transformation leads to the desired update in iteration
other lasso-type penalization approaches. m þ 1 (with m ¼ 0,1,. . .),

2 MATERIAL AND METHODS pffiffiffiffi !


 pl t  
ðlÞ;mþ1
2.1 Linear Regression Model b ¼ 1  ðlÞ;m  T bðlÞ;m (5)
X ðlÞ T b 
The underlying linear model consists of an n-dimensional 2 þ

response vector y, a matrix of features X with dimensions


n  p, and the corresponding vector of regression coeffi- with
cients b. Then,
  t  ðlÞT ðlÞ 1 ðlÞT
y ¼ Xb þ e (1) T bðlÞ;m : ¼ bðlÞ;m  X X X
n !
where e is a vector of normally i.i.d. random variables. In X L
order to estimate b, penalization is applied so that the linear  XðkÞ bðkÞ;m  y : (6Þ
regression minimization is altered by adding a penalty k¼1
function f
and (x)þ ¼ max(x, 0). Assuming that each XðlÞ has full col-
1
min ky  Xbk22 þf ðbÞ: (2) umn rank, the inverse matrix in (6) exists, otherwise a modi-
b 2n
fied approach becomes necessary that is described in the
In this study, we combine the lasso and the standardized next section.
group lasso [5] and refer to this as fitSGL. The correspond- A proper penalty parameter  is commonly determined
ing optimization function is via a grid search. All of the considered penalty functions in
this paper share the property that if  exceeds a certain
 2 !
1 
 XL 
ðlÞ ðlÞ 
XL
pffiffiffiffi


ðlÞ ðlÞ 
threshold, say max , then the estimates for regression coeffi-
min y  X b  þ akbk1 þ ð1  aÞ pl X b 2 cients no longer change after a single iteration, i.e., b ^0 ¼
b 2n  
l¼1 2 l¼1 ^ 1
b ¼ . . . . Thus, the implemented grid search is based on a
(3) logarithmic scale from max to 0:001  max , where the upper
value max was determined from the corresponding update
where  > 0 is the penalty parameter, and a 2 ½0; 1 is the ^1 ¼ 0 and solving for .
^0 ¼ b
formula substituting b
mixing parameter. The superscript ðlÞ denotes group l.
Therefore, bðlÞ is a subvector of b, and XðlÞ are the corre- 2.3 Alteration for Rank Deficiencies
sponding columns of X. Furthermore, pl is the number of In the case that at least for one group l the matrix XðlÞ does
elements in group l. not have full column rank, the matrix XðlÞT XðlÞ is not invert-
ible. In [5], the authors proposed regularization by a posi-
2.2 Proximal Gradient Descent Update
tive value d2l , so that XðlÞT XðlÞ þ d2l I ðlÞ replaces X ðlÞT XðlÞ .
Solving (3) is not trivial as both penalty terms are based on
different coordinate systems. And there is no substitution Here, I ðlÞ is the identity matrix with dimensions pl  pl . This
available to overcome this issue. Therefore, we divide the treatment alters the initial optimization problem (4) to
initial penalization into two sub-problems: a lasso and a
standardized group lasso problem. We compute a proximal  2  2
gradient descent (PGD) [8] update to each of them. PGD is 1 
 XL
ðlÞ ðlÞ 
 X L
d2l bðlÞ 

min y  X b  þ  
an iterative algorithm: starting with a guess b ^0 , a sequence b 2n   2n 2
l¼1 2 l¼1
^ rffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi (7)
of updates b mþ1
is computed over iterations (m ¼ 0,1,. . .). X
L     
2 2
After each iteration, the updates are merged according to þ dfl X ðlÞ bðlÞ  þd2 bðlÞ  2 l 2
their respective weights a and 1  a; this process is called l¼1
proximal averaging (PA) [9]. Finally, the algorithm is forced
to stop if some convergence criterion is met. where
Unlike for the lasso, the PGD update for the standardized
group lasso, which is
 2 X
pl
d2l;i
1 
 X L 
ðlÞ ðlÞ 
XL
pffiffiffiffi  dfl ¼ (8)
min y  X b  þ pl XðlÞ bðlÞ 2 (4) d2l;i þ d2l
b 2n   i¼1
l¼1 2 l¼1

has not been published yet. To solve this, we followed the with dl;i being the i-th singular value of XðlÞ . Note that (7)
technique described in [5] and adapted it to fit the PGD coincides with (4), if dl ¼ 0. By introducing the following
framework. Thus, resulting in a joint update of regression expressions
32 IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. 20, NO. 1, JANUARY/FEBRUARY 2023

0 1
XðlÞ 1; . . . ; p. Linkage and linkage disequilibrium (LD) between
B 0 C markers can cause extremely high correlation among pre-
0 1 B ..C
y B C dictor variables which typically satisfy a block structure.
B .C
B0C B C Thus, the challenge of a genome-wide regression approach
nþp ~ ðlÞ B 0 C
y~ ¼ B C
@ ... A 2 R ; X ¼ B ðlÞ C 2 RðnþpÞpl is to identify the causal sites in genomic regions of high LD.
B dl I C
B C Based on simulated data, we compared the performance of
0 B 0 C
B . C fitSGL to the lasso, GL, SGL, and EN often used in animal
@ .. A
and plant breeding (e.g., [11], [12]).
0

Equation (7) can be rewritten as


2.6 Simulation Study
 2 We conducted an extensive simulation study to evaluate
1 
 XL 
ðlÞ ðlÞ 
~ b  þ
XL pffiffiffiffiffiffi 
 ~ ðlÞ ðlÞ  characteristics of the proposed fitSGL. The data resembled a
min y~  X dfl X b  : (9)
b 2n   2 dairy cattle population for which genome-based evaluations
l¼1 2 l¼1
drive the breeding success.
To generate a realistic amount of LD between SNPs,
~ ðlÞ have full column rank,
Since the introduced matrices X genotype data were simulated with the software AlphaSim
the PGD update of (9) is achieved by (5) and (6) with substi- version 1.05, a software suited to breeding populations and
tution of the corresponding components. now integrated in the R package “AlphaSimR” [13]. We let
the software simulate two chromosomes with a length of
2.4 Final Algorithm for fitSGL 100 centimorgan each. Each of the chromosomes consisted
We implemented the proximal-averaged gradient descent of 1,660 SNPs, giving a total of p ¼ 3,320. As proposed by
scheme for the fitSGL by incorporating an acceleration (in default, 6 consecutive generations were simulated. Each
terms of u) based on [7] (step 7 in Algorithm 1). We used generation consisted of 200,000 individuals, half of which
warm starts to perform a grid search over consecutive val- were females. After random mating in generations 1-3, 200
ues for the penalty parameter . The final algorithm is males with high performance were mated to 1,000 dams in
termed proximal-averaged accelerated gradient descent each of the generations 4 and 5. This mating scheme led to a
(PA-AGD). half-sib family structure which is typical in livestock. We
then used the data of the last two generations to set up 100
Algorithm 1. PA-AGD experiments for a comprehensive statistical analysis. For
^ 0 , h0 , t ¼ 0:1, and u0 ¼ 1 each experiment, we randomly picked 10 out of the 200 sires
1: Initialize b
2: for m ¼ 0; 1; . . . do
from generation 5. The corresponding 10,000 offspring in
3: Calculate proximal center generation 6 were split into training and validation sets. In
rm ¼ b ^ m  ðt=nÞX T ðX b ^ m  yÞ scenario (A) where p>n, the training data consisted of 1,000
4: Calculate j-th element of Pft1 (the proximal gradient individuals (100 progeny of each sire). The remaining 9,000
descent update of the lasso) half sibs formed the validation data. In scenario (B) where
½Pft1 j ¼ signðrm j Þðjrj j  tÞþ
m p<n, the roles of training and validation data were reversed,
5: Calculate Pft2 (the proximal gradient descent update of i.e., the training data consisted of 9,000 individuals, whereas
the standardized group lasso) via equation (5) the validation data were formed by the remaining 1,000 off-
6: Calculate proximal average spring. Further, an additional set of 10,000 half sibs was sim-
hmþ1 ¼ aPft1 þ ð1  aÞPft2 ulated for each experiment. This set was split into 9,000 and
7: Accelerate 1,000 individuals as well, which served as independent test
qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
set for scenario (A) and (B), respectively.
umþ1 ¼ ð1 þ 1 þ 4ðum Þ2 Þ=2
In order to simulate the vector of features b , we assumed
^ mþ1 ¼ hmþ1 þ ðhmþ1  hm Þðum  1Þ=ðumþ1 Þ
b that genetic effects appear in groups of highly correlated
8: end for SNPs. We first picked all 200,000 individuals of generation
5, and clustered the genotypes with respect to LD using the
R package “BALD” version 0.2.1 as in [14]. This led to a total
2.5 Application to Genome-Based Association of 106 and 268 groups for the first and second chromosome,
Analysis respectively. We then randomly picked groups from both
Trait expression is often associated with genetics. The link chromosomes, either 1, 3, or 9 from each. These groups
between genetic and phenotypic variation can be elucidated were allowed to harbor quantitative trait loci (QTL) with
with genome-based association analysis for which molecu- non-zero effect on y. The features corresponding to all
lar markers provide useful information. The most common remaining groups were set to 0. Furthermore, we divided
form of a molecular marker is the single nucleotide poly- the scenarios according to the proportion of simulated QTL,
morphism (SNP). Such marker bears only two variants lead- i.e., we allowed either 1/3, 2/3 or all of the SNPs inside the
ing to three different genotypes at each site in a diploid QTL groups to have a non-zero effect. The effects were sam-
organism. Then, in a genome-wide regression model as in pled from a Gamma distribution with shape parameter 0.42,
(1), the phenotype y is regressed onto the observed geno- rate ð0:42  nQTL Þ1=2 (with nQTL the total number of simu-
type at p SNPs distributed over the whole genome. Hence, lated QTL), and randomly drawn sign, or a Normal distri-
xij 2 f0; 1; 2g for individual i ¼ 1; . . . ; n and SNP j ¼ bution with mean 0 and variance ð0:99  nQTL Þ1 .
KLOSA ET AL.: FITTED SPARSE-GROUP LASSO FOR GENOME-BASED EVALUATIONS 33

An individual’s GBV was determined by its genotype


and the vector of features

X
p
GBVi ¼ xij bj :
j¼1

We then simulated the phenotypes of each offspring by


adding a residual to the GBV. The variation of GBV (s 2a ) to
Fig. 1. Schematics of LD matrices to illustrate the difference of a proxim-
the phenotypic variance (s 2y ) constitutes heritability (h2 ). For ity-associated SNP (turquoise area in the left panel) and a group-associ-
a range of h2 2 f0:1; 0:3; 0:5g, the variance of the error term e ated SNP (turquoise area in the right panel). Dark blue marks a
was determined by s 2e ¼ s 2a ð1  h2 Þ=h2 . Due to variations in simulated QTL. Toy example with p ¼ 15 SNPs. Groups sizes are 5, 7,
number of QTL groups per chromosome, proportion of and 3.
QTL per group, and heritability, each of the categories (A)
p>n and (B) p<n consisted of 54 individual settings which simulated and predicted GBV and determined the intersec-
are summarized in Supplemental File 1. The simulated data tion of both sets.
are available online at https://dx.doi.org/10.22000/432. The analyses were performed with R version 4.1.0 [15].
Due to the simulated family stratification, we applied a
family-wise centering of the genotypic and phenotypic data 2.8. Impact of Mixing Parameter
in each experiment prior to the evaluation. We selected a single simulation scenario to investigate the
influence of the choice of a on performance. This particular
2.7. Evaluation Criteria setting closely resembled the dairy trait “milk protein
Since EN, SGL, and fitSGL are all convex combinations of percentage” with QTL information available from the Cattle
two penalties, the mixing parameter a needed to be speci- QTL database [16]. The QTL distribution of this trait approxi-
fied in advance; it was set to 0.5. Each regularization path mately corresponded to the simulated setting where 3
consisted of estimated features alongside 50 values for the groups of QTL were present per chromosome and 1/3 of the
penalization parameter . The solutions were estimated SNPs within such QTL groups had a simulated non-zero
using the training data. The evaluation criterion to select effect. As heritability we chose 0.3. We let a 2 f0:1; 0:5; 0:9g.
one of the 50 solutions was based on the minimal mean
squared error (MSE) of predicted GBV in the corresponding
validation data. The performance of methods was evaluated 2.9. Scalability
in terms of precision of predicted GBV Xb, ^ i.e., the correla- We analyzed real dairy cattle data from [17], retrieved from
tion of predicted GBV and simulated phenotypic values Dryad https://doi.org/10.5061/dryad.cs133 to demonstrate
within the independent test set. scalability of “seagull” version 1.1.0, and how its computa-
Additionally, we assessed the quality of fitSGL and tion times compare to the established R package “glmnet”
comparative methods with respect to the ability to detect version 2.0-18. The dataset consisted of marker genotypes at
trait-associated sites on the genome. Once the solution with p ¼ 164,312 sites distributed over 29 chromosomes for n ¼
minimal MSE was determined for each experiment and in 1,092 individuals. As phenotype we used the fat-percentage
each scenario, we calculated the sensitivity, the specificity, average from day 1 to day 305 of lactation. All methods
the positive predictive value (PPV), the negative predictive except the lasso and EN required grouping of the genotypic
value (NPV), and the accuracy (ACC). All these measures data. We calculated the squared correlation between SNP
were based on the determination of true and false positives genotypes, i.e., r2ij ¼ corrðx:i ; x:j Þ2 for i; j ¼ 1; . . . ; p, and used
(TP, FP), and true and false negatives (TN, FN). this as a measure of similarity among SNPs. Grouping was
Due to the proximity of SNPs, the appearance of at least performed using the R package “adjclust” version 0.6.3
locally strong correlations among predictor variables can which is the follow-up implementation of “BALD”. We
lead to the identification of putatively causal sites only selected L ¼ 200 groups per chromosome, similar to our sim-
because a SNP was in high LD with a simulated QTL. The ulation study with 106 and 268 groups of SNPs on the first
computation of the binary measures (TP, FP, TN, FN) needs and second chromosome, respectively, in the generation of
to account for this association phenomenon. Two options ancestors (Section 2.6). Additionally, we reduced the number
are obvious, similar to [14]. First, the proximity-associated of groups by an order of magnitude to examine its impact on
SNP: We considered a small interval around every QTL, i.e., computation time.
the QTL itself and two SNPs to the left and to the right. If an
algorithm identified any SNP from inside this window to be
causal, we let this count as a true positive. Second, the 3 RESULTS
group-associated SNP: Since the simulation of QTL was The presented results are based on the 27 scenarios where
already based on prior grouping of SNPs, we defined a true effects were simulated via a Gamma distribution. The
positive result, if an algorithm correctly identified any SNP results based on a Normal distribution were mainly similar.
of the group of SNPs in which a QTL was located. Fig. 1 Among all algorithms, the GL was most sensitive to this
gives a visual impression of both approaches. change of effect sampling. In particular, the GL showed
At last, we compared the ability of all methods to cor- high precision of predicted GBVs if effects were sampled
rectly identify the best performing individuals. For that we from a Normal distribution and when every SNP within
took the top 10% performing individuals based on the QTL groups had a non-zero effect.
34 IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. 20, NO. 1, JANUARY/FEBRUARY 2023

3.1 Precision of Prediction Furthermore, we observed that the smaller the proportion
With respect to the precision of GBV prediction, the novel of simulated QTL was, the higher the chances for another
method fitSGL outperformed the lasso in all 27 settings for method were to perform just as well as fitSGL.
scenario (A) and in 20 settings for scenario (B). The maxi- Though results for NPV were very similar to those of sen-
mum improvement of the mean correlation between simu- sitivity, we observed different patterns for specificity, PPV,
lated phenotypic values and predicted GBVs over 100 and ACC. In the case of proximity-based measures, the lasso
experiments was 2.81% and 0.20% for (A) and (B), respec- performed better than any other method. With group-based
tively. However, compared to the lasso, the fitSGL lost measures, best results were obtained with the lasso and SGL.
adventages in scenario (B) with increasing heritability.
The fitSGL outperformed EN in (A) 27 and (B) 26 settings
3.3. Identification of Best Performing Indivduals
with maximum average improvement of 2.34% and 0.25%,
In scenario (A), fitSGL correctly identified most of the 10%
respectively. In (B), EN performed better than fitSGL on
best performing individuals based on their predicted GBV
average only in the case where a single group of simulated
in 18 out of 27 settings. In 15 out of these 18 settings either
QTL was present and the QTL coverage within this group
only a single group or 3 groups of simulated QTL were
was 100%.
present on each chromosome. SGL outperformed all other
The fitSGL delivered improved results compared to GL
methods in 8 settings, and the lasso in a single setting. The
in (A) 24 and (B) 27 settings. The respective improvements
two best performing methods were SGL and fitSGL, where
in mean correlation went up to (A) 10.76% and (B) 3.82%.
the average overlap of fitSGL was at least 2.2% greater than
The scenarios were GL performed better than fitSGL had
that of SGL in settings with less than 9 groups of simulated
either 3 or 9 groups of simulated QTL per chromosome, or
QTL per chromosome. In settings with 9 groups of QTL
the heritability was 10%.
both methods performed almost equally well. The range of
Compared to SGL, the fitSGL had higher correlations in 21
correctly identified individuals was [45.5%, 85.8%] for SGL
settings for both (A) and (B) with maximum average improve-
and [43.3%, 88.4%] for fitSGL.
ments of 5.79% and 2.09%, respectively. We found a gradual
Similarly, fitSGL performed better than any other method
decrease of advantage for fitSGL if either the heritability
in 13 out of 27 settings in scenario (B). In only 2 out of these
decreased or the total number of simulated QTL increased.
13 settings, 9 groups of QTL were simulated on each chromo-
Fig. 2 shows means and standard errors of the deviation
some. The lasso reached peak performance in 9 settings and
of correlations between simulated phenotypic value and
SGL in 5. All of SGL’s best performances were found in set-
predicted GBV of the lasso, GL, SGL, and EN from fitSGL in
tings with 9 groups of QTL per chromosome but the average
percent for 9 out of 27 simulation settings. A negative value
overlap of fitSGL and SGL was the same. In settings with
indicates an improvement of precision of prediction with
either 1 or 3 groups of simulated QTL per chromosome, the
fitSGL compared to another method. Thus, the displayed
lasso and fitSGL performed best and with equal overlap. The
means reflect tendencies.
proportion of correctly assigned top 10% performing indi-
viduals ranged from 63.5% to 95.2% (lasso), from 65.6% to
3.2 Binary Statistical Measures 93.1% (SGL), and from 63.7% to 95.3% (fitSGL).
Figures displaying the average performance for the
Fig. 3 shows average values for sensitivity, specificity, PPV,
remaining settings are provided in Supplementary Files 2
NPV, and ACC over 100 experiments for 3 different settings
and 3.
for each category (A) and (B) using proximity-based meas-
ures. The larger any of these values was, the better. We
chose radar plots to visualize these measures, as the total 3.4 Impact of Mixing Parameter
covered area within such a plot gives an impression of a FitSGL outperformed EN and SGL with respect to MSE and
method’s overall performance. We noticed, even though EN correlations of predicted GBVs independently of the choice
was not superior based on a single criterion, its overall per- of a. However, the relative difference between methods
formance was very good. This was indicated by the area it gradually diminished with increasing a approaching the
covered within each radar plot of Fig. 3. If based on proxim- lasso penalty. For example in scenario (A) and a ¼ 0:1, on
ity measures, the calculated area covered by EN was larger average fitSGL performed 1.92% and 4.48% better than EN
than that of any other method in the vast majority of cases. and SGL, respectively. With a ¼ 0:5 these values changed to
But also the differences between EN and the lasso were 0.56% and 4.16%. And finally to 0.09% and 0.31% for a ¼
marginal. However, if the covered area was based on group 0:9. With respect to binary statistical measures, we noticed
measures, the lasso was the peak performer compared to all the tendencies that if a increased, the numbers for TN and
other methods. FN also increased, whereas the numbers for TP and FP
In general, by comparing average TP, FN, TN, and FP, decreased. And hence, the sensitivity decreased and the
we observed a superiority of fitSGL for TP and FN but an specificity increased. These observations were independent
inferiority with respect to TN and FP. With proximity-based of n=p-ratio and whether proximity- or group-based meas-
measures, fitSGL showed both the highest means for TP as ures were considered. Furthermore, we observed that the
well as the lowest means for FN in 25 settings of category identification of the best performing individuals improved
(A) and in 23 settings of category (B). When group-based with increasing a. In scenario (A), for instance, the average
measures were used, these values changed to 24 and 22, overlap of correctly identified individuals from EN started
respectively. And since the sensitivity was calculated via at 67.9%, increased to 73.0%, and further to 73.8% for the
TP and FN, the respective results were very similar. largest a. The respective values for SGL were 64.9%, 67.4%,
KLOSA ET AL.: FITTED SPARSE-GROUP LASSO FOR GENOME-BASED EVALUATIONS 35

Fig. 2. Deviation of the lasso (yellow green), group lasso (GL, turquoise green), sparse-group lasso (SGL, blue), and elastic net (EN, purple) from the
fitSGL in percent displayed via means and standard errors. Positive values indicate an improvement over fitSGL. Left column: results from category
(A) p>n, right column: category (B) p<n. Top row: 1 group of simulated QTL per chromosome, middle: 3 groups, bottom: 9 groups. The effects were
sampled from a Gamma distribution. The heritability of the trait was equal to 0.1. The 3 consecutive results for each method represent different levels
of within-group sparsity, from left to right: 1/3, 2/3, or all of the SNPs within QTL groups had non-zero effect. Results are based on correlations
between simulated phenotypic and predicted genetic values within the test data from 100 experiments.

and 73.2%. And for fitSGL, 70.8% were observed for a ¼ 0:1 3.5 Scalability
and 74.0% for both a ¼ 0:5 and a ¼ 0:9. However, in sce- The lasso was the only algorithm available in both packages
nario (A), we found the peak performance of fitSGL with seagull and glmnet, allowing a direct comparison. The time
respect to the MSE and the correlation of predicted GBVs to calculate the full regularization path was 24 seconds for
with a ¼ 0:5. glmnet and 2 hours and 27 minutes for seagull. EN from
36 IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. 20, NO. 1, JANUARY/FEBRUARY 2023

Fig. 3. Average performance of the different methods in 6 different scenarios using proximity-based measures. The following measures are shown:
Sens. ¼ Sensitivity ¼ TP/(TPþFN), Spec. ¼ Specificity ¼ TN/(TNþFP), PPV ¼ Positive Predictive Value ¼ TP/(TPþFP), NPV ¼ Negative Predictive
Value ¼ TN/(TNþFN), ACC ¼ Accuracy ¼ (TPþTN)/(TPþFPþTNþFN). Left column: results from category (A) p>n, right column: category (B) p<n.
Top row: 1 group of simulated QTL per chromosome, middle: 3 groups, bottom: 9 groups. One third of the SNPs within these QTL groups had a simu-
lated non-zero effect. The effects were sampled from a Gamma distribution. The heritability of the trait was equal to 0.1. Results are based on 100
experiments.

glmnet also required 24 seconds. The remaining methods 8min, and 10h 14min. Thus, the computational time for GL
from seagull, i.e., GL, SGL, and fitSGL, needed 1h 50min, 2h and SGL apparently did not depend on the number of
6min, and 1h 3min, respectively, when 200 groups of SNPs groups per chromosome, whereas fitSGL was heavily sensi-
were present per chromosome. However, if 20 groups were tive to it. FitSGL relies strongly on matrix algebra within
present, the respective numbers changed to 1h 53min, 2h groups but there is another major factor influencing the
KLOSA ET AL.: FITTED SPARSE-GROUP LASSO FOR GENOME-BASED EVALUATIONS 37

speed of calculations within the seagull package: accelera- standardized to gain independence of the scale of X, we
tion was not implemented for the lasso, GL, and SGL algo- suggest dl ¼ 1 as a scale-free solution.
rithm but for fitSGL (step 7 in Algorithm 1). This explains Furthermore, it is necessary to specify the step width t.
first why the fitSGL with 200 groups per chromosome ran This parameter determines the width between consecutive
faster than other methods from seagull and second why iterations. If a value too large is chosen, there is a chance
other algorithms from seagull ran slower than the lasso and that at a certain point the algorithm jumps over the optimal
EN from glmnet. More examples about scalability of seagull solution. If chosen too small, the changes of the solution
in real data applications can be found in [18]. from one iteration to the next might be small enough in
order to indicate convergence, even though the solution is
still far from optimal. During preliminary investigations,
4 DISCUSSION we verified different values for t and found that values  1
We introduced fitSGL as a novel penalization approach for caused unstable behaviour of MSE. Thus, we propose to
estimating the vector of features b. FitSGL was designed for reduce t by an order of magnitude, i.e., t ¼ 0:1.
correlated predictor variables which need to be grouped in GL, SGL, and fitSGL require additional grouping infor-
advance. As an example, we verified its ability to predict mation which could be achieved with one of the various
genetic values of not-yet phenotyped individuals and to clustering algorithms. In genome-based evaluations, the
detect genomic regions associated with trait expression. We ordering of predictor variables is determined by the physi-
inspected its performance based on simulated data and cal coordinates of markers on a chromosome. Hence, we
observed that fitSGL was a competitive approach in many applied adjclust which is a hierarchical clustering procedure
respects. allowing only adjacent clusters to merge. The outcome is a
Just like EN and SGL regularization, the penalty of tree structure, but we selected a fixed number of groups per
fitSGL is a convex combination of two terms, which are chromosome in real data analysis only to demonstrate scal-
linked via the parameter a. Based on [19], we set this param- ability of methods. In a practical application, however, the
eter to 0.5 for all of these methods to balance between the optimal number of groups shall be selected based on an
estimation and prediction error. We investigated the impact objective criterion such as gap statistic (as implemented in
of this mixing parameter using a representative setting. If the BALD package) or slope heuristics capushe (also avail-
high accuracy for the prediction of the future performance able in adjclust).
of an individual and high rates of correctly identified best
performing individuals were desired, then larger values of
5 CONCLUSION
a were favorable, i.e., a  0:5.
To substantiate our objectives, fitSGL was evaluated with We implemented a lasso-type penalization approach, which
respect to the two major criteria: (i) its precision of predict- not only accounts for sparsity of signals but also of fitted
ing the individuals’ performance and, consequently, its values. As fitted values are of particular importance for
potential to identify selection candidates for breeding objec- breeding applications, we validated the new approach
tives, and (ii) the ability to detect the trait-associated sites. “fitSGL” for its use in a genome-wide regression analysis. If
The first criterion would favor a larger weight on the pen- only few regions associated with trait expression exist on
alty that harbors Xb, whereas for the second criterion a the genome, our method proved beneficial, especially if p >
larger weight on the penalty for b would be advisable. By n. The lower the impact of regressors on trait expression is,
setting a to 0.5 for the fitSGL to support both perspectives the more difficult it is to identify the causal signals per se.
equally, a strong competition is introduced between spar- FitSGL performed best under such circumstances. In other
sity on the level of b and a more subtle sparsity on the level investigated scenarios, the novel method was still competi-
of Xb. As a direct consequence, the false positive rate is tre- tive to the other penalization approaches, often being closest
mendously increased compared to methods which intro- in its performance to the sparse-group lasso. We extended
duce sparsity only on the level of b. our R package “seagull” (available at CRAN) to include
We observed that the new method outperformed all fitSGL.
other methods with respect to the above mentioned evalua-
tion criteria (i) and (ii), whenever the simulated signal, i.e., ACKNOWLEDGMENTS
the number of causal variants, was very sparse, indicating a The funder had no role in the design of the study and collec-
strong dependence on genetic architecture. The sensitivity tion, analysis, and interpretation of data and in writing the
of the estimation process towards single signals can be manuscript. We thank the two anonymous reviewers for
adjusted through a. However, this aspect requires more their constructive and helpful comments.
research in the future.
Another parameter to be mindful of is the regularization
parameter dl from (7). In [5], it was suggested to set this
REFERENCES
parameter so that the degrees of freedom in (8) are equal [1] R. Tibshirani, “Regression shrinkage and selection via the lasso,”
J. Roy. Statist. Soc. B, vol. 58, no. 1, pp. 267–267–288, 1996, doi:
among groups. However, in fitSGL this parameter is solely 10.2307/2346178.
required for groups with rank deficiency. The above sugges- [2] M. Yuan and Y. Lin, “Model selection and estimation in regression
tion might not fit perfectly. It would result in the same with grouped variables,” J. Roy. Statist. Soc. B, vol. 68, no. 1,
weight for every such group, and thus potentially lead to pp. 49–67, Feb. 2006, doi: 10.1111/j.1467-9868.2005.00532.x.
[3] H. Zou and T. Hastie, “Regularization and variable selection via
poor interpretability when compared to any group with full the elastic net,” J. Roy. Statist. Soc. B, vol. 67, no. 2, pp. 301–320,
rank. Instead, since the columns of X were initially Apr. 2005, doi: 10.1111/j.1467-9868.2005.00503.x.
38 IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. 20, NO. 1, JANUARY/FEBRUARY 2023

[4] N. Simon, J. Friedman, T. Hastie, and R. Tibshirani, “A sparse- Jan Klosa received the MSc degree in mathe-
group lasso,” J. Comput. Graph. Statist., vol. 22, no. 2, pp. 231–245, matics from the Technical University of Bruns-
Apr. 2013, doi: 10.1080/10618600.2012.681250. wick, Germany, in 2012. From 2013 to 2014 he
[5] N. Simon and R. Tibshirani, “Standardization and the group lasso was employed with the University of Uppsala,
penalty,” Statistica Sinica, vol. 22, no. 3, pp. 983–1001, Jul. 2012, Sweden. He joined the Research Institute for
doi: 10.5705/ss.2011.075. Farm Animal Biology (FBN) in Dummerstorf, Ger-
[6] Y. Yu, “Better approximation and faster algorithm using the prox- many, in 2014. He is engaged in research on sta-
imal average,” in Proc. 26th Int. Conf. Neural Inf. Process. Syst., tistical and numerical methods for studying the
2013, pp. 458–466. genotype-to-phenotype association in livestock
[7] A. Beck and M. Teboulle, “A fast iterative shrinkage-thresholding populations.
algorithm for linear inverse problems,” SIAM J. Imag. Sci., vol. 2,
no. 1, pp. 183–202, Jan. 2009, doi: 10.1137/080716542.
[8] N. Parikh and S. Boyd, “Proximal algorithms,” FNT Optim., vol. 1, Noah Simon received the BA degree in mathemat-
no. 3, pp. 127–239, 2014, doi: 10.1561/2400000003. ics from Pomona College, Claremont CA, in 2008,
[9] H. H. Bauschke, R. Goebel, Y. Lucet, and X. Wang, “The proximal and the PhD degree in statistics from the Stanford
average: Basic theory,” SIAM J. Optim., vol. 19, no. 2, pp. 766–785, University, Stanford CA, in 2013. Since 2013 he has
Jan. 2008, doi: 10.1137/070687542. been employed as professor with the Department of
[10] S. Mosci, L. Rosasco, M. Santoro, A. Verri, and S. Villa, “Solving Biostatistics, University of Washington, Seattle, WA.
structured sparsity regularization with proximal methods,” in He works on the Development of Statistical Method-
Proc. Eur. Conf. Mach. Learn. Knowl. Discov. Databases, Part II, 2010, ology for Prediction and Inference with High Dimen-
pp. 418–433, doi: 10.1007/978-3-642-15883-4_27. sional and/or Complex Data, as well as the Design
[11] Z. Li and M. J. Sillanp€a€a, “Overview of LASSO-related penalized of Adaptive Clinical Trials. He additionally works on
regression methods for quantitative trait mapping and genomic collaborative problems in cardiology, oncology, and
selection,” Theor. Appl. Genet., vol. 125, no. 3, pp. 419–435, cystic fibrosis among other areas.
Aug. 2012, doi: 10.1007/s00122-012-1892-9.
[12] Z. A. Desta and R. Ortiz, “Genomic selection: Genome-wide pre-
diction in plant improvement,” Trends Plant Sci., vol. 19, no. 9, Volkmar Liebscher received the graduate degree
pp. 592–601, Sep. 2014, doi: 10.1016/j.tplants.2014.05.006. in mathematics, in 1991, and the PhD degree from
[13] J. M. Hickey and G. Gorjanc, “Simulated data for genomic selec- the Friedrich-Schiller-University of Jena, in 1994, in
tion and genome-wide association studies using a combination of the area of Quantum Probability. After a postdoc
coalescent and gene drop methods,” G3 (Bethesda), vol. 2, no. 4, position in Jena ending with his habilitation, he
pp. 425–427, Apr. 2012, doi: 10.1534/g3.111.001297. turned in 1998 to the area of biomathematics at the
[14] A. Dehman, C. Ambroise, and P. Neuvial, “Performance of a GSF Research Centre for Environment and Health,
blockwise approach in variable selection using linkage disequilib- now HMGU, in Munich. Since 2005 he is chair of
rium information,” BMC Bioinf., vol. 16, no. 1, May 2015, biomathematics, since 2017 chair of biomathemat-
Art. no. 148, doi: 10.1186/s12859-015-0556-6. ics and statistics, at the University of Greifswald.
[15] R Core Team, “R foundation for statistical computing,” 2021. His research interests include stochastic modelling
[Online]. Available: https://www.R-project.org/ of biological processes, biostatistical methods for expression data and
[16] Z.-L. Hu, C. A. Park, and J. M. Reecy, “Building a livestock genetic development of statistical methods in image analysis, curve estimation and
and genomic information knowledgebase through integrative robust statistics.
developments of animal QTLdb and CorrDB,” Nucleic Acids Res.,
vol. 47, no. D1, pp. D701–D710, Jan. 2019, doi: 10.1093/nar/
gky1084. Do€ rte Wittenburg received the diploma degree
[17] Z. Chen, Y. Yao, P. Ma, Q. Wang, and Y. Pan, “Haplotype-based in business mathematics from the University of
genome-wide association study identifies loci and candidate Rostock, Germany, in 2005, and the doctoral degree
genes for milk yield in holsteins,” PLoS One, vol. 13, no. 2, in biomathematics from the University of Greifswald,
Feb. 2018, Art. no. e0192695, doi: 10.1371/journal.pone.0192695. Germany, in 2008. She was postdoctoral fellow with
[18] J. Klosa, N. Simon, P. O. Westermark, V. Liebscher, and D. Witten- the Research Institute for Farm Animal Biology
burg, “Seagull: Lasso, group lasso and sparse-group lasso regular- (FBN) in Dummerstorf, Germany, until she became
ization for linear regression models via proximal gradient a group leader in statistical genomics with the FBN
descent,” BMC Bioinf., vol. 21, no. 1, Sep. 2020, Art. no. 407, doi: in 2013. Her research interests include statistical
10.1186/s12859-020-03725-w. modelling of genetic effects on quantitative traits
[19] T. Hastie, R. Tibshirani, and J. Friedman, “High-dimensional with special focus on dependencies between geno-
problems: p>>N,” in The Elements of Statistical Learning: Data mic markers in breeding populations.
Minig, Inference, and Prediction, 2nd ed., Berlin, Germany: Springer,
2009.
" For more information on this or any other computing topic,
please visit our Digital Library at www.computer.org/csdl.

You might also like