A Fitted Sparse-Group Lasso For Genome-Based Evaluations
A Fitted Sparse-Group Lasso For Genome-Based Evaluations
A Fitted Sparse-Group Lasso For Genome-Based Evaluations
Abstract—In life sciences, high-throughput techniques typically lead to high-dimensional data and often the number of covariates is
much larger than the number of observations. This inherently comes with multicollinearity challenging a statistical analysis in a linear
regression framework. Penalization methods such as the lasso, ridge regression, the group lasso, and convex combinations thereof,
which introduce additional conditions on regression variables, have proven themselves effective. In this study, we introduce a novel
approach by combining the lasso and the standardized group lasso leading to meaningful weighting of the predicted (“fitted”) outcome
which is of primary importance, e.g., in breeding populations. This “fitted” sparse-group lasso was implemented as a proximal-averaged
gradient descent method and is part of the R package “seagull” available at CRAN. For the evaluation of the novel method, we
executed an extensive simulation study. We simulated genotypes and phenotypes which resemble data of a dairy cattle population.
Genotypes at thousands of genomic markers were used as covariates to fit a quantitative response. The proximity of markers on a
chromosome determined grouping. In the majority of simulated scenarios, the new method revealed improved prediction abilities
compared to other penalization approaches and was able to localize the signals of simulated features.
of the lasso helps detecting trait-associated sites on the coefficients with respect to step width t. This approach con-
genome. Thus, for evaluation purposes, we applied the novel sists – in its core – of (compact) singular value decomposi-
method to a large variety of simulated scenarios resembling tions of the matrices XðlÞ , and a subsequent transformation
data from a dairy cattle population. The scenarios differed in into orthogonal variables. The transformation leads to the
terms of sample size and features of causal variants influenc- GL with PGD update according to [8], [10]. Eventually, a
ing trait expression. The outcome was compared to that of back transformation leads to the desired update in iteration
other lasso-type penalization approaches. m þ 1 (with m ¼ 0,1,. . .),
has not been published yet. To solve this, we followed the with dl;i being the i-th singular value of XðlÞ . Note that (7)
technique described in [5] and adapted it to fit the PGD coincides with (4), if dl ¼ 0. By introducing the following
framework. Thus, resulting in a joint update of regression expressions
0 1
XðlÞ 1; . . . ; p. Linkage and linkage disequilibrium (LD) between
B 0 C markers can cause extremely high correlation among pre-
0 1 B ..C
y B C dictor variables which typically satisfy a block structure.
B .C
B0C B C Thus, the challenge of a genome-wide regression approach
nþp ~ ðlÞ B 0 C
y~ ¼ B C
@ ... A 2 R ; X ¼ B ðlÞ C 2 RðnþpÞpl is to identify the causal sites in genomic regions of high LD.
B dl I C
B C Based on simulated data, we compared the performance of
0 B 0 C
B . C fitSGL to the lasso, GL, SGL, and EN often used in animal
@ .. A
and plant breeding (e.g., [11], [12]).
GBVi ¼ xij bj :
3.1 Precision of Prediction Furthermore, we observed that the smaller the proportion
With respect to the precision of GBV prediction, the novel of simulated QTL was, the higher the chances for another
method fitSGL outperformed the lasso in all 27 settings for method were to perform just as well as fitSGL.
scenario (A) and in 20 settings for scenario (B). The maxi- Though results for NPV were very similar to those of sen-
mum improvement of the mean correlation between simu- sitivity, we observed different patterns for specificity, PPV,
lated phenotypic values and predicted GBVs over 100 and ACC. In the case of proximity-based measures, the lasso
experiments was 2.81% and 0.20% for (A) and (B), respec- performed better than any other method. With group-based
tively. However, compared to the lasso, the fitSGL lost measures, best results were obtained with the lasso and SGL.
adventages in scenario (B) with increasing heritability.
The fitSGL outperformed EN in (A) 27 and (B) 26 settings
3.3. Identification of Best Performing Indivduals
with maximum average improvement of 2.34% and 0.25%,
In scenario (A), fitSGL correctly identified most of the 10%
respectively. In (B), EN performed better than fitSGL on
best performing individuals based on their predicted GBV
average only in the case where a single group of simulated
in 18 out of 27 settings. In 15 out of these 18 settings either
QTL was present and the QTL coverage within this group
only a single group or 3 groups of simulated QTL were
was 100%.
present on each chromosome. SGL outperformed all other
The fitSGL delivered improved results compared to GL
methods in 8 settings, and the lasso in a single setting. The
in (A) 24 and (B) 27 settings. The respective improvements
two best performing methods were SGL and fitSGL, where
in mean correlation went up to (A) 10.76% and (B) 3.82%.
the average overlap of fitSGL was at least 2.2% greater than
The scenarios were GL performed better than fitSGL had
that of SGL in settings with less than 9 groups of simulated
either 3 or 9 groups of simulated QTL per chromosome, or
QTL per chromosome. In settings with 9 groups of QTL
the heritability was 10%.
both methods performed almost equally well. The range of
Compared to SGL, the fitSGL had higher correlations in 21
correctly identified individuals was [45.5%, 85.8%] for SGL
settings for both (A) and (B) with maximum average improve-
and [43.3%, 88.4%] for fitSGL.
ments of 5.79% and 2.09%, respectively. We found a gradual
Similarly, fitSGL performed better than any other method
decrease of advantage for fitSGL if either the heritability
in 13 out of 27 settings in scenario (B). In only 2 out of these
decreased or the total number of simulated QTL increased.
13 settings, 9 groups of QTL were simulated on each chromo-
Fig. 2 shows means and standard errors of the deviation
some. The lasso reached peak performance in 9 settings and
of correlations between simulated phenotypic value and
SGL in 5. All of SGL’s best performances were found in set-
predicted GBV of the lasso, GL, SGL, and EN from fitSGL in
tings with 9 groups of QTL per chromosome but the average
percent for 9 out of 27 simulation settings. A negative value
overlap of fitSGL and SGL was the same. In settings with
indicates an improvement of precision of prediction with
either 1 or 3 groups of simulated QTL per chromosome, the
fitSGL compared to another method. Thus, the displayed
lasso and fitSGL performed best and with equal overlap. The
means reflect tendencies.
proportion of correctly assigned top 10% performing indi-
viduals ranged from 63.5% to 95.2% (lasso), from 65.6% to
3.2 Binary Statistical Measures 93.1% (SGL), and from 63.7% to 95.3% (fitSGL).
Figures displaying the average performance for the
Fig. 3 shows average values for sensitivity, specificity, PPV,
remaining settings are provided in Supplementary Files 2
NPV, and ACC over 100 experiments for 3 different settings
and 3.
for each category (A) and (B) using proximity-based meas-
ures. The larger any of these values was, the better. We
chose radar plots to visualize these measures, as the total 3.4 Impact of Mixing Parameter
covered area within such a plot gives an impression of a FitSGL outperformed EN and SGL with respect to MSE and
method’s overall performance. We noticed, even though EN correlations of predicted GBVs independently of the choice
was not superior based on a single criterion, its overall per- of a. However, the relative difference between methods
formance was very good. This was indicated by the area it gradually diminished with increasing a approaching the
covered within each radar plot of Fig. 3. If based on proxim- lasso penalty. For example in scenario (A) and a ¼ 0:1, on
ity measures, the calculated area covered by EN was larger average fitSGL performed 1.92% and 4.48% better than EN
than that of any other method in the vast majority of cases. and SGL, respectively. With a ¼ 0:5 these values changed to
But also the differences between EN and the lasso were 0.56% and 4.16%. And finally to 0.09% and 0.31% for a ¼
marginal. However, if the covered area was based on group 0:9. With respect to binary statistical measures, we noticed
measures, the lasso was the peak performer compared to all the tendencies that if a increased, the numbers for TN and
other methods. FN also increased, whereas the numbers for TP and FP
In general, by comparing average TP, FN, TN, and FP, decreased. And hence, the sensitivity decreased and the
we observed a superiority of fitSGL for TP and FN but an specificity increased. These observations were independent
inferiority with respect to TN and FP. With proximity-based of n=p-ratio and whether proximity- or group-based meas-
measures, fitSGL showed both the highest means for TP as ures were considered. Furthermore, we observed that the
well as the lowest means for FN in 25 settings of category identification of the best performing individuals improved
(A) and in 23 settings of category (B). When group-based with increasing a. In scenario (A), for instance, the average
measures were used, these values changed to 24 and 22, overlap of correctly identified individuals from EN started
respectively. And since the sensitivity was calculated via at 67.9%, increased to 73.0%, and further to 73.8% for the
TP and FN, the respective results were very similar. largest a. The respective values for SGL were 64.9%, 67.4%,
Fig. 2. Deviation of the lasso (yellow green), group lasso (GL, turquoise green), sparse-group lasso (SGL, blue), and elastic net (EN, purple) from the
fitSGL in percent displayed via means and standard errors. Positive values indicate an improvement over fitSGL. Left column: results from category
(A) p>n, right column: category (B) p<n. Top row: 1 group of simulated QTL per chromosome, middle: 3 groups, bottom: 9 groups. The effects were
sampled from a Gamma distribution. The heritability of the trait was equal to 0.1. The 3 consecutive results for each method represent different levels
of within-group sparsity, from left to right: 1/3, 2/3, or all of the SNPs within QTL groups had non-zero effect. Results are based on correlations
between simulated phenotypic and predicted genetic values within the test data from 100 experiments.
and 73.2%. And for fitSGL, 70.8% were observed for a ¼ 0:1 3.5 Scalability
and 74.0% for both a ¼ 0:5 and a ¼ 0:9. However, in sce- The lasso was the only algorithm available in both packages
nario (A), we found the peak performance of fitSGL with seagull and glmnet, allowing a direct comparison. The time
respect to the MSE and the correlation of predicted GBVs to calculate the full regularization path was 24 seconds for
with a ¼ 0:5. glmnet and 2 hours and 27 minutes for seagull. EN from
Fig. 3. Average performance of the different methods in 6 different scenarios using proximity-based measures. The following measures are shown:
Sens. ¼ Sensitivity ¼ TP/(TPþFN), Spec. ¼ Specificity ¼ TN/(TNþFP), PPV ¼ Positive Predictive Value ¼ TP/(TPþFP), NPV ¼ Negative Predictive
Value ¼ TN/(TNþFN), ACC ¼ Accuracy ¼ (TPþTN)/(TPþFPþTNþFN). Left column: results from category (A) p>n, right column: category (B) p<n.
Top row: 1 group of simulated QTL per chromosome, middle: 3 groups, bottom: 9 groups. One third of the SNPs within these QTL groups had a simu-
lated non-zero effect. The effects were sampled from a Gamma distribution. The heritability of the trait was equal to 0.1. Results are based on 100
glmnet also required 24 seconds. The remaining methods 8min, and 10h 14min. Thus, the computational time for GL
from seagull, i.e., GL, SGL, and fitSGL, needed 1h 50min, 2h and SGL apparently did not depend on the number of
6min, and 1h 3min, respectively, when 200 groups of SNPs groups per chromosome, whereas fitSGL was heavily sensi-
were present per chromosome. However, if 20 groups were tive to it. FitSGL relies strongly on matrix algebra within
present, the respective numbers changed to 1h 53min, 2h groups but there is another major factor influencing the
speed of calculations within the seagull package: accelera- standardized to gain independence of the scale of X, we
tion was not implemented for the lasso, GL, and SGL algo- suggest dl ¼ 1 as a scale-free solution.
rithm but for fitSGL (step 7 in Algorithm 1). This explains Furthermore, it is necessary to specify the step width t.
first why the fitSGL with 200 groups per chromosome ran This parameter determines the width between consecutive
faster than other methods from seagull and second why iterations. If a value too large is chosen, there is a chance
other algorithms from seagull ran slower than the lasso and that at a certain point the algorithm jumps over the optimal
EN from glmnet. More examples about scalability of seagull solution. If chosen too small, the changes of the solution
in real data applications can be found in [18]. from one iteration to the next might be small enough in
order to indicate convergence, even though the solution is
still far from optimal. During preliminary investigations,
4 DISCUSSION we verified different values for t and found that values 1
We introduced fitSGL as a novel penalization approach for caused unstable behaviour of MSE. Thus, we propose to
estimating the vector of features b. FitSGL was designed for reduce t by an order of magnitude, i.e., t ¼ 0:1.
correlated predictor variables which need to be grouped in GL, SGL, and fitSGL require additional grouping infor-
advance. As an example, we verified its ability to predict mation which could be achieved with one of the various
genetic values of not-yet phenotyped individuals and to clustering algorithms. In genome-based evaluations, the
detect genomic regions associated with trait expression. We ordering of predictor variables is determined by the physi-
inspected its performance based on simulated data and cal coordinates of markers on a chromosome. Hence, we
observed that fitSGL was a competitive approach in many applied adjclust which is a hierarchical clustering procedure
respects. allowing only adjacent clusters to merge. The outcome is a
Just like EN and SGL regularization, the penalty of tree structure, but we selected a fixed number of groups per
fitSGL is a convex combination of two terms, which are chromosome in real data analysis only to demonstrate scal-
linked via the parameter a. Based on [19], we set this param- ability of methods. In a practical application, however, the
eter to 0.5 for all of these methods to balance between the optimal number of groups shall be selected based on an
estimation and prediction error. We investigated the impact objective criterion such as gap statistic (as implemented in
of this mixing parameter using a representative setting. If the BALD package) or slope heuristics capushe (also avail-
high accuracy for the prediction of the future performance able in adjclust).
of an individual and high rates of correctly identified best
performing individuals were desired, then larger values of
a were favorable, i.e., a 0:5.
To substantiate our objectives, fitSGL was evaluated with We implemented a lasso-type penalization approach, which
respect to the two major criteria: (i) its precision of predict- not only accounts for sparsity of signals but also of fitted
ing the individuals’ performance and, consequently, its values. As fitted values are of particular importance for
potential to identify selection candidates for breeding objec- breeding applications, we validated the new approach
tives, and (ii) the ability to detect the trait-associated sites. “fitSGL” for its use in a genome-wide regression analysis. If
The first criterion would favor a larger weight on the pen- only few regions associated with trait expression exist on
alty that harbors Xb, whereas for the second criterion a the genome, our method proved beneficial, especially if p >
larger weight on the penalty for b would be advisable. By n. The lower the impact of regressors on trait expression is,
setting a to 0.5 for the fitSGL to support both perspectives the more difficult it is to identify the causal signals per se.
equally, a strong competition is introduced between spar- FitSGL performed best under such circumstances. In other
sity on the level of b and a more subtle sparsity on the level investigated scenarios, the novel method was still competi-
of Xb. As a direct consequence, the false positive rate is tre- tive to the other penalization approaches, often being closest
mendously increased compared to methods which intro- in its performance to the sparse-group lasso. We extended
duce sparsity only on the level of b. our R package “seagull” (available at CRAN) to include
We observed that the new method outperformed all fitSGL.
other methods with respect to the above mentioned evalua-
tion criteria (i) and (ii), whenever the simulated signal, i.e., ACKNOWLEDGMENTS
the number of causal variants, was very sparse, indicating a The funder had no role in the design of the study and collec-
strong dependence on genetic architecture. The sensitivity tion, analysis, and interpretation of data and in writing the
of the estimation process towards single signals can be manuscript. We thank the two anonymous reviewers for
adjusted through a. However, this aspect requires more their constructive and helpful comments.
research in the future.
Another parameter to be mindful of is the regularization
parameter dl from (7). In [5], it was suggested to set this
parameter so that the degrees of freedom in (8) are equal [1] R. Tibshirani, “Regression shrinkage and selection via the lasso,”
J. Roy. Statist. Soc. B, vol. 58, no. 1, pp. 267–267–288, 1996, doi:
among groups. However, in fitSGL this parameter is solely 10.2307/2346178.
required for groups with rank deficiency. The above sugges- [2] M. Yuan and Y. Lin, “Model selection and estimation in regression
tion might not fit perfectly. It would result in the same with grouped variables,” J. Roy. Statist. Soc. B, vol. 68, no. 1,
weight for every such group, and thus potentially lead to pp. 49–67, Feb. 2006, doi: 10.1111/j.1467-9868.2005.00532.x.
[3] H. Zou and T. Hastie, “Regularization and variable selection via
poor interpretability when compared to any group with full the elastic net,” J. Roy. Statist. Soc. B, vol. 67, no. 2, pp. 301–320,
rank. Instead, since the columns of X were initially Apr. 2005, doi: 10.1111/j.1467-9868.2005.00503.x.
[4] N. Simon, J. Friedman, T. Hastie, and R. Tibshirani, “A sparse- Jan Klosa received the MSc degree in mathe-
group lasso,” J. Comput. Graph. Statist., vol. 22, no. 2, pp. 231–245, matics from the Technical University of Bruns-
Apr. 2013, doi: 10.1080/10618600.2012.681250. wick, Germany, in 2012. From 2013 to 2014 he
[5] N. Simon and R. Tibshirani, “Standardization and the group lasso was employed with the University of Uppsala,
penalty,” Statistica Sinica, vol. 22, no. 3, pp. 983–1001, Jul. 2012, Sweden. He joined the Research Institute for
doi: 10.5705/ss.2011.075. Farm Animal Biology (FBN) in Dummerstorf, Ger-
[6] Y. Yu, “Better approximation and faster algorithm using the prox- many, in 2014. He is engaged in research on sta-
imal average,” in Proc. 26th Int. Conf. Neural Inf. Process. Syst., tistical and numerical methods for studying the
2013, pp. 458–466. genotype-to-phenotype association in livestock
[7] A. Beck and M. Teboulle, “A fast iterative shrinkage-thresholding populations.
algorithm for linear inverse problems,” SIAM J. Imag. Sci., vol. 2,
no. 1, pp. 183–202, Jan. 2009, doi: 10.1137/080716542.
[8] N. Parikh and S. Boyd, “Proximal algorithms,” FNT Optim., vol. 1, Noah Simon received the BA degree in mathemat-
no. 3, pp. 127–239, 2014, doi: 10.1561/2400000003. ics from Pomona College, Claremont CA, in 2008,
[9] H. H. Bauschke, R. Goebel, Y. Lucet, and X. Wang, “The proximal and the PhD degree in statistics from the Stanford
average: Basic theory,” SIAM J. Optim., vol. 19, no. 2, pp. 766–785, University, Stanford CA, in 2013. Since 2013 he has
Jan. 2008, doi: 10.1137/070687542. been employed as professor with the Department of
[10] S. Mosci, L. Rosasco, M. Santoro, A. Verri, and S. Villa, “Solving Biostatistics, University of Washington, Seattle, WA.
structured sparsity regularization with proximal methods,” in He works on the Development of Statistical Method-
Proc. Eur. Conf. Mach. Learn. Knowl. Discov. Databases, Part II, 2010, ology for Prediction and Inference with High Dimen-
pp. 418–433, doi: 10.1007/978-3-642-15883-4_27. sional and/or Complex Data, as well as the Design
[11] Z. Li and M. J. Sillanp€a€a, “Overview of LASSO-related penalized of Adaptive Clinical Trials. He additionally works on
regression methods for quantitative trait mapping and genomic collaborative problems in cardiology, oncology, and
selection,” Theor. Appl. Genet., vol. 125, no. 3, pp. 419–435, cystic fibrosis among other areas.
Aug. 2012, doi: 10.1007/s00122-012-1892-9.
[12] Z. A. Desta and R. Ortiz, “Genomic selection: Genome-wide pre-
diction in plant improvement,” Trends Plant Sci., vol. 19, no. 9, Volkmar Liebscher received the graduate degree
pp. 592–601, Sep. 2014, doi: 10.1016/j.tplants.2014.05.006. in mathematics, in 1991, and the PhD degree from
[13] J. M. Hickey and G. Gorjanc, “Simulated data for genomic selec- the Friedrich-Schiller-University of Jena, in 1994, in
tion and genome-wide association studies using a combination of the area of Quantum Probability. After a postdoc
coalescent and gene drop methods,” G3 (Bethesda), vol. 2, no. 4, position in Jena ending with his habilitation, he
pp. 425–427, Apr. 2012, doi: 10.1534/g3.111.001297. turned in 1998 to the area of biomathematics at the
[14] A. Dehman, C. Ambroise, and P. Neuvial, “Performance of a GSF Research Centre for Environment and Health,
blockwise approach in variable selection using linkage disequilib- now HMGU, in Munich. Since 2005 he is chair of
rium information,” BMC Bioinf., vol. 16, no. 1, May 2015, biomathematics, since 2017 chair of biomathemat-
Art. no. 148, doi: 10.1186/s12859-015-0556-6. ics and statistics, at the University of Greifswald.
[15] R Core Team, “R foundation for statistical computing,” 2021. His research interests include stochastic modelling
[Online]. Available: https://www.R-project.org/ of biological processes, biostatistical methods for expression data and
[16] Z.-L. Hu, C. A. Park, and J. M. Reecy, “Building a livestock genetic development of statistical methods in image analysis, curve estimation and
and genomic information knowledgebase through integrative robust statistics.
developments of animal QTLdb and CorrDB,” Nucleic Acids Res.,
vol. 47, no. D1, pp. D701–D710, Jan. 2019, doi: 10.1093/nar/
gky1084. Do€ rte Wittenburg received the diploma degree
[17] Z. Chen, Y. Yao, P. Ma, Q. Wang, and Y. Pan, “Haplotype-based in business mathematics from the University of
genome-wide association study identifies loci and candidate Rostock, Germany, in 2005, and the doctoral degree
genes for milk yield in holsteins,” PLoS One, vol. 13, no. 2, in biomathematics from the University of Greifswald,
Feb. 2018, Art. no. e0192695, doi: 10.1371/journal.pone.0192695. Germany, in 2008. She was postdoctoral fellow with
[18] J. Klosa, N. Simon, P. O. Westermark, V. Liebscher, and D. Witten- the Research Institute for Farm Animal Biology
burg, “Seagull: Lasso, group lasso and sparse-group lasso regular- (FBN) in Dummerstorf, Germany, until she became
ization for linear regression models via proximal gradient a group leader in statistical genomics with the FBN
descent,” BMC Bioinf., vol. 21, no. 1, Sep. 2020, Art. no. 407, doi: in 2013. Her research interests include statistical
10.1186/s12859-020-03725-w. modelling of genetic effects on quantitative traits
[19] T. Hastie, R. Tibshirani, and J. Friedman, “High-dimensional with special focus on dependencies between geno-
problems: p>>N,” in The Elements of Statistical Learning: Data mic markers in breeding populations.
Minig, Inference, and Prediction, 2nd ed., Berlin, Germany: Springer,
" For more information on this or any other computing topic,
please visit our Digital Library at www.computer.org/csdl.