Abstract
Numerous variable selection methods rely on a two-stage procedure, where a sparsity-inducing penalty is used in the first stage to predict the support, which is then conveyed to the second stage for estimation or inference purposes. In this framework, the first stage screens variables to find a set of possibly relevant variables and the second stage operates on this set of candidate variables, to improve estimation accuracy or to assess the uncertainty associated to the selection of variables. We advocate that more information can be conveyed from the first stage to the second one: we use the magnitude of the coefficients estimated in the first stage to define an adaptive penalty that is applied at the second stage. We give the example of an inference procedure that highly benefits from the proposed transfer of information. The procedure is precisely analyzed in a simple setting, and our large-scale experiments empirically demonstrate that actual benefits can be expected in much more general situations, with sensitivity gains ranging from 50 to 100 % compared to state-of-the-art.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Notes
In their two-stage procedure, Liu and Yu (2013) also proposed to construct confidence regions and to conduct hypothesis testing by bootstrapping residuals. Their approach fundamentally differs from Wasserman and Roeder (2009), in that inference does not rely on the two-stage procedure itself, but on the properties of the estimator obtained in the second stage.
Though many sparsity-inducing penalties, such as the Elastic-Net, the group-Lasso or the fused-Lasso lend themselves to the approach proposed here, the paper is restricted to the simple Lasso penalty.
Note that there is no finite-sample exact permutation test in multiple linear regression (Anderson and Robinson 2001). A test based on partial residuals (under the null hypothesis regression model) is asymptotically exact for unpenalized regression, but it does not apply to penalized regression.
References
Ambroise, C., McLachlan, G.J.: Selection bias in gene extraction on the basis of microarray gene-expression data. Proc. Natl. Acad. Sci. 99(10), 6562–6566 (2002)
Anders, S., Huber, W.: Differential expression analysis for sequence count data. Genome Biol. 11(10), R106 (2010)
Anderson, M.J., Robinson, J.: Permutation tests for linear models. Austral. N. Z. J. Stat. 43(1), 75–88 (2001)
Bach, F., Jenatton, R., Mairal, J., Obozinski, G.: Optimization with sparsity-inducing penalties. Found. Trends Mach. Learn. 4(1), 1–106 (2012)
Balding, D.: A tutorial on statistical methods for population association studies. Nat. Rev. Genet. 7(10), 781–791 (2006)
Belloni, A., Chernozhukov, V.: Least squares after model selection in high-dimensional sparse models. Bernoulli 19(2), 521–547 (2013)
Benjamini, Y., Hochberg, Y.: Controlling the false discovery rate: a practical and powerful approach to multiple testing. J. R. Stat. Soc. Ser. B 57(1), 289–300 (1995)
Boulesteix, A.L., Schmid, M.: Machine learning versus statistical modeling. Biom. J. 56, 588–593 (2014)
Bühlmann, P.: Statistical significance in high-dimensional linear models. Bernoulli 19, 1212–1242 (2013)
Candès, E., Tao, T.: The Dantzig selector: statistical estimation when \(p\) is much larger than \(n\). Ann. Stat. 35, 2313–2351 (2007)
Chatterjee, A., Lahiri, S.N.: Rates of convergence of the adaptive lasso estimators to the oracle distribution and higher order refinements by the bootstrap. Ann. Stat. 41(3), 1232–1259 (2013)
Chong, I.G., Jun, C.H.: Performance of some variable selection methods when multicollinearity is present. Chemom. Intel. Lab. Syst. 78(1–2), 103–112 (2005)
Cule, E., Vineis, P., De Lorio, M.: Significance testing in ridge regression for genetic data. BMC Bioinf. 12(372), 1–15 (2011)
Dalmasso, C., Carpentier, W., Meyer, L., Rouzioux, C., Goujard, C., Chaix, M.L., Lambotte, O., Avettand-Fenoel, V., Le Clerc, S., Denis de Senneville, L., Deveau, C., Boufassa, F., Debre, P., Delfraissy, J.F., Broet, P., Theodorou, I.: Distinct genetic loci control plasma HIV-RNA and cellular HIV-DNA levels in HIV-1 infection: the ANRS genome wide association 01 study. PLoS One 3(12), e3907 (2008)
Dudoit, S., Van der Laan, M.: Multiple Testing Procedures with Applications to Genomics. Springer, New York (2008)
Efron, B., Hastie, T., Johnstone, I., Tibshirani, R.: Least angle regression. Ann. Stat. 32(2), 407–499 (2004)
Grandvalet, Y.: Least absolute shrinkage is equivalent to quadratic penalization. In: Niklasson L, Bodén M, Ziemske T (eds) ICANN’98, Perspectives in Neural Computing, vol 1, Springer, New York, pp. 201–206 (1998)
Grandvalet, Y., Canu, S.: Outcomes of the equivalence of adaptive ridge with least absolute shrinkage. In: Kearns MS, Solla SA, Cohn DA (eds) Advances in Neural Information Processing Systems 11 (NIPS 1998), MIT Press, Cambridge, pp. 445–451 (1999)
Halawa, A.M., El Bassiouni, M.Y.: Tests of regressions coefficients under ridge regression models. J. Stat. Comput. Simul. 65(1), 341–356 (1999)
Hastie, T.J., Tibshirani, R.J.: Generalized Additive Models, Monographs on Statistics and Applied Probability, vol. 43. Chapman & Hall, London (1990)
Huang, J., Horowitz, J.L., Ma, S.: Asymptotic properties of bridge estimators in sparse high-dimensional regression models. Ann. Stat. 36(2), 587–613 (2008)
Kyung, M., Gill, J., Ghosh, M., Casella, G.: Penalized regression, standard errors, and Bayesian lassos. Bayesian Anal. 5(2), 369–411 (2010)
Liu, H., Yu, B.: Asymptotic properties of lasso+mls and lasso+ridge in sparse high-dimensional linear regression. Electr. J. Stat. 7, 3124–3169 (2013)
Lockhart, R., Taylor, J., Tibshirani, R.J., Tibshirani, R.: A significance test for the lasso. Ann. Stat. 42(2), 413–468 (2014)
Meinshausen, N.: Relaxed lasso. Comput. Stat. Data Anal. 52(1), 374–393 (2007)
Meinshausen, N., Meier, L., Bühlmann, P.: \(p\)-values for high-dimensional regression. J. Am. Stat. Assoc. 104(488), 1671–1681 (2009)
Tenenhaus, A., Philippe, C., Guillemot, V., Le Cao, K.A., Grill, J., Frouin, V.: Variable selection for generalized canonical correlation analysis. Biostatistics 15(3), 569–583 (2014)
Tibshirani, R.J.: Regression shrinkage and selection via the lasso. J. R. Stat. Soc. Ser. B 58(1), 267–288 (1996)
Verzelen, N.: Minimax risks for sparse regressions: ultra-high dimensional phenomenons. Electr. J. Stat. 6, 38–90 (2012)
Wang, Y., Yang, J., Yin, W., Zhang, W.: A new alternating minimization algorithm for total variation image reconstruction. SIAM J. Imaging Sci. 1(3), 248–272 (2008)
Wasserman, L., Roeder, K.: High-dimensional variable selection. Ann. Stat. 37(5A), 2178–2201 (2009)
Xing, E.P., Jordan, M.I., Karp, R.M.: Feature selection for high-dimensional genomic microarray data. In: Proceedings of the Eighteenth International Conference on Machine Learning (ICML 2001), pp. 601–608 (2001)
Zhang, C.H., Zhang, S.S.: Confidence intervals for low dimensional parameters in high dimensional linear models. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 76(1), 217–242 (2014)
Acknowledgments
This work was supported by the UTC foundation for Innovation, in the ToxOnChip program. It has been carried out in the framework of the Labex MS2T (ANR-11-IDEX-0004-02) within the “Investments for the future” program, managed by the National Agency for Research.
Author information
Authors and Affiliations
Corresponding author
Appendices
Variational equivalence
We show below that the quadratic penalty in \(\varvec{\beta }\) in Problem (2) acts as the Lasso penalty \(\lambda \left\| \varvec{\beta }\right\| _{1}\).
Proof
The Lagrangian of Problem (2) is:
Thus, the first order optimality conditions for \(\tau _j\) are
where the term in \(\nu _j\) vanishes due to complementary slackness, which implies here \(\nu _j \tau _j^\star = 0\). Together with the constraints of Problem (2), the last equation implies \(\tau _j^\star = \left| \beta _{j}\right| \), hence Problem (2) is equivalent to
which is the original Lasso formulation. \(\square \)
Efficient implementation
Permutation tests rely on the simulation of numerous data sampled under the null hypothesis distribution. The number of replications must be important to estimate the rather extreme quantiles we are typically interested in. Here, we use \(B=1000\) replications for the \(q=|\mathscr {S}_{\hat{\lambda }}|\) variables selected in the screening stage. Each replication involving the fitting of a model, the total computational cost for solving these B systems of size q on the q selected variables is \(O(Bq(q^{3}+q^{2}n))\). However, block-wise decompositions and inversions can bring computing savings by a factor q.
First, we recall that the adaptive-ridge estimate, computed at the cleaning stage, is computed as
where \(\varvec{{\varLambda }}\) is the diagonal adaptive-penalty matrix defined at the screening stage, whose jth diagonal entry is \({\lambda }/{\tau _{j}^\star }\), as defined in (1–3).
In the F-statistic (4), the permutation affects the calculation of the larger model \(\hat{\mathbf {y}}_{1}\), which is denoted \(\hat{\mathbf {y}}_{1}^{(b)}\) for the bth permutation. Using a similar notation convention for the design matrix and the estimated parameters, we have \(\hat{\mathbf {y}}_{1}^{(b)}={\mathbf {X}^{(b)}}\hat{\varvec{\beta }}^{(b)}\). When testing the relevance of variable j, \({\mathbf {X}^{(b)}}\) is defined as the concatenation of the permuted variable \(\mathbf {x}_{j}^{(b)}\) and the other original variables: \({\mathbf {X}^{(b)}}= (\mathbf {x}_{j}^{(b)}, \mathbf {X}_{\backslash j}) = (\mathbf {x}_{j}^{(b)}, \mathbf {x}_1, ..., \mathbf {x}_{j-1}, \mathbf {x}_{j+1}, ..., \mathbf {x}_{p})\). Then, \(\hat{\varvec{\beta }}^{(b)}\) can be efficiently computed by using \(a^{(b)}\in \mathbb {R}\), \(\mathbf {v}^{(b)}\in \mathbb {R}^{q-1}\) and \(\hat{\varvec{\beta }}_{\backslash j}\in \mathbb {R}^{q-1}\) defined as follows:
Indeed, using the Schur complement, one writes \(\hat{\varvec{\beta }}^{(b)}\) as follows:
Hence, \(\hat{\varvec{\beta }}^{(b)}\) can be obtained as a correction of the vector of coefficients \(\hat{\varvec{\beta }}_{\backslash j}\) obtained under the smaller model. The key observation to be made here is that \(\mathbf {x}_{j}^{(b)}\) does not intervene in the expression \({(\mathbf {X}_{\backslash j}^\top \mathbf {X}_{\backslash j}+ {\varvec{{\varLambda }}_{\backslash j}})}^{-1}\), which is the bottleneck in the computation of \(a^{(b)}\), \(\mathbf {v}^{(b)}\) and \(\hat{\varvec{\beta }}_{\backslash j}\). It can therefore be performed once for all permutations. Additionally, \({(\mathbf {X}_{\backslash j}^\top \mathbf {X}_{\backslash j}+ {\varvec{{\varLambda }}_{\backslash j}})}^{-1}\) can be cheaply computed from \(\left( {\mathbf {X}}^{\top }\mathbf {X}+ \varvec{{\varLambda }}\right) ^{-1}\) as follows:
Thus we compute \(\left( {\mathbf {X}}^{\top }\mathbf {X}+ \varvec{{\varLambda }}\right) ^{-1}\) once, firstly correct for the removal of variable j, secondly correct for permutation b, thus eventually requiring \(O(B(q^{3}+q^{2}n)))\) operations.
Rights and permissions
About this article
Cite this article
Bécu, JM., Grandvalet, Y., Ambroise, C. et al. Beyond support in two-stage variable selection. Stat Comput 27, 169–179 (2017). https://doi.org/10.1007/s11222-015-9614-1
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11222-015-9614-1