Abstract
We propose a Random Splitting Model Averaging procedure, RSMA, to achieve stable predictions in high-dimensional linear models. The idea is to use split training data to construct and estimate candidate models and use test data to form a second-level data. The second-level data is used to estimate optimal weights for candidate models by quadratic optimization under non-negative constraints. This procedure has three appealing features: (1) RSMA avoids model overfitting, as a result, gives improved prediction accuracy. (2) By adaptively choosing optimal weights, we obtain more stable predictions via averaging over several candidate models. (3) Based on RSMA, a weighted importance index is proposed to rank the predictors to discriminate relevant predictors from irrelevant ones. Simulation studies and a real data analysis demonstrate that RSMA procedure has excellent predictive performance and the associated weighted importance index could well rank the predictors.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.References
Ando, T., Li, K.C.: A model-averaging approach for high-dimensional regression. J. Am. Stat. Assoc. 109, 254–265 (2014)
Breheny, P., Huang, J.: Coordinate descent algorithms for nonconvex penalized regression, with applications to biological feature selection. Ann. Appl. Stat. 5, 232–253 (2011)
Breiman, L.: Bagging predictors. Mach. Learn. 24, 123–140 (1996a)
Breiman, L.: Stacked regressions. Mach. Learn. 24, 49–64 (1996b)
Bühlmann, P., Geer, V.D.: Statistics for High-Dimensional Data: Methods, Theory and Applications. Springer, Berlin, Heidelberg (2011)
Bühlmann, P., Mandozzi, J.: High-dimensional variable screening and bias in subsequent inference, with an empirical comparison. Comput. Stat. 29, 407–430 (2014)
Bühlmann, P., Kalisch, M., Meier, L.: High-dimensional statistics with a view toward aplications in biology. Annu. Rev. Stat. Appl. 1, 255–278 (2014). subse
Efron, B., Hastie, T., Johnstone, L., Tibshirani, R.: Least angle regression. Ann. Stat. 32, 407–499 (2004)
Fan, J., Li, R.: Variable selection via nonconcave penalized likelihood and oracle properties. J. Am. Stat. Assoc. 96, 1348–1360 (2001)
Fan, J., Lv, J.: Sure independence screening for ultrahigh dimensional feature space. J. R. Stat. Soc. Ser. B 70, 849–911 (2008)
Friedman, J., Hastie, T., Hofling, H., Tibshirani, R.: Pathwise coordinate optimization. Ann. Appl. Stat. 1, 302–332 (2007)
Hansen, B.E.: Least squares model averaging. Econometrica 75, 1175–1189 (2007)
Hansen, B.E., Racine, J.S.: Jackknife model averaging. Technical report (2010)
Hjort, N.L., Claeskens, G.: Frequentist model average estimators. J. Am. Stat. Assoc. 98, 879–899 (2003)
Li, K.C.: Asymptotic optimality for C\(_p\), C\(_l\), cross-validation and generalized cross-validation: discrete index set. Ann. Stat. 15, 958–975 (1987)
Liang, H., Zou, G., Wan, A.T.K., Zhang, X.: Optimal weight choice for frequentist model average estimators. J. Am. Stat. Assoc. 106, 1053–1066 (2011)
Marioni, J.C., Mason, C.E., Mane, S.M.: Rna-seq: An assessment of technical reproducibility and comparison with gene expression arrays. Genome Res. 18, 1509–1517 (2008)
Meinshausen, N., Bühlmann, P.: Stability selection. J. R. Stat. Soc. Ser. B 72, 417–473 (2010)
Nan, Y., Yang, Y.: Variable selection diagnostics measures for high-dimensional regression. J. Comput. Graph. Stat. 23, 636–656 (2014)
R Core Team R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna. http://www.R-project.org/ (2014)
Raftery, A., Madigan, D., Hoeting, J.: Bayesian model averaging for lienar regression models. J. Am. Stat. Assoc. 92, 179–191 (1997)
Rao, J.S., Tibshirani, R.: The out-of-bootstrap method for model averaging and selection. Technical Report (1997)
Tibshirani, R.: Regression shrinkage and selection via the lasso. J. R. Stat. Soc. Ser. B 58, 268–288 (1996)
Ullah, A., Wang, H.: Parametric and nonparametric frequentist model selection and model averaging. Econometrics 1, 157–179 (2013)
Yuan, Z., Yang, Y.: Combining linear regression models: when and how? J. Am. Stat. Assoc. 100, 1202–1214 (2005)
Zhang, C.: Nearly unbiased variable selection under minimax concave penalty. Ann. Stat. 38, 894–942 (2010)
Zou, H., Hastie, T.: Regularization and variable selection via the elastic net. J. R. Stat. Soc. Ser. B 67, 301–320 (2005)
Zou, H., Zhang, H.H.: On the adaptive elastic-net with a diverging number of parameters. Ann. Stat. 37, 1733–1751 (2009)
Acknowledgments
The authors thank the associate editor and two referees for their constructive suggestions that helped them to improve the early manuscript. Lin’s research was supported by the Natural Science Foundation of Shenzhen University (Grant No. 201542). Wang’s research was supported by the National Science Fund for Distinguished Young Scholars in China (Grant No. 10725106), the National Natural Science Foundation of China (Grant No. 11171331 and Grant No. 11331011), a grant from the Key Lab of Random Complex Structure and Data Science, CAS and Natural Science Foundation of Shenzhen University. Zhang’s research was supported by the National Natural Science Foundation of China (Grant No. 11401391), the Project of Department of Education of Guangdong Province of China (Grant No. 2014KTSCX112), and the Natural Science Foundation of Shenzhen University (Grant No. 701, 000360023408). Pang’s research was supported by the Central Research Grant from the Hong Kong Polytechnic University (Grant No. G-YBKQ).
Author information
Authors and Affiliations
Corresponding author
Electronic supplementary material
Below is the link to the electronic supplementary material.
Rights and permissions
About this article
Cite this article
Lin, B., Wang, Q., Zhang, J. et al. Stable prediction in high-dimensional linear models. Stat Comput 27, 1401–1412 (2017). https://doi.org/10.1007/s11222-016-9694-6
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11222-016-9694-6