Abstract
In recent years, the importance of computationally efficient surrogate models has been emphasized as the use of high-fidelity simulation models increases. However, high-dimensional models require a lot of samples for surrogate modeling. To reduce the computational burden in the surrogate modeling, we propose an integrated algorithm that incorporates accurate variable selection and surrogate modeling. One of the main strengths of the proposed method is that it requires less number of samples compared with conventional surrogate modeling methods by excluding dispensable variables while maintaining model accuracy. In the proposed method, the importance of selected variables is evaluated using the quality of the model approximated with the selected variables only. Nonparametric probabilistic regression is adopted as the modeling method to deal with inaccuracy caused by using selected variables during modeling. In particular, Gaussian process regression (GPR) is utilized for the modeling because it is suitable for exploiting its model performance indices in the variable selection criterion. Outstanding variables that result in distinctly superior model performance are finally selected as essential variables. The proposed algorithm utilizes a conservative selection criterion and appropriate sequential sampling to prevent incorrect variable selection and sample overuse. Performance of the proposed algorithm is verified with two test problems with challenging properties such as high dimension, nonlinearity, and the existence of interaction terms. A numerical study shows that the proposed algorithm is more effective as the fraction of dispensable variables is high.
Similar content being viewed by others
Abbreviations
- n :
-
Dimension of input
- X :
-
Training input
- X ∗ :
-
New input
- f ∗ :
-
Posterior output with zero mean function
- \( {\overline{\mathbf{f}}}_{\ast } \) :
-
Best estimation for f∗
- g ∗ :
-
Posterior output with explicit basis function
- cov(g ∗):
-
Covariance of g∗
- y :
-
Training output (noisy response)
- m i(x i):
-
Mean function of GPR in xi-y plane
- c(x| X):
-
Posterior variance in a specified point x
- m :
-
Number of observations
- h(x):
-
Basis function of GPR
- k(x):
-
Covariance function of GPR
- cov(f ∗):
-
Covariance of f∗
- \( {\overline{\mathbf{g}}}_{\ast } \) :
-
Best estimation for g∗
- \( \boldsymbol{\upbeta}, \widehat{\boldsymbol{\upbeta}} \) :
-
Coefficients of basis function and their estimation
- \( \boldsymbol{\uptheta}, \widehat{\boldsymbol{\uptheta}} \) :
-
Hyperparameters of covariance function and their estimation
- \( {\sigma}^2,{\widehat{\sigma}}^2 \) :
-
Noise variance and its estimation
- k i(x i, x´;θ):
-
Covariance function of GPR with xi-y plane and hyperparameter θ
- ε :
-
Gaussian noise
References
Bastos LS, O’Hagan A (2009) Diagnostics for Gaussian process emulators. Technometrics 51(4):425–438
Beck J, Guillas S (2016) Sequential design with mutual information for computer experiments (MICE): emulation of a tsunami model. SIAM/ASA Journal on Uncertainty Quantification 4(1):739–766
Bessa MA, Bostanabad R, Liu Z, Hu A, Apley DW, Brinson C, Liu WK (2017) A framework for data-driven analysis of materials under uncertainty: countering the curse of dimensionality. Comput Methods Appl Mech Eng 320:633–667
Bouguettaya A, Yu Q, Liu X, Zhou X, Song A (2015) Efficient agglomerative hierarchical clustering. Expert Syst Appl 42(5):2785–2797
Campolongo F, Cariboni J, Saltelli A (2007) An effective screening design for sensitivity analysis of large models. Environ Model Softw 22(10):1509–1518
Chandrashekar G, Sahin F (2014) A survey on feature selection methods. Comput Electr Eng 40(1):16–28
Cho H, Bae S, Choi KK, Lamb D, Yang RJ (2014) An efficient variable screening method for effective surrogate models for reliability-based design optimization. Struct Multidiscip Optim 50(5):717–738
Cho H, Choi KK, Gaul NJ, Lee I, Lamb D, Gorsich D (2016) Conservative reliability-based design optimization method with insufficient input data. Struct Multidiscip Optim 54(6):1609–1630
Cook RD (2000) Detection of influential observation in linear regression. Technometrics 42(1):65–68
Gorodetsky A, Marzouk Y (2016) Mercer kernels and integrated variance experimental design: connections between Gaussian process regression and polynomial approximation. SIAM/ASA Journal on Uncertainty Quantification 4(1):796–828
Guyon I, Elisseeff A (2003) An introduction to variable and feature selection. J Mach Learn Res 3:1157–1182
Hayter A (2012) Probability and statistics for engineers and scientists. Nelson Education
Helton JC, Johnson JD, Sallaberry CJ, Storlie CB (2006) Survey of sampling-based methods for uncertainty and sensitivity analysis. Reliab Eng Syst Saf 91(10–11):1175–1209
Homma T, Saltelli A (1996) Importance measures in global sensitivity analysis of nonlinear models. Reliab Eng Syst Saf 52(1):1–17
Jain AK, Murty MN, Flynn PJ (1999) Data clustering: a review. ACM Comput Surv 31(3):264–323
Jin R, Chen W, Sudjianto A (2002) On sequential sampling for global metamodeling in engineering design. In ASME 2002 International Design Engineering Technical Conferences and Computers and Information in Engineering Conference. American Society of Mechanical Engineers
Jin, R., Chen, W., and Sudjianto, A. (2003) An efficient algorithm for constructing optimal design of computer experiments. in ASME 2003 International Design Engineering Technical Conferences and Computers and Information in Engineering Conference. American Society of Mechanical Engineers
Joseph VR, Hung Y (2008) Orthogonal-maximin Latin hypercube designs. Stat Sin 171–186
Joseph VR, Gul E, Ba S (2015) Maximum projection designs for computer experiments. Biometrika 102(2):371–380
Ko CW, Lee J, Queyranne M (1995) An exact algorithm for maximum entropy sampling. Oper Res 43(4):684–691
Kohavi R, John GH (1997) Wrappers for feature subset selection. Artif Intell 97(1–2):273–324
Lee I, Choi KK, Noh Y, Zhao L, Gorsich D (2011) Sampling-based stochastic sensitivity analysis using score functions for RBDO problems with correlated random variables. J Mech Des 133(2):021003
Li J, Cheng K, Wang S, Morstatter F, Trevino RP, Tang J, Liu H (2017) Feature selection: a data perspective. ACM Computing Surveys (CSUR) 50(6):94
Moon H, Dean AM, Santner TJ (2012) Two-stage sensitivity-based group screening in computer experiments. Technometrics 54(4):376–387
Oakley JE, O’Hagan A (2004) Probabilistic sensitivity analysis of complex models: a Bayesian approach. J R Stat Soc Ser B Stat Methodol 66(3):751–769
Pronzato L, Walter É (1988) Robust experiment design via maximin optimization. Math Biosci 89(2):161–176
Qi M, Zhang GP (2001) An investigation of model selection criteria for neural network time series forecasting. Eur J Oper Res 132(3):666–680
Quiñonero-Candela J, Rasmussen CE (2005) A unifying view of sparse approximate Gaussian process regression. J Mach Learn Res 6(Dec):1939–1959
Rasmussen CE, Williams CK (2006) Gaussian process for machine learning, cambridge, MIT press
Saltelli A, Campolongo F, Cariboni J (2009) Screening important inputs in models with strong interaction properties. Reliab Eng Syst Saf 94(7):1149–1155
Shan S, Wang GG (2010) Survey of modeling and optimization strategies to solve high-dimensional design problems with computationally-expensive black-box functions. Struct Multidiscip Optim 41(2):219–241
Sobol IM (2001) Global sensitivity indices for nonlinear mathematical models and their Monte Carlo estimates. Math Comput Simul 55(1–3):271–280
Solomatine DP, Ostfeld A (2008) Data-driven modelling: some past experiences and new approaches. J Hydroinf 10(1):3–22
Stein M (1987) Large sample properties of simulations using Latin hypercube sampling. Technometrics 29(2):143–151
Sun NZ, Sun A (2015) Model calibration and parameter estimation: for environmental and water resource systems. Springer
Székely GJ, Rizzo ML, Bakirov NK (2007) Measuring and testing dependence by correlation of distances. Ann Stat 35(6):2769–2794
Welch WJ, Buck RJ, Sacks J, Wynn HP, Mitchell TJ, Morris MD (1992) Screening, predicting, and computer experiments. Technometrics 34(1):15–25
Wu D, Hajikolaei KH, Wang GG (2018) Employing partial metamodels for optimization with scarce samples. Struct Multidiscip Optim 57(3):1329–1343
Zhao J, Leng C, Li L, Wang H (2013) High-dimensional influence measure. Ann Stat 41(5):2639–2667
Funding
This research was supported by the development of thermoelectric power generation system and business model utilizing non-use heat of industry funded by the Korea Institute of Energy Technology Evaluation and Planning (KETEP) and the Ministry of Trade, Industry and Energy (MOTIE) of the Republic of Korea (No.20172010000830).
Author information
Authors and Affiliations
Corresponding author
Additional information
Responsible Editor: YoonYoung Kim
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Lee, K., Cho, H. & Lee, I. Variable selection using Gaussian process regression-based metrics for high-dimensional model approximation with limited data. Struct Multidisc Optim 59, 1439–1454 (2019). https://doi.org/10.1007/s00158-018-2137-6
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00158-018-2137-6