Variable selection using Gaussian process regression-based metrics for high-dimensional model approximation with limited data

Lee, Kyungeun; Cho, Hyunkyoo; Lee, Ikjin

doi:10.1007/s00158-018-2137-6

Variable selection using Gaussian process regression-based metrics for high-dimensional model approximation with limited data

Research Paper
Published: 28 November 2018

Volume 59, pages 1439–1454, (2019)
Cite this article

Structural and Multidisciplinary Optimization Aims and scope Submit manuscript

1486 Accesses
23 Citations
Explore all metrics

Abstract

In recent years, the importance of computationally efficient surrogate models has been emphasized as the use of high-fidelity simulation models increases. However, high-dimensional models require a lot of samples for surrogate modeling. To reduce the computational burden in the surrogate modeling, we propose an integrated algorithm that incorporates accurate variable selection and surrogate modeling. One of the main strengths of the proposed method is that it requires less number of samples compared with conventional surrogate modeling methods by excluding dispensable variables while maintaining model accuracy. In the proposed method, the importance of selected variables is evaluated using the quality of the model approximated with the selected variables only. Nonparametric probabilistic regression is adopted as the modeling method to deal with inaccuracy caused by using selected variables during modeling. In particular, Gaussian process regression (GPR) is utilized for the modeling because it is suitable for exploiting its model performance indices in the variable selection criterion. Outstanding variables that result in distinctly superior model performance are finally selected as essential variables. The proposed algorithm utilizes a conservative selection criterion and appropriate sequential sampling to prevent incorrect variable selection and sample overuse. Performance of the proposed algorithm is verified with two test problems with challenging properties such as high dimension, nonlinearity, and the existence of interaction terms. A numerical study shows that the proposed algorithm is more effective as the fraction of dispensable variables is high.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Variable-fidelity probability of improvement method for efficient global optimization of expensive black-box problems

Article 20 August 2020

GAS-AU: an average uncertainty-based general adaptive sampling approach

Article 08 May 2023

A generalized hierarchical co-Kriging model for multi-fidelity data fusion

Article 28 May 2020

Abbreviations

n :: Dimension of input
X :: Training input
X _∗ :: New input
f _∗ :: Posterior output with zero mean function
$ {\overline{\mathbf{f}}}_{\ast } $ :: Best estimation for f_∗
g _∗ :: Posterior output with explicit basis function
cov(g _∗):: Covariance of g_∗
y :: Training output (noisy response)
m _i(x _i):: Mean function of GPR in x_i-y plane
c(x| X):: Posterior variance in a specified point x
m :: Number of observations
h(x):: Basis function of GPR
k(x):: Covariance function of GPR
cov(f _∗):: Covariance of f_∗
$ {\overline{\mathbf{g}}}_{\ast } $ :: Best estimation for g_∗
$ \boldsymbol{\upbeta}, \widehat{\boldsymbol{\upbeta}} $ :: Coefficients of basis function and their estimation
$ \boldsymbol{\uptheta}, \widehat{\boldsymbol{\uptheta}} $ :: Hyperparameters of covariance function and their estimation
$ {\sigma}^2,{\widehat{\sigma}}^2 $ :: Noise variance and its estimation
k _i(x _i, x´;θ):: Covariance function of GPR with x_i-y plane and hyperparameter θ
ε :: Gaussian noise

References

Bastos LS, O’Hagan A (2009) Diagnostics for Gaussian process emulators. Technometrics 51(4):425–438
Article MathSciNet Google Scholar
Beck J, Guillas S (2016) Sequential design with mutual information for computer experiments (MICE): emulation of a tsunami model. SIAM/ASA Journal on Uncertainty Quantification 4(1):739–766
Article MathSciNet MATH Google Scholar
Bessa MA, Bostanabad R, Liu Z, Hu A, Apley DW, Brinson C, Liu WK (2017) A framework for data-driven analysis of materials under uncertainty: countering the curse of dimensionality. Comput Methods Appl Mech Eng 320:633–667
Article MathSciNet Google Scholar
Bouguettaya A, Yu Q, Liu X, Zhou X, Song A (2015) Efficient agglomerative hierarchical clustering. Expert Syst Appl 42(5):2785–2797
Article Google Scholar
Campolongo F, Cariboni J, Saltelli A (2007) An effective screening design for sensitivity analysis of large models. Environ Model Softw 22(10):1509–1518
Article Google Scholar
Chandrashekar G, Sahin F (2014) A survey on feature selection methods. Comput Electr Eng 40(1):16–28
Article Google Scholar
Cho H, Bae S, Choi KK, Lamb D, Yang RJ (2014) An efficient variable screening method for effective surrogate models for reliability-based design optimization. Struct Multidiscip Optim 50(5):717–738
Article Google Scholar
Cho H, Choi KK, Gaul NJ, Lee I, Lamb D, Gorsich D (2016) Conservative reliability-based design optimization method with insufficient input data. Struct Multidiscip Optim 54(6):1609–1630
Article MathSciNet Google Scholar
Cook RD (2000) Detection of influential observation in linear regression. Technometrics 42(1):65–68
Article MathSciNet Google Scholar
Gorodetsky A, Marzouk Y (2016) Mercer kernels and integrated variance experimental design: connections between Gaussian process regression and polynomial approximation. SIAM/ASA Journal on Uncertainty Quantification 4(1):796–828
Article MathSciNet MATH Google Scholar
Guyon I, Elisseeff A (2003) An introduction to variable and feature selection. J Mach Learn Res 3:1157–1182
MATH Google Scholar
Hayter A (2012) Probability and statistics for engineers and scientists. Nelson Education
Helton JC, Johnson JD, Sallaberry CJ, Storlie CB (2006) Survey of sampling-based methods for uncertainty and sensitivity analysis. Reliab Eng Syst Saf 91(10–11):1175–1209
Article Google Scholar
Homma T, Saltelli A (1996) Importance measures in global sensitivity analysis of nonlinear models. Reliab Eng Syst Saf 52(1):1–17
Article Google Scholar
Jain AK, Murty MN, Flynn PJ (1999) Data clustering: a review. ACM Comput Surv 31(3):264–323
Article Google Scholar
Jin R, Chen W, Sudjianto A (2002) On sequential sampling for global metamodeling in engineering design. In ASME 2002 International Design Engineering Technical Conferences and Computers and Information in Engineering Conference. American Society of Mechanical Engineers
Jin, R., Chen, W., and Sudjianto, A. (2003) An efficient algorithm for constructing optimal design of computer experiments. in ASME 2003 International Design Engineering Technical Conferences and Computers and Information in Engineering Conference. American Society of Mechanical Engineers
Joseph VR, Hung Y (2008) Orthogonal-maximin Latin hypercube designs. Stat Sin 171–186
Joseph VR, Gul E, Ba S (2015) Maximum projection designs for computer experiments. Biometrika 102(2):371–380
Article MathSciNet MATH Google Scholar
Ko CW, Lee J, Queyranne M (1995) An exact algorithm for maximum entropy sampling. Oper Res 43(4):684–691
Article MathSciNet MATH Google Scholar
Kohavi R, John GH (1997) Wrappers for feature subset selection. Artif Intell 97(1–2):273–324
Article MATH Google Scholar
Lee I, Choi KK, Noh Y, Zhao L, Gorsich D (2011) Sampling-based stochastic sensitivity analysis using score functions for RBDO problems with correlated random variables. J Mech Des 133(2):021003
Article Google Scholar
Li J, Cheng K, Wang S, Morstatter F, Trevino RP, Tang J, Liu H (2017) Feature selection: a data perspective. ACM Computing Surveys (CSUR) 50(6):94
Article Google Scholar
Moon H, Dean AM, Santner TJ (2012) Two-stage sensitivity-based group screening in computer experiments. Technometrics 54(4):376–387
Article MathSciNet Google Scholar
Oakley JE, O’Hagan A (2004) Probabilistic sensitivity analysis of complex models: a Bayesian approach. J R Stat Soc Ser B Stat Methodol 66(3):751–769
Article MathSciNet MATH Google Scholar
Pronzato L, Walter É (1988) Robust experiment design via maximin optimization. Math Biosci 89(2):161–176
Article MathSciNet MATH Google Scholar
Qi M, Zhang GP (2001) An investigation of model selection criteria for neural network time series forecasting. Eur J Oper Res 132(3):666–680
Article MATH Google Scholar
Quiñonero-Candela J, Rasmussen CE (2005) A unifying view of sparse approximate Gaussian process regression. J Mach Learn Res 6(Dec):1939–1959
MathSciNet MATH Google Scholar
Rasmussen CE, Williams CK (2006) Gaussian process for machine learning, cambridge, MIT press
Saltelli A, Campolongo F, Cariboni J (2009) Screening important inputs in models with strong interaction properties. Reliab Eng Syst Saf 94(7):1149–1155
Article Google Scholar
Shan S, Wang GG (2010) Survey of modeling and optimization strategies to solve high-dimensional design problems with computationally-expensive black-box functions. Struct Multidiscip Optim 41(2):219–241
Article MathSciNet MATH Google Scholar
Sobol IM (2001) Global sensitivity indices for nonlinear mathematical models and their Monte Carlo estimates. Math Comput Simul 55(1–3):271–280
Article MathSciNet MATH Google Scholar
Solomatine DP, Ostfeld A (2008) Data-driven modelling: some past experiences and new approaches. J Hydroinf 10(1):3–22
Article Google Scholar
Stein M (1987) Large sample properties of simulations using Latin hypercube sampling. Technometrics 29(2):143–151
Article MathSciNet MATH Google Scholar
Sun NZ, Sun A (2015) Model calibration and parameter estimation: for environmental and water resource systems. Springer
Székely GJ, Rizzo ML, Bakirov NK (2007) Measuring and testing dependence by correlation of distances. Ann Stat 35(6):2769–2794
Article MathSciNet MATH Google Scholar
Welch WJ, Buck RJ, Sacks J, Wynn HP, Mitchell TJ, Morris MD (1992) Screening, predicting, and computer experiments. Technometrics 34(1):15–25
Article Google Scholar
Wu D, Hajikolaei KH, Wang GG (2018) Employing partial metamodels for optimization with scarce samples. Struct Multidiscip Optim 57(3):1329–1343
Zhao J, Leng C, Li L, Wang H (2013) High-dimensional influence measure. Ann Stat 41(5):2639–2667
Article MathSciNet MATH Google Scholar

Download references

Funding

This research was supported by the development of thermoelectric power generation system and business model utilizing non-use heat of industry funded by the Korea Institute of Energy Technology Evaluation and Planning (KETEP) and the Ministry of Trade, Industry and Energy (MOTIE) of the Republic of Korea (No.20172010000830).

Author information

Authors and Affiliations

Department of Mechanical Engineering, Korea Advanced Institute of Science and Technology, Daejeon, 34141, South Korea
Kyungeun Lee & Ikjin Lee
Department of Mechanical Engineering, Mokpo National University, Muan-gun, 58554, South Korea
Hyunkyoo Cho

Authors

Kyungeun Lee
View author publications
You can also search for this author in PubMed Google Scholar
Hyunkyoo Cho
View author publications
You can also search for this author in PubMed Google Scholar
Ikjin Lee
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Ikjin Lee.

Additional information

Responsible Editor: YoonYoung Kim

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Lee, K., Cho, H. & Lee, I. Variable selection using Gaussian process regression-based metrics for high-dimensional model approximation with limited data. Struct Multidisc Optim 59, 1439–1454 (2019). https://doi.org/10.1007/s00158-018-2137-6

Download citation

Received: 20 June 2018
Revised: 23 October 2018
Accepted: 25 October 2018
Published: 28 November 2018
Issue Date: 15 May 2019
DOI: https://doi.org/10.1007/s00158-018-2137-6

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Variable selection using Gaussian process regression-based metrics for high-dimensional model approximation with limited data

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Variable-fidelity probability of improvement method for efficient global optimization of expensive black-box problems

GAS-AU: an average uncertainty-based general adaptive sampling approach

A generalized hierarchical co-Kriging model for multi-fidelity data fusion

Abbreviations

References

Funding

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher’s Note

Rights and permissions

About this article

Cite this article

Keywords

Subscribe and save

Buy Now

Navigation

Variable selection using Gaussian process regression-based metrics for high-dimensional model approximation with limited data

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Variable-fidelity probability of improvement method for efficient global optimization of expensive black-box problems

GAS-AU: an average uncertainty-based general adaptive sampling approach

A generalized hierarchical co-Kriging model for multi-fidelity data fusion

Abbreviations

References

Funding

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher’s Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now

Search

Navigation