Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
Skip to main content

Variable selection using Gaussian process regression-based metrics for high-dimensional model approximation with limited data

  • Research Paper
  • Published:
Structural and Multidisciplinary Optimization Aims and scope Submit manuscript

Abstract

In recent years, the importance of computationally efficient surrogate models has been emphasized as the use of high-fidelity simulation models increases. However, high-dimensional models require a lot of samples for surrogate modeling. To reduce the computational burden in the surrogate modeling, we propose an integrated algorithm that incorporates accurate variable selection and surrogate modeling. One of the main strengths of the proposed method is that it requires less number of samples compared with conventional surrogate modeling methods by excluding dispensable variables while maintaining model accuracy. In the proposed method, the importance of selected variables is evaluated using the quality of the model approximated with the selected variables only. Nonparametric probabilistic regression is adopted as the modeling method to deal with inaccuracy caused by using selected variables during modeling. In particular, Gaussian process regression (GPR) is utilized for the modeling because it is suitable for exploiting its model performance indices in the variable selection criterion. Outstanding variables that result in distinctly superior model performance are finally selected as essential variables. The proposed algorithm utilizes a conservative selection criterion and appropriate sequential sampling to prevent incorrect variable selection and sample overuse. Performance of the proposed algorithm is verified with two test problems with challenging properties such as high dimension, nonlinearity, and the existence of interaction terms. A numerical study shows that the proposed algorithm is more effective as the fraction of dispensable variables is high.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15
Fig. 16
Fig. 17

Similar content being viewed by others

Abbreviations

n :

Dimension of input

X :

Training input

X :

New input

f :

Posterior output with zero mean function

\( {\overline{\mathbf{f}}}_{\ast } \) :

Best estimation for f

g :

Posterior output with explicit basis function

cov(g ):

Covariance of g

y :

Training output (noisy response)

m i(x i):

Mean function of GPR in xi-y plane

c(x| X):

Posterior variance in a specified point x

m :

Number of observations

h(x):

Basis function of GPR

k(x):

Covariance function of GPR

cov(f ):

Covariance of f

\( {\overline{\mathbf{g}}}_{\ast } \) :

Best estimation for g

\( \boldsymbol{\upbeta}, \widehat{\boldsymbol{\upbeta}} \) :

Coefficients of basis function and their estimation

\( \boldsymbol{\uptheta}, \widehat{\boldsymbol{\uptheta}} \) :

Hyperparameters of covariance function and their estimation

\( {\sigma}^2,{\widehat{\sigma}}^2 \) :

Noise variance and its estimation

k i(x i, x´;θ):

Covariance function of GPR with xi-y plane and hyperparameter θ

ε :

Gaussian noise

References

  • Bastos LS, O’Hagan A (2009) Diagnostics for Gaussian process emulators. Technometrics 51(4):425–438

    Article  MathSciNet  Google Scholar 

  • Beck J, Guillas S (2016) Sequential design with mutual information for computer experiments (MICE): emulation of a tsunami model. SIAM/ASA Journal on Uncertainty Quantification 4(1):739–766

    Article  MathSciNet  MATH  Google Scholar 

  • Bessa MA, Bostanabad R, Liu Z, Hu A, Apley DW, Brinson C, Liu WK (2017) A framework for data-driven analysis of materials under uncertainty: countering the curse of dimensionality. Comput Methods Appl Mech Eng 320:633–667

    Article  MathSciNet  Google Scholar 

  • Bouguettaya A, Yu Q, Liu X, Zhou X, Song A (2015) Efficient agglomerative hierarchical clustering. Expert Syst Appl 42(5):2785–2797

    Article  Google Scholar 

  • Campolongo F, Cariboni J, Saltelli A (2007) An effective screening design for sensitivity analysis of large models. Environ Model Softw 22(10):1509–1518

    Article  Google Scholar 

  • Chandrashekar G, Sahin F (2014) A survey on feature selection methods. Comput Electr Eng 40(1):16–28

    Article  Google Scholar 

  • Cho H, Bae S, Choi KK, Lamb D, Yang RJ (2014) An efficient variable screening method for effective surrogate models for reliability-based design optimization. Struct Multidiscip Optim 50(5):717–738

    Article  Google Scholar 

  • Cho H, Choi KK, Gaul NJ, Lee I, Lamb D, Gorsich D (2016) Conservative reliability-based design optimization method with insufficient input data. Struct Multidiscip Optim 54(6):1609–1630

    Article  MathSciNet  Google Scholar 

  • Cook RD (2000) Detection of influential observation in linear regression. Technometrics 42(1):65–68

    Article  MathSciNet  Google Scholar 

  • Gorodetsky A, Marzouk Y (2016) Mercer kernels and integrated variance experimental design: connections between Gaussian process regression and polynomial approximation. SIAM/ASA Journal on Uncertainty Quantification 4(1):796–828

    Article  MathSciNet  MATH  Google Scholar 

  • Guyon I, Elisseeff A (2003) An introduction to variable and feature selection. J Mach Learn Res 3:1157–1182

    MATH  Google Scholar 

  • Hayter A (2012) Probability and statistics for engineers and scientists. Nelson Education

  • Helton JC, Johnson JD, Sallaberry CJ, Storlie CB (2006) Survey of sampling-based methods for uncertainty and sensitivity analysis. Reliab Eng Syst Saf 91(10–11):1175–1209

    Article  Google Scholar 

  • Homma T, Saltelli A (1996) Importance measures in global sensitivity analysis of nonlinear models. Reliab Eng Syst Saf 52(1):1–17

    Article  Google Scholar 

  • Jain AK, Murty MN, Flynn PJ (1999) Data clustering: a review. ACM Comput Surv 31(3):264–323

    Article  Google Scholar 

  • Jin R, Chen W, Sudjianto A (2002) On sequential sampling for global metamodeling in engineering design. In ASME 2002 International Design Engineering Technical Conferences and Computers and Information in Engineering Conference. American Society of Mechanical Engineers

  • Jin, R., Chen, W., and Sudjianto, A. (2003) An efficient algorithm for constructing optimal design of computer experiments. in ASME 2003 International Design Engineering Technical Conferences and Computers and Information in Engineering Conference. American Society of Mechanical Engineers

  • Joseph VR, Hung Y (2008) Orthogonal-maximin Latin hypercube designs. Stat Sin 171–186

  • Joseph VR, Gul E, Ba S (2015) Maximum projection designs for computer experiments. Biometrika 102(2):371–380

    Article  MathSciNet  MATH  Google Scholar 

  • Ko CW, Lee J, Queyranne M (1995) An exact algorithm for maximum entropy sampling. Oper Res 43(4):684–691

    Article  MathSciNet  MATH  Google Scholar 

  • Kohavi R, John GH (1997) Wrappers for feature subset selection. Artif Intell 97(1–2):273–324

    Article  MATH  Google Scholar 

  • Lee I, Choi KK, Noh Y, Zhao L, Gorsich D (2011) Sampling-based stochastic sensitivity analysis using score functions for RBDO problems with correlated random variables. J Mech Des 133(2):021003

    Article  Google Scholar 

  • Li J, Cheng K, Wang S, Morstatter F, Trevino RP, Tang J, Liu H (2017) Feature selection: a data perspective. ACM Computing Surveys (CSUR) 50(6):94

    Article  Google Scholar 

  • Moon H, Dean AM, Santner TJ (2012) Two-stage sensitivity-based group screening in computer experiments. Technometrics 54(4):376–387

    Article  MathSciNet  Google Scholar 

  • Oakley JE, O’Hagan A (2004) Probabilistic sensitivity analysis of complex models: a Bayesian approach. J R Stat Soc Ser B Stat Methodol 66(3):751–769

    Article  MathSciNet  MATH  Google Scholar 

  • Pronzato L, Walter É (1988) Robust experiment design via maximin optimization. Math Biosci 89(2):161–176

    Article  MathSciNet  MATH  Google Scholar 

  • Qi M, Zhang GP (2001) An investigation of model selection criteria for neural network time series forecasting. Eur J Oper Res 132(3):666–680

    Article  MATH  Google Scholar 

  • Quiñonero-Candela J, Rasmussen CE (2005) A unifying view of sparse approximate Gaussian process regression. J Mach Learn Res 6(Dec):1939–1959

    MathSciNet  MATH  Google Scholar 

  • Rasmussen CE, Williams CK (2006) Gaussian process for machine learning, cambridge, MIT press

  • Saltelli A, Campolongo F, Cariboni J (2009) Screening important inputs in models with strong interaction properties. Reliab Eng Syst Saf 94(7):1149–1155

    Article  Google Scholar 

  • Shan S, Wang GG (2010) Survey of modeling and optimization strategies to solve high-dimensional design problems with computationally-expensive black-box functions. Struct Multidiscip Optim 41(2):219–241

    Article  MathSciNet  MATH  Google Scholar 

  • Sobol IM (2001) Global sensitivity indices for nonlinear mathematical models and their Monte Carlo estimates. Math Comput Simul 55(1–3):271–280

    Article  MathSciNet  MATH  Google Scholar 

  • Solomatine DP, Ostfeld A (2008) Data-driven modelling: some past experiences and new approaches. J Hydroinf 10(1):3–22

    Article  Google Scholar 

  • Stein M (1987) Large sample properties of simulations using Latin hypercube sampling. Technometrics 29(2):143–151

    Article  MathSciNet  MATH  Google Scholar 

  • Sun NZ, Sun A (2015) Model calibration and parameter estimation: for environmental and water resource systems. Springer

  • Székely GJ, Rizzo ML, Bakirov NK (2007) Measuring and testing dependence by correlation of distances. Ann Stat 35(6):2769–2794

    Article  MathSciNet  MATH  Google Scholar 

  • Welch WJ, Buck RJ, Sacks J, Wynn HP, Mitchell TJ, Morris MD (1992) Screening, predicting, and computer experiments. Technometrics 34(1):15–25

    Article  Google Scholar 

  • Wu D, Hajikolaei KH, Wang GG (2018) Employing partial metamodels for optimization with scarce samples. Struct Multidiscip Optim 57(3):1329–1343

  • Zhao J, Leng C, Li L, Wang H (2013) High-dimensional influence measure. Ann Stat 41(5):2639–2667

    Article  MathSciNet  MATH  Google Scholar 

Download references

Funding

This research was supported by the development of thermoelectric power generation system and business model utilizing non-use heat of industry funded by the Korea Institute of Energy Technology Evaluation and Planning (KETEP) and the Ministry of Trade, Industry and Energy (MOTIE) of the Republic of Korea (No.20172010000830).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Ikjin Lee.

Additional information

Responsible Editor: YoonYoung Kim

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Lee, K., Cho, H. & Lee, I. Variable selection using Gaussian process regression-based metrics for high-dimensional model approximation with limited data. Struct Multidisc Optim 59, 1439–1454 (2019). https://doi.org/10.1007/s00158-018-2137-6

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00158-018-2137-6

Keywords