Lasso regularization for mixture experiments with noise variables.

Manuel González-Navarrete,
Fabián Manríquez-Méndez and Manuel Pereira-Barahona Departamento de Matemática y Estadística, Universidad de La Frontera. Avda. Francisco Salazar 01145, Temuco, Chile. E-mail address: manuel.gonzaleznavarrete@ufrontera.clInstituto de Estadística, Universidad de Valparaíso. Av. Gran Bretaña 1111, Valparaíso, Chile. E-mail address: fabian.manriquez@postgrado.uv.clDepartamento de Estadística, Universidad del Bío-Bío. Avda. Collao 1202, Concepción, Chile. E-mail address: mpereira@ubiobio.cl

Abstract

We apply classical and Bayesian lasso regularizations to a family of models with the presence of mixture and process variables. We analyse the performance of these estimates with respect to ordinary least squares estimators by a simulation study and a real data application. Our results demonstrate the superior performance of Bayesian lasso, particularly via coordinate ascent variational inference, in terms of variable selection accuracy and response optimization.

1 Introduction

The exploration of complex systems through mixture experiments and the influence of external process variables represents a significant challenge in various fields of science and engineering. The ability to predict and optimize responses in such systems is crucial for technological advancement and innovation [7, 8].

In a mathematical framework, the study of mixture experiments involves building a model describing the relationship among the response and the mixture and process variables. This task requires the choose of an experimental design, and the fit of the statistical model by employing the data collected after experimentation. The usual tools to estimate model parameters are the ordinary least squares [3, 7] and, to a lesser extent, the partial least squares [18, 22].

In this context, regularization techniques such as lasso and its Bayesian extension stand out as fundamental tools for model analysis and selection in statistical literature [17], being good candidates for challenges such as high-dimensional mixture experiments. Lasso, introduced by [25], marked an advancement in regression by proposing a technique that minimizes the sum of squared residuals with a constraint on the $L^{1}$ norm of the coefficients, facilitating variable selection and reducing model complexity. Bayesian lasso, proposed by [23], extends this approach by incorporating a Bayesian perspective that assigns Laplace prior distributions to the regression parameters. This innovation maintains the effectiveness of lasso in variable selection while introducing Bayesian flexibility in estimation. Subsequent developments, such as those by [15], [21], and [1], have delved into the hierarchical structure of the model and improved inference algorithms, highlighting the robustness of Bayesian lasso in variable selection and predictive accuracy.

This study focuses on the integration of mixture experiments with process variables and the application of lasso regularization techniques to simultaneously optimize the mean and variance of the response. In particular, we explore the mixture-process models with noise variables described in [7], emphasizing their importance in understanding the interaction between mixture components and process conditions. The mathematical formulation underlying the optimization of these models is discussed, aiming to evaluate the performance of classical formulation and Bayesian lasso by employing Markov chain Monte Carlo [26] and variational approximation methods [5]. In this sense, a practical application of these concepts is illustrated through a simulation study to evaluate their performances in variable selection task. Moreover, a real data example from [13] is included, discussing the effectiveness of the proposed approach in the study of mixture experiments.

The rest of the paper is organized as follows. In Section 2 we introduce the theoretical aspects in the mathematical study of mixture experiments and the proposed regularization methods. Section 3 includes the results of a simulation study to evaluate the performance of such methods. In Section 4 we expose an application for a soap production experiment. Finally, Section 5 contains some conclusions.

2 Theoretical Background

2.1 Mixture experiments with noise variables

Mixture models with process variables represent an advanced tool for analyzing systems where both component proportions and specific external conditions (process variables) influence the system’s response. These models extend traditional frameworks by incorporating additional variables that reflect the conditions under which the experiment is conducted.

The general formulation of a model including mixture components $\mathbf{x}$ , process variables $\mathbf{w}$ , and possibly noise variables $\mathbf{z}$ , is described by the equation:

\begin{split}Y=&f(\textbf{x},\textbf{w},\textbf{z})=\displaystyle\sum_{i}% \alpha_{i}x_{i}+\mathop{\sum\sum}_{i<j}\alpha_{ij}x_{i}x_{j}+\displaystyle\sum% _{i}\sum_{p}\delta_{ip}x_{i}w_{p}\\ &+\mathop{\sum\sum}_{i<j}\sum_{p}\delta_{ijp}x_{i}x_{j}w_{p}+\displaystyle\sum% _{i}\sum_{t}\gamma_{it}x_{i}z_{t}+\mathop{\sum\sum}_{i<j}\sum_{t}\gamma_{ijt}x% _{i}x_{j}z_{t}\\ &+\displaystyle\sum_{i}\sum_{p}\sum_{t}\eta_{ipt}x_{i}w_{p}z_{t}+\mathop{\sum% \sum}_{i<j}\sum_{p}\sum_{t}\eta_{ijpt}x_{i}x_{j}w_{p}z_{t}+\varepsilon\end{split}

(2.1)

where $Y$ is the response variable; $\underline{\beta}=(\underline{\alpha},\underline{\delta},\underline{\gamma},% \underline{\eta})$ is the vector of coefficients modeling linear effects and interactions; and $\varepsilon$ represents the error term, normally distributed with mean zero and variance $\sigma^{2}$ .

Key constraints for such models include [13]:

•

The proportions of mixture components $\mathbf{x}$ must sum to 1, i.e., $\sum_{i}x_{i}=1$ , ensuring that the model adequately reflects the nature of mixtures.
•

Mixture components $x_{i}$ and process variables $w_{p}$ should be selected to faithfully reflect the system under study, including only those factors that have a significant impact on the response.
•

Conveniently, the noise variables are supposed to be independent and identically distributed, with $\mathbb{E}(Z_{t})=0$ and $\mathbb{V}(Z_{t})=1$ .

The usual tool to estimate parameters in (2.1) is the method of ordinary least squares, which are given by

\hat{\underline{\beta}}_{OLS}={\arg\min}_{\underline{\beta}}(y-X\underline{% \beta})^{t}(y-X\underline{\beta})

(2.2)

Once we obtain the fitted model for response variable $Y$ , the objective is finding optimal configurations of mixture variables $\mathbf{x}$ and process and noise variables $\mathbf{w}$ and $\mathbf{z}$ , respectively, which optimize the response for the experiments. This task is completed by using the desirability function approach proposed by Derringer and Suich [11], and extensively used in the recent literature [3, 9, 10, 24].

The desirability function is defined for estimated response functions, such as the moments of $Y$ , $\mathbb{E}(Y^{n})$ . The values of these functions increase as the ”desirability” of the corresponding response increases. In this sense, for instance, the desirability function of the expectation of a function of the response variable is given by

d\left(\widehat{\mathbb{E}}(g(Y))\right)=\begin{cases}0,&\text{if}\ \widehat{% \mathbb{E}}(g(Y))\leq\mathbb{E}(g(Y))_{*}\\ \left[\frac{\widehat{\mathbb{E}}(g(Y))-\mathbb{E}(g(Y))_{*}}{\mathbb{E}(g(Y))^% {*}-\mathbb{E}(g(Y))_{*}}\right]^{r},&\text{if}\ \mathbb{E}(g(Y))_{*}<\widehat% {\mathbb{E}}(g(Y))<\mathbb{E}(g(Y))^{*}\\ 1,&\text{if}\ \widehat{\mathbb{E}}(g(Y))\geq\mathbb{E}(g(Y))^{*}\end{cases}

(2.3)

The values $\mathbb{E}(g(Y))_{*}$ and $\mathbb{E}(g(Y))^{*}$ give the minimum and maximum acceptable values of $\mathbb{E}(g(Y))$ , respectively. The parameter $r$ is arbitrarily chosen. Finally, the individual desirabilities are combined using the geometric mean,

D(\textbf{x},\textbf{w},\textbf{z})=\left(d\left(\widehat{\mathbb{E}}(g_{1}(Y)% )\right)\cdot\ldots\cdot d\left(\widehat{\mathbb{E}}(g_{d}(Y))\right)\right)^{% 1/d}

(2.4)

This single value of $D$ is maximized to obtain the overall assessment of the desirability of the combined expected response functions. In particular, we use $g_{1}(Y)=Y$ and $g_{2}(Y)=-(Y-\mathbb{E}(Y))^{2}$ . In other words, we maximize the expectation and minimize the variance of the response variable $Y$ .

2.2 Lasso regularization

The introduction of lasso, as proposed by Tibshirani (1996), represented a significant step forward in regression techniques. By minimizing the sum of squared residuals under a constraint on the $L^{1}$ norm of coefficients, lasso facilitates variable selection and effectively reduces model complexity. Building upon this foundation, Bayesian lasso, as introduced by Park and Casella (2008), takes regression analysis further by adopting a Bayesian framework that assigns Laplace prior distributions to regression parameters. This Bayesian perspective not only preserves the variable selection capabilities of lasso but also introduces greater flexibility in parameter estimation.

2.2.1 Classical formulation

The lasso is a form of penalized least squares that minimizes the residual sum of squares while controlling the $L^{1}$ norm of the coefficient vector $\underline{\beta}$ . The lasso estimator for a classical regression model is given by,

\hat{\underline{\beta}}_{L}={\arg\min}_{\underline{\beta}}(y-X\underline{\beta% })^{t}(y-X\underline{\beta})+\lambda||\underline{\beta}||_{1}

(2.5)

where $\lambda\geq 0$ is called the shrinkage parameter. In the case $\lambda=0$ , we have $\hat{\underline{\beta}}_{L}=\hat{\underline{\beta}}_{OLS}$ , the ordinary least squares (OLS) estimation, and sufficiently large $\lambda$ reduces $\underline{\beta}_{L}$ to zero. The lasso has a Bayesian interpretation [25], since the lasso estimation can be seen as the mode of the posterior distribution of $\underline{\beta}$ , when double-exponential and independent prior distributions are assigned to the $p$ regression coefficients,

p(\underline{\beta}\mid\tau)=(\tau/2)^{p}\exp\left(-\tau||\underline{\beta}||_% {1}\right)

(2.6)

where $p(\underline{y}\mid\underline{\beta},\sigma^{2})=\mathcal{N}(\underline{y}\mid X% \underline{\beta},\sigma^{2}\textbf{I}_{n})$ , for any fixed values of $\sigma>0$ and $\tau>0$ , with penalty $\lambda=2\tau\sigma^{2}$ .

2.2.2 Bayesian formulation

The work [23] shows a Bayesian formulation of lasso regression. The hierarchical model is defined by:

\begin{split}\underline{y}\mid X,\underline{\beta},\sigma^{2}&\sim\mathcal{N}(% X\cdot\underline{\beta},\sigma^{2}\cdot I_{n})\\ \underline{\beta}\mid\sigma^{2},\underline{\tau}&\sim\mathcal{N}(\underline{0}% ,\sigma^{2}\mathbf{D}_{\tau})\\ \tau_{j}\mid\lambda&\sim\operatorname{Exp}(\lambda)\quad j=1,\ldots,p\end{split}

(2.7)

where $\mathbf{D}_{\tau}=\operatorname{diag}(\tau_{1},\ldots,\tau_{p})$ y $\tau_{j}\mid\lambda$ and $\tau_{j}$ are conditionally independent for all $j$ . The model can be completed with the gamma prior distributions $(\sigma^{2})^{-1}\sim Ga(a_{0},b_{0})$ and $\lambda\sim Ga(c_{0},d_{0})$ , where $a_{0},b_{0},c_{0}$ and $d_{0}$ the hyperparameters. Let $\underline{\theta}=\left(\underline{\beta},\sigma^{2},\underline{\tau},\lambda\right)$ be the vector of the parameters for this model. The posterior distribution will be proportional to the model distribution times the prior distribution for the latent components and the parameters:

p(\underline{\theta}\mid\underline{y},X)\propto p(\underline{y}\mid X,% \underline{\beta},\sigma^{2})p(\underline{\beta}\mid\sigma^{2},\underline{\tau% })p(\underline{\tau}\mid\lambda)p(\sigma^{2})p(\lambda)

The posterior distribution for hierarchical model is often intractable.

2.3 Bayesian model estimation

In this section we explain the point estimator we use, the variable selection methods and the computational tools to approximate the posterior distributions. In particular, we include the Markov chain Monte Carlo (MCMC) and variational alternatives by coordinate ascent variational inference (CAVI) and automatic differentiation variational inference (ADVI).

2.3.1 Point estimation and variable selection in Bayesian lasso

We adopt the use of posterior mean, $\hat{\theta}=\mathbb{E}(\underline{\theta}\mid\underline{y},X)$ , to give point estimations. The choice of this estimator is driven by its capacity to condense all the information provided by the posterior distribution, offering an estimate that considers the diversity of possible parameter values, in contrast to the one-dimensional approach of the MAP [23].

Furthermore, Bayesian lasso provides interval estimates that can guide variable selection. Usually, for each parameter, it is used a 95% credible interval and if the interval contains the value zero, then the regression coefficient is excluded [23]. This criterion will be denoted by CI (credible interval).

However, as discussed by [20], the 95% credible intervals are usually too wide and most predictors would consequently be removed. Therefore, we use the criterion proposed by Li and Lin, that is, we consider the posterior probability of the interval $\left[-\sqrt{\mathbb{V}(\beta_{j})}|\underline{y},X;\sqrt{\mathbb{V}(\beta_{j}% )}|\underline{y},X\right]$ . In this sense, a regression coefficient is excluded if such probability exceeds a certain threshold and is retained otherwise. In particular, we use $0.5$ as a threshold. This criterion is called scaled neighborhood (SN).

2.3.2 Markov chain Monte Carlo

In the field of Bayesian inference, Markov chain Monte Carlo (MCMC) is the most common method to approximate posterior distributions [26]. However, usual MCMC includes high autocorrelation and convergence is slow especially in high-dimensional spaces. To overcome this problem, the rstan package, implemented in R (see [14]), uses an advanced version of Hamiltonian Monte Carlo (HMC) known as the no-U-turn sampler (NUTS) (see [16]), optimizing the sampling process by eliminating the need to manually adjust the number of steps in each improving efficiency in parameter space exploration, reducing autocorrelation between samples and speeding up convergence.

NUTS is an extension of the HMC algorithm that solves the problem of selecting the optimal number of simulation steps. It employs a recursive strategy that automatically expands the trajectory in parameter space, stopping when a reversal or ”U-turn” is detected in the trajectory, hence its name. This enhances sampling efficiency by reducing autocorrelation between samples and optimizing the use of computational resources.

Formally, updating the parameters $\theta$ in NUTS can be described using Hamiltonian dynamics, where an auxiliary momentum $p$ is introduced, and the system’s evolution is simulated under the Hamiltonian $H(\theta,p)=U(\theta)+K(p)$ . Here, $U(\theta)$ represents the negative logarithmic potential of the posterior distribution, and $K(p)$ is the kinetic energy associated with the momentum $p$ , typically defined as $K(p)=\frac{1}{2}p^{T}M^{-1}p$ , where $M$ is the mass (or covariance) matrix that can be adjusted to reflect parameter scales (see [16]).

\theta_{n+1},p_{n+1}=\text{Leapfrog}(\theta_{n},p_{n},\epsilon_{n},L_{n})

(2.8)

where $\text{Leapfrog}(\cdot)$ denotes the leapfrog integration steps used for numerical simulation of Hamiltonian dynamics, $\epsilon_{n}$ is the adaptive step size, and $L_{n}$ is the number of leapfrog steps, determined dynamically [4].

The implementation of NUTS in rstan allows for more efficient Bayesian inference in high-dimensional models and reduces manual intervention in selecting sampler hyperparameters. The MCMC algorithms, specifically NUTS, were run until convergence, evaluated using the $\hat{R}$ statistic (potential scale reduction), ensuring $\hat{R}<1.1$ (see [12]).

2.3.3 Methods based on variational inference

The concept behind variational inference methods is to propose a family of densities and find a member $q$ of that family which closely approximates the target posterior $p(\underline{\theta}\mid\underline{y},X)$ [5]. In other words, instead of computing the true posterior, we endeavor to determine the parameters $\phi$ of a particular distribution $q^{*}$ (the approximation to our true posterior) such that

q^{*}=\arg\min\mathcal{L}(q(\underline{\theta};\phi)\ ||\ p(\underline{\theta}% \mid\underline{y},X))

(2.9)

where $\mathcal{L}(\cdot\ ||\ \cdot)$ denote the Kullback-Leibler divergence, given by $\mathcal{L}(q(\underline{\theta};\phi)\ ||\ p(\underline{\theta}\mid\underline% {y},X))=\mathbb{E}_{q}\left[\ln\frac{q(\underline{\theta};\phi)}{p(\underline{% \theta}\mid\underline{y},X)}\right]$ . Therefore (2.9) is equivalent to maximizing

q^{*}=\arg\max\{\underbrace{\mathbb{E}_{q}\left[\ln p(\underline{y},\underline% {\theta},X)-\ln q(\underline{\theta};\phi)\right]}_{ELBO}\}

(2.10)

This expression is called evidence lower bound (ELBO). In particular, we want to optimize the ELBO in mean field variational inference, that is, the joint distribution reduces to the product of marginal distributions, $q(\underline{\theta})=\prod_{i=1}^{p}q(\theta_{i})$ .

•

Coordinate ascent variational inference: This algorithm to solve the optimization problem was introduced by [4] and denoted by coordinate ascent variational inference (CAVI). The CAVI optimizes one factor of the mean field variational density at a time. This is defined as an iterative optimization of $q_{j}$ for $j=1,\ldots,p$ , while the other variational distributions are fixed. The optimal $q_{j}$ is proportional to the exponential of the log of the complete conditional distribution, is given by

q(\theta_{j})\propto\exp\{\mathbb{E}_{\theta_{-(j)}}[\ln p(\theta_{j}\mid% \theta_{-(j)},\underline{y},X)]\}\quad j=1,\ldots,p

(2.11)

where $\theta_{-(j)}=(\theta_{1},\ldots,\theta_{j-1}.\theta_{j+1},\ldots,\theta_{p})$ .

In the context of CAVI, the focus lies on iteratively adjusting the parameters within the variational distribution until certain convergence standards are reached. This process entails performing analytic derivations for the updates, which may prove to be time-intensive at most and impractical in certain scenarios. The main objective is to optimize the ELBO in the mean field variational inference.

For the Bayesian lasso model proposed in (2.7), the variational posterior for $\underline{\beta}$ and $\sigma^{2}$ , is given by (for details, see [2])

\begin{split}q(\underline{\beta},\sigma^{2})&=\mathcal{N}(\underline{\beta}% \mid m_{\beta},\sigma^{2}\cdot C_{\beta})Ga((\sigma^{2})^{-1}\mid a_{0},b_{0})% \end{split}

it is recognized that is a normal-gamma distribution with parameters:

C^{-1}_{\beta}=\mathbb{E}\left[D_{\tau}^{-1}\right]+X^{t}X,\quad m_{\beta}=C_{% \beta}X^{t}y,\quad a_{\sigma^{2}}=a_{0}+\frac{1}{2}\quad\text{and}\quad b_{% \sigma^{2}}=b_{0}+\frac{1}{2}\left(y^{t}y-m_{\beta}^{t}C_{\beta}^{-1}m_{\beta}% \right).

the variational distribution for $\tau_{j}$ , is given by $q(\tau_{j})=\mathcal{GIG}(\tau_{j}\mid c_{\tau},d_{\tau},f_{\tau_{j}})$ , where $\mathcal{GIG}$ is a generalized inverse Gaussian distribution, with parameters $c_{\tau}=\frac{1}{2}$ , $d_{\tau}=2\mathbb{E}_{\lambda}\left[\lambda\right]$ and $f_{\tau_{j}}=\mathbb{E}_{\sigma^{2}\beta}\left[(\sigma^{2})^{-1}\beta_{j}^{2}% \right].$ Finally, the variational distribution for $\lambda$ , is given by

q(\lambda)=Ga(\lambda\mid a_{\lambda},b_{\lambda})

it is recognized that is a gamma distribution with parameters

a_{\lambda}=g_{0}+p\quad\text{and}\quad b_{\lambda}=h_{0}+\sum_{j=1}^{p}% \mathbb{E}\left[\tau_{j}\right].

•

Automatic differentiation variational inference (ADVI): Implementing CAVI requires careful thought about the target distribution and choosing an appropriate variational family specific to the problem. Alternatively, [19] offer a way to automate variational inference. We will first assume all model parameters are continuous. In ADVI, the ELBO is first re-written as

\operatorname{ELBO}(\underline{y},\phi):=\mathbb{E}_{q}\left[\ln p(\underline{% y},T^{-1}(\zeta),X)+\ln|J_{T^{-1}}(\zeta)|-\ln q(\zeta;\phi)\right]

Here, $T$ is a function that transforms $\theta$ to $\zeta$ , where $\zeta\in\mathbb{R}^{dim(\theta)}$ . That is, $T:\operatorname{support}(\theta)\rightarrow\mathbb{R}^{dim(\theta)}$ , identified as $\zeta=T(\theta)$ and $J_{T^{-1}}(\zeta)$ is the Jacobian of the inverse of T. As all the model parameters $\zeta$ have support on the real line, a suitable variational distribution for $\zeta$ is a normal distribution. Using a multivariate Gaussian variational distribution $q(\zeta;\phi)=N(\zeta|m,LL^{t})$ is specified for $\zeta$ and the variational parameters are $\phi=(m,L)$ enables us to compute the expectation and its gradient using a Monte Carlo estimate. Specifically, to estimate the ELBO, one can sample values from the variational distributions and evaluate the expression inside the expectation mentioned above. To maximize the ELBO, the gradient of the ELBO with respect to the variational parameters is required. That is

\nabla_{\phi}\operatorname{ELBO}(\underline{y},\phi):=\nabla_{\phi}\mathbb{E}_% {q}\left[\ln p(\underline{y},T^{-1}(\zeta),X)+\ln|J_{T^{-1}}(\zeta)|-\ln q(% \zeta;\phi)\right]

Once again, we can assess this through Monte Carlo integration. However, computing the gradient of a random variate isn’t straightforward. Hence, it’s prudent to initially draw a standard normal random variable and then scale it by the variational standard deviation and mean. This way, we can incorporate the gradient within the expectation. To clarify further:

\nabla_{\phi}\operatorname{ELBO}(\underline{y},\phi)\approx\frac{1}{S}\sum_{s=% 1}^{S}\nabla_{\phi}\left[\ln p(\underline{y},T^{-1}(\zeta),X)+\ln|J_{T^{-1}}(% \zeta)|-\ln q(\zeta;\phi)\textbar_{\zeta=m+L\epsilon^{(s)}}\right]

(2.12)

where $\epsilon^{(s)}\sim N(0,I),s=1,\ldots,S.$ One can also easily compute the stochastic gradient approximation of (2.12)

For the variational inference methods, convergence was determined by monitoring the ELBO. Specifically, convergence was achieved when the relative change in the ELBO between successive iterations fell below a predefined tolerance threshold. This criterion ensured that the optimization process had sufficiently stabilized, indicating that the variational approximation was close to the true posterior distribution (see [19]).

3 A simulation study for variable selection

In this section we analyse the performance of lasso regularization in the variable selection task. We include the results for classical lasso and Bayesian lasso methods (CAVI, ADVI and MCMC), with variable selection criteria CI and SN.

We suppose the experiments are governed by a reduced form of the quadratic mixture model, given by (2.1), with $i=1,2,3$ , $p=1$ and $t=2$ . We consider a model where $\underline{\beta}=(\underline{\alpha},\underline{\delta},\underline{\eta})$ , for which we set $\underline{\alpha}=\underline{\delta}=\underline{1}$ and $\underline{\eta}=\underline{0}$ , and then evaluate the performance of lasso to set $\underline{\eta}$ equal zero.

3.1 Data Generation

The primary predictors $x_{1},x_{2},$ and $x_{3}$ were generated under constraints to ensure their sum is 1. Specifically, $x_{1}$ and $x_{2}$ were drawn from uniform distributions $U(0.2,0.8)$ and $U(0.15,0.5)$ respectively, while $x_{3}$ was determined as $1-x_{1}-x_{2}$ , ensuring it lies within the range $[0.05,0.3]$ .

Additional predictors $w_{1},z_{1},$ and $z_{2}$ were introduced to simulate the effects of process and noise variables. The variable $w_{1}$ was sampled from a binary distribution taking values $0.5$ and $1$ with equal probability. Both $z_{1}$ and $z_{2}$ were drawn from standard normal distributions.

The response variable $Y$ was then generated based on the reduced model, with an added noise term $\varepsilon$ drawn from a normal distribution with mean 0 and standard deviation $\sigma=0.5$ .

The implementation of the proposed Bayesian methods requires careful hyperparameter selection and convergence criteria. In this work, we used a prior distribution configuration that includes gamma distributions for $\phi$ and $\lambda$ , and exponential distributions for $\tau$ . Initial values for the Markov chains were based on previous estimates obtained via ordinary least squares (OLS) to improve sampling efficiency. These criteria aim to ensure robust and accurate estimation of the model parameters.

3.2 Results

For each method, the frequency of variable selection across the 1000 simulations was recorded. Table 1 shows detailed results on the variable selection for the simulation study. In our simulation, methods with larger $N(\alpha)$ and $N(\delta)$ and smaller $N(\eta)$ are considered to perform better.

The confusion matrices in Figure 1 underscore that CAVI outperforms other methods with the highest true positives and lowest false negatives, indicating superior parameter selection accuracy. Conversely, MCMC variants show notably poorer performance, highlighting their inefficacy in accurate parameter identification.

	LASSO	BL-MCMC		BL-CAVI		BL-ADVI
	LASSO	CI	SN	CI	SN	CI	SN
$N(\alpha_{1})$	1.000	1.000	1.000	1.000	1.000	0.998	0.998
$N(\alpha_{2})$	1.000	0.905	0.995	1.000	1.000	0.998	0.998
$N(\alpha_{3})$	1.000	0.640	0.963	1.000	1.000	0.998	0.998
$N(\alpha_{12})$	1.000	0.527	0.850	1.000	1.000	0.991	0.995
$N(\alpha_{23})$	0.997	0.007	0.039	0.856	0.999	0.857	0.913
$N(\alpha_{13})$	0.023	0.507	0.633	0.958	1.000	0.947	0.960
$N(\delta_{11})$	1.000	1.000	1.000	1.000	1.000	0.998	0.998
$N(\delta_{21})$	1.000	0.699	0.964	1.000	1.000	0.998	0.998
$N(\delta_{31})$	1.000	0.438	0.949	0.955	1.000	0.994	0.997
$N(\delta_{121})$	0.051	0.489	0.607	0.992	1.000	0.977	0.986
$N(\delta_{231})$	0.096	0.003	0.028	0.394	0.986	0.763	0.852
$N(\delta_{131})$	0.215	0.006	0.099	0.709	0.999	0.901	0.941
$N(\eta_{111})$	0.002	0.017	0.121	0.000	0.101	0.100	0.283
$N(\eta_{211})$	0.001	0.004	0.022	0.000	0.084	0.092	0.210
$N(\eta_{311})$	0.000	0.004	0.010	0.000	0.040	0.047	0.158
$N(\eta_{112})$	0.003	0.016	0.203	0.003	0.122	0.095	0.268
$N(\eta_{212})$	0.001	0.006	0.094	0.002	0.101	0.075	0.216
$N(\eta_{312})$	0.000	0.007	0.047	0.000	0.052	0.051	0.157
$N(\eta_{1211})$	0.002	0.005	0.036	0.004	0.092	0.035	0.132
$N(\eta_{2111})$	0.001	0.002	0.015	0.000	0.054	0.000	0.063
$N(\eta_{1311})$	0.002	0.003	0.013	0.001	0.042	0.018	0.084
$N(\eta_{1212})$	0.001	0.007	0.098	0.002	0.110	0.031	0.124
$N(\eta_{2112})$	0.003	0.005	0.062	0.000	0.048	0.006	0.059
$N(\eta_{1312})$	0.002	0.007	0.059	0.000	0.064	0.022	0.101

Table 1: Frequency of retaining for the regression coefficients.

Refer to caption — Figure 1: The confusion matrices for lasso regularization methods.

Complementary, to evaluate the effectiveness of lasso and Bayesian lasso methods in the context of simulations, we propose the use of the balanced accuracy index (BAI), introduced by [6]. The BAI improves evaluation in contexts where variable selection proportions may be unbalanced, it is calculated as the average of the true positive rate and the true negative rate. Specifically, it is defined as:

BAI=\frac{1}{2}\left(\frac{TP}{TP+FN}+\frac{TN}{TN+FP}\right),

where $TP$ is the number of truly non-zero parameters correctly selected, $FP$ indicates the number of truly zero parameters incorrectly selected, $FN$ represents the number of truly non-zero parameters incorrectly excluded, and $TN$ is the number of truly zero parameters correctly not selected.

Table 2 shows that, according to the BAI, CAVI-SN stands out as the most efficient method for variable selection in the context of simulations, closely followed by CAVI and ADVI with CI criterion, which also demonstrate high efficiency. The classical lasso and ADVI-SN methods show good performance, whereas MCMC methods, perform less effectively. These results suggest that Bayesian variants of lasso, particularly CAVI, may offer significant advantages in terms of variable selection performance, especially in datasets where the balance between sensitivity and specificity is important.

	LASSO	BL-MCMC		BL-CAVI		BL-ADVI
	LASSO	CI	SN	CI	SN	CI	SN
BAI	0,849	0,756	0,806	0,952	0,961	0,952	0,907

Table 2: Comparison of methods performance by using the balanced accuracy index.

4 An example of soap production

We analysis the performance of the different lasso methods by applying to data from a soap processing plant, discussed in [13]. This scenario involves examining the output based on the soap mixture components ( $x_{1}$ , $x_{2}$ , $x_{3}$ ) under specific constraints:

0.2\leq x_{1}\leq 0.8,\quad 0.15\leq x_{2}\leq 0.5,\quad 0.05\leq x_{3}\leq 0.% 3,\quad x_{1}+x_{2}+x_{3}=1.

(4.1)

The process variables of interest include the mixing time ( $w_{1}$ ) and plodder temperature ( $z_{1}$ ), and the humidity ( $z_{2}$ ) with the two latter being harder to control and considered as noise.

Consider the model in (2.1) in its linear version. The matricial formulation is given by,

\underline{x}=\begin{bmatrix}x_{1}\\ x_{2}\\ x_{3}\\ \end{bmatrix},\quad w=[w],\quad\underline{z}=\begin{bmatrix}z_{1}\\ z_{2}\\ \end{bmatrix},\quad V=\begin{bmatrix}w&0\\ 0&w\\ \end{bmatrix},

(4.2)

and

\underline{\alpha}=\begin{bmatrix}\alpha_{1}\\ \alpha_{2}\\ \alpha_{3}\\ \end{bmatrix},\quad\underline{\delta}=\begin{bmatrix}\delta_{1}\\ \delta_{2}\\ \delta_{3}\\ \end{bmatrix},\quad\Delta=\begin{bmatrix}\gamma_{11}&\gamma_{12}\\ \gamma_{21}&\gamma_{22}\\ \gamma_{31}&\gamma_{32}\\ \end{bmatrix}\text{ and }H=\begin{bmatrix}\eta_{11}&\eta_{12}\\ \eta_{21}&\eta_{22}\\ \eta_{31}&\eta_{32}\\ \end{bmatrix},

(4.3)

where,

Y=f(x,w,z)=\underline{x}^{\prime}\underline{\alpha}+\underline{x}^{\prime}% \underline{\delta}w+\underline{x}^{\prime}\Delta\underline{z}+\underline{x}^{% \prime}HV\underline{z}+\epsilon

(4.4)

Table 3 shows the estimation for OLS, proposed in [13], and the proposed regularization methods for all the 18 parameters in (4.3). It is worth noting that [13] uses OLS to fit the model, and then performs a statistical significance analysis by iteratively eliminating terms with $p$ -values greater than 0.05 from the initial model, until all remaining terms are statistically significant. It is worth mentioning that classical lasso excludes more covariables than other methods. Besides, the CAVI Bayesian lasso fits a model with more covariables.

In Figure 2, we include the density and bloxplot for posterior distribution of parameter $\delta_{1}$ , which was eliminated by the BL-MCMC using CI criterion. Moreover, the parameters $\delta_{3}$ and $\gamma_{21}$ , which were eliminated solely by the BL-ADVI using the CI criterion. Note that, the posterior densities for such parameters show that the MCMC has more variability, and CAVI is the most homogeneous one, concentrating the probability in a region that does not contain the value zero.

Given the model’s 18 parameters and the 40 observations, leave-one-out cross-validation (LOO CV) is particularly advantageous for evaluating our model ( see [27]). LOO CV optimizes data utilization and provides a precise, unbiased estimate of the generalization error, essential for reliable performance assessment in this context. Table 4 presents the LOO CV results for OLS, classical and Bayesian lasso regularization methods. These results demonstrate the comparative effectiveness of each method in minimizing the generalization error, highlighting the robustness of the Bayesian approaches, particularly the BL-CAVI method, which achieved the lowest LOO CV value.

Parameter	OLS	LASSO	BL-MCMC		BL-CAVI		BL-ADVI
Parameter	OLS	LASSO	CI	SN	CI	SN	CI	SN
$\widehat{\alpha}_{1}$	1898.99	1928.32	1900.10		1899.02		1897.50
$\widehat{\alpha}_{2}$	1626.42	1699.97	1624.27		1627.03		1624.75
$\widehat{\alpha}_{3}$	1537.79	1431.18	1541.43		1536.33		1535.70
$\widehat{\delta}_{1}$	39.53	-	-	38.07	39.52		40.40
$\widehat{\delta}_{2}$	285.90	170.58	288.79		285.08		284.81
$\widehat{\delta}_{3}$	-	-	-		26.57		-	24.52
$\widehat{\gamma}_{11}$	9.45	6.59	9.42		9.45		9.45
$\widehat{\gamma}_{21}$	-	2.47	-		-2.08		-	-1.74
$\widehat{\gamma}_{31}$	34.60	-	31.95		34.16		33.99
$\widehat{\gamma}_{12}$	-	-	-		-		-
$\widehat{\gamma}_{22}$	-20.00	-16.87	-20.00		-20.00		-20.02
$\widehat{\gamma}_{32}$	-	-	-		-		-
$\widehat{\eta}_{11}$	16.84	10.15	16.85		16.83		16.69
$\widehat{\eta}_{21}$	39.17	20.43	37.45		38.91		38.62
$\widehat{\eta}_{31}$	-21.09	-	-17.64		-20.55		-20.16
$\widehat{\eta}_{12}$	-	-	-		-		-
$\widehat{\eta}_{22}$	-25.00	-20.49	-24.99		-24.99		-25.02
$\widehat{\eta}_{32}$	-	-	-		-		-

Table 3: Estimated values for coefficients in (4.3).

	OLS	LASSO	BL-MCMC		BL-CAVI		BL-ADVI
	OLS	LASSO	CI	SN	CI	SN	CI	SN
LOO CV	11.87	11.35	11.10	11.48	10.64		11.07	11.44

Table 4: Values of LOO-CV for OLS and Bayesian lasso versions (MCMC, CAVI and ADVI).

4.1 Optimization of the response surface by the desirability function

After fitting a combined mixture process-noise variable model, the final goal is to identify levels of the mixture components and controllable variables that simultaneously yield acceptable mean and variance response values. To address this optimization problem, we use the desirability function approach as outlined in (2.3) and (2.4). Following the methodology described by [13], we initially assume that the noise variables have zero-mean. Under this assumption, we utilize the delta method, applying a first-order Taylor series approximation around the mean of the noise variables. Therefore, the expected value and variance of $Y$ are approximated as follows:

\mathbb{E}(Y)\sim\underline{x}^{\prime}\underline{\alpha}+\underline{x}^{% \prime}\textbf{A}\underline{w},

(4.5)

and

\mathbb{V}(Y)\sim\left[\Lambda^{{}^{\prime}}\underline{X}+V^{\prime}\Lambda% \underline{X}\right]^{{}^{\prime}}\Sigma_{X}\left[\Lambda^{{}^{\prime}}% \underline{X}+V^{\prime}H\underline{X}\right].

(4.6)

These equations allow us to quantify how the proportions of the mixture components and the process conditions directly affect the variability and predictability of the response. Finally, for OLS and the proposed regularization methods, we show the optimal values in Table 5. It is evident that all the methods yield identical proportions for the mixture components, yet there are notable differences in the values of the process variable. In terms of the expected value of the response variable, the Bayesian methods utilizing the SN criterion perform better. However, when also considering the variance of the response variable, the CAVI approximation emerges as the superior choice, exhibiting the smallest coefficient of variation.

	OLS	LASSO	BL-MCMC		BL-CAVI		BL-ADVI
			CI	SN	CI	SN	CI	SN
$x_{1}$	0.45	0.45	0.45		0.45		0.45
$x_{2}$	0.50	0.50	0.50		0.50		0.50
$x_{3}$	0.05	0.05	0.05		0.05		0.05
$w$	0.85	0.91	0.89		0.90		0.87	0.91
$\widehat{\mu_{Y}}$	1881.28	1884.35	1872.76	1888.01	1901.13		1882.75	1890.29
$\widehat{\sigma_{Y}}$	15.59	15.78	15.62	15.62	15.50		15.66	15.61
$CV$	0,00828	0,00837	0,00834	0,00827	0,00815		0,00831	0,008257

Table 5: Optimal values for OLS and Bayesian lasso versions (MCMC, CAVI and ADVI). CV =

\widehat{\sigma_{Y}}/\widehat{\mu_{Y}}

5 Conclusions

We proposed the use of classical and Bayesian lasso regularization for mixture experiments with noise variables. The model formulation was given in (2.1), where optimization was pursued through the desirability function method [11], aiming to simultaneously maximize the mean and minimize the variance of the response variable.

The findings from our study highlight the efficacy of proposed regularization techniques in the context of mixture experiments with noise variables. The comparative analysis, which included ordinary least squares (OLS), lasso, and various Bayesian lasso formulations (CAVI, ADVI, and MCMC), demonstrated the superior performance of the CAVI algorithm. CAVI consistently outperformed other methods in both simulation studies and real data applications, particularly in terms of variable selection accuracy and response surface optimization. In a practical application involving a soap processing plant, CAVI not only provided precise parameter estimates but also optimized the response, achieving higher expected values and lower variance. Bayesian lasso variants, especially CAVI and ADVI, proved advantageous over classical lasso in robustness and flexibility of parameter estimation. While MCMC-based methods were reliable, they faced challenges related to convergence and computational efficiency in high-dimensional spaces. Variational inference methods (CAVI and ADVI) offered efficient approximations to posterior distributions, significantly reducing computational time compared to MCMC. The application of Bayesian regularization techniques, particularly CAVI, enhances model selection, parameter estimation, and response optimization in complex systems influenced by both mixture components and process variables.

Acknowledgements

MGN was partially supported by Fondecyt Iniciación 11200500.

References

[1] Alhamzawi, R. & Taha Mohammad Ali, H. (2020). A new Gibbs sampler for Bayesian lasso. Commun. Stat. - Simul. 49(7), 1855-1871.
[2] Alves, L.C., Dias, R. & Migon, H.S. (2024) Variational Bayesian Lasso for spline regression. Comput. Stat. 39, 2039–2064.
[3] Azcarate, S.M., Pinto, L. & Goicoechea, H.C. (2020). Applications of mixture experiments for response surface methodology implementation in analytical methods development. J. Chemometrics, 34(12), e3246.
[4] Bishop, C.M.(2006) Pattern Recognition and Machine Learning. Springer.
[5] Blei, D.M., Kucukelbir, A. & McAuliffe, J.D. (2017). Variational Inference: A Review for Statisticians. J. Am. Stat. Assoc. 112(518), 859-877.
[6] Brodersen, K.H., Ong, C.S., Stephan, K.E. & Buhmann, J.M.(2010). The balanced accuracy and its posterior distribution. In 2010 20th international conference on pattern recognition (3121-3124). IEEE.
[7] Cornell, J.A. (2002) Experiments with Mixtures: Designs, Models, and the Analysis of Mixture Data (3rd ed.). John Wiley & Sons.
[8] Cornell, J.A. (2011) A Primer on Experiments with Mixtures. Wiley, Hoboken, NJ.
[9] Costa, N.R. & Lourenço, J. (2016). Multiresponse problems: desirability and other optimization approaches. J. Chemometrics, 30, 702-714.
[10] Costa, N.R. & Pereira, Z.L. (2010). Multiple response optimization: a global criterion-based method. J. Chemometrics, 24(6), 333-342.
[11] Derringer, G. & Suich, R. (1980). Simultaneous Optimization of Several Response Variables. J. Qual. Technol. 12(4), 214-219.
[12] Gelman, A. & Rubin, D.B. (1992). Inference from Iterative Simulation Using Multiple Sequences. Stat. Sci. 7(4), 457-472.
[13] Goldfarb, H.B., Borror, C.M. & Mongomery, D.C. (2003). Mixture-process variable experiments with noise variables. J. Qual. Technol. 35(4), 393-405.
[14] Guo, J., Gabry, J., Goodrich, B. & Weber, S.(2020). Package ‘rstan’. URL https://cran.r-project.org/web/packages/rstan/.
[15] Hans, C. (2009). Bayesian lasso regression. Biometrika, 96(4), 835-845.
[16] Hoffman, M.D. & Gelman, A. (2014) The No-U-Turn sampler: adaptively setting path lengths in Hamiltonian Monte Carlo. J. Mach. Learn. Res., 15(1), 1593-1623.
[17] James, G., Witten, D., Hastie, T. & Tibshirani, R. (2021) An Introduction to Statistical Learning: With Applications in R. Springer New York, NY.
[18] Kettaneh-Wold, N. (1992) Analysis of mixture data with partial least squares. Chemom. Intell. Lab. Syst., 14(1-3), 57-69.
[19] Kucukelbir. A., Tran, D., Ranganath, R., Gelman, A. & Blei, D.M. (2017) Automatic differentiation variational inference. J. Mach. Learn. Res., 18, 1-45.
[20] Li, Q. & Lin, N. (2010) The Bayesian elastic net. Bayesian Anal. 5(1), 151-170.
[21] Mallick, H. & Yi, N. (2014). A new Bayesian lasso. Stat. Interface. 7(4), 571
[22] Muteki, K. & MacGregor, J.F. (2007). Sequential design of mixture experiments for the development of new products. J. Chemometrics, 21, 496-505.
[23] Park, T. & Casella, G. (2008). The Bayesian lasso. J. Am. Stat. Assoc. 103(482), 681-686.
[24] Taavitsainen, V.M., Lehtovaara, A. & Lähteenmäki, M. (2010). Response surfaces, desirabilities and rational functions in optimizing sugar production. J. Chemometrics, 24(7‐8), 505-513.
[25] Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. J. R. Stat. Soc. Ser. B Methodol. 58(1), 267-288.
[26] Turkman, M.A.A., Paulino, C.D. & Müller, P. (2019) Computational Bayesian statistics: an introduction (Vol 11). Cambridge University Press.
[27] Wong, T.T. (2015). Performance evaluation of classification algorithms by k-fold and leave-one-out cross validation. Pattern Recognit. 48(9), 2839-2846.