A Tutorial Review of Machine Learningbased MPC Methods
A Tutorial Review of Machine Learningbased MPC Methods
Review
Zhe Wu*, Panagiotis D. Christofides*, Wanlu Wu, Yujia Wang, Fahim Abdullah, Aisha Alnajdi
and Yash Kadakia
Dobbelaere et al. (2021), on the use of ML tools in chemical et al. (2022), model uncertainty and process disturbances are
engineering, the authors identified the lack of interpret- unavoidable issues in controller implementation, how to speed
ability of black-box ML models and the challenges in up ML-based MPC calculations and improve the robustness of
obtaining sufficient and reliable data as the main obstacles ML-based MPC are valid concerns that require attention dur-
that prevent the application of ML models. In addition, ing the execution phase of ML-based MPC. Overall, safety is an
controller robustness, operation safety, and system stability overarching concern that applies to all stages of the develop-
are some of the common issues that have been raised in the ment of ML-based MPC (Brunke et al. 2022; Hewing et al. 2020).
discussion of ML-based MPC (Bonassi et al. 2022; Brunke et al. Safe data collection, safe modeling, and safe implementation
2022; Hewing et al. 2020; Nian et al. 2020; Schweidtmann et al. are vital to ensure safe operation under ML-based MPC.
2021). In this review article, we consolidate some of the Finally, the lack of interpretability of black-box ML models
common challenges faced in the industrial implementation raises a general concern among global community (Bonassi
of ML-based MPC and categorize them according to their et al. 2022; Dobbelaere et al. 2021; Schweidtmann et al. 2021;
theoretical and practical aspects. Shang and You 2019). The lack of understanding of the decision
The theoretical challenge of ML-based MPC lies in the process and internal workings of data-driven models can
mathematical guarantee of closed-loop stability under ML- impede users’ trust towards these models, especially for con-
based MPC (Berberich et al. 2020). Closed-loop stability is trol applications where safety is paramount. Thus, enhancing
essential to ensure safe, efficient, and reliable operation of the transparency of ML models is a critical step towards
control systems. ML models are typically developed using a set gaining industrial approval of ML-based MPC.
of training data that is representative of the underlying data Substantial reviews on the application of ML models to
distribution. Even with sufficient training, some ML models process system engineering have been discussed by Daou-
may struggle to generalize to new, unseen data beyond the tidis et al. (2023), Everett (2021), Khan and Ammar Taqvi
training set. This may result in poor controller performance (2023), Lee et al. (2018), Mowbray et al. (2022), Pan et al. (2022),
and stability issues in ML-based MPC. Thus, understanding and Shang and You (2019). However, since the objective of
the generalization performance of ML models is a key chal- these reviews was to provide an overview on the develop-
lenge in guaranteeing closed-loop stability. ment of ML models and to analyze existing and potential
On the other hand, the practical challenges of imple- applications of ML models in the industry, discussions on
menting ML-based MPC in industrial settings are significantly ML-based MPCs were limited. On the other hand, reviews
more diverse, arising from the different stages in the devel- specific to ML-based MPC focused on various aspects of ML-
opment of ML-based MPC. The formation of an ML-based MPC based MPC (Abdullah and Christofides 2023a; Berberich and
can be divided chronologically into three phases: data collec- Allgöwer 2024; Bonassi et al. 2022; Brunke et al. 2022; Dev
tion, modeling, and execution phases. Each phase presents a et al. 2021; Gonzalez et al. 2023; Lu et al. 2019; Mesbah et al.
unique set of challenges that need to be addressed to ensure 2022; Nian et al. 2020; Norouzi et al. 2023; Ren et al. 2022; Tang
the feasibility of ML-based MPC. Data scarcity and data cor- and Daoutidis 2022). For instance, Mesbah et al. (2022) and
ruption are common issues that plague the data collection Nian et al. (2020) examine the applications of a particular
process in industrial settings. As ML models are highly type of ML-based MPC: reinforcement learning (RL)-based
dependent on the quantity and quality of the data used for MPC, while Brunke et al. (2022) explores its safety aspect.
training, the question of how to develop accurate ML models While Ren et al. (2022) focuses on a broader category of ML
under such circumstances poses a major concern in the models, neural networks (NNs), providing a tutorial review
application of ML-based MPC (Thebelt et al. 2022). Additionally, on their modeling approaches in MPC, there was limited
practical challenges often arise when modeling large-scale mention of the challenges related to the implementation of
systems in industries. The curse of dimensionality, a phe- ML-based MPC. Bonassi et al. (2022) consolidated recent ef-
nomenon in which an increase in data dimensions results in forts in the development of RNN-based MPC and discussed
an exponential growth in data requirement and a reduction in issues related to RNN-based MPC in terms of safe verification
the efficiency and effectiveness of ML algorithms, presents a and interpretability of the RNN model, as well as stability
significant challenge in modeling large-scale systems. Thus, and robustness of RNN-based MPC. Similarly, Berberich and
how to effectively capture the dynamics of complex large-scale Allgöwer (2024) focus on closed-loop stability and guarantee
systems and bypass the curse of dimensionality is another key of ML-based MPC. However, to the best of the authors’
challenge in the development of ML-based MPC for industrial knowledge, a comprehensive overview of the challenges
applications. The heavy computational burden and sluggish faced in the implementation of ML-based MPC has yet to be
processing speed of ML-based MPC is a well-acknowledged established. Hence, this review aims to complement the
problem (Wu et al. 2019d). In addition, as noted by Mesbah existing literature by providing an extensive review of the
Z. Wu et al.: Machine learning-based model predictive control 3
theoretical and practical challenges in ML-based MPC, spe- x ̇ = F(x, u, d), x(t0 ) = x 0 (1)
cifically NN-based MPC, as well as a summary of the current
where x ∈ Rn is the state vector, u ∈ Rm is the manipu-
efforts taken to address each of these challenges. The review
lated input vector, and d ∈ D is the disturbance vector
also presents a dual perspective to approach some of the
with D ≔ {d ∈ Rq | |d| ≤ dm , dm ≥ 0}. The control input
practical issues, namely from both modeling and control
vector u is constrained by the following set
viewpoints.
u ∈ U ≔ {ui, min ≤ ui ≤ ui, max , i = 1, …, m} ⊂ Rm . F(⋅, ⋅, ⋅) is a
This article is organized as follows: preliminary knowl-
sufficiently smooth vector function. Throughout the
edge on the class of systems considered and an introduction to
manuscript, the initial time t0 is taken to be zero (t0 = 0),
neural networks and ML-based MPC is provided in Section 2.
and it is assumed that F(0, 0, 0) = 0, and thus, the origin
In Section 3, the theoretical challenges of ML-based MPC and
is a steady-state of the nominal system (i.e., w(t) ≡ 0) of
current advances in characterizing the generalization per-
Eq. (1) (i.e., (x *s , u*s ) = (0, 0), where u*s and x *s represent the
formance of ML models and analyzing the closed-loop sta-
steady-state input and state vectors, respectively).
bility of ML-based MPC are reviewed. In Section 4, practical
challenges and potential solutions to resolve these issues are
discussed. This includes topics such as data scarcity, data
2.3 Supervised learning – neural networks
quality, the curse of dimensionality, model uncertainty,
computational efficiency, and safety concerns of ML-based
Machine learning can be divided into four learning types:
MPC. In Section 5, novel ML modeling and ML-MPC control
supervised, unsupervised, semi-supervised, and reinforce-
methods mentioned in Section 4 are applied to a nonlinear
ment learning. In supervised learning, the dataset S used for
chemical process to demonstrate their effectiveness. Finally,
model construction is the collection of M labeled data,
Section 6 concludes with an outlook on the future directions of
i.e., S = {(xi , yi ), i = 1, …, M}. Each data sample (xi , yi ) con-
ML-based MPC.
sists of a feature input vector xi ∈ Rdx and a labeled/target
output yi , where dx is the dimension of xi , i.e., the number of
features. If the label yi can only take on finitely many values,
2 Preliminaries i.e., the labeled output yi is discrete-valued, then the learning
task is identified as a classification problem. On the other
2.1 Notation hand, if the labeled output yi is continuous, then this is a
regression task. In the context of MPC, where models are
The notation |⋅| is used to denote the Euclidean norm of a required to predict uncountably many values, this consti-
vector. x T denotes the transpose of x. For a given matrix tutes a regression problem. ML models such as linear and
A ∈ Rm×n , its Frobenius norm is denoted by ‖A‖F . The no- nonlinear regression, autoregressive models, state-space
tation Lf V(x) denotes the standard Lie derivative where models, and neural networks have been widely adopted for
Lf V(x) ≔ ∂V(x)
∂xT f (x). Set subtraction is denoted by “\” MPC applications (Huang and Kadali 2008; Tang and Daou-
(i.e., A \ B ≔ {x ∈ Rn | x ∈ A, x ∉ B}). ∅ signifies the null set. tidis 2022). Neural networks are a subset of machine learning
The function f (⋅) is of class C 1 if it is continuously models that consist of layers of interconnected neurons.
differentiable in its domain. A function f : Rn → Rm is Neurons are the most fundamental unit of NNs. Given a data
said to be L-Lipschitz, if for all x, y ∈ Rn , if |f (x) − f (y)| ≤ sample (xi , yi ), each neuron in the NN acts as a processing
L|x − y| where L > 0. A continuous function α : [0, a) → unit that first computes the weighted sum of the inputs
[0, ∞) is said to belong to class K if it is strictly increasing and the bias term associated with the neuron, that is,
and is zero only when evaluated at zero. For a random (∑dj=1
x
wj x i, j ) + b, where xi = [x i, 1 , x i, 2 , …, x i, dx ], b is the bias
variable X, its expected value is denoted as E[X]. The term, and wj is the weight associated with the j-th feature.
notation P(A) represents the probability that an event A is The neuron then applies an activation function f (⋅) to the
occurring. weighted sum to produce its output yi . The different ar-
rangements and connections of neurons within the network
architecture determine the type of neural network. The
2.2 Class of systems formulations of two neural networks commonly used in MPC
applications, feedforward neural networks (FNNs) and
The class of continuous-time nonlinear systems considered recurrent neural network (RNNs), will be provided in the
is described by the following system of first-order nonlinear following section. Discussions on data generation, model
ordinary differential equations (ODEs): training, and model incorporation into MPC will not be
4 Z. Wu et al.: Machine learning-based model predictive control
covered in this review as an in-depth review has been pro- create loops within the networks. The recurrent structure of
vided in (Ren et al. 2022). the RNNs allows them to retain information from previous
time steps which facilitates the capturing of patterns in
sequential data. Therefore, RNNs are widely utilized for
2.3.1 FNN and RNN
modeling nonlinear dynamic processes, particularly in ap-
plications that require time series predictions where RNNs
The formulation of a one-hidden-layer FNN is provided
have demonstrated their efficiency to capture nonlinear
below, with hidden states h ∈ Rdh computed as follows:
behaviors over some time period. To illustrate the differ-
h = σ h (Wx + bh ) (2) ence, we consider a one-hidden-layer RNN. The computation
of RNN hidden states ht ∈ Rdh is slightly different from FNN,
where σ h is the element-wise nonlinear activation function
with the inclusion of a time factor t, as shown in Eq. (4)
(e.g., ReLU) and bh ∈ Rdh is the bias vector of the hidden state.
below:
W ∈ Rdh ×dx is the weight matrix connected to the input vector
x. The output layer y is computed as follows: ht = σ h (Uht−1 + Wxt + bh ) (4)
y = σ y (Vh + by ) (3) where σ h is the element-wise nonlinear activation function
dy ×dh of the hidden layer and bh ∈ Rdh is the bias vector of the
where V ∈ R is the weight matrix, by is the bias vector for
hidden state. U ∈ Rdh ×dh and W ∈ Rdh ×dx are weight matrices
the output, and σ y is the element-wise activation function
connected to the hidden states and the input vector,
in the output layer (typically linear unit for regression prob-
respectively. From Eq. (4), it can be seen that in addition to
lems). To develop an FNN model for the nonlinear system of
using the current input vector xt for computation, RNNs also
Eq. (1) using a training dataset with data collected at every
use the hidden state of the previous time step ht−1 in calcu-
sampling time t = tk , k = 1, 2, …, where tk+1 ≔ tk + Δ, and Δ
lating the current hidden state ht .
represents one sampling period, one can choose the current
The aforementioned configuration constitutes the stan-
state x(tk ) and the manipulated input u(tk ) that is applied
T
over t ∈ [tk , tk+1 ) as FNN inputs, i.e., x = [x(tk )T u(tk )T ] , to dard and simplest RNN structure. Since the first introduction of
predict the output y = x(tk+1 ) at the next sampling time. RNNs in the 1980s, many variations of the conventional RNNs
RNNs are a type of neural network that uses sequential have emerged and demonstrated exceptional capabilities in
data or time-series data. While the hidden layer configura- processing time series data. Popular variants of RNNs include
tion varies among different NNs, the output layer configu- the long short-term memory (LSTM) network (Hochreiter and
ration remains consistent for FNNs and RNNs. The key Schmidhuber 1997) and gated recurrent unit (GRU) (Cho et al.
difference between an FNN and an RNN lies in the direction 2014). LSTMs and GRUs are gated variants of RNNs that modify
of information flow. A figure showing the network struc- the computation of RNN hidden states. A standard LSTM cell
tures of an FNN and an RNN is presented in Figure 1. uses three gates, namely the forget, input, and output gates, to
From Figure 1, the flow of information in an FNN is update its cell and hidden states. The update equations are
observed to be unidirectional, where information is passed
listed below:
sequentially in the forward fashion through the input layer,
hidden layer, and finally to the output layer. On the other f t = σ(W f xt + U f ht−1 + bf ) (5a)
hand, RNNs are designed for sequential data where the or-
it = σ(W i xt + U i ht−1 + bi ) (5b)
der of inputs is important. RNNs represent an improvement
over FNNs in the sense that RNNs have connections that ot = σ(W o xt + U o ht−1 + bo ) (5c)
ht = ot ⊙ tanh(ct ) (5f)
bias terms bf , bi , bo ∈ Rdh . The gates use element-wise can also be readily applied to other types of RNNs, such as
nonlinear activation functions (e.g., sigmoid function σ(⋅) LSTMs and GRUs.
and tanh(⋅)). The weight matrices W f , W i , W o , W c , ∈Rdh ×dx
and U f , U i , U o , U c ∈ Rdh ×dh are used to connect the input
vector and the hidden states to the different gates, respec-
tively. The term ̃ct represents the candidate cell state in the
2.4 ML-based MPC
LSTM cell. It serves as an intermediate state that reflects the
To simplify the notation for MPC using RNN models, we
potential information used to update the cell state ct ,
represent the RNN model in the following continuous-time
modulated by the input and forget gates in Eq. (5e).
form for the nominal system of Eq. (1) (i.e., w(t) ≡ 0):
Compared to LSTMs, GRUs have a more simplified
structure with only two gates, the update rt and reset gates ẋ ̂ = Fnn (x ̂ , u) (7)
zt . The update equations of GRU are listed as follows:
where x ̂ ∈ Rn and u ∈ Rm are the state vector predicted by
rt = σ(W r xt + U r ht−1 + br ) (6a) the RNN model and the manipulated input vector, respec-
zt = σ(W z xt + U z ht−1 + bz ) (6b) tively. Note that neural networks are generally developed
as discrete-time models with sampled training data. While
̃ t = tanh(W h xt + U h (rt ⊙ ht−1 ) + bh )
h (6c) there are some methods such as using differential equa-
tions to model the evolution of the hidden state, or using
̃t
ht = (1 − zt ) ⊙ ht−1 + zt ⊙ h (6d) neural ordinary differential equations (neural ODEs)
with weight matrices W r , W r , W h ∈ Rdh ×dx , U r , U z , U h ∈ Rdh ×dh (Chen et al. 2018a), that directly model continuous dy-
and bias parameters br , bz , bh ∈ Rdh . The term h ̃ t represents namics in neural networks, the continuous-time
the candidate hidden state which is used to compute the GRU representation of RNNs adopted in this work is primarily
hidden state ht in Eq. (6d). A summary of the differences in to simplify notation. In other words, when incorporated
the network structures of a simple RNN, an LSTM, and a GRU into MPC, the RNN model is used to predict states at
is provided in Figure 2. discrete future time steps, rather than generating a
While RNNs, LSTMs, and GRUs are all types of neural continuous-time state trajectory. Consequently, the objec-
networks used for processing sequential data, they have tive function and constraints in the MPC framework will be
different structures and mechanisms to handle the vanishing based solely on the states predicted by the RNN models.
gradient problem and maintain long-term dependencies. In A general tracking model predictive control design is
general, RNNs are suitable for simpler tasks, LSTMs can be given by the following optimization problem:
used for tasks that require long-term memory, and GRUs may tk+N
achieve a balance of efficiency and performance. We will J = min ∫ L(x̃(t) , u(t))dt (8a)
u
tk
primarily focus on simple RNNs when addressing theoretical
and practical challenges in the following sections. However, it s.t. x̃˙ (t) = Fnn (x̃(t), u(t)) (8b)
should be noted that the methods discussed in this manuscript
The stability of MPC is a fundamental consideration in its Given a single-layer RNN model trained with M data sam-
application. ML models trained on historical data may ples, where each sample has a time sequence length of T, its
struggle to generalize to new or unseen operating condi- input and output are denoted as xi, t ∈ Rdx and yi, t ∈ Rdy
tions, leading to poor performance and stability issues in respectively, where i = 1, 2, …, M and t = 1, 2, …, T. This RNN
MPC. Developing a better understanding and techniques for model is designed to predict the states over the next T time
improving the generalization of ML models is a key chal- steps, based on past or current state measurements and any
lenge in ML-based MPC. manipulated inputs. This approach is analogous to solving a
nonlinear ODE (e.g., Eq. (1)) given the initial condition of x
and the manipulated input u that will be applied. In the
3.1 Generalization performance derivation of the generalization error bound, we assume
negligible bias terms in the RNN model. The revised equa-
Early work in ML-MPC often assumes that the model-plant tions of the hidden layer and the output layer are provided
mismatch is bounded, and therefore, the closed-loop stabil- below:
ity of MPC holds through the robust design of MPC. However,
hi, t = σ h (Uhi, t−1 + Wxi, t ) (10)
this assumption may not be true for practical ML models.
The generalization error in machine learning refers to the yi, t = σ y (Vhi, t ) (11)
expected error of a model on unseen data drawn from the
same distribution as the training data. In other words, it The loss function is denoted as L(y̌ , y), where y̌ is the
measures how well a trained model performs on new, un- predicted RNN output and y is the true/labeled output. The
seen data points. Generalization error is a critical concept in following assumptions are made on the RNN model and
machine learning because the ultimate goal is to build dataset.
models that can make accurate predictions on data that they
Assumption 1. All inputs into the RNN model are bounded,
have not seen during training. Initial research on the ⃒⃒ ⃒⃒
generalizability of ML models was developed using Vapnik– that is, for all i = 1, …, M and t = 1, …, T, ⃒⃒xi, t ⃒⃒ ≤ AX .
Chervonenkis (VC) dimension, a method that characterizes
the capacity and complexity of models (Sontag 1998b; Vapnik Assumption 2. The Frobenius norms of all the weight
et al. 1994). However, due to the simplified assumptions matrices are bounded in the following manner:
underlying the VC dimension approach, the derived gener- ‖U‖F ≤ AU, F , ‖V‖F ≤ AV, F , ‖W‖F ≤ AW, F
alization error bounds can be overly conservative (Chen
et al. 2019). Thus, alternatives such as probably approxi-
Assumption 3. The nonlinear activation σ h is positive ho-
mately correct (PAC) – Bayesian method (Neyshabur et al.
mogeneous and 1-Lipschitz continuous, i.e., for all γ ≥ 0,
2017) and empirical Rademacher complexity approach un-
x ∈ R, we have σ h (γx) = γσ h (x).
der the PAC Learning framework (Bartlett et al. 2017) were
introduced and adopted in recent years. Numerous studies
Assumption 4. Data samples used for training, validation,
have analyzed the generalizability of RNNs, predominantly
and testing purposes are drawn from the same distribution.
focusing on their performance in classification tasks (Chen
et al. 2019; Sontag 1998a; Wei and Ma 2019; Zhang et al. 2018). All the assumptions made adhere to common practice in
Recent works by Akpinar et al. (2019) and Wu et al. (2021b), machine learning theory. Specifically, the first two assump-
have proposed PAC analysis frameworks to derive general- tions assume the boundedness of RNN inputs and weights, a
ization error bounds for RNNs in regression problems. For condition typically satisfied in many modeling tasks where a
demonstration purposes, this review will follow the empir- finite class of neural network hypotheses is used to model
ical Rademacher complexity approach presented in Wu et al. nonlinear systems based on data collected from a finite set.
(2021b) to derive the generalization error bound of RNNs. The third assumption can be satisfied by activation functions
Interested readers may refer to works by Koiran and Sontag such as ReLU and its variants. It is used for the derivation of
(1998) and Sontag (1998a) for the VC dimension method, generalization error in this section, and can be omitted when
Zhang et al. (2018) for the PAC-Bayesian approach, and using other proof techniques, as demonstrated in Golowich
Akpinar et al. (2019) for the traditional PAC framework in the et al. (2018). The last assumption is a fundamental and neces-
derivation of the generalization error bound for RNNs. sary condition for analyzing generalization performance,
8 Z. Wu et al.: Machine learning-based model predictive control
which is adopted in many machine learning works that finite hypothesis class that fulfills Assumptions 1–4 was
consider the application of machine learning models to the considered, the RNN output can be proven to be bounded. This
same process without disturbances or model uncertainties. aligns with the fact that the nonlinear system described in Eq.
However, in the presence of disturbances that cause varia- (1) operates in the stability region Ωρ , thus ensuring that the
tions in process dynamics over time, the generalization error RNN outputs are bounded within a compact set. Thus, the
can still be derived for machine learning models by account- upper bound of yt is denoted by rt > 0, that is, |yt | ≤ rt ,
ing for the drift in distribution. We will discuss this further t = 1, …, T. Without loss of generality, it is assumed that the
when introducing online machine learning in Section 4.4.1. true outputs are also constrained by rt . Since the RNN outputs
The following text entails the essential definitions and and the true outputs are bounded, i.e., for all |yt |, |ȳ t | ≤ rt , we
lemmas widely used within the theoretical framework of can show that the MSE loss function is a locally Lipschitz
machine learning. continuous function satisfying the following inequality.
⃒⃒ ⃒⃒
|L(y1 , ȳ ) − L(y2 , ȳ )| ≤ Lr ⃒⃒y1 − y2 ⃒⃒ (14)
Definition 1. The expected loss/error or generalization
error of a function f which predicts output values y for each where Lr is the Lipschitz constant. Since Lipschitz conditions
given input x, with an underlying distribution Q, is given as: occur several times throughout this article for different
functions, to clarify, the definition of an L-Lipschitz function
LQ (f ) ≜ E[L(f (x), y)] = ∫ L(f (x), y)P(x, y)dxdy (12)
X×Y
in Section 2.1 (Notation) refers to the global Lipschitz con-
dition. However, when we assume that the mean-squared-
where the vector spaces of all possible inputs and outputs error loss function is Lipschitz continuous, we are referring
are denoted by X and Y , respectively, and the term P(x, y) to the local Lipschitz condition. This is because the neural
represents the joint probability distribution for x and y. network’s input and output data are drawn from a compact
However, since the joint probability distribution P is often set in the state space.
unknown, the empirical error, calculated from the data Further analysis shows that the generalization error can
samples, is used as an estimate of the expected loss. be perceived as a combination of the approximation and the
estimation error. The breakdown of the generalization error
Definition 2. The empirical error or risk of a dataset with M of a given neural network function fS taken from a hypothesis
data samples S = (s1 , …, sM ), where si = (x i , yi ), is defined as: class F , trained with a specific learning algorithm using a
training dataset S sampled from distribution Q, is as follows:
̂ S [L(f (x), y)] = 1 ∑ L(f (x i ), yi )
M
E (13)
M i=1 LQ (fS ) − LQ (f * ) = (min LQ (f ) − LQ (f * )) + (LQ (fS )
f ∈F
− min LQ (f )) (15)
f ∈F
In order to ensure that the RNN model can capture the
nonlinear dynamics of the system of Eq. (1) and generalize where LQ (fS ) is the generalization error of the function fS on
well to unseen operating conditions, it is essential to show the underlying data distribution Q. The term f * represents
that the generalization error E[L(f (x), y)] can be bounded. the global optimal hypothesis that yields the lowest gener-
Since the empirical error is used as a proxy measure of the alization error, for the data distribution Q; this hypothesis
generalization error, it is necessary that the empirical error could be outside of the finite hypothesis class F . The term
be sufficiently small and bounded such that the generaliza- minf ∈F LQ (f ) searches for the optimal hypothesis within F
tion error can be bounded. The empirical error can also be that minimizes loss functions over the distribution Q.
viewed as the loss associated with the training dataset. As the Thus, minf ∈F LQ (f ) can be seen as the generalization error of
training process of ML models is designed to reduce the the local optimal hypothesis. The term LQ (fS ) − LQ (f * ),
models’ training loss, it is achievable to obtain a sufficiently measures the generalization gap between the selected
small and bounded empirical error. The subsequent text hypothesis fS and the global optimal hypothesis f * . This dif-
will outline the steps taken to derive the upper bound of the ference can be decomposed into the approximation error
generalization error. and estimation error, represented by the first and second
In this study, the loss function used is the mean squared parenthesized terms, respectively.
error (MSE). While it can be readily demonstrated that the The approximation error can be thought of as how far
MSE loss function L(y, ȳ ) is not Lipschitz continuous for every the local optimal minf ∈F LQ (f ) deviates from the global
y, ȳ ∈Rdy , we can prove that the MSE loss function is locally optimal LQ (f * ). This provides insights into how close the
Lipschitz continuous for y, ȳ within a compact set in Rdy . As a hypothesis class F is to the global optimal hypothesis f * . In
Z. Wu et al.: Machine learning-based model predictive control 9
general, the likelihood of the optimal hypothesis f * being in functions f (⋅) that map the RNN inputs to the RNN output at
the hypothesis class F is greater for a larger hypothesis t-th time step, and trained with M i.i.d. data samples; the
class. Hence, larger hypothesis classes tend to have a smaller following inequality holds for all g t ∈ G t , with probability at
approximation error. On the other hand, the estimation least 1 − δ over samples S = (xi, t , yi, t )Tt=1 , i = 1, …, M
error compares the performance of the candidate hypothe- √̅̅̅̅̅
sis fS , trained with training dataset S, with the best hypoth- 1 M log(δ2)
E[g t (x, y)] ≤ ∑ g t (xi , yi ) + 2R S (G t ) + 3 (17)
esis within the hypothesis class, minf ∈F LQ (f ). Thus, the M i=1 2M
estimation error is dependent on both hypothesis class and
training data. In general, an increase in the size of the hy- It can be seen from Eq. (17) that the generalization error
pothesis class F could result in a higher estimation error, bound depends on the empirical error (the first term),
as it may be more challenging to search for the optimal Rademacher complexity (the second term), and an error
hypothesis in a larger hypothesis class. The error decom- function associated with confidence δ and the number of
position analysis in Eq. (15) illustrates how the generaliza- samples M (the last term). Since the first and last terms are
tion error is influenced by the size of the training dataset known given a set of M training data, in order to charac-
and the complexity of the hypothesis class. Thus, the next terize the upper bound for the generalization error
segment will explore methods to quantify the effect of these E[g t (x, y)], we need to determine the upper bound for the
factors on the generalization error. Rademacher complexity R S (G t ).
Intuitively, the Rademacher complexity measures the
maximum correlation between functions in the hypothesis
3.1.2 Upper bound for generalization error
class F and random noise. A smaller Rademacher
complexity indicates that the hypothesis class is less likely to
Generalization error is a measure of how well a model
fit random noise and therefore may generalize better to
generalizes from the training data to unseen data. While it
unseen data. The following error bound was derived for the
is typically estimated using a separate validation dataset or
generalization error of the RNN of Eqs. (10) and (11).
through techniques such as cross-validation, deriving a
theoretical understanding is also important as it can help
Theorem 1 . (c.f. Theorem 1 in Wu et al. (2021b)) Given an i.i.d
improve the architecture and training of ML models to
training samples of size M, S = (xi, t , yi, t )Tt=1 , i = 1, …, M and
achieve the desired generalization performance. The
an RNN model that satisfy Assumptions 1–4, the following
computer science community has made great efforts to
inequality holds for G t , the family of loss function associated
derive an upper bound for the generalization error of
to the hypothesis class F t of vector-valued functions that
various neural network models. Specifically, the following
map the RNN inputs to the RNN output at t-th time step, with
lemma characterizes the upper bound of the generalization
probability at least 1 − δ over S:
error using the Rademacher complexity R S (⋅), which √̅̅̅̅̅̅̅̅
quantifies the richness of a class of functions, and is often H( 2 log(2)t + 1)AX
E[g t (x, y)] ≤ O (Lr dy √̅̅ )
used in machine learning theory to bound the generaliza- M
√̅̅̅̅̅
tion error.
1 M log(δ2)
+ ∑ g t (xi , yi ) + 3 (18)
M i=1 2M
Definition 3. (Empirical Rademacher Complexity) The
t
empirical Rademacher complexity of a given hypothesis where H = AV, F AW, F (AAU,U,FF)−1−1.
class K of real-valued functions, trained with a set of data
samples S = {s1 , …, sM }, is defined as Remark 2. The generalization error bound of Eq. (18) implies
that the following attempts can be taken to reduce the
1 M
R S (K ) = Eϵ [sup ∑ ϵi k(si )] (16) generalization error: (1) minimize the empirical loss
k∈K M i=1 1 M
M ∑i=1 g t (xi , yi ) over the training data samples S through a
where ϵ = (ϵ1 , …, ϵM )T with ϵi being independent and careful design of neural network, and (2) increase the
identically distributed (i.i.d.) Rademacher random variables number of training samples M. Additionally, as discussed in
satisfying P(ϵi = 1) = P(ϵi = −1) = 0.5. the error decomposition of Eq. (15), increasing the
complexity hypothesis class in terms of larger weight
Lemma 1 . (c.f. Theorem 3.3 in Mohri et al. (2018)) Consider a matrices bounds M could decrease the approximation error,
class of loss functions G t (G t = {g t : (x, y) → L(f (x), y)}) but may also increase the estimation error, which corre-
associated to the hypothesis class F t of vector-valued sponds to the term O (⋅) in Eq. (18). This is consistent with the
10 Z. Wu et al.: Machine learning-based model predictive control
analysis of the trade-off between approximation error and provided that a stabilizing controller Φnn (x) can be found).
estimation error in Eq. (15). Therefore, in practice, we One of the primary advantages of the LMPC approach over
generally start with a simple neural network and gradually the nonlinear control law Φnn (x) is its ability to explicitly
increase its complexity in terms of more neurons, layers and incorporate optimality considerations, as well as constraints
larger weight matrices bounds to improve the training and on inputs and states within an online optimization frame-
testing performance. The whole process stops when the work. This approach improves the closed-loop performance
testing error starts increasing, which indicates the occur- of the system. Furthermore, since the closed-loop stability
rence of overfitting. and feasibility of LMPC are guaranteed by Φnn (x), there is no
need to introduce a terminal penalty term in the cost func-
Note that for neural networks with different architectures
tion. Additionally, while the horizon length N affects the
(e.g., types of NNs, number of layers and neurons, activation
closed-loop performance, it does not impact the stability of
functions, etc.), we will have different values for Rademacher
the closed-loop system.
complexity. For instance, Golowich et al. (2018) derived the
A key step for closed-loop stability under MPC is to
Rademacher complexity upper bounds for a multi-layer FNN
ensure that the discrepancy between NN predictions and the
(see Eq. (19)).
actual state evolution is bounded. If we consider a deter-
√̅̅̅̅̅̅̅̅
BX ( 2 log(2)η + 1) ∏dj=1 Bj, F ministic error bound, that is, the error between NN pre-
O ⎝
⎛ √̅̅ ⎠
⎞ (19)
M dictions and the true evolution of states is bounded for all
times, the ML-MPC of Eq. (9) guarantees closed-loop stability
The FNN has η hidden layers in total, the Frobenius by designing the nonlinear control law Φnn (x) appropriately
norm of the weight matrices of the j-th hidden layer is to render the steady-state stable in the presence of the worst-
bounded by Bj, F . Similar to the RNN, the FNN is trained using case scenario where the prediction error reaches its bound
a dataset of size M and its input x is bounded by BX , at all times. However, since in reality, the generalization
i.e., |x| ≤ BX . errors of any neural networks developed using supervised
learning methods are bound in some probability (e.g., the
error bound in Eq. (18)), closed-loop stability under ML-MPC
3.2 Closed-loop stability is actually guaranteed in a probabilistic sense. This implies
that closed-loop stability is guaranteed with a certain prob-
Closed-loop stability in MPC is important for ensuring the safety ability for each time step. In other words, there exists a small
and reliability of plant operations, and some recent efforts have probability that stability may not hold if the prediction error
been made to investigate the stability of ML-based MPC (Limon exceeds the theoretical bound. Additionally, although sta-
et al. 2017; Meng et al. 2022; Soloperto et al. 2022; Wu et al. 2019c, bility cannot be guaranteed due to the probabilistic nature of
2021b). This section will explore and analyze the closed-loop its prediction error, it should be noted that the actual
stability of ML-based MPC based on the LMPC formulation of Eq. probability of closed-loop stability for each time step could
(9). Specifically, in the LMPC of Eq. (9), the constraint of Eq. (9e) be higher than the lower bound 1 − δ for many reasons. For
ensures that the time derivative of the Lyapunov function V(x), instance, (1) if the RNN model is well trained and the
at time tk remains less than or equal to the value it would attain modeling error remains significantly below its upper bound,
if the nonlinear control law u = Φnn (x) is applied in a sample- and (2) if the next state remains within the stability region
and-hold manner within the closed-loop system. This constraint even when the modeling error surpasses its upper limit in a
allows us to demonstrate (given state measurements at syn- single sampling period, then the probability. Therefore, the
chronous sampling times) that LMPC inherits the stability and theoretical error bound of Eq. (18) serves as a conservative
robustness properties of the nonlinear control law Φnn (x). The estimate of the probability of maintaining closed-loop sta-
stability characteristic of LMPC is directly inherited from the bility, and can be used to guide the construction of network
stability of the nonlinear control law Φnn (x) when applied in a architecture and selection of sample size.
sample-and-hold manner.
Additionally, the feasibility of LMPC is inherently
guaranteed by the nonlinear control law Φnn (x), since it is a
feasible solution to the optimization problem of Eq. (9).
4 Practical challenges in
Detailed results on this aspect for MPCs using first-principles applications of ML-MPC
model can be found in Mahmood and Mhaskar (2008) and
Mhaskar et al. (2006) (note that the results in these papers In addition to the theoretical understanding of the general-
can be readily applied to neural-network-based MPC ization performance of ML models and the resulting
Z. Wu et al.: Machine learning-based model predictive control 11
closed-loop stability properties, there exist many other current state of a system x(tk ) and manipulated input u(tk )
practical challenges for the implementation of ML-based to predict the evolution of the state of the system over a
MPC systems in real-world systems. period of time, that is, x(t) ∀ t ∈ [tk , tk + Δ], where Δ rep-
resents one sampling period. The state trajectory x(t) con-
sists of multiple internal states, i.e., the collection of states
4.1 Data scarcity between tk and tk + Δ separated by a fixed time interval τ (τ
can be considered as the smallest time interval for which
The quantity and quality of the data used for model devel- sensor measurements are available). The loss function of
opment are paramount for the performance and accuracy of PIRNN is defined as follows:
ML models. In Section 3.1, we see that an increase in the
Loss = αX LossX + αG LossG (20a)
number of training samples is helpful in reducing the
generalization error of the model, thereby improving its where
accuracy. However, in practice, it can be difficult to gather a
1 NX 1 NT
substantial amount of data samples to meet the re- LossX = ∑ ∑ |x n (ti ) − ̃xn (ti )|2 (20b)
N X n=1 N T i=0
quirements of developing an accurate ML model. This is
especially true for complex systems with a large number of 1 NG
feature variables, where data collection can be costly and LossG = γ ∑ |x n (t0 ) − ̃xn (t0 )|2
N G n=1
limited. Hence, this section presents an overview of some 1 NG 1 NT
popular techniques to address data scarcity in machine + ∑ ∑ |G (̃xn (ti ), un )|2 (20c)
N G n=1 N T i=0
learning.
states ̃x and the manipulated inputs u into F(̃x, u). Finally, the from domain knowledge, such as the first-principles model,
ODE residual can be calculated by subtracting F(̃x, u) from x̃˙ . the accuracy of these theoretical models may affect the
Specifically, given an initial condition, we do not need labeled generalization performance of PINNs. The limitations of the
output data (i.e., dynamic state trajectory from the training data theoretical models can be addressed by applying the PINNs
starting at that initial condition) to evaluate LossG . In partic- in an inverse manner, using observed data. For example,
ular, one could take a set of initial points (possibly random) in Zheng and Wu (2023) proposed an inverse problem of PIRNN
the domain of consideration where there is no labeled output to improve the first-principles model used in the loss func-
(i.e., state trajectories), use the PIRNN to predict the trajectory tion of PIRNN using real-time data.
forward with that starting point (̃xn (t, u)), and evaluate the In addition, different types of domain knowledge can be
physics-driven loss LossG directly using the physics-informed incorporated into the design of PINNs. For example, for
knowledge of the first-principles model F(x, u). It is also noted systems with physical constraints on states and/or inputs,
that in the extreme case with no labeled output data used as additional loss terms can be included in the loss function to
training data (hence N X = 0), the PIRNN learns the dynamics enforce the constraints on the relevant states/inputs. The
using only the loss function LossG . A summary of the PIRNN following loss term is an example that imposes non-negative
training process is provided in Figure 3, interested reader may constraints on the states (Wu et al. 2023a):
refer to Zheng et al. (2023) for details. 1 NG 1 NT
A forward problem is a problem that involves solving LossReLU = β ∑ ∑ ReLU (−̃xn (ti )) (21)
N G n=1 N T i=0
for the outputs of a system when the inputs and governing
equations are known. On the other hand, an inverse problem where β denotes the weight coefficient of the loss term. In
seeks to infer unknown inputs or parameters from the Eq. (21), the non-negative constraints are implemented by the
observed outputs by working backward. While the discussion ReLU function. As the output of the ReLU function is always
above introduced and explored the applications of PINNs to non-negative, the additional loss term guarantees physically
forward problems, PINNs have also been extensively applied reasonable predictions by ensuring that the model is penal-
to inverse problems. For example, PINNs demonstrated ized whenever the physical constraints are violated. It has to
remarkable accuracy when used to derive velocity and pres- be noted that the physical constraints, such as non-negativity
sure fields from fluid flow images, such as temperature constraints on process states, are integrated into PIRNN
gradient maps that depict the flow in an espresso cup (Cai models as soft constraints. Unlike using a ReLU activation
et al. 2021; Raissi et al. 2020). function at the output layer, the soft constraint method offers
Following the study of a generalization error bound for greater flexibility by modifying the loss function without
purely data-driven RNN models, recent efforts have been requiring adjustments to the network architecture.
made to analyze the generalization performance of PINNs. In addition to incorporating physics-induced loss terms,
For example, in Mishra and Molinaro (2022, 2023), the domain knowledge such as process structure knowledge of a
generalization error analysis for a general class of PINNs process network can be used to improve the design of the NN
approximating solutions of the forward and inverse prob- architecture to reflect the underlying physics (de Giuli et al.
lems for PDEs was performed, respectively. Furthermore, in 2024; Wu et al. 2020). In many industrial chemical processes,
Zheng et al. (2023), the results of generalization errors were operations in the upstream phase of production have a
developed specifically for PIRNNs. As PINNs were developed direct impact on those in the downstream phase, whereas
Physics-
Finite difference informed loss
Data loss
Output …
layer
Hidden
layer
Unfold
the reverse influence is often negligible, unless recycling is domain labels are available, i.e., no label from target
involved. While theoretical models (if available) are able to domain, then this is a transductive TL problem. On the other
capture the relationship between the upstream and down- hand, if the target domain labels are available, then this is an
stream processes in their equations, this relational infor- inductive TL task. If both the source and target domain labels
mation is rarely utilized in data-driven models. In Wu et al. are unavailable, then this constitutes an unsupervised TL
(2020), a partially-connected RNN structure that mirrors the task. In this review, we will focus our discussion on the
process network of a two continuous stirred tank reactors inductive TL problem, where labeled data from both the
(CSTR) system, was proposed. Figure 4 shows how a standard source and the target domains are available.
RNN model, with a fully connected structure, can be Consider an inductive TL task of transferring knowledge
decoupled into a partially-connected structure. Unlike the from a source task to a target task. The process first involves
fully connected structure where all inputs affect all output, developing a pre-trained model (e.g., RNN model) on a large
in the partially connected structure, the output of the first dataset from the source domain. Thereafter, the pre-trained
RNN layer x 1 = [C A1 T 1 ] is designed to be affected only by the RNN model is adapted to the target domain. Formally, we
input u1 = [C A10 Q1 ], and the output of the second RNN layer define Q and P to be the source and target distributions. The
x 2 = [C A2 T 2 ] is affected by both inputs u1 and u2 = [C A20 Q2 ]. RNN input is denoted as xi ∈ Rdx and the labeled output as
C A1 , T 1 , C A2 , T 2 denote the concentration of the reactant yi ∈ Rdy . The set S = (s1 , …, sM ) = ((x1 , y1 ), …, (xM , yM )) ∈
A and the temperature in the two reactors, respectively, and (X × Y)M denotes the collection of M labeled data sampled
C A10 , Q1 , C A20 , Q2 denote the inlet concentration of A and from a certain distribution, where X denotes the input space
heat input rate for the outer cooling jacket, respectively. and Y denotes the output space. The labeling function or
Thus, the partially-connected network resembles the con- ground truth model that returns yi = f (xi ), is denoted as
nections in a two-CSTR system. Likewise, in de Giuli et al. f : X → Y. It is noted that the labeling function for the target
(2024), relational information was used in developing a data- task fP can be different from the source task fQ . Furthermore,
driven model for a district heating system, where individual the loss function is defined as L(⋅, ⋅) : Y × Y → R+ , where R+
RNN models were connected based on the physical system denotes the set of all positive real numbers. The expected loss
network structure. Both studies reported a significant for any two function a(x), b(x) on the distribution Q, that maps
improvement in the model’s generalization performance the input space X to the output space Y, is given by
and accuracy, highlighting the merits of PIMLs. L Q (a, b) ≔ Ex~Q [L(a(x), b(x))].
In modeling nonlinear processes, it is assumed that both
4.1.2 Transfer learning the source and target processes can be represented in the form
of Eq. (1), but with different process dynamics. For instance, in
Another perspective to address data scarcity in data-centric the case of a chemical reactor, the source process with similar
approaches would be to reuse models already developed for configurations might involve the same reactor type but under
similar tasks. This is the key concept of transfer learning varying operating conditions and reactions, different types of
(TL), where knowledge learned from a task (source) can be reactors performing the same reactions, or even different
transferred to a related task (target) to boost performance or reactors under different conditions (see Figure 5). Therefore,
reduce the data requirement for the new task (Alhajeri et al. the distributions of Q and P for the source and target pro-
2024; Thebelt et al. 2022). Pan and Yang (2009) provided a cesses are different. However, in reality, because the source
comprehensive overview of the types of TL tasks. In sum- and target distributions are typically unknown, we use the
mary, TL can be classified into transductive, inductive, and corresponding empirical distributions Q̂ and P̂ for the source
unsupervised TL, based on the availability of label infor- and target samples, respectively. These samples, S from the
mation from source and target domains. If only the source source distribution Q, and T from the target distribution P, are
leading to insufficient training data. Therefore, synthetic data values between existing data. These approaches, though
generation is applied as an augmentation of incomplete and simpler than modern generative models, can be effective for
unlabeled process signal data. Various generative models certain applications where data patterns are relatively well
have been proposed for data augmentation such as generative understood and straightforward.
adversarial networks (GAN), variational autoencoders (VAE),
normalizing flow (NF) models, Gaussian mixture models,
hidden Markov models, latent Dirichlet allocation, and 4.1.4 Active learning
Boltzmann machines (Bond-Taylor et al. 2021; Harshvardhan
et al. 2020). Specifically, GAN, VAE, and NF have been applied Active learning is another approach that intelligently selects
in some recent works in chemical engineering for synthetic data points that are associated with high uncertainty or low
data generation, e.g., He et al. (2020), Lee and Chen (2023), Qin confidence predictions by the current model for labeling and
and Zhao (2022), Xie et al. (2019), Zhang et al. (2021b, 2024), and inclusion in training. Unlike synthetic data generation, the
Zhu et al. (2021). goal of active learning is to choose as few labeled samples as
GANs are a class of generative models inspired by the possible to minimize the cost of obtaining real data. Active
concept of Nash equilibrium in game theory (Goodfellow learning methods can generally be classified into three cat-
et al. 2014). A typical GAN consists of two networks: a egories: pool-based sampling, stream-based selective sam-
generator G and a discriminator D. The generator takes pling, and membership query synthesis (Settles 2009). In
random variables z (z~pz (z)) as input, and aims to generate pool-based sampling, a large set of unlabeled data is avail-
samples G(z)~pg that can fool the discriminator. The able, from which samples are drawn iteratively at no cost.
On the other hand, stream-based selective sampling involves
discriminator is used to evaluate whether a given sample x is
drawing unlabeled samples one at a time. In membership
from real data (x~pdata (x)) or from the generator x~pg (x). It
query synthesis, the active learner generates synthetic
returns a score D(x), where D(x) = 1 if x is classified as real
samples and requests labels for them. In Zhao et al. (2022b),
data, and D(x) = 0 if x is classified as generated data. Both
pool-based active learning is used to enrich the training set
the generator and discriminator continuously optimize for modeling a nonlinear chemical process by iteratively
themselves until they reach Nash equilibrium. Through this identifying the training data that improve model perfor-
process, the difference between two distributions pg (x) and mance most efficiently. Additionally, active learning has the
pdata (x) is minimized such that the generator can capture the potential to be intergrated with machine-learning-aided
distribution of real data pdata . optimal experiment design strategies aimed at minimizing
VAE is another class of generative models that combines time and costs associated with experiments in chemical
Bayesian inference with deep networks (Kingma and Welling engineering problems, such as identifying the proper kinetic
2013). Typically structured like an autoencoder, a VAE consists model structure (Sangoi et al. 2022, 2024).
of an encoder and a decoder. Given real data x following the
distribution x~pdata (x), the encoder qϕ (z|x) maps the real
data to a latent space, and the decoder pθ (x|z) reconstructs 4.2 Data quality
the latent variable to original data, where the latent variables
z are designed to follow a continuous distribution z~p(z). The Another common issue faced in the development and
objective of VAE training is to balance the reconstruction implementation of ML models for real-world applications is
accuracy and the divergence between the latent distribution the quality of the data. The presence of noise in sensor data is
qϕ (z|x) and the prior distribution p(z). Therefore, by sam- almost inevitable due to factors such as sensor limitation,
pling from the prior distribution p(z) and decoding these environmental conditions, measurement error, etc. Since
samples, VAEs are effective in generating new samples that noisy data can impede the learning process of ML models,
closely mimic the statistical properties of real data. this section will explore popular and innovative solutions to
handle and mitigate the issue of noise-corrupted data.
Remark 3. In addition to GANs and VAEs, traditional
methods, such as bootstrapping in statistics and interpola- 4.2.1 Conventional approaches
tion techniques (e.g., linear or quadratic interpolation
between nearby points), are also commonly used for data Dropout method is a popular regularization technique used in
augmentation. Bootstrapping generates new datasets by the training of neural networks to prevent overfitting
resampling with replacement from the original data, while (Abdullah et al. 2022a; Hinton et al. 2012; Srivastava et al. 2014).
interpolation creates synthetic data points by estimating Overfitting occurs when a model learns the training data too
16 Z. Wu et al.: Machine learning-based model predictive control
well, including its noise and outliers, which results in poor distribution of NN output. Hence, by allowing us to quantify
generalization to new, unseen data. Introduced by Hinton the uncertainty in predictions and reducing the risk of
et al. (2012), dropout involves randomly “dropping out” a overfitting, MC dropout is a powerful method for learning
fraction of the neurons in the network during each training the ground truth from data corrupted by noise.
iteration. This means that for each forward and backward
pass, certain neurons are temporarily removed from the 4.2.2 Co-teaching method
network, along with all their incoming and outgoing con-
nections. The neurons to be dropped out are chosen at Co-teaching is an innovative way to address noise in labeled
random with a probability p, known as the dropout rate. By data. The idea behind co-teaching stems from the observations
training the network with dropout, the model becomes more that deep learning models tend to fit simple patterns at the
robust and less likely to rely on specific neurons, encouraging early stage of the training process and progressively learn more
it to learn more distributed and generalizable representa- complex nuances as training continues (Han et al. 2018). Based
tions. During testing or inference, dropout is turned off, and on the belief that the training loss is related to the level of noise
all neurons are used, but their outputs are scaled by the in the data sample, i.e., noise-free or ‘clean’ data samples are
dropout rate to maintain consistency with the training phase. more likely to have small training loss and vice versa, co-
Furthermore, Monte Carlo (MC) dropout, a technique used teaching is designed to have two simultaneously trained NNs. A
to estimate uncertainty in deep neural networks, can be used to schematic of the co-teaching method with two NNs, A and B, is
develop stochastic neural networks that characterize the un- shown in Figure 6. For every mini batch training iteration,
certainties of prediction (Gal and Ghahramani 2016a,b). In the models identify and collect a small portion of the data
contrast to the standard dropout, which is only applied during samples with small training losses. Subsequently, the models
the training phase, the MC dropout applies dropout during both exchange the identified dataset with ‘clean’ samples and update
the training and inference phases. Hence, predictions made by their weights based on the exchanged dataset. The process is
NN models using the MC dropout are not deterministic. While repeated until all training epochs have been completed.
the standard dropout helps mitigate the impact of data over- Although the co-teaching method was initially proposed
fitting by learning a deterministic model, the MC dropout learns for classification problems with noisy labels, co-teaching has
a stochastic model that can quantify the system’s uncertainty. been successfully adapted for regression problems, such as
This ability to estimate uncertainty is particularly valuable for modeling of nonlinear processes in the presence of noise
improving controller design under uncertainty. Therefore, MC (Abdullah et al. 2022b; Wu et al. 2021c). In addition to the
dropout has been adopted in many chemical process modeling standard co-teaching algorithm highlighted in this section,
works when training datasets (e.g., sensor measurements) is variants such as asymmetric co-teaching (Yang et al. 2020),
corrupted with noise, and there is a need to estimate prediction stochastic co-teaching (de Vos et al. 2023; Robinet et al. 2022),
uncertainty (Alhajeri et al. 2022; Wu et al. 2021a). and co-teaching+ (Yu et al. 2019) have been proposed to
Specifically, the MC dropout method aims to find the improve the accuracy of the model. In essence, by leveraging
posterior distribution of the model weights p(W | X, Y),
where X and Y respectively denote the input and output
matrices of the NN, and W denotes the weight matrix of the
NN. However, since it is intractable to obtain the posterior
distribution in practice, an estimation of the predictive dis- A B
tribution of the NN output is used and its calculation is
provided as follows (Wu et al. 2021c):
1 Nt
p(y* | x* , X, Y) ≈ ∑ p(y* | x* , Wk ) (22)
N t k=1
A B
where N t represents the total number of times the model is
executed with different dropout masks (i.e., the number of
realizations). Since the NN model with MC dropout is sto-
chastic, we are able to generate random predictions by
A B
running the MC dropout-NN model multiple times using the
same input. By performing multiple forward passes through
the network with different dropout masks and averaging the Figure 6: The symmetric co-teaching framework that trains two
results, we can gather an approximate probabilistic networks (A and B) simultaneously.
Z. Wu et al.: Machine learning-based model predictive control 17
on the peer network’s perspective, co-teaching is an effective dense layers. The spectral norm of the weight matrix w,
method to reduce the influence of noisy data and improves denoted as ‖W‖2 is 1, as the spectral norm of a matrix is
the overall generalization capability of the NN model. defined to be equal to the largest singular value in its sin-
gular value decomposition (SVD). Since the Jacobian matrix
4.2.3 Lipschitz-constrained NN of the activation function σ : Rm → Rm has a spectral norm of
1 almost everywhere (except for a set of measure 0), then
To reduce the effect of data noise on the generalization based on Theorem 3.1.6 in (Federer 2014), the function σ is
performance of the model, another approach is to design an 1-Lipschitz continuous with respect to the Euclidean norm.
inherently robust NN. In particular, Lipschitz-based NN have Hence, it can be concluded that every SpectralDense layer is
demonstrated robustness and trustworthiness, especially in 1-Lipschitz continuous. Following this, the definition of the
handling adversarial attacks. Since many real-world systems class of Lipschitz-constrained neural networks (LCNNs) is
(e.g., chemical processes) are Lipschitz continuous, devel- given as follows.
oping a Lipschitz-constrained NN provides a promising
solution to addressing data noise in the training set. In Definition 4. Let L N m
n be the class of Lipschitz-constrained
mathematical terms, given a function f that maps from set neural networks (LCNNs) defined as follows:
X ∈ Rn to set Y ∈ Rm , i.e., f : X → Y , f is said to be Lipschitz LN m n ≔ { f | f : R → R , ∃j ∈ N
n m
continuous if there exists a constant C L ≥ 0 such that for all such that f = W j+1 fj ◦ fj−1 ◦ … ◦ f2 ◦ f1 ,
x, y ∈ X, the inequality |f (x) − f (y)| ≤ C L ⋅ |x − y| holds. The where fi = σ(W i x + b), and ‖ W i ‖2 = 1, i = 1, …, j } (24)
term C L is known as the Lipschitz constant, and the function
f can also be referred to as C L -Lipschitz. The Lipschitz con- where σ denotes the GroupSort activation function with
stant C L allows one to quantify the maximum change in the group size 2. Every LCNN in L N m n is a composition of many
output of a Lipschitz continuous function f with respect to SpectralDense layers (i.e., W i , i = 1, …, j), with a final weight
changes in its input. Hence, if the Lipschitz constant C L is matrix W j+1 at the end. The spectral norm constraint is
small, the function f will be less sensitive to perturbations in applied to all weight matrices except the final weight matrix
its input, making it more robust. Various methods have been W j+1 .
proposed to constrain the Lipschitz constant of neural Since every SpectralDense layer fi , i = 1, …, j, is 1-Lipschitz
network models, with most focusing on weight matrices continuous, i.e., the Lipschitz constant for each Spec-
(Arjovsky et al. 2017; Cisse et al. 2017; Gouk et al. 2021), gra- tralDense layer is bounded by 1, it can be easily shown that
dients (Anil et al. 2019; Gulrajani et al. 2017; Hein and for every LCNN in class L N m n , its Lipschitz constant is
Andriushchenko 2017), or network architectures (Tang 2023; bounded by the spectral norm of the final weight matrix
Wang and Manchester 2023). W j+1 .
Here, we give an example of using the SpectralDense
The SpectralDense LCNN approach was adopted in Tan
layer that was proposed by (Serrurier et al. 2021). Specif-
and Wu (2024) and Tan et al. (2024b) to handle noisy data
ically, SpectralDense layers are dense layers such that (1)
when modeling chemical processes. The proposed LCNN
the activation function σ is a GroupSort function and (2) the
demonstrated higher accuracy and generalization perfor-
largest singular value of the weight matrix W is 1. The
mance compared to the conventional Dense NN trained on
following equations describe the GroupSort function (of
the same set of noisy data. This highlights the effectiveness of
group size 2) i.e., σ : Rm → Rm :
enforcing Lipschitz continuity in NN designs in handling
σ ([x 1 , x 2 , …, x m−1 , x m ]T ) = [max(x 1 , x 2 ), min(x 1 , x 2 ), …, noisy data.
max(x m−1 , x m ), min(x m−1 , x m )]T
(23a)
4.3 Curse of dimensionality: ML-MPC of
σ ([x 1 , x 2 , …, x m−2 , x m−1 , x m ]T ) = [max(x 1 , x 2 ), min(x 1 , x 2 ), …,
large-scale systems
min(x m−2 , x m−1 ), x m ]T
(23b)
The curse of dimensionality is a practical challenge for
where Eq. (23a) applies when m is an even number and Eq. modeling of large-scale systems. It refers to the phenomena
(23b) when m is a odd number. Aside from the activation that arise when working with high-dimensional systems, the
function that does not operate component-wise and the requirement for data amount grows exponentially, leading to
weight matrices having a spectral norm of 1, the Spec- more complex network structures, longer training time, and
tralDense layers are structurally similar to the conventional poorer model performance, as the number of dimensions
18 Z. Wu et al.: Machine learning-based model predictive control
(features or variables) increases. This issue can be addressed interpretations of a nonlinear extension of the aforementioned
from the modeling and control perspectives, respectively. linear dimensionality reduction technique PCA and is funda-
mentally an autoencoder with a nonlinear activation function.
4.3.1 Model perspective: reduced-order modeling The use of a feedforward neural network in NLPCA renders it a
static model at the cost of reduced complexity.
Reduced-order modeling (ROM) is a powerful technique to The aforementioned SINDy modeling approach for mul-
address the curse of dimensionality in high-dimensional tiscale systems was later used in Abdullah et al. (2021b) to
systems by reducing the complexity of the system while develop a controller based on the slow dynamics. The reduced-
preserving its essential behavior. In the context of machine order model-based controller, due to its lower complexity
learning and data analysis, reduced-order modeling aims to and computational cost, was able to outperform a full-order
capture the most significant features or dynamics of the data first-principles model-based controller as the former could use
while reducing the dimensionality of the problem. Common a longer prediction horizon in the model predictive control
dimensionality reduction techniques include methods such scheme, which is impacted significantly by the prediction
as principal component analysis (PCA) (e.g., Hassanpour horizon length. While there is an inevitable loss of accuracy in
et al. 2020 integrates PCA with neural networks), which a reduced-order model, for tasks involving optimization such
identifies a linear transformation that maps data from a as process intensification and optimal control, the computa-
higher-dimensional space to a lower-dimensional space with tional tractability of solving the mathematical optimization
minimal information loss by minimizing the squared sum of
problem is of greater priority, justifying the construction and
the orthogonal distances between the measured data points
deployment of such reduced-order models in process systems
and a straight line, and, more recently, autoencoders.
engineering. SINDy was further developed to handle noisy
Dimensionality reduction is particularly advantageous in
data and real-time changes in process dynamics using sub-
process systems engineering where time-scale separation is
sampling, co-teaching, error-triggered model update mecha-
a common phenomenon in unit models such as distillation
nisms, and partial model update algorithms (Abdullah and
columns and catalytic reactors (Chang and Aluko 1984),
Christofides 2023b; Abdullah et al. 2022a,b), all of which
which can justify the use of reduced-order models. Specif-
are techniques that can be extended to the reduced-order
ically, if such a timescale separation is not factored into the
modeling framework of Abdullah et al. (2021a,b) as well.
design of a standard nonlinear feedback controller, the
In addition to PCA and SINDy, the autoencoder (AE) is an
controller may become ill-conditioned due to the stiff ordi-
unsupervised learning model that adopts an FNN architec-
nary differential equations that arise, resulting in perfor-
ture to perform tasks such as dimensionality reduction
mance deterioration and possibly even unstable closed-loop
(Kramer 1991). A typical AE comprises two components, the
dynamics (Kokotović et al. 1999).
encoder and the decoder. The encoder, parameterized by
Reduced-order modeling for two-time-scale systems using
sparse identification of nonlinear dynamics (SINDy) was θ = {W e , b}, compresses the input data x ∈ Rdx into a lower-
studied in Abdullah et al. (2021a,b). In Abdullah et al. (2021a), the dimensional representation xr ∈ Rdh by a function fe (⋅),
mathematical framework of singular perturbations was uti- described by Eq. (25).
lized to decompose the original two-time-scale system into two xr = fe (x) = σ e (W e x + b) (25)
lower-order subsystems, each separately modeling the slow
and fast dynamics of the original multiscale system. Specif- where W e ∈ Rdh ×dx is the weight matrix and b ∈ Rdh is the bias.
ically, after a brief transient period, the fast states converge to a σ e (⋅) is the nonlinear activation function (e.g., hyperbolic
slow manifold and can be algebraically related to the slow tangent function). The encoded representation xr consists of
states using nonlinear functional representations. To capture lower-dimensional, complex hierarchical nonlinear features,
the nonlinear relationship between the slow and fast states, which are often regarded as more efficient representations of
nonlinear principal component analysis (NLPCA), developed the original data (Bank et al. 2023). On the other hand, the
by Dong and McAvoy (1996), was applied in Abdullah et al. decoder, parameterized by θ′ = {W d , b′ }, aims to reconstruct
(2021a), following which SINDy was used to derive well- the input data x ∈ Rdx from the encoded lower-dimensional
conditioned, reduced-order ODE models for the slow states. The representation xr ∈ Rdh via function fd (⋅):
reduced-order SINDy models, owing to their numerical stabil- x′ = fd (xr ) = σ d (W d h + b′ ) (26)
ity, allowed for integration with much larger time steps. Once
the slow states were predicted with the SINDy ODE model, where W d ∈ Rdx ×dh , b′ ∈ Rdx are another weight matrix and
NLPCA was used to algebraically predict the fast states without bias, respectively. σ d (⋅) is the linear activation function. The
any integration. NLPCA is one of the manifestations or objective of an AE is to encode an input data efficiently,
Z. Wu et al.: Machine learning-based model predictive control 19
particular, regularization models are the most commonly actions. Since the designs of ML-based DMPCs closely follow
used embedded methods, where the loss function is modified those using first-principles models, the formulations of ML-
such that the model learns the important features while DMPCs are omitted here. Various alternative configurations of
minimizing the fitting error (Li et al. 2017). Interested reader DMPC systems have been proposed in literature, each varies in
may refer to Karagiannopoulos et al. (2007) and Li et al. terms of the coordination and communication schemes be-
(2017) for detailed reviews on the various feature selection tween the subsystems’ MPC. Readers are directed to Christo-
methods and Zhao et al. (2023) on their applications in fides et al. (2013), Scattolini (2009), and Stewart et al. (2010) for
reduced-order RNN models. comprehensive reviews of DMPC.
historical data, (2) monitor incoming data and compute the optimization to find control inputs that perform well under
prediction error using new data, and (3) if the current pre- uncertainty. In recent works by Berberich et al. (2020), Chen
diction error or accumulated prediction error in a sliding and You (2021), Hu and You (2023), Mahmood et al. (2023), and
time window exceeds the predefined threshold, update the Manzano et al. (2020), robust data-driven MPCs and robust
model using the new data. Furthermore, when incorporating learning-based MPCs have been developed to enhance the
online machine learning models into MPC, event-triggered robustness of controllers against uncertainties such as pre-
mechanism designed based on stability criteria can be diction errors from machine learning models and process
adopted to update models (Wu et al. 2019b). While online disturbances. For example, Chen and You (2021) used ML to
learning helps maintain the accuracy of machine learning learn uncertainties, and designed a robust MPC for green-
models in dynamic environments, potential drift in the un- house in-door climate control problems. Mahmood et al.
derlying data distribution over time due to the change of (2023) developed a robust data-driven-based MPC based on
process dynamics under disturbances can pose a significant the minimax approach for temperature control and optimi-
challenge to the performance of updated models. To this end, zation of energy consumption.
some recent works have investigated the generalization per-
formance of online learning models that take independent
and identically distributed (i.i.d.) real-time data and non-i.i.d. 4.5 Computational efficiency
real-time data for online learning, respectively (Hu and Wu
2024; Hu et al. 2023a,b). These two cases represent the sce- ML-based MPC is generally solved slowly due to the
narios where system dynamics remain unchanged and where complexity of ML models (the ML model is often required to
they change over time, respectively. Specifically, it is shown in be evaluated multiple times during optimization, which can
Hu and Wu (2024) that the generalization performance of significantly increase computational burden), and the non-
online learning models depends on several factors, including convexity of optimization problem.
the divergence between historical data and real-time data
distributions, network complexity, and sample size. 4.5.1 Model perspective: optimization and
convexification
4.4.2 Control perspective: robust MPC and tube-based
MPC One approach to improve the computational efficiency of ML-
MPC is to simplify the model architecture, such as by reducing
From the control perspective, we can design robust MPC and the number of neurons or layers. While manually doing this
tube-based MPC to account for plant-model mismatch in un- can be challenging, automated tools and techniques can assist
certain systems. Specifically, tube-based MPC addresses the in finding an optimal configuration, thereby reducing
uncertainty in system dynamics by considering a range of computational overhead. Reduced-order modeling that has
possible future trajectories rather than a single trajectory. It been introduced in the previous section could be a solution to
creates a “tube” around the nominal trajectory, within which large-scale nonlinear systems. Additionally, hyperparameter
the actual trajectory is expected to lie. Tube-based MPC uses optimization could be one solution to finding the optimal
techniques like robust optimization or stochastic optimization hyperparameters for ML models. Some common techniques
to compute the tube around the nominal trajectory. Tube- for hyperparameter optimization of ML models include grid
based MPC often involves solving optimization problems with search (Bergstra et al. 2011) and Bayesian optimization (e.g.,
constraints that ensure that the system remains within the tools such as Optuna (Akiba et al. 2019) and Hyperopt (Bergstra
defined tube despite uncertainties. Machine learning tech- et al. 2013)). An analysis and comparison of common hyper-
niques have been incorporated into tube-based MPC to parameter optimization approaches for developing an LSTM
further improve the characterization of uncertainties and forecast model for a cyber-physical production system can be
robustness. Recent developments in tube-based MPC using found in Pravin et al. (2022).
machine learning include work by Gao et al. (2024), Zhang Another approach is to build input-convex ML models
et al. (2022), and Zheng et al. (2022b). (Amos et al. 2017; Chen et al. 2018c; Wang et al. 2025; Yang and
Robust MPC directly incorporates uncertainty into the Bequette 2021). An input-convex model in the context of
control law formulation. It aims to optimize control inputs machine learning refers to a model whose loss function is
such that the system remains stable and satisfies performance convex with respect to its input. This property can be highly
criteria under the worst-case scenario of uncertainty. Robust beneficial for optimization because convex functions have a
MPC typically involves solving optimization problems with single global minimum, making the optimization process more
robust constraints or using techniques like min-max straightforward and ensuring that gradient-based methods
22 Z. Wu et al.: Machine learning-based model predictive control
converge reliably. Certain linear models, such as linear regres- piecewise function of the system state. This precomputation
sion and logistic regression, are inherently convex because their allows for real-time implementation with constant-time
loss functions are convex with respect to the model parameters. complexity, regardless of the system’s complexity or pre-
However, for a more general class of nonlinear ML models, it is diction horizon. By eliminating the need for online optimi-
possible to design neural network architectures and loss zation during operation, explicit MPC can achieve faster
functions to be input convex under certain conditions. This control loop execution times, making it suitable for appli-
design can significantly improve training stability and cations with stringent real-time requirements.
convergence. We provide an example of enforcing input The explicit control law is derived using multi-parametric
convexity in FNNs. Following the same idea, input-convex programming algorithms, which include multi-parametric
RNNs and LSTMs have been designed in some recent works linear/quadratic programming (mpLP/mpQP) and multi-
(Chen et al. 2018c; Wang et al. 2025). The output of each layer of parametric mixed-integer linear/quadratic programming
input-convex FNN follows: (mpMILP/mpMIQP) (Pistikopoulos et al. 2020). Unlike typical
optimization problem where the parameters, e.g., system state
zl+1 = g l (W zl zl + W xl x + bl ), l = 0, 1, …, L − 1, (28)
x, are fixed and known, in multi-parametric programming, the
and with z0 , W z0 = 0. The output zl+1 is input-convex for parameters are unknown at the point of computation. Multi-
single-step prediction if all weights W zl are non-negative and parametric programming addresses this uncertainty by
all activation functions g l are convex and non-decreasing generating an optimal solution map for all possible values of
(Amos et al. 2017), while the output zl+1 is input-convex for the uncertain parameters, e.g., finding the optimal control
multi-step ahead predictions if all weights W zl and W xl are action u* for all possible states x (Ali et al. 2023; Tian et al. 2021).
non-negative and all activation functions g l are convex and By obtaining precomputed solutions offline, the online compu-
non-decreasing (Bünning et al. 2021). Therefore, under tational load of the MPC is significantly reduced. By trans-
certain conditions (e.g., convex objective functions and forming MPC problems of discrete-time, linear time-invariant
convex constraints), the MPC using input-convex NNs state-space models with linear/quadratic cost functions into
(ICNNs) becomes a convex optimization problem, which is mpLP/mpQP problems, these MPC problems can be solved
computationally less expensive to solve. explicitly, using solvers such as Python Parametric OPtimization
Toolbox (PPOPT), Parametric OPtimization Toolbox (POP), and
Remark 4. It is important to note that while ICNN models offer Multi-Parametric Toolbox (MPT) (Kenefake and Pistikopoulos
benefits such as global optimality and stability, they may lose 2022; Kvasnica et al. 2004; Oberdieck et al. 2016).
accuracy when applied to highly non-convex functions due to As it can be time-consuming to solve ML-based MPC (Wu
their inherent convexity. However, for many practical systems et al. 2019d), there has been a growing interest in converting
that are not highly non-convex, ICNN models can provide a ML-MPC into an explicit ML-MPC for faster computation.
computationally efficient alternative for ML model-based opti- However, the black-box nature of ML models creates obstacles
mization problems while maintaining the desired accuracy. In in the path towards explicit ML-MPC. As ML models can be
practical applications, comparing the testing losses of ICNN difficult to express explicitly, i.e., do not have explicit expres-
models with traditional FNN models can be an effective way to sions, it is a challenge to adopt existing explicit MPC algo-
evaluate the performance of ICNN models. If they yield similar rithms for ML-MPC. An approach to bypass this problem is to
accuracy, ICNN models can be considered a good approxima- utilize the unique property of the ReLU activation function
tion for nonlinear systems. Additionally, partially input convex and represent the ML model as a mixed-integer linear pro-
architecture of ICNN (PICNN) can be utilized to further restore gramming (MILP) problem. The MILP problem is then incor-
the representation ability of ICNN models by making the output porated into the formulation of an explicit ML-MPC and solved
a convex function to some elements of the input (Amos et al. using mpMILP (Chen et al. 2018b; Grimstad and Andersson
2017). Therefore, developing ICNN models requires a delicate 2019; Katz et al. 2020). An althernative approach to solve the
balance between convexity and representation power to ensure explicit ML-MPC using multi-parametric nonlinear program-
optimal performance for various applications. ming (mpNLP) methods. As deriving the exact solutions
to mpNLP is still an unsolvable problem, existing mpNLP
algorithms generally use either piecewise linearization or
4.5.2 Control perspective: explicit ML-MPC quadratic constraints to approximate the strong nonlinear
terms (Kassa and Kassa 2016; Pappas et al. 2021). In Wang et al.
Explicit MPC provides another solution from the control (2024a,b), the authors developed explicit ML-MPC for ML
perspective to improve computational efficiency. In explicit models with a general class of nonlinear activation functions
MPC, the control law is precomputed and stored as a by first approximating ML models by piecewise linear
Z. Wu et al.: Machine learning-based model predictive control 23
functions. The corresponding mpNLP problems are then chemical plants, where unsafe actions can lead to accidents,
approximated into mpLP/mpQP problems which can be solved injuries, or other severe consequences. To ensure safe explo-
efficiently by existing algorithms. ration, safe RL has recently been studied, where various tech-
In addition, compared to some works that develop an niques such as reward shaping and safety constraints through
ML model to learn state-input relationship that can be used barrier functions have been developed to limit the action and
to replace the controller in a closed-loop system, explicit state space (Garcıa and Fernández 2015; Kim and Kim 2022;
MPC offers transparency and interpretability since the Wang and Wu 2024a).
control law in explicit MPC is represented as a piecewise Safe modeling in supervised learning often refers to
function of the system state, for which engineers can easily ensuring the robustness and reliability of predicted outputs.
analyze and understand how control actions are determined This involves several strategies and techniques with the goal of
based on the current state of the system. This interpretability ensuring that the model predictions are consistent, reliable,
is valuable for troubleshooting, tuning, and verifying the and conform to necessary constraints in real-world systems. To
controller’s behavior in practical applications. achieve safe modeling in terms of reliable predictions, we can
impose hard constraints on NN outputs through the design of
Remark 5. The main challenge with explicit ML-MPC is that ML activation functions, or incorporate the constraints as a regu-
models are typically black-box models without an explicit larization term (similarly to physics-informed ML introduced in
functional form (or they are too complex to be directly incor- Section 4.1.1). Additionally, robustness requires that the pre-
porated into explicit MPC solvers). Nonlinearity adds another diction of ML models is robust to small perturbations in input
layer of difficulty, as a nonlinear ML model leads to mpNLP, data. Some common techniques include adversarial training
which is generally hard to solve. Potential solutions include that intentionally introduces adversarial examples in the
using the unique properties of ReLU activation functions for training process to improve its robustness, and novel design of
ML models to formulate a mixed-integer linear programming NN architectures with inherent robustness such as Lipschitz-
(MILP) problem, or approximating nonlinear ML models with constrained NNs that have been introduced in Section 4.2.3.
piecewise linear functions, allowing for the formulation of Lastly, from the control perspective, safe implementation
mpLP or mpQP problems. of ML models in MPC requires the improvement of existing
controllers to account for the impact of safety as the last line of
defense. Due to the approximation of ML models, ML-based
4.6 Safe and secure ML-MPC MPC may lead to suboptimal or even unreasonable control
actions that may cause unsafe operations. To mitigate these
While most existing research of ML-MPC in engineering disci-
risks, safety constraints have been incorporated into the design
plines has focused on improving its prediction accuracy and
of MPCs, ensuring that control actions and the resulting state
performance, safety and security are emerging research areas
evolution remain within safe bounds. For example, barrier
of significant importance. The misuse of ML-MPC technologies
functions can be used to design MPCs to effectively prevent the
could lead to unsafe, and potentially catastrophic, consequences
system from violating safety constraints by heavily penalizing
in safety-critical systems, causing environmental damage,
states near the constraint boundaries. While there are various
capital loss, and human injuries.
types of barrier functions, we provide an example of the control
barrier function (CBF) for the nonlinear affine control system
4.6.1 Safety ẋ = f (x) + g(x)u proposed in Wieland and Allgöwer (2007).
Ensuring safety of ML-MPC includes safe learning (data collec- Definition 5. Given a set of unsafe states in state-space D , a
tion), safe modeling, and safe implementation. Safe data C 1 function B(x) : Rn → R is a CBF if the following proper-
collection is often not the most critical issue in supervised ties are satisfied:
learning because datasets are provided for offline learning. The
B(x) > 0, ∀ x∈D (29a)
data used in supervised learning is usually pre-collected,
cleaned, and labeled, which reduces the risks associated with Lf B(x) ≤ 0, ∀ x ∈ {z ∈ Rn \ D | Lg B(z) = 0} (29b)
data collection. However, the importance of safe data collection
becomes much more pronounced in other machine learning U ≔ {x ∈ Rn | B(x) ≤ 0} ≠ ∅ (29c)
techniques, such as RL. Specifically, in RL, an agent interacts
with an environment to learn optimal actions through trial and To further reinforce closed-loop stability while ensuring
error. This interaction can involve significant risks, especially in safety simultaneously, CBFs can be integrated with control
real-world applications like autonomous driving, robotics, and Lyapunov functions via weighted sums. As a result, control
24 Z. Wu et al.: Machine learning-based model predictive control
Lyapunov-barrier functions (CLBFs) that was proposed in control systems can be improved through a variety of funda-
Romdlony and Jayawardhana (2016) has been used to design mental operation and control methods that address the
safe MPC in Wu et al. (2018, 2019a). The definition of CLBFs is following aspects: security by design, advanced recovery,
given as follows: advanced threat detection, secure remote access, and combined
safety. Specifically, we can improve data security in both the
Definition 6. Consider the nonlinear system learning and implementation stages of ML-MPC. For example,
̇
x = f (x) + g(x)u with a set of unsafe states in state-space unlike the conventional ML approaches for modeling a
(i.e., D ), a proper, lower-bounded and C 1 function W c (x) : nonlinear process network with multiple subsystems, where
Rn → R is a CLBF if W c (x) has a minimum at the origin and the training process is performed on a central server with
also satisfies the following properties: training data collected from all subsystems, federated learning
(FL), an emerging distributed ML framework to preserve data
W c (x) > ρc , ∀ x ∈ D ⊂ ϕuc (30a)
privacy, distributes the training data across multiple local
Lf W c (x) < 0, subsystems, and subsequently, aggregate the submodels
∀ x ∈ {z ∈ ϕuc \(D ∪ {0} ∪ Xe ) | Lg W c (z) = 0} (30b) trained locally for each subsystem to create a global FL model
(Zhang et al. 2021a; Zhao et al. 2018). Since FL only exchanges the
U ρc ≔ {x ∈ ϕuc | W c (x) ≤ ρc } ≠ ∅ (30c)
NN weight information, and maintains local data in local sys-
tems without sharing with each other (see Figure 9), data se-
ϕuc\(D ∪ U ρc ) ∩ D = ∅ (30d)
curity is significantly improved under the FL framework. In Xu
where ρc ∈ R, ϕuc is a neighborhood around the origin, and and Wu (2024), FL was applied to model the distributed
Xe ≔ {x ∈ ϕuc \ (D ∪ {0}) | ∂W∂xc (x) = 0} is a set of states where nonlinear systems with guaranteed data privacy for ML
Lf W c (x) = 0 (for x ≠ 0) due to ∂W c (x)/∂x = 0. The formu- methods, and then incorporated into the design of MPC.
lation of CLBF-MPC can be found in Wu et al. (2019a). Addi- In addition to data security in the learning stage, the
tionally, in Wu and Christofides (2020), CLBF-MPC using ML smooth operation of ML-MPC in real-time heavily depends
models were developed to control chemical processes with on the accuracy of recorded data and the reliability of net-
unknown process models. Chen et al. (2022a) discussed the worked communication channels. Any compromise in the
use of ML methods for the construction of barrier functions integrity or confidentiality of this data due to unauthorized
when safe and unsafe regions cannot be represented in access or manipulation by malicious entities can lead to
functional forms. Furthermore, in Chen et al. (2022b), the serious consequences, impacting operational safety and
generalization performance was analyzed for ML-based economic performance. As sophisticated cyber-attacks pose
construction of barrier functions and the resulting safe MPC. risks to system information, there is a need to develop ML-
In addition to control Lyapunov and control barrier func- MPC that ensures the confidentiality of industrial data. A
tions that can be used to ensure stability and safety, promising solution to tackle this challenge is the adoption of
respectively, in MPC, control invariant sets can been incor- an encrypted control system (Farokhi et al. 2017; Kim et al.
porated into machine-learning-based controllers to improve
stability (e.g., reinforcement learning-based controllers in
Bo et al. 2023).
2016), offering a versatile and effective means to improve within the MPC at time t. The MPC subsequently computes
data security and confidentiality. It can be seamlessly optimized inputs u(t), which are encrypted prior to trans-
implemented across various systems without requiring mission to the actuator. After the actuator receives the
system-specific modifications, thus addressing the core encrypted signals as input, the encrypted input is decrypted,
challenge of secure data transmission in networked systems. leading to a quantized input, û (t) that is applied to the
The work of Suryavanshi et al. (2023) presents the closed- process. Additionally, Kadakia et al. (2024a) developed an
loop architecture of an encrypted MPC. As depicted in Figure 10, encrypted distributed MPC for networked systems, and
the sensor signals x(t) are subjected to encryption using the Kadakia et al. (2024d) further integrated cyber-attack
public key before being sent to the model predictive controller detection with encrypted control systems. Moreover, to
(MPC). Regarding encryption techniques, Schlüter et al. (2023) control nonlinear processes with the objective of maxi-
discussed various potential methods that can be used to ensure mizing economic performance or achieving desired tracking
the confidentiality of transmitted data. These methods include performance, a two-layer framework was proposed in
homomorphic encryption (HE), secure multi-party computation Kadakia et al. (2024b) and Kadakia et al. (2024c) to integrate
(SMPC), differential privacy (DP), and random affine trans- encrypted feedback control with dynamic process eco-
formations (RAT). Other encryption methods include symmet- nomics optimization through economic MPC, and tracking
ric encryption and partially homomorphic encryption (PHE) to MPC, respectively.
secure data. Symmetric encryption, like advanced encryption
standard (AES), is a non-homomorphic technique that prohibits
mathematical operations within encrypted data (Rijmen and
5 Applications of ML-MPC to a
Daemen 2001). It is noted that fully homomorphic encryption, as
seen in schemes like Brakerski–Gentry–Vaikuntanathan (BGV), chemical process example
allows both addition and multiplication operations within
encrypted data (Gentry et al. 2012), while partially homomor- In this section, we use a nonlinear chemical process to
phic encryption enables either addition or multiplication op- demonstrate the performance of various ML modeling and
erations within encrypted data. For example, the Paillier ML-MPC control methods, addressing different practical
cryptosystem supports addition operations in an encrypted challenges discussed in previous sections. We begin with a
environment (Paillier 1999). Paillier encryption, one of the semi- brief introduction to developing conventional RNN models
homomorphic cryptosystems with additive homomorphism, for nonlinear dynamic systems. We then explore advanced
has been widely used for its computational efficiency compared RNN models incorporating physics-informed ML, transfer
to other semi-homomorphic encryption schemes like El-Gamal learning, dropout, co-teaching, Lipschitz-constrained archi-
(Elgamal 1985), and its ability to perform additive operations in tecture, input-convex structure, online learning, and feder-
an encrypted space, without decryption. Its security guarantees ated learning. These advanced methods are designed to
rely on a standard cryptographic assumption called decisional tackle practical issues such as data scarcity, noise, robust-
composity residuosity (DCR). Thus, homomorphic (partially and ness, convexity, model uncertainties, and data security.
fully) encryption must be used when the transmitted cipher- Following this, we show several novel designs of ML-MPCs
texts need to be utilized for performing linear mathematical that enhance computational efficiency, process operational
operations without decryption. safety, and cybersecurity. Additionally, the Python codes for
After obtaining the encrypted data, it undergoes some of the aforementioned ML and ML-MPC methods are
decryption, resulting in quantized states x̂ (t). These quan- provided for reference.
tized states serve as the initial values for the plant model
100
Stability region
Process data with noise
Collocation points (initial states)
Terminal set min
50
-50
Figure 11: Collocation points sampled uniformly
across the stability region Ωρ , together with a few
-100 examples of the noisy process state trajectories
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2 captured within a small neighborhood around
the origin (i.e., steady-state C As and T s ).
the region captured in the process data, as well as the area The open-loop state profiles predicted by the three models
beyond). 110 pairs of initial states were sampled across the are presented in Figure 12. It can be seen from Figure 12 that the
closed-loop stability region Ωρ , as shown in Figure 11. Each prediction performance of the standard RNN model starts to
initial state was subjected to 500 uniformly sampled manip- deviates from ground truth from t = 0.25 h onward, the time
ulated inputs in a sample-and-hold fashion. Hence, a total of when the states begin to deviate from their respectively steady-
55,000 collocation points (i.e., the total number of combina- state values. The unsatisfactory performance of RNN after t =
tions with 110 pairs of initial states and 500 pairs of manipu- 0.25 h highlights the poor generalizability of data-driven
lated inputs) were obtained. During the model training models when provided with unseen data beyond its training
process, 80 % of the process data and collocation points were set. On the other hand, the standard PIRNN model that was
used for training, and the remaining 20 % were saved for trained with both process data and collocation points,
validation. demonstrated remarkable generalization performance, in the
The standard RNN model was trained solely with the sense that its prediction matches the ground truth closely.
process data. It is a purely data-driven model that serves Moreover, the PIRNN without MSEX model was able to provide
as a baseline to evaluate the predictive capabilities of the satisfactory prediction performance despite being a purely
PIRNN models in regions beyond the range provided by physics-driven model (i.e., no observed data was used for
the training data. Additionally, a purely physics-based training). Hence, the strength of physics-informed ML can be
RNN model, i.e., PIRNN without MSEX , was developed. The corroborated from this example, where the predictive perfor-
PIRNN without MSEχ model was trained only using the mance of the RNN models was significantly enhanced with the
collocation points, without utilizing the observed process incorporation of physical knowledge into the models. The
data. It serves as another baseline for comparison with the exceptional generalizability of physics-informed ML can be
standard PIRNN. Finally, the standard PIRNN was created
especially beneficial for controlling dynamic systems in the
using both collocation points and process data, with its
chemical industry, where the process data collected are often
loss function described in Eq. (20). All three RNN models
ill-sampled (e.g., concentrated within a small region around the
were developed in PyTorch and share the same network
steady-state set points). The code for PIRNN is available in our
architecture, which consists of three hidden recurrent
GitHub repository.1
layers with 128, 256, and 64 recurrent units, respectively.
The models also had the same parameter settings: the
number of training epochs was set to 300, with Adam as 5.2.2 Transfer learning RNNs
the optimizer (learning rate of 0.001) and tanh as the
activation function. The models had 4 input features and 2 In the presence of data scarcity, transfer learning can be
output features. Specifically, given the initial state mea- used to accelerate the training process and improve the
surements (ΔC A and ΔT at the current time step), and the
manipulated inputs (ΔC A0 and ΔQ), the models are
1 GitHub link to the PIRNN code: https://github.com/Keerthana-
required to predict the future system states (i.e., ΔC A and Vellayappan/Demonstration-of-Physics-Informed-Machine-Learning-
ΔT) over a sampling period Δ = 1 × 10−2 h. Model.
28 Z. Wu et al.: Machine learning-based model predictive control
0.5
-0.5
-1
0 0.05 0.1 0.15 0.2 0.25 0.3
60
Figure 12: Comparison of open-loop state
40 profiles (i.e., ΔC A (top figure), and ΔT (bottom
figure)) predicted by RNN (green dotted-dashed
20 line), PIRNN (red dotted line), and purely
0 physics-driven RNN (orange dashed line) with
the ground truth (blue solid line). Noise-free
0 0.05 0.1 0.15 0.2 0.25 0.3 state measurements were used to train the
three RNN models.
generalization performance of RNNs for a target process hidden layers each containing 32 neurons was developed and
with limited data using the pre-trained model for a similar trained for 300 epochs, using solely the target data set.
source process with sufficient data. The model development Table 2 presents the training time and testing errors of TL-
framework and results of transfer learning based RNN RNN and standard RNN trained different sizes of the target
models presented in this section are taken from Xiao et al. data set. Given sufficient target data (i.e., 16,800 training sam-
(2023). In Xiao et al. (2023), one source CSTR was selected for ples and 7,200 testing samples), it can be observed from Table 2
the construction of a TL-based model of a target CSTR pro- that the prediction performance of TL-RNN and RNN models
cess. Except for the ideal gas constant R, the two CSTRs had are comparable (i.e., TL-RNN testing error = 3.144 × 10−5 , RNN
different parameter values. Specifically, the parameters of testing error = 2.635 × 10−5 ). This shows that the TL-RNN model
the source CSTR were 1.1 times its counterparts in the target can achieve similar results as the best performing RNN
CSTR. The model construction framework has been provided equivalent under standard training process. Moreover, it can
in Section 4.1.2. In essence, a single hidden layer RNN is be seen from Table 2 that the TL-RNN model used up less
developed using data from the source CSTR. Subsequently, training time than the standard RNN model when trained with
the target model is obtained by adding a new RNN hidden 16,800 data samples. This could be attributed to the fact that the
layer to the pre-trained RNN source model and fine-tuned TL-RNN model had less trainable parameters than the stan-
using the data from the target CSTR. To obtain the source dard RNN during the first part of its training process, where
model in Keras, we trained an RNN model with one hidden the parameters in the first hidden layer of the TL-RNN were set
layer of 32 neurons using 42,840 training samples and 7,560 to be untrainable. This reduction in trainable parameters
testing samples, all collected from the source CSTR. The could have accelerated the training process of TL-RNN. The
training time for 150 training epochs was 112 s. The testing results under 24,000 samples suggest that, when sufficient
error of the source model is reported to be 1.834 × 10−5 , training samples are provided for the target process, transfer
which is a sufficiently small modeling error using normal- learning can achieve a similar performance as the standard
ized data from the CSTR described in Eq. (31). Afterward, a RNN model while requiring less training time. Additionally, we
TL-RNN model was built for the target CSTR by adding a consider the scenario where the target dataset contains fewer
hidden layer of size 32 to the pre-trained source model. Thus, samples (i.e., 1,920 training samples and 1,280 testing samples).
the target model has two RNN hidden layers of 32 neurons. The It is observed from Table 2 that in the case of a small dataset
training process for the TL-RNN is divided into two steps. In the (i.e., 3,200 samples), the transfer learning RNN model can
first step, the parameters in the first hidden layer, i.e., the pre- achieve better prediction performance while reducing the
trained source model, are set to be ‘untrainable’ by using the training time compared to the standard RNN model. The code
function ‘model.layers[0].trainable = False’ in Keras. In this for transfer learning is available in our GitHub repository.2
stage, only the second hidden layer is trained. The second
hidden layer was trained for 150 epochs. In the second step, all
the hidden layers in the RNN model are set as ‘trainable’, and
the entire model is further trained for 150 epochs. As a 2 GitHub link to the transfer learning-based RNN code: https://github.
benchmark for comparison, a standard RNN model with two com/MingXiaop/Transfer-Learning-for-nonlinear-chemical-process.
Z. Wu et al.: Machine learning-based model predictive control 29
Table : Testing errors of standard and TL-RNNs. ground truth (i.e., the nominal state trajectory in black),
especially at the start. Conversely, the mean state trajectory
Data set size Training time (s) Testing error predicted by the dropout LSTM shows a closer match to the
TL-RNN , . : × ground truth, showing the capability of MC dropout in
Standard RNN , . : × handling noisy data.
TL-RNN , . : × The effect of incorporating co-teaching into LSTM was
Standard RNN , . : × also studied in Wu et al. (2021c). As mentioned in Section
4.2.2, co-teaching involves training two NN models. In Wu
et al. (2021c), the co-teaching process starts by training the
5.2.3 Dropout and co-teaching RNNs with noisy data two LSTM models with a noisy dataset. Subsequently, the
models iteratively identify and exchange clean data se-
Since neural networks are demonstrated to be able to miti- quences and update their weights accordingly. This allows
gate the impact of Gaussian noise to some extent. In this the co-teaching models to capture a balanced pattern that
section, we will use the findings in (Wu et al. 2021c) to un- accounts for both noisy and clean data. To understand the
derstand the capability of the LSTM model to handle non- effectiveness of the co-teaching method, the testing perfor-
Gaussian noise, as well as how approaches such as dropout mances of the standard LSTM, dropout LSTM and co-
and co-teaching can help to improve the learning perfor- teaching LSTM models were compared and are listed in
mances of the LSTM models developed with noisy data. Table 3. For fair comparison, all LSTM models were trained
We will first explore the effect of implementing the MC and tested on the same noisy dataset. The LSTM models also
dropout, proposed in Gal and Ghahramani (2016a,b), in an shared the same network structure and hyperparameters,
LSTM model. By treating the LSTM weights as random var- i.e., the same number of neurons, layers, epochs, and acti-
iables and finding the posterior distribution of the weights vation functions. The difference between the predicted state
by sampling the network with randomly dropped out trajectories and the underlying (noise-free) state trajec-
weights during testing, the MC dropout method helps tories, i.e., MSE, was chosen as the criterion for performance
quantify the uncertainty in the prediction and uses the in- evaluation, where a smaller MSE value signifies better model
formation to update the weights. The open-loop prediction performance.
results of the standard LSTM and the dropout LSTM are
presented in Figure 13 for comparison. As the predictions Table : Statistical analysis of the open-loop predictions under non-
made by the LSTM model using MC dropout are stochastic in Gaussian noise.
nature, the LSTM predictions were executed repeatedly 300
times to generate the predicted state trajectories distribu- Methods MSE x MSE x
tion. The mean state trajectory is represented as a red line, a) LSTM: noise-free data only . .
and the 95 % standard deviation interval is marked by the b) LSTM: mixed data . .
gray region in Figure 13. It is observed that the prediction c) LSTM: noisy data only . .
) Co-teaching LSTM . .
made by the standard LSTM model (yellow line), trained
) Dropout LSTM . .
with non-Gaussian noise, deviates significantly from the
(kmol/m3)
1.5
1
95% std
0.5 Dropout LSTM
0 0.005 0.01 0.015 0.02 0.025 0.03 0.035 0.04 Standard
0.045 LSTM0.05
Ground truth
Noisy data
0
K
As shown in Table 3, the standard LSTM models had the Table : Comparison of the testing errors (TEs) and Lipschitz constants
highest MSE out of the three methods. Furthermore, when (LCs) for various hidden layer architectures and standard deviation (SD) of
noise introduced into the training dataset.
comparing standard LSTM trained noisy data (i.e., 1c in
Table 3) with mixed LSTM trained data (i.e., 1b in Table 3), a
Hidden Noise LCNN TE Dense TE LCNN Dense LC
slight improvement in model prediction was observed, layers SD ð× Þ ð× Þ LC ð× Þ
highlighting the adverse impact noisy data have on model
(, ) . . . . .
accuracy. These observations imply that the standard
(,, . . . . .
modeling approach cannot achieve the desired model ac- ,)
curacy without a high-quality dataset. However, co-teaching (, ) . . . . .
LSTM and dropout models developed with the same noisy (,, . . . . .
dataset outperformed the standard LSTM models, with co- ,)
developed two RNN models for the CSTR of Eq. (31) subjected triggered by errors. It is shown that the update of RNN
to the aforementioned disturbances, one being a standard models is triggered two times throughout the operation.
RNN model trained offline using historical data, and the
other one being an RNN model updated online using real- Remark 6. Due to space constraints, we are unable to
time data. The closed-loop state trajectories under Lyapunov present all the NN modeling approaches that address each
MPC (LMPC) controllers designed with the two models are practical issue discussed in this article. Readers who are
shown in Figure 14a. As shown in Figure 14a, the closed-loop interested in reduced-order modeling, ML-based distrib-
state trajectory under LMPC using the standard RNN models uted MPC, and federated learning methods, which often
(that is, without online update of the RNN model) exhibited require more complex process networks for demonstra-
oscillatory behavior around the origin due to disturbances. tion, can refer to the references provided in the corre-
Whereas, the trajectory under the LMPC with online update sponding sections.
of the RNN model was able to drive the closed-loop state into
a small neighborhood around the origin successfully.
Figure 14b shows the evolution of the moving-horizon 5.3 NN-based MPC
error detector Ernn (t) that is designed as the accumulated
prediction error for the closed-loop system of Eq. (31) under After we obtain the NN models that learn the dynamics of the
the LMPC of Eq. (9) with online update of RNN models CSTR of Eq. (31), NN-based MPC can be developed to control the
(K)
(kmol/m3)
system by manipulating C A0 and Q. A conventional LMPC Table : Convergence runtime of MPCs using LSTM and ICLSTM.
scheme using standard RNN models has been developed in Ren
et al. (2022) and has been shown to achieve the desired closed- ½CAi ; Ti Plain LSTM ICLSTM
loop performance by stabilizing the states in the steady state. Time (s) % Decrease Time (s)
The Python code for MPC using conventional RNN models can ½ :; ,: ± : : % : ± :
be found in our GitHub repository.4 In this subsection, we again ½ :; ,: ± : : % : ± :
will focus more on novel designs of ML-MPCs that address the ½ ; ,: ± : : % : ± :
practical issues such as computational efficiency, safety and ½ :; ,: ± : : % : ± :
½ :; ,: ± : : % : ± :
data security in real-world applications.
½ :; : ± : : % : ± :
½ :; : ± : : % : ± :
5.3.1 Convex MPC using input-convex NNs ½:; ,: ± : : % ,: ± :
½:; ,: ± : : % ,: ± :
½:; ,: ± : : % ,: ± :
While neural networks offer advantages in process modeling, ½:; ,: ± : : % : ± :
ensuring computational efficiency is crucial for real-time ½:; ,: ± : : % : ± :
optimization and control tasks. In a chemical plant, numerous ½:; : ± : : % : ± :
operations require real-time or near-real-time control to ½:; : ± : : % : ± :
½:; : ± : : % : ± :
maintain product quality, safety, and operational efficiency.
Average ,. : % .
Swift decision-making is pivotal for safety in chemical pro-
cesses, as delays in addressing reactant changes can result in
undesired reactions or unsafe conditions. Inspired by the fact
The objective function L in Eq. (32a) incorporates a cost
that convex optimization is easier to solve than non-convex
function J in terms of the system states x and the control actions
optimization, in this subsection, our goal is to preserve the
u. The dynamic function Fnn (̃x(t), u(t)) in Eq. (32b) is param-
convexity in neural-network-based predictive control that will
eterized as recurrent neural networks (i.e., plain LSTM and
be discussed later by developing input-convex NNs where the
ICLSTM). In this experiment, the PyIpopt library was executed
neural network outputs remain convex with respect to the
on an Intel Core i7-12700 processor with 64 GB of RAM, using 15
input. Specifically, in addition to the input-convex feedforward
different initial conditions within the stability region
neural network introduced in Section 4.5.1, there are a variety
(i.e., covering the whole stability region). Table 5 presents the
of input-convex NNs in the family of RNNs such as input-
average runtime across 3 runs for each case and their corre-
convex RNNs and input-convex LSTMs. Specifically, we develop
sponding percentage decrease with respect to ICLSTM,
input-convex LSTM (ICLSTM), following the formulation in
showing that ICLSTM-based MPC yields an improvement in
Wang et al. (2025), and compare its performance in closed-loop
convergence runtime. Specifically, it attains an average per-
control with the MPC using plain LSTM model. Subsequently,
centage decrease of 40.0 % compared to plain LSTM. The Py-
we consider a simple MPC scheme using a neural network
thon code for MPC using ICLSTM models can be found in our
model as the prediction model given by the following optimi-
GitHub repository.5
zation problem:
tk+N
L = min ∫ J(̃x(t), u(t))dt (32a) 5.3.2 Safe ML-based MPC
u∈S(Δ) tk
s.t. ẋ (t)
̃ = Fnn (̃x(t), u(t)) (32b) Safe MPCs should be developed to ensure that process opera-
tions remain within the safe operating region, particularly
u(t) ∈ U, ∀t ∈ [tk , tk+N ) (32c) when there are potential unsafe operating conditions in
chemical processes. CLBF functions can be incorporated into
̃x(tk ) = x(tk ) (32d)
the MPC scheme (termed CLBF-MPC) to regulate the CSTR of Eq.
where ̃x is the predicted state trajectory, S(Δ) is the set of (31) to the steady-state while avoiding the unsafe operation at
piecewise constant functions with period Δ, and N is the the same time. The CLBF-MPC scheme is formulated by the
number of sampling periods in the prediction horizon. following optimization problem (Wu et al. 2019a):
4 GitHub link to RNN-based MPC code: https://github.com/GuoQWu/ 5 GitHub link to ICLSTM-based MPC code: https://github.com/
Machine-learning-based-model-predictive-control. killingbear999/ICLSTM.
Z. Wu et al.: Machine learning-based model predictive control 33
tk+N
as H ≔ {x ∈ R2 | F(x) < 0.07}. The CLBF is designed as the
min ∫ lt (̃x(t), u(t))dt (33a)
u∈S(Δ)
tk weighted sum of the following control barrier function B(x)
and control Lyapunov function V(x):
s.t. ẋ (t)
̃ = Fnn (̃x(t), u(t)) (33b)
⎪
⎧ F(x)
⎨ eF(x)−0.07 − e−6 , if x ∈ H
̃x(tk ) = x(tk ) (33c) B(x) = ⎪ (34)
⎩ −e−6 , if x ∉ H
u(t) ∈ U, ∀ t ∈ [tk , tk+N ) (33d)
and V(x) = xT Px with the following positive definite P
̇ ̇
W c (x(tk ), u(tk )) ≤ W c (x(tk ), Φ(x(tk ))), matrix:
(33e) 1, 060 22
P=[ ] (35)
if W c (x(tk )) > ρ′min and x(tk ) ∉ B δ (x e ) 22 0.52
W c (̃x(t)) ≤ ρ′min , ∀ t ∈ [tk , tk+N ), if W c (x(tk )) ≤ ρ′min (33f) In Figure 15, it is demonstrated that for all initial states x 0
in U ρ̂ (marked by circles), the closed-loop trajectories avoid
W c (̃x(t)) < W c (x(tk )), ∀ t ∈ (tk , tk+N ), if x(tk ) the bounded unsafe region D b that is embedded within U ρ̂
∈ B δ (x e ) (33g) (a subset of the safe operating region U ρ ), and ultimately
converges to U ρmin under the CLBF-MPC of Eq. (33).
where Δ is the sampling period, S(Δ) is the set of piecewise
constant functions with time interval Δ, ̃x(t) is the predicted
Remark 7. Although a CSTR example was used to illustrate
state trajectory, and N is the number of sampling steps in
the applications of various machine learning modeling
the prediction horizon. We use Ẇ c (x, u) to represent
∂W c ( x)
and ML-based MPC methods, it is important to note that
∂x
(Fnn (x, u)). The optimization problem of Eq. (33) is to
ML-based MPC can be applied to a variety of complex
minimize the object function of Eq. (33a) subject to the con- chemical engineering problems. Due to space constraints,
straints of Eqs. (33b)–(33g). The NN model that captures the we will not provide a detailed discussion in the review;
dynamics of Eq. (31) can be used as the prediction model in Eq. however, we have provided some examples on the appli-
(33b). The initial condition for this prediction model is deter- cation of ML models, as well as, ML-based MPC methods to
mined by the current state measurement, as shown in Eq. model and control complex systems, for interested readers
(33c). The constraints outlined in Eqs. (33e)–(33g) ensure that seeking more examples in this area. For example, neural
the closed-loop state remains bounded within a small neigh- network models have been applied to model an industrial
borhood around the origin (i.e., U ρ′ ) and does not enter the ethylene splitter in Jalanko et al. (2021) and experimental
min
unsafe region for all times. Specifically, when x(tk ) is outside electrochemical reactors in Çıtmacı et al. (2022) and Luo
of U ρ′ and x(tk ) ∉ B δ (x e ), the constraint of Eq. (33e) drives et al. (2022). Moreover, an LSTM-based MPC method has
min
the closed-loop state into a smaller level set of W c (x) by been developed in Luo et al. (2023) for the same electro-
decreasing the value of W c (̃x) along the predicted state trajec- chemical reactor of Çıtmacı et al. (2022). Other notable
tory at least at the rate under the CLBF-based controller works on ML-based MPC include: using an LSTM-based
u = Φ(x). When x(tk ) enters U ρ′ (i.e., x(tk ) is also
min
bounded in a small ball around the origin
B d (0) ≔ {x ∈ Rn | |x| ≤ d}), the constraint of Eq. (33f)
maintains the closed-loop state inside B d (0) afterwards.
However, if the state is trapped in other stationary points
during the path towards the origin, i.e., x(tk ) ∈ B δ (x e ), we
activate the constraint of Eq. (33g) to drive the state away
from x e in the direction of decreasing W c (x).
We consider a bounded unsafe region D b in state-
space, and demonstrate that the CLBF-MPC of Eq. (33)
can drive the state to a small neighborhood around
the origin while not entering the unsafe region. Specif- Figure 15: Closed-loop state trajectories for the system of Eq. (31) under
the CLBF-MPC using an RNN model. The initial conditions are marked by
ically, the unsafe region is defined as an ellipse:
circles, and the set of bounded unsafe states D b is the gray area
D b ≔ {x ∈ R2 | F(x) = (x1 +0.92) + (x2500
−42)
2 2
economic MPC to control the heating, ventilation, and air knowledge that can improve model performance. In many ML-
conditioning (HVAC) system of a building in Ellis and MPC applications, NN models are initially trained offline until
Chinde (2020), and using an ANN-based MPC to control the achieving sufficiently small errors before incorporation into
film properties in the thin film chemical deposition of MPC. However, this process involves extensive data collection
quantum dots in Sitapure and Kwon (2022). and training, potentially consuming time and resources.
Therefore, a future direction is to integrate stability re-
quirements into NN model development, ensuring that NN
6 Conclusion and outlook models naturally meet the MPC stability criteria and can be
easily implemented within MPC frameworks (Tan et al. 2024a).
The tutorial provided an overview of machine learning-based Additionally, for modeling distributed systems, knowledge of
model predictive control methods, highlighting both theo- network structure (i.e., units [nodes] and their relationships
retical insights and practical challenges associated with the [edges]) can be integrated into the development of graph
development of NNs and the incorporation of NNs into MPC. neural networks (GNNs) to improve the modeling accuracy.
Closed-loop stability of ML-based MPC was first established Overall, there are various types of domain knowledge that can
based on the generalization error analysis for NNs. Various be integrated into neural networks tailored to specific ML-MPC
ML methods such as physics-informed ML, transfer learning, applications in different ways (e.g., loss function, network ar-
and novel designs of NN architectures were discussed chitecture, weight constraints, learning algorithms, etc.).
alongside advanced control methods to address the practical To successfully implement ML-MPC in real-world large-
challenges including data scarcity, data quality, the curse of scale systems, addressing adaptability and scalability is
dimensionality, model uncertainty, computational efficiency, important to ensure computational efficiency and main-
and safety in ML-MPC. Finally, a chemical process example taining performance across diverse applications. Transfer
was studied to demonstrate the effectiveness of various ML- learning offers a promising approach by leveraging knowl-
MPC methods to address the aforementioned practical issues. edge from one process to another in modeling and control
In addition to the topics covered in this paper, several tasks for process scale-up. However, finding a suitable
emerging areas in ML-based MPC require significant attention source process that closely matches the target process can be
for future research. For example, explainable AI (XAI) is crit- challenging in practice. Inspired by the success of large
ical to improving the transparency, trustworthiness, and us- language models in many recent studies and applications, a
ability of ML models in MPC. By understanding how a neural compelling future direction is to develop a single, universal
network arrives at its predictions, users can trust more on the neural network (referred to as a foundation model) capable
model, and identify the errors more effectively in real-world of rapidly adapting to model any new chemical process
applications. Although neural networks are powerful tools for (Wang and Wu 2024c). Foundation models have shown
learning complex patterns and making predictions across success in fields such as computer science, chemistry, and
various domains, they are typically developed as black-box material sciences. In the field of chemical engineering, large
models with inherent complexity, which makes it challenging language models have been applied in Hirtreiter et al. (2024)
to understand the reasoning behind their outputs. Physics- to generate control structures for process flow diagrams
informed ML provides one solution to incorporate domain (PFDs) from PFDs without control structures, as part of an
knowledge into NN models, yet it does not completely address effort to automate the generation of piping and instrumen-
the challenge of model explainability. One common approach tation diagrams (P&IDs). However, the application of foun-
for XAI is SHapley Additive exPlanations (SHAP). SHAP is a dation models to chemical process modeling and control is
method based on cooperative game theory that assigns each still in its infancy (Decardi-Nelson et al. 2024). This is partly
feature an importance value for a particular prediction. It due to the complexity of chemical engineering, which in-
provides a unified framework to explain the output of any volves large-scale industrial processes characterized by
machine learning model by attributing the prediction proprietary complex data that is rarely shared publicly by
outcome to different input features. However, developing industries. Additionally, adapting ML-based MPC from a
suitable XAI methods to explain predictions, limitations, and small-scale to a large-scale system involves several key
resulting behaviors of neural network models in MPC remains considerations such as real-time computation requirements,
an ongoing challenge. availability of sensor data and sensor-related issues (e.g.,
Regarding physics-informed machine learning, while this missing, delayed, and asynchronous measurements), and
review paper discusses several approaches integrating physics optimization of MPC hyper-parameters across different
knowledge (e.g., first-principles models and structural process scales. Addressing these challenges not only enables more
knowledge) into NN development, there are numerous types of efficient utilization of data but also improves the
Z. Wu et al.: Machine learning-based model predictive control 35
applicability of ML-MPC in various chemical engineering Akpinar, N.-J., Kratzwald, B., and Feuerriegel, S. (2019). Sample complexity
applications. bounds for recurrent neural networks with application to
combinatorial graph problems. arXiv preprint arXiv:1901.10289.
Alhajeri, M.S., Abdullah, F., Wu, Z., and Christofides, P.D. (2022). Physics-
Research ethics: Not applicable. informed machine learning modeling for predictive control using
Informed consent: Not applicable. noisy data. Chem. Eng. Res. Des. 186: 34–49.
Author contributions: Z.W. and P.D.C. conceptualized the Alhajeri, M.S., Ren, Y.M., Ou, F., Abdullah, F., and Christofides, P.D. (2024).
review and oversaw all aspects of the project. Z.W. and W.W. Model predictive control of nonlinear processes using transfer
learning-based recurrent neural networks. Chem. Eng. Res. Des. 205:
were responsible for the initial literature review. Z.W. took the
1–12.
lead in writing the original manuscript, with significant inputs
Ali, M., Cai, X., Khan, F.I., Pistikopoulos, E.N., and Tian, Y. (2023). Dynamic
from W.W. and F.A., Y.W., F.A., A.A. and Y.K. contributed risk-based process design and operational optimization via multi-
significantly to the review and editing process. All authors parametric programming. Digit. Chem. Eng. 7: 100096.
have accepted responsibility for the entire content of this Amos, B., Xu, L., and Kolter, J.Z. (2017). Proceedings of the 34th international
manuscript and approved its submission. conference on machine learning, August 6–11, 2017: input convex neural
networks. PMLR, Sydney, Australia, pp. 146–155.
Use of Large Language Models, AI and Machine Learning
Anil, C., Lucas, J., and Grosse, R. (2019). Proceedings of the 36th international
Tools: None declared. conference on machine learning, June 9–15, 2019: sorting out Lipschitz
Conflict of interest: The authors state no conflict of interest. function approximation. PMLR, California, USA, pp. 291–301.
Research funding: Financial support from the National Antonelo, E.A., Camponogara, E., Seman, L.O., Jordanou, J.P., de Souza, E.R.,
Science Foundation, the Department of Energy, NRF-CRP (27- and Hübner, J.F. (2024). Physics-informed neural nets for control of
dynamical systems. Neurocomputing 579: 127419.
2021-0001), MOE AcRF Tier 1 FRC Grant (22-5367-A0001),
Arjovsky, M., Chintala, S., and Bottou, L. (2017). Proceedings of the 34th
Singapore, and A*STAR MTC YIRG 2022 (M22K3c0093),
international conference on machine learning, August 6–11, 2017:
Singapore is gratefully acknowledged. Wasserstein generative adversarial networks. PMLR, Sydney, Australia,
Data availability: The raw data can be obtained on request pp. 214–223.
from the corresponding authors. Arnold, F. and King, R. (2021). State-space modeling for control based on
physics-informed neural networks. Eng. Appl. Artif. Intell. 101: 104195.
Bangi, M.S.F., Kao, K., and Kwon, J.S.-I. (2022). Physics-informed neural
networks for hybrid modeling of lab-scale batch fermentation for
References β-carotene production using Saccharomyces cerevisiae. Chem. Eng.
Res. Des. 179: 415–423.
Abbasi, M., Santos, B.P., Pereira, T.C., Sofia, R., Monteiro, N.R., Simões, C.J., Bank, D., Koenigstein, N., and Giryes, R. (2023) Autoencoders. In: Machine
Brito, R.M., Ribeiro, B., Oliveira, J.L., and Arrais, J.P. (2022). Designing learning for data science handbook: data mining and knowledge
optimized drug candidates with generative adversarial network. discovery handbook. Springer, Cham, pp. 353–374.
J. Cheminf. 14: 40. Bartlett, P.L., Foster, D.J., and Telgarsky, M.J. (2017). Spectrally-normalized
Abdullah, F. and Christofides, P.D. (2023a). Data-based modeling and margin bounds for neural networks. In: Advances in neural information
control of nonlinear process systems using sparse identification: an processing systems, Vol. 30. Curran Associates, Inc, Red Hook, NY.
overview of recent results. Comput. Chem. Eng. 174: 108247. Batra, R., Dai, H., Huan, T.D., Chen, L., Kim, C., Gutekunst, W.R., Song, L., and
Abdullah, F. and Christofides, P.D. (2023b). Real-time adaptive sparse- Ramprasad, R. (2020). Polymers for extreme conditions designed
identification-based predictive control of nonlinear processes. Chem. using syntax-directed variational autoencoders. Chem. Mater. 32:
Eng. Res. Des. 196: 750–769. 10489–10500.
Abdullah, F., Wu, Z., and Christofides, P.D. (2021a). Data-based reduced- Ben-David, S., Blitzer, J., Crammer, K., and Pereira, F. (2006). Analysis of
order modeling of nonlinear two-time-scale processes. Chem. Eng. Res. representations for domain adaptation. In: Advances in neural
Des. 166: 1–9. information processing systems, Vol. 19. MIT Press, Cambridge, MA.
Abdullah, F., Wu, Z., and Christofides, P.D. (2021b). Sparse-identification- Ben-David, S., Blitzer, J., Crammer, K., Kulesza, A., Pereira, F., and Vaughan,
based model predictive control of nonlinear two-time-scale processes. J.W. (2010). A theory of learning from different domains. Mach. Learn.
Comput. Chem. Eng. 153: 107411. 79: 151–175.
Abdullah, F., Alhajeri, M.S., and Christofides, P.D. (2022a). Modeling Berberich, J. and Allgöwer, F. (2024). An overview of systems-theoretic
and control of nonlinear processes using sparse identification: guarantees in data-driven model predictive control. arXiv preprint
using dropout to handle noisy data. Ind. Eng. Chem. Res. 61: arXiv:2406.04130.
17976–17992. Berberich, J., Köhler, J., Müller, M.A., and Allgöwer, F. (2020). Data-driven
Abdullah, F., Wu, Z., and Christofides, P.D. (2022b). Handling noisy data in model predictive control with stability and robustness guarantees.
sparse model identification using subsampling and co-teaching. IEEE Trans. Automat. Control 66: 1702–1717.
Comput. Chem. Eng. 157: 107628. Bergstra, J., Bardenet, R., Bengio, Y., and Kégl, B. (2011). Algorithms for
Akiba, T., Sano, S., Yanase, T., Ohta, T., and Koyama, M. (2019). Proceedings of hyper-parameter optimization. In: Advances in neural information
the 25th ACM SIGKDD international conference on knowledge discovery & processing systems, Vol. 24. Curran Associates, Inc, Red Hook, NY.
data mining, August 4–8, 2019: optuna: a next-generation Bergstra, J., Yamins, D., and Cox, D. (2013). Proceedings of the 30th
hyperparameter optimization framework. Association for Computing international conference on machine learning, June 16–21, 2013: making a
Machinery, Anchorage, AK, USA, pp. 2623–2631. science of model search: hyperparameter optimization in hundreds of
36 Z. Wu et al.: Machine learning-based model predictive control
dimensions for vision architectures. PMLR, Atlanta, GA, USA, Chen, W.-H. and You, F. (2021). Semiclosed greenhouse climate control
pp. 115–123. under uncertainty via machine learning and data-driven robust model
Bhadriraju, B., Narasingam, A., and Kwon, J.S.-I. (2019). Machine learning- predictive control. IEEE Trans. Control Syst. Technol. 30: 1186–1197.
based adaptive model identification of systems: application to a Chen, R.T., Rubanova, Y., Bettencourt, J., and Duvenaud, D.K. (2018a) Neural
chemical process. Chem. Eng. Res. Des. 152: 372–383. ordinary differential equations. In: Advances in neural information
Bhowmick, A., D’Souza, M., and Raghavan, G.S. (2021) LipBaB: computing processing systems, Vol. 31. Curran Associates, Inc, Red Hook, NY.
exact Lipschitz constant of ReLU networks. In: Artificial neural networks Chen, S., Saulnier, K., Atanasov, N., Lee, D.D., Kumar, V., Pappas, G.J., and
and machine learning – ICANN 2021. Springer, Cham, pp. 151–162. Morari, M. (2018b). Proceedings of the 2018 annual American control
Bi, K., Beykal, B., Avraamidou, S., Pappas, I., Pistikopoulos, E.N., and Qiu, T. conference, June 27–29, 2018: approximating explicit model predictive
(2020). Integrated modeling of transfer learning and intelligent control using constrained neural networks. Milwaukee, Wisconsin, USA,
heuristic optimization for a steam cracking process. Ind. Eng. Chem. pp. 1520–1527.
Res. 59: 16357–16367. Chen, Y., Shi, Y., and Zhang, B. (2018c). Optimal control via neural networks:
Bitmead, R.R., Gevers, M., and Wertz, V. (1990). Adaptive optimal control the a convex approach. arXiv preprint arXiv:1805.11835.
thinking man’s GPC. Prentice Hall, Victoria, Australia. Chen, M., Li, X., and Zhao, T. (2019). On generalization bounds of a family of
Blitzer, J., Crammer, K., Kulesza, A., Pereira, F., and Wortman, J. (2007). recurrent neural networks. arXiv preprint arXiv:1910.12947.
Learning bounds for domain adaptation. In: Advances in neural Chen, S., Wu, Z., and Christofides, P.D. (2020a). Decentralized machine-
information processing systems, Vol. 20. Curran Associates, Inc, Red learning-based predictive control of nonlinear processes. Chem. Eng.
Hook, NY. Res. Des. 162: 45–60.
Bo, S., Agyeman, B.T., Yin, X., and Liu, J. (2023). Control invariant set Chen, S., Wu, Z., Rincon, D., and Christofides, P.D. (2020b). Machine
enhanced safe reinforcement learning: improved sampling efficiency, learning-based distributed model predictive control of nonlinear
guaranteed stability and robustness. Comput. Chem. Eng. 179: 108413. processes. AIChE J. 66: e17013.
Bonassi, F., Farina, M., Xie, J., and Scattolini, R. (2022). On recurrent neural Chen, S., Wu, Z., and Christofides, P.D. (2022a). Machine-learning-based
networks for learning-based control: recent results and ideas for construction of barrier functions and models for safe model predictive
future developments. J. Process Control 114: 92–104. control. AIChE J. 68: e17456.
Bond-Taylor, S., Leach, A., Long, Y., and Willcocks, C.G. (2021). Deep Chen, S., Wu, Z., and Christofides, P.D. (2022b). Statistical machine-learning-
generative modelling: a comparative review of VAEs, GANs, based predictive control using barrier functions for process
normalizing flows, energy-based and autoregressive models. IEEE operational safety. Comput. Chem. Eng. 163: 107860.
Trans. Pattern Anal. Mach. Intell. 44: 7327–7347. Cheng, F., He, Q.P., and Zhao, J. (2019). A novel process monitoring
Bradford, E., Imsland, L., Zhang, D., and del Rio Chanona, E.A. (2020). approach based on variational recurrent autoencoder. Comput. Chem.
Stochastic data-driven model predictive control using Gaussian Eng. 129: 106515.
processes. Comput. Chem. Eng. 139: 106844. Cho, K., Van Merriënboer, B., Gulcehre, C., Bahdanau, D., Bougares, F.,
Briceno-Mena, L.A., Romagnoli, J.A., and Arges, C.G. (2022). PemNet: a Schwenk, H., and Bengio, Y. (2014). Proceedings of the 2014 conference
transfer learning-based modeling approach of high-temperature on empirical methods in natural language processing (EMNLP), October
polymer electrolyte membrane electrochemical systems. Ind. Eng. 25–29, 2014: learning phrase representations using RNN encoder–
Chem. Res. 61: 3350–3357. decoder for statistical machine translation. Doha, Qatar, pp. 1724–1734.
Brunke, L., Greeff, M., Hall, A.W., Yuan, Z., Zhou, S., Panerati, J., and Christofides, P.D., Scattolini, R., De La Pena, D.M., and Liu, J. (2013).
Schoellig, A.P. (2022). Safe learning in robotics: from learning-based Distributed model predictive control: a tutorial review and future
control to safe reinforcement learning. Annu. Rev. Control Robot. Auton. research directions. Comput. Chem. Eng. 51: 21–41.
Syst. 5: 411–444. Cisse, M., Bojanowski, P., Grave, E., Dauphin, Y., and Usunier, N. (2017).
Bünning, F., Schalbetter, A., Aboudonia, A., de Badyn, M.H., Heer, P., and Proceedings of the 34th international conference on machine learning,
Lygeros, J. (2021). Proceedings of the 3rd conference on learning for August 6–11, 2017: parseval networks: improving robustness to adversarial
dynamics and control, June 7–8, 2021: input convex neural networks for examples. PMLR, Sydney, Australia, pp. 854–863.
building MPC. PMLR, pp. 251–262. Çıtmacı, B., Luo, J., Jang, J.B., Canuso, V., Richard, D., Ren, Y.M., Morales-
Cai, S., Wang, Z., Fuest, F., Jeon, Y.J., Gray, C., and Karniadakis, G.E. (2021). Guio, C.G., and Christofides, P.D. (2022). Machine learning-based
Flow over an espresso cup: inferring 3-D velocity and pressure fields ethylene concentration estimation, real-time optimization and
from tomographic background oriented Schlieren via physics- feedback control of an experimental electrochemical reactor. Chem.
informed neural networks. J. Fluid Mech. 915: A102. Eng. Res. Des. 185: 87–107.
Chandrasekar, A., Abdulhussain, H., Thompson, M.R., and Mhaskar, P. Daoutidis, P., Lee, J.H., Rangarajan, S., Chiang, L., Gopaluni, B.,
(2024). Utilizing neural networks for image-based model predictive Schweidtmann, A.M., Harjunkoski, I., Mercangöz, M., Mesbah, A.,
controller of a batch rotational molding process. IFAC-PapersOnLine Boukouvala, F., et al. (2023). Machine learning in process systems
58: 470–475. engineering: challenges and opportunities. Comput. Chem. Eng. 181:
Chandrashekar, G. and Sahin, F. (2014). A survey on feature selection 108523.
methods. Comput. Electr. Eng. 40: 16–28. David, S.B., Lu, T., Luu, T., and Pál, D. (2010). Proceedings of the 13th
Chang, H.-C. and Aluko, M. (1984). Multi-scale analysis of exotic dynamics in international conference on artificial intelligence and statistics, May 13–
surface catalyzed reactions–I: justification and preliminary model 15, 2010: impossibility theorems for domain adaptation. JMLR Workshop
discriminations. Chem. Eng. Sci. 39: 37–50. and Conference Proceedings, Sardinia, Italy, pp. 129–136.
Chen, H. and Allgöwer, F. (1998). A quasi-infinite horizon nonlinear model de Giuli, L.B., La Bella, A., and Scattolini, R. (2024). Physics-informed neural
predictive control scheme with guaranteed stability. Automatica 34: network modeling and predictive control of district heating systems.
1205–1217. IEEE Trans. Control Syst. Technol. 32: 1182–1195.
Z. Wu et al.: Machine learning-based model predictive control 37
de Vos, B.D., Jansen, G.E., and Išgum, I. (2023). Stochastic co-teaching for nets. In: Advances in neural information processing systems, Vol. 27.
training neural networks with unknown levels of label noise. Sci. Rep. Curran Associates, Inc, Red Hook, NY.
13: 16875. Gouk, H., Frank, E., Pfahringer, B., and Cree, M.J. (2021). Regularisation of
Decardi-Nelson, B., Alshehri, A.S., Ajagekar, A., and You, F. (2024). neural networks by enforcing Lipschitz continuity. Mach. Learn. 110:
Generative AI and process systems engineering: the next Frontier. 393–416.
Comput. Chem. Eng. 187: 108723. Grimstad, B. and Andersson, H. (2019). ReLU networks as surrogate models
Degeest, A., Verleysen, M., and Frénay, B. (2019) About filter criteria for in mixed-integer linear programs. Comput. Chem. Eng. 131: 106580.
feature selection in regression. In: Advances in computational Gulrajani, I., Ahmed, F., Arjovsky, M., Dumoulin, V., and Courville, A.C. (2017).
intelligence. Springer, Cham, pp. 579–590. Improved training of Wasserstein GANs. In: Advances in neural
Dev, P., Jain, S., Arora, P.K., and Kumar, H. (2021). Machine learning and its information processing systems, Vol. 30. Curran Associates, Inc, Red
impact on control systems: a review. Mater. Today: Proc. 47: Hook, NY.
3744–3749. Guo, J., Du, W., and Nascu, I. (2020). Adaptive modeling of fixed-bed
Dobbelaere, M.R., Plehiers, P.P., Van de Vijver, R., Stevens, C.V., and reactors with multicycle and multimode characteristics based on
Van Geem, K.M. (2021). Machine learning in chemical engineering: transfer learning and just-in-time learning. Ind. Eng. Chem. Res. 59:
strengths, weaknesses, opportunities, and threats. Engineering 7: 6629–6637.
1201–1211. Han, B., Yao, Q., Yu, X., Niu, G., Xu, M., Hu, W., Tsang, I., and Sugiyama, M.
Dong, D. and McAvoy, T. (1996). Nonlinear principal component analysis–based (2018). Co-teaching: robust training of deep neural networks with
on principal curves and neural networks. Comput. Chem. Eng. 20: 65–78. extremely noisy labels. In: Advances in neural information processing
Elgamal, T. (1985). A public key cryptosystem and a signature scheme based systems, Vol. 31. Curran Associates, Inc, Red Hook, NY.
on discrete logarithms. IEEE Trans. Inf. Theor. 31: 469–472. Han, X., Zhang, L., Zhou, K., and Wang, X. (2019). ProGAN: protein solubility
Ellis, M.J. and Chinde, V. (2020). An encoder–decoder LSTM-based EMPC generative adversarial nets for data augmentation in DNN
framework applied to a building HVAC system. Chem. Eng. Res. Des. framework. Comput. Chem. Eng. 131: 106533.
160: 508–520. Harshvardhan, G., Gourisaria, M.K., Pandey, M., and Rautaray, S.S. (2020). A
Everett, M. (2021). Proceedings of the 60th IEEE conference on decision and comprehensive survey and analysis of generative models in machine
control (CDC), December 14–17, 2021: neural network verification in learning. Comput. Sci. Rev. 38: 100285.
control. IEEE, Austin, TX, USA, pp. 6326–6340. Hassanpour, H., Corbett, B., and Mhaskar, P. (2020). Integrating dynamic
Farokhi, F., Shames, I., and Batterham, N. (2017). Secure and private control neural network models with principal component analysis for
using semi-homomorphic encryption. Control Eng. Pract. 67: 13–20. adaptive model predictive control. Chem. Eng. Res. Des. 161: 26–37.
Federer, H. (2014). Geometric measure theory. Springer Berlin Heidelberg, He, R., Li, X., Chen, G., Chen, G., and Liu, Y. (2020). Generative adversarial
Heidelberg. network-based semi-supervised learning for real-time risk warning of
Ferramosca, A., Limon, D., González, A.H., Odloak, D., and Camacho, E.F. process industries. Expert Syst. Appl. 150: 113244.
(2010). MPC for tracking zone regions. J. Process Control 20: 506–516. Hein, M. and Andriushchenko, M. (2017). Formal guarantees on the
Gal, Y. and Ghahramani, Z. (2016a). Proceedings of the 33rd international robustness of a classifier against adversarial manipulation. In:
conference on machine learning, June 19–24, 2016: dropout as a Bayesian Advances in neural information processing systems, Vol. 30. Curran
approximation: representing model uncertainty in deep learning. PMLR, Associates, Inc, Red Hook, NY.
New York, USA, pp. 1050–1059. Hewing, L., Wabersich, K.P., Menner, M., and Zeilinger, M.N. (2020).
Gal, Y. and Ghahramani, Z. (2016b) A theoretically grounded application of Learning-based model predictive control: toward safe learning in
dropout in recurrent neural networks. In: Advances in neural control. Annu. Rev. Control Robot. Auton. Syst. 3: 269–296.
information processing systems, Vol. 29. Curran Associates, Inc, Red Hinton, G.E., Srivastava, N., Krizhevsky, A., Sutskever, I., and
Hook, NY. Salakhutdinov, R.R. (2012). Improving neural networks by preventing
Gao, Y., Yan, S., Zhou, J., Cannon, M., Abate, A., and Johansson, K.H. (2024). co-adaptation of feature detectors. arXiv preprint arXiv:1207.0580.
Proceedings of the 6th annual learning for dynamics & control conference, Hirtreiter, E., Schulze Balhorn, L., and Schweidtmann, A.M. (2024). Toward
July 15–17, 2024: learning-based rigid tube model predictive control. automatic generation of control structures for process flow diagrams
PMLR, Oxford, UK, pp. 492–503. with large language models. AIChE J. 70: e18259.
Garcıa, J. and Fernández, F. (2015). A comprehensive survey on safe Hochreiter, S. and Schmidhuber, J. (1997). Long short-term memory. Neural
reinforcement learning. J. Mach. Learn. Res. 16: 1437–1480. Comput. 9: 1735–1780.
Gentry, C., Halevi, S., and Smart, N.P. (2012). Proceedings of the annual Hoi, S.C., Sahoo, D., Lu, J., and Zhao, P. (2021). Online learning: a
cryptology conference– CRYPTO 2012, August 19–23, 2012: homomorphic comprehensive survey. Neurocomputing 459: 249–289.
evaluation of the AES circuit. Springer Berlin Heidelberg, Santa Barbara, Hoskins, J.C. and Himmelblau, D.M. (1988). Artificial neural network models
CA, USA, pp. 850–867. of knowledge representation in chemical engineering. Comput. Chem.
Golowich, N., Rakhlin, A., and Shamir, O. (2018). Proceedings of the 31st Eng. 12: 881–890.
conference on learning theory, July 6–9, 2018: size-independent sample Hu, C. and Wu, Z. (2024). Model predictive control of switched nonlinear
complexity of neural networks. PMLR, Stockholm, Sweden, pp. 297–299. systems using online machine learning. Chem. Eng. Res. Des. 209:
González, A.H. and Odloak, D. (2009). A stable MPC with zone control. 221–236.
J. Process Control 19: 110–122. Hu, G. and You, F. (2023). Multi-zone building control with thermal comfort
Gonzalez, C., Asadi, H., Kooijman, L., and Lim, C.P. (2023). Neural networks constraints under disjunctive uncertainty using data-driven robust
for fast optimisation in model predictive control: a review. arXiv model predictive control. Adv. Appl. Energy 9: 100124.
preprint arXiv:2309.02668. Hu, C., Cao, Y., and Wu, Z. (2023a). Online machine learning modeling and
Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., predictive control of nonlinear systems with scheduled mode
Ozair, S., Courville, A., and Bengio, Y. (2014). Generative adversarial transitions. AIChE J. 69: e17882.
38 Z. Wu et al.: Machine learning-based model predictive control
Hu, C., Chen, S., and Wu, Z. (2023b). Economic model predictive control of Kingma, D.P. and Welling, M. (2013). Auto-encoding variational bayes. arXiv
nonlinear systems using online learning of neural networks. Processes preprint arXiv:1312.6114.
11: 342. Koiran, P. and Sontag, E.D. (1998). Vapnik-Chervonenkis dimension of
Huang, B. and Kadali, R. (2008) System identification: conventional recurrent neural networks. Discrete Appl. Math. 86: 63–79.
approach. In: Dynamic modeling, predictive control and performance Kokotović, P., Khalil, H.K., and O’Reilly, J. (1999). Singular perturbation
monitoring: a data-driven subspace approach. Springer London, methods in control: analysis and design. Society for Industrial and
London, pp. 9–29. Applied Mathematics, Chap. 3, pp. 93–156.
Huang, J., Gretton, A., Borgwardt, K., Schölkopf, B., and Smola, A. (2006). Kramer, M.A. (1991). Nonlinear principal component analysis using
Correcting sample selection bias by unlabeled data. In: Advances in neural autoassociative neural networks. AIChE J. 37: 233–243.
information processing systems, Vol. 19. MIT Press, Cambridge, MA. Kvasnica, M., Grieder, P., Baotić, M., and Morari, M. (2004). Proceedings of
Huang, Z., Liu, J., and Huang, B. (2023). Model predictive control of agro- the 7th international workshop on hybrid systems: computation and
hydrological systems based on a two-layer neural network modeling control (HSCC 2004), March 25–27, 2004: multi-parametric toolbox (MPT).
framework. Int. J. Adapt. Control Signal Process. 37: 1536–1558. Philadelphia, PA, USA, pp. 448–462.
Jalanko, M., Sanchez, Y., Mahalec, V., and Mhaskar, P. (2021). Adaptive Lanzetti, N., Lian, Y.Z., Cortinovis, A., Dominguez, L., Mercangöz, M., and
system identification of industrial ethylene splitter: a comparison of Jones, C. (2019). Proceedings of the 18th European control conference
subspace identification and artificial neural networks. Comput. Chem. (ECC), June 25–28, 2019: recurrent neural network based MPC for process
Eng. 147: 107240. industries. IEEE, Naples, Italy, pp. 1005–1010.
Kadakia, Y.A., Abdullah, F., Alnajdi, A., and Christofides, P.D. (2024a). Lee, Y.S. and Chen, J. (2023). Developing semi-supervised latent dynamic
Encrypted distributed model predictive control of nonlinear variational autoencoders to enhance prediction performance of
processes. Control Eng. Pract. 145: 105874. product quality. Chem. Eng. Sci. 265: 118192.
Kadakia, Y.A., Abdullah, F., Alnajdi, A., and Christofides, P.D. (2024b). Lee, J.H., Shin, J., and Realff, M.J. (2018). Machine learning: overview of the
Integrating dynamic economic optimization and encrypted control for recent progresses and implications for the process systems
cyber-resilient operation of nonlinear processes. AIChE J. 70: e18509. engineering field. Comput. Chem. Eng. 114: 111–121.
Kadakia, Y.A., Suryavanshi, A., Alnajdi, A., Abdullah, F., and Christofides, P.D. Lee, S., Kwak, M., Tsui, K.-L., and Kim, S.B. (2019). Process monitoring using
(2024c). Proceedings of the 2024 American control conference, July 10–12, variational autoencoder for high-dimensional nonlinear processes.
2024: a two-tier encrypted control architecture for enhanced cybersecurity Eng. Appl. Artif. Intell. 83: 13–27.
of nonlinear processes. Toronto, Canada, pp. 4452–4459. Lee, N., Kim, H., Jung, J., Park, K.-H., Linga, P., and Seo, Y. (2022). Time series
Kadakia, Y.A., Suryavanshi, A., Alnajdi, A., Abdullah, F., and Christofides, P.D. prediction of hydrate dynamics on flow assurance using PCA and
(2024d). Integrating machine learning detection and encrypted recurrent neural networks with iterative transfer learning. Chem. Eng.
control for enhanced cybersecurity of nonlinear processes. Comput. Sci. 263: 118111.
Chem. Eng. 180: 108498. Li, J., Cheng, K., Wang, S., Morstatter, F., Trevino, R.P., Tang, J., and Liu, H.
Karagiannopoulos, M., Anyfantis, D., Kotsiantis, S.B., and Pintelas, P.E. (2017). Feature selection: a data perspective. ACM Comput. Surv. 50:
(2007). Proceedings of the 8th hellenic European research on computer 1–45.
mathematics & its applications, September 20–22, 2007: feature selection Limon, D., Calliess, J., and Maciejowski, J.M. (2017). Learning-based
for regression problems. Athens, Greece. nonlinear model predictive control. IFAC-PapersOnLine 50:
Karniadakis, G.E., Kevrekidis, I.G., Lu, L., Perdikaris, P., Wang, S., and Yang, L. 7769–7776.
(2021). Physics-informed machine learning. Nat. Rev. Phys. 3: 422–440. Lu, J., Cao, Z., Zhao, C., and Gao, F. (2019). 110th anniversary: an overview on
Kassa, A.M. and Kassa, S.M. (2016). A branch-and-bound multi-parametric learning-based model predictive control for batch processes. Ind. Eng.
programming approach for non-convex multilevel optimization with Chem. Res. 58: 17164–17173.
polyhedral constraints. J. Global Optim. 64: 745–764. Luo, J., Canuso, V., Jang, J.B., Wu, Z., Morales-Guio, C.G., and Christofides,
Katz, J., Pappas, I., Avraamidou, S., and Pistikopoulos, E.N. (2020). P.D. (2022). Machine learning-based operational modeling of an
Integrating deep learning models and multiparametric programming. electrochemical reactor: handling data variability and improving
Comput. Chem. Eng. 136: 106801. empirical models. Ind. Eng. Chem. Res. 61: 8399–8410.
Kenefake, D. and Pistikopoulos, E.N. (2022). Proceedings of the 32nd Luo, J., Çıtmacı, B., Jang, J.B., Abdullah, F., Morales-Guio, C.G., and
European aymposium on computer-aided process engineering, June 12– Christofides, P.D. (2023). Machine learning-based predictive control
15, 2022: PPOPT-multiparametric solver for explicit MPC. Toulouse, using on-line model linearization: application to an experimental
France, pp. 1273–1278. electrochemical reactor. Chem. Eng. Res. Des. 197: 721–737.
Khan, N. and Ammar Taqvi, S.A. (2023). Machine learning an intelligent Mahmood, M. and Mhaskar, P. (2008). Enhanced stability regions for model
approach in process industries: a perspective and overview. predictive control of nonlinear process systems. AIChE J. 54:
ChemBioEng Rev. 10: 195–221. 1487–1498.
Kim, Y. and Kim, J.W. (2022). Safe model-based reinforcement learning for Mahmood, F., Govindan, R., Bermak, A., Yang, D., and Al-Ansari, T. (2023).
nonlinear optimal control with state and input constraints. AIChE J. 68: Data-driven robust model predictive control for greenhouse
e17601. temperature control and energy utilisation assessment. Appl. Energy
Kim, J., Lee, C., Shim, H., Cheon, J.H., Kim, A., Kim, M., and Song, Y. (2016). 343: 121190.
Encrypting controller using fully homomorphic encryption for security Mansour, Y., Mohri, M., and Rostamizadeh, A. (2008). Domain adaptation
of cyber-physical systems. IFAC-PapersOnLine 49: 175–180. with multiple sources. In: Advances in neural information processing
Kim, S., Noh, J., Gu, G.H., Aspuru-Guzik, A., and Jung, Y. (2020). Generative systems, Vol. 21. Curran Associates, Inc, Red Hook, NY.
adversarial networks for crystal structure prediction. ACS Cent. Sci. 6: Mansour, Y., Mohri, M., and Rostamizadeh, A. (2009). Domain adaptation:
1412–1420. learning bounds and algorithms. arXiv preprint arXiv:0902.3430.
Z. Wu et al.: Machine learning-based model predictive control 39
Manzano, J.M., Limon, D., de la Peña, D.M., and Calliess, J.-P. (2020). Robust Paillier, P. (1999). Public-key cryptosystems based on composite degree
learning-based MPC for nonlinear constrained systems. Automatica residuosity classes. In: Proceedings of the international conference on
117: 108948. the theory and applications of cryptographic techniques, May 2–6, 1999:
Mayne, D.Q., Rawlings, J.B., Rao, C.V., and Scokaert, P.O. (2000). public-key cryptosystems based on composite degree residuosity classes.
Constrained model predictive control: stability and optimality. Springer, Prague, Czech Republic, pp. 223–238.
Automatica 36: 789–814. Pan, S.J. and Yang, Q. (2009). A survey on transfer learning. IEEE Trans. Knowl.
Meng, F., Shen, X., and Karimi, H.R. (2022). Emerging methodologies in Data Eng. 22: 1345–1359.
stability and optimization problems of learning-based nonlinear Pan, I., Mason, L.R., and Matar, O.K. (2022). Data-centric engineering:
model predictive control: a survey. Int. J. Circ. Theor. Appl. 50: integrating simulation, machine learning and statistics. Challenges
4146–4170. and opportunities. Chem. Eng. Sci. 249: 117271.
Mesbah, A., Wabersich, K.P., Schoellig, A.P., Zeilinger, M.N., Lucia, S., Pappas, I., Diangelakis, N.A., and Pistikopoulos, E.N. (2021). The exact
Badgwell, T.A., and Paulson, J.A. (2022). Proceedings of the 2022 solution of multiparametric quadratically constrained quadratic
American control conference, June 8–10, 2022: fusion of machine learning programming problems. J. Global Optim. 79: 59–85.
and MPC under uncertainty: what advances are on the horizon? IEEE, Parker, S., Wu, Z., and Christofides, P.D. (2023). Cybersecurity in process
Atlanta, GA, USA, pp. 342–357. control, operations, and supply chain. Comput. Chem. Eng. 171: 108169.
Mhaskar, P., El-Farra, N.H., and Christofides, P.D. (2006). Stabilization of Patel, R., Bhartiya, S., and Gudi, R. (2023). Optimal temperature trajectory
nonlinear systems with state and control constraints using Lyapunov- for tubular reactor using physics informed neural networks. J. Process
based predictive control. Syst. Control Lett. 55: 650–659. Control 128: 103003.
Mishra, S. and Molinaro, R. (2022). Estimates on the generalization error of Pistikopoulos, E.N., Diangelakis, N.A., and Oberdieck, R. (2020). Multi-
physics-informed neural networks for approximating a class of inverse parametric optimization and control. John Wiley & Sons, London.
problems for PDEs. IMA J. Numer. Anal. 42: 981–1022. Pravin, P., Tan, J.Z.M., Yap, K.S., and Wu, Z. (2022). Hyperparameter
Mishra, S. and Molinaro, R. (2023). Estimates on the generalization error of optimization strategies for machine learning-based stochastic energy
physics-informed neural networks for approximating PDEs. IMA efficient scheduling in cyber-physical production systems. Digit. Chem.
J. Numer. Anal. 43: 1–43. Eng. 4: 100047.
Mohri, M., Rostamizadeh, A., and Talwalkar, A. (2018). Foundations of Qin, R. and Zhao, J. (2022). High-efficiency generative adversarial network
machine learning. MIT press, Cambridge, MA. model for chemical process fault diagnosis. IFAC-PapersOnLine 55:
Mowbray, M., Vallerio, M., Perez-Galvan, C., Zhang, D., Chanona, A.D.R., and 732–737.
Navarro-Brull, F.J. (2022). Industrial data science–a review of machine Raissi, M., Perdikaris, P., and Karniadakis, G.E. (2019). Physics-informed
learning applications for chemical and process industries. React. Chem. neural networks: a deep learning framework for solving forward and
Eng. 7: 1471–1509. inverse problems involving nonlinear partial differential equations.
Murray-Smith, R., Sbarbaro, D., Rasmussen, C.E., and Girard, A. (2003). J. Comput. Phys. 378: 686–707.
Adaptive, cautious, predictive control with Gaussian process priors. Raissi, M., Yazdani, A., and Karniadakis, G.E. (2020). Hidden fluid mechanics:
IFAC Proc. Vol. 36: 1155–1160. learning velocity and pressure fields from flow visualizations. Science
Na, J., Jeon, K., and Lee, W.B. (2018). Toxic gas release modeling for real-time 367: 1026–1030.
analysis using variational autoencoder with convolutional neural Rakhlin, A., Sridharan, K., and Tewari, A. (2010). Online learning: random
networks. Chem. Eng. Sci. 181: 68–78. averages, combinatorial parameters, and learnability. In: Advances in
Nagy, Z.K. (2007). Model based control of a yeast fermentation bioreactor neural information processing systems, Vol. 23. Curran Associates, Inc,
using optimally designed artificial neural networks. Chem. Eng. J. 127: Red Hook, NY.
95–109. Ren, Y., Alhajeri, M.S., Luo, J., Chen, S., Abdullah, F., Wu, Z., and
Nascimento, C.A.O., Giudici, R., and Guardani, R. (2000). Neural network Christofides, P.D. (2022). A tutorial review of neural network modeling
based approach for optimization of industrial chemical processes. approaches for model predictive control. Comput. Chem. Eng.: 107956,
Comput. Chem. Eng. 24: 2303–2314. https://doi.org/10.1016/j.compchemeng.2022.107956.
Neyshabur, B., Bhojanapalli, S., and Srebro, N. (2017). A PAC-Bayesian Rendall, R., Castillo, I., Schmidt, A., Chin, S.-T., Chiang, L.H., and Reis, M.
approach to spectrally-normalized margin bounds for neural (2019). Wide spectrum feature selection (WiSe) for regression model
networks. arXiv preprint arXiv:1707.09564. building. Comput. Chem. Eng. 121: 99–110.
Nian, R., Liu, J., and Huang, B. (2020). A review on reinforcement learning: Rijmen, V. and Daemen, J. (2001). Advanced encryption standard. In:
introduction and applications in industrial process control. Comput. Proceedings of federal information processing standards publications,
Chem. Eng. 139: 106886. Vol. 19. National Institute of Standards and Technology, p. 22.
Ning, C. and You, F. (2021). Online learning based risk-averse stochastic MPC Robinet, F., Parera, C., Hundt, C., and Frank, R. (2022). Proceedings of the
of constrained linear uncertain systems. Automatica 125: 109402. 2022 IEEE/CVF winter conference on applications of computer vision,
Norouzi, A., Heidarifar, H., Borhan, H., Shahbakhti, M., and Koch, C.R. (2023). January 4–8, 2022: weakly-supervised free space estimation through
Integrating machine learning and model predictive control for stochastic co-teaching. Waikoloa, HI, USA, pp. 618–627.
automotive applications: a review and future directions. Eng. Appl. Rogers, A.W., Cardenas, I.O.S., Del Rio-Chanona, E.A., and Zhang, D. (2023).
Artif. Intell. 120: 105878. Investigating physics-informed neural networks for bioprocess hybrid
Nouira, A., Sokolovska, N., and Crivello, J.-C. (2018). CrystalGAN: learning to model construction. In: Computer aided chemical engineering, Vol. 52.
discover crystallographic structures with generative adversarial Elsevier, Amsterdam, pp. 83–88.
networks. arXiv preprint arXiv:1810.11203. Romdlony, M.Z. and Jayawardhana, B. (2016). Stabilization with guaranteed
Oberdieck, R., Diangelakis, N.A., Papathanasiou, M.M., Nascu, I., and safety using control Lyapunov–barrier function. Automatica 66: 39–47.
Pistikopoulos, E.N. (2016). POP–parametric optimization toolbox. Ind. Sangoi, E., Quaglio, M., Bezzo, F., and Galvanin, F. (2022). Optimal design of
Eng. Chem. Res. 55: 8979–8991. experiments based on artificial neural network classifiers for fast
40 Z. Wu et al.: Machine learning-based model predictive control
kinetic model recognition. In: Computer aided chemical engineering, Tan, G.Y., Xiao, M., Wu, G., and Wu, Z. (2024a). Proceedings of the 2024
Vol. 49. Elsevier, Amsterdam, pp. 817–822. American control conference, July 10–12, 2024: machine learning
Sangoi, E., Quaglio, M., Bezzo, F., and Galvanin, F. (2024). An optimal modeling of nonlinear processes with Lyapunov stability guarantees.
experimental design framework for fast kinetic model identification Toronto, Canada, pp. 528–535.
based on artificial neural networks. Comput. Chem. Eng. 187: 108752. Tan, W.G.Y., Xiao, M., and Wu, Z. (2024b). Robust reduced-order machine
Saraswathi K, S., Bhosale, H., Ovhal, P., Parlikkad Rajan, N., and Valadi, J.K. learning modeling of high-dimensional nonlinear processes using
(2020). Random forest and autoencoder data-driven models for noisy data. Digit. Chem. Eng. 11: 100145.
prediction of dispersed-phase holdup and drop size in rotating disc Tang, W. (2023). Synthesis of data-driven nonlinear state observers using
contactors. Ind. Eng. Chem. Res. 60: 425–435. lipschitz-bounded neural networks. arXiv preprint arXiv:2310.03187.
Scattolini, R. (2009). Architectures for distributed and hierarchical model Tang, W. and Daoutidis, P. (2022). Proceedings of the 2022 American control
predictive control–a review. J. Process Control 19: 723–731. conference, June 8–10, 2022: data-driven control: overview and
Schilter, O., Vaucher, A., Schwaller, P., and Laino, T. (2023). Designing perspectives. IEEE, Atlanta, Georgia, USA, pp. 1048–1064.
catalysts with deep generative models and computational data. A case Terzi, E., Bonassi, F., Farina, M., and Scattolini, R. (2021). Learning model
study for Suzuki cross coupling reactions. Digit. Discov. 2: 728–735. predictive control with long short-term memory networks. Int.
Schlüter, N., Binfet, P., and Darup, M.S. (2023). A brief survey on encrypted J. Robust Nonlinear Control 31: 8877–8896.
control: from the first to the second generation and beyond. Annu. Rev. Thebelt, A., Wiebe, J., Kronqvist, J., Tsay, C., and Misener, R. (2022).
Control 56: 100913. Maximizing information from chemical engineering data sets:
Schweidtmann, A.M., Esche, E., Fischer, A., Kloft, M., Repke, J.-U., Sager, S., applications to machine learning. Chem. Eng. Sci. 252: 117469.
and Mitsos, A. (2021). Machine learning in chemical engineering: a Tian, Y., Pappas, I., Burnak, B., Katz, J., and Pistikopoulos, E.N. (2021).
perspective. Chem. Ing. Tech. 93: 2029–2039. Simultaneous design & control of a reactive distillation system–a
Serrurier, M., Mamalet, F., González-Sanz, A., Boissin, T., Loubes, J.-M., and parametric optimization & control approach. Chem. Eng. Sci. 230:
Del Barrio, E. (2021). Proceedings of the 2021 IEEE/CVF conference on 116232.
computer vision and pattern recognition, June 20–25, 2021: achieving Tian, J., Han, D., Li, M., and Shi, P. (2022). A multi-source information transfer
robustness in classification using optimal transport with hinge learning method with subdomain adaptation for cross-domain fault
regularization. Nashville, TN, USA, pp. 505–514. diagnosis. Knowl. Base Syst. 243: 108466.
Settles, B. (2009). Active learning literature survey. Computer Sciences Vapnik, V., Levin, E., and Le Cun, Y. (1994). Measuring the VC-dimension of a
Technical Report 1648. University of Wisconsin–Madison. learning machine. Neural Comput. 6: 851–876.
Shalev-Shwartz, S. (2012). Online learning and online convex optimization. Wächter, A. and Biegler, L.T. (2006). On the implementation of an interior-
Found. Trends Mach. Learn. 4: 107–194. point filter line-search algorithm for large-scale nonlinear
Shang, C. and You, F. (2019). Data analytics and machine learning for smart programming. Math. Program. 106: 25–57.
process manufacturing: recent advances and perspectives in the big Wang, R. and Manchester, I. (2023). Proceedings of the 40th international
data era. Engineering 5: 1010–1016. conference on machine learning, July 23–29, 2023: direct
Sitapure, N. and Kwon, J.S.-I. (2022). Neural network-based model predictive parameterization of lipschitz-bounded deep networks. PMLR, Hawaii,
control for thin-film chemical deposition of quantum dots using data USA, pp. 36093–36110.
from a multiscale simulation. Chem. Eng. Res. Des. 183: 595–607. Wang, Y. and Wu, Z. (2024a). Control Lyapunov-barrier function-based safe
Soloperto, R., Müller, M.A., and Allgöwer, F. (2022). Guaranteed closed-loop reinforcement learning for nonlinear optimal control. AIChE J. 70:
learning in model predictive control. IEEE Trans. Automat. Control 68: e18306.
991–1006. Wang, Y. and Wu, Z. (2024b). Physics-informed reinforcement learning for
Sontag, E.D. (1998a). A learning result for continuous-time recurrent neural optimal control of nonlinear systems. AIChE J. 70: e18542.
networks. Syst. Control Lett. 34: 151–158. Wang, Z. and Wu, Z. (2024c). Foundation model for chemical process
Sontag, E.D. (1998b). VC dimension of neural networks. NATO ASI Ser. F modeling: meta-learning with physics-informed adaptation. arXiv
Comput. Syst. Sci. 168: 69–96. preprint arXiv:2405.11752.
Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., and Salakhutdinov, R. Wang, X., Ayachi, S., Corbett, B., and Mhaskar, P. (in press). Integrating
(2014). Dropout: a simple way to prevent neural networks from autoencoder with Koopman operator to design a linear data-driven
overfitting. J. Mach. Learn. Res. 15: 1929–1958. model predictive controller. Can. J. Chem. Eng.
Stewart, B.T., Venkat, A.N., Rawlings, J.B., Wright, S.J., and Pannocchia, G. Wang, Z., Dai, Z., Póczos, B., and Carbonell, J. (2019). Proceedings of the 2019
(2010). Cooperative distributed model predictive control. Syst. Control IEEE/CVF conference on computer vision and pattern recognition, June 15–
Lett. 59: 460–469. 20, 2019: characterizing and avoiding negative transfer. Long Beach, CA,
Su, H.T., McAvoy, T.J., and Werbos, P. (1992). Long-term predictions of USA, pp. 11293–11302.
chemical processes using recurrent neural networks: a parallel Wang, W., Wang, Y., Tian, Y., and Wu, Z. (2024a). Explicit machine learning-
training approach. Ind. Eng. Chem. Res. 31: 1338–1352. based model predictive control of nonlinear processes via multi-
Subraveti, S.G., Li, Z., Prasad, V., and Rajendran, A. (2022). Physics-based parametric programming. Comput. Chem. Eng. 186: 108689.
neural networks for simulation and synthesis of cyclic adsorption Wang, Z., Yu, D., and Wu, Z. (2025). Real-time machine-learning-based
processes. Ind. Eng. Chem. Res. 61: 4095–4113. optimization using input convex long short-term memory network.
Suryavanshi, A., Alnajdi, A., Alhajeri, M., Abdullah, F., and Christofides, P.D. Appl. Energy 377: 124472.
(2023). Encrypted model predictive control design for security to Wang, W., Zhang, H., Wang, Y., Tian, Y., and Wu, Z. (2024b). Fast explicit
cyberattacks. AIChE J. 69: e18104. machine learning-based model predictive control using input convex
Tan, W.G.Y. and Wu, Z. (2024). Robust machine learning modeling for neural networks. Ind. Eng. Chem. Res. 63: 17279–17293.
predictive control using Lipschitz-constrained neural networks. Wei, C. and Ma, T. (2019). Data-dependent sample complexity of deep
Comput. Chem. Eng. 180: 108466. neural networks via Lipschitz augmentation. In: Advances in neural
Z. Wu et al.: Machine learning-based model predictive control 41
information processing systems, Vol. 32. Curran Associates, Inc, Red Xiu, X., Yang, Y., Kong, L., and Liu, W. (2020). Laplacian regularized robust
Hook, NY. principal component analysis for process monitoring. J. Process Control
Wieland, P. and Allgöwer, F. (2007). Constructive safety using control barrier 92: 212–219.
functions. IFAC Proc. Vol. 40: 462–467. Xu, Z. and Wu, Z. (2024). Privacy-preserving federated machine learning
Wong, W., Chee, E., Li, J., and Wang, X. (2018). Recurrent neural network- modeling and predictive control of heterogeneous nonlinear systems.
based model predictive control for continuous pharmaceutical Comput. Chem. Eng. 187: 108749.
manufacturing. Mathematics 6: 242. Yang, S. and Bequette, B.W. (2021). Optimization-based control using input
Wu, Z. and Christofides, P.D. (2020). Control Lyapunov-barrier function- convex neural networks. Comput. Chem. Eng. 144: 107143.
based predictive control of nonlinear processes using machine Yang, F., Li, K., Zhong, Z., Luo, Z., Sun, X., Cheng, H., Guo, X., Huang, F., Ji, R.,
learning modeling. Comput. Chem. Eng. 134: 106706. and Li, S. (2020). Asymmetric co-teaching for unsupervised cross-
Wu, Z., Durand, H., and Christofides, P.D. (2018). Safe economic model domain person re-identification. In: Proceedings of the thirty-fourth
predictive control of nonlinear systems. Syst. Control Lett. 118: 69–76. AAAI conference on artificial intelligence, February 7–12, 2020: asymmetric
Wu, Z., Albalawi, F., Zhang, Z., Zhang, J., Durand, H., and Christofides, P.D. co-teaching for unsupervised cross-domain person re-identification,
(2019a). Control Lyapunov-barrier function-based model predictive Vol. 34. New York, USA, pp. 12597–12604.
control of nonlinear systems. Automatica 109: 108508. Yao, Y. and Doretto, G. (2010). Proceedings of the 2010 IEEE computer society
Wu, Z., Rincon, D., and Christofides, P.D. (2019b). Real-time adaptive conference on computer vision and pattern recognition, June 13–18, 2010:
machine-learning-based predictive control of nonlinear processes. boosting for transfer learning with multiple sources. San Francisco, CA,
Ind. Eng. Chem. Res. 59: 2275–2290. USA, pp. 1855–1862.
Wu, Z., Tran, A., Rincon, D., and Christofides, P.D. (2019c). Machine learning- You, Y. and Nikolaou, M. (1993). Dynamic process modeling with recurrent
based predictive control of nonlinear processes. Part I: theory. AIChE J. neural networks. AIChE J. 39: 1654–1667.
65: e16729. Yu, X., Han, B., Yao, J., Niu, G., Tsang, I., and Sugiyama, M. (2019). Proceedings
Wu, Z., Tran, A., Rincon, D., and Christofides, P.D. (2019d). Machine- of the 36th international conference on machine learning, June 9–15, 2019:
learning-based predictive control of nonlinear processes. Part II: how does disagreement help generalization against label corruption?
computational implementation. AIChE J. 65: e16734. PMLR, California, USA, pp. 7164–7173.
Wu, Z., Rincon, D., and Christofides, P.D. (2020). Process structure-based Zhang, S. and Qiu, T. (2022). Semi-supervised LSTM ladder autoencoder for
recurrent neural network modeling for model predictive control of chemical process fault diagnosis and localization. Chem. Eng. Sci. 251:
nonlinear processes. J. Process Control 89: 74–84. 117467.
Wu, Z., Luo, J., Rincon, D., and Christofides, P.D. (2021a). Machine learning- Zhang, J., Lei, Q., and Dhillon, I. (2018) Stabilizing gradients for deep neural
based predictive control using noisy data: evaluating performance networks via efficient svd parameterization. In: Proceedings of the 35th
and robustness via a large-scale process simulator. Chem. Eng. Res. international conference on machine learning, July 10–15, 2018: stabilizing
Des. 168: 275–287. gradients for deep neural networks via efficient svd parameterization.
Wu, Z., Rincon, D., Gu, Q., and Christofides, P.D. (2021b). Statistical machine PMLR, Stockholm, Sweden, pp. 5806–5814.
learning in model predictive control of nonlinear processes. Zhang, C., Xie, Y., Bai, H., Yu, B., Li, W., and Gao, Y. (2021a). A survey on
Mathematics 9: 1912. federated learning. Knowl. Base Syst. 216: 106775.
Wu, Z., Rincon, D., Luo, J., and Christofides, P.D. (2021c). Machine learning Zhang, X., Zou, Y., and Li, S. (2021b). Semi-supervised generative adversarial
modeling and predictive control of nonlinear processes using noisy network with guaranteed safeness for industrial quality prediction.
data. AIChE J. 67: e17164. Comput. Chem. Eng. 153: 107418.
Wu, G., Yion, W.T.G., Dang, K.L.N.Q., and Wu, Z. (2023a). Physics-informed Zhang, X., Pan, W., Scattolini, R., Yu, S., and Xu, X. (2022). Robust tube-based
machine learning for MPC: application to a batch crystallization model predictive control with Koopman operators. Automatica 137:
process. Chem. Eng. Res. Des. 192: 556–569. 110114.
Wu, Z., Zhang, B., Yu, H., Ren, J., Pan, M., He, C., and Chen, Q. (2023b). Zhang, Z., Wang, X., Wang, G., Jiang, Q., Yan, X., and Zhuang, Y. (2024). A
Accelerating heat exchanger design by combining physics-informed data enhancement method based on generative adversarial network
deep learning and transfer learning. Chem. Eng. Sci. 282: 119285. for small sample-size with soft sensor application. Comput. Chem. Eng.
Wu, Z., Li, M., He, C., Zhang, B., Ren, J., Yu, H., and Chen, Q. (2024). Physics- 186: 108707.
informed learning of chemical reactor systems using decoupling– Zhao, Y., Li, M., Lai, L., Suda, N., Civin, D., and Chandra, V. (2018). Federated
coupling training framework. AIChE J.: e18436, https://doi.org/10. learning with non-iid data. arXiv preprint arXiv:1806.00582.
1002/aic.18436. Zhao, T., Zheng, Y., Gong, J., and Wu, Z. (2022a). Machine learning-based
Xiao, M. and Wu, Z. (2023). Modeling and control of a chemical process reduced-order modeling and predictive control of nonlinear
network using physics-informed transfer learning. Ind. Eng. Chem. Res. processes. Chem. Eng. Res. Des. 179: 435–451.
62: 17216–17227. Zhao, T., Zheng, Y., and Wu, Z. (2022b). Improving computational efficiency
Xiao, M., Hu, C., and Wu, Z. (2023). Modeling and predictive control of of machine learning modeling of nonlinear processes using sensitivity
nonlinear processes using transfer learning method. AIChE J. 69: e18076. analysis and active learning. Digit. Chem. Eng. 3: 100027.
Xiao, M., Vellayappan, K., Pravin, P., Gudena, K., and Wu, Z. (2024). Zhao, T., Zheng, Y., and Wu, Z. (2023). Feature selection-based machine
Optimization-based multi-source transfer learning for modeling of learning modeling for distributed model predictive control of
nonlinear processes. Chem. Eng. Sci. 295: 120117. nonlinear processes. Comput. Chem. Eng. 169: 108074.
Xie, R., Jan, N.M., Hao, K., Chen, L., and Huang, B. (2019). Supervised Zheng, Y. and Wu, Z. (2023). Physics-informed online machine learning and
variational autoencoders for soft sensor modeling with missing data. predictive control of nonlinear processes with parameter uncertainty.
IEEE Trans. Ind. Inf. 16: 2820–2828. Ind. Eng. Chem. Res. 62: 2804–2818.
42 Z. Wu et al.: Machine learning-based model predictive control
Zheng, S. and Zhao, J. (2020). A new unsupervised data mining method Zheng, Y., Hu, C., Wang, X., and Wu, Z. (2023). Physics-informed recurrent
based on the stacked autoencoder for chemical process fault neural network modeling for predictive control of nonlinear
diagnosis. Comput. Chem. Eng. 135: 106755. processes. J. Process Control 128: 103005.
Zheng, Y., Wang, X., and Wu, Z. (2022a). Machine learning modeling and Zhu, Q.X., Xu, T.X., Xu, Y., and He, Y.L. (2021). Improved virtual sample
predictive control of the batch crystallization process. Ind. Eng. Chem. generation method using enhanced conditional generative
Res. 61: 5578–5592. adversarial networks with cycle structures for soft sensors with limited
Zheng, Y., Zhang, T., Li, S., Zhang, G., and Wang, Y. (2022b). Gp-based MPC data. Ind. Eng. Chem. Res. 61: 530–540.
with updating tube for safety control of unknown system. Digit. Chem. Zhu, W., Zhang, J., and Romagnoli, J. (2022). General feature extraction for
Eng. 4: 100041. process monitoring using transfer learning approaches. Ind. Eng.
Zheng, Y., Zhao, T., Wang, X., and Wu, Z. (2022c). Online learning-based Chem. Res. 61: 5202–5214.
predictive control of crystallization processes under batch-to-batch Zhuang, F., Qi, Z., Duan, K., Xi, D., Zhu, Y., Zhu, H., Xiong, H., and He, Q. (2020).
parametric drift. AIChE J. 68: e17815. A comprehensive survey on transfer learning. Proc. IEEE 109: 43–76.