0% found this document useful (0 votes)

48 views

ASurveyonDeepLearningforData-drivenSoftSensors Earlyaccess

This document summarizes a survey on using deep learning techniques for data-driven soft sensors. It discusses how deep learning models can help address challenges with traditional soft sensor modeling as industrial processes become more complex. The survey reviews common deep learning models and techniques used for soft sensors, including convolutional neural networks and recurrent neural networks. It also analyzes existing works applying deep learning to soft sensors and identifies remaining challenges and opportunities for future research.

Uploaded by

Tushar Mukherjee

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

48 views

ASurveyonDeepLearningforData-drivenSoftSensors Earlyaccess

Uploaded by

Tushar Mukherjee

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 15

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/348642886

A Survey on Deep Learning for Data-Driven Soft Sensors

Article in IEEE Transactions on Industrial Informatics · January 2021

DOI: 10.1109/TII.2021.3053128

CITATIONS READS
13 430

2 authors, including:

Qingqiang Sun
UNSW Sydney
5 PUBLICATIONS 74 CITATIONS

SEE PROFILE

All content following this page was uploaded by Qingqiang Sun on 29 January 2021.

The user has requested enhancement of the downloaded file.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TII.2021.3053128, IEEE
Transactions on Industrial Informatics
TII-20-4482.R3

A Survey on Deep Learning for Data-driven

Soft Sensors
Qingqiang Sun and Zhiqiang Ge, Senior Member, IEEE

 and data-driven methods [5]. The first two kinds of approaches

Abstract—Soft sensors are widely constructed in process can work well if detailed and accurate mechanism of process is
industry to realize process monitoring, quality prediction, and known or a wealth of experience and knowledge about process
many other important applications. With the development of is available. However, the increasing complexity of the
hardware and software, industrial processes have embraced new
characteristics which lead to the poor performance of traditional
industrial process makes these preconditions can no longer be
soft sensor modeling methods. Deep learning, as a kind of easily satisfied. As a result, data-driven modeling has become
data-driven approach, shows its great potential in many fields, as the mainstream soft sensing modeling methods [6, 7].
well as in soft sensing scenarios. After a period of development, Conventional data-driven soft sensor modeling methods
especially in the last five years, many new issues have emerged mainly include a wide variety of statistical inference techniques
which need to be investigated. Therefore, in this paper, the and machine learning techniques, such as Principal Component
necessity and significance of deep learning for soft sensor
Regression (PCR) which combines Principal Component
applications are demonstrated firstly by analyzing the merits of
deep learning and the trends of industrial processes. Next, Analysis (PCA) with a regression model, Partial Least Squares
mainstream deep learning models, tricks, and (PLS) regression, Support Vector Machine (SVM), and
frameworks/toolkits are summarized and discussed to help Artificial Neural Network (ANN) [8-12]. In last two decades,
designers propel the developing progress of soft sensors. Then, with technical breakthroughs on some key issues, networks
existing works are reviewed and analyzed to discuss the demands with enough number of hidden layers or with complex enough
and problems occurred in practical applications. Finally, outlook
structures are available, which are known as Deep Learning
and conclusions are given.
(DL) techniques [13, 14]. Due to DL techniques, computational
Index Terms—Soft Sensor, Deep Learning, Industrial Big Data, models that are composed of multiple processing layers are
Data-driven Modeling, Neural Networks. allowed to learn representations of data with multiple levels of
abstraction. These methods have dramatically improved the
state-of-the-art in speech recognition, object detection and
I. INTRODUCTION many other domains such as drug discovery and genomics [15].
In recent years, there has been a proliferation of research that
N owadays, the process industry is becoming more and more
complicated, due to the development of information
technologies and the increase of customer demands. As a result,
applies deep learning approaches to soft sensors. From
conventional artificial intelligence field to soft sensing field,
the cost and difficulty of direct measurement and analysis of many differences exist objectively. There are many questions
key quality variables are increasing [1-3]. However, in order to need to be investigated and discussed (including but not limited
monitor the operation status of systems, realize the smooth to the following issues): Is it necessary and suitable to use deep
control of processes and improve the quality of products, those learning techniques in soft sensing scenario? What deep
key variables or quality indices have to be obtained as fast and learning models can be utilized for practical application? How
accurately as possible. Therefore, soft sensing technique, which to apply them to solving problems in real processes? What are
is a kind of mathematical model with easy-to-measured the potential research points for the future? Therefore, the
auxiliary variables as input and hard-to-measured variables as motivation of this work is to answer these questions as
output, has been developed to estimate or predict important reasonably as possible.
variables expediently during the past decades [4]. The rest of the paper is organized as follows. Section II
There are three main types of approaches to establish soft discusses the distinct merit of DL and demonstrate its necessity
sensing models, namely mechanism-based, knowledge-based for soft sensor modeling. Section III provides an overview of
several typical DL models and core training techniques. Then
This work was supported in part by The National Key Research and the state-of-the-art of soft sensor applications using DL
Development Program of China (2018YFC0808600), the National Natural approaches are investigated in Section IV. Discussions and
Science Foundation of China (NSFC) (61722310), the Natural Science
Foundation of Zhejiang Province (LR18F030001), and in part by the Open
outlook are given in Section V. Finally, conclusions of this
Research Project of the State Key Laboratory of Industrial Control Technology, work are made in Section VI.
Zhejiang University (ICT20098) (Corresponding author: Zhiqiang Ge.)
The authors are with the State Key Laboratory of Industrial Control
Technology, Institute of Industrial Process Control, College of Control Science
II. SIGNIFICANCE OF DEEP LEARNING FOR SOFT SENSOR
and Engineering, Zhejiang University, Hangzhou, 310027, P. R. China (e-mail: Detailed review about conventional methods can be found in
sunqingqiang@zju.edu.cn; gezhiqiang@zju.edu.cn).

1551-3203 (c) 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
Authorized licensed use limited to: Zhejiang University. Downloaded on January 22,2021 at 04:44:16 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TII.2021.3053128, IEEE
Transactions on Industrial Informatics
TII-20-4482.R3

existing work, such as [7, 16], et al. Although those methods According to Universal Approximation Theory, if there are
already have many applications, they may suffer from some enough nodes in the hidden layer, the function represented by
drawbacks, like heavy workload brought by handcrafted feature the network shown in Fig. 1 can approximate any continuous
engineering or inefficiency when dealing with large amount of function [17-19]. Furthermore, using multiple layers of neurons
data, and so on. To demonstrate the significance of DL for soft to represent some functions is much simpler.
sensor modeling, the distinct merits of DL and the trends or Since Hinton et al. proposed a faster learning algorithm
characteristics of industrial processes should be discussed. which was applied to Deep Belief Network (DBN), the
maximum depth of network can be tens of layers [13]. Later on,
A. Merits of deep learning techniques He et al. proposed the Deep Residual Network, which solved
To begin with, the structure of a simple network with single the performance degradation problem caused by increasing
hidden layer is shown in Fig. 1. There are three layers, namely network depth. From then on, the depth of neural network can
an input layer, a hidden layer and an output layer. Input layer reach a level of hundreds of layers [20]. However, “deep” in
contains variables x1, , xm and a constant node “1”. The Deep Learning theory is not absolutely defined. In speech
recognition domain, four layers of network can be considered
hidden layer has many nodes, and each node has an activation as “deep”, while in image recognition, networks with more than
function  . The feature in each node is extracted through affine 20 layers are common.
transformation and activation function transformation from Deep Learning has its own advantage compared with
original input layer, which are defined as followed formula: conventional soft sensor modeling methods. Here we classify
H i    M i  x1 , , xm   them into three categories at a greater granularity: rule-based
system, classical machine learning and shallow representation
 m   (1)
=    wik0  xk   bi0  learning. The differences between them are shown in Fig. 2, in
  k 1   which the green blocks indicate components that are able to
Then the final output is the combination of those composite learn information from data [21].
functions: Rule-based system, also known as production system or
n expert system, is the simplest form of artificial intelligence.
y  x    w1k H k  x  (2) Rules are coded into the programs as the representation of
k 1
knowledge, which tell the system what to do or what to
The weight and bias parameters ( wij0 , bi0 ) need to be learned conclude in different situations [22-24]. In this way, the
by minimizing the lost function, which is defined according to performance of rule-based system depends almost entirely on
specific task and target. This process is called as “training” or expert knowledge, which is hard to obtain and hard to update,
“learning”. especially in complicated cases. A rule-based system could be
feature considered as having “fixed” intelligence, in contrast, a
machine learning system is more adaptive and closer to human
wij0 M 1  H1 intelligence. Instead of outputting a result directly from a fixed
x1 w11
set of rules wrote by human, classical machine learning firstly
M 2  H2 w12 extracts features from raw input data and then maps from
y features to obtain the final output. However, the forms of
xm features are still handcrafted based on knowledge and
experience, which is called as feature engineering [25, 26]. In
w1n order to extract features that better represent the underlying
1
0 M n  Hn
b i problem, the process of feature engineering is usually
Input Hidden Output complicated, including feature selection, feature construction
layer layer layer and feature extraction. Because the upper bound of the
Fig. 1. The structure of a network with single hidden layer performance of conventional machine learning is mainly
determined by data and features, the effect of those approaches
relies heavily on the ability of the engineer to extract good
features. Therefore, representation learning approaches were
proposed so as to automatically learn the implicit useful
representations or features from raw data [27]. In this way, data
representation is often trained in conjunction with subsequent
predictive tasks. Representation learning does not rely on
expert experience, but requires a large training data set.
Compared with shallow representation learning, deep learning
is a kind of deep representation learning, which tries to learn
more hierarchical and more abstract representations using deep
networks. As an end-to-end approach, what deep learning
needs is enough and quality data, rather than complicated
feature engineering.

Fig. 2. The comparison of four kinds of theories

grades lead to many chemical processes working with multiple

conditions [31, 32]. Besides, the complicated process
mechanisms also increase the difficult of process modeling,
such as penicillin fermentation process, in which the
microorganisms have to experience multiple growth phases
[33]. Due to such causes, process industry may possess many
characteristics like nonlinear, multimodal properties, etc.
Therefore, it is increasingly difficult to construct monitoring or
predictive modeling for those complex processes. In addition,
changes in process characteristics or operating conditions are
almost ubiquitous [34]. In chemical processes, for instance,
Fig. 3. Scale drives algorithm performance equipment characteristics are changed due to catalyst
However, is deep learning always better than conventional deactivation, scale adhesion, preventive maintenance, and
machine learning or is deep representation learning always others. The changes of loads and feedstocks also result in
better than shallower one? The key factor is the amount of data process variations and deteriorate the performance of process
that are available for modeling, especially labeled ones [28]. modeling, like in the pharmaceutical industry [35]. Therefore,
Visually, the performance of algorithm is plotted as the soft sensors have to be updated as the process characteristics
function of the amount of data used for a task in Fig. 3. change, but manual and frequent construction of them should
Improvements in data availability and computational scale be avoided due to their heavy workload, especially in feature
have been two of the biggest drivers of recent progress in engineering. This trend and corresponding issues are shown in
machine learning, which means large enough training sets are the left part of Fig. 4.
available and large enough neural networks are trainable. As Looking at the development of the process industry in recent
for traditional learning algorithms, like support vector machine years, industrial big data is another trend that cannot be ignored
or logistic regression, the performance improves for a while as [36, 37]. More and more process monitoring sensors are
more data are added. However, even as more data are installed to measure real time process status (e.g. temperature,
accumulated after that, usually the performance of those flow rate, pressure, etc.) and a lot of data storage devices (e.g.
algorithms goes into plateaus. This means their learning curves Distributed Control System (DCS)) are utilized in plants and
flattens out and the algorithms stop improving even as more factories [38]. All of these developments make it possible to
data are given since they do not know what to do with huge obtain large amount of data for process modeling. At the same
amounts of data. Nevertheless, if a small neural network (NN), time, the data form also evolves a lot [39]. For instance, from
which contains only a small number of hidden univariate to multivariate, to high-dimensional [40-42]; from
units/layers/parameters, is trained on the same supervised homogeneous data to heterogeneous datasets [43, 44]; from
learning task, slightly better performance might be attainable. static to dynamic [45, 46]. Therefore, enough and various data
Analogously, if larger and larger NNs are trained, even better are available, which need to be utilized efficiently to train
performance can be obtained. Besides, it is notable that in the
monitoring or predictive models. This trend is shown in the
regime of small training sets, the relative ordering of the
right part of Fig. 4.
algorithms is actually not very well defined. In this case, the
performance of the model depends mainly on the skill of
feature engineering and other algorithm details, so it is quite
possible that traditional algorithms could do better. By the way,
even if only small amount of data is available, the transferable
character of deep learning algorithms can also ensure the
performance of modeling since the underlying networks are
relatively general as long as the data distribution is as consistent
as possible [29, 30]. In contrast, in big data regions where there
are very large training sets, it can be more consistently seen that
large NN dominates the other approaches. Thus, the relatively
more reliable way to improve the performance of an algorithm
today is to train a bigger network and get more data.
In conclusion, the merits of deep learning techniques
compared with traditional algorithms mainly lie in (i) learning
representation without the requirement of knowledge or
experience and (ii) taking full advantage of huge amount of
data for performance improvements.
B. Trends of industrial process
The industrial processes are more and more complicated and
ever changing. The ever-increasing demands for profits and
environmental factors have added the complexity of industrial
processes. For example, the demands of different product Fig. 4. Deep learning matches the trends of industrial processes

In a nutshell, based on sufficient literature research and to B. Restricted Boltzmann Machine

our best knowledge, two main trends in the development of RBM is an undirected probability graph model with one
industrial processes are concluded: (i) they are more and more visible layer and one hidden layer. There is no connection
complicated and ever changing and (ii) a huge amount of between neurons in the same layer, which is the meaning of
process data are generated and stored. Under such a “restricted”. The goal of RBM is to make the output of the
circumstance, the characteristics of deep learning technique, visible layer as close to the original input as possible so that the
discussed in Section II, exactly match these two trends well. hidden layers are regarded as different representations of the
First of all, deep learning can avoid complicated feature visible layer. The joint probability distribution and conditional
engineering and learn abstract representation automatically distribution are related to an energy function, and detail
(Fig. 2). Secondly, deep learning can make full use of large derivation process can be found in [21]. RBMs can be
amounts of data to effectively improve modeling performance trained by approximate maximum likelihood stochastic
(Fig. 3). These are why deep learning techniques are of great gradient descent, often involving a Monte-Carlo Markov Chain
significance and are going to be more and more significant for to obtain those model samples. A much more complete tutorial
soft sensor applications. and other tips or tricks can be seen in [53, 54].
RBM has various extensions, among which Deep Belief
Network (DBN) and Deep Boltzmann Machine (DBM). DBN
III. DEEP LEARNING MODELS AND GENERAL TRICKS
is a hybrid graphical model involving both directed and
In this section, typical models and general tricks in DL field undirected connections. Except that the top two layers are
are reviewed and summarized, including Autoendocoder (AE), undirected (pure RBM), the connections of all the other layers
Restricted Boltzmann Machine (RBM), Convolutional Neural are directed (Bayesian Network). DBN has multiple hidden
Network (CNN), and Recurrent Neural Network (RNN). layers and hidden units in adjacent layers are connected. All of
A. Autoencoder the local conditional probability distributions in DBN are
An Autoencoder is actually a system which attempts to copied directly from that in its constituent RBMs. DBN is
reproduce its original input. To achieve this goal, AE must layer-wise pre-trained by a fast and greedy algorithm and then
capture the most important information that can represent the is fine-tuned using contractive wake-sleep algorithm [13].
input data [47, 48]. Therefore, the code dimension is While a DBM is an undirected graphical model with several
constrained to be less than the input dimension, which is also layers, and it is constructed to learn high-level representations
called as undercomplete AE. of the input [55]. Generally speaking, DBM is more robust than
Technically, the full encoding and decoding process can be DBN, but the cost is greater computational complexity since
represented as the following formula: DBM needs to be jointly trained.
h  encode( x)  f e  We x  b e  (3) C. Convolution Neural Network
x  decode  h   f d  Wd h  b d  (4) CNN is a specialized kind of neural network for processing
data that have a grid-like topology, such as time-series data (1D
where x is the original input vector, h is the feature vector
grid taking samples at regular time intervals) and image data
after encoding, x is the vector of reconstructed input, We , b e  , (2D grid of pixels). It is notable that the “convolution” here
Wd , b d  are weights and biases of encoder and decoder actually refers to cross-correlation function, which is the same
respectively, and f e   , f d   are corresponding nonlinear as convolution but without flipping the kernel:
S (i, j )  ( I  K )  i, j  =  I  i  m, j  n  K  m, n  (5)
activation function, like Sigmoid, Tanh, ReLU, and so on. m n

where I and K denote 2D input and 2D kernel function

AE1 AE2 AEL
respectively, the symbol “  ” denotes convolution operation
x h1 h L1 and i , j are the indexes in these two dimensions.
- Decoder1
- Decoder2
- DecoderL
The detailed computation process of a 2D convolution case
can be seen in Fig. 6. From the example, the merits of such
Encoder1 Encoder2 EncoderL
x h1 h2 h L1 hL convolution operation can be concluded: (a) sparse interactions:
the size of kernel is much smaller than that of input so the
interaction between the input and output is a kind of sparse
Fig. 5. The learning strategy of SAE connectivity, which saves a lot of time complexity compared
Besides, AEs can be stacked to construct deeper network, with common fully-connected networks; (b) parameters sharing:
namely Stacked Autoencoder (SAE). The learning strategy of different with the entries of weight matrix of traditional
SAE is represented as Fig. 5. The whole process is actually a networks which are used only one time when computing the
process of unsupervised layer-wise training. SAE possesses output of a layer, every element of a kernel is used at every
more encoding layers so that it can extract more abstract position of the input so the storage requirements for parameters
representations. Besides, Autoencoder has many extensions, are reduced significantly; and (c) equivariant representations:
such as Denoising Autoencoder (DAE) [49], Sparse due to the characteristic of parameter sharing, the result of
Autoencoder (SAE) [50, 51], Contractive Autoencoder (CAE), convolution operation before which the input is shifted is the
and so on [52]. same as that of shifting the output of convolution of the input. It

is because of these three features that CNN is particularly suited D. Recurrent Neural Network
to process grid-like data [56]. RNN is developed for processing sequential data. The basic
Generally, after the convolution there is a pooling operation architecture and loss computation graph of RNN is shown in
to further adjust the output. The pooling function uses the Fig. 7. The left network can be unfolded over time sequence to
overall statistical characteristics of the adjacent outputs at a get the right form. Every time step has an input, a hidden unit,
certain location to replace the network output at that location, and an output. Besides, recurrent connections exist between
and no parameters need to be learned. For instance, the max hidden units.
pooling operation uses the maximum output to represent the Given a specific status h 0 , RNN can propagate forward.
corresponding rectangular region [57]. Other common pooling Suppose the activation of hidden layer is tanh   and the output
functions, such as the average of a rectangular neighborhood,
layer is fed into a softmax function to generate normalized
the L2 norm of a rectangular neighborhood, or a weighted
probabilities ŷ , the corresponding layers from t  1 to t  
average based on the distance from the central pixel, are also
can be updated according to the following formula:
widely used to compress parameter space. CNN also has a lot of
a    b  Wh   Ux   ,
t t 1 t
variants, such as LeNet, AlexNet, VggNet, and so on [58-60]. (6)
h   tanh a   ,
t
 
t
(7)
aW+bX bW+cX cW+dX
o   c  Vh ，
t t
a b c d + + + (8)
e f g h
*
W X
=
eY+fZ fY+gZ gY+hZ
yˆ    softmax o  ，
t t
  (9)
Y Z eW+fX fW+gX gW+hX
i j k l + + + where b and c denotes the bias vectors.
iY+jZ jY+kZ kY+lZ The total loss is just the sum of the losses over all the time
Input Kernel Output steps. For example, if Lt  is computed as the negative
Fig. 6. A 2D convolution case log-likelihood of y  t  given x 1 , , x  t  , then
y y
t 1
y 
t
y
t 1

L x  , , x
1 
 , y   ,
1
, y


  
(10)
  L    log pmodel y   x   , , x   ，
t t 1 t

L
t 1
L 
t t
L
t t 1

 is given by reading the entry for

L
where pmodel y  x
t  1
, , x 
t

Unfold y   from the model’s output vector yˆ   . The parameters are

t t
o
t 1
o  o
o t t 1

updated using back-propagation through time (BPTT) [21, 61].

V W V V V The basic problem of RNN is that gradients propagated over
W W W W
h
t 1
h 
h  h h 
t t 1 many stages tend to either vanish or explode, which is called as
h
the challenge of Long-Term Dependencies [62, 63]. Therefore,
U U U U Long Short-Term Memory (LSTM) and other gated RNNs like
 t 1 t 
x
x t 1
x x Gated Recurrent Units (GRUs) are proposed, which use several
Fig. 7. The typical components of RNN ( x is the input data in sequence form,
gate units to control the memory and forgetting behaviors of the
hidden state [64-67].
h is the hidden layer, o is the output layer, y is the target label and L is the
The summary of four main commonly used DL techniques is
loss. U,V and W are corresponding weight matrixes.) listed in Table. 1.

Table. 1. Summary of four main types of Deep Learning Models

Main Applications for
Model Characteristics Merits Demerits
Soft Sensor modeling
AE Unsupervised; Common data; Effective dimension reduction; The output layer has no practical use after Semi-supervised
Learn feature representations Denoising; Low computational training, and high hidden dimension may modeling, missing data
of the input automatically. complexity. lead to self-replication of the input. problem, et al.
RBM Unsupervised; Common data; Robust to ambiguous data; Dimension High computational complexity caused by Strong correlation
Probabilistic generative reduction; Feature extraction; joint parameter optimization. problem, ensemble
model. Collaborative filtering. learning, et al.

CNN Supervised; Grid-like data; Sparse interactions; Parameters The contradiction between the dependence Local dynamic
local feature extractor. sharing; Equivariant representations. on the depth of network and the slow modeling, frequency
parameter updating of deeper network. domain processing, et al.
RNN Supervised; Sequence data; Learn the relationship between The challenge of long-term dependence. Dynamic modeling, et
Update parameters by BPTT. different time steps. al.

distribution gradually approaches the upper and lower limits of

E. General tricks for developing DL models
the value interval of the nonlinear function. Thus, the gradients
Although deep learning has huge potential, it could be very are easy to vanish when conducting back-propagation. With
challenging to train deep models with satisfactory batch normalization, the mean and the variance of each unit is
generalization performance efficiently. The reasons mainly lie standardized so as to stabilize learning, but the relationships
in the overfitting and gradient vanish problems caused by deep between units and the nonlinear statistics of a single unit are
structure. To overcome or mitigate these issues, several tricks allowed to change.
should be helpful when training deep models.
F. Frameworks for developing deep learning algorithms
Regularization
To better realize the development of deep learning algorithm,
Regularization is an effective tool to overcome high-variance
several open-source frameworks are available, which may
problem, namely overfitting. A direct way is to regularize the
consist of state-of-the-art algorithms or well-designed
cost function with a parameter norm penalty, such as L2 underlying network elements, such as TensorFlow [81], Caffe
regularization. When minimizing the cost function, the [82], Theano [83], CNTK [84], Keras [85], Pytorch [86], and so
parameters are also constrained to not be too large [68]. on. The comparison of these platforms is shown in Table. 2.
Dataset Augmentation Table. 2. Comparison of mainstream platforms
Getting more data for training machine learning models is
the best way to improve their generalization performance. Platform Characteristics
Although it may be not easy to collect large amount of data TensorFlow The most popular deep learning framework at present with
from real scenarios, creating new fake data is meaningful for powerful communities. However, the interface design is too
arcane and the system design is too complex.
some specific tasks, such as object recognition [69], speech
Caffe Easy to use, concise source code, superior performance and
recognition [70]. Introducing noise into the input layer can also fast prototyping. However, it is difficult to extend and
be regarded as a kind of data augmentation [71, 72]. configure.
Early Stopping Theano It has a strong academic atmosphere, but there are big defects
The cost of training process usually runs down firstly and in the engineering design. Now it has stopped the
development.
then increased as further learning is conducted, which denotes
CNTK The performance is outstanding, good at the relevant
the occur of overfitting. To avoid this problem, each time a research on speech, but the community is not active.
better validation error is achieved, the parameter setting should
Keras More like a deep learning interface. The most easily to get
be saved so that returning to the point with best performance started but not flexible enough.
after all training steps is realizable [73]. Therefore, the early Pytorch Concise, fast, easy to use, active community.
stop strategy can prevent over-learning of parameters.
Sparse Representations IV. DL APPLICATIONS FOR SOFT SENSOR MODELING
Another kind of parameter penalty is to constrained
activation unit which will indirectly impose a penalty on the A successful development of deep learning algorithms is
complexity of parameters. Similar with common regularization, actually a highly iterative process, which can be summarized as
the penalty term based on the activation state of hidden units is Fig. 8. For soft sensing applications, the first step is to find the
added into the cost function. To obtain a relatively smaller cost, demands or problems existing in real industrial processes (such
the probability of neuronal activation should be as small as as semi-supervised learning, dynamic modeling, missing data,
possible [74]. Other approaches such as KL divergence et al.) and try to come up with a new idea worth trying. The next
penalties or imposing a hard constraint on activation values are thing that needs to be done is to code it up with open-source
also applied [75, 76]. frameworks or toolkits. After that, the data are collected and fed
Dropout into the program to obtain a result that tells the designer how
Dropout is a kind of ensemble-like strategy [77]. The basic well this particular algorithm or configuration works. Based on
principle is to remove the non-output units (e.g. multiply the the outcome, the designer should refine the ideas and change
output by zero) from base network to form several the strategies to find a better neural network. And then the
sub-networks. Every input unit and hidden unit is included process is repeated and the scheme is improved iteratively until
according to a sampling probability so that the randomness and the ideal effect is achieved.
diversity of sub-models can be guaranteed. The ensemble Idea
weights are often obtained according to the probability p  y x 
of sub-models [78]. Another significant advantage is that there
are few restrictions on the applicable model or training process.
However, it does not work well if there are only a few data [79].
Batch Normalization
Batch Normalization is a method of adaptive Experiment Code
reparameterization which aims to better train extremely deep
network [80]. When training, the parameters of hidden layers in Fig. 8. The iterative process for developing deep learning algorithms
deep networks will consistently change, which leads to the To help the readers know about state-of-the-art progress and
internal covariate shift problem. Generally, the global better develop high-performance soft sensors, the soft sensing

applications based on deep learning techniques are reviewed by combining the encoder of the first one with the decoder of
here. The existing work are introduced and discussed, and the the second one, which works well under the missing data
factors such as motivation, strategy, and effectiveness are situation.
mainly highlighted. The following contents are expanded In some cases, AEs could work better by combining it with
according to the mainstream model to which each work other methods or improving its learning strategy. For example,
belongs. Yao et al. implemented a deep network of Autoencoders for
unsupervised feature extraction and then utilized extreme
A. Autoencoder based applications learning machine for regression task [97]. Wang et al. adopted
AE and its variants are widely used to construct soft sensors the Limited-memory Broyden-Fletcher-Goldfarb-Shanno
for semi-supervised learning and dealing with missing data in algorithm to optimize the weights parameters learned by SAE,
industrial processes. Also, excellent performance can be and then the features extracted were fed into support vector
achieved by combining with traditional machine learning regression (SVR) model for estimating the rotor deformation of
algorithms. air preheaters [98]. Instead of using pure data-driven models,
Since AE is an unsupervised-learning model, it is often Wang et al. combined a knowledge-based model (KDM)
modified to a semi-supervised or supervised form so as to named the Lab model with a data-driven model (DDM) namely
complete the predictive tasks. For example, a semi-supervised Stacked Autoencoder, and the experimental results verified that
probabilistic latent variable regression model was developed the hybrid method is prior than using only KDM or DDM [99].
using Variational Autoencoder (VAE) in [87]. A common way Using an improved gradient descent algorithm, Yan et al.
is to introduce the supervision from label variables into the proposed a DAE-based method which was demonstrated to be
procedures of encoding and decoding. In [88], a Variable-wise effective compared with conventional approaches like shallow
Weighted Stacked AE (VW-SAE) was proposed to introduce learning methods [100]. Besides, to adaptively model
the linear Pearson coefficient between the inputs of each hidden time-varying processes, a just-in-time fine-tuning framework
layer and quality labels when pre-training so as to extract was proposed for SAE-based soft sensor construction [101].
feature in a semi-supervised way. Furthermore, techniques
B. Restricted Boltzmann Machine based applications
based on nonlinear relationship, like mutual information [89],
were adopted to better extract feature representation. However, Nonlinearity is a widely-existing characteristic in industrial
both linear and nonlinear relationships are artificially specified processes. Aiming at this, RBM and its variants, especially
and may be inadequate or unsuitable. Thus, a relatively more DBN, are generally used as unsupervised nonlinear feature
intelligent and automatic way is to add the predictive loss of extractors in industrial process modeling.
quality labels into the pre-training cost [90]. Besides, other Predictors can take advantage of features learned by RBM or
strategies also can be adopted to build the connections between DBN, and SVR and BPNN are two common kinds of predictors.
hidden layers and label values. Sun et al. used gated units to For example, to address the problem of high nonlinearity and
measure the contribution of the features in different hidden strong correlation among multi-variables in the process of
layers and better control the information flows between hidden coal-fired boiler, a novel deep structure using continuous RBM
layers and the output layer [91]. Moreover, focusing on (CRBM) and SVR algorithms was proposed [102]. A related
semi-supervised scenarios when there are only a small number work was proposed by Lian et al., which uses DBN and SVR
of labeled samples and an excess of unlabeled samples, a kind with the improved particle swarm optimization to complete the
of double ensemble learning approach was proposed which task of rotor thermal deformation prediction [103]. In [104], a
takes both data diversity and structural diversity into account soft sensor model based on the DBN and BPNN was proposed
[92]. to predict the 4-carboxy-benzaldchydc concentration in the
Missing data is one of the most commonly encountered purified terephthalic acid industrial production process. Faced
problem while designing industrial soft sensors. As a variant of with the complexity and nonlinearity of nonlinear system
autoencoder, VAE performs well in learning data distribution modeling, an improved BPNN based on RBM was proposed in
and dealing with missing data problem. For example, a [105]. In this work, the structure of BPNN is optimized by
generative model named VA-WGAN was proposed based on utilizing sensitivity analysis and mutual information theories
VAE and Wasserstein GAN, and it can generate the same and the initialization of parameters is done by RBM. While in
distributions of real data from industrial processes, which is [106], DBN was used to learn hierarchical features for a BPNN,
hard to achieve by conventional regression models [93]. In [94], which was constructed for modeling the relationships between
VAE was employed to extract the distribution of each feature extracted features and mill level in a ball mill production
variable for a just-in-time modeling approach, and the process. In addition to SVR and BPNN, Extreme Learning
effectiveness of it was verified through a numerical example Machine (ELM) can also work as a predictor based on the
and an industrial process. Moreover, the authors enriched the features extracted by DBN. And the idea was realized in the
theory by proposing an output-relevant VAE for just-in-time measurement of nutrient solution composition for soilless
soft sensor application, which aims to deal with missing data culture [107].
[95]. Different with the former, two kinds of VAEs were used To overcome the data-rich-but-information-poor problem,
in a new soft sensor framework which also focuses on the RBMs can be utilized for ensemble learning. For instance,
missing data [96]. The first one named Supervised Deep VAE Zheng et al. proposed a soft sensing framework which
was designed to obtain the distribution of latent features, which integrates the ensemble strategy, DBN, and correntropy kernel
was used as a prior of the second one known as the Modified regression into a unified soft sensing framework [108].
Unsupervised Deep VAE. Then the framework was constructed Similarly, an ensemble deep kernel learning model was

proposed in industrial polymerization process, which adopts RNN-based soft sensors were developed to estimate
DBN for unsupervised information extraction [109]. In the variables with strong dynamic characteristic, such as the curing
other case, lack of the labeled sample also leads to poor of epoxy/graphite fiber composites [121], the contact area that
information, which can be settled by semi-supervised learning tires of a car are making with the ground [122], the indoor air
using DBN, like the work proposed in [110]. In [111], focusing quality (IAQ) in the subway [123], the melt-flow-length in the
on labeled data scarcity, computational complexity reduction, injection molding process [124], the biomass concentrations
and unsupervised feature exploitation, a DBN based soft sensor [125], the product concentration of reactive distillation
is designed. columns [126].
RBMs have some other interesting applications as well. Apart from methods based on ordinary RNN, LSTM is also a
Graziani et al. designed a soft sensor based on DBN for a plant popular model in soft sensing applications, which can be deeper
process to estimate an unknown measurement delay rather than and more powerful since long-term dependence is weakened.
quality variables [112]. Another DBN-based model was For example, a LSTM-based soft sensor model was proposed to
applied to process flame images, rather than common structural cope with strong nonlinearity and dynamics of the process in
data, in industrial combustion processes for oxygen content [127]. Besides, Yuan et al. proposed a supervised LSTM
prediction [113]. And Zhu et al. investigated the selection of network, which used both the input and quality variables to
DBN structure for the soft sensor application in an industrial learn dynamic hidden states, and the method was proved to be
polymerization process. By comparing with feedforward neural effective on a penicillin fermentation process and an industrial
debutanizer column [128]. Besides, a LSTM network was used
networks, the DBN-based method can give more accurate
to predict the content of nitrogen-derived components in
predictions of the polymer melt index [114].
wastewater treatment plants [129].
C. Convolutional Neural Network based applications There are other variants that are designed for specific
CNNs are mainly utilized for processing grid-like data, industrial applications. As an example, a two-stream network
especially image data. Besides, they can also be developed to structure was designed, which adopts batch normalization and
capture local dynamic characteristics of industrial process data dropout tricks, to learn diverse features of the various process
or process signals in frequency domain. data [130]. In [131], another type of RNN called Time Delayed
By processing image data, CNN can be used to construct soft Neural Network (TDNN) was implemented for inferential state
sensors. For example, Horn et al. uses CNN to extract features estimation for an ideal reactive distillation column. Besides, the
in froth flotation sensing, which shows a good feature Echo State Network (ESN) as a kind of RNN was also used for
extraction speed and predictive performance [115]. However, soft sensing application in the high-density polyethylene
images are still seldom utilized for soft sensor construction (HDPE) production process and purified terephthalic acid
compared to common data forms. (PTA) production process [132]. By taking advantage of
As for dynamic problems, Yuan et al. also proposed singular value decomposition (SVD), the collinearity and
multichannel CNN (MCNN) for soft sensing application in the over-fitting problems were solved. Recently, an ensemble
industrial debutanizer column and hydrocracking process, semi-supervised model which combining SAE with
which can learn dynamics and various local correlations of Bidirectional LSTM (BLSTM) was proposed in [133]. The new
different variable combinations [116]. Besides, Wang et al. method can not only extract and utilize the temporal behavior in
used two CNN-based soft sensor models to deal with abundant labeled and unlabeled data but also take the time dependency
process data for the purpose of staying low complexity and hidden in quality metric self into consideration. Also, GRU
embracing the process dynamics at the same time [117]. In based method are proposed for automatic deep extraction of
[118], a soft sensor was proposed using the convolutional robust dynamic features in [134], and achieves good
neural network, which predicts the measurements at next time performance in a debutanizer distillation process.
step by extracting time-dependent correlations from a moving E. Other Deep Learning based applications
window.
In addition to applications based on above mainstream
In frequency domain, CNNs can acquire high invariance to
models, some other deep models are also used to solve soft
signal translation, scaling and distortion. In [119], a pair of
sensing problems. Some typical applications are discussed as
convolution layer and max-pooling layer was utilized at the
the following and the others will not be analyzed in detail here.
lowest part of network to extract high level abstraction from the
vibration spectral features of the mill bearing. And then ELM Semi-supervised modeling
learns a mapping from the extracted features to the mill level. In In [135], a semisupervised framework was constructed by
the field of aerospace engineering, a virtual sensor model with integrating manifold embedding into a deep neural network
partial vibration measurements using a CNN was proposed for (DNN), in which manifold embedding exploited the local
estimating the structural response, which is important for neighbor relationship among industrial data and improved the
structural health monitoring and damage detection but physical utilization efficiency of unlabeled data in deep neural network.
sensors are limited in the corresponding operational conditions Besides, a just-in-time semi-supervised soft sensor based on
[120]. extreme learning machine was proposed to online estimate the
Mooney viscosity with multiple recipes in [136].
D. Recurrent Neural Network based applications
Dynamic modeling
RNNs are widely used for dynamic modeling, and various Except CNNs and RNNs, there are some other neural
variants like LSTM are also applied in real cases. networks are used for dynamic modeling. Graziani et al.

proposed a dynamic DNN based soft sensor to estimate the As shown in diagram (b) of Fig. 9, soft sensors based on DL
research octane number for a reformer unit in a refinery and theory were constructed in many scenarios, including chemical
nonlinear finite inputs response models were investigated [137]. industry, power industry, machinery manufacturing, aerospace
Wang et al. proposed a dynamic network called NARX-DNN, engineering, and so on. Among them, chemical industry
which can interpret the quality prediction error of validation applications account for the largest proportion at about 66.7%.
data from different aspects and automatically determine the The effectiveness of most of the work reviewed in this
most appropriate delay of historical data [138]. Besides, a survey is verified by doing numerical simulation experiments
dynamic strategy is adopted to improve the dynamic capture (e.g. [95], [116], etc.), or by using public available benchmark
performance of the extreme learning machine, which is datasets (e.g. [139]), or by modeling the datasets from
combined with PLS in [139]. real-world processes (e.g. [93], [94], [95], [110], [116], [123],
Data generation etc.). The most common case is the third type, which can reflect
Due to the harsh environment of the industrial process, the characteristics of real processes as much as possible. For
directly collecting data may be difficult. Therefore, a example, in chemical industry field, actual run data are
Generative Adversarial Networks based method was proposed collected from processes like debutanizer process [96],
for data generation in [140]. polymerization processes [109], hydrocracking process [116],
to name a few. However, more detailed and specific factors
Elimination of redundancy
need to be considered when applying those soft sensors to real
In [141], a double least absolute shrinkage and selection
scenarios.
operator (dLASSO) algorithm was integrated into a multilayer
perceptron (MLP) network to solve two redundancy problem: Earlier than 2016 2017 2018 2019 2020
the input variable redundancy and the model structure 6
redundancy. 5
Inference and approximation 4
Due to the strong learning ability, deep neural networks can Count 3
be used for intelligent control purposes. For example, A soft 2
sensor based on Levenberg-Marquart and adaptive linear
1
network was designed and applied in inferential control of a
0
multicomponent distillation process [142]. In addition, the AE-based RBM-based CNN-Based RNN-based Other
adaptive fuzzy means algorithm was utilized to evolve a radial models-based
basis function (RBF) neural network, which aimed at the (a)
approximation of an unknown system [143]. 15.8%
5.3%
F. Summary of the existing applications Chemical Industry
5.3%
The purposes of developing DL-based novel soft sensors Power Industry
1.8%
include feature extraction, solving missing value issues, 1.8%
Machinery Manufacturing
dynamic characteristics capture, semi-supervised modeling, 1.8% Environmental Monitoring
and so on (as show in Table. 1). It is worth noting that only 1.8% Agriculture Production
existing applications in soft sensor field are discussed in detail， Aerospace Engineering

which does not mean that what has not yet appeared in the field Transportation Industry
of soft sensor is not possible. For example, although VAE is the 66.7% Bioprocess Industry
mainstream method to deal with missing value problems for
soft sensor application using DL, methods based on RBM and
(b)
GAN are also feasible in other fields [144, 145]. To design Fig. 9. Statistics on existing relevant work: (a) publications in different years;
feasible models, different strategies were adopted, such as (b) applications on different fields.
optimizing network structure, improving the training algorithm,
and integrating different algorithms, et al. V. DISCUSSIONS AND OUTLOOK
From the applications discussed in above subsections, some Although deep learning has made great progress in many
points can be further summarized. Firstly, the statistics on soft fields, there is still a lot of work to do to better apply the
sensor applications using DL methods can be seen in Fig. 9, advanced methods in the soft sensor domain, especially to meet
which is based on a total of 57 references discussed and cited in the demands in practical industrial processes. Data and
Section IV. From the diagram (a), the trend is clear that there structure are the two most important issues required to be
are more and more algorithms based on DL theory during considered all the time. Around these two topics, some hot
recent years, which is a reflection of the increasing demand for research directions should be paid more attention in the future.
DL models in real industrial process modeling. Moreover,
compared with 3 other main theories, CNN-based methods are Lack of labeled sample
applied less. This is because grid-like data such as images are Although the data is easy to obtain under the trend of big data,
more used for classification rather than regression tasks. the annotation cost is still very expensive. Therefore, we always
Besides, although AE looks simpler than other main models, it hope that using fewer labeled samples can train a model with
is easier to develop and expand, so it is also of great potential. good generalization ability. Traditional solution of this problem
is using semi-supervised learning methods, while the more and

more serious imbalance problem between unlabeled and deep learning compared with traditional algorithms and the
labeled data makes it less satisfactory. Self-supervised learning trends of the industrial processes were discussed in detail to
(SSL) is another feasible solution, which is a kind of demonstrate the necessity and significance of deep learning
unsupervised strategy [146]. Different with transfer learning algorithms for soft sensor modeling; (ii) main DL models,
[32, 33], the useful feature representations are learned from a tricks and frameworks/toolkits were discussed and summarized
pretext task designed from the unlabeled input data (not from to help readers better develop DL-based soft sensors; (iii)
other similar datasets). Contrastive way is one of the most practical application scenarios were analyzed by reviewing and
popular type of SSL, and has made some great achievements in discussing existing work or publications; (iv) possible research
speech, images, text and reinforcement learning fields [148]. hot points for future work were investigated shortly.
However, a lot of investigation and exploration work remains It is our hope for this paper to serve as a taxonomy and also a
to be done for its soft sensing application. tutorial of advances elucidated from a multitude of works on
Hyperparameter optimization deep learning based soft sensors, and to provide the community
For a long time, how to optimize hyperparameters and with a picture of the roadmap and matters for future endeavors.
structures of networks is a difficult issue for researchers and
engineers [106, 114, 141]. And most of such work require REFERENCES
manual trial. To avoid heavy workload and great randomness, [1] B. Huang, and R. Kadali, Dynamic Modeling, Predictive Control and
meta-learning was proposed and investigated, which is also Performance Monitoring, Springer London, 2008.
[2] X. Wang, B. Huang, and T. Chen, “Multirate Minimum Variance Control
called as “learn to learn” [148]. The motivation is to offer Design and Control Performance Assessment: A Data-Driven Subspace
machine with human-like learning ability. Instead of learning a Approach,” IEEE. T. Contr. Syst. T., vol. 15, no. 1, pp. 65-74, 2006.
single function for a specific task, meta-learning learns a [3] Z. Chen, S. X. Ding, T. Peng, C. Yang, and W. Gui, “Fault Detection for
function to output functions for several subtasks. At the same Non-Gaussian Processes Using Generalized Canonical Correlation
Analysis and Randomized Algorithms,” IEEE. T. Ind. Electron., vol. 65,
time, many subtasks are required for meta-learning, and each no. 2, pp. 1559-1567, 2018.
subtask has its own training set and test set. After effective [4] Y. Jiang, S. Yin, J. Dong, O. Kaynak, “A Review on Soft Sensors for
training, machine can possess the ability to optimize Monitoring, Control and Optimization of Industrial Processes,” IEEE
hyper-parameters including selecting network structures by Sensors Journal, 2020, doi: 10.1109/JSEN.2020.3033153.
[5] V. Venkatasubramanian, R. Rengaswamy, S. N. Kavuri, “A review of
itself. This is attractive for multimodal and changing processes. process fault detection and diagnosis: Part II: Qualitative models and
Model reliability search strategies,” Computers & Chemical Engineering, vol. 27, no. 3,
Deep learning methods learns features in an end-to-end way, pp. 313-326, 2003.
[6] P. Kadlec, B. Gabrys, S. Strandt, “Data-driven soft sensors in the process
which increases the difficulty for engineers or designers to industry,” Comput. Chem. Eng. vol. 33, pp. 795-814, 2009.
understand what and how they learned. Besides, the [7] M. Kano, M. Ogawa, “The state of the art in chemical process control in
dependence of the learning process on data increases the Japan: good practice and questionnaire survey,” J. Process Control, vol.
inaccuracy caused by poor data quality. Both of these two 20, pp. 969-982, 2010.
[8] K. Pearson, “LIII. On lines and planes of closest fit to systems of points
factors pose a threat on the reliability of DL models. Therefore, in space,” Philosophical Magazine, vol. 2, no. 11, pp. 559-572, 1901.
it is important to improve the model reliability, and model [9] H. Wold, “Estimation of principal components and related models by
visualization [149, 150] and combination with experience or iterative least squares,” Multivar. Anal., Vol. 1, pp. 391-420, 1966.
knowledge [151] are two feasible ways. Model visualization [10] Q. Jiang, X. Yan, H. Yi and F. Gao, “Data-Driven Batch-End Quality
Modeling and Monitoring Based on Optimized Sparse Partial Least
helps researchers to understand what has been learned, while Squares,” IEEE Transactions on Industrial Electronics, vol. 67, no. 5, pp.
introducing experience or knowledge helps to reduce 4098-4107, May 2020, doi: 10.1109/TIE.2019.2922941.
inaccuracy brought by just relying on data. Nevertheless, these [11] W. Yan, H. Shao, X. Wang, “Soft sensing modeling based on support
two points need more investigations for practical industrial vector machine and Bayesian model selection,” Comput, Chem. Eng.
vol. 28, pp. 1489-1498, 2004.
application. [12] K. Desai, Y. Badhe, S.S. Tambe, B.D. Kulkarni, “Soft-sensor
Distributed parallel modeling development for fed-batch bioreactors using support vector regression,”
Biochem. Eng. J., vol. 27, pp. 225-239, 2006.
With the trend of industrial big data discussed in Section II,
[13] G. Hinton, S. Osindero, Y-W. Teh, “A Fast Learning Algorithm for Deep
how to efficiently model the process from large amount of data Belief Nets,” Neural Comput., vol. 18, no. 7, pp. 1527-1554, 2006.
is an important and urgent issue. A feasible solution is to [14] X. Glorot and Y. Bengio, “Understanding the difficulty of training deep
transform original deep learning models into the distributed and feedforward neural networks,” J. Mach. Learn. Res., vol. 9, pp. 249–256,
2010.
parallel modeling. By splitting a large data set into several [15] Y. LeCun, Y. Bengio, G. Hinton, “Deep learning,” Nature, vol. 521, no.
small distributed blocks, data processing can be carried out 7553, pp. 436-444, 2015.
simultaneously, which is conducive to large-scale data [16] F.A.A. Souza, R. Araújo, and J. Mendes, “Review of soft sensor methods
for regression applications,” Chemometrics and Intelligent Laboratory
modeling [152, 153]. So far, however, there is still a long Systems, vol. 152, pp.69-79, 2016.
distance to go. [17] K. Hornik, et al. “Multilayer feedforward networks are universal
approximations,” Neural Networks, vol. 2, pp. 359-366, 1989.
[18] G. Cybenko, “Approximation by superpositions of a sigmoidal
VI. CONCLUSIONS function,” Math. Control Signals System, vol. 2, pp. 303-314, 1989.
Deep Learning techniques have shown their great potential [19] K. Hornik, “Approximation capabilities of multilayer feedforward
networks,” Neural Networks, vol. 4, pp. 251-257, 1991.
in many fields, as well as in soft sensor. In order to summarize [20] K. He, X. Zhang, S. Ren, J. Sun, “Deep Residual Learning for Image
the past, analyze the present, and look into the future, in this Recognition,” arXiv:1512.03385v1, 2015.
work, we made the following contributions to the application of [21] I. Goodfellow, Y. Bengio, A. Courville, Deep learning, vol. 1,
deep learning theory in the field of soft sensor: (i) the merits of Cambridge, MA, USA: the MIT press, 2016.

[22] C. Grosan, A. Abraham, “Rule-Based Expert Systems,” Intelligent by back-propagating errors,” Nature, vol. 323, no. 6088, pp. 533-536,
Systems, vol. 17, pp. 149-185, 2011. 1986.
[23] A. Ligęza, Logical Foundations for Rule-based Systems. 2nd edn. [49] H. Larochelle, I. Lajoie, Y. Bengio, P. A. Manzagol, “Stacked denoising
Springer, Heidelberg, 2006. autoencoders: learning useful representations in a deep network with a
[24] J. Durkin, Expert Systems: Design and Development. Prentice Hall, New local denoising criterion,” J Mach Learn Res, vol. 11, no. 12, pp.
York, 1994. 3371-3408, 2010.
[25] C. R. Turner, A. Fuggetta, L. Lavazza, A. L. Wolf, “A conceptual basis [50] B. Schölkopf, J. Platt, T. Hofmann, “Efficient learning of sparse
for feature engineering,” Journal of Systems and Software, vol. 49, no. 1, representations with an energy-Based model,” Proceedings of advances
pp. 3-15, 1999. in neural information processingsystems, pp. 1137-1144, 2006.
[26] F. Nargesian, H. Samulowitz, U. Khurana, E. B. Khalil, D. Turaga, [51] M. A. Ranzato, Y. L. Boureau, Y. Lecun, “Sparse feature learning for
“Learning Feature Engineering for Classification,” Presented at deep belief networks,” Proceedings of international conference on neural
Proceedings of the Twenty-Sixth International Joint Conference on information processing systems, vol. 20, pp. 1185-1192, 2007.
Artificial Intelligence, Aug. 2017, doi: 10.24963/ijcai.2017/352. [52] A. Hassanzadeh, A. Kaarna, T. Kauranne, “Unsupervised multi-manifold
[27] Y. Bengio, A. Courville, and P. Vincent, “Representation learning: A classification of hyperspectral remote sensing images with contractive
review and new perspectives,” IEEE Transactions on Pattern Analysis Autoencoder,” Neurocomputing, vol. 257, pp.67-78.
and Machine Intelligence, vol.35, no. 8, pp. 1798-1828, 2013. [53] Y. Bengio, “Learning deep architectures for AI,” Foundations and Trends
[28] Andrew Ng, “Scale drives machine learning progress,” in Machine in Machine Learning, vol. 2, no. 1, pp. 1–127, 2009.
Learning Yearning, pp. 10-12. [online]. Available: [54] G. E. Hinton, “A practical guide to training restricted Boltzmann
https://www.deeplearning.ai/machine-learning-yearning/. machines,” Neural networks: Tricks of the trade. Springer, Berlin,
[29] S. J. Pan, Q. Yang, “A survey on transfer learning,” IEEE Transactions Heidelberg, pp. 599-619, 2012.
on Knowledge and Data Engineering, vol. 22, no.10, pp. 1345-1359, [55] G. E. Hinton, R. R. Salakhutdinov, “Deep Boltzmann machines,” J Mach
Oct. 2010. Learn Res, vol. 5, no. 2, pp. 1967-2006, 2009.
[30] Y. Bengio, “Deep Learning of Representations for Unsupervised and [56] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification
Transfer Learning,” Proceedings of ICML workshop on unsupervised with deep convolutional neural networks,” Advances in neural
and transfer learning, pp. 17-36, 2012. information processing systems. 2012.
[31] W. Shao, Z. Song, and L. Yao, “Soft sensor development for multimode [57] Y. Zhou, and R. Chellappa, “Computation of optical flow using a neural
processes based on semisupervised Gaussian mixture models,” network,” IEEE 1988 International Conference on Neural Networks,
IFAC-PapersOnLine, vol. 51, no. 18, pp. 614–619, 2018. 1988, doi: 10.1109/ICNN.1988.23914.
[32] F. A. A. Souza and R. Araújo, “Mixture of partial least squares experts [58] Y. LeCun, L. Bottou, Y. Bengio, et al. “Gradient-based learning applied
and application in prediction settings with multiple operating modes,” to document recognition,” Proceedings of the IEEE, vol. 86, no. 11, pp.
Chemometrics Intell. Lab. Syst., vol. 130, no. 15, pp. 192–202, 2014. 2278–2324, 1998.
[33] H. Jin, X. Chen, L. Wang, K. Yang, and L. Wu, “Dual learning-based [59] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification
online ensemble regression approach for adaptive soft sensor modeling with deep convolutional neural networks,” In Advances in Neural
of non-linear time-varying processes,” Chemometrics Intell. Lab. Syst., Information Processing Systems, pp. 1097–1105, 2012.
vol. 151, pp. 228–244, 2016. [60] K. Simonyan, A. Zisserman, “Very deep convolutional networks for
[34] M. Kano, and K. Fujiwara, “Virtual sensing technology in process large-scale image recognition,” arXiv preprint arXiv:1409.1556, 2014.
industries: trends and challenges revealed by recent industrial [61] P. J. Werbos, “Backpropagation through time: What it does and how to
applications,” Journal of Chemical Engineering of Japan, 2012, doi: do it,” Proc. IEEE, vol. 78, no. 10, pp. 1550–1560, Oct. 1990.
10.1252/jcej.12we167. [62] Y. Bengio, P. Simard, and P. Frasconi, “Learning long-term dependencies
[35] L. X. Yu, “Pharmaceutical Quality by Design: Product and Process with gradient descent is difficult,” IEEE Transactions on Neural
Development, Understanding, and Control,” Pharm Res, vol. 25, pp. Networks, vol. 5, no. 2, pp. 157–166, 1994.
781–791, 2008, doi: 10.1007/s11095-007-9511-1. [63] R. Pascanu, T. Mikolov, Y. Bengio, “On the difficulty of training
[36] S. J. Qin, “Process Data Analytics in the Era of Big Data,” AIChE recurrent neural networks,” In Proceedings of International Conference
Journal, vol. 60, no. 9, pp. 3092-3100, 2014. on Machine Learning, pp. 1310-1318, 2013.
[37] N. Stojanovic, M. Dinic, L. Stojanovic, “Big data process analytics for [64] F. A. Gers, J. Schmidhuber, and F. Cummins, “Learning to forget:
continuous process improvement in manufacturing,” 2015 IEEE Continual prediction with LSTM,” Neural computation, vol. 12, no. 10,
International Conference on Big Data, 2015, doi: pp. 2451–2471, 2000.
10.1109/BigData.2015.7363900. [65] R. Pascanu, C. Gulcehre, K. Cho, and Y. Bengio, “How to construct deep
[38] L. Yao, Z. Ge, “Big data quality prediction in the process industry: A recurrent neural networks,” arXiv preprint arXiv:1312.6026, 2013.
distributed parallel modeling framework,” J. Process Contr., vol. 68, pp. [66] K. Cho, B. V. Merriënboer, C. Gulcehre, F. Bougares, H. Schwenk, and
1-13, 2018. Y. Bengio, “Learning phrase representations using RNN
[39] M. S. Reis, and G. Gins, “Industrial Process Monitoring in the Big encoder-decoder for statistical machine translation,” In Proceedings of
Data/Industry 4.0 Era: from Detection, to Diagnosis, to Prognosis,” the Empiricial Methods in Natural Language Processing 2014, 2014.
Processes, vol. 5, no. 3, 35, 2017, doi:10.3390/pr5030035. [67] G. Chrupala, A. Kadar, and A. Alishahi, “Learning language through
[40] S. W. Roberts, “Control charts tests based on geometric moving pictures,” arXiv: 1506.03694, 2015.
averages,” Technometrics, vol. 1, pp. 239-250, 1959. [68] F. Girosi, M. Jones, and T. Poggio, “Regularization theory and neural
[41] C. A. Lowry, W. H. Woodall, C. W. Champ, C. E. Rigdon, “A networks architectures,” Neural computation, vol. 7, no. 2, pp. 219-269,
multivariate exponentially weighted moving average control chart,” 1995.
Technometrics, vol. 34, pp. 46–53, 1992. [69] D. M. Montserrat, Q. Lin, J. Allebach, E. J. Delp, “Training object
[42] T. Kourti, J. F. MacGregor, “Multivariate SPC methods for process and detection and recognition CNN models using data augmentation,”
product monitoring,” J. Qual. Technol., vol. 28, pp. 409–428, 1996. Electronic Imaging, vol. 2017, no. 10, pp. 27-36, 2017.
[43] M. S. Reis, P. M. Saraiva, “Prediction of profiles in the process [70] N. Jaitly, and G. E. Hinton, “Vocal tract length perturbation (VTLP)
industries,” Ind. Eng. Chem. Res., vol. 51, pp. 4254–4266, 2012. improves speech recognition,” Proc. ICML Workshop on Deep Learning
[44] C. Duchesne, J. J. Liu, J. F. MacGregor, “Multivariate image analysis in for Audio, Speech and Language, Vol. 117, 2013.
the process industries: A review,” Chemom. Intell. Lab. Syst., vol. 117, [71] P. Vincent, H. Larochelle, Y. Bengio, et al. “Extracting and composing
pp. 116-128, 2012. robust features with denoising autoencoders,” Proceedings of the 25th
[45] D. C. Montgomery, C. M. Mastrangelo, “Some statistical process control international conference on Machine learning, pp. 1096-1103, 2008.
methods for autocorrelated data,” J. Qual. Technol., vol. 23, pp. 179– [72] B. Poole, J. Sohl-Dickstein, and S. Ganguli, “Analyzing noise in
193, 1991. autoencoders and deep networks,” arXiv preprint arXiv: 1406.1831,
[46] T. J. Rato, M. S. Reis, “Advantage of using decorrelated residuals in 2014.
dynamic principal component analysis for monitoring large-scale [73] R. Caruana, S. Lawrence, and C. L. Giles, “Overfitting in neural nets:
systems,” Ind. Eng. Chem. Res., vol. 52, pp. 13685–13698, 2013. Backpropagation, conjugate gradient, and early stopping,” Advances in
[47] G. E. Hinton, and J. L. McClelland, “Learning representations by neural information processing systems, 2001.
recirculation,” In NIPS’ 1987, pp. 358–366, 1988. [74] Z. Zhang, Y. Xu, J. Yang, X. Li, D. Zhang, “A survey of sparse
[48] D. E. Rumelhar, G. E. Hinton, R. J. Williams, “Learning representations representation: algorithms and applications,” IEEE access, vol. 3, pp.

4910-530, 2015. 1490-1498, 2017.

[75] H. Larochelle, Y. Bengio, “Classification using discriminative restricted [98] X. Wang, H. Liu, “Soft sensor based on stacked auto-encoder deep
Boltzmann machines,” Proceedings of the 25th international conference neural network for air preheater rotor deformation prediction,”
on Machine learning, pp. 536-543, 2008. Advanced Engineering Informatics, vol. 36, pp. 112-119, 2018.
[76] Y. Pati, R. Rezaiifar, and P. Krishnaprasad, “Orthogonal matching [99] X. Wang and H. Liu, “A Knowledge- and Data-Driven Soft Sensor Based
pursuit: Recursive function approximation with applications to wavelet on Deep Learning for Predicting the Deformation of an Air Preheater
decomposition,” In Proceedings of the 27 th Annual Asilomar Rotor,” in IEEE Access, vol. 7, pp. 159651-159660, 2019.
Conference on Signals, Systems, and Computers, pp. 40–44, 1993. [100] W. Yan, D. Tang and Y. Lin, “A Data-Driven Soft Sensor Modeling
[77] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Method Based on Deep Learning and its Application,” in IEEE
Salakhutdinov, “Dropout: A simple way to prevent neural networks from Transactions on Industrial Electronics, vol. 64, no. 5, pp. 4237-4245,
overfitting,” Journal of Machine Learning Research, vol. 15, pp. 1929– May 2017, doi: 10.1109/TIE.2016.2622668.
1958, 2014. [101] Y. Wu, D. Liu, X. Yuan and Y. Wang, “A just-in-time fine-tuning
[78] G. E. Hinton, N. Srivastava, A. Krizhevsky, et al. “Improving neural framework for deep learning of SAE in adaptive data-driven modeling of
networks by preventing co-adaptation of feature detectors,” arXiv time-varying industrial processes,” IEEE Sensors Journal, doi:
preprint arXiv:1207.0580, 2012. 10.1109/JSEN.2020.3025805.
[79] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. [102] W. Fan, F. Si, S. Ren, et al. “Integration of continuous restricted
Salakhutdinov, “Dropout: A simple way to prevent neural networks from Boltzmann machine and SVR in NOx emissions prediction of a
overfitting,” Journal of Machine Learning Research, vol. 15, pp. 1929– tangential firing boiler,” Chemometrics and Intelligent Laboratory
1958, 2014. Systems, vol. 195, 2019, doi: 10.1016/j.chemolab.2019.103870.
[80] S. Ioffe, C. Szegedy, “Batch normalization: Accelerating deep network [103] P. Lian, H. Liu, X. Wang, et al. “Soft sensor based on DBN-IPSO-SVR
training by reducing internal covariate shift,” arXiv preprint approach for rotor thermal deformation prediction of rotary
arXiv:1502.03167, 2015. air-preheater,” Measurement, vol. 165, 2020, doi:
[81] M. Abadi, P. Barham, J. Chen, et al. “Tensorflow: A system for 10.1016/j.measurement.2020.108109.
large-scale machine learning,” 12th Symposium on Operating Systems [104] R. Liu, Z. Rong, B. Jiang, Z. Pang and C. Tang, “Soft Sensor of 4-CBA
Design and Implementation, pp.265-283, 2016. Concentration Using Deep Belief Networks with Continuous Restricted
[82] Y. Jia, E. Shelhamer, J. Donahue, et al. “Caffe: Convolutional Boltzmann Machine,” 2018 5th IEEE International Conference on Cloud
architecture for fast feature embedding,” Proceedings of the 22nd ACM Computing and Intelligence Systems (CCIS), Nanjing, China, pp.
international conference on Multimedia, pp. 675-678, 2014. 421-424, 2018, doi: 10.1109/CCIS.2018.8691166.
[83] F. Bastien, P. Lamblin, R. Pascanu, et al. “Theano: new features and [105] J. Qiao, L. Wang, “Nonlinear system modeling and application based on
speed improvements,” arXiv preprint arXiv:1211.5590, 2012. restricted Boltzmann machine and improved BP neural network,”
[84] F. Seide, A. Agarwal, “CNTK: Microsoft's open-source deep-learning Applied Intelligence, 2020, doi: 10.1007/s10489-019-01614-1.
toolkit,” Proceedings of the 22nd ACM SIGKDD International [106] M. Lu, Y. Kang, X. Han and G. Yan, “Soft sensor modeling of mill level
Conference on Knowledge Discovery and Data Mining, pp. 2135-2135, based on Deep Belief Network,” The 26th Chinese Control and Decision
2016. Conference (2014 CCDC), Changsha, pp. 189-193, 2014, doi:
[85] A. Gulli, S. Pal, Deep learning with Keras. Packt Publishing Ltd, 2017. 10.1109/CCDC.2014.6852142.
[86] A. Paszke, S. Gross, F. Massa, et al. “Pytorch: An imperative style, [107] X. Wang, W. Hu, K. Li, L. Song and L. Song, “Modeling of Soft Sensor
high-performance deep learning library,” Advances in Neural Based on DBN-ELM and Its Application in Measurement of Nutrient
Information Processing Systems, pp. 8026-8037, 2019. Solution Composition for Soilless Culture,”2018 IEEE International
[87] B. Shen, L. Yao, Z. Ge, “Nonlinear probabilistic latent variable Conference of Safety Produce Informatization (IICSPI), Chongqing,
regression models for soft sensor application: From shallow to deep China, pp. 93-97, 2018, doi: 10.1109/IICSPI.2018.8690373.
structure,” Control Engineering Practice, vol. 94, 2020, doi: [108] S. Zheng, K. Liu, Y. Xu, et al. “Robust soft sensor with deep kernel
10.1016/j.conengprac.2019.104198. learning for quality prediction in rubber mixing processes,” Sensors, vol.
[88] X. Yuan, B. Huang, Y. Wang, et al. “Deep learning-based feature 20, no. 3, 2020, doi: 10.3390/s20030695.
representation and its application for soft sensor modeling with [109] Y. Liu, C. Yang, Z. Gao, et al. “Ensemble deep kernel learning with
variable-wise weighted SAE,” IEEE Transactions on Industrial application to quality prediction in industrial polymerization processes,”
Informatics, vol. 14, no. 7, pp. 3235-3243, 2018. Chemometrics and Intelligent Laboratory Systems, vol. 174, pp. 15-21,
[89] X. Yan, J. Wang, and Q. Jiang, “Deep relevant representation learning for 2018.
soft sensing,” Information Sciences, vol. 514, pp. 263-274, 2020. [110] C. Shang C, F. Yang, D. Huang, et al. “Data-driven soft sensor
[90] X. Yuan, J. Zhou, B. Huang, et al. “Hierarchical quality-relevant feature development based on deep learning technique,” Journal of Process
representation for soft sensor modeling: a novel deep learning strategy,” Control, vol. 24, no. 3, pp. 223-233, 2014.
IEEE Transactions on Industrial Informatics, vol. 16, no. 6, pp. [111] S. Graziani, and M. G. Xibilia, “Deep Learning for Soft Sensor Design,”
3721-3730, 2019. Development and Analysis of Deep Learning Architectures. Springer,
[91] Q. Sun, Z. Ge, “Gated Stacked Target-Related Autoencoder: A Novel Cham, pp. 31-59, 2020.
Deep Feature Extraction and Layerwise Ensemble Method for Industrial [112] S. Graziani and M. G. Xibilia, “Design of a Soft Sensor for an Industrial
Soft Sensor Application,” IEEE transactions on cybernetics, 2020, doi: Plant with Unknown Delay by Using Deep Learning,” 2019 IEEE
10.1109/TCYB.2020.3010331. International Instrumentation and Measurement Technology Conference
[92] Q. Sun, Z. Ge, “Deep Learning for Industrial KPI Prediction: When (I2MTC), Auckland, New Zealand, pp. 1-6, 2019, doi:
Ensemble Learning Meets Semi-Supervised Data,” IEEE Transactions 10.1109/I2MTC.2019.8827074.
on Industrial Informatics, 2020, doi: 10.1109/TII.2020.2969709. [113] Y. Liu, Y. Fan, J. Chen, “Flame images for oxygen content prediction of
[93] X. Wang, H. Liu, “Data supplement for a soft sensor using a new combustion systems using DBN,” Energy & Fuels, vol. 31, no. 8, pp.
generative model based on a variational autoencoder and Wasserstein 8776-8783, 2017.
GAN,” Journal of Process Control, vol. 85, pp. 91-99, 2020. [114] C. H. Zhu, J. Zhang, “Developing Soft Sensors for Polymer Melt Index
[94] F. Guo, R. Xie, B. Huang, “A deep learning just-in-time modeling in an Industrial Polymerization Process Using Deep Belief Networks,”
approach for soft sensor based on variational autoencoder,” International Journal of Automation and Computing, vol. 17, no. 1, pp.
Chemometrics and Intelligent Laboratory Systems, vol. 197, 2020, doi: 44-54, 2020.
10.1016/j.chemolab.2019.103922. [115] Z.C. Horn, et al. “Performance of convolutional neural networks for
[95] F. Guo, W. Bai, B. Huang, “Output-relevant Variational autoencoder for feature extraction in froth flotation sensing,” IFAC-PapersOnLine, vol.
Just-in-time soft sensor modeling with missing data,” Journal of Process 50, no. 2, pp. 13-18, 2017.
Control, 2020, 92: 90-97. [116] X. Yuan, S. Qi, Y. Shardt, et al. “Soft sensor model for dynamic
[96] R. Xie, N. M. Jan, K. Hao, et al. “Supervised Variational Autoencoders processes based on multichannel convolutional neural network,”
for Soft Sensor Modeling with Missing Data,” IEEE Transactions on Chemometrics and Intelligent Laboratory Systems, 2020: 104050.
Industrial Informatics, vol. 16, no. 4, pp. 2820-2828, 2019. [117] K. Wang, C. Shang, L. Liu, et al. “Dynamic soft sensor development
[97] L. Yao, Z. Ge, “Deep learning of semisupervised process data with based on convolutional neural networks,” Industrial & Engineering
hierarchical extreme learning machine and soft sensor application,” Chemistry Research, vol. 58, no. 26, pp. 11521-11531, 2019.
IEEE Transactions on Industrial Electronics, vol. 65, no. 2, pp. [118] W. Zhu, et al. “Deep learning based soft sensor and its application on a

pyrolysis reactor for compositions predictions of gas phase [138] K. Wang, C. Shang, F. Yang, Y. Jiang and D. Huang, “Automatic
components,” Computer Aided Chemical Engineering, Elsevier, Vol. 44, hyper-parameter tuning for soft sensor modeling based on dynamic deep
pp. 2245-2250, 2018. neural network,” 2017 IEEE International Conference on Systems, Man,
[119] J. Wei, L. Guo, X. Xu and G. Yan, “Soft sensor modeling of mill level and Cybernetics (SMC), Banff, AB, pp. 989-994, 2017, doi:
based on convolutional neural network,” The 27th Chinese Control and 10.1109/SMC.2017.8122739.
Decision Conference (2015 CCDC), Qingdao, pp. 4738-4743, 2015, doi: [139] Y. He, Y. Xu, and Q. Zhu, “Soft-sensing model development using
10.1109/CCDC.2015.7162762. PLSR-based dynamic extreme learning machine with an enhanced
[120] S. Sun, Y. He, S. Zhou, et al. “A data-driven response virtual sensor hidden layer,” Chemometrics and Intelligent Laboratory Systems, vol.
technique with partial vibration measurements using convolutional 154, pp. 101-111, 2016.
neural network,” Sensors, vol. 17, no. 12, 2017, doi: [140] X. Wang, “Data Preprocessing for Soft Sensor Using Generative
10.3390/s17122888. Adversarial Networks,” 2018 15th International Conference on Control,
[121] H.B. Su, L.T. Fan, J.R. Schlup, “Monitoring the process of curing of Automation, Robotics and Vision (ICARCV), Singapore, pp. 1355-1360,
epoxy/graphite fiber composites with a recurrent neural network as a soft 2018, doi: 10.1109/ICARCV.2018.8581249.
sensor,” Engineering Applications of Artificial Intelligence, vol. 11, no. [141] Y. Fan, B. Tao, Y. Zheng and S. Jang, “A Data-Driven Soft Sensor Based
2, pp. 293-306, 1998. on Multilayer Perceptron Neural Network with a Double LASSO
[122] C.A. Duchanoy, M.A. Moreno-Armendáriz, L. Urbina, et al. “A novel Approach,” in IEEE Transactions on Instrumentation and Measurement,
recurrent neural network soft sensor via a differential evolution training vol. 69, no. 7, pp. 3972-3979, July 2020, doi:
algorithm for the tire contact patch,” Neurocomputing, vol. 235, pp. 10.1109/TIM.2019.2947126.
71-82, 2017. [142] A. Rani, V. Singh, J.R.P. Gupta, “Development of soft sensor for neural
[123] J. Loy-Benitez, S.K. Heo, C.K. Yoo, “Soft sensor validation for network based control of distillation column,” ISA transactions, vol. 52,
monitoring and resilient control of sequential subway indoor air quality no. 3, pp. 438-449, 2013.
through memory-gated recurrent neural networks-based autoencoders,” [143] A. Alexandridis, “Evolving RBF neural networks for adaptive
Control Engineering Practice, vol. 97: 104330, 2020. soft-sensor design,” International journal of neural systems, vol. 23, no.
[124] X. Chen, F. Gao, G. Chen, “A soft-sensor development for 6, 2013: 1350029.
melt-flow-length measurement during injection mold filling,” Materials [144] M.D. Zeiler, et al. “Modeling pigeon behavior using a Conditional
Science and Engineering: A, vol. 384, no. 1-2, pp. 245-254, 2004. Restricted Boltzmann Machine.” ESANN, 2009.
[125] L.Z. Chen, S.K. Nguang, X.M. Li, et al. “Soft sensors for on-line [145] Y. Luo, et al. “Multivariate time series imputation with generative
biomass measurements,” Bioprocess and Biosystems Engineering, vol. adversarial networks,” Advances in Neural Information Processing
26, no. 3, pp. 191-195, 2004. Systems, 2018.
[126] G. Kataria, K. Singh, “Recurrent neural network based soft sensor for [146] L. Jing and Y. Tian, “Self-supervised Visual Feature Learning with Deep
monitoring and controlling a reactive distillation column,” Chemical Neural Networks: A Survey,” in IEEE Transactions on Pattern Analysis
Product and Process Modeling, vol. 13, no. 3, 2017, doi: and Machine Intelligence, 2020, doi: 10.1109/TPAMI.2020.2992393.
10.1515/cppm-2017-0044. [147] A. Oord, Y. Li, O. Vinyals, “Representation learning with contrastive
[127] W. Ke, D. Huang, F. Yang and Y. Jiang, “Soft sensor development and predictive coding,” arXiv preprint arXiv: 1807.03748, 2018.
applications based on LSTM in deep neural networks,” 2017 IEEE [148] C. Finn, P. Abbeel, S. Levine, “Model-agnostic meta-learning for fast
Symposium Series on Computational Intelligence (SSCI), Honolulu, HI, adaptation of deep networks,” arXiv preprint arXiv:1703.03400, 2017.
pp. 1-6, 2017, doi: 10.1109/SSCI.2017.8280954. [149] L. Maaten, G. Hinton, Visualizing data using t-SNE,” Journal of machine
[128] X. Yuan, L. Li and Y. Wang, “Nonlinear Dynamic Soft Sensor Modeling learning research, no. 9, pp. 2579-2605, Nov. 2008.
with Supervised Long Short-Term Memory Network,” in IEEE [150] M.D. Zeiler, R. Fergus, “Visualizing and understanding convolutional
Transactions on Industrial Informatics, vol. 16, no. 5, pp. 3168-3176, networks,” European conference on computer vision. Springer, Cham,
May 2020, doi: 10.1109/TII.2019.2902129. pp. 818-833, 2014.
[129] I. Pisa, I. Santín, J.L. Vicario, et al. “ANN-based soft sensor to predict [151] S. Kabir, R. U. Islam, M. S. Hossain, et al. “An Integrated Approach of
effluent violations in wastewater treatment plants,” Sensors, vol. 19, no. Belief Rule Base and Deep Learning to Predict Air Pollution.” Sensors,
6, 2019: 1280. vol. 20, no. 7: 1956, 2020.
[130] R. Xie, K. Hao, B. Huang, L. Chen and X. Cai, “Data-Driven Modeling [152] Q. Jiang, S. Yan, H. Cheng and X. Yan, “Local-Global Modeling and
Based on Two-Stream λ Gated Recurrent Unit Network with Soft Sensor Distributed Computing Framework for Nonlinear Plant-Wide Process
Application,” in IEEE Transactions on Industrial Electronics, vol. 67, no. Monitoring with Industrial Big Data,” IEEE Transactions on Neural
8, pp. 7034-7043, Aug. 2020, doi: 10.1109/TIE.2019.2927197. Networks and Learning Systems, doi: 10.1109/TNNLS.2020.2985223.
[131] S.R. V. Raghavan, T.K. Radhakrishnan, K. Srinivasan, “Soft sensor [153] Z. Yang, Z. Ge, “Monitoring and Prediction of Big Process Data with
based composition estimation and controller design for an ideal reactive Deep Latent Variable Models and Parallel Computing,” Journal of
distillation column,” ISA transactions, vol. 50, no. 1, pp. 61-70, 2011. Process Control, vol. 92, pp. 19-34, 2020.
[132] Y.L. He, Y. Tian, Y. Xu, et al. “Novel soft sensor development using echo
state network integrated with singular value decomposition: Application
to complex chemical processes,” Chemometrics and Intelligent Qingqiang Sun received the B.Eng.
Laboratory Systems, vol. 200, 2020: 103981, doi:
degree in Electrical Engineering and
10.1016/j.chemolab.2020.103981.
[133] X. Yin, Z. Niu, Z. He, et al. “Ensemble deep learning based Automation from Xiamen University,
semi-supervised soft sensor modeling method and its application on Xiamen, China, in 2017. He received the
quality prediction for coal preparation process,” Advanced Engineering M.Eng. degree in the Department of
Informatics, vol. 46, 2020: 101136.
Control Science and Engineering,
[134] X. Zhang and Z. Ge, “Automatic Deep Extraction of Robust Dynamic
Features for Industrial Big Data Modeling and Soft Sensor Application,” Zhejiang University, Hangzhou, China, in
in IEEE Transactions on Industrial Informatics, vol. 16, no. 7, pp. 2020.
4456-4467, July 2020, doi: 10.1109/TII.2019.2945411. His research interests include
[135] W. Yan, R. Xu, K. Wang, et al. “Soft Sensor Modeling Method Based on
data-based modeling, process data deep learning, soft sensing.
Semisupervised Deep Learning and Its Application to Wastewater
Treatment Plant,” Industrial & Engineering Chemistry Research, vol. 59,
no. 10, pp.4589-4601, 2020.
[136] W. Zheng, Y. Liu, Z. Gao, et al. “Just-in-time semi-supervised soft sensor
for quality prediction in industrial rubber mixers,” Chemometrics and
Intelligent Laboratory Systems, vol.180, pp. 36-41, 2018.
[137] S. Graziani, M.G. Xibilia, “Deep structures for a reformer unit soft
sensor,” 2018 IEEE 16th International Conference on Industrial
Informatics (INDIN). IEEE, pp. 927-932, 2018.

Zhiqiang Ge (M'13-SM'17) received the

B.Eng. and Ph.D. degrees in Automation
from the Department of Control Science
and Engineering, Zhejiang University,
Hangzhou, China, in 2004 and 2009,
respectively.
He was a Research Associate with the
Department of Chemical and Biomolecular
Engineering, Hong Kong University of
Science Technology from Jul. 2010 to Dec. 2011 and a visiting
Professor with the Department of Chemical and Materials
Engineering, University of Alberta from Jan. 2013 to May 2013.
Dr. Ge was an Alexander von Humboldt research fellow with
University of Duisburg-Essen during Nov. 2014 to Jan. 2017,
and also a JSPS invitation Fellow with Kyoto University during
Jun. 2018 to Aug. 2018. He is currently a Full Professor with
the College of Control Science and Engineering, Zhejiang
University. His research interests include industrial big data,
process monitoring, soft sensor, data-driven modeling, machine
intelligence, and knowledge automation.