Effective Variational-Autoencoder-Based Generative Models For Highly Imbalanced Fault Detection Data in Semiconductor Manufacturing

IEEE TRANSACTIONS ON SEMICONDUCTOR MANUFACTURING, VOL. 36, NO.
2, MAY 2023 205
Effective Variational-Autoencoder-Based Generative

Models for Highly Imbalanced Fault Detection Data
in Semiconductor Manufacturing
Shu-Kai S. Fan , Member, IEEE, Du-Ming Tsai , and Pei-Chi Yeh
Abstract—In current semiconductor manufacturing, limited immediate action to help the process engineer with FDC tasks
raw trace data pertaining to defective wafers make fault detection prior to metrology data being taken. In the recent literature,
(FD) assignments extremely difficult due to the data imbalance defective wafer detection and classification have been con-
in wafer classification. To mitigate, this paper proposes using a
variational autoencoder (VAE) as a data augmentation strategy siderably addressed by investigating the sensor data of the
for resolving data imbalance of temporal raw trace data. A VAE process equipment in semiconductor fabrication plant [4], [5],
with few defective samples is first trained. By means of extract- [6], [7], [8], [9]. In common semiconductor practice, the sen-
ing the latent variables that characterize the distribution of the sor data is also known as the raw trace data of status variable
defective samples, we make use of the statistical randomness of identification (SVID). The raw trace data contains a wide
the latent variables to generate synthesized defective samples via
the decoder scheme in the trained VAE. Two data representa- variety of temporal, time-indexed sequences of measurements
tions and VAE modeling strategies, concatenation of multiple and collected in situ from the sensors installed in the process
individual raw trace data as the input of the VAE during the equipment, where each sensor corresponds to a specific SVID.
training stage, are investigated. A real-data plasma enhanced Technically speaking, SVID can be, in essence, quantitative or
chemical vapor deposition (PECVD) process having only few qualitative for the purposes of sensor design. Typically, quali-
defective samples is used to illustrate the performance enhance-
ment to wafer classification arising from the proposed data tative SVIDs are related to the wafer count, time stamp, chip
augmentation framework. Based on the computational compar- tally, etc.; quantitative SVIDs are related to the processing
isons between noted classification models, the proposed generative variables in the tool, which can be measured in the met-
VAE model via the individual strategy enables the adaptive boost- ric system, like chamber inner heater zone power, chamber
ing (AdaBoost) classifier to achieve perfect performances in every pressure reading, chamber outer heater current, temperature
metrics if the 80% and 100% over-sampling ratios are adopted.
power, etc. In a word, fault detection (FD) is performed in
Index Terms—Variational autoencoder (VAE), data augmen- an early stage of processing steps, attempting to detect defec-
tation, wafer classification, semiconductor manufacturing, fault tive wafers (or process excursions) without recourse to the
detection.
metrology system and circumvent the defective wafers going
downstream in the pipeline.
I. I NTRODUCTION
HE SEMICONDUCTOR industry has long turned into
T an important core driving force behind the development
of innovative high-tech applications and consumer electronics.
A. Literature Review
The process engineers used to conduct FD assignments in
In semiconductor manufacturing, fault detection and classifi- terms of the univariate statistics (also known as UVAs) like
cation (FDC) plays a central role in the paradigm of extended mean, standard deviation, range, maximum, minimum, slope,
advanced process control (eAPC) [1], [2], [3]. The con- skewedness, kurtosis, coefficient of variation, among oth-
struction of an effective wafer classification model promises ers, applied to every individual pre-defined processing steps.
However, if non-key SVIDs and/or non-key processing steps
Manuscript received 21 June 2022; revised 14 November 2022; accepted 16 are selected for monitoring, unexpectedly high false positive
January 2023. Date of publication 3 February 2023; date of current version rate (i.e., type I error) or false negative rate (i.e., type II error)
5 May 2023. This work was supported in part by the Ministry of Science
and Technology, Taiwan, under Grant MOST-111-2221-E-027-070-MY3. will result [10]. Toward this end, the entire nonlinear SVID
(Corresponding author: Shu-Kai S. Fan.) profile of raw trace data is monitored instead for the purposes
Shu-Kai S. Fan is with the Department of Industrial Engineering and of safeguarding the yield loss [6], [7], [8], [9]. In the statistical
Management, National Taipei University of Technology, Taipei 106344,
Taiwan (e-mail: morrisfan@ntut.edu.tw). process control (SPC) literature, profile monitoring provides
Du-Ming Tsai is with the Department of Industrial Engineering and an alternative to the FD tasks in the electronics manufacturing
Management, Yuan Ze University, Taoyuan 32003, Taiwan. industry [11], [12], [13].
Pei-Chi Yeh is with the Department of Industrial Engineering and
Management, National Taipei University of Technology, Taipei 106344, As the manufacturing management technology (MMT) in
Taiwan, and also with the RD Process Center, Taiwan Semiconductor wafer fabrication advances, a high yield of 95% or even
Manufacturing Company Ltd., Hsinchu 308, Taiwan. beyond becomes now standard practice. Machine and deep
Color versions of one or more figures in this article are available at
https://doi.org/10.1109/TSM.2023.3238555. learning modeling is often suffering from the difficulty of class
Digital Object Identifier 10.1109/TSM.2023.3238555 imbalance in the FD tasks. The class imbalance arises from the
0894-6507
c 2023 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See https://www.ieee.org/publications/rights/index.html for more information.
Authorized licensed use limited to: National Cheng Kung Univ.. Downloaded on December 05,2023 at 16:24:18 UTC from IEEE Xplore. Restrictions apply.
206 IEEE TRANSACTIONS ON SEMICONDUCTOR MANUFACTURING, VOL. 36, NO. 2, MAY 2023
minority class, pertaining to the defective wafer type, the mis-

classification of which certainly leads to a critical yield loss in
the subsequent processing steps. Data augmentation is consid-
ered as a practical action for dealing with the class imbalance
in supervised learning. In the open literature, the research of
data augmentation has been focused on sampling technique
design [14], generative adversarial network (GAN) [15], data
representations using autoencoder [16], etc. Data augmentation
in wafer defect map classification has also received a consider-
able attention in the recent semiconductor literature [17], [18],
[19], [20]. To date, surprisingly few studies have been focused Fig. 1. Plasma enhanced chemical vapor deposition (PECVD).
on the data augmentation of the raw trace data under the cir-
cumstances where high class imbalance is taking place for the parallel electrode plate, (ii) pressure control system for main-
FD tasks. Namely, the wafers being processed in a tool are taining the plasma state by means of gas inlet and pump, and
to be identified as normal or abnormal, in terms of the raw (iii) the chamber for a reaction through a pre-defined period of
trace data, without recourse to the metrology system. A viable deposition time, as pictorially shown in Fig. 1. At the outset,
alternative to FD under the class imbalance scenario can rest the process begins by introducing reactive gases between par-
on the construction of a one-class model via an autoencoder allel electrode plates. By applying a medium radio frequency
or one-class support vector machine (OCSVM) [9]. However, (RF) power supply between the electrodes, the reactive gases
one-class models without ad hoc analysis are typically subject are excited into plasma and chemical reaction starts.
to a high false positive rate. Abnormal wafers rarely happen in Plasma is a completely or partially ionized gas or a mixture
modern semiconductor fabrication plants due to the high-yield of charged particles, neutral atoms, and electrons, which can be
production, so how to develop an effective data augmentation produced by either heating the gas or by means of the electrical
strategy for FD is the major motivation behind this research. energy. Hence, the plasma state is a high-energy condition,
The GAN framework has emerged as a powerful tool for but it is neutral in that the overall charge is approximately
data augmentation but is better suited to various image and zero. As the CVD precursors having high vapor pressure are
video synthesis tasks. Therefore, variational autoencoder based introduced into the PECVD chamber, it causes the dissociation
generative models are proposed over GAN-based models in and activation of the precursor as plasma, therefore allowing
this paper where FD data augmentation is investigated, as will the deposition of the film at a lower temperature. Advantages
be elaborated shortly. of PECVD contain a high deposition rate and low temperature;
both organic and inorganic materials can be used as CVD
precursors.
B. Motivation and Problem Statement Particularly in PECVD, temperature, pressure, gas flow
Epitaxy is an extension of a thin film grown on a single rate and RF power/impedance are critically important types
crystal substrate, and a continuum of a single crystal structure of SVIDs. For instance, process pressure, chamber pressure,
formed by the addition of atoms on the single crystal substrate. transfer module pressure, load-luck pressure, fore-line pressure
The growth of the vapor phase crystalline silicon layer is called and exhaust pressure are often being monitored for the FDC
the vapor phase epitaxy. In the process, a substrate wafer is purposes. A real-world PECVD processing dataset will be dis-
used as a seed crystal. Among various epitaxial growth meth- cussed in Section III, where the data of abnormal wafers are
ods, vapor phase epitaxial growth is currently the most crucial rarely available in the examined case. For such FD scenarios,
method for growing silicon devices. The vapor phase epi- the data augmentation is warranted.
taxy can be divided into two technologies: Physical Vapor
Deposition (PVD) and Chemical Vapor Deposition (CVD). II. G ENERATIVE M ODELS OF DATA AUGMENTATION
The former uses physical phenomena, while the latter mainly BASED ON VARIATIONAL AUTOENCODER (VAE)
uses chemical reactions to deposit thin films. However, the
In this section, the data format of the PECVD process inves-
application of PVD is typically limited to the deposition of
tigated in this paper is first defined. To cope with the class
thin metal films. All thin films required for semiconductor
imbalance problem in the FD practice of semiconductor manu-
devices, including semiconductors, conductors and dielectrics,
facturing, the proposed generative models by using variational
can be prepared by CVD.
autoencoders are addressed. Upon the completion of the model
In this paper, the raw trace data of plasma-enhanced chem-
building, how to generate the synthesized FDC profile data is
ical vapor deposition (PECVD) for FD assignments will be
discussed. Lastly, two different data augmentation strategies
used to illustrate the proposed generative model for data aug-
incorporated in the proposed generative models are introduced.
mentation, where the problem of high class imbalance arises.
PECVD is an important processing tool for 300mm fabrication
plants, where the deposition of thin films of various mate- A. Representation of Raw Trace Data
rials is taking place at a relatively lower temperature than Let Xijk denote the raw trace data matrix of the PECVD pro-
that of thermal CVD. PECVD has three fundamentals such as cess that consists of N wafers (i = 1, 2, . . . , N). There are
(i) plasma creation by ionization of atoms and molecules via M key SVIDs being measured per wafer for j = 1, 2, . . . , M
FAN et al.: EFFECTIVE VAE-BASED GENERATIVE MODELS FOR HIGHLY IMBALANCED FD DATA 207
as an approximation of the actual posterior density pθ (z|x)

of the generative model; the variational parameters φ and the
generative parameters θ are jointly optimized.
In terms of evidence lower bound (ELBO), the loss func-
tion of the VAE model that includes two components, the
reconstruction loss and the latent loss, can be shown as
follows:

Lφ,θ = −DKL qφ (z|x )pθ (z) + Ez∼q(z|x ) log pθ (x|z ) (2)
Fig. 2. General architecture of a VAE model.
where the first term represents the Kullback Leibler (KL)
and T measurements being collected per SVID for k = divergence, ensuring that the learned distribution qφ is similar
1, 2, . . . , T. The dataset Xijk of size N contains N1 normal to the prior distribution pθ (z). The second term represents the
wafers as the majority class and N2 abnormal wafers as the reconstruction likelihood. The convenient choice of the prior
minority class. Under the circumstances investigated in this distribution is
paper, the class imbalance occurs, i.e., N2 N1 , where the
number of the major class greatly outnumbers that of the pθ (z) ∼ N(0, I) (3)
minority class.
For a detailed account, see the fundamentals and mathe-
matical derivations in [22]–[24]. Upon the completion of the
B. Variational Autoencoder (VAE)
trained VAE models, the standard normal random variable is
In the common practice, highly imbalanced data apparently sampled from N(0, I), serving as the input to the trained
pose an undue difficulty in machine learning [14], [15], [16], decoder pθ (x|z) (see Fig. 2) for the purpose of generating the
[17], [18], [21]. Variational autoencoders (VAEs) are used for synthesized data.
the purposes of data augmentation, an important type of deep To sum, the main objectives of the proposed generative
generative model proposed by Kingma and Welling [22]. In a models in this paper are multifold: (i) train the VAE model
word, VAEs are considered a regularized version of autoen- through only a few defective wafers to construct the data repre-
coders whose encoding distribution is particularly constrained sentation of the raw trace data of defective wafers, (ii) with the
during the learning process to ensure that the embedded latent distributional randomness of the latent variables in the trained
variables own inherent properties for generative modeling. VAE model, use the generative model to synthesize the raw
VAEs utilize the concept of variational inference to reduce trace data of defective wafers, (iii) make use of the gener-
the high-dimensional data into a latent vector on condition ated samples serving as additional training data to alleviate
that a multivariate distribution is presumed as a prior for the the imbalance of the original raw trace data when the classi-
same latent vector. fication model is warranted, (iv) use the trained classification
VAE is an unsupervised learning model that consists of models with data augmentation to compare the performances
encoder and decoder structures to reconstruct the input data as between different classification models as a wide variety of
a procedure of data representation. The encoder structure can augmentation ratios are assessed.
reduce the dimensionality and extract the information from
the input data to the middle-hidden layer, also known as the
C. Data Augmentation Strategies of Raw Trace Data
latent space. Another part of VAE is the decoder structure, and
this structure reconstructs the extracted information from the There are two data augmentation strategies to be proposed
latent vector to the original input data. Unlike the traditional in this paper. Suppose that there are M key SVIDs to be taken
autoencoder architecture, the VAE model uses the continuous into account. The first strategy is to construct M individual
latent space, in a probabilistic sense, to reconstruct data as VAE models independently as per SVID, which are used to
close to input as possible. A general architecture of a VAE is generate synthesized raw trace profiles in relation to abnormal
schematically shown in Fig. 2. wafers. Specifically, a raw trace vector x of size T × 1 serves
As shown in the figure, the key issue reduces to the esti- as the input to every individual VAE model during the training
mation of the log-likelihood and posterior density functions process, as expressed by:
in the deep latent-variable model while the posterior infer-
xTij = xij1 , xij2 , . . . , xijT 1×T , i = 1, 2, . . . , n2 (4)
ence is intractable. The recognition model qφ (z|x) serves as a
neural network encoder where a latent vector z is generated for the j-th SVID VAE model, where n2 ∈ N2 , a sample subset
according to of the minority class used for data augmentation and model

z ∼ N µ(x), (x) (1) building in the training process.
The second strategy is to construct an overall VAE model
given an input vector x, a tuple of Xijk with respect to a by concatenating the raw trace data of M SVIDs in serial as
particular wafer i. The latent vector z is therefore employed the input vector of size (T × M) by 1, as shown by:
to generate a sample from the likelihood density pθ (x|z),
also known as the neural network decoder or the genera- xTi = [xi11 , xi12 , . . . , xi1T , . . . , xiM1 , xiM2 , . . . , xiMT ]1×(T×M)
tive model. Note that the probability density qφ (z|x) serves (5)
for a single VAE model including all M SVIDs, where n2

is defined as in (4). Henceforth, the former is referred to
as “the individual strategy”; the latter is referred to as “the
concatenation strategy”.
III. I NVESTIGATION OF R EAL P ROCESS DATA AND

M ODEL VALIDATION
In this section, a real PECVD process dataset is investigated.
The fundamentals of PECVD are already detailed in Sections I
and II. The PECVD process data was collected from a 12-in
fabrication plant of a leading global semiconductor foundry
company in Taiwan. The enabling IC processing technologies
and manufacturing solutions include logic-signal, embedded
high-voltage, embedded non-volatile-memory, etc. The fabri-
cation plant under study is located in Southern Tainan Science
Park.
The PECVD tool contains two chambers, and only one
chamber with an FD data collection of 965 wafers, with 919
normal and 46 abnormal, is considered in this experiment. The
percentage of abnormal wafer is only about 4.7%, dictating a
high class imbalance. Among 66 SVIDS, SVIDs 25-32 were
identified the key SVIDs by the process engineers [8]. FD
signal is approximately collected once per second from sen-
sor and there are 309 measurements for each SVID. In this
study, N = 965, N1 = 919, N2 = 46, M = 8 and T = 309.
The downsizing strategy for standardizing the raw trace data
is used in this study. The minimum sampled processing time
in this chamber is T = 309. Prior to performing the formal
data analysis, the senor readings of every wafer need to be
pre-processed in that the measurement time points of all the
wafers are not synchronized, e.g., 1.1 seconds, 2.3 seconds,
etc. In this paper, all the FD data in every wafer per key
parameter are synchronized to the nearest integer second, i.e., Fig. 3. FDC profiles of eight key SVIDS.
1, 2, . . . , 309. Meanwhile, the corresponding SVID value is
approximated by interpolation.
To have a quick look at the FD data, the temporal profiles of
eight key SVIDs are pictorially demonstrated in Fig. 3. As can
be clearly seen from the figure, the FD profiles between the
normal and abnormal wafers are nearly indistinguishable. The
data profiles of normal wafers are placed in the background
and abnormal wafers in the foreground, making the diminutive
Fig. 4. Training and test datasets of the raw trace data.
discrepancies easy to spot. These figures dictate visually how
SVIDs 25-32 depict the differentiation between normal and
abnormal wafers. in terms of the original dataset without data augmentation.
Before proceeding to the data augmentation phase, the Since the data size in this study is not large (less than one
dataset of 965 samples needs to be split appropriately for an thousand), thus supervised aggregation-based machine learn-
initial test on wafer classification. The data of 665 normal ing models for classification are only considered. To offer a
and 15 abnormal wafers are randomly selected as the training wide array of ensemble structures, the random forests and
dataset to train various machine learning models, where these bagging algorithms based on parallel ensemble learning, the
15 (i.e., n2 ) abnormal samples are also used for data augmen- AdaBoost algorithm based on sequential ensemble learning,
tation to be carried out in a later section. The data of 254 the XGBoost algorithm based on end-to-end gradient boosting
normal and 31 abnormal wafers are used as the test dataset to are chosen as the benchmark models for classification. These
test the trained machine learning models. The configuration of four models are implemented by using the Scikit-learn library
raw trace data arrangement is pictorially shown in Fig. 4. for the Python programming language [25] and the default
To form a basis for the performance evaluation, four selected parameters of these four models are employed herein since
classification models, random forests, AdaBoost, bootstrap the hyper-parameterization is not the major concern about
aggregating (bagging) and XGBoost, are trained and tested data augmentation. The preliminary classification evaluation is
TABLE I
P RELIMINARY C LASSIFICATION E VALUATION W ITHOUT DATA
AUGMENTATION
Fig. 6. The Concatenation strategy.
Loss functions are of essential importance in machine learn-

ing, rendering a way to evaluate the distance or difference
between the predicted output and the ground truth value (or
label) in order to train the model effectively. To examine the
efficacy of the trained model visually, the convergence plot of
the loss function defined in (2) versus epoch is investigated.
It can be clearly seen from Fig. 7 for the individual strategy
Fig. 5. The Individual strategy.
during the training stage that the loss functions for SVIDs 25-
32 all converge well and drop quickly below 0.02 (or even
below 0.01) in the first 200 epochs. In particular, more epochs
tabulated in Table I. The statistics reported in the table are cal- are required to train a useful VAE model for SVIDs 26-27.
culated over 10 independent runs. The true negative rate (i.e., The corresponding loss functions appear to be stabilized as
specificity) and positive predictive value (PPV, i.e., precision) approximately 1500 epochs elapse. Taking a close look at the
for the four models are all perfect (i.e., 100%). However, the raw trace profiles in Fig. 3, the difficulty occurs probably due
recall (i.e., true positive rate) is all less than 70%; namely, the to that there exist extremely high impulses greater than 35,000
false negative rate (FNR) greater than 30%. In spite of the high for SVIDs 26-27.
accuracy displayed in Table I, a recall lower than 90% or even Attention is now placed on the concatenation strategy and
worse is practically unacceptable in the current semiconduc- the loss function versus epoch is plotted in Fig. 8. Evidently,
tor practice due to the stringent requirement for wafer yield the loss function of the concatenation strategy works satis-
nowadays. In what follows, data augmentation via VAE-based factorily but does not converge equally well as the individual
models is introduced, intended for improving the performance strategy and exhibits larger oscillations (i.e., zigzags) while the
of recall without sacrificing the remaining metrics. It is worth loss drops below 0.05. The loss function converges swiftly in
to note that bagging is considered one of the earliest ensemble the very early training period but an unexpected local impulse
model and random forests is an extended version of bagging. takes place around the time of 500 epochs. VAEs learn by
means of a pre-defined loss function as shown in (2), used to
assess how well the model fits the given data. From the view-
A. Configuration of Data Augmentation
point of function value reduction, the models in terms of both
In this paper, the proposed VAE models use convolutional data augmentation strategies provide good overall performance
layers in encoder and decoder structures and are trained in two of model building. In the next section, the generated profiles
different ways. The first training structure uses the individual of the FD data are examined.
strategy as described in (4). This strategy means that the raw
trace profile of each key SVID is fed independently one by one
as the input to the VAE model. The data of fifteen abnormal B. Generated Profiles of FDC Data Based on Two Different
wafers are designated for data augmentation. In a word, eight Data Augmentation Strategies
generative models via VAE in terms of the individual strategy First, the deep network structure of the VAE model designed
will be trained and created, as shown in Fig. 5. to generate temporal profile data in terms of the individ-
The second training structure uses the concatenation strat- ual strategy is illustrated in Fig. 9. For the sake of intuitive
egy as described in (5). This strategy means that the raw trace interpretation, the input is treated as a row vector as shown
profiles of the 8 key SVIDs of the same wafer are concatenated in (4) and (5). The encoder is composed of the first convo-
as the input to the VAE model. So, only a single generative lutional layer of size 32 × 1 × 154, the second convolutional
VAE model in terms of the concatenation strategy will be layer of size 64×1×76 and the hidden layer of 4864 neurons.
trained and built by using the designated 15 abnormal wafers. The latent layer is of size 20. The decoder is composed of the
The concatenation strategy is schematically shown in Fig. 6. hidden layer of size 4864 neurons, the third de-convolutional
Fig. 9. Deep network structure of the VAE model via the individual strategy.
model using the concatenation strategy closely resembles the

one in Fig. 9. For brevity, it is not elaborated here.
To visually examine the generative effectiveness via the
proposed VAE model, the trained decoders by means of both
data augmentation strategies are used to generate the abnor-
mal raw trace data of SVIDs. For the sake of conciseness, the
generated raw trace profiles of SVIDs 25-28 are only shown
in Fig. 10. The generated profile data of the remaining SVIDs
are available upon request. As can be seen from the illus-
trations, the individual strategy is able to better generate the
abnormal wafer data than the concatenation strategy particu-
larly in several local neighborhoods. It is interesting to note
that the proposed VAE-based generative model took less than
5 minutes for training the VAE model of each strategy, and less
than 1 minute for generating 600 synthesized temporal profile
data of defective wafers. The computing machinery, with Intel
Core i7-8700 CPU, DDR4-32GB RAM and NVIDIA GeForce
RTX 2080 GPU, was used in this study.
C. Over-Sampling Ratios and Classification Comparisons

Fig. 7. Convergence plots of the VAE loss function for SVIDs 28-32 using via the Proposed Generative VAE Models
the individual strategy.
In the investigated training dataset, the imbalance ratio (IR)
is computed as
IR = 1 − 15/665 = 0.9774 (6)
which presenting a high class imbalance. A value of zero for
IR in (6) indicates perfect class balance in the dataset. To
address the class imbalance issue, the over-sampling (OS) ratio
is defined by
αos = nminority /nmajority (7)
where nminority denotes the number of instances in the minor-
ity class after over-sampling and nmajority in the majority class
without over-sampling. In this experimental study, various OS
ratios, 20%, 40%, 60%, 80%, 100%, are tested for the pur-
poses of classification comparison, as shown in Table II. To
Fig. 8. Convergence plot of the VAE loss function for the concatenation
strategy.
simplify matters, the definitions of confusion matrix, accuracy,
precision, recall, F1 -score and false negative rate (FNR) can
be referred to [8] [9].
layer of size 64 × 1 × 45, the second de-convolutional layer Based on different OS ratios, the computational results of
of size 32 × 1 × 133, the first de-convolutional layer of size four different machine learning methods for classification, ran-
1 × 1 × 309. The input and output layers are both of size dom forests, AdaBoost, bagging and XGBoost (see Table I),
1 × 1 × 309. Note that the deep network structure of the VAE are compared as four different data augmentation methods,
Fig. 11. Classification accuracy of random forests using 4 different data

augmentation methods.
Fig. 12. Classification accuracy of AdaBoost using 4 different data

Fig. 10. Generated temporal profiles of abnormal wafers for SVIDs 25-28
via the VAE models.
TABLE II
OVER -S AMPLING R ATIOS IN THE T RAINING DATASET
Fig. 13. Classification accuracy of Bagging using 4 different data augmen-

tation methods.
runs. The “original” bar indicates the accuracy achieved with-

out data augmentation. Bootstrap here refers to a self-sampling
process that is supposed to continue or grow without external
extra input until the desired sample size is filled.
SMOTE [26], bootstrap, and the individual and concatenation Evidently from these illustrations, the proposed VAE model
strategies in VAE, are considered. In the literature, SMOTE coupled with the individual strategy takes the lead in improv-
has been considered one of the most popular algorithms for ing accuracy among four different data augmentation methods
oversampling signal data and received abundant citations. The for every classifier. On the contrary, the concatenation strat-
bootstrap method is a resampling technique used to estimate egy used in the VAE model cannot perform competitively as
statistics on a population by sampling a dataset with replace- compared to SMOTE and bootstrap. Important clues as to its
ment. It can be used to estimate summary statistics such as incompetent data augmentation performance can be observed
the mean or standard deviation, working similarly to the neu- and self-evidenced by the convergence plot in Fig. 8 and the
ral network decoder in the proposed VAE generative model. generated profile realizations listed in the right column of
On that account, these two data augmentation methods are Fig. 10. Concatenating these 8 highly incommensurable FD
legitimate choices for comparison. The classification results in data having a high degree of data variability into a single
accuracy are demonstrated in Figs. 11–14, respectively. Note data series brings about an undue difficulty in properly decod-
that all the statistics are evaluated based on 10 independent ing the data representation of the raw trace data. To sum, the
TABLE III
P ERFORMANCE E VALUATION OF A DA B OOST U SING
THE I NDIVIDUAL S TRATEGY
Fig. 14. Classification accuracy of XGBoost using 4 different data

concatenation strategy is not recommended for the proposed adequately with the non-commensurability in scale between
generative VAE model that is used to characterize the raw SVIDs, therefore yielding augmented data inferior to those
trace data of PECVD addressed in this paper. of the individual strategy. Such inferiority in data augmen-
For random forests, the individual strategy in combina- tation leads to unstable improvement in accuracy despite an
tion with the proposed VAE model is capable of boosting increasing OS ratio.
classification accuracy most among compared data augmen-
tation strategies but seems unfazed by increasing OS ratios. D. Classification Performance of AdaBoost and XGBoost
Similarly, the other three strategies appear to not benefit from Using the Individual Strategy in the Proposed VAE Model
data augmentation, which could partly due to the inherent tech- Based on the comparison results in Section III-C, the
nique of bootstrap aggregating already embedded in random two boosting algorithms, AdaBoost and XGBoost, pro-
forests. duce best classification accuracy with the assistance of the
It is worth to note that AdaBoost performs perfectly in proposed VAE model in terms of the individual strategy.
accuracy, precision, recall and F-score while the individual Full details of classification performance, including accuracy,
strategy in the proposed VAE model is applied to the data precision, recall, F1 -score and FNR, are now reported in
augmentation with OS ratios of 80% and 100% (see Fig. 12). Tables III and IV. The best augmentation ratios for AdaBoost
In circumstances where the OS ratio is less than or equal and XGBoost are indicated in bold face.
to 60%, these four data augmentation methods do not help In Table III, the performance of AdaBoost improves in
AdaBoost to improve classification accuracy considerably. The every evaluation indicators almost linearly as the minority case
great success of the individual strategy in AdaBoost can be increases. In spite of perfect precision (100%) and a really
attributed to the following two reasons: (i) appropriate char- high accuracy (98.21% and 97.93%), AdaBoost still returns
acterization of FD data by using the individual strategy, and the FNRs of 16.45% and 19.03% as the minority class is
(ii) a sequential ensemble structure by successively refitting augmented to 266 and 399 cases. As mentioned earlier, such
weak classifiers to different weighted instances of the training high FNRs are totally unacceptable in the semiconductor prac-
dataset. By doing so, subsequent classifiers focus on relatively tice. In this regard, at least 532 minority cases are necessary
hard-to-classify instances. Apparently from Fig. 12, heavier to achieve world-class production yields. As can clearly be
data augmentation, like OS ratios of 80% and 100%, facilitates seen from Table III, a recall significantly lower than its perfect
the adaptiveness of dealing with additional minority cases. precision implies that the trained model owns a very decisive
For bagging and XGBoost, the individual strategy in com- ability to identify the raw trace data of abnormal wafers, but is
bination with the proposed VAE model attains the highest only restrained to partial defective patterns in the temporal raw
accuracy (0.9793 and 0.9944, respectively) as the OS ratio of trace data. With the aid of augmented raw trace data of defec-
60% is used. XGBoost returns the highest accuracy of 0.9944, tive wafers during the training stage, the machine learning
almost equally perfect as AdaBoost for OS ratios of 80% and model surely enhances its detectability by learning more dif-
100%. Unlike AdaBoost, the success of XGBoost rests on ferent raw trace patterns of defective wafers. The gap between
a parallel boosting mechanism within a single tree under an recall and precision begins to narrow as the OS ratio increases.
optimized distributed gradient-boosting framework. In Table IV, the performance XGBoost culminates in every
Taking a close look at Figs. 11–14, the SMOTE and boot- evaluation indicators as the minority case increases to 399. In
strap methods do not provide noticeable improvement in the meantime, the FNR of 5.16% is achieved, barely accept-
accuracy even when the OS ratio rises. The major difficulty able in practice. Surprisingly, the performance of every indi-
may arise from the fact that these two methods do not take cator declines afterwards, which remains an unsolved question
the distribution of the temporal data into account, so the syn- for a further scrutiny. On the whole, AdaBoost slightly out-
thesized temporal file data cannot resemble the original series performs XGBoost for the PECVD process investigated here.
well in some localities. By comparison with the individual Another potential research opportunity is to confirm if the dis-
strategy used for the proposed generative model, the concate- crepancy in performance arises from the boosting framework
nation strategy embedded in a single VAE model cannot deal between the sequential and parallel learning structures.
TABLE IV
P ERFORMANCE E VALUATION OF XGB OOST U SING the proposed generative VAE model in terms of the indi-
THE I NDIVIDUAL S TRATEGY vidual strategy is to provide the practitioners with a viable
raw trace data augmentation tool in the ordinary semicon-
ductor manufacturing practice. An immediate extension of
the current work for future research would be applying the
proposed generative model to other important processing steps
in semiconductor manufacturing, such as chemical mechanical
planarization, etching, and ion implantation, since it is of prac-
tical relevance in FD. How to determine an optimum OS ratio
automatically and adaptively in the advanced process control
paradigm bears a further scrutiny [27]. Although the excellent
performances are demonstrated in this paper, additional on-site
confirmation experiments are still required to further validate
the effectiveness of the proposed generative model.
All the detailed performance evaluation reports in regard
to other data augmentation methods (SMOTE, bootstrap and
the concatenation strategy), other classifiers (random forests R EFERENCES
and bagging), other indicators (F0.5 -score, F2 -score and speci-
[1] F. Zhu et al., “Methodology for important sensor screening for
ficity) are available upon request. In an earlier experiment, fault detection and classification in semiconductor manufactur-
the Gaussian mixture model (GMM) has ever been used to ing,” IEEE Trans. Semicond. Manuf., vol. 34, no. 1, pp. 65–73,
replace the latent vector in (1) and Fig. 3. However, the VAE- Feb. 2021.
[2] S. Yasuda, T. Tanaka, M. Kitabata, and Y. Jisaki, “Chamber and
GMM model does not perform as competitively as the original recipe-independent FDC indicator in high-mix semiconductor manufac-
VAE model. The corresponding experimental results are also turing,” IEEE Trans. Semicond. Manuf., vol. 34, no. 3, pp. 301–306,
available upon request. Aug. 2021.
[3] D. H. Kim and S. J. Hong, “Use of plasma information in machine-
The statement that one-class models may suffer from a learning-based fault detection and classification for advanced equipment
high false positive rate is claimed in Section I. To do justice control,” IEEE Trans. Semicond. Manuf., vol. 34, no. 3, pp. 408–419,
under the PECVD data configuration investigated in this paper, Aug. 2021.
the OCSVM models using the polynomial and radial basis [4] H. Lee, Y. Kim, and C. O. Kim, “A deep learning model for robust
wafer fault monitoring with sensor measurement noise,” IEEE Trans.
function kernels were evaluated, respectively, and yielded a Semicond. Manuf., vol. 30, no. 1, pp. 23–31, Feb. 2017.
false positive rate of 0.5 approximately. This additional com- [5] J. Jang, B. W. Min, and C. O. Kim, “Denoised residual trace analysis
putational result re-stresses the contribution of the proposed for monitoring semiconductor process faults,” IEEE Trans. Semicond.
Manuf., vol. 32, no. 3, pp. 293–301, Aug. 2019.
VAE-based generative model for highly imbalanced fault [6] E. Kim, S. Cho, B. Lee, and M. Cho, “Fault detection and diagnosis
detection data. Similar outcomes of high false positive rates using self-attentive convolutional neural networks for variable-length
were also reported in [28]. sensor data in semiconductor manufacturing,” IEEE Trans. Semicond.
Manuf., vol. 32, no. 3, pp. 302–309, Aug. 2019.
[7] S.-K. S. Fan, D.-M. Tsai, F. He, J.-Y. Huang, and C.-H. Jen, “Key
IV. C ONCLUSION parameter identification and defective wafer detection of semiconduc-
tor manufacturing processes using image processing techniques,” IEEE
This paper investigates an important problem in the routine Trans. Semicond. Manuf., vol. 32, no. 4, pp. 544–552, Nov. 2019.
FD tasks of current semiconductor manufacturing, the class [8] S.-K. S. Fan, C.-Y. Hsu, D.-M. Tsai, F. He, and C.-C. Cheng,
imbalance in wafer classification. To resolve this difficulty, a “Data-driven approach for fault detection and diagnostic in semicon-
ductor manufacturing,” IEEE Trans. Autom. Sci. Eng., vol. 17, no. 4,
new VAE-based model in terms of two different data augmen- pp. 1925–1936, Oct. 2020.
tation strategies is proposed. A real-data PECVD process in [9] S.-K. S. Fan, C.-Y. Hsu, C.-H. Jen, K.-L. Chen, and L.-T. Juan,
semiconductor manufacturing is used to illustrate the proposed “Defective wafer detection using a denoising autoencoder for semicon-
ductor manufacturing processes,” Adv. Eng. Inform., vol. 46, Oct. 2020,
generative model. In terms of various OS ratios, the decoder Art. no. 101166.
in the trained VAE model is utilized to generate the syn- [10] S.-K. S. Fan, S.-C. Lin, and P.-F. Tsai, “Wafer fault detection and key
thesized profile data of abnormal wafers for the purpose of step identification for semiconductor manufacturing using principal com-
data augmentation. To verify the effectiveness of the proposed ponent analysis, AdaBoost and decision tree,” J. Ind. Prod. Eng., vol. 33,
no. 3, pp. 151–168, Jun. 2016.
VAE model, four different machine learning models are com- [11] S.-K. S. Fan, N.-C. Yao, Y.-J. Chang, and C.-H. Jen, “Statistical mon-
pared in the classification performance while four different itoring of nonlinear profiles by using piecewise linear approximation,”
data augmentation strategies are used, i.e., SMOTE, bootstrap, J. Process Control, vol. 21, no. 8, pp. 1217–1229, Sep. 2011.
[12] S.-K. S. Fan, Y.-J. Chang, and N. Aidara, “Nonlinear profile monitoring
and the individual and concatenation strategies used in the of reflow process data based on the sum of sine functions,” Qual. Rel.
VAE model. Based on a comprehensive computational study, Eng. Int., vol. 29, no. 5, pp. 743–758, Jul. 2013.
the proposed VAE model coupled with the individual strat- [13] S.-K. S. Fan, C.-H. Jen, and J.-X. Lee, “Profile monitoring for autocor-
related reflow processes with small samples,” Processes, vol. 7, no. 2,
egy outperforms the other data augmentation methods in every p. 104, Jan. 2019.
tested classifiers. In particular, AdaBoost in combination with [14] T. Lee, K. B. Lee, and C. O. Kim, “Performance of machine learning
the proposed VAE model and the individual strategy deliv- algorithms for class-imbalanced process fault detection problems,” IEEE
ers perfect classification performances (i.e., 100% in accuracy, Trans. Semicond. Manuf., vol. 29, no. 4, pp. 436–445, Nov. 2016.
[15] X. Jiang and Z. Ge, “Data augmentation classifier for imbalanced
precision, recall and F1 -score; 0 in FNR) as the OS ratios fault classification,” IEEE Trans. Autom. Sci. Eng., vol. 18, no. 3,
of 80% and 100% are selected. The major contribution of pp. 1206–1217, Jul. 2021.
[16] X. Yuan, C. Ou, Y. Wang, C. Yang, and W. Gui, “A layer-wise data [22] D. P. Kingma and M. Welling, “Auto-encoding variational bayes,” 2013,
augmentation strategy for deep learning networks and its soft sensor arXiv:1312.6114.
application in an industrial hydrocracking process,” IEEE Trans. Neural [23] D. P. Kingma and M. Welling, “An introduction to variational autoen-
Netw. Learn. Syst., vol. 32, no. 8, pp. 3296–3305, Aug. 2021. coders,” Found. Trends Mach. Learn., vol. 12, no. 4, pp. 307–392,
[17] M. Saqlain, Q. Abbas, and J. Y. Lee, “A deep convolutional neural Nov. 2019.
network for wafer defect identification on an imbalanced dataset in semi- [24] S. Bond-Taylor, A. Leach, Y. Long, and C. G. Willcocks, “Deep
conductor manufacturing processes,” IEEE Trans. Semicond. Manuf., generative modelling: A comparative review of VAEs, GANs, nor-
vol. 33, no. 3, pp. 436–444, Aug. 2020. malizing flows, energy-based and autoregressive models,” IEEE Trans.
[18] Y. Hyun and H. Kim, “Memory-augmented convolutional neural Pattern Anal. Mach. Intell., vol. 44, no. 11, pp. 7327–7347, Nov. 2022,
networks with triplet loss for imbalanced wafer defect pattern classi- doi: 10.1109/TPAMI.2021.3116668.
fication,” IEEE Trans. Semicond. Manuf., vol. 33, no. 4, pp. 622–634, [25] F. Pedregosa et al., “Scikit-learn: Machine learning in Python,” J. Mach.
Nov. 2020. Learn. Res., vol. 12, pp. 2825–2830, Oct. 2011.
[19] S. Wang, Z. Zhong, Y. Zhao, and L. Zuo, “A variational autoencoder
[26] N. V. Chawla, K. W. Bowyer, L. O. Hall, and W. P. Kegelmeyer,
enhanced deep learning model for wafer defect imbalanced classifica-
“SMOTE: Synthetic minority over-sampling technique,” J. Artif. Intell.
tion,” IEEE Trans. Compon. Packag. Manuf. Technol., vol. 11, no. 12,
Res., vol. 16, pp. 321–357, Jan. 2002.
pp. 2055–2060, Dec. 2021, doi: 10.1109/TCPMT.2021.3126083.
[20] J. Yu and J. Liu, “Multiple granularities generative adver- [27] S.-K. S. Fan, C.-W. Cheng, and D.-M. Tsai, “Fault diagnosis of wafer
sarial network for recognition of wafer map defects,” IEEE acceptance test and chip probing between front-end-of-line and back-
Trans. Ind. Informat., vol. 18, no. 3, pp. 1674–1683, Mar. 2022, end-of-line processes,” IEEE Trans. Autom. Sci. Eng., vol. 19, no. 4,
doi: 10.1109/TII.2021.3092372. pp. 3068–3082, Oct. 2022, doi: 10.1109/TASE.2021.3106011.
[21] Y. Lu, Y.-M. Cheung, and Y. Y. Tang, “Bayes imbalance impact index: [28] S.-K. S. Fan, D.-M. Tsai, C.-H. Jen, C.-Y. Hsu, F. He, and L.-T. Juan,
A measure of class imbalanced data set for classification problem,” “Data visualization of anomaly detection in semiconductor process-
IEEE Trans. Neural Netw. Learn. Syst., vol. 31, no. 9, pp. 3525–3539, ing tools,” IEEE Trans. Semicond. Manuf., vol. 35, no. 2, pp. 186–197,
Sep. 2020. May 2022.

Effective Variational-Autoencoder-Based Generative Models For Highly Imbalanced Fault Detection Data in Semiconductor Manufacturing

Uploaded by

Copyright:

Available Formats

Effective Variational-Autoencoder-Based Generative Models For Highly Imbalanced Fault Detection Data in Semiconductor Manufacturing

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Effective Variational-Autoencoder-Based Generative Models For Highly Imbalanced Fault Detection Data in Semiconductor Manufacturing

Uploaded by

Copyright:

Available Formats

IEEE TRANSACTIONS ON SEMICONDUCTOR MANUFACTURING, VOL. 36, NO.

2, MAY 2023 205

Effective Variational-Autoencoder-Based Generative

minority class, pertaining to the defective wafer type, the mis-

as an approximation of the actual posterior density pθ (z|x)

for a single VAE model including all M SVIDs, where n2

III. I NVESTIGATION OF R EAL P ROCESS DATA AND

Fig. 6. The Concatenation strategy.

Loss functions are of essential importance in machine learn-

model using the concatenation strategy closely resembles the

C. Over-Sampling Ratios and Classification Comparisons

Fig. 11. Classification accuracy of random forests using 4 different data

Fig. 12. Classification accuracy of AdaBoost using 4 different data

Fig. 13. Classification accuracy of Bagging using 4 different data augmen-

runs. The “original” bar indicates the accuracy achieved with-

Fig. 14. Classification accuracy of XGBoost using 4 different data

You might also like