Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
16 views

Robust Vector Quantized-Variational Autoencoder

This document proposes a new model called Robust Vector Quantized-Variational Autoencoder (RVQ-VAE) to make generative models more robust to outliers in training data. RVQ-VAE uses separate codebooks to represent inliers and outliers, and iteratively updates these sets during training. It quantizes using a weighted distance based on codebook variances. The model is experimentally shown to generate examples mainly from inliers even when a large portion of training data contains outliers.

Uploaded by

vinay thakar
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views

Robust Vector Quantized-Variational Autoencoder

This document proposes a new model called Robust Vector Quantized-Variational Autoencoder (RVQ-VAE) to make generative models more robust to outliers in training data. RVQ-VAE uses separate codebooks to represent inliers and outliers, and iteratively updates these sets during training. It quantizes using a weighted distance based on codebook variances. The model is experimentally shown to generate examples mainly from inliers even when a large portion of training data contains outliers.

Uploaded by

vinay thakar
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 14

Robust Vector Quantized-Variational Autoencoder

Chieh-Hsin Lai 1 Dongmian Zou 2 Gilad Lerman 1

Abstract
Image generative models can learn the distri-
butions of the training data and consequently
arXiv:2202.01987v1 [cs.LG] 4 Feb 2022

generate examples by sampling from these


distributions. However, when the training dataset
is corrupted with outliers, generative models will
likely produce examples that are also similar to Figure 1. Demonstration of generation by VAE, VQ-VAE and
the outliers. In fact, a small portion of outliers may RVQ-VAE with minor corruption: Given a training set of face
induce state-of-the-art generative models, such images, where only 5% have masks, we trained VAE, VQ-VAE
as Vector Quantized-Variational AutoEncoder and RVQ-VAE models, and randomly generated ten examples from
each model. VQ-VAE generated more outliers than normal faces,
(VQ-VAE), to learn a significant mode from the
VAE generated only a few outliers but the images were blurry, and
outliers. To mitigate this problem, we propose a
VQ-VAE generated normal faces without masks.
robust generative model based on VQ-VAE, which
we name Robust VQ-VAE (RVQ-VAE). In order to
achieve robustness, RVQ-VAE uses two separate
training data and then draw novel samples from it. Therefore,
codebooks for the inliers and outliers. To ensure
it is crucial that the training set correctly represents the
the codebooks embed the correct components,
distribution. However, in practice, the training set for the
we iteratively update the sets of inliers and
generative task can be corrupted and one needs to develop
outliers during each training epoch. To ensure
robust methods that aim to learn the uncorrupted distribution.
that the encoded data points are matched to the
In general, there are two types of corruption. The first one is
correct codebooks, we quantize using a weighted
when images themselves are defected or corrupted by noise.
Euclidean distance, whose weights are determined
The second one is when the image dataset is corrupted by
by directional variances of the codebooks. Both
another category of images with different patterns, which
codebooks, together with the encoder and decoder,
are considered as outliers. In many real-world applications,
are trained jointly according to the reconstruction
corrupted datasets are inevitable during the data collection
loss and the quantization loss. We experimentally
process and it is too expensive to clean up datasets. Even
demonstrate that RVQ-VAE is able to generate
widely used datasets such as CIFAR-10 or ImageNet suffer
examples from inliers even if a large portion of the
from corruption and mislabeling (Northcutt et al., 2021).
training data points are corrupted.
In addition, corruption may also occur when a new area of
study is explored and it is unclear how to distinguish between
normal and abnormal points. For example, in the beginning
1. Introduction
of the COVID-19 pandemic it was hard to diagnose COVID-
In the past several years, deep learning has made remarkable 19 patients and distinguish them from the rest of patients
progress in generating realistic images (Goodfellow et al., infected by a virus (Chowdhury et al., 2020; Xiao et al., 2020).
2014; Van Den Oord et al., 2017; Brock et al., 2019; Menick In particular, the datasets of X-ray images of patients infected
& Kalchbrenner, 2019; Ma et al., 2019; Razavi et al., 2019). with COVID-19 may contain wrongly classified ones.
Deep models usually learn the image distribution from the
One may expect that a small portion of outliers in the
1
School of Mathematics, University of Minnesota, Twin training data should not affect the overall performance of
Cities, USA 2 Division of Natural and Applied Sciences, a generative model. However, this is not true and even some
Duke Kunshan University, Jiangsu, China. Correspondence state-of-the-art generative models may produce unexpected
to: Chieh-Hsin Lai <laixx313@umn.edu>, Dongmian Zou results when the training data is mildly corrupted. Indeed,
<dongmian.zou@dukekunshan.edu.cn>, Gilad Lerman <ler-
man@umn.edu>. the middle row of Fig. 1 demonstrates an image generation
example with the common Vector Quantized-Variational
Auto-Encoder (VQ-VAE) using a training set with 5%
Robust Vector Quantized-Variational Autoencoder

outliers, where VQ-VAE mainly learns from the outlier related previous works. In §3 we describe the details of the
distribution of masked faces and generates more outliers proposed method, RVQ-VAE. In §4 we test the robustness
than inliers. This intriguing observation strongly urges the of RVQ-VAE while comparing it with competing methods.
development of robust generative methods, and in particular, Lastly, we conclude this work in §5.
robust VAE-type generating methods.
We are unaware of any work addressing the robustness of 2. Related Works
VAE-type generative models. Kaneko & Harada (2020) built
Variational AutoEncoders (VAEs) (Kingma & Welling, 2019)
robust generative adversarial networks (GANs) that could
and Generative Adversarial Networks (GANs)(Goodfellow
handle noisy images. Balaji et al. (2020) used a robust formu-
et al., 2014) are well-studied generative models. VAEs have
lation of optimal transport in a Wasserstein GAN (Arjovsky
been applied to various generation tasks (Lotfollahi et al.,
et al., 2017) and demonstrated that they may handle a small
2020; Huang et al., 2018; Vahdat & Kautz, 2020). It is gener-
fraction of outliers. However, both methods may not be suf-
ally easier to train VAEs than GANs. However, GAN-based
ficiently robust to outliers whose fractions are not too small.
methods have been able to generate high-fidelity synthetic
In this work, we present an end-to-end model that makes images (Arjovsky et al., 2017; Karras et al., 2020b;a; Zhao
the VQ-VAE framework robust to outliers in the training et al., 2020), whereas, until recently, VAE-based methods
set. We call this model Robust VQ-VAE (RVQ-VAE). It seemed to generate blurred images (Cai et al., 2019).
uses two separate discrete latent spaces (codebooks) to In order to address this issue of low-quality generation,
encode the inliers and outliers. Iteratively, RVQ-VAE Van Den Oord et al. (2017) recently proposed the Vector
performs two main steps. The first one, which we refer to Quantized-Variational AutoEncoder (VQ-VAE) framework.
as self-recognition, assigns training data points to their cor- It uses a discrete latent space, or a codebook, containing a
responding codebooks according to the reconstruction errors. collection of vectors, where the number of vectors is a priori
The second one, which we refer to as joint-training, updates fixed and is much smaller than the dimension of the input data.
the codebooks as well as the parameters of the encoder The vectors (codes) in the codebook are updated throughout
and the decoder by backpropagation. To some extent, this the training process. VQ-VAE uses this codebook to obtain a
iterative process is similar to an EM algorithm (Dempster discrete, or “vectorized”, latent representation of the output
et al., 1977). However, both steps above are different from of the encoder by searching the nearest element in the code-
the ones of the EM algorithm and are novel in our context. book. We further review the VQ-VAE model in §3.1). Both
VQ-VAE and its modified version, VQ-VAE-2 (Razavi et al.,
We remark that an additional novel subcomponent of the
2019), can generate high-fidelity images that are comparable
self-recognition step is “confidence”-based matching. It
to those generated by GAN-based methods (Brock et al.,
finds the best quantized vector in the codebook using a
2019). Although VQ-VAE was originally designed to handle
weighted Euclidean distance whose weights are determined
both image and audio data, and was recently extended to
according to the directional variances of the codebooks.
other natural language processing tasks (van Niekerk et al.,
We experimentally validated our model using two different 2020), we only focus on its application to image generation.
datasets, while corrupting their training points with different
Most VAE or GAN-based methods are not designed to
percentages of outliers. The consideration of many such
handle corruption of the training set. Kaneko & Harada
percentages is rather time-consuming and extensive. Not
(2020) designed a family of GAN-based methods which can
only does our model successfully resolve the vulnerability
learn a clean generator from the training dataset corrupted
of VQ-VAE to small corruption ratios, but it also proves
by noise. Their method and its variations can be adapted to
to be robust to higher corruption ratios (up to 30%). In
different noise models without complete prior information
addition, our model is more robust to outliers than other
of the noise. Balaji et al. (2020) modified the Wasserstein
robust generative models.
distance in Wasserstein GAN (Arjovsky et al., 2017) by a
We briefly summarize the above contributions: robust formulation of optimal transport. They apply the
method for image generation, where the training datasets
• RVQ-VAE is an end-to-end image generative VAE-type are corrupted by only a small portion of outliers from
model, robust to outliers in the training data. other categories with distinct structures. Unlike these two
• It uses two separate codebooks to quantize inliers methods, we aim to follow the VQ-VAE framework and
and outliers. Their iterative update has some novel consider relatively large fractions of outliers (up to 30%).
components, such as the “confidence”-based matching.
Reconstruction errors of autoencoders are widely used in
• Numerical experiments indicate that RVQ-VAE
detection of outliers in imaging data (Zhai et al., 2016; Zong
outperforms other robust and non-robust models.
et al., 2018; Perera et al., 2019; Zhou et al., 2021b;a; Li et al.,
The rest of this paper is organized as follows. In §2 we review 2021). To ensure that inliers have smaller errors and outliers
Robust Vector Quantized-Variational Autoencoder

have larger ones, some previous works (Lai et al., 2020a;b) where β is a hyperparameter and sg[·] is the stop-gradient
propose the use of sum of absolute deviations instead of operator (Bengio et al., 2013; Van Den Oord et al., 2017),
squared distances for the loss function of the autoencoders. which is the identity during the forward pass and has zero
Our model also adopts such robust reconstruction error when gradients during backward computation. The discrete latent
determining which codebook each data point should be indices, obtained by optimizing (2) are used for the training
assigned to. In contrast to our model, in the above works the of the prior distribution of the VQ-VAE. This distribution
reconstruction error provides a score that is directly used for is modeled by an autoregressive neural network, and its com-
anomaly detection. mon choice is PixelCNN (Oord et al., 2016; Salimans et al.,
2017) that follows the design of the VQ-VAE. Ultimately,
Weighted Euclidean distances are often adopted in detecting
the learned autoregressive prior is utilized for generation.
outliers. A common example is the Mahalanobis distance,
which proves effective in both anomaly detection (Lee et al.,
2018; Kamoi & Kobayashi, 2020; Hou et al., 2020) and 3.2. Description of RVQ-VAE
generative networks (Mroueh & Sercu, 2017). Our distance We assume that the training dataset is unlabeled and
is proportional to directional standard deviations (we later corrupted with outliers, whose fraction is at most 30%.
motivate this proportion), unlike the inverse proportion of For this setting, we propose RVQ-VAE, which follows the
the Mahalanobis distance. general framework of VQ-VAE. We first train an encoder
and a decoder, together with a pair of codebooks, namely
3. Method an inlier codebook and an outlier codebook. Then we learn
a latent prior, which is modeled by PixelCNN, according to
We review the VQ-VAE framework in §3.1 and describe our the codebooks. Finally, we use it for generation.
robust VQ-VAE-type method in §3.2.
To train a good model that generates from the inlier distribu-
tion, we need to obtain a faithful inlier codebook. For this
3.1. The VQ-VAE framework for image generation
purpose, we iteratively apply two steps for training. The first
Unlike a regular VAE which encodes the input in a continu- one, which we refer to as self-recognition, classifies each data
ous latent space, VQ-VAE finds compressed representations point as either an inlier or an outlier according to the recon-
of images by projecting the output of the encoder onto a struction error from the previous stage. Each inlier or outlier
discrete latent space. These representations aim to extract is then quantized to a vector in the inlier or outlier codebook.
essential features of the images without distorting them Fig. 2 demonstrates the basic idea of self-recognition. The
too much, so the decoder may adequately reconstruct them. second step, which we refer to as joint-training, updates the
More precisely, the output of the encoder is quantized parameters of all components of the autoencoder (the encoder,
to a set of vectors, called the codebook , denoted by the decoder and the two codebooks) according to the quantiza-
K
tion result. We further describe the two steps in detail below.

C := e(i) i=1 ⊂ RD , where K is the size of the codebook
and D is the dimension of the latent features. Given a data
point x from the input dataset, let zenc (x) denote the output 3.2.1. S ELF - RECOGNITION
of the encoder E of VQ-VAE, that is, zenc (x) := E(x). The We denote the inlier and outlier codebooks by
vectorized representation of zenc (x), denoted by zvq (x), is (i) (i)
Cin := {ein }K i=1 ⊂ R
in D
and Cout := {eout }K i=1 ⊂ R ,
out D
the nearest element in the codebook, that is,
respectively, where Kin and Kout are the sizes of the inlier
and outlier codebooks. We use more vectors in the inlier
zvq (x) = e(k) , k = arg min zenc (x) − e(i) . (1) codebooks and therefore require that Kin > Kout .
1≤i≤K 2
The quantization process that maps zenc to a code is similar
This representation, zvq (x), is then fed forward to the to (1), but is different in two ways. First, we need to make a
decoder D. Given a training dataset S, the parameters of the choice between Cin and Cout . Second, we adopt a “confidence
encoder and decoder, as well as the codebook, are updated matching”, in which the distance between zenc and each code
by minimizing the following loss function: is weighted Euclidean. We clarify this quantization process
as follows. Before training the VQ-VAE, we fix a hyperpa-
Lvqvae (E, C, D) := (2) rameter η that bounds the percentage of data points associated
(
  2
with the outlier codebook. In general, 0 < η < 50%, though
Ex∼S x − D zenc (x) + sg [zvq (x) − zenc (x)] + we fix the default parameter η = 20%. In the first epoch,
2 we have not received the reconstruction errors, and thus we
temporarily treat all data points as inliers. For the consequent
)
2 2
sg [zenc (x)] − zvq (x) + β zenc (x) − sg [zvq (x)] , epochs, we re-label each data point as either an inlier or an
2 2
outlier, according to its reconstruction error obtained from
Robust Vector Quantized-Variational Autoencoder

Figure 2. Illustration of the self-recognition step of RVQ-VAE. It uses two separate codebooks for inliers and outliers. The other step of
joint-training updates E and D together with the codebooks.

the previous epoch. That is, we compute kx − x̃k2 for each same distance (see the second and third terms in (7), which
input data point x and its reconstructed point x̃ = D(zvq (x)) we will explain later). We experimentally noticed that when
via the decoder from the previous epoch. The data points using the regular Euclidean distance instead, the norms of the
with top η (20% in our experiments) largest reconstruction codebook vectors dramatically increase and consequently
errors are associated with the outlier codebook, while the rest the training process diverges. In principle, we could have
with the inlier codebook. We next explain the assignment of used the full covariance of the codebook as the “confidence
inliers and outliers to vectors in their coodebooks, which we matrix”. However, that would be computationally costly and
refer to as confidence matching and perform in each epoch. significantly slow down the training process.
For our proposed confidence matching, we replace the In summary, consider an input x from the training data. In
standard `2 distance in (1) with a weighted one, which the first epoch, we embed zenc (x) to zvq (x), the nearest
introduces scaling in each direction. That is, for any point from the inlier codebook Cin using the weighted
two vectors z = (zi )D D
i=1 and µ = (µi )i=1 ∈ R
D
and Euclidean distance. More precisely,
D D
a ‘confidence” weight vector s = (si )i=1 ∈ R , the (k) (i)
corresponding weighted Euclidean distance is zvq (x) = ein , k = arg min dW (zenc (x); ein , std(Cin )),
1≤i≤Kin
v
uD (4)
uX where std(Cin ) ∈ RD is the vector composed of the standard
dW (z; µ, s) := t s2i (zi − µi )2 . (3) deviations of the inlier codebook Cin in all coordinates:
i=1
n oKin 
(i)
The coordinate si of s expresses the “confidence” in the i-th std(Cin )j = std ein,j , j = 1, · · · , D, (5)
i=1
coordinate. In the confidence matching, for any input image
x and any choice of inliers or outliers, we let z := zenc (x) (i) (i)
where ein,j is the j-th entry of ein . For all the consequent
and µ be the inlier or outlier codebook vectors. In this case epochs, if kx − x̃k2 , the reconstruction error for x, is not
we choose si as the standard deviation of the inlier or outlier among the largest η of all errors, we update zvq according
codebook associated with zenc (x) along the i-th coordinate. to (4) again. Otherwise, we search zvq (x) from the outlier
We find this choice for si as a natural “confidence score”. codebook Cout :
Indeed, when we assign zenc (x) to a vector, we are more
(k) (i)
confident about the directions where the set of possible zvq (x) = eout , k = arg min dW (zenc (x); eout , std(Cout )),
1≤i≤Kout
vectors have a larger variance, so that they are far apart from
(6)
each other. For a direction along which the variance of the
where std(Cout ) is the vector of standard deviations for the
vectors is small, we need to be more careful when we perform
outlier codebook Cout , which is defined similarly to (5).
the quantization, because there is not much difference
between the vectors and we do not lose much if we switch
3.2.2. J OINT- TRAINING
between them. Consequently, the quantization process may
be unstable if we rely on such a direction. We remark that the In each epoch, after associating with the training data specific
loss function used for updating the codebooks employs the vectors in the codebooks, we propose using the following
Robust Vector Quantized-Variational Autoencoder

robust loss function for updating the parameters of the clean image generator and noise generator simultaneously
encoder E, decoder D and the two codebooks Cin and Cout : with a distribution or transformation constraint on the noise
generator. We choose one of the variants of NR-GAN,
Lrvqvae (E, Cin , Cout , D)
( namely SD-NR-GAN-III, which does not assume a specific
  prior noise and better aligns with our task.
:=Ex∼S x − D zenc (x) + sg [zvq (x) − zenc (x)]
2
Robust OT designs a robust formulation of Wasserstein
distance according to a constrained form of duality. Their
 
+dM sg [zenc (x)] ; zvq (x), std(C(x)) (7)
) distance is insensitive to outliers and can prevent the
  corresponding Wasserstein GAN from learning from outliers.
+βdM sg [zvq (x)] ; zenc (x), std(C(x)) , To the best of our knowledge, this work is the only one in the
literature that intentionally addresses the robust generation
where std(C(x)) = Cin if x is associated with the inlier code- problem without assuming a specific noise form.
book and std(C(x)) = Cout otherwise. To ensure robustness
We implement the VQ-VAE model by modifying the code
in the early stage of training, when the assignment of the
in https://github.com/deepmind/sonnet/blob/v2/examples/
codebook might be wrong, we use least-absolute-deviations
vqvae example.ipynb to a PyTorch version. We implement
minimization (which avoids squaring each term in (7)),
both GAN and NR-GAN using the official code provided
and was advocated in Lerman & Maunu (2018); Lai et al.
in Kaneko & Harada (2020), and implemented Robust OT
(2020b;a). We emphasize that we use the weighted Euclidean
using the official code provided in Balaji et al. (2020). We
distance of (3) (see second and third terms of (7)), which was
will make all codes available once we get an approval from
already used for quantization in (4) and (6).
the patent office handling a related invention.
Once the RVQ-VAE is trained, we only utilize the
inlier codebook to obtain the discrete latent indices 4.2. Datasets and settings
{k (t) }L
t=1 ⊂ {1, 2, · · · , Kin } of the training dataset
{x(t) }Lt=1 . Namely, for each x
(t)
where t = 1, · · · , L, we
(t)
search for the index k by computing
(i)
k (t) = arg min dW (zenc (x(t) ); ein , std(Cin )). (8)
1≤i≤Kin

The training of RVQ-VAE is summarized as Algorithm 1 in


Figure 3. Sample images from FaceMask and RoomCrop.
Appendix A. The discrete latent indices are further used for
training the PixelCNN and then the sampling procedure is
We report results on two image datasets: FaceMask and
identical as that of a regular VQ-VAE (Van Den Oord et al.,
RoomCrop. Each dataset contains an inlier class and an
2017).
outlier class. We demonstrate random examples from them
in Fig 3. We consider the following outlier ratios (fraction
4. Experiments of outliers from all data points): 0%, 5%, 10%, 15%, 20%,
25%, 30%. For each dataset, ratio and baseline model, we
4.1. Baselines
train a generative model.
We compare RVQ-VAE with the following methods for
FaceMask contains 64 × 64 human face images downsized
image generation: VQ-VAE (Van Den Oord et al., 2017),
from the 1024 × 1024 images in Kottarathil (2020). There
GAN (Goodfellow et al., 2014), Noise Robust GAN
are two classes: unmasked faces and masked faces with
(NR-GAN) (Kaneko & Harada, 2020) and Robust Optimal
10,000 images in each. For each experiment, we take all
Transport (Robust OT) (Balaji et al., 2020). While the details
10,000 unmasked faces as inliers and randomly choose
of VQ-VAE were presented in §3.1, we describe the other
outliers to have a chosen ratio.
methods as follows.
RoomCrop is constructed from the LSUN dataset(Yu et al.,
GAN learns the data distribution via adversarial training. It
2015) in the following way. We randomly sample 60,000
contains two components of neural networks: a generator
images from the “bedroom” category of LSUN and downsize
and a discriminator. The generator learns to map a simple
them to 64 × 64. After that, we split the images into two
noise to the data distribution. On the other hand, the
subsets, each containing 30,000 images. The first subset
discriminator learns to distinguish candidates produced by
is kept as inliers. For each image in the second subset
the generator from the true data distribution.
of outliers, we randomly choose the number of cropped
NR-GAN is a GAN-based method that aims to learn a clean rectangles to be 1, 2 or 3. For each rectangle we randomly
image generator while the training set is noisy. It trains a choose a center and area between 15 and 25 and replace
Robust Vector Quantized-Variational Autoencoder

its pixel values to be black. The second subset is used as of inliers and outliers.
outliers. For each experiment, we take all 30,000 inliers and
For each dataset and model, we generate the same number
randomly choose outliers to have a chosen ratio.
of examples as the number of inliers in the training set. Then
The encoder of RVQ-VAE is composed of 3 convolu- we apply the binary classifier to determine the percentage of
tional layers with kernel size (4, 4, 3), output channels inliers within the generated images. To quantitatively mea-
(64, 128, 128) and stride (2, 2, 1), followed by 2 residual sure the quality of images generated by each method, we also
blocks, consisting of a convolutional layer with kernel size compute the Fréchet Inception Distance (FID) (Karras et al.,
3 and a convolutional layer with kernel size 1. The decoder 2020b) between the set of generated images and the images
of RVQ-VAE has a deconvolutional layer with kernel size 3, of the inlier class. Fig. 4 reports both the percentage of gener-
output channel 128, stride 1, followed by two residual blocks ated inliers and FID scores for all methods and outlier ratios.
and 2 deconvolutional layers with kernel size (4, 4), output
To visualize the generated images, we further plot 100
channels (64, 3), and stride (2, 2), respectively. In the neural
samples generated by each method with outlier ratios 5%,
network, the activation functions are all taken to be ReLU.
15%, 25%, 30% for FaceMask and RoomCrop in Figs 5 and
The dimension of the two codebook vectors are D = 64. The
6, respectively. We present the generated images with outlier
size of the inlier codebook is Kin = 512 and of the outlier
ratios 0%, 10% and 20% in the supplemental material.
codebook is Kout = 256. The threshold is η = 20% and the
regularization weight is β = 0.25. The neural networks are In terms of the percentage of generated inliers, we note that
optimized by Adam (Kingma & Ba, 2015) with a learning VQ-VAE almost only generates abnormal images when the
rate 3e − 4 and trained for 400 epochs with a batch size of 10. outlier ratios of the training set for FaceMask are greater than
10%. Its performance is also poor for RoomCrop. As the
4.3. Metrics and results outlier ratio increases, the GAN-based methods, including
GAN, NR-GAN and the robust OT, do not seem to be robust
at all, because the proportion of outliers generated is similar
to the outlier ratio in the training set. In contrast, RVQ-VAE
produces significantly less outliers in the generated examples.
In terms of FID, GAN-based models have lower FID scores
than VQ-VAE or RVQ-VAE, that is, their quality is better.
Note that when there are no outliers in the training data, VQ-
VAE also produces a relatively larger FID (see supplementary
material for visualization). This suggests that the large FID
of RVQ-VAE is not due to our regime, but due to the fact that
the VQ-VAE cannot produce images of the same quality as
GAN in this generative task. We also note that images gener-
ated by either RVQ-VAE or VQ-VAE have similar quantities.
That is, RVQ-VAE does not sacrifice the quality of VQ-VAE
for robustness. Lastly, we observe that for the GAN-based
models, the FID increases when the outlier ratio in the
training set increases. In contrast, for RVQ-VAE, the FID
stays almost constant, implying that the fraction of outliers
in the generated examples did not significantly increase.
Figure 4. Percentage of generated inliers (on left) and FID scores (on
right) for the baseline methods using the FaceMask and RoomCrop
datasets. The inliers are recognized by a well-trained classifier. 5. Conclusion and future works
To evaluate the percentage of outliers in the generated We present RVQ-VAE, a novel robust generative model
examples, we train a binary classifier with the following within the framework of VQ-VAE. Our central idea is to itera-
neural network structure: 2 convolutional layers with kernel tively update two codebooks for inliers and outliers and only
size (3, 3), output channel (32, 64), stride (2, 2), and ReLU employ the inlier codebook for generation. By confidence-
activation functions. The convolutional layers are followed matching we rescale the importance of the coordinates
by a max pooling layer and 2 dropout layers with ratio and consequently improve the stability in the quantization
(0.25, 0.5). These are followed by a flatten layer and fully process. Our model is more robust to outliers in the training
connected layers with output channel (3136, 128, 2). At last, set, compared with the other robust generative models.
we apply the softmax function to produce the probability of We noted in §4.3 that VQ-VAE and thus also RVQ-VAE do
each class. The classifier is trained using the same number
Robust Vector Quantized-Variational Autoencoder
RVQ-VAE
VQ-VAE
robust OT
GAN
NR-GAN

Outlier ratio = 5% Outlier ratio = 15% Outlier ratio = 25% Outlier ratio = 30%
Figure 5. 100 synthetic images of FaceMask generated by RVQ-VAE and competing methods. From the left to right columns are the results
obtained by training on the dataset with outlier ratios = 5%, 15% , 25% and 30%, respectively.

not generate images of the same quality as GANs. Neverthe- ideas to other VAE structures that have a discrete latent space.
less, robust GAN-based models are not robust for the tasks
We do not consider the case where the inliers may belong to
we considered, especially when the percentage of outliers
multiple classes, where it will be more difficult to distinguish
is nontrivial. In order to improve the quality of the generated
between inliers and outliers. In future work we may consider
images, we will try to extend our work to more advanced
multiple codebooks for the inliers to address this scenario.
frameworks such as VQ-VAE-2. We also plan to extend our
Robust Vector Quantized-Variational Autoencoder
RVQ-VAE
VQ-VAE
robust OT
GAN
NR-GAN

Outlier ratio = 5% Outlier ratio = 15% Outlier ratio = 25% Outlier ratio = 30%
Figure 6. 100 synthetic images of RoomCrop generated by RVQ-VAE and competing methods. From the left to right columns are the
results obtained by training on the dataset with outlier ratios = 5%, 15% , 25% and 30%, respectively.

We considered specific datasets with corrupted images. We like to extend this study to audio and video data.
noticed that current GAN-based robust models were not
Lastly, we expect we can extend the presented model to
sufficiently robust to the outliers in these datasets. In the
address the different problem of anomaly detection.
future, we would like to study more extensively different
types of datasets and carefully characterize successes and
failures of RVQ-VAE and other algorithms. We would also
Robust Vector Quantized-Variational Autoencoder

References Kingma, D. P. and Ba, J. Adam: A method for stochastic optimiza-


tion. In International Conference on Learning Representations,
Arjovsky, M., Chintala, S., and Bottou, L. Wasserstein generative 2015.
adversarial networks. In International conference on machine
learning, pp. 214–223. PMLR, 2017. Kingma, D. P. and Welling, M. An introduction to variational au-
Balaji, Y., Chellappa, R., and Feizi, S. Robust optimal trans- toencoders. Foundations and Trends® in Machine Learning, 12
port with applications in generative modeling and domain (4):307–392, 2019. ISSN 1935-8237. doi: 10.1561/2200000056.
adaptation. In Advances in Neural Information Processing URL http://dx.doi.org/10.1561/2200000056.
Systems, volume 33, pp. 12934–12944, 2020. URL https:
Kottarathil, P. Face mask lite dataset https:
//proceedings.neurips.cc/paper/2020/file/
//www.kaggle.com/prasoonkottarathil/
9719a00ed0c5709d80dfef33795dcef3-Paper.
face-mask-lite-dataset, 2020.
pdf.
Bengio, Y., Léonard, N., and Courville, A. Estimating or Lai, C.-H., Zou, D., and Lerman, G. Novelty detection via
propagating gradients through stochastic neurons for conditional robust variational autoencoding, 2020a. arXiv preprint
computation. arXiv preprint arXiv:1308.3432, 2013. arXiv:2006.05534.

Brock, A., Donahue, J., and Simonyan, K. Large scale GAN Lai, C.-H., Zou, D., and Lerman, G. Robust subspace recovery
training for high fidelity natural image synthesis. In International layer for unsupervised anomaly detection. In International
Conference on Learning Representations, 2019. URL https: Conference on Learning Representations, 2020b. URL https:
//openreview.net/forum?id=B1xsqj09Fm. //openreview.net/forum?id=rylb3eBtwr.
Cai, L., Gao, H., and Ji, S. Multi-stage variational auto-encoders Lee, K., Lee, K., Lee, H., and Shin, J. A simple unified framework
for coarse-to-fine image generation. In Proceedings of the 2019 for detecting out-of-distribution samples and adversarial attacks.
SIAM International Conference on Data Mining, pp. 630–638. Advances in neural information processing systems, 31, 2018.
SIAM, 2019.
Lerman, G. and Maunu, T. An overview of robust subspace
Chowdhury, M. E., Rahman, T., Khandakar, A., Mazhar, R., Kadir, recovery. Proceedings of the IEEE, 106(8):1380–1410, 2018.
M. A., Mahbub, Z. B., Islam, K. R., Khan, M. S., Iqbal, A., doi: 10.1109/JPROC.2018.2853141.
Al Emadi, N., et al. Can AI help in screening viral and covid-19
pneumonia? IEEE Access, 8:132665–132676, 2020. Li, T., Wang, Z., Liu, S., and Lin, W.-Y. Deep unsupervised anomaly
Dempster, A. P., Laird, N. M., and Rubin, D. B. Maximum detection. In Proceedings of the IEEE/CVF Winter Conference
likelihood from incomplete data via the EM algorithm. Journal on Applications of Computer Vision, pp. 3636–3645, 2021.
of the Royal Statistical Society: Series B (Methodological), 39
Lotfollahi, M., Naghipourfar, M., Theis, F. J., and Wolf, F. A. Condi-
(1):1–22, 1977.
tional out-of-distribution generation for unpaired data using trans-
Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, fer VAE. Bioinformatics, 36(Supplement 2):i610–i617, 2020.
D., Ozair, S., Courville, A., and Bengio, Y. Generative
adversarial nets. In Advances in Neural Information Processing Ma, X., Kong, X., Zhang, S., and Hovy, E. Macow: Masked
Systems, pp. 2672–2680, 2014. convolutional generative flow. Advances in Neural Information
Processing Systems, 32:5893–5902, 2019.
Hou, Y., Chen, Z., Wu, M., Foo, C.-S., Li, X., and Shubair, R. M.
Mahalanobis distance based adversarial network for anomaly Menick, J. and Kalchbrenner, N. Generating high fi-
detection. In ICASSP 2020-2020 IEEE International Conference delity images with subscale pixel networks and mul-
on Acoustics, Speech and Signal Processing (ICASSP), pp. tidimensional upscaling. In International Conference
3192–3196. IEEE, 2020. on Learning Representations, 2019. URL https:
//openreview.net/forum?id=HylzTiC5Km.
Huang, H., He, R., Sun, Z., Tan, T., et al. IntroVAE: Introspective
variational autoencoders for photographic image synthesis. Mroueh, Y. and Sercu, T. Fisher GAN. In Proceedings of the 31st
Advances in Neural Information Processing Systems, 31, 2018. International Conference on Neural Information Processing
Systems, pp. 2510–2520, 2017.
Kamoi, R. and Kobayashi, K. Why is the Mahalanobis distance ef-
fective for anomaly detection? arXiv preprint arXiv:2003.00402, Northcutt, C. G., Athalye, A., and Mueller, J. Pervasive label errors
2020. in test sets destabilize machine learning benchmarks. In Thirty-
Kaneko, T. and Harada, T. Noise robust generative adversarial fifth Conference on Neural Information Processing Systems
networks. In Proceedings of the IEEE/CVF Conference on Datasets and Benchmarks Track (Round 1), 2021. URL https:
Computer Vision and Pattern Recognition, pp. 8404–8414, 2020. //openreview.net/forum?id=XccDXrDNLek.

Karras, T., Aittala, M., Hellsten, J., Laine, S., Lehtinen, J., and Aila, Oord, A. v. d., Kalchbrenner, N., Vinyals, O., Espeholt, L., Graves,
T. Training generative adversarial networks with limited data. A., and Kavukcuoglu, K. Conditional image generation with
arXiv preprint arXiv:2006.06676, 2020a. pixelcnn decoders. arXiv preprint arXiv:1606.05328, 2016.

Karras, T., Laine, S., Aittala, M., Hellsten, J., Lehtinen, J., and Aila, Perera, P., Nallapati, R., and Xiang, B. OCGAN: One-class novelty
T. Analyzing and improving the image quality of StyleGAN. In detection using gans with constrained latent representations. In
Proceedings of the IEEE/CVF Conference on Computer Vision Proceedings of the IEEE Conference on Computer Vision and
and Pattern Recognition, pp. 8110–8119, 2020b. Pattern Recognition, pp. 2898–2906, 2019.
Robust Vector Quantized-Variational Autoencoder

Razavi, A., van den Oord, A., and Vinyals, O. Generating diverse
high-fidelity images with VQ-VAE-2. In Advances in Neural
Information Processing Systems, pp. 14866–14876, 2019.
Salimans, T., Karpathy, A., Chen, X., and Kingma, D. P. Pixel-
CNN++: Improving the PixelCNN with discretized logistic
mixture likelihood and other modifications. In International
Conference on Learning Representations, 2017.

Vahdat, A. and Kautz, J. NVAE: A deep hierarchical variational


autoencoder. In Advances in Neural Information Processing
Systems, volume 33, pp. 19667–19679, 2020. URL https:
//proceedings.neurips.cc/paper/2020/file/
e3b21256183cf7c2c7a66be163579d37-Paper.
pdf.

Van Den Oord, A., Vinyals, O., et al. Neural discrete representation
learning. In Advances in Neural Information Processing Systems,
pp. 6306–6315, 2017.

van Niekerk, B., Nortje, L., and Kamper, H. Vector-quantized


neural networks for acoustic unit discovery in the zerospeech
2020 challenge. arXiv preprint arXiv:2005.09409, 2020.

Xiao, A. T., Tong, Y. X., and Zhang, S. False-negative of RT-PCR


and prolonged nucleic acid conversion in COVID-19: Rather
than recurrence. Journal of Medical Virology, 2020.

Yu, F., Seff, A., Zhang, Y., Song, S., Funkhouser, T., and
Xiao, J. LSUN: Construction of a large-scale image dataset
using deep learning with humans in the loop. arXiv preprint
arXiv:1506.03365, 2015.

Zhai, S., Cheng, Y., Lu, W., and Zhang, Z. Deep structured energy
based models for anomaly detection. In Proceedings of The 33rd
International Conference on Machine Learning, volume 48, pp.
1100–1109. PMLR, 2016.
Zhao, Y., Li, C., Yu, P., Gao, J., and Chen, C. Feature quanti-
zation improves GAN training. In Proceedings of the 37th
International Conference on Machine Learning, volume 119,
pp. 11376–11386. PMLR, 2020.

Zhou, K., Li, J., Luo, W., Li, Z., Yang, J., Fu, H., Cheng, J., Liu,
J., and Gao, S. Proxy-bridged image reconstruction network for
anomaly detection in medical images. IEEE Transactions on
Medical Imaging, 2021a.

Zhou, K., Li, J., Xiao, Y., Yang, J., Cheng, J., Liu, W., Luo, W., Liu,
J., and Gao, S. Memorizing structure-texture correspondence
for image anomaly detection. IEEE Transactions on Neural
Networks and Learning Systems, 2021b.
Zong, B., Song, Q., Min, M. R., Cheng, W., Lumezanu, C., Cho, D.,
and Chen, H. Deep autoencoding gaussian mixture model for
unsupervised anomaly detection. In International Conference
on Learning Representations, 2018.
Robust Vector Quantized-Variational Autoencoder

A. Algorithm of RVQ-VAE

Algorithm 1 Training RVQ-VAE


(i)
Kin
Input: Training dataset {x(t) }L t=1 ; initialized parameters of E and D; initialized codebooks: Cin = {ein }i=1 and
(i)
Cout = {eout }K D
i=1 ⊂ R ; threshold η; regularization weight β; number of epochs T ; batch size I; learning rate α
out

Output: Trained parameters of E and D; trained inlier codebook Cin ; discrete latent indices {k (t) }L (t) L
t=1 of {x }t=1
1: for epoch τ = 1, · · · , T do
2: for each batch {x(t) }t∈I do
3: if τ = 1 then
(i)
4: k (t) ← arg min dW (zenc (x(t) ); ein , std(Cin ))
1≤i≤Kin
(t)
5: zvq (x ) ← ekin
(t)

6: else
7: Iout ← indices of the top dη × |I|e} elements of {Err(τ − 1)(t) }t∈I
8: if t ∈ Iout then
(i)
9: k (t) ← arg mindW (zenc (x(t) ); eout , std(Cout ))
1≤i≤Kout
(t)
10: zvq (x(t) ) ← eout
k

11: else
(i)
12: k (t) ← arg min dW (zenc (x(t) ); ein , std(Cin ))
1≤i≤Kin
(t)
13: zvq (x ) ← ekin
(t)

14: end if
15: end if
16: x̃(t) ← D(zvq (x(t) ))
17: Err(τ )(t) ← x(t) − x̃(t) 2
18: (E, Cin , Cout , D) ← (E, Cin , Cout , D)
−α∇(E,Cin ,Cout ,D) Lrvqvae (E, Cin , Cout , D)
according to (7)
19: end for
20: end for
21: Obtain the trained E and Cin
22: for each t = 1, · · · , L do
(i)
23: k (t) ← arg min dW (zenc (x(t) ); ein , std(Cin ))
1≤i≤Kin
24: end for

B. Additional generated images


Figs. 7 and 8 extend the demonstration in Figs. 5 and 6 of the main text, by considering the following outlier ratios: 0%, 10%
and 20% (recall that the earlier figures considered 5%, 15% and 25%). These figures were not shown in the main text due
to page limit.
For each method (RVQ-VAE, VQ-VAE, robust OT, GAN and NR-GAN) and outlier ratio, Figs. 7 and 8 demonstrate 100
generated images for FaceMask and RoomCrop , respectively.

C. Numerical representations for some figures


We present the numerical values depicted in Fig. 4 of the main text as tables.

D. Effectiveness of the binary classifier


In order to illustrate the effectiveness of the classifier used to evaluate the percentage of outliers in the generated examples,
we report the F1 scores on labeled test data. Table 5 records the F1 scores of the trained classifier for the following outlier
Robust Vector Quantized-Variational Autoencoder

Table 1. Percentage of generated inliers for FaceMask


Training outlier ratios
Methods 5% 10% 15% 20% 25% 30%
RVQ-VAE 98.42% 96.55% 94.67% 93.18% 92.18% 91.81%
VQ-VAE 48.68% 14.08% 2.34% 0.98% 0.62% 0.22%
robust OT 94.44% 89.55% 84.53% 79.80% 74.41% 69.39%
GAN 95.73% 89.83% 84.60% 78.51% 73.75% 69.11%
NR-GAN 94.29% 89.64% 86.24% 79.78% 75.52% 71.29%

Table 2. Percentage of generated inliers for RoomCrop


Training outlier ratios
Methods 5% 10% 15% 20% 25% 30%
RVQ-VAE 97.56% 95.72% 91.98% 86.64% 80.96% 77.44%
VQ-VAE 66.71% 53.85% 39.75% 25.93% 22.15% 16.29%
robust OT 96.19% 92.94% 88.95% 82.85% 72.71% 67.80%
GAN 92.78% 88.91% 84.63% 82.00% 74.23% 70.15%
NR-GAN 97.43% 88.91% 82.69% 78.65% 72.64% 70.36%

Table 3. FID scores of generated images for FaceMask


Training outlier ratios
Methods 0% 5% 10% 15% 20% 25% 30%
RVQ-VAE 42.723 43.367 44.680 42.496 42.340 44.552 42.537
VQ-VAE 42.659 100.767 146.244 234.322 245.413 241.596 242.206
robust OT 30.420 32.670 35.593 36.400 40.971 51.398 58.344
GAN 5.681 10.578 15.675 23.871 37.029 40.558 48.488
NR-GAN 6.381 13.650 17.358 21.060 32.201 36.971 45.143

Table 4. FID scores of generated images for RoomCrop


Training outlier ratios
Methods 0% 5% 10% 15% 20% 25% 30%
RVQ-VAE 92.987 92.831 93.574 94.891 95.026 94.578 97.213
VQ-VAE 92.816 104.214 107.996 111.145 113.437 116.230 117.203
robust OT 24.062 31.880 38.935 49.554 54.691 85.075 108.884
GAN 23.301 30.595 33.507 42.098 50.377 54.553 57.381
NR-GAN 25.137 30.184 38.976 41.260 43.630 46.796 53.351

ratios of the testing set: 0%, 5%, 10%, 15%, 20%, 25%, 30%. Since the classifier we employed showed almost perfect F1
scores, the large margin of our model over other ones is validated

Table 5. Testing F1 scores of the trained binary classifier on FaceMask and RoomCrop
Testing outlier ratios
Datasets 0% 5% 10% 15% 20% 25% 30%
FaceMask 1.000 1.000 1.000 0.995 0.996 0.984 0.979
RoomCrop 0.997 0.989 0.968 0.970 0.966 0.959 0.934
Robust Vector Quantized-Variational Autoencoder

RVQ-VAE
VQ-VAE
robust OT
GAN
NR-GAN

Outlier ratio = 0% Outlier ratio = 10% Outlier ratio = 20%


Figure 7. 100 synthetic images of FaceMask generated by RVQ-VAE and competing methods for outlier ratios 0%, 10% and 20% (from
left to right).
Robust Vector Quantized-Variational Autoencoder

RVQ-VAE
VQ-VAE
robust OT
GAN
NR-GAN

Outlier ratio = 0% Outlier ratio = 10% Outlier ratio = 20%


Figure 8. 100 synthetic images of RoomCrop generated by RVQ-VAE and competing methods for outlier ratios 0%, 10% and 20% (from
left to right).

You might also like