Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
4 views

Dual Autoencoder Clustering

Uploaded by

JB Hoarau
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views

Dual Autoencoder Clustering

Uploaded by

JB Hoarau
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

Deep Spectral Clustering using Dual Autoencoder Network

Xu Yang1 , Cheng Deng1∗, Feng Zheng2 , Junchi Yan3 , Wei Liu4∗


1
School of Electronic Engineering, Xidian University, Xian 710071, China
2
Department of Computer Science and Engineering, Southern University of Science and Technology
3
Department of CSE, and MoE Key Lab of Artificial Intelligence, Shanghai Jiao Tong University
4
Tencent AI Lab, Shenzhen, China
arXiv:1904.13113v1 [cs.LG] 30 Apr 2019

{xuyang.xd, chdeng.xd}@gmail.com, zhengf@sustc.edu.cn,


yanjunchi@sjtu.edu.cn, wl2223@columbia.edu

Abstract

The clustering methods have recently absorbed even-


increasing attention in learning and vision. Deep cluster-
ing combines embedding and clustering together to obtain
optimal embedding subspace for clustering, which can be
more effective compared with conventional clustering meth-
ods. In this paper, we propose a joint learning framework (a) Raw data (b) ConvAE (c) Our method
for discriminative embedding and spectral clustering. We Figure 1. Visualizing the discriminative embedding capability on
first devise a dual autoencoder network, which enforces the MNIST-test with t-SNE algorithm. (a): the space of raw data, (b):
reconstruction constraint for the latent representations and data points in the latent subspace of convolution autoencoder; (c):
their noisy versions, to embed the inputs into a latent space data points in the latent subspace of the proposed autoencoder net-
for clustering. As such the learned latent representations work. Our method can provide a more discriminative embedding
subspace.
can be more robust to noise. Then the mutual information
estimation is utilized to provide more discriminative infor-
mation from the inputs. Furthermore, a deep spectral clus-
tering method is applied to embed the latent representations clustering are two well-known conventional algorithms that
into the eigenspace and subsequently clusters them, which are applicable to a wide range of various tasks. However,
can fully exploit the relationship between inputs to achieve these shallow clustering methods depend on low-level fea-
optimal clustering results. Experimental results on bench- tures such as raw pixels, SIFT [32] or HOG [7] of the inputs.
mark datasets show that our method can significantly out- Their distance metrics are only exploited to describe local
perform state-of-the-art clustering approaches. relationships in data space, and have limitation to represent
the latent dependencies among the inputs [3].
This paper presents a novel deep learning based unsu-
1. Introduction pervised clustering approach. Deep clustering, which inte-
grates embedding and clustering processes to obtain opti-
As an important task in unsupervised learning [43, 8, 22, mal embedding subspace for clustering, can be more effec-
24] and vision communities [48], clustering [14] has been tive than shallow clustering methods. The main reason is
widely used in image segmentation [37], image categoriza- that the deep clustering methods can effectively model the
tion [45, 47], and digital media analysis [1]. The goal of distribution of the inputs and capture the non-linear prop-
clustering is to find a partition in order to keep similar data erty, being more suitable to real-world clustering scenarios.
points in the same cluster while dissimilar ones in differ-
Recently, many clustering methods are promoted by
ent clusters. In recent years, many clustering methods have
deep generative approaches, such as autoencoder net-
been proposed, such as K-means clustering [28], spectral
work [29]. The popularity of the autoencoder network lies
clustering [31, 46, 17], and non-negative matrix factoriza-
in its powerful ability to capture high dimensional proba-
tion clustering [41], among which K-means and spectral
bility distributions of the inputs without supervised infor-
∗ Corresponding author. mation. The encoder model projects the inputs into the la-

1
tent space, and adopts an explicit approximation of max- (DEC) adopts a fully connected stacked autoencoder net-
imum likelihood to estimate the distribution diversity be- work in order to learn the latent representations by min-
tween the latent representations and the inputs. Simulta- imizing the reconstruction loss in the pre-training phase.
neously, the decoder model reconstructs the latent repre- The objective function applied to the clustering phase is the
sentations to ensure the output maintaining all of the de- Kullback Leibler (KL) divergence between the soft assign-
tails in the inputs [38]. Almost all existing deep clustering ments of clustering modelled by a t-distribution. And then,
methods endeavor to minimize the reconstruction loss. The a K-means loss is adopted at the clustering phase to train a
hope is making the latent representations more discrimina- fully connected autoencoder network [42], which is a joint
tive which directly determines the clustering quality. How- approach of dimensionality reduction and K-means cluster-
ever, in fact, the discriminative ability of the latent represen- ing. In addition, Gaussian Mixture Variational Autoencoder
tations has no substantial connection with the reconstruc- (GMVAE) [9] shows that minimum information constraint
tion loss, causing the performance gap that is to be bridged can be utilized to mitigate the effect of over-regularization
in this paper. in VAEs and provides an unsupervised clustering within the
We propose a novel dual autoencoder network for deep VAE framework considering a Gaussian mixture as a prior
spectral clustering. First, a dual autoencoder, which en- distribution. Discriminatively Boosted Clustering [23], a
forces the reconstruction constraint for the latent represen- fully convolutional network with layer-wised batch normal-
tations and their noisy versions, is utilized to establish the ization, adopts the same objective function as DEC and uses
relationships between the inputs and their latent represen- a boosting factor to the relatively train a stacked autoen-
tations. Such a mechanism is performed to make the la- coder.
tent representations more robust. In addition, we adopt Shah and Koltun [34] jointly solve the tasks of clus-
the mutual information estimation to reserve discrimina- tering and dimensionality reduction by efficiently optimiz-
tive information from the inputs to an extreme. In this ing a continuous global objective based on robust statistics,
way, the decoder can be viewed as a discriminator to de- which allows heavily mixed clusters to be untangled. Fol-
termine whether the latent representations are discrimina- lowing this method, a deep continuous clustering approach
tive. Fig. 1 demonstrates the performance of our proposed is suggested in [35], where the autoencoder parameters and
autoencoder network by comparing different data represen- a set of representatives defined against each data-point are
tations on MNIST-test data points. Obviously, our method simultaneously optimized. The convex clustering approach
can provide more discriminative embedding subspace than proposed by [6] optimizes the representatives by minimiz-
the convolution autoencoder network. Furthermore, deep ing the distances between each representative and its asso-
spectral clustering is harnessed to embed the latent repre- ciated data-point. Non-convex objectives are involved to
sentations into the eigenspace, which followed by cluster- penalize for the pairwise distances between the representa-
ing. This procedure can exploit the relationships between tives.
the data points effectively and obtain the optimal results.
The proposed dual autoencoder network and deep spectral Furthermore, to improve the performance of clustering,
clustering network are jointly optimized. some methods combine convolutional layers with fully con-
The main contributions of this paper are in three-folds: nected layers. Joint Unsupervised Learning (JULE) [44]
jointly optimizes a convolutional neural network with the
• We propose a novel dual autoencoder network for clustering parameters in a recurrent manner using an ag-
generating discriminative and robust latent represen- glomerative clustering approach, where image clustering
tations, which is trained with the mutual information is conducted in the forward pass and representation learn-
estimation and different reconstruction results. ing is performed in the backward pass. Dizaji [10] pro-
• We present a joint learning framework to embed the poses DEPICT, a method that trains a convolutional auto-
inputs into a discriminative latent space with a dual encoder with a softmax layer stacked on-top of the en-
autoencoder and assign them to the ideal distribution coder. The softmax entries represent the assignment of
by a deep spectral clustering model simultaneously. each data-point to one cluster. VaDE [18] is a variational
autoencoder method for deep embedding, and combines a
• Empirical experiments demonstrate that our method Gaussian Mixture Model for clusering. In [16], a deep au-
outperforms state-of-the-art methods over the five toencoder is trained to minimize a reconstruction loss to-
benchmark datasets, including both traditional and gether with a self-expressive layer. This objective encour-
deep network-based models. ages a sparse representation of the original data. Zhou
2. Related Work et al. [50] presents a deep adversarial subspace clustering
(DASC) method to learn more favorable representations
Recently, a number of deep learning-based clustering and supervise sample representation learning by adversar-
methods are proposed. Deep Embedding Clustering [40] ial deep learning [21]. However, the results of reconstruc-
Decoder

KL Divergence
Relative Reconstruction Loss

𝝍𝝍
Reconstruction Loss

𝒚𝒚

Encoder Graph Spectral Clustering

Mutual Information

Negative Sample
Decoder
Figure 2. Illustration of the overall architecture. We first pre-train a dual autoencoder to embed the inputs into a latent space, and reconstruc-
tion results are obtained by the latent representations and their noise versions based on the noisy-transformer ψ. The mutual information
calculated with negative sampling estimation is utilized to learn the discriminative information from inputs. Then, we assign the latent
representations to the ideal clusters by a deep spectral clustering model, and jointly optimize the dual autoencoder and spectral clustering
network simultaneously.

tion through low-dimensional representations are often very 3.1. Discriminative latent representation
blurry. One possible way is to train a discriminator with ad-
We first train the dual autoencoder network to embed the
versarial learning but it can further increase the difficulty of
inputs into a latent space. Based on the original reconstruc-
training. Comparatively, our method introduces a relative
tion loss, we add a noise-disturbed reconstruction loss to
reconstruction loss and mutual information estimation to
learn the decoder network. In addition, we introduce the
obtain more discriminative representations, and jointly op-
maximization of mutual information [13] to the learning
timize the autoencoder network and the deep spectral clus-
procedure of the encoder network, so that the network can
tering network for optimal clustering.
obtain more robust representations.
Encoder: Feature extraction is the major step in cluster-
3. Methodology ing and a good feature can effectively improve clustering
As aforementioned, our framework consists of two main performance. However, a single reconstruction loss can-
components: a dual autoencoder and a deep spectral clus- not well guarantee the quality of the latent representations.
tering network. The dual autoencoder, which reconstructs We hope that the representations will help us to identify the
the inputs using the latent representations and their noise sample from the inputs, which means it is the most unique
versions, is introduced to make the latent representations information extracted from the inputs. Mutual information
more robust. In addition, the mutual information estimation measures the essential correlation between two samples and
between the inputs and the latent representations is applied can effectively estimate the similarity between features Z
to preserve the input information as much as possible. Then and inputs X. The definition of mutual information is de-
we utilize the deep spectral clustering network to embed the fined as:
latent representations into the eigenspace and subsequently ZZ
p(z|x)
clustering is performed. The two networks are merged into I(X, Z) = p(z|x)p(x) log dxdz
a unified framework and jointly optimized with KL diver- p(z) (1)
gence. The framework is shown in Fig. 2. =KL(p(z|x)p(x)||p(z)p(x)),
Let X = {x1 , ..., xn } denote the input samples, Z =
{z1 , ..., zn } denote their corresponding latent representa- where p(x) is the distribution of the inputs, p(z|x) is the
tions where zi = f (xi ; θe ) ∈ Rd is learned by the encoder distribution of the latent representations, and the distribu-
E. The parameters of the encoder are defined by θe , and Rtion of latent space p(z) can be calculated by p(z) =
d is the feature dimension. x̃zi = g(zi ; θd ) represents the p(z|x)p(x)dx. The mutual information is expected to
reconstructed data point, which is the output of the decoder be as large as possible when training the encoder network,
D, and the parameters of the decoder are denoted by θd . hence we have:
We adopt a deep spectral clustering network C to map zi to
yi = c(zi ; θy ) ∈ RK , where K is the number of clusters. p(z|x) = max I(X, Z). (2)
θe
Negative Sample
In addition, the learned latent representations are required Latent Representation

to obey the prior distribution of the standard normal distri-


bution with KL divergence. This is beneficial to make the
latent space more regular. The distribution difference be-
tween p(z) and its prior q(z) is defined as.
𝐌𝐌 × 𝐌𝐌 features +
Z
p(z) 𝐌𝐌 × 𝐌𝐌 features
KL(p(z)||q(z)) = p(z)log dz. (3) drawn from another image replicated feature vector
q(z)
𝟏𝟏 × 𝟏𝟏 Conv
According to Eqs. (2) and (3), we have: Discriminator

Scores
 ZZ
p(z|x)
p(z|x) = min − p(z|x)p(x) log dxdz “Real” “Fake”
θe p(z)
Z  (4) Figure 3. Local mutual information estimation.
p(z)
+α p(z) log dz .
q(z)
samples to estimate the distribution of real samples, is gen-
It can be further rewritten as: erally utilized to solve the problem in Eq. (9). σ(T (x, z))
Z Z
p(z|x) is a discriminator, where x and its latent representation z
p(z|x) = min p(z|x)p(x)[−(α + 1) log together form a positive sample pair. We randomly select
θe p(z)
 zt from the disturbed batch to construct a negative sample
p(z|x) pair according to x. Note that Eq. (9) represents the global
+α log ]dxdz .
q(z) mutual information between X and Z.
(5) Furthermore, we extract the feature map from the middle
layer of the convolutional network, and construct the rela-
According to Eq. (1), the Eq. (5) can be viewed as:
tionship between the feature map and the latent representa-
p(z|x) = min {−βI(X, Z) tion, which is the local mutual information. The estimation
θe
(6) method plays the same role as global mutual information.
+γEx∼p(x) [KL(p(z|x)||q(z))] . The middle layer feature are combined with the latent repre-
sentation to obtain a new feature map. Then a 1×1 convolu-
Unfortunately, KL divergence is unbounded. Instead of us- tion is considered as the estimation network of local mutual
ing KL divergence, JS divergence is adopted for mutual information, as shown in Fig. 3. The selection method of
information maximization: negative samples is the same as global mutual information
estimation. Therefore, the objective function that needs to
p(z|x) = min {−βJS(p(z|x)p(x), p(z)p(x)) be optimized can be defined as:
θe
(7)
+γEx∼p(x) [KL(p(z|x)||q(z))] .
Le = − β(E(x,z)∼p(z|x)p(x) [log σ(T1 (x, z))]
We have known that the variational estimation of JS di- + E(x,z)∼p(z)p(x) [log(1 − σ(T1 (x, z)))])
vergence [33] is defined as: β
− Σi,j (E(x,z)∼p(z|x)p(x) [log σ(T2 (Cij , z))]
hw
JS(p(x)||q(x)) = max(Ex∼p(x) [log σ(T (x))]
T
(8) + E(x,z)∼p(z)p(x) [log(1 − σ(T2 (Cij , z)))])
+Ex∼q(x) [log(1 − σ(T (x)))]). + γEx∼p(x) [KL(p(z|x)||q(z))],
(10)
2p(x)
where T (x) = log p(x)+q(x) [33]. Here p(z|x)p(x) and
p(z)p(x) are utilized to replace p(x) and q(x). As a result, where h and w represent the height and width of the feature
Eq. (7) can be defined as: map. Cij represents the feature vector of the middle feature
map at coordinates (i, j) and q(z) is the standard normal

p(z|x) = min −β(E(x,z)∼p(z|x)p(x) [log σ(T (x, z))] distribution.
θe
Decoder: In the existing decoder networks, the reconstruc-
+E(x,z)∼p(z)p(x) [log(1 − σ(T (x, z)))]) (9) tion loss is generally a suboptimal scheme for clustering,
+ γEx∼p(x) [KL(p(z|x)||q(z))] . due to the natural trade-off between the reconstruction and
the clustering tasks. The reconstruction loss mainly de-
Negative sampling estimation [13], which is the process pends on the two parts: the distribution of the latent repre-
of using a discriminator to distinguish the real and noisy sentations and the generative capacity of decoder network.
However, the generative capacity of the decoder network is minibatch of m samples at each iteration and thus the loss
not required in the clustering task. Our real goal is not to ob- function can be defined as:
tain the best reconstruction results, but to get more discrim- m
inative features for clustering. We directly use noise dis- 1 X
Lc = Wi,j k yi − yj k2 . (16)
turbance in the latent space to discard known nuisance fac- m2 i,j=1
tors from the latent representations. Models trained in this
fashion become robust by exclusion rather than inclusion, In order to prevent that all points are grouped into the
and are expected to perform well on clustering tasks, where same cluster in network maps, the output y is required to be
even the inputs contain unseen nuisance [15]. A noisy- orthonormal in expectation. That is to say:
transformer ψ is utilized to convert the latent representa-
tions Z into their noisy versions Ẑ, and then the decoder re- 1 T
Y Y = Ik×k , (17)
constructs the inputs from Ẑ and Z. The reconstruction re- m
sults can be defined as x̃ẑi = g(ẑi ; θd ) and x̃zi = g(zi ; θd ),
where Y is a m × k matrix of the outputs whose ith row is
and the relative reconstruction loss can be written as:
yiT . The last layer of the network is utilized to enforce the
Lr (x̃ẑi , x̃zi ) =k x̃ẑi − x̃zi k2F , (11) orthogonality [36] constraint. This layer gets input from K
units, and acts as a linear layer with K outputs, in which
where k · kF stands for the Frobenius norm. We also use the the weights are required to be orthogonal, producing the or-
original reconstruction loss to ensure the performance of the thogonalized output Y for a minibatch. Let Ỹ denote the
decoder network and consider ψ as multiplicative Gaussian m × k matrix containing the inputs to this layer for Z, a lin-
noise. The complete reconstruction loss can be defined as: ear map that orthogonalizes the columns of Ỹ is computed
through its QR decomposition. Since integrated A> A is
Lr =k x̃ẑi − x̃zi k2F +δ k x − x̃zi k2F . (12) full rank for any matrix A, the QR decomposition can be
obtained by the Cholesky decomposition:
where δ stands for the strength of different reconstruction
loss.
A> A = BB > , (18)
Hence, by considering all the items, the total loss of the
autoencoder network can be defined as:
where B is a lower triangular matrix, and Q = A(B −1 )> .
min Lr + Le . (13) Therefore, in order to orthogonalize
√ Ỹ , the last layer mul-
θd ,θe
tiplies Ỹ from the right by m(L−1 )T . Actually, L̃ can
3.2. Deep Spectral Clustering √ obtained from the Cholesky decomposition of Ỹ and the
be
m factor is needed to satisfy Eq. (17).
The learned autoencoder parameters θe and θd are con- We unify the latent representation learning and the spec-
sidered as an initial condition in the clustering phase. Spec- tral clustering using KL divergence. In the clustering
tral clustering can effectively use the relationship between phase, the last term of Eq. (10) can be rewritten as:
samples to reduce intra-class differences, and produce bet-
ter clustering results than K-means. In this step, we first Ex∼p(x) [KL(p((y, z)|x)||q(y, z))], (19)
adopt the autoencoder network to learn the latent represen-
tations. Next, a spectral clustering method is used to embed
where p((y, z)|x) = p(y|z)p(z|x) and q(y, z) =
the latent representations into the eigenspace of their asso-
q(z|y)q(y). Note q(z|y) is a normal distribution with mean
ciated graph Laplacian matrix [25]. All the samples will
µy and variance 1. Therefore, the overall loss of the autoen-
be subsequently clustered in this space. Finally, both the
coder and the spectral clustering network is defined as:
autoencoder parameters and clustering objective are jointly
optimized. min Lr + Le + Lc .
θd ,θe ,θc
(20)
Specifically, we first utilize the latent representations Z
to construct the non-negative affinity matrix W :
Finally, we jointly optimize the two networks until con-
kz −z k2 vergence to obtain the desired clustering results.
− i2σ2j (14)
Wi,j = e .
The loss function of spectral clustering is defined as: 4. Experiments
Lc = E[Wi,j k yi − yj k2 ], (15) In this section, we evaluate the effectiveness of the pro-
posed clustering method in five benchmark datasets, and
where yi is the output of the network. When we adopt the then compare the performance with several state-of-the-
general neural network to output y, we randomly select a arts.
Table 1. Description of Datasets sure accuracy [19]. For completeness, we define ACC by:
Dataset Samples Classes Dimensions Pn
MNIST-full 70,000 10 1×28×28 1{li = m(ci )}
ACC = max i=1 , (21)
MNIST-test 10,000 10 1×28×28 m n
USPS 9298 10 1×16×16 where li and ci are the true label and predicted cluster of
Fashion-Mnist 70,000 10 1×28×28 data point xi .
YTF 10,000 41 3×55×55 NMI calculates the normalized measure of similarity be-
tween two labels of the same data, which is defined as:
I(l; c)
NMI = , (22)
max{H(l), H(c)}
where I(l, c) denotes the mutual information between true
label l and predicted cluster c, and H represents their en-
(a) MNIST tropy. Results of NMI do not change by permutations of
clusters (classes), and they are normalized to the range of
[0, 1], with 0 meaning no correlation and 1 exhibiting per-
fect correlation.
4.3. Implementation Details
In our experiments, we set β = 0.01, γ = 1, and
(b) Fashion-Mnist δ = 0.5. The channel numbers and kernel sizes of the
Figure 4. The image samples from the benchmark datasets used in autoencoder network are shown in Tab. 2, and the dimen-
our experiments sion of latent space is set to 120. The deep spectral cluster-
ing network consists of four fully connected layers, and we
adopt ReLU [26] as the non-linear activations. We construct
4.1. Datasets the original weight matrix W with probabilistic K-nearest
neighbors for each dataset. The weight Wij is calculated as
In order to show that our method works well with various nearest-neighbor graph [11], and the number of neighbors
kinds of datasets, we choose the following image datasets. is set to 3.
Considering that clustering tasks are fully unsupervised, we
concatenate the training and testing samples when applica- 4.4. Comparison Methods
ble. MNIST-full [20]: A dataset containing a total of 70,000 We compare our clustering model with several base-
handwritten digits with 60,000 training and 10,000 testing lines, including K-means [28], spectral clustering with nor-
samples, each being a 32×32 monochrome image. MNIST- malized cuts (SC-Ncut) [37], large-scale spectral clustering
test: A dataset only consists of the testing part of MNIST- (SC-LS) [4], NMF [2], graph degree linkage-based agglom-
full data. USPS: A handwritten digits dataset from the erative clustering (AC-GDL) [49]. In addition, we also eval-
USPS postal service, containing 9,298 samples of 16×16 uate the performance of our method with several state-of-
images. Fashion-MNIST [39]: This dataset has the same the-art clustering algorithms based on deep learning, includ-
number of images and the same image size with MNIST, ing deep adversarial subspace clustering (DASC) [50], deep
but it is fairly more complicated. Instead of digits, it con- embedded clustering (DEC) [40], variational deep embed-
sists of various types of fashion products. YTF: We adopt ding (VaDE) [18], joint unsupervised learning (JULE) [44],
the first 41 subjects of YTF dataset and the images are first deep embedded regularized clustering (DEPICT) [10], im-
cropped and resized to 55 × 55. Some image samples are proved deep embedded clustering with locality preservation
shown in Fig. 4. The brief descriptions of the datasets are (IDEC) [12], deep spectral clustering with a set of near-
given in Tab. 1. est neighbor pairs (SpectralNet) [36], clustering with GAN
(ClusterGAN) [30] and GAN with the mutual information
4.2. Clustering Metrics (InfoGAN) [5].

To evaluate the clustering results, we adopt two standard 4.5. Evaluation of Clustering Algorithm
evaluation metrics: Accuracy (ACC) and Normalized Mu- We run our method with 10 random trials and report the
tual Information (NMI) [41]. average performance, the error range is no more than 2%. In
The best mapping between cluster assignments and true terms of the compared methods, if the results of their meth-
labels is computed using the Hungarian algorithm to mea- ods on some datasets are not reported, we run the released
Table 2. Description the structure of the autoencoder network
Method encoder-1/decoder-4 encoder-2/decoder-3 encoder-3/decoder-2 encoder-4/decoder-1
MNIST 3×3×16 3×3×16 3×3×32 3×3×32
USPS 3×3×16 3×3×32 - -
Fashion-Mnist 3×3×16 3×3×16 3×3×32 3×3×32
YTF 5×5×16 5×5×16 5×5×32 5×5×32

Table 3. Clustering performance of different algorithms on five datasets based on ACC and NMI
MNIST-full MNIST-test USPS Fashion-10 YTF
Method
NMI ACC NMI ACC NMI ACC NMI ACC NMI ACC
K-means [28] 0.500 0.532 0.501 0.546 0.601 0.668 0.512 0.474 0.776 0.601
SC-Ncut [37] 0.731 0.656 0.704 0.660 0.794 0.649 0.575 0.508 0.701 0.510
SC-LS [4] 0.706 0.714 0.756 0.740 0.755 0.746 0.497 0.496 0.759 0.544
NMF [2] 0.452 0.471 0.467 0.479 0.693 0.652 0.425 0.434 - -
AC-GDL [49] 0.017 0.113 0.864 0.933 0.825 0.725 0.010 0.112 0.622 0.430

DASC [50] 0.784 0.801∗ 0.780 0.804 - - - - - -

DEC [40] 0.834 0.863∗ 0.830∗ 0.856∗ 0.767∗ 0.762∗ 0.546∗ 0.518∗ 0.446∗ 0.371∗
VaDE [18] 0.876 0.945 - - 0.512 0.566 0.630 0.578 - -
JULE [44] 0.913∗ 0.964∗ 0.915∗ 0.961∗ 0.913 0.950 0.608 0.563 0.848 0.684
DEPICT [10] 0.917∗ 0.965∗ 0.915∗ 0.963∗ 0.906 0.899 0.392 0.392 0.802 0.621
IDEC [12] 0.867∗ 0.881∗ 0.802 0.846 0.785∗ 0.761∗ 0.557 0.529 - -
SpectralNet [36] 0.814 0.800 0.821 0.817 - - - - 0.798 0.685
InfoGAN [5] 0.840 0.870 - - - - 0.590 0.610 - -
ClusterGAN [30] 0.890 0.950 - - - - 0.640 0.630 - -
Our Method 0.941 0.978 0.946 0.980 0.857 0.869 0.645 0.662 0.857 0.691

code with hyper-parameters mentioned in their papers, and


the results are marked by (*) on top. When the code is not
publicly available, or running the released code is not prac-
tical, we put dash marks (-) instead of the corresponding
results.
The clustering results are shown in Tab. 3, where the
first five are conventional clustering methods. In the table,
we can notice that our proposed method outperforms the (a) (b)
competing methods on these benchmark datasets. We ob- Figure 5. ACC and NMI of Our method with different β and γ on
serve that the proposed method can improve the clustering MNIST dataset
performance whether in digital datasets or in other product
dataset. Especially when performing on the object dataset
MNIST-test, the clustering accuracy is over 98%. Specifi- represents the results of ACC from different parameters and
cally, it exceeds the second best DEPICT which is trained Fig. 5(b) is the results of NMI. It intuitively demonstrates
on the noisy versions of the inputs by 1.6% and 3.1% on that our method maintains acceptable results with most pa-
ACC and NMI respectively. Moreover, our method achieves rameter combinations and has relative stability.
much better clustering results than several classical shallow
baselines. This is because compared with shallow methods,
4.6. Evaluation of Learning Approach
our method uses a multi-layer convolutional autoencoder as We compare different strategies for training our model.
the feature extractor and adopts deep clustering network to For training a multi-layer convolutional autoencoder, we
obtain the most optimal clustering results. The Fashion- analyze the following four approaches: (1) convolutional
MNIST dataset is very difficult to deal with due to the com- autoencoder with original reconstruction loss (ConvAE),
plexity of samples, but our method still harvests good re- (2) convolutional autoencoder with original reconstruc-
sults. tion loss and mutual information (ConvAE+MI), (3) con-
We also investigate the parameter sensitivity on MNIST- volutional autoencoder with improved reconstruction loss
test, and the results are shown in Fig. 5, where Fig. 5(a) (ConvAE+RS) and (4) convolutional autoencoder with im-
Table 4. Clustering performance with different strategies on five datasets based on ACC and NMI
MNIST-full MNIST-test USPS Fashion-10 YTF
Method
NMI ACC NMI ACC NMI ACC NMI ACC NMI ACC
ConvAE 0.745 0.776 0.751 0.781 0.652 0.698 0.556 0.546 0.642 0.476
ConvAE+MI 0.800 0.835 0.796 0.844 0.744 0.785 0.609 0.592 0.738 0.571
ConvAE+RS 0.803 0.841 0.801 0.850 0.752 0.798 0.597 0.614 0.721 0.558
ConvAE+MI+RS 0.910 0.957 0.914 0.961 0.827 0.831 0.640 0.656 0.801 0.606
ConvAE+MI+RS+SN 0.941 0.978 0.946 0.980 0.857 0.869 0.645 0.662 0.857 0.691

(a) Raw data (b) ConvAE (c) DEC (d) SpectralNet

(e) ConvAE+RS (f) ConvAE+MI (g) ConvAE+RS+MI (h) ConvAE+MI+RS+SN


Figure 6. Visualization to show the discriminative capability of embedding subspaces using MNIST-test data.

proved reconstruction loss and mutual information (Con- noise-contaminated versions, is utilized to establish the re-
vAE+MI+RS). The last one is the joint training of convo- lationships between the inputs and the latent representations
lutional autoencoder and deep spectral clustering. Tab. 4 in order to obtain more robust latent representations. Fur-
represents the performance of different strategies for train- thermore, we maximize the mutual information between the
ing our model. It clearly demonstrates that each kind of inputs and the latent representations, which can preserve
strategy of our method can improve the accuracy of cluster- the information of the inputs as much as possible. Hence,
ing effectively, especially after adding mutual information the features of the latent space obtained by our autoencoder
and the improved reconstruction loss in the convolutional are robust to noise and more discriminative. Finally, the
autoencoder network. Fig. 6 demonstrates the importance spectral network is fused to a unified framework to cluster
of our proposed strategy by comparing different data rep- the features of the latent space, so that the relationship be-
resentations of MNIST-test data points using t-SNE visu- tween the samples can be effectively utilized. We evaluate
alization [27], Fig. 6(a) represents the space of raw data, our method on several benchmarks and experimental results
Fig. 6(b) is the data points in the latent subspace of convo- show that our method outperforms those state-of-the-art ap-
lution autoencoder, Fig. 6(c) and 6(d) are the results of DEC proaches.
and SpectralNet respectively, and the rest are our proposed
model with different strategies. The results demonstrate the
latent representations obtained by our method have more 6. Acknowledgement
clear distribution structure.
Our work was also supported by the National Natu-
5. Conclusion ral Science Foundation of China under Grant 61572388,
61703327 and 61602176, the Key R&D Program-The
In this paper, we propose an unsupervised deep cluster- Key Industry Innovation Chain of Shaanxi under Grant
ing method with a dual autoencoder network and a deep 2017ZDCXL-GY-05-04-02, 2017ZDCXL-GY-05-02 and
spectral network. First, the dual autoencoder, which recon- 2018ZDXM-GY-176, and the National Key R&D Program
structs the inputs using the latent representations and their of China under Grant 2017YFE0104100.
References [15] Ayush Jaiswal, Rex Yue Wu, Wael Abd-Almageed, and
Prem Natarajan. Unsupervised adversarial invariance. In
[1] Lingling An, Xinbo Gao, Xuelong Li, Dacheng Tao, Cheng Advances in Neural Information Processing Systems, pages
Deng, Jie Li, et al. Robust reversible watermarking via clus- 5097–5107, 2018.
tering and enhanced pixel-wise masking. IEEE Trans. Image
[16] Pan Ji, Tong Zhang, Hongdong Li, Mathieu Salzmann, and
Processing, 21(8):3598–3611, 2012.
Ian Reid. Deep subspace clustering networks. In Advances in
[2] Deng Cai, Xiaofei He, Xuanhui Wang, Hujun Bao, and Ji- Neural Information Processing Systems, pages 24–33, 2017.
awei Han. Locality preserving nonnegative matrix factoriza-
[17] Wenhao Jiang and Fu-lai Chung. Transfer spectral clus-
tion. In IJCAI, volume 9, pages 1010–1015, 2009.
tering. In Joint European Conference on Machine Learn-
[3] Pu Chen, Xinyi Xu, and Cheng Deng. Deep view-aware met-
ing and Knowledge Discovery in Databases, pages 789–803.
ric learning for person re-identification. In IJCAI, pages 620–
Springer, 2012.
626, 2018.
[18] Zhuxi Jiang, Yin Zheng, Huachun Tan, Bangsheng Tang, and
[4] Xinlei Chen and Deng Cai. Large scale spectral clustering
Hanning Zhou. Variational deep embedding: An unsuper-
with landmark-based representation. In AAAI, volume 5,
vised and generative approach to clustering. arXiv preprint
page 14, 2011.
arXiv:1611.05148, 2016.
[5] Xi Chen, Yan Duan, Rein Houthooft, John Schulman, Ilya
[19] Harold W Kuhn. The hungarian method for the assignment
Sutskever, and Pieter Abbeel. Infogan: Interpretable repre-
problem. Naval research logistics quarterly, 2(1-2):83–97,
sentation learning by information maximizing generative ad-
1955.
versarial nets. In Advances in neural information processing
systems, pages 2172–2180, 2016. [20] Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick
Haffner. Gradient-based learning applied to document recog-
[6] Eric C Chi and Kenneth Lange. Splitting methods for convex
nition. Proceedings of the IEEE, 86(11):2278–2324, 1998.
clustering. Journal of Computational and Graphical Statis-
tics, 24(4):994–1013, 2015. [21] Chao Li, Cheng Deng, Ning Li, Wei Liu, Xinbo Gao, and
Dacheng Tao. Self-supervised adversarial hashing networks
[7] Navneet Dalal and Bill Triggs. Histograms of oriented gra-
for cross-modal retrieval. In CVPR, pages 4242–4251, 2018.
dients for human detection. In Computer Vision and Pat-
tern Recognition, 2005. CVPR 2005. IEEE Computer Society [22] Chao Li, Cheng Deng, Lei Wang, De Xie, and Xianglong
Conference on, volume 1, pages 886–893. IEEE, 2005. Liu. Coupled cyclegan: Unsupervised hashing network
[8] C Deng, E Yang, T Liu, W Liu, J Li, and D Tao. Unsu- for cross-modal retrieval. arXiv preprint arXiv:1903.02149,
pervised semantic-preserving adversarial hashing for image 2019.
search. IEEE transactions on image processing: a publica- [23] Fengfu Li, Hong Qiao, and Bo Zhang. Discriminatively
tion of the IEEE Signal Processing Society, 2019. boosted image clustering with fully convolutional auto-
[9] Nat Dilokthanakul, Pedro AM Mediano, Marta Garnelo, encoders. Pattern Recognition, 83:161–173, 2018.
Matthew CH Lee, Hugh Salimbeni, Kai Arulkumaran, and [24] Yeqing Li, Junzhou Huang, and Wei Liu. Scalable sequential
Murray Shanahan. Deep unsupervised clustering with spectral clustering. In Thirtieth AAAI conference on artificial
gaussian mixture variational autoencoders. arXiv preprint intelligence, 2016.
arXiv:1611.02648, 2016. [25] Wei Liu, Junfeng He, and Shih-Fu Chang. Large graph con-
[10] Kamran Ghasedi Dizaji, Amirhossein Herandi, Cheng Deng, struction for scalable semi-supervised learning. In Proceed-
Weidong Cai, and Heng Huang. Deep clustering via joint ings of the 27th international conference on machine learn-
convolutional autoencoder embedding and relative entropy ing (ICML-10), pages 679–686, 2010.
minimization. In Computer Vision (ICCV), 2017 IEEE In- [26] Andrew L Maas, Awni Y Hannun, and Andrew Y Ng. Recti-
ternational Conference on, pages 5747–5756. IEEE, 2017. fier nonlinearities improve neural network acoustic models.
[11] Quanquan Gu and Jie Zhou. Co-clustering on manifolds. In In Proc. icml, volume 30, page 3, 2013.
Proceedings of the 15th ACM SIGKDD international confer- [27] Laurens van der Maaten and Geoffrey Hinton. Visualiz-
ence on Knowledge discovery and data mining, pages 359– ing data using t-sne. Journal of machine learning research,
368. ACM, 2009. 9(Nov):2579–2605, 2008.
[12] Xifeng Guo, Long Gao, Xinwang Liu, and Jianping Yin. Im- [28] James MacQueen et al. Some methods for classification
proved deep embedded clustering with local structure preser- and analysis of multivariate observations. In Proceedings of
vation. In International Joint Conference on Artificial Intel- the fifth Berkeley symposium on mathematical statistics and
ligence (IJCAI-17), pages 1753–1759, 2017. probability, volume 1, pages 281–297. Oakland, CA, USA,
[13] R Devon Hjelm, Alex Fedorov, Samuel Lavoie-Marchildon, 1967.
Karan Grewal, Adam Trischler, and Yoshua Bengio. Learn- [29] Jonathan Masci, Ueli Meier, Dan Cireşan, and Jürgen
ing deep representations by mutual information estimation Schmidhuber. Stacked convolutional auto-encoders for hi-
and maximization. arXiv preprint arXiv:1808.06670, 2018. erarchical feature extraction. In International Conference on
[14] Steven CH Hoi, Wei Liu, and Shih-Fu Chang. Semi- Artificial Neural Networks, pages 52–59. Springer, 2011.
supervised distance metric learning for collaborative image [30] Sudipto Mukherjee, Himanshu Asnani, Eugene Lin, and
retrieval and clustering. ACM Transactions on Multimedia Sreeram Kannan. Clustergan: Latent space cluster-
Computing, Communications, and Applications (TOMM), ing in generative adversarial networks. arXiv preprint
6(3):18, 2010. arXiv:1809.03627, 2018.
[31] Andrew Y Ng, Michael I Jordan, and Yair Weiss. On spectral [48] Tianshu Yu, Junchi Yan, Wei Liu, and Baoxin Li. Incremen-
clustering: Analysis and an algorithm. In Advances in neural tal multi-graph matching via diversity and randomness based
information processing systems, pages 849–856, 2002. graph clustering. In Proceedings of the European Conference
[32] Pauline C Ng and Steven Henikoff. Sift: Predicting amino on Computer Vision (ECCV), pages 139–154, 2018.
acid changes that affect protein function. Nucleic acids re- [49] Wei Zhang, Deli Zhao, and Xiaogang Wang. Agglomerative
search, 31(13):3812–3814, 2003. clustering via maximum incremental path integral. Pattern
[33] Sebastian Nowozin, Botond Cseke, and Ryota Tomioka. f- Recognition, 46(11):3056–3065, 2013.
gan: Training generative neural samplers using variational [50] Pan Zhou, Yunqing Hou, and Jiashi Feng. Deep adversarial
divergence minimization. In Advances in Neural Information subspace clustering. In Proceedings of the IEEE Conference
Processing Systems, pages 271–279, 2016. on Computer Vision and Pattern Recognition, pages 1596–
[34] Sohil Atul Shah and Vladlen Koltun. Robust continuous 1604, 2018.
clustering. Proceedings of the National Academy of Sci-
ences, 114(37):9814–9819, 2017.
[35] Sohil Atul Shah and Vladlen Koltun. Deep continuous clus-
tering. arXiv preprint arXiv:1803.01449, 2018.
[36] Uri Shaham, Kelly Stanton, Henry Li, Boaz Nadler, Ronen
Basri, and Yuval Kluger. Spectralnet: Spectral clustering us-
ing deep neural networks. arXiv preprint arXiv:1801.01587,
2018.
[37] Jianbo Shi and Jitendra Malik. Normalized cuts and image
segmentation. IEEE Transactions on pattern analysis and
machine intelligence, 22(8):888–905, 2000.
[38] Elad Tzoreff, Olga Kogan, and Yoni Choukroun. Deep
discriminative latent space for clustering. arXiv preprint
arXiv:1805.10795, 2018.
[39] Han Xiao, Kashif Rasul, and Roland Vollgraf. Fashion-
mnist: a novel image dataset for benchmarking machine
learning algorithms. arXiv preprint arXiv:1708.07747, 2017.
[40] Junyuan Xie, Ross Girshick, and Ali Farhadi. Unsupervised
deep embedding for clustering analysis. In International
conference on machine learning, pages 478–487, 2016.
[41] Wei Xu, Xin Liu, and Yihong Gong. Document cluster-
ing based on non-negative matrix factorization. In Proceed-
ings of the 26th annual international ACM SIGIR conference
on Research and development in informaion retrieval, pages
267–273. ACM, 2003.
[42] Bo Yang, Xiao Fu, Nicholas D Sidiropoulos, and Mingyi
Hong. Towards k-means-friendly spaces: Simultaneous deep
learning and clustering. arXiv preprint arXiv:1610.04794,
2016.
[43] Erkun Yang, Cheng Deng, Tongliang Liu, Wei Liu, and
Dacheng Tao. Semantic structure-based unsupervised deep
hashing. In IJCAI, pages 1064–1070, 2018.
[44] Jianwei Yang, Devi Parikh, and Dhruv Batra. Joint unsuper-
vised learning of deep representations and image clusters.
In Proceedings of the IEEE Conference on Computer Vision
and Pattern Recognition, pages 5147–5156, 2016.
[45] Muli Yang, Cheng Deng, and Feiping Nie. Adaptive-
weighting discriminative regression for multi-view classifi-
cation. Pattern Recogn., 88(4):236–245, 2019.
[46] Xu Yang, Cheng Deng, Xianglong Liu, and Feiping Nie.
New l2, 1-norm relaxation of multi-way graph cut for clus-
tering. In AAAI, 2018.
[47] Jinfeng Yi, Lijun Zhang, Tianbao Yang, Wei Liu, and Jun
Wang. An efficient semi-supervised clustering algorithm
with sequential constraints. In Proceedings of the 21th ACM
SIGKDD International Conference on Knowledge Discovery
and Data Mining, pages 1405–1414. ACM, 2015.

You might also like