Deep Density-Based Image Clustering

Uploaded by

PRITAM PARAL

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

21 views

Deep Density-Based Image Clustering

Uploaded by

PRITAM PARAL

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 8

Knowledge-Based Systems 197 (2020) 105841

Contents lists available at ScienceDirect

Knowledge-Based Systems
journal homepage: www.elsevier.com/locate/knosys

Deep density-based image clustering

∗
Yazhou Ren a,b , , Ni Wang a , Mingxia Li a , Zenglin Xu c
a
School of Computer Science and Engineering, University of Electronic Science and Technology of China, Chengdu 611731, China
b
Institute of Electronic and Information Engineering of UESTC in Guangdong, Dongguan 523808, China
c
School of Computer Science and Technology, Harbin Institute of Technology, Shenzhen, China

article info a b s t r a c t

Article history: Recently, deep clustering, which is able to perform feature learning that favors clustering tasks via deep
Received 11 August 2019 neural networks, has achieved remarkable performance in image clustering applications. However,
Received in revised form 4 January 2020 the existing deep clustering algorithms generally need the number of clusters in advance, which is
Accepted 29 March 2020
usually unknown in real-world tasks. In addition, the initial cluster centers in the learned feature space
Available online 11 April 2020
are generated by k-means. This only works well on spherical clusters and probably leads to unstable
Keywords: clustering results. In this paper, we propose a two-stage deep density-based image clustering (DDC)
Deep clustering framework to address these issues. The first stage is to train a deep convolutional autoencoder (CAE) to
Density-based clustering extract low-dimensional feature representations from high-dimensional image data, and then apply t-
Feature learning SNE to further reduce the data to a 2-dimensional space favoring density-based clustering algorithms.
In the second stage, we propose a novel density-based clustering technique for the 2-dimensional
embedded data to automatically recognize an appropriate number of clusters with arbitrary shapes.
Concretely, a number of local clusters are generated to capture the local structures of clusters,
and then are merged via their density relationship to form the final clustering result. Experiments
demonstrate that the proposed DDC achieves comparable or even better clustering performance than
state-of-the-art deep clustering methods, even though the number of clusters is not given.
© 2020 Elsevier B.V. All rights reserved.

1. Introduction feature learning together in the clustering framework, such as

Torre et al. performs k-means clustering and linear discriminant
Image clustering is one of the extensively exploited topics in analysis jointly [16]. However, these shallow models are typically
machine learning and has many applications in a wide range with limited representation power and thus their improvement
of fields, including image retrieval [1,2] and annotation [3]. It on image clustering performance is not significant.
seeks to partition images into clusters according to a similarity Recently, deep clustering methods, which perform feature
measure, such that similar images are grouped in the same cluster learning by applying deep neural networks (DNN) and conduct
and images which are dissimilar from each other are grouped clustering in the latent learned feature space, have shown im-
into different clusters. A number of traditional clustering methods pressive performance in image clustering tasks and have attracted
have been proposed in the past decades, such as partitional people’s increasing attentions [17–28]. Despite the huge success,
clustering [4,5], hierarchical clustering [6], density-based clus- most of the existing deep clustering methods actually apply
tering [7–11], distribution-based clustering [12], clustering based a partitional clustering, e.g., k-means clustering in the latent
on non-negative matrix factorization (NMF) [13–15], etc. These learned feature space. This brings the following drawbacks: (1)
methods usually fail to clustering image data sets which are with The number of clusters must be given in advance, which is
high dimensionality. The main reason is that reliable similarity usually unknown in practical clustering tasks. (2) The partitional
measures are hard to obtain in the high dimensional space. clustering techniques can only find spherical clusters and perform
To mitigate this issue, a normal method is to first reduce worse on irregular clusters or imbalanced data. (3) The k-means
the dimensionality of data via feature selection or feature ex- like clustering methods have randomness, probably leading to
traction techniques, and then conduct clustering in the lower unstable clustering results.
dimensional space. Another way is to consider clustering and Some methods have been proposed to estimate the number of
clusters in deep clustering models [29–31]. However, these meth-
∗ Corresponding author at: School of Computer Science and Engineering, ods do not consider the local information of clusters, and do not
University of Electronic Science and Technology of China, Chengdu 611731, consider that points with different densities should play different
China. roles in density-based clustering technique. Thus, the perfor-
E-mail address: yazhou.ren@uestc.edu.cn (Y. Ren). mance of these methods is still not satisfied and two questions

https://doi.org/10.1016/j.knosys.2020.105841
0950-7051/© 2020 Elsevier B.V. All rights reserved.
2 Y. Ren, N. Wang, M. Li et al. / Knowledge-Based Systems 197 (2020) 105841

are normally raising: (1) How deep clustering methods effectively a cluster with points from continuous high-density regions and
find appropriate number of clusters with irregular shape when the treats those points in low-density regions as outliers or noises. In-
number of clusters is not known a-prior? (2) Do we really need to spired by this popular algorithm, a lot of density-based clustering
refine the deep neural networks with the initial cluster assignment? methods have been designed, such as OPTICS [41], DENCLUE [42],
In this paper, we aim to answer these two questions and DESCRY [43], and others [44–47]. DenPeak (clustering by fast
propose a novel effective deep density-based clustering (DDC) search and find of density peaks) [48] is another immensely
method for images. Specifically, DDC first learns deep feature rep- popular density-based clustering method, which assumes that
resentation of data via a deep autoencoder. Second, t-SNE [32] is cluster centers locate in regions with higher density and the dis-
adopted to further reduce the learned features to a 2-dimensional tances among different centers should be relatively large. Some
space while preserving the pairwise similarity of data instances. improvements of DenPeak have also been made [49–51]. These
Finally, we develop a novel density-based clustering method methods described above are applied in the original feature space.
which considers both the local structures of clusters and impor- Thus, their performance for grouping images which are with high
tance of instances to generate the final clustering results. The dimensionality is not satisfied due to the limited representation
source code of the proposed DDC is available at https://github. ability.
com/Yazhou-Ren/DDC. Most recently, several deep clustering methods [29–31] which
The contributions of this work are stated as below: seek to address the issue of estimating the number of clusters
have been proposed, i.e., DDC-UF (deep density clustering of
• We propose an effective density-based technique for deep unconstrained faces) [29], DCC (deep continuous clustering) [30],
clustering which can automatically find appropriate number and DED (deep embedding determination) [31]. However, these
of image clusters with arbitrary shapes. We first reduce the methods ignore the local structures in each cluster, and do not
original data to a 2-dimensional space and then develop a allow points to play different roles according to their densities.
novel density-based clustering method for the learned data. By contrast, the proposed DDC takes into both the local informa-
• DDC is with good cluster visualization and interpretability. tion of clusters and importance of points account and achieves
Its properties are theoretically and empirically analyzed. significant improvements on clustering performance.
Its efficiency and robustness to parameter setting are also
empirically verified. 3. Deep density-based image clustering
• Extensive experiments are conducted to show that DDC be-
comes the new state-of-the-art deep clustering method on This section presents the proposed deep density-based image
various image clusters discovering tasks when the number clustering (DDC) in detail. Let X = {xi ∈ RD }ni=1 denote the
of clusters is unknown. image data set, where n is number of data points and D is the
dimensionality. DDC aims at grouping X into an appropriate
2. Related work number of disjoint clusters without any prior knowledge such as
the number of clusters and label information. DDC is a two-stage
2.1. Deep clustering deep clustering model which contains two main steps, i.e., deep
feature learning which nonlinearly transfers the original features
Due to the good representation ability, deep neural networks to a low dimensional space, and density-based clustering which
(DNN) have gained impressive achievements in various types of automatically recognizes an appropriate number of clusters with
machine learning and computer vision applications [33–35]. Most shapes in the latent space.
of the DNN methods focus on supervised problems in which the
label information is known. In recent several years, people pay 3.1. Deep feature learning
increasing attentions to adopting DNN in unsupervised learn-
As deep clustering methods generally do, we adopt deep au-
ing tasks and a number of deep clustering methods have been
toencoder to initialize the feature transformation due to its ex-
proposed.
cellent representation ability. An autoencoder is consisted of two
One kind of deep clustering methods divide the clustering pro-
parts: the encoder h = fΘ (x) (maps each data point x to a learned
cedure into two stages, i.e., feature learning and clustering. They
representation h) and the decoder x′ = gΩ (h) (transfers data
first perform feature learning via DNN and then apply clustering
from the learned feature space to the original one). Here, the
algorithms in the learned space [19,20,36,37]. The other kind of
feature dimensionality of h is d. Θ and Ω denote the parameters
deep clustering methods incorporate the abovementioned two
of the encoder and decoder, respectively. In this paper, we use the
stages into one framework. Song et al. [38] refine the autoencoder
denoising autoencoder [52] that solves the following problem:
such that data representations in the learned space are close
n
to their affiliated cluster centers. Xie et al. [21] propose deep 1∑
embedded clustering (DEC) to jointly learn the cluster assign- arg min ∥xi − gΩ (fΘ (x̃i ))∥22 (1)
Θ ,Ω n
ment and the feature representations. Ren et al. [39] propose i=1

semi-supervised deep embedded clustering to enhance the per- where x̃ is a corrupted copy of x by adding noises, e.g., adding
formance of DEC by using pairwise constraints. Yang et al. [23] Gaussian noise or randomly setting a portion of input data to
and Chang et al. [17] apply convolutional neural networks (CNN) 0. We use the stacked autoencoder (SAE) [53] in this work, in
for exploring image clusters. Guo et al. [40] improve DEC with which each layer is a denoising autoencoder trained to recon-
local structure preservation. Guo et al. [18] use data augmentation struct the previous layer’s output. For image clustering, we adopt
in the DEC framework and achieve state-of-the-art clustering the deep convolutional autoencoder (CAE) in the experiments,
performance on several image data sets. whose structure will be stated in Section 3.3.
In [18], the data augmentation (DA) technique is used in
2.2. Density-based clustering the training process of deep autoencoder and has achieved sig-
nificant improvements of clustering performance. The resulting
The key advantage of density-based clustering is that the optimization model is:
number of clusters is not needed and clusters with arbitrary n
shape can be found. Over the past decades, many density-based 1∑
arg min ∥x̄i − gΩ (fΘ (x̄i ))∥22 (2)
clustering methods have been developed. DBSCAN [7] defines Θ ,Ω n
i=1
Y. Ren, N. Wang, M. Li et al. / Knowledge-Based Systems 197 (2020) 105841 3

where x̄i = Trand (xi ) denotes the random transformation1 of xi . Definition 1 (Local Cluster Centers).Those points satisfying the
When the training of deep autoencoder (solving Eq. (1) or following condition are defined as local cluster centers:
Eq. (2)) is finished, we observe the feature representations H =
{hi = fΘ (xi ) ∈ Rd }ni=1 . For visualization and better fitting the δi > dc and ρj > ρ̄ (9)
∑n
designed density-based clustering algorithm, we further reduce where ρ̄ = n1
j=1 ρj is the average density of all the points
data H to a 2-dimensional space Z = {zi ∈ R2 }ni=1 by using {zi }ni=1 .
t-SNE [32] which owns good preservation ability of pairwise
similarities. It is easy to verify that a local cluster center zi owns the largest
t-SNE is a dimensionality reduction method which can visual- density in its dc -neighborhood, i.e., a circle with zi and dc as the
ize high-dimensional data in a 2-dimensional space. Firstly, t-SNE center and radius, respectively. When all the local cluster centers
defines the joint probability pij of data points hi and hj as: are obtained, we assign each remaining point to the cluster as its
nearest neighbor of higher density. Then, a set of local clusters
pi|j + pj|i
pij = (3) are found and will be used to generate the final clustering. To
2n analyze the characteristic of local cluster centers, the following
where two theorems are stated.
exp(−∥hi − hj ∥2 /2σi2 )
pj|i = ∑ (4) Theorem 1. A local cluster center zi owns the largest density value
k̸ =i exp(−∥hi − hk ∥2 /2σi2 ) ρi locally in its dc -neighborhood.
Here, σi is a parameter for hi . Secondly, the joint probability qij of
zi and zj in the learned 2-dimensional space is calculated as: Proof. We use ‘proof by contradiction’ method to prove the
theorem. For a local cluster center zi , assume that there exists
2 −1
(1 + ∥zi − zj ∥ ) a point zj in the dc -neighborhood of zi satisfying ρj > ρi . Then,
qij = ∑ (5)
k̸ =l (1 + ∥zk − zl ∥2 )−1 δi ≤ dc holds according to Eq. (8). This actually contradicts Eq. (9)
in Definition 1. Thus, the assumption is wrong and the theorem
Both pii and qii are set to 0. Then, t-SNE seeks to minimize the
is proved. □
Kullback–Leibler divergence between the two joint probability
distributions P and Q :
Theorem 2. The distance of two local cluster centers with different
∑∑ pij densities is at least dc .
KL(P ∥ Q ) = pij log (6)
qij
i j
Proof. Suppose zi and zj are two local cluster centers with ρi ̸ =
When all the 2-dimensional data points {zi ∈ R2 }ni=1 are ρj . We assume the distance dij < dc , then zi and zj are in the
obtained, we develop a novel density-based clustering in the dc -neighborhoods of each other. Since zi is a local cluster center,
embedded space Z as below. it owns the highest density in its dc -neighborhood. Thus, ρi ≥ ρj .
zj is also a local cluster center. Similarly, we have ρj ≥ ρi . Thus,
3.2. Density-based clustering ρi = ρj . This contradicts the condition of the theorem. □
Thus, the distance of two local clusters is smaller than dc
We propose a novel density-based clustering method to ob-
only when they have the same density and Eq. (9) holds at the
tain an appropriate partition of data Z = {zi ∈ R2 }ni=1 in
same time. In real tasks, this situation extremely rarely occurs.
the 2-dimensional feature space when the number of clusters is
As a consequence, Theorems 1 and 2 indicate two important
unavailable.
properties of local cluster centers: (1) Each local center is with
the highest density locally. (2) The selected cluster centers are
3.2.1. Local clusters generation
not too close to each other, preventing a huge number of cluster
DDC shares two fundamental definitions (i.e., ρi and δi of point
centers from being selected.
zi ) with DenPeak [48]. Concretely, DDC defines the density of ρi
of point zi via a Gaussian kernel:
3.2.2. Merging local clusters
Suppose L local clusters (C (1) , C (2) , . . . , C (L) ) are obtained, they
( )
∑ dij
ρi = exp −( ) 2
(7)
dc will be merged to form the final clustering result. First, we define
zj ∈Z \{zi }
core and border points in Definition 2.
where dij is the Euclidean distance between points zi and zj , and
dc is the cutoff distance that need to be predefined. A higher value Definition 2 (Core and Border Points of a Cluster).Suppose a point
of ρi means a higher density of point zi . δi of point zi denotes the zi is from local cluster C (k) , it is defined as a core point if the
minimum Euclidean distance between zi and those points whose following condition holds:
densities are larger than zi . That is,
ρi > ρ̄ (k) (10)
δi = min (dij ) (8)
where ρ̄ (k) 1
ρj is the average density of all the points
∑
j:ρj >ρi = nk zj ∈C (k)

For the point with the highest density, its ρ is set to the in C (k) and nk is the number of points in C (k) . Otherwise, zi is
maximum of pairwise distances. DenPeak simply chooses several considered as a border point.
points with the highest ρ and δ values as cluster centers. Different Definition 2 indicates that whether a point is a core or border
from DenPeak, we consider those points with relatively large ρ point depends on its own density and the average density of
and δ values as local cluster centers. The corresponding definition the local cluster to which this point belongs. Generally, the core
is given in Definition 1. points of a cluster locate in the central regions, while the border
points place in the boundary of areas with lower density.
1 As in [18], we randomly shift for at most 3 pixels in each direction and Then, we define connectivity of clusters in Definitions 3 and
randomly rotate for at most 10◦ . 4.
4 Y. Ren, N. Wang, M. Li et al. / Knowledge-Based Systems 197 (2020) 105841

Algorithm 1 Deep Density-based Image Clustering (DDC). Table 1

Image data sets used in the experiments.
Input: Image data set X ; Cutoff distance dc . Data set # examples # classes Image size
Output: The final clustering result.
MNIST 70000 10 28 × 28
1: Stage 1 → Deep feature learning
MNIST-test 10000 10 28 × 28
2: Train a deep autoencoder via Eq. (1) or (2).
USPS 9298 10 16 × 16
3: Transform X to lower feature representations H via the encoder Fashion 10000 10 28 × 28
fΘ (·). LetterA-J 10000 10 28 × 28
4: Map H to a 2-dimensional data set Z via t-SNE.
5: Stage 2 → Density-based clustering
6: \\ Local clusters generation
7: for each point zi in Z do to data Z itself. Concretely, we compute d̄ as the average value
8: Compute ρi and δi via Eqs. (7) and (8). of all pairwise distances in Z . Then, set dc = d̄ × ratio. If ratio
9: end for is extremely large, a small number of clusters will be found by
10: Choose local cluster centers via Eq. (9).
DDC. If ratio is extremely small, a large number of clusters will be
11: Assign the remaining points and observe local clusters
detected. However, we will empirically verify that DDC achieves
C (1) , C (2) , . . . , C (L) .
12: \\ Generate the final clustering result
stable performance in a wide range of ratio. The default value of
13: Define core and border points via Eq. (10). ratio is 0.1.
14: Merge all the density connectable local clusters via Eqs. (11) and
(12). 3.4. Relations to exiting methods
15: Return the final clustering result.
DBSCAN [7] and DenPeak [48] are two worldwide popular
density-based clustering methods. They are applied in the orig-
inal feature space, while the proposed DDC works in the 2-
Definition 3 (Density Directly-connectable of Clusters).A local clus-
dimensional embedded space. Besides, DBSCAN is sensitive to
ter C (k) is density directly-connectable from a local cluster C (l) if:
the parameters and tends to merge clusters with overlapping ar-
eas [7,9]. These shortcomings prevent its successful use in image
∃ core points zi ∈ C (k) and zj ∈ C (l) , such that dij < dc . (11) clustering tasks.
DenPeak assumes that each cluster has only one center, lead-
Definition 4 (Density Connectable of Clusters).A local cluster C (k) is ing to the following disadvantages: (1) In real applications, mul-
density-connectable to a local cluster C (l) if: tiple centers/modes usually coexist in one cluster. Thus, DenPeak
typically loses information of local structures of a cluster. (2) It
∃ a path C (k) = C1 , C2 , . . . , Cm = C (l) (12) is difficult for DenPeak to select a suitable number of clusters
where cluster Cj is density directly-connectable from cluster Cj−1 because usually a number of (which is much larger than the
(j = 2, . . . , m) and m is the path length. ground-truth number of clusters) points with high ρ and δ values
can be considered as candidates of cluster centers. To address
It is easy to verify that both density directly-connectable these issues, DDC firstly selects all the potential cluster centers
and density connectable are symmetric. Finally, all the density- to obtain the local clusters, and then aggregates all density con-
connectable local clusters are merged and the final clustering nectable cluster to form the final clustering result. An illustration
result is provided. When two local clusters are merged, the cluster exhibiting the different behaviors of DenPeak and DDC is given
center with higher density becomes the center of the new merged in Figs. 1 and 2. Here, several data sets with irregular shapes of
cluster. clusters are used, i.e., Twomoon, Flame, and t4.2 DCC is directly
According to Definitions 3 and 4, two clusters are merged only applied on the 2-dimensional data without using CAE and t-SNE.
when their central areas are very close to each other. This ensures The ratio of DDC is 0.1. DenPeak follows the parameter setting
the new merged cluster also has continuous high-density areas. described in Section 4.3.
The pseudo-code of the proposed DDC is summarized in Algo- DED [31] is a recently proposed deep clustering model that
rithm 1. transforms the original data via DNN to a 2-dimensional feature
space that favors the density-based clustering algorithm. How-
3.3. Implementation ever, DED directly applies DenPeak on the 2-dimensional data,
thereby inheriting the disadvantages of DenPeak.
According to different optimization problems, DDC provides
two specific algorithms:
4. Experimental setup
(1) DDC: Use CAE and solve Eq. (1).
(2) DDC-DA: Use CAE and solve Eq. (2) in which data augmen- This section describes the tested image data sets, comparing
tation is adopted. methods, parameter settings, and evaluation measures.

As in [18] and [31], the structure of CAE is always set to

4.1. Image data sets
Conv532 → Conv564 → Conv3128 → Fc10 → Conv3128 → Conv564 →
Conv532 . Here, Conv532 represents a convolutional layer with 32
Five popular image data sets are used to assess the perfor-
filters and a 5 × 5 kernel. The stride is always set to 2. Fc10
mance of comparing methods.
denotes the full connected layer with 10 neurons. In convolu-
The MNIST data base3 consists of 70000 handwritten digits of
tional autoencoders, all the internal layers except for the input,
28 × 28 pixel size from 10 categories (digits 0–9). The MNIST-
embedding, and output layers are activated by ReLU function. The
test data set only contains the test set of MNIST, with 10000
structures of autoencoders also indicate that the dimensionality
of learned representations H is 10.
Given the embedded 2-dimensional data Z , DDC has only one 2 http://cs.joensuu.fi/sipu/datasets/
parameter (dc ) needed to be set. We set the value of dc according 3 http://yann.lecun.com/exdb/mnist/
Y. Ren, N. Wang, M. Li et al. / Knowledge-Based Systems 197 (2020) 105841 5

Fig. 1. Twomoon: Clustering performance comparison of DenPeak and DDC. The Twomoon data set has 2000 points from two classes. (a): The decision graph of
DenPeak. (b): The final result of DenPeak. (c): Initial local clusters of DDC. (d): The final result of DDC. (e): The border points detected by DDC are plotted as black
points. The center of each cluster is highlighted with black ‘♦’. Points with the same color are from the same cluster. As shown in (a), a number of points with high
ρ and δ values can be considered as centers and it is hard for DenPeak to choose an appropriate number of clusters. Even it is told that 2 clusters exist, the result
of DenPeak is still not satisfied, as (b) shows. By contrast, DDC first generate a relatively large number of local cluster centers and then merge them to form the
final clustering result. Compared (c) with (e), we find that two clusters are typically merged if there exists core points that are from both clusters and are close to
each other. It is shown in (e) that border points generally locate around the boundary of each real cluster, while core points locate in central areas.

Fig. 2. Clustering results of DenPeak and DDC on Flame and t4 data sets. (a) and (b) correspond to the Flame data set. (c) and (d) show the results on t4. DenPeak
is told to select the true number of clusters. Due to loss information of local structures, DenPeak fails to find suitable clusters (as shown in (a) and (c)). In contrast,
DDC performs perfectly on these two data sets. Even when noisy data exist (as exhibited in (d)), DDC can still automatically recognize the 4 irregular clusters.

Table 2
Results of the comparing methods. In each column, the best two results are highlighted in boldface. The results marked by ‘*’ are excerpted from the papers. ‘–’
denotes the results are unavailable from the papers or codes, and ‘- -’ means ‘out of memory’ when applying.
MNIST MNIST-test USPS Fashion LetterA-J
ACC NMI ACC NMI ACC NMI ACC NMI ACC NMI
k-means 0.485 0.470 0.563 0.510 0.611 0.607 0.554 0.512 0.354 0.309
DBSCAN - - - - 0.114 0 0.167 0 0.100 0 0.100 0
DenPeak - - - - 0.357 0.399 0.390 0.433 0.344 0.398 0.300 0.211
DEC 0.849 0.816 0.856 0.830 0.758 0.769 0.591 0.618 0.407 0.374
IDEC 0.881* 0.867* 0.846 0.802 0.759 0.777 0.523 0.600 0.381 0.318
DCN 0.830* 0.810* 0.802* 0.786* 0.688* 0.683* – – – –
JULE 0.964* 0.913* 0.961* 0.915* 0.950* 0.913* – – – –
DEPICT 0.965* 0.917* 0.963* 0.915* 0.964* 0.927* – – – –
ClusterGAN 0.950* 0.890* – – – – – – – –
DWSC 0.948* 0.889* – – – – – – – –
DKM 0.840* 0.796* – – 0.757* 0.776* – – – –
VaDE 0.945* 0.876* 0.287* 0.287* 0.566* 0.512* – – – –
DCC 0.963* – – – – – – – – –
DED - - - - 0.690 0.818 0.781 0.855 0.473 0.617 0.371 0.440
ConvDEC 0.940 0.916 0.861 0.847 0.784 0.820 0.514 0.588 0.517 0.536
ConvDEC-DA 0.985 0.961 0.955 0.949 0.970 0.953 0.570 0.632 0.571 0.608
DDC 0.965 0.932 0.965 0.916 0.967 0.918 0.619 0.682 0.573 0.546
DDC-DA 0.969 0.941 0.970 0.927 0.977 0.939 0.609 0.661 0.691 0.629

images. The USPS data set4 is collected from handwritten dig- 4.2. Evaluation measures
its from envelopes by the U.S. postal service. It contains 9298
grayscale images with size 16 × 16. Fashion [54] is a data set Clustering accuracy (ACC) and normalized mutual information
comprising 28 × 28 gray images of 70000 fashion products from (NMI) are used to estimate the performance of comparing algo-
10 categories. Its test set with 10000 images are used in our rithms. Their values are both in [0,1]. A higher value of ACC or
experiments. The LetterA-J data set5 is consisted of more than NMI indicates a better clustering performance.
500k 28 × 28 grayscale images of English letters from A to J. We
randomly select 10000 images from its uncleaned subset as test
set. 4.3. Comparing methods
The summary of all data sets is shown in Table 1. The features
of each data set are scaled to [0, 1]. We compare the proposed DDC with both shallow cluster-
ing methods and deep ones. Shallow baselines are k-means [4],
DBSCAN [7], and DenPeak [48]. Deep methods based on both
4 https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/multiclass.html full connected and convolutional autoencoders are compared,
5 https://yaroslavvb.blogspot.com/2011/09/notmnist-dataset.html including DEC (deep embedded clustering) [21], IDEC (improved
6 Y. Ren, N. Wang, M. Li et al. / Knowledge-Based Systems 197 (2020) 105841

Table 3 5.2. Sensitivity analysis

The average number of detected clusters.
Data set DDC DDC-DA This section tests the sensitivity of DCC w.r.t. the parameter
MNIST 10.8 ± 0.4 10.7 ± 0.5 ratio on MNIST-test and USPS data sets. The tested range is
MNIST-test 10.0 ± 0.0 10.0 ± 0.0
[0.05, 0.16]. Both ACC and NMI values of DDC-DA are reported
USPS 10.0 ± 0.0 10.0 ± 0.0
Fashion 10.5 ± 0.8 10.2 ± 0.8 in Fig. 3, from which we can observe that our method achieves
LetterA-J 9.1 ± 0.8 10.1 ± 0.9 stably excellent performance in a wide range of ratio. When
applying the DDC methods in real clustering applications, the
default value of ratio is recommended to be set to 0.1.

DEC with local structure preservation) [40], DCN (deep clus- 5.3. Runtime analysis
tering network) [22], JULE (joint unsupervised learning for im-
age clustering) [23], DEPICT (deep embedded regularized clus- We compare our method with DEC-DA [18] because these two
tering) [24], ClusterGAN [25], DWSC (deep weighted k-subspace
models use the same CAE structure and DEC-DA has been proved
clustering) [26], DKM (deep k-means) [27], VaDE (variational
to be efficient compared with other existing deep clustering
deep embedding) [28], DCC (deep continuous clustering) [30],
methods. The experiments are tested on a server with 32 GB RAM
DED (deep embedding determination) [31], DEC-DA (DEC with
and 2 T P100 GPUs. Concretely, the runtimes of our DDC-DA on
data augmentation) [18].
Among all the comparing methods, DBSCAN, DenPeak, DCC, MNIST-test and USPS are 737 and 583 s, respectively. Those of
DED, and the proposed DCC do not need the number of clusters ConvDEC-DA are 798 and 436 s, respectively. DDC-DA needs time
in advance. For all other methods, the number of clusters is set to to estimate the density ρ and δ for each point. ConvDEC-DA needs
the ground-truth number of categories. When applying DBSCAN, to refine the CAE with initial cluster centers. Thus, these two
the 4-th nearest neighbor distances are computed w.r.t. the entire methods show competitive performance in terms of efficiency.
data, and parameter Eps is set to the median of those values.
The MinPts value of DBSCAN is always set to 4. For DenPeak, 6. Discussion
the Gaussian kernel is used and dc is set such that the average
number of points in dc -neighborhood is approximately 1% × n. We also conduct experiments to directly use t-SNE to reduce
To give DenPeak and DED an advantage, the detected number the original data to the 2-dimensional space and then apply
of clusters is set to the true number of classes according to the the proposed density-based clustering technique. The clustering
decision graph. So far, given the ground-truth number of clusters, results are much worse than our DDC methods. The main reason
ConvDEC-DA achieves state-of-the-art clustering performance in is that CAE can transform the original data to a lower dimensional
image clustering [18]. We compare ConvDEC-DA and its version space in which the intrinsic local structures are preserved. It is
without using DA in our experiments. better to further reduce the lower dimensional representations
The reported ACC and NMI values are either excerpted from to a 2-dimensional space rather than extracting from the original
the original papers, or are the average values of running the high dimensional data. As a consequence, DED [31] and our DDC
released code with corresponding suggested parameters for 10 make use of both CAE and t-SNE to obtain the 2-dimensional
independent trials.
representations that favor the density-based clustering.
5. Results and analysis Now, let us come back to the question raised in Section 1: Is
it really needed to refine the deep autoencoder with the initial
5.1. Results on real image data cluster assignment? To answer this question, we first visualize
the clustering results on MNIST-test and LetterA-J in the embed-
Table 2 gives the clustering results of comparing methods ded 2-dimensional space of DDC-DA in Figs. 4 and 5, respectively.
measured by ACC and NMI. In each column, the best two re- For data whose clusters are well separated (as shown in Fig. 4(a)),
sults are highlighted in boldface. From Table 2 we have the those centroid-based clustering methods, such as ConvDEC-DA,
following observations: (1) The shallow models generally per- which depends greatly on the initial selection of cluster centers,
form worse than deep clustering methods. DBSCAN works the needs to refine the CAE iteratively to achieve satisfied results. By
worst mainly because it is hard to choose suitable parameters in contrast, our DDC can output remarkable performance without
high dimensional space. (2) Data augmentation (DA) can improve refinement even when several clusters in the middle area have
the clustering performance. Except for two methods using DA overlapped areas.
(i.e., ConvDEC-DA and DDC-DA), our DDC always achieves the For data in which many points from different categories mess
highest ACC and NMI values. (3) Our DDC-DA always achieves one together (as shown in the middle area of Fig. 5(a)), the refinement
of the best two clustering results, even the number of clusters of ConvDEC-DA cannot separate the messed points correctly,
is not given. Even given the true number of clusters, DED still neither does our DDC. If this happens and no additional infor-
performs much worse than DDC and DDC-DA. (4) We also find mation is given, the effectiveness of refining autoencoder is not
that ConvDEC-DA sometimes performs unstably. For instance, it
significant for both centroid-based and density-based clustering.
can usually obtain a high ACC value (>0.98) on MNIST-test, but
In our opinion, one needs prior information (e.g., pairwise con-
it performs worse (ACC <0.84) occasionally on this data set. This
straints) or knowledge transferred from related tasks to handle
might be caused by the bad initial cluster centers provided by
k-means in the learned feature space. By contrast, our DDC and this situation.
DDC-DA are more stable with small standard deviations.
The average number of clusters detected by our DDC and DDC- 7. Conclusion and future work
DA as well as the corresponding standard deviations are given
in Table 3. From Table 3 we find that our methods can always We propose a novel deep density-based clustering (DDC)
find the correct numbers of categories on MNIST-test, and USPS. method for image clustering. It is well known that for high-
On MNIST, Fashion and LetterA-J, the recognized numbers of dimensional data such as images, it is difficult to obtain satisfied
clusters are slightly different from the true values. These indicate performance by applying clustering methods in the original space
the capability of the proposed DDC framework of automatically of image data. So in DDC, first, we use CAE with good representa-
recognizing reasonable numbers of clusters. tion ability to extract 10-dimensional features from the original
Y. Ren, N. Wang, M. Li et al. / Knowledge-Based Systems 197 (2020) 105841 7

Fig. 3. Sensitivity analysis of parameter ratio (ACC and NMI).

Fig. 4. Visualization of DDC-DA on MNIST-test. (a) The ground truth labels of the embedded 2-dimensional data. (b) The initial result of DDC-DA. (c) The final result
of DDC-DA. (d) The border points detected by DDC-DA.

Fig. 5. Visualization of DDC-DA on LetterA-J. (a) The ground truth labels of the embedded 2-dimensional data. (b) The initial result of DDC-DA. (c) The final result
of DDC-DA. (d) The border points detected by DDC-DA.

data. After this, t-SNE is used to reduce the 10-dimensional Acknowledgments

data to a 2-dimensional space, which favors our density-based
clustering. DDC consider both the local information of clusters This paper was in part supported by the National Natural
and the importance of points in the clustering process. It is Science Foundation of China (Nos. 61806043, 61572111, and
empirically proved to be the new state-of-the-art deep clustering 61832001), the Project funded by China Postdoctoral Science
method when the number of clusters is not given. Its efficiency Foundation (No. 2016M602674), and the Guangdong Basic and
and robustness are also verified. An interesting future work is to Applied Basic Research Foundation (No. 2020A1515011002).
exploit semi-supervised learning and transfer learning into deep
density-based clustering. References
Declaration of competing interest [1] Y. Chen, J.Z. Wang, R. Krovetz, CLUE: cluster-based retrieval of images
by unsupervised learning, IEEE Trans. Image Process. 14 (8) (2005)
The authors declare that they have no known competing finan- 1187–1201.
cial interests or personal relationships that could have appeared [2] P. Xie, E.P. Xing, Integrating image clustering and codebook learning, in:
to influence the work reported in this paper. AAAI, 2015, pp. 1903–1909.
[3] J. Li, J.Z. Wang, Real-time computerized annotation of pictures, TPAMI 30
(6) (2008) 985–1002.
CRediT authorship contribution statement
[4] J. MacQueen, Some methods for classification and analysis of multivariate
observations, in: Proceedings of the 5th Berkeley Symposium on Mathe-
Yazhou Ren: Conceptualization, Methodology, Formal analy- matical Statistics and Probability, University of California Press, 1967, pp.
sis, Resources, Writing - original draft, Writing - review & edit- 281–297.
ing, Supervision, Project administration, Funding acquisition. Ni [5] Y. Ren, U. Kamath, C. Domeniconi, Z. Xu, Parallel boosted clustering,
Wang: Methodology, Software, Validation, Investigation, Data cu- Neurocomputing 351 (2019) 87–100.
[6] A.K. Jain, M.N. Murty, P.J. Flynn, Data clustering: A review, ACM Comput.
ration, Writing - original draft, Visualization. Mingxia Li: Method-
Surv. 31 (3) (1999) 264–323.
ology, Software, Validation, Investigation, Writing - original draft, [7] M. Ester, H.-P. Kriegel, J. Sander, X. Xu, et al., A density-based algorithm for
Visualization. Zenglin Xu: Resources, Writing - review & editing, discovering clusters in large spatial databases with noise, in: KDD, 1996,
Funding acquisition. pp. 226–231.
8 Y. Ren, N. Wang, M. Li et al. / Knowledge-Based Systems 197 (2020) 105841

[8] D. Comaniciu, P. Meer, Mean shift: a robust approach toward feature space [33] Y. Bengio, Learning deep architectures for AI, Found. Trends Mach. Learn.
analysis, TPAMI 24 (5) (2002) 603–619. 2 (1) (2009) 1–127.
[9] Y. Ren, U. Kamath, C. Domeniconi, G. Zhang, Boosted mean shift clustering, [34] Y. Bengio, A. Courville, P. Vincent, Representation learning: A review and
in: ECML-PKDD, 2014, pp. 646–661. new perspectives, TPAMI 35 (8) (2013) 1798–1828.
[10] Y. Ren, C. Domeniconi, G. Zhang, G. Yu, A weighted adaptive mean shift [35] G.E. Hinton, S. Osindero, Y.W. Teh, A fast learning algorithm for deep belief
clustering algorithm, in: SDM, 2014, pp. 794–802. nets, Neural Comput. (2006) 1527–1554.
[11] Y. Ren, X. Hu, K. Shi, G. Yu, D. Yao, Z. Xu, Semi-supervised denpeak [36] G. Chen, Deep learning with nonparametric clustering, 2015, pp. 1–14,
clustering with pairwise constraints, in: Proceedings of the 15th Pacific arXiv preprint arXiv:1501.03084.
Rim International Conference on Artificial Intelligence, 2018, pp. 837–850. [37] M. Shao, S. Li, Z. Ding, Y. Fu, Deep linear coding for fast graph clustering,
[12] C.M. Bishop, Pattern Recognition and Machine Learning, Springer, 2006, in: IJCAI, 2015, pp. 3798–3804.
pp. 430–439. [38] C. Song, F. Liu, Y. Huang, L. Wang, T. Tan, Auto-encoder based data
[13] D.D. Lee, H.S. Seung, Algorithms for non-negative matrix factorization, in: clustering, in: Progress in Pattern Recognition, Image Analysis, Computer
NIPS, MIT Press, 2001, pp. 556–562. Vision, and Applications, Springer, 2013, pp. 117–124.
[14] S. Huang, Z. Xu, J. Lv, Adaptive local structure learning for document [39] Y. Ren, K. Hu, X. Dai, L. Pan, S.C. Hoi, Z. Xu, Semi-supervised deep
co-clustering, Knowl.-Based Syst. 148 (2018) 74–84. embedded clustering, Neurocomputing 325 (2019) 121–130.
[15] S. Huang, P. Zhao, Y. Ren, T. Li, Z. Xu, Self-paced and soft-weighted [40] X. Guo, L. Gao, X. Liu, J. Yin, Improved deep embedded clustering with
nonnegative matrix factorization for data representation, Knowl.-Based local structure preservation, in: IJCAI, 2017, pp. 1573–1759.
Syst. 164 (2019) 29–37. [41] M. Ankerst, M.M. Breunig, H.-P. Kriegel, J. Sander, OPTICS: ordering points
[16] F.D.l. Torre, T. Kanade, Discriminative cluster analysis, in: ICML, 2006, pp. to identify the clustering structure, in: SIGMOD, ACM, 1999, pp. 49–60.
241–248. [42] A. Hinneburg, D.A. Keim, et al., An efficient approach to clustering in large
[17] J. Chang, L. Wang, G. Meng, S. Xiang, C. Pan, Deep adaptive image multimedia databases with noise, in: KDD, vol. 98, 1998, pp. 58–65.
clustering, in: CVPR, 2017, pp. 5879–5887. [43] F. Angiulli, C. Pizzuti, M. Ruffolo, DESCRY: a density based clustering
[18] X. Guo, E. Zhu, X. Liu, J. Yin, Deep embedded clustering with data algorithm for very large data sets, in: Proceedings of the International Con-
augmentation, in: ACML, 2018, pp. 550–565. ference on Intelligent Data Engineering and Automated Learning, Springer,
[19] X. Peng, S. Xiao, J. Feng, W.Y. Yau, Z. Yi, Deep subspace clustering with 2004, pp. 203–210.
sparsity prior, in: IJCAI, 2016, pp. 1925–1931. [44] Q. Du, Z. Dong, C. Huang, F. Ren, Density-based clustering with geograph-
[20] F. Tian, B. Gao, Q. Cui, E. Chen, T.-Y. Liu, Learning deep representations for ical background constraints using a semantic expression model, ISPRS Int.
graph clustering, in: AAAI, 2014, pp. 1293–1299. J. Geo-Inf. 5 (5) (2016) 72.
[21] J. Xie, R.B. Girshick, A. Farhadi, Unsupervised deep embedding for [45] Y. Gu, X. Ye, F. Zhang, Z. Du, R. Liu, L. Yu, A parallel varied density-based
clustering analysis, in: ICML, 2016, pp. 478–487. clustering algorithm with optimized data partition, J. Spatial Sci. (2017)
[22] B. Yang, X. Fu, N.D. Sidiropoulos, M. Hong, Towards K-means-friendly 1–22.
spaces: Simultaneous deep learning and clustering, in: ICML, 2017, pp. [46] Y. Lv, T. Ma, M. Tang, J. Cao, Y. Tian, A. Al-Dhelaan, M. Al-Rodhaan, An
3861–3870. efficient and scalable density-based clustering algorithm for datasets with
[23] J. Yang, D. Parikh, D. Batra, Joint unsupervised learning of deep complex structures, Neurocomputing 171 (2016) 9–22.
representations and image clusters, in: CVPR, 2016, pp. 5147–5156. [47] S.T. Mai, X. He, J. Feng, C. Plant, C. Böhm, Anytime density-based clustering
[24] K. Ghasedi Dizaji, A. Herandi, C. Deng, W. Cai, H. Huang, Deep cluster- of complex data, Knowl. Inf. Syst. 45 (2) (2015) 319–355.
ing via joint convolutional autoencoder embedding and relative entropy [48] A. Rodriguez, A. Laio, Clustering by fast search and find of density peaks,
minimization, in: ICCV, 2017, pp. 5736–5745. Science 344 (6191) (2014) 1492–1496.
[25] S. Mukherjee, H. Asnani, E. Lin, S. Kannan, ClusterGAN: Latent space [49] Y. Liu, Z. Ma, F. Yu, Adaptive density peak clustering based on k-
clustering in generative adversarial networks, AAAI (2019) 4610–4617. nearest neighbors with aggregating strategy, Knowl.-Based Syst. 133 (2017)
[26] W. Huang, M. Yin, J. Li, S. Xie, Deep clustering via weighted k-subspace 208–220.
network, IEEE Signal Process. Lett. 26 (11) (2019) 1628–1632. [50] R. Mehmood, S. El-Ashram, R. Bie, H. Dawood, A. Kos, Clustering by fast
[27] M.M. Fard, T. Thonet, E. Gaussier, Deep k-means: Jointly clustering with search and merge of local density peaks for gene expression microarray
k-means and learning representations, 2018, pp. 1–14, arXiv preprint data, Sci. Rep. 7 (2017) 45602.
arXiv:1806.10069. [51] J. Xu, G. Wang, W. Deng, DenPEHC: Density peak based efficient
[28] Z. Jiang, Y. Zheng, H. Tan, B. Tang, H. Zhou, Variational deep embedding: hierarchical clustering, Inform. Sci. 373 (2016) 200–218.
An unsupervised and generative approach to clustering, in: IJCAI, 2017, pp. [52] P. Vincent, H. Larochelle, Y. Bengio, P.-A. Manzagol, Extracting and com-
1965–1972. posing robust features with denoising autoencoders, in: ICML, 2008, pp.
[29] W.-A. Lin, J.-C. Chen, C.D. Castillo, R. Chellappa, Deep density clustering of 1096–1103.
unconstrained faces, in: CVPR, 2018, pp. 8128–8137. [53] P. Vincent, H. Larochelle, I. Lajoie, Y. Bengio, P.-A. Manzagol, Stacked
[30] S.A. Shah, V. Koltun, Deep continuous clustering, 2018, pp. 1–11, arXiv denoising autoencoders: Learning useful representations in a deep network
preprint arXiv:1803.01449. with a local denoising criterion, JMLR 11 (2010) 3371–3408.
[31] Y. Wang, E. Zhu, Q. Liu, Y. Chen, J. Yin, Exploration of human activities [54] H. Xiao, K. Rasul, R. Vollgraf, Fashion-MNIST: a novel image dataset for
using sensing data via deep embedded determination, in: Proceedings benchmarking machine learning algorithms, 2017, pp. 1–6, arXiv preprint
of the International Conference on Wireless Algorithms, Systems, and arXiv:1708.07747.
Applications, 2018, pp. 473–484.
[32] L.v.d. Maaten, G. Hinton, Visualizing data using t-SNE, JMLR 9 (2008)
2579–2605.