Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

Neural Normalized Cut: A Differential and Generalizable Approach for Spectral Clustering

Wei He Shangzhi Zhang Chun-Guang Li Xianbiao Qi Rong Xiao Jun Guo School of Artificial Intelligence, Beijing University of Posts and Telecommunications, Beijing, P.R. China Intellifusion, Shenzhen, P.R. China
Abstract

Spectral clustering, as a popular tool for data clustering, requires an eigen-decomposition step on a given affinity to obtain the spectral embedding. Nevertheless, such a step suffers from the lack of generalizability and scalability. Moreover, the obtained spectral embeddings can hardly provide a good approximation to the ground-truth partition and thus a k𝑘kitalic_k-means step is adopted to quantize the embedding. In this paper, we propose a simple yet effective scalable and generalizable approach, called Neural Normalized Cut (NeuNcut), to learn the clustering membership for spectral clustering directly. In NeuNcut, we properly reparameterize the unknown cluster membership via a neural network, and train the neural network via stochastic gradient descent with a properly relaxed normalized cut loss. As a result, our NeuNcut enjoys a desired generalization ability to directly infer clustering membership for out-of-sample unseen data and hence brings us an efficient way to handle clustering task with ultra large-scale data. We conduct extensive experiments on both synthetic data and benchmark datasets and experimental results validate the effectiveness and the superiority of our approach.

keywords:
Neural Normalized Cut , Unsupervised learning , Spectral clustering , Differential programming
journal: Pattern Recognition

1 Introduction

Finding clusters from unlabeled data is a task of great scientific significance and practical value in pattern recognition, machine learning and data science. Spectral clustering [1, 2], as a popular tool for data clustering, has widely spread applications in variant areas, owning to its wide generality, excellent empirical performance and rich theoretical foundation from spectral graph theory [3]. Spectral clustering finds a graph partition that minimizes the sum of weights on edges between different subgraphs—which is essentially solving a problem of minimizing a graph cut. Due to its combinatorial nature, rather than solving the problem directly, as in, e.g., Ratio cut [4], Normalized cut (Ncut) [5] and Min-max cut [6], the common practice in spectral clustering is to solve a relaxed problem and find the partition in graph spectral domain.

Roughly, spectral clustering algorithms consist of two main steps: 1) computing the spectral embeddings via the eigenvectors associated with the ending k𝑘kitalic_k minor eigenvalues of the pre-computed graph Laplacian; and then 2) applying k𝑘kitalic_k-means algorithm to obtain the partition from the embedding. Despite the popularity of spectral clustering, it suffers from various drawbacks. The spectral embeddings are usually computed only for the given samples, thus they cannot be generalized to unseen out-of-sample data. Moreover, the eigen-decomposition of the graph Laplacian that is of quadratic complexity with number of samples is also quite time consuming or computationally prohibitive when handling clustering task with ultra large-scale data. Besides, the spectral embeddings obtained by eigen-decomposition contain negative entries, hence can hardly provide a good approximation to the clustering membership. As a remedy, a k𝑘kitalic_k-means algorithm usually is applied on the spectral embeddings as a rounding heuristic step to obtain the clustering membership. Since that the problem is solved in the two separated stages, the solution obtained by k𝑘kitalic_k-means is sub-optimal to the original graph cut problem [7].

To address the scalability issue, methods based on Nyström extension [8], landmarks [9] and bipartite graphs [10] have been proposed. Still, none of these methods can generalize the spectral embeddings. Recently, there are a few attempts to tackle the scalability and generalizability for spectral embeddings, e.g., SpectralNet [11] and SpecNet2 [12] train neural networks to approximate the eigenvectors of the graph Laplacian. Nevertheless, as in conventional spectral clustering, these methods still have to apply a k𝑘kitalic_k-means step on the learned spectral embeddings, thus the obtained solution is sub-optimal to the original graph cut problem.

In this paper, we attempt to develop an efficient and effective approach to learn the clustering membership for spectral clustering directly, aiming to endow spectral clustering with the generalization ability to handle out-of-sample data. Specifically, we reformulate the problem of minimizing the normalized graph cut loss by incorporating a relaxed orthogonality penalty and a set of properly adjusted box constraints on the continuous relaxation of segmentation matrix at first; then we reparameterize the segmentation matrix via a neural network with a softmax output to learn the soft clustering membership (rather than the spectral embedding). Because of employing the normalized cut loss and incorporating the neural network to reparameterize the clustering membership, we term our approach as a Neural Normalized Cut (NeuNcut). By relaxing the orthogonality constraint, our NeuNcut is easier to train. Owning to introducing a neural network with a softmax output, our NeuNcut can directly infer the clustering membership with no need to use the k𝑘kitalic_k-means step to quantize the embeddings. Our code is available at: https://github.com/hewei98/NeuNcut.

The contributions of the paper can be summarized as follows.

  1. 1.

    We reformulate the combinatorial problem of minimizing a normalized cut into continuous problem with a relaxed orthogonality penalty and a set of properly adjusted box constraints. The box constraints enable the segmentation matrix to satisfy more desired properties.

  2. 2.

    We propose to reparameterize the clustering membership matrix with a properly designed neural network and therein present an EM-like procedure to solve the reformulated normalized cut problem.

  3. 3.

    We conduct extensive experiments on both synthetic data and real world data to demonstrate the effectiveness of our approach, as well as the generalization ability to out-of-sample data.

2 Related Work

2.1 Spectral Clustering

Spectral clustering is a well-known approach to approximate the solution for graph cut problem. Instead of directly solving the problem of combinatorial nature, the common practice relaxes the combinatorial graph cut problem into its corresponding continuous problem at first and solves it as an eigen-decomposition problem with the graph Laplacian instead. Different spectral clustering methods differ in the ways to avoid trivial solutions. For instance, Ratio Cut [4] maximizes the number of data points within each partition; Normalized Cut (Ncut) [5] maximizes the number of edges within each partition; and Min-max Cut [6] maximizes the similarity within each partition. Besides, by connecting to nonnegative matrix factorization [13], variant nonnegative spectral clustering methods [14, 15] have also been developed, in which the nonnegativity constraint is incorporated to enforce the nonnegativity of the segmentation matrix.

2.2 Scalable Spectral Clustering

Spectral clustering suffers from a scalability issue when handling large-scale data due to its heavy memory requirements for storing pairwise affinities and computational cost for computing the spectral embedding. In the earlier stage, Nyström method [8] is employed to address the scalability issue by randomly choosing a few samples to construct the affinity sub-matrix. In [9], a unified framework for landmark-based spectral clustering is presented, in which an anchor graph is constructed and what follows is eigen-decomposition, and k𝑘kitalic_k-means is performed on the anchor space. Additionally, when a sparse affinity sub-matrix is obtained, bipartite graph clustering methods can be used to obtain the partition, e.g., [10]. A coarse-to-fine anchor approximation strategy is proposed in [16] to further reduce the computation cost for dealing with ultra large-scale dataset. Nevertheless, all these methods mentioned above are performed on the given sample and lack the ability to handle out-of-sample unseen data.

2.3 Deep Clustering

Inspired by the success of deep learning in the past decade, a number of methods attempt to learn the clustering by leveraging a deep learning framework. For example, in VaDE [17], a variational autoencoder equipped with a Gaussian mixture prior is trained to perform deep embedding and clustering; in DEPICT [18], a pre-trained auto-encoder with a softmax layer is trained with a relative entropy based loss, in which the target distribution is initialized with clustering method; in SCAN [19], a novel framework based on minimizing the consistency between pairwise samples is presented, in which a regularization based on maximizing the entropy of the clustering assignments [20] is incorporated to prevent the clustering degeneracy. However, these methods such as DEPICT and SCAN utilize an entropy-maximizing regularization to prevent collapsed solution. Adopting an entropy-maximizing regularization is essentially assuming that all clusters are of equal sizes, which is misleading when the clusters in data are imbalanced.

2.4 Differential Spectral Clustering

Recently, there are a few attempts to address the limitations of spectral clustering in terms of scalability and generalization ability. SpectralNet [11] introduces deep neural networks with an orthogonalization layer to learn the spectral embeddings and then uses k𝑘kitalic_k-means to quantize the embedding. SpecNet2 [12] proposes an orthogonalization-free objective to learn the spectral embedding. AutoSC [21] integrate an automatically constructed affinity matrix with a neural network to learn the spectral embedding. BaSiS [22] learns the spectral embeddings by using affine registration techniques to align the mini-batches. These methods, SpectralNet, SpecNet2, AutoSC and BaSiS, are designed to learn the orthogonal spectral embeddings of the graph Laplacian and thus requires to use k𝑘kitalic_k-means on the learned embedding of the entire dataset to find the clusters, leading to unsatisfactory clustering results. CNC [23] aims to directly optimize the normalized cut objective via a neural network without continuous relaxation and explicit orthogonalization constraint. However, since that the loss function of CNC is still of combinatorial nature, it is quite challenging to optimize it without proper relaxation. Rather than enforcing a strict orthogonality constraint to learn the spectral embeddings (as in [11]) or directly optimizing a normalized cut loss of combinatorial nature (as in [23]), we take a compromise path to directly learn the clustering membership with a properly relaxed normalized cut loss.

3 Our Approach: Neural Normalized Cut

This section will introduce some preliminaries in spectral clustering with normalized cut at first and then reformulate the normalized cut problem and present our approach—Neural Normalized Cut (NeuNcut).

3.1 Preliminaries in Normalized Cut and Its Relaxations

Spectral clustering is able to handle non-convex clusters. However, solving the spectral clustering simply by minimizing the vanilla graph cut objective will result in trivial solution, that consists of singleton clusters. In one of the most popular spectral clustering methods, normalized cut [5], the trivial solution is prevented by taking into account the volume of clusters.

Given a dataset of n𝑛nitalic_n data points 𝒙idsubscript𝒙𝑖superscript𝑑\bm{x}_{i}\in\mathbb{R}^{d}bold_italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT, arranged as a data matrix X=[𝒙1,,𝒙n]d×n𝑋subscript𝒙1subscript𝒙𝑛superscript𝑑𝑛X=[\bm{x}_{1},\cdots,\bm{x}_{n}]\in\mathbb{R}^{d\times n}italic_X = [ bold_italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , ⋯ , bold_italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ] ∈ blackboard_R start_POSTSUPERSCRIPT italic_d × italic_n end_POSTSUPERSCRIPT. Denote a given affinity matrix as An×n𝐴superscript𝑛𝑛A\in\mathbb{R}^{n\times n}italic_A ∈ blackboard_R start_POSTSUPERSCRIPT italic_n × italic_n end_POSTSUPERSCRIPT, in which each element ai,jsubscript𝑎𝑖𝑗a_{i,j}italic_a start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT measures the pair-wise similarity between data points 𝒙isubscript𝒙𝑖\bm{x}_{i}bold_italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and 𝒙jsubscript𝒙𝑗\bm{x}_{j}bold_italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT. By viewing each data point 𝒙isubscript𝒙𝑖\bm{x}_{i}bold_italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT as graph vertex visubscript𝑣𝑖v_{i}italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, we can define a graph 𝒢(𝒱,A)𝒢𝒱𝐴\mathcal{G}(\mathcal{V},A)caligraphic_G ( caligraphic_V , italic_A ), where 𝒱𝒱\mathcal{V}caligraphic_V is the set of vertexes. The goal of spectral clustering is to find the partition {𝒱(1),,𝒱(k)}superscript𝒱1superscript𝒱𝑘\{\mathcal{V}^{(1)},\ldots,\mathcal{V}^{(k)}\}{ caligraphic_V start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT , … , caligraphic_V start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT } for the vertexes in 𝒱𝒱\mathcal{V}caligraphic_V on the graph, where 𝒱=𝒱(1)𝒱(2)𝒱(k)𝒱superscript𝒱1superscript𝒱2superscript𝒱𝑘\mathcal{V}=\mathcal{V}^{(1)}\cup\mathcal{V}^{(2)}\cup\cdots\cup\mathcal{V}^{(% k)}caligraphic_V = caligraphic_V start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT ∪ caligraphic_V start_POSTSUPERSCRIPT ( 2 ) end_POSTSUPERSCRIPT ∪ ⋯ ∪ caligraphic_V start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT and k𝑘kitalic_k denotes the number of true clusters. Precisely, the normalized cut objective is defined as follows:

Ncut(𝒱(1),,𝒱(k)):==1kcut(𝒱(),𝒱¯())vol(𝒱()),assignNcutsuperscript𝒱1superscript𝒱𝑘superscriptsubscript1𝑘cutsuperscript𝒱superscript¯𝒱volsuperscript𝒱\mathrm{Ncut}(\mathcal{V}^{(1)},\ldots,\mathcal{V}^{(k)}):=\sum_{\ell=1}^{k}% \frac{\mathrm{cut}(\mathcal{V}^{(\ell)},\overline{\mathcal{V}}^{(\ell)})}{% \mathrm{vol}(\mathcal{V}^{(\ell)})},roman_Ncut ( caligraphic_V start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT , … , caligraphic_V start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT ) := ∑ start_POSTSUBSCRIPT roman_ℓ = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT divide start_ARG roman_cut ( caligraphic_V start_POSTSUPERSCRIPT ( roman_ℓ ) end_POSTSUPERSCRIPT , over¯ start_ARG caligraphic_V end_ARG start_POSTSUPERSCRIPT ( roman_ℓ ) end_POSTSUPERSCRIPT ) end_ARG start_ARG roman_vol ( caligraphic_V start_POSTSUPERSCRIPT ( roman_ℓ ) end_POSTSUPERSCRIPT ) end_ARG , (1)

where 𝒱¯()superscript¯𝒱\overline{\mathcal{V}}^{(\ell)}over¯ start_ARG caligraphic_V end_ARG start_POSTSUPERSCRIPT ( roman_ℓ ) end_POSTSUPERSCRIPT denotes the complement of 𝒱()superscript𝒱\mathcal{V}^{(\ell)}caligraphic_V start_POSTSUPERSCRIPT ( roman_ℓ ) end_POSTSUPERSCRIPT, cut(𝒱(),𝒱¯())=i:vi𝒱()j:vj𝒱()ai,jcutsuperscript𝒱superscript¯𝒱subscript:𝑖subscript𝑣𝑖superscript𝒱subscript:𝑗subscript𝑣𝑗superscript𝒱subscript𝑎𝑖𝑗\mathrm{cut}(\mathcal{V}^{(\ell)},\overline{\mathcal{V}}^{(\ell)})=\sum_{i:v_{% i}\in{\mathcal{V}}^{(\ell)}}\sum_{j:v_{j}\notin{\mathcal{V}}^{(\ell)}}a_{i,j}roman_cut ( caligraphic_V start_POSTSUPERSCRIPT ( roman_ℓ ) end_POSTSUPERSCRIPT , over¯ start_ARG caligraphic_V end_ARG start_POSTSUPERSCRIPT ( roman_ℓ ) end_POSTSUPERSCRIPT ) = ∑ start_POSTSUBSCRIPT italic_i : italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ caligraphic_V start_POSTSUPERSCRIPT ( roman_ℓ ) end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_j : italic_v start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∉ caligraphic_V start_POSTSUPERSCRIPT ( roman_ℓ ) end_POSTSUPERSCRIPT end_POSTSUBSCRIPT italic_a start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT, vol(𝒱())=i:vi𝒱()Di,ivolsuperscript𝒱subscript:𝑖subscript𝑣𝑖superscript𝒱subscript𝐷𝑖𝑖\mathrm{vol}(\mathcal{V}^{(\ell)})=\sum_{i:v_{i}\in\mathcal{V}^{(\ell)}}D_{i,i}roman_vol ( caligraphic_V start_POSTSUPERSCRIPT ( roman_ℓ ) end_POSTSUPERSCRIPT ) = ∑ start_POSTSUBSCRIPT italic_i : italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ caligraphic_V start_POSTSUPERSCRIPT ( roman_ℓ ) end_POSTSUPERSCRIPT end_POSTSUBSCRIPT italic_D start_POSTSUBSCRIPT italic_i , italic_i end_POSTSUBSCRIPT is the volume of the \ellroman_ℓ-th cluster, and Di,i=j=1nai,jsubscript𝐷𝑖𝑖superscriptsubscript𝑗1𝑛subscript𝑎𝑖𝑗D_{i,i}=\sum_{j=1}^{n}a_{i,j}italic_D start_POSTSUBSCRIPT italic_i , italic_i end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT italic_a start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT is the degree of the i𝑖iitalic_i-th vertex.

Let H=[𝒉1,,𝒉k]{0,1}n×k𝐻subscript𝒉1subscript𝒉𝑘superscript01𝑛𝑘H=[\bm{h}_{1},\cdots,\bm{h}_{k}]\in\{0,1\}^{n\times k}italic_H = [ bold_italic_h start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , ⋯ , bold_italic_h start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ] ∈ { 0 , 1 } start_POSTSUPERSCRIPT italic_n × italic_k end_POSTSUPERSCRIPT be a binary segmentation matrix, for which 𝒉=(h1,,,hn,)subscript𝒉superscriptsubscript1subscript𝑛top\bm{h}_{\ell}={(h_{1,\ell},\cdots,h_{n,\ell})}^{\top}bold_italic_h start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT = ( italic_h start_POSTSUBSCRIPT 1 , roman_ℓ end_POSTSUBSCRIPT , ⋯ , italic_h start_POSTSUBSCRIPT italic_n , roman_ℓ end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT, where nonzero entries indicate which vertexes belong to the \ellroman_ℓ-th cluster, i.e.,

hi,:={1ifvi𝒱()0ifvi𝒱().assignsubscript𝑖cases1ifsubscript𝑣𝑖superscript𝒱0ifsubscript𝑣𝑖superscript𝒱h_{i,\ell}:=\begin{cases}1&\ \text{if}\ {v}_{i}\in\mathcal{V}^{(\ell)}\\ 0&\ \text{if}\ {v}_{i}\notin\mathcal{V}^{(\ell)}.\end{cases}italic_h start_POSTSUBSCRIPT italic_i , roman_ℓ end_POSTSUBSCRIPT := { start_ROW start_CELL 1 end_CELL start_CELL if italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ caligraphic_V start_POSTSUPERSCRIPT ( roman_ℓ ) end_POSTSUPERSCRIPT end_CELL end_ROW start_ROW start_CELL 0 end_CELL start_CELL if italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∉ caligraphic_V start_POSTSUPERSCRIPT ( roman_ℓ ) end_POSTSUPERSCRIPT . end_CELL end_ROW (2)

Then, the normalized cut objective can be rewritten as follows (see, e.g., [1] for a detailed deduction):

Ncut(𝒱(1),,𝒱(k))==1k𝒉~L𝒉~=trace(H~LH~),Ncutsuperscript𝒱1superscript𝒱𝑘superscriptsubscript1𝑘superscriptsubscript~𝒉top𝐿subscript~𝒉tracesuperscript~𝐻top𝐿~𝐻\mathrm{Ncut}(\mathcal{V}^{(1)},...,\mathcal{V}^{(k)})=\sum_{\ell=1}^{k}\tilde% {\bm{h}}_{\ell}^{\top}L\tilde{\bm{h}}_{\ell}=\textrm{trace}{(\tilde{H}^{\top}L% \tilde{H})},roman_Ncut ( caligraphic_V start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT , … , caligraphic_V start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT ) = ∑ start_POSTSUBSCRIPT roman_ℓ = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT over~ start_ARG bold_italic_h end_ARG start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_L over~ start_ARG bold_italic_h end_ARG start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT = trace ( over~ start_ARG italic_H end_ARG start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_L over~ start_ARG italic_H end_ARG ) , (3)

where H~=[𝒉~1,𝒉~k]~𝐻subscript~𝒉1subscript~𝒉𝑘\tilde{H}=[\tilde{\bm{h}}_{1},\cdots\tilde{\bm{h}}_{k}]over~ start_ARG italic_H end_ARG = [ over~ start_ARG bold_italic_h end_ARG start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , ⋯ over~ start_ARG bold_italic_h end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ], in which 𝒉~=1vol(𝒱())𝒉subscript~𝒉1volsuperscript𝒱subscript𝒉\tilde{\bm{h}}_{\ell}=\frac{1}{\sqrt{\mathrm{vol}(\mathcal{V}^{(\ell)})}}\bm{h% }_{\ell}over~ start_ARG bold_italic_h end_ARG start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT = divide start_ARG 1 end_ARG start_ARG square-root start_ARG roman_vol ( caligraphic_V start_POSTSUPERSCRIPT ( roman_ℓ ) end_POSTSUPERSCRIPT ) end_ARG end_ARG bold_italic_h start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT, L=DA𝐿𝐷𝐴L=D-Aitalic_L = italic_D - italic_A is the graph Laplacian, and D=Diag(D1,1,,Dn,n)𝐷Diagsubscript𝐷11subscript𝐷𝑛𝑛D=\textrm{Diag}(D_{1,1},\cdots,D_{n,n})italic_D = Diag ( italic_D start_POSTSUBSCRIPT 1 , 1 end_POSTSUBSCRIPT , ⋯ , italic_D start_POSTSUBSCRIPT italic_n , italic_n end_POSTSUBSCRIPT ) is an n×n𝑛𝑛n\times nitalic_n × italic_n degree matrix.

Relaxing Segmentation Matrix Continuously. Solving for the partition {𝒱(1),𝒱(k)}superscript𝒱1superscript𝒱𝑘\{\mathcal{V}^{(1)},\ldots\mathcal{V}^{(k)}\}{ caligraphic_V start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT , … caligraphic_V start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT } turns out to be finding for the segmentation H𝐻Hitalic_H. Unfortunately, finding {𝒉}=1ksuperscriptsubscriptsubscript𝒉1𝑘\{\bm{h}_{\ell}\}_{\ell=1}^{k}{ bold_italic_h start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT } start_POSTSUBSCRIPT roman_ℓ = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT for k>2𝑘2k>2italic_k > 2 is an NP-hard combinatorial optimization problem. Alternatively, the common practice in spectral clustering is to solve a continuous relaxation of H~~𝐻\tilde{H}over~ start_ARG italic_H end_ARG. For the Ncut problem, one solves:

minH~n×ktrace(H~LH~)s.t.H~DH~=I,subscript~𝐻superscript𝑛𝑘tracesuperscript~𝐻top𝐿~𝐻s.t.superscript~𝐻top𝐷~𝐻𝐼\displaystyle\min_{\tilde{H}\in\mathbb{R}^{n\times k}}\,\textrm{trace}{(\tilde% {H}^{\top}L\tilde{H})}\quad\textrm{s.t.}\quad\tilde{H}^{\top}D\tilde{H}=I,roman_min start_POSTSUBSCRIPT over~ start_ARG italic_H end_ARG ∈ blackboard_R start_POSTSUPERSCRIPT italic_n × italic_k end_POSTSUPERSCRIPT end_POSTSUBSCRIPT trace ( over~ start_ARG italic_H end_ARG start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_L over~ start_ARG italic_H end_ARG ) s.t. over~ start_ARG italic_H end_ARG start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_D over~ start_ARG italic_H end_ARG = italic_I , (4)

where Ik×k𝐼superscript𝑘𝑘I\in\mathbb{R}^{k\times k}italic_I ∈ blackboard_R start_POSTSUPERSCRIPT italic_k × italic_k end_POSTSUPERSCRIPT is an identity matrix and the constraint H~DH~=Isuperscript~𝐻top𝐷~𝐻𝐼\tilde{H}^{\top}D\tilde{H}=Iover~ start_ARG italic_H end_ARG start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_D over~ start_ARG italic_H end_ARG = italic_I is to avoid trivial solution. By letting F=D1/2H~𝐹superscript𝐷12~𝐻F=D^{1/2}\tilde{H}italic_F = italic_D start_POSTSUPERSCRIPT 1 / 2 end_POSTSUPERSCRIPT over~ start_ARG italic_H end_ARG, the objective in Eq. (4) becomes:

minFn×ktrace(FL~F)s.t.FF=I,subscript𝐹superscript𝑛𝑘tracesuperscript𝐹top~𝐿𝐹s.t.superscript𝐹top𝐹𝐼\displaystyle\min_{F\in\mathbb{R}^{n\times k}}\,\textrm{trace}{(F^{\top}\tilde% {L}F)}\quad\textrm{s.t.}\quad F^{\top}F=I,roman_min start_POSTSUBSCRIPT italic_F ∈ blackboard_R start_POSTSUPERSCRIPT italic_n × italic_k end_POSTSUPERSCRIPT end_POSTSUBSCRIPT trace ( italic_F start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT over~ start_ARG italic_L end_ARG italic_F ) s.t. italic_F start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_F = italic_I , (5)

where L~=D12LD12~𝐿superscript𝐷12𝐿superscript𝐷12\tilde{L}=D^{-\frac{1}{2}}LD^{-\frac{1}{2}}over~ start_ARG italic_L end_ARG = italic_D start_POSTSUPERSCRIPT - divide start_ARG 1 end_ARG start_ARG 2 end_ARG end_POSTSUPERSCRIPT italic_L italic_D start_POSTSUPERSCRIPT - divide start_ARG 1 end_ARG start_ARG 2 end_ARG end_POSTSUPERSCRIPT is the normalized graph Laplacian. Then it is easy to show that the matrix F𝐹Fitalic_F can be solved by computing the ending k𝑘kitalic_k eigenvectors associated with the k𝑘kitalic_k minor eigenvalues of L~~𝐿\tilde{L}over~ start_ARG italic_L end_ARG.

Nonnegative Spectral Clustering. While relaxing the segmentation matrix H𝐻Hitalic_H and so for H~~𝐻\tilde{H}over~ start_ARG italic_H end_ARG into the real value matrix subject to orthogonality constraint H~DH~=Isuperscript~𝐻top𝐷~𝐻𝐼\tilde{H}^{\top}D\tilde{H}=Iover~ start_ARG italic_H end_ARG start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_D over~ start_ARG italic_H end_ARG = italic_I will lead to an eigenvalue decomposition problem, the obtained solution returned by the ending k𝑘kitalic_k eigenvectors usually cannot provide a satisfactory approximation to the desired segmentation matrix. Because, the optimal solution for H~~𝐻\tilde{H}over~ start_ARG italic_H end_ARG obtained by the ending k𝑘kitalic_k eigenvectors may contain arbitrary negative entries which would make it deviate from being clustering membership. To remedy such a deficiency, alternatively, nonnegative spectral clustering is developed, in which the nonnegativity constraint is imposed into the problem as follows:

minH~n×ktrace(H~LH~),s.t.H~DH~I,H~0.formulae-sequencesubscript~𝐻superscript𝑛𝑘tracesuperscript~𝐻top𝐿~𝐻s.t.superscript~𝐻top𝐷~𝐻𝐼~𝐻0\min_{\tilde{H}\in\mathbb{R}^{n\times k}}\,\textrm{trace}{(\tilde{H}^{\top}L% \tilde{H})},\quad\textrm{s.t.}~{}~{}\tilde{H}^{\top}D\tilde{H}\approx I,\ % \tilde{H}\geq 0.roman_min start_POSTSUBSCRIPT over~ start_ARG italic_H end_ARG ∈ blackboard_R start_POSTSUPERSCRIPT italic_n × italic_k end_POSTSUPERSCRIPT end_POSTSUBSCRIPT trace ( over~ start_ARG italic_H end_ARG start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_L over~ start_ARG italic_H end_ARG ) , s.t. over~ start_ARG italic_H end_ARG start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_D over~ start_ARG italic_H end_ARG ≈ italic_I , over~ start_ARG italic_H end_ARG ≥ 0 . (6)

However, as a cost of imposing the nonnegative constraint, the problem can no longer be solved as an eigenvalue decomposition problem, but needs to consult to nonnegative matrix factorization scheme, which makes the orthogonality constraint hardly be satisfied strictly.

Remark 1. We notice that there exists a latent trade-off in the stage of relaxing the segmentation matrix H𝐻Hitalic_H defined in (2) for spectral clustering. The spectral clustering methods introduce an orthogonality constraint but at a cost of giving up the nonnegativity constraint; whereas the nonnegative spectral clustering methods impose the nonnegativity constraint but at a cost of giving up the orthogonality constraint. This trade-off between orthogonality and nonnegativity inspires us to take a compromise path to reformulate the normalized graph cut problem.

3.2 Reformulating Normalized Cut Problem

Now we begin to reformulate the normalized cut problem. Since that we want to learn the (soft) clustering membership directly, rather than merely the spectral embedding, it is desirable that the index of the largest entry in each row of the matrix H𝐻Hitalic_H indicates the clustering membership, by which the assignment of data point to each cluster can thus be determined with no need to use an extra k𝑘kitalic_k-means step. Moreover, the insights from Remark 1 hint us that the orthogonality constraint is not necessary as long as additional constraints are imposed to harness H𝐻Hitalic_H to satisfy the desired property for being clustering membership. Thus, we propose to address the normalized cut problem by solving the following relaxed problem:

minH~n×ktrace(H~LH~)+γ2H~DH~IF2,s.t.0H~Λ1,(H~Λ)𝟏=𝟏,formulae-sequencesubscript~𝐻superscript𝑛𝑘tracesuperscript~𝐻top𝐿~𝐻𝛾2superscriptsubscriptdelimited-∥∥superscript~𝐻top𝐷~𝐻𝐼𝐹2s.t.0~𝐻Λ1~𝐻Λ11\begin{split}\min_{\tilde{H}\in\mathbb{R}^{n\times k}}&\ \textrm{trace}\left(% \tilde{H}^{\top}L\tilde{H}\right)+\frac{\gamma}{2}\left\|\tilde{H}^{\top}D% \tilde{H}-I\right\|_{F}^{2},\\ \quad\textrm{s.t.}&~{}~{}~{}~{}0\leq\tilde{H}\Lambda\leq 1,\ (\tilde{H}\Lambda% )\cdot\mathbf{1}=\mathbf{1},\end{split}start_ROW start_CELL roman_min start_POSTSUBSCRIPT over~ start_ARG italic_H end_ARG ∈ blackboard_R start_POSTSUPERSCRIPT italic_n × italic_k end_POSTSUPERSCRIPT end_POSTSUBSCRIPT end_CELL start_CELL trace ( over~ start_ARG italic_H end_ARG start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_L over~ start_ARG italic_H end_ARG ) + divide start_ARG italic_γ end_ARG start_ARG 2 end_ARG ∥ over~ start_ARG italic_H end_ARG start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_D over~ start_ARG italic_H end_ARG - italic_I ∥ start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT , end_CELL end_ROW start_ROW start_CELL s.t. end_CELL start_CELL 0 ≤ over~ start_ARG italic_H end_ARG roman_Λ ≤ 1 , ( over~ start_ARG italic_H end_ARG roman_Λ ) ⋅ bold_1 = bold_1 , end_CELL end_ROW (7)

where γ>0𝛾0\gamma>0italic_γ > 0 is a penalty parameter, H~DH~IF2superscriptsubscriptnormsuperscript~𝐻top𝐷~𝐻𝐼𝐹2\|\tilde{H}^{\top}D\tilde{H}-I\|_{F}^{2}∥ over~ start_ARG italic_H end_ARG start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_D over~ start_ARG italic_H end_ARG - italic_I ∥ start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT is a relaxed orthogonality constraint, 𝟏1\mathbf{1}bold_1 is a k𝑘kitalic_k-dimensional vector consisting of 1111’s, and Λk×kΛsuperscript𝑘𝑘\Lambda\in\mathbb{R}^{k\times k}roman_Λ ∈ blackboard_R start_POSTSUPERSCRIPT italic_k × italic_k end_POSTSUPERSCRIPT is an unknown diagonal matrix defined by:

Λ=Diag(vol(𝒱(1)),,vol(𝒱(k)))12.ΛDiagsuperscriptvolsuperscript𝒱1volsuperscript𝒱𝑘12\Lambda=\mathrm{Diag}\left(\mathrm{vol}(\mathcal{V}^{(1)}),\cdots,\mathrm{vol}% (\mathcal{V}^{(k)})\right)^{\frac{1}{2}}.roman_Λ = roman_Diag ( roman_vol ( caligraphic_V start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT ) , ⋯ , roman_vol ( caligraphic_V start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT ) ) start_POSTSUPERSCRIPT divide start_ARG 1 end_ARG start_ARG 2 end_ARG end_POSTSUPERSCRIPT . (8)

Compared to the conventional normalized cut formulation in Eq. (4) and the nonnegative spectral clustering in Eq. (5), the main differences are two-folds: a) the orthogonality constraint H~DH~=Isuperscript~𝐻top𝐷~𝐻𝐼\tilde{H}^{\top}D\tilde{H}=Iover~ start_ARG italic_H end_ARG start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_D over~ start_ARG italic_H end_ARG = italic_I is relaxed; and b) a set of adjusted box constraints 0H~Λ10~𝐻Λ10\leq\tilde{H}\Lambda\leq 10 ≤ over~ start_ARG italic_H end_ARG roman_Λ ≤ 1 and (H~Λ)𝟏=𝟏~𝐻Λ11(\tilde{H}\Lambda)\cdot\mathbf{1}=\mathbf{1}( over~ start_ARG italic_H end_ARG roman_Λ ) ⋅ bold_1 = bold_1 are added, rather than using the naive nonnegativity constraint H0𝐻0H\geq 0italic_H ≥ 0. The reason to incorporate such a set of calibrated constraints is to more accurately and properly approximate the segmentation matrix H𝐻Hitalic_H as defined in Eq. (2) for the relaxed problem. It will be clear soon that such constraints can be elegantly and implicitly satisfied.

Rather than solving for H~~𝐻\tilde{H}over~ start_ARG italic_H end_ARG directly from problem (7), by noting of H~=HΛ1~𝐻𝐻superscriptΛ1\tilde{H}=H\Lambda^{-1}over~ start_ARG italic_H end_ARG = italic_H roman_Λ start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT, we reparameterize H𝐻Hitalic_H via a Multi-Layer Perceptron (MLP) with a softmax output layer, i.e.,

𝐟(;Θ)=softmax(𝐠(;Θ)),𝐟Θsoftmax𝐠Θ\mathbf{f}(\cdot;\Theta)=\texttt{softmax}(\mathbf{g}(\cdot;\Theta)),bold_f ( ⋅ ; roman_Θ ) = softmax ( bold_g ( ⋅ ; roman_Θ ) ) , (9)

where 𝐠:dk:𝐠superscript𝑑superscript𝑘\mathbf{g}:\mathbb{R}^{d}\rightarrow\mathbb{R}^{k}bold_g : blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT → blackboard_R start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT is the MLP and ΘΘ\Thetaroman_Θ denotes all the parameters in the network. Given the input data X𝑋Xitalic_X and, we have that H𝐻Hitalic_H is approximated by the network output 𝐟(X;Θ)𝐟𝑋Θ\mathbf{f}(X;\Theta)bold_f ( italic_X ; roman_Θ ), which can be optimized by updating the parameters in ΘΘ\Thetaroman_Θ. Owning to the softmax layer, the output of the network can be served to approximate the clustering memberships. In particular, we note that by incorporating the softmax layer and rewriting H~=f(X;Θ)Λ1~𝐻𝑓𝑋ΘsuperscriptΛ1\tilde{H}=f(X;\Theta)\Lambda^{-1}over~ start_ARG italic_H end_ARG = italic_f ( italic_X ; roman_Θ ) roman_Λ start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT, the constraints 0H~Λ1,(H~Λ)𝟏=𝟏formulae-sequence0~𝐻Λ1~𝐻Λ110\leq\tilde{H}\Lambda\leq 1,\ (\tilde{H}\Lambda)\cdot\mathbf{1}=\mathbf{1}0 ≤ over~ start_ARG italic_H end_ARG roman_Λ ≤ 1 , ( over~ start_ARG italic_H end_ARG roman_Λ ) ⋅ bold_1 = bold_1 in Eq. (7) can be automatically satisfied, provided that ΛΛ\Lambdaroman_Λ was known or estimated.

The loss function to train the neural network 𝐟(;Θ)𝐟Θ\mathbf{f}(\cdot;\Theta)bold_f ( ⋅ ; roman_Θ ) turns out to be:

(X,A;Θ):=trace((𝐟(X;Θ)Λ1)L(𝐟(X;Θ)Λ1))Lap+γ2(𝐟(X;Θ)Λ1)D(𝐟(X;Θ)Λ1)IF2orth.assign𝑋𝐴Θsubscripttracesuperscript𝐟𝑋ΘsuperscriptΛ1top𝐿𝐟𝑋ΘsuperscriptΛ1subscript𝐿𝑎𝑝𝛾2subscriptsuperscriptsubscriptnormsuperscript𝐟𝑋ΘsuperscriptΛ1top𝐷𝐟𝑋ΘsuperscriptΛ1𝐼𝐹2subscript𝑜𝑟𝑡\begin{split}\mathcal{L}(X,A;\Theta):=&\underbrace{\mathrm{\textrm{trace}}% \left((\mathbf{f}(X;\Theta)\Lambda^{-1})^{\top}\cdot L\cdot(\mathbf{f}(X;% \Theta)\Lambda^{-1})\right)}_{\mathcal{L}_{Lap}}\\ +&\frac{\gamma}{2}\underbrace{\left\|(\mathbf{f}(X;\Theta)\Lambda^{-1})^{\top}% \cdot D\cdot(\mathbf{f}(X;\Theta)\Lambda^{-1})-I\right\|_{F}^{2}}_{\mathcal{L}% _{orth}}.\end{split}start_ROW start_CELL caligraphic_L ( italic_X , italic_A ; roman_Θ ) := end_CELL start_CELL under⏟ start_ARG trace ( ( bold_f ( italic_X ; roman_Θ ) roman_Λ start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ⋅ italic_L ⋅ ( bold_f ( italic_X ; roman_Θ ) roman_Λ start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ) ) end_ARG start_POSTSUBSCRIPT caligraphic_L start_POSTSUBSCRIPT italic_L italic_a italic_p end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_CELL end_ROW start_ROW start_CELL + end_CELL start_CELL divide start_ARG italic_γ end_ARG start_ARG 2 end_ARG under⏟ start_ARG ∥ ( bold_f ( italic_X ; roman_Θ ) roman_Λ start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ⋅ italic_D ⋅ ( bold_f ( italic_X ; roman_Θ ) roman_Λ start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ) - italic_I ∥ start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_POSTSUBSCRIPT caligraphic_L start_POSTSUBSCRIPT italic_o italic_r italic_t italic_h end_POSTSUBSCRIPT end_POSTSUBSCRIPT . end_CELL end_ROW (10)

Unfortunately, the matrix ΛΛ\Lambdaroman_Λ in Eq. (10) is still unknown. Let Y=𝐟(X;Θ)n×k𝑌𝐟𝑋Θsuperscript𝑛𝑘Y=\mathbf{f}(X;\Theta)\in\mathbb{R}^{n\times k}italic_Y = bold_f ( italic_X ; roman_Θ ) ∈ blackboard_R start_POSTSUPERSCRIPT italic_n × italic_k end_POSTSUPERSCRIPT be the output matrix, we replace the volume of the \ellroman_ℓ-th cluster vol(𝒱())volsuperscript𝒱\mathrm{vol}(\mathcal{V}^{(\ell)})roman_vol ( caligraphic_V start_POSTSUPERSCRIPT ( roman_ℓ ) end_POSTSUPERSCRIPT ) by the following estimation:

vol~(𝒱()):=i=1nyi,Di,i,assign~volsuperscript𝒱superscriptsubscript𝑖1𝑛subscript𝑦𝑖subscript𝐷𝑖𝑖\widetilde{\mathrm{vol}}(\mathcal{V}^{(\ell)}):=\sum_{i=1}^{n}y_{i,\ell}\cdot D% _{i,i},over~ start_ARG roman_vol end_ARG ( caligraphic_V start_POSTSUPERSCRIPT ( roman_ℓ ) end_POSTSUPERSCRIPT ) := ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT italic_y start_POSTSUBSCRIPT italic_i , roman_ℓ end_POSTSUBSCRIPT ⋅ italic_D start_POSTSUBSCRIPT italic_i , italic_i end_POSTSUBSCRIPT , (11)

where yi,subscript𝑦𝑖y_{i,\ell}italic_y start_POSTSUBSCRIPT italic_i , roman_ℓ end_POSTSUBSCRIPT, the element of the output matrix Y𝑌Yitalic_Y, is the belief to assign the i𝑖iitalic_i-th data point to \ellroman_ℓ-th cluster 𝒱()superscript𝒱\mathcal{V}^{(\ell)}caligraphic_V start_POSTSUPERSCRIPT ( roman_ℓ ) end_POSTSUPERSCRIPT. Then, according to the definition in Eq. (8), ΛΛ\Lambdaroman_Λ also can be replaced by:

Λ~=Diag(vol~(𝒱(1)),,vol~(𝒱(k)))12.~ΛDiagsuperscript~volsuperscript𝒱1~volsuperscript𝒱𝑘12\tilde{\Lambda}=\mathrm{Diag}\left(\widetilde{\mathrm{vol}}(\mathcal{V}^{(1)})% ,\cdots,\widetilde{\mathrm{vol}}(\mathcal{V}^{(k)})\right)^{\frac{1}{2}}.over~ start_ARG roman_Λ end_ARG = roman_Diag ( over~ start_ARG roman_vol end_ARG ( caligraphic_V start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT ) , ⋯ , over~ start_ARG roman_vol end_ARG ( caligraphic_V start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT ) ) start_POSTSUPERSCRIPT divide start_ARG 1 end_ARG start_ARG 2 end_ARG end_POSTSUPERSCRIPT . (12)

That is, the task of finding the desired segmentation matrix H𝐻Hitalic_H defined in Eq. (2) turns out to be a task of training neural network 𝐟(;Θ)𝐟Θ\mathbf{f}(\cdot;\Theta)bold_f ( ⋅ ; roman_Θ ) via loss in Eq. (10) and then updating ΛΛ\Lambdaroman_Λ via Eq. (12) alternately. For clarity, we summarize our training procedure in Algorithm 1. After training, the network 𝐟(,Θ)𝐟Θ\mathbf{f}(\cdot,\Theta)bold_f ( ⋅ , roman_Θ ) will learn the map the input data X𝑋Xitalic_X in the feature space dsuperscript𝑑\mathbb{R}^{d}blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT onto the clusters assignment in space ksuperscript𝑘\mathbb{R}^{k}blackboard_R start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT.

Remark 2. As an end-to-end approach, our NeuNcut can be used to replace the conventional spectral clustering for many down-stream clustering tasks, e.g., the spectral clustering step in subspace clustering methods [24]. The benefits of our framework are two-folds. First, the neural network 𝐟(X,Θ)𝐟𝑋Θ\mathbf{f}(X,\Theta)bold_f ( italic_X , roman_Θ ) maps data points directly to their cluster memberships. Hence, it can be trained on a sampled small set of data and then generalized to infer the cluster memberships for unseen data points directly. This provides an efficient mechanism for handling the clustering task on ultra large-scale datasets. Second, the neural network can be trained in a mini-batch mode and enjoys the scalability. For each mini-batch that contains m𝑚mitalic_m data points, only a graph Laplacian of m×m𝑚𝑚m\times mitalic_m × italic_m is needed to compute and cache. As the number of batch samples grows, the graph Laplacian on the mini-batch data converges to the the manifold Laplacian [25, 26].

Remark 3. It is worth noting that the updating step in Eq. (12) is analogue to an Expectation (E) step that estimates the expectation 𝔼[Λ]𝔼delimited-[]Λ\mathbb{E}[\Lambda]blackboard_E [ roman_Λ ]. When ΛΛ\Lambdaroman_Λ is fixed, we train of the neural network 𝐟(X;Θ)𝐟𝑋Θ\mathbf{f}(X;\Theta)bold_f ( italic_X ; roman_Θ ) by minimizing the loss function in Eq. (10)—this stage is effectively analogue to a Maximization (M) step that maximizes the likelihood exp((X,A;Θ))𝑋𝐴Θ\exp{(-\mathcal{L}(X,A;\Theta))}roman_exp ( - caligraphic_L ( italic_X , italic_A ; roman_Θ ) ). Thus, our proposed scheme to solve NeuNCut by alternately training of the neural network 𝐟(X;Θ)𝐟𝑋Θ\mathbf{f}(X;\Theta)bold_f ( italic_X ; roman_Θ ) and updating ΛΛ\Lambdaroman_Λ is essentially an EM-style algorithm, as shown in steps 10-14 of Algorithm 1.

Algorithm 1 An EM-style Procedure for Solving NeuNcut
1:  Input: Training data X𝑋Xitalic_X, trade-off parameter γ>0𝛾0\gamma>0italic_γ > 0, number of iterations T𝑇Titalic_T, batch size m𝑚mitalic_m and learning rate η𝜂\etaitalic_η.
2:  Initialization: t=0𝑡0t=0italic_t = 0, random initialization of MLP parameters Θ(t)superscriptΘ𝑡\Theta^{(t)}roman_Θ start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT.
3:  for each t{1,,T}𝑡1𝑇t\in\{1,\cdots,T\}italic_t ∈ { 1 , ⋯ , italic_T } do
4:     # data preparation
5:     Randomly sample mini-batch data X(t)superscript𝑋𝑡X^{(t)}italic_X start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT from X𝑋Xitalic_X.
6:     Obtain affinities A(t)m×msuperscript𝐴𝑡superscript𝑚𝑚A^{(t)}\in\mathbb{R}^{m\times m}italic_A start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_m × italic_m end_POSTSUPERSCRIPT from X(t)superscript𝑋𝑡X^{(t)}italic_X start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT by methods in Section 3.4.
7:     Compute degree D(t)superscript𝐷𝑡D^{(t)}italic_D start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT and graph L(t)superscript𝐿𝑡L^{(t)}italic_L start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT w.r.t A(t)superscript𝐴𝑡A^{(t)}italic_A start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT as defined in Section 3.1.
8:     # Forward pass
9:     Compute output Y(t)m×ksuperscript𝑌𝑡superscript𝑚𝑘Y^{(t)}\in\mathbb{R}^{m\times k}italic_Y start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_m × italic_k end_POSTSUPERSCRIPT by Eq. (9).
10:      # Estimating volume
11:     Compute volume Λ(t)k×ksuperscriptΛ𝑡superscript𝑘𝑘\Lambda^{(t)}\in\mathbb{R}^{k\times k}roman_Λ start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_k × italic_k end_POSTSUPERSCRIPT by Eq. (12).
12:     # Backward propagation
13:     Compute Θ(X(t),A(t);Θ(t))subscriptΘsuperscript𝑋𝑡superscript𝐴𝑡superscriptΘ𝑡\nabla_{\Theta}\mathcal{L}(X^{(t)},A^{(t)};\Theta^{(t)})∇ start_POSTSUBSCRIPT roman_Θ end_POSTSUBSCRIPT caligraphic_L ( italic_X start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT , italic_A start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ; roman_Θ start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ) of Eq. (10).
14:      Set Θ(t+1)Θ(t)ηΘ(X(t),A(t);Θ(t))superscriptΘ𝑡1superscriptΘ𝑡𝜂subscriptΘsuperscript𝑋𝑡superscript𝐴𝑡superscriptΘ𝑡\Theta^{(t+1)}\leftarrow\Theta^{(t)}-\eta\cdot\nabla_{\Theta}\mathcal{L}(X^{(t% )},A^{(t)};\Theta^{(t)})roman_Θ start_POSTSUPERSCRIPT ( italic_t + 1 ) end_POSTSUPERSCRIPT ← roman_Θ start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT - italic_η ⋅ ∇ start_POSTSUBSCRIPT roman_Θ end_POSTSUBSCRIPT caligraphic_L ( italic_X start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT , italic_A start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ; roman_Θ start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ).
15:     tt+1𝑡𝑡1t\leftarrow t+1italic_t ← italic_t + 1.
16:  end for
17:  Output: Θ(t)superscriptΘ𝑡\Theta^{(t)}roman_Θ start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT

3.3 When Orthogonality Meets Softmax

In our NeuNcut, we impose in Eq. (7) a set of adjusted box constraints which then can be automatically satisfied owning to adopting a softmax layer in the neural network 𝐟(;Θ)𝐟Θ\mathbf{f}(\cdot;\Theta)bold_f ( ⋅ ; roman_Θ ). In this section, we will explain when the strict orthogonality could be satisfied provided that the neural network is used to learn Y𝑌Yitalic_Y.

Theorem 1

Suppose Yn×k𝑌superscript𝑛𝑘Y\in\mathbb{R}^{n\times k}italic_Y ∈ blackboard_R start_POSTSUPERSCRIPT italic_n × italic_k end_POSTSUPERSCRIPT be the clustering membership matrix, produced by the softmax layer of 𝐟(;Θ)𝐟Θ\mathbf{f}(\cdot;\Theta)bold_f ( ⋅ ; roman_Θ ), and H~=YΛ1~𝐻𝑌superscriptΛ1\tilde{H}=Y\Lambda^{-1}over~ start_ARG italic_H end_ARG = italic_Y roman_Λ start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT, where ΛΛ\Lambdaroman_Λ is defined in Eq. (8). Then, the orthogonality constraint H~DH~=Isuperscript~𝐻top𝐷~𝐻𝐼\tilde{H}^{\top}D\tilde{H}=Iover~ start_ARG italic_H end_ARG start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_D over~ start_ARG italic_H end_ARG = italic_I holds if and only if Y𝑌Yitalic_Y is a binary clustering membership matrix.

Proof 1

If the orthogonality constraints H~DH~=Isuperscript~𝐻top𝐷~𝐻𝐼\tilde{H}^{\top}D\tilde{H}=Iover~ start_ARG italic_H end_ARG start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_D over~ start_ARG italic_H end_ARG = italic_I is satisfied, then we must have that for the \ellroman_ℓ-th cluster 𝒱()superscript𝒱\mathcal{V}^{(\ell)}caligraphic_V start_POSTSUPERSCRIPT ( roman_ℓ ) end_POSTSUPERSCRIPT: 𝐡~D𝐡~=1superscriptsubscript~𝐡top𝐷subscript~𝐡1\tilde{\bm{h}}_{\ell}^{\top}D\tilde{\bm{h}}_{\ell}=1over~ start_ARG bold_italic_h end_ARG start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_D over~ start_ARG bold_italic_h end_ARG start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT = 1. Note that

𝒉~D𝒉~=i=1nDi,ih~i,2=i=1nDi,iyi,2vol(𝒱()),superscriptsubscript~𝒉top𝐷subscript~𝒉superscriptsubscript𝑖1𝑛subscript𝐷𝑖𝑖superscriptsubscript~𝑖2superscriptsubscript𝑖1𝑛subscript𝐷𝑖𝑖superscriptsubscript𝑦𝑖2volsuperscript𝒱\tilde{\bm{h}}_{\ell}^{\top}D\tilde{\bm{h}}_{\ell}=\sum_{i=1}^{n}D_{i,i}\cdot% \tilde{h}_{i,\ell}^{2}=\frac{\sum_{i=1}^{n}D_{i,i}\cdot y_{i,\ell}^{2}}{% \mathrm{vol}(\mathcal{V}^{(\ell)})},over~ start_ARG bold_italic_h end_ARG start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_D over~ start_ARG bold_italic_h end_ARG start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT italic_D start_POSTSUBSCRIPT italic_i , italic_i end_POSTSUBSCRIPT ⋅ over~ start_ARG italic_h end_ARG start_POSTSUBSCRIPT italic_i , roman_ℓ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT = divide start_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT italic_D start_POSTSUBSCRIPT italic_i , italic_i end_POSTSUBSCRIPT ⋅ italic_y start_POSTSUBSCRIPT italic_i , roman_ℓ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG roman_vol ( caligraphic_V start_POSTSUPERSCRIPT ( roman_ℓ ) end_POSTSUPERSCRIPT ) end_ARG ,

where vol(𝒱())volsuperscript𝒱\mathrm{vol}(\mathcal{V}^{(\ell)})roman_vol ( caligraphic_V start_POSTSUPERSCRIPT ( roman_ℓ ) end_POSTSUPERSCRIPT ) is replaced by its estimation as defined in Eq. (11) in our approach. Thus we have

𝒉~D𝒉~=i=1nDi,iyi,2i=1nDi,iyi,.superscriptsubscript~𝒉top𝐷subscript~𝒉superscriptsubscript𝑖1𝑛subscript𝐷𝑖𝑖superscriptsubscript𝑦𝑖2superscriptsubscript𝑖1𝑛subscript𝐷𝑖𝑖subscript𝑦𝑖\tilde{\bm{h}}_{\ell}^{\top}D\tilde{\bm{h}}_{\ell}=\frac{\sum_{i=1}^{n}D_{i,i}% y_{i,\ell}^{2}}{\sum_{i=1}^{n}D_{i,i}y_{i,\ell}}.over~ start_ARG bold_italic_h end_ARG start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_D over~ start_ARG bold_italic_h end_ARG start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT = divide start_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT italic_D start_POSTSUBSCRIPT italic_i , italic_i end_POSTSUBSCRIPT italic_y start_POSTSUBSCRIPT italic_i , roman_ℓ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT italic_D start_POSTSUBSCRIPT italic_i , italic_i end_POSTSUBSCRIPT italic_y start_POSTSUBSCRIPT italic_i , roman_ℓ end_POSTSUBSCRIPT end_ARG .

Since that 0yi,10subscript𝑦𝑖10\leq y_{i,\ell}\leq 10 ≤ italic_y start_POSTSUBSCRIPT italic_i , roman_ℓ end_POSTSUBSCRIPT ≤ 1, we have 0yi,210subscriptsuperscript𝑦2𝑖10\leq y^{2}_{i,\ell}\leq 10 ≤ italic_y start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i , roman_ℓ end_POSTSUBSCRIPT ≤ 1 and yi,2yi,subscriptsuperscript𝑦2𝑖subscript𝑦𝑖y^{2}_{i,\ell}\leq y_{i,\ell}italic_y start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i , roman_ℓ end_POSTSUBSCRIPT ≤ italic_y start_POSTSUBSCRIPT italic_i , roman_ℓ end_POSTSUBSCRIPT. Note also that Di,i>0subscript𝐷𝑖𝑖0D_{i,i}>0italic_D start_POSTSUBSCRIPT italic_i , italic_i end_POSTSUBSCRIPT > 0, thus we see that i=1nDi,iyi,2i=1nDi,iyi,superscriptsubscript𝑖1𝑛subscript𝐷𝑖𝑖superscriptsubscript𝑦𝑖2superscriptsubscript𝑖1𝑛subscript𝐷𝑖𝑖subscript𝑦𝑖{\sum_{i=1}^{n}D_{i,i}y_{i,\ell}^{2}}\leq{\sum_{i=1}^{n}D_{i,i}y_{i,\ell}}∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT italic_D start_POSTSUBSCRIPT italic_i , italic_i end_POSTSUBSCRIPT italic_y start_POSTSUBSCRIPT italic_i , roman_ℓ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ≤ ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT italic_D start_POSTSUBSCRIPT italic_i , italic_i end_POSTSUBSCRIPT italic_y start_POSTSUBSCRIPT italic_i , roman_ℓ end_POSTSUBSCRIPT, i.e., 𝐡~D𝐡~1superscriptsubscript~𝐡top𝐷subscript~𝐡1\tilde{\bm{h}}_{\ell}^{\top}D\tilde{\bm{h}}_{\ell}\leq 1over~ start_ARG bold_italic_h end_ARG start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_D over~ start_ARG bold_italic_h end_ARG start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT ≤ 1, where the equality holds only if yi,=1subscript𝑦𝑖1y_{i,\ell}=1italic_y start_POSTSUBSCRIPT italic_i , roman_ℓ end_POSTSUBSCRIPT = 1 and yi,=0subscript𝑦𝑖superscript0y_{i,\ell^{\prime}}=0italic_y start_POSTSUBSCRIPT italic_i , roman_ℓ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT = 0 for all superscript\ell^{\prime}\neq\ellroman_ℓ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ≠ roman_ℓ. We have proved that each row of Y𝑌Yitalic_Y must be binary.

This result tells us that if a strict orthogonality constraint is imposed in NeuNcut, it will push an entry in each row of Y𝑌Yitalic_Y to 1111 and all others to 00, which is numerically very difficult due to softmax function. Note that each row of Y𝑌Yitalic_Y in (9) being binary is equivalent to requiring some entries of 𝐠(;Θ)𝐠Θ\mathbf{g}(\cdot;\Theta)bold_g ( ⋅ ; roman_Θ ) to tend to infinite, which poses numerical difficulty in training MLP. Furthermore, there is no need to yield binary clustering membership in our NeuNcut since that it is convenient to assign the cluster index for each data point via argmax.

In addition, the result of the theorem also suggests a practical way to set the penalty parameter γ𝛾\gammaitalic_γ. That is, the penalty term orthsubscript𝑜𝑟𝑡\mathcal{L}_{orth}caligraphic_L start_POSTSUBSCRIPT italic_o italic_r italic_t italic_h end_POSTSUBSCRIPT with a reasonable small γ𝛾\gammaitalic_γ in Eq. (10) will makes MLP easier to be optimized, by encouraging a soft clustering membership Y𝑌Yitalic_Y but not damaging the correctness of the clustering. As we will see in Section 4.5 that, our NeuNcut can prevent degenerated solutions and obtain satisfactory clustering results when the orthogonality constraint is relaxed—as long as the penalty weight γ𝛾\gammaitalic_γ is larger than a certain threshold.

3.4 Methods to Learn Affinity

We train our NeuNcut with three types of affinity: a) heat kernel affinity; b) SiameseNet based heat kernel affinity; and c) self-expressiveness induced affinity.

Heat kernel affinity. To be comparable with other classic methods, we implement the most common setting to train our NeuNcut. The affinity is defined by heat kernel with a bandwidth parameter σ>0𝜎0\sigma>0italic_σ > 0:

ai,j=exp(𝒙i𝒙j222σ2).subscript𝑎𝑖𝑗subscriptsuperscriptnormsubscript𝒙𝑖subscript𝒙𝑗222superscript𝜎2a_{i,j}=\exp\left(-\frac{\|\bm{x}_{i}-\bm{x}_{j}\|^{2}_{2}}{2\sigma^{2}}\right).italic_a start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT = roman_exp ( - divide start_ARG ∥ bold_italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - bold_italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_ARG start_ARG 2 italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ) . (13)

SiameseNet based heat kernel affinity. In SpectralNet [11], the pairwise affinity is learned by a siamese network, which is trained via a constrastive loss:

const(𝒙i,𝒙j;Ψ)={𝒛i𝒛j22,𝒙j𝒩3(𝒙i)max(1𝒛i𝒛j2,0)2,𝒙j𝒩3¯(𝒙i),\mathcal{L}_{const}(\bm{x}_{i},\bm{x}_{j};\Psi)=\begin{cases}\|\bm{z}_{i}-\bm{% z}_{j}\|_{2}^{2},&\bm{x}_{j}\in\mathcal{N}_{3}(\bm{x}_{i})\\ \max(1-\|\bm{z}_{i}-\bm{z}_{j}\|_{2},0)^{2},&\bm{x}_{j}\in\overline{\mathcal{N% }_{3}}(\bm{x}_{i}),\end{cases}caligraphic_L start_POSTSUBSCRIPT italic_c italic_o italic_n italic_s italic_t end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , bold_italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ; roman_Ψ ) = { start_ROW start_CELL ∥ bold_italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - bold_italic_z start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT , end_CELL start_CELL bold_italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∈ caligraphic_N start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) end_CELL end_ROW start_ROW start_CELL roman_max ( 1 - ∥ bold_italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - bold_italic_z start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , 0 ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT , end_CELL start_CELL bold_italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∈ over¯ start_ARG caligraphic_N start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT end_ARG ( bold_italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) , end_CELL end_ROW (14)

where ΨΨ\Psiroman_Ψ denotes the parameters of Siamese network, 𝒛isubscript𝒛𝑖\bm{z}_{i}bold_italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is the output of Siamese network corresponding to the input 𝒙isubscript𝒙𝑖\bm{x}_{i}bold_italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, 𝒩3(𝒙i)subscript𝒩3subscript𝒙𝑖\mathcal{N}_{3}(\bm{x}_{i})caligraphic_N start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) denotes the set of three nearest neighbors of 𝒙isubscript𝒙𝑖\bm{x}_{i}bold_italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT determined by Euclidean distance and 𝒩3¯(𝒙i)¯subscript𝒩3subscript𝒙𝑖\overline{\mathcal{N}_{3}}(\bm{x}_{i})over¯ start_ARG caligraphic_N start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT end_ARG ( bold_italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) denotes the set of the three non-neighbors of 𝒙isubscript𝒙𝑖\bm{x}_{i}bold_italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT which are randomly chosen from all other non-neighbors. Once the Siamese network is trained, then the pairwise affinity is defined by the SiameseNet based heat kernel as follows:

ai,j=exp(𝒛i𝒛j222σ2).subscript𝑎𝑖𝑗subscriptsuperscriptnormsubscript𝒛𝑖subscript𝒛𝑗222superscript𝜎2a_{i,j}=\exp\left(-\frac{\|\bm{z}_{i}-\bm{z}_{j}\|^{2}_{2}}{2\sigma^{2}}\right).italic_a start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT = roman_exp ( - divide start_ARG ∥ bold_italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - bold_italic_z start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_ARG start_ARG 2 italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ) . (15)

Self-expressiveness induced affinity. When clustering high dimensional data, it is reasonable to assume that data approximately lie on a union of subspaces [27]. Here, we adopt SENet [24] to learn the self-expressiveness induced affinity, which parametermizes the self-expressive coefficients cijsubscript𝑐𝑖𝑗c_{ij}italic_c start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT with a key-query network and train it by minimizing a self-expression loss with elastic net regularization [28]:

min{cij}ijη2𝒙jijcij𝒙i22+ij(λ|cij|+1λ2cij2),subscriptsubscript𝑐𝑖𝑗𝑖𝑗𝜂2superscriptsubscriptnormsubscript𝒙𝑗subscript𝑖𝑗subscript𝑐𝑖𝑗subscript𝒙𝑖22subscript𝑖𝑗𝜆subscript𝑐𝑖𝑗1𝜆2superscriptsubscript𝑐𝑖𝑗2\underset{\{c_{ij}\}_{i\neq j}}{{\min}}\ \frac{\eta}{2}\|{\bm{x}}_{j}-\sum_{i% \neq j}c_{ij}{\bm{x}}_{i}\|_{2}^{2}+\sum_{i\neq j}\left(\lambda|c_{ij}|+\frac{% 1-\lambda}{2}c_{ij}^{2}\right),start_UNDERACCENT { italic_c start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_i ≠ italic_j end_POSTSUBSCRIPT end_UNDERACCENT start_ARG roman_min end_ARG divide start_ARG italic_η end_ARG start_ARG 2 end_ARG ∥ bold_italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT - ∑ start_POSTSUBSCRIPT italic_i ≠ italic_j end_POSTSUBSCRIPT italic_c start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT bold_italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + ∑ start_POSTSUBSCRIPT italic_i ≠ italic_j end_POSTSUBSCRIPT ( italic_λ | italic_c start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT | + divide start_ARG 1 - italic_λ end_ARG start_ARG 2 end_ARG italic_c start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) , (16)

where η>0𝜂0\eta>0italic_η > 0 and 0λ10𝜆10\leq\lambda\leq 10 ≤ italic_λ ≤ 1 are two hyper-parameters. Given the self-expressive coefficients cijsubscript𝑐𝑖𝑗c_{ij}italic_c start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT, in default we define the affinity by aij=(|cij|+|cij|)/2subscript𝑎𝑖𝑗subscript𝑐𝑖𝑗subscript𝑐𝑖𝑗2a_{ij}=(|c_{ij}|+|c_{ij}|)/2italic_a start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT = ( | italic_c start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT | + | italic_c start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT | ) / 2.

4 Experiments

In this section, we provide comprehensive evaluations for our NeuNcut on both synthetic data and real-world data.111The code of this work will be released upon the acceptance of the manuscript.

4.1 Datasets and Metrics

Datasets. To evaluate the performance of NeuNcut, we use the following datasets: MNIST [29] consists of 70,000 samples with gray images of handwritten digits 0-9. Fashion MNIST (F-MNIST) [30] consists of 70,000 samples with gray images of 10 fashion products. Several fashion product clusters in F-MNIST are hard to distinguish. Extended MNIST (E-MNIST) [31], we select all lower case letters with 190,998 images belonging to 26 extremely imbalanced categories by following [32]. CIFAR-10 [33] consists of 60,000 color images belonging to 10 categories. MNIST8M [34], a.k.a, infinite MNIST, consists of 8 million samples produced from MNIST by using pseudo-random deformations and translations. TinyImageNet and ImageNet-1k are two popular subsets of ImageNet [35] that consist color images belonging to 200 and 1,000 categories, respectively.

For each dataset, we train our NeuNcut on sampled subset, which is called train dataset, and then evaluate the clustering performance which is yielded by directly inferring the pseudo label via our trained NeuNcut over the entire dataset. The size of sampled training subset for each dataset is shown in Table 1. For MNIST, F-MNIST, E-MNIST and MNIST8M, we compute a feature vector of dimension 3,472 using ScatNet [36] and reduce the dimension to 500 using PCA.222Following [32], we remove mean after dimension reduction via PCA for E-MNIST. For CIFAR-10, we use MCR2 [37] to extract 128 dimensional features. For TinyImageNet and ImageNet-1k, we use the image encoder of CLIP [38], a large-scale pretrained model, to extract 768 dimensional features, denoted by “TinyImageNet-CLIP” and “ImageNet1k-CLIP”, respectively.

Metrics. We use three common evaluation metrics: a) clustering accuracy (ACC); b) normalized mutual information (NMI); and c) adjusted rand index (ARI). The details of the definitions can be found in the appendices of [37]. In short, the three metrics are ranged in [0,1]01[0,1][ 0 , 1 ] and the higher value indicates better performance.

Parameter Settings. In NeuNcut, we form a multi-layer preceptrons (MLP) with two hidden layers with ReLU as the activation function. The number of hidden units in each layer is 512. We use the Adam optimizer with an initial learning rate and cosine annealing learning rate in training. When using the Euclidean distance-based affinity, we set σ𝜎\sigmaitalic_σ in Eq. (13) to 3 for all datasets. Regarding the SiameseNet based heat kernel affinity and the self-expressiveness induced affinity, the parameter setting of Siamese network and SENet are followed by [11] and [24], respectively. Other hyper-parameters of NeuNcut are shown in Table 1.

Dataset # Train data γ𝛾\gammaitalic_γ lr𝑙𝑟lritalic_l italic_r wd𝑤𝑑wditalic_w italic_d m𝑚mitalic_m epochs
MNIST 20,000 100 0.005 104superscript10410^{-4}10 start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT 1000100010001000 100
F-MNIST 50,000 100 0.005 104superscript10410^{-4}10 start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT 1000100010001000 300
E-MNIST 50,000 350 0.01 0 1000100010001000 300
CIFAR-10 20,000 100 0.005 104superscript10410^{-4}10 start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT 1000100010001000 300
MNIST8M 20,000 100 0.005 104superscript10410^{-4}10 start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT 1000100010001000 100
TinyImageNet 100,000 150 0.001 104superscript10410^{-4}10 start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT 1000100010001000 100
ImageNet-1k 1,281,167 500 0.001 104superscript10410^{-4}10 start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT 3000300030003000 20
(a) Best hyper-parameters of NeuNcut on real-world datasets.
Linear: d512superscript𝑑superscript512\mathbb{R}^{d}\rightarrow\mathbb{R}^{512}blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT → blackboard_R start_POSTSUPERSCRIPT 512 end_POSTSUPERSCRIPT
ReLU
2×\times× Linear: 512512superscript512superscript512\mathbb{R}^{512}\rightarrow\mathbb{R}^{512}blackboard_R start_POSTSUPERSCRIPT 512 end_POSTSUPERSCRIPT → blackboard_R start_POSTSUPERSCRIPT 512 end_POSTSUPERSCRIPT
ReLU
Linear: 512ksuperscript512superscript𝑘\mathbb{R}^{512}\rightarrow\mathbb{R}^{k}blackboard_R start_POSTSUPERSCRIPT 512 end_POSTSUPERSCRIPT → blackboard_R start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT
Softmax
(b) Model parameters.
Table 1: Hyper-parameters and model parameters of NeuNcut. γ𝛾\gammaitalic_γ: penalty weight. lr𝑙𝑟lritalic_l italic_r: initial learning rate, wd𝑤𝑑wditalic_w italic_d: weight decay, m𝑚mitalic_m: size of mini-batches, epochs: training epochs, d𝑑ditalic_d: dimension of data points, k𝑘kitalic_k: number of clusters.
Settings Methods MNIST F-MNIST
ACC NMI ARI ACC NMI ARI
I Ncut 0.693 0.809 0.665 0.557 0.629 0.437
SpectralNet 0.622* 0.687* - 0.516 0.587 0.398
NeuNcut 0.784 0.817 0.734 0.602 0.603 0.473
II Ncut 0.821 0.867 0.795 0.561 0.645 0.456
SpectralNet 0.800* 0.814* - 0.590 0.665 0.457
NeuNcut 0.943 0.866 0.879 0.643 0.644 0.489
III Ncut 0.861 0.872 0.839 0.582 0.593 0.454
SpectralNet 0.971 0.924 0.934 0.601 0.650 0.458
NeuNcut 0.969 0.922 0.914 0.650 0.659 0.501
IV Ncut 0.854 0.904 0.837 0.645 0.726 0.448
SpectralNet 0.816 0.833 0.752 0.617 0.645 0.438
NeuNcut 0.982 0.947 0.958 0.687 0.663 0.535
V Ncut 0.827 0.874 0.810 0.641 0.650 0.479
SpectralNet 0.841 0.901 0.834 0.694 0.670 0.544
NeuNcut 0.983 0.952 0.960 0.783 0.725 0.643
Table 2: Clustering performance of Normalized cut (Ncut), SpectralNet and NeuNcut when combined with different feature and affinity components. (*) are results cited from [11].

4.2 Comparison to Normalized cut [5] and SpectralNet [11]

The feature extraction and affinity learning are important for the success of spectral clustering. To fairly evaluate the performance of NeuNcut, we conduct experiments on MNIST and F-MNIST under different combinations of feature extraction and affinity learning methods, compared to spectral clustering via Normalized cut [5] and SpectralNet [11]. We report the experiment results under the following five settings: (I) Original feature space + heat kernel based affinity; (II) Auto-encoders based feature [17] + heat kernel affinity; (III) Auto-encoders based feature + SiameseNet based heat kernel affinity; (IV) ScatNet [36] based feature + heat kernel affinity; (V) ScatNet based feature + self-expressiveness induced affinity. Experimental results are provided in Table 2. We can read that NeuNcut outperforms Normalized cut [5] and SpectralNet [11] in almost all settings. NeuNcut achieves the best performance when combined with ScatNet and the self-expressiveness induced affinity. In the following, we use ScatNet based feature + self-expressiveness induced affinity (i.e., the setting V in Table 2) as the default setting for MNIST and F-MNIST.

4.3 Evaluation on Generalization Performance

To evaluate the generalization performance, we randomly select a number of samples for training NeuNcut and evaluate the directly inferred clustering results on the entire dataset. We also report the performance of Normalized cut (Ncut) [5] and SpectralNet [11] on entire datasets. Conventional spectral clustering has no generalization ability; whereas SpectralNet learns an embedding from graph Laplacian but has to perform k𝑘kitalic_k-means on the entire dataset. All methods use ScatNet [36] based feature and self-expressiveness induced affinity. As can be seen in Table 3, our NeuNcut trained by a small subset with more than 10,000 samples outperforms Ncut [5] and SpectralNet [11], which validates the good generalization ability of NeuNcut for out-of-sample-extensions.

Methods Train data MNIST F-MNIST
ACC NMI ARI ACC NMI ARI
Ncut NA 0.827 0.874 0.810 0.641 0.650 0.479
SpectralNet NA 0.841 0.901 0.834 0.694 0.670 0.544
NeuNcut 1,000 0.711 0.699 0.616 0.572 0.510 0.389
2,000 0.858 0.783 0.749 0.659 0.616 0.487
5,000 0.967 0.913 0.928 0.694 0.638 0.537
10,000 0.977 0.938 0.953 0.717 0.692 0.598
20,000 0.983 0.952 0.960 0.752 0.695 0.618
50,000 0.981 0.946 0.957 0.783 0.725 0.643
Table 3: Clustering performance of Ncut, SpectralNet and NeuNcut with varying number of training samples on MNIST and F-MNIST datasets.

4.4 Experiments on Large-scale Data

For normalized cut [5], there is a computation bottleneck when computing the embedding of the entire dataset if the size of the data is too large. By contrast, our NeuNcut takes only a small affinity graph of each mini-batch data and thus enjoys a good scalability to handle large-scale data.

On large-scale synthetic data. We generate some typical 2D synthetic data and visualize the predictions of NeuNcut. “Double rings” contains 2 concentric circles and “double C” contains two clusters in the shape of letter “C”. Each dataset contains 10 million data points. We use 10,0001000010,00010 , 000 samples (i.e., only 0.1% samples are used) for training and construct heat kernel based affinities. As can be observed from Figure 1 that, our NeuNcut yields nearly 100% correct clusters.

Refer to caption
(a) Double rings (ACC=100.00%)
Refer to caption
(b) Double C (ACC=99.90%).
Figure 1: Visualization of NeuNcut predictions on large-scale synthetic data. 10,000 samples are plotted.

On large-scale real data. We compare the performance and total running time of spectral clustering via Normalized cut (Ncut), SpectralNet [11] and NeuNcut on MNIST8M, TinyImageNet-CLIP and ImageNet1k-CLIP. For MNIST8M, we construct affinities from a pretrained SENet [24]. For TinyImageNet-CLIP and ImageNet1k-CLIP, we compute the heat kernel affinities. Due to the computational bottleneck of eigen-decomposition, we report the average performance of Ncut over all subsets that contain 100,000100000100,000100 , 000 samples. Table 4 shows that NeuNcut achieves satisfactory clustering accuracy on three large-scale real datasets. Besides, NeuNcut does not need to perform k𝑘kitalic_k-means on the datasets and thus saves more time of the inference when compared to SpectralNet.

Methods MNIST8M TinyImageNet-CLIP ImageNet1k-CLIP
Time ACC NMI ARI Time ACC NMI ARI Time ACC NMI ARI
Ncut 3.22×103absentsuperscript103\times 10^{3}× 10 start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT 0.960 0.915 0.918 51.47 0.628 0.770 0.447 533.55 0.560 0.812 0.425
SpectralNet 2.27 0.961 0.913 0.920 9.97 0.622 0.766 0.419 19.22 0.560 0.798 0.441
NeuNcut 1.36 0.968 0.923 0.926 5.40 0.646 0.781 0.506 5.23 0.624 0.813 0.473
Table 4: Total running time (min.), ACC, NMI and ARI of Ncut, SpectralNet, and NeuNcut on MNIST8M, TinyImageNet-CLIP and ImageNet1k-CLIP.

4.5 Ablation Studies

In this section, we provide ablation studies on the network size, size of mini-batch data and hyper-parameter γ𝛾\gammaitalic_γ. We train NeuNcut on randomly selected 20,000 samples under the setting V (i.e., ScatNet based feature + self-expressiveness induced affinity) and then directly infer the cluster memberships.

Effect of Network Size. To evaluate how the network size affects the clustering performance, we conduct experiments on MNIST with varying the number of hidden layers (i.e., the layer depth) used for NeuNcut in the range of {1,2,3}123\{1,2,3\}{ 1 , 2 , 3 } and varying the number of neurons in each hidden layer (i.e., the hidden dimension) in the range of {128,256,512,1024}1282565121024\{128,256,512,1024\}{ 128 , 256 , 512 , 1024 }. In these experiments, we set the batch size to 1000. Experimental results are provided in Figure 2, where we display the performance of ACC and NMI under different network size. We can observe that the neural network with two hidden layers and each layer containing 512 neurons can achieve the best clustering performance. Further increasing the depth does not help to improve the clustering performance, but requires more training time.

Refer to caption
(a) ACC
Refer to caption
(b) NMI
Figure 2: Clustering performance of NeuNcut with varying network size on MNIST.

Effect of Batch Size. The batch size is important for approximating the graph Laplacian L𝐿Litalic_L with its mini-batch version. To evaluate the impact of batch size, we train NeuNcut with varying batch size in the range of {50,100,200,,1400}501002001400\{50,100,200,\dots,1400\}{ 50 , 100 , 200 , … , 1400 } and report the mean value of ACC over 3 trials. As can be seen in Table 5, NeuNcut yields satisfactory results whenever the batch size is larger than 100 on MNIST. For E-MNIST that contains more imbalanced categories, NeuNcut demands a batch size larger than 200 to yield satisfactory results.

Batch size 50 100 200 400 600 800 1000 1200 1400
MNIST 0.751 0.973 0.978 0.980 0.982 0.980 0.983 0.977 0.981
E-MNIST 0.463 0.477 0.684 0.709 0.711 0.708 0.716 0.709 0.713
Table 5: Clustering accuracy of NeuNcut with varying batch size on MNIST and E-MNIST.

Effect of Hyper-parameter γ𝛾\gammaitalic_γ. Note that Eq. (10) introduce an additional hyper-parameter γ>0𝛾0\gamma>0italic_γ > 0. We conduct experiments with NeuNcut under different γ𝛾\gammaitalic_γ and show the results in Figure 3. The clustering performance of our method is not sensitive to the parameter γ𝛾\gammaitalic_γ, e.g., NeuNcut achieves stable results on MNIST when γ[80,240]𝛾80240\gamma\in[80,240]italic_γ ∈ [ 80 , 240 ] and γ[80,180]𝛾80180\gamma\in[80,180]italic_γ ∈ [ 80 , 180 ] when it comes to F-MNIST.

Refer to caption
(a) MNIST
Refer to caption
(b) F-MNIST
Figure 3: Clustering accuracy (mean±plus-or-minus\pm±std) of NeuNcut with varying γ𝛾\gammaitalic_γ on MNIST and F-MNIST.

4.6 Searching Best γ𝛾\gammaitalic_γ Without Labels

Since label information is not available in training, it is improper to search for the best hyper-parameters by checking the clustering accuracy. To this end, we provide a practical way to find the best γ𝛾\gammaitalic_γ without using the ground-truth labels. Note that the loss function in Eq. (10) consists of two terms, i.e., Lapsubscript𝐿𝑎𝑝\mathcal{L}_{Lap}caligraphic_L start_POSTSUBSCRIPT italic_L italic_a italic_p end_POSTSUBSCRIPT and orthsubscript𝑜𝑟𝑡\mathcal{L}_{orth}caligraphic_L start_POSTSUBSCRIPT italic_o italic_r italic_t italic_h end_POSTSUBSCRIPT. We observe some fascinating connections between the optimal loss and the parameter γ𝛾\gammaitalic_γ. Taking the experiment on MNIST as an example, we train our NeuNcut with different γ𝛾\gammaitalic_γ and record the optimal Lapsubscript𝐿𝑎𝑝\mathcal{L}_{Lap}caligraphic_L start_POSTSUBSCRIPT italic_L italic_a italic_p end_POSTSUBSCRIPT and orthsubscript𝑜𝑟𝑡\mathcal{L}_{orth}caligraphic_L start_POSTSUBSCRIPT italic_o italic_r italic_t italic_h end_POSTSUBSCRIPT during the entire training process. As can be seen in Figure 4, when γ𝛾\gammaitalic_γ is small, orthsubscript𝑜𝑟𝑡\mathcal{L}_{orth}caligraphic_L start_POSTSUBSCRIPT italic_o italic_r italic_t italic_h end_POSTSUBSCRIPT is larger than its recorded minimum, while Lapsubscript𝐿𝑎𝑝\mathcal{L}_{Lap}caligraphic_L start_POSTSUBSCRIPT italic_L italic_a italic_p end_POSTSUBSCRIPT is equal to 00 or very small. In this case our NeuNcut will produce a collapsed solution. When γ𝛾\gammaitalic_γ reaches a threshold, orthsubscript𝑜𝑟𝑡\mathcal{L}_{orth}caligraphic_L start_POSTSUBSCRIPT italic_o italic_r italic_t italic_h end_POSTSUBSCRIPT approaches to its minimum and in this case NeuNcut can produce satisfactory clustering results. Further increasing the γ𝛾\gammaitalic_γ slightly increases the value of Lapsubscript𝐿𝑎𝑝\mathcal{L}_{Lap}caligraphic_L start_POSTSUBSCRIPT italic_L italic_a italic_p end_POSTSUBSCRIPT while the value of Lapsubscript𝐿𝑎𝑝\mathcal{L}_{Lap}caligraphic_L start_POSTSUBSCRIPT italic_L italic_a italic_p end_POSTSUBSCRIPT is unchanged. Finally, an arbitrary large γ𝛾\gammaitalic_γ will makes Lapsubscript𝐿𝑎𝑝\mathcal{L}_{Lap}caligraphic_L start_POSTSUBSCRIPT italic_L italic_a italic_p end_POSTSUBSCRIPT hard to be optimized and harms the clustering accuracy.

Based on these observations, we suggest an empirical rule to find the best γ𝛾\gammaitalic_γ. Specifically, we start from a very large γ𝛾\gammaitalic_γ (e.g., γ=106𝛾superscript106\gamma=10^{6}italic_γ = 10 start_POSTSUPERSCRIPT 6 end_POSTSUPERSCRIPT) to train NeuNcut and record the optimal orth(o)subscriptsuperscript𝑜𝑜𝑟𝑡\mathcal{L}^{(o)}_{orth}caligraphic_L start_POSTSUPERSCRIPT ( italic_o ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_o italic_r italic_t italic_h end_POSTSUBSCRIPT (i.e., where orth(o)=0.256subscriptsuperscript𝑜𝑜𝑟𝑡0.256\mathcal{L}^{(o)}_{orth}=0.256caligraphic_L start_POSTSUPERSCRIPT ( italic_o ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_o italic_r italic_t italic_h end_POSTSUBSCRIPT = 0.256 in this case)—which can be regarded as the lower bound of orthsubscript𝑜𝑟𝑡\mathcal{L}_{orth}caligraphic_L start_POSTSUBSCRIPT italic_o italic_r italic_t italic_h end_POSTSUBSCRIPT. Then, we gradually decrease the value of γ𝛾\gammaitalic_γ and find the threshold of γ𝛾\gammaitalic_γ that corresponds to the lowest one among the optimal Lapsubscript𝐿𝑎𝑝\mathcal{L}_{Lap}caligraphic_L start_POSTSUBSCRIPT italic_L italic_a italic_p end_POSTSUBSCRIPT while still keeping orth=0.256subscript𝑜𝑟𝑡0.256\mathcal{L}_{orth}=0.256caligraphic_L start_POSTSUBSCRIPT italic_o italic_r italic_t italic_h end_POSTSUBSCRIPT = 0.256. This estimated threshold sets the lower bound of feasible γ𝛾\gammaitalic_γ and is consistent with the real feasible γ𝛾\gammaitalic_γ, i.e., γ80𝛾80\gamma\geq 80italic_γ ≥ 80 as can be validated by Figure 3(a).

Refer to caption
(a) Lapsubscript𝐿𝑎𝑝\mathcal{L}_{Lap}caligraphic_L start_POSTSUBSCRIPT italic_L italic_a italic_p end_POSTSUBSCRIPT
Refer to caption
(b) orthsubscript𝑜𝑟𝑡\mathcal{L}_{orth}caligraphic_L start_POSTSUBSCRIPT italic_o italic_r italic_t italic_h end_POSTSUBSCRIPT
Figure 4: Showing the optimal Lapsubscript𝐿𝑎𝑝\mathcal{L}_{Lap}caligraphic_L start_POSTSUBSCRIPT italic_L italic_a italic_p end_POSTSUBSCRIPT in panel (a) and the optimal orthsubscript𝑜𝑟𝑡\mathcal{L}_{orth}caligraphic_L start_POSTSUBSCRIPT italic_o italic_r italic_t italic_h end_POSTSUBSCRIPT in panel (b) with a varying penalty weight γ𝛾\gammaitalic_γ in (10) when training on MNIST.

4.7 Computation Cost and Running Time

Methods MNIST F-MNIST E-MNIST CIFAR-10
Ncut 283 334 869 312
NeuNcut (5000) 11 47 43 21
NeuNcut (10000) 13 58 51 40
NeuNcut (20000) 25 92 97 101
NeuNcut (50000) 68 258 249 211
Table 6: Total running time (sec.) of Ncut and NeuNcut (N𝑁Nitalic_N) on MNIST, F-MNIST, E-MNIST and CIFAR-10. Ncut is performed on the entire dataset and N𝑁Nitalic_N denotes the number of training samples.

Computation Complexity. Considering a dataset containing n𝑛nitalic_n data points, the common strategy for spectral clustering is to sparsify the affinity at first, e.g., keeping the largest s𝑠sitalic_s entries in each row. In such a setting, the time complexity is 𝒪(kns)𝒪𝑘𝑛𝑠\mathcal{O}(kns)caligraphic_O ( italic_k italic_n italic_s ) for solving the k𝑘kitalic_k eigenvectors with sparse eigen-solver. While in NeuNcut, the time complexity of the loss in Eq. (10) is 𝒪(2tkm2)𝒪2𝑡𝑘superscript𝑚2\mathcal{O}(2tkm^{2})caligraphic_O ( 2 italic_t italic_k italic_m start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ), where t𝑡titalic_t denotes the number of training iterations and m𝑚mitalic_m denotes the batch size.

Running Time. We compare the total running time of NeuNcut with varying number of training samples and the running time of spectral clustering via Ncut on the entire dataset MNIST, F-MNIST, E-MNIST and CIFAR-10. We use an Intel(R) Xeon E5-2630 CPU to solve spectral clustering via Ncut since that there is no available GPU acceleration package. The NeuNcut is trained on a single NVIDIA GeForce 1080Ti GPU. As shown in Table 6, the NeuNcut takes much less running time since it can be trained on a small training set to perform generalizable clustering results. Besides, NeuNcut infers the cluster memberships directly, which saves the time of applying k𝑘kitalic_k-means clustering.

4.8 Learning Curves

Refer to caption
(a) Metrics
Refer to caption
(b) Lapsubscript𝐿𝑎𝑝\mathcal{L}_{Lap}caligraphic_L start_POSTSUBSCRIPT italic_L italic_a italic_p end_POSTSUBSCRIPT and orthsubscript𝑜𝑟𝑡\mathcal{L}_{orth}caligraphic_L start_POSTSUBSCRIPT italic_o italic_r italic_t italic_h end_POSTSUBSCRIPT
Figure 5: Clustering performance curves and loss curves of during training NeuNcut on MNIST.

In Figure 5, we plot the clustering performance (ACC, NMI and ARI metrics) curves as well as the loss curves Lapsubscript𝐿𝑎𝑝\mathcal{L}_{Lap}caligraphic_L start_POSTSUBSCRIPT italic_L italic_a italic_p end_POSTSUBSCRIPT and orthsubscript𝑜𝑟𝑡\mathcal{L}_{orth}caligraphic_L start_POSTSUBSCRIPT italic_o italic_r italic_t italic_h end_POSTSUBSCRIPT as defined in Eq. (10) during training on MNIST. We can observe that our NeuNcut converges and obtains the stable clustering results on MNIST within 1,000 training iteration. The penalty term orthsubscript𝑜𝑟𝑡\mathcal{L}_{orth}caligraphic_L start_POSTSUBSCRIPT italic_o italic_r italic_t italic_h end_POSTSUBSCRIPT decrease steadily to their lower bounds, while the loss term Lapsubscript𝐿𝑎𝑝\mathcal{L}_{Lap}caligraphic_L start_POSTSUBSCRIPT italic_L italic_a italic_p end_POSTSUBSCRIPT rapidly increases from 0 and then slightly decreases during the training.

Methods MNIST F-MNIST
ACC NMI ARI ACC NMI ARI
Ncut (s𝑠sitalic_s=1000) 0.798 0.812 0.727 0.641 0.650 0.480
Ncut (s𝑠sitalic_s=10) 0.827 0.874 0.810 0.458 0.445 0.212
Ncut (s𝑠sitalic_s=3) 0.721 0.818 0.708 0.405 0.368 0.159
SpectralNet (s𝑠sitalic_s=1000) 0.795 0.815 0.744 0.667 0.639 0.524
SpectralNet (s𝑠sitalic_s=10) 0.831 0.882 0.813 0.694 0.670 0.544
SpectralNet (s𝑠sitalic_s=3) 0.841 0.901 0.960 0.582 0.685 0.485
NeuNcut (s𝑠sitalic_s=1000) 0.981 0.945 0.957 0.783 0.725 0.643
NeuNcut (s𝑠sitalic_s=10) 0.982 0.947 0.959 0.764 0.709 0.618
NeuNcut (s𝑠sitalic_s=3) 0.983 0.952 0.960 0.758 0.701 0.606
Table 7: Performance of spectral clustering with Ncut, SpectralNet and NeuNcut on MNIST and F-MNIST with varying s𝑠sitalic_s, which is the number of nonzero affinity entries kept in each row.

4.9 Evaluation on Sparsity of Affinity

Because of the memory bottleneck, spectral clustering have to reserve only s𝑠sitalic_s largest entries of each row to construct a sparse affinity matrix. However, setting all other entries to zeros may affect the clustering performance. Here, we compares the clustering performance of Ncut, SpectralNet and NeuNcut with s{3,10,1000}𝑠3101000s\in\{3,10,1000\}italic_s ∈ { 3 , 10 , 1000 }. Table 7 shows that the number of s𝑠sitalic_s severely affects the clustering performance of spectral clustering and SpectralNet. Moreover, since the optimal number of s𝑠sitalic_s varies from different dataset, it is nearly impossible to find best s𝑠sitalic_s when ground truth is missing or unknown. For example, SpectralNet achieves best clustering results for MNIST when s=3𝑠3s=3italic_s = 3 but achieves worst results for F-MNIST with same s𝑠sitalic_s. By contrast, NeuNcut shows robustness to varying s𝑠sitalic_s.

4.10 Comparison to State-of-the-art Methods

Methods MNIST F-MNIST E-MNIST CIFAR-10
ACC NMI ARI ACC NMI ARI ACC NMI ARI ACC NMI ARI
k𝑘kitalic_k-means [39] 0.541 0.507 0.367 0.505 0.578 0.403 0.459 0.438 0.316 0.525 0.589 0.276
VaDE [17] 0.963 0.912 0.922 0.604 0.641 0.477 0.561 0.694 0.518 - - -
EnSC [28] 0.980 0.945 0.957 0.672 0.705 0.565 T T T 0.613 0.601 0.430
DEPICT [18] 0.965 0.917 - 0.392 0.392 - - - - - - -
ESC [32] 0.971 0.925 0.936 0.668 0.708 0.556 0.732 0.825 0.759 0.653 0.629 0.438
SCAN [19] 0.969 0.916 0.929 0.538 0.575 0.363 0.567 0.652 0.545 0.756 0.633 0.577
SENet [24] 0.968 0.918 0.931 0.697 0.663 0.556 0.721 0.798 0.766 0.765 0.655 0.573
EDESC [40] 0.913 0.862 - 0.631 0.670 - - - - 0.627 0.464 -
Ncut [5] 0.854 0.904 0.837 0.645 0.726 0.448 0.662 0.769 0.654 0.693 0.636 0.428
SpectralNet [11] 0.971 0.924 0.934 0.694 0.670 0.544 0.556 0.750 0.556 0.728 0.624 0.546
SpecNet2 [12] 0.974 0.937 0.940 0.680 0.676 0.542 0.570 0.753 0.575 0.696 0.641 0.531
AutoSC [21] 0.978 - - 0.646 - - - - - - - -
CNC [23] 0.972 0.924 - - - - - - - 0.702 0.586 -
NeuNcut (ours) 0.983 0.952 0.960 0.783 0.725 0.643 0.716 0.789 0.774 0.776 0.647 0.594
NeuRcut (ours) 0.978 0.938 0.951 0.713 0.693 0.587 0.661 0.610 0.595 0.759 0.640 0.566
Table 8: Clustering results on MNIST, F-MNIST, E-MNIST and CIFAR-10. We compare our method with most relevant spectral clustering methods and other baseline clustering methods. Legend: ‘-’ denotes not reported results, ‘T’ means the computation time exceeds 24 hours.

We compare the clustering performance of our NeuNcut to the following baselines that are most relevant methods to our NeuNcut333BaSiS learns spectral embeddings in a supervised manner and thus it is not being compared., including SpectralNet [11], SpecNet2 [12], AutoSC [21] and CNC [23], conventional clustering methods including k𝑘kitalic_k-means [39] and Normalize cut (Ncut) [5], a set of competitive clustering methods including VaDE [17], DEPICT [18], SCAN [19] and EDESC [40], and a set of advanced subspace clustering methods including EnSC [28], ESC [32] and SENet [24]. Among the relevant baselines, we reproduce spectral clustering via Ncut, SpectralNet, SpecNet2 and SCAN and report their best performances on the same features and affinities, and cite the results of AutoSC444AutoSC uses ScatNet for feature extraction and constructs the affinity via a self-expressive model, thus it is a fair comparison to ours by citing the results from its paper. and CNC555CNC is a relevant approach to ours, but we failed to reproduce its performance. Experimental results are reported in Table 8. Among these baselines, DEPICT and SCAN use an entropy based regularizer to avoid degenerated solutions. The success of these methods is owing to class-balanced prior, i.e., assuming all clusters have equal number of data points. These methods perform poorly on imbalanced datasets such as E-MNIST. State-of-the-art subspace clustering methods including EnSC, ESC, SENet and EDESC generate self-expressive coefficients and apply spectral clustering. Our NeuNcut can be used to replace the spectral clustering step in most subspace clustering methods to achieve better performance and enable them to handle large-scale datasets (e.g., see “SENet” and “NeuNcut” in Table 8 for comparison). As a differential spectral clustering method, our NeuNcut outperforms Ncut and other spectral-based clustering methods on all four datasets. In particular, our NeuNcut outperforms all baseline methods on MNIST, F-MNIST and CIFAR-10, and achieves second highest accuracy on E-MNIST.

4.11 Flexible Extensions

Our NeuNcut can be easily extended to other conventional spectral clustering methods, such as Ratio cut [4]:

Ratiocut(𝒱(1),,𝒱(k)):==1kcut(𝒱(),𝒱¯())|𝒱()|,assignRatiocutsuperscript𝒱1superscript𝒱𝑘superscriptsubscript1𝑘cutsuperscript𝒱superscript¯𝒱superscript𝒱\mathrm{Ratio\ cut}(\mathcal{V}^{(1)},\ldots,\mathcal{V}^{(k)}):=\sum_{\ell=1}% ^{k}\frac{\mathrm{cut}(\mathcal{V}^{(\ell)},\overline{\mathcal{V}}^{(\ell)})}{% |\mathcal{V}^{(\ell)}|},roman_Ratio roman_cut ( caligraphic_V start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT , … , caligraphic_V start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT ) := ∑ start_POSTSUBSCRIPT roman_ℓ = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT divide start_ARG roman_cut ( caligraphic_V start_POSTSUPERSCRIPT ( roman_ℓ ) end_POSTSUPERSCRIPT , over¯ start_ARG caligraphic_V end_ARG start_POSTSUPERSCRIPT ( roman_ℓ ) end_POSTSUPERSCRIPT ) end_ARG start_ARG | caligraphic_V start_POSTSUPERSCRIPT ( roman_ℓ ) end_POSTSUPERSCRIPT | end_ARG , (17)

where |𝒱()|superscript𝒱|\mathcal{V}^{(\ell)}|| caligraphic_V start_POSTSUPERSCRIPT ( roman_ℓ ) end_POSTSUPERSCRIPT | denotes the size of \ellroman_ℓ-th cluster. Similarly, the desired segmentation matrix H~~𝐻\tilde{H}over~ start_ARG italic_H end_ARG of the Ratio cut can again be expressed as H~:=HΥ1assign~𝐻𝐻superscriptΥ1\tilde{H}:=H\Upsilon^{-1}over~ start_ARG italic_H end_ARG := italic_H roman_Υ start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT, where H𝐻Hitalic_H is reparametrized by 𝐟(X;Θ)𝐟𝑋Θ\mathbf{f}(X;\Theta)bold_f ( italic_X ; roman_Θ ) and Υ=Diag(|𝒱(1)|,,|𝒱(k)|)1/2ΥDiagsuperscriptsuperscript𝒱1superscript𝒱𝑘12\Upsilon=\mathrm{Diag}(|\mathcal{V}^{(1)}|,\cdots,|\mathcal{V}^{(k)}|)^{1/2}roman_Υ = roman_Diag ( | caligraphic_V start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT | , ⋯ , | caligraphic_V start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT | ) start_POSTSUPERSCRIPT 1 / 2 end_POSTSUPERSCRIPT. Here, the size of \ellroman_ℓ-th cluster can also be replaced by its estimation |𝒱()|i=1nyi,approaches-limitsuperscript𝒱superscriptsubscript𝑖1𝑛subscript𝑦𝑖|\mathcal{V}^{(\ell)}|\doteq\sum_{i=1}^{n}y_{i,\ell}| caligraphic_V start_POSTSUPERSCRIPT ( roman_ℓ ) end_POSTSUPERSCRIPT | ≐ ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT italic_y start_POSTSUBSCRIPT italic_i , roman_ℓ end_POSTSUBSCRIPT. Again, this is an analogue to the expectation step in EM-style algorithm. Then, we relax the orthogonality constraint of Ratio cut (i.e., H~H~=Isuperscript~𝐻top~𝐻𝐼\tilde{H}^{\top}\tilde{H}=Iover~ start_ARG italic_H end_ARG start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT over~ start_ARG italic_H end_ARG = italic_I) to a penalty function and obtain the following loss function:

(X,W;Θ):=trace((𝐟(X;Θ)Υ1)L(𝐟(X;Θ)Υ1))+γ2(𝐟(X;Θ)Υ1)(𝐟(X;Θ)Υ1)IF2.assign𝑋𝑊Θtracesuperscript𝐟𝑋ΘsuperscriptΥ1top𝐿𝐟𝑋ΘsuperscriptΥ1𝛾2superscriptsubscriptdelimited-∥∥superscript𝐟𝑋ΘsuperscriptΥ1top𝐟𝑋ΘsuperscriptΥ1𝐼𝐹2\begin{split}\mathcal{L}(X,W;\Theta):=&\mathrm{\textrm{trace}}\left((\mathbf{f% }(X;\Theta)\Upsilon^{-1})^{\top}L(\mathbf{f}(X;\Theta)\Upsilon^{-1})\right)\\ +&\frac{\gamma}{2}\left\|(\mathbf{f}(X;\Theta)\Upsilon^{-1})^{\top}(\mathbf{f}% (X;\Theta)\Upsilon^{-1})-I\right\|_{F}^{2}.\end{split}start_ROW start_CELL caligraphic_L ( italic_X , italic_W ; roman_Θ ) := end_CELL start_CELL trace ( ( bold_f ( italic_X ; roman_Θ ) roman_Υ start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_L ( bold_f ( italic_X ; roman_Θ ) roman_Υ start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ) ) end_CELL end_ROW start_ROW start_CELL + end_CELL start_CELL divide start_ARG italic_γ end_ARG start_ARG 2 end_ARG ∥ ( bold_f ( italic_X ; roman_Θ ) roman_Υ start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ( bold_f ( italic_X ; roman_Θ ) roman_Υ start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ) - italic_I ∥ start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT . end_CELL end_ROW (18)

We note that SpectralNet [11] is not a fully differential programming approach for Ratio cut because k𝑘kitalic_k-means is still needed after the spectral embedding. Here, we have a fully differential programming approach for Ratio cut, termed as NeuRcut. We conduct experiments with the same setting as NeuNcut and report the results in Table 8. Again, our NeuRcut outperforms SpectralNet.

5 Conclusion

We proposed a differential and generalizable approach for spectral clustering, termed Neural Normalized Cut (NeuNcut), which can be trained in mini-batch mode and used to infer the clustering membership for out-of-sample data directly. Such a generalization ability provides an efficient and effective way for clustering large-scale data. Extensive experiments on both synthetic data and real-world datasets have validated the superior performance of our proposed NeuNcut.

Limitations. Our NeuNcut maps the data to the cluster assignment space, not the orthogonal eigenfunctions space. That means, the output of NeuNcut can not approximate the eigenvectors obtained by eigen-decomposition, which prevents potential applications on eigenvectors or spectral embedding, such as the approximation of the Fiedler vector and the positional encoding for graph neural networks.

Future works. It is worth to note that NeuNcut is a differential version of Normalized cut [5], hence it can potentially be used to replace the conventional spectral clustering in a variety of applications, especially when facing the clustering task with data of ultra large-scale. Nevertheless, we also note that in our NeuNcut the feature and the affinity are still assumed to be given and fixed, thus it will be a worthwhile future work to develop a unified framework for jointly learning both the feature and the affinity. In addition, we have observed empirically that our NeuNcut enjoys a good generalization ability to out-of-sample data, thus it is also an attempting future work to establish the relevant theoretical guarantee.

Acknowledgment

W. He, C.-G Li and J. Guo are supported by the National Natural Science Foundation of China under Grant 61876022.

References

  • [1] U. von Luxburg, A tutorial on spectral clustering, Statistics and Computing 17 (4) (2007) 395–416.
  • [2] M. Filippone, F. Camastra, F. Masulli, S. Rovetta, A survey of kernel and spectral methods for clustering, Pattern Recognition 41 (1) (2008) 176–190.
  • [3] L. Ding, C. Li, D. Jin, S. Ding, Survey of spectral clustering based on graph theory, Pattern Recognition 151 (2024) 110366.
  • [4] P. K. Chan, M. D. Schlag, J. Y. Zien, Spectral k-way ratio-cut partitioning and clustering, in: Proceedings of the 30th International Design Automation Conference, 1993, pp. 749–754.
  • [5] J. Shi, J. Malik, Normalized cuts and image segmentation, IEEE Transactions on Pattern Analysis and Machine Intelligence 22 (8) (2000) 888–905.
  • [6] C. H. Ding, X. He, H. Zha, M. Gu, H. D. Simon, A min-max cut algorithm for graph partitioning and data clustering, in: Proceedings of the IEEE International Conference on Data Mining, 2001, pp. 107–114.
  • [7] X. Zhu, Y. Zhu, W. Zheng, Spectral rotation for deep one-step clustering, Pattern Recognition 105 (2020) 107175.
  • [8] C. Fowlkes, S. Belongie, F. Chung, J. Malik, Spectral grouping using the nyström method, IEEE Transactions on Pattern Analysis and Machine Intelligence 26 (2) (2004) 214–225.
  • [9] G. Yang, S. Deng, X. Chen, C. Chen, Y. Yang, Z. Gong, Z. Hao, Reskm: A general framework to accelerate large-scale spectral clustering, Pattern Recognition 137 (2023) 109275.
  • [10] X. Yang, W. Yu, R. Wang, G. Zhang, F. Nie, Fast spectral clustering learning with hierarchical bipartite graph for large-scale data, Pattern Recognition Letters 130 (2020) 345–352.
  • [11] U. Shaham, K. P. Stanton, H. Li, R. Basri, B. Nadler, Y. Kluger, Spectralnet: Spectral clustering using deep neural networks, in: Proceedings of the 6th International Conference on Learning Representations, 2018.
  • [12] Z. Chen, Y. Li, X. Cheng, Specnet2: Orthogonalization-free spectral embedding by neural networks, in: Proceedings of The Mathematical and Scientific Machine Learning Conference, Vol. 190, 2022, pp. 33–48.
  • [13] C. Boutsidis, E. Gallopoulos, Svd based initialization: A head start for nonnegative matrix factorization, Pattern Recognition 41 (4) (2008) 1350–1362.
  • [14] H. Lu, Z. Fu, X. Shu, Non-negative and sparse spectral clustering, Pattern Recognition 47 (1) (2014) 418–426.
  • [15] R. Shang, Z. Zhang, L. Jiao, W. Wang, S. Yang, Global discriminative-based nonnegative spectral clustering, Pattern Recognition 55 (2016) 172–182.
  • [16] D. Huang, C.-D. Wang, J.-S. Wu, J.-H. Lai, C.-K. Kwoh, Ultra-scalable spectral clustering and ensemble clustering, IEEE Transactions on Knowledge and Data Engineering 32 (6) (2019) 1212–1226.
  • [17] Z. Jiang, Y. Zheng, H. Tan, B. Tang, H. Zhou, Variational deep embedding: An unsupervised and generative approach to clustering, in: Proceedings of the International Joint Conference on Artificial Intelligence, 2017, pp. 1965–1972.
  • [18] K. Ghasedi Dizaji, A. Herandi, C. Deng, W. Cai, H. Huang, Deep clustering via joint convolutional autoencoder embedding and relative entropy minimization, in: IEEE International Conference on Computer Vision, 2017, pp. 5736–5745.
  • [19] W. Van Gansbeke, S. Vandenhende, S. Georgoulis, M. Proesmans, L. Van Gool, Scan: Learning to classify images without labels, in: European Conference on Computer Vision, 2020, pp. 268–285.
  • [20] M. Caron, P. Bojanowski, A. Joulin, M. Douze, Deep clustering for unsupervised learning of visual features, in: European Conference on Computer Vision, 2018, pp. 132–149.
  • [21] J. Fan, Y. Tu, Z. Zhang, M. Zhao, H. Zhang, A simple approach to automated spectral clustering, in: Advances in Neural Information Processing Systems, Vol. 35, 2022, pp. 9907–9921.
  • [22] O. Streicher, I. Cohen, G. Gilboa, Basis: Batch aligned spectral embedding space, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2023, pp. 10396–10405.
  • [23] A. Nazi, W. Hang, A. Goldie, S. Ravi, A. Mirhoseini, Generalized clustering by learning to optimize expected normalized cuts, arXiv preprint arXiv:1910.07623 (2019).
  • [24] S. Zhang, C. You, R. Vidal, C.-G. Li, Learning a self-expressive network for subspace clustering, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2021, pp. 12393–12403.
  • [25] M. Belkin, P. Niyogi, Convergence of laplacian eigenmaps, Advances in Neural Information Processing Systems (2006) 129–136.
  • [26] M. Belkin, P. Niyogi, Towards a theoretical foundation for laplacian-based manifold methods, Journal of Computer and System Sciences 74 (8) (2008) 1289–1308.
  • [27] E. Elhamifar, R. Vidal, Sparse subspace clustering, in: IEEE Conference on Computer Vision and Pattern Recognition, 2009, pp. 2790–2797.
  • [28] C. You, C.-G. Li, D. Robinson, R. Vidal, Oracle based active set algorithm for scalable elastic net subspace clustering, in: IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 3928–3937.
  • [29] Y. Lecun, L. Bottou, Y. Bengio, P. Haffner, Gradient-based learning applied to document recognition, Proceedings of the IEEE 86 (11) (1998) 2278–2324.
  • [30] H. Xiao, K. Rasul, R. Vollgraf, Fashion-mnist: a novel image dataset for benchmarking machine learning algorithms, arXiv preprint arXiv: 1708.07747 (2019).
  • [31] G. Cohen, S. Afshar, J. Tapson, A. Van Schaik, Emnist: Extending mnist to handwritten letters, in: Proceedings of the IEEE International Joint Conference on Neural Networks, 2017, pp. 2921–2926.
  • [32] C. You, C. Li, D. Robinson, R. Vidal, A scalable exemplar-based subspace clustering algorithm for class-imbalanced data, in: European Conference on Computer Vision, 2018, pp. 68–85.
  • [33] A. Krizhevsky, G. Hinton, et al., Learning multiple layers of features from tiny images, Technical Report TR-2009, University of Toronto, Toronto (2009).
  • [34] G. Loosli, S. Canu, L. Bottou, Training invariant support vector machines using selective sampling, in: Large Scale Kernel Machines, MIT press, 2007, pp. 301–320.
  • [35] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, L. Fei-Fei, Imagenet: A large-scale hierarchical image database, in: IEEE conference on computer vision and pattern recognition, 2009, pp. 248–255.
  • [36] J. Bruna, S. Mallat, Invariant scattering convolution networks, IEEE Transactions on Pattern Analysis and Machine Intelligence 35 (8) (2013) 1872–1886.
  • [37] Y. Yu, K. H. R. Chan, C. You, C. Song, Y. Ma, Learning diverse and discriminative representations via the principle of maximal coding rate reduction, in: Advances in Neural Information Processing Systems, 2020, pp. 9422–9434.
  • [38] A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, G. Krueger, I. Sutskever, Learning transferable visual models from natural language supervision, in: Proceedings of the International Conference on Machine Learning, 2021, pp. 8748–8763.
  • [39] J. MacQueen, Some methods for classification and analysis of multivariate observations, in: Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability, 1967, pp. 281–297.
  • [40] J. Cai, J. Fan, W. Guo, S. Wang, Y. Zhang, Z. Zhang, Efficient deep embedded subspace clustering, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2022, pp. 1–10.