Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

HC-GAE: The Hierarchical Cluster-based Graph Auto-Encoder for Graph Representation Learning

Zhuo Xu,  Lu Bai,  Lixin Cui,  Ming Li, 
Yue Wang,  Edwin R. Hancock
Zhuo Xu and Lu Bai (Corresponding Author: bailu@bnu.edu.cn) are with School of Artificial Intelligence, Beijing Normal University, Beijing, China. Lixin Cui and Yue Wang are with School of Information, Central University of Finance and Economics, Beijing, China. Ming Li is with the Key Laboratory of Intelligent Education Technology and Application of Zhejiang Province, Zhejiang Normal University, Jinhua, China. Edwin R. Hancock is with Department of Computer Science, University of York, York, UK
Abstract

Graph Auto-Encoders (GAEs) are powerful tools for graph representation learning. In this paper, we develop a novel Hierarchical Cluster-based GAE (HC-GAE), that can learn effective structural characteristics for graph data analysis. To this end, during the encoding process, we commence by utilizing the hard node assignment to decompose a sample graph into a family of separated subgraphs. We compress each subgraph into a coarsened node, transforming the original graph into a coarsened graph. On the other hand, during the decoding process, we adopt the soft node assignment to reconstruct the original graph structure by expanding the coarsened nodes. By hierarchically performing the above compressing procedure during the decoding process as well as the expanding procedure during the decoding process, the proposed HC-GAE can effectively extract bidirectionally hierarchical structural features of the original sample graph. Furthermore, we re-design the loss function that can integrate the information from either the encoder or the decoder. Since the associated graph convolution operation of the proposed HC-GAE is restricted in each individual separated subgraph and cannot propagate the node information between different subgraphs, the proposed HC-GAE can significantly reduce the over-smoothing problem arising in the classical convolution-based GAEs. The proposed HC-GAE can generate effective representations for either node classification or graph classification, and the experiments demonstrate the effectiveness on real-world datasets.

Index Terms:
Graph Auto-Encoder; Graph Neural Networks; Graph Classification; Node Classification

I Introduction

In real-world applications, graph structure data has been widely used for characterizing pairwise relationships among the components of complex systems. With the recent rapid development of deep learning, the graph representation learning approaches relying on neural networks are introduced for the analysis of various graph data, e.g., social networks [1], transportation networks [2], protein compounds [3], etc. One challenging arising in these studies is that the graph data has a nonlinear structure defined in an irregular non-Euclidean space, and it is hard to directly employ traditional neural networks to learn graph representations.

To overcome the above problem, there have been increasing interests to further generalize traditional neural networks, especially the Convolutional Neural Network (CNN) [4], for the irregular graph data. These are the so-called convolution-based Graph Neural Networks (GNNs) [5] and their related approaches [6] proposed for graph-based tasks, utilizing both graph features and topologies. For instance, the Higher-order Graph Convolutional Network (HiGCN) [7] has been developed based on the higher-order interactions to recognize intrinsic features across varying topological scales. Its effective expressiveness makes it capable for various graph-based tasks. The DeepRank-GNN [8] has been proposed by combining the rotation-invariant graphs and the GNN to represent protein-protein complexes. Because the GNN models can extract graph representations with more semantic learning under supervised conditions, researchers have focused more on seeking a self-supervised framework associated with the GNNs to accomplish representation learning.

As a typical framework of representation learning, the classical Auto-Encoder [9] has been proposed to extract impressive results by reconstructing the input information. Especially, the Graph Auto-Encoder (GAE) [10] associated with the GNN model has further generalized the reconstruction ability for graph structures [11, 12]. Due to the extensibility, the GAEs have been developed as a family of classical models for self-supervised representation learning, and there are adequate derivation models belonging to GAEs. For instance, the Self-Supervised Masked Graph Autoencoders (GraphMAE) [13] focusing on the feature reconstruction adopts a masking strategy and the scaled cosine error in the training model. Compared to the traditional GAE approach like VGAE [14], its decoder is retrofitted with the GNN and the re-masking operation. Based on the GraphMAE, the S2GAE [15] continues to adopt the masking strategy to improve the auto-encoder framework. To generate the cross-representation, the decoder is designed to capture the cross-correlation of nodes.

Challenges. Although the classical GAE-based methods achieve the effective performance for graph representation learning, they still have some significant challenging problems summarized as follows.

(a) The limitation for multiple downstream tasks: Generally, the representations extracted from the GAEs can be divided into several categories, including the node-level representations for node classification, the graph-level representations for graph classification, etc. Specifically, it is difficult for the GAEs to generate universal representations for multiple downstream tasks simultaneously. This is because the GAEs tend to over-emphasize the node features. For instance, the GraphMAE [13] focuses more on the node feature reconstruction, resulting in topological missing and weakening the the structure information reconstruction. This is harmful for the graph-level representation learning.

(b) The over-smoothing problem: The GAEs are usually proposed based on the GNNS, thus both the decoder and encoder modules of the GAEs are defined associated with a number of stacked graph convolution operations, that rely on the node information propagation between adjacent nodes. When the GAE becomes deeper, the node features tend to be similar or indistinguishable after multiple rounds of information passing [16], resulting in the notorious over-smoothing problem [17] and influence the performance of the GAEs.

Contributions. The aim of this paper is to overcome the above challenging problems by proposing a novel HC-GAE model. Overall, the main contributions are threefold.

First, we propose a novel Hierarchical Cluster-based GAE (HC-GAE) for graph representation learning. Specifically, for the encoding process, we adopt the hard node assignment to decompose a sample graph into a family of separated subgraphs. We perform the graph convolution operation for each subgraph to further extract node features and compress the nodes belonging to each subgraph into a coarsened node, transforming the original graph into a coarsened graph. Since the separated subgraphs are isolated from each other, the convolution operation cannot propagate the node information between different subgraphs. The proposed HC-GAE can in turn reduce the over-smoothing problem arising in the classical GAEs. Moreover, since the effect of the graph structure perturbation is limited within each subgraph, the required convolution operation performed on each subgraph can strengthen the robustness of the encoder for the proposed HC-GAE. As a result, the outputs of the encoder can be employed as the graph-level representations. On the other hand, for the decoding process, we adopt the soft node assignment to reconstruct the original graph structure by expanding each coarsened node into all retrieved nodes probabilistically. Thus, the outputs of the decoder can be employed as the node-level representations. Since the HC-GAE is defined by hierarchically performing the above compressing procedure during the decoding process as well as the expanding procedure during the decoding process, the proposed HC-GAE can effectively extract bidirectionally hierarchical structural features of the original sample graph, resulting in effective hierarchical graph-level and node-level representations for either graph classification or node classification.

Second, we propose a new loss function for training the proposed HC-GAE model. For calculating the complete loss value, we integrate the local loss from the subgraphs in the encoding operation and the global loss from the reconstructed graphs in the decoding operation. The global loss can capture the information from both the structure and the feature reconstruction processes. The combination of these two pretext tasks broadens the strict requirement causing the topological closeness. In addition, to avoid the over-fitting problem, we add the local loss as the regularization in our loss function.

Third, we empirically evaluate the performance of the proposed HC-GAE model on both node and graph classification tasks, demonstrating the effectiveness of the proposed model.

II Related Works

II-A Graph Neural Network

GNNs are widely utilized across adequate application scenarios [18, 19, 20], and achieves a prominent success. The input data of GNNs is graphs, a kind of non-Euclidean data, containing nodes and edges. With the complex structure of the graphs, GNNs aim to leverage the information passing mechanism among nodes for graph embedding learning. The process of information passing could be divided into aggregating, combining and readout.

Given an input graph G(V,E)𝐺𝑉𝐸G(V,E)italic_G ( italic_V , italic_E ) with the node set V𝑉Vitalic_V and the edge set E𝐸Eitalic_E, the node information is represented as the feature matrix Xn×d𝑋superscript𝑛𝑑X\in\mathbb{R}^{n\times d}italic_X ∈ blackboard_R start_POSTSUPERSCRIPT italic_n × italic_d end_POSTSUPERSCRIPT with d𝑑ditalic_d features, and the structure information is represented as the adjacent matrix A{0,1}n×n𝐴superscript01𝑛𝑛A\in\{0,1\}^{n\times n}italic_A ∈ { 0 , 1 } start_POSTSUPERSCRIPT italic_n × italic_n end_POSTSUPERSCRIPT. The GNN for each layer is defined as

ZG=GNN(X,A;Θ),subscript𝑍𝐺GNN𝑋𝐴ΘZ_{G}=\mathrm{GNN}(X,A;\Theta),italic_Z start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT = roman_GNN ( italic_X , italic_A ; roman_Θ ) , (1)

where ZGsubscript𝑍𝐺Z_{G}italic_Z start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT is the graph embedding, and ΘΘ\Thetaroman_Θ is the parameter set of the GNN. This embedding result is used for the downstream tasks. Its methodology researches could be categorized into the spectral and spatial approaches [21]. When computing power is not enough to realize operations on graph, there are several researches focusing on the graph spectral domain [22].

Graph convolutional networks (GCNs) [6], a typical derivative model of GNNs, generalize convolutional neural networks (CNNs) [4] to the graph-structured data. They have performed in various graph application tasks [23].And they are widely utilized in deep learning models [24, 25]. For example, the GCN proposed by Kipf et al. [6], adopts the following layer-wise scheme to realize the hierarchical model, i.e.,

H(l+1)=ReLU(D~12A~D~12H(l)W(l)),superscript𝐻𝑙1ReLUsuperscript~𝐷12~𝐴superscript~𝐷12superscript𝐻𝑙superscript𝑊𝑙H^{(l+1)}=\mathrm{ReLU}(\tilde{D}^{-\frac{1}{2}}\tilde{A}\tilde{D}^{-\frac{1}{% 2}}H^{(l)}W^{(l)}),italic_H start_POSTSUPERSCRIPT ( italic_l + 1 ) end_POSTSUPERSCRIPT = roman_ReLU ( over~ start_ARG italic_D end_ARG start_POSTSUPERSCRIPT - divide start_ARG 1 end_ARG start_ARG 2 end_ARG end_POSTSUPERSCRIPT over~ start_ARG italic_A end_ARG over~ start_ARG italic_D end_ARG start_POSTSUPERSCRIPT - divide start_ARG 1 end_ARG start_ARG 2 end_ARG end_POSTSUPERSCRIPT italic_H start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT italic_W start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT ) , (2)

where H(l)n×dsuperscript𝐻𝑙superscript𝑛𝑑H^{(l)}\in\mathbb{R}^{n\times d}italic_H start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_n × italic_d end_POSTSUPERSCRIPT is the hidden embedding matrix in the l𝑙litalic_l layer. Wld×dsuperscript𝑊𝑙superscript𝑑𝑑W^{l}\in\mathbb{R}^{d\times d}italic_W start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_d × italic_d end_POSTSUPERSCRIPT is the trainable matrix in the l𝑙litalic_l layer, A~=A+I~𝐴𝐴𝐼\tilde{A}=A+Iover~ start_ARG italic_A end_ARG = italic_A + italic_I is the adjacency matrix associated with the self loop, the degree matrix D~=jA~ij~𝐷subscript𝑗subscript~𝐴𝑖𝑗\tilde{D}=\sum_{j}\tilde{A}_{ij}over~ start_ARG italic_D end_ARG = ∑ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT over~ start_ARG italic_A end_ARG start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT is the corresponding degree matrix, and H(l+1)superscript𝐻𝑙1H^{(l+1)}italic_H start_POSTSUPERSCRIPT ( italic_l + 1 ) end_POSTSUPERSCRIPT is the embedding matrix extracted for the next layer l+1𝑙1l+1italic_l + 1 of the hierarchical model. Compared to the traditional GNNs, the hierarchical GCN could capture the global representations through multi-layer passing. However, as the hierarchical GCN deepens, the node information is propagated to the whole graph. The over-smoothing problem where the node representations of the graph tend to be similar is obvious in the multi-layer GCN.

II-B Graph Auto-Encoder

The GAE is a classical self-supervised framework that completes graph representation learning task. The earliest works related to GAEs are DeepWalk [26] and Node2Vec [27] where encoders play an important role in learning latent representations of vertices. With the addition of GNNs, encoders in GAEs have the ability to cope with non-Euclidean data [28]. As a self-supervised learning model, the pretext task of GAEs in training is the graph reconstruction [29]. In detail, the reconstruction targets could be categorized into fine-grained and coarse-grained ones.

The fine-grained targets contain either nodes or edges. For instance, the Variational Graph Auto-Encoder (VGAE) model [14] adopts two stages including encoder and decoder to accomplish representation learning. Assume an input graph G=(V,E)𝐺𝑉𝐸G=(V,E)italic_G = ( italic_V , italic_E ), the goal of the VGAE is to embed the graph following the encoder function f:V×EZn×d:𝑓𝑉𝐸𝑍superscript𝑛𝑑f:V\times E\rightarrow Z\in\mathbb{R}^{n\times d}italic_f : italic_V × italic_E → italic_Z ∈ blackboard_R start_POSTSUPERSCRIPT italic_n × italic_d end_POSTSUPERSCRIPT which is the mapping from the node set V𝑉Vitalic_V which has n𝑛nitalic_n nodes with d𝑑ditalic_d features to the embedding matrix Z𝑍Zitalic_Z. Then, the decoder reconstructs the graph through the network g:ZE:𝑔𝑍superscript𝐸g:Z\rightarrow E^{\prime}italic_g : italic_Z → italic_E start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT where Esuperscript𝐸E^{\prime}italic_E start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT is the reconstructed edge set. The training process of the VGAE is as

Z=f(V,E),E=g(Z).formulae-sequence𝑍𝑓𝑉𝐸superscript𝐸𝑔𝑍Z=f(V,E),E^{\prime}=g(Z).italic_Z = italic_f ( italic_V , italic_E ) , italic_E start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = italic_g ( italic_Z ) . (3)

In the encoding and decoding processes, the VGAE obtains the conditional probability q(ZV,E)𝑞conditional𝑍𝑉𝐸q(Z\mid V,E)italic_q ( italic_Z ∣ italic_V , italic_E ) from the encoder and p(EZ)𝑝conditionalsuperscript𝐸𝑍p(E^{\prime}\mid Z)italic_p ( italic_E start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∣ italic_Z ) from the decoder. The loss function is defined as

=KL[q(ZV,E)p(Z)]𝔼q(ZV,E)[logp(EZ)],KLdelimited-[]conditional𝑞conditional𝑍𝑉𝐸𝑝𝑍subscript𝔼𝑞conditional𝑍𝑉𝐸delimited-[]log𝑝conditionalsuperscript𝐸𝑍\mathcal{L}=\mathrm{KL}[q(Z\mid V,E)\|p(Z)]-\mathbb{E}_{q(Z\mid V,E)}[\mathrm{% log}p(E^{\prime}\mid Z)],caligraphic_L = roman_KL [ italic_q ( italic_Z ∣ italic_V , italic_E ) ∥ italic_p ( italic_Z ) ] - blackboard_E start_POSTSUBSCRIPT italic_q ( italic_Z ∣ italic_V , italic_E ) end_POSTSUBSCRIPT [ roman_log italic_p ( italic_E start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∣ italic_Z ) ] , (4)

where KL[]KLdelimited-[]\mathrm{KL}[\cdot]roman_KL [ ⋅ ] is the Kullback-Leibler divergence, 𝔼𝔼\mathbb{E}blackboard_E is the expaction, and p(Z)𝑝𝑍p(Z)italic_p ( italic_Z ) is the Gaussian prior.

Self-Supervised Graph Autoencoder (S2GAE) proposed by Tan et al. [15], randomly mask a portion of edges and then learn to reconstruct the missing edges. Self-supervised Masked Graph Autoencoders (GraphMAE) [13] also utilizes the masking strategy to reconstruct node features. These methods focus on the local information and disregards several challenges such as over-smoothing.

The coarse-grained targets contain the whole graph, the subgraphs or the paths of the graph. For example, Heterogeneous Graph Masked Autoencoder (HGMA) [10] adopts the dynamic masking strategy to mask the nodes, and edges in the paths and then complete path reconstruction. MaskGAE [30] aims to reconstruct the masked edges and node degrees jointly. Recently, some researchers have noted that the combination of Graph Contrastive Learning (GCL) [31] and the GAE framework could realize the capture of complex interdependency in graphs. Self-supervised Learning for Graph Anomaly Detection (SL-GAD) [32] obtains double subgraphs through the graph view sampling, and then respectively reconstructs them in two decoders for constrastive learning.

Although all the methods realize the improvement of representation learning, the expressiveness of the representations is still weak. The aforementioned GAEs rarely notice their limitations in learning schemes. For the specific downstream task such as node classification, the GAEs could have a nice performance due to the high focus on the node feature reconstruction. This phenomenon where models focus on the node features is named as the topological missing. Since the graph features are over-emphasized, the GAEs are weak in graph structure reconstruction. And these models could be limited in multiple downstream tasks [29]. Meanwhile, the problem of GNNs mentioned in Section II-A affects the feature learning in GAEs. Especially, over-smoothing caused by information passing affects the GNN encoder. When the perturbation of the graph structure is conducted, the noise could be propagated to the neighbor nodes through the edges. After several rounds of information passing, the generated graph representations are noisy.

Current Challenges. The graph representation learning based on the GAE framework has achieved a nice performance. However, the researchers are disturbed by two problems including (a) limitation for multiple downstream tasks, (b) over-smoothing. These problems have widely existed in the current GAEs. Note that, some models might overcome one of these challenges, but cannot solve them simultaneously.

III The Methodology

To overcome the aforementioned challenges, we propose a novel (HC-GAE) to learn effective graph representations. The overview of our model is shown in Figure 1. Similar to the other GAEs, our model has two stages including encoder and decoder. In the encoder, the input graphs are compressed into coarsened graphs through multi-layers. The results of encoder are the graph-level representations for the graph classification. Then, in the decoding process, the decoder reconstructs the graphs, and outputs the node-level representations for the node classification.

Refer to caption
Figure 1: The architecture of our proposed model, HC-GAE.

In the following subsections, we first give our proposed GNN encoder and introduce the subgraphs utilized in the encoder. Then, we introduce the GNN decoder with the soft assignment. Compared to the standard GAE loss, our proposed loss calculation is proposed for effective training. At last, we discuss the theoretical properties of our proposed HC-GAE.

III-A The GNN Encoder with the Separated Subgraphs

The first module of our model is the GNN encoder, which adopts the hierarchical architecture to compress the input graph. The GNN encoder is composed of multiple layers which continuously compress the features and the nodes in graph. The details of our proposed layer in the GNN encoder are shown in Figure 2. Each layer could be divided into two processes including assignment and coarsening. The first one is to generate subgraphs from the original graph. And these subgraphs map to the nodes of the coarsened graph.

Refer to caption
Figure 2: The computational architecture for our proposed layer in the GNN encoder.

Assignment. For each layer l𝑙litalic_l of encoder, an input graph is denoted as G(l)=(X(l),A(l))superscript𝐺𝑙superscript𝑋𝑙superscript𝐴𝑙G^{(l)}=(X^{(l)},A^{(l)})italic_G start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT = ( italic_X start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT , italic_A start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT ) where X(l)n(l)×d(l)superscript𝑋𝑙superscriptsubscript𝑛𝑙subscript𝑑𝑙X^{(l)}\in\mathbb{R}^{n_{(l)}\times d_{(l)}}italic_X start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT ( italic_l ) end_POSTSUBSCRIPT × italic_d start_POSTSUBSCRIPT ( italic_l ) end_POSTSUBSCRIPT end_POSTSUPERSCRIPT is the feature matrix and A(l)n(l)×n(l)superscript𝐴𝑙superscriptsubscript𝑛𝑙subscript𝑛𝑙A^{(l)}\in\mathbb{R}^{n_{(l)}\times n_{(l)}}italic_A start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT ( italic_l ) end_POSTSUBSCRIPT × italic_n start_POSTSUBSCRIPT ( italic_l ) end_POSTSUBSCRIPT end_POSTSUPERSCRIPT is the adjacent matrix. The number of nodes in G(l)superscript𝐺𝑙G^{(l)}italic_G start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT is n(l)subscript𝑛𝑙n_{(l)}italic_n start_POSTSUBSCRIPT ( italic_l ) end_POSTSUBSCRIPT, and each node has d(l)subscript𝑑𝑙d_{(l)}italic_d start_POSTSUBSCRIPT ( italic_l ) end_POSTSUBSCRIPT features. Note that, G(l)superscript𝐺𝑙G^{(l)}italic_G start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT could be the original input graph when l=1𝑙1l=1italic_l = 1 or the coarsened graph when l>1𝑙1l>1italic_l > 1. In the assignment process, the graph G(l)superscript𝐺𝑙G^{(l)}italic_G start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT is decomposed into subgraphs. And we realize the node assignment through hard assignment where each node cannot be assigned to multiple subgraphs. Given the feature matrix X(l)superscript𝑋𝑙X^{(l)}italic_X start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT and the adjacent matrix A(l)superscript𝐴𝑙A^{(l)}italic_A start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT, we first calculate the soft assignment matrix Ssoftsubscript𝑆softS_{\mathrm{soft}}italic_S start_POSTSUBSCRIPT roman_soft end_POSTSUBSCRIPT which allows a node to assign various subgraphs as follows,

Ssoft={softmax(GNN(X(l),A(l)))ifl=1softmax(X(l))ifl>1,subscript𝑆softcasessoftmaxGNNsuperscript𝑋𝑙superscript𝐴𝑙if𝑙1softmaxsuperscript𝑋𝑙if𝑙1S_{\mathrm{soft}}=\begin{cases}\mathrm{softmax}(\mathrm{GNN}(X^{(l)},A^{(l)}))% \ &\mathrm{if}\ l=1\\ \mathrm{softmax}(X^{(l)})\ &\mathrm{if}\ l>1\end{cases},italic_S start_POSTSUBSCRIPT roman_soft end_POSTSUBSCRIPT = { start_ROW start_CELL roman_softmax ( roman_GNN ( italic_X start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT , italic_A start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT ) ) end_CELL start_CELL roman_if italic_l = 1 end_CELL end_ROW start_ROW start_CELL roman_softmax ( italic_X start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT ) end_CELL start_CELL roman_if italic_l > 1 end_CELL end_ROW , (5)

where Ssoftn(l)×n(l+1)subscript𝑆softsuperscriptsubscript𝑛𝑙subscript𝑛𝑙1S_{\mathrm{soft}}\in\mathbb{R}^{n_{(l)}\times n_{(l+1)}}italic_S start_POSTSUBSCRIPT roman_soft end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT ( italic_l ) end_POSTSUBSCRIPT × italic_n start_POSTSUBSCRIPT ( italic_l + 1 ) end_POSTSUBSCRIPT end_POSTSUPERSCRIPT and n(l+1)<n(l)subscript𝑛𝑙1subscript𝑛𝑙n_{(l+1)}<n_{(l)}italic_n start_POSTSUBSCRIPT ( italic_l + 1 ) end_POSTSUBSCRIPT < italic_n start_POSTSUBSCRIPT ( italic_l ) end_POSTSUBSCRIPT. Based on the Ssoftsubscript𝑆softS_{\mathrm{soft}}italic_S start_POSTSUBSCRIPT roman_soft end_POSTSUBSCRIPT, the (i,j)𝑖𝑗(i,j)( italic_i , italic_j )-th entry of the hard assignment matrix S(l){0,1}n(l)×n(l+1)superscript𝑆𝑙superscript01subscript𝑛𝑙subscript𝑛𝑙1S^{(l)}\in\{0,1\}^{n_{(l)}\times n_{(l+1)}}italic_S start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT ∈ { 0 , 1 } start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT ( italic_l ) end_POSTSUBSCRIPT × italic_n start_POSTSUBSCRIPT ( italic_l + 1 ) end_POSTSUBSCRIPT end_POSTSUPERSCRIPT satisfies

S(l)(i,j)={1ifSsoft(i,j)=maxjnl+1[Ssoft(i,:)]0otherwise.superscript𝑆𝑙𝑖𝑗cases1ifsubscript𝑆soft𝑖𝑗subscriptfor-all𝑗subscript𝑛𝑙1delimited-[]subscript𝑆soft𝑖:0otherwiseS^{(l)}(i,j)=\begin{cases}1\ &\mathrm{if}\ S_{\mathrm{soft}}(i,j)=\mathop{\max% }\limits_{\forall j\in n_{l+1}}[S_{\mathrm{soft}}(i,:)]\\ 0\ &\mathrm{otherwise}\end{cases}.italic_S start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT ( italic_i , italic_j ) = { start_ROW start_CELL 1 end_CELL start_CELL roman_if italic_S start_POSTSUBSCRIPT roman_soft end_POSTSUBSCRIPT ( italic_i , italic_j ) = roman_max start_POSTSUBSCRIPT ∀ italic_j ∈ italic_n start_POSTSUBSCRIPT italic_l + 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ italic_S start_POSTSUBSCRIPT roman_soft end_POSTSUBSCRIPT ( italic_i , : ) ] end_CELL end_ROW start_ROW start_CELL 0 end_CELL start_CELL roman_otherwise end_CELL end_ROW . (6)

Clearly, each i𝑖iitalic_i-th row of the hard assignment matrix S(l)superscript𝑆𝑙S^{(l)}italic_S start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT selects the maximum element as 1111 and the remaining elements as 00, i.e., the i𝑖iitalic_i-th node is only assigned to the j𝑗jitalic_j-th subgraph. And the j𝑗jitalic_j-th subgraph is denoted as Gj(l)(Vj(l),Ej(l))superscriptsubscript𝐺𝑗𝑙superscriptsubscript𝑉𝑗𝑙superscriptsubscript𝐸𝑗𝑙G_{j}^{(l)}(V_{j}^{(l)},E_{j}^{(l)})italic_G start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT ( italic_V start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT , italic_E start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT ) where Vj(l)superscriptsubscript𝑉𝑗𝑙V_{j}^{(l)}italic_V start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT is the node set including the nodes of Gj(l)superscriptsubscript𝐺𝑗𝑙G_{j}^{(l)}italic_G start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT and Ej(l)superscriptsubscript𝐸𝑗𝑙E_{j}^{(l)}italic_E start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT is the edge connections of nodes in Vj(l)superscriptsubscript𝑉𝑗𝑙V_{j}^{(l)}italic_V start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT.

Coarsening. Based on the generation of the subgraph, the coarsening process aims at compressing these subgraphs into nodes in the coarsened graph. Given the associated feature matrix Xj(l)superscriptsubscript𝑋𝑗𝑙X_{j}^{(l)}italic_X start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT and adjacent matrix Aj(l)superscriptsubscript𝐴𝑗𝑙A_{j}^{(l)}italic_A start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT of the subgraph Gj(l)superscriptsubscript𝐺𝑗𝑙G_{j}^{(l)}italic_G start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT, we adopt a local graph coarsening operation to extract the local information as

Zj(l)=Aj(l)Xj(l)Wj(l),subscriptsuperscript𝑍𝑙𝑗superscriptsubscript𝐴𝑗𝑙superscriptsubscript𝑋𝑗𝑙superscriptsubscript𝑊𝑗𝑙Z^{(l)}_{j}=A_{j}^{(l)}X_{j}^{(l)}W_{j}^{(l)},italic_Z start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT = italic_A start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT italic_X start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT italic_W start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT , (7)

where Wj(l)d(l)×d(l+1)superscriptsubscript𝑊𝑗𝑙superscriptsubscript𝑑𝑙subscript𝑑𝑙1W_{j}^{(l)}\in\mathbb{R}^{d_{(l)}\times d_{(l+1)}}italic_W start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_d start_POSTSUBSCRIPT ( italic_l ) end_POSTSUBSCRIPT × italic_d start_POSTSUBSCRIPT ( italic_l + 1 ) end_POSTSUBSCRIPT end_POSTSUPERSCRIPT (d(l)>d(l+1)subscript𝑑𝑙subscript𝑑𝑙1d_{(l)}>d_{(l+1)}italic_d start_POSTSUBSCRIPT ( italic_l ) end_POSTSUBSCRIPT > italic_d start_POSTSUBSCRIPT ( italic_l + 1 ) end_POSTSUBSCRIPT) is the trainable weight matrix of layer l𝑙litalic_l, and Zj(l)|Vj(l)|×d(l+1)subscriptsuperscript𝑍𝑙𝑗superscriptsuperscriptsubscript𝑉𝑗𝑙subscript𝑑𝑙1Z^{(l)}_{j}\in\mathbb{R}^{|V_{j}^{(l)}|\times d_{(l+1)}}italic_Z start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT | italic_V start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT | × italic_d start_POSTSUBSCRIPT ( italic_l + 1 ) end_POSTSUBSCRIPT end_POSTSUPERSCRIPT represents the resulting matrix of subgraph Gj(l)superscriptsubscript𝐺𝑗𝑙G_{j}^{(l)}italic_G start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT. To compress each subgraph Gj(l)superscriptsubscript𝐺𝑗𝑙G_{j}^{(l)}italic_G start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT, we utilize a mapping vector sjlsubscriptsuperscripts𝑙𝑗\textbf{s}^{l}_{j}s start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT, and

sj(l)=softmax(Aj(l)Xj(l)Dj(l)),subscriptsuperscripts𝑙𝑗softmaxsuperscriptsubscript𝐴𝑗𝑙superscriptsubscript𝑋𝑗𝑙superscriptsubscript𝐷𝑗𝑙\textbf{s}^{(l)}_{j}=\mathrm{softmax}(A_{j}^{(l)}X_{j}^{(l)}D_{j}^{(l)}),s start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT = roman_softmax ( italic_A start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT italic_X start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT italic_D start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT ) , (8)

where Dj(l)d(l)×1superscriptsubscript𝐷𝑗𝑙superscriptsubscript𝑑𝑙1D_{j}^{(l)}\in\mathbb{R}^{d_{(l)}\times 1}italic_D start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_d start_POSTSUBSCRIPT ( italic_l ) end_POSTSUBSCRIPT × 1 end_POSTSUPERSCRIPT is the training vector. sjlsubscriptsuperscripts𝑙𝑗\textbf{s}^{l}_{j}s start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT plays an important role in compressing each jthsuperscript𝑗thj^{\mathrm{th}}italic_j start_POSTSUPERSCRIPT roman_th end_POSTSUPERSCRIPT subgraph Gj(l)superscriptsubscript𝐺𝑗𝑙G_{j}^{(l)}italic_G start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT to the node of the coarsened graph. After several local graph operations on the separated subgraphs in the l𝑙litalic_l-th layer, we aggregate these local information to further generate the coarsened graph, as the input graph G(l+1)superscript𝐺𝑙1G^{(l+1)}italic_G start_POSTSUPERSCRIPT ( italic_l + 1 ) end_POSTSUPERSCRIPT for the next layer. We collect the feature matrices of the l𝑙litalic_l-th subgraphs as the feature matrix Z(l)n(l)×d(l+1)superscript𝑍𝑙superscriptsubscript𝑛𝑙subscript𝑑𝑙1Z^{(l)}\in\mathbb{R}^{n_{(l)}\times d_{(l+1)}}italic_Z start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT ( italic_l ) end_POSTSUBSCRIPT × italic_d start_POSTSUBSCRIPT ( italic_l + 1 ) end_POSTSUBSCRIPT end_POSTSUPERSCRIPT whose node sequence follows the original input graph G(l)superscript𝐺𝑙G^{(l)}italic_G start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT. In detail, each vertex feature vector of Z(l)superscript𝑍𝑙Z^{(l)}italic_Z start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT is equal to that of the corresponding node in Gj(l)superscriptsubscript𝐺𝑗𝑙G_{j}^{(l)}italic_G start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT, since each original node in G(l)superscript𝐺𝑙G^{(l)}italic_G start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT essentially corresponds the embedding node of subgraph Gj(l)superscriptsubscript𝐺𝑗𝑙G_{j}^{(l)}italic_G start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT. Given the hard assignment matrix S(l)superscript𝑆𝑙S^{(l)}italic_S start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT defined by Equation 6, the mapping vector sj(l)subscriptsuperscripts𝑙𝑗\textbf{s}^{(l)}_{j}s start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT of each subgraph Gj(l)superscriptsubscript𝐺𝑗𝑙G_{j}^{(l)}italic_G start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT defined by Equation 8, the feature matrix Z(l)superscript𝑍𝑙Z^{(l)}italic_Z start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT and the adjacent matrix A(l)superscript𝐴𝑙A^{(l)}italic_A start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT, we calculate the feature matrix X(l+1)superscript𝑋𝑙1X^{(l+1)}italic_X start_POSTSUPERSCRIPT ( italic_l + 1 ) end_POSTSUPERSCRIPT and the adjacent matrix A(l+1)superscript𝐴𝑙1A^{(l+1)}italic_A start_POSTSUPERSCRIPT ( italic_l + 1 ) end_POSTSUPERSCRIPT of the resulting coarsened graph G(l+1)superscript𝐺𝑙1G^{(l+1)}italic_G start_POSTSUPERSCRIPT ( italic_l + 1 ) end_POSTSUPERSCRIPT as

X(l+1)=Reorder[j=1nl+1sj(l)]Z(l),superscript𝑋𝑙1Reorderdelimited-[]superscriptsubscript𝑗1subscript𝑛𝑙1subscriptsuperscriptssuperscript𝑙top𝑗superscript𝑍𝑙X^{(l+1)}=\mathrm{Reorder}[\mathop{\|}\limits_{j=1}^{n_{l+1}}\textbf{s}^{(l)^{% \top}}_{j}]Z^{(l)},italic_X start_POSTSUPERSCRIPT ( italic_l + 1 ) end_POSTSUPERSCRIPT = roman_Reorder [ ∥ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT italic_l + 1 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT s start_POSTSUPERSCRIPT ( italic_l ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ] italic_Z start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT , (9)

and

A(l+1)=S(l)A(l)S(l),superscript𝐴𝑙1superscript𝑆superscript𝑙topsuperscript𝐴𝑙superscript𝑆𝑙A^{(l+1)}=S^{(l)^{\top}}A^{(l)}S^{(l)},italic_A start_POSTSUPERSCRIPT ( italic_l + 1 ) end_POSTSUPERSCRIPT = italic_S start_POSTSUPERSCRIPT ( italic_l ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT italic_A start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT italic_S start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT , (10)

where \mathop{\|} is a concatenation operation, and the function ReorderReorder\mathrm{Reorder}roman_Reorder reorders the sequences of []delimited-[][\cdot][ ⋅ ] to follow the same node order of Gj(l)superscriptsubscript𝐺𝑗𝑙G_{j}^{(l)}italic_G start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT. In our model, we set the number of the encoder layer to L𝐿Litalic_L. And the resulting graph G(L+1)=(X(L+1),A(L+1))superscript𝐺𝐿1superscript𝑋𝐿1superscript𝐴𝐿1G^{(L+1)}=(X^{(L+1)},A^{(L+1)})italic_G start_POSTSUPERSCRIPT ( italic_L + 1 ) end_POSTSUPERSCRIPT = ( italic_X start_POSTSUPERSCRIPT ( italic_L + 1 ) end_POSTSUPERSCRIPT , italic_A start_POSTSUPERSCRIPT ( italic_L + 1 ) end_POSTSUPERSCRIPT ) is the input graph for decoder. Meanwhile, we adopt a non-parameterized readout function (e.g., MaxPooling and MeanPooling) to embed X(L+1)superscript𝑋𝐿1X^{(L+1)}italic_X start_POSTSUPERSCRIPT ( italic_L + 1 ) end_POSTSUPERSCRIPT into the graph-level representation for graph classification.

III-B GNN Decoder with the Soft Assignment

After encoding the input graph, we adopt our proposed GNN decoder which aims at generating the reconstructed graph. Similar to the encoder, our decoder is composed of multiple layers. The details of our proposed layer in the GNN decoder are shown in Figure 3. The input of the decoder layer is named as the retrieved graph, and the result is defined as the reconstructed graph. The generation method of the reconstructed graphs is the soft assignment where each input node in the retrieved graph is assigned to the whole reconstructed graph. Then, we introduce how the reconstructed graph is generated from the the retrieved graph.

Refer to caption
Figure 3: The illustration of our proposed layer in the GNN decoder.

Reconstruction. We denote the retrieved graph which is the input to lsuperscript𝑙l^{\prime}italic_l start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT-th layer of decoder as G(l)=(X(l),A(l))superscript𝐺superscript𝑙superscript𝑋superscript𝑙superscript𝐴superscript𝑙G^{\prime(l^{\prime})}=(X^{\prime(l^{\prime})},A^{\prime(l^{\prime})})italic_G start_POSTSUPERSCRIPT ′ ( italic_l start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) end_POSTSUPERSCRIPT = ( italic_X start_POSTSUPERSCRIPT ′ ( italic_l start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) end_POSTSUPERSCRIPT , italic_A start_POSTSUPERSCRIPT ′ ( italic_l start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) end_POSTSUPERSCRIPT ) where X(l)n(l)×d(l)superscript𝑋superscript𝑙superscriptsubscript𝑛superscript𝑙subscript𝑑superscript𝑙X^{\prime(l^{\prime})}\in\mathbb{R}^{n_{(l^{\prime})}\times d_{(l^{\prime})}}italic_X start_POSTSUPERSCRIPT ′ ( italic_l start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT ( italic_l start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) end_POSTSUBSCRIPT × italic_d start_POSTSUBSCRIPT ( italic_l start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) end_POSTSUBSCRIPT end_POSTSUPERSCRIPT is the feature matrix and A(l)n(l)×n(l)superscript𝐴superscript𝑙superscriptsubscript𝑛superscript𝑙subscript𝑛superscript𝑙A^{\prime(l^{\prime})}\in\mathbb{R}^{n_{(l^{\prime})}\times n_{(l^{\prime})}}italic_A start_POSTSUPERSCRIPT ′ ( italic_l start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT ( italic_l start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) end_POSTSUBSCRIPT × italic_n start_POSTSUBSCRIPT ( italic_l start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) end_POSTSUBSCRIPT end_POSTSUPERSCRIPT is the adjacent matrix. Note that, G(l)superscript𝐺superscript𝑙G^{\prime(l^{\prime})}italic_G start_POSTSUPERSCRIPT ′ ( italic_l start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) end_POSTSUPERSCRIPT is equal to G(L)superscript𝐺𝐿G^{(L)}italic_G start_POSTSUPERSCRIPT ( italic_L ) end_POSTSUPERSCRIPT when l=1superscript𝑙1l^{\prime}=1italic_l start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = 1. Given the input G(l)superscript𝐺superscript𝑙G^{\prime(l^{\prime})}italic_G start_POSTSUPERSCRIPT ′ ( italic_l start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) end_POSTSUPERSCRIPT with the feature matrix X(l)superscript𝑋superscript𝑙X^{\prime(l^{\prime})}italic_X start_POSTSUPERSCRIPT ′ ( italic_l start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) end_POSTSUPERSCRIPT and the adjacent matrix A(l)superscript𝐴superscript𝑙A^{\prime(l^{\prime})}italic_A start_POSTSUPERSCRIPT ′ ( italic_l start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) end_POSTSUPERSCRIPT, we denote the learned re-assignment matrix as S¯(l)n(l)×n(l+1)superscript¯𝑆superscript𝑙superscriptsubscript𝑛superscript𝑙subscript𝑛superscript𝑙1\bar{S}^{(l^{\prime})}\in\mathbb{R}^{n_{(l^{\prime})}\times n_{(l^{\prime}+1)}}over¯ start_ARG italic_S end_ARG start_POSTSUPERSCRIPT ( italic_l start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT ( italic_l start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) end_POSTSUBSCRIPT × italic_n start_POSTSUBSCRIPT ( italic_l start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT + 1 ) end_POSTSUBSCRIPT end_POSTSUPERSCRIPT and the embedding matrix as Z¯(l)n(l)×d(l+1)superscript¯𝑍superscript𝑙superscriptsubscript𝑛superscript𝑙subscript𝑑superscript𝑙1\bar{Z}^{(l^{\prime})}\in\mathbb{R}^{n_{(l^{\prime})}\times d_{(l^{\prime}+1)}}over¯ start_ARG italic_Z end_ARG start_POSTSUPERSCRIPT ( italic_l start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT ( italic_l start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) end_POSTSUBSCRIPT × italic_d start_POSTSUBSCRIPT ( italic_l start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT + 1 ) end_POSTSUBSCRIPT end_POSTSUPERSCRIPT which both are located at the layer lsuperscript𝑙l^{\prime}italic_l start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT. Compared to the complex encoding process, these matrices are calculated as

S¯(l)=softmax(GNNl,re(X(l),A(l))),superscript¯𝑆superscript𝑙softmaxsubscriptGNNsuperscript𝑙resuperscript𝑋superscript𝑙superscript𝐴superscript𝑙\bar{S}^{(l^{\prime})}=\mathrm{softmax}(\mathrm{GNN}_{l^{\prime},\mathrm{re}}(% X^{\prime(l^{\prime})},A^{\prime(l^{\prime})})),over¯ start_ARG italic_S end_ARG start_POSTSUPERSCRIPT ( italic_l start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) end_POSTSUPERSCRIPT = roman_softmax ( roman_GNN start_POSTSUBSCRIPT italic_l start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , roman_re end_POSTSUBSCRIPT ( italic_X start_POSTSUPERSCRIPT ′ ( italic_l start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) end_POSTSUPERSCRIPT , italic_A start_POSTSUPERSCRIPT ′ ( italic_l start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) end_POSTSUPERSCRIPT ) ) , (11)

and

Z¯(l)=GNNl,emb(X(l),A(l)),superscript¯𝑍superscript𝑙subscriptGNNsuperscript𝑙embsuperscript𝑋superscript𝑙superscript𝐴superscript𝑙\bar{Z}^{(l^{\prime})}=\mathrm{GNN}_{l^{\prime},\mathrm{emb}}(X^{\prime(l^{% \prime})},A^{\prime(l^{\prime})}),over¯ start_ARG italic_Z end_ARG start_POSTSUPERSCRIPT ( italic_l start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) end_POSTSUPERSCRIPT = roman_GNN start_POSTSUBSCRIPT italic_l start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , roman_emb end_POSTSUBSCRIPT ( italic_X start_POSTSUPERSCRIPT ′ ( italic_l start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) end_POSTSUPERSCRIPT , italic_A start_POSTSUPERSCRIPT ′ ( italic_l start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) end_POSTSUPERSCRIPT ) , (12)

where GNNl,resubscriptGNNsuperscript𝑙re\mathrm{GNN}_{l^{\prime},\mathrm{re}}roman_GNN start_POSTSUBSCRIPT italic_l start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , roman_re end_POSTSUBSCRIPT and GNNl,embsubscriptGNNsuperscript𝑙emb\mathrm{GNN}_{l^{\prime},\mathrm{emb}}roman_GNN start_POSTSUBSCRIPT italic_l start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , roman_emb end_POSTSUBSCRIPT are two different GNN blocks which do not share parameters. Clearly, although both the GNN blocks have the same inputs, there is an obvious distinction in their functions. In detail, the GNNl,resubscriptGNNsuperscript𝑙re\mathrm{GNN}_{l^{\prime},\mathrm{re}}roman_GNN start_POSTSUBSCRIPT italic_l start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , roman_re end_POSTSUBSCRIPT generates a probabilistic distribution assigning nodes to the reconstructed graph, while the GNNl,embsubscriptGNNsuperscript𝑙emb\mathrm{GNN}_{l^{\prime},\mathrm{emb}}roman_GNN start_POSTSUBSCRIPT italic_l start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , roman_emb end_POSTSUBSCRIPT is to generate the new embeddings. With the re-assignment matrix S¯(l)superscript¯𝑆superscript𝑙\bar{S}^{(l^{\prime})}over¯ start_ARG italic_S end_ARG start_POSTSUPERSCRIPT ( italic_l start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) end_POSTSUPERSCRIPT and the embedding matrix Z¯(l)superscript¯𝑍superscript𝑙\bar{Z}^{(l^{\prime})}over¯ start_ARG italic_Z end_ARG start_POSTSUPERSCRIPT ( italic_l start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) end_POSTSUPERSCRIPT, we calculate the resulting X(l+1)superscript𝑋superscript𝑙1X^{\prime(l^{\prime}+1)}italic_X start_POSTSUPERSCRIPT ′ ( italic_l start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT + 1 ) end_POSTSUPERSCRIPT and A(l+1)superscript𝐴superscript𝑙1A^{\prime(l^{\prime}+1)}italic_A start_POSTSUPERSCRIPT ′ ( italic_l start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT + 1 ) end_POSTSUPERSCRIPT as

X(l+1)=S¯(l)Z¯(l),superscript𝑋superscript𝑙1superscript¯𝑆superscriptsuperscript𝑙topsuperscript¯𝑍superscript𝑙X^{\prime(l^{\prime}+1)}=\bar{S}^{(l^{\prime})^{\top}}\bar{Z}^{(l^{\prime})},italic_X start_POSTSUPERSCRIPT ′ ( italic_l start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT + 1 ) end_POSTSUPERSCRIPT = over¯ start_ARG italic_S end_ARG start_POSTSUPERSCRIPT ( italic_l start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT over¯ start_ARG italic_Z end_ARG start_POSTSUPERSCRIPT ( italic_l start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) end_POSTSUPERSCRIPT , (13)

and

A(l+1)=S¯(l)A(l)S¯(l),superscript𝐴superscript𝑙1superscript¯𝑆superscriptsuperscript𝑙topsuperscript𝐴superscript𝑙superscript¯𝑆superscript𝑙A^{\prime(l^{\prime}+1)}=\bar{S}^{(l^{\prime})^{\top}}A^{\prime(l^{\prime})}% \bar{S}^{(l^{\prime})},italic_A start_POSTSUPERSCRIPT ′ ( italic_l start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT + 1 ) end_POSTSUPERSCRIPT = over¯ start_ARG italic_S end_ARG start_POSTSUPERSCRIPT ( italic_l start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT italic_A start_POSTSUPERSCRIPT ′ ( italic_l start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) end_POSTSUPERSCRIPT over¯ start_ARG italic_S end_ARG start_POSTSUPERSCRIPT ( italic_l start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) end_POSTSUPERSCRIPT , (14)

where X(l+1)n(l+1)×d(l+1)superscript𝑋superscript𝑙1superscriptsubscript𝑛superscript𝑙1subscript𝑑superscript𝑙1X^{\prime(l^{\prime}+1)}\in\mathbb{R}^{n_{(l^{\prime}+1)}\times d_{(l^{\prime}% +1)}}italic_X start_POSTSUPERSCRIPT ′ ( italic_l start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT + 1 ) end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT ( italic_l start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT + 1 ) end_POSTSUBSCRIPT × italic_d start_POSTSUBSCRIPT ( italic_l start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT + 1 ) end_POSTSUBSCRIPT end_POSTSUPERSCRIPT is the feature matrix and A(l+1)n(l+1)×n(l+1)superscript𝐴superscript𝑙1superscriptsubscript𝑛superscript𝑙1subscript𝑛superscript𝑙1A^{\prime(l^{\prime}+1)}\in\mathbb{R}^{n_{(l^{\prime}+1)}\times n_{(l^{\prime}% +1)}}italic_A start_POSTSUPERSCRIPT ′ ( italic_l start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT + 1 ) end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT ( italic_l start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT + 1 ) end_POSTSUBSCRIPT × italic_n start_POSTSUBSCRIPT ( italic_l start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT + 1 ) end_POSTSUBSCRIPT end_POSTSUPERSCRIPT is the adjacent matrix belonging to the reconstructed graph G(l+1)superscript𝐺superscript𝑙1G^{\prime(l^{\prime}+1)}italic_G start_POSTSUPERSCRIPT ′ ( italic_l start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT + 1 ) end_POSTSUPERSCRIPT. In our model, we set the number of the decoder layer to Lsuperscript𝐿L^{\prime}italic_L start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT, so the resulting graph G(L+1)=(X(L+1),A(L+1))superscript𝐺superscript𝐿1superscript𝑋superscript𝐿1superscript𝐴superscript𝐿1G^{\prime(L^{\prime}+1)}=(X^{\prime(L^{\prime}+1)},A^{\prime(L^{\prime}+1)})italic_G start_POSTSUPERSCRIPT ′ ( italic_L start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT + 1 ) end_POSTSUPERSCRIPT = ( italic_X start_POSTSUPERSCRIPT ′ ( italic_L start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT + 1 ) end_POSTSUPERSCRIPT , italic_A start_POSTSUPERSCRIPT ′ ( italic_L start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT + 1 ) end_POSTSUPERSCRIPT ) is also the result of our GAE. Note that, the graph X(L+1)superscript𝑋superscript𝐿1X^{\prime(L^{\prime}+1)}italic_X start_POSTSUPERSCRIPT ′ ( italic_L start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT + 1 ) end_POSTSUPERSCRIPT is the node-level representations for the node classification task.

III-C Our Loss Function

Compared to the standard loss \mathcal{L}caligraphic_L in the Equation 4, our loss function improves the calculation to strengthen the expressiveness of representations. In detail, our model is composed of the encoder focusing the local information of subgraphs and the decoder generating the reconstructed graphs. Therefore, our loss could be divided into two parts including the local loss and the global one, i.e.,

local=l=1Lj=1n(l+1)KL[q(Zj(l)Xj(l),Aj(l))p(Z(l))],subscriptlocalsuperscriptsubscript𝑙1𝐿superscriptsubscript𝑗1subscript𝑛𝑙1KLdelimited-[]conditional𝑞conditionalsuperscriptsubscript𝑍𝑗𝑙superscriptsubscript𝑋𝑗𝑙superscriptsubscript𝐴𝑗𝑙𝑝superscript𝑍𝑙\displaystyle\mathcal{L}_{\mathrm{local}}=\sum_{l=1}^{L}\sum_{j=1}^{n_{(l+1)}}% \mathrm{KL}[q(Z_{j}^{(l)}\mid X_{j}^{(l)},A_{j}^{(l)})\|p(Z^{(l)})],caligraphic_L start_POSTSUBSCRIPT roman_local end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT italic_l = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT ( italic_l + 1 ) end_POSTSUBSCRIPT end_POSTSUPERSCRIPT roman_KL [ italic_q ( italic_Z start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT ∣ italic_X start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT , italic_A start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT ) ∥ italic_p ( italic_Z start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT ) ] , (15)
global=l=1L𝔼q(X(L),A(L))X(l),A(l))\displaystyle\mathcal{L}_{\mathrm{global}}=-\sum_{l=1}^{L}\mathbb{E}_{q(X^{(L)% },A^{(L)})\mid X^{(l)},A^{(l)})}caligraphic_L start_POSTSUBSCRIPT roman_global end_POSTSUBSCRIPT = - ∑ start_POSTSUBSCRIPT italic_l = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT blackboard_E start_POSTSUBSCRIPT italic_q ( italic_X start_POSTSUPERSCRIPT ( italic_L ) end_POSTSUPERSCRIPT , italic_A start_POSTSUPERSCRIPT ( italic_L ) end_POSTSUPERSCRIPT ) ∣ italic_X start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT , italic_A start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT ) end_POSTSUBSCRIPT
[logp(X(Ll+2),A(Ll+2)X(L),A(L)))],\displaystyle[\mathrm{log}p(X^{\prime(L-l+2)},A^{\prime(L-l+2)}\mid X^{(L)},A^% {(L)}))],[ roman_log italic_p ( italic_X start_POSTSUPERSCRIPT ′ ( italic_L - italic_l + 2 ) end_POSTSUPERSCRIPT , italic_A start_POSTSUPERSCRIPT ′ ( italic_L - italic_l + 2 ) end_POSTSUPERSCRIPT ∣ italic_X start_POSTSUPERSCRIPT ( italic_L ) end_POSTSUPERSCRIPT , italic_A start_POSTSUPERSCRIPT ( italic_L ) end_POSTSUPERSCRIPT ) ) ] ,
HCGAE=local+global,subscriptHCGAEsubscriptlocalsubscriptglobal\displaystyle\mathcal{L}_{\mathrm{HC-GAE}}=\mathcal{L}_{\mathrm{local}}+% \mathcal{L}_{\mathrm{global}},caligraphic_L start_POSTSUBSCRIPT roman_HC - roman_GAE end_POSTSUBSCRIPT = caligraphic_L start_POSTSUBSCRIPT roman_local end_POSTSUBSCRIPT + caligraphic_L start_POSTSUBSCRIPT roman_global end_POSTSUBSCRIPT ,

where HCGAEsubscriptHCGAE\mathcal{L}_{\mathrm{HC-GAE}}caligraphic_L start_POSTSUBSCRIPT roman_HC - roman_GAE end_POSTSUBSCRIPT is our proposed loss, localsubscriptlocal\mathcal{L}_{\mathrm{local}}caligraphic_L start_POSTSUBSCRIPT roman_local end_POSTSUBSCRIPT is the local loss, globalsubscriptglobal\mathcal{L}_{\mathrm{global}}caligraphic_L start_POSTSUBSCRIPT roman_global end_POSTSUBSCRIPT is the global loss, and p(Z(l))𝑝superscript𝑍𝑙p(Z^{(l)})italic_p ( italic_Z start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT ) is the Gaussian prior for the l𝑙litalic_l-th layer subgraphs. Compared to the loss \mathcal{L}caligraphic_L in the Equation 4, our local loss localsubscriptlocal\mathcal{L}_{\mathrm{local}}caligraphic_L start_POSTSUBSCRIPT roman_local end_POSTSUBSCRIPT aims at training the subgraph generation which reserves the local information and avoids over-smoothing in the GNN encoder. And we set the global loss globalsubscriptglobal\mathcal{L}_{\mathrm{global}}caligraphic_L start_POSTSUBSCRIPT roman_global end_POSTSUBSCRIPT to train the reconstruction of the graph features and structures. The combination of the local loss localsubscriptlocal\mathcal{L}_{\mathrm{local}}caligraphic_L start_POSTSUBSCRIPT roman_local end_POSTSUBSCRIPT and the global loss globalsubscriptglobal\mathcal{L}_{\mathrm{global}}caligraphic_L start_POSTSUBSCRIPT roman_global end_POSTSUBSCRIPT broadens the reconstruction requirement for multiple downstream tasks, since localsubscriptlocal\mathcal{L}_{\mathrm{local}}caligraphic_L start_POSTSUBSCRIPT roman_local end_POSTSUBSCRIPT is a regularization for the loss. Therefore, HCGAEsubscriptHCGAE\mathcal{L}_{\mathrm{HC-GAE}}caligraphic_L start_POSTSUBSCRIPT roman_HC - roman_GAE end_POSTSUBSCRIPT not only strengthens the graph representations with the additions of the local information, but also addresses the challenges mentioned in Section II-B.

TABLE I: Datasets for node classification
Datasets Cora CiteSeer  PubMed Computers CS
Nodes 2708 3312 19717 13752 18333
Edges 5429 4660 44338 245861 81894
Features 1433 3703 500 767 6805
Classes 7 6 3 10 15
TABLE II: Datasets for graph classification
Datasets IMDB-B IMDB-M PROTEINS COLLAB MUTAG
Graphs 1000 1500 1113 5000 188
Nodes(mean) 19.77 13 39.06 74.49 17.93
Edges(mean) 96.53 65.94 72.82 2457.78 19.79
Classes 2 3 2 3 1

III-D Discussions

(a) Why are our results effective in multiple downstream tasks?

As we mentioned in Section I and Section II-B, the traditional GAE methods might over-emphasize the goals of the graph reconstruction. The typical result of this operation is topological missing, which damages the graph structure learning and aggravates the backward of the over-fitting in graph features [33]. Therefore, its resulting graph-level representations could not be effective in the graph classification. To generate generalized representations for multiple downstream tasks, our approach proposes some novel operations. And we summarize the reasons why they are effective as follows.

First, our model adopts a series of the assignment strategies to improve the encoding and decoding. In the encoder, we utilize the separated subgraphs to decompose the input graph, and assign the nodes by the hard assignment. In the decoder, the generation of the reconstructed graph follows the soft assignment. The hard assignment reserves the local heterogeneous information and abandons redundancy in the graph-level representation learning. And the soft assignment accomplishes the generation of the node-level representations. The combination of these assignment strategies ensures that our proposed model have the generalized capability to learn multi-level representations for various downstream tasks. Second, our model re-design the loss function suitable for the training of the new modules. The global loss globalsubscriptglobal\mathcal{L}_{\mathrm{global}}caligraphic_L start_POSTSUBSCRIPT roman_global end_POSTSUBSCRIPT set two reconstruction goals including the graph features and the structure. Multiple goals in self-supervision reduce the over-emphasizing on graph features, that causes the topological missing. The local loss localsubscriptlocal\mathcal{L}_{\mathrm{local}}caligraphic_L start_POSTSUBSCRIPT roman_local end_POSTSUBSCRIPT not only plays a role as regularization in HCGAEsubscriptHCGAE\mathcal{L}_{\mathrm{HC-GAE}}caligraphic_L start_POSTSUBSCRIPT roman_HC - roman_GAE end_POSTSUBSCRIPT, but captures the local information from the subgraphs for training. The addition of the local information is a common method to improve the generalization of the graph representations [34].

(b) Why does the over-smoothing hardly affect the model?

In the encoding process, we adopt separated subgraphs where there is no connection between these structures. Unlike the hierarchical GNN methods such as DiffPool [35] or GAE methods such as VGAE [14], the message passing is limited in the subgraphs. The node information hardly propagates to the whole graph. This operation can significantly reduce the over-smoothing problem.

IV Experiments

In this section, we evaluate the performance of our proposed model over two important graph learning tasks including node classification and graph classification. The details of the datasets and the experiment settings are shown in Table I and Table II.

TABLE III: Node classification performance based on accuracy. A.R. is the average rank.
Datasets Cora CiteSeer  PubMed Computers CS A.R.
DGI 85.41±plus-or-minus\pm±0.34 74.51±plus-or-minus\pm±0.51 85.95±plus-or-minus\pm±0.66 84.68±plus-or-minus\pm±0.39 91.33±plus-or-minus\pm±0.30 4.0
VGAE 83.60±plus-or-minus\pm±0.52 63.37±plus-or-minus\pm±1.21 78.23±plus-or-minus\pm±1.63 87.21±plus-or-minus\pm±0.26 89.79±plus-or-minus\pm±0.09 5.2
SSL-GCN 57.29±plus-or-minus\pm±0.13 59.57±plus-or-minus\pm±1.77 75.06±plus-or-minus\pm±0.37 80.49±plus-or-minus\pm±0.10 84.71±plus-or-minus\pm±0.95 6.8
GraphSage 74.30±plus-or-minus\pm±1.84 60.20±plus-or-minus\pm±2.15 81.96±plus-or-minus\pm±0.74 87.05±plus-or-minus\pm±0.25 89.74±plus-or-minus\pm±0.19 5.6
GraphMAE 85.45±plus-or-minus\pm±0.40 72.48±plus-or-minus\pm±0.77 85.74±plus-or-minus\pm±0.14 88.04±plus-or-minus\pm±0.61 93.47±plus-or-minus\pm±0.04 3.0
S2GAE 86.15±plus-or-minus\pm±0.25 74.60±plus-or-minus\pm±0.06 86.91±plus-or-minus\pm±0.28 90.94±plus-or-minus\pm±0.08 91.70±plus-or-minus\pm±0.08 2.2
HC-GAE 87.97±plus-or-minus\pm±0.10 75.29±plus-or-minus\pm±0.09 87.56±plus-or-minus\pm±0.35 91.07±plus-or-minus\pm±0.14 92.28±plus-or-minus\pm±0.07 1.2
TABLE IV: Graph classification performance based on accuracy. A.R. is the average rank.
Datasets IMDB-B IMDB-M PROTEINS COLLAB MUTAG A. R.
WLSK 64.48±plus-or-minus\pm±0.90 43.38±plus-or-minus\pm±0.75 71.70±plus-or-minus\pm±0.67 N/A 80.72±plus-or-minus\pm±3.00 7.75
DGCNN 67.45±plus-or-minus\pm±0.83 46.33±plus-or-minus\pm±0.73 73.21±plus-or-minus\pm±0.34 N/A 85.83±plus-or-minus\pm±1.66 6.25
DiffPool 72.6±plus-or-minus\pm±3.9 47.2±plus-or-minus\pm±1.8 75.1±plus-or-minus\pm±3.5 78.9±plus-or-minus\pm±2.3 85.0±plus-or-minus\pm±10.3 5.20
Graph2Vec 71.10±plus-or-minus\pm±0.54 50.44±plus-or-minus\pm±0.87 73.30±plus-or-minus\pm±2.05 N/A 83.15±plus-or-minus\pm±9.25 5.75
InfoGCL 75.10±plus-or-minus\pm±0.90 51.40±plus-or-minus\pm±0.80 N/A 80.00±plus-or-minus\pm±1.30 91.20±plus-or-minus\pm±1.30 3.50
GraphMAE 75.52±plus-or-minus\pm±0.66 51.63±plus-or-minus\pm±0.52 75.30±plus-or-minus\pm±0.39 80.32±plus-or-minus\pm±0.46 88.19±plus-or-minus\pm±1.26 3.20
S2GAE 75.76±plus-or-minus\pm±0.62 51.79±plus-or-minus\pm±0.36 76.37±plus-or-minus\pm±0.43 81.02±plus-or-minus\pm±0.53 88.26±plus-or-minus\pm±0.76 2.00
HC-GAE 76.72±plus-or-minus\pm±0.60 51.90±plus-or-minus\pm±1.47 78.13±plus-or-minus\pm±1.37 80.41±plus-or-minus\pm±0.02 92.38±plus-or-minus\pm±1.17 1.20
Refer to caption
Figure 4: The ablation experiments on graph classification task.

IV-A Node Classification

Datasets. For node classification task, we consider 5 real-world datasets (Cora, CiteSeer, PubMed, Amazon-Computers and Coauthor-CS). To fairly compare our model and the other baselines, we follow the previous study [36] to carry out the related experiments, and utilize the SVM classifier to predict the node labels. We evaluate model performance based on the accuracy score.

Baselines. We compare our model with 6 self-supervised models including DGI [37], VGAE [14], SSL-GCN [38], GraphSage [39], GraphMAE [13], S2GAE [15]. The reported results of baselines are from previous papers if available.

Experimental Setup. In order to compare methods fairly, we adopt 10-fold cross validation to test all models. And we generally follow the same parameter settings across different baselines. We select the Adam optimizer to optimize the parameters of models in our experiments. And neural network models are trained in 50 epochs. During the training process, we set the hidden dimension of models to 128, the dropout to 0.5. Specially, for the node classification task, we set the batch size to 1024, the learning rate to 1e21𝑒21e-21 italic_e - 2. For the graph classification task, we set the batch size to 64, the learning rate to 5e45𝑒45e-45 italic_e - 4. For our proposed model HC-GAE, we set the encoder layer L𝐿Litalic_L to 3, the decoder layer Lsuperscript𝐿L^{\prime}italic_L start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT to 3. The node numbers of the three layers in our encoder follow {128,64,32}1286432\{128,64,32\}{ 128 , 64 , 32 }, and the node numbers of the three layers in our decoder follow {32,64,128}3264128\{32,64,128\}{ 32 , 64 , 128 }. The experiments were performed on four GeForce RTX 2080 Ti GPUs.

Results. We summarize the results in Table III. Obviously, our proposed model HC-GAE can outperform all the baseline models. Only the accuracy of the self-supervised method GraphMAE is a little higher than our model. From these results, we could analyse that the VGAE and its improved methods are effective in node representation learning. The improved GAE methods (e.g. GraphMAE, S2GAE) have a better performance. This verifies that the GAE framework is suitable for node-level representation learning.

IV-B Graph Classification

Datasets. For graph classification task, we adopt 5 standard graph datasets (IMDB-B, IMDB-M, PROTEINS, COLLAB, MUTAG). In experiments, we follow the previous study [13], and feed the resulting graph representations into the SVM classifier for prediction. We also evaluate the performance based on the accuracy score, and report the mean 10-fold cross-validation accuracy with standard deviation.

Baselines. We compare our model with a typical graph kernel model WLSK [40] based on the subtree invariants, 2 supervised baseline models including DGCNN [41] and DiffPool [35], 4 self-supervised baseline models including Graph2Vec [42], InfoGCL [43], GraphMAE [13], S2GAE [15].

Results. We summarize the results in Table IV. Similar to the results on the node classification task, our performance on the graph classification is outstanding in the experiments. Only on the COLLAB dataset, S2GAE slightly outperforms our model. With these comparisons of the node classification and the graph classification, we could analyse the reasons why the effectiveness of our model is obvious. The GAEs such as GraphMAE focusing on the graph feature reconstruction could obtain outstanding performance on the specific dataset of the node classification. However, its performance on the graph classification is not as good as the other GAEs (e.g., S2GAE, HC-GAE) focusing on graph features and structure. This verifies that the topological missing disturbs GraphMAE while the combination of the assignment strategies and the re-design loss improves our proposed HC-GAE for multiple downstream tasks. In the hierarchical methods such as DGCNN and DiffPool, there is an assignment process when the input graphs are compressed into the coarsened graphs. However, DGCNN adopts the top-k strategy to assign nodes, and DiffPool utilizes the hard assignment. Since these methods cannot prevent the information passing causing the over-smoothing, their performance is limited in the experiments. However, the separated subgraphs proposed in our encoder avoid the information passing, realizing the improvement of the GAE.

IV-C Ablation Study

In order to analyse the effectiveness of our encoder, we replace the separated subgraphs and the hard assignment strategy in the encoding with the soft assignment strategy. The comparison results are shown in the Figure 4. In this experiment, we define our model with the soft assignment in the encoder as HC-GAE-SE. We observe that the performance of HC-GAE-SE on graph classification is lower than ours. Compared to the HC-GAE-SE, our vanilla model have two factors which strengthen the training. First, the separated subgraphs prevents the encoding from the over-smoothing causing the fall of the performance. The subgraph generation combining with the hard assignment makes the message pass within the subgraph. Secondly, the loss value calculation relying on the local graph information. Without the original encoder, HCGAEsubscriptHCGAE\mathcal{L}_{\mathrm{HC-GAE}}caligraphic_L start_POSTSUBSCRIPT roman_HC - roman_GAE end_POSTSUBSCRIPT missing the local loss localsubscriptlocal\mathcal{L}_{\mathrm{local}}caligraphic_L start_POSTSUBSCRIPT roman_local end_POSTSUBSCRIPT cannot allow the encoding process to extract the local information.

V Conclusion

In this paper, we have proposed a novel HC-GAE model to effectively learn multi-level graph representations for various downstream tasks, i.e., the node classification and the graph classification. During the encoding process, we have adopted the hard node assignment to decompose a sample graph into a family of separated subgraphs, that can be compressed into the coarsened nodes for the resulting coarsened graph. On the other hand, during the decoding process, we have utilized the soft node assignment to reconstruct the original graph structure. During the encoding and decoding processes, the proposed HC-GAE can effectively extract features hierarchical graph representations. The re-designed loss function has balanced the training of the encoder and the decoder. In the experiments, we have evaluated the performance of the proposed HC-GAE model for either node classification or graph classification. The experimental results have demonstrated the effectiveness of the proposed model.

References

  • [1] W. Ju, Y. Gu, X. Luo, Y. Wang, H. Yuan, H. Zhong, and M. Zhang, “Unsupervised graph-level representation learning with hierarchical contrasts,” Neural Networks, vol. 158, pp. 359–368, 2023.
  • [2] Z. Liu, Y. Chen, F. Xia, J. Bian, B. Zhu, G. Shen, and X. Kong, “Tap: Traffic accident profiling via multi-task spatio-temporal graph representation learning,” ACM Transactions on Knowledge Discovery from Data, vol. 17, no. 4, pp. 1–25, 2023.
  • [3] L. Wang, H. Liu, Y. Liu, J. Kurtin, and S. Ji, “Learning hierarchical protein representations via complete 3d graph networks,” in International Conference on Learning Representations (ICLR), 2023.
  • [4] Z. Li, F. Liu, W. Yang, S. Peng, and J. Zhou, “A survey of convolutional neural networks: analysis, applications, and prospects,” IEEE transactions on neural networks and learning systems, 2021.
  • [5] Z. Wu, S. Pan, F. Chen, G. Long, C. Zhang, and S. Y. Philip, “A comprehensive survey on graph neural networks,” IEEE transactions on neural networks and learning systems, vol. 32, no. 1, pp. 4–24, 2020.
  • [6] T. N. Kipf and M. Welling, “Semi-supervised classification with graph convolutional networks,” arXiv preprint arXiv:1609.02907, 2016.
  • [7] Y. Huang, Y. Zeng, Q. Wu, and L. Lü, “Higher-order graph convolutional network with flower-petals laplacians on simplicial complexes,” arXiv preprint arXiv:2309.12971, 2023.
  • [8] M. Réau, N. Renaud, L. C. Xue, and A. M. Bonvin, “Deeprank-gnn: a graph neural network framework to learn patterns in protein–protein interfaces,” Bioinformatics, vol. 39, no. 1, p. btac759, 2023.
  • [9] U. Michelucci, “An introduction to autoencoders,” arXiv preprint arXiv:2201.03898, 2022.
  • [10] Y. Tian, K. Dong, C. Zhang, C. Zhang, and N. V. Chawla, “Heterogeneous graph masked autoencoders,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 37, no. 8, 2023, pp. 9997–10 005.
  • [11] J. Li, X. Fu, S. Zhu, H. Peng, S. Wang, Q. Sun, S. Y. Philip, and L. He, “A robust and generalized framework for adversarial graph embedding,” IEEE Transactions on Knowledge and Data Engineering, 2023.
  • [12] Y. Liu, X. Yang, S. Zhou, X. Liu, Z. Wang, K. Liang, W. Tu, L. Li, J. Duan, and C. Chen, “Hard sample aware network for contrastive deep graph clustering,” in Proceedings of the AAAI conference on artificial intelligence, vol. 37, no. 7, 2023, pp. 8914–8922.
  • [13] Z. Hou, X. Liu, Y. Cen, Y. Dong, H. Yang, C. Wang, and J. Tang, “Graphmae: Self-supervised masked graph autoencoders,” in Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, 2022, pp. 594–604.
  • [14] N. K. Thomas and M. Welling, “Variational graph auto-encoders,” arXiv preprint arXiv:1611.07308, vol. 2, no. 10, 2016.
  • [15] Q. Tan, N. Liu, X. Huang, S.-H. Choi, L. Li, R. Chen, and X. Hu, “S2gae: Self-supervised graph autoencoders are generalizable learners with graph masking,” in Proceedings of the Sixteenth ACM International Conference on Web Search and Data Mining, 2023, pp. 787–795.
  • [16] D. Mesquita, A. Souza, and S. Kaski, “Rethinking pooling in graph neural networks,” Advances in Neural Information Processing Systems, vol. 33, pp. 2220–2231, 2020.
  • [17] J. H. Giraldo, K. Skianis, T. Bouwmans, and F. D. Malliaros, “On the trade-off between over-smoothing and over-squashing in deep graph neural networks,” in Proceedings of the 32nd ACM International Conference on Information and Knowledge Management, 2023, pp. 566–576.
  • [18] J. Li, H. Shomer, H. Mao, S. Zeng, Y. Ma, N. Shah, J. Tang, and D. Yin, “Evaluating graph neural networks for link prediction: Current pitfalls and new benchmarking,” Advances in Neural Information Processing Systems, vol. 36, 2024.
  • [19] V. Vasudevan, M. Bassenne, M. T. Islam, and L. Xing, “Image classification using graph neural network and multiscale wavelet superpixels,” Pattern Recognition Letters, vol. 166, pp. 89–96, 2023.
  • [20] L. Wei, H. Zhao, Z. He, and Q. Yao, “Neural architecture search for gnn-based graph classification,” ACM Transactions on Information Systems, vol. 42, no. 1, pp. 1–29, 2023.
  • [21] C. Gao, Y. Zheng, N. Li, Y. Li, Y. Qin, J. Piao, Y. Quan, J. Chang, D. Jin, X. He et al., “A survey of graph neural networks for recommender systems: Challenges, methods, and directions,” ACM Transactions on Recommender Systems, vol. 1, no. 1, pp. 1–51, 2023.
  • [22] X. Wang and M. Zhang, “How powerful are spectral graph neural networks,” in International Conference on Machine Learning.   PMLR, 2022, pp. 23 341–23 362.
  • [23] U. A. Bhatti, H. Tang, G. Wu, S. Marjan, and A. Hussain, “Deep learning with graph convolutional networks: An overview and latest applications in computational intelligence,” International Journal of Intelligent Systems, vol. 2023, pp. 1–28, 2023.
  • [24] F. Hu, Y. Zhu, S. Wu, L. Wang, and T. Tan, “Hierarchical graph convolutional networks for semi-supervised node classification,” arXiv preprint arXiv:1902.06667, 2019.
  • [25] K. Guo, Y. Hu, Y. Sun, S. Qian, J. Gao, and B. Yin, “Hierarchical graph convolution network for traffic forecasting,” in Proceedings of the AAAI conference on artificial intelligence, vol. 35, no. 1, 2021, pp. 151–159.
  • [26] B. Perozzi, R. Al-Rfou, and S. Skiena, “Deepwalk: Online learning of social representations,” in Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining, 2014, pp. 701–710.
  • [27] A. Grover and J. Leskovec, “node2vec: Scalable feature learning for networks,” in Proceedings of the 22nd ACM SIGKDD international conference on Knowledge discovery and data mining, 2016, pp. 855–864.
  • [28] E. Pan and Z. Kang, “Beyond homophily: Reconstructing structure for graph-agnostic clustering,” in International Conference on Machine Learning.   PMLR, 2023, pp. 26 868–26 877.
  • [29] J. Li, R. Wu, W. Sun, L. Chen, S. Tian, L. Zhu, C. Meng, Z. Zheng, and W. Wang, “What’s behind the mask: Understanding masked graph modeling for graph autoencoders,” in Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, 2023, pp. 1268–1279.
  • [30] ——, “Maskgae: Masked graph modeling meets graph autoencoders,” arXiv preprint arXiv:2205.10053, vol. 9, p. 13, 2022.
  • [31] Y. You, T. Chen, Y. Sui, T. Chen, Z. Wang, and Y. Shen, “Graph contrastive learning with augmentations,” Advances in neural information processing systems, vol. 33, pp. 5812–5823, 2020.
  • [32] Y. Zheng, M. Jin, Y. Liu, L. Chi, K. T. Phan, and Y.-P. P. Chen, “Generative and contrastive self-supervised learning for graph anomaly detection,” IEEE Transactions on Knowledge and Data Engineering, vol. 35, no. 12, pp. 12 220–12 233, 2021.
  • [33] P. Veličković, W. Fedus, W. L. Hamilton, P. Liò, Y. Bengio, and R. D. Hjelm, “Deep graph infomax,” arXiv preprint arXiv:1809.10341, 2018.
  • [34] Y. You, T. Chen, Z. Wang, and Y. Shen, “When does self-supervision help graph convolutional networks?” in international conference on machine learning.   PMLR, 2020, pp. 10 871–10 880.
  • [35] Z. Ying, J. You, C. Morris, X. Ren, W. Hamilton, and J. Leskovec, “Hierarchical graph representation learning with differentiable pooling,” Advances in neural information processing systems, vol. 31, 2018.
  • [36] W. Hu, M. Fey, M. Zitnik, Y. Dong, H. Ren, B. Liu, M. Catasta, and J. Leskovec, “Open graph benchmark: Datasets for machine learning on graphs,” Advances in neural information processing systems, vol. 33, pp. 22 118–22 133, 2020.
  • [37] P. Velickovic, W. Fedus, W. L. Hamilton, P. Liò, Y. Bengio, and R. D. Hjelm, “Deep graph infomax.” ICLR (Poster), vol. 2, no. 3, p. 4, 2019.
  • [38] Q. Zhu, B. Du, and P. Yan, “Self-supervised training of graph convolutional networks,” arXiv preprint arXiv:2006.02380, 2020.
  • [39] W. Hamilton, Z. Ying, and J. Leskovec, “Inductive representation learning on large graphs,” Advances in neural information processing systems, vol. 30, 2017.
  • [40] N. Shervashidze, S. Vishwanathan, T. Petri, K. Mehlhorn, and K. Borgwardt, “Efficient graphlet kernels for large graph comparison,” in Artificial intelligence and statistics.   PMLR, 2009, pp. 488–495.
  • [41] M. Zhang, Z. Cui, M. Neumann, and Y. Chen, “An end-to-end deep learning architecture for graph classification,” in Proceedings of the AAAI conference on artificial intelligence, vol. 32, no. 1, 2018.
  • [42] A. Narayanan, M. Chandramohan, R. Venkatesan, L. Chen, Y. Liu, and S. Jaiswal, “graph2vec: Learning distributed representations of graphs,” arXiv preprint arXiv:1707.05005, 2017.
  • [43] D. Xu, W. Cheng, D. Luo, H. Chen, and X. Zhang, “Infogcl: Information-aware graph contrastive learning,” Advances in Neural Information Processing Systems, vol. 34, pp. 30 414–30 425, 2021.