Node and edge dual-masked self-supervised graph representation

Tang, Peng; Xie, Cheng; Duan, Haoran

doi:10.1007/s10115-023-01950-2

Node and edge dual-masked self-supervised graph representation

Regular Paper
Open access
Published: 23 December 2023

Volume 66, pages 2307–2326, (2024)
Cite this article

Download PDF

You have full access to this open access article

Knowledge and Information Systems Aims and scope Submit manuscript

Node and edge dual-masked self-supervised graph representation

Download PDF

Peng Tang¹,
Cheng Xie¹ &
Haoran Duan¹

1715 Accesses
Explore all metrics

Abstract

Self-supervised graph representation learning has been widely used in many intelligent applications since labeled information can hardly be found in these data environments. Currently, masking and reconstruction-based (MR-based) methods lead the state-of-the-art records in the self-supervised graph representation field. However, existing MR-based methods did not fully consider both the deep-level node and structure information which might decrease the final performance of the graph representation. To this end, this paper proposes a node and edge dual-masked self-supervised graph representation model to consider both node and structure information. First, a dual masking model is proposed to perform node masking and edge masking on the original graph at the same time to generate two masking graphs. Second, a graph encoder is designed to encode the two generated masking graphs. Then, two reconstruction decoders are designed to reconstruct the nodes and edges according to the masking graphs. At last, the reconstructed nodes and edges are compared with the original nodes and edges to calculate the loss values without using the labeled information. The proposed method is validated on a total of 14 datasets for graph node classification tasks and graph classification tasks. The experimental results show that the method is effective in self-supervised graph representation. The code is available at: https://github.com/TangPeng0627/Node-and-Edge-Dual-Mask.

GDM: Dual Mixup for Graph Classification with Limited Supervision

Robust graph neural networks with Dirichlet regularization and residual connection

Article 07 April 2024

Iterative Encode-and-Decode Graph Neural Network

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

Graph is a data representation that could be the closest to the representation of the real world, such as social relations representation, knowledge representation, chemical structure representation, protein representation, etc., which are all graph-structured data. However, these graph-structured data are difficult to obtain high-quality labels for supervised learning due to the high domain knowledge required for labeling graph nodes, such as unknown molecules, non-trivial proteins, social relations, etc. Self-Supervised Graph Representation Learning (SSGRL) utilizes the data’s supervised signals, enabling large-scale models to be trained on massive unlabeled data by designing different proxy tasks. Therefore, SSGRL has been suitable for processing real-world graph data without pre-defined labels. Recently, SSGRL is widely used in knowledge engineering [40] and bio-informatics [7] domains. It has become a research hotspot in intelligent applications. Excellent achievements have been made in the fields of traffic prediction [46], protein prediction [48], and bioinformatics [49].

Contrastive learning (CL)-based [25, 33, 45] and generative learning (GL)-based [14, 18, 28] categories are now the two most popular types of SSGRL approach. Self-supervised graph representation learning is driven by contrastive learning-based techniques that maximize positive sample node similarity and minimize negative sample node similarity. Positive sample nodes are obtained from graph enhancement techniques, and the remaining nodes are used as negative samples. These methods call for expensive training approaches and powerful graph augmentation techniques. On varied graph data, getting the best outcomes with the current graph augmentation approaches is challenging. As a result, SSGRL once more over-depends on high-quality data augmentation (Fig. 1).

Generative learning-based methods such as self-supervised graph autoencoders (GAE) [18] can avoid the above problems because they aim to reconstruct corrupted data without additional data augmentation methods. In the early research, such methods [18, 23, 28] drive self-supervised graph representation learning by reconstructing adjacency matrices, which makes such methods only focus on graph structure information and ignore rich graph node information. In the latest research [14, 16, 24], the method represented by GraphMAE [14] achieves state-of-the-art results. GraphMAE aims to achieve self-supervised graph learning by masking and restoring graph nodes, which takes full advantage of the information in graph nodes and trains more powerful graph representation encoders. However, since such methods mainly emphasize node information and ignore graph structure information, they usually cannot achieve better results on downstream tasks of graph classification focusing on graph structure information.

In general, most generative learning methods in current research are roughly divided into two categories, one is based on the reconstruction of graph structure, and the other is based on the reconstruction of graph node information. Both methods have their advantages. This paper combines the two methods so that the model can be applied to a wider range of tasks. However, how to effectively combine these two methods so that the model does not conflict when reconstructing graph structure information and graph node information is the challenge of this paper. To this end, this paper proposes a node and edge dual-masked self-supervised graph representation model to ensure that the final graph representation can capture the information of nodes in the graph and obtain deep graph structure information. First, a dual masking model is proposed to perform node masking and edge masking on the original graph at the same time to generate two masking graphs ${\mathcal {G}}^{1}_{mask}$ and ${\mathcal {G}}^{2}_{mask}$. Second, a graph encoder $G^{e}_{\theta }$ is designed to encode the two generated masking graphs to obtain graph representations $H_{1}$ and $H_{2}$. Then, two reconstruction decoders $G^{d1}_{\theta }$ and $G^{d2}_{\theta }$ are designed to reconstruct the nodes and edges according to the masking graphs. At last, the reconstructed nodes and edges are compared with the original nodes and edges to calculate the loss values without using the labeled information.

The method has been verified on a large number of graph node classification tasks and graph classification tasks and achieved good performance, especially in the graph classification task, it has achieved an improvement of 0.5–2%. The method achieves obvious improvements compared to the previous state of the art. In summary, the work has the following highlights:

We first propose the concept of mask learning for simultaneous masking and reconstruction of graphs and edges and achieve state-of-the-art results in the field of self-supervised graph representation;
Compared to CL-based methods, our method does not rely on high-quality graph augmentation and thus can be used on real-world datasets;
Compared with the state-of-the-art methods, our method considers both the node information and the structure information of the graph;

2 Related work

2.1 Graph neural network

The graph algorithm’s primary strategy in early research [21, 22, 26] was the graph embedding technique. In order to apply machine learning to the graph, this type of method generates the node sequence using a random walk of the nodes to obtain the node embedding. However, the graph representations obtained by most graph embedding techniques are ineffective in downstream tasks since the techniques only focus on the graph structure information and ignore the rich node information. The difficulties of the above research have been solved by the development of graph convolutional neural networks with great success. The first study [5] fully exploits node feature information and graph structure information by fitting convolution kernels with Chebyshev polynomials. The complexity of the former is then significantly reduced by GCN [17] using a first-order approximation technique. GraphSAGE [8] proposes a method to obtain node representations by integrating first-order adjacent node information since the first two methods are difficult to use for large graphs. In order to provide more accurate graph representations, GAT [34] uses edge weights as learnable parameters based on GraphSAGE. Graph neural networks have so far outperformed earlier graph embedding technologies in terms of performance and have taken over as the most popular methods of graph representation. These supervised learning-based methods [6, 41] usually require high-quality graph labels, however obtaining graph labels usually requires extensive domain expertise. Therefore, self-supervised graph representation learning is the most effective approach to solving this problem.

2.2 Contrastive self-supervised graph learning

Contrastive learning has achieved great success in computer vision at the earliest and has proposed many excellent methods [3, 13] in the research process. Inspired by this, many researchers began to study how to apply contrastive learning to self-supervised graph representation learning. DGI [35] first introduced the concept of mutual information into graph representation learning. The method relies on maximizing the mutual information between the enhanced graph representation and the currently extracted graph information. Subsequently, GMI [25] and infoGraph [25] improved the method so that mutual information can be better used in the graph domain. MoCo [11] improves NCE loss based on cross-entropy loss and proposes infoNCE loss. GRACE [50] brings infoNCE to graph representation learning. GraphCL [45] leverages the ideas of SimCLR [3] to study the performance impact of different data augmentations on self-supervised graph representation learning.

The above methods are usually extremely expensive to train since these methods require a large number of negative samples to ensure the model does not collapse. Inspired by SimSiam [4] idea of using prediction instead of contrast and using a momentum encoder to ensure that the model does not collapse, BGRL [33] first proposed contrastive learning that does not require negative samples in graph representations. Then LaGraph [37] proposed a method for predicting latent graphs.

Data augmentation in contrastive learning methods is intuitive and understandable in computer vision. However, graph augmentation currently needs to be fully explained theoretically to guarantee whether the method is optimal.

2.3 Generative self-supervised graph learning

Generative methods are important in self-supervised graph representation learning. In previous methods [26], node embeddings were produced using a potent natural language model after the graph structure had been flattened into a sequence by random walk sampling. These methods [2] only rely on adjacency matrix information and do not fully utilize the initial feature information of nodes. Later graph neural network methods such as GCN [17] and GAT [34] fully combine the characteristics of nodes and the relationship between nodes to solve the problem that the previous methods can only use single information in the graph. Therefore, GAE [18] can apply autoencoders to graph representation learning. It uses GNN as the encoder to obtain the node representation features and obtains the inner product of the node representation features through the decoder to reconstruct the adjacency matrix. However, real-world data often have a large number of noisy edges, and this method pays too much attention to the structural information of the graph. Therefore, the method reconstructing these noisy edges makes the model unstable and degrades performance. On this basis, GATE [29] also reconstructs the features of nodes for better graph representation.

Subsequently, Masked AutoEncoders [10] were proposed in the field of computer vision. This method can restore images with 75% occluded pixels, showing strong model performance. Inspired by this idea, GraphMAE [14] applies this method to self-supervised graph representation learning. It improves the original MSE loss function and proposes a scaled cosine loss function more suitable for graph representation learning. The method uses a graph attention network GAT as the encoder and then restores the node’s features in the decoder. Its performance reaches state of the art in the field of self-supervised graph representation learning. However, this method ignores the important structure information of the graph. Therefore, we propose a new method to fully use graph node information and graph structure information based on Masked AutoEncoders.

3 Method

The method is divided into three steps. First, perform a mask operation on the graph to obtain a node and edge mask graph. Second, use it as input to the graph encoder to obtain a graph representation. At last, the masked nodes and edges are recovered using the graph representation as input to the decoder. Figure 2 shows a block diagram of the entire model.

3.1 Graph masking

In this chapter, we will introduce how to mask the graph. In computer vision, it is common to mask the pixels of an image. And the mask rate is very large, usually reaching 75%. A high mask rate is used to prevent the model from obtaining its own data by simply copying the data of surrounding pixels when restoring pixels. In the latest work, GraphMAE has achieved state-of-the-art results in self-supervised graph representation learning by masking graph nodes. In graph representation learning, the input to the model is a graph. Comparing nodes to pixels is not feasible since the relationship between pixels in an image is determined by their position in the image and the relationship between nodes is represented by edges. Therefore edges play an important role in the graph and they cannot be ignored when masking and reconstructing. Our method generates two mask graphs, one with only nodes masked and the other with only edges masked. Both graphs need to use the same encoder to obtain their graph representations. However, it should be noted that different strategies are used in the decoding stage. We designed two different decoders for reconstructed nodes and reconstructed edges.

The input graph consists of its node feature matrix $X$ and adjacency matrix $A$, denoted as ${\mathcal {G}}= (A, X)$. $A\in {\mathbb {R}}^{|V|\times |V|}$ represents the adjacency matrix of the graph. $X\in {\mathbb {R}}^{|V|\times d}$ represents the feature matrix of graph node. The first mask graph masks the node feature matrix $X$, selects the node feature of the $p_{1}$ ratio and sets it to 0 to obtains ${\mathcal {G}}_{mask}^{1} =( A,X^{'})$. The adjacency matrix A is then masked, removing edges with $p_{2}$ ratios in the input graph to obtain ${\mathcal {G}}_{mask}^{2} =(A^{'},X)$. The values of $p_{1}$ and $p_{2}$ are two important parameters of the model. In the field of computer vision, the larger the mask rate, the better the model’s performance. If the masking rate is too small, the difficulty of the reconstruction task will be small and the performance of the trained model will be poor. However, if the masking rate is too large, there will be too much missing information, leading to the model’s collapse and the inability to complete the task of reconstruction. Therefore, selecting the appropriate $p_{1}$ and $p_{2}$ is the key to training the model. The experimental chapter will show the effects of different $p_{1}$ and $p_{2}$ on the model performance.

3.2 The graph encoder

Nowadays, Graph Neural Networks (GNNs), such as GCN (Graph Convolutional Network) [17], GIN (Graph Isomorphic Network) [38] and GAT (Graph Attention Network) [34], are widely used as the graph encoder and have the similar performance on graph representation. This paper selects a classic GCN model as the graph encoder.

In the training process, ${\mathcal {G}}_{mask}^{1}$ and ${\mathcal {G}}_{mask}^{2}$ are used as the encoder input. After the model training, the original graph ${\mathcal {G}}=\left( A, X \right) $ is used as the input of the encoder to obtain the graph representation H. The graph representation H can be used for downstream machine learning tasks, such as graph node classification, graph classification, link prediction, and other tasks. During model training, ${\mathcal {G}}_{mask}^{1}$ and ${\mathcal {G}}_{mask}^{2}$ get the graph feature representations $H_{1}$ and $H_{2}$ after passing through the encoder $G_{\theta }^{e} \left( \cdot \right) $. The formula is defined as follows.

$$\begin{aligned}{} & {} H = G_{\theta }^{e}({\mathcal {G}} ) = \sigma ({\tilde{D}}^{-\frac{1}{2} } {\tilde{A}} {\tilde{D}}^{-\frac{1}{2}}XW_{l}) \end{aligned}$$

(1)

$$\begin{aligned}{} & {} H_{1} = G_{\theta }^{e}({\mathcal {G}}_{mask}^{1}) = \sigma ({\tilde{D}}^{-\frac{1}{2} } {\tilde{A}} {\tilde{D}}^{-\frac{1}{2}}X^{'}W_{l}) \end{aligned}$$

(2)

$$\begin{aligned}{} & {} H_{2} = G_{\theta }^{e}({\mathcal {G}}_{mask}^{2}) = \sigma ({\tilde{D}}^{-\frac{1}{2} } \tilde{A^{'}} {\tilde{D}}^{-\frac{1}{2}}XW_{l}) \end{aligned}$$

(3)

where ${\tilde{A}} =A+I$,${\tilde{D}}_{ii}= {\textstyle \sum _{j}^{}}{\tilde{A}}_{i,j}$, I is an unit matrix and $W_{l}$ is a learnable weight matrix. $\sigma $ represents an activation function. In this paper, PReLU [12] activation function is used.

After the model training is complete, input the original graph ${\mathcal {G}}$ to the encoder, as shown in Eq. 1. The resulting graph representation H is used for the graph node classification task. Then, H is further aggregated to a one-vector graph representation ${\mathcal {R}}$ by a readout function, as shown in the following equations.

$$\begin{aligned} {\mathcal {R}} = \sum _{i=1}^{|V|}H_{i} \end{aligned}$$

(4)

Here, V is the number of graph nodes, H is used for the node classification task, and ${\mathcal {R}}$ is used for the graph classification task.

3.3 Dual decoder for reconstruction of graph nodes and edges

Now the graph representation of the graph of masking nodes $H_{1}$ and the graph of masking edges $H_{2}$ have been obtained. Taking the graph representation as input to the two decoders yields ${\hat{H}}_{1}$ and ${\hat{H}}_{2}$. Here the decoder selects GCN as the decoding method. The formula is defined as follows.

$$\begin{aligned}{} & {} {\hat{H}}_{1} = \sigma ({\tilde{D}}^{-\frac{1}{2} } {\tilde{A}} {\tilde{D}}^{-\frac{1}{2}} H_{1} W_{l}) \end{aligned}$$

(5)

$$\begin{aligned}{} & {} {\hat{H}}_{2} = \sigma ({\tilde{D}}^{-\frac{1}{2} } \tilde{A^{'}} {\tilde{D}}^{-\frac{1}{2}} H_{2} W_{l}) \end{aligned}$$

(6)

The decoded graph representation $H_{1}$ and the original graph representation H calculate the cross loss. The cross loss is based on the scaled cosine error function [14] as defined in Eq. 6. Then use graph representation $H_{2}$ to reconstruct the adjacency matrix $A_{mask} = {\hat{H}}^{T}_{2}{\hat{H}}_{2}$. This method obtains the correlation matrix between nodes by directly calculating the similarity between each feature. Then calculate the similarity between the adjacency matrix $A_{mask}$ and the original graph’s adjacency matrix A. The mean squared error function is used here, and its formula is shown in Eq. 7.

$$\begin{aligned}{} & {} {\mathcal {L}}_{sce} = \frac{1}{sum([MASK])}\sum _{i\in [MASK]}(1-\frac{{\hat{H}}^{T}_{i} X_{i} }{||{\hat{H}}_{i}||\cdot ||X_{i}||} ) \end{aligned}$$

(7)

$$\begin{aligned}{} & {} {\mathcal {L}}_{mse} = \frac{1}{n^{2}} \sum _{i}^{n}\sum _{j}^{n}(A_{i,j}-A_{mask i,j})^{2} \end{aligned}$$

(8)

Here, [mask] is the index of the masker node, and n is the number of graph nodes. Finally, the final loss of the proposed method is a weighted summation of ${\mathcal {L}}_{sce}$ and ${\mathcal {L}}_{mse}$ as defined in Eq. 8.

$$\begin{aligned}{} & {} {\mathcal {L}} = \alpha \cdot {\mathcal {L}}_{sce} + (1-\alpha ) \cdot {\mathcal {L}}_{mse} \end{aligned}$$

(9)

where $0<\alpha <1$ is the weighted hyperparameter that balances the two losses. The whole process of the method is presented in Algorithm 1.

As shown in Algorithm 1, the input of the algorithm’s input is the original graph ${\mathcal {G}}$. The output of the algorithm is the aggregated representation H and ${\mathcal {R}}$. H is used for downstream node classification tasks. ${\mathcal {R}}$ is used for downstream graph classification tasks. Our method achieves state of the art in most downstream tasks. The experimental data will be described in detail in the experiments section.

3.4 Computational complexity analysis

Assume C, N, D, and M represent architecture-dependent constants, numbers of nodes, feature dimension of the node, and numbers of edges, respectively. Assume that the computational costs of reverse and forward propagation are equivalent. Our decoder complexity is twice that of GraphMAE because we use two decoders. The computational costs for the encoders and projection are $2C(N+M)$ and $4C(N+M)$, respectively. The cost of model training is $C_{\textrm{Ours}}(N^{2})$ due to the the loss function used to reconstruct the edge is ${L}_{mse}$. In summary, Table 1 provides the computational costs analysis of representative SSGRL methods.

Table 1 The computational costs analysis of representative SSGRL methods

Full size table

4 Experiments

In this section, we will verify the effectiveness of this method on the two downstream tasks of node classification and graph classification. And an ablation experiment is set up to verify the importance of reconstructing edges in self-supervised graph representation learning. The hyperparameter experiment discusses the influence of different occlusion rates of nodes and edges on the experimental results.

Table 2 The results of node classification task

Full size table

4.1 Node classification

Datasets: We select seven widely used graph datasets to verify the performance of our method in node classification. The seven datasets are Cora, CiteSeer, PubMed [42], Photo, Computers [30], DBLP [1], CS [30] and WikiCS. Cora, Citeseer, PubMed, and DBLP are citation networks, and the node features of these datasets are bag-of-words representations of documents. Photo and Computers are Amazon co-purchase graph. The nodes of this type of dataset represent items, and the edges represent the relationship of whether two items are frequently purchased together. CS is a co-author network whose nodes represent authors, keywords in the author’s paper represent node features, and edges represent whether two authors co-authored a paper. The WikiCS dataset is a web page related to computer science in Wikipedia. The nodes in this dataset represent articles related to computer science, and the edges indicate whether there are hyperlinks between these articles. Detailed statistics for these 7 datasets are provided in Table 3.

Table 3 The statistics of the node classification datasets

Full size table

Table 4 Graph classification results

Full size table

Settings for each experiment, we follow the linear evaluation scheme of DGI [35]. Each encoder uses one-layer GCN [17]. First, it is trained in a self-supervised way to obtain node representation H. Then, H is used to train and test a simple l2-regularized logistic regression classifier for evaluation. For model tuning, we tune the mask rate within 0.1 to 0.9, while the learning rate is within 1e−1 to 1e−5. We adopt the public splits for Cora, Citeseer, PubMed, WikiCS, and a 1:1:8 training/validation/testing splits for the other 4 datasets [47]. We implement the model with PyTorch. All experiments are conducted on an NVIDIA RTX3090 GPU with 24GB VRAM.

Competitors: Three categories of graph representation methods are compared. They are: (1) Generative Meth ods, GAE [18], GPT-GNN [15] and GATE [29]; (2) Contrastive Learning Methods, DGI [35], GMI [25], MVGRL [9], GRACE [50], CCA-SSG [47], BGRL [33], LaGraph [37] and AFGRL [20]; (3) Mask-based Method. GraphMAE [14]. There are also two supervised baselines for reference, GCN [17], and GAT [34].

The results of the experiment are shown in Table 2. In the table, bold indicates the highest record, and underline indicates the second-best record. The performance achieved by our method on the Cora dataset differs from state-of-the-art methods by 1.3% and by 0.2% on the CiteSeer dataset. We achieve the best results on four datasets and the second-best on one dataset. Specifically, compared with the state-of-the-art method LaGraph, it achieves 1.8% and 0.3% improvements on the Computers and CS datasets, respectively. Compared with the state-of-the-art method GraphMAE, it achieves a 0.4% improvement on the PubMed dataset. This shows that the model’s performance will be degraded on some specific datasets after the edge reconstruction is added, but it can still achieve better performance on most datasets.

4.2 Graph classification

Table 5 The statistics of the graph classification datasets

Full size table

Datasets: We select seven widely used graph datasets to verify the performance of our method in graph classification. The seven datasets are IMDB-BINARY [39], IMDB-MULTI, PROTEINS, COLLAB [39], MUTAG [19], REDDIT-BINARY and NCI1 [36]. Among them, IMDB-BINARY and IMDB-MULTI are movie collaboration dataset that consists of the ego-networks of 1,000 actors who played roles in movies in IMDB. In each graph, nodes represent actors, and edges represent whether these two actors have collaborated on a movie. These graphs are derived from the Action and Romance genres. PROTEINS is a dataset of proteins that are classified as enzymes or non-enzymes. Nodes represent the amino acids, and an edge connects two nodes if they are less than 6 Angstroms apart. COLLAB is a scientific collaboration dataset. A graph corresponds to a researcher’s ego network, i.e., the researcher and its collaborators are nodes and an edge indicates collaboration between two researchers. MUTAG is a collection of nitroaromatic compounds, where nodes in the dataset represent atoms and edges represent bonds between atoms. REDDIT-BINARY is a dataset of posts published by Reddit. Its nodes represent posts, and edges represent whether the same user has commented on both posts. The NCI1 dataset comes from the cheminformatics domain, where each input graph represents of a chemical compound: each vertex stands for an atom of the molecule, and edges between vertices represent bonds between atoms. Detailed statistics for these 7 datasets are provided in Table 5.

Table 6 Experimental results on graph node classification tasks without masking edges or masking nodes

Full size table

Table 7 Experimental results on graph classification tasks without masking edges or masking nodes

Full size table

Settings: We follow GraphCL’s [45] linear evaluation protocol. In particular, we tune the encoder layer within 1 to 5 in increments of 1. We first train the encoder in a self-supervised form. Then we freeze the parameters of the encoder to get the corresponding representation H. Specifically, we sum the readout functions R of each view as a new representation. Finally, train a linear classification model with the fixed representation R, and report the mean tenfold cross-validation accuracy with standard deviation after 5 runs.

Competitors: Three types methods are compared. They are: (1) Supervised Methods, GIN [38] and DiffPool [43]. (2) Graph Kernel, WL [36], DGK [39]. (3) Self-supervised Methods, graph2vec [22], Infograph [31], GraphCL [45], JOAO [44], GCC [27], MVGRL [9], AD-GCL [32], LaGraph [37] and GraphMAE [14].

Result: We compare the performance of state-of-the-art self-supervised models in Table 4. The highest record on each dataset has been highlighted in bold. The second place is shown as underlined. The results in Table 4 show that our method achieves state-of-the-art results on six datasets. It can be found in the observation table that the performance of generative methods outperforms contrastive learning methods based on negative samples or predictions. For the latest method LaGraph, our method outperforms it comprehensively with a performance gain of 0.2–4.9%. Specifically, our method has 4.88% improvement on COLLAB datasets. It shows the great promise of generative methods. In particular, our method leads 0.2–2.5% than GraphMAE of 2022 SOTA on six datasets. This illustrates the advantages of considering edges in graph structures. This also aligns with our initial prediction that GraphMAE can obtain better node representation features when only node reconstruction is performed. However, it is no longer advantageous for graph classification tasks that require overall graph information.

4.3 Ablation experiment

The following ablation experiments are set up to verify the method’s effectiveness. The base is a model that does not mask nodes and edges. The base(+n) model masks nodes but does not mask edges. The base(+e) is the opposite of base(+n), which masks edges but does not mask nodes. All is our final method. The experimental results are shown in Tables 6 and 7. Table 6 is the ablation experiment on the node classification task, and Table 7 is the ablation experiment on the graph classification task.

In the ablation experiment of the node classification task, it can be seen that the model that only reconstructs the edges is improved compared to the base model. This proves the effectiveness of reconstructing edges for model performance improvement. The model (all) that reconstructs edges and nodes can achieve better results than the model (base+e) that reconstructs only edges. This means that even if the edges are reconstructed, the impact of the model on the node classification task is not reduced and can lead to some improvement.

The experimental results of ablation on the graph classification task show that edge reconstruction can significantly improve the model performance. Moreover, the model reconstructing only edges on the REDDIT-B dataset achieves better results. However, the model with only reconstructed nodes on the COLLAB dataset achieves better results. Regardless, both refactoring node and refactoring edge models can achieve better results than cardinality. This is different from the results of the graph node classification task. This shows that both reconstructed nodes and reconstructed edges can achieve better graph representation.

It can be seen from the results of ablation experiments that a better graph representation can be obtained by reconstructing edges and nodes. However, there are sometimes degraded results for specific datasets. In general, combining the two is more applicable to a wide range of datasets. Our method effectively combines the different advantages of reconstructed nodes and reconstructed edges.

Table 8 Optimal parameters for each model

Full size table

4.4 Hyperparameter experiment

In this method, the mask rate of nodes and the mask rate of edges are two important parameters. In order to study the impact of different masking rates on model performance, we have done a lot of experiments to find the most suitable masking rate. We selected one dataset each from graph node classification task and graph classification task to conduct experiments. The results are shown in Figs. 3 and 4.

The datasets selected in the hyperparameter experiments are Amazon Computers and IMDB-MULTI. In Fig. 3, the abscissa is the 5 different masking values of the node masking rate from 0.1 to 0.9, and the ordinate is the accuracy rate of the dataset. As can be seen from the figure on the left, the Amazon Computers dataset has the highest accuracy when the node mask rate reaches 0.7, and the lowest accuracy when the node mask rate reaches a minimum value of 0.1 and a maximum value of 0.9. In the figure on the right, the IMDB-MULTI dataset is also the most accurate when the node mask rate is 0.7, and the accuracy drops significantly when the mask rate reaches 0.9. Through the analysis of the experimental results, the low mask rate makes the reconstruction task of the model too simple to allow the model to learn the deep information in the graph. The high mask rate causes too much loss of graph information, making it impossible to perform the set node reconstruction task. In Fig. 4, it can also be observed that the model performs best when the mask rate of the edge is 0.5$-$0.7. It is similar to the node mask rate, too low and too high mask rate will degrade the model’s performance.

The final selected hyperparameters for all experiments in this paper are listed in Table 8.

4.5 Discussion

In this method, the final graph representation dimension is also very important for downstream tasks. The lower feature dimension can greatly reduce the training cost of the model, but it also leads to the loss of more information. The upper limit of the machine memory also limits the maximum value of the feature dimension. Therefore, we conduct experiments on multiple feature dimensions to observe their impact on model performance.

We selected four datasets from the node classification task to observe the impact of the final graph representation feature dimension on downstream tasks. The four datasets are Citeseer, PubMed, CS, and Computers. Hyperparameter experiments were performed on these datasets in feature dimensions of 128, 256, 512, 1024 and 2048, respectively. The experimental results are shown in Fig. 5. It can be seen from the figure that the four datasets have achieved better experimental results in higher-dimensional feature representation. Among them, the accuracy of the CS dataset is around 93%, so the improvement observed in the figure is not obvious. However, compared with the feature dimension of 128, the feature dimension of 2048 still has an accuracy improvement of 0.4%. It is worth mentioning that the feature dimension of the Citseer dataset has the most obvious change. Compared with the feature dimension of 128, the accuracy of the feature dimension of 2048 is improved by 10.8%. It can be inferred from these experiments that our method achieves better downstream task performance with higher-dimensional node representations.

5 Conclusion

In this paper, we propose a node and edge dual-masked self-supervised graph representation method. For the first time, we propose to use two different decoders to reconstruct nodes and reconstruct edges. The method is validated on a total of 14 real-world datasets for graph node classification tasks and graph classification tasks. The experimental results demonstrate the effectiveness of the method. In the future, we plan to explore better ways to make edge reconstruction and node reconstruction better act on the model. And exploring generative models that are more suitable for self-supervised graph representation learning than existing methods.

References

Bojchevski A, Günnemann S (2017) Deep gaussian embedding of graphs: Unsupervised inductive learning via ranking. arXiv preprint arXiv:1707.03815
Chen L, Cui J, Tang X, Qian Y, Li Y, Zhang Y (2022) Rlpath: a knowledge graph link prediction method using reinforcement learning based attentive relation path searching and representation learning. Appl Intell 52(4):4715–4726
Article Google Scholar
Chen T, Kornblith S, Norouzi M, Hinton G (2020) A simple framework for contrastive learning of visual representations. In: ICML, PMLR, pp 1597–1607
Chen X, He K (2021) Exploring simple Siamese representation learning. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 15750–15758
Defferrard M, Bresson X, Vandergheynst P (2016) Convolutional neural networks on graphs with fast localized spectral filtering. In: Advances in neural information processing systems, vol29
Fan H, Zhong Y, Zeng G, Sun L (2021) Attributed network representation learning via improved graph attention with robust negative sampling. Appl Intell 51(1):416–426
Article Google Scholar
Fang Y, Zhang Q, Yang H, Zhuang X, Deng S, Zhang W, Qin M, Chen Z, Fan X, Chen H (2022) Molecular contrastive learning with chemical element knowledge graph. In: AAAI
Hamilton W, Ying Z, Leskovec J (2017) Inductive representation learning on large graphs. In: NeurIPS
Hassani K, Khasahmadi AH (2020) Contrastive multi-view representation learning on graphs. In: ICML, PMLR, pp 4116–4126
He K, Chen X, Xie S, Li Y, Dollár P, Girshick R (2022) Masked autoencoders are scalable vision learners. In: CVPR
He K, Fan H, Wu Y, Xie S, Girshick R (2020) Momentum contrast for unsupervised visual representation learning. In: CVPR, pp 9729–9738
He K, Zhang X, Ren S, Sun J (2015) Delving deep into rectifiers: surpassing human-level performance on imagenet classification. In: ICCV
Hjelm RD, Fedorov A, Lavoie-Marchildon S, Grewal K, Bachman P, Trischler A, Bengio Y (2018) Learning deep representations by mutual information estimation and maximization. In: ICLR
Hou Z, Liu X, Dong Y, Wang C, Tang J, et al (2022) Graphmae: Self-supervised masked graph autoencoders. In: KDD
Hu Z, Dong Y, Wang K, Chang KW, Sun Y (2020) Gpt-gnn: Generative pre-training of graph neural networks. In: SIGKDD, pp 1857–1867
Jin W, Derr T, Liu H, Wang Y, Wang S, Liu Z, Tang J (2020) Self-supervised learning on graphs: Deep insights and new direction. arXiv preprint arXiv:2006.10141
Kipf TN, Welling M (2016) Semi-supervised classification with graph convolutional networks. In: ICLR
Kipf TN, Welling M (2016) Variational graph auto-encoders. In: NeurIPS
Kriege N, Mutzel P (2012) Subgraph matching kernels for attributed graphs. arXiv preprint arXiv:1206.6483
Lee N, Lee J, Park C (2022) Augmentation-free self-supervised learning on graphs. In: Proceedings of the AAAI conference on artificial intelligence, vol 36, pp 7372–7380
Lu X, Wang L, Jiang Z, He S, Liu S (2022) Mmkrl: a robust embedding approach for multi-modal knowledge graph representation learning. Appl Intell 52(7):7480–7497
Article Google Scholar
Narayanan A, Chandramohan M, Venkatesan R, Chen L, Liu Y, Jaiswal S (2017) graph2vec: Learning distributed representations of graphs. arXiv preprint arXiv:1707.05005
Pan S, Hu R, Long G, Jiang J, Yao L, Zhang C (2018) Adversarially regularized graph autoencoder for graph embedding. arXiv preprint arXiv:1802.04407
Park J, Lee M, Chang HJ, Lee K, Choi JY (2019) Symmetric graph convolutional autoencoder for unsupervised graph representation learning. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 6519–6528
Peng Z, Huang W, Luo M, Zheng Q, Rong Y, Xu T, Huang J (2020) Graph representation learning via graphical mutual information maximization. In: WWW
Perozzi B, Al-Rfou R, Skiena S (2014) Deepwalk: online learning of social representations. In: KDD
Qiu J, Chen Q, Dong Y, Zhang J, Yang H, Ding M, Wang K, Tang J (2020) Gcc: Graph contrastive coding for graph neural network pre-training. In: SIGKDD
Salehi A, Davulcu H (2019) Graph attention auto-encoders. arXiv preprint arXiv:1905.10715
Salehi A, Davulcu H (2019) Graph attention auto-encoders. In: ICTAI
Shchur O, Mumme M, Bojchevski A, Günnemann S (2018) Pitfalls of graph neural network evaluation. arXiv preprint arXiv:1811.05868
Sun FY, Hoffmann J, Verma V, Tang J (2019) Infograph: unsupervised and semi-supervised graph-level representation learning via mutual information maximization. In: ICLR
Suresh S, Li P, Hao C, Neville J (2021) Adversarial graph augmentation to improve graph contrastive learning. Adv Neural Inf Process Syst 34:15920–15933
Google Scholar
Thakoor S, Tallec C, Azar MG, Azabou M, Dyer EL, Munos R, Veličković P, Valko M (2022) Large-scale representation learning on graphs via bootstrapping. In: ICLR
Veličković P, Cucurull G, Casanova A, Romero A, Lio P, Bengio Y (2017) Graph attention networks. In: ICLR
Veličković P, Fedus W, Hamilton WL, Liò P, Bengio Y, Hjelm RD (2018) Deep graph infomax. In: ICLR
Wale N, Watson IA, Karypis G (2008) Comparison of descriptor spaces for chemical compound retrieval and classification. Knowl Inf Syst
Xie Y, Xu Z, Ji S (2022) Self-supervised representation learning via latent graph prediction. In: ICML
Xu K, Hu W, Leskovec J, Jegelka S (2018) How powerful are graph neural networks? In: ICLR
Yanardag P, Vishwanathan S (2015) Deep graph kernels. In: SIGKDD, pp 1365–1374
Yang C, Liu J, Shi C (2021) Extract the knowledge of graph neural networks and go beyond it: an effective knowledge distillation framework. In: WWW
Yang F, Zhang H, Tao S, Hao S (2022) Graph representation learning via simple jumping knowledge networks. Appl Intell 1–19
Yang Z, Cohen W, Salakhudinov R (2016) Revisiting semi-supervised learning with graph embeddings. In: ICML
Ying Z, You J, Morris C, Ren X, Hamilton W, Leskovec J (2018) Hierarchical graph representation learning with differentiable pooling. In: NeurIPS
You Y, Chen T, Shen Y, Wang Z (2021) Graph contrastive learning automated. In: ICML, PMLR
You Y, Chen T, Sui Y, Chen T, Wang Z, Shen Y (2020) Graph contrastive learning with augmentations. In: NeurIPS, vol 33, pp 5812–5823
Zeng Z, Zhao W, Qian P, Zhou Y, Zhao Z, Chen C, Guan C (2022) Robust traffic prediction from spatial-temporal data based on conditional distribution learning. IEEE Trans Cybern 52(12):13458–13471. https://doi.org/10.1109/TCYB.2021.3131285
Article Google Scholar
Zhang H, Wu Q, Yan J, Wipf D, Yu PS (2021) From canonical correlation analysis to self-supervised graph neural networks. In: NeurIPS
Zhao Z, Qian P, Yang X, Zeng Z, Guan C, Tam WL, Li X (2023) Semignn-ppi: Self-ensembling multi-graph neural network for efficient and generalizable protein-protein interaction prediction. arXiv preprint arXiv:2305.08316
Zhu Y, Qian P, Zhao Z, Zeng Z (2022) Deep feature fusion via graph convolutional network for intracranial artery labeling. In: 2022 44th annual international conference of the IEEE engineering in medicine & biology society (EMBC). IEEE, pp 467–470
Zhu Y, Xu Y, Yu F, Liu Q, Wu S, Wang L (2020) Deep graph contrastive representation learning. In: ICML workshop on graph representation learning and beyond

Download references

Acknowledgements

The work is supported by the National Natural Science Foundation of China (Grant Nos. 62106216 and 62162064) and the Key Scientific and Technological Project of Yunnan Province (No. 202102AB080019-2 and 202002AB080001-5).

Author information

Authors and Affiliations

School of Software, Yunnan University, Kunming, 650504, China
Peng Tang, Cheng Xie & Haoran Duan

Authors

Peng Tang
View author publications
You can also search for this author in PubMed Google Scholar
Cheng Xie
View author publications
You can also search for this author in PubMed Google Scholar
Haoran Duan
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

In this parper, P.T. put forward the idea of the parper and completed the experimental design and the writing of the theme of the parper. C.X. corrected and guided the parper. Experiments were checked and improved by H.D.

Corresponding author

Correspondence to Cheng Xie.

Ethics declarations

Conflict of interest

We declare that we do not have any commercial or associative interest that represents a conflict of interest in connection with the work submitted.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Tang, P., Xie, C. & Duan, H. Node and edge dual-masked self-supervised graph representation. Knowl Inf Syst 66, 2307–2326 (2024). https://doi.org/10.1007/s10115-023-01950-2

Download citation

Received: 24 December 2022
Revised: 05 June 2023
Accepted: 25 July 2023
Published: 23 December 2023
Issue Date: April 2024
DOI: https://doi.org/10.1007/s10115-023-01950-2

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Node and edge dual-masked self-supervised graph representation

Abstract

Similar content being viewed by others

GDM: Dual Mixup for Graph Classification with Limited Supervision

Robust graph neural networks with Dirichlet regularization and residual connection

Iterative Encode-and-Decode Graph Neural Network

1 Introduction