Semi-supervised node classification is a crucial challenge in relational data mining and has attracted increasing interest in research on graph neural networks (GNNs). However, previous approaches merely utilize labeled nodes to supervise the overall optimization, but fail to sufficiently explore the information of their underlying label distribution. Even worse, they often overlook the robustness of models, which may cause instability of network outputs to random perturbations. To address the aforementioned shortcomings, we develop a novel framework termed Hybrid Curriculum Pseudo-Labeling (HCPL) for efficient semi-supervised node classification. Technically, HCPL iteratively annotates unlabeled nodes by training a GNN model on the labeled samples and any previously pseudo-labeled samples, and repeatedly conducts this process. To improve the model robustness, we introduce a hybrid pseudo-labeling strategy that incorporates both prediction confidence and uncertainty under random perturbations, therefore mitigating the influence of erroneous pseudo-labels. Finally, we leverage the idea of curriculum learning to start from annotating easy samples, and gradually explore hard samples as the iteration grows. Extensive experiments on a number of benchmarks demonstrate that our HCPL beats various state-of-the-art baselines in diverse settings.

1 Introduction

With the increasing popularity of cloud computing technologies and the Internet of Things, as well as the expansion of social media, structured data is rising at an unprecedented rate. A graph is an effective and powerful tool for representing a large number of relational data across various domains including biology, social networks, information science, and so on [25]. Thus, the exploration of graph-structured data is very critical and necessary. In recent years, graph neural networks (GNNs) have been proposed and shown incredible performance in studying graph-structured data. Typically, GNNs are capable of combining vertex attributes and graph topology information to learn vertex representations for a variety of downstream graph-based tasks, including node classification [29, 31, 54, 79], graph classification [27, 35], graph clustering [26, 70], link prediction [7, 73], and traffic forecasting [23, 37]. Here in this article, we study semi-supervised node classification, which aims to forecast the categories of unlabeled nodes using a limited number of labeled nodes.

Indeed, various GNN algorithms for semi-supervised node classification have been developed [21, 29, 52], the bulk of which rely on the construction of diverse neighborhood propagation strategies for learning effective node representations. The most prominent technique is graph convolutional networks (GCNs) [29], which aggregate node representations of their local neighbors iteratively. Following GCNs, a number of graph convolutions have been developed sequentially using a variety of message passing algorithms. For example, Graph Attention Network (GAT) [52] integrates the attention mechanism into the message passing process that enables feature information to be passed adaptively. By eliminating nonlinearities and compressing weight matrices between successive layers, Simple Graph Convolution (SGC) [60] reduces the computational cost of a GCN. Hamilton et al. [21] propose to sample node neighborhoods for higher efficiency. Graph Isomorphism Network (GIN) [66] further improves the expressive capability of GCNs and is capable of capturing different graph structural information. GNNs have also been applied to various applications such as multi-view learning [69] and recommendation [40, 67].

Despite the remarkable performance in semi-supervised node classification, existing approaches go through two critical constraints that may impair model performance. On the one hand, these methods typically leverage the unlabeled nodes when propagating their attributes during the message passing process using GNNs, while ignoring the information of underlying label distribution. This problem may lead to easy underfitting, particularly in the absence of adequate annotated labels, thus limiting the performance of the network [54]. On the other hand, they usually pay less attention to the robustness of the model. Existing GNNs often have predefined attributes and neighborhood propagation patterns, which leads to each node being extremely reliant on its initial features and neighbors. When networks are attacked by noise in real-world applications [75], they may output unstable predictions, leading to considerable performance deterioration.

Numerous semi-supervised learning algorithms have been comprehensively investigated in recent years for making full use of the unlabeled datum [9, 30, 41, 50, 51]. Pseudo-labeling [30] is one of the most classic methods in this field. It needs a predictor to iteratively output the categories of unlabeled samples and involve well-classified examples in the training dataset. On this basis, further works usually encourage the classifier to make predictions with a small entropy on unlabeled samples [20]. Pseudo-labeling techniques have a lot of downstream applications in various tasks. For example, FixMatch [49] combines pseudo-labeling and consistency regularization to address the problems of label scarcity for image classification. Cross Pseudo Supervision (CPS) [9] introduces pseudo-labeling techniques into semantic segmentation problems. However, pseudo-labeling techniques are predominantly studied in the visual domains but have not been well applied to effectively solving node classification problems on graphs yet. Therefore, it is promising to explore unlabeled nodes on the graph with semi-supervised techniques.

Toward this end, this article develops a simple but effective approach called Hybrid Curriculum Pseudo-Labeling (HCPL) for semi-supervised node classification. The core of our idea is to sufficiently explore the unlabeled data through a pseudo-labeling strategy. Specifically, we iteratively annotate unlabeled nodes by training a model on the labeled samples and any previously pseudo-labeled samples, and repeat the process in a self-training way. Moreover, we not only involve a perturbed GNN predictor with the adaptive decoupling of the representation transformation and neighborhood propagation, but also introduce a hybrid pseudo-labeling strategy to increase the robustness under noise attack. We take both prediction confidence and uncertainty into account while dealing with noise, alleviating the impact of potential erroneous pseudo-labels. Note that curriculum learning [4] is a training strategy that trains a learning model from easier data to harder data. Inspired by this, we utilize the idea of curriculum learning to annotate easy samples with high confidence and robustness and then annotate hard samples with less confidence and robustness. This strategy learns the model in a meaningful order and helps the model free from error accumulation. In this way, our HCPL can sufficiently explore the unlabeled data through a pseudo-labeling strategy. Experimental results on several popular benchmark datasets demonstrate our HCPL outperforms a wide range of state-of-the-art approaches. To sum up, the contributions of this work are as follows:

–

We develop a novel approach named HCPL for semi-supervised node classification, which leverages curriculum learning to produce confident pseudo-labels to make full use of the abundant unlabeled data while existing works usually do not explore semantic information in unlabeled data.

–

We study the model robustness and introduce a novel hybrid pseudo-labeling strategy that takes into consideration both prediction confidence and prediction uncertainty to produce accurate pseudo-labels.

–

Extensive experiments on six graph datasets show that HCPL achieves remarkable performance compared with a variety of state-of-the-arts in different settings.

2 Related Work

2.1 GNNs

Recent years have witnessed increasing attention in research to apply deep learning methods to graph-structured data. GNNshave come into the spotlight due to their superior capability for graph representation learning [32, 34, 47, 55, 64] with wide applications such as fake new detection [24, 68]. Pioneering efforts use spectral-based approaches for localized and effective graph convolution. GCN simplifies GNN into spatial-based models [29], which utilizes the adjacent matrix and increases the effectiveness of GNNs. These spatial-based approaches are typically based on a neighborhood aggregation mechanism that updates node representation via the aggregation of information from its neighbors. Following that, several GCN variants have been developed subsequently, including GAT [52], SGC [60], and GIN [66]. GAT incorporates the attention mechanism to assess the importance of different neighbors on the center node and uses the attention scores as feature aggregation weights. Inspired by the Weisfeiler-Lehman algorithm [46], GIN [66] improves the expressive capability of GCNs via capturing different graph structures. However, the majority of GNN-based methods train the model using the cross entropy on labeled nodes to optimize the model, which neglect the abundant unlabeled nodes. In contrast, our research builds on the strength of GNNs and focuses on enhancing semi-supervised node classification via curriculum learning in a pseudo-labeling way. GNNs can also be applied to various settings such as PU learning [61], open-world learning [62], unsupervised domain adaptation [63], and cross-modal retrieval [39]. For example, LSDAN [61] incorporates the attention mechanism with GNN to model node significance from both short and long terms. OpenWGL [62] utilizes a variation graph autoencoder to explore unseen nodes in the test set. GCLN [63] models both attraction and repulsion forces for consistency learning within a single graph and across graphs. DAGNN [39] adopts multi-hop GNN to investigate the relationship between labels for effective cross-modal retrieval.

2.2 Semi-supervised Learning

Semi-supervised learning (SSL) has recently drawn a lot of interest and achieved a lot of success in a variety of fields. SSL is capable of reducing the need for labeled data by using a vast volume of unlabeled data. Because unlabeled data can be quickly obtained with minimal human effort, the performance of models may be improved at a low cost using SSL. Two mainstream methodologies of SSL are self-training and consistency regularization. The pioneer semi-supervised learning works are based on self-training (so-called pseudo-labeling) [8, 30, 48, 51], which uses the class predictions of models as pseudo-labels to train unlabeled data in a supervised way. Specifically, unlabeled samples are iteratively added to the training data by annotating them with a weak model trained with labeled data. Another line of SSL is based on recent breakthroughs in consistency learning [5, 49], which encourages the network to make consistent predictions when it comes to noise perturbation on unlabeled samples. Semi-supervised learning techniques have been extensively utilized in a variety of fields such as computer vision and knowledge mining [1, 9, 13, 43, 49]. Inspired by recent advances in visual domains, our proposed HCPL effectively combines curriculum learning and semi-supervised learning, and develops a novel pseudo-labeling strategy for effective semi-supervised node classification on graphs. Our work is also related to teaching-to-learn and learning-to-teach (TLLT) [15], which also adopt an easy-to-hard curriculum strategy for graph-based semi-supervised learning. TLLT aims to perform the message propagation based on the teacher module and choose the following samples from the learner module in an alternative manner. ML-TLLT [18] introduces this framework to solve the problem of multi-labeled learning, which studies every possible label for the unlabeled samples along with the label dependencies. SMMCL [19] incorporates curriculum learning to learn the procedure of label propagation on graphs. MMCL [17] leverages curriculum learning to acquire the difficulty of accurately classifying each unlabeled sample for semi-supervised image classification. Gong et al. [16] adopt TLLT into the saliency detection, which successfully identifies the salient objects in images with high propagation quality.

2.3 Graph-based Semi-supervised Learning

Semi-supervised node classification is the most fundamental problem in graph data mining [14, 29, 33, 56, 59], which has various applications in social analysis [42], bioinformatics [65], image annotation [76], text generation [58], and noise cleaning [74]. For example, ASFS [72] explores the pairwise relationships in the latent and then utilizes the graph structure to guide semantics learning in a semi-supervised setting. This work utilizes a classic feature selection framework while ours adopts GNNs for semi-supervised node classification. For semi-supervised node classification, only a few annotation nodes are available in the graph to predict the labels of the remaining nodes. Traditional methods of solving this problem are typically based on graph Laplacian regularizations [3, 36, 77]. For example, Belkin et al. [3] propose to exploit the geometry of the marginal distribution and give a new form of regularization. Recently, GNNs have emerged as a powerful approach to learn from the graph. However, existing GNN methods usually focus on developing effective message passing patterns [29, 60], but neglect to sufficiently exploit the information of unlabeled nodes. Our article, by contrast, tackles this issue through effective pseudo-labeling for better semi-supervised node classification. We believe our method can be extended to tackle various graph-related semi-supervised tasks [12].

3 Method

To begin with, we introduce the problem definition and present our approach HCPL for semi-supervised node classification on graphs. Previous methods usually neglect the label information contained in unlabeled nodes as well as the robustness of the model. To tackle the issues, our approach is based on the exploration of unlabeled data via pseudo-labeling. Namely, we annotate unlabeled nodes by training a model on labeled samples and any previously pseudo-labeled data and then repeating this procedure in a self-training fashion. Specifically, we first propose a perturbed GNN predictor and present a hybrid pseudo-labeling technique, which takes both prediction confidence and uncertainty into account under noise attack. Finally, an optimization pipeline embraced with curriculum learning is used to dynamically and automatically select pseudo-labels. The overall framework can be illustrated in Figure 1.

Fig. 1.

3.1 Problem Formulation

A graph is represented in a form of \(\mathcal {G}=(\mathcal {V}, \mathcal {E})\), in which \(\mathcal {V}\) denotes a set of N nodes and \(\mathcal {E} \subseteq \mathcal {V} \times \mathcal {V}\) denotes a set of edges in the graph. \(\mathbf {x}_i \in \mathbb {R}^{F}\) denotes the attribute feature of node \(v_i\), where F is the dimension number of the node attributes. In addition, each node \(v_i\) is associated with a label vector, i.e., \(\mathbf {y}_i \in \lbrace 0, 1\rbrace ^{C}\) where C is the number of label categories. In our case, \(M (M \lt N)\) nodes in \(\mathcal {V}^L\) are annotated with their labels, while the labels of the other \(N - M\) nodes are unknown. Our aim is to predict the unobserved labels for unlabeled nodes in \(\mathcal {V}^U\) in a graph. Take the social network CORA as an example. Each node \(v_i\) corresponds to a research paper, and two research papers (i.e., \(v_i\) and \(v_j\)) are linked if the paper \(v_i\) is cited by the paper \(v_j\). There are seven classes of topics, i.e., reinforcement learning, neural networks, case-based research, genetic algorithms, probabilistic methods, rule learning, and theory. We need to classify all papers without labels into their associated classes.

3.2 Perturbed GNN Predictor

Here, we present a perturbed GNN as the backbone of our HCPL. GNNs have been frequently utilized to collect node attributive information as well as topological information on graphs using deep neural networks. We begin with the introduction of the popular message passing procedure. In formulation, the embedding of node \(v_i\in V\) at the layer k is represented as \(\mathbf {h}_i^{(k)}\). The message passing procedure in GNNs usually involves two steps: (i) Aggregation step, which collects the semantic information from the neighborhood of node \(v_i\) at the previous layer \(k-1\); and (ii) Combination step, which merges the node embedding of \(v_i\) at the previous layer with the obtained neighbor embedding at the current layer. To summarize,

\begin{equation} \begin{gathered}\mathbf {h}_{\mathcal {N}(v_i)}^{(k)}={\it AGG}^{(k)}_{\theta } \left(\left\lbrace \mathbf {h}_{j}^{(k-1)}\right\rbrace _{v_j \in \mathcal {N}(v_i)}\right), \\ \mathbf {h}_{i}^{(k)}= {\it COM}^{(k)}_{\theta }\left(\mathbf {h}_{i}^{(k-1)}, \mathbf {h}_{\mathcal {N}(v_i)}^{(k)} \right), \\ \end{gathered} \end{equation}

(1)

where \(\mathcal {N}(v_i)\) denotes the neighborhood of \(v_i\), and \({\it AGG}^{(k)}_{\theta }\) and \({\it COM}^{(k)}_{\theta }\) represent the aggregation and combination operators at the layer k, respectively. After performing neighborhood aggregation K times, the embedding vectors at all layers are condensed into a single embedding vector:

\begin{equation} \mathbf {h}_{i} = {\it SUM}_{\theta }\left(\left\lbrace \mathbf {h}_{i}^{k}\right\rbrace _{k=1}^K\right), \end{equation}

(2)

where \({\it SUM}_\theta\) represents the summarization operator. Widely used mean aggregator, LSTM aggregator, and pooling aggregator [21] can also be utilized to generate informative and structure-aware representations for various downstream tasks.

In our implementation, we begin with a Multi-Layer Perception (MLP) to process the initial attributes. Formally, we have

\begin{equation} \mathbf {z}_i = MLP (\mathbf {x}_i), \end{equation}

(3)

which will be concatenated into feature matrix \(\mathbf {Z} \in \mathbb {R}^{|V|\times d}\) and d is the embedding dimension. Then, given the adjacent matrix \(\mathbf {A} \in \mathbb {R}^{|V|\times |V|}\), we use the decoupled symmetrical normalization propagation to generate node embeddings at every layer.

\begin{equation} \mathbf {H}_{k}=\widehat{\mathbf {A}}^{k} \mathbf {Z}, k=1,2, \ldots , K, \end{equation}

(4)

where \(\widetilde{\mathbf {A}}=\widetilde{\mathbf {D}}^{-\frac{1}{2}} \widehat{\mathbf {A}} \widetilde{\mathbf {D}}^{-\frac{1}{2}}\) and \(\widehat{\mathbf {A}} = \mathbf {A}+ \mathbf {I}\). Finally, we use an attention mechanism [2, 33] to aggregate embeddings at all layers. Formally, the summarized node representation \(\mathbf {h}_i\) for node \(v_i\) is written as

\begin{equation} \begin{gathered}w_i^k = \sigma \left(\mathbf {h}^k_i \mathbf {W}\right), \\ \mathbf {h}_{i} = \frac{w_i^k \mathbf {h}_{i}^{k}}{\sum _{k^{\prime }=1}^Kw_i^{k^{\prime }}},\\ \end{gathered} \end{equation}

(5)

where \(\mathbf {h}_i^k\) is the representation of the i-th node at the layer k, \(\sigma (\cdot)\) denotes an activation function, and \(\mathbf {W}\) is a trainable matrix to acquire the weight.

Nonetheless, neighborhood propagation strategies are typically predefined in GNNs, leaving each node largely reliant on its attributes and neighbors. The neural network could be misled during graph convolution methods due to noise attacks on node properties and connection patterns. Here, we utilize two popular graph augmentation strategies [71] to make it easier to generate disturb-invariant node representations.

–

Attribute Masking: We choose several vertices and mask a portion of their attributes afterward. The prior behind this strategy is that masking part of vertices will not change the semantics of the node much. It serves as the dropout strategy in the deep neural network.

–

Edge Deletion: Certain edges are randomly dropped out of the graph based on an i.i.d uniform distribution. The strategy corresponds to the prior that the node semantics should be robust to the random attacks of edge connectivity patterns.

The augmented version of \(\mathcal {G}\) is denoted as \(\tilde{\mathcal {G}}\). After our GNN, we feed the representation \(\mathbf {h}_i\) of each node into a two-layer MLP classifier to produce the prediction vector \(\mathbf {p}_i \in \mathbb {R}^{C}\). Formally,

\begin{equation} \mathbf {p}_i = {\it MLP}_{\theta }(\mathbf {h}_i), \end{equation}

(6)

where \(\theta\) is the parameter of the GNN predictor.

3.3 Hybrid Pseudo-Label Selection

In our framework, we first optimize the model using labeled data and then employ the trained model to output the label distribution for unlabeled data. Furthermore, we seek to collect accurate pseudo-labels and add these unlabeled data into the training set. Previous methods [30] usually utilize the confidence scores for pseudo-labeling, which could generate biased and overconfident samples, and therefore result in error accumulation. Therefore, we propose a hybrid pseudo-label selection strategy based on both the confidence and uncertainty of the prediction, which will be elaborated as follows.

Confidence-based Selection. Intuitively, based on the label distribution, we seek to select hard samples with high confidence. Formally, let \(p_i^c\) denote the probability of the i-th example belonging to the class c, and the pseudo-label can be obtained as follows:

\begin{equation} \tilde{y}_i^c=\mathbf {1}\left[p_i^c \ge \gamma \right], \end{equation}

(7)

where \(\gamma\) is a fixed threshold. Note that the use of hard labels is associated with entropy minimization [20], where the model prediction is enforced to be of low entropy on unlabeled samples.

Uncertainty-based Selection. However, the accuracy of pseudo-labels is still far from satisfactory, since it does not take the robustness of the network into consideration. From a different perspective, the prediction is unreliable if the predicted distribution is unstable to random attack [43]. At this inspirit, we re-run the perturbed GNN predictor for W times where W is the running number of the predictor, and calculate the uncertainty of pseudo-labels using the formulation of standard deviation. Let \(\underline{p_i^c}\) denote the list of W predictions, and we have

\begin{equation} \tilde{y}_i^c=\mathbf {1}\left[sd\left(\underline{p_i^c}\right) \le \eta \right], \end{equation}

(8)

where \(\eta\) is another threshold and \({\it sd}(\cdot)\) denotes the standard error of the W predictions. In this way, we consider the robustness of our model by selecting pseudo-labels invariant to random attacks, and thus our model is more likely to produce correct pseudo-labels.

Finally, we combine the advantages of the two strategies by taking the intersection of selected pseudo-labels. Note that, since we forward the GNN predictor for W times, we use the mean of the prediction to replace the single output in Equation (7). In formulation,

\begin{equation} \tilde{y}_i^c=\mathbf {1}\left[sd\left(\underline{p_i^c}\right) \le \eta \right] \mathbf {1}\left[mean\left(\underline{p_i^c}\right) \ge \gamma \right], \end{equation}

(9)

where \(mean(\underline{p_i^c})\) denotes the mean of the W predictions. In this way, the pseudo-labels with both high confidence and robustness will be selected, which greatly improves the accuracy of pseudo-labels. In summary, the motivation of our strategy is to evaluate the difficulty of classifying each sample accurately, which can guide sequential curriculum learning. Here, our hybrid pseudo-label selection strategy is introduced based on both the confidence and uncertainty of the prediction. On the one hand, we select samples with high maximal probabilities, for which the model has high confidence about the prediction. On the other hand, we re-run the perturbed GNN predictor and evaluate the variance of the predictions, which can reflect the uncertainty of the prediction. These uncertainty scores can help us to evaluate the difficulty from different views. Finally, we take the intersection of results from our hybrid strategy to detect reliable samples under both rules. Further ablation studies also validate the effectiveness of our hybrid strategy and we believe our method can be utilized in more semi-supervised settings such as cross-modal retrieval.

3.4 Optimization Pipeline with Curriculum Learning

In our framework, we first train the GNN predictor using labeled data. Specifically, we employ the standard cross-entropy loss to train labeled nodes on the augmented graphs. Formally,

\begin{equation} \ell _{s} = -\frac{1}{|\mathcal {V}^{L}|} \sum _{x_i\in \mathcal {V}^{L}}\mathbf {y}_{i}^{T}\log \mathbf {p}_{i}. \end{equation}

(10)

Then, following the principle of self-training, we output the prediction for unlabeled data for W times, and then select reliable pseudo-labels based on Equation (9). These unlabeled nodes and their pseudo-labels are added to the labeled subset.

Note that a fixed threshold in selection is not optimal. For example, a large threshold \(\gamma\) may lead to too few pseudo-labels while a small value may bring in too many inaccurate pseudo-labels otherwise. As a result, we involve in a novel pipeline by adopting curriculum learning, resulting in dynamic thresholds in the selection for multiple iterations. To be specific, we gradually select more unlabeled samples from easy to difficult by increasing \(\eta\) and decreasing \(\gamma\). In our implementation, we use the percentile of scores to decide the thresholds. Assume the total number of iterations is T and \(argmax_{c \in C} \lbrace mean(\underline{p_i^c})\rbrace = c_i^{\prime }\). For the t-th iteration, the thresholds are adjusted with

\begin{equation} \begin{gathered}\gamma _t = Percentile\left(\left\lbrace mean\left(\underline{p_i^{c_i^{\prime }}}\right)\right\rbrace _{\mathbf {x}_i \in V^U}, 100-100/T*t\right), \\ \eta _t = Percentile\left(\left\lbrace sd\left(\underline{p_i^{c_i^{\prime }}}\right)\right\rbrace _{\mathbf {x}_i \in V^U}, 100/T*t\right),\\ \end{gathered} \end{equation}

(11)

where \(Percentile(S,m)\) denotes the values of the m-th percentile of set S. Then, we select pseudo-labels with dynamic thresholds as follows:

\begin{equation} \tilde{y}_i^c=\mathbf {1}\left[sd\left(\underline{p_i^c}\right) \le \eta _t \right] \mathbf {1}\left[mean\left(\underline{p_i^c}\right) \ge \gamma _t\right]. \end{equation}

(12)

Note that at T-th iteration, all the unlabeled nodes will be exhausted, i.e., annotated in self-training. An example is illustrated in Figure 2. Through curriculum learning [8], we start from easy samples with high confidence and low uncertainty, and gradually explore hard unlabeled samples. We are also involved in two strategies. First, after each iteration, we restore the labeled set and re-annotate every node in the unlabeled set. This enables pseudo-annotated nodes to enter or leave the updated set. Second, we train the GNN predictor from scratch, i.e., reinitialize the parameters in the GNN predictor after each iteration instead of popular fine-tuning. These two strategies can discourage concept drift or confirmation bias introduced at the early stage of self-training to be accumulated, improving the performance of our proposed HCPL. The whole pipeline of the optimization process is illustrated in Algorithm 1.

Fig. 2.

4 Experiments

In this part, by conducting extensive experiments on six real-world datasets to show the effectiveness of our HCPL, we highlight the following results:

–

HCPL significantly outperforms all competing baselines that are compared to all experimental settings.

–

Ablation studies demonstrate the efficiency of the different components of HCPL.

–

The performance of our methods is stable to main hyper-parameters in proper ranges.

–

Our HCPL is robust to random attack compared with baselines.

4.1 Experimental Setup

Datasets. Our HCPL is accessed on six widely used benchmark node classification datasets including three paper citation datasets [6, 44], i.e., Cora, CiteSeer, and PubMed, two purchasing graph datasets [45], i.e., Amazon Computers and Amazon Photo, and one co-author network dataset [45], i.e., Coauthor CS. In three paper citation datasets, nodes denote publication and edges denote citation links. The purpose is to classify these nodes into different areas. Both purchasing graph datasets are collected from Amazon, where nodes represent goods and edges are constructed when two goods are often bought at the same time. CoauthorCS is a co-author network dataset where nodes denote authors and edges indicate co-author relationships. The statistics of these datasets are summarized in Table 1.

Table 1.

Dataset	#Nodes	#Edges	#Features	#Classes	Edge density	Type
Cora	2,708	5,278	1,433	7	0.0004	Citation
CiteSeer	3,327	4,552	3,703	6	0.0004	Citation
PubMed	19,717	44,324	500	3	0.0001	Citation
Amazon Computers	13,752	245,861	767	10	0.0007	Co-purchase
Amazon Photo	7,650	119,081	745	8	0.0011	Co-purchase
Coauthor CS	18,333	81,894	6,805	15	0.0001	Coauthor

Table 1. Statistics of Six Datasets

We utilize the same splits in the previous work [54] to construct train/validation/test datasets for three citation datasets, while for the other three datasets, we randomly choose 30 nodes from each category as labeled training data, 30 nodes as validation data, and other nodes as the test data. For a fair comparison, we adopt the same dataset splits on all datasets for all baseline methods.

Compared Methods. To evaluate the effectiveness of our developed HCPL, we compare it with the following state-of-the-art baseline models for semi-supervised node classification as follows.

–

Chebyshev [10]: It is a formulation of CNNs that leverages the idea of spectral graph theory to devise fast localized convolutional filters suitable for graph data.

–

GCN [29]: It is a classic semi-supervised GNN model based on the spectral theory that generates node representations via aggregating information from neighbors.

–

GAT [52]: It is a GNN model that improves GCN by incorporating the attention mechanism to assign different weights to each neighboring node of a node.

–

SGC [60]: It is a fast algorithm that lowers the unnecessary computational cost of GCN via removing the nonlinearity between layers and compressing the weight matrix.

–

DGI [53]: It is an unsupervised approach for learning node representations, which focuses on the mutual information between node-level representations and their associated graph-level representations.

–

GMI [38]: It presents a new method for measuring the similarity degree between input graphs and hidden node embeddings, generalizing the concept of mutual information computations to the graph domain.

–

MVGRL [22]: It introduces a self-supervised approach, which learns node-level and graph-level representations by maximizing mutual information between representations encoded from different topological views of graphs.

–

GRACE [78]: It is a novel framework based on contrastive learning for unsupervised graph representation learning via a hybrid scheme for generating graph views on both topology and feature levels.

–

CG\(^3\) [54]: It is a novel GCN-based semi-supervised learning algorithm that enriches the label information via leveraging node similarities and structural knowledge from two different perspectives.

–

AM-GCN [57]: It fuses multi-view information from topological structures and features using the attention mechanism.

Parameter Settings. We implement all the compared methods using PyTorch 1.8.0 and Pytorch Geometric 1.7.2, which are capable of smoothly training GNNs for a range of applications connected to graph-structured data. Extensive experiments are performed on an NVIDIA GeForce GTX 1080 Ti. For simplicity, we adopt a two-layer GCN [29] as the GCN backbone as default and include a model variant HCPL-A that utilizes GAT [52] as the backbone. The dimension number of hidden embedding is set to 256 for all datasets and the number of iterations is set to 20. These two hyper-parameters will be discussed in Section 4.4. Adam [28] is employed during optimization due to its effectiveness. We set the learning rate to 0.01 and it decays with the rate 0.0005. For all experiments, we present the mean accuracy with standard deviations from five runs. The validation dataset is utilized to tune all hyper-parameters, and the test dataset can provide the final results. For the parameters in the adopted baselines, we refer to their original papers and utilize their tuning strategies for the best performance.

4.2 Experimental Results

Table 2 displays the compared results on six datasets. From the table, the following observations can be obtained:

Table 2.

Methods	Cora	CiteSeer	PubMed	Amazon Computers	Amazon Photo	Coauthor CS
Chebyshev [10]	80.7 \(\pm\) 0.2	70.2 \(\pm\) 0.6	77.4 \(\pm\) 0.1	72.5 \(\pm\) 0.0	88.4 \(\pm\) 0.1	90.4 \(\pm\) 0.2
GCN [29]	81.3 \(\pm\) 0.4	71.5 \(\pm\) 0.2	78.8 \(\pm\) 0.6	77.7 \(\pm\) 0.7	88.1 \(\pm\) 0.8	91.6 \(\pm\) 0.7
GAT [52]	82.7 \(\pm\) 0.1	70.7 \(\pm\) 0.4	78.5 \(\pm\) 0.2	79.5 \(\pm\) 0.2	88.0 \(\pm\) 0.6	91.2 \(\pm\) 0.5
SGC [60]	77.7 \(\pm\) 0.0	72.6 \(\pm\) 0.0	76.4 \(\pm\) 0.0	74.8 \(\pm\) 0.1	87.9 \(\pm\) 0.1	90.2 \(\pm\) 0.2
DGI [53]	80.9 \(\pm\) 0.3	71.4 \(\pm\) 0.2	76.3 \(\pm\) 1.1	77.7 \(\pm\) 0.8	85.3 \(\pm\) 0.9	90.6 \(\pm\) 0.5
GMI [38]	81.6 \(\pm\) 0.4	71.9 \(\pm\) 0.5	81.8 \(\pm\) 0.4	78.9 \(\pm\) 0.1	84.9 \(\pm\) 0.0	90.7 \(\pm\) 0.0
MVGRL [22]	81.3 \(\pm\) 0.4	71.9 \(\pm\) 0.1	79.3 \(\pm\) 0.1	79.5 \(\pm\) 0.8	88.1 \(\pm\) 0.2	91.7 \(\pm\) 0.1
AM-GCN [57]	81.0 \(\pm\) 0.3	72.8 \(\pm\) 0.4	OOM	80.9 \(\pm\) 0.7	91.3 \(\pm\) 0.2	OOM
GRACE [78]	82.8 \(\pm\) 0.3	71.3 \(\pm\) 0.7	79.0 \(\pm\) 0.2	75.1 \(\pm\) 0.1	83.2 \(\pm\) 0.1	91.2 \(\pm\) 0.2
CG\(^3\) [54]	83.5 \(\pm\) 0.3	73.7 \(\pm\) 0.2	79.2 \(\pm\) 0.6	80.5 \(\pm\) 0.1	90.0 \(\pm\) 0.2	92.4 \(\pm\) 0.1
HCPL (Ours)	84.2 \(\pm\) 0.6	74.4 \(\pm\) 0.7	82.4 \(\pm\) 0.7	82.2 \(\pm\) 0.8	92.3 \(\pm\) 0.5	93.2 \(\pm\) 0.4
HCPL-A (Ours)	84.5 \(\pm\) 0.4	73.6 \(\pm\) 0.5	81.6 \(\pm\) 0.4	83.4 \(\pm\) 1.2	92.6 \(\pm\) 0.7	92.5 \(\pm\) 0.3

Table 2. Results on Six Datasets in Terms of Accuracy (in \(\%\)) Over Five Runs

OOM means out-of-memory.

–

GCN-based algorithms (i.e., GCN, GAT, and SGC) overall perform better than the traditional method (i.e., Chebyshev), which shows that the superior representation-learning ability of GCN helps to enhance the performance for semi-supervised node classification.

–

The methods (i.e., DGI, GMI, MVGRL, GRACE, CG\(^3\), and HCPL) that explore the representations or label distribution of unlabeled data perform better than other methods, showing that utilizing additional unlabeled datum by unsupervised or semi-supervised learning is an important complement to supervised learning, enhancing model performance.

–

In all of the datasets, our approach produces the best results. In particular, on the large-scale datasets Amazon Computers and Amazon Photo, our HCPL outperforms the best baseline CG\(^3\) by 2.1% and 2.6%, respectively, validating the efficiency of our HCPL. We claim this improvement can be attributed to two reasons: (i) Our hybrid pseudo-labeling strategy incorporates both prediction confidence and uncertainty to generate accurate pseudo-labels for unlabeled nodes. (ii) Our curriculum learning pipeline gradually explores unlabeled samples to avoid overconfident annotations.

–

We have conducted one-sample paired t-tests to justify that the improvements with the best baseline are statistically significant with p-value < 0.05 on all the datasets. However, the variance of our HCPL is a little larger than that of baselines on several datasets. A potential reason is that in some cases, wrong pseudo-labels could make the performance a little unstable when studying unlabeled nodes. In practice, we suggest running multiple times and selecting the best model based on validation datasets. We calculate the classification accuracy on all test nodes.

–

The performance improvement compared with the best baseline (CG\(^3\)) is limited in Cora. The potential reason could be the high homophily ratio of Cora and thus low risk of biased pseudo-labeling, which makes curriculum learning less important. In practice, we can measure the risk of biased pseudo-labeling by active learning and then design the algorithm accordingly.

Moreover, we can validate that GAT can still benefit from our hybrid curriculum pseudo-labeling by comparing GAT and HCPL-A. Moreover, our HCPL-A can perform better than all the baselines in most cases, which validates our superiority again. Of note, our HCPL is trained in an iterative way. Hence, we access the accuracy of our HCPL at each iteration to observe whether the performance will improve as the number of iterations grows under curriculum learning. The results of the three datasets are shown in Figure 3. We can observe that the performance increases in most cases after each iteration, which validates that our proposed HCPL benefits from pseudo-labeling and curriculum learning.

Fig. 3.

Further, we experiment in the cases where the labeled samples are changed to access the performance of the HCPL in handling different supervision. We choose a proportion of labeled samples for model training in each run following [54]. We first choose the Cora dataset as an example where the label rates are varying in 0.5%, 1%, 2%, 3%, 5%, 10%, 20%, and 50%. The result is summarized in Table 3. Again, we can see that our HCPL consistently beats other baselines in different settings, demonstrating the superiority of our HCPL in tackling scarce supervision. We also conduct similar experiments on datasets CiteSeer and PubMed following the setting (i.e., label rate) in [54]. The result is shown in Tables 4 and 5 and similar results can be detected in two datasets.

Table 3.

Label Rate	0.5%	1%	2%	3%	5%	10%	20%	50%
Chebyshev	37.9	59.4	73.5	76.1	80.7	82.6	82.4	82.9
GCN	47.8	63.9	72.7	76.4	81.3	82.1	85.0	86.5
GAT	57.1	70.9	74.3	78.2	82.7	83.4	85.3	87.2
SGC	48.4	66.5	69.7	73.9	77.7	78.9	81.2	79.9
DGI	68.0	73.4	76.7	78.3	80.9	81.2	81.3	81.6
GMI	67.8	71.6	75.5	77.6	81.6	84.0	84.2	84.7
MVGRL	57.6	67.6	76.2	77.8	81.3	83.8	84.5	84.9
GRACE	63.8	73.5	75.2	76.2	82.8	83.6	84.4	85.9
CG\(^3\)	68.1	74.2	77.3	79.1	83.5	84.3	85.1	86.6
HCPL (Ours)	71.7	75.4	78.2	81.1	84.2	84.9	86.3	88.4

Table 3. Results on Cora Dataset with Different Label Rates in Terms of Classification Accuracies (in \(\%\))

Table 4.

Label Rate	0.5%	1%	2%	3%	5%	10%	20%	50%
Chebyshev	34.0	58.3	64.6	67.2	71.3	71.7	72.2	75.7
GCN	47.6	55.8	65.3	69.2	71.7	72.6	73.4	77.6
GAT	53.2	63.9	68.3	69.5	71.2	72.1	75.1	79.0
SGC	46.8	59.3	67.1	68.6	72.7	73.0	74.5	78.8
DGI	61.0	65.8	67.5	68.8	71.6	72.3	73.1	76.5
GMI	54.4	63.5	66.7	68.5	72.5	74.8	75.0	75.9
MVGRL	61.3	65.1	68.5	70.3	71.2	72.8	73.1	74.8
GRACE	61.8	62.5	70.7	71.4	71.9	73.0	74.2	76.6
CG\(^3\)	62.9	70.1	70.9	71.7	73.9	74.5	74.8	77.2
HCPL (Ours)	64.4	71.4	71.9	73.0	74.6	75.1	75.7	80.4

Table 4. Results on CiteSeer Dataset with Different Label Rates in Terms of Classification Accuracies (in \(\%\))

Table 5.

Label Rate	0.03%	0.05%	0.1%	0.3%	0.5%	3%	10%
Chebyshev	58.9	67.2	71.5	77.4	80.1	82.1	82.9
GCN	61.3	65.6	72.3	78.8	80.8	83.9	86.1
GAT	62.8	66.7	71.1	78.5	80.1	83.6	84.8
SGC	61.0	64.3	68.5	76.4	77.8	78.6	79.2
DGI	61.5	66.2	71.4	76.3	79.9	80.2	80.4
GMI	58.7	65.2	76.3	81.8	82.5	83.2	83.7
MVGRL	60.3	67.3	73.4	79.3	81.9	82.7	83.6
GRACE	64.9	68.6	73.6	79.0	80.4	81.4	82.5
CG\(^3\)	67.0	71.1	74.5	79.2	81.7	82.3	82.9
HCPL (Ours)	70.9	74.0	77.6	82.4	82.8	84.7	87.2

Table 5. Results on PubMed Dataset with Different Label Rates in Terms of Classification Accuracies (in \(\%\))

4.3 Ablation Study

In this part, we perform extensive experiments over core components of the proposed HCPL. In particular, five model variants are compared with the full model, which only remove one part of our framework with the other components kept:

–

HCPL w/o aug: We delete the perturbation over the input in the GNN-based predictor.

–

HCPL w/o cur: We remove the curriculum strategy and annotate all the unlabeled nodes for self-training.

–

HCPL w/o unc: We remove the uncertainty-based selection strategy and only use confidence to select pseudo-labels.

–

HCPL - inv cur: We annotate from the hard examples to easy examples.

–

HCPL - random: We annotate samples randomly during iterations.

The results are in Table 6. First, we can see a decline in the performance of HCPL w/o aug, demonstrating the necessity of augmentation strategies, which also improve the robustness of HCPL. Second, HCPL performs better than HCPL w/o cur and HCPL inv cur, validating that pseudo-labeling may introduce some biases to deteriorate the performance while our curriculum learning strategy is capable of releasing this issue. Moreover, we analyze the homophily ratio of each dataset, which denotes the fraction of edges that connect nodes from the same category in a graph. A lower homophily ratio will increase the challenges of semi-supervised learning under label scarcity, which makes curriculum learning more important. In particular, the homophily ratios for Cora and CiteSeer are 0.810, and 0.736, respectively. From ablation studies, it can be observed that when we remove curriculum learning, the performance will drop 1.18% and 1.61% for Cora and CiteSeer, respectively. This validates our analysis that the curriculum learning strategy is more suitable for challenging tasks. Third, removing the uncertainty-based selection strategy leads to a decline in performance, which shows it can produce more accurate pseudo-labels with the consideration of model robustness. Fourth, although these model variants can still perform well with the effectiveness of the remaining components, we can always observe a decline in these model variants compared with the full model, which validates the effectiveness of every component.

Table 6.

Methods	Cora	CiteSeer	PubMed	Amazon Computers	Amazon Photo	Coauthor CS
HCPL w/o aug	83.5 \(\pm\) 0.7	73.4 \(\pm\) 1.1	82.1 \(\pm\) 0.7	81.6 \(\pm\) 0.4	91.8 \(\pm\) 0.4	92.9 \(\pm\) 0.3
HCPL w/o cur	83.2 \(\pm\) 0.6	73.2 \(\pm\) 0.5	81.3 \(\pm\) 0.6	81.4 \(\pm\) 0.6	91.6 \(\pm\) 0.3	92.6 \(\pm\) 0.5
HCPL - inv cur	82.8 \(\pm\) 0.8	72.5 \(\pm\) 0.9	80.9 \(\pm\) 1.1	80.8 \(\pm\) 1.5	91.3 \(\pm\) 0.8	92.1 \(\pm\) 0.7
HCPL - random	83.5 \(\pm\) 0.7	73.4 \(\pm\) 0.7	81.5 \(\pm\) 0.9	81.7 \(\pm\) 1.0	91.6 \(\pm\) 0.9	92.7 \(\pm\) 0.5
HCPL w/o unc	83.9 \(\pm\) 0.3	73.3 \(\pm\) 0.8	81.5 \(\pm\) 0.8	81.8 \(\pm\) 0.7	91.9 \(\pm\) 0.6	92.8 \(\pm\) 0.7
HCPL (Ours)	84.2 \(\pm\) 0.6	74.4 \(\pm\) 0.7	82.4 \(\pm\) 0.7	82.2 \(\pm\) 0.8	92.3 \(\pm\) 0.5	93.2 \(\pm\) 0.4

Table 6. Comparison with Variants for Ablation Study (in \(\%\))

4.4 Sensitivity Analysis

In this part, we study the sensitivity of hyper-parameters in HCPL, i.e., embedding dimension in the hidden layer and the total number of iterations, respectively.

We first study the influence of different hidden embeddings by varying the dimension in \([32, 64,128,256,512,1024]\) with other settings fixed. We plot the result on all the datasets in Figure 4 and observe that the performance almost first increases and then stays stable as the embedding dimension grows. The potential reason is that a large hidden dimension would improve representation, but the model will tend to be saturated when the dimension is above a certain value.

Fig. 4.

Next, we study the effect of different numbers of iterations. Specifically, we fix all the other hyper-parameters and vary the iteration number in \(\lbrace 5,10,20,50\rbrace\). The results are plotted in Figure 5. We can observe that in most cases increasing the iteration number leads to a gain in performance. Perhaps it is because a larger iteration number brings in fewer pseudo-labels at each iteration, which is usually reliable for self-training. However, a too-large number of iterations accompanies a higher computational cost. As a result, we set the number of iterations to 20 as the default.

Fig. 5.

4.5 Robustness Analysis

In this part, we test the robustness of our HCPL by perturbing the graph, i.e., deleting edges or masking node attributes at random. Figure 6 illustrates the performance of three methods (GCN, GAT, and HCPL) when varying the perturbation rate from 10\(\%\) to 90\(\%\) on three datasets Cora, PubMed, and Amazon Photo, respectively. It can be shown that our HCPL obtains the best results under various random attack perturbation rates. Moreover, our HCPL decreases less as the perturbation rate grows, demonstrating the robustness of our HCPL.

Fig. 6.

4.6 Efficiency Analysis

In this part, we analyze the efficiency of competing methods by comparing their running time. The compared results on six datasets are collected in Table 7. From the results, we can find that our method has better efficiency compared with various recent works (i.e., MVGCL, CG\(^3\), and AMGCN). Actually, besides GNNs, these current works, i.e., MVGCL, CG\(^3\), and AMGCN utilize additional complex techniques, which could bring in huge computational cost. MVGCL introduces different data augmentation strategies to generate multiple topological views for mutual information maximization. CG\(^3\) needs to calculate data similarities and involves both local graph convolution and global hierarchical graph convolution. AMGCN extracts node embeddings from different views and then fuses them using the attention mechanism. These techniques bring more computational cost than our curriculum learning. Although some of the early methods have better efficiency, their performance is much worse than ours. Therefore, our HCPL exhibits competitive model scalability from the comparable running time.

Table 7.

Methods	Cora	CiteSeer	PubMed	Amazon Computers	Amazon Photo	Coauthor CS
Chebyshev	7.9	8.2	11.1	17.6	10.4	17.5
GCN	6.6	7.2	10.7	19.9	11.2	21.4
GAT	8.7	8.8	6.8	16.1	8.3	14.7
SGC	3.8	3.9	3.9	3.7	3.6	7.5
DGI	16.8	19.1	59.6	56.3	31.7	90.2
GMI	90.5	85.5	520.8	624.8	396.5	812.6
MVGRL	287.1	296.9	489.9	554.0	472.2	578.7
AM-GCN	18.5	7.9	OOM	1055.0	237.6	OOM
GRACE	145.7	69.6	209.8	297.8	215.6	590.6
CG\(^3\)	1156.0	1036.1	1326.8	1702.4	1563.7	3512.4
HCPL (Ours)	52.4	59.9	70.4	125.5	75.8	189.2

Table 7. The Compared Running Time Cost of the Compared Methods (Seconds)

OOM means out-of-memory.

4.7 Visualization

In this subsection, we demonstrate the t-SNE visualization [11] of the node embeddings generated by four methods on Cora, CiteSeer, and Amazon Photo. The compared results can be found in Figure 7. From the results, the embeddings generated by our HCPL are more discriminative on these three datasets since these embeddings belonging to different categories can be better separated. This finding can result from our hybrid pseudo-labeling strategy, which provides extra high-quality supervision for the neural network, therefore validating our superiority again.

Fig. 7.

5 Conclusion

In this research, we investigate the problem of semi-supervised node classification on graphs and propose a simple yet effective model HCPL. Note that pseudo-labeling techniques are predominantly studied in the visual domains but have not been well applied to effectively solving node classification problems on graphs yet. In this article, our HCPL annotates unlabeled samples by training a classification model on the labeled nodes as well as pseudo-labeled nodes, and repeats the procedure with self-training. We propose a hybrid pseudo-label selection strategy for reliable guidance. Moreover, the concept of curriculum learning is introduced to progressively learn from simple pseudo-labels to hard pseudo-labels in terms of confidence and uncertainty. In this way, our HCPL can sufficiently explore the unlabeled data through our pseudo-labeling strategy. Extensive experiments on six well-known benchmarks validate the effectiveness of the proposed HCPL. In future work, we will attempt to introduce our pseudo-labeling strategy into other graph-related tasks such as link prediction and graph classification.

Acknowledgments

The authors are grateful to the anonymous reviewers for critically reading the manuscript and for giving important suggestions to improve their paper.

References

[1]

Mahmoud Assran, Mathilde Caron, Ishan Misra, Piotr Bojanowski, Armand Joulin, Nicolas Ballas, and Michael Rabbat. 2021. Semi-supervised learning of visual features by non-parametrically predicting view assignments with support samples. In CVPR.

Abstract

1 Introduction

2 Related Work

2.1 GNNs

2.2 Semi-supervised Learning

2.3 Graph-based Semi-supervised Learning

3 Method

3.1 Problem Formulation

3.2 Perturbed GNN Predictor

3.3 Hybrid Pseudo-Label Selection

3.4 Optimization Pipeline with Curriculum Learning

4 Experiments

4.1 Experimental Setup

4.2 Experimental Results

4.3 Ablation Study

4.4 Sensitivity Analysis

4.5 Robustness Analysis

4.6 Efficiency Analysis

4.7 Visualization

5 Conclusion

Acknowledgments

References

Cited By

Index Terms

Recommendations

Semantic guide for semi-supervised few-shot multi-label node classification

Unsupervised Selective Labeling for More Effective Semi-supervised Learning

Semi-supervised multi-label classification using incomplete label information

Comments

Information

Published In

Publisher

Publication History

Check for updates

Author Tags

Qualifiers

Funding Sources

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

Cited By

View options

PDF

eReader

Login options

Full Access

Share

Share this Publication link

Share on social media

Affiliations