Long-Tailed Graph Representation Learning via Dual Cost-Sensitive Graph Convolutional Network

Duan, Yijun; Liu, Xin; Jatowt, Adam; Yu, Hai-tao; Lynden, Steven; Kim, Kyoung-Sook; Matono, Akiyoshi

doi:10.3390/rs14143295

Open AccessArticle

Long-Tailed Graph Representation Learning via Dual Cost-Sensitive Graph Convolutional Network

by

Yijun Duan

^1,*

,

Xin Liu

¹

,

Adam Jatowt

²,

Hai-tao Yu

³

,

Steven Lynden

¹,

Kyoung-Sook Kim

¹ and

Akiyoshi Matono

¹

National Institute of Advanced Industrial Science and Technology Tokyo Waterfront, 2 Chome-3-26 Aomi, Koto City, Tokyo 135-0064, Japan

²

Department of Computer Science, University of Innsbruck, Innrain 52, 6020 Innsbruck, Austria

³

Faculty of Library, Information and Media Science, University of Tsukuba, 1 Chome-1-1 Tennodai, Tsukuba 305-8577, Japan

^*

Author to whom correspondence should be addressed.

Remote Sens. 2022, 14(14), 3295; https://doi.org/10.3390/rs14143295

Submission received: 1 June 2022 / Revised: 1 July 2022 / Accepted: 5 July 2022 / Published: 8 July 2022

(This article belongs to the Special Issue Theory and Application of Machine Learning in Remote Sensing)

Download

Browse Figures

Versions Notes

Abstract

:

Deep learning algorithms have seen a massive rise in popularity for remote sensing over the past few years. Recently, studies on applying deep learning techniques to graph data in remote sensing (e.g., public transport networks) have been conducted. In graph node classification tasks, traditional graph neural network (GNN) models assume that different types of misclassifications have an equal loss and thus seek to maximize the posterior probability of the sample nodes under labeled classes. The graph data used in realistic scenarios tend to follow unbalanced long-tailed class distributions, where a few majority classes contain most of the vertices and the minority classes contain only a small number of nodes, making it difficult for the GNN to accurately predict the minority class samples owing to the classification tendency of the majority classes. In this paper, we propose a dual cost-sensitive graph convolutional network (DCSGCN) model. The DCSGCN is a two-tower model containing two subnetworks that compute the posterior probability and the misclassification cost. The model uses the cost as ”complementary information” in a prediction to correct the posterior probability under the perspective of minimal risk. Furthermore, we propose a new method for computing the node cost labels based on topological graph information and the node class distribution. The results of extensive experiments demonstrate that DCSGCN outperformed other competitive baselines on different real-world imbalanced long-tailed graphs.

Keywords:

graph convolutional network; imbalanced data classification; cost-sensitive learning; semi-supervised learning

Graphical Abstract

1. Introduction

In classical learning problems, data are usually assumed to be independently and identically distributed without considering the relationships between them; however, when edges are used to explicitly express the relationships between data, and a graph representation of the data is constructed, the performance of the learning algorithm is often significantly improved [1]. In the past, graph-based learning methods were mainly devoted to studying how to propagate information from sample nodes over graphs using a linear approach, using typical algorithms, including spectral clustering [2], random walk [3], and label propagation [4]. In recent years, with the rapid development of deep learning, studies on applying deep learning techniques to graph data with a high structural specificity, e.g., studies on graph neural networks (GNNs), have been conducted. Such studies have been aimed at the development of more flexible and effective nonlinear information propagation algorithms for learning high-quality graph representations to better conduct various tasks on graphs, such as node classification, link prediction, and community detection. As a popular research direction in the field of machine learning and data mining, GNNs have shown excellent performance and a wide range of applications in many fields such as logic inference [5], knowledge graphs [6], recommendation systems [7], natural language processing [8], air quality monitoring [9], and remote sensing [10].

In a considerable number of real-world graph node classification tasks, the sample data follow a long-tailed distribution. That is, a few classes have rich samples, whereas most of the classes only have a handful of samples. For example, in the NCI chemical compound graph, only approximately 5% of the molecules were labeled as active in an anti-cancer bioassay test [11]. Another example is that for the Cora dataset [12], which is widely used in the graph representation learning (GRL) field, the category “Probabilistic Methods” with the largest samples is 4.54 times larger than the smallest category “Theory”. Furthermore, in the real world, it is sometimes extremely important to correctly classify samples of minority classes that do not occur frequently. For example, a rare positive result in a COVID-19 test should not be incorrectly classified as negative because otherwise, undetected patients may accelerate the spread of the virus. As another example, despite continuous surveillance conducted on a particular street, occasional criminal activity will likely still be detected by the surveillance system. However, most standard learning algorithms attempt to minimize the overall classification error during training and implicitly assign the same misclassification cost to all types of errors. As a result, the imbalanced class distribution of data forces the classifier to overfit the majority class. That is, the features of the minority class samples cannot be adequately learned [13].

In graph data, the problem of overfitting the classifier to the majority class is exacerbated by the presence of the topological interplay effect [11]. This is manifested by the widespread presence of different forms of topological connections (i.e., explicit representations of inter-sample relationships) between nodes in the graph data, making the judgment of node classes by the classifier not only dependent on the node features but also influenced by other nodes connected to the target node. Therefore, the learning of a specific class using a GNN is strongly influenced by the neighboring classes. Furthermore, through messages passing between nodes, feature propagation is dominated by majority classes, making it more difficult for the GNN to define the demarcation line between classes and thus learn the more discriminative features required for classification.

To overcome the above problem, in this paper, a new graph neural network model adapted to node classification on imbalanced graph datasets is proposed, i.e., the dual cost-sensitive graph convolutional network (DCSGCN). To the best of our knowledge, our study is among the first to be devoted to an imbalanced graph node classification task. Our idea is inspired by research in the field of cost-sensitive learning. As our starting point, different types of errors lead to different degrees of misclassification costs, and the learning algorithm needs to focus on samples that may lead to high misclassification costs, thereby decreasing the overall misclassification cost of the classifier. In the case of imbalanced data classification, the classifier tends to become overfit to the majority classes. Therefore, to accurately identify the minority class samples, minority classes should be considered more important during learning. That is the action in which the classifier misclassifies the minority class samples will result in a higher cost than misclassifying the majority class samples.

We conducted experiments on a large number of standard datasets within the graph representation learning domain. In addition to the classical metrics Micro-F1 [14] and Macro-F1 [14] used in classification scenarios, we also tested the standard metrics in the field of cost-sensitive learning, i.e., based on the cost [15] and high-cost errors [15]. To the best of our knowledge, this study is the first to report the performance of graph node classification models under cost and high-cost error metrics. Further, we report the performance variation of DCSGCN under various changes in the experimental settings (e.g., different training set sizes and degree of imbalance) and the possible causes of such changes in performance. We also conducted a detailed ablation study and parameter analysis of the DCSGCN to verify the role of its key components and parameters. Finally, we visualized the learned vector representation of the DCSGCN to illustrate its superior classification.

In summary, our contribution is two-fold:

Unlike the standard GNNs based on maximizing the posterior probability, we present the first cost-sensitive graph node classification model DCSGCN based on minimizing the classification risk. In addition, we are the first to introduce methods for computing the node cost label, which is based on graph topological information and a node class distribution. Our study is among the first devoted to the task of semi-supervised multi-class imbalanced long-tailed graph node classification.
In extensive experiments conducted on a wide range of standard datasets, for the first time, we report the performance of graph neural network models from a cost-sensitive learning perspective. The experimental results demonstrate the simplicity and effectiveness of the proposed approach. Compared with the current state-of-the-art model RECT [16], DCSGCN shows an improvement of 10.6% in terms of Micro-F1 and a reduction of 34.3% in the average cost.

The remainder of this paper is structured as follows. We survey the related work in Section 2. In Section 3, we outline our methods for learning the representations of long-tailed imbalanced graphs and then for generating cost labels based on label distribution and graph topology. Section 4 explains the experimental settings, while Section 5 describes the results of our experiments and answers the research questions of interest. Finally, we conclude the paper in Section 6.

2. Related Work

2.1. Cost-Sensitive Learning

Cost-sensitive learning [17] is an important research topic in the field of machine learning. There are different types of costs in this field, such as misclassification [15], data acquisition [18], computation [19], and human–computer interaction costs [20]. In this study, we focused on the misclassification cost. Traditional classification learning methods pursue the lowest classification error rate, assuming that different types of misclassifications have equal costs. However, in domains such as face recognition-based access control systems [21], the cost owing to different types of misclassification varies significantly (e.g., the cost of treating an intruder as a non-intruder is much greater than treating a non-intruder as an intruder). Therefore, cost-sensitive models seek to minimize the cost of misclassification.

Cost-sensitive learning has been shown to be an effective approach for alleviating the problem of imbalanced data applied to a classification [22]. The basic idea is to assign a larger penalty cost to misclassified minority class samples [17]. Existing cost-sensitive classification algorithms can be generally grouped into three categories [23]: (1) pre-processing the training data, (2) post-processing the output, and (3) applying direct cost-sensitive learning methods. Data pre-processing aims to make the classification results on the new training set equivalent to cost-sensitive classification decisions on the original training set, typically along the lines of sampling [17] and weighting [22]. Post-processing the output makes the classifier biased toward minority classes by adjusting the classifier decision threshold, as represented by MetaCost [24] and ETA [25]. Direct cost-sensitive learning methods embed the cost information into the objective function of the learning algorithm to obtain the minimal expected cost, such as cost-sensitive decision trees [26] and cost-sensitive SVM [27].

In recent years, an increasing number of studies have combined cost-sensitive learning with deep learning. For example, a CSDNN [28] provides deep neural network cost sensitivity through pre-training. A CoSen CNN [29] applies a proposed optimization framework to collaboratively learn the parameters of the neural network itself as well as the cost-related parameters. According to these reports, deep cost-sensitive models have shown a better performance than classical methods on standard test sets. Our proposed model belongs to the third category of the approaches mentioned above (i.e., direct cost-sensitive learning) and is a novel exploration of a cost-sensitive graph neural network architecture.

2.2. Graph Neural Network

A growing number of applications use non-Euclidean methods to generate data, which is then represented as graphs with complex relationships and inter-object dependencies. Existing machine learning algorithms face substantial difficulties in handling the complexity of graph data. Over the past decade, many studies on extending traditional machine learning approaches for graph data have emerged. Among them, graph representation learning (GRL) has evolved considerably, and graph neural networks can be broadly regarded as the third (and latest) generation of GRL after traditional graph embedding and modern graph embedding. A growing body of research has shown that GNNs are extremely effective for both traditional GRL tasks (e.g., recommender systems and social network analysis) and new research areas (e.g., healthcare, physics, and combinatorial optimization) [1].

A typical GNN consists of graph filters and/or graph pooling layers. The former takes the node features and graph structure as inputs and outputs new node features. The latter takes the graph as an input and outputs a coarsened graph with few nodes. GNNs can be broadly classified into two categories, i.e., spatial and spectral approaches, based on their graph filters. The former explicitly leverages the graph structure, for example, spatially close neighbors, whereas the latter analyzes the graph using a graph Fourier transform and an inverse graph Fourier transform [30].

Classical spatial-based GNNs include [31,32,33,34,35,36,37]. Ref. [31] is a very early GNN that uses the local transition function as a graph filter. A GraphSAGE filter [32] uses different aggregators (mean/LSTM/pooling) to aggregate information regarding the one-hop neighbors of the nodes. In addition, a GAT-filter [33] relies on a self-attention mechanism to distinguish the importance of neighboring nodes during the aggregation process. An ECC-filter [34] was also proposed to handle graphs with different types of edges. Similarly, a GGNN-filter [36] was designed for graphs with different types of directed edges. By contrast, a Mo-filter [35] is based on a Gaussian kernel. Finally, an MPNN [37] is a more general framework, with the GraphSAGE-filter and GAT-filter mentioned above being special cases. In general, spatial-based GNNs are more generalized and flexible.

Spectral-based graph filters use graph spectral theory in the design of the filtering operations within the spectral domain. Early studies [30] dealt with the Eigen decomposition of the Laplacian matrix and the matrix multiplication between dense matrices; thus, they are computationally expensive. To overcome this problem, a Poly-filter [38] based on a K-order truncated polynomial was proposed. To solve the problem of a Poly-filter in which the basis of the polynomial is not orthogonal, a Cheby-filter [38] based on the Chebyshev polynomial was introduced. A GCN-filter [39] is a simplified version of a Cheby-filter. The latter involves a K-hop neighborhood of a node during the filtering process, whereas in the former, K = 1. A GCN-filter can also be regarded as a spatial-based filter. Currently, GCNs are one of the most widely used types of GNN. In our model, we used a GCN as the key component.

Graph embedding and graph kernel techniques are strongly related to the study of GNNs. Compared to GNNs, the former only focus on representing network nodes as low-dimensional vector representations without targeting subsequent tasks such as graph node classification and link prediction. Many graph embedding techniques are linear and not in an end-to-end manner, such as random walk [3] and matrix factorization [40]. On the other hand, graph kernel techniques employ a kernel function to obtain the vector representations of graphs. They are also an important type of approach for solving the graph classification problem. However, compared to GNNs, they are not learnable and are far less efficient.

GNNs are designed for different graph-based tasks, such as node classification, link prediction, graph classification, and community detection. To use the information encoded in the edges in a more refined way, [41] presented an inductive–transductive learning scheme based on GNNs. To overcome the difficulty in taking into account information coming from peripheral nodes within graphs, GNNs that are cascaded to form layered architectures [42] were proposed. In particular, our task is semi-supervised, which means that we need to learn the representation of all nodes from a few labeled nodes and the remaining unlabeled nodes. A recent study on semi-supervised graph node classification can be found in [43,44]. GNNs are a rapidly growing field at the present time. For a more comprehensive and detailed introduction to this field, we refer the reader to [1,45,46].

2.3. Connections to Our Work: When Graph Neural Networks Meet Cost-Sensitive Learning

This study aims to develop a graph neural network-based solution for a special graph representation learning task: semi-supervised graph representation learning on imbalanced data (thus related to Section 2.2). Meanwhile, our solution is constructed from the perspective of cost-sensitive learning (thus related to Section 2.1). To the best of our knowledge, this is the first study demonstrating the effectiveness of applying cost-sensitive learning to imbalanced graph representation learning. In recent years, we have observed the emergence of studies on our research task [11,16,47]. Among them, the DR-GCN [11] model relies on a class-conditioned adversarial training process to facilitate the separation of labeled nodes and the identification of minority class nodes. GraphSMOTE [47] tries to transfer the classical SMOTE method [48], which deals with imbalanced data, to graph data. In addition, RECT [16] has reported the best performance on imbalanced graph node classification tasks, and its core idea is based on the design and optimization of a class-semantic-related objective function.

Our proposed method is quite simple yet outperforms the state-of-the-art (SOTA) approach in extensive experiments conducted on standard datasets. In addition, we report for the first time the performance of classical GNN models based on standard evaluation metrics in the field of cost-sensitive learning. We specify the details of the proposed model in the next section.

3. Methodology

3.1. Method Overview

Conventionally, for a sample

v \in c l a s s i

, the standard GNN attempts to minimize the cross-entropy loss for the predicted and labeled class of v; that is, maximizing the posterior probability that v belongs to class i:

m i n L_{C E} (G N N (v), i) \to m a x P (v | i)

(1)

Herein, we omit the regularization parameter term for simplicity. In our cost-sensitive setting, let the (i, j) entry in cost matrix C be the misclassification cost of predicting v in class i when the true class is j. Instead of maximizing the posterior probability that v belongs to class i, our proposed DCSGCN model seeks to minimize the risk (or misclassification cost) of placing v into class i (denoted as

R i s k (v | i)

, which is the summation of the product of the posterior probability

P (v | j)

and the misclassification cost

C (i, j)

over potential classes):

m i n L_{C E} (D C S G C N (v), i) \to m i n R i s k (v | i) = \sum_{j} P (v | j) C (i, j)

(2)

In the above equation, the lower the posterior probability that v belongs to a non-i class, and the lower the cost of classifying v into i when v actually belongs to a non-i class, the smaller the total risk

R i s k (v | i)

is. In the case of imbalanced data, assuming i is a minority class and j is a majority class, according to the above assumption, we should have

C (j, i) ≫ C (i, j)

. Therefore, for a minority class sample v, even if the classifier overfits the majority class features such that its learned posterior probability satisfies

P (v | i) < P (v | j)

, we can still have

R i s k (v | i) < R i s k (v | j)

, and thus make a correct prediction.

According to Equation (2), DCSGCN is naturally designed as a two-tower model (see Figure 1). The model contains two bilayer graph convolutional network (GCN) components:

C o n v N e t P

and

C o n v N e t C

. Among them,

C o n v N e t P

is responsible for learning the posterior probability

P (v | j)

, and

C o n v N e t C

learns the misclassification cost information

C (i, j)

. The integration of the two components is the risk of the category to which the node belongs.

As another major problem we face during training, the training set only has category labels and lacks cost labels; thus,

C o n v N e t C

cannot be trained. Therefore, in this study, we propose three heuristics for computing the cost labels. Our idea is quite simple: Suppose the numbers of samples of class i and class j in region S near node v are

N_{i}^{S}

and

N_{j}^{S}

, respectively; we then have

\frac{C (j, i)}{C (i, j)} \propto \sum_{S} w_{S} \frac{N_{j}^{S}}{N_{i}^{S}} .

(3)

According to the above equation, the cost of misclassifying class i (denoted as

C (j, i)

) is negatively related to the number of samples contained in that class (denoted as

N_{i}^{S}

). When i is a minority class in S and j is a majority class, then

N_{j}^{S} ≫ N_{i}^{S}

, and

C (j, i) ≫ C (i, j)

, which is consistent with the above cost requirement. To make the calculation more accurate, S is designed as multiple regions of different areas centered at v, and

w_{S}

is the weight of S based on the regional node features. Based on the cost label c obtained, the final loss function is as shown below, where

λ

is the weight parameter, and

L_{M S E}

is the mean squared error. Here,

L_{C E} (D C S G C N (v), i)

and

L_{M S E} (C o n v N e t C (v), c)

are the losses of the cost-sensitive prediction and the cost matrix learning, respectively. By minimizing

L

, DCSGCN can simultaneously learn effective cost labels and make correct classifications.

L = λ \cdot L_{C E} (D C S G C N (v), i) + (1 - λ) \cdot L_{M S E} (C o n v N e t C (v), c)

(4)

3.2. Notations and Definitions

Input. The graph

G = {V, A, X, Y_{L}, H}

. Here

V = {v_{1}, v_{2}, \dots, v_{i}}

denotes the set of graph nodes, and

A \in R^{n \times n}

denotes the adjacency matrix. In this case,

A_{i j} = 1

when there is an undirected edge between nodes

v_{i}

and

v_{j}

; otherwise,

A_{i j} = 0

. The self-loops in G have been removed, and thus,

A_{i i} = 0

,

i \in {1, 2, \dots, n}

. Moreover,

X \in R^{n \times k}

is the feature matrix, where

X [i, :] \in R^{1 \times k}

is the feature vector of node

v_{i}

and has dimension k. In addition,

Y_{L} = {y_{1}, y_{2}, \dots, y_{p}}

is the class information for the labeled node set

L = {v_{1}, v_{2}, \dots, v_{p}}

, whereas

H = {1, 2, \dots, m}

denotes the node class and

y_{i} \in H, i \in {1, 2, \dots, p}

, and

N_{i}

represents the number of samples belonging to the ith category in L. Given the imbalanced class distribution of the nodes, the number of samples contained in different classes may vary significantly. We define the sets of majority and minority classes as

H_{m a j}

and

H_{m i n}

, respectively. For

i \in H_{m a j}

and

j \in H_{m i n}

, we have

N_{i} ≫ N_{j}

. The commonly used notations in this paper are listed in Table 1.

Output. Our goal is to learn a graph neural network f that maps the input information G into a dense vector representation

Z \in R^{n \times d}

for all nodes with low dimensionality, where

Z [i, :] \in R^{1 \times d}

is the vector of node

v_{i}

with dimension d, and the category labels

Y_{U}^{^{'}} = {y_{p + 1}^{^{'}}, y_{p + 2}^{^{'}}, \dots, y_{n}^{^{'}}}

for the unlabeled node set

U = {v_{p + 1}, v_{p + 2}, \dots, v_{n}}

. Clearly, an effective f can correctly predict unlabeled samples from both the majority and minority classes.

Cost matrix. The cost matrix C is an

m \times m

real square matrix (in which m is the number of node classes), where the

(i, j)

entry in C is the misclassification cost of predicting class i when the true class is j, and

C (i, j) \in [0, + \infty)

. Naturally, because the prediction of the classifier is correct if

i = j

, then

C (i, j) = 0

. In the case of an imbalanced classification, the minority class samples are more likely to be misclassified as the majority class, and thus when

i \in H_{m a j}

and

j \in H_{m i n}

, we should have

C (i, j) > C (j, i)

. The element values of C can be set manually using domain knowledge or are learned from the training data. In this study, we designed three heuristic algorithms for automatically computing the cost label of the training set based on graph topological information, which will be presented later.

Cost-sensitive prediction. When the cost matrix C is known, we can make a cost-sensitive prediction consistent with a Bayes optimal decision, which computes the corresponding cost of discriminating node v into each class and selects the class y with the lowest classification cost (denoted as

L (v, y)

):

y = \underset{1 \leq y \leq m}{a r g m i n} L (v, y) = \underset{1 \leq y \leq m}{a r g m i n} \sum_{k = 1}^{m} P (k | v) C (y, k)

(5)

For any class y,

L (v, y)

measures the total cost of assigning node v to class y,

P (k | v)

represents the estimated likelihood of the classifier assigning v to class k, and

C (y, k)

denotes the cost of misclassifying v to class y. The total cost

L (v, y)

is small when v is less likely to belong to a non-y class, and the cost of misclassifying a non-y class sample into class y is small. As the essence of the cost, one class may be judged to be optimal when another class is more probable. For the non-cost-sensitive case, that is, when all misclassifications have the same cost, and the correct classification cost is zero, the above equation degenerates to the commonly used prediction

y = \underset{1 \leq y \leq m}{a r g m i n} \sum_{k \neq y} P (k | v) = \underset{1 \leq y \leq m}{a r g m a x} P (y | v)

; that is, the pursuit of the class with the largest posterior probability. It follows that minimizing the misclassification cost is a more general setting compared to maximizing the posterior probability.

3.3. Proposed Model: Dual Cost-Sensitive Graph Convolutional Network (DCSGCN)

3.3.1. Estimating Node Probability: $C o n v N e t P$

We use a classical two-layer GCN structure [39] to compute the posterior probability of the class to which a node belongs and denote it as

C o n v N e t P

. The first and second layers of

C o n v N e t P

are denoted as

L_{P}^{(1)}

and

L_{P}^{(2)}

, respectively, and their corresponding outputs {

O_{P}^{(1)}

,

O_{P}^{(2)}

} are as follows:

O_{P}^{(1)} = R e L U ({\tilde{D}}^{- \frac{1}{2}} \tilde{A} {\tilde{D}}^{- \frac{1}{2}} X W_{P}^{(1)})

(6)

O_{P}^{(2)} = s o f t m a x ({\tilde{D}}^{- \frac{1}{2}} \tilde{A} {\tilde{D}}^{- \frac{1}{2}} O_{P}^{(1)} W_{P}^{(2)})

(7)

where

\tilde{A} = A + I

and

I \in R^{n \times n}

is an identity matrix of size n. In addition,

\tilde{D} \in R^{n \times n}

is a diagonal matrix, and

{\tilde{D}}_{i i} = \sum_{j} {\tilde{A}}_{i j}

. Here,

{\tilde{D}}^{- \frac{1}{2}} \tilde{A} {\tilde{D}}^{- \frac{1}{2}}

is the normalized adjacency matrix, and A and X are the adjacency and feature matrices introduced in Section 3.2, respectively.

Furthermore,

W_{P}^{(1)} \in R^{k \times r}

and

W_{P}^{(2)} \in R^{r \times m}

are the learnable parameters in the first and second layers of

C o n v N e t P

, where r and m are the dimensions of the output vectors of

L_{P}^{(1)}

and

L_{P}^{(2)}

, respectively. Here, m is the same as the number of node classes. In addition,

R e L U

and

s o f t m a x

are the respective activation functions of the first and second layers, where

R e L U {(Z)}_{i} = m a x (0, Z_{i})

and

s o f t m a x {(Z)}_{i} = \frac{e x p (Z_{i})}{\sum_{i} e x p (Z_{i})}

. Moreover,

O_{P}^{(1)} \in R^{n \times r}

,

O_{P}^{(2)} \in R^{n \times m}

,

O_{P}^{(2)}

represents the posterior probability of the class to which the node belongs. For {

L_{P}^{(1)}

,

L_{P}^{(2)}

,

O_{P}^{(1)}

,

O_{P}^{(2)}

,

W_{P}^{(1)}

,

W_{P}^{(2)}

}, the superscript indicates the number of layers, and the subscript P indicates that the parameters belong to

C o n v N e t P

, which are distinguished from the following

C o n v N e t C

.

In (6) and (7), the role of

{\tilde{D}}^{- \frac{1}{2}} \tilde{A} {\tilde{D}}^{- \frac{1}{2}}

, or the normalized adjacency matrix, is to enrich the feature vector of a node by linearly adding all feature vectors of its neighbors. This is because the basic assumption of a GCN is that neighboring nodes (and thus those having similar neighbors) are more likely to belong to the same class. The role of

W_{P}^{(1)}

,

W_{P}^{(2)}

is to transform the feature dimension of the nodes, making sparse high-dimensional node features dense at low dimensions. In addition, (6) and (7) can also be equivalently described as the process by which the input signal (i.e., node feature X) is filtered through a graph Fourier transform in the graph spectral domain [44]; however, in this paper, we consider the spatial domain.

Eventually, given the one-hot training labels

Y_{L}^{o h} = {y_{1}^{o h}, \dots, y_{p}^{o h}}

, we compute the posterior probability

O_{P}^{(2)}

and the cross-entropy error

Y_{L}

as follows:

L_{C E} (C o n v N e t P) = - \sum_{i = 1}^{p} \sum_{j = 1}^{m} Y_{L}^{o h} [i] [j] l n O_{P}^{(2)} [i] [j]

(8)

By minimizing

L_{C E} (C o n v N e t P)

, we can learn the parameters of

C o n v N e t P

such that it predicts the posterior probability of the class to which the unlabeled node belongs.

3.3.2. Estimating Node Cost: $C o n v N e t C$

Instead of

C o n v N e t P

, we use

C o n v N e t C

, another neural network with a two-layer GCN structure, to predict the node misclassification cost. For the sample

(x, y)

in the training set, where x and y denote the feature vector and label of node v, respectively,

C o n v N e t C

predicts the v-dependent cost matrix

c \in R^{m \times m}

, whose element

c_{i j}

denotes the cost of predicting v in class i when the true class is j. Here, we use a GCN instead of a feedforward neural network to predict the cost because we want to exploit the topological information of the graph.

The setting we adopt here is an example-dependent cost, which means that the cost matrix may be different for different samples (even if they belong to the same class), as compared to a uniform cost matrix used for each category. This is because different nodes have different location information in the network. For example, consider two nodes v and u, which belong to the same global minority class. In this case, v belongs to the local majority class (i.e., most of the neighbors of v belong to the same global minority class as v), whereas u belongs to the minority class even locally. Because

C o n v N e t P

tends to assign similar classes to neighboring nodes, the probability of correctly predicting v as a global minority class will be much higher than that of correctly predicting u. Therefore, to prevent

C o n v N e t P

from making incorrect predictions, the cost of

C o n v N e t C

generated to incorrectly predict u as a global majority class should be much larger than the cost of misclassification for v. From this perspective, the example-dependent cost information computed by

C o n v N e t C

can be interpreted as a “correction” to the posterior probabilities generated by

C o n v N e t P

. Clearly, the class-dependent cost matrix is a special case of a sample-dependent setting.

Similar to (6) and (7), the output of the first layer

L_{C}^{(1)}

and the second layer

L_{C}^{(2)}

in

C o n v N e t C

{

O_{C}^{(1)}

,

O_{C}^{(2)}

} is as follows:

O_{C}^{(1)} = R e L U ({\tilde{D}}^{- \frac{1}{2}} \tilde{A} {\tilde{D}}^{- \frac{1}{2}} X W_{C}^{(1)})

(9)

O_{C}^{(2)} = R e L U ({\tilde{D}}^{- \frac{1}{2}} \tilde{A} {\tilde{D}}^{- \frac{1}{2}} O_{C}^{(1)} W_{C}^{(2)})

(10)

In the above equations,

{\tilde{D}, \tilde{A}, X}

is the same as that in

C o n v N e t P

. In addition,

W_{C}^{(1)} \in R^{k \times r}

and

W_{C}^{(2)} \in R^{r \times m^{2}}

are the learned parameters in the first and second layers of

C o n v N e t C

. Moreover, r and

m^{2}

are the respective dimensions of the output vectors of

L_{C}^{(1)}

and

L_{C}^{(2)}

, m is the number of node categories, and

O_{C}^{(1)} \in R^{n \times r}

,

O_{C}^{(2)} \in R^{n \times m^{2}}

. Note that the second layer also uses ReLU as the activation function because the cost matrix generated by the second layer is not the posterior probability. Finally, the mean squared error (MSE) is used as the loss function of

C o n v N e t C

for training its parameters:

L_{M S E} (C o n v N e t C) = \frac{1}{p} \sum_{i = 1}^{p} {∥O_{C}^{(2)} [i] - C_{L} [i]∥}_{2}^{2}

(11)

where

C_{L} = {c_{1}, c_{2}, \dots, c_{p}}

is the cost matrix label of the training samples

{v_{1}, v_{2}, \dots, v_{p}}

. In addition,

{∥\cdot∥}_{2}

denotes the 2-norm of the matrix. We elaborate on three heuristics for computing

C_{L}

based on the graph topological information of the graphs in Section 3.4.

3.3.3. Ensemble of Node Probability and Cost

For a node sample

v_{i}

,

C o n v N e t P

and

C o n v N e t C

compute the posterior probability

O_{P}^{(2)} [i]

and cost matrix

O_{C}^{(2)} [i]

, respectively. Based on

O_{P}^{(2)} [i]

and

O_{C}^{(2)} [i]

, we can calculate the misclassification cost

L (v_{i}, y)

required to predict each class based on (5) and select the class with the minimal expected cost as follows:

y_{i}^{^{'}} = \underset{1 \leq y \leq m}{a r g m i n} L (v_{i}, y) = \underset{1 \leq y \leq m}{a r g m i n} \sum_{k = 1}^{m} O_{P}^{(2)} [i] [k] O_{C}^{(2)} [i] [y] [k]

(12)

where

O_{P}^{(2)} [i] [k]

and

O_{C}^{(2)} [i] [y] [k]

are the probabilities that

v_{i}

will be predicted as the kth class by

C o n v N e t P

and the cost of misclassifying

v_{i}

as belonging to the kth class in the yth class when calculated using

C o n v N e t C

, respectively. Equation (12) is a cost-sensitive prediction consistent with the Bayes optimal decision.

Minimizing

L (v_{i}, y)

is equivalent to maximizing

- L (v_{i}, y)

. Using

s o f t m a x

to normalize

- L (v_{i}, y)

, we obtain the following cross-entropy error

L_{C E} (E N S)

. In addition,

O_{E N S}

and

L_{C E} (E N S)

represent the predicted probabilities and losses obtained by ensembling

C o n v N e t P

and

C o n v N e t C

, respectively.

O_{E N S} [y] = \frac{e x p (- L (v_{i}, y))}{\sum_{y} e x p (- L (v_{i}, y))}

(13)

L_{C E} (E N S) = - \sum_{i = 1}^{p} \sum_{j = 1}^{m} Y_{L}^{o h} [i] [j] l n O_{E N S} [i] [j]

(14)

Algorithm 1 shows the “two-step” training process of DCSGCN. First, we pre-train

C o n v N e t P

by minimizing

L_{C E} (C o n v N e t P)

to compute the posterior probabilities of the classes to which the training samples belong. After

L_{C E} (C o n v N e t P)

converges, we train the parameters of

C o n v N e t C

and fine-tune

C o n v N e t P

. In the second step, the loss function

L

is weighted by

L_{M S E} (C o n v N e t C)

and

L_{C E} (E N S)

:

L = λ \cdot L_{C E} (E N S) + (1 - λ) \cdot L_{M S E} (C o n v N e t C)

(15)

Algorithm 1 Dual Cost-Sensitive Graph Convolutional Network

Inputs: Graph data:

G = {V, A, X, Y_{L}}

Outputs: Network parameters: {

W_{P}^{(1)}

,

W_{P}^{(2)}

,

W_{C}^{(1)}

,

W_{C}^{(2)}

}; Node embeddings:

O_{P}^{(2)}

1: Initialize {

W_{P}^{(1)}

,

W_{P}^{(2)}

,

W_{C}^{(1)}

,

W_{C}^{(2)}

}, training epoch

τ_{1}

and

τ_{2}

, and weight

λ

2: for

t \in [0, τ_{1}]

do

3:

O_{P}^{(2)} \leftarrow C o n v N e t P (A, X)

▹ Equation (7)

4:

l o s s_{P} \leftarrow L_{C E} (O_{P}^{(2)}, Y_{L})

▹ Equation (8)

5: Updating {

W_{P}^{(1)}

,

W_{P}^{(2)}

} using

\frac{\partial l o s s_{P}}{\partial W_{P}^{(1)}}

,

\frac{\partial l o s s_{P}}{\partial W_{P}^{(2)}}

6: end for

7: Compute cost label

C_{L}

▹ Equation (16) or Equation (17) or Equation (19)

8: for

t \in [0, τ_{2}]

do

9:

O_{P}^{(2)} \leftarrow C o n v N e t P (A, X)

▹ Equation (7)

10:

O_{C}^{(2)} \leftarrow C o n v N e t C (A, X)

▹ Equation (10)

11:

l o s s_{C} \leftarrow L_{M S E} (O_{C}^{(2)}, C_{L})

▹ Equation (11)

12:

l o s s_{E N S} \leftarrow L_{C E} (O_{P}^{(2)}, O_{C}^{(2)}, Y_{L})

▹ Equation (14)

13:

l o s s \leftarrow λ \cdot l o s s_{E N S} + (1 - λ) \cdot l o s s_{C}

▹ Equation (15)

14: Updating {

W_{P}^{(1)}

,

W_{P}^{(2)}

,

W_{C}^{(1)}

,

W_{C}^{(2)}

} using

\frac{\partial l o s s}{\partial W_{P}^{(1)}}

,

\frac{\partial l o s s}{\partial W_{P}^{(2)}}

,

\frac{\partial l o s s}{\partial W_{C}^{(1)}}

,

\frac{\partial l o s s}{\partial W_{C}^{(2)}}

15: end for

By minimizing

L

, DCSGCN can simultaneously learn the cost matrix of the samples and conduct a cost-aware node classification. Here,

λ \in [0, 1]

is used to adjust the weights of the two subnetworks. We pre-train

C o n v N e t P

because we want it to compute similar posterior probabilities of the nodes regardless of

C o n v N e t C

, whereas

C o n v N e t C

only provides a “correction” of the posterior probabilities. If pre-training is omitted and only

C o n v N e t P

and

C o n v N e t C

are jointly trained, the output of

C o n v N e t P

may contain too much cost information, thus deviating from the meaning of the posterior probabilities.

Another key point regarding DCSGCN is whether the parameters of the first layer in

C o n v N e t P

and

C o n v N e t C

should be shared (i.e.,

W_{P}^{(1)} = W_{C}^{(1)}

). As an argument in favor of sharing, if we want

C o n v N e t C

to provide “complementary” information about the equilibrium of the sample class distribution, the potential premise is that we need

C o n v N e t C

to make the same guess about the sample class as

C o n v N e t P

. For nodes

v_{i}

, the first layer of

C o n v N e t P

and

C o n v N e t C

learns the representations

O_{P}^{(1)} [i]

and

O_{C}^{(1)} [i]

of

v_{i}

, respectively. If there is a significant difference between them, the predictions of

v_{i}

by

C o n v N e t C

and

C o n v N e t P

are likely to be different. Therefore, forcing the parameter sharing can be more effective in ensuring the consistency of the two subnetworks in terms of node class prediction. However, we occasionally need

C o n v N e t P

and

C o n v N e t C

to be inconsistent in terms of label prediction. For example, samples belonging to the global majority class and local minority class are more likely to be misclassified as a local majority class. Therefore, we want DCSGCN to generate the posterior probability of the global majority class and the cost complement of the global minority class. Not requiring

C o n v N e t P

and

C o n v N e t C

to share the parameters of the first layer makes it more flexible in utilizing the topological information of the nodes.

In Section 5.6, we verify whether pre-training

C o n v N e t P

and sharing the first layer parameters of

C o n v N e t P

and

C o n v N e t C

help improve the classification performance. The results of the validation show that pre-training has a positive effect, whereas sharing the parameters slightly degrades the model performance. Finally, Figure 1 illustrates the structure of our designed network.

3.4. Methods for Computing Cost Labels

In this section, we present three heuristics for computing the cost matrix of the training samples, which are based on the class distribution of the training set and the topological information of the input graph.

3.4.1. Global Cost

Our first approach assigns the same cost matrix

C_{G L O}

to all samples in the training set without considering the position information in the graph. In Section 3.2, we defined

N_{i}

as the number of samples belonging to the ith class in the labeled nodes set L,

i \in {1, 2, \dots, m}

, and

H_{m a j}

and

H_{m i n}

as the set of global majority classes and global minority classes, respectively. Then, for each sample node, the misclassification cost of predicting it as i when its true class is j is

C_{G L O} (i, j) = \{\begin{matrix} {(\frac{N_{i} + k}{N_{j} + k})}^{α}, i \neq j \\ 0, i = j, \end{matrix}

(16)

From (16), the misclassification cost is zero when the classifier correctly classifies the samples, and the cost label when the samples are misclassified is related to the number of samples in the predicted class i and the true class j. Supposing {

i \in H_{m a j}

,

j \in H_{m i n}

}, because

N_{i} ≫ N_{j}

,

C_{G L O} (i, j)

is large; when {

i \in H_{m a j}

,

j \in H_{m a j}

}, or {

i \in H_{m i n}

,

j \in H_{m i n}

},

C_{G L O} (i, j)

is approximately equal to 1 because

N_{i} \approx N_{j}

. Finally, when {

i \in H_{m i n}

,

j \in H_{m a j}

}, and

N_{i} ≪ N_{j}

,

C_{G L O} (i, j)

is small and approximately equal to 0. Therefore, (16) is consistent with the principle of our design of the cost computing method: In the case of imbalanced training samples, owing to the topological interplay effect, the minority category samples are more likely to be surrounded by the majority category samples and misclassified into the majority category. To avoid this situation, the cost of misclassifying the minority category samples should be much larger than misclassifying the majority category samples. We use the ratio of the size of different classes to fit the cost of misclassification. To improve the fit, k and

α

are added to the model as parameters to adjust the proportion of the sample. Here, k is designed to be a small fraction of the total number of samples in the training set (e.g., 1%), and the presence of k prevents a minority class from having zero training samples. In addition,

α

is the exponential scaling of the sample proportion after add-k smoothing. Together with the other DCSGCN parameters, the optimal values of k and

α

are obtained from the validation set.

Figure 2 shows a toy example of how to compute

C_{G L O}

for target node v. There are only two classes for simplicity. The minority class 1 (represented by the red nodes) contains two samples, and the majority class 2 (represented by the blue nodes) contains seven samples. Here, k is 10% of the total number of nodes, including v, i.e.,

k = 1

. In addition,

α = 1

. Therefore, the cost of misclassifying class 1 samples into class 2 is

\frac{7 + 1}{2 + 1} = 2.67

, and the cost of misclassifying class 2 samples into class 1 samples is

\frac{2 + 1}{7 + 1} = 0.38

.

3.4.2. Simple Average Cost

The global cost relates the size of the class to the misclassification cost. However, as its major drawback, the cost matrix is the same for all nodes in the training set. In fact, the cost matrix of a sample node is influenced by its position in the network. Consider the case in which all neighbors of a global majority class node belong to the global minority class. This situation is rare, however. For example, COVID-19 patients represent a small fraction of the global population, yet a healthy person may have family members who are infected with the virus. Owing to the topological interplay effect, the GNN-based classifier tends to misclassify its class as the majority class of its neighbors; that is, the global minority class. Moreover, its global cost matrix encourages the classifier to classify it as a global minority class. Instead of correcting the posterior probability, global cost exacerbates the tendency to misclassify.

Naturally, the cost matrix should be built by considering not only the proportions of the global inter-class samples (which are the same for all nodes) but also the proportions of the local inter-class samples (which may vary between nodes). For node v, supposing we denote the set of nodes directly connected to v as

S^{1}

, and denote

S^{1}

and the set of nodes directly connected to nodes in

S^{1}

while not connected to v (i.e., at a distance of 2 from v) as

S^{2}

,

S^{2}

is then the set of nodes at a maximum distance of 2 from v. Furthermore, the set of nodes with a maximum distance of 3 from v is denoted as

S^{3}

, …, until the set of nodes with the maximum distance l from v is denoted as

S^{l}

, whereas l is the distance from the farthest node to v in the training set. In the case of no isolated nodes,

S^{l}

includes all nodes in the training set. The search process of {

S^{1}

,

S^{2}

, …,

S^{l}

} can be conducted simply through a breath-first search on the graph. For each set of nodes

S^{i}, i \in {1, 2, \dots, l}

, we can compute the cost matrix

C_{G L O}^{i}

based on the node class distribution within

S^{i}

using (16). In other words, letting

N_{x}^{i}

denote the number of xth class samples in

S^{i}

,

x \in {1, 2, \dots, m}

, then

C_{G L O}^{i} (x, y) = {(\frac{N_{x}^{i} + k}{N_{y}^{i} + k})}^{α}

when

x \neq y

, and

C_{G L O}^{i} (x, y) = 0

when

x = y

. Thus, {

C_{G L O}^{i}

} reacts to the different costs assigned to v resulting from the different node distributions in different regions, where v is the center dispersed toward the network edge. In the above example, because v is a global majority class and a local minority class,

C_{G L O}^{1}

tends to determine v as the correct class, thus attenuating the negative effect of the global cost. Eventually, the cost matrix

C_{S A}

of a node is the simple average of {

C_{G L O}^{i}

}:

C_{S A} = \frac{1}{l} \sum_{i = 1}^{l} C_{G L O}^{i} .

(17)

Algorithm 2 shows the computation process for

C_{S A}

. Figure 2 also illustrates a toy example of how to compute

C_{S A}

. The set of nodes with the maximum distance {1, 2, 3} from node v are {

S^{1}

,

S^{2}

, and

S^{3}

}, respectively, where

S^{3}

includes all nodes. In

S^{1}

, the numbers of red and blue class samples are 2 and 0, respectively, and the cost of misclassifying v into the red and blue classes is 3 and 0.33, respectively (assuming

k = 1, α = 1

). Similarly, in

S^{2}

, the number of red and blue class samples are 2 and 4, resulting in a cost of misclassifying v into red and blue of 0.60 and 1.67. Finally, in the global view, the numbers of red and blue samples are 2 and 7, and the corresponding misclassification cost is 0.38 and 2.67, respectively. In addition,

C_{S A}

is the average of the above three sets, i.e.,

C_{S A} (r e d, b l u e) = (3 + 0.6 + 0.38) / 3 = 1.33

,

C_{S A} (b l u e, r e d) = (0.33 + 1.67 + 2.67) / 3 = 1.56

. Compared to the global cost

C_{G L O}

,

C_{S A}

encourages the classifier to classify v into the correct class, i.e., blue.

Algorithm 2 Calculate Simple Average Cost

Inputs: Target node v; Adjacency matrix A

Outputs: Simple average cost

C_{S A}

for v

1: Initialize Q as a queue, D as a dictionary

2: label v as explored

3:

Q . e n q u e u e (v)

4:

D [v] \leftarrow 0

5: while Q is not empty do

6:

v \leftarrow Q . d e q u e u e ()

7: for all edges from v to w in

A . a d j a c e n t E d g e s (v)

do

8: if w is not labeled as explored then

9: label w as explored

10:

Q . e n q u e u e (w)

11:

D [w] \leftarrow D [v] + 1

12: end if

13: end for

14: end while

15:

m a x_d e p t h \leftarrow m a x (D . v a l u e s ())

16: for

i \in [1, m a x_d e p t h]

do

17: Construct node set

S_{i}

18: Compute

C_{G L O}^{i}

based on

S_{i}

▹ Equation (16)

19: end for

20: Compute

C_{S A}

using {

C_{G L O}^{i}

} ▹ Equation (17)

3.4.3. Weighted Average Cost

The simple average cost distinguishes the global dominance of a category from its local dominance and is thus node-dependent. However, it still ignores the fact that the dominance of a sample category in different regions may have a different importance. Returning to the example shown in Figure 2, to compute the cost matrix of node v, we first construct three sets of nodes {

S^{1}

,

S^{2}

,

S^{3}

}, and compute the cost for each node set separately to obtain {

C_{G L O}^{1}, C_{G L O}^{2}, C_{G L O}^{3}

}, and average them to obtain the final cost

C_{S A}

. However,

C_{S A}

still tends to classify v into the incorrect class; that is, the global minority class red because v is the minority class only in

S^{1}

, and

S^{1}

is the minority class in {

S^{1}

,

S^{2}

, and

S^{3}

} voting.

In the process of set voting, we want to highlight the sets in which the target node is in the minority of those sets. We provide an intuitive interpretation as follows: For a target node v and its corresponding node set {

S^{1}

,

S^{2}

, …,

S^{l}

}, suppose

v \in H_{m a j}

; then, as i reaches closer to l, the category distribution in

S^{i}

tends to approximate the global distribution. Therefore, if v is in a minority category in

S^{i}

, i is more likely to be small, and v will belong to a local minority category. Because the topological interplay effect is more pronounced locally in v, the classifier should prefer to avoid misclassifying v as a local majority class, and thus the cost information calculated based on

S^{i}

should be given more weight. When

v \in H_{m i n}

, the situation is reversed. If v is a majority class in

S^{i}

, then i is more likely to be small. At this point, the cost information calculated based on

S^{i}

will incorrectly classify v as a minority class in

S^{i}

. When i is large, the class distribution in

S^{i}

will be close to the global distribution, and the cost calculated based on

S^{i}

will encourage the classifier to make the correct choice, i.e., the global minority class. Therefore,

S^{i}

should have a larger weight at this point.

Based on the above analysis, we designed a weighting method that relies on the node features. Naturally, the average features of a node set are closer to the features of the majority class nodes. Therefore, when the target node v is a minority in

S^{i}

, the cosine distance between its features

X_{v}

and the average features of the nodes in

S^{i}

X_{S^{i}} = \frac{1}{| S^{i} |} \sum_{u \in S^{i}} X_{u}

is large, where

| S^{i} |

is the number of nodes contained in

S^{i}

. This cosine distance can be regarded as the weight of

S^{i}

. In summary, given the target node v and the corresponding {

S^{1}

,

S^{2}

, …,

S^{l}

} (constructed in the same way as described in Section 3.4.2), the weighted average based cost matrix

C_{W A}

of v is

W^{i} = D i s t_{c o s i n e} (X_{v}, X_{S^{i}})

(18)

C_{W A} = \frac{\sum_{i = 1}^{l} W^{i} C_{G L O}^{i}}{\sum_{i = 1}^{l} W^{i}} .

(19)

In the example shown in Figure 2, assuming that the blue class samples are all characterized by

{(1, 0)}^{T}

and the red class samples are all characterized by

{(0, 1)}^{T}

, then the weights of

S^{1}

,

S^{2}

, and

S^{3}

are 1.0, 0.10, and 0.04, respectively. Combining the simple average cost calculated before, the cost of misclassifying v as red is

(3 * 1.0 + 0.60 * 0.10 + 0.38 * 0.04) / 1.14 = 2.70

, and the cost of misclassifying it as blue is

(0.33 * 1.0 + 1.67 * 0.10 + 2.67 * 0.04) / 1.14 = 0.53

. It follows that

C_{W A}

considers the cost incurred by v being misclassified as red to be larger, thus encouraging the classifier to make the correct choice, i.e., the blue class.

4. Experimental Settings

4.1. Datasets

For the fairness and validity of the experiments, we compared the performance of all analyzed methods on nine widely used standard datasets covering different domains in the field of graph representation learning. In Table 2, we list the statistical information of all datasets used, including the number of nodes, the number of edges, the dimensions of the node features, the number of node classes, the minority classes, and the tuned optimal value of key parameters of DCSGCN: {

λ

,

α

, k}. Below is a brief description of each dataset used.

Citation networks (3). Cora, Citeseer, and Pubmed [49,50]. In these datasets, nodes represent papers, edges represent citations between papers, and the features and classes of the nodes are the bag-of-words representation of the paper and the research topic (e.g., neural networks and reinforcement learning), respectively.

WebKB (3). Three sub-datasets, i.e., Cornell, Texas, and Wisconsin, are included (http://www.cs.cmu.edu/afs/cs.cmu.edu/project/theo-20/www/data/, accessed on 31 May 2022). In these datasets, nodes and edges represent the relevant web pages of a university and links between web pages, respectively. The features of a node are the bag-of-words representation of the corresponding web page, and the node category is the web page topic (e.g., Student and Faculty).

Wikipedia networks (2). Two sub-datasets, i.e., Chameleon and Squirrel (https://snap.stanford.edu/data/wikipedia-article-networks.html, accessed on 31 May 2022), are included, each describing a network of Wikipedia pages on a particular topic. Nodes represent article pages, and edges represent the links between pages. Node features are bag-of-word vectors of articles, and node categories are built based on the monthly traffic of the articles.

Actor co-occurrence network (1). In this dataset, a node represents an actor, and an edge represents whether two actors co-occur on the same Wikipedia page. Node features correspond to keywords on Wikipedia pages. Node categories were built based on the actor’s Wikipedia. This dataset was proposed by [51].

All of the above datasets are available through the PyTorch geometric package (https://github.com/rusty1s/pytorch_geometric, accessed on 31 May 2022). In addition, we notice that although GNNs have been widely adopted in the remote sensing field, the current publicly-released benchmark datasets for GRL (such as the PyTorch geometric datasets (https://pytorch-geometric.readthedocs.io/en/latest/modules/datasets.html, accessed on 31 May 2022) and the deep graph library datasets (https://docs.dgl.ai/api/python/dgl.data.html, accessed on 31 May 2022)) still have not covered the remote sensing field. Thus, unfortunately, we are not able to verify the effectiveness of our approach directly on remote sensing-related graphs. However, according to previous studies, such as [9] and [10], GNNs that are demonstrated as effective on non-remote sensing datasets can still play an important role in remote sensing applications.

4.2. Analyzed Methods

In this section, we describe all the analyzed methods, including three types of baselines and our proposed three models. The types of baselines are as follows.

Type I: Graph neural networks (2). These include the vanilla GCN [39] and its modified version, GCNII [52].

Type II: Graph neural networks + cost-sensitive learning methods (4). For the GCN and GCNII, we tested their combination with two classical cost-sensitive learning techniques, i.e., under-sampling and threshold-moving (Under-sampling and threshold-moving have multiple implementation schemes, and this study applies the algorithm proposed in [15]). The former reduces the degree of imbalance in the training set by under-sampling the majority class samples, and the latter moves the classification decision boundary to increase the classifier’s preference for the minority class. The combined four methods are GCN

_{U S}

, GCN

_{T M}

, GCNII

_{U S}

, and GCNII

_{T M}

.

Type III: Graph neural networks specific for imbalanced classification (3). Three state-of-the-art GNN models [16] designed for imbalanced datasets were tested: RECT-L, RECT-N, and RECT. Unlike traditional GNNs, the RECT family adopts a novel objective function that explores class-semantic knowledge.

Type IV: Proposed methods: (3). The methods proposed in this paper are denoted as DCSGCN

_{G L O}

, DCSGCN

_{S A}

, and DCSGCN

_{W A}

, which represent the dual cost-sensitive GCN based on global cost, simple average cost, and weighted average cost, respectively.

To facilitate reproduction, the implementation of the baseline approaches relies on publicly released code from the sources, including GCN [39] (Available online: https://github.com/tkipf/pygcn, accessed on 31 May 2022), GCNII [52] (Available online: https://github.com/chennnM/GCNII, accessed on 31 May 2022), RECT [16] (Available online: https://github.com/zhengwang100/RECT, accessed on 31 May 2022).

4.3. Evaluation Metrics

Four evaluation metrics were used to evaluate the performance of all the methods. First, we tested two metrics that are standard in the graph node classification task: Micro-F1 and Macro-F1. Micro-F1 measures the F1-score of the aggregated contributions of all classes. In our multi-class setting, Micro-F1 and accuracy are equivalent. Macro-F1 is defined as the arithmetic mean of the label-wise F1-scores. Compared to Micro-F1, Macro-F1 does not consider the class size. Both Micro-F1 and Macro-F1 combine the precision and recall of the model, the range of which is [0, 1]. The larger the value is, the stronger the model performance.

Second, we test two standard metrics in the cost-sensitive learning task: misclassification cost and a number of high-cost errors. The former is the average cost resulting from all misclassifications (the ground-truth misclassification cost will be discussed later in Section 4.4), and the latter is the number of misclassifications where a minority class sample is predicted as being in a majority class. Both metrics range within

[0, + \infty)

, and the smaller the value, the stronger the model. To the best of our knowledge, this study is the first to report the performance of graph neural network models under the cost and high-cost errors in a graph node classification task. The formulas for all metrics are shown below.

M i c r o - F 1 = \frac{\sum_{u = 1}^{m} 2 \cdot T P_{u}}{\sum_{u = 1}^{m} 2 \cdot T P_{u} + F P_{u} + F N_{u}}

(20)

M a c r o - F 1 = \frac{1}{m} \sum_{u = 1}^{m} \frac{2 \cdot T P_{u}}{2 \cdot T P_{u} + F P_{u} + F N_{u}}

(21)

c o s t = \frac{1}{N} \sum_{i = 1}^{N} C_{f (v_{i}), y_{i}}

(22)

h i g h_c o s t_e r r o r s = \sum_{i = 1}^{N} [f (v_{i}) \in H_{m a j}, y_{i} \in H_{m i n}]

(23)

In the above equations,

T P_{u}

,

F P_{u}

, and

F N_{u}

denote the number of true positives, false positives, and false negatives in the u-th class, respectively. In addition, m denotes the number of classes, and N is the number of classifications (i.e., the number of test set samples). Moreover,

f (v_{i})

and

y_{i}

denote the predicted class and label classes for test node

v_{i}

, respectively; C is the ground-truth cost matrix;

H_{m a j}

and

H_{m a j}

are the majority and minority class sets, respectively; and

[\cdot]

is the Iverson bracket, which represents an indicator function.

4.4. Training Settings

In this study, we assume different types of errors lead to different degrees of misclassification costs. Furthermore, the action in which the classifier misclassifies the minority class samples will result in a higher cost than misclassifying the majority class samples. We use a benchmark that is widely applied in cost-sensitive learning to generate the ground-truth misclassification cost matrix C, i.e., the randomized proportional setup [28,53]. The diagonal entries

C (i, i)

are set to zero, whereas the non-diagonal entries

C (i, j)

are uniformly sampled from

[0, 10 \frac{N_{i}}{N_{j}}]

, where

N_{i}

represents the number of nodes in class i. This randomized proportional setup charges a higher expected cost for misclassifying a minority class than a majority class because it considers the class distribution of the dataset. All experiments were conducted in PyTorch (1 February 2020, community edition), and the Adam optimizer was used to train all methods analyzed. For each dataset, following [16] we first remove 70% of the samples from each minority class to increase the data imbalance. Then, we randomly split labeled nodes of each class into 20%/20%/60% for training, validation, and testing. A 5-fold cross-validation is adopted to evaluate the model performance. For the pre-training of

C o n v N e t P

, the number of training epochs was set to 50. For the training of DCSGCN, the number of training epochs was 200.

4.5. Parameter Tuning

For the validation set for each dataset, we used a grid search to tune the parameters in the following order: learning rate (range of {0.001, 0.005, 0.01, 0.05, 0.1}) → weight decay (range of {10

^{- 5}

, 5 × 10

^{- 5}

, 10

^{- 4}

, 5 × 10

^{- 4}

, 10

^{- 3}

}) → number of hidden units (range of {8, 16, 32, 64}) → dropout rate (range of {0.1–0.9}, step size 0.1) →

λ

(range of {0.0–1.0}, step size 0.1) →

α

(range of {0.25, 0.5, 1.0. 1.5}) →k (range of {0.001, 0.01, 0.1}). Among them, the last three parameters are specific to the DCSGCN family, and the effects of changing their values on the performance of DCSGCN are shown in Section 5.8.

5. Experimental Results

5.1. Research Questions

In this section, the experimental results are discussed. Specifically, we answer the following research questions: (1) Does DCSGCN demonstrate a higher classification accuracy on imbalanced datasets compared to existing models (see Section 5.2)? (2) Can DCSGCN achieve a smaller misclassification cost and fewer high-cost errors while maintaining the classification accuracy (see Section 5.3)? (3) How does the classification performance of DCSGCN vary on datasets with different degrees of balance (see Section 5.4)? (4) How does the classification performance of DCSGCN vary on training sets of different sizes (see Section 5.5)? (5) Do the two key settings of training DCSGCN and pre-training

C o n v P

, and not sharing the first layer parameters of

C o n v P

and

C o n v C

, indeed improve the model performance (see Section 5.6)? (6) Does DCSGCN perform better than the classic data-oversampling methods (see Section 5.7)? (7) How do the three key parameters

λ

,

α

, and k of DCSGCN affect the network performance (see Section 5.8)? (8) Finally, what is the visualization of the learned node representation of DCSGCN (see Section 5.9)? In the following sections, we will answer the above questions with a detailed analysis of the experiment data.

5.2. Classification Accuracy

Table 3 and Table 4 show the performance of all methods under two measures of classification accuracy: Micro-F1 (which is equivalent to the accuracy under our multi-class classification scenario) and Macro-F1. The results are presented as the mean based on 10 replicated experiments. From the experimental results, we conducted the following analysis:

In our setting of removing 70% of the minority class samples, we noticed that the Micro-F1 values of both GCN and GCNII decreased to different degrees compared to the performance reported in related studies [39,52], and the performance of GCNII decreased more significantly. For example, in the Cora dataset, the decrease in the Micro-F1 values for the GCN and GCNII was approximately 7.5% and 30.6%, respectively. These data demonstrate the necessity of developing specific algorithms for imbalanced graph data.
The classic cost-sensitive learning methods under-sampling and threshold-moving improve the accuracy of type I methods. For under-sampling, it brings a 5.0% average performance improvement in terms of Micro-F1. For threshold-moving, this figure is 2.3%. It follows that under-sampling is the better of the two methods based on our experiments.
Better baselines than the under-sampling and threshold-moving are three type III models specifically designed for imbalanced datasets, i.e., RECT-L, RECT-N, and RECT. Among them, RECT exhibited the best performance. It achieves the best Micro-F1 scores on the Cora and Texas datasets and the best Macro-F1 score on the Cora dataset. Compared to under-sampling, RECT brings an average improvement of 5.6% in terms of Micro-F1. Compared to threshold-moving, this figure is 8.5%.
It is encouraging to note that our proposed DCSGCN model achieves the best classification performance overall and the strongest robustness. We observed that in six of the nine datasets (excluding Cora, Citeseer, and Texas), DCSGCN achieved the highest Micro-F1; in eight of the nine datasets (excluding Cora), DCSGCN achieved the best Macro-F1. On Citeseer, Micro-F1 of DCSGCN differs from the best performer by only 0.44%. As another observation, for Micro-F1, among the six datasets where DCSGCN is the best performer, relatively small performance gains on Pubmed/Cornell/Actor are obtained over the best baseline of 0.4%, 2.8%, and 0.3%, respectively. For Wisconsin, Chameleon, and Squirrel, DCSGCN has an extremely significant advantage, i.e., 5.8%, 14.2%, and 11.3%, respectively, over the highest baseline. Similar observations were made for Macro-F1. Naturally, DCSGCN has a different fitness on datasets with different sample distributions and topological features. We regard the study of such fitness variation as a follow-up research problem.
Among the three variants of DCSGCN, we noted that the simplest global cost-based DCSGCN $_{G L O}$ achieves the best performance, i.e., winning on three (Micro-F1) and four (Macro-F1) of the datasets, respectively. For the two methods based on the node-dependent cost, DCSGCN $_{S A}$ using a local context wins on two (Micro-F1), and three (Macro-F1) of the datasets, and DCSGCN $_{W A}$ using the region features wins on only one dataset under both metrics. This suggests that the information provided by the local context and region semantics is not always helpful for classification. As one possible explanation for this, the sample-dependent cost, while bringing about more flexibility to the neural network, is also more likely to result in overfitting, particularly when the number of training data is insufficient. In summary, for the performance ranking of the three DCSGCN models, DCSGCN $_{G L O}$ > DCSGCN $_{S A}$ > DCSGCN $_{W A}$ .

In summary, our answer to research question (1) (see Section 5.1) is “yes.” DCSGCN demonstrates a SOTA performance in the task of imbalanced graph node classification. Our idea of using cost to “complement” the posterior probability is simple but effective. In addition, the simplest method of estimating the sample cost achieves the highest performance.

5.3. Cost and High-Cost Errors

Table 5 and Table 6 show the performance of all methods under two metrics from a cost-sensitive classification perspective, i.e., cost and high-cost errors, respectively. Both metrics measure the ability of the classifier to correctly identify the minority class samples under the topological interplay effect. Similarly, the results are expressed in the form of the mean based on 10 replicate experiments. From the collected results, we conducted the following analysis.

As a counterintuitive finding, both under-sampling and threshold-moving increase the cost of type I models (1.4%↑ and 2.1%↑, respectively). Past research has demonstrated that both under-sampling and threshold-moving are effective strategies for reducing the cost when the training set samples are independently and identically distributed [15]. Our observations show that undersampling and threshold-moving can fail on graph data (the presence of edges indicates explicit dependency relationships among nodes such that they do not fit; they are independent and identically distributed). Combined with the previous analysis in Section 5.2, we infer that the improvements of the under-sampling and threshold-moving on the graph data are greater in the case of reducing the misclassification of the majority class samples and not the minority class samples.
The RECT family can effectively reduce the misclassification cost. Among the three type II methods, in terms of cost, the best method in terms of high-cost errors is RECT. Combined with the previous analysis, we conclude that the RECT family outperforms the GCN and GCNII-based models regardless of classifying the samples of all classes or classifying the minority class samples.
Our proposed DCSGCN models are the best performers among all methods compared. In terms of cost, DCSGCN achieves minimal cost in all datasets (9/9). In terms of high-cost errors, it achieves the fewest errors in most of the datasets (8/9). Compared to its performance in terms of Micro-F1 and Macro-F1, DCSGCN shows a more significant advantage under the cost and high-cost error metrics. Specifically, compared to the average performance of the RECT family, DCSGCN reduces the cost by 34.3% on average, whereas DCSGCN reduces the cost by 37.7% on average.
Among the three variants of DCSGCN, DCSGCN $_{G L O}$ performs the best. This observation is consistent with that described in Section 5.2. DCSGCN $_{G L O}$ achieves the lowest cost in five of the datasets and the lowest cost errors in four of the datasets. For the other two methods, DCSGCN $_{S A}$ and DCSGCN $_{W A}$ , in terms of cost, both achieve the highest performance in three of the datasets. In terms of high-cost errors, DCSGCN $_{W A}$ is better (3 > 1). In general, based on the classification accuracy of the minority samples, we have DCSGCN $_{G L O}$ > DCSGCN $_{W A}$ > DCSGCN $_{S A}$ .

In summary, our answer to question (2) (see Section 5.1) is also “yes.” Under the standard metrics for cost-sensitive classification, DCSGCN showed the best performance. Although leading in terms of classification accuracy, DCSGCN still manages to achieve the smallest misclassification cost. DCSGCN

_{G L O}

is still the best approach, whereas both DCSGCN

_{S A}

and DCSGCN

_{W A}

can still achieve the best performance on certain datasets. We can also state that neither is superior to the other models.

5.4. Influence of Imbalance Ratio

In this subsection, we explore how the performance of DCSGCN changes as the imbalance of the training set changes. We first reduced the percentage of randomly removed minority class samples from 70% to 30% for each dataset (which will reduce the imbalance of the training set) and compared the performance of DCSGCN with all baselines under this moderated setting (see Table 7 and Table 8). The results in the tables are the mean values based on 10 replicated experiments, and we omit the standard deviation owing to space limitations. Subsequently, we varied the percentage of removed minority samples in [10%, 20%, …, 90%] and recorded the performance change of DCSGCN on the Cora dataset (see Figure 3).

As shown in Table 7 and Table 8, there is a small increase in Micro-F1 and a small decrease in cost for each compared method when more minority class samples are available in the training set. This finding is consistent with our hypothesis. Second, DCSGCN maintained its performance advantage. Compared with the two strong baselines, GCN and RECT, DCSGCN increases Micro-F1 by 8.6% and 10.3% on average and reduces the cost by 18.7% and 27.5% on average, respectively. In terms of cost, the superiority of DCSGCN is more significant. Finally, we observe that DCSGCN

_{G L O}

, DCSGCN

_{S A}

, and DCSGCN

_{W A}

have no obvious differences in their performance advantages when compared to our previous findings when the data are more imbalanced.

As seen in Figure 3, there is a significant decreasing trend of Micro-F1 as the imbalance of the training set increases. Among the three DCSGCN models, the change in DCSGCN

_{G L O}

is the most moderate, and the change in DCSGCN

_{W A}

is the most drastic. For DCSGCN

_{S A}

, on Cora, when the proportion of removed minority class samples reaches 70%, the change in its Micro-F1 with respect to the imbalance transfers from a small fluctuation to a drastic decrease. The cost is positively correlated with the imbalance in the training set. Similarly, we found that DCSGCN

_{W A}

has the strongest sensitivity to an imbalance in terms of cost. We infer that this is because, unlike the other two methods, DCSGCN

_{W A}

has one more parameter that is affected by the training set in addition to the inter-class sample ratio, i.e., the semantic features of the training data. The greater the number of parameters depending on the training set, the stronger the sensitivity.

5.5. Performance under Dense Training

Similar to [54], in this section, we discuss whether DCSGCN can maintain its leading performance on a larger training set. For each dataset, we randomly maintained 60% of its nodes to train the network and 20% of the nodes for testing. Similar to sparse training, we then tested the performance of all compared methods for removing 30% of the minority class samples and removing 70% of the minority class samples. All other settings of the experiments were the same as those shown in Section 4.4 and Section 4.5.

Table 9 and Table 10 first show the Micro-F1 and cost of each analyzed model under dense training on each dataset in which 70% of the minority class samples are removed, respectively. Based on these two tables, Table 11 and Table 12 show the model performance under dense training with respect to a 30% removal of minority class samples. The results in the tables are the mean values based on 10 replicated experiments, and we omit the standard deviation owing to space limitations.

When comparing Table 3 and Table 9 and Table 5 and Table 10, it can easily be seen that as the number of training data increases, most of the Micro-F1 values generally increase and the cost values generally decrease for all methods. Similar observations can be obtained when comparing Table 7 and Table 11 and Table 8 and Table 12, where more training data are given.

By contrast, we found that DCSGCN still has a significant advantage over large training sets. For 70% sample removal, compared with the GCN, the average Micro-F1 improvement of DCSGCN is 17.1%, the average cost reduction is 44.8%, and the corresponding values compared with RECT are 14.7% and 37.1%. For 30% sample removal, the corresponding performance gain of DCSGCN over the GCN is (Micro-F1, 11.4%↑; cost, 30.6%↓), and over RECT is (Micro-F1, 5.8%↑; cost, 13.4%↓). This again demonstrates the effectiveness of DCSGCN. In addition, we observed a significant performance improvement for DCSGCN

_{S A}

and DCSGCN

_{W A}

compared to the case of sparse training, whereas the advantage of DCSGCN

_{G L O}

decreases. With more training data, we infer that more flexible sample-dependent cost-sensitive learning exhibits a stronger classification performance than category-dependent cost-sensitive learning.

5.6. Validation of Key Procedures in Training DCSGCN

In this section, we answer research question (5) in Section 5.1. Namely, do the two key steps in training DCSGCN and pre-training

C o n v P

, and not sharing the first layer parameters of

C o n v P

and

C o n v C

, indeed improve the model performance? For all DCSGCN models, we tested the following four combinations: {A, pre-trained, shared; B, pre-trained, not shared; C, not pre-trained, shared; D, not pre-trained, not shared}, and present the results in Table 13. The test dataset used was Cora, and all experimental settings and parameters were the same as those in Section 4.4 and Section 4.5. Naturally, (A + B) − (C + D) describes the performance difference between pre-training and not pre-training, whereas (A + C) − (B + D) describes the performance difference between sharing parameters and not sharing parameters.

By looking at Table 13, we find that, for all models, pre-training

C o n v N e t P

improves Micro-F1 and reduces the cost. The average accuracy improvement and cost reduction are 2.4% and 0.38, respectively. By contrast, not sharing the first layer parameters of

C o n v N e t P

and

C o n v N e t C

significantly improves the performance of DCSGCN

_{G L O}

and DCSGCN

_{S A}

and slightly impairs the performance of DCSGCN

_{W A}

. Therefore, we conclude that B (pre-trained, not shared) achieves the best performance.

Figure 4 shows the dynamic state of Micro-F1 and cost during the training process of DCSGCN

_{W A}

with each practice above (A–D) on the Cora dataset, respectively. It can be clearly observed that practice B (pre-trained, not shared) has the fastest convergence speed and the best convergence performance.

5.7. Comparison of DCSGCN and Advanced Data Over-Sampling Strategies

In this section, we explore the performance differences between DCSGCN and {GNN + data over-sampling} further. Our choice of GNNs includes GCN [39] and GCNII [52], while our selection of data-oversampling approach comprises SMOTE [48] and ADASYN [55]. We then compare the performance of DCSGCN

_{W A}

with the following four combinations: {A, GCN + SMOTE; B, GCNII + SMOTE; C, GCN + ADASYN; D, GCNII + ADASYN}, and present the results in Table 14. The test dataset used was Cora, and all experimental settings and parameters were the same as those in Section 4.4 and Section 4.5. The implementation of SMOTE and ADASYN relies on the imbalanced-learn library [56]. By looking at Table 14, we find that the performance of the {GNN + data-oversampling} combinations is roughly in between type II and type III methods (see Section 4.2). Therefore, we conclude that our proposed methods are better at learning how to embed imbalanced graph data than the classic data-oversampling strategies.

5.8. Influence of Key Parameters

In this section, we explore how the three key parameters

λ

,

α

, and k of DCSGCN affect the performance of DCSGCN

_{W A}

. Among them,

λ

controls the weights of the category label learning and cost label learning (see (11)), whereas

α

and k control the scale of the proportion of inter-class samples when computing the cost matrix (see Equation (16)). The values of

λ

,

α

, and k are within the ranges [0.0, 0.1, 0.2, …, 1.0], [0.25, 0.5, 1.0, 1.5], and [0.001, 0.01, 0.1], respectively.

Figure 5 shows the variation in the performance of DCSGCN

_{W A}

on the validation of each dataset as the values of

λ

,

α

, and k change. As shown in Figure 5a, Micro-F1 increases on almost all datasets when

0.0 \leq λ \leq 0.2

and then enters a small fluctuation phase when

0.2 \leq λ \leq 0.8

. In terms of cost, excluding the three smallest datasets (Cornell, Texas, and Wisconsin) where the performance fluctuates significantly, and similarly, when

λ \geq 0.2

, the cost leveled off. Therefore, we conclude that [0.2, 1.0] is a suitable range of values for

λ

. When

λ

is too small, DCSGCN tends too much toward the learning of the cost label, decreasing the learning of the posterior probability.

For

α

, we found that when excluding Cornell, Texas, and Wisconsin, its value does not have a significant effect on Micro-F1 or the cost; we made similar observations for k. For Cornell, Texas, and Wisconsin, Micro-F1 increases and then decreases as

α

increases; in addition, Micro-F1 decreases and the cost increases when k changes from 0.01 to 0.1. As one possible reason for this observation, the category distribution of these three datasets is the most imbalanced, and thus the inter-class sample ratio is not approximately 1 and is, therefore, sensitive to

α

; by contrast, the number of nodes in these datasets is the smallest, and, therefore, when k is large (i.e.,

k \geq 0.01

), the number of nodes added by add-k smoothing is not negligible in comparison with the real number of nodes in the minority classes, thus making the model performance sensitive to k. In summary, using

α

= 0.5 and k = 0.01 is a good approach, as revealed experimentally.

5.9. Graph Visualization

To more intuitively analyze the features of the learned node representations obtained from the DCSGCN models and how they differ from that of the GCN, we projected the output vectors of the last layer from GCN, DCSGCN

_{G L O}

, DCSGCN

_{S A}

, and DCSGCN

_{W A}

on Cora into two dimensions, as illustrated in Figure 6. For the dimensional reduction method, we used t-SNE [57].

It is clear from Figure 6 that some of the class representations learned by the GCN lack clear separation boundaries. For example, for classes 0, 1, and 6, the distances between the nodes from these three classes are relatively small, and some of the nodes from these classes are mixed together with class 5 nodes. In particular, for minority classes 1 and 6, many of the samples lie on the boundary between the majority classes. Our observations show that the GCN lacks the ability to learn discriminative minority class features under the influence of the majority classes.

By contrast, the representations of the minority classes learned through DCSGCN are significantly more discriminative. We can easily circle the feature space of the minority class and its boundary with the majority class. In particular, the minority classes 1 and 6 in Figure 6c have an extremely clear boundary space from the other classes with nearly no nodes present. This observation demonstrates that DCSGCN can weaken the negative impact of an imbalanced class distribution to a considerable extent and learn the unique features of minority classes in a more effective manner. We also see that even DCSGCN cannot avoid a large number of nodes misclassified on the class boundary, which reflects the difficulty of the imbalanced node classification task.

6. Conclusions

In this paper, we propose a new model, the dual cost-sensitive graph convolutional network (DCSGCN), for the task of long-tailed graph node classification. Compared with many GNNs based on maximizing the posterior probability, the proposed DCSGCN is based on minimizing the misclassification cost for a class prediction. To generate sample cost labels for network training, we propose three algorithms based on the node class distribution and graph topology. Extensive experiments on different datasets demonstrate the high effectiveness and robustness of DCSGCN in handling imbalanced data. In addition, a large number of tests under different experimental settings describe, in detail, the performance variations and characteristics of DCSGCN under different application scenarios.

To the best of our knowledge, this study is the first to introduce the idea of cost-sensitive learning in the field of graph representation learning and imbalanced node classification. The proposed model is simple and effective. In a future study, we will explore, in more depth, new theoretical directions and application scenarios using the organic combination of cost-sensitive learning and network science.

Author Contributions

Methodology/system implementation/experiments/original draft preparation: Y.D. All authors contributed to the study conception, revised, and approved the manuscript. All authors have read and agreed to the published version of the manuscript.

Funding

This paper is based on results obtained from a project, JPNP20006, commissioned by the New Energy and Industrial Technology Development Organization (NEDO).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The datasets used in this study are available at https://github.com/rusty1s/pytorch_geometric (accessed on 31 May 2022).

Acknowledgments

We would like to thank the reviewers for taking the time and effort necessary to review the manuscript. We sincerely appreciate all valuable comments and suggestions, which helped us to improve the quality of the manuscript. This manuscript is an extension of the authors’ earlier work to be presented at the 2022 International Joint Conference on Neural Networks.

Conflicts of Interest

The authors declare no conflict of interest.

Abbreviations

The following abbreviations are used in this manuscript:

GRL	Graph representation learning
GNN	Graph neural network
GCN	Graph convolutional network
DCSGCN	Dual cost-sensitive graph convolutional network

References

Ma, Y.T. Deep Learning on Graphs; Cambridge University Press: Cambridge, UK, 2021. [Google Scholar]
Ng, A.Y.; Jordan, M.I.; Weiss, Y. On spectral clustering: Analysis and an algorithm. Adv. Neural Inf. Process. Syst. 2002, 14, 849–856. [Google Scholar]
Spitzer, F. Principles of Random Walk; Springer Science & Business Media: New York, NY, USA, 2013; Volume 34. [Google Scholar]
Wang, F.; Zhang, C. Label propagation through linear neighborhoods. IEEE Trans. Knowl. Data Eng. 2007, 20, 55–67. [Google Scholar] [CrossRef]
Zhang, Y.; Chen, X.; Yang, Y.; Ramamurthy, A.; Li, B.; Qi, Y.; Song, L. Efficient probabilistic logic reasoning with graph neural networks. arXiv 2020, arXiv:2001.11850. [Google Scholar]
Zhang, Z.; Zhuang, F.; Zhu, H.; Shi, Z.; Xiong, H.; He, Q. Relational graph neural network with hierarchical attention for knowledge graph completion. In Proceedings of the AAAI Conference on Artificial Intelligence, New York, NY, USA, 7–12 February 2020; Volume 34, pp. 9612–9619. [Google Scholar]
Fan, W.; Ma, Y.; Li, Q.; He, Y.; Zhao, E.; Tang, J.; Yin, D. Graph neural networks for social recommendation. In Proceedings of the World Wide Web Conference, San Francisco, CA, USA, 13–17 May 2019; pp. 417–426. [Google Scholar]
Gui, T.; Zou, Y.; Zhang, Q.; Peng, M.; Fu, J.; Wei, Z.; Huang, X.J. A lexicon-based graph neural network for chinese ner. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), Hong Kong, China, 3–7 November 2019; pp. 1040–1050. [Google Scholar]
Kang, Y.; Chen, J.; Cao, Y.; Xu, Z. A Higher-Order Graph Convolutional Network for Location Recommendation of an Air-Quality-Monitoring Station. Remote Sens. 2021, 13, 1600. [Google Scholar] [CrossRef]
Ouyang, S.; Li, Y. Combining deep semantic segmentation network and graph convolutional neural network for semantic segmentation of remote sensing imagery. Remote Sens. 2020, 13, 119. [Google Scholar] [CrossRef]
Shi, M.; Tang, Y.; Zhu, X.; Wilson, D.; Liu, J. Multi-class imbalanced graph convolutional network learning. In Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence (IJCAI-20), Yokohama, Japan, 11–17 July 2020. [Google Scholar]
Lin, F.; Cohen, W.W. Semi-supervised classification of network data using very few labels. In Proceedings of the 2010 International Conference on Advances in Social Networks Analysis and Mining, Odense, Denmark, 9–11 August 2010; pp. 192–199. [Google Scholar]
He, H.; Garcia, E.A. Learning from imbalanced data. IEEE Trans. Knowl. Data Eng. 2009, 21, 1263–1284. [Google Scholar]
Opitz, J.; Burst, S. Macro f1 and macro f1. arXiv 2019, arXiv:1911.03347. [Google Scholar]
Zhou, Z.H.; Liu, X.Y. Training cost-sensitive neural networks with methods addressing the class imbalance problem. IEEE Trans. Knowl. Data Eng. 2005, 18, 63–77. [Google Scholar] [CrossRef]
Wang, Z.; Ye, X.; Wang, C.; Cui, J.; Yu, P. Network embedding with completely-imbalanced labels. IEEE Trans. Knowl. Data Eng. 2020, 33, 3634–3647. [Google Scholar] [CrossRef]
Elkan, C. The foundations of cost-sensitive learning. In International Joint Conference on Artificial Intelligence; Lawrence Erlbaum Associates Ltd.: Mahwah, NJ, USA, 2001; Volume 17, pp. 973–978. [Google Scholar]
Sheng, V.S. Fast data acquisition in cost-sensitive learning. In Industrial Conference on Data Mining; Springer: Berlin/Heidelberg, Germany, 2011; pp. 66–77. [Google Scholar]
Sze, V.; Chen, Y.H.; Emer, J.; Suleiman, A.; Zhang, Z. Hardware for machine learning: Challenges and opportunities. In Proceedings of the 2017 IEEE Custom Integrated Circuits Conference (CICC), Austin, TX, USA, 30 April–3 May 2017; pp. 1–8. [Google Scholar]
Jaimes, A.; Sebe, N. Multimodal human–computer interaction: A survey. Comput. Vis. Image Underst. 2007, 108, 116–134. [Google Scholar] [CrossRef]
Li, H.; Zhang, L.; Huang, B.; Zhou, X. Sequential three-way decision and granulation for cost-sensitive face recognition. Knowl.-Based Syst. 2016, 91, 241–251. [Google Scholar] [CrossRef]
Zhou, Z.H.; Liu, X.Y. On multi-class cost-sensitive learning. Comput. Intell. 2010, 26, 232–257. [Google Scholar] [CrossRef] [Green Version]
Fernández, A.; García, S.; Galar, M.; Prati, R.C.; Krawczyk, B.; Herrera, F. Cost-sensitive learning. In Learning from Imbalanced Data Sets; Springer: Berlin/Heidelberg, Germany, 2018; pp. 63–78. [Google Scholar]
Domingos, P. Metacost: A general method for making classifiers cost-sensitive. In Proceedings of the Fifth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Diego, CA, USA, 15–18 August 1999; pp. 155–164. [Google Scholar]
Sheng, V.S.; Ling, C.X. Thresholding for Making Classifiers Cost-Sensitive; American Association for Artificial Intelligence: Palo Alto, CA, USA, 2006; Volume 6, pp. 476–481. [Google Scholar]
Lomax, S.; Vadera, S. A survey of cost-sensitive decision tree induction algorithms. ACM Comput. Surv. 2013, 45, 1–35. [Google Scholar] [CrossRef] [Green Version]
Morik, K.; Brockhausen, P.; Joachims, T. Combining Statistical Learning with a Knowledge-Based Approach: A Case Study in Intensive Care Monitoring; Technical Report; Universitat Dortmund: Dortmund, Germany, 1999. [Google Scholar]
Chung, Y.A.; Lin, H.T.; Yang, S.W. Cost-aware pre-training for multiclass cost-sensitive deep learning. arXiv 2015, arXiv:1511.09337. [Google Scholar]
Khan, S.H.; Hayat, M.; Bennamoun, M.; Sohel, F.A.; Togneri, R. Cost-sensitive learning of deep feature representations from imbalanced data. IEEE Trans. Neural Networks Learn. Syst. 2017, 29, 3573–3587. [Google Scholar]
Bruna, J.; Zaremba, W.; Szlam, A.; LeCun, Y. Spectral networks and locally connected networks on graphs. arXiv 2013, arXiv:1312.6203. [Google Scholar]
Scarselli, F.; Gori, M.; Tsoi, A.C.; Hagenbuchner, M.; Monfardini, G. The graph neural network model. IEEE Trans. Neural Networks 2008, 20, 61–80. [Google Scholar] [CrossRef] [Green Version]
Hamilton, W.L.; Ying, R.; Leskovec, J. Inductive representation learning on large graphs. In Proceedings of the 31st International Conference on Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017; pp. 1025–1035. [Google Scholar]
Veličković, P.; Cucurull, G.; Casanova, A.; Romero, A.; Lio, P.; Bengio, Y. Graph attention networks. arXiv 2017, arXiv:1710.10903. [Google Scholar]
Simonovsky, M.; Komodakis, N. Dynamic edge-conditioned filters in convolutional neural networks on graphs. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 3693–3702. [Google Scholar]
Monti, F.; Bronstein, M.M.; Bresson, X. Geometric matrix completion with recurrent multi-graph neural networks. arXiv 2017, arXiv:1704.06803. [Google Scholar]
Li, Y.; Tarlow, D.; Brockschmidt, M.; Zemel, R. Gated graph sequence neural networks. arXiv 2015, arXiv:1511.05493. [Google Scholar]
Gilmer, J.; Schoenholz, S.S.; Riley, P.F.; Vinyals, O.; Dahl, G.E. Neural message passing for quantum chemistry. In Proceedings of the International Conference on Machine Learning, PMLR, Sydney, Australia, 6–11 August 2017; pp. 1263–1272. [Google Scholar]
Defferrard, M.; Bresson, X.; Vandergheynst, P. Convolutional neural networks on graphs with fast localized spectral filtering. Adv. Neural Inf. Process. Syst. 2016, 29, 3844–3852. [Google Scholar]
Kipf, T.N.; Welling, M. Semi-supervised classification with graph convolutional networks. arXiv 2016, arXiv:1609.02907. [Google Scholar]
Shen, X.; Pan, S.; Liu, W.; Ong, Y.S.; Sun, Q.S. Discrete network embedding. In Proceedings of the 27th International Joint Conference on Artificial Intelligence, Stockholm, Sweden, 13–19 July 2018; pp. 3549–3555. [Google Scholar]
Rossi, A.; Tiezzi, M.; Dimitri, G.M.; Bianchini, M.; Maggini, M.; Scarselli, F. Inductive—Transductive learning with graph neural networks. In IAPR Workshop on Artificial Neural Networks in Pattern Recognition; Springer: Berlin/Heidelberg, Germany, 2018; pp. 201–212. [Google Scholar]
Bianchini, M.; Dimitri, G.M.; Maggini, M.; Scarselli, F. Deep neural networks for structured data. In Computational Intelligence for Pattern Recognition; Springer: Berlin/Heidelberg, Germany, 2018; pp. 29–51. [Google Scholar]
Li, J.; Rong, Y.; Cheng, H.; Meng, H.; Huang, W.; Huang, J. Semi-supervised graph classification: A hierarchical graph perspective. In Proceedings of the World Wide Web Conference, San Francisco, CA, USA, 13–17 May 2019; pp. 972–982. [Google Scholar]
Zhuang, C.; Ma, Q. Dual graph convolutional networks for graph-based semi-supervised classification. In Proceedings of the World Wide Web Conference, Lyon, France, 23–27 April 2018; pp. 499–508. [Google Scholar]
Zhou, J.; Cui, G.; Hu, S.; Zhang, Z.; Yang, C.; Liu, Z.; Wang, L.; Li, C.; Sun, M. Graph neural networks: A review of methods and applications. AI Open 2020, 1, 57–81. [Google Scholar] [CrossRef]
Wu, Z.; Pan, S.; Chen, F.; Long, G.; Zhang, C.; Philip, S.Y. A comprehensive survey on graph neural networks. IEEE Trans. Neural Networks Learn. Syst. 2020, 32, 4–24. [Google Scholar] [CrossRef] [PubMed] [Green Version]
Zhao, T.; Zhang, X.; Wang, S. GraphSMOTE: Imbalanced Node Classification on Graphs with Graph Neural Networks. In Proceedings of the 14th ACM International Conference on Web Search and Data Mining, Virtual Event, 8–12 March 2021; pp. 833–841. [Google Scholar]
Chawla, N.V.; Bowyer, K.W.; Hall, L.O.; Kegelmeyer, W.P. SMOTE: Synthetic minority over-sampling technique. J. Artif. Intell. Res. 2002, 16, 321–357. [Google Scholar] [CrossRef]
Sen, P.; Namata, G.; Bilgic, M.; Getoor, L.; Galligher, B.; Eliassi-Rad, T. Collective classification in network data. AI Mag. 2008, 29, 93. [Google Scholar] [CrossRef] [Green Version]
Namata, G.; London, B.; Getoor, L.; Huang, B.; EDU, U. Query-driven active surveying for collective classification. In Proceedings of the 10th International Workshop on Mining and Learning with Graphs, Edinburgh, Scotland, 1 July 2012; Volume 8, p. 1. [Google Scholar]
Tang, J.; Sun, J.; Wang, C.; Yang, Z. Social influence analysis in large-scale networks. In Proceedings of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Paris, France, 28 June–1 July 2009; pp. 807–816. [Google Scholar]
Chen, M.; Wei, Z.; Huang, Z.; Ding, B.; Li, Y. Simple and deep graph convolutional networks. In Proceedings of the International Conference on Machine Learning, PMLR, Virtual Event, 13–18 July 2020; pp. 1725–1735. [Google Scholar]
Abe, N.; Zadrozny, B.; Langford, J. An iterative method for multi-class cost-sensitive learning. In Proceedings of the tenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Seattle, WA, USA, 22–25 August 2004; pp. 3–11. [Google Scholar]
Spurek, P.; Danel, T.; Tabor, J.; Smieja, M.; Struski, L.; Slowik, A.; Maziarka, L. Geometric graph convolutional neural networks. arXiv 2019, arXiv:1909.05310. [Google Scholar]
He, H.; Bai, Y.; Garcia, E.A.; Li, S. ADASYN: Adaptive synthetic sampling approach for imbalanced learning. In Proceedings of the 2008 IEEE International Joint Conference on Neural Networks (IEEE World Congress on Computational Intelligence), Hong Kong, China, 1–8 June 2008; pp. 1322–1328. [Google Scholar]
Lemaître, G.; Nogueira, F.; Aridas, C.K. Imbalanced-learn: A Python Toolbox to Tackle the Curse of Imbalanced Datasets in Machine Learning. J. Mach. Learn. Res. 2017, 18, 1–5. [Google Scholar]
Van der Maaten, L.; Hinton, G. Visualizing data using t-SNE. J. Mach. Learn. Res. 2008, 9, 2579–2605. [Google Scholar]

Figure 1. Model architecture. First, during the pre-training stage, input data only pass through the bilayer GCN network:

C o n v N e t P

, which learns the mapping between input graph node features with node class labels by minimizing the cross-entropy error

L_{C E} (C o n v N e t P)

. Second, at the joint training stage,

C o n v N e t C

is learned, and

C o n v N e t P

is fine-tuned. The input data pass through both

C o n v N e t P

and

C o n v N e t C

and are transformed differently. Using the cost label,

C o n v N e t C

learns the node-dependent cost. Such cost information is used to correct the posterior probability from

C o n v N e t P

by using the ensemble-oriented regularizer

L_{C E} (E N S)

. By minimizing the sum of

L_{C E} (E N S)

and

L_{M S E} (C o n v N e t C)

, DCSGCN combines the opinions of two sub-networks to give better predictions.

Figure 1. Model architecture. First, during the pre-training stage, input data only pass through the bilayer GCN network:

C o n v N e t P

, which learns the mapping between input graph node features with node class labels by minimizing the cross-entropy error

L_{C E} (C o n v N e t P)

. Second, at the joint training stage,

C o n v N e t C

is learned, and

C o n v N e t P

is fine-tuned. The input data pass through both

C o n v N e t P

and

C o n v N e t C

and are transformed differently. Using the cost label,

C o n v N e t C

learns the node-dependent cost. Such cost information is used to correct the posterior probability from

C o n v N e t P

by using the ensemble-oriented regularizer

L_{C E} (E N S)

. By minimizing the sum of

L_{C E} (E N S)

and

L_{M S E} (C o n v N e t C)

, DCSGCN combines the opinions of two sub-networks to give better predictions.

Figure 2. Examples of computing cost in different ways (left: global cost

C_{G L O}

; right: simple average cost

C_{S A}

and weighted average cost

C_{W A}

). For simplicity, there are only two groups of nodes. Two samples make up the minority class 1 (shown by the red nodes), while seven samples make up the majority class 2 (represented by the blue nodes). Following Equations (16), (17), and (19), the costs of misclassifying v as a red node are 0.38, 1.33, and 2.70, respectively.

C_{W A}

is the only method that encourages the classifier to make the correct choice (i.e., the blue class).

Figure 2. Examples of computing cost in different ways (left: global cost

C_{G L O}

; right: simple average cost

C_{S A}

and weighted average cost

C_{W A}

). For simplicity, there are only two groups of nodes. Two samples make up the minority class 1 (shown by the red nodes), while seven samples make up the majority class 2 (represented by the blue nodes). Following Equations (16), (17), and (19), the costs of misclassifying v as a red node are 0.38, 1.33, and 2.70, respectively.

C_{W A}

is the only method that encourages the classifier to make the correct choice (i.e., the blue class).

Figure 3. Influence of class imbalance ratio on Cora. (a) Micro-F1; (b) Cost.

Figure 4. Model performance during the training process. (a) Micro-F1; (b) Cost.

Figure 5. The effect of key parameters (

λ

,

α

, and k) on the performance of DCSGCN

_{W A}

for each dataset. (a) The effect of

λ

on Micro-F1; (b) The effect of

α

on Micro-F1; (c) The effect of k on Micro-F1; (d) The effect of

λ

on cost; (e) The effect of

α

on cost; (f) The effect of k on cost.

Figure 5. The effect of key parameters (

λ

,

α

, and k) on the performance of DCSGCN

_{W A}

for each dataset. (a) The effect of

λ

on Micro-F1; (b) The effect of

α

on Micro-F1; (c) The effect of k on Micro-F1; (d) The effect of

λ

on cost; (e) The effect of

α

on cost; (f) The effect of k on cost.

Figure 6. A t-SNE visualization of learned node representation of the Cora dataset obtained from GCN, DCSGCN

_{G L O}

, DCSGCN

_{S A}

, and DCSGCN

_{W A}

. The node colors denote the labels. (a) Method: GCN; (b) Method: DCSGCN

_{G L O}

; (c) Method: DCSGCN

_{S A}

; (d) Method: DCSGCN

_{W A}

.

Figure 6. A t-SNE visualization of learned node representation of the Cora dataset obtained from GCN, DCSGCN

_{G L O}

, DCSGCN

_{S A}

, and DCSGCN

_{W A}

. The node colors denote the labels. (a) Method: GCN; (b) Method: DCSGCN

_{G L O}

; (c) Method: DCSGCN

_{S A}

; (d) Method: DCSGCN

_{W A}

.

Table 1. Commonly used notations.

Notations	Descriptions
G	A graph.
V	The set of graph nodes.
v	A node $v \in V$
A	The adjacency matrix.
L	The labeled node set.
U	The unlabeled node set.
$Y_{L}$	The class information for the labeled node set.
$Y_{U}^{^{'}}$	The predicted labels for the unlabeled node set.
H	The node classes.
$H_{m a j}$	The majority classes.
$H_{m i n}$	The minority classes.
Z	The node vector representation matrix.
C	The cost matrix.
$P (y \| v)$	The posterior probability of v belonging to class y.
${\tilde{D}}^{- \frac{1}{2}} \tilde{A} {\tilde{D}}^{- \frac{1}{2}}$	The normalized adjacency matrix.
$W_{P}, W_{C}$	The learnable parameters of DCSGCN.
$L_{P}, L_{C}$	The layers of DCSGCN.
$O_{P}, O_{C}$	The outputs of DCSGCN.
$L_{C E} (E N S), L_{M S E} (C o n v N e t C)$	The loss functions of DCSGCN.

Table 2. Dataset statistics.

Dataset	Cora	Cite.	Pubm.	Corn.	Texa.	Wisc.	Cham.	Squi.	Actor
# Nodes	2708	3327	19,717	183	183	251	2277	5201	7600
# Edges	5429	4732	44,338	295	309	499	36,101	217,073	33,544
# Features	1433	3703	500	1703	1703	1703	2325	2089	931
# Classes	7	6	3	5	5	5	5	5	5
Minority classes	2, 7	1	1	2, 3	2, 3	1, 4, 5	5	3	1
$λ$	0.5	0.3	0.5	0.6	0.5	0.5	0.9	1.0	0.5
$α$	0.5	0.5	0.5	0.5	0.5	0.5	0.25	0.5	0.25
k	0.01	0.01	0.01	0.01	0.01	0.01	0.01	0.01	0.01

Table 3. Performance of analyzed methods based on Micro-F1/Accuracy (%). The higher the score, the better the method.

Dataset	Cora	Citeseer	Pubmed	Cornell	Texas	Wisconsin	Chameleon	Squirrel	Actor
GCN	$74.03$	$72.41$	$81.24$	$33.64$	$45.45$	$46.36$	$43.31$	$28.26$	$26.29$
GCNII	$54.89$	$44.12$	$70.25$	$51.82$	$50.91$	$41.06$	$28.90$	$19.55$	$25.11$
GCN $_{U S}$	$75.63$	$70.91$	$83.76$	$47.27$	$52.50$	$42.38$	$45.79$	$28.52$	$25.77$
GCN $_{T M}$	$73.78$	$72.56$	$81.28$	$52.73$	$47.27$	$44.37$	$44.04$	$26.05$	$25.70$
GCNII $_{U S}$	$50.46$	$62.67$	$70.26$	$48.18$	$51.36$	$46.77$	$32.70$	$19.64$	$25.22$
GCNII $_{T M}$	$55.69$	$46.72$	$70.39$	$46.36$	$50.91$	$42.38$	$33.58$	$19.87$	$23.07$
RECT-L	$77.95$	$66.56$	$82.38$	$52.25$	$53.15$	$28.48$	$24.30$	$22.82$	$24.26$
RECT-N	$84.35$	$69.24$	$80.78$	$50.45$	$53.15$	$28.34$	$35.04$	$28.38$	$26.65$
RECT	$84.38$	$70.02$	$82.54$	$51.17$	$53.15$	$31.79$	$35.73$	$30.07$	$25.74$
DCSGCN $_{G L O}$	$82.31$	$71.79$	$84.15$	$48.10$	$48.18$	$50.63$	$56.22$	$41.38$	$26.96$
DCSGCN $_{S A}$	$82.28$	$72.12$	$83.65$	$55.54$	$49.75$	$46.12$	$59.99$	$40.36$	$25.79$
DCSGCN $_{W A}$	$82.05$	$71.69$	$83.04$	$55.45$	$45.21$	$52.62$	$59.33$	$40.56$	$26.63$

Table 4. Performance of all analyzed methods in terms of Macro-F1 (%). The higher the score, the better the method.

Dataset	Cora	Citeseer	Pubmed	Cornell	Texas	Wisconsin	Chameleon	Squirrel	Actor
GCN	$56.56$	$62.79$	$77.52$	$20.69$	$13.75$	$18.88$	$37.21$	$21.30$	$18.09$
GCNII	$34.45$	$31.18$	$52.29$	$13.65$	$13.49$	$11.64$	$16.54$	$6.74$	$8.03$
GCN $_{U S}$	$64.96$	$61.71$	$82.16$	$13.96$	$21.10$	$14.00$	$39.14$	$22.37$	$18.59$
GCN $_{T M}$	$56.45$	$62.75$	$77.73$	$20.48$	$15.96$	$16.55$	$36.32$	$18.13$	$19.02$
GCNII $_{U S}$	$29.90$	$50.99$	$52.39$	$13.81$	$15.77$	$13.47$	$20.97$	$7.04$	$8.06$
GCNII $_{T M}$	$35.18$	$35.17$	$52.41$	$12.67$	$13.49$	$11.91$	$21.93$	$6.69$	$7.79$
RECT-L	$75.92$	$59.15$	$82.44$	$19.03$	$13.88$	$11.21$	$17.95$	$17.94$	$10.93$
RECT-N	$83.24$	$61.29$	$80.67$	$13.91$	$13.88$	$12.37$	$33.60$	$24.55$	$14.59$
RECT	$83.33$	$61.78$	$82.46$	$14.27$	$14.25$	$12.83$	$33.01$	$26.31$	$12.79$
DCSGCN $_{G L O}$	$79.60$	$63.42$	$82.67$	$25.18$	$29.65$	$23.58$	$55.41$	$39.64$	$16.92$
DCSGCN $_{S A}$	$80.21$	$64.97$	$82.24$	$27.01$	$29.52$	$21.05$	$59.08$	$38.37$	$19.47$
DCSGCN $_{W A}$	$80.00$	$63.71$	$81.80$	$29.34$	$22.41$	$22.67$	$57.97$	$38.31$	$18.38$

Table 5. Performance of all analyzed methods in terms of cost (the lower the value, the better). Please refer to Section 4.3 and Section 4.4 for the information on the ground-truth matrix.

Dataset	Cora	Citeseer	Pubmed	Cornell	Texas	Wisconsin	Chameleon	Squirrel	Actor
GCN	$2.60$	$1.96$	$1.41$	$8.51$	$9.09$	$8.34$	$3.15$	$3.59$	$4.70$
GCNII	$2.37$	$2.10$	$1.10$	$10.09$	$9.27$	$8.29$	$3.05$	$3.57$	$4.77$
GCN $_{U S}$	$2.64$	$1.96$	$1.40$	$9.22$	$9.68$	$8.76$	$3.07$	$3.70$	$4.69$
GCN $_{T M}$	$3.63$	$3.54$	$2.44$	$9.72$	$10.26$	$9.83$	$3.96$	$4.03$	$5.24$
GCNII $_{U S}$	$4.63$	$3.41$	$2.44$	$9.93$	$9.62$	$9.55$	$3.84$	$4.03$	$5.22$
GCNII $_{T M}$	$3.38$	$3.33$	$2.43$	$11.00$	$9.99$	$9.79$	$3.70$	$4.01$	$5.26$
RECT-L	$1.70$	$2.15$	$0.95$	$7.34$	$7.93$	$8.67$	$4.17$	$3.86$	$5.08$
RECT-N	$1.25$	$2.06$	$1.07$	$8.15$	$7.84$	$8.81$	$3.48$	$3.58$	$4.77$
RECT	$1.20$	$2.02$	$0.98$	$8.14$	$7.70$	$9.30$	$3.49$	$3.50$	$4.79$
DCSGCN $_{G L O}$	$1.27$	$1.83$	$0.89$	$4.21$	$2.90$	$5.83$	$2.26$	$2.77$	$4.62$
DCSGCN $_{S A}$	$1.30$	$1.83$	$0.92$	$4.36$	$3.80$	$5.20$	$2.12$	$2.91$	$4.34$
DCSGCN $_{W A}$	$0.99$	$1.83$	$0.92$	$5.21$	$3.89$	$5.46$	$2.18$	$2.85$	$4.33$

Table 6. Performance of all analyzed methods in terms of high-cost errors (the lower the value, the better). Please refer to Section 4.3 and Section 4.4 for the information on the ground-truth matrix.

Dataset	Cora	Citeseer	Pubmed	Cornell	Texas	Wisconsin	Chameleon	Squirrel	Actor
GCN	$257.3$	$174.4$	$1272.7$	$13.1$	$16.6$	$45.7$	$253.3$	$635.9$	$537.5$
GCNII	$210.3$	$176.2$	$852.8$	$15.6$	$14.6$	$41.5$	$250.9$	$642.1$	$538.2$
GCN $_{U S}$	$245.8$	$163.8$	$1251.3$	$14.8$	$16.7$	$46.6$	$245.7$	$650.8$	$526.6$
GCN $_{T M}$	$247.4$	$168.9$	$2504.2$	$14.6$	$15.6$	$45.9$	$257.2$	$623.2$	$544.7$
GCNII $_{U S}$	$252.9$	$167.5$	$2503.1$	$14.9$	$14.4$	$43.5$	$243.5$	$659.1$	$534.7$
GCNII $_{T M}$	$244.6$	$171.2$	$2518.6$	$15.9$	$14.2$	$43.4$	$247.6$	$647.3$	$543.3$
RECT-L	$78.6$	$164.5$	$481.5$	$14.2$	$14.6$	$40.2$	$244.3$	$645.2$	$536.8$
RECT-N	$58.0$	$158.2$	$567.8$	$13.3$	$14.1$	$42.0$	$200.1$	$606.7$	$537.6$
RECT	$55.5$	$158.5$	$508.2$	$13.7$	$14.8$	$41.5$	$202.6$	$638.6$	$536.7$
DCSGCN $_{G L O}$	$62.9$	$139.3$	$617.1$	$9.9$	$5.3$	$33.5$	$160.4$	$472.2$	$501.0$
DCSGCN $_{S A}$	$69.4$	$146.5$	$625.1$	$12.2$	$11.3$	$38.3$	$134.9$	$468.8$	$505.0$
DCSGCN $_{W A}$	$45.9$	$125.9$	$587.3$	$13.4$	$9.6$	$36.6$	$160.6$	$362.7$	$509.0$

Table 7. Micro-F1 (Accuracy) percent of selected methods with regards to the removal of 30% of the samples of the minority classes.

	GCN	RECT	DCSGCN $_{GLO}$	DCSGCN $_{SA}$	DCSGCN $_{WA}$
Cora	78.89	84.08	83.29	82.69	83.94
Cite.	71.81	71.59	71.42	72.28	71.87
Pubm.	84.35	84.55	85.53	85.36	84.57
Corn.	48.18	51.17	52.98	51.82	59.34
Texa.	46.36	54.41	51.57	50.58	42.98
Wisc.	45.03	42.91	49.85	50.63	47.44
Cham.	43.75	35.36	61.45	60.10	60.02
Squi.	28.29	29.25	42.98	40.61	40.74
Actor	26.18	27.02	27.24	26.28	27.31

Table 8. Cost of selected methods with regard to the removal of 30% of the samples of minority classes.

	GCN	RECT	DCSGCN $_{GLO}$	DCSGCN $_{SA}$	DCSGCN $_{WA}$
Cora	2.01	1.19	1.04	1.08	0.84
Cite.	2.02	1.98	1.74	1.71	1.78
Pubm.	1.01	0.86	0.79	0.77	0.87
Corn.	9.90	8.19	5.63	10.02	6.65
Texa.	9.34	7.18	7.02	5.24	4.45
Wisc.	9.58	8.99	7.48	5.82	8.08
Cham.	3.15	3.60	1.80	2.05	1.99
Squi.	3.59	3.54	2.75	2.86	2.86
Actor	4.66	4.86	4.54	4.11	4.49

Table 9. Performance of selected methods under dense training in terms of Micro-F1 (Accuracy) percent with regards to removing 70% of the minority class samples.

	GCN	RECT	DCSGCN $_{GLO}$	DCSGCN $_{SA}$	DCSGCN $_{WA}$
Cora	75.46	85.30	85.89	85.89	85.83
Cite.	72.07	70.57	77.88	76.50	73.75
Pubm.	82.00	84.42	85.64	86.78	86.68
Corn.	45.95	55.26	64.62	51.11	67.08
Texa.	43.24	50.00	41.03	40.29	48.89
Wisc.	45.10	31.37	55.61	56.15	43.14
Cham.	46.05	38.55	64.47	64.25	71.16
Squi.	27.28	31.26	42.92	45.59	49.65
Actor	27.57	27.50	26.83	27.73	26.62

Table 10. Performance of selected methods under dense training in terms of cost with regards to removing 70% of the minority class samples.

	GCN	RECT	DCSGCN $_{GLO}$	DCSGCN $_{SA}$	DCSGCN $_{WA}$
Cora	2.43	1.04	0.96	0.87	1.04
Cite.	1.93	1.92	1.32	1.40	1.74
Pubm.	1.32	0.88	0.85	0.71	0.65
Corn.	13.51	5.98	8.66	6.44	5.18
Texa.	13.43	8.05	2.95	8.31	5.23
Wisc.	10.12	12.93	4.70	4.14	6.44
Cham.	3.01	3.45	1.80	1.80	1.35
Squi.	3.64	3.44	2.54	2.54	2.36
Actor	4.70	4.84	4.88	4.22	4.77

Table 11. Performance of selected methods under dense training in terms of Micro-F1 (Accuracy) percent with regards to removing 30% of minority class samples.

	GCN	RECT	DCSGCN $_{GLO}$	DCSGCN $_{SA}$	DCSGCN $_{WA}$
Cora	80.81	87.03	88.28	85.38	87.01
Cite.	72.67	74.02	74.90	74.35	74.80
Pubm.	84.84	86.09	86.40	86.74	86.82
Corn.	54.05	55.79	51.11	54.79	54.79
Texa.	45.95	55.26	47.17	40.05	40.79
Wisc.	43.14	51.37	50.09	55.26	48.66
Cham.	43.42	37.02	68.42	67.98	66.45
Squi.	29.20	31.76	43.27	45.52	46.51
Actor	26.91	27.79	27.12	27.18	27.28

Table 12. Performance of selected methods under dense training in terms of cost with regards to removing 30% of the minority class samples.

	GCN	RECT	DCSGCN $_{GLO}$	DCSGCN $_{SA}$	DCSGCN $_{WA}$
Cora	1.76	0.96	0.86	1.00	0.86
Cite.	1.95	1.60	1.66	1.61	1.61
Pubm.	0.93	0.78	0.74	0.68	0.75
Corn.	11.40	7.04	3.71	7.47	9.07
Texa.	11.23	6.99	5.29	9.35	5.04
Wisc.	8.89	9.11	7.25	7.11	8.73
Cham.	3.15	3.48	1.49	1.43	1.70
Squi.	3.54	3.41	2.57	2.61	2.45
Actor	4.75	4.75	4.81	4.17	5.04

Table 13. Performance comparisons of DCSGCN models with different training procedures.

	DCSGCN $_{GLO}$		DCSGCN $_{SA}$		DCSGCN $_{WA}$
	Micro-F1	Cost	Micro-F1	Cost	Micro-F1	Cost
(A + B) − (C + D)	1.1%	−0.50	3.2%	−0.49	2.9%	−0.16
(A + C) − (B + D)	−5.7%	0.82	−6.7%	1.17	1.1%	−0.20

Table 14. Performance improvements of DCSGCN

_{W A}

over different {GNN + data over-sampling} combinations.

Table 14. Performance improvements of DCSGCN

_{W A}

over different {GNN + data over-sampling} combinations.

A		B		C		D
Micro-F1	Cost	Micro-F1	Cost	Micro-F1	Cost	Micro-F1	Cost
12.4%	−0.74	13.9%	−1.12	12.1%	−0.71	17.0%	−1.36

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Duan, Y.; Liu, X.; Jatowt, A.; Yu, H.-t.; Lynden, S.; Kim, K.-S.; Matono, A. Long-Tailed Graph Representation Learning via Dual Cost-Sensitive Graph Convolutional Network. Remote Sens. 2022, 14, 3295. https://doi.org/10.3390/rs14143295

AMA Style

Duan Y, Liu X, Jatowt A, Yu H-t, Lynden S, Kim K-S, Matono A. Long-Tailed Graph Representation Learning via Dual Cost-Sensitive Graph Convolutional Network. Remote Sensing. 2022; 14(14):3295. https://doi.org/10.3390/rs14143295

Chicago/Turabian Style

Duan, Yijun, Xin Liu, Adam Jatowt, Hai-tao Yu, Steven Lynden, Kyoung-Sook Kim, and Akiyoshi Matono. 2022. "Long-Tailed Graph Representation Learning via Dual Cost-Sensitive Graph Convolutional Network" Remote Sensing 14, no. 14: 3295. https://doi.org/10.3390/rs14143295

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Long-Tailed Graph Representation Learning via Dual Cost-Sensitive Graph Convolutional Network

Abstract

1. Introduction

2. Related Work

2.1. Cost-Sensitive Learning

2.2. Graph Neural Network

2.3. Connections to Our Work: When Graph Neural Networks Meet Cost-Sensitive Learning

3. Methodology

3.1. Method Overview

3.2. Notations and Definitions

3.3. Proposed Model: Dual Cost-Sensitive Graph Convolutional Network (DCSGCN)

3.3.1. Estimating Node Probability: C o n v N e t P

3.3.2. Estimating Node Cost: C o n v N e t C

3.3.3. Ensemble of Node Probability and Cost

3.4. Methods for Computing Cost Labels

3.4.1. Global Cost

3.4.2. Simple Average Cost

3.4.3. Weighted Average Cost

4. Experimental Settings

4.1. Datasets

4.2. Analyzed Methods

4.3. Evaluation Metrics

4.4. Training Settings

4.5. Parameter Tuning

5. Experimental Results

5.1. Research Questions

5.2. Classification Accuracy

5.3. Cost and High-Cost Errors

5.4. Influence of Imbalance Ratio

5.5. Performance under Dense Training

5.6. Validation of Key Procedures in Training DCSGCN

5.7. Comparison of DCSGCN and Advanced Data Over-Sampling Strategies

5.8. Influence of Key Parameters

5.9. Graph Visualization

6. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

3.3.1. Estimating Node Probability: $C o n v N e t P$

3.3.2. Estimating Node Cost: $C o n v N e t C$