1 Introduction
In recent years, graph machine learning [
27,
38] has been widely adopted for modeling graph-structured data. In this realm, the emergence of
graph neural networks (GNNs) [
71] has offered promising results in diverse domains, such as recommendation systems [
25,
85,
91], social network analysis [
47,
84,
93], and drug discovery [
4,
73]. GNNs, like typical neural networks [
45,
75], are abstractions of the underlying data. Thus, their inference can suffer from faults [
28,
53,
58], which can lead to severe prediction failures, especially in security-critical use cases. Testing is considered to be a fundamental practice that is widely adopted to ensure the performance of neural networks, including GNNs. However, like traditional
deep neural networks (DNNs), GNN testing also suffers from the lack of automated testing oracles, which necessitates the manual labeling of test inputs. However, this labeling process can require significant human effort, especially for large and complex graphs. Moreover, in certain specialized domains, such as the protein interface prediction [
62] of drug discovery, labeling intensively relies on domain-specific knowledge, further increasing its costs.
Prior works [
6,
26,
44,
81] have focused on
test prioritization to relieve the labeling-cost problem for DNNs. Test prioritization approaches aim to prioritize test inputs that are more likely to be misclassified (i.e., fault-revealing test inputs) so such inputs can be identified earlier to reveal system bugs. Existing approaches are mainly divided into two categories: coverage-based and confidence-based test prioritization approaches. Coverage-based approaches prioritize test inputs based on neuron coverage through adapting coverage-based prioritization methods from traditional software testing [
51,
92]. Confidence-based approaches assume that test inputs for which the model is less confident are more likely to be misclassified and thus should be prioritized higher. Feng et al. [
26] proposed the state-of-the-art confidence-based approach DeepGini, which considers that a test input is more likely to be misclassified by a DNN model if the model outputs similar prediction probabilities for each class. More recently, Wang et al. [
81] proposed PRIMA, which leveraged mutation analysis and learning-to-rank methods to prioritize test inputs for DNNs. However, despite its effectiveness in DNN test prioritization, PRIMA cannot be directly applied to GNNs, since their mutation operators are not adapted to graph-structured data and GNN models.
Furthermore, existing studies [
36] have focused on metrics for data selection (e.g., margin and least confidence), which can also be used to detect possibly misclassified test data. Although the aforementioned approaches have been demonstrated to be effective for DNN models in some cases, they have the following limitations when applied to GNN models:
–
First, to the best of our knowledge, current coverage-based approaches do not provide interfaces for GNN models and thus cannot be directly applied. Moreover, existing research [
26] has demonstrated that coverage-based approaches are not effective compared to confidence-based approaches.
–
Second, despite the effectiveness of confidence-based approaches on traditional DNNs, they do not take into account the interdependencies between test inputs of GNNs, which are particularly crucial for GNN inference. In other words, GNN test inputs are typically represented as graph-structured data consisting of nodes and edges, while confidence-based prioritization approaches usually deal with test sets in which each test is independent and has no connections with others.
–
Third, the effectiveness of uncertainty-based metrics can be limited when facing some specific adversarial attacks. If the aim of an attack is to generate test inputs that maximize the probability of incorrect classification, then the utility of uncertainty metrics can be limited. This is because the underlying assumption of uncertainty-based metrics is that: If a model is more uncertain about classifying a test, then this test is more likely to be misclassified. However, in such scenarios, even if a model is confident on a test, this test can still have a high probability of being misclassified.
To overcome the aforementioned problems, in this article, we propose GraphPrior (
GNN-oriented Test
Prioritization), a set of test prioritization approaches specifically for GNNs. GraphPrior identifies and prioritizes possibly misclassified test inputs via mutation analysis. Given a test set for a GNN model, GraphPrior regards a test input that kills more mutated models (i.e., variants of the original GNN model that is slightly changed) of the original GNN model as more likely to be misclassified. Here, killing means the prediction result to the test input via the GNN model and the mutated model is different. To this end, we design a set of mutation rules to generate mutated models specifically for GNNs by slightly changing the training parameters of the original model. After obtaining the mutation results of each test input, GraphPrior introduces several ranking models (ML/DL models) [
5,
42,
83] to rank the test set. The working principle of GraphPrior is inspired by mutation testing research, as this has been realized for both model-based [
1,
18,
63] and code-based [
2,
17,
64] testing. The key underlying principle in all cases is that test cases that distinguish the behavior of mutants from that of the original artifact are useful and more likely to detect other underlying faults [
1,
9,
63].
While both the GraphPrior and PRIMA (i.e., the state-of-the-art DNN test prioritization approach) use mutation analysis, GraphPrior differs from PRIMA in terms of its mutation rules, feature generation, and ranking models: (1) GraphPrior’s mutation rules can directly or indirectly affect the message passing between nodes in graph data. In contrast, the mutation rules of PRIMA are designed for traditional DNNs, where the test inputs are independent, and therefore, the mutation rules do not affect the relationships between tests; (2) GraphPrior generates a mutation feature vector for each test input based on its mutation results, where the \(i\)th element in the vector denotes whether the \(i\)th mutated model is killed by this input. This feature generation strategy is intuitive and reproducible. In addition to this, the generation method exhibits several other advantages. First, by using binary indicators (1 or 0) as elements of the mutation feature vector, the information is transformed into a concise vector representation. Second, the fine-grained nature of the mutation feature vector allows for a detailed analysis of the effects of individual mutations. In particular, further analysis can be conducted to assess the contributions of each mutated model to GraphPrior. By tracing back to the corresponding mutation rules for the top critical mutated models, we can gain insights into which mutation rules made higher contributions to GraphPrior. The experimental results demonstrate its effectiveness; (3) GraphPrior employs five ranking models and compares their effectiveness in utilizing mutation features for test prioritization, while PRIMA only uses a single ranking model. By comparing multiple ranking models, GraphPrior can identify the optimal ranking model for learning mutation features in test prioritization.
GraphPrior has broad applicability across a wide range of contexts, including software development, scientific research, and financial systems. For instance, GraphPrior can be employed to gain insights into the vulnerabilities of GNN models used in financial transaction fraud detection. In this specific context, where nodes represent accounts and edges represent transaction transfers, the first step is to utilize the GNN model under test to identify a group of potentially fraudulent accounts. Subsequently, these identified accounts serve as test inputs for GraphPrior. By prioritizing accounts that are more likely to be misclassified by the model (i.e., accounts falsely classified as fraudulent), GraphPrior places them at the top of the recommendation list. Consequently, by labelling and analyzing these bug-revealing tests earlier, the fraud analysis team can unveil the bugs and vulnerabilities of the GNN model more efficiently.
It is important to note that GraphPrior is specifically designed for GNNs, and its impact on DNNs has not been evaluated. This is because, in graph datasets, nodes are interconnected, and the mutation rules of GraphPrior can directly or indirectly affect the message passing between nodes in the prediction process. In contrast, in traditional DNNs, each sample in a dataset is typically independent, and as a result, such mutation rules are unlikely to affect the transmission of information between tests. Therefore, the effectiveness of GraphPrior’s mutation rules for DNNs remains uncertain, as no related experiments have been conducted to evaluate it.
We conducted an extensive study to evaluate the performance of GraphPrior based on 604 subjects. Here, a subject refers to a pair of graph dataset and GNN model. We compare GraphPrior with six uncertainty-based metrics [
26,
80,
82] that can be used to prioritize possibly misclassified test inputs and adopt random selection as the baseline method. Our experimental results demonstrate that GraphPrior performs well across all subjects and outperforms the compared approaches, on average.
As mentioned before, one essential problem of confidence-based approaches is that adversarial attacks may lead to a model being more confident in the incorrect prediction, resulting in the failure of the approach. Therefore, we also evaluate GraphPrior on test inputs generated from graph adversarial attacks of existing studies [
3,
48,
86,
100]. Furthermore, since the effectiveness of test prioritization methods may vary depending on the degree of the adversarial attack, we set different attack levels to generate adversarial data and compared GraphPrior with the compared approaches. In addition to the evaluation of GraphPrior, we compare the effectiveness of different mutation rules in generating top contributing mutated models, aiming to identify which mutated rules contribute more to each GNN model. In the last step, we investigate whether GraphPrior and the uncertainty-based metrics can select informative retraining tests to improve a GNN model. Our experimental results demonstrate that GraphPrior achieved better effectiveness compared with the uncertainty-based test prioritization methods. We publish our dataset, results, and tools to the community on
Github.
1Our work has the following major contributions:
–
Approach. We propose GraphPrior, a set of mutation-based test prioritization approaches for GNNs. To this end, we design a set of mutation rules that mutate GNN models by slightly changing their training parameters. We carefully select ranking models to analyze the mutation results for effective test prioritization.
–
Study. We conduct an extensive study based on 604 GNN subjects involving natural and adversarial test sets. We compare GraphPrior with existing DNN approaches that could detect possibly misclassified test inputs. Our experimental results demonstrate the effectiveness of GraphPrior.
–
Mutation rule analysis. We compare the effectiveness of the GNN mutation rules in generating top contributing mutated models, observing that the mutation rule HC (i.e., mutating Hidden Channels) makes top contributions to most GNN models in test input prioritization.
3 Approach
3.1 Overview
In this article, we propose GraphPrior, a set of test prioritization approaches for GNNs to prioritize test inputs. GraphPrior consists of six mutation-based test prioritization approaches: KMGP, LRGP, RFGP, LGGP, DNGP, and XGGP. These approaches are discussed later in this section. We present the overview of GraphPrior in Figure
1, in which the input of GraphPrior is a GNN test set, and the output is the test set that has been prioritized. Given a test set
\(T\) for a GNN model
\(G\), the implementation process of GraphPrior is presented as follows:
•
Generating mutants for the GNN model \(G\). First, GraphPrior generates mutated models (i.e., mutants) for the GNN model
\(G\) based on carefully designed mutation rules (cf. Section
3.2).
•
Obtaining mutation results through killing mutants. For each test input, GraphPrior identifies which mutated models it kills. Here, a mutated model is killed by a test input if the prediction results of this input via the mutated model and the original model \(G\) are different. In this way, GraphPrior obtains the mutation result of each test input.
•
Generating feature vectors from the mutation results. For each test input, GraphPrior generates a mutation feature vector for it based on its mutation results. The \(i\)th element of this feature vector denotes whether this input kills the \(i\)th mutated model. More specifically, given a test input \(t \in T\), if \(t\) kills a mutated model \(M_i\), then the \(i\)th element of \(t\)’s mutation feature vector is set to 1. Otherwise, the \(i\)th element is set to 0.
•
Ranking test input based on mutation feature vectors via ranking models. GraphPrior utilizes ranking models [
5,
42,
83] to calculate a misclassification score for each test input based on its feature vector. This score can indicate how likely a test input will be misclassified by the GNN model. Finally, GraphPrior ranks them based on their misclassification scores in descending order and outputs the prioritized test set
\(T^{\prime }\).
3.2 Mutation Rules
In GraphPrior, mutation rules are employed to generate mutated models of a GNN model by making slight changes to its training parameters. We select the following parameters, because they can impact the message passing in the GNN prediction process. More specifically, in the mutated GNN model, the manner in which nodes acquire information from their neighboring nodes is slightly different from that of the original GNN model. Although variations of GNNs can be obtained even without changing training parameters, the resulting model mutants cannot produce meaningful differences in the GNN model’s behavior. By changing the selected training parameters to generate mutants, we can intentionally introduce meaningful modifications to the model’s behavior in terms of the interdependencies between nodes during the prediction process. We present all the mutation rules of GraphPrior as follows:
–
Self Loops (SL) [45, 79]. SL is a Boolean parameter, which controls whether to add self-loops to the input graph. When the SL parameter is set to True, self-loops are introduced to each node in the graph. By incorporating self-loops, the inherent information of nodes can be effectively aggregated into their representation vectors, leading to a change in the weighting of their neighboring nodes, and thus affecting the interdependence of nodes in the prediction process.
–
Bias (BIA) [30, 45, 79]. BIA is a Boolean parameter, which determines whether to introduce a predetermined offset to the representation vectors of nodes. When the BIA parameter is enabled (set to True), each node will be assigned a corresponding bias parameter to its representation vector, allowing the GNN model to better capture the inherent properties of the graph and improve the interdependence between nodes in the prediction process.
–
Cached (CA) [45]. CA is a Boolean parameter that controls whether to cache the computation of node embeddings during the forward pass. When the CA parameter is set to True, the node embeddings are cached and reused during the backward pass to save computation time. Caching the computation of node embeddings can affect the interdependence between nodes by altering the order and efficiency of message passing.
–
Improved (IMP) [45]. IMP is a Boolean parameter that controls whether to use the improved message passing strategy, thus affecting the interdependence between nodes in the prediction process.
–
Normalize (NOR) [21, 30]. NOR is a Boolean parameter, which determines whether to normalize the messages passed between nodes in the prediction process. When this parameter is set to “True,” the messages are normalized by the number of neighbors that a node has before being passed to the next layer. This normalization can impact the contribution of each neighbor to the node’s final representation, thus affecting the message passing between nodes in the prediction process.
–
Concat (CON) [79]. CON is a Boolean parameter, which controls how the representations of neighboring nodes are combined during message passing. When it is set to True, the representations of neighboring nodes are concatenated before being passed, resulting in a more expressive representation of the nodes, enabling the GNN to capture more nuanced interdependencies between them.
–
Heads (HDS) [79]. HDS is an integer parameter that determines the number of attention heads used in multi-head attention. Increasing the number of heads allows the model to capture more complex interdependence among nodes in the graph. Each attention head can focus on a different aspect of the node neighborhood, enabling the model to learn different representations of the graph.
–
Epoch (EP) [21, 30, 79]. EP is an integer parameter that controls the number of times a GNN model iterates over the training dataset. By increasing the number of epochs, a GNN model can better capture the interdependence between nodes for model inference.
–
Hidden Channel (HC) [21, 30, 45, 79]. HC is an integer parameter, which controls the dimensionality of the hidden representation in each layer of the GNN. Therefore, changing this parameter can impact the interdependence between nodes in a graph by enabling the GNN to learn more expressive node embeddings.
–
Negative Slope (NS) [79]. NP is a float parameter, which controls the slope of the negative part of the activation function used in the
Gated Linear Unit (GLU) operation. GLU is a common non-linear function used in GNNs for message passing. Specifically, the GLU operation is used to combine the node features with the weighted sum of their neighboring nodes’ features, which is the message passed between nodes in the graph. The negative slope parameter determines the slope of the activation function for negative input values in the GLU operation, thus impacting the message passing between nodes.
Based on the above mutation rules, for a given test set and a GNN model, GraphPrior generates \(N\) mutated models of the original model. We consider that a test input kills a mutated model if the predictions for this input via the mutated models and the original GNN model are different. Based on it, GraphPrior obtains the mutation results of all the test inputs.
Considering that the primary objective of generating mutated models is to obtain informative features for test prioritization, a statistical analysis is employed to validate their effectiveness. To achieve this, a series of repeated experiments are conducted, as outlined in Section
5. The results of these experiments demonstrate that GraphPrior’s effectiveness is statistically significant, thereby confirming the statistical validity of the generated mutated models for the purpose of test prioritization.
3.3 Killing-based GraphPrior
This section presents the workflow of KMGP, the Killing Mutants-based GNN Test Prioritization approach. Notably, KMGP operates on a “killing-based” principle, where test inputs that can kill more mutated models are considered as more likely to be misclassified and will be prioritized higher. It is worth noting that KMGP assigns equal importance to each mutated model in the process of test prioritization, a distinct feature that distinguishes it from feature-based approaches, which will be elaborated upon in subsequent sections. Given a GNN model \(G\) and a test input set \(T=\left\lbrace t_1, t_2, \ldots , t_n\right\rbrace\), the detailed execution of KMGP can be divided into three key stages: mutation generation, killing-based mutation analysis, and test prioritization.
Mutation generation. In the mutation generation stage, a group of mutated models \(\lbrace G^{\prime }_1, G^{\prime }_2, \ldots , G^{\prime }_N\rbrace\) is generated for the original GNN model \(G\).
Killing-based mutation analysis. This stage involves obtaining the mutation results of each test input
\(t\in T\) using the process outlined in Section
3.2. Subsequently, KMGP counts the number of mutants killed by each test input based on their mutation results.
Test prioritization. In the third stage, KMGP prioritizes all the test inputs in \(T\) based on the number of mutated models they killed, with those that kill more mutants being prioritized higher in the test sequence.
3.4 Feature-based GraphPrior
In comparison to the killing-based GraphPrior approach, the feature-based approaches are characterized by automatic mutation feature analysis. This process involves the generation of mutated feature vectors based on the execution of mutated models, followed by the use of ranking models (ML/DL models), which assign different importance to each mutated model for test prioritization.
Overall, the feature-based approaches’ workflow entails three key stages: mutated model generation, mutation feature generation, and learning-to-rank.
❶
Mutated model generation. Given a GNN model
\(G\) and a test set
\(T\), during the first stage, the feature-based approaches generate a group of mutated models (denoted as
\(\lbrace G^{\prime }_1, G^{\prime }_2, \ldots , G^{\prime }_N\rbrace\)) of the GNN model
\(G\) based on the mutation rules specified in Section
3.2.
❷
Mutation feature generation. Subsequently, the feature-based approaches associate a feature vector \(V_t\) of size \(N\) with each test input \(t\), where \(N\) represents the number of mutated models, and \(v_k (=V_t[k])\) maps to the execution output for the mutated model \(G^{\prime }_k\). If \(t\) kills the mutated model \(G^{\prime }_k\) (i.e., the prediction results for \(t\) via the mutated models \(G^{\prime }_k\) and the original model \(G\) are different), then \(v_{k}\) is set to 1. Otherwise, it is set to 0.
❸
Learning-to-rank. In the final stage, the feature-based approaches input the mutation features of each test input to the ranking model (ML/DL models) [
5,
15,
42,
78,
83]. The ranking models can automatically learn different importance for each mutation feature to output misclassification scores. Here, each mutation feature corresponds to the execution result of a mutated model so we can consider that the ranking models learn the importance of each mutated model for test prioritization. Finally, the feature-based approaches rank all the test inputs based on their misclassification scores in descending order.
In our study, we propose five feature-based GraphPrior approaches, which follow the similar workflow described above, but leverage different ranking models. These five approaches are XGGP (XGBoost-based GNN Test Prioritization), LRGP (Logistic Regression-based GNN Test Prioritization), LGGP (LightGBM-based GNN Test Prioritization), RFGP (Random Forest-based GNN Test Prioritization), and DNGP (DNN-based GNN Test Prioritization). We briefly introduce the basic principle of the ranking models of these approaches as follows:
(1)
XGGP leverages the XGBoost algorithm [
15] as the ranking model. XGBoost is a highly effective gradient boosting algorithm that combines decision trees to enhance the accuracy of predictions. XGGP utilizes the XGBoost algorithm to predict the misclassification score for a given test input based on its mutation features. This score reflects the likelihood that the input will be misclassified by a GNN model.
(2)
LRGP leverages the Logistic Regression algorithm [
83] as the ranking model. Logistic regression leverages a logistic function to model the association between a categorical dependent variable and one or more independent variables.
(3)
LGGP leverages the LightGBM algorithm [
42] as the ranking model. LightGBM is a gradient boosting framework that employs tree-based learning algorithms. The fundamental principle of LightGBM is similar to XGBoost, which employs decision trees based on learning algorithms. However, LightGBM introduces a novel optimization in the framework, with a primary focus on enhancing the speed of model training.
(4)
RFGP leverages the random forest algorithm [
5] as the ranking model. Random Forest is an ensemble learning algorithm that constructs multiple decision trees using random subsets of the training data and input features. The predictions from individual trees are combined to produce the final prediction using averaging or voting.
(5)
DNGP leverages a DNN model [
78] as the ranking model. The DNN model can learn to rank test inputs based on their mutation features. After training, the DNN model can generate a score that reflects their misclassification probability. This score can then be used to rank test inputs in a test set.
Compared to the mutation features of PRIMA, the distinctive aspect of GraphPrior’s mutation features lies in their utilized mutation rules, which are specifically designed for GNNs. These mutation rules have the potential to directly or indirectly impact the message passing mechanism between nodes in graph data. Our experiment results in Section
5 demonstrate the effectiveness of the feature-based GraphPrior approaches. The observed effectiveness can be attributed, in part, to the selection of mutation rules and ranking models. Specifically, our mutation rules have been designed to generate informative mutation features by changing the message passing between nodes in the GNN prediction process. Furthermore, our ranking models are able to utilize these mutation features for test prioritization effectively. After sufficient training, ranking models can output a misclassification score that indicates how likely a sample would be misclassified based on its mutation features. A score closer to 1 indicates a higher probability of misclassification. By sorting the misclassification scores of test inputs in descending order, the feature-based GraphPrior approaches can effectively prioritize tests that are more likely to be misclassified.
3.5 Usage of GraphPrior
By utilizing ranking models, GraphPrior predicts a misclassification score for each test input within a given test set. These predicted scores are then utilized for test prioritization, whereby test inputs with higher scores are prioritized higher. Particularly, the ranking models are pre-trained before the execution of GraphPrior. The training process is standardized across all the different ranking models and follows a consistent set of procedures, which are presented in detail below.
❶
Splitting datasets. Given a GNN model
\(G\) with dataset
\(T\). First, we split the dataset
\(T\) into two partitions: the training set
\(R\) and the test set, in a 7:3 ratio [
61]. The test set remains untouched for the purpose of evaluating GraphPrior.
❷
Constructing the training set for ranking models. Based on the training set \(R\), we aim to build a training set \(R^{\prime }\) for training the ranking models. First, we generate a group of mutated models for each input \(r_i \in R\). Then, we obtain the mutation feature vector \(V_i\) of \(r_i\) (i.e., a one-dimensional vector in which the \(i\)th element denotes whether the \(i\)th mutated model is killed by this input). The mutation feature vector of \(r_i\) is used to build the training set \(R^{\prime }\) (i.e., the training set of the ranking models). Second, we let the original GNN model \(G\) classify each input \(r_i \in R\) and compare it with the ground truth of \(r_i\). In this way, we can identify whether \(r_i\) is misclassified by the GNN model \(G\). If \(r_i\) is misclassified by \(G\), then we label it as 1. Otherwise, we label it as 0. In this way, we have built the ranking model training set \(R^{\prime }\).
❸
Training ranking models. Based on \(R^{\prime }\), we train the ranking models. Upon the completion of the training process, the ranking model is capable of receiving the mutation feature vector of a test input as an input and producing a misclassification score as an output. This score serves as an indicator of the probability of the test input being incorrectly classified by the GNN model.
It is worth noting that the original labels of the training set \(R^{\prime }\) are binary (i.e., 1 or 0), but the ranking models that are well trained can output values (i.e., the misclassification scores). To achieve this, we make some adaptations to implement the adopted ranking algorithms (e.g., random forest and XGBoost). First, although the ranking algorithms we adopted initially deal with classification tasks, an intermediate value is calculated for the classifications. For example, if the intermediate value exceeds 0.5 (default value, which can be adjusted), then input will be classified into the first category; otherwise, the other category. Here, after training, we let the ranking models directly output the intermediate value, as this value can indicate the likelihood of a test input being misclassified by the GNN model, where a higher value implies a greater likelihood of misclassification. We call this intermediate value “misclassification scores” and leverage the scores of test inputs to rank them.
4 Study Design
4.1 Research Questions
Our experimental evaluation answers the research questions below.
–
RQ1: How does the killing-based GraphPrior approach perform in prioritizing test inputs for GNNs?
In terms of test prioritization for GNNs, existing prioritization approaches usually do not take into account the interdependencies between nodes (tests) in a graph (test set). To fill the gap, we propose GraphPrior, which contains six GNN-oriented test prioritization approaches. Among them, KMGP is a killing-based approach, which regards a test input that kills more mutants as more likely to be misclassified. In this research question, we evaluate the effectiveness of the killing-based KMGP by comparing it with existing approaches that have been demonstrated as effective in detecting possibly misclassified test inputs.
–
RQ2: How do the feature-based GraphPrior approaches perform in GNN test prioritization?
In addition to the killing-based KMGP, GraphPrior involves five feature-based approaches. The core difference is that the killing-based approach regards the importance of each mutated model as equal, while the feature-based approaches learn different importance for each mutated model for test prioritization. More specifically, feature-based approaches extract features from mutation results and adopt ranking models [
5,
42,
83] to utilize the mutation features for test prioritization. In this research question, we compare the effectiveness of killing-based and feature-based approaches to investigate the effect of ranking models in leveraging mutation results.
–
RQ3: How does GraphPrior perform on test inputs generated from graph adversarial attacks?
When faced with graph adversarial attacks, confidence-based test prioritization approaches may be fooled, thus becoming more confident in incorrect predictions. Therefore, we evaluate to what extent the effectiveness of GraphPrior is affected by graph adversarial attacks. We compare GraphPrior and confidence-based approaches [
26,
36] on test inputs generated from graph adversarial attacks of existing studies [
3,
48,
86,
100] to demonstrate its effectiveness.
–
RQ4: How does GraphPrior perform against different levels of graph adversarial attacks?
In this research question, we investigate the effectiveness of GraphPrior against different levels of graph adversarial attacks. To answer this research question, we set different levels of attacks to generate test inputs and compare GraphPrior with existing approaches to demonstrate its effectiveness.
–
RQ5: Which mutation rules generate more top contributing GNN mutants?
We investigate the contributions of each mutation rule in generating effective mutants of GNNs. For each GNN model, we select the top contributing mutation features to it through the XGBoost ranking algorithm [
15], which is an optimized ML algorithm for ranking tasks based on the implementation of gradient boosting. We match each selected feature with the corresponding GNN mutant and identify the mutation rule that generates it. In this way, we obtain which mutation rules generate more top contributing mutants for test prioritization.
–
RQ6: Can GraphPrior and the uncertainty-based metrics be used in active learning scenarios to improve a GNN model by retraining?
In the face of a large number of unlabeled inputs and a limited time budget, it is not feasible to manually label all the inputs and use them to retrain a GNN. One established solution to reduce data labeling costs is active learning [
67], which involves selecting informative subsets of training samples to improve the model performance. In this research question, we investigate the effectiveness of GraphPrior and the uncertainty-based metrics in selecting informative retraining inputs to improve the quality of a GNN model.
4.2 GNN Models and Datasets
In our study, we totally adopt 604 subjects to evaluate the effectiveness of GraphPrior and the compared approaches [
26,
36]. Table
1 exhibits their basic information. Among the 604 subjects considered in this study, 16 subjects were utilized in the experiments of RQ1, 16 subjects in RQ2, 108 subjects in RQ3, 432 subjects in RQ4, 16 subjects in RQ5, and 16 subjects in RQ6. It is worth noting that, among these subjects, a total of 64 subjects (which were utilized in RQ1, RQ5, and RQ6) were associated with clean datasets, while the remaining 540 subjects (which were utilized in RQ3 and RQ4) were associated with adversarial datasets.
Our study involves four GNN models:
GCN (Graph Convolutional Networks) [
45],
GAT (Graph Attention Networks) [
79],
GraphSAGE (Graph SAmple and aggreGatE) [
30], and
TAGCN (Topology Adaptive Graph Convolutional Network) [
21], tested by four datasets, namely, the Cora [
88], CiteSeer [
88], PubMed [
88], and LastFM [
70]. We present their descriptions as follows:
4.2.1 GNN Models.
–
GCN [45]. GCN is a class of convolutional neural networks that can work directly on the graph. It solves the problem of classifying nodes (such as documents) in graphs (such as citation networks), of which only a small number of nodes are labeled. The core idea of GCN is to use the edge information of a graph to aggregate node information to generate new node representations. GCN has been used in several existing studies [
31,
35,
89].
–
GAT [79]. GAT introduces a self-attention mechanism in the propagation process. Compared to GCN, which regards all neighbors of a node equally, the attention mechanism assigns different attention scores to each neighbor, thereby identifying more important neighbors.
–
GraphSAGE [30]. GraphSAGE is a generalized inductive framework that generates node embeddings by sampling and aggregating features of neighbor nodes.
–
TAGCN [21]. TAGCN introduces a systematic approach to design a set of fixed-size learnable filters to perform convolutions on graphs. These filters are topology-fit to the topology of the graph as they scan the graph for convolution.
4.2.2 Datasets.
–
Cora [88]. The Cora dataset is a citation graph composed of 2,708 scientific publications (nodes) and 5,429 links (edges) between them. Nodes represent ML papers, and edges represent citations between pairs of papers. Each paper is classified into one of seven classes, such as reinforcement learning and neural networks.
–
CiteSeer [88]. The CiteSeer dataset consists of 3,327 scientific publications (nodes) and 4,732 links (edges). Each paper belongs to one of six categories such as AI and ML.
–
PubMed [88]. The PubMed dataset contains 19,717 diabetes-related scientific publications (nodes) and 44,338 links (edges). Publications are classified into three classes such as Cancer and AIDS (i.e., Acquired Immune Deficiency Syndrome).
–
LastFM Asia Social Network [70]. The dataset LastFM Asia Social Network was collected from the social network of users on the Last.fm music platform in Asia. Nodes are LastFM users, and edges are mutual follower relationships between them. LastFM contains 7,624 nodes and 27,806 edges. The classification task of the LastFM dataset is to predict the home country of a user (e.g., Philippines, Malaysia, Singapore).
Notably, we evaluate GraphPrior on different types of test inputs (i.e., both natural test inputs and adversarial test inputs). We adopted eight graph adversarial attacks, presented in Section
4.4.
4.3 Compared Approaches
In our study, we considered seven compared approaches in total, including one baseline (i.e., random selection), four DNN test prioritization approaches, and two active learning approaches. We select these approaches due to the following reasons: (1) These approaches can be adapted for GNN test prioritization; (2) The selected approaches have been demonstrated as effective for DNNs in existing studies [
26,
36,
82]; (3) The implementations of these approaches have been released by the authors.
–
DeepGini. DeepGini [
26] prioritizes test inputs based on model confidence. DeepGini leverages the Gini coefficient to measure the likelihood of a test input being misclassified. DeepGini leverages Equation (
1) to calculate the ranking scores.
where
\(\xi (x)\) refers to the likelihood of the test input
\(x\) being misclassified.
\(p_i(x)\) refers to the probability that the test input
\(x\) is predicted to be label
\(i\).
\(N\) refers to the number of labels.
–
Margin. Margin [
80] regards a test input with less difference between the top two most confidence predictions as more likely to be misclassified. Margin score is calculated by Equation (
2).
where
\(M(x)\) refers to the margin score.
\(p_{k}(x)\) refers to the most confident prediction probability.
\(p_{j}(x)\) refers to the second most confident prediction probability.
–
Least Confidence. Least Confidence [
80] regards test inputs for which the model has the least confidence as more likely to be misclassified. Least confidence is calculated by Equation (
3).
where
\(L(x)\) refers to the confidence score.
\(p_i(x)\) refers to the probability that the test input
\(x\) is predicted to be label
\(i\) via a model
\(M\).
–
Vanilla Softmax. Vanilla Softmax [
82] is computed by subtracting the highest activation probability in the output softmax layer from 1, resulting in a metric that is positively correlated with the misclassification probability. Equation (
4) presents the calculation of the Vanilla Softmax metric.
where
\(l_c(x)\) belongs to a valid softmax array in which all values are between 0 and 1, and their sum is 1.
–
Prediction-Confidence Score (PCS). PCS [
82] calculates the difference between the predicted class and the second most confident class in softmax likelihood.
–
Entropy. Entropy [
82] calculates the entropy of the softmax likelihood.
–
Random selection. [
22] In random selection, the execution order of the test inputs is determined randomly.
4.4 Graph Adversarial Attacks
In RQ3 and RQ4, we evaluate the effectiveness of GraphPrior on test inputs generated through diverse graph adversarial attacks, in which attackers aim to generate graph adversarial perturbations by manipulating the graph structure or node features to fool the GNN models. We introduce all the attacks we applied in our experiments as follows:
–
Disconnect Internally, Connect Externally (DICE) [100]. The DICE attack is a type of white-box attack whereby the adversary has access to all information about the targeted GNN model, including its parameters, training data, labels, and predictions. Specifically, the DICE attack randomly adds edges between nodes with different labels or removes edges between nodes sharing the same label. Through this, the attack can generate adversarial perturbations that can fool the targeted GNN model.
–
PGD attack [86]. The PGD attack leverages the
Projected Gradient Descent (PGD) algorithm to search for optimal structural perturbations to attack GNNs.
–
Min-max attack (MMA) [86]. The min-max attack is a type of untargeted white-box GNN attack. The attack problem is formulated as a min-max problem, where the inner maximization is designed to update the model’s parameters (
\(\theta\)) by maximizing the attack loss, and it can be solved using gradient ascent. However, the outer minimization can be achieved by using PGD [
59].
–
Node Embedding Attack-Add (NEAA) [3]. In node embedding attack-add, the attackers are capable of modifying the original graph structure by adding new edges while adhering to a predefined budget constraint.
–
Node Embedding Attack-Remove (NEAR) [3]. In node embedding attack-remove, the attackers modify the original graph structure by removing edges.
–
Random Attack-Add (RAA) [48]. The Random Attack-Add approach randomly adds edges to the input graph to fool the targeted GNN model.
–
Random Attack-Flip (RAF) [48]. The Random Attack-Flip approach randomly flips edges to the input graph to fool the targeted GNN model.
–
Random Attack-Remove (RAR) [48]. The Random Attack-Add approach randomly removes edges to the input graph to fool the targeted GNN model.
4.5 Evaluation of Mutation Rules (RQ5)
In RQ5, we investigated the contribution of different mutation rules in generating top contributing mutated models. First, for each GNN model, we utilize the cover metric in XGBoost [
15] to evaluate the importance of its mutation features and rank them according to the descending order of the importance scores. The cover metric can evaluate the importance of mutation features by quantifying the average coverage of each instance by the leaf nodes in a decision tree. Specifically, it calculates the number of times a particular feature is used to split the data across all trees in the ensemble and then sums up the coverage values for each feature over all trees. This coverage value is then normalized by the total number of instances to obtain the average coverage of each instance by the leaf nodes. The importance of a feature is then calculated based on its coverage value, and features with higher coverage values are considered more important.
Upon obtaining the importance of each mutation feature, which corresponds to a specific mutated model, we proceed to match and determine the importance of the respective mutated models. Subsequently, we select the top N critical mutated models and identify the specific mutated rules employed in their generation. This enables a comparative analysis of the contributions of various mutation rules.
4.6 Implementation and Configuration
We implemented GraphPrior in Python based on the PyTorch 1.11.0 framework [
65]. We also integrate the available implementations of the compared approaches [
26,
57,
80,
82] into our experimental pipeline to adapt to the GNN prioritization problem. Regarding our mutation rules, we set the number of mutated models as 80~240 across different subjects. Balancing the tradeoff between execution time and the effectiveness of GraphPrior is a critical consideration in determining the number of mutants. Building on relevant literature [
81], we identified a suitable range of mutants. Our preliminary investigations on multiple subjects demonstrate that these settings effectively maintain the effectiveness of GraphPrior while controlling the runtime within a reasonable range. In the case of subjects associated with longer mutant generation times, we choose to generate a comparatively smaller number of mutants compared to other subjects. Additionally, the range was achieved through the full execution of all pre-defined mutation rules. It is worth noting that the total number of mutation rules was predetermined and fixed. Thus, even with the addition of new mutants, the impact on the performance of GraphPrior is minor, as the new mutants are created based on the existing mutation rules.
With regard to the specific mutation rules that change the integer/float training parameters, we define a parameter range close to the original parameter values to achieve slight mutations. We conducted a preliminary study using multiple subjects, demonstrating the effectiveness of such settings. Moreover, to obtain parameter values from the specified range, we adopt uniform sampling [
56] as the sampling methodology. This technique ensures an equitable probability of selecting each value within the parameter range and has been widely adopted across the ML testing field [
56,
60,
96].
More specifically, we set the hidden channel parameter in the range of [15–20), epochs parameter as <= 50, heads parameter as <= 5, and negative slope parameter as <= 0.2. For the mutation rules that change the Boolean type parameters, if the parameter value of the original model is true, then we set it to false. If the original value is false, then we set it to true. The parameter ranges for our mutation rules are carefully selected to ensure the change to the original GNN model is slight.
With respect to the configuration of the ranking models utilized in GraphPrior, we made several parameter selections: For the random forest, XGBoost, and LightGBM ranking algorithms, we set the n_estimators parameter to 100. For the DNN ranking model, we set the learning_rate parameter to 0.01. Finally, for the logistic regression ranking algorithm, we set the max_iter parameter to 100.
We conducted the following experiments on a high-performance computer cluster, and each cluster node runs a 2.6 GHz Intel Xeon Gold 6132 CPU with an NVIDIA Tesla V100 16 G SXM2 GPU. For the data process, we conducted corresponding experiments on a MacBook Pro laptop with Mac OS Big Sur 11.6, Intel Core i9 CPU, and 64 GB RAM.
4.7 Measurements
Following the existing study [
26], we leverage
Average Percentage of Fault-Detection (APFD) [
92] to evaluate the prioritization effectiveness of GraphPrior and the compared approaches. APFD is a standard metric for prioritization evaluation. Typically, higher APFD values indicate faster misclassification detection rates. We calculate the APFD values by Equation (
5)
where n is the number of test inputs in the test set
\(T\).
k is the number of test inputs in
\(T\) that will be misclassified by the GNN model
\(G\).
\(o_i\) is the index of the
\(i\)th misclassified tests in the prioritized test set. More specifically,
\(o_i\) is an integer that represents the position of the
\(i\)th misclassified tests in the test set that has been prioritized. When
\(\sum _{i=1}^k o_i\) is small (i.e., the total index sum of the misclassified tests within the prioritized list is small), indicating that that the misclassified tests are prioritized higher, the APFD will be large according to Equation (
5). Therefore, large APFD indicates better prioritization effectiveness. Following the existing study [
26], we normalize the APFD values to [0,1]. We consider a prioritization approach better when the APFD value is closer to 1. We present the comparison results in tables.
For more detailed analysis, we utilize
PFD (Percentage of Fault Detected) [
26] to evaluate the fault detection rate of each approach on different ratios of prioritized test inputs. High PFD values refer to high effectiveness in detecting misclassified test inputs.
where
\(F_c\) is the number of faults (i.e., misclassified test inputs) correctly detected.
\(F_t\) is the total number of faults. More specifically, we evaluate the fault detection rate of GraphPrior against different ratios of prioritized tests. We use
PFD-n to represent the first n% prioritized test inputs.
5 Results and Analysis
5.1 RQ1: Effectiveness of the Killing-based GraphPrior Approach (KMGP)
Objectives: We investigate the effectiveness of the killing-based GraphPrior approach, KMGP (cf. Section
3.3), comparing it with existing approaches that can be used to identify possibly misclassified test inputs.
Experimental design: We used 16 pairs of datasets and GNN models as subjects to evaluate the effectiveness of GraphPrior. Table
1 exhibits their basic information. We carefully selected seven compared approaches (i.e., DeepGini, least confidence, margin, Vanilla SM, PCS, entropy, and random selection), which can be adapted for GNN test prioritization. Random selection is considered the baseline. We adopt two metrics to measure the effectiveness of GraphPrior and the compared approaches:
Average Percentage of Fault-Detection (APFD) and PFD, which are explained in Section
4.7.
Due to the randomness of the training process of a GNN model, we conduct a statistical analysis by repeating all the experiments 10 times. More specifically, for each subject (a dataset with a GNN model), 10 different GNN models are generated through separate training processes.
Results: The GraphPrior approach KMGP outperforms all the compared approaches (i.e., DeepGini, Least Confidence, Margin, Vanilla SM, PCS, Entropy, and Random) in GNN test prioritization. Table
2 presents the comparison results of the KMGP and a set of compared approaches using the APFD metric. We highlight the approach with the highest effectiveness for each case in grey. The results demonstrate that KMGP outperforms the other approaches in the majority of cases, specifically, in 87.5% (14 out of 16) subjects. Vanilla SM, however, performs the best in only 12.5% of cases. Additionally, the average APFD value achieved by KMGP was 0.748, which is higher than that of the compared techniques, with improvements of 4.76%~49.6%. These results suggest that KMGP offers a promising solution for prioritizing GNN test inputs.
Table
3 exhibits the comparison results among the test prioritization techniques with respect to PFD. We highlight the approach with the highest effectiveness for each case in grey. The findings indicate that, for 68.75% (11 out of 16) of the subjects, KMGP performs best when prioritizing less than 50% of tests. Furthermore, for a majority of the subjects, specifically, 87.5% (14 out of 16), KMGP exhibits the best performance when prioritizing less than 30% of tests. Furthermore, Table
4 exhibits the overall comparison results in terms of PFD. We can see that when prioritizing 10%~30% test inputs, the average effectiveness of KMGP outperforms that of the compared approaches in 100% cases. When prioritizing 10%~50% test inputs, the average effectiveness of KMGP outperforms that of the compared approaches in 90% cases. Figure
2 plots the ratio of detected misclassified tests against the prioritized tests. We see that GraphPrior achieves a higher APFD value in comparison to DeepGini, entropy, least confidence, margin, Vanilla SM, PCS, and random. These results confirm the effectiveness of KMGP in GNN test input prioritization.
To demonstrate the stability of our findings, a statistical analysis is performed. Specifically, all the experiments are repeated 10 times for each subject, resulting in 10 distinct GNN model instances obtained through separate training processes for a given original GNN model. Based on the statistical analysis of the resulting data, the p-value was found to be lower than \(10^{-05}\), indicating that the KMGP approach can consistently outperform the compared approaches in terms of test prioritization.
5.2 RQ2: Effectiveness of the Feature-based GraphPrior Approaches
Objectives: We investigate the effectiveness of feature-based approaches in GraphPrior, including XGGP, LRGP, RFGP, LGGP, and DNGP, compared with the killing-based approach KMGP.
Experimental design: We evaluated the effectiveness of feature-based GraphPrior approaches with the killing-based approach KMGP on 16 subjects (four graph datasets × four GNN models). Due to the randomness of the training process of a GNN model, we repeat all the experiments 10 times and calculate the average results. For each subject (a dataset with a GNN model), 10 different GNN models are generated through separate training processes. For evaluation, we calculated the APFD values of all the approaches on each subject, which can reflect the misclassification detection rate. Moreover, we calculated the PFD values of all the approaches on different ratios of prioritized tests to further investigate the effectiveness of feature-based approaches.
Results: The experimental results of this research question are exhibited in Tables
5,
6, and
7. Table
5 presents the comparison results in terms of APFD, while Table
6 and Table
7 present the comparison results in terms of PFD.
Among all the GraphPrior approaches, RFGP demonstrates the highest level of effectiveness in most cases. Table
5 exhibits the comparison results among KMGP (i.e., the killing-based GraphPrior approach) and the feature-based GraphPrior approaches in terms of APFD. The results demonstrate RFGP outperforms other GraphPrior approaches, on average. Moreover, the average APFD values of RFGP exceed that of KMGP by around 0.02. Additionally, across different subjects, RFGP outperforms other GraphPrior approaches in the majority of cases. To provide a more detailed analysis, Tables
6 and
7 exhibit the comparison results of all GraphPrior approaches in terms of PFD. The findings also confirm that RFGP is the most effective GraphPrior approach. Furthermore, Table
7 indicates that, on average, RFGP is consistently more effective than other GraphPrior approaches across different test prioritization ratios. Figure
3 presents some examples aimed at providing a more visually intuitive understanding of the performance of the various GraphPrior approaches. Collectively, these results suggest that RFGP is the most effective GraphPrior approach for the evaluated datasets.
Additionally, although the killing-based GraphPrior approach, KMGP, shows good effectiveness in some specific datasets, its average effectiveness is lower than several feature-based GraphPrior approaches, such as RFGP, LGGP, and XGGP. This result suggests that KMGP is less stable compared to some feature-based approaches. For example, in Figure
3(b), we can see that KMGP (represented by the red line) is less effective than other GraphPrior approaches. In fact, the main difference between KMGP and feature-based GraphPrior approaches lies in their strategy for utilizing mutation results. Specifically, KMGP treats all mutated models as having equal importance, whereas feature-based GraphPrior approaches, such as RFGP, employ ranking models to assign higher weights to the more important mutated models, thereby better utilizing mutation results for test prioritization. The superior performance of RFGP indicates that the random forest algorithm it utilizes can effectively identify important mutated models and assign them high weights.
The efficiency of GraphPrior (all the six approaches) is acceptable. Table
8 illustrates the efficiency of GraphPrior in comparison with other approaches. The time cost of GraphPrior can be decomposed into three phases, namely, mutant generation, training, and execution. Mutant generation involves the production of mutated models based on retraining the original GNN model. The training time represents the average duration needed for training a ranking model. Finally, execution time denotes the average duration expended on test prioritization. By decomposing the time cost into these distinct phases, we provide a more detailed understanding of the efficiency of GraphPrior in contrast to other approaches. As evident from Table
8, the average execution time of GraphPrior for test prioritization is 40 seconds, with the most time-consuming phase being mutant generation, which takes around 35 minutes. In contrast, the average execution time of the compared approaches is less than one second. Although GraphPrior is not as efficient as the compared approaches, it provides a viable alternative to costly and time-consuming manual labeling, and its total time cost remains acceptable in real-world scenarios.
5.3 RQ3: Effectiveness of GraphPrior on Adversarial Test Inputs
Objectives: We further investigate the effectiveness of GraphPrior on adversarial test data. Here, we adopt eight graph adversarial attacks (cf. Section
4.4) from the existing studies [
3,
48,
86,
100]. The results can answer whether GraphPrior can perform well on adversarial test sets for GNNs, compared with existing approaches that can be used to identify possibly misclassified test inputs.
Experimental design: We evaluate GraphPrior on adversarial datasets generated by eight graph attack techniques [
3,
48,
86,
100]. In this research question, we set the attack level as 0.3, which means that 30% of the test inputs in the test set are adversarial tests. It is important to note that a high attack level, such as 90%, would result in a significant ratio of adversarial test inputs. Under such circumstances, a larger number of bug cases could be selected by any of the prioritization methods, making it difficult to demonstrate the effectiveness of GraphPrior. Thus, to ensure an effective evaluation of GraphPrior and the compared approaches, we selected a reasonable attack level (i.e., 0.3), which can limit the proportion of adversarial test inputs. Totally, in this research question, we evaluate GraphPrior on 108 subjects (four GNN models, four datasets, and eight graph adversarial attacks). We then ran all six GraphPrior approaches and the compared approaches on the subjects and calculated the APFD values of each approach with each graph adversarial attack. Moreover, we calculated the PFD values of each approach in terms of different ratios of prioritized values.
Results:GraphPrior approaches outperform the compared approaches (i.e., DeepGini, Least Confidence, Margin, Vanilla SM, PCS, Entropy, and Random) in the context of graph adversarial attacks. Table
9 shows the test prioritization effectiveness (measured by APFD) of GraphPrior and the compared approaches across a variety of adversarial attacks. The experimental results indicate that the GraphPrior approaches exhibit superior performance, with the average APFD values ranging from 0.692 to 0.732, while the compared approaches range from 0.499 to 0.711. Notably, five GraphPrior approaches, namely, RFGP, XGGP, LRGP, LGGP, and KMGP, outperform all the compared approaches, on average, across all the adversarial attacks. Table
10 presents the comparison results of GraphPrior and the compared approaches in terms of PFD, confirming the superior performance of GraphPrior from both the perspective of average effectiveness and the number of best cases. Furthermore, Table
11 presents the overall comparison results in terms of PFD, which further support the above conclusions by demonstrating that the largest average effectiveness of each case is achieved by the GraphPrior approaches, along with the largest number of best cases.
Among all the GraphPrior approaches proposed, the effectiveness of RFGP stands out as the most notable. From Table
9, in which the effectiveness is measured by the APFD values, we see that RFGP performs the best across different adversarial attacks, with the average improvement of 2.95%~46.69% compared with uncertainty-based test prioritization approaches. Table
10 presents the test prioritization effectiveness in terms of PFD. The column #Best case in PFD denotes the number of best cases a test prioritization approach achieved across all cases (i.e., all subjects of a graph adversarial attack). The results demonstrate that, against a majority of adversarial attacks, RFGP consistently outperforms all other GraphPrior approaches in terms of average effectiveness. Moreover, Table
11 presents the overall comparison results in terms of PFD, further indicating that RFGP outperforms all other approaches in terms of average effectiveness. Notably, when prioritizing 20% to 40% of the test inputs, RFGP consistently exhibits the highest number of best cases across a variety of subjects.
5.4 RQ4: Effectiveness of GraphPrior against Adversarial Attacks at Varying Attack Levels
Objectives: We investigate the effectiveness of GraphPrior on adversarial test inputs with different attack levels.
Experimental design: To investigate the effectiveness of GraphPrior on test inputs generated via different levels of graph adversarial attacks, we set different attack levels (i.e., 0.1, 0.2, 0.3, and 0.4) on eight graph adversarial techniques (i.e., DICE, Min-max attack, NEAA, NEAR, PGD attack, RAA, RAF, and RAR). As mentioned in RQ3, the attack level indicates the ratio of adversarial inputs in the dataset. For example, 0.4 means that 40% tests in the dataset are adversarial tests. We select these attack levels because a high attack level (e.g., 80%) would engender a substantial proportion of adversarial test inputs. Consequently, such circumstances could yield a greater number of bug cases selected by any prioritization method, thereby affecting the evaluation of GraphPrior. Therefore, we carefully selected a range of attack levels that are not unduly high for the evaluation of GraphPrior. In this research question, we totally evaluate GraphPrior and the compared approaches on 432 subjects.
Results: GraphPrior outperforms all the compared approaches on the adversarial test inputs generated from different attack levels. More specifically, Table
12 presents the effectiveness of GraphPrior and the compared approaches under the attacks DICE, MMA, RAA, and RAR, with the attack level ranging from 0.1 to 0.4. In this research question, we totally apply eight adversarial attacks. The remaining experimental results (i.e., results of the other four adversarial attacks) are presented on our
Github.
2 The experimental results presented in Table
12 demonstrate that GraphPrior, consisting of DNGP, KMGP, LGGP, LRGP, RFGP, and XGGP, outperforms all the compared approaches across different levels of the adversarial attacks.
Table
13 demonstrates the overall comparison results among GraphPrior and the compared approaches across eight adversarial attacks with different attack levels. Specifically, we evaluate the effectiveness of each test prioritization approach in terms of the number of cases where it performed the best, as well as its average PFD values across different attack levels. For example, the “All-0.1” refers to the overall results of each approach under all the adversarial attacks with an attack level of 0.1. Table
13 demonstrates that GraphPrior outperforms all compared approaches, achieving the best effectiveness in 99.94% of the tested cases. Only one best case is achieved by the compared approach margin. Furthermore, GraphPrior approaches such as RFGP and KMGP consistently exhibit the largest average PFD values across different attack levels.
Among all the GraphPrior approaches, RFGP and KMGP exhibit superior performance across different attack levels in comparison to other GraphPrior approaches. In Table
12, we see that, across the attack levels from 0.1 to 0.4, RFGP performs the best in the largest number of best cases, followed by KMGP. For example, when the attack level is 0.1, RFGP performs the best in 46.47% cases. KMGP performs the best in 35.33% cases. Notably, when prioritizing 10% test inputs, KMGP takes the largest number of best cases. When the attack level is 0.2~0.4, RFGP takes the largest number of best cases.
Additionally, our experimental results, as illustrated in Table
13, reveal that the RFGP technique exhibits the largest average PFD values when compared to the other evaluated approaches across varying attack levels. Specifically, when 40% of the test inputs are prioritized, RFGP achieves a PFD value ranging from 0.832 to 0.836, which indicates the ability to detect more than 80% of misclassified tests.
5.5 RQ5: Contribution Analysis of Different Mutation Rules
Objectives: For each evaluated GNN model, we investigate which mutated rules generate more top contributing mutated models for test prioritization.
Experimental design: In our study, we employed one or more mutation rules to generate a mutated model. Each mutated model corresponds to one mutation feature. Thus, to evaluate the importance of different mutation rules, we initially evaluate the importance of various mutation features. We adopted the cover metric of the XGBoost algorithm to identify the importance of each mutation feature for ranking models. A detailed account of this approach is presented in Section 4.5. After computing the importance scores of all the mutated features, we selected the top-N important features for each subject and subsequently identified the top-N mutated models. We then identified the mutation rules utilized to generate each mutated model and compared the contributions of the mutation rules accordingly. Additionally, for different subjects in this research question, we generate 80~240 mutated models.
Results: The mutation rule HC made high contributions to the effectiveness of GraphPrior on all the four types of GNN models. Tables
14 to
17 illustrate the contributions of different mutation rules to the effectiveness of GraphPrior on different GNN models (i.e., GCN, GAT, GraphSAGE, and TAGCN). For each GNN model, we identify the top-N mutated models that made top contributions to the effectiveness of GraphPrior. The corresponding mutation rules applied to generate each mutated model are highlighted in grey. Table
14 presents the contributions of Top-N mutated models to the effectiveness of GraphPrior for the case of GCN model. Notably, the mutation rules
BIA and
HC made contributions to 100% of the top contributing mutated models, while
SL,
NOR,
CA, and
IMP contributed to a lower percentage of the top contributing mutated models. We conclude that, for the GCN model, the mutation rules
SL and
HC were the most effective in generating the top important mutated models. Moving to GAT, GraphSAGE, and TAGCN, whose results are presented in Tables
15,
16, and
17, the mutation rule HC also generates a large ratio (i.e., 100%, 90%, and 90%, respectively) of top contributing mutated models. We can conclude that, across the four different types of GNN models, HC can continuously make top contributions to the effectiveness of GraphPrior.
Some mutated rules, such as NOR and BIA, made high contributions to the effectiveness of GraphPrior on some specific GNN models. Moreover, some mutation rules, such as BIA and NOR, also generate a considerable ratio (i.e., from 50% to 100%) of top-critical mutated models. For example, on GCN and GraphSAGE, BIA made contributions to 100% top-N mutated models. On TAGCN, NOR made contributions to 100% top-N mutated models.
5.6 RQ6: Enhancing GNNs with GraphPrior
Objectives: We investigate whether GraphPrior and the uncertainty-based metrics can select informative retraining subsets to improve the performance of a GNN model.
Experimental design: Following the prior research by Ma et al. [
57], our retraining experiments are structured as follows: First, we randomly partitioned the dataset into three sets: an initial training set, a candidate set, and a test set, with a ratio of 4:4:2. The candidate set was reserved exclusively for retraining purposes, while the test set was kept untouched for the purpose of evaluation. In the first round, we trained a GNN model using only the initial training set and computed its accuracy on the test set. We employed the best model obtained over the training epochs for the subsequent retraining process. In the second round, we incorporate an additional 10% of new inputs from the candidate set into the existing training set without replacement. The inputs selected for inclusion are those that are prioritized in the first 10% by the test prioritization approaches, namely, GraphPrior and the compared techniques. Following Ma et al. [
57], we retrain the GNN models by utilizing the complete augmented training set. This approach ensures that the old and new training data are treated equally. We repeat the retraining process for multiple rounds until the candidate set is empty. We kept the test data untouched during the retraining process. Moreover, we account for the randomness involved in the model training process and repeat all the experiments 10 times to report the average results (averaged over 10 repetitions).
Results: Table
18 illustrates the average accuracy of GNN models after retraining with 10% to 100% prioritized test inputs. For each case, we highlight the approach with the highest effectiveness in grey to facilitate quick and easy interpretation of the results.
GraphPrior and the uncertainty-based test prioritization approaches outperform the random selection approach. However, the observed improvement is relatively small, indicating that GNN test prioritization approaches can guide the retraining of GNN models but with limited effect. In Table
18, we observe that test prioritization methods, including GraphPrior and compared approaches, consistently demonstrate better performance across varying ratios of added data compared with the random selection. Furthermore, when incorporating prioritized tests exceeding 10% of the total, a significant majority of the test prioritization methods—specifically, 83.4% (10 out of 12)—outperform random selection in each case. However, the improvements achieved by these test prioritization methods compared to random selection are relatively small, with the highest increase being only 0.014. Additionally, Figure
4 visually depicts an example outcome of the retraining experiments conducted on the Cora dataset using the GCN model, showcasing a comparative evaluation of the performance of test prioritization approaches against random selection (indicated by the black line). As observed from the results, the test prioritization approaches demonstrate a better performance compared to random selection, but the improvement is visually slight.
One reason that leads to the effectiveness of GraphPrior and uncertainty-based test prioritization approaches being limited lies in their inadequate consideration of node importance (i.e., impact on other nodes in the dataset). In a GNN dataset, the complex interdependence among test inputs and their neighbors can lead to them having different importance. For example, nodes with greater connectivity can affect more of other nodes, making them relatively more critical. However, the current test prioritization approaches only focus on the ability of test inputs to reveal system bugs without regard to the importance of nodes. Although the selected test input by them can have a higher likelihood of misclassification, their importance within the dataset can be minor if they have a very small number of neighbors. Retraining such inputs would have less effect. Consequently, it is crucial to consider node importance in the selection of retraining data to achieve more effective outcomes.
GraphPrior achieved better effectiveness than the uncertainty-based test prioritization methods. In Table
18, we see that, when adding more than 20% (including 20%) test cases for retraining, the GraphPrior approaches perform the best in 100% cases. Figure
4 visually demonstrates that the GraphPrior approaches (solid line) perform better than the compared approaches (dotted line) in most cases.
6 Discussion
6.1 Generality of GraphPrior
Although the confidence-based test prioritization approaches demonstrate excellent effectiveness in traditional DNNs, they do not consider the interdependencies between test inputs, which are particularly crucial in GNN test prioritization. Our proposed GraphPrior leverages the mutation analysis of GNN models to perform GNN test input prioritization, which has been demonstrated effective on graph classification tasks through 604 carefully designed subjects. In fact, the scheme of GraphPrior, (i.e., modifying training parameters to mutate the GNN model for test prioritization) can also be generalized to other dimensions of GNN tasks, including graph-level and edge-level tasks. In the future, we will further verify the extension of GraphPrior from this perspective.
[The applicability of GraphPrior on regression tasks]. In this section, we will also discuss the potential applicability of GraphPrior to regression tasks. Currently, the mutation rules and ranking models of GraphPrior are specifically designed for classification tasks. To extend GraphPrior to regression tasks, modifications to the mutation rules and ranking models would be required. If appropriate mutation rules can be identified for regression tasks and suitable ranking models can be designed, then GraphPrior could also be applied to regression tasks.
6.2 Limitations of GraphPrior
[
Diversity of the prioritized data]. One limitation of GraphPrior lies in guaranteeing the diversity of selected bug data. This limitation is also noted in prior work on the uncertainty-based test prioritization approaches [
26], which did not consider the diversity of bugs when prioritizing test inputs. Similarly, GraphPrior also does not aim for diversity in the prioritized tests. However, GraphPrior has demonstrated the ability to identify a significant majority of misclassified test inputs using a small ratio of prioritized test cases. Specifically, RFGP (i.e., the most effective GraphPrior approach) has been shown to detect over 80% misclassified tests by prioritizing only 40% of the test inputs. This highlights GraphPrior’s ability to efficiently identify a large proportion of bugs using a small set of prioritized tests, even without explicitly ensuring bug diversity. While prioritizing diverse bugs can improve the overall quality of testing, prioritizing a significant majority of bugs can still be a practical strategy in situations where time and resources are limited. Therefore, GraphPrior’s ability to efficiently identify a large proportion of bugs using a small set of prioritized tests can be particularly useful in scenarios where time and resources are constrained.
[
GraphPrior in active learning scenarios]. Active learning [
68] operates under the assumption that samples within a dataset have varying contributions to the improvement of the current model and aims to select the most informative samples for inclusion in the training set. Our investigation in RQ6 has demonstrated that GraphPrior and uncertainty-based metrics can be utilized to select informative retraining tests. However, the effectiveness of these approaches is limited. Specifically, despite the demonstrated success of uncertainty-based metrics such as DeepGini and margin in previous studies [
26,
36] on DNNs, their effectiveness in the context of GNNs is slight. We explore potential reasons for this phenomenon.
One crucial reason for their limited effectiveness lies in their inadequate consideration of node importance, i.e., the impact that a node has on other nodes in the graph dataset. In a GNN dataset, the complex interdependence among test inputs and their neighbors can result in differing levels of importance for different nodes. For instance, nodes with higher connectivity can be more influential and hence more critical. However, current test prioritization approaches only focus on the ability of test inputs to expose system bugs without taking into account the node importance. Although these approaches may identify inputs with a higher likelihood of misclassification, their importance within the dataset may be negligible if they have only a few neighbors. Retraining such inputs is, therefore, less effective.
Furthermore, we elaborate on the difference between GraphPrior and the existing active learning methods evaluated in our study. The active learning methods used for comparison in our article are primarily uncertainty-based, aimed at datasets where each sample is independent of others. However, for graph datasets, these methods select retraining data without considering the interdependencies between nodes and also neglect the importance of nodes, merely selecting possibly misclassifed nodes. In contrast, GraphPrior employs mutation analysis to identify test inputs that are more likely to be misclassified while considering the interdependencies between nodes during the mutation process. Despite this added consideration, GraphPrior’s goal remains to select misclassified test inputs and does not explicitly consider node importance, leading to slight effectiveness similar to the uncertainty-based methods.
[Generating mutants for large-scale GNN models]. In our experiments, which are based on our current model and datasets, the time cost of our retraining method (for generating mutants) is within an acceptable range. When dealing with large-scale GNN models, GraphPrior can require large computational resources, but it can remain feasible in situations where the cost of manual labeling outweighs the computational cost.
6.3 Threats to Validity
Threats to Internal Validity. The internal threats to validity mainly lie in the implementation of our proposed GraphPrior and the compared approaches. To reduce the threat, we implemented GraphPrior based on the widely used library PyTorch and adopted the implementations of the compared approaches published by their authors. Another internal threat lies in the randomness of the model training. To mitigate this threat and ensure the stability of our experimental results, we conducted a statistical analysis. Specifically, we repeated the training process 10 times for both the original model and the mutated model and calculated the statistical significance of the experiments.
The selection of mutation rules in our study presents another internal threat to validity. Despite our best efforts to collect a comprehensive set of mutation rules, it is possible that other training parameters beyond our current knowledge could serve as mutation rules. To mitigate this threat, we selected mutation rules that can directly or indirectly affect node interdependence in the prediction process. The selection of parameter ranges for mutation rules is another internal threat that could affect the effectiveness of the rules. To mitigate this threat, we adopted a strategy in which we inverted the values of Boolean parameters, setting true to false and false to true. For integer and float parameters, we selected a range that introduces only slight changes to the original GNN model. Our experimental results demonstrated the effectiveness of GraphPrior, indicating that the mutation rules and selected parameter range are suitable for GNN test prioritization.
Threats to External Validity. The external threats to validity mainly lie in the GNN models under test and the testing datasets we used in our study. To mitigate this threat, we adopted a large number of subjects (pair of model and dataset) in our study and leveraged different types of test inputs. We applied eight graph adversarial attacks from public studies to generate adversarial test inputs and varied the attack level for more detailed evaluation. In the future, we will apply GraphPrior to more GNN models and test datasets with diversity.
8 Conclusion
To improve the efficiency of GNN testing, we aim to prioritize possibly misclassified test inputs to reveal GNN bugs earlier. However, a crucial limitation of existing test prioritization approaches is that, when applying to GNNs, they do not take into account the interdependence between test inputs (nodes). In this article, we propose GraphPrior, a set of test prioritization approaches specifically for GNN testing. GraphPrior assumed that a test input is more likely to be misclassified if it can kill many mutated models. Based on it, GraphPrior leveraged carefully designed mutation rules to generate mutated models for GNNs. Subsequently, GraphPrior obtained the mutation results of test inputs based on the execution of the mutated models. GraphPrior utilized the mutation results in two ways, namely, killing-based and feature-based methods. In the process of scoring a test, killing-based methods considered each mutated model equally important, while feature-based methods learned different importance for each mutated model through ranking models. Finally, GraphPrior ranked all the test inputs based on their scores. We conducted an extensive study to evaluate the effectiveness of GraphPrior approaches on 604 subjects, comparing them with existing approaches that could detect possibly misclassified test inputs. The experimental results demonstrate the effectiveness of GraphPrior. In terms of APFD, the killing-based GraphPrior approach, KMGP, exceeds the compared approaches (i.e., DeepGini, margin, Vanilla Softmax, PCS, Entropy, least confidence, and random selection) by 0.034~0.248, on average. Furthermore, RFGP (i.e., the feature-based GraphPrior approach) exhibited better performance compared to other GraphPrior approaches. Specifically, RFGP outperforms the uncertainty-based test prioritization approaches against different adversarial attacks, with the average improvement of 2.95%~46.69%.