Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article
Open access

GraphPrior: Mutation-based Test Input Prioritization for Graph Neural Networks

Published: 24 November 2023 Publication History

Abstract

Graph Neural Networks (GNNs) have achieved promising performance in a variety of practical applications. Similar to traditional DNNs, GNNs could exhibit incorrect behavior that may lead to severe consequences, and thus testing is necessary and crucial. However, labeling all the test inputs for GNNs can be costly and time-consuming, especially when dealing with large and complex graphs, which seriously affects the efficiency of GNN testing. Existing studies have focused on test prioritization for DNNs, which aims to identify and prioritize fault-revealing tests (i.e., test inputs that are more likely to be misclassified) to detect system bugs earlier in a limited time. Although some DNN prioritization approaches have been demonstrated effective, there is a significant problem when applying them to GNNs: They do not take into account the connections (edges) between GNN test inputs (nodes), which play a significant role in GNN inference. In general, DNN test inputs are independent of each other, while GNN test inputs are usually represented as a graph with complex relationships between each test. In this article, we propose GraphPrior (GNN-oriented Test Prioritization), a set of approaches to prioritize test inputs specifically for GNNs via mutation analysis. Inspired by mutation testing in traditional software engineering, in which test suites are evaluated based on the mutants they kill, GraphPrior generates mutated models for GNNs and regards test inputs that kill many mutated models as more likely to be misclassified. Then, GraphPrior leverages the mutation results in two ways, killing-based and feature-based methods. When scoring a test input, the killing-based method considers each mutated model equally important, while feature-based methods learn different importance for each mutated model through ranking models. Finally, GraphPrior ranks all the test inputs based on their scores. We conducted an extensive study based on 604 subjects to evaluate GraphPrior on both natural and adversarial test inputs. The results demonstrate that KMGP, the killing-based GraphPrior approach, outperforms the compared approaches in a majority of cases, with an average improvement of 4.76% ~49.60% in terms of APFD. Furthermore, the feature-based GraphPrior approach, RFGP, performs the best among all the GraphPrior approaches. On adversarial test inputs, RFGP outperforms the compared approaches across different adversarial attacks, with the average improvement of 2.95% ~46.69%.

1 Introduction

In recent years, graph machine learning [27, 38] has been widely adopted for modeling graph-structured data. In this realm, the emergence of graph neural networks (GNNs) [71] has offered promising results in diverse domains, such as recommendation systems [25, 85, 91], social network analysis [47, 84, 93], and drug discovery [4, 73]. GNNs, like typical neural networks [45, 75], are abstractions of the underlying data. Thus, their inference can suffer from faults [28, 53, 58], which can lead to severe prediction failures, especially in security-critical use cases. Testing is considered to be a fundamental practice that is widely adopted to ensure the performance of neural networks, including GNNs. However, like traditional deep neural networks (DNNs), GNN testing also suffers from the lack of automated testing oracles, which necessitates the manual labeling of test inputs. However, this labeling process can require significant human effort, especially for large and complex graphs. Moreover, in certain specialized domains, such as the protein interface prediction [62] of drug discovery, labeling intensively relies on domain-specific knowledge, further increasing its costs.
Prior works [6, 26, 44, 81] have focused on test prioritization to relieve the labeling-cost problem for DNNs. Test prioritization approaches aim to prioritize test inputs that are more likely to be misclassified (i.e., fault-revealing test inputs) so such inputs can be identified earlier to reveal system bugs. Existing approaches are mainly divided into two categories: coverage-based and confidence-based test prioritization approaches. Coverage-based approaches prioritize test inputs based on neuron coverage through adapting coverage-based prioritization methods from traditional software testing [51, 92]. Confidence-based approaches assume that test inputs for which the model is less confident are more likely to be misclassified and thus should be prioritized higher. Feng et al. [26] proposed the state-of-the-art confidence-based approach DeepGini, which considers that a test input is more likely to be misclassified by a DNN model if the model outputs similar prediction probabilities for each class. More recently, Wang et al. [81] proposed PRIMA, which leveraged mutation analysis and learning-to-rank methods to prioritize test inputs for DNNs. However, despite its effectiveness in DNN test prioritization, PRIMA cannot be directly applied to GNNs, since their mutation operators are not adapted to graph-structured data and GNN models.
Furthermore, existing studies [36] have focused on metrics for data selection (e.g., margin and least confidence), which can also be used to detect possibly misclassified test data. Although the aforementioned approaches have been demonstrated to be effective for DNN models in some cases, they have the following limitations when applied to GNN models:
First, to the best of our knowledge, current coverage-based approaches do not provide interfaces for GNN models and thus cannot be directly applied. Moreover, existing research [26] has demonstrated that coverage-based approaches are not effective compared to confidence-based approaches.
Second, despite the effectiveness of confidence-based approaches on traditional DNNs, they do not take into account the interdependencies between test inputs of GNNs, which are particularly crucial for GNN inference. In other words, GNN test inputs are typically represented as graph-structured data consisting of nodes and edges, while confidence-based prioritization approaches usually deal with test sets in which each test is independent and has no connections with others.
Third, the effectiveness of uncertainty-based metrics can be limited when facing some specific adversarial attacks. If the aim of an attack is to generate test inputs that maximize the probability of incorrect classification, then the utility of uncertainty metrics can be limited. This is because the underlying assumption of uncertainty-based metrics is that: If a model is more uncertain about classifying a test, then this test is more likely to be misclassified. However, in such scenarios, even if a model is confident on a test, this test can still have a high probability of being misclassified.
To overcome the aforementioned problems, in this article, we propose GraphPrior (GNN-oriented Test Prioritization), a set of test prioritization approaches specifically for GNNs. GraphPrior identifies and prioritizes possibly misclassified test inputs via mutation analysis. Given a test set for a GNN model, GraphPrior regards a test input that kills more mutated models (i.e., variants of the original GNN model that is slightly changed) of the original GNN model as more likely to be misclassified. Here, killing means the prediction result to the test input via the GNN model and the mutated model is different. To this end, we design a set of mutation rules to generate mutated models specifically for GNNs by slightly changing the training parameters of the original model. After obtaining the mutation results of each test input, GraphPrior introduces several ranking models (ML/DL models) [5, 42, 83] to rank the test set. The working principle of GraphPrior is inspired by mutation testing research, as this has been realized for both model-based [1, 18, 63] and code-based [2, 17, 64] testing. The key underlying principle in all cases is that test cases that distinguish the behavior of mutants from that of the original artifact are useful and more likely to detect other underlying faults [1, 9, 63].
While both the GraphPrior and PRIMA (i.e., the state-of-the-art DNN test prioritization approach) use mutation analysis, GraphPrior differs from PRIMA in terms of its mutation rules, feature generation, and ranking models: (1) GraphPrior’s mutation rules can directly or indirectly affect the message passing between nodes in graph data. In contrast, the mutation rules of PRIMA are designed for traditional DNNs, where the test inputs are independent, and therefore, the mutation rules do not affect the relationships between tests; (2) GraphPrior generates a mutation feature vector for each test input based on its mutation results, where the \(i\)th element in the vector denotes whether the \(i\)th mutated model is killed by this input. This feature generation strategy is intuitive and reproducible. In addition to this, the generation method exhibits several other advantages. First, by using binary indicators (1 or 0) as elements of the mutation feature vector, the information is transformed into a concise vector representation. Second, the fine-grained nature of the mutation feature vector allows for a detailed analysis of the effects of individual mutations. In particular, further analysis can be conducted to assess the contributions of each mutated model to GraphPrior. By tracing back to the corresponding mutation rules for the top critical mutated models, we can gain insights into which mutation rules made higher contributions to GraphPrior. The experimental results demonstrate its effectiveness; (3) GraphPrior employs five ranking models and compares their effectiveness in utilizing mutation features for test prioritization, while PRIMA only uses a single ranking model. By comparing multiple ranking models, GraphPrior can identify the optimal ranking model for learning mutation features in test prioritization.
GraphPrior has broad applicability across a wide range of contexts, including software development, scientific research, and financial systems. For instance, GraphPrior can be employed to gain insights into the vulnerabilities of GNN models used in financial transaction fraud detection. In this specific context, where nodes represent accounts and edges represent transaction transfers, the first step is to utilize the GNN model under test to identify a group of potentially fraudulent accounts. Subsequently, these identified accounts serve as test inputs for GraphPrior. By prioritizing accounts that are more likely to be misclassified by the model (i.e., accounts falsely classified as fraudulent), GraphPrior places them at the top of the recommendation list. Consequently, by labelling and analyzing these bug-revealing tests earlier, the fraud analysis team can unveil the bugs and vulnerabilities of the GNN model more efficiently.
It is important to note that GraphPrior is specifically designed for GNNs, and its impact on DNNs has not been evaluated. This is because, in graph datasets, nodes are interconnected, and the mutation rules of GraphPrior can directly or indirectly affect the message passing between nodes in the prediction process. In contrast, in traditional DNNs, each sample in a dataset is typically independent, and as a result, such mutation rules are unlikely to affect the transmission of information between tests. Therefore, the effectiveness of GraphPrior’s mutation rules for DNNs remains uncertain, as no related experiments have been conducted to evaluate it.
We conducted an extensive study to evaluate the performance of GraphPrior based on 604 subjects. Here, a subject refers to a pair of graph dataset and GNN model. We compare GraphPrior with six uncertainty-based metrics [26, 80, 82] that can be used to prioritize possibly misclassified test inputs and adopt random selection as the baseline method. Our experimental results demonstrate that GraphPrior performs well across all subjects and outperforms the compared approaches, on average.
As mentioned before, one essential problem of confidence-based approaches is that adversarial attacks may lead to a model being more confident in the incorrect prediction, resulting in the failure of the approach. Therefore, we also evaluate GraphPrior on test inputs generated from graph adversarial attacks of existing studies [3, 48, 86, 100]. Furthermore, since the effectiveness of test prioritization methods may vary depending on the degree of the adversarial attack, we set different attack levels to generate adversarial data and compared GraphPrior with the compared approaches. In addition to the evaluation of GraphPrior, we compare the effectiveness of different mutation rules in generating top contributing mutated models, aiming to identify which mutated rules contribute more to each GNN model. In the last step, we investigate whether GraphPrior and the uncertainty-based metrics can select informative retraining tests to improve a GNN model. Our experimental results demonstrate that GraphPrior achieved better effectiveness compared with the uncertainty-based test prioritization methods. We publish our dataset, results, and tools to the community on Github.1
Our work has the following major contributions:
Approach. We propose GraphPrior, a set of mutation-based test prioritization approaches for GNNs. To this end, we design a set of mutation rules that mutate GNN models by slightly changing their training parameters. We carefully select ranking models to analyze the mutation results for effective test prioritization.
Study. We conduct an extensive study based on 604 GNN subjects involving natural and adversarial test sets. We compare GraphPrior with existing DNN approaches that could detect possibly misclassified test inputs. Our experimental results demonstrate the effectiveness of GraphPrior.
Mutation rule analysis. We compare the effectiveness of the GNN mutation rules in generating top contributing mutated models, observing that the mutation rule HC (i.e., mutating Hidden Channels) makes top contributions to most GNN models in test input prioritization.

2 Background

In this section, we introduce the key domain concepts for our work, including Graph Neural Networks and Test Input Prioritization for DNNs.

2.1 Graph Neural Networks

GNNs have achieved great success in handling machine learning problems on graph-structured data [25, 76, 98]. Unlike traditional neural networks running on fixed-sized vectors, GNNs deal with graphs of varying sizes and structures. Therefore, GNNs can capture complex relationships between data points and make more accurate predictions. GNNs have been used in a wide range of tasks, including recommendation system [25, 85, 90], protein-protein interaction (PPI) prediction [40, 62, 97] and traffic forecasting [10, 41, 95].
Graphs. A graph is a data structure consisting of two components: nodes (vertices) and edges. A graph \(H\) can be defined as \(H = (V, E)\), where \(V\) is the set of nodes, and \(E\) are the edges between them. In a graph, nodes can represent entities (e.g., persons, places, or things), while the edges define the relationships between nodes. The edges can be either directed or undirected based on the directional dependencies that exist between nodes. Graphs can be utilized to model complex systems such as social media networks, molecular structures, and citation networks. For example, in the context of citation networks, publications can be represented as nodes, and the citations between them can be represented as edges. Graph datasets are collections of graph data that can be used to train and evaluate GNNs. Some benchmark graph datasets [79] include Cora, CiteSeer, and PubMed. In this article, we evaluated GraphPrior and the compared approaches on several graph datasets obtained from existing studies [70, 88].
Graph Embeddings. Graph embedding [7] is an approach used to transform nodes, edges, and their associated features into lower dimensional representation while maximally preserving the graph structural information and graph properties. Graph analytics methods usually suffer from high computational and storage costs, limiting their applicability in real-world scenarios. The use of graph embedding has shown promising results as an efficient and effective way to address the graph analytics problem.
Message Passing Scheme. In GNNs, the message-passing scheme is commonly employed [29], whereby nodes aggregate and transform the information from their neighbors in each layer. Through stacking multiple GNN layers, this mechanism facilitates the propagation of information across the entire graph structure, allowing for the effective embedding of nodes into low-dimensional representations. These node representations may subsequently be leveraged by a differentiable prediction layer, thereby enabling end-to-end training of the complete model.
GNN models. A GNN model is a type of neural network designed to operate on graph data structures. Typically, a GNN model contains two crucial parts: a graph convolution layer [45] to capture the relationship between nodes in the graph and a classifier [87] to make predictions based on the captured relationship. In general, a GNN model takes graph-structured data as inputs and produces outputs based on its corresponding task. For example, the output for a GNN model that deals with node-level tasks (i.e., GNN tasks that are concerned with predicting the identity or role of each node within a graph) is typically a prediction for nodes in the input graph. In this article, we evaluated our proposed test prioritization approach, GraphPrior, and the compared approaches on various GNN models [21, 30, 45, 79] that deal with node classification tasks.
Graph Adversarial Attacks. Graph adversarial attacks [3, 16, 77, 99] involve the manipulation of graph structure or node features to generate graph adversarial perturbations that can fool GNN models. This vulnerability of GNNs has raised serious concerns regarding their reliability and safety, particularly in safety-critical applications such as financial systems and risk management. For instance, in a credit scoring system, attackers can exploit the vulnerability of GNNs to create fake connections with high-credit customers to evade fraud detection models. In this article, we applied eight graph adversarial attacks from existing studies [3, 48, 86, 100] to generate adversarial inputs for the evaluation of GraphPrior.

2.2 Test Input Prioritization for DNNs

In Deep Neural Networks (DNNs) testing, test input prioritization aims to prioritize tests that are more likely to be misclassified (i.e., bug-revealing test inputs) by the DNN model. In this way, more important test inputs can be labeled earlier in a limited time, which can improve the efficiency of DNN testing. In the literature, several prioritization approaches have been proposed to deal with the labeling-cost issues [6, 26, 81, 94].
The majority of approaches for prioritizing tests in DNNs can be classified into two categories, coverage-based and confidence-based [81]. Confidence-based approaches, such as DeepGini [26], prioritize test inputs based on the model’s confidence. Specifically, these methods identify inputs that are likely to be incorrectly predicted by the DNN model, given that the model outputs similar probabilities for each class. In contrast, coverage-based approaches, such as CTM [92], simply extend traditional software system testing methods to DNN testing and have been shown to underperform compared to confidence-based approaches [26]. Weiss et al. [82] conducted a comprehensive investigation of the capabilities of various DNN test input prioritization techniques, including some notable uncertainty-based metrics such as Vanilla Softmax, Prediction-Confidence Score (PCS), and Entropy. The Vanilla Softmax metric is calculated as the highest activation in the output softmax layer for a classification problem, subtracted from 1. PCS, however, is defined as the difference in softmax likelihood between the predicted class and the second runner-up class. Additionally, Entropy is considered as an alternative metric in the softmax layer proposed by the authors of DeepGini. These metrics have been demonstrated to be effective in identifying possibly misclassified test inputs and can aid in guiding test prioritization efforts.
The aforementioned uncertainty-based test prioritization can be adapted for test input prioritization for GNNs. GraphPrior differs from these approaches in that GraphPrior leverages mutation analysis to perform test prioritization. The mutation analysis of GraphPrior exploits the specific properties of GNNs. Specifically, GraphPrior’s mutation rules can directly or indirectly affect the message passing between nodes in a graph. In contrast, uncertainty-based approaches rely on the prediction uncertainty of the DNN model to prioritize test inputs without accounting for the interdependence between nodes.
Currently, the state-of-the-art technique for DNN test prioritization is PRIMA, which prioritizes fault-revealing test inputs based on mutation analysis. However, PRIMA is not suitable for GNN test prioritization because: (1) its input mutation rules are specifically designed for DNN testing datasets where each sample is independent of each other. In contrast, graph datasets have complex interdependence between nodes, making PRIMA unsuitable for test prioritization in this context; (2) GNNs employ graph operations and message passing mechanisms to aggregate and update information from neighboring nodes, thereby facilitating improved representation and learning within graph structures. The model mutation rules employed in PRIMA are not suitable for accommodating the graph operation mechanisms intrinsic to GNNs.
In addition to the aforementioned test prioritization techniques, several active learning [80] methods can also be adapted to prioritize DNN tests, such as Least Confidence and Margin. Active learning aims to select the most informative samples to be labeled by a human expert. When applied to test prioritization, active learning can be used to identify the most critical and informative test cases that can reveal bugs in the system.

3 Approach

3.1 Overview

In this article, we propose GraphPrior, a set of test prioritization approaches for GNNs to prioritize test inputs. GraphPrior consists of six mutation-based test prioritization approaches: KMGP, LRGP, RFGP, LGGP, DNGP, and XGGP. These approaches are discussed later in this section. We present the overview of GraphPrior in Figure 1, in which the input of GraphPrior is a GNN test set, and the output is the test set that has been prioritized. Given a test set \(T\) for a GNN model \(G\), the implementation process of GraphPrior is presented as follows:
Fig. 1.
Fig. 1. Overview of GraphPrior.
Generating mutants for the GNN model \(G\). First, GraphPrior generates mutated models (i.e., mutants) for the GNN model \(G\) based on carefully designed mutation rules (cf. Section 3.2).
Obtaining mutation results through killing mutants. For each test input, GraphPrior identifies which mutated models it kills. Here, a mutated model is killed by a test input if the prediction results of this input via the mutated model and the original model \(G\) are different. In this way, GraphPrior obtains the mutation result of each test input.
Generating feature vectors from the mutation results. For each test input, GraphPrior generates a mutation feature vector for it based on its mutation results. The \(i\)th element of this feature vector denotes whether this input kills the \(i\)th mutated model. More specifically, given a test input \(t \in T\), if \(t\) kills a mutated model \(M_i\), then the \(i\)th element of \(t\)’s mutation feature vector is set to 1. Otherwise, the \(i\)th element is set to 0.
Ranking test input based on mutation feature vectors via ranking models. GraphPrior utilizes ranking models [5, 42, 83] to calculate a misclassification score for each test input based on its feature vector. This score can indicate how likely a test input will be misclassified by the GNN model. Finally, GraphPrior ranks them based on their misclassification scores in descending order and outputs the prioritized test set \(T^{\prime }\).

3.2 Mutation Rules

In GraphPrior, mutation rules are employed to generate mutated models of a GNN model by making slight changes to its training parameters. We select the following parameters, because they can impact the message passing in the GNN prediction process. More specifically, in the mutated GNN model, the manner in which nodes acquire information from their neighboring nodes is slightly different from that of the original GNN model. Although variations of GNNs can be obtained even without changing training parameters, the resulting model mutants cannot produce meaningful differences in the GNN model’s behavior. By changing the selected training parameters to generate mutants, we can intentionally introduce meaningful modifications to the model’s behavior in terms of the interdependencies between nodes during the prediction process. We present all the mutation rules of GraphPrior as follows:
Self Loops (SL) [45, 79]. SL is a Boolean parameter, which controls whether to add self-loops to the input graph. When the SL parameter is set to True, self-loops are introduced to each node in the graph. By incorporating self-loops, the inherent information of nodes can be effectively aggregated into their representation vectors, leading to a change in the weighting of their neighboring nodes, and thus affecting the interdependence of nodes in the prediction process.
Bias (BIA) [30, 45, 79]. BIA is a Boolean parameter, which determines whether to introduce a predetermined offset to the representation vectors of nodes. When the BIA parameter is enabled (set to True), each node will be assigned a corresponding bias parameter to its representation vector, allowing the GNN model to better capture the inherent properties of the graph and improve the interdependence between nodes in the prediction process.
Cached (CA) [45]. CA is a Boolean parameter that controls whether to cache the computation of node embeddings during the forward pass. When the CA parameter is set to True, the node embeddings are cached and reused during the backward pass to save computation time. Caching the computation of node embeddings can affect the interdependence between nodes by altering the order and efficiency of message passing.
Improved (IMP) [45]. IMP is a Boolean parameter that controls whether to use the improved message passing strategy, thus affecting the interdependence between nodes in the prediction process.
Normalize (NOR) [21, 30]. NOR is a Boolean parameter, which determines whether to normalize the messages passed between nodes in the prediction process. When this parameter is set to “True,” the messages are normalized by the number of neighbors that a node has before being passed to the next layer. This normalization can impact the contribution of each neighbor to the node’s final representation, thus affecting the message passing between nodes in the prediction process.
Concat (CON) [79]. CON is a Boolean parameter, which controls how the representations of neighboring nodes are combined during message passing. When it is set to True, the representations of neighboring nodes are concatenated before being passed, resulting in a more expressive representation of the nodes, enabling the GNN to capture more nuanced interdependencies between them.
Heads (HDS) [79]. HDS is an integer parameter that determines the number of attention heads used in multi-head attention. Increasing the number of heads allows the model to capture more complex interdependence among nodes in the graph. Each attention head can focus on a different aspect of the node neighborhood, enabling the model to learn different representations of the graph.
Epoch (EP) [21, 30, 79]. EP is an integer parameter that controls the number of times a GNN model iterates over the training dataset. By increasing the number of epochs, a GNN model can better capture the interdependence between nodes for model inference.
Hidden Channel (HC) [21, 30, 45, 79]. HC is an integer parameter, which controls the dimensionality of the hidden representation in each layer of the GNN. Therefore, changing this parameter can impact the interdependence between nodes in a graph by enabling the GNN to learn more expressive node embeddings.
Negative Slope (NS) [79]. NP is a float parameter, which controls the slope of the negative part of the activation function used in the Gated Linear Unit (GLU) operation. GLU is a common non-linear function used in GNNs for message passing. Specifically, the GLU operation is used to combine the node features with the weighted sum of their neighboring nodes’ features, which is the message passed between nodes in the graph. The negative slope parameter determines the slope of the activation function for negative input values in the GLU operation, thus impacting the message passing between nodes.
Based on the above mutation rules, for a given test set and a GNN model, GraphPrior generates \(N\) mutated models of the original model. We consider that a test input kills a mutated model if the predictions for this input via the mutated models and the original GNN model are different. Based on it, GraphPrior obtains the mutation results of all the test inputs.
Considering that the primary objective of generating mutated models is to obtain informative features for test prioritization, a statistical analysis is employed to validate their effectiveness. To achieve this, a series of repeated experiments are conducted, as outlined in Section 5. The results of these experiments demonstrate that GraphPrior’s effectiveness is statistically significant, thereby confirming the statistical validity of the generated mutated models for the purpose of test prioritization.

3.3 Killing-based GraphPrior

This section presents the workflow of KMGP, the Killing Mutants-based GNN Test Prioritization approach. Notably, KMGP operates on a “killing-based” principle, where test inputs that can kill more mutated models are considered as more likely to be misclassified and will be prioritized higher. It is worth noting that KMGP assigns equal importance to each mutated model in the process of test prioritization, a distinct feature that distinguishes it from feature-based approaches, which will be elaborated upon in subsequent sections. Given a GNN model \(G\) and a test input set \(T=\left\lbrace t_1, t_2, \ldots , t_n\right\rbrace\), the detailed execution of KMGP can be divided into three key stages: mutation generation, killing-based mutation analysis, and test prioritization.
Mutation generation. In the mutation generation stage, a group of mutated models \(\lbrace G^{\prime }_1, G^{\prime }_2, \ldots , G^{\prime }_N\rbrace\) is generated for the original GNN model \(G\).
Killing-based mutation analysis. This stage involves obtaining the mutation results of each test input \(t\in T\) using the process outlined in Section 3.2. Subsequently, KMGP counts the number of mutants killed by each test input based on their mutation results.
Test prioritization. In the third stage, KMGP prioritizes all the test inputs in \(T\) based on the number of mutated models they killed, with those that kill more mutants being prioritized higher in the test sequence.

3.4 Feature-based GraphPrior

In comparison to the killing-based GraphPrior approach, the feature-based approaches are characterized by automatic mutation feature analysis. This process involves the generation of mutated feature vectors based on the execution of mutated models, followed by the use of ranking models (ML/DL models), which assign different importance to each mutated model for test prioritization.
Overall, the feature-based approaches’ workflow entails three key stages: mutated model generation, mutation feature generation, and learning-to-rank.
Mutated model generation. Given a GNN model \(G\) and a test set \(T\), during the first stage, the feature-based approaches generate a group of mutated models (denoted as \(\lbrace G^{\prime }_1, G^{\prime }_2, \ldots , G^{\prime }_N\rbrace\)) of the GNN model \(G\) based on the mutation rules specified in Section 3.2.
Mutation feature generation. Subsequently, the feature-based approaches associate a feature vector \(V_t\) of size \(N\) with each test input \(t\), where \(N\) represents the number of mutated models, and \(v_k (=V_t[k])\) maps to the execution output for the mutated model \(G^{\prime }_k\). If \(t\) kills the mutated model \(G^{\prime }_k\) (i.e., the prediction results for \(t\) via the mutated models \(G^{\prime }_k\) and the original model \(G\) are different), then \(v_{k}\) is set to 1. Otherwise, it is set to 0.
Learning-to-rank. In the final stage, the feature-based approaches input the mutation features of each test input to the ranking model (ML/DL models) [5, 15, 42, 78, 83]. The ranking models can automatically learn different importance for each mutation feature to output misclassification scores. Here, each mutation feature corresponds to the execution result of a mutated model so we can consider that the ranking models learn the importance of each mutated model for test prioritization. Finally, the feature-based approaches rank all the test inputs based on their misclassification scores in descending order.
In our study, we propose five feature-based GraphPrior approaches, which follow the similar workflow described above, but leverage different ranking models. These five approaches are XGGP (XGBoost-based GNN Test Prioritization), LRGP (Logistic Regression-based GNN Test Prioritization), LGGP (LightGBM-based GNN Test Prioritization), RFGP (Random Forest-based GNN Test Prioritization), and DNGP (DNN-based GNN Test Prioritization). We briefly introduce the basic principle of the ranking models of these approaches as follows:
(1)
XGGP leverages the XGBoost algorithm [15] as the ranking model. XGBoost is a highly effective gradient boosting algorithm that combines decision trees to enhance the accuracy of predictions. XGGP utilizes the XGBoost algorithm to predict the misclassification score for a given test input based on its mutation features. This score reflects the likelihood that the input will be misclassified by a GNN model.
(2)
LRGP leverages the Logistic Regression algorithm [83] as the ranking model. Logistic regression leverages a logistic function to model the association between a categorical dependent variable and one or more independent variables.
(3)
LGGP leverages the LightGBM algorithm [42] as the ranking model. LightGBM is a gradient boosting framework that employs tree-based learning algorithms. The fundamental principle of LightGBM is similar to XGBoost, which employs decision trees based on learning algorithms. However, LightGBM introduces a novel optimization in the framework, with a primary focus on enhancing the speed of model training.
(4)
RFGP leverages the random forest algorithm [5] as the ranking model. Random Forest is an ensemble learning algorithm that constructs multiple decision trees using random subsets of the training data and input features. The predictions from individual trees are combined to produce the final prediction using averaging or voting.
(5)
DNGP leverages a DNN model [78] as the ranking model. The DNN model can learn to rank test inputs based on their mutation features. After training, the DNN model can generate a score that reflects their misclassification probability. This score can then be used to rank test inputs in a test set.
Compared to the mutation features of PRIMA, the distinctive aspect of GraphPrior’s mutation features lies in their utilized mutation rules, which are specifically designed for GNNs. These mutation rules have the potential to directly or indirectly impact the message passing mechanism between nodes in graph data. Our experiment results in Section 5 demonstrate the effectiveness of the feature-based GraphPrior approaches. The observed effectiveness can be attributed, in part, to the selection of mutation rules and ranking models. Specifically, our mutation rules have been designed to generate informative mutation features by changing the message passing between nodes in the GNN prediction process. Furthermore, our ranking models are able to utilize these mutation features for test prioritization effectively. After sufficient training, ranking models can output a misclassification score that indicates how likely a sample would be misclassified based on its mutation features. A score closer to 1 indicates a higher probability of misclassification. By sorting the misclassification scores of test inputs in descending order, the feature-based GraphPrior approaches can effectively prioritize tests that are more likely to be misclassified.

3.5 Usage of GraphPrior

By utilizing ranking models, GraphPrior predicts a misclassification score for each test input within a given test set. These predicted scores are then utilized for test prioritization, whereby test inputs with higher scores are prioritized higher. Particularly, the ranking models are pre-trained before the execution of GraphPrior. The training process is standardized across all the different ranking models and follows a consistent set of procedures, which are presented in detail below.
Splitting datasets. Given a GNN model \(G\) with dataset \(T\). First, we split the dataset \(T\) into two partitions: the training set \(R\) and the test set, in a 7:3 ratio [61]. The test set remains untouched for the purpose of evaluating GraphPrior.
Constructing the training set for ranking models. Based on the training set \(R\), we aim to build a training set \(R^{\prime }\) for training the ranking models. First, we generate a group of mutated models for each input \(r_i \in R\). Then, we obtain the mutation feature vector \(V_i\) of \(r_i\) (i.e., a one-dimensional vector in which the \(i\)th element denotes whether the \(i\)th mutated model is killed by this input). The mutation feature vector of \(r_i\) is used to build the training set \(R^{\prime }\) (i.e., the training set of the ranking models). Second, we let the original GNN model \(G\) classify each input \(r_i \in R\) and compare it with the ground truth of \(r_i\). In this way, we can identify whether \(r_i\) is misclassified by the GNN model \(G\). If \(r_i\) is misclassified by \(G\), then we label it as 1. Otherwise, we label it as 0. In this way, we have built the ranking model training set \(R^{\prime }\).
Training ranking models. Based on \(R^{\prime }\), we train the ranking models. Upon the completion of the training process, the ranking model is capable of receiving the mutation feature vector of a test input as an input and producing a misclassification score as an output. This score serves as an indicator of the probability of the test input being incorrectly classified by the GNN model.
It is worth noting that the original labels of the training set \(R^{\prime }\) are binary (i.e., 1 or 0), but the ranking models that are well trained can output values (i.e., the misclassification scores). To achieve this, we make some adaptations to implement the adopted ranking algorithms (e.g., random forest and XGBoost). First, although the ranking algorithms we adopted initially deal with classification tasks, an intermediate value is calculated for the classifications. For example, if the intermediate value exceeds 0.5 (default value, which can be adjusted), then input will be classified into the first category; otherwise, the other category. Here, after training, we let the ranking models directly output the intermediate value, as this value can indicate the likelihood of a test input being misclassified by the GNN model, where a higher value implies a greater likelihood of misclassification. We call this intermediate value “misclassification scores” and leverage the scores of test inputs to rank them.

4 Study Design

4.1 Research Questions

Our experimental evaluation answers the research questions below.
RQ1: How does the killing-based GraphPrior approach perform in prioritizing test inputs for GNNs?
In terms of test prioritization for GNNs, existing prioritization approaches usually do not take into account the interdependencies between nodes (tests) in a graph (test set). To fill the gap, we propose GraphPrior, which contains six GNN-oriented test prioritization approaches. Among them, KMGP is a killing-based approach, which regards a test input that kills more mutants as more likely to be misclassified. In this research question, we evaluate the effectiveness of the killing-based KMGP by comparing it with existing approaches that have been demonstrated as effective in detecting possibly misclassified test inputs.
RQ2: How do the feature-based GraphPrior approaches perform in GNN test prioritization?
In addition to the killing-based KMGP, GraphPrior involves five feature-based approaches. The core difference is that the killing-based approach regards the importance of each mutated model as equal, while the feature-based approaches learn different importance for each mutated model for test prioritization. More specifically, feature-based approaches extract features from mutation results and adopt ranking models [5, 42, 83] to utilize the mutation features for test prioritization. In this research question, we compare the effectiveness of killing-based and feature-based approaches to investigate the effect of ranking models in leveraging mutation results.
RQ3: How does GraphPrior perform on test inputs generated from graph adversarial attacks?
When faced with graph adversarial attacks, confidence-based test prioritization approaches may be fooled, thus becoming more confident in incorrect predictions. Therefore, we evaluate to what extent the effectiveness of GraphPrior is affected by graph adversarial attacks. We compare GraphPrior and confidence-based approaches [26, 36] on test inputs generated from graph adversarial attacks of existing studies [3, 48, 86, 100] to demonstrate its effectiveness.
RQ4: How does GraphPrior perform against different levels of graph adversarial attacks?
In this research question, we investigate the effectiveness of GraphPrior against different levels of graph adversarial attacks. To answer this research question, we set different levels of attacks to generate test inputs and compare GraphPrior with existing approaches to demonstrate its effectiveness.
RQ5: Which mutation rules generate more top contributing GNN mutants?
We investigate the contributions of each mutation rule in generating effective mutants of GNNs. For each GNN model, we select the top contributing mutation features to it through the XGBoost ranking algorithm [15], which is an optimized ML algorithm for ranking tasks based on the implementation of gradient boosting. We match each selected feature with the corresponding GNN mutant and identify the mutation rule that generates it. In this way, we obtain which mutation rules generate more top contributing mutants for test prioritization.
RQ6: Can GraphPrior and the uncertainty-based metrics be used in active learning scenarios to improve a GNN model by retraining?
In the face of a large number of unlabeled inputs and a limited time budget, it is not feasible to manually label all the inputs and use them to retrain a GNN. One established solution to reduce data labeling costs is active learning [67], which involves selecting informative subsets of training samples to improve the model performance. In this research question, we investigate the effectiveness of GraphPrior and the uncertainty-based metrics in selecting informative retraining inputs to improve the quality of a GNN model.

4.2 GNN Models and Datasets

In our study, we totally adopt 604 subjects to evaluate the effectiveness of GraphPrior and the compared approaches [26, 36]. Table 1 exhibits their basic information. Among the 604 subjects considered in this study, 16 subjects were utilized in the experiments of RQ1, 16 subjects in RQ2, 108 subjects in RQ3, 432 subjects in RQ4, 16 subjects in RQ5, and 16 subjects in RQ6. It is worth noting that, among these subjects, a total of 64 subjects (which were utilized in RQ1, RQ5, and RQ6) were associated with clean datasets, while the remaining 540 subjects (which were utilized in RQ3 and RQ4) were associated with adversarial datasets.
Table 1.
IDDataset#Nodes#EdgesModelType
1CiteSeer3,3274,732GCNOriginal, DICE, MMA, PGD, RAA, RAF, RAR
2CiteSeer3,3274,732GATOriginal, DICE, MMA, PGD, RAA, RAF, RAR
3CiteSeer3,3274,732TAGCNOriginal, DICE, MMA, PGD, RAA, RAF, RAR
4CiteSeer3,3274,732GraphSAGEOriginal, DICE, MMA, PGD, RAA, RAF, RAR
5Cora2,7085,429GCNOriginal, DICE, MMA, PGD, RAA, RAF, RAR, NEAR, NEAA
6Cora2,7085,429GATOriginal, DICE, MMA, PGD, RAA, RAF, RAR, NEAR, NEAA
7Cora2,7085,429TAGCNOriginal, DICE, MMA, PGD, RAA, RAF, RAR, NEAR, NEAA
8Cora2,7085,429GraphSAGEOriginal, DICE, MMA, PGD, RAA, RAF, RAR, NEAR, NEAA
9LastFM7,62427,806GCNOriginal, DICE, PGD, RAA, RAF, RAR, NEAR, NEAA
10LastFM7,62427,806GATOriginal, DICE, PGD, RAA, RAF, RAR, NEAR, NEAA
11LastFM7,62427,806TAGCNOriginal, DICE, PGD, RAA, RAF, RAR, NEAR, NEAA
12LastFM7,62427,806GraphSAGEOriginal, DICE, PGD, RAA, RAF, RAR, NEAR, NEAA
13PubMed19,71744,338GCNOriginal, DICE, RAA, RAF, RAR, NEAR, NEAA
14PubMed19,71744,338GATOriginal, DICE, RAA, RAF, RAR, NEAR, NEAA
15PubMed19,71744,338TAGCNOriginal, DICE, RAA, RAF, RAR, NEAR, NEAA
16PubMed19,71744,338GraphSAGEOriginal, DICE, RAA, RAF, RAR, NEAR, NEAA
Table 1. GNN Models and Datasets
Our study involves four GNN models: GCN (Graph Convolutional Networks) [45], GAT (Graph Attention Networks) [79], GraphSAGE (Graph SAmple and aggreGatE) [30], and TAGCN (Topology Adaptive Graph Convolutional Network) [21], tested by four datasets, namely, the Cora [88], CiteSeer [88], PubMed [88], and LastFM [70]. We present their descriptions as follows:

4.2.1 GNN Models.

GCN [45]. GCN is a class of convolutional neural networks that can work directly on the graph. It solves the problem of classifying nodes (such as documents) in graphs (such as citation networks), of which only a small number of nodes are labeled. The core idea of GCN is to use the edge information of a graph to aggregate node information to generate new node representations. GCN has been used in several existing studies [31, 35, 89].
GAT [79]. GAT introduces a self-attention mechanism in the propagation process. Compared to GCN, which regards all neighbors of a node equally, the attention mechanism assigns different attention scores to each neighbor, thereby identifying more important neighbors.
GraphSAGE [30]. GraphSAGE is a generalized inductive framework that generates node embeddings by sampling and aggregating features of neighbor nodes.
TAGCN [21]. TAGCN introduces a systematic approach to design a set of fixed-size learnable filters to perform convolutions on graphs. These filters are topology-fit to the topology of the graph as they scan the graph for convolution.

4.2.2 Datasets.

Cora [88]. The Cora dataset is a citation graph composed of 2,708 scientific publications (nodes) and 5,429 links (edges) between them. Nodes represent ML papers, and edges represent citations between pairs of papers. Each paper is classified into one of seven classes, such as reinforcement learning and neural networks.
CiteSeer [88]. The CiteSeer dataset consists of 3,327 scientific publications (nodes) and 4,732 links (edges). Each paper belongs to one of six categories such as AI and ML.
PubMed [88]. The PubMed dataset contains 19,717 diabetes-related scientific publications (nodes) and 44,338 links (edges). Publications are classified into three classes such as Cancer and AIDS (i.e., Acquired Immune Deficiency Syndrome).
LastFM Asia Social Network [70]. The dataset LastFM Asia Social Network was collected from the social network of users on the Last.fm music platform in Asia. Nodes are LastFM users, and edges are mutual follower relationships between them. LastFM contains 7,624 nodes and 27,806 edges. The classification task of the LastFM dataset is to predict the home country of a user (e.g., Philippines, Malaysia, Singapore).
Notably, we evaluate GraphPrior on different types of test inputs (i.e., both natural test inputs and adversarial test inputs). We adopted eight graph adversarial attacks, presented in Section 4.4.

4.3 Compared Approaches

In our study, we considered seven compared approaches in total, including one baseline (i.e., random selection), four DNN test prioritization approaches, and two active learning approaches. We select these approaches due to the following reasons: (1) These approaches can be adapted for GNN test prioritization; (2) The selected approaches have been demonstrated as effective for DNNs in existing studies [26, 36, 82]; (3) The implementations of these approaches have been released by the authors.
DeepGini. DeepGini [26] prioritizes test inputs based on model confidence. DeepGini leverages the Gini coefficient to measure the likelihood of a test input being misclassified. DeepGini leverages Equation (1) to calculate the ranking scores.
\begin{equation} \xi (x) = 1-\sum _{i=1}^N\left(p_i(x)\right)^2 , \end{equation}
(1)
where \(\xi (x)\) refers to the likelihood of the test input \(x\) being misclassified. \(p_i(x)\) refers to the probability that the test input \(x\) is predicted to be label \(i\). \(N\) refers to the number of labels.
Margin. Margin [80] regards a test input with less difference between the top two most confidence predictions as more likely to be misclassified. Margin score is calculated by Equation (2).
\begin{equation} M(x)=p_{k}(x)-p_{j}(x), \end{equation}
(2)
where \(M(x)\) refers to the margin score. \(p_{k}(x)\) refers to the most confident prediction probability. \(p_{j}(x)\) refers to the second most confident prediction probability.
Least Confidence. Least Confidence [80] regards test inputs for which the model has the least confidence as more likely to be misclassified. Least confidence is calculated by Equation (3).
\begin{equation} L(x) = \max _{i=1: n} p_{i}(x), \end{equation}
(3)
where \(L(x)\) refers to the confidence score. \(p_i(x)\) refers to the probability that the test input \(x\) is predicted to be label \(i\) via a model \(M\).
Vanilla Softmax. Vanilla Softmax [82] is computed by subtracting the highest activation probability in the output softmax layer from 1, resulting in a metric that is positively correlated with the misclassification probability. Equation (4) presents the calculation of the Vanilla Softmax metric.
\begin{equation} \text{V}(x)=1-\max _{c=1}^C l_c(x), \end{equation}
(4)
where \(l_c(x)\) belongs to a valid softmax array in which all values are between 0 and 1, and their sum is 1.
Prediction-Confidence Score (PCS). PCS [82] calculates the difference between the predicted class and the second most confident class in softmax likelihood.
Entropy. Entropy [82] calculates the entropy of the softmax likelihood.
Random selection. [22] In random selection, the execution order of the test inputs is determined randomly.

4.4 Graph Adversarial Attacks

In RQ3 and RQ4, we evaluate the effectiveness of GraphPrior on test inputs generated through diverse graph adversarial attacks, in which attackers aim to generate graph adversarial perturbations by manipulating the graph structure or node features to fool the GNN models. We introduce all the attacks we applied in our experiments as follows:
Disconnect Internally, Connect Externally (DICE) [100]. The DICE attack is a type of white-box attack whereby the adversary has access to all information about the targeted GNN model, including its parameters, training data, labels, and predictions. Specifically, the DICE attack randomly adds edges between nodes with different labels or removes edges between nodes sharing the same label. Through this, the attack can generate adversarial perturbations that can fool the targeted GNN model.
PGD attack [86]. The PGD attack leverages the Projected Gradient Descent (PGD) algorithm to search for optimal structural perturbations to attack GNNs.
Min-max attack (MMA) [86]. The min-max attack is a type of untargeted white-box GNN attack. The attack problem is formulated as a min-max problem, where the inner maximization is designed to update the model’s parameters (\(\theta\)) by maximizing the attack loss, and it can be solved using gradient ascent. However, the outer minimization can be achieved by using PGD [59].
Node Embedding Attack-Add (NEAA) [3]. In node embedding attack-add, the attackers are capable of modifying the original graph structure by adding new edges while adhering to a predefined budget constraint.
Node Embedding Attack-Remove (NEAR) [3]. In node embedding attack-remove, the attackers modify the original graph structure by removing edges.
Random Attack-Add (RAA) [48]. The Random Attack-Add approach randomly adds edges to the input graph to fool the targeted GNN model.
Random Attack-Flip (RAF) [48]. The Random Attack-Flip approach randomly flips edges to the input graph to fool the targeted GNN model.
Random Attack-Remove (RAR) [48]. The Random Attack-Add approach randomly removes edges to the input graph to fool the targeted GNN model.

4.5 Evaluation of Mutation Rules (RQ5)

In RQ5, we investigated the contribution of different mutation rules in generating top contributing mutated models. First, for each GNN model, we utilize the cover metric in XGBoost [15] to evaluate the importance of its mutation features and rank them according to the descending order of the importance scores. The cover metric can evaluate the importance of mutation features by quantifying the average coverage of each instance by the leaf nodes in a decision tree. Specifically, it calculates the number of times a particular feature is used to split the data across all trees in the ensemble and then sums up the coverage values for each feature over all trees. This coverage value is then normalized by the total number of instances to obtain the average coverage of each instance by the leaf nodes. The importance of a feature is then calculated based on its coverage value, and features with higher coverage values are considered more important.
Upon obtaining the importance of each mutation feature, which corresponds to a specific mutated model, we proceed to match and determine the importance of the respective mutated models. Subsequently, we select the top N critical mutated models and identify the specific mutated rules employed in their generation. This enables a comparative analysis of the contributions of various mutation rules.

4.6 Implementation and Configuration

We implemented GraphPrior in Python based on the PyTorch 1.11.0 framework [65]. We also integrate the available implementations of the compared approaches [26, 57, 80, 82] into our experimental pipeline to adapt to the GNN prioritization problem. Regarding our mutation rules, we set the number of mutated models as 80~240 across different subjects. Balancing the tradeoff between execution time and the effectiveness of GraphPrior is a critical consideration in determining the number of mutants. Building on relevant literature [81], we identified a suitable range of mutants. Our preliminary investigations on multiple subjects demonstrate that these settings effectively maintain the effectiveness of GraphPrior while controlling the runtime within a reasonable range. In the case of subjects associated with longer mutant generation times, we choose to generate a comparatively smaller number of mutants compared to other subjects. Additionally, the range was achieved through the full execution of all pre-defined mutation rules. It is worth noting that the total number of mutation rules was predetermined and fixed. Thus, even with the addition of new mutants, the impact on the performance of GraphPrior is minor, as the new mutants are created based on the existing mutation rules.
With regard to the specific mutation rules that change the integer/float training parameters, we define a parameter range close to the original parameter values to achieve slight mutations. We conducted a preliminary study using multiple subjects, demonstrating the effectiveness of such settings. Moreover, to obtain parameter values from the specified range, we adopt uniform sampling [56] as the sampling methodology. This technique ensures an equitable probability of selecting each value within the parameter range and has been widely adopted across the ML testing field [56, 60, 96].
More specifically, we set the hidden channel parameter in the range of [15–20), epochs parameter as <= 50, heads parameter as <= 5, and negative slope parameter as <= 0.2. For the mutation rules that change the Boolean type parameters, if the parameter value of the original model is true, then we set it to false. If the original value is false, then we set it to true. The parameter ranges for our mutation rules are carefully selected to ensure the change to the original GNN model is slight.
With respect to the configuration of the ranking models utilized in GraphPrior, we made several parameter selections: For the random forest, XGBoost, and LightGBM ranking algorithms, we set the n_estimators parameter to 100. For the DNN ranking model, we set the learning_rate parameter to 0.01. Finally, for the logistic regression ranking algorithm, we set the max_iter parameter to 100.
We conducted the following experiments on a high-performance computer cluster, and each cluster node runs a 2.6 GHz Intel Xeon Gold 6132 CPU with an NVIDIA Tesla V100 16 G SXM2 GPU. For the data process, we conducted corresponding experiments on a MacBook Pro laptop with Mac OS Big Sur 11.6, Intel Core i9 CPU, and 64 GB RAM.

4.7 Measurements

Following the existing study [26], we leverage Average Percentage of Fault-Detection (APFD) [92] to evaluate the prioritization effectiveness of GraphPrior and the compared approaches. APFD is a standard metric for prioritization evaluation. Typically, higher APFD values indicate faster misclassification detection rates. We calculate the APFD values by Equation (5)
\begin{equation} A P F D =1-\frac{\sum _{i=1}^k o_i}{k n}+\frac{1}{2 n} , \end{equation}
(5)
where n is the number of test inputs in the test set \(T\). k is the number of test inputs in \(T\) that will be misclassified by the GNN model \(G\). \(o_i\) is the index of the \(i\)th misclassified tests in the prioritized test set. More specifically, \(o_i\) is an integer that represents the position of the \(i\)th misclassified tests in the test set that has been prioritized. When \(\sum _{i=1}^k o_i\) is small (i.e., the total index sum of the misclassified tests within the prioritized list is small), indicating that that the misclassified tests are prioritized higher, the APFD will be large according to Equation (5). Therefore, large APFD indicates better prioritization effectiveness. Following the existing study [26], we normalize the APFD values to [0,1]. We consider a prioritization approach better when the APFD value is closer to 1. We present the comparison results in tables.
For more detailed analysis, we utilize PFD (Percentage of Fault Detected) [26] to evaluate the fault detection rate of each approach on different ratios of prioritized test inputs. High PFD values refer to high effectiveness in detecting misclassified test inputs.
\begin{equation} PFD = \frac{F_c}{F_t} , \end{equation}
(6)
where \(F_c\) is the number of faults (i.e., misclassified test inputs) correctly detected. \(F_t\) is the total number of faults. More specifically, we evaluate the fault detection rate of GraphPrior against different ratios of prioritized tests. We use PFD-n to represent the first n% prioritized test inputs.

5 Results and Analysis

5.1 RQ1: Effectiveness of the Killing-based GraphPrior Approach (KMGP)

Objectives: We investigate the effectiveness of the killing-based GraphPrior approach, KMGP (cf. Section 3.3), comparing it with existing approaches that can be used to identify possibly misclassified test inputs.
Experimental design: We used 16 pairs of datasets and GNN models as subjects to evaluate the effectiveness of GraphPrior. Table 1 exhibits their basic information. We carefully selected seven compared approaches (i.e., DeepGini, least confidence, margin, Vanilla SM, PCS, entropy, and random selection), which can be adapted for GNN test prioritization. Random selection is considered the baseline. We adopt two metrics to measure the effectiveness of GraphPrior and the compared approaches: Average Percentage of Fault-Detection (APFD) and PFD, which are explained in Section 4.7.
Due to the randomness of the training process of a GNN model, we conduct a statistical analysis by repeating all the experiments 10 times. More specifically, for each subject (a dataset with a GNN model), 10 different GNN models are generated through separate training processes.
Results: The GraphPrior approach KMGP outperforms all the compared approaches (i.e., DeepGini, Least Confidence, Margin, Vanilla SM, PCS, Entropy, and Random) in GNN test prioritization. Table 2 presents the comparison results of the KMGP and a set of compared approaches using the APFD metric. We highlight the approach with the highest effectiveness for each case in grey. The results demonstrate that KMGP outperforms the other approaches in the majority of cases, specifically, in 87.5% (14 out of 16) subjects. Vanilla SM, however, performs the best in only 12.5% of cases. Additionally, the average APFD value achieved by KMGP was 0.748, which is higher than that of the compared techniques, with improvements of 4.76%~49.6%. These results suggest that KMGP offers a promising solution for prioritizing GNN test inputs.
Table 2.
DataModelApproaches
KMGPDeepGiniLeast ConfidenceMarginVanilla SMPCSEntropyRandom
CiteSeerGAT0.7080.6710.6910.6940.6910.6940.6460.508
GCN0.7010.6410.6770.6820.6770.6820.6380.502
GraphSAGE0.7390.6630.6840.6840.6840.6840.6590.497
TAGCN0.7120.6580.6910.6940.6910.6940.6200.499
CoraGAT0.8410.7420.7700.7630.7700.7630.7330.487
GCN0.8120.6900.7360.7390.7360.7390.6840.495
GraphSAGE0.7920.7270.7810.7840.7810.7840.7040.515
TAGCN0.7820.7010.7390.7380.7390.7380.6900.498
LastFMGAT0.8010.6330.6950.7130.6950.7130.5340.498
GCN0.7610.7130.7580.7460.7580.7460.6030.497
GraphSAGE0.7020.7340.7610.7540.7610.7540.6260.502
TAGCN0.6730.7190.7410.7300.7410.7300.6570.498
PubMedGAT0.7350.6420.6700.6610.6700.6610.6450.502
GCN0.7480.6450.6800.6700.6800.6700.6470.501
GraphSAGE0.7470.6310.6850.6750.6850.6750.6340.498
TAGCN0.7200.6130.6630.6720.6630.6720.6150.497
Average0.7480.6770.7140.7120.7140.7120.6460.500
Table 2. Effectiveness Comparison among KMGP and the Compared Approaches in Terms of APFD
Table 3 exhibits the comparison results among the test prioritization techniques with respect to PFD. We highlight the approach with the highest effectiveness for each case in grey. The findings indicate that, for 68.75% (11 out of 16) of the subjects, KMGP performs best when prioritizing less than 50% of tests. Furthermore, for a majority of the subjects, specifically, 87.5% (14 out of 16), KMGP exhibits the best performance when prioritizing less than 30% of tests. Furthermore, Table 4 exhibits the overall comparison results in terms of PFD. We can see that when prioritizing 10%~30% test inputs, the average effectiveness of KMGP outperforms that of the compared approaches in 100% cases. When prioritizing 10%~50% test inputs, the average effectiveness of KMGP outperforms that of the compared approaches in 90% cases. Figure 2 plots the ratio of detected misclassified tests against the prioritized tests. We see that GraphPrior achieves a higher APFD value in comparison to DeepGini, entropy, least confidence, margin, Vanilla SM, PCS, and random. These results confirm the effectiveness of KMGP in GNN test input prioritization.
Table 3.
DataModelApproachesPFD-10PFD-20PFD-30PFD-40PFD-50PFD-60PFD-70DataModelApproachesPFD-10PFD-20PFD-30PFD-40PFD-50PFD-60PFD-70
CiteSeerGATKMGP0.2640.4640.6290.7500.8120.8410.875LastFMGATKMGP0.3890.6830.8100.8690.9020.9270.945
DeepGini0.2110.3820.5210.6460.7480.8280.895DeepGini0.2010.3630.4950.6030.6950.7700.839
Entropy0.2030.3730.5060.6210.7160.7880.844Entropy0.1910.3230.4220.4940.5530.6070.675
Least Confidence0.2310.4090.5500.6800.7770.8610.913Least Confidence0.2370.4290.5850.7060.7910.8560.908
Margin0.2280.4010.5470.6880.7940.8640.914Margin0.2620.4660.6230.7340.8140.8680.916
Vanilla SM0.2310.4090.5500.6800.7770.8610.913Vanilla SM0.2370.4290.5850.7060.7910.8560.908
PCS0.2280.4010.5470.6880.7940.8640.914PCS0.2620.4660.6230.7340.8140.8680.916
Random0.0990.1920.2960.3910.4930.5910.689Random0.1010.2010.3000.4010.4950.5890.695
GCNKMGP0.2780.4920.6520.7230.7710.8110.865GCNKMGP0.4030.6480.7280.7700.8300.8680.915
DeepGini0.2000.3550.4900.6000.6970.7830.858DeepGini0.2670.4670.6000.7150.7990.8750.928
Entropy0.2010.3540.4870.5950.6920.7790.856Entropy0.2540.4110.5010.5700.6270.6850.755
Least Confidence0.2290.4060.5440.6610.7480.8270.889Least Confidence0.2980.5300.6910.7990.8800.9270.956
Margin0.2140.3990.5560.6740.7760.8440.895Margin0.2780.4990.6610.7830.8650.9200.951
Vanilla SM0.2290.4060.5440.6610.7480.8270.889Vanilla SM0.2980.5300.6910.7990.8800.9270.956
PCS0.2140.3990.5560.6740.7760.8440.895PCS0.2780.4990.6610.7830.8650.9200.951
Random0.0980.1970.2920.3880.4880.5870.690Random0.0980.1990.3020.3970.5030.6000.704
GraphSAGEKMGP0.3060.5250.6790.7740.8350.8790.910GraphSAGEKMGP0.3020.4820.5800.6680.8000.8420.902
DeepGini0.2080.3740.5130.6260.7380.8230.885DeepGini0.2850.5010.6550.7650.8360.8930.929
Entropy0.2070.3710.5100.6220.7270.8160.877Entropy0.2830.4430.5200.5870.6490.7080.775
Least Confidence0.2230.4050.5450.6700.7690.8500.904Least Confidence0.2940.5270.7090.8250.8920.9220.946
Margin0.2140.3980.5490.6720.7690.8510.908Margin0.2760.5250.7000.8190.8830.9130.944
Vanilla SM0.2230.4050.5450.6700.7690.8500.904Vanilla SM0.2940.5270.7090.8250.8920.9220.946
PCS0.2140.3980.5490.6720.7690.8510.908PCS0.2760.5250.7000.8190.8830.9130.944
Random0.1010.2060.3110.4170.5150.6090.693Random0.0950.1940.2980.3980.4980.5960.697
TAGCNKMGP0.2950.4900.6170.7230.7950.8450.888TAGCNKMGP0.2500.4310.5440.6440.7060.8190.892
DeepGini0.2160.3750.5120.6220.7190.8080.877DeepGini0.2600.4610.6150.7310.8210.8850.934
Entropy0.2140.3660.4920.5920.6930.7490.801Entropy0.2580.4510.5770.6530.7200.7690.816
Least Confidence0.2460.4270.5700.6780.7720.8450.905Least Confidence0.2600.4750.6420.7690.8650.9280.966
Margin0.2340.4300.5780.6880.7760.8500.907Margin0.2380.4500.6160.7550.8560.9220.962
Vanilla SM0.2460.4270.5700.6780.7720.8450.905Vanilla SM0.2600.4750.6420.7690.8650.9280.966
PCS0.2340.4300.5780.6880.7760.8500.907PCS0.2380.4500.6160.7550.8560.9220.962
Random0.1010.1960.2970.3830.4820.5860.684Random0.1000.2030.2990.4010.4970.5960.697
CoraGATKMGP0.4540.7590.8840.9190.9390.9540.975GATPubMedKMGP0.3360.5880.6970.7540.8130.8590.893
DeepGini0.2950.5090.6690.7810.8520.8920.928DeepGini0.2050.3590.4950.6070.7020.7820.856
Entropy0.2930.5030.6580.7660.8420.8860.918Entropy0.2050.3600.4960.6090.7070.7880.864
Least Confidence0.2960.5390.7240.8300.8990.9320.962Least Confidence0.2130.3840.5320.6570.7580.8410.895
Margin0.2820.5250.7080.8150.8790.9340.970Margin0.2150.3880.5320.6560.7500.8170.871
Vanilla SM0.2960.5390.7240.8300.8990.9320.962Vanilla SM0.2130.3840.5320.6570.7580.8410.895
PCS0.2820.5250.7080.8150.8790.9340.970PCS0.2150.3880.5320.6560.7500.8170.871
Random0.0990.1920.2940.3920.4780.5780.679Random0.1010.2010.2980.3960.4970.5950.696
GCNKMGP0.3840.7040.8540.8840.9090.9330.952GCNKMGP0.3470.6070.7430.7880.8260.8600.894
DeepGini0.2490.4180.5690.6820.7760.8530.908DeepGini0.2150.3950.5340.6240.6980.7710.838
Entropy0.2450.4110.5590.6760.7630.8400.897Entropy0.2160.3950.5350.6260.7010.7740.842
Least Confidence0.2650.4800.6430.7700.8480.9060.954Least Confidence0.2230.4070.5600.6860.7820.8440.890
Margin0.2540.4690.6530.7810.8600.9120.956Margin0.2110.3970.5500.6790.7680.8320.876
Vanilla SM0.2650.4800.6430.7700.8480.9060.954Vanilla SM0.2230.4070.5600.6860.7820.8440.890
PCS0.2540.4690.6530.7810.8600.9120.956PCS0.2110.3970.5500.6790.7680.8320.876
Random0.0970.1970.2910.3980.5050.5960.695Random0.0980.2020.3020.4030.5030.6020.704
GraphSAGEKMGP0.4890.7050.7770.8200.8480.8860.919GraphSAGEKMGP0.3960.6350.7130.7570.8080.8500.889
DeepGini0.3230.4980.6230.7360.8290.8780.922DeepGini0.2140.3640.4880.5890.6760.7560.829
Entropy0.3180.4820.6040.7100.7920.8460.885Entropy0.2150.3650.4900.5910.6800.7610.834
Least Confidence0.3560.5840.7230.8330.9030.9400.962Least Confidence0.2290.4070.5610.6820.7740.8460.901
Margin0.3630.6040.7350.8300.8970.9390.964Margin0.2290.4120.5550.6680.7560.8320.884
Vanilla SM0.3560.5840.7230.8330.9030.9400.962Vanilla SM0.2290.4070.5610.6820.7740.8460.901
PCS0.3630.6040.7350.8300.8970.9390.964PCS0.2290.4120.5550.6680.7560.8320.884
Random0.1070.2050.3060.4030.5000.5960.691Random0.0960.2000.3030.4000.5050.6060.704
TAGCNKMGP0.3720.6680.7880.8410.8630.8880.914TAGCNKMGP0.3790.5450.6100.7220.7910.8440.885
DeepGini0.2490.4500.5860.6960.7830.8570.914DeepGini0.2100.3520.4680.5530.6440.7320.811
Entropy0.2460.4420.5780.6890.7710.8380.895Entropy0.2110.3540.4700.5570.6500.7360.814
Least Confidence0.2730.4810.6380.7620.8500.9130.954Least Confidence0.2230.3970.5410.6580.7440.8150.867
Margin0.2550.4660.6380.7640.8610.9220.964Margin0.2320.4140.5660.6750.7610.8220.868
Vanilla SM0.2730.4810.6380.7620.8500.9130.954Vanilla SM0.2230.3970.5410.6580.7440.8150.867
PCS0.2550.4660.6380.7640.8610.9220.964PCS0.2320.4140.5660.6750.7610.8220.868
Random0.1020.2040.3090.4030.5070.5920.699Random0.0990.1960.2970.3980.4990.6010.700
Table 3. Effectiveness Comparison among KMGP and the Compared Approaches in Terms of PFD
Table 4.
DataApproaches#Best case in PFDAverage PFD
PFD-10PFD-20PFD-30PFD-40PFD-50PFD-60PFD-70PFD-10PFD-20PFD-30PFD-40PFD-50PFD-60PFD-70
CiteSeerKMGP44443110.2850.4920.6440.7420.8030.8440.884
DeepGini00000000.2080.3710.5090.6230.7250.8100.878
Entropy00000000.2060.3660.4980.6070.7070.7830.844
Least Confidence00000000.2320.4110.5520.6720.7660.8450.902
Margin00000000.2220.4070.5570.6800.7780.8520.906
Vanilla SM00000000.2320.4110.5520.6720.7660.8450.902
PCS00001330.2220.4070.5570.6800.7780.8520.906
Random00000000.0990.1970.2990.3940.4940.5930.689
CoraKMGP44433210.4240.7090.8250.8660.8890.9150.940
DeepGini00000000.2790.4680.6110.7230.8100.8700.918
Entropy00000000.2750.4590.5990.7100.7920.8520.898
Least Confidence00000000.2970.5210.6810.7980.8750.9220.958
Margin00000000.2880.5160.6830.7970.8740.9260.963
Vanilla SM00011100.2970.5210.6810.7980.8750.9220.958
PCS0000 130.2880.5160.6830.7970.8740.9260.963
Random00000000.1010.1990.3000.3990.4970.5900.691
LastFMKMGP32211110.3360.5610.6650.7370.8090.8640.913
DeepGini00000000.2530.4480.5910.7030.7870.8550.907
Entropy00000000.2460.4070.5050.5760.6370.6920.755
Least Confidence00000000.2720.4900.6560.7740.8570.9080.944
Margin00000000.2630.4850.6500.7720.8540.9050.943
Vanilla SM12233330.2720.4900.6560.7740.8570.9080.944
PCS00000000.2630.4850.650.7720.8540.9050.943
Random00000000.0980.1990.2990.3990.4980.5950.698
PubMedKMGP44444420.3640.5930.6900.7550.8090.8530.890
DeepGini00000000.2110.3670.4960.5930.6790.7600.833
Entropy00000000.2110.3680.4970.5950.6840.7640.838
Least Confidence00000000.2220.3980.5480.6700.7640.8360.888
Margin00000020.2210.4020.5500.6690.7580.8250.874
Vanilla SM00000000.2220.3980.5480.6700.7640.8360.888
PCS00000000.2210.4020.5500.6690.7580.8250.874
Random00000000.0980.1990.3000.3990.5010.6010.701
Table 4. Average Comparison Results among KMGP and the Compared Approaches in Terms of PFD
Fig. 2.
Fig. 2. Test prioritization effectiveness among KMGP and the compared approaches for CiteSeer with GraphSAGE and LastFM with GAT. X-axis: the percentage of prioritized tests; Y-axis: the percentage of detected miscalssified tests.
To demonstrate the stability of our findings, a statistical analysis is performed. Specifically, all the experiments are repeated 10 times for each subject, resulting in 10 distinct GNN model instances obtained through separate training processes for a given original GNN model. Based on the statistical analysis of the resulting data, the p-value was found to be lower than \(10^{-05}\), indicating that the KMGP approach can consistently outperform the compared approaches in terms of test prioritization.
Answer to RQ1: The GraphPrior approach KMGP outperforms all the compared approaches (i.e., DeepGini, Least Confidence, Margin, Vanilla SM, PCS, Entropy, and Random) in GNN test prioritization.

5.2 RQ2: Effectiveness of the Feature-based GraphPrior Approaches

Objectives: We investigate the effectiveness of feature-based approaches in GraphPrior, including XGGP, LRGP, RFGP, LGGP, and DNGP, compared with the killing-based approach KMGP.
Experimental design: We evaluated the effectiveness of feature-based GraphPrior approaches with the killing-based approach KMGP on 16 subjects (four graph datasets × four GNN models). Due to the randomness of the training process of a GNN model, we repeat all the experiments 10 times and calculate the average results. For each subject (a dataset with a GNN model), 10 different GNN models are generated through separate training processes. For evaluation, we calculated the APFD values of all the approaches on each subject, which can reflect the misclassification detection rate. Moreover, we calculated the PFD values of all the approaches on different ratios of prioritized tests to further investigate the effectiveness of feature-based approaches.
Results: The experimental results of this research question are exhibited in Tables 5, 6, and 7. Table 5 presents the comparison results in terms of APFD, while Table 6 and Table 7 present the comparison results in terms of PFD.
Table 5.
DataModelApproaches
DGGPLGGPXGGPLRGPRFGPKMGP
CiteSeerGAT0.6330.6780.6690.6510.6750.708
GCN0.6820.6950.6900.6780.6940.701
GraphSAGE0.6560.6940.6990.6820.7100.739
TAGCN0.6520.6810.6940.6600.6960.712
CoraGAT0.7490.7850.7950.7670.8110.841
GCN0.7780.7910.7910.7840.8060.812
GraphSAGE0.7640.7910.7930.7840.7940.792
TAGCN0.7770.7850.7850.7780.8000.782
LastFMGAT0.7990.8140.8120.8020.8260.801
GCN0.7960.8110.8090.8020.8160.761
GraphSAGE0.7710.7850.7800.7780.7890.702
TAGCN0.7630.7810.7760.7700.7790.673
PubMedGAT0.7400.7740.7680.7630.7730.735
GCN0.7430.7490.7450.7460.7500.748
GraphSAGE0.7430.7760.7670.7680.7740.747
TAGCN0.7010.7800.7730.7650.7680.720
Average0.7340.7610.7590.7490.7660.748
Table 5. Effectiveness Comparison among KMGP and the Feature-based GraphPrior Approaches in Terms of APFD
Table 6.
DataModelApproachesPFD-10PFD-20PFD-30PFD-40PFD-50PFD-60PFD-70DataModelApproachesPFD-10PFD-20PFD-30PFD-40PFD-50PFD-60PFD-70
CiteSeerGATKMGP0.2640.4640.6290.7500.8120.8410.875LastFMGATKMGP0.3890.6830.8100.8690.9020.9270.945
DNGP0.2520.4600.5960.6470.6930.7220.753DNGP0.3820.7280.8480.8630.8830.9050.926
LGGP0.2510.4650.6210.7150.7590.7910.833LGGP0.3970.7400.8610.8890.9040.9240.942
LRGP0.2440.4670.6110.6830.7210.7500.788LRGP0.3890.7290.8480.8760.8920.9100.929
RFGP0.2570.4700.6190.6970.7430.7810.827RFGP0.4040.7460.8740.9060.9270.9440.960
XGGP0.2560.4640.6180.7020.7400.7710.817XGGP0.3930.7370.8560.8860.9010.9210.942
GCNKMGP0.2780.4920.6520.7230.7710.8110.865GCNKMGP0.4030.6480.7280.7700.8300.8680.915
DNGP0.2480.4790.6430.6990.7480.7940.843DNGP0.4120.7170.8140.8490.8770.9060.93
LGGP0.2730.4830.6510.7170.7640.8030.856LGGP0.4280.7300.8300.8730.8980.9210.945
LRGP0.2510.4840.6430.6980.7450.7870.832LRGP0.4240.7170.8170.8590.8860.9120.937
RFGP0.2720.4860.6530.7160.7620.8070.852RFGP0.4310.7350.8420.8810.9060.9270.949
XGGP0.2650.4810.6500.7110.7600.8040.848XGGP0.4240.7240.8260.8690.8950.9180.942
GraphSAGEKMGP0.3060.5250.6790.7740.8350.8790.910GraphSAGEKMGP0.3020.4820.5800.6680.8000.8420.902
DNGP0.2710.5110.6350.6700.6950.7290.771DNGP0.3350.6220.7660.8370.8710.8990.924
LGGP0.2870.5150.6800.7330.7670.7970.831LGGP0.3440.6340.7840.8580.8900.9180.946
LRGP0.2730.5120.6710.7080.7370.7670.806LRGP0.3420.6260.7730.8480.8810.9070.936
RFGP0.2870.5150.6840.7300.7750.8160.865RFGP0.3480.6360.7870.8650.8980.9250.947
XGGP0.2830.5160.6610.7030.7530.8000.851XGGP0.3430.6300.7740.8480.8810.9100.941
TAGCNKMGP0.2950.4900.6170.7230.7950.8450.888TAGCNKMGP0.2500.4310.5440.6440.7060.8190.892
DNGP0.2850.5040.5780.6280.6820.7400.784DNGP0.2940.5520.7420.8400.8840.9150.936
LGGP0.2980.5130.6510.7000.7370.7730.811LGGP0.2990.5620.7580.8650.9140.9440.964
LRGP0.2920.5070.5870.6400.6920.7490.799LRGP0.2950.5550.7470.8460.8960.9270.950
RFGP0.2940.5110.6620.6940.7470.7930.845RFGP0.3000.5610.7560.8670.9150.9420.961
XGGP0.2970.5100.6360.6950.7480.8010.849XGGP0.2970.5580.7510.8600.9110.9360.960
CoraGATKMGP0.4540.7590.8840.9190.9390.9540.975PubMedGATKMGP0.3360.5880.6970.7540.8130.8590.893
DNGP0.3830.7220.7910.8000.8140.8270.848DNGP0.3340.6310.7300.7670.8030.8430.883
LGGP0.4270.7240.8230.8360.8480.8670.894LGGP0.3630.6430.7630.8160.8530.8930.932
LRGP0.3610.7250.8260.8340.8450.8580.871LRGP0.3540.6320.7460.8030.8410.8810.919
RFGP0.4280.7300.8690.8820.8940.9090.928RFGP0.3620.6390.7630.8150.8530.8940.929
XGGP0.3750.7290.8490.8700.8850.9020.916XGGP0.3600.6400.7560.8060.8440.8860.921
GCNKMGP0.3840.7040.8540.8840.9090.9330.952GCNKMGP0.3470.6070.7430.7880.8260.8600.894
DNGP0.3590.6910.8140.8440.8700.8930.914DNGP0.3470.6290.7390.7790.8160.8510.885
LGGP0.3570.6780.8310.8620.8890.9130.932LGGP0.3550.6340.7460.7850.8230.8570.891
LRGP0.3590.6870.8230.8530.8800.9020.920LRGP0.3530.6290.7410.7820.8180.8540.888
RFGP0.3790.6910.8480.8760.9000.9280.947RFGP0.3540.6290.7450.7870.8240.8580.892
XGGP0.3650.6820.8300.8610.8850.9110.930XGGP0.3480.6290.7400.7800.8180.8530.886
GraphSAGEKMGP0.4890.7050.7770.8200.8480.8860.919GraphSAGEKMGP0.3960.6350.7130.7570.8080.8500.889
DNGP0.4800.7050.7360.7760.8050.8450.879DNGP0.3960.6700.7170.7530.7910.8330.872
LGGP0.4750.7210.7720.8180.8570.8950.924LGGP0.4090.6840.7580.8030.8430.8820.917
LRGP0.4740.7280.7760.8020.8320.8630.906LRGP0.4060.6770.7440.7910.8330.8740.909
RFGP0.4870.7360.7710.8030.8480.8960.923RFGP0.4080.6840.7580.8020.8430.8810.914
XGGP0.4790.7180.7600.8030.8540.8940.939XGGP0.4040.6780.7480.7930.8320.8710.907
TAGCNKMGP0.3720.6680.7880.8410.8630.8880.914TAGCNKMGP0.3790.5450.6100.7220.7910.8440.885
DNGP0.3470.6710.7970.8440.8700.8910.912DNGP0.4020.5930.6440.6920.7340.7770.828
LGGP0.3570.6680.8040.8630.8890.9140.930LGGP0.4150.6310.7310.8040.8650.9100.946
LRGP0.3470.6690.7990.8480.8710.8910.914LRGP0.4090.6180.7070.7840.8400.8890.927
RFGP0.3760.6780.8200.8720.8950.9240.943RFGP0.4090.6210.7220.7950.8470.8890.923
XGGP0.3610.6700.7960.8520.8800.9040.926XGGP0.4100.6270.7220.7960.8520.8990.934
Table 6. Effectiveness Comparison among KMGP and the Feature-based GraphPrior Approaches in Terms of PFD
Table 7.
ApproachesAverage PFD
PFD-10PFD-20PFD-30PFD-40PFD-50PFD-60PFD-70
KMGP0.3530.5890.7070.7750.8280.8690.907
DNGP0.3460.6180.7240.7680.8020.8360.868
LGGP0.3580.6270.7540.8090.8440.8750.906
LRGP0.3480.6230.7410.7910.8260.8580.890
RFGP0.3620.6290.7610.8120.8490.8820.913
XGGP0.3540.6240.7480.8020.8400.8740.907
Table 7. Average Effectiveness Comparison among KMGP and the Feature-based GraphPrior Approaches in Terms of PFD
Among all the GraphPrior approaches, RFGP demonstrates the highest level of effectiveness in most cases. Table 5 exhibits the comparison results among KMGP (i.e., the killing-based GraphPrior approach) and the feature-based GraphPrior approaches in terms of APFD. The results demonstrate RFGP outperforms other GraphPrior approaches, on average. Moreover, the average APFD values of RFGP exceed that of KMGP by around 0.02. Additionally, across different subjects, RFGP outperforms other GraphPrior approaches in the majority of cases. To provide a more detailed analysis, Tables 6 and 7 exhibit the comparison results of all GraphPrior approaches in terms of PFD. The findings also confirm that RFGP is the most effective GraphPrior approach. Furthermore, Table 7 indicates that, on average, RFGP is consistently more effective than other GraphPrior approaches across different test prioritization ratios. Figure 3 presents some examples aimed at providing a more visually intuitive understanding of the performance of the various GraphPrior approaches. Collectively, these results suggest that RFGP is the most effective GraphPrior approach for the evaluated datasets.
Fig. 3.
Fig. 3. Test prioritization effectiveness of the six GraphPrior approaches for Cora with TAGCN and LastFM with GraphSAGE. X-axis: the percentage of prioritized tests; Y-axis: the percentage of detected miscalssified tests.
Additionally, although the killing-based GraphPrior approach, KMGP, shows good effectiveness in some specific datasets, its average effectiveness is lower than several feature-based GraphPrior approaches, such as RFGP, LGGP, and XGGP. This result suggests that KMGP is less stable compared to some feature-based approaches. For example, in Figure 3(b), we can see that KMGP (represented by the red line) is less effective than other GraphPrior approaches. In fact, the main difference between KMGP and feature-based GraphPrior approaches lies in their strategy for utilizing mutation results. Specifically, KMGP treats all mutated models as having equal importance, whereas feature-based GraphPrior approaches, such as RFGP, employ ranking models to assign higher weights to the more important mutated models, thereby better utilizing mutation results for test prioritization. The superior performance of RFGP indicates that the random forest algorithm it utilizes can effectively identify important mutated models and assign them high weights.
The efficiency of GraphPrior (all the six approaches) is acceptable. Table 8 illustrates the efficiency of GraphPrior in comparison with other approaches. The time cost of GraphPrior can be decomposed into three phases, namely, mutant generation, training, and execution. Mutant generation involves the production of mutated models based on retraining the original GNN model. The training time represents the average duration needed for training a ranking model. Finally, execution time denotes the average duration expended on test prioritization. By decomposing the time cost into these distinct phases, we provide a more detailed understanding of the efficiency of GraphPrior in contrast to other approaches. As evident from Table 8, the average execution time of GraphPrior for test prioritization is 40 seconds, with the most time-consuming phase being mutant generation, which takes around 35 minutes. In contrast, the average execution time of the compared approaches is less than one second. Although GraphPrior is not as efficient as the compared approaches, it provides a viable alternative to costly and time-consuming manual labeling, and its total time cost remains acceptable in real-world scenarios.
Table 8.
Time cost partsApproaches
GraphPriorDeepGiniLeast ConfidenceMarginVanilla SMPCSEntropyRandom
Mutant Generation35 min-------
Training3 min-------
Execution40 s< 1 s< 1 s< 1 s< 1 s< 1 s< 1 s< 1 s
Table 8. Time Comparison between GraphPrior and Compared Approaches
Answer to RQ2: Among all the GraphPrior approaches, RFGP demonstrates the highest level of effectiveness in most cases. The efficiency of GraphPrior (all the six approaches) is acceptable.

5.3 RQ3: Effectiveness of GraphPrior on Adversarial Test Inputs

Objectives: We further investigate the effectiveness of GraphPrior on adversarial test data. Here, we adopt eight graph adversarial attacks (cf. Section 4.4) from the existing studies [3, 48, 86, 100]. The results can answer whether GraphPrior can perform well on adversarial test sets for GNNs, compared with existing approaches that can be used to identify possibly misclassified test inputs.
Experimental design: We evaluate GraphPrior on adversarial datasets generated by eight graph attack techniques [3, 48, 86, 100]. In this research question, we set the attack level as 0.3, which means that 30% of the test inputs in the test set are adversarial tests. It is important to note that a high attack level, such as 90%, would result in a significant ratio of adversarial test inputs. Under such circumstances, a larger number of bug cases could be selected by any of the prioritization methods, making it difficult to demonstrate the effectiveness of GraphPrior. Thus, to ensure an effective evaluation of GraphPrior and the compared approaches, we selected a reasonable attack level (i.e., 0.3), which can limit the proportion of adversarial test inputs. Totally, in this research question, we evaluate GraphPrior on 108 subjects (four GNN models, four datasets, and eight graph adversarial attacks). We then ran all six GraphPrior approaches and the compared approaches on the subjects and calculated the APFD values of each approach with each graph adversarial attack. Moreover, we calculated the PFD values of each approach in terms of different ratios of prioritized values.
Results:GraphPrior approaches outperform the compared approaches (i.e., DeepGini, Least Confidence, Margin, Vanilla SM, PCS, Entropy, and Random) in the context of graph adversarial attacks. Table 9 shows the test prioritization effectiveness (measured by APFD) of GraphPrior and the compared approaches across a variety of adversarial attacks. The experimental results indicate that the GraphPrior approaches exhibit superior performance, with the average APFD values ranging from 0.692 to 0.732, while the compared approaches range from 0.499 to 0.711. Notably, five GraphPrior approaches, namely, RFGP, XGGP, LRGP, LGGP, and KMGP, outperform all the compared approaches, on average, across all the adversarial attacks. Table 10 presents the comparison results of GraphPrior and the compared approaches in terms of PFD, confirming the superior performance of GraphPrior from both the perspective of average effectiveness and the number of best cases. Furthermore, Table 11 presents the overall comparison results in terms of PFD, which further support the above conclusions by demonstrating that the largest average effectiveness of each case is achieved by the GraphPrior approaches, along with the largest number of best cases.
Table 9.
AttackApproaches
DNGPKMGPLGGPXGGPLRGPRFGPDeepGiniLeast ConfidenceMarginRandomVanilla SMPCSEntropy
DICE0.6720.7100.7070.7060.6950.7130.6670.6980.6930.5000.6980.6930.642
MMA0.6910.7250.7210.7240.7050.7310.6840.7170.7180.4990.7170.7180.672
NEAA0.6980.7230.7330.7320.7210.7380.6760.7110.7030.4990.7110.7030.646
NEAR0.7370.7350.7670.7640.7570.7740.6780.7190.7170.4990.7190.7170.644
PGD0.7180.7300.7430.7430.7290.7530.6930.7280.7270.4980.7280.7270.656
RAA0.6590.7010.6970.6960.6840.7030.6710.7020.6950.4990.7020.6950.648
RAF0.6570.7020.6960.6960.6830.7030.6700.7010.6940.5000.7010.6940.646
RAR0.7030.7240.7350.7340.7230.7420.6730.7080.7070.4980.7080.7070.645
Average0.6920.7180.7250.7240.7120.7320.6770.7110.7070.4990.7110.7070.650
Table 9. Effectiveness Comparison among GraphPrior and the Compared Approaches in Terms of APFD
Table 10.
AttackApproaches#Best cases in PFDAverage PFDAttackApproaches#Best cases in PFDAverage PFD
PFD-10PFD-20PFD-30PFD-40PFD-10PFD-20PFD-30PFD-40PFD-10PFD-20PFD-30PFD-40PFD-10PFD-20PFD-30PFD-40
DICEDNGP02000.2890.5530.7050.769PGDDNGP01100.3090.5720.7010.752
KMGP71220.3000.5200.6650.754KMGP74330.3360.5730.7140.789
LGGP10000.3010.5570.7190.801LGGP00000.3070.5630.7070.781
LRGP00000.2910.5550.7110.788LRGP02000.3050.5700.6970.762
RFGP4910100.3040.5610.7290.818RFGP11450.3260.5710.7270.806
XGGP00000.2930.5560.7160.799XGGP00000.3000.5580.7030.774
DeepGini00000.2150.3940.5350.655DeepGini00000.2380.4130.5560.672
Entropy00000.2120.3810.5070.611Entropy00000.2360.4090.5470.660
Least Confidence00000.2330.4280.5900.713Least Confidence00000.2550.4510.6040.727
Margin00000.2250.4230.5840.711Margin00000.2420.4490.6060.730
PCS00000.2250.4230.5840.711PCS00000.2420.4490.6060.730
Vanilla SM00000.2330.4280.5900.713Vanilla SM00000.2550.4510.6040.727
Random00000.1000.2000.2990.398Random00000.0980.1990.2990.397
MMADNGP02000.3200.5980.7290.785RAADNGP01000.3030.5730.7190.781
KMGP75540.3400.5780.7010.773KMGP53240.3080.5440.6750.755
LGGP11000.3270.5970.7390.809LGGP64430.3140.5780.7340.812
LRGP00000.3200.5950.7290.793LRGP01000.3070.5740.7270.800
RFGP44780.3410.5980.7540.829RFGP571090.3150.5790.7370.821
XGGP00000.3190.5920.7360.804XGGP00000.3070.5740.7300.808
DeepGini00000.2430.4260.5680.682DeepGini00000.2210.3950.5380.652
Entropy00000.2400.4120.5380.635Entropy00000.2190.3870.5180.623
Least Confidence00000.2630.4690.6220.741Least Confidence00000.2340.4250.5820.705
Margin00000.2530.4630.6220.743Margin00000.2200.4110.5700.698
PCS00000.2530.4630.6220.743PCS00000.2200.4110.5700.698
Vanilla SM00000.2630.4690.6220.741Vanilla SM00000.2340.4250.5820.705
Random00000.1020.2020.3030.402Random00000.1010.2010.3010.399
NEAADNGP00000.3320.6270.7830.840RAFDNGP01000.2950.5650.7150.780
KMGP32210.3350.5890.7330.805KMGP73130.3010.5330.6730.760
LGGP02010.3430.6360.8030.877LGGP45540.3070.5680.7310.812
LRGP10000.3340.6300.7950.860LRGP00000.2980.5650.7230.798
RFGP44660.3450.6400.8140.884RFGP571090.3080.5700.7360.821
XGGP00000.3360.6310.8000.869XGGP00000.2990.5650.7270.807
DeepGini00000.2450.4330.5790.694DeepGini00000.2180.3940.5360.650
Entropy00000.2400.4140.5380.632Entropy00000.2160.3850.5160.620
Least Confidence00000.2610.4720.6380.763Least Confidence00000.2300.4220.5800.706
Margin00000.2450.4570.6250.757Margin00000.2170.4090.5670.698
PCS00000.2450.4570.6250.757PCS00000.2170.4090.5670.698
Vanilla SM00000.2610.4720.6380.763Vanilla SM00000.2300.4220.5800.706
Random00000.1000.2000.3010.399Random00000.1000.2020.3010.402
NEARDNGP00000.3220.6180.7870.848RARDNGP02000.3340.6060.7200.766
KMGP12210.3350.6180.7800.856KMGP61140.3410.5680.6970.772
LGGP10000.3360.6200.7930.871LGGP24440.3470.6160.7520.814
LRGP00000.3260.6210.7980.866LRGP10000.3380.6110.7400.799
RFGP22230.3390.6270.8100.893RFGP781170.3480.6170.7610.823
XGGP00000.3270.6200.7960.872XGGP01010.3390.6130.7490.810
DeepGini00000.2470.4320.5760.687DeepGini00000.2310.4100.5510.662
Entropy00000.2440.4270.5690.675Entropy00000.2290.4000.5280.627
Least Confidence00000.2560.4580.6210.747Least Confidence00000.2470.4450.6030.723
Margin00000.2330.4310.6000.737Margin00000.2430.4440.6050.727
PCS00000.2330.4310.6000.737PCS00000.2430.4440.6050.727
Vanilla SM00000.2560.4580.6210.747Vanilla SM00000.2470.4450.6030.723
Random00000.1010.1980.2940.391Random00000.0990.2000.3000.401
Table 10. Effectiveness Comparison of GraphPrior and the Compared Approaches on Adversarial Test Inputs in Terms of PFD
Table 11.
Approaches#Best case in PFDAverage PFD
PFD-10PFD-20PFD-30PFD-40PFD-10PFD-20PFD-30PFD-40
DNGP09100.3130.5890.7320.790
KMGP432118220.3250.5650.7050.783
LGGP151613120.3230.5920.7470.822
LRGP23000.3150.5900.7400.808
RFGP324260570.3280.5950.7580.837
XGGP01010.3150.5890.7450.818
DeepGini00000.2320.4120.5550.669
Entropy00000.230.4020.5330.635
Least Confidence00000.2470.4460.6050.728
Margin00000.2350.4360.5970.725
PCS00000.2350.4360.5970.725
Vanilla SM00000.2470.4460.6050.728
Random00000.1010.2020.3010.399
Table 11. Average Effectiveness Comparison among GraphPrior and the Compared Approaches on Adversarial Test Inputs in Terms of PFD
Among all the GraphPrior approaches proposed, the effectiveness of RFGP stands out as the most notable. From Table 9, in which the effectiveness is measured by the APFD values, we see that RFGP performs the best across different adversarial attacks, with the average improvement of 2.95%~46.69% compared with uncertainty-based test prioritization approaches. Table 10 presents the test prioritization effectiveness in terms of PFD. The column #Best case in PFD denotes the number of best cases a test prioritization approach achieved across all cases (i.e., all subjects of a graph adversarial attack). The results demonstrate that, against a majority of adversarial attacks, RFGP consistently outperforms all other GraphPrior approaches in terms of average effectiveness. Moreover, Table 11 presents the overall comparison results in terms of PFD, further indicating that RFGP outperforms all other approaches in terms of average effectiveness. Notably, when prioritizing 20% to 40% of the test inputs, RFGP consistently exhibits the highest number of best cases across a variety of subjects.
Answer to RQ3: GraphPrior approaches outperform the compared approaches (i.e., DeepGini, Least Confidence, Margin, Vanilla SM, PCS, Entropy, and Random) in the context of graph adversarial attacks. Among all the GraphPrior approaches proposed, the effectiveness of RFGP stands out as the most notable.

5.4 RQ4: Effectiveness of GraphPrior against Adversarial Attacks at Varying Attack Levels

Objectives: We investigate the effectiveness of GraphPrior on adversarial test inputs with different attack levels.
Experimental design: To investigate the effectiveness of GraphPrior on test inputs generated via different levels of graph adversarial attacks, we set different attack levels (i.e., 0.1, 0.2, 0.3, and 0.4) on eight graph adversarial techniques (i.e., DICE, Min-max attack, NEAA, NEAR, PGD attack, RAA, RAF, and RAR). As mentioned in RQ3, the attack level indicates the ratio of adversarial inputs in the dataset. For example, 0.4 means that 40% tests in the dataset are adversarial tests. We select these attack levels because a high attack level (e.g., 80%) would engender a substantial proportion of adversarial test inputs. Consequently, such circumstances could yield a greater number of bug cases selected by any prioritization method, thereby affecting the evaluation of GraphPrior. Therefore, we carefully selected a range of attack levels that are not unduly high for the evaluation of GraphPrior. In this research question, we totally evaluate GraphPrior and the compared approaches on 432 subjects.
Results: GraphPrior outperforms all the compared approaches on the adversarial test inputs generated from different attack levels. More specifically, Table 12 presents the effectiveness of GraphPrior and the compared approaches under the attacks DICE, MMA, RAA, and RAR, with the attack level ranging from 0.1 to 0.4. In this research question, we totally apply eight adversarial attacks. The remaining experimental results (i.e., results of the other four adversarial attacks) are presented on our Github.2
Table 12.
AttackApproaches#Best cases in PFDAverage PFDAttackApproaches#Best cases in PFDAverage PFD
PFD-10PFD-20PFD-30PFD-40PFD-10PFD-20PFD-30PFD-40PFD-10PFD-20PFD-30PFD-40PFD-10PFD-20PFD-30PFD-40
DICE-0.1DNGP01000.3220.5950.7240.775RAA-0.1DNGP00000.3330.6040.7250.774
KMGP73250.3350.5680.7000.776KMGP43550.3360.5740.6960.770
LGGP10100.3350.6000.7450.811LGGP36440.3430.6110.7510.814
LRGP00000.3230.5970.7360.797LRGP10000.3370.6070.7430.803
RFGP47970.3380.6050.7570.828RFGP87770.3450.6130.7570.822
XGGP01000.3250.5990.7430.810XGGP00000.3380.6080.7490.813
DeepGini00000.2370.4190.5590.674DeepGini00000.2320.4100.5490.660
Entropy00000.2330.4050.5280.627Entropy00000.2300.3990.5270.626
Least Confidence00000.2560.4590.6160.736Least Confidence00000.2480.4450.6020.722
Margin00000.2450.4510.6130.737Margin00000.2360.4380.5970.720
PCS00000.2450.4510.6130.737PCS00000.2360.4380.5970.720
Vanilla SM00000.2560.4590.6160.736Vanilla SM00000.2480.4450.6020.722
Random00000.0980.1980.2960.397Random00000.1000.2000.3010.401
DICE-0.2DNGP00000.3050.5730.7130.772RAA-0.2DNGP00000.3110.5840.7170.773
KMGP63230.3140.5450.6780.762KMGP65450.3180.5530.6830.762
LGGP12010.3140.5760.7320.807LGGP45330.3230.5870.7360.807
LRGP00000.3040.5750.7240.795LRGP11000.3140.5860.7290.796
RFGP561080.3160.5790.7410.820RFGP55980.3240.5900.7440.818
XGGP01000.3050.5740.7290.804XGGP00000.3140.5860.7340.805
DeepGini00000.2280.4090.5520.667DeepGini00000.2230.3970.5400.653
Entropy00000.2250.3950.5220.624Entropy00000.2210.3880.5190.620
Least Confidence00000.2440.4430.6020.724Least Confidence00000.2370.4310.5880.713
Margin00000.2350.4350.5960.723Margin00000.2260.4220.5820.709
PCS00000.2350.4350.5960.723PCS00000.2260.4220.5820.709
Vanilla SM00000.2440.4430.6020.724Vanilla SM00000.2370.4310.5880.713
Random00000.1010.2020.3020.401Random00000.0990.1990.2980.398
DICE-0.3DNGP02000.2890.5530.7050.769RAA-0.3DNGP01000.3030.5730.7190.781
KMGP71220.3000.5200.6650.754KMGP53240.3080.5440.6750.755
LGGP10000.3010.5570.7190.801LGGP64430.3140.5780.7340.812
LRGP00000.2910.5550.7110.788LRGP01000.3070.5740.7270.800
RFGP4910100.3040.5610.7290.818RFGP571090.3150.5790.7370.821
XGGP00000.2930.5560.7160.799XGGP00000.3070.5740.7300.808
DeepGini00000.2150.3940.5350.655DeepGini00000.2210.3950.5380.652
Entropy00000.2120.3810.5070.611Entropy00000.2190.3870.5180.623
Least Confidence00000.2330.4280.5900.713Least Confidence00000.2340.4250.5820.705
Margin00000.2250.4230.5840.711Margin00000.2200.4110.5700.698
PCS00000.2250.4230.5840.711PCS00000.2200.4110.5700.698
Vanilla SM00000.2330.4280.5900.713Vanilla SM00000.2340.4250.5820.705
Random00000.1000.2000.2990.398Random00000.1010.2010.3010.399
DICE-0.4DNGP01200.2760.5320.6940.770RAA-0.4DNGP01000.2900.5540.7130.783
KMGP72110.2880.5100.6470.740KMGP63140.2940.5250.6710.761
LGGP01110.2860.5350.7020.799LGGP45330.3000.5590.7260.812
LRGP00100.2770.5330.6990.785LRGP01000.2930.5560.7200.800
RFGP587100.2910.5380.7080.812RFGP661290.3020.5600.7310.823
XGGP00000.2800.5320.7000.795XGGP00000.2940.5560.7240.809
DeepGini00000.2110.3880.5330.654DeepGini00000.2150.3920.5350.650
Entropy00000.2090.3760.5080.613Entropy00000.2130.3840.5170.621
Least Confidence00000.2260.4190.5790.708Least Confidence00000.2260.4180.5760.702
Margin00000.2150.4060.5680.701Margin00000.2100.3990.5590.689
PCS00000.2150.4060.5680.701PCS00000.2100.3990.5590.689
Vanilla SM00000.2260.4190.5790.708Vanilla SM00000.2260.4180.5760.702
Random00000.0980.2000.3000.400Random00000.0980.2000.3000.400
MMA-0.2DNGP01000.3290.6110.7330.781RAR-0.2DNGP00000.3420.6130.7210.767
KMGP75440.3460.5830.7080.776KMGP63370.3480.5830.7040.774
LGGP01210.3360.6090.7480.810LGGP44410.3520.6210.7530.809
LRGP00000.3300.6080.7380.800LRGP10000.3420.6160.7400.792
RFGP55670.3490.6140.7630.833RFGP59880.3570.6240.7600.817
XGGP00000.3290.6070.7460.808XGGP00100.3450.6200.7480.806
DeepGini00000.2460.4280.5680.682DeepGini00000.2400.4160.5520.662
Entropy00000.2410.4130.5360.634Entropy00000.2380.4040.5270.626
Least Confidence00000.2690.4720.6270.745Least Confidence00000.2570.4550.6090.726
Margin00000.2570.4670.6270.749Margin00000.2480.4510.6090.729
PCS00000.2570.4670.6270.749PCS00000.2490.4510.6090.729
Vanilla SM00000.2690.4720.6270.745Vanilla SM00000.2570.4550.6090.726
Random00000.0990.1980.2960.396Random00000.1000.2000.3010.401
MMA-0.1DNGP00000.3280.6050.7250.775RAR-0.1DNGP01000.3410.6140.7230.771
KMGP65540.3440.5810.7050.774KMGP73060.3460.5790.7010.775
LGGP12100.3360.6050.7410.805LGGP43330.3530.6190.7510.812
LRGP00000.3310.6030.7340.794LRGP11000.3420.6150.7400.795
RFGP55670.3490.6100.7580.829RFGP471370.3560.6230.7620.822
XGGP00000.3300.6010.7380.802XGGP01000.3430.6180.7480.805
DeepGini00000.2480.4310.5730.686DeepGini00000.2370.4110.5500.659
Entropy00000.2450.4170.5410.639Entropy00000.2340.4010.5260.624
Least Confidence00000.2670.4730.6290.746Least Confidence00000.2500.4490.6040.725
Margin00010.2550.4660.6260.746Margin00000.2440.4470.6050.728
PCS00000.2550.4660.6260.746PCS00000.2440.4470.6050.728
Vanilla SM00000.2670.4730.6290.746Vanilla SM00000.2500.4490.6040.725
Random00000.0990.1990.2970.401Random00000.0980.1990.2970.397
MMA-0.3DNGP02000.3200.5980.7290.785RAR-0.3DNGP02000.3340.6060.7200.766
KMGP75540.3400.5780.7010.773KMGP61140.3410.5680.6970.772
LGGP11000.3270.5970.7390.809LGGP24440.3470.6160.7520.814
LRGP00000.3200.5950.7290.793LRGP10000.3380.6110.7400.799
RFGP44780.3410.5980.7540.829RFGP781170.3480.6170.7610.823
XGGP00000.3190.5920.7360.804XGGP01010.3390.6130.7490.810
DeepGini00000.2430.4260.5680.682DeepGini00000.2310.4100.5510.662
Entropy00000.2400.4120.5380.635Entropy00000.2290.4000.5280.627
Least Confidence00000.2630.4690.6220.741Least Confidence00000.2470.4450.6030.723
Margin00000.2530.4630.6220.743Margin00000.2430.4440.6050.727
PCS00000.2530.4630.6220.743PCS00000.2430.4440.6050.727
Vanilla SM00000.2630.4690.6220.741Vanilla SM00000.2470.4450.6030.723
Random00000.1020.2020.3030.402Random00000.0990.2000.3000.401
MMA-0.4DNGP02000.3220.6010.7320.787RAR-0.4DNGP01000.3330.6070.7230.770
KMGP76550.3450.5820.7110.778KMGP71130.3370.5640.6920.771
LGGP13200.3310.5980.7390.807LGGP35430.3410.6110.7490.815
LRGP00000.3240.5970.7280.790LRGP02000.3350.6090.7380.799
RFGP41570.3450.5980.7540.827RFGP6710100.3450.6130.7580.824
XGGP00000.3230.5920.7310.797XGGP00100.3350.6090.7450.809
DeepGini00000.2450.4270.5680.681DeepGini00000.2330.4060.5470.658
Entropy00000.2420.4130.5390.636Entropy00000.2310.3980.5240.623
Least Confidence00000.2650.4690.6240.741Least Confidence00000.2480.4390.5960.719
Margin00000.2530.4640.6250.744Margin00000.2430.4430.6000.723
PCS00000.2530.4640.6250.744PCS00000.2430.4430.6000.723
Vanilla SM00000.2650.4690.6240.741Vanilla SM00000.2480.4390.5960.719
Random00000.0980.2010.3000.399Random00000.0970.1990.2990.398
Table 12. Comparison Results of GraphPrior and the Compared Approaches against Different Levels of the Attacks DICE, MMA, RAA, and RAR in Terms of PFD
The experimental results presented in Table 12 demonstrate that GraphPrior, consisting of DNGP, KMGP, LGGP, LRGP, RFGP, and XGGP, outperforms all the compared approaches across different levels of the adversarial attacks.
Table 13 demonstrates the overall comparison results among GraphPrior and the compared approaches across eight adversarial attacks with different attack levels. Specifically, we evaluate the effectiveness of each test prioritization approach in terms of the number of cases where it performed the best, as well as its average PFD values across different attack levels. For example, the “All-0.1” refers to the overall results of each approach under all the adversarial attacks with an attack level of 0.1. Table 13 demonstrates that GraphPrior outperforms all compared approaches, achieving the best effectiveness in 99.94% of the tested cases. Only one best case is achieved by the compared approach margin. Furthermore, GraphPrior approaches such as RFGP and KMGP consistently exhibit the largest average PFD values across different attack levels.
Table 13.
Attack LevelApproaches#Best case in PFDAverage PFD
PFD-10PFD-20PFD-30PFD-40PFD-10PFD-20PFD-30PFD-40
All-0.1DNGP03000.3340.6150.7380.784
KMGP422728330.3490.5940.7230.791
LGGP131814110.3460.6190.7600.820
LRGP30000.3360.6170.7520.808
RFGP344348460.3520.6240.7720.836
XGGP01210.3360.6170.7580.819
DeepGini00000.2430.4250.5660.679
Entropy00000.2410.4130.5410.642
Least Confidence00000.2610.4650.6230.742
Margin00010.2490.4570.6190.742
PCS00000.2490.4570.6190.742
Vanilla SM00000.2610.4650.6230.742
Random00000.0990.2000.3010.402
All-0.2DNGP02000.3230.6020.7340.786
KMGP442920290.3350.5800.7130.786
LGGP131910100.3320.6040.7530.820
LRGP23000.3230.6020.7450.806
RFGP333762520.3390.6080.7650.836
XGGP02000.3230.6020.7500.816
DeepGini00000.2380.4190.5610.675
Entropy00000.2350.4080.5380.640
Least Confidence00000.2540.4560.6140.736
Margin00010.2410.4460.6090.734
PCS00000.2410.4460.6090.734
Vanilla SM00000.2540.4560.6140.736
Random00000.0990.1990.2990.399
All-0.3DNGP09100.3130.5890.7320.790
KMGP432118220.3240.5650.7040.783
LGGP151613120.3220.5910.7470.822
LRGP23000.3140.5900.7400.808
RFGP324260570.3280.5950.7580.836
XGGP01010.3150.5880.7440.817
DeepGini00000.2320.4120.5540.669
Entropy00000.2290.4010.5320.635
Least Confidence00000.2470.4460.6050.728
Margin00000.2340.4350.5970.725
PCS00000.2340.4350.5970.725
Vanilla SM00000.2470.4460.6050.728
Random00000.1000.2000.2990.398
All-0.4DNGP08300.3060.5780.7270.790
KMGP432315230.3160.5540.6940.776
LGGP132013110.3140.5800.7390.819
LRGP23100.3070.5770.7320.805
RFGP343858580.3200.5810.7480.832
XGGP00200.3070.5760.7350.813
DeepGini00000.2280.4080.5520.669
Entropy00000.2260.3990.5320.636
Least Confidence00000.2420.4390.5990.724
Margin00000.2270.4260.5880.717
PCS00000.2270.4260.5880.717
Vanilla SM00000.2420.4390.5990.724
Random00000.0970.1990.2990.398
Table 13. Overall Comparison Results among GraphPrior and the Compared Approaches on Adversarial Tests with Different Attack Levels
Among all the GraphPrior approaches, RFGP and KMGP exhibit superior performance across different attack levels in comparison to other GraphPrior approaches. In Table 12, we see that, across the attack levels from 0.1 to 0.4, RFGP performs the best in the largest number of best cases, followed by KMGP. For example, when the attack level is 0.1, RFGP performs the best in 46.47% cases. KMGP performs the best in 35.33% cases. Notably, when prioritizing 10% test inputs, KMGP takes the largest number of best cases. When the attack level is 0.2~0.4, RFGP takes the largest number of best cases.
Additionally, our experimental results, as illustrated in Table 13, reveal that the RFGP technique exhibits the largest average PFD values when compared to the other evaluated approaches across varying attack levels. Specifically, when 40% of the test inputs are prioritized, RFGP achieves a PFD value ranging from 0.832 to 0.836, which indicates the ability to detect more than 80% of misclassified tests.
Answer to RQ4: GraphPrior outperforms all the compared approaches on the adversarial test inputs generated from different attack levels. Among all the GraphPrior approaches, RFGP and KMGP exhibit superior performance across different attack levels in comparison to other GraphPrior approaches.

5.5 RQ5: Contribution Analysis of Different Mutation Rules

Objectives: For each evaluated GNN model, we investigate which mutated rules generate more top contributing mutated models for test prioritization.
Experimental design: In our study, we employed one or more mutation rules to generate a mutated model. Each mutated model corresponds to one mutation feature. Thus, to evaluate the importance of different mutation rules, we initially evaluate the importance of various mutation features. We adopted the cover metric of the XGBoost algorithm to identify the importance of each mutation feature for ranking models. A detailed account of this approach is presented in Section 4.5. After computing the importance scores of all the mutated features, we selected the top-N important features for each subject and subsequently identified the top-N mutated models. We then identified the mutation rules utilized to generate each mutated model and compared the contributions of the mutation rules accordingly. Additionally, for different subjects in this research question, we generate 80~240 mutated models.
Results: The mutation rule HC made high contributions to the effectiveness of GraphPrior on all the four types of GNN models. Tables 14 to 17 illustrate the contributions of different mutation rules to the effectiveness of GraphPrior on different GNN models (i.e., GCN, GAT, GraphSAGE, and TAGCN). For each GNN model, we identify the top-N mutated models that made top contributions to the effectiveness of GraphPrior. The corresponding mutation rules applied to generate each mutated model are highlighted in grey. Table 14 presents the contributions of Top-N mutated models to the effectiveness of GraphPrior for the case of GCN model. Notably, the mutation rules BIA and HC made contributions to 100% of the top contributing mutated models, while SL, NOR, CA, and IMP contributed to a lower percentage of the top contributing mutated models. We conclude that, for the GCN model, the mutation rules SL and HC were the most effective in generating the top important mutated models. Moving to GAT, GraphSAGE, and TAGCN, whose results are presented in Tables 15, 16, and 17, the mutation rule HC also generates a large ratio (i.e., 100%, 90%, and 90%, respectively) of top contributing mutated models. We can conclude that, across the four different types of GNN models, HC can continuously make top contributions to the effectiveness of GraphPrior.
Table 14.
Top-NSLBIACAIMPNORHC
0\(\checkmark\)\(\checkmark\)   \(\checkmark\)
1\(\checkmark\)\(\checkmark\)\(\checkmark\)\(\checkmark\) \(\checkmark\)
2 \(\checkmark\)\(\checkmark\)\(\checkmark\) \(\checkmark\)
3 \(\checkmark\)   \(\checkmark\)
4\(\checkmark\)\(\checkmark\) \(\checkmark\)\(\checkmark\)\(\checkmark\)
5\(\checkmark\)\(\checkmark\)\(\checkmark\)\(\checkmark\)\(\checkmark\)\(\checkmark\)
6\(\checkmark\)\(\checkmark\)  \(\checkmark\)\(\checkmark\)
7\(\checkmark\)\(\checkmark\)\(\checkmark\) \(\checkmark\)\(\checkmark\)
8\(\checkmark\)\(\checkmark\)\(\checkmark\)  \(\checkmark\)
9\(\checkmark\)\(\checkmark\)  \(\checkmark\)\(\checkmark\)
Table 14. The Contributions of Different Mutation Rules (GCN)
Table 15.
Top-NSLBIACONHDSEPNSHC
0\(\checkmark\)\(\checkmark\)\(\checkmark\)\(\checkmark\)\(\checkmark\)\(\checkmark\)\(\checkmark\)
1\(\checkmark\) \(\checkmark\)\(\checkmark\)\(\checkmark\) \(\checkmark\)
2  \(\checkmark\) \(\checkmark\) \(\checkmark\)
3\(\checkmark\)\(\checkmark\)\(\checkmark\)\(\checkmark\) \(\checkmark\)\(\checkmark\)
4\(\checkmark\)\(\checkmark\)\(\checkmark\) \(\checkmark\)\(\checkmark\)\(\checkmark\)
5 \(\checkmark\)\(\checkmark\) \(\checkmark\) \(\checkmark\)
6  \(\checkmark\) \(\checkmark\)\(\checkmark\)\(\checkmark\)
7  \(\checkmark\)\(\checkmark\) \(\checkmark\)\(\checkmark\)
8 \(\checkmark\)\(\checkmark\) \(\checkmark\)\(\checkmark\)\(\checkmark\)
9 \(\checkmark\)\(\checkmark\)\(\checkmark\)\(\checkmark\) \(\checkmark\)
Table 15. The Contributions of Different Mutation Rules (GAT)
Table 16.
Top-NBIANORHCEP
0\(\checkmark\)\(\checkmark\)\(\checkmark\)\(\checkmark\)
1\(\checkmark\)\(\checkmark\)\(\checkmark\) 
2\(\checkmark\)\(\checkmark\)\(\checkmark\)\(\checkmark\)
3\(\checkmark\)\(\checkmark\)\(\checkmark\) 
4\(\checkmark\)\(\checkmark\)\(\checkmark\) 
5\(\checkmark\)\(\checkmark\)\(\checkmark\) 
6\(\checkmark\) \(\checkmark\)\(\checkmark\)
7\(\checkmark\)\(\checkmark\)\(\checkmark\) 
8\(\checkmark\)\(\checkmark\)\(\checkmark\)\(\checkmark\)
9\(\checkmark\)\(\checkmark\)\(\checkmark\) 
Table 16. The Contributions of Different Mutation Rules (GraphSAGE)
Table 17.
Top-NNORHCEP
0\(\checkmark\)  
1\(\checkmark\)\(\checkmark\)\(\checkmark\)
2\(\checkmark\)\(\checkmark\)\(\checkmark\)
3\(\checkmark\)\(\checkmark\) 
4\(\checkmark\)\(\checkmark\)\(\checkmark\)
5\(\checkmark\)\(\checkmark\) 
6\(\checkmark\)\(\checkmark\)\(\checkmark\)
7\(\checkmark\)\(\checkmark\)\(\checkmark\)
8\(\checkmark\)\(\checkmark\) 
9\(\checkmark\)\(\checkmark\)\(\checkmark\)
Table 17. The Contributions of Different Mutation Rules to the (TAGCN)
Some mutated rules, such as NOR and BIA, made high contributions to the effectiveness of GraphPrior on some specific GNN models. Moreover, some mutation rules, such as BIA and NOR, also generate a considerable ratio (i.e., from 50% to 100%) of top-critical mutated models. For example, on GCN and GraphSAGE, BIA made contributions to 100% top-N mutated models. On TAGCN, NOR made contributions to 100% top-N mutated models.
Answer to RQ5: The mutation rule HC made high contributions to the effectiveness of GraphPrior on all the four types of GNN models. Some mutated rules, such as NOR and BIA, made high contributions to the effectiveness of GraphPrior on some specific GNN models.

5.6 RQ6: Enhancing GNNs with GraphPrior

Objectives: We investigate whether GraphPrior and the uncertainty-based metrics can select informative retraining subsets to improve the performance of a GNN model.
Experimental design: Following the prior research by Ma et al. [57], our retraining experiments are structured as follows: First, we randomly partitioned the dataset into three sets: an initial training set, a candidate set, and a test set, with a ratio of 4:4:2. The candidate set was reserved exclusively for retraining purposes, while the test set was kept untouched for the purpose of evaluation. In the first round, we trained a GNN model using only the initial training set and computed its accuracy on the test set. We employed the best model obtained over the training epochs for the subsequent retraining process. In the second round, we incorporate an additional 10% of new inputs from the candidate set into the existing training set without replacement. The inputs selected for inclusion are those that are prioritized in the first 10% by the test prioritization approaches, namely, GraphPrior and the compared techniques. Following Ma et al. [57], we retrain the GNN models by utilizing the complete augmented training set. This approach ensures that the old and new training data are treated equally. We repeat the retraining process for multiple rounds until the candidate set is empty. We kept the test data untouched during the retraining process. Moreover, we account for the randomness involved in the model training process and repeat all the experiments 10 times to report the average results (averaged over 10 repetitions).
Results: Table 18 illustrates the average accuracy of GNN models after retraining with 10% to 100% prioritized test inputs. For each case, we highlight the approach with the highest effectiveness in grey to facilitate quick and easy interpretation of the results. GraphPrior and the uncertainty-based test prioritization approaches outperform the random selection approach. However, the observed improvement is relatively small, indicating that GNN test prioritization approaches can guide the retraining of GNN models but with limited effect. In Table 18, we observe that test prioritization methods, including GraphPrior and compared approaches, consistently demonstrate better performance across varying ratios of added data compared with the random selection. Furthermore, when incorporating prioritized tests exceeding 10% of the total, a significant majority of the test prioritization methods—specifically, 83.4% (10 out of 12)—outperform random selection in each case. However, the improvements achieved by these test prioritization methods compared to random selection are relatively small, with the highest increase being only 0.014. Additionally, Figure 4 visually depicts an example outcome of the retraining experiments conducted on the Cora dataset using the GCN model, showcasing a comparative evaluation of the performance of test prioritization approaches against random selection (indicated by the black line). As observed from the results, the test prioritization approaches demonstrate a better performance compared to random selection, but the improvement is visually slight.
Table 18.
ApproachesAccuracy of percentage of datasetsAverage
10%20%30%40%50%60%70%80%90%100%
KMGP0.7870.8100.8250.8340.8440.8540.8610.8670.8740.8780.844
DNGP0.7870.8110.8270.8360.8440.8530.8590.8670.8730.8770.843
LGGP0.7870.8120.8250.8350.8450.8520.8610.8680.8730.8770.844
LRGP0.7870.8110.8250.8350.8450.8540.8640.8690.8730.8770.844
XGGP0.7880.8110.8240.8340.8450.8530.8610.8670.8730.8770.843
RFGP0.7870.8130.8250.8350.8450.8530.8600.8690.8740.8770.844
DeepGini0.7880.8010.8140.8260.8360.8440.8510.8580.8660.8700.835
Entropy0.7890.8010.8160.8290.8360.8450.8520.8580.8660.8720.837
LeastConfidence0.7890.8020.8160.8280.8360.8460.8530.8600.8660.8720.837
Margin0.7880.8010.8180.8270.8370.8450.8530.8610.8670.8720.837
VanillaSM0.7880.8040.8190.8290.8370.8460.8530.8610.8670.8730.838
PCS0.7870.8020.8170.8270.8370.8450.8540.8600.8660.8720.837
Random0.7890.7990.8140.8250.8340.8430.8530.8600.8660.8720.836
Table 18. The GNNs’ Average Accuracy Value after Retraining with 10%~100% Prioritized Tests
Fig. 4.
Fig. 4. Enhancing the accuracy of the GNN with prioritized tests (Cora with GCN).
One reason that leads to the effectiveness of GraphPrior and uncertainty-based test prioritization approaches being limited lies in their inadequate consideration of node importance (i.e., impact on other nodes in the dataset). In a GNN dataset, the complex interdependence among test inputs and their neighbors can lead to them having different importance. For example, nodes with greater connectivity can affect more of other nodes, making them relatively more critical. However, the current test prioritization approaches only focus on the ability of test inputs to reveal system bugs without regard to the importance of nodes. Although the selected test input by them can have a higher likelihood of misclassification, their importance within the dataset can be minor if they have a very small number of neighbors. Retraining such inputs would have less effect. Consequently, it is crucial to consider node importance in the selection of retraining data to achieve more effective outcomes.
GraphPrior achieved better effectiveness than the uncertainty-based test prioritization methods. In Table 18, we see that, when adding more than 20% (including 20%) test cases for retraining, the GraphPrior approaches perform the best in 100% cases. Figure 4 visually demonstrates that the GraphPrior approaches (solid line) perform better than the compared approaches (dotted line) in most cases.
Answer to RQ6: GraphPrior and the uncertainty-based test prioritization approaches outperform the random selection approach. However, the observed improvement is relatively small, indicating that GNN test prioritization approaches can guide the retraining of GNN models but with limited effect. GraphPrior achieved better effectiveness than the uncertainty-based test prioritization methods.

6 Discussion

6.1 Generality of GraphPrior

Although the confidence-based test prioritization approaches demonstrate excellent effectiveness in traditional DNNs, they do not consider the interdependencies between test inputs, which are particularly crucial in GNN test prioritization. Our proposed GraphPrior leverages the mutation analysis of GNN models to perform GNN test input prioritization, which has been demonstrated effective on graph classification tasks through 604 carefully designed subjects. In fact, the scheme of GraphPrior, (i.e., modifying training parameters to mutate the GNN model for test prioritization) can also be generalized to other dimensions of GNN tasks, including graph-level and edge-level tasks. In the future, we will further verify the extension of GraphPrior from this perspective.
[The applicability of GraphPrior on regression tasks]. In this section, we will also discuss the potential applicability of GraphPrior to regression tasks. Currently, the mutation rules and ranking models of GraphPrior are specifically designed for classification tasks. To extend GraphPrior to regression tasks, modifications to the mutation rules and ranking models would be required. If appropriate mutation rules can be identified for regression tasks and suitable ranking models can be designed, then GraphPrior could also be applied to regression tasks.

6.2 Limitations of GraphPrior

[Diversity of the prioritized data]. One limitation of GraphPrior lies in guaranteeing the diversity of selected bug data. This limitation is also noted in prior work on the uncertainty-based test prioritization approaches [26], which did not consider the diversity of bugs when prioritizing test inputs. Similarly, GraphPrior also does not aim for diversity in the prioritized tests. However, GraphPrior has demonstrated the ability to identify a significant majority of misclassified test inputs using a small ratio of prioritized test cases. Specifically, RFGP (i.e., the most effective GraphPrior approach) has been shown to detect over 80% misclassified tests by prioritizing only 40% of the test inputs. This highlights GraphPrior’s ability to efficiently identify a large proportion of bugs using a small set of prioritized tests, even without explicitly ensuring bug diversity. While prioritizing diverse bugs can improve the overall quality of testing, prioritizing a significant majority of bugs can still be a practical strategy in situations where time and resources are limited. Therefore, GraphPrior’s ability to efficiently identify a large proportion of bugs using a small set of prioritized tests can be particularly useful in scenarios where time and resources are constrained.
[GraphPrior in active learning scenarios]. Active learning [68] operates under the assumption that samples within a dataset have varying contributions to the improvement of the current model and aims to select the most informative samples for inclusion in the training set. Our investigation in RQ6 has demonstrated that GraphPrior and uncertainty-based metrics can be utilized to select informative retraining tests. However, the effectiveness of these approaches is limited. Specifically, despite the demonstrated success of uncertainty-based metrics such as DeepGini and margin in previous studies [26, 36] on DNNs, their effectiveness in the context of GNNs is slight. We explore potential reasons for this phenomenon.
One crucial reason for their limited effectiveness lies in their inadequate consideration of node importance, i.e., the impact that a node has on other nodes in the graph dataset. In a GNN dataset, the complex interdependence among test inputs and their neighbors can result in differing levels of importance for different nodes. For instance, nodes with higher connectivity can be more influential and hence more critical. However, current test prioritization approaches only focus on the ability of test inputs to expose system bugs without taking into account the node importance. Although these approaches may identify inputs with a higher likelihood of misclassification, their importance within the dataset may be negligible if they have only a few neighbors. Retraining such inputs is, therefore, less effective.
Furthermore, we elaborate on the difference between GraphPrior and the existing active learning methods evaluated in our study. The active learning methods used for comparison in our article are primarily uncertainty-based, aimed at datasets where each sample is independent of others. However, for graph datasets, these methods select retraining data without considering the interdependencies between nodes and also neglect the importance of nodes, merely selecting possibly misclassifed nodes. In contrast, GraphPrior employs mutation analysis to identify test inputs that are more likely to be misclassified while considering the interdependencies between nodes during the mutation process. Despite this added consideration, GraphPrior’s goal remains to select misclassified test inputs and does not explicitly consider node importance, leading to slight effectiveness similar to the uncertainty-based methods.
[Generating mutants for large-scale GNN models]. In our experiments, which are based on our current model and datasets, the time cost of our retraining method (for generating mutants) is within an acceptable range. When dealing with large-scale GNN models, GraphPrior can require large computational resources, but it can remain feasible in situations where the cost of manual labeling outweighs the computational cost.

6.3 Threats to Validity

Threats to Internal Validity. The internal threats to validity mainly lie in the implementation of our proposed GraphPrior and the compared approaches. To reduce the threat, we implemented GraphPrior based on the widely used library PyTorch and adopted the implementations of the compared approaches published by their authors. Another internal threat lies in the randomness of the model training. To mitigate this threat and ensure the stability of our experimental results, we conducted a statistical analysis. Specifically, we repeated the training process 10 times for both the original model and the mutated model and calculated the statistical significance of the experiments.
The selection of mutation rules in our study presents another internal threat to validity. Despite our best efforts to collect a comprehensive set of mutation rules, it is possible that other training parameters beyond our current knowledge could serve as mutation rules. To mitigate this threat, we selected mutation rules that can directly or indirectly affect node interdependence in the prediction process. The selection of parameter ranges for mutation rules is another internal threat that could affect the effectiveness of the rules. To mitigate this threat, we adopted a strategy in which we inverted the values of Boolean parameters, setting true to false and false to true. For integer and float parameters, we selected a range that introduces only slight changes to the original GNN model. Our experimental results demonstrated the effectiveness of GraphPrior, indicating that the mutation rules and selected parameter range are suitable for GNN test prioritization.
Threats to External Validity. The external threats to validity mainly lie in the GNN models under test and the testing datasets we used in our study. To mitigate this threat, we adopted a large number of subjects (pair of model and dataset) in our study and leveraged different types of test inputs. We applied eight graph adversarial attacks from public studies to generate adversarial test inputs and varied the attack level for more detailed evaluation. In the future, we will apply GraphPrior to more GNN models and test datasets with diversity.

7 Related Work

We present the related work in three aspects, which are test prioritization techniques, deep neural network testing, and mutation-based test prioritization for traditional software.

7.1 Test Prioritization Techniques

In traditional software testing, test prioritization [11, 12, 13, 19, 20, 33, 69, 92] aims to find the ideal order of test cases to reveal system bugs earlier. Prioritizing test cases contributes to two critical constraints, time and budget for software testing, to detect more fault-revealing test cases in a limited time. Di Nardo et al. [19] conducted a case study of coverage-based prioritization strategies on real-world regression faults, evaluating the effectiveness of several test case prioritization techniques in bug detection. Rothermel et al. [69] presented and compared three types of test case prioritization techniques for regression testing that are based on test execution information. They demonstrated that each of the studied prioritization techniques increased the fault detection rate of the test suite. Henard et al. [33] conducted a comprehensive study to compare existing test prioritization approaches, finding that the differences between white-box [23, 24, 49, 92] and black-box strategies [32, 34, 46] are little. Chen et al. [13] proposed LET to prioritize test programs for compiler testing acceleration and demonstrated its effectiveness. LET works through two processes: the learning process to identify program features and predict the bug-revealing probability of a new test program and the scheduling process to prioritize test programs based on bug-revealing probabilities. Chen et al. [11] proposed to prioritize test programs based on the prediction information of the test coverage for compilers.
In terms of test prioritization for DNNs, Feng et al. [26] proposed the state-of-the-art approach, DeepGini, which identifies possibly misclassified tests based on model uncertainty. DeepGini assumes a test is more likely to be mispredicted if the DNN outputs similar probabilities for each class. Byun et al. [6] evaluated several metrics that prioritize bug-revealing inputs based on the white-box measures of DNN’s sentiment, including softmax confidence (i.e., predicted probability for output categories in DNNs that use softmax output layers), Bayesian uncertainty (i.e., the uncertainty of the prediction probability distributions for Bayesian Neural Networks), and input surprise (i.e., the distance of the neuron activation pattern between a test input and the training data). Wang et al. [81] proposed PRIMA to prioritize test inputs for DNNs via intelligent mutation analysis. PRIMA further improves DNN test prioritization in two main aspects. First, PRIMA can be applied not only to classification modes but also to regression models. Second, PRIMA can deal with the case in which test inputs are generated from adversarial input generation approaches [8] that can make the probability of the wrong class larger. Furthermore, some data selection approaches [80] are also proposed to detect possibly misclassified tests for DNNs. Despite its effectiveness in DNN test prioritization, the PRIMA approach cannot be directly applied to GNNs. This is because PRIMA’s mutation operators are not adapted to graph-structured data and GNN models.
More specifically, GNN models operate on graph-structured data, where nodes and edges represent entities and their relationships. Conversely, the input mutation rules of PRIMA were designed for independent test samples, rendering them unsuitable for GNNs. Moreover, GNNs incorporate unique graph operations and aggregation mechanisms, including graph convolution operations and message passing mechanisms. PRIMA’s model mutation rules are not applicable to the graph-level mechanisms employed by GNNs. As such, GNNs require specialized test prioritization techniques, such as GraphPrior, which leverages the properties of GNN models in its mutation analysis for test prioritization. More specifically, to address the limitations of PRIMA, GraphPrior introduces mutation rules that are designed based on the graph operations and aggregation mechanisms of GNNs. These rules can directly or indirectly impact message passing. Consequently, GraphPrior enables prioritizing tests for graph-structured data.

7.2 Deep Neural Network Testing

Besides test input prioritization, some test selection approaches have also been proposed to improve the efficiency of DNN testing. Test selection aims to precisely estimate the accuracy of the whole set by only labeling the set of selected test inputs. In this way, the labeling cost for DNN testing is reduced. Li et al. [50] proposed CES (Cross Entropy-based Sampling) and CSS (Confidence-based Stratified Sampling) to select a small group of representative test inputs to estimate the accuracy of the whole testing set. CES minimizes the cross-entropy between the selected set and the entire test set to ensure that the distribution of the selected test set is similar to the original test set. CSS leverages the confidence features of test inputs to guarantee the similarity between the selected test set and the entire test set. Chen et al. [14] proposed PACE (Practical Accuracy Estimation), which selects test inputs practically based on clustering, prototype selection, and adaptive random testing. PACE first clusters all the test inputs into different groups based on their testing capabilities. Then, PACE utilizes the MMD-critic algorithm [43] to select prototypes from each group. For test inputs not in any group, PACE leverages adaptive random testing to select tests from them. Compared to the aforementioned research, our work focus on test prioritization, which ranks all the test inputs without discarding any test input. In this way, testers or developers can find the test inputs that reveal bugs earlier.
In addition to improving the efficiency of DNN testing, several existing studies [37, 44, 54, 55, 56, 66] have focused on measuring the adequacy of DNNs. Pei et al. [66] proposed a metric of neuron coverage to evaluate how adequate a test set covers the logic of a DNN model. Based on this metric, they proposed a white-box framework for testing DNNs. In the following study, Ma et al. [55] proposed DeepGauge, a set of DNN testing coverage criteria to measure the test adequacy of DNNs. DeepGauge also considers neuron coverage to be a good indicator of the effectiveness of a test input. Based on the basic neuron coverage metric, they proposed new metrics with different granularities to differentiate adversarial attacks from legitimate test data. Kim et al. [44] proposed the surprise adequacy for testing of DL models, which identify how effective a test input by measuring its surprise with respect to the training set. More specifically, the surprise of a test input refers to the difference in the activation value of neurons in the face of this new test.

7.3 Mutation Testing for DNNs

Several existing studies have explored the use of mutation testing for DNNs and developed different mutation operators and frameworks. Ma et al. [56] propose DeepMutation to measure the quality of test data for DL systems based on mutation testing. To this end, they design a set of source-level and model-level mutation operators to inject faults into the training data, training programs, and DL models. The quality of test data is evaluated by analyzing the extent to which the injected faults can be detected. The work by Ma et al. was later extended into a mutation testing tool for DL systems named DeepMutation++ [37], which proposed a set of new mutation operators for Feed-forward Neural Networks (FNNs) and Recurrent Neural Networks (RNNs) and can dynamically mutate runtime states of an RNN. Humbatova et al. [39] proposed DeepCrime, which is the first mutation testing tool that implements a set of DL mutation operators based on real DL faults. Shen et al. [72] proposed MuNN, a mutation analysis method for neural networks. MuNN defined five mutation operators based on the characteristics of neural networks. The results reveal that mutation analysis has strong domain characteristics, indicating the need for domain mutation operators to enhance the analysis, and that new mutation mechanisms are required for deep neural networks.
The above studies in mutation testing have focused on traditional DNNs, which are typically evaluated on datasets with independent samples. However, the mutation rules employed in these studies do not account for the interdependence among test inputs, which is a crucial factor to consider in the context of GNN testing. In contrast, the mutation rules of GraphPrior are designed to impact the message passing mechanism in the GNN prediction process. In the mutated GNN model, the way nodes acquire information from their neighboring nodes differs slightly from that of the original GNN model. The mutation features generated based on these mutation rules are fed into ranking models to predict the likelihood of a test input being misclassified by the GNN model.

7.4 Mutation-based Test Prioritization for Traditional Software

In traditional software testing, mutation testing is a well-established technique to evaluate the quality of test sets. Mutation-based test prioritization focuses on prioritizing test cases based on their ability to detect mutants. The key idea is that test cases that can detect mutants are likely to be more effective at finding real faults in the code and, therefore, should be given higher priority. Several mutation-based approaches [52, 74] have been proposed. Lou et al. [52] proposed a test-case prioritization approach based on the fault detection capability of test cases. They introduced two models to calculate the fault detection capability: the statistics-based model and the probability-based model. Based on the experimental study, they found that the statistics-based model outperforms all the approaches. Shin et al. [74] proposed a test case prioritization technique guided by the diversity-aware mutation adequacy criterion and empirically evaluated the effectiveness of mutation-based prioritization techniques with large-scale developer-written test cases. Papadakis et al. [63] proposed mutating Combinatorial Interaction Testing models and using them to prioritize tests based on their ability to kill mutants and showed that the number of model-based mutants that are killed yields a strong correlation to code-level faults revealed by the test cases. The aforementioned DNN-oriented approaches consider each test input independent of each other, while in a graph dataset, there are usually complex connections between test inputs. Our proposed GraphPrior specifically targets GNNs and utilizes several mutation rules to generate GNN mutants for test prioritization. Moreover, to better leverage the mutation results, we adopt several ranking models [5, 42, 83] that can learn to predict the probability of a test input to be misclassified.

8 Conclusion

To improve the efficiency of GNN testing, we aim to prioritize possibly misclassified test inputs to reveal GNN bugs earlier. However, a crucial limitation of existing test prioritization approaches is that, when applying to GNNs, they do not take into account the interdependence between test inputs (nodes). In this article, we propose GraphPrior, a set of test prioritization approaches specifically for GNN testing. GraphPrior assumed that a test input is more likely to be misclassified if it can kill many mutated models. Based on it, GraphPrior leveraged carefully designed mutation rules to generate mutated models for GNNs. Subsequently, GraphPrior obtained the mutation results of test inputs based on the execution of the mutated models. GraphPrior utilized the mutation results in two ways, namely, killing-based and feature-based methods. In the process of scoring a test, killing-based methods considered each mutated model equally important, while feature-based methods learned different importance for each mutated model through ranking models. Finally, GraphPrior ranked all the test inputs based on their scores. We conducted an extensive study to evaluate the effectiveness of GraphPrior approaches on 604 subjects, comparing them with existing approaches that could detect possibly misclassified test inputs. The experimental results demonstrate the effectiveness of GraphPrior. In terms of APFD, the killing-based GraphPrior approach, KMGP, exceeds the compared approaches (i.e., DeepGini, margin, Vanilla Softmax, PCS, Entropy, least confidence, and random selection) by 0.034~0.248, on average. Furthermore, RFGP (i.e., the feature-based GraphPrior approach) exhibited better performance compared to other GraphPrior approaches. Specifically, RFGP outperforms the uncertainty-based test prioritization approaches against different adversarial attacks, with the average improvement of 2.95%~46.69%.

Footnotes

References

[1]
Bernhard K. Aichernig, Harald Brandl, Elisabeth Jöbstl, Willibald Krenn, Rupert Schlick, and Stefan Tiran. 2015. Killing strategies for model-based mutation testing. Softw. Test. Verif. Reliab. 25, 8 (2015), 716–748. DOI:
[2]
Paul Ammann and Jeff Offutt. 2008. Introduction to Software Testing. Cambridge University Press.
[3]
Aleksandar Bojchevski and Stephan Günnemann. 2019. Adversarial attacks on node embeddings via graph poisoning. In International Conference on Machine Learning. PMLR, 695–704.
[4]
Pietro Bongini, Monica Bianchini, and Franco Scarselli. 2021. Molecular generative graph neural networks for drug discovery. Neurocomputing 450 (2021), 242–252.
[5]
Leo Breiman. 2001. Random forests. Mach. Learn. 45, 1 (2001), 5–32.
[6]
Taejoon Byun, Vaibhav Sharma, Abhishek Vijayakumar, Sanjai Rayadurgam, and Darren Cofer. 2019. Input prioritization for testing neural networks. In IEEE International Conference on Artificial Intelligence Testing (AITest’19). IEEE, 63–70.
[7]
Hongyun Cai, Vincent W. Zheng, and Kevin Chen-Chuan Chang. 2018. A comprehensive survey of graph embedding: Problems, techniques, and applications. IEEE Trans. Knowl. Data Eng. 30, 9 (2018), 1616–1637.
[8]
Nicholas Carlini and David Wagner. 2017. Towards evaluating the robustness of neural networks. In IEEE Symposium on Security and Privacy (SP’17). IEEE, 39–57.
[9]
Thierry Titcheu Chekam, Mike Papadakis, Yves Le Traon, and Mark Harman. 2017. An empirical study on mutation, statement and branch coverage fault revelation that avoids the unreliable clean program assumption. In 39th International Conference on Software Engineering, Sebastián Uchitel, Alessandro Orso, and Martin P. Robillard (Eds.). IEEE/ACM, 597–608. DOI:
[10]
Cen Chen, Kenli Li, Sin G. Teo, Xiaofeng Zou, Kang Wang, Jie Wang, and Zeng Zeng. 2019. Gated residual recurrent graph neural networks for traffic prediction. In AAAI Conference on Artificial Intelligence, Vol. 33. 485–492.
[11]
Junjie Chen. 2018. Learning to accelerate compiler testing. In 40th International Conference on Software Engineering. 472–475.
[12]
Junjie Chen, Yanwei Bai, Dan Hao, Yingfei Xiong, Hongyu Zhang, and Bing Xie. 2017. Learning to prioritize test programs for compiler testing. In IEEE/ACM 39th International Conference on Software Engineering (ICSE’17). IEEE, 700–711.
[13]
Junjie Chen, Guancheng Wang, Dan Hao, Yingfei Xiong, Hongyu Zhang, Lu Zhang, and Bing Xie. 2018. Coverage prediction for accelerating compiler testing. IEEE Trans. Softw. Eng. 47, 2 (2018), 261–278.
[14]
Junjie Chen, Zhuo Wu, Zan Wang, Hanmo You, Lingming Zhang, and Ming Yan. 2020. Practical accuracy estimation for efficient deep neural network testing. ACM Trans. Softw. Eng. Method. 29, 4 (2020), 1–35.
[15]
Tianqi Chen and Carlos Guestrin. 2016. XGBoost: A scalable tree boosting system. In 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 785–794.
[16]
Hanjun Dai, Hui Li, Tian Tian, Xin Huang, Lin Wang, Jun Zhu, and Le Song. 2018. Adversarial attack on graph structured data. In International Conference on Machine Learning. PMLR, 1115–1124.
[17]
Richard A. DeMillo, Richard J. Lipton, and Frederick G. Sayward. 1978. Hints on test data selection: Help for the practicing programmer. IEEE Comput. 11, 4 (1978), 34–41. DOI:
[18]
Xavier Devroey, Gilles Perrouin, Mike Papadakis, Axel Legay, Pierre-Yves Schobbens, and Patrick Heymans. 2016. Featured model-based mutation analysis. In 38th International Conference on Software Engineering, Laura K. Dillon, Willem Visser, and Laurie A. Williams (Eds.). ACM, 655–666. DOI:
[19]
Daniel Di Nardo, Nadia Alshahwan, Lionel Briand, and Yvan Labiche. 2013. Coverage-based test case prioritisation: An industrial case study. In IEEE 6th International Conference on Software Testing, Verification and Validation. IEEE, 302–311.
[20]
Hyunsook Do and Gregg Rothermel. 2006. On the use of mutation faults in empirical assessments of test case prioritization techniques. IEEE Trans. Softw. Eng. 32, 9 (2006), 733–752.
[21]
Jian Du, Shanghang Zhang, Guanhang Wu, José M. F. Moura, and Soummya Kar. 2017. Topology adaptive graph convolutional networks. arXiv preprint arXiv:1710.10370 (2017).
[22]
Sebastian Elbaum, Alexey G. Malishevsky, and Gregg Rothermel. 2002. Test case prioritization: A family of empirical studies. IEEE Trans. Softw. Eng. 28, 2 (2002), 159–182.
[23]
Sebastian Elbaum, Gregg Rothermel, and John Penix. 2014. Techniques for improving regression testing in continuous integration development environments. In 22nd ACM SIGSOFT International Symposium on Foundations of Software Engineering. 235–245.
[24]
Emelie Engström, Per Runeson, and Mats Skoglund. 2010. A systematic review on regression test selection techniques. Inf. Softw. Technol. 52, 1 (2010), 14–30.
[25]
Wenqi Fan, Yao Ma, Qing Li, Yuan He, Eric Zhao, Jiliang Tang, and Dawei Yin. 2019. Graph neural networks for social recommendation. In World Wide Web Conference. 417–426.
[26]
Yang Feng, Qingkai Shi, Xinyu Gao, Jun Wan, Chunrong Fang, and Zhenyu Chen. 2020. DeepGini: Prioritizing massive tests to enhance the robustness of deep neural networks. In 29th ACM SIGSOFT International Symposium on Software Testing and Analysis. 177–188.
[27]
Thomas Gaudelet, Ben Day, Arian R. Jamasb, Jyothish Soman, Cristian Regep, Gertrude Liu, Jeremy B. R. Hayter, Richard Vickers, Charles Roberts, Jian Tang, David Roblin, Tom L. Blundell, Michael M. Bronstein, and Jake P. Taylor-King. 2021. Utilizing graph machine learning within drug discovery and development. Brief. Bioinform. 22, 6 (2021).
[28]
Simon Geisler, Tobias Schmidt, Hakan Şirin, Daniel Zügner, Aleksandar Bojchevski, and Stephan Günnemann. 2021. Robustness of graph neural networks at scale. Adv. Neural Inf. Process. Syst. 34 (2021), 7637–7649.
[29]
Justin Gilmer, Samuel S. Schoenholz, Patrick F. Riley, Oriol Vinyals, and George E. Dahl. 2017. Neural message passing for quantum chemistry. In International Conference on Machine Learning. PMLR, 1263–1272.
[30]
Will Hamilton, Zhitao Ying, and Jure Leskovec. 2017. Inductive representation learning on large graphs. Adv. Neural Inf. Process. Syst. 30 (2017).
[31]
Xiangnan He, Kuan Deng, Xiang Wang, Yan Li, Yongdong Zhang, and Meng Wang. 2020. LightGCN: Simplifying and powering graph convolution network for recommendation. In 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval. 639–648.
[32]
Hadi Hemmati, Andrea Arcuri, and Lionel Briand. 2013. Achieving scalable model-based testing through test case diversity. ACM Trans. Softw. Eng. Methodol. 22, 1 (2013), 1–42.
[33]
Christopher Henard, Mike Papadakis, Mark Harman, Yue Jia, and Yves Le Traon. 2016. Comparing white-box and black-box test prioritization. In IEEE/ACM 38th International Conference on Software Engineering (ICSE’16). IEEE, 523–534.
[34]
Christopher Henard, Mike Papadakis, Gilles Perrouin, Jacques Klein, Patrick Heymans, and Yves Le Traon. 2014. Bypassing the combinatorial explosion: Using similarity to generate and prioritize t-wise test configurations for software product lines. IEEE Trans. Softw. Eng. 40, 7 (2014), 650–670.
[35]
Danfeng Hong, Lianru Gao, Jing Yao, Bing Zhang, Antonio Plaza, and Jocelyn Chanussot. 2020. Graph convolutional networks for hyperspectral image classification. IEEE Trans. Geosci. Rem. Sens. 59, 7 (2020), 5966–5978.
[36]
Qiang Hu, Yuejun Guo, Maxime Cordy, Xiaofei Xie, Wei Ma, Mike Papadakis, and Yves Le Traon. 2021. Towards exploring the limitations of active learning: An empirical study. In 36th IEEE/ACM International Conference on Automated Software Engineering (ASE’21). IEEE, 917–929.
[37]
Qiang Hu, Lei Ma, Xiaofei Xie, Bing Yu, Yang Liu, and Jianjun Zhao. 2019. DeepMutation++: A mutation testing framework for deep learning systems. In 34th IEEE/ACM International Conference on Automated Software Engineering (ASE’19). IEEE, 1158–1161.
[38]
Weihua Hu, Matthias Fey, Marinka Zitnik, Yuxiao Dong, Hongyu Ren, Bowen Liu, Michele Catasta, and Jure Leskovec. 2020. Open graph benchmark: Datasets for machine learning on graphs. Adv. Neural Inf. Process. Syst. 33 (2020), 22118–22133.
[39]
Nargiz Humbatova, Gunel Jahangirova, and Paolo Tonella. 2021. DeepCrime: Mutation testing of deep learning systems based on real faults. In 30th ACM SIGSOFT International Symposium on Software Testing and Analysis. 67–78.
[40]
Kanchan Jha, Sriparna Saha, and Hiteshi Singh. 2022. Prediction of protein–protein interaction using graph neural networks. Scient. Rep. 12, 1 (2022), 1–12.
[41]
Weiwei Jiang and Jiayun Luo. 2022. Graph neural network for traffic forecasting: A survey. Expert Syst. Applic. 207 (2022), 117921.
[42]
Guolin Ke, Qi Meng, Thomas Finley, Taifeng Wang, Wei Chen, Weidong Ma, Qiwei Ye, and Tie-Yan Liu. 2017. LightGBM: A highly efficient gradient boosting decision tree. Adv. Neural Inf. Process. Syst. 30 (2017).
[43]
Been Kim, Rajiv Khanna, and Oluwasanmi O. Koyejo. 2016. Examples are not enough, learn to criticize! Criticism for interpretability. Adv. Neural Inf. Process. Syst. 29 (2016).
[44]
Jinhan Kim, Robert Feldt, and Shin Yoo. 2019. Guiding deep learning system testing using surprise adequacy. In IEEE/ACM 41st International Conference on Software Engineering (ICSE’19). IEEE, 1039–1049.
[45]
Thomas N. Kipf and Max Welling. 2016. Semi-supervised classification with graph convolutional networks. arXiv preprint arXiv:1609.02907 (2016).
[46]
Yves Ledru, Alexandre Petrenko, Sergiy Boroday, and Nadine Mandran. 2012. Prioritizing test cases with string distances. Autom. Softw. Eng. 19, 1 (2012), 65–95.
[47]
Cheng Li, Jiaqi Ma, Xiaoxiao Guo, and Qiaozhu Mei. 2017. DeepCas: An end-to-end predictor of information cascades. In 26th International Conference on World Wide Web. 577–586.
[48]
Yaxin Li, Wei Jin, Han Xu, and Jiliang Tang. 2020. DeepRobust: A PyTorch library for adversarial attacks and defenses. arXiv preprint arXiv:2005.06149 (2020).
[49]
Zheng Li, Mark Harman, and Robert M. Hierons. 2007. Search algorithms for regression test case prioritization. IEEE Trans. Softw. Eng. 33, 4 (2007), 225–237.
[50]
Zenan Li, Xiaoxing Ma, Chang Xu, Chun Cao, Jingwei Xu, and Jian Lü. 2019. Boosting operational DNN testing efficiency through conditioning. In 27th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering. 499–509.
[51]
Yiling Lou, Junjie Chen, Lingming Zhang, and Dan Hao. 2019. A survey on regression test-case prioritization. In Advances in Computers. Vol. 113. Elsevier, 1–46.
[52]
Yiling Lou, Dan Hao, and Lu Zhang. 2015. Mutation-based test-case prioritization in software evolution. In IEEE 26th International Symposium on Software Reliability Engineering (ISSRE’15). IEEE, 46–57.
[53]
Jiaqi Ma, Shuangrui Ding, and Qiaozhu Mei. 2020. Towards more practical adversarial attacks on graph neural networks. Adv. Neural Inf. Process. Syst. 33 (2020), 4756–4766.
[54]
Lei Ma, Felix Juefei-Xu, Minhui Xue, Bo Li, Li Li, Yang Liu, and Jianjun Zhao. 2019. DeepCT: Tomographic combinatorial testing for deep learning systems. In IEEE 26th International Conference on Software Analysis, Evolution and Reengineering (SANER’19). IEEE, 614–618.
[55]
Lei Ma, Felix Juefei-Xu, Fuyuan Zhang, Jiyuan Sun, Minhui Xue, Bo Li, Chunyang Chen, Ting Su, Li Li, Yang Liu, Jianjun Zhao, and Yadong Wang. 2018. DeepGauge: Multi-granularity testing criteria for deep learning systems. In 33rd ACM/IEEE International Conference on Automated Software Engineering. 120–131.
[56]
Lei Ma, Fuyuan Zhang, Jiyuan Sun, Minhui Xue, Bo Li, Felix Juefei-Xu, Chao Xie, Li Li, Yang Liu, Jianjun Zhao, et al. 2018. DeepMutation: Mutation testing of deep learning systems. In IEEE 29th International Symposium on Software Reliability Engineering (ISSRE’18). IEEE, 100–111.
[57]
Wei Ma, Mike Papadakis, Anestis Tsakmalis, Maxime Cordy, and Yves Le Traon. 2021. Test selection for deep learning systems. ACM Trans. Softw. Eng. Methodol. 30, 2 (2021), 1–22.
[58]
Yao Ma, Suhang Wang, Tyler Derr, Lingfei Wu, and Jiliang Tang. 2019. Attacking graph convolutional networks via rewiring. arXiv preprint arXiv:1906.03750 (2019).
[59]
Aleksander Madry, Aleksandar Makelov, Ludwig Schmidt, Dimitris Tsipras, and Adrian Vladu. 2017. Towards deep learning models resistant to adversarial attacks. arXiv preprint arXiv:1706.06083 (2017).
[60]
Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A. Rusu, Joel Veness, Marc G. Bellemare, Alex Graves, Martin Riedmiller, Andreas K. Fidjeland, Georg Ostrovski, Stig Petersen, Charles Beattie, Amir Sadik, Ioannis Antonoglou, Helen King, Dharshan Kumaran, Daan Wierstra, Shane Legg, and Demis Hassabis. 2015. Human-level control through deep reinforcement learning. Nature 518, 7540 (2015), 529–533.
[61]
Quang Hung Nguyen, Hai-Bang Ly, Lanh Si Ho, Nadhir Al-Ansari, Hiep Van Le, Van Quan Tran, Indra Prakash, and Binh Thai Pham. 2021. Influence of data splitting on performance of machine learning models in prediction of shear strength of soil. Math. Prob. Eng. 2021 (2021), 1–15.
[62]
Niccolò Pancino, Alberto Rossi, Giorgio Ciano, Giorgia Giacomini, Simone Bonechi, Paolo Andreini, Franco Scarselli, Monica Bianchini, and Pietro Bongini. 2020. Graph neural networks for the prediction of protein-protein interfaces. In European Conference on Artificial Neural Networks. 127–132.
[63]
Mike Papadakis, Christopher Henard, and Yves Le Traon. 2014. Sampling program inputs with mutation analysis: Going beyond combinatorial interaction testing. In 7th IEEE International Conference on Software Testing, Verification and Validation. IEEE Computer Society, 1–10. DOI:
[64]
Mike Papadakis, Marinos Kintis, Jie Zhang, Yue Jia, Yves Le Traon, and Mark Harman. 2019. Mutation testing advances: An analysis and survey. Adv. Comput. 112 (2019), 275–378. DOI:
[65]
Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas Kopf, Edward Yang, Zachary DeVito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala. 2019. PyTorch: An imperative style, high-performance deep learning library. Adv. Neural Inf. Process. Syst. 32 (2019).
[66]
Kexin Pei, Yinzhi Cao, Junfeng Yang, and Suman Jana. 2017. DeepXplore: Automated whitebox testing of deep learning systems. In 26th Symposium on Operating Systems Principles. 1–18.
[67]
Michael Prince. 2004. Does active learning work? A review of the research. J. Eng. Educ. 93, 3 (2004), 223–231.
[68]
Pengzhen Ren, Yun Xiao, Xiaojun Chang, Po-Yao Huang, Zhihui Li, Brij B. Gupta, Xiaojiang Chen, and Xin Wang. 2021. A survey of deep active learning. ACM Comput. Surv. 54, 9 (2021), 1–40.
[69]
Gregg Rothermel, Roland H. Untch, Chengyun Chu, and Mary Jean Harrold. 2001. Prioritizing test cases for regression testing. IEEE Trans. Softw. Eng. 27, 10 (2001), 929–948.
[70]
Benedek Rozemberczki and Rik Sarkar. 2020. Characteristic functions on graphs: Birds of a feather, from statistical descriptors to parametric models. In 29th ACM International Conference on Information & Knowledge Management. 1325–1334.
[71]
Franco Scarselli, Marco Gori, Ah Chung Tsoi, Markus Hagenbuchner, and Gabriele Monfardini. 2008. The graph neural network model. IEEE Trans. Neural Netw. 20, 1 (2008), 61–80.
[72]
Weijun Shen, Jun Wan, and Zhenyu Chen. 2018. MuNN: Mutation analysis of neural networks. In IEEE International Conference on Software Quality, Reliability and Security Companion (QRS-C’18). IEEE, 108–115.
[73]
Chence Shi, Minkai Xu, Zhaocheng Zhu, Weinan Zhang, Ming Zhang, and Jian Tang. 2020. GraphAF: A flow-based autoregressive model for molecular graph generation. arXiv preprint arXiv:2001.09382 (2020).
[74]
Donghwan Shin, Shin Yoo, Mike Papadakis, and Doo-Hwan Bae. 2019. Empirical evaluation of mutation-based test case prioritization techniques. Softw. Test., Verif. Reliab. 29, 1-2 (2019), e1695.
[75]
Karen Simonyan and Andrew Zisserman. 2014. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014).
[76]
Chen Sun, Abhinav Shrivastava, Carl Vondrick, Rahul Sukthankar, Kevin Murphy, and Cordelia Schmid. 2019. Relational action forecasting. In IEEE/CVF Conference on Computer Vision and Pattern Recognition. 273–283.
[77]
Lichao Sun, Yingtong Dou, Carl Yang, Ji Wang, Philip S. Yu, Lifang He, and Bo Li. 2018. Adversarial attack and defense on graph data: A survey. arXiv preprint arXiv:1812.10528 (2018).
[78]
Vivienne Sze, Yu-Hsin Chen, Tien-Ju Yang, and Joel S. Emer. 2017. Efficient processing of deep neural networks: A tutorial and survey. Proc. IEEE 105, 12 (2017), 2295–2329.
[79]
Petar Veličković, Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro Lio, and Yoshua Bengio. 2017. Graph attention networks. arXiv preprint arXiv:1710.10903 (2017).
[80]
Dan Wang and Yi Shang. 2014. A new active labeling method for deep learning. In International Joint Conference on Neural Networks (IJCNN’14). IEEE, 112–119.
[81]
Zan Wang, Hanmo You, Junjie Chen, Yingyi Zhang, Xuyuan Dong, and Wenbin Zhang. 2021. Prioritizing test inputs for deep neural networks via mutation analysis. In IEEE/ACM 43rd International Conference on Software Engineering (ICSE’21). IEEE, 397–409.
[82]
Michael Weiss and Paolo Tonella. 2022. Simple techniques work surprisingly well for neural network test prioritization and active learning (replicability study). In 31st ACM SIGSOFT International Symposium on Software Testing and Analysis. 139–150.
[83]
Raymond E. Wright. 1995. Logistic regression. In Reading and Understanding Multivariate Statistics, L. G. Grimm and P. R. Yarnold (Eds.). American Psychological Association, 217–244.
[84]
Le Wu, Peijie Sun, Richang Hong, Yanjie Fu, Xiting Wang, and Meng Wang. 2018. SocialGCN: An efficient graph convolutional network based model for social recommendation. arXiv preprint arXiv:1811.02815 (2018).
[85]
Shiwen Wu, Fei Sun, Wentao Zhang, Xu Xie, and Bin Cui. 2022. Graph neural networks in recommender systems: A survey. Comput. Surv. 55, 5 (2022), 1–37.
[86]
Kaidi Xu, Hongge Chen, Sijia Liu, Pin-Yu Chen, Tsui-Wei Weng, Mingyi Hong, and Xue Lin. 2019. Topology attack and defense for graph neural networks: An optimization perspective. arXiv preprint arXiv:1906.04214 (2019).
[87]
Keyulu Xu, Weihua Hu, Jure Leskovec, and Stefanie Jegelka. 2018. How powerful are graph neural networks? arXiv preprint arXiv:1810.00826 (2018).
[88]
Zhilin Yang, William Cohen, and Ruslan Salakhudinov. 2016. Revisiting semi-supervised learning with graph embeddings. In International Conference on Machine Learning. PMLR, 40–48.
[89]
Liang Yao, Chengsheng Mao, and Yuan Luo. 2019. Graph convolutional networks for text classification. In AAAI Conference on Artificial Intelligence, Vol. 33. 7370–7377.
[90]
Ruiping Yin, Kan Li, Guangquan Zhang, and Jie Lu. 2019. A deeper graph neural network for recommender systems. Knowl.-based Syst. 185 (2019), 105020.
[91]
Rex Ying, Ruining He, Kaifeng Chen, Pong Eksombatchai, William L. Hamilton, and Jure Leskovec. 2018. Graph convolutional neural networks for web-scale recommender systems. In 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. 974–983.
[92]
Shin Yoo and Mark Harman. 2012. Regression testing minimization, selection and prioritization: A survey. Softw. Test., Verif. Reliab. 22, 2 (2012), 67–120.
[93]
Junliang Yu, Hongzhi Yin, Jundong Li, Min Gao, Zi Huang, and Lizhen Cui. 2020. Enhance social recommendation with adversarial graph convolutional networks. IEEE Trans. Knowl. Data Eng. 34, 8 (2020).
[94]
Long Zhang, Xuechao Sun, Yong Li, and Zhenyu Zhang. 2019. A noise-sensitivity-analysis-based test prioritization technique for deep neural networks. arXiv preprint arXiv:1901.00054 (2019).
[95]
Qin Zhang, Keping Yu, Zhiwei Guo, Sahil Garg, Joel J. P. C. Rodrigues, Mohammad Mehedi Hassan, and Mohsen Guizani. 2021. Graph neural network-driven traffic forecasting for the connected internet of vehicles. IEEE Trans. Netw. Sci. Eng. 9, 5 (2021), 3015–3027.
[96]
Dongbin Zhao, Haitao Wang, Kun Shao, and Yuanheng Zhu. 2016. Deep reinforcement learning with experience replay based on SARSA. In IEEE Symposium Series on Computational Intelligence (SSCI’16). IEEE, 1–6.
[97]
Hang Zhou, Weikun Wang, Jiayun Jin, Zengwei Zheng, and Binbin Zhou. 2022. Graph neural network for protein–protein interaction prediction: A comparative study. Molecules 27, 18 (2022), 6135.
[98]
Jie Zhou, Ganqu Cui, Shengding Hu, Zhengyan Zhang, Cheng Yang, Zhiyuan Liu, Lifeng Wang, Changcheng Li, and Maosong Sun. 2020. Graph neural networks: A review of methods and applications. AI Open 1 (2020), 57–81.
[99]
Daniel Zügner, Amir Akbarnejad, and Stephan Günnemann. 2018. Adversarial attacks on neural networks for graph data. In 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. 2847–2856.
[100]
Daniel Zügner and Stephan Günnemann. 2019. Adversarial attacks on graph neural networks via meta learning. arXiv preprint arXiv:1902.08412 (2019).

Cited By

View all
  • (2025)Application of perceptual graph neural network intelligent algorithm in software testing platformInternational Journal of Computers and Applications10.1080/1206212X.2025.2452882(1-10)Online publication date: 23-Jan-2025
  • (2025)An empirical study of AI techniques in mobile applicationsJournal of Systems and Software10.1016/j.jss.2024.112233219(112233)Online publication date: Jan-2025
  • (2024)Deep Learning for Hyperspectral Image Classification: A Critical Evaluation via Mutation TestingRemote Sensing10.3390/rs1624469516:24(4695)Online publication date: 16-Dec-2024
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Software Engineering and Methodology
ACM Transactions on Software Engineering and Methodology  Volume 33, Issue 1
January 2024
933 pages
EISSN:1557-7392
DOI:10.1145/3613536
  • Editor:
  • Mauro Pezzè
Issue’s Table of Contents
This work is licensed under a Creative Commons Attribution International 4.0 License.

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 24 November 2023
Online AM: 04 July 2023
Accepted: 13 June 2023
Revised: 12 June 2023
Received: 15 January 2023
Published in TOSEM Volume 33, Issue 1

Check for updates

Author Tags

  1. Test Input Prioritization
  2. Graph Neural Networks
  3. Mutation
  4. Labelling

Qualifiers

  • Research-article

Funding Sources

  • Luxembourg National Research Funds (FNR) AFR
  • European Research Council (ERC) under the European Union’s Horizon 2020 research and innovation program

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)1,263
  • Downloads (Last 6 weeks)180
Reflects downloads up to 21 Jan 2025

Other Metrics

Citations

Cited By

View all
  • (2025)Application of perceptual graph neural network intelligent algorithm in software testing platformInternational Journal of Computers and Applications10.1080/1206212X.2025.2452882(1-10)Online publication date: 23-Jan-2025
  • (2025)An empirical study of AI techniques in mobile applicationsJournal of Systems and Software10.1016/j.jss.2024.112233219(112233)Online publication date: Jan-2025
  • (2024)Deep Learning for Hyperspectral Image Classification: A Critical Evaluation via Mutation TestingRemote Sensing10.3390/rs1624469516:24(4695)Online publication date: 16-Dec-2024
  • (2024)Test Optimization in DNN Testing: A SurveyACM Transactions on Software Engineering and Methodology10.1145/364367833:4(1-42)Online publication date: 27-Jan-2024
  • (2024)Mutation Testing Reinvented: How Artificial Intelligence Complements Classic Methods2024 9th International Conference on Computer Science and Engineering (UBMK)10.1109/UBMK63289.2024.10773427(298-303)Online publication date: 26-Oct-2024
  • (2024)Test Input Prioritization for Graph Neural NetworksIEEE Transactions on Software Engineering10.1109/TSE.2024.338553850:6(1396-1424)Online publication date: Jun-2024
  • (2024)Test Input Prioritization for Machine Learning ClassifiersIEEE Transactions on Software Engineering10.1109/TSE.2024.335001950:3(413-442)Online publication date: Mar-2024
  • (2024)A Survey on Test Input Selection and Prioritization for Deep Neural Networks2024 10th International Symposium on System Security, Safety, and Reliability (ISSSR)10.1109/ISSSR61934.2024.00035(232-243)Online publication date: 16-Mar-2024
  • (2024)Prioritizing test cases for deep learning-based video classifiersEmpirical Software Engineering10.1007/s10664-024-10520-129:5Online publication date: 22-Jul-2024

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Login options

Full Access

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media