1. Introduction
Over the last several years, the discipline of graph deep learning has advanced remarkably quickly. It has developed into one of the areas of artificial intelligence study that is expanding the quickest from a minority subject and a small group of researchers. In addition to the fact that graphs come with a complete set of mathematical underpinnings, graph theory [
1] also offers a beautiful theoretical framework. Based on this basis, we can examine, comprehend, and learn from complex systems in the actual world. More crucially, thanks to the prevalence of graph-structured data, the availability of billions of IoT network devices, the security knowledge graph containing attack and threat semantics, the development of sizable social networking platforms, extensive scientific interoperability, even COVID-19 modeling of protein molecular structures, the amount and quality of graph-structured data accessible to researchers have significantly increased in recent years. In numerous domains, including network security [
2,
3], IoT networks [
4,
5], Cyber-Physical System [
6,
7], social media [
8,
9,
10], citation networks [
11], structural biology [
12,
13], 3D modeling [
14,
15], and telecommunication networks [
16,
17,
18,
19], graph data are frequently utilized as an efficient data format to describe and extract complicated data patterns. Cyber-Physical Systems, Social Networks, Satellite Internet, Security Knowledge Graphs, molecular structure, and 3D lattices are all abstracted into graph-structured data in
Figure 1.
Deep neural network models such as Convolutional Neural Networks (CNN), Recurrent Neural Networks (RNN), Deep Belief Networks (DBN), deep auto-encoders, and Generative Adversarial Networks (GAN), often used currently have stolen the show in industrial fields such as computer vision [
20,
21], natural language processing [
21,
22,
23,
24], and recommender systems [
25,
26]. It is because any activity categorizing images or machine translation needs a low-dimensional data representation with a grid-like layout.
Figure 2 depicts the extraction of text features using bag-of-words models and the extraction of picture characteristics using convolutional neural networks, respectively. However, many intriguing problems require data, not in a grid-like form but spread over irregular domains. A common term for structured data that resembles a grid is Euclidean data, whereas non-gridlike structured data found in atypical domains are referred to as non-Euclidean data. Non-Euclidean data do not have translation invariance, leading to traditional convolutional methods being very limited in processing non-Euclidean graph data. Traditional convolutional methods such as CNN and RNN, which have made a splash on tasks such as CV and NLP, cannot fix the convolutional kernel size to sample feature information. Consequently, graph convolutional networks (GCN) were first presented by Kipf et al. [
27] in 2018. It can successfully handle end-to-end learning tasks such as node classification, link prediction, connection extraction, and anomaly detection as an extension of convolutional neural networks to the graph domain.
Edges connect pairs of nodes in a non-Euclidean space. The nondirected graph is symmetric links only. The directed graph contains non-symmetric links. To model this structural information, it is crucial that for each node, we assign a learnable embedding based on their spatial relationships. In many time-series-based tasks, attention mechanisms have become standard facts for representing learning (embedding). In reality, it is more beneficial for people to make correct judgments after observing the whole picture of a thing. The human brain, however, can focus on the most recognizable part (or feature) of the thing quickly to make the corresponding judgment, rather than after observing the whole picture of the thing from beginning to end to have the judgment result. Based on this idea, the attention mechanism was created in the field of neural networks. The attention mechanism focuses on the most relevant part of the input to make a decision.
In learning graph representations, there is growing interest in incorporating graph structural knowledge into attention methods. Graph Attention Network (GAT) [
28] was first proposed to weight the summation of neighboring node features using an attention mechanism, where the neighboring node weights depend entirely on the node features. Instead of explicitly recording structural linkages, particularly neighborhood ties, this family of approaches solely encodes feature interactions between nodes. There is evidence [
29,
30] show that most real-world networks in which nodes tend to create relatively tightly connected groups are more likely than the average probability of establishing a relationship between two nodes at random. We want to include clustering coefficients in the attention mechanism since they can more accurately assess the level of aggregation between node pairs. We may further illustrate the proposed model’s excellent expressive capability by altering the attention score by including clustering coefficients in the attention mechanism.
This research addresses the crucial problem of incorporating local knowledge encode to the attention process. The critical innovation we make is the addition of local clustering coefficients, which directly indicate the level of solidarity between nodes and afterward capture and interaction between nodes. That is distinct from the majority of the current graph attention techniques. This paper’s article structure is described as follows.
Section 2 presents earlier research and related work first. The full description of our suggested strategy follows in
Section 3. Our experimentation process and findings are presented in
Section 4. Finally, in
Section 5, we summarize the fundamental research and discuss problems and difficulties, such as the explainability and robustness of graph embeddings.
2. Preliminary
This section will outline the basics of graphs and attention mechanisms. For a given graph , where denotes nodes, E is a set of observable edges, and is the number of nodes. The features of node are denoted by . In a graph of N nodes, the features of all nodes are stored in .
2.1. Graph Attention
The Graph Convolutional Network performs well in node classification tasks and demonstrates how to combine local graph structure with node attributes. The trained model cannot generalize to different graph architectures because the GCN integrates surrounding node characteristics in a manner that is intimately tied to the graph’s structure. The Graph Attention Network puts forward a technique for weighting the sum of the attributes of surrounding nodes. GAT thinks that the graph structure does not affect the weights of the nearby node characteristics, which are solely determined by the node features. A collection of node features are provided as the input of GAT,
,
, and F is the dimensionality of each node feature. The
layer node characteristics that were generated after updating the
layer node features are technically described by the following equations.
where
is the shareable weight matrix parameter,
is the shareable weight vector parameter, and
is the dimensionality of the hidden layer features.
2.2. Local Clustering Coefficient
According to traditional graph theory, the degree of node aggregation may be gauged using clustering coefficients. The clustering coefficients are divided into global and local clustering coefficients. Since the attention mechanism focuses more on local features, this paper attempts to introduce local clustering coefficients into the graph attention mechanism. The local clustering coefficients of nodes in a graph quantify the extent to which their neighboring nodes aggregate to form a complete graph. The first-order neighbor nodes of a node
together form its neighborhood
, and
is the number of neighborhood nodes. The local clustering coefficient of a node
is proportional to the ratio of the adjacent edges between nodes in the neighborhood and all possible contiguous edges in the neighborhood. For an orientation graph, the complete graph formed by the neighboring nodes of node
has a total of
edges. Therefore, the local clustering coefficient of an orientation graph is expressed as follows.
The total number of edges in the full undirected graph generated by node
’s neighbors is
. As a result, Equation (
5) may be used to represent the undirected graph’s local clustering coefficient.
The following formula may also be used to represent the local clustering coefficients.
where the value of
is the number of triangles that may be built on the undirected graph
G using the node
v as the vertex,
.
4. Experiments
We report results for four classical tasks (Cora, CiteSeer, BlogCatalog, and PPI) from the literature [
31], literature [
32], and literature [
33]. We compare the proposed model with several solid baselines and earlier methods for the three well-known node-based classification benchmark tasks listed above. In each case, it outperforms the state-of-the-art performance. This section also gives a short qualitative analysis of the feature representations retrieved by the suggested model and summarizes the experimental parameter settings and findings.
4.1. Datasets
Cora, which is made up of machine learning publications, is one of the datasets that graph deep learning has made extensive use of in recent years. The collection comprises 2708 academic publications categorized into eight groups: case-based, genetic algorithms, neural networks, probabilistic techniques, reinforcement learning, rule-based learning, and theory. If each article were a node in a graph, all papers would form a linked graph with no isolated points since each one refers to at least one other paper or is referenced by other publications. Only 1433 distinct words were left in the corpus after the root processing and stop words were eliminated. In all papers, words used less than ten times are eliminated. As a result, each document is represented as a word vector with 1433 dimensions, or 1433 characteristics per node. The word vector’s elements represent a word, and each element may only have one of two possible values: 0 or 1. 0 indicates that a word does not occur in the document, while one indicates that it does.
CiteSeer contains 3312 scientific publications classified into six categories: Agents, AI, DB, IR, ML, and HCI. The citation network has 4732 linkages in total. A word vector encoding 0/1, which denotes the existence or absence of the associated term in the dictionary, is used to characterize each publication in the dataset. There are 3703 specific terms in the dictionary.
BlogCatalog is a social network made up of a graph of social relationships (such as friends) and tags representing the blogger’s interests. The BlogCatalog dataset has 39 tag dimensions and 10,312 nodes with 333,983 edges. In BlogCatalog, a blogger specifies the bloggers in his social network. A blogger’s interest can be determined by the categories of blogs he posted, which total 60 in BlogCatalog. Each blogger’s blog includes 1.6 categories on average. A blogger’s profile vector is created by aggregating his blogs into categories. This profile vector represents the set of attribute values for the blogger. Additionally, as seen by the average degree of the nodes and the link density, the adjacency vector employed to describe this dataset is very sparse.
PPI is often thought of as a Protein–Protein Interaction (PPI) between two proteins when they cooperate in a task or are engaged in a biological process. A PPI network may be used to explain the intricate connections between several proteins. A total of 56,944 nodes and 818,716 edges, each with a feature-length of 50, are present in the PPI dataset, which consists of 24 graphs, each corresponding to different human tissue. Each graph has an average of 2371 nodes, each having positional gene sets, motif sets, and immunological features. Up to 121 labels may be attached to a node (such as a protein’s characteristics or location, for example), and the labels are not one-hot encoded.
Table 1 provides an overview of the significant properties of the datasets that can be found throughout.
4.2. Benchmark Methods
MLP In addition to the input and output layers, a multilayer perceptron, also known as an Artificial Neural Network (ANN), may have several hidden layers. The output of the hidden layer is
, where
is the weight, also known as the connection coefficient,
is the bias, and the function
f can be a frequently used
function or
function. The hidden layer is first fully connected to the input layer, and assuming that a vector
X represents the input layer, the output of the hidden layer is
. The following formulation describes the most typical three-layer MLP structural model.
Therefore, all the parameters of the MLP are the connection weights between each layer and the bias, for the Equation (
11) including
,
,
,
.
GCN A Graph Convolutional Network (GCN) is a neural network architecture that operates on graph data and is so powerful that even a randomly initialized two-layer GCN can generate a helpful feature representation of the nodes in the graph. Two inputs go into the GCN. A feature matrix
X of dimension
is one input, where
is the number of input features per node and
N is the number of nodes in the graph. The second input is a graph structure with adjacency matrix
A and size
. Thus, Equations (
12) and (
13) may be used to represent the hidden layer of the GCN.
where
is the adjacency matrix with added self-loop,
,
is the unit matrix, and
is a diagonal matrix with the main diagonal element being the degree of each node,
.
GAT The graph attention layer, described in
Section 2.2, is the main component of the graph attention network, and each node in the whole graph serves as its object. For instance, the output of the graph attention layer for a graph with
N nodes and each node having feature dimension
F is
, where
is the feature dimension of the output layer. The input data are represented by the
, where
is a vector of size
F. The output is represented as
, and the size of the vector
is
F. This section will not go over the formal representation again as it was covered in depth in
Section 2.2.
4.3. Experimental Setup
A single-head and a multi-head CA-GAT are the two types of cluster-aware attention networks we employ in this article. Most studies of multi-headed attention use 6 to 16 heads, and Petar et al., in their work [
28], specify that the standard multi-headed attention model generally chooses 8 heads. We continued their work and also chose
attention heads for attention. Each of which has a different set of parameters and a similar structural makeup. The model mainly uses regularization to address the slight training set size issue. We used L2 regularization with
during training. In addition, after several cross-validations, the implicit node dropout rate equals 0.6 when it works best. The dropout was further adjusted to 0.6 and applied to the normalized attention coefficients and the inputs of the three-layer CA-GAT. This configuration indicates that each node will be exposed to a randomly selected neighborhood with a
chance during each training cycle. To avoid the vanishing gradient, we choose LeakyReLU as the activation function. LeakyReLU is a modified ReLU function, a nonlinear activation function with parameters that can be expressed as
Generally, the
takes values in the 0.01–0.2. Therefore, the
is chosen to be 0.2 in this work. Both models are trained using a three-layer CA-GAT with the Adam optimizer to reduce cross-entropy on the training nodes.
where the initial learning rate for all data sets is
, to avoid the denominator of Equation (
15) to be 0 and
.
4.4. Results and Explainability
The outcomes of our comparison studies are shown in
Table 2 and
Table 3 after 1000 training rounds. In this study, we employ the micro-averaged F1-score and average classification accuracy (including standard deviation) as assessment metrics for the induction task. We precisely execute the suggested single-head technique 100 times on the Core, CiteSeer, and BlogCatalog datasets. After deleting the greatest and lowest accuracies, we obtain the average accuracy of the categorization findings. In calculating the average accuracy, removing one of the highest and one of the lowest scores is to omit possible outliers in the scoring. The average accuracy of the classification findings was then calculated using 100 further runs of the multi-head approach on these datasets. After ten runs, the micro-average of the F1-score on the test graph nodes for the PPI dataset was averaged.
The baseline approaches previously described by the Graph-MLP [
34], GCN [
27], and GAT [
28] are utilized as comparisons to demonstrate the advantages of our suggested methodology. Graph-MLP is a neural network containing three hidden layers. The first hidden layer is used to capture the input node features, the second is used to combine the neighborhood information and calculate NContrast Loss, and the last is used for node classification. Each hidden layer has a feature dimension of 256, a dropout of 0.6, and a learning rate of 0.01. GCN is the first proposal to apply convolutional operations to graphs to model the learning of local structure and node features of graphs through approximate analysis of the graph frequency domain. GCN contains two hidden layers, and its hyperparameters are chosen as follows: learning rate of 0.01, 0.5 (dropout rate, first and last layer),
(L2 regularization, first layer), 16 (number of units for each hidden layer) and 0.01 (learning rate). The GAT proposed by Petar et al. uses the attention mechanism. For the datasets, Cora and Cieseer, the structure of GAT are two hidden layers. The first hidden layer consists of
attention heads, and each attention head computes
features. The second hidden layer is used for classification.
dropout is applied to the inputs of both layers and L2 regularization with
is applied during training. For the datasets, BlogCatalog and PPI, the structure of GAT are three hidden layers. Both of the first two layers consist of
attention heads, computing
features, and the last layer for (multi-label) classification:
attention heads, computing 121 features per attention head. The learning rate for both models is 0.005. Following our assumptions, our findings effectively show that on all four datasets, state-of-the-art performance is attained or matched. Take the Core data set as an example; more precisely, our technique beats theirs by
,
, and
, respectively. Notably, the excellent results on BlogCatalog imply that the study’s central premise is true. Individual nodes tend to organize into reasonably dense web clusters in network architectures that mimic the actual world when interacting with one another. Particularly in social networks, the representation of nodes in network topologies depends more on the aggregation coefficient between nodes than in networks created by chance connections between two nodes.
Additionally, we test activating LeakyReLU and ELU in each of the three scenarios to more appropriately evaluate the advantages of taking attention processes for clustering into account. Sigmoid and ReLU are included in ELU, with a low saturation applied to the left side and none to the right side. While the soft saturation on the left makes ELU more resistant to input variations or noise, the linear portion on the right helps ELU to counteract gradient disappearance.
5. Conclusions
The study of graphs has become crucial in areas of artificial intelligence, including knowledge graphs and social network analysis. In this article, we provide a brand-new graph attention model that takes the degree of core node aggregation in the network structure into account using aggregation coefficients. With the inclusion of local clustering coefficient capture and node interaction, we answer the crucial problem of including neighborhood information in this study’s attention mechanism. It is distinct from the majority of the current graph attention techniques. The model put forward in this research is also computationally efficient. Unlike earlier graph neural networks, it may operate all edges and nodes in parallel and does not need feature decomposition or computing of dense matrices. Our suggested model was tested on four datasets and effectively met or exceeded the most current performance standards for well-known node classification benchmarks, such as transduction and induction settings. As an assessment measure for the transduction task, we utilize the average classification accuracy, and for the induction task, we use the micro-averaged F1 score.
To apply graph processing algorithms in a broader and more industrially relevant range of scenarios, future work may be interested in improving and extending attention models that include more edge features of aggregation. For instance, by applying improved attention networks considering aggregation to hypergraphs and dynamic graphs. Mathematical analytic methods may be used to conduct a more in-depth study of the interpretability and antagonistic resilience of attention processes to models, which is another fascinating avenue of inquiry. Future graph signal processing methods and graph deep learning models must include interpretability and adversarial resilience. Additionally, this study offers a fresh viewpoint on how graph data science may advance in the future as a link between graph signal processing and graph deep learning.