4.1 What Is Convolution?
We define convolution generally as an operation whereby an output is derived from twogiven inputs by integration or summation, which expresses how the one is modified by the other.
Convolution in CNNs involves two matrix inputs: one is the previous layer of activations, and the other is a matrix
\(W \times H\) of learned weights, which is “slid” across the activation matrix, aggregating each
\(W \times H\) region using a simple linear combination (see Figure
7(a)). In the spatial graph domain, it seems that this type of convolution is not well defined [
78]; the convolution of a rigid matrix of learned weights must occur on a rigid structure of activation. How do we reconcile convolutions on unstructured inputs such as graphs?
Note that at no point during our general definition of convolution was the structure of the given inputs alluded to. In fact, convolutional operations can be applied to continuous functions (e.g., audio recordings and other signals), N-dimensional discrete tensors (e.g., semantic vectors in 1D, and images in 2D), and so on. During convolution, one input is typically interpreted as a filter (or kernel) being applied to the other input, and we will adopt this language throughout this section. Specific filters can be utilized to perform specific tasks: In the case of audio recordings, high pass filters can be used to filter out low-frequency signals, and in the case of images, certain filters can be used to increase contrast, sharpen, or blur images. In our previous example of CNNs, filters are learned rather than designed.
4.2 Spatial Approaches
One might consider the early RGNNs described in Section
3 as using convolutional operations. In fact, these methods meet the criteria of locality, scalability, and interpretability. First, Equation (
1) only operates over the neighborhood of the central vertex
v \(_{i}\) , and can be applied on any neighborhood in the graph due to its invariance to permutation and neighborhood size. Second, the NN
\({f}\:\) is dependent on a fixed number of weights and has a fixed input and output that is independent of
\(\vert {\mbox{V}}\vert\) . Finally, the convolution operation is immediately interpretable as a generalization of image-based convolution: In image-based convolution, neighboring pixel values are aggregated to produce embeddings; in graph-based spatial convolution, neighboring vertex features are aggregated to produce embeddings (see Figure
7). This type of graph convolution is referred to as the
spatial graph convolutional operation, since spatial connectivity is used to retrieve the neighborhoods in this process.
Although the RGNN technique meets the definition of spatial convolution, there are numerous improvements in the literature. For example, the choice of aggregation function is not trivial—different aggregation functions can have notable effects on performance and computational cost.
A notable framework that investigated aggregator selection is the
GraphSAGE framework [
29], which demonstrated that learned aggregators can outperform simpler aggregation functions (such as taking the mean of embeddings) and thus can create more discriminative, powerful vertex embeddings. Regardless of the aggregation function, GraphSAGE works by computing embeddings based on the central vertex and an aggregation of its neighborhood (see Equation (
2)). By including the central vertex, it ensures that vertices with near identical neighborhoods have different embeddings. GraphSAGE has since been outperformed on accepted benchmarks [
20] by other frameworks [
6], but the framework is still competitive and can be used to explore the concept of learned aggregators (see Section
4.3).
Alternatively,
Message Passing Neural Networks (MPNNs) compute
directional messages between vertices with a message function that is dependent on the source vertex, the destination vertex, and the edge connecting them [
25]. Rather than aggregate the neighbor’s features and concatenating them with the central vertex’s features as in GraphSAGE, MPNNs sum the incoming messages, and pass the result to a readout function alongside the central vertex’s features (see Equation (
3)). Both the message function and readout function can be implemented with simple NNs in practice. This generalizes the concepts outlined in Equation (
1) and allows for more meaningful patterns to be identified by the learned functions.
One of the most popular spatial convolution methods is
Graph Convolutional Networks (GCNs), which produce embeddings by summing features extracted from each neighboring vertex and then applying non-linearity [
97]. These methods are highly scalable, local, and furthermore, they can be “stacked” to produce layers in a CGNN. Each of these features is normalized based on the relative neighborhood scales of the current and neighbor vertex, thus ensuring that embeddings do not “explode” in scale during the forward pass.
Graph Attention Networks (GATs) extend GCNs: Instead of using the size of the neighborhoods to weight the importance of
v \(_{i}\) to
v \(_{j}\) , they implicitly calculate this weighting based on the normalized product of an attention mechanism [
87]. In this case, the attention mechanism is dependent on the embeddings of two vertices and the edge between them. Vertices are constrained to only be able to attend to neighboring vertices, thus localizing the filters. GATs are stabilized during training using multi-head attention and regularization and are considered less general than MPNNs [
88]. Although GATs limit the attention mechanism to the direct neighborhood, the scalability to large graphs is not guaranteed, as attention mechanisms have compute complexities that grow quadratically with the number of vertices being considered.
Interestingly, all of these approaches consider information from the direct neighborhood and the previous embeddings, aggregate this information in some symmetric fashion, apply learned weights to calculate more complex features, and “activate” these results in some way to produce an embedding that captures non-linear relationships.
Results and Discussion
The mean, pool, and LSTM aggregators score test accuracies of
\(\mathbf {66.0\%}\) ,
\(\mathbf {74.4\%}\) , and
\(\mathbf {68.3\%}\) , respectively. As expected, the learned pool and LSTM aggregators are more effective than the simple mean operation, though they incur significant training overheads and may not be suitable for smaller training graphs or graph datasets. Indeed, in the original GraphSAGE paper [
29], it was found that the LSTM and pool methods generally outperformed the mean and GCN aggregation methods across a range of datasets.
At the time of publication, GraphSAGE outperformed the state-of-the-art on a variety of graph-based tasks on common benchmark datasets. Since that time, a number of inductive learning variants of GraphSAGE have been developed, and their performance on benchmark datasets is regularly updated.
2The Laplacian is a second order differential operator that is calculated as the divergence of the gradient of a function in Euclidean space. The Laplacian occurs naturally in equations that model physical interactions, including but not limited to electromagnetism, heat diffusion, celestial mechanics, and pixel interactions in computer vision applications. Similarly, it arises naturally in the graph domain, where we are interested in the “diffusion of information” throughout a graph structure.
More formally, if we define flux as the quantity passing outward through a surface, then the Laplacian represents the density of the flux
of the gradient flow of a given function. A step-by-step visualization of the Laplacian’s calculation is provided in Figure
9. Note that the definition of the Laplacian is dependant on three things:
functions, the
gradient of a function, and the
divergence of the gradient. Since we are seeking to define the Laplacian in the graph domain, we need to define how these constructs operate in the graph domain.
Functions in the graph domain (referred to as graph signals in GSP) are a mapping from every vertex in a graph to a scalar value:
\(f({\it G}\,({\it V, {\mbox{E}} \hspace{-0.85358pt}})): {{\mbox{V}}} \mapsto \hspace{1.42262pt}\mathbb {R}\) . Multiple graph functions can be defined over a given graph, and we can interpret a single graph function as a single feature vector defined over the vertices in a graph. See Figure
10 for an example of a graph with two graph functions.
The
gradient of a function in the graph domain describes the the direction and the rate of fastest increase of graph signals. In a graph structure, when we refer to “direction,” we are referring to the edges of the graph; the avenues by which a graph function can change. For example, in Figure
10, the graph functions are 8-dimensional vectors (defined over the vertices), but the gradients of the functions for this graph are 12-dimensional vectors (defined over the edges) and are calculated as in Equations (7). Refer to Table
3 for a formal definition of the incident matrix
M.
In Equations (7), the gradient vectors describe the difference in graph function value
across the vertices/
along the edges. Specifically, note that the largest magnitude value is 20,866 and corresponds to
e \(_{12}\) , the edge between Hobart and Melbourne in Figure
10. In other words, the greatest magnitude edge is between the city with the least cases and the city with the most cases. Similarly, the lowest magnitude edge is
e \(_{2}\) ; the edge between Perth and Adelaide, which has the least difference in cases.
The divergence of a gradient function in the graph domain describes the outward flux of the gradient function at every vertex. To continue with our example, we could interpret the divergence of the gradient function \(f_{cases}\) as the outgoing “flow” of infectious disease cases from each capital city. Whereas the gradient function was defined over the graph’s edges, the divergence of a gradient function is defined over the graph’s vertices and is calculated as in Equation (8).
The maximum value in the divergence vector for the infectious disease graph signal is 134,671, corresponding to Melbourne (the 7th vertex). Again, this can be interpreted as the magnitude of the “source” of infectious disease cases from Melbourne. Contrastively, the minimum value is \(-\) 281,123, corresponding to Canberra, the largest “sink” of infections disease.
Note as well that the dimensionality of the original graph function is 8—corresponding to the vector space, its gradient’s dimensionality is 12—corresponding to its edge space, and the Laplacian’s dimensionality is again 8—corresponding to the vertex space. This mimics the calculation of the Laplacian in Figure
9, where the original scalar field (representing the magnitude at each point) is converted to a vector field (representing direction) and then back to a scalar field (representing how each point acts as a source).
The
graph Laplacian appears naturally in these calculations as a
\(\vert {\mbox{V}}\vert \times \vert {\mbox{V}}\vert\) matrix operator in the form
\({\bf L} ={\bf M} {\bf M} ^T\) (see Equation (8)). This corresponds to the formulation provided in Table
3, as shown in Equation (9), and this formulation is referred to as the combinatorial definition
\({\bf L} = {\bf D}- {\bf A_{W}}\) (the normalized definition is defined as
L \(\!^{sn}\) [
17]). The graph Laplacian is pervasive in the fields of GSP [
14].
Since
\({\bf L} = {\bf D}- {\bf A_{W}}\) , the graph Laplacian must be a real (
\({\bf L} _{ij} \in \hspace{1.42262pt}\mathbb {R}, \hspace{2.84526pt} \forall \hspace{2.84526pt} 0 \le i,j \lt \vert {\mbox{V}}\vert\) ) and symmetric (
\({\bf L} = {\bf L} ^T\) ) matrix. As such, it will have an
eigensystem composed of a set of
\(\vert {\mbox{V}}\vert\) orthonormal eigenvectors, each associated with a single real eigenvalue [
78]. We denote the
ith eigenvector with
\(u_i\) , and the associated eigenvalue with
\(\lambda _i\) , each satisfying
\({\bf L} u_i = \lambda _i u_i\) , where the eigenvectors
\(u_i\) are the
\(\vert {\mbox{V}}\vert\) -dimensional columns in the matrix (Fourier basis)
U. The Laplacian can be factored as three matrices such that
\({\bf L} = {\bf U} \mathbf {\Lambda } {\bf U} ^T\) through a process known as
eigenvector decomposition. A variety of algorithms exist for solving this kind of eigendecomposition problem (e.g., the QR algorithm and Singular Value Decomposition).
These eigenvectors form a basis in \(\hspace{1.42262pt}\mathbb {R}^{\vert {\mbox{V}}\vert }\) , and as such, we can express any discrete graph function as a linear combination these eigenvectors. We define the graph Fourier transform of any graph function/signal as \(\hat{f} = {\bf U} ^T f \in \hspace{1.42262pt}\mathbb {R}^{\vert {\mbox{V}}\vert }\) , and its inverse as \(f = {\bf U} \hat{f} \in \hspace{1.42262pt}\mathbb {R}^{\vert {\mbox{V}}\vert }\) . To complete our goal of performing convolution in the spectral domain, we now complete the following steps:
(1)
Convert the lth graph function into the frequency space (i.e., generate its graph Fourier transform). We do this through matrix multiplication with the transpose of the Fourier basis: \({\bf U} ^T f_l\) . Note that multiplication with the eigenvector matrix is \(O(N^2)\) .
(2)
Apply the corresponding lth learned filter in frequency space. If we define \(\Theta _l\) as our lth learned filter (and a function of the eigenvalues of L), then this appears like so: \(\Theta _l {\bf U} ^T f_l\) .
(3)
Convert the result back to vertex space by multiplying the result with the Fourier basis matrix. This completes the formulation defined in Equation (
10). By Parseval’s theorem, multiplication applied in the frequency space corresponds to translation in vertex space, so the filter has been
convolved against the graph function [
52].
This simple formulation has a number of downsides. Foremost, the approach is
not localized—it has
global support—meaning that the filter is applied to all vertices (i.e., the entirety of the graph function). This means that useful filters are not shared, and that the locality of graph structures is not being exploited. Second, it is
not scalable: The number of learnable parameters grows with the size of the graph (not the scale of the filter) [
17], the
\(O(N^2)\) cost of matrix multiplication scales poorly to large graphs, and the
\(O(N^3)\) time complexity of QR-based eigendecomposition [
81] is prohibitive on large graphs. Moreover, directly computing this transform requires the diagonalization of the Laplacian and is
infeasible for large graphs (where
\(\vert {\mbox{V}}\vert\) exceeds a few thousand vertices) [
31]. Finally, since the structure of the graph dictates the values of the Laplacian,
graphs with dynamic topologies cannot use this method of convolution.
To alleviate the
locality issue, Reference [
21] noted that the smoothing of filters in the frequency space would result in localization in the vertex space. Instead of learning the filter directly, they formed the filter as a combination of smooth polynomial functions and instead learned the coefficients to these polynomials. Since the Laplacian is a local operator affecting only direct neighbors of any given vertex, then a polynomial of degree
\(r\) affects vertices
\(r\) -hops away. By approximating the spectral filter in this way (instead of directly learning it), spatial localization is thus guaranteed [
77]. Furthermore, this improved
scalability; learning
\(K\) coefficients of the predefined smooth polynomial functions meant that the number of learnable parameters was no longer dependent on the size of the input graph. Additionally, the learned model could be applied to other graphs, too, as opposed to spectral filter coefficients that are basis-dependant. Since then, multiple potential polynomials have been used for specialized effects (e.g., Chebyshev polynomials, Cayley polynomials [
51]).
Equation (
11) outlines this approach. The learnable parameters are
\(\theta _k\) —vectors of Chebyshev polynomial coefficients—and
\(T_k(\widetilde{\mathbf {\Lambda }})\) is the Chevyshev polynomial of order
\(k\) (dependent on the normalized diagonal matrix of scaled eigenvalues
\(\widetilde{\mathbf {\Lambda }}\) ). Chevyshev polynomials can be computed recursively with a stable recurrence relation and form an orthogonal basis [
83]. We recommend Reference [
65] for a full treatment on Chebyshev polynomials.
Interestingly, these approximate approaches demonstrate an equivalence between spatial and spectral techniques. Both are
spatially localized and allow for a single filter to extract repeating patterns of interest throughout a graph, both have a number of learnable parameters that are independent of the input graph size, and each have meaningful and intuitive interpretations from a spatial (Figure
7) and spectral (Figure
10) perspective. In fact, GCNs can be viewed as a first-order approximation of Chebyshev polynomials [
56]. For an in-depth treatment on the topic of GSP, we recommend References [
82] and [
78].