Accelerating Graph Neural Networks with a Novel Matrix Compression Format

João N. F. Alves^1,2,3,4 Samir Moustafa^1,2 Siegfried Benkner¹
Alexandre P. Francisco^3,4 Wilfried N. Gansterer¹ & Luís M. S. Russo^3,4

Abstract

The inference and training stages of Graph Neural Networks (GNNs) are often dominated by the time required to compute a long sequence of matrix multiplications between the sparse graph adjacency matrix and its embedding. To accelerate these stages, we first propose the Compressed Binary Matrix (CBM) storage format to succinctly represent the binary adjacency matrix of an unweighted graph. Then, we show how to generalize this representation to normalized adjacency matrices of unweighted graphs which arise in the context of GNNs. Finally, we develop efficient matrix multiplication kernels based on this compressed representation. The matrix multiplication kernels proposed in this work never require more scalar operations than classic sparse matrix multiplication algorithms. Experimental evaluation shows that the matrix multiplication strategies proposed outperform the current state-of-the-art implementations provided by Intel MKL, achieving speedups close to 5 $\times$ . Furthermore, our optimized matrix-multiplication strategies accelerated the inference time of a GNN by up to $3\times$ .

1 Introduction

Graph Neural Networks (GNNs) are the preferred tool to learn from graph-structured data and thus are considered key for future AI applications in domains like social network analysis, natural language processing, biology, physics, and many others [1]. The training and inference time of different GNN architectures is dominated by a long sequence of matrix products. This is particularly evident in GNNs that resort to Message Passing Layers (MPLs), where in each hidden layer the nodes of the graph aggregate the embedding of neighboring nodes and adjust their own embedding based on the information collected. In some variants of GNNs, such as the widely used Graph Convolutional Networks (GCNs) [2], the message produced in each layer is essentially the product of the adjacency matrix of the underlying graph and its current embedding.

For illustration purposes, consider a two-layer GCN. To propagate and combine node features across the graph, this network must compute the following operations once per inference and once per training epoch [3]:

\hat{\mathbf{A}}\ \sigma(\hat{\mathbf{A}}\mathbf{X}\mathbf{W}^{0})\mathbf{W}^{% 1},

(1)

where $\hat{\mathbf{A}}$ represents the normalized adjacency matrix of the graph, such that $\hat{\mathbf{A}}=\mathbf{D}^{-\frac{1}{2}}(\mathbf{A}+\mathbf{I})\mathbf{D}^{-% \frac{1}{2}}$ , $\mathbf{D}$ is the degree diagonal matrix of the graph, $\sigma$ denotes an element-wise activation function, $\mathbf{X}$ is the matrix of node features, and $\mathbf{W}^{0}$ and $\mathbf{W}^{1}$ are learnable dense matrices for the first and second layers [4]. It is important to note that in real scenarios the adjacency matrix of the graph is typically much larger than the remaining operand matrices of Equation 1. Therefore, matrix products involving this matrix represent most of the computational burden of training and inferring GNNs. Graphs arising in the context of GNNs are often extremely sparse. Popular GNNs frameworks, such as PyTorch [5], leverage this sparsity to accelerate both training and inference. This is achieved by representing the adjacency matrix of the graph, or its normalized form, in standard sparse matrix formats that only consider the nonzero elements of a sparse matrix. Sparse matrix formats, like COO or CSR, enable faster sparse-dense matrix multiplication kernels (SpMMs) which are known to require a number of scalar operations that is proportional to the number of nonzero elements in the sparse matrix.

In this paper we present a new matrix compression format called Compressed Binary Matrix (CBM) to accelerate widely-used GNN architectures (e.g., GCN [2], GraphSage[6], and GIN [7]) that spend most of their processing time on matrix products involving the adjacency matrix of the underlying graph or its normalized, when these GNNs learn on unweighted graphs. Our format exploits the fact that the adjacency matrix of an unweighted graph is binary, and therefore it can be compressed beyond what is possible with sparsity alone, while simultaneously reducing the number of scalar operation required to multiply our compressed representation of the adjacency matrix and another dense real-valued matrix. The key advantage of the CBM format is that it only represents the differences (deltas) of a row with respect to another similar row of the same matrix, which for many use cases tends to be significantly smaller than the number of non-zeros represented by standard compression formats.

1.1 Previous Works

The matrix-matrix and matrix-vector products have been extensively studied [8]. A particular case is the product of a binary sparse matrix by a real-valued vector or (dense) matrix, where the efficient representation of the binary matrix can be exploited to improve both the memory footprint and the operation running time. Although this was not expected in some preliminary studies [9], and impossibility results exist for more complex compression schemes [10], it works for some representational compression schemes. One of such schemes is the Single Tree Adjacency Forest (STAF) proposed by Nishino et al. [11]. The STAF representation of a binary matrix is obtained by reversing and inserting the adjacency list of each row of a binary matrix into a trie data-structure, meaning that suffixes that are common to more than one adjacency list are represented exactly once. The authors have shown that the this data-structure can be built in linear time with respect to the number of nonzero elements in the input matrix. STAFs enable fast matrix-matrix products by traversing the trie in root-to-leaf order, while accumulating partial sums that are common to different rows of the result matrix. The number of operations required to multiply a binary matrix represented by a STAF and a real-valued vector is proportional to the size of the trie, which is also upper-bounded by the number of nonzero elements in the input matrix. STAFs doe not exploit, however, row-wise similarities outside of common row suffixes. To address this issue the authors proposed splitting the input matrix into sets of columns and create a STAF per set. This optimized version achieves a significant speedup and memory footprint reduction compared to CSR and the sparse matrix multiply kernel offered by the Eigen library.

Francisco et al. [12] explored how succinct representations for binary matrices and graphs could be exploited to speedup binary matrix multiplication, namely the Webgraph representation by Boldi and Vigna [13] and the Biclique Extraction (BE) based representation by Hernandez and Navarro [14]. Both representations allow to reduce the memory footprint of binary matrices and accelerate the product of compressed binary matrices and real-valued vectors. The Webgraph representation exploits the similarity among rows, as well as clustering effects present in real-world graphs and matrices, relying on gap compression, referentiation, intervalisation and $\zeta$ codes. The BE representation exploits also rows similarities and clustering effects, extracting maximal bicliques and replacing them with differential compressed sets of nodes that share adjacencies. The key observations are that, in both cases, we can reuse partial results from previous computations. Taking an implementation of Pagerank using classic adjacency lists, authors achieved significant and similar speedups and memory footprint reductions, showing that the product computation can be reduced in time proportional to the compressed matrix size. The Webgraph and BE representations require however non-trivial preprocessing steps, Webgraph benefits from a suitable vertex reordering, obtained through graph clustering methods, and BE requires finding maximal bicliques, an NP-hard problem in general. These steps might take longer to execute than computing the SpMM we are trying to optimize. On the other hand the experimental evaluations considered naive implementations of standard representations, possibly leading to unfair comparisons in what concerns state of the art linear algebra libraries.

Elgohary et al. [15] also addressed the problem that large-scale machine learning algorithms are often iterative, using repeated read-only data access and I/O-bound matrix-vector multiplications, introducing Compressed Linear Algebra (CLA) for lossless matrix compression. CLA also executes linear algebra operations directly on the compressed representations. However, it is not focused on binary matrices and, although it achieves a performance close to the uncompressed case, it only presents performance gains when data does not fit into memory.

Our work is somewhat related to the work by Björklund and Lingas [16]. They also consider a weighted graph on the rows of a binary matrix where the weight of an edge between two rows is equal to its Hamming distance, and then they rely on a minimum spanning tree of that graph to differentially compress the rows of a binary matrix with respect to one another, thus accelerating the product computation. They consider however only a single product of two binary matrices, and do not consider the overhead imposed by operating on a compressed representation of a binary matrix, as their results are purely theoretical.

1.2 Our Contributions

In this paper, we make the following contributions: (1) we present the Compressed Binary Matrix (CBM) format, an efficient compression scheme for binary matrices, that can reduce the memory footprint of unweighted graphs that arise in the context of GNNs; (2) we introduce a new algorithm to significantly accelerate the sequence of matrix products between the (potentially normalized) adjacency matrix of the graph, represented in CBM format, and its embedding; (3) we prove that even in the worst-case scenario, where compression is not possible, the number of scalar operations required to multiply a matrix represented in CBM format does not exceed the number of scalar operations required to multiply the same matrix using classic sparse storage formats; and (4) we have implemented the CBM format and corresponding matrix multiplication kernels such that they can be used together with state-of-the-art Deep Learning framework, such as PyTorch. Furthermore, experimental evaluation using real-world datasets demonstrates the effectiveness of our approach. Our method is nearly $5\times$ faster than state-of-the-art SpMM implementations in sequential and parallel environments, subsequently shortening the inference time of a 2-layer GCN by more than $3\times$ . Our implementation will be made available at https://github.com/cbm4scale.

As previously stated, the CBM format is akin to the work of Björklund and Lingas [16]. Nevertheless, it distinguishes itself by being specifically designed to accelerate the product between a single sparse binary matrix and a set of real-valued matrices, thus the binary matrix only needs to be compressed once¹¹1Ideally, the adjacency matrix of any unweighted graph would be provided in CBM, avoiding any compression overhead.. Furthermore, our format resorts to a Minimum Cost Arborescence (MCA), which allows us to ignore compression opportunities that do not lead to improvements in memory-footprint and matrix multiplication. The CBM format also overcomes limitations present in STAF and BE, since our format exploits compression opportunities along complete rows of the matrix (by default), is constructed in polynomial time, and does not allocate additional memory to store partial results.

2 Compressed Binary Matrix Format

Let $\mathbf{A}\in\{0,1\}^{m\times n}$ be a binary matrix, where $\vec{a}_{i}\in\{0,1\}^{n}$ represents the $i$ -th row of $\mathbf{A}$ for $i=1,\dots,m$ . Also assume that $\vec{a}_{i}$ is represented with an adjacency list with the column indices of the nonzero elements of $\vec{a}_{i}$ . The Compressed Binary Matrix (CBM) format resorts to differential compression to represent the rows of a binary matrix $\mathbf{A}$ . This is, if $\mathbf{A}$ is represented in CBM format, then any row $\vec{a}_{x}$ can be represented by another row $\vec{a}_{y}$ and two lists of deltas that indicate which elements must be included ( $\Delta^{+}_{x,y}$ ), or removed ( $\Delta^{-}_{x,y}$ ), from the adjacency list of $\vec{a}_{y}$ to obtain $\vec{a}_{x}$ :

\vec{a}_{x}=(\vec{a}_{y}\cup\Delta^{+}_{x,y})\setminus\Delta^{-}_{x,y},\textrm% { where }\Delta^{+}_{x,y}=\vec{a}_{x}\setminus\vec{a}_{y}\textrm{, and }\Delta% ^{-}_{x,y}=\vec{a}_{y}\setminus\vec{a}_{x}.

(2)

Assuming $\vec{a}_{y}$ is present in memory, Equation 2 suggests that the memory required to represent $\vec{a}_{x}$ is proportional to the number of deltas between $\vec{a}_{x}$ and $\vec{a}_{y}$ . If these rows are similar, then it is likely that the number of deltas is smaller than the number of nonzero elements of $\vec{a}_{x}$ . In that case, it would be more memory efficient to represent $\vec{a}_{x}$ with respect to $\vec{a}_{y}$ than through its adjacency list. Therefore, to further reduce the memory footprint of matrix A, the compression algorithm that builds the CBM format must find a suitable chain of compression to represent all rows of $\mathbf{A}$ . This is, for each row $\vec{a}_{x}$ , identify another similar row $\vec{a}_{y}$ that characterizes the former, such that: (1) the numbers of deltas required to represent each row $\vec{a}_{x}$ is minimized subjected to $\vec{a}_{y}$ , and (2) the number of deltas required to represent $\vec{a}_{x}$ is guaranteed to be less than, or equal to, the number of nonzero elements in $\vec{a}_{x}$ .

Minimizing the number of deltas.

To address point (1), the CBM format must first measure the number of deltas required to convert each row $\vec{a}_{y}$ into all other rows $\vec{a}_{x}$ of $\mathbf{A}$ , i.e., measure the Hamming distance for each pair of matrix rows. This step provides a global view of the dissimilarity between the rows of the matrix $\mathbf{A}$ , and it can be modeled as a fully-connected and undirected distance graph $G$ . This graph has $m$ vertices, where each vertex represents a unique row of the matrix, and the weight of each edge $(y,x)$ corresponds to the number of deltas required to represent $\vec{a}_{x}$ with respect to $\vec{a}_{y}$ . To reduce the number of deltas required to compress the rows of $\mathbf{A}$ we can find a Minimal Spanning Tree (MST) of $G$ , which by definition spans $G$ with the minimum sum of edge weights possible. Naturally, any MST of the distance graph, rooted in vertex $x$ , defines a chain of compression with as many deltas as the weight of this tree plus the number of nonzero elements of $\vec{a}_{x}$ . Thus, any MST rooted in the vertex corresponding to the row with the fewest nonzero elements, defines a chain of compression that satisfies point (1).

Worst-case guarantees.

Note that the chain of compression obtained by finding an MST of $G$ does not satisfy point (2), because the weight of the lightest incoming edge of any vertex $x$ can be greater than the number of nonzero elements in $\vec{a}_{x}$ . In such cases, representing $\vec{a}_{x}$ with an adjacency list is clearly more memory efficient. To avoid this issue, we extended the distance graph $G$ with a virtual vertex $\mathbf{0}$ which is connected to all other vertices of the graph. This virtual vertex represents a null row-vector $\vec{a}_{0}\in\{0\}^{n}$ , which ensures that the weight of each edge $(\mathbf{0},x)$ is equal to the number of nonzero elements in $\vec{a}_{x}$ . The inclusion of virtual vertex $\mathbf{0}$ in the distance graph $G$ ensures that the issue described above cannot occur, since the lightest incoming edge of any vertex $x$ is now at most as heavy as the number of nonzero elements in $\vec{a}_{x}$ . Therefore, any chain of compression characterized by an MST of $G$ , rooted on vertex $\mathbf{0}$ , is guaranteed to satisfy points (1) and (2). If we use this chain of compression to represent a matrix in CBM format, then following property will be observed:

Property 2.1.

The number of deltas required to represent any matrix $\mathbf{A}\in\{0,1\}^{m\times n}$ in Compressed Binary Matrix (CBM) format is never greater than the number of nonzero elements in matrix $\mathbf{A}$ .

To complete the construction, we simply traverse the compression chain above in topological order, and for every edge $(y,x)$ visited, we compute the lists of positive and negative deltas required to convert row $\vec{a}_{y}$ into $\vec{a}_{x}$ .

2.1 Time and Space Analysis

Lemma 2.1.

Any matrix $\mathbf{A}\in\{0,1\}^{m\times n}$ can be represented in CBM format in $O((m+1)\ \mathbf{nnz}(\mathbf{A})+m^{2}\log m)$ time.

Proof.

The construction of the extended distance graph $G$ for $\mathbf{A}$ requires the computation of $m(m+1)/2$ Hamming distances between all possible row pairs $(\vec{a}_{x},\vec{a}_{y})$ . The Hamming distance of each pair of rows can be reduced to the intersection of their adjacency lists, computed in $O(\mathbf{nnz}(\vec{a}_{x})+\mathbf{nnz}(\vec{a}_{y}))$ time. Hence, the time to compute all $m(m+1)/2$ Hamming distances is upper-bounded by

\sum_{x=0}^{m}\sum_{y=0}^{m}\big{(}\mathbf{nnz}(\vec{a}_{x})+\mathbf{nnz}(\vec% {a}_{y})\big{)}=(m+1)\ \mathbf{nnz}(\mathbf{A}).

(3)

Additionally, well-known MST algorithms, such as Prim or Kruskal, are known to find an MST in $O(E\log V)$ time, where $E$ and $V$ denote the number of edges and vertices in the graph. Since the extended distance graph $G$ contains $m(m+1)/2$ edges and $m+1$ vertices, finding an MST of $G$ requires $O((m+1)^{2}\log(m+1))$ time, and therefore, representing matrix $\mathbf{A}$ takes time

O((m+1)\ \mathbf{nnz}(\mathbf{A})+(m+1)^{2}\log(m+1))).

(4)

Finally, a single list of deltas can also be obtained from the intersection of the adjacency lists of $\vec{a}_{x}$ and $\vec{a}_{y}$ . Hence, the computation of $2m$ lists of deltas is already accounted for in Equation 4. ∎

Lemma 2.2.

The space required to represent a binary matrix $\mathbf{A}\in\{0,1\}^{m\times n}$ in Compressed Binary Matrix (CBM) format is proportional to $O(m+\sum_{i=0}^{m}(|\Delta^{+}_{x,r_{x}}|+|\Delta^{-}_{x,r_{x}}|))$ , where $r_{x}$ represents the index of the row selected to compress row $\vec{a}_{x}$ .

Proof.

Assuming matrix $\mathbf{A}$ is represented in CBM format. Then the chain of compression required to represent this matrix consists on: (1) a list of edges that represents any MST rooted in vertex $\mathbf{0}$ of graph $G$ , and (2) a list of positive and negative deltas for each edge of this MST. Since the extended version of $G$ is a fully-connected graph with $m+1$ vertices, it is known that any MST of $G$ contains $m$ edges. Therefore, the list of edges contains $m$ elements, and there are $m$ lists of deltas, whose size totals $\sum_{x=0}^{m}(|\Delta^{+}_{x,r_{x}}|+|\Delta^{-}_{x,r_{x}}|)$ . ∎

2.2 Fast Matrix Multiplication with Compressed Binary Matrix Format

Let $\vec{w}\in\mathbb{R}^{n}$ be a dense and real-valued vector, and $\vec{a}_{x}$ and $\vec{a}_{y}$ two distinct rows of a binary matrix $\mathbf{A}$ as previously defined. It follows from Equation 2 that we can resort to the inner-product $\vec{a}_{y}\cdot\vec{w}$ to compute $\vec{a}_{x}\cdot\vec{w}$ as

\vec{a}_{x}\cdot\vec{w}=((\vec{a}_{y}\cup\Delta^{+}_{x,y})\setminus\Delta^{-}_% {x,y})\cdot\vec{w}=(\vec{a}_{y}\cdot\vec{w})+(\Delta^{+}_{x,y}\cdot\vec{w})-(% \Delta^{-}_{x,y}\cdot\vec{w}).

(5)

This implies that the dot-product $\vec{a}_{x}\cdot\vec{w}$ can be calculated in $1+|\Delta^{+}_{x,y}|+|\Delta^{-}_{x,y}|$ scalar operations, if the value of $\vec{a_{y}}\cdot\vec{w}$ can be reused. Naturally, we can resort to this strategy to design fast matrix-vector multiplication kernels $\vec{u}\leftarrow\mathbf{A}\vec{v}$ , where $\mathbf{A}\in\{0,1\}^{m\times n}$ , $\vec{u}\in\mathbb{R}^{m}$ , and $\vec{v}\in\mathbb{R}^{n}$ , by computing all dot-products between the rows of $\mathbf{A}$ and $\vec{v}$ in an order where: (1) the value of the dot-product $\vec{a}_{y}\cdot\vec{v}$ is known before calculating $\vec{a}_{x}\cdot\vec{v}$ , and (2) the value of all dot-products $\vec{a}_{x}\cdot\vec{v}$ is calculated with respect to the value $\vec{a}_{y}\cdot\vec{v}$ that results in the minimum overall number of scalar operations. By definition, the chain of compression of matrix $\mathbf{A}$ already represents such an ordering. Therefore, we can accelerate matrix-vector multiplication by traversing the chain of compression of matrix $\mathbf{A}$ in topological order, and for each edge visited compute $u_{x}\leftarrow\vec{a}_{x}\cdot\vec{v}$ as

u_{x}\leftarrow u_{y}+(\Delta^{+}_{x,y}\cdot\vec{v})-(\Delta^{-}_{x,y}\cdot% \vec{v}),

(6)

where $u_{y}$ is known to already contain the value $\vec{a}_{y}\cdot\vec{v}$ . Note that classic sparse-dense matrix-vector multiplication kernels compute $\vec{u}\leftarrow\mathbf{A}\cdot\vec{v}$ in $\mathbf{nnz}(\mathbf{A})$ scalar operations, where $\mathbf{nnz}(\mathbf{A})$ represents the numbers of nonzero elements in matrix $\mathbf{A}$ . Assuming $\mathbf{A}$ is represented in CBM format, then the number of deltas required to represent each row $\vec{a}_{x}$ of $\mathbf{A}$ is known to be smaller than, or equal to, the number of nonzero elements in $\vec{a}_{x}$ . If the number of deltas required to represent $\vec{a}_{x}$ is strictly smaller than $\mathbf{nnz}(\vec{a}_{x})$ , then it is clear that the dot-product $\vec{a}_{x}\cdot\vec{v}$ requires at most $\mathbf{nnz}(\vec{a}_{x})$ scalar operations. On the other hand, if the number of deltas required to represent $\vec{a}_{x}$ is the same as the number of nonzero elements in $\vec{a}_{x}$ , Equation 5 suggests that the dot-product $\vec{a}_{x}\cdot\vec{v}$ would be computed in $1+\mathbf{nnz}(\vec{a}_{x})$ scalar operations. This scenario can be avoided by engineering the MST algorithm to select an out-going edge of the virtual node 0 in case of draw. Since $\vec{a}_{x}$ was not compressed in this scenario, the number of scalar operations required to compute this dot-product is exactly $\mathbf{nnz}(\vec{a}_{x})$ . As the number of scalar operations required to compute $\vec{a}_{x}\cdot\vec{v}$ never surpasses the number of nonzero elements in $\vec{a}_{x}$ for $x=1,\dots,m$ , the following property becomes evident:

Property 2.2.

The number of scalar operations required to compute matrix-vector multiplication based on the Compressed Binary Matrix (CBM) format is never greater than those required to compute matrix-vector multiplication based on classic sparse formats.

Additionally, Equation 5 shows that our matrix-vector multiplication strategy does not require the allocation of additional buffers, since the value of the dot-product required to compute $\vec{a}_{x}\cdot\vec{v}$ is guaranteed to already be stored in vector $\vec{u}$ . Therefore, the following property is observed:

Property 2.3.

The amount of memory required to compute matrix-vector multiplication based on the Compressed Binary Matrix (CBM) format is proportional to the size of its operands and remains constant during execution time.

Intuitively, we can resort to the matrix-vector multiplication strategy described above to design fast matrix-matrix multiplication kernels as described in Algorithm 1. This algorithm assumes that the left-hand side operand matrix $\mathbf{A}\in\{0,1\}^{m\times n}$ is represented in CBM format, while matrices $\mathbf{B}\in\mathbb{R}^{n\times p}$ and $\mathbf{C}\in\mathbb{R}^{m\times p}$ are dense and correspond to the right-hand side operand matrix and to the product matrix, respectively. As it can be observed, Algorithm 1 computes the matrix-vector product between matrix $\mathbf{A}$ and each column of matrix $\mathbf{B}$ . Therefore, we can conclude that Properties 2.2 and 2.3 hold for the matrix-matrix multiplication case.

Input:

\mathbf{A}\in\{0,1\}^{m\times n}

\mathbf{B}\in\mathbb{R}^{n\times p}

\mathbf{C}\in\mathbb{R}^{m\times p}

1foreach edge $(y,x)\in\text{ chain of compression of }\mathbf{A}$ (in topological order) do

2 foreach column vector $\vec{b}_{k}\in\mathbf{B}$ , where $k\in[1,p[$ do

\mathbf{C}_{x,k}\leftarrow\mathbf{C}_{y,k}+(\Delta^{+}_{x,y}\cdot\vec{b}_{k})-% (\Delta^{-}_{x,y}\cdot\vec{b}_{k})

;

return

\mathbf{C}

Algorithm 1 Matrix-Matrix product with CBM

Leveraging High Performance SpMM kernels.

The representation of a binary matrix $\mathbf{A}$ in CBM format was until now conceptualized as a chain of compression, where each node of this chain is associated with two lists of deltas. As is, a compressed matrix cannot be represented in a sparse format capable of leveraging efficient SpMM kernels. To address this issue, we represent the lists of deltas that characterize our format as a matrix $\mathbf{A^{\prime}}\in\mathbb{R}^{m\times n}$ , which can be represented in any convenient matrix format, and leverage efficient SpMM kernels provided by Intel MKL to compute all dot-products of Algorithm 1 in a single matrix product $\mathbf{A^{\prime}}\mathbf{B}$ . Once $\mathbf{A^{\prime}}\mathbf{B}$ is stored in matrix $\mathbf{C}$ , we can finalize the matrix multiplication with CBM, by traversing the chain of compression in topological order, and updating row $\vec{c}_{x}$ of $\mathbf{C}$ as $\vec{c}_{x}:=\vec{c}_{x}+\vec{c}_{y}$ for each edge $(y,x)$ that was visited. Naturally, these updates can also leverage efficient AXPY kernels, which are also provided by Intel MKL.

Multi-threading parallelism.

As suggested in the previous paragraph, matrix multiplication with the CBM format can be divided into two stages. The first one computes the product $\mathbf{C}:=\mathbf{A^{\prime}B}$ which is embarrassingly parallel, and Intel MKL already provides efficient multi-threaded and vectorized implementations of this operation. The second stage involves updating the rows of matrix $\mathbf{C}$ with respect to the chain of compression. This stage presents data-dependencies, since the final value of row $\vec{c}_{x}$ of $\mathbf{C}$ can only be calculated once $\vec{c}_{y}$ is known. There are however no dependencies between different branches of the chain of compression. Therefore, we can parallelize this stage by concurrently updating the rows of matrix $\mathbf{C}$ that are found in different branches of the virtual node 0.

Extending SpMM with CBM to normalized adjacency matrices.

Let $\hat{\mathbf{A}}\in\mathbb{R}^{n\times n}$ represent the normalized adjacency matrix of an unweighted graph, where $\hat{\mathbf{A}}=\mathbf{D}^{-\frac{1}{2}}(\mathbf{A}+\mathbf{I})\mathbf{D}^{-% \frac{1}{2}}$ . As is, we cannot resort to the CBM format to represent $\hat{\mathbf{A}}$ since this matrix is not binary. However, fast matrix multiplication is still possible. Note that $\mathbf{D}^{-\frac{1}{2}}$ is a diagonal matrix, and therefore $(\mathbf{A}+\mathbf{I})\mathbf{D}^{-\frac{1}{2}}$ corresponds to a column-scaled matrix, i.e., the elements in a column of the matrix are either 0 or have a constant value that is unique to this column. We can represent this matrix in CBM format, by simply multiplying the corresponding matrix of deltas $(\mathbf{A^{\prime}}+\mathbf{I})$ by $\mathbf{D}^{-\frac{1}{2}}$ . At this point, we can efficiently compute $\mathbf{C}:=((\mathbf{A}+\mathbf{I})\mathbf{D}^{-\frac{1}{2}})\mathbf{B}$ as described previously. Finally, note that $\mathbf{D}^{-\frac{1}{2}}(\mathbf{A}+\mathbf{I})\mathbf{D}^{-\frac{1}{2}}% \mathbf{B})$ , simply scales the rows of matrix $(\mathbf{A}+\mathbf{I})\mathbf{D}^{-\frac{1}{2}}\mathbf{B}$ . The cost of scaling the rows of this matrix can be hidden by fusing it with the update step of the product matrix $\mathbf{C}$ .

Speeding up SpMM with Edge Pruning.

Note that not all compression opportunities contribute to faster matrix multiplication kernels. The overheads associated with differential compression, such as traversing the chain of compression and updating the result matrix, might overcome potential performance gains if the number of scalar operations saved are not above a certain threshold. To address this issue, we can prune all edges of the distance graph of $\mathbf{A}$ where the number of scalar operations saved does not meet a user-defined threshold $\alpha\in\mathbb{N}$ . For each edge $(y,x)$ in the extended distance graph of $\mathbf{A}$ , prune this edge if its weight is greater than the number of nonzero elements in $\vec{a}_{x}$ minus $\alpha$ . Naturally, if we prune the edges of the extended distance graph of $\mathbf{A}$ in this manner, it is possible for a single edge direction to be pruned, while the opposite direction remains in the graph. Therefore, the extended distance graph of $\mathbf{A}$ is now directed, and a suitable chain of compression corresponds to a Minimum Cost Arborescence (MCA) rooted in the virtual node 0. Note that our compression algorithm remains correct, since the extended distance graph of $\mathbf{A}$ contains an out-going edge from the virtual node 0 to all other nodes; Furthermore, the time required to build the CBM format, as shown in Lemma 2.1, remains unchanged since finding an MCA or an MST present the same time complexity [17].

3 Experimental Evaluation

Experimental Setting.

The experiments found in this section were run on an Intel Xeon Gold 6130 (Skylake) CPU with 16 physical cores and 2.1 GHz fixed clock frequency. This machine runs on CentOS Linux 7 (version 3.10.0) operating system. The SpMM kernels tested in this section were implemented in C++, and rely on Intel MKL 24.0 sparse CSR format and corresponding matrix multiplication kernels. The C++ code developed in this work is then called from Python 3.11, via PyTorch C++ Extensions to reliably emulate common use-cases. Parallel experiments (with 16 cores) were implemented with OpenMP 4.5, and the threads were pinned to physical cores with GOMP_CPU_AFFINITY="0-15" environment variable.

3.1 Evaluation Metrics

The CBM format was evaluated with respect to quality of compression achieved (compression ratio), and the time required (runtime reduction) to compute SpMM and to infer a 2-layer GCN, when the adjacency matrix of the graphs are represented in our format. The compression ratio is defined as the ratio between the memory required to represent a matrix in CSR format and CBM format. In our implementation, the CBM format is composed by the corresponding matrix of deltas $\mathbf{A^{\prime}}$ and a tree representing the chain of compression, both stored in CSR format. In the context of sparse-dense matrix multiplication (SpMM), the runtime reduction is measured by comparing the average time, out of 50 runs, it takes to perform matrix multiplication with a randomly generated dense matrix with 500 columns using the CSR format, to the time taken to compute the same matrix multiplication with the CBM format. The formula used to capture this metric is $\frac{T_{\text{CSR}}-T_{\text{CBM}}}{T_{\text{CSR}}}\times 100\%$ , where $T_{\text{CSR}}$ is the time required to carry out sparse-dense matrix multiplication with CSR by the state-of-the-art SpMM implementation offered by Intel MKL, and $T_{\text{CBM}}$ is the time taken to compute the same product with CBM. We used same formula and number of runs in the context of GCN inference, however, $T_{\text{CSR}}$ and $T_{\text{CBM}}$ correspond to the time required by the inference stage of this network by resorting to SpMM kernels based on CSR and CBM, respectively. It is important to note that we did not consider the SpMM kernels that are native to PyTorch, because these kernels were substantially slower than the ones implemented in Intel MKL.

3.2 Datasets

To demonstrate the advantages of the CBM format in the context of SpMM and GCN inference, we selected six real-world graphs of varying size and average in-degree as depicted in Table 1. The selected graphs depict relationships between authors and/or academic papers, where nodes tend to share many common neighbors. This property suggests that the adjacency matrices of these graphs are good candidates to be represented in CBM format.

In co-paper graphs, each node represents a paper. An undirected edge is placed between two nodes if the corresponding papers share at least one common author. They depict the interconnection and collaborative patterns between various academic publications. Co-author graphs represent scientific collaborations between authors of academic papers. Here, nodes correspond to authors, and an undirected edge is placed between two nodes if authors of the nodes have co-authored a paper together. If a paper is authored collaboratively by a group of authors, it results in a fully connected subgraph, or clique, encompassing those grouped nodes. Citation graphs are directed graphs where each node represents an academic paper. Directed edges in these graphs illustrate citations, with an edge pointing from the citing paper to the cited paper. These graphs highlight the directional flow of information and the influence of one paper upon another within the academic community.

Table 1: Summary of key datasets used to evaluate SpMM and GNN efficiency and scalability, categorized by graph type and sorted by average in-degree. This table presents the total number of nodes, edges, and average in-degree.

Graph	Type	Nodes	Edges	Avg. In-degree
coPapersCiteseer [18]	Co-paper	434,102	32,073,440	74.8
coPapersDBLP [18]	Co-paper	540,486	30,491,458	57.4
ca-AstroPh [19]	Co-author	18,772	396,160	22.1
ca-HepPh [19]	Co-author	12,008	237,010	20.7
PubMed [20]	Citation network	19,717	88,648	5.4
Cora [20]	Citation network	2,708	10,556	4.8

3.3 Sparse-Dense Matrix Multiplication (SpMM) Evaluation

Finding the best $\alpha$ is key to improve the performance of matrix multiplication with CBM. Adjusting this parameter not only reduces overhead associated with traversing the compression chain, but also exposes more parallelism opportunities as it increases the out-degree of the virtual node. Given the importance of $\alpha$ we first consider the case where $\alpha=0$ and our edge pruning technique was not applied, and then we show how fine-tuning $\alpha$ improves the quality of matrix multiplication with CBM.

Refer to caption — (a) coPapersCiteseer Dataset

$\alpha$ equal to zero.

Representing Cora (Fig. 1(e)) and PubMed (Fig. 1(f)) datasets in CBM resulted in minimal, and even negative, compression gains with respect to CSR. This is likely caused by the small average in-degree of these graphs, which suggests that the compression opportunities found in these graphs do not offset the memory overhead required to represent the corresponding chains of compression. As expected, the poor compression rate of these datasets led to no speedup in the context of matrix multiplication. On the other hand, compressing ca-AstroPh (Fig. 1(c)) and ca-HepPH (Fig. 1(d)) with our format respectively increased the compression ratios of both datasets to 2.6 $\times$ and 1.7 $\times$ , subsequently accelerating matrix product for both datasets. Our matrix multiplication strategy achieved a runtime reduction of 26% and 8% for ca-AstroPh in sequential and parallel environments, respectively. For ca-HepPh the same multiplication kernel presented a runtime reduction of 40% and 18% also in sequential and parallel environments. The datasets that benefited the most from our format were the coPapersCiteseer (Fig. 1(a)) and coPapersDBLP (Fig. 1(b)), achieving impressive compression ratios of 9.8 $\times$ and 5.9 $\times$ , leading to substantial improvements in computational performance. Our matrix multiplication kernel with coPapersCiteseer achieved a runtime reduction of 71% and 77% in sequential and parallel environments, while the same kernel with coPapersDBLP presented a runtime reduction of 59% and 61% in the same experimental settings. These results highlight the potential of the CBM format to efficiently compress and multiply unweighted graphs that present natural communities and an high average in-degree.

$\alpha$ greater than zero.

Setting $\alpha$ to a value greater than 8 reduced the overhead associated with traversing the chain of compression of Cora (Fig. 1(e)) and PubMed (Fig. 1(f)), making the performance decay observed for $\alpha=0$ negligible in sequential and parallel settings, even when no compression gains are observed. Increasing the value of $\alpha$ is also beneficial for sequential matrix multiplication with ca-AstroPh (Fig. 1(c)) and ca-HepPh (Fig. 1(d)), improving the respective runtime reduction from 26% up to 28% ( $\alpha=2$ ) and from 40% up to 44% ( $\alpha=4$ ). Nevertheless, it is important to note that our compression algorithm will start to ignore good compression opportunities once $\alpha$ is large enough, worsening the performance of our matrix multiplication strategy. This effect is evident in both datasets for $\alpha$ greater than 16, where the compression ratio decreases alongside with the runtime reduction. Experiments with both co-author datasets show that adjusting $\alpha$ is even more important in the parallel case, increasing the runtime reduction of our multiplication strategy with ca-AstroPh from 8% up to 11% ( $\alpha=8$ ), and with ca-HepPh from 18% up to 26% ( $\alpha=64$ ). Furthermore, these experiments confirm that the degree of parallelism of our multiplication strategy increases concurrently with $\alpha$ . This effect is easily observed for ca-HepPh (Fig.1(d)), where the performance of our matrix multiplication kernel sharply declines for $\alpha$ greater than 2, followed by a steep increase in performance when $\alpha$ equals to 32 or 64 (even though our compression algorithm is already ignoring good compression opportunities at this point). Finding the best $\alpha$ is not as relevant for coPapersCiteseer (Fig.1(a)) and coPapersDBLP (Fig.1(b)), as most compression opportunities save more than 8 scalar operations in the context of our matrix multiplication strategy. This observation is verified, since the compression ratio of these datasets shows little to no decrease for $\alpha$ smaller than 8. This effect is most likely due to the high average in-degree of both datasets. Still, adjusting $\alpha$ is required to obtain the best runtime reduction for both coPapers datasets. In our experiments our matrix multiplication kernel with coPapersCiteseer achieved a peak runtime reduction of 71% ( $\alpha=2$ ) and 79% ( $\alpha=32$ ) in sequential and parallel environments, while the same kernel with coPapersDBLP peaks at 59% ( $\alpha=$ 4) and 63% ( $\alpha=16$ ) also in sequential and parallel environments.

3.4 Integrating CBM with GCN

To assess the impact of CBM in GNNs, we considered the runtime reduction of an inference stage of a 2-layer GCN with 500 features where the normalized adjacency matrix of each dataset is represent in our format. Matrix products involving the normalized adjacency matrix were carried out with our extended matrix multiplication kernel, as described in Section 2.2. As baseline for our experiments, we selected the same 2-layer GCN where the normalized adjacency matrix is represented in CSR and any products involving this matrix are carried out by the SpMM implementation found in Intel MKL. These experiments are illustrated in Figure 2. To keep the discussion concise, we only considered the values of $\alpha$ that led to the best matrix multiplication performance for each dataset analyzed in Section 3.3.

Representing both Cora and PubMed datasets in CBM format increased the inference time of the corresponding GCNs compared to the baseline. This behavior is expected because the only steps of the inference that benefit from the CBM format are the products involving the normalized adjacency matrix. As previously shown, compressing these datasets using our format did not accelerate the corresponding matrix products. Experiments with both co-author datasets demonstrated that our format can reduce the inference time for GCN models. For ca-AstroPh, our format reduced the inference time of the network by 17% in sequential environments and 3% in parallel environments. Similarly, for ca-HepPh, our format achieved a runtime reduction of 22% in sequential environments and 11% in parallel environments. Representing both co-Paper datasets in our format resulted in the highest runtime reductions during GCN inference. For coPapersCiteseer, our format achieved an average runtime reduction of 48% in sequential environments and 66% in parallel environments. For coPapersDBLP, our format achieved a runtime reduction of 43% in sequential environments and 52% in parallel environments.

4 Final Remarks

In this work we proposed the Compressed Binary Matrix (CBM) format which simultaneously reduces the memory footprint of unweighted graphs and binary matrices, and enables the implementation of new matrix multiplication kernels that might be significantly faster than the current state-of-the-art. Experimental evaluation results shown significant speedups, both in sequential and parallel environments, up to 5 $\times$ . We obtained also significant performance improvements in the context of the GCN inference stage by integrating the CBM format in a deep learning framework, namely PyTorch, observing speed ups of 3 $\times$ . Although we did not discuss the CBM format construction time, we observe that it can be built in a reasonable amount of time for a dataset provider. In our experiments it took us less than 16 seconds to convert the largest dataset into our format in a sequential CPU environment. It is important to stress that the effectiveness of our format depends on the specific dataset, as discussed in the experimental evaluation. While we suspect that graphs with a high average degree and a tendency to form communities are good candidates, the best way to determine if a graph is suitable is to examine the compression ratio achieved by our format for a reasonable value of $\alpha$ . Finally we highlight that our format is future proof, since future optimizations to high-performance SpMM kernels will also accelerate matrix multiplication with the CBM format. Future work concerns integrating and evaluating the CBM format in the context of different GNNs architectures, and also targeting the training stage of this networks. Additionally, we intend to implement and evaluate our format and corresponding multiplication kernels in GPU architectures.

Acknowledgments

This work has been supported by the Innovation Study CBM4scale, funded by the Inno4scale project, which is funded by the European High-Performance Computing Joint Undertaking (JU) under Grant Agreement No 101118139. The JU receives support from the European Union’s Horizon Europe Programme.

Author Contributions

JNFA devised the main conceptual of the Compressed Binary Matrix (CBM), including the various matrix multiplication algorithms and their optimizations. JNFA also implemented the CBM format and the related multiplication kernels in C++, including the integration of SpMM and AXPY (from Intel MKL) into the matrix multiplications kernels based on the CBM format, as proposed in Section 2.2. Additionally, JNFA implemented an interface that enables calling the previous C++ routines from Python. SM designed the Python benchmarks to compare CBM and CSR in matrix multiplication and compression quality, as described in Section 3.3. SM also integrated the CBM and CSR-based matrix multiplication kernels into the Message Passing Layer (MPL) to evaluate the impact of the CBM format during the inference stage of a 2-layer GCN model implemented in PyTorch, as discussed in Section 3.4. SB, APF, WNG, and LMSR supervised the project. JNFA wrote the bulk of this draft. All authors provided critical feedback, participated in discussions, contributed to the interpretation of the results, and approved the final manuscript.

References

[1] Z. Wu, S. Pan, F. Chen, G. Long, C. Zhang, and S. Y. Philip. A comprehensive survey on graph neural networks. IEEE Transactions on Neural Networks and Learning Systems, 32(1):4–24, 2021.
[2] Thomas N Kipf and Max Welling. Semi-supervised classification with graph convolutional networks. arXiv preprint arXiv:1609.02907, 2016.
[3] Y. Shao, H. Li, X. Gu, H. Yin, Y. Li, X. Miao, W. Zhang, B. Cui, and L. Chen. Distributed graph neural network training: A survey. ACM Comput. Surv., 56(8), 2024.
[4] F. Scarselli, M. Gori, A. C. Tsoi, M. Hagenbuchner, and G. Monfardini. The graph neural network model. IEEE Transactions on Neural Networks, 20(1):61–80, 2009.
[5] Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, et al. Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems, 32, 2019.
[6] W. L. Hamilton, R. Ying, and J. Leskovec. Inductive representation learning on large graphs. In Advances in Neural Information Processing Systems (NeurIPS), 2017.
[7] Keyulu Xu, Weihua Hu, Jure Leskovec, and Stefanie Jegelka. How powerful are graph neural networks? arXiv preprint arXiv:1810.00826, 2018.
[8] J. Alman and V. Vassilevska Williams. Further limitations of the known approaches for matrix multiplication. In A. R. Karlin, editor, 9th Innovations in Theoretical Computer Science Conference (ITCS 2018), volume 94 of Leibniz International Proceedings in Informatics (LIPIcs), pages 25:1–25:15. Schloss Dagstuhl–Leibniz-Zentrum für Informatik, 2018.
[9] C. Karande, K. Chellapilla, and R. Andersen. Speeding up algorithms on compressed web graphs. Internet Mathematics, 6(3):227–256, 2009.
[10] A. Abboud, A. Backurs, K. Bringmann, and M. Künnemann. Impossibility results for grammar-compressed linear algebra. Advances in Neural Information Processing Systems, 33:8810–8823, 2020.
[11] M. Nishino, N. Yasuda, S. i. Minato, and M. Nagata. Accelerating graph adjacency matrix multiplications with adjacency forest. In Proceedings of the 2014 SIAM International Conference on Data Mining, pages 1073–1081. SIAM, 2014.
[12] A. P. Francisco, T. Gagie, D. Köppl, S. Ladra, and G. Navarro. Graph compression for adjacency-matrix multiplication. SN Computer Science, 3(3):193, 2022.
[13] P. Boldi and S. Vigna. The webgraph framework i: compression techniques. In Proceedings of the 13th international conference on World Wide Web, pages 595–602, 2004.
[14] C. Hernández and G. Navarro. Compressed representations for web and social graphs. Knowledge and Information Systems, 40(2):279–313, 2014.
[15] A. Elgohary, M. Boehm, P. J. Haas, F. R. Reiss, and B. Reinwald. Compressed linear algebra for declarative large-scale machine learning. Communications of the ACM, 62(5):83–91, 2019.
[16] A. Björklund and A. Lingas. Fast boolean matrix multiplication for highly clustered data. In Algorithms and Data Structures: 7th International Workshop, WADS 2001 Providence, RI, USA, August 8–10, 2001 Proceedings 7, pages 258–263. Springer, 2001.
[17] M. Böther, O. Kißig, and C. Weyand. Efficiently computing directed minimum spanning trees. In 2023 Proceedings of the Symposium on Algorithm Engineering and Experiments (ALENEX), pages 86–95. SIAM, 2023.
[18] R. A. Rossi and N. Ahmed. The network data repository with interactive graph analytics and visualization. In AAAI Conference on Artificial Intelligence, 2015.
[19] Jure Leskovec and Andrej Krevl. SNAP Datasets: Stanford large network dataset collection. June 2014.
[20] Z. Yang, W. W. Cohen, and R. Salakhutdinov. Revisiting semi-supervised learning with graph embeddings. In Proceedings of the 33rd International Conference on Machine Learning, ICML ’16, pages 40–48, 2016.

Accelerating Graph Neural Networks with a Novel Matrix Compression Format

Abstract

1 Introduction

1.1 Previous Works

1.2 Our Contributions

2 Compressed Binary Matrix Format

Minimizing the number of deltas.

Worst-case guarantees.

Property 2.1.

2.1 Time and Space Analysis

Lemma 2.1.

Proof.

Lemma 2.2.

Proof.

2.2 Fast Matrix Multiplication with Compressed Binary Matrix Format

Property 2.2.

Property 2.3.

Leveraging High Performance SpMM kernels.

Multi-threading parallelism.

Extending SpMM with CBM to normalized adjacency matrices.

Speeding up SpMM with Edge Pruning.

3 Experimental Evaluation

Experimental Setting.

3.1 Evaluation Metrics

3.2 Datasets

3.3 Sparse-Dense Matrix Multiplication (SpMM) Evaluation

α𝛼\alphaitalic_α equal to zero.

α𝛼\alphaitalic_α greater than zero.

3.4 Integrating CBM with GCN

4 Final Remarks

Acknowledgments

Author Contributions

References

$\alpha$ equal to zero.

$\alpha$ greater than zero.