Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

Accelerating Graph Neural Networks with a Novel Matrix Compression Format

João N. F. Alves1,2,3,4 Samir Moustafa1,2 Siegfried Benkner1
Alexandre P. Francisco3,4
Wilfried N. Gansterer1 & Luís M. S. Russo3,4
Abstract

The inference and training stages of Graph Neural Networks (GNNs) are often dominated by the time required to compute a long sequence of matrix multiplications between the sparse graph adjacency matrix and its embedding. To accelerate these stages, we first propose the Compressed Binary Matrix (CBM) storage format to succinctly represent the binary adjacency matrix of an unweighted graph. Then, we show how to generalize this representation to normalized adjacency matrices of unweighted graphs which arise in the context of GNNs. Finally, we develop efficient matrix multiplication kernels based on this compressed representation. The matrix multiplication kernels proposed in this work never require more scalar operations than classic sparse matrix multiplication algorithms. Experimental evaluation shows that the matrix multiplication strategies proposed outperform the current state-of-the-art implementations provided by Intel MKL, achieving speedups close to 5×\times×. Furthermore, our optimized matrix-multiplication strategies accelerated the inference time of a GNN by up to 3×3\times3 ×.

1 Introduction

Graph Neural Networks (GNNs) are the preferred tool to learn from graph-structured data and thus are considered key for future AI applications in domains like social network analysis, natural language processing, biology, physics, and many others [1]. The training and inference time of different GNN architectures is dominated by a long sequence of matrix products. This is particularly evident in GNNs that resort to Message Passing Layers (MPLs), where in each hidden layer the nodes of the graph aggregate the embedding of neighboring nodes and adjust their own embedding based on the information collected. In some variants of GNNs, such as the widely used Graph Convolutional Networks (GCNs) [2], the message produced in each layer is essentially the product of the adjacency matrix of the underlying graph and its current embedding.

For illustration purposes, consider a two-layer GCN. To propagate and combine node features across the graph, this network must compute the following operations once per inference and once per training epoch [3]:

𝐀^σ(𝐀^𝐗𝐖0)𝐖1,^𝐀𝜎^𝐀superscript𝐗𝐖0superscript𝐖1\hat{\mathbf{A}}\ \sigma(\hat{\mathbf{A}}\mathbf{X}\mathbf{W}^{0})\mathbf{W}^{% 1},over^ start_ARG bold_A end_ARG italic_σ ( over^ start_ARG bold_A end_ARG bold_XW start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT ) bold_W start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT , (1)

where 𝐀^^𝐀\hat{\mathbf{A}}over^ start_ARG bold_A end_ARG represents the normalized adjacency matrix of the graph, such that 𝐀^=𝐃12(𝐀+𝐈)𝐃12^𝐀superscript𝐃12𝐀𝐈superscript𝐃12\hat{\mathbf{A}}=\mathbf{D}^{-\frac{1}{2}}(\mathbf{A}+\mathbf{I})\mathbf{D}^{-% \frac{1}{2}}over^ start_ARG bold_A end_ARG = bold_D start_POSTSUPERSCRIPT - divide start_ARG 1 end_ARG start_ARG 2 end_ARG end_POSTSUPERSCRIPT ( bold_A + bold_I ) bold_D start_POSTSUPERSCRIPT - divide start_ARG 1 end_ARG start_ARG 2 end_ARG end_POSTSUPERSCRIPT, 𝐃𝐃\mathbf{D}bold_D is the degree diagonal matrix of the graph, σ𝜎\sigmaitalic_σ denotes an element-wise activation function, 𝐗𝐗\mathbf{X}bold_X is the matrix of node features, and 𝐖0superscript𝐖0\mathbf{W}^{0}bold_W start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT and 𝐖1superscript𝐖1\mathbf{W}^{1}bold_W start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT are learnable dense matrices for the first and second layers [4]. It is important to note that in real scenarios the adjacency matrix of the graph is typically much larger than the remaining operand matrices of Equation 1. Therefore, matrix products involving this matrix represent most of the computational burden of training and inferring GNNs. Graphs arising in the context of GNNs are often extremely sparse. Popular GNNs frameworks, such as PyTorch [5], leverage this sparsity to accelerate both training and inference. This is achieved by representing the adjacency matrix of the graph, or its normalized form, in standard sparse matrix formats that only consider the nonzero elements of a sparse matrix. Sparse matrix formats, like COO or CSR, enable faster sparse-dense matrix multiplication kernels (SpMMs) which are known to require a number of scalar operations that is proportional to the number of nonzero elements in the sparse matrix.

In this paper we present a new matrix compression format called Compressed Binary Matrix (CBM) to accelerate widely-used GNN architectures (e.g., GCN [2], GraphSage[6], and GIN [7]) that spend most of their processing time on matrix products involving the adjacency matrix of the underlying graph or its normalized, when these GNNs learn on unweighted graphs. Our format exploits the fact that the adjacency matrix of an unweighted graph is binary, and therefore it can be compressed beyond what is possible with sparsity alone, while simultaneously reducing the number of scalar operation required to multiply our compressed representation of the adjacency matrix and another dense real-valued matrix. The key advantage of the CBM format is that it only represents the differences (deltas) of a row with respect to another similar row of the same matrix, which for many use cases tends to be significantly smaller than the number of non-zeros represented by standard compression formats.

1.1 Previous Works

The matrix-matrix and matrix-vector products have been extensively studied [8]. A particular case is the product of a binary sparse matrix by a real-valued vector or (dense) matrix, where the efficient representation of the binary matrix can be exploited to improve both the memory footprint and the operation running time. Although this was not expected in some preliminary studies [9], and impossibility results exist for more complex compression schemes [10], it works for some representational compression schemes. One of such schemes is the Single Tree Adjacency Forest (STAF) proposed by Nishino et al. [11]. The STAF representation of a binary matrix is obtained by reversing and inserting the adjacency list of each row of a binary matrix into a trie data-structure, meaning that suffixes that are common to more than one adjacency list are represented exactly once. The authors have shown that the this data-structure can be built in linear time with respect to the number of nonzero elements in the input matrix. STAFs enable fast matrix-matrix products by traversing the trie in root-to-leaf order, while accumulating partial sums that are common to different rows of the result matrix. The number of operations required to multiply a binary matrix represented by a STAF and a real-valued vector is proportional to the size of the trie, which is also upper-bounded by the number of nonzero elements in the input matrix. STAFs doe not exploit, however, row-wise similarities outside of common row suffixes. To address this issue the authors proposed splitting the input matrix into sets of columns and create a STAF per set. This optimized version achieves a significant speedup and memory footprint reduction compared to CSR and the sparse matrix multiply kernel offered by the Eigen library.

Francisco et al. [12] explored how succinct representations for binary matrices and graphs could be exploited to speedup binary matrix multiplication, namely the Webgraph representation by Boldi and Vigna [13] and the Biclique Extraction (BE) based representation by Hernandez and Navarro [14]. Both representations allow to reduce the memory footprint of binary matrices and accelerate the product of compressed binary matrices and real-valued vectors. The Webgraph representation exploits the similarity among rows, as well as clustering effects present in real-world graphs and matrices, relying on gap compression, referentiation, intervalisation and ζ𝜁\zetaitalic_ζ codes. The BE representation exploits also rows similarities and clustering effects, extracting maximal bicliques and replacing them with differential compressed sets of nodes that share adjacencies. The key observations are that, in both cases, we can reuse partial results from previous computations. Taking an implementation of Pagerank using classic adjacency lists, authors achieved significant and similar speedups and memory footprint reductions, showing that the product computation can be reduced in time proportional to the compressed matrix size. The Webgraph and BE representations require however non-trivial preprocessing steps, Webgraph benefits from a suitable vertex reordering, obtained through graph clustering methods, and BE requires finding maximal bicliques, an NP-hard problem in general. These steps might take longer to execute than computing the SpMM we are trying to optimize. On the other hand the experimental evaluations considered naive implementations of standard representations, possibly leading to unfair comparisons in what concerns state of the art linear algebra libraries.

Elgohary et al. [15] also addressed the problem that large-scale machine learning algorithms are often iterative, using repeated read-only data access and I/O-bound matrix-vector multiplications, introducing Compressed Linear Algebra (CLA) for lossless matrix compression. CLA also executes linear algebra operations directly on the compressed representations. However, it is not focused on binary matrices and, although it achieves a performance close to the uncompressed case, it only presents performance gains when data does not fit into memory.

Our work is somewhat related to the work by Björklund and Lingas [16]. They also consider a weighted graph on the rows of a binary matrix where the weight of an edge between two rows is equal to its Hamming distance, and then they rely on a minimum spanning tree of that graph to differentially compress the rows of a binary matrix with respect to one another, thus accelerating the product computation. They consider however only a single product of two binary matrices, and do not consider the overhead imposed by operating on a compressed representation of a binary matrix, as their results are purely theoretical.

1.2 Our Contributions

In this paper, we make the following contributions: (1) we present the Compressed Binary Matrix (CBM) format, an efficient compression scheme for binary matrices, that can reduce the memory footprint of unweighted graphs that arise in the context of GNNs; (2) we introduce a new algorithm to significantly accelerate the sequence of matrix products between the (potentially normalized) adjacency matrix of the graph, represented in CBM format, and its embedding; (3) we prove that even in the worst-case scenario, where compression is not possible, the number of scalar operations required to multiply a matrix represented in CBM format does not exceed the number of scalar operations required to multiply the same matrix using classic sparse storage formats; and (4) we have implemented the CBM format and corresponding matrix multiplication kernels such that they can be used together with state-of-the-art Deep Learning framework, such as PyTorch. Furthermore, experimental evaluation using real-world datasets demonstrates the effectiveness of our approach. Our method is nearly 5×5\times5 × faster than state-of-the-art SpMM implementations in sequential and parallel environments, subsequently shortening the inference time of a 2-layer GCN by more than 3×3\times3 ×. Our implementation will be made available at https://github.com/cbm4scale.

As previously stated, the CBM format is akin to the work of Björklund and Lingas [16]. Nevertheless, it distinguishes itself by being specifically designed to accelerate the product between a single sparse binary matrix and a set of real-valued matrices, thus the binary matrix only needs to be compressed once111Ideally, the adjacency matrix of any unweighted graph would be provided in CBM, avoiding any compression overhead.. Furthermore, our format resorts to a Minimum Cost Arborescence (MCA), which allows us to ignore compression opportunities that do not lead to improvements in memory-footprint and matrix multiplication. The CBM format also overcomes limitations present in STAF and BE, since our format exploits compression opportunities along complete rows of the matrix (by default), is constructed in polynomial time, and does not allocate additional memory to store partial results.

2 Compressed Binary Matrix Format

Let 𝐀{0,1}m×n𝐀superscript01𝑚𝑛\mathbf{A}\in\{0,1\}^{m\times n}bold_A ∈ { 0 , 1 } start_POSTSUPERSCRIPT italic_m × italic_n end_POSTSUPERSCRIPT be a binary matrix, where ai{0,1}nsubscript𝑎𝑖superscript01𝑛\vec{a}_{i}\in\{0,1\}^{n}over→ start_ARG italic_a end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ { 0 , 1 } start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT represents the i𝑖iitalic_i-th row of 𝐀𝐀\mathbf{A}bold_A for i=1,,m𝑖1𝑚i=1,\dots,mitalic_i = 1 , … , italic_m. Also assume that aisubscript𝑎𝑖\vec{a}_{i}over→ start_ARG italic_a end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is represented with an adjacency list with the column indices of the nonzero elements of aisubscript𝑎𝑖\vec{a}_{i}over→ start_ARG italic_a end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. The Compressed Binary Matrix (CBM) format resorts to differential compression to represent the rows of a binary matrix 𝐀𝐀\mathbf{A}bold_A. This is, if 𝐀𝐀\mathbf{A}bold_A is represented in CBM format, then any row axsubscript𝑎𝑥\vec{a}_{x}over→ start_ARG italic_a end_ARG start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT can be represented by another row aysubscript𝑎𝑦\vec{a}_{y}over→ start_ARG italic_a end_ARG start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT and two lists of deltas that indicate which elements must be included (Δx,y+subscriptsuperscriptΔ𝑥𝑦\Delta^{+}_{x,y}roman_Δ start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_x , italic_y end_POSTSUBSCRIPT), or removed (Δx,ysubscriptsuperscriptΔ𝑥𝑦\Delta^{-}_{x,y}roman_Δ start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_x , italic_y end_POSTSUBSCRIPT), from the adjacency list of aysubscript𝑎𝑦\vec{a}_{y}over→ start_ARG italic_a end_ARG start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT to obtain axsubscript𝑎𝑥\vec{a}_{x}over→ start_ARG italic_a end_ARG start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT:

ax=(ayΔx,y+)Δx,y, where Δx,y+=axay, and Δx,y=ayax.formulae-sequencesubscript𝑎𝑥subscript𝑎𝑦subscriptsuperscriptΔ𝑥𝑦subscriptsuperscriptΔ𝑥𝑦 where subscriptsuperscriptΔ𝑥𝑦subscript𝑎𝑥subscript𝑎𝑦, and subscriptsuperscriptΔ𝑥𝑦subscript𝑎𝑦subscript𝑎𝑥\vec{a}_{x}=(\vec{a}_{y}\cup\Delta^{+}_{x,y})\setminus\Delta^{-}_{x,y},\textrm% { where }\Delta^{+}_{x,y}=\vec{a}_{x}\setminus\vec{a}_{y}\textrm{, and }\Delta% ^{-}_{x,y}=\vec{a}_{y}\setminus\vec{a}_{x}.over→ start_ARG italic_a end_ARG start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT = ( over→ start_ARG italic_a end_ARG start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT ∪ roman_Δ start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_x , italic_y end_POSTSUBSCRIPT ) ∖ roman_Δ start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_x , italic_y end_POSTSUBSCRIPT , where roman_Δ start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_x , italic_y end_POSTSUBSCRIPT = over→ start_ARG italic_a end_ARG start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT ∖ over→ start_ARG italic_a end_ARG start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT , and roman_Δ start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_x , italic_y end_POSTSUBSCRIPT = over→ start_ARG italic_a end_ARG start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT ∖ over→ start_ARG italic_a end_ARG start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT . (2)

Assuming aysubscript𝑎𝑦\vec{a}_{y}over→ start_ARG italic_a end_ARG start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT is present in memory, Equation 2 suggests that the memory required to represent axsubscript𝑎𝑥\vec{a}_{x}over→ start_ARG italic_a end_ARG start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT is proportional to the number of deltas between axsubscript𝑎𝑥\vec{a}_{x}over→ start_ARG italic_a end_ARG start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT and aysubscript𝑎𝑦\vec{a}_{y}over→ start_ARG italic_a end_ARG start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT. If these rows are similar, then it is likely that the number of deltas is smaller than the number of nonzero elements of axsubscript𝑎𝑥\vec{a}_{x}over→ start_ARG italic_a end_ARG start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT. In that case, it would be more memory efficient to represent axsubscript𝑎𝑥\vec{a}_{x}over→ start_ARG italic_a end_ARG start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT with respect to aysubscript𝑎𝑦\vec{a}_{y}over→ start_ARG italic_a end_ARG start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT than through its adjacency list. Therefore, to further reduce the memory footprint of matrix A, the compression algorithm that builds the CBM format must find a suitable chain of compression to represent all rows of 𝐀𝐀\mathbf{A}bold_A. This is, for each row axsubscript𝑎𝑥\vec{a}_{x}over→ start_ARG italic_a end_ARG start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT, identify another similar row aysubscript𝑎𝑦\vec{a}_{y}over→ start_ARG italic_a end_ARG start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT that characterizes the former, such that: (1) the numbers of deltas required to represent each row axsubscript𝑎𝑥\vec{a}_{x}over→ start_ARG italic_a end_ARG start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT is minimized subjected to aysubscript𝑎𝑦\vec{a}_{y}over→ start_ARG italic_a end_ARG start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT, and (2) the number of deltas required to represent axsubscript𝑎𝑥\vec{a}_{x}over→ start_ARG italic_a end_ARG start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT is guaranteed to be less than, or equal to, the number of nonzero elements in axsubscript𝑎𝑥\vec{a}_{x}over→ start_ARG italic_a end_ARG start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT.

Minimizing the number of deltas.

To address point (1), the CBM format must first measure the number of deltas required to convert each row aysubscript𝑎𝑦\vec{a}_{y}over→ start_ARG italic_a end_ARG start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT into all other rows axsubscript𝑎𝑥\vec{a}_{x}over→ start_ARG italic_a end_ARG start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT of 𝐀𝐀\mathbf{A}bold_A, i.e., measure the Hamming distance for each pair of matrix rows. This step provides a global view of the dissimilarity between the rows of the matrix 𝐀𝐀\mathbf{A}bold_A, and it can be modeled as a fully-connected and undirected distance graph G𝐺Gitalic_G. This graph has m𝑚mitalic_m vertices, where each vertex represents a unique row of the matrix, and the weight of each edge (y,x)𝑦𝑥(y,x)( italic_y , italic_x ) corresponds to the number of deltas required to represent axsubscript𝑎𝑥\vec{a}_{x}over→ start_ARG italic_a end_ARG start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT with respect to aysubscript𝑎𝑦\vec{a}_{y}over→ start_ARG italic_a end_ARG start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT. To reduce the number of deltas required to compress the rows of 𝐀𝐀\mathbf{A}bold_A we can find a Minimal Spanning Tree (MST) of G𝐺Gitalic_G, which by definition spans G𝐺Gitalic_G with the minimum sum of edge weights possible. Naturally, any MST of the distance graph, rooted in vertex x𝑥xitalic_x, defines a chain of compression with as many deltas as the weight of this tree plus the number of nonzero elements of axsubscript𝑎𝑥\vec{a}_{x}over→ start_ARG italic_a end_ARG start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT. Thus, any MST rooted in the vertex corresponding to the row with the fewest nonzero elements, defines a chain of compression that satisfies point (1).

Worst-case guarantees.

Note that the chain of compression obtained by finding an MST of G𝐺Gitalic_G does not satisfy point (2), because the weight of the lightest incoming edge of any vertex x𝑥xitalic_x can be greater than the number of nonzero elements in axsubscript𝑎𝑥\vec{a}_{x}over→ start_ARG italic_a end_ARG start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT. In such cases, representing axsubscript𝑎𝑥\vec{a}_{x}over→ start_ARG italic_a end_ARG start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT with an adjacency list is clearly more memory efficient. To avoid this issue, we extended the distance graph G𝐺Gitalic_G with a virtual vertex 𝟎0\mathbf{0}bold_0 which is connected to all other vertices of the graph. This virtual vertex represents a null row-vector a0{0}nsubscript𝑎0superscript0𝑛\vec{a}_{0}\in\{0\}^{n}over→ start_ARG italic_a end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∈ { 0 } start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT, which ensures that the weight of each edge (𝟎,x)0𝑥(\mathbf{0},x)( bold_0 , italic_x ) is equal to the number of nonzero elements in axsubscript𝑎𝑥\vec{a}_{x}over→ start_ARG italic_a end_ARG start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT. The inclusion of virtual vertex 𝟎0\mathbf{0}bold_0 in the distance graph G𝐺Gitalic_G ensures that the issue described above cannot occur, since the lightest incoming edge of any vertex x𝑥xitalic_x is now at most as heavy as the number of nonzero elements in axsubscript𝑎𝑥\vec{a}_{x}over→ start_ARG italic_a end_ARG start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT. Therefore, any chain of compression characterized by an MST of G𝐺Gitalic_G, rooted on vertex 𝟎0\mathbf{0}bold_0, is guaranteed to satisfy points (1) and (2). If we use this chain of compression to represent a matrix in CBM format, then following property will be observed:

Property 2.1.

The number of deltas required to represent any matrix 𝐀{0,1}m×n𝐀superscript01𝑚𝑛\mathbf{A}\in\{0,1\}^{m\times n}bold_A ∈ { 0 , 1 } start_POSTSUPERSCRIPT italic_m × italic_n end_POSTSUPERSCRIPT in Compressed Binary Matrix (CBM) format is never greater than the number of nonzero elements in matrix 𝐀𝐀\mathbf{A}bold_A.

To complete the construction, we simply traverse the compression chain above in topological order, and for every edge (y,x)𝑦𝑥(y,x)( italic_y , italic_x ) visited, we compute the lists of positive and negative deltas required to convert row aysubscript𝑎𝑦\vec{a}_{y}over→ start_ARG italic_a end_ARG start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT into axsubscript𝑎𝑥\vec{a}_{x}over→ start_ARG italic_a end_ARG start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT.

2.1 Time and Space Analysis

Lemma 2.1.

Any matrix 𝐀{0,1}m×n𝐀superscript01𝑚𝑛\mathbf{A}\in\{0,1\}^{m\times n}bold_A ∈ { 0 , 1 } start_POSTSUPERSCRIPT italic_m × italic_n end_POSTSUPERSCRIPT can be represented in CBM format in O((m+1)𝐧𝐧𝐳(𝐀)+m2logm)𝑂𝑚1𝐧𝐧𝐳𝐀superscript𝑚2𝑚O((m+1)\ \mathbf{nnz}(\mathbf{A})+m^{2}\log m)italic_O ( ( italic_m + 1 ) bold_nnz ( bold_A ) + italic_m start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT roman_log italic_m ) time.

Proof.

The construction of the extended distance graph G𝐺Gitalic_G for 𝐀𝐀\mathbf{A}bold_A requires the computation of m(m+1)/2𝑚𝑚12m(m+1)/2italic_m ( italic_m + 1 ) / 2 Hamming distances between all possible row pairs (ax,ay)subscript𝑎𝑥subscript𝑎𝑦(\vec{a}_{x},\vec{a}_{y})( over→ start_ARG italic_a end_ARG start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT , over→ start_ARG italic_a end_ARG start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT ). The Hamming distance of each pair of rows can be reduced to the intersection of their adjacency lists, computed in O(𝐧𝐧𝐳(ax)+𝐧𝐧𝐳(ay))𝑂𝐧𝐧𝐳subscript𝑎𝑥𝐧𝐧𝐳subscript𝑎𝑦O(\mathbf{nnz}(\vec{a}_{x})+\mathbf{nnz}(\vec{a}_{y}))italic_O ( bold_nnz ( over→ start_ARG italic_a end_ARG start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT ) + bold_nnz ( over→ start_ARG italic_a end_ARG start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT ) ) time. Hence, the time to compute all m(m+1)/2𝑚𝑚12m(m+1)/2italic_m ( italic_m + 1 ) / 2 Hamming distances is upper-bounded by

x=0my=0m(𝐧𝐧𝐳(ax)+𝐧𝐧𝐳(ay))=(m+1)𝐧𝐧𝐳(𝐀).superscriptsubscript𝑥0𝑚superscriptsubscript𝑦0𝑚𝐧𝐧𝐳subscript𝑎𝑥𝐧𝐧𝐳subscript𝑎𝑦𝑚1𝐧𝐧𝐳𝐀\sum_{x=0}^{m}\sum_{y=0}^{m}\big{(}\mathbf{nnz}(\vec{a}_{x})+\mathbf{nnz}(\vec% {a}_{y})\big{)}=(m+1)\ \mathbf{nnz}(\mathbf{A}).∑ start_POSTSUBSCRIPT italic_x = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_y = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT ( bold_nnz ( over→ start_ARG italic_a end_ARG start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT ) + bold_nnz ( over→ start_ARG italic_a end_ARG start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT ) ) = ( italic_m + 1 ) bold_nnz ( bold_A ) . (3)

Additionally, well-known MST algorithms, such as Prim or Kruskal, are known to find an MST in O(ElogV)𝑂𝐸𝑉O(E\log V)italic_O ( italic_E roman_log italic_V ) time, where E𝐸Eitalic_E and V𝑉Vitalic_V denote the number of edges and vertices in the graph. Since the extended distance graph G𝐺Gitalic_G contains m(m+1)/2𝑚𝑚12m(m+1)/2italic_m ( italic_m + 1 ) / 2 edges and m+1𝑚1m+1italic_m + 1 vertices, finding an MST of G𝐺Gitalic_G requires O((m+1)2log(m+1))𝑂superscript𝑚12𝑚1O((m+1)^{2}\log(m+1))italic_O ( ( italic_m + 1 ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT roman_log ( italic_m + 1 ) ) time, and therefore, representing matrix 𝐀𝐀\mathbf{A}bold_A takes time

O((m+1)𝐧𝐧𝐳(𝐀)+(m+1)2log(m+1))).O((m+1)\ \mathbf{nnz}(\mathbf{A})+(m+1)^{2}\log(m+1))).italic_O ( ( italic_m + 1 ) bold_nnz ( bold_A ) + ( italic_m + 1 ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT roman_log ( italic_m + 1 ) ) ) . (4)

Finally, a single list of deltas can also be obtained from the intersection of the adjacency lists of axsubscript𝑎𝑥\vec{a}_{x}over→ start_ARG italic_a end_ARG start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT and aysubscript𝑎𝑦\vec{a}_{y}over→ start_ARG italic_a end_ARG start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT. Hence, the computation of 2m2𝑚2m2 italic_m lists of deltas is already accounted for in Equation 4. ∎

Lemma 2.2.

The space required to represent a binary matrix 𝐀{0,1}m×n𝐀superscript01𝑚𝑛\mathbf{A}\in\{0,1\}^{m\times n}bold_A ∈ { 0 , 1 } start_POSTSUPERSCRIPT italic_m × italic_n end_POSTSUPERSCRIPT in Compressed Binary Matrix (CBM) format is proportional to O(m+i=0m(|Δx,rx+|+|Δx,rx|))𝑂𝑚superscriptsubscript𝑖0𝑚subscriptsuperscriptΔ𝑥subscript𝑟𝑥subscriptsuperscriptΔ𝑥subscript𝑟𝑥O(m+\sum_{i=0}^{m}(|\Delta^{+}_{x,r_{x}}|+|\Delta^{-}_{x,r_{x}}|))italic_O ( italic_m + ∑ start_POSTSUBSCRIPT italic_i = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT ( | roman_Δ start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_x , italic_r start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT end_POSTSUBSCRIPT | + | roman_Δ start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_x , italic_r start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT end_POSTSUBSCRIPT | ) ), where rxsubscript𝑟𝑥r_{x}italic_r start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT represents the index of the row selected to compress row axsubscript𝑎𝑥\vec{a}_{x}over→ start_ARG italic_a end_ARG start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT.

Proof.

Assuming matrix 𝐀𝐀\mathbf{A}bold_A is represented in CBM format. Then the chain of compression required to represent this matrix consists on: (1) a list of edges that represents any MST rooted in vertex 𝟎0\mathbf{0}bold_0 of graph G𝐺Gitalic_G, and (2) a list of positive and negative deltas for each edge of this MST. Since the extended version of G𝐺Gitalic_G is a fully-connected graph with m+1𝑚1m+1italic_m + 1 vertices, it is known that any MST of G𝐺Gitalic_G contains m𝑚mitalic_m edges. Therefore, the list of edges contains m𝑚mitalic_m elements, and there are m𝑚mitalic_m lists of deltas, whose size totals x=0m(|Δx,rx+|+|Δx,rx|)superscriptsubscript𝑥0𝑚subscriptsuperscriptΔ𝑥subscript𝑟𝑥subscriptsuperscriptΔ𝑥subscript𝑟𝑥\sum_{x=0}^{m}(|\Delta^{+}_{x,r_{x}}|+|\Delta^{-}_{x,r_{x}}|)∑ start_POSTSUBSCRIPT italic_x = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT ( | roman_Δ start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_x , italic_r start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT end_POSTSUBSCRIPT | + | roman_Δ start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_x , italic_r start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT end_POSTSUBSCRIPT | ). ∎

2.2 Fast Matrix Multiplication with Compressed Binary Matrix Format

Let wn𝑤superscript𝑛\vec{w}\in\mathbb{R}^{n}over→ start_ARG italic_w end_ARG ∈ blackboard_R start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT be a dense and real-valued vector, and axsubscript𝑎𝑥\vec{a}_{x}over→ start_ARG italic_a end_ARG start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT and aysubscript𝑎𝑦\vec{a}_{y}over→ start_ARG italic_a end_ARG start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT two distinct rows of a binary matrix 𝐀𝐀\mathbf{A}bold_A as previously defined. It follows from Equation 2 that we can resort to the inner-product aywsubscript𝑎𝑦𝑤\vec{a}_{y}\cdot\vec{w}over→ start_ARG italic_a end_ARG start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT ⋅ over→ start_ARG italic_w end_ARG to compute axwsubscript𝑎𝑥𝑤\vec{a}_{x}\cdot\vec{w}over→ start_ARG italic_a end_ARG start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT ⋅ over→ start_ARG italic_w end_ARG as

axw=((ayΔx,y+)Δx,y)w=(ayw)+(Δx,y+w)(Δx,yw).subscript𝑎𝑥𝑤subscript𝑎𝑦subscriptsuperscriptΔ𝑥𝑦subscriptsuperscriptΔ𝑥𝑦𝑤subscript𝑎𝑦𝑤subscriptsuperscriptΔ𝑥𝑦𝑤subscriptsuperscriptΔ𝑥𝑦𝑤\vec{a}_{x}\cdot\vec{w}=((\vec{a}_{y}\cup\Delta^{+}_{x,y})\setminus\Delta^{-}_% {x,y})\cdot\vec{w}=(\vec{a}_{y}\cdot\vec{w})+(\Delta^{+}_{x,y}\cdot\vec{w})-(% \Delta^{-}_{x,y}\cdot\vec{w}).over→ start_ARG italic_a end_ARG start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT ⋅ over→ start_ARG italic_w end_ARG = ( ( over→ start_ARG italic_a end_ARG start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT ∪ roman_Δ start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_x , italic_y end_POSTSUBSCRIPT ) ∖ roman_Δ start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_x , italic_y end_POSTSUBSCRIPT ) ⋅ over→ start_ARG italic_w end_ARG = ( over→ start_ARG italic_a end_ARG start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT ⋅ over→ start_ARG italic_w end_ARG ) + ( roman_Δ start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_x , italic_y end_POSTSUBSCRIPT ⋅ over→ start_ARG italic_w end_ARG ) - ( roman_Δ start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_x , italic_y end_POSTSUBSCRIPT ⋅ over→ start_ARG italic_w end_ARG ) . (5)

This implies that the dot-product axwsubscript𝑎𝑥𝑤\vec{a}_{x}\cdot\vec{w}over→ start_ARG italic_a end_ARG start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT ⋅ over→ start_ARG italic_w end_ARG can be calculated in 1+|Δx,y+|+|Δx,y|1subscriptsuperscriptΔ𝑥𝑦subscriptsuperscriptΔ𝑥𝑦1+|\Delta^{+}_{x,y}|+|\Delta^{-}_{x,y}|1 + | roman_Δ start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_x , italic_y end_POSTSUBSCRIPT | + | roman_Δ start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_x , italic_y end_POSTSUBSCRIPT | scalar operations, if the value of aywsubscript𝑎𝑦𝑤\vec{a_{y}}\cdot\vec{w}over→ start_ARG italic_a start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT end_ARG ⋅ over→ start_ARG italic_w end_ARG can be reused. Naturally, we can resort to this strategy to design fast matrix-vector multiplication kernels u𝐀v𝑢𝐀𝑣\vec{u}\leftarrow\mathbf{A}\vec{v}over→ start_ARG italic_u end_ARG ← bold_A over→ start_ARG italic_v end_ARG, where 𝐀{0,1}m×n𝐀superscript01𝑚𝑛\mathbf{A}\in\{0,1\}^{m\times n}bold_A ∈ { 0 , 1 } start_POSTSUPERSCRIPT italic_m × italic_n end_POSTSUPERSCRIPT, um𝑢superscript𝑚\vec{u}\in\mathbb{R}^{m}over→ start_ARG italic_u end_ARG ∈ blackboard_R start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT, and vn𝑣superscript𝑛\vec{v}\in\mathbb{R}^{n}over→ start_ARG italic_v end_ARG ∈ blackboard_R start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT, by computing all dot-products between the rows of 𝐀𝐀\mathbf{A}bold_A and v𝑣\vec{v}over→ start_ARG italic_v end_ARG in an order where: (1) the value of the dot-product ayvsubscript𝑎𝑦𝑣\vec{a}_{y}\cdot\vec{v}over→ start_ARG italic_a end_ARG start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT ⋅ over→ start_ARG italic_v end_ARG is known before calculating axvsubscript𝑎𝑥𝑣\vec{a}_{x}\cdot\vec{v}over→ start_ARG italic_a end_ARG start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT ⋅ over→ start_ARG italic_v end_ARG, and (2) the value of all dot-products axvsubscript𝑎𝑥𝑣\vec{a}_{x}\cdot\vec{v}over→ start_ARG italic_a end_ARG start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT ⋅ over→ start_ARG italic_v end_ARG is calculated with respect to the value ayvsubscript𝑎𝑦𝑣\vec{a}_{y}\cdot\vec{v}over→ start_ARG italic_a end_ARG start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT ⋅ over→ start_ARG italic_v end_ARG that results in the minimum overall number of scalar operations. By definition, the chain of compression of matrix 𝐀𝐀\mathbf{A}bold_A already represents such an ordering. Therefore, we can accelerate matrix-vector multiplication by traversing the chain of compression of matrix 𝐀𝐀\mathbf{A}bold_A in topological order, and for each edge visited compute uxaxvsubscript𝑢𝑥subscript𝑎𝑥𝑣u_{x}\leftarrow\vec{a}_{x}\cdot\vec{v}italic_u start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT ← over→ start_ARG italic_a end_ARG start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT ⋅ over→ start_ARG italic_v end_ARG as

uxuy+(Δx,y+v)(Δx,yv),subscript𝑢𝑥subscript𝑢𝑦subscriptsuperscriptΔ𝑥𝑦𝑣subscriptsuperscriptΔ𝑥𝑦𝑣u_{x}\leftarrow u_{y}+(\Delta^{+}_{x,y}\cdot\vec{v})-(\Delta^{-}_{x,y}\cdot% \vec{v}),italic_u start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT ← italic_u start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT + ( roman_Δ start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_x , italic_y end_POSTSUBSCRIPT ⋅ over→ start_ARG italic_v end_ARG ) - ( roman_Δ start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_x , italic_y end_POSTSUBSCRIPT ⋅ over→ start_ARG italic_v end_ARG ) , (6)

where uysubscript𝑢𝑦u_{y}italic_u start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT is known to already contain the value ayvsubscript𝑎𝑦𝑣\vec{a}_{y}\cdot\vec{v}over→ start_ARG italic_a end_ARG start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT ⋅ over→ start_ARG italic_v end_ARG. Note that classic sparse-dense matrix-vector multiplication kernels compute u𝐀v𝑢𝐀𝑣\vec{u}\leftarrow\mathbf{A}\cdot\vec{v}over→ start_ARG italic_u end_ARG ← bold_A ⋅ over→ start_ARG italic_v end_ARG in 𝐧𝐧𝐳(𝐀)𝐧𝐧𝐳𝐀\mathbf{nnz}(\mathbf{A})bold_nnz ( bold_A ) scalar operations, where 𝐧𝐧𝐳(𝐀)𝐧𝐧𝐳𝐀\mathbf{nnz}(\mathbf{A})bold_nnz ( bold_A ) represents the numbers of nonzero elements in matrix 𝐀𝐀\mathbf{A}bold_A. Assuming 𝐀𝐀\mathbf{A}bold_A is represented in CBM format, then the number of deltas required to represent each row axsubscript𝑎𝑥\vec{a}_{x}over→ start_ARG italic_a end_ARG start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT of 𝐀𝐀\mathbf{A}bold_A is known to be smaller than, or equal to, the number of nonzero elements in axsubscript𝑎𝑥\vec{a}_{x}over→ start_ARG italic_a end_ARG start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT. If the number of deltas required to represent axsubscript𝑎𝑥\vec{a}_{x}over→ start_ARG italic_a end_ARG start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT is strictly smaller than 𝐧𝐧𝐳(ax)𝐧𝐧𝐳subscript𝑎𝑥\mathbf{nnz}(\vec{a}_{x})bold_nnz ( over→ start_ARG italic_a end_ARG start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT ), then it is clear that the dot-product axvsubscript𝑎𝑥𝑣\vec{a}_{x}\cdot\vec{v}over→ start_ARG italic_a end_ARG start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT ⋅ over→ start_ARG italic_v end_ARG requires at most 𝐧𝐧𝐳(ax)𝐧𝐧𝐳subscript𝑎𝑥\mathbf{nnz}(\vec{a}_{x})bold_nnz ( over→ start_ARG italic_a end_ARG start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT ) scalar operations. On the other hand, if the number of deltas required to represent axsubscript𝑎𝑥\vec{a}_{x}over→ start_ARG italic_a end_ARG start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT is the same as the number of nonzero elements in axsubscript𝑎𝑥\vec{a}_{x}over→ start_ARG italic_a end_ARG start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT, Equation 5 suggests that the dot-product axvsubscript𝑎𝑥𝑣\vec{a}_{x}\cdot\vec{v}over→ start_ARG italic_a end_ARG start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT ⋅ over→ start_ARG italic_v end_ARG would be computed in 1+𝐧𝐧𝐳(ax)1𝐧𝐧𝐳subscript𝑎𝑥1+\mathbf{nnz}(\vec{a}_{x})1 + bold_nnz ( over→ start_ARG italic_a end_ARG start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT ) scalar operations. This scenario can be avoided by engineering the MST algorithm to select an out-going edge of the virtual node 0 in case of draw. Since axsubscript𝑎𝑥\vec{a}_{x}over→ start_ARG italic_a end_ARG start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT was not compressed in this scenario, the number of scalar operations required to compute this dot-product is exactly 𝐧𝐧𝐳(ax)𝐧𝐧𝐳subscript𝑎𝑥\mathbf{nnz}(\vec{a}_{x})bold_nnz ( over→ start_ARG italic_a end_ARG start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT ). As the number of scalar operations required to compute axvsubscript𝑎𝑥𝑣\vec{a}_{x}\cdot\vec{v}over→ start_ARG italic_a end_ARG start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT ⋅ over→ start_ARG italic_v end_ARG never surpasses the number of nonzero elements in axsubscript𝑎𝑥\vec{a}_{x}over→ start_ARG italic_a end_ARG start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT for x=1,,m𝑥1𝑚x=1,\dots,mitalic_x = 1 , … , italic_m, the following property becomes evident:

Property 2.2.

The number of scalar operations required to compute matrix-vector multiplication based on the Compressed Binary Matrix (CBM) format is never greater than those required to compute matrix-vector multiplication based on classic sparse formats.

Additionally, Equation 5 shows that our matrix-vector multiplication strategy does not require the allocation of additional buffers, since the value of the dot-product required to compute axvsubscript𝑎𝑥𝑣\vec{a}_{x}\cdot\vec{v}over→ start_ARG italic_a end_ARG start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT ⋅ over→ start_ARG italic_v end_ARG is guaranteed to already be stored in vector u𝑢\vec{u}over→ start_ARG italic_u end_ARG. Therefore, the following property is observed:

Property 2.3.

The amount of memory required to compute matrix-vector multiplication based on the Compressed Binary Matrix (CBM) format is proportional to the size of its operands and remains constant during execution time.

Intuitively, we can resort to the matrix-vector multiplication strategy described above to design fast matrix-matrix multiplication kernels as described in Algorithm 1. This algorithm assumes that the left-hand side operand matrix 𝐀{0,1}m×n𝐀superscript01𝑚𝑛\mathbf{A}\in\{0,1\}^{m\times n}bold_A ∈ { 0 , 1 } start_POSTSUPERSCRIPT italic_m × italic_n end_POSTSUPERSCRIPT is represented in CBM format, while matrices 𝐁n×p𝐁superscript𝑛𝑝\mathbf{B}\in\mathbb{R}^{n\times p}bold_B ∈ blackboard_R start_POSTSUPERSCRIPT italic_n × italic_p end_POSTSUPERSCRIPT and 𝐂m×p𝐂superscript𝑚𝑝\mathbf{C}\in\mathbb{R}^{m\times p}bold_C ∈ blackboard_R start_POSTSUPERSCRIPT italic_m × italic_p end_POSTSUPERSCRIPT are dense and correspond to the right-hand side operand matrix and to the product matrix, respectively. As it can be observed, Algorithm 1 computes the matrix-vector product between matrix 𝐀𝐀\mathbf{A}bold_A and each column of matrix 𝐁𝐁\mathbf{B}bold_B. Therefore, we can conclude that Properties 2.2 and 2.3 hold for the matrix-matrix multiplication case.

Input: 𝐀{0,1}m×n𝐀superscript01𝑚𝑛\mathbf{A}\in\{0,1\}^{m\times n}bold_A ∈ { 0 , 1 } start_POSTSUPERSCRIPT italic_m × italic_n end_POSTSUPERSCRIPT, 𝐁n×p𝐁superscript𝑛𝑝\mathbf{B}\in\mathbb{R}^{n\times p}bold_B ∈ blackboard_R start_POSTSUPERSCRIPT italic_n × italic_p end_POSTSUPERSCRIPT, 𝐂m×p𝐂superscript𝑚𝑝\mathbf{C}\in\mathbb{R}^{m\times p}bold_C ∈ blackboard_R start_POSTSUPERSCRIPT italic_m × italic_p end_POSTSUPERSCRIPT
1foreach edge (y,x) chain of compression of 𝐀𝑦𝑥 chain of compression of 𝐀(y,x)\in\text{ chain of compression of }\mathbf{A}( italic_y , italic_x ) ∈ chain of compression of bold_A (in topological order) do
      2 foreach column vector bk𝐁subscript𝑏𝑘𝐁\vec{b}_{k}\in\mathbf{B}over→ start_ARG italic_b end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∈ bold_B , where k[1,p[k\in[1,p[italic_k ∈ [ 1 , italic_p [  do
            3 𝐂x,k𝐂y,k+(Δx,y+bk)(Δx,ybk)subscript𝐂𝑥𝑘subscript𝐂𝑦𝑘subscriptsuperscriptΔ𝑥𝑦subscript𝑏𝑘subscriptsuperscriptΔ𝑥𝑦subscript𝑏𝑘\mathbf{C}_{x,k}\leftarrow\mathbf{C}_{y,k}+(\Delta^{+}_{x,y}\cdot\vec{b}_{k})-% (\Delta^{-}_{x,y}\cdot\vec{b}_{k})bold_C start_POSTSUBSCRIPT italic_x , italic_k end_POSTSUBSCRIPT ← bold_C start_POSTSUBSCRIPT italic_y , italic_k end_POSTSUBSCRIPT + ( roman_Δ start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_x , italic_y end_POSTSUBSCRIPT ⋅ over→ start_ARG italic_b end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) - ( roman_Δ start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_x , italic_y end_POSTSUBSCRIPT ⋅ over→ start_ARG italic_b end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT );
            
      
return 𝐂𝐂\mathbf{C}bold_C
Algorithm 1 Matrix-Matrix product with CBM

Leveraging High Performance SpMM kernels.

The representation of a binary matrix 𝐀𝐀\mathbf{A}bold_A in CBM format was until now conceptualized as a chain of compression, where each node of this chain is associated with two lists of deltas. As is, a compressed matrix cannot be represented in a sparse format capable of leveraging efficient SpMM kernels. To address this issue, we represent the lists of deltas that characterize our format as a matrix 𝐀m×nsuperscript𝐀superscript𝑚𝑛\mathbf{A^{\prime}}\in\mathbb{R}^{m\times n}bold_A start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_m × italic_n end_POSTSUPERSCRIPT, which can be represented in any convenient matrix format, and leverage efficient SpMM kernels provided by Intel MKL to compute all dot-products of Algorithm 1 in a single matrix product 𝐀𝐁superscript𝐀𝐁\mathbf{A^{\prime}}\mathbf{B}bold_A start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT bold_B. Once 𝐀𝐁superscript𝐀𝐁\mathbf{A^{\prime}}\mathbf{B}bold_A start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT bold_B is stored in matrix 𝐂𝐂\mathbf{C}bold_C, we can finalize the matrix multiplication with CBM, by traversing the chain of compression in topological order, and updating row cxsubscript𝑐𝑥\vec{c}_{x}over→ start_ARG italic_c end_ARG start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT of 𝐂𝐂\mathbf{C}bold_C as cx:=cx+cyassignsubscript𝑐𝑥subscript𝑐𝑥subscript𝑐𝑦\vec{c}_{x}:=\vec{c}_{x}+\vec{c}_{y}over→ start_ARG italic_c end_ARG start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT := over→ start_ARG italic_c end_ARG start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT + over→ start_ARG italic_c end_ARG start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT for each edge (y,x)𝑦𝑥(y,x)( italic_y , italic_x ) that was visited. Naturally, these updates can also leverage efficient AXPY kernels, which are also provided by Intel MKL.

Multi-threading parallelism.

As suggested in the previous paragraph, matrix multiplication with the CBM format can be divided into two stages. The first one computes the product 𝐂:=𝐀𝐁assign𝐂superscript𝐀𝐁\mathbf{C}:=\mathbf{A^{\prime}B}bold_C := bold_A start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT bold_B which is embarrassingly parallel, and Intel MKL already provides efficient multi-threaded and vectorized implementations of this operation. The second stage involves updating the rows of matrix 𝐂𝐂\mathbf{C}bold_C with respect to the chain of compression. This stage presents data-dependencies, since the final value of row cxsubscript𝑐𝑥\vec{c}_{x}over→ start_ARG italic_c end_ARG start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT of 𝐂𝐂\mathbf{C}bold_C can only be calculated once cysubscript𝑐𝑦\vec{c}_{y}over→ start_ARG italic_c end_ARG start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT is known. There are however no dependencies between different branches of the chain of compression. Therefore, we can parallelize this stage by concurrently updating the rows of matrix 𝐂𝐂\mathbf{C}bold_C that are found in different branches of the virtual node 0.

Extending SpMM with CBM to normalized adjacency matrices.

Let 𝐀^n×n^𝐀superscript𝑛𝑛\hat{\mathbf{A}}\in\mathbb{R}^{n\times n}over^ start_ARG bold_A end_ARG ∈ blackboard_R start_POSTSUPERSCRIPT italic_n × italic_n end_POSTSUPERSCRIPT represent the normalized adjacency matrix of an unweighted graph, where 𝐀^=𝐃12(𝐀+𝐈)𝐃12^𝐀superscript𝐃12𝐀𝐈superscript𝐃12\hat{\mathbf{A}}=\mathbf{D}^{-\frac{1}{2}}(\mathbf{A}+\mathbf{I})\mathbf{D}^{-% \frac{1}{2}}over^ start_ARG bold_A end_ARG = bold_D start_POSTSUPERSCRIPT - divide start_ARG 1 end_ARG start_ARG 2 end_ARG end_POSTSUPERSCRIPT ( bold_A + bold_I ) bold_D start_POSTSUPERSCRIPT - divide start_ARG 1 end_ARG start_ARG 2 end_ARG end_POSTSUPERSCRIPT. As is, we cannot resort to the CBM format to represent 𝐀^^𝐀\hat{\mathbf{A}}over^ start_ARG bold_A end_ARG since this matrix is not binary. However, fast matrix multiplication is still possible. Note that 𝐃12superscript𝐃12\mathbf{D}^{-\frac{1}{2}}bold_D start_POSTSUPERSCRIPT - divide start_ARG 1 end_ARG start_ARG 2 end_ARG end_POSTSUPERSCRIPT is a diagonal matrix, and therefore (𝐀+𝐈)𝐃12𝐀𝐈superscript𝐃12(\mathbf{A}+\mathbf{I})\mathbf{D}^{-\frac{1}{2}}( bold_A + bold_I ) bold_D start_POSTSUPERSCRIPT - divide start_ARG 1 end_ARG start_ARG 2 end_ARG end_POSTSUPERSCRIPT corresponds to a column-scaled matrix, i.e., the elements in a column of the matrix are either 0 or have a constant value that is unique to this column. We can represent this matrix in CBM format, by simply multiplying the corresponding matrix of deltas (𝐀+𝐈)superscript𝐀𝐈(\mathbf{A^{\prime}}+\mathbf{I})( bold_A start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT + bold_I ) by 𝐃12superscript𝐃12\mathbf{D}^{-\frac{1}{2}}bold_D start_POSTSUPERSCRIPT - divide start_ARG 1 end_ARG start_ARG 2 end_ARG end_POSTSUPERSCRIPT. At this point, we can efficiently compute 𝐂:=((𝐀+𝐈)𝐃12)𝐁assign𝐂𝐀𝐈superscript𝐃12𝐁\mathbf{C}:=((\mathbf{A}+\mathbf{I})\mathbf{D}^{-\frac{1}{2}})\mathbf{B}bold_C := ( ( bold_A + bold_I ) bold_D start_POSTSUPERSCRIPT - divide start_ARG 1 end_ARG start_ARG 2 end_ARG end_POSTSUPERSCRIPT ) bold_B as described previously. Finally, note that 𝐃12(𝐀+𝐈)𝐃12𝐁)\mathbf{D}^{-\frac{1}{2}}(\mathbf{A}+\mathbf{I})\mathbf{D}^{-\frac{1}{2}}% \mathbf{B})bold_D start_POSTSUPERSCRIPT - divide start_ARG 1 end_ARG start_ARG 2 end_ARG end_POSTSUPERSCRIPT ( bold_A + bold_I ) bold_D start_POSTSUPERSCRIPT - divide start_ARG 1 end_ARG start_ARG 2 end_ARG end_POSTSUPERSCRIPT bold_B ), simply scales the rows of matrix (𝐀+𝐈)𝐃12𝐁𝐀𝐈superscript𝐃12𝐁(\mathbf{A}+\mathbf{I})\mathbf{D}^{-\frac{1}{2}}\mathbf{B}( bold_A + bold_I ) bold_D start_POSTSUPERSCRIPT - divide start_ARG 1 end_ARG start_ARG 2 end_ARG end_POSTSUPERSCRIPT bold_B. The cost of scaling the rows of this matrix can be hidden by fusing it with the update step of the product matrix 𝐂𝐂\mathbf{C}bold_C.

Speeding up SpMM with Edge Pruning.

Note that not all compression opportunities contribute to faster matrix multiplication kernels. The overheads associated with differential compression, such as traversing the chain of compression and updating the result matrix, might overcome potential performance gains if the number of scalar operations saved are not above a certain threshold. To address this issue, we can prune all edges of the distance graph of 𝐀𝐀\mathbf{A}bold_A where the number of scalar operations saved does not meet a user-defined threshold α𝛼\alpha\in\mathbb{N}italic_α ∈ blackboard_N. For each edge (y,x)𝑦𝑥(y,x)( italic_y , italic_x ) in the extended distance graph of 𝐀𝐀\mathbf{A}bold_A, prune this edge if its weight is greater than the number of nonzero elements in axsubscript𝑎𝑥\vec{a}_{x}over→ start_ARG italic_a end_ARG start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT minus α𝛼\alphaitalic_α. Naturally, if we prune the edges of the extended distance graph of 𝐀𝐀\mathbf{A}bold_A in this manner, it is possible for a single edge direction to be pruned, while the opposite direction remains in the graph. Therefore, the extended distance graph of 𝐀𝐀\mathbf{A}bold_A is now directed, and a suitable chain of compression corresponds to a Minimum Cost Arborescence (MCA) rooted in the virtual node 0. Note that our compression algorithm remains correct, since the extended distance graph of 𝐀𝐀\mathbf{A}bold_A contains an out-going edge from the virtual node 0 to all other nodes; Furthermore, the time required to build the CBM format, as shown in Lemma 2.1, remains unchanged since finding an MCA or an MST present the same time complexity  [17].

3 Experimental Evaluation

Experimental Setting.

The experiments found in this section were run on an Intel Xeon Gold 6130 (Skylake) CPU with 16 physical cores and 2.1 GHz fixed clock frequency. This machine runs on CentOS Linux 7 (version 3.10.0) operating system. The SpMM kernels tested in this section were implemented in C++, and rely on Intel MKL 24.0 sparse CSR format and corresponding matrix multiplication kernels. The C++ code developed in this work is then called from Python 3.11, via PyTorch C++ Extensions to reliably emulate common use-cases. Parallel experiments (with 16 cores) were implemented with OpenMP 4.5, and the threads were pinned to physical cores with GOMP_CPU_AFFINITY="0-15" environment variable.

3.1 Evaluation Metrics

The CBM format was evaluated with respect to quality of compression achieved (compression ratio), and the time required (runtime reduction) to compute SpMM and to infer a 2-layer GCN, when the adjacency matrix of the graphs are represented in our format. The compression ratio is defined as the ratio between the memory required to represent a matrix in CSR format and CBM format. In our implementation, the CBM format is composed by the corresponding matrix of deltas 𝐀superscript𝐀\mathbf{A^{\prime}}bold_A start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT and a tree representing the chain of compression, both stored in CSR format. In the context of sparse-dense matrix multiplication (SpMM), the runtime reduction is measured by comparing the average time, out of 50 runs, it takes to perform matrix multiplication with a randomly generated dense matrix with 500 columns using the CSR format, to the time taken to compute the same matrix multiplication with the CBM format. The formula used to capture this metric is TCSRTCBMTCSR×100%subscript𝑇CSRsubscript𝑇CBMsubscript𝑇CSRpercent100\frac{T_{\text{CSR}}-T_{\text{CBM}}}{T_{\text{CSR}}}\times 100\%divide start_ARG italic_T start_POSTSUBSCRIPT CSR end_POSTSUBSCRIPT - italic_T start_POSTSUBSCRIPT CBM end_POSTSUBSCRIPT end_ARG start_ARG italic_T start_POSTSUBSCRIPT CSR end_POSTSUBSCRIPT end_ARG × 100 %, where TCSRsubscript𝑇CSRT_{\text{CSR}}italic_T start_POSTSUBSCRIPT CSR end_POSTSUBSCRIPT is the time required to carry out sparse-dense matrix multiplication with CSR by the state-of-the-art SpMM implementation offered by Intel MKL, and TCBMsubscript𝑇CBMT_{\text{CBM}}italic_T start_POSTSUBSCRIPT CBM end_POSTSUBSCRIPT is the time taken to compute the same product with CBM. We used same formula and number of runs in the context of GCN inference, however, TCSRsubscript𝑇CSRT_{\text{CSR}}italic_T start_POSTSUBSCRIPT CSR end_POSTSUBSCRIPT and TCBMsubscript𝑇CBMT_{\text{CBM}}italic_T start_POSTSUBSCRIPT CBM end_POSTSUBSCRIPT correspond to the time required by the inference stage of this network by resorting to SpMM kernels based on CSR and CBM, respectively. It is important to note that we did not consider the SpMM kernels that are native to PyTorch, because these kernels were substantially slower than the ones implemented in Intel MKL.

3.2 Datasets

To demonstrate the advantages of the CBM format in the context of SpMM and GCN inference, we selected six real-world graphs of varying size and average in-degree as depicted in Table 1. The selected graphs depict relationships between authors and/or academic papers, where nodes tend to share many common neighbors. This property suggests that the adjacency matrices of these graphs are good candidates to be represented in CBM format.

In co-paper graphs, each node represents a paper. An undirected edge is placed between two nodes if the corresponding papers share at least one common author. They depict the interconnection and collaborative patterns between various academic publications. Co-author graphs represent scientific collaborations between authors of academic papers. Here, nodes correspond to authors, and an undirected edge is placed between two nodes if authors of the nodes have co-authored a paper together. If a paper is authored collaboratively by a group of authors, it results in a fully connected subgraph, or clique, encompassing those grouped nodes. Citation graphs are directed graphs where each node represents an academic paper. Directed edges in these graphs illustrate citations, with an edge pointing from the citing paper to the cited paper. These graphs highlight the directional flow of information and the influence of one paper upon another within the academic community.

Table 1: Summary of key datasets used to evaluate SpMM and GNN efficiency and scalability, categorized by graph type and sorted by average in-degree. This table presents the total number of nodes, edges, and average in-degree.
Graph Type Nodes Edges Avg. In-degree
coPapersCiteseer [18] Co-paper 434,102 32,073,440 74.8
coPapersDBLP [18] Co-paper 540,486 30,491,458 57.4
ca-AstroPh [19] Co-author 18,772 396,160 22.1
ca-HepPh [19] Co-author 12,008 237,010 20.7
PubMed [20] Citation network 19,717 88,648 5.4
Cora [20] Citation network 2,708 10,556 4.8

3.3 Sparse-Dense Matrix Multiplication (SpMM) Evaluation

Finding the best α𝛼\alphaitalic_α is key to improve the performance of matrix multiplication with CBM. Adjusting this parameter not only reduces overhead associated with traversing the compression chain, but also exposes more parallelism opportunities as it increases the out-degree of the virtual node. Given the importance of α𝛼\alphaitalic_α we first consider the case where α=0𝛼0\alpha=0italic_α = 0 and our edge pruning technique was not applied, and then we show how fine-tuning α𝛼\alphaitalic_α improves the quality of matrix multiplication with CBM.

Refer to caption
Refer to caption
(a) coPapersCiteseer Dataset
Refer to caption
(b) coPapersDBLP Dataset
Refer to caption
(c) ca-AstroPh Dataset
Refer to caption
(d) ca-HepPh Dataset
Refer to caption
(e) Cora Dataset
Refer to caption
(f) PubMed Dataset
Figure 1: Performance impact of different α𝛼\alphaitalic_α values for the CBM format. These plots compare runtime reduction in sequential and parallel environments, along side with the compression rate across various α𝛼\alphaitalic_α values. The x-axis lists the α𝛼\alphaitalic_α values, the left y-axis shows runtime reduction relative to the original sparse-dense matrix product (SpMM), while the right y-axis shows the compression ratio against the dataset size in CSR.

α𝛼\alphaitalic_α equal to zero.

Representing Cora (Fig. 1(e)) and PubMed (Fig. 1(f)) datasets in CBM resulted in minimal, and even negative, compression gains with respect to CSR. This is likely caused by the small average in-degree of these graphs, which suggests that the compression opportunities found in these graphs do not offset the memory overhead required to represent the corresponding chains of compression. As expected, the poor compression rate of these datasets led to no speedup in the context of matrix multiplication. On the other hand, compressing ca-AstroPh (Fig. 1(c)) and ca-HepPH (Fig. 1(d)) with our format respectively increased the compression ratios of both datasets to 2.6×\times× and 1.7×\times×, subsequently accelerating matrix product for both datasets. Our matrix multiplication strategy achieved a runtime reduction of 26% and 8% for ca-AstroPh in sequential and parallel environments, respectively. For ca-HepPh the same multiplication kernel presented a runtime reduction of 40% and 18% also in sequential and parallel environments. The datasets that benefited the most from our format were the coPapersCiteseer (Fig. 1(a)) and coPapersDBLP (Fig. 1(b)), achieving impressive compression ratios of 9.8×\times× and 5.9×\times×, leading to substantial improvements in computational performance. Our matrix multiplication kernel with coPapersCiteseer achieved a runtime reduction of 71% and 77% in sequential and parallel environments, while the same kernel with coPapersDBLP presented a runtime reduction of 59% and 61% in the same experimental settings. These results highlight the potential of the CBM format to efficiently compress and multiply unweighted graphs that present natural communities and an high average in-degree.

α𝛼\alphaitalic_α greater than zero.

Setting α𝛼\alphaitalic_α to a value greater than 8 reduced the overhead associated with traversing the chain of compression of Cora (Fig. 1(e)) and PubMed (Fig. 1(f)), making the performance decay observed for α=0𝛼0\alpha=0italic_α = 0 negligible in sequential and parallel settings, even when no compression gains are observed. Increasing the value of α𝛼\alphaitalic_α is also beneficial for sequential matrix multiplication with ca-AstroPh (Fig. 1(c)) and ca-HepPh (Fig. 1(d)), improving the respective runtime reduction from 26% up to 28% (α=2𝛼2\alpha=2italic_α = 2) and from 40% up to 44% (α=4𝛼4\alpha=4italic_α = 4). Nevertheless, it is important to note that our compression algorithm will start to ignore good compression opportunities once α𝛼\alphaitalic_α is large enough, worsening the performance of our matrix multiplication strategy. This effect is evident in both datasets for α𝛼\alphaitalic_α greater than 16, where the compression ratio decreases alongside with the runtime reduction. Experiments with both co-author datasets show that adjusting α𝛼\alphaitalic_α is even more important in the parallel case, increasing the runtime reduction of our multiplication strategy with ca-AstroPh from 8% up to 11% (α=8𝛼8\alpha=8italic_α = 8), and with ca-HepPh from 18% up to 26% (α=64𝛼64\alpha=64italic_α = 64). Furthermore, these experiments confirm that the degree of parallelism of our multiplication strategy increases concurrently with α𝛼\alphaitalic_α. This effect is easily observed for ca-HepPh (Fig.1(d)), where the performance of our matrix multiplication kernel sharply declines for α𝛼\alphaitalic_α greater than 2, followed by a steep increase in performance when α𝛼\alphaitalic_α equals to 32 or 64 (even though our compression algorithm is already ignoring good compression opportunities at this point). Finding the best α𝛼\alphaitalic_α is not as relevant for coPapersCiteseer (Fig.1(a)) and coPapersDBLP (Fig.1(b)), as most compression opportunities save more than 8 scalar operations in the context of our matrix multiplication strategy. This observation is verified, since the compression ratio of these datasets shows little to no decrease for α𝛼\alphaitalic_α smaller than 8. This effect is most likely due to the high average in-degree of both datasets. Still, adjusting α𝛼\alphaitalic_α is required to obtain the best runtime reduction for both coPapers datasets. In our experiments our matrix multiplication kernel with coPapersCiteseer achieved a peak runtime reduction of 71% (α=2𝛼2\alpha=2italic_α = 2) and 79% (α=32𝛼32\alpha=32italic_α = 32) in sequential and parallel environments, while the same kernel with coPapersDBLP peaks at 59% (α=𝛼absent\alpha=italic_α =4) and 63% (α=16𝛼16\alpha=16italic_α = 16) also in sequential and parallel environments.

Refer to caption
Figure 2: Performance impact of integrating the CBM format and corresponding sparse-dense matrix multiplication kernel (SpMM) in the inference stage of a 2-layer GCN across the different datasets. The y-axis shows the runtime reduction achieved in GNN inference by matrix multiplication with CBM against the matrix multiplication kernel with CSR provided by Intel MKL.

3.4 Integrating CBM with GCN

To assess the impact of CBM in GNNs, we considered the runtime reduction of an inference stage of a 2-layer GCN with 500 features where the normalized adjacency matrix of each dataset is represent in our format. Matrix products involving the normalized adjacency matrix were carried out with our extended matrix multiplication kernel, as described in Section 2.2. As baseline for our experiments, we selected the same 2-layer GCN where the normalized adjacency matrix is represented in CSR and any products involving this matrix are carried out by the SpMM implementation found in Intel MKL. These experiments are illustrated in Figure 2. To keep the discussion concise, we only considered the values of α𝛼\alphaitalic_α that led to the best matrix multiplication performance for each dataset analyzed in Section 3.3.

Representing both Cora and PubMed datasets in CBM format increased the inference time of the corresponding GCNs compared to the baseline. This behavior is expected because the only steps of the inference that benefit from the CBM format are the products involving the normalized adjacency matrix. As previously shown, compressing these datasets using our format did not accelerate the corresponding matrix products. Experiments with both co-author datasets demonstrated that our format can reduce the inference time for GCN models. For ca-AstroPh, our format reduced the inference time of the network by 17% in sequential environments and 3% in parallel environments. Similarly, for ca-HepPh, our format achieved a runtime reduction of 22% in sequential environments and 11% in parallel environments. Representing both co-Paper datasets in our format resulted in the highest runtime reductions during GCN inference. For coPapersCiteseer, our format achieved an average runtime reduction of 48% in sequential environments and 66% in parallel environments. For coPapersDBLP, our format achieved a runtime reduction of 43% in sequential environments and 52% in parallel environments.

4 Final Remarks

In this work we proposed the Compressed Binary Matrix (CBM) format which simultaneously reduces the memory footprint of unweighted graphs and binary matrices, and enables the implementation of new matrix multiplication kernels that might be significantly faster than the current state-of-the-art. Experimental evaluation results shown significant speedups, both in sequential and parallel environments, up to 5×\times×. We obtained also significant performance improvements in the context of the GCN inference stage by integrating the CBM format in a deep learning framework, namely PyTorch, observing speed ups of 3×\times×. Although we did not discuss the CBM format construction time, we observe that it can be built in a reasonable amount of time for a dataset provider. In our experiments it took us less than 16 seconds to convert the largest dataset into our format in a sequential CPU environment. It is important to stress that the effectiveness of our format depends on the specific dataset, as discussed in the experimental evaluation. While we suspect that graphs with a high average degree and a tendency to form communities are good candidates, the best way to determine if a graph is suitable is to examine the compression ratio achieved by our format for a reasonable value of α𝛼\alphaitalic_α. Finally we highlight that our format is future proof, since future optimizations to high-performance SpMM kernels will also accelerate matrix multiplication with the CBM format. Future work concerns integrating and evaluating the CBM format in the context of different GNNs architectures, and also targeting the training stage of this networks. Additionally, we intend to implement and evaluate our format and corresponding multiplication kernels in GPU architectures.

Acknowledgments

This work has been supported by the Innovation Study CBM4scale, funded by the Inno4scale project, which is funded by the European High-Performance Computing Joint Undertaking (JU) under Grant Agreement No 101118139. The JU receives support from the European Union’s Horizon Europe Programme.

Author Contributions

JNFA devised the main conceptual of the Compressed Binary Matrix (CBM), including the various matrix multiplication algorithms and their optimizations. JNFA also implemented the CBM format and the related multiplication kernels in C++, including the integration of SpMM and AXPY (from Intel MKL) into the matrix multiplications kernels based on the CBM format, as proposed in Section 2.2. Additionally, JNFA implemented an interface that enables calling the previous C++ routines from Python. SM designed the Python benchmarks to compare CBM and CSR in matrix multiplication and compression quality, as described in Section 3.3. SM also integrated the CBM and CSR-based matrix multiplication kernels into the Message Passing Layer (MPL) to evaluate the impact of the CBM format during the inference stage of a 2-layer GCN model implemented in PyTorch, as discussed in Section 3.4. SB, APF, WNG, and LMSR supervised the project. JNFA wrote the bulk of this draft. All authors provided critical feedback, participated in discussions, contributed to the interpretation of the results, and approved the final manuscript.

References

  • [1] Z. Wu, S. Pan, F. Chen, G. Long, C. Zhang, and S. Y. Philip. A comprehensive survey on graph neural networks. IEEE Transactions on Neural Networks and Learning Systems, 32(1):4–24, 2021.
  • [2] Thomas N Kipf and Max Welling. Semi-supervised classification with graph convolutional networks. arXiv preprint arXiv:1609.02907, 2016.
  • [3] Y. Shao, H. Li, X. Gu, H. Yin, Y. Li, X. Miao, W. Zhang, B. Cui, and L. Chen. Distributed graph neural network training: A survey. ACM Comput. Surv., 56(8), 2024.
  • [4] F. Scarselli, M. Gori, A. C. Tsoi, M. Hagenbuchner, and G. Monfardini. The graph neural network model. IEEE Transactions on Neural Networks, 20(1):61–80, 2009.
  • [5] Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, et al. Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems, 32, 2019.
  • [6] W. L. Hamilton, R. Ying, and J. Leskovec. Inductive representation learning on large graphs. In Advances in Neural Information Processing Systems (NeurIPS), 2017.
  • [7] Keyulu Xu, Weihua Hu, Jure Leskovec, and Stefanie Jegelka. How powerful are graph neural networks? arXiv preprint arXiv:1810.00826, 2018.
  • [8] J. Alman and V. Vassilevska Williams. Further limitations of the known approaches for matrix multiplication. In A. R. Karlin, editor, 9th Innovations in Theoretical Computer Science Conference (ITCS 2018), volume 94 of Leibniz International Proceedings in Informatics (LIPIcs), pages 25:1–25:15. Schloss Dagstuhl–Leibniz-Zentrum für Informatik, 2018.
  • [9] C. Karande, K. Chellapilla, and R. Andersen. Speeding up algorithms on compressed web graphs. Internet Mathematics, 6(3):227–256, 2009.
  • [10] A. Abboud, A. Backurs, K. Bringmann, and M. Künnemann. Impossibility results for grammar-compressed linear algebra. Advances in Neural Information Processing Systems, 33:8810–8823, 2020.
  • [11] M. Nishino, N. Yasuda, S. i. Minato, and M. Nagata. Accelerating graph adjacency matrix multiplications with adjacency forest. In Proceedings of the 2014 SIAM International Conference on Data Mining, pages 1073–1081. SIAM, 2014.
  • [12] A. P. Francisco, T. Gagie, D. Köppl, S. Ladra, and G. Navarro. Graph compression for adjacency-matrix multiplication. SN Computer Science, 3(3):193, 2022.
  • [13] P. Boldi and S. Vigna. The webgraph framework i: compression techniques. In Proceedings of the 13th international conference on World Wide Web, pages 595–602, 2004.
  • [14] C. Hernández and G. Navarro. Compressed representations for web and social graphs. Knowledge and Information Systems, 40(2):279–313, 2014.
  • [15] A. Elgohary, M. Boehm, P. J. Haas, F. R. Reiss, and B. Reinwald. Compressed linear algebra for declarative large-scale machine learning. Communications of the ACM, 62(5):83–91, 2019.
  • [16] A. Björklund and A. Lingas. Fast boolean matrix multiplication for highly clustered data. In Algorithms and Data Structures: 7th International Workshop, WADS 2001 Providence, RI, USA, August 8–10, 2001 Proceedings 7, pages 258–263. Springer, 2001.
  • [17] M. Böther, O. Kißig, and C. Weyand. Efficiently computing directed minimum spanning trees. In 2023 Proceedings of the Symposium on Algorithm Engineering and Experiments (ALENEX), pages 86–95. SIAM, 2023.
  • [18] R. A. Rossi and N. Ahmed. The network data repository with interactive graph analytics and visualization. In AAAI Conference on Artificial Intelligence, 2015.
  • [19] Jure Leskovec and Andrej Krevl. SNAP Datasets: Stanford large network dataset collection. June 2014.
  • [20] Z. Yang, W. W. Cohen, and R. Salakhutdinov. Revisiting semi-supervised learning with graph embeddings. In Proceedings of the 33rd International Conference on Machine Learning, ICML ’16, pages 40–48, 2016.