1 Introduction
Irregular memory access and little computational intensity inherent to graph processing cause major performance challenges on traditional hardware (e.g., CPUs) [
1,
5,
13,
20].
Field programmable gate arrays (FPGAs) promise to accelerate common graph problems like
breadth-first search (BFS),
PageRank (PR), and
weakly connected components (WCCs) with their flexible memory hierarchy (e.g., by low-latency on-chip memory) and massive parallelism [
5]. Still, memory bandwidth is the bottleneck of graph processing even for highly optimized FPGA implementations. Thus, graph processing accelerators like AccuGraph [
26] and FabGraph [
24] utilize
graph compression and
asynchronous graph processing to reduce the load on the memory sub-system. Figure
1 shows the potential of these two graph processing accelerator properties. For graphs with large average degree, a well-known compressed graph data structure like
compressed sparse row (CSR) almost halves the number of bytes per edge to be processed. Asynchronous graph processing, in turn, may lead to a significant decrease in iterations over the graph. However, these approaches have not yet been scaled to multiple memory channels, limiting their performance on modern hardware [
10].
In [
12], we introduced GraphScale, the first scalable graph processing framework for FPGAs based on asynchronous graph processing on a compressed graph. GraphScale competes with other scalable graph processing frameworks that work on labeled graphs, like HitGraph [
29] and ThunderGP [
7], that, however, do not leverage the potential shown in Figure
1. While for asynchronous graph processing the challenge lies in handling the high-bandwidth data flow of vertex label reads and writes to on-chip scratch pads at scale, the CSR-compressed graph adds design complexity and higher resource utilization for materializing compressed edges on-chip that is challenging for scaling to multiple memory channels. GraphScale leverages combined asynchronous graph processing and graph compression, by making the following contributions:
•
We designed an asynchronous graph framework, efficiently solving common graph problems on compressed data (Section
3.1).
•
We designed a scalable, resource-efficient two-level vertex label crossbar (Section
3.2).
•
We proposed a novel two-dimensional partitioning scheme for graph processing on multiple memory channels (Section
3.3).
•
We evaluated our approach compared to state-of-the-art graph processors: AccuGraph, FabGraph, HitGraph, and ThunderGP (Sections
5.1 and
5.2).
GraphScale combines asynchronous graph processing and graph compression with elastic scalability to many memory channels, which were previously exclusive properties in other graph processing accelerators. Thus, we are able to configure the number of memory channels and amount of compute power, which is not possible for the existing asynchronous graph processing frameworks working on compressed graphs, i.e., AccuGraph and FabGraph. However, we also identified two main challenges with GraphScale: (1) scaling beyond four DDR4 memory channels, which was not shown for any of the existing graph processing frameworks, and (2) the increasing partitioning overhead, when processing large graphs.
In this article, we extend our work on GraphScale from [
12] by addressing aforementioned challenges (1)–(2). For (1) we scale GraphScale to
high-bandwidth memory (HBM), which allows for up to 32 memory channels, and for (2) we apply additional compression to the pointers array of the underlying graph data structure, to reduce the identified partition overhead and thus improve performance for processing of larger graphs. In summary, this extended article on the GraphScale system from [
12] makes the following additional contributions:
•
We design an OpenCL wrapper to enable scaling of GraphScale to HBM and discuss technical limitations of scalability (Section
5.3).
•
We analyze which integer compression technique is best suited to be implemented on the FPGA and provides the best compression ratio (Sections
2.2 and
4.1).
•
We design and evaluate an inline decompressor based on the binary packing compression technique (Sections
4.2 and
5.4).
•
We propose a performance model that is able to explain why the performance improvement from the inline decompressor can only be observed on synthetic graphs (Section
5.5).
The resulting GraphScale system shows promising scalability with a maximum speedup of \(4.77\times\) on dense graphs and an average speedup of \(1.86\times\) over AccuGraph, FabGraph, HitGraph, and ThunderGP. Scaling to HBM provides a further speedup of \(1.53 \times\) over GraphScale running on DDR4 memory. Overall, we conjecture that asynchronous processing on a compressed graph with multi-channel memory scaling improves the graph processing performance but leads to interesting tradeoffs (e.g., for large graphs where partitioning overhead dominates performance). We show how this partitioning overhead can be tackled with an inline binary packing decompressor with a resulting average performance improvement of \(1.25 \times\) on synthetic graphs.
2 Background and Related Work
In this section, we introduce data structures, implementation schemes, and graph problems important to graph processing on FPGAs. Additionally, we discuss integer compression techniques, related work, and our contributions in the context of current graph accelerators.
2.1 Graph Processing
Graphs are abstract data structures (
\(G = (V,E)\)) comprising a vertex set
\(V\) and an edge set
\(E \subseteq V \times V\). Current graph processing accelerators represent the graphs they are working on in memory either as an array of edges, also called edge list, or as a set of adjacency lists, each containing a vertex’s neighbors in the graph. One possible implementation of an adjacency lists structure is CSR, which we use in GraphScale. To allow for more meaningful graph processing, a label is attached to each vertex, unlike [
27] and ScalaBFS [
19]. Additionally, graph processing accelerators utilize two dimensions of partitioning: vertical and horizontal. Vertical partitioning divides the vertex set into intervals such that each partition contains the incoming edges of one interval. In contrast, horizontal partitioning again divides up the vertex set into intervals, but each partition contains the outgoing edges of one interval. In GraphScale, we horizontally partition the set of inverse edges (the result for an example graph is shown in Figure
2) and extend it to work on multiple memory channels. The values of the
pointers array (P) at position
\(i\) and
\(i{+}1\) delimit the
neighbors (N) of
\(v_i\). For example, for vertex
\(v_5\) in partition 1 these are the values of the neighbors array between 2 and 4 (i.e.,
\(v_3\) and
\(v_4\)). As a third partitioning approach, interval-shard partitioning [
30] employs both vertical and horizontal partitioning at once.
Depending on the underlying graph data structure, graphs are processed based on two fundamentally different approaches: edge- and vertex-centric graph processing. Edge-centric systems iterate over the edges as primitives of the graph on an underlying edge list. Vertex-centric systems iterate over the vertices and their neighbors as primitives of the graph on an underlying adjacency lists data structure (e.g., CSR). The vertex-centric approach can be further divided into push- and pull-based data flow. A push-based data flow denotes that updates are pushed along the forward direction of edges to update neighboring vertices, while in a pull-based data flow updates are pulled along the inverse direction of edges from neighboring vertices to update the current vertex. Lastly, there are currently two dominant update propagation schemes. Asynchronous graph processing directly applies updates to the working vertex label set when they are produced, and synchronous graph processing collects all updates in memory and applies them only after the iteration is finished.
In the context of this work, we consider the three graph problems BFS, PR, and WCC. BFS denotes a sequence of visiting the vertices of a graph. Vertices are labeled with their distance (in length of the shortest path in number of edges) to a root vertex. WCC specifies as output for each vertex its affiliation to a weakly connected component. Two vertices are in the same weakly connected component if there is an undirected path between them. PR is a measure to describe the importance of vertices in a graph. It is calculated as \(p(i, t{+}1) = \frac{1 - d}{|V|} + d \cdot \sum _{j \in N_G(i)} \frac{p(j, t)}{d_G(j)}\) for each \(i \in V\) with damping factor \(d\), neighbors \(N_G\), degree \(d_G\), and iteration \(t\).
2.2 Integer Compression
Table
1 shows a selection of integer compression techniques that are often mentioned in related work and their feature set. To make sure that inline decompression at line rate can be achieved for the compression technique we choose, we only pick those that are not recursive (meaning we do not have to do multiple passes over the data to get the decompressed integers) and do not require a dictionary (which would use very limited BRAM resources). This excludes Huffman [
15], Interpolated [
21], and RePair [
16] compression. Additionally, we prefer a compression technique that works with blocks to simplify decompression of many values per clock cycle. All of these techniques can additionally be combined with difference encoding if the original sequence of integers is sorted. Difference encoding means that every encoded value
\(v^{\prime }_i\) represents the difference of its original value
\(v_i\) to its previous value
\(v_{i - 1}\). Subsequently, we will shortly introduce the five remaining compression techniques (variable length, Simple-8, binary packing, PFor, and Golomb).
Variable length. For each value, there is a prefix of ones of the same length as the subsequent packed value. This works well for very small values but can waste a lot of bits for large values.
Simple-8. As many values as possible are packed into 64-bit blocks where the first 4 bits of each 64-bit block are used to determine the width of the values [
2]. One drawback is that this technique possibly wastes bits as padding but all blocks are fixed width.
Binary packing and PFor. Binary packing [
2] takes a fixed number of values as a block and determines the maximum of the minimum number of bits needed to represent each value. Each block is prefixed with a header containing this bit width and every value of the block is thereafter encoded with that bit width. PFor [
31] works similarly to binary packing but the header of each block additionally stores a base value against which each value is diffed. There are a number of possible optimizations that we do not cover here.
Golomb. Each value is divided by a parameter
\(b\) where the quotient is encoded in unary code and the remainder is encoded into
\(\log _2(b)\) bit binary code [
22].
2.3 Related Work
Table
2 shows a representative sub-set of current FPGA-based graph processing accelerators, found in a recent survey [
13], and how they relate to GraphScale. AccuGraph [
26], FabGraph [
24], HitGraph [
29], and ThunderGP [
7,
8] all work on labeled graphs and are frameworks that map to multiple graph problems. The latter three employ edge-centric graph processing with different partitioning schemes. HitGraph and ThunderGP sort their edge list beforehand to enable update coalescing before writing updates back into memory and applying them in a second phase in each iteration (i.e., both synchronous). FabGraph, in contrast, employs interval-shard partitioning, which enables compression of vertex identifiers for each partition and asynchronous update propagation. AccuGraph follows a vertex-centric approach with a pull-based data flow on an inverse, horizontally partitioned CSR data structure. While FabGraph and AccuGraph enable the potential of a compressed data structure and asynchronous graph processing, they do not scale to multiple memory channels, and HitGraph and ThunderGP scale to multiple memory channels but do not exploit that potential.
GraphScale combines both sides on a conceptual basis similar to AccuGraph. Our design decisions on iteration scheme, data flow, partitioning, and data structure are motivated by the insights from [
10]. The vertex-centric iteration scheme allows for more flexibility in compression of the data structure than edge-centric. The pull data flow is beneficial for iterations with many vertex label updates, which we expect especially for PR and WCC, and more sequential memory accesses when compared to the push data flow. The CSR data structure was chosen because it is easy to construct with little pre-processing and allows for efficient sequential memory accesses too. Finally, the partitioning scheme resulted from all other design decisions.
Besides these frameworks, there are specialized graph processing accelerators—surpassing the performance of general-purpose frameworks due to the much simpler memory access pattern and often massive memory bandwidth of their benchmark systems—that do not map to other graph problems and do not work on labeled graphs. For example, Zhou et al. [
28] propose a PR accelerator, while dedicated BFS accelerators were proposed by [
3,
6,
17] on the Convey HC-2 system, which is not commercially available anymore. Zhang and Li [
27] use the high-bandwidth
hybrid memory cube (HMC), allowing for extreme performance scaling, however, through BFS-specific simplifications and assumptions. Similarly, ScalaBFS [
19] reports superior performance on HBM but is also limited to BFS. Notably, due to the problem-specific simplifications, those accelerators cannot easily be extended to support other graph problems (e.g., simply using BFS for WCC is inefficient).
3 GraphScale System Design/architecture
In this section, we describe our graph processing framework GraphScale (cf. Figure
3) at an abstract processor-scale level and subsequently explain its non-trivial design concepts. In principle, a GraphScale graph processor consists of
\(p\) graph cores (explained in Section
3.1) matched to the number of
\(p\) memory channels (four in this example). Each graph core is only connected to its own memory channel and can thus only directly read and write data on this channel. This requires partitioning of the graph into at least
\(p\) partitions. The details of partitioning and how the graph is distributed over the memory channels are discussed in Section
3.3. However, since graph partitioning does not eliminate data dependencies between partitions, the graph cores are connected via a high-performance crossbar for exchange of vertex labels enabling the scaling of the approach. The crossbar will be explained in Section
3.2. The whole execution is governed by a processor controller. Before execution starts, the host code passes parameters for each partition and optimization flags to the processor controller, which stores them in a metadata store. When execution is triggered by the host code, the processor goes through a state machine, orchestrating the control signals for the execution of iterations over the graph.
3.1 Graph Core
A graph core (cf. Figure
4), as the basic building block of GraphScale, processes graphs based on the vertex-centric iteration scheme and pull-based data flow. It works on a partitioned inverse-CSR data structure of the graph (cf. Section
2.1) consisting of one vertex labels array and for each partition one pointers and one neighbors array. Furthermore, processing of the graph is structured into two phases per iteration: prefetching and processing. In the prefetching phase, the vertex label prefetcher reads a partition-specific interval of the vertex label array into the label scratch pad, an on-chip memory (BRAM) split up into
\(e\) (set to 8 in this example) banks. The label scratch pad is used to serve all non-sequential read requests that occur during an iteration instead of off-chip DRAM since BRAM has much higher bandwidth and predictable one-cycle request latency independent of the access pattern.
Starting the data flow of the processing phase, the source builder reads vertex labels and pointers sequentially. Vertex labels and pointers are zipped to form
\(v\) source vertices in parallel with a vertex index (generated on the fly), vertex label, inclusive left bound, and exclusive right bound of neighbors in each of the neighbors array. The destination builder reads the neighbors array of the current partition sequentially and puts
\(e\) neighbor vertex identifiers in parallel through the two-level crossbar, passing the vertex identifiers to the correct label scratch pad bank of the correct graph core and returning the resulting vertex labels in the original order (discussed in more detail in Section
3.2). The vertex label annotated with the neighbor index is then passed to the edge builder, which combines source and destination vertices based on the left bound
\(l\) and right bound
\(r\) of the source vertex and the neighbor index
\(j\) of the destination vertex as
\(l \lt = j; j \lt r\). Thus, we get up to
\(e\) edges with a maximum of
\(v\) source vertices as output from the edge builder per clock cycle.
The accumulator takes the \(e\) edges annotated with their source and destination vertex labels as input and updates vertices in four steps. First, updates are produced in the Update stage depending on the graph problems’ update function for each edge in parallel. For BFS, this means taking the minimum of the source vertex label and destination vertex label plus 1. If the latter is smaller, the output is flagged as an actual update of the source vertex label. This is crucial for algorithms that terminate when no more updates are produced in an iteration (e.g., BFS and WCC). The pairs of source vertex identifier and updated vertex labels are then passed to the Prefix Adder, which reduces the updates to the most significant element with the same source vertex identifier for each source vertex. The most significant entry is then selected by the \(v\) selectors in the SelectMSO stage of the accumulator and passed on to the last stage. Each selector already only selects pairs with \(i \% v = I\) for source vertex index \(i\) and selector index \(I\). The Sequential stage consists of \(v\) sequential operators that reduce updates from subsequent cycles to the same vertex into one that is output when a new source vertex identifier is encountered or a higher source vertex identifier is encountered. Thus, in total, the accumulator produces updates only when the new label is different based on the annotated edges and reduces them into a maximum of one update per source vertex.
Figure
5 shows the parallel prefix-adder vertex-update accumulator. We add a suffix sub-accumulator (dotted outlines), necessary to attain correct results in some edge cases, and a merged signal for non-idempotent reduce operators like summation. The accumulator takes
\(e\) pairs of source vertex identifier and updated vertex label (split with a comma) and returns one updated vertex label as the right-most identifier-label pair per incoming source vertex identifier (italicized). The prefix-adder accumulator consists of
\(\log _2(e) + 1\) pipelined levels of registers (white) and reduce
processing elements (PEs). The registers take one identifier-label pair as an input and pass this input on in the next clock cycle. The reduce PEs (green and pink) take two identifier-label pairs as an input and combine them depending on the graph problem the graph core maps if the source vertex identifiers are equal. The result is again put out in the next clock cycle. Right reduce PEs (green) pass on the right identifier-label pair unmodified if the identifiers are unequal and left reduce PEs (left) pass on the left pair. In this particular example, the parallel accumulator could be used, e.g., for BFS or WCC, because it uses minimum reduce PEs, which put out the minimum of the vertex labels if they should be combined. The connection pattern of the first
\(\log _2 e\) levels of the accumulator represents a Ladner-Fischer prefix-adder.
In addition to the prefix adder, we introduce a suffix sub-adder, which reduces all identifier-label pairs with source vertex identifier equal to the first one to the first element. In an additional pipeline step, this suffix accumulation result is reduced with the last prefix accumulation result if there have been multiple different source vertex identifiers in the input. We do this because the sequential source vertex identifiers can overlap from the last one to the first one as a result of how the edge builder works. In this edge case updated vertex labels might be missed because only the right-most vertex label of a source vertex identifier is further processed. Finally, we only reduce two identifier-label pairs if all pairs in between have the same source vertex identifier, which we keep track of with a merged signal mentioned above.
The resulting updates are fed back to a buffered writer and into the label scratch pad so they can immediately be used in the same iteration. The buffered writer collects all updates to the same cache line and writes them back to memory when an update to a new cache line is encountered.
All different parts of this design are orchestrated in their execution by a core controller. The core controller gets graph-wide parameters of the execution like number of vertices, number of edges, and address of the buffer in memory and dynamic parameters like iteration number from the processor controller. Based on this, it starts the prefetch phase and then the processing phase and therefore calculates the addresses to the data structure arrays. Finally, it also flushes the pipeline so all updates are written back to memory before asserting the ready signal such that the next iteration can be started.
3.2 Scaling Graph Cores
To scale this very effective single-channel design with limited overhead, we propose the graph-core-to-memory-channel assignment shown in Figure
3. Each graph core works on exactly one memory channel. However, annotating edges with vertex labels requires communication between graph cores. Therefore, we propose a scalable resource-efficient two-level crossbar. In this section, we will describe how we achieve the necessary high throughput of this crossbar to saturate the accumulators of multiple graph cores with annotated edges.
We show the multi-stage design of the crossbar for two cores and
\(e = 4\) in Figure
6. The first level (bank shuffle) receives
\(e\) neighbors from the destination builder each cycle for each core with a FIFO buffer for each lane in between to reduce stalls. The neighbor indices serve as addresses to vertex labels in the labels array. Before the processing of a partition starts, the partition’s vertex labels are prefetched to the label scratch pad. Since the on-board memory returns
\(e\) neighbors per graph core per cycle at maximum throughput,
\(e * p\) requests have to be served by the label scratch pad per cycle. The label scratch pad of each graph core is split up into
\(b\) banks, which is at least equal to
\(e\) (larger values for
\(b\) may be used to reduce contention on banks), which can serve requests in parallel. The vertex labels are striped over these banks. The last
\(\log _2 b\) bits of each neighbor index are used to address the bank of the label scratch pad that this vertex label can be requested from. Thus, the bank shuffle level puts each neighbor index into the right bank lane based on its last
\(\log _2 b\) bits. This can introduce stalls because multiple neighbors from one line can go to the same bank (i.e., a multiplexer has to output entries from the same input cycle in multiple output cycles). However, since each neighbor only goes to one bank, we decouple the
\(b\) bank shufflers and let them consume
\(r\) full lines before stalling such that labels from later lines can overtake in other banks. For most graphs, this reduces stalls because the load is approximately balanced between banks.
In a second level, we introduce a core crossbar that shuffles the neighbor indices annotated with their line and lane they originally came from to the core that contains the vertex label. Core addressing is done by the first \(\log _2 p\) bits of the neighbor index. However, since the neighbor indices are already in the correct lane, this only requires \(p * b\) core shufflers with \(p\) inputs. The results are additionally annotated with the core they originally came from and fed into the label scratch pad. The core shufflers work independently from each other too, allowing neighbor indices to overtake.
The label scratch pad returns the vertex labels with a one-cycle latency but keeps the annotations (each bank is \(2^c\) entries deep). A second layer of core shufflers routes the vertex labels back to their original graph core. Thereafter, the vertex labels are unshuffled to the lane they originally came from and fed into a final reorder stage to restore the original sequence of the data, which is possibly changed because requests and responses overtook each other in the previous steps.
The reorder stage has a fixed number of lines it can keep open at a time (4 in this example), which we will call reorder slots. It is passed the valid signals of the incoming neighbors when they first enter the crossbar and puts them into a mask FIFO. The unshuffled labels are then still annotated with the line they originally came from modulo the number of reorder slots, which is used as the address to put them in a BRAM. There is one BRAM for each lane of reorder slots because in each cycle we possibly write one label and read one label per lane. The reorder stage also maintains a pointer pointing to the currently to-be-output line and compares the valid signals of this line to the mask FIFO output. If the mask FIFO valid output and the valid signals from the line are equal, the labels are put out, the pointer is incremented, the mask FIFO is popped, and the valid signals of the line are cleared. When the pointer is incremented above the last line, it overflows to 0.
Finally, the mask FIFO of the reorder stage is also used to exert backpressure. If the mask FIFO contains as many elements as there are reorder slots, the ready signal is deasserted and all stages stop. To handle the one-cycle latency of the label scratch pad, there is also an additional overflow register where the label scratch pad result can overflow to.
3.3 Graph Partitioning and Optimization
Figure
7 shows the partitioning of the input graph as the last missing part of our graph processing accelerator GraphScale and why it is able to scale well. The partitioning in done in two dimensions. In the first dimension, the set of vertices is divided into
\(p\) equal intervals
\(I_q\) (
\(I_0\) and
\(I_1\) in this example for
\(p=2\)), one stored on each memory channel and processed by its corresponding graph core
\(P_q\). The second dimension of partitioning divides each vertex interval into
\(l\) equal sub-intervals (
\(J_0\) to
\(J_5\) in this example for
\(l=3\)) that fit into the label scratch pad of the graph core. We generate one sub-partition
\(S_{i,j}\) for each pair of interval
\(I_i\) and sub-interval
\(J_j\) containing all edges with destination vertices in
\(I_i\) and source vertices in
\(J_j\) and rewrite the neighbor indices in the resulting CSR data structure such that the requests are shuffled to the correct graph core by the two-level crossbar (i.e., first
\(\log _2 p\) bits are graph core index) and subtract the offset of the respective sub-interval
\(J_j\). Sub-partitions
\(S_{q,q*l}\) for each
\(q \in [0,q)\) additionally form a meta-partition
\(M_q\). During execution, all sub-intervals
\(J_{q*l}\) are prefetched by their respective graph core
\(q\) before processing of all sub-partitions of meta-partition
\(M_q\) is triggered. During processing, edges whose source and destination vertex reside in different intervals
\(I_i\) produce a data dependency between partitions and thus graph cores such that a vertex label has to be communicated over the two-level crossbar to process that edge. This lightweight graph partitioning, however, may introduce load imbalance because it works on simple intervals of the vertex set.
Each graph core writes all label updates to off-chip memory through the buffered writer while processing a partition. As a first optimization, immediate updates, we also immediately write back the updates to the label scratch pad if they are part of the current partition [
26]. Thus, with this optimization, BRAM and off-chip memory are always in sync. Nevertheless, at the beginning of processing a partition, the vertex label set is unnecessarily prefetched even if the labels are already present in BRAM. Thus, as a second optimization, we utilize prefetch skipping in this case as a lightweight control flow optimization that skips the prefetch phase if the vertex label set is already present in the label scratch pad [
11]. This optimization only works in conjunction with immediate updates. As a third optimization, we apply stride mapping to improve partition balance, which we identified as a large issue during testing [
9]. Because the graph cores work in lock-step on the meta-partitions, imbalanced partitions lead to a lot of idle time. Stride mapping is a lightweight technique for semi-random shuffling of the vertex identifiers before applying the partitioning, which creates a new vertex ordering with a constant stride (100 in our case, which results in
\(v_0, v_{100}, v_{200}, \ldots\)).
5 Evaluation
In this section, we introduce the system used for evaluation and setup for measuring performance together with the system parameters and the graph datasets used for the experiments. Thereafter, we comprehensively measure performance in multiple dimensions. We first look at the effects of the optimizations introduced in Section
5.1 before highlighting the scalability of the GraphScale framework. We compare GraphScale with the performance of the competitors that do not scale to multiple memory channels (i.e., AccuGraph and FabGraph) to show the scalability of GraphScale and those that do scale (i.e., HitGraph and ThunderGP) to show the advantages and disadvantages of asynchronous graph processing and a compressed graph. Next, we propose an OpenCL wrapper for GraphScale and show the resulting scalability to an HBM-enabled FPGA. Lastly, we show the performance improvements from the binary packing compression and propose a performance model to explain data-dependent performance characteristics of GraphScale.
Figure
10 shows the system context in which GraphScale is deployed. In principle, the system features a CPU and an accelerator board, which host the FPGA, running GraphScale itself, and memory, used as intermediate data storage for the graph during processing. The CPU manages the execution on the FPGA and is also responsible for loading and partitioning the graph. To execute a particular workload with a particular graph, the GraphScale framework, first, is synthesized with
user-defined functions (UDFs) for the map and reduce operators in the accumulator. Map produces an update to the source vertex label for each edge, while reduce aggregates updates into one value for each to-be-updated vertex. For a switch from BFS to WCC, the reduce UDF stays the same, while only one line has to be changed for the map UDF. PR requires changing the map UDF significantly and replacing the reduce UDF with summation. Additionally, PR alternatingly works on two separate vertex label arrays. Second, the synthesized design is programmed to the FPGA.
For execution of the programmed algorithm on a particular graph dataset, the edge list (or any other representation) of the graph is read from disk to the CPU and partitioned by the graph partitioner in the host code according to the GraphScale framework parameters. Additionally, the vertex labels of the graph are initialized with graph-problem-specific values. The graph partitions and vertex labels are then transferred to the respective channels of the FPGA memory. Thereafter, the parameters of the graph are passed to GraphScale via a control interface, which triggers execution. After the execution is finished, the results can be read back to CPU memory and used for further processing. If desired, the partitioned graph can be used multiple times in a row by loading new vertex labels and again triggering the execution.
For our experiments, we are working with a server equipped with an Intel FPGA Programmable Accelerator Card (PAC) D5005 attached via PCIe version 3. The system features two Intel Xeon Gold 6142 CPUs at 2.6GHz and 384GB of DDR4-2666 memory. The D5005 board is equipped with four channels of DDR4-2400 memory with a total capacity of 32GB and a resulting bandwidth of 76.8GB/s. The design itself is based on the Intel Open Programmable Execution Engine (OPAE) platform and is synthesized with Quartus 19.4.
For the HBM experiments, we are working with a workstation equipped with an Intel FPGA S10MX development kit attached via PCIe version 3. The system features an Intel Xeon Gold E5-2667 CPU at 2.9GHz and 96GB of DDR3-1600 memory. The S10MX development kit board is equipped with 32 channels of HBM 2 memory with a total capacity of 8GB and a resulting bandwidth of up to 512GB/s. GraphScale itself is embedded into the Intel OpenCL platform as a kernel and is synthesized with Quartus 21.2.
Table
3 shows the different system configurations used for our experiments for GraphScale as well as AccuGraph, FabGraph, HitGraph, and ThunderGP. The resource utilizations for AccuGraph, FabGraph, HitGraph, and ThunderGP are directly taken from their papers and are relative to the platforms they work with, respectively. Besides the three graph problems BFS, PR, and WCC, we synthesized system variants that utilize different numbers of memory channels
\(p\). All DDR4-based variants (i.e., GraphScale on the D5005 platform) have a total label scratch pad size of
\(2^{21}\), 16 scratch pad banks, and 8 vertex pipelines. All types including pointers, vertex identifiers, and vertex labels are 32-bit unsigned integers except PR vertex labels, which are 64-bit and consist of the degree of the vertex and its PR value. Lastly, the depth of the reorder stage is set to 32. For HBM, we synthesized system variants up to 16 memory channels except for PR, which did not fit on the FPGA. All variants except those for 16 channels have a total label scratch pad size of
\(2^{20}\), 4 vertex pipelines, and an overprovisioned crossbar with 16 scratch pad banks. The GraphScale-BinPack variant implements binary packing compression. This parameterization results in a moderate resource utilization with rising
lookup table (LUT) and
register (Regs.) utilization and almost constant BRAM utilization because scratch pad size is shared between the graph cores. The PR configuration has significantly higher resource utilization due to the doubled label size.
Graph datasets that are used to benchmark our system are listed in Table
4. This selection represents the most important graphs, currently considered, found by a recent survey [
13]. Two important aspects when working with these graphs are their directedness and the choice of root vertices
1 (e.g., for BFS or SSSP), because they can have a significant impact on performance. We also show graph properties like degree distribution and average degree that are useful in explaining performance effects observed in the following. For the binary packing compression experiments, we generated different configurations of the perfect-
\(v\)-
\(e\) graph. This graph is generated in a way to perfectly utilize the GraphScale crossbar to isolate graph loading from memory.
5.1 Effects of GraphScale Optimizations
Figure
11 shows the effects of different optimizations from Section
3.3 when applied to the base framework. The measurements are performed on a four-channel GraphScale system and normalized to measurements with all optimizations turned off. The
immediate updates optimization ensures that updates to the vertex labels of the current partition interval are written back to the scratch pad immediately, instead of just being written back to memory. This makes updates available earlier and leads to faster convergence for almost all graphs. Only the berk-stan graph does not benefit from this optimization, which is due to a specific combination of graph structure and selected root vertex. The
prefetch skipping optimization skips the prefetch phase of each iteration if intermediate updates are enabled. Hence, the prefetch skipping measurements have intermediate updates enabled. Additionally, prefetch skipping only works on graphs with a single partition. Prefetch skipping is a lightweight control flow optimization that sometimes leads to small performance improvements. Lastly,
stride mapping tries to optimize partition balance. Whenever partitions can be balanced (e.g., youtube or slash-dot graphs), the performance improves most significantly. However, in rare cases (e.g., berk-stan graph) this may lead to performance degradation because with asynchronous graph processing, result convergence is dependent on vertex order and a beneficial vertex order may be shuffled by stride mapping. From our observation, it is beneficial if high-degree vertices are at the beginning of the vertex sequence for faster convergence. In single-channel measurements, single-channel performance was better without stride mapping for almost all graphs. This is expected because partition balance is only important between channels but not between sub-partitions.
5.2 GraphScale Scalability
Figure
12 shows the scalability of GraphScale from a single-channel up to four memory channels as speedup over the baseline of single-channel operation for BFS, PR, and WCC. For a single channel, the stride mapping optimization is disabled. Otherwise, all optimizations discussed in Section
5.1 are always enabled. The measurements show that there is some scaling overhead and speedup is dependent on the dataset. This may be due to partition balance but is mainly influenced by density (i.e., average degree) of the graph for BFS. This can, e.g., be observed for the orkut, dblp, and rmat-21-86 graphs. Two interesting exceptions are the roadnet-ca and top-cats graphs, which show super-linear scaling. This is due to stride mapping changing the vertex ordering and thus leading to convergence on the result in significantly fewer iterations. Scalability speedups for WCC are similar to the BFS measurements besides the even more pronounced super-linear scaling for roadnet-ca and top-cats.
For the comparison of GraphScale against AccuGraph, FabGraph, HitGraph, and ThunderGP, we use the performance numbers they reported in their papers, respectively, with the performance measure
millions of traversed edges per second (MTEPS) defined by the Graph500 benchmark as
\(|E|/t_{exec}\), with runtime
\(t_{exec}\). More is better for this performance metric. This is different than the MTEPS* definition
\(|E|*i/t_{exec}\) with number of iterations
\(i\) used by HitGraph and ThunderGP. MTEPS* eliminates number of iterations in favor of showing raw edge processing speed. However, faster convergence to results due to lower number of iterations has more impact on actual runtime than usually smaller differences in raw edge processing speed [
10].
Figure
13 shows AccuGraph and FabGraph compared to GraphScale scaled to four memory channels. FabGraph was only measured for BFS and PR on the youtube, wiki-talk, live-journal, and pokec graphs, and the AccuGraph measurements did not include the pokec graph. For the BFS measurements, we use 0 as the root vertex as was done for AccuGraph and FabGraph in their respective papers. Overall, we show an average performance improvement over AccuGraph of
\(1.48\times\) and over FabGraph of
\(1.47\times\). Especially AccuGraph benefits from a much higher clock frequency due to lower design complexity. For wiki-talk, GraphScale scales the worst (cf. Figure
12) and is thus not able to provide a large improvement. FabGraph performs exceptionally well for PR on very sparse graphs (i.e., youtube and wiki-talk).
Figure
14 compares the four-channel GraphScale system to HitGraph. Because HitGraph does not provide BFS performance numbers, the GraphScale BFS results in Figure
14(a) are compared to single-source shortest path results from HitGraph, which has the same output for edge weights 1. We were not able to obtain the root vertices that were used for the HitGraph measurements and thus we measure with our own. Overall, we show an average performance improvement over HitGraph of
\(1.89\times\) for BFS and
\(2.38\times\) for WCC. As already shown in Figure
12, GraphScale benefits from denser graphs like live-journal in contrast to a sparse graph like wiki-talk. We also again observe the superior scaling of our approach for the roadnet-ca graph. For graphs with a large vertex set like rmat-24-16, our approach requires increasingly more partitions (9 for rmat-24-16), introducing a lot of overhead.
Figure
15 compares the four-channel GraphScale system to ThunderGP. For this experiment, we implemented a vertex range compression proposed by ThunderGP, which removes any vertex without an outgoing edge from the graph before partitioning it. While we apply this for the purpose of comparing the approaches on equal footing, we criticize this compression technique because it returns wrong results. Taking BFS as an example, vertices that only have incoming edges also receive updates even though they do not propagate them. ThunderGP uses random root vertices generated with an unseeded random generator. Thus, we reproduce their root vertices and measure on the exact same input parameters. Overall, we achieve a speedup over ThunderGP of
\(2.05\times\) and
\(2.87\times\) for BFS and WCC, respectively. The vertex range compression makes the wiki-talk graph much denser, which our approach benefits from. The only slowdown we observe is again for the rmat-24-16 graph due to partition overhead.
5.3 HBM
Figure
16 shows the integration of GraphScale into an OpenCL kernel enabling usage with the S10MX development kit because Intel OPAE is not available for this system. The integration is handled by an adapter converting the function call interface of OpenCL kernels to the address- and data-based register interface of the processor controller. This works because GraphScale, which is added to the OpenCL kernel as a VHDL library function, retains its state between kernel function calls. The reset signal of GraphScale is only triggered when the OpenCL kernel is created. In addition to a core, address, and data parameter, the OpenCL kernel has one pointer parameter for each memory channel. For HBM, pointer parameters are bound to a memory channel during compile time and have a different data type than the data parameter. The core parameter multiplexes the pointers and the address parameter multiplexes the data parameter and the pointer parameters, which are then passed to the data port of the processor controller of GraphScale. The core and address parameters are additionally concatenated and passed to the address port.
Figure
17 shows the scalability of GraphScale up to 16 channels on the HBM-enabled system (hatched) compared against GraphScale scaled over up to 4 channels of DDR4 memory (solid color). We directly compare HBM configurations with double the number of graph cores as the DDR4 configurations because the HBM memory channels are half as wide as the DDR4 memory channels, resulting in half as many edges processed per core simultaneously. There are no DDR4 configurations matching the 1-core and 16-core HBM configurations. The narrower data paths of the HBM channels, however, also lead to significantly higher clock frequencies (cf. Table
3). Additionally, we overprovision the crossbar with 16 label scratch pad banks per graph core while the graph cores only do a maximum of 8 requests per clock cycle due to the narrower data path. If there are no measurements shown for certain graphs (e.g., no measurement for the HBM 1 graph core configuration for the live-journal graph), this means that the graph did not fit into the memory of this configuration. Each HBM channel only has 256MB of memory, and different from DDR4, there is no datapath connecting all memory channels, limiting graph size.
To fit 16 graph cores, we had to remove the ready signal decoupling registers in the crossbar due to resource utilization constraints, leading to much-reduced clock frequency; halve the label scratch pad size; and remove the overprovisioned label scratch pad banks. For some graphs, the reduced clock frequency and default number of label scratch pad banks mean that GraphScale-HBM with 8 graph cores runs faster than GraphScale-HBM with 16 graph cores. Further scaling to more memory channels is limited by resource utilization of the FPGA. For scalability to 32 channels, each graph core of GraphScale would have to be smaller than
\(2\%\) of the LUT resources. In principle, having a number of cores that is not a power of two is also possible but does not make sense here because 16 cores already is the limit. In the future, FPGAs with more resources and possibly smaller board support packages (providing basic functionality like memory and PCIe controllers) will enable scaling the system even further. When it eventually is possible to saturate every memory channel with a graph core, there are more methods to extract even higher performance from HBM [
14]. The rmat-24-16 graph did not fit into memory at all for PR.
On average, GraphScale on HBM is able to provide a speedup of \(1.53\times\) with a maximum of \(3.49\times\) for WCC on the road-net graph. Compared to the theoretical HBM bandwidth of 512GB/s, this may seem low. However, we are limited by resources such that we are not able to scale to the full 32 channels and do not achieve the best clock frequency for 16 channels for the same reason. Additionally, for HBM we would have to reach an unrealistic 500MHz clock frequency to utilize the full memory bandwidth, compared to only 300MHz required to reach the maximum bandwidth of a DDR4 channel.
5.4 Compression
Figure
18 shows the effect of the binary packing compression on the GraphScale performance on generated graphs for an increasing average degree and, on the other hand, increasing number of partitions. The experiments are run with all vertex labels set to
\(-1\) to prevent updates having to be written back to memory, and the number of iterations over the graph is fixed to 1 to isolate the performance of loading the graph. The measurements show a performance improvement for GraphScale-BinPack(1) compared against the baseline of GraphScale(1) the lower the average degree of the graph is (cf. p-21-2) and the more partitions the graph has (cf. p-24-16). For the p-21-1 graph, we observe a drop in performance because there are only half as many vertex pipelines in our design as edge pipelines such that with an average degree of below 2, we are limited in the edge builder by number of vertex pipelines. On average, we observe a speedup of
\(1.26 \times\) with the binary compression enabled and a maximum of
\(1.48 \times\) for the p-21-2 graph, achieving our goal of reducing partition overhead for large graphs. However, for the real-world graphs that we use in our other experiments, we do not observe this improvement.
5.5 Understanding Data-dependent Performance Characteristics
To better understand the performance characteristics of GraphScale on different graphs, we propose a performance model based on two components. We model the theoretical maximum performance based on memory accesses alone (memory bound) and also simulate the theoretical maximum performance based on the crossbar (crossbar bound), which is mainly influenced by imbalances in accesses to the label scratch pad both locally and globally producing bubbles in the processing pipelines. As we observed in [
10], the biggest influence on performance is number of iterations over the graph, which we optimize for with asynchronous graph processing. Thus, we will normalize all performance numbers in this experiment to one iteration over the graph. Additionally, imbalance of number of edges between partitions can also cause lower performance, which we will also exclude from the experiment by only looking at one graph core.
The memory bound
\(M_{max}\) is mostly influenced by the average degree of the graph and the number of meta-partitions that we observed for the compression too. It can be modeled as follows (
\(i\) is the number of iterations over the graph and
\(d = 1\) if the number of sub-intervals
\(l\) is exactly one and
\(d = i\) otherwise):
We additionally simulate the crossbar bound
\(C_{max}\) by going over the edge list in cycles and consuming an edge if the corresponding simulated label scratch pad bank has spaces left in an internal queue. The lookahead is limited to the FIFO depth from Figure
6 regulated with a parameter in the simulation. We count the cycles that the simulation needs to consume all edges (a maximum of 16 per cycle with perfect balancing of requests) and calculate the theoretical maximum MREPS from that.
Figure
19 shows the result of our performance modeling compared against the actual performance measurements of GraphScale with one core. The final estimate is the minimum of the memory bound
\(M_{max}\) and the crossbar bound
\(C_{max}\). We observe that we can closely model the actual performance with this estimate. For most real-world graphs, the performance is significantly limited by the crossbar bound because there are local imbalances in the label scratch pad accesses. This is especially prevalent for the synthetic rmat graphs. However, the graphs do not exhibit global imbalances of accesses. This suggests that optimizing the partitioning of the graphs taking our performance model as a measure of quality could improve performance and unlock the performance gains shown for the binary packing compression.
5.6 Discussion
Overall, we observe an average speedup of \(1.48\times\) over AccuGraph and FabGraph and \(2.3\times\) over HitGraph and ThunderGP with a maximum speedup of \(4.77\times\) for BFS on the wiki-talk graph over ThunderGP, confirming the potential of scaling asynchronous graph processing on compressed data. For optimizations, we show the importance of partition balance, with stride mapping being very effective for graphs like slash-dot and youtube at the tradeoff of shuffling the potentially beneficial natural vertex ordering of real-world graphs. In our scalability measurements, we observe that GraphScale performance benefits from denser graphs in general (e.g., orkut and dblp graphs). This was especially pronounced compared against FabGraph for PR on very sparse graphs. Additionally, compared to HitGraph and ThunderGP, we observe a significant slowdown for graphs with a large vertex set, like rmat-24-16. This results in a tradeoff in the compressed data structure between less data required for dense graphs and more partitioning overhead for graphs with a large vertex set. We tackle both these challenges (sparse and large graphs) with our binary packing decompressor with an average speedup of \(1.26 \times\) on synthetic graphs. To explain the lacking performance improvement from binary packing compression on real-world graphs, we show a performance model forming a theoretical upper bound for GraphScale performance. Lastly, we scale GraphScale to HBM, resulting in a \(1.53\times\) average speedup with a maximum of \(3.49\times\) speedup for WCC on the road-net graph.