The input is an unordered set of directed edges of the graph. Undirected edges in a graph can be represented by a pair of directed edges. Each iteration contains three stages: the scatter, the gather, and the apply. In the scatter stage (lines 2 through 6), for each input edge with the format of \(\langle src, dst, weight \rangle\) , an update tuple is generated for the destination vertex of the edge. The update tuple is of the format of \(\langle dst, value \rangle\) , where dst is the destination vertex of the edge and value is generated by processing the vertex properties and edge weights. In the gather stage (lines 7 through 9), all update tuples generated in the scatter stage are accumulated for destination vertices. The apply stage (lines 10 through 12) takes all values accumulated in the gather stage to compute the new vertex property. The iterations will be ended when the termination criterion is met. ThunderGP exposes corresponding functional APIs to customize logic of three stages for different algorithms.
Related work. There have been several research works on FPGA accelerated graph processing. Table
1 summarizes the representative studies that could process large-scale graphs. In the early stage, Nurvitadhi et al. [
44 ] proposed the GraphGen, which accepts a vertex-centric graph specification and then produces an accelerator for the FPGA platform. However, developers need to provide pipelined RTL implementations of the update function. Dai et al. [
19 ] presented FPGP, which partitions large-scale graphs to fit into on-chip RAMs to alleviate the random accesses introduced during graph processing. More recently, they further proposed ForeGraph [
20], which exploits BRAM resources from multiple FPGA boards. FabGraph from Shao et al. [
52 ] delivers an additional 2× speedup by enabling two-level caching to ForeGraph. Nevertheless, the two preceding works adopt the interval-shard-based partitioning method and buffers both source and destination vertices, resulting in a significant data replication factor and heavy preprocessing overhead [
3]. Moreover, these two works are simulations that are not publicly available. Engelhardt and So [
22,
23 ] proposed two vertex centric frameworks: GraVF and GraVF-M. However, they can only support small graphs that can fit completely in the on-chip RAMs and fail to scale to large-scale graphs. Zhou et al. [
70 –
73 ] proposed a set of FPGA-based graph processing works. Their latest work, HitGraph [
72], vertically partitions the large-scale graphs to enlarge the partition size. However, edges have to be sorted to minimize memory row conflicts. This is a significant preprocessing overhead especially for very large graphs. Furthermore, HitGraph executes the scatter and the gather stages in a
bulk synchronous parallel execution model where the intermediate data need to be stored and read from the global memory. In contrast, ThunderGP adopts the pipelined execution for its two stages, hence reducing memory access to the global memory. Yao et al. [
69 ] proposed AccuGraph, a graph processing framework with an efficient parallel data conflict solver. Although vertical partitioning is adopted, edges are sorted during graph partitioning. Recently, Asiatici and Ienne [
3] explored graph processing accelerators on multi-SLR FPGAs with caches for thousands of simultaneous misses instead of minimizing the cache misses. However, the performance is worse than HitGraph [
72] and ThunderGP [
12] caused by bank conflicts in the proposed miss-optimized memory system. In terms of development, all preceding works require HDL programming, as summarized in Table
1.
In addition, a few HLS-based graph processing frameworks have been developed. Oguntebi and Olukotun [
45 ] presented an open-source modular hardware library, GraphOps, which abstracts the low-level graph processing related hardware details for quickly constructing energy-efficient accelerators. However, optimizations such as on-chip data buffering are not exploited in their designs, leading to poor memory performance. In addition, it still requires developers to manually compose pipelines or modify the logic of basic components if the built-in modules are not enough to represent the application logic.
The emerging HBM memory system of FPGAs has attracted a lot of attention due to its potentials for graph processing applications. GraphLily [
28] is the first graph linear algebra overlay that explores HBM. With a more constrained configuration space, it offers a much faster compilation via bitstream reuse across applications. Our framework, however, yields customized accelerators for graph applications that are more flexible. Our previous study of ThunderGP [
12] has almost fully utilized the available memory bandwidth of DRAM-based platforms while using a portion of computation resources, which indicates that the performance can further scale up if there is more available bandwidth. This article makes three additional significant contributions. First, we optimize the resource consumption of kernels to instantiate more kernels to take advantage of the high memory bandwidth of HBM. Second, we propose two optimizations to access HBM channels efficiently. Third, we conduct comprehensive evaluations of ThunderGP on the HBM-enabled platform and compare its performance with that on the DRAM-based platform.