1 Introduction
Recent years have seen a massive interest in
Artificial Intelligence (AI), due to a sharp increase in demand from industry, such as security, finance, health-care, and military. Meanwhile, graph theory, as one of the most fundamental modeling techniques in data science, contributes to many of the current AI applications. Problems that can be modeled as communicating entities, such as the
Internet of Things (IoT), social networks, web search, transportation, health-care systems [
23], and even biology [
1] are examples for which graph data models are an excellent fit. For example, according to official statistics, as of the third quarter of 2020, Twitter had more than 180 million daily active users, with more than 500 million total number of tweets per day [
27]. For such compute-intensive applications, having strong support from the underlying hardware to accelerate software algorithms has become an indispensable requirement due to the widespread usage of AI on huge datasets. Graph applications are among the important algorithms used in AI and
machine learning (ML) implementations, since iterative model training is one of the greatest challenges. The concurrent nature of graph models—as vertices/edges represent concurrent entities/links—provides large parallelism potentials. However, efficiently implementing these applications on existing systems is not trivial. Therefore, having an efficient co-processor-based graph processing is considered as a promising solution for the growing need for graph applications within the AI domain.
As the complexity of digital systems increases, the tendency to use productive High-Level-Synthesis (HLS) languages, such as SystemC, over conventional RTL-level Verilog and VHDL increases. Besides, as CPU-FPGA hybrid platforms are becoming omnipresent, particularly in IT domain like in data centers to run analytics applications, writing FPGA programs for these platforms by software programmers will be a dilemma. This mandates having even higher-level HLS languages, like Intel OpenCL for FPGA, where even most basic digital design concepts like clock signal are absent in the syntax. Currently, mainstream FPGA-makers are following this approach to provide better programmability. However, when abstraction level of design increases, optimality of implementation, particularly in complex designs, degrades. This issue specifically applies to graph-processing applications, in which, as we will discuss, the shrewd system architecture is necessary to efficiently exploit the valuable memory bandwidth. Considering these two opposing requirements, namely, easier programmability and implementation efficiency, our goal is to have a hybrid graph-processing pipeline, mixing both cycle-accurate implementations to provide efficiency with high-level programmability to support productivity.
In this work, we propose an HLS-based graph-processing pipeline. The platform used for implementation has an architecture of a host processor, connected to the multiple hardware accelerators on the FPGA. A software program on the host processor is only responsible for initiating the execution of FPGA, with no further intervention afterward. FPGA’s user codes are composed of two kinds of modules. The first one is the collection of fixed user modules, provided by our framework, which is written in SystemC and in a cycle-accurate way, providing efficient/high-throughput vertex processing pipeline. The second one is a single customizable module, to be written by the users in pure C/C++, in which they write intended iterative graph algorithm merely inside this specific module of the template. Finally, all modules are converted to bitstream to be executed on the FPGA. In this architecture, parallelism is provided in two dimensions, across multiple accelerators, and also inside a manually designed deep pipeline. While in general, the proposed architecture is not limited to, in the current implementation our tool is specifically usable on the emerging generation of Intel platforms, Xeon+FPGA, where our experiments have been executed. For comparison purposes, candidate graph algorithms are also implemented in OpenCL, the alternative HLS platform. In very high-level OpenCL, the user program is written in a straightforward way, similar to software programming, where a pipelined architecture is automatically generated by the architecture compiler. We show that our reusable template can be used to implement a high-throughput or work-efficient graph algorithm by only writing pure software code. This way, the programmer does not need to engage in the complicated HLS techniques or transformations, typically required to generate an efficient architecture out of an HLS design. The main contributions of this work are as follows:
•
We propose a framework to generate an efficient graph-processing pipeline for iterative graph algorithms. The generated pipeline is high-performance/work-efficient, SystemC-based, synchronous, deeply pipelined, fully synthesizable, and ready to be implemented on FPGA. The pipeline is optimized to get the maximum feasible throughput (one edge per clock cycle), assuming that there is no vertex locality, which is the case for large-scale graphs (minimum cache hit rate—the worst case). While it can be executed on any CPU+FPGA architecture, it is specifically prepared for Intel state-of-the-art Xeon+FPGA platform.
•
The framework is template-based, designed for convenient use by non-hardware experts, for the rapid generation of high-performance graph accelerators using only C/C++. This combines the high-level programmability with the efficiency of underlying SystemC language.
•
We implement a novel fast bit-vector to keep an active vertex list, which is mapped to FPGA Block RAMs. This enables work-efficient graph processing.
•
We compare and contrast the alternative HLS platform for FPGA, OpenCL, to show limitations and difficulties in implementing high-performance pipelined graph algorithms.
This article is organized as follows: The next section describes the background on graph processing as well as the Intel Xeon+FPGA platform. Section
3 gives the relevant studies on the topic. In Section
4, baseline implementations of graph algorithms in Intel OpenCL, various structures and drawbacks are discussed. In Section
5, our HLS-based pipelined graph-processing architecture is introduced with details. In Section
6, the experimental setup and results are discussed. In Section
7, limitations and future works are proposed. The article is concluded in Section
8.
2 Background
In this section, we review some literature background in graph processing. First, we state why FPGA is preferable over GPU. After that, the FPGA hardware platform that we used is explained. Finally, to facilitate the reading of this work, some terms in the terminology are reminded.
2.1 FPGA or GPU?
Currently, utilizing the computational power of GPUs in user applications such as multimedia processing tools, deep-learning algorithms, or scientific computations is a well-known approach for
High-Performance Computing (HPC). This
General Purpose GPU (GPGPU) computing can potentially be limited due to control divergence and memory divergence. In contrast, graph-based applications typically have data-dependent behavior, both in control flow and memory references, mainly due to the diversity of graph topology. Due to this, generally in the literature, GPU is not considered as a suitable choice for accelerating graph applications [
7].
In a more recent trend,
High-Performance Reconfigurable Computing (HPRC), which is integrating FPGA-accelerators into general-purpose processors, has attracted the attention of dominant market vendors. Recently, Intel Corp. has bought Altera, one of the leading FPGA-makers, in a significant investment valued at $16.7 billion, which was recorded as the largest deal in the semiconductor industry. Despite volatile architecture and lower operational frequency of FPGAs, due to complex routing interconnects, there are many applications where the flexibility of FPGA architecture to design a deep and customizable pipeline outperforms computational power of GPGPUs for equal available off-chip memory bandwidth [
11,
20]. In this trend, as FPGA-accelerators are becoming serious competitors for GPUs, many cloud service providers extensively use them in their data centers to provide massive parallelism [
31]. Currently, Amazon offers FPGA nodes on EC2 platforms [
2]. Microsoft has integrated over 100 million FPGAs into its data centers aiming at having FPGA-powered high-performance real-time AI systems, including enhancing the performance of Bing web search engine [
8,
10].
2.2 Hardware Accelerator Research Program (HARP) Platform
We implemented and tested our approach on a version of the state-of-the-art Intel Xeon+FPGA platform, where a Xeon processor is connected to an Arria10-GX1150 FPGA via two PCIe and one QPI channels (see Figure
1). On this platform, a DDR memory is available on the processor side, and FPGA can access it through three serial channels and processor memory controller. The maximum total read/write bandwidth between processor and FPGA is
\(\sim\) 20 GB/s, while according to our measurements, read and write can go up to
\(\sim\) 17 and
\(\sim\) 15 GB/s, respectively. Part of the FPGA bitstream, which includes implementations of memory and host CPU communication controllers, to be used by FPGA, has already been provided to users by Intel SDK, as so-called fixed
blue-stream. In addition to the blue-stream, with FPGA partial reconfigurability, users develop their own hardware design as
Accelerator Function Unit (AFU). The customizable user code, called
green-stream, is attached to the existing blue-stream to make a full bitstream. In this setting, user logic implementation can work at a maximum frequency of 400 MHz. After programming bitstream to FPGA, a host-side software initiates FPGA run and, thereafter, FPGA continues operating without host CPU intervention until indicating a done signal.
We have implemented our approach on a remote system, under Intel
Hardware Accelerator Research Program (HARP), at Paderborn University in Germany. At the time of writing this article, two SDKs that we used are only available under the HARP program, with some limitations. The two applied design flows, for baseline and the template, are illustrated in Figure
2. As can be seen, OpenCL flow (bottom) is simpler in terms of execution, whereas SystemC flow (top) involves HLS and placement and routing tools. From a user point of view, OpenCL code is directly converted to bitstream, but in SystemC, intermediate RTL code is generated. To convert SystemC to RTL, we followed Reference [
15], which requires some additional third-party tools. In both cases, a host software initializes the FPGA program.
2.3 Graph Processing
Iterative Graph Processing Model: A graph with a starting initial value per vertex/edge is iteratively processed in a defined algorithm. After every iteration, values are expected to improve until they reach the final desired result. To complete the execution, all vertices must converge.
Memory Access Bottleneck: In large-scale graph applications, low cache utilization, high memory access latency, and low bandwidth utilization in accessing graph data on off-chip memory are well-known bottlenecks. This is because of the inevitable random (irregular) memory access pattern. Diverse references to graph data (e.g., neighbor vertices/edges of a vertex) on a wide range of memory addresses lead to poor data locality. This, in turn, causes dramatically low cache utilization and low memory bandwidth utilization, in addition to large (typically around 100 clock cycles) data access latency. Hence, graph applications are typically memory bandwidth-bound.
Gather-Apply-Scatter (GAS) model: Represents three conceptual phases of a vertex-centric graph program. In the gather phase, data of adjacent vertices/edges is collected. Then, in the apply phase, after the necessary calculations, the new vertex value is calculated and updated. Finally, in the scatter phase, the neighbor vertices are informed about the value change in the current vertex. In the GAS model, data access is limited to neighbor vertices.
Convergence: A vertex is said to be converged when its corresponding data value reaches to its final value (or gets close enough, if the data type supports). This value will not change in the next iterations. A graph is said to be converged if all vertices are converged.
Pull-based vs. Push-based: There are two general strategies to implement graph algorithms. In pull-based, vertex execution is composed of first, some data reads from neighbor vertices/edges, then some computation, and finally write a new value to current vertex, without informing neighbors (in other words, there is no explicit scatter phase). In contrast, in push-based, after doing calculations, the vertex may write data to neighbor vertices, too [
13]. Comparatively, pull-based has more number of reads and higher edge processing throughput, while push-based has more writes. Depending on the algorithm, one may be a better fit than the other. For example, in the case of straightforward and non-work-efficient
Breadth-First Search (BFS), push-based can be by far more efficient than pull-based, because in pull-based, deep vertices such as leaves of a tree graph unnecessarily iterate over all parents for many iterations waiting for the parent to change and waste valuable memory bandwidth by vain data reads. Even though push-based is more efficient in general, it lacks in terms of parallel execution due to race conditions and false-sharing. Consistent solutions, such as using atomic accesses, have high hardware cost, or performance degradation because of serialization. Our work is based on a pull-based design, but as will be explained, work-efficient mode obviates unnecessary vertex processing.
Synchronous vs. Asynchronous Execution: In an iterative graph algorithm, in each iteration, all or part of vertices are processed, for which next data values are calculated. There are two strategies about when to update vertices with their calculated next value. In asynchronous execution, vertex value is immediately updated, then neighbor vertices can read new value immediately, in the same processing iteration. In contrast, in synchronous execution, the next value is kept in a temporary variable and is updated only at the end of the iteration. Therefore, there are two copies of data, one for reading older values and the other for writing new values. New values become available for reading starting from the next iteration. Synchronous execution converges more slowly, but it has an easier implementation, as there is no intra-iteration dependency. However, in asynchronous execution, some complexities may arise, such as the possibility of race conditions (because data is both read and written) and ensuring sequential consistency needed for correctness, where consistent solutions can be costly [
21].
Workload Imbalance: In many systems, irrespective of being implemented at the software or hardware level, balancing the workload becomes a significant problem for graph applications because of the power-law distribution of vertex-degrees. Techniques such as dynamic load balancing or vertex-degree aware scheduling, as mentioned in Section
3, try to tackle this problem.
High throughput vs. Work efficiency: In executing graph applications, one can opt for high-throughput execution, where all vertices are sequentially loaded to the pipeline in every iteration for possible processing. However, in work-efficient execution, instead of processing all of the vertices, only those tagged as active are processed. A vertex is active when at least one of its neighbors has been updated in the previous iteration. The active vertices that need to be processed are called
active-set. Therefore,
work efficiency refers to only processing active vertices at every iteration. The fundamental architectural difference is the need for a fast and efficient implementation of
active-set. The High-throughput mode achieves a higher number of processed edges per second, whereas the Work-efficiency mode converges faster. Setting the number of edges processed per second as the only performance metric may not be a smart decision, since it is inefficient in terms of work. However, work-efficiency support can have significant complexity and overhead [
21].
3 Related Work
While a wide variety of approaches in graph processing exist both at the hardware and software level, to focus on FPGAs, we do not consider complete software frameworks developed for providing users with high-level and easy-to-use software modeling on CPU and GPU. We have given the reasons as to why GPUs are not the best options for graph processing in Section
2.1. Moreover, comparisons with CPU have been provided by FPGA works covered below.
Locality improvement works: In a series of works, with focus on optimizing off-chip memory bandwidth efficiency, techniques such as data layout and compression are proposed. For example, in Reference [
19], significant potential of locality in real-world graphs is explored in an online fashion. A locality-aware online scheduler tries to improve data reuse by exploiting the community structure of real-world graphs and predict the well-connected regions ahead of time. In Reference [
32], the inherent graph property, “vertex degree,” is considered for optimizing a software/hardware co-design architecture. Since high-degree vertices can be the bottleneck in graph algorithms, authors propose degree-aware adjacency list reordering. In Reference [
4], using common CSR graph data format, graph workloads are analyzed to show higher performance sensitivity to L3 cache size, rather than L2 private cache size. Based on profiling insights, like data reuse distances, an application specific and data-aware data prefetcher is proposed to increase inherent data reuse. In Reference [
29], different types of irregularities in graph analytics are classified. Then, to alleviate them with a co-design approach, data-aware dynamic scheduling to schedule the program on-the-fly is suggested, which has microarchitecture support to extract data dependencies at runtime. Such techniques are generally orthogonal to our work and can be applied simultaneously, because our assumption for the worst-case scenario is that there is no vertex locality and the memory access efficiency is minimal.
Non-FPGA-based works: Due to very low-level hardware modification or implementation complexity some applications are more suitable for
ASICs (application specific integrated circuit) rather than FPGAs. As such, Graphicionado [
14] proposes a set of data type and memory subsystem specializations to reduce memory access latency.
Processing-In-Memory (PIM)-based accelerators, like Reference [
9], reduce memory access cost by integrating accelerators inside the memory. In References [
3,
22], authors offer a configurable, work-efficient, asynchronous, and template-based graph-processing accelerator architectural model. After all, due to excessive complexity devoted to ensuring advanced features listed in Reference [
21], such as strict sequential consistency property in asynchronous execution, the design is not practically usable on all FPGA platforms. FPGA can not afford the intricacy, area size, and burden of interconnect network of the architecture to give a practically efficient implementation. In our work, architecture is simplified enough for synchronous execution to better fit into an FPGA. In addition to the high-throughput mode, the work-efficient feature is added with a novel and efficient bit-vector.
FPGA-based works: There have been frameworks particularly for large-scale graph processing on FPGA. ForeGraph by Microsoft [
12] introduces a scalable multi-FPGA architecture. The graph is partitioned among FPGAs, inside dedicated off-chip memories, while an optimized inter-communication mechanism among FPGAs exists. In the proposed scheduling scheme and data-compression technique, the graph is loaded into fast on-chip Block RAM memory used as cache, and dedicated off-chip memories provide higher bandwidth. The main idea is to have more Block RAM by using multiple FPGAs, hoping for more data locality. ForeGraph also has edge reordering for edges with potential data write conflict. In Reference [
33], a data layout technique with architectural support is proposed to minimize the number of random accesses to external memory, which also reduces the power consumption of on-chip Block RAMs. In HitGraph [
34], a design automation tool is proposed to generate a synthesizable RTL code for graph accelerator. Besides, to improve performance, several algorithmic optimizations, such as graph partitioning, optimized data layout, and inactive partition skipping, are introduced. In the aforementioned studies, processing pipelines are usually simple and shallow, where side techniques are the main novelty for coping with memory bottleneck problems. However, our work without conflicting with these techniques focuses directly on pipeline architecture, where we intentionally remove possible locality from graph data with initial shuffling. In Reference [
30], a parallel accumulator is proposed to remove serialization in atomic operations for conflicting vertex updates, applicable to specific graph algorithms. However, WaveScheduler [
28] proposes a scheduler for
Sparse Matrix-Vector Multiplication (SpMV)-based multi-accelerator graph processing on FPGA. Besides two data re-ordering optimizations, the key insight is the appropriate tiling of the underlying adjacency matrix to eliminate all read/write conflicts in on-chip BRAM. Again, most of these works generally focus on increasing the locality and are orthogonal to our work. Because we do not consider locality in our implementation, we rather focus on designing an efficient pipeline that can be used in other graph-processing frameworks for FPGA.
5 Template-based Accelerator Architecture
In this section, we describe the pull-based template pipeline architecture used for vertex-centric graph processing.
In each iteration, large chunks of vertices are dynamically assigned to and processed in parallel running accelerators. For each vertex, all connected edges are processed in a loop. To optimally utilize memory bandwidth, enabling spatial locality and cache memory, vertex, and edge lists are fetched in order. Graphs are stored in a common Compressed Sparse Row (CSR) format, which facilitates straightforward streaming memory access. Since vertex degrees can be rather different, vertex processing times may also vary considerably. Hence, vertices are executed (processed) out of order. Similar to load, vertices are also committed and written back to memory in order, for efficient usage of memory bandwidth and avoiding potential false-sharing problems. This way, the majority of bandwidth (more than 90% in our experiments) can be dedicated to inevitable random (irregular) memory accesses, reading data of a neighbor vertex.
As mentioned earlier, large-scale graph applications have a well-known inherent bottleneck in accessing off-chip memory, which leads to high latency and low bandwidth efficiency. Carefully utilizing this limited resource requires precise execution, which can be done in a cycle-accurate level. But, this at the same time is tedious and complicated. In our template-based design, we implement the majority of common modules only once, except for a single user-specific module. In a deep-pipeline vertex processing architecture, multiple vertices are in processing phase, in different stages of the pipeline, with many concurrent outstanding memory requests to tolerate high latency of main memory access. Architecture can be configured in two execution modes. In lighter high-throughput mode, all vertices are loaded to the pipeline in each iteration. However, in work-efficient mode, using a novel and fast bit-vector design to implement the active-list, only active vertices are loaded and processed in each iteration. We explain both of these options in the following subsections.
5.1 High-throughput Mode
Figure
4 gives the different modules in our high-throughput architecture, which we explain in detail below.
Data and Control Tables: There are a few tables to keep track of vertices being processed. They keep vertex states, such as vertex degree and value, number of remaining unprocessed edges, and temporary data in gather phase. Table length is the maximum number of vertices in execution at the same time (pipeline depth is set to 128 because of 100 clock cycle memory latency). Different pipeline stages of design may have simultaneous read/write accesses to these tables. Their implementation, which usually has resource contention and timing constraints, has to be efficient, too. For this purpose, on-chip multi-port memory resources of FPGA, including Block RAMs and even memories made by LUTs and Flip-Flops, are utilized.
Table Allocator: responsible for allocating a vacant row to the next incoming vertex-ID in control and data tables. After that, the assigned row-id, which points to a specific vertex, flows through the next pipeline stages until the end of processing.
Vertex Initiator: sequentially reads a free row-id from “Table Allocator” queue and an unprocessed vertex from a streaming memory port to fill some table entries with initial vertex data, such as vertex value and degree. This module internally is composed of multiple pipeline stages for higher throughput.
Edge Loop Setup: sequentially reads a row-id of an initialized vertex from the “Vertex Initiator” queue and information of all connected edges from a streaming memory port. Data of a connected edge contains vertex-ID of the other (neighbor) vertex. After that, in a so-called edge loop, for each edge of the current vertex, a random access memory request is sent to read data of a neighbor vertex. Row-id of the current vertex under processing is also attached to the request to be returned back with the response (supported by memory controller). Since responses may come out of order, this ensures that the owner vertex of response will be known. This is the only irregular memory access in the overall design. When all requests for neighbor vertices of the current vertex are sent, the module moves to the next vertex (row-id) from the “Vertex Initiator” queue. Multiple vertices being processed (up to 128) may have many pending vertex info requests for their own neighbors. Having many in-progress outstanding memory requests hides large off-chip memory latency (over 100 cycles). Ideally, requests are sent successively one per clock cycle.
Edge Loop Execution: Responses of neighbor vertex info requests, sent by the previous module, are received out of order, ideally one per clock cycle. Responses may also belong to various vertices. Because of poor locality, it is highly probable that each read response consumes one full cache-line of memory bandwidth. Data of each neighbor vertex is “gathered” (accumulated, or so, depending on the specific graph algorithm) in temporary variable(s) in the table row of the related vertex. When all neighbors of any of under processing vertices are processed, its row-id is passed to the “apply” stage.
Apply: In this module, after having a final complementary calculation on the result of the “gather” stage, the vertex value is updated again in the related table. Then, a “done” bit is set for this vertex in the related table. Similar to other modules, vertices may finish this step out of order.
User Module: All customizable user codes are inside this module. It includes functions applicable to vertex data in gather and apply phases. This module is included and merged into the gather and apply modules at compile time.
Write data: sequentially reads a vertex row-id from commit queue and waits for its completion by monitoring the “done” bit. When the done signal is detected, vertex’s new value is written back to the DDR memory. As row-ids in commit queue are in order, writings are also done in order. This way, one iteration of the iterative algorithm for a specific vertex is accomplished. For realizing a synchronous update strategy, the read and write vertex-info data structures on the off-chip DDR are different.
Deallocator: finally releases all table entries for row-id of a finished vertex, to be reused again by the next vertices.
5.2 Work-efficient Mode
As explained before, Work-efficient architecture does not execute for all the vertices in every iteration. Therefore, it has some additional modules, plus some modifications to previous ones. They are shown in Figure
5 and explained in detail in the below paragraphs.
Pre-fetch: Sequential vertex access in the “Vertex Initiator” module of high-throughput mode makes it possible to have a single read/write request for a large vertex chunk (containing whole vertices dedicated to each accelerator). But in work-efficient mode, only some of the vertices are processed in every iteration. Hence, instead of a big chunk, there should be multiple disjoint and smaller ones. To reduce access overhead, especially for small and near-by-address memory requests, a grouping mechanism is provided to create such smaller chunks at runtime. The pre-fetch module is responsible for batching consecutive individual vertices (consecutive vertex-ID), into one large batch. Therefore, a single memory request is sent for the whole batch (chunk).
Read-CL: If successive disjoint vertex read requests have overlapping cache-line data (end of one request and beginning of the next one), then the memory request is responded by a local single size read cache-line. This can be helpful for small and near-by-address requests. This module also provides the value of unmodified vertices for the next write operation.
Write-CL: Similar to its read counterpart, this module is responsible for handling a cache-line for the write operation. The cache-line is filled one-by-one and then flushed into the memory.
Scatter: Finally, after updating a vertex with its new value, provided that the change is beyond a specified threshold, the scatter module informs all neighbors about its value update. This module iterates over all neighbors and sends their vertex-ID to bit-vector module to construct active-list of next iteration.
Bit vector: This key module is responsible for keeping a list of active vertices in work-efficient mode. This way, in each iteration, instead of processing all of the vertices, only those tagged as active are processed. Efficient implementation of active-list can be challenging [
21]. In our pipelined and fast solution, active-list members can both be quickly inserted to and extracted from the bit-vector module as they are mapped to fast on-chip Block RAMs. As depicted in Figure
6, the Bit vector is composed of two sets of ping-pong swappable memories. One keeps the active-list of current iteration that is read from, and the other stores the active-list of next iteration that is written onto. Ping-pong memories are swapped after each iteration. In the basic bit-vector implementation, a single memory bit is assigned to a vertex, where being “1” indicates being in the active-list. While inserting a vertex-ID into on-chip memory is naturally fast, by just setting a single bit, on a specific address (vertex-ID) to “1” in a single clock cycle, retrieving them can be rather slow, since the whole bit-vector should be scanned. To tackle this problem, besides internal pipelining, a novel multi-level design is proposed for the Bit vector. In the advanced implementation, vertices are addressed in three different levels. Higher-level active-set is responsible for coarser groups of vertices, meaning larger chunks. When a vertex is added, it is added to all hierarchy levels in parallel as shown in the right part of Figure
6. However, the benefit is realized in the reduction of active-list retrieval latency as shown in the left part of Figure
6. Using higher-level indicator bits, non-active chunks of vertices are quickly skipped without deeply searching in lower-level addresses. At the start of each iteration, the small top-level list is fully searched for any active chunk. Then, only active addresses are collected in a FIFO and passed to the second-level list. The same is done for the third list level. Finally, the list of all active vertex-IDs is extracted monotonically and sequentially.
Scheduler: For alleviating possible load imbalances, a runtime scheduler module (not shown in figures) is responsible for dynamically assigning the next vertex chunk to the first idle accelerator.
Note that the available block memory size of the underlying FPGA, determines the maximum graph size supported in the work-efficient mode. In this version of Xeon+FPGA, and in the basic version of the work-efficient active-set, where a single memory bit is assigned to each vertex, graph size of 8 million vertices is supported (2 million per accelerator, as shown in Figure
6). As an idea for future works, by reducing the resolution it is possible to support larger graphs in exchange for efficiency. For example, if a single bit of a bit-vector is assigned to a group of 32 vertices, then graphs of size 256 million vertices can be handled. However, in this case, all 32 vertices have to be loaded, even if only one of them is really active.
5.3 User Programming Interface
As mentioned earlier in Section
2.3,
Gather-Apply-Scatter (GAS) is the common processing model for vertex-centric programs. According to this model, the user defines three following functions in the discussed
User Module to specify three conceptual phases of an iterative algorithm. The template customization is only done by these three functions, by which the graph application is defined in the GAS model. In this way, an easy programming interface to provide programmability and productivity is provided to the users. In the compilation time, the functions are included, merged, and pipelined into the related modules (see Figures
4 and
5). In every iteration and for every vertex under processing:
Gather: is called once for each neighbor whenever the neighbor is accessed and its data is read. Neighbor vertex values have been already requested in an streaming way, and they come after many clock cycles of latency. Accumulation or finding the maximum value are examples of common operations in this function. For example, in PageRank, the function can be:
Apply: is called once, after processing the last neighbor, to finalize the calculation and update the vertex value. Division after the accumulation in the average operation can be an example of this. For example, in PageRank, we have (A, B, and C are numerical constants of PageRank formulation):
Scatter: is called once, after “Apply” to inform the neighbors about value update. This function is not used in the pull-based implementation, such as our framework.
Vertex data that are large arrays indexed by vertex ID (stored in global off-chip DDR memory) and any other temporary variable for a vertex in the pipeline (stored on on-chip Data Tables) are defined by user in a specific class. The user can implement any complex function, dependent on data, iteration number, and so on. The three C/C++ functions are converted to logic, the same as SystemC units. For more relaxed timing in high-latency functions, the function latency can be defined by the user as the number of clock cycles. This value is passed to the underlying SystemC implementation during logic synthesis.
8 Conclusion
In the age of AI, utilizing hardware accelerators, such as GPUs, FPGAs, and neural network sticks, has become a common practice. In processing the graphs, as fundamental data modeling, FPGAs are more efficient than GPUs due to irregular data access. In this work, we designed and implemented a SystemC HLS-based graph-processing template with a clock-wise precisely designed deep-pipeline architecture. The template-based design is simplified for easy mapping on to FPGA, even for software programmers to generate accelerators conveniently. The template can be customized using a single module in C/C++, combining the high-level programmability with the efficiency of SystemC hardware language. In high-performance mode, the high-throughput pipeline achieves maximum edge throughput with the minimum number of accelerators. The pipeline is optimized to get the maximum feasible throughput, under the assumption of no vertex locality (worst case). However, any orthogonal locality-improving technique from the literature can be used in tandem to achieve additional performance. In addition, the work-efficient mode significantly reduces total runtime with a novel active-list design. Through experiments on the Intel Xeon+FPGA platform, we showed the benefits of the proposed template compared with respect to OpenCL-based implementation. Based on our results, this template outperforms the OpenCL version considerably by providing convenient programmability, higher throughput, lower runtime, and lower power consumption.