research-article

Open access

HLS-based High-throughput and Work-efficient Synthesizable Graph Processing Template Pipeline

Authors:

Hamzeh Ahangari,

Muhammet Mustafa Özdal,

Özcan ÖztürkAuthors Info & Claims

ACM Transactions on Embedded Computing Systems, Volume 22, Issue 2

Article No.: 34, Pages 1 - 24

https://doi.org/10.1145/3529256

Published: 24 January 2023 Publication History

All formats PDF

Abstract

Hardware systems composed of diverse execution resources are being deployed to cope with the complexity and performance requirements of Artificial Intelligence (AI) and Machine Learning (ML) applications. With the emergence of new hardware platforms, system-wide programming support has become much more important. While this is true for various devices ranging from CPUs to GPUs, it is especially critical for specific neural network accelerators implemented on FPGAs. For example, Intel’s recent HARP platform encompasses a Xeon CPU and an FPGA, which requires an intense software stack to be used effectively. Programming such a hybrid system will be a challenge for most of the non-expert users. High-level language solutions such as Intel OpenCL for FPGA try to address the problem. However, as the abstraction level increases, the efficiency of implementation decreases, depicting two opposing requirements. In this work, we propose a framework to generate HLS-based, FPGA-accelerated, high-throughput/work-efficient, synthesizable, and template-based graph-processing pipeline. While a fixed and clock-wise precisely designed deep-pipeline architecture, written in SystemC, is responsible for processing graph vertices, the user implements the intended iterative graph algorithm by implementing/modifying only a single module in C/C++. This way, efficiency and high performance can be achieved with better programmability and productivity. With similar programming efforts, it is shown that the proposed template outperforms a high-throughput OpenCL baseline by up to 50% in terms of edge throughput. Furthermore, the novel work-efficient design significantly improves execution time and power consumption by up to 100×.

1 Introduction

Recent years have seen a massive interest in Artificial Intelligence (AI), due to a sharp increase in demand from industry, such as security, finance, health-care, and military. Meanwhile, graph theory, as one of the most fundamental modeling techniques in data science, contributes to many of the current AI applications. Problems that can be modeled as communicating entities, such as the Internet of Things (IoT), social networks, web search, transportation, health-care systems [23], and even biology [1] are examples for which graph data models are an excellent fit. For example, according to official statistics, as of the third quarter of 2020, Twitter had more than 180 million daily active users, with more than 500 million total number of tweets per day [27]. For such compute-intensive applications, having strong support from the underlying hardware to accelerate software algorithms has become an indispensable requirement due to the widespread usage of AI on huge datasets. Graph applications are among the important algorithms used in AI and machine learning (ML) implementations, since iterative model training is one of the greatest challenges. The concurrent nature of graph models—as vertices/edges represent concurrent entities/links—provides large parallelism potentials. However, efficiently implementing these applications on existing systems is not trivial. Therefore, having an efficient co-processor-based graph processing is considered as a promising solution for the growing need for graph applications within the AI domain.

As the complexity of digital systems increases, the tendency to use productive High-Level-Synthesis (HLS) languages, such as SystemC, over conventional RTL-level Verilog and VHDL increases. Besides, as CPU-FPGA hybrid platforms are becoming omnipresent, particularly in IT domain like in data centers to run analytics applications, writing FPGA programs for these platforms by software programmers will be a dilemma. This mandates having even higher-level HLS languages, like Intel OpenCL for FPGA, where even most basic digital design concepts like clock signal are absent in the syntax. Currently, mainstream FPGA-makers are following this approach to provide better programmability. However, when abstraction level of design increases, optimality of implementation, particularly in complex designs, degrades. This issue specifically applies to graph-processing applications, in which, as we will discuss, the shrewd system architecture is necessary to efficiently exploit the valuable memory bandwidth. Considering these two opposing requirements, namely, easier programmability and implementation efficiency, our goal is to have a hybrid graph-processing pipeline, mixing both cycle-accurate implementations to provide efficiency with high-level programmability to support productivity.

In this work, we propose an HLS-based graph-processing pipeline. The platform used for implementation has an architecture of a host processor, connected to the multiple hardware accelerators on the FPGA. A software program on the host processor is only responsible for initiating the execution of FPGA, with no further intervention afterward. FPGA’s user codes are composed of two kinds of modules. The first one is the collection of fixed user modules, provided by our framework, which is written in SystemC and in a cycle-accurate way, providing efficient/high-throughput vertex processing pipeline. The second one is a single customizable module, to be written by the users in pure C/C++, in which they write intended iterative graph algorithm merely inside this specific module of the template. Finally, all modules are converted to bitstream to be executed on the FPGA. In this architecture, parallelism is provided in two dimensions, across multiple accelerators, and also inside a manually designed deep pipeline. While in general, the proposed architecture is not limited to, in the current implementation our tool is specifically usable on the emerging generation of Intel platforms, Xeon+FPGA, where our experiments have been executed. For comparison purposes, candidate graph algorithms are also implemented in OpenCL, the alternative HLS platform. In very high-level OpenCL, the user program is written in a straightforward way, similar to software programming, where a pipelined architecture is automatically generated by the architecture compiler. We show that our reusable template can be used to implement a high-throughput or work-efficient graph algorithm by only writing pure software code. This way, the programmer does not need to engage in the complicated HLS techniques or transformations, typically required to generate an efficient architecture out of an HLS design. The main contributions of this work are as follows:

•

We propose a framework to generate an efficient graph-processing pipeline for iterative graph algorithms. The generated pipeline is high-performance/work-efficient, SystemC-based, synchronous, deeply pipelined, fully synthesizable, and ready to be implemented on FPGA. The pipeline is optimized to get the maximum feasible throughput (one edge per clock cycle), assuming that there is no vertex locality, which is the case for large-scale graphs (minimum cache hit rate—the worst case). While it can be executed on any CPU+FPGA architecture, it is specifically prepared for Intel state-of-the-art Xeon+FPGA platform.

•

The framework is template-based, designed for convenient use by non-hardware experts, for the rapid generation of high-performance graph accelerators using only C/C++. This combines the high-level programmability with the efficiency of underlying SystemC language.

•

We implement a novel fast bit-vector to keep an active vertex list, which is mapped to FPGA Block RAMs. This enables work-efficient graph processing.

•

We compare and contrast the alternative HLS platform for FPGA, OpenCL, to show limitations and difficulties in implementing high-performance pipelined graph algorithms.

This article is organized as follows: The next section describes the background on graph processing as well as the Intel Xeon+FPGA platform. Section 3 gives the relevant studies on the topic. In Section 4, baseline implementations of graph algorithms in Intel OpenCL, various structures and drawbacks are discussed. In Section 5, our HLS-based pipelined graph-processing architecture is introduced with details. In Section 6, the experimental setup and results are discussed. In Section 7, limitations and future works are proposed. The article is concluded in Section 8.

2 Background

In this section, we review some literature background in graph processing. First, we state why FPGA is preferable over GPU. After that, the FPGA hardware platform that we used is explained. Finally, to facilitate the reading of this work, some terms in the terminology are reminded.

2.1 FPGA or GPU?

Currently, utilizing the computational power of GPUs in user applications such as multimedia processing tools, deep-learning algorithms, or scientific computations is a well-known approach for High-Performance Computing (HPC). This General Purpose GPU (GPGPU) computing can potentially be limited due to control divergence and memory divergence. In contrast, graph-based applications typically have data-dependent behavior, both in control flow and memory references, mainly due to the diversity of graph topology. Due to this, generally in the literature, GPU is not considered as a suitable choice for accelerating graph applications [7].

In a more recent trend, High-Performance Reconfigurable Computing (HPRC), which is integrating FPGA-accelerators into general-purpose processors, has attracted the attention of dominant market vendors. Recently, Intel Corp. has bought Altera, one of the leading FPGA-makers, in a significant investment valued at $16.7 billion, which was recorded as the largest deal in the semiconductor industry. Despite volatile architecture and lower operational frequency of FPGAs, due to complex routing interconnects, there are many applications where the flexibility of FPGA architecture to design a deep and customizable pipeline outperforms computational power of GPGPUs for equal available off-chip memory bandwidth [11, 20]. In this trend, as FPGA-accelerators are becoming serious competitors for GPUs, many cloud service providers extensively use them in their data centers to provide massive parallelism [31]. Currently, Amazon offers FPGA nodes on EC2 platforms [2]. Microsoft has integrated over 100 million FPGAs into its data centers aiming at having FPGA-powered high-performance real-time AI systems, including enhancing the performance of Bing web search engine [8, 10].

2.2 Hardware Accelerator Research Program (HARP) Platform

We implemented and tested our approach on a version of the state-of-the-art Intel Xeon+FPGA platform, where a Xeon processor is connected to an Arria10-GX1150 FPGA via two PCIe and one QPI channels (see Figure 1). On this platform, a DDR memory is available on the processor side, and FPGA can access it through three serial channels and processor memory controller. The maximum total read/write bandwidth between processor and FPGA is $\sim$ 20 GB/s, while according to our measurements, read and write can go up to $\sim$ 17 and $\sim$ 15 GB/s, respectively. Part of the FPGA bitstream, which includes implementations of memory and host CPU communication controllers, to be used by FPGA, has already been provided to users by Intel SDK, as so-called fixed blue-stream. In addition to the blue-stream, with FPGA partial reconfigurability, users develop their own hardware design as Accelerator Function Unit (AFU). The customizable user code, called green-stream, is attached to the existing blue-stream to make a full bitstream. In this setting, user logic implementation can work at a maximum frequency of 400 MHz. After programming bitstream to FPGA, a host-side software initiates FPGA run and, thereafter, FPGA continues operating without host CPU intervention until indicating a done signal.

Fig. 1.

We have implemented our approach on a remote system, under Intel Hardware Accelerator Research Program (HARP), at Paderborn University in Germany. At the time of writing this article, two SDKs that we used are only available under the HARP program, with some limitations. The two applied design flows, for baseline and the template, are illustrated in Figure 2. As can be seen, OpenCL flow (bottom) is simpler in terms of execution, whereas SystemC flow (top) involves HLS and placement and routing tools. From a user point of view, OpenCL code is directly converted to bitstream, but in SystemC, intermediate RTL code is generated. To convert SystemC to RTL, we followed Reference [15], which requires some additional third-party tools. In both cases, a host software initializes the FPGA program.

Fig. 2.

2.3 Graph Processing

Iterative Graph Processing Model: A graph with a starting initial value per vertex/edge is iteratively processed in a defined algorithm. After every iteration, values are expected to improve until they reach the final desired result. To complete the execution, all vertices must converge.

Memory Access Bottleneck: In large-scale graph applications, low cache utilization, high memory access latency, and low bandwidth utilization in accessing graph data on off-chip memory are well-known bottlenecks. This is because of the inevitable random (irregular) memory access pattern. Diverse references to graph data (e.g., neighbor vertices/edges of a vertex) on a wide range of memory addresses lead to poor data locality. This, in turn, causes dramatically low cache utilization and low memory bandwidth utilization, in addition to large (typically around 100 clock cycles) data access latency. Hence, graph applications are typically memory bandwidth-bound.

Gather-Apply-Scatter (GAS) model: Represents three conceptual phases of a vertex-centric graph program. In the gather phase, data of adjacent vertices/edges is collected. Then, in the apply phase, after the necessary calculations, the new vertex value is calculated and updated. Finally, in the scatter phase, the neighbor vertices are informed about the value change in the current vertex. In the GAS model, data access is limited to neighbor vertices.

Convergence: A vertex is said to be converged when its corresponding data value reaches to its final value (or gets close enough, if the data type supports). This value will not change in the next iterations. A graph is said to be converged if all vertices are converged.

Pull-based vs. Push-based: There are two general strategies to implement graph algorithms. In pull-based, vertex execution is composed of first, some data reads from neighbor vertices/edges, then some computation, and finally write a new value to current vertex, without informing neighbors (in other words, there is no explicit scatter phase). In contrast, in push-based, after doing calculations, the vertex may write data to neighbor vertices, too [13]. Comparatively, pull-based has more number of reads and higher edge processing throughput, while push-based has more writes. Depending on the algorithm, one may be a better fit than the other. For example, in the case of straightforward and non-work-efficient Breadth-First Search (BFS), push-based can be by far more efficient than pull-based, because in pull-based, deep vertices such as leaves of a tree graph unnecessarily iterate over all parents for many iterations waiting for the parent to change and waste valuable memory bandwidth by vain data reads. Even though push-based is more efficient in general, it lacks in terms of parallel execution due to race conditions and false-sharing. Consistent solutions, such as using atomic accesses, have high hardware cost, or performance degradation because of serialization. Our work is based on a pull-based design, but as will be explained, work-efficient mode obviates unnecessary vertex processing.

Synchronous vs. Asynchronous Execution: In an iterative graph algorithm, in each iteration, all or part of vertices are processed, for which next data values are calculated. There are two strategies about when to update vertices with their calculated next value. In asynchronous execution, vertex value is immediately updated, then neighbor vertices can read new value immediately, in the same processing iteration. In contrast, in synchronous execution, the next value is kept in a temporary variable and is updated only at the end of the iteration. Therefore, there are two copies of data, one for reading older values and the other for writing new values. New values become available for reading starting from the next iteration. Synchronous execution converges more slowly, but it has an easier implementation, as there is no intra-iteration dependency. However, in asynchronous execution, some complexities may arise, such as the possibility of race conditions (because data is both read and written) and ensuring sequential consistency needed for correctness, where consistent solutions can be costly [21].

Workload Imbalance: In many systems, irrespective of being implemented at the software or hardware level, balancing the workload becomes a significant problem for graph applications because of the power-law distribution of vertex-degrees. Techniques such as dynamic load balancing or vertex-degree aware scheduling, as mentioned in Section 3, try to tackle this problem.

High throughput vs. Work efficiency: In executing graph applications, one can opt for high-throughput execution, where all vertices are sequentially loaded to the pipeline in every iteration for possible processing. However, in work-efficient execution, instead of processing all of the vertices, only those tagged as active are processed. A vertex is active when at least one of its neighbors has been updated in the previous iteration. The active vertices that need to be processed are called active-set. Therefore, work efficiency refers to only processing active vertices at every iteration. The fundamental architectural difference is the need for a fast and efficient implementation of active-set. The High-throughput mode achieves a higher number of processed edges per second, whereas the Work-efficiency mode converges faster. Setting the number of edges processed per second as the only performance metric may not be a smart decision, since it is inefficient in terms of work. However, work-efficiency support can have significant complexity and overhead [21].

3 Related Work

While a wide variety of approaches in graph processing exist both at the hardware and software level, to focus on FPGAs, we do not consider complete software frameworks developed for providing users with high-level and easy-to-use software modeling on CPU and GPU. We have given the reasons as to why GPUs are not the best options for graph processing in Section 2.1. Moreover, comparisons with CPU have been provided by FPGA works covered below.

Locality improvement works: In a series of works, with focus on optimizing off-chip memory bandwidth efficiency, techniques such as data layout and compression are proposed. For example, in Reference [19], significant potential of locality in real-world graphs is explored in an online fashion. A locality-aware online scheduler tries to improve data reuse by exploiting the community structure of real-world graphs and predict the well-connected regions ahead of time. In Reference [32], the inherent graph property, “vertex degree,” is considered for optimizing a software/hardware co-design architecture. Since high-degree vertices can be the bottleneck in graph algorithms, authors propose degree-aware adjacency list reordering. In Reference [4], using common CSR graph data format, graph workloads are analyzed to show higher performance sensitivity to L3 cache size, rather than L2 private cache size. Based on profiling insights, like data reuse distances, an application specific and data-aware data prefetcher is proposed to increase inherent data reuse. In Reference [29], different types of irregularities in graph analytics are classified. Then, to alleviate them with a co-design approach, data-aware dynamic scheduling to schedule the program on-the-fly is suggested, which has microarchitecture support to extract data dependencies at runtime. Such techniques are generally orthogonal to our work and can be applied simultaneously, because our assumption for the worst-case scenario is that there is no vertex locality and the memory access efficiency is minimal.

Non-FPGA-based works: Due to very low-level hardware modification or implementation complexity some applications are more suitable for ASICs (application specific integrated circuit) rather than FPGAs. As such, Graphicionado [14] proposes a set of data type and memory subsystem specializations to reduce memory access latency. Processing-In-Memory (PIM)-based accelerators, like Reference [9], reduce memory access cost by integrating accelerators inside the memory. In References [3, 22], authors offer a configurable, work-efficient, asynchronous, and template-based graph-processing accelerator architectural model. After all, due to excessive complexity devoted to ensuring advanced features listed in Reference [21], such as strict sequential consistency property in asynchronous execution, the design is not practically usable on all FPGA platforms. FPGA can not afford the intricacy, area size, and burden of interconnect network of the architecture to give a practically efficient implementation. In our work, architecture is simplified enough for synchronous execution to better fit into an FPGA. In addition to the high-throughput mode, the work-efficient feature is added with a novel and efficient bit-vector.

FPGA-based works: There have been frameworks particularly for large-scale graph processing on FPGA. ForeGraph by Microsoft [12] introduces a scalable multi-FPGA architecture. The graph is partitioned among FPGAs, inside dedicated off-chip memories, while an optimized inter-communication mechanism among FPGAs exists. In the proposed scheduling scheme and data-compression technique, the graph is loaded into fast on-chip Block RAM memory used as cache, and dedicated off-chip memories provide higher bandwidth. The main idea is to have more Block RAM by using multiple FPGAs, hoping for more data locality. ForeGraph also has edge reordering for edges with potential data write conflict. In Reference [33], a data layout technique with architectural support is proposed to minimize the number of random accesses to external memory, which also reduces the power consumption of on-chip Block RAMs. In HitGraph [34], a design automation tool is proposed to generate a synthesizable RTL code for graph accelerator. Besides, to improve performance, several algorithmic optimizations, such as graph partitioning, optimized data layout, and inactive partition skipping, are introduced. In the aforementioned studies, processing pipelines are usually simple and shallow, where side techniques are the main novelty for coping with memory bottleneck problems. However, our work without conflicting with these techniques focuses directly on pipeline architecture, where we intentionally remove possible locality from graph data with initial shuffling. In Reference [30], a parallel accumulator is proposed to remove serialization in atomic operations for conflicting vertex updates, applicable to specific graph algorithms. However, WaveScheduler [28] proposes a scheduler for Sparse Matrix-Vector Multiplication (SpMV)-based multi-accelerator graph processing on FPGA. Besides two data re-ordering optimizations, the key insight is the appropriate tiling of the underlying adjacency matrix to eliminate all read/write conflicts in on-chip BRAM. Again, most of these works generally focus on increasing the locality and are orthogonal to our work. Because we do not consider locality in our implementation, we rather focus on designing an efficient pipeline that can be used in other graph-processing frameworks for FPGA.

4 OpenCL-based Design

The simplest way to utilize the underlying FPGA architecture is through OpenCL, where an HLS tool for Xeon+FPGA by Intel (formerly Altera) is used. Unlike SystemC, in this environment, the programmer does not deal with clock signals explicitly. This is very convenient, as the programming style can be rather similar to conventional C/C++ software programming. Hence, with the aim of making FPGAs attractive for all programmers, this C-to-Gate tool is usable by non-experts, too. The OpenCL architecture compiler converts the software-like HLS code to logic in a bitstream file to be programmed to FPGA. Since OpenCL is mainly a pipeline generator, it is selected as our baseline.

4.1 High-performance OpenCL for FPGA

As widely accepted, automated and high-performance output generation is more difficult for a programming language at a higher abstraction level. Moreover, the shortage of inherent parallelism in a programming language (such as C/C++) makes extracting a parallel execution model and the required resource management a challenging task for compiler [25, 35]. Generally, users are required to follow some specific coding style or use well-defined transformations to help compiler [18] extract the parallelism. In the case of OpenCL, the key point to have a high-throughput implementation for an iterative algorithm is writing the code in such a way that the compiler is able to automatically generate efficient pipelined hardware. In other words, having a smaller loop Initiation Interval (II) or minimal pipeline stall rate is desirable. Initiation interval is the number of clock cycles between consecutive launches of an iterative task that ideally is one clock cycle.

There are two choices to structure kernels in Intel OpenCL: (1) ND-Range kernel and (2) task kernel (or single work-item kernel). While the former is mainly intended for GPUs, the latter is recommended for FPGAs [16, 17], as it benefits from FPGA-specific techniques. A task kernel is coded almost similar to C/C++ software program and at the end is converted into a single logic module. In this kind of kernel, the compiler implicitly generates pipelined hardware for the program, where for/while loops are used to provide concurrency in the execution of consecutive loop iterations. Moreover, additional parallelism can be obtained by having multiple instances of the same kernel (multi-accelerator). However, the ND-Range kernel is structured according to standard GPU-optimized OpenCL style. The kernel is converted into a small work-item (like a software thread), where a large number of these threads run concurrently on a Compute-Unit (CU) hardware core. Unlike task kernel, program loops are not pipelined inside an ND-Range kernel. However, pipelining is supplied at the work-item level. A compute-unit pipeline can run many work-items concurrently. Similarly, for having additional parallelism on hardware, multiple compute-units can be instantiated. Work-items are grouped in so-called work-groups that are assigned to a specific compute-unit. A runtime scheduler is responsible for dynamically distributing work-items and work-groups in/along compute-units. If the kernel body is not work-item dependent (which is the case for us), then single instruction multiple data (SIMD) can add further parallelism inside the compute-unit. In light of these OpenCL features, we will discuss generating high-performance OpenCL code for our applications.

4.2 OpenCL Implementation Options

To have an efficient OpenCL implementation for vertex-centric graph algorithms as a baseline, we executed several scenarios.

(1)

Starting with the recommended task kernel structure, we use a straightforward doubly nested loop with an outer loop to go over all vertices and an inner one to go over all edges of each vertex (see the top part of Figure 3). Since vertex degree is variable, the inner loop has variable length, too. Therefore, in general, this style can not generate efficient pipelines [16, 17].

Fig. 3.

(2)

To overcome the above-mentioned variable length loop pipelining problem, we examined the ND-Range structure, where loops are not pipelined statically. This way dynamic scheduling at runtime among work-items with the variable workload is a possible solution. Therefore, the same doubly nested loops were examined on an ND-Range kernel with different configurations, including different work-item numbers, work-group sizes, compute-unit numbers, and so on. In this case, vertices are divided to a large number of small chunks, and each chunk is processed by a work-item. Since the execution order of work-items is unknown, accessing the large shared global data on DDR memory (e.g., edge info) is not guaranteed to be consecutive. Due to this, low cache utilization even for sequential data is inevitable, which wastes the limited global memory bandwidth. Through a complicated manual implementation, a user-level tiny local cache was added to resolve this issue. This technique is called local banking [16] and comes with an additional area and Block RAM usage cost. Even with these custom changes, pipeline utilization was not high enough, due to memory contention and high pipeline stall rate. This was especially the case for rather large graphs.

(3)

In a task kernel style, it is also possible to use a so-called edge-centric approach, where edges are traversed sequentially in a single flat loop (see the bottom part of Figure 3). This way, edges are divided statically among kernels in large sequential chunks, thereby leading to better workload balancing when compared to vertex-centric processing. However, this also does not come for free. More specifically, a denser loop body has high latency due to floating-point operations used in applications like PageRank. This prevented the generation of a high-throughput pipeline, where II $\geqslant$ 4 clock cycles. As a possible solution, one can replace floating-point data types with fixed-point. But fixed-point format can not support a wide range of real numbers, leading to precision problems in some cases. Another problem is with sequential global inputs that are not accessed in every loop iteration. For example, some of the vertex data are only requested when the edge counter transitions from one vertex to another. Since the compiler cannot handle this problem, as a manual solution, a helper kernel implements a FIFO, which repeatedly is filled by global data (from DDR) on one side and consumed by the pipeline on the other side. Eventually, the compiler can handle this case efficiently by achieving II = 1. Therefore, one edge can be processed in every clock cycle by each kernel, provided that global memory bandwidth is sufficient. For more parallelism, a few kernels (accelerators) are instantiated, where each kernel is assigned a large chunk of graph edges.

4.3 OpenCL for FPGA?

Our extensive implementation trials with many different OpenCL styles to achieve a high-performance baseline on FPGA accentuates again on the difficulty of writing efficient code by such languages for realistic HPC applications [18, 35]. One evident reason in the case of OpenCL for FPGA is having less control over the final implementation, particularly in the absence of clock signal in the language syntax. Besides, another noteworthy drawback is a very long compilation time for converting even a few lines of code to hardware. Furthermore, some custom requirements due to application or structure demanded excessive time to write programs and debug.

5 Template-based Accelerator Architecture

In this section, we describe the pull-based template pipeline architecture used for vertex-centric graph processing.

In each iteration, large chunks of vertices are dynamically assigned to and processed in parallel running accelerators. For each vertex, all connected edges are processed in a loop. To optimally utilize memory bandwidth, enabling spatial locality and cache memory, vertex, and edge lists are fetched in order. Graphs are stored in a common Compressed Sparse Row (CSR) format, which facilitates straightforward streaming memory access. Since vertex degrees can be rather different, vertex processing times may also vary considerably. Hence, vertices are executed (processed) out of order. Similar to load, vertices are also committed and written back to memory in order, for efficient usage of memory bandwidth and avoiding potential false-sharing problems. This way, the majority of bandwidth (more than 90% in our experiments) can be dedicated to inevitable random (irregular) memory accesses, reading data of a neighbor vertex.

As mentioned earlier, large-scale graph applications have a well-known inherent bottleneck in accessing off-chip memory, which leads to high latency and low bandwidth efficiency. Carefully utilizing this limited resource requires precise execution, which can be done in a cycle-accurate level. But, this at the same time is tedious and complicated. In our template-based design, we implement the majority of common modules only once, except for a single user-specific module. In a deep-pipeline vertex processing architecture, multiple vertices are in processing phase, in different stages of the pipeline, with many concurrent outstanding memory requests to tolerate high latency of main memory access. Architecture can be configured in two execution modes. In lighter high-throughput mode, all vertices are loaded to the pipeline in each iteration. However, in work-efficient mode, using a novel and fast bit-vector design to implement the active-list, only active vertices are loaded and processed in each iteration. We explain both of these options in the following subsections.

5.1 High-throughput Mode

Figure 4 gives the different modules in our high-throughput architecture, which we explain in detail below.

Fig. 4.

Data and Control Tables: There are a few tables to keep track of vertices being processed. They keep vertex states, such as vertex degree and value, number of remaining unprocessed edges, and temporary data in gather phase. Table length is the maximum number of vertices in execution at the same time (pipeline depth is set to 128 because of 100 clock cycle memory latency). Different pipeline stages of design may have simultaneous read/write accesses to these tables. Their implementation, which usually has resource contention and timing constraints, has to be efficient, too. For this purpose, on-chip multi-port memory resources of FPGA, including Block RAMs and even memories made by LUTs and Flip-Flops, are utilized.

Table Allocator: responsible for allocating a vacant row to the next incoming vertex-ID in control and data tables. After that, the assigned row-id, which points to a specific vertex, flows through the next pipeline stages until the end of processing.

Vertex Initiator: sequentially reads a free row-id from “Table Allocator” queue and an unprocessed vertex from a streaming memory port to fill some table entries with initial vertex data, such as vertex value and degree. This module internally is composed of multiple pipeline stages for higher throughput.

Edge Loop Setup: sequentially reads a row-id of an initialized vertex from the “Vertex Initiator” queue and information of all connected edges from a streaming memory port. Data of a connected edge contains vertex-ID of the other (neighbor) vertex. After that, in a so-called edge loop, for each edge of the current vertex, a random access memory request is sent to read data of a neighbor vertex. Row-id of the current vertex under processing is also attached to the request to be returned back with the response (supported by memory controller). Since responses may come out of order, this ensures that the owner vertex of response will be known. This is the only irregular memory access in the overall design. When all requests for neighbor vertices of the current vertex are sent, the module moves to the next vertex (row-id) from the “Vertex Initiator” queue. Multiple vertices being processed (up to 128) may have many pending vertex info requests for their own neighbors. Having many in-progress outstanding memory requests hides large off-chip memory latency (over 100 cycles). Ideally, requests are sent successively one per clock cycle.

Edge Loop Execution: Responses of neighbor vertex info requests, sent by the previous module, are received out of order, ideally one per clock cycle. Responses may also belong to various vertices. Because of poor locality, it is highly probable that each read response consumes one full cache-line of memory bandwidth. Data of each neighbor vertex is “gathered” (accumulated, or so, depending on the specific graph algorithm) in temporary variable(s) in the table row of the related vertex. When all neighbors of any of under processing vertices are processed, its row-id is passed to the “apply” stage.

Apply: In this module, after having a final complementary calculation on the result of the “gather” stage, the vertex value is updated again in the related table. Then, a “done” bit is set for this vertex in the related table. Similar to other modules, vertices may finish this step out of order.

User Module: All customizable user codes are inside this module. It includes functions applicable to vertex data in gather and apply phases. This module is included and merged into the gather and apply modules at compile time.

Write data: sequentially reads a vertex row-id from commit queue and waits for its completion by monitoring the “done” bit. When the done signal is detected, vertex’s new value is written back to the DDR memory. As row-ids in commit queue are in order, writings are also done in order. This way, one iteration of the iterative algorithm for a specific vertex is accomplished. For realizing a synchronous update strategy, the read and write vertex-info data structures on the off-chip DDR are different.

Deallocator: finally releases all table entries for row-id of a finished vertex, to be reused again by the next vertices.

5.2 Work-efficient Mode

As explained before, Work-efficient architecture does not execute for all the vertices in every iteration. Therefore, it has some additional modules, plus some modifications to previous ones. They are shown in Figure 5 and explained in detail in the below paragraphs.

Fig. 5.

Pre-fetch: Sequential vertex access in the “Vertex Initiator” module of high-throughput mode makes it possible to have a single read/write request for a large vertex chunk (containing whole vertices dedicated to each accelerator). But in work-efficient mode, only some of the vertices are processed in every iteration. Hence, instead of a big chunk, there should be multiple disjoint and smaller ones. To reduce access overhead, especially for small and near-by-address memory requests, a grouping mechanism is provided to create such smaller chunks at runtime. The pre-fetch module is responsible for batching consecutive individual vertices (consecutive vertex-ID), into one large batch. Therefore, a single memory request is sent for the whole batch (chunk).

Read-CL: If successive disjoint vertex read requests have overlapping cache-line data (end of one request and beginning of the next one), then the memory request is responded by a local single size read cache-line. This can be helpful for small and near-by-address requests. This module also provides the value of unmodified vertices for the next write operation.

Write-CL: Similar to its read counterpart, this module is responsible for handling a cache-line for the write operation. The cache-line is filled one-by-one and then flushed into the memory.

Scatter: Finally, after updating a vertex with its new value, provided that the change is beyond a specified threshold, the scatter module informs all neighbors about its value update. This module iterates over all neighbors and sends their vertex-ID to bit-vector module to construct active-list of next iteration.

Bit vector: This key module is responsible for keeping a list of active vertices in work-efficient mode. This way, in each iteration, instead of processing all of the vertices, only those tagged as active are processed. Efficient implementation of active-list can be challenging [21]. In our pipelined and fast solution, active-list members can both be quickly inserted to and extracted from the bit-vector module as they are mapped to fast on-chip Block RAMs. As depicted in Figure 6, the Bit vector is composed of two sets of ping-pong swappable memories. One keeps the active-list of current iteration that is read from, and the other stores the active-list of next iteration that is written onto. Ping-pong memories are swapped after each iteration. In the basic bit-vector implementation, a single memory bit is assigned to a vertex, where being “1” indicates being in the active-list. While inserting a vertex-ID into on-chip memory is naturally fast, by just setting a single bit, on a specific address (vertex-ID) to “1” in a single clock cycle, retrieving them can be rather slow, since the whole bit-vector should be scanned. To tackle this problem, besides internal pipelining, a novel multi-level design is proposed for the Bit vector. In the advanced implementation, vertices are addressed in three different levels. Higher-level active-set is responsible for coarser groups of vertices, meaning larger chunks. When a vertex is added, it is added to all hierarchy levels in parallel as shown in the right part of Figure 6. However, the benefit is realized in the reduction of active-list retrieval latency as shown in the left part of Figure 6. Using higher-level indicator bits, non-active chunks of vertices are quickly skipped without deeply searching in lower-level addresses. At the start of each iteration, the small top-level list is fully searched for any active chunk. Then, only active addresses are collected in a FIFO and passed to the second-level list. The same is done for the third list level. Finally, the list of all active vertex-IDs is extracted monotonically and sequentially.

Fig. 6.

Scheduler: For alleviating possible load imbalances, a runtime scheduler module (not shown in figures) is responsible for dynamically assigning the next vertex chunk to the first idle accelerator.

Note that the available block memory size of the underlying FPGA, determines the maximum graph size supported in the work-efficient mode. In this version of Xeon+FPGA, and in the basic version of the work-efficient active-set, where a single memory bit is assigned to each vertex, graph size of 8 million vertices is supported (2 million per accelerator, as shown in Figure 6). As an idea for future works, by reducing the resolution it is possible to support larger graphs in exchange for efficiency. For example, if a single bit of a bit-vector is assigned to a group of 32 vertices, then graphs of size 256 million vertices can be handled. However, in this case, all 32 vertices have to be loaded, even if only one of them is really active.

5.3 User Programming Interface

As mentioned earlier in Section 2.3, Gather-Apply-Scatter (GAS) is the common processing model for vertex-centric programs. According to this model, the user defines three following functions in the discussed User Module to specify three conceptual phases of an iterative algorithm. The template customization is only done by these three functions, by which the graph application is defined in the GAS model. In this way, an easy programming interface to provide programmability and productivity is provided to the users. In the compilation time, the functions are included, merged, and pipelined into the related modules (see Figures 4 and 5). In every iteration and for every vertex under processing:

Gather: is called once for each neighbor whenever the neighbor is accessed and its data is read. Neighbor vertex values have been already requested in an streaming way, and they come after many clock cycles of latency. Accumulation or finding the maximum value are examples of common operations in this function. For example, in PageRank, the function can be:

Apply: is called once, after processing the last neighbor, to finalize the calculation and update the vertex value. Division after the accumulation in the average operation can be an example of this. For example, in PageRank, we have (A, B, and C are numerical constants of PageRank formulation):

Scatter: is called once, after “Apply” to inform the neighbors about value update. This function is not used in the pull-based implementation, such as our framework.

Vertex data that are large arrays indexed by vertex ID (stored in global off-chip DDR memory) and any other temporary variable for a vertex in the pipeline (stored on on-chip Data Tables) are defined by user in a specific class. The user can implement any complex function, dependent on data, iteration number, and so on. The three C/C++ functions are converted to logic, the same as SystemC units. For more relaxed timing in high-latency functions, the function latency can be defined by the user as the number of clock cycles. This value is passed to the underlying SystemC implementation during logic synthesis.

6 Experimental Evaluation

6.1 Experimental Setup

6.1.1 Graph Applications.

We implement and execute some conventional iterative graph algorithms in our template. For comparison purposes, they are also implemented using OpenCL. Some traditional algorithms, such as Breadth-First-Search (BFS) are conceivably parallelizable, while some others like Depth-First-Search (DFS) are inherently difficult to parallelize [24]. The three common parallelizable algorithms chosen for our experiments are listed below. As discussed earlier, due to FPGA hardware and also Xeon+FPGA platform limitations to handle the complexity of the push-based mechanism, these algorithms are implemented in a pull-based fashion.

•

PageRank (PR): is a widely used algorithm in web search engines to rank webpages, based on reference counts from other pages. The score value assigned to rank each node has a real data type. In our baseline OpenCL implementation, it was not possible to implement this in an efficient pipelining due to floating-point limitations in OpenCL. Therefore, to have a fair comparison, we tested with a fixed-point version instead.

•

Breadth-First-Search (BFS): In the iterative execution of conventional BFS, starting from a randomized root vertex (same in all experiments), all reachable vertices are visited and labeled with their depth value.

•

Maximal Independent Set (MIS): Independent set or anti-clique is a set of vertices, where no two are adjacent. Maximal Independent Set (MIS) is the one that is not a subset of any others. MIS is used as part of different applications such as graph coloring. A fully parallelizable algorithm to find one MIS (more than one solution can exist) is chosen [6]. During the execution, vertices are labeled as undecided-yet, in-list, and out-of-list. Implementing the exact form of the cited algorithm requires two different “gather” functions to run in even/odd iterations.

6.1.2 Datasets.

Selected applications are evaluated using graph datasets provided by a widely accepted repository [26]. Datasets include cases from real-world applications such as social networks (Com-Orkut, Com-LiveJournal) and those generated by simulation (Adaptive, Delaunay, Hugetric). Graph sizes vary from a few to tens of millions of vertices, as given in Table 1. Earlier, it was mentioned that there are some limitations in the tools. Specifically, due to maximum memory limitation in the compiler of the baseline OpenCL, we could not experiment on larger graphs. However, graph sizes are large enough for a small FPGA platform cache to minimize the locality (cache size is set to a single cache-line of size 64 bytes). Therefore, this limitation does not affect the presented results. Graph vertices are also shuffled properly to remove any possible locality that can affect the experimental results. Therefore, the performance improvement presented in this work does not rely on the locality of the data; instead, it is purely achieved by the pipeline design. The profiler tool confirms that, in all graphs, the cache hit-rate of accessing neighbor vertices is minimal ( $\frac{1}{16}\approx$ 6%, for a 32-bit data type).

Table 1.

Graph	# of Vertices	# of Edges
Adaptive	6.8M	27M
Com-LiveJournal	4M	69M
Com-Orkut	3.1M	234M
Delaunay	16M	100M
Hugetric	6.6M	20M

Table 1. Graph Datasets Used in Experiments

6.1.3 Performance Estimation Metrics.

We have two kinds of performance metrics: throughput (TP) and work-efficiency (WE). In TP measurement, the number of processed edges (irregular access to neighbor vertices) per unit of time is counted. While in WE, task completion time is counted. The TP metric represents the efficiency of the pipeline design. However, relying only on the TP metric can be misleading, as WE mode can converge faster using the active-list.

Theoretically, an efficient pipeline finishes one task per clock cycle. In our case, processing one edge (neighbor vertex) per cycle is the primary goal of high-throughput (HT) design, per pipeline (accelerator). Recall that the cache memory is not in effect. Then, only the data of the requested neighbor vertex is used per memory request, with high memory access latency. We tried to achieve this objective in both the OpenCL-based design and our template-based pipeline implementation.

Since the off-chip memory bandwidth is the decisive bottleneck, when it is saturated, no more data can be provided to the pipeline. Beyond this point, performance can not increase any more, and adding additional parallel processing power of any kind, such as more accelerators, does not improve TP. Due to this, approaching the theoretical maximum memory bandwidth of the system efficiently, with the minimum number of accelerators is our primary goal to achieve peak TP from the architecture. For this purpose, we increase the number of accelerators until no improvement is seen in the performance. With using profiler tools, bandwidth saturation is measured and the design is tested accordingly. Assuming $\sim$ 10% of bandwidth is consumed by other sequential graph data, the maximum TP of system, under no locality assumption, is:

\begin{equation*} \text{Maximum TP} \approx 0.9 * \text{Memory Bandwidth}/\text{Cache-line Size.} \end{equation*}

For OpenCL and template with the cache-line size of 64 B, the available bandwidth is $\sim$ 14 GB/s and $\sim$ 17 GB/s, respectively, and maximum throughput is $\sim$ 200 M-Edge/s and, $\sim$ 240 M-Edge/s, respectively. Please note that if locality exists and the cache is enabled with the exact same setup, then throughput can ideally increase up to 16× ([cache-line size]/[size of datatype]), i.e., $\sim$ 4 G-Edge/s. Equivalently, this can be calculated as bandwidth/size of data type (17 GBps/4 B). However, the locality increasing techniques are orthogonal and not directly relevant to this work. They can be applied on top of the pipeline in a more sophisticated and comprehensive framework.

6.2 Results

We show some of the results based on baseline HT OpenCL. Thereafter, improvements by template-based SystemC are presented.

6.2.1 Baseline OpenCL (OCL).

First, we illustrate the discussed memory bottleneck issue. In Figure 7, peak edge throughput and off-chip memory bandwidth usage of different applications (for baseline HT OpenCL on Delaunay dataset) are depicted. As can be seen from this figure, when the maximum available memory bandwidth is saturated, the throughput is also saturated. More specifically, in all applications, because of the high-throughput pipeline design (II=1 clock cycle), two accelerators use up the whole available data read bandwidth, while even one accelerator could use more than 75% of it. As mentioned earlier, irregular memory accesses are responsible for around 90% of the bandwidth usage. Since the sequential write consumes comparatively negligible bandwidth, only the read bandwidth is shown.

Fig. 7.

In Figure 8, the average throughput of different graph datasets with different numbers of accelerators is shown for the BFS application. The figure illustrates the effect of graph topology on throughput. According to Table 1, the Com-LiveJournal graph is denser than the other three (Com-Orkut is too large for the OpenCL platform). Moreover, as can be seen from Table 2, this very same graph converges quickly in BFS application (15 iterations), causing many early wasted clock cycles, due to unnecessary loading of converged vertices. This leads to a comparatively poor average throughput for this graph. Since in this case, performance is limited by the pipeline (not the memory), having more accelerators is potentially effective. Other graphs converge far more slowly, for which average throughput is measured in the first 200 iterations.

Fig. 8.

Table 2.

Graph	iter. 1	iter. 2	iter. 3	iter. 4
Adaptive	4	12	24	40
Com-LiveJournal	1	171	25,309	307,515
Delaunay	5	14	26	52
Hugetric	3	8	16	27

Table 2. The Number of Vertices Converged in the First Four Iterations of BFS

Figure 9 shows the average throughput of different applications for a high number of iterations, unless early convergence happens (specifically in the case of MIS). As can be seen, the OpenCL-based implementation, in some cases, can achieve rather close to maximum feasible (saturated) throughput. Recall that a small portion of bandwidth is dedicated to sequential data accesses. Since MIS generally converges very quickly, it leads to low pipeline utilization and low average throughput after initial iterations.

Fig. 9.

6.2.2 Template-based (TMP).

As explained earlier, the SystemC template has two modes, namely, HT and WE. While the WE is the desired one for users, we use the HT mode for comparison with the baseline OpenCL, which has only HT mode. In Figure 10, it can be seen that even HT outperforms the automatically generated OpenCL pipeline by up to 50% in peak throughput. This is mainly due to cycle-accurate and more efficient pipeline design, better memory management (both off-chip DDR and on-chip BRAM), and higher frequency of memory controller, which provides more bandwidth (17 vs. 14 GB/s) in template-based SystemC, altogether allowing more parallelism. In Figures 11 and 12, we have almost similar results, where average application throughput and average runtime per iteration are depicted for higher iteration numbers. Results show that template-based implementation provides up to 33% shorter runtime. As expected, larger graphs with higher edge count have longer runtime.

Fig. 10.

Fig. 11.

Fig. 12.

WE mode provides even better results by only working on active vertices. Vertices that really need to be processed when the wave of changes reaches to them are loaded into the pipeline in each iteration. As can be seen from Figure 13, MIS application [6] usually converges very quickly for any dataset (in all of our datasets, $\lt$ 20 iterations). Because of the greedy approach, almost all vertices are processed and converged early, in practice. Then, vertices are most of the time engaged in processing and then included in the active-list, until full graph convergence. As a result, the benefits of WE mode are generally limited for MIS, which is a 15% improvement, on average. In contrast, BFS usually converges slowly, especially for sparse graphs. Since the algorithm starts from a specified root and goes one level deeper after each iteration, only frontier vertices are contained in the active-list. Results show around 5× improvement for two rather denser graphs (converge in $\lt$ 30 iterations) and at least 100× for two sparse graphs (for which measurements are done in the first 1,000 iterations). However, PR converges faster than BFS, as all vertices are engaged from the start. But no vertex is decided as converged individually until the whole graph converges. For all datasets, with the damper parameter ( $\beta$ ) being 0.99, convergence happens in less than ( $\lt$ ) 1,000 iterations. Results show that from 2.5×, up to 23× improvement is achieved in WE mode (Delaunay graph is larger than the maximum supported graph size in WE mode, discussed in Section 5.2).

Fig. 13.

The improvement of total energy consumption of WE mode versus HT mode is shown in Figure 14. The measurements are done by reading the current sensors embedded on to the FPGA, which are available to the users on some special status register addresses, during the runtime. Only the power of FPGA core (not I/O ports, etc.) is considered here. The core voltage is 0.95 V. Results are similar to runtime improvements, since the energy saving is mainly due to shorter runtime, provided by WE mode.

Fig. 14.

6.2.3 Resource Utilization and Clock Rate.

Here, we give a summary of low-level implementation details about resource utilization and the clock rates. In Table 3, area utilization is summarized, where all applications have close values. The reason for this behavior is due to the fact that most of the hardware blocks used in the template are common among all applications. Moreover, in OpenCL, a similar coding structure (see Figure 3) with identical memory ports to access the common CSR graph format is used, which leads to a similar generated pipeline. While the OpenCL compiler relies more on Block RAM usage, the SystemC and the RTL tools prefer using more logic. The WE mode uses more resources to implement the bit-vector and other extra units. However, the OpenCL compiler tries to sustain the highest possible clock frequency, where $\sim$ 210 MHz is achieved by all of the applications. In OpenCL (not in our template), this frequency directly affects bandwidth and throughput, because the memory controller also works with this clock frequency (210 MHz x 64 B $\approx$ 14 GB/s). However, in SystemC flow, while bandwidth is fixed ( $\sim$ 17 GB/s, read bandwidth), the user can set the frequency. This allows reducing the clock rate as much as needed if performance is already saturated. In Figure 10, while OpenCL works with $\sim$ 210 MHz, the template works with 100 MHz, because four cores are enough for saturating the available bandwidth.

Table 3.

Design\Resource Type	Logic	Block RAM
Blue-stream (fixed)	20%	12%
Green-stream (user), OpenCL-HT, per core	3%	7%
Green-stream (user), Template-HT, per core	7%	2%
Green-stream (user), Template-WE, per core	13%	14%

Table 3. FPGA Resource Utilization

6.2.4 CPU Comparison.

After FPGA experiments, it is also noteworthy to have a performance comparison with the CPU. In any platform, processing a relatively large graph will not be able to utilize the cache, since locality will be minimal due to irregular (random) memory accesses. In such a setting, total memory usage is known beforehand and is roughly equal to [edge count] × [cache-line size]. As the execution gets faster, this resource will be used faster, up to the memory bandwidth saturation. To have a fair comparison with the CPU, we used a larger graph to eliminate the potential benefits of a considerably large cache present in the CPU. Our experiments show that, on the host CPU, for a graph with $\sim$ 100 million vertices, cache utilization is minimized (also see Reference [5]). Note that locality level depends on the vertex count, not the edge count, since there are irregular accesses to neighbor vertex data per (sequential) edge processing. We compare one accelerator implemented with our template-based pipeline design in the HT mode with one core of the HARP CPU. Figure 15 shows that, despite having more than 10 times higher clock rate of a typical CPU (3 GHz vs. 200 MHz), a single FPGA accelerator can achieve up to five times more throughput than a CPU core for a graph with a size of $\sim$ 140 M vertices. Note that we did not use the graphs given in Table 1, since the largest graph in this table has $\sim$ 16 M vertices. Instead, we used a much bigger graph, Kmer-P1a [26], with $\sim$ 140 M vertices. As mentioned before, to nullify the cache effects, we used a large graph with a high number of vertices, since the vertex count (i.e., not the edge count) is the main factor for the locality.

Fig. 15.

6.2.5 Observations.

First, the well-known memory bottleneck problem in graph applications, which throttles the performance, is experimentally illustrated in Figure 7. The performance achieved by baseline OpenCL and the effect of different datasets and applications are shown in Figures 8–9. Note that, theoretically, the maximum achievable throughput in the worst case without locality is equal to [memory bandwidth] $/$ [cache-line size] (Sections 6.1–3). These figures depict the reasonable quality of the baseline, which can be close to the maximum throughput. Recall from Section 4 that the main disadvantage of OpenCL is not the implementation, but it is the difficulty of generating a high-performance pipeline, which only allows HT mode to be implemented. The advantage of our high-throughput SystemC implementation is shown in Figure 10, which is mainly due to efficient deep pipeline design and higher bandwidth. Through Figures 12–14, the main advantage of our template, the WE mode, is presented. Besides high-throughput processing, the active-list significantly reduces the total runtime and power consumption, particularly for sparse graphs, up to 100×. While it is shown that a few accelerators can maximize throughput, the low area allows convenient fitting to FPGA. Finally, we showed how a single FPGA accelerator outperforms a single CPU core by around 5×.

7 Limitations and Future Work

In this part of the article, we will try to explain the limitations and possible extensions related to these limitations of the proposed scheme. One of the limitations present in the work is related to the graph size. As explained in detail before, our tool supports iterative graph applications in two modes, namely, HT and WE modes. The HT mode has no graph size limit, but the WE mode has a graph size limit due to the required FPGA block memory used to implement the active-list. Furthermore, in the current version of the proposed scheme, edge weight is not supported. Therefore, applications that require edge weight cannot be directly implemented. For example, the shortest path problem, which is very much similar to BFS, cannot be modeled in the current implementation.

As part of our future work, we plan to: (1) provide support for edge weights, (2) develop and test the different active-list resolutions (see Section 5.2), (3) run on newer FPGA platforms with higher resource sizes to handle larger graphs in WE mode, (4) provide multiple user modules in the same accelerator to support more complex applications without higher-level software intervention.

8 Conclusion

In the age of AI, utilizing hardware accelerators, such as GPUs, FPGAs, and neural network sticks, has become a common practice. In processing the graphs, as fundamental data modeling, FPGAs are more efficient than GPUs due to irregular data access. In this work, we designed and implemented a SystemC HLS-based graph-processing template with a clock-wise precisely designed deep-pipeline architecture. The template-based design is simplified for easy mapping on to FPGA, even for software programmers to generate accelerators conveniently. The template can be customized using a single module in C/C++, combining the high-level programmability with the efficiency of SystemC hardware language. In high-performance mode, the high-throughput pipeline achieves maximum edge throughput with the minimum number of accelerators. The pipeline is optimized to get the maximum feasible throughput, under the assumption of no vertex locality (worst case). However, any orthogonal locality-improving technique from the literature can be used in tandem to achieve additional performance. In addition, the work-efficient mode significantly reduces total runtime with a novel active-list design. Through experiments on the Intel Xeon+FPGA platform, we showed the benefits of the proposed template compared with respect to OpenCL-based implementation. Based on our results, this template outperforms the OpenCL version considerably by providing convenient programmability, higher throughput, lower runtime, and lower power consumption.

References

[1]

Tero Aittokallio and Benno Schwikowski. 2006. Graph-based methods for analysing networks in cell biology. Brief. Bioinf. 7, 3 (2006), 243–255.

Abstract

1 Introduction

2 Background

2.1 FPGA or GPU?

2.2 Hardware Accelerator Research Program (HARP) Platform

2.3 Graph Processing

3 Related Work

4 OpenCL-based Design

4.1 High-performance OpenCL for FPGA

4.2 OpenCL Implementation Options

4.3 OpenCL for FPGA?

5 Template-based Accelerator Architecture

5.1 High-throughput Mode

5.2 Work-efficient Mode

5.3 User Programming Interface

6 Experimental Evaluation

6.1 Experimental Setup

6.1.1 Graph Applications.

6.1.2 Datasets.

6.1.3 Performance Estimation Metrics.

6.2 Results

6.2.1 Baseline OpenCL (OCL).

6.2.2 Template-based (TMP).

6.2.3 Resource Utilization and Clock Rate.

6.2.4 CPU Comparison.

6.2.5 Observations.

7 Limitations and Future Work

8 Conclusion

References

Cited By

Index Terms

Recommendations

ThunderGP: Resource-Efficient Graph Processing Framework on FPGAs with HLS

Towards High-Performance Graph Processing: From a Hardware/Software Co-Design Perspective

Scalable Video Coding Deblocking Filter FPGA and ASIC Implementation Using High-Level Synthesis Methodology

Comments

Information

Published In

Publisher

Journal Family

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Funding Sources

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

Cited By

View options

PDF

eReader

HTML Format

Get Access

Login options

Full Access

Figures

Other

Share

Share this Publication link

Share on social media

Affiliations