Several studies have explored the use of GNNs for particle physics applications, such as jet tagging (identification) [
29], charged particle tracking [
23], and calorimeter energy measurements [
31]. More can be found in a survey [
43]. To achieve low latency, FPGAs are utilized. The work in [
13] extends the hls4ml [
11] tool to translate GNNs into FPGA firmware automatically for charged particle tracking. GarNet [
22], a GNN-based algorithm, is proposed for calorimeter energy regression.
Numerous studies also focus on general GNN accelerations [
16,
17,
18,
28,
44,
53,
58]. AWB-GCN [
17] is based on a column-wise-product architecture with runtime re-balancing for GCN acceleration. Their upgraded version I-GCN [
18] presents Islandization, a new runtime graph restructuring algorithm, to improve data locality. BoostGCN [
53] presents a novel hardware-aware
Partition-Centric Feature Aggregation (
PCFA) scheme for pipelined GCNs. Lin et al. [
28] introduce GCN acceleration using HLS and hardware-friendly optimizations. G-NMP [
44] presents a
Near-Memory Processing (
NMP) solution for accelerating GNNs to handle the irregular memory access. Garg et al. [
16] explore various dataflow choices for sparse and dense GNNs on spatial accelerators. The work in [
4] designs a taxonomy of parallelism in GNNs. Sohrabizadeh et al. [
40] present StreamGCN for accelerating GCN specialized for GNN streaming processing with small graphs. Chen et al. [
7] introduce a heterogeneous pipeline architecture for GNNs on
high bandwidth memory (
HBM) enabled FPGAs. Abi-Karam et al. [
2] propose GenGNN framework to deliver ultra-fast GNN inference and support a diverse set of GNN models. The results indicate that their designs achieve latency at the millisecond level. They also propose FlowGNN [
38] which can flexibly support the majority of message-passing GNNs. Kang et al. [
24] propose a GCN accelerator named GROW with Gustavson’s algorithm to architect a sparse-dense GEMM accelerator with row-wise product. Sun et al. [
42] propose MultiGCN which balances network latency and network bandwidth for GCNs in multi-node systems. Yang et al. [
48] present a dynamically reconfigurable accelerator for GNNs named DRGN. EGCN [
21] is proposed using tiled matrix multiplication to reduce Off-Chip Memory Access. Furthermore, the I-GCN [
18] adopts 4,096
multiplier-accumulator (
MAC) units running at 350 MHz. Its peak performance is 2,703
Giga Operations Per Second (
GOPS). Each MAC is counted as 2 operations since it includes both multiplication and addition. BoostGCN [
53] has a peak performance of 1,792 GOPS, running at 250 MHz. GenGNN [
2] utilizes 1,344 DSP blocks with 16-bit fixed-point data representation, resulting in a peak performance of 806 GOPS. Their later work, FlowGNN [
38] has a peak performance of 1,499 GOPS. Both of them run at 300 MHz. Our design (
J4), running at 200 MHz, can not only provide a sub-microsecond latency but also an effective performance of 3,025 GOPS which is higher than I-GCN, BoostGCN, GenGNN, and FlowGNN. The custom MMMs, which do not contain any multiplications or additions, are excluded, with the exception of MMM3 that includes a small number of additions. Finally, none of these previous designs target a sub-microsecond scenario. Although throughput at the application level can serve as a fair metric, the use of various applications for benchmarking makes it more complicated.
There are also previous studies about algorithm and hardware co-design for GNNs [
51,
56,
57,
58]. The work in [
56] presents a GNN and accelerator automatically co-search framework to maximize both task accuracy and acceleration efficiency. Zhou et al. [
58] propose model-architecture co-design with a light-weight algorithm for temporal GNN inferences on FPGAs. You et al. [
51] propose GCoD framwork, involving a two-pronged accelerator. Zhong et al. [
57] propose an algorithm-hardware co-design scheme with a reuse-aware sampling method to accelerate GNN inference in mini-batch scenarios. Some previous studies focus on accelerating GNN training [
27,
30,
41,
52]. GraphACT [
52] introduces an FPGA-based accelerator with a subgraph-based algorithm for GCNs training. Su et al. [
41] present an efficient graph sampling accelerator on HBM enabled FPGAs for training GNNs. Lin et al. [
27] propose HP-GNN which maps GNN training on the CPU-FPGA platform automatically. DietGNN [
30], a crossbar-aware pruning technique, is proposed to accelerate the training of large-scale GNNs.
Most of the previous designs utilize a fixed hardware for all the layers of GNNs and process them sequentially. This is not efficient for GNN inference execution when targeting small graphs with requirements of ultra-low latency (such as \(\lt 1\mu s\)) and high throughput for scientific applications, such as particle identification. This work focuses on layer-wise architecture for streaming processing of input graphs. We propose multiple novel optimizations to achieve high throughput and sub-microsecond level latency. These previous studies are orthogonal to our proposed approach and hardware architecture. Their techniques could be complementary to our approach, which could be extended in the future to achieve even lower latency.