Efficient Hardware Accelerator Based on Medium Granularity Dataflow for SpTRSV

Qian Chen, Xiaofeng Yang, and Shengli Lu Q. Chen, X. Yang, and S. Lu are with the Department of National ASIC System Engineering Research Center, Southeast University, Nanjing 210096, China (e-mail: chenqian0103@seu.edu.cn; lsl@seu.edu.cn).This research work is supported by the Big Data Computing Center of Southeast University.

Abstract

Sparse triangular solve (SpTRSV) is widely used in various domains. Numerous studies have been conducted using CPUs, GPUs, and specific hardware accelerators, where dataflow can be categorized into coarse and fine granularity. Coarse dataflow offers good spatial locality but suffers from low parallelism, while fine dataflow provides high parallelism but disrupts the spatial structure, leading to increased nodes and poor data reuse. This paper proposes a novel hardware accelerator for SpTRSV or SpTRSV-like DAGs. The accelerator implements a medium granularity dataflow through hardware-software codesign and achieves both excellent spatial locality and high parallelism. Additionally, a partial sum caching mechanism is introduced to reduce the blocking frequency of processing elements (PEs), and a reordering algorithm of intra-node edges computation is developed to enhance data reuse. Experimental results on 264 benchmarks with node counts reaching up to 85,392 demonstrate that this work achieves average performance improvements of 12.2 $\times$ (up to 874.5 $\times$ ) over CPUs and 10.1 $\times$ (up to 740.4 $\times$ ) over GPUs. Compared to the state-of-the-art technique (DPU-v2), this work shows a 2.5 $\times$ (up to 5.9 $\times$ ) average performance improvement and 1.8 $\times$ (up to 4.1 $\times$ ) average energy efficiency enhancement.

Index Terms:

Sparse triangular solve, hardware-software codesign, hardware accelerator, parallel processor, sparse irregular computation

I Introduction

Sparse triangular solve (SpTRSV) is a crucial computational kernel extensively used in various domains, including direct solvers [1, 2, 3, 4], preconditioned iterative solvers [5, 6], and least square problems. Different from the sparse matrix-vector multiplication (SpMV) [7] and sparse matrix-matrix multiplication (SpMM) [8], the outputs of SpTRSV depend on the previous inputs, making it inherently sequential. It is also frequently executed in many scientific and engineering applications, such as transient simulations with fixed steps for power delivery networks [9, 10, 11]. As a result, SpTRSV has become a performance bottleneck in many applications [12].

SpTRSV is a typical sparse irregular graph computation represented by a directed acyclic graph (DAG), where nodes represent the matrix rows and edges represent the non-zeros that are not on the diagonal. The level-based or node-based methods can be used to accelerate such a DAG. The level-based method, also called the level-scheduling method or level-set method, segments independent nodes into levels and utilizes multiple threads to calculate the nodes in each level simultaneously. The threads between levels need to be globally synchronized to maintain the data dependency. The node-based method, also called the synchronization-free method, allocates nodes directly to threads without building the levels and each thread commences computation once all its dependent nodes are completed. The former is typically applied in CPUs [13, 14], whereas the latter is used in GPUs [15, 16] because GPUs have more lightweight threads than CPUs. Furthermore, some studies focus on exploiting 2D spatial structure of the coefficient matrix [17, 18, 19] or reducing dependencies [20, 21] to optimize SpTRSV based on CPUs and GPUs.

Accelerating SpTRSV on CPUs and GPUs faces some challenges: 1) SpTRSV requires frequent synchronization, but the associated overhead on both CPUs and GPUs is substantial; 2) each node has too few edges to be worth dedicating a separate thread on either CPUs or GPUs, even when these nodes are grouped into levels; 3) the irregular sparse nature disrupts both temporal and spatial locality, leading to inefficient memory access patterns on CPUs and GPUs. Designing a specific hardware architecture can effectively tackle these challenges. For example, Nimish Shah et al. employ processing elements (PEs) and software-managed scratchpads (or register files) to replace threads and hardware-managed caches, respectively [22, 23, 24]. They refer to the design as the DAG processing unit (DPU and DPU-v2). Compared to the hardware-managed caches, the software-managed scratchpads have a lower granularity of data access and are more controllable. This allows nodes to be placed in the appropriate locations in advance based on the data dependencies, addressing the issue of irregular memory access. Furthermore, the synchronization costs in a specific architecture are significantly lower than CPUs or GPUs. To address the limited parallelism of the original DAG, they transform it into a binary DAG by cascading multiple two-input nodes to represent each multi-input node. The binary DAG is then mapped onto a tree-shaped PEs array, thereby achieving higher parallelism.

We refer to the original DAG as the coarse DAG and the binary DAG as the fine DAG, with their respective dataflows termed coarse and fine dataflows. The coarse dataflow represents the node-based or level-based method, and the fine dataflow represents the method used by DPU and DPU-v2. While the fine dataflow improves parallelism, its performance still falls short of expectations. Compared to the coarse DAG, the fine DAG introduces numerous intermediate nodes and disrupts the spatial structure of the original matrix. In the coarse dataflow, the partial sum of a coarse node can be reused until completion. In contrast, in the fine dataflow, due to a coarse node being replaced by multiple cascaded binary nodes, the partial sum can only be written back to the register files if the cascading depth exceeds the designed PEs array. The cascading depth, which depends on the number of inputs of a coarse node, is variable and can easily reach tens or even hundreds. However, the depth of the tree-shaped PEs array is typically fixed and usually lower than ten, leading to inefficient data reuse. Additionally, the increased number of nodes exacerbates bank conflicts, further degrading performance.

In this paper, we propose a novel hardware accelerator to address the aforementioned challenges. The contributions of this paper are summarized as follows.

•

A novel hardware accelerator, complemented by a custom compiler, is proposed to accelerate SpTRSV or SpTRSV-like DAGs. The accelerator utilizes a medium granularity dataflow to achieve high performance. The dataflow incorporates a coarse node allocation method to maintain the spatial structure and a fine-grained edge computation method to increase parallelism.
•

A partial sum caching mechanism is introduced to store the intermediate results of a coarse node when there are no computable edges, allowing PEs to proceed with the computation of other nodes. The mechanism reduces the blocking frequency of PEs and enhances overall performance.
•

A reordering algorithm for intra-node edges computation is proposed. The algorithm schedules the computation of similar edges within the same cycle, to enhance data reuse and reduce bank constraints and conflicts.
•

Experimental results demonstrate that, compared to the state-of-the-art technique, our approach achieves an average throughput of 6.5 GOPS (up to 14.5 GOPS), a performance speedup of 2.5 $\times$ , and an energy efficiency improvement of 1.8 $\times$ . The PEs utilization in this work can reach up to 75.3%.

II Background

II-A Sparse Triangular Solve on CPUs and GPUs

Input :

n

rowptr

colidx

value

: A

n\times n

sparse triangular matrix stored in the form of compressed sparse row.

b

: The right-hand vector.

Output :

x

: The solution vector.

1 for $i\leftarrow 0$ to $n-1$ do

ie\leftarrow rowptr[i+1]-1

sum\leftarrow 0

4 for $j\leftarrow rowptr[i]$ to $ie-1$ do

sum\leftarrow sum+value[j]\times x[colidx[j]]

6 end for

x[i]\leftarrow(b[i]-sum)/value[ie]

8 end for

return $x$

Algorithm 1 Serial sparse triangular solve

Refer to caption — Figure 1: The level-scheduling method for SpTRSV.

The sparse triangular solve can be represented by solving $Lx=b$ , where $L$ is a sparse triangular matrix and $b$ is the right-hand vector. Sparse matrices are commonly stored in formats like compressed sparse row (CSR) and compressed sparse column (CSC) [25]. In CSR format, a sparse matrix is represented by three vectors: one storing the column indices of non-zeros in row-major order, another storing the corresponding values, and a third vector storing the position indices of the first non-zero in each row. The CSC format is similar but stores entries in column-major order. A basic approach to solving the triangular system is elimination. Analysis of the coefficient matrix shows that the unknowns in the solution vector are influenced only by preceding unknowns. Thus, the elimination can be performed sequentially based on the original order of the solution vector. Assuming the first $j-1$ rows are solved, solving for the $j$ -th unknown ( $x_{j}$ ) involves using the column indices of the non-zeros in this row to index the corresponding entries in the solution vector and perform the multiply-accumulate operations (lines 4-6 in algo. 1). The right-hand data $b_{i}$ and the diagonal entry $L_{ii}$ are then used to update and obtain the final result, as shown in equation 1.

Lx=b\hskip 9.00002pt\Rightarrow\hskip 9.00002ptx_{i}=\frac{1}{L_{ii}}\left(b_{% i}-\sum_{j=1}^{i-1}L_{ij}x_{j}\right)

(1)

For sparse matrices, calculating each unknown does not utilize all preceding results, allowing for mining the parallelism. Sparse triangular matrices can be converted into DAGs, where each edge represents a multiply-accumulate operation and each node represents an unknown and a self-update operation (subtraction and division). The level-scheduling method divides all nodes into multiple levels, where nodes within the same level are independent and have the same depth from the source node. The nodes within the same level can be computed simultaneously by multiple threads, but synchronization between levels is necessary to ensure data dependencies, as shown in figure 1. The synchronization-free method assigns a node to a warp on GPUs and begins parallel computation of the input edges after all dependent nodes are completed, without waiting for other nodes at the same level. For example, in figure 1, node 5 can start computation once nodes 2 and 4 are completed, without waiting for node 6. This method achieves higher parallelism but results in more communication overhead. Atomic operations are used to mitigate the cost of thread synchronization. In SpTRSV DAGs, consecutively dependent nodes (e.g., nodes 7 and 9 in figure 1) usually occupy most edges, limiting the performance of node-based methods. The block and tiled algorithms are proposed to exploit intra-node parallelism, and the rewrite method is introduced to reduce DAG dependencies, to achieve better performance.

However, using CPUs and GPUs to accelerate SpTRSV remains challenging [26, 27]. On one hand, the sparse irregular data distribution disrupts temporal and spatial locality, rendering high-access-granularity caches nearly ineffective [28, 29, 30, 31]. For example, in modern CPUs and GPUs, the minimum data width transferred from the memory to cache is typically 32 or 64 words. Due to irregular distribution, only one of these words may be used, causing high cache misses and frequent data transfers. On the other hand, SpTRSV requires frequent synchronization, and the computation per node is minimal, leading to threads on CPUs and GPUs consuming most time in communication rather than computation. Therefore, the customizable hardware platform may be an attractive solution to accelerate SpTRSV.

II-B Very-Long-Instruction-Word Architecture

The very-long-instruction-word (VLIW) architecture is a design approach for parallel processors. It enhances parallelism and processing efficiency by merging multiple operations into a single instruction word and executing them simultaneously [32, 33, 34, 35, 36]. Unlike traditional architectures, which rely on hardware for instruction scheduling and parallel execution, the VLIW architecture utilizes the compiler to determine instruction-level parallelism. This approach reduces the complexity of control logic and simplifies hardware design. By integrating software-managed register files, the VLIW architecture can implement communication between compute units (CUs) without synchronization overhead. Therefore, the VLIW architecture is well-suited for accelerating DAG tasks that require complex scheduling logic and frequent synchronization, such as SpTRSV. The main disadvantage of the VLIW architecture is that the compiler becomes more complex.

II-C DAG Processing Unit and This Work

In recent years, a few studies have attempted to accelerate SpTRSV based on customized architectures. The latest and representative work is the DAG processing unit v2 (DPU-v2) [22]. Based on the VLIW architecture, DPU-v2 utilizes tree-shaped PEs arrays and pipelining techniques to enhance data reuse. Each PE has two inputs and one output, performing either multiplication or addition operations, as illustrated in figure 2. To map such arrays, the DPU-v2 compiler converts the original coarse DAG into a fine binary DAG by inserting multiple cascaded two-input nodes. This approach theoretically maximizes parallelism since a coarse node is divided into multiple fine nodes that can be computed simultaneously. While DPU-v2 targets general DAG tasks, it is only effective for DAGs where each edge represents a single logical operation and each node contains few edges. DAGs from machine learning, like sum-product networks [37, 38], often meet these conditions.

However, the DPU-v2 architecture becomes inefficient for SpTRSV-like DAGs. These DAGs are characterized by three features: 1) each edge represents two or more cascaded logical operations, such as multiply-accumulate; 2) a node possibly contain multiple edges (more than 10); 3) consecutively dependent nodes occupy most edges. The first two features increase the number of nodes in the binary DAG and reduce the data reuse efficiency of the tree-shaped PEs array. For instance, consider using a tree-shaped PEs array with a depth of 2 to calculate a coarse node with 4 input edges, where the source nodes of these edges are completed, as shown in figure 3. The PEs array needs to be mapped 4 times to complete this node, and the intermediate results must be written back to register files in the first 3 mappings. Despite the appearance that 3 PEs are calculating a coarse node simultaneously, the pipelining structure and register files access degrade performance, potentially making it less efficient than using a single PE. Increasing the depth of tree-shaped PEs array can mitigate this issue but exponentially increases hardware complexity. The third feature limits the parallelism of the traditional coarse dataflow, such as level-scheduling and synchronization-free methods.

This work proposes an architecture with feedback-structured PEs to achieve adaptive and efficient data reuse. The compiler determines the reuse count by calculating the nodes indegree. The functionality of the PE aligns with the basic operations of the DAG tasks. This paper evaluates the performance of the architecture for SpTRSV. Given that the basic operations of SpTRSV are a cascade of multiplication and addition (see line 5 and line 7 in algo. 1, where subtraction and division can be transformed into addition and multiplication by the compiler), and that the coefficient matrix $L$ and right-hand vector $b$ are read-only, we design the architecture as shown in figure 2, in which PEs are assigned coarse nodes, $L$ and $b$ are fed to PEs through FIFOs, and the partial sums are reused through feedback structures. The architecture employs a novel dataflow, called the medium granularity dataflow, to overcome the parallelism limitations of the traditional coarse dataflow when handling consecutively dependent nodes.

III System Architecture

In real-world applications, a sparse triangular system is usually solved multiple times with the same coefficient matrix. We can preprocess the data structure before SpTRSV is executed because the preprocess time can be amortized. Figure 4 illustrates the system architecture, comprising a software compiler and a hardware accelerator. The compiler preprocesses a sparse triangular matrix based on the proposed dataflow, maps nodes to register files, and generates instructions. The accelerator executes these instructions to obtain the solution vector and stores it in the data memory.

III-A Compiler Overview

The compiler workflow is as follows: firstly, traverse the adjacency graph of the coefficient matrices and allocate nodes to PEs according to the topological order of the graph. Next, apply the medium granularity dataflow (section IV. A) and the partial sum caching mechanism (section IV. B) to establish the initial computation pattern for PEs. Then, without changing the order of coarse nodes computation and the access mode of the $psum$ register files determined in the previous step, rearrange the calculation order of the edges within the nodes in each cycle (section IV. C) to enhance data reuse and reduce bank conflicts. Subsequently, determine the memory access requirements of the PEs, set constraints for nodes, and resolve bank conflicts by a greedy graph coloring algorithm. Finally, address the potential spilling issues of the register files and generate the instructions.

III-B Accelerator Architecture

The proposed accelerator achieves flexible dataflow by executing the instructions, as shown in figure 4 (b). It consists of multiple CUs, interconnects, and on-chip memories. The CUs are connected by input and output interconnects, which are implemented by the crossbars, enabling communication between them. The crossbar-based interconnects decouple the PEs and register files, minimizing bank conflicts. The on-chip memories include instruction memory, data memory, and stream memory. The instruction memory stores instructions generated by the compiler. The compiler is only re-executed and the instruction memory is updated when the structure of the coefficient matrix changes. The stream memory stores the values of the coefficient matrix and the right-hands, without storing their positional information (i.e., in which rows or columns). This is because the data in the stream memory is reordered by the compiler through the analysis of the established medium granularity dataflow, so that the positional information is hidden in the instructions. Consequently, during the running of the accelerator, data in the instruction memory and stream memory is accessed sequentially. The results of the accelerator are written to the data memory and may be read out when the register file spills.

out=\left\{\begin{array}[]{l}\vspace{0.5em}\left(b_{i}-psum\right)\times L_{ij% }(ct=0)\\ psum+L_{ij}\times x_{i}(ct=1)\end{array}\right.

(2)

Each CU comprises a decoder, a control unit, a D type flip-flop (DFF), an $x_{i}$ register file, a $psum$ register file, a PE, and several multiplexers. The PE consists of a cascaded 32-bit floating-point adder and multiplier, implementing the two fundamental operations of SpTRSV via a 1-bit control signal ( $ct$ in figure 4 (b)). The functionality of SpTRSV and PE are described in equations 1 and 2, respectively. The division operation is performed by computing the reciprocal in the compiler and using multiplication in the hardware. The PE output can follow two paths: one as a partial sum, which is either written to the $psum$ register file or fed back to the $psum$ input of the PE, depending on whether the current node is blocked in the next cycle. When the node in the $psum$ register file is unblocked, the partial sum is read out and the address is released. This will be discussed in detail in section IV. B. When the PE starts computing a new node, the $psum$ input is set to zero. The other path is the solution for the node, which is written through the output interconnect to the $x_{i}$ register file of any CU, or directly reused by other PEs through input and output interconnects.

The decoder translates instructions to obtain read addresses and read-write enable signals for the register files and data memory, as well as selection signals for input and output interconnects, and sends other data to the control unit. The structure of the instruction is shown in figure 5 (a). Write addresses for the register files and data memory are automatically generated by the control unit to reduce instruction length. The principle for generating write addresses is to always write data to the empty location with the lowest address. The method for generating write addresses is illustrated in figure 5 (b). For the register files, a 1-bit valid flag is set for each address to determine if it is free, and a priority encoder is used to generate the write address. Whether an address in the $x_{i}$ register file is released after being read is determined by a 1-bit signal ( $R^{x_{i}}_{vs}$ ) in the instruction. Data in the $psum$ register file is released once read out, as it is intermediate data and will not be reused by other PEs. For the data memory, write addresses are generated by a counter initialized to zero, incrementing with the write enable signal valid. This is because data written to the data memory is the final result required by the user and will not be overwritten.

All CUs in the accelerator share a single clock, and PEs achieve synchronized computation through DFFs. Thus, with a software-managed memory system, the compiler can fully predict the behavior of the hardware and the details of data stored in the register file and memory, to address the problem of bank conflicts. Data will be spilled from the $x_{i}$ register file if the number of nodes exceeds capacity. Given the execution schedule, a live-range analysis is performed to determine when the spilling is required. If the number of free addresses falls below a threshold, the data in the register file will be written to the data memory, or the address will be directly released if the data memory already holds the same data. The spilled data will be loaded back before it is supposed to be consumed. The compiler decides whether to insert nop cycles to perform the store and load operations by considering the port occupancy of the $x_{i}$ register file.

IV Custom Compiler Methodology

In this section, we introduce the medium granularity dataflow, the partial sum caching mechanism, and the intra-node edges computation reordering algorithm. The medium granularity dataflow combines the advantages of coarse and fine dataflow to enhance performance for SpTRSV or SpTRSV-like DAG tasks. The partial sum caching mechanism provides temporary storage to reduce PEs blocking frequency and shorten the critical path length. The intra-node edges computation reordering algorithm exploits the spatial locality to improve data reuse and alleviate constraints on banked register files.

IV-A Medium Granularity Dataflow

In coarse dataflow, a node represents the minimal computational task. A PE or thread only starts computing a node after all its predecessors are completed. This method is typically used in CPUs and GPUs, where the task granularity must be large enough to offset thread management overheads (e.g., startup, termination, synchronization). However, real application DAGs often have many consecutively dependent nodes, resulting in nearly sequential computation (see the coarse dataflow example in figure 6). In fine dataflow, a coarse node is divided into multiple cascaded two-input fine nodes, each representing a multiplication or addition and mapped to a PE. This allows multiple PEs to compute a coarse node simultaneously. However, SpTRSV-like DAGs have two key characteristics: 1) the fundamental operations are cascaded multiplications and additions, needing two fine nodes per coarse node; 2) numerous coarse nodes with multiple inputs create many intermediate fine nodes. The two characteristics significantly increase node count and critical path length. Although a tree-shaped PE array can improve data reuse, the reuse depth is limited and many intermediate nodes still need to be written back to the register files for later use. Furthermore, the increased intermediate nodes also exacerbate bank conflicts.

The proposed dataflow combines the advantages of coarse and fine dataflows. Here, a coarse node is mapped to a PE, which processes any available edges immediately once its source node is completed, without waiting for all input edges. Thus, this method can be seen as coarse allocation with fine computation. The output of a PE is fed back to the $psum$ input to reuse partial sums (orange box in figure 4 (b)). Upon node completion, the output is written to the register files and routed through the interconnects for other PEs. This feedback structure adjusts the depth of data reuse based on the number of edges for each node, ensuring the partial sum is always reused or stored in the $psum$ register file, thus minimizing access demands on the register files and interconnects. The medium granularity dataflow enables parallel computation between nodes, necessitating more synchronization than the coarse or fine dataflow. This is one of the reasons that we utilize synchronized PEs based on the VLIW architecture, because the asynchronous computational architecture incurs additional synchronization costs.

Figure 6 illustrates an example to demonstrate how the three dataflows handle the consecutively dependent nodes. For the coarse dataflow, a PE must ensure that all dependent nodes are completed before computing a node. This results in serial computation for the consecutively dependent nodes. In the given example, assuming no bank conflicts and that the $L$ and $b$ are previously stored in register files, the coarse dataflow consumes 12 cycles. For the fine dataflow, the original DAG is converted into a binary DAG and mapped onto tree-shaped PEs. In this example, the binary DAG is divided into 9 blocks. Considering the impact of register file access and pipeline structure, these 9 blocks consume 19 cycles. Since PEs in the fine dataflow perform a single logical operation while the coarse and this-work dataflows perform 2 logical operations. For a fair comparison, we assume the fine dataflow operates at twice the clock frequency of the others. Thus, the fine dataflow consumes 9.5 cycles. In the proposed dataflow, we retain the node allocation strategy of the coarse dataflow, assigning one coarse node to a PE, but do not treat it as a minimal computational task. The PEs will compute a node if one of its dependent nodes is completed. For the given example, this work dataflow consumes 8 cycles. In summary, the proposed dataflow demonstrates superior performance compared to both the coarse and fine dataflows.

IV-B Partial Sum Caching Mechanism

Due to coarse node allocation, the medium granularity dataflow may experience blocking when a PE has no edges to compute for the current node. Blocking occurs in two scenarios: one where all nodes in the task list of the PE are blocked, and another where other nodes still have computable edges. The first scenario, related to the DAG structure and node allocation order, is rare and can be addressed by optimizing the dataflow, such as using a reconfigurable dataflow. For the more common second scenario, we propose a partial sum caching mechanism to mitigate the blocking. In our architecture, when blocking occurs, each CU uses a $psum$ register file to cache partial sums, then traverses the task list to find the first unblocked node and calculates it. The $psum$ register file is local and does not occupy interconnect resources. Nodes in the $psum$ register file must be prioritized to avoid potential deadlocks, as they are always parent nodes to others. Thus, in each cycle, if a node in the $psum$ register file is unblocked, the PE must pause the current task to compute this node, regardless of whether the current node is blocked.

Determining when to perform read/write operations on the $psum$ register file is crucial. Firstly, without considering the capacity of the $psum$ register file, the task of a PE for the current cycle is determined based on the above principles. If the PE is blocked (this blocking is from the first scenario), or the current node and the previous node are the same, or the previous node is completed and the current node is new, the PE will reuse the partial sum or set it to zero without accessing the $psum$ register file. If the previous node is completed and the current node is old, the PE will read the partial sum from the $psum$ register file and release the corresponding address. If the previous node is not completed and the current node is new, the PE will put the previous partial sum into the $psum$ register file and set the current partial sum to zero. In this case, the $psum$ register file must have at least two free addresses, or one free address if the current node is the first new node in the task list. Otherwise, the PE will be blocked to avoid the potential deadlock, as shown in figure 7. If the previous node is not completed and the current node is old, the PE will put the previous partial sum into the $psum$ register file and read the current partial sum from it. This scenario does not need to consider the capacity of the $psum$ register file, as it supports read-before-write operations. Figure 4 (a) shows a simple example to illustrate the partial sum caching mechanism, where PE 3 is blocked while computing node 5. In the third cycle, the PE stores node 5 into the $psum$ register file and computes the next node (node 9) in its task list. Then, in the next cycle, node 5 is unblocked and read out for computation. Meanwhile, node 9 is stored in the $psum$ register file.

IV-C Intra-node Edges Computation Reordering Algorithm

Input :

C

: A container with

m

sub-containers, each holding edges for computation in this cycle, where

m\leq

number of PEs.

Output :

M

: A key-value table mapping PE indices to the edges each PE computes in this cycle.

// Classify edges in

C

based on their source nodes. The R-value of a category indicates the number of edges in that category.

R\leftarrow compute\_category(C)

M\leftarrow\phi

D\leftarrow C

4 while not $D.empty()$ do

// Find the category with the most edges in

D

EV\leftarrow get\_max\_category(D)

6 if $EV.size()\geq 2$ then

// Find the category with the minimum R-value in

EV

E\leftarrow get\_min\_R\_category(R,EV)

8 else

E\leftarrow EV[0]

10 end if

// Compute the PE index in

E

and establish a key-value table.

M\_i\leftarrow get\_mapping(E)

M.add(M\_i)

// Mark the sub-containers associated with each key in

M\_i

and remove them in

D

D\_i\leftarrow mark\_sub\_contains(D,M\_i)

D.remove(D\_i)

15 end while

return $M$

Algorithm 2 Determine edges to be computed for PEs in a certain clock cycle

In each cycle, PEs may have multiple computable edges while processing the current node. Choosing which edge to compute does not affect the final result, but impacts the frequency of accessing the register banks. The traditional method selects the edge based on the ascending order of the source node ID. However, this can lead to repeated readouts of the same source node (the left subfigure of figure 8), as edges connected to it are scattered across different positions for other nodes. We consider edges with the same source node as similar edges and propose a reordering algorithm to group similar edges within the same cycle.

The intra-node edges computation reordering algorithm for each cycle involves the following four steps: firstly, classify similar edges and count the number in each category, denoted as the R-value (line 1 in algo. 2). Secondly, select the category with the highest count. In case of a tie, choose the category with the smallest R-value (lines 4-9). This maximizes reuse in the current cycle and ensures nodes can be reused in subsequent cycles, as illustrated in the right subfigure of figure 8, where PE 1 selects node 8 instead of node 7 in the second cycle, allowing node 7 to be reused in the next cycle. Thirdly, assign edges in the selected category to the corresponding PEs and remove all related edges (lines 10-13). Finally, repeat the second and third steps until all non-blocked PEs are assigned edges.

V Experiments

V-A Experimental Setup

The experiments compare this work with CPU, GPU, and DPU-v2 platforms. CPU: We use the Intel Math Kernel Library (MKL v2018.1) [39] on a Xeon4214 CPU at 2.2 GHz, utilizing the $mkl\_sparse\_s\_trsv$ function. Programs are compiled with GCC v9.5.0, using the -O3 optimization flag and OpenMP v4.0.7. GPU: We use the cuSPARSE Library (v2022.12) [40] on an RTX 3070 laptop GPU at 1.1 GHz, utilizing the $cusparseSpSV\_analysis$ function and $cusparseSpSV\_solve$ function. The $cusparseSpSV\_analysis$ function analyzes the matrix structure and generates a dataflow that matches the GPU architecture, serving as a preprocessing or compilation step for SpTRSV. Programs are compiled with CUDA v12.2. Data transfer time between the host and GPU is excluded. DPU-v2: The processor runs in its default configuration with 56 PEs and 64 register files [22]. For a fair comparison, our accelerator operates at 150 MHz, half the DPU-v2 clock frequency, because our PE connects the adder and multiplier in series while DPU-v2 connects in parallel. The accelerator is synthesized using TSMC 28 nm technology with 64 CUs. Each CU has $x_{i}$ and $psum$ register files configured with 64 and 8 words. The data memory is configured with 8192 words, while both the instruction memory and stream memory are configured with 65536 words. The synthesis results are shown in table I. Our compiler is implemented in C++ 11. The compiler runtime environment for the GPU, DPU-v2, and this work is the same. We use 264 public benchmarks [41] from various domains, such as circuit simulation and power networks, to evaluate the accelerator. Table II provides the basic information of the partial benchmarks.

TABLE I: Synthesis Result of The Accelerator With 64 CUs.

	Area		Power
	$\text{mm}^{\text{2}}$	%	mW	%
Datapath:
PEs	0.07	3.3	16.00	10.2
Fifos	0.16	7.7	28.22	18.1
Pipelining registers	0.02	0.8	6.85	4.4
Input interconnect	0.04	2.1	9.65	6.2
Output interconnect	0.04	2.1	8.36	5.3
Register file	0.28	13.1	29.86	19.1
Control:
Control units	0.02	0.9	5.41	3.5
Multiplexers	0.00	0.2	1.85	1.2
Data memory	0.11	5.4	7.07	4.5
Instruction memory	0.64	30.1	17.09	10.9
Stream memory	0.72	34.3	25.86	16.6
	2.11		156.21

TABLE II: Basic Information of Partial Benchmarks

ID	Name	N	NNZ	Binary nodes
1	rajat19	1157	3956	6755
2	c-36	7479	12186	16893
3	bayer05	3268	26316	49364
4	HB_jagmesh4	1440	22600	43760
5	cz628	628	9123	17618
6	ex3	1821	30204	58587
7	fpga_dcop_01	1220	4303	7386
8	nnc1374	1374	17897	34420
9	add32	4960	14451	23942
10	bips98_606	7135	28759	50383
11	fpga_trans_01	1220	5371	9522
12	gemat12	4929	28415	51901
13	filter2D	1668	8647	15626
14	ex10hs	2548	30515	58482
15	ex10	2410	26853	51296
16	add20	2395	9867	17339
17	rajat04	1041	7625	14209
18	circuit204	1020	8008	14996
19	bayer07	3268	26316	49364
20	ACTIVSg2000	4000	42840	81680

V-B Throughput Comparison over Fine Dataflow

Figure 9 (a) compares the throughput of the fine and this work dataflows. This work dataflow does not utilize the partial sum caching mechanism. The coarse dataflow was not tested because this work dataflow enables the edges computation based on it, which ensures the performance at least equal to or better than the coarse dataflow. For a few benchmarks, such as matrices rajat19, cz628, and rajat04, the throughput of our proposed dataflow is lower than that of the fine dataflow. This is because these matrices have relatively few consecutively dependent nodes and few input edges per node, leading to the more frequent blocking in our dataflow. In contrast, the fine dataflow, while not significantly increasing intermediate nodes in the binary DAG, achieves better throughput by enhancing parallelism. However, for most benchmarks, the proposed dataflow demonstrates higher throughput and performance.

V-C Impact of Partial Sum Caching Mechanism

Figure 9 (b) and (c) illustrate the impact of the partial sum caching mechanism on reducing blocking cycles and overall cycles under different $psum$ register files capacities. Given the wide variance in benchmark computations, the data was normalized. The results indicate that the partial sum caching mechanism reduces blocking frequency and improves overall performance. It achieves optimization with small capacities because it always prioritizes cached nodes, allowing efficiently use of the $psum$ register files. When the blocking cycles no longer change with capacity, residual blocking is attributed to sparse structures and unresolved bank conflicts. For certain benchmarks, such as matrices rajat19 and c-36, we observe that while blocking cycles significantly decrease, the critical path is shortened less or even remains unchanged. This is because the proportion of blocking cycles is inherently small, or the reduced blocking cycles are not on the critical path, which is related to the sparse structure and coarse node allocation.

V-D ICR Optimization for Bank Conflicts and Data Reuse

Figure 9 (d), (e), and (f) compare the changes in constraints, bank conflicts, and data reuse before and after applying the intra-node edges computation reordering (ICR) algorithm. The ICR algorithm effectively reduces constraints and improves data reuse. Generally, reducing constraints can lower the bank conflicts because in the greedy graph coloring algorithm, fewer constraints mean more choices for nodes regarding banks allocation, thereby increasing the optimization space of these nodes and their directly or indirectly connected nodes. In few cases, such as the matrix add32, the ICR algorithm might make bank conflicts increase. This occurs because the ICR algorithm focuses on the optimal computation order at the current time without considering the global. Although this can reduce the overall number of constraints, it can alter the constraint graph structure, potentially increasing constraints for some nodes. Nonetheless, such instances are rare, and in most cases, the ICR algorithm results in positive optimization.

V-E Instruction Breakdown

Figure 10 shows the breakdown of instructions, where the execute instructions represent the PEs performing regular logical operations, the nop instructions indicate the PEs being blocked and the outputs are unchanged. Fewer nop instructions signify that the system is closer to achieving the theoretical peak throughput of the proposed dataflow. Thanks to the crossbar-based interconnects, our architecture achieves approximately 90% of the peak throughput on average. The nop instructions arise from four aspects: 1) the sparse structure of the DAG; 2) limited capacity of $psum$ register files; 3) load imbalance caused by coarse node allocation; 4) interconnect port conflicts. Except for the first point, other issues can be mitigated through compiler algorithm optimizations.

TABLE III: Performance Comparison of 264 SuiteSparse Benchmarks [41] With Other Platforms

	CPU	GPU	DPU-v2	This work
Technology (nm)	14	8	28	28
Frequency (MHz)	2200	1110	300	150
Peak throughput (GOPS)	3379.2	15974.4	16.8	19.2
Avg. throughput (GOPS)	0.5	0.6	2.6	6.5
Speedup	1 $\times$	1.2 $\times$	4.9 $\times$	12.2 $\times$
Power (W)	>50	>50	0.109	0.156
Avg. energy efficiency (GOPS/W)	<0.01	<0.01	23.7	41.6
Avg. compile time (s)	-	0.02	105.90	0.03

V-F Performance Comparison with Other Platforms

Figure 11 shows the throughput comparison of this work with other platforms on partial benchmarks. The performance of our approach exceeds that of other platforms. From figure 9 (a), it is evident that for the matrix rajat04, the throughput of our dataflow is initially lower than DPU-v2. However, with the partial sum caching mechanism, the performance is improved and surpasses DPU-v2. Figure 12 compares the performance of our work against other platforms across 264 benchmarks with nodes ranging from 19 to 85,392. DPU-v2 fails to produce results for 7 benchmarks due to excessive compilation times (exceeding 300 minutes). These benchmarks are odepa400, olm1000, lung1, mhda416, extr1b, extr1, and ex22. The compilation time for DPU-v2 is significantly longer than that for our approach and the GPU. This is due to the coarse nodes with numerous inputs (matrices with many rows containing numerous non-zeros) greatly increase the number of nodes when converted into the binary nodes. The complexity of decomposing these binary nodes into blocks during DPU-v2 compilation is $O(N^{2})$ , where $N$ is the number of binary nodes, resulting in a time-consuming process. Although the compilation time can be amortized over multiple SpTRSV executions, it still affects overall performance. Longer compilation time necessitate more SpTRSV executions to realize performance benefits. The CPU exhibits low average throughput as SpTRSV disrupts both temporal and spatial locality, rendering SIMD functions and hardware-managed cache mechanisms ineffective. The GPU, despite its higher peak throughput than the CPU, performs worse on smaller benchmarks because limited cache capacity leading to more cache misses. However, as node size increases, the GPU gains a performance edge from its numerous CUDA cores. Table III summarizes key metrics of our work and other platforms. Overall, our approach achieves average performance improvements of 12.2 $\times$ (up to 874.5 $\times$ ) and 10.1 $\times$ (up to 740.4 $\times$ ) compared to the CPU and GPU, respectively. Compared to DPU-v2, our work demonstrates an average performance improvement of 2.5 $\times$ (up to 5.9 $\times$ ) and an energy efficiency improvement of 1.8 $\times$ (up to 4.1 $\times$ ). Furthermore, the throughput of our work shows good scalability.

VI Summary and Future Works

In this paper, we propose a novel hardware accelerator for sparse triangular solve. The accelerator employs a medium granularity dataflow that leverages coarse node allocation and fine edge computation. This hybrid granularity approach ensures both parallelism and spatial locality. We also introduce a partial sum caching mechanism to reduce the blocking frequency of processing elements and an intra-node edges computation reordering algorithm to enhance data reuse. Experimental results demonstrate that our method outperforms the current state-of-the-art techniques. Future works will focus on optimizing load balance issues caused by coarse node allocation and enhancing architecture scalability through an efficient interconnect design. The steps in the compiler, such as allocating nodes to PEs and determining the dataflow, are not independent and should be considered holistically to achieve optimal performance. Introducing artificial intelligence may yield satisfactory results.

References

[1] T. A. Davis, Direct methods for sparse linear systems. SIAM, 2006.
[2] T. Wang, W. Li, H. Pei, Y. Sun, Z. Jin, and W. Liu, “Accelerating sparse lu factorization with density-aware adaptive matrix multiplication for circuit simulation,” in 2023 60th ACM/IEEE Design Automation Conference (DAC). IEEE, 2023, pp. 1–6.
[3] K. He, S. X.-D. Tan, H. Wang, and G. Shi, “GPU-accelerated parallel sparse LU factorization method for fast circuit analysis,” IEEE Transactions on Very Large Scale Integration (VLSI) Systems, vol. 24, no. 3, pp. 1140–1150, 2015.
[4] W.-K. Lee, R. Achar, and M. S. Nakhla, “Dynamic GPU parallel sparse LU factorization for fast circuit simulation,” IEEE Transactions on Very Large Scale Integration (VLSI) Systems, vol. 26, no. 11, pp. 2518–2529, 2018.
[5] Y. Saad, Iterative methods for sparse linear systems. SIAM, 2003.
[6] H. Anzt, E. Chow, and J. Dongarra, “Iterative sparse triangular solves for preconditioning,” in Euro-Par 2015: Parallel Processing: 21st International Conference on Parallel and Distributed Computing, Vienna, Austria, August 24-28, 2015, Proceedings 21. Springer, 2015, pp. 650–661.
[7] N. Jao, A. K. Ramanathan, J. Sampson, and V. Narayanan, “Sparse Vector-Matrix Multiplication Acceleration in Diode-Selected Crossbars,” IEEE Transactions on Very Large Scale Integration (VLSI) Systems, vol. 29, no. 12, pp. 2186–2196, 2021.
[8] E. B. Tavakoli, M. Riera, M. H. Quraishi, and F. Ren, “FSpGEMM: A Framework for Accelerating Sparse General Matrix–Matrix Multiplication Using Gustavson’s Algorithm on FPGAs,” IEEE Transactions on Very Large Scale Integration (VLSI) Systems, 2024.
[9] T. Yu and M. D. Wong, “PGT_SOLVER: An efficient solver for power grid transient analysis,” in 2012 IEEE/ACM International Conference on Computer-Aided Design (ICCAD). IEEE, 2012, pp. 647–652.
[10] J. Yang, Z. Li, Y. Cai, and Q. Zhou, “PowerRush: An efficient simulator for static power grid analysis,” IEEE Transactions on Very Large Scale Integration (VLSI) Systems, vol. 22, no. 10, pp. 2103–2116, 2013.
[11] R. Achar, M. S. Nakhla, H. S. Dhindsa, A. R. Sridhar, D. Paul, and N. M. Nakhla, “Parallel and scalable transient simulator for power grids via waveform relaxation (PTS-PWR),” IEEE Transactions on Very Large Scale Integration (VLSI) Systems, vol. 19, no. 2, pp. 319–332, 2009.
[12] M. Freire, J. Ferrand, F. Seveso, E. Dufrechou, and P. Ezzatti, “A GPU method for the analysis stage of the SPTRSV kernel,” The Journal of Supercomputing, vol. 79, no. 13, pp. 15 051–15 078, 2023.
[13] E. Anderson and Y. Saad, “Solving sparse triangular linear systems on parallel computers,” International Journal of High Speed Computing, vol. 1, no. 01, pp. 73–95, 1989.
[14] J. H. Saltz, “Aggregation methods for solving sparse triangular systems on multiprocessors,” SIAM journal on scientific and statistical computing, vol. 11, no. 1, pp. 123–144, 1990.
[15] W. Liu, A. Li, J. Hogg, I. S. Duff, and B. Vinter, “A synchronization-free algorithm for parallel sparse triangular solves,” in Euro-Par 2016: Parallel Processing: 22nd International Conference on Parallel and Distributed Computing, Grenoble, France, August 24-26, 2016, Proceedings 22. Springer, 2016, pp. 617–630.
[16] W. Liu, A. Li, J. D. Hogg, I. S. Duff, and B. Vinter, “Fast synchronization-free algorithms for parallel sparse triangular solves with multiple right-hand sides,” Concurrency and Computation: Practice and Experience, vol. 29, no. 21, p. e4244, 2017.
[17] N. Ahmad, B. Yilmaz, and D. Unat, “A split execution model for sptrsv,” IEEE Transactions on Parallel and Distributed Systems, vol. 32, no. 11, pp. 2809–2822, 2021.
[18] Z. Lu, Y. Niu, and W. Liu, “Efficient block algorithms for parallel sparse triangular solve,” in Proceedings of the 49th International Conference on Parallel Processing, 2020, pp. 1–11.
[19] Z. Lu and W. Liu, “TileSpTRSV: a tiled algorithm for parallel sparse triangular solve on GPUs,” CCF Transactions on High Performance Computing, vol. 5, no. 2, pp. 129–143, 2023.
[20] B. Yılmaz and A. F. Yıldız, “A Graph Transformation Strategy for Optimizing SpTRSV,” arXiv preprint arXiv:2206.05843, 2022.
[21] B. Yılmaz, “A novel graph transformation strategy for optimizing SpTRSV on CPUs,” Concurrency and Computation: Practice and Experience, vol. 35, no. 24, p. e7761, 2023.
[22] N. Shah, W. Meert, and M. Verhelst, “DPU-v2: Energy-efficient execution of irregular directed acyclic graphs,” in 2022 55th IEEE/ACM International Symposium on Microarchitecture (MICRO). IEEE, 2022, pp. 1288–1307.
[23] N. Shah, L. I. G. Olascoaga, S. Zhao, W. Meert, and M. Verhelst, “DPU: DAG processing unit for irregular graphs with precision-scalable posit arithmetic in 28 nm,” IEEE Journal of Solid-State Circuits, vol. 57, no. 8, pp. 2586–2596, 2021.
[24] N. Shah, W. Meert, and M. Verhelst, Efficient Execution of Irregular Dataflow Graphs: Hardware/Software Co-optimization for Probabilistic AI and Sparse Linear Algebra. Springer Nature, 2023.
[25] D. Langr and P. Tvrdik, “Evaluation criteria for sparse matrix storage formats,” IEEE Transactions on parallel and distributed systems, vol. 27, no. 2, pp. 428–440, 2015.
[26] F. Favaro, E. Dufrechou, P. Ezzatti, and J. P. Oliver, “Exploring FPGA optimizations to compute sparse numerical linear algebra kernels,” in Applied Reconfigurable Computing. Architectures, Tools, and Applications: 16th International Symposium, ARC 2020, Toledo, Spain, April 1–3, 2020, Proceedings 16. Springer, 2020, pp. 258–268.
[27] Z. He, L. Song, R. F. Lucas, and J. Cong, “LevelST: Stream-based Accelerator for Sparse Triangular Solver,” in Proceedings of the 2024 ACM/SIGDA International Symposium on Field Programmable Gate Arrays, 2024, pp. 67–77.
[28] K. Lu, Z. Li, L. Liu, J. Wang, S. Yin, and S. Wei, “Redesk: A reconfigurable dataflow engine for sparse kernels on heterogeneous platforms,” in 2019 IEEE/ACM International Conference on Computer-Aided Design (ICCAD). IEEE, 2019, pp. 1–8.
[29] S. Li, D. Liu, and W. Liu, “Optimized data reuse via reordering for sparse matrix-vector multiplication on fpgas,” in 2021 IEEE/ACM International Conference On Computer Aided Design (ICCAD). IEEE, 2021, pp. 1–9.
[30] B. Liu and D. Liu, “Towards high-bandwidth-Utilization SpMV on FPGAs via partial vector duplication,” in Proceedings of the 28th Asia and South Pacific Design Automation Conference, 2023, pp. 33–38.
[31] E. Yi, Y. Duan, Y. Bai, K. Zhao, Z. Jin, and W. Liu, “Cuper: Customized Dataflow and Perceptual Decoding for Sparse Matrix-Vector Multiplication on HBM-Equipped FPGAs,” in 2024 Design, Automation & Test in Europe Conference & Exhibition (DATE). IEEE, 2024, pp. 1–6.
[32] A. Capitanio, N. Dutt, and A. Nicolau, “Partitioned register files for VLIWs: A preliminary analysis of tradeoffs,” ACM SIGMICRO Newsletter, vol. 23, no. 1-2, pp. 292–300, 1992.
[33] J. A. Fisher, P. Faraboschi, and C. Young, Embedded computing: a VLIW approach to architecture, compilers and tools. Elsevier, 2005.
[34] A. K. Jones, R. Hoare, D. Kusic, J. Fazekas, and J. Foster, “An FPGA-based VLIW processor with custom hardware execution,” in Proceedings of the 2005 ACM/SIGDA 13th international symposium on Field-programmable gate arrays, 2005, pp. 107–117.
[35] M. Lam, “Software pipelining: An effective scheduling technique for VLIW machines,” in Proceedings of the ACM SIGPLAN 1988 conference on Programming Language design and Implementation, 1988, pp. 318–328.
[36] D. Sabena, M. S. Reorda, and L. Sterpone, “On the automatic generation of optimized software-based self-test programs for VLIW processors,” IEEE Transactions on Very Large Scale Integration (VLSI) Systems, vol. 22, no. 4, pp. 813–823, 2013.
[37] H. Poon and P. Domingos, “Sum-product networks: A new deep architecture,” in 2011 IEEE International Conference on Computer Vision Workshops (ICCV Workshops). IEEE, 2011, pp. 689–690.
[38] Y. Choi, A. Vergari, and G. Van den Broeck, “Probabilistic circuits: A unifying framework for tractable probabilistic models,” UCLA. URL: http://starai.cs.ucla.edu/papers/ProbCirc20.pdf, p. 6, 2020.
[39] Intel, “Intel Math Kernel Library,” https://www.intel.com/content/www/us/en/developer/tools/oneapi/onemkl.html, 2018.
[40] Nvidia, “cuSPARSE Library,” https://docs.nvidia.com/cuda/archive/12.2.1/cuda-toolkit-release-notes/index.html, 2022.
[41] T. A. Davis and Y. Hu, “The University of Florida sparse matrix collection,” ACM Transactions on Mathematical Software (TOMS), vol. 38, no. 1, pp. 1–25, 2011.