Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

Efficient Hardware Accelerator Based on Medium Granularity Dataflow for SpTRSV

Qian Chen, Xiaofeng Yang, and Shengli Lu Q. Chen, X. Yang, and S. Lu are with the Department of National ASIC System Engineering Research Center, Southeast University, Nanjing 210096, China (e-mail: chenqian0103@seu.edu.cn; lsl@seu.edu.cn).This research work is supported by the Big Data Computing Center of Southeast University.
Abstract

Sparse triangular solve (SpTRSV) is widely used in various domains. Numerous studies have been conducted using CPUs, GPUs, and specific hardware accelerators, where dataflow can be categorized into coarse and fine granularity. Coarse dataflow offers good spatial locality but suffers from low parallelism, while fine dataflow provides high parallelism but disrupts the spatial structure, leading to increased nodes and poor data reuse. This paper proposes a novel hardware accelerator for SpTRSV or SpTRSV-like DAGs. The accelerator implements a medium granularity dataflow through hardware-software codesign and achieves both excellent spatial locality and high parallelism. Additionally, a partial sum caching mechanism is introduced to reduce the blocking frequency of processing elements (PEs), and a reordering algorithm of intra-node edges computation is developed to enhance data reuse. Experimental results on 264 benchmarks with node counts reaching up to 85,392 demonstrate that this work achieves average performance improvements of 12.2×\times× (up to 874.5×\times×) over CPUs and 10.1×\times× (up to 740.4×\times×) over GPUs. Compared to the state-of-the-art technique (DPU-v2), this work shows a 2.5×\times× (up to 5.9×\times×) average performance improvement and 1.8×\times× (up to 4.1×\times×) average energy efficiency enhancement.

Index Terms:
Sparse triangular solve, hardware-software codesign, hardware accelerator, parallel processor, sparse irregular computation

I Introduction

Sparse triangular solve (SpTRSV) is a crucial computational kernel extensively used in various domains, including direct solvers [1, 2, 3, 4], preconditioned iterative solvers [5, 6], and least square problems. Different from the sparse matrix-vector multiplication (SpMV) [7] and sparse matrix-matrix multiplication (SpMM) [8], the outputs of SpTRSV depend on the previous inputs, making it inherently sequential. It is also frequently executed in many scientific and engineering applications, such as transient simulations with fixed steps for power delivery networks [9, 10, 11]. As a result, SpTRSV has become a performance bottleneck in many applications [12].

SpTRSV is a typical sparse irregular graph computation represented by a directed acyclic graph (DAG), where nodes represent the matrix rows and edges represent the non-zeros that are not on the diagonal. The level-based or node-based methods can be used to accelerate such a DAG. The level-based method, also called the level-scheduling method or level-set method, segments independent nodes into levels and utilizes multiple threads to calculate the nodes in each level simultaneously. The threads between levels need to be globally synchronized to maintain the data dependency. The node-based method, also called the synchronization-free method, allocates nodes directly to threads without building the levels and each thread commences computation once all its dependent nodes are completed. The former is typically applied in CPUs [13, 14], whereas the latter is used in GPUs [15, 16] because GPUs have more lightweight threads than CPUs. Furthermore, some studies focus on exploiting 2D spatial structure of the coefficient matrix [17, 18, 19] or reducing dependencies [20, 21] to optimize SpTRSV based on CPUs and GPUs.

Accelerating SpTRSV on CPUs and GPUs faces some challenges: 1) SpTRSV requires frequent synchronization, but the associated overhead on both CPUs and GPUs is substantial; 2) each node has too few edges to be worth dedicating a separate thread on either CPUs or GPUs, even when these nodes are grouped into levels; 3) the irregular sparse nature disrupts both temporal and spatial locality, leading to inefficient memory access patterns on CPUs and GPUs. Designing a specific hardware architecture can effectively tackle these challenges. For example, Nimish Shah et al. employ processing elements (PEs) and software-managed scratchpads (or register files) to replace threads and hardware-managed caches, respectively [22, 23, 24]. They refer to the design as the DAG processing unit (DPU and DPU-v2). Compared to the hardware-managed caches, the software-managed scratchpads have a lower granularity of data access and are more controllable. This allows nodes to be placed in the appropriate locations in advance based on the data dependencies, addressing the issue of irregular memory access. Furthermore, the synchronization costs in a specific architecture are significantly lower than CPUs or GPUs. To address the limited parallelism of the original DAG, they transform it into a binary DAG by cascading multiple two-input nodes to represent each multi-input node. The binary DAG is then mapped onto a tree-shaped PEs array, thereby achieving higher parallelism.

We refer to the original DAG as the coarse DAG and the binary DAG as the fine DAG, with their respective dataflows termed coarse and fine dataflows. The coarse dataflow represents the node-based or level-based method, and the fine dataflow represents the method used by DPU and DPU-v2. While the fine dataflow improves parallelism, its performance still falls short of expectations. Compared to the coarse DAG, the fine DAG introduces numerous intermediate nodes and disrupts the spatial structure of the original matrix. In the coarse dataflow, the partial sum of a coarse node can be reused until completion. In contrast, in the fine dataflow, due to a coarse node being replaced by multiple cascaded binary nodes, the partial sum can only be written back to the register files if the cascading depth exceeds the designed PEs array. The cascading depth, which depends on the number of inputs of a coarse node, is variable and can easily reach tens or even hundreds. However, the depth of the tree-shaped PEs array is typically fixed and usually lower than ten, leading to inefficient data reuse. Additionally, the increased number of nodes exacerbates bank conflicts, further degrading performance.

In this paper, we propose a novel hardware accelerator to address the aforementioned challenges. The contributions of this paper are summarized as follows.

  • A novel hardware accelerator, complemented by a custom compiler, is proposed to accelerate SpTRSV or SpTRSV-like DAGs. The accelerator utilizes a medium granularity dataflow to achieve high performance. The dataflow incorporates a coarse node allocation method to maintain the spatial structure and a fine-grained edge computation method to increase parallelism.

  • A partial sum caching mechanism is introduced to store the intermediate results of a coarse node when there are no computable edges, allowing PEs to proceed with the computation of other nodes. The mechanism reduces the blocking frequency of PEs and enhances overall performance.

  • A reordering algorithm for intra-node edges computation is proposed. The algorithm schedules the computation of similar edges within the same cycle, to enhance data reuse and reduce bank constraints and conflicts.

  • Experimental results demonstrate that, compared to the state-of-the-art technique, our approach achieves an average throughput of 6.5 GOPS (up to 14.5 GOPS), a performance speedup of 2.5×\times×, and an energy efficiency improvement of 1.8×\times×. The PEs utilization in this work can reach up to 75.3%.

II Background

II-A Sparse Triangular Solve on CPUs and GPUs

Input :  n𝑛nitalic_n, rowptr𝑟𝑜𝑤𝑝𝑡𝑟rowptritalic_r italic_o italic_w italic_p italic_t italic_r, colidx𝑐𝑜𝑙𝑖𝑑𝑥colidxitalic_c italic_o italic_l italic_i italic_d italic_x, value𝑣𝑎𝑙𝑢𝑒valueitalic_v italic_a italic_l italic_u italic_e: A n×n𝑛𝑛n\times nitalic_n × italic_n sparse triangular matrix stored in the form of compressed sparse row.
b𝑏bitalic_b: The right-hand vector.
Output :  x𝑥xitalic_x: The solution vector.
1 for i0𝑖0i\leftarrow 0italic_i ← 0 to n1𝑛1n-1italic_n - 1 do
2       ierowptr[i+1]1𝑖𝑒𝑟𝑜𝑤𝑝𝑡𝑟delimited-[]𝑖11ie\leftarrow rowptr[i+1]-1italic_i italic_e ← italic_r italic_o italic_w italic_p italic_t italic_r [ italic_i + 1 ] - 1
3       sum0𝑠𝑢𝑚0sum\leftarrow 0italic_s italic_u italic_m ← 0
4       for jrowptr[i]𝑗𝑟𝑜𝑤𝑝𝑡𝑟delimited-[]𝑖j\leftarrow rowptr[i]italic_j ← italic_r italic_o italic_w italic_p italic_t italic_r [ italic_i ] to ie1𝑖𝑒1ie-1italic_i italic_e - 1 do
5             sumsum+value[j]×x[colidx[j]]𝑠𝑢𝑚𝑠𝑢𝑚𝑣𝑎𝑙𝑢𝑒delimited-[]𝑗𝑥delimited-[]𝑐𝑜𝑙𝑖𝑑𝑥delimited-[]𝑗sum\leftarrow sum+value[j]\times x[colidx[j]]italic_s italic_u italic_m ← italic_s italic_u italic_m + italic_v italic_a italic_l italic_u italic_e [ italic_j ] × italic_x [ italic_c italic_o italic_l italic_i italic_d italic_x [ italic_j ] ]
6       end for
7      x[i](b[i]sum)/value[ie]𝑥delimited-[]𝑖𝑏delimited-[]𝑖𝑠𝑢𝑚𝑣𝑎𝑙𝑢𝑒delimited-[]𝑖𝑒x[i]\leftarrow(b[i]-sum)/value[ie]italic_x [ italic_i ] ← ( italic_b [ italic_i ] - italic_s italic_u italic_m ) / italic_v italic_a italic_l italic_u italic_e [ italic_i italic_e ]
8 end for
return x𝑥xitalic_x
Algorithm 1 Serial sparse triangular solve

Refer to caption

Figure 1: The level-scheduling method for SpTRSV.

The sparse triangular solve can be represented by solving Lx=b𝐿𝑥𝑏Lx=bitalic_L italic_x = italic_b, where L𝐿Litalic_L is a sparse triangular matrix and b𝑏bitalic_b is the right-hand vector. Sparse matrices are commonly stored in formats like compressed sparse row (CSR) and compressed sparse column (CSC) [25]. In CSR format, a sparse matrix is represented by three vectors: one storing the column indices of non-zeros in row-major order, another storing the corresponding values, and a third vector storing the position indices of the first non-zero in each row. The CSC format is similar but stores entries in column-major order. A basic approach to solving the triangular system is elimination. Analysis of the coefficient matrix shows that the unknowns in the solution vector are influenced only by preceding unknowns. Thus, the elimination can be performed sequentially based on the original order of the solution vector. Assuming the first j1𝑗1j-1italic_j - 1 rows are solved, solving for the j𝑗jitalic_j-th unknown (xjsubscript𝑥𝑗x_{j}italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT) involves using the column indices of the non-zeros in this row to index the corresponding entries in the solution vector and perform the multiply-accumulate operations (lines 4-6 in algo. 1). The right-hand data bisubscript𝑏𝑖b_{i}italic_b start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and the diagonal entry Liisubscript𝐿𝑖𝑖L_{ii}italic_L start_POSTSUBSCRIPT italic_i italic_i end_POSTSUBSCRIPT are then used to update and obtain the final result, as shown in equation 1.

Lx=bxi=1Lii(bij=1i1Lijxj)𝐿𝑥𝑏subscript𝑥𝑖1subscript𝐿𝑖𝑖subscript𝑏𝑖superscriptsubscript𝑗1𝑖1subscript𝐿𝑖𝑗subscript𝑥𝑗Lx=b\hskip 9.00002pt\Rightarrow\hskip 9.00002ptx_{i}=\frac{1}{L_{ii}}\left(b_{% i}-\sum_{j=1}^{i-1}L_{ij}x_{j}\right)italic_L italic_x = italic_b ⇒ italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = divide start_ARG 1 end_ARG start_ARG italic_L start_POSTSUBSCRIPT italic_i italic_i end_POSTSUBSCRIPT end_ARG ( italic_b start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i - 1 end_POSTSUPERSCRIPT italic_L start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) (1)

For sparse matrices, calculating each unknown does not utilize all preceding results, allowing for mining the parallelism. Sparse triangular matrices can be converted into DAGs, where each edge represents a multiply-accumulate operation and each node represents an unknown and a self-update operation (subtraction and division). The level-scheduling method divides all nodes into multiple levels, where nodes within the same level are independent and have the same depth from the source node. The nodes within the same level can be computed simultaneously by multiple threads, but synchronization between levels is necessary to ensure data dependencies, as shown in figure 1. The synchronization-free method assigns a node to a warp on GPUs and begins parallel computation of the input edges after all dependent nodes are completed, without waiting for other nodes at the same level. For example, in figure 1, node 5 can start computation once nodes 2 and 4 are completed, without waiting for node 6. This method achieves higher parallelism but results in more communication overhead. Atomic operations are used to mitigate the cost of thread synchronization. In SpTRSV DAGs, consecutively dependent nodes (e.g., nodes 7 and 9 in figure 1) usually occupy most edges, limiting the performance of node-based methods. The block and tiled algorithms are proposed to exploit intra-node parallelism, and the rewrite method is introduced to reduce DAG dependencies, to achieve better performance.

However, using CPUs and GPUs to accelerate SpTRSV remains challenging [26, 27]. On one hand, the sparse irregular data distribution disrupts temporal and spatial locality, rendering high-access-granularity caches nearly ineffective [28, 29, 30, 31]. For example, in modern CPUs and GPUs, the minimum data width transferred from the memory to cache is typically 32 or 64 words. Due to irregular distribution, only one of these words may be used, causing high cache misses and frequent data transfers. On the other hand, SpTRSV requires frequent synchronization, and the computation per node is minimal, leading to threads on CPUs and GPUs consuming most time in communication rather than computation. Therefore, the customizable hardware platform may be an attractive solution to accelerate SpTRSV.

II-B Very-Long-Instruction-Word Architecture

The very-long-instruction-word (VLIW) architecture is a design approach for parallel processors. It enhances parallelism and processing efficiency by merging multiple operations into a single instruction word and executing them simultaneously [32, 33, 34, 35, 36]. Unlike traditional architectures, which rely on hardware for instruction scheduling and parallel execution, the VLIW architecture utilizes the compiler to determine instruction-level parallelism. This approach reduces the complexity of control logic and simplifies hardware design. By integrating software-managed register files, the VLIW architecture can implement communication between compute units (CUs) without synchronization overhead. Therefore, the VLIW architecture is well-suited for accelerating DAG tasks that require complex scheduling logic and frequent synchronization, such as SpTRSV. The main disadvantage of the VLIW architecture is that the compiler becomes more complex.

II-C DAG Processing Unit and This Work

Refer to caption

Figure 2: Architecture comparison between DPU-v2 [22] and this work.

Refer to caption

Figure 3: An example of converting a coarse node into multiple fine nodes and mapping them to the tree-shaped PEs.

In recent years, a few studies have attempted to accelerate SpTRSV based on customized architectures. The latest and representative work is the DAG processing unit v2 (DPU-v2) [22]. Based on the VLIW architecture, DPU-v2 utilizes tree-shaped PEs arrays and pipelining techniques to enhance data reuse. Each PE has two inputs and one output, performing either multiplication or addition operations, as illustrated in figure 2. To map such arrays, the DPU-v2 compiler converts the original coarse DAG into a fine binary DAG by inserting multiple cascaded two-input nodes. This approach theoretically maximizes parallelism since a coarse node is divided into multiple fine nodes that can be computed simultaneously. While DPU-v2 targets general DAG tasks, it is only effective for DAGs where each edge represents a single logical operation and each node contains few edges. DAGs from machine learning, like sum-product networks [37, 38], often meet these conditions.

However, the DPU-v2 architecture becomes inefficient for SpTRSV-like DAGs. These DAGs are characterized by three features: 1) each edge represents two or more cascaded logical operations, such as multiply-accumulate; 2) a node possibly contain multiple edges (more than 10); 3) consecutively dependent nodes occupy most edges. The first two features increase the number of nodes in the binary DAG and reduce the data reuse efficiency of the tree-shaped PEs array. For instance, consider using a tree-shaped PEs array with a depth of 2 to calculate a coarse node with 4 input edges, where the source nodes of these edges are completed, as shown in figure 3. The PEs array needs to be mapped 4 times to complete this node, and the intermediate results must be written back to register files in the first 3 mappings. Despite the appearance that 3 PEs are calculating a coarse node simultaneously, the pipelining structure and register files access degrade performance, potentially making it less efficient than using a single PE. Increasing the depth of tree-shaped PEs array can mitigate this issue but exponentially increases hardware complexity. The third feature limits the parallelism of the traditional coarse dataflow, such as level-scheduling and synchronization-free methods.

This work proposes an architecture with feedback-structured PEs to achieve adaptive and efficient data reuse. The compiler determines the reuse count by calculating the nodes indegree. The functionality of the PE aligns with the basic operations of the DAG tasks. This paper evaluates the performance of the architecture for SpTRSV. Given that the basic operations of SpTRSV are a cascade of multiplication and addition (see line 5 and line 7 in algo. 1, where subtraction and division can be transformed into addition and multiplication by the compiler), and that the coefficient matrix L𝐿Litalic_L and right-hand vector b𝑏bitalic_b are read-only, we design the architecture as shown in figure 2, in which PEs are assigned coarse nodes, L𝐿Litalic_L and b𝑏bitalic_b are fed to PEs through FIFOs, and the partial sums are reused through feedback structures. The architecture employs a novel dataflow, called the medium granularity dataflow, to overcome the parallelism limitations of the traditional coarse dataflow when handling consecutively dependent nodes.

Refer to caption

Figure 4: (a) Overview of the custom compiler. The spin arrow indicates updating the node with the right hand b𝑏bitalic_b. (b) The architecture of the proposed accelerator. It consists of 2Nsuperscript2𝑁\text{2}^{N}2 start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT compute units (CUs) connected by input and output interconnects, where N𝑁Nitalic_N is a hyperparameter. Stream memory stores the sparse matrix non-zeros (L𝐿Litalic_L) and right hands (b𝑏bitalic_b), sequentially supplying the data to CUs. Data memory stores the solution vector x𝑥xitalic_x.

III System Architecture

In real-world applications, a sparse triangular system is usually solved multiple times with the same coefficient matrix. We can preprocess the data structure before SpTRSV is executed because the preprocess time can be amortized. Figure 4 illustrates the system architecture, comprising a software compiler and a hardware accelerator. The compiler preprocesses a sparse triangular matrix based on the proposed dataflow, maps nodes to register files, and generates instructions. The accelerator executes these instructions to obtain the solution vector and stores it in the data memory.

III-A Compiler Overview

The compiler workflow is as follows: firstly, traverse the adjacency graph of the coefficient matrices and allocate nodes to PEs according to the topological order of the graph. Next, apply the medium granularity dataflow (section IV. A) and the partial sum caching mechanism (section IV. B) to establish the initial computation pattern for PEs. Then, without changing the order of coarse nodes computation and the access mode of the psum𝑝𝑠𝑢𝑚psumitalic_p italic_s italic_u italic_m register files determined in the previous step, rearrange the calculation order of the edges within the nodes in each cycle (section IV. C) to enhance data reuse and reduce bank conflicts. Subsequently, determine the memory access requirements of the PEs, set constraints for nodes, and resolve bank conflicts by a greedy graph coloring algorithm. Finally, address the potential spilling issues of the register files and generate the instructions.

III-B Accelerator Architecture

The proposed accelerator achieves flexible dataflow by executing the instructions, as shown in figure 4 (b). It consists of multiple CUs, interconnects, and on-chip memories. The CUs are connected by input and output interconnects, which are implemented by the crossbars, enabling communication between them. The crossbar-based interconnects decouple the PEs and register files, minimizing bank conflicts. The on-chip memories include instruction memory, data memory, and stream memory. The instruction memory stores instructions generated by the compiler. The compiler is only re-executed and the instruction memory is updated when the structure of the coefficient matrix changes. The stream memory stores the values of the coefficient matrix and the right-hands, without storing their positional information (i.e., in which rows or columns). This is because the data in the stream memory is reordered by the compiler through the analysis of the established medium granularity dataflow, so that the positional information is hidden in the instructions. Consequently, during the running of the accelerator, data in the instruction memory and stream memory is accessed sequentially. The results of the accelerator are written to the data memory and may be read out when the register file spills.

out={(bipsum)×Lij(ct=0)psum+Lij×xi(ct=1)𝑜𝑢𝑡casessubscript𝑏𝑖𝑝𝑠𝑢𝑚subscript𝐿𝑖𝑗𝑐𝑡0𝑝𝑠𝑢𝑚subscript𝐿𝑖𝑗subscript𝑥𝑖𝑐𝑡1out=\left\{\begin{array}[]{l}\vspace{0.5em}\left(b_{i}-psum\right)\times L_{ij% }(ct=0)\\ psum+L_{ij}\times x_{i}(ct=1)\end{array}\right.italic_o italic_u italic_t = { start_ARRAY start_ROW start_CELL ( italic_b start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_p italic_s italic_u italic_m ) × italic_L start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT ( italic_c italic_t = 0 ) end_CELL end_ROW start_ROW start_CELL italic_p italic_s italic_u italic_m + italic_L start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT × italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_c italic_t = 1 ) end_CELL end_ROW end_ARRAY (2)

Each CU comprises a decoder, a control unit, a D type flip-flop (DFF), an xisubscript𝑥𝑖x_{i}italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT register file, a psum𝑝𝑠𝑢𝑚psumitalic_p italic_s italic_u italic_m register file, a PE, and several multiplexers. The PE consists of a cascaded 32-bit floating-point adder and multiplier, implementing the two fundamental operations of SpTRSV via a 1-bit control signal (ct𝑐𝑡ctitalic_c italic_t in figure 4 (b)). The functionality of SpTRSV and PE are described in equations 1 and 2, respectively. The division operation is performed by computing the reciprocal in the compiler and using multiplication in the hardware. The PE output can follow two paths: one as a partial sum, which is either written to the psum𝑝𝑠𝑢𝑚psumitalic_p italic_s italic_u italic_m register file or fed back to the psum𝑝𝑠𝑢𝑚psumitalic_p italic_s italic_u italic_m input of the PE, depending on whether the current node is blocked in the next cycle. When the node in the psum𝑝𝑠𝑢𝑚psumitalic_p italic_s italic_u italic_m register file is unblocked, the partial sum is read out and the address is released. This will be discussed in detail in section IV. B. When the PE starts computing a new node, the psum𝑝𝑠𝑢𝑚psumitalic_p italic_s italic_u italic_m input is set to zero. The other path is the solution for the node, which is written through the output interconnect to the xisubscript𝑥𝑖x_{i}italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT register file of any CU, or directly reused by other PEs through input and output interconnects.

Refer to caption

Figure 5: (a) The structure and length of the instruction. Assuming that there are 2Nsuperscript2𝑁\text{2}^{N}2 start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT CUs, with each CU having 2Msuperscript2𝑀\text{2}^{M}2 start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT and 2Ksuperscript2𝐾\text{2}^{K}2 start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT words in the xisubscript𝑥𝑖x_{i}italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and psum𝑝𝑠𝑢𝑚psumitalic_p italic_s italic_u italic_m register files respectively. Each CU has an addressing depth of 2Tsuperscript2𝑇\text{2}^{T}2 start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT for the data memory. (b) Automatic write address generation for the register file and data memory.

The decoder translates instructions to obtain read addresses and read-write enable signals for the register files and data memory, as well as selection signals for input and output interconnects, and sends other data to the control unit. The structure of the instruction is shown in figure 5 (a). Write addresses for the register files and data memory are automatically generated by the control unit to reduce instruction length. The principle for generating write addresses is to always write data to the empty location with the lowest address. The method for generating write addresses is illustrated in figure 5 (b). For the register files, a 1-bit valid flag is set for each address to determine if it is free, and a priority encoder is used to generate the write address. Whether an address in the xisubscript𝑥𝑖x_{i}italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT register file is released after being read is determined by a 1-bit signal (Rvsxisubscriptsuperscript𝑅subscript𝑥𝑖𝑣𝑠R^{x_{i}}_{vs}italic_R start_POSTSUPERSCRIPT italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_v italic_s end_POSTSUBSCRIPT) in the instruction. Data in the psum𝑝𝑠𝑢𝑚psumitalic_p italic_s italic_u italic_m register file is released once read out, as it is intermediate data and will not be reused by other PEs. For the data memory, write addresses are generated by a counter initialized to zero, incrementing with the write enable signal valid. This is because data written to the data memory is the final result required by the user and will not be overwritten.

All CUs in the accelerator share a single clock, and PEs achieve synchronized computation through DFFs. Thus, with a software-managed memory system, the compiler can fully predict the behavior of the hardware and the details of data stored in the register file and memory, to address the problem of bank conflicts. Data will be spilled from the xisubscript𝑥𝑖x_{i}italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT register file if the number of nodes exceeds capacity. Given the execution schedule, a live-range analysis is performed to determine when the spilling is required. If the number of free addresses falls below a threshold, the data in the register file will be written to the data memory, or the address will be directly released if the data memory already holds the same data. The spilled data will be loaded back before it is supposed to be consumed. The compiler decides whether to insert nop cycles to perform the store and load operations by considering the port occupancy of the xisubscript𝑥𝑖x_{i}italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT register file.

IV Custom Compiler Methodology

In this section, we introduce the medium granularity dataflow, the partial sum caching mechanism, and the intra-node edges computation reordering algorithm. The medium granularity dataflow combines the advantages of coarse and fine dataflow to enhance performance for SpTRSV or SpTRSV-like DAG tasks. The partial sum caching mechanism provides temporary storage to reduce PEs blocking frequency and shorten the critical path length. The intra-node edges computation reordering algorithm exploits the spatial locality to improve data reuse and alleviate constraints on banked register files.

IV-A Medium Granularity Dataflow

Refer to caption

Figure 6: Comparison of the coarse, fine, and this-work dataflows for an SpTRSV-like DAG, with nodes 1, 2, and 3 already computed. In the coarse and this-work dataflows, nodes 7, 8, and 9 are assigned to PE 1, PE 2, and PE 3, respectively. In the fine dataflow, the three PEs are connected in a tree topology [22], and the coarse DAG is converted to the fine DAG and mapped to the PEs.

In coarse dataflow, a node represents the minimal computational task. A PE or thread only starts computing a node after all its predecessors are completed. This method is typically used in CPUs and GPUs, where the task granularity must be large enough to offset thread management overheads (e.g., startup, termination, synchronization). However, real application DAGs often have many consecutively dependent nodes, resulting in nearly sequential computation (see the coarse dataflow example in figure 6). In fine dataflow, a coarse node is divided into multiple cascaded two-input fine nodes, each representing a multiplication or addition and mapped to a PE. This allows multiple PEs to compute a coarse node simultaneously. However, SpTRSV-like DAGs have two key characteristics: 1) the fundamental operations are cascaded multiplications and additions, needing two fine nodes per coarse node; 2) numerous coarse nodes with multiple inputs create many intermediate fine nodes. The two characteristics significantly increase node count and critical path length. Although a tree-shaped PE array can improve data reuse, the reuse depth is limited and many intermediate nodes still need to be written back to the register files for later use. Furthermore, the increased intermediate nodes also exacerbate bank conflicts.

The proposed dataflow combines the advantages of coarse and fine dataflows. Here, a coarse node is mapped to a PE, which processes any available edges immediately once its source node is completed, without waiting for all input edges. Thus, this method can be seen as coarse allocation with fine computation. The output of a PE is fed back to the psum𝑝𝑠𝑢𝑚psumitalic_p italic_s italic_u italic_m input to reuse partial sums (orange box in figure 4 (b)). Upon node completion, the output is written to the register files and routed through the interconnects for other PEs. This feedback structure adjusts the depth of data reuse based on the number of edges for each node, ensuring the partial sum is always reused or stored in the psum𝑝𝑠𝑢𝑚psumitalic_p italic_s italic_u italic_m register file, thus minimizing access demands on the register files and interconnects. The medium granularity dataflow enables parallel computation between nodes, necessitating more synchronization than the coarse or fine dataflow. This is one of the reasons that we utilize synchronized PEs based on the VLIW architecture, because the asynchronous computational architecture incurs additional synchronization costs.

Figure 6 illustrates an example to demonstrate how the three dataflows handle the consecutively dependent nodes. For the coarse dataflow, a PE must ensure that all dependent nodes are completed before computing a node. This results in serial computation for the consecutively dependent nodes. In the given example, assuming no bank conflicts and that the L𝐿Litalic_L and b𝑏bitalic_b are previously stored in register files, the coarse dataflow consumes 12 cycles. For the fine dataflow, the original DAG is converted into a binary DAG and mapped onto tree-shaped PEs. In this example, the binary DAG is divided into 9 blocks. Considering the impact of register file access and pipeline structure, these 9 blocks consume 19 cycles. Since PEs in the fine dataflow perform a single logical operation while the coarse and this-work dataflows perform 2 logical operations. For a fair comparison, we assume the fine dataflow operates at twice the clock frequency of the others. Thus, the fine dataflow consumes 9.5 cycles. In the proposed dataflow, we retain the node allocation strategy of the coarse dataflow, assigning one coarse node to a PE, but do not treat it as a minimal computational task. The PEs will compute a node if one of its dependent nodes is completed. For the given example, this work dataflow consumes 8 cycles. In summary, the proposed dataflow demonstrates superior performance compared to both the coarse and fine dataflows.

IV-B Partial Sum Caching Mechanism

Refer to caption

Figure 7: An example to illustrate the deadlock.

Due to coarse node allocation, the medium granularity dataflow may experience blocking when a PE has no edges to compute for the current node. Blocking occurs in two scenarios: one where all nodes in the task list of the PE are blocked, and another where other nodes still have computable edges. The first scenario, related to the DAG structure and node allocation order, is rare and can be addressed by optimizing the dataflow, such as using a reconfigurable dataflow. For the more common second scenario, we propose a partial sum caching mechanism to mitigate the blocking. In our architecture, when blocking occurs, each CU uses a psum𝑝𝑠𝑢𝑚psumitalic_p italic_s italic_u italic_m register file to cache partial sums, then traverses the task list to find the first unblocked node and calculates it. The psum𝑝𝑠𝑢𝑚psumitalic_p italic_s italic_u italic_m register file is local and does not occupy interconnect resources. Nodes in the psum𝑝𝑠𝑢𝑚psumitalic_p italic_s italic_u italic_m register file must be prioritized to avoid potential deadlocks, as they are always parent nodes to others. Thus, in each cycle, if a node in the psum𝑝𝑠𝑢𝑚psumitalic_p italic_s italic_u italic_m register file is unblocked, the PE must pause the current task to compute this node, regardless of whether the current node is blocked.

Determining when to perform read/write operations on the psum𝑝𝑠𝑢𝑚psumitalic_p italic_s italic_u italic_m register file is crucial. Firstly, without considering the capacity of the psum𝑝𝑠𝑢𝑚psumitalic_p italic_s italic_u italic_m register file, the task of a PE for the current cycle is determined based on the above principles. If the PE is blocked (this blocking is from the first scenario), or the current node and the previous node are the same, or the previous node is completed and the current node is new, the PE will reuse the partial sum or set it to zero without accessing the psum𝑝𝑠𝑢𝑚psumitalic_p italic_s italic_u italic_m register file. If the previous node is completed and the current node is old, the PE will read the partial sum from the psum𝑝𝑠𝑢𝑚psumitalic_p italic_s italic_u italic_m register file and release the corresponding address. If the previous node is not completed and the current node is new, the PE will put the previous partial sum into the psum𝑝𝑠𝑢𝑚psumitalic_p italic_s italic_u italic_m register file and set the current partial sum to zero. In this case, the psum𝑝𝑠𝑢𝑚psumitalic_p italic_s italic_u italic_m register file must have at least two free addresses, or one free address if the current node is the first new node in the task list. Otherwise, the PE will be blocked to avoid the potential deadlock, as shown in figure 7. If the previous node is not completed and the current node is old, the PE will put the previous partial sum into the psum𝑝𝑠𝑢𝑚psumitalic_p italic_s italic_u italic_m register file and read the current partial sum from it. This scenario does not need to consider the capacity of the psum𝑝𝑠𝑢𝑚psumitalic_p italic_s italic_u italic_m register file, as it supports read-before-write operations. Figure 4 (a) shows a simple example to illustrate the partial sum caching mechanism, where PE 3 is blocked while computing node 5. In the third cycle, the PE stores node 5 into the psum𝑝𝑠𝑢𝑚psumitalic_p italic_s italic_u italic_m register file and computes the next node (node 9) in its task list. Then, in the next cycle, node 5 is unblocked and read out for computation. Meanwhile, node 9 is stored in the psum𝑝𝑠𝑢𝑚psumitalic_p italic_s italic_u italic_m register file.

IV-C Intra-node Edges Computation Reordering Algorithm

Refer to caption

Figure 8: Comparison of data reuse and memory access frequency with and without intra-node computation reordering.
Input :  C𝐶Citalic_C: A container with m𝑚mitalic_m sub-containers, each holding edges for computation in this cycle, where m𝑚absentm\leqitalic_m ≤ number of PEs.
Output :  M𝑀Mitalic_M: A key-value table mapping PE indices to the edges each PE computes in this cycle.
1
// Classify edges in C𝐶Citalic_C based on their source nodes. The R-value of a category indicates the number of edges in that category.
2 Rcompute_category(C)𝑅𝑐𝑜𝑚𝑝𝑢𝑡𝑒_𝑐𝑎𝑡𝑒𝑔𝑜𝑟𝑦𝐶R\leftarrow compute\_category(C)italic_R ← italic_c italic_o italic_m italic_p italic_u italic_t italic_e _ italic_c italic_a italic_t italic_e italic_g italic_o italic_r italic_y ( italic_C )
3 Mϕ𝑀italic-ϕM\leftarrow\phiitalic_M ← italic_ϕ, DC𝐷𝐶D\leftarrow Citalic_D ← italic_C
4 while not D.empty()formulae-sequence𝐷𝑒𝑚𝑝𝑡𝑦D.empty()italic_D . italic_e italic_m italic_p italic_t italic_y ( ) do
       // Find the category with the most edges in D𝐷Ditalic_D.
5       EVget_max_category(D)𝐸𝑉𝑔𝑒𝑡_𝑚𝑎𝑥_𝑐𝑎𝑡𝑒𝑔𝑜𝑟𝑦𝐷EV\leftarrow get\_max\_category(D)italic_E italic_V ← italic_g italic_e italic_t _ italic_m italic_a italic_x _ italic_c italic_a italic_t italic_e italic_g italic_o italic_r italic_y ( italic_D )
6       if EV.size()2formulae-sequence𝐸𝑉𝑠𝑖𝑧𝑒2EV.size()\geq 2italic_E italic_V . italic_s italic_i italic_z italic_e ( ) ≥ 2 then
             // Find the category with the minimum R-value in EV𝐸𝑉EVitalic_E italic_V .
7             Eget_min_R_category(R,EV)𝐸𝑔𝑒𝑡_𝑚𝑖𝑛_𝑅_𝑐𝑎𝑡𝑒𝑔𝑜𝑟𝑦𝑅𝐸𝑉E\leftarrow get\_min\_R\_category(R,EV)italic_E ← italic_g italic_e italic_t _ italic_m italic_i italic_n _ italic_R _ italic_c italic_a italic_t italic_e italic_g italic_o italic_r italic_y ( italic_R , italic_E italic_V )
8      else
9             EEV[0]𝐸𝐸𝑉delimited-[]0E\leftarrow EV[0]italic_E ← italic_E italic_V [ 0 ]
10       end if
      // Compute the PE index in E𝐸Eitalic_E and establish a key-value table.
11       M_iget_mapping(E)𝑀_𝑖𝑔𝑒𝑡_𝑚𝑎𝑝𝑝𝑖𝑛𝑔𝐸M\_i\leftarrow get\_mapping(E)italic_M _ italic_i ← italic_g italic_e italic_t _ italic_m italic_a italic_p italic_p italic_i italic_n italic_g ( italic_E )
12       M.add(M_i)formulae-sequence𝑀𝑎𝑑𝑑𝑀_𝑖M.add(M\_i)italic_M . italic_a italic_d italic_d ( italic_M _ italic_i )
       // Mark the sub-containers associated with each key in M_i𝑀_𝑖M\_iitalic_M _ italic_i and remove them in D𝐷Ditalic_D.
13       D_imark_sub_contains(D,M_i)𝐷_𝑖𝑚𝑎𝑟𝑘_𝑠𝑢𝑏_𝑐𝑜𝑛𝑡𝑎𝑖𝑛𝑠𝐷𝑀_𝑖D\_i\leftarrow mark\_sub\_contains(D,M\_i)italic_D _ italic_i ← italic_m italic_a italic_r italic_k _ italic_s italic_u italic_b _ italic_c italic_o italic_n italic_t italic_a italic_i italic_n italic_s ( italic_D , italic_M _ italic_i )
14       D.remove(D_i)formulae-sequence𝐷𝑟𝑒𝑚𝑜𝑣𝑒𝐷_𝑖D.remove(D\_i)italic_D . italic_r italic_e italic_m italic_o italic_v italic_e ( italic_D _ italic_i )
15 end while
return M𝑀Mitalic_M
Algorithm 2 Determine edges to be computed for PEs in a certain clock cycle

In each cycle, PEs may have multiple computable edges while processing the current node. Choosing which edge to compute does not affect the final result, but impacts the frequency of accessing the register banks. The traditional method selects the edge based on the ascending order of the source node ID. However, this can lead to repeated readouts of the same source node (the left subfigure of figure 8), as edges connected to it are scattered across different positions for other nodes. We consider edges with the same source node as similar edges and propose a reordering algorithm to group similar edges within the same cycle.

The intra-node edges computation reordering algorithm for each cycle involves the following four steps: firstly, classify similar edges and count the number in each category, denoted as the R-value (line 1 in algo. 2). Secondly, select the category with the highest count. In case of a tie, choose the category with the smallest R-value (lines 4-9). This maximizes reuse in the current cycle and ensures nodes can be reused in subsequent cycles, as illustrated in the right subfigure of figure 8, where PE 1 selects node 8 instead of node 7 in the second cycle, allowing node 7 to be reused in the next cycle. Thirdly, assign edges in the selected category to the corresponding PEs and remove all related edges (lines 10-13). Finally, repeat the second and third steps until all non-blocked PEs are assigned edges.

V Experiments

V-A Experimental Setup

The experiments compare this work with CPU, GPU, and DPU-v2 platforms. CPU: We use the Intel Math Kernel Library (MKL v2018.1) [39] on a Xeon4214 CPU at 2.2 GHz, utilizing the mkl_sparse_s_trsv𝑚𝑘𝑙_𝑠𝑝𝑎𝑟𝑠𝑒_𝑠_𝑡𝑟𝑠𝑣mkl\_sparse\_s\_trsvitalic_m italic_k italic_l _ italic_s italic_p italic_a italic_r italic_s italic_e _ italic_s _ italic_t italic_r italic_s italic_v function. Programs are compiled with GCC v9.5.0, using the -O3 optimization flag and OpenMP v4.0.7. GPU: We use the cuSPARSE Library (v2022.12) [40] on an RTX 3070 laptop GPU at 1.1 GHz, utilizing the cusparseSpSV_analysis𝑐𝑢𝑠𝑝𝑎𝑟𝑠𝑒𝑆𝑝𝑆𝑉_𝑎𝑛𝑎𝑙𝑦𝑠𝑖𝑠cusparseSpSV\_analysisitalic_c italic_u italic_s italic_p italic_a italic_r italic_s italic_e italic_S italic_p italic_S italic_V _ italic_a italic_n italic_a italic_l italic_y italic_s italic_i italic_s function and cusparseSpSV_solve𝑐𝑢𝑠𝑝𝑎𝑟𝑠𝑒𝑆𝑝𝑆𝑉_𝑠𝑜𝑙𝑣𝑒cusparseSpSV\_solveitalic_c italic_u italic_s italic_p italic_a italic_r italic_s italic_e italic_S italic_p italic_S italic_V _ italic_s italic_o italic_l italic_v italic_e function. The cusparseSpSV_analysis𝑐𝑢𝑠𝑝𝑎𝑟𝑠𝑒𝑆𝑝𝑆𝑉_𝑎𝑛𝑎𝑙𝑦𝑠𝑖𝑠cusparseSpSV\_analysisitalic_c italic_u italic_s italic_p italic_a italic_r italic_s italic_e italic_S italic_p italic_S italic_V _ italic_a italic_n italic_a italic_l italic_y italic_s italic_i italic_s function analyzes the matrix structure and generates a dataflow that matches the GPU architecture, serving as a preprocessing or compilation step for SpTRSV. Programs are compiled with CUDA v12.2. Data transfer time between the host and GPU is excluded. DPU-v2: The processor runs in its default configuration with 56 PEs and 64 register files [22]. For a fair comparison, our accelerator operates at 150 MHz, half the DPU-v2 clock frequency, because our PE connects the adder and multiplier in series while DPU-v2 connects in parallel. The accelerator is synthesized using TSMC 28 nm technology with 64 CUs. Each CU has xisubscript𝑥𝑖x_{i}italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and psum𝑝𝑠𝑢𝑚psumitalic_p italic_s italic_u italic_m register files configured with 64 and 8 words. The data memory is configured with 8192 words, while both the instruction memory and stream memory are configured with 65536 words. The synthesis results are shown in table I. Our compiler is implemented in C++ 11. The compiler runtime environment for the GPU, DPU-v2, and this work is the same. We use 264 public benchmarks [41] from various domains, such as circuit simulation and power networks, to evaluate the accelerator. Table II provides the basic information of the partial benchmarks.

TABLE I: Synthesis Result of The Accelerator With 64 CUs.
Area Power
mm2superscriptmm2\text{mm}^{\text{2}}mm start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT % mW %
Datapath:
PEs 0.07 3.3 16.00 10.2
Fifos 0.16 7.7 28.22 18.1
Pipelining registers 0.02 0.8 6.85 4.4
Input interconnect 0.04 2.1 9.65 6.2
Output interconnect 0.04 2.1 8.36 5.3
Register file 0.28 13.1 29.86 19.1
Control:
Control units 0.02 0.9 5.41 3.5
Multiplexers 0.00 0.2 1.85 1.2
Data memory 0.11 5.4 7.07 4.5
Instruction memory 0.64 30.1 17.09 10.9
Stream memory 0.72 34.3 25.86 16.6
2.11 156.21
TABLE II: Basic Information of Partial Benchmarks
ID Name N NNZ Binary nodes
1 rajat19 1157 3956 6755
2 c-36 7479 12186 16893
3 bayer05 3268 26316 49364
4 HB_jagmesh4 1440 22600 43760
5 cz628 628 9123 17618
6 ex3 1821 30204 58587
7 fpga_dcop_01 1220 4303 7386
8 nnc1374 1374 17897 34420
9 add32 4960 14451 23942
10 bips98_606 7135 28759 50383
11 fpga_trans_01 1220 5371 9522
12 gemat12 4929 28415 51901
13 filter2D 1668 8647 15626
14 ex10hs 2548 30515 58482
15 ex10 2410 26853 51296
16 add20 2395 9867 17339
17 rajat04 1041 7625 14209
18 circuit204 1020 8008 14996
19 bayer07 3268 26316 49364
20 ACTIVSg2000 4000 42840 81680

Refer to caption

Figure 9: (a) Throughput comparison between fine and this work dataflows. (b) Comparison of total cycles with different psum𝑝𝑠𝑢𝑚psumitalic_p italic_s italic_u italic_m register file capacities. (c) Comparison of blocking cycles with different psum𝑝𝑠𝑢𝑚psumitalic_p italic_s italic_u italic_m register file capacities. (d) Comparison of constraints with and without intra-node edges computation reordering (ICR). (e) Comparison of bank conflicts with and without intra-node edges computation reordering (ICR). (f) Comparison of data reuse with and without intra-node edges computation reordering (ICR).

Refer to caption

Figure 10: Instruction breakdown of the partial benchmarks.

Refer to caption

Figure 11: Throughput comparison of the partial benchmarks with CPU [39], GPU [40], and DPU-v2 [22].

V-B Throughput Comparison over Fine Dataflow

Figure 9 (a) compares the throughput of the fine and this work dataflows. This work dataflow does not utilize the partial sum caching mechanism. The coarse dataflow was not tested because this work dataflow enables the edges computation based on it, which ensures the performance at least equal to or better than the coarse dataflow. For a few benchmarks, such as matrices rajat19, cz628, and rajat04, the throughput of our proposed dataflow is lower than that of the fine dataflow. This is because these matrices have relatively few consecutively dependent nodes and few input edges per node, leading to the more frequent blocking in our dataflow. In contrast, the fine dataflow, while not significantly increasing intermediate nodes in the binary DAG, achieves better throughput by enhancing parallelism. However, for most benchmarks, the proposed dataflow demonstrates higher throughput and performance.

V-C Impact of Partial Sum Caching Mechanism

Figure 9 (b) and (c) illustrate the impact of the partial sum caching mechanism on reducing blocking cycles and overall cycles under different psum𝑝𝑠𝑢𝑚psumitalic_p italic_s italic_u italic_m register files capacities. Given the wide variance in benchmark computations, the data was normalized. The results indicate that the partial sum caching mechanism reduces blocking frequency and improves overall performance. It achieves optimization with small capacities because it always prioritizes cached nodes, allowing efficiently use of the psum𝑝𝑠𝑢𝑚psumitalic_p italic_s italic_u italic_m register files. When the blocking cycles no longer change with capacity, residual blocking is attributed to sparse structures and unresolved bank conflicts. For certain benchmarks, such as matrices rajat19 and c-36, we observe that while blocking cycles significantly decrease, the critical path is shortened less or even remains unchanged. This is because the proportion of blocking cycles is inherently small, or the reduced blocking cycles are not on the critical path, which is related to the sparse structure and coarse node allocation.

V-D ICR Optimization for Bank Conflicts and Data Reuse

Figure 9 (d), (e), and (f) compare the changes in constraints, bank conflicts, and data reuse before and after applying the intra-node edges computation reordering (ICR) algorithm. The ICR algorithm effectively reduces constraints and improves data reuse. Generally, reducing constraints can lower the bank conflicts because in the greedy graph coloring algorithm, fewer constraints mean more choices for nodes regarding banks allocation, thereby increasing the optimization space of these nodes and their directly or indirectly connected nodes. In few cases, such as the matrix add32, the ICR algorithm might make bank conflicts increase. This occurs because the ICR algorithm focuses on the optimal computation order at the current time without considering the global. Although this can reduce the overall number of constraints, it can alter the constraint graph structure, potentially increasing constraints for some nodes. Nonetheless, such instances are rare, and in most cases, the ICR algorithm results in positive optimization.

V-E Instruction Breakdown

Figure 10 shows the breakdown of instructions, where the execute instructions represent the PEs performing regular logical operations, the nop instructions indicate the PEs being blocked and the outputs are unchanged. Fewer nop instructions signify that the system is closer to achieving the theoretical peak throughput of the proposed dataflow. Thanks to the crossbar-based interconnects, our architecture achieves approximately 90% of the peak throughput on average. The nop instructions arise from four aspects: 1) the sparse structure of the DAG; 2) limited capacity of psum𝑝𝑠𝑢𝑚psumitalic_p italic_s italic_u italic_m register files; 3) load imbalance caused by coarse node allocation; 4) interconnect port conflicts. Except for the first point, other issues can be mitigated through compiler algorithm optimizations.

Refer to caption

Figure 12: Throughput comparison of 264 benchmarks with CPU, GPU, DPU-v2, and this work. Horizontal axis numbers represent the operations (binary nodes) of each benchmark.
TABLE III: Performance Comparison of 264 SuiteSparse Benchmarks [41] With Other Platforms
CPU GPU DPU-v2 This work
Technology (nm) 14 8 28 28
Frequency (MHz) 2200 1110 300 150
Peak throughput (GOPS) 3379.2 15974.4 16.8 19.2
Avg. throughput (GOPS) 0.5 0.6 2.6 6.5
Speedup 1×\times× 1.2×\times× 4.9×\times× 12.2×\times×
Power (W) >50 >50 0.109 0.156
Avg. energy efficiency (GOPS/W) <0.01 <0.01 23.7 41.6
Avg. compile time (s) - 0.02 105.90 0.03

V-F Performance Comparison with Other Platforms

Figure 11 shows the throughput comparison of this work with other platforms on partial benchmarks. The performance of our approach exceeds that of other platforms. From figure 9 (a), it is evident that for the matrix rajat04, the throughput of our dataflow is initially lower than DPU-v2. However, with the partial sum caching mechanism, the performance is improved and surpasses DPU-v2. Figure 12 compares the performance of our work against other platforms across 264 benchmarks with nodes ranging from 19 to 85,392. DPU-v2 fails to produce results for 7 benchmarks due to excessive compilation times (exceeding 300 minutes). These benchmarks are odepa400, olm1000, lung1, mhda416, extr1b, extr1, and ex22. The compilation time for DPU-v2 is significantly longer than that for our approach and the GPU. This is due to the coarse nodes with numerous inputs (matrices with many rows containing numerous non-zeros) greatly increase the number of nodes when converted into the binary nodes. The complexity of decomposing these binary nodes into blocks during DPU-v2 compilation is O(N2)𝑂superscript𝑁2O(N^{2})italic_O ( italic_N start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ), where N𝑁Nitalic_N is the number of binary nodes, resulting in a time-consuming process. Although the compilation time can be amortized over multiple SpTRSV executions, it still affects overall performance. Longer compilation time necessitate more SpTRSV executions to realize performance benefits. The CPU exhibits low average throughput as SpTRSV disrupts both temporal and spatial locality, rendering SIMD functions and hardware-managed cache mechanisms ineffective. The GPU, despite its higher peak throughput than the CPU, performs worse on smaller benchmarks because limited cache capacity leading to more cache misses. However, as node size increases, the GPU gains a performance edge from its numerous CUDA cores. Table III summarizes key metrics of our work and other platforms. Overall, our approach achieves average performance improvements of 12.2×\times× (up to 874.5×\times×) and 10.1×\times× (up to 740.4×\times×) compared to the CPU and GPU, respectively. Compared to DPU-v2, our work demonstrates an average performance improvement of 2.5×\times× (up to 5.9×\times×) and an energy efficiency improvement of 1.8×\times× (up to 4.1×\times×). Furthermore, the throughput of our work shows good scalability.

VI Summary and Future Works

In this paper, we propose a novel hardware accelerator for sparse triangular solve. The accelerator employs a medium granularity dataflow that leverages coarse node allocation and fine edge computation. This hybrid granularity approach ensures both parallelism and spatial locality. We also introduce a partial sum caching mechanism to reduce the blocking frequency of processing elements and an intra-node edges computation reordering algorithm to enhance data reuse. Experimental results demonstrate that our method outperforms the current state-of-the-art techniques. Future works will focus on optimizing load balance issues caused by coarse node allocation and enhancing architecture scalability through an efficient interconnect design. The steps in the compiler, such as allocating nodes to PEs and determining the dataflow, are not independent and should be considered holistically to achieve optimal performance. Introducing artificial intelligence may yield satisfactory results.

References

  • [1] T. A. Davis, Direct methods for sparse linear systems.   SIAM, 2006.
  • [2] T. Wang, W. Li, H. Pei, Y. Sun, Z. Jin, and W. Liu, “Accelerating sparse lu factorization with density-aware adaptive matrix multiplication for circuit simulation,” in 2023 60th ACM/IEEE Design Automation Conference (DAC).   IEEE, 2023, pp. 1–6.
  • [3] K. He, S. X.-D. Tan, H. Wang, and G. Shi, “GPU-accelerated parallel sparse LU factorization method for fast circuit analysis,” IEEE Transactions on Very Large Scale Integration (VLSI) Systems, vol. 24, no. 3, pp. 1140–1150, 2015.
  • [4] W.-K. Lee, R. Achar, and M. S. Nakhla, “Dynamic GPU parallel sparse LU factorization for fast circuit simulation,” IEEE Transactions on Very Large Scale Integration (VLSI) Systems, vol. 26, no. 11, pp. 2518–2529, 2018.
  • [5] Y. Saad, Iterative methods for sparse linear systems.   SIAM, 2003.
  • [6] H. Anzt, E. Chow, and J. Dongarra, “Iterative sparse triangular solves for preconditioning,” in Euro-Par 2015: Parallel Processing: 21st International Conference on Parallel and Distributed Computing, Vienna, Austria, August 24-28, 2015, Proceedings 21.   Springer, 2015, pp. 650–661.
  • [7] N. Jao, A. K. Ramanathan, J. Sampson, and V. Narayanan, “Sparse Vector-Matrix Multiplication Acceleration in Diode-Selected Crossbars,” IEEE Transactions on Very Large Scale Integration (VLSI) Systems, vol. 29, no. 12, pp. 2186–2196, 2021.
  • [8] E. B. Tavakoli, M. Riera, M. H. Quraishi, and F. Ren, “FSpGEMM: A Framework for Accelerating Sparse General Matrix–Matrix Multiplication Using Gustavson’s Algorithm on FPGAs,” IEEE Transactions on Very Large Scale Integration (VLSI) Systems, 2024.
  • [9] T. Yu and M. D. Wong, “PGT_SOLVER: An efficient solver for power grid transient analysis,” in 2012 IEEE/ACM International Conference on Computer-Aided Design (ICCAD).   IEEE, 2012, pp. 647–652.
  • [10] J. Yang, Z. Li, Y. Cai, and Q. Zhou, “PowerRush: An efficient simulator for static power grid analysis,” IEEE Transactions on Very Large Scale Integration (VLSI) Systems, vol. 22, no. 10, pp. 2103–2116, 2013.
  • [11] R. Achar, M. S. Nakhla, H. S. Dhindsa, A. R. Sridhar, D. Paul, and N. M. Nakhla, “Parallel and scalable transient simulator for power grids via waveform relaxation (PTS-PWR),” IEEE Transactions on Very Large Scale Integration (VLSI) Systems, vol. 19, no. 2, pp. 319–332, 2009.
  • [12] M. Freire, J. Ferrand, F. Seveso, E. Dufrechou, and P. Ezzatti, “A GPU method for the analysis stage of the SPTRSV kernel,” The Journal of Supercomputing, vol. 79, no. 13, pp. 15 051–15 078, 2023.
  • [13] E. Anderson and Y. Saad, “Solving sparse triangular linear systems on parallel computers,” International Journal of High Speed Computing, vol. 1, no. 01, pp. 73–95, 1989.
  • [14] J. H. Saltz, “Aggregation methods for solving sparse triangular systems on multiprocessors,” SIAM journal on scientific and statistical computing, vol. 11, no. 1, pp. 123–144, 1990.
  • [15] W. Liu, A. Li, J. Hogg, I. S. Duff, and B. Vinter, “A synchronization-free algorithm for parallel sparse triangular solves,” in Euro-Par 2016: Parallel Processing: 22nd International Conference on Parallel and Distributed Computing, Grenoble, France, August 24-26, 2016, Proceedings 22.   Springer, 2016, pp. 617–630.
  • [16] W. Liu, A. Li, J. D. Hogg, I. S. Duff, and B. Vinter, “Fast synchronization-free algorithms for parallel sparse triangular solves with multiple right-hand sides,” Concurrency and Computation: Practice and Experience, vol. 29, no. 21, p. e4244, 2017.
  • [17] N. Ahmad, B. Yilmaz, and D. Unat, “A split execution model for sptrsv,” IEEE Transactions on Parallel and Distributed Systems, vol. 32, no. 11, pp. 2809–2822, 2021.
  • [18] Z. Lu, Y. Niu, and W. Liu, “Efficient block algorithms for parallel sparse triangular solve,” in Proceedings of the 49th International Conference on Parallel Processing, 2020, pp. 1–11.
  • [19] Z. Lu and W. Liu, “TileSpTRSV: a tiled algorithm for parallel sparse triangular solve on GPUs,” CCF Transactions on High Performance Computing, vol. 5, no. 2, pp. 129–143, 2023.
  • [20] B. Yılmaz and A. F. Yıldız, “A Graph Transformation Strategy for Optimizing SpTRSV,” arXiv preprint arXiv:2206.05843, 2022.
  • [21] B. Yılmaz, “A novel graph transformation strategy for optimizing SpTRSV on CPUs,” Concurrency and Computation: Practice and Experience, vol. 35, no. 24, p. e7761, 2023.
  • [22] N. Shah, W. Meert, and M. Verhelst, “DPU-v2: Energy-efficient execution of irregular directed acyclic graphs,” in 2022 55th IEEE/ACM International Symposium on Microarchitecture (MICRO).   IEEE, 2022, pp. 1288–1307.
  • [23] N. Shah, L. I. G. Olascoaga, S. Zhao, W. Meert, and M. Verhelst, “DPU: DAG processing unit for irregular graphs with precision-scalable posit arithmetic in 28 nm,” IEEE Journal of Solid-State Circuits, vol. 57, no. 8, pp. 2586–2596, 2021.
  • [24] N. Shah, W. Meert, and M. Verhelst, Efficient Execution of Irregular Dataflow Graphs: Hardware/Software Co-optimization for Probabilistic AI and Sparse Linear Algebra.   Springer Nature, 2023.
  • [25] D. Langr and P. Tvrdik, “Evaluation criteria for sparse matrix storage formats,” IEEE Transactions on parallel and distributed systems, vol. 27, no. 2, pp. 428–440, 2015.
  • [26] F. Favaro, E. Dufrechou, P. Ezzatti, and J. P. Oliver, “Exploring FPGA optimizations to compute sparse numerical linear algebra kernels,” in Applied Reconfigurable Computing. Architectures, Tools, and Applications: 16th International Symposium, ARC 2020, Toledo, Spain, April 1–3, 2020, Proceedings 16.   Springer, 2020, pp. 258–268.
  • [27] Z. He, L. Song, R. F. Lucas, and J. Cong, “LevelST: Stream-based Accelerator for Sparse Triangular Solver,” in Proceedings of the 2024 ACM/SIGDA International Symposium on Field Programmable Gate Arrays, 2024, pp. 67–77.
  • [28] K. Lu, Z. Li, L. Liu, J. Wang, S. Yin, and S. Wei, “Redesk: A reconfigurable dataflow engine for sparse kernels on heterogeneous platforms,” in 2019 IEEE/ACM International Conference on Computer-Aided Design (ICCAD).   IEEE, 2019, pp. 1–8.
  • [29] S. Li, D. Liu, and W. Liu, “Optimized data reuse via reordering for sparse matrix-vector multiplication on fpgas,” in 2021 IEEE/ACM International Conference On Computer Aided Design (ICCAD).   IEEE, 2021, pp. 1–9.
  • [30] B. Liu and D. Liu, “Towards high-bandwidth-Utilization SpMV on FPGAs via partial vector duplication,” in Proceedings of the 28th Asia and South Pacific Design Automation Conference, 2023, pp. 33–38.
  • [31] E. Yi, Y. Duan, Y. Bai, K. Zhao, Z. Jin, and W. Liu, “Cuper: Customized Dataflow and Perceptual Decoding for Sparse Matrix-Vector Multiplication on HBM-Equipped FPGAs,” in 2024 Design, Automation & Test in Europe Conference & Exhibition (DATE).   IEEE, 2024, pp. 1–6.
  • [32] A. Capitanio, N. Dutt, and A. Nicolau, “Partitioned register files for VLIWs: A preliminary analysis of tradeoffs,” ACM SIGMICRO Newsletter, vol. 23, no. 1-2, pp. 292–300, 1992.
  • [33] J. A. Fisher, P. Faraboschi, and C. Young, Embedded computing: a VLIW approach to architecture, compilers and tools.   Elsevier, 2005.
  • [34] A. K. Jones, R. Hoare, D. Kusic, J. Fazekas, and J. Foster, “An FPGA-based VLIW processor with custom hardware execution,” in Proceedings of the 2005 ACM/SIGDA 13th international symposium on Field-programmable gate arrays, 2005, pp. 107–117.
  • [35] M. Lam, “Software pipelining: An effective scheduling technique for VLIW machines,” in Proceedings of the ACM SIGPLAN 1988 conference on Programming Language design and Implementation, 1988, pp. 318–328.
  • [36] D. Sabena, M. S. Reorda, and L. Sterpone, “On the automatic generation of optimized software-based self-test programs for VLIW processors,” IEEE Transactions on Very Large Scale Integration (VLSI) Systems, vol. 22, no. 4, pp. 813–823, 2013.
  • [37] H. Poon and P. Domingos, “Sum-product networks: A new deep architecture,” in 2011 IEEE International Conference on Computer Vision Workshops (ICCV Workshops).   IEEE, 2011, pp. 689–690.
  • [38] Y. Choi, A. Vergari, and G. Van den Broeck, “Probabilistic circuits: A unifying framework for tractable probabilistic models,” UCLA. URL: http://starai.cs.ucla.edu/papers/ProbCirc20.pdf, p. 6, 2020.
  • [39] Intel, “Intel Math Kernel Library,” https://www.intel.com/content/www/us/en/developer/tools/oneapi/onemkl.html, 2018.
  • [40] Nvidia, “cuSPARSE Library,” https://docs.nvidia.com/cuda/archive/12.2.1/cuda-toolkit-release-notes/index.html, 2022.
  • [41] T. A. Davis and Y. Hu, “The University of Florida sparse matrix collection,” ACM Transactions on Mathematical Software (TOMS), vol. 38, no. 1, pp. 1–25, 2011.