Deep learning has become a highly popular research field, and previously deep learning algorithms ran primarily on CPUs and GPUs. However, with the rapid development of deep learning, it was discovered that existing processors could not meet the specific large-scale computing requirements of deep learning, and custom deep learning accelerators have become popular. The majority of the primary workloads in deep learning are general matrix-matrix multiplications (GEMMs), and emerging GEMMs are highly sparse and irregular. The TPU and SIGMA are typical GEMM accelerators in recent years, but the TPU does not support sparsity, and both the TPU and SIGMA have insufficient utilization rates of the Processing Element (PE). We design and implement SparGD, a sparse GEMM accelerator with dynamic dataflow. SparGD has specific PE structures, flexible distribution networks and reduction networks, and a simple dataflow switching module. When running sparse and irregular GEMMs, SparGD can maintain high PE utilization while utilizing sparsity, and can switch to the optimal dataflow according to the computing environment. For sparse, irregular GEMMs, our experimental results show that SparGD outperforms systolic arrays by 30 times and SIGMA by 3.6 times.

1 Introduction

In recent years, deep learning has become very popular. Different deep learning models are widely used in several important fields, including data mining [23], machine translation [37], recommendation [1], natural language processing [25], and search technology. Before the appearance of dedicated neural network processors, the running of deep learning algorithms mainly used the CPU and GPU. The explosion of big data applications has propelled the development of deep learning, but it also poses serious challenges to the data processing speed and scalability of traditional computer systems [24]. Traditional Von Neumann computer architectures are relatively inflexible, with separate processing and data storage components. Frequent data movement between traditional architecture processors and off-chip memory limits the system performance and energy efficiency. Hardware accelerators are customized with flexible architectures to run tensor computations efficiently for machine learning models [34]. Deep learning accelerators are usually composed of a large number of highly parallel computing and storage units, which can accelerate the computing tasks in deep learning. Therefore, the design of dedicated chips of deep learning has gradually begun to rise.

The core computational task of most deep learning models in training and inference is the General Matrix-matrix Multiplication (GEMM). Accelerating the GEMM has become a major goal of hardware accelerator design. State-of-the-art GEMM accelerators include Google’s TPU and the SIGMA [17, 27]. The TPU uses a systolic array (SA) as its hardware structure. The SA is a two-dimensional array composed of the Processing Element (PE), and the data flows only between PEs. The SA can reduce the exchange of data with the global cache and can reduce the data loading time, so it can reduce energy consumption and speed up the GEMM [35, 36]. SAs can efficiently compute the dense GEMM but cannot take advantage of sparsity. Compared to accelerators that support sparsity, the TPU introduces additional computing time and energy consumption when running sparse GEMMs. SIGMA is the latest sparse GEMM accelerator, which uses the Bitmap format for sparse data encoding and uses the Benes network for data routing. It proposes a FAN tree for accumulating irregular data, whose accumulation time can reach O( \(log_2N\) ). The PE array utilization of the stationary matrix of SIGMA is high. However, in SIGMA, the sparsity of the streaming matrix is not fully exploited, so its PE array utilization of the streaming matrix is insufficient.

The SA supports the weight stationary (WS) dataflow, input stationary (IS) dataflow, and output stationary (OS) dataflow [28]. We find that different dataflows of the SA have different performance in different computing environments. When the weight size is small, the performance of the WS is the highest, and when the input matrix size is small, the performance of the IS is the highest. SIGMA has three data streams: “N-sta, M-str,” “N-str, M-sta,” and “N-str, M-str” [27]. These dataflows also have different performance in different computing environments. In any computing environment, one of these dataflows cannot always achieve the highest performance. Choosing the corresponding dataflow according to the computing environment can improve the performance of the processor. However, the TPU and SIGMA cannot switch the dataflow according to the computing environment.

We propose SparGD, which supports sparse GEMM computation and dynamic dataflow. SparGD has a simpler and more efficient architecture than SIGMA. It can maintain high PE utilization while taking advantage of the sparsity. SparGD uses the Pipelined Adder Tree (PAT), which can achieve O(1) accumulation time for irregular data in continuous computation. SparGD can switch different dataflows in different computing environments, and its performance is always higher than that of a single dataflow processor in various computing environments.

In summary, the contributions of this article are as follows:

—

We design the architecture of SparGD to support sparse and irregular GEMM computations, and the PE utilization of both the streaming matrix and the stationary matrix can achieve close to 100%.

—

We propose a flexible distribution network, the PE Bus, and a pipeline reduction network, the PAT. The time of the PE Bus for data loading is O(1), and the time of the PAT for accumulating irregular data is O(1).

—

We analyze the performance of different dataflows in the SA, SIGMA, and SparGD. We find that their performance is different in different computing environments.

—

We design SparGD to support dynamic dataflow and achieve the highest performance in any computing environment by switching the dataflow during computation.

—

Experimental results show that the performance of SparGD is more than 30 times higher than the TPU and 3.6 times higher than SIGMA. Considering the improvement of the power efficiency and area efficiency, the hardware overheads of SparGD are acceptable.

2 Background and Motivation

2.1 Deep Learning Workload Characteristics

2.1.1 GEMM.

The Convolutional Neural Network (CNN) is one of the most successful algorithms of deep learning [11]. The CNN usually includes the convolutional layer, the ReLU layer, the pooling layer, and the fully connected layer. The main workload of the CNN is the convolution operation of the convolution layer and the matrix multiplication operation of the fully connected layer. At present, the neural network processor usually converts the convolution operation into the matrix multiplication with a method such as the im2col algorithm. Therefore, the GEMM operation is the core computing operation of deep learning training and inference. In particular, the GEMM operation can account for more than 70% of the computing cycle [27]. Thus, accelerating the GEMM is the main goal of hardware acceleration.

2.1.2 Sparsity in Deep Learning Workloads.

Tensors in deep learning are always sparse. Multiple factors induce the sparsity to the tensors in deep learning models.

The CNN uses the ReLU activation function to turn negative values into zeros [18]. The sparsity of input activations can reach 40% in the CNN [20]. The Max pooling also amplifies the sparsity [3]. Neural networks use dropout layers to avoid overfitting. With dropout, only partial activations are kept, which leads to the sparsity as well [30].

The weight pruning technique removes unimportant weights. The widely used pruning algorithms introduce significant sparsity. For example, more than 60% of weights in the convolutional layer and 90% of weights in the fully connected layer can be removed [14].

Pruning of input activations also leads to the sparsity [2]. The MASR reconstructs batch normalization [12], achieving about 60% input activation sparsity of the RNN. For attention-based NLP models, SpAtten prunes unimportant tokens and heads [32]. It reports that the computation and DRAM accesses can be reduced by up to 3.8 times and 1.1 times, respectively, while maintaining the model accuracy.

GANs use transposed convolutions in degenerate networks, where the input data is first amplified by inserting zeros between values. For the transposed convolution layers in GANs, there is 60% sparsity on average [38].

The design of deep learning accelerators needs to take into account the sparsity of tensors. Accelerators taking advantage of sparsity can eliminate inefficient computations and improve performance. By processing only operations involving non-zero values, the execution time and energy consumption of computations can be reduced. Meanwhile, by storing only non-zero values, the memory requirements can be reduced, reducing both on-chip and off-chip memory access counts [16].

2.2 Typical GEMM Accelerators

2.2.1 TPU.

Google’s TPU is designed for data center applications [17]. The main computing component of the TPU is the Matrix Multiply Unit, which is a systolic array composed of 256 \(\times\) 256 computing units. The TPU passes the data between the computing units so that each data is processed multiple times. Thus, the TPU can significantly reduce the I/O operations. The TPU applies the weight stationary dataflow. The weight matrices are fixed in the systolic array, and the input matrices are transmitted and processed between the computing units in a certain order. Not only the input matrix but also the partial sum is passed between the systolic array computing units. The calculated results are streamed out of the systolic array and stored in the result accumulator.

The systolic array is efficient for dense GEMMs but not for sparse GEMMs. When computing a sparse matrix, the systolic array sends zeros into the multiplier for multiplication, resulting in additional computing time and energy consumption.

2.2.2 SIGMA.

SIGMA is a GEMM accelerator that supports sparsity [27]. The basic building block in the SIGMA architecture is a processor called the Flexible Dot Product Engine (Flex-DPE). Several Flex-DPEs are connected via a simple NoC to create the full SIGMA compute fabric. Each GEMM operation uses a contiguous set of Flex-DPEs. The SIGMA uses the Benes network to support flexibility in data loading. The Benes is an N-input N-output multi-level non-blocking network. The Benes has 2log(N)+1 stages, and each stage has N tiny 2*2 switches. The Benes allows communication between any source and any destination without any contention [4]. The data communication time of the Benes network is O(1). SIGMA uses the Bitmap scheme to encode sparse data and supports the calculation of sparse matrices without decompression. For sparse irregular workloads, the performance of SIGMA is 5.7x higher than the TPU. Although SIGMA supports sparsity well, it has the problem of insufficient PE utilization for the streaming matrix, which will be discussed in detail later in Section 2.3.

2.3 Inefficiency of the TPU and SIGMA

In this section, we map a sparse irregular GEMM to the systolic array, SIGMA, and our proposed SparGD, respectively. Figure 1(a) shows a sparse irregular GEMM. The MK matrix is streaming and the KN matrix is stationary.

Fig. 1.

Figure 1(b) shows the mapping of the sparse and irregular GEMM in the systolic array. The systolic array has 16 PEs. Due to the rigid structure of the systolic array, only half of the PEs can be used, and zeros need to be filled into the systolic array as well. Only half of the KN matrix can be calculated each time. Once the calculation of the streaming matrix is complete, the second half of the KN matrix must be loaded and calculated again, which results in poor PE utilization and performance.

Similar to the discussion in the SIGMA paper [27], for SIGMA and SparGD, we focus on two different types of PE utilization: namely, StrUtil, which represents the PE array utilization of the streaming matrix, and StaUtil, which represents the PE array utilization of the stationary matrix. Equation (1) and Equation (2) define StaUtil and StrUtil in the SA. Equation (3) and Equation (4) define StaUtil and StrUtil in SIGMA. In these equations, the ”M,” ”N,” and ”K” are the size of the matrix in Figure 1(a). \(N_{nz}\) is the number of non-zero values. \(N_{pe}\) is the number of PEs. Scale is the scale of the PE array, and it is 4 in Figure 1. Equation (5) and Equation (6) define StaUtil and StrUtil in SparGD. In Equation (6), L is the length of the streaming matrix after element shifting.

\begin{equation} SA\_StaUtil=\frac{N_{nz} }{\frac{N}{Scale}\times N_{pe} } \end{equation}

(1)

\begin{equation} SA\_StrUtil=\frac{N_{nz} }{ N_{pe} } \end{equation}

(2)

\begin{equation} SIGMA\_StaUtil=\frac{N_{nz} }{\lceil \frac{N_{nz}}{N_{pe} } \rceil \times N_{pe}} \end{equation}

(3)

\begin{equation} SIGMA\_StrUtil=\frac{N_{nz} }{M\times K} \end{equation}

(4)

\begin{equation} SparGD\_StaUtil=\frac{N_{nz} }{\lceil \frac{N_{nz}}{N_{pe} } \rceil \times N_{pe}} \end{equation}

(5)

\begin{equation} SparGD\_StrUtil=\frac{N_{nz} }{L\times K} \end{equation}

(6)

Figure 1(c) shows the mapping of the sparse irregular GEMM in SIGMA, which includes 16 PEs. Due to the flexible distribution and reduction network, SIGMA only maps non-zero elements of the stationary matrix. StaUtil of SIGMA can reach 100%. However, when the streaming matrix enters the array, insufficient StrUtil occurs. In Figure 1(c), elements “c” and “0” are in the same column. When “c” enters the array, “0” does not enter the array, so element “c” enters the array alone. Since “c” only uses part of the PEs, other PEs will be idle. In Figure 1(c) (Cycle 2), the PE containing the values (A, B, C, D, E, F, G) are used by “c,” so they are useful, while the PEs containing the values (H, I, J, K, L, M, N, O, P) are not used by any element, so they are idle. Similarly, when “d” enters the array, the PEs containing the values (A, B, C, D, E, F, G) are useful, while the PEs containing the values (H, I, J, K, L, M, N, O, P) are idle. Therefore, StrUtil of SIGMA is insufficient (only 56% in this example).

Figure 1(d) shows the mapping of sparse irregular GEMM in SparGD, which includes 16 PEs. Similar to SIGMA, SparGD only maps non-zero elements of the stationary matrix to the PEs. StaUtil of SparGD can reach 100%. Unlike SIGMA, the streaming matrix in SparGD is shifted. When the streaming matrix flows in, the zero elements are skipped, and the nearest non-zero element flows into the PE. In Figure 1(d), when element “c” enters the array, element “e” enters the array simultaneously. Similarly, when element “d” enters the array, element “f” enters the array simultaneously. In this way, all PEs are effectively utilized. Therefore, StrUtil of SparGD is 90% in this example, which is much higher than that of SIGMA.

3 Analysis of Dataflows in Popular Acceleration Architecture

3.1 Introduction of Dataflows

The SA typically uses three dataflows: the WS dataflow, IS dataflow, and OS dataflow [28]. The difference between the three dataflows is that they fix different elements to the PEs of the systolic array. The direction of element transmission is also different among different dataflows. Figure 2 shows the three dataflows. We refer to the OS in the systolic array as “SA-OS,” the WS as “SA-WS,” and the IS as “SA-IS.” In this article, in a GEMM with the size “M-K-N,” the matrix with dimensions “M-K” is the input matrix, and the matrix with dimensions “K-N” is the weight matrix.

Fig. 2.

SIGMA supports three types of dataflows: “N-sta, M-str,” “N-str, M-sta,” and “N-str, M-str” [27]. “N-sta, M-str” is similar to the WS dataflow in the systolic array, while “N-str, M-sta” is similar to the IS in the systolic array. In the following text, we refer to “N-sta, M-str” as “SIGMA-WS” and “N-str, M-sta” as “SIGMA-IS.” As the third type of dataflow, “N-str, M-str,” has been proven to always perform worse than the other two types in [27], we don’t discuss it in this section.

3.1.1 SA-OS.

In the SA-OS, the partial sum is fixed in the PEs of the SA [28]. As shown in Figure 2(b), before the calculation, the PE array is empty. During the calculation, the elements of the weight matrix are streamed into the top side of the PE array in order. The elements of the input matrix are streamed into the left side of the PE array in order. After being processed by PEs, the elements of the weight matrix and input matrix are passed down or right, respectively. Partial sums generated during the calculation are retained in the PEs and not passed between PEs. When all matrix elements have been processed, the matrix multiplication is completed.

3.1.2 SA-IS.

The SA-IS fixes the input matrix in the PEs of the SA [28]. Figure 2(c) shows the SA-IS. Before the computation, the elements of the input matrix are preloaded into the PEs of the SA, and these elements do not change during the computation. During the computation, the elements of the weight matrix are streamed from the top side into the PE array. After being processed by the PEs, the partial sums are generated. Then the elements of the weight matrix are passed down, while the partial sums are passed right. The final result is streamed out of the PE array.

3.1.3 SA-WS.

The SA-WS fixes the weight matrix in the PEs of the SA [28]. Figure 2(d) shows the SA-WS. Before the computation, the elements of the weight matrix are preloaded into the PEs. During the computation, the elements of the input matrix are streamed into the PE array from the left side. The partial sums are generated after the PEs process the input matrix elements, which are passed down to the lower PEs. After being processed, the input matrix elements are passed right again. The final result is streamed out of the PE array.

3.1.4 SIGMA-WS.

Similar to the SA-WS, the weight matrix in the SIGMA-WS is fixed in the PEs of SIGMA, and the input matrix is streamed into the PEs. As shown in Figure 3(b), the weight matrix is preloaded into the PEs. The fixed weight data are non-zero values. However, not all input data flowing into the PE array are non-zeros. The input data is broadcast to the PEs via a distribution network rather than being passed between PEs. After multiplying the input data with the weight data in the PEs, the multiplication results are accumulated through a reduction network. The final calculation result is generated and sent out from the PE array through the reduction network.

Fig. 3.

3.1.5 SIGMA-IS.

Contrary to the SIGMA-WS, SIGMA-IS fixes the input matrix in the PEs of SIGMA. As shown in Figure 3(c), non-zero values in the input matrix are preloaded to the PEs, and uncompressed weight matrices flow into the PE array. The weight data is broadcasted to the PEs through the distribution network. After processing the weight data with the input data in the PEs, the multiplication results are accumulated, and the final calculation result is generated in the reduction network and sent out.

3.2 Performance Analysis of Dataflows

3.2.1 Dataflow Analysis in the Systolic Array.

When performing large-size matrix calculations on a small-size systolic array, the large-size matrix is partitioned into small-size matrices for computation. Based on the computation principle of partitioned matrices, the reuse times of the input or weight matrix are different when the size of the input matrix and the weight matrix are different [31]. For example, in an 8 \(\times\) 8 SA, when computing the multiplication of a 16 \(\times\) 8 input matrix and an 8 \(\times\) 8 weight matrix, the input matrix will be divided into two 8 \(\times\) 8 blocks and each block will be used once, while the weight matrix has only one 8 \(\times\) 8 block that will be used twice. If the SA-WS dataflow is used in this example, the SA will reuse the weight matrix block twice. If the SA-IS dataflow is used, the SA will not reuse any matrix blocks. Different dataflows for different matrix sizes result in different data reuse times, which also affects the performance of the SA. We analyze the performance of different dataflows by running matrix multiplication on three SA processors with the size as 8 \(\times\) 8.

When the SA-WS and SA-IS preload the “fixed” data, additional time overhead is required. There is no additional preload overhead for the SA-OS, but there is a small amount of time overhead for outputting results in the SA-OS.

Figure 4 shows the performance of the dataflows in the SA. When the input matrix size is larger than the weight matrix size, the SA reuses input matrix blocks less and the weight matrix blocks more, resulting in the smallest computation time for the SA-WS dataflow. Replacing a “fixed” matrix requires additional time. In the SA-WS, if the SA reuses weight matrix blocks more, it needs less time to replace the “fixed” matrix block in the SA. Therefore, there is low additional computation time in the SA-WS in this case. This time is less than the additional time for the SA-OS to output results. The calculation methods for sparse and dense matrices are the same in the SA, so the sparsity does not affect this trend.

Fig. 4.

When the input matrix size is smaller than the weight matrix size, the SA reuses the input matrix blocks more, and the weight matrix blocks less, resulting in the smallest computation time for the SA-IS. In this case, the time for the SA-IS to replace the ‘fixed’ matrix block is less than the additional time for the SA-OS to output results. In the SA-IS, if the SA reuses input matrix blocks more, a lower additional computation time is required. Therefore, the SA-IS performs the best in this case. In our experiment, the performance of the SA-OS lies between the SA-WS and SA-IS. Therefore, it cannot perform better than the SA-WS and SA-IS simultaneously. The sparsity of the data does not affect this trend.

In summary, the SA-IS dataflow is more suitable for computing environments with smaller input matrix sizes, while the SA-WS dataflow is more suitable for computing environments with smaller weight matrix sizes.

3.2.2 Dataflow Analysis in SIGMA.

Unlike the SA, SIGMA supports sparsity. The sparsity and matrix size can both affect the performance of dataflows. In this section, matrix multiplication of different sizes and sparsities are run on an 8x8 SIGMA to analyze the performance of different dataflows.

Dense data. When processing dense data in SIGMA, neither the weight matrix nor the input matrix needs to be compressed, and all the data sent to the PE array are non-zeros. Therefore, only the factor of data reuse needs to be considered. Figure 5 shows the performance of different dataflows in SIGMA for computing dense data. When computing the dense data, the performance trend in SIGMA is the same as that of the SA. When the input matrix size is larger than the weight matrix size, SIGMA-IS reuses the input matrix block less, while SIGMA-WS reuses the weight matrix block more. In this case, the performance of SIGMA-WS is higher. When the input matrix size is smaller than the weight matrix size, the performance of the SIGMA-IS dataflow is higher.

Fig. 5.

Low-sparsity data. When processing sparse data in SIGMA, the matrix data fixed in the PE array is compressed, while the matrix data flowing into the PE array is not compressed. The sparsity may affect the performance of different dataflows in SIGMA. Figure 6 shows the performance of different dataflows in SIGMA under 20% sparsity.

Fig. 6.

When the input matrix size is larger than the weight matrix size, using SIGMA-IS results in lower performance due to less reuse of input matrix blocks. Using SIGMA-WS leads to higher performance because of more reuse of weight matrix blocks. Low sparsity makes the actual size of the fixed matrix smaller, but it does not significantly affect the trend in SIGMA-WS and SIGMA-IS. Therefore, in the case of dense data and low sparse data, the performance trend of different dataflows is the same.

High-sparsity data. Figure 7 shows the performance of different dataflows in SIGMA under high sparsity (>80%). In high-sparsity situations, the performance of different dataflows in SIGMA is irregular. The stationary matrix in SIGMA is heavily compressed, while the streaming matrix is not compressed. When the size of the weight matrix is small and the size of the input matrix is large, using SIGMA-WS may compress a small amount of weight data while leaving a large amount of input data uncompressed. Although the PE array can reuse the weight matrix more times, the overall amount of compressed data may be few, which may result in poor performance for SIGMA-WS. On the other hand, using the SIGMA-IS dataflow can compress a large amount of input matrix data while leaving a small amount of weight data uncompressed. When the input matrix size is large, the PE array reuses the input matrix fewer times. But more data is compressed overall in the SIGMA-IS, which may result in better performance.

Fig. 7.

When the weight matrix is large and the input matrix is small, using SIGMA-WS compresses a large number of weights, with few input data uncompressed. Although the reuse times for weights are fewer, more data are compressed overall. Therefore, SIGMA-WS may have higher performance.

In summary, when processing low-sparsity and dense data, the performance trend of dataflows in SIGMA is similar to that of the SA. Due to data reuse, SIGMA-IS is more suitable for computing environments with smaller input matrix sizes, while SIGMA-WS is more suitable for computing environments with smaller weight matrix sizes. However, under high sparsity, the performance trend of different dataflows may be the opposite.

After analyzing the dataflow performance in the SA and SIGMA, we found that each type of dataflow is suitable for different computing environments. The processors that support only one type of dataflow cannot always achieve the highest performance in any computing environment. Dynamic dataflow can switch between dataflows during computation to achieve optimal performance, so we need to design processors with dynamic dataflow.

4 The Architecture of SparGD

4.1 Microarchitecture

In this section, we propose the architecture of SparGD. As shown in Figure 8(a), SparGD contains the Global Buffer, PE Array, Dataflow Switching Module, Accumulator, and Controller. The Global Buffer is used to store the block matrix in a Bitmap format. The PE Array is used to calculate the blocked GEMM. The Dataflow Switching Module controls the dynamic configuration of dataflows. The Accumulator is used to accumulate the block matrix, and the Controller controls the progress of SparGD.

Fig. 8.

4.1.1 PE and PE Groups.

As shown in Figure 8(a), SparGD consists of several PE groups. Each PE group consists of several PEs. Each PE consists of some data registers, routing registers, and multiplexers. The data registers in a PE can cache data for data reuse. For example, in the weight stationary dataflow, one data register in each PE can store an element of the weight matrix for reuse. Other registers and multiplexers in the PE are used for data routing.

Specifically, as shown in Figure 8(c), the PE contains five registers, which are VIDSecReg, VIDReg, StrSecReg, StrReg, and StaReg. The value ID (VID) indicates an accumulation group of the multiplication result of streaming data and stationary data. Multiplication results of the same VID need to be added together. The value in VIDSecReg is used to select the VID to VIDReg. VIDReg is used to store the VID. The value in StrSecReg is used to select data on the PE Bus as the streaming data. StrReg is used to store the streaming data. StaReg is used to store the stationary data. The values in VIDSecReg, StrSecReg, and StaReg are filled when loading the stationary data and do not change when loading the streaming data. StrReg and VIDReg are filled when loading the streaming data. The PE contains several multiplexers. When loading the streaming data, according to the signal on the PE Bus and the value in VIDSecReg, VIDMux_0 to VIDMux_4 select a VID and send it to VIDReg (more details are shown in Figure 16 (Step vii-b)). According to the value in StrSecReg, StrMux selects the data of the PE Bus and sends it to StrReg (more details are shown in Figure 16 (Step vii-a)). There is a multiplier in PE, which is used to calculate the multiplication of stationary data and streaming data. The multiplication result and the VID are output to the reduction network simultaneously.

Fig. 9.

Fig. 10.

Fig. 11.

Fig. 12.

Fig. 13.

Fig. 14.

Fig. 15.

Fig. 16.

If a PE group contains K PEs, the large-size N \(\times\) N matrix is divided into small matrices of the size of K \(\times\) N or N \(\times\) K to perform block matrix operations. Several PE groups are used to compute a block matrix multiplication. As shown in Figure 16 (Step iv), the number of PE groups required is determined when loading the stationary data.

4.1.2 Distribution Network.

The distribution network is used to load the stationary data and stream the other data. In a systolic array, the distribution network consists of horizontal and vertical forwarding links between PEs. The data loading time of the systolic array is O(K) for a K \(\times\) K systolic array. SIGMA uses the Benes network as the distribution network, and its data loading time can be reduced to O(1). However, the Benes network requires additional logic to generate the routing information.

SparGD uses PE Bus as the distribution network. Each PE group is connected to a single PE Bus, and the data loading time can reach O(1). Compared with the Benes network, the bus structure is simpler, and the wiring cost is less.

The stationary data are unicast to each PE in the PE group. All the stationary data enters each PE in order. Before unicasting the stationary data, SparGD calculates the row number corresponding to each stationary data. When the stationary data enters StaReg (shown in Figure 8(c)), this row number enters StrSecReg.

When loading the streaming data, each PE in the PE group needs to select the correct streaming data, and routing is required at this time. The row number in StrSecReg is used as the signal for the StrMux multiplexer in the PE. The StrMux multiplexer selects the streaming data on the PE bus according to the row number and stores it in the StrReg. Then the routing of streaming data is completed.

4.1.3 Reduction Networks.

The reduction network is used to accumulate the multiplication results from the PE array. The reduction network of the systolic array is rigid, and it can only accumulate the same number of elements each time. Unlike the systolic arrays, flexible reduction networks usually require accumulating different numbers of elements. As shown in Figure 9(b) and Figure 9(c), the VID of “a” has three elements that need to be added together, while the VID of “b” has four elements to be added together.

The ART is a reduction network used in MAERI [19]. The ART is an adder tree augmented with additional links. These additional links are used to forward the adder output to other nodes of the same level instead of the parent node. The ART is built with three input adders; two inputs are from the child nodes and one input is from the sibling node. This induces high hardware overhead.

The FAN is a reduction network used in SIGMA [27]. The FAN is based on a traditional binary adder tree. It places forwarding links between adders at different levels. The average accumulation time of the ART or the FAN is O( \(log_2N\) ). However, neither the ART nor the FAN can support pipelined accumulation, which significantly limits the performance.

In this section, we proposed the PAT as the reduction network, which is a linear adder tree with pipeline registers. Similar to the FAN, the PAT is based on the linear binary tree in Figure 9(a). By adding forwarding links on the linear binary tree, the simplified FAN can be obtained in Figure 9(b). Then adding pipeline registers between the stages of the simplified FAN can get the PAT in Figure 9(c). The value on the original forwarding link is temporarily stored in the pipeline register and passed backward by the pipeline register. For example, in Figure 9(b), “adder 6” and “adder 7” have a forwarding link. In Figure 9(c), after adding the pipeline register, “adder 6” is connected to the “ \(Reg_1[7]\) ” of the next stage and is passed down stage by stage, and finally reaches “adder 7.”

The PAT runs in a pipelined manner, in which the input of each stage is the output of the previous stage. Figure 9(d) presents the algorithm for the ith level of the PAT. The input to the algorithm is the value of the pipeline register of this stage, the \(VID,\) and the AdderID. The output is the value of the pipeline register of the next stage. Line 1 of the algorithm traverses the AdderID in \(Lev\_i\) . The \(Lev\_i\) in line 1 of the algorithm corresponds to the number of pipeline stages in Figure 9(c). For example, if i is “2,” the AdderID in \(Lev\_i\) contains “3” and “11.” Lines 2 to 4 of the algorithm assign initial values to the output registers. Line 5 judges whether an add operation is needed to be performed according to the VID value. If an add operation is performed, lines 7 to 8 clear the two registers to remove the two added elements. Then lines 9 to 26 determine which register to put the sum into. The condition of line 9 to hold is that if the adder with this AddID is on the left side of the parent node in the linear binary. If all AdderID in \(Lev\_i\) are traversed, the algorithm ends.

To show the advantage of the PAT, we implement the linear binary tree, ART, FAN, and PAT with the RTL Verilog HDL. We use the Xilinx Vivado Design Suite to evaluate their performance and hardware overhead. Each of these adder trees contains 31 adders. The values of different batches are used for accumulation. As shown in Figure 10, the PAT has the highest performance due to pipelined accumulation. The PAT is about 3x faster than the FAN or the ART. The hardware overhead of the PAT is the largest due to the additional pipeline registers. It is 2x more than the FAN or ART.

4.2 Dynamic Dataflow Design

4.2.1 Dataflows in SparGD.

SparGD supports two typical dataflows: the weight stationary dataflow and the input stationary dataflow. The weight stationary dataflow in SparGD is referred to as “SparGD-WS,” and the input stationary dataflow in SparGD is referred to as “SparGD-IS.”

SparGD-WS. In SparGD-WS, the weight matrix is fixed in the PE array of SparGD, and the fixed weight matrix only contains the non-zero data. The input data streaming into the PE array are also non-zero data. As shown in Figure 11(a), the non-zero weights are fixed in the PE array. During the calculation process, the weight data are reused, and the non-zero inputs flow into the PE array. The non-zero input data are distributed to the corresponding PEs through a distribution network. After multiplying the input data with the corresponding weight data, the multiplication results are accumulated through a reduction network, and the final calculation result is generated and sent out through the reduction network.

SparGD-IS. As shown in Figure 11(b), in SparGD-IS, the input matrix is fixed in the PE array, and both the input matrix and the weight data used in the PE array are non-zero data. The non-zero weights are distributed to the corresponding PEs. After multiplying the weight data with the input data, the multiplication results are accumulated in the reduction network and sent out of the PE array.

4.2.2 Dataflow Analysis of SparGD.

Same sparsity and different matrix size. In the case of computing dense data, SparGD has a similar performance trend as the SA and SIGMA. Therefore, we do not discuss the performance of dataflows in SparGD when computing dense data.

Figures 12 and 13 show the performance of different dataflows in SparGD under low- or high-sparsity conditions. In SparGD, both the data fixed in the PE array and the data streaming into the PE array are highly compressed, with only non-zero data participating in the operation. When the sparsity of the weight matrix and the input matrix is the same, only the size of the weight and input matrices needs to be considered. At this time, the performance trend of the dataflow in SparGD is the same as that in the SA. When the input matrix size is larger than the weight matrix size, the computational time of SparGD-WS is minimal, and its performance is high. On the contrary, when the input matrix size is smaller than the weight matrix size, SparGD-IS has higher performance.

Same matrix size and different sparsity. This section further discusses the performance when the sparsity of the weight matrix and the input matrix is different.

Figure 14 shows the performance of SparGD under different weight and input sparsity but with the same matrix size. When the sparsity of the weight matrix is high and the sparsity of the input matrix is low, there are fewer non-zero values in the weight matrix and more non-zero values in the input matrix. Therefore, the size of the compressed weight matrix is smaller, while the size of the compressed input matrix is larger. That is similar to processing dense data in the SA when the weight matrix size is small and the input matrix size is large. In this case, when using SparGD-WS, the PE array reuses the weight matrix block more times, resulting in the minimum computational time of SparGD-WS.

When the sparsity of the weight matrix is low and the sparsity of the input matrix is high, there are more non-zero values in the weight matrix and fewer non-zero values in the input matrix. Therefore, the size of the compressed weight matrix is larger and the size of the compressed input matrix is smaller. As shown in Figure 15, SparGD-IS has a high performance in this case.

4.2.3 Dataflow Switching Module.

By analyzing the dataflow performance in SparGD, we found that a single dataflow only has good performance in certain situations. To ensure optimal performance in all situations, we design a Dataflow Switch Module for SparGD, which allows SparGD to support both SparGD-WS and SparGD-IS simultaneously.

SparGD chooses the optimal dataflow based on the size and sparsity of the input matrix and weight matrix. In the SparGD processor, before computation, the Dataflow Switching Module generates control signals based on the size and sparsity of the input matrix and weight matrix. Then it controls the corresponding matrices to be fixed in the PE array, which completes the configuration of the dataflow. The matrices enter the SparGD in compressed Bitmap format, and the sizes of the weight matrix and input matrix are known. The conditions for generating dataflow control signals are given by Equation (7). The DCtrl is a control signal. The \(L_i\) is the length of the input value in the bitmap. It equals the number of the non-zero values in the input matrix. The \(L_w\) is the length of the weight value in the bitmap:

\begin{equation} DCtrl=\left\lbrace \begin{array}{lr} 1,&L_i\ge L_w \\ 0,&L_i\lt L_w \end{array}. \right. \end{equation}

(7)

The difference between SparGD-WS and SparGD-IS is which matrix is stationary and which matrix is streaming. These two dataflows in SparGD are symmetric, which simplifies the design of SparGD. The design of the Dataflow Switching Module does not require modification of the existing structure.

The implementation of the Dataflow Switching Module is shown in Figure 8(b). The Dataflow Switching Module includes a SizeReg that stores the values of \(L_w\) and \(L_i\) . A subtractor is used for subtracting \(L_w\) and \(L_i\) , which is used to compare \(L_w\) and \(L_i\) . When \(L_w\) is greater than \(L_i\) , the result of the subtraction is positive with a sign bit of 0, which is stored as DCtrl in DCtrlReg. If \(L_w\) is less than \(L_i\) , the result of the subtraction is negative with a sign bit of 1, and DCtrl is set to 1. DCtrl mainly changes the order of extracting data from the Global Buffer. The matrix extracted first is stationary, and the matrix multiplied with it later is streaming. If DCtrl = 1, then SparGD-WS is selected, and the weight matrix is first extracted from the Global Buffer and fixed in the PE array, Otherwise, SparGD-IS is selected, and the input matrix is fixed in the PE array.

4.3 Example

The following steps (corresponding to Figure 16) describe a walk-through example of SparGD. In this example, each PE group contains four PEs ( \(N_{PE}\) is 4).

Step i) Read two block matrices encoded in the Bitmap format.

Step ii) The Dataflow Switching Module generates a control signal based on Equation (7) to determine which dataflow to use. In this example, the processor selects the SparGD-IS dataflow, where the weight matrix is streaming and the input matrix is stationary.

Step iii) Perform the Row-wise OR operation on the streaming bitmap, and take the output as the valid bits of the column of the stationary bitmap. Then, the invalid elements are removed from the stationary bitmap.

Step iv) The number of ones in the stationary bitmap corresponds to the number of useful stationary values ( \(N_{sta}\) ). Since \(N_{sta}\) is 8 and \(N_{PE}\) is 4, two PE groups are required in this example.

Step v) Encode the stationary bitmap and the streaming bitmap. For the stationary bitmap, get the column number and row number of the value from the bitmap and put them into the \(C_{sta}\) and \(R_{sta}\) arrays. For the streaming bitmap, the column number is put into the \(C_{str}\) array. The values of the streaming matrix need to be shifted left by row as shown in Figure 16 (Step v).

Step vi) Unicast the stationary value to each PE in the PE group. The stationary data is put into the StaReg. The corresponding values in the \(R_{sta}\) and \(C_{sta}\) arrays are also allocated to the PE. The data in the \(R_{sta}\) array is put into VIDSecReg. The data in the \(C_{sta}\) array is put into StrSecReg.

Step vii) Broadcast the streaming value to each PE group by column. PEs need to select a correct streaming value, and routing is required at this step. In the previous step, there is already a value in StrSecReg. This value is used as the control signal for the multiplexer in Figure 16 (Step vii-a). The multiplexer selects the value on the PE Bus and stores it in StrReg. In Figure 16 (Step vii-a), if the value in StrSecReg is “00,” StrReg selects the first stream data “a.” If the value in StrSecReg is “01,” the StrReg selects the second stream data “e.”

When the streaming data enters StrReg, the corresponding \(C_{str}\) value is also put into the PE. The \(C_{str}\) value and the VIDSecReg value are taken as the control signal of the multiplexer in Figure 16 (Step vii-b). The multiplexer selects the corresponding VID and sends it to VIDReg. In Figure 16 (Step vii-b), the value of the VID is 0 to 15. If the value in VIDSecReg is “00,” VIDReg selects the leftmost multiplexer (including the VID that has the value “0,4,8,12”). Then VIDReg selects the VID in this multiplexer based on the \(C_{str}\) value. If the \(C_{str}\) is “000,” select the first VID “0,” and if the \(C_{str}\) value is 001, the VIDReg selects the second VID “4.”

StaReg and StrReg in the PE are connected to a multiplier. The multiplication result is generated after the stationary data and the streaming data are multiplied. The multiplication result and the value in VIDReg are sent to the reduction network simultaneously.

Step viii) Since the streaming matrix is out of order (after shifted), the multiplication results with the same VID may not be in adjacent PEs. Our reduction network PAT requires multiplication results with the same VID to be adjacent, so sorting of multiplication results according to VID is needed. We group multiplication results into an Addition Group according to the same VIDSecReg. The sorting process is only performed within each Addition Group, which reduces the sorting overhead. The sorted multiplication results are sent to the PAT for accumulation. When running a single task, the complexity of the PAT is O( \(log_2N\) ). The sorting process increases the complexity of the single task. However, the PAT is pipelined. When running multiple tasks continuously, the PAT generates accumulated results in each cycle. Therefore, the overall complexity of the accumulation process is O(1).

The above unicast, broadcast, multiply, and add operations are all performed in a pipelined manner. A GEMM operation is complete once all non-zero values of the streaming matrix have flowed in and the output has been generated.

5 Evaluation

5.1 Experimental Setup

5.1.1 Experimental Method.

In this section, we first evaluate different sparse compression methods. We compare the Element Shifting (ES) and Bitmap with other compression methods, including the Compressed Sparse Column (CSC), Compressed Sparse Row (CSR), Coordinate (COO), and Run-length Coding (RLC). Then we compare the performance and PE utilization of SparGD, the SA, and SIGMA when running different types of GEMM. Based on the previous analysis, the SA-OS dataflow cannot always perform better than the SA-IS or SA-WS dataflow in most cases, so we do not use the SA-OS dataflow in our experience. Additionally, to compare the SparGD with an accelerator different from the SA and SIGMA, we simulate and analyze the Extensor. Finally, we conduct both hardware resource analysis and scalability analysis.

5.1.2 Experimental Platform.

To ensure evaluating the performance and hardware overhead with the same experimental platform, we use RTL Verilog HDL to implement the SA, SIGMA, and SparGD with 64 PEs. In our experiments, the SA contains 8 \(\times\) 8 PEs; SIGMA contains 8 Flex-DPEs, and each Flex-DPE contains 8 PEs; and SparGD contains 8 PE groups, and each PE group contains 8 PEs. The data width is 8 bits. We implement two SA processors with the SA-WS and SA-IS dataflow, implement two SIGMA processors with the SIGMA-WS and SIGMA-IS dataflow, and finally implement SparGD. SparGD-WS in our experiment represents the SparGD that uses the SparGD-WS dataflow, and it cannot switch the dataflow. SparGD-IS represents the SparGD that only uses the SparGD-IS dataflow. Additionally, we implement the SA and SparGD with 16 \(\times\) 16 PEs, 64 \(\times\) 64 PEs, and 128 \(\times\) 128 PEs for scalability analysis. We use the Xilinx Vivado Design Suite for logic simulation to obtain performance data. To analyze the area and power, we synthesize them with Synopsys Design Compiler (DC) in 28nm technology. The clock cycle during simulation is set to 8ns, and the main frequency during synthesis is set to 125MHz.

Similar to [15], we use Python to simulate the ExTensor when evaluating it. We simulated the main data transmission and calculation process of the ExTensor, which has 64 PEs. The simulated ExTensor adopts an optimized strategy in coordinate lookup, which can perform lookup by stepping. We set the length of one search step for the ExTensor to 16.

5.1.3 Workloads.

To clearly display the differences in different dataflows when running different types of GEMMs, we use some synthetic workloads. In other experiments, we select the real-world tensors from the SuiteSparse matrix collection in Table 1 (Set 0 to Set 8). We evaluate the sparse GEMM by multiplying these matrices pairwise. We also use some common GEMM workloads in typical transformer models in Table 1 (Set 9 to Set 12). These datasets have different dimensions and sparsities.

Table 1.

	Workloads	M Size	K Size	N Size	Sparsity (M-K Sparsity, K-N Sparsity)
Set 0	ch4-4-b3 \(\times\) ch4-4-b2	24	96	72	96.8%, 96.8%
Set 1	ch5-5-b2 \(\times\) ch5-5-b1	600	200	25	98%, 92%
Set 2	klein-b2 \(\times\) klein-b1	20	30	10	90%, 80%
Set 3	n2c6-b2 \(\times\) n2c6-b1	455	105	15	97.1%, 86.6%
Set 4	n3c5-b2 \(\times\) n3c5-b1	120	45	10	93.3%, 98%
Set 5	n3c5-b5 \(\times\) n3c5-b4	210	252	210	97.6%, 97.6%
Set 6	n3c5-b7 \(\times\) n3c5-b6	30	120	210	97.6%, 97.6%
Set 7	n3c6-b2 \(\times\) n3c6-b1	455	105	105	97.1%, 98.1%
Set 8	n4c5-b11 \(\times\) n4c5-b10	10	120	630	90%, 99.1%
Set 9	Transformer	256	512	64	80%, 90%
Set 10	BERT	128	768	64	70%, 90%
Set 11	DeiT-B	256	768	64	90%, 80%
Set 12	DeiT-S	1,024	384	64	90%, 70%

Table 1. GEMM Workloads in the Suite Sparse Matrix Collection and Transformer Model [7]

5.2 Data Compression Analysis

Encoding of the sparse data can reduce memory footprint and communication overhead during transmission. Common sparse encoding methods include the CSC, CSR, COO, and RLC. SparGD uses Bitmap to store and load data. In the Bitmap, each data has a corresponding bit, which is used to indicate whether the elements in the corresponding matrix are zero. The SparGD uses the ES method to process the streaming matrix, which requires a Cstr array (as shown in Figure 16 (Step v)) as the index of the shifted matrix.

Figure 17 compares the memory footprint of various sparse compression formats under different sparsity levels. The “2” in RLC_2 represents 2 bits of each index. The memory footprint of the ES, CSR, and CSC is similar. The memory footprint of the RLC and Bitmap is the lowest. Although the memory footprint of the ES is not the lowest, it is only used when processing streaming matrices in SparGD. When loading data and processing stationary matrices, SparGD adopts the Bitmap with a low memory footprint. Overall, SparGD compresses data with a lower memory footprint.

Fig. 17.

5.3 Different Types of GEMM

5.3.1 Dense Regular and Dense Irregular GEMM.

Figure 18 shows the performance and array utilization of the SA, SIGMA, and SparGD of the 8 \(\times\) 8 size when running dense regular GEMMs. The performance is measured as the count of cycles. The utilization is measured as the average of the StaUtil (the PE array utilization of the stationary matrix) and StrUtil (the PE array utilization of the streaming matrix). Since there is no sparsity and the matrix is regular, every matrix element in the GEMM must be mapped. The array utilizations of SA, SIGMA, and SparGD are all 100%. The systolic array has O(SqrtN) distribution and reduction. The time complexity of distribution and reduction are O(1) and O(log2N), respectively, in SIGMA. SIGMA is ~10% faster than the SA on average. SparGD uses the PAT whose time complexity is O(1) in reduction. Therefore, SparGD is ~20% faster than the SA and ~10% faster than SIGMA on average.

Fig. 18.

The “MK” matrix is the input matrix and the “KN” matrix is the weight matrix. When the size of the weight matrix is smaller than that of the input matrix, the performance of SA-WS is better than that of SA-IS, SIGMA-WS is better than SIGMA-IS, and SparGD-WS is better than SparGD-IS. The performance of SparGD is equal to that of SparGD-WS because SparGD switches to the SparGD-WS dataflow when the size of the weight matrix is larger. When the weight matrix size is larger than the input matrix size, the performance of the “IS” in different processors is better than that of the “WS.” In this experiment, the performance of SparGD-IS may be worse than that of SIGMA-WS when the weight matrix size is small, and the performance of SparGD-WS may be worse than that of SIGMA-IS when the input matrix size is small. Therefore, if not supporting dynamic dataflow, the performance of SparGD cannot always be better than that of SIGMA. Due to the use of dynamic dataflow, the performance of SparGD is always the highest.

Figure 19 shows the performance and array utilization of the SA, SIGMA, and SparG of the 8 \(\times\) 8 size when running dense irregular GEMMs. Since the array size of SA, SIGMA, and SparGD in our experiments is 8 \(\times\) 8, the GEMM is irregular when the size of the “K” dimension is less than 8. For dense and irregular GEMMs, the PEs in the SA cannot be fully filled, so the array utilization of the SA cannot reach 100%. The underutilization of the SA results in additional time. SIGMA and SparGD use flexible distribution networks. All elements can be filled into PEs, so the array utilization of both can be near 100%. The high utilization brings high performance to SIGMA and SparGD. The performance of SparGD is always the highest because of the dynamic dataflow. In this experiment, SIGMA and SparGD are up to 2~5x faster than the TPU. SparGD is ~10% faster than SIGMA due to the more efficient pipelined reduction network and dynamic dataflow.

Fig. 19.

5.3.2 Sparse Regular GEMM.

Figures 20 to 22 show the performance and array utilization of the SA, SIGMA, and SparGD of the 8 \(\times\) 8 size when running regular GEMMs with different sparsities. Due to the introduction of sparsity, the SA must map zeros to the PE array, resulting in an insufficient array utilization. As shown in Figure 1(c), in SIGMA, StaUtil can be close to 100%, but StrUtil is low. Therefore, the average utilization of SIGMA is low as well. In SparGD, both StaUtil and StrUtil are close to 100%, so the utilization of SparGD is the highest. Due to insufficient utilization, the performance of the SA is the worst. Due to the highest utilization, the performance of SparGD is the best. In our experiments, for the sparse regular GEMMs, the performance of the SparGD is ~10x better than the SA and ~3.6x better than SIGMA. When the weight matrix size is smaller than the input matrix size, SA-WS outperforms SA-IS, and SparGD-WS outperforms SparGD-IS, while SIGMA does not follow this trend. Due to the ability to switch dataflows, SparGD has the same performance as the optimal dataflow. With the sparsity increase, the performance difference between SparGD, SIGMA, and the SA becomes larger.

Fig. 20.

Fig. 21.

Fig. 22.

5.3.3 Sparse Irregular GEMM.

Figures 23 to 25 show the performance and array utilization of the SA, SIGMA, and SparG of the 8 \(\times\) 8 size when running irregular GEMMs with different sparsity. The coordinate is on a logarithmic scale. The sparsity and trend lead to worse array utilization of the SA, while that of SparGD is always the highest, and that of SIGMA is modest. The SA performs the worst when dealing with sparse and irregular GEMMs. Due to its low StrUtil, the performance of SIGMA is worse than SparGD. In our experiments, the performance of SparGD is ~30x better than SA and ~3.6x better than SIGMA. Just like the previous experiment, SparGD, which can switch dataflows, has the best performance. As sparsity increases, the performance difference between SparGD, SIGMA, and the SA becomes larger. On sparse and irregular GEMMs, the advantage of SparGD is very obvious.

Fig. 23.

Fig. 24.

Fig. 25.

5.4 All Optimal Dataflow

The performance of different dataflows in the SA and SIGMA has been analyzed. To emphasize the advantage of the SparGD architecture, we assume that both the SA and SIGMA enable dynamic dataflows in this section. We run sparse GEMM in Table 1 on the SA, SIGMA, and SparGD. In the experiment, we select the results of the optimal dataflow in the SA and SIGMA. Figure 26 shows the experimental results. The ordinate is on a logarithmic scale. On our experimental platform, the performance of the SparGD is 200.7 GOPS. The performance of the SA and SIGMA is 6.6 GOPS and 41.3 GOPS, respectively. Whether running the GEMM in the SuiteSparse matrix collection or running the GEMM in transformer models, SparGD has the highest performance. Although both the SA and SIGMA enable dynamic dataflow, this cannot address the issue of insufficient array utilization in the SA and SIGMA architectures. When run with sparse GEMM, high array utilization is the main advantage of the SparGD architecture.

Fig. 26.

5.5 Compare to the ExTensor

There are a lot of sparse GEMM accelerators that are not based on the SA and SIGMA architectures. The ExTensor is an example [15]. It is one of the state-of-the-art GEMM accelerators with double-sided sparsity. Like the SparGD, the ExTensor only processes non-zero data. The ExTensor finds the valid non-zero data from all non-zeros first. Then it loads them into the PEs. The valid non-zero data are elements that participate in the calculation. Our proposed SparGD loads all non-zero data first. Then it finds valid calculations and executes them during the calculation process. In other words, the ExTensor finds valid calculations before loading non-zero data, while the SparGD finds them after loading non-zero data. The ExTensor finds valid calculations by matching the data coordinates, while the SparGD finds them by distributing correct streaming data to the stationary data.

We run the GEMM in both the SuiteSparse matrix collection and the transformer models. Figure 27 shows the performance of ExTensor and SparGD. On our experimental platform, the performance of SparGD is 200.7 GOPS. The performance of Extensor is 143.6 GOPS. ExTensor performs better when running small-size GEMM (Set 0 and Set 2), while SparGD performs better when running larger-size GEMM. When running a small-size GEMM, the matrix does not need to be blocked, and the GEMM can be calculated in one matrix multiplication. In this case, ExTensor finds the valid non-zero elements and only loads a portion of the non-zero elements, while SparGD loads all non-zero elements. Therefore, ExTensor performs better in small-size matrix multiplication (Set 0 and Set 2). But in fact, most GEMMs have large sizes. For example, the GEMM in the transformer models always has large sizes (Set 9 to Set 12). The large-size GEMM needs to be blocked into multiple small-size matrix multiplications. Therefore, the accelerator needs to perform multiple block matrix multiplication. In this case, SparGD can utilize the data reuse from the IS or WS dataflows. ExTensor needs to find valid data between the two matrices before each blocked matrix multiplication, which does not have data reuse. In addition, ExTensor does not optimize the reduction network. The multiplication results of ExTensor are not accumulated before being sent to DRAM. It sends the multiplication result and the coordinates (corresponding to the VID in SparGD) together to DRAM and then performs linear accumulation. Overall, SparGD has performance advantages over ExTensor in most cases.

Fig. 27.

5.6 Hardware Cost Analysis

We evaluate the SA, SIGMA, and SparGD with 64 PEs. Table 2 shows the area and power consumption of them. SIGMA and SparGD include a larger controller and more computational logic to support sparsity. Therefore, their area and power consumption are greater than those of the SA. Compared to SIGMA, SparGD has a larger area. The area of SparGD mainly increases in the controller, PEs, and reduction networks (PAT). Compared to SIGMA, the controller in SparGD has additional functions to support element shifting and dataflow selection. The PE in SIGMA only caches two operands, while the PE in SparGD contains more registers. Similarly, the PAT in SparGD contains more register resources than the FAN. Overall, the area of SparGD is 1.8x bigger than the SA and 1.2x bigger than SIGMA.

Table 2.

Design	SA	SIGMA	SparGD
Technology	Commercial 28nm	Commercial 28nm	Commercial 28nm
Number of PEs	64	64	64
Power (mw)	69.5	127.3	155.5
Area ( \(um^2\) )	Total: 35,347.0	Total: 54,901.5	Total: 64,279.1
Area ( \(um^2\) )	Local Buffer: 45% Controller: 1.5% PEs: 48.5% Accumulator: 5%	Local Buffer: 43% Controller: 7% Benes: 12% PEs: 30.5% FAN: 5% Accumulator: 2.5%	Local Buffer: 36.5% Controller: 10.5% PE Bus: 5.5% PEs: 32% PAT: 13.5% Accumulator: 2%

Table 2. The Hardware Overhead of the SA, SIGMA, and SparGD

We evaluate the area efficiency and power efficiency of the SA, SIGMA, and SparGD on different workloads. Figure 28 shows their power efficiency. On our experimental platform, the power efficiency of SparGD is 1.29 TOPS/W. The power efficiency of the SA and SIGMA is 0.1 TOPS/W and 0.3 TOPS/W, respectively. Although the power of SparGD is 2.2x larger than that of the SA, there is a performance improvement of >30 times when running sparse GEMM. Therefore, SparGD has higher power efficiency. Figure 29 shows their area efficiency. Similarly, although SparGD has the biggest area, its area efficiency is also the highest. We consider that the additional hardware overhead of SparGD is acceptable.

Fig. 28.

Fig. 29.

5.7 Scalability Analysis

We run a large batch of the Set 2 in Table 1 on the SA and SparGD to evaluate the scalability of SparGD. Figure 30 shows the array utilization of the SA and SparGD. With the increase in the number of PEs, the utilization of SparGD is always near 100%, but the utilization of the SA is decreasing. This is because the utilization of the SA is very low when processing irregular GEMM. Figures 31 and 32 show the area and power of the SA and SparGD. As the number of PEs increases, the area of the SA and SparGD increases. Due to the need for more hardware resources, the area of SparGD grows faster than that of the SA. The power of the SA increases relatively slowly because when the array size is large, the utilization of the SA is very low. Due to the high utilization of SparGD in large-size arrays, its power increases rapidly. Figures 33 and 34 show the area efficiency and power efficiency of SparGD relative to the SA. Due to the low power caused by low utilization of the SA, the power efficiency of SparGD relative to the SA does not always increase. The area efficiency of SparGD relative to the SA increases quickly. Overall, SparGD has better scalability than the SA.

Fig. 30.

Fig. 31.

Fig. 32.

Fig. 33.

Fig. 34.

6 Related Work

The sparsity is a concern for the latest custom processors. Previous accelerators that considered sparsity can be divided into two types: those that can only handle single-sided sparsity and those that can handle double-sided sparsity. These sparsity-supporting accelerators include processors specifically designed for sparse GEMM. Accelerators with flexible interconnects can accelerate the communication efficiency of sparse data and provide better performance. This section introduces custom deep learning accelerators related to this article in recent years.

6.1 Sparsity

6.1.1 Single-sided Sparsity.

Eyeriss gates the multiplier when it sees an input activation of zero, but it does not gate the multiplier on zero weights [5]. This gating approach can save energy but cannot save execution time. Cnvlutin is a value-based approach that eliminates most of the ineffectual operations related to zeros, improving performance and energy with no accuracy loss [2]. Cnvlutin compresses activation values based on the ReLU operator, but it does not employ pruning to exploit weight sparsity. Cambricon-X exploits weight sparsity, keeping only non-zero weights in its buffer [39]. Cambricon-X exploits the sparsity and irregularity of NN models for increased efficiency, but it does not exploit activation sparsity. Unlike Eyeriss, Cnvlutin, and Cambricon-X, SparG exploits both activation and weight sparsity.

6.1.2 Double-sided Sparsity.

SCNN and SparTen are recent sparse CNN accelerators that utilize both activation and weight sparsity [10, 26]. Specifically, SCNN employs a novel dataflow that enables maintaining the sparse weights and activations in a compressed encoding, which eliminates unnecessary data transfers and reduces storage requirements. SparTen achieves efficient inner join operations by providing support for native two-sided sparse execution and memory storage. EIE performs inference on a compressed network model and accelerates the resulting sparse matrix-vector multiplication with weight sharing [13]. EIE uses the packed representation of weights and activations, passing only non-zero operands to multipliers. Extensor finds the intersection of weights and activations in compressed data, operating only on useful computations [15]. GoSPA optimizes sparse convolution through two methods. First, it encodes the sparse data and filters out zero values, only performing calculations on non-zero values. Second, it reorders the calculations so that the calculations of related data are executed at the same period [8]. Our work also exploits double-sided sparsity, but SparG mainly targets sparse and irregular GEMMs.

6.1.3 For Sparse GEMM.

HIRAC proposes a hardware/software co-design architecture that can efficiently compute Sparse GEMM without the need for complex interconnect networks. HIRAC improves PE utilization by compressing and converting sparse matrices into dense matrices using a scheme called SorPack. HIRAC also proposes a new graph-based hierarchical architecture to provide a scalable system that maximizes PE parallelism [29]. SWM utilizes the Winograd algorithm as well as sparsity of activation and weight, and it proposes the DS scheme and BCSR format to improve load balancing [33]. These latest solutions can effectively accelerate sparse GEMM, but they do not take into account the performance impact of the dataflows. Our proposed SparGD not only supports sparse GEMM but also supports dynamic dataflows.

6.2 Flexible Interconnect

Eyeriss v2 is a DNN accelerator architecture designed for running compact and sparse DNNs [6]. To deal with the widely varying layer shapes and sizes, it introduces a highly flexible on-chip network, called hierarchical mesh, that can adapt to the different amounts of data reuse and bandwidth requirements of different data types, which improves the utilization of the computation resources. Furthermore, Eyeriss v2 can process sparse data directly in the compressed domain for both weights and activations and therefore is able to improve both processing speed and energy efficiency with sparse models. Eyeriss v2 uses a flexible NoC to support sparsity but targets small mobile CNNs instead of large GEMMs. MAERI is a DNN accelerator built with a set of modular and configurable building blocks that can easily support myriad DNN partitions and mappings by appropriately configuring tiny switches [19]. MAERI uses a tree-based inter-connection network to achieve flexible mapping, but it cannot support the sparsity of input features. FlexFlow can leverage the complementary effects among feature map, neuron, and synapse parallelism to mitigate the mismatch [21]. FlexFlow develops a flexible dataflow architecture for different types of parallelism, but it is not for GEMMs. SIGMA proposes a flexible non-blocking interconnect [27]. SIGMA can support double-sided sparsity in the GEMM but with the problem of insufficient utilization. SparG proposed in this article adopts an efficient and flexible distribution network PE Bus and reduction network PAT. Gemmini is a full-stack, open-source generator of DNN accelerators [9]. It provides flexibility through a two-level hierarchy. Accelerators generated by Gemmini use either WS or IS dataflow, but they do not support dynamic dataflow. Moreover, Gemmini doesn’t emphasize sparsity. STIFT is a new strategy for Spatio-Temporal reduction in flexible DNN accelerator architectures [22]. It is capable of running any number of dynamic-size clusters in a non-blocking manner, and it has a high area and power efficiency. Compared to STIFT, the PAT is more suitable for the pipeline process in SparGD. However, the PAT has more hardware resources, which is what we need to improve in the future.

7 Conclusion

In this article, we design, implement, and evaluate SparGD, a state-of-the-art sparse GEMM accelerator with dynamic dataflow. SparGD has a specific PE structure, a flexible distribution network, and a pipelined reduction network. It can dynamically configure dataflows. SparGD can achieve high performance by high array utilization and dynamic dataflows. For sparse and irregular GEMMs, our experiments show that the performance of SparGD is >30x better than the systolic array processor and >3.6x better than SIGMA. In addition, SparGD brings only a small amount of additional hardware overhead.

References

[1]

Bilge Acun, Matthew Murphy, Xiaodong Wang, Jade Nie, Carole-Jean Wu, and Kim Hazelwood. 2021. Understanding training efficiency of deep learning recommendation models at scale. In 2021 IEEE International Symposium on High-performance Computer Architecture (HPCA’21). IEEE, 802–814.

Abstract

1 Introduction

2 Background and Motivation

2.1 Deep Learning Workload Characteristics

2.1.1 GEMM.

2.1.2 Sparsity in Deep Learning Workloads.

2.2 Typical GEMM Accelerators

2.2.1 TPU.

2.2.2 SIGMA.

2.3 Inefficiency of the TPU and SIGMA

3 Analysis of Dataflows in Popular Acceleration Architecture

3.1 Introduction of Dataflows

3.1.1 SA-OS.

3.1.2 SA-IS.

3.1.3 SA-WS.

3.1.4 SIGMA-WS.

3.1.5 SIGMA-IS.

3.2 Performance Analysis of Dataflows

3.2.1 Dataflow Analysis in the Systolic Array.

3.2.2 Dataflow Analysis in SIGMA.

4 The Architecture of SparGD

4.1 Microarchitecture

4.1.1 PE and PE Groups.

4.1.2 Distribution Network.

4.1.3 Reduction Networks.

4.2 Dynamic Dataflow Design

4.2.1 Dataflows in SparGD.

4.2.2 Dataflow Analysis of SparGD.

4.2.3 Dataflow Switching Module.

4.3 Example

5 Evaluation

5.1 Experimental Setup

5.1.1 Experimental Method.

5.1.2 Experimental Platform.

5.1.3 Workloads.

5.2 Data Compression Analysis

5.3 Different Types of GEMM

5.3.1 Dense Regular and Dense Irregular GEMM.

5.3.2 Sparse Regular GEMM.

5.3.3 Sparse Irregular GEMM.

5.4 All Optimal Dataflow

5.5 Compare to the ExTensor

5.6 Hardware Cost Analysis

5.7 Scalability Analysis

6 Related Work

6.1 Sparsity

6.1.1 Single-sided Sparsity.

6.1.2 Double-sided Sparsity.

6.1.3 For Sparse GEMM.

6.2 Flexible Interconnect

7 Conclusion

References

Cited By

Index Terms

Recommendations

SparG: A Sparse GEMM Accelerator for Deep Learning Applications

SADD: A Novel Systolic Array Accelerator with Dynamic Dataflow for Sparse GEMM in Deep Learning

A GEMM interface and implementation on NVIDIA GPUs for multiple small matrices

Comments

Information

Published In

Publisher

Journal Family

Publication History

Check for updates

Author Tags

Qualifiers

Funding Sources

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

Cited By

View options

PDF

eReader

Get Access

Login options