1 Introduction
In recent years,
deep neural networks (DNNs) have seen significant improvement in accuracy and an increase in applications on edge systems like mobile phones and Internet of Things devices. They have become the key approach to solving tasks such as object detection, image classification and analysis, and semantic segmentation [
30]. However, it is common for these deep networks to have multiple layers, millions of parameters, and billions of operations and require tremendous storage and intense computation resources [
16,
27,
33,
43,
48], making it difficult to meet the energy-efficiency and performance requirements of resource-constrained devices. To address these issues, several model compression techniques [
10,
19,
20,
25,
36,
52], efficient dataflow techniques [
39,
46], and accelerators [
1,
2,
4,
7,
8,
9,
17,
18,
23,
38,
54,
55,
56,
57,
59] have been proposed and widely investigated in recent years.
In particular, exploiting sparsity by pruning DNNs has emerged as a promising approach for achieving energy-efficient and high-performance DNN solutions [
1,
2,
4,
7,
8,
9,
17,
18,
23,
38,
54,
55,
56,
57,
59]. Pruning methods can be broadly classified as either unstructured or structured, producing unstructured sparsity (zeros are distributed in an irregular pattern) or structured sparsity (zeros are in a regular pattern) in the model weights. A number of sparsity-aware DNN accelerators have been proposed to leverage unstructured pruning [
1,
2,
4,
7,
8,
17,
18,
23,
38,
54,
55,
56,
57] due to its ability in achieving a high compression ratio while maintaining model accuracy. However, exploiting unstructured sparsity tends to require complex hardware designs and current accelerators face two key challenges in handling this type of sparsity.
The first challenge is the irregularity seen when accessing the data. To support irregular activation and weight sparsity introduced by unstructured pruning, complex logic is required to pair non-zero values, which incurs additional power and area cost. For example, the prefix sum and priority encoder used to exploit sparsity consume 62.7% and 46% of the total area and power of Sparten [
17], respectively. In addition, irregularity in activation and weight can cause the partial sums to be unevenly distributed. Previous works [
50,
55] suffer from memory contention to update these sums, resulting in compute stalls and low utilization of the multiplier array. To achieve a better utilization rate with sparse workloads, STICKER [
55] requires offline software optimization to rearrange the input activations.
Second, we see that workload imbalance can be a significant challenge for accelerators. Sparsity in weights and activations across filters, channels, and layers can vary greatly, which results in imbalanced workload distribution across compute units. Units with smaller workloads must remain idle while waiting for other compute units with higher workloads to complete the computation, leading to a low utilization rate. For instance, in SCNN [
38] the multiplier utilization rate falls below 60% when the overall weight and activation sparsity is more than 60% in a convolutional layer.
As an alternative, recent work [
35] has shown that structured pruning can achieve comparable accuracy and be more competitive when computing
convolutional neural networks (CNNs), in terms of both computational efficiency and storage, compared to unstructured pruning. In addition, structured pruning produces regular data sparsity, which simplifies memory access patterns and hardware complexity. To leverage the advantage of structured pruning, recent sparsity-aware DNN accelerators [
21,
59] take a hardware-software co-design approach to reduce data irregularity and benefit from structured sparsity. These works propose structured pruning, which greatly reduces weight irregularity and processing complexity. However, a major problem, even when using structured pruning, remains: Dynamic and irregular activation sparsity is still handled inefficiently. Recent work [
59] continues to store and access zero-valued activations from off-chip memory, resulting in unnecessary access energy and data transfer.
Cascading Structured Pruning (CSP) [
21] is another example of recent work that reduces weight irregularity to enable sequential access of activations, thus, reducing power-consuming off-chip activation accesses. Unfortunately, this previous work is unable to skip both the access and computation of zero-valued activation. This leads to low effectual
processing element (PE) utilization, ineffectual zero-valued multiplications, and large overheads when accessing input data. Overall, prior works are unable to both reduce irregularity and eliminate unnecessary computation and memory accesses of zero-valued data while maintaining high PE utilization and efficiency. A solution that can accomplish each of these goals is needed to build a highly efficient sparsity-aware system.
To overcome the limitations in sparsity-aware DNN accelerators and to achieve energy-efficient DNN inference in edge devices, we leverage structured pruning and propose our novel
Multiply-and-Fire (MnF) technique. MnF presents an event-driven dataflow that supports structured sparsity for convolution layers and both structured and unstructured sparsity for fully connected layers. In this event-driven dataflow, we take an
activation-centric approach and view one convolution operation as
one activation (an event) multiplied by all relevant weights of the filters instead of the typical vector-vector multiplication or matrix-vector multiplication methods. Similarly, in fully connected layers, we view computation as one activation multiplied by all relevant weights of neurons. This technique allows sparsity to be handled efficiently in a number of ways. (1) As all the computations of a single non-zero activation can be performed in parallel, that activation is maximally reused and only needs to be read from off-chip (global DRAM) and on-chip (local SRAM) memory once. Moreover, weights and partial sums are highly reused from the low-cost local memory in the PE. Memory accesses are a significant source of energy consumption in prior works [
15,
18,
21]; minimizing these accesses increases overall energy efficiency. (2) The proposed novel event-driven dataflow exploits irregular activation sparsity naturally and avoids ineffectual zero-valued multiplication without complex logic overhead like the prefix sum and priority encoder implemented in SparTen [
17] or the content addressable memory used by Extensor [
23], translating to both energy and area gains. (3) In the convolution layer, both kernel level parallelism (two-dimensional weight matrix each assigned to a channel of input activations) and filter level parallelism (three-dimensional matrix consisting of multiple kernels) are exploited. In the fully connected layer, output neuron parallelism is exploited. With our proposed efficient hardware, multipliers can maintain >90% utilization rate across most sparsity levels.
Overall, the main contributions of this work are as follows:
—
We propose an event-driven dataflow, the MnF methodology, which can support structured sparsity in convolution layers and both structured and unstructured sparsity in fully connected layers. It exploits activation sparsity and significantly improves activation reuse compared to the traditional dataflow techniques (see Section
3).
—
We propose a sparsity-aware accelerator that provides high energy efficiency and high performance by leveraging the proposed event-driven dataflow to allow one-time memory access to activation data. It is designed to exploit both kernelwise and filterwise parallelism and enables a single-cycle processing time of activation data (see Section
4).
—
We present a thorough study of modern dataflow techniques and use a variety of sparsity levels and convolutional layers to demonstrate the advantages of MnF (see Sections
2 and
5).
—
Finally, we perform a detailed study and performance comparison with related works. The proposed accelerator surpasses a recent sparsity-aware DNN accelerator, CSP [
21], by a geometric mean of 11.2× on the evaluated models in terms of energy efficiency (inferences/J) and a geometric mean of 1.41× speedup (inferences/second) on most of the evaluated models (see Section
5).
2 Background and Motivation
DNN typically consist of numerous layers and a vast number of parameters, leading to considerable computational and storage requirements. These demands pose significant challenges in meeting high-performance and energy-efficiency requirements of devices intended for deployment in resource-constrained environments. In Reference [
14], it was demonstrated that these DNNs are often over-parameterized, meaning that they have more parameters than necessary to perform their tasks. As a result, there is considerable redundancy within these models. To address this issue, a promising compression technique called pruning [
3,
19,
34,
36,
52] has emerged, which involves removing less important weights from the DNNs. The resulting sparse DNN retains a similar accuracy as the dense model but has a significantly reduced model size with much lower processing requirements [
3,
20]. A number of prior works are designed to benefit from sparse DNNs to achieve better performance and energy efficiency [
1,
2,
4,
7,
8,
9,
17,
18,
23,
38,
54,
55,
56,
57,
59]. In the following subsections, we first present the basic concept of sparsity in DNNs then discuss the sparsity-aware accelerator, and finally the motivation of this work.
2.1 Sparsity in Deep Neural Networks
Pruning introduces regular or irregular sparsity in weight parameters by applying structured or unstructured techniques, respectively. Unstructured pruning is able to achieve a high compression ratio while maintaining model accuracy. However, it introduces irregular weight sparsity as weights are randomly removed across the model [
20], which can pose challenges in hardware accelerator design. Structured pruning, however, has a lower compression ratio but produces regular weight sparsity that is more hardware friendly. In addition to pruning, the
rectified linear activation function (ReLU) used in DNN layers also introduces irregular activation sparsity [
28]. Weight-level sparsity is static, as neural network parameters are known before beginning the computation. But activation-level sparsity is dynamic and can only be known during the execution of specific input data. Due to such irregular and dynamic sparsity, it is difficult to reap the benefits of sparse models through traditional hardware solutions such as CPUs, GPUs, TPUs, and ASIC DNN accelerators. As a result, unique sparsity-aware DNN accelerators are required to handle sparsity efficiently and avoid unnecessary zero-valued multiplications between activations and weights [
1,
5,
6,
7,
8,
12,
15,
21,
59].
2.2 Sparsity-aware Accelerator
Many prior accelerators aim to benefit from the high compression ratio of unstructured pruning by designing specific dataflow and hardware to handle irregular weight and activation sparsity [
1,
7,
8,
12]. However, they cannot fully benefit from the high compression ratio due to the irregularity in the sparse data. More recent works shift the focus to structured pruning to leverage the regular sparsity in weights [
21,
32,
45,
59]. Although the compression ratio of structured pruning can be lower than unstructured pruning, the storage requirement of the structurally pruned model can be smaller or comparable as additional indices are not required to identify sparse data in the compressed format [
35]. Both prior structured and unstructured sparsity-aware accelerators manage to improve the performance of sparse processing, but they suffer from a number of drawbacks, such as ineffectual computations and costly pairing logic. In the following, we discuss the designs and drawbacks of two representative works.
2.2.1 Unstructured Sparsity-aware Accelerator.
Eyeriss v2 [
8] is one representative work that handles both irregular weight and activation sparsity using the row stationary dataflow proposed in Eyeriss [
7]. It stores and processes sparse data in a
compressed sparse column (CSC) format with a hierarchical mesh topology
network-on-chip (NoC). Although Eyeriss v2 is able to avoid the storage and access of zero-valued data and skip zero-valued multiplications, compressed data processing introduces several challenges. To represent the sparse data in the CSC format, the count and address value of each piece of data has to be stored, which adds memory overhead in this compression scheme. As shown in Figure
1, to identify the data, the CSC format requires 24 extra values to be stored with the 11 non-zero activations and weights. To process CSC data, read dependencies are introduced in the dataflow, where the address of the data has to be read before the actual value for both activations and weights. The input activations have to be read before weights to find the right matching pairs. This dependency introduces unnecessary input access as non-zero activations must still be read from memory even if all corresponding weights are zero. In addition, to support sparsity, the control logic, and extra memories consume an additional 73% area cost compared to the original Eyeriss implementation with limited sparsity support. The control logic in the PE alone (without multiply and accumulate logic) consumes an average of 25% of the power.
2.2.2 Structured Sparsity-aware Accelerator.
Since one of the key challenges of sparsity-aware DNN accelerators is to efficiently handle irregular sparsity, CSP [
21] employs a hardware-software co-design methodology to regularise weight sparsity. Although CSP avoids complex logic to handle irregular weight sparsity in its implementation, activation sparsity is not exploited. Since activation sparsity resulting from the ReLU function is typically around 50% [
28], not exploiting activation sparsity can lead to around 50% more than necessary off-chip and on-chip data accesses. In addition, zero-valued activations are computed in CSP, and these computations are ineffectual since they are multiplications with zeros. Figure
2 shows how CSP accesses zero-valued activations (highlighted in red and 9 gray boxes in the first step) and calculates each accessed value using the corresponding weights. From a theoretical perspective, these ineffectual computations are unnecessary and incur unnecessary power and energy overheads.
CSP makes an observation that activation data re-fetch from the off-chip and on-chip global memory consumes a significant amount of energy. It is therefore designed to eliminate the re-fetch of activation data from the off-chip memory. By preserving more data reuse opportunities, CSP has achieved better energy efficiency than the prior work, Cambricon-S [
59]. However, the dataflow still requires activation data to be accessed multiple times from the on-chip memory. Because the pruning algorithm of CSP prunes weights in the same position from one kernel to all later kernels, the positions of non-zero weights in kernel2 (as seen in Figure
2) can only be the subset of kernel1. Then, the subset of activations (12 gray boxes in the second step) accessed with kernel1 must be re-accessed from on-chip memory to process with kernel2. This leads to a high number of activation accesses and reduces energy efficiency.
2.3 Motivation
To leverage the benefits of pruning, the sparsity-aware accelerators must be able to process sparsity efficiently. However, prior unstructured sparsity-aware accelerators face multiple challenges brought about by the irregularity in both weight and activation data. The key challenges include memory overhead and the complex and expensive hardware logic to find matching pairs of non-zero weight and activation as seen in Section
2.2.1. Recent structured sparsity-aware accelerators target a hardware-software co-design to reduce irregularity in weights, but they are unable to fully exploit activation sparsity and skip all ineffectual multiplications. This work aims to overcome these limitations with a novel event-driven dataflow and energy-efficient accelerator.
In convolution layers, the proposed event-driven dataflow targets structured pruning instead of unstructured pruning. As shown in Reference [
35], although structured pruning has a lower compression ratio, it can be more competitive than unstructured pruning in terms of storage requirement and computational efficiency. With the same accuracy and quantization method, the compression ratio of structured pruning is about
\(1.14\times\) to
\(2.56\times\) lower than unstructured pruning. But the storage requirement of the structurally pruned model is in most cases
\(1.70\times\) to
\(3.00\times\) smaller than the unstructurally pruned model [
35]. In terms of computation efficiency, recent work [
35] has found that the weight reduction to speed up ratio is approximately 1.0 for structurally pruned models, which means an
x times weight reduction can result in the same
x times speedup. While in unstructurally pruned models, the ratio is between 2.7 to 3.5, which means the high compression ratio does not translate to high speedup in hardware. Since structured pruning results in regular sparsity and is comparable to unstructurally pruned models in terms of accuracy and storage, it is leveraged in this work. In fully connected layers, the proposed dataflow is capable of supporting both structured and unstructured pruning.
Overall, the proposed event-driven dataflow can handle activation sparsity efficiently without complex, high-overhead logic. It allows all ineffectual multiplications to be skipped and significantly improves the reuse of activation data to achieve low overall energy consumption. The implemented accelerator leverages the dataflow and enables a highly parallel computation of each activation with high energy efficiency and performance.
3 Multiply-and-Fire: Dataflow
In this section, we present our novel event-driven dataflow for MnF and outline its two distinct phases: the multiplication phase and the firing phase. A simple example is used as an overview of the MnF event-driven dataflow. This example demonstrates how the proposed design aims to solve the two major problems mentioned in Section
2: (1) the complex and high-overhead logic in finding matching pairs of activations and weights and (2) the inability to fully exploit activation sparsity and skip all ineffectual computations. Meanwhile, to remove data irregularity and avoid additional memory overhead in unstructurally pruned models, we leverage the structured pruning techniques from previous works [
31,
34]. As a result, pruned weights are in a regular shape and can be stored and accessed in a dense format without incurring any memory overhead.
As shown on the left of Figure
3, each non-zero activation is seen as an event. The events are numbered according to the order in which they are sent. All zero-valued activations are not seen as events, and therefore not involved in the computation.
The multiply phase is illustrated in the middle of Figure
3. The dataflow takes an activation-centric approach in which we focus on processing one non-zero activation (event) at a time. Parallelism is exploited in the kernel (two-dimensional weight matrix, each assigned to a channel of input activations) and filter (three-dimensional matrix consisting of multiple kernels) levels in convolutional layers and the output neuron level in fully connected layers. Starting from the first event, all the corresponding weights are retrieved and sums are computed to update relevant output neurons. After processing the first activation, the activation can be discarded. Then the second non-zero activation is fetched and processed in a similar way. This mechanism reuses activations to a great extent because all computations requiring the activation are done at once and each activation only needs to be read once from the memory to the computing unit. Following the multiply phase, the right of Figure
3 shows
the fire phase. The
output feature maps (OFMs) are traversed to find values that exceed the ReLU threshold (value zero). All values greater than the threshold will be fired to the next layer. Since zero-valued activations from the OFM are discarded at the fire phase, they are never stored and accessed in the subsequent computation. Note that Figure
3 only provides one example. The fire phase can perform more than the ReLU function, which is discussed in the following subsection. Together, the multiply and fire phases allow dynamic and irregular activation sparsity to be exploited naturally without any extra logic and enable all ineffectual computations to be skipped.
3.1 Multiply Phase
We consider two types of input events to support the acceleration of convolutional layers and fully connected layers.
3.1.1 Convolutional Layers.
Each non-zero activation (i.e.,
IFM[I_ch][I_i][I_j] in Figure
4) is processed sequentially by the PE array as represented by the first 5 lines in the algorithm.
I_c,
I_w, and
I_h represent the number of input channels, width, and height of
input feature maps (IFMs), respectively. In this layer, both kernel and filterwise parallelism are exploited. We define the kernel as the two-dimensional matrix with width and height and filter as a three-dimensional matrix consisting of width, height, and channel size. Kernelwise parallelism means that MAC computations of a single activation on a single kernel are processed in parallel. Filterwise parallelism means that MAC computations of a single activation on multiple filters are processed in parallel. In the PE array, to exploit filterwise parallelism between PEs, each PE handles (
O_c /
pe_total) OFMs, where
O_c is the total number of OFMs and
pe_total is the total number of PEs. Since the number of total OFMs can be more than PEs, each PE is designed with the capability to exploit filterwise parallelism. In addition, to exploit kernelwise parallelism within the PE, all corresponding weights from a single kernel required to be multiplied with the activation are read and the results are updated to relevant output neurons in parallel. As each PE is uniquely mapped to specific filters and generates the corresponding output feature maps, weights are highly reused within the PE and partial sums are immediately accumulated in the PE.
As an example, consider a convolutional layer with an IFM size of 4 × 4 × 3 and four 3 × 3 × 3 weight filters shown in Figure
4. Assuming that the kernel’s stride is 1 along the
x- and
y-axes and the activation value in the IFM at coordinates
IFM[0][1][1] is non-zero, such as 100, as shown in the top half of Figure
4, each PE is mapped with a different filter and processes different output feature maps in parallel. When a PE in event-driven dataflow receives one event, it performs the steps outlined in the following algorithm (assuming padding is 0 and input/output channel is 1 and indexed with 0):
(i)
Read
w[0][0][1][1] and multiply it with activation
IFM[0][1][1]. Then add the result with the previous result of
OFM[0][0][0] to update
OFM[0][0][0] as shown in Figure
4 MAC0. (This operation can be represented as
OFM[0][0][0] +=
w[0][0][1][1] *
IFM[0][1][1])
(ii)
Shift right the kernel and perform
OFM[0][0][1] +=
w[0][0][1][0] *
IFM[0][1][1] as shown in Figure
4 MAC1.
(iii)
Shift down the kernel and perform
OFM[0][1][0] +=
w[0][0][0][1] *
IFM[0][1][1] as shown in Figure
4 MAC2.
(iv)
Shifts right the kernel and perform
OFM[0][1][1] +=
w[0][0][0][0] *
IFM[0][1][1] as shown in Figure
4 MAC3.
(v)
All of the above operations can be handled by separate MAC units to allow for kernelwise parallelism.
In this way, event-driven dataflow fully utilizes a single input activation event, performs all necessary multiplication operations, and updates all the required output neuron values in a convolutional layer’s OFM. As we aim to demonstrate in our hardware design later, multiplication operations for a single input event can be processed at once in parallel in the hardware.
In terms of data storage format, this work compacts activation information with activation value together and follows a specific format: input_value + channel_id + neuron_address. The channel_id indicates the corresponding channel number to which the input activation belongs. The neuron_address indicates the first OFM neuron address that is required to be updated. The format is similar to Coordinate Format (COO) and requires a similar storage. It allows both activation and its coordinate information to be read in one cycle instead of the two cycles required in CSC/CSR, which is more hardware friendly. Since weights are structurally pruned, they can be stored and accessed in a dense format.
3.1.2 Fully Connected Layers.
Although this work focuses on structured sparsity achieved by structured pruning, unstructured sparsity in fully connected is supported. The computation process is similar to the convolutional layer, each non-zero activation (i.e., I[I_i]) is processed sequentially by the PE array as represented by the first three lines in the algorithm. I_t represents the number of input neurons. Since there are no kernels and filters in a fully connected layer, we exploit output neuronwise parallelism across the PE array and within the PE. In the PE array, each PE handles (O_t / pe_total) output neurons, where O_t is the total number of output neurons and pe_total is the total number of PEs. Inside each PE, multiple weights from each neuron are read to multiply with the input activation to update the relevant output neurons in parallel.
Take a two-layer fully connected network with unstructured sparsity as an example in which there are 5 and 10 neurons in the first and second layers, respectively. If neuron 1 in the first layer has a non-zero output neuron value such as 100, then the input event delivered to the multiply module of the second layer comprises the following information: (a) activation value of 100 and (b) neuron address of
1. Suppose the 10 output neurons are assigned to 2 PEs. When PE1 receives the input event, it performs the steps outlined in Figure
5(b). It reads all four non-zero weights linked with neuron address
1, skips neurons linked with zero-valued weights, and multiplies non-zero weights by input activation (in this example, 100). The multiplication results will subsequently be added to the first four neurons (with non-zero weights) of the second layer. PE2 will simultaneously process the same activation value (i.e., 100) with the rest of the output neurons in the same way. In this way, event-driven dataflow performs all the required multiplication and updates the output neuron values in fully connected layers. For structured sparsity, the same process is done and may require less PE to process since neurons are pruned.
In terms of storage format, activations are stored similarly to the convolutional layer while weights follow a different format based on the pruning technique applied. In structured pruning, neurons of less importance are entirely removed from the layer, resulting in a smaller dense layer with only neurons of high importance. In this case, we store weights in a dense format without additional index information. In contrast, unstructured pruning randomly removes weights of low importance from each neuron, resulting in a sparse layer with the same number of neurons but fewer weights per neuron. We store the sparse weights in a compressed format similar to CSR but instead of using row and column pointers, we use a single neuron index to identify which output neurons each weight is connected to and minimize memory overhead.
3.2 Fire Phase
The fire phase can be viewed as a compression phase that is commonly not considered by the recent works [
8,
12,
21]. This phase performs four tasks: (1) ReLU, (2) max/average pooling, (3) formatting and compacting the output activation (output neuron) value, and (4) ordering sequence of output activations. After completing the multiply phase, the output values in the convolutional or fully connected layer undergo a comparison with the ReLU threshold, typically set at zero. Output values below the threshold are filtered out by the fire module and are not transmitted to the next layer. However, if the value of the output neuron exceeds the threshold, then it is either transmitted to the pooling module for further processing or directly formatted to input activation to the next layer. For convolutional layers followed with max/average pooling, the pooling operation is fused with ReLU and done on-the-fly. Each output value goes through ReLU first and is collected based on the size of the pooling operation. When all values from a pool are collected, the maximal (max-pooling) or average (average-pooling) value will be selected as the final value and then formatted and “fired” as input activation to the next layer. This means only non-zero output activations are fired to the next layer instead of all activations including zeros. This process naturally compressed the sparse output activations. In addition, the fire phase can control the firing sequence of activation values by traversing the OFMs in different orders. This represents a fundamental characteristic of event-driven dataflow. The ordering capability of the fire phase provides additional possibility to dataflow, which we consider as future work to improve energy efficiency. This is further discussed in Section
7.
The fire phase together with the multiply phase enables the accelerator to compute data in an event-driven manner and significantly reduces the number of computations in the network and the need to re-compress the activations after ReLU to the corresponding compression format, leading to lower inference latency and better energy efficiency
6 Related Work
Multiple accelerators [
1,
2,
4,
7,
8,
9,
17,
18,
21,
23,
32,
38,
40,
45,
53,
54,
55,
56,
57,
58,
59] are designed to leverage the benefits of pruned DNNs to achieve high energy efficiency and throughput. EIE [
18] and Cnvlutin [
2] are early efforts that exploit unstructured sparsity to enhance DNN throughput and energy efficiency. EIE exploits both irregular activation and weight sparsity, but it only supports the fully connected layers. Although Cnvlutin supports convolutional (Conv) layers, it only exploits irregular activation sparsity and does computation with zero-valued weights.
SCNN [
38] is the first sparsity-aware accelerator that takes advantage of unstructured sparsity in both activations and weights. Although it is able to avoid all zero-valued computations, the input-stationary dataflow together with the Cartesian product technique introduces non-zero-valued multiplications that do not exist in the original convolution algorithm. Furthermore, the dataflow employed produces a large number of irregularly distributed partial sums that requires large accumulator buffer and complex hardware logic to update the results. Since the partial sums can update the same output, memory contention is possible, which can lead to compute stalls and a low multiplier utilization rate. STICKER [
55] and Reference [
50] follow a similar dataflow as SCNN and try to reduce the memory overhead. Instead of the large multi-banked buffer used by SCNN, STICKER and [
50] employ a set-associative PE design to update the irregular partial sums. Although this significantly reduced storage area, memory contention still remains when updating the partial sums. Furthermore, a contention in a set of PEs can stall the whole computation process and lowers the multiplier utilization rate. This is because the design process data in batches and the PE array can only start to process the next batch when all operations are done in the current batch. SparTen [
17] and GoSPA [
12] are recent works that use an intersection-based method to exploit both the unstructured weight and activation sparsity and perform only the necessary computations. SparTen has demonstrated better energy efficiency and latency than SCNN, however, it uses an expensive operation, prefix sum, to identify the non-zero pairs of weight and activation. Unlike SparTen, GoSPA [
12] took an on-the-fly intersection approach and employs a dedicated activation processing unit to identify the valid non-zero pairs of weight and activation, resulting in smaller energy overheads than SparTen.
In contrast to the aforementioned works, Cambricon-S [
59], CSP [
21], SPOTS [
45], and S2TA [
32] seek to exploit structured sparsity to accelerate the sparse DNNs. Cambricon-S and CSP use a hardware-software co-design approach that regularizes the weight sparsity but they cannot handle the irregular activation sparsity efficiently. SPOTS is a systolic-based design, it can skip ineffectual multiplications when both the input and weight blocks contain all zeros but is unable to do so when the block contains both non-zero and zero values. S2TA goes a step further, it exploits structured sparsity not only in the weight but also in the activation. To enable the structured activation sparsity, S2TA implements a specialized dynamic activation pruning technique and time-unrolled hardware architecture. Similarly, we leverage the structured pruning technique to regularize weight sparsity in this work, but we employ a straightforward MnF technique that naturally exploits unstructured sparsity in activations and skips all ineffectual computations. This technique eliminates the need for complex hardware overhead and achieves highly efficient event-driven computations.
7 Discussions and Limitations
MnF achieves better performance and energy efficiency than prior works but in the current hardware design, there is limited support for different kernel sizes. The PE works best with kernel width that is multiple of 3 and pointwise
\(1\times 1\) kernel. It is able to handle other kernel sizes like
\(11\times 11\) kernels but the utilization rate can be lower and energy becomes less efficient as seen in Section
5. However, the
\(3\times 3\) and
\(1\times 1\) kernel sizes are commonly used and constitute a large number of layers in many recent CNNs [
22,
24,
27,
43,
47] including the MobileNet and Resnet-50 evaluated in this work. Therefore, we did not extend the support for this work. Future work can look into supporting various kernel sizes to optimize performance and energy efficiency.
As seen from Figure
7(b), MnF has a large number of on-chip memory weight access. This can be optimized in future work by caching the two-dimensional kernel in the MAC unit and leveraging the fire phase of the proposed dataflow to order and enforce the processing sequence of input activations. Since activation from an input channel will only be multiplied by the kernel from the same channel, by processing all activations from the same channel first, the corresponding kernel can be maximally reused.
In addition, the current design only exploits kernel and filterwise parallelism in the PE. This means parallelism is limited to the max number of computations required by a single activation. In future work, we can look into processing multiple activations in the same PE to exploit parallelism between activations and further improve the latency.