ReRAM-based Processing-In-Memory (PIM) architectures have been increasingly explored to accelerate various Deep Neural Network (DNN) applications because they can achieve extremely high performance and energy-efficiency for in-situ analog Matrix-Vector Multiplication (MVM) operations. However, since ReRAM crossbar arrays’ peripheral circuits–analog-to-digital converters (ADCs) often feature high latency and low area efficiency, AD conversion has become a performance bottleneck of in-situ analog MVMs. Moreover, since each crossbar array is tightly coupled with very limited ADCs in current ReRAM-based PIM architectures, the scarce ADC resource is often underutilized.

In this article, we propose ReHarvest, an ADC-crossbar decoupled architecture to improve the utilization of ADC resource. Particularly, we design a many-to-many mapping structure between crossbars and ADCs to share all ADCs in a tile as a resource pool, and thus one crossbar array can harvest much more ADCs to parallelize the AD conversion for each MVM operation. Moreover, we propose a multi-tile matrix mapping (MTMM) scheme to further improve the ADC utilization across multiple tiles by enhancing data parallelism. To support fine-grained data dispatching for the MTMM, we also design a bus-based interconnection network to multicast input vectors among multiple tiles, and thus eliminate data redundancy and potential network congestion during multicasting. Extensive experimental results show that ReHarvest can improve the ADC utilization by 3.2×, and achieve 3.5× performance speedup while reducing the ReRAM resource consumption by 3.1× on average compared with the state-of-the-art PIM architecture–FORMS.

1 Introduction

Deep Neural Networks (DNNs) have been widely exploited in various fields such as image recognition and natural language processing [23, 26, 32, 47, 53, 57]. With the ever-increasing size of DNN models, traditional Von Neumann architectures face serious “memory-wall” and “power-wall” problems due to massive data movement between CPUs and main memory. To counteract these problems, a promising approach is to exploit Processing-In-Memory (PIM) architectures using non-volatile memory (NVM), such as Resistive Random Access Memory (ReRAM) [14, 36].

Previous studies have demonstrated that ReRAM-based PIM architectures are particularly effective and efficient for accelerating massive Matrix-Vector Multiplication (MVM) operations in a DNN inference pass [4, 14, 52, 54, 62, 65, 68]. For a weight matrix in a DNN layer, a ReRAM cell can store an element of the weight matrix as the analog conductance state, and the whole weight matrix can be mapped to ReRAM crossbar (XB) arrays. When an input vector is applied to XB arrays, they can perform an in-situ analog MVM operation in a constant time (O(1) complexity) based on Kirchhoff’s circuit laws [52].

However, to perform in-situ analog MVMs, ReRAM crossbar arrays need some peripheral circuits such as digital-to-analog converters (DACs) and analog-to-digital converters (ADCs) to convert digital data into analog signals, and vice versa. Since the most area-efficient ADC [66] with 8-bit resolution is even 12.5 times larger than a typical 128\(\times\)128 ReRAM crossbar array [52], multiple columns (i.e, bitlines) in a single crossbar array usually have to share only one ADC. Thus, the analog output of each crossbar column should be converted into digital values by a single ADC one by one [10, 21, 22, 29, 38, 59, 64, 68]. Thus, the analog-to-digital(AD) conversion for an MVM operation usually spends \(279\times\) more time than the crossbar array itself. This performance bottleneck significantly degrades the performance of ReRAM-based DNN accelerators. Although there have been many efforts on ADC designs [5, 15, 27, 38, 54, 69], the performance and area efficiency of the most advanced ADC [66] is still much lower than that of ReRAM crossbars [39, 72].

To improve the performance of analog MVMs, prior proposals such as ISAAC [52], PipeLayer [54], and HitM [35] propose inter-layer pipeline technologies to perform MVM operations of different DNN layers in parallel. Because the input data dimensions of different DNN layers are often diverse, the different computing latencies of DNN layers usually lead to massive pipeline bubbles. To address this problem, the weight replication (WR) mechanism [52] is proposed to achieve higher task parallelism at the array level, i.e., a weight matrix is replicated to multiple arrays, and different MVM operations in the same layer are performed simultaneously. The weight matrix in the layer with a larger output dimension is replicated more times. Ideally, different layers can finish MVM operations simultaneously, and thus the pipeline bubbles can be minimized. The inter-layer pipeline and the WR mechanism improve the utilization of ADCs implicitly.

However, the WR mechanism causes high data redundancy and wastes scarce ReRAM resource. For large DNN models, the PIM accelerator may not have sufficient ReRAM resource for WR. The computing latencies of different DNN layers may be still unbalanced, causing many pipeline bubbles. As a result, the ReRAM and ADC resource cannot be fully utilized. Since the AD conversion is the performance bottleneck of analog MVMs, current ReRAM-based PIM accelerators often cannot realize their full potential for DNN acceleration.

In this article, we try to mitigate the performance bottleneck of ReRAM-based PIM architectures from a new angle—improving the utilization of ADCs. The root cause of low ADC utilization is that a crossbar array is tightly coupled with very few (usually one) ADCs. Thus, a crossbar array can only use its own ADCs to perform AD conversions for each bitline sequentially even if ADCs in other crossbar arrays are unused. To address this issue, we propose ReHarvest, a novel ADC-crossbar decoupled architecture that shares all ADCs in a tile as a resource pool. ReHarvest enables a crossbar array to harvest more ADCs for each MVM operation. Our major contributions are as follows.

—

We propose a many-to-many mapping structure between crossbars and ADCs to share all ADCs in a tile as a resource pool. We map all bitlines of the same index in different crossbar arrays to the same ADC. In this way, each ADC can be multiplexed by all arrays in a tile, and each array can utilize all ADCs in a tile simultaneously. This ADC-crossbar decoupled architecture can significantly improve the utilization of ADCs, without incurring high ReRAM resource consumption due to WR.

—

We also propose a multi-tile matrix mapping (MTMM) scheme to further improve the ADC utilization across multiple tiles. Unlike previous schemes that usually map a large matrix in minimum tiles compactly, we split a large matrix into many sub-matrices and distribute them in different tiles as many as possible, and thus can utilize more ADCs in different ADC pools for a single MVM operation. We also extend a bus-based interconnection network to support fine-grained data tagging and dispatching for MTMM. It can multicast input data among multiple tiles without data redundancy, and thus eliminates potential network congestion effectively.

—

We prototype ReHarvest and implement typical DNN models in a ReRAM simulation framework [41]. We evaluate the inference performance of ReHarvest for different DNN models, and compare ReHarvest with GPU and two ReRAM-based PIM architectures supporting WR—ISAAC and FORMS. Experimental results show that ReHarvest can improve the ADC utilization by 4.8\(\times\), 4.9\(\times\), and 3.2\(\times\), and achieve 3.4\(\times\), 1.9\(\times\), and 3.5\(\times\) performance speedup on average compared with ISAAC, TinyADC, and FORMS, respectively. Moreover, ReHarvest also significantly reduces the ReRAM resource consumption by 3.1\(\times\) on average.

2 Background

2.1 Deep Neural Networks

There are different types of DNNs, such as multilayer perceptrons [47] (MLPs), convolutional neural networks (CNNs) [32], and attention networks [57]. An MLP often contains multiple fully-connected (FC) layers, which consist of multiple independent MVMs in FC operations. A CNN is usually composed of multiple convolutional (CONV) layers, pooling layers, and FC layers. The CONV layer contains multiple convolution kernels which perform MVM operations in a sliding window. An attention network usually comprises multiple multi-head self-attention (MSA) layers, FC layers, and layer-normalization (LN) layers. The MSA layer contains massive MVM operations and several softmax operations. It has a significant impact on the performance of attention networks.

Generally, DNNs contain massive MVM operations in which weight matrices are unchanged during inference. These weight matrices can be loaded into crossbar arrays in advance. Thus, ReRAM-based PIM accelerators are particularly efficient for accelerating MVM operations in the DNN inference.

2.2 ADC-Crossbar Tightly-Coupled Structure

A ReRAM crossbar array can store matrix elements with conductance states, and performs in-situ analog MVM operations in a constant time based on Kirchhoff’s circuit laws. As shown in Figure 1, each ReRAM cell in the crossbar stores a conductance value \(G_{ij}\). When we apply voltages \(V_{i}\) to each wordline of the crossbar, the bitline actually perform a multiply-add (MAD) operation to calculate the output current, i.e., \(\sum V_{i} G_{ij}\). To perform an MVM, we first write operands in the matrix as conductance states of ReRAM cells, and then apply voltages according to the input vector via DACs. The output vector (currents) is finally read by ADCs at the end of bitlines.

Fig. 1.

According to the latest ADC performance survey [46] updated in May 2023, even the most area-efficient ADC (313 \(\mu m^2\)) presented by IBM [66] is still 12.5 times larger than a \(128\times 128\) ReRAM crossbar (25 \(\mu m^2\)) [24, 40, 68]. Thus, in most conventional ReRAM-based PIM architectures, all bitlines of a ReRAM crossbar usually share one ADC due to high area cost, as shown in Figure 1. This ADC-crossbar tightly-coupled architecture may underutilize the ADC resource when some crossbar arrays are idle. Moreover, the ADC latency is much higher than the latency of an analog MVM. Figure 2 shows the breakdown of an MVM latency based on NeuroSim [10]. Here, the ADC frequency is set to 1.25 GHz according to IBM’s ADC design [66]. When 128 bitlines share a single ADC, the analog-to-digital conversion for an MVM operation usually spends \(279\times\) more time than the crossbar array itself. Even if the ADC is only used by a single bitline, the latency of the ADC is still dominant in a multiply-add operation. Besides NeuroSim, many other studies also demonstrate that the array latency (i.e., resistive-capacitive delay) is trivial compared to the ADC latency [22, 29, 64]. Thus, the analog-to-digital conversion is a major performance bottleneck of ReRAM-based PIM architecture [10, 21, 38, 59, 68].

Fig. 2.

3 Motivations and Challenges

3.1 Motivations

Traditional ReRAM-based DNN accelerators [35, 52, 54] usually exploit an inter-layer pipeline to process multiple DNN layers in parallel. As shown in Figure 3(a), the layer N-1 performs four CONV operations on the input feature map, while the layer N performs only one CONV operation on the output feature map of the layer N-1. When the weight matrix of each layer is only mapped once (i.e., naive mapping), the different processing latencies among layers often cause lots of pipeline bubbles, as shown in Figure 3(b). To satisfy the data requirement of the layer N, the weight matrix of the layer N-1 is replicated 3 times to different crossbar arrays, and thus four output vectors can be generated simultaneously. This WR [52, 54] mechanism can balance the processing latencies of different layers, and thus eliminates pipeline bubbles.

Fig. 3.

However, for large CNN models, the number of weight matrices that should be replicated often increases exponentially with the number of layers, and thus the limited ReRAM resource may be unable to accommodate all weight replicas [52]. Figure 4 shows the number of weight matrices replicated in each layer of VGG-19 [53] in a real case. To balance the latencies of the last three layers (L17\(\sim\)L19), each layer requires a large number of replicas (red bars). However, due to limited ReRAM resource, only insufficient replicas (blue bars) in the layers L1\(\sim\)L16 are mapped to crossbar arrays, and thus the processing latencies of different layers are still unbalanced. In this example, the last three layers reserve about 25% of the total ReRAM resource. They have to wait for the output of the previous layer periodically, and thus 25% of the total ADC resource is underutilized.

Fig. 4.

Moreover, the ADC utilization is also affected by the batch size of inputs in the inter-layer pipeline. For attention networks, LN and softmax operations require the whole input feature during the inference. The layer with these operations cannot start until the whole output yielded by the previous layer is ready, and thus more inputs are required to fill the inter-layer pipeline. Because the batch size of inputs is mainly determined by applications, ReRAM and ADC resources are usually underutilized when inputs are insufficient. Due to the high data dependency, the ADC utilization is often low even if the WR mechanism is used.

The WR mechanism essentially improves task parallelism by replicating the same weight to multiple crossbar arrays, as shown in Figure 5(a). However, it cannot completely eliminate pipeline bubbles due to limited ReRAM resource and inter-layer data dependency. The scarce ADC resource is still underutilized when the pipeline stalls. The root cause is that the ADC resource is tightly coupled with crossbar arrays. Because one ADC is usually shared by all bitlines of a crossbar array, the analog-to-digital conversions for all bitlines are performed by only one ADC sequentially even though ADCs in other crossbar arrays may be idle. This ADC-crossbar tightly-coupled structure limits the utilization of ADCs.

Fig. 5.

This defect of current ReRAM crossbar architectures motivates us to design an ADC-crossbar decoupled architecture–ReHarvest. As shown in Figure 5(b), ReHarvest processes multiple MVMs by parallelizing analog-to-digital conversions at the bitline level. By decoupling crossbar arrays and ADCs, each ADC can be multiplexed by all crossbar arrays in a tile, and all ADCs in a single tile can be used as a resource pool. Thus, a single crossbar array can harvest all unused ADCs in the pool. In the following, we use a simple example to illustrate the difference between the WR mechanism and ReHarvest. Assume two input vectors (\(X_0\) and \(X_1\)) are processed by a single tile comprising n crossbar arrays and each crossbar array is composed of n bitlines. The WR mechanism can process \(X_0M\) and \(X_1M\) in parallel, as shown in Figure 6(a). However, the ADC latency is still dominant because only one ADC is used to convert the analog currents for n bitlines of a crossbar array sequentially. In contrast, ReHarvest can exploit n ADCs to handle the outputs of n bitlines in parallel. Thus, for a single MVM operation, ReHarvest can significantly reduce the ADC latency by \(n\times\) compared with previous ADC-crossbar tightly-coupled architectures, as shown in Figure 6(b). ReHarvest achieves more fine-grained data parallelism (bitline-level) to improve the utilization of ADCs, while the WR mechanism can only achieve task parallelism at the array level for different MVM operations. Thus, ReHarvest can significantly reduce the latency of MVM with less ReRAM resource consumption compared with the WR mechanism.

Fig. 6.

3.2 Challenges

To realize the ADC-crossbar decoupled architecture, there still remain several challenges as follows.

How to design a many-to-many mapping between crossbars and ADCs so that each crossbar can utilize multiple ADCs simultaneously? A simple way is to fully map all bitlines from all crossbar arrays in a tile to each ADC. However, this bitline-to-ADC fully-associative mapping scheme results in more complex circuits and control signals. Thus, it is essential to design a many-to-many mapping structure with low wiring overhead and less circuit noises. Also, the design should not compromise the analog signal integrity.

How to design an ADC-harvesting weight matrix mapping scheme to adapt to the proposed ADC-crossbar decoupled architecture? If we still use traditional schemes to map a large weight matrix into a few tiles compactly, only one ADC pool can be shared by crossbar arrays in the same tile. In this way, the ADC utilization is still low. Thus, it is essential to design a new weight matrix mapping scheme to maximize the utilization of all ADC pools in different tiles.

How to design an efficient data dispatching scheme with low data redundancy to adapt to the proposed ADC-crossbar decoupled architecture? In the ADC-crossbar decoupled architecture, the partial results of an MVM may be distributed in different ReRAM tiles, and thus should be collected to assemble the final result. Also, the intermediate result of a layer is required by multiple tiles as input vectors for the next layer. Thus, an efficient multicasting mechanism should be supported by the on-chip interconnection network. Previous approaches such as concentrated mesh [52] and H-tree [43] may incur serious network congestion because each multicast packet may walk across the whole NoC hop-by-hop. It is essential to design an efficient data dispatching scheme to support multicasting, without causing network congestion.

4 Design

4.1 Overview

Figure 7 shows the hierarchical architecture of ReHarvest. A ReRAM-based PIM chip is composed of multiple tiles, interconnection networks, and other peripheral circuits. The I/O interface is used to bridge the global data buffer (GDB) and the host main memory. The input/output feature maps and the intermediate data are stored in the GDB which is connected to each tile with global bus-based networks, i.e., global input network (GIN) and global output network (GON). The tagging controller (TC) is used to tag each packet with the coordinate of the input feature map. The adder group is used to aggregate partial sums (Psums). Each tile is composed of a ReRAM-based processing engine and its auxiliary circuits. The control unit is used to generate control signals according to micro-instructions of PIM accelerators [4, 41]. The eDRAM is a buffer for input/output feature maps. The pooling unit and the activation unit are special circuits for pooling operations and activation functions, respectively. These components in a tile are connected to a shared bus. The ReRAM-based processing engine contains a set of input/output registers (IRs/ORs), and 128 crossbar (XB) arrays, multiplexers (MUXs), ADCs, and shift-and-add units (S&A). The number of ADCs is equal to the number of ReRAM crossbar arrays in each tile. Each ADC is connected with an MUX and an S&A. The 128-to-1 MUX is connected to 128 bitlines of different crossbar arrays so that a single ADC can be multiplexed by 128 ReRAM crossbar arrays. The S&A circuit is used to compose the final result from partial results.

Fig. 7.

4.2 Crossbar-to-ADC Mapping Structure

To improve the utilization of ADCs, ReHarvest decouples crossbar arrays and ADCs, and shares all ADCs in a resource pool. In this way, each crossbar array can harvest all unused ADCs in a tile to achieve bitline-level data parallelism for AD conversions. It can perform an MVM much more efficiently because 128 output currents from bitlines can be converted into digital values by 128 ADCs simultaneously.

To achieve a many-to-many mapping between 128 ADCs and 128 crossbar arrays in a single tile, we design a grouped bitline-to-ADC mapping structure that stacks multiple crossbar arrays and MUXs in an orthogonal manner. As shown in Figure 8, all bitlines in 128 crossbar arrays are grouped by their indexes and mapped to the same ADC. For example, bitlines with the index 0 in different crossbar arrays all connect to MUX-0 and ADC-0. All ADCs in the same tile form an ADC pool, and each ADC can be shared by multiple crossbar arrays via time-division multiplexing (TDM). We note that ReHarvest supports M-to-N mappings between crossbar arrays and ADCs, where M is unnecessarily equivalent to N. The default size of an ADC resource pool is 128 in ReHarvest. By parallelizing analog-to-digital conversions for all bitlines in a crossbar array, the total execution time of an MVM can be reduced by about 128 times compared with traditional ReRAM PIM architectures.

Fig. 8.

Placement and Wiring. A number of proposals [12, 28, 44] have demonstrated that it is feasible to stack multiple ReRAM crossbar arrays vertically. We need a 128-to-1 MUX to connect 128 bitlines to one ADC. This 128-to-1 MUX can be implemented with eight 16-to-1 transmission gate MUXs and seven 2-to-1 CMOS MUXs in a hierarchical manner [63]. Totally, the MUX tree is composed of only 720 transistors and incurs 467 ps latency. Since its dimension is small relative to the height of stacked crossbar arrays, it can be placed vertically and connects to 128 bitlines in the same plane. ADCs are still laid out horizontally due to their large dimensions. To minimize the timing skew of control signals among ADCs, we adopt a clock tree using the H-tree structure [8] for routing control signals. In this clock tree, only 127 repeaters are located at different cross-points of the H-tree. As shown in Figure 8, we place stacked crossbars and MUXs in the center of a tile, and all ADC are situated at the leaf nodes of the H-tree. Thus, all ADCs in a tile can be activated by the clock tree concurrently via only one control signal, and all MUXes also share the same control signal. Similarly, all ADCs are also connected to MUXs via an H-tree structure. In this way, all wires for transmitting current signals have the same length, and thus all output currents from 128 bitlines can arrive at all ADCs simultaneously. Unlike the clock tree, we do not use any repeaters for each individual wire between a MUX and an ADC because they may significantly increase the circuit complexity and the power consumption.

Although this stacked crossbar structure is somewhat similar to 3D Flash-based PIM units [7, 34, 42], there are two major differences as follows. First, we use a different way to select the target crossbar array for AD conversions. 3D-FPIM [34] selects the target crossbar array through a wordline decoder. Instead, ReHarvest selects the target crossbar array through MUXs, while wordline decoders are used to drive voltages corresponding to input vectors. Second, we use a different way to stack crossbar arrays. In 3D-FPIM, Flash cells from different crossbar arrays are connected to share bitline selectors and bitlines in the vertical direction. Accordingly, only one crossbar array can be activated at a time in stacked crossbar arrays of 3D-FPIM. In contrast, stacked crossbar arrays in ReHarvest are physically independent in the vertical direction. Thus, different stacked crossbar arrays can be activated concurrently for different operations. For example, ReHarvest can map a weight matrix to one crossbar array while allowing other crossbar arrays to perform MVM operations simultaneously. This mechanism can partially hide the ReRAM write latency when the ReRAM resource is limited.

Analog Signal Integrity. Our stacked crossbar architecture can eliminate cross points between crossbars, MUXs, and ADCs, and thus minimizes the crosstalk [58] among wires and facilitates the wiring in integrated circuits. In addition, since only one crossbar array is activated in a tile in most times, our stacked crossbar architecture only introduces very limited circuit noise among crossbars. However, as we directly connect MUXs to ADCs without using intermediate repeaters, longer wires lead to higher wire resistance. Since the wide (fat) wire is a feasible solution [2, 17, 20, 25, 37, 56, 60, 70] to combat impedance, RC delay, and IR drop in various IC designs, we adopt wires with a width of 0.3 \(\mu m\) to mitigate the impact of the wire resistance on the signal integrity while minimizing the wiring overhead (Section 5.9).

Remark. The proposed ADC-crossbar decoupled architecture offers several advantages. First, the decoupled architecture achieves high data parallelism without using the WR mechanism. Compared to previous ADC-crossbar tightly-coupled structure [52, 68], ReHarvest can significantly reduce the latency of MVM operations with less ReRAM resource consumption. Second, since the decoupled architecture can fully exploit the ADC resource, it is unnecessary to employ the inter-layer pipeline technology [52, 54] which usually suffers pipeline stalls due to unbalanced computing latencies between two adjacent layers. Third, the decoupled architecture offers an opportunity to map weight matrices to unused crossbar arrays when one crossbar array in the same tile is still working on the ADC pool. Thus, ReHarvest has the potential to hide the latency of matrix mapping during analog computing.

4.3 Multi-Tile Matrix Mapping

If we simply map the weight matrix to a few tiles compactly, the utilization of ADCs may be still low since ADCs in other tiles are not fully utilized. To further improve the ADC utilization, we design an MTMM scheme to adapt to the ADC-crossbar decoupled architecture. The basic idea is to distribute a large weight matrix across multiple tiles so that more ADC pools can be utilized for a single MVM operation.

Assume the total number of tiles is n, and at least m crossbar arrays are required to map the weight matrix \(W_m\). We first partition \(W_m\) according to the size of a crossbar array. Then, each sub-matrix is mapped to a crossbar array in different tiles. As shown in Figure 9(a), a 128\(\times\)384 matrix is partitioned to three sub-matrices (M1, M2, M3) and then mapped to three tiles (T1, T2, T3). There still remain two unused tiles (i.e., T4 and T5). Thus, we also replicate some sub-matrices to these remaining tiles to fully utilize the ADC pools in T4 and T5.

Fig. 9.

We first discuss a case when \(n\gt m\). Let r denote the remainder of \(\frac{n}{m}\). When \(r=0\), these sub-matrices of \(W_m\) are mapped to each tile so that all tiles contain one sub-matrix. When \(r\ne 0\), we should fold and map \(W_m\) into r remaining tiles. Let s be the remainder of \(\frac{m}{r}\), if \(s=0\), we map \(W_m\) into r tiles evenly, and \(\frac{m}{r}\) crossbar arrays are used in each tile. If \(s\ne 0\), we first use \(\lfloor \frac{m}{r} \rfloor \times r\) crossbar arrays in r tiles to map the majority of \(W_m\), and the remaining portion of \(W_m\) can be replicated \(\lfloor \frac{r}{s}\rfloor\) times and mapped into r tiles iteratively. Taking Figure 9(a) as an example, we first map M1 and M2 into T4 and T5, respectively, and then replicate M3 to T4 and T5. Finally, T4 contains M1 and M3, and T5 contains M2 and M3.

After sub-matrices are mapped to n tiles, the first \(n-r\) tiles can finish \(\lfloor \frac{n}{m}\rfloor\) MVM operations in one time slot. We illustrate how the remaining r tiles perform MVM operations as follows. First, r tiles compute \(\lfloor \frac{r}{s} \rfloor\) partial results with \(\lfloor \frac{r}{s} \rfloor\) replicated sub-matrices in one time slot. Then, r tiles compute other \(\lfloor \frac{r}{s} \rfloor\) partial results in \(\lfloor \frac{r}{s} \rfloor \times \lfloor \frac{m}{r} \rfloor\) time slots. Totally, r tiles generate \(\lfloor \frac{r}{s} \rfloor\) output vectors in \(1 + \lfloor \frac{r}{r_m} \rfloor \times \lfloor \frac{m}{r} \rfloor\) time slots. Figure 9(b) shows an example of processing five input vectors (X1–X5). T1, T2, and T3 can finish an MVM in each time slot. For T4 and T5, we first perform two MVMs with a input vector X4 and get two partial results (i.e., \(M1\cdot X4\) and \(M2\cdot X4\)). Then, we perform two MVMs (i.e., \(M3\cdot X4\) and \(M3\cdot X5\)) in the time slot 1. At this time, all partial results about X4 are generated and can be aggregated into the final output. Finally, we perform \(M1\cdot X5\) and \(M2\cdot X5\) in the time slot 2, and compose the partial results with \(M3\cdot X5\) to form the final output about X5. In this example, we can process five input vectors in three time slots by utilizing all ADCs in five tiles.

When \(n\lt m\), we first map m sub-matrices iteratively to n tiles, and thus each tile contains at least \(\lfloor \frac{m}{n} \rfloor\) sub-matrices. In this way, n tiles can complete the whole MVM in \(\lfloor \frac{m}{n}\rfloor +1\) time slots. We note that the utilization of ADCs in the last time slot may be low because only (\(m\%n\)) tiles are used. Thus, we also map \(W_m\) into these \(n-m\%n\) tiles with the same approach described in the case of \(n\gt m\). We also employ the same scheme in Figure 9(b) to fully utilize ADCs in these tiles.

Remark. MTMM can maximize the utilization of ADC resource for each DNN layer. Although MTMM also replicates a few sub-matrices, the extra ReRAM resource consumption is much less than that of WR mechanisms [52, 54]. The reason is that the WR mechanism tends to use up ReRAM arrays in all tiles while MTMM only uses one crossbar array for one sub-matrix in each tile. Unlike inter-layer pipeline approaches [35, 52, 54] that process multiple DNN layers simultaneously, ReHarvest processes each layer independently, and thus can avoid inter-layer pipeline stalls and the ADC resource contention among DNN layers.

4.4 Fine-Grained Data Dispatching

Since MTMM distributes sub-matrices to multiple tiles, Psums generated by these tiles should be shuffled to many tiles, and are used as inputs of the next layer. This many-to-many communication mode usually causes massive redundant packets in mesh or tree-based network-on-chip (NoC) [43, 52]. Moreover, for most CNN models, the input features dispatched to different ReRAM tiles usually are highly redundant, resulting in a significant waste of NoC bandwidth. These redundant packets may incur severe network congestion, and offset the performance gain from the ADC-crossbar decoupled architecture and the MTMM scheme. To address this problem, we convert the many-to-many communication mode into many-to-one and one-to-many modes, and implement a fine-grained data dispatching (FGDD) scheme to eliminate data redundancy using two independent bus-based networks.

We extend a bus-based interconnection network [13] with fine-grained data tagging and filtering to support packet multicasting without data redundancy. We use a GIN to multicast the input data, and a GON to aggregate the output data. Since they have similar structures, we take GIN as an example to illustrate their structures. As shown in Figure 10(a), we employ multiple multicast controllers (MCs) and a shared bus to support multicasting among different tiles. To multicast the input data, we tag each input data with its destination (\(d_i\)). We use a pair of tags (\(tag_i\), \(tag_j\)) to index a tile. For 128 tiles, each tag pair only requires 14 bits to index these tiles. As shown in Figure 10(b), a tag pair is used to determine whether a tile should receive the input data tagged by \(d_i\). The packet whose tag does not match the tag pair of tiles is filtered by the MC. Since a weight matrix may be partitioned and mapped into multiple tiles, we regard these multiple tiles as a tile group (TG) to facilitate tagging. All tiles in a TG would receive the same data by using the same tag pair. Since the distribution of weight matrices are different in different layers, ReHarvest should reconfigure TGs and their tag pairs for each layer dynamically.

Fig. 10.

For input features without data redundancy, such as input data in FC layers, only one tag is sufficient to filter unnecessary packets in the MC. Thus, the TC can configure the \(i_{th}\) TG’s tag pair as (i, i) simply. Let \(N_{tg}\) denote the number of TGs. For the \(j_{th}\) input feature, the TC also tags all data packets with \(d_j = j\%N_{tg}\). If \(d_j\ne i\), this packet is filtered by the \(i_{th}\) TG.

For highly-redundant input features, such as input data in CONV layers, we design a fine-grained data transmission mode to eliminate duplicated transmission. Let \(S_c\) denote the column size of the output feature map, and w denote the width of the CONV window. The TC tags feature values at the same position of feature maps with \(d_i=i\%N_{tg}\) and send them to the same TG. To process multiple CONV operations from the sliding window in limited TGs iteratively, two tags are used for each TG. For the \(i_{th}\) TG, when \(i\lt w\), its tag pair (\(tag_i, tag_j\)) is configured as (i, \(i+N_{tg}\)). Otherwise, the tag pair is configured as (i, i). The packet with \(d_i\) is transferred to TGs when \(0\le tag_i-d_i\lt w\) or \(0\le tag_j-d_i\lt w\) is satisfied. As shown in Figure 10(c), the filter shows the logic implementation of these two conditions. By configuring the tag pair for each TG and changing w, ReHarvest can support different communication patterns such as broadcast, multicast, and unicast.

Figure 11 illustrates the fine-grained data transmission for CONV operations when \(N_{tg}\) is 4, and w is 3 with a stride of 1. According to the filter conditions, the input data tagged with 0 is sent to three TGs with tags [(0, 4), (1, 5), (2, 6)]. TG (0, 4) processes the leftmost CONV window, and TG (1, 5) can process the next CONV window after the input data (\(d_i\)=1) is received. In this way, four TGs can process four CONV windows concurrently. With \(tag_j=4\), TG (0, 4) receives the data with \(d_i = 2\) and \(d_i =3\), and thus can process the fifth CONV window when the new data with \(d_i=0\) (the fifth column in Figure 11) is received. In this way, each TG can process multiple CONV windows in a sliding window iteratively, without causing data redundancy on the bus.

Fig. 11.

Remark. Our FGDD scheme is implemented with a bus-based interconnection network proposed in Eyeriss [13]. However, Eyeriss [13] only supports coarse-grained data transmission and usually leads to high data redundancy among different CONV windows. To address this problem, we propose fine-grained data tagging and filtering to support redundancy-free packet multicasting. Our approach can effectively reuse input features among adjacent CONV windows, and thus maximize the effective bus bandwidth. In this way, ReHarvest can orchestrate the data transmission and MVM operations, and thus fully exploits the dominant ADC resource.

4.5 Architecture-Level Pipeline

As the latency of an MVM operation is significantly reduced by about 128\(\times\), the latencies of data transferring and data aggregation become non-trivial. Thus, we use a pipeline to orchestrate data transferring, MVM, and data aggregation so that their latencies can be partially overlapped. We note that our architecture-level pipeline is different from the inter-layer pipelines [52, 54]. Those algorithm-level pipelines often suffer from long stalls due to unbalanced computing latencies among DNN layers. In contrast, our pipeline is oblivious to the DNN structures, and thus can fully exploit the ADC resource.

5 Evaluation

5.1 Methodology

We model the ReHarvest architecture and its runtime system based on MHSim [41]. We employ peripheral circuits modeled by NeuroSim [10] to simulate their latencies and energy consumption at 32 nm technology. We use TaOx/HfOx [61] as the ReRAM media, and each ReRAM cell can store four conductance states (2-bit). Based on the two-phase write mode in NeuroSim, the latency of writing a row in crossbar arrays is set to 400 ns. Except ADCs, all parameters of hardware components are configured according to ISAAC [52]. We choose the state-of-the-art ADC design [66] from the latest ADC performance survey [46]. We use Booksim 2.0 [30] and Orion 3.0 [31] to simulate the latency and the energy consumption of the on-chip network, respectively. We use CACTI 7.0 [6] to estimate the area and energy consumption of the proposed global data buffer, eDRAM, GIN, and GON. The area and energy of the TC, MUX-to-ADC wires, and the clock tree are estimated by RTL synthesis [18]. We employ the HyperTransport model in DaDianNao [11] to simulate the off-chip interconnection.

5.2 Experimental Setup

System Configurations. In our experiments, we use physical servers equipped with Intel Xeon E5-2650v3 CPU, 64 GB DDR-4 memory, and Tesla V100 GPU. Table 1 shows the specification of key components in ReHarvest. We use the same components in ISAAC [52] except for the interconnection network (including GDB, TC, GIN/GON, MC, and ALUs). Weight matrices are mapped and fixed in ReRAM crossbar arrays in advance for all DNN models. All operands are 16-bit fixed-point numbers. Like ISAAC, we use eight 2-bit ReRAM cells to represent a 16-bit weight operand. To process signed fixed-point numbers, we employ the complement and biased representations proposed in ISAAC. Instead of DACs, we use multiple-cycle voltage pulses [10, 54, 71] to drive the input vector for energy efficiency.

Table 1.

ReHarvest tile at 1.25 GHz (128 tiles per chip)
Component	Parameter	Spec	Power	Area (\(mm^2\))
Controller	number	1	0.38 mW	0.0015
eDRAM	size	64 KB	20.7 mW	0.083
In-tile bus	width	256 bit	7 mW	0.09
IR	size	2 KB	1.24 mW	0.0021
OR	size	3 KB	1.68 mW	0.0032
Sigmoid	number	2	0.52 mW	0.0006
MaxPool	number	1	0.4 mW	0.0002
ReRAM array	number size bits/cell	128 per tile 128 \(\times\) 128 2	38.4 mW	0.000025
MUX	ports number	128 128	7.68 mW	0.0006
S& A	number	128	6.4 mW	0.0077
ADC	resolution frequency number	8 bits 1.25 GHz 128 per tile	243.2 mW	0.0401
MUX-to-ADC wire	width length number	0.3 \(\mu\)m 0.177 mm 128	0.86 mW	0.0068
Clock tree	number	1	0.004 mW	0.0004
1 Tile Total			328 mW	0.24
128 Tile Total			42.04 W	30.25
GLB	size	512 KB	65.2 mW	2.817
TC	number	1	0.02 mW	0.0001
GIN/GON	width	256 bit	19.4 mW	4.34
MC	number	256	40.4 mW	0.0452
ALU	number	16	0.31 mW	0.0005
HyperTransport	bandwidth	25.6 GB/s	10.4 W	22.88
Chip Total			52.6 W	60.33

Table 1. Hardware Characteristics of ReHarvest

Benchmarks. We use four typical DNN models (MLP, LSTM, CONV, and Transformer) to evaluate the inference performance, and energy consumption of different architectures with three datasets, as shown in Table 2.

Table 2.

DNN Model	Type	Specifications	Dataset
MLP-S [47]	MLP	3-FC	MNIST [33]
LSTM-S [26]	LSTM	1-LSTM, 1-FC	WikiText-2 [45]
VGG-19 [53]	CONV	16-CONV, 3-FC	ImageNet [50]
ResNet-50 [23]	CONV	53-CONV, 1-FC
ViT-L/16 [19]	Transformer	1-CONV, 50-FC, 24-MSA

Table 2. Benchmarks

Alternatives for Comparison. To evaluate the effectiveness of ReHarvest, we compare ReHarvest with GPUs and three ReRAM-based PIM architectures as follows.

—

Real GPUs. We use NVIDIA Tesla V100 GPU and the PyTorch [48] framework to run different DNN models in physical servers, and measure their execution time.

—

ISAAC [52]. ISAAC is a typical ReRAM-based PIM architecture supporting WR and inter-layer pipeline. We use the same ADC configuration as ReHarvest, and set ISAAC as a baseline.

—

TinyADC [67]. TinyADC exploits a weight pruning scheme to adapt to low-resolution ADCs. We follow TinyADC to adopt 6-bit resolution and performance-efficient ADCs, and also apply WR and inter-layer pipelining mechanisms to TinyADC for a fair comparison.

—

FORMS [68]. FORMS also supports WR and inter-layer pipeline mechanisms, but exploits high-frequent and low-resolution ADCs to adapt to the proposed weight pruning/ quantification. We only model the hardware architecture of FORMS for a fair comparison.

5.3 Performance Speedup

Figure 12 shows the inference performance of different DNN models with different batch sizes for GPU, ISAAC, FORMS, and ReHarvest. The batch size represents the number of input sequences/images. ReHarvest achieves about 114.7\(\times\), 3.4\(\times\), 1.9\(\times\), and 3.5\(\times\) performance speedup on average compared with GPU, ISAAC, TinyADC, and FORMS, respectively. When the batch size is 4, ReHarvest achieves the maximum performance speedup compared to ISAAC, TinyADC, and FORMS. The reason is that the inter-layer pipeline in ISAAC/TinyADC/FORMS is not fully filled, thus underutilizing the ADC resource. In contrast, ReHarvest can fully utilize the ADC resource regardless of the batch size. When the batch size increases, ISAAC/TinyADC/FORMS can reduce stalls in inter-layer pipelines to some extent, and thus improve the ADC utilization and the performance of applications. When the batch size is as large as 256, because ISAAC, TinyADC, and FORMS incur unbalanced latencies among different DNN layers, ReHarvest can still improve the performance by 83%, 12%, and 69% on average compared to ISAAC, TinyADC, and FORMS, respectively. We find that FORMS shows slightly higher performance than ReHarvest for VGG-19 because the zero-skipping scheme in FORMS can skip massive inputs with zero numbers, and thus significantly reduces the amount of MVM operations. TinyADC outperforms ReHarvest for ResNet-50 and VGG-19 because the ADC utilization of TinyADC approximates to 80% (Figure 13) while the frequency of ADCs used in TinyADC is twice as high as that in ReHarvest. We note that the pruning scheme in TinyADC is also applicable to ReHarvest for performance improvement.

Fig. 12.

Fig. 13.

For most cases, ReRAM-based PIM architectures achieve significant performance speedup (\(\gt 100\times\)) compared to GPU due to high-efficient analog MVMs and trivial data movements. For ViT-L/16, ReHarvest only achieves 8.1\(\times\) performance speedup on average compared to GPU because of the limited on-chip bandwidth, time-consuming operations (e.g., LN and Softmax) on CPUs, and high write latency of intermediate results on ReRAM crossbar arrays.

5.4 ADC Utilization

Figure 13 shows ReHarvest achieves much higher ADC utilization for all benchmarks with different batch sizes compared with ISAAC, TinyADC, and FORMS. We can find that the ADC utilization has a significant impact on the performance of these PIM architectures, as shown in Figure 12. Except for the small model MLP-S, ReHarvest achieves rather high utilization of the ADC resource regardless of the batch size. ISAAC, TinyADC, and FORMS improve the ADC utilization with larger batch sizes since more inputs can reduce bubbles in the inter-layer pipeline.

For ResNet-50 and VGG-19, ReHarvest can fully utilize the ADC resource even when the batch size is small, while ISAAC, TinyADC, and FORMS only utilize about 80% of ADC resource even when the batch size increases to 256. A portion of ADCs are underutilized in ISAAC, TinyADC, and FORMS because each ADC is tightly coupled with one crossbar array. When a crossbar array is idle, the corresponding ADC also becomes idle. For MLP-S, since its model size is very small and cannot fully exploit the ReRAM resource, the ADC utilization is extremely low when the batch size is small. When the batch size increases, ReHarvest can significantly improve the ADC utilization because it can replicate more weight matrices to utilize more ADCs. On average, ReHarvest can improve the ADC utilization for MLP-S by 40.7\(\times\), 40.7\(\times ,\) and 27.1\(\times\) compared with ISAAC, TinyADC, and FORMS, respectively. This implies our ADC-crossbar decoupled architecture can fully realize the potential of ReRAM crossbar arrays for small DNN models because each crossbar array can utilize 128 ADCs for each MVM operation. For ViT-L/16, because it generates massive intermediate data in its MSA layers, the on-chip interconnection cannot offer sufficient feature vectors for crossbar arrays in each layer. As a result, the ADC utilization of ReHarvest only approximates 65%. In this case, the limited interconnection bandwidth becomes the performance bottleneck. Overall, ReHarvest can improve the ADC utilization by 4.8\(\times\), 4.9\(\times ,\) and 3.2\(\times\) compared to ISAAC, TinyADC, and FORMS, respectively.

5.5 ReRAM Resource Consumption

To evaluate the impact of ReRAM resource consumption on the application performance, we run benchmarks on a ReRAM-based PIM accelerator with different amounts of chips when the batch size is 256. For VGG-19 and ViT-L/16, since their models are very large, a small number of ReRAM chips cannot accommodate all weight matrices, we only show their experimental results with sufficient ReRAM resource.

Figure 14 shows that ReHarvest can improve the performance by 88.2%, 11.2%, and 86.6% on average compared to ISAAC, TinyADC, and FORMS, respectively. The reason is that ReHarvest can fully utilize all ADC resource no matter how many ReRAM chips are available. With the increase of chips, the performance gap between ReHarvest and ISAAC/FORMS becomes larger for MLP-S and LSTM-S, but becomes smaller for ResNet-50, VGG-19, and ViT-L/16. Since the model size of MLP-S and LSTM-S is small, ReRAM crossbar arrays on two chips can accommodate sufficient replicas for all input data and achieve high ADC resource utilization. However, when the number of chips increases, ISAAC, TinyADC, and FORMS only require a few crossbar arrays to process inputs with a fixed batch size (256), leading to lower ADC utilization. In contrast, ReHarvest can improve the ADC resource utilization by harvesting more ADC pools with our MTMM scheme when the number of chips increases. Thus, ReHarvest is more beneficial for small-scale DNN models. For large models (i.e., ResNet-50, VGG-19, and ViT-L/16), ISAAC, TinyADC, and FORMS cannot effectively balance the ReRAM resource allocation for weight replicas when the ReRAM resource is limited. When the number of chips increases, ISAAC, TinyADC, and FORMS have more opportunities to balance the inter-layer pipeline and achieve higher ADC utilization. With the improvement of ADC utilization, ISAAC and FORM show a smaller performance gap compared to ReHarvest, and TinyADC achieves even higher performance than ReHarvest because it uses more efficient ADCs.

Fig. 14.

Figure 15 shows the ReRAM resource consumption of ReHarvest and ISAAC/TinyADC/FORMS when the number of chips increases. As we use the same mapping scheme in ISAAC, TinyADC, and FORMS, their resource consumption is the same. For ReHarvest, the proportion of ReRAM resource consumption becomes lower when the number of chips increases, because ReHarvest can efficiently use ReRAM crossbar arrays with less weight replicas. We find that both ReHarvest and ISAAC/TinyADC/FORMS consume limited ReRAM resource for MLP-S because it is a small-scale DNN model. Except for MLP-S, ISAAC/TinyADC/FORMS tends to use up all available ReRAM resource for higher performance via WR, while the MTMM scheme in ReHarvest only requires very limited ReRAM resource. For ResNet-50 with two chips, the ReRAM consumption of ReHarvest is larger than that of ISAAC/TinyADC/FORMS because ReHarvest replicates a small matrix several times for higher ADC utilization. Overall, ReHarvest reduces the ReRAM resource consumption by 3.1\(\times\) on average compared to ISAAC/TinyADC/FORMS. These experimental results demonstrate that ReHarvest can achieve higher performance with low resource consumption for different DNN models.

Fig. 15.

5.6 Energy Consumption

Figure 16 shows the energy consumption of ISAAC, TinyADC, FORMS, and ReHarvest using 16 chips with a batch size of 256. We observe that ADC and HyperTransport dominate the energy consumption as their power consumption is rather large compared with other components (Table 1). As ReHarvest and TinyADC reduce the execution time of DNN inference, they can significantly reduce the energy consumption of HyperTransport and eDRAM compared with ISAAC/FORMS. Overall, ReHarvest reduces 6.8% energy consumption on average compared to TinyADC. Compared to FORMS, ReHarvest shows 21.7% more energy consumption on average because FORMS can skip unnecessary AD conversions. In addition, ADCs in FORMS are more energy-efficient than that in ReHarvest. Totally, ReHarvest reduces 27.1% energy on average compared to ISAAC. This implies that ReHarvest only introduces trivial energy consumption due to eDRAM/GDB, and can reduce total energy consumption by shortening the execution time of DNN.

Fig. 16.

5.7 ReRAM Resource Reallocation

Here, we evaluate the performance of DNN inference when ReRAM crossbar arrays cannot accommodate all weight matrices. For ReHarvest, ISAAC, TinyADC, and FORMS, all weight matrices are mapped without replicas to minimize the ReRAM resource consumption. To process a large DNN model, we map different layers into ReRAM arrays alternately so that the ReRAM resource allocated to the prior layers can be reallocated to the following layers.

Figure 17 shows how limited ReRAM resource affects the performance of large-scale DNN models with a batch size of 256. When ReRAM arrays are reallocated for different layers alternately, ReHarvest achieves 227.7\(\times\), 119.2\(\times\), and 152.9\(\times\) speedup on average for VGG-19 compared with ISAAC, TinyADC, and FORMS, respectively. The reason is that the ADC-crossbar decoupled architecture can perform the weight mapping and in-situ MVM operations concurrently, and achieves high ADC utilization. In contrast, ISAAC and FORMS cause massive pipeline bubbles because they cannot apply the WR mechanism due to limited ReRAM resource. For ViT-L/16, ReHarvest still achieves 91%, 13%, and 1.36\(\times\) performance improvement on average compared to ISAAC, TinyADC, and FORMS, respectively. As attention networks do not cause unbalanced computing latencies among layers, ISAAC/TinyADC/FORMS can balance the inter-layer pipeline for higher ADC utilization, and thus avoid severe performance degradation even without using the WR mechanism. In this case, the performance gains of ReHarvest mainly stem from overlapping MVMs and weight matrix mapping. This implies that ReHarvest can improve the utilization of ReRAM crossbar arrays since our ADC-crossbar decoupled architecture can facilitate resource sharing for both ADCs and crossbar arrays.

Fig. 17.

5.8 Effectiveness of Individual Techniques

Figure 18 shows the performance gain of ReHarvest when the ADC-crossbar decoupled structure (ADC-de), the MTMM scheme, the FGDD scheme, and the weight pruning scheme [67] are adopted incrementally. Since ISAAC cannot fully utilize ADCs for the MLP-S model, these three techniques can improve the ADC utilization incrementally, and thus deliver higher performance for this small model. For other models, the “ADC-de” technique offers trivial performance improvement against ISAAC because each ADC pool can only be utilized by very limited layers, and thus the inter-layer pipeline may be still unbalanced. After we adopt the MTMM scheme, the ADC utilization is improved for MLP-S and ViT-L/16 because they can harvest more ADC resource across tiles. However, the “ADC-de + MTMM” techniques show even performance degradation for LSTM-S, ResNet-50, and VGG-19 since MTMM causes severe network congestion due to the many-to-many communication mode. For these three benchmarks, the effective interconnection bandwidth becomes a performance bottleneck. Because our FGDD scheme effectively eliminates data redundancy and mitigates network congestion, they can fully realize the potential of the ADC-crossbar decoupled architecture and the MTMM scheme, and further improves the application performance. The algorithm-level weight pruning scheme can further improve the performance for all models because it promises ReHarvest to use performance-efficient ADCs, which are essential to reduce the cost of AD conversions.

Fig. 18.

5.9 Physical Practicality

Signal Integrity. We leverage CST Studio Suite [55] and PSPICE [49] to evaluate the impact of circuit noises on the analog signal integrity. Specifically, we use PSPICE to model the clock tree and wires between MUXs and ADCs at the transistor level, and use CST Studio to emulate the stacked crossbar structure. We set the wire resistivity according to NeuroSim [10]. The vertical distance between two crossbars is set to 600 nm. According to the circuit layout of ADCs and crossbars, the width and length of wires between MUXs and ADCs are set to 0.3 \(\mu m\) and 0.177 mm, respectively. To evaluate the maximum circuit noise, we set all ReRAM cells to a low resistance state and activate all wordlines, and then measure the current at the input port of ADCs. We also calculate the ideal output current based on Kirchhoff’s circuit laws as a baseline (without any noises). As shown in Figure 19(a), the measured current shows non-trivial deviations relative to the ideal current at each rising/falling edge. However, it approaches the ideal current quickly, and then becomes stable. There is only a little gap between the ideal and measured currents due to the resistance of wires, and the crosstalk among wires and repeaters. The measured currents relative to the ideal current only show 0.7% and 1.48% deviations for ISAAC and ReHarvest, respectively. Overall, the signal-to-noise rates (SNRs) of ReHarvest and ISAAC are 25.8 dB and 31 dB, respectively. Figure 19(b) shows the eye diagram of our ADC-crossbar decoupled architecture. We find that the root mean square (RMS) jitter and the peak-peak (PP) jitter are 0.63 ps and 3.8 ps, respectively. The Q-factor is 96.8 for the simulated signals, and the eye-crossing percentage is 50.1%. Both SNR results and the eye diagram demonstrates that our architecture can guarantee the signal integrity at an acceptable level. We also evaluate the impact of our design on the signal integrity when the number of both ADCs and crossbars increases exponentially. Experimental results show that the current deviation remains stable for different numbers of ADCs and crossbars. Specifically, when the number of ADCs is 32, 64, 128, 256, and 512, the corresponding SNRs are 26.5 dB, 26.3 dB, 25.8 dB, 25.2 dB, and 24.6 dB, respectively. This implies that our ADC-crossbar decoupled architecture can scale to different sizes without compromising the signal integrity.

Fig. 19.

Inference Accuracy. To evaluate the impact of these noises on the end-to-end inference accuracy, we inject equivalent noises based on the emulated SNRs into the output currents of bitlines in MHSim [41], and measure the inference accuracy of ReHarvest and ISAAC with four DNN models. Table 3 shows the Top-1 accuracy of image classification tasks using GPU, ISAAC, ReHarvest, and ReHarvest-I (i.e., ReHarvest with idealized currents). Both ISAAC and ReHarvest show slight degradation of inference accuracy relative to GPU. This implies that the circuit noises introduced by ReHarvest have a very limited impact on the results of MVM operations. Since most DNN algorithms can tolerate computing errors moderately, ReHarvest achieves similar accuracy (0.09% deviation) on average compared with ISAAC.

Table 3.

Dataset	Name	GPU	ISAAC	ReHarvest	ReHarvest-I
MNIST	MLP-S	78.14%	78.11%	78.15%	78.17%
ImageNet	ResNet-50	71.30%	71.40%	71.35%	71.50%
	VGG-19	70.00%	68.40%	67.90%	67.70%
	ViT-L/14	79.80%	78.60%	78.90%	79.30%

Table 3. Inference Accuracy

Wire Width. To guarantee analog signal integrity while minimizing the wiring overhead, we experimentally study the impact of the wire width on the inference accuracy of DNN models. As shown in Figure 20, for ResNet-50, VGG-19, and ViT-L/16, ReHarvest leads to about 10% accuracy loss compared to the GPU platform when the wire width is 0.1 \(\mu m\). This is because thinner wires lead to higher resistance, and thus have a larger impact on the signal integrity, which eventually affects the inference accuracy. When the wire width increases to 0.3 \(\mu m\), the inference accuracy of ReHarvest is improved and becomes stable. Thus, we adopt wires with a width of 0.3 \(\mu m\) to minimize the wiring overhead while guaranteeing analog signal integrity.

Fig. 20.

Process Variation Analysis. Process variation, caused by the imperfection in fabrication technology, has a significant impact on the computational accuracy of ReRAM-based in-situ accelerators. We follow a recent study [9] to model the process variation of ReRAM devices as a log-normal distribution with 0 mean and a standard deviation of \(\sigma\). This approach is also widely adopted by many ReRAM-based in-situ accelerators for device variation analysis, such as FORMS [68] and ADC-Less [51]. Figure 21 shows the inference degradation of ISAAC and ReHarvest for different DNN models when the standard deviation \(\sigma\) increases from 0 to 0.2. The device variation has a significant impact on the inference accuracy of ResNet-50 and VGG-19 because they are more sensitive to the computational accuracy. When the standard deviation is 0.1, the device variation causes an accuracy loss of 4.6% on average, while the circuit noises introduced by ReHarvest only lead to 0.11% accuracy degradation compared with the ideal case. This implies that our stacked crossbar structure has much less impact on the inference accuracy relative to the device variation.

Fig. 21.

Area and Power Overhead. Although ReHarvest adopts an H-tree and longer wires between MUXs and ADCs in each tile, the extra area and power costs (0.012 \(mm^2\) and 0.314 mW) can be offset by the cost saving of other components. Previous studies [52, 54, 68] require 16 IRs/ORs in a tile so that all crossbar arrays can be activated simultaneously. In contrast, ReHarvest only needs one IR/OR per tile because it only activates one crossbar array in a tile at any time. Thus, ReHarvest can save an area cost of 0.0795 \(mm^2\) and power consumption of 43.8 mW for these 15 IRs/ORs. Totally, ReHarvest can even reduce 13.5% area cost and 11.6% power consumption for each tile compared with ISAAC. Moreover, ReHarvest also adopts a bus-based on-chip interconnection network, which is fundamentally different from the concentrated mesh network in ISAAC. The interconnection network in ISAAC incurs a power consumption of 1.76 W and an area cost of 6.34 \(mm^2\), while ReHarvest shows much lower power (0.13 W) with a similar area cost (8.2 \(mm^2\)). Overall, ReHarvest reduces the area and power cost per chip by 12.4% and 19.2% compared with ISAAC [52].

6 Related Work

3D Stacked Crossbar Architectures. There have been a number of studies [12, 44, 73] on 3D stacked ReRAM for high-density storage, without supporting in-situ computing. Huo et al. [28] also exploit stacked ReRAM crossbar arrays as high-density memory, but design dedicated peripheral circuits for 3D ReRAM arrays to enable near-data processing. Lue et al. propose a 3D AND Flash-based PIM accelerator [42], which uses only one sense amplifier to read currents from all source lines. Since a single crossbar array can only use very few ADCs associated with it, the ADC utilization of this architecture is often low. 3D-FPIM [34] also designs a 3D-stacked flash structure where crossbar arrays of different layers share the same bitlines and bitline selectors. Each bitline is connected to one ADC for AD conversions. However, 3D-FPIM can only activate one crossbar array at a time in stacked crossbar arrays, limiting the array parallelism. Although the 3D stacked crossbar architecture in ReHarvest is somewhat similar to these previous proposals, they have different structures to connect bitlines and cells in crossbar arrays, and different ways to activate a crossbar array. These different designs allow ReHarvest to utilize ReRAM crossbar arrays and ADCs for matrix multiplications more efficiently.

Optimizations on AD Conversions. Pipelayer [54] and NeuroSim [10] replace ADCs with integration and fire (IF) circuits to reduce energy consumption and area cost of ADCs, at the expense of high latency. CASCADE [15] employs transimpedance amplifiers to temporarily store all Psums in a ReRAM buffer, and then aggregates all Psums via one AD conversion, at the expense of high write overhead of the ReRAM buffer. TIMELY [38] uses time-domain interfaces and analog local buffers to reduce energy consumption of ADCs. FORMS [68] and Quarry [5] employ low-resolution and area-efficient ADCs, and thus can equip more ADCs for a single crossbar array. However, they would lower the parallelism of wordlines due to the low resolution of ADCs. RFSM [21] proposes a switch-matrices-based connection for ReRAM crossbar arrays to reduce the amount of DA/AD conversions. ADC-Less [51] proposes a quantization-aware training scheme to use 1-bit sense amplifiers, and achieves high accuracy on two datasets (i.e., MNIST [33] and CIFAR-10 [1]). TinyADC [67] proposes a weight pruning algorithm to adapt to low-resolution but performance-efficient ADCs. Similarly, RAELLA [3] also proposes a “center+offset” encoding scheme, adaptive slicing of DNN weights, and dynamical slicing of DNN inputs to lower the resolution of computed analog values. As a result, RAELLA can use low-resolution yet efficient ADCs while guaranteeing high fidelity and low DNN accuracy loss. Unlike these studies that focus on reducing the cost of AD conversions, we open up a new angle to counteract this problem by improving the utilization of ADCs. Our architecture supports many-to-many mapping between ADCs and crossbars to share the ADC resource more flexibly. Thus, ReHarvest is complementary to previous proposals for further improving the performance of AD conversions.

Weight Matrix Mapping Schemes. PipeLayer [54] and ISAAC [52] propose WR mechanisms to improve task parallelism, at the expense of high ReRAM resource consumption. HitM[35] exploits a dynamic programming algorithm to allocate ReRAM arrays for weight replicas. SRE [64] proposes a weight compression technology to reduce the ReRAM resource consumption. FORMS [68] and PIM-Prune [16] use weight pruning to reduce the ReRAM consumption. These proposals all focus on improving the storage efficiency of ReRAM resource. In contrast, we try to improve the utilization of performance-critical resource–ADCs via the proposed MTMM scheme. Unlike previous studies, MTMM distributes weight matrices to different tiles as much as possible so that each ADC pool has more opportunities to be utilized by more DNN layers, and thus can essentially avoid the design defects of the inter-layer pipeline.

7 Conclusion

In this article, we propose an ADC resource-harvesting crossbar architecture for ReRAM-based DNN accelerators. We design an ADC-crossbar decoupled architecture to effectively share the performance-critical ADC resource across crossbar arrays. We also propose an MTMM scheme to further improve the utilization of ADCs. To efficiently support the proposed matrix mapping scheme, we implement a redundancy-free data dispatching scheme via fine-grained packet tagging and filtering. To the best of our knowledge, ReHarvest is the first attempt to mitigate the performance bottleneck of analog computing via ADC resource virtualization. Experimental results demonstrate that ReHarvest can significantly improve the performance of DNN models by improving the ADC utilization. Moreover, ReHarvest also significantly reduces the ReRAM resource consumption compared with state-of-the-art ISAAC and FORMS.

References

[1]

Krizhevsky Alex. 2009. CIFAR-10 and CIFAR-100 Datasets. (2009). Retrieved January 5, 2024 from https://www.cs.toronto.edu/kriz/cifar.html

Abstract

1 Introduction

2 Background

2.1 Deep Neural Networks

2.2 ADC-Crossbar Tightly-Coupled Structure

3 Motivations and Challenges

3.1 Motivations

3.2 Challenges

4 Design

4.1 Overview

4.2 Crossbar-to-ADC Mapping Structure

4.3 Multi-Tile Matrix Mapping

4.4 Fine-Grained Data Dispatching

4.5 Architecture-Level Pipeline

5 Evaluation

5.1 Methodology

5.2 Experimental Setup

5.3 Performance Speedup

5.4 ADC Utilization

5.5 ReRAM Resource Consumption

5.6 Energy Consumption

5.7 ReRAM Resource Reallocation

5.8 Effectiveness of Individual Techniques

5.9 Physical Practicality

6 Related Work

7 Conclusion

References

Index Terms

Recommendations

A Cascaded ReRAM-based Crossbar Architecture for Transformer Neural Network Acceleration

A Self-Calibrated Pipeline ADC with 200 MHz IF-Sampling Frontend

A 70.7-dB SNDR 100-kS/s 14-b SAR ADC with attenuation capacitance calibration in 0.35-µm CMOS

Comments

Information

Published In

Publisher

Publication History

Check for updates

Author Tags

Qualifiers

Funding Sources

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

View options

PDF

eReader

Login options

Full Access

Figures

Other

Share

Share this Publication link

Share on social media

Affiliations