4.1 Overview
Figure
7 shows the hierarchical architecture of ReHarvest. A ReRAM-based PIM chip is composed of multiple tiles, interconnection networks, and other peripheral circuits. The I/O interface is used to bridge the
global data buffer (
GDB) and the host main memory. The input/output feature maps and the intermediate data are stored in the GDB which is connected to each tile with global bus-based networks, i.e.,
global input network (
GIN) and
global output network (
GON). The
tagging controller (
TC) is used to tag each packet with the coordinate of the input feature map. The adder group is used to aggregate
partial sums (
Psums). Each tile is composed of a ReRAM-based processing engine and its auxiliary circuits. The control unit is used to generate control signals according to micro-instructions of PIM accelerators [
4,
41]. The eDRAM is a buffer for input/output feature maps. The pooling unit and the activation unit are special circuits for pooling operations and activation functions, respectively. These components in a tile are connected to a shared bus. The ReRAM-based processing engine contains a set of
input/output registers (
IRs/ORs), and 128 crossbar (XB) arrays,
multiplexers (
MUXs), ADCs, and
shift-and-add units (
S&A). The number of ADCs is equal to the number of ReRAM crossbar arrays in each tile. Each ADC is connected with an MUX and an S&A. The 128-to-1 MUX is connected to 128 bitlines of different crossbar arrays so that a single ADC can be multiplexed by 128 ReRAM crossbar arrays. The S&A circuit is used to compose the final result from partial results.
4.2 Crossbar-to-ADC Mapping Structure
To improve the utilization of ADCs, ReHarvest decouples crossbar arrays and ADCs, and shares all ADCs in a resource pool. In this way, each crossbar array can harvest all unused ADCs in a tile to achieve bitline-level data parallelism for AD conversions. It can perform an MVM much more efficiently because 128 output currents from bitlines can be converted into digital values by 128 ADCs simultaneously.
To achieve a many-to-many mapping between 128 ADCs and 128 crossbar arrays in a single tile, we design a grouped bitline-to-ADC mapping structure that stacks multiple crossbar arrays and MUXs in an orthogonal manner. As shown in Figure
8, all bitlines in 128 crossbar arrays are grouped by their indexes and mapped to the same ADC. For example, bitlines with the index 0 in different crossbar arrays all connect to MUX-0 and ADC-0. All ADCs in the same tile form an ADC pool, and each ADC can be shared by multiple crossbar arrays via
time-division multiplexing (
TDM). We note that ReHarvest supports
M-to-
N mappings between crossbar arrays and ADCs, where
M is unnecessarily equivalent to
N. The default size of an ADC resource pool is 128 in ReHarvest. By parallelizing analog-to-digital conversions for all bitlines in a crossbar array, the total execution time of an MVM can be reduced by about 128 times compared with traditional ReRAM PIM architectures.
Placement and Wiring. A number of proposals [
12,
28,
44] have demonstrated that it is feasible to stack multiple ReRAM crossbar arrays vertically. We need a 128-to-1 MUX to connect 128 bitlines to one ADC. This 128-to-1 MUX can be implemented with eight 16-to-1 transmission gate MUXs and seven 2-to-1 CMOS MUXs in a hierarchical manner [
63]. Totally, the MUX tree is composed of only 720 transistors and incurs 467
ps latency. Since its dimension is small relative to the height of stacked crossbar arrays, it can be placed vertically and connects to 128 bitlines in the same plane. ADCs are still laid out horizontally due to their large dimensions. To minimize the timing skew of control signals among ADCs, we adopt a clock tree using the H-tree structure [
8] for routing control signals. In this clock tree, only 127 repeaters are located at different cross-points of the H-tree. As shown in Figure
8, we place stacked crossbars and MUXs in the center of a tile, and all ADC are situated at the leaf nodes of the H-tree. Thus, all ADCs in a tile can be activated by the clock tree concurrently via only one control signal, and all MUXes also share the same control signal. Similarly, all ADCs are also connected to MUXs via an H-tree structure. In this way, all wires for transmitting current signals have the same length, and thus all output currents from 128 bitlines can arrive at all ADCs simultaneously. Unlike the clock tree, we do not use any repeaters for each individual wire between a MUX and an ADC because they may significantly increase the circuit complexity and the power consumption.
Although this stacked crossbar structure is somewhat similar to 3D Flash-based PIM units [
7,
34,
42], there are two major differences as follows. First, we use a different way to select the target crossbar array for AD conversions. 3D-FPIM [
34] selects the target crossbar array through a wordline decoder. Instead, ReHarvest selects the target crossbar array through MUXs, while wordline decoders are used to drive voltages corresponding to input vectors. Second, we use a different way to stack crossbar arrays. In 3D-FPIM, Flash cells from different crossbar arrays are connected to share bitline selectors and bitlines in the vertical direction. Accordingly, only one crossbar array can be activated at a time in stacked crossbar arrays of 3D-FPIM. In contrast, stacked crossbar arrays in ReHarvest are physically independent in the vertical direction. Thus, different stacked crossbar arrays can be activated concurrently for different operations. For example, ReHarvest can map a weight matrix to one crossbar array while allowing other crossbar arrays to perform MVM operations simultaneously. This mechanism can partially hide the ReRAM write latency when the ReRAM resource is limited.
Analog Signal Integrity. Our stacked crossbar architecture can eliminate cross points between crossbars, MUXs, and ADCs, and thus minimizes the crosstalk [
58] among wires and facilitates the wiring in integrated circuits. In addition, since only one crossbar array is activated in a tile in most times, our stacked crossbar architecture only introduces very limited circuit noise among crossbars. However, as we directly connect MUXs to ADCs without using intermediate repeaters, longer wires lead to higher wire resistance. Since the wide (fat) wire is a feasible solution [
2,
17,
20,
25,
37,
56,
60,
70] to combat impedance, RC delay, and IR drop in various IC designs, we adopt wires with a width of 0.3
\(\mu m\) to mitigate the impact of the wire resistance on the signal integrity while minimizing the wiring overhead (Section
5.9).
Remark. The proposed ADC-crossbar decoupled architecture offers several advantages. First, the decoupled architecture achieves high data parallelism without using the WR mechanism. Compared to previous ADC-crossbar tightly-coupled structure [
52,
68], ReHarvest can significantly reduce the latency of MVM operations with less ReRAM resource consumption. Second, since the decoupled architecture can fully exploit the ADC resource, it is unnecessary to employ the inter-layer pipeline technology [
52,
54] which usually suffers pipeline stalls due to unbalanced computing latencies between two adjacent layers. Third, the decoupled architecture offers an opportunity to map weight matrices to unused crossbar arrays when one crossbar array in the same tile is still working on the ADC pool. Thus, ReHarvest has the potential to hide the latency of matrix mapping during analog computing.
4.3 Multi-Tile Matrix Mapping
If we simply map the weight matrix to a few tiles compactly, the utilization of ADCs may be still low since ADCs in other tiles are not fully utilized. To further improve the ADC utilization, we design an MTMM scheme to adapt to the ADC-crossbar decoupled architecture. The basic idea is to distribute a large weight matrix across multiple tiles so that more ADC pools can be utilized for a single MVM operation.
Assume the total number of tiles is
n, and at least
m crossbar arrays are required to map the weight matrix
\(W_m\). We first partition
\(W_m\) according to the size of a crossbar array. Then, each sub-matrix is mapped to a crossbar array in different tiles. As shown in Figure
9(a), a 128
\(\times\)384 matrix is partitioned to three sub-matrices (M1, M2, M3) and then mapped to three tiles (T1, T2, T3). There still remain two unused tiles (i.e., T4 and T5). Thus, we also replicate some sub-matrices to these remaining tiles to fully utilize the ADC pools in T4 and T5.
We first discuss a case when
\(n\gt m\). Let
r denote the remainder of
\(\frac{n}{m}\). When
\(r=0\), these sub-matrices of
\(W_m\) are mapped to each tile so that all tiles contain one sub-matrix. When
\(r\ne 0\), we should fold and map
\(W_m\) into
r remaining tiles. Let
s be the remainder of
\(\frac{m}{r}\), if
\(s=0\), we map
\(W_m\) into
r tiles evenly, and
\(\frac{m}{r}\) crossbar arrays are used in each tile. If
\(s\ne 0\), we first use
\(\lfloor \frac{m}{r} \rfloor \times r\) crossbar arrays in
r tiles to map the majority of
\(W_m\), and the remaining portion of
\(W_m\) can be replicated
\(\lfloor \frac{r}{s}\rfloor\) times and mapped into
r tiles iteratively. Taking Figure
9(a) as an example, we first map
M1 and
M2 into T4 and T5, respectively, and then replicate
M3 to T4 and T5. Finally, T4 contains
M1 and
M3, and T5 contains
M2 and
M3.
After sub-matrices are mapped to
n tiles, the first
\(n-r\) tiles can finish
\(\lfloor \frac{n}{m}\rfloor\) MVM operations in one time slot. We illustrate how the remaining
r tiles perform MVM operations as follows. First,
r tiles compute
\(\lfloor \frac{r}{s} \rfloor\) partial results with
\(\lfloor \frac{r}{s} \rfloor\) replicated sub-matrices in one time slot. Then,
r tiles compute other
\(\lfloor \frac{r}{s} \rfloor\) partial results in
\(\lfloor \frac{r}{s} \rfloor \times \lfloor \frac{m}{r} \rfloor\) time slots. Totally,
r tiles generate
\(\lfloor \frac{r}{s} \rfloor\) output vectors in
\(1 + \lfloor \frac{r}{r_m} \rfloor \times \lfloor \frac{m}{r} \rfloor\) time slots. Figure
9(b) shows an example of processing five input vectors (X1–X5). T1, T2, and T3 can finish an MVM in each time slot. For T4 and T5, we first perform two MVMs with a input vector
X4 and get two partial results (i.e.,
\(M1\cdot X4\) and
\(M2\cdot X4\)). Then, we perform two MVMs (i.e.,
\(M3\cdot X4\) and
\(M3\cdot X5\)) in the time slot 1. At this time, all partial results about
X4 are generated and can be aggregated into the final output. Finally, we perform
\(M1\cdot X5\) and
\(M2\cdot X5\) in the time slot 2, and compose the partial results with
\(M3\cdot X5\) to form the final output about
X5. In this example, we can process five input vectors in three time slots by utilizing all ADCs in five tiles.
When
\(n\lt m\), we first map
m sub-matrices iteratively to
n tiles, and thus each tile contains at least
\(\lfloor \frac{m}{n} \rfloor\) sub-matrices. In this way,
n tiles can complete the whole MVM in
\(\lfloor \frac{m}{n}\rfloor +1\) time slots. We note that the utilization of ADCs in the last time slot may be low because only (
\(m\%n\)) tiles are used. Thus, we also map
\(W_m\) into these
\(n-m\%n\) tiles with the same approach described in the case of
\(n\gt m\). We also employ the same scheme in Figure
9(b) to fully utilize ADCs in these tiles.
Remark. MTMM can maximize the utilization of ADC resource for each DNN layer. Although MTMM also replicates a few sub-matrices, the extra ReRAM resource consumption is much less than that of WR mechanisms [
52,
54]. The reason is that the WR mechanism tends to use up ReRAM arrays in all tiles while MTMM only uses one crossbar array for one sub-matrix in each tile. Unlike inter-layer pipeline approaches [
35,
52,
54] that process multiple DNN layers simultaneously, ReHarvest processes each layer independently, and thus can avoid inter-layer pipeline stalls and the ADC resource contention among DNN layers.
4.4 Fine-Grained Data Dispatching
Since MTMM distributes sub-matrices to multiple tiles, Psums generated by these tiles should be shuffled to many tiles, and are used as inputs of the next layer. This many-to-many communication mode usually causes massive redundant packets in mesh or tree-based
network-on-chip (
NoC) [
43,
52]. Moreover, for most CNN models, the input features dispatched to different ReRAM tiles usually are highly redundant, resulting in a significant waste of NoC bandwidth. These redundant packets may incur severe network congestion, and offset the performance gain from the ADC-crossbar decoupled architecture and the MTMM scheme. To address this problem, we convert the many-to-many communication mode into many-to-one and one-to-many modes, and implement a
fine-grained data dispatching (
FGDD) scheme to eliminate data redundancy using two independent bus-based networks.
We extend a bus-based interconnection network [
13] with fine-grained data tagging and filtering to support packet multicasting without data redundancy. We use a GIN to multicast the input data, and a GON to aggregate the output data. Since they have similar structures, we take GIN as an example to illustrate their structures. As shown in Figure
10(a), we employ multiple
multicast controllers (
MCs) and a shared bus to support multicasting among different tiles. To multicast the input data, we tag each input data with its destination (
\(d_i\)). We use a pair of tags (
\(tag_i\),
\(tag_j\)) to index a tile. For 128 tiles, each tag pair only requires 14 bits to index these tiles. As shown in Figure
10(b), a tag pair is used to determine whether a tile should receive the input data tagged by
\(d_i\). The packet whose tag does not match the tag pair of tiles is filtered by the MC. Since a weight matrix may be partitioned and mapped into multiple tiles, we regard these multiple tiles as a
tile group (
TG) to facilitate tagging. All tiles in a TG would receive the same data by using the same tag pair. Since the distribution of weight matrices are different in different layers, ReHarvest should reconfigure TGs and their tag pairs for each layer dynamically.
For input features without data redundancy, such as input data in FC layers, only one tag is sufficient to filter unnecessary packets in the MC. Thus, the TC can configure the \(i_{th}\) TG’s tag pair as (i, i) simply. Let \(N_{tg}\) denote the number of TGs. For the \(j_{th}\) input feature, the TC also tags all data packets with \(d_j = j\%N_{tg}\). If \(d_j\ne i\), this packet is filtered by the \(i_{th}\) TG.
For highly-redundant input features, such as input data in CONV layers, we design a fine-grained data transmission mode to eliminate duplicated transmission. Let
\(S_c\) denote the column size of the output feature map, and
w denote the width of the CONV window. The TC tags feature values at the same position of feature maps with
\(d_i=i\%N_{tg}\) and send them to the same TG. To process multiple CONV operations from the sliding window in limited TGs iteratively, two tags are used for each TG. For the
\(i_{th}\) TG, when
\(i\lt w\), its tag pair (
\(tag_i, tag_j\)) is configured as (
i,
\(i+N_{tg}\)). Otherwise, the tag pair is configured as (
i,
i). The packet with
\(d_i\) is transferred to TGs when
\(0\le tag_i-d_i\lt w\) or
\(0\le tag_j-d_i\lt w\) is satisfied. As shown in Figure
10(c), the filter shows the logic implementation of these two conditions. By configuring the tag pair for each TG and changing
w, ReHarvest can support different communication patterns such as broadcast, multicast, and unicast.
Figure
11 illustrates the fine-grained data transmission for CONV operations when
\(N_{tg}\) is 4, and
w is 3 with a stride of 1. According to the filter conditions, the input data tagged with 0 is sent to three TGs with tags [(0, 4), (1, 5), (2, 6)]. TG (0, 4) processes the leftmost CONV window, and TG (1, 5) can process the next CONV window after the input data (
\(d_i\)=1) is received. In this way, four TGs can process four CONV windows concurrently. With
\(tag_j=4\), TG (0, 4) receives the data with
\(d_i = 2\) and
\(d_i =3\), and thus can process the fifth CONV window when the new data with
\(d_i=0\) (the fifth column in Figure
11) is received. In this way, each TG can process multiple CONV windows in a sliding window iteratively, without causing data redundancy on the bus.
Remark. Our FGDD scheme is implemented with a bus-based interconnection network proposed in Eyeriss [
13]. However, Eyeriss [
13] only supports coarse-grained data transmission and usually leads to high data redundancy among different CONV windows. To address this problem, we propose fine-grained data tagging and filtering to support redundancy-free packet multicasting. Our approach can effectively reuse input features among adjacent CONV windows, and thus maximize the effective bus bandwidth. In this way, ReHarvest can orchestrate the data transmission and MVM operations, and thus fully exploits the dominant ADC resource.