# RNNFast: An Accelerator for Recurrent Neural Networks Using Domain Wall Memory

Mohammad Hossein Samavatian<sup>1</sup>, Anys Bacha<sup>2</sup>, Li Zhou<sup>1</sup>, and Radu Teodorescu<sup>1</sup>

<sup>1</sup>The Ohio State University, {samavatian.1,zhou.785, teodorescu.1}@osu.edu

<sup>2</sup>University of Michigan,bacha@umich.edu

## **ABSTRACT**

Recurrent Neural Networks (RNNs) are an important class of neural networks designed to retain and incorporate context into current decisions. RNNs are particularly well suited for machine learning problems in which context is important, such as speech recognition or language translation.

This work presents RNNFast, a hardware accelerator for RNNs that leverages an emerging class of non-volatile memory called domain-wall memory (DWM). We show that DWM is very well suited for RNN acceleration due to its very high density and low read/write energy. At the same time, the sequential nature of input/weight processing of RNNs mitigates one of the downsides of DWM, which is the linear (rather than constant) data access time.

RNNFast is very efficient and highly scalable, with flexible mapping of logical neurons to RNN hardware blocks. The basic hardware primitive, the RNN processing element (PE) includes custom DWM-based multiplication, sigmoid and tanh units for high density and low-energy. The accelerator is designed to minimize data movement by closely interleaving DWM storage and computation. We compare our design with a state-of-the-art GPGPU and find  $21.8\times$  better performance with  $70\times$  lower energy.

## 1. INTRODUCTION

Deep learning is transforming the way we approach everyday computing. From speech recognition that empowers today's digital assistants to business intelligence applications fueled by the analysis of social media postings, processing information in a way that preserves the correct context is crucial. For instance, the sentences "white blood cells destroying an infection" and "an infection destroying white blood cells" have very different meanings even though they contain the same words. Traditional machine learning designs such as Convolutional Neural Networks (CNNs) do not consider context and are therefore not well suited for solving such problems.

Recurrent Neural Networks (RNNs) are a powerful class of networks designed to consider context by retaining and using information from previously processed inputs. RNNs are used across a wide range of applications that include speech recognition for digital assistants such as Siri and Google Now, sentiment analysis for classifying social media postings, and language translation. The popularity of RNN networks in production applications was highlighted by Google in a recent paper [1], which reports that RNN workloads represent almost 30% of the workloads on Google's TPU datacenters. This is in contrast to only 5% for CNN workloads.

However, RNN workloads are computationally intensive because they store a partial history of the output sequence and perform computations on that history along with the current input. As a result, RNNs require both vast amounts of storage and increased processing power. For example, the RNN neuron requires 8× the number of weights and multiply-accumulate (MAC) operations of a typical CNN cell. RNN networks are also generally quite large. For instance, Amodei et al. [2] developed a network for performing speech recognition that utilized seven recurrent layers and a total of 35 million parameters. At this scale, RNNs with large input sets are susceptible to memory bottlenecks when running on existing accelerators such as GPUs [3].

To address these challenges, prior work has proposed FPGA-based accelerators for RNNs [3, 4, 5]. While effective, these designs are still expected to be almost an order of magnitude less efficient than ASIC implementations [6]. In addition, the fundamentally different design of the RNN cell makes previously proposed custom CNN accelerators [7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22] not directly applicable to RNN workloads.

This paper presents RNNFast, a hardware accelerator for RNN networks. RNNFast leverages domain-wall memory (DWM), an emerging non-volatile memory technology, to provide high density on-chip storage as well as energy efficient computation. DWM [23, 24, 25, 26, 27, 28, 29] is a magnetic spin-based memory technology, which stores information by setting the spin orientation of so-called magnetic domains in a ferromagnetic wire. Multiple magnetic domains can occupy a single wire (referred to as "racetrack") allowing up to 64 bits to be represented.

DWM has many attractive characteristics. It has read/write latencies that are close to SRAM and write performance and energy that are substantially lower than STT-RAM and other non-volatile memories [30]. Perhaps more importantly,

DWM is expected to have  $30 \times$  higher density than SRAM and  $10 \times$  higher than DRAM or STT-RAM. The technology would therefore allow dramatically higher storage capacity in the same chip area. While the technology is still in the early stages of development, prototypes have yielded encouraging results [31]. We show that DWM is very well suited for RNN acceleration due to its very high density, linear access pattern, and low read/write energy.

The RNNFast architecture is modular and highly scalable forgoing the need for long communication buses despite the high output fanout of typical RNN networks. RNNFast allows flexible mapping of logic neurons to RNN hardware blocks. The accelerator is designed to minimize data movement by closely interleaving DWM storage and computation. The basic hardware primitive, the RNN processing element (PE) includes custom DWM-based multiplication and custom nonlinear functional units for high performance and low-energy. RNNFast also includes an error mitigation mechanism for position errors, expected to be relatively common in DWM. The error mitigation is tailored to the RNNFast data access pattern to minimize overhead. We compare RNNFast with a state-of-the art NVIDIA P100 GPGPU and find RNNFast improves performance by 21.8× while reducing energy 70×.

We also compare with two alternative RNNFast designs. 1) a CMOS-based RNNFast design in which both memories and logic use traditional CMOS. We find the RNNFast design to be up to 2× more energy efficient that the CMOS version, in a much smaller chip area. 2) a design that replaces the multiply and accumulate units with a memristor-based implementation that uses an analog dot-product engine, a state-of-the-art design that has been shown to be very efficient for CNNs [17, 32]. RNNFast shows better performance, energy and area than memristor-based design. Qualitative comparisons with FPGA-based RNN accelerators and Google's TPU also indicate RNNFast has better performance and lower energy.

This paper makes the following main contributions:

- Presents RNNFast, the first DWM-based custom accelerator for LSTM recurrent neural networks and its variants.
- Introduces novel DWM-based designs for efficient NN hardware including sigmoid, and tanh units.
- Implements an efficient error mitigation solution for DWM overshift errors.
- Presents a new efficient and scalable interconnection mechanism based on racetrack chains.
- Demonstrates that DWM is very well suited for efficient acceleration of recurrent neural networks.

The rest of this paper is organized as follows: Section 2 provides background information. Section 3 details the design and implementation of RNNFast. Section 4 presents the error mitigation aspects of the design. Sections 5 and 6 describe the evaluation. Section 7 discusses related work and Section 8 concludes.

## 2. BACKGROUND



Figure 1: (a) 3-layer RNN with 3 LSTM cells/layer, (b) LSTM cell, (c) an LSTM cell unrolled over time

# 2.1 The Long Short-Term Memory Cell

Most recurrent neural networks make use of special "neurons" called Long Short-Term Memory (LSTM) cells [33,34]. LSTMs are designed to process and remember prior inputs and factor them into their outputs over time. Figure 1 shows an example of a very simple 3-layer RNN with 3 LSTM cells/layer. The output of each layer is a vector that is supplied as the input to the following layer. In addition to those inputs, a feedback loop takes the output vector of each layer and feeds it back as an additional input to each LSTM neuron. An illustration of the inputs and outputs of a single LSTM cell C unrolled over time is shown in Figure 1(c). An input  $x_0$  into neuron C at time step t = 0, will generate an output  $h_0$ that is propagated downstream to the next layer. In addition,  $h_0$  is saved within the neuron's memory cell for use in the next time step. At time step t = 1, the same neuron C will process input  $x_1$ , but also use the previously stored output  $h_0$ to generate the new output  $h_1$ .

A detailed look inside the LSTM neuron (Figure 1(b)) reveals a significantly more complex operation compared to CNN neurons. The strength of the LSTM lies in the way it regulates the fraction of information it recalls from its embedded memory and the fraction of input it processes for generating outputs over time. In other words, the LSTM cell progressively memorizes and forgets contextual information as it processes more inputs. This is achieved through special gates that are controlled through a set of mathematical functions [35] governed by equations (1) – (5).

$$i_t = \sigma(W_{xi}x_t + W_{hi}h_{t-1} + b_i) \tag{1}$$

$$f_t = \sigma(W_{xf}x_t + W_{hf}h_{t-1} + b_f) \tag{2}$$

$$o_t = \sigma(W_{xo}x_t + W_{ho}h_{t-1} + b_o)$$
 (3)

$$c_t = f_t \odot c_{t-1} + i_t \odot tanh(W_{xc}x_t + W_{hc}h_{t-1} + b_c)$$
 (4)

$$h_t = o_t \odot tanh(c_t) \tag{5}$$

The input gate  $i_t$  receives the input to be written into a neuron's memory cell at time step t. The forget gate  $f_t$  controls what information should be erased from a neuron's memory cell at time step t. The cell  $c_t$  represents the content of the neuron's memory cell. The output gate  $o_t$  controls the amount



Figure 2: DWM device structure.

of information read from the neuron's cell and how much of it contributes to the output. The output  $h_t$  represents the output of the cell to the next layer at time step t. This output is also fed back into the input gate  $i_{t+1}$  of same LSTM cell at time step t+1. The Ws and bs represent the weights and biases respectively.

Because of the complex design, LSTM cells require substantially more storage and computation relative to their CNN counterparts. Moreover, RNN networks are also generally fully-connected, further increasing the data movement overhead.

## 2.2 Domain-wall Memory

Domain wall (a.k.a. racetrack) memory was first proposed by Parkin et al. [23] from IBM in 2008. In 2011, Annunziata et al. [31] demonstrated the first 200mm DWM wafer, fabricated with IBM 90nm CMOS technology. Each die contained 256 racetrack cells, proving the feasibility of DWM fabrication. A large body of research has since sought to improve and optimize the technology at device and circuit levels [36, 37, 38, 39, 40, 41, 42] and find solutions to improve its reliability [43].

Domain wall (racetrack) memory represents information using the spin orientation of magnetic domains in a ferromagnetic wire, as shown in Figure 2. Each of these domains can be independently set to an up-spin or down-spin to represent the value of a single bit. Since multiple magnetic domains can reside on a single wire, multiple bits (32-64) of data can be packed in a single DWM device, resulting in a very high density. Three basic operations can be performed on a DWM device: read, write and shift. A magnetic tunnel junction (MTJ) [44, 45] structure is used to read data from the DWM cell (read port in Figure 2). In a DWM device, all the magnetic domains share a single read MTJ (generally referred-to as a read head or port). The bit to be read needs to be aligned with the MTJ before it can be accessed. This is accomplished using a property that is unique to DWM, called domain wall motion, which refers to the shifting of magnetic domains down the ferromagnetic wire. When a current pulse of a suitable magnitude is applied through the ferromagnetic wire, the magnetic spins of all domains "move" across the wire in a direction opposite to the direction of current. The number of bit positions in a shift motion is controlled by the duration of the shift current. Additional blank domains are included at the ends of each racetrack to allow all data domains to be shifted to the read head without data loss at the ends of the wire [46].

Writing into DWM is also fast and energy efficient due to recently developed [41] "shift-based writes" as demonstrated



RNNFast Chip

Figure 3: RNNFast architecture overview at chip level.

in Fig. 2(write port). The design of the write head consists of a ferromagnetic wire with two fixed domains that straddle a free domain at an arbitrary location on the racetrack. One of the fixed domains is hardwired to up-spin and the other to down-spin at fabrication. The spin of either of the fixed domains can be shifted into the free domain through the domain motion process by applying a current pulse in the appropriate direction. The latency and energy of shift-based writes are equivalent to those of simple shifts.

The main challenge of racetrack memory is the access latency to data stored in a DWM tape which is variable depending upon the number of shifts required to align the accessed bit with the read or write heads. RNNFast mitigates this disadvantage by optimizing data placement for sequential access such that most accesses only require a single shift.

#### 2.2.1 Reliability Issues

DWM technology also presents reliability challenges including possible misalignment of the data domains leading to erroneous reads and/or writes [43, 47]. Prior work [43] has classified DWM errors in two main types: "stop-in-themiddle" and "out-of-step" errors. The first class of errors is caused when data domains are not aligned with the read/write heads, leading to invalid accesses. The second class of errors is caused when the incorrect domain is aligned with the read/write head which causes the wrong bit in the track to be accessed. The errors are generally caused by variability in the magnitude or duration of the current pulse applied during the domain shift operation. Zhang et al. [43] has developed a technique for eliminating "stop-in-the-middle" errors that relies on the application of a short subthreshold shift current to nudge the misaligned domain back into alignment. They also demonstrate that the subthreshold pulse is small enough that it cannot misalign a correctly aligned domain. As a result, sub-threshold shifts can virtually eliminate "stop-inthe-middle" errors, at the cost of increasing the number of "out-of-step" errors.

While subthreshold shifts can be applied in both directions, we choose to apply them in the shift direction. As a result, all "out-of-step" errors will be converted into overshift errors by 1 or more positions in the shift direction. For a single-position shift, which represents virtually all shifts in RNNFast, the probability of single-bit overshift is on the order of  $10^{-5}$  [43], which is quite high. However, the probability of multibit overshift is about  $10^{-21}$ , which is negligible. As a result, RN-NFast implements mitigation for single-bit overshift errors.

## 3. RNNFAST ARCHITECTURE

RNNFast is a custom architecture that leverages domain wall memory for accelerating recurrent neural networks. Figure 3 shows an overview of the design. At a high level the RNNFast chip consists of Global Memory, a Computational Memory array, Configuration Memory and I/O interface. The Global Memory is a dense memory block implemented using DWM. This is the main memory of the accelerator and is used to store input data. The Computational Memory is the main compute engine of RNNFast and is implemented primarily using DWM elements augmented with CMOS logic where appropriate. The compute array of RNNFast is organized as a pool of highly reconfigurable and tightly interconnected tile groups. The Configuration Memory holds the runtime configuration settings for the chip. RNNFast is optimized to deliver low latency without batching, and it is also efficient for batch workloads. Multiple inputs could be pipelined very efficiently for multilayer networks.

# 3.1 Compute Tiles

Tile groups are composed of multiple compute tiles, interconnected with their nearest horizontal and vertical neighbors through racetrack memories. Figure 4 shows the tile design and layout. Each compute tile consists of multiple LSTM hardware units that share a single input and a single output racetrack. The results of the computation within each tile are written directly onto the input track of the tile belonging to the next layer in the network. Tile groups are connected to each other through traditional wired interconnection networks.



Figure 4: Compute tile layout, internal design and interconnection through racetrack chains.

#### 3.1.1 Inter-tile Communication

RNNs are typically fully connected networks requiring all inputs to be delivered to all the neurons in a given layer. The high degree of connectivity that has to be supported by the hardware can lead to substantial energy and area overheads when traditional wired interconnects are used. To address this challenge we leverage the shifting mechanism of DWM racetracks for communication both within and across tiles.

Within a tile (Figure 4), inputs are read sequentially from the tile's input racetrack and broadcast to all LSTM units



Figure 5: (1) LSTM Tile with PE cells and aggregation unit (2) PE: Weight selection controller for mapping input to corresponding weight (3) Aggregation unit with Accumulator and Output Generator

across a locally-shared bus. Each read is followed by a shift of the input track to align the next input element with the read head. In addition to the tile-local broadcast, each input is also sent to the neighboring tile on the left for addition to its input track. We call this process "chaining". Chains are essentially circular buffers that circulate all inputs to all tiles that are mapped to the same layer of the NN. Chains of different lengths can be configured depending on the number of neurons in each layer of the network. Racetracks are connected through MUXs (Figure 4) that enable different chain lengths. A variable number of tracks can be included in a chain by simply setting the right most track MUX to 0 and the rest to 1.

#### 3.2 LSTM Units

Each tile consists of multiple LSTM compute units (64 in our design). A logical neuron can be mapped to one or more LSTM compute units depending on the number of weights it requires. RNNFast is a weight-stationary design, which means weights are locally stored inside the compute units. We expect a 1-to-1 mapping between logical neurons and hardware LSTM units for most networks. However, for large networks multiple LSTM units are combined to store all the weights corresponding to single neurons. For simpler Vanilla neurons, LSTM units can be split between multiple neurons. The architecture of an LSTM cell is shown in Figure 5.

#### 3.2.1 Processing Elements

The LSTM cell is further subdivided into multiple processing elements (PEs) ①. Per equations (1) - (5), each input  $X_t$  is multiplied with four different sets of weights. A single PE can only be assigned to one of the weight sets (known as gates). However, an LSTM cell gate can be mapped to one or more PEs across LSTM units depending on its storage requirements and input/output fanout. PEs have racetrack-based storage for weights and racetrack-based compute units. Each PE unit holds a set of weights and performs the dot product on the corresponding subset of inputs. Each PE only consumes inputs corresponding to the weights it stores. Each input to a PE is multiplied by its weight and accumulated with the result of the previous multiplication ②. Each PE stores the result of the accumulation in its own output racetrack.

PEs include multiply accumulator (MAC) engines for performing matrix multiplication. The MAC engine is composed of 256+16 DWM based full adders. The MAC unit is deeply pipelined into 48 stages and the latency of each stage is 2 clock cycles which result in total 96 cycles. New input can



Figure 6: Mapping of inputs and weights to racetracks.

feed in to pipeline every two cycles.

To improve performance, these dot product operations are performed in parallel using two different MAC engines, one for input  $X_t$  and one for feedback input  $h_{t-1}$ . having 4 PEs per LSTM units makes the design very flexible to different variants of RNNs (see Section 3.4). Each PE is capable of doing multiplication for the recurrent and input paths and accumulate the results independently. Hence PEs can handle simple RNN neurons independently.

#### 3.2.2 Input and Weight Mapping

The inputs and weights assignment to racetracks is a tradeoff between access latency and hardware overhead. In RN-NFast, inputs are spread across multiple racetracks with 1 bit per track. This allows an entire input word to be read in a single cycle, as the top half of Figure 6 illustrates. Error detection bits are also included in the tracks and their role will be detailed in Section 4.

Unlike inputs, which move from track to track along the chain, weights are stationary at PE level and are reused multiple times. This means that after scanning all weights, the tracks need to be returned to the initial weight. To minimize the number of shifts, weight values are distributed both within and across multiple racetracks. Weight racetracks are provisioned with multiple read/write heads (5 in our design). Data layout is such that all read heads across all tracks can access all the bits of a single weight simultaneously. The bottom of Figure 6 illustrates this layout. Weight  $W_0$  (red) is currently aligned with the read heads. A single-position shift to the left will align the next weight  $W_1$  (blue) with all the read heads.

## 3.2.3 Result Aggregation

If more than one LSTM unit is mapped to a neuron the partial results of the individual LSTMs have to be combined to form the neuron's output. Aggregation units ③ in each LSTM are used to sum up partial results in that LSTM block. In addition, the aggregation units apply the sigmoid and tanh functions and perform the multiplication and accumulation operations in order to generate the final output of the cell.

For cases in which neurons span multiple LSTM blocks, aggregation units in those blocks are linked to produce the final result. This is achieved by collecting all the partial results computed by each LSTM unit mapped to the same neuron to a single aggregation unit. Aggregation units are also chained through adjacent LSTM units. Each aggregation



Figure 7: DW based implementation of sigmoid/tanh.

unit sends out its final result to the adjacent aggregation unit to its left. The adjacent unit will use the incoming result to either accumulate or bypass it to the next unit (Figure 5-③). Even-indexed aggregation units consume and odd-indexed aggregation units forward the incoming result. The leftmost LSTM in a neuron will be responsible for the final aggregation and will apply the sigmoid and tanh. Aggregation time is a logarithmic function in the number of LSTM cells mapped to a single neuron.

The design tradeoff for LSTM units is driven by the need to support networks both large and small. If LSTM units and PEs are too large, storage space will be wasted when small networks are mapped. If they are too small, large networks will require several LSTM units per neuron, increasing the aggregation time.

## 3.3 Nonlinear Functions

RNNFast uses hardware acceleration for the sigmoid and tanh nonlinear functions. The hardware is included in each Aggregation Unit (Figure 5). We propose an area efficient approximate logic function-based unit implemented using DWM for the nonlinear functions. The approximation has been proposed by prior work [48] as an alternative to the standard sigmoid follows Equation 6:

$$\sigma(z) = \begin{cases} \frac{\frac{1}{2} + \frac{2}{4}}{2|(z)|} & if z < 0\\ 1 - \sigma(-z) & if z > 0 \end{cases}$$
 (6)

This approximation has the advantage of being easier to implement in hardware. As Equation 6 shows, the hardware has to support division by  $2^n$  numbers. This can be implemented using shift operations which are a feature of racetrack memories. The tanh approximation function can be computed from the sigmoid function through two multiplications and a subtraction. Note that  $\hat{z} = z + |(z)|$ , where (z) is the integer part of z.

Figure 7 shows our DWM-based implementation of the sigmoid approximation. Sigmoid for a negative value will be computed as follows: a) the output integer part is initialized with binary '1'; b) two right shifts are performed to compute  $\hat{z}/4$ ; c) +1/2 is applied to the result; d) final result is shifted right  $2^{|(z)|}$  times. For a positive number two subtraction steps are added in the beginning and end of above steps. To compute the tanh approximation, a right shift  $(2 \times z)$  and a subtraction will be applied in the first and last steps



Figure 8: Mapping multiple LSTM networks to RNNFast. Interconnection network helps extend racetrack chains beyond tile groups for large networks.

respectively. This design is very area and energy efficient utilizing only a 16 bit racetrack memory, along with some simple subtraction and counting logic. Section 6 evaluates the relative merits of the approximate designs regarding LUTs.

## 3.4 RNNFast Mapping and Configuration

The RNNFast hardware can be configured to implement different network sizes and topologies. Moreover, multiple distinct neural networks can be mapped to the same chip.

Outputs from one network can be delivered directly to the following network or stored in the on-chip memory for further processing, if needed. Figure 8 illustrates an example of four networks A, B, C and D mapped to two tile groups. Tile groups are connected through a wired interconnect. The racetrack chains for each row of tiles have additional read/write heads to provide access to the inter-tile network.

Multilayer networks span multiple rows with different layers mapped to consecutive rows. Tile groups are designed with wide rows to accommodate most network sizes (e.g. Nets A and C). However, when a network layer cannot fit in a single row, RNNFast supports splitting it across tile groups (e.g. Nets B and D). This is achieved by extending the input/output racetrack chains to neighboring tile groups using the inter-tile interconnect. We chose to split layers across tile groups (as opposed to within a tile group) in order to allow consecutive network layers to continue to be mapped to adjacent rows, preserving inter-layer communication.

One important design constraint was to enable the extension of the racetrack chains across tile groups without adding to the track chain shift latency. This is accomplished by implementing a look-ahead read port at the end of the track that reads inputs several cycles ahead of the end of the track, as illustrated for Net D in Figure 8. This allows the input to reach the destination row in the neighboring tile through the higher latency interconnect by the time the same input reaches the end of the source track.

Although RNNFast is designed for the more demanding LSTM design, it is also compatible with LSTM variants like Gated Recurrent Unit (GRU) and Vanilla which need lower compute resources. Unlike LSTM, the GRU unit does not use a memory element to control the flow of information and are useful when input sequences are not very long. Figure 9 shows how a GRU cell can be mapped to a RNNFast LSTM unit. The shaded areas represent unutilized compo-



Figure 9: LSTM vs GRU cell configuration on RNNFast

nents. GRU utilizes 75% of the resources of MACs. The GRU has only two gates computed by two full active PE and hidden state computation by two half-active PEs. In half active PEs only a single MAC unit is active. Inactive Nonlinear function units only bypass their input to output. Simple multiplexers are used to tailor the data flow to the LSTM or GRU. Simpler RNNs like *Vanilla*, only utilize a single PE per neuron and do not need further computations in the aggregation unit. As a result, RNNFast can map four *Vanilla* neurons in each LSTM unit. The reconfiguration is performed similarly to an FPGA, with configuration signals that drive MUXs, the racetrack chain length, aggregation unit bypass and power gate inactive units. The values for these configuration signals are stored in configuration memory.

The RNNFast configuration is programmed through configuration registers that control input assignment at PE level, input track chaining, result aggregation setup, etc. A configuration file with the LSTM network(s) specifications is loaded into the device driver of the accelerator and propagated to the appropriate registers.

## 4. ERROR MITIGATION DESIGN

## **4.1 DWM Position Errors**

As detailed in Section 2.2, "out-of-step" shift errors (in which the wrong bit is aligned with the read/write heads) are a significant reliability challenge for DWM. We focus on single-bit overshift errors which are expected to occur with a probability of  $10^{-5}$  [43], which is quite high. We used Pytorch [49] to inject error in weights for both im2txt and seq2seq models.

While prior work [21] has shown that neural networks are quite resilient to errors, we find that error rates on the order of DWM overshift errors can degrade output accuracy substantially. Figure 10 shows the accuracy of the output for two benchmarks, measured by the BLEU metric [50], relative to an error-free baseline. We inject single-bit overshift errors in different DWM components of RNNFast: the racetrack chains used to hold inputs and outputs for each NN layer, the weights associated with all PEs, the DWM components of the logic functions (MAC units and the nonlinear functions). Shift errors are modeled as a uniform distribution with an overshift probability of  $4.55 \times 10^{-5}$  [43].

Figure 10 shows that when errors are injected only in the logic, the drop in output accuracy is very low: <1% for im2txt and 3% for seq2seq, two of the benchmarks we run. This is because overshift off-by-one errors in the MAC and nonlinear functions tend to produce results that are relatively



Figure 10: Output accuracy (BLEU metric) relative to the error-free RNNFast baseline.



Figure 11: Output accuracy (BLEU metric) relative to the error-free RNNFast baseline for integer and fraction parts.

close to the correct value. As a result, the accuracy of the output is very high. However, when errors are injected into the input chains and the weight arrays, the output accuracy drops dramatically to between 10% and 35% of the original. When errors are injected uniformly in all DWM tracks, the output accuracy drops below 5% for *im2txt* and below 10% for *seq2seq*, meaning that the results are essentially useless. This data highlights that mitigation solutions for errors in the inputs as well as weights are essential.

To better understand which errors have the worst effect on output quality, we selectively inject errors into different bits of data words. RNNFast uses 2's complement fixed point representation for both inputs and weights. We inject errors separately into the integer and the fraction portions of the word. Figure 11 shows the results of this experiment. When errors are injected only in the fraction, the drop in accuracy is less than 3% for both inputs and weights in *im2txt*. For *seq2seq* the accuracy degradation is worse when errors are injected in the weights compared to inputs, but overall output quality is still reasonably high.

Injecting errors with the same probability in the integer portion of the data words has a much more dramatic effect, leading to a drop in output accuracy of between 35% and 10%. The large effect is due to the fact that in these workloads both inputs and weights are represented with small fractional numbers. A single bit flip of the integer fraction can turn a small number into a much larger value, which has a disproportionate effect on the rest of the network.

# 4.2 RNNFast Error Mitigation

RNNFast addresses overshit errors by implementing an efficient error mitigation mechanism that considers the sensitivity of RNN workload to errors that result in very large values. We implement different error detection and mitigation mechanisms for input/output racetrack chains and for weight arrays. Our EDC solution is optimized for RNNFast. RNNFast uses shift-read/write cycles for accessing weights and inputs instead of random numbers of consecutive shifts, as is the case in random memory implementations of DWM.

We take advantage of this characteristic to implement a more efficient SEDSEC design that has lower area overhead, requires fewer extra domains and access ports compared to prior EDC solutions such as [43].

#### 4.2.1 Input Errors

In order to detect overshit errors in the input tracks, we append a 3-bit pattern to the left side of each track, as shown in the example in Figure 12. The figure shows a single track that stores bit n for multiple inputs  $I_1 - I_7$ . In the initial state the Error Detection Code (EDC) "101" is stored in the leftmost bits of the track. Input  $I_1$  is read in the current cycle. At time  $t_1$  the track is shifted left by 1 to access the next input. If the shift is correct, the leading (check) bit should be a "1". Input  $I_2$  is read and sent to the LSTM units. A new EDC code is written at cycle  $t_3$  in the first three bits of the track using three parallel write ports. Note that updating the EDC does not introduce any time overhead since a write cycle already exists following each read to allow data to be written into the next track in the chain.

At cycle  $t_4$  we show an overshift error. The track has incorrectly shifted left 2 positions instead of 1. This means that  $I_3$  (instead of  $I_2$ ) is now aligned with the read head. The check bit is now "0" indicating a shift error. To recover from this error we use an additional read head to also read  $I_2$ . The outputs of the two read heads are connected to a multiplexer. The check bit value selects the multiplexer output (shown in blue in Figure 12). A "1" selects the error-free output and a "0" selects the overshifted output. A similar mechanism selects the correct location for writing the input coming from the previous track in the chain. If an overshift error occurs, the write location is also shifted to the left, as the right hand side of Figure 12 shows.

At  $t_6$  the EDC code is again updated. Following an overshift error the shift controller will not issue a shift command for the following cycle ( $t_7$ ) since the track is already properly aligned to access the next input ( $I_4$ ) during that cycle. Note that, since individual words are stored across multiple tracks to enable single-cycle access, an overshift error will affect all inputs that share that track (up to 60 in our design). It is therefore important to detect and correct these errors.

## 4.2.2 Errors in Weight Arrays

A similar mechanism is deployed to detect and mitigate errors in weight arrays associated with each PE. However, because the access timing to the weights array is more critical and weights are stored in a more compact representation, the detection and mitigation steps are implemented differently. Unlike the input racetrack chain, access to the weight arrays does not require a write cycle, so an update to EDC code is not feasible. We instead store a fixed EDC pattern of "01010" at the rightmost edge of the weight tracks as shown in Figure 13. Error detection logic detects an overshift error when the current EDC bit does not match the expected value. For instance, in the initial state, the read heads are aligned with bits from weight  $W_0$  and the error detection logic expects to read "0" from the EDC. Note that weights are interleaved both within and across racetracks such that all read heads across all tracks can access all the bits of a single weight simultaneously.



Figure 12: Mitigation mechanism for overshift errors in the input track chains.



Figure 13: Mitigation mechanism for overshift errors in the weight track chains.

At time  $t_1$  a correct shift takes place and  $W_1$  can be read. At time  $t_2$  an overshift error occurs and weight  $W_3$  is read instead of  $W_2$ . A recovery mechanism similar to the one for inputs could be employed. This would require doubling the number of read heads in each track and extra logic. Since weight storage in RNNFast is substantial, the overhead would be nontrivial. We can, however, avoid this extra overhead by leveraging the observation that replacing the incorrect weight with "zero" yields very little loss in output accuracy compared to error-free execution. This is in contrast with using the erroneous weight, which can be a large value. The following cycle at  $t_3$ , the shift controller will not shift because the track is already aligned for accessing the next weight.

## 5. EVALUATION METHODOLOGY

#### **5.1 RNNFast Modeling Infrastructure**

We implemented a detailed behavioral model to evaluate performance, chip area and energy consumption of the RNN-Fast design. A cycle-level model that accounts for the latency of each component in the design is used for the timing simulation. The simulated hardware is configured for each neural network in our benchmark set, by enabling the appropriate number of hardware tiles, LSTMs and PEs. Since all LSTM

units execute independently and in parallel, only a single LSTM per tile is simulated to speed up simulation time. For the energy evaluation, the number of reads, writes, shifts as well as decoder, Adder/Multiplier and LUT accesses are counted for all the units in the design.

To understand the energy consumption, shift and write latency of the Domain Wall Memory (DWM), an electrical model is necessary. A Verilog-A based SPICE model for DWM from [51,52,53] was simulated on Cadence Virtuoso. The DWM model estimates the effective resistance as a function of the length of the track and uses width and thickness of the strip to calculate current density and position shift. A Cadence component was created for the DWM model and a test-bench was setup to stimulate the device. A sensitivity analysis was conducted to study the effect of track length on shift latency and energy. Table 1 shows the characteristics of the DWM we model and also lists the architectural parameters for RNNFast and power/area breakdown for different components.

|                                  |                  | DWM properties                                |                       |                      |  |  |  |  |  |
|----------------------------------|------------------|-----------------------------------------------|-----------------------|----------------------|--|--|--|--|--|
| racetrack width/length/thickness |                  | 1F / 64F / 3nm                                | domain length         | 1F                   |  |  |  |  |  |
| number of bits per track         |                  | 64                                            | Effective cell size   | $2.56F^2$            |  |  |  |  |  |
| read/shift/write latency         |                  | 1ns / 0.5ns / 0.5ns                           | Technology node       | 32nm                 |  |  |  |  |  |
| read/shift/write energy          |                  | 0.39pJ / 0.24pJ / 9.6fJ                       |                       |                      |  |  |  |  |  |
|                                  |                  | Tile properties                               |                       |                      |  |  |  |  |  |
| Component                        | Configuration    | Specification                                 | Power(mW)             | area( $\mu m^2$ )    |  |  |  |  |  |
| Input buffer                     | 1 track/tile     | 16 stripes/track                              | 2.59                  | 2.68                 |  |  |  |  |  |
| _                                | with EDC         | 64 cell/stripe                                |                       |                      |  |  |  |  |  |
| LSTM unit                        | 64 per tile      | 4 PEs/LSTM                                    | 9.74                  | 2046                 |  |  |  |  |  |
|                                  |                  | 1 Aggre./LSTM                                 |                       |                      |  |  |  |  |  |
| Total tile                       |                  | 256 PEs                                       | 626                   | 0.130mm <sup>2</sup> |  |  |  |  |  |
|                                  |                  | 64 Aggre. Unit                                |                       |                      |  |  |  |  |  |
|                                  | PE properties    |                                               |                       |                      |  |  |  |  |  |
| MAC                              | 2/PE             | 272 Adder                                     |                       | 422                  |  |  |  |  |  |
| Weight array                     | 2 track/PE       | 205 stripes/track                             | 2.43                  |                      |  |  |  |  |  |
|                                  | with EDC         | 64 cell/stripe                                |                       |                      |  |  |  |  |  |
|                                  | Ag               | gregation Unit properti                       | es                    |                      |  |  |  |  |  |
| Accumulator                      | 4/LSTM           | -                                             |                       |                      |  |  |  |  |  |
| Multiplier                       | 2/LSTM           | -                                             | 0.004                 | 356                  |  |  |  |  |  |
| sigmoid                          | 3/LSTM           | Approx. nonlinear func. design                | 0.004                 | 550                  |  |  |  |  |  |
| tanh                             | 2/LSTM           | Approx. nonlinear func. design                |                       |                      |  |  |  |  |  |
|                                  | •                | On-chip DW Memory                             | -                     |                      |  |  |  |  |  |
| Size: 128MB,                     | 4R/W ports, Area | : 6.2mm <sup>2</sup> , Acc. Eng.: 0.89nJ, Acc | . lat.: 1.69ns, Leaka | ge 24.3mW            |  |  |  |  |  |

Table 1: Racetrack memory and RNNFast design parameters with associated power and area overheads.

#### 5.1.1 RNNFast Design Variations

We compare our design with two alternative RNNFast architectures that uses CMOS and Memristor technologies. We call them RNNFast-CMOS and RNNFast-Me respectively. For RNNFast-CMOS, we used SRAM buffers for both LSTM inputs and weight storage within PEs. MAC units are also implemented with fully CMOS logic. We used SRAM based LUT for nonlinear functions. Input SRAM buffers are also chained like racetrack memories in order to deliver all inputs to all LSTM units. We also compared RNNFast with an ISAAC-like [16] design for RNN that stores inputs in eDRAM and is entirely CMOS and memristor-based. This is a state-of-the art solution for accelerating dot products.

More over we also used ISAAC crossbar design on top of the RNNFast called RNNFast-Me. RNNFast-Me architecture uses 128x128 2-bit memristor crossbars similar to the ISAAC in for the dot product engine. RNNFast-Me leverages the architecture elements of RNNFast, with the exception of the memristor crossbar. In order to fit the memristor crossbar to RNNFast for fair comparison, we change the input data layout. The compute capacity of the crossbar has to be factored in the RNNFast-Me design. First, Each memristor dot prod-

uct engine is capable of  $128 \times 16$  multiplications in parallel (128 inputs by 16 weights). In an LSTM neuron each input is multiplied by 4 different weights as discussed in section 2. Thus, each Memristor dot product engine can handle 4 neurons, making each LSTM in RNNFast-Me computationally equivalent to 4 LSTMs in RNNFast. Second, mermirstor crossbar performs on each bit of multiple inputs in each cycle while the RNNFast design perfome on whole input bit in each cycle. for leveraging the memrisor crossbar performance we changed the input DWM layout in order to store each input in a different racetrack. Therefore, at each cycle single bit of 128 input would be accessible.

## 5.2 Benchmarks

We used LSTM-based RNN workloads from the Deepbench [54] open source benchmark suite for DNNs, released by Baidu. For our experiments we used:

| Bench.    | Platform  | Precision | Layers×<br>Neurons        | Time-<br>step | Description          |
|-----------|-----------|-----------|---------------------------|---------------|----------------------|
| im2txt    | DeepBench | 16 bit    | 1×512                     | 11            | image caption        |
| seq2seq   | DeepBench | 16 bit    | 3×1024                    | 15            | language translation |
| mach-tran | DeepBench | 16 bit    | 1×512<br>1×1024<br>1×2048 | 25            | Machine translation  |
| lang-mod  | DeepBench | 16 bit    | 1×1536                    | 50            | language modeling    |
| D-Speech  | DeepBench | 16 bit    | 1×2816                    | 1500          | Deep Speech          |

Table 2: Summary of the benchmarks evaluated.

*Image Caption Generator:* This benchmark is based on the "Show and Tell" Model [55], which is an encoder-decoder type neural network. The decoder is an LSTM RNN that generates captions from a fixed-length vector input.

Sequence-to-Sequence Model: This benchmark is based on the RNN encoder-decoder model by Cho et al. [56], which performs language translation. The encoder and decoder are 3-layer LSTM networks. *Machine Translation:* also based on the RNN encoder-decoder model by Cho et al. [56].

Language Modeling: a probability distribution over sequences of words. It is used in speech recognition, sentiment analysis, information retrieval and other applications [57].

*Deep Speech:* a Speech-To-Text engine that uses a model trained by machine learning techniques, based on Baidu's Deep Speech research [58].

All benchmarks are run using 16-bit precision arithmetic on both RNNFast and the P100 GPU.

#### **5.3 GPU** Baseline

We choose as a baseline system for evaluation a state-of-the art GPGPU optimized for machine learning: the NVIDIA Tesla P100 (Pascal architecture) with 16GB of CoWoS-HBM2 memory. All benchmarks use the DNN-optimized cuDNN NVIDIA libraries version 7 [59], which delivers roughly 6× performance improvement relative to a standard GPU implementation for LSTM on Torch [60].

We measure runtime of the forward passes through the LSTM layers using instrumentation in Deepbench. We measure power consumption using the NVIDIA SMI profiler. Since the SMI provides total board power, in order to isolate the active power of the GPU we subtract power measured at GPU idle. Since the board components are less energy proportional with activity compared to the GPU, they will account for most of the idle power.

## 6. EVALUATION

We evaluate the RNNFast performance and energy consumption and area compared to the NVIDIA GPU, the CMOS-based and the Memristor-based RNNFast design. We evaluate the reliability of the RNNFast error mitigation. We show an area utilization estimate for different benchmarks. We also include a high-level comparison to other RNN accelerators.

## **6.1** Performance Improvement

Figure 14 shows the execution time speedup for RNNFast, RNNFast-CMOS, RNNFast-Me and ISAAC-RNN relative to the P100 GPU for the seven benchmarks we run. RNNFast speedup relative to the GPU varies between  $12\times$  for im2txt and  $34.5\times$  for D-speech, with an average speedup of  $21.8\times$ . RNNFast speedups increase with the network size, demonstrating the excellent scalability of the design. For instance, in mach-trans we test three different network sizes ranging from 512 to 2048, We observe speedups increases from  $15.4\times$  to  $29.3\times$ . This is because the large number of threads required to handle the larger network becomes a bottleneck even for the GPU, whereas RNNFast scales much better.

RNNFast-Me and ISAAC-RNN also brings a substantial speedup relative to the GPU ranging between 1.88× for im2txt and 5.8× for D-speech. Although this is substantial, it is more than 6.1× slower than the DWM RNNFast implementation. This is primarily due to the higher latency of the LSTM unit in RNNFast-Me, which is 7.3× higher than a RNNFast LSTM unit. The higher latency is due to the memristor array read latency (100ns) and overheads that stem from the ADC/DAC components. However a single memristpor array can handle up to 4 neurons which increases the throughput. As a result, for the same number of PE equivalents, RNNFast-Me is still fundamentally slower than RNNFast. ISAAC-RNN shows slightly higher speedup than RNNFast-Me as it uses CMOS adders in aggregation units which are faster. As the main computation time contributor in ISAAC-RNN is the crossbar design it does not show considerable performance difference with RNNFast-Me. RNNFast-CMOS design shows 2.1× speedup compared to RNNFast. This is due to faster CMOS adders and random memory access instead of the shift-based access in RNNFast. Figure 15 shows the energy consumption for RNNFast, RNNFast-CMOS RNNFast-Me and ISAAC-RNN relative to the GPU in log scale. RNNFast reduces energy consumption on average by 70×. This is due to a much faster execution time achieved with about 1/3 the power of a GPU. The RNNFast-CMOS design has 55% higher energy compared to RNNFast. This is reaches a 100% increase for *D-speech* due to higher resource demand, which increases the leakage energy for both compute and memory logic in CMOS. This causes the CMOS design to reach its maximum TDP for smaller networks. ISAAC-RNN has slightly higher energy usage than RNNFast-Me due to its leaky eDRAM buffers and CMOS

RNNFast offers a much more scalable design relative to a GPU due to its modular design and very high storage density of DWM. Figure 17 shows the log scale of execution time for the *mach-tran* benchmark as a function of problem (neural network) size ranging from 128 nodes to 16K nodes per layer in a single-layer configuration. For problem sizes larger then



Figure 14: RNNFast, RNNFast-CMOS, ISAAC-RNN and RNNFast-Me speedup relative to the GPU execution.



Figure 15: Energy consumption for RNNFast, RNNFast-CMOS, ISAAC-RNN and RNNFast-Me relative to the GPU.



Figure 16: Storage saving and performance degradation for different network sizes for Approx. Function-based sigmoid design relative to LUT.

16K, the GPU runs fail because the device runs out of memory. The GPU execution time exhibits a super-linear increase in execution time with problem size due to memory pressure. RNNFast is consistently faster than the GPU in the range of  $13.9 \times (0.5 \, \text{K})$  to  $156 \times (16 \, \text{K})$  and also scales better to very large problem sizes of 16K nodes and beyond. RNNFast-Me scales similarly to RNNFast but it is also  $6.2 \times$  slower that RNNFast on average for *mach-tran*. RNNFast-CMOS shows almost  $2 \times$  speedup that RNNFast but it faces main energy and area challenges as discussed. Figure 18 shows a similar trend for im2txt. The GPU shows good performance up to  $0.5 \, \text{K}$ , but run time increases exponentially beyond that.

## 6.2 Error Mitigation

We also evaluate RNNFast resilience to position errors. Figure 19 shows the accuracy of the output as evaluated by the BLEU metric [50], as a function of the probability of position errors. We can see that for a relatively low probability of errors of  $4.5 \times 10^{-7}$  the output accuracy is virtually unaffected. This is primarily due to the inherent robustness of the RNN to errors. However, at higher errors rates the output accuracy degrades substantially. In the region around  $4.5 \times 10^{-5}$  (highlighted region), which is the expected rate for single bit position errors, the output accuracy drops to 45% for im2txt and 10% for seq2seq, an unacceptable performance for most applications. When RNNFast error mitigation is enabled the drop in output accuracy is less than 2%.

The RNNFast error mitigation produces outputs with less than 5% accuracy loss even for much higher error rates of  $10^{-3}$  or around 20% accuracy loss for  $10^{-2}$ . This shows that RNNFast EDC is robust to much higher error rates than what is expected for DWM technology.

It is also worth highlighting the fact that error mitigation incurs no performance penalty even when errors are detected. Correction or mitigation are performed without stalling the execution pipeline. This is an important design consideration because of the highly synchronized nature of the design. A single stall to correct an error would result in lost cycles for thousands of functional units.

#### **6.3** Nonlinear Function Hardware

We evaluate two designs for the nonlinear function hardware: a LUT-based implementation, and an approximate logic function-based unit. The function-based implementation is area efficient since it does not require as much storage as the LUT-based design. However the computation required, albeit simple, is slower than the simple lookup of the LUT version. The activation functions are not a significant latency bottleneck. However, at this scale we have thousands of such units on chip which reducing their area adds up to real savings. Figure 16 shows the storage savings and performance degradation of the function-based sigmoid/tanh relative to the LUT design for multiple network sizes. The storage savings diminish as the network size increases because the storage space for the weights dominates. For large networks the storage savings are about 4%, which represents >1GB of DWM for a 16K network. As for the performance cost, it starts at about 9%, but falls below 1% for larger networks. The approximated nonlinear function does not result in loss of accuracy as measured by the BLEU score.

# **6.4 RNNFast Parameter Tuning**

We also conduct a sensitivity analysis on number of LSTM units per tile. Figure 20 illustrates the tile input buffer energy versus different number of LSTMs per tile for different network size. As the number of LSTMs per tile increases, the power/area overhead for the within tile bus increases superlinearly. The minimum energy point is different depending on the size of the network. The 64 LSTM units per tile represents a reasonable compromise for medium-to-large networks. The maximum input race track size is also 64. Since most networks have an equal number of inputs and LSTM cells, having 64 LSTM units per tile makes the chaining in the design very storage-efficient without any blank race track cells. In the case of lower/higher number of LSTM units than number of racetrack cells, we have to hire more tiles to map a network (need more input racetracks to fit the data) while some tiles will be partially underutilized.

#### 6.5 Comparison to Other RNN Accelerators

The only prior work we are aware of on RNN accelerators has focused on FPGA implementations [3,5]. While a direct comparison with those designs is difficult, we offer a qualitative comparison based on their reported runtime and energy numbers. We scale the RNNFast input sizes to the problem sizes reported in those papers: 4.3 million Weights and 32 nodes for [3] and [5] respectively.

In the case of [3], RNNFast performance is 2 orders of magnitude faster. Even though [3] reports energy for the entire board so a direct comparison is unfair, RNNFast has many orders of magnitude lower energy. The network used in [5] is much smaller, at only 64 nodes. The relatively large RNNFast chip is not very efficient for such small problem sizes, but still achieves  $4\times$  lower energy and  $4.7\times$  faster. Very recently Fowers et al. [61] introduced Brainwave, an FPGA-based accelerator for RNN with no batching for real



Figure 17: RNNFast, RNNFast-Me and GPU execution times vs. net size for *mach-tran*, normalized to RNNFast 0.125K.



Figure 18: RNNFast, RNNFast-Me and GPU execution times vs. network size for *im2txt*, normalized to RNNFast 0.125K.



Figure 19: Output accuracy relative to error free execution for benchmarks *im2txt* and *seq2seq* with and without RNNFast EDC.



Figure 20: Sensitivity analysis for number LSTMs per tile

time AI. We show runtimes for a range of network sizes along-side estimated energy consumption in the table 3.Brainwave shows better performance for larger networks and poorer performance for smaller networks as it is efficiently designed for batch=1. Note that this is not an apples-to-apples comparison to our design given that Brainwave uses 8 bit precision (vs 16 bit for RNNFast) and a 14nm techology node (vs. 32nm for RNNFast). Even under these circumstances, RNNFast is comparable in performance with less than half the energy of Brainwave. The Google TPU is also capable of running RNN

| FPGA<br>Design | Net size | run time(ms)    | energy (µJ)       | RNNFast<br>run time (ms) | RNNFast<br>energy (µJ) |
|----------------|----------|-----------------|-------------------|--------------------------|------------------------|
| [3]            | 4.2M     | 390             | 7.65E6(Brd. Eng.) | 3.45                     | 7229                   |
| [5]            | 32       | 1.586E-3        | 0.8               | 3.32E-4                  | 0.0419                 |
| [61]           | 256-2K   | 0.00283-0.00296 | Est.: 3-233       | 0.00078-0.0044           | 1.68-103               |

Table 3: Energy and run time for three FPGA-based RNNs.

workloads efficiently. In [1] they report up to  $8\times$  better performance for LSTM workloads compared to NVIDIA K80. RNNFast is up to  $260\times$  faster than the newer NVIDIA P100 for workloads of similar size.

#### 7. RELATED WORK

**DNN Accelerators.** Many customized accelerators for machines learning algorithms and DNNs have been proposed recently [10, 11, 12, 13, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22]. The majority of this work focuses on improving the performance of CNNs, exploring the potential for resources sharing, leveraging emerging memory technologies, optimizing basic operations, and developing domain specific methods.

Data sharing and sparsity in both the activations and weights are explored to reduce the data movement and long latency of DRAM accesses. ShiDianNao [13] explored the inherent weight sharing and eliminated all DRAM accesses for weights. EIE [14] used compression of the network model to reduce the memory footprint and accelerate real-time networks in which batching cannot be employed to improve data

re-use. Eyeriss [15] explored local data reuse of filter weights and activations in high-dimensional convolutions in order to minimize the energy of data movement.

Emerging memory technologies and in-memory processing have been leveraged for CNN designs to address memory latency limitations and to implement custom logic. For instance, ISAAC [16] presents a CNN architecture that implements a fast and energy efficient analog dot-product engine using memristor-based crossbars. PRIME [17] combined processor-in-memory architecture and ReRAM-based neural network computation. The crossbar array structure in ReRAM can be used to perform matrix-vector multiplication as well as regular memory to increase memory space. Neurocube [18] proposed a programmable and scalable digital neuromorphic architecture based on 3D high-density memory integrated with a logic tier for efficient neural computing. The design in [62] also used ReRAM cross bar for RNN acceleration for a case of human activity detection with small network size of 100 and simple vanilla RNN. CNV [19] accelerates DNNs in hardware by eliminating a large fraction of ineffectual zero-valued operand multiplications. It improves the performance and energy using data-parallel units and a co-designed data storage format without losing accuracy. RedEye [20] reduces analog readout and computational burden by moving convolutional processing into an image sensor's domain. Minerva [21] automates the co-design flow by optimizing across the algorithm, architecture and circuit levels. Cambricon [22] propose a novel domain-specific Instruction Set Architecture (ISA) for neural network accelerators. PuDianNao [12] focuses on a range of popular machine learning algorithms. However all these optimizations are CNNs/DNNs specific.

Brainwave [61] proposed a single threaded SIMD architecture for CNN/RNN. It expands the compound SIMD operations into thousands fixed vector size operations which form primitives that are fanned out to compute units. These parallelized vector operations that are mapped to one-dimensional flat functional units, connected in a way that allows vectors to flow through the pipeline without any bubbles.

#### 8. CONCLUSION

The unprecedented growth of available data is accelerating the adoption of deep learning across a wide range of applications including speech recognition, machine translation, and language modeling. In this study, we propose RNNFast, a novel accelerator tuned for recurrent neural networks. Our approach demonstrates that using domain wall memory is not

only feasible, but also very efficient. We compare our design with a state-of-the-art P100 NVIDIA GPU and find  $21.8 \times$  better performance with  $70 \times$  lower energy.

#### 9. REFERENCES

- [1] N. P. Jouppi, C. Young, N. Patil, D. Patterson, G. Agrawal, R. Bajwa, S. Bates, S. Bhatia, N. Boden, A. Borchers, R. Boyle, P.-l. Cantin, C. Chao, C. Clark, J. Coriell, M. Daley, M. Dau, J. Dean, B. Gelb, T. V. Ghaemmaghami, R. Gottipati, W. Gulland, R. Hagmann, C. R. Ho, D. Hogberg, J. Hu, R. Hundt, D. Hurt, J. Ibarz, A. Jaffey, A. Jaworski, A. Kaplan, H. Khaitan, D. Killebrew, A. Koch, N. Kumar, S. Lacy, J. Laudon, J. Law, D. Le, C. Leary, Z. Liu, K. Lucke, A. Lundin, G. MacKean, A. Maggiore, M. Mahony, K. Miller, R. Nagarajan, R. Narayanaswami, R. Ni, K. Nix, T. Norrie, M. Omernick, N. Penukonda, A. Phelps, J. Ross, M. Ross, A. Salek, E. Samadiani, C. Severn, G. Sizikov, M. Snelham, J. Souter, D. Steinberg, A. Swing, M. Tan, G. Thorson, B. Tian, H. Toma, E. Tuttle, V. Vasudevan, R. Walter, W. Wang, E. Wilcox, and D. H. Yoon, "In-datacenter performance analysis of a tensor processing unit," in *Proceedings of the 44th Annual International Symposium on Computer Architecture*, ISCA '17, (New York, NY, USA), pp. 1–12, ACM, 2017.
- [2] D. Amodei, R. Anubhai, E. Battenberg, C. Case, J. Casper, B. Catanzaro, J. Chen, M. Chrzanowski, A. Coates, G. Diamos, E. Elsen, J. Engel, L. Fan, C. Fougner, A. Y. Hannun, B. Jun, T. Han, P. LeGresley, X. Li, L. Lin, S. Narang, A. Y. Ng, S. Ozair, R. Prenger, S. Qian, J. Raiman, S. Satheesh, D. Seetapun, S. Sengupta, C. Wang, Y. Wang, Z. Wang, B. Xiao, Y. Xie, D. Yogatama, J. Zhan, and Z. Zhu, "Deep speech 2: End-to-end speech recognition in english and mandarin," in Proceedings of the 33nd International Conference on Machine Learning, ICML 2016, New York City, NY, USA, June 19-24, 2016, pp. 173–182, 2016.
- [3] Y. Guan, Z. Yuan, G. Sun, and J. Cong, "Fpga-based accelerator for long short-term memory recurrent neural networks," in 2017 22nd Asia and South Pacific Design Automation Conference (ASP-DAC), pp. 629–634, Jan 2017.
- [4] S. Li, C. Wu, H. Li, B. Li, Y. Wang, and Q. Qiu, "Fpga acceleration of recurrent neural network based language model," in 2015 IEEE 23rd Annual International Symposium on Field-Programmable Custom Computing Machines, pp. 111–118, May 2015.
- [5] J. C. Ferreira and J. Fonseca, "An fpga implementation of a long short-term memory neural network," in 2016 International Conference on ReConFigurable Computing and FPGAs (ReConFig), pp. 1–8, Nov 2016.
- [6] E. Nurvitadhi, J. Sim, D. Sheffield, A. Mishra, S. Krishnan, and D. Marr, "Accelerating recurrent neural networks in analytics servers: Comparison of fpga, cpu, gpu, and asic," in 2016 26th International Conference on Field Programmable Logic and Applications (FPL), pp. 1–4, Aug 2016.
- [7] S. Venkataramani, A. Ranjan, S. Banerjee, D. Das, S. Avancha, A. Jagannathan, A. Durg, D. Nagaraj, B. Kaul, P. Dubey, and A. Raghunathan, "Scaledeep: A scalable compute architecture for learning and evaluating deep networks," in *Proceedings of the 44th Annual International Symposium on Computer Architecture*, ISCA '17, (New York, NY, USA), pp. 13–26, ACM, 2017.
- [8] Y. Shen, M. Ferdman, and P. Milder, "Maximizing cnn accelerator efficiency through resource partitioning," in *Proceedings of the 44th Annual International Symposium on Computer Architecture*, ISCA '17, (New York, NY, USA), pp. 535–547, ACM, 2017.
- [9] A. Parashar, M. Rhu, A. Mukkara, A. Puglielli, R. Venkatesan, B. Khailany, J. Emer, S. W. Keckler, and W. J. Dally, "Scnn: An accelerator for compressed-sparse convolutional neural networks," in Proceedings of the 44th Annual International Symposium on Computer Architecture, ISCA '17, (New York, NY, USA), pp. 27–40, ACM 2017
- [10] Y. Chen, T. Luo, S. Liu, S. Zhang, L. He, J. Wang, L. Li, T. Chen, Z. Xu, N. Sun, and O. Temam, "Dadiannao: A machine-learning supercomputer," in 47th Annual IEEE/ACM International Symposium on Microarchitecture, MICRO 2014, Cambridge, United Kingdom, December 13-17, 2014, pp. 609–622, 2014.
- [11] T. Chen, Z. Du, N. Sun, J. Wang, C. Wu, Y. Chen, and O. Temam, "Diannao: a small-footprint high-throughput accelerator for ubiquitous machine-learning," in Architectural Support for Programming Languages and Operating Systems, ASPLOS '14, Salt Lake City, UT, USA, March 1-5, 2014, pp. 269–284, 2014.

- [12] D. Liu, T. Chen, S. Liu, J. Zhou, S. Zhou, O. Temam, X. Feng, X. Zhou, and Y. Chen, "Pudiannao: A polyvalent machine learning accelerator," in Proceedings of the Twentieth International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS '15, Istanbul, Turkey, March 14-18, 2015, pp. 369–381, 2015.
- [13] Z. Du, R. Fasthuber, T. Chen, P. Ienne, L. Li, T. Luo, X. Feng, Y. Chen, and O. Temam, "Shidiannao: shifting vision processing closer to the sensor," in *Proceedings of the 42nd Annual International Symposium on Computer Architecture, Portland, OR, USA, June 13-17, 2015*, pp. 92–104, 2015.
- [14] S. Han, X. Liu, H. Mao, J. Pu, A. Pedram, M. A. Horowitz, and W. J. Dally, "EIE: efficient inference engine on compressed deep neural network," in 43rd ACM/IEEE Annual International Symposium on Computer Architecture, ISCA 2016, Seoul, South Korea, June 18-22, 2016, pp. 243–254, 2016.
- [15] Y. Chen, J. S. Emer, and V. Sze, "Eyeriss: A spatial architecture for energy-efficient dataflow for convolutional neural networks," in 43rd ACM/IEEE Annual International Symposium on Computer Architecture, ISCA 2016, Seoul, South Korea, June 18-22, 2016, pp. 367–379, 2016.
- [16] A. Shafiee, A. Nag, N. Muralimanohar, R. Balasubramonian, J. P. Strachan, M. Hu, R. S. Williams, and V. Srikumar, "ISAAC: A convolutional neural network accelerator with in-situ analog arithmetic in crossbars," in 43rd ACM/IEEE Annual International Symposium on Computer Architecture, ISCA 2016, Seoul, South Korea, June 18-22, 2016, pp. 14–26, 2016.
- [17] P. Chi, S. Li, C. Xu, T. Zhang, J. Zhao, Y. Liu, Y. Wang, and Y. Xie, "PRIME: A novel processing-in-memory architecture for neural network computation in reram-based main memory," in 43rd ACM/IEEE Annual International Symposium on Computer Architecture, ISCA 2016, Seoul, South Korea, June 18-22, 2016, pp. 27–39, 2016.
- [18] D. Kim, J. Kung, S. M. Chai, S. Yalamanchili, and S. Mukhopadhyay, "Neurocube: A programmable digital neuromorphic architecture with high-density 3d memory," in 43rd ACM/IEEE Annual International Symposium on Computer Architecture, ISCA 2016, Seoul, South Korea, June 18-22, 2016, pp. 380–392, 2016.
- [19] J. Albericio, P. Judd, T. Hetherington, T. Aamodt, N. E. Jerger, and A. Moshovos, "Cnvlutin: ineffectual-neuron-free deep neural network computing," in *Computer Architecture (ISCA)*, 2016 ACM/IEEE 43rd Annual International Symposium on, pp. 1–13, IEEE, 2016.
- [20] R. LiKamWa, Y. Hou, Y. Gao, M. Polansky, and L. Zhong, "Redeye: Analog convnet image sensor architecture for continuous mobile vision," in 43rd ACM/IEEE Annual International Symposium on Computer Architecture, ISCA 2016, Seoul, South Korea, June 18-22, 2016, pp. 255–266, 2016.
- [21] B. Reagen, P. N. Whatmough, R. Adolf, S. Rama, H. Lee, S. K. Lee, J. M. Hernández-Lobato, G.-Y. Wei, and D. M. Brooks, "Minerva: Enabling low-power, highly-accurate deep neural network accelerators," in isca, 2016.
- [22] S. Liu, Z. Du, J. Tao, D. Han, T. Luo, Y. Xie, Y. Chen, and T. Chen, "Cambricon: An instruction set architecture for neural networks," in 43rd ACM/IEEE Annual International Symposium on Computer Architecture, ISCA 2016, Seoul, South Korea, June 18-22, 2016, pp. 393–405, 2016.
- [23] S. S. P. Parkin, M. Hayashi, and L. Thomas, "Magnetic domain-wall racetrack memory," *Science*, vol. 320, no. 5873, pp. 190–194, 2008.
- [24] Y. Wang, H. Yu, L. Ni, G.-B. Huang, M. Yan, C. Weng, W. Yang, and J. Zhao, "An energy-efficient nonvolatile in-memory computing architecture for extreme learning machine by domain-wall nanowire devices," *IEEE Transactions on Nanotechnology*, vol. 14, no. 6, pp. 998–1012, 2015.
- [25] Y. Wang, H. Yu, D. Sylvester, and P. Kong, "Energy efficient in-memory AES encryption based on nonvolatile domain-wall nanowire," in *Design*, Automation & Test in Europe Conference & Exhibition, DATE 2014, Dresden, Germany, March 24-28, 2014, pp. 1–4, 2014.
- [26] H. Yu, Y. Wang, S. Chen, W. Fei, C. Weng, J. Zhao, and Z. Wei, "Energy efficient in-memory machine learning for data intensive image-processing by non-volatile domain-wall memory," in 19th Asia and South Pacific Design Automation Conference, ASP-DAC 2014, Singapore, January 20-23, 2014, pp. 191–196, 2014.

- [27] K. Huang, R. Zhao, and Y. Lian, "Racetrack memory-based nonvolatile storage elements for multicontext fpgas," *IEEE Trans. VLSI Syst.*, vol. 24, no. 5, pp. 1885–1894, 2016.
- [28] J. Chung, J. Park, and S. Ghosh, "Domain wall memory based convolutional neural networks for bit-width extendability and energy-efficiency," in *Proceedings of the 2016 International* Symposium on Low Power Electronics and Design, ISLPED 2016, San Francisco Airport, CA, USA, August 08 - 10, 2016, pp. 332–337, 2016.
- [29] W. Zhao, N. B. Romdhane, Y. Zhang, J.-O. Klein, and D. Ravelosona, "Racetrack memory based reconfigurable computing," in *Faible Tension Faible Consommation (FTFC)*, 2013 IEEE, pp. 1–4, IEEE, 2013.
- [30] R. Venkatesan, V. J. Kozhikkottu, M. Sharad, C. Augustine, A. Raychowdhury, K. Roy, and A. Raghunathan, "Cache design with domain wall memory," *IEEE Trans. Comput.*, vol. 65, pp. 1010–1024, Apr. 2016.
- [31] A. J. Annunziata, M. C. Gaidis, L. Thomas, C. W. Chien, C. C. Hung, P. Chevalier, E. J. O'Sullivan, J. P. Hummel, E. A. Joseph, Y. Zhu, T. Topuria, E. Delenia, P. M. Rice, S. S. P. Parkin, and W. J. Gallagher, "Racetrack memory cell array with integrated magnetic tunnel junction readout," in 2011 International Electron Devices Meeting, pp. 24.3.1–24.3.4, Dec 2011.
- [32] A. Ankit, A. Sengupta, P. Panda, and K. Roy, "Resparc: A reconfigurable and energy-efficient architecture with memristive crossbars for deep spiking neural networks," arXiv preprint arXiv:1702.06064, 2017.
- [33] S. Hochreiter and J. Schmidhuber, "Long short-term memory," *Neural Computation*, vol. 9, no. 8, pp. 1735–1780, 1997.
- [34] K. Greff, R. K. Srivastava, J. Koutník, B. R. Steunebrink, and J. Schmidhuber, "Lstm: A search space odyssey," *IEEE transactions* on neural networks and learning systems, 2016.
- [35] A. Graves, A. Mohamed, and G. E. Hinton, "Speech recognition with deep recurrent neural networks," in *IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2013, Vancouver, BC, Canada, May 26-31, 2013*, pp. 6645–6649, 2013.
- [36] Z. Sun, W. Wu, and H. H. Li, "Cross-layer racetrack memory design for ultra high density and low power consumption," in *Proceedings of* the 50th Annual Design Automation Conference, DAC '13, (New York, NY, USA), pp. 53:1–53:6, ACM, 2013.
- [37] S. Ghosh, "Design methodologies for high density domain wall memory," in 2013 IEEE/ACM International Symposium on Nanoscale Architectures (NANOARCH), pp. 30–31, July 2013.
- [38] Z. Sun, X. Bi, W. Wu, S. Yoo, and H. . Li, "Array organization and data management exploration in racetrack memory," *IEEE Transactions on Computers*, vol. 65, pp. 1041–1054, April 2016.
- [39] Y. Zhang, C. Zhang, J. Nan, Z. Zhang, X. Zhang, J. O. Klein, D. Ravelosona, G. Sun, and W. Zhao, "Perspectives of racetrack memory for large-capacity on-chip memory: From device to system," *IEEE Transactions on Circuits and Systems I: Regular Papers*, vol. 63, pp. 629–638, May 2016.
- [40] S. Motaman, A. S. Iyengar, and S. Ghosh, "Domain wall memory-layout, circuit and synergistic systems," *IEEE Transactions* on *Nanotechnology*, vol. 14, pp. 282–291, March 2015.
- [41] C. Zhang, G. Sun, W. Zhang, F. Mi, H. Li, and W. Zhao, "Quantitative modeling of racetrack memory, a tradeoff among area, performance, and power," in *The 20th Asia and South Pacific Design Automation Conference*, pp. 100–105, Jan 2015.
- [42] R. Venkatesan, M. Sharad, K. Roy, and A. Raghunathan, "Dwm-tapestri - an energy efficient all-spin cache using domain wall shift based writes," in 2013 Design, Automation Test in Europe Conference Exhibition (DATE), pp. 1825–1830, March 2013.
- [43] C. Zhang, G. Sun, X. Zhang, W. Zhang, W. Zhao, T. Wang, Y. Liang, Y. Liu, Y. Wang, and J. Shu, "Hi-fi playback: Tolerating position errors in shift operations of racetrack memory," in *Proceedings of the 42Nd Annual International Symposium on Computer Architecture*, ISCA '15, (New York, NY, USA), pp. 694–706, ACM, 2015.
- [44] C. Xu, D. Niu, X. Zhu, S. H. Kang, M. Nowak, and Y. Xie, "Device-architecture co-optimization of STT-RAM based memory for low power embedded systems," in *iccad*, pp. 463–470, IEEE Press,

- 2011.
- [45] C. W. Smullen, A. Nigam, S. Gurumurthi, and M. R. Stan, "The stetsims stt-ram simulation and modeling system," in *iccad*, pp. 318–325, IEEE Press, 2011.
- [46] A. Ranjan, S. G. Ramasubramanian, R. Venkatesan, V. Pai, K. Roy, and A. Raghunathan, "Dyrectape: A dynamically reconfigurable cache using domain wall memory tapes," in 2015 Design, Automation Test in Europe Conference Exhibition (DATE), pp. 181–186, March 2015.
- [47] A. Iyengar and S. Ghosh, "Modeling and analysis of domain wall dynamics for robust and low-power embedded memory," in *Proceedings of the 51st Annual Design Automation Conference*, DAC '14, (New York, NY, USA), pp. 65:1–65:6, ACM, 2014.
- [48] M. Tommiska, "Efficient digital implementation of the sigmoid function for reprogrammable logic," *IEE Proceedings-Computers and Digital Techniques*, vol. 150, no. 6, pp. 403–411, 2003.
- [49] A. Paszke, S. Chintala, R. Collobert, K. Kavukcuoglu, C. Farabet, S. Bengio, I. Melvin, J. Weston, and J. Mariethoz, "Pytorch: Tensors and dynamic neural networks in python with strong gpu acceleration, may 2017."
- [50] K. Papineni, S. Roukos, T. Ward, and W.-J. Zhu, "Bleu: A method for automatic evaluation of machine translation," in *Proceedings of the* 40th Annual Meeting on Association for Computational Linguistics, ACL '02, (Stroudsburg, PA, USA), pp. 311–318, Association for Computational Linguistics, 2002.
- [51] S. Motaman and S. Ghosh, "Adaptive write and shift current modulation for process variation tolerance in domain wall caches," *IEEE Trans. VLSI Syst.*, vol. 24, no. 3, pp. 944–953, 2016.
- [52] S. Motaman, A. S. Iyengar, and S. Ghosh, "Domain wall memory-layout, circuit and synergistic systems," *IEEE Transactions* on Nanotechnology, vol. 14, no. 2, pp. 282–291, 2015.
- [53] S. Motaman, A. Iyengar, and S. Ghosh, "Synergistic circuit and system design for energy-efficient and robust domain wall caches," in International Symposium on Low Power Electronics and Design, ISLPED'14, La Jolla, CA, USA - August 11 - 13, 2014, pp. 195–200, 2014
- [54] "Deepbench." https://svail.github.io/DeepBench/.
- [55] O. Vinyals, A. Toshev, S. Bengio, and D. Erhan, "Show and tell: Lessons learned from the 2015 MSCOCO image captioning challenge," *CoRR*, vol. abs/1609.06647, 2016.
- [56] K. Cho, B. van Merrienboer, Ç. Gülçehre, F. Bougares, H. Schwenk, and Y. Bengio, "Learning phrase representations using RNN encoder-decoder for statistical machine translation," *CoRR*, vol. abs/1406.1078, 2014.
- [57] J. M. Ponte and W. B. Croft, "A language modeling approach to information retrieval," in SIGIR '98: Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, August 24-28 1998, Melbourne, Australia, pp. 275–281, 1998.
- [58] A. Y. Hannun, C. Case, J. Casper, B. Catanzaro, G. Diamos, E. Elsen, R. Prenger, S. Satheesh, S. Sengupta, A. Coates, and A. Y. Ng, "Deep speech: Scaling up end-to-end speech recognition," *CoRR*, vol. abs/1412.5567, 2014.
- [59] "Nvidia cuda deep neural network library." https://developer.nvidia.com/cudnn.
- [60] "Optimizing recurrent neural networks in cudnn 5." https://devblogs.nvidia.com/parallelforall/ optimizing-recurrent-neural-networks-cudnn-5/.
- [61] J. Fowers, K. Ovtcharov, M. Papamichael, T. Massengill, M. Liu, D. Lo, S. Alkalay, M. Haselman, L. Adams, M. Ghandi, S. Heil, P. Patel, A. Sapek, G. Weisz, L. Woods, S. Lanka, S. K. Reinhardt, A. M. Caulfield, E. S. Chung, and D. Burger, "A configurable cloud-scale dnn processor for real-time ai," in 2018 ACM/IEEE 45th Annual International Symposium on Computer Architecture (ISCA), pp. 1–14, June 2018.
- [62] Y. Long, E. M. Jung, J. Kung, and S. Mukhopadhyay, "Reram crossbar based recurrent neural network for human activity detection," in 2016 International Joint Conference on Neural Networks (IJCNN), pp. 939–946, July 2016.