Attention As An RNN: Preprint. Under Review

Attention as an RNN
Leo Feng Frederick Tung

Mila & Borealis AI Borealis AI
leo.feng@mila.quebec frederick.tung@borealisai.com
arXiv:2405.13956v1 [cs.LG] 22 May 2024
Hossein Hajimirsadeghi Mohamed Osama Ahmed

Borealis AI Borealis AI
hossein.hajimirsadeghi@borealisai.com mohamed.o.ahmed@borealisai.com
Yoshua Bengio Greg Mori

Mila – Université de Montréal Borealis AI
yoshua.bengio@mila.quebec greg.mori@borealisai.com
Abstract
The advent of Transformers marked a significant breakthrough in sequence mod-

elling, providing a highly performant architecture capable of leveraging GPU
parallelism. However, Transformers are computationally expensive at inference
time, limiting their applications, particularly in low-resource settings (e.g., mobile
and embedded devices). Addressing this, we (1) begin by showing that attention
can be viewed as a special Recurrent Neural Network (RNN) with the ability to
compute its many-to-one RNN output efficiently. We then (2) show that popular
attention-based models such as Transformers can be viewed as RNN variants.
However, unlike traditional RNNs (e.g., LSTMs), these models cannot be updated
efficiently with new tokens, an important property in sequence modelling. Tackling
this, we (3) introduce a new efficient method of computing attention’s many-to-
many RNN output based on the parallel prefix scan algorithm. Building on the
new attention formulation, we (4) introduce Aaren, an attention-based module that
can not only (i) be trained in parallel (like Transformers) but also (ii) be updated
efficiently with new tokens, requiring only constant memory for inferences (like
traditional RNNs). Empirically, we show Aarens achieve comparable performance
to Transformers on 38 datasets spread across four popular sequential problem
settings: reinforcement learning, event forecasting, time series classification, and
time series forecasting tasks while being more time and memory-efficient.
1 Introduction
Advancements in sequence modelling are highly impactful due to the wide range of applications,
including reinforcement learning (e.g., robotics and autonomous driving), time series classification
(e.g., financial fraud detection and medical diagnoses), and time series forecasting (e.g., weather
and energy consumption predictions). Over the past several years, an extensively explored topic
in sequence modelling is that of Transformer-based models (Vaswani et al., 2017). This is due to
Transformers’ strong performance and ability to leverage GPU parallelism. As a result, numerous
Transformer-specific survey papers have been written for various sequential settings such as rein-
forcement learning (Agarwal et al., 2023; Li et al., 2023), time series (Lin et al., 2022; Jiang et al.,
2024), and speech processing (Latif et al., 2023).
Preprint. Under review.

With the rapid increase in low-resource domains such as battery-powered devices, deployed models
in these domains must be computationally efficient. Transformers, on the other hand, are expensive
due to their quadratic scaling in memory and computation. Although their efficiency can be enhanced
at inference time using techniques such as KV-caching (Pope et al., 2023), Transformers remain
expensive for low-resource domains due to requiring (1) linear memory in the number of tokens and
(2) the caching of all preceding tokens to the model. The effect of these limitations is exacerbated in
settings with long contexts (i.e., large number of tokens) such as those common in time series (e.g.,
climate and economics).
To address this issue, we begin by examining attention, the component contributing to Transformers
quadratic computational complexity. We show that (1) attention can be viewed as a special Recur-
rent Neural Network (RNN) with the ability to compute its many-to-one RNN output efficiently.
Leveraging the RNN formulation of attention, we (2) show that popular attention-based models (e.g.,
Transformers and Perceivers) can be viewed as RNNs. However, unlike traditional RNNs such as
LSTMs (Hochreiter and Schmidhuber, 1997) and GRUs (Cho et al., 2014), these attention-based
models are unable to perform efficient updates with new tokens, an important property in sequential
problem settings where data is processed in a stream.
To address this, we (3) introduce a new formulation of attention based on the parallel prefix scan
algorithm (Blelloch, 1990) that efficiently computes attention’s many-to-many RNN output. Building
on this new attention formulation, we (4) introduce Aaren ([A]ttention [a]s a [re]current neural
[n]etwork), a computationally efficient module that can not only (i) be trained in parallel (like
Transformers) but also (ii) be efficiently updated with new tokens, requiring only constant memory
for inferences (like traditional RNNs). Empirically, we show Aarens achieve comparable performance
to Transformers on 38 datasets spread across four popular sequential data settings: reinforcement
learning, event forecasting, time series classification, and time series forecasting tasks while being
more time and memory-efficient.
2 Background
2.1 Recurrent Neural Networks
Recurrent Neural Networks (RNNs) are specialized models for sequence modelling. In brief, RNNs
process sequential data by iteratively computing hidden states as follows:
ht = fθ (ht−1 , xt )
where t denotes the step index, x represents a token, h represents the hidden state, and fθ is a
neural network parameterized by θ. The value of the initial hidden state h0 is typically learned
via backpropagation. Popular RNNs such as LSTMs (Hochreiter and Schmidhuber, 1997) and
GRUs (Cho et al., 2014) fix the size of the hidden states ht to a constant irrespective of the step
index. As such, these models are efficient at test time, requiring only constant memory and time
per token and can be easily updated with new tokens, an important property in sequence modelling.
However, due to being iterative by design, these popular RNNs also suffer scalability issues due
to their lack of parallelizability. As such, RNNs were replaced by a parallelizable attention-based
module, Transformers (Vaswani et al., 2017), as the standard for many sequence modelling settings.
2.2 Attention
Attention retrieves information from a set of XC context tokens for a given set of query tokens XQ
as follows:
Attention(Q, K, V ) = softmax(QK T )V
where Q = XQ Wq is the query matrix, K = XC Wk is the key matrix, and V = XC Wv is the value
matrix. Wq , Wk , Wv ∈ Rd×d are weight matrices (learned parameters). softmax(QK T ) computes
weights of the context tokens for a weighted average. Notably, unlike RNNs, attention is not designed
to be iterative; instead, it is designed to easily leverage GPU parallelism. Transformers (Vaswani
et al., 2017) use self-attention1 , a special case of attention, where the query tokens are the same
as the context tokens. However, self-attention requires quadratic computation with respect to the
1
For sequential data, a causal mask is typically used to prevent attending to future timesteps.
2
number of tokens and is not efficiently updateable with new tokens. As a result, Transformers are
computationally expensive, limiting their applications in low-resource domains.
3 Methodology
(a) Conventional Attention (b) Transformer’s Self-Attention (c) Perceiver’s Cross-Attention

Figure 1: Attention as a many-to-one RNN. The query tokens are the initial hidden states of the
RNNs. (a) The conventional method of computing attention only computes its final output. As such,
it can be viewed as a method of computing attention’s many-to-one RNN output. (b) Transformer’s
self-attention (Vaswani et al., 2017) uses the input tokens as the initial hidden states. (c) Perceiver’s
cross-attention (Jaegle et al., 2021) uses input-dependent latents as the initial hidden states.
Addressing this, we propose an efficient attention-based module capable of leveraging GPU paral-
lelism while being efficiently updateable. We begin by first showing in Section 3.1 that attention can
be viewed as an RNN with the special ability to compute its many-to-one RNN (Figure 1a) output
efficiently. Leveraging the RNN formulation of attention, we further show that popular attention-
based models such as Transformers (Figure 1b) and Perceivers (Figure 1c) can be viewed as RNNs.
However, unlike traditional RNNs, these models are unable to efficiently update themselves with new
tokens, limiting their potential in sequential problem settings where data arrives as a stream. Tackling
this, we introduce in Section 3.2 an efficient method for computing attention as a many-to-many RNN
based on the parallel prefix scan algorithm. Building on this, we introduce in Section 3.3 Aaren
([A]ttention [a]s a [re]current neural [n]etwork), a computationally efficient module that can not only
(i) be trained in parallel (like Transformers) but also (ii) be efficiently updated with new tokens at
inference time, requiring only constant memory for inferences (like traditional RNNs).
3.1 Attention as a (many-to-one) RNN
Attention for a query vector q can be viewed as a func-

tion that maps N context tokens x1:N via their keys
and values {(ki , vi )}N
i=1 to a singular output oN =
Attention(q, k1:N , v1:N ). Given si = dot(q, ki ), the
output oN can be formulated as:
N PN
X exp(si )vi âN
oN = softmax(s)i vi = Pi=1
N
=
i=1 i=1 exp(s i ) ĉN
PN
where the numerator is âN = i=1 exp(si )vi and the
PN
denominator is ĉN = i=1 exp(si ). Viewing attention
as an RNN, we can compute both iteratively as rolling
sums âk = âk−1 +exp(sk )vk and ĉk = ĉk−1 +exp(sk )
for k = 1, . . . , N . However, in practice, this is an un-
stable implementation, running into numerical issues Figure 2: Attention’s RNN Cell.
due to limited precision representations and potentially
very small or large exponents (i.e., exp(s)). Mitigating
this, we re-write the recurrence with a cumulative maxi-
mum term2 mk = maxi∈{1,...,k} si , computing instead
2
A common trick for computing softmax numerically stably is to subtract the max term (Goodfellow et al.,
2016). The cumulative max is an extension of this trick for a recurrent computation of softmax.
3
Pk Pk
ak = i=1 exp(si − mk )vi and ck = i=1 exp(si − mk ). Notably, the final result is the same
âN
oN = ĉN = acNN . ak , ck , and mk are thus computed recurrently as follows:
ak = ak−1 exp(mk−1 − mk ) + vk exp(ck − mk )

ck = ck−1 exp(mk−1 − mk ) + exp(sk − mk )
mk = max(mk−1 , sk )
By encapsulating the recurrent computation of ak , ck , and mk from ak−1 , ck−1 , and mk−1 , we
introduce an RNN cell that iteratively computes the output of attention (see Figure 2). Attention’s
RNN cell takes as input (ak−1 , ck−1 , mk−1 , q) and computes (ak , ck , mk , q). Note that the query
vector q is carried over in the RNN cell. The initial hidden state of attention’s RNN is (a0 , c0 , m0 , q) =
(0, 0, 0, q).
Methods for computing attention. By viewing attention as an RNN, we can see that there are
different ways to compute attention: (1) recurrently token-by-token (i.e., sequentially) in O(1)
memory or (2) in the conventional manner (i.e., in parallel) requiring linear O(N ) memory. Since
attention can be viewed as an RNN, the conventional method of computing attention can also be
viewed as an efficient method of computing attention’s many-to-one RNN output, i.e., the output of
an RNN that takes as input multiple context tokens but only outputs a single token at the end of the
RNN (see Figure 1a). Lastly, instead of fully sequential or fully in parallel, we can also compute
attention as (3) an RNN that processes the tokens block-by-block requiring O(b) memory where
b is the size of the block. This method, however, is outside the scope of this work. As such, the
description of the block-by-block RNN is included in Appendix A.
Viewing existing attention-based models as RNNs. By viewing attention as an RNN, existing
attention-based models can also be viewed as variations of RNNs. For example, Transformers’
self-attentions are RNNs (Figure 1b) with the context tokens as their initial hidden states. Perceiver’s
cross-attentions are RNNs (Figure 1c) with context-dependent latents as their initial hidden states. By
leveraging the RNN formulation of their attention mechanisms, these existing models can compute
their output memory efficiently.
Challenges of viewing attention as an RNN for exist-
ing models. However, when viewing existing attention-
based models such as Transformers as RNNs, the mod-
Algorithm 1 Parallel Prefix Scan (Hillis
els lack important properties common in traditional
and Steele (1986)’s variation)
RNNs such as LSTMs and GRUs. Notably, LSTMs and
GRUs are capable of efficiently updating themselves Require: Associative Operator ⊕ and
with new tokens in only O(1) constant memory and {xi }N
i=1 Lk
computation, an important feature for sequence mod- Ensure: {zk = i=1 xi }N k=1
elling where data is received in a stream. In contrast, z←x
the RNN view of Transformer (see Figure 1b) would for i ← 1, . . . , ⌊log(N )⌋ do
handle new tokens by adding a new RNN with the new for j ← 0, . . . , N − 1 do in parallel
token as its initial state. The new RNN processes all if j < 2i then
preceding tokens, requiring O(N ) linear computation zj′ ← zj
in the number of tokens. In Perceiver, the latents (Li in else
Figure 1c) are input-dependent due to their architecture, zj′ ← zj ⊕ zj−2i
meaning that their values change when receiving a new end if
token. Since the initial hidden states (i.e., latents) of end for
their RNNs change, Perceiver would thus require re- z ← z′
computing their RNNs from scratch, requiring O(N L) end for
linear computation in the number of tokens (N ) and the
number of latents (L).
3.2 Attention as a (many-to-many) RNN
Addressing these limitations, we propose to develop an attention-based model capable of leveraging

the RNN formulation’s ability to perform efficient updates. To do so, we first introduce an efficient
parallelized method of computing attention as a many-to-many RNN, i.e., a parallel method to compute
{oi = Attention(q, x1:i )}Ni=1 . To do so, we leverage the parallel prefix scan algorithm (Blelloch,
1990) (see Algorithm 1), a parallel computation method for computing N prefix computations
4
from N sequential data points via an associative operator ⊕. The algorithm efficiently computes
Lk
{ i=1 xi }N N
k=1 from {xk }k=1 .
Recall that Attention(q, x1:k ) = ok = ackk where
Pk Pk
ak = i=1 exp(si − mk )vi , ck = i=1 exp(si −
mk ), and mk = maxi∈{1,...,k} si To compute
{Attention(q, x1:k )}N k=1 efficiently, we can compute
{ak }N
k=1 , {c }N
k k=1 , and {mk }Nk=1 via the parallel scan
algorithm and afterwards combine ak and ck to com-
pute Attention(q, x1:k ).
To do so, we propose the following associative oper- Figure 3: Attention as a many-to-many
ator ⊕ that acts on 3-tuples3 of the form (mA , uA , wA ) RNN
where
P A is a set of indices, mA = max P i∈A si , uA =
i∈A exp(s i − mA ), and w A = i∈A exp(si −
mA )vi . The parallel scan algorithm takes as input
{(m{i} , u{i} , w{i} )}N N
i=1 = {(si , 1, vi )}i=1 . The algo-
rithm recursively applies the operator ⊕ which works
as follows:
(mA , uA , wA ) ⊕ (mB , uB , wB ) = (mA∪B , uA∪B , wA∪B )
where mA∪B = max(mA , mB ), uA∪B = uA exp(mA −
mA∪B ) + uB exp(mB − mA∪B ) and wA∪B =
wA exp(mA −mA∪B )+wB exp(mB −mA∪B ). Upon com-
pletion of applying the operator recursively, the algo-
rithm outputs {(m{1,...,k} , u{1,...,k} , w{1,...,k} )}N k=1 =
Pk Pk
{(mk , i=1 exp(si − mk ), i=1 exp(si −
mk )vi )}Nk=1 . Also known as {(m k , ck , ak )}N
k=1 .
Combining the last two values of the output tuples, we
retrieve Attention(q, x1:k ) = ok = ackk , resulting in an
efficient parallelized method for computing attention Figure 4: Stacking Aarens for sequence
as a many-to-many RNN (Figure 3). modelling
3.3 Aaren:
Attention as a Recurrent Neural Network
Leveraging the parallelized many-to-many formulation of attention, we propose Aaren ([A]ttention

[a]s a [re]current neural [n]etwork). The interface of Aaren is the same as a Transformer, mapping
N inputs to N outputs whereas the i-th output is an aggregate of the 1st to i-th input. As such,
Aaren is also (1) naturally stackable and (2) capable of computing individual loss terms for each
sequence token. However, unlike Transformers which use a causal self-attention, Aaren uses the
aforementioned method of computing attention as a many-to-many RNN, making it more efficient.
Aaren functions as follows:
(0) (0)
h1 , . . . , hN ← x1 , . . . , xN
(j+1) (j+1) (j) (j)
[h1 , . . . , hN ] ← Aaren(q (j) , [h1 , . . . , hN ])
Unlike Transformers where the query is one of the input tokens to attention, Aaren’s query token q is
learned during training via backpropagation. In Figure 4, we include an example of a stacked Aaren
model with input context tokens x1:3 and outputs y1:3 . Notably, since Aaren leverages the RNN
formulation of attention, the stacking of Aarens is also the stacking of RNNs. Therefore, Aarens are
also able to perform updates with new tokens efficiently, i.e., the iterative computation of yk only
requires constant computation as it relies solely on hk−1 and xk . Unlike Transformer-based models
which (1) require linear memory (when using KV-caching) and (2) require storing all previous
tokens, including those in intermediate Transformer layers, Aaren-based models (1) require only
constant memory and (2) does not require storing all previous tokens, making Aarens significantly
more computationally efficient than Transformers.
3
See Appendix B for proof of the correctness and associative nature of ⊕.
5
4 Experiments
Our objective in the experiments is to compare Aarens with Transformers in terms of (1) performance
and (2) resources required (time and memory). To perform a comprehensive comparison, we evaluate
across four problem settings: reinforcement learning, event forecasting, time series forecasting, and
time series classification.
Datasets. In total, we evaluate Aarens and Transformers on 38 datasets, the majority of which are
real-world datasets. The datasets are split between problem settings as follows: 12 reinforcement
learning datasets, 8 event forecasting datasets, 8 time series forecasting datasets, and 10 time series
classification datasets. For each dataset, the models are evaluated with 5 seeds. Due to space
limitations, we refer the reader to Appendix C descriptions of individual datasets.
Models. To compare Aarens directly with Transformers, we replace the Transformers with Aarens in
domain-specialized Transformer models. For reinforcement learning, we performed the comparison
on Decision Transformer (Chen et al., 2021). For event forecasting, we performed the comparison
Transformer Hawkes Process (Zuo et al., 2020; Bae et al., 2023). For time series forecasting, we
performed the comparison on a Transformer with input normalization following Liu et al. (2022); For
time series classification, we performed the comparison on a vanilla causal transformer following the
library by Wu et al. (2023).
Implementation Details. The experiments are run using the datasets and transformer implementa-
tions of popular repositories. More specifically, the reinforcement learning experiments with Decision
Transformer were run on the code by Barhate (2022). The time series forecasting and time series
classification experiments were run on the Time Series Library repository by Wu et al. (2023). The
event forecasting experiments were run on the code by Bae et al. (2023). As Transformers and
Aarens share the same interface and are both attention-based methods, they share the same set of
hyperparameters. For fairness, the same hyperparameters are used for both Transformers and Aarens.
Due to space limitations, specific hyperparameter details are included in Appendix E4 .
4.1 Reinforcement Learning
In these experiments, we compare Aarens with Transformers on reinforcement learning (RL). In RL,
the model’s objective during training is to learn a policy that learns from feedback/rewards obtained
through trial and error while interacting in an environment. As such, RL is popular in interactive
settings such as robotics, recommendation engines, and traffic control. For these experiments, we
consider Decision Transformers (Chen et al., 2021), a popular method for training a policy in an
offline manner on datasets of environment interactions. At test time, Decision Transformers are
evaluated in an online manner.
We evaluate on popular locomotion robotics environments from the D4RL benchmark (Fu et al., 2020):
HalfCheetah, Ant, Hopper, and Walker. For each of the locomotion environments, we compare the
models trained on three different kinds of datasets: Medium, Medium-Replay, and Medium-Expert.
Each of these datasets consists of 1 million timesteps generated by a policy. In total, we compare
model performances across 4 × 3 = 12 RL datasets. We refer the reader to Appendix C.1 for more
details regarding the individual tasks. The results in Table 1 show that Aarens achieve competitive
performance with Transformers across all twelve datasets and four environments. However, unlike
Transformers, Aarens are able to efficiently process new environmental interactions in constant
computation due to also being an RNN, making it more suitable for reinforcement learning.
4.2 Event Forecasting
In these experiments, we compare Aarens with Transformers on event forecasting (EF). In EF, the
model is given a sequence of irregularly spaced discrete events in time and models the probability
distribution of the next event time and its mark (i.e., event label/class). EF is popular in many
real-world settings such as finance (e.g., transactions), healthcare (e.g., patient observations), and
e-commerce (e.g., purchases) where user or system actions occur at irregularly spaced times. To
perform the comparison, we replace the Transformers in Transformers Hawkes Process (Zuo et al.,
2020) with Aarens. Following Bae et al. (2023), a mixture of log-normal distributions is used to
4
The code will be released alongside the camera-ready.
6
Reinforcement Learning (Score ↑)
HalfCheetah Ant
Dataset
Medium Med-Replay Med-Expert Medium Med-Replay Med-Expert
Transformer 41.88 ± 1.47 36.57 ± 1.40 75.98 ± 6.34 94.25 ± 8.62 89.39 ± 4.96 125.47 ± 10.99
Aaren 42.16 ± 1.89 37.91 ± 1.94 75.74 ± 15.13 93.29 ± 4.04 85.53 ± 6.57 119.72 ± 12.63
Hopper Walker
Dataset
Medium Med-Replay Med-Expert Medium Med-Replay Med-Expert
Transformer 80.18 ± 5.85 79.73 ± 7.64 98.82 ± 10.33 77.84 ± 3.81 72.36 ± 5.63 109.66 ± 0.45
Aaren 80.86 ± 4.77 77.87 ± 5.68 103.89 ± 11.89 74.44 ± 5.16 71.44 ± 6.55 110.51 ± 1.30
Table 1: Reinforcement Learning Results. Measurement of the D4RL score (higher is better) (Fu
et al., 2020). The bolded results indicate the best-performing method.
model the probability distribution of the next event time. For this setting, we consider 8 popular
benchmarking datasets for next event forecasting (Zhang et al., 2020; Zuo et al., 2020; Bae et al.,
2023): MIMIC, Wiki, Reddit, Mooc, StackOverflow, Sin, Uber, and Taxi. 7 out of the 8 are real-world
datasets whereas only Sin is a synthetic dataset. 3 out of the 8 datasets (Sin, Uber, and Taxi) do not
include marks/labels. We refer the reader to Appendix C.2 for details regarding individual datasets.
The results in Table 2 show that Aarens performed comparably with Transformers across all datasets.
Aaren’s ability to efficiently process new inputs is a particularly useful feature in event forecasting
settings where events arrive in an irregular stream.
Event Forecasting
Metric NLL (↓) RMSE (↓) Acc (↑)
Dataset MIMIC Wiki MIMIC Wiki MIMIC Wiki
Transformer 1.22 ± 0.08 9.66 ± 0.98 1.60 ± 0.28 0.28 ± 0.04 84.07 ± 1.46 23.60 ± 2.66
Aaren 1.21 ± 0.06 8.98 ± 1.03 1.56 ± 0.32 0.22 ± 0.05 84.53 ± 0.66 21.26 ± 1.29
Dataset Reddit Mooc Reddit Mooc Reddit Mooc
Transformer 0.40 ± 0.29 −0.22 ± 0.57 0.23 ± 0.03 0.20 ± 0.04 60.68 ± 1.62 37.79 ± 0.42
Aaren 0.31 ± 0.30 0.25 ± 1.61 0.30 ± 0.04 0.41 ± 0.28 62.34 ± 0.40 36.69 ± 1.48
Dataset StackOverflow Sin StackOverflow Sin StackOverflow
Transformer 2.92 ± 0.04 0.68 ± 0.05 1.44 ± 0.08 1.75 ± 0.09 46.44 ± 0.08
Aaren 2.91 ± 0.02 0.78 ± 0.13 1.27 ± 0.17 2.03 ± 0.25 46.34 ± 0.21
Dataset Uber Taxi Uber Taxi
Transformer 3.33 ± 0.14 2.01 ± 0.17 73.63 ± 5.73 10.34 ± 0.32
Aaren 3.48 ± 0.10 2.33 ± 0.12 54.61 ± 5.40 10.01 ± 0.52
Table 2: Event Forecasting Results. Sin, Uber, and Taxi datasets do not include marks/labels.
4.3 Time Series Forecasting
In these experiments, we compared Aarens with Transformers on time series forecasting (TSF). In
TSF, the model is given a series of observations of temporally continuous signals. The objective of
the model is to predict T future values of the series. TSF models are commonly used in a wide range
of domains, including those related to climate (e.g., weather), energy (e.g., supply and demand),
and economics (e.g., stock prices). To perform the comparison, we consider a causally masked
Transformer with input normalization following Liu et al. (2022). For this setting, we consider 8
real-world datasets used in prior works: Weather, Exchange, Traffic, ECL, ETTh1, ETTh2, ETTm1,
and ETTm2. For details regarding the individual datasets, we refer the reader to Appendix C.3.
Following Wu et al. (2023), the models are evaluated with T ∈ {96, 192, 336, 720} given an input
length of 96. Due to space limitations, Table 3 only includes results for T = 192. We refer the
reader to Table 5 in Appendix D for the full results. The results in Table 3 show that Aarens perform
comparably with Transformers across all datasets. However, unlike Transformers, Aarens efficiently
processes the time series data, making it more suitable for time series-related domains.
7
Time Series Forecasting
Metric MSE (↓) MAE (↓)
Dataset Weather Exchange Traffic Weather Exchange Traffic
Transformer 0.24 ± 0.01 0.24 ± 0.02 0.63 ± 0.01 0.28 ± 0.00 0.34 ± 0.01 0.34 ± 0.00
Aaren 0.25 ± 0.01 0.25 ± 0.03 0.64 ± 0.01 0.28 ± 0.00 0.33 ± 0.02 0.35 ± 0.00
Dataset ETTh1 ETTm1 ECL ETTh1 ETTm1 ECL
Transformer 0.64 ± 0.05 0.52 ± 0.05 0.39 ± 0.03 0.57 ± 0.02 0.47 ± 0.01 0.48 ± 0.02
Aaren 0.59 ± 0.03 0.51 ± 0.03 0.37 ± 0.02 0.55 ± 0.01 0.47 ± 0.01 0.45 ± 0.01
Dataset ETTh1 ETTm1 ETTh2 ETTm2
Transformer 0.50 ± 0.03 0.38 ± 0.02 0.46 ± 0.01 0.37 ± 0.01
Aaren 0.49 ± 0.03 0.34 ± 0.04 0.48 ± 0.02 0.39 ± 0.02
Table 3: Time Series Forecasting Results. Following Wu et al. (2023), the models are evaluated
with T ∈ {96, 192, 336, 720} given an input length of 96. Due to space limitations, this table only
includes results for T = 192. We refer the reader to Appendix D (Table 5) for the full results.
Time Series Classification (Acc ↑)

Dataset EthanolConc. FaceDetection Handwriting Heartbeat Jap. Vowels
Transformer 29.89 ± 1.63 69.23 ± 0.52 26.54 ± 2.25 74.05 ± 1.21 96.38 ± 0.91
Aaren 29.58 ± 2.30 69.06 ± 0.61 27.39 ± 1.46 74.15 ± 0.77 96.65 ± 0.75
Dataset PEMS-SF SelfReg. SCP1 SelfReg. SCP2 ArabicDigits UWaveGesture
Transformer 78.73 ± 2.06 88.81 ± 0.92 52.89 ± 2.47 98.89 ± 0.57 79.81 ± 1.51
Aaren 81.85 ± 2.60 89.42 ± 1.85 54.22 ± 1.50 98.68 ± 0.20 82.00 ± 1.93
Table 4: Time Series Classification Results.
4.4 Time Series Classification
In these experiments, we compared Aarens with Transformers on time series classification (TSC).
In TSC, the model’s objective is to predict the label of a time series. This setting is common
in many important applications such as pattern recognition (e.g., electrocardiograms), anomaly
detection (e.g., bank fraud), or failure prediction (e.g., power grid fluctuations) (Dinger et al.,
2022). For this setting, we consider 10 real-world popular datasets from the UEA time series
classification Archive (Bagnall et al., 2018): EthanolConcentration, FaceDetection, Handwriting,
Heartbeat, JapaneseVowels, PEMS-SF, SelfRegulationSCP1, SelfRegulationSCP2 ArabicDigits, and
UWaveGesture. For details regarding the individual datasets, we refer the reader to Appendix C.4. In
Table 4, we see that Aarens perform comparably with Transformers across all datasets.
4.5 Analyses
In these experiments, we compare Aarens with Transformers in terms of the resources required. To
do so, we use the code by Barhate (2022). For Transformers, we use KV-caching to improve their
efficiency.
Memory Complexity: In Figure 5 (left), we compare the memory usage of Aarens and Transformers
(using KV-caching) at inference time. We see that the memory usage of Transformers grow linearly
with the KV-caching technique. In contrast, Aarens only use constant memory regardless of the
number of tokens, making it significantly more efficient.
Time Complexity: In Figure 5 (right), we compare the cumulative time needed by Aarens and
Transformers (using KV-caching) for sequentially processing a sequence of tokens. In the case
of Transformers, the cumulative amount of computation is quadratic in the number of tokens, i.e.,
O(1 + 2 + . . . + N ) = O(N 2 ). In contrast, for Aaren, the cumulative amount of computation is
linear. In the figure, we see a similar result in the cumulative time needed by the models. Specifically,
the cumulative time needed by Transformers grows quadratically while that of Aaren grows linearly.
Number of Parameters: Due to learning the initial hidden state q, Aaren modules require slightly
more parameters than Transformer modules. However, the difference is marginal due to q being
8
Figure 5: Computational Resources Plots comparing Aarens and Transformers (using KV-caching)
when processing a sequence of tokens. (Left) Memory Usage Comparison. (Right) Cumulative Time
Comparison.
only a vector. Measuring this empirically in comparable models, we found that Transformers used
3, 152, 384 parameters. In contrast, the equivalent Aarens used 3, 152, 896 parameters, representing
only a marginal ∼ 0.016% parameter increase – a minor trade-off for the significant gains in memory
and time complexities.
5 Related Work
Closest to Aaren are approximations of attention such as those by RWKV (Peng et al., 2023),
RetNet (Sun et al., 2023), and Linear Transformer (Katharopoulos et al., 2020). These models
proposed linearizations of the standard softmax-based attention that allow them to be formulated as
an RNN. However, in doing so, these models also encode an exponential factor that biases tokens
based on their timestamp, limiting their potential applications. In contrast, Aaren leverages an exact
re-formulation of softmax attention as an RNN, allowing the model itself to compute the weight of
each token.
Feng et al. (2023) showed attention can be computed recurrently, using it to compress set-based inputs.
Rabe and Staats (2022) introduced a recurrent formulation of attention, showing that self-attention
can be computed efficiently. Katharopoulos et al. (2020) showed that Transformers with a causal
mask can be viewed as an RNN. In contrast, we (1) show a more general result whereas any attention
model can be viewed as an RNN. Furthermore, we (2) introduce Aaren, a new attention formulation
based on parallel prefix sums, that achieves competitive results with that of Transformers while being
more efficient.
The problem of computing prefix scans/sums has been well studied with various efficient parallelized
algorithms proposed for computing them. Since Aaren only requires the output of the prefix scan,
any efficient algorithm for computing it can be used. In this work, we outlined the method by
Hillis and Steele (1986). This method is time efficient for parallel computation, requiring log2 (N )
sequential steps and O(N log(N )) overall computation. In contrast, the method by Ladner and
Fischer (1980) use mores sequential steps (specifically, 2 log2 (N ) − 2) but only performs O(N )
overall computation. For a more in-depth introduction to parallel prefix sums algorithms, we refer the
reader to the following work by Blelloch (1990).
In this work, we applied Transformers to a subset of applications. For a broad overview of the
applications of Transformers, we refer the reader to the following survey by Islam et al. (2023). For
an overview of different transformer models applied to the specific settings considered in this paper,
we refer the reader to the following surveys (1) on transformers in reinforcement learning by Li et al.
(2023) and (2) on transformers in event forecasting, time series forecasting, time series classification,
and more by Wen et al. (2022).
6 Conclusion
In this work, we showed that attention can be formulated as an RNN whereas the conventional way of
computing attention is a parallelized method of computing its many-to-one RNN output. Building on
the RNN formulation, we showed that existing attention-based models can be formulated as RNNs.
9
However, unlike traditional RNNs such as LSTMs and GRUs, these methods cannot be updated
efficiently with new tokens. Addressing this, we introduced a new parallelized method of computing
attention’s many-to-many RNN output based on the parallel prefix scan algorithm. Building on the
new attention formulation, we introduced Aaren, a new module that can not only (i) be trained in
parallel (like Transformers) but also (ii) be efficiently updated at inference time, thereby requiring only
constant memory (like RNNs). Empirically, we showed that Aarens achieve performance competitive
with Transformers on 38 datasets spread across four sequential data settings: reinforcement learning,
event forecasting, time series classification, and time series forecasting. Finally, we empirically show
that Aarens are significantly more time and memory-efficient than Transformers.
10
References
Agarwal, P., Rahman, A. A., St-Charles, P.-L., Prince, S. J., and Kahou, S. E. (2023). Transformers
in reinforcement learning: A survey. arXiv preprint arXiv:2307.05979.
Bae, W., Ahmed, M. O., Tung, F., and Oliveira, G. L. (2023). Meta temporal point processes. In
International Conference on Learning Representations.
Bagnall, A., Dau, H. A., Lines, J., Flynn, M., Large, J., Bostrom, A., Southam, P., and Keogh,
E. (2018). The uea multivariate time series classification archive, 2018. arXiv preprint
arXiv:1811.00075.
Barhate, N. (2022). Minimal implementation of decision transformer. https://github.com/
nikhilbarhate99/min-decision-transformer.
Blelloch, G. E. (1990). Prefix sums and their applications. Technical Report CMU-CS-90-190, School
of Computer Science, Carnegie Mellon University.
Chen, L., Lu, K., Rajeswaran, A., Lee, K., Grover, A., Laskin, M., Abbeel, P., Srinivas, A., and
Mordatch, I. (2021). Decision transformer: Reinforcement learning via sequence modeling.
Advances in neural information processing systems, 34:15084–15097.
Cho, K., Van Merrienboer, B., Gulcehre, C., Bahdanau, D., Bougares, F., Schwenk, H., and Bengio,
Y. (2014). Learning phrase representations using rnn encoder–decoder for statistical machine
translation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language
Processing (EMNLP), pages 1724–1734.
Dinger, T., Chang, Y.-c., Pavuluri, R., and Subramanian, S. (2022).
What is time series classification? https://developer.ibm.com/
learningpaths/get-started-time-series-classification-api/
what-is-time-series-classification/.
Feng, L., Tung, F., Hajimirsadeghi, H., Bengio, Y., and Ahmed, M. O. (2023). Memory efficient
neural processes via constant memory attention block. arXiv preprint arXiv:2305.14567.
Fu, J., Kumar, A., Nachum, O., Tucker, G., and Levine, S. (2020). D4rl: Datasets for deep data-driven
reinforcement learning. arXiv preprint arXiv:2004.07219.
Goodfellow, I., Bengio, Y., Courville, A., and Bengio, Y. (2016). Deep learning, volume 1. MIT
Press.
Haarnoja, T., Zhou, A., Abbeel, P., and Levine, S. (2018). Soft actor-critic: Off-policy maximum
entropy deep reinforcement learning with a stochastic actor. In International conference on
machine learning, pages 1861–1870. PMLR.
Hillis, W. D. and Steele, G. L. (1986). Data parallel algorithms. Commun. ACM, 29:1170–1183.
Hochreiter, S. and Schmidhuber, J. (1997). Long short-term memory. Neural computation, 9:1735–80.
Islam, S., Elmekki, H., Elsebai, A., Bentahar, J., Drawel, N., Rjoub, G., and Pedrycz, W. (2023). A
comprehensive survey on applications of transformers for deep learning tasks. Expert Systems with
Applications, page 122666.
Jaegle, A., Gimeno, F., Brock, A., Vinyals, O., Zisserman, A., and Carreira, J. (2021). Perceiver:
General perception with iterative attention. In International conference on machine learning, pages
4651–4664. PMLR.
Jiang, Y., Pan, Z., Zhang, X., Garg, S., Schneider, A., Nevmyvaka, Y., and Song, D. (2024). Empow-
ering time series analysis with large language models: A survey. arXiv preprint arXiv:2402.03182.
Katharopoulos, A., Vyas, A., Pappas, N., and Fleuret, F. (2020). Transformers are rnns: Fast
autoregressive transformers with linear attention. In International conference on machine learning,
pages 5156–5165. PMLR.
Ladner, R. E. and Fischer, M. J. (1980). Parallel prefix computation. J. ACM, 27(4):831–838.
11
Latif, S., Zaidi, A., Cuayahuitl, H., Shamshad, F., Shoukat, M., and Qadir, J. (2023). Transformers in
speech processing: A survey. arXiv preprint arXiv:2303.11607.
Li, W., Luo, H., Lin, Z., Zhang, C., Lu, Z., and Ye, D. (2023). A survey on transformers in
reinforcement learning. arXiv preprint arXiv:2301.03044.
Lin, T., Wang, Y., Liu, X., and Qiu, X. (2022). A survey of transformers. AI Open.
Liu, Y., Wu, H., Wang, J., and Long, M. (2022). Non-stationary transformers: Exploring the
stationarity in time series forecasting. Advances in Neural Information Processing Systems,
35:9881–9893.
Peng, B., Alcaide, E., Anthony, Q., Albalak, A., Arcadinho, S., Biderman, S., Cao, H., Cheng, X.,
Chung, M., Derczynski, L., et al. (2023). Rwkv: Reinventing rnns for the transformer era. In
Findings of the Association for Computational Linguistics: EMNLP 2023, pages 14048–14077.
Pope, R., Douglas, S., Chowdhery, A., Devlin, J., Bradbury, J., Heek, J., Xiao, K., Agrawal, S., and
Dean, J. (2023). Efficiently scaling transformer inference. Proceedings of Machine Learning and
Systems, 5.
Rabe, M. N. and Staats, C. (2022). Self-attention does not need o(n2 ) memory. arXiv preprint
arXiv:2112.05682.
Sun, Y., Dong, L., Huang, S., Ma, S., Xia, Y., Xue, J., Wang, J., and Wei, F. (2023). Retentive
network: A successor to transformer for large language models. arXiv preprint arXiv:2307.08621.
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł., and
Polosukhin, I. (2017). Attention is all you need. Advances in neural information processing
systems, 30.
Wen, Q., Zhou, T., Zhang, C., Chen, W., Ma, Z., Yan, J., and Sun, L. (2022). Transformers in time
series: A survey. arXiv preprint arXiv:2202.07125.
Wu, H., Hu, T., Liu, Y., Zhou, H., Wang, J., and Long, M. (2023). Timesnet: Temporal 2d-
variation modeling for general time series analysis. In International Conference on Learning
Representations.
Zhang, Q., Lipani, A., Kirnap, O., and Yilmaz, E. (2020). Self-attentive hawkes process. In
International conference on machine learning, pages 11183–11193. PMLR.
Zheng, Q., Zhang, A., and Grover, A. (2022). Online decision transformer. In international conference
on machine learning, pages 27042–27059. PMLR.
Zuo, S., Jiang, H., Li, Z., Zhao, T., and Zha, H. (2020). Transformer hawkes process. In International
conference on machine learning, pages 11692–11702. PMLR.
12
Appendix
A Block-by-block method of computing attention as a (many-to-one) RNN.

In the main paper, we mentioned there are multiple ways to compute attention: (1) the conventional
manner (i.e., in parallel) requiring O(N ) linear memory and (2) recurrently token-by-token as an
RNN, requiring only O(1) constant memory. In between these two methods, there is a third method
which computes attention (3) in a block-by-block manner as follows:
i+b
X
ai+b = ai exp(mi − mi+b ) + vj exp(cj − mi+b )
j=i
i+b
X
ci+b = ci exp(mi − mi+b ) + exp(sj − mi+b )
j=i
mi+b = max(mi , si+1 , . . . , si+b )

where b is a block size that controls the rate at which data is processed. Computing attention in this
manner requires O(b) memory where b is a hyperparameter. By setting b = 1, we retrieve the RNN
formulation that processes the tokens one-by-one, i.e., computing ai , ci , mi using the previous time
step ai−1 , ci−1 , mi−1 .
B Parallel Prefix Scan’s Operator ⊕

In this work, we proposed a formulation of Attention as a many-to-many RNN based on the parallel
prefix scan algorithm. To do so, we defined an associative operator ⊕ that is recursively applied
during the algorithm. In this section, we show the correctness of the operator and prove that it satisfies
the associative property.
The operator
P ⊕ acts on 3-tuples of the form
P (mA , uA , wA ) where A is a set of indices, mA = maxi∈A si ,
uA = i∈A exp(si − mA ), and wA = i∈A exp(si − mA )vi . The operator ⊕ functions as follows:
where:
mA∪B = max(mA , mB )
uA∪B = uA exp(mA − mA∪B ) + uB exp(mB − mA∪B )
wA∪B = wA exp(mA − mA∪B ) + wB exp(mB − mA∪B )
B.1 Correctness Proof
To P we need to show: mA∪B = maxi∈A∪B si , uA∪B =

P confirm the correctness of the operator,
i∈A∪B exp(si − mA∪B ), and wA∪B = i∈A∪B exp(si − mA∪B )vi .
Expanding the computation of mA∪B :
mA∪B = max(mA , mB )
= max(maxi∈A si , maxi∈B si )
= maxi∈A∪B si
Expanding the computation of uA∪B :
uA∪B = uA exp(mA − mA∪B ) + uB exp(mB − mA∪B )
" # " #
X X
= exp(si − mA ) exp(mA − mA∪B ) + exp(si − mB ) exp(mB − mA∪B )
i∈A i∈B
X X
= exp(si − mA∪B ) + exp(si − mA∪B )
i∈A i∈B
X
= exp(si − mA∪B )
i∈A∪B
13
Expanding the computation of wA∪B :
wA∪B = wA exp(mA − mA∪B ) + wB exp(mB − mA∪B )
" # " #
X X
= exp(si − mA )vi exp(mA − mA∪B ) + exp(si − mB )vi exp(mB − mA∪B )vi
i∈A i∈B
X X
= exp(si − mA∪B )vi + exp(si − mA∪B )vi
i∈A i∈B
X
= exp(si − mA∪B )vi
i∈A∪B
P
resulting
P in mA∪B = maxi∈A∪B si , uA∪B = i∈A∪B exp(si − mA∪B ), and wA∪B =
i∈A∪B exp(si − mA∪B )vi as needed.
B.2 Associative Proof
For the operator ⊕ to be valid for the parallel scan algorithm, the operator ⊕ must satisfy the
associative property, i.e., (a ⊕ b) ⊕ c = a ⊕ (b ⊕ c).
Since the detailed operator is applied to 3-tuples of the form (mA , uA , wA ) as follows:
therefore showing that ⊕ is associative means that:
[(mA , uA , wA ) ⊕ (mB , uB , wB )] ⊕ (mC , uC , wC ) = (mA , uA , wA ) ⊕ [(mB , uB , wB ) ⊕ (mC , uC , wC )]
(m(A∪B)∪C , u(A∪B)∪C , w(A∪B)∪C ) = (mA∪(B∪C) , uA∪(B∪C) , wA∪(B∪C) )
Since ∪ is an associative operator that acts on sets, i.e., (A ∪ B) ∪ C = A ∪ (B ∪ C), therefore

m(A∪B)∪C = mA∪(B∪C) , u(A∪B)∪C = uA∪(B∪C) , and w(A∪B)∪C = wA∪(B∪C) , meaning that ⊕ is
associative as required.
C Breakdown of Individual Datasets

C.1 Reinforcement Learning
Our code for the reinforcement learning experiments is based on that of Barhate (2022) (MIT License).
For these experiments, we consider locomotion MuJoCo robotics environments (Apache 2.0 License)
popular in deep RL:
• HalfCheetah is a two-legged 2-d robot comprising of a head, horizontal torso, two thighs,
two shins, and two feet.
• Ant is a 3-d four-legged robot comprising of a torso and four legs (2 parts per leg).
• Hopper is a single-legged 2-d robot comprising a torso, thigh, shin, and foot.
• Walker(2D) is a two-legged 2-d robot comprising of a vertical torso, two thighs, two shins,
and two feet.
The datasets used in our experiments are from D4RL (Fu et al., 2020) (Apache 2.0 License):
• Medium is a dataset generated by a pre-trained Soft Actor-Critic policy (Haarnoja et al.,

2018) that was early stopped during training.
• Medium-Expert is a dataset comprising of equal parts: expert and suboptimal demonstra-
tions.
• Medium-Replay is a dataset consisting of samples from the replay buffer during the training
of the medium policy.
14
C.2 Event Forecasting
Our code for the event forecasting experiments and datasets is based on that of Bae et al. (2023) (CC
BY-NC-SA 4.0 License). For these experiments, we consider 8 datasets used in prior works (Bae
et al., 2023; Zhang et al., 2020; Zuo et al., 2020).
5 out of the datasets include marks/labels:
• MIMIC (MIMIC-II) is a dataset based on an electric medical record system, consisting of

anonymous patients’ clinical visits.
• Wiki is a dataset based on Wikipedia, consisting of actions performed on the most edited
pages.
• Reddit is a dataset based on the social media platform, consisting of the actions of the most
active users.
• Mooc is a dataset based on an online course, consisting of the sequence of actions performed
by various users.
• StackOverflow is a dataset based on the question-answering website, consisting of user
awards.
3 out of the 8 datasets do not include marks/labels:
• Sin is a synthetic dataset generated from a sine function with a periodicity of 4π and a
domain of [0, 32π].
• Uber is a dataset based on Uber NYC pickup data from 2015.
• Taxi is a dataset based on NYC Taxi pickup data from 2013.
C.3 Time Series Forecasting
Our code for the time series forecasting experiments is based on the Time series Library repository
(MIT License) by Wu et al. (2023). For these experiments, we consider 8 popular datasets available
in the Time Series Library repository:
• Weather is a dataset containing meteorological indicators recorded every 10 minutes for

2020.
• Exchange is a dataset consisting of the daily exchange rates of 8 countries (Australia, British,
Canada, Switzerland, China, Japan, New Zealand and Singapore) from 1990 to 2016.
• Traffic is a dataset consisting of the hourly occupancy rate of 963 car lanes of San Francisco
bay area freeways.
• ECL is a dataset consisting of electricity consumption of 321 clients.
• ETT (h1, h2, m1, and m2) are datasets consisting of data from two stations’ electricity
transformers whereas "h" denotes the hourly measurement and "m" denotes the minute-by-
minute measurement.
C.4 Time Series Classification
Our code for the time series classification experiments is based on the Time series Library repository
(MIT License) by Wu et al. (2023). For these experiments, we consider 10 UEA datasets (Bagnall
et al., 2018) available in the Time Series Library repository:
• EthanolConcentration is a dataset of raw spectra recordings of whisky bottles and their

alcohol concentration.
• FaceDetection is a dataset of MEG (Magnetoencephalography) recordings and their class
labels (Face or Scramble).
• Handwriting is a dataset of measurements taken from a smart watch while writing.
• Heartbeat is a dataset of heart sound recordings derived from the 2016 PhysioNet/Computing
in Cardiology Challenge Challenge.
15
• JapaneseVowels is a dataset of recordings of Japanese-male vowel pronunciations.
• PEMS-SF is a dataset describing the occupancy rate of car lanes in San Francisco Bay
Area’s freeways available on California’s Department of Transportation PEMS (Performance
Measurement System) website.
• SelfRegulationSCP1 is a dataset of cortical potential recordings from a healthy subject.
• SelfRegulationSCP2 is a dataset of cortical potential recordings from an artificially respirated
ALS (Amyotrophic lateral sclerosis) patient.
• SpokenArabicDigits is a dataset of recordings of ten spoken Arabic digits from 44 male and
44 female Arabic native speakers.
• UWaveGestureLibrary is a dataset of accelerometer measurements of eight gestures.
D Additional Experiments
D.1 Full Time Series Forecasting Results
In time series forecasting, the objective of the model is to predict T future values of the series.
Following the standard of previous works (Wu et al., 2023), we evaluated on t ∈ {96, 192, 336, 720}.
However, due to space limitations, only T = 192 was included in the main paper. Here, in Table
5, we include the full results. We see that for all values of T , Aarens performed comparably with
Transformers across all datasets. However, unlike Transformers, Aarens can efficiently process new
inputs, making it more advantageous in sequential settings such as this time series-related setting.
E Hyperparameters + Implementation Details

As Transformers and Aarens share the same interface and are both attention-based methods, they
share the same set of hyperparameters. For fairness, the same hyperparameters are used for both
Transformers and Aarens. The specific hyperparameters for their respective settings are as follows:
Reinforcement Learning. For these experiments, we found that hyperparameters used by Zheng
et al. (2022) performed better than the default Decision Transformers hyperparameters. As such,
Zheng et al. (2022)’s hyperparameters were used for these experiments, i.e., an embedding dimension
of 512, 4 attention heads, and 4 Transformers/Aarens blocks.
Event Forecasting. For these experiments, the default hyperparameters (except for the learning rate)
in Bae et al. (2023)’s repository were used. We found that the default learning rate value of 0.0001
caused early convergences to poor local optima. As such, the learning rate was set to 0.0005 instead.
Time Series Forecasting. For these experiments, the default Transformers hyperparameters in the
Time Series Library repository (Wu et al., 2023) were used.
Time Series Classification. For these experiments, the default Transformers hyperparameters in the
Time Series Library repository (Wu et al., 2023) were used.
F Compute
Our experiments were run using a mix of Nvidia GTX 1080 Ti (12 GB) and Nvidia Tesla P100 (16
GB) GPUs. The analyses were performed on Nvidia GTX 1080 Ti (12 GB).
The reinforcement learning experiments required approximately the same amount of time: ∼ 2 − 4
hours each.
The event forecasting experiments varied in time depending on the dataset:
• MIMIC took ∼ 0.5 hours

• Wiki took ∼ 0.75 hours
• Reddit took ∼ 3.5 hours
• Mooc took ∼ 8 hours
16
Time Series Forecasting
Metric MSE (↓) MAE (↓)
Models Aaren Transformer Aaren Transformer
96 0.53 ± 0.04 0.54 ± 0.01 0.52 ± 0.02 0.50 ± 0.01
192 0.59 ± 0.03 0.64 ± 0.05 0.55 ± 0.01 0.57 ± 0.02
ETTh1
336 0.65 ± 0.03 0.65 ± 0.02 0.55 ± 0.01 0.55 ± 0.01
720 0.67 ± 0.05 0.70 ± 0.05 0.62 ± 0.02 0.58 ± 0.02
96 0.38 ± 0.02 0.41 ± 0.04 0.44 ± 0.02 0.40 ± 0.02
192 0.49 ± 0.03 0.50 ± 0.03 0.48 ± 0.02 0.46 ± 0.01
ETTh2
336 0.57 ± 0.05 0.59 ± 0.03 0.47 ± 0.02 0.50 ± 0.01
720 0.55 ± 0.03 0.60 ± 0.03 0.52 ± 0.01 0.52 ± 0.01
96 0.48 ± 0.02 0.44 ± 0.01 0.44 ± 0.01 0.41 ± 0.01
192 0.51 ± 0.03 0.52 ± 0.05 0.47 ± 0.01 0.47 ± 0.01
ETTm1
336 0.54 ± 0.02 0.57 ± 0.03 0.49 ± 0.01 0.51 ± 0.01
720 0.60 ± 0.03 0.66 ± 0.06 0.52 ± 0.01 0.56 ± 0.02
96 0.24 ± 0.03 0.25 ± 0.01 0.30 ± 0.02 0.30 ± 0.01
192 0.34 ± 0.04 0.38 ± 0.02 0.39 ± 0.02 0.37 ± 0.01
ETTm2
336 0.41 ± 0.03 0.49 ± 0.05 0.42 ± 0.01 0.43 ± 0.02
720 0.51 ± 0.03 0.56 ± 0.02 0.49 ± 0.02 0.47 ± 0.01
96 0.18 ± 0.00 0.18 ± 0.00 0.23 ± 0.00 0.23 ± 0.00
192 0.25 ± 0.01 0.24 ± 0.01 0.28 ± 0.00 0.28 ± 0.00
Weather
336 0.31 ± 0.00 0.31 ± 0.02 0.32 ± 0.00 0.34 ± 0.01
720 0.40 ± 0.00 0.38 ± 0.02 0.39 ± 0.00 0.39 ± 0.01
96 0.14 ± 0.01 0.14 ± 0.01 0.27 ± 0.01 0.25 ± 0.01
192 0.25 ± 0.03 0.24 ± 0.02 0.33 ± 0.02 0.34 ± 0.01
Exchange
336 0.42 ± 0.04 0.41 ± 0.02 0.44 ± 0.02 0.45 ± 0.01
720 1.20 ± 0.07 1.44 ± 0.19 0.79 ± 0.02 0.81 ± 0.04
96 0.63 ± 0.01 0.61 ± 0.01 0.35 ± 0.00 0.34 ± 0.00
192 0.64 ± 0.01 0.63 ± 0.01 0.35 ± 0.00 0.34 ± 0.00
Traffic
336 0.65 ± 0.01 0.64 ± 0.00 0.35 ± 0.00 0.34 ± 0.00
720 0.68 ± 0.01 0.67 ± 0.00 0.36 ± 0.01 0.36 ± 0.00
96 0.36 ± 0.02 0.35 ± 0.02 0.46 ± 0.01 0.43 ± 0.01
192 0.37 ± 0.02 0.39 ± 0.03 0.45 ± 0.01 0.48 ± 0.02
ECL
336 0.47 ± 0.05 0.48 ± 0.06 0.52 ± 0.03 0.55 ± 0.03
720 0.57 ± 0.05 0.62 ± 0.06 0.56 ± 0.02 0.55 ± 0.03
Table 5: Full Time Series Forecasting Results.
17
• StackOverflow took ∼ 3.5 hours
• Sin took ∼ 1.5 hours
• Uber took ∼ 3 hours
• Taxi took ∼ 1.5 hours
The time series forecasting experiments were run as a single script for T ∈ {96, 192, 336, 720}. The
experiments varied in time depending on the dataset:
• Weather took ∼ 6 hours

• Exchange took ∼ 0.5 hours
• Traffic took ∼ 1 hours
• ECL took ∼ 4 hours
• ETTh1 took ∼ 0.75 hours
• ETTm1 took ∼ 11 hours
• ETTh2 took ∼ 0.75 hours
• ETTm2 took ∼ 11 hours
The time series classification experiments were run as a single script. Running the experiments on all
datasets took in total ∼ 1 hour.
G Limitations and Broader Impact

Limitations. Unlike Transformers whose attention queries are input-dependent (i.e., xi ), Aarens’
attention queries are input-independent (i.e., q is a learned constant). In our experiments (reinforce-
ment learning, event forecasting, time series forecasting, and time series classification), we found
that Aarens performed comparably to Transformers regardless. However, this difference could be
a limitation in settings that require large highly expressive sequence models, e.g., large language
models. This, however, is outside the scope of our work and our available resources.
Broader Impact. In this work, we propose Aaren, a module that achieves comparable performance
to Transformers while being more time and memory-efficient. Since Aarens can update themselves
efficiently (as an RNN) unlike Transformers, therefore Aarens are more suitable for sequence
modelling. Since (1) there exists a wide range of sequence modelling problem settings and (2) the
number of low-resource domains (e.g., battery powered or IoT devices) is rapidly increasing, therefore
Aaren has a broad potential impact. The general applicability of Aaren means that its societal impact
is dependent on the downstream application.
18

Attention As An RNN: Preprint. Under Review

Uploaded by

Copyright:

Available Formats

Attention As An RNN: Preprint. Under Review

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Attention As An RNN: Preprint. Under Review

Uploaded by

Copyright:

Available Formats

Attention as an RNN

Leo Feng Frederick Tung

Hossein Hajimirsadeghi Mohamed Osama Ahmed

Yoshua Bengio Greg Mori

The advent of Transformers marked a significant breakthrough in sequence mod-

Preprint. Under review.

(a) Conventional Attention (b) Transformer’s Self-Attention (c) Perceiver’s Cross-Attention

3.1 Attention as a (many-to-one) RNN

Attention for a query vector q can be viewed as a func-

ak = ak−1 exp(mk−1 − mk ) + vk exp(ck − mk )

3.2 Attention as a (many-to-many) RNN

Addressing these limitations, we propose to develop an attention-based model capable of leveraging

Leveraging the parallelized many-to-many formulation of attention, we propose Aaren ([A]ttention

4.1 Reinforcement Learning

4.2 Event Forecasting

4.3 Time Series Forecasting

Time Series Classification (Acc ↑)

4.4 Time Series Classification

A Block-by-block method of computing attention as a (many-to-one) RNN.

mi+b = max(mi , si+1 , . . . , si+b )

B Parallel Prefix Scan’s Operator ⊕

B.1 Correctness Proof

To P we need to show: mA∪B = maxi∈A∪B si , uA∪B =

B.2 Associative Proof

Since ∪ is an associative operator that acts on sets, i.e., (A ∪ B) ∪ C = A ∪ (B ∪ C), therefore

C Breakdown of Individual Datasets

• Medium is a dataset generated by a pre-trained Soft Actor-Critic policy (Haarnoja et al.,

• MIMIC (MIMIC-II) is a dataset based on an electric medical record system, consisting of

3 out of the 8 datasets do not include marks/labels:

C.3 Time Series Forecasting

• Weather is a dataset containing meteorological indicators recorded every 10 minutes for

C.4 Time Series Classification

• EthanolConcentration is a dataset of raw spectra recordings of whisky bottles and their

E Hyperparameters + Implementation Details

• MIMIC took ∼ 0.5 hours

• Weather took ∼ 6 hours

G Limitations and Broader Impact

You might also like