Attention As An RNN: Preprint. Under Review
Attention As An RNN: Preprint. Under Review
Attention As An RNN: Preprint. Under Review
Abstract
1 Introduction
Advancements in sequence modelling are highly impactful due to the wide range of applications,
including reinforcement learning (e.g., robotics and autonomous driving), time series classification
(e.g., financial fraud detection and medical diagnoses), and time series forecasting (e.g., weather
and energy consumption predictions). Over the past several years, an extensively explored topic
in sequence modelling is that of Transformer-based models (Vaswani et al., 2017). This is due to
Transformers’ strong performance and ability to leverage GPU parallelism. As a result, numerous
Transformer-specific survey papers have been written for various sequential settings such as rein-
forcement learning (Agarwal et al., 2023; Li et al., 2023), time series (Lin et al., 2022; Jiang et al.,
2024), and speech processing (Latif et al., 2023).
2 Background
2.1 Recurrent Neural Networks
Recurrent Neural Networks (RNNs) are specialized models for sequence modelling. In brief, RNNs
process sequential data by iteratively computing hidden states as follows:
ht = fθ (ht−1 , xt )
where t denotes the step index, x represents a token, h represents the hidden state, and fθ is a
neural network parameterized by θ. The value of the initial hidden state h0 is typically learned
via backpropagation. Popular RNNs such as LSTMs (Hochreiter and Schmidhuber, 1997) and
GRUs (Cho et al., 2014) fix the size of the hidden states ht to a constant irrespective of the step
index. As such, these models are efficient at test time, requiring only constant memory and time
per token and can be easily updated with new tokens, an important property in sequence modelling.
However, due to being iterative by design, these popular RNNs also suffer scalability issues due
to their lack of parallelizability. As such, RNNs were replaced by a parallelizable attention-based
module, Transformers (Vaswani et al., 2017), as the standard for many sequence modelling settings.
2.2 Attention
Attention retrieves information from a set of XC context tokens for a given set of query tokens XQ
as follows:
Attention(Q, K, V ) = softmax(QK T )V
where Q = XQ Wq is the query matrix, K = XC Wk is the key matrix, and V = XC Wv is the value
matrix. Wq , Wk , Wv ∈ Rd×d are weight matrices (learned parameters). softmax(QK T ) computes
weights of the context tokens for a weighted average. Notably, unlike RNNs, attention is not designed
to be iterative; instead, it is designed to easily leverage GPU parallelism. Transformers (Vaswani
et al., 2017) use self-attention1 , a special case of attention, where the query tokens are the same
as the context tokens. However, self-attention requires quadratic computation with respect to the
1
For sequential data, a causal mask is typically used to prevent attending to future timesteps.
2
number of tokens and is not efficiently updateable with new tokens. As a result, Transformers are
computationally expensive, limiting their applications in low-resource domains.
3 Methodology
Addressing this, we propose an efficient attention-based module capable of leveraging GPU paral-
lelism while being efficiently updateable. We begin by first showing in Section 3.1 that attention can
be viewed as an RNN with the special ability to compute its many-to-one RNN (Figure 1a) output
efficiently. Leveraging the RNN formulation of attention, we further show that popular attention-
based models such as Transformers (Figure 1b) and Perceivers (Figure 1c) can be viewed as RNNs.
However, unlike traditional RNNs, these models are unable to efficiently update themselves with new
tokens, limiting their potential in sequential problem settings where data arrives as a stream. Tackling
this, we introduce in Section 3.2 an efficient method for computing attention as a many-to-many RNN
based on the parallel prefix scan algorithm. Building on this, we introduce in Section 3.3 Aaren
([A]ttention [a]s a [re]current neural [n]etwork), a computationally efficient module that can not only
(i) be trained in parallel (like Transformers) but also (ii) be efficiently updated with new tokens at
inference time, requiring only constant memory for inferences (like traditional RNNs).
3
Pk Pk
ak = i=1 exp(si − mk )vi and ck = i=1 exp(si − mk ). Notably, the final result is the same
âN
oN = ĉN = acNN . ak , ck , and mk are thus computed recurrently as follows:
By encapsulating the recurrent computation of ak , ck , and mk from ak−1 , ck−1 , and mk−1 , we
introduce an RNN cell that iteratively computes the output of attention (see Figure 2). Attention’s
RNN cell takes as input (ak−1 , ck−1 , mk−1 , q) and computes (ak , ck , mk , q). Note that the query
vector q is carried over in the RNN cell. The initial hidden state of attention’s RNN is (a0 , c0 , m0 , q) =
(0, 0, 0, q).
Methods for computing attention. By viewing attention as an RNN, we can see that there are
different ways to compute attention: (1) recurrently token-by-token (i.e., sequentially) in O(1)
memory or (2) in the conventional manner (i.e., in parallel) requiring linear O(N ) memory. Since
attention can be viewed as an RNN, the conventional method of computing attention can also be
viewed as an efficient method of computing attention’s many-to-one RNN output, i.e., the output of
an RNN that takes as input multiple context tokens but only outputs a single token at the end of the
RNN (see Figure 1a). Lastly, instead of fully sequential or fully in parallel, we can also compute
attention as (3) an RNN that processes the tokens block-by-block requiring O(b) memory where
b is the size of the block. This method, however, is outside the scope of this work. As such, the
description of the block-by-block RNN is included in Appendix A.
Viewing existing attention-based models as RNNs. By viewing attention as an RNN, existing
attention-based models can also be viewed as variations of RNNs. For example, Transformers’
self-attentions are RNNs (Figure 1b) with the context tokens as their initial hidden states. Perceiver’s
cross-attentions are RNNs (Figure 1c) with context-dependent latents as their initial hidden states. By
leveraging the RNN formulation of their attention mechanisms, these existing models can compute
their output memory efficiently.
Challenges of viewing attention as an RNN for exist-
ing models. However, when viewing existing attention-
based models such as Transformers as RNNs, the mod-
Algorithm 1 Parallel Prefix Scan (Hillis
els lack important properties common in traditional
and Steele (1986)’s variation)
RNNs such as LSTMs and GRUs. Notably, LSTMs and
GRUs are capable of efficiently updating themselves Require: Associative Operator ⊕ and
with new tokens in only O(1) constant memory and {xi }N
i=1 Lk
computation, an important feature for sequence mod- Ensure: {zk = i=1 xi }N k=1
elling where data is received in a stream. In contrast, z←x
the RNN view of Transformer (see Figure 1b) would for i ← 1, . . . , ⌊log(N )⌋ do
handle new tokens by adding a new RNN with the new for j ← 0, . . . , N − 1 do in parallel
token as its initial state. The new RNN processes all if j < 2i then
preceding tokens, requiring O(N ) linear computation zj′ ← zj
in the number of tokens. In Perceiver, the latents (Li in else
Figure 1c) are input-dependent due to their architecture, zj′ ← zj ⊕ zj−2i
meaning that their values change when receiving a new end if
token. Since the initial hidden states (i.e., latents) of end for
their RNNs change, Perceiver would thus require re- z ← z′
computing their RNNs from scratch, requiring O(N L) end for
linear computation in the number of tokens (N ) and the
number of latents (L).
4
from N sequential data points via an associative operator ⊕. The algorithm efficiently computes
Lk
{ i=1 xi }N N
k=1 from {xk }k=1 .
Recall that Attention(q, x1:k ) = ok = ackk where
Pk Pk
ak = i=1 exp(si − mk )vi , ck = i=1 exp(si −
mk ), and mk = maxi∈{1,...,k} si To compute
{Attention(q, x1:k )}N k=1 efficiently, we can compute
{ak }N
k=1 , {c }N
k k=1 , and {mk }Nk=1 via the parallel scan
algorithm and afterwards combine ak and ck to com-
pute Attention(q, x1:k ).
To do so, we propose the following associative oper- Figure 3: Attention as a many-to-many
ator ⊕ that acts on 3-tuples3 of the form (mA , uA , wA ) RNN
where
P A is a set of indices, mA = max P i∈A si , uA =
i∈A exp(s i − mA ), and w A = i∈A exp(si −
mA )vi . The parallel scan algorithm takes as input
{(m{i} , u{i} , w{i} )}N N
i=1 = {(si , 1, vi )}i=1 . The algo-
rithm recursively applies the operator ⊕ which works
as follows:
(mA , uA , wA ) ⊕ (mB , uB , wB ) = (mA∪B , uA∪B , wA∪B )
where mA∪B = max(mA , mB ), uA∪B = uA exp(mA −
mA∪B ) + uB exp(mB − mA∪B ) and wA∪B =
wA exp(mA −mA∪B )+wB exp(mB −mA∪B ). Upon com-
pletion of applying the operator recursively, the algo-
rithm outputs {(m{1,...,k} , u{1,...,k} , w{1,...,k} )}N k=1 =
Pk Pk
{(mk , i=1 exp(si − mk ), i=1 exp(si −
mk )vi )}Nk=1 . Also known as {(m k , ck , ak )}N
k=1 .
Combining the last two values of the output tuples, we
retrieve Attention(q, x1:k ) = ok = ackk , resulting in an
efficient parallelized method for computing attention Figure 4: Stacking Aarens for sequence
as a many-to-many RNN (Figure 3). modelling
3.3 Aaren:
Attention as a Recurrent Neural Network
Unlike Transformers where the query is one of the input tokens to attention, Aaren’s query token q is
learned during training via backpropagation. In Figure 4, we include an example of a stacked Aaren
model with input context tokens x1:3 and outputs y1:3 . Notably, since Aaren leverages the RNN
formulation of attention, the stacking of Aarens is also the stacking of RNNs. Therefore, Aarens are
also able to perform updates with new tokens efficiently, i.e., the iterative computation of yk only
requires constant computation as it relies solely on hk−1 and xk . Unlike Transformer-based models
which (1) require linear memory (when using KV-caching) and (2) require storing all previous
tokens, including those in intermediate Transformer layers, Aaren-based models (1) require only
constant memory and (2) does not require storing all previous tokens, making Aarens significantly
more computationally efficient than Transformers.
3
See Appendix B for proof of the correctness and associative nature of ⊕.
5
4 Experiments
Our objective in the experiments is to compare Aarens with Transformers in terms of (1) performance
and (2) resources required (time and memory). To perform a comprehensive comparison, we evaluate
across four problem settings: reinforcement learning, event forecasting, time series forecasting, and
time series classification.
Datasets. In total, we evaluate Aarens and Transformers on 38 datasets, the majority of which are
real-world datasets. The datasets are split between problem settings as follows: 12 reinforcement
learning datasets, 8 event forecasting datasets, 8 time series forecasting datasets, and 10 time series
classification datasets. For each dataset, the models are evaluated with 5 seeds. Due to space
limitations, we refer the reader to Appendix C descriptions of individual datasets.
Models. To compare Aarens directly with Transformers, we replace the Transformers with Aarens in
domain-specialized Transformer models. For reinforcement learning, we performed the comparison
on Decision Transformer (Chen et al., 2021). For event forecasting, we performed the comparison
Transformer Hawkes Process (Zuo et al., 2020; Bae et al., 2023). For time series forecasting, we
performed the comparison on a Transformer with input normalization following Liu et al. (2022); For
time series classification, we performed the comparison on a vanilla causal transformer following the
library by Wu et al. (2023).
Implementation Details. The experiments are run using the datasets and transformer implementa-
tions of popular repositories. More specifically, the reinforcement learning experiments with Decision
Transformer were run on the code by Barhate (2022). The time series forecasting and time series
classification experiments were run on the Time Series Library repository by Wu et al. (2023). The
event forecasting experiments were run on the code by Bae et al. (2023). As Transformers and
Aarens share the same interface and are both attention-based methods, they share the same set of
hyperparameters. For fairness, the same hyperparameters are used for both Transformers and Aarens.
Due to space limitations, specific hyperparameter details are included in Appendix E4 .
In these experiments, we compare Aarens with Transformers on reinforcement learning (RL). In RL,
the model’s objective during training is to learn a policy that learns from feedback/rewards obtained
through trial and error while interacting in an environment. As such, RL is popular in interactive
settings such as robotics, recommendation engines, and traffic control. For these experiments, we
consider Decision Transformers (Chen et al., 2021), a popular method for training a policy in an
offline manner on datasets of environment interactions. At test time, Decision Transformers are
evaluated in an online manner.
We evaluate on popular locomotion robotics environments from the D4RL benchmark (Fu et al., 2020):
HalfCheetah, Ant, Hopper, and Walker. For each of the locomotion environments, we compare the
models trained on three different kinds of datasets: Medium, Medium-Replay, and Medium-Expert.
Each of these datasets consists of 1 million timesteps generated by a policy. In total, we compare
model performances across 4 × 3 = 12 RL datasets. We refer the reader to Appendix C.1 for more
details regarding the individual tasks. The results in Table 1 show that Aarens achieve competitive
performance with Transformers across all twelve datasets and four environments. However, unlike
Transformers, Aarens are able to efficiently process new environmental interactions in constant
computation due to also being an RNN, making it more suitable for reinforcement learning.
In these experiments, we compare Aarens with Transformers on event forecasting (EF). In EF, the
model is given a sequence of irregularly spaced discrete events in time and models the probability
distribution of the next event time and its mark (i.e., event label/class). EF is popular in many
real-world settings such as finance (e.g., transactions), healthcare (e.g., patient observations), and
e-commerce (e.g., purchases) where user or system actions occur at irregularly spaced times. To
perform the comparison, we replace the Transformers in Transformers Hawkes Process (Zuo et al.,
2020) with Aarens. Following Bae et al. (2023), a mixture of log-normal distributions is used to
4
The code will be released alongside the camera-ready.
6
Reinforcement Learning (Score ↑)
HalfCheetah Ant
Dataset
Medium Med-Replay Med-Expert Medium Med-Replay Med-Expert
Transformer 41.88 ± 1.47 36.57 ± 1.40 75.98 ± 6.34 94.25 ± 8.62 89.39 ± 4.96 125.47 ± 10.99
Aaren 42.16 ± 1.89 37.91 ± 1.94 75.74 ± 15.13 93.29 ± 4.04 85.53 ± 6.57 119.72 ± 12.63
Hopper Walker
Dataset
Medium Med-Replay Med-Expert Medium Med-Replay Med-Expert
Transformer 80.18 ± 5.85 79.73 ± 7.64 98.82 ± 10.33 77.84 ± 3.81 72.36 ± 5.63 109.66 ± 0.45
Aaren 80.86 ± 4.77 77.87 ± 5.68 103.89 ± 11.89 74.44 ± 5.16 71.44 ± 6.55 110.51 ± 1.30
Table 1: Reinforcement Learning Results. Measurement of the D4RL score (higher is better) (Fu
et al., 2020). The bolded results indicate the best-performing method.
model the probability distribution of the next event time. For this setting, we consider 8 popular
benchmarking datasets for next event forecasting (Zhang et al., 2020; Zuo et al., 2020; Bae et al.,
2023): MIMIC, Wiki, Reddit, Mooc, StackOverflow, Sin, Uber, and Taxi. 7 out of the 8 are real-world
datasets whereas only Sin is a synthetic dataset. 3 out of the 8 datasets (Sin, Uber, and Taxi) do not
include marks/labels. We refer the reader to Appendix C.2 for details regarding individual datasets.
The results in Table 2 show that Aarens performed comparably with Transformers across all datasets.
Aaren’s ability to efficiently process new inputs is a particularly useful feature in event forecasting
settings where events arrive in an irregular stream.
Event Forecasting
Metric NLL (↓) RMSE (↓) Acc (↑)
Dataset MIMIC Wiki MIMIC Wiki MIMIC Wiki
Transformer 1.22 ± 0.08 9.66 ± 0.98 1.60 ± 0.28 0.28 ± 0.04 84.07 ± 1.46 23.60 ± 2.66
Aaren 1.21 ± 0.06 8.98 ± 1.03 1.56 ± 0.32 0.22 ± 0.05 84.53 ± 0.66 21.26 ± 1.29
Dataset Reddit Mooc Reddit Mooc Reddit Mooc
Transformer 0.40 ± 0.29 −0.22 ± 0.57 0.23 ± 0.03 0.20 ± 0.04 60.68 ± 1.62 37.79 ± 0.42
Aaren 0.31 ± 0.30 0.25 ± 1.61 0.30 ± 0.04 0.41 ± 0.28 62.34 ± 0.40 36.69 ± 1.48
Dataset StackOverflow Sin StackOverflow Sin StackOverflow
Transformer 2.92 ± 0.04 0.68 ± 0.05 1.44 ± 0.08 1.75 ± 0.09 46.44 ± 0.08
Aaren 2.91 ± 0.02 0.78 ± 0.13 1.27 ± 0.17 2.03 ± 0.25 46.34 ± 0.21
Dataset Uber Taxi Uber Taxi
Transformer 3.33 ± 0.14 2.01 ± 0.17 73.63 ± 5.73 10.34 ± 0.32
Aaren 3.48 ± 0.10 2.33 ± 0.12 54.61 ± 5.40 10.01 ± 0.52
Table 2: Event Forecasting Results. Sin, Uber, and Taxi datasets do not include marks/labels.
In these experiments, we compared Aarens with Transformers on time series forecasting (TSF). In
TSF, the model is given a series of observations of temporally continuous signals. The objective of
the model is to predict T future values of the series. TSF models are commonly used in a wide range
of domains, including those related to climate (e.g., weather), energy (e.g., supply and demand),
and economics (e.g., stock prices). To perform the comparison, we consider a causally masked
Transformer with input normalization following Liu et al. (2022). For this setting, we consider 8
real-world datasets used in prior works: Weather, Exchange, Traffic, ECL, ETTh1, ETTh2, ETTm1,
and ETTm2. For details regarding the individual datasets, we refer the reader to Appendix C.3.
Following Wu et al. (2023), the models are evaluated with T ∈ {96, 192, 336, 720} given an input
length of 96. Due to space limitations, Table 3 only includes results for T = 192. We refer the
reader to Table 5 in Appendix D for the full results. The results in Table 3 show that Aarens perform
comparably with Transformers across all datasets. However, unlike Transformers, Aarens efficiently
processes the time series data, making it more suitable for time series-related domains.
7
Time Series Forecasting
Metric MSE (↓) MAE (↓)
Dataset Weather Exchange Traffic Weather Exchange Traffic
Transformer 0.24 ± 0.01 0.24 ± 0.02 0.63 ± 0.01 0.28 ± 0.00 0.34 ± 0.01 0.34 ± 0.00
Aaren 0.25 ± 0.01 0.25 ± 0.03 0.64 ± 0.01 0.28 ± 0.00 0.33 ± 0.02 0.35 ± 0.00
Dataset ETTh1 ETTm1 ECL ETTh1 ETTm1 ECL
Transformer 0.64 ± 0.05 0.52 ± 0.05 0.39 ± 0.03 0.57 ± 0.02 0.47 ± 0.01 0.48 ± 0.02
Aaren 0.59 ± 0.03 0.51 ± 0.03 0.37 ± 0.02 0.55 ± 0.01 0.47 ± 0.01 0.45 ± 0.01
Dataset ETTh1 ETTm1 ETTh2 ETTm2
Transformer 0.50 ± 0.03 0.38 ± 0.02 0.46 ± 0.01 0.37 ± 0.01
Aaren 0.49 ± 0.03 0.34 ± 0.04 0.48 ± 0.02 0.39 ± 0.02
Table 3: Time Series Forecasting Results. Following Wu et al. (2023), the models are evaluated
with T ∈ {96, 192, 336, 720} given an input length of 96. Due to space limitations, this table only
includes results for T = 192. We refer the reader to Appendix D (Table 5) for the full results.
In these experiments, we compared Aarens with Transformers on time series classification (TSC).
In TSC, the model’s objective is to predict the label of a time series. This setting is common
in many important applications such as pattern recognition (e.g., electrocardiograms), anomaly
detection (e.g., bank fraud), or failure prediction (e.g., power grid fluctuations) (Dinger et al.,
2022). For this setting, we consider 10 real-world popular datasets from the UEA time series
classification Archive (Bagnall et al., 2018): EthanolConcentration, FaceDetection, Handwriting,
Heartbeat, JapaneseVowels, PEMS-SF, SelfRegulationSCP1, SelfRegulationSCP2 ArabicDigits, and
UWaveGesture. For details regarding the individual datasets, we refer the reader to Appendix C.4. In
Table 4, we see that Aarens perform comparably with Transformers across all datasets.
4.5 Analyses
In these experiments, we compare Aarens with Transformers in terms of the resources required. To
do so, we use the code by Barhate (2022). For Transformers, we use KV-caching to improve their
efficiency.
Memory Complexity: In Figure 5 (left), we compare the memory usage of Aarens and Transformers
(using KV-caching) at inference time. We see that the memory usage of Transformers grow linearly
with the KV-caching technique. In contrast, Aarens only use constant memory regardless of the
number of tokens, making it significantly more efficient.
Time Complexity: In Figure 5 (right), we compare the cumulative time needed by Aarens and
Transformers (using KV-caching) for sequentially processing a sequence of tokens. In the case
of Transformers, the cumulative amount of computation is quadratic in the number of tokens, i.e.,
O(1 + 2 + . . . + N ) = O(N 2 ). In contrast, for Aaren, the cumulative amount of computation is
linear. In the figure, we see a similar result in the cumulative time needed by the models. Specifically,
the cumulative time needed by Transformers grows quadratically while that of Aaren grows linearly.
Number of Parameters: Due to learning the initial hidden state q, Aaren modules require slightly
more parameters than Transformer modules. However, the difference is marginal due to q being
8
Figure 5: Computational Resources Plots comparing Aarens and Transformers (using KV-caching)
when processing a sequence of tokens. (Left) Memory Usage Comparison. (Right) Cumulative Time
Comparison.
only a vector. Measuring this empirically in comparable models, we found that Transformers used
3, 152, 384 parameters. In contrast, the equivalent Aarens used 3, 152, 896 parameters, representing
only a marginal ∼ 0.016% parameter increase – a minor trade-off for the significant gains in memory
and time complexities.
5 Related Work
Closest to Aaren are approximations of attention such as those by RWKV (Peng et al., 2023),
RetNet (Sun et al., 2023), and Linear Transformer (Katharopoulos et al., 2020). These models
proposed linearizations of the standard softmax-based attention that allow them to be formulated as
an RNN. However, in doing so, these models also encode an exponential factor that biases tokens
based on their timestamp, limiting their potential applications. In contrast, Aaren leverages an exact
re-formulation of softmax attention as an RNN, allowing the model itself to compute the weight of
each token.
Feng et al. (2023) showed attention can be computed recurrently, using it to compress set-based inputs.
Rabe and Staats (2022) introduced a recurrent formulation of attention, showing that self-attention
can be computed efficiently. Katharopoulos et al. (2020) showed that Transformers with a causal
mask can be viewed as an RNN. In contrast, we (1) show a more general result whereas any attention
model can be viewed as an RNN. Furthermore, we (2) introduce Aaren, a new attention formulation
based on parallel prefix sums, that achieves competitive results with that of Transformers while being
more efficient.
The problem of computing prefix scans/sums has been well studied with various efficient parallelized
algorithms proposed for computing them. Since Aaren only requires the output of the prefix scan,
any efficient algorithm for computing it can be used. In this work, we outlined the method by
Hillis and Steele (1986). This method is time efficient for parallel computation, requiring log2 (N )
sequential steps and O(N log(N )) overall computation. In contrast, the method by Ladner and
Fischer (1980) use mores sequential steps (specifically, 2 log2 (N ) − 2) but only performs O(N )
overall computation. For a more in-depth introduction to parallel prefix sums algorithms, we refer the
reader to the following work by Blelloch (1990).
In this work, we applied Transformers to a subset of applications. For a broad overview of the
applications of Transformers, we refer the reader to the following survey by Islam et al. (2023). For
an overview of different transformer models applied to the specific settings considered in this paper,
we refer the reader to the following surveys (1) on transformers in reinforcement learning by Li et al.
(2023) and (2) on transformers in event forecasting, time series forecasting, time series classification,
and more by Wen et al. (2022).
6 Conclusion
In this work, we showed that attention can be formulated as an RNN whereas the conventional way of
computing attention is a parallelized method of computing its many-to-one RNN output. Building on
the RNN formulation, we showed that existing attention-based models can be formulated as RNNs.
9
However, unlike traditional RNNs such as LSTMs and GRUs, these methods cannot be updated
efficiently with new tokens. Addressing this, we introduced a new parallelized method of computing
attention’s many-to-many RNN output based on the parallel prefix scan algorithm. Building on the
new attention formulation, we introduced Aaren, a new module that can not only (i) be trained in
parallel (like Transformers) but also (ii) be efficiently updated at inference time, thereby requiring only
constant memory (like RNNs). Empirically, we showed that Aarens achieve performance competitive
with Transformers on 38 datasets spread across four sequential data settings: reinforcement learning,
event forecasting, time series classification, and time series forecasting. Finally, we empirically show
that Aarens are significantly more time and memory-efficient than Transformers.
10
References
Agarwal, P., Rahman, A. A., St-Charles, P.-L., Prince, S. J., and Kahou, S. E. (2023). Transformers
in reinforcement learning: A survey. arXiv preprint arXiv:2307.05979.
Bae, W., Ahmed, M. O., Tung, F., and Oliveira, G. L. (2023). Meta temporal point processes. In
International Conference on Learning Representations.
Bagnall, A., Dau, H. A., Lines, J., Flynn, M., Large, J., Bostrom, A., Southam, P., and Keogh,
E. (2018). The uea multivariate time series classification archive, 2018. arXiv preprint
arXiv:1811.00075.
Barhate, N. (2022). Minimal implementation of decision transformer. https://github.com/
nikhilbarhate99/min-decision-transformer.
Blelloch, G. E. (1990). Prefix sums and their applications. Technical Report CMU-CS-90-190, School
of Computer Science, Carnegie Mellon University.
Chen, L., Lu, K., Rajeswaran, A., Lee, K., Grover, A., Laskin, M., Abbeel, P., Srinivas, A., and
Mordatch, I. (2021). Decision transformer: Reinforcement learning via sequence modeling.
Advances in neural information processing systems, 34:15084–15097.
Cho, K., Van Merrienboer, B., Gulcehre, C., Bahdanau, D., Bougares, F., Schwenk, H., and Bengio,
Y. (2014). Learning phrase representations using rnn encoder–decoder for statistical machine
translation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language
Processing (EMNLP), pages 1724–1734.
Dinger, T., Chang, Y.-c., Pavuluri, R., and Subramanian, S. (2022).
What is time series classification? https://developer.ibm.com/
learningpaths/get-started-time-series-classification-api/
what-is-time-series-classification/.
Feng, L., Tung, F., Hajimirsadeghi, H., Bengio, Y., and Ahmed, M. O. (2023). Memory efficient
neural processes via constant memory attention block. arXiv preprint arXiv:2305.14567.
Fu, J., Kumar, A., Nachum, O., Tucker, G., and Levine, S. (2020). D4rl: Datasets for deep data-driven
reinforcement learning. arXiv preprint arXiv:2004.07219.
Goodfellow, I., Bengio, Y., Courville, A., and Bengio, Y. (2016). Deep learning, volume 1. MIT
Press.
Haarnoja, T., Zhou, A., Abbeel, P., and Levine, S. (2018). Soft actor-critic: Off-policy maximum
entropy deep reinforcement learning with a stochastic actor. In International conference on
machine learning, pages 1861–1870. PMLR.
Hillis, W. D. and Steele, G. L. (1986). Data parallel algorithms. Commun. ACM, 29:1170–1183.
Hochreiter, S. and Schmidhuber, J. (1997). Long short-term memory. Neural computation, 9:1735–80.
Islam, S., Elmekki, H., Elsebai, A., Bentahar, J., Drawel, N., Rjoub, G., and Pedrycz, W. (2023). A
comprehensive survey on applications of transformers for deep learning tasks. Expert Systems with
Applications, page 122666.
Jaegle, A., Gimeno, F., Brock, A., Vinyals, O., Zisserman, A., and Carreira, J. (2021). Perceiver:
General perception with iterative attention. In International conference on machine learning, pages
4651–4664. PMLR.
Jiang, Y., Pan, Z., Zhang, X., Garg, S., Schneider, A., Nevmyvaka, Y., and Song, D. (2024). Empow-
ering time series analysis with large language models: A survey. arXiv preprint arXiv:2402.03182.
Katharopoulos, A., Vyas, A., Pappas, N., and Fleuret, F. (2020). Transformers are rnns: Fast
autoregressive transformers with linear attention. In International conference on machine learning,
pages 5156–5165. PMLR.
Ladner, R. E. and Fischer, M. J. (1980). Parallel prefix computation. J. ACM, 27(4):831–838.
11
Latif, S., Zaidi, A., Cuayahuitl, H., Shamshad, F., Shoukat, M., and Qadir, J. (2023). Transformers in
speech processing: A survey. arXiv preprint arXiv:2303.11607.
Li, W., Luo, H., Lin, Z., Zhang, C., Lu, Z., and Ye, D. (2023). A survey on transformers in
reinforcement learning. arXiv preprint arXiv:2301.03044.
Lin, T., Wang, Y., Liu, X., and Qiu, X. (2022). A survey of transformers. AI Open.
Liu, Y., Wu, H., Wang, J., and Long, M. (2022). Non-stationary transformers: Exploring the
stationarity in time series forecasting. Advances in Neural Information Processing Systems,
35:9881–9893.
Peng, B., Alcaide, E., Anthony, Q., Albalak, A., Arcadinho, S., Biderman, S., Cao, H., Cheng, X.,
Chung, M., Derczynski, L., et al. (2023). Rwkv: Reinventing rnns for the transformer era. In
Findings of the Association for Computational Linguistics: EMNLP 2023, pages 14048–14077.
Pope, R., Douglas, S., Chowdhery, A., Devlin, J., Bradbury, J., Heek, J., Xiao, K., Agrawal, S., and
Dean, J. (2023). Efficiently scaling transformer inference. Proceedings of Machine Learning and
Systems, 5.
Rabe, M. N. and Staats, C. (2022). Self-attention does not need o(n2 ) memory. arXiv preprint
arXiv:2112.05682.
Sun, Y., Dong, L., Huang, S., Ma, S., Xia, Y., Xue, J., Wang, J., and Wei, F. (2023). Retentive
network: A successor to transformer for large language models. arXiv preprint arXiv:2307.08621.
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł., and
Polosukhin, I. (2017). Attention is all you need. Advances in neural information processing
systems, 30.
Wen, Q., Zhou, T., Zhang, C., Chen, W., Ma, Z., Yan, J., and Sun, L. (2022). Transformers in time
series: A survey. arXiv preprint arXiv:2202.07125.
Wu, H., Hu, T., Liu, Y., Zhou, H., Wang, J., and Long, M. (2023). Timesnet: Temporal 2d-
variation modeling for general time series analysis. In International Conference on Learning
Representations.
Zhang, Q., Lipani, A., Kirnap, O., and Yilmaz, E. (2020). Self-attentive hawkes process. In
International conference on machine learning, pages 11183–11193. PMLR.
Zheng, Q., Zhang, A., and Grover, A. (2022). Online decision transformer. In international conference
on machine learning, pages 27042–27059. PMLR.
Zuo, S., Jiang, H., Li, Z., Zhao, T., and Zha, H. (2020). Transformer hawkes process. In International
conference on machine learning, pages 11692–11702. PMLR.
12
Appendix
13
Expanding the computation of wA∪B :
wA∪B = wA exp(mA − mA∪B ) + wB exp(mB − mA∪B )
" # " #
X X
= exp(si − mA )vi exp(mA − mA∪B ) + exp(si − mB )vi exp(mB − mA∪B )vi
i∈A i∈B
X X
= exp(si − mA∪B )vi + exp(si − mA∪B )vi
i∈A i∈B
X
= exp(si − mA∪B )vi
i∈A∪B
P
resulting
P in mA∪B = maxi∈A∪B si , uA∪B = i∈A∪B exp(si − mA∪B ), and wA∪B =
i∈A∪B exp(si − mA∪B )vi as needed.
For the operator ⊕ to be valid for the parallel scan algorithm, the operator ⊕ must satisfy the
associative property, i.e., (a ⊕ b) ⊕ c = a ⊕ (b ⊕ c).
Since the detailed operator is applied to 3-tuples of the form (mA , uA , wA ) as follows:
(mA , uA , wA ) ⊕ (mB , uB , wB ) = (mA∪B , uA∪B , wA∪B )
therefore showing that ⊕ is associative means that:
[(mA , uA , wA ) ⊕ (mB , uB , wB )] ⊕ (mC , uC , wC ) = (mA , uA , wA ) ⊕ [(mB , uB , wB ) ⊕ (mC , uC , wC )]
(m(A∪B)∪C , u(A∪B)∪C , w(A∪B)∪C ) = (mA∪(B∪C) , uA∪(B∪C) , wA∪(B∪C) )
Our code for the reinforcement learning experiments is based on that of Barhate (2022) (MIT License).
For these experiments, we consider locomotion MuJoCo robotics environments (Apache 2.0 License)
popular in deep RL:
• HalfCheetah is a two-legged 2-d robot comprising of a head, horizontal torso, two thighs,
two shins, and two feet.
• Ant is a 3-d four-legged robot comprising of a torso and four legs (2 parts per leg).
• Hopper is a single-legged 2-d robot comprising a torso, thigh, shin, and foot.
• Walker(2D) is a two-legged 2-d robot comprising of a vertical torso, two thighs, two shins,
and two feet.
The datasets used in our experiments are from D4RL (Fu et al., 2020) (Apache 2.0 License):
14
C.2 Event Forecasting
Our code for the event forecasting experiments and datasets is based on that of Bae et al. (2023) (CC
BY-NC-SA 4.0 License). For these experiments, we consider 8 datasets used in prior works (Bae
et al., 2023; Zhang et al., 2020; Zuo et al., 2020).
5 out of the datasets include marks/labels:
• Sin is a synthetic dataset generated from a sine function with a periodicity of 4π and a
domain of [0, 32π].
• Uber is a dataset based on Uber NYC pickup data from 2015.
• Taxi is a dataset based on NYC Taxi pickup data from 2013.
Our code for the time series forecasting experiments is based on the Time series Library repository
(MIT License) by Wu et al. (2023). For these experiments, we consider 8 popular datasets available
in the Time Series Library repository:
Our code for the time series classification experiments is based on the Time series Library repository
(MIT License) by Wu et al. (2023). For these experiments, we consider 10 UEA datasets (Bagnall
et al., 2018) available in the Time Series Library repository:
15
• JapaneseVowels is a dataset of recordings of Japanese-male vowel pronunciations.
• PEMS-SF is a dataset describing the occupancy rate of car lanes in San Francisco Bay
Area’s freeways available on California’s Department of Transportation PEMS (Performance
Measurement System) website.
• SelfRegulationSCP1 is a dataset of cortical potential recordings from a healthy subject.
• SelfRegulationSCP2 is a dataset of cortical potential recordings from an artificially respirated
ALS (Amyotrophic lateral sclerosis) patient.
• SpokenArabicDigits is a dataset of recordings of ten spoken Arabic digits from 44 male and
44 female Arabic native speakers.
• UWaveGestureLibrary is a dataset of accelerometer measurements of eight gestures.
D Additional Experiments
D.1 Full Time Series Forecasting Results
In time series forecasting, the objective of the model is to predict T future values of the series.
Following the standard of previous works (Wu et al., 2023), we evaluated on t ∈ {96, 192, 336, 720}.
However, due to space limitations, only T = 192 was included in the main paper. Here, in Table
5, we include the full results. We see that for all values of T , Aarens performed comparably with
Transformers across all datasets. However, unlike Transformers, Aarens can efficiently process new
inputs, making it more advantageous in sequential settings such as this time series-related setting.
F Compute
Our experiments were run using a mix of Nvidia GTX 1080 Ti (12 GB) and Nvidia Tesla P100 (16
GB) GPUs. The analyses were performed on Nvidia GTX 1080 Ti (12 GB).
The reinforcement learning experiments required approximately the same amount of time: ∼ 2 − 4
hours each.
The event forecasting experiments varied in time depending on the dataset:
16
Time Series Forecasting
Metric MSE (↓) MAE (↓)
Models Aaren Transformer Aaren Transformer
96 0.53 ± 0.04 0.54 ± 0.01 0.52 ± 0.02 0.50 ± 0.01
192 0.59 ± 0.03 0.64 ± 0.05 0.55 ± 0.01 0.57 ± 0.02
ETTh1
336 0.65 ± 0.03 0.65 ± 0.02 0.55 ± 0.01 0.55 ± 0.01
720 0.67 ± 0.05 0.70 ± 0.05 0.62 ± 0.02 0.58 ± 0.02
96 0.38 ± 0.02 0.41 ± 0.04 0.44 ± 0.02 0.40 ± 0.02
192 0.49 ± 0.03 0.50 ± 0.03 0.48 ± 0.02 0.46 ± 0.01
ETTh2
336 0.57 ± 0.05 0.59 ± 0.03 0.47 ± 0.02 0.50 ± 0.01
720 0.55 ± 0.03 0.60 ± 0.03 0.52 ± 0.01 0.52 ± 0.01
96 0.48 ± 0.02 0.44 ± 0.01 0.44 ± 0.01 0.41 ± 0.01
192 0.51 ± 0.03 0.52 ± 0.05 0.47 ± 0.01 0.47 ± 0.01
ETTm1
336 0.54 ± 0.02 0.57 ± 0.03 0.49 ± 0.01 0.51 ± 0.01
720 0.60 ± 0.03 0.66 ± 0.06 0.52 ± 0.01 0.56 ± 0.02
96 0.24 ± 0.03 0.25 ± 0.01 0.30 ± 0.02 0.30 ± 0.01
192 0.34 ± 0.04 0.38 ± 0.02 0.39 ± 0.02 0.37 ± 0.01
ETTm2
336 0.41 ± 0.03 0.49 ± 0.05 0.42 ± 0.01 0.43 ± 0.02
720 0.51 ± 0.03 0.56 ± 0.02 0.49 ± 0.02 0.47 ± 0.01
96 0.18 ± 0.00 0.18 ± 0.00 0.23 ± 0.00 0.23 ± 0.00
192 0.25 ± 0.01 0.24 ± 0.01 0.28 ± 0.00 0.28 ± 0.00
Weather
336 0.31 ± 0.00 0.31 ± 0.02 0.32 ± 0.00 0.34 ± 0.01
720 0.40 ± 0.00 0.38 ± 0.02 0.39 ± 0.00 0.39 ± 0.01
96 0.14 ± 0.01 0.14 ± 0.01 0.27 ± 0.01 0.25 ± 0.01
192 0.25 ± 0.03 0.24 ± 0.02 0.33 ± 0.02 0.34 ± 0.01
Exchange
336 0.42 ± 0.04 0.41 ± 0.02 0.44 ± 0.02 0.45 ± 0.01
720 1.20 ± 0.07 1.44 ± 0.19 0.79 ± 0.02 0.81 ± 0.04
96 0.63 ± 0.01 0.61 ± 0.01 0.35 ± 0.00 0.34 ± 0.00
192 0.64 ± 0.01 0.63 ± 0.01 0.35 ± 0.00 0.34 ± 0.00
Traffic
336 0.65 ± 0.01 0.64 ± 0.00 0.35 ± 0.00 0.34 ± 0.00
720 0.68 ± 0.01 0.67 ± 0.00 0.36 ± 0.01 0.36 ± 0.00
96 0.36 ± 0.02 0.35 ± 0.02 0.46 ± 0.01 0.43 ± 0.01
192 0.37 ± 0.02 0.39 ± 0.03 0.45 ± 0.01 0.48 ± 0.02
ECL
336 0.47 ± 0.05 0.48 ± 0.06 0.52 ± 0.03 0.55 ± 0.03
720 0.57 ± 0.05 0.62 ± 0.06 0.56 ± 0.02 0.55 ± 0.03
Table 5: Full Time Series Forecasting Results.
17
• StackOverflow took ∼ 3.5 hours
• Sin took ∼ 1.5 hours
• Uber took ∼ 3 hours
• Taxi took ∼ 1.5 hours
The time series forecasting experiments were run as a single script for T ∈ {96, 192, 336, 720}. The
experiments varied in time depending on the dataset:
The time series classification experiments were run as a single script. Running the experiments on all
datasets took in total ∼ 1 hour.
18