Neural Networks and The Chomsky Hierarchy
Neural Networks and The Chomsky Hierarchy
Neural Networks and The Chomsky Hierarchy
Elliot Catt1 Chris Cundy†2 Marcus Hutter1 Shane Legg1 Joel Veness1 Pedro A. Ortega†
A BSTRACT
Reliable generalization lies at the heart of safe ML and AI. However, understanding
arXiv:2207.02098v3 [cs.LG] 28 Feb 2023
when and how neural networks generalize remains one of the most important
unsolved problems in the field. In this work, we conduct an extensive empirical
study (20 910 models, 15 tasks) to investigate whether insights from the theory of
computation can predict the limits of neural network generalization in practice.
We demonstrate that grouping tasks according to the Chomsky hierarchy allows
us to forecast whether certain architectures will be able to generalize to out-of-
distribution inputs. This includes negative results where even extensive amounts
of data and training time never lead to any non-trivial generalization, despite
models having sufficient capacity to fit the training data perfectly. Our results
show that, for our subset of tasks, RNNs and Transformers fail to generalize on
non-regular tasks, LSTMs can solve regular and counter-language tasks, and only
networks augmented with structured memory (such as a stack or memory tape) can
successfully generalize on context-free and context-sensitive tasks.
1 I NTRODUCTION
Statistical learning theory is the most widely used theory of generalization in practical machine
learning, justifying empirical risk minimization and estimating the generalization error via a test
set (Vapnik, 1998). However, its central assumption that training and test data are independent and
identically distributed (i.i.d.) is violated for many problems of interest (distribution shifts, continual
learning, etc.). An example of such a non-i.i.d. setting is testing generalization on sequence prediction
problems, where an agent is trained with sequences of length ` ≤ N and tested with arbitrarily
longer sequences ` N . This problem is of particular importance since it subsumes all computable
problems (Dawid, 1984; Rich, 2007; Sipser, 1997; Solomonoff, 2009; 2010). Central to sequence
prediction is inductive inference, which consists of deriving a general rule from a finite set of concrete
instances and using this rule to make predictions. For example, in program induction (Goldberg,
1989; Gomez et al., 2008; Holland, 1992; Liang et al., 2013; Nordin, 1997; Solomonoff, 1964a;b;
Wineberg & Oppacher, 1994), the goal is to obtain a model that correctly identifies the underlying
data-generating process given examples of input-output sequences. Then, if the model is correct, it
can produce results in accordance with the generative process for previously unseen input sequences.
The key challenge of inductive inference (as opposed to deduction) is that it does not allow selecting
one hypothesis with certainty among the ones that fit the data. For instance, the sequence 2, 4, 6, 8
has infinitely many possible continuations. Thus, any principle that selects a particular continuation
requires additional assumptions that are independent of the data, i.e., inductive biases (Mitchell,
1980). In machine learning, the network architecture, training mechanisms (e.g., gradient descent),
and initial distributions over parameters all generate their corresponding inductive biases. This has led
to a vast number of approaches for designing inductive biases via architectural and training protocol
changes (see Battaglia et al. (2018) for an overview). However, the problem is that stronger inductive
biases generally come at the cost of decreasing the universality of a model, and thus finding a good
balance between the two is one of the biggest challenges in the contemporary literature.
Even if a neural architecture is theoretically universal or Turing complete, gradient-based training,
which cannot exhaustively search the parameter space, can impede finding the right solution and thus
*
Equal contribution. Correspondence to {gdelt, anianr}@deepmind.com.
1
DeepMind. 2 Stanford University. † Work performed while the author was at DeepMind.
1
Published as a conference paper at ICLR 2023
infinite tape
recursively enumerable
context-sensitive
Tape-RNN linear tape
deterministic context-free
Stack-RNN
regular stack
RNN
finite
FFNN Transformer finite-state
counter controller
counter
LSTM
Figure 1: Formal language classes and their correspondence with neural network architectures.
Left: Our empirical evaluation locates the architectures on the hierarchy of formal language classes.
Right: Each formal language class is associated with a minimal computational model (automaton) to
recognize or generate the language (see Section 3). All automata have a finite-state controller at their
core, in addition to increasingly restrictive memory access as we descend the hierarchy.
practically render the model non-universal. Therefore, both architectural and training limitations
impact which sequence prediction problems a model can solve in practice. In formal language theory,
the Chomsky hierarchy (Chomsky, 1956) classifies such (sequence prediction) problems by increasing
complexity. This hierarchy is associated with an equivalent hierarchy of models (automata) that can
solve different problem classes (Savage, 1998; Sipser, 1997). Lower-level automata have restrictive
memory models and can only solve lower-level problems, while Turing machines with infinite
memory and unrestricted memory access lie on top of the hierachy and can solve all computable
problems. However, unlike for classical automata, a unified placement of neural architectures on the
Chomsky hierarchy has not yet been practically established, which is precisely the goal of our work.
This work We conduct an extensive empirical study with the aim of discovering how neural
network models used for program induction relate to the idealized computational models defined
by the Chomsky hierarchy in practice (see Fig. 1 for a summary of our findings). We investigate
whether the theoretical limitations of certain neural models hold in practice when trained with
gradient-based methods. For example, previous work has theoretically argued that RNNs are Turing
complete (Siegelmann & Sontag, 1994). However, more recent theoretical analyses (Ackerman &
Cybenko, 2020; Merrill, 2019; Weiss et al., 2018) showed that RNNs lie much lower on the Chomsky
hierarchy. To complement these theoretical analyses, we conduct a large-scale empirical evaluation
on sequence prediction problems. We make the following main contributions:
2 R ELATED WORK
Learning formal languages A long line of work has empirically investigated whether common
machine learning architectures, including RNNs (Elman, 1990), GRUs (Cho et al., 2014), SCNs (Giles
et al., 1992; Pollack, 1991), LSTMs (Hochreiter & Schmidhuber, 1997), and Transformers (Vaswani
et al., 2017), are capable of learning formal languages. The main insights are: These networks can
2
Published as a conference paper at ICLR 2023
learn simple counting languages (Hölldobler et al., 1997; Steijvers & Grünwald, 1996; Wiles &
Elman, 1995; Rodriguez & Wiles, 1997) and the Dyck-1 language (Skachkova et al., 2018; Suzgun
et al., 2019a). To learn more advanced languages, RNNs and LSTMs require exponential memory in
terms of the input length (Sennhauser & Berwick, 2018), as they lack an external memory structure.
They are capable of learning simple context-sensitive languages, such as an bn cn , but in a very limited
way, i.e., they generalize only to lengths close to those seen during training (Bodén & Wiles, 2000;
2002; Gers & Schmidhuber, 2001). Similarly, Transformers cannot learn Dyck-n languages for
n > 1 and long sequences (Ebrahimi et al., 2020). Concurrent work also studies length generalization
on synthetic reasoning tasks (akin to learning formal languages) but for pretrained large language
models (Anil et al., 2022; Zhang et al., 2022). Thus, while prior work has investigated a single
architecture on a restricted set of tasks under different experimental conditions, we provide a unified
experimental protocol that spans all the levels of the Chomsky hierarchy for a wide range of models.
Neural networks and the Chomsky hierarchy It was theoretically shown that RNNs and Trans-
formers are Turing complete (Chen et al., 2018; Pérez et al., 2019; 2021; Siegelmann & Sontag,
1994). However, these results are impractical as they rely on an unbounded number of recurrent
steps and on arbitrary numerical precision. Thus, more recent work (Ackerman & Cybenko, 2020;
Bhattamishra et al., 2020; Hahn, 2020; Hao et al., 2022; Korsky & Berwick, 2019; Merrill, 2019;
Merrill et al., 2020; Merrill & Sabharwal, 2022; Weiss et al., 2018) has refined these theoretical
analyses by considering linear computation steps and logarithmic precision, showing that: (i) RNNs
and GRUs can, in theory, recognize regular languages, and (ii) LSTMs are strictly more powerful
since they can learn a counting mechanism (i.e., are k-counter machines). Moreover, it was theoreti-
cally shown that Transformers are not well-aligned with the Chomsky hierarchy since they cannot
recognize certain regular languages (e.g., periodic finite-state languages), while being able to learn
some counter languages (e.g., Shuffle-Dyck and n-ary Boolean expressions). A different approach
proposed a computational model to capture the Transformer operations and used it to show which
tasks could conceivably be learned by a Transformer (histograms, sorting, Dyck languages) (Weiss
et al., 2021). However, this approach only upper-bounds the capabilities of a model and does not
provide any insight on whether gradient-based methods will find parameters that can solve a task in
practice, which is precisely the goal of our work. In that sense, our work complements the above
studies by investigating how well gradient-based learning can exploit the inductive biases of common
machine learning architectures to recognize languages on different levels of the Chomsky hierarchy.
3 BACKGROUND
We now provide the necessary background on how to link sequence prediction and formal language
theory, allowing us to associate neural architectures and the Chomsky hierarchy. As we want to
evaluate the capability of networks to generalize to sequences of unseen length, we need a way of
generating arbitrarily long sequences that all exhibit the same structure. To that end, we will consider
sequences as elements of infinite formal languages, which are generated by a grammar.
3
Published as a conference paper at ICLR 2023
Table 1: Chomsky hierarchy grammar types, their corresponding minimal automata and memory
structures required for language recognition or generation, their production rules, and the correspond-
ing example languages from Section 3. The production rules symbols are defined as: a terminal;
A, B non-terminal; α, β, γ string of terminals and/or non-terminals (α, β can be empty, but not γ).
Formal language A formal language L is a set of words, where each word is a finite string of
symbols drawn from a finite alphabet Σ. For example, with a binary alphabet Σ = {0, 1} we can
describe the following infinite languages: A = {0n 1m | n, m > 0}, B = {w | w is a palindrome},
C = {ww | w is a word}, and D = {w | w describes a terminating Turing machine}.
Generative grammar Languages of infinite cardinality can be generated with a finite alphabet
and a finite generative grammar. A formal grammar (Chomsky, 1956) is defined by a finite set of
terminal symbols, a finite set of non-terminal symbols, a distinctive start symbol, and a finite set of
production rules that map from non-terminal symbols to other symbols. The set of all words that
can be formed by such a grammar is its corresponding formal language. For example, the binary
palindrome language B can be generated by the set of rules: P → | 0 | 1 | 0P 0 | 1P 1, with
start and nonterminal symbol P and terminal symbols {0, 1}. The Chomsky hierarchy arises from
categorizing the allowed relationships between terminals and non-terminals in the grammar (see
Table 1). For example, B is not regular, since the right-hand side contains a variable enclosed by two
terminal symbols, but context-free, since the left-hand side consists of a single nonterminal symbol.
4
Published as a conference paper at ICLR 2023
4 M ETHODS
Problem setup For each task we define input and output languages LI and LO , with their corre-
sponding alphabets ΣI and ΣO . The goal of a task is to learn the deterministic function f that maps
words x ∈ LI to words y ∈ LO , i.e., language transduction. To that end, we consider models pθ (·|x)
as conditional distributions over next possible sequence continuations (with parameters θ). We one-hot
encode the tokens in both languages LI and LO as vectors of length |ΣI | and |ΣO |, respectively, denot-
ing the corresponding jth entries of the ith one-hot tokens xij and yij . P
We use gradient-based training
1 T
to minimize the average cross-entropy loss, where C(x, y) := − |y| i yi log pθ (·|x) denotes the
per-example loss (see Algorithm A.1). Note that we do not use auto-regressive models, but rather ap-
pend |y| empty dummy tokens to the input. Thus, the model will know when the input sequence x has
terminated by reading the first empty token. We computeP the per-sequence accuracy as the percentage
1
of correctly predicted tokens, i.e., A(x, y) := |y| i 1 (arg maxj yij ) = (arg maxj pθ (·|x)j ) .
Moreover, we compute the overall performance score as the per-sequence accuracy averaged over all
sequences of unseen length. We run our experiments over 10 different network parameter initialization
seeds and report the maximum score instead of the average, as this better indicates if an architecture
is capable of learning the right algorithm at all (we provide the means and variances in Appendix B).
Tasks For each level of the Chomsky hierarchy, we define various sequence prediction tasks ranging
from modular arithmetic (R) to binary addition (CS). We place the most emphasis on CS tasks as they
are currently out of reach for state-of-the-art models and thus provide an ideal benchmark to guide
future research on architecture design. We list the tasks in Table 2 and provide a full formalization
in Table A.1. Although seemingly simple, these tasks are well-known examples in the theory of
computation and concisely demonstrate the key properties of the grammars in the Chomsky hierarchy,
i.e., counting, repetition, long-distance dependencies, and hierarchy. Note that the levels of our
tasks are independent of the levels of LI or LO but correspond to the automata that can solve the
corresponding transduction tasks. For example, for the Reverse String task, LI = LO = {a, b}∗ ,
which are both regular. However, learning the function f that maps w ∈ LI to reverse(w) ∈ LO
requires a deterministic push-down transducer, thus rendering the task deterministic context-free.
Architectures We consider a wide range of neural network architectures, including both state-of-
the-art and memory-augmented models (full details in Appendix A.2). Our goal is to have at least one
model per level of the Chomsky hierarchy. To that end, we use a standard RNN as a controller and
augment it with two kinds of differentiable memory structures: a deterministic stack and a bounded
tape. The stack and tape have elements in Rd with d = 8, and we ensure that the stack and tape sizes
are large enough to simulate infinite memory during training and testing. The Stack-RNN (Joulin &
Mikolov, 2015) can perform any linear combination of PUSH, POP, and NO - OP operations on the stack.
The Tape-RNN, which is inspired by the Baby-NTM (Suzgun et al., 2019b), can perform any linear
combination of WRITE - LEFT, WRITE - RIGHT, WRITE - STAY, JUMP - RIGHT, JUMP - LEFT (details in
Appendix A). In addition to these memory-augmented models, we also evaluate LSTMs (Hochreiter
& Schmidhuber, 1997) and Transformers (Vaswani et al., 2017). We compute the Transformer
output from the encoder only model that overlooks the whole input sequence. We consider five
different positional encodings: none, classical sin/cos (Vaswani et al., 2017), the rotary positional
encodings (RoPE) (Su et al., 2021), ALiBi (Press et al., 2021), and the relative positional encodings
of Transformer-XL (Dai et al., 2019). The latter two have been shown to allow extrapolation to longer
sequences on some language modeling tasks. All models receive the tokens x as an input and are
expected to produce the output y (i.e., we do not feed the output back to the input unlike sequence-to-
sequence models (Sutskever et al., 2014)). We study CNNs, Stack-LSTMs, NDStack-RNNs (DuSell
& Chiang, 2020; 2022), and autoregressive versions of our models in Appendix B.
5
Published as a conference paper at ICLR 2023
Table 2: Score (in percentage, see Section 4), i.e., accuracy averaged over all test lengths and
maximized over 10 random seeds (and other hyperparameters, see Appendix A). We consider a
model to generalize successfully (bold) if its score ≥ 90%. The random accuracy is 50% (except
for Cycle Navigation (R), Bucket Sort (CS), and the two modular arithmetic tasks where it
is 20%). We observe that, in general, RNNs with more permissive memory structures can solve
more challenging tasks. We denote permutation-invariant tasks with †, counting tasks with ?, tasks
requiring a nondeterministic controller with ◦, and tasks requiring superlinear running time with ×.
Accuracy
Accuracy
(a) Parity Check (R) (b) Modular Arithmetic (DCF) (c) Duplicate String (CS)
Figure 2: Performance curves on three tasks. The dashed vertical red line is the training range, mean-
ing that sequences to the right have not been seen during training and thus measure generalization.
Data generation Instead of using fixed-size datasets, we define training and test distributions from
which we continually sample sequences. We define the maximum training sequence length as the
training range N , with N = 40. Every mini-batch is generated by first sampling a length ` from
the uniform distribution U(1, N ) and then sampling sequences of length ` from the task’s generative
grammar. For testing, we sample the sequence length ` from U(N + 1, M ), with M = 500.
We provide an open-source implementation of our models, tasks, and training and evaluation suite at
https://github.com/deepmind/neural_networks_chomsky_hierarchy.
5 R ESULTS
In this section, we provide the results of our empirical evaluation. In Section 5.1 we relate the
networks to the Chomsky hierarchy, in Section 5.2 we analyze the algorithms learned by the networks,
and in Sections 5.3 and 5.4 we discuss the performance of LSTMs and Transformers in more detail.
6
Published as a conference paper at ICLR 2023
.
eq p
t s ste
pu me Actions Memory contents
In Ti
a 8
8 even+0 Probability of stack action, per input token a
0
even+1 1.0 1
b 2
6 odd+0 PUSH a 3
odd+1 0.5 POP α 4
6
4 NO_OP α 5
2 0.0
α 6
PCA2
4
number operator ( ) α 7
0 α
Cell content
8
( ( 3 - ( 4+( 1 * 1 ) ) )+2 * ( 1 - ( 0 ) ) ) α 10 2
4 α 11
β 12
Stack depth
6 β 13 Write-Stay
β 14 0
Write-Right
8 β 15
10 5 0 5 10 β 16
Write-Left
PCA1 β 17
Jump-Right
Jump-Left 2
β 18
β 19
(a) Internal RNN states on (b) Stack-RNN action proba- 0.0 0.5 1.0 -10 -9 -8 -7 -6 -5 -4 -3 -2 -1 0 1 2 3 4 5 6 7 8 9
Action Probabilities Memory Index
Parity Check (R), col- bilities (above) and stack val-
ored by the previous state ues (below) on the Modular (c) Tape-RNN action probabilities and memory
and the current input. Arithmetic (DCF) task. values on Duplicate String (CS).
Figure 3: Analysis of the internal representations and memory dynamics of an RNN on a regular
task, a Stack-RNN on a DCF task, and a Tape-RNN on a CS task. The reverse-engineered network
mechanisms are indicative of the correct algorithms that solve the corresponding tasks.
We evaluate all architectures (defined in Section 4) on all tasks (defined in Table A.1) and show
the generalization accuracy averaged over the segment U(N + 1, M ) (as described in Section 4) in
Table 2. Furthermore, Fig. 2 displays the accuracy per sequence length for the best-performing model
per architecture on three sample tasks (see Appendix B for the remaining tasks). As the test sequences
have not been seen during training, we can gauge whether the architectures can learn the “correct”
algorithm. We observe that the networks generally match the computational models associated with
the Chomsky hierarchy: RNNs can solve tasks up to the regular level, Stack-RNNs up to the DCF
level, and Tape-RNNs up to the CS level. However, the match is not perfect, as some architectures
cannot solve tasks at their supposed level, e.g., Stack-RNN cannot solve Solve Equation (DCF)
and the Tape-RNN cannot solve the (challenging) Binary Multiplication (CS) and Compute
Sqrt (CS) tasks. This may be due to shortcomings in the architecture, the training protocol, or
the particular difficulty of the tasks. For example, Solve Equation (DCF) is context-free but
requires a nondeterministic FSA controller because it has to compute multiple modular arithmetic
expressions for the different values of the unknown variable x and compare their results to the actual
solution of the equation, which is only provided at the end of the input sequence. Similarly, Binary
Multiplication (CS) is a context-sensitive task but has a quadratic complexity, which is infeasible
for the architectures we investigate. Conversely, some architectures show non-trivial performance
on tasks beyond their supposed level, such as the Stack-RNN on Bucket Sort (CS) (more details
in Appendix B). Finally, we observe that Transformers and LSTMs are not well-aligned with the
Chomsky hierarchy: Transformers fail on regular tasks, while LSTMs can solve (counting) tasks
more difficult than regular. We discuss both cases in detail below (Sections 5.3 and 5.4).
To provide further evidence supporting the claim that some of the networks are able to learn the right
algorithm, we now provide an analysis of solutions learned by some of our networks on our tasks, by
investigating internal state representations and memory-update dynamics.
Regular tasks According to the Chomsky hierarchy, we expect networks that solve regular tasks
to simulate a finite-state automaton. We investigate this hypothesis by analyzing the internal states
of the RNN controller and checking whether they are finite (or form a finite number of clusters).
Figure 3a shows a scatter plot of a PCA projection of the internal RNN states when trained on Parity
Check (R). We observe that the states form four clusters, each representing one of the four values of
the tuple (subsequence result, last token seen). In principle, one can solve this task with only two
states, corresponding to the subsequence result: If the state is 1 and the input token is 1, move to state
0, otherwise stay in state 1, and finally output the state itself. However, due to the problem symmetry,
the network also learns to compute the parity of the number of zeros in the string, therefore adding
another two states to the computation. We also investigate whether and how Stack-RNNs use their
7
Published as a conference paper at ICLR 2023
stack for this task, which is unnecessary, in principle. Indeed, the Stack-RNNs only use NO - OP or POP
(which is equivalent to NO - OP in the absence of PUSH) actions, indicating that the RNN controller
uses only a finite-state automaton and does not rely on the stack (see Appendix B.3 for more details).
Deterministic context-free tasks On DCF tasks, we anticipate Stack-RNNs to use their stack.
Indeed, on Reverse String (DCF), the networks learn the expected algorithm: PUSH until an empty
token appears, and then POP until the stack is empty. Moreover, the internal state of the controller
seems to be largely guided by the last input token and not the last state (details in Appendix B.3).
On Modular Arithmetic (DCF), which requires a stack to remember past results when entering a
new bracket, the networks use the stack exactly as expected. Figure 3b shows that the Stack-RNN
learns to solve this task by: (i) pushing the last result when a bracket is opened, (ii) not operating
on the stack and only using its internal controller when a number or an operator is seen, and (iii)
popping the last result from the stack when a bracket is closed. The top plot reports the probability of
each action given an input, and the bottom plot shows the stack evolution along the sequence. On the
bottom plot, the x-axis is the timestep and the y-axis is the cell position in the stack (top of the stack
on the bottom). Each cell is colored according to the first principal component of its actual content.
Context-sensitive tasks In contrast to the RNN and Stack-RNN, the reverse-engineered network
mechanisms for the Tape-RNN are less interpretable. Neverthless, Fig. 3c shows that the Tape-
RNN has learned an algorithm that is indicative of the data-generating grammar of the Duplicate
String (CS) task. The input to the network consists of the sequence a, a, b, a of length ` = 4,
followed by 2` α-symbols that the RNN can use to compute and manipulate the tape, followed by 2`
β-symbols used to output the result (i.e., the sequence a, a, b, a, a, a, b, a of length 2`). The panels
show the action probabilities and memory contents over time. We observe that up to time t = 5 the
RNN writes on the memory with WRITE - LEFT. Then, on time-step 6 the RNN mixes JUMP - RIGHT
and WRITE - STAY to duplicate the input string in the memory, before moving back to the initial
position with JUMP - RIGHT on time-step 7. After this, the RNN idles using WRITE - STAY until it
reaches the first β-symbol, whereupon it starts outputting the result using WRITE - LEFT actions.
5.3 LSTM S
LSTMs are similar to RNNs as they also carry a hidden state and are unrolled over a sequence of
inputs. However, prior work showed that LSTMs are strictly more powerful than RNNs because they
can also solve counting tasks (Bodén & Wiles, 2000; 2002; Gers & Schmidhuber, 2001; Merrill, 2019;
Weiss et al., 2018). Indeed, Table 2 provides further evidence for this claim since LSTMs are capable
of solving Bucket Sort (CS) almost perfectly, even though the task is context-sensitive. That is,
counting tasks can reside on levels higher than regular in the Chomsky hierarchy, since every Turing
machine can, in theory, be simulated by a two-counter machine (Minsky, 1967). However, finding an
architecture that solves non-regular tasks with counters via gradient descent is more difficult than
when the controller has access to a more permissive memory structure, such as a stack or a tape.
5.4 T RANSFORMERS
Unlike the other architectures, Transformers are permutation invariant w.r.t. the position of a token in
a sequence because they rely on a global attention mechanism. As a result, Transformers are more
scalable since all tokens can be processed in parallel, however, at the cost of reduced expressivity
(i.e., permutation invariance). Similar to real-world problems, most of our tasks are not permutation
invariant, e.g., reversing a string depends on the tokens’ position. To overcome this problem,
Transformers are generally augmented with positional encodings, which are added to the tokens to
simulate some notion of position. Consequently, the augmented tokens intrinsically contain position
information, i.e., the token x at position 0 will no longer be considered the same as the same token x
at position 10. As described in Section 4, we evaluate five encodings (none, sin/cos, RoPE, ALiBi,
and the relative positional encoding from Transformer-XL) and report the best-performing variant.
Indeed, Table 2 shows that Transformers are most successful on permutation-invariant tasks: They
solve Bucket Sort (CS) and demonstrate non-trivial generalization on Cycle Navigation (R).
Moreover, they solve Even Pairs (R), which is not permutation-invariant but only requires checking
whether the first and last tokens are equal (i.e., all other positions are irrelevant). However, for all
other tasks, Transformers fail to generalize, regardless of their positional encodings (see Fig. 4a for the
8
Published as a conference paper at ICLR 2023
1.0
None
0.9 Original Sin/Cos 2 Length 40
ALiBi Length 50
0.8 TransformerXL Length 150
Accuracy
RoPE
PCA2
0.7
0
0.6
2
0.5 2 1 0 1 2
0 100 200 300 400 500
Sequence length PCA1
(a) Accuracy per length by positional encoding. (b) First layer activations (PCA) by sequence length.
different positional encodings on Reverse String (DCF) and Fig. B.7 for the remaining tasks). In
particular, Transformers are unable to solve all permutation-invariant tasks (i.e., Parity Check (R),
which is a well-known failure mode (Chiang & Cholak, 2022)). We hypothesize that this poor
performance is due to the positional encodings, which take on new values when the sequences grow
longer (even for relative encodings), meaning that the network activations will be out-of-distribution.
Indeed, Fig. 4b shows that the 2D PCA of the first layer activations is similar for sequences of length
40 (i.e., the training range, in blue) and 50 (in orange), but completely different for length 150 (in
green). All our other models, which do not rely on global self-attention, do not suffer from this effect,
as they only have access to a local view of the input token (and, potentially, a local memory reading).
6 D ISCUSSION
In this paper, we considered generalization from a sequence prediction viewpoint, using formal
language theory to establish the computational complexity of each task. Naturally, we had to choose
a maximum test sequence length for our evaluations, which theoretically renders all of our input and
output languages finite. However, our sequences share the structure of an infinite language, allowing
us to measure generalization by testing on input sequences that are significantly longer than those
seen during training (and are thus out-of-distribution). Therefore, observing successful generalization
in our setting is strong evidence that the network has learned the “correct” algorithm. However,
for some tasks where we consider generalization to be successful we still see a slight degradation
of the accuracy as the test length increases. This is a result of implementing finite state machines
and memory updates with a neural network trained via SGD: Even slight numerical inaccuracies in
state-transitions or memory dynamics might accumulate when increasing the test sequence length.
Our experiments indicate that RNNs, LSTMs, and Transformers, admittedly powerful architectures,
are fairly limited in terms of their ability to generalize to longer inputs. However, this result must
be contrasted with their ability to generalize and detect/extrapolate patterns for fixed size inputs
or fixed size context windows. Transformers, for instance, are capable of learning complex and
highly structured generalization patterns, but they cannot overcome the limitation of not having an
extendable memory (which only becomes apparent when probed as in our experiments). This might
imply hard limits for scaling laws (Kaplan et al., 2020): Even significantly increasing the amount of
training data and the size of a Transformer are insufficient for it to “climb the Chomsky hiearchy”.
7 C ONCLUSION
We leveraged the theory of computation to better understand how and why neural networks generalize
on algorithmic sequence prediction tasks. Our extensive empirical evaluation demonstrates that there
is a model hierarchy on the tasks we investigated, which are representative of the different levels of
the Chomsky hierarchy. In particular, we showed that state-of-the-art architectures, such as LSTMs
and Transformers, cannot solve seemingly simple tasks, such as duplicating a string, when evaluated
on sequences that are significantly longer than those seen during training. Moreover, we showed that
models interacting with an external memory structure, such as a stack or a finite tape, can climb the
Chomsky hierarchy, indicating a promising direction for improvements in architecture design.
9
Published as a conference paper at ICLR 2023
Although our results are consistent with prior theoretical analyses (Ackerman & Cybenko, 2020;
Merrill, 2019; Weiss et al., 2018), and suggest that we can group neural architectures according to the
Chomsky hierarchy, our claims are limited to our empirical study. In particular, we cannot guarantee
that no tasks exist higher in the hierarchy (Fig. 1) that an architecture could solve. Similarly, we
cannot guarantee that no tasks exist lower or on the same level of the hierarchy that an architecture
cannot solve. Moreover, without extracting and analyzing the automata implemented by the networks,
we cannot strictly claim that our architectures generalize to arbitrary-length inputs, as we only test
up to a maximum length. Finally, our results are w.r.t. our precise experimental setting, i.e., if an
architecture fails to generalize we cannot guarantee that no weight configuration would solve the
task; we were simply unable to find such a configuration with our training protocol.
ACKNOWLEDGMENTS
We thank Ann He, Chris Dyer, Markus Kunesch, Róbert Csordás, Tom McGrath, and Zhengdong
Wang for their helpful feedback and insightful conversations.
R EFERENCES
Joshua Ackerman and George Cybenko. A survey of neural networks and formal languages. CoRR,
2020.
Cem Anil, Yuhuai Wu, Anders Andreassen, Aitor Lewkowycz, Vedant Misra, Vinay V. Ramasesh,
Ambrose Slone, Guy Gur-Ari, Ethan Dyer, and Behnam Neyshabur. Exploring length generalization
in large language models. CoRR, 2022.
Igor Babuschkin, Kate Baumli, Alison Bell, Surya Bhupatiraju, Jake Bruce, Peter Buchlovsky, David
Budden, Trevor Cai, Aidan Clark, Ivo Danihelka, Claudio Fantacci, Jonathan Godwin, Chris Jones,
Tom Hennigan, Matteo Hessel, Steven Kapturowski, Thomas Keck, Iurii Kemaev, Michael King,
Lena Martens, Vladimir Mikulik, Tamara Norman, John Quan, George Papamakarios, Roman Ring,
Francisco Ruiz, Alvaro Sanchez, Rosalia Schneider, Eren Sezener, Stephen Spencer, Srivatsan
Srinivasan, Wojciech Stokowiec, and Fabio Viola. The DeepMind JAX Ecosystem, 2020. URL
http://github.com/deepmind.
Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neural machine translation by jointly
learning to align and translate. In 3rd International Conference on Learning Representations,
2015.
Arpit Bansal, Avi Schwarzschild, Eitan Borgnia, Zeyad Emam, Furong Huang, Micah Goldblum, and
Tom Goldstein. End-to-end algorithm synthesis with recurrent networks: Logical extrapolation
without overthinking. CoRR, 2022.
Peter W. Battaglia, Jessica B. Hamrick, Victor Bapst, Alvaro Sanchez-Gonzalez, Vinícius Flores
Zambaldi, Mateusz Malinowski, Andrea Tacchetti, David Raposo, Adam Santoro, Ryan Faulkner,
Çaglar Gülçehre, H. Francis Song, Andrew J. Ballard, Justin Gilmer, George E. Dahl, Ashish
Vaswani, Kelsey R. Allen, Charles Nash, Victoria Langston, Chris Dyer, Nicolas Heess, Daan
Wierstra, Pushmeet Kohli, Matthew M. Botvinick, Oriol Vinyals, Yujia Li, and Razvan Pascanu.
Relational inductive biases, deep learning, and graph networks. CoRR, 2018.
Satwik Bhattamishra, Kabir Ahuja, and Navin Goyal. On the ability and limitations of transformers
to recognize formal languages. In Proceedings of the 2020 Conference on Empirical Methods in
Natural Language Processing, 2020.
Mikael Bodén and Janet Wiles. Context-free and context-sensitive dynamics in recurrent neural
networks. Connect. Sci., 2000.
Mikael Bodén and Janet Wiles. On learning context-free and context-sensitive languages. IEEE
Trans. Neural Networks, 2002.
10
Published as a conference paper at ICLR 2023
James Bradbury, Roy Frostig, Peter Hawkins, Matthew James Johnson, Chris Leary, Dougal
Maclaurin, George Necula, Adam Paszke, Jake VanderPlas, Skye Wanderman-Milne, and
Qiao Zhang. JAX: composable transformations of Python+NumPy programs, 2018. URL
http://github.com/google/jax.
Yining Chen, Sorcha Gilroy, Andreas Maletti, Jonathan May, and Kevin Knight. Recurrent neural
networks as weighted language recognizers. In Proceedings of the 2018 Conference of the
North American Chapter of the Association for Computational Linguistics: Human Language
Technologies, 2018.
David Chiang and Peter Cholak. Overcoming a theoretical limitation of self-attention, 2022. URL
https://arxiv.org/abs/2202.12172.
Kyunghyun Cho, Bart van Merrienboer, Çaglar Gülçehre, Dzmitry Bahdanau, Fethi Bougares, Holger
Schwenk, and Yoshua Bengio. Learning phrase representations using RNN encoder-decoder for
statistical machine translation. In Proceedings of the 2014 Conference on Empirical Methods in
Natural Language Processing, 2014.
Noam Chomsky. Three models for the description of language. IRE Trans. Inf. Theory, 1956.
Zihang Dai, Zhilin Yang, Yiming Yang, Jaime G. Carbonell, Quoc Viet Le, and Ruslan Salakhutdinov.
Transformer-xl: Attentive language models beyond a fixed-length context. In Proceedings of the
57th Conference of the Association for Computational Linguistics, 2019.
Ivo Danihelka, Greg Wayne, Benigno Uria, Nal Kalchbrenner, and Alex Graves. Associative long
short-term memory. In Proceedings of the 33nd International Conference on Machine Learning,
2016.
Sreerupa Das, C. Lee Giles, and Guo-Zheng Sun. Learning context-free grammars: Capabilities and
limitations of a recurrent neural network with an external stack memory. In Fourteenth Annual
Conference of the Cognitive Science Society, 1992a.
Sreerupa Das, C. Lee Giles, and Guo-Zheng Sun. Using prior knowledge in a NNPDA to learn
context-free languages. In Advances in Neural Information Processing Systems 5, 1992b.
A. Philip Dawid. Present position and potential developments: Some personal views: Statistical
theory: The prequential approach. Journal of the Royal Statistical Society. Series A (General),
1984.
Brian DuSell and David Chiang. Learning context-free languages with nondeterministic stack rnns.
In Proceedings of the 24th Conference on Computational Natural Language Learning, 2020.
Brian DuSell and David Chiang. Learning hierarchical structures with differentiable nondeterministic
stacks. In The Tenth International Conference on Learning Representations, 2022.
Javid Ebrahimi, Dhruv Gelda, and Wei Zhang. How can self-attention networks recognize dyck-n
languages? In Findings of the Association for Computational Linguistics, 2020.
Jeffrey L. Elman. Finding structure in time. Cogn. Sci., 1990.
Karlis Freivalds and Renars Liepins. Improving the neural GPU architecture for algorithm learning.
CoRR, 2017.
Felix A. Gers and Jürgen Schmidhuber. LSTM recurrent networks learn simple context-free and
context-sensitive languages. IEEE Trans. Neural Networks, 2001.
C. Lee Giles, Clifford B. Miller, Dong Chen, Hsing-Hen Chen, Guo-Zheng Sun, and Yee-Chun Lee.
Learning and extracting finite state automata with second-order recurrent neural networks. Neural
Comput., 1992.
David E. Goldberg. Genetic Algorithms in Search Optimization and Machine Learning. Addison-
Wesley, 1989.
Faustino J. Gomez, Jürgen Schmidhuber, and Risto Miikkulainen. Accelerated neural evolution
through cooperatively coevolved synapses. J. Mach. Learn. Res., 2008.
11
Published as a conference paper at ICLR 2023
Alex Graves. Adaptive computation time for recurrent neural networks. CoRR, 2016.
Alex Graves, Greg Wayne, and Ivo Danihelka. Neural turing machines. CoRR, 2014.
Alex Graves, Greg Wayne, Malcolm Reynolds, Tim Harley, Ivo Danihelka, Agnieszka Grabska-
Barwinska, Sergio Gomez Colmenarejo, Edward Grefenstette, Tiago Ramalho, John P. Agapiou,
Adrià Puigdomènech Badia, Karl Moritz Hermann, Yori Zwols, Georg Ostrovski, Adam Cain,
Helen King, Christopher Summerfield, Phil Blunsom, Koray Kavukcuoglu, and Demis Hassabis.
Hybrid computing using a neural network with dynamic external memory. Nat., 2016.
Edward Grefenstette, Karl Moritz Hermann, Mustafa Suleyman, and Phil Blunsom. Learning to
transduce with unbounded memory. In Advances in Neural Information Processing Systems 28,
2015.
Çaglar Gülçehre, Sarath Chandar, Kyunghyun Cho, and Yoshua Bengio. Dynamic neural turing
machine with continuous and discrete addressing schemes. Neural Comput., 2018.
Michael Hahn. Theoretical limitations of self-attention in neural sequence models. Trans. Assoc.
Comput. Linguistics, 2020.
Yiding Hao, William Merrill, Dana Angluin, Robert Frank, Noah Amsel, Andrew Benz, and Simon
Mendelsohn. Context-free transductions with neural stacks. In Proceedings of the Workshop:
Analyzing and Interpreting Neural Networks for NLP, 2018.
Yiding Hao, Dana Angluin, and Robert Frank. Formal language recognition by hard attention
transformers: Perspectives from circuit complexity. CoRR, 2022.
Tom Hennigan, Trevor Cai, Tamara Norman, and Igor Babuschkin. Haiku: Sonnet for JAX, 2020.
URL http://github.com/deepmind/dm-haiku.
Matteo Hessel, David Budden, Fabio Viola, Mihaela Rosca, Eren Sezener, and Tom Hennigan. Optax:
composable gradient transformation and optimisation, in jax!, 2020. URL http://github.com/
deepmind/optax.
Sepp Hochreiter and Jürgen Schmidhuber. Long short-term memory. Neural Comput., 1997.
John H. Holland. Adaptation in Natural and Artificial Systems: An Introductory Analysis with
Applications to Biology, Control, and Artificial Intelligence. MIT Press, 1992.
Steffen Hölldobler, Yvonne Kalinke, and Helko Lehmann. Designing a counter: Another case study
of dynamics and activation landscapes in recurrent networks. In Advances in Artificial Intelligence,
21st Annual German Conference on Artificial Intelligence, 1997.
Armand Joulin and Tomás Mikolov. Inferring algorithmic patterns with stack-augmented recurrent
nets. In Advances in Neural Information Processing Systems 28, 2015.
Lukasz Kaiser and Ilya Sutskever. Neural gpus learn algorithms. In 4th International Conference on
Learning Representations, 2016.
Nal Kalchbrenner, Ivo Danihelka, and Alex Graves. Grid long short-term memory. In 4th International
Conference on Learning Representations, 2016.
Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B. Brown, Benjamin Chess, Rewon Child,
Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Scaling laws for neural language models.
CoRR, 2020.
Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In 3rd International
Conference on Learning Representations, 2015.
Samuel A. Korsky and Robert C. Berwick. On the computational power of rnns. CoRR, 2019.
Karol Kurach, Marcin Andrychowicz, and Ilya Sutskever. Neural random-access machines. In 4th
International Conference on Learning Representations, 2016.
12
Published as a conference paper at ICLR 2023
Percy Liang, Michael I. Jordan, and Dan Klein. Learning dependency-based compositional semantics.
Comput. Linguistics, 2013.
Ankur Arjun Mali, Alexander Ororbia, Daniel Kifer, and C. Lee Giles. Recognizing long grammatical
sequences using recurrent networks augmented with an external differentiable stack. In Proceedings
of the 15th International Conference on Grammatical Inference, 2021.
William Merrill. Sequential neural networks as automata. CoRR, 2019.
William Merrill and Ashish Sabharwal. Log-precision transformers are constant-depth uniform
threshold circuits. CoRR, 2022.
William Merrill, Gail Weiss, Yoav Goldberg, Roy Schwartz, Noah A. Smith, and Eran Yahav.
A formal hierarchy of RNN architectures. In Proceedings of the 58th Annual Meeting of the
Association for Computational Linguistics, 2020.
Marvin L. Minsky. Computation: Finite and Infinite Machines. Prentice-Hall, Inc., 1967.
Tom M. Mitchell. The need for biases in learning generalizations. Technical report, Rutgers
University, 1980.
Michael Mozer and Sreerupa Das. A connectionist symbol manipulator that discovers the structure
of context-free languages. In Advances in Neural Information Processing Systems 5, 1992.
Peter Nordin. Evolutionary program induction of binary machine code and its applications. PhD
thesis, Dortmund University of Technology, 1997.
Jorge Pérez, Javier Marinkovic, and Pablo Barceló. On the turing completeness of modern neural
network architectures. In 7th International Conference on Learning Representations, 2019.
Jorge Pérez, Pablo Barceló, and Javier Marinkovic. Attention is turing-complete. J. Mach. Learn.
Res., 2021.
Jordan B. Pollack. The induction of dynamical recognizers. Mach. Learn., 1991.
Alethea Power, Yuri Burda, Harrison Edwards, Igor Babuschkin, and Vedant Misra. Grokking:
Generalization beyond overfitting on small algorithmic datasets. CoRR, 2022.
Ofir Press, Noah A. Smith, and Mike Lewis. Train short, test long: Attention with linear biases
enables input length extrapolation. CoRR, 2021.
Eric Price, Wojciech Zaremba, and Ilya Sutskever. Extensions and limitations of the neural GPU.
CoRR, 2016.
Elaine Rich. Automata, Computability and Complexity. Prentice-Hall, 2007.
Paul Rodriguez and Janet Wiles. Recurrent neural networks can learn to implement symbol-sensitive
counting. In Advances in Neural Information Processing Systems 10, 1997.
John E. Savage. Models of computation - exploring the power of computing. Addison-Wesley, 1998.
Luzi Sennhauser and Robert C. Berwick. Evaluating the ability of lstms to learn context-free
grammars. In Proceedings of the Workshop: Analyzing and Interpreting Neural Networks for NLP,
2018.
Hava T. Siegelmann and Eduardo D. Sontag. Analog computation via neural networks. Theor.
Comput. Sci., 1994.
Michael Sipser. Introduction to the theory of computation. PWS Publishing Company, 1997.
Natalia Skachkova, Thomas Alexander Trost, and Dietrich Klakow. Closing brackets with recurrent
neural networks. In Proceedings of the Workshop: Analyzing and Interpreting Neural Networks
for NLP, 2018.
Ray J. Solomonoff. A formal theory of inductive inference. part I. Inf. Control., 1964a.
13
Published as a conference paper at ICLR 2023
Ray J. Solomonoff. A formal theory of inductive inference. part II. Inf. Control., 1964b.
Ray J. Solomonoff. Algorithmic probability: Theory and applications. In Information Theory and
Statistical Learning. Springer US, 2009.
Ray J. Solomonoff. Algorithmic probability, heuristic programming and agi. In Proceedings of the
3d Conference on Artificial General Intelligence, 2010.
Mark Steijvers and Peter Grünwald. A recurrent network that performs a context-sensitive prediction
task. In Eighteenth Annual Conference of the Cognitive Science Society, 1996.
John Stogin, Ankur Arjun Mali, and C. Lee Giles. Provably stable interpretable encodings of context
free grammars in rnns with a differentiable stack. CoRR, 2020.
Jianlin Su, Yu Lu, Shengfeng Pan, Bo Wen, and Yunfeng Liu. Roformer: Enhanced transformer with
rotary position embedding, 2021.
Sainbayar Sukhbaatar, Arthur Szlam, Jason Weston, and Rob Fergus. End-to-end memory networks.
In Advances in Neural Information Processing Systems 28, 2015.
G. Z. Sun, C. L. Giles, H. H. Chen, and Y. C. Lee. The neural network pushdown automaton: model,
stack and learning simulations. Technical report, University of Maryland, 1993.
Ilya Sutskever, Oriol Vinyals, and Quoc V. Le. Sequence to sequence learning with neural networks.
In Advances in Neural Information Processing Systems 27, 2014.
Mirac Suzgun, Sebastian Gehrmann, Yonatan Belinkov, and Stuart M. Shieber. LSTM networks can
perform dynamic counting. CoRR, 2019a.
Mirac Suzgun, Sebastian Gehrmann, Yonatan Belinkov, and Stuart M. Shieber. Memory-augmented
recurrent neural networks can learn generalized dyck languages. CoRR, 2019b.
Vladimir Vapnik. Statistical learning theory. Wiley, 1998.
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez,
Lukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Advances in Neural Information
Processing Systems 30, 2017.
Gail Weiss, Yoav Goldberg, and Eran Yahav. On the practical computational power of finite precision
rnns for language recognition. In Proceedings of the 56th Annual Meeting of the Association for
Computational Linguistics, 2018.
Gail Weiss, Yoav Goldberg, and Eran Yahav. Thinking like transformers. In Proceedings of the 38th
International Conference on Machine Learning, 2021.
Jason Weston, Sumit Chopra, and Antoine Bordes. Memory networks. In 3rd International Confer-
ence on Learning Representations, 2015.
Janet Wiles and Jeff Elman. Learning to count without a counter: A case study of dynamics and
activation landscapes in recurrent networks. In Seventeenth Annual Conference of the Cognitive
Science Society, 1995.
Mark Wineberg and Franz Oppacher. A representation scheme to perform program induction in a
canonical genetic algorithm. In Parallel Problem Solving from Nature - PPSN III, International
Conference on Evolutionary Computation, 1994.
Greg Yang and Alexander M. Rush. Lie-access neural turing machines. In 5th International
Conference on Learning Representations, 2017.
Dani Yogatama, Yishu Miao, Gábor Melis, Wang Ling, Adhiguna Kuncoro, Chris Dyer, and Phil
Blunsom. Memory architectures in recurrent neural network language models. In 6th International
Conference on Learning Representations, 2018.
Wojciech Zaremba and Ilya Sutskever. Reinforcement learning neural turing machines. CoRR, 2015.
14
Published as a conference paper at ICLR 2023
Wojciech Zaremba, Tomás Mikolov, Armand Joulin, and Rob Fergus. Learning simple algorithms
from examples. In Proceedings of the 33nd International Conference on Machine Learning, 2016.
Yi Zhang, Arturs Backurs, Sébastien Bubeck, Ronen Eldan, Suriya Gunasekar, and Tal Wagner.
Unveiling transformers with LEGO: a synthetic reasoning task. CoRR, 2022.
15
Published as a conference paper at ICLR 2023
A E XPERIMENTAL DETAILS
A.1 L ANGUAGE RECOGNITION VS . TRANSDUCTION
In practice, learning to recognize a formal language by training a classifier on words and non-words
of a language is hard because there is, in general, no formal construction for generating a sufficient
but finite set of negative examples. Thus, prior work has relied on other approaches to evaluate
language recognition capabilities. These approaches are variants of language generation, which are
as hard as their recognition counterpart (for a given language) and thus are good proxy tasks.
The first approach is autoregressive token prediction (standard in the machine learning commu-
nity (Bahdanau et al., 2015; Vaswani et al., 2017)): Given a prefix p of a word w ∈ L, predict the
probability of its next token t ∈ Σ. The main problem with this approach is that the probabilities
depend on the length distribution of the training strings. For instance, for the language of palin-
dromes of length 2, with alphabet Σ = {a, b}, the output probabilities are P (a|prefix = a) = 1 and
P (b|prefix = b) = 1. However, for palindromes of length 3, the probabilities are very different,
i.e., P (a|prefix = a) = 0.5 and P (b|prefix = b) = 0.5. Therefore, there is a conflict between
the different distributions, and no model can solve the task without knowing the length beforehand.
Accordingly, increasing the length at test time without informing the model makes it impossible to
find the correct new probabilities and, therefore, to generalize to inputs of unseen lengths.
The second approach is to predict the set of possible next tokens and not their probabilities (Bhat-
tamishra et al., 2020; Suzgun et al., 2019a). In this case, the distribution of possible next tokens
does not change with the length, and increasing the length at test time is possible. However, for each
prefix, the target will consists of a set of tokens, which is incompatible with the (standard sequence)
prediction setting and thus of lesser interest to the wider machine learning community.
Our approach is similar to the second one, but we only consider prefixes for which there is only one
possible (deterministic) continuation. Thus, instead of language generation, we consider language
transduction: The inputs and outputs form two individual languages, and the goal is to learn the
deterministic function that maps a word from one to the other. The automata solving these tasks
are the classical automata from Fig. 1 but with outputs attached to each transition and are denoted
transducers. Therefore, the hierarchy that we consider is a hierarchy over transducers and not over
acceptors as detailed in Section 3. However, the two types of machines (acceptor automata and
transducer automata) are interchangeable for the purposes of our investigation as they are both
memory-augmented finite-state automata that reside on different levels of the Chomsky hierarchy
depending on their type of external memory access (none, a stack, or a tape).
A.2 M ODELS
Here we describe our neural architectures in more detail. All models have a final linear readout layer
to produce the output logits. We do not use any regularization, e.g., weight decay, dropout (except for
the Transformer models), etc. For training of the memory-augmented models, we set the tape size to
256 and the stack size to 128 and manually increase their sizes when testing on longer sequences.
Thus, as these memory sizes are significantly larger than the training range N (see Section 4), the
finite memory models appear infinite for our neural architectures in the context of our experimental
evaluation. Moreover, increasing the memory size does not affect how the RNN controller takes
actions, since it only has access to a local view of the external memory (e.g., the top of the stack or
the current memory cell).
RNN A vanilla single-layer RNN (Elman, 1990) with hidden size of 256.
Stack-RNN A single-layer RNN controller of hidden size 256 with access to a differentiable
stack (Joulin & Mikolov, 2015). The controller can perform any linear combination of PUSH, POP,
and NO - OP on the stack, with action weights given by a softmax over a linear readout of the RNN
output. Each cell of the stack contains a real vector of dimension 8.
NDStack-RNN A single-layer RNN controller of hidden size 256 with access to a differentiable
nondeterministic stack. We consider the NDStack-RNN proposed by (DuSell & Chiang, 2020; 2022),
which simulates a nondeterministic stack via dynamic programming. Concretely, we use normalized
16
Published as a conference paper at ICLR 2023
actions and nondeterministic state reading with 2 states and sweep over 2, 4 symbols. We did not use
the unnormalized actions as suggested in (DuSell & Chiang, 2022).
Tape-RNN A single-layer RNN controller of hidden size 256 with access to a differentiable tape,
inspired by the Baby-NTM architecture (Suzgun et al., 2019b). The controller can perform any linear
combination of WRITE - RIGHT, WRITE - LEFT, WRITE - STAY, JUMP - LEFT, and JUMP - RIGHT on the
tape, with action weights given by a softmax. The actions correspond to: writing at the current
position and moving to the right (WRITE - RIGHT), writing at the current position and moving to the
left (WRITE - LEFT), writing at the current position (WRITE - STAY), jumping ` steps to the right without
writing (JUMP - RIGHT), where ` is the length of the input, and jumping ` steps to the left without
writing (JUMP - LEFT). Finally, to allow the Tape-RNN to perform memory operations without having
to produce the output, we append “computation” tokens (different from the empty tokens ∅ used to
produce the output, see Appendix A.4 below) to the input sequence. As in the Stack-RNN, each cell
of the tape contains a real vector of dimension 8.
LSTM A vanilla single-layer LSTM Hochreiter & Schmidhuber (1997) of hidden size 256.
Stack-LSTM A single-layer LSTM controller of hidden size 256 with access to a differentiable
stack Joulin & Mikolov (2015). The only difference to the Stack-RNN is that this model uses an
LSTM as the controller.
Transformer encoder A vanilla Transformer encoder (Vaswani et al., 2017). We do not use
the autoregressive sequence-to-sequence model to produce the output but attend the whole input
(including the empty tokens ∅). We use five blocks with dmodel = 64, where each block is composed
of an attention layer, two dense layers, and a layer normalization. We add a residual connections as
in the original architecture (Vaswani et al., 2017). We consider five different positional encodings:
none, classical sin/cos (Vaswani et al., 2017), RoPE (Su et al., 2021), ALiBi (Press et al., 2021), and
the relative positional encoding from Transformer-XL (Dai et al., 2019).
Transformer (autoregressive) A vanilla Transformer (Vaswani et al., 2017). We do not use empty
tokens ∅ anymore, and the output is produced via the decoder (using causal masking). We use the
same hyperparameters and positional encodings as for the Transformer encoder above.
A.3 TASKS
Here, we describe our tasks (listed in Table A.1) in more detail. As mentioned in Section 4, all tasks
consist of an input sequence x ∈ LI from which the models have to predict the target sequence
y ∈ LO . We append |y| empty tokens ∅ to the input sequence, such that the models can process the
entire input sequence before having to produce the output (i.e., they know the input length before
having to produce the output). Note that half of our tasks have an output length of 1 and are thus just
standard classification tasks with the set of classes being the alphabet of the output language. The
networks still see one empty token ∅ at the end of the input string to signify that the input phase has
finished and that the output must be computed. We do not consider NDCF tasks since the distinction
between DCF and NDCF is fairly technical and does not fundamentally change the type of memory
structure required (i.e., a stack).
Even Pairs (R) Given a binary sequence, e.g., x = aabba, compute if the number of abs and bas
is even. For aabba, we have one ab and ba, meaning that the total number is 2 and thus even, i.e.,
y = b (where we arbitrarily equate odd with a and even with b). This task is equivalent to computing
whether the first and last character of the string are equal. Therefore, the task is regular since we can
solve it with a 2-state finite-state machine that looks at the first bit of the sequence and compares it
with the last one.
17
Published as a conference paper at ICLR 2023
Modular Arithmetic (Simple) (R) Given a sequence of numbers in {0, 1, 2, 3, 4} and opera-
tions in {+, −, ·}, compute the result modulo 5. For example, x = 1 + 2 − 4 evaluates to y = 4.
Note that the length of the input sequences must be of odd. Therefore, if we sample an even length
during the training/evaluation process, we return an input sequence of this length plus 1. This task
is regular since we can solve it with a 5-state finite-state machine, which transitions for each new
operation with the state being equal to the current result.
Parity Check (R) Given a binary string, e.g., aaabba, compute if the number of bs is even. The
sequence x = aaabba contains 2 bs, which is even, i.e., y = b (where we arbitrarily equate odd with
a and even with b). This task is regular since we can solve it with a 2-state finite-state machine, which
transitions every time it sees a token different from its current state.
Cycle Navigation (R) Given a sequence of movements on a cycle of length 5, compute the end
position. The movements are STAY, INCREASE, DECREASE and are represented as {0, 1, 2}.
The agent always starts at position 0. For example, 010211 means the agent stops at position
0 + 0 + 1 + 0 − 1 + 1 + 1 = 2, and the output class is 2. This task is regular since we can solve
it with a 5-state finite-state machine, similar to Modular Arithmetic (Simple) (R), as the task
can be converted into a sum modulo 5.
Modular Arithmetic (DCF) Given a sequence of numbers in {0, 1, 2, 3, 4}, brackets, and opera-
tions in {+, −, ·}, compute the result modulo 5. For example, the sequence x = −(1−2)·(4−3·(−2))
evaluates to y = 0. Note that we do not have the same problem of odd lengths as in Modular
Arithmetic (Simple) (R), since we can use substrings of the form (−z) with z ∈ {0, 1, 2, 3, 4},
which have length 4 and thus can be used to produce input strings of even length. The task is DCF
since we can solve it with a 5-state finite-state machine and an external stack. That is, similar to
the Modular Arithmetic (Simple) (R) task, the finite-state machine transitions for each new
operation, with the state being equal to the current result. Moreover, the stack is used to handle
brackets: The machine pushes if the current input token is an opening bracket, pops if it is a closing
bracket, and uses no-ops to let the finite-state machine work otherwise.
Reverse String (DCF) Given a binary string, e.g., x = aabba, compute its reverse, i.e., y =
abbaa. The task is DCF since we can solve it by pushing the tokens of the input sequence on a stack
until the first empty token ∅ appears and then popping the tokens from the stack, outputting the
values one by one. This task cannot be solved with a finite-state machine since the number of possible
outputs is infinite.
Solve Equation (DCF) Given an equation consisting of numbers in {0, 1, 2, 3, 4}, brackets,
operations in {+, −, ·}, and an unknown variable z, compute the value of z such that the equation
holds modulo 5. For example, x = −(z − 2) · (4 − 3 · (−2)) = 0 modulo 5 holds for z = 1,
i.e., y = 1. This task is DCF since we can solve it by simulating all possible values of x in a
nondeterministic finite-state machine and producing each result using the same algorithm as in
Modular Arithmetic (DCF). Note that a nondeterministic finite-state machine can be simulated
with a deterministic one, implying that this task is DCF and not NDCF.
Stack Manipulation (DCF) Given a binary string representing a stack’s content (given bottom-
to-top) and a sequence of PUSH a, PUSH b, or POP actions, execute the actions on the stack and return
the final stack content (in a top-to-bottom manner, i.e., as if popping the resulting stack). For example,
executing the sequence POP PUSH a POP on the stack abbaa, i.e., x = abbaa POP PUSH a POP,
results in y = abba. If a POP action is called on an empty stack, the action is ignored. This task
requires stack manipulation by design and is therefore DCF.
Binary Addition (CS) Given two binary numbers, e.g., 10010 and 101, compute their sum in
base 2, i.e., y = 10111 for x = 10010 + 101. The inputs and outputs are given in little-endian
representation. The task is CS since it exhibits cross-serial dependencies because the two numbers
are provided one after the other (and not bit-by-bit at the same time).
Binary Multiplication (CS) Given two binary numbers, e.g., 100 and 10110, compute their
product in base 2, i.e., y = 1011000 for x = 100 ∗ 10110. The inputs and outputs are given in
18
Published as a conference paper at ICLR 2023
Table A.1: Our tasks with their level in the Chomsky hierarchy and example input/output pairs. The †
denotes permutation-invariant tasks; the ? denotes counting tasks; the ◦ denotes tasks that require a
nondeterministic controller; and the × denotes tasks that require superlinear running time in terms of
the input length.
little-endian representation. The task is CS since it exhibits cross-serial dependencies because the
two numbers are provided one after the other (and not bit-by-bit at the same time).
√ Sqrt (CS)
Compute Given a binary number, e.g., 101001, compute the floor of its square root, i.e.,
y = b 101001c = 101. The inputs and outputs are given in little-endian representation. This task
is CS since the digit-by-digit algorithms to compute the integer square root require many sequence
manipulations, including binary multiplications, which is a CS task.
Duplicate String (CS) Given a binary string, e.g., x = abaab, output the string twice, i.e., y =
abaababaab. This task is CS since it corresponds to the well-known language {ww | w is a word}
(cf. Section 3), which is CS.
Missing Duplicate (CS) The input is a binary string of the form ww where w is itself a binary
string. One token in this string has been hidden, and the network must find out which one it is,
by looking at the value at the same position but on the other side of the string. For instance, if
x = ab_aba (i.e., w = aba), then y = a. This task is CS since it requires on the recognition of the
well-known language {ww | w is a word} (cf. Section 3), which is CS.
Odds First (CS) Given a binary string t1 ...tn , output t1 t3 t5 ...t2 t4 t6 .... For example, the output
corresponding to x = aaabaa is y = aaaaba. The task is CS because it exhibits cross-serial
dependencies (the output is the interleaved input).
Bucket Sort (CS) Given a string over an alphabet of fixed size (5 in our case), return the sorted
string. Since the alphabet has a fixed size, the task can be solved via bucket sort, which only requires
a finite amount of counters (i.e., 5 counters in our case). For example, the output corresponding to
x = 421302214 is y = 011222344. The task is CS because it requires multiple counters that keep
track of the number of occurrences for every token.
19
Published as a conference paper at ICLR 2023
We now provide the full formalization of our problem setup (described in Section 4). Algorithm A.1
illustrates our training procedure. We train on sequences of length ` ∼ U(1, N ), with N = 40 (using
batches of size 128). For testing, we consider sequences of length sampled from U(N + 1, M ), with
M = 500. In general, we report the score, which we define as the per-sequence accuracy A (see
1
PM
Section 4) averaged over all sequences of unseen length, i.e., score := M −N `=N +1 A(x, y). We
use the Adam optimizer (Kingma & Ba, 2015) with default hyperparameters for 1 000 000 steps,
which we have found to be sufficient to achieve a near-optimal training performance for all tasks
and architectures. As described in Algorithm A.1, we augment our input sequences x with m empty
tokens ∅, where m is the length of the corresponding target sequence y, so that the networks only
need to start outputting after they have consumed the entire input sequence. Moreover, we add
computational tokens (which are different from the empty tokens ∅) to the input sequence x for the
Tape-RNN to enable it to perform memory operations without having to output the result. In general,
we run all experiments with 10 different random seeds (used for network parameter initialization)
and three learning rates (1 × 10−4 , 3 × 10−4 and 5 × 10−4 ), and we report the result obtained by the
hyperparameters with the maximum score (means and standard deviations in Appendix B). However,
for Transformers, NDStack-RNNs, and Tape-RNNs we sweep over additional hyperparameters: For
Transformers we consider five different positional encodings, for the NDStack-RNN we optionally
read from internal states of the nondeterministic stack (one of the improvements proposed by DuSell
& Chiang (2022)) and sweep over 2 and 4 symbols, and for the Tape-RNN we consider various
numbers of computation tokens (0, `, or 2`), and various numbers of tapes (1 and 4).
Algorithm A.1: Training pipeline for our sequence prediction tasks. The comments (in blue)
show an example output for the Reverse String (DCF) task.
Input: model pθ (·|x) with parameters θ, learning rate α, number of training steps S
1 Initialize parameters θ
2 for i ← 1 to S do
3 Sample length ` from U(1, 40) /* ` = 3 */
4 Sample sequence x of length ` from the task’s input grammar /* x = 011 for ` = 3 */
5 Compute the corresponding output sequence y of length m /* y = 110 with m = 3 */
6 Pad x with m empty tokens ∅ such that the model only needs to start outputting after having
consumed the entire input sequence, i.e., x ← x · ∅ . . . ∅ /* x = 011∅∅∅ */
7 Set the output to empty sequence of length m, i.e., o ← ∅ . . . ∅
8 for t ← 1 to m do
9 Compute the output probability distribution ot ← pθ (·|x1:`+t ) /* ot ∈ R2 */
10 end
1
Pm >
11 Compute the cross-entropy loss (averaged over output tokens) C ← − m t=1 yt log ot
12 Update the parameters with gradient descent θ ← θ −α∇C
1
Pm
13 Compute the per-sequence accuracy A ← m t=1 1 arg maxj ytj = arg maxj otj
14 end
We implemented our evaluation suite in JAX (Bradbury et al., 2018) using the DeepMind JAX ecosys-
tem (Babuschkin et al., 2020; Hessel et al., 2020; Hennigan et al., 2020). We make all of our code pub-
licly available at https://github.com/deepmind/neural_networks_chomsky_hierarchy.
We ran each task-architecture-hyperparameter triplet on a single TPU on our internal cluster. For
RNNs, Stack-RNNs, LSTMs and Stack-LSTMs, the hyperparameters we have used are learning
rates and seeds, which gives 15 (tasks) · 4 (models) · 3 (learning rates) · 10 (seeds) = 1800
TPU-units. For Tape-RNNs, the hyperparameters we have used are the learning rate, the num-
ber of computation steps tokens, the number of tapes, and the random seeds, which yields
15 (tasks) · 3 (learning rates) · 3 (computation steps) · 2 (tapes) · 10 (seeds) = 2700 TPU-
units. For Transformers (encoder and autoregressive), the hyperparameters we have used are the
learning rate, the positional encodings, and the random seeds, which gives 15 (tasks) · 2 (models) ·
5 (positional encodings) · 3 (learning rates) · 10 (seeds) = 4500 TPU-units. For the NDStack-
20
Published as a conference paper at ICLR 2023
Table A.2: Mean and standard deviation of the running times (in hours) for all the architectures and
tasks. The † denotes permutation-invariant tasks; the ? denotes counting tasks; the ◦ denotes tasks
that require a nondeterministic controller; and the × denotes tasks that require superlinear running
time in terms of the input length.
Level Task RNN Stack-RNN NDStack-RNN Tape-RNN Transformer LSTM Stack-LSTM Autoreg. Transformer
Even Pairs 0.37 ± 0.01 0.56 ± 0.01 8.50 ± 0.16 3.31 ± 2.41 1.19 ± 0.73 0.75 ± 0.03 2.97 ± 0.01 1.61 ± 0.27
Modular Arithmetic (Simple) 0.57 ± 0.04 0.92 ± 0.06 8.34 ± 0.19 3.44 ± 2.48 1.31 ± 0.55 0.62 ± 0.00 3.05 ± 0.01 1.51 ± 0.24
R
Parity Check† 0.59 ± 0.02 0.78 ± 0.05 8.54 ± 0.12 3.37 ± 2.32 1.23 ± 0.58 0.68 ± 0.06 2.96 ± 0.01 1.61 ± 0.27
Cycle Navigation† 0.60 ± 0.03 0.57 ± 0.07 8.42 ± 0.23 3.26 ± 2.25 1.23 ± 0.79 0.76 ± 0.03 2.97 ± 0.01 1.66 ± 0.45
Stack Manipulation 4.07 ± 0.12 4.69 ± 0.85 17.80 ± 1.43 8.27 ± 2.93 5.06 ± 0.65 4.27 ± 0.10 8.89 ± 0.04 7.25 ± 2.24
Reverse String 0.50 ± 0.02 0.88 ± 0.11 16.66 ± 0.99 4.58 ± 2.88 1.62 ± 0.78 0.70 ± 0.03 5.48 ± 0.01 4.32 ± 2.03
DCF
Modular Arithmetic 2.70 ± 0.08 3.11 ± 0.95 9.04 ± 0.78 5.44 ± 2.16 3.59 ± 0.65 3.23 ± 0.72 5.21 ± 0.04 3.99 ± 0.47
Solve Equation◦ 2.82 ± 0.06 3.24 ± 0.79 9.16 ± 0.64 5.69 ± 2.30 3.59 ± 0.49 2.96 ± 0.10 5.25 ± 0.05 3.91 ± 0.28
Duplicate String 0.57 ± 0.05 1.33 ± 0.13 25.25 ± 2.31 5.90 ± 3.66 2.66 ± 0.76 0.92 ± 0.05 7.92 ± 0.02 5.02 ± 0.93
Missing Duplicate 0.49 ± 0.04 0.71 ± 0.04 8.49 ± 0.27 3.31 ± 2.16 1.24 ± 0.35 0.60 ± 0.02 3.05 ± 0.01 1.72 ± 0.30
Odds First 0.55 ± 0.04 0.99 ± 0.05 16.63 ± 1.34 4.49 ± 2.81 1.67 ± 0.38 0.80 ± 0.04 5.48 ± 0.01 4.57 ± 1.72
CS Binary Addition 1.33 ± 0.05 1.35 ± 0.04 17.36 ± 0.45 4.69 ± 2.94 2.29 ± 0.75 1.35 ± 0.05 5.62 ± 0.01 5.14 ± 1.48
Binary Multiplication× 1.35 ± 0.06 1.33 ± 0.04 17.04 ± 0.76 4.53 ± 2.76 2.06 ± 0.37 1.34 ± 0.04 5.49 ± 0.01 4.64 ± 1.95
Compute Sqrt 0.95 ± 0.05 1.05 ± 0.05 12.62 ± 1.21 3.82 ± 2.45 1.57 ± 0.34 0.97 ± 0.04 4.19 ± 0.01 3.30 ± 1.27
Bucket Sort†F 0.58 ± 0.09 0.98 ± 0.08 16.75 ± 2.01 3.54 ± 2.24 1.34 ± 0.76 0.79 ± 0.09 5.64 ± 0.05 4.52 ± 1.77
RNN, the hyperparameters we have used are the learning rate, the number of stack symbols, and the
random seeds, which gives 15 (tasks) · 2 (number of symbols) · 3 (learning rates) · 10 (seeds) =
900 TPU-units. Moreover, we used an additional 300 TPU-units for the phase transition ex-
periment (Appendix B.2). For the CNN experiment, we used 15 (tasks) · 4 (kernel sizes) ·
3 (learning rates) · 10 (seeds) = 1800 TPU-units. For the scaling laws experiment, we used
15 (tasks) · 3 (training steps) · 5 (num layers) · 3 (learning rates) · 10 (seeds) = 6750 TPU-
units. Finally, for the autoregresive experiments (excluding the autoregressive Transformer), we
used 8 (tasks) · 3 (learning rates) · 10 (seeds) = 240 TPU-units for the RNNs, Stack-RNNs, and
LSTMs, i.e., 720 TPU-units, and 8 (tasks) · 3 (computation steps) · 2 (tapes) · 3 (learning rates) ·
10 (seeds) = 1440 TPU-units for the Tape-RNNs. Thus, in total we used 20 910 TPU-units. We
report all the running times, for each task and architecture, in Table A.2. Note that we aggregated the
results over the hyperparameters, which explains the high standard deviation for Transformers (five
different positional encodings) and Tape-RNN (three different numbers of computation steps tokens,
two different numbers of tapes).
The NDStack-RNN is comparably slow to train. It is roughly four times slower than a classical
Stack-RNN, due to the computation of the big tensor γ, denoting the transition weights between
nodes. Furthermore, this tensor takes a lot of space in memory, i.e., the number of values it contains
is, once unrolled on the whole trajectory, proportional to (l + m)3 , where l and m are the length of
the input and output sequences, respectively. Thus, for our training range of 40 and a task where
the output length is equal to the input length (e.g., Reverse String (DCF)), the required space is
proportional to (40 + 40)3 = 512000. Therefore, with a batch size of 128 and 16 possible actions
(i.e., state-symbol pairs), the tensor is roughly of size 33.5Gb using float32 representation, which
exceeds the memory size of our GPUs. Consequently, we reduced the batch size to 16 (i.e., 8 times
smaller than normal) for this architecture.
B A DDITIONAL E XPERIMENTS
B.1 M AIN RESULT
We visualize the performance curves for all tasks in Fig. B.6 and report the means and standard
deviations of the test accuracies for all architectures on our tasks in Table B.1. We observe that all the
networks, except for Transformers, solve all regular tasks perfectly for all the seeds. However, only a
few tasks above beyond regular can be solved (defined by a score larger or equal to 90%) on average,
but the hierarchy still persists. For example, the Tape-RNN has much better scores overall, even on
average, on the CS tasks compared to the other architectures. However, even though the Tape-RNN
can solve more tasks, it still fails on some of them. We hypothesize that it is because the limited set
of actions makes the right action trajectory long and difficult to find. Nevertheless, the Tape-RNN
is the closest we have to an architecture capable of solving all tasks from regular to CS. Note that
networks may be evaluated on unseen sequences even within the training range, as we are sampling
from a training distribution and not a finite dataset (as discussed in Section 4). Thus, it is possible
21
Published as a conference paper at ICLR 2023
Table B.1: Means and standard deviations (computed over random seeds) of the score (average test
accuracy, see Section 4) for the results of the main experiment (see Table 2 and Section 5.1). The †
denotes permutation-invariant tasks; the ? denotes counting tasks; the ◦ denotes tasks that require a
nondeterministic controller; and the × denotes tasks that require superlinear running time in terms of
the input length.
Stack Manipulation 54.5 ± 1.1 93.8 ± 15.1 66.9 ± 22.6 51.1 ± 8.1 58.1 ± 0.7
Reverse String 60.4 ± 2.3 96.2 ± 12.1 90.7 ± 15.6 56.1 ± 2.7 59.5 ± 1.0
DCF
Modular Arithmetic 34.6 ± 4.3 75.8 ± 11.3 60.9 ± 20.7 28.2 ± 3.4 53.1 ± 4.3
Solve Equation◦ 31.3 ± 8.7 46.1 ± 7.3 37.7 ± 12.7 23.5 ± 1.6 37.1 ± 14.0
Duplicate String 50.2 ± 0.0 51.6 ± 0.6 90.0 ± 16.3 52.8 ± 0.0 56.5 ± 0.9
Missing Duplicate 51.7 ± 0.2 53.6 ± 0.7 66.1 ± 23.3 53.8 ± 1.5 53.1 ± 0.6
Odds First 50.4 ± 0.2 51.5 ± 0.1 76.2 ± 16.6 52.7 ± 0.0 54.6 ± 0.8
CS Binary Addition 49.5 ± 0.3 50.9 ± 0.9 74.3 ± 10.6 51.7 ± 1.3 54.8 ± 0.4
Binary Multiplication× 49.6 ± 0.2 51.4 ± 0.7 54.5 ± 2.2 50.4 ± 3.6 52.9 ± 0.2
Compute Sqrt 53.9 ± 0.3 56.1 ± 0.3 55.6 ± 1.4 51.5 ± 0.4 57.3 ± 0.2
Bucket Sort†F 22.5 ± 3.6 56.6 ± 16.2 33.6 ± 12.8 35.7 ± 23.1 79.4 ± 21.9
1.0 1.0
0.9 5 max
7 0.8 mean
Accuracy
0.8
Score
9 min
0.7 10 std
0.6 15 0.6
20
0.5
0 100 200 300 400 500 0 10 20 30
Sequence length Training range
(a) Maximum accuracy (over 10 different random (b) Score per training range (min, max, and mean taken
seeds) per length for different training ranges (in color). over 10 random seeds).
Figure B.1: Phase transition effect for the Stack-RNN on Reverse String (DCF). For training
ranges N < 10, the model fails to generalize to longer sequences as it memorizes the training
distribution. The score in Fig. B.1b is the accuracy shown in Fig. B.1a averaged over sequence
lengths (x-axis).
that some models do not achieve a perfect accuracy within the training range U(1, N ), such as in the
Compute Sqrt (CS) task (cf. Fig. B.6).
In this section, we investigate how much training data is needed to learn the data-generating grammars.
For certain tasks and architectures we observe a substantial “phase transition” of the test accuracy for
different training ranges. For example, Fig. B.1 shows the generalization performance of Stack-RNNs
on the Reverse String (DCF) task for different training ranges. Clearly, the models fail to learn the
data-generating algorithm when only exposed to sequences of length smaller than 10. We hypothesize
that for small training ranges the networks overfit to the training data, essentially learning a lookup
table to recognize the small finite language. In contrast, larger training ranges encourage the networks
to learn the data-generating algorithm that generalizes to sequences of unseen length.
22
Published as a conference paper at ICLR 2023
Probability of stack action, over sequence timestep Probability of stack action, over sequence timestep Probability of stack action, over sequence timestep
1 1 1
PUSH PUSH
Probability
Probability
PUSH
Probability
POP POP POP
NO_OP NO_OP NO_OP
0 0 0
0 20 40 60 80 100 0 25 50 75 100 125 150 175 200 0 25 50 75 100 125 150 175 200
PCA of the internal states PCA of the internal states PCA of the internal states
even+0 10 0
2 even+1 1 4 0
odd+0 e 2 1
0 odd+1
PCA2
PCA2
PCA2
0 2
0 3
2 4
2 e
10
4
4 2 0 2 4 10 5 0 5 10 0 20 40 60 80 100 120 140 160
PCA1 PCA1 PCA1
(a) Stack-RNN on Parity Check (R). (b) Stack-RNN on Reverse String (DCF). (c) Stack-RNN on Bucket Sort (CS).
Figure B.2: Analysis of the Stack-RNN on the Parity Check (R) and Reverse String (DCF)
tasks. Left: The network does not use the stack on Parity Check and has the same state structure
as a classical RNN. Right: The network uses the stack on Reverse String and the PCA of the
states shows that they are clustered according to the last input token only.
Interpretation of the learned algorithm On the first layer, the model uses only the last empty
token and matches it with the tokens 0 and 2 (clearly seen on matrices (1, 1) and (1, 2)). Note that
the task is to count the number of 0s, the number of 1s, and to do a subtraction modulo 5. This first
layer is able to separate the different tokens; the output of the last token for each head is a weighted
sum (softmax) of a single value (the token which is attended to), and the weights are all equal to a
constant, which means the output of the last token is just the value of the token which is attended to.
Then the other tokens have a constant weight over the whole sequence, which means that the output
23
Published as a conference paper at ICLR 2023
Layer 1
Figure B.3: Analysis of the Transformer attention matrices on the Cycle Navigation (R) task, for
an input sequence of length 8, used as ticks on the x-axis.
is a weighted sum of all the tokens’ values, with the weights also being constant and proportional to
the number of occurrence of each token. This will then allow the network to compute the occurrence
of each token individually in the linear layer to merge the heads. Fig. B.4 shows a scatter plot of the
activations of the network after the first layer for sequences of length 20, which confirms this claim.
Failure mode for long sequences The counting mechanism, which consists in having constant
weights over the whole sequence, does not return true counts but frequencies, since they are multiplied
by 1/N where N is the sequence length (since we are using a softmax). This explains why the
network can reasonably generalize to longer lengths, but not perfectly. Looking at Fig. B.4, it means
that the structure we see is conserved for longer sequences, but the points become closer and closer,
which makes further differentiation into classes more difficult. Fig. B.5 shows the evolution of
the activations at the last layer for increasingly long sequences. The classification becomes more
challenging because the input activations do not form 5 clear clusters anymore.
Bansal et al. (2022) showed that fully convolutional networks achieve near-perfect length general-
ization on the prefix sum task in the length generalization setting. At the same time, Merrill (2019)
theoretically showed that one-layer convolutional networks with maxpool activation functions are
incapable of recognizing all regular languages. Therefore, we investigate the length generalization
performance of a simple CNN on our benchmark and visualize the maximum and average accuracies
over unseen test lengths in Tables B.2 and B.3, respectively. We observe that our simple architecture
with kernel sizes in {3, 5, 7, 11} (see Appendix A.2) generally fails to solve even regular tasks,
achieving the random baseline accuracy in most cases. This confirms the theoretical derivations by
Merrill (2019) but contrasts the results presented by Bansal et al. (2022). However, the convolutional
architecture that Bansal et al. (2022) consider relies on an adaptive number of layers depending on
the sequence length (similar to adaptive computation time (Graves, 2016)) and thus crucially differs
from a classical CNN (such as the one we study).
24
Published as a conference paper at ICLR 2023
1
PCA2
−1
−2
−3
−100 0 100 −100 0 100 −100 0 100 −100 0 100
PCA1 PCA1 PCA1 PCA1
Figure B.4: Analysis of the Transformer activations after the first layer on the Cycle Navigation
(R) task, for an input sequence of length 8. We report only the first two principal components of 512
sequences. We clearly see that the network learns to count the number of 0s, 1s and 2s and already
separates the output classes very well after this first layer.
2
PCA2
−2
−4
−6
−5.0 −2.5 0.0 2.5 5.0 7.5 −5.0 −2.5 0.0 2.5 5.0 −5.0 −2.5 0.0 2.5 5.0 −5.0 −2.5 0.0 2.5 5.0 7.5
PCA1 PCA1 PCA1 PCA1
Figure B.5: Analysis of the Transformer activations after the last layer, before the final linear logit
layer, on the Cycle Navigation (R) task, for an input sequence of length 8. We report only the
first two principal components of 512 sequences. We see that the activations form 5 clear clusters for
short sequences (length 20), and slowly decay into the center, mixing the clusters together.
25
Published as a conference paper at ICLR 2023
We evaluate the experiment suite from Section 5.1 for the Stack-LSTMs and compare their per-
formance with that of Stack-RNNs in Table B.2. We observe that Stack-LSTMs and Stack-RNNs
perform roughly equally well when considering the score (maximum accuracy over the 10 initializa-
tion seeds), which is not surprising given that both networks are theoretically located on the same level
of the Chomsky hierarchy. However, when considering the average Stack-LSTMs perform worse,
i.e., our training protocol (gradient descent) is less stable for this architecture, confirming the results
of prior work (Hao et al., 2018). We hypothesize that this is due to the fact that LSTMs are more
powerful than RNNs (as they can implement k-counter machines) and thus do not always simulate a
finite-state automaton. This can be an advantage for more difficult tasks, e.g., counting tasks such as
Bucket Sort (CS), but also a disadvantage for simpler tasks that only need a finite-state automaton
controller to interact with the memory structure. Thus, if the controller is not a finite-state automaton,
then its internal state does not interact well with the stack content as it is not part of a finite set, i.e.,
the interactions are potentially infinite and thus the network does not generalize well beyond the
training range.
We also compared the Stack-RNN and the Stack-LSTM to their non-deterministic counterpart, the
NDStack-RNN (which stands for Non Deterministic Stack-RNN) (DuSell & Chiang, 2020; 2022).
We observe that this architecture performs well on regular tasks, as the other memory-augmented
RNNs. However, it fails on most tasks beyond regular, potentially due to its large action space (24
actions) and the difficulty of training a tensor that tracks all possible stack values. We also don’t use
unnormalized actions as it requires computing all the probabilities in log-space, and unfortunately
JAX (Bradbury et al., 2018) does not support complex tensor multiplications like einsum in log-
space. Surprisingly, the architecture solves the Missing Duplicate (CS) task perfectly. We tried
some ablations and reduced the stack size to only 1: the model still solves the task perfectly. Also, the
actions taken by the model on the stack are not binary, leading to mixed computation. We conjecture
that the network uses this mixed embedding of values on the stack (reduced to a single cell) efficiently
as a memory, on top of the memory of the RNN controller.
In this paper, we chose to predict the output sequences from empty tokens, but we could also have
chosen to predict the sequences autoregressively. Here, we conduct the same set of experiments but
in the autoregressive setting to confirm that this aspect of our experimental setup does not impact the
architectures’ positions in the Chomsky hierarchy. To that end, we use the true output sequence as
the input to the model at train time and autoregressive sampling at test time, i.e., we use the predicted
output token as the input for the next token until we reach the desired length (more details below
for the different architectures). We show the results in Table B.2 with the corresponding means and
standard deviations in Table B.3.
For the (memory-augmented) RNNs, i.e., RNN, LSTM, Stack-RNN and Tape-RNN, we replace the
empty tokens ∅ with the true output sequence during training. However, to avoid predicting the
input, we append an extra empty token between the input sequence (which could contain computation
steps) and the true output sequence. We compute the loss with an offset of 1, so the network should
predict the first token of the output sequence for the extra empty token, then the second token of
the output sequence for the first token, passed now as input, etc. We observe in Table B.2 that these
models trained with the auto-regressive system perform very similarly to the ones trained with the
empty tokens system (see Table 2 for comparison). That is, this part of the training scheme does not
affect the length generalization scores, and therefore does not change where the networks lie on the
Chomsky hierarchy.
26
Published as a conference paper at ICLR 2023
Table B.2: Score (in percentage, see Section 4), i.e., accuracy averaged over all test lengths and
maximized over 10 random seeds (and other hyperparameters, see Appendix A), for the CNN, Stack-
LSTM, and NDStack-RNN. The results for the Stack-RNN are identical to Table 2 and are included
only for easier comparison with the Stack-LSTM. We consider a model to generalize successfully
(bold) if its score ≥ 90%. The random accuracy is 50% (except for Cycle Navigation (R),
Bucket Sort (CS), and the two modular arithmetic tasks where it is 20%). We denote permutation-
invariant tasks with †, counting tasks with ?, tasks requiring a nondeterministic controller with ◦, and
tasks requiring superlinear running time with ×.
B.7.2 T RANSFORMER
27
Published as a conference paper at ICLR 2023
Table B.3: Means and standard deviations (computed over random seeds) of the score (average
test accuracy, see Section 4) for the results of Table B.2. The results for the Stack-RNN are
identical to Table B.1 and are included only for easier comparison with the Stack-LSTM. The †
denotes permutation-invariant tasks; the ? denotes counting tasks; the ◦ denotes tasks that require a
nondeterministic controller; and the × denotes tasks that require superlinear running time in terms of
the input length.
Stack Manipulation 52.4 ± 0.0 93.8 ± 15.1 66.2 ± 17.4 55.1 ± 2.5
Reverse String 53.0 ± 0.1 96.2 ± 12.1 94.0 ± 12.4 61.6 ± 0.6
DCF
Modular Arithmetic 30.7 ± 0.5 75.8 ± 11.3 51.2 ± 5.0 28.4 ± 4.1
Solve Equation◦ 19.9 ± 0.1 46.1 ± 7.3 42.7 ± 16.0 34.4 ± 9.7
Duplicate String 50.0 ± 0.0 51.6 ± 0.6 56.3 ± 1.3 50.6 ± 0.2
Missing Duplicate 50.6 ± 0.1 53.6 ± 0.7 53.3 ± 0.7 82.0 ± 23.3
Odds First 50.0 ± 0.0 51.5 ± 0.1 54.4 ± 0.7 52.2 ± 0.8
CS Binary Addition 49.7 ± 0.2 50.9 ± 0.9 55.1 ± 0.6 50.4 ± 1.3
Binary Multiplication× 49.8 ± 0.0 51.4 ± 0.7 52.9 ± 0.3 50.0 ± 0.4
Compute Sqrt 50.2 ± 0.0 56.1 ± 0.3 57.3 ± 0.2 54.3 ± 0.4
Bucket Sort†F 28.4 ± 0.6 56.6 ± 16.2 81.7 ± 19.9 25.9 ± 7.2
Table B.4: Score (in percentage, see Section 4), i.e., accuracy averaged over all test lengths and
maximized over 10 random seeds (and other hyperparameters, see Appendix A), for autoregressive
setting. We only include the tasks with output length larger than 1, since the results are identical to
those in Table 2 otherwise. We consider a model to generalize successfully (bold) if its score ≥ 90%.
The random accuracy is 50% (except for Cycle Navigation (R), Bucket Sort (CS), and the two
modular arithmetic tasks where it is 20%). We denote permutation-invariant tasks with †, counting
tasks with ?, tasks requiring a nondeterministic controller with ◦, and tasks requiring superlinear
running time with ×.
28
Published as a conference paper at ICLR 2023
Accuracy
Accuracy
0.8 0.8
0.6
0.7 0.7
0.6 0.4 0.6
0.5 0.2 0.5
0 100 200 300 400 500 0 100 200 300 400 500 0 100 200 300 400 500
Sequence length Sequence length Sequence length
(a) Even Pairs (R) (b) Modular Arithmetic (Simple) (R) (c) Parity Check (R)
Accuracy
Accuracy
0.8 0.8
0.6
0.7 0.7
0.4 0.6 0.6
0.2 0.5 0.5
0 100 200 300 400 500 0 100 200 300 400 500 0 100 200 300 400 500
Sequence length Sequence length Sequence length
(d) Cycle Navigation (R) (e) Stack Manipulation (DCF) (f) Reverse String (DCF)
Accuracy
Accuracy
0.8 Tape-RNN
0.6 0.6 LSTM
0.7
0.4 0.4 0.6
0.2 0.2 0.5
0 100 200 300 400 500 0 100 200 300 400 500 0 100 200 300 400 500
Sequence length Sequence length Sequence length
(g) Modular Arithmetic (DCF) (h) Solve Equation (DCF) (i) Duplicate String (CS)
Accuracy
Accuracy
(j) Missing Duplicate (CS) (k) Odds First (CS) (l) Binary Addition (CS)
Accuracy
Accuracy
0.8 0.8
0.6
0.7 0.7
0.6 0.6 0.4
(m) Binary Multiplication (CS) (n) Compute Sqrt (CS) (o) Bucket Sort (CS)
Figure B.6: Performance curves on all tasks. For the Transformer encoder, we pick the best positional
encoding for each task (it is considered as a hyperparameter). The dashed vertical red line is the
training range, meaning that sequences to the right have not been seen during training and thus
measure generalization.
29
Published as a conference paper at ICLR 2023
Accuracy
Accuracy
0.8 0.8
0.6
0.7 0.7
0.6 0.4 0.6
0.5 0.2 0.5
0 100 200 300 400 500 0 100 200 300 400 500 0 100 200 300 400 500
Sequence length Sequence length Sequence length
(a) Even Pairs (R) (b) Modular Arithmetic (Simple) (R) (c) Parity Check (R)
Accuracy
0.8 0.8
Accuracy
0.6 RoPE
0.7 0.7
0.4 0.6 0.6
(d) Cycle Navigation (R) (e) Stack Manipulation (DCF) (f) Reverse String (DCF)
Accuracy
Accuracy
0.8
0.6 0.6
0.7
0.4 0.4 0.6
0.2 0.2 0.5
0 100 200 300 400 500 0 100 200 300 400 500 0 100 200 300 400 500
Sequence length Sequence length Sequence length
(g) Modular Arithmetic (DCF) (h) Solve Equation (DCF) (i) Duplicate String (CS)
Accuracy
Accuracy
(j) Missing Duplicate (CS) (k) Odds First (CS) (l) Binary Addition (CS)
Accuracy
Accuracy
0.8 0.8
0.6
0.7 0.7
0.6 0.6 0.4
(m) Binary Multiplication (CS) (n) Compute Sqrt (CS) (o) Bucket Sort (CS)
Figure B.7: Performance curves on all tasks for the Transformer encoder architecture, for all the
positional encodings we used. The dashed vertical red line is the training range, meaning that
sequences to the right have not been seen during training and thus measure generalization.
30
Published as a conference paper at ICLR 2023
Table B.5: Means and standard deviations (computed over random seeds) of the score (average test
accuracy, see Section 4) for the results in the autoregressive setting (i.e., Table B.4). We only include
the tasks with output length larger than 1, since the results are identical to those in Table 2 otherwise.
The † denotes permutation-invariant tasks; the ? denotes counting tasks; the ◦ denotes tasks that
require a nondeterministic controller; and the × denotes tasks that require superlinear running time in
terms of the input length.
Duplicate String 50.2 ± 0.1 50.5 ± 0.1 92.7 ± 4.7 51.4 ± 0.1
Odds First 50.4 ± 0.1 51.0 ± 0.02 52.3 ± 1.4 52.9 ± 0.4
Binary Addition 49.4 ± 0.2 50.5 ± 0.5 60.2 ± 4.2 52.9 ± 0.6
CS
Binary Multiplication× 47.5 ± 0.5 50.5 ± 1.3 50.8 ± 1.3 52.5 ± 0.5
Compute Sqrt 53.7 ± 0.3 57.5 ± 0.3 54.5 ± 0.5 57.7 ± 0.5
Bucket Sort†F 21.8 ± 3.4 35.6 ± 8.7 32.1 ± 4.3 93.0 ± 6.7
To provide a more detailed empirical result, we train the transformer encoder with relative positional
encodings for increasing numbers of in-distribution data points (100 000, 1 000 000, and 10 000 000
training steps) and increasing numbers of layers (5, 10, 15, 20, 25). Table B.6 shows that the score,
i.e., the accuracy averaged over all (unseen) test lengths and maximized over 10 seeds and 3 learning
rates, generally does not increase with more training data or more parameters. Thus, in this empirical
setup, the Chomsky hierarchy implies a hard limit on scaling laws (Kaplan et al., 2020).
31
Published as a conference paper at ICLR 2023
Table B.6: Score (see the caption of Table 2) for the Transformer with relative positional encodings
and different numbers of layers and training steps (i.e., training data). We observe that the Chomsky
hierarchy implies a hard limit on scaling laws (Kaplan et al., 2020), i.e., more in-distribution data
and/or more parameters do not lead to better length generalization performance.
Layers
Level Task Training Steps 5 10 15 20 25
100 000 90.2 84.7 74.8 50.1 50.1
Even Pairs 1 000 000 90.0 78.6 78.7 50.1 50.1
10 000 000 97.5 89.7 89.5 50.1 50.1
100 000 21.7 21.6 21.5 20.5 19.9
Modular Arithmetic (Simple) 1 000 000 21.2 21.6 22.6 22.1 20.0
10 000 000 21.3 22.1 24.8 24.4 20.0
R
100 000 50.3 50.4 50.1 50.1 49.9
Parity Check† 1 000 000 51.5 50.3 50.1 50.1 50.1
10 000 000 52.0 50.5 50.1 50.1 50.1
100 000 24.5 25.4 23.7 22.9 23.2
Cycle Navigation† 1 000 000 22.9 22.5 23.6 22.0 23.7
10 000 000 22.8 22.9 25.9 23.2 23.3
100 000 54.9 55.4 55.3 54.5 54.9
Stack Manipulation 1 000 000 54.0 54.4 53.7 54.9 56.2
10 000 000 52.8 51.9 53.0 52.9 49.6
100 000 61.6 61.2 58.0 56.1 56.8
Reverse String 1 000 000 59.8 60.6 61.0 60.1 58.8
10 000 000 56.5 58.7 57.2 52.1 50.0
DCF
100 000 31.4 31.7 31.9 31.5 31.6
Modular Arithmetic 1 000 000 28.9 27.9 31.0 31.0 31.0
10 000 000 27.5 30.8 31.0 31.0 31.0
100 000 22.4 21.3 21.5 22.2 19.9
Solve Equation◦ 1 000 000 23.0 22.4 21.6 21.8 19.9
10 000 000 22.8 21.6 21.9 23.7 20.1
100 000 52.5 53.6 53.1 55.3 53.6
Duplicate String 1 000 000 51.9 51.5 53.1 53.7 54.2
10 000 000 51.9 51.1 51.7 51.3 50.0
100 000 53.9 52.7 52.6 52.1 50.0
Missing Duplicate 1 000 000 53.2 53.0 50.0 50.0 50.0
10 000 000 52.1 52.5 50.0 50.0 50.0
100 000 52.5 53.4 52.9 52.2 53.2
Odds First 1 000 000 52.5 53.4 52.9 52.7 53.5
10 000 000 51.8 52.8 52.0 50.0 50.0
100 000 52.3 54.7 51.7 51.2 50.8
CS Binary Addition 1 000 000 53.3 55.9 54.8 54.8 53.8
10 000 000 52.9 52.9 51.8 54.4 51.1
100 000 52.3 52.3 51.9 51.2 51.0
Binary Multiplication× 1 000 000 52.5 53.1 53.4 53.4 54.0
10 000 000 51.8 53.0 52.0 52.8 50.6
100 000 52.8 52.7 52.6 52.0 52.7
Compute Sqrt 1 000 000 52.3 53.4 53.7 54.0 53.5
10 000 000 51.6 52.7 53.4 53.3 53.0
100 000 90.9 93.9 93.0 91.4 87.5
Bucket Sort†F 1 000 000 93.1 91.6 92.1 93.7 93.0
10 000 000 84.3 81.9 80.4 89.5 65.1
32