Addition Multiplication RNN
Addition Multiplication RNN
Addition Multiplication RNN
Abstract—Previous RNN architectures have largely been su- tangent function is not ideal since LSTM memory values can
perseded by LSTM, or “Long Short-Term Memory”. Since its grow large, but the the hyperbolic tangent has a very small
introduction, there have been many variations on this simple gradient when its input value is large.
arXiv:1605.01988v3 [cs.NE] 30 Mar 2017
or biases for the inner layer, the network can also achieve a
f (x) = tanh(x) traditional forget gate. Since this layer does not use many extra
This logarithm-based activation function does not saturate parameters, we can compare equal width networks. wv1 , wv2 ,
and can better handle larger inputs than tanh. Fig. 5 illustrates wv3 and bv1 are initialized with zeros, so the network starts
both functions. The recurrent cell values may grow quite large, out as a standard LSTM. In summary, an LSTWM network
causing the hyperbolic tangent function to quickly saturate and can modify its memory in a more complex manner without
gradients to disappear. To obtain good performance from our necessarily accepting new values or exposing its current values.
design, we wanted to develop a non-saturating function that Since the inner layer only uses the previous memory cell values,
can still squash its input. Earlier works have used logarithm- it can be computed in parallel with any downstream network
based activations, and the function we used appears in [13] and does not present a computation bottleneck if implemented
and originally in [3]. in a parallel manner.
While the rectified linear unit [9] (ReLU) is a common This architecture was adapted from a design in a previous
choice for non-recurrent architectures, they are usually not version [20] of this work, which used normal forget gates
suitable for LSTM cells since they only have positive outputs and included an extra layer and extra gate after the memory
(although they have been used with some success in LSTM cell update. We found that the previous four-gate architecture
and RNNs, see [18] [17]), and exploding feedback loops are did not perform as well on tasks that required a precise
more likely when the activation function does not apply some memory. Likely, having three different memory operations at
“squashing effect”. However, tanh is less suited for large inputs each timestep resulted in excessive changes in to the memory.
due to the small values of its derivative outside the small Setting appropriate initial bias values helped the situation in
interval around zero. some cases, however, we found better designs that did not
(s)
The forget gate output has been renamed g , and now require as much hand-tuning. Our first attempt at resolving the
controls the convex combination of the previous memory cell issue was removing the forget gate. Removing the forget gate
value ct−1 with the output of the inner layer it . The weight from our earlier design did not yield good result by itself.
vectors wv2 and wv3 only connect a neuron to its left and right Figures 1 and 2 illustrate LSTM and LSTWM, respectively.
neighbors. In software, we did not implement this as a matrix The i subscript indicates that this is the ith unit in a layer. The
multiplication, but as an element-wise roll of the input which white double arrows indicate an input that determines a gate
gets multiplied by weight vectors and then has bias vectors value (as opposed to a value that gets multiplied by a gate).
added. The function ρ(x, y) is an element-wise roll where x is 1) Training, Regularization and Other Details: Given that
the input vector and y is the number of times to roll left or right. the memory cells are not necessarily decaying in an exponential
For example ρ([1, 2, 3], 1) = [3, 1, 2] and ρ([1, 2, 3], −1) = manner at each timestep, it is important to control their
[2, 3, 1]. So the inter-neuron connections within the memory magnitude. Rather than just using a regularization term on
cells are sparse and local. The reason for choosing a sparse the network weights themselves, we also regularize the overall
layer is that dense layers grow quadratically with respect to magnitude of the memory-cells at each timestep.
layer width. We wanted to compare equal-width networks, When training an ANN of any type, a regularization term
since a LSTM’s memory capacity is determined by the number is usually included in the cost function. Often, the L2-norms
of individual memory cells. By setting near-zero weights and of the weight matrices are added together and multiplied by
Architecture BPC on test set after training
LSTM-256-tanh 1.893
LSTWM-256-tanh 1.892
LSTM-256-log 1.880
LSTWM-256-log 1.880
LSTM-256-256-tanh 1.742
LSTWM-256-256-tanh 1.733
LSTM-256-256-log 1.730
LSTWM-256-256-log 1.725
Fig. 6. Performance on text prediction task.
III. E XPERIMENTS
We have two experimental tasks to test our network on: text
prediction and a combination digit-recognition and addition task.
The networks were trained using ADAM [16], an optimization
algorithm based on gradient descent, with the following settings
Fig. 5. Comparison of tanh and log-based activation function
α = .001 β1 = .9 β2 = .999.
IV. T EXT P REDICTION
a very small constant. This keeps the weight magnitudes in Like many other works e.g. [6], we use the hutter challenge
check. For our architecture to train quickly, it is important [12] dataset as a performance test. This is a dataset of text
to use an additional regularization term. Given a batch of and XML from Wikipedia. The objective is to predict the
training data, we take the squared mean of the absolute value next character in the sequence. (Note that we only use the
plus the mean absolute value of the memory cell magnitudes dataset itself, and this benchmark is unrelated to the Hutter
at each timestep for every element in the batch. In other compression challenge) The first 95% of the data was used as
words, η · (mean(|cells|)2 + mean(|cells|)) for some small a training set and the last 5% for testing. Error is measured
constant η. We found that using η ≈ 10−2 or η ≈ 10−3 in bits-per-character, BPC, which is identical to cross entropy
worked well. Similar regularization terms are discussed in [17], error, except that the base-2 logarithm is used instead of the
although they differ in that they are applied to the change in natural log. We used a batch size of 32 and no gradient clipping.
memory cell values from one timestep to the next, rather than To reduce the amount of training time needed, we used length-
their magnitude. Using direct connections between the inner 200 sequences for the first epoch, and then used length-2000
cells can amplify gradients and produce quickly exploding sequences for the remaining five epochs.
values. By adding this regularization term, we penalize weight Figures 3 and 4 show a running average of the training
configurations that encourage uncontrollable memory cell error and 6 shows the BPC on the test set after training. The
values. Since the cells values get squashed by the activation results are close for this particular task, with LSTWM taking a
function before being multiplied by the output gate, regularizing slight advantage. Notice that for this test, the logarithm based
these values shapes the optimization landscape in a way that activation does carry some benefit, and the best performing
encourages faster learning. This is also true to some extent for network was indeed LSTWM with the logarithmic activation.
regular LSTM, which benefits from this type of regularization
as well. We also applied this regularization function to the A. Training Information
weights themselves. LSTWM can learn effectively without Given the popularity of this task as a benchmark, a quick
extra regularization, but it will learn slowly in some cases. training phase is desirable. Input vectors are traditionally given
Note that we use the square-of-abs-mean not the mean- in a one-hot format. However, the network layers can be quite
squared. Using the square-of-abs-mean allows some values to wide, and each cell has many connections. We found that using
grow large, and only encourages most of the values to be small. a slightly larger nonzero value in the vectors resulted in quicker
This is important, especially for the memory cells, because the training. Instead of using a value of 1.0 for the single non-zero
ability of a neural network to generalize depends on its ability element in the input vectors, we used log(n) + 1.0 where n
to abstract away details and ignore small changes in input. is the number of input symbols. In this case, n = 205, since
Indeed, this is the entire purpose of using sigmoid-shaped there are 205 distinct characters in this dataset. On tasks where
squashing functions. If the goal were to keep every single there are many symbols, the mean magnitude of the elements
memory cell within tanh’s “gradient-zone”, one could use the of the input vector is not as small, which accelerated training
hyperbolic cosine as a regularization function since it grows in our experiments.
very large outside this range. Another method we used to speed up training was using a
We used Python and Theano [21] to implement our network pre-train epoch with shorter sequences. The shorter sequences
and experiments. mean that there are more iterations in this epoch and the
Fig. 7. Example input sequences for combination digit task
Fig. 8. Inputs colored and separated into columns to show individual input
columns Fig. 9. Digit Combo task training error.