Backpropagation Through Time: What It Does and Howtodoit: Paul Werbos
Backpropagation Through Time: What It Does and Howtodoit: Paul Werbos
Backpropagation Through Time: What It Does and Howtodoit: Paul Werbos
Backpropagation is now the most widely used tool in the field propagation."The concepts here will already be familiar to
of artificial neural networks. At the core of backpropagation is a those who have read the paper by Rumelhart, Hinton, and
method for calculating derivatives exactly and efficiently in any
large system made up of elementary subsystems or calculations Williams [21 in the seminal book Parallel Distributed Pro-
which are represented by known, differentiable functions; thus, cessing, which played a pivotal role in the development of
backpropagation has many applications which do not involve the field. (That book also acknowledged the prior work of
neural networks as such. Parker [3] and Le Cun [4], and the pivotal role of Charles
This paper first reviews basic backpropagation, a simple method
Smith of the Systems Development Foundation.) This sec-
which is now being widely used in areas like pattern recognition
and fault diagnosis. Next, it presents the basic equations for back- tion will use new notation which adds a bit of generality and
propagation through time, and discusses applications to areas like makes it easier to go on to complex applications in a rig-
pattern recognition involving dynamic systems, systems identifi- orous manner. (The need for new notation may seem
cation, and control. Finally, it describes further extensions of this unnecessary to some, but for thosewho have to apply back-
method, to deal with systems other than neural networks, systems
involving simultaneous equations or true recurrent networks, and propagation to complex systems, it i s essential.)
other practical issues which arise with this method. Pseudocode is Section Ill will use the same notation to describe back-
provided to clarify the algorithms. The chain rule for ordered deriv- propagation through time. Backpropagation through time
atives-the theorem which underlies backpropagation-is briefly has been applied to concrete problems by a number of
discussed.
authors, including, at least, Watrous and Shastri [5],Sawai
and Waibel et al. [6], Nguyen and Widrow [A,Jordan [8],
I. INTRODUCTION Kawato [9],Elman and Zipser, Narendra [IO], and myself [I],
Backpropagation through time is a very powerful tool, [Ill, [12], [15].Section IV will discuss what i s missing in this
with applications to pattern recognition, dynamic model- simplified discussion, and how to do better.
ing, sensitivity analysis, and the control of systems over At its core, backpropagation i s simply an efficient and
time, among others. It can be applied to neural networks, exact method for calculating all the derivatives of a single
to econometric models, to fuzzy logic structures, to fluid target quantity (such as pattern classification error) with
dynamics models, and to almost any system built up from respect to a large set of input quantities (such as the param-
elementary subsystems or calculations. The one serious eters or weights in a classification rule). Backpropagation
constraint i s that the elementary subsystems must be rep- through time extends this method so that it applies to
resented by functions known to the user, functions which dynamic systems. This allows one to calculate the deriva-
are both continuous and differentiable (i.e., possess deriv- tives needed when optimizing an iterative analysis pro-
atives). For example, the first practical application of back- cedure, a neural networkwith memory, or a control system
propagation was for estimating a dynamic model to predict which maximizes performance over time.
nationalism and social communications in 1974 [I].
Unfortunately, the most general formulation of back- II. BASIC BACKPROPACATION
propagation can only be used by those who are willing to A. The Supervised Learning Problem
work out the mathematics of their particular application.
This paper will mainly describe a simpler version of back- Basic backpropagation is current the most popular
propagation, which can be translated into computer code method for performing the supervised learning task, which
and applied directly by neural network users. i s symbolized in Fig. 1.
Section II will review the simplest and most widely used In supervised learning, we try to adapt an artificial neural
form of backpropagation,which may be called "basic back- network so that i t s actual outputs ( P ) come close to some
target outputs (Y)for a training set which contains T pat-
terns. The goal i s to adapt the parameters of the network
Manuscript received September 12,1989; revised March 15,1990.
The author i s with the National Science Foundation, 1800 G St.
so that it performs well for patterns from outside the train-
NW, Washington, DC 20550. ing set.
IEEE Log Number 9039172. The main use of supervised learning today lies in pattern
1550 PROCEEDINGS OF THE IEEE, VOL. 78, NO. IO, OCTOBER 1990
eral case. Given a history of X(1) * * X(T) and Y(1) . . . Y(T),
we want to find a mapping from X to Y which will perform
well when we encounter new vectors Xoutside the training
set. The index "t" may be interpreted either as a time index
or as a pattern number index; however, this section will not
assume that the order of patterns i s meaningful.
*- (7)
as will be seen in the next section.
82, aZ, ,>I az, az,
where the derivatives with the superscript represent
C. Adapting the Network: Approach ordered derivatives, and the derivatives without subscripts
In basic backpropagation, we choose the weights W,, so represent ordinary partial derivatives.Thischain rule isvalid
as to minimize square error over the training set: only for ordered systems where the values to be calculated
can be calculated one by one (if necessary) in the order zl,
T T n
22, . . . , zn,TARGET. The simple partial derivatives repre-
E= c E(t) = c c (1/2)(Vl(t) - Y,(t))2.
i=1 i=1 ,=1
(6)
sent the direct impact of z, on z, through the system equa-
tion which determines z,. The ordered derivative repre-
This is simply a special case of the well-known method of
sents the total impact of z, on TARGET, accounting for both
least squares, used very often in statistics, econometrics,
the direct and indirect effects. For example, suppose that
and engineering; the uniqueness of backpropagation lies
we had a simple system governed by the following two
in how this expression i s minimized. The approach used
equations, in order:
here is illustrated in Fig. 3.
22 = 4 * 21
z3 = 3 * z1 + 5 * z2.
1552 PROCEEDINGS OF THE IEEE, VOL. 78, NO. IO, OCTOBER 1990
z,-which are later than z, in thecausal ordering we impose agation of information i s what gives backpropagation i t s
on the system. name. A little calculus and algebra, starting from (5), shows
This chain rule provides a straightforward, plodding, us that
”linear“ recipe for how tocalculate thederivatives of agiven 5’(z) = s(z) * (1 - s(z)), (1 3)
TARGET variable with respect to a// of the inputs (and
parameters) of an ordered differentiable system in onlyone which we can use when we implement (11).Finally, to adapt
pass through the system. This paper will not explain this the weights, the usual method i s to set
chain rule in detail since lengthy tutorials have been pub- New W,, = W,, - learning-rate * F-W,, (14)
lished elsewhere [I], [Ill. But there i s one point worth not-
ing: because we are calculating ordered derivatives of one where the learningrate i s some small constant chosen on
target variable, we can use a simpler notation, a notation an ad hoc basis. (The usual procedure i s to make it as large
which works out to be easier to use in complex practical as possible, u p to 1, until the error starts to diverge; how-
examples [Ill. We can write the ordered derivative of the ever, there are more analytic procedures available [Ill.)
TARGETwith respecttoz,as “F-z,,”which may bedescribed
as “the feedback toz,.” In basic backpropagation, the TAR- F. Adapting the Network: Code
GET variable of interest i s the error €. This changes the The key part of basic backpropagation-(l0)-(13)-may
appearance of our chain rule in that case to be coded up into a ”dual” subroutine, as follows.
- Calculation o f t
C Next implement equation (9)
DO 9 i = l , n
I -
X Input I Hidden Cells I ?Output ] C
9 F-Yhat(i)= Yhat(i)- Y ( t ,i )
Next Implement (10)-(12)
1 mm+l N N+l N+n
CALL F-NET(F-Yhat, W, x, F-W)
Calculation of F x
c _
C Next Implement (14)
Map o f 1 C Note how weights are updated
Fig. 4. Backwards flow of derivative calculation. C within the “DO 100” loop.
.
I
DO 14 i = m + l , N + n -
?IT1
DO 1 4 j = I , ; - I
14 W(i,j)=W(i,j)-
X(T-11 i(T-1)
learning-rate*
F-W(i, j )
100 CONTl NUE &IT-21 -
?(T-21
1000 CONTINUE
Fig. 5. Generalized network design with time lags.
The key pair.- here is that the weights Ware adjustec in
responseto the current vector F-W, which only depends on
the current pattern t; the weights are adjusted after each The Introduction cited a number of exampleswheresuch
pattern i s processed. (In batch learning, by contrast, the "memory" of previous time periods is very important. For
weights are adjusted only after the "DO 100" loop i s com- example, it is easier to recognize moving objects if our net-
pleted.) work accounts for changes in the scene from the time t -
In practice, maximum-passes is usually set to an enor- 1 to time t, which requires memory of time t - 1. Many of
mous number; the loop is exited only when a test of con- the best pattern recognition algorithms involve a kind of
vergence is passed, a test of error size or weight change "relaxation" approach where the representation of the
which can be injected easily into the loop. True real-time world at time t i s based on an adjustment of the represen-
learning is like pattern learning, but with only one pass tation at time t - 1; this requires memory of the internal
through the data and no memory of earlier times t. (The network variables for time t - 1. (Even Kalman filtering
equations above could be implemented easily enough as requires such a representation.)
a real-time learning scheme; however, this will not be true
for backpropagation through time.) The term "on-line
learning" i s sometimes used to represent a situation which B. Example of a Recurrent Network
could be pattern learning or could be real-time learning. Backpropagationcan be applied to any system with a well-
Most people using basic backpropagation now use pattern defined order of calculations, even if those calculations
learning rather than real-time learning because, with their depend on past calculations within the network itself. For
data sets, many passes through the data are needed to the sake of generality, I will show how this works for the
ensure convergence of the weights. network design shown in Fig. 5 where every neuron is
The reader should be warned that I have not actually potentially allowed to input values from any of the neurons
tested the code here. It i s presented simply as a way of at the two previous time periods (including, of course, the
explaining more precisely the preceding ideas. The C input neurons). To avoid excess clutter, Fig. 5 shows the
implementations which I have worked with have been less hidden and output sections of the network (parallel to Fig.
transparent, and harder to debug, in part because of the 2)onlyfortime T, but they are presentat othertimesaswell.
absence of range checking in that language. It i s often To translate this network into a mathematical system, we
argued that people "who knowwhat they are doing" do not can simply replace (2) above by
need range checking and the like; however, people who
think they never make mistakes should probably not be 1-1 N+n N+il
writing this kind of code. With neural network code, espe- net,(t) = C
/=I
W,,x,(t) + C
/=I
~ ; , x , ( t- I ) + C
j=1
w;X,(t - 2).
cially, good diagnostics and tests arevery important because
bugs can lead to slow convergence and oscillation-prob- (15)
lems which are hard to track down, and are easily misat-
tributed to the algorithm in use. If one must use a language Again, we can simply fix some of the weights to be zero, if
without range checking, it i s extremely important to main- we so choose, in order to simplify the network. In most
tain a version of the code which is highly transparent and applications today, the W weights are fixed to zero (i.e.,
safe, however inefficient it may be, for diagnostic purposes. erased from all formulas), and all the W weights are fixed
to zero as well, except for W;,. This is done in part for the
Ill. BACKPROPAGATION
THROUGH
TIME sake of parsimony, and in part for historical reasons. (The
"time-delay neural networks" of Watrous and Shastri [5]
A. Background assumed that special case.) Here, I deliberately include extra
Backpropagation through time-like basic backpropa- terms for the sake of generality. I allow for the fact that a l l
gation-is used most often in pattern recognition today. active neurons (neurons other than input neurons) can be
Therefore, this section will focuson such applications, using allowed to input the outputs of any other neurons if there
notation like that of the previous section. See Section IV for i s a time lag in the connection. The weights W and W are
other applications. the weights on those time-lagged connections between
In someapplications-such as speech recognition or sub- neurons. [Lags of more than two periods are also easy to
marine detection-our classification at time t w i l l be more manage; they are treated just as one would expect from
seeing how we handle lag-two terms, as a special case of
accurate if we can account for what we saw at earlier times.
Even though the training set still fits the same format as ( 7 ~
above, we want to use a more powerful class of networks These equations could be embodied in a subroutine:
to do the classification; we want the output of the network
SUBROUTINE NETZ(X(t), W', W", x(t - 2),
at time t to account for variables at earlier times (as in Fig.
5). x(t - I), XU), Yhat),
1554 PROCEEDINGS OF THE IEEE, VOL. 78, NO. IO, OCTOBER 1990
which is programmed just like the subroutine NET, with the First, to calculate the derivatives, we need a new sub-
modifications one would expect from (15).The output arrays routine, dual to NET2.
are x(t) and Yhat.
When we call this subroutine for the first time, at t = 1, SUBROUTINE F_NET2(F_Yhat, W, W , W ” , x, F-net,
we face a minor technical problem: there i s no value for F-net’, F-net“, F-W, F - W , F - W )
x(-1)or x(O), both of which we need as inputs. In principle, REAL F-Yhath), W(N+n, N+n),
we can use any values we wish to choose; the choice of x( -1) W(N+n,N+n), W”(N+n, N+n)
and x(0) i s essentially part of the definition of our network. +
REAL x(N+ n),F-net(N n),F-net’(N + n ) ,
Most people simply set these vectors to zero, and argue that F-net”(N n ) +
their network will start out with a blank slate in classifying +
REAL F- W(N n, N + n),F-W(N n, N + n), +
whatever dynamic pattern is at hand, both in the training + +
F- W”(N n, N n ) ,F- x (N n ) +
set and in later applications. (Statisticians have been known INTEGER i, j , n, rn, N
to treat these vectors as weights, in effect, to be adapted C Initialize equation (16)
alongwith the otherweights in the network. Thisworks fine DO l i = l , N
in the training set, but opens up questions of what to do 1 F-x(i) =O.
when one applies the network to new data.) D O 2 i = l,n
In this section, I will assume that the data run from an 2 F-x(i+ N)=F-Yhat(i)
initial time t = 1 through to a final time t = T,which plays C RUN THROUGH (16), (IIAND ), (12) AS A SET,
a crucial role in the derivative calculations. Section IV will C RUNNING BACKWARDS
show how this assumption can be relaxed somewhat. DO 1000 i=N+n,rn+l,-I
C first complete (16)
C. Adapting the Network: Equations DO 161 j = i+l,N+n
161 F-x(i)=F-x(i)+ W ( j,i)*F-net( j )
To calculate the derivatives of F-W,,, we use the same D O 162; = rn+l,N+n
equations as before, except that (IO) i s replaced by 162 +
F- x (i) =F- x ( i ) W’( j , i ) * F-n et ’( j )
N+” +
W”(j,i)*F-net”( j )
F-x,(t) = F-)ir-N(t) + 1 =c, + l W, * F-net,(t) C next implement (11)
F-net(i) =F-x(i)*x(i)*(l - x ( i ) )
N+n C implement (12), (17), and (18)
+ I=rn+l W,; * F-net,(t + 1) (as running sums)
DO 1 2 j = 1,;-I
N+n
12 F-W(i,j)=F-W(i,j)
+ / = cm + l W; * F-net,(t + 2). (16)
+
F-net(i)*x ( j )
DO 1718 j =I,N+n
Once again, if one wants to fix the W” terms to zero, one F-W(i,j)=F-W(i,j)
can simply delete the rightmost term. +
F-net’(i )*x ( ;
Notice that this equation makes it impossible for U S to 1718
’ F-W(i, j )=F-W(i, j )
calculate F-x,(t) and F-net,(t) until after F-net,(t 1) and + +F-net”(i)*x(j)
F-net,(t+ 2) are already known; therefore, we can only use 1000 CONTINUE
this equation by proceeding backwards in time, calculating
F-net for time T, and then working our way backwards to Notice that the last two DO loops have been set up to
time 1. perform running sums, to simplify what follows.
To adapt this network, of course, we need to calculate Finally, we may adapt the weights as follows, by batch
F-W:, and F-W; as well as F-W,,: learning, where I use theabbreviation x(i,), to represent the
T vector formed by x(i,j) across all j .
F-W;, = c F-net,(t + 1) * x,(t)
I l l
(17)
REAL x(-l:T,N+n),Yhat(T,n)
r DATA x(O,),X(-I,) / (2*(N+n)) * 0.01
F-W; =
t=l
F-net,(t + 2) * x,(t). (18)
C
D O 1000 pass-number=l, maximum-passes
First calculate outputs and errors in
C a forward pass
In all of these calculations, F-net(T +
1) and F-net(T 2) + D O 100t=I,T
should be treated as zero. For programming convenience,
100 CALL NET2(X(t),W, W’, W , x (t- 21,
Iwill later define quantities like F-net;(t) = F-net,(t I), but + x(t-l),x(t,),Yhat(t,))
this i s purely a convenience; the subscript “ i ” and the time
C Initialize the running sums to 0 and
argument are enough to identify which derivative i s being
C set F-net(T), F-netfT+I) to 0
represented. (In other words, net,(f) represents a specific
DO 200 i = m + l , N + n
quantity z, as in (8), and F-net,(t) represents the ordered
F-net’(i)=O.
derivative of E with respect to that quantity.)
F-net”(i)=O.
D O 199j = l,N+n
D. Adapting the Network: Code F-W(i, j ) = O .
To fully understand the meaning and implications of F - W ( i , j ) =O.
these equations, it may help to run through a simple (hypo- 199 F- W ( i ,j ) =O.
thetical) implementation. 200 CONTl N UE
1556 PROCEEDINGS OF THE IEEE, VOL. 78, NO. 10, OCTOBER 1990
I 1
U
S(Z) = 1 - 1/(1 + z + 0.5 * z2), z > 0. we had control over (such as the settings of motors or actua-
tors), The combination of X ( t ) and u ( t ) is input to the net-
In a similar spirit, it i s common to speed up learning by work at each time t . Our target, at time t , i s the vector XU
“stretching out” s(z) so that it goes from -1 to 1 instead + 1).
of 0 to 1. We could easily build a network to input these inputs,
Backpropagation can also be used without using neural and aim at these targets. We could simply collect the inputs
networks at all. For example, it can be used to adapt a net- and targets into the format of Section II, and then use basic
work consisting entirely of user-specified functions, rep- backpropagation. But basic backpropagation contains no
resenting something like an econometric model. In that +
“memory.” The forecast of X(t 1) would depend on X(t),
case, the way one proceeds depends on who one i s pro- but not on previous time periods. If human beings worked
gramming for and what kind of model one has. like this, then they would be unable to predict that a ball
If one i s programming for oneself and the model consists might roll outthefarsideof atableafter rollingdown under
of a sequence of equations which can be invoked one after the near side; as soon as the ball disappeared from sight
the other, then one should consider the tutorial paper [ I l l , [from the current vector X(t)], they would have no way of
which alsocontains a more rigorous definition of what these accounting for i t s existence. (Harold Szu has presented a
”F-x,” derivatives really mean and a proof of the chain rule more interesting example of this same effect: if a tiger
for ordered derivatives. If one i s developing a tool for oth- chased after such a memoryless person, the person would
ers, then one might set it up to look like a standard econ- forget about the tiger after first turning to run away. Natural
ometric package (like SAS or Troll) where the user of the selection has eliminated such people.) Backpropagation
system types in the equations of his or her model; the back- through time permits more powerful networks, which do
propagation would go inside the package as a way to speed have a “memory,” for use in the same setup.
up these calculations, and would mostly be transparent to Even this approach to the neuroidentification problem
the user. If one’s model consists of a set of simultaneous has i t s limitations. Like the usual methods of econometrics
equations which need to be solved at each time, then one [15], it may lead to forecasts which hold u p poorly over mul-
must use more complicated procedures [15]; in neural net- tiple time periods. It does not properly identify where the
work terms, one would call this a ”doubly recurrent net- noise comes from. It does not permit real-time adaptation.
work.” (The methods of Pineda [I61 and Almeida [I71 are In an earlier paper [20], I have described some ideas for
special cases of this situation.) overcomingthese limitations, but more research i s needed.
Pearlmutter [I81 and Williams [I91 have described alter- The first phase of Kawato’s cascade method [9] for con-
native methods,designed toachieve results similartothose trollinga robot arm isan identification phase,which i s more
of backpropagation through time, using a different com- robust over time, and which uses backpropagation through
putational strategy. For example, the Williams-Zipser timeinadifferentway; it isaspecia1caseofthe”purerobust
method i s a special caseof the“conventiona1 perturbation” method,” which also worked well in the earliest applica-
equation cited in [14], which rejected this as a neural net- tions which I studied [I], [20].
work method on the grounds that its computational costs After we have solved the problem of identifying a dynamic
scale as the square of the network size; however, the
. method does yield exact derivatives with a time-forward
calculation.
system, we are then ready to move on to controlling that
system.
In neurocontrol, we often start out with a model or net-
Supervised learning problems or forecasting problems work which describes the system or plant we are trying to
which involve memory can also be translated into control control. Our problem is to adapt a second network, the
problems [15, p. 3521, [20], which allows the use of adaptive action network, which inputs X(t) and outputs the control
critic methods, to be discussed in the next section. Nor- u ( t ) .(In actuality, we can allow the action network to “see”
mally, this would yield only an approximate solution (or or input the entire vector x ( t ) calculated by the model net-
approximate derivatives), but it would also allow time-for- work; this allows it to account for memories such as the
ward real-time learning. If the network itself contains cal- recent appearance of a tiger.) Usually, we want to adapt the
culation noise (due to hardware limitations), the adaptive action network so as to maximize some measure of per-
critic approach might even be more robust than back- formance or utility U(X, t ) summed over time. Performance
propagation through time because it i s based on mathe- measures used in past applications have included every-
matics which allow for the presence of noise. thing from the energy used to move a robot arm [8], [9]
through to net profits received bythegas industry[ll].Typ-
B. Applications Other Than Supervised Learning ically, we are given a set of possible initial states XU), and
Backpropagation through time can also be used in two asked to train the action network so as to maximize the sum
other major applications: neuroidentification and neuro- of utility from time 1 to a final time T.
control. (For applications to sensitivity analysis, see [14] and To solve this problem using backpropagation through
[151.) time, we simply calculate the derivatives of our perfor-
In neuroidentification, we try to do with neural nets what mance measure with respect to all of the weights in the
econometricians do with forecasting models. (Engineers action network. “Backpropagation” refers to how we cal-
would call this the identification problem or the problem culate the derivatives, notto anything involving pattern rec-
of identifying dynamic systems. Statisticians refer to it as ognition or error. We then adapt the weights according to
the problem of estimating stochastic time-series models.) these derivatives, as in (121, except that the sign of the
Ourtrainingsetconsistsof vectorsX(t) and u ( t ) ,notX(t) and adjustment term i s now positive (because we are maxirniz-
Y ( t ) . Usually, X(t) represents a set of observations of the ing rather than minimizing).
external (sic) world, and u ( t )represents a set of actions that The easiest way to implement this approach i s to merge
1557
WERBOS: BACKPROPACATION THROUGH TIME
I
the utility function, the model network, and the action net- SUBROUTINE F-MODEL(F-net’, F - X , x, F-net, F-U)
work into one big network. We can then construct the dual C The weights inside this subroutine are those
to this entire network, as described in 1974 [I] and illus- C used in MODEL, analogous to those in NET2, and are
trated in my recent tutorial [ I l l . However, if wewish to keep C unrelated to the weights in ACTION
the three component networks distinct, then the book- +
REAL F-net ’( N; n),F- X‘( n),x ( N n),
keeping becomes more complicated. The basic idea is illus- F-net(N+n), F-u ( p ) ,F-x(N+n)
trated in Fig. 6, which maps exactly into the approach used INTEGER i, j,n,m, N,p
by Nguyen and Widrow [ A and by Jordan [8]. DO l i = l , N
1 F-x(i) =O.
DO 2 i = 1,n
2 +
F- x (i N ) = F- X(i)
DO 1000 i = N+n,l, -1
DO 91Oj = i+l,N+n
910 +
F-x(i)=F-x(i) W (j , i)*F-net(j 1
DO 920j = m + l , N + n
920 F-x(i)=F-x(i)+ W ( j,i)*F-net’(j )
1000 F-net(i) =F-x(i)*x(i)*(l -x(i));
DO 2000; = l,p
Fig. 6. Backpropagatingutilitythrough time. (Dashed lines 2000 F-u ( i )=F-x(n + i )
represent derivative calculations.)
Insteadof workingwith asinglesubroutine, NET,we now The last small DO loop here assumes that u(t) was part
need three subroutines: of the input vector to the original subroutine MODEL,
inserted into the slots between x(n + 1) and x(m). Again,
UTILITY(X; t; x”; U) a good programmer could easily compress all this; my goal
MODEL(X(t),u(t); x ( t ) ; X(t + 1)) here i s only to illustrate the mathematics.
Finally, in order to adapt the action network, we go
ACTION(x(t); W; x’(t); u(t)). through multiple passes,each startingfrom oneofthestart-
ing values of XU).In each pass, we call ACTION and then
Ineachofthesesubroutines,thetwoargumentsonthe right
MODEL, one after the other, until we have built up a stream
are technically outputs, and the argument on the far right
of forecasts from time 1 up to time T. Then, for each time
is what we usually think of as the output of the network.
t going backwards from T to 1, we call the UTILITY sub-
We need to know the full vector x produced inside the
routine, then F-UTILITY, then F-MODEL, and then F-AC-
model network so that the action network can ”see” impor-
TION. At the end of the pass, we have the correct array of
tant memories. The action network does not need to have
derivatives F-W, which wecan then usetoadjust theweights
its own internal memory, but we need to save its internal
of the action network.
state (x’) so that we can later calculate derivatives. For sim-
In general, backpropagation through time has theadvan-
plicity, I will assume that MODEL does not contain any lag-
tage of being relatively quick and exact. That is why I chose
two memory terms (i.e., W weights). The primes after the
it for my natural gas application [ I l l . However, it cannot
x’s indicate that we are looking at the internal states of dif-
account for noise in the process to the controlled. To
ferent networks; they are unrelated to the primes repre-
account for noise in maximizing an arbitrary utility func-
senting lagged values, discussed in Section I l l , which we
tion, we must rely on adaptive critic methods [21]. Adaptive
will also need in what follows.
critic methods do not require backpropagation through
To use backpropagation through time, we need to con-
time in any form, and are therefore suitable for true real-
struct dual subroutines for a l l three of these subroutines:
time 1earning.Thereareotherformsof neurocontrol aswell
F-UTILITY(x”; t; F-X) [21] which are not based on maximizing a utility function.
F-ACTION(F-u; x’(t); F-W). In most of the examples above, I assumed that the train-
ingdataform one lonetime series, from tequalsl to tequals
The outputs of these subroutines are the arguments on the T. Thus, in adapting the weights, I always assumed batch
far right (including F-net), which are represented by the learning (except in the code in Section 11); the weights were
broken lines in Fig. 4. The subroutine F-UTILITY simply always adapted after a complete set of derivatives was cal-
reports out the derivatives of U(x, t) with respect to the culated, based on a complete pass through all the data.
variables)(,.The subroutine F-MODEL i s like the earlier sub- Mechanically, one could use pattern learning in the back-
routine F-NET2, except that we need to output F-U instead wards pass through time; however, this would lead to a host
of derivatives to weights. (Again, we are adapting only the of problems, and it i s difficult to see what it would gain.
action network here.)The subroutine F-ACTION isvirtually Data in the real world are often somewhere between the
identical to the old subroutine F-NET, except that we need two extremes represented by Sections II and Ill. Instead of
to calculate F-W as a running sum (as we did in F-NET2). having a set of unrelated patterns or one continuous time
Of these three subroutines, F-MODEL i s by far the most series, we often have aset of time series or strings. For exam-
complex. Therefore, it may help to consider some possible ple, in speech recognition, our training set may consist of
code. a set of strings, each consisting of one word or one sen-
1558 PROCEEDINGS OF THE IEEE, VOL. 78, NO. 10, OCTOBER 1990
tence. In robotics, our training set may consist of a set of parts of engineering; however, there i s a large literature on
strings, where each string represents one experiment with alternative approaches [12], both i n neural network theory
a robot. and in robust statistics.
I n these situations, we can apply backpropagation These literatures are beyond the scope of this paper, but
through time to a single string of data at a time. For each a few related points may be worth noting. For example,
string, we can calculate complete derivatives and update instead of minimizing square error, we could minimize the
the weights. Then we can go on to the next string. This i s 1.5 power of error; all of the operations above still go
like pattern learning, in that the weights are updated incre- through. We can minimize E of (5) plus some constant k
mentally before the entire data set i s studied. It requires times the sum of squares of theweights; as kgoes to infinity
intermediate storage for only one string at a time. To speed and the network i s made linear, this converges to Koho-
things up even further, we might adapt the net in stages, nen's pseudoinverse method, a common form of associ-
initially fixing certain weights (like W j j )to zero or one. ative memory. Statisticians like Dempster and Efron have
Nevertheless,string learning i s notthe samething as real- argued that the linear form of this approach can be better
time learning. To solve problems in neuroidentification and than the usual least squares methods; their arguments cap-
supervised learning, the only consistent way to have inter- ture the essential insight that people can forecast by anal-
nal memory terms and to avoid backpropagation through ogyto historical precedent, instead of forecasting byacom-
time i s to use adaptive critics in a supporting role [15]. That prehensive model or network. Presumably, an ideal
alternative is complex, inexact, and relatively expensive for network would bring together both kinds of forecasting
these applications; it may be unavoidable for true real-time [121, [201.
systems like the human brain, but it would probably be bet- Many authors worry a lot about local minima. In using
ter to live with string learning and focus on otherchallenges backpropagationthrough time in robust estimation, I found
in neuroidentification for the time being. it important to keep the "memory" weights near zero at
first, and free them up gradually in order to minimize prob-
D. Speeding Up Convergence lems. When T i s much larger than m-as statisticians rec-
For those who are familiar with numerical analysis and ommend for good generalization-local minima are prob-
optimization, it goes without saying that steepest descent- ably a lot less serious than rumor has it. Still, with T larger
as in (12)-is a very inefficient method. than m, it is very easy to construct local minima. Consider
There is a huge literature in the neural network field on the example with m = 2 shown in Table I.
how to speed up backpropagation. For example, Fahlman
and Touretzky of Carnegie-Mellon have compiled and
tested avarietyof intuitive insights which can speed upcon- Table 1 Training Set for Local Minima
vergence a hundredfold. Their benchmark problems may t X(t) Y(t)
be very useful in evaluating other methods which claim to
1 0 1 .1
do the same. A few authors have copied simple methods 2 1 0 .I
from the field of numerical analysis, such as quasi-Newton 3 1 1 .9
methods (BFGS) and Polak-Ribiere conjugate gradients;
however, the former works only on small problems (a
hundred or so weights) [22], while the latter works well only
The error for each of the patterns can be plotted as a con-
with batch learningandverycareful linesearches.The need
tour map as a function of the two weights w,and w2.(For
for careful line searches i s discussed in the literature [23],
this simple example, no threshold term i s assumed.) Each
but I have found it to be unusually importantwhenworking
map i s made up of straight contours, defining a fairly sharp
with large problems, including simulated linear mappings.
trough about a central line. The three central lines for the
In my own work, I have used Shanno's more recent con-
three patterns form a triangle, the vertices of which cor-
jugate gradient method with batch learning; for a dense
respond roughly to the local minima. Even when Tis much
training set-made up of distinctly different patterns-this
larger than m, conflicts like this can exist within the training
method worked better than anything else I tried, including
set. Again, however, this may not be an overwhelming prob-
pattern learning methods [12]. Many researchers have used
lem in practical applications [19].
approximate Newton's methods, without saying that they
are using an approximation; however an exact Newton's
U. SUMMARY
method can also be implemented in O ( N )storage, and has
worked reasonably well in early tests [12]. Shanno has Backpropagation through time can be applied to many
reported new breakthroughs in function minimization different categories of dynamical systems-neural net-
which may perform still better [24]. Still, there i s clearly a works, feedforward systems of equations, systems with time
lot of room for improvement through further research. lags, systems with instantaneous feedback between vari-
Needless to say, it can be much easier to converge to a ables (as in ordinary differential equations or simultaneous
setofweightswhichdonot minimizeerrororwhichassume equation models), and s o on. The derivatives which it cal-
a simpler network; methods of that sort are also popular, culates can be used in pattern recognition, in systems iden-
but are useful only when they clearly fit the application at tification, and i n stochastic and deterministic control. This
hand for identifiable reasons. paper has presentedthe keyequationsof backpropagation,
as applied to neural networks of varying degrees of com-
E. Miscellaneous Issues plexity. It has also discussed other papers which elaborate
Minimizing square error and maximizing likelihood are on the extensions of this method to more general appli-
often taken for granted as fundamental principles in large cations and some of the tradeoffs involved.
1560 PROCEEDINGS OF THE IEEE, VOL. 78, NO. 10, OCTOBER 1990
-
k