Frye 1991
Frye 1991
Frye 1991
J A N U A R Y 1991
Abstract-We present experimental results of adaptive learning using feedforward networks, the changes in the weights after each
an optically controlled neural network. We have used example prob- trial are proportional to the error itself. This leads to a system
lems in nonlinear system identification and signal prediction, two com-
mon areas of potential neural network application, to study the that settles to a stable weight configuration as the errors become
capabilities of analog neural hardware. These experiments investigate small. Notice, however, that the changes become zero only for
the effects of a variety of nonidealities typical of analog hardware sys- zero gradient of the error in weight space. This zero can rep-
tems. They show that networks using large arrays of nonuniform com- resent either a true global minimum o r only a local one but,
ponents can perform analog computations with a much higher degree
of accuracy than might be expected, given the degree of variation in
from a practical standpoint, this gradient descent algorithm gen-
the network’s elements. We have also investigated effects of other com- erally results in useful, if not optimal, solutions. (Several vari-
mon nonidealities, such as noise, weight quantization, and dynamic ations to this back-propagation approach have been used to ad-
range limitations. dress some of its deficiencies. These are reviewed by Jacobs
[61.)
The adaptive technique described above has proven to be suc-
INTRODUCTION cessful in digital software-based networks. Applying the same
and the output vector is given by light along the parallel electrodes of the photoconductor (Fig.
2). The overall image projected onto the array consisted of a
OJ
field 240 pixels in width, i.e., each bar of light could assume
The generalized delta rule is derived from ( 3 ) and, for the layer 240 discrete values of length. The values of the synaptic matrix
in the figure, the weight changes are obtained by this method are both bounded and quantized. In
mathematical models of neural networks that use floating point
Aw!, = TJO,~,. (6) computations, these are not usually important considerations.
In this relationship, TJ is the learning rate and is the constant of This is another significant difference between hardware and
proportionality implicit in ( 3 ) . If the layer in question is an out- software networks.
put layer, then 6, is given by The average measured behavior of our optical synapses was
found to obey the relationship
(7)
G = [2.2 x 10-9(O-’)]L (9)
where tJ is the target, or desired, output vector and f; denotes
differentiation of the neuron’s activation function with respect where G is the conductance and L is the programmed length
to the input signal. However, if the layer is hidden inside the expressed as an integer ranging form 0 to 240. The measure-
network, it is not immediately apparent what its target response ments were taken by programming simultaneously all 10 con-
should be. For these weights, 6, is computed iteratively using nections in the hidden-to-output layer and, consequently,
represent the average behavior of a collection of 10 synapses.
In addition, we found significant temporal scatter in the values,
particularly at larger programmed values, resulting from flicker
where hk and wJkrefer to the layer immediately after the one in in the projected CRT image coupling into the synapses.
question (i.e., to the right of the layer in Fig. 3 ) . In practice,
the input is first propagated forward through the network. The SYSTEMIDENTIFICATION
weight changes are first computed in the last layer, using (6)
and (7), and then working backwards through the network layer As a test of the capability of this hardware to adaptively learn
by layer using (6) and (8) (hence the phrase “back-propagation a simple nonlinear task, we chose the system identification
of errors. ’ ’) problem of a ballistic trajectory. For a projectile launched from
Some of the difficulty in realizing this procedure in analog the coordinates (0, 0) at time t = 0 with a random velocity, U ,
hardware is apparent in (8). T o compute these weight changes, angle 8, and downward acceleration a, its (x, y ) coordinates at
it is necessary to know the values of the interconnection weights a later time r, will be given by
wJk in the next layer of the network. Because hardware com- x = ut, cos (e),
ponents are generally subject to statistical variations, we usu-
ally have only an approximate idea of what their actual values and
may be. It is not feasible in a realistic system to provide enough
interconnection between the computer and the neural network
y = ut, sin (e) - att’j/2. (10)
hardware to monitor the value of each individual synapse. Sim- The ranges of allowed values for u and 8 (0.116 < u < 0.2,
ilar problems may arise from uncertainties about the exact na- 60” < 8 < 90’) were restricted such that, at time to = IO, the
ture of the activation function6 and its derivative. In such cases, projectile would be confined to the region ( 0 ix < l ) , ( 0 <
our only recourse is to work with the presumed values of these y < 1 ). The objective of the network was to compute the pro-
parameters. Since the hardware components will typically de- jectile’s location at to = 10, given u and 8 ab its inputs. The
part from their nominal values, one purpose of this study is to difference between the network output (n,y ) and the computer
determine the extent to which this will influence the perfor- generated target (x,, y , ) defined the error vector. This error vec-
mance of the network. tor, together with the measured value of the outputs of the hid-
The strength of the synaptic connections in our experimental den neurons, provided the data used to calculate the weight
network was controlled by projecting a variable-length bar of changes using the generalized delta rule as described above.
FRYE et al.: BACK-PROPAGATION LEARNING AND NONIDEALITIES I13
the electronic components in the network. For this example, we o.oo-"""' " ' I " ' i " "
T
- ~
r---
I I4 IEEE TRANSACTIONS ON N E U R A L NETWORKS. VOL. 2. NO. I . J A N U A R Y 1991
SIGNALPREDICTION
As an example of using a neural network in signal prediction,
we chose the two-dimensional deterministic chaotic relation first
studied by Henon 11 11. In the equations describing this rela- Fig. 6 . A schematic diagram of the computational scheme used to avoid
tionship, two sequences x, and y, are dynamically coupled ac- some of the problems in learning with quantized weight values. Software
cording to: matrices Mi and M; were higher precision than the corresponding hardware
matrices M ,and M 2 . Learning update calculations and weight storage were
X,+I = y, + 1 - M,2 in the software arrays, and an input-output signal propagated through the
hardware.
Yr+l = bx,.
For the constant values a = 1.4 and b = 0.3, and for initial trices in the neural network hardware, denoted as MI and M,.
conditions of x and y near unity, the sequences x, and y, exhibit For the purposes of computation, two floating point matrices,
chaotic behavior. Mi and Mi,were used in the software. The values in these ma-
This particular choice of an example problem in signal pre- trices were used in the computation of (4)-(7) and the resulting
diction is of interest for several reasons. Lapedes and Farber incremental changes to the weight matrices were first added to
[ 121 have considered the use of software-based neural networks Mi and M;.During the adaptive formation of the synaptic ma-
to predict one-dimensional chaotic sequences, and showed their trices, Mi and Mi serve as higher precision storage arrays in
performance to be superior to several alternative predictive which small increments to the weights accumulate until the net
methods. Conventional techniques of linear prediction are ob- change is large enough to appear in the hardware elements. In
viously not capable of predicting the next value in the time se- this scheme, quantization of the synaptic weights occurs only
quence, given past values, with any degree of accuracy. In at the point of implementation. (This may be computationally
contrast to the ballistic system identification problem described equivalent to summing the weight changes from many trials be-
above (1 1) can not be even closely approximated by a linear fore actually making the changes.) This procedure considerably
relationship. In practice, however, the problem is not signifi- improved the ability of the network to learn. During the learn-
cantly different from the system identification problem, since ing phase of network operation, however, it makes it necessary
both require a nonlinear mapping from two input variables to to provide software storage arrays having considerably higher
two output variables. precision than the network hardware’s.
The adaptive learning procedure differed from the one used With this improvement, the hardware network learned to pre-
in the system identification problem in one important respect. ,
dict x,+ and y, + given x, and y, to an accuracy of 16%,com-
In our initial trials, an appropriate choice of the learning rate 7 pared with 12% for an idealized software network having the
was difficult to make. Using values of 7 greater than 0.1 typi- same quantized weight matrices. The similarity between the two
cally resulted in divergent behavior of the network. At smaller would, at first glance, suggest that they have learned to exhibit
values, the network was stable but did not converge to an an- similar responses to their inputs. A more detailed examination,
swer that was a significantly better prediction than a random however, shows that the hardware and software, despite their
guess. We found the cause to be the quantization of the weights. similar degrees of accuracy, are quite different. Fig. 7(a) shows
Back-propagation learning is a gradient descent down an error the locus of target points that both the hardware and software
surface in weight space. In a mathematically ideal neural net- networks were being trained to output. (Note that these points
work, this surface would be a smooth continuous function of are just a rescaled version of well-known Henon attractor.)
the weight w . However, if we introduce quantization of the These data represent the last 1000 points, from trials 4001 to
weights, then the error surface becomes instead a family of dis- 5000, in the learning process. Comparison of the output of the
crete points. If the weight increment Aw calculated from (6) is software network, shown in Fig. 7(b), with that of the hard-
less than one quantum, then the weight remains unchanged. In ware, shown in Fig. 7(c), suggests that both sets of data ap-
such a case, the network tends to remain in regions of weight proximate the ideal behavior to a similar degree of accuracy,
space where the gradient between these points is low. This but in qualitatively different ways.
problem can be partially overcome by increasing the value of The results of our hardware experiments show that the back-
7, but significant increases will cause stability problems. propagation learning procedure leads to a high degree of inter-
The approach that we used to circumvent this problem is il- nal compensation for the hardware’s nonidealities. The next
lustrated in Fig. 6. There were two synaptic connection ma- logical line of investigation, then, is to determine which of the
FRYE et a l . : BACK-PROPAGATION LEARNING AND NONIDEALITIES I I5
t o
-1
0.08
0
[r LT
0 LT I
w 1 W 0.06 II
w L
W
0.02 = 4
10-2C;I
1o -~
, , , 1 , , ,,I
10-2
, , , I , , ,,I
lo-'
, , , I , ,J
1oo
0.00 +
0 20 40 60 ao 100
WEIGHT INCREMENT COMPONENT VARIATION (X OF MEAN)
Fig. 9. Average residual error after training as a function of synaptic weight Fig. 11. Average residual error after training as a function of synaptic
quantization, obtained from simulation. component variation, using full feedback from all neurons in the network
and partial feedback from only the output neurons.
- r
--l- -
FRYE el ul.: BACK-PROPAGATION LEARNING AND NONIDEALITIES 1 I7
interfere with the ability of the network t o successfully learn its Robert C. Frye (M’90) received the B.S. in
target behavior. Inadequate dynamic range in t h e synaptic electrical engineering from Massachusetts In-
stitute of Technology, Cambridge, in 1973. In
weights c a n directly result in degraded performance. E v e n large
1975, he returned to M.1.T , and received the
component nonuniformities, however, appear t o have a mini- Ph.D in electrical engineering in 1980.
mal effect o n the network’s performance if the training is d o n e From 1973 to 1975, he was with the Central
directly on t h e hardware. Research Laboratories of Texas Instruments,
where he worked on charge-coupled devices for
analog signal processing. Since then, he has
REFERENCES been with AT&T Bell Laboratories, Murray
Hill, NJ, where he is currently a Distinguished
[ I ] D E . Rumelhart, G . E Hinton, and R J Williams, “Learning Member of Technical Staff in the Electronic Materials Research De-
internal representations by error propagation” in Parallel Dis- partment.
rribured Processing: Explorations in rhe Microstructure of Cog- Dr Frye IS a member of the Materials Research Society and a charter
nirion, vol. I , D . E . Rumelhart and J . L McCelland, Eds member of the International Neural Network Society.
Cambridge, MA: M.I.T. Press, 1986.
[2] D. E . Rumelhart, G . E . Hinton, and R. J. Williams, “Learning
representations by backpropagating errors” Nature, vol 323, p.
533, 1986 *
[3] B Widrow and M E . Hoff, “Adaptive switching circuits” IRE
WESCON, Convention Record IRE, New York pp. 96-104,
1960.
Edward A. Rietman received the B.S. in
[41 B. Widrow and M. E. Hoff, “Associative storage and retrieval physics and chemistry from the University of
of digital information in networks of adaptive neurons” in Biol.
Protorypes Synthe. Syst., vol. 1, E. E. Bernard and M. R . Kane, North Carolina in 1980 and the M.S. in mate-
Eds. New York: Plenum, 1962. rials science from Stevens Technical Institute
Y. LeCun, “Learning process in an asymmetric threshold net- in 1984.
work,” in Disordered Systems and Biological Organization, E. He has been with AT&T Bell Laboratories,
Bienenstock, F. Fogelman Soulie, and G . Weisbuch, Eds. NATO Murray Hill, NJ, since 1982 and worked on
AS1 Series F , vol. 20, Springer Verlag, Berlin (1986). polymer conductors, charge density wave ma-
terials, and ceramic superconductors. He began
161 R. A. Jacobs, “Increased rates of convergence through learning working on neural networks in 1987. He is the
rate adaptation,” Neural Networks, vol. 1, p. 295, 1988.
author of three books and the editor of the Jour-
171 M. A . Sivilotti, M. R. Emerling, and C. A. Mead, “VLSI ar-
chitecture for implementation of neural networks,” AIP Conf. nal of the British American Scientific Research Association. Mr. Riet-
Proc. $151: Neural Networksfor Computing, J. S . Denker, Ed., man is a member of the International Neural Network Society.
American Inst Physics, New York (1986).
[SI B. Widrow and S . D. Steams, Adaprrve Signal Processing En-
glewood Cliffs, NJ: Prentice-Hall, 1985
[9] C. D . Kornfeld, R. C . Frye, C . C . Wong, and E . A. Rietman,
*
“An optically programmed neural network,” ZEEE 2nd Int. Conf
Neural Networks, San Diego CA, 1988, vol. 11, p 357
[IO] E A. Rietman, R. C Frye, C C Wong, and C D Kornfeld, Chee C. Wong received the Ph.D. in matenals
“Amorphous silicon photoconductive arrays for artifical neural science from Massachusetts Institute of Tech-
networks,” Applied Optics, to be published. nology, Cambridge, in 1986.
[ l l ] M. Henon, “A two-dimensional mapping with a strange attrac- Since then, he has been a Member of Tech-
tor,” Comm. Math. Phys., vol. 50, p. 69, 1976. nical Staff in the Electronic Matenals Research
[12] A. Lapedes and R Farber, “Nonlinear signal processing using department at AT&T Bell Laboratones, Mur-
neural networks. Prediction and system modeling,” Los Alamos ray Hill, NJ. His research interests include
National Laboratory Report #LA-UR-87-2662. amorphous silicon devices, thin-film process-
[13] A V . Oppenheim and R. W. Schafer, Digital Signal Processing. ing, and advanced interconnection technology.
Englewood Cliffs, NJ Prentice-Hall, 1975.