Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

Frye 1991

Download as pdf or txt
Download as pdf or txt
You are on page 1of 8

II O IEEE TRANSACTIONS ON N E U R A L NETWORKS. VOL. 2 , NO. I .

J A N U A R Y 1991

Back-Propagation Learning and Nonidealities in


Analog Neural Network Hardware
Robert C. Frye, Member, IEEE, Edward A . Rietman, and Chee C. Wong

Abstract-We present experimental results of adaptive learning using feedforward networks, the changes in the weights after each
an optically controlled neural network. We have used example prob- trial are proportional to the error itself. This leads to a system
lems in nonlinear system identification and signal prediction, two com-
mon areas of potential neural network application, to study the that settles to a stable weight configuration as the errors become
capabilities of analog neural hardware. These experiments investigate small. Notice, however, that the changes become zero only for
the effects of a variety of nonidealities typical of analog hardware sys- zero gradient of the error in weight space. This zero can rep-
tems. They show that networks using large arrays of nonuniform com- resent either a true global minimum o r only a local one but,
ponents can perform analog computations with a much higher degree
of accuracy than might be expected, given the degree of variation in
from a practical standpoint, this gradient descent algorithm gen-
the network’s elements. We have also investigated effects of other com- erally results in useful, if not optimal, solutions. (Several vari-
mon nonidealities, such as noise, weight quantization, and dynamic ations to this back-propagation approach have been used to ad-
range limitations. dress some of its deficiencies. These are reviewed by Jacobs
[61.)
The adaptive technique described above has proven to be suc-
INTRODUCTION cessful in digital software-based networks. Applying the same

L AYERED networks of interconnected nonlinear neurons are


finding increasing use in adaptive problems. Algorithmic
procedures have been developed that enable these networks to
techniques to analog electronic hardware, however, has proven
to be more difficult. Adaptive analog hardware requires the fab-
rication and interactive control of large numbers of variable re-
learn through a trial and error procedure. In this learning pro- sistance interconnections. This is simple in concept, but poses
cess, the connection strengths between the active elements of a formidable task in practice. By far the most promising method
the network are gradually modified until the circuit exhibits a currently available for fabricating and interconnecting large
desired behavior. A widely used adaptive method is the back- numbers of electronic components is VLSI. An attendant con-
propagation of errors technique, as discussed by Rumelhart, sequence of this technology, however, is the surprisingly large
Hinton, and Williams [ l ] , [2], which is a generalization of the component-to-component variation that is typical of VLSI de-
delta rule developed by Widrow and Hoff [3], [4] and by LeCun vices [7]. These variations, which have a negligible effect on
[ 5 ] . The method seeks to minimize the error in the output of the digital circuits, can be particularly troublesome in the fabrica-
network as compared to a target, o r desired, response. For a tion of analog circuits based on resistive arrays, since the circuit
network having multiple outputs, the rms error is given by function depends critically on the component values. Adaptive
techniques are viewed as a possible way to build analog VLSI
(1) networks that are not overly constrained by these problems of
component variations.
where rJ and 0, are the target and the output values for t h e j t h In this paper, we will present experimental results on adap-
component of the vectors. The goal of the back-propagation tive learning using standard back-propagation techniques ap-
learning procedure is to minimize this error. If the network is plied to an optically controlled hardware system. The particular
time invariant, then its output will depend only on its inputs i , hardware that we used showed a degree of nonuniformity sim-
and the current value of the connection weight martices w,,,i.e., ilar to that in VLSI circuits, and many of the conclusions of
these learning studies can be readily applied to VLSI and other
0, = f ( L wu). (2) alternative hardware systems. For many applications, analog
For a given input vector, therefore, the error is determined by neural network hardware offers significant speed advantages be-
the values of the weighting coefficients that connect the net- cause of the high degree of parallelismn in the architecture.
work. The approach used in the adaptive procedure is to change Software-based networks, while exploiting the same adaptive
these connections by an amount proportional to the gradient of advantages, fail to realize this parallelism. In building networks
the error in weight space, i.e., from real electronic components, however, we are confronted
by several departures from the mathematically ideal networks
aE that have been realized in software. The experiments that we
AW,a -- (3)
a w,; will describe used back-propagation learning to solve simple
problems of system identification and signal prediction in a
This procedure generally results in reductions in the average hardware-based network. The purpose of this study was to ex-
error as the weight matrices in the network evolve. For layered plore the ways in which the nonidealities common to electronic
hardware components influence these adaptive systems.
Manuscript received November 3, 1989; revised July 17, 1990. Adaptive system identification and signal prediction have
The authors are with AT&T Bell Laboratories, Murray Hill, NJ 07974. been widely used for many years, but have been primarily re-
IEEE Log Number 9038746. stricted to linear systems and signals for which the mathemati-

104.5-9227/91/0100-0110$01.00G 1991 IEEE


FRYE er al. : BACK-PROPAGATION LEARNING A N D NONIDEALITIES

cal treatment of the problems is tractable. The use of nonlinear IMAGE

adaptive networks for these same applications is a direct exten-


sion of the linear adaptive process described by Widrow and INPUT OUTPUT
Steams [8]. The recent development of the back-propagation
learning algorithms described above allows these same methods DIFFEREWIAL
to be applied to nonlinear systems and signals. In system iden-
tification, for example, the basic objective is to adaptively train
PHOTOCONDUCTIVE
SUBSTRATE
’ \ ELECTRODES

a network to emulate another unknown system, often referred


Fig. 1 . Basic configuration used for the optically controlled photoconduc-
to as the “plant” in control theory. The plant can be a physical tive synapses.
system, like a robot arm o r an airplane, o r it may be a mathe-
matical model. The neural network is trained by providing com-
mon inputs to both the network and the plant, and then
comparing their responses. Signal prediction is closely related
to system identification. The only important difference is that
the desired response comes from a time sequence rather than a
plant. In the learning stage, the network is taught to output the
current value of the signal given only a delayed version as its
input. After the training is completed, the input can be fed di-
rectly into the network, with no delay, and its output will pre-
dict the future values of the signal. Predictors have been widely
used in signal encoding and noise reduction applications, but
until recently have been restricted to linear systems. 0.5
W
ADAPTIVEHARDWARE
s
U

In previous work, we used optically controlled synapses to


program a Hopfield content-addressable memory [9], [ 101. The

0
0.0

synaptic weight matrix in this network was built from an array


of 120 X 120 photoconductive elements. The interconnection
weights, o r synaptic values, were varied using light. In the re-
-0.5
sult reported here, we have taken a similar approach but the
synapses are instead a linear array of elements with a configu-
ration similar to that shown in Fig. 1. The synaptic array con- -200 -100 100 200
sists of 120 narrow metal lines deposited on a layer of glow- DIFFERENTIAL CURRENT (,LA)
discharge deposited amorphous silicon. The region between the Fig. 2. Nonlinear differential current summing neuron circuit and its re-
lines forms a photoconductor, whose electrical conductance can sponse characteristics.
be controlled by an appropriate image projected onto the array.
Unlike the photoconductive synaptic elements used in our pre-
vious studies, the lateral geometry of these devices gives them an output voltage proportional to its input current. The output
more linear current-voltage response characteristics. Measured of the negative channel’s transconductance amplifier passed
under uniform illumination, the elements in this array show an through a voltage inverter and was then summed with the pos-
overall variation of roughly 10%. In the operation of the en- itive channel. The two diodes in the feedback path of the final
tire system, however, nonuniformities and misalignments in the stage gave the neuron the sigmoidally-shaped response charac-
optical illumination components tend to increase this variation teristic shown in Fig. 2(b). The curve labeled “model” shows
to about +30%. These component variations are an inescapable a hyperbolic tangent scaled to match the slope through the ori-
feature of nearly all large-scale hardware networks and are one gin and the limiting behavior of the behavior. This model was
important way in which hardware networks differ from their used for comparative simulations of the hardware.
ideal software counterparts. The connection strength of the pho- Using optically programmable synapses and electronic neu-
tosynaptic elements was modulated by changing the length of rons, we designed and built a layered feedforward network with
the bars of light between appropriate electrodes in the array. In three analog inputs, ten hidden nonlinear neurons, and two out-
our system, this synaptic image was dynamically generated by put neurons. These output neurons were identical to the hidden
a computer and projected onto the photoconductive array using ones but did not have diodes in their feedback path and, con-
a high-intensity projection CRT. sequently, had a linear output response. The interconnections
As Fig. 1 shows, the connections to the electronic neurons between the neurons were the optically programmable photo-
were made in differential pairs. This made it possible to imple- synapses. In implementation of the back-propagation learning,
ment both excitatory and inhibitory inputs. Depending on the a digital computer was used to generate the training data, eval-
sign of the synaptic weight, the image was projected onto a uate the output and error, and to compute and update the image
region of the array to establish conductance between the input that programmed the conductance values in the interconnection
and either the positive o r negative summing node of the neuron. array. Inputs to the network were analog voltage levels, and the
The neuron circuit, designed to differentially sum its input cur- outputs of the network and the hidden neurons were monitored
rents, is shown in Fig. 2. In this current, the first stage for both and used in the weight-update calculations.
the positive and negative input channels consisted of a simple Computation of the weight changes, according to the gradient
transconductance amplifier that had a virtual ground input, and descent method expressed in (3), is usually done by the gener-
112 IEEE TRANSACTIONS ON N E U R A L NETWORKS, VOL. 2 . NO. I . J A N U A R Y 1991

alized delta rule described by Rumelhardt, Hinton, and Wil- weight


liams. This rule was derived from a strictly mathematical matrix
standpoint. When we seek to apply the same method to an an-
alog hardware-based network, some of the information needed
to calculate the weight changes may not be available in an ac-
curate form, since the components often depart from their ideal
programmed values. This problem is demonstrated in the fol- input output
lowing discussion of the delta rule. Fig. 3 shows a generic layer- vector vector
within-a-layer feedforward network. In this representation, the oi Oi
output of the layer is a vector of signals oJ. Its input vector 0,
may itself be an output from a preceding layer and its output
vector oJ may, in turn, provide input to the next layer. The neu-
rons have an activation function f , and are coupled to the input
vector by a weight matrix wIJ.The net input to each neuron is
given by neurons
Fig. 3 . Diagram of a generic layer in a feedforward network.
net, = C o,w,,
I
(4)

and the output vector is given by light along the parallel electrodes of the photoconductor (Fig.
2). The overall image projected onto the array consisted of a
OJ
field 240 pixels in width, i.e., each bar of light could assume
The generalized delta rule is derived from ( 3 ) and, for the layer 240 discrete values of length. The values of the synaptic matrix
in the figure, the weight changes are obtained by this method are both bounded and quantized. In
mathematical models of neural networks that use floating point
Aw!, = TJO,~,. (6) computations, these are not usually important considerations.
In this relationship, TJ is the learning rate and is the constant of This is another significant difference between hardware and
proportionality implicit in ( 3 ) . If the layer in question is an out- software networks.
put layer, then 6, is given by The average measured behavior of our optical synapses was
found to obey the relationship
(7)
G = [2.2 x 10-9(O-’)]L (9)
where tJ is the target, or desired, output vector and f; denotes
differentiation of the neuron’s activation function with respect where G is the conductance and L is the programmed length
to the input signal. However, if the layer is hidden inside the expressed as an integer ranging form 0 to 240. The measure-
network, it is not immediately apparent what its target response ments were taken by programming simultaneously all 10 con-
should be. For these weights, 6, is computed iteratively using nections in the hidden-to-output layer and, consequently,
represent the average behavior of a collection of 10 synapses.
In addition, we found significant temporal scatter in the values,
particularly at larger programmed values, resulting from flicker
where hk and wJkrefer to the layer immediately after the one in in the projected CRT image coupling into the synapses.
question (i.e., to the right of the layer in Fig. 3 ) . In practice,
the input is first propagated forward through the network. The SYSTEMIDENTIFICATION
weight changes are first computed in the last layer, using (6)
and (7), and then working backwards through the network layer As a test of the capability of this hardware to adaptively learn
by layer using (6) and (8) (hence the phrase “back-propagation a simple nonlinear task, we chose the system identification
of errors. ’ ’) problem of a ballistic trajectory. For a projectile launched from
Some of the difficulty in realizing this procedure in analog the coordinates (0, 0) at time t = 0 with a random velocity, U ,
hardware is apparent in (8). T o compute these weight changes, angle 8, and downward acceleration a, its (x, y ) coordinates at
it is necessary to know the values of the interconnection weights a later time r, will be given by
wJk in the next layer of the network. Because hardware com- x = ut, cos (e),
ponents are generally subject to statistical variations, we usu-
ally have only an approximate idea of what their actual values and
may be. It is not feasible in a realistic system to provide enough
interconnection between the computer and the neural network
y = ut, sin (e) - att’j/2. (10)
hardware to monitor the value of each individual synapse. Sim- The ranges of allowed values for u and 8 (0.116 < u < 0.2,
ilar problems may arise from uncertainties about the exact na- 60” < 8 < 90’) were restricted such that, at time to = IO, the
ture of the activation function6 and its derivative. In such cases, projectile would be confined to the region ( 0 ix < l ) , ( 0 <
our only recourse is to work with the presumed values of these y < 1 ). The objective of the network was to compute the pro-
parameters. Since the hardware components will typically de- jectile’s location at to = 10, given u and 8 ab its inputs. The
part from their nominal values, one purpose of this study is to difference between the network output (n,y ) and the computer
determine the extent to which this will influence the perfor- generated target (x,, y , ) defined the error vector. This error vec-
mance of the network. tor, together with the measured value of the outputs of the hid-
The strength of the synaptic connections in our experimental den neurons, provided the data used to calculate the weight
network was controlled by projecting a variable-length bar of changes using the generalized delta rule as described above.
FRYE et al.: BACK-PROPAGATION LEARNING AND NONIDEALITIES I13

However, in implementing (8) to calculate the weight changes, 0.25 - -HARDWARE


only the presumed values of the weights [calculated from (9)] rY . SOWARE
could be used since their exact ones are not known. go*2o**
In this particular problem of system identification, the plant rY
W 0.15
was a set of two nonlinear equations relating the inputs, v and
6, to the outputs, x and y. The network's task is to learn the 2
rY 0.10
W
mapping between these two coordinate sets. In operation, we
have generally found it useful when possible to scale the ranges
of the input and output variables to suit the operating ranges of Q
0.05 1 L e . . ..*..
L...,,.... . C... - *.*.* ,*.....a. .
. : .
a --...*. .*...*. *.

the electronic components in the network. For this example, we o.oo-"""' " ' I " ' i " "

assigned the analog inputs for U and 6 a voltage ranging from 0


to 2V, and the network was trained to output a value of -2V
for x or y = 0, ranging up to a value of +2V for x or y = 1.
For its starting point, the network was programmed with small
random values for its connection strengths. In this initial state,
its output was near OV regardless of the value of the inputs,
corresponding to an initial guess by the network of x = 0.5, y
= 0.5. For randomly chosen trials, the average error initially
exhibited by the untrained network was 0.33. (This value is just
the average distribution of the final target points about the point
(0.5, 0.5) for the range of U and 6 that we used.)
The learning sequence consisted simply of generating a ran-
domly chosen trial v and 6. These two numbers were then pre-
sented as an analog voltage input to the hardware. In addition
a
0.10
W
to these two inputs, a constant value of 1V was applied to the
third input. Synaptic connections between this constant input 5 0.05 '"b~~~-.L
*-*9*-49 . ....- .............
**.... *.w..v.*.*- w.
and the hidden neurons result in the adjustment of the neurons'
threshold. After applying a trial set of inputs, the outputs of the
network were compared with the target points computed from
(10). This relationship requires that the network learn to gen-
erate a nonlinear function of the two inputs. After comparing
actual and target outputs, the weight changes were computed
using the error at the output and the measured values of the
hidden neurons. The new weight values were then translated by
the computer into an appropriate image and projected onto the
photoconductive array through the video interface. After this,
a new trial data pair was presented to the network and the pro-
cedure repeated. An ideal model of the hardware was continu-
ously run alongside the experiment, learning from the same set
of trials. This simulation's performance was used to gauge the
limits of accuracy that could be expected from perfect hardware
components.
The average magnitude of the error vector exhibited by the
network, implemented both in hardware and in the software, as
a function of the number of learning trials is shown in Fig. 4.
Each data point in this figure represents the error averaged over
the most recent 25 trials. In repeated training sequences, the
hardware invariably needed more trials to reach its steady-state
error level. However, the accuracy finally obtained from the
hardware was nearly as low as the software network, despite
the component-to-component variations in the hardware. The
importance of adaptation is evident, since the average error in
each output channel of the hardware was less than 4% using
components with more than 30% variation.
In both the hardware and software versions of the network,
the restriction on the maximum value of the synaptic weight
was simply imposed as an external constraint on the system. In
the computation of the weight changes, the value of the integer
L in (9) was allowed to exceed f240. Such constraints are not
embodied in the mathematical description of the network, but
the adaptive learning process successfully compensates for
them. An example of a much more severe constraint is the fail-
ure of one of the network's neurons. In one series of trials,

T
- ~

r---
I I4 IEEE TRANSACTIONS ON N E U R A L NETWORKS. VOL. 2. NO. I . J A N U A R Y 1991

compare these calculations with the experimentally observed DESIRED


HARDWARE RESPONSE
values, we find only a vague correspondence. r---------- I
The ability of the network hardware to compute at a level of
accuracy greatly exceeding that of its components is not a result
of the average response of a large ensemble of components. It
is, instead, a consequence of the adaptive process itself. In ef-
fect, the network learns to execute its particular analog com-
putational tasks in the presence of its individual nonidealities.
In systems like ours, the degree of variation in the components
is enough to render almost useless a precalculation of the syn-
aptic weight coefficients. However, by adaptively learning in
the presence of these same variations, their effect on the net-
work is negligible.

SIGNALPREDICTION
As an example of using a neural network in signal prediction,
we chose the two-dimensional deterministic chaotic relation first
studied by Henon 11 11. In the equations describing this rela- Fig. 6 . A schematic diagram of the computational scheme used to avoid
tionship, two sequences x, and y, are dynamically coupled ac- some of the problems in learning with quantized weight values. Software
cording to: matrices Mi and M; were higher precision than the corresponding hardware
matrices M ,and M 2 . Learning update calculations and weight storage were
X,+I = y, + 1 - M,2 in the software arrays, and an input-output signal propagated through the
hardware.
Yr+l = bx,.

For the constant values a = 1.4 and b = 0.3, and for initial trices in the neural network hardware, denoted as MI and M,.
conditions of x and y near unity, the sequences x, and y, exhibit For the purposes of computation, two floating point matrices,
chaotic behavior. Mi and Mi,were used in the software. The values in these ma-
This particular choice of an example problem in signal pre- trices were used in the computation of (4)-(7) and the resulting
diction is of interest for several reasons. Lapedes and Farber incremental changes to the weight matrices were first added to
[ 121 have considered the use of software-based neural networks Mi and M;.During the adaptive formation of the synaptic ma-
to predict one-dimensional chaotic sequences, and showed their trices, Mi and Mi serve as higher precision storage arrays in
performance to be superior to several alternative predictive which small increments to the weights accumulate until the net
methods. Conventional techniques of linear prediction are ob- change is large enough to appear in the hardware elements. In
viously not capable of predicting the next value in the time se- this scheme, quantization of the synaptic weights occurs only
quence, given past values, with any degree of accuracy. In at the point of implementation. (This may be computationally
contrast to the ballistic system identification problem described equivalent to summing the weight changes from many trials be-
above (1 1) can not be even closely approximated by a linear fore actually making the changes.) This procedure considerably
relationship. In practice, however, the problem is not signifi- improved the ability of the network to learn. During the learn-
cantly different from the system identification problem, since ing phase of network operation, however, it makes it necessary
both require a nonlinear mapping from two input variables to to provide software storage arrays having considerably higher
two output variables. precision than the network hardware’s.
The adaptive learning procedure differed from the one used With this improvement, the hardware network learned to pre-
in the system identification problem in one important respect. ,
dict x,+ and y, + given x, and y, to an accuracy of 16%,com-
In our initial trials, an appropriate choice of the learning rate 7 pared with 12% for an idealized software network having the
was difficult to make. Using values of 7 greater than 0.1 typi- same quantized weight matrices. The similarity between the two
cally resulted in divergent behavior of the network. At smaller would, at first glance, suggest that they have learned to exhibit
values, the network was stable but did not converge to an an- similar responses to their inputs. A more detailed examination,
swer that was a significantly better prediction than a random however, shows that the hardware and software, despite their
guess. We found the cause to be the quantization of the weights. similar degrees of accuracy, are quite different. Fig. 7(a) shows
Back-propagation learning is a gradient descent down an error the locus of target points that both the hardware and software
surface in weight space. In a mathematically ideal neural net- networks were being trained to output. (Note that these points
work, this surface would be a smooth continuous function of are just a rescaled version of well-known Henon attractor.)
the weight w . However, if we introduce quantization of the These data represent the last 1000 points, from trials 4001 to
weights, then the error surface becomes instead a family of dis- 5000, in the learning process. Comparison of the output of the
crete points. If the weight increment Aw calculated from (6) is software network, shown in Fig. 7(b), with that of the hard-
less than one quantum, then the weight remains unchanged. In ware, shown in Fig. 7(c), suggests that both sets of data ap-
such a case, the network tends to remain in regions of weight proximate the ideal behavior to a similar degree of accuracy,
space where the gradient between these points is low. This but in qualitatively different ways.
problem can be partially overcome by increasing the value of The results of our hardware experiments show that the back-
7, but significant increases will cause stability problems. propagation learning procedure leads to a high degree of inter-
The approach that we used to circumvent this problem is il- nal compensation for the hardware’s nonidealities. The next
lustrated in Fig. 6. There were two synaptic connection ma- logical line of investigation, then, is to determine which of the
FRYE et a l . : BACK-PROPAGATION LEARNING AND NONIDEALITIES I I5

t o

-1

1o -~ 1o-2 lo-' 1oo


RMS NOISE
Fig. 8. Average residual error after training as a function of rms noise
levels at the input and output of the network. obtained from simulation.

The effects of noise introduced at both input and output after


extensive training are shown in Fig. 8. Both are qualitatively
similar. At higher levels of noise, the error shows a roughly
linear increase comparable to the magnitude of the noise itself.
-2 -1 0 1 2
For example, an rms noise level of 0.1 at the output results in
X
an average prediction error of 0.1. Apparently, the noise con-
Fig. 7. Locus of x, y points (attractor) for the (a) target response and for tributes directly to inaccuracies in the output, but does not have
the output of the (b) software and (c) hardware networks. a noticeable effect on the ability of the network to learn. Noise
at the output of the network contributes to an inaccurate mea-
nonidealities commonly associated with electronic hardware sure of the error, and would be expected to result in miscalcu-
have the most, or least, influence on the networks performance. lation of the weight changes. However, these miscalculations
Since we are unable to vary many of these parameters in the are small and because they are uncorrelated with any other sig-
hardware, we rely instead on simulation results to identify their nals in the network their time averaged effect is negligible. The
importance. slightly stronger dependence of the error on input noise is prob-
ably a reflection of the network's nonlinear nature. The rms
RESULTS
SIMULATION noise measured at the output of our hardware network, for com-
parison, was about 0.05 (normalized to the input signal range).
In our experiments, we have encountered several nonideali- Fig. 9 shows the effects of weight quantization on the long-
ties common to analog neural network hardware that are not term average error. Quantization error is similar to noise in its
generally found in software-based networks and are not in- effects. In fact, a common way to treat quantization o r roundoff
cluded in the mathematical formulations of adaptive learning. errors analytically is to model them with equivalent noise
Noise, for example, can arise from many sources and may be sources [ 131. The approximate normalized level of quantization
introduced at various points into the network. In addition, as in our hardware experiments was 0.002. These simulations sug-
we have discussed above, weight quantization and nonuniform- gest that, for this particular problem, the precision in the hard-
ities in the synapse hardware affect the performance, and the ware's synaptic matrix was more than adequate.
maximum obtainable connection strength in hardware is lim- Fig. 10 shows the effects of limiting the maximum weight o r
ited. dynamic range in the synaptic connections. Below a value of
For the simulation results that follow, the neuron activation unity, the performance degrades rapidly. For our hardware cir-
function in the hidden layer was a hyperbolic tangent, and the cuit, this maximum weight was about 0.8. (This number rep-
neurons in the output layer were unity gain linear elements. The resents the product of the maximum synaptic conductance and
architecture was identical to hardware's, having 10 hidden neu- the gain of the neuron.) These results indicate the limited dy-
rons, and the results below were obtained from training on the namic range was the most important limitation to the hard-
same prediction problem described in (1 1). For reference, an ware's performance. Since the input and output in this problem
ideal neural network with this architecture implemented in soft-
were normalized to have a dynamic range of unity, this result
ware with double-precision floating point synapses reached a probably reflects the need for the network as a whole to exhibit
level of error of 2 % after 20 000 learning trials. Presumably, sufficient overall gain.
this represents the limit of the network's capabilities. Data We have simulated the effects of component variations by
points in the following curves each represent the error averaged introducing a second factor into the synaptic weight coefficients
over 1000 trials for a single 20 000 trial learning run. We typ- so that the actual weight wi, was related to the ideal weight
ically find that, for this particular problem, the convergence is
good and the performance limits show little run-to-run varia- wl; by
tion. w,,= r!, w:, 1 (12)
116 IEEE TRANSACTIONS ON NEURAL NETWORKS. VOL. 2 . NO. I . J A N U A R Y 1991

0.08
0
[r LT
0 LT I
w 1 W 0.06 II
w L
W

0.04 1 FULL FEEDBACK

0.02 = 4
10-2C;I
1o -~
, , , 1 , , ,,I

10-2
, , , I , , ,,I

lo-'
, , , I , ,J
1oo
0.00 +
0 20 40 60 ao 100
WEIGHT INCREMENT COMPONENT VARIATION (X OF MEAN)
Fig. 9. Average residual error after training as a function of synaptic weight Fig. 11. Average residual error after training as a function of synaptic
quantization, obtained from simulation. component variation, using full feedback from all neurons in the network
and partial feedback from only the output neurons.

data labeled "full feedback," all of the neurons in the network


' o o l were monitored and their values used in the calculations. In the
former case, component variations have little effect on the per-
formance of the network until they reach a value of about 30%.
Above this value, the network quickly lost its ability to con-
verge and, for component variations exceeding 40%, its per-
formance was no better than random guessing. By contrast,
component variations had virtually no effect on a network in
which the output of all the neurons was monitored. This insen-
sitivity is obtained at the expense of considerable increase in
interconnection complexity, but it enables the network to per-
form analog computations at an accuracy that greatly exceeds
the tolerance of the devices used to form the network.
10-21, I , ,,,I , , , 1 , ,,,I , , , 1 , ,,,I , -j In subsequent experiments, we have found that results like
100 IO' 1 02 the one shown in Fig. 11 depend on the complexity of the learn-
MAXIMUM WEIGHT ing task and the size of the network. For smaller networks, even
Fig. IO. Average residual error after training as a function of maximum with full feedback the errors may increase for large amounts of
synaptic weight, obtained from simulation. component variation. For larger networks, even partial feed-
back is often adequate to ensure insensitivity to large amounts
of component variation. The most important requirement in all
where r,, was a time-invariant random distribution having a mean cases is that the adaptive learning process must be done on an
of unity. Computationally, this simulates a distribution of de- individual basis for each analog hardware system, so that it can
vice response commonly found in most analog electronic im- learn to compensate for the hardware's nonidealities.
plementations. For simplicity, however, the distribution in the
simulations was uniform rather than a bell shaped one usually
seen in practice. T o simulate the effects of unknown component CONCLUSION
variations, the synapse weight w,, was used to determine the The absence of precise knowledge about the synaptic con-
output response of the neurons. However, since this weight is nection matrices does not appear to restrict our ability to adap-
not generally known in real hardware, only the ideal weight tively train analog hardware using back-propagation. The key
wl; was used in the weight update calculations. to its successful implementation is to uniquely train each hard-
If we compare simulation results, incorporating measured ware system. If this is done, using the nominal values of the
values of noise, weight quantization, dynamic range, compo- synaptic weights in the computation of the weight changes re-
nent variation with the analog hardware, we find good agree- sults in successful learning, even in systems for which the ac-
ment (14% average error in the simulation versus 15% in the tual weights depart significantly from those nominal values. This
hardware.) Interestingly, the 30 % component variation was not is undoubtedly a consequence of the adaptive learning proce-
found to appreciably degrade the network's performance. dure itself. The network, by learning in the presence of its in-
In Fig. 11, we show the accuracy of the network as a function ternal faults and nonuniformities, automatically compensates for
of component variations using two different weight update com- many of these problems. Other nonideal effects associated with
putational methods. For the data labeled "partial feedback," electronic hardware, such as noise o r quantization errors (which
the outputs of the neruons only in the output layer were moni- are similar in their effects to noise), introduce levels of error
tored. The outputs of the hidden neurons were computed based into the network that might be expected in other nonadaptive
on the inputs and the presumed values of the synapses. For the types of circuits. In the examples that we studied, they did not

- r
--l- -
FRYE el ul.: BACK-PROPAGATION LEARNING AND NONIDEALITIES 1 I7

interfere with the ability of the network t o successfully learn its Robert C. Frye (M’90) received the B.S. in
target behavior. Inadequate dynamic range in t h e synaptic electrical engineering from Massachusetts In-
stitute of Technology, Cambridge, in 1973. In
weights c a n directly result in degraded performance. E v e n large
1975, he returned to M.1.T , and received the
component nonuniformities, however, appear t o have a mini- Ph.D in electrical engineering in 1980.
mal effect o n the network’s performance if the training is d o n e From 1973 to 1975, he was with the Central
directly on t h e hardware. Research Laboratories of Texas Instruments,
where he worked on charge-coupled devices for
analog signal processing. Since then, he has
REFERENCES been with AT&T Bell Laboratories, Murray
Hill, NJ, where he is currently a Distinguished
[ I ] D E . Rumelhart, G . E Hinton, and R J Williams, “Learning Member of Technical Staff in the Electronic Materials Research De-
internal representations by error propagation” in Parallel Dis- partment.
rribured Processing: Explorations in rhe Microstructure of Cog- Dr Frye IS a member of the Materials Research Society and a charter
nirion, vol. I , D . E . Rumelhart and J . L McCelland, Eds member of the International Neural Network Society.
Cambridge, MA: M.I.T. Press, 1986.
[2] D. E . Rumelhart, G . E . Hinton, and R. J. Williams, “Learning
representations by backpropagating errors” Nature, vol 323, p.
533, 1986 *
[3] B Widrow and M E . Hoff, “Adaptive switching circuits” IRE
WESCON, Convention Record IRE, New York pp. 96-104,
1960.
Edward A. Rietman received the B.S. in
[41 B. Widrow and M. E. Hoff, “Associative storage and retrieval physics and chemistry from the University of
of digital information in networks of adaptive neurons” in Biol.
Protorypes Synthe. Syst., vol. 1, E. E. Bernard and M. R . Kane, North Carolina in 1980 and the M.S. in mate-
Eds. New York: Plenum, 1962. rials science from Stevens Technical Institute
Y. LeCun, “Learning process in an asymmetric threshold net- in 1984.
work,” in Disordered Systems and Biological Organization, E. He has been with AT&T Bell Laboratories,
Bienenstock, F. Fogelman Soulie, and G . Weisbuch, Eds. NATO Murray Hill, NJ, since 1982 and worked on
AS1 Series F , vol. 20, Springer Verlag, Berlin (1986). polymer conductors, charge density wave ma-
terials, and ceramic superconductors. He began
161 R. A. Jacobs, “Increased rates of convergence through learning working on neural networks in 1987. He is the
rate adaptation,” Neural Networks, vol. 1, p. 295, 1988.
author of three books and the editor of the Jour-
171 M. A . Sivilotti, M. R. Emerling, and C. A. Mead, “VLSI ar-
chitecture for implementation of neural networks,” AIP Conf. nal of the British American Scientific Research Association. Mr. Riet-
Proc. $151: Neural Networksfor Computing, J. S . Denker, Ed., man is a member of the International Neural Network Society.
American Inst Physics, New York (1986).
[SI B. Widrow and S . D. Steams, Adaprrve Signal Processing En-
glewood Cliffs, NJ: Prentice-Hall, 1985
[9] C. D . Kornfeld, R. C . Frye, C . C . Wong, and E . A. Rietman,
*
“An optically programmed neural network,” ZEEE 2nd Int. Conf
Neural Networks, San Diego CA, 1988, vol. 11, p 357
[IO] E A. Rietman, R. C Frye, C C Wong, and C D Kornfeld, Chee C. Wong received the Ph.D. in matenals
“Amorphous silicon photoconductive arrays for artifical neural science from Massachusetts Institute of Tech-
networks,” Applied Optics, to be published. nology, Cambridge, in 1986.
[ l l ] M. Henon, “A two-dimensional mapping with a strange attrac- Since then, he has been a Member of Tech-
tor,” Comm. Math. Phys., vol. 50, p. 69, 1976. nical Staff in the Electronic Matenals Research
[12] A. Lapedes and R Farber, “Nonlinear signal processing using department at AT&T Bell Laboratones, Mur-
neural networks. Prediction and system modeling,” Los Alamos ray Hill, NJ. His research interests include
National Laboratory Report #LA-UR-87-2662. amorphous silicon devices, thin-film process-
[13] A V . Oppenheim and R. W. Schafer, Digital Signal Processing. ing, and advanced interconnection technology.
Englewood Cliffs, NJ Prentice-Hall, 1975.

You might also like