First-And Second-Order Methods Learning: Between Steepest Descent and Newton's Method
First-And Second-Order Methods Learning: Between Steepest Descent and Newton's Method
First-And Second-Order Methods Learning: Between Steepest Descent and Newton's Method
Roberto Battiti
Dipartimento di Matematica, Universita di Trento,
38050 Povo (Trento),Italy
1 Introduction
There are cases in which learning speed is a limiting factor in the prac-
tical application of multilayer perceptrons to problems that require high
accuracy in the network mapping function. In this class are applications
related to system identification and nonlinear modeling, time-series pre-
diction, navigation, manipulation, and robotics. In addition, the stan-
dard batch backpropagation (BP) method ( e g , Rumelhart and McClel-
land 1986) requires a selection of appropriate parameters by the user,
that is mainly executed with a trial-and-error process. Since one of the
competitive advantages of neural networks is the ease with which they
may be applied to novel or poorly understood problems, it is essential to
consider automated and robust learning methods with a good average
performance on many classes of problems.
This review describes some methods that have been shown to acceler-
ate the convergence of the learning phase on a variety of problems, and
suggests a possible taxonomy of the different techniques based on their
order (i.e., their use of first or second derivatives), space and computa-
tional requirements, and convergence properties. Some of these methods,
while requiring only limited modifications of the standard BP algorithm,
(2.1)
where t,, and oprare the target and the current output values for pattern
p , respectively, and no is the number of output units. The learning pro-
cedure known as backpropagation (Rumelhart and McClelland 1986) is
composed of two stages. In the first, the contributions to the gradient
coming from the different patterns (dEp/8wij) are calculated backpropa-
gating the error signal. The partial contributions are then used to correct
the weights, either after every pattern presentation (on-line BP), or after
they are summed in order to obtain the total gradient (batch BP).
Let us define as gk the gradient of the error function kk = VE(wk)].
The batch backpropagation update is a form of gradient descent defined
as
If the learning rate E tends to zero, the difference between the weight
vectors Wk+,,during one epoch of the on-line method tends to be small
and the step EVE~(W~+,,) induced by a particular pattern p can be ap-
proximated by cVEp(wk)(by calculating the gradient at the initial weight
vector). Summing the contributions for all patterns, the movement in
weight space during one epoch will be similar to the one obtained with a
single batch update. However, in general the learning rate has to be large
to accelerate convergence, so that the paths in weight space of the two
methods differ.
The on-line procedure has to be used if the patterns are not available
before learning starts [see, for example, the perceptron used for adaptive
equalization in Widrow and Stearns (1985)1, and a continuous adapta-
tion to a stream of input-output signals is desired. On the contrary, if
all patterns are available, collecting the total gradient information before
deciding the next step can be useful in order to avoid a mutual interfer-
ence of the weight changes (caused by the different patterns) that could
occur for large learning rates (this effect is equivalent to a sort of noise in
the true gradient direction). One of the reasons in favor of the on-line
approach is that it possesses some randomness that may help in escaping
from a local minimum. The objection to this is that the method may, for
the same reason, miss a good local minimum, while there is the alter-
native method of converging to the "nearest" local minimum and using
randomness2 to escape only after convergence. In addition, the on-line
update may be useful when the number of patterns is so large that the er-
rors involved in the finite precision computation of the total gradient may
be comparable with the gradient itself. This effect is particularly present
for analog implementations of backpropagation, but it can be controlled
in digital implementations by increasing the number of bits during the
gradient accumulation. The fact that many patterns possess redundant
information [see, for example, the case of hand-written character recog-
nition in LeCun (1986)l has been cited as an argument in favor of on-line
BP, because many of the contributions to the gradient are similar, so that
waiting for all contributions before updating can be wasteful. In other
words, averaging over more examples to obtain a better estimate of the
gradient does not improve the learning speed sufficiently to compensate
for the additional computationaI cost of taking these patterns into ac-
count. Nonetheless, the redundancy can also be limited using batch BP,
provided that learning is started with a subset of "relevant" patterns and
continued after convergence by progressively increasing the example set.
This method has for example been used in Kramer and Sangiovanni-
Vicentelli (1988) for the digit recognition p r ~ b l e m .Even
~ if the training
21n addition, there are good reasons why random noise may not be the best available
alternative to escape from a local minimum and avoid returning to it. See, for example,
the recently introduced TABU methods (Glover 1987).
31f the redundancy is clear (when for example many copies of the sume example are
present) one may preprocess the example set in order to eliminate the duplication. On
144 Roberto Battiti
the contrary, if redundancy is only partial, the redundant patterns have to be presented
to and learned by the network in both versions.
4The availability of many variations of the on-line technique is one of the reasons
why "fafi" comparisons with the batch and second-order versions are complex. Which
version has to be chosen for the comparison? If the final convergence results have been
obtained after a tuning process, should the tuning times be included in the comparison?
First- and Second-Order Methods for Learning 145
[see, for example, Bingham (1988)l. The result is that the convergence of
stochastic LMS is guaranteed if E < 1 / ( N G), where N is the number
of parameters being optimized and A, is the largest eigenvalue of the
autocorrelation function of the input.5 A detailed study of adaptive filters
is presented in Widrow and Stearns (1985). The effects of the autocor-
relation matrix of the inputs on the learning process (for a single linear
unit) are discussed in LeCun et al. (1991). In this framework the appro-
priate learning rate for gradient descent is 1/Amax. These results cannot
be extended to multilayer networks (with nonlinear transfer functions)
in a straightforward manner, but they can be used as a starting point for
useful heuristics.
The convergence properties of the LMS algorithm with adaptive learn-
ing rate are presented in Luo (1991), together with a clear comparison of
the LMS algorithm with stochastic gradient descent and adaptive filter-
ing algorithms. The main result is that if the learning rate E , for the nth
training cycle satisfies the two conditions:
(2.4)
n=l n=l
where rimax and qmln are the maximum and minimum eigenvalues of the
matrix G, and wl; is the minimizer [see Luenberger (197311. If these two
eigenvalues are very different, the distance from the minimum ~ a l u e
is multiplied each time by a number that is close to one. The type of
convergence in equation 2.6 is termed q-linear convergence. One has 9-
superlinear convergence if, for some sequence ck that converges to 0, one
has
(3.4)
4 Newton's Method
and solving for the step sN that brings to a point where the gradient of
the model is zero: Vmc(w, + s N ) = 0. This corresponds to solving the
V E ( w , + &p)Tp 2. B V E ( w c ) T p (4.4)
where 0is a fixed constant E ( a ,1). The condition 0 > (1: assures that the
two requirements can be simultaneously satisfied.
First- and Second-Order Methods for Learning 151
If the two above conditions are satisfied at each iteration and if the
error is bounded below, the sequence wk obtained is such that
limk,, VE(wk) = 0, provided that each step is kept away from orthogo-
nality to the gradient (limk,= VE(wk)'sk/llskll # 0).
This result is quite important: we are permitted to use fast approxi-
mated one-dimensional searches without losing global convergence. Re-
cent computational tests show that methods based on fast one-dimen-
sional searches in general require much less computational effort than
methods based on sophisticated one-dimensional minimizations.I2
The line-search method suggested in Dennis and Schnabel (1983) is
well suited for multilayer perceptrons (where the gradient can be ob-
tained with limited effort during the computation of the error) and re-
quires only a couple of error and gradient evaluations per iteration, in
the average. The method is based on quadratic and cubic interpolations
and designed to use in an efficient way the available information about
the function to be minimized [see Dennis and Schnabel(1983) for details].
A similar method based on quadratic interpolation is presented in Battiti
(1989).
(4.5)
is positive definite and safely well conditioned. A proper value for /I, can
be efficiently found using the modified Cholesky factorization described
in Gill et al. (1981) and the heuristics described in Dennis and Schnabel
(1983). The resulting algorithm is as follows.
The Cholesky factors of a positive-definite symmetric matrix can be
considered as a sort of "square root" of the matrix. The original matrix
M is expressed as the product LDL', where L is a unit lower triangu-
lar matrixI4 and D is a diagonal matrix with strictly positive diagonal
elements. Taking the square root of the diagonal elements and form-
ing with them the matrix D'12, the original matrix can be written as
M = LD1/2D1/2LT = ITT = R'R, where L is a general lower triangular
matrix, and R a general upper-triangular matrix. The Cholesky factoriza-
tion can be computed in about i N 3 multiplications and additions and is
characterized by good numerical stability. If the original matrix is not pos-
itive definite, the factorization can be modified in order to obtain factors
L and D with all the diagonal elements in D positive and all the elements
in L uniformly bounded. The obtained factorization corresponds to the
factors of a matrix n,differing from the original one only by a diagonal
matrix K with nonnegative elements:
gc= LDLT = H, + K (4.6)
where
for the unique p 1 0 such that the step has the maximum allowed length
(~ S(.L )I = &I, unless the step with p = 0 is inside the trusted region
Clls(0)ll 5 &I, in which case s(0) is the solution, equal to the Newton
step. We omit the proof and the usable techniques for finding b, leaving
the topics as a suggested reading for those in search of elegance and
inspiration [see, for example, Dennis and Schnabel (1983)l.
As a final observation, note that the diagonal modification to the Hes-
sian is a sort of compromise between gradient descent and Newton's
method: When 11 tends to zero the original Hessian is (almost) positive
definite and the step tends to coincide with Newton's step; when p has
to be large the diagonal addition pI tends to dominate and the step tends
to one proportional to the negative gradient:
1
~ ( p=) -(Hc + /LI)-'QE(w,) x --QE(wr)
I"
There is no need to decide from the beginning about whether to use the
gradient as a search direction; the algorithm takes care of selecting the
direction that is appropriate for a local configuration of the error surface.
While not every usable multilayer perceptron needs to have thou-
sands of weights, it is true that this number tends to be large for some
interesting applications. Furthermore, while the analytic gradient is eas-
ily obtained, the calculation of second derivatives is complex and time-
consuming.
For these reasons, the methods described above, while fundamental
from a theoretical standpoint, have to be simplified and approximated in
suitable ways that we describe in the next two sections.
5 Secant Methods
In more dimensions the situation is more complex. Let the current and
next point be wc and w+;defining yc = QE(w+)-QE(w,) and s, = w+ -wc;
the analogous secant equation to equation 5.1 is
H+ s, = yc (5.2)
17Historicallythese methods were called quasi-Nmton methods. Here we follow the
terminology of Dennis and Schnabel (1983), where the term quasi-Newton refers to all
algorithms "derived" from Newton's method.
First- and Second-Order Methods for Learning 155
The new problem is that in more than one dimension equation 5.2 does
not determine a unique H+ but leaves the freedom to choose from a
(V - N)-dimensional affine subspace Q(sc,y c ) of matrices obeying equa-
tion 5.2. The new suggested strategy is that of using equation 5.2 not to
determine but to update a previously available approximation. In particu-
lar, Broyden's update is based on a least change principle: find the matrix
in Q(sc,yc)that is closest to the previously available matrix. This is ob-
tained by projecting18 the matrix onto Q(s,, yc). The resulting Broyden's
update is
(5.3)
The Powell update is one step forward, but not the solution. In the
previous sections we have shown the importance of having a symmetric
and positive definite approximation to the Hessian. Now, one can prove
that H+ is symmetric and positive definite if and only if H+ = f+.f:
for some nonsingular matrix f+. Using this fact, one update of this kind
can be derived, expressing H+ = f+ f: and using Broyden's method to
derive a suitable ]+.I9 The resulting update is historically known as the
Broyden, Fletcher, Coldfarb, and Shanno (BFGS) update (by Broyden et
al. 1973) and is given by
(5.5)
The BFGS positive definite secant update has been the most successful
update in a number of studies performed during the years. The positive
definite secant update converges q-superlinearly [a proof can be found
I6The changes and the projections are executed using the Frobenius norm: IlHll~=
(C,,,
h;)''', the matrix is considered as a "long" vector.
I9The solution exists if ycsc > 0, that is guaranteed if "accurate" line searches are
performed (see Section 4.1).
156 Roberto Battiti
in Broyden et al. (1973)l. It is common to take the initial matrix HOas the
identity matrix (first step in the direction of the negative gradient). It is
possible to update directly the Cholesky factors (Goldfarb 1976), with a
total complexity of O ( P ) [see the implementation in Dennis and Schn-
abel (198311. Secant methods for learning in the multilayer perceptron
have been used, for example, in Watrous (1987). The O ( p ) complexity
of BFGS is clearly a problem for very large networks, but the method
can still remain competitive if the number of examples is very large,
so that the computation of the error function dominates. A compari-
son of various nonlinear optimization strategies can be found in Webb
et al. (1988). Second-order methods in continuous time are considered in
Parker (1987).
2aUpdatingthe Cholesky factorization and calculating the solution are both of order
O(fl).The same order is obtained if the inverse Hessian is updated, as in equation 6.1,
and the search direction is calculated by a matrix-vector product.
*'The comparison with BP with fixed learning and momentum rate has little meaning:
if an improper learning rate is chosen, standard BP becomes arbitrarily slow or not
convergent, if parameters are chosen with a slow trial-and-error process this time should
also be included in the total computing time.
First- and Second-OrderMethods for Learning 157
(6.1)
Now, there is an easy way to reduce the storage for the matrix H, : just
forget the matrix and start each time from the identity I. Approximating
equation 6.1 in this way, and multiplying by the gradient g, = VE(w,),
the new search direction p+ becomes
(6.2)
where the two scalars A, and B, are the following combination of scalar
products of the previously defined vectors s,, g,, and ye (last step, gradient
and difference of gradients):
221f the problem is badly scaled, for example, if the typical magnitude of the
variables changes a lot, it is useful to substitute the identity matrix with Ha =
max{ lE(wo)l,typical size of E} . D:, where D, is a diagonal scaling matrix, such that
the new variables ti = D,w are in the same range [see Dennis and Schnabel(1983)l.
158 Roberto Battiti
(6.3)
The estimate Grin of the diagonal component of the Hessian is in turn ob-
tained with an exponentiallLweighted average of the second derivative
(or an estimate thereof d2E/dw$),as follows:
(6.4)
Suppose that the weight wnconnects the output of unit j to unit i (w,=
in the double-index notation), a, is the total input to unit i, f ( ) is the
wI,,
squashing function and xI is the state of unit j . It is easy to derive
(6.6)
Finally, for the simulations in LeCun (19891, the term in equation 6.6
with the second derivative of the squashing function is neglected, as
in the Levenberg-Marquardt method, that will be described in Section 7,
obtaining
(6.7)
Note that a positive estimate is obtained in this way (so that the neg-
ative gradient is multiplied by a positive-definite diagonal matrix).
The parameters p and E in equation 6.3 and y in equation 6.4 are fixed
and must be appropriately chosen by the user. The purpose of adding
p to the diagonal approximation is explained by analogy with the trust-
region method (see equation 4.9). According to Becker and LeCun (1989)
the method does not bring a tremendous speed-up, but converges reliably
without requiring extensive parameter adjustments.
231nthis rule we omit details related to weight sharing, that is, having more connections
controlled by a single parameter.
First- and Second-Order Methods for Learning 159
VE(w) =
V2E(w) =
= ~ + S(w)
J ( w ) J(w) (7.2)
where I ( w )is the Jacobian matrix ] ( x ) ~ ~=
, , ar,,(w)/dw, and S(w)is the part
of the Hessian containing second derivatives of rp1(w),that is, S ( w ) =
EL rpl(w) V2rp1(w).
The standard Newton iteration is the following:
w+ = wc - V(Wc)TI(wc) + s(wc)]-l I(wc)TR(wc) 17.3)
The particular feature of the problem is that, while I(wc) is easily cal-
culated, S(wc) is not. On the other hand, a secant approximation seems
"wasteful" because part of the Hessian [the I(W~)~J(W,) part1 is easily ob-
tained from ](w)and, in addition, the remaining part S(w) is negligible
for small values of the residuals. The Gauss-Newton method consists in
neglecting the S ( w ) part, so that a single iteration is
w+ = wc - u(Wc)TI(Wc)I-' I M TR(wc) (7.4)
It can be shown that this step is completely equivalent to minimizing the
error obtained from using an affine model of R(w)around wc:
(7.5)
(7.6)
24Thecouples of indices ( p , i) can be alphabetically ordered, for example, and mapped
to a single index.
160 Roberto Battiti
The QR factorization method can be used for the solution of equation 7.5
[see Dennis and Schnabel (198311. The method is locally q-quadratically
convergent for small residuals. If J(w,) has full column rank, J ( W ~ ) ~ J ( W ~ )
is nonsingular, the Gauss-Newton step is a descent direction and the
method can be modified with line searches (dumped Gauss-Newton
method).
Another modification is based on the trust-region idea (see Section 4.3)
and known as the Levenberg-M~rquardt method. The step is defined as
w+ = wc - u(wc)TJ(wc)+ /d-
J(WJTR(WC) (7.7)
This method can be used also if J ( w )does not have full column rank (this
happens, for example, if the number of examples is less than the number
of weights).
It is useful to mention that the components of the Jacobian matrix
arpj(w)/8wIcan be calculated by the usual chain rule for derivatives
with a number of backpropagation passes equal to the number of output
units. If weight w,b connects unit b to unit a (please note that now the
usual two-index notation is adopted), one obtains
where
271f the initial rate is too large, some iterations are wasted to reduce it until an ap-
propriate rate is found.
162 Roberto Battiti
While the presentation has focused onto the multilayer perceptron neural
network, most of these techniques can be applied to alternative m0dels.2~
It is also worth stressing that problems related to memory require-
ments are less stringent now than when these methods were invented,
and problems related to massive computation can be approached by us-
ing concurrent computation. Most of the presented techniques are suitable
for a parallel implementation, with a speed-up that is approximately
proportional to the number of processors employed [see, for example,
Kramer and Sangiovanni-Vicentelli (1988); Battiti et al. (1990); Battiti and
Straforini (199111.
We feel that the cross-fertilization between optimization techniques
and neural networks is fruitful and deserves further research efforts. In
particular the relevance of second-order techniques to large-scale back-
propagation tasks (with thousands of weights and examples) is a subject
that deserves additional studies and comparative experiments.
Acknowledgments
The author is indebted to Profs. Geoffrey Fox, Roy Williams, and Edoardo
Amaldi for helpful discussions. Thanks are due to Profs. Tommaso Pog-
gio, Christopher Atkeson, and Michael Jordan for sharing their views on
second-order methods. The results of a second-order survey by Eric
A. Wan (Neuron Digesf 1989, 6-53) were a useful source of references.
The referees detailed comments are also greatly appreciated. Part of this
work was completed while the author was a research assistant at Cal-
tech. The research group was supported in part by DOE Grant DE-FG-
03-85ER25009, The National Science Foundation with Grant IST-8700064,
and by IBM.
References
Allred, L. G., and Kelly, G. E. 1990. Supervised learning techniques for back-
propagation networks. Proc. Int. Joint Conf. Neural Networks (IJCNN), Wash-
ington I, 721-728.
Barnard, E., and Cole, R. 1988. A neural-net training program based on conju-
gate gradient optimization. Oregon Graduate Center, CSE 89-014.
Battiti, R. 1989. Accelerated back-propagationlearning: Two optimization meth-
ods. Complex Syst. 3, 331-342.
Battiti, R., and Masulli, F. 1990. BFGS optimization for faster and automated
supervised learning. Proc. Int. Neural Network Conf. (INNC go), Paris, France
757-760.
29For example the RBF model introduced in Broomhead and Lowe (1988) and Poggio
and Girosi (1990).
164 Roberto Battiti
Battiti, R., and Straforini, M. 1991. Parallel supervised learning with the mem-
oryless quasi-Newton method. In Parallel Computing: Problems, Methods and
Applications, P. Messina and A. Murli, eds. Elsevier, Amsterdam.
Battiti, R., Colla, A. M., Briano, L. M., Cecinati, R., and Guido, P. 1990. An
application-oriented development environment for neural net models on the
multiprocessor Emma-2. Proc. IFIP Workshop Silicon Architectures Neural Nets,
St Paul de Vence, France, M. Somi and J. Calzadillo-Daguerre, eds. North-
Holland, Amsterdam.
Becker, S., and LeCun, Y. 1989. Improving the convergence of backpropagation
learning with second-order methods. In Proceedings of the 1988 Connection-
ist Models Summer School, D. Touretzky, G. Hinton, and T. Sejnowski, eds.
Morgan Kaufmann, San Mateo, CA.
Bengio, Y., and Moore, B. 1989. Acceleration of learning. Proc. GNCB-CNR
School, Trento, Italy.
Bingham, J. A. C. 1988. The Theory and Practice of Modem Design. Wiley, New
York.
Broomhead, D. S., and Lowe, D. 1988. Multivariable functional interpolation
and adaptive networks. Complex Syst. 2,321-355.
Broyden, C. G., Dennis, J. E., and More, J. J. 1973. On the local and superlinear
convergence of quasi-Newton methods. J.I.M.A. 12, 223-246.
Cardon, H., van Hoogstraten, R., and Davies, P. 1991. A neural network ap-
plication in geology: Identification of genetic facies. Proc. Int. Conf. Artificial
Neural Networks (ICANN-91),Espoo, Finland 1519-1522.
Chan, L. W. 1990. Efficacy of different learning algorithms of the backpropaga-
tion network. Proc. I E E E TENCON-90.
Chan, L. W., and Fallside, F. 1987. An adaptive training algorithm for back
propagation networks. Cornput. Speech Language 2,205-218.
Dennis, J. E., and Schnabel, R. B. 1983. Numerical Methods for Unconstrained
Optimization and Nonlinear Equations. Prentice Hall, Englewood Cliffs, NJ.
Dennis, J. E., Gay, D. M., and Welsch, R. E. 1981. Algorithm 573 NL2SOL -
An adaptive nonlinear least-squares algorithm [E4]. TOMS 7,369-383.
Dongarra, J. J., Moler, C. B., Bunch, J. R., and Stewart, G. W. 1979. LINPACK
Users's Guide. Siam, Philadelphia.
Drago, G. P., and Ridella, S. 1991. An optimum weights initialization for im-
proving scaling relationships in BP learning. Proc. Int. Conf. Artificial Neural
Networks (ICA"-91 ), Espoo, Finland 1519-1 522.
Fahlman, S. E. 1988. An empirical study of learning speed in back-propagation
networks. Preprint CMU-CS-88-162, Carnegie Mellon University, Pittsburgh,
PA.
Gawthrop, I?, and Sbarbaro, D. 1990. Stochastic approximation and multilayer
perceptrons: The gain backpropagation algorithm. Complex Syst. 4,51-74.
Gill, P. E., Murray, W., and Wright, M. H. 1981. Practical Optimization. Academic
Press, London.
Glover, F. 1987. TABU Search methods in artificial intelligence and operations
research. ORSA Art. Int. Nezuslett. l(2.6).
Goldfarb, D. 1976. Factorized variable metric methods for unconstrained opti-
mization. Math. Comp. 30,796-811.
First- and Second-Order Methods for Learning 165
Goldstein, A. A. 1967. Constructive Real Analysis. Harper & Row, New York, NY.
Jacobs, R. A. 1988. Increased rates of convergence through learning rate adap-
tation. Neural Networks 1, 295-307.
Johansson, E. M., Dowla, F. U., and Goodman, D. M. 1990. Backpropagation
learning for multi-layer feed-forward neural networks using the conjugate
gradient method. Lawrence Livermore National Laboratory, Preprint UCRL-
JC-104850.
Kollias, S., and Anastassiou, D. 1989. An adaptive least squares algorithm for
the efficient training of multilayered networks. I E E E Trans. CAS 36, 1092-
1101.
Kramer, A. H., and Sangiovanni-Vicentelli, A. 1988. Efficient parallel learning
algorithms for neural networks. In Advances in Neural Information Processing
Systems, Vol. 1, pp. 75-89. Morgan Kaufmann, San Mateo, CA.
Lapedes, A., and Farber, R. 1986. A self-optimizing, nonsymmetrical neural
net for content addressable memory and pattern recognition. Physica 22 D,
247-259.
LeCun, Y. 1986. HLM A multilayer learning network. Proc. 2986 Connectionist
Models Summer School, Pittsburgh 169-177.
LeCun, Y. 1989. Generalization and network design strategies. In Connectionism
in Perspective, pp. 143-155. North Holland, Amsterdam.
LeCun, Y., Kanter, I., and Solla, S. A. 1991. Second order properties of error
surfaces: Learning time and generalization. In Neural Information Processing
Systems - NIPS, Vol. 3, pp, 918-924. Morgan Kaufmann, San Mateo, CA.
Luenberger, D. G. 1973. Introduction to Linear and Nonlinear Programming. Addi-
son-Wesley, New York.
Luo, Zhi-Quan. 1991. On the convergence of the LMS algorithm with adaptive
learning rate for linear feedforward networks. Neural Comp. 3, 227-245.
Malferrari, L., Serra, R., and Valastro, G. 1990. Using neural networks for signal
analysis in oil well drilling. Proc. 111Ital. Workshop Parallel Architect. Neural
Networks - Vietri s/m Salerno, Italy 345-353. World Scientific, Singapore.
Merller, M. F. 1990. A scaled conjugate gradient algorithm for fast supervised
learning. PB-339 Preprint. Computer Science Department, University of
Aarhus, Denmark. Neural Networks, to be published.
MorP, J. J., Garbow, B. S., and Hillstrom, K. E. 1980. User guide for MINPACK-1.
Argonne National Labs Report ANL-80-74.
Nocedal, J. 1980. Updating quasi-Newton matrices with limited storage. Math.
Comp. 35, 773-782.
Parker, D. B. 1987. Optimal algorithms for adaptive networks: Second-order
back propagation, second-order direct propagation, and second-order Heb-
bian learning. Proc. ICNN-1, San Diego, C A 11-593-11-600.
Poggio, T., and Girosi, F. 1990. Regularization algorithms for learning that are
equivalent to multilayer networks. Science 247, 978-982.
Press, W. H., Flannery, B. I?, Teukolsky, S. A., and Wetterling, W. T. 1988. Nu-
merical Recipes in C. Cambridge University Press, Cambridge.
Rumelhart, D. E., and McClelland, J. L. (eds.) 1986. Parallel DistributedProcessing:
Explorations in the Microstructure of Cognition. Vol. 1: Foundations. MIT Press,
Cambridge, MA.
166 Roberto Battiti
~~