Abstract
In this chapter, we describe the basic concepts behind the functioning of recurrent neural networks and explain the general properties that are common to several existing architectures. We introduce the basis of their training procedure, the backpropagation through time, as a general way to propagate and distribute the prediction error to previous states of the network. The learning procedure consists of updating the model parameters by minimizing a suitable loss function, which includes the error achieved on the target task and, usually, also one or more regularization terms. We then discuss several ways of regularizing the system, highlighting their advantages and drawbacks. Beside the standard stochastic gradient descent procedure, we also present several additional optimization strategies proposed in the literature for updating the network weights. Finally, we illustrate the problem of the vanishing gradient effect, an inherent problem of the gradient-based optimization techniques which occur in several situations while training neural networks. We conclude by discussing the most recent and successful approaches proposed in the literature to limit the vanishing of the gradients.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Bengio Y (2012) Practical recommendations for gradient-based training of deep architectures. In: Montavon G, Orr GB, Müller KR (eds) Neural networks: tricks of the trade: second edition. Springer, Berlin, pp 437–478. https://doi.org/10.1007/978-3-642-35289-8_26
Bianchi FM, Livi L, Alippi C (2016a) Investigating echo-state networks dynamics by means of recurrence analysis. IEEE Trans Neural Netw Learn Syst 99:1–13. https://doi.org/10.1109/TNNLS.2016.2630802
Bottou L (2004) Stochastic learning. In: Bousquet O, von Luxburg U (eds) Advanced lectures on machine learning. Lecture Notes in Artificial Intelligence, LNAI, vol 3176. Springer Verlag, Berlin, pp 146–168. http://leon.bottou.org/papers/bottou-mlss-2004
Bottou L (2012a) Stochastic gradient descent tricks. In: Neural networks: tricks of the trade. Springer, pp 421–436
Bottou L (2012b) Stochastic gradient descent tricks. In: Montavon G, Orr GB, Müller KR (eds) neural networks: tricks of the trade: second edition. Springer, Berlin, pp 421–436. https://doi.org/10.1007/978-3-642-35289-8_25
Che Z, Purushotham S, Cho K, Sontag D, Liu Y (2016) Recurrent neural networks for multivariate time series with missing values. arXiv:1606.01865
Dauphin YN, Pascanu R, Gulcehre C, Cho K, Ganguli S, Bengio Y (2014) Identifying and attacking the saddle point problem in high-dimensional non-convex optimization. In: Ghahramani Z, Welling M, Cortes C, Lawrence ND, Weinberger KQ (eds) Advances in neural information processing systems, vol 27. Curran Associates Inc., pp 2933–2941
Duchi J, Hazan E, Singer Y (2011) Adaptive subgradient methods for online learning and stochastic optimization. J Mach Learn Res 12:2121–2159
El Hihi S, Bengio Y (1995) Hierarchical recurrent neural networks for long-term dependencies. In: Proceedings of the 8th International Conference on Neural Information Processing Systems (NIPS’95). MIT Press, Cambridge, MA, USA, pp 493–499. http://dl.acm.org/citation.cfm?id=2998828.2998898
Gal Y, Ghahramani Z (2015) A theoretically grounded application of dropout in recurrent neural networks. arXiv:1512:05287
Glorot X, Bengio Y (2010) Understanding the difficulty of training deep feedforward neural networks. In: International conference on artificial intelligence and statistics, pp 249–256
Gomez FJ, Miikkulainen R (2003) Robust non-linear control through neuroevolution. Computer Science Department, University of Texas at Austin
Gregor K, Danihelka I, Graves A, Rezende DJ, Wierstra D (2015) Draw: a recurrent neural network for image generation. arXiv:150204623
Haykin SS, Haykin SS, Haykin SS (2001) Kalman filtering and neural networks. Wiley Online Library
He K, Zhang X, Ren S, Sun J (2015) Deep residual learning for image recognition. arXiv:1512:03385
He K, Zhang X, Ren S, Sun J (2015) Delving deep into rectifiers: Surpassing human-level performance on imagenet classification. In: Proceedings of the IEEE international conference on computer vision, pp 1026–1034
Hochreiter S, Bengio Y, Frasconi P, Schmidhuber J (2001) Gradient flow in recurrent nets: the difficulty of learning long-term dependencies
Jaeger H (2001) The “echo state” approach to analysing and training recurrent neural networks-with an erratum note. German National Research Center for Information Technology GMD Technical Report 148:34, Bonn, Germany
Jaeger H (2002b) Tutorial on training recurrent neural networks, covering BPPT, RTRL, EKF and the “echo state network” approach, vol 5. GMD-Forschungszentrum Informationstechnik
John H (1992) Holland, adaptation in natural and artificial systems
Kingma D, Ba J (2014) Adam: a method for stochastic optimization. arXiv:14126980
Koutník J, Greff K, Gomez FJ, Schmidhuber J (2014) A clockwork RNN. arXiv:1402.3511
Lee JD, Recht B, Srebro N, Tropp J, Salakhutdinov RR (2010) Practical large-scale optimization for max-norm regularization. In: Advances in neural information processing systems, pp 1297–1305
Lipton ZC (2015) A critical review of recurrent neural networks for sequence learning. http://arxiv.org/abs/1506.00019
Livi L, Bianchi FM, Alippi C (2017) Determination of the edge of criticality in echo state networks through fisher information maximization. IEEE Trans Neural Netw Learn Syst (99):1–12. https://doi.org/10.1109/TNNLS.2016.2644268
Lukoševičius M, Jaeger H (2009) Reservoir computing approaches to recurrent neural network training. Comput Sci Rev 3(3):127–149. https://doi.org/10.1016/j.cosrev.2009.03.005
Maass W, Joshi P, Sontag ED (2007) Computational aspects of feedback in neural circuits. PLoS Comput Biol 3(1):e165. https://doi.org/10.1371/journal.pcbi.0020165.eor
Martens J (2010) Deep learning via hessian-free optimization. In: Proceedings of the 27th International Conference on Machine Learning (ICML-10), pp 735–742
Montavon G, Orr G, Müller KR (2012) Neural networks-tricks of the trade second edition. Springer. https://doi.org/10.1007/978-3-642-35289-8
Nair V, Hinton GE (2010) Rectified linear units improve restricted boltzmann machines. In: Proceedings of the 27th International Conference on Machine Learning (ICML-10), June 21-24, 2010, Haifa, Israel, pp 807–814
Neelakantan A, Vilnis L, Le QV, Sutskever I, Kaiser L, Kurach K, Martens J (2015) Adding gradient noise improves learning for very deep networks. arXiv:151106807
Nesterov Y (1983) A method of solving a convex programming problem with convergence rate O(1/sqrt(k)). Sov Math Dokl 27:372–376
Pascanu R, Gülçehre Ç, Cho K, Bengio Y (2013a) How to construct deep recurrent neural networks. arXiv:1312.6026
Pascanu R, Mikolov T, Bengio Y (2012) Understanding the exploding gradient problem. Computing Research Repository (CoRR). arXiv:12115063
Pascanu R, Mikolov T, Bengio Y (2013b) On the difficulty of training recurrent neural networks. In: Proceedings of the 30th International conference on international conference on machine learning, JMLR.org, ICML’13, vol 28, pp III–1310–III–1318. http://dl.acm.org/citation.cfm?id=3042817.3043083
Pham V, Bluche T, Kermorvant C, Louradour J (2014a) Dropout improves recurrent neural networks for handwriting recognition. In: 2014 14th International Conference on Frontiers in Handwriting Recognition (ICFHR). IEEE, pp 285–290
Rumelhart DE, Hinton GE, Williams RJ (1985) Learning internal representations by error propagation, Technical report, DTIC Document
Scardapane S, Comminiello D, Hussain A, Uncini A (2017) Group sparse regularization for deep neural networks. Neurocomputing 241:81–89. https://doi.org/10.1016/j.neucom.2017.02.029
Scardapane S, Wang D (2017) Randomness in neural networks: an overview. Wiley Interdiscip Rev Data Min Knowl Discov 7(2):e1200. https://doi.org/10.1002/widm.1200
Schmidhuber J, Wierstra D, Gagliolo M, Gomez F (2007) Training recurrent networks by evolino. Neural Comput 19(3):757–779
Schoenholz SS, Gilmer J, Ganguli S, Sohl-Dickstein J (2016) Deep information propagation. arXiv:1611.01232
Schuster M, Paliwal KK (1997) Bidirectional recurrent neural networks. IEEE Trans Signal Process 45(11):2673–2681
Siegelmann HT, Sontag ED (1991) Turing computability with neural nets. Appl Math Lett 4(6):77–80
Srivastava RK, Greff K, Schmidhuber J (2015) Training very deep networks. In: Cortes C, Lawrence ND, Lee DD, Sugiyama M, Garnett R (eds) Advances in neural information processing systems, vol 28. Curran Associates Inc., pp 2377–2385
Sutskever I (2013) Training recurrent neural networks. PhD thesis, University of Toronto
Sutskever I, Hinton G (2010) Temporal-kernel recurrent neural networks. Neural Netw 23(2):239–243
Sutskever I, Martens J, Dahl GE, Hinton GE (2013) On the importance of initialization and momentum in deep learning. ICML 3(28):1139–1147
Sutskever I, Vinyals O, Le QV (2014) Sequence to sequence learning with neural networks. In: Advances in neural information processing systems, pp 3104–3112
Tieleman T, Hinton G (2012) Lecture 6.5-rmsprop: divide the gradient by a running average of its recent magnitude. COURSERA: Neural Netw Mach Learn 4:2
Williams RJ, Peng J (1990) An efficient gradient-based algorithm for on-line training of recurrent network trajectories. Neural Comput 2(4):490–501. https://doi.org/10.1162/neco.1990.2.4.490
Williams RJ, Zipser D (1995) Gradient-based learning algorithms for recurrent networks and their computational complexity. In: Backpropagation: theory, architectures, and applications, vol 1. pp 433–486
Williams RJ, Zipser D (1989) A learning algorithm for continually running fully recurrent neural networks. Neural Comput 1(2):270–280
Zhang S, Wu Y, Che T, Lin Z, Memisevic R, Salakhutdinov RR, Bengio Y (2016) Architectural complexity measures of recurrent neural networks. In: Lee DD, Sugiyama M, Luxburg UV, Guyon I, Garnett R (eds) Advances in Neural Information Processing Systems, vol 29. Curran Associates Inc., pp 1822–1830
Zilly JG, Srivastava RK, Koutník J, Schmidhuber J (2016) Recurrent highway networks. arXiv:1607:03474
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
Copyright information
© 2017 The Author(s)
About this chapter
Cite this chapter
Bianchi, F.M., Maiorino, E., Kampffmeyer, M.C., Rizzi, A., Jenssen, R. (2017). Properties and Training in Recurrent Neural Networks. In: Recurrent Neural Networks for Short-Term Load Forecasting. SpringerBriefs in Computer Science. Springer, Cham. https://doi.org/10.1007/978-3-319-70338-1_2
Download citation
DOI: https://doi.org/10.1007/978-3-319-70338-1_2
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-70337-4
Online ISBN: 978-3-319-70338-1
eBook Packages: Computer ScienceComputer Science (R0)