Abstract
We examine the degree of accuracy of simple feedforward neural nets with N inputs and a single output to forecast time series that represent analytical functions. We show that the subspace of those functions, whose higher order derivatives can be clustered into a finite number of linearly dependent groups, can be forecasted exactly by a neural net. Furthermore, we derive generally applicable summation and product rules that permit us to calculate the associated optimum connection weights for the particular network architecture for complicated but exactly predictable functions. If a general network is initialized with these particular weights, the learning process for general data (with noise) can be significantly accelerated and the forecasting accuracy increased. We also show that neural nets can be used to predict the finite value of diverging sums, which is a generic problem for most perturbation-based approaches to physical systems.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Change history
28 September 2023
A Correction to this paper has been published: https://doi.org/10.1007/s42979-023-02168-3
References
Weigend AS, Gershenfeld NA. Time series prediction: forecasting the future and understanding the past. Santa Fe Institute Studies in the sciences of complexity. Reading: Addison-Wesley; 1993.
Hill T, O’Connor M, Remus W. Neural network models for time series forecasts. Manag Sci. 1996;42(7):1082–92.
Goodfellow I, Bengio Y, Courville A. Deep learning. Cambridge: MIT Press; 2016.
James W. The principles of psychology. New York: Henry Holt and Company; 1890.
McCulloch W, Pitts W. A logical calculus of the ideas immanent in nervous activity. Bull Math Biophys. 1943;5:115–33.
Hebb D. The organization of behavior. New York: Wiley; 1949.
Rosenblatt F. The perceptron: a probabilistic model for information storage and organization in the brain. Psychol Rev. 1958;65(6):386–408.
Minsky M, Papert S. Perceptrons. 1st ed. Cambridge: MIT Press; 1969. (2nd edition 1972).
Rummelhart DE, Hinton GE, Williams RJ. Learning representations by back-propagating errors. Nature. 1986;323:533–6.
Rojas R. The backpropagation algorithm neural networks: a systematic introduction. Berlin: Springer; 1996.
Mills TC. Time series techniques for economists. Cambridge: Cambridge University Press; 1990.
Percival DB, Walden AT. Spectral analysis for physical applications. Cambridge: Cambridge University Press; 1993.
Hamilton J. Time series analysis. Princeton: Princeton University Press; 1994.
Papoulis A. Probability, random variables, and stochastic processes. New York: Tata McGraw-Hill Education; 2002.
Box GEP. Time series analysis: forecasting and control. Hoboken: Wiley; 2015.
Hamilton JD. Time series analysis. Princeton: Princeton University Press; 1994.
Hornik K, Stinchcombe M, White H. Multilayer feedforward networks are universal approximators. Neural Netw. 1989;2:359–66.
Hornik K, Stinchcombe M, White H. Universal approximation of an unknown mapping and its derivatives using multilayer feedforward networks. Neural Netw. 1990;3(5):551–60.
Leshno M, Lin VY, Pinkus A, Schocken S. Multilayer feedforward networks with a nonpolynomial activation function can approximate any function. Neural Netw. 1993;6(6):861–7.
Barron AE. Universal approximation bounds for superpositions of a sigmoidal function. IEEE Trans Inf Theory. 1993;39:930–45.
For a nice visual proof of the universality theorem, see chapter 4 in Nielson M. Neural Networks and Deep Learning. http://neuralnetworksanddeeplearning.com/
Oppenheim AV, Schafer RW. Discrete-time signal processing. Englewood Cliffs: Prentice Hall; 1989. Accessed 12 Sept 2020
Kamen EW. Introduction to signals and systems. New York: McMillan; 1990.
Gershenfeld NA. The nature of mathematical modeling. Cambridge: Cambridge Press; 1999.
Glorot X, Bengio Y, see page 249 in Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics (AISTATS), Italy. 2010;9 of JMLR: W & CP 9.
Kingma DP, Adam BJ. A method for stochastic optimization. 2014; arXiv:1412.6980v9.
Zhang Z, Ma L, Li Z, Wu C. Normalized direction-preserving adam. 2017; arXiv:1709.04546v2.
Reddi SJ, Kale S, Kumar S. On the convergence of adam and beyond. 2018; arXiv:1904.09237.
Loshchilov I, Hutter F. Fixing weight decay regularization in adam. 2017; arXiv:1711.05101v2.
Keskar NS, Socher R. Improving generalization performance by switching from Adam to SGD. 2017. arXiv:1712.07628v1.
Chollet F et. al. Keras. 2015. https://keras.io.
Borel E. Memoire sur les Series Divergentes. Ann Sci Econ Norm Super Ser. 1899;3(16):9–131.
Hardy GH. Divergent series. New York: AMS Chelsea; 1949.
Lisowski C, Norris S, Pelphrey R, Stefanovich E, Su Q, Grobe R. Ground state energies from converging and diverging power series expansions. Ann Phys. 2016;373:456–69.
Lv QZ, Norris S, Pelphrey R, Su Q, Grobe R. Computation of diverging sums based on a finite number of terms. Comput Phys Comm. 2017;219:1–12.
Schmidt M, Lipson H. Distilling free-form natural laws from experimental data. Science. 2009;324:81–5.
Bongard J, Lipson H. Automated reverse engineering of nonlinear dynamical systems. Proc Natl Acad Sci USA. 2007;104(24):9943–8.
Acknowledgements
We appreciate Prof. N. Christensen’s enthusiasm for this work at its early stages and numerous illuminating discussions. We also thank Prof. R.F. Martin and C. Gong for very helpful discussions and pointing out recent related work in the literature. This work has been supported by the US National Science Foundation and Research Corporation.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Appendices
Appendix A: Proof of the Superposition Laws for Sums
Here we briefly outline the basic ideas for the general proof of the superposition law that permits us to construct the perfect connection weights \(W_{k}(F)\) for \(F(t)\equiv f(t) + g(t)\) in terms of the original weights of f(t), denoted by \(W_{n}(f)\), and of g(t), denoted by \(W_{m}(f)\). We require
The proof in full generality is very clumsy and can be best performed with some computer algebra software packages such as Mathematica or MatLab. The basic idea is to replace the \((M+1)\) functions at the times f(t), \(f(t-h)\) up to \(f(t - M h)\) in terms of the N functions at earlier times \(f(t-j h)\) with \(j = 1+M, N+M\). This iterative procedure that needs to be done in a strict consecutive order is extremely cumbersome. For example, by evaluating both sides of Eq. (11) for the argument \(t-M h\), we have
Similarly, for the argument \(t-(M-1)h\) and insertion of Eq. (13) we obtain
This sequence of iterative steps needs to be repeated \((M+1)\) times until the function f(t) can be expressed in terms of \(f(t-j h)\) with \(j = M+1, N+M\) and all \(W_{n}(f)\). The same replacements need to be performed for the function g(t) as well. Here the \((N+1)\) functions at the times g(t), \(g(t-h)\) up to \(g(t-N h)\) need to be expressed in terms of the N functions at earlier times \(g(t-j h)\) with \(j=N+1, N+M\).
After these expressions are inserted into Eq. (10), this equation becomes finally a single linear equation for the unknown \((N+M)\) weights \(W_{k}(f+g)\), containing all N weights \(W_{n}(f)\) and M weights \(W_{m}(f)\) as well as the N functions \(f(t-j h)\) with \(j = 1+M, N+M\) and M functions \(g(t-jh)\) with \(j = N+1, N+M\).
As this single equation needs to be satisfied for all times t, the \((N+M)\) pre-factors in front of all \(f(t-j h)\) and \(g(t-j h)\) need to vanish identically. The corresponding set of \((N+M)\) equations for the \((N+M)\) weights \(W_{k}(f+g)\) can be solved uniquely.
To give the reader, a better idea of the complexity of the derivation, we present here a concrete example, where we choose \(N = 3\) and \(M = 2\). For example, \(f(t) = t^{2} \exp (3 t)\) and \(g(t) = \cos (5t)\) would fall in this category with the known weights according to Table 1.
where for notational simplicity we abbreviate \(U_{n}\equiv W_{n}(f)\) and \(V_{m}\equiv W_{m}(g)\). Using Eqs. (16) and (17) repeatedly, the sequence of required replacements leads to
As a result, we obtain for Eq. (15)
where the five coefficients are given by
If we equate these five coefficients \(A_{k}\) to zero, we obtain the final solutions for the weights W as
In view of the complexity of the expressions in the intermediate steps, these forms are remarkably simple and they are in full agreement with the general solutions of Eq. (7) for arbitrary N and M.
Appendix B: Weight Factors for the Product Rule
Here we briefly outline the basic ideas for the general proof of the superposition law that permits us to construct the perfect connection weights \(W_{k}(F)\) for products \(F(t)\equiv f(t)g(t)\) in terms of the original weights of f(t), denoted by \(W_{n}(f)\), and of g(t), denoted by \(W_{m}(f)\). We require
The approach to derive of the weights \(W_{k}(fg)\) from the \(W_{k}(f)\) and \(W_{k}(g)\) is—in principle—similar to the one used in appendix A, but it is significantly more complicated and we illustrate it here only for the \(N=3\) and \(M=3\) case. Here we would iteratively use Eq. (23) to replace the functions f and g at later times in terms of the six values \(f(t-kh)\) and \(g(t-kh)\) for \(k = 7, 8\) and 9. After the replacements, the central equation Eq. (23) for the nine weights \(W_{k}(fg)\) depend on the nine time-dependent product functions \(f(t-k_{1}h)g(t-k_{2}h)\) with \(k_{1} = 7, 8, 9\) and \(k_{2} = 7, 8, 9\). If we assume that these nine functions are linearly independent of each other, we have to require that the corresponding nine pre-factors vanish. If we solve the resulting nine coupled but linear equations of the nine weights \(W_{k}(fg)\) for \(k=1, 2, \ldots , 9\) we finally obtain the solutions
For notational simplicity, we have used again the abbreviations \(U_{k}\equiv W_{k}(f)\) and \(V_{k}\equiv W_{k}(g)\). Unfortunately, we have not been able to recognize a certain regular pattern to these 9 weights that would have allowed us to predict the corresponding 16 weights for the \(N=4\) \(M=4\) system. Even though we note that the sum of the indices of each factor U and V matches the index k of \(W_{k}(fg)\), respectively, to predict reliably the corresponding permutations of these factors, their pre-factors and signs seems difficult.
Appendix C: Optimum Weights Involving a Sum of Products
Here we examine the optimum weights due for the specific function f(t)
where we have used the specific values \(a_{1}=-2\) and \(a_{2}=2\) for the decay and growth rates and \(b_{1} = 70\) and \(b_{2} =20\) for the two frequencies. As a sum of the class = 2 functions \(\exp (a_{1}t) \cos (b_{1}t)\) and \(\exp (-a_{2}t) \cos (b_{2}t)\), the function f(t) is again a epf of class = 4. Applying consecutively the superposition laws derived in this work for optimal weights for the summation and products of functions (Eqs. 7, 25 and 26), one can derive the following analytical expressions for the six optimal weights.
In Fig. 6 we have graphed these four optimum weights as a function of the grid spacing h. In the limit of small spacings \(h\rightarrow 0\) we find that the weights approach the values \((W_{1},W_{2},W_{3},W_{4}) =(4, -6, 4, -1)\equiv {\mathbf {W}}(h=0)\). This set corresponds precisely the optimal (h-independent) weights \(B_{n4}\) for any polynomial of degree 3. This is not a coincidence as the original optimal weights of the constituent functions \(\exp (at)\), \(\cos (bt)\) of f(t) of Eq. (34) converge already in this limit to the alternating binomial coefficients given in Eq. (8).
The key question is, whether the binomial coefficients can act as helpful initial values for the learning algorithm for the relevant case where \(h\ne 0\). For example, for \(h<0.0076\), each the optimal set of weights differs from the set \({\mathbf {W}}(h=0)\) by at most \(10\%\). For example, for \(h=0.01\) we have the set of exact optimal weights given by \({\mathbf {W}}(0.01) =(3.50, -5.00, 3.48, -1.00)\), very similar to \(B_{n4}=(-1)^{n+1}(4,n)\). This suggests, that as long as the grid spacing is not too large, the binomial set should be an ideal set of weight parameters to initialize the net for the learning process.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Yost, J., Rizo, L., Fang, X. et al. Exactly Predictable Functions for Simple Neural Networks. SN COMPUT. SCI. 3, 64 (2022). https://doi.org/10.1007/s42979-021-00949-2
Received:
Accepted:
Published:
DOI: https://doi.org/10.1007/s42979-021-00949-2