Abstract
We show how the success of deep learning could depend not only on mathematics but also on physics: although well-known mathematical theorems guarantee that neural networks can approximate arbitrary functions well, the class of functions of practical interest can frequently be approximated through “cheap learning” with exponentially fewer parameters than generic ones. We explore how properties frequently encountered in physics such as symmetry, locality, compositionality, and polynomial log-probability translate into exceptionally simple neural networks. We further argue that when the statistical process generating the data is of a certain hierarchical form prevalent in physics and machine learning, a deep neural network can be more efficient than a shallow one. We formalize these claims using information theory and discuss the relation to the renormalization group. We prove various “no-flattening theorems” showing when efficient linear deep networks cannot be accurately approximated by shallow ones without efficiency loss; for example, we show that n variables cannot be multiplied using fewer than \(2^n\) neurons in a single hidden layer.
Similar content being viewed by others
Notes
Neurons are universal analog computing modules in much the same way that NAND gates are universal digital computing modules: any computable function can be accurately evaluated by a sufficiently large network of them. Just as NAND gates are not unique (NOR gates are also universal), nor is any particular neuron implementation—indeed, any generic smooth nonlinear activation function is universal [8, 9].
The class of functions that can be exactly expressed by a neural network must be invariant under composition, since adding more layers corresponds to using the output of one function as the input to another. Important such classes include linear functions, affine functions, piecewise linear functions (generated by the popular Rectified Linear unit “ReLU” activation function \(\sigma (x) = \max [0,x]\)), polynomials, continuous functions and smooth functions whose \(n^\mathrm{th}\) derivatives are continuous. According to the Stone-Weierstrass theorem, both polynomials and piecewise linear functions can approximate continuous functions arbitrarily well.
The limit where \(\lambda \rightarrow \infty \) but \(|A_1|^2 |A_2|\) is held constant is very similar in spirit to the ’t Hooft limit in large N quantum field theories where \(g^2 N\) is held fixed but \(N \rightarrow \infty \). The extra terms in the Taylor series which are suppressed at large \(\lambda \) are analogous to the suppression of certain Feynman diagrams at large N. The authors thank Daniel Roberts for pointing this out.
In addition to the four neurons required for each multiplication, additional neurons may be deployed to copy variables to higher layers bypassing the nonlinearity in \(\sigma \). Such linear “copy gates” implementing the function \(u\rightarrow u\) are of course trivial to implement using a simpler version of the above procedure: using \(\mathbf A _1\) to shift and scale down the input to fall in a tiny range where \(\sigma '(u)\ne 0\), and then scaling it up and shifting accordingly with \(\mathbf A _2\).
If the next step in the generative hierarchy requires knowledge of not merely of the present state but also information of the past, the present state can be redefined to include also this information, thus ensuring that the generative process is a Markov process.
Although our discussion is focused on describing probability distributions, which are not random, stochastic neural networks can generate random variables as well. In biology, spiking neurons provide a good random number generator, and in machine learning, stochastic architectures such as restricted Boltzmann machines [25] do the same.
A typical renormalization scheme for a lattice system involves replacing many spins (bits) with a single spin according to some rule. In this case, it might seem that the map R could not possibly map its domain onto itself, since there are fewer degrees of freedom after the coarse-graining. On the other hand, if we let the domain and range of R differ, we cannot easily talk about the Hamiltonian as having the same functional form, since the renormalized Hamiltonian would have a different domain than the original Hamiltonian. Physicists get around this by taking the limit where the lattice is infinitely large, so that R maps an infinite lattice to an infinite lattice.
A subtlety regarding the above statements is presented by the Multi-scale Entanglement Renormalization Ansatz (MERA) [37]. MERA can be viewed as a variational class of wave functions whose parameters can be tuned to to match a given wave function as closely as possible. From this perspective, MERA is as an unsupervised machine learning algorithm, where classical probability distributions over many variables are replaced with quantum wavefunctions. Due to the special tensor network structure found in MERA, the resulting variational approximation of a given wavefunction has an interpretation as generating an RG flow. Hence this is an example of an unsupervised learning problem whose solution gives rise to an RG flow. This is only possible due to the extra mathematical structure in the problem (the specific tensor network found in MERA); a generic variational Ansatz does not give rise to any RG interpretation and vice versa.
References
LeCun, Y., Bengio, Y., Hinton, G.: Deep learning. Nature 521, 436–444 (2015)
Bengio, Y.: Learning deep architectures for AI, foundations and trends\({\textregistered }\). Mach. Learn. 2, 1–127 (2009)
Russell, S., Dewey, D., Tegmark, M.: Research priorities for robust and beneficial artificial intelligence. AI Mag. 36, 105–114 (2015)
Herbrich, R., Williamson, R.C.: Algorithmic luckiness. J. Mach. Learn. Res. 3, 175–212 (2002)
Shawe-Taylor, J., Bartlett, P.L., Williamson, R.C., Anthony, M.: Structural risk minimization over data-dependent hierarchies. IEEE Trans. Inf. Theory 44, 1926–1940 (1998)
Poggio, T., Anselmi, F., Rosasco, L.: I-theory on depth vs width: hierarchical function composition. Center Brains Minds Mach. (2015). Technical Reports
Mehta, P., Schwab, D.J.: An exact mapping between the variational renormalization group and deep learning. arXiv:1410.3831 (2014)
Hornik, K., Stinchcombe, M., White, H.: Multilayer feedforward networks are universal approximators. Neural Netw. 2, 359–366 (1989)
Cybenko, G.: Approximation by superpositions of a sigmoidal function. Math. Control Signals Syst. 2, 303–314 (1989)
Pinkus, A.: Approximation theory of the MLP model in neural networks. Acta Numer. 8, 143–195 (1999)
Gnedenko, B., Kolmogorov, A., Gnedenko, B., Kolmogorov, A.: Limit distributions for sums of independent. Am. J. Math. 105, 28–35 (1954)
Jaynes, E.T.: Information theory and statistical mechanics. Phys. Rev. 106, 620 (1957)
Tegmark, M., Aguirre, A., Rees, M.J., Wilczek, F.: Dimensionless constants, cosmology, and other dark matters. Phys. Rev. D 73, 023505 (2006)
Delalleau, O., Bengio, Y.: Shallow vs. deep sum-product networks. In: Advances in Neural Information Processing Systems, pp. 666–674 (2011)
Mhaskar, H., Liao, Q., Poggio, T.: Learning functions: when is deep better than shallow. arXiv:1603.00988 (2016)
Mhaskar, H., Poggio, T.: Deep vs. shallow networks: an approximation theory perspective. arXiv:1608.03287 (2016)
Adam, R., Ade, P., Aghanim, N., Akrami, Y., Alves, M., Arnaud, M., Arroja, F., Aumont, J., Baccigalupi, C., Ballardini, M., et al.: arXiv:1502.01582 (2015)
Seljak, U., Zaldarriaga, M.: A line of sight approach to cosmic microwave background anisotropies. arXiv:astro-ph/9603033 (1996)
Tegmark, M.: How to measure CMB power spectra without losing information. Phys. Rev. D 55, 5895 (1997)
Bond, J., Jaffe, A.H., Knox, L.: Estimating the power spectrum of the cosmic microwave background. Phys. Rev. D 57, 2117 (1998)
Tegmark, M., de Oliveira-Costa, A., Hamilton, A.J.: High resolution foreground cleaned CMB map from WMAP. Phys. Rev. D 68, 123523 (2003)
Ade, P., Aghanim, N., Armitage-Caplan, C., Arnaud, M., Ashdown, M., Atrio-Barandela, F., Aumont, J., Baccigalupi, C., Banday, A.J., Barreiro, R., et al.: Planck 2013 results. XII. Diffuse component separation. Astron. Astrophys. 571, A12 (2014)
Tegmark, M.: How to make maps from cosmic microwave background data without losing information. Astrophys. J. Lett. 480, L87 (1997)
Hinshaw, G., Barnes, C., Bennett, C., Greason, M., Halpern, M., Hill, R., Jarosik, N., Kogut, A., Limon, M., Meyer, S., et al.: First-year Wilkinson microwave anisotropy probe (WMAP) WMAP is the result of a partnership between Princeton University and the NASA Goddard Space Flight Center. Scientific guidance is provided by the WMAP Science Team. Observations: data processing methods and systematic error limits. Astrophys. J. Suppl. Ser. 148, 63 (2003)
Hinton, G.: A practical guide to training restricted Boltzmann machines. Momentum 9, 926 (2010)
Émile Borel, M.: Les probabilités dénombrables et leurs applications arithmétiques. Rendiconti del Circolo Matematico di Palermo (1884–1940) 27, 247–271 (1909)
Fisher, R.A.: On the mathematical foundations of theoretical statistics. Philos. Trans. R. Soc. Lond. Ser. A Contain. Pap. Math. Phys. Charact. 222, 309–368 (1922)
Riesenhuber, M., Poggio, T.: Models of object recognition. Nat. Neurosci. 3, 1199–1204 (2000)
Kullback, S., Leibler, R.A.: On information and sufficiency. Ann. Math. Stat. 22, 79–86 (1951). doi:10.1214/aoms/1177729694
Cover, T.M., Thomas, J.A.: Elements of Information Theory. Wiley, New York (2012)
Kardar, M.: Statistical Physics of Fields. Cambridge University Press, Cambridge (2007)
Cardy, J.: Scaling and Renormalization in Statistical Physics, vol. 5. Cambridge University Press, Cambridge (1996)
Johnson, J.K., Malioutov, D.M., Willsky, A.S.: Lagrangian relaxation for MAP estimation in graphical models. arXiv:0710.0013 (2007)
Bény, C.: Deep learning and the renormalization group. arXiv:1301.3124 (2013)
Saremi, S., Sejnowski, T.J.: Hierarchical model of natural images and the origin of scale invariance. Proc. Natl. Acad. Sci. 110, 3071–3076 (2013). http://www.pnas.org/content/110/8/3071.full.pdf, http://www.pnas.org/content/110/8/3071.abstract
Miles Stoudenmire, E., Schwab, D.J.: Supervised learning with quantum-inspired tensor networks. arXiv:1605.05775 (2016)
Vidal, G.: Class of quantum many-body states that can be efficiently simulated. Phys. Rev. Lett. 101, 110501 (2008). arXiv:quant-ph/0610099
Ba, J., Caruana, R.: Do deep nets really need to be deep? In: Advances in Neural Information Processing Systems, pp. 2654–2662 (2014)
Hastad, J.: Almost optimal lower bounds for small depth circuits. In: Proceedings of the Eighteenth Annual ACM Symposium on Theory of Computing, pp. 6–20. Organization ACM (1986)
Telgarsky, M.: Representation benefits of deep feedforward networks. arXiv:1509.08101 (2015)
Montufar, G.F., Pascanu, R., Cho, K., Bengio, Y.: On the number of linear regions of deep neural networks. In: Advances in Neural Information Processing Systems, pp. 2924–2932 (2014)
Eldan, R., Shamir, O.: The power of depth for feedforward neural networks. arXiv:1512.03965 (2015)
Poole, B., Lahiri, S., Raghu, M., Sohl-Dickstein, J., Ganguli, S.: Exponential expressivity in deep neural networks through transient chaos. arXiv:1606.05340 (2016)
Raghu, M., Poole, B., Kleinberg, J., Ganguli, S., Sohl-Dickstein, J.: On the expressive power of deep neural networks. arXiv:1606.05336 (2016)
Saxe, A.M., McClelland, J.L., Ganguli, S.: Exact solutions to the nonlinear dynamics of learning in deep linear neural networks. arXiv:1312.6120 (2013)
Bengio, Y., LeCun, Y., et al.: Scaling learning algorithms towards AI. Large Scale Kernel Mach. 34, 1–41 (2007)
Strassen, V.: Gaussian elimination is not optimal. Numer. Math. 13, 354–356 (1969)
Le Gall, F.: In: Proceedings of the 39th international symposium on symbolic and algebraic computation. Organization ACM, pp. 296–303 (2014)
Carleo, G., Troyer, M.: Solving the quantum many-body problem with artificial neural networks. arXiv:1606.02318 (2016)
Vollmer, H.: Introduction to Circuit Complexity: A Uniform Approach. Springer, Berlin (2013)
Acknowledgements
This work was supported by the Foundational Questions Institute http://fqxi.org/, the Rothberg Family Fund for Cognitive Science and NSF Grant 1122374. We thank Scott Aaronson, Frank Ban, Yoshua Bengio, Rico Jonschkowski, Tomaso Poggio, Bart Selman, Viktoriya Krakovna, Krishanu Sankar and Boya Song for helpful discussions and suggestions, Frank Ban, Fernando Perez, Jared Jolton, and the anonymous referee for helpful corrections and the Center for Brains, Minds, and Machines (CBMM) for hospitality.
Author information
Authors and Affiliations
Corresponding author
Appendix A: The Polynomial No-Flattening Theorem
Appendix A: The Polynomial No-Flattening Theorem
We saw above that a neural network can compute polynomials accurately and efficiently at linear cost, using only about 4 neurons per multiplication. For example, if n is a power of two, then the monomial \(\prod _{i=1}^n x_i\) can be evaluated using 4n neurons arranged in a binary tree network with \(\log _2 n\) hidden layers. In this appendix, we will prove a no-flattening theorem demonstrating that flattening polynomials is exponentially expensive.
Theorem
Suppose we are using a generic smooth activation function \(\sigma (x) = \sum _{k=0}^\infty \sigma _k x^k\), where \(\sigma _k \ne 0\) for \(0\le k\le n\). Then for any desired accuracy \(\epsilon >0\), there exists a neural network that can implement the function \(\prod _{i=1}^n x_i\) using a single hidden layer of \(2^n\) neurons. Furthermore, this is the smallest possible number of neurons in any such network with only a single hidden layer.
This result may be compared to problems in Boolean circuit complexity, notably the question of whether \(TC^0 = TC^1\) [50]. Here circuit depth is analogous to number of layers, and the number of gates is analogous to the number of neurons. In both the Boolean circuit model and the neural network model, one is allowed to use neurons/gates which have an unlimited number of inputs. The constraint in the definition of \(TC^i\) that each of the gate elements be from a standard universal library (AND, OR, NOT, Majority) is analogous to our constraint to use a particular nonlinear function. Note, however, that our theorem is weaker by applying only to depth 1, while \(TC^0\) includes all circuits of depth O(1).
1.1 A.1 Proof that \(2^n\) Neurons are Sufficient
A neural network with a single hidden layer of m neurons that approximates a product gate for n inputs can be formally written as a choice of constants \(a_{ij}\) and \(w_j\) satisfying
Here, we use \(\approx \) to denote that the two sides of (A1) have identical Taylor expansions up to terms of degree n; as we discussed earlier in our construction of a product gate for two inputs, this exables us to achieve arbitrary accuracy \(\epsilon \) by first scaling down the factors \(x_i\), then approximately multiplying them and finally scaling up the result.
We may expand (A1) using the definition \(\sigma (x) = \sum _{k=0}^\infty \sigma _k x^k\) and drop terms of the Taylor expansion with degree greater than n, since they do not affect the approximation. Thus, we wish to find the minimal m such that there exist constants \(a_{ij}\) and \(w_j\) satisfying
for all \(0\le k \le n-1\). Let us set \(m = 2^n\), and enumerate the subsets of \(\{1,\ldots ,n\}\) as \(S_1,\ldots ,S_m\) in some order. Define a network of m neurons in a single hidden layer by setting \(a_{ij}\) equal to the function \(s_i(S_j)\) which is \(-1\) if \(i\in S_j\) and \(+1\) otherwise, setting
In other words, up to an overall normalization constant, all coefficients \(a_{ij}\) and \(w_j\) equal \(\pm 1\), and each weight \(w_j\) is simply the product of the corresponding \(a_{ij}\).
We must prove that this network indeed satisfies Eqs. (A2) and (A3). The essence of our proof will be to expand the left hand side of Eq. (A1) and show that all monomial terms except \(x_1\cdot \cdot \cdot x_n\) come in pairs that cancel. To show this, consider a single monomial \(p(\mathbf x ) = x_1^{r_1}\cdots x_n^{r_n}\) where \(r_1 + \ldots + r_n = r \le n\).
If \(p(\mathbf x ) \ne \prod _{i=1}^n x_i\), then we must show that the coefficient of \(p(\mathbf x )\) in \(\sigma _r\sum _{j=1}^m w_j\left( \sum _{i=1}^n a_{ij} x_i\right) ^r\) is 0. Since \(p(\mathbf x ) \ne \prod _{i=1}^n x_i\), there must be some \(i_0\) such that \(r_{i_0} = 0\). In other words, \(p(\mathbf x )\) does not depend on the variable \(x_{i_0}\). Since the sum in Eq. (A1) is over all combinations of ± signs for all variables, every term will be canceled by another term where the (non-present) \(x_{i_0}\) has the opposite sign and the weight \(w_j\) has the opposite sign:
Observe that the coefficient of \(p(\mathbf x )\) is equal in \(\left( \sum _{i=1}^n s_i(S_j) x_i\right) ^r\) and \(\left( \sum _{i=1}^n s_i(S_j \cup \{i_0\}) x_i\right) ^r\), since \(r_{i_0}=0\). Therefore, the overall coefficient of \(p(\mathbf x )\) in the above expression must vanish, which implies that (A3) is satisfied.
If instead \(p(\mathbf x ) = \prod _{i=1}^n x_i\), then all terms have the coefficient of \(p(\mathbf x )\) in \(\left( \sum _{i=1}^n a_{ij} x_i\right) ^n\) is \(n!\, \prod _{i=1}^n a_{ij} = (-1)^{|S_j|} n!\), because all n! terms are identical and there is no cancelation. Hence, the coefficient of \(p(\mathbf x )\) on the left-hand side of (A2) is
completing our proof that this network indeed approximates the desired product gate.
From the standpoint of group theory, our construction involves a representation of the group \(G = \mathbb {Z}_2^n\), acting upon the space of polynomials in the variables \(x_1, x_2, \ldots , x_n\). The group G is generated by elements \(g_i\) such that \(g_i\) flips the sign of \(x_i\) wherever it occurs. Then, our construction corresponds to the computation
Every monomial of degree at most n, with the exception of the product \(x_1\cdots x_n\), is sent to 0 by \((1 - g_i)\) for at least one choice of i. Therefore, \(\mathbf f (x_1,\ldots ,x_n)\) approximates a product gate (up to a normalizing constant).
1.2 A.2 Proof that \(2^n\) Neurons are Necessary
Suppose that S is a subset of \(\{1,\ldots ,n\}\) and consider taking the partial derivatives of (A2) and (A3), respectively, with respect to all the variables \(\{x_h\}_{h\in S}\). Then, we obtain the equalities
for all \(0\le k \le n-1\). Let \(\mathbf A \) denote the \(2^n \times m\) matrix with elements
We will show that \(\mathbf A \) has full row rank. Suppose, towards contradiction, that \(\mathbf c ^t\mathbf A =\mathbf{0}\) for some non-zero vector \(\mathbf c \). Specifically, suppose that there is a linear dependence between rows of \(\mathbf A \) given by
where the \(S_{\ell }\) are distinct and \(c_{\ell } \ne 0\) for every \(\ell \). Let s be the maximal cardinality of any \(S_{\ell }\). Defining the vector \(\mathbf d \) whose components are
taking the dot product of Eq. (A8) with \(\mathbf d \) gives
Applying Eq. (A6) (with \(k=n+|S_{\ell }|-s\)) shows that the second term vanishes. Substituting Eq. (A5) now simplifies Eq. (A10) to
i.e., to a statement that a set of monomials are linearly dependent. Since all distinct monomials are in fact linearly independent, this is a contradiction of our assumption that the \(S_{\ell }\) are distinct and \(c_{\ell }\) are nonzero. We conclude that \(\mathbf A \) has full row rank, and therefore that \(m \ge 2^n\), which concludes the proof.
Rights and permissions
About this article
Cite this article
Lin, H.W., Tegmark, M. & Rolnick, D. Why Does Deep and Cheap Learning Work So Well?. J Stat Phys 168, 1223–1247 (2017). https://doi.org/10.1007/s10955-017-1836-5
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10955-017-1836-5