Why Does Deep and Cheap Learning Work So Well?

Lin, Henry W.; Tegmark, Max; Rolnick, David

doi:10.1007/s10955-017-1836-5

Why Does Deep and Cheap Learning Work So Well?

Published: 21 July 2017

Volume 168, pages 1223–1247, (2017)
Cite this article

Journal of Statistical Physics Aims and scope Submit manuscript

Henry W. Lin¹,
Max Tegmark² &
David Rolnick³

15k Accesses
318 Citations
750 Altmetric
44 Mentions
Explore all metrics

Abstract

We show how the success of deep learning could depend not only on mathematics but also on physics: although well-known mathematical theorems guarantee that neural networks can approximate arbitrary functions well, the class of functions of practical interest can frequently be approximated through “cheap learning” with exponentially fewer parameters than generic ones. We explore how properties frequently encountered in physics such as symmetry, locality, compositionality, and polynomial log-probability translate into exceptionally simple neural networks. We further argue that when the statistical process generating the data is of a certain hierarchical form prevalent in physics and machine learning, a deep neural network can be more efficient than a shallow one. We formalize these claims using information theory and discuss the relation to the renormalization group. We prove various “no-flattening theorems” showing when efficient linear deep networks cannot be accurately approximated by shallow ones without efficiency loss; for example, we show that n variables cannot be multiplied using fewer than $2^n$ neurons in a single hidden layer.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Why and when can deep-but not shallow-networks avoid the curse of dimensionality: A review

Article Open access 14 March 2017

Why Deep Neural Networks: Yet Another Explanation

Minsky, Chomsky and Deep Nets

Notes

Neurons are universal analog computing modules in much the same way that NAND gates are universal digital computing modules: any computable function can be accurately evaluated by a sufficiently large network of them. Just as NAND gates are not unique (NOR gates are also universal), nor is any particular neuron implementation—indeed, any generic smooth nonlinear activation function is universal [8, 9].
The class of functions that can be exactly expressed by a neural network must be invariant under composition, since adding more layers corresponds to using the output of one function as the input to another. Important such classes include linear functions, affine functions, piecewise linear functions (generated by the popular Rectified Linear unit “ReLU” activation function $\sigma (x) = \max [0,x]$), polynomials, continuous functions and smooth functions whose $n^\mathrm{th}$ derivatives are continuous. According to the Stone-Weierstrass theorem, both polynomials and piecewise linear functions can approximate continuous functions arbitrarily well.
The limit where $\lambda \rightarrow \infty $ but $|A_1|^2 |A_2|$ is held constant is very similar in spirit to the ’t Hooft limit in large N quantum field theories where $g^2 N$ is held fixed but $N \rightarrow \infty $. The extra terms in the Taylor series which are suppressed at large $\lambda $ are analogous to the suppression of certain Feynman diagrams at large N. The authors thank Daniel Roberts for pointing this out.
In addition to the four neurons required for each multiplication, additional neurons may be deployed to copy variables to higher layers bypassing the nonlinearity in $\sigma $. Such linear “copy gates” implementing the function $u\rightarrow u$ are of course trivial to implement using a simpler version of the above procedure: using $\mathbf A _1$ to shift and scale down the input to fall in a tiny range where $\sigma '(u)\ne 0$, and then scaling it up and shifting accordingly with $\mathbf A _2$.
If the next step in the generative hierarchy requires knowledge of not merely of the present state but also information of the past, the present state can be redefined to include also this information, thus ensuring that the generative process is a Markov process.
Although our discussion is focused on describing probability distributions, which are not random, stochastic neural networks can generate random variables as well. In biology, spiking neurons provide a good random number generator, and in machine learning, stochastic architectures such as restricted Boltzmann machines [25] do the same.
A typical renormalization scheme for a lattice system involves replacing many spins (bits) with a single spin according to some rule. In this case, it might seem that the map R could not possibly map its domain onto itself, since there are fewer degrees of freedom after the coarse-graining. On the other hand, if we let the domain and range of R differ, we cannot easily talk about the Hamiltonian as having the same functional form, since the renormalized Hamiltonian would have a different domain than the original Hamiltonian. Physicists get around this by taking the limit where the lattice is infinitely large, so that R maps an infinite lattice to an infinite lattice.
A subtlety regarding the above statements is presented by the Multi-scale Entanglement Renormalization Ansatz (MERA) [37]. MERA can be viewed as a variational class of wave functions whose parameters can be tuned to to match a given wave function as closely as possible. From this perspective, MERA is as an unsupervised machine learning algorithm, where classical probability distributions over many variables are replaced with quantum wavefunctions. Due to the special tensor network structure found in MERA, the resulting variational approximation of a given wavefunction has an interpretation as generating an RG flow. Hence this is an example of an unsupervised learning problem whose solution gives rise to an RG flow. This is only possible due to the extra mathematical structure in the problem (the specific tensor network found in MERA); a generic variational Ansatz does not give rise to any RG interpretation and vice versa.

References

LeCun, Y., Bengio, Y., Hinton, G.: Deep learning. Nature 521, 436–444 (2015)
Article ADS Google Scholar
Bengio, Y.: Learning deep architectures for AI, foundations and trends${\textregistered }$. Mach. Learn. 2, 1–127 (2009)
Article MATH Google Scholar
Russell, S., Dewey, D., Tegmark, M.: Research priorities for robust and beneficial artificial intelligence. AI Mag. 36, 105–114 (2015)
Article Google Scholar
Herbrich, R., Williamson, R.C.: Algorithmic luckiness. J. Mach. Learn. Res. 3, 175–212 (2002)
MathSciNet MATH Google Scholar
Shawe-Taylor, J., Bartlett, P.L., Williamson, R.C., Anthony, M.: Structural risk minimization over data-dependent hierarchies. IEEE Trans. Inf. Theory 44, 1926–1940 (1998)
Article MathSciNet MATH Google Scholar
Poggio, T., Anselmi, F., Rosasco, L.: I-theory on depth vs width: hierarchical function composition. Center Brains Minds Mach. (2015). Technical Reports
Mehta, P., Schwab, D.J.: An exact mapping between the variational renormalization group and deep learning. arXiv:1410.3831 (2014)
Hornik, K., Stinchcombe, M., White, H.: Multilayer feedforward networks are universal approximators. Neural Netw. 2, 359–366 (1989)
Article Google Scholar
Cybenko, G.: Approximation by superpositions of a sigmoidal function. Math. Control Signals Syst. 2, 303–314 (1989)
Article MathSciNet MATH Google Scholar
Pinkus, A.: Approximation theory of the MLP model in neural networks. Acta Numer. 8, 143–195 (1999)
Article ADS MathSciNet MATH Google Scholar
Gnedenko, B., Kolmogorov, A., Gnedenko, B., Kolmogorov, A.: Limit distributions for sums of independent. Am. J. Math. 105, 28–35 (1954)
MATH Google Scholar
Jaynes, E.T.: Information theory and statistical mechanics. Phys. Rev. 106, 620 (1957)
Article ADS MathSciNet MATH Google Scholar
Tegmark, M., Aguirre, A., Rees, M.J., Wilczek, F.: Dimensionless constants, cosmology, and other dark matters. Phys. Rev. D 73, 023505 (2006)
Article ADS Google Scholar
Delalleau, O., Bengio, Y.: Shallow vs. deep sum-product networks. In: Advances in Neural Information Processing Systems, pp. 666–674 (2011)
Mhaskar, H., Liao, Q., Poggio, T.: Learning functions: when is deep better than shallow. arXiv:1603.00988 (2016)
Mhaskar, H., Poggio, T.: Deep vs. shallow networks: an approximation theory perspective. arXiv:1608.03287 (2016)
Adam, R., Ade, P., Aghanim, N., Akrami, Y., Alves, M., Arnaud, M., Arroja, F., Aumont, J., Baccigalupi, C., Ballardini, M., et al.: arXiv:1502.01582 (2015)
Seljak, U., Zaldarriaga, M.: A line of sight approach to cosmic microwave background anisotropies. arXiv:astro-ph/9603033 (1996)
Tegmark, M.: How to measure CMB power spectra without losing information. Phys. Rev. D 55, 5895 (1997)
Article ADS Google Scholar
Bond, J., Jaffe, A.H., Knox, L.: Estimating the power spectrum of the cosmic microwave background. Phys. Rev. D 57, 2117 (1998)
Article ADS Google Scholar
Tegmark, M., de Oliveira-Costa, A., Hamilton, A.J.: High resolution foreground cleaned CMB map from WMAP. Phys. Rev. D 68, 123523 (2003)
Article ADS Google Scholar
Ade, P., Aghanim, N., Armitage-Caplan, C., Arnaud, M., Ashdown, M., Atrio-Barandela, F., Aumont, J., Baccigalupi, C., Banday, A.J., Barreiro, R., et al.: Planck 2013 results. XII. Diffuse component separation. Astron. Astrophys. 571, A12 (2014)
Article Google Scholar
Tegmark, M.: How to make maps from cosmic microwave background data without losing information. Astrophys. J. Lett. 480, L87 (1997)
Article ADS Google Scholar
Hinshaw, G., Barnes, C., Bennett, C., Greason, M., Halpern, M., Hill, R., Jarosik, N., Kogut, A., Limon, M., Meyer, S., et al.: First-year Wilkinson microwave anisotropy probe (WMAP) WMAP is the result of a partnership between Princeton University and the NASA Goddard Space Flight Center. Scientific guidance is provided by the WMAP Science Team. Observations: data processing methods and systematic error limits. Astrophys. J. Suppl. Ser. 148, 63 (2003)
Article ADS Google Scholar
Hinton, G.: A practical guide to training restricted Boltzmann machines. Momentum 9, 926 (2010)
Google Scholar
Émile Borel, M.: Les probabilités dénombrables et leurs applications arithmétiques. Rendiconti del Circolo Matematico di Palermo (1884–1940) 27, 247–271 (1909)
Fisher, R.A.: On the mathematical foundations of theoretical statistics. Philos. Trans. R. Soc. Lond. Ser. A Contain. Pap. Math. Phys. Charact. 222, 309–368 (1922)
Article ADS MATH Google Scholar
Riesenhuber, M., Poggio, T.: Models of object recognition. Nat. Neurosci. 3, 1199–1204 (2000)
Article Google Scholar
Kullback, S., Leibler, R.A.: On information and sufficiency. Ann. Math. Stat. 22, 79–86 (1951). doi:10.1214/aoms/1177729694
Article MathSciNet MATH Google Scholar
Cover, T.M., Thomas, J.A.: Elements of Information Theory. Wiley, New York (2012)
MATH Google Scholar
Kardar, M.: Statistical Physics of Fields. Cambridge University Press, Cambridge (2007)
Book MATH Google Scholar
Cardy, J.: Scaling and Renormalization in Statistical Physics, vol. 5. Cambridge University Press, Cambridge (1996)
Book MATH Google Scholar
Johnson, J.K., Malioutov, D.M., Willsky, A.S.: Lagrangian relaxation for MAP estimation in graphical models. arXiv:0710.0013 (2007)
Bény, C.: Deep learning and the renormalization group. arXiv:1301.3124 (2013)
Saremi, S., Sejnowski, T.J.: Hierarchical model of natural images and the origin of scale invariance. Proc. Natl. Acad. Sci. 110, 3071–3076 (2013). http://www.pnas.org/content/110/8/3071.full.pdf, http://www.pnas.org/content/110/8/3071.abstract
Miles Stoudenmire, E., Schwab, D.J.: Supervised learning with quantum-inspired tensor networks. arXiv:1605.05775 (2016)
Vidal, G.: Class of quantum many-body states that can be efficiently simulated. Phys. Rev. Lett. 101, 110501 (2008). arXiv:quant-ph/0610099
Ba, J., Caruana, R.: Do deep nets really need to be deep? In: Advances in Neural Information Processing Systems, pp. 2654–2662 (2014)
Hastad, J.: Almost optimal lower bounds for small depth circuits. In: Proceedings of the Eighteenth Annual ACM Symposium on Theory of Computing, pp. 6–20. Organization ACM (1986)
Telgarsky, M.: Representation benefits of deep feedforward networks. arXiv:1509.08101 (2015)
Montufar, G.F., Pascanu, R., Cho, K., Bengio, Y.: On the number of linear regions of deep neural networks. In: Advances in Neural Information Processing Systems, pp. 2924–2932 (2014)
Eldan, R., Shamir, O.: The power of depth for feedforward neural networks. arXiv:1512.03965 (2015)
Poole, B., Lahiri, S., Raghu, M., Sohl-Dickstein, J., Ganguli, S.: Exponential expressivity in deep neural networks through transient chaos. arXiv:1606.05340 (2016)
Raghu, M., Poole, B., Kleinberg, J., Ganguli, S., Sohl-Dickstein, J.: On the expressive power of deep neural networks. arXiv:1606.05336 (2016)
Saxe, A.M., McClelland, J.L., Ganguli, S.: Exact solutions to the nonlinear dynamics of learning in deep linear neural networks. arXiv:1312.6120 (2013)
Bengio, Y., LeCun, Y., et al.: Scaling learning algorithms towards AI. Large Scale Kernel Mach. 34, 1–41 (2007)
Google Scholar
Strassen, V.: Gaussian elimination is not optimal. Numer. Math. 13, 354–356 (1969)
Article MathSciNet MATH Google Scholar
Le Gall, F.: In: Proceedings of the 39th international symposium on symbolic and algebraic computation. Organization ACM, pp. 296–303 (2014)
Carleo, G., Troyer, M.: Solving the quantum many-body problem with artificial neural networks. arXiv:1606.02318 (2016)
Vollmer, H.: Introduction to Circuit Complexity: A Uniform Approach. Springer, Berlin (2013)
MATH Google Scholar

Download references

Acknowledgements

This work was supported by the Foundational Questions Institute http://fqxi.org/, the Rothberg Family Fund for Cognitive Science and NSF Grant 1122374. We thank Scott Aaronson, Frank Ban, Yoshua Bengio, Rico Jonschkowski, Tomaso Poggio, Bart Selman, Viktoriya Krakovna, Krishanu Sankar and Boya Song for helpful discussions and suggestions, Frank Ban, Fernando Perez, Jared Jolton, and the anonymous referee for helpful corrections and the Center for Brains, Minds, and Machines (CBMM) for hospitality.

Author information

Authors and Affiliations

Department of Physics, Harvard University, Cambridge, MA, 02138, USA
Henry W. Lin
Department of Physics, Massachusetts Institute of Technology, Cambridge, MA, 02139, USA
Max Tegmark
Department of Mathematics, Massachusetts Institute of Technology, Cambridge, MA, 02139, USA
David Rolnick

Authors

Henry W. Lin
View author publications
You can also search for this author in PubMed Google Scholar
Max Tegmark
View author publications
You can also search for this author in PubMed Google Scholar
David Rolnick
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Henry W. Lin.

Appendix A: The Polynomial No-Flattening Theorem

We saw above that a neural network can compute polynomials accurately and efficiently at linear cost, using only about 4 neurons per multiplication. For example, if n is a power of two, then the monomial $\prod _{i=1}^n x_i$ can be evaluated using 4n neurons arranged in a binary tree network with $\log _2 n$ hidden layers. In this appendix, we will prove a no-flattening theorem demonstrating that flattening polynomials is exponentially expensive.

Theorem

Suppose we are using a generic smooth activation function $\sigma (x) = \sum _{k=0}^\infty \sigma _k x^k$, where $\sigma _k \ne 0$ for $0\le k\le n$. Then for any desired accuracy $\epsilon >0$, there exists a neural network that can implement the function $\prod _{i=1}^n x_i$ using a single hidden layer of $2^n$ neurons. Furthermore, this is the smallest possible number of neurons in any such network with only a single hidden layer.

This result may be compared to problems in Boolean circuit complexity, notably the question of whether $TC^0 = TC^1$ [50]. Here circuit depth is analogous to number of layers, and the number of gates is analogous to the number of neurons. In both the Boolean circuit model and the neural network model, one is allowed to use neurons/gates which have an unlimited number of inputs. The constraint in the definition of $TC^i$ that each of the gate elements be from a standard universal library (AND, OR, NOT, Majority) is analogous to our constraint to use a particular nonlinear function. Note, however, that our theorem is weaker by applying only to depth 1, while $TC^0$ includes all circuits of depth O(1).

1.1 A.1 Proof that $2^n$ Neurons are Sufficient

A neural network with a single hidden layer of m neurons that approximates a product gate for n inputs can be formally written as a choice of constants $a_{ij}$ and $w_j$ satisfying

$$\begin{aligned} \sum _{j=1}^m w_j \sigma \left( \sum _{i=1}^n a_{ij} x_i\right) \approx \prod _{i=1}^n x_i. \end{aligned}$$

(A1)

Here, we use $\approx $ to denote that the two sides of (A1) have identical Taylor expansions up to terms of degree n; as we discussed earlier in our construction of a product gate for two inputs, this exables us to achieve arbitrary accuracy $\epsilon $ by first scaling down the factors $x_i$, then approximately multiplying them and finally scaling up the result.

We may expand (A1) using the definition $\sigma (x) = \sum _{k=0}^\infty \sigma _k x^k$ and drop terms of the Taylor expansion with degree greater than n, since they do not affect the approximation. Thus, we wish to find the minimal m such that there exist constants $a_{ij}$ and $w_j$ satisfying

$$\begin{aligned} \sigma _n\sum _{j=1}^m w_j\left( \sum _{i=1}^n a_{ij} x_i\right) ^n= & {} \prod _{i=1}^n x_i, \end{aligned}$$

(A2)

$$\begin{aligned} \sigma _k\sum _{j=1}^m w_j\left( \sum _{i=1}^n a_{ij} x_i\right) ^k= & {} 0, \end{aligned}$$

(A3)

for all $0\le k \le n-1$. Let us set $m = 2^n$, and enumerate the subsets of $\{1,\ldots ,n\}$ as $S_1,\ldots ,S_m$ in some order. Define a network of m neurons in a single hidden layer by setting $a_{ij}$ equal to the function $s_i(S_j)$ which is $-1$ if $i\in S_j$ and $+1$ otherwise, setting

$$\begin{aligned} w_j\equiv {1\over 2^n n!\sigma _n}\prod _{i=1}^n a_{ij} = {(-1)^{|S_j|}\over 2^n n!\sigma _n}. \end{aligned}$$

(A4)

In other words, up to an overall normalization constant, all coefficients $a_{ij}$ and $w_j$ equal $\pm 1$, and each weight $w_j$ is simply the product of the corresponding $a_{ij}$.

We must prove that this network indeed satisfies Eqs. (A2) and (A3). The essence of our proof will be to expand the left hand side of Eq. (A1) and show that all monomial terms except $x_1\cdot \cdot \cdot x_n$ come in pairs that cancel. To show this, consider a single monomial $p(\mathbf x ) = x_1^{r_1}\cdots x_n^{r_n}$ where $r_1 + \ldots + r_n = r \le n$.

If $p(\mathbf x ) \ne \prod _{i=1}^n x_i$, then we must show that the coefficient of $p(\mathbf x )$ in $\sigma _r\sum _{j=1}^m w_j\left( \sum _{i=1}^n a_{ij} x_i\right) ^r$ is 0. Since $p(\mathbf x ) \ne \prod _{i=1}^n x_i$, there must be some $i_0$ such that $r_{i_0} = 0$. In other words, $p(\mathbf x )$ does not depend on the variable $x_{i_0}$. Since the sum in Eq. (A1) is over all combinations of ± signs for all variables, every term will be canceled by another term where the (non-present) $x_{i_0}$ has the opposite sign and the weight $w_j$ has the opposite sign:

$$\begin{aligned}&\sigma _r\sum _{j=1}^m w_j \left( \sum _{i=1}^n a_{ij} x_i\right) ^r \\&= \sigma _r\sum _{S_j} {(-1)^{|S_j|}\over 2^n n!\sigma _r}\left( \sum _{i=1}^n s_i(S_j) x_i\right) ^r \\&= \sigma _r\sum _{S_j\not \ni i_0} \Bigg [{(-1)^{|S_j|}\over 2^n n!\sigma _r}\left( \sum _{i=1}^n s_i(S_j) x_i\right) ^r\\&\quad + {(-1)^{|S_j \cup \{i_0\}|}\over 2^n n!\sigma _r}\left( \sum _{i=1}^n s_i(S_j \cup \{i_0\}) x_i\right) ^r\Bigg ] \\&= \sum _{S_j\not \ni i_0} {(-1)^{|S_j|}\over 2^n n!}\Bigg [\left( \sum _{i=1}^n s_i(S_j) x_i\right) ^r \\&\quad -\left( \sum _{i=1}^n s_i(S_j \cup \{i_0\}) x_i\right) ^r\Bigg ] \end{aligned}$$

Observe that the coefficient of $p(\mathbf x )$ is equal in $\left( \sum _{i=1}^n s_i(S_j) x_i\right) ^r$ and $\left( \sum _{i=1}^n s_i(S_j \cup \{i_0\}) x_i\right) ^r$, since $r_{i_0}=0$. Therefore, the overall coefficient of $p(\mathbf x )$ in the above expression must vanish, which implies that (A3) is satisfied.

If instead $p(\mathbf x ) = \prod _{i=1}^n x_i$, then all terms have the coefficient of $p(\mathbf x )$ in $\left( \sum _{i=1}^n a_{ij} x_i\right) ^n$ is $n!\, \prod _{i=1}^n a_{ij} = (-1)^{|S_j|} n!$, because all n! terms are identical and there is no cancelation. Hence, the coefficient of $p(\mathbf x )$ on the left-hand side of (A2) is

$$\begin{aligned} \sigma _n\sum _{j=1}^m \frac{(-1)^{|S_j|}}{2^n n! \sigma _n} (-1)^{|S_j|} n! = 1, \end{aligned}$$

completing our proof that this network indeed approximates the desired product gate.

From the standpoint of group theory, our construction involves a representation of the group $G = \mathbb {Z}_2^n$, acting upon the space of polynomials in the variables $x_1, x_2, \ldots , x_n$. The group G is generated by elements $g_i$ such that $g_i$ flips the sign of $x_i$ wherever it occurs. Then, our construction corresponds to the computation

$$\begin{aligned} \mathbf f (x_1,\ldots ,x_n) = (1 - g_1)(1 - g_2) \cdots (1-g_n)\sigma (x_1+x_2+\ldots +x_n). \end{aligned}$$

Every monomial of degree at most n, with the exception of the product $x_1\cdots x_n$, is sent to 0 by $(1 - g_i)$ for at least one choice of i. Therefore, $\mathbf f (x_1,\ldots ,x_n)$ approximates a product gate (up to a normalizing constant).

1.2 A.2 Proof that $2^n$ Neurons are Necessary

Suppose that S is a subset of $\{1,\ldots ,n\}$ and consider taking the partial derivatives of (A2) and (A3), respectively, with respect to all the variables $\{x_h\}_{h\in S}$. Then, we obtain the equalities

$$\begin{aligned} \frac{n!\, \sigma _n}{(n-|S|)!} \sum _{j=1}^m w_j\prod _{h\in S} a_{hj}\left( \sum _{i=1}^n a_{ij} x_i\right) ^{n-|S|}= & {} \prod _{h\not \in S} x_h, \end{aligned}$$

(A5)

$$\begin{aligned} \frac{k!\, \sigma _k}{(k-|S|)!} \sum _{j=1}^m w_j\prod _{h\in S} a_{hj}\left( \sum _{i=1}^n a_{ij} x_i\right) ^{k-|S|}= & {} 0, \end{aligned}$$

(A6)

for all $0\le k \le n-1$. Let $\mathbf A $ denote the $2^n \times m$ matrix with elements

$$\begin{aligned} A_{Sj} \equiv \prod _{h \in S} a_{hj}. \end{aligned}$$

(A7)

We will show that $\mathbf A $ has full row rank. Suppose, towards contradiction, that $\mathbf c ^t\mathbf A =\mathbf{0}$ for some non-zero vector $\mathbf c $. Specifically, suppose that there is a linear dependence between rows of $\mathbf A $ given by

$$\begin{aligned} \sum _{\ell =1}^r c_{\ell } A_{S_\ell ,j} = 0, \end{aligned}$$

(A8)

where the $S_{\ell }$ are distinct and $c_{\ell } \ne 0$ for every $\ell $. Let s be the maximal cardinality of any $S_{\ell }$. Defining the vector $\mathbf d $ whose components are

$$\begin{aligned} d_j\equiv w_j\left( \sum _{i=1}^n a_{ij} x_i\right) ^{n-s}, \end{aligned}$$

(A9)

taking the dot product of Eq. (A8) with $\mathbf d $ gives

$$\begin{aligned} 0= & {} \mathbf c ^t\mathbf A \mathbf d = \sum _{\ell =1}^r c_{\ell } \sum _{j=1}^m w_j\prod _{h \in S_{\ell }} a_{hj} \left( \sum _{i=1}^n a_{ij} x_i\right) ^{n-s} \nonumber \\= & {} \sum _{\ell \mid (|S_{\ell }| = s)} c_{\ell } \sum _{j=1}^m w_j\prod _{h\in S_{\ell }} a_{hj} \left( \sum _{i=1}^n a_{ij} x_i\right) ^{n-|S_{\ell }|} \\&+ \sum _{\ell \mid (|S_{\ell }| < s)} c_{\ell } \sum _{j=1}^m w_j\prod _{h\in S_{\ell }} a_{hj} \left( \sum _{i=1}^n a_{ij} x_i\right) ^{(n+|S_{\ell }|-s)-|S_{\ell }|}.\nonumber \end{aligned}$$

(A10)

Applying Eq. (A6) (with $k=n+|S_{\ell }|-s$) shows that the second term vanishes. Substituting Eq. (A5) now simplifies Eq. (A10) to

$$\begin{aligned} 0 = \sum _{\ell \mid (|S_{\ell }| = s)} \frac{c_{\ell }(n-|S_\ell |)!}{n!\> \sigma _n} \> \prod _{h\not \in S_{\ell }} x_h, \end{aligned}$$

(A11)

i.e., to a statement that a set of monomials are linearly dependent. Since all distinct monomials are in fact linearly independent, this is a contradiction of our assumption that the $S_{\ell }$ are distinct and $c_{\ell }$ are nonzero. We conclude that $\mathbf A $ has full row rank, and therefore that $m \ge 2^n$, which concludes the proof.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Lin, H.W., Tegmark, M. & Rolnick, D. Why Does Deep and Cheap Learning Work So Well?. J Stat Phys 168, 1223–1247 (2017). https://doi.org/10.1007/s10955-017-1836-5

Download citation

Received: 03 December 2016
Accepted: 27 June 2017
Published: 21 July 2017
Issue Date: September 2017
DOI: https://doi.org/10.1007/s10955-017-1836-5

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Why Does Deep and Cheap Learning Work So Well?

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Why and when can deep-but not shallow-networks avoid the curse of dimensionality: A review

Why Deep Neural Networks: Yet Another Explanation

Minsky, Chomsky and Deep Nets

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Appendix A: The Polynomial No-Flattening Theorem

Theorem

1.1 A.1 Proof that \(2^n\) Neurons are Sufficient

1.2 A.2 Proof that \(2^n\) Neurons are Necessary

Rights and permissions

About this article

Cite this article

Keywords

Subscribe and save

Buy Now

Navigation

Why Does Deep and Cheap Learning Work So Well?

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Why and when can deep-but not shallow-networks avoid the curse of dimensionality: A review

Why Deep Neural Networks: Yet Another Explanation

Minsky, Chomsky and Deep Nets

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Appendix A: The Polynomial No-Flattening Theorem

Appendix A: The Polynomial No-Flattening Theorem

Theorem

1.1 A.1 Proof that \(2^n\) Neurons are Sufficient

1.2 A.2 Proof that \(2^n\) Neurons are Necessary

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now

Search

Navigation