Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
Skip to main content

Kolmogorov width decay and poor approximators in machine learning: shallow neural networks, random feature models and neural tangent kernels

  • Research
  • Published:
Research in the Mathematical Sciences Aims and scope Submit manuscript

Abstract

We establish a scale separation of Kolmogorov width type between subspaces of a given Banach space under the condition that a sequence of linear maps converges much faster on one of the subspaces. The general technique is then applied to show that reproducing kernel Hilbert spaces are poor \(L^{2}\)-approximators for the class of two-layer neural networks in high dimension, and that multi-layer networks with small path norm are poor approximators for certain Lipschitz functions, also in the \(L^{2}\)-topology.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. Arora, S., Du, S.S., Hu, W., Li, Z., Salakhutdinov, R.R., Wang, R.: On exact computation with an infinitely wide neural net. In: Advances in Neural Information Processing Systems, pp. 8139–8148 (2019)

  2. Bach, F.: Breaking the curse of dimensionality with convex neural networks. J. Mach. Learn. Res. 18(1), 629–681 (2017)

    MATH  Google Scholar 

  3. Barron, A.R.: Universal approximation bounds for superpositions of a sigmoidal function. IEEE Trans. Inf. Theory 39(3), 930–945 (1993)

    Article  MathSciNet  Google Scholar 

  4. Brezis, H.: Functional Analysis. Sobolev Spaces and Partial Differential Equations. Universitext. Springer, New York (2011)

    MATH  Google Scholar 

  5. Cho, Y., Saul, L.K.: Kernel methods for deep learning. In: Advances in Neural Information Processing Systems, pp. 342–350 (2009)

  6. Cybenko, G.: Approximation by superpositions of a sigmoidal function. Math. Control Signals Syst. 2(4), 303–314 (1989)

    Article  MathSciNet  Google Scholar 

  7. Du, S.S., Lee, J.D., Li, H., Wang, L., Zhai, X.: Gradient descent finds global minima of deep neural networks. arXiv:1811.03804 [cs.LG] (2018)

  8. Dobrowolski, M.: Angewandte Funktionalanalysis: Funktionalanalysis. Springer, Sobolev-Räume und elliptische Differentialgleichungen (2010)

  9. Du, S.S., Zhai, X., Poczos, B., Singh, A.: Gradient descent provably optimizes over-parameterized neural networks. arXiv:1810.02054 [cs.LG] (2018)

  10. E, W., Ma, C., Wu, L.: A priori estimates of the population risk for two-layer neural networks. Commun. Math. Sci. 17(5), 1407–1425 (2019)

    Article  MathSciNet  Google Scholar 

  11. E, W., Ma, C., Wang, Q.: A priori estimates of the population risk for residual networks. arXiv:1903.02154 [cs.LG] (2019)

  12. E, W., Ma, C., Wu, L.: Barron spaces and the compositional function spaces for neural network models. arXiv:1906.08039 [cs.LG] (2019)

  13. E, W., Ma, C., Wu, L.: Machine learning from a continuous viewpoint. arXiv:1912.12777 [math.NA] (2019)

  14. E, W., Ma, C., Wu, L.: A comparative analysis of optimization and generalization properties of two-layer neural network and random feature models under gradient descent dynamics. Sci. China Math. (2019). https://doi.org/10.1007/s11425-019-1628-5

    Article  Google Scholar 

  15. E, W., Ma, C., Wang, Q., Wu, L.: Analysis of the gradient descent algorithm for a deep neural network model with skip-connections. arXiv:1904.05263 [cs.LG] (2019)

  16. E, W., Wojtowytsch, S.: On the Banach spaces associated with multi-layer ReLU networks of infinite width. arXiv:2007.15623 [stat.ML] (2020)

  17. Fournier, N., Guillin, A.: On the rate of convergence in Wasserstein distance of the empirical measure. Probab. Theory Related Fields 162(3–4), 707–738 (2015)

    Article  MathSciNet  Google Scholar 

  18. Gribonval, R., Kutyniok, G., Nielsen, M., Voigtlaender, F.: Approximation spaces of deep neural networks. arXiv:1905.01208 [math.FA] (2019)

  19. Hornik, K.: Approximation capabilities of multilayer feedforward networks. Neural Netw. 4(2), 251–257 (1991)

    Article  MathSciNet  Google Scholar 

  20. Jacot, A., Gabriel, F., Hongler, C.: Neural tangent kernel: convergence and generalization in neural networks. In: Advances in Neural Information Processing Systems, pp. 8571–8580 (2018)

  21. Lorentz, G.: Approximation of Functions. Holt, Rinehart and Winston, New York (1966)

    MATH  Google Scholar 

  22. Lee, J., Xiao, L., Schoenholz, S., Bahri, Y., Novak, R., Sohl-Dickstein, J., Pennington, J.: Wide neural networks of any depth evolve as linear models under gradient descent. In: Advances in Neural Information Processing Systems, pp. 8570–8581 (2019)

  23. Rahimi, A., Recht, B.: Uniform approximation of functions with random bases. In: 2008 46th Annual Allerton Conference on Communication, Control, and Computing, pp. 555–561. IEEE (2008)

  24. Shalev-Shwartz, S., Ben-David, S.: Understanding Machine Learning: From Theory to Algorithms. Cambridge University Press, Cambridge (2014)

    Book  Google Scholar 

  25. Villani, C.: Optimal Transport: Old and New, vol. 338. Springer, Berlin (2008)

    MATH  Google Scholar 

  26. Wojtowytsch, S., E, W.: Can shallow neural networks beat the curse of dimensionality? A mean field training perspective. arXiv:2005.10815 [cs.LG] (2020)

  27. Yang, G.: Scaling limits of wide neural networks with weight sharing: Gaussian process behavior, gradient independence, and neural tangent kernel derivation. arXiv:1902.04760 [cs.NE] (2019)

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Stephan Wojtowytsch.

Additional information

Dedicated to Andrew Majda on the occasion of his 70th birthday

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendices

Appendix A: A brief review of Barron space

For the convenience of the reader, we recall Barron space for two-layer neural networks as introduced by E et al. [10, 12]. We focus on the functional analytic properties of Barron space, for results with a focus on machine learning we refer the reader to the original sources. The same space is denoted as \({\mathcal F}_1\) in [2], but described from a different perspective. For functional analytic notions, we refer the reader to [4].

Let \({\mathbb {P}}\) be a probability measure on \(\mathbb R^d\) and \(\sigma \) a Lipschitz-continuous function such that either

  1. (1)

    \(\sigma = \text {ReLU}\) or

  2. (2)

    \(\sigma \) is sigmoidal, i.e., \(\lim _{z\rightarrow \pm \infty } \sigma (z) = \pm 1\) (or 0 and 1).

Consider the class \({\mathcal F}_m\) of two-layer networks with m neurons

$$\begin{aligned} f_\Theta (x) = \frac{1}{m} \sum _{i=1}^m a_i \,\sigma \big (w_i^\mathrm{T}x+ b_i\big ), \qquad \Theta = \{(a_i, w_i, b_i) \in \mathbb R^{d+2}\}_{i=1}^m. \end{aligned}$$

It is well-known that the closure of \({\mathcal F}= \bigcup _{m=1}^\infty {\mathcal F}_m\) in the uniform topology is the space of continuous functions, see, e.g., [6]. Barron space is a different closure of the same function class where the path-norm

$$\begin{aligned} \Vert f_\Theta \Vert _{\text {path}} = \frac{1}{m} \sum _{i=1}^m |a_i| \,\big [|w_i|_{\ell ^q} + |b_i|\big ] \end{aligned}$$

remains bounded. Here, we assume that data space \(\mathbb R^d\) is equipped with the \(\ell ^p\)-norm and take the dual \(\ell ^q\)-norm on w. The concept of path norm corresponds to ReLU activation, a slightly different path norm for bounded Lipschitz activation is discussed below.

The same class is often discussed without the normalizing factor of \(\frac{1}{m}\). With the factor, the following concept of infinitely wide two-layer networks emerges more naturally.

Definition A.1

Let \(\pi \) be a Radon probability measure on \(\mathbb R^{d+2}\) with finite second moments, which we denote by \(\pi \in {\mathcal P}_{2}(\mathbb R^{d+2})\). We denote by

$$\begin{aligned} f_\pi (x) = \int _{\mathbb R^{d+2}} a\,\sigma (w^\mathrm{T}x+b)\,\pi (\mathrm {d}a \otimes \mathrm {d}w\otimes \mathrm {d}b) \end{aligned}$$

the two-layer network associated with \(\pi \).

If \(\sigma = \mathrm {ReLU}\), it is clear that \(f_\pi \) is Lipschitz-continuous on \(\mathbb R^d\) with Lipschitz-constant \(\le \Vert f_\pi \Vert _{{\mathcal B}(\mathbb {P})}\), so \(f_\pi \) lies in the space of (possibly unbounded) Lipschitz functions \(\mathrm {Lip}(\mathbb R^d)\). If \(\sigma \) is a bounded Lipschitz function, the integral converges in \(C^0(\mathbb R^d)\) without assumptions on the moments of \(\pi \). If the second moments of \(\pi \) are bounded, \(f_\pi \) is a Lipschitz function also in the case of bounded Lipschitz activation \(\sigma \).

For technical reasons, we will extend the definition to distributions \(\pi \) for which only the mixed second moments

$$\begin{aligned} \int _{\mathbb R^{d+2}}|a|\,\big [|w| + \psi (b)\big ]\,\pi (\mathrm {d}a\otimes \mathrm {d}w \otimes \mathrm {d}b) \end{aligned}$$

are finite, where \(\psi = |b|\) in the ReLU case and \(\psi = 1\) otherwise. Now, we introduce the associated function space.

Definition A.2

We denote

$$\begin{aligned}&\Vert \cdot \Vert _{{\mathcal B}(\mathbb {P})}:\mathrm {Lip}(\mathbb R^d)\rightarrow [0,\infty ), \\&\Vert f\Vert _{{\mathcal B}(\mathbb {P})}= \inf _{\{\pi |f_\pi =f\,\mathbb {P}-\text {a.e.}\}} \int _{\mathbb R^{d+2}}|a|\,\big [|w|_{\ell ^q}+|b|\big ]\,\pi (\mathrm {d}a\otimes \mathrm {d}w \otimes \mathrm {d}b) \end{aligned}$$

if \(\sigma = \mathrm {ReLU}\) and

$$\begin{aligned}&\Vert \cdot \Vert _{{\mathcal B}(\mathbb {P})}:\mathrm {Lip}(\mathbb R^d)\rightarrow [0,\infty ), \\&\Vert f\Vert _{{\mathcal B}(\mathbb {P})}= \inf _{\{\pi |f_\pi =f\,\mathbb {P}-\text {a.e.}\}} \int _{\mathbb R^{d+2}}|a|\,\big [|w|_{\ell ^q}+1\big ]\,\pi (\mathrm {d}a\otimes \mathrm {d}w \otimes \mathrm {d}b) \end{aligned}$$

otherwise. In either case, we denote

$$\begin{aligned} {\mathcal B}(\mathbb {P}) = \big \{f\in \mathrm {Lip}(\mathbb R^d) : \Vert f\Vert _{{\mathcal B}(\mathbb {P})}<\infty \big \}. \end{aligned}$$

Here, \(\inf \emptyset = + \infty \).

Remark A.3

It depends on the activation function \(\sigma \) whether or not the infimum in the definition of the norm is attained. If \(\sigma =\) ReLU, this can be shown to be true by using homogeneity. Instead of a probability measure \(\pi \) on the whole space, one can use a signed measure \(\mu \) on the unit sphere to express a two layer network. The compactness theorem for Radon measures provides the existence of a measure minimizer, which can then be lifted to a probability measure (see below for similar arguments). On the other hand, if \(\sigma \) is a classical sigmoidal function such that

$$\begin{aligned} \lim _{z\rightarrow -\infty } \sigma (z) = 0, \qquad \lim _{x\rightarrow \infty }\sigma (z) = 1, \qquad 0< \sigma (z) < 1 \quad \forall \ z\in \mathbb R, \end{aligned}$$

then the function \(f(z) \equiv 1\) has Barron norm 1, but the infimum is not attained. This holds true for any data distribution \(\mathbb {P}\).

Proof

  1. (1)

    For any \(x\in {\mathrm {spt}}(\mathbb {P})\), we have

    $$\begin{aligned} 1 = f(x) = \int _{\mathbb R^{d+2}} a\,\sigma (w^\mathrm{T}x+b)\,\mathrm {d}\pi \le \int _{\mathbb R^{d+2}}|a|\,\mathrm {d}\pi \le \int _{\mathbb R^{d+2}}|a|\,\big [|w| + 1\big ]\,\mathrm {d}\pi . \end{aligned}$$

    Taking the infimum overall \(\pi \), we find that \(1\le \Vert f\Vert _{{\mathcal B}(\mathbb {P})}\). For any measure \(\pi \), the inequality above is strict since \(|\sigma |<1\), so there is no \(\pi \) which attains equality.

  2. (2)

    We consider a family of measures

    $$\begin{aligned}&\pi _\lambda = \delta _{a= \sigma (\lambda )^{-1}}\,\delta _{w=0}\,\delta _{b = \lambda } \\&\quad \Rightarrow \quad f_{\pi _\lambda } \equiv 1\quad \text {and } \int _{\mathbb R^{d+2}} |a|\big [|w|_{\ell ^q}+1\big ]\,\mathrm {d}\pi _\lambda = \frac{1}{\sigma (\lambda )} \rightarrow 1 \end{aligned}$$

    as \(\lambda \rightarrow \infty \).

Thus, \(\Vert f\Vert _{{\mathcal B}(\mathbb {P})} = 1\), but there is no minimizing parameter distribution \(\pi \).

Remark A.4

The space \({\mathcal B}(\mathbb {P})\) does not depend on the measure \(\mathbb {P}\), but only on the system of null sets for \(\mathbb {P}\).

We note that the space \({\mathcal B}(\mathbb {P})\) is reasonably well-behaved from the point of view of functional analysis.

Lemma A.5

\({\mathcal B}({\mathbb P})\) is a Banach space with norm \(\Vert \cdot \Vert _{{\mathcal B}(\mathbb {P})}\). If \({\mathrm {spt}}(\mathbb {P})\) is compact, \({\mathcal B}(\mathbb {P})\) embeds continuously into the space of Lipschitz functions \(C^{0,1}({\mathrm {spt}}\,\mathbb {P})\).

Proof

Scalar multiplication Let \(f\in {\mathcal B}(\mathbb {P})\). For \(\lambda \in \mathbb R\) and \(\pi \in {\mathcal {P}}_{2}(\mathbb R^{d+2})\), define the push-forward

$$\begin{aligned} T_{\lambda \sharp }\pi \in {\mathcal P}_2(\mathbb R^{d+2})\qquad \text {along } T_\lambda :\mathbb R^{d+2}\rightarrow \mathbb R^{d+2}, \quad T_\lambda (a,w,b) = (\lambda a, w,b). \end{aligned}$$

Then,

$$\begin{aligned} f_{T_{\lambda \sharp }\pi }= & {} \lambda \,f_\pi ,\\ \int _{\mathbb R^{d+2}} \,|a|\,\big [|w|+ 1\big ]\,T_{\lambda \sharp }\pi (\mathrm {d}a \otimes \mathrm {d}w \otimes \mathrm {d}b)= & {} |\lambda | \int _{\mathbb R^{d+2}} \,|a|\,\big [|w|+ 1\big ]\,\pi (\mathrm {d}a \otimes \mathrm {d}w \otimes \mathrm {d}b) \end{aligned}$$

and similarly in the ReLU case. Thus, scalar multiplication is well-defined in \({\mathcal B}({\mathbb P})\). Taking the infimum over \(\pi \), we find that \(\Vert \lambda f\Vert _{{\mathcal B}(\mathbb {P})} = |\lambda |\,\Vert f\Vert _{{\mathcal B}(\mathbb {P})}\).

Vector addition Let \(g,h \in {\mathcal B}(\mathbb {P})\). Choose \(\pi _g, \pi _h\) such that \(g= f_{\pi _g}\) and \(h = f_{\pi _h}\). Consider

$$\begin{aligned} \pi = \frac{1}{2}\big [ T_{2\sharp } \pi _g + T_{2\sharp } \pi _h\big ] \end{aligned}$$

like above. Then, \(f_\pi = g+h\) and

$$\begin{aligned} \int _{\mathbb R^{d+2}} \,|a|\,\big [|w|+ 1\big ]\,\pi (\mathrm {d}a \otimes \mathrm {d}w \otimes \mathrm {d}b)= & {} \int _{\mathbb R^{d+2}} \,|a|\,\big [|w|+ 1\big ]\,\pi _g(\mathrm {d}a \otimes \mathrm {d}w \otimes \mathrm {d}b)\\&+ \int _{\mathbb R^{d+2}} \,|a|\,\big [|w|+ 1\big ]\,\pi _h(\mathrm {d}a \otimes \mathrm {d}w \otimes \mathrm {d}b). \end{aligned}$$

Taking infima, we see that \(\Vert g+h\Vert _{{\mathcal B}(\mathbb {P})} \le \Vert g\Vert _{{\mathcal B}(\mathbb {P})} + \Vert h\Vert _{{\mathcal B}(\mathbb {P})}\). The same holds in the ReLU case.

Positivity and embedding Recall that the norm on the space of Lipschitz functions on a compact set K is

$$\begin{aligned} \Vert f\Vert _{C^{0,1}(K)} = \sup _{x\in K} |f(x)| + \sup _{x,y\in K,\, x\ne y} \frac{|f(x) - f(y)|}{|x-y|}. \end{aligned}$$

It is clear that \(\Vert \cdot \Vert _{{\mathcal B}(\mathbb {P})} \ge 0\). If \(\sigma \) is a bounded Lipschitz function, then

$$\begin{aligned} \Vert f_\pi \Vert _{L^\infty ({\mathrm {spt}}\,\mathbb {P})} \le {\Vert f_\pi \Vert _{{\mathcal B}(\mathbb {P})}}\,\Vert \sigma \Vert _{L^\infty (\mathbb R)} \end{aligned}$$

like in Remark A.3. If \(\sigma = \mathrm {ReLU}\), then

$$\begin{aligned} \sup _{x\in \mathbb R^d} \frac{|f_\pi (x)|}{1+ |x|} \le \Vert f_\pi \Vert _{{\mathcal B}(\mathbb {P})}. \end{aligned}$$

In either case

$$\begin{aligned} |f_\pi (x) - f_\pi (y)| \le [\sigma ]_{\mathrm {Lip}}\,\Vert f_\pi \Vert _{{\mathcal B}(\mathbb {P})} \end{aligned}$$

for all xy in \({\mathrm {spt}}(\mathbb {P})\). In particular, \(\Vert f\Vert _{{\mathcal B}(\mathbb {P})} > 0\) whenever \(f\ne 0\) in \({\mathcal B}(\mathbb {P})\) and if \({\mathrm {spt}}(\mathbb {P})\) is compact, \({\mathcal B}(\mathbb {P})\) embeds into the space of Lipschitz functions on \({\mathcal B}(\mathbb {P})\). If \(\sigma \) is a bounded Lipschitz function, \({\mathcal B}(\mathbb {P})\) embeds into the space of bounded Lipschitz functions also on unbounded sets.

Completeness Completeness is proved most easily by introducing a different representation for Barron functions. Consider the space

$$\begin{aligned} V= & {} \left\{ \mu \,\bigg |\, \mu \text { (signed) Radon measure on }\mathbb R^{d+2}\text { s.t.} \right. \\&\left. \int _{\mathbb R^{d+2}}|a|\,\big [|w|+|b|\big ]\,|\mu |(\mathrm {d}a \otimes \mathrm {d}w\otimes \mathrm {d}b) < \infty \right\} , \end{aligned}$$

where \(|\mu |\) is the total variation measure of \(\mu \). Equipped with the norm

$$\begin{aligned} \Vert \mu \Vert _V = \int _{\mathbb R^{d+2}}|a|\,\big [|w|+|b|\big ]\,|\mu |(\mathrm {d}a \otimes \mathrm {d}w\otimes \mathrm {d}b), \end{aligned}$$

V is a Banach space when we quotient out measures supported on \(\{|a|=0\}\cup \{|b| = |w| = 0\}\) or restrict ourselves to the subspace of measures such that \(|\mu |(\{|a|=0\}) = |\mu | (\{|b|=|w|=0\}) = 0\). The only non-trivial question is whether V is complete. By definition, \(\mu _n\) is a Cauchy sequence in V if and only if \(\nu _n {:}{=} |a|\,[|w|+ |b|]\cdot \mu _n\) is a Cauchy sequence in the space of finite Radon measures. Since the space of finite Radon measures is complete, \(\nu _n\) converges (strongly) to a measure \(\nu \) which satisfies \(\nu (\{a =0\}\cup \{(w,b) = 0\}) =0\). We then obtain \(\mu {:}{=} |a|^{-1}\,[|w|+ |b|]^{-1}\cdot \nu \). For \(\mu \in V\), we write

$$\begin{aligned} f_{\mu } (x) = \int _{\mathbb R^{d+2}} a\,\sigma (w^\mathrm{T}x+b)\,\mu (\mathrm {d}a \otimes \mathrm {d}w \otimes \mathrm {d}b) \end{aligned}$$

and consider the subspace

$$\begin{aligned} V^0_{\mathbb {P}} = \{\mu \in V\,|\, f_\mu = 0 \,\,\mathbb {P}-\text {almost everywhere}\}. \end{aligned}$$

Since the map

$$\begin{aligned} V\mapsto C^{0,1}({\mathrm {spt}}\,\mathbb {P}), \qquad \mu \mapsto f_\mu \end{aligned}$$

is continuous by the same argument as before, we find that \(V^0_\mathbb {P}\) is a closed subspace of V. In particular, \(V/V^0_\mathbb {P}\) is a Banach space. We claim that \({\mathcal B}(\mathbb {P})\) is isometric to the quotient space \(V/V^0_\mathbb {P}\) by the map \([\mu ]\mapsto f_\mu \) where \(\mu \) is any representative in the equivalence class \([\mu ]\).

It is clear that any representative in the equivalence class induces the same function \(f_{\mu }\) such that the map is well-defined. Consider the Hahn decomposition \(\mu = \mu ^{+} - \mu ^-\) of \(\mu \) as the difference of non-negative Radon measures. Set

$$\begin{aligned} m^{\pm }{:}{=} \Vert \mu ^\pm \Vert , \qquad \pi = \frac{1}{2} \left[ \frac{1}{m^+} T_{2m^+\sharp }\mu ^+ + \frac{1}{m^{-}}\,T_{-{2}m^{-\sharp }}\mu ^{-}\right] . \end{aligned}$$

Then, \(\pi \) is a probability Radon measure such that \(f_\pi = f_\mu \) and

$$\begin{aligned} \int _{\mathbb R^{d+2}}|a|\,\big [|w|+|b|\big ]\,\pi (\mathrm {d}a \otimes \mathrm {d}w\otimes \mathrm {d}b) = \int _{\mathbb R^{d+2}}|a|\,\big [|w|+|b|\big ]\,|\mu |(\mathrm {d}a \otimes \mathrm {d}w\otimes \mathrm {d}b). \end{aligned}$$

In particular, \(\Vert f_\mu \Vert _{{\mathcal B}(\mathbb {P})} \le \Vert \mu \Vert _V\). Taking the infimum of the right hand side, we conclude that \(\Vert f_\mu \Vert _{{\mathcal B}(\mathbb {P})} \le \Vert [\mu ]\Vert _{V/V^0_\mathbb {P}}\). The opposite inequality is trivial since every probability measure is in particular a signed Radon measure.

Thus, \({\mathcal B}(\mathbb {P})\) is isometric to a Banach space, hence a Banach space itself. We presented the argument in the context of ReLU activation, but the same proof holds for bounded Lipschitz activation. \(\square \)

A few remarks are in order.

Remark A.6

The requirement that \({\mathbb P}\) have compact support can be relaxed when we consider the norm

$$\begin{aligned} \Vert f\Vert _{C^{0,1}(\mathbb {P})} = \sup _{x,y\in {\mathrm {spt}}\mathbb {P},\, x\ne y} \frac{|f(x) - f(y)|}{|x-y|} + \Vert f\Vert _{L^1(\mathbb {P})} \end{aligned}$$

on the space of Lipschitz functions. Since Lipschitz functions grow at most linearly, this is well-defined for all data distributions \(\mathbb {P}\) with finite first moments.

Remark A.7

For general Lipschitz-activation \(\sigma \) which is neither bounded nor ReLU, the Barron norm is defined as

$$\begin{aligned} \Vert f\Vert _{{\mathcal B}(\mathbb {P})} = \inf _{\{\pi |f_\pi =f\,\mathbb {P}-\text {a.e.}\}} \int _{\mathbb R^{d+2}}|a|\,\big [|w|_{\ell ^q}+|b| + 1\big ]\,\pi (\mathrm {d}a\otimes \mathrm {d}w \otimes \mathrm {d}b). \end{aligned}$$

Similar results hold in this case.

Remark A.8

In general, \({\mathcal B}(\mathbb {P})\) for ReLU activation is not separable. Consider \(\mathbb {P}= {\mathcal L}^1|_{[0,1]}\) to be Lebesgue measure on the unit interval in one dimension. For \(\alpha \in (0,1)\) set \(f_\alpha (x) = \sigma (x-\alpha )\). Then, for \(\beta >\alpha \), we have

$$\begin{aligned} 1 = \frac{\beta - \alpha }{\beta -\alpha } = \frac{(f_\alpha -f_\beta )(\beta ) - (f_\alpha - f_\beta )(\alpha )}{\beta -\alpha } \le [f_\beta -f_\alpha ]_{\mathrm {Lip}} \le \Vert f_\beta - f_\alpha \Vert _{{\mathcal B}(\mathbb {P})}. \end{aligned}$$

Thus, there exists an uncountable family of functions with distance \(\ge 1\), meaning that \({\mathcal B}(\mathbb {P})\) cannot be separable.

Remark A.9

In general, \({\mathcal B}(\mathbb {P})\) for ReLU activation is not reflexive. We consider \(\mathbb {P}\) to be the uniform measure on [0, 1] and demonstrate that \({\mathcal B}(\mathbb {P})\) is the space of functions whose first derivative is in BV (i.e., whose second derivative is a Radon measure) on [0, 1]. The space of Radon measures on [0, 1] is denoted by \({\mathcal M}[0,1]\) and equipped with the total variation norm.

Assume that f is a Barron function on [0, 1]. Then,

$$\begin{aligned} f(x)&= \int _{\mathbb R^3} a\,\sigma (wx + b)\,\pi (\mathrm {d}a \otimes \mathrm {d}w \otimes \mathrm {d}b)\\&= \int _{\{w\ne 0\}} a\,|w|\,\sigma \left( \frac{w}{|w|}x + \frac{b}{|w|}\right) \,\pi (\mathrm {d}a \otimes \mathrm {d}w \otimes \mathrm {d}b) + \int _{\{w=0\}}a\,\sigma (b)\pi (\mathrm {d}a \otimes \mathrm {d}w \otimes \mathrm {d}b)\\&= \int _{\mathbb R} \sigma \left( -x+\tilde{b}\right) \,\mu _1(\mathrm {d}\tilde{b}) + \int _{\mathbb R} \sigma \left( x + b\right) \,\mu _2(\mathrm {d}\tilde{b}) + \int _{\{w=0\}}a\,\sigma (b)\pi (\mathrm {d}a \otimes \mathrm {d}w \otimes \mathrm {d}b) \end{aligned}$$

where

$$\begin{aligned} \mu _{1} = T_\sharp \big (a\,|w|\cdot \pi \big ), \qquad T:\{w<0\}\rightarrow \mathbb R, \quad T(a,w, b) = \frac{b}{|w|}. \end{aligned}$$

and similarly for \(\mu _{2}\). The Barron norm is expressed as

$$\begin{aligned} \Vert f\Vert _{{\mathcal B}(\mathbb {P})}= & {} \inf _{\mu _1,\mu _2} \left[ \int _{\mathbb R} 1 + |\tilde{b}|\,|\mu _1|(\mathrm {d}\tilde{b}) + \int _{\mathbb R} 1 + |\tilde{b}|\,|\mu _2|(\mathrm {d}\tilde{b})\right] \\&+ \int _{\{w=0\}} |ab|\,\pi (\mathrm {d}a\otimes \mathrm {d}w \otimes \mathrm {d}b). \end{aligned}$$

Since \(\sigma '' = \delta \), we can formally calculate that

$$\begin{aligned} f'' = \mu _1 + \mu _2 \end{aligned}$$

This is easily made rigorous in the distributional sense. Since [0, 1] is bounded by 1, we obtain in addition to the bounds on f(0) and the Lipschitz constant of f that

$$\begin{aligned} \Vert f''\Vert _{{\mathcal M}[0,1]} \le 2\,\Vert f\Vert _{{\mathcal B}(\mathbb {P})}. \end{aligned}$$

On the other hand, if f has a second derivative, then

$$\begin{aligned} f(x)&= f(0) + f'(0)\,x + \int _0^x \int _0^t f''(\xi )\mathrm {d}\xi \,\mathrm {d}t\\&= f(0) + f'(0)\,x + \int _0^x (x-t)\,f''(t)\mathrm {d}t\\&= f(0)\,\sigma (1) + f'(0)\,\sigma (x) + \int _0^xf''(t)\,\sigma (x-t)\,\mathrm {d}t\end{aligned}$$

for all \(x\in [0,1]\). This easily extends to measure valued derivatives and we conclude that

$$\begin{aligned} \Vert f\Vert _{{\mathcal B}(\mathbb {P})} \le |f(0)| + |f'(0)| + \Vert f''\Vert _{{\mathcal M}[0,1]} . \end{aligned}$$

We can thus express \({\mathcal B}(\mathbb {P}) = BV([0,1]) \times \mathbb R\times \mathbb R\) with an equivalent norm

$$\begin{aligned} \Vert f\Vert _{{\mathcal B}(\mathbb {P})}' = |f(0)| + |f'(0)| + \Vert f''\Vert _{{\mathcal M}[0,1]}. \end{aligned}$$

Thus, \({\mathcal B}(\mathbb {P})\) is not reflexive since BV is not reflexive.

Finally, we demonstrate that integration by empirical measures converges quickly on \({\mathcal B}(\mathbb {P})\). Assume that \(p=\infty \), \(q=1\).

Lemma A.10

The uniform Monte Carlo estimate

$$\begin{aligned} {\mathbb E}_{X_i \sim \mathbb {P}} \left\{ \sup _{\Vert \phi \Vert _{{\mathcal B}(\mathbb {P})}\le 1} \left[ \frac{1}{n}\sum _{i=1}^n\phi (X_i) - \int _Q\phi (x)\,\mathbb {P}(\mathrm {d}x)\right] \right\} \le 2L\,\sqrt{\frac{2\,\log (2d)}{n}} \end{aligned}$$

holds for any probability distribution \(\mathbb {P}\) such that \({\mathrm {spt}}(\mathbb {P})\subseteq [-1,1]^d\). Here, L is the Lipschitz-constant of \(\sigma \).

Proof

A single data point may underestimate or overestimate the average integral, and the proof of convergence relies on these cancellations. A convenient tool to formalize cancellation and decouple this randomness from other effects is through Rademacher complexity [24, Chapter 26].

We denote \(S= \{X_1,\dots , X_n\}\) and assume that S is drawn iid from the distribution \(\mathbb {P}\). We consider an auxiliary random vector \(\xi \) such that the entries \(\xi _i\) are iid (and independent of S) variables which take the values \(\pm \, 1\) with probability 1/2. Furthermore, abbreviate by B the unit ball in \({\mathcal B}(\mathbb {P})\). Furthermore,

$$\begin{aligned} \mathrm {Rep}(B,S)&= \sup _{\phi \in B} \left[ \frac{1}{n}\sum _{i=1}^n\phi (X_i) - \int _Q\phi (x)\,\mathbb {P}(\mathrm {d}x)\right] ,\\ \mathrm {Rad} (B,S)&= {\mathbb E}_\xi \sup _{\phi \in B} \left[ \frac{1}{n} \sum _{i=1}^n \xi _i\,\phi (X_i)\right] . \end{aligned}$$

According to [24, Lemma 26.2], the Rademacher complexity \(\mathrm {Rad}\) bounds the representativeness of the set S by

$$\begin{aligned} {\mathbb E}_S \mathrm {Rep}(B,S) \le 2\,{\mathbb E}_S \mathrm {Rad}(B,S). \end{aligned}$$

The unit ball in Barron space is given by convex combinations of functions

$$\begin{aligned} \phi _{w,b} (x) = \pm \frac{\sigma (w^\mathrm{T} x+ b)}{|w|+1} \quad \text {or} \quad \pm \frac{\sigma (w^\mathrm{T} x+ b)}{|w|+|b|}, \end{aligned}$$

, respectively, so for fixed \(\xi \), the linear map \(\phi \mapsto \frac{1}{n} \sum _{i=1}^n \xi _i\,\phi (X_i)\) at one of the functions in the convex hull, i.e.,

$$\begin{aligned} \mathrm {Rad} (B,S)= {\mathbb E}_\xi \sup _{(w,b)} \left[ \frac{1}{n} \sum _{i=1}^n \xi _i\,\frac{\sigma (w^\mathrm{T} X_i+ b)}{|w|+1}\right] . \end{aligned}$$

According to the Contraction Lemma [24, Lemma 26.9], the Lipschitz-nonlinearity \(\sigma \) can be neglected in the computation of the complexity. If L is the Lipschitz-constant of \(\sigma \), then

$$\begin{aligned} {\mathbb E}_\xi \sup _{(w,b)} \left[ \frac{1}{n} \sum _{i=1}^n \xi _i\,\frac{\sigma (w^\mathrm{T} X_i+ b)}{|w|+1}\right]&\le L \, {\mathbb E}_\xi \sup _{(w,b)} \left[ \frac{1}{n} \sum _{i=1}^n \xi _i\,\frac{w^\mathrm{T} X_i+ b}{|w|+1}\right] \\&= L\, {\mathbb E}_\xi \sup _{w\in \mathbb R^d} \frac{w^\mathrm{T}}{|w|+1} \frac{1}{n} \sum _{i=1}^n \xi _iX_i\\&= L\, {\mathbb E}_\xi \sup _{|w|\le 1} w^\mathrm{T} \frac{1}{n} \sum _{i=1}^n \xi _iX_i\\&\le L\,\sup _i|X_i|_\infty \,\sqrt{\frac{2\,\log (2d)}{n}}, \end{aligned}$$

where we used [24, Lemma 26.11] for the complexity bound of the linear function class and [24, Lemma 26.6] to eliminate the scalar translation. A similar computation can be done in the ReLU case. \(\square \)

Appendix B: A brief review of reproducing kernel Hilbert spaces, random feature models, and the neural tangent kernel

For an introduction to kernel methods in machine learning, see, e.g., [24, Chapter 16] or [5] in the context of deep learning. Let \(k:\mathbb R^d\times \mathbb R\rightarrow \mathbb R\) be a symmetric positive definite kernel. The reproducing kernel Hilbert space (RKHS) \({\mathcal H}_k\) associated with k is the completion of the collection of functions of the form

$$\begin{aligned} h(x) = \sum _{i=1}^n a_i\,k(x,x_i) \end{aligned}$$

under the scalar product \(\langle k(x,x_i), k(x, x_j)\rangle _{{\mathcal H}_{k}} = k(x_i, x_j)\). Note that, due to the positive definiteness of the kernel, the representation of a function is unique.

1.1 B.1 Random feature models

The random feature models considered in this article are functions of the same form

$$\begin{aligned} f(x) = \frac{1}{m} \sum _{i=1}^m a_i\,\sigma \big (w_i^\mathrm{T}x + b_i\big ) \end{aligned}$$

as a two-layer neural network. Unlike neural networks, \((w_i, b_i)\) are not trainable variables and remain fixed after initialization. Random feature models are linear (so easier to optimize) but less expressive than shallow neural networks. While two-layer neural networks of infinite width are modeled as

$$\begin{aligned} f(x) = \int _{\mathbb R^{d+2}} a\,\sigma (w^\mathrm{T}x+b)\,\pi (\mathrm {d}a\otimes \mathrm {d}w \otimes \mathrm {d}b) \end{aligned}$$

for variable probability measures \(\pi \), an infinitely wide random feature model is given as

$$\begin{aligned} f(x) = \int _{\mathbb R^{d+1}} a(w,b)\, \sigma (w^\mathrm{T}x+b)\,\pi ^0(\mathrm {d}w\otimes \mathrm {d}b) \end{aligned}$$

for a fixed distribution \(\pi ^0\) on \(\mathbb R^{d+1}\). One can think of Barron space as the union over all random feature spaces. It is well known that neural networks can represent the same function in different ways. For ReLU activation, there is a degree of degeneracy due to the identity

$$\begin{aligned} 0 = x+\alpha - (x+\alpha ) = \sigma (x) - \sigma (-x) + \sigma (\alpha )- \sigma (-\alpha ) - \sigma (x+\alpha ) + \sigma (-x-\alpha ). \end{aligned}$$

This can be generalized to higher dimension and integrated in \(\alpha \) to show that the random feature representation of functions where \(\pi ^0\) is the uniform distribution on the sphere has a similar degeneracy. For given \(\pi ^0\), denote

$$\begin{aligned} N = \left\{ a \in L^2(\pi ^0)\,\bigg |\, \int _{\mathbb R^{d+1}} a(w,b)\, \sigma (w^\mathrm{T}x+b)\,\pi ^0(\mathrm {d}w\otimes \mathrm {d}b) = 0 \quad \mathbb {P}-\text {a.e.}\right\} \end{aligned}$$

Lemma B.1

[23, Proposition 4.1] The space of random feature models is dense in the RKHS for the kernel

$$\begin{aligned} k(x,x') = \int _{\mathbb R^{d+1}} \sigma (w^\mathrm{T}x+b)\,\sigma (w^\mathrm{T}x' + b)\,\pi ^0(\mathrm {d}w\otimes \mathrm {d}b) \end{aligned}$$

and if

$$\begin{aligned} f(x) = \int _{\mathbb R^{d+1}} a(w,b) \sigma (w^\mathrm{T}x+b)\,\pi ^0(\mathrm {d}w\otimes \mathrm {d}b) \end{aligned}$$

for \(a\in L^2(\pi ^0)/N\), then

$$\begin{aligned} \Vert f\Vert _{{\mathcal H}_k} = \Vert a\Vert _{L^2(\pi ^0)/N}. \end{aligned}$$

Note that the proof in the source uses a different normalization. The result in this form is achieved by setting \(a= \alpha /p, b = \beta /p\) in the notation of [23].

Corollary B.2

Any function in the random feature RKHS is Lipschitz-continuous and \([f]_{\mathrm {Lip}} \le \Vert f\Vert _{{\mathcal H}_{k}}\). Thus, \({\mathcal H}_{k}\) embeds compactly into \(C^0({\mathrm {spt}}\,\mathbb {P})\) by the Arzelà–Ascoli theorem and a fortiori into \(L^2(\mathbb {P})\) for any compactly supported measure \(\mathbb {P}\).

Let \(\mathbb {P}\) be a compactly supported data distribution on \(\mathbb R^d\). Then, the kernel k acts on \(L^2(\mathbb {P})\) by the map

$$\begin{aligned} \overline{K}:L^2(\mathbb {P})\rightarrow L^2(\mathbb {P}), \qquad \overline{K}u(x) = \int _{\mathbb R^d} u(x')\,k(x,x')\,\mathbb {P}(\mathrm {d}x'). \end{aligned}$$

Computing the eigenvalues of the kernel k for a given parameter distribution \(\pi ^0\) and a given data distribution \(\mathbb {P}\) is a non-trivial endeavor. The task simplifies considerably under the assumption of symmetry, but remains complicated. The following results are taken from [2, Appendix D], where more general results are proved for \(\alpha \)-homogeneous activation for \(\alpha \ge 0\). We specify \(\alpha =1\).

Lemma B.3

Assume that \(\pi ^0 = \mathbb {P}= \alpha _d^{-1}\cdot {\mathcal H}^d|_{S^d}\) where \(S^d\) is the Euclidean unit sphere, \(\alpha _d\) is its volume and \(\sigma =\) ReLU. Then, the eigenfunctions of the kernel k are the spherical harmonics. The k-th eigenvalue \(\lambda _k\) (counted without repetition) occurs with the same multiplicity N(dk) as eigenfunctions to the k-th eigenvalue of the Laplace–Beltrami operator on the sphere (the spherical harmonics). Precisely

$$\begin{aligned} N(d,k) = \frac{2k+d-1}{k} \left( {\begin{array}{c}k+d-2\\ d-1\end{array}}\right) \qquad \text { and }\qquad \lambda _k = \frac{d-1}{2\pi }\,2^{-k}\,\frac{\Gamma (d/2)\,\Gamma (k-1)}{\Gamma (k/2)\,\Gamma (\frac{k+d+2}{2})} \end{aligned}$$

for \(k\ge 2\).

We can extract a decay rate for the eigenvalues counted with repetition by estimating the height and width of the individual plateaus of eigenvalues. Denote by \(\mu _i\) the eigenvalues of k counted as often as they occur, i.e.,

$$\begin{aligned} \mu _i = \lambda _k \quad \Leftrightarrow \quad \sum _{j=1}^{k-1} N(d,j) < i \le \sum _{j=1}^{k} N(d,j). \end{aligned}$$

By Stirling’s formula, one can estimate that for fixed d

$$\begin{aligned} \lambda _k \sim k^{-\frac{d-3}{2}} \end{aligned}$$

as \(k\rightarrow \infty \) where we write \(a_k \sim b_k\) if and only if

$$\begin{aligned} 0< \liminf _{k\rightarrow \infty } \frac{a_k}{b_k} \le \limsup _{k\rightarrow \infty } \frac{a_k}{b_k} < \infty . \end{aligned}$$

On the other hand,

$$\begin{aligned} N(d,k) = \frac{d}{k} \left( {\begin{array}{c}k+d-1\\ d\end{array}}\right) = \frac{d}{k} \frac{(k+d-1)\dots k}{d!} \sim \frac{(k+d)^{d-1}}{(d-1)!}. \end{aligned}$$

In particular, if \(\mu _i = \lambda _k\), then

$$\begin{aligned} i&\sim C(d) \sum _{j=1}^k j^{d-1} \sim C(d) \int _1^{k+1} t^{d-1}\,\mathrm {d}t\sim C(d)\,k^d. \end{aligned}$$

Thus,

$$\begin{aligned} \mu _i = \lambda _k \sim k^{-\frac{d-3}{2}} \sim i^{-\frac{d-3}{2d}} = i^{-\frac{1}{2} + \frac{3}{2d}}. \end{aligned}$$

Corollary A.4

Consider the random feature model for ReLU activation when both parameter and data measure are given by the uniform distribution on the unit sphere. Then, the eigenvalues of the kernel decay like \(i^{-\frac{1}{2} + \frac{3}{2d}}\).

Remark A.5

It is easy to see that up to a multiplicative constant, all radially symmetric parameter distributions \(\pi ^0\) lead to the same random feature kernel. This can be used to obtain an explicit formula in the case when \(\pi ^0\) is a standard Gaussian in d dimensions. Since k only depends on the angle between x and \(x'\), we may assume that \(x=e_1\) and \(x' = \cos \phi \,e_1 + \sin \phi \,e_2\) with \(\phi \in [0,\pi ]\). Now, one can use that the projection of the standard Gaussian onto the \(w_1w_2\)-plane is a lower-dimensional standard Gaussian. Thus, the kernel does not depend on the dimension. An explicit computation in two dimensions shows that

$$\begin{aligned} k(x,x') = \frac{\pi - \phi }{\pi }\,\cos \phi + \frac{\sin \phi }{\pi }, \end{aligned}$$

see [5, Section 2.1].

1.2 B.2 The neural tangent kernel

The neural tangent kernel [20] is a different model for infinitely wide neural networks. For two layer networks, it is obtained as the limiting object in a scaling regime for parameters which makes the Barron norm infinite. When training networks on empirical risk, in a certain scaling regime between the number of data points, the number of neurons, and the initialization of parameters, it can be shown that parameters do not move far from their initial position according to a parameter distribution \(\bar{\pi }\), and that the gradient flow optimization of neural networks is close to the optimization of a kernel method for all times [9, 14]. This kernel is called the neural tangent kernel (NTK) and is obtained as the sum of derivatives of the feature function with respect to all trainable parameters. It linearizes the dynamics at the initial parameter distribution. For networks with one hidden layer, this is

$$\begin{aligned} k(x,x')&= \int _{\mathbb R\times \mathbb R^d\times \mathbb R} \nabla _{(a,w,b)} \big (a\sigma (w^\mathrm{T}x+b)\big ) \cdot \nabla _{(a,w,b)} \big (a\sigma (w^\mathrm{T}x'+b)\big ) \,\bar{\pi }(\mathrm {d}a \otimes \mathrm {d}w \otimes \mathrm {d}b)\\&= \int _{\mathbb R\times \mathbb R^d\times \mathbb R} \sigma (w^\mathrm{T}x + b)\,\sigma (w^\mathrm{T}x'+ b) \\&\qquad + a^2 \sigma '(w^\mathrm{T}x+b)\,\sigma '(w^\mathrm{T}x'+b) \big [\langle x,x'\rangle +1\big ]\,\bar{\pi }(\mathrm {d}a \otimes \mathrm {d}w\otimes \mathrm {d}b)\\&= k_{RF}(x,x') \\&\quad + \int _{\mathbb R\times \mathbb R^d\times \mathbb R} a^2\,\sigma '(w^\mathrm{T}x+b)\,\sigma '(w^\mathrm{T}x'+b) (x,1) \,\cdot \,(x,1)\,\bar{\pi }(\mathrm {d}a \otimes \mathrm {d}w\otimes \mathrm {d}b) \end{aligned}$$

where \(k_{RF}\) denotes the random feature kernel with distribution \(P_{(w,b),\sharp }\bar{\pi }\). The second term is obtained on the right hand side is a positive definite kernel in itself. This can be seen most easily by recalling that

$$\begin{aligned} \sum _{i,j}c_i c_j&\int _{\mathbb R^{d+2}}\nabla _{(w,b)} \big (a\sigma (w^\mathrm{T}x_i+b)\big ) \cdot \nabla _{(w,b)} \big (a\sigma (w^\mathrm{T}x_j+b)\big ) \,\bar{\pi }(\mathrm {d}a \otimes \mathrm {d}w \otimes \mathrm {d}b)\\&=\int _{\mathbb R^{d+2}}\nabla _{(w,b)} a^2 \left( \sum _ic_i\,\sigma (w^\mathrm{T}x_i+b)\right) \\&\quad \cdot \nabla _{(w,b)} \left( \sum _jc_j\,\sigma (w^\mathrm{T}x_j+b)\right) \,\bar{\pi }(\mathrm {d}a \otimes \mathrm {d}w \otimes \mathrm {d}b)\\&= \int _{\mathbb R^{d+2}}\left| \nabla _{(w,b)} a^2 \left( \sum _ic_i\,\sigma (w^\mathrm{T}x_i+b)\right) \right| ^2 \,\bar{\pi }(\mathrm {d}a \otimes \mathrm {d}w \otimes \mathrm {d}b)\\&\ge 0. \end{aligned}$$

On the other hand, if we assume that \(|a| = a_0\) and \(|(w,b)| = 1\) almost surely, we find that

$$\begin{aligned} \sum _{i,j}c_ic_j&\int _{\mathbb R\times \mathbb R^d\times \mathbb R} a^2\,\sigma '(w^\mathrm{T}x_i+b)\,\sigma '(w^\mathrm{T}x_j+b) \,(x_i,1) \,I\,(x_j,1)^\mathrm{T}\,\bar{\pi }(\mathrm {d}a \otimes \mathrm {d}w\otimes \mathrm {d}b)\\&= \int _{\mathbb R\times \mathbb R^d\times \mathbb R} a^2 \left| \sum _i c_i\,\sigma '(w^\mathrm{T}x_i+b) \begin{pmatrix}x_i\\ 1\end{pmatrix}\right| ^2 \bar{\pi }(\mathrm {d}a \otimes \mathrm {d}w\otimes \mathrm {d}b)\\&\le a_0^2 \int _{\mathbb R\times \mathbb R^d\times \mathbb R} \left| \sum _i c_i\,\sigma '(w^\mathrm{T}x_i+b) \begin{pmatrix}x_i\\ 1\end{pmatrix} \cdot \begin{pmatrix} w\\ b\end{pmatrix} \right| ^2 \bar{\pi }(\mathrm {d}a \otimes \mathrm {d}w\otimes \mathrm {d}b)\\&= a_0^2 \int _{\mathbb R\times \mathbb R^d\times \mathbb R} \left| \sum _i c_i\,\sigma '(w^\mathrm{T}x_i+b) (w^\mathrm{T}x_i+b) \right| ^2 \bar{\pi }(\mathrm {d}a \otimes \mathrm {d}w\otimes \mathrm {d}b)\\&= a_0^2 \int _{\mathbb R\times \mathbb R^d\times \mathbb R} \left| \sum _i c_i\,\sigma (w^\mathrm{T}x_i+b) \right| ^2 \bar{\pi }(\mathrm {d}a \otimes \mathrm {d}w\otimes \mathrm {d}b) \end{aligned}$$

since \(\sigma \) is positively one-homogeneous. Thus, the NTK satisfies

$$\begin{aligned} k_{RF} \le k \le (1+ |a_0|^2)\,k_{RF} \end{aligned}$$

in the sense of quadratic forms. In particular, the eigenvalues of the NTK and the random feature kernel decay at the same rate. Clearly, in exchange for larger constants it suffices to assume that (awb) are bounded. In practice, the intialization of (wb) is Gaussian, which concentrates close to the Euclidean sphere of radius \(\sqrt{d}\) in d dimensions.

The neural tangent kernel is also defined for deep networks, see for example [1, 7, 15, 22, 27].

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

E, W., Wojtowytsch, S. Kolmogorov width decay and poor approximators in machine learning: shallow neural networks, random feature models and neural tangent kernels. Res Math Sci 8, 5 (2021). https://doi.org/10.1007/s40687-020-00233-4

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s40687-020-00233-4

Keywords

Mathematics Subject Classification