Abstract
We establish a scale separation of Kolmogorov width type between subspaces of a given Banach space under the condition that a sequence of linear maps converges much faster on one of the subspaces. The general technique is then applied to show that reproducing kernel Hilbert spaces are poor \(L^{2}\)-approximators for the class of two-layer neural networks in high dimension, and that multi-layer networks with small path norm are poor approximators for certain Lipschitz functions, also in the \(L^{2}\)-topology.
Similar content being viewed by others
References
Arora, S., Du, S.S., Hu, W., Li, Z., Salakhutdinov, R.R., Wang, R.: On exact computation with an infinitely wide neural net. In: Advances in Neural Information Processing Systems, pp. 8139–8148 (2019)
Bach, F.: Breaking the curse of dimensionality with convex neural networks. J. Mach. Learn. Res. 18(1), 629–681 (2017)
Barron, A.R.: Universal approximation bounds for superpositions of a sigmoidal function. IEEE Trans. Inf. Theory 39(3), 930–945 (1993)
Brezis, H.: Functional Analysis. Sobolev Spaces and Partial Differential Equations. Universitext. Springer, New York (2011)
Cho, Y., Saul, L.K.: Kernel methods for deep learning. In: Advances in Neural Information Processing Systems, pp. 342–350 (2009)
Cybenko, G.: Approximation by superpositions of a sigmoidal function. Math. Control Signals Syst. 2(4), 303–314 (1989)
Du, S.S., Lee, J.D., Li, H., Wang, L., Zhai, X.: Gradient descent finds global minima of deep neural networks. arXiv:1811.03804 [cs.LG] (2018)
Dobrowolski, M.: Angewandte Funktionalanalysis: Funktionalanalysis. Springer, Sobolev-Räume und elliptische Differentialgleichungen (2010)
Du, S.S., Zhai, X., Poczos, B., Singh, A.: Gradient descent provably optimizes over-parameterized neural networks. arXiv:1810.02054 [cs.LG] (2018)
E, W., Ma, C., Wu, L.: A priori estimates of the population risk for two-layer neural networks. Commun. Math. Sci. 17(5), 1407–1425 (2019)
E, W., Ma, C., Wang, Q.: A priori estimates of the population risk for residual networks. arXiv:1903.02154 [cs.LG] (2019)
E, W., Ma, C., Wu, L.: Barron spaces and the compositional function spaces for neural network models. arXiv:1906.08039 [cs.LG] (2019)
E, W., Ma, C., Wu, L.: Machine learning from a continuous viewpoint. arXiv:1912.12777 [math.NA] (2019)
E, W., Ma, C., Wu, L.: A comparative analysis of optimization and generalization properties of two-layer neural network and random feature models under gradient descent dynamics. Sci. China Math. (2019). https://doi.org/10.1007/s11425-019-1628-5
E, W., Ma, C., Wang, Q., Wu, L.: Analysis of the gradient descent algorithm for a deep neural network model with skip-connections. arXiv:1904.05263 [cs.LG] (2019)
E, W., Wojtowytsch, S.: On the Banach spaces associated with multi-layer ReLU networks of infinite width. arXiv:2007.15623 [stat.ML] (2020)
Fournier, N., Guillin, A.: On the rate of convergence in Wasserstein distance of the empirical measure. Probab. Theory Related Fields 162(3–4), 707–738 (2015)
Gribonval, R., Kutyniok, G., Nielsen, M., Voigtlaender, F.: Approximation spaces of deep neural networks. arXiv:1905.01208 [math.FA] (2019)
Hornik, K.: Approximation capabilities of multilayer feedforward networks. Neural Netw. 4(2), 251–257 (1991)
Jacot, A., Gabriel, F., Hongler, C.: Neural tangent kernel: convergence and generalization in neural networks. In: Advances in Neural Information Processing Systems, pp. 8571–8580 (2018)
Lorentz, G.: Approximation of Functions. Holt, Rinehart and Winston, New York (1966)
Lee, J., Xiao, L., Schoenholz, S., Bahri, Y., Novak, R., Sohl-Dickstein, J., Pennington, J.: Wide neural networks of any depth evolve as linear models under gradient descent. In: Advances in Neural Information Processing Systems, pp. 8570–8581 (2019)
Rahimi, A., Recht, B.: Uniform approximation of functions with random bases. In: 2008 46th Annual Allerton Conference on Communication, Control, and Computing, pp. 555–561. IEEE (2008)
Shalev-Shwartz, S., Ben-David, S.: Understanding Machine Learning: From Theory to Algorithms. Cambridge University Press, Cambridge (2014)
Villani, C.: Optimal Transport: Old and New, vol. 338. Springer, Berlin (2008)
Wojtowytsch, S., E, W.: Can shallow neural networks beat the curse of dimensionality? A mean field training perspective. arXiv:2005.10815 [cs.LG] (2020)
Yang, G.: Scaling limits of wide neural networks with weight sharing: Gaussian process behavior, gradient independence, and neural tangent kernel derivation. arXiv:1902.04760 [cs.NE] (2019)
Author information
Authors and Affiliations
Corresponding author
Additional information
Dedicated to Andrew Majda on the occasion of his 70th birthday
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Appendices
Appendix A: A brief review of Barron space
For the convenience of the reader, we recall Barron space for two-layer neural networks as introduced by E et al. [10, 12]. We focus on the functional analytic properties of Barron space, for results with a focus on machine learning we refer the reader to the original sources. The same space is denoted as \({\mathcal F}_1\) in [2], but described from a different perspective. For functional analytic notions, we refer the reader to [4].
Let \({\mathbb {P}}\) be a probability measure on \(\mathbb R^d\) and \(\sigma \) a Lipschitz-continuous function such that either
-
(1)
\(\sigma = \text {ReLU}\) or
-
(2)
\(\sigma \) is sigmoidal, i.e., \(\lim _{z\rightarrow \pm \infty } \sigma (z) = \pm 1\) (or 0 and 1).
Consider the class \({\mathcal F}_m\) of two-layer networks with m neurons
It is well-known that the closure of \({\mathcal F}= \bigcup _{m=1}^\infty {\mathcal F}_m\) in the uniform topology is the space of continuous functions, see, e.g., [6]. Barron space is a different closure of the same function class where the path-norm
remains bounded. Here, we assume that data space \(\mathbb R^d\) is equipped with the \(\ell ^p\)-norm and take the dual \(\ell ^q\)-norm on w. The concept of path norm corresponds to ReLU activation, a slightly different path norm for bounded Lipschitz activation is discussed below.
The same class is often discussed without the normalizing factor of \(\frac{1}{m}\). With the factor, the following concept of infinitely wide two-layer networks emerges more naturally.
Definition A.1
Let \(\pi \) be a Radon probability measure on \(\mathbb R^{d+2}\) with finite second moments, which we denote by \(\pi \in {\mathcal P}_{2}(\mathbb R^{d+2})\). We denote by
the two-layer network associated with \(\pi \).
If \(\sigma = \mathrm {ReLU}\), it is clear that \(f_\pi \) is Lipschitz-continuous on \(\mathbb R^d\) with Lipschitz-constant \(\le \Vert f_\pi \Vert _{{\mathcal B}(\mathbb {P})}\), so \(f_\pi \) lies in the space of (possibly unbounded) Lipschitz functions \(\mathrm {Lip}(\mathbb R^d)\). If \(\sigma \) is a bounded Lipschitz function, the integral converges in \(C^0(\mathbb R^d)\) without assumptions on the moments of \(\pi \). If the second moments of \(\pi \) are bounded, \(f_\pi \) is a Lipschitz function also in the case of bounded Lipschitz activation \(\sigma \).
For technical reasons, we will extend the definition to distributions \(\pi \) for which only the mixed second moments
are finite, where \(\psi = |b|\) in the ReLU case and \(\psi = 1\) otherwise. Now, we introduce the associated function space.
Definition A.2
We denote
if \(\sigma = \mathrm {ReLU}\) and
otherwise. In either case, we denote
Here, \(\inf \emptyset = + \infty \).
Remark A.3
It depends on the activation function \(\sigma \) whether or not the infimum in the definition of the norm is attained. If \(\sigma =\) ReLU, this can be shown to be true by using homogeneity. Instead of a probability measure \(\pi \) on the whole space, one can use a signed measure \(\mu \) on the unit sphere to express a two layer network. The compactness theorem for Radon measures provides the existence of a measure minimizer, which can then be lifted to a probability measure (see below for similar arguments). On the other hand, if \(\sigma \) is a classical sigmoidal function such that
then the function \(f(z) \equiv 1\) has Barron norm 1, but the infimum is not attained. This holds true for any data distribution \(\mathbb {P}\).
Proof
-
(1)
For any \(x\in {\mathrm {spt}}(\mathbb {P})\), we have
$$\begin{aligned} 1 = f(x) = \int _{\mathbb R^{d+2}} a\,\sigma (w^\mathrm{T}x+b)\,\mathrm {d}\pi \le \int _{\mathbb R^{d+2}}|a|\,\mathrm {d}\pi \le \int _{\mathbb R^{d+2}}|a|\,\big [|w| + 1\big ]\,\mathrm {d}\pi . \end{aligned}$$Taking the infimum overall \(\pi \), we find that \(1\le \Vert f\Vert _{{\mathcal B}(\mathbb {P})}\). For any measure \(\pi \), the inequality above is strict since \(|\sigma |<1\), so there is no \(\pi \) which attains equality.
-
(2)
We consider a family of measures
$$\begin{aligned}&\pi _\lambda = \delta _{a= \sigma (\lambda )^{-1}}\,\delta _{w=0}\,\delta _{b = \lambda } \\&\quad \Rightarrow \quad f_{\pi _\lambda } \equiv 1\quad \text {and } \int _{\mathbb R^{d+2}} |a|\big [|w|_{\ell ^q}+1\big ]\,\mathrm {d}\pi _\lambda = \frac{1}{\sigma (\lambda )} \rightarrow 1 \end{aligned}$$as \(\lambda \rightarrow \infty \).
Thus, \(\Vert f\Vert _{{\mathcal B}(\mathbb {P})} = 1\), but there is no minimizing parameter distribution \(\pi \).
Remark A.4
The space \({\mathcal B}(\mathbb {P})\) does not depend on the measure \(\mathbb {P}\), but only on the system of null sets for \(\mathbb {P}\).
We note that the space \({\mathcal B}(\mathbb {P})\) is reasonably well-behaved from the point of view of functional analysis.
Lemma A.5
\({\mathcal B}({\mathbb P})\) is a Banach space with norm \(\Vert \cdot \Vert _{{\mathcal B}(\mathbb {P})}\). If \({\mathrm {spt}}(\mathbb {P})\) is compact, \({\mathcal B}(\mathbb {P})\) embeds continuously into the space of Lipschitz functions \(C^{0,1}({\mathrm {spt}}\,\mathbb {P})\).
Proof
Scalar multiplication Let \(f\in {\mathcal B}(\mathbb {P})\). For \(\lambda \in \mathbb R\) and \(\pi \in {\mathcal {P}}_{2}(\mathbb R^{d+2})\), define the push-forward
Then,
and similarly in the ReLU case. Thus, scalar multiplication is well-defined in \({\mathcal B}({\mathbb P})\). Taking the infimum over \(\pi \), we find that \(\Vert \lambda f\Vert _{{\mathcal B}(\mathbb {P})} = |\lambda |\,\Vert f\Vert _{{\mathcal B}(\mathbb {P})}\).
Vector addition Let \(g,h \in {\mathcal B}(\mathbb {P})\). Choose \(\pi _g, \pi _h\) such that \(g= f_{\pi _g}\) and \(h = f_{\pi _h}\). Consider
like above. Then, \(f_\pi = g+h\) and
Taking infima, we see that \(\Vert g+h\Vert _{{\mathcal B}(\mathbb {P})} \le \Vert g\Vert _{{\mathcal B}(\mathbb {P})} + \Vert h\Vert _{{\mathcal B}(\mathbb {P})}\). The same holds in the ReLU case.
Positivity and embedding Recall that the norm on the space of Lipschitz functions on a compact set K is
It is clear that \(\Vert \cdot \Vert _{{\mathcal B}(\mathbb {P})} \ge 0\). If \(\sigma \) is a bounded Lipschitz function, then
like in Remark A.3. If \(\sigma = \mathrm {ReLU}\), then
In either case
for all x, y in \({\mathrm {spt}}(\mathbb {P})\). In particular, \(\Vert f\Vert _{{\mathcal B}(\mathbb {P})} > 0\) whenever \(f\ne 0\) in \({\mathcal B}(\mathbb {P})\) and if \({\mathrm {spt}}(\mathbb {P})\) is compact, \({\mathcal B}(\mathbb {P})\) embeds into the space of Lipschitz functions on \({\mathcal B}(\mathbb {P})\). If \(\sigma \) is a bounded Lipschitz function, \({\mathcal B}(\mathbb {P})\) embeds into the space of bounded Lipschitz functions also on unbounded sets.
Completeness Completeness is proved most easily by introducing a different representation for Barron functions. Consider the space
where \(|\mu |\) is the total variation measure of \(\mu \). Equipped with the norm
V is a Banach space when we quotient out measures supported on \(\{|a|=0\}\cup \{|b| = |w| = 0\}\) or restrict ourselves to the subspace of measures such that \(|\mu |(\{|a|=0\}) = |\mu | (\{|b|=|w|=0\}) = 0\). The only non-trivial question is whether V is complete. By definition, \(\mu _n\) is a Cauchy sequence in V if and only if \(\nu _n {:}{=} |a|\,[|w|+ |b|]\cdot \mu _n\) is a Cauchy sequence in the space of finite Radon measures. Since the space of finite Radon measures is complete, \(\nu _n\) converges (strongly) to a measure \(\nu \) which satisfies \(\nu (\{a =0\}\cup \{(w,b) = 0\}) =0\). We then obtain \(\mu {:}{=} |a|^{-1}\,[|w|+ |b|]^{-1}\cdot \nu \). For \(\mu \in V\), we write
and consider the subspace
Since the map
is continuous by the same argument as before, we find that \(V^0_\mathbb {P}\) is a closed subspace of V. In particular, \(V/V^0_\mathbb {P}\) is a Banach space. We claim that \({\mathcal B}(\mathbb {P})\) is isometric to the quotient space \(V/V^0_\mathbb {P}\) by the map \([\mu ]\mapsto f_\mu \) where \(\mu \) is any representative in the equivalence class \([\mu ]\).
It is clear that any representative in the equivalence class induces the same function \(f_{\mu }\) such that the map is well-defined. Consider the Hahn decomposition \(\mu = \mu ^{+} - \mu ^-\) of \(\mu \) as the difference of non-negative Radon measures. Set
Then, \(\pi \) is a probability Radon measure such that \(f_\pi = f_\mu \) and
In particular, \(\Vert f_\mu \Vert _{{\mathcal B}(\mathbb {P})} \le \Vert \mu \Vert _V\). Taking the infimum of the right hand side, we conclude that \(\Vert f_\mu \Vert _{{\mathcal B}(\mathbb {P})} \le \Vert [\mu ]\Vert _{V/V^0_\mathbb {P}}\). The opposite inequality is trivial since every probability measure is in particular a signed Radon measure.
Thus, \({\mathcal B}(\mathbb {P})\) is isometric to a Banach space, hence a Banach space itself. We presented the argument in the context of ReLU activation, but the same proof holds for bounded Lipschitz activation. \(\square \)
A few remarks are in order.
Remark A.6
The requirement that \({\mathbb P}\) have compact support can be relaxed when we consider the norm
on the space of Lipschitz functions. Since Lipschitz functions grow at most linearly, this is well-defined for all data distributions \(\mathbb {P}\) with finite first moments.
Remark A.7
For general Lipschitz-activation \(\sigma \) which is neither bounded nor ReLU, the Barron norm is defined as
Similar results hold in this case.
Remark A.8
In general, \({\mathcal B}(\mathbb {P})\) for ReLU activation is not separable. Consider \(\mathbb {P}= {\mathcal L}^1|_{[0,1]}\) to be Lebesgue measure on the unit interval in one dimension. For \(\alpha \in (0,1)\) set \(f_\alpha (x) = \sigma (x-\alpha )\). Then, for \(\beta >\alpha \), we have
Thus, there exists an uncountable family of functions with distance \(\ge 1\), meaning that \({\mathcal B}(\mathbb {P})\) cannot be separable.
Remark A.9
In general, \({\mathcal B}(\mathbb {P})\) for ReLU activation is not reflexive. We consider \(\mathbb {P}\) to be the uniform measure on [0, 1] and demonstrate that \({\mathcal B}(\mathbb {P})\) is the space of functions whose first derivative is in BV (i.e., whose second derivative is a Radon measure) on [0, 1]. The space of Radon measures on [0, 1] is denoted by \({\mathcal M}[0,1]\) and equipped with the total variation norm.
Assume that f is a Barron function on [0, 1]. Then,
where
and similarly for \(\mu _{2}\). The Barron norm is expressed as
Since \(\sigma '' = \delta \), we can formally calculate that
This is easily made rigorous in the distributional sense. Since [0, 1] is bounded by 1, we obtain in addition to the bounds on f(0) and the Lipschitz constant of f that
On the other hand, if f has a second derivative, then
for all \(x\in [0,1]\). This easily extends to measure valued derivatives and we conclude that
We can thus express \({\mathcal B}(\mathbb {P}) = BV([0,1]) \times \mathbb R\times \mathbb R\) with an equivalent norm
Thus, \({\mathcal B}(\mathbb {P})\) is not reflexive since BV is not reflexive.
Finally, we demonstrate that integration by empirical measures converges quickly on \({\mathcal B}(\mathbb {P})\). Assume that \(p=\infty \), \(q=1\).
Lemma A.10
The uniform Monte Carlo estimate
holds for any probability distribution \(\mathbb {P}\) such that \({\mathrm {spt}}(\mathbb {P})\subseteq [-1,1]^d\). Here, L is the Lipschitz-constant of \(\sigma \).
Proof
A single data point may underestimate or overestimate the average integral, and the proof of convergence relies on these cancellations. A convenient tool to formalize cancellation and decouple this randomness from other effects is through Rademacher complexity [24, Chapter 26].
We denote \(S= \{X_1,\dots , X_n\}\) and assume that S is drawn iid from the distribution \(\mathbb {P}\). We consider an auxiliary random vector \(\xi \) such that the entries \(\xi _i\) are iid (and independent of S) variables which take the values \(\pm \, 1\) with probability 1/2. Furthermore, abbreviate by B the unit ball in \({\mathcal B}(\mathbb {P})\). Furthermore,
According to [24, Lemma 26.2], the Rademacher complexity \(\mathrm {Rad}\) bounds the representativeness of the set S by
The unit ball in Barron space is given by convex combinations of functions
, respectively, so for fixed \(\xi \), the linear map \(\phi \mapsto \frac{1}{n} \sum _{i=1}^n \xi _i\,\phi (X_i)\) at one of the functions in the convex hull, i.e.,
According to the Contraction Lemma [24, Lemma 26.9], the Lipschitz-nonlinearity \(\sigma \) can be neglected in the computation of the complexity. If L is the Lipschitz-constant of \(\sigma \), then
where we used [24, Lemma 26.11] for the complexity bound of the linear function class and [24, Lemma 26.6] to eliminate the scalar translation. A similar computation can be done in the ReLU case. \(\square \)
Appendix B: A brief review of reproducing kernel Hilbert spaces, random feature models, and the neural tangent kernel
For an introduction to kernel methods in machine learning, see, e.g., [24, Chapter 16] or [5] in the context of deep learning. Let \(k:\mathbb R^d\times \mathbb R\rightarrow \mathbb R\) be a symmetric positive definite kernel. The reproducing kernel Hilbert space (RKHS) \({\mathcal H}_k\) associated with k is the completion of the collection of functions of the form
under the scalar product \(\langle k(x,x_i), k(x, x_j)\rangle _{{\mathcal H}_{k}} = k(x_i, x_j)\). Note that, due to the positive definiteness of the kernel, the representation of a function is unique.
1.1 B.1 Random feature models
The random feature models considered in this article are functions of the same form
as a two-layer neural network. Unlike neural networks, \((w_i, b_i)\) are not trainable variables and remain fixed after initialization. Random feature models are linear (so easier to optimize) but less expressive than shallow neural networks. While two-layer neural networks of infinite width are modeled as
for variable probability measures \(\pi \), an infinitely wide random feature model is given as
for a fixed distribution \(\pi ^0\) on \(\mathbb R^{d+1}\). One can think of Barron space as the union over all random feature spaces. It is well known that neural networks can represent the same function in different ways. For ReLU activation, there is a degree of degeneracy due to the identity
This can be generalized to higher dimension and integrated in \(\alpha \) to show that the random feature representation of functions where \(\pi ^0\) is the uniform distribution on the sphere has a similar degeneracy. For given \(\pi ^0\), denote
Lemma B.1
[23, Proposition 4.1] The space of random feature models is dense in the RKHS for the kernel
and if
for \(a\in L^2(\pi ^0)/N\), then
Note that the proof in the source uses a different normalization. The result in this form is achieved by setting \(a= \alpha /p, b = \beta /p\) in the notation of [23].
Corollary B.2
Any function in the random feature RKHS is Lipschitz-continuous and \([f]_{\mathrm {Lip}} \le \Vert f\Vert _{{\mathcal H}_{k}}\). Thus, \({\mathcal H}_{k}\) embeds compactly into \(C^0({\mathrm {spt}}\,\mathbb {P})\) by the Arzelà–Ascoli theorem and a fortiori into \(L^2(\mathbb {P})\) for any compactly supported measure \(\mathbb {P}\).
Let \(\mathbb {P}\) be a compactly supported data distribution on \(\mathbb R^d\). Then, the kernel k acts on \(L^2(\mathbb {P})\) by the map
Computing the eigenvalues of the kernel k for a given parameter distribution \(\pi ^0\) and a given data distribution \(\mathbb {P}\) is a non-trivial endeavor. The task simplifies considerably under the assumption of symmetry, but remains complicated. The following results are taken from [2, Appendix D], where more general results are proved for \(\alpha \)-homogeneous activation for \(\alpha \ge 0\). We specify \(\alpha =1\).
Lemma B.3
Assume that \(\pi ^0 = \mathbb {P}= \alpha _d^{-1}\cdot {\mathcal H}^d|_{S^d}\) where \(S^d\) is the Euclidean unit sphere, \(\alpha _d\) is its volume and \(\sigma =\) ReLU. Then, the eigenfunctions of the kernel k are the spherical harmonics. The k-th eigenvalue \(\lambda _k\) (counted without repetition) occurs with the same multiplicity N(d, k) as eigenfunctions to the k-th eigenvalue of the Laplace–Beltrami operator on the sphere (the spherical harmonics). Precisely
for \(k\ge 2\).
We can extract a decay rate for the eigenvalues counted with repetition by estimating the height and width of the individual plateaus of eigenvalues. Denote by \(\mu _i\) the eigenvalues of k counted as often as they occur, i.e.,
By Stirling’s formula, one can estimate that for fixed d
as \(k\rightarrow \infty \) where we write \(a_k \sim b_k\) if and only if
On the other hand,
In particular, if \(\mu _i = \lambda _k\), then
Thus,
Corollary A.4
Consider the random feature model for ReLU activation when both parameter and data measure are given by the uniform distribution on the unit sphere. Then, the eigenvalues of the kernel decay like \(i^{-\frac{1}{2} + \frac{3}{2d}}\).
Remark A.5
It is easy to see that up to a multiplicative constant, all radially symmetric parameter distributions \(\pi ^0\) lead to the same random feature kernel. This can be used to obtain an explicit formula in the case when \(\pi ^0\) is a standard Gaussian in d dimensions. Since k only depends on the angle between x and \(x'\), we may assume that \(x=e_1\) and \(x' = \cos \phi \,e_1 + \sin \phi \,e_2\) with \(\phi \in [0,\pi ]\). Now, one can use that the projection of the standard Gaussian onto the \(w_1w_2\)-plane is a lower-dimensional standard Gaussian. Thus, the kernel does not depend on the dimension. An explicit computation in two dimensions shows that
see [5, Section 2.1].
1.2 B.2 The neural tangent kernel
The neural tangent kernel [20] is a different model for infinitely wide neural networks. For two layer networks, it is obtained as the limiting object in a scaling regime for parameters which makes the Barron norm infinite. When training networks on empirical risk, in a certain scaling regime between the number of data points, the number of neurons, and the initialization of parameters, it can be shown that parameters do not move far from their initial position according to a parameter distribution \(\bar{\pi }\), and that the gradient flow optimization of neural networks is close to the optimization of a kernel method for all times [9, 14]. This kernel is called the neural tangent kernel (NTK) and is obtained as the sum of derivatives of the feature function with respect to all trainable parameters. It linearizes the dynamics at the initial parameter distribution. For networks with one hidden layer, this is
where \(k_{RF}\) denotes the random feature kernel with distribution \(P_{(w,b),\sharp }\bar{\pi }\). The second term is obtained on the right hand side is a positive definite kernel in itself. This can be seen most easily by recalling that
On the other hand, if we assume that \(|a| = a_0\) and \(|(w,b)| = 1\) almost surely, we find that
since \(\sigma \) is positively one-homogeneous. Thus, the NTK satisfies
in the sense of quadratic forms. In particular, the eigenvalues of the NTK and the random feature kernel decay at the same rate. Clearly, in exchange for larger constants it suffices to assume that (a, w, b) are bounded. In practice, the intialization of (w, b) is Gaussian, which concentrates close to the Euclidean sphere of radius \(\sqrt{d}\) in d dimensions.
The neural tangent kernel is also defined for deep networks, see for example [1, 7, 15, 22, 27].
Rights and permissions
About this article
Cite this article
E, W., Wojtowytsch, S. Kolmogorov width decay and poor approximators in machine learning: shallow neural networks, random feature models and neural tangent kernels. Res Math Sci 8, 5 (2021). https://doi.org/10.1007/s40687-020-00233-4
Received:
Accepted:
Published:
DOI: https://doi.org/10.1007/s40687-020-00233-4
Keywords
- Curse of dimensionality
- Two-layer network
- Multi-layer network
- Population risk
- Barron space
- Reproducing kernel Hilbert space
- Random feature model
- Neural tangent kernel
- Kolmogorov width
- Approximation theory