Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
Skip to main content

The Boundaries of Verifiable Accuracy, Robustness, and Generalisation in Deep Learning

  • Conference paper
  • First Online:
Artificial Neural Networks and Machine Learning – ICANN 2023 (ICANN 2023)

Abstract

In this work, we assess the theoretical limitations of determining guaranteed stability and accuracy of neural networks in classification tasks. We consider classical distribution-agnostic framework and algorithms minimising empirical risks and potentially subjected to some weights regularisation. We show that there is a large family of tasks for which computing and verifying ideal stable and accurate neural networks in the above settings is extremely challenging, if at all possible, even when such ideal solutions exist within the given class of neural architectures.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 69.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 89.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

References

  1. Bastounis, A., Hansen, A.C., Vlačić, V.: The mathematics of adversarial attacks in AI-why deep learning is unstable despite the existence of stable neural networks. arXiv preprint arXiv:2109.06098 (2021)

  2. Eykholt, K., et al.: Robust physical-world attacks on deep learning visual classification. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1625–1634 (2018)

    Google Scholar 

  3. Gorban, A.N., Grechuk, B., Mirkes, E.M., Stasenko, S.V., Tyukin, I.Y.: High-dimensional separability for one-and few-shot learning. Entropy 23(8), 1090 (2021)

    Article  MathSciNet  Google Scholar 

  4. Gorban, A.N., Tyukin, I.Y., Romanenko, I.: The blessing of dimensionality: separation theorems in the thermodynamic limit. IFAC-PapersOnLine 49(24), 64–69 (2016)

    Article  Google Scholar 

  5. Gorban, A., Tyukin, I.Y.: Stochastic separation theorems. Neural Netw. 94, 255–259 (2017)

    Article  MATH  Google Scholar 

  6. Hand, D.J.: Dark Data: Why What You Don’t Know Matters. Princeton University Press (2020)

    Google Scholar 

  7. Kirdin, A., Sidorov, S., Zolotykh, N.: Rosenblatt’s first theorem and frugality of deep learning. Entropy 24(11), 1635 (2022). https://doi.org/10.3390/e24111635

    Article  MathSciNet  Google Scholar 

  8. Moosavi-Dezfooli, S.M., Fawzi, A., Fawzi, O., Frossard, P.: Universal adversarial perturbations. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1765–1773 (2017)

    Google Scholar 

  9. Schembera, B., Durán, J.M.: Dark data as the new challenge for big data science and the introduction of the scientific data officer. Philos. Technol. 33, 93–115 (2020)

    Article  Google Scholar 

  10. Shafahi, A., Huang, W., Studer, C., Feizi, S., Goldstein, T.: Are adversarial examples inevitable? In: International Conference on Learning Representations (ICLR) (2019)

    Google Scholar 

  11. Su, J., Vargas, D.V., Sakurai, K.: One pixel attack for fooling deep neural networks. IEEE Trans. Evol. Comput. 23(5), 828–841 (2019)

    Article  Google Scholar 

  12. Szegedy, C., et al.: Intriguing properties of neural networks. arXiv preprint arXiv:1312.6199 (2013)

  13. Tyukin, I.Y., Higham, D.J., Bastounis, A., Woldegeorgis, E., Gorban, A.N.: The feasibility and inevitability of stealth attacks. arXiv preprint arXiv:2106.13997 (2021)

  14. Tyukin, I.Y., Higham, D.J., Gorban, A.N.: On adversarial examples and stealth attacks in artificial intelligence systems. In: 2020 International Joint Conference on Neural Networks (IJCNN), pp. 1–6. IEEE (2020)

    Google Scholar 

  15. Yarotsky, D.: Error bounds for approximations with deep ReLU networks. Neural Netw. 94, 103–114 (2017)

    Article  MATH  Google Scholar 

Download references

Acknowledgements

This work is supported by the UKRI, EPSRC [UKRI Turing AI Fellowship ARaISE EP/V025295/2 and UKRI Trustworthy Autonomous Systems Node in Verifiability EP/V026801/2 to I.Y.T., EP/V025295/2 to O.S., A.N.G., and Q.Z., EP/V046527/1 and EP/P020720/1 to D.J.H, EP/V046527/1 to A.B.].

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Ivan Y. Tyukin .

Editor information

Editors and Affiliations

Appendix

Appendix

1.1 4.1  Proof of Theorem 1

1.1.1 Proof of Statement (i) of the Theorem.

The proof consists of three parts. The first part introduces a family of distributions satisfying the separability requirement (2) and shows relevant statistical properties of samples drawn from these distributions. The second part presents the construction of a suitable neural network minimising the empirical loss function \(\mathcal {L}\) for any loss function \(\mathcal {R}\in \mathcal{C}\mathcal{F}_{\textrm{loc}}\) which successfully generalises beyond training (and test/validation) data. The final part shows that, with high probability, this network is unstable on nearly half of the data (for \(s+r\) reasonably large).

Proof of statement (i), part 1. Consider the n-dimensional hyper cube \(\textrm{Cb}(2, 0)\) \(=\) \([-1,1]^n\). Within this cube, we may inscribe the unit ball \(\mathbb {B}_{n}\) (the surface of which touches the surface of the outer cube at the centre of each face), and within this ball we may, in turn, inscribe the inner cube \(\textrm{Cb}(2/\sqrt{n}, 0)\) each vertex of which touches the surface of the ball and whose faces are parallel to the faces of the cube \(\textrm{Cb}(2, 0)\). For any \(\varepsilon \in (0,\sqrt{n}-1)\), the cube \( \textrm{Cb}(\frac{2}{\sqrt{n}}(1+\varepsilon ), 0) \) may be shown to satisfy \(\textrm{Cb}(2/\sqrt{n}, 0) \subset \textrm{Cb}(\frac{2}{\sqrt{n}}(1+\varepsilon ), 0) \subset \textrm{Cb}(2, 0)\).

Let \( V = \{v_i\}_{i=1}^{2^n} \) denote the set of vertices of \(\textrm{Cb}(2/\sqrt{n}, 0)\) with an arbitrary but fixed ordering, and note that each \(v_i\) may be expressed as \(\frac{1}{\sqrt{n}}(q_1,\dots ,q_n)\) with each component \(q_k\in \{-1,1\}\). The choice of \(\varepsilon \) ensures that the set

$$ \mathcal {J}_0=\Big \{x\in \mathbb {S}_{n-1}(1,0) \ | \ x\notin \textrm{Cb}\Big (\frac{2}{\sqrt{n}}(1+\varepsilon ), 0\Big ) \Big \}. $$

is non-empty and that \( \min _{x\in \mathcal {J}_0, \ y\in V} \Vert x-y\Vert > \frac{\varepsilon }{\sqrt{n}}\).

Consider a family of distributions \(\mathcal {F}_1\subset \mathcal {F}\) which are supported on \(\mathbb {S}_{n-1}(1,0)\times \{0,1\}\), with the \(\sigma \)-algebra \(\Sigma _{\mathbb {S}}\times \{0\}\cup \Sigma _{\mathbb {S}}\times \{1\}\), where \(\Sigma _{\mathbb {S}}\) is the standard \(\sigma \)-algebra on the sphere \(\mathbb {S}_{n-1}\) with the topology induced by the arclength metric.

We construct \(\mathcal {F}_1\) as those distributions \(\mathcal {D}_\delta \in \mathcal {F}\) such that

$$\begin{aligned} \begin{aligned}&P_{\mathcal {D}_\delta }(x,\ell )=0 \text{ for } \ x\in \textrm{Cb}\left( \frac{2}{\sqrt{n}}(1+\varepsilon ), 0\right) \setminus V, \ \text{ and } \text{ any } \ell , \end{aligned} \end{aligned}$$
(7)

with

$$\begin{aligned} P_{\mathcal {D}_\delta }(x,\ell )= {\left\{ \begin{array}{ll} \frac{1}{2^{n+1}} &{} \text { for } x\in V, \ \ell =1 \\ 0, &{} \text { for } x\in V, \ \ell =0 \end{array}\right. } \end{aligned}$$
(8)

and

$$\begin{aligned} P_{\mathcal {D}_\delta }(\mathcal {J}_0,\ell ) = {\left\{ \begin{array}{ll} 0&{} \text { for } \ell =1,\\ \frac{1}{2}&{} \text { for } \ell =0. \end{array}\right. } \end{aligned}$$
(9)

The existence of an uncountable family of distributions \(\mathcal {D}_\delta \) satisfying (7)–(9) is ensured by the flexibility of (9) and the fact that \(\mathcal {J}_0\) contains more than a single point (consider e.g. the family of all delta-functions supported on \(\mathcal {J}_0\) and scaled by 1/2). This construction moreover ensures that any \(\mathcal {D}_\delta \in \mathcal {F}_1\) also satisfies the separation property (2) with \(\delta \le \frac{\varepsilon }{\sqrt{n}}\).

Let \( \mathcal {M}=\mathcal {T}\cup \mathcal {V} = \{(x_k, \ell _k)\}_{k=1}^{M}, \) denote the (multi-)set corresponding to the union of the training and validation sets independently sampled from \(\mathcal {D}_{\delta }\), where \( M=s+r=|\mathcal {M}|. \) Let \(z:\mathbb {R}^n\times \{0,1\} \rightarrow \{0,1\}\) be the trivial function mapping a sample \((x,\ell )\) from \(\mathcal {D}_\delta \) into \(\{0,1\}\) by \( z(x,\ell )=\ell . \) This function defines new random variables \(Z_k = z(x_k, \ell _k) \in [0,1]\) for \(k = 1, \dots , M\), with expectation \( E(Z_k)=\frac{1}{2}. \)

The Hoeffding inequality ensures that

$$ P\left( \frac{1}{2} - \frac{1}{M}\sum Z_k > q \right) \le \exp \left( -2 q^2 M\right) , $$

and hence, with probability greater than or equal to

$$\begin{aligned} 1 - \exp \left( -2 q^2 M\right) , \end{aligned}$$
(10)

the number of data points \((x,\ell )\) with \(\ell =1\) in the sample \(\mathcal {M}\) is at least

$$\begin{aligned} \lfloor \left( \frac{1}{2} - q\right) M \rfloor . \end{aligned}$$
(11)

Proof of statement (i), part 2. Let \(\{e_1,\dots ,e_n\}\) be the standard basis in \(\mathbb {R}^n\). Consider the following set of 2n inequalities:

$$\begin{aligned} (x,e_i) \le \frac{1}{\sqrt{n}}, \ (x,e_i) \ge -\frac{1}{\sqrt{n}}, \text { for } i=1,\dots ,n. \end{aligned}$$
(12)

Any function defined on \([-1/\sqrt{n},1/\sqrt{n}]^n\) (or which contains \([-1/\sqrt{n},1/\sqrt{n}]^n\) in the domain of its definition) and which returns 1 for x satisfying (12) and 0 otherwise, minimises the loss \(\mathcal {L}\) on \(\mathcal {T}\). It also generalises perfectly well on any \(\mathcal {V}\). Hence a network implementing such a function shares the same properties.

Pick a function \(g_\theta \in \mathcal {K}_\theta \) and consider

$$ g_\theta ((x,e_i) - 1/\sqrt{n} + \theta ), \ g_\theta (-(x,e_i) + 1/\sqrt{n} + \theta ), \ i=1,\dots ,n. $$

It is clear that \(g_\theta (\theta ) - g_\theta ((x,e_i) - 1/\sqrt{n} + \theta ) = 0\) for \((x,e_i) \le 1/\sqrt{n}\), and \(g_\theta (\theta ) - g_\theta ((x,e_i) - 1/\sqrt{n} + \theta ) < 0\) for \((x,e_i) > 1/\sqrt{n}\). Similarly \(g_\theta (\theta ) - g_\theta (- (x,e_i) -1/\sqrt{n} + \theta ) = 0\) for \((x,e_i) \ge - 1/\sqrt{n}\), and \(g_\theta (\theta ) - g_\theta (-(x,e_i) - 1/\sqrt{n} + \theta ) < 0\) for \((x,e_i) < - 1/\sqrt{n}\). Hence, the function f given by

$$\begin{aligned} \begin{aligned} f(x)=&\textrm{sign}\left( \sum _{i=1}^n g_\theta (\theta ) - g_\theta ((x,e_i) - 1/\sqrt{n} + \theta )\right. \\&+ \left. \sum _{i=1}^n g_\theta (\theta ) - g_\theta (- (x,e_i) -1/\sqrt{n} + \theta )\right) \end{aligned} \end{aligned}$$
(13)

is exactly 1 only when all inequalities (12) hold true, and is zero otherwise. We may therefore conclude that

$$ f\in \arg \min _{\varphi \in \mathcal{N}\mathcal{N}_{\mathbf{{N}},L}} \mathcal {L}(\mathcal {T}\cup \mathcal {V},\varphi ). $$

Observe now that (13) is a two-layer neural network with 2n neurons in the hidden layer and a threshold output. This core network can be extended to any larger size without changing the map f by propagating the argument of \(\textrm{sign}(\cdot )\) in (13) to the next layers and appending the width as appropriate.

Proof of statement (i), part 3. Let us now show that the map (13) becomes unstable for an appropriately-sized set \(\mathcal {M}\). Suppose that there are \(\lfloor (1/2-q)M\rfloor \) data points on which \(f(x)=1\), and by construction each is a vertex of \(\textrm{Cb}(2/\sqrt{n}, 0)\). According to (10), (11), the probability of this event is not zero. Let x be one such point and let \(\zeta \) be a perturbation sampled from an equidistribution in the ball \(\mathbb {B}_n(\alpha /\sqrt{n}, 0)\) for some \(\alpha \in (0,\varepsilon /2)\). Then, with probability \(1 - \frac{1}{2^n}\), the perturbation \(\zeta \) is such that \(|f(x + \zeta ) - f(x)| = 1\), since this is true for any \(\zeta \) such that \(x + \zeta \notin \mathcal {I} = \textrm{Cb}(2/\sqrt{n}, 0) \cap \mathbb {B}_n(\alpha / \sqrt{n}, x)\), and the set \(\mathcal {I}\) is uniquely defined by the signs of exactly n linear inequalities which slice the ball into \(2^n\) pieces of equal volume and so has probability \(\frac{1}{2^n}\).

Finally, note that if there are at least m points \((u^1,\ell ^1),\dots ,(u^m,\ell ^m)\) in the set \(\mathcal {U}\) then the probability that all \(u^i+\zeta \), \(i=1,\dots ,m\) are outside of the corresponding intersections follows from the union bound, which completes the argument.

1.1.2 Proof of Statement (ii) of the Theorem.

The argument used in the proof of statement (i), part 2, implies that there exists a network \(\tilde{f}\in \mathcal{N}\mathcal{N}_{\mathbf{{N}},L}\) such that \(\tilde{f}(x)\) takes value 1 when the inequalities

$$\begin{aligned} (x,e_i) \le \frac{1}{\sqrt{n}} \Big ( 1+\frac{\varepsilon }{2} \Big ), \ (x,e_i) \ge - \frac{1}{\sqrt{n}} \Big ( 1+\frac{\varepsilon }{2} \Big ), \text { for } i=1,\dots ,n. \end{aligned}$$
(14)

are satisfied, and zero otherwise. This network also minimises \(\mathcal {L}\) and generalises beyond the training and validation data.

However, since for any \(\alpha \in (0,\varepsilon /2)\) the function \(\tilde{f}\) is constant within a ball of radius \(\alpha /\sqrt{n}\) around any data point \(x \in \mathcal {T} \cup \mathcal {V}\), we can conclude that \(\tilde{f}\) is insusceptible to the instabilities affecting f.

To show that there exists a pair of unstable and stable networks, f and \(\tilde{f}\) (the network \(\tilde{f}\) is stable with respect to perturbations \(\zeta : \ \Vert \zeta \Vert \le \alpha /\sqrt{n}\)), consider systems of inequalities (12), (14) with both sides multiplied by a positive constant \(\kappa >0\). Clearly, and regardless of the multiplication by \(\kappa \), these systems of inequalities define the cubes \(\textrm{Cb}(2\sqrt{n},0)\) and \(\textrm{Cb}(2\sqrt{n}(1+\varepsilon /2),0)\), respectively. Then

$$\begin{aligned} \begin{aligned} f(x)=&\textrm{sign}\left( \sum _{i=1}^n g_\theta (\theta ) - g_\theta (\kappa ((x,e_i) - 1/\sqrt{n}) + \theta )\right. \\&+ \left. \sum _{i=1}^n g_\theta (\theta ) - g_\theta (\kappa (- (x,e_i) -1/\sqrt{n}) + \theta )\right) \end{aligned} \end{aligned}$$
(15)

encodes the unstable network, and

$$\begin{aligned} \begin{aligned} \tilde{f}(x)=&\textrm{sign}\left( \sum _{i=1}^n g_\theta (\theta ) - g_\theta (\kappa ((x,e_i) - (1+\varepsilon /2)/\sqrt{n}) + \theta )\right. \\&+ \left. \sum _{i=1}^n g_\theta (\theta ) - g_\theta (\kappa (- (x,e_i) -(1+\varepsilon /2)/\sqrt{n}) + \theta )\right) \end{aligned} \end{aligned}$$
(16)

encodes the stable one. These networks share the same weights but their biases differ in absolute value by \(\kappa \varepsilon /(2\sqrt{n})\). Given that \(\kappa \) can be chosen arbitrarily small or arbitrarily large, the statement now follows.

1.1.3 Proof of Statement (iii) of the Theorem.

Part a) of statement (iii) can be demonstrated following the same argument used to prove of statement (i) by replacing the cube \(\textrm{Cb}(2/\sqrt{n}, 0)\) with \(\textrm{Cb}(2/\sqrt{n}(1+\varepsilon /2), 0)\).

Part b) follows by considering a slightly modified family of distributions \(\mathcal {D}_\delta \) in which the set V is replaced with

$$ V=\{v_i \ | \ i=1,\dots ,2^n - k \} \cup \hat{V}, $$

where

$$ \hat{V}=\{v_i (1 + \varepsilon /2) \ | \ i=2^n-k+1,\dots , 2^n\}. $$

The probability that a single point from \(\hat{V}\) is not present in \(\mathcal {M}\) is \((1-1/2^{n+1})^M\). Since the samples are drawn independently, the probability that none of these points are present in \(\mathcal {M}\) is \((1-1/2^{n+1})^{Mk}\). The probability, however, that a point from \(\hat{V}\) is sampled is \(k/2^{n+1}\).    \(\square \)

Rights and permissions

Reprints and permissions

Copyright information

© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Bastounis, A. et al. (2023). The Boundaries of Verifiable Accuracy, Robustness, and Generalisation in Deep Learning. In: Iliadis, L., Papaleonidas, A., Angelov, P., Jayne, C. (eds) Artificial Neural Networks and Machine Learning – ICANN 2023. ICANN 2023. Lecture Notes in Computer Science, vol 14254. Springer, Cham. https://doi.org/10.1007/978-3-031-44207-0_44

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-44207-0_44

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-44206-3

  • Online ISBN: 978-3-031-44207-0

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics