The Boundaries of Verifiable Accuracy, Robustness, and Generalisation in Deep Learning

Bastounis, Alexander; Gorban, Alexander N.; Hansen, Anders C.; Higham, Desmond J.; Prokhorov, Danil; Sutton, Oliver; Tyukin, Ivan Y.; Zhou, Qinghua

doi:10.1007/978-3-031-44207-0_44

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 14254))

Included in the following conference series:

International Conference on Artificial Neural Networks

1445 Accesses

Abstract

In this work, we assess the theoretical limitations of determining guaranteed stability and accuracy of neural networks in classification tasks. We consider classical distribution-agnostic framework and algorithms minimising empirical risks and potentially subjected to some weights regularisation. We show that there is a large family of tasks for which computing and verifying ideal stable and accurate neural networks in the above settings is extremely challenging, if at all possible, even when such ideal solutions exist within the given class of neural architectures.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 69.99; Price excludes VAT (USA)

Softcover Book: USD 89.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Global quantitative robustness of regression feed-forward neural networks

Article Open access 09 August 2024

Neural Network Robustness as a Verification Property: A Principled Case Study

Why and how to construct an epistemic justification of machine learning?

Article Open access 10 August 2024

References

Bastounis, A., Hansen, A.C., Vlačić, V.: The mathematics of adversarial attacks in AI-why deep learning is unstable despite the existence of stable neural networks. arXiv preprint arXiv:2109.06098 (2021)
Eykholt, K., et al.: Robust physical-world attacks on deep learning visual classification. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1625–1634 (2018)
Google Scholar
Gorban, A.N., Grechuk, B., Mirkes, E.M., Stasenko, S.V., Tyukin, I.Y.: High-dimensional separability for one-and few-shot learning. Entropy 23(8), 1090 (2021)
Article MathSciNet Google Scholar
Gorban, A.N., Tyukin, I.Y., Romanenko, I.: The blessing of dimensionality: separation theorems in the thermodynamic limit. IFAC-PapersOnLine 49(24), 64–69 (2016)
Article Google Scholar
Gorban, A., Tyukin, I.Y.: Stochastic separation theorems. Neural Netw. 94, 255–259 (2017)
Article MATH Google Scholar
Hand, D.J.: Dark Data: Why What You Don’t Know Matters. Princeton University Press (2020)
Google Scholar
Kirdin, A., Sidorov, S., Zolotykh, N.: Rosenblatt’s first theorem and frugality of deep learning. Entropy 24(11), 1635 (2022). https://doi.org/10.3390/e24111635
Article MathSciNet Google Scholar
Moosavi-Dezfooli, S.M., Fawzi, A., Fawzi, O., Frossard, P.: Universal adversarial perturbations. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1765–1773 (2017)
Google Scholar
Schembera, B., Durán, J.M.: Dark data as the new challenge for big data science and the introduction of the scientific data officer. Philos. Technol. 33, 93–115 (2020)
Article Google Scholar
Shafahi, A., Huang, W., Studer, C., Feizi, S., Goldstein, T.: Are adversarial examples inevitable? In: International Conference on Learning Representations (ICLR) (2019)
Google Scholar
Su, J., Vargas, D.V., Sakurai, K.: One pixel attack for fooling deep neural networks. IEEE Trans. Evol. Comput. 23(5), 828–841 (2019)
Article Google Scholar
Szegedy, C., et al.: Intriguing properties of neural networks. arXiv preprint arXiv:1312.6199 (2013)
Tyukin, I.Y., Higham, D.J., Bastounis, A., Woldegeorgis, E., Gorban, A.N.: The feasibility and inevitability of stealth attacks. arXiv preprint arXiv:2106.13997 (2021)
Tyukin, I.Y., Higham, D.J., Gorban, A.N.: On adversarial examples and stealth attacks in artificial intelligence systems. In: 2020 International Joint Conference on Neural Networks (IJCNN), pp. 1–6. IEEE (2020)
Google Scholar
Yarotsky, D.: Error bounds for approximations with deep ReLU networks. Neural Netw. 94, 103–114 (2017)
Article MATH Google Scholar

Download references

Acknowledgements

This work is supported by the UKRI, EPSRC [UKRI Turing AI Fellowship ARaISE EP/V025295/2 and UKRI Trustworthy Autonomous Systems Node in Verifiability EP/V026801/2 to I.Y.T., EP/V025295/2 to O.S., A.N.G., and Q.Z., EP/V046527/1 and EP/P020720/1 to D.J.H, EP/V046527/1 to A.B.].

Author information

Authors and Affiliations

University of Leicester, Leicester, LE1 7RH, UK
Alexander Bastounis & Alexander N. Gorban
University of Cambridge, Cambridge, CB3 0WA, UK
Anders C. Hansen
University of Edinburgh, Edinburgh, EH9 3FD, UK
Desmond J. Higham
Toyota Tech Center, Ann Arbor, USA
Danil Prokhorov
King’s College London, London, WC2R 2LS, UK
Alexander N. Gorban, Oliver Sutton, Ivan Y. Tyukin & Qinghua Zhou

Authors

Alexander Bastounis
View author publications
You can also search for this author in PubMed Google Scholar
Alexander N. Gorban
View author publications
You can also search for this author in PubMed Google Scholar
Anders C. Hansen
View author publications
You can also search for this author in PubMed Google Scholar
Desmond J. Higham
View author publications
You can also search for this author in PubMed Google Scholar
Danil Prokhorov
View author publications
You can also search for this author in PubMed Google Scholar
Oliver Sutton
View author publications
You can also search for this author in PubMed Google Scholar
Ivan Y. Tyukin
View author publications
You can also search for this author in PubMed Google Scholar
Qinghua Zhou
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Ivan Y. Tyukin .

Editor information

Editors and Affiliations

Democritus University of Thrace, Xanthi, Greece
Lazaros Iliadis
Democritus University of Thrace, Xanthi, Greece
Antonios Papaleonidas
Lancaster University, Lancaster, UK
Plamen Angelov
Teesside University, Middlesbrough, UK
Chrisina Jayne

Appendix

1.1 4.1 Proof of Theorem 1

1.1.1 Proof of Statement (i) of the Theorem.

The proof consists of three parts. The first part introduces a family of distributions satisfying the separability requirement (2) and shows relevant statistical properties of samples drawn from these distributions. The second part presents the construction of a suitable neural network minimising the empirical loss function $\mathcal {L}$ for any loss function $\mathcal {R}\in \mathcal{C}\mathcal{F}_{\textrm{loc}}$ which successfully generalises beyond training (and test/validation) data. The final part shows that, with high probability, this network is unstable on nearly half of the data (for $s+r$ reasonably large).

Proof of statement (i), part 1. Consider the n-dimensional hyper cube $\textrm{Cb}(2, 0)$ $=$ $[-1,1]^n$. Within this cube, we may inscribe the unit ball $\mathbb {B}_{n}$ (the surface of which touches the surface of the outer cube at the centre of each face), and within this ball we may, in turn, inscribe the inner cube $\textrm{Cb}(2/\sqrt{n}, 0)$ each vertex of which touches the surface of the ball and whose faces are parallel to the faces of the cube $\textrm{Cb}(2, 0)$. For any $\varepsilon \in (0,\sqrt{n}-1)$, the cube $ \textrm{Cb}(\frac{2}{\sqrt{n}}(1+\varepsilon ), 0) $ may be shown to satisfy $\textrm{Cb}(2/\sqrt{n}, 0) \subset \textrm{Cb}(\frac{2}{\sqrt{n}}(1+\varepsilon ), 0) \subset \textrm{Cb}(2, 0)$.

Let $ V = \{v_i\}_{i=1}^{2^n} $ denote the set of vertices of $\textrm{Cb}(2/\sqrt{n}, 0)$ with an arbitrary but fixed ordering, and note that each $v_i$ may be expressed as $\frac{1}{\sqrt{n}}(q_1,\dots ,q_n)$ with each component $q_k\in \{-1,1\}$. The choice of $\varepsilon $ ensures that the set

$$ \mathcal {J}_0=\Big \{x\in \mathbb {S}_{n-1}(1,0) \ | \ x\notin \textrm{Cb}\Big (\frac{2}{\sqrt{n}}(1+\varepsilon ), 0\Big ) \Big \}. $$

is non-empty and that $ \min _{x\in \mathcal {J}_0, \ y\in V} \Vert x-y\Vert > \frac{\varepsilon }{\sqrt{n}}$.

Consider a family of distributions $\mathcal {F}_1\subset \mathcal {F}$ which are supported on $\mathbb {S}_{n-1}(1,0)\times \{0,1\}$, with the $\sigma $-algebra $\Sigma _{\mathbb {S}}\times \{0\}\cup \Sigma _{\mathbb {S}}\times \{1\}$, where $\Sigma _{\mathbb {S}}$ is the standard $\sigma $-algebra on the sphere $\mathbb {S}_{n-1}$ with the topology induced by the arclength metric.

We construct $\mathcal {F}_1$ as those distributions $\mathcal {D}_\delta \in \mathcal {F}$ such that

$$\begin{aligned} \begin{aligned}&P_{\mathcal {D}_\delta }(x,\ell )=0 \text{ for } \ x\in \textrm{Cb}\left( \frac{2}{\sqrt{n}}(1+\varepsilon ), 0\right) \setminus V, \ \text{ and } \text{ any } \ell , \end{aligned} \end{aligned}$$

(7)

with

$$\begin{aligned} P_{\mathcal {D}_\delta }(x,\ell )= {\left\{ \begin{array}{ll} \frac{1}{2^{n+1}} &{} \text { for } x\in V, \ \ell =1 \\ 0, &{} \text { for } x\in V, \ \ell =0 \end{array}\right. } \end{aligned}$$

(8)

and

$$\begin{aligned} P_{\mathcal {D}_\delta }(\mathcal {J}_0,\ell ) = {\left\{ \begin{array}{ll} 0&{} \text { for } \ell =1,\\ \frac{1}{2}&{} \text { for } \ell =0. \end{array}\right. } \end{aligned}$$

(9)

The existence of an uncountable family of distributions $\mathcal {D}_\delta $ satisfying (7)–(9) is ensured by the flexibility of (9) and the fact that $\mathcal {J}_0$ contains more than a single point (consider e.g. the family of all delta-functions supported on $\mathcal {J}_0$ and scaled by 1/2). This construction moreover ensures that any $\mathcal {D}_\delta \in \mathcal {F}_1$ also satisfies the separation property (2) with $\delta \le \frac{\varepsilon }{\sqrt{n}}$.

Let $ \mathcal {M}=\mathcal {T}\cup \mathcal {V} = \{(x_k, \ell _k)\}_{k=1}^{M}, $ denote the (multi-)set corresponding to the union of the training and validation sets independently sampled from $\mathcal {D}_{\delta }$, where $ M=s+r=|\mathcal {M}|. $ Let $z:\mathbb {R}^n\times \{0,1\} \rightarrow \{0,1\}$ be the trivial function mapping a sample $(x,\ell )$ from $\mathcal {D}_\delta $ into $\{0,1\}$ by $ z(x,\ell )=\ell . $ This function defines new random variables $Z_k = z(x_k, \ell _k) \in [0,1]$ for $k = 1, \dots , M$, with expectation $ E(Z_k)=\frac{1}{2}. $

The Hoeffding inequality ensures that

$$ P\left( \frac{1}{2} - \frac{1}{M}\sum Z_k > q \right) \le \exp \left( -2 q^2 M\right) , $$

and hence, with probability greater than or equal to

$$\begin{aligned} 1 - \exp \left( -2 q^2 M\right) , \end{aligned}$$

(10)

the number of data points $(x,\ell )$ with $\ell =1$ in the sample $\mathcal {M}$ is at least

$$\begin{aligned} \lfloor \left( \frac{1}{2} - q\right) M \rfloor . \end{aligned}$$

(11)

Proof of statement (i), part 2. Let $\{e_1,\dots ,e_n\}$ be the standard basis in $\mathbb {R}^n$. Consider the following set of 2n inequalities:

$$\begin{aligned} (x,e_i) \le \frac{1}{\sqrt{n}}, \ (x,e_i) \ge -\frac{1}{\sqrt{n}}, \text { for } i=1,\dots ,n. \end{aligned}$$

(12)

Any function defined on $[-1/\sqrt{n},1/\sqrt{n}]^n$ (or which contains $[-1/\sqrt{n},1/\sqrt{n}]^n$ in the domain of its definition) and which returns 1 for x satisfying (12) and 0 otherwise, minimises the loss $\mathcal {L}$ on $\mathcal {T}$. It also generalises perfectly well on any $\mathcal {V}$. Hence a network implementing such a function shares the same properties.

Pick a function $g_\theta \in \mathcal {K}_\theta $ and consider

$$ g_\theta ((x,e_i) - 1/\sqrt{n} + \theta ), \ g_\theta (-(x,e_i) + 1/\sqrt{n} + \theta ), \ i=1,\dots ,n. $$

It is clear that $g_\theta (\theta ) - g_\theta ((x,e_i) - 1/\sqrt{n} + \theta ) = 0$ for $(x,e_i) \le 1/\sqrt{n}$, and $g_\theta (\theta ) - g_\theta ((x,e_i) - 1/\sqrt{n} + \theta ) < 0$ for $(x,e_i) > 1/\sqrt{n}$. Similarly $g_\theta (\theta ) - g_\theta (- (x,e_i) -1/\sqrt{n} + \theta ) = 0$ for $(x,e_i) \ge - 1/\sqrt{n}$, and $g_\theta (\theta ) - g_\theta (-(x,e_i) - 1/\sqrt{n} + \theta ) < 0$ for $(x,e_i) < - 1/\sqrt{n}$. Hence, the function f given by

$$\begin{aligned} \begin{aligned} f(x)=&\textrm{sign}\left( \sum _{i=1}^n g_\theta (\theta ) - g_\theta ((x,e_i) - 1/\sqrt{n} + \theta )\right. \\&+ \left. \sum _{i=1}^n g_\theta (\theta ) - g_\theta (- (x,e_i) -1/\sqrt{n} + \theta )\right) \end{aligned} \end{aligned}$$

(13)

is exactly 1 only when all inequalities (12) hold true, and is zero otherwise. We may therefore conclude that

$$ f\in \arg \min _{\varphi \in \mathcal{N}\mathcal{N}_{\mathbf{{N}},L}} \mathcal {L}(\mathcal {T}\cup \mathcal {V},\varphi ). $$

Observe now that (13) is a two-layer neural network with 2n neurons in the hidden layer and a threshold output. This core network can be extended to any larger size without changing the map f by propagating the argument of $\textrm{sign}(\cdot )$ in (13) to the next layers and appending the width as appropriate.

Proof of statement (i), part 3. Let us now show that the map (13) becomes unstable for an appropriately-sized set $\mathcal {M}$. Suppose that there are $\lfloor (1/2-q)M\rfloor $ data points on which $f(x)=1$, and by construction each is a vertex of $\textrm{Cb}(2/\sqrt{n}, 0)$. According to (10), (11), the probability of this event is not zero. Let x be one such point and let $\zeta $ be a perturbation sampled from an equidistribution in the ball $\mathbb {B}_n(\alpha /\sqrt{n}, 0)$ for some $\alpha \in (0,\varepsilon /2)$. Then, with probability $1 - \frac{1}{2^n}$, the perturbation $\zeta $ is such that $|f(x + \zeta ) - f(x)| = 1$, since this is true for any $\zeta $ such that $x + \zeta \notin \mathcal {I} = \textrm{Cb}(2/\sqrt{n}, 0) \cap \mathbb {B}_n(\alpha / \sqrt{n}, x)$, and the set $\mathcal {I}$ is uniquely defined by the signs of exactly n linear inequalities which slice the ball into $2^n$ pieces of equal volume and so has probability $\frac{1}{2^n}$.

Finally, note that if there are at least m points $(u^1,\ell ^1),\dots ,(u^m,\ell ^m)$ in the set $\mathcal {U}$ then the probability that all $u^i+\zeta $, $i=1,\dots ,m$ are outside of the corresponding intersections follows from the union bound, which completes the argument.

1.1.2 Proof of Statement (ii) of the Theorem.

The argument used in the proof of statement (i), part 2, implies that there exists a network $\tilde{f}\in \mathcal{N}\mathcal{N}_{\mathbf{{N}},L}$ such that $\tilde{f}(x)$ takes value 1 when the inequalities

$$\begin{aligned} (x,e_i) \le \frac{1}{\sqrt{n}} \Big ( 1+\frac{\varepsilon }{2} \Big ), \ (x,e_i) \ge - \frac{1}{\sqrt{n}} \Big ( 1+\frac{\varepsilon }{2} \Big ), \text { for } i=1,\dots ,n. \end{aligned}$$

(14)

are satisfied, and zero otherwise. This network also minimises $\mathcal {L}$ and generalises beyond the training and validation data.

However, since for any $\alpha \in (0,\varepsilon /2)$ the function $\tilde{f}$ is constant within a ball of radius $\alpha /\sqrt{n}$ around any data point $x \in \mathcal {T} \cup \mathcal {V}$, we can conclude that $\tilde{f}$ is insusceptible to the instabilities affecting f.

To show that there exists a pair of unstable and stable networks, f and $\tilde{f}$ (the network $\tilde{f}$ is stable with respect to perturbations $\zeta : \ \Vert \zeta \Vert \le \alpha /\sqrt{n}$), consider systems of inequalities (12), (14) with both sides multiplied by a positive constant $\kappa >0$. Clearly, and regardless of the multiplication by $\kappa $, these systems of inequalities define the cubes $\textrm{Cb}(2\sqrt{n},0)$ and $\textrm{Cb}(2\sqrt{n}(1+\varepsilon /2),0)$, respectively. Then

$$\begin{aligned} \begin{aligned} f(x)=&\textrm{sign}\left( \sum _{i=1}^n g_\theta (\theta ) - g_\theta (\kappa ((x,e_i) - 1/\sqrt{n}) + \theta )\right. \\&+ \left. \sum _{i=1}^n g_\theta (\theta ) - g_\theta (\kappa (- (x,e_i) -1/\sqrt{n}) + \theta )\right) \end{aligned} \end{aligned}$$

(15)

encodes the unstable network, and

$$\begin{aligned} \begin{aligned} \tilde{f}(x)=&\textrm{sign}\left( \sum _{i=1}^n g_\theta (\theta ) - g_\theta (\kappa ((x,e_i) - (1+\varepsilon /2)/\sqrt{n}) + \theta )\right. \\&+ \left. \sum _{i=1}^n g_\theta (\theta ) - g_\theta (\kappa (- (x,e_i) -(1+\varepsilon /2)/\sqrt{n}) + \theta )\right) \end{aligned} \end{aligned}$$

(16)

encodes the stable one. These networks share the same weights but their biases differ in absolute value by $\kappa \varepsilon /(2\sqrt{n})$. Given that $\kappa $ can be chosen arbitrarily small or arbitrarily large, the statement now follows.

1.1.3 Proof of Statement (iii) of the Theorem.

Part a) of statement (iii) can be demonstrated following the same argument used to prove of statement (i) by replacing the cube $\textrm{Cb}(2/\sqrt{n}, 0)$ with $\textrm{Cb}(2/\sqrt{n}(1+\varepsilon /2), 0)$.

Part b) follows by considering a slightly modified family of distributions $\mathcal {D}_\delta $ in which the set V is replaced with

$$ V=\{v_i \ | \ i=1,\dots ,2^n - k \} \cup \hat{V}, $$

where

$$ \hat{V}=\{v_i (1 + \varepsilon /2) \ | \ i=2^n-k+1,\dots , 2^n\}. $$

The probability that a single point from $\hat{V}$ is not present in $\mathcal {M}$ is $(1-1/2^{n+1})^M$. Since the samples are drawn independently, the probability that none of these points are present in $\mathcal {M}$ is $(1-1/2^{n+1})^{Mk}$. The probability, however, that a point from $\hat{V}$ is sampled is $k/2^{n+1}$. $\square $

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Bastounis, A. et al. (2023). The Boundaries of Verifiable Accuracy, Robustness, and Generalisation in Deep Learning. In: Iliadis, L., Papaleonidas, A., Angelov, P., Jayne, C. (eds) Artificial Neural Networks and Machine Learning – ICANN 2023. ICANN 2023. Lecture Notes in Computer Science, vol 14254. Springer, Cham. https://doi.org/10.1007/978-3-031-44207-0_44

Download citation

DOI: https://doi.org/10.1007/978-3-031-44207-0_44
Published: 22 September 2023
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-44206-3
Online ISBN: 978-3-031-44207-0
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

The Boundaries of Verifiable Accuracy, Robustness, and Generalisation in Deep Learning

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Global quantitative robustness of regression feed-forward neural networks

Neural Network Robustness as a Verification Property: A Principled Case Study

Why and how to construct an epistemic justification of machine learning?

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Appendix

1.1 4.1 Proof of Theorem 1

1.1.1 Proof of Statement (i) of the Theorem.

1.1.2 Proof of Statement (ii) of the Theorem.

1.1.3 Proof of Statement (iii) of the Theorem.

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Subscribe and save

Buy Now

Navigation

The Boundaries of Verifiable Accuracy, Robustness, and Generalisation in Deep Learning

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Global quantitative robustness of regression feed-forward neural networks

Neural Network Robustness as a Verification Property: A Principled Case Study

Why and how to construct an epistemic justification of machine learning?

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Appendix

Appendix

1.1 4.1 Proof of Theorem 1

1.1.1 Proof of Statement (i) of the Theorem.

1.1.2 Proof of Statement (ii) of the Theorem.

1.1.3 Proof of Statement (iii) of the Theorem.

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation