In this work, we assess the theoretical limitations of determining guaranteed stability and accuracy of neural networks in classification tasks. We consider classical distribution-agnostic framework and algorithms minimising empirical risks and potentially subjected to some weights regularisation. We show that there is a large family of tasks for which computing and verifying ideal stable and accurate neural networks in the above settings is extremely challenging, if at all possible, even when such ideal solutions exist within the given class of neural architectures.
1.1 4.1 Proof of Theorem 1
1.1.1 Proof of Statement (i) of the Theorem.
The proof consists of three parts. The first part introduces a family of distributions satisfying the separability requirement (2) and shows relevant statistical properties of samples drawn from these distributions. The second part presents the construction of a suitable neural network minimising the empirical loss function \(\mathcal {L}\) for any loss function \(\mathcal {R}\in \mathcal{C}\mathcal{F}_{\textrm{loc}}\) which successfully generalises beyond training (and test/validation) data. The final part shows that, with high probability, this network is unstable on nearly half of the data (for \(s+r\) reasonably large).
Proof of statement (i), part 1. Consider the n-dimensional hyper cube \(\textrm{Cb}(2, 0)\) \(=\) \([-1,1]^n\). Within this cube, we may inscribe the unit ball \(\mathbb {B}_{n}\) (the surface of which touches the surface of the outer cube at the centre of each face), and within this ball we may, in turn, inscribe the inner cube \(\textrm{Cb}(2/\sqrt{n}, 0)\) each vertex of which touches the surface of the ball and whose faces are parallel to the faces of the cube \(\textrm{Cb}(2, 0)\). For any \(\varepsilon \in (0,\sqrt{n}-1)\), the cube \( \textrm{Cb}(\frac{2}{\sqrt{n}}(1+\varepsilon ), 0) \) may be shown to satisfy \(\textrm{Cb}(2/\sqrt{n}, 0) \subset \textrm{Cb}(\frac{2}{\sqrt{n}}(1+\varepsilon ), 0) \subset \textrm{Cb}(2, 0)\).
Let \( V = \{v_i\}_{i=1}^{2^n} \) denote the set of vertices of \(\textrm{Cb}(2/\sqrt{n}, 0)\) with an arbitrary but fixed ordering, and note that each \(v_i\) may be expressed as \(\frac{1}{\sqrt{n}}(q_1,\dots ,q_n)\) with each component \(q_k\in \{-1,1\}\). The choice of \(\varepsilon \) ensures that the set
is non-empty and that \( \min _{x\in \mathcal {J}_0, \ y\in V} \Vert x-y\Vert > \frac{\varepsilon }{\sqrt{n}}\).
Consider a family of distributions \(\mathcal {F}_1\subset \mathcal {F}\) which are supported on \(\mathbb {S}_{n-1}(1,0)\times \{0,1\}\), with the \(\sigma \)-algebra \(\Sigma _{\mathbb {S}}\times \{0\}\cup \Sigma _{\mathbb {S}}\times \{1\}\), where \(\Sigma _{\mathbb {S}}\) is the standard \(\sigma \)-algebra on the sphere \(\mathbb {S}_{n-1}\) with the topology induced by the arclength metric.
We construct \(\mathcal {F}_1\) as those distributions \(\mathcal {D}_\delta \in \mathcal {F}\) such that
The existence of an uncountable family of distributions \(\mathcal {D}_\delta \) satisfying (7)–(9) is ensured by the flexibility of (9) and the fact that \(\mathcal {J}_0\) contains more than a single point (consider e.g. the family of all delta-functions supported on \(\mathcal {J}_0\) and scaled by 1/2). This construction moreover ensures that any \(\mathcal {D}_\delta \in \mathcal {F}_1\) also satisfies the separation property (2) with \(\delta \le \frac{\varepsilon }{\sqrt{n}}\).
Let \( \mathcal {M}=\mathcal {T}\cup \mathcal {V} = \{(x_k, \ell _k)\}_{k=1}^{M}, \) denote the (multi-)set corresponding to the union of the training and validation sets independently sampled from \(\mathcal {D}_{\delta }\), where \( M=s+r=|\mathcal {M}|. \) Let \(z:\mathbb {R}^n\times \{0,1\} \rightarrow \{0,1\}\) be the trivial function mapping a sample \((x,\ell )\) from \(\mathcal {D}_\delta \) into \(\{0,1\}\) by \( z(x,\ell )=\ell . \) This function defines new random variables \(Z_k = z(x_k, \ell _k) \in [0,1]\) for \(k = 1, \dots , M\), with expectation \( E(Z_k)=\frac{1}{2}. \)
The Hoeffding inequality ensures that
and hence, with probability greater than or equal to
the number of data points \((x,\ell )\) with \(\ell =1\) in the sample \(\mathcal {M}\) is at least
Proof of statement (i), part 2. Let \(\{e_1,\dots ,e_n\}\) be the standard basis in \(\mathbb {R}^n\). Consider the following set of 2n inequalities:
Any function defined on \([-1/\sqrt{n},1/\sqrt{n}]^n\) (or which contains \([-1/\sqrt{n},1/\sqrt{n}]^n\) in the domain of its definition) and which returns 1 for x satisfying (12) and 0 otherwise, minimises the loss \(\mathcal {L}\) on \(\mathcal {T}\). It also generalises perfectly well on any \(\mathcal {V}\). Hence a network implementing such a function shares the same properties.
Pick a function \(g_\theta \in \mathcal {K}_\theta \) and consider
It is clear that \(g_\theta (\theta ) - g_\theta ((x,e_i) - 1/\sqrt{n} + \theta ) = 0\) for \((x,e_i) \le 1/\sqrt{n}\), and \(g_\theta (\theta ) - g_\theta ((x,e_i) - 1/\sqrt{n} + \theta ) < 0\) for \((x,e_i) > 1/\sqrt{n}\). Similarly \(g_\theta (\theta ) - g_\theta (- (x,e_i) -1/\sqrt{n} + \theta ) = 0\) for \((x,e_i) \ge - 1/\sqrt{n}\), and \(g_\theta (\theta ) - g_\theta (-(x,e_i) - 1/\sqrt{n} + \theta ) < 0\) for \((x,e_i) < - 1/\sqrt{n}\). Hence, the function f given by
is exactly 1 only when all inequalities (12) hold true, and is zero otherwise. We may therefore conclude that
Observe now that (13) is a two-layer neural network with 2n neurons in the hidden layer and a threshold output. This core network can be extended to any larger size without changing the map f by propagating the argument of \(\textrm{sign}(\cdot )\) in (13) to the next layers and appending the width as appropriate.
Proof of statement (i), part 3. Let us now show that the map (13) becomes unstable for an appropriately-sized set \(\mathcal {M}\). Suppose that there are \(\lfloor (1/2-q)M\rfloor \) data points on which \(f(x)=1\), and by construction each is a vertex of \(\textrm{Cb}(2/\sqrt{n}, 0)\). According to (10), (11), the probability of this event is not zero. Let x be one such point and let \(\zeta \) be a perturbation sampled from an equidistribution in the ball \(\mathbb {B}_n(\alpha /\sqrt{n}, 0)\) for some \(\alpha \in (0,\varepsilon /2)\). Then, with probability \(1 - \frac{1}{2^n}\), the perturbation \(\zeta \) is such that \(|f(x + \zeta ) - f(x)| = 1\), since this is true for any \(\zeta \) such that \(x + \zeta \notin \mathcal {I} = \textrm{Cb}(2/\sqrt{n}, 0) \cap \mathbb {B}_n(\alpha / \sqrt{n}, x)\), and the set \(\mathcal {I}\) is uniquely defined by the signs of exactly n linear inequalities which slice the ball into \(2^n\) pieces of equal volume and so has probability \(\frac{1}{2^n}\).
Finally, note that if there are at least m points \((u^1,\ell ^1),\dots ,(u^m,\ell ^m)\) in the set \(\mathcal {U}\) then the probability that all \(u^i+\zeta \), \(i=1,\dots ,m\) are outside of the corresponding intersections follows from the union bound, which completes the argument.
1.1.2 Proof of Statement (ii) of the Theorem.
The argument used in the proof of statement (i), part 2, implies that there exists a network \(\tilde{f}\in \mathcal{N}\mathcal{N}_{\mathbf{{N}},L}\) such that \(\tilde{f}(x)\) takes value 1 when the inequalities
are satisfied, and zero otherwise. This network also minimises \(\mathcal {L}\) and generalises beyond the training and validation data.
However, since for any \(\alpha \in (0,\varepsilon /2)\) the function \(\tilde{f}\) is constant within a ball of radius \(\alpha /\sqrt{n}\) around any data point \(x \in \mathcal {T} \cup \mathcal {V}\), we can conclude that \(\tilde{f}\) is insusceptible to the instabilities affecting f.
To show that there exists a pair of unstable and stable networks, f and \(\tilde{f}\) (the network \(\tilde{f}\) is stable with respect to perturbations \(\zeta : \ \Vert \zeta \Vert \le \alpha /\sqrt{n}\)), consider systems of inequalities (12), (14) with both sides multiplied by a positive constant \(\kappa >0\). Clearly, and regardless of the multiplication by \(\kappa \), these systems of inequalities define the cubes \(\textrm{Cb}(2\sqrt{n},0)\) and \(\textrm{Cb}(2\sqrt{n}(1+\varepsilon /2),0)\), respectively. Then
encodes the unstable network, and
encodes the stable one. These networks share the same weights but their biases differ in absolute value by \(\kappa \varepsilon /(2\sqrt{n})\). Given that \(\kappa \) can be chosen arbitrarily small or arbitrarily large, the statement now follows.
1.1.3 Proof of Statement (iii) of the Theorem.
Part a) of statement (iii) can be demonstrated following the same argument used to prove of statement (i) by replacing the cube \(\textrm{Cb}(2/\sqrt{n}, 0)\) with \(\textrm{Cb}(2/\sqrt{n}(1+\varepsilon /2), 0)\).
Part b) follows by considering a slightly modified family of distributions \(\mathcal {D}_\delta \) in which the set V is replaced with
The probability that a single point from \(\hat{V}\) is not present in \(\mathcal {M}\) is \((1-1/2^{n+1})^M\). Since the samples are drawn independently, the probability that none of these points are present in \(\mathcal {M}\) is \((1-1/2^{n+1})^{Mk}\). The probability, however, that a point from \(\hat{V}\) is sampled is \(k/2^{n+1}\). \(\square \)
