1 Introduction

Adversarial examples are one of the most interesting topics in the current computer vision research. It was first mentioned by [1], in which it was pointed out how small pixel perturbations on the input pixels could dramatically change the prediction of deep learning models. The potential risk of this phenomenon is that these perturbations can remain invisible to humans, thus confusing neural networks while presenting no noticeable difference to a human operator. As an example, see Fig. 1. For this reason, it is important to develop a mechanism that detects these inputs, preventing them from making further steps in critical systems workflows.

Fig. 1
figure 1

Adversarial examples. First row, from left to right: original image (classified as a whistle), adversarial noise and adversarial image after adding noise (classified as a screw). Second row, from left to right: original image (classified as a pen), adversarial noise and adversarial image after adding noise (classified as a revolver)

On the one hand, there are the so-called “attacks” methods, which are aimed at crafting these malicious perturbations that can potentially put the models at risk. This can be expressed as a formal statement, as observed in Eq. (1). To obtain an adversarial example \(X'\), an unperturbed image X is modified with a calculated noise \(\delta X\), which should be the least possible noise that changes the prediction class of the original f(X).

$$\begin{aligned} X' = X + \arg \min _{\delta X} \{||\delta X|| \;\; s.t. \;\; f(X+\delta X) \ne f(X) \} \end{aligned}$$
(1)

On the other hand, research efforts are also put into “defenses” methods that study the properties of input images to detect these perturbations in order to flag the potential malicious images.

This work aims to combine the machine learning domain with the chaos theory domain. For this reason, an introduction to the latter is mandatory. Chaos theory is considered as one of the most important branches derived from nonlinear dynamics. These fields have increased the understanding of complex systems, from weather patterns and communications to biological processes. Then, it is important to introduce nonlinear dynamics to understand better how chaos theory has been developed. Unlike linear systems, where cause and effect are proportional, nonlinear systems exhibit complex, often unexpected patterns. This field analyzes the interactions among the different components to understand emergent behavior. For example, the works [2, 3] represent significant advancements in understanding the complex behaviors of advanced composite materials and nanoscale structures under various loading and environmental conditions. They provide essential insights for the design and optimization of new materials and structures in engineering applications, highlighting the ongoing innovations and analytical methods in the field.

In the case of chaos theory, it is more focused on the analysis of a system sensitivity to initial conditions. Even in a deterministic system, when unpredictable behavior is observed, we refer to it as chaotic. The same behavior can be observed in the machine learning domain with the introduced concept of adversarial examples. Small perturbations in the initial conditions (input image) turn the network to behave in a chaotic state (unpredictable output). For this reason, the combination of both fields seems natural. As a first approach to this integration, this work proposes the application of a chaos theory-related metric, such as Lyapunov exponents (LEs), to the optimization of the training process for a neural network architecture, an AutoEncoder. Therefore, the main contributions of this work are defined as follows:

  • An analysis of the implications of chaos theory on neural network inputs is performed, through LEs.

  • A novel loss function combining common divergence metrics and chaos theory metrics is proposed.

  • An adversarial AutoEncoder is trained with both approaches, showing how the proposal contributes to increase adversarial example detection rate.

2 Related work

2.1 Chaos theory

Chaos theory is applied in multiple domains. For example, in the medical field, we can find works such as [4, 5], in which chaos theory was helpful to discover random points candidates to jump from the low-fitness values to a high-fitness value region. On the one side, [4] employed these points to train a classifier using a predator–prey adaptive-inertia chaotic particle swarm optimization algorithm. This classifier was employed to develop a novel method able to detect an alcohol use disorder, with the help of Hu moment invariants to extract features from brain slices. This algorithm was also prepared to be installed on medical robots. On the other side, [5] was focused on the detection of abnormal breast in digital mammography. For this purpose, they proposed a novel chaotic adaptive real-coded biogeography-based optimization to train multilayer perceptron classifier. To feed this classifier, they used fractional Fourier entropy to extract global features from preprocessed images, being able to select 23 distinguishing features.

Another interesting domain is related to communications. For example, [6] uses the Lyapunov direct method to bound signals and make their synchronization error converge to zero. This leverages a brain emotional learning-based intelligent controller (BELBIC) to develop a secure communication system. For this reason, chaos theory becomes more suitable than other state-of-the-art techniques, such as neuro-fuzzy systems. At the same time, controller and observer synchronization performance increases, due to the mitigation of the chaotic signals from transmitter and receiver, respectively. In [7], there is another interesting application of chaos theory in the communications domain. With a similar approach, this work proposes an adaptive controller for chaos synchronization using quantum neural networks. Using the Lyapunov theorem, the proposed system is able to estimate the uncertainties due to external disturbance from environmental conditions. As a result, the synchronization procedure is performed with negligible error. This method could also be applied in cryptography.

2.2 Adversarial attacks

To deceive a specific neural network model, various strategies can be employed. The considered assumptions are referred to as the threat model. There is a white-box threat model if the attacker is allowed to have access to all parameters of the neural network and the distribution of training images. Alternatively, if the attacker only has access to the model’s prediction results (via queries, for example), the threat model is referred to as black-box. Although the latter is more restrictive, there are attacks that can correctly estimate the distribution of classes in order to generate a gradient direction in which a perturbed image is incorrectly predicted.

In this study, seven attack methods from both threat models are compared in order to evaluate the effectiveness of the proposed defense mechanism. To achieve that, a set of adversarial examples is crafted from each method. After that, the original defense method and our proposed improved defense method are trained on those samples, checking the adversarial detection rate in each case.

The Fast Gradient Sign Method (FGSM) [1] was the seminal attack that demonstrated the influence of adversarial examples in deep learning. It computes the gradient direction of a loss function that accounts for the network’s input. To achieve that, the derivatives of the loss function are computed with respect to the parameters of the network, in the so-called backpropagation mechanism. Calculating this with respect to the input image gives a pixel perturbation that, when smoothed using a small epsilon value and added to that input image, is able to cause the output class prediction to drift away from the original ground truth.

In this work, additional variants of the FGSM attack are also considered. These strategies were designed to counteract forthcoming defenses and improve their performance across a wider range of architectures. For example, in the Projected Gradient Descent (PGD, [8]), the perturbation is projected on an Lp-ball for each iteration. This can be based on any L-norm distance of a specified radius. In consequence, the crafted perturbations can be kept small while, at the same time, they remain in the range of the input data, so they are more difficult to be detected by both defense methods and humans.

Another attack method, DeepFool [9], was one of the first to achieve good results in complex datasets such as ImageNet. The method estimates a bounded hyperplane that contains the classifier predictions of the same class. Using this information, the attack attempts to exceed its border, so it directly calculates a perturbation that induces an adversarial example. Another approach, Elastic-Net Attack (EAD) [10], defines the adversarial examples as a regularization optimization problem. The objective of this method is to calculate a minimum Lp distance (usually L1, but it can be extended to the L2 order domain). Others, like Universal [11], compute a single perturbation pattern that is applied to all the inputs to find a general modification that is not dependent on a specific image, like most attack algorithms. To achieve that, the method analyzes the geometric correlations among the high-dimensional decision boundaries of a given neural network classifier.

Some of these methods were able to break the most popular defenses at that time. For example, in EAD, the authors claimed that adversarial distillation [12] (a popular defense) heavily reduced its effectiveness when defending against the adversarial examples crafted with that method. Moreover, they also claimed that using adversarial examples from EAD in addition to Carlini & Wagner (C&W) [13] adversarial examples was able to increase the adversarial training defense robustness, showing that the field is in constant evolution.

Finally, regarding the black-box threat model, two methods are considered in our experiments. As explained before, these methods are able to craft adversarial examples querying the model to compute slight perturbation steps. The first is Boundary attack [14] (also known as Decision-Based Attack). This method starts with a large perturbation that ensures the adversarial behavior, and then it is gradually reduced while keeping the adversarial class. As an evolved version of the previous method, we also consider the HopSkipJump [15] attack (also known as Boundary Attack++ in early versions). This method optimizes the number of queries thanks to a better estimation of the decision gradient, using binary information about the boundary.

Considering the aforementioned attack methods, they are implemented in our work with the parameter settings detailed in Table 1.

Table 1 Attacks and parameters setup in this work

2.3 Adversarial detection

In the state-of-the-art, different methods have been proposed for adversarial attacks and defenses. On the attack side, most of the efforts are put into fooling the network with the least perturbation possible, while on the defense side, the objective is to detect as many adversarial images as possible or reduce the impact of those into the model accuracy. For the latter purpose, different approaches are provided. For example, one of the most common is to retrain the neural network models with additional images that include adversarial examples crafted from the legit training dataset images. This sort of “data augmentation with adversarial examples” is the so-called adversarial training [16] technique. It has been proved to be effective to increase the model robustness against unseen adversarial examples, even the ones crafted using other methods different than the ones used during training. Moreover, by identifying the features that differ from those in the training data, the authors of [17, 18] suggested a false data injection attack utilizing the AutoEncoder technique. Samuel et al. [19] proposed the classification model based on a multivariate time series in which a gradient adversarial transformation network is combined with the adversarial AutoEncoders. Following the same trend, [20] employs the AutoEncoders to perform a subset scanning over its activations, looking for anomalous patterns related to adversarial noise.

Some works have developed other approaches based on different kinds of classifiers. For example, [21] builds an ensemble of maximum-margin softmax to learn high-density features to discriminate between adversarial and legitimate samples. In [22], a multiobjective multifidelity Bayesian optimization algorithm is proposed. This is aimed at designing fault classification models that can solve the security-accuracy trade-off. It is also worth mentioning the work described in [23], since neural network activations are employed to detect adversarial, using the Fourier domain instead of the images.

Other authors apply some preprocessing before the input image is fed to the network [24], so that the potential malicious effect of the perturbation is prevented. Adversarial attacks rely on carefully modified pixels that shift the values of activations throughout inner layers. Therefore, a method that modifies the majority of pixels within a small range, while preserving the visual information, can also get rid of the specific pixel values that were causing the adversarial behavior. One solution presently used to combat the issue of adversarial attacks is denoising image classifiers [25]. In that work, the authors devised a method to restore the ground truth from noisy data damaged by malicious perturbations. Also, a block-matching convolutional neural networks (CNN) [26] for image denoising was proposed as a preprocessing module that does not call for the classifier to be retrained.

Another interesting approach is to apply the chaos theory knowledge as an analogy to perturbations of adversarial examples. This was first proposed by the seminal work [27] and further developed in more recent studies as [28,29,30]. There, adversarial perturbations are considered as chaos points in different parts of the network, such as the input image itself or the inner features of the network layers. As a result, it is possible to apply properties and metrics from this theory to detect potential adversarial examples with tools such as LEs. On these works, this has been proved as a useful metric to detect adversarial examples. For this reason, it is expected that its use could be leveraged as a defense method, obtaining better results than other metrics that have been employed before.

As a basis, the AutoEncoder defense method proposed by [31] is trained on clean and adversarial images. When a new input image is fed, the distribution it belongs to is detected. For its purpose, this original method employs a Kullback–Leibler (KL) divergence metric. In this context, the specific contribution of the paper can be summarized as follows: a novel defense method based on a customized loss metric that is not based on divergence but on chaos theory, specifically on LEs. For this purpose, the loss calculates these exponents on both a clean and an adversarial example, trying to minimize the difference between them, considering that the lower the value, the more chaos is reduced by the defense.

3 Methodology

Given the potential of LE in detecting the noise related to adversarial examples, the idea in this work is to build a defense method that leverages this potential into an active mechanism to prevent the consequences of adversarial examples. For this purpose, the defense framework proposed by [31] is used as a basis. As it is shown in Fig. 2, an AutoEncoder is trained on two main subsets: legit images and adversarial examples.

Fig. 2
figure 2

Diagram of the proposed defense, showing the workflow to classify and correct a potential adversarial example, based on [31]

This method is able to score the matching distribution of a given sample, as a certain probability of belonging to a potential adversarial. Moreover, in the middle of the AutoEncoder structure, the vector representation can be corrected to match the legit distribution. In consequence, adversarial perturbations can be corrected to prevent their effects when fed to the network.

The main difference with the aforementioned methods [28,29,30] is that the adversarial examples are not only detected, but the AutoEncoder is able to reconstruct the potential adversarial into an image in which its effect has been eliminated. Depending on the adversarial distribution threshold, a sample can be classified as a regular or a potential adversarial. Therefore, this lets the algorithm to choose between the original image and the image that has been reconstructed from the AutoEncoder, minimizing the impact of the adversarial example on the results of the image classification model.

Originally, the architecture of the method employs a KL divergence metric as the loss function when training and optimizing the AutoEncoder detector. In our implementation, this is substituted by a Lyapunov-based formula known as Lyap loss, which better captures the potential perturbations of adversarial examples.

Indeed, LEs are a crucial tool for understanding the behavior of dynamical systems, particularly in the context of chaos theory [32]. The maximum Lyapunov exponent (MLE) is a critical measure in the study of dynamical systems, particularly in identifying chaotic behavior introduced by the small perturbation. The MLE quantifies the average rate at which nearby trajectories in the phase space diverge or converge over time. A positive MLE is a strong indication of chaos, demonstrating that small differences in initial conditions can lead to exponentially divergent outcomes. Mathematically, the MLE is defined in Eq. (2).

$$\begin{aligned} \lambda _{max} = \displaystyle \lim _{t \rightarrow \infty } \lim _{\delta _x(0) \rightarrow 0}\frac{1}{t}\ln \frac{|\delta _x(t)|}{|\delta _x(0)|} \end{aligned}$$
(2)

where \(\delta _x(0)\) is the initial separation between two nearby trajectories, and \(\delta _x(t)\) is the separation at time t. A positive \(\lambda _{max}\) indicates that \(\delta _x(t)\) grows exponentially, signifying chaotic behavior. In contrast, a zero or negative MLE indicates neutral stability or convergence, respectively, implying non-chaotic behavior. Quantitatively, two trajectories in phase space with an initial separation vector \(\delta Z_0\) diverge at a rate given by Eq. (3).

$$\begin{aligned} ||\delta (t)|| \approx ||\delta (0)||e^{\lambda t} \end{aligned}$$
(3)

where \(\lambda\) is the LE, and the explanation of this chaos quantifier is provided in Fig. 3.

Fig. 3
figure 3

Explanation of Lyapunov exponent (LE)

The proposed loss function is defined in Eq. (4).

$$\begin{aligned} Loss = Loss_{KL} + 10^{max(\lambda _{i}^{L_{k}}-\lambda _{i}^{A_{k}})} \end{aligned}$$
(4)

where \(\lambda _{i}^{L_{k}}\) and \(\lambda _{i}^{A_{k}}\) represent the LE computed from the \(k_{th}\) legit image (\(L_{k})\) and corresponding adversarial example (\(A_{k}\)), respectively. Here, \(k (1<k<N)\) refers to the number of images in the image datasets. The first four MLEs are calculated over a given pair, which are paired as \((\lambda _{1}^{L_{k}}, \lambda _{2}^{L_{k}}, \lambda _{3}^{L_{k}}, \lambda _{4}^{L_{k}})\) and \((\lambda _{1}^{A_{k}}, \lambda _{2}^{A_{k}}, \lambda _{3}^{A_{k}}, \lambda _{4}^{A_{k}})\), respectively. Later, the differences between MLEs are calculated, so positive values mean that the perturbation is being reduced, being the opposite otherwise. After that, only the max values are kept, to detect the presence of the potentially most chaotic point, wherever is found. Finally, the result is elevated by 10, to increase exponentially the value of the loss when chaos is found, and reduce it in the same sense when it is being reduced during the training process. As a result, the defined loss makes the AutoEncoder to converge to a point in which it is able to reduce the chaoticity of any given sample. In consequence, once processed by the trained model, an image that induces less chaoticity will be classified with better accuracy, preventing the malicious effect of the adversarial attack.

The main novelty of this approach is to consider the neural network as a chaotic system, something not considered in other detection methods, including the baseline AutoEncoder.

4 Experiments

4.1 Datasets

This work employs three different datasets to conduct the experiments. Therefore, it is possible to contrast the obtained results in a wider range of conditions, for example, covering grayscale or color images, from small or large size, with simple patterns or real-world objects.

The first dataset is MNIST, which was published in [33]. Across different domains of computer vision research, it is one of the most common benchmarks. It consists of white handwritten digits over a black background, as seen in Fig. 4. Despite its simplicity, it is a very useful dataset to study in detail the specific features that are affected by a given attack or defense method. Specifically, this dataset contains 60,000 images for training and 10,000 for testing, with 28 x 28 pixel size each.

Fig. 4
figure 4

Classes of MNIST dataset

The next dataset is a variant of the previous one, called Fashion-MNIST, developed by the Zalando Company [34]. The size and number of images are the same, while the objects are related to clothing, with thumbnails of different types, such as shirts, bags, coats or trousers, see Fig. 5. In comparison with the regular MNIST, these contain more texture details and, as a result, more complexity.

Fig. 5
figure 5

Classes of Fashion-MNIST dataset

Finally, the CIFAR-10 [35] dataset is also considered. This has a larger pixel size of the images (32 x 32 pixels), and they are in color, with three channels for RGB. The 10 classes represent real-world common objects from two domains: motor transport (automobile, truck, ship and airplane) and animal species (deer, frog, cat, dog, bird and horse). Some examples of these objects can be found in Fig. 6.

Fig. 6
figure 6

Classes of CIFAR-10 dataset

4.2 Results

With the attacks and datasets previously defined, the proposed method is compared in two versions: The first one employs the original loss based on the KL divergence metric, while the second one substitutes it with a variant that is based on LE.

In Fig. 7, the Lyapunov spectrum for 50 legitimate and corresponding FGSM adversarial examples is displayed. We plot MLE for \(\epsilon =0.05\), which will significantly affect the computed value of the Lyap loss in Eq. (4). Note that \(\epsilon\) is a small, dimensionless hyperparameter used to control the perturbation level [36]. When the images are perturbed to deceive the network, it is observed that the MLEs are positive. In contrast, the MLEs for legit images are negative. The histogram shows that the quantiles of the empirical distribution of \(\lambda _{L}\) for adversarial images are all positive, proving the importance of the \(\lambda _{L}\) for perturbation detection.

Fig. 7
figure 7

The largest LEs of legit images and FGSM adversarial examples of 50 MNIST images

The impact of different perturbation levels at \(\epsilon =0.01\) to 0.25 on the estimated Lyap loss over 50 randomly selected images from the MNIST dataset and its average values are displayed in Fig. 8. We can observe that the Lyap loss gradually grows as the perturbation level increases. LEs are a measure used in chaos theory to categorize the pace at which extremely near trajectories separate from one another. A positive MLE is believed to mean that the system is chaotic among all LEs. In principle, we analyze the statistical characteristics of the Lyapunov spectra in the non-chaotic and chaotic phases by computing the Lyap loss from MLE of the legit and adversarial examples. Considering a threshold learned during the training process of the AutoEncoder, it is possible to discern whether the loss values belong to a legit or an adversarial input.

Fig. 8
figure 8

Impact of various perturbation levels on the estimated Lyapunov loss over 50 randomly chosen images from the MNIST dataset. The green circles line indicates the average \((\mu )\) Lyapunov loss values for each epsilon, while the red crosses line and the blue squares line indicate a plus and minus one standard deviation \((\sigma )\), respectively. Best viewed in color

For the MNIST dataset, the Lyapunov variant performs better in all cases, from a 9% to 18% increment, despite it also has larger values in standard deviation. The results of MNIST dataset are shown in Table 2. They are better visualized in Fig. 9.

Table 2 MNIST dataset results
Fig. 9
figure 9

Compared results of the original AutoEncoder architecture and the one proposed in this work for the MNIST dataset

For the Fashion-MNIST dataset, the Lyapunov loss in the defense obtains better results and the divergence metric, with the greatest increment in FGSM attack (up to +37% detection rate). However, the Universal attack is the only case in which this method is performing worse, for the whole experimentation. The results of Fashion-MNIST dataset are shown in Table 3. They are better visualized in Fig. 10.

Table 3 Fashion-MNIST dataset results
Fig. 10
figure 10

Compared results of the original AutoEncoder architecture and the one proposed in this work for the Fashion-MNIST dataset

For the CIFAR-10 dataset, the Lyapunov version achieves more than 10% average detection rate, with low standard deviation values, except for the Universal attack. Specifically, the largest increments are observed in the gradient-based attacks, both FGSM and PGD. The results of CIFAR-10 dataset are shown in Table 4. They are better visualized in Fig. 11.

Table 4 CIFAR-10 dataset results
Fig. 11
figure 11

Compared results of the original AutoEncoder architecture and the one proposed in this work for the CIFAR dataset

5 Conclusion and future work

This work shows the application of the chaos theory domain to the adversarial example detection. First, a given adversarial defense based on AutoEncoder and a divergence metric is tested. With the addition of LEs, it is possible to extract more information during the training process of the AutoEncoder. This enhancement makes it possible to obtain promising results for small-size datasets, such as MNIST or CIFAR. Specifically, the results obtained in comparison with the original method show a performance increase that ranges from 9% up to 37% in the detection rate of adversarial examples. Given the wide variety of attack methods that are considered, this shows that the performance is increased consistently.

We have checked that a chaos-based metric is useful in adversarial example detection. For this purpose, we have chosen a modification to the training loss function of a deep learning AutoEncoder. The experiments show an improvement in the overall results against the baseline loss function. While this does not achieve a new best state-of-the-art detection method, it represents a milestone in the further application of these methods deriving from chaos theory into this problem of machine learning.

For example, as future work, further refinement of the employed formula would be required for larger datasets and networks, such as ImageNet, in order to better capture the spatial information of perturbations in such a high-dimensional problem.