SSTA: Salient Spatially Transformed Attack

Abstract

Extensive studies have demonstrated that deep neural networks (DNNs) are vulnerable to adversarial attacks, which brings a huge security risk to the further application of DNNs, especially for the AI models developed in the real world. Despite the significant progress that has been made recently, existing attack methods still suffer from the unsatisfactory performance of escaping from being detected by naked human eyes due to the formulation of adversarial example (AE) heavily relying on a noise-adding manner. Such mentioned challenges will significantly increase the risk of exposure and result in an attack to be failed. Therefore, in this paper, we propose the Salient Spatially Transformed Attack (SSTA), a novel framework to craft imperceptible AEs, which enhance the stealthiness of AEs by estimating a smooth spatial transform metric on a most critical area to generate AEs instead of adding external noise to the whole image. Compared to state-of-the-art baselines, extensive experiments indicated that SSTA could effectively improve the imperceptibility of the AEs while maintaining a 100% attack success rate.

Index Terms— Adversarial Attack, Imperceptible Adversarial Examples, Spatial Transform.

1 Introduction

Deep neural networks (DNNs) are susceptible to AEs, which are crafted by subtly perturbing a clean input [1], especially for computer vision (CV) tasks, like image recognition. The critical point to carry out adversarial attacks on CV models is how to generate AEs with attack success rate and high imperceptibility. Various methods have been proposed to build AEs; among them, most such attacks are crafting AEs in optimizing noise and adding noise manner.

Although most existing attacks can obtain a high success rate by adding noise to the original image, they are not ideal in terms of imperceptibility and similarity since the added perturbations are not harmonious with the clean image [2, 3]. To address these issues, researchers have proposed various works. Some methods try to generate AEs in a non-noise addition way, such as the spatial transform-based attack, which crafts AEs by changing the specific pixel’s position [4, 5]. Even though these methods ensure the adversarial perturbations are more harmonious with the clean counterparts, the imperceptibility is still weak because they disturb the entire image. In most cases, people can easily distinguish the AEs generated by these methods through the naked eyes.

To improve the concealment of AEs, we formulate the issue of synthesizing AEs beyond additive perturbations and propose a novel non-addition attack method called SSTA. More specifically, SSTA uses spatial transformation techniques [6] based on the salient region of the image to generate AEs, rather than directly adding well-designed noise to the benign image. The spatial transform technique can learn a smooth flow field for each pixel’s new locations to optimize an eligible AE. To further ensure the concealment and image quality, we constraint the optimized flow field $\bm{f}$ (the transform metric) by limiting it with a small dynamic flow budget $\xi$ .

Extensive experiments on ImageNet datasets indicate that the proposed SSTA can make AEs more inconspicuous while maintaining high attack performance. Besides, evaluation results on many metrics involve similarity and image quality showing that our AEs are more similar to their benign counterparts and preserved the vivid details. The main contributions could be summarized as follows:

•

We formulate the imperceptible AE by spatial transform operations in the local salient region, which are extracted by object detection method rather than in a noise-adding manner.
•

To balance the attack performance and the concealment of the generated AEs, we propose a dynamic strategy to update the extracted critical region and flow budget $\xi$ associated with the number of optimizations increases.
•

Comparing with the state-of-the-art imperceptible attacks, experimental results on various victim models show our method’s superiority in synthesizing AEs with the attack ability, invisibility, and image quality and guarantee the AEs’ similarity to the original image.

The rest of this paper is organized as follows. In Sec. 2, we provide the details of the proposed SSTA framework. The experiments are presented in Sec. 3, with the conclusion drawn in Sec. 4.

Refer to caption — Fig. 1: Overview of SSTA, where $\oplus$ represents applying Mask $M$ , and $\otimes$ represents the spatial transformation operation.

2 Methodology

2.1 Problem Definition

Given a well-trained DNN classifier $\bm{\mathcal{C}}$ and an input $\bm{x}$ with its corresponding label $y$ , we have $\bm{\mathcal{C}}(\bm{x})=y$ . The AE $\bm{x}_{adv}$ is a neighbor of $\bm{x}$ and satisfies that $\bm{\mathcal{C}}(\bm{x}_{adv})\neq y$ and $\left\|\bm{x}_{adv}-\bm{x}\right\|_{p}\leq\epsilon$ , where the $\bm{L}_{p}$ -norm is used as the metric function and $\epsilon$ is usually a small noise budget. With this definition, the problem of finding an AE becomes a constrained optimization problem:

\bm{x}_{adv}=\underset{\left\|\bm{x}_{adv}-\bm{x}\right\|_{p}\leq\epsilon}{% \mathop{argmax}\mathcal{L}}(\bm{\mathcal{C}}(\bm{x}_{adv})\neq y),

(1)

where $\mathcal{L}$ stands for a loss function that measures the confidence of the model outputs.

Previous works craft an AE $\bm{x}_{adv}$ by adding $\bm{L}_{p}$ -norm constrained noise $\delta$ to the clean image $\bm{x}$ as

\bm{x}_{adv}=\bm{x}+\delta,\ s.t.\ \left\|\delta\right\|_{p}\leq\epsilon.

(2)

Different from this, in this paper, we combine the salient object extraction and the spatial transform techniques to build the imperceptible AE $\bm{x}_{adv}$ . As illustrated in Fig. 1, the proposed salient spatially transformed attack framework can be divided into two stages: the first stage is to obtain a salient region, we call it mask $\bm{M}(\cdot)$ ; the other one is to calculate the flow field $\bm{f}$ . Subsequently, we can formulate the AE $\bm{x}_{adv}$ by applying the calculated flow field $\bm{f}$ to the clean image’s salient area $\bm{M}(\cdot)$ .

2.2 Salient Region Extraction

In this paper, we use the salient detection method TRACER [7], which can efficiently detect salient objects in images, to extract the critical area mask $\bm{M}(\cdot)$ . In preliminary experiments, we also tried other area extraction methods like LC [8], FT [9], and Grad-CAM [10, 11], but found TRACER [12] is more suitable because it can efficiently detect salient objects in an image and return their corresponding regions, the results are showing in Fig. 2.

Moreover, TRACER can return several regions $r_{\tau}(\tau=0,...,255)$ with various scales depending on different thresholds when extracting salient areas. These regions will be helpful to the downstream tasks, such as image segmentation and background removal. In our work, we first take the region with a high threshold $\tau$ , such as $\tau=250$ , as the region mask $\bm{M}(\cdot)$ . Then, in generating AEs, if the current attack is unsuccessful after pre-set iterations, the $\bm{M}(\cdot)$ will be updated by decreasing the threshold $\tau$ .

2.3 Adversarial Example Generation

After computing the mask region $\bm{M}(\cdot)$ , we subsequently utilize the spatial transform method to build AEs in a non-noise additional way. The spatial transform techniques using a flow field matrix $\bm{f}=[2,h,w]$ to transform the original image $x$ to $x_{st}$ [4]. Specifically, assume the input is $x$ and its transformed counterpart $\bm{x}_{st}$ , for the $i$ -th pixel in $\bm{x}_{st}$ at the pixel location $(u_{st}^{i},v_{st}^{i})$ , we need to calculate the flow field matrix $\bm{f}_{i}=(\Delta u^{i},\Delta v^{i})$ . So, the $i$ -th pixel $\bm{x}^{i}$ ’s location in the transformed image can be indicated as:

(u^{i},v^{i})=(u_{st}^{i}+\Delta u^{i},v_{st}^{i}+\Delta v^{i}).

(3)

To ensure the flow field $\bm{f}$ is differentiable, the bi-linear interpolation [6] is used to obtain the 4 neighboring pixels’ value surrounding the location $(u_{st}^{i}+\Delta u^{i},v_{st}^{i}+\Delta v^{i})$ for the transformed image $\bm{x}_{st}$ as:

\bm{x}_{st}^{i}=\sum_{q\in\bm{N}(u^{i},v^{i})}\bm{x}^{q}(1-|u^{i}-u^{q}|)(1-|v% ^{i}-v^{q}|),

(4)

where $\bm{N}(u^{i},v^{i})$ is the neighborhood, that is, the four positions (top-left, top-right, bottom-left, bottom-right) tightly surrounding the target pixel $(u^{i},v^{i})$ . In adversarial attack settings, the calculated $\bm{x}_{st}$ is the final AE $\bm{x}_{adv}$ . Once the $f$ has been computed, we can obtain the $\bm{x}_{adv}$ by combining $\bm{M}(\cdot)$ and flow field $\bm{f}$ , which is given by:

\bm{x}_{adv}=clip(\bm{M}(\sum_{q\in\bm{N}(u^{i},v^{i})}\bm{x}^{q}(1-|u^{i}-u^{% q}|)(1-|v^{i}-v^{q}|))\\ +(\bm{x}-\bm{M(\bm{x}})),0,1),

(5)

where $\bm{M}(\bm{x})$ represents the salient region while the $\bm{x}-\bm{M}(\bm{x})$ indicates the area out of the salient region.

In practice, we regard the problem of calculating flow field $\bm{f}$ as an optimization task. In this paper, we use the AdamW to optimize flow $\bm{f}$ .

2.4 Objective Functions

Taking the attack success rate and visual invisibility of the generated AEs into account, we divide the objective function into two parts, where one is the adversarial loss and the other is a constraint for the flow field. Unlike other flow field-based attack methods, which constrain the size of the flow field by the flow loss proposed in [4], in our method, we use a dynamically updated flow field budget $\xi$ (a small number, like $1*10^{-2}$ ) to regularize the flow field $f$ . For adversarial attacks, the goal is making $\bm{\mathcal{C}}(\bm{x}_{adv})\neq y$ . We give the objective function as:

\mathcal{L}_{adv}(\bm{x},y,\bm{f})=max[\bm{\mathcal{C}}(\bm{x}_{adv})_{y}-% \underset{k\neq y}{max}\bm{\mathcal{C}}(\bm{x}_{adv})_{k},k],\\ s.t.\|\bm{f}\|\leq\xi.

(6)

3 Experiments

3.1 Settings

Dataset: We verify the performance of our method on the development set of ImageNet-Compatible Dataset, a subset of ImageNet-K, which consists of 1,000 images with a size of 299×299×3. And we resized the image to 224x224x3 to adopt the victim models.

Models: We use the PyTorch pre-trained model as the victim models, including VGG-19 [13], ResNet-50 [14], DenseNet-121 [15], ViT-16 [16] and Swin_B [17].

Baselines: The baselines include the stAdv [4], Chroma-Shift [5] and AdvDrop [18].

Metrics: Unlike the pixel-based attack methods, which always use $L_{p}$ -norm to evaluate the AEs’ perceptual similarity to its corresponding benign image. The AEs generated by spatial transformation always use other metrics referring to image quality. To be exact, we use the following perceptual metrics to evaluate the AEs generated by our method, including LPIPS [19], DISTS [20], FID, MSE, UQI [21], SCC [22], PSNR [23], VIPF [24], SSIM, and NIQE to evaluate the difference between the generated AEs and their benign counterparts and the image quality of these AEs.

3.2 Attacking Performance

We investigate the performance of the proposed method in attacking various image classifiers. The results are shown in Table. 1, we derive that SSTA can obtain the SOTA attack performance by only disturbing the minimal local area, i.e., the salient region, while other attacks need to distort the whole image. This demonstrates the superiority of our method.

Table 1: Attack performance of baselines and SSTA.

Methods	VGG-19	ResNet-50	DenseNet-121	VIT-16	Swin_B
stAdv	100	100	100	100	100
Chroma-Shift	93.69	94.67	95.1	95.09	96.66
AdvDrop	100	99.07	100	95.97	99.79
SSTA	100	99.86	100	100	100

3.3 Image Quality and Similarity

We use diverse metrics involving image quality and similarity to assess the AEs’ image quality and list the results in Table. 2, which indicated that the proposed method has the lowest LPIPS, DISTS, FID, and MSE (the lower is better) are 0.0038, 0.0091, 16.3876 and 2.1210, respectively, and has the highest UQI, SCC, PSNR, VIPF, SSIM, and NIQE (the higher is better), achieving 0.9998, 0.9890, 49.2397, 0.9487, 0.9987 and 43.9611, respectively, in comparison to the baselines. The results point out that the proposed method is superior to the existing imperceptible attacks.

Table 2: Perceptual distances were calculated on fooled examples by stAdv, Chroma-Shift and the proposed SSTA.

Metrics	stAdv	Chroma-Shift	AdvDrop	SSAT
LPIPS $\downarrow$	0.1595	0.0135	0.0956	0.0038
DISTS $\downarrow$	0.1524	0.0165	0.0678	0.0091
FID $\downarrow$	60.2464	88.8750	46.7813	16.3876
MSE $\downarrow$	95.7488	23.5399	17.0450	2.1210
\hdashlineUQI $\uparrow$	0.9925	0.9925	0.9952	0.9998
SCC $\uparrow$	0.6415	0.9623	0.6894	0.9890
PSNR $\uparrow$	29.8119	36.5651	36.1464	49.2397
VIFP $\uparrow$	0.5229	0.7644	0.6474	0.9487
SSIM $\uparrow$	0.9391	0.9771	0.9688	0.9987
NIQE $\uparrow$	33.3234	43.5860	39.8657	43.9611

To visualize the difference between the AEs generated by our method and the baselines, we also draw the adversarial perturbation generated by stAdv, Chroma-Shift, AdvDrop and the proposed method in Fig. 3, the target model is pre-trained ResNet-50. The first row is the AEs and their corresponding noise of stAdv, Chroma-Shift, AdvDrop and our method, respectively. Noted that, for better observation, we magnified the noise by a factor of 30. From Fig. 3, we can clearly observe that these baselines distort the whole image. In contrast, the AEs generated by our method are focused on the salient region and its noise is milder; they are similar to the original clean counterparts and are more imperceptible to human eyes. These results indicate that the AEs generated by the proposed method have better concealment and are not easily exposed.

3.4 Further Human Perceptual Study

This experiment is for subjective evaluation, i.e., in most cases, whether AEs generated based on SSTA are indistinguishable from their original samples. We argue that AEs generated by SSTA not only satisfy imperceptibility but are also inconspicuous to the human eye. To validate this claim, we compare AEs generated by SSTA with those generated by baseline methods. In our human perception study, we display the original image and the AEs on the computer screen and give each participant 100 seconds to judge every image. Empirically, 100 seconds is enough to decide and point out any visible distortion for the participants. We used the randomly sampled 50 images from the ImageNet dataset for this experiment. The participants are shown about 5 images, the left is always the clean image and its right side shows adversarial images generated by various methods (Maybe more than one) or the same clean image. Participants will be asked “Are the images on the right the same as the left (the clean one)?” and each participant will provide more than 50 annotations. For each image to be checked, we can zoom it as large as possible to provide convenience for participants to observe.

A total of 20+ participants were involved in assessing whether the specific image was the same as the original clean image. For the sake of fairness, we put the clean images and adversarial images generated by different methods into the dataset to be checked together. These participants provided more than 1,000 annotations. As shown in Fig. 4, the AEs generated by SSTA are generally considered to be the same as the original images. 88.98% of the annotations were considered unmodified, meaning that most participants could not distinguish the AEs generated by SSTA. Conversely, for AEs generated by baseline methods, participants were able to spot distortions more easily, more than 90%, 55% and 30% of the total annotations have been picked out for stAdv, AdvDrop and Chroma-shift, respectively, indicating that the AEs generated by these baseline methods did not affect humans to identify objects in images correctly but very easy to find that they had tampered.

4 Conclusions

In this paper, we present a novel non-noise additional method, called SSTA, which combines performing the spatial transformation in salient regions with the optimal flow field to synthesize AEs. Extensive experiments show that the proposed method is superior to the state-of-the-art methods in terms of prominent concealment and high image quality, and the generated AEs are indistinguishable by the human eyes. Benefitting from generating AEs without noise-adding, the proposed SSTA provides a new efficient way to evaluate the robustness of classifiers and enhance their performance using techniques like fine-tuning or adversarial training. Furthermore, the proposed approach can be used as a reliable tool to build more robust models.

References

[1] Han Xu, Yao Ma, Haochen Liu, Debayan Deb, Hui Liu, Jiliang Tang, and Anil K. Jain, “Adversarial attacks and defenses in images, graphs and text: A review,” International Journal of Automation and Computing, vol. 17, no. 2, pp. 151–178, 2020.
[2] Nicholas Carlini and David A. Wagner, “Towards evaluating the robustness of neural networks,” in S&P, 2017.
[3] Aleksander Madry, Aleksandar Makelov, Ludwig Schmidt, Dimitris Tsipras, and Adrian Vladu, “Towards deep learning models resistant to adversarial attacks,” in ICLR, 2018.
[4] Chaowei Xiao, Jun-Yan Zhu, Bo Li, Warren He, Mingyan Liu, and Dawn Song, “Spatially transformed adversarial examples,” in ICLR, 2018.
[5] Ayberk Aydin, Deniz Sen, Berat Tuna Karli, Oguz Hanoglu, and Alptekin Temizel, “Imperceptible adversarial examples by spatial chroma-shift,” in ADVM, 2021, pp. 8–14.
[6] Max Jaderberg, Karen Simonyan, Andrew Zisserman, and Koray Kavukcuoglu, “Spatial transformer networks,” in NeurIPS, 2015, pp. 2017–2025.
[7] Min Seok Lee, WooSeok Shin, and Sung Won Han, “TRACER: extreme attention guided salient object tracing network,” in AAAI, 2022, pp. 12993–12994.
[8] Yun Zhai and Mubarak Shah, “Visual attention detection in video sequences using spatiotemporal cues,” in ACM MM, 2006, pp. 815–824.
[9] Radhakrishna Achanta, Sheila S. Hemami, Francisco J. Estrada, and Sabine Süsstrunk, “Frequency-tuned salient region detection,” in CVPR, 2009, pp. 1597–1604.
[10] Ramprasaath R. Selvaraju, Michael Cogswell, Abhishek Das, Ramakrishna Vedantam, Devi Parikh, and Dhruv Batra, “Grad-cam: Visual explanations from deep networks via gradient-based localization,” in ICCV, 2017, pp. 618–626.
[11] Ting Deng and Zhigang Zeng, “Generate adversarial examples by spatially perturbing on the meaningful area,” Pattern Recognition Letters, vol. 125, pp. 632–638, 2019.
[12] Xuebin Qin, Zichen Vincent Zhang, Chenyang Huang, Masood Dehghan, Osmar R. Zaïane, and Martin Jägersand, “U ${}^{\mbox{2}}$ -net: Going deeper with nested u-structure for salient object detection,” Pattern Recognition, vol. 106, pp. 107404, 2020.
[13] Karen Simonyan and Andrew Zisserman, “Very deep convolutional networks for large-scale image recognition,” in ICLR, 2015.
[14] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun, “Deep residual learning for image recognition,” in CVPR, 2016, pp. 770–778.
[15] Gao Huang, Zhuang Liu, Laurens van der Maaten, and Kilian Q. Weinberger, “Densely connected convolutional networks,” in CVPR, 2017, pp. 2261–2269.
[16] Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo, “Swin transformer: Hierarchical vision transformer using shifted windows,” in ICCV, 2021, pp. 9992–10002.
[17] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby, “An image is worth 16x16 words: Transformers for image recognition at scale,” in ICLR, 2021.
[18] Ranjie Duan, Yuefeng Chen, Dantong Niu, Yun Yang, A. Kai Qin, and Yuan He, “Advdrop: Adversarial attack to dnns by dropping information,” in ICCV, 2021, pp. 7486–7495.
[19] Richard Zhang, Phillip Isola, Alexei A. Efros, Eli Shechtman, and Oliver Wang, “The unreasonable effectiveness of deep features as a perceptual metric,” in CVPR, 2018, pp. 586–595.
[20] Keyan Ding, Kede Ma, Shiqi Wang, and Eero P. Simoncelli, “Image quality assessment: Unifying structure and texture similarity,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 44, no. 5, pp. 2567–2581, 2022.
[21] Zhou Wang and Alan C. Bovik, “A universal image quality index,” IEEE Signal Processing Letters, vol. 9, no. 3, pp. 81–84, 2002.
[22] Jun Li, “Spatial quality evaluation of fusion of different resolution images,” International Archives of Photogrammetry and Remote Sensing, vol. 33, 09 2000.
[23] Hamid R. Sheikh, Muhammad F. Sabir, and Alan C. Bovik, “A statistical evaluation of recent full reference image quality assessment algorithms,” IEEE Trans. Image Process., vol. 15, no. 11, pp. 3440–3451, 2006.
[24] Hamid R. Sheikh and Alan C. Bovik, “Image information and visual quality,” in ICASSP, 2004, pp. 709–712.