Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

HTML conversions sometimes display errors due to content that did not convert correctly from the source. This paper uses the following packages that are not yet supported by the HTML conversion tool. Feedback on these issues are not necessary; they are known and are being worked on.

  • failed: arydshln

Authors: achieve the best HTML results from your LaTeX submissions by selecting from this list of supported packages.

License: arXiv.org perpetual non-exclusive license
arXiv:2312.07258v1 [cs.CV] 12 Dec 2023

SSTA: Salient Spatially Transformed Attack

Abstract

Extensive studies have demonstrated that deep neural networks (DNNs) are vulnerable to adversarial attacks, which brings a huge security risk to the further application of DNNs, especially for the AI models developed in the real world. Despite the significant progress that has been made recently, existing attack methods still suffer from the unsatisfactory performance of escaping from being detected by naked human eyes due to the formulation of adversarial example (AE) heavily relying on a noise-adding manner. Such mentioned challenges will significantly increase the risk of exposure and result in an attack to be failed. Therefore, in this paper, we propose the Salient Spatially Transformed Attack (SSTA), a novel framework to craft imperceptible AEs, which enhance the stealthiness of AEs by estimating a smooth spatial transform metric on a most critical area to generate AEs instead of adding external noise to the whole image. Compared to state-of-the-art baselines, extensive experiments indicated that SSTA could effectively improve the imperceptibility of the AEs while maintaining a 100% attack success rate.

Index Terms—  Adversarial Attack, Imperceptible Adversarial Examples, Spatial Transform.

1 Introduction

Deep neural networks (DNNs) are susceptible to AEs, which are crafted by subtly perturbing a clean input [1], especially for computer vision (CV) tasks, like image recognition. The critical point to carry out adversarial attacks on CV models is how to generate AEs with attack success rate and high imperceptibility. Various methods have been proposed to build AEs; among them, most such attacks are crafting AEs in optimizing noise and adding noise manner.

Although most existing attacks can obtain a high success rate by adding noise to the original image, they are not ideal in terms of imperceptibility and similarity since the added perturbations are not harmonious with the clean image [2, 3]. To address these issues, researchers have proposed various works. Some methods try to generate AEs in a non-noise addition way, such as the spatial transform-based attack, which crafts AEs by changing the specific pixel’s position [4, 5]. Even though these methods ensure the adversarial perturbations are more harmonious with the clean counterparts, the imperceptibility is still weak because they disturb the entire image. In most cases, people can easily distinguish the AEs generated by these methods through the naked eyes.

To improve the concealment of AEs, we formulate the issue of synthesizing AEs beyond additive perturbations and propose a novel non-addition attack method called SSTA. More specifically, SSTA uses spatial transformation techniques [6] based on the salient region of the image to generate AEs, rather than directly adding well-designed noise to the benign image. The spatial transform technique can learn a smooth flow field for each pixel’s new locations to optimize an eligible AE. To further ensure the concealment and image quality, we constraint the optimized flow field 𝒇𝒇\bm{f}bold_italic_f (the transform metric) by limiting it with a small dynamic flow budget ξ𝜉\xiitalic_ξ.

Extensive experiments on ImageNet datasets indicate that the proposed SSTA can make AEs more inconspicuous while maintaining high attack performance. Besides, evaluation results on many metrics involve similarity and image quality showing that our AEs are more similar to their benign counterparts and preserved the vivid details. The main contributions could be summarized as follows:

  • We formulate the imperceptible AE by spatial transform operations in the local salient region, which are extracted by object detection method rather than in a noise-adding manner.

  • To balance the attack performance and the concealment of the generated AEs, we propose a dynamic strategy to update the extracted critical region and flow budget ξ𝜉\xiitalic_ξ associated with the number of optimizations increases.

  • Comparing with the state-of-the-art imperceptible attacks, experimental results on various victim models show our method’s superiority in synthesizing AEs with the attack ability, invisibility, and image quality and guarantee the AEs’ similarity to the original image.

The rest of this paper is organized as follows. In Sec. 2, we provide the details of the proposed SSTA framework. The experiments are presented in Sec. 3, with the conclusion drawn in Sec. 4.

Refer to caption
Fig. 1: Overview of SSTA, where direct-sum\oplus represents applying Mask M𝑀Mitalic_M, and tensor-product\otimes represents the spatial transformation operation.

2 Methodology

2.1 Problem Definition

Given a well-trained DNN classifier 𝓒𝓒\bm{\mathcal{C}}bold_caligraphic_C and an input 𝒙𝒙\bm{x}bold_italic_x with its corresponding label y𝑦yitalic_y, we have 𝓒(𝒙)=y𝓒𝒙𝑦\bm{\mathcal{C}}(\bm{x})=ybold_caligraphic_C ( bold_italic_x ) = italic_y. The AE 𝒙advsubscript𝒙𝑎𝑑𝑣\bm{x}_{adv}bold_italic_x start_POSTSUBSCRIPT italic_a italic_d italic_v end_POSTSUBSCRIPT is a neighbor of 𝒙𝒙\bm{x}bold_italic_x and satisfies that 𝓒(𝒙adv)y𝓒subscript𝒙𝑎𝑑𝑣𝑦\bm{\mathcal{C}}(\bm{x}_{adv})\neq ybold_caligraphic_C ( bold_italic_x start_POSTSUBSCRIPT italic_a italic_d italic_v end_POSTSUBSCRIPT ) ≠ italic_y and 𝒙adv𝒙pϵsubscriptnormsubscript𝒙𝑎𝑑𝑣𝒙𝑝italic-ϵ\left\|\bm{x}_{adv}-\bm{x}\right\|_{p}\leq\epsilon∥ bold_italic_x start_POSTSUBSCRIPT italic_a italic_d italic_v end_POSTSUBSCRIPT - bold_italic_x ∥ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ≤ italic_ϵ, where the 𝑳psubscript𝑳𝑝\bm{L}_{p}bold_italic_L start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT-norm is used as the metric function and ϵitalic-ϵ\epsilonitalic_ϵ is usually a small noise budget. With this definition, the problem of finding an AE becomes a constrained optimization problem:

𝒙adv=argmax𝒙adv𝒙pϵ(𝓒(𝒙adv)y),subscript𝒙𝑎𝑑𝑣subscriptnormsubscript𝒙𝑎𝑑𝑣𝒙𝑝italic-ϵ𝑎𝑟𝑔𝑚𝑎𝑥𝓒subscript𝒙𝑎𝑑𝑣𝑦\bm{x}_{adv}=\underset{\left\|\bm{x}_{adv}-\bm{x}\right\|_{p}\leq\epsilon}{% \mathop{argmax}\mathcal{L}}(\bm{\mathcal{C}}(\bm{x}_{adv})\neq y),bold_italic_x start_POSTSUBSCRIPT italic_a italic_d italic_v end_POSTSUBSCRIPT = start_UNDERACCENT ∥ bold_italic_x start_POSTSUBSCRIPT italic_a italic_d italic_v end_POSTSUBSCRIPT - bold_italic_x ∥ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ≤ italic_ϵ end_UNDERACCENT start_ARG start_BIGOP italic_a italic_r italic_g italic_m italic_a italic_x end_BIGOP caligraphic_L end_ARG ( bold_caligraphic_C ( bold_italic_x start_POSTSUBSCRIPT italic_a italic_d italic_v end_POSTSUBSCRIPT ) ≠ italic_y ) , (1)

where \mathcal{L}caligraphic_L stands for a loss function that measures the confidence of the model outputs.

Previous works craft an AE 𝒙advsubscript𝒙𝑎𝑑𝑣\bm{x}_{adv}bold_italic_x start_POSTSUBSCRIPT italic_a italic_d italic_v end_POSTSUBSCRIPT by adding 𝑳psubscript𝑳𝑝\bm{L}_{p}bold_italic_L start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT-norm constrained noise δ𝛿\deltaitalic_δ to the clean image 𝒙𝒙\bm{x}bold_italic_x as

𝒙adv=𝒙+δ,s.t.δpϵ.formulae-sequencesubscript𝒙𝑎𝑑𝑣𝒙𝛿𝑠𝑡subscriptnorm𝛿𝑝italic-ϵ\bm{x}_{adv}=\bm{x}+\delta,\ s.t.\ \left\|\delta\right\|_{p}\leq\epsilon.bold_italic_x start_POSTSUBSCRIPT italic_a italic_d italic_v end_POSTSUBSCRIPT = bold_italic_x + italic_δ , italic_s . italic_t . ∥ italic_δ ∥ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ≤ italic_ϵ . (2)

Different from this, in this paper, we combine the salient object extraction and the spatial transform techniques to build the imperceptible AE 𝒙advsubscript𝒙𝑎𝑑𝑣\bm{x}_{adv}bold_italic_x start_POSTSUBSCRIPT italic_a italic_d italic_v end_POSTSUBSCRIPT. As illustrated in Fig. 1, the proposed salient spatially transformed attack framework can be divided into two stages: the first stage is to obtain a salient region, we call it mask 𝑴()𝑴\bm{M}(\cdot)bold_italic_M ( ⋅ ); the other one is to calculate the flow field 𝒇𝒇\bm{f}bold_italic_f. Subsequently, we can formulate the AE 𝒙advsubscript𝒙𝑎𝑑𝑣\bm{x}_{adv}bold_italic_x start_POSTSUBSCRIPT italic_a italic_d italic_v end_POSTSUBSCRIPT by applying the calculated flow field 𝒇𝒇\bm{f}bold_italic_f to the clean image’s salient area 𝑴()𝑴\bm{M}(\cdot)bold_italic_M ( ⋅ ).

2.2 Salient Region Extraction

In this paper, we use the salient detection method TRACER [7], which can efficiently detect salient objects in images, to extract the critical area mask 𝑴()𝑴\bm{M}(\cdot)bold_italic_M ( ⋅ ). In preliminary experiments, we also tried other area extraction methods like LC [8], FT [9], and Grad-CAM [10, 11], but found TRACER [12] is more suitable because it can efficiently detect salient objects in an image and return their corresponding regions, the results are showing in Fig. 2.

Refer to caption
Fig. 2: The extracted area by different methods.

Moreover, TRACER can return several regions rτ(τ=0,,255)subscript𝑟𝜏𝜏0255r_{\tau}(\tau=0,...,255)italic_r start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT ( italic_τ = 0 , … , 255 ) with various scales depending on different thresholds when extracting salient areas. These regions will be helpful to the downstream tasks, such as image segmentation and background removal. In our work, we first take the region with a high threshold τ𝜏\tauitalic_τ, such as τ=250𝜏250\tau=250italic_τ = 250, as the region mask 𝑴()𝑴\bm{M}(\cdot)bold_italic_M ( ⋅ ). Then, in generating AEs, if the current attack is unsuccessful after pre-set iterations, the 𝑴()𝑴\bm{M}(\cdot)bold_italic_M ( ⋅ ) will be updated by decreasing the threshold τ𝜏\tauitalic_τ.

2.3 Adversarial Example Generation

After computing the mask region 𝑴()𝑴\bm{M}(\cdot)bold_italic_M ( ⋅ ), we subsequently utilize the spatial transform method to build AEs in a non-noise additional way. The spatial transform techniques using a flow field matrix 𝒇=[2,h,w]𝒇2𝑤\bm{f}=[2,h,w]bold_italic_f = [ 2 , italic_h , italic_w ] to transform the original image x𝑥xitalic_x to xstsubscript𝑥𝑠𝑡x_{st}italic_x start_POSTSUBSCRIPT italic_s italic_t end_POSTSUBSCRIPT [4]. Specifically, assume the input is x𝑥xitalic_x and its transformed counterpart 𝒙stsubscript𝒙𝑠𝑡\bm{x}_{st}bold_italic_x start_POSTSUBSCRIPT italic_s italic_t end_POSTSUBSCRIPT, for the i𝑖iitalic_i-th pixel in 𝒙stsubscript𝒙𝑠𝑡\bm{x}_{st}bold_italic_x start_POSTSUBSCRIPT italic_s italic_t end_POSTSUBSCRIPT at the pixel location (usti,vsti)superscriptsubscript𝑢𝑠𝑡𝑖superscriptsubscript𝑣𝑠𝑡𝑖(u_{st}^{i},v_{st}^{i})( italic_u start_POSTSUBSCRIPT italic_s italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT , italic_v start_POSTSUBSCRIPT italic_s italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ), we need to calculate the flow field matrix 𝒇i=(Δui,Δvi)subscript𝒇𝑖Δsuperscript𝑢𝑖Δsuperscript𝑣𝑖\bm{f}_{i}=(\Delta u^{i},\Delta v^{i})bold_italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = ( roman_Δ italic_u start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT , roman_Δ italic_v start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ). So, the i𝑖iitalic_i-th pixel 𝒙isuperscript𝒙𝑖\bm{x}^{i}bold_italic_x start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT’s location in the transformed image can be indicated as:

(ui,vi)=(usti+Δui,vsti+Δvi).superscript𝑢𝑖superscript𝑣𝑖superscriptsubscript𝑢𝑠𝑡𝑖Δsuperscript𝑢𝑖superscriptsubscript𝑣𝑠𝑡𝑖Δsuperscript𝑣𝑖(u^{i},v^{i})=(u_{st}^{i}+\Delta u^{i},v_{st}^{i}+\Delta v^{i}).( italic_u start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT , italic_v start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ) = ( italic_u start_POSTSUBSCRIPT italic_s italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT + roman_Δ italic_u start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT , italic_v start_POSTSUBSCRIPT italic_s italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT + roman_Δ italic_v start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ) . (3)

To ensure the flow field 𝒇𝒇\bm{f}bold_italic_f is differentiable, the bi-linear interpolation [6] is used to obtain the 4 neighboring pixels’ value surrounding the location (usti+Δui,vsti+Δvi)superscriptsubscript𝑢𝑠𝑡𝑖Δsuperscript𝑢𝑖superscriptsubscript𝑣𝑠𝑡𝑖Δsuperscript𝑣𝑖(u_{st}^{i}+\Delta u^{i},v_{st}^{i}+\Delta v^{i})( italic_u start_POSTSUBSCRIPT italic_s italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT + roman_Δ italic_u start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT , italic_v start_POSTSUBSCRIPT italic_s italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT + roman_Δ italic_v start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ) for the transformed image 𝒙stsubscript𝒙𝑠𝑡\bm{x}_{st}bold_italic_x start_POSTSUBSCRIPT italic_s italic_t end_POSTSUBSCRIPT as:

𝒙sti=q𝑵(ui,vi)𝒙q(1|uiuq|)(1|vivq|),superscriptsubscript𝒙𝑠𝑡𝑖subscript𝑞𝑵superscript𝑢𝑖superscript𝑣𝑖superscript𝒙𝑞1superscript𝑢𝑖superscript𝑢𝑞1superscript𝑣𝑖superscript𝑣𝑞\bm{x}_{st}^{i}=\sum_{q\in\bm{N}(u^{i},v^{i})}\bm{x}^{q}(1-|u^{i}-u^{q}|)(1-|v% ^{i}-v^{q}|),bold_italic_x start_POSTSUBSCRIPT italic_s italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT = ∑ start_POSTSUBSCRIPT italic_q ∈ bold_italic_N ( italic_u start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT , italic_v start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ) end_POSTSUBSCRIPT bold_italic_x start_POSTSUPERSCRIPT italic_q end_POSTSUPERSCRIPT ( 1 - | italic_u start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT - italic_u start_POSTSUPERSCRIPT italic_q end_POSTSUPERSCRIPT | ) ( 1 - | italic_v start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT - italic_v start_POSTSUPERSCRIPT italic_q end_POSTSUPERSCRIPT | ) , (4)

where 𝑵(ui,vi)𝑵superscript𝑢𝑖superscript𝑣𝑖\bm{N}(u^{i},v^{i})bold_italic_N ( italic_u start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT , italic_v start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ) is the neighborhood, that is, the four positions (top-left, top-right, bottom-left, bottom-right) tightly surrounding the target pixel (ui,vi)superscript𝑢𝑖superscript𝑣𝑖(u^{i},v^{i})( italic_u start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT , italic_v start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ). In adversarial attack settings, the calculated 𝒙stsubscript𝒙𝑠𝑡\bm{x}_{st}bold_italic_x start_POSTSUBSCRIPT italic_s italic_t end_POSTSUBSCRIPT is the final AE 𝒙advsubscript𝒙𝑎𝑑𝑣\bm{x}_{adv}bold_italic_x start_POSTSUBSCRIPT italic_a italic_d italic_v end_POSTSUBSCRIPT. Once the f𝑓fitalic_f has been computed, we can obtain the 𝒙advsubscript𝒙𝑎𝑑𝑣\bm{x}_{adv}bold_italic_x start_POSTSUBSCRIPT italic_a italic_d italic_v end_POSTSUBSCRIPT by combining 𝑴()𝑴\bm{M}(\cdot)bold_italic_M ( ⋅ ) and flow field 𝒇𝒇\bm{f}bold_italic_f, which is given by:

𝒙adv=clip(𝑴(q𝑵(ui,vi)𝒙q(1|uiuq|)(1|vivq|))+(𝒙𝑴(𝒙)),0,1),subscript𝒙𝑎𝑑𝑣𝑐𝑙𝑖𝑝𝑴subscript𝑞𝑵superscript𝑢𝑖superscript𝑣𝑖superscript𝒙𝑞1superscript𝑢𝑖superscript𝑢𝑞1superscript𝑣𝑖superscript𝑣𝑞𝒙𝑴𝒙01\bm{x}_{adv}=clip(\bm{M}(\sum_{q\in\bm{N}(u^{i},v^{i})}\bm{x}^{q}(1-|u^{i}-u^{% q}|)(1-|v^{i}-v^{q}|))\\ +(\bm{x}-\bm{M(\bm{x}})),0,1),start_ROW start_CELL bold_italic_x start_POSTSUBSCRIPT italic_a italic_d italic_v end_POSTSUBSCRIPT = italic_c italic_l italic_i italic_p ( bold_italic_M ( ∑ start_POSTSUBSCRIPT italic_q ∈ bold_italic_N ( italic_u start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT , italic_v start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ) end_POSTSUBSCRIPT bold_italic_x start_POSTSUPERSCRIPT italic_q end_POSTSUPERSCRIPT ( 1 - | italic_u start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT - italic_u start_POSTSUPERSCRIPT italic_q end_POSTSUPERSCRIPT | ) ( 1 - | italic_v start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT - italic_v start_POSTSUPERSCRIPT italic_q end_POSTSUPERSCRIPT | ) ) end_CELL end_ROW start_ROW start_CELL + ( bold_italic_x - bold_italic_M bold_( bold_italic_x ) ) , 0 , 1 ) , end_CELL end_ROW (5)

where 𝑴(𝒙)𝑴𝒙\bm{M}(\bm{x})bold_italic_M ( bold_italic_x ) represents the salient region while the 𝒙𝑴(𝒙)𝒙𝑴𝒙\bm{x}-\bm{M}(\bm{x})bold_italic_x - bold_italic_M ( bold_italic_x ) indicates the area out of the salient region.

In practice, we regard the problem of calculating flow field 𝒇𝒇\bm{f}bold_italic_f as an optimization task. In this paper, we use the AdamW to optimize flow 𝒇𝒇\bm{f}bold_italic_f.

2.4 Objective Functions

Taking the attack success rate and visual invisibility of the generated AEs into account, we divide the objective function into two parts, where one is the adversarial loss and the other is a constraint for the flow field. Unlike other flow field-based attack methods, which constrain the size of the flow field by the flow loss proposed in [4], in our method, we use a dynamically updated flow field budget ξ𝜉\xiitalic_ξ (a small number, like 1*1021superscript1021*10^{-2}1 * 10 start_POSTSUPERSCRIPT - 2 end_POSTSUPERSCRIPT) to regularize the flow field f𝑓fitalic_f. For adversarial attacks, the goal is making 𝓒(𝒙adv)y𝓒subscript𝒙𝑎𝑑𝑣𝑦\bm{\mathcal{C}}(\bm{x}_{adv})\neq ybold_caligraphic_C ( bold_italic_x start_POSTSUBSCRIPT italic_a italic_d italic_v end_POSTSUBSCRIPT ) ≠ italic_y. We give the objective function as:

adv(𝒙,y,𝒇)=max[𝓒(𝒙adv)ymaxky𝓒(𝒙adv)k,k],s.t.𝒇ξ.formulae-sequencesubscript𝑎𝑑𝑣𝒙𝑦𝒇𝑚𝑎𝑥𝓒subscriptsubscript𝒙𝑎𝑑𝑣𝑦𝑘𝑦𝑚𝑎𝑥𝓒subscriptsubscript𝒙𝑎𝑑𝑣𝑘𝑘𝑠𝑡delimited-∥∥𝒇𝜉\mathcal{L}_{adv}(\bm{x},y,\bm{f})=max[\bm{\mathcal{C}}(\bm{x}_{adv})_{y}-% \underset{k\neq y}{max}\bm{\mathcal{C}}(\bm{x}_{adv})_{k},k],\\ s.t.\|\bm{f}\|\leq\xi.start_ROW start_CELL caligraphic_L start_POSTSUBSCRIPT italic_a italic_d italic_v end_POSTSUBSCRIPT ( bold_italic_x , italic_y , bold_italic_f ) = italic_m italic_a italic_x [ bold_caligraphic_C ( bold_italic_x start_POSTSUBSCRIPT italic_a italic_d italic_v end_POSTSUBSCRIPT ) start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT - start_UNDERACCENT italic_k ≠ italic_y end_UNDERACCENT start_ARG italic_m italic_a italic_x end_ARG bold_caligraphic_C ( bold_italic_x start_POSTSUBSCRIPT italic_a italic_d italic_v end_POSTSUBSCRIPT ) start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_k ] , end_CELL end_ROW start_ROW start_CELL italic_s . italic_t . ∥ bold_italic_f ∥ ≤ italic_ξ . end_CELL end_ROW (6)

3 Experiments

3.1 Settings

Dataset: We verify the performance of our method on the development set of ImageNet-Compatible Dataset, a subset of ImageNet-K, which consists of 1,000 images with a size of 299×299×3. And we resized the image to 224x224x3 to adopt the victim models.

Models: We use the PyTorch pre-trained model as the victim models, including VGG-19 [13], ResNet-50 [14], DenseNet-121 [15], ViT-16 [16] and Swin_B [17].

Baselines: The baselines include the stAdv [4], Chroma-Shift [5] and AdvDrop [18].

Metrics: Unlike the pixel-based attack methods, which always use Lpsubscript𝐿𝑝L_{p}italic_L start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT-norm to evaluate the AEs’ perceptual similarity to its corresponding benign image. The AEs generated by spatial transformation always use other metrics referring to image quality. To be exact, we use the following perceptual metrics to evaluate the AEs generated by our method, including LPIPS [19], DISTS [20], FID, MSE, UQI [21], SCC [22], PSNR [23], VIPF [24], SSIM, and NIQE to evaluate the difference between the generated AEs and their benign counterparts and the image quality of these AEs.

3.2 Attacking Performance

We investigate the performance of the proposed method in attacking various image classifiers. The results are shown in Table. 1, we derive that SSTA can obtain the SOTA attack performance by only disturbing the minimal local area, i.e., the salient region, while other attacks need to distort the whole image. This demonstrates the superiority of our method.

Table 1: Attack performance of baselines and SSTA.
Methods VGG-19 ResNet-50 DenseNet-121 VIT-16 Swin_B
stAdv 100 100 100 100 100
Chroma-Shift 93.69 94.67 95.1 95.09 96.66
AdvDrop 100 99.07 100 95.97 99.79
SSTA 100 99.86 100 100 100

3.3 Image Quality and Similarity

We use diverse metrics involving image quality and similarity to assess the AEs’ image quality and list the results in Table. 2, which indicated that the proposed method has the lowest LPIPS, DISTS, FID, and MSE (the lower is better) are 0.0038, 0.0091, 16.3876 and 2.1210, respectively, and has the highest UQI, SCC, PSNR, VIPF, SSIM, and NIQE (the higher is better), achieving 0.9998, 0.9890, 49.2397, 0.9487, 0.9987 and 43.9611, respectively, in comparison to the baselines. The results point out that the proposed method is superior to the existing imperceptible attacks.

Refer to caption
Fig. 3: AEs and their corresponding perturbations.

Table 2: Perceptual distances were calculated on fooled examples by stAdv, Chroma-Shift and the proposed SSTA.
Metrics stAdv Chroma-Shift AdvDrop SSAT
LPIPS \downarrow 0.1595 0.0135 0.0956 0.0038
DISTS \downarrow 0.1524 0.0165 0.0678 0.0091
FID \downarrow 60.2464 88.8750 46.7813 16.3876
MSE \downarrow 95.7488 23.5399 17.0450 2.1210
\hdashlineUQI \uparrow 0.9925 0.9925 0.9952 0.9998
SCC \uparrow 0.6415 0.9623 0.6894 0.9890
PSNR \uparrow 29.8119 36.5651 36.1464 49.2397
VIFP \uparrow 0.5229 0.7644 0.6474 0.9487
SSIM \uparrow 0.9391 0.9771 0.9688 0.9987
NIQE \uparrow 33.3234 43.5860 39.8657 43.9611

To visualize the difference between the AEs generated by our method and the baselines, we also draw the adversarial perturbation generated by stAdv, Chroma-Shift, AdvDrop and the proposed method in Fig. 3, the target model is pre-trained ResNet-50. The first row is the AEs and their corresponding noise of stAdv, Chroma-Shift, AdvDrop and our method, respectively. Noted that, for better observation, we magnified the noise by a factor of 30. From Fig. 3, we can clearly observe that these baselines distort the whole image. In contrast, the AEs generated by our method are focused on the salient region and its noise is milder; they are similar to the original clean counterparts and are more imperceptible to human eyes. These results indicate that the AEs generated by the proposed method have better concealment and are not easily exposed.

3.4 Further Human Perceptual Study

This experiment is for subjective evaluation, i.e., in most cases, whether AEs generated based on SSTA are indistinguishable from their original samples. We argue that AEs generated by SSTA not only satisfy imperceptibility but are also inconspicuous to the human eye. To validate this claim, we compare AEs generated by SSTA with those generated by baseline methods. In our human perception study, we display the original image and the AEs on the computer screen and give each participant 100 seconds to judge every image. Empirically, 100 seconds is enough to decide and point out any visible distortion for the participants. We used the randomly sampled 50 images from the ImageNet dataset for this experiment. The participants are shown about 5 images, the left is always the clean image and its right side shows adversarial images generated by various methods (Maybe more than one) or the same clean image. Participants will be asked “Are the images on the right the same as the left (the clean one)?” and each participant will provide more than 50 annotations. For each image to be checked, we can zoom it as large as possible to provide convenience for participants to observe.

A total of 20+ participants were involved in assessing whether the specific image was the same as the original clean image. For the sake of fairness, we put the clean images and adversarial images generated by different methods into the dataset to be checked together. These participants provided more than 1,000 annotations. As shown in Fig. 4, the AEs generated by SSTA are generally considered to be the same as the original images. 88.98% of the annotations were considered unmodified, meaning that most participants could not distinguish the AEs generated by SSTA. Conversely, for AEs generated by baseline methods, participants were able to spot distortions more easily, more than 90%, 55% and 30% of the total annotations have been picked out for stAdv, AdvDrop and Chroma-shift, respectively, indicating that the AEs generated by these baseline methods did not affect humans to identify objects in images correctly but very easy to find that they had tampered.

Refer to caption
Fig. 4: Human perceptual study results where the bars signify the percentage of the participants answering “The images are the same as the clean image”.

4 Conclusions

In this paper, we present a novel non-noise additional method, called SSTA, which combines performing the spatial transformation in salient regions with the optimal flow field to synthesize AEs. Extensive experiments show that the proposed method is superior to the state-of-the-art methods in terms of prominent concealment and high image quality, and the generated AEs are indistinguishable by the human eyes. Benefitting from generating AEs without noise-adding, the proposed SSTA provides a new efficient way to evaluate the robustness of classifiers and enhance their performance using techniques like fine-tuning or adversarial training. Furthermore, the proposed approach can be used as a reliable tool to build more robust models.

References

  • [1] Han Xu, Yao Ma, Haochen Liu, Debayan Deb, Hui Liu, Jiliang Tang, and Anil K. Jain, “Adversarial attacks and defenses in images, graphs and text: A review,” International Journal of Automation and Computing, vol. 17, no. 2, pp. 151–178, 2020.
  • [2] Nicholas Carlini and David A. Wagner, “Towards evaluating the robustness of neural networks,” in S&P, 2017.
  • [3] Aleksander Madry, Aleksandar Makelov, Ludwig Schmidt, Dimitris Tsipras, and Adrian Vladu, “Towards deep learning models resistant to adversarial attacks,” in ICLR, 2018.
  • [4] Chaowei Xiao, Jun-Yan Zhu, Bo Li, Warren He, Mingyan Liu, and Dawn Song, “Spatially transformed adversarial examples,” in ICLR, 2018.
  • [5] Ayberk Aydin, Deniz Sen, Berat Tuna Karli, Oguz Hanoglu, and Alptekin Temizel, “Imperceptible adversarial examples by spatial chroma-shift,” in ADVM, 2021, pp. 8–14.
  • [6] Max Jaderberg, Karen Simonyan, Andrew Zisserman, and Koray Kavukcuoglu, “Spatial transformer networks,” in NeurIPS, 2015, pp. 2017–2025.
  • [7] Min Seok Lee, WooSeok Shin, and Sung Won Han, “TRACER: extreme attention guided salient object tracing network,” in AAAI, 2022, pp. 12993–12994.
  • [8] Yun Zhai and Mubarak Shah, “Visual attention detection in video sequences using spatiotemporal cues,” in ACM MM, 2006, pp. 815–824.
  • [9] Radhakrishna Achanta, Sheila S. Hemami, Francisco J. Estrada, and Sabine Süsstrunk, “Frequency-tuned salient region detection,” in CVPR, 2009, pp. 1597–1604.
  • [10] Ramprasaath R. Selvaraju, Michael Cogswell, Abhishek Das, Ramakrishna Vedantam, Devi Parikh, and Dhruv Batra, “Grad-cam: Visual explanations from deep networks via gradient-based localization,” in ICCV, 2017, pp. 618–626.
  • [11] Ting Deng and Zhigang Zeng, “Generate adversarial examples by spatially perturbing on the meaningful area,” Pattern Recognition Letters, vol. 125, pp. 632–638, 2019.
  • [12] Xuebin Qin, Zichen Vincent Zhang, Chenyang Huang, Masood Dehghan, Osmar R. Zaïane, and Martin Jägersand, “U22{}^{\mbox{2}}start_FLOATSUPERSCRIPT 2 end_FLOATSUPERSCRIPT-net: Going deeper with nested u-structure for salient object detection,” Pattern Recognition, vol. 106, pp. 107404, 2020.
  • [13] Karen Simonyan and Andrew Zisserman, “Very deep convolutional networks for large-scale image recognition,” in ICLR, 2015.
  • [14] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun, “Deep residual learning for image recognition,” in CVPR, 2016, pp. 770–778.
  • [15] Gao Huang, Zhuang Liu, Laurens van der Maaten, and Kilian Q. Weinberger, “Densely connected convolutional networks,” in CVPR, 2017, pp. 2261–2269.
  • [16] Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo, “Swin transformer: Hierarchical vision transformer using shifted windows,” in ICCV, 2021, pp. 9992–10002.
  • [17] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby, “An image is worth 16x16 words: Transformers for image recognition at scale,” in ICLR, 2021.
  • [18] Ranjie Duan, Yuefeng Chen, Dantong Niu, Yun Yang, A. Kai Qin, and Yuan He, “Advdrop: Adversarial attack to dnns by dropping information,” in ICCV, 2021, pp. 7486–7495.
  • [19] Richard Zhang, Phillip Isola, Alexei A. Efros, Eli Shechtman, and Oliver Wang, “The unreasonable effectiveness of deep features as a perceptual metric,” in CVPR, 2018, pp. 586–595.
  • [20] Keyan Ding, Kede Ma, Shiqi Wang, and Eero P. Simoncelli, “Image quality assessment: Unifying structure and texture similarity,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 44, no. 5, pp. 2567–2581, 2022.
  • [21] Zhou Wang and Alan C. Bovik, “A universal image quality index,” IEEE Signal Processing Letters, vol. 9, no. 3, pp. 81–84, 2002.
  • [22] Jun Li, “Spatial quality evaluation of fusion of different resolution images,” International Archives of Photogrammetry and Remote Sensing, vol. 33, 09 2000.
  • [23] Hamid R. Sheikh, Muhammad F. Sabir, and Alan C. Bovik, “A statistical evaluation of recent full reference image quality assessment algorithms,” IEEE Trans. Image Process., vol. 15, no. 11, pp. 3440–3451, 2006.
  • [24] Hamid R. Sheikh and Alan C. Bovik, “Image information and visual quality,” in ICASSP, 2004, pp. 709–712.