Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

OSLO:
One-Shot Label-Only Membership Inference Attacks

Yuefeng Peng
University of Massachusetts Amherst
yuefengpeng@cs.umass.edu
&Jaechul Roh
University of Massachusetts Amherst
jroh@umass.edu
Subhransu Maji
University of Massachusetts Amherst
smaji@cs.umass.edu
&Amir Houmansadr
University of Massachusetts Amherst
amir@cs.umass.edu
Abstract

We introduce One-Shot Label-Only (OSLO) membership inference attacks (MIAs), which accurately infer a given sample’s membership in a target model’s training set with high precision using just a single query, where the target model only returns the predicted hard label. This is in contrast to state-of-the-art label-only attacks which require 6000similar-toabsent6000\sim 6000∼ 6000 queries, yet get attack precisions lower than OSLO’s. OSLO leverages transfer-based black-box adversarial attacks. The core idea is that a member sample exhibits more resistance to adversarial perturbations than a non-member. We compare OSLO against state-of-the-art label-only attacks and demonstrate that, despite requiring only one query, our method significantly outperforms previous attacks in terms of precision and true positive rate (TPR) under the same false positive rates (FPR). For example, compared to previous label-only MIAs, OSLO achieves a TPR that is 7×\times× to 28×\times× stronger under a 0.1% FPR on CIFAR10 for a ResNet model. We evaluated multiple defense mechanisms against OSLO.

1 Introduction

Deep learning (DL) models are vulnerable to membership inference attacks (MIAs), where an attacker attempts to infer whether a given sample is in the target model’s training set [1]. The success of such MIAs may lead to severe individual privacy breaches as DL models are often trained on private data such as medical records [2] and facial images [3]. Existing black-box MIAs fall into two categories: score-based MIAs [1, 4, 5] and decision-based MIAs [6, 7], also known as label-only MIAs. Score-based attacks assume access to the model’s output confidence scores, which can be defended against if the model only outputs a label. In contrast, label-only attacks infer membership based on the labels returned by the target model, which is a stricter and more practical threat model.

State-of-the-art label-only MIAs measure the sample’s robustness to adversarial perturbations, classifying samples with higher robustness as members. This robustness serves as a proxy for model confidence, which is typically higher for training samples. These attacks utilize query-based black-box adversarial attacks [8, 9] to determine the amount of necessary adversarial perturbation for each sample. However, accurately estimating this requires a large number of queries to the model, sometimes up to thousands for each sample. This approach is not only costly but also easily detectable  [10]. Furthermore, these attacks often lack precision and are unsuccessful in the low false positive rate regime, as indicated by previous studies [11] and our evaluations. In summary, existing label-only MIAs are not practical.

Table 1: Comparison to previous label-only attacks. We report the highest attack precision that these attacks can achieve with a recall greater than 1% on CIFAR-10 using ResNet in the table.
Attack method No knowledge about target model architecture Require no auxiliary data Attack
precision
Num of queries used
Transfer Attack [6] 50.9% |𝒟shadow|subscript𝒟𝑠𝑎𝑑𝑜𝑤|\mathcal{D}_{shadow}|| caligraphic_D start_POSTSUBSCRIPT italic_s italic_h italic_a italic_d italic_o italic_w end_POSTSUBSCRIPT |
Data Augmentation [7] 55.7% 10similar-toabsent10\sim 10∼ 10
Boundary Attack (Gaussian Noise) [7] 61.3% similar-to\sim700
Boundary Attack (HopSkipJump) [7, 6] 64.5% similar-to\sim6000
Boundary Attack (QEBA) [6] 71.4% similar-to\sim6000
OSLO (Ours) 96.2% 1
Refer to caption
Figure 1: An illustration of OSLO versus the state-of-the-art boundary attack. boundary attack requires querying the target model thousands of thousands of times, whereas OSLO requires only a single query.

We propose novel One-Shot Label-Only (OSLO) MIAs that can infer a sample’s membership in the target model’s training set with just a single query, even when the target model only returns hard labels for input samples. Table 1 compares OSLO to previous label-only MIAs. OSLO is based on constructing transfer-based adversarial example, with just enough magnitude of adversarial perturbation added to each sample to cause misclassification if the sample is non-member (but not enough if it is a member) based on a set of source and validation models, a technique that was initially proposed in [12]. Prior work has considered using a fixed threshold of adversarial perturbation across all samples to infer membership status, resulting in lower precision [7, 6].

We extensively compared OSLO with state-of-the-art label-only methods [7, 6]. Following recent standards, we evaluated the attacks using metrics such as the true positive rate (TPR) under a low false positive rate (FPR) and performed a precision/recall analysis. Our results show that previous label-only MIAs perfom poorly, exhibiting high false positive rates. In contrast, OSLO achieves high precision in identifying members, outperforming prior work by a significant margin. For example, as shown in Figure 2, OSLO is 7×\times× to 67×\times× more powerful than other label-only MIAs in terms of TPR under 0.1% FPR across three datasets using a ResNet [13] model. OSLO achieves over 95% precision on all three datasets on ResNet model with a recall greater than 0.01, while the highest precision achieved by other attacks is only 71.4%, 84.2%, and 67.5% respectively. Furthermore, we conduct an detailed comparison of OSLO with previous work, exploring why previous label-only attacks fail to achieve high precision.

2 Background and preliminaries

2.1 Adversarial attacks

Early works [14, 15] have demonstrated the susceptibility of deep neural networks (DNNs) to adversarial attacks. These attacks involve crafting subtle perturbations to input data, causing a DNN to make incorrect predictions while the perturbations remain almost imperceptible to humans. Adversarial attacks can significantly compromise model performance, resulting in issues like misclassification or misgeneration. These attacks are typically classified based on the adversary’s level of access to the model [16]: white-box and black-box settings. In white-box attacks [17, 15, 18, 19, 20, 21], adversaries have full knowledge of the victim model, including its architecture, parameters, and gradients. In contrast, black-box attacks [22, 23] limit adversaries to querying the model and obtaining output labels or confidence scores. We primarily discuss black-box attacks to align with the threat model of our MIA settings.

Black-box adversarial attacks.

Black-box attacks primarily include two methods: (i) Query-based attacks iteratively query the target model to gather information and craft adversarial examples, adjusting perturbations based on the model’s feedback. These attacks often require many queries to be effective. (ii) Transfer-based attacks generate adversarial examples using a white-box model and then transfer these examples to attack another model. Previous works have proposed various transfer-based attacks. For instance, Momentum Iterative Fast Gradient Sign Method (MI-FGSM) [21] introduces momentum iterative gradient-based methods for generating adversarial examples to effectively attack both white-box and black-box models. Diverse Inputs Iterative Fast Gradient Sign Method (DI2superscriptI2\text{I}^{2}I start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT-FGSM) [24] enhances the transferability of adversarial examples by applying random transformations to input images at each iteration of the attack process. Translation-Invariant Fast Gradient Sign Method (TI-FGSM) [25] optimizes perturbations over a set of translated images to make the adversarial examples less sensitive to specific model features and improve their transferability. Admix [26] enhances traditional input transformations by mixing the input image with images from other categories to create admixed images. These admixed images are used to compute gradients, helping to create more transferable adversarial examples.

2.2 Membership inference attacks

Membership inference attacks (MIAs) [27, 28, 29, 30, 31, 32] aim to determine whether a specific data sample is a member of the training dataset of a target model. Existing MIAs often assume that the attacker can access the target model’s confidence scores to infer membership. These score-based attacks exploit the difference in confidence scores for training versus non-training data, as models often exhibit higher confidence for samples they have seen during training [1, 4, 5]. Such attacks can be easily defended by limiting access to only the predicted labels, which reduces the amount of information available to the attacker.

Label-Only MIAs.

Compared to score-based attacks, a more practical and stringent type of attack is decision-based or label-only MIAs [7, 31], which operate under the constraint of having access only to the final decision labels. Several label-only attacks have been proposed. For example, inspired by the transferability of adversarial examples, transfer attacks [31] explore the transferability of membership information. They use a shadow dataset labeled by the target model to train a shadow model and perform a score-based attack on the shadow model, expecting that the differences between members and non-members in the target model will reflect similarly in the shadow model. Data augmentation attacks [7] exploit the robustness of samples to data augmentation transformations to differentiate members from non-members, with higher robustness indicating membership. Similarly, state-of-the-art label-only MIAs, known as boundary attacks [31, 7], use the robustness of samples to strategic input perturbations as label-only proxies for the model’s confidence, where higher robustness implies a higher confidence score. Boundary attacks have different methods for adding perturbations. For example, Gaussian noise is one method where the attacker continuously adds Gaussian noise until the label changes. Attackers also use adversarial attacks to measure the robustness of samples to adversarial perturbations. Specifically, query-based adversarial attacks like HopSkipJump [9] and QEBA [8] algorithms are used to add perturbations to each sample to generate adversarial examples.

3 OSLO

3.1 Problem definition

Following Ye et al. [33] and Leemann et al. [34], we view MIA as a hypothesis testing problem:

H0:(x,y)DvsH1:(x,y)D:subscript𝐻0𝑥𝑦𝐷𝑣𝑠subscript𝐻1:𝑥𝑦𝐷H_{0}:(x,y)\in D\quad vs\quad H_{1}:(x,y)\notin D\ italic_H start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT : ( italic_x , italic_y ) ∈ italic_D italic_v italic_s italic_H start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT : ( italic_x , italic_y ) ∉ italic_D (1)

Successfully rejecting the null hypothesis equates to determining that the sample (x,y)𝑥𝑦(x,y)( italic_x , italic_y ) is a member in the target model’s training set D𝐷Ditalic_D. Given a sample for inference, we initially assume that the sample is a non-member (H0subscript𝐻0H_{0}italic_H start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT). Our objective is to design an attack method that can reject the null hypothesis with high confidence.

3.2 Threat model

Attacker’s capability

Given a target model f𝑓fitalic_f, we assume that the attacker can only access f𝑓fitalic_f in a black-box manner, meaning that means that for any input x𝑥xitalic_x, the attacker can only observe f(x)𝑓𝑥f(x)italic_f ( italic_x ) but cannot access the internal parameters θ𝜃\thetaitalic_θ or any intermediate outputs of f𝑓fitalic_f. Furthermore, we consider a more restrictive scenario: label-only access, where the attacker can only obtain the predicted labels (i.e., y^=argmaxf(x)^𝑦𝑓𝑥\hat{y}=\arg\max f(x)over^ start_ARG italic_y end_ARG = roman_arg roman_max italic_f ( italic_x )) from the target model, not the confidence scores or probabilities associated with each class. This label-only, black-box attack scenario represents one of the most challenging and realistic settings for conducting attacks against DL models. Similar to previous attacks [1, 7], it is assumed that the attacker possesses an auxiliary dataset Dauxsubscript𝐷auxD_{\text{aux}}italic_D start_POSTSUBSCRIPT aux end_POSTSUBSCRIPT that shares the same distribution as the target model’s training set for the purpose of training their surrogate models. However, we do not assume the attacker has any knowledge of the target model, including its architecture.

Attacker’s Goal

The attacker’s objective is to reliably infer whether a given sample is a member of the target model’s training set with high precision. As underscored by Carlini et al. [5], identifying individuals within sensitive datasets with high precision poses a significant privacy risk, even if only a few users are correctly identified. Moreover, the precision of these attacks is not just a concern for individual privacy; it also lays the groundwork for more advanced extraction attacks [35], making the pursuit of high precision in MIAs a critical focus for attackers. Additionally, the attacker aims to infer membership with as few queries as possible, as excessive querying not only incurs significant costs but is also detectable by defenders [10]. Our OSLO limits the query budget to a single query.

3.3 Intuition of OSLO

Previous work [4, 1] has demonstrated that target models tend to be overconfident on members. existing label-only attacks [6, 7] suggest that models are more confident about members, making it more diffficult to "change their mind" on these samples. Their experiments also show that members are usually more robust to adversarial perturbations than non-members. We believe that the differences in the required adversarial perturbation between members and non-members can manifest in two ways: (i) on average, members tend to require more adversarial perturbation than non-members; and (ii) for each individual sample, the member status may demand more adversarial perturbation than the non-member status. Previous label-only MIAs [6, 7] have exploited the former observation, but this has led to high false positive rates and low precision. Our OSLO is based on the latter. We add just enough adversarial perturbation to cause a misclassification for the sample if it is a non-member but not if it is a member. If the sample is not successfully misclassified, it indicates that the added perturbation was insufficient, and the sample is likely a member. Note that here, ’sufficient perturbation’ refers to a variable standard for each sample, changing with the sample, rather than a uniform standard for all samples.

3.4 Attack method

Previous label-only MIAs face two main issues: low precision and high query budget. We address both issues simultaneously by proposing our transfer-based adversarial attack based attack, which implements the aforementioned hypothesis testing framework. Specifically, we first train a group of surrogate models. For each sample, we construct a transferable adversarial example using the surrogate model(s). Then, we input the crafted adversarial example into the target model and obtain the predicted label. If the sample is misclassified, it indicates that the adversarial example has successfully transferred, implying that there is not enough evidence to reject the null hypothesis H0subscript𝐻0H_{0}italic_H start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT. Otherwise, we reject H0subscript𝐻0H_{0}italic_H start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT and determine the sample as a member. It is worth noting that the perturbation added to each sample should be sufficient for it to transfer as a non-member but not as a member. Therefore, carefully controlling the scale of perturbation added to each sample is required. The framework of OSLO is divided into three stages: surrogate model training, transferable adversarial example construction, and membership inference.

Surrogate model training

The attackers first train their own surrogate models to construct adversarial examples. As assumed in Section 3.2, we assume that the attacker has an auxiliary dataset with the same distribution as the target model’s training set. The attacker trains a group of models with different structures on this dataset, training N𝑁Nitalic_N models for each structure. These models are then divided into source models and validation models based on different purposes in the adversarial example generation stage. Source models are used to calculate gradients for generating adversarial perturbations, while validation models are used to control the magnitude of the added perturbations.

Input: Benign input x𝑥xitalic_x with label y𝑦yitalic_y; source models g𝑔gitalic_g; validation model hhitalic_h; number of sub-procedures K𝐾Kitalic_K; number of iterations N𝑁Nitalic_N; step size α𝛼\alphaitalic_α; maximum perturbation size ϵitalic-ϵ\epsilonitalic_ϵ and threshold τ𝜏\tauitalic_τ;
Output: Transfer-based unrestricted adversarial example xsuperscript𝑥x^{\prime}italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT with approximately minimum change;
x0xsubscript𝑥0𝑥x_{0}\leftarrow xitalic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ← italic_x;
for k=1,2,,K𝑘12𝐾k=1,2,\ldots,Kitalic_k = 1 , 2 , … , italic_K do
       x0xk1subscriptsuperscript𝑥0subscript𝑥𝑘1x^{\prime}_{0}\leftarrow x_{k-1}italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ← italic_x start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT;
       for i=1𝑖1i=1italic_i = 1 to N𝑁Nitalic_N do
             Compute gradient: gixL(g(xi1,y)g_{i}\leftarrow\nabla_{x}L(g(x^{\prime}_{i-1},y)italic_g start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ← ∇ start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT italic_L ( italic_g ( italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT , italic_y );
             Update adversarial example using the update function: xiUpdate(xi1,gi,α)subscriptsuperscript𝑥𝑖Updatesubscriptsuperscript𝑥𝑖1subscript𝑔𝑖𝛼x^{\prime}_{i}\leftarrow\text{Update}(x^{\prime}_{i-1},g_{i},\alpha)italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ← Update ( italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT , italic_g start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_α );
             Clip xisubscriptsuperscript𝑥𝑖x^{\prime}_{i}italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT to ensure it is within the kϵK𝑘italic-ϵ𝐾\frac{k\epsilon}{K}divide start_ARG italic_k italic_ϵ end_ARG start_ARG italic_K end_ARG-ball of x𝑥xitalic_x: xiclip(xi,xkϵK,x+kϵK)subscriptsuperscript𝑥𝑖clipsubscriptsuperscript𝑥𝑖𝑥𝑘italic-ϵ𝐾𝑥𝑘italic-ϵ𝐾x^{\prime}_{i}\leftarrow\text{clip}(x^{\prime}_{i},x-\frac{k\epsilon}{K},x+% \frac{k\epsilon}{K})italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ← clip ( italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_x - divide start_ARG italic_k italic_ϵ end_ARG start_ARG italic_K end_ARG , italic_x + divide start_ARG italic_k italic_ϵ end_ARG start_ARG italic_K end_ARG );
             confexp(hy(xi))jexp(hj(xi))confsubscript𝑦subscriptsuperscript𝑥𝑖subscript𝑗subscript𝑗subscriptsuperscript𝑥𝑖\text{conf}\leftarrow\frac{\exp(h_{y}(x^{\prime}_{i}))}{\sum_{j}\exp(h_{j}(x^{% \prime}_{i}))}conf ← divide start_ARG roman_exp ( italic_h start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT ( italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ) end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT roman_exp ( italic_h start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ) end_ARG;
             if conf<τconf𝜏\text{conf}<\tauconf < italic_τ then
                   return xisubscriptsuperscript𝑥𝑖x^{\prime}_{i}italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT;
                    // early-stopping criterion
                  
             end if
            
       end for
      xkxNsubscript𝑥𝑘subscriptsuperscript𝑥𝑁x_{k}\leftarrow x^{\prime}_{N}italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ← italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT;
      
end for
return xKsubscript𝑥𝐾x_{K}italic_x start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT;
Algorithm 1 Transferable adversarial example generation
Transferable adversarial example generation

After the source models are trained, the attacker generates adversarial examples on these models using transfer-based adversarial attacks. Existing transfer-based attacks typically search for adversarial examples within a fixed lpsubscript𝑙𝑝l_{p}italic_l start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT-norm ball and do not minimize the perturbation added. In other words, these attacks allocate a uniform perturbation budget ϵitalic-ϵ\epsilonitalic_ϵ to all samples, accepting any adversarial examples found within this budget. While this approach is reasonable in the context of general adversarial attacks, it is not suitable for our MIA framework, as OSLO requires finer-grained control over the amount of perturbation added.

In OSLO, the perturbation budget for each sample is adaptive. Specifically, we utilize the Geometry-Aware (GA) framework proposed in [12], which employs a set of validation models to regulate the amount of perturbation added to each sample. The process of adding perturbations is incremental. During this process, a validation model is used to monitor the transferability of the sample. The perturbation addition process is terminated early—referred to as early stopping—when the perturbation added is sufficient for the sample to deceive the validation model, and the confidence of the correct class on the validation model drops below a specified threshold τ𝜏\tauitalic_τ:

(h^(x)=y)=exp(hy(x;θ))jexp(hj(x;θ))<τ.^𝑥𝑦subscript𝑦𝑥𝜃subscript𝑗subscript𝑗𝑥𝜃𝜏\mathbb{P}(\hat{h}(x)=y)=\frac{\exp(h_{y}(x;\theta))}{\sum_{j}\exp(h_{j}(x;% \theta))}<\tau.blackboard_P ( over^ start_ARG italic_h end_ARG ( italic_x ) = italic_y ) = divide start_ARG roman_exp ( italic_h start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT ( italic_x ; italic_θ ) ) end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT roman_exp ( italic_h start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_x ; italic_θ ) ) end_ARG < italic_τ . (2)

Through this method, the process can be terminated when the transferability of each sample reaches a certain level, resulting in tailored perturbations for each sample. This can more effectively distinguish the differences of the sample in the in and out worlds.

Algorithm 1 outlines our method for generating transferable adversarial examples, adapted from Liu et al.’s GA framework [12] to align with our objective. The algorithm inputs with a benign sample x𝑥xitalic_x with its corresponding label y𝑦yitalic_y, a set of source models denoted as g𝑔gitalic_g, and a set of validation models hhitalic_h. The algorithm proceeds through K𝐾Kitalic_K sub-procedures, each conducting an iterative refinement of the adversarial example across N𝑁Nitalic_N iterations, guided by a pre-established step size α𝛼\alphaitalic_α and bounded by a maximal perturbation magnitude ϵitalic-ϵ\epsilonitalic_ϵ. At each iteration, the gradient of the loss function L𝐿Litalic_L with respect to the adversarial input is computed, informing the perturbative update. This process is modulated by the validation model hhitalic_h. If at any step the confidence of the ground-truth class falls below the threshold τ𝜏\tauitalic_τ, the early stopping is triggered to prevent over-perturbation, thus ensuring that the perturbations are precisely calibrated for each sample.

Membership inference

Finally, for each sample x𝑥xitalic_x, the attacker inputs the transferable adversarial example xsuperscript𝑥x^{\prime}italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT, generated by Algorithm 1, into the target model f𝑓fitalic_f. The classification outcome determines the evidence against the null hypothesis H0subscript𝐻0H_{0}italic_H start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT. If argmaxif(x)iysubscript𝑖𝑓subscriptsuperscript𝑥𝑖𝑦\arg\max_{i}f(x^{\prime})_{i}\neq yroman_arg roman_max start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_f ( italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ≠ italic_y, it suggests insufficient evidence to reject H0subscript𝐻0H_{0}italic_H start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT. In contrast, a classification of argmaxif(x)i=ysubscript𝑖𝑓subscriptsuperscript𝑥𝑖𝑦\arg\max_{i}f(x^{\prime})_{i}=yroman_arg roman_max start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_f ( italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_y allows us to reject H0subscript𝐻0H_{0}italic_H start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT, hence classifying x𝑥xitalic_x as a member. Our membership inference strategy seeks to reject H0subscript𝐻0H_{0}italic_H start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT with high confidence, aiming to achieve high precision in the identification of members.

4 Evaluation

4.1 Experimental setup

Datasets.

We utilized three datasets commonly used in prior works [5]: CIFAR-10 [36], CIFAR-100  [36]and SVHN  [37]. For CIFAR-10 and CIFAR-100, we selected 25,000 samples to train the target model under attack. For SVHN, we randomly selected 2,000 samples to train the target model. For each dataset, we trained the surrogate models used for the attack with a set of data that is the same size as the training set of the target model but disjoint.

Models. We adopted the widely used ResNet18 [13] and DenseNet121 [38] as the target model architectures. In addition to these two, we incorporated five additional model architectures as surrogate models. Detailed training and attack setup are provided in Appendix A.2.

Metrics We employ the following metrics to evaluate the effectiveness of the attacks: (i) attack TPR and FPR [5]: Carlini et al. suggest that membership inference attacks should be analyzed using TPR and FPR, especially considering that TPR in a low FPR setting is crucial for assessing the success of MIAs. We report the log ROC curve to show the TPR and FPR of the attack, with a focus on the low FPR regime; (ii) attack precision and recall: Attack precision reflects how reliably an attack can infer membership. High-precision attacks allow an attacker to credibly violate the privacy of a portion of the samples. Specifically, precision is defined as the proportion of true members correctly identified among all positively predicted members by an adversary. Meanwhile, recall quantifies the proportion of true members that the adversary has correctly identified out of all actual members. We discarded average-case metrics such as accuracy and AUC, as they have been shown not to accurately reflect the threat of MIAs [5].

4.2 Results

Refer to caption
(a) CIFAR10
Refer to caption
(b) CIFAR100
Refer to caption
(c) SVHN
Figure 2: ROC curves for various label-only attacks on three different datasets on ResNet18
Refer to caption
(a) CIFAR10
Refer to caption
(b) CIFAR100
Refer to caption
(c) SVHN
Figure 3: ROC curves for various label-only attacks on three different datasets on DenseNet121

4.2.1 Evaluating attack TPR under Low FPR

We first evaluate six attacks, including OSLO, focusing on TPR under low FPR regime. For OSLO, we control the trade-off between TPR and FPR by adjusting the threshold τ𝜏\tauitalic_τ during the transferable adversarial example generation phase. A lower τ𝜏\tauitalic_τ results in more perturbation added to each sample. Consequently, if the samples still fail to deceive the target model, it is more likely that they are members, thereby reducing FPR. For other attacks, we adjust their corresponding parameters to manage TPR and FPR. For instance, in the boundary attack, increasing the perturbation threshold for identifying members ensures that only samples requiring more significant perturbations are classified as members. Based on the assumption that members generally need greater perturbation than non-members, this method should minimize the number of samples mistakenly identified as members and enhance the credibility of the attack.

As shown in Figure 2, OSLO outperforms all previous label-only MIAs by a large margin. For example, compared to previous attacks, OSLO achieves a TPR that ranges from 7×\times× to 67×\times× higher under 0.1% False Positive Rate (FPR), and from 3.3×\times× to 31.5×\times× higher under 1% FPR on ResNet, evaluated across three datasets. On DenseNet, the improvements range from 1.25×\times× to 33×\times× under 0.1% FPR, and from 2.8×\times× to 9.5×\times× under 1% FPR. It can be observed that previous attacks were unsuccessful in the low FPR regime, whereas OSLO can successfully identify some members even at low FPR. For example, on CIFAR-100 using ResNet, at 0.1% FPR regime, 2 of the 5 previous attacks achieved a TPR of zero, the best of the remaining three managed only a 0.3% TPR, while OSLO reached 6.7% TPR. Under 1% FPR, OSLO could even achieve 18.9% TPR, with other attacks achieving at most 2% TPR. These comparisons show that, although OSLO requires only one query to the target model, it can reliably identify part of the members, while previous attacks, despite needing to perform up to thousands of queries, are almost unable to credibly identify members.

4.2.2 Precision and recall analysis

Refer to caption
(a) CIFAR10
Refer to caption
(b) CIFAR100
Refer to caption
(c) SVHN
Figure 4: Precision-Recall curves for various label-only attacks on ResNet18. Each line represents the trade-off between precision and recall for a different method as the attack parameter is varied.
Refer to caption
(a) CIFAR10
Refer to caption
(b) CIFAR100
Refer to caption
(c) SVHN
Figure 5: Precision-Recall curves for various label-only attacks on DenseNet121. Each line represents the trade-off between precision and recall for a different method as the attack parameter is varied.

We then measured the trade-off between attack precision and recall. In MIAs, precision is considered a crucial metric, whereas recall is relatively less important [5]. For instance, a precision of 1.0 means that the attacker can be 100% certain that the identified samples are members, leading to significant privacy breaches, even if only a few samples are affected. Conversely, a recall of 1.0 can be achieved by classifying all samples as members, but this does not cause any privacy breach. For MIAs, a desired outcome would be the ability to trade recall for increased precision.

As shown in Figure 4 and Figure 5, OSLO is the only one that effectively trades recall for precision. For example, by trading recall, OSLO achieves over 90% precision across all three datasets with ResNet as the target model. Specifically, on CIFAR-10, CIFAR-100, and SVHN, with recall rates of 5%, 44.7%, and 9% respectively, OSLO attained precisions of 96.2%, 92.0%, and 90.9%. In contrast, the highest precision achieved by other attacks at similar recall levels was 63.2%, 82.5%, and 65.2% respectively. This demonstrates that OSLO can identify members with high precision, a capability that all previous label-only attacks could not match. Previous methods could not improve precision by adjusting hyper-parameters, due to limitations inherent in their designs. We discuss this further in Section 6.1.

Refer to caption
Figure 6: Precision-Recall curve showing the attack performance of OSLO using different transfer-based adversarial attacks on CIFAR-10.

5 Ablation study

5.1 Effect of using different adversarial attacks

As introduced in Section 3.4, OSLO leverages transfer-based adversarial attacks. We evaluated the membership inference performance of OSLO using six different transfer-based adversarial techniques within its framework, specifically TI [25], DI [24], MI [21], Admix [26], and the combinable methods TDMI and TMDAI. The results are presented in Figure 6. Observations reveal no significant differences in the effectiveness of using these various transfer-based attacks. This may suggest that the efficacy of the attack predominantly relies on the inherent design of the OSLO framework rather than the specific choice of the individual transfer techniques.

5.2 Effect of using more than one shot

Refer to caption
Figure 7: Precision-Recall curve illustrating the attack performance of OSLO using more shots.

OSLO employs a single query to the target model to determine a sample’s membership status. We also explore the potential advantages of extending OSLO to utilize multiple queries per sample, referred to as multi-shot. Specifically, Figure 7 illustrates the precision and recall of OSLO under different adversarial example generation thresholds, τ𝜏\tauitalic_τ.

We further report on combining results from the current threshold with those from previous, higher thresholds. For example, ‘three shots‘ represents feeding the current shot and the previous two shots’ generated images to the target model, identifying samples as members if all three adversarial examples fail to fool the target model. As shown in Figure 7, we find that using more shots does not yield a significant improvement. We attribute this to the observation that even with multiple shots, only the shot with the lowest τ𝜏\tauitalic_τ plays a decisive role.

6 Discussion

6.1 Why OSLO outperform previous approaches

Refer to caption
(a) Boundary Attack (HopSkipJump)
Refer to caption
(b) OSLO (Ours)
Figure 8: Comparison of boundary attack and OSLO regarding proportions of true positives (accurately identified members) and false positives (misclassified non-members) under varying threshold settings on CIFAR-10.

In this section, we discuss why previous attacks fail to achieve high precision or TPR under low FPR conditions, and how OSLO outperforms them. The boundary attack employs a global threshold of perturbation magnitude to differentiate between members and non-members. However, regardless of the chosen threshold, a considerable proportion of non-members are always found among the samples that require perturbations above this threshold (see the cumulative distribution function (CDF) in Figure 8(a)). On the other hand, by lowering the validation threshold τ𝜏\tauitalic_τ in OSLO, the proportion of non-members among the failed adversarial attack samples significantly decreases, effectively squeezing out the non-members. For example, when classifying 4.4% of the samples as members, the boundary attack still includes 39% non-members (false positives) among those identified as members, whereas OSLO has only 10% non-members, resulting in 90% attack precision. We further discuss this by measuring the magnitude of the adversarial perturbation added to members and non-members in boundary attack versus OSLO on CIFAR-100, details are in Appendix A.1.

6.2 Mitigation

Refer to caption
Figure 9: Attack TPR under 1% FPR of OSLO against various defense mechanisms.

In addition to demonstrating the effectiveness of OSLO, we also evaluated six defensive strategies [39, 40, 41] aimed at mitigating membership privacy leakage.

For each defense mechanism, we trained three target models using different hyper-parameters, and the results are presented in Figure 9. It can be observed that confidence alteration methods such as MemGuard[39] are ineffective against OSLO, as expected. Other methods that mitigate memorization require strong regularization to be effective but result in a decrease in the model’s test accuracy.

7 Conclusion

In this paper, we propose OSLO, One-Shot Label-Only Membership Inference Attack (MIAs). OSLO is based on transfer-based adversarial attacks and identifies members by the different perturbations required when a sample is a member versus a non-member. We empirically demonstrate that OSLO not only operates under label-only settings with a single query but also significantly outperforms all previous label-only attacks by a large margin in terms of true positive rate (TPR) under the same false positive rate (FPR) and attack precision. We also discuss why OSLO outperforms existing methods. Furthermore, we evaluate multiple defenses against OSLO, highlighting its robustness.

Limitations.

Despite achieving impressive results using a single query, OSLO requires training surrogate models, which need an auxiliary dataset from the same distribution as the target model’s training set. However, this is a common assumption in MIAs [1, 5]. Furthermore, this assumption may be relaxed by using synthetic data to train the surrogate model, as demonstrated in previous work [1].

References

  • [1] Reza Shokri, Marco Stronati, Congzheng Song, and Vitaly Shmatikov. Membership inference attacks against machine learning models. In 2017 IEEE symposium on security and privacy (SP), pages 3–18. IEEE, 2017.
  • [2] Bradley J Erickson, Panagiotis Korfiatis, Zeynettin Akkus, and Timothy L Kline. Machine learning for medical imaging. radiographics, 37(2):505–515, 2017.
  • [3] Yaron Gurovich, Yair Hanani, Omri Bar, Guy Nadav, Nicole Fleischer, Dekel Gelbman, Lina Basel-Salmon, Peter M Krawitz, Susanne B Kamphausen, Martin Zenker, et al. Identifying facial phenotypes of genetic disorders using deep learning. Nature medicine, 25(1):60–64, 2019.
  • [4] Ahmed Salem, Yang Zhang, Mathias Humbert, Pascal Berrang, Mario Fritz, and Michael Backes. Ml-leaks: Model and data independent membership inference attacks and defenses on machine learning models. arXiv preprint arXiv:1806.01246, 2018.
  • [5] Nicholas Carlini, Steve Chien, Milad Nasr, Shuang Song, Andreas Terzis, and Florian Tramer. Membership inference attacks from first principles. In 2022 IEEE Symposium on Security and Privacy (SP), pages 1897–1914. IEEE, 2022.
  • [6] Zheng Li and Yang Zhang. Membership leakage in label-only exposures. In Proceedings of the 2021 ACM SIGSAC Conference on Computer and Communications Security, pages 880–895, 2021.
  • [7] Christopher A Choquette-Choo, Florian Tramer, Nicholas Carlini, and Nicolas Papernot. Label-only membership inference attacks. In International conference on machine learning, pages 1964–1974. PMLR, 2021.
  • [8] Huichen Li, Xiaojun Xu, Xiaolu Zhang, Shuang Yang, and Bo Li. Qeba: Query-efficient boundary-based blackbox attack. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 1221–1230, 2020.
  • [9] Jianbo Chen, Michael I Jordan, and Martin J Wainwright. Hopskipjumpattack: A query-efficient decision-based attack. In 2020 ieee symposium on security and privacy (sp), pages 1277–1294. IEEE, 2020.
  • [10] Huiying Li, Shawn Shan, Emily Wenger, Jiayun Zhang, Haitao Zheng, and Ben Y. Zhao. Blacklight: Scalable defense for neural networks against Query-Based Black-Box attacks. In 31st USENIX Security Symposium (USENIX Security 22), pages 2117–2134, Boston, MA, August 2022. USENIX Association.
  • [11] Zitao Chen and Karthik Pattabiraman. Overconfidence is a dangerous thing: Mitigating membership inference attacks by enforcing less confident prediction. arXiv preprint arXiv:2307.01610, 2023.
  • [12] Fangcheng Liu, Chao Zhang, and Hongyang Zhang. Towards transferable unrestricted adversarial examples with minimum changes. In 2023 IEEE Conference on Secure and Trustworthy Machine Learning (SaTML), pages 327–338. IEEE, 2023.
  • [13] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
  • [14] Christian Szegedy, Wojciech Zaremba, Ilya Sutskever, Joan Bruna, Dumitru Erhan, Ian Goodfellow, and Rob Fergus. Intriguing properties of neural networks. arXiv preprint arXiv:1312.6199, 2013.
  • [15] Ian J Goodfellow, Jonathon Shlens, and Christian Szegedy. Explaining and harnessing adversarial examples. arXiv preprint arXiv:1412.6572, 2014.
  • [16] Joana C Costa, Tiago Roxo, Hugo Proença, and Pedro RM Inácio. How deep learning sees the world: A survey on adversarial attacks & defenses. arXiv preprint arXiv:2305.10862, 2023.
  • [17] Nicholas Carlini and David Wagner. Towards evaluating the robustness of neural networks. In 2017 ieee symposium on security and privacy (sp), pages 39–57. Ieee, 2017.
  • [18] Seyed-Mohsen Moosavi-Dezfooli, Alhussein Fawzi, and Pascal Frossard. Deepfool: a simple and accurate method to fool deep neural networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2574–2582, 2016.
  • [19] Eric Wong, Leslie Rice, and J Zico Kolter. Fast is better than free: Revisiting adversarial training. arXiv preprint arXiv:2001.03994, 2020.
  • [20] Aleksander Madry, Aleksandar Makelov, Ludwig Schmidt, Dimitris Tsipras, and Adrian Vladu. Towards deep learning models resistant to adversarial attacks. arXiv preprint arXiv:1706.06083, 2017.
  • [21] Yinpeng Dong, Fangzhou Liao, Tianyu Pang, Hang Su, Jun Zhu, Xiaolin Hu, and Jianguo Li. Boosting adversarial attacks with momentum. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 9185–9193, 2018.
  • [22] Maksym Andriushchenko, Francesco Croce, Nicolas Flammarion, and Matthias Hein. Square attack: a query-efficient black-box adversarial attack via random search. In European conference on computer vision, pages 484–501. Springer, 2020.
  • [23] Matthew Wicker, Xiaowei Huang, and Marta Kwiatkowska. Feature-guided black-box safety testing of deep neural networks. In Tools and Algorithms for the Construction and Analysis of Systems: 24th International Conference, TACAS 2018, Held as Part of the European Joint Conferences on Theory and Practice of Software, ETAPS 2018, Thessaloniki, Greece, April 14-20, 2018, Proceedings, Part I 24, pages 408–426. Springer, 2018.
  • [24] Cihang Xie, Zhishuai Zhang, Yuyin Zhou, Song Bai, Jianyu Wang, Zhou Ren, and Alan L Yuille. Improving transferability of adversarial examples with input diversity. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 2730–2739, 2019.
  • [25] Yinpeng Dong, Tianyu Pang, Hang Su, and Jun Zhu. Evading defenses to transferable adversarial examples by translation-invariant attacks. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 4312–4321, 2019.
  • [26] Xiaosen Wang, Xuanran He, Jingdong Wang, and Kun He. Admix: Enhancing the transferability of adversarial attacks. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 16158–16167, 2021.
  • [27] Klas Leino and Matt Fredrikson. Stolen memories: Leveraging model memorization for calibrated white-box membership inference. In 29th USENIX Security Symposium (USENIX Security 20), pages 1605–1622, 2020.
  • [28] Milad Nasr, Reza Shokri, and Amir Houmansadr. Comprehensive privacy analysis of deep learning: Passive and active white-box inference attacks against centralized and federated learning. In 2019 IEEE symposium on security and privacy (SP), pages 739–753. IEEE, 2019.
  • [29] Alexandre Sablayrolles, Matthijs Douze, Cordelia Schmid, Yann Ollivier, and Hervé Jégou. White-box vs black-box: Bayes optimal strategies for membership inference. In International Conference on Machine Learning, pages 5558–5567. PMLR, 2019.
  • [30] Bo Hui, Yuchen Yang, Haolin Yuan, Philippe Burlina, Neil Zhenqiang Gong, and Yinzhi Cao. Practical blind membership inference attack via differential comparisons. arXiv preprint arXiv:2101.01341, 2021.
  • [31] Yiyong Liu, Zhengyu Zhao, Michael Backes, and Yang Zhang. Membership inference attacks by exploiting loss trajectory. In Proceedings of the 2022 ACM SIGSAC Conference on Computer and Communications Security, pages 2085–2098, 2022.
  • [32] Lauren Watson, Chuan Guo, Graham Cormode, and Alexandre Sablayrolles. On the importance of difficulty calibration in membership inference attacks. In International Conference on Learning Representations, 2022.
  • [33] Jiayuan Ye, Aadyaa Maddi, Sasi Kumar Murakonda, Vincent Bindschaedler, and Reza Shokri. Enhanced membership inference attacks against machine learning models. In Proceedings of the 2022 ACM SIGSAC Conference on Computer and Communications Security, pages 3093–3106, 2022.
  • [34] Tobias Leemann, Martin Pawelczyk, and Gjergji Kasneci. Gaussian membership inference privacy. In Thirty-seventh Conference on Neural Information Processing Systems, 2023.
  • [35] Nicholas Carlini, Chang Liu, Úlfar Erlingsson, Jernej Kos, and Dawn Song. The secret sharer: Evaluating and testing unintended memorization in neural networks. 28th USENIX Security Symposium (USENIX Security 19), pages 267–284, 2019.
  • [36] Alex Krizhevsky, Geoffrey Hinton, et al. Learning multiple layers of features from tiny images. 2009.
  • [37] Yuval Netzer, Tao Wang, Adam Coates, Alessandro Bissacco, Bo Wu, and Andrew Y Ng. Reading digits in natural images with unsupervised feature learning. 2011.
  • [38] Gao Huang, Zhuang Liu, Laurens Van Der Maaten, and Kilian Q Weinberger. Densely connected convolutional networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 4700–4708, 2017.
  • [39] Jinyuan Jia, Ahmed Salem, Michael Backes, Yang Zhang, and Neil Zhenqiang Gong. Memguard: Defending against black-box membership inference attacks via adversarial examples. In Proceedings of the 2019 ACM SIGSAC conference on computer and communications security, pages 259–274, 2019.
  • [40] Milad Nasr, Reza Shokri, and Amir Houmansadr. Machine learning with membership privacy using adversarial regularization. In Proceedings of the 2018 ACM SIGSAC conference on computer and communications security, pages 634–646, 2018.
  • [41] Martin Abadi, Andy Chu, Ian Goodfellow, H Brendan McMahan, Ilya Mironov, Kunal Talwar, and Li Zhang. Deep learning with differential privacy. In Proceedings of the 2016 ACM SIGSAC conference on computer and communications security, pages 308–318, 2016.
  • [42] Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jon Shlens, and Zbigniew Wojna. Rethinking the inception architecture for computer vision. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2818–2826, 2016.
  • [43] Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014.
  • [44] Saining Xie, Ross Girshick, Piotr Dollár, Zhuowen Tu, and Kaiming He. Aggregated residual transformations for deep neural networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1492–1500, 2017.
  • [45] Ningning Ma, Xiangyu Zhang, Hai-Tao Zheng, and Jian Sun. Shufflenet v2: Practical guidelines for efficient cnn architecture design. In Proceedings of the European conference on computer vision (ECCV), pages 116–131, 2018.
  • [46] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Identity mappings in deep residual networks. In Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11–14, 2016, Proceedings, Part IV 14, pages 630–645. Springer, 2016.

Appendix A Appendix

A.1 Adversarial perturbation magnitude for members and non-members

Refer to caption
(a) Boundary Attack (HopSkipJump)
Refer to caption
(b) OSLO (Ours)
Figure 10: The magnitude of adversarial perturbations added to samples in the boundary attack and OSLO on CIFAR-100. We adjusted the parameters of two attacks such that they both classify 15.8% of the samples as members (marked as yellow). The precision of OSLO is 95.2%, while the precision of boundary attack is 79.3%.

In Section 6.1, we discussed why OSLO outperforms. We further explore this by measuring the perturbation magnitude added to members versus non-members in both boundary attack and OSLO. The results are presented in Figure 10. We set τ=0.01𝜏0.01\tau=0.01italic_τ = 0.01 for OSLO, which resulted in classifying 15.8% of the samples as members. We also adjusted the parameters for the boundary attack to select the same proportion of samples classified as members.

From Figure 10(a), it can be observed that using a single threshold to attempt to distinguish members from non-members leads to a significant proportion of non-members among the samples classified as members, regardless of the threshold value selected (i.e., drawing a horizontal line in the figure, with those above the line classified as members). For example, as shown in Figure 10(a), 20.7% of the samples classified as members by the boundary attack are actually non-members. Conversely, OSLO conducts a hypothesis test for each sample, enabling it to accurately identify members even if their perturbations are relatively low compared to other samples. In Figure 10(b), only 4.8% of the samples classified as members by OSLO are non-members.

A.2 Training and implementation details

Training setup.

We consider ResNet18[13] and DenseNet121 [38] as target models, and additionally, Inception [42], VGG13[43], ResNeXt50[44], ShuffleNet [45], and PreActResNet18[46] are utilized as surrogate models. In our default setup, all models are trained for 100 epochs using the Adam optimizer, with a learning rate of 0.0010.0010.0010.001. We apply an L2𝐿2L2italic_L 2 weight decay coefficient of 106superscript10610^{-6}10 start_POSTSUPERSCRIPT - 6 end_POSTSUPERSCRIPT and use a batch size of 128128128128.

Implementation details.

The OSLO setup primarily includes the configuration of source and validation models, as well as the parameters for generating adversarial examples. By default, we use Inception [42], ResNeXt50 [44], and PreActResNet18 [46] as the source model architectures. For validation models, we use DenseNet121 [38], VGG13 [43], and ShuffleNetV2 [45] when ResNet18 [13] is the target model, and ResNet18, VGG13, and ShuffleNetV2 when DenseNet is the target model. We follow the assumption that the attacker has no knowledge of the target model architecture and does not include the same architecture in the source/validation models. More combinations of source and validation models are discussed in Appendix A.3. For the adversarial example generation parameters, we set the maximum perturbation to 80/2558025580/25580 / 255 and the number of sub-procedures to 80 (increasing ϵitalic-ϵ\epsilonitalic_ϵ by 1/25512551/2551 / 255 each iteration).

Experimental setup.

All our models and methods are implemented in PyTorch. Our experiments are conducted on an NVIDIA GeForce RTX 2080 Ti with 20 GB of memory. The maximum training time for each surrogate or target model does not exceed 1.9 hours.

A.3 Effect of combinations of different source and validation models

We tested the effect of using different source model architectures and validation model structures on OSLO when ResNet18 [13] is the target model. Specifically, we considered the following model structures: DenseNet121 ("1") [38], Inception-V3 ("2") [42], VGG13 ("3") [43], ResNeXt50 ("4") [44], ShuffleNet-V2 ("5") [45], and Pre-Activation ResNet ("6") [46]. We selected different combinations from these models as source models or validation models. As shown in Figure 11(a), different combinations have no significant impact on the trade-off between precision and recall, although they may affect the maximum achievable precision. For example, beyond a certain point, it becomes impossible to trade recall for additional precision. In such cases, we tested the effect of increasing the number of models (num) trained for each validation model architecture. As depicted in Figure 11(b), increasing the number of validation models helps to make the trade-off curve more complete, as OSLO requires adding more perturbations to bypass more validation models. This helps to further reduce recall and increase precision.

Refer to caption
(a)
Refer to caption
(b)
Figure 11: Precision-Recall curve comparing the attack precision and recall for OSLO across various combinations of source and validation models.