In the centralized learning setting, the main assumption is that models and data are collocated during the training phase. The next subsection introduces a common design approach that is used by multiple papers, namely, the use of shadow models or shadow training. The rest of the subsections are dedicated to the different attack types and introduce the assumptions, common elements, as well as differences of the reviewed papers.
6.1.2 Membership Inference Attacks.
In
membership inference black-box attacks, the most common attack pattern is the use of shadow models. The output of the shadow models is a prediction vector [
40,
51,
106,
110,
116,
125] or only a label [
68]. The labels used for the attack dataset come from the test and training splits of the shadow data, where the data points that belong to the test set are labeled as non-members of the training set. The meta-model is trained to recognize patterns in the prediction vector output of the target model. These patterns allow the meta-model to infer whether a data point belongs to the training dataset or not. The number of shadow models affects the attack accuracy, but it also incurs cost to the attackers. Salem et al. [
110] showed that membership inference attacks are possible with as little as one shadow model.
Shadow training can be further reduced to a threshold-based attack, where instead of training a meta-model, one can calculate a suitable threshold function that indicates whether a sample is a member of the training set. The threshold can be learned from multiple shadow models [
108] or even without using any shadow models [
140]. Sablayrolles et al. [
108] showed that a Bayes-optimal membership inference attack depends only on the loss and their attack outperforms previous attacks such as References [
116,
140]. In terms of attack accuracy, they reported up to 90.8% on large neural network models such as VGG16 [
71] that were performing classification on the Imagenet [
23] dataset.
A Bayes-optimal attack was also proposed in the white-box scenario and for linear models [
66]. While the optimal attack required strong assumptions regarding the target data distribution and the attacker knowledge, further relaxations make it feasible even for deep neural network targets. The attack on linear models requires the training of an identical proxy model that is used to calculate differences from the white-box model’s weights and subsequently use them for membership inference. For the deep neural network, each layer is replaced by a local linear approximation, which can be used for the attack in a similar manner.
In addition to relaxations on the number of shadow models, attacks have been shown to be data driven, i.e., an attack can be successful even if the target model is different than the shadow and meta-models [
125]. The authors tested several types of models such as k-NN, logistic regression, decision trees and naive Bayes classifiers in different combinations on the role of the target model, shadow and meta model. The results showed that (i) using different types of models did not affect the attack accuracy and (ii) in most cases, models such as decision trees outperformed neural networks in terms of attack accuracy and precision.
Shadow model training requires a shadow dataset. One of the main assumptions of membership inference attacks on supervised learning models is that the adversary has no or limited knowledge of the training samples used. However, the adversary knows something about the underlying data distribution of the target’s training data. If the adversary does not have access to a suitable dataset, then they can try to generate one [
116,
125]. Access to statistics about the probability distribution of several features allows an attacker to create the shadow dataset using sampling techniques. If a statistics-based generation is not possible, then a query-based approach using the target models’ prediction vectors is another possibility. Generating auxiliary data using GANs was also proposed by Hayes et al. [
39]. If the adversary manages to find input data that generate predictions with high confidence, then no prior knowledge of the data distribution is required for a successful attack [
116]. Salem et al. [
110] went so far as to show that it is not even necessary to train the shadow models using data from the same distribution as the target, making the attack more realistic, since it does not assume any knowledge of the training data.
The previous discussion is mostly relevant to supervised classification or regression tasks. The efficacy of membership inference attacks against sequence-to-sequence model for machine translation, was studied by Reference [
46]. The authors used shadow models that try to mimic the target model’s behavior and then used a meta-model to infer membership. They found that sequence generation models are much harder to attack compared to those trained for other tasks such as image classification. However, membership of
out-of-domain and out-of-vocabulary data was easier to infer.
Membership inference attacks are also applicable to deep generative models such as GANs and VAEs [
16,
39,
44]. Since these models have more than one component (generator/discriminator, encoder/decoder), adversarial threat modeling needs to take that into account. For these types of models, the taxonomy proposed by Chen et al. [
16] is partially followed. We consider black-box access to the generator as the ability to access generated samples and partial black-box access, the ability to provide inputs in the latent space
z and generate samples. Having access to the generator model and its parameters is considered a white-box attack. The ability to query the discriminator is also a white-box attack.
The full white-box attacks with access to the GAN discriminator are based on the assumption that if the GAN has “overfitted,” then the data points used for its training will receive higher confidence values as output by the discriminator [
39]. In addition to the previous attack, Hayes et al. [
39] proposed a set of attacks in the partial black-box setting. These attacks are applicable to both GANs and VAEs or any generative model. If the adversary has no auxiliary data, then they can attempt to train an auxiliary GAN whose discriminator distinguishes between the data generated by the target generator and the data generated by the auxiliary GAN. Once the auxiliary GAN is trained, its discriminator can be used for the white-box attack. The authors considered also scenarios where the adversary may have auxiliary information such as knowledge of the target’s training and test data. Using the auxiliary data, they can train another GAN whose discriminator would be able to distinguish between members of the original training set and non-members.
A distance-based attack over the nearest neighbors of a data point was proposed by Chen et al. [
16] for the full black-box model. In this case, a data point
\(\mathbf {x}\) is a member of the training set if within its k-nearest neighbors there is at least one point that has a distance lower than a threshold
\(\epsilon\). The authors proposed more complex attacks as the level of knowledge of the adversary increases, based on the idea that the reconstruction error between the real data point
x and a sample generated by the generator given some input
z should be smaller if the data point is coming from the training set.
6.1.3 Reconstruction Attacks.
The initial reconstruction attacks were based on the assumption that the adversary has access to the model
f, the priors of the sensitive and nonsensitive features, and the output of the model for a specific input
x. The attack was based on estimating the values of sensitive features, given the values of nonsensitive features and the output label [
28]. This method used a
maximum a posteriori(MAP) estimate of the attribute that maximizes the probability of observing the known parameters. Hidano et al. [
43] used a similar attack but they made no assumption about the knowledge of the nonsensitive attributes. For their attack to work, they assumed that the adversary can perform a
model poisoning attack during training.
Both previous attacks worked against linear regression models, but as the number of features and their range increases, the attack feasibility decreases. To overcome the limitations of the MAP attack, Fredrikson et al. [
27] proposed another inversion attack, which recovers features using target labels and optional auxiliary information. The attack was formulated as an optimization problem where the objective function is based on the observed model output and uses gradient descent in the input space to recover the input data point. The method was tested on image reconstruction. The result was a class representative image, which in some cases was quite blurry even after denoising. A formalization of the model inversion attacks in References [
27,
28] was later proposed by Wu et al. [
135].
Since the optimization problem in Reference [
27] is quite hard to solve, Zhang et al. [
145] proposed to use a GAN to learn some auxiliary information of the training data and produce better results. The auxiliary information in this case is the presence of blurring or masks in the input images. The attack first uses the GAN to learn to generate realistic looking images from masked or blurry images using public data. The second step is a GAN inversion that calculates the latent vector
\(\hat{z}\), which generates the most likely image:
where the prior loss
\(L_{prior}\) is ensuring the generation of realistic images and
\(L_{id}\) ensures that the images have a high likelihood in the target network. The attack is quite successful, especially on masked images.
Black-box only reconstruction attacks are less common, since the attacker has substantially less information. Nevertheless, Salem et al. [
109] proposed reconstruction attacks in an online setting, where they used the prediction vectors of a holdout dataset before and after a training round, in combination with generative models to reconstruct labels and data samples. Finally, Yang et al. [
138], proposed a black-box attack that employs an additional classifier that performs an inversion from the output of the target model
\(f(x)\) to a candidate output
\(\hat{x}\). The setup is similar to that of an autoencoder, only in this case the target network that plays the role of the encoder is a black box and it is not trainable. The attack was tested on different types of target model outputs: the full prediction vector, a truncated vector, and the target label only. When the full prediction vector is available, the attack performs a good reconstruction, but with less available information, the produced data point looks more like a class representative.
6.1.5 Model Extraction Attacks.
When the adversary has access to the inputs and prediction outputs of a model, it is possible to view these pairs of inputs and outputs as a system of equations, where the unknowns are the model parameters [
124] or hyper-parameters of the objective function [
131]. In the case of a linear binary classifier, the system of equations is linear and only
\(d + 1\) queries are necessary to retrieve the model parameters, where
d is the dimension of the parameter vector
\(\theta\). In more complex cases, such as multi-class linear regression or multi-layer perceptrons, the systems of equations are no longer linear. Optimization techniques such as
Broyden–Fletcher–Goldfarb–Shanno (BFGS) [
96] or stochastic gradient descent are then used to approximate the model parameters [
124].
Lack of prediction vectors or a high number of model parameters renders equation solving attacks inefficient. A strategy is required to select the inputs that will provide the most useful information for model extraction. From this perspective, model extraction is quite similar to
active learning [
15]. Active learning makes use of an external oracle that provides labels to input queries. The oracle can be a human expert or a system. The labels are then used to train or update the model. In the case of model extraction, the target model plays the role of the oracle.
Following the active learning approach, several papers propose an adaptive training strategy. They start with some initial data points or
seeds, which they use to query the target model and retrieve labels or prediction vectors, which they use to train the substitute model
\(\hat{f}\). For a number of subsequent rounds, they extend their dataset with new synthetic data points based on some adaptive strategy that allows them to find points close to the decision boundary of the target model [
15,
56,
100,
103,
124,
142]. Chandrasekaran et al. [
15] provided a more query efficient method of extracting nonlinear models such as kernel SVMs, with slightly lower accuracy than the method proposed by Tramer et al. [
124], while the opposite was true for Decision Tree models. ActiveThief [
100] and CloudLeak [
142] are attacks that are based on the combination of active learning and adversarial examples for the extraction of deep neural network models. Both attacks were also combined with other techniques such as transfer learning or k-center [
112] to optimize their performance. One of the main differences between the two approaches is that the CloudLeak attack uses adversarial samples to query the target, while ActiveThief uses adversarial samples as a way to find samples from the training dataset that are closed to the decision boundary of the substitute model, hence data with higher uncertainty.
Several other strategies for selecting the most suitable data for querying the target model use: (i) data that are not synthetic but belong to different domains such as images from different datasets [
6,
21,
98], (ii) semi-supervised learning techniques such as rotation loss [
143] or MixMatch [
7] to augment the dataset [
50], (iii) data generated through model inversion techniques [
32], or iv) randomly generated input data [
56,
63,
124]. In terms of efficiency, semi-supervised methods such as MixMatch require much fewer queries than fully supervised extraction methods to perform similarly or better in terms of task accuracy and fidelity, against models trained for classification using CIFAR-10 and SVHN datasets [
50]. For larger models, trained for Imagenet classification, even querying a 10% of the Imagenet data, gives a comparable performance to the target model [
50]. Against a deployed MLaaS service that provides facial characteristics, Orekondy et al. [
98] managed to create a substitute model that performs at 80% of the target in task accuracy, spending as little as $30.
Some, mostly theoretical, work has demonstrated the ability to perform direct model extraction beyond linear models [
50,
86]. Full model extraction was shown to be theoretically possible against two-layer fully connected neural networks with
rectified linear unit (ReLU) activations by Milli et al. [
86]. However, their assumption was that the attacker has access to the loss gradients with respect to the inputs. Jagielski et al. [
50] managed to do a full extraction of a similar network without the need of gradients. Both approaches take into account that ReLUs transforms the neural network into a piecewise linear function of the inputs. By probing the model with different inputs, it is possible to identify where the linearity breaks and use this knowledge to calculate the network parameters. In a hybrid approach that uses both a learning strategy and direct extraction, Jagielski et al. [
50], showed that they can extract a model trained on MNIST with almost 100% fidelity by using an average of
\(2^{19.2}\) to
\(2^{22.2}\) queries against models that contain up to 400,000 parameters. However, this attack assumes access to the loss gradients similar to Reference [
86].
Finally, apart from learning substitute models directly, there is also the possibility of extracting model information such as architecture, optimization methods and hyper-parameters using shadow models [
97]. The majority of attacks were performed against neural networks trained on MNIST. Using the shadow models’ prediction vectors as input, the meta-models managed to learn to distinguish whether a model has certain architectural properties. An additional attack by the same authors, proposed to generate adversarial samples, which were created by models that have the property in question. The generated samples were created in a way that makes a classifier output a certain prediction if they have the attribute in question. The target model’s prediction on this adversarial sample is then used to establish if the target model has a specific property. The combination of the two attacks proved to be the most effective approach. Some properties such as activation function, presence of dropout, and max-pooling were the most successfully predicted.