Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
11institutetext: Human Factors and Vehicle Technology, Hellenic Institute of Transport, Centre for Research and Technology Hellas, Thermi, Greece
11email: {chandrin, ioannis.sym, aspiliotis, mpanou}@certh.gr
22institutetext: Wire Communications and Information Technology Laboratory, Dept. of Electrical and Computer Engineering, University of Patras, Patras, Greece
22email: loi@ceid.upatras, p_zachos@ac.upatras.gr, moustakas@ece.upatras.gr

Effectiveness of L2 Regularization in Privacy-Preserving Machine Learningthanks: This work was partially supported by the European Commission (EC) through the European Union’s Horizon Europe Framework Programme within the frame and for the purpose of the artificial Intelligence To enHAnce Civic pArticipation (ITHACA) project under Grant Agreement No. 101094364.

Nikolaos Chandrinos\orcidlink0009-0004-0737-3261 11    Iliana Loi\orcidlink0000-0001-9112-0638 22    Panagiotis Zachos\orcidlink0000-0003-0460-252X 22    Ioannis Symeonidis\orcidlink0000-0002-5050-8617 11    Aristotelis Spiliotis\orcidlink0000-0001-9042-2044 11    Maria Panou\orcidlink0000-0002-2161-9246 11    Konstantinos Moustakas\orcidlink0000-0001-7617-227X 22
Abstract

Artificial intelligence, machine learning, and deep learning as a service have become the status quo for many industries, leading to the widespread deployment of models that handle sensitive data. Well-performing models, the industry seeks, usually rely on a large volume of training data. However, the use of such data raises serious privacy concerns due to the potential risks of leaks of highly sensitive information. One prominent threat is the Membership Inference Attack, where adversaries attempt to deduce whether a specific data point was used in a model’s training process. An adversary’s ability to determine an individual’s presence represents a significant privacy threat, especially when related to a group of users sharing sensitive information. Hence, well-designed privacy-preserving machine learning solutions are critically needed in the industry. In this work, we compare the effectiveness of L2 regularization and differential privacy in mitigating Membership Inference Attack risks. Even though regularization techniques like L2 regularization are commonly employed to reduce overfitting, a condition that enhances the effectiveness of Membership Inference Attacks, their impact on mitigating these attacks has not been systematically explored.

Keywords:
Privacy-Preserving Machine Learning Deep Learning Neural Network Differential Privacy L2 Regularization.

1 Introduction

Privacy, a multifaceted concept encompassing freedom of thought, control over personal information, and protection from surveillance, has become increasingly complex in the digital age. As industries widely adopt Artificial Intelligence (AI) and Machine Learning (ML) technologies, they rely on vast amounts of sensitive data to train sophisticated models. This reliance raises significant concerns about data privacy, especially with the advent of Machine Learning as a Service (MLaaS) platforms and important regulations like the General Data Protection Regulation (GDPR). Addressing these challenges, this paper explores how Privacy-Preserving Machine Learning (PPML) techniques can safeguard sensitive information. Specifically, we investigate the effectiveness of L2subscript𝐿2L_{2}italic_L start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT regularization in enhancing defenses against Membership Inference Attacks (MIAs), comparing its performance to commonly used methods such as Differential Privacy (DP).

Privacy, in general, is a concept of disarray. A broad term that accumulates, freedom of thought, control over one’s body, seclusion in one’s home, control over personal information, freedom from surveillance, protection of one’s reputation, and protection from searches and interrogations, among other things [22, 27]. A similar difficulty exists for privacy in the digital realm. Typically digital privacy emphasizes protecting personal identity involving ensuring individuals’ control over their personal information and safeguarding against surveillance [27].

PPML plays a crucial role in upholding these standards. PPML emerged to address privacy challenges posed by ML and Deep Learning (DL) applications [24, 27]. PPML techniques aim to develop methodologies to train and deploy models without compromising sensitive personal data. This includes methods like encryption, anonymization, DP, etc., which protect data during both the training phase and when making inferences.

Despite these advancements, several types of cybersecurity threats stake the integrity of sensitive data used in models. These threats can be broadly categorized into three types [15]: poisoning attacks, where adversaries manipulate the training data to corrupt the learning process; evasion or exploratory attacks, which aim to deceive the model into making incorrect predictions; and inference attacks, where the goal is to extract information about the data or the model itself rather than influencing its output [27].

Regarding inference attacks, common examples include Membership Inference Attacks (MIAs), which we’ll discuss in this work, in which an attacker can infer whether a data point was used to train the model [23, 27, 21]. This black-box access to predictions can estimate aspects, whether a specific patient profile was used in the training dataset [27, 11, 29].

The purpose of this paper is not to provide a comprehensive tutorial on the tools of PPML per se but rather to shed light on how L2subscript𝐿2L_{2}italic_L start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT regularization can further enhance mitigation against MIAs compared to common practices like DP. The rest of this paper is structured as follows. Section 2 provides the necessary background, while the experimental evaluation is provided in Section 3. Finally, conclusions are drawn in Section 4.

2 Preliminaries

2.1 Deep Neural Networks

Deep Neural Networks (DNNs) are composed of multiple layers of interconnected neurons, where each layer transforms its input data, often into a higher-dimensional representation [6, 18]. The operation of a fully connected (dense) layer can be mathematically described by:

𝐲=ϕ(𝐖𝐱+𝐛),𝐲italic-ϕ𝐖𝐱𝐛\mathbf{y}=\phi\left(\mathbf{W}\mathbf{x}+\mathbf{b}\right),bold_y = italic_ϕ ( bold_Wx + bold_b ) ,

where: 𝐱m𝐱superscript𝑚\mathbf{x}\in\mathbb{R}^{m}bold_x ∈ blackboard_R start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT in the input vector, 𝐖n×m𝐖superscript𝑛𝑚\mathbf{W}\in\mathbb{R}^{n\times m}bold_W ∈ blackboard_R start_POSTSUPERSCRIPT italic_n × italic_m end_POSTSUPERSCRIPT is the weight matrix, 𝐛n𝐛superscript𝑛\mathbf{b}\in\mathbb{R}^{n}bold_b ∈ blackboard_R start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT is the bias vector, ϕ()italic-ϕ\phi(\cdot)italic_ϕ ( ⋅ ) is a non linera activation function (e.g. ReLU, sigmoid), 𝐲n𝐲superscript𝑛\mathbf{y}\in\mathbb{R}^{n}bold_y ∈ blackboard_R start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT is the output vector [6, 18, 20].

The strength of DNNs lies in their capacity to model complex nonlinear relationships throughout their architecture. This ability enables the network to learn hierarchical feature representations from the data [6]. A DNN with L𝐿Litalic_L layers can be represented as:

𝐲=f(𝐱;Θ)=f(L)(f(L1)(f(1)(𝐱)),\mathbf{y}=f\left(\mathbf{x};\Theta\right)=f^{(L)}\left(f^{(L-1)}\left(\dots f% ^{(1)}(\mathbf{x}\right)\right),bold_y = italic_f ( bold_x ; roman_Θ ) = italic_f start_POSTSUPERSCRIPT ( italic_L ) end_POSTSUPERSCRIPT ( italic_f start_POSTSUPERSCRIPT ( italic_L - 1 ) end_POSTSUPERSCRIPT ( … italic_f start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT ( bold_x ) ) ,

where: f(l)()superscript𝑓𝑙f^{(l)}(\cdot)italic_f start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT ( ⋅ ) denotes the function computed at layer l𝑙litalic_l, Θ={𝐖l,𝐛(l)}l=1LΘsubscriptsuperscriptsuperscript𝐖𝑙superscript𝐛𝑙𝐿𝑙1\Theta=\left\{\mathbf{W}^{l},\mathbf{b}^{(l)}\right\}^{L}_{l=1}roman_Θ = { bold_W start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT , bold_b start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT } start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_l = 1 end_POSTSUBSCRIPT represents all the networks parameters [6].

Training a DNN involves finding the optimal set of parameters ΘΘ\Thetaroman_Θ that minimize a predefined loss function (𝐲,𝐲^)𝐲^𝐲\mathcal{L}\left(\mathbf{y},\mathbf{\hat{y}}\right)caligraphic_L ( bold_y , over^ start_ARG bold_y end_ARG ), where 𝐲^^𝐲\mathbf{\hat{y}}over^ start_ARG bold_y end_ARG is the true label corresponding to the input sample 𝐱𝐱\mathbf{x}bold_x. The loss function measures the discrepancy between the predicted output 𝐲𝐲\mathbf{y}bold_y and the actual output 𝐲^^𝐲\mathbf{\hat{y}}over^ start_ARG bold_y end_ARG. For multi-class classification problems, the cross-entropy loss function is frequently employed [6].

The optimization of the loss function is typically carried out using gradient-based methods. The most common approach is Stochastic Gradient Descent (SGD) and its variants, such as Adam and RMSprop. The gradients of the loss with respect to the network parameters are computed using the backpropagation algorithm. Backpropagation applies the chain rule from calculus to efficiently propagate errors backward through the network layers, enabling the adjustment of weights and biases to minimize the loss [6].

2.2 L2 Regularization

L2subscript𝐿2L_{2}italic_L start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT regularization, also known as Ridge regression, is a widely utilized technique in machine learning, particularly in scenarios where model complexity needs to be controlled without eliminating features. The core idea behind L2subscript𝐿2L_{2}italic_L start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT regularization is to penalize large coefficients in the model, thereby reducing the risk of overfitting [19, 4].

In machine learning, L2subscript𝐿2L_{2}italic_L start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT regularization modifies the loss function by adding a penalty to the sum of the squared coefficients of the model parameters. The modified loss function can be expressed as:

Loss=Lossorig+λi=1nwi2LosssubscriptLossorig𝜆superscriptsubscript𝑖1𝑛superscriptsubscript𝑤𝑖2\text{Loss}=\text{Loss}_{\text{orig}}+\lambda\sum_{i=1}^{n}w_{i}^{2}Loss = Loss start_POSTSUBSCRIPT orig end_POSTSUBSCRIPT + italic_λ ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT

Here, LossorigsubscriptLossorig\text{Loss}_{\text{orig}}Loss start_POSTSUBSCRIPT orig end_POSTSUBSCRIPT represents the original loss function (such as mean squared error for linear regression), λ𝜆\lambdaitalic_λ is the regularization parameter that controls the strength of the penalty, and wisubscript𝑤𝑖w_{i}italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT are the coefficients of the model. The term i=1nwi2superscriptsubscript𝑖1𝑛superscriptsubscript𝑤𝑖2\sum_{i=1}^{n}w_{i}^{2}∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT is the L2subscript𝐿2L_{2}italic_L start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT penalty, which increases as the magnitude of the coefficients increases [19].

The effect of this penalty is to shrink the coefficients towards zero, but not necessarily to zero. This shrinkage prevents any single feature from dominating the model, especially in the presence of multicollinearity, where features are highly correlated [5]. The regularization parameter λ𝜆\lambdaitalic_λ plays a crucial role; a higher λ𝜆\lambdaitalic_λ value imposes a stronger penalty, leading to greater shrinkage of the coefficients [19].

In neural networks, L2subscript𝐿2L_{2}italic_L start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT regularization is incorporated into the training process through a technique known as weight decay. The loss function, often based on cross-entropy or mean squared error, is augmented with an L2subscript𝐿2L_{2}italic_L start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT penalty on the network’s weights. The modified loss function in this context can be written as:

Loss=Lossorig+λj=1mk=1nwjk2LosssubscriptLossorig𝜆superscriptsubscript𝑗1𝑚superscriptsubscript𝑘1𝑛superscriptsubscript𝑤𝑗𝑘2\text{Loss}=\text{Loss}_{\text{orig}}+\lambda\sum_{j=1}^{m}\sum_{k=1}^{n}w_{jk% }^{2}Loss = Loss start_POSTSUBSCRIPT orig end_POSTSUBSCRIPT + italic_λ ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT italic_w start_POSTSUBSCRIPT italic_j italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT

Where wjksubscript𝑤𝑗𝑘w_{jk}italic_w start_POSTSUBSCRIPT italic_j italic_k end_POSTSUBSCRIPT represents the weights connecting the neurons between layers j𝑗jitalic_j and k𝑘kitalic_k, m𝑚mitalic_m is the number of layers, and n𝑛nitalic_n is the number of connections [19].

The inclusion of the L2subscript𝐿2L_{2}italic_L start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT penalty ensures that the network learns smaller weight values, which helps in preventing the model from overfitting to the training data. During the training process, the gradient update rule for the weights is modified to include this penalty. Specifically, the weights are updated according to:

wjkwjkη(Lossorigwjk+2λwjk)subscript𝑤𝑗𝑘subscript𝑤𝑗𝑘𝜂subscriptLossorigsubscript𝑤𝑗𝑘2𝜆subscript𝑤𝑗𝑘w_{jk}\leftarrow w_{jk}-\eta\left(\frac{\partial\text{Loss}_{\text{orig}}}{% \partial w_{jk}}+2\lambda w_{jk}\right)italic_w start_POSTSUBSCRIPT italic_j italic_k end_POSTSUBSCRIPT ← italic_w start_POSTSUBSCRIPT italic_j italic_k end_POSTSUBSCRIPT - italic_η ( divide start_ARG ∂ Loss start_POSTSUBSCRIPT orig end_POSTSUBSCRIPT end_ARG start_ARG ∂ italic_w start_POSTSUBSCRIPT italic_j italic_k end_POSTSUBSCRIPT end_ARG + 2 italic_λ italic_w start_POSTSUBSCRIPT italic_j italic_k end_POSTSUBSCRIPT )

Where η𝜂\etaitalic_η is the learning rate, and the term 2λwjk2𝜆subscript𝑤𝑗𝑘2\lambda w_{jk}2 italic_λ italic_w start_POSTSUBSCRIPT italic_j italic_k end_POSTSUBSCRIPT represents the derivative of the L2subscript𝐿2L_{2}italic_L start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT penalty. This update rule gradually reduces the magnitude of the weights as the training progresses, leading to a simpler and more generalizable model [19].

The use of L2subscript𝐿2L_{2}italic_L start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT regularization in neural networks improves the model’s ability to generalize, reduces sensitivity to input noise, and enhances the stability of the learning process [19]. However, the choice of the regularization parameter λ𝜆\lambdaitalic_λ is critical; if set too high, it can lead to underfitting, where the model is overly simplistic and fails to capture the underlying data patterns [19, 4].

2.3 Memberships Inference Attack

MIAs originate from the observation that models usually perform differently on the data they are trained on than first-time-seen data. Generally, it refers to acquiring knowledge about whatever a given data sample (x,y)Dsuperscript𝑥superscript𝑦𝐷(\overrightarrow{x^{*}},y^{*})\in D( over→ start_ARG italic_x start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_ARG , italic_y start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) ∈ italic_D, where D𝐷Ditalic_D is the training dataset used to train a model f𝑓fitalic_f. D𝐷Ditalic_D contains samples used for training. Dsuperscript𝐷D^{\prime}italic_D start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT, the testing dataset, has an almost identical distribution as the records in D𝐷Ditalic_D, however, they never participated in training f𝑓fitalic_f. This deference means that an instance xDsuperscript𝑥𝐷\overrightarrow{x^{*}}\in Dover→ start_ARG italic_x start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_ARG ∈ italic_D is a member, while xDsuperscript𝑥superscript𝐷\overrightarrow{x^{\prime*}}\in D^{\prime}over→ start_ARG italic_x start_POSTSUPERSCRIPT ′ ∗ end_POSTSUPERSCRIPT end_ARG ∈ italic_D start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT is a non-member. Therefore, MIAs are used to determine whether a given record belongs to the training dataset or not [8, 24].

In certain situations, identifying that a person is included in the training set can have privacy implications, especially in a medical study involving patients with a rare disease. Additionally, membership inference can be used as a foundation for launching data extraction attacks. [24, 26].

MIAs may extend to extracting specific characteristics of sensitive training data or potentially reconstructing it entirely. Typically, such attacks exploit the prevalent issue of overfitting, which leads to a noticeable accuracy gap between the training data and external datasets [24, 15].

The Attacker Advantage (AA) metric provides a measure on how much additional information the attacker gains with each additional iteration by calculating the difference between the probability of the adversary correctly guessing a data point was included in the training set and the probability of the adversary correctly guessing a data point was not included in the training set [28].

The AA is evaluated as follows. Let 𝒜𝒜\mathcal{A}caligraphic_A be an adversary, A be a learning algorithm, n be a positive integer, and D be a distribution over data points (x,y). The membership experiment is developed as follows

  • Sample SDnsimilar-to𝑆superscript𝐷𝑛S\sim D^{n}italic_S ∼ italic_D start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT, and let AS=A(S)subscript𝐴𝑆𝐴𝑆A_{S}=A(S)italic_A start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT = italic_A ( italic_S ).

  • Choose b0,1𝑏01b\leftarrow{0,1}italic_b ← 0 , 1 uniformly at random.

  • Draw zSsimilar-to𝑧𝑆z\sim Sitalic_z ∼ italic_S if b=0𝑏0b=0italic_b = 0, or zDsimilar-to𝑧𝐷z\sim Ditalic_z ∼ italic_D if b=1𝑏1b=1italic_b = 1.

  • ExpM(𝒜,A,n,𝒟)𝐸𝑥superscript𝑝𝑀𝒜𝐴𝑛𝒟Exp^{M}(\mathcal{A},A,n,\mathcal{D})italic_E italic_x italic_p start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT ( caligraphic_A , italic_A , italic_n , caligraphic_D ) is 1 if 𝒜(z,AS,n,𝒟)=b𝒜𝑧subscript𝐴𝑆𝑛𝒟𝑏\mathcal{A}(z,A_{S},n,\mathcal{D})=bcaligraphic_A ( italic_z , italic_A start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT , italic_n , caligraphic_D ) = italic_b and 0 otherwise. 𝒜𝒜\mathcal{A}caligraphic_A must output either 0 or 1.

The membership advantage of 𝒜𝒜\mathcal{A}caligraphic_A is defined as

𝖠𝖽𝗏𝖬(𝒜,A,n,D)=2Pr[𝖤𝗑𝗉𝖬(𝒜,A,n,D)=1]1,superscript𝖠𝖽𝗏𝖬𝒜𝐴𝑛𝐷2Prsuperscript𝖤𝗑𝗉𝖬𝒜𝐴𝑛𝐷11\mathsf{Adv}^{\mathsf{M}}(\mathcal{A},A,n,D)=2\Pr[\mathsf{Exp}^{\mathsf{M}}(% \mathcal{A},A,n,D)=1]-1,sansserif_Adv start_POSTSUPERSCRIPT sansserif_M end_POSTSUPERSCRIPT ( caligraphic_A , italic_A , italic_n , italic_D ) = 2 roman_Pr [ sansserif_Exp start_POSTSUPERSCRIPT sansserif_M end_POSTSUPERSCRIPT ( caligraphic_A , italic_A , italic_n , italic_D ) = 1 ] - 1 ,

where the probabilities are taken over the coin flips of 𝒜𝒜\mathcal{A}caligraphic_A, the random choices of S and b, and the random data point zSsimilar-to𝑧𝑆z\sim Sitalic_z ∼ italic_S or z𝒟similar-to𝑧𝒟z\sim\mathcal{D}italic_z ∼ caligraphic_D.

Equivalently, the right-hand side can be expressed as the difference between 𝒜𝒜\mathcal{A}caligraphic_A’s true and false positive rates

𝖠𝖽𝗏𝖬Pr[𝒜=0|b=0]Pr[𝒜=0|b=1],superscript𝖠𝖽𝗏𝖬Pr𝒜conditional0𝑏0Pr𝒜conditional0𝑏1\mathsf{Adv^{M}}\Pr[\mathcal{A}=0|b=0]-\Pr[\mathcal{A}=0|b=1],sansserif_Adv start_POSTSUPERSCRIPT sansserif_M end_POSTSUPERSCRIPT roman_Pr [ caligraphic_A = 0 | italic_b = 0 ] - roman_Pr [ caligraphic_A = 0 | italic_b = 1 ] ,

where 𝖠𝖽𝗏𝖬superscript𝖠𝖽𝗏𝖬\mathsf{Adv^{M}}sansserif_Adv start_POSTSUPERSCRIPT sansserif_M end_POSTSUPERSCRIPT is shorthand for 𝖠𝖽𝗏𝖬(𝒜,A,n,D)superscript𝖠𝖽𝗏𝖬𝒜𝐴𝑛𝐷\mathsf{Adv}^{\mathsf{M}}(\mathcal{A},A,n,D)sansserif_Adv start_POSTSUPERSCRIPT sansserif_M end_POSTSUPERSCRIPT ( caligraphic_A , italic_A , italic_n , italic_D ).

2.4 Differential Privacy

Differential privacy (DP) techniques apply arbitrary modifications to a dataset such that if an individual has access to the dataset’s entries, they will not be able to infer any personalized sensitive information from it [27, 30]. Such methods include the addition of random noise to the data through differential procedures (Gaussian, Laplace, exponential, etc.).

Some of the most recent works on differential privacy preservation are [1, 17, 3, 16]. One of the most prominent works in PPML is [1], where Gaussian noise is injected to the gradients of a neural network’s parameters at each training step, rendering the model capable of learning the so-called differentially private parameters. Similarly, in [17], an adaptive Laplace algorithm named AdLM was developed to also learn differentially private parameters, but with adaptively adding noise to input features of a neural model according to their relevance to the network’s output. Explanatory, the main idea behind AdLM is to add less Laplace noise to the features that contribute the most to the model’s output, while adding more noise to those that are less relevant. Furthermore, noise is added adaptively into affine feature transformations and loss functions as well. In contrast to [1], in this approach the added noise and privacy consumption do not accumulate in each epoch, since AdLM does not access the ‘clean/unmasked’ data during the training process, rendering AdLM independent of the number of training epochs. In other words, AdLM adds Laplacian noise as a preprocessing procedure rather than on training time. Another benefit of this framework is that it has the ability to be incorporated into a plethora of deep learning models. The same team of authors in one of their newest works [16], propose a scalable framework, named StoBatch, to preserve differential privacy in deep adversarial learning. StoBatch is a stochastic batch mechanism that maintains the privacy in the learning of model parameters by first introducing noise to both input features and their latent space and then integrating adversarial learning to enhance the decision bounds of the model. In contrast to [17], the amount of noise introduced to the model is minimized by incorporating an adversarial objective function that combines the loss function for training data with a loss function for differentially private adversarial data, hence preserving the model’s utility (i.e., preventing privacy data and noise accumulation over training). Moreover, by feeding both input and hidden layers of a model with noise renders StoBatch more resistant and robust against adversarial data. The stochastic batch training that StoBatch utilizes, separates data in local trainers to learn differential privacy parameters and then the computed gradients from the trainers are combined, allowing the calculation of adversarial data from various data batches at each epoch. The latter enables StoBatch to be scalable to vast datasets and deep learning models [16].

The aforementioned approaches use global differential privacy, which presupposes that both the data and differential privacy method to apply random noise to these data, reside in a server. Nevertheless, local differential privacy enables data owners to ‘hide’ any sensitive/personal data locally before sharing them. Hence, local differential privacy is considered to be a more secure way to privacy-proof training data as well as computationally cheap, since it does not require a vast number of computational resources like global differential privacy [3]. A local differentially private method consists of LATENT [3], which acts as an intermediate layer into a deep learning model that a data owner can utilize to randomize their data before distributing them to untrusted users/neural networks. LATENT was evaluated by being integrated between the convolutional and the fully connected layers of a CNN model. In particular, LATENT randomizes the 1D real-value flattened vectors (corresponding to an image input) that the convolutional layers produce by transforming them to discrete vectors [3].

It is worth mentioning the Tensorflow Privacy [2], an open-source Python library by Google Research, which enables the creation and training of privacy-preserved machine learning models. Especially, Tensorflow Privacy utilizes Tensorflow [2] optimizers for training an ML model using differential privacy algorithms, while it is able to compare neural networks in privacy measures and preserve model utility. It is worth mentioning that Tensorflow is a widely used open-source Python package that supports the development of ML/DL models, thus, not supporting PPML creation. Therefore, the Tensorflow Privacy framework’s main aim is to facilitate the adoption of privacy-preserving methods into ML using Tensorflow or other APIs such as Keras.

3 Experimental Results

In this work, we experimentally evaluate the MIAs performance in two different datasets: (a) MNIST [14], (b) CIFAR10 [13], by applying two neural network architectures: (a) a fully connected neural network (FCNN), (b) a convolutional neural network (CNN) and (c) an augmented version of the Toxic Tweets Dataset for text classification [9], covering a wide range of possible scenarios with commonly used architecture designs. For each aforementioned scenario, we compared two training methods: (a) a normal (baseline) approach using a non-DP-trained model, and (b) a DP method involving a DP-trained model. Additionally, for the text classification task, we use a simple embedding-based neural network tailored for natural language data. To study how L2subscript𝐿2L_{2}italic_L start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT regularization affects their privacy, we also evaluated both methods with L2subscript𝐿2L_{2}italic_L start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT regularization. Finally, we extend our experimental evaluation to examine how the difference in accuracy between training and evaluation affects MIAs.

3.1 Image Classification Task

The MNIST [14] dataset consists of handwritten digits, including 60,000 samples in the training set and 10,000 in the evaluation set. The input flattened images (784 features) are fed to a fully connected network which consists of three layers with 10, 20, and 10 neurons, respectively. The non-DP models are optimized for 50 epochs using Adam optimizer [12] with a learning rate equal to 0.0001 and using the cross-entropy loss. Batches of 32 samples were used. A similar configuration was used for the DP models, but instead of Adam, we used DP-Adam of TensorFlow Privacy [2].

The experimental results are reported in Table 1 and visualized in Figure 1, where we report the average and variance of the training, evaluation accuracy, and attacker advantage over five training runs for each evaluated method. Furthermore, we demonstrate the performance of every method on different λ𝜆\lambdaitalic_λ values with L2subscript𝐿2L_{2}italic_L start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT regularization.

Table 1: Performance metrics for FCNN on the MNIST dataset with varying L2subscript𝐿2L_{2}italic_L start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT regularization strengths (λ𝜆\lambdaitalic_λ). The table reports the average and standard deviation of training accuracy, validation accuracy, and attacker advantage over five runs for both Baseline (non-DP) and DP models.
Train Accuracy
Baseline 94.6±.250plus-or-minus94.6.25094.6\pm.25094.6 ± .250 94.8±.171plus-or-minus94.8.17194.8\pm.17194.8 ± .171 94.0±.201plus-or-minus94.0.20194.0\pm.20194.0 ± .201 93.4±.125plus-or-minus93.4.12593.4\pm.12593.4 ± .125 93.0±.144plus-or-minus93.0.14493.0\pm.14493.0 ± .144 92.4±.223plus-or-minus92.4.22392.4\pm.22392.4 ± .223
DP 94.1±.232plus-or-minus94.1.23294.1\pm.23294.1 ± .232 93.3±.471plus-or-minus93.3.47193.3\pm.47193.3 ± .471 93.0±.318plus-or-minus93.0.31893.0\pm.31893.0 ± .318 92.8±.190plus-or-minus92.8.19092.8\pm.19092.8 ± .190 92.4±.407plus-or-minus92.4.40792.4\pm.40792.4 ± .407 91.7±.153plus-or-minus91.7.15391.7\pm.15391.7 ± .153
Validation Accuracy
Baseline 94.0±.361plus-or-minus94.0.36194.0\pm.36194.0 ± .361 94.8±.273plus-or-minus94.8.27394.8\pm.27394.8 ± .273 94.1±.252plus-or-minus94.1.25294.1\pm.25294.1 ± .252 93.5±.266plus-or-minus93.5.26693.5\pm.26693.5 ± .266 93.1±.200plus-or-minus93.1.20093.1\pm.20093.1 ± .200 92.8±.350plus-or-minus92.8.35092.8\pm.35092.8 ± .350
DP 93.6±.327plus-or-minus93.6.32793.6\pm.32793.6 ± .327 93.2±.415plus-or-minus93.2.41593.2\pm.41593.2 ± .415 92.9±.423plus-or-minus92.9.42392.9\pm.42392.9 ± .423 92.8±.218plus-or-minus92.8.21892.8\pm.21892.8 ± .218 92.4±.330plus-or-minus92.4.33092.4\pm.33092.4 ± .330 91.7±.193plus-or-minus91.7.19391.7\pm.19391.7 ± .193
Attacker Advantage
Baseline 1.69±.156plus-or-minus1.69.1561.69\pm.1561.69 ± .156 1.78±.615plus-or-minus1.78.6151.78\pm.6151.78 ± .615 1.48±.615plus-or-minus1.48.6151.48\pm.6151.48 ± .615 1.62±.216plus-or-minus1.62.2161.62\pm.2161.62 ± .216 1.87±.319plus-or-minus1.87.3191.87\pm.3191.87 ± .319 1.62±.182plus-or-minus1.62.1821.62\pm.1821.62 ± .182
DP 1.71±.340plus-or-minus1.71.3401.71\pm.3401.71 ± .340 1.90±.214plus-or-minus1.90.2141.90\pm.2141.90 ± .214 1.86±.212plus-or-minus1.86.2121.86\pm.2121.86 ± .212 1.61±.135plus-or-minus1.61.1351.61\pm.1351.61 ± .135 1.72±.302plus-or-minus1.72.3021.72\pm.3021.72 ± .302 1.70±.191plus-or-minus1.70.1911.70\pm.1911.70 ± .191
λ𝜆\lambdaitalic_λ 00 .001.001.001.001 .002.002.002.002 .003.003.003.003 .004.004.004.004 .005.005.005.005

From Table 1 we can observe, that the non-DP model achieves higher validation accuracy compared to DP. Specifically, when the non-DP model is trained with L2subscript𝐿2L_{2}italic_L start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT regularization at strengths of 0.001 and 0.002, it attains higher validation accuracy. This shows how mild L2subscript𝐿2L_{2}italic_L start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT regularization helps in optimizing the generalization of the model without introducing significant noise into the training process, as is the case with DP.

For models utilizing DP, the introduction of L2subscript𝐿2L_{2}italic_L start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT regularization appears to further reduce validation accuracy. This decrement is likely due to the effect of noise addition from both DP and the constraints imposed by L2subscript𝐿2L_{2}italic_L start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT regularization, which may restrict the model’s capacity to fit the training data effectively. The downward trend in accuracy with increasing L2subscript𝐿2L_{2}italic_L start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT regularization strengths across both DP and non-DP models could be attributed to over-regularization [25].

Refer to caption
Refer to caption
Figure 1: L2subscript𝐿2L_{2}italic_L start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT regularization on the FCNN model trained on the MNIST dataset. The figure shows how validation accuracy and attacker advantage change with different L2subscript𝐿2L_{2}italic_L start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT regularization strengths (λ𝜆\lambdaitalic_λ) for both Baseline (non-DP) and DP models.

Concerning privacy performance measured by attacker advantage, the models exhibit a relatively stable trend across different configurations. This stability suggests that neither the pressure from L2subscript𝐿2L_{2}italic_L start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT regularization nor the noise from DP significantly alters the fundamental privacy characteristics of the models in the context of the MNIST dataset. The small difference in train and validation accuracy, presented in Table 1, across all models also indicates that overfitting is likely not a prevailing issue in this scenario. The similar attacker advantage across varying configurations supports, that in inference all models maintain a similar level of robustness against such privacy threats.

3.1.1 The CIFAR10

dataset includes 50,000 images in the training set and 10,000 in the evaluation set with 32 × 32 color image samples that contain one of the 10 object classes. The applied CNN consists of three convolutional layers followed by a single fully connected layer. In more detail, three convolutional layers with 32, 64, 128 filters of size 3 × 3 each of them followed by an average pooling layer 2 × 2. Finally, the extracted features are flattened and fed to the classification layer of 10 neurons. The non-DP CNN is optimized for 50 epochs using the Adam optimizer [12] with a learning 0.0001, while batches of 32 are used. For the DP method, only the Adam optimizer was changed with the equivalent DP-Adam optimizer of Tensorflow Privacy [2].

Table 2: Performance metrics for CNN on the CIFAR10 dataset with varying L2subscript𝐿2L_{2}italic_L start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT regularization strengths (λ𝜆\lambdaitalic_λ). The table presents the average and standard deviation of training accuracy, validation accuracy, and attacker advantage over five runs for both Baseline (non-DP) and DP models.
Train Accuracy
Baseline 81.2±.444plus-or-minus81.2.44481.2\pm.44481.2 ± .444 77.6±.410plus-or-minus77.6.41077.6\pm.41077.6 ± .410 72.7±.345plus-or-minus72.7.34572.7\pm.34572.7 ± .345 69.3±.364plus-or-minus69.3.36469.3\pm.36469.3 ± .364 66.5±.356plus-or-minus66.5.35666.5\pm.35666.5 ± .356 64.6±.342plus-or-minus64.6.34264.6\pm.34264.6 ± .342
DP 61.4±.303plus-or-minus61.4.30361.4\pm.30361.4 ± .303 59.6±.798plus-or-minus59.6.79859.6\pm.79859.6 ± .798 59.7±.470plus-or-minus59.7.47059.7\pm.47059.7 ± .470 59.9±.347plus-or-minus59.9.34759.9\pm.34759.9 ± .347 59.0±.332plus-or-minus59.0.33259.0\pm.33259.0 ± .332 59.1±.661plus-or-minus59.1.66159.1\pm.66159.1 ± .661
Validation Accuracy
Baseline 72.6±.244plus-or-minus72.6.24472.6\pm.24472.6 ± .244 73.3±.358plus-or-minus73.3.35873.3\pm.35873.3 ± .358 70.2±2.23plus-or-minus70.22.2370.2\pm 2.2370.2 ± 2.23 67.8±1.23plus-or-minus67.81.2367.8\pm 1.2367.8 ± 1.23 66.0±1.08plus-or-minus66.01.0866.0\pm 1.0866.0 ± 1.08 64.0±.900plus-or-minus64.0.90064.0\pm.90064.0 ± .900
DP 59.7±.257plus-or-minus59.7.25759.7\pm.25759.7 ± .257 58.3±.651plus-or-minus58.3.65158.3\pm.65158.3 ± .651 58.5±.985plus-or-minus58.5.98558.5\pm.98558.5 ± .985 58.9±.818plus-or-minus58.9.81858.9\pm.81858.9 ± .818 58.2±.707plus-or-minus58.2.70758.2\pm.70758.2 ± .707 57.9±.647plus-or-minus57.9.64757.9\pm.64757.9 ± .647
Attacker Advantage
Baseline 9.25±.298plus-or-minus9.25.2989.25\pm.2989.25 ± .298 5.76±.450plus-or-minus5.76.4505.76\pm.4505.76 ± .450 3.26±.330plus-or-minus3.26.3303.26\pm.3303.26 ± .330 2.05±.251plus-or-minus2.05.2512.05\pm.2512.05 ± .251 1.75±.159plus-or-minus1.75.1591.75\pm.1591.75 ± .159 1.21±.267plus-or-minus1.21.2671.21\pm.2671.21 ± .267
DP 2.01±.153plus-or-minus2.01.1532.01\pm.1532.01 ± .153 1.75±.098plus-or-minus1.75.0981.75\pm.0981.75 ± .098 1.70±.221plus-or-minus1.70.2211.70\pm.2211.70 ± .221 1.46±.319plus-or-minus1.46.3191.46\pm.3191.46 ± .319 1.45±.165plus-or-minus1.45.1651.45\pm.1651.45 ± .165 1.58±.208plus-or-minus1.58.2081.58\pm.2081.58 ± .208
λ𝜆\lambdaitalic_λ 00 .001.001.001.001 .002.002.002.002 .003.003.003.003 .004.004.004.004 .005.005.005.005

In Table 2, we report the average values and variance of the training and evaluation accuracy, as well as the attacker advantage, across five training runs. Unlike the MNIST case, for this more challenging CIFAR10 dataset, we observe a higher difference between training and evaluation accuracy, which also corresponds to higher values of the attacker advantage. This highlights the importance of incorporating methods to improve privacy, as Table 2 demonstrates how higher accuracy differences lead to an increase in the attacker’s advantage. This phenomenon will be further investigated, as it suggests that models with larger training-to-evaluation accuracy differences tend to have higher vulnerability to attacks.

Additionally, Table 2 and Figure 2 illustrate the effects of applying L2subscript𝐿2L_{2}italic_L start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT regularization to the models. As the regularization strength parameter λ𝜆\lambdaitalic_λ increases, the non-DP model exhibits a clear decline in both validation accuracy and attacker advantage. However, the DP model with L2subscript𝐿2L_{2}italic_L start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT regularization demonstrates a more resilient behavior, maintaining or even improving performance at higher λ𝜆\lambdaitalic_λ values. Notably, starting from λ=0.001𝜆0.001\lambda=0.001italic_λ = 0.001, the non-DP model achieves the highest validation accuracy of 73.3% while also maintaining a lower attacker advantage of 5.76 compared to the non-DP model’s of 9.25. This suggests in this experiment the model does not suffer as much overfitting as the non-DP model. As λ𝜆\lambdaitalic_λ increases to 0.0030.0030.0030.003, the validation accuracy decreases to 67.8% but remains higher than that of the DP model’s 59.7%. Finally, at λ=0.005𝜆0.005\lambda=0.005italic_λ = 0.005, it continues to outperform the DP model in terms of validation accuracy with 64.0% and also exhibits the lowest attacker advantage among all models of 1.21. Additionally, the DP model, when trained with L2subscript𝐿2L_{2}italic_L start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT regularization, achieves a validation accuracy of approximately 58.36 and an attacker advantage of 1.58, regardless of the regularization strength parameter. It is evident that the non-DP model, when trained with regularization, doesn’t achieve a comparable level of performance. Even though it may exhibit a lower attacker advantage, its validation accuracy is worse than other models.

Refer to caption
Refer to caption
Figure 2: Effect of L2subscript𝐿2L_{2}italic_L start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT regularization on the CNN model trained on the CIFAR10 dataset. The figure illustrates the relationship between validation accuracy and attacker advantage across different L2subscript𝐿2L_{2}italic_L start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT regularization strengths (λ𝜆\lambdaitalic_λ) for both Baseline (non-DP) and DP models.

3.2 Text Classification Task

The dataset employed for this task was an artificially enhanced version of the Toxic Tweets Dataset [9]. The original dataset consisted of 54,313 tweets, where each tweet is labeled as either toxic or non-toxic with each class containing 24153 and 32592 posts respectively. The dataset was enhanced by adding a new column, called ”vulnerability”, which indicates whether the author of the tweet belongs to a vulnerable group or not. The vulnerable group is defined as a group of people who are more likely to be affected by bias in AI models. Vulnerable groups include, but are not limited to, elderly people, young adults, racial minorities, and people with disabilities, as identified in the citizen study conducted in the pilot cities of Martin and Brasov for the purposes of the ITHACA project [7].

The vulnerability was determined via a random process guided by the results of the citizen study, where the probability of a tweet being authored by a member of a minority group was 20%. To test the effectiveness of the method proposed in this work, an artificial bias was injected into the dataset, where posts originating from the vulnerable groups were more likely to be labeled as toxic, with a probability of 70%. This signifies that for a randomly selected post, there is a 70% chance that it will be labeled as toxic if it was authored by a member of a vulnerable group, and a 30% chance that it will be labeled as toxic if it was authored by a non-vulnerable group.

This artificial bias was introduced to simulate the real-world scenario where AI models are trained on biased datasets, which has been shown to lead in biased predictions made by the model [10].

Our baseline non-DP model is a Natural Language Processing Model for detecting the toxicity of text excerpts contained in the dataset. The model consisted of a text vectorization pre-processing layer, followed by an embedding layer responsible for converting the input data into a continuous dense vector of fixed size, where values capture the semantic meaning or relationship between the inputs, a Global Average Pooling layer, and a Dense layer with a sigmoid activation function. The maximum vocabulary length was set to 104superscript10410^{4}10 start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT and the maximum output sequence length for the text vectorizer was set to 15 words per post. The output dimension of the embedding layer was set to 128, and the output dimension of the Dense layer was set to 1, as the model was trained as a binary classifier, with the goal of predicting whether a given text excerpt is toxic or non-toxic.

The non-DP models were trained using the Adam optimizer with a learning rate of 0.001 for 100 epochs and a batch size of 1024. For the DP models, we used the DP-Adam optimizer from TensorFlow Privacy, setting the noise multiplier to 1.1 and the clipping norm to 1.0.

The experimental results are presented in Table 3 and visualized in Figure 3. The table reports the average and standard deviation of training accuracy, validation accuracy, and attacker advantage over five runs for both Baseline (non-DP) and DP models, across varying L2subscript𝐿2L_{2}italic_L start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT regularization strengths (λ𝜆\lambdaitalic_λ) ranging from 1×1041superscript1041\times 10^{-4}1 × 10 start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT to 5×1045superscript1045\times 10^{-4}5 × 10 start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT.

Table 3: Performance metrics for the text classification task with varying L2subscript𝐿2L_{2}italic_L start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT regularization strengths (λ𝜆\lambdaitalic_λ). The table presents the average and standard deviation of training accuracy, validation accuracy, and attacker advantage over five runs for both Baseline (non-DP) and DP models.
Train Accuracy
Baseline 97.9±.001plus-or-minus97.9.00197.9\pm.00197.9 ± .001 92.7±.001plus-or-minus92.7.00192.7\pm.00192.7 ± .001 91.1±.001plus-or-minus91.1.00191.1\pm.00191.1 ± .001 90.0±.001plus-or-minus90.0.00190.0\pm.00190.0 ± .001 89.3±.001plus-or-minus89.3.00189.3\pm.00189.3 ± .001 88.7±.002plus-or-minus88.7.00288.7\pm.00288.7 ± .002
DP 78.4±.001plus-or-minus78.4.00178.4\pm.00178.4 ± .001 75.7±.005plus-or-minus75.7.00575.7\pm.00575.7 ± .005 73.3±.006plus-or-minus73.3.00673.3\pm.00673.3 ± .006 71.5±.011plus-or-minus71.5.01171.5\pm.01171.5 ± .011 70.0±.020plus-or-minus70.0.02070.0\pm.02070.0 ± .020 68.7±.016plus-or-minus68.7.01668.7\pm.01668.7 ± .016
Validation Accuracy
Baseline 90.6±.014plus-or-minus90.6.01490.6\pm.01490.6 ± .014 91.5±.004plus-or-minus91.5.00491.5\pm.00491.5 ± .004 90.6±.003plus-or-minus90.6.00390.6\pm.00390.6 ± .003 89.7±.003plus-or-minus89.7.00389.7\pm.00389.7 ± .003 89.0±.004plus-or-minus89.0.00489.0\pm.00489.0 ± .004 88.3±.003plus-or-minus88.3.00388.3\pm.00388.3 ± .003
DP 78.6±.035plus-or-minus78.6.03578.6\pm.03578.6 ± .035 75.6±.017plus-or-minus75.6.01775.6\pm.01775.6 ± .017 73.4±.022plus-or-minus73.4.02273.4\pm.02273.4 ± .022 71.6±.009plus-or-minus71.6.00971.6\pm.00971.6 ± .009 70.0±.044plus-or-minus70.0.04470.0\pm.04470.0 ± .044 68.9±.030plus-or-minus68.9.03068.9\pm.03068.9 ± .030
Attacker Advantage
Baseline 8.16±.006plus-or-minus8.16.0068.16\pm.0068.16 ± .006 1.53±.004plus-or-minus1.53.0041.53\pm.0041.53 ± .004 0.92±.001plus-or-minus0.92.0010.92\pm.0010.92 ± .001 0.73±.003plus-or-minus0.73.0030.73\pm.0030.73 ± .003 0.75±.003plus-or-minus0.75.0030.75\pm.0030.75 ± .003 0.73±.003plus-or-minus0.73.0030.73\pm.0030.73 ± .003
DP 0.25±.007plus-or-minus0.25.0070.25\pm.0070.25 ± .007 0.09±.008plus-or-minus0.09.0080.09\pm.0080.09 ± .008 0.17±.013plus-or-minus0.17.0130.17\pm.0130.17 ± .013 0.12±.022plus-or-minus0.12.0220.12\pm.0220.12 ± .022 0.09±.025plus-or-minus0.09.0250.09\pm.0250.09 ± .025 0.21±.018plus-or-minus0.21.0180.21\pm.0180.21 ± .018
λ𝜆\lambdaitalic_λ 00 .0001.0001.0001.0001 .0002.0002.0002.0002 .0003.0003.0003.0003 .0004.0004.0004.0004 .0005.0005.0005.0005

From Table 3, we observe that the Baseline model achieves significantly higher training accuracy compared to the DP model across all λ𝜆\lambdaitalic_λ values. At λ=0𝜆0\lambda=0italic_λ = 0, the Baseline model reaches a training accuracy of 97.99%, while the DP model attains 78.44%. As λ𝜆\lambdaitalic_λ increases, both models exhibit a decrease in training accuracy, with the Baseline model showing a more pronounced decline due to the stronger regularization effect. In terms of validation accuracy, the Baseline model starts at 90.61% for λ=0𝜆0\lambda=0italic_λ = 0 and decreases slightly to 88.32% at λ=0.005𝜆0.005\lambda=0.005italic_λ = 0.005. Interestingly, at λ=0.001𝜆0.001\lambda=0.001italic_λ = 0.001, the Baseline model’s validation accuracy slightly improves to 91.55%, suggesting that mild regularization can enhance generalization by preventing overfitting. The DP model’s validation accuracy closely follows its training accuracy, starting at 78.68% and decreasing to 68.91% as λ𝜆\lambdaitalic_λ increases.

Regarding the attacker advantage, the Baseline model shows a substantial decrease as λ𝜆\lambdaitalic_λ increases. At λ=0𝜆0\lambda=0italic_λ = 0, the attacker advantage is 8.16, which significantly reduces to approximately 0.73 at λ=0.005𝜆0.005\lambda=0.005italic_λ = 0.005. This indicates that increasing the L2subscript𝐿2L_{2}italic_L start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT regularization strength effectively enhances the model’s privacy by reducing its vulnerability to Membership Inference Attacks. The DP model consistently maintains a very low attacker advantage across all λ𝜆\lambdaitalic_λ values, ranging from 0.25 to 0.09, demonstrating the strong privacy protection inherent in DP training.

Refer to caption
Refer to caption
Figure 3: Effect of L2subscript𝐿2L_{2}italic_L start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT regularization on the text classification task. The top plot shows the validation accuracy across different L2subscript𝐿2L_{2}italic_L start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT regularization strengths (λ𝜆\lambdaitalic_λ) for both Baseline (non-DP) and DP models. The bottom plot illustrates the corresponding attacker advantage for each model.

Figure 3 illustrates the impact of L2subscript𝐿2L_{2}italic_L start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT regularization on validation accuracy and attacker advantage for both Baseline and DP models. The top plot shows that the validation accuracy of the Baseline model remains relatively stable with increasing λ𝜆\lambdaitalic_λ, with a slight improvement at λ=0.001𝜆0.001\lambda=0.001italic_λ = 0.001, before gradually decreasing at higher λ𝜆\lambdaitalic_λ values. The DP model experiences a consistent decline in validation accuracy as λ𝜆\lambdaitalic_λ increases. The bottom plot highlights the significant reduction in attacker advantage for the Baseline model as λ𝜆\lambdaitalic_λ increases, whereas the DP model maintains a consistently low attacker advantage across all regularization strengths.

These results suggest that applying L2subscript𝐿2L_{2}italic_L start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT regularization to the Baseline model not only improves its generalization performance but also significantly enhances its privacy by reducing the attacker’s advantage. At λ=0.001𝜆0.001\lambda=0.001italic_λ = 0.001, the Baseline model achieves its highest validation accuracy of 91.55% while substantially lowering the attacker advantage from 8.16 to 1.53. This indicates that even mild regularization can have a noticeable effect on both accuracy and privacy. For higher values of λ𝜆\lambdaitalic_λ, the Baseline model continues to reduce the attacker advantage, approaching values similar to the DP model, while maintaining higher validation accuracy. At λ=0.005𝜆0.005\lambda=0.005italic_λ = 0.005, the Baseline model’s attacker advantage decreases to 0.73, closely aligning with the DP model’s range. However, the validation accuracy remains significantly higher than that of the DP model, demonstrating that L2subscript𝐿2L_{2}italic_L start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT regularization can be an effective strategy for balancing privacy and performance. The DP model, while offering strong privacy protection across all λ𝜆\lambdaitalic_λ values, exhibits lower validation accuracy compared to the Baseline model. This performance gap highlights the trade-offs associated with DP training, where the addition of noise to achieve privacy can affect model accuracy.

3.3 Overfit Evaluation

These results underscore the potential of L2subscript𝐿2L_{2}italic_L start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT regularization to provide better privacy protections, as it both reduces the attacker advantage and sustains strong validation performance, in contrast to the DP-optimized model. Furthermore, the strength parameter λ𝜆\lambdaitalic_λ can be seen as a trade-off factor between model accuracy and privacy. Lower values of λ𝜆\lambdaitalic_λ tend to result in higher accuracy but with reduced privacy protection, as reflected in higher attacker advantage. Conversely, increasing λ𝜆\lambdaitalic_λ enhances privacy by lowering the attacker advantage, but at the cost of reduced model accuracy. This trade-off can be tuned, for the specific use case, to strike a balance between maintaining strong validation performance and minimizing the model’s vulnerability to attacks.

Refer to caption
Figure 4: Correlation between the accuracy difference (training accuracy minus validation accuracy) and attacker advantage. The plot demonstrates a strong positive correlation (correlation coefficient of 0.93).

Finally, we analyzed the relationship between the difference in training and evaluation accuracy (accuracy difference) and the attacker advantage. To visualize this relationship, we plotted the accuracy difference on the y-axis and the attacker advantage on the x-axis. The resulting plot, presented in Figure 4, demonstrates that as the accuracy difference increases, the attacker’s advantage also increases. Specifically, we observed a correlation coefficient of 0.93, indicating a strong positive correlation between these two metrics.

These findings indicate that the gap between training and evaluation accuracy is a significant factor contributing to the vulnerability of models to MIAs. A higher accuracy difference suggests that the model has overfitted to the training data, capturing noise and specific patterns that do not generalize well to unseen data. This overfitting makes it easier for an attacker to distinguish whether a particular data point was part of the training set, thereby increasing the attacker’s advantage. Maintaining a lower gap between training and evaluation accuracy is therefore crucial for reducing vulnerability to MIAs. Techniques that mitigate overfitting, such as L2subscript𝐿2L_{2}italic_L start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT regularization, and therefore dropout, early stopping, should be prioritized. Our experiments demonstrate that applying L2subscript𝐿2L_{2}italic_L start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT regularization not only enhances the model’s generalization capabilities but also effectively reduces the attacker’s advantage, sometimes outperforming models trained with differential privacy in terms of both validation accuracy and privacy protection.

4 Conclusion

We have investigated the effectiveness of L2subscript𝐿2L_{2}italic_L start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT regularization in enhancing privacy protection against MIAs in DL models. Our approach was evaluated on benchmark datasets, including MNIST [14], CIFAR-10 [13], and an augmented version of the Toxic Tweets Dataset [9], using various neural network architectures.

Overall, the analysis of the experimental results demonstrates that incorporating L2subscript𝐿2L_{2}italic_L start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT regularization into model training can reduce the attacker’s advantage while maintaining or even improving model accuracy compared to models trained with DP. Specifically, our findings indicate that L2subscript𝐿2L_{2}italic_L start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT regularization effectively mitigates overfitting, which is a key factor contributing to the vulnerability of models to MIAs. Models trained with L2subscript𝐿2L_{2}italic_L start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT regularization not only achieved higher validation accuracies but also exhibited enhanced privacy protection.

Despite the improvements offered by L2subscript𝐿2L_{2}italic_L start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT regularization, some limitations still exist. The effectiveness of L2subscript𝐿2L_{2}italic_L start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT regularization in providing privacy protection is closely tied to its ability to prevent overfitting. In scenarios where models are inherently prone to overfitting due to complex data distributions or limited training data, L2subscript𝐿2L_{2}italic_L start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT regularization alone may not be sufficient to ensure robust privacy. Additionally, while L2subscript𝐿2L_{2}italic_L start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT regularization can reduce the attacker’s advantage, it does not provide formal privacy guarantees like differential privacy does.

To further enhance privacy protection, future research could explore combining L2subscript𝐿2L_{2}italic_L start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT regularization with other regularization techniques. Moreover, integrating L2subscript𝐿2L_{2}italic_L start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT regularization with formal privacy-preserving mechanisms like DP could offer a balanced approach that leverages the strengths of both methods. Future work could also focus on extending the evaluation to more complex datasets and models to generalize the findings and develop robust strategies for privacy-preserving machine learning.

Data Availability

All data underlying the analyses are available as part of the article or as referenced external data sources and no additional source data are required.

Declaration of competing interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

References

  • [1] Abadi, M., Chu, A., Goodfellow, I., McMahan, H.B., Mironov, I., Talwar, K., Zhang, L.: Deep learning with differential privacy. In: Proceedings of the 2016 ACM SIGSAC conference on computer and communications security. pp. 308–318 (2016)
  • [2] Abadi, M., Chu, A., Goodfellow, I., McMahan, H.B., Mironov, I., Talwar, K., Zhang, L.: Deep learning with differential privacy. In: Proceedings of the 2016 ACM SIGSAC conference on computer and communications security. pp. 308–318 (2016)
  • [3] Arachchige, P.C.M., Bertok, P., Khalil, I., Liu, D., Camtepe, S., Atiquzzaman, M.: Local differential privacy for deep learning. IEEE Internet of Things Journal 7(7), 5827–5842 (2019)
  • [4] Bickel, P.J., Li, B., Tsybakov, A.B., van de Geer, S.A., Yu, B., Valdés, T., Rivero, C., Fan, J., van der Vaart, A.: Regularization in statistics. Test 15, 271–344 (2006)
  • [5] Chan, J.Y.L., Leow, S.M.H., Bea, K.T., Cheng, W.K., Phoong, S.W., Hong, Z.W., Chen, Y.L.: Mitigating the multicollinearity problem and its machine learning approach: a review. Mathematics 10(8),  1283 (2022)
  • [6] Gurney, K.: An Introduction to Neural Networks. CRC Press (Oct 2018), http://dx.doi.org/10.1201/9781315273570
  • [7] Horizon, E.U.: Artificial intelligence to enhance civic participation, https://www.ithaca-project.eu/, [Accessed 2024-11-27]
  • [8] Hu, H., Salcic, Z., Dobbie, G., Chen, Y., Zhang, X.: Ear: an enhanced adversarial regularization approach against membership inference attacks. In: 2021 International Joint Conference on Neural Networks (IJCNN). pp. 1–8. IEEE (2021)
  • [9] Iyer, A.: Toxic Tweets Dataset — kaggle.com. https://www.kaggle.com/datasets/ashwiniyer176/toxic-tweets-dataset/data (2020), [Accessed 2024-04-22]
  • [10] Jiang, H., Nachum, O.: Identifying and correcting label bias in machine learning. In: Chiappa, S., Calandra, R. (eds.) Proceedings of the Twenty Third International Conference on Artificial Intelligence and Statistics. Proceedings of Machine Learning Research, vol. 108, pp. 702–712. PMLR (26–28 Aug 2020), https://proceedings.mlr.press/v108/jiang20a.html
  • [11] Kaya, Y., Hong, S., Dumitras, T.: On the effectiveness of regularization against membership inference attacks. arXiv preprint arXiv:2006.05336 (2020)
  • [12] Kingma, D.P.: Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014)
  • [13] Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009)
  • [14] Lecun, Y., Bottou, L., Bengio, Y., Haffner, P.: Gradient-based learning applied to document recognition. Proceedings of the IEEE 86(11), 2278–2324 (1998). https://doi.org/10.1109/5.726791
  • [15] Liu, B., Ding, M., Shaham, S., Rahayu, W., Farokhi, F., Lin, Z.: When machine learning meets privacy: A survey and outlook. ACM Computing Surveys (CSUR) 54(2), 1–36 (2021)
  • [16] Phan, H., Thai, M.T., Hu, H., Jin, R., Sun, T., Dou, D.: Scalable differential privacy with certified robustness in adversarial learning. In: International Conference on Machine Learning. pp. 7683–7694. PMLR (2020)
  • [17] Phan, N., Wu, X., Hu, H., Dou, D.: Adaptive laplace mechanism: Differential privacy preservation in deep learning. In: 2017 IEEE international conference on data mining (ICDM). pp. 385–394. IEEE (2017)
  • [18] Schmidhuber, J.: Deep learning in neural networks: An overview. Neural Networks 61, 85–117 (Jan 2015). https://doi.org/10.1016/j.neunet.2014.09.003
  • [19] Shalev-Shwartz, S., Ben-David, S.: Understanding Machine Learning: From Theory to Algorithms. Cambridge University Press (2014)
  • [20] Sharma, S., Sharma, S., Athaiya, A.: Activation functions in neural networks. International Journal of Engineering Applied Sciences and Technology 04(12), 310–316 (May 2020). https://doi.org/10.33564/ijeast.2020.v04i12.054
  • [21] Shokri, R., Stronati, M., Song, C., Shmatikov, V.: Membership inference attacks against machine learning models. In: 2017 IEEE symposium on security and privacy (SP). pp. 3–18. IEEE (2017)
  • [22] Solove, D.J.: Understanding privacy. Harvard university press (2010)
  • [23] Song, L., Mittal, P.: Systematic evaluation of privacy risks of machine learning models. In: 30th USENIX Security Symposium (USENIX Security 21). pp. 2615–2632 (2021)
  • [24] Tanuwidjaja, H.C., Choi, R., Baek, S., Kim, K.: Privacy-preserving deep learning on machine learning as a service—a comprehensive survey. IEEE Access 8, 167425–167447 (2020)
  • [25] Thoma, M.: Analysis and optimization of convolutional neural network architectures. arXiv preprint arXiv:1707.09725 (2017)
  • [26] Vassilev, A., Oprea, A., Fordyce, A., Anderson, H.: Adversarial machine learning. Gaithersburg, MD (2024)
  • [27] Xu, R., Baracaldo, N., Joshi, J.: Privacy-preserving machine learning: Methods, challenges and directions. arXiv preprint arXiv:2108.04417 (2021)
  • [28] Yeom, S., Giacomelli, I., Fredrikson, M., Jha, S.: Privacy risk in machine learning: Analyzing the connection to overfitting. In: 2018 IEEE 31st computer security foundations symposium (CSF). pp. 268–282. IEEE (2018)
  • [29] Ying, Z., Zhang, Y., Liu, X.: Privacy-preserving in defending against membership inference attacks. In: Proceedings of the 2020 Workshop on Privacy-Preserving Machine Learning in Practice. pp. 61–63 (2020)
  • [30] Zapechnikov, S.: Privacy-preserving machine learning as a tool for secure personalized information services. Procedia Computer Science 169, 393–399 (2020)