The application of PPTs is not always trivial. While implementing a specific approach, such as reducing data granularity, to enhance privacy, it is plausible that the privacy level of certain instances may remain unaffected due to the presence of extreme values. Moreover, reducing the detail of information may negatively impact data utility. For this reason, it is crucial to conduct studies to evaluate the effectiveness of such techniques. In this section, we discuss existing studies on the impact of PPTs on both data privacy and utility regarding predictive performance. Additionally, we cover available software, its principles, and implementation details.
7.1 Impact on Data Privacy and Predictive Performance
Non-perturbative Techniques. Improvements and new proposals on non-perturbative techniques popped up after the suggestion of the
k-anonymity measure. Such a measure was initially used to measure identity disclosure in data transformed with generalisation and suppression techniques [
206]. In the context of supervised learning tasks, namely classification, several studies are aiming to prove the efficiency of generalisation and suppression in predictive analysis. A common approach is to transform the original data using such techniques in QI attributes that are highly identifiable to satisfy privacy constraints with minimal information loss and using
k-anonymity to evaluate the privacy of transformed data. Such transformation is often achieved through de-identification algorithms—for example, Incognito [
128], Mondrian [
129], and many other de-identification algorithms surveyed by Fung et al. [
79]. In general, information loss is measured during the generalisation process. We notice that the studies in predictive performance rely often upon datasets produced with these de-identification algorithms in which they all are evaluated according to the models’ performance. Many examples [
21,
83,
108,
113,
237] show higher classification errors for datasets produced by optimising the loss metric and reach the maximum possible value at higher values of
k. In general, they prove that a de-identification level increase leads to proportional degradation of predictive performance. However, it is still possible to protect individuals’ privacy while maintaining predictive performance with both techniques.
Nonetheless, it was proven that generalisation and suppression are not sufficient to protect the disclosure of sensitive data [
135,
148]. A suggestion to increase the individuals’ privacy is to suppress values that have high disclosure risk [
184]. Brickell and Shmatikov [
19] focus on semantic definitions to quantify the attribute disclosure. Besides
k-anonymity, the authors also used
l-diversity and
t-closeness and proposed a new measure to capture the adversarial knowledge gain. Whereas the previous works use a defined set of QIs, this study uses different sets of QIs. In most cases, trivial de-identification (i.e., removing all QIs or all sensitive attributes) provides equivalent performance and better privacy guarantees than common generalisation and suppression. Such results challenge previous works and their conclusions. However, we believe that this result depends on the selected set of QIs—for instance, if a higher number of attributes is suppressed, the de-identified data set intuitively will not have much utility.
LeFevre et al. [
130] presented a suite of de-identification methods to generate de-identified data based on target workloads, which consists of several data mining tasks, including both classification and regression tasks. This experience focuses also on attribute disclosure. In general, the derivations of the Mondrian algorithm outperform the Fung et al. [
83] de-identification algorithm. However, the previous methods suppress too many values [
133]. Therefore, the level of generalisation can be determined by the distribution of the attributes and then use cell suppression to remove locally detailed information. Such a method was proven to be more accurate in classification than the LeFevre et al. [
130] proposal. Focusing only on regression tasks, Ohno-Machado et al. [
181] also present a study using both suppression and generalisation. However, the predictive performance is evaluated concerning the number of suppressed cells. With the minimum privacy guarantees, the results show a difference compared with the original data. However, a slightly higher privacy level results in approximately the same predictive performance.
Regarding unsupervised learning tasks, clustering methods have also been used to evaluate the impact of PPTs. Some approaches convert the problem into classification analysis, wherein class labels encode the cluster structure in the data and then evaluate the cluster quality on the de-identified data. For instance, Fung et al. [
81,
82] define the de-identification problem for cluster analysis using a generalisation. After transforming the data, the clusters in the original dataset should be equal to those in the de-identified dataset. In general, the cluster quality degrades as the de-identification threshold increases. However, the results suggest that it is possible to achieve a reasonable level of de-identification without compromising cluster quality.
Perturbative Techniques. The introduction of noise is one of the most used techniques in the perturbative group. A well-known conclusion is the generation of a perturbed dataset that remains statistically close to the original data. Typically, the closer the perturbed data is to the original, the less confidential that dataset becomes. On the opposite side, the more distant the perturbed dataset is from the original, the more secure it is. However, the utility of the dataset might be lost when the statistical characteristics of the original dataset are lost [
157].
Besides generalisation and suppression, a few studies also include datasets with noise [
24,
233]. Contrary to Vanichayavisalsakul and Piromsopa [
233], the conclusions of Carvalho and Moniz [
24] point towards a noticeable impact of such techniques in predictive performance, especially with noise. However, noise is the one that presents a low re-identification risk level. Beyond that, the former uses several de-identification algorithms, whereas the latter tested different parameters in PPTs without a de-identification algorithm.
The single application of noise was also performed by some researchers [
158,
159,
257]. The experimental studies allow us to conclude that the level of noise does affect the classification error. Such results are expected, as the higher the
\(\epsilon\) , the less the noise [
127]; therefore, the private data is more closely to the original data. Although the noise adds uncertainty to the intruder in the re-identification ability, it may result in additional data mining bias [
241]. The introduced bias could severely impact the ability of knowledge discovery. It is usually related to the change in variance, the relationships between attributes, or the underlying attribute distribution. Wilson and Rosen [
241] show that additive noise has a lower impact on classification compared to other types of noise. More recently, Liu et al. [
143] presented a new approach to reduce the uncertainty introduced by noise. The idea is to represent features using vectorising functions in a de-identified instance or original instance. Experiments show that the regression model trained with de-identified data can be expected to do as well as the original dataset under certain feature representations.
Oliveira and Zaiane [
182] proposed a new distortion technique to numerical attributes to meet the desired privacy level in clustering analysis. The experiments include additive noise, multiplicative noise, and rotation noise that is defined based on an angle
\(\theta\) . Their technique shows a misclassification between 0% and 0.2%. In particular, multiplicative noise achieved the best values for accuracy and privacy level in most experiments. In general, the experiments show that is possible to achieve a good compromise between privacy and accuracy.
Microaggregation is also very often used for the perturbation of microdata sets and has been enhanced in terms of disclosure risk. For instance, Fadel et al. [
69] present a heuristic approach to apply microaggregation that aims to reduce the disclosure risk when compared with other approaches. Predictive analysis on microaggregation shows that the prediction accuracy of a classifier based on a de-identified dataset is not always worse than baseline [
139]. For instance, some results show higher accuracy when compared to the baseline due to the reduction of variance in the de-identified dataset. Additionally, this technique either produces a low degree of within cluster homogeneity or fails to reduce the amount of noise independent of the size of a dataset, and for these reasons, Iftikhar et al. [
107] propose an interesting approach that uses microaggregation for generating differentially private datasets. Muralidhar et al. [
165] presented an empirical study that compares two approaches of differential privacy via microaggregation. Their experimental results show that a fixed
\(\epsilon\) does not guarantee a certain level of confidentiality. Besides, differentially private data could be challenging for data analysts to work with due to the added noise which may lead to inaccurate models [
8]. Moreover, differentially private data could be vulnerable to membership inference attacks, especially with outliers cases [
142]. Very recently, Blanco-Justicia et al. [
16] presented a critical review on the use of differential privacy in machine learning. Their experiments indicate that standard anti-overfitting techniques provide similar practical protection, improved accuracy, and significantly lower learning costs than differential privacy based machine learning. Thus, this method is not good for microdata releases, which challenges the previous theoretical guarantees.
De-associative Techniques. De-associative techniques have been also explored and improved to protect individuals’ privacy. The principal drawback of these types of techniques is the publishing of the QITs in their raw form, but also the disclosure risks for some absolute facts which would help the intruder find invalid records in the transformed dataset resulting in the disclosure of confidential information. Therefore, Hasan et al. [
93] proposed the combination of slicing and data swapping to decrease the attribute disclosure risk. By swapping the values, the published data contains no invalid information such that the intruder cannot disclose individual privacy. However, Sari et al. [
207] state that breaking the two sets of attributes produces more records than the original dataset. The authors’ proposal uses generalisation and suppression in the QIT, then sensitive values are aggregated on an ST and QI attributes are summarised. Thus, the number of records in the ST is reduced. However, conclusions are limited, as privacy risk and utility were not determined. But theoretically, this approach provides better protection.
Synthetic Data. Releasing synthetic data that mimics the original data provides a promising alternative to the previous techniques. Numerous approaches for synthetic data generation have been introduced in the literature: deep generative models based on
Generative Adversarial Networks (GANs) such as conditional GANs and variational autoencoders, tree-based methods such as CART [
195], and sampling methods such as SMOTE [
27] which were commonly used before the popularity of deep learning models. Although the generation of synthetic data provides practically the same statistical conclusions as the original data, there still remains the risk of disclosure of private information [
12,
245]. The combination of synthetic data with differential privacy is highly recommended [
11,
12,
17,
92].
Concerning synthetic data through deep learning models, several approaches to differentially private synthetic data have emerged. For instance, DPGAN [
245] aims to protect the confidentiality of the training data while providing good quality data. MedGAN [
28] focus on the generation of patient records using a combination of variational autoencoders and GANs. However, med-GAN overestimates the number of diagnoses for patients and introduces comorbidities that were not present in the original dataset. To improve its limitations of low utility, Yale et al. [
248] proposed HealthGAN that includes WGAN-GP [
5]. Many other studies proved that differentially private synthetic data can provide high data utility while preserving privacy [
115,
124,
145,
201]. Notwithstanding, Fang et al. [
70] introduced DP-CTGAN, which outperforms DPGAN and PATE-GAN [
115]. Very recently, Kotal et al. [
120] proposed PriveTAB based on CTGAN [
247] for the generation of synthetic data which ensures a determined
t-closeness level to preserve the individual’s privacy and thus reduces the possibility of linkage attacks.
In addition, studies on predictive performance have been carried out. Despite the previous conclusions on utility, it has demonstrated a certain impact on predictive performance. Hittmeir et al. [
97] show that the outcomes of synthetic data without differential privacy are closer to the original. However, they proved that an intruder is able to obtain predictions that may be much closer to the true target value. The same conclusions were obtained for regression tasks [
98]. Using tree-based models for synthetic data generation, Rankin et al. [
193] show small decreases in accuracy between models trained with synthetic data and those trained using real data. However, such a study does not include privacy assessment. Regarding sampling methods, PrivateSMOTE, proposed by Carvalho et al. [
26], aims to generate synthetic data by interpolating the cases at maximum risk (single out) instead of synthesising all cases while minimising data utility loss. PrivateSMOTE demonstrates similar or even better results in re-identification risk and predictive performance (classification tasks) compared to deep learning based models.
Final Remarks. An organisation’s main challenge is applying the optimal PPT that reduces disclosure risks with minimal information loss. Currently, there are no “one size fits all” approaches to data privacy. Nevertheless, some guidelines should be used to limit disclosure risk when releasing data to the public, industries, or researchers. Data release must follow a defence strategy that uses both technical and non-technical approaches to data privacy [
101,
170].
A non-technical approach aims to understand the intended use of data since it would be naive to assume that all malicious attacks against individuals’ privacy have been discovered at this stage. Additionally, an intruder may possess external information to join with a released dataset to re-identify individuals in the de-identified dataset. To address such an issue, non-technical approaches should be considered to provide risk-limiting solutions. As such, datasets should not be made freely available without barriers. An example of a barrier is the identification of the study that data receivers will conduct by specifying the data they require and how data will be used to achieve their goals. Thus, the cost of a potential disclosure risk by an intruder will be raised in terms of effort. Furthermore, it allows the data controller to filter out data receivers without malicious intent, whose work intends to violate the data subjects’ privacy. In addition, another non-technical defence is an agreement in which data receivers commit not to de-identify any data.
Regarding technical approaches, one of the main issues is the understanding of which PPTs are most appropriate for certain situations. Many of these techniques require trial and error to decide which parameters to configure and to find the acceptable trade-off. Therefore, it is important to decide which PPT is most appropriate to apply to a given dataset for a specific application.
7.2 Available Software
Many privacy tools have been developed to help data controllers or any other user in decision-making for any purpose. Beyond the transformation of microdata through PPTs, some tools allow different configurations assessments of such techniques, enabling the evaluation of the achieved privacy and utility level.
\(\mu\) -ARGUS [
41] was the pioneering tool in privacy-preserving publishing. It is a free open source software that provides a user interface [
214]. This tool allows the application of techniques such as global recoding, top-and-bottom coding, local suppression, PRAM, noise addition, and microaggregation. The individual risk estimation is based on sampling weight [
13] and
k-anonymity. Furthermore,
\(\mu\) -ARGUS allows the creation of de-identified datasets suitable for different purposes, namely for scientific and public use files. UTD Anonymisation ToolBox [
231] is available for download and further implements the first three groups of PPTs. Cornell Anonymisation Toolkit [
29] provides an interface, uses generalisation to transform the data, and is free for download. Both tools depict the utility and re-identification risk. Moreover, these tools are research prototypes and have scalability issues when handling large datasets.
Two well-known tools are sdcMicro [
222] and ARX Data Anonymisation Tool [
189]. The R package called sdcMicro is a free open source software for scientific and public use files [
222]. This package supports several PPTs, both non-perturbative and perturbative. For risk estimation, sdcMicro implements, for example, the SUDA2,
k-anonymity, and log-linear models, among others. Besides that, this package provides the utility between the de-identified data and the original data and quantifies the information loss. The package also features a user-friendly interface [
121] that allows non-experts in the de-identification process to gain insight into the impact of various PPTs and reduces the burden that software can be used. ARX Data Anonymisation Tool, often known as ARX, is one of the most used tools for data privacy. This software is free and has a simple and intuitive interface [
6], which supports wizards for creating transformation rules and visualisations of re-identification risks. A wide range of PPTs is supported in ARX as well as many measures for both re-identification risk and data utility. Regarding the data utility, ARX further optimises output data towards suitability as a training set for building learning models. ARX is also available as a library with an API that provides data de-identification for any Java program.
Another free open source privacy tool is Amnesia [
183], which provides software and an online dashboard. Amnesia allows the selection of generalisation levels and uses
k-anonymity for disclosure risk. This tool produces many possible solutions, and it shows the distribution of values and provides statistics about the data quality in the de-identified dataset. Moreover, Amnesia allows for transforming relational and transactional databases into de-identified data.
All the previous tools require the attribute terminology in advance, which is a static approach. A dynamic tool called Aircloak [
3] has emerged to help data controllers by giving them access to all the underlying data, and dynamically adapting the de-identification to the specific query and data requested. The resulting answer set is fully de-identified. Both non-perturbative and perturbative techniques are implemented in Aircloak. This tool has a free version for students based on the querying system and a full version for organisations.
Due to the increased attention to synthetic data, several generators have emerged to facilitate the creation of synthetic samples. Synthpop (SP) [
178] is an R and Python package that uses tree-based methods to synthesise data where each attribute is generated by estimating conditional distributions. Although this tool does not use GANs for synthesis, it includes features such as differential privacy, record linkage, and outlier detection. One of the most known generators is
Synthetic Data Vault (SDV) [
186], which is a free and open source Python tool. SDV estimates the joint distribution of the population using a Gaussian copula but also includes GANs. Furthermore, SDV enables differential privacy and provides metrics for assessing privacy and utility. DataSynthesizer [
188] uses a differential private Bayesian network to capture the correlation between the different attributes. This tool allows the user to analyse the accuracy of the synthetic data. Comparisons between these tools have been performed [
39,
98].
Furthermore, commercial solutions have gained momentum for privacy-preserving synthetic data, such as MOSTLY AI [
163], YData [
249], and Gretel [
87]. All three generate synthetic data based on deep learning models and allow the evaluation of privacy and utility. These commercial solutions also provide open source Python packages [
88,
162,
250] but with some limitations. For instance, YData and Gretel only provide the generator and utility assessment.
Although some of the presented tools are more intuitive and user-friendly, they require prior knowledge of the SDC principles and expertise in the R or Python programming language. These tools are useful for data protection, but they are not trivial. Table
5 summarises the discussed privacy tools and their characteristics.