survey

Open access

Survey on Privacy-Preserving Techniques for Microdata Publication

Authors:

Tânia Carvalho,

Nuno Moniz,

Pedro Faria,

Luís AntunesAuthors Info & Claims

ACM Computing Surveys, Volume 55, Issue 14s

Article No.: 309, Pages 1 - 42

https://doi.org/10.1145/3588765

Published: 17 July 2023 Publication History

PDF eReader

Abstract

The exponential growth of collected, processed, and shared microdata has given rise to concerns about individuals’ privacy. As a result, laws and regulations have emerged to control what organisations do with microdata and how they protect it. Statistical Disclosure Control seeks to reduce the risk of confidential information disclosure by de-identifying them. Such de-identification is guaranteed through privacy-preserving techniques (PPTs). However, de-identified data usually results in loss of information, with a possible impact on data analysis precision and model predictive performance. The main goal is to protect the individual’s privacy while maintaining the interpretability of the data (i.e., its usefulness). Statistical Disclosure Control is an area that is expanding and needs to be explored since there is still no solution that guarantees optimal privacy and utility. This survey focuses on all steps of the de-identification process. We present existing PPTs used in microdata de-identification, privacy measures suitable for several disclosure types, and information loss and predictive performance measures. In this survey, we discuss the main challenges raised by privacy constraints, describe the main approaches to handle these obstacles, review the taxonomies of PPTs, provide a theoretical analysis of existing comparative studies, and raise multiple open issues.

1 Introduction

The right to privacy has been a discussion topic since the late 1960s in the computing field [99]. At that time, awareness of individuals’ privacy marked the development of legal and administrative safeguards appropriate to the computerised and modern world. For instance, the Council of Europe Convention 108 for the Protection of Individuals with regard to the automatic processing of personal data was the first legally binding international instrument adopted in the field of data protection [30]. Later, in 1995, the Council of Europe developed a data protection directive, of which the purpose was establishing a commitment from Member states to protect the right to privacy with respect to the processing of personal data, while guaranteeing the free flow of personal data between them [67]. Privacy is therefore not a recent topic of concern and debate. However, today, we face the pressing problem of data pervasiveness and the issues it raises for computation and society. The amount of collected and shared data reached astonishing levels easing the availability and use of microdata [32]. This type of data, also known as user data, usually provides records of information on individuals or entities and is often seen as crucial for analysis and mining tasks in data-driven projects [84, 191, 199]. Microdata is also key to research and strategy development efforts, as to gather meaningful information and in-depth knowledge from available sources. Nonetheless, despite the critical importance and vast range of advantages in information sharing, the privacy of data subjects has been constantly challenged as data is re-used and analysed on an unprecedented scale [9, 23, 85, 116, 171, 187]. As a result, privacy raised multiple legal puzzles around the world, heightening the interest and concern of individuals about how organisations and institutions handle their data and private information.

Recently, several data privacy regulations have been put in place to protect data subjects’ privacy. The General Data Protection Regulation¹ (GDPR) emerged to unify data privacy laws across Europe. Following that regulation, several efforts were put in place [204]—for example, the California Consumer Privacy Act of 2018 (CCPA) and the General Personal Data Protection Law (Lei Geral de Proteção de Dados Pessoais, or LGPD) were passed to protect the privacy of Californian and Brazilian individuals, respectively. In essence, these pieces of legislation require entities to establish appropriate technical and organisational measures to process personal data in compliance with such laws and regulations. Legally, a data subject is an individual to whom data relates. Data controllers correspond to entities which determine the purposes and means of the processing of personal data. Data processors handle personal data on behalf of controllers [65]. The data subject has rights over its data, namely rights of access or erasure, whereas the data controller is subject to various obligations, such as ensuring confidentiality, notifying if data are breached, and carrying out risk assessments [62, 110].

To ensure data confidentiality is primarily to prevent private information disclosure, by limiting data access to authorised entities or by de-identifying the data—that is, all private information concerning an individual in a record or dataset is removed or transformed. The de-identification procedure reduces the amount of information and data granularity, which typically results in losses for predictive and/or descriptive performance, as well as in data interpretability [19, 136]. In the case of machine learning/data mining tasks (i.e., knowledge extraction through pattern discovery algorithms), we face what can be denoted as a trade-off between privacy and predictive performance [25]. It is fundamental to design Privacy-Preserving Techniques (PPTs) that best guarantee privacy without compromising (or compromising as little as possible) predictive performance. However, finding the best trade-off remains a challenge. On one hand, poor-quality data (high privacy level) makes the interpretation of results and extraction of knowledge less effective [19]. On the other hand, a low privacy level may result in re-identification [23].

Data transformation encapsulates all methods intended to transform and prepare data for public availability or research. Such methods are of crucial importance when interpretability is a critical factor and the objective is to share meaningful information while preserving the privacy of individuals. In Statistical Disclosure Control (SDC), data transformation is key to ensuring the protection of the data when released (i.e., de-identified), where such transformation is achieved through a set of PPTs.

In this survey, we review the SDC guiding principles and present a survey on PPTs for data transformation. We focus on methods that aim at preserving data and results’ interpretability, as well as their ability to serve as the basis for future analysis, motivated by the importance of secure data sharing for future endeavours in computation. Despite the relevance of PPTs in microdata, there has been no recent overview and discussion concerning their effectiveness on data privacy and predictive performance. In addition, existing surveys only address the most well-known PPTs and privacy risk measures for a certain type of disclosure, typically identity disclosure [14, 42, 44, 79, 102, 154, 239, 240]. Our survey builds on previous works such as those from Hundepool et al. [102] and Benschop et al. [14], for example, by discussing and updating existing taxonomies for PPT, including more recent techniques. Then, we provide a thorough discussion on privacy risks and review important advancements. Finally, we provide an extensive analysis of the effectiveness of such techniques concerning privacy protection and predictive performance along with a summary of the main conclusions of existing experimental studies. The main contributions of this work are summarised as follows: (i) provide a general definition of the de-identification problem in microdata, (ii) describe the main privacy risk measures concerning specific disclosure types of private information, (iii) propose a taxonomy of existing approaches of PPTs in data transformation, (iv) describe the well-known measures of information loss and predictive performance, (v) summarise the conclusions of existing experimental comparisons, and (vi) review theoretical problems of de-identification processes.

The remainder of the article is organised as follows. Section 2 includes some preliminaries and defines the problem of SDC. Section 3 describes each stage in the de-identification process. Section 4 details several measures to assess the privacy risk according to different disclosure types. Section 5 provides a taxonomy for the set of PPTs as well as their description. Section 6 presents an overview of the main approaches for utility assessment. In Section 7, we present some existing experimental comparisons of different strategies for this problem. Section 8 explores some problems related to SDC and includes a summary of recent trends and open research questions. Finally, Section 9 concludes the article.

2 Preliminaries and Problem Formulation

SDC, many times referred to as Statistical Disclosure Limitation or Inference Control, is a set of guiding principles that aims to provide statistical data to society while preserving data subjects’ privacy. De-identified data is achieved by using PPTs in such a way that it is extremely challenging for an intruder to disclose confidential information on a data subject and still have useful data for further analysis [149, 180].

To determine which PPTs are appropriate for data protection and potential threats, it is essential to distinguish between database formats, as each presents different challenges. In this context, microdata, tabular data, and query-based databases are the most common formats, as outlined by Danezis et al. [37]. Microdata is a set of records where each entry contains information that corresponds to a specific individual or entity. Tabular data represents aggregated information for specific individual groups, which may include counts or magnitudes. Lastly, query-based databases are iterative databases where users submit statistical queries such as sums, averages, max, and min, among others.

It should be noted that microdata sets are the raw material used to construct both tabular data and query-based databases. Furthermore, the disclosure risk for microdata is usually regarded as higher than that of tabular data [239]. Accordingly, we focus on the principles underlying the protection of microdata and the main measures used for privacy guarantees. Nevertheless, we note that the principles underlying the protection of tabular data have been reviewed by many researchers, namely Willenborg and Waal [240] introduce many types of tabular data and include several techniques for preventing disclosure along with privacy and information loss measures. In addition, Hundepool et al. [103] and Willenborg and De Waal [239] present an overview of protection for magnitude and frequency tables. Addressing only tabular data, Duncan et al. [55] describe several techniques for limiting the disclosure in a general perspective. Concerning query-based databases, Adam and Worthmann [1] present a comparative study on the protection of this type of data and Aggarwal and Philip [2] present methods to restrict queries. Very recently, Torra [227] introduced privacy for computations which includes secure mechanisms for queries.

Focusing on the microdata setting, attributes normally are classified according to the following terminology:

–

Identifiers: Attributes such as name and social security number that directly identify an individual.

–

Quasi-identifiers: Quasi-Identifiers (QIs) are attributes that, when combined, generate a unique signature that may lead to re-identification—for instance, date of birth, gender, geographical location, profession, and ethnic group. In related literature, these are frequently called key attributes.

–

Sensitive: Known as confidential attributes, they refer to highly critical attributes, usually protected by law and regulations—for example, religion, sexual orientation, disease, and political opinion.

–

Non-sensitive: Other attributes that do not contain sensitive information.

The problem addressed in SDC is that, through inappropriate use, a person who is given access to released data may disclose private information about data subjects. Due to such liability, identity disclosure is one of the main concerns of data privacy today. An intruder, also known as an adversary, attacker, or snooper, is an individual who possesses the skills, resources, and motivation to re-identify data subjects or deduces new information about them in the de-identified dataset. Motivated intruders can improve their knowledge of private information on observations in available data. It is fundamental for data controllers to make assumptions regarding intruders’ background knowledge. If an intruder has more background knowledge than assumed by data controllers, the risk of disclosure may be underestimated.

Several studies show how feasible and/or easy it is to link back private information to a data subject [23, 172, 179, 198]. Despite the possibility of linking information derived from different sources, re-identification can also occur by isolating data subjects in the de-identified data. In a study conducted by Sweeney [217], it was found that 87% of the population in the United States are likely to be uniquely identified by only considering the set of QIs {5-digit ZIP, gender, date of birth}. Such a study contributed to the increased attention from both data controllers and data subjects and was a motivation to analyse re-identification attacks [218] and inspiration to create new measures for data protection to minimise disclosure risk [148].

The design of robust PPT includes assessing the impact of such techniques for privacy and interpretability/utility. Notably, they must guarantee desired levels of data protection without compromising the usefulness of data. Figure 1 illustrates an acceptable trade-off between data privacy and utility. Although the ideal solution is de-identified data with maximum privacy and utility, this scenario is practically impossible to reach.

Fig. 1.

The formal definition of the presented problem is described as follows. Consider a microdata set \(T = \lbrace t_1, \ldots , t_n\rbrace\) , where \(t_i\) corresponds to a tuple of attribute values for an individual’s record. Let \(V = \lbrace v_1, \ldots , v_m\rbrace\) be the set of m attributes, in which \(t_{i,v_j}\) denotes the value of attribute \(v_j\) for tuple i. A QI consists of a set of attribute values (either categorical or numeric) that could be known to the intruder for a given individual where \(QI \in V\) . An equivalence class corresponds to a combination of records that are indistinguishable from each other. Then, two tuples, \(t_a\) and \(t_b,\) are in same equivalence class if \(t_{a,[QI]} = t_{b,[QI]}\) . In the de-identification process, one or more “transformation functions” are applied to the original data, producing de-identified data. A dataset with certain privacy guarantees does not have equivalence classes of size 1—that is, a distinct set of values in a group of attributes for a single individual, and an intruder cannot isolate any data points or infer any information about an individual.

Although PPTs aim to transform confidential information, it may not be enough to protect it from intruders linking private information back to an individual or singling out individuals who have indistinguishable information. In addition to the presentation of the main PPT and measures for disclosure risk and data utility, we review and discuss available studies concerning the effectiveness of such techniques on both privacy and, when available, predictive performance in data mining/machine learning tasks.

3 De-identification Process

The de-identification process treats and alters the data following the SDC guiding principles ensuring the transformed data is achieved respecting an “acceptable level” of disclosure risk for secure data publishing or dissemination. Usually, such level depends on dissemination policies and restrictions imposed by data providers [56]. The level of protection in microdata (i.e., the acceptable disclosure risk) varies according to who the data receivers are and the conditions of release. The three common microdata dissemination mechanisms are outlined as follows [63, 64]:

–

Secure use files: The data is available in a safe centre, usually controlled by national statistical authorities. In this type of file, the level of disclosure risk is high, as only identifiers are treated (i.e., do not contain any identifier but may contain QI and sensitive attributes).

–

Scientific use files: The data can be used outside of controlled environments. The risk level is medium, as it is properly reduced, but it may not be eliminated completely. Hence, access to them is restricted.

–

Public use files: The data is available to the general public. The level of protection in this type of files is very high to prevent the disclosure of private information. The risk is very low (or null) under specific attacker scenarios.

As secure and scientific use files are confidential due to inherent disclosure risk, access to such files is controlled. Data receivers must fulfil certain terms and conditions determined by data providers. Additionally, the statistical outputs generated by data analysts or researchers are checked for determining disclosure risk by statistical offices [60, 197]. A safe output is a set of statistics such as descriptive statistics which are deemed not likely to reveal any personal information of a data subject. Requirements and guidelines have been discussed for checking statistical outputs to ensure that they are safe to be released [60, 205]. Acceptable statistical outputs can also be facilitated with the de-identification process. A general overview of the main steps in the de-identification process applied for microdata is presented in Figure 2.

Fig. 2.

The first step is to recognise the identifiers since such attributes can easily lead to the exposure of a data subject’s identity. Then, identifiers must be removed or replaced with pseudonyms. The remaining attributes are used to assess the raw disclosure risk and utility. The choice of PPT is based on the need for data protection determined by the disclosure risk and structure of data. After data de-identification, the disclosure risk and utility are re-assessed. If the compromise between the two measures is not met, the parameters of the PPT should be re-adjusted or different techniques must be applied. The process is repeated until the desired level of privacy and utility is reached. Otherwise, the data is protected and can be released.

It is fundamental to document the de-identification process, namely for auditing from specialised authorities [14]. The documentation should include a description of the technical solutions employed and the risk management during the process [66, 109]. Furthermore, such documentation is crucial for transparency towards data subjects and for data receivers to understand what was changed in the original data due to confidentiality constraints and to guarantee the data subjects’ privacy rights. It may be useful for data subjects or data receivers to know which attributes were modified along with a brief description of changes either by PPT application or out of necessity due to the quality of the original data. Notwithstanding, in data mining, a recent suggestion was proposed for documentation to transparent model reporting [156]. The goal is to clarify the potential use cases of trained machine learning models and minimise their usage in contexts for which they are not appropriate. For model transparency, an important criteria to be considered is the “ethical consideration” that includes the evidence of risk mitigation and the risk and harms present in the model.

De-identification effectiveness is determined by how likely an intruder will be able to disclose personal information on any data subject. Therefore, in the following sections, we thoroughly describe disclosure risk, PPTs used to circumvent problems caused by disclosure of confidential information, and, finally, existing measures to boost data interpretability and utility for data mining/machine learning tasks.

4 Privacy Risk

Considering the terminology for attributes in microdata, robust PPTs must be selected by data controllers to obtain de-identified datasets. Such transformation is commonly based on three common types of privacy threats [61]: (i) singling out, where an intruder can isolate records that uniquely identify a data subject; (ii) linkability, concerning the ability to connect or correlate two or more records of an individual or a group of individuals, and (iii) inference, regarding the possibility to deduce the value of an attribute based on the values of other attributes. We should also note that increased attempts at confidential information attacks have contributed to the awareness of other privacy threats, which we also discuss, despite them being less known.

A data controller usually validates the effectiveness of a PPT through privacy measures appropriated for each previous privacy threat [65, 110]. Quantifying disclosure risk is a difficult task because disclosure of confidential information generally occurs when an intruder possesses external information, and the data controller often cannot anticipate this information. Therefore, a data controller needs to prudently make assumptions about the knowledge of the intruder to predict disclosure risk. Usually, the data controller examines risks under diverse scenarios—for instance, different sets of QIs known by intruders, or whether or not intruders know who participated in the study. Unfortunately, assumptions may not be accurate for a given de-identified dataset. To prevent this, the best approach is to assume a maximum-knowledge intruder (i.e., the intruder knows all original attribute values). Hence, the controller assumes that an intruder may use all QIs (maximum background knowledge), which allows the controller to obtain the most accurate risk estimate possible.

In the following sections, we specify four types of disclosure risks described in the literature and their respective measures. The three well-known types are identity, attribute, and inferential disclosure [14, 80], and membership disclosure was later included in the literature [189, 200]:

–

Identity disclosure: When an intruder can recognise that a record in the released dataset concerns an individual by matching QI values.

–

Attribute disclosure: When an intruder is able to determine new characteristics of a data subject based on the information available in the released data.

–

Inferential disclosure: When an intruder can infer data subject private information with high confidence from statistical properties of the released data.

–

Membership disclosure: When an intruder is able to conclude if the private information regarding a certain individual is present or not in the dataset.

We should stress that inferential disclosure risk requires a data mining technique performed by a data processor to unlawfully gain knowledge about a data subject. In other words, an intruder can predict the value of a certain individual’s characteristics more accurately with released data [103]. Inferential disclosure may harm a group of individuals, even individuals whose information does not appear in the dataset [51]. Some examples of inference attacks show how feasible is to use a machine learning model and incomplete information on individuals to infer the missing information [78, 252, 254]. As inference strategies are designed to predict aggregate, not individual, behaviour or attributes, this is not usually addressed in SDC for the microdata setting. Accordingly, we do not discuss measures for inferential disclosure in this survey. Additionally, to the best of our current knowledge, there are only a few measures for membership disclosure risk, such as \(\delta\) -presence [175], aiming to bound the likelihood of inferring the presence of any potential data subjects’ record within a specified range \(\delta =(\delta _{min}, \delta _{max})\) . Due to the imbalanced representation of this type of disclosure risk, we will also not delve into its respective measures in the following sections.

4.1 Identity Disclosure

There are two main strategies for measuring identity disclosure risks in microdata [194]: (i) estimating the number of records released in the sample whose characteristics are unique in the population (uniqueness), and (ii) estimating the probability that records possessed by intruders can be identified from the released data (record linkage). Notwithstanding, some strategies used for other disciplines, such as statistics and data mining tasks, have been adapted to find records at risk [232], and for this reason, we also include measures for (iii) outlier detection and (iv) clustering.

4.1.1 Uniqueness.

Population uniqueness is essential for successful disclosure. Suppose that a data receiver knows that a specific individual is unique in the population. Then, this individual either is or is not in the sample. If the former, such an individual will be identified and disclosed with high certainty of re-identification; if the latter, the individual is not in the sample and no harm can be done. For this reason, uniqueness is quite relevant, since population uniques naturally have a higher risk of re-identification than non-uniques. Thus, knowledge of population uniqueness should not be underestimated, especially when microdata sets contain attributes that make it possible to easily detect individuals. For instance, certain professions could be unique in small geographical areas. Furthermore, in case the sample size is equivalent to the population size, an intruder who detects a unique value in the released sample is almost certain that there is a single individual in the population with that value, making it identifiable.

Three different procedures to determine uniqueness are included in this category. The first procedure is isolating records according to a set of QI. As for the second procedure, the data receiver may not have the resources to acquire data on all of the population and the receiver needs to estimate uniqueness from the available sample data by using a probabilistic model. In the third procedure, an alternative is based on special unique detection which searches for uniqueness without considering the entire QI set. All procedures are important to decide whether re-identification risk is acceptable or if further disclosure control actions are required. Therefore, we discuss the following group of measures for uniqueness: singling out, probabilistic modelling, and special uniques.

Singling Out. When all QIs of a de-identified dataset are categorical, estimation of disclosure risk can be obtained by the frequency of a set of QIs. Let \(C_1, \ldots , C_K\) represent K records, and the population and sample frequencies \(F_k=(k=1, \ldots , F_K)\) and \(f_k=(k=1, \ldots , f_K)\) , respectively. Formally, \(\sum _{k=1}^K F_k=N\) and \(\sum _{k=1}^K f_k=n,\) where N and n correspond to the size of the population and sample, respectively. The probability of identity disclosure of an individual i being in cell \(C_k\) , when \(F_k\) individuals in the population are known to belong to it, is \(1/F_k\) for \(k=1, \ldots , K\) . If \(F_k=1,\) then the combination of QI values is unique in the population. As such, the intruder is sure that the matching record for individual i on the QI does indeed belong to that individual. This means that the intruder can single out the corresponding individual. If \(F_k=2,\) then the re-identification risk is 0.5, which means that the intruder has 50% certainty on the re-identification.

A very common criterion for deciding whether the risk of re-identification is too high is k-anonymity, proposed by Samarati [206]. Such a measure and the many extensions based on k-anonymity are referred to in the literature as “privacy models”. For instance, in the case of k-anonymity, researchers use those “models” to indicate whether a dataset respects the desired level of k to protect against background knowledge. Along with that, this type of approach provides a level of disclosure risk. Therefore, for terminology clarity, we assume that k-anonymity and its variations are measures and not models—that is, they are mainly used in performance assessment and/or criteria in optimisation processes. Many surveys detail several measures related to k-anonymity and provide an analysis of their strengths and weaknesses [79, 80, 150, 256]. Regarding k-anonymity, it is similar to the previous sample uniqueness definition. Each equivalence class is assigned a frequency \(f_k\) . A record is unique when \(k=1\) . For better protection, k must be greater than 1. It is not always appropriate to use k-anonymity, such as when this metric indicates that a certain record is unique but there is not enough information to re-identify it. Therefore, more information is needed: a re-identification dataset. k-Map [58] arose to address this shortcoming of k-anonymity. Both measures are quite similar with the exception that k-map calculates the risk based on information about the underlying population.

Finally, determining population uniqueness requires access to population data which is rarely available. Nonetheless, when population frequencies \(F_k\) are unknown, they can be estimated from the sample with statistical models [102]. For example, the objective of super-population models is to estimate the characteristics of the overall population using probability distributions that are parameterised with sample characteristics.

Probabilistic Modelling. Bethlehem et al. [15] propose a model for estimating the number of population uniques using sample data based on the assumption that cell frequencies are a realisation of a super-population (a theoretical population) distribution. In a population with N individuals and K cells, each cell k is assigned a super-population parameter \(\pi \gt 0\) (a probability) and an attribute \(F_k\) with the population frequency in that cell. It is assumed that \(F_k\) follows a Poisson distribution with the expected value \(\mu _k=N\pi _k\) . Thus, the expected number of population uniques ( \(U_p\) ) is denoted as \(U_p=\sum _{k=1}^K\mu _k exp(-\mu _k),\) which can be used as an approximation to the realised number of unique individuals under the super-population model. However, estimating all expected values is a complex problem due to the large number of cells. The Poisson-gamma model serves to govern the generation of the super-population parameters by considering \(\pi _k\) as a realisation of gamma( \(\alpha , \beta\) )-distributed attributes denoted by \(\Pi _k\) , where \(\alpha =1/K\beta\) and \(\beta\) reflects the amount of dispersion of \(\Pi _k\) . Thus, the Poisson-gamma model is summarised as \(F_k \sim Poisson(N\pi _k)|\pi _k=\Pi _k\) and \(\Pi _k \sim gamma(\alpha ,\beta)\) . To estimate \(U_p\) , we require the parameters \(\alpha\) and \(\beta\) , which can be given by maximum likelihood estimators. Additionally, a well-known model in the literature is the Poisson-log-normal, which considers how an intruder might use released microdata sets to infer whether a sample unique record is population unique [208, 210]. The measure depends on the specification of a log-linear generalisation of a Poisson log-normal model for a set of QIs.

Hoshino [100] proposes the use of Pitman’s model for the same purpose. Such a model is defined in terms of the cell size indices and is a generalisation of the Ewens sampling formula [68]. Additional to Poisson-gamma and Poisson-log-normal models, the authors present a comparison including the Dirichlet-multinomial model [220], logarithmic series model [100], and Ewens model. According to their results, the most accurate result was obtained with Pitman’s model. Dankar et al. [38] have experimentally validated Pitman’s sampling formula as the underlying distribution with clinical datasets. Furthermore, Pitman’s model has been employed in the well-known privacy tool ARX [190].

A different model to compute the population uniqueness is Gaussian copulas [198], which are used to model population uniqueness and estimate the likelihood of a sample unique being a population unique. The model quantifies, for any individual i, the likelihood \(\xi _i\) for this record to be unique in the complete population. From \(\xi _i\) , it is derived the likelihood \(\upsilon _i\) for i to be correctly re-identified when matched. The Gaussian copulas allow the modelling of the density of probability distributions by specifying separately the marginal distributions and the dependency structure.

Special Uniques. A different approach is based on the concept of a special unique [59]. A special unique has a higher probability of being a population unique than a normal sample unique. Within sample uniques with respect to a set of QIs, it is possible to find unique patterns without even considering the complete set of QIs. The subset of QIs is referred to as the Minimal Sample Unique (MSU), as any smaller subset of this set is not unique. The method was implemented into the Special Uniques Detection Algorithm (SUDA). To fulfil the minimal requirement, all subsets of size \(k-1\) of the MSU are unique. The principal objective of SUDA is then to identify all the MSUs in the sample. The potential risk of the records is determined based on the size of MSU—the smaller the size, the greater the risk, and vice versa. Each record is assigned a score that indicates how “risky” a record is. This score is determined by \(\Pi _{i=k}^M(ATT-i)\) , where M is the user-specified maximum size of MSUs and ATT is the total number of attributes in the dataset. The higher the score, the higher the risk. However, finding the MSUs is quite challenging, and SUDA is restricted to datasets with very small numbers of attributes. SUDA2 was proposed to deal with these drawbacks, especially concerning the search space [151].

These three groups of measures are suitable for categorical attributes but not for continuous attributes, as the number of uniques in a continuous attribute is usually large. Thus, several measures appropriate for this scenario are presented as follows, concerning the second main strategy for measuring identity disclosure risks in microdata.

4.1.2 Record Linkage.

Population uniqueness presents limitations as a measure of identity disclosure risk. For example, it does not consider the characteristics of information possessed by intruders. Additionally, if a large number of sample and population uniques exists—common when sets of QIs have continuous attributes—the number of uniques may not provide much information. Additionally, when the sampling fraction is small, it is difficult to estimate the number of population uniques accurately. Thus, measures of population uniqueness can be misleading. Record linkage aims to address these shortcomings. In the literature, record linkage is also known as identity matching [134] or fuzzy matching [173].

Linking released records with target records can be employed either through direct matching using external datasets or indirect matching using an existing dataset [194]. In both approaches, the data processor essentially mimics the behaviour of an intruder trying to match released records to target records. Since the data processor knows the real correspondence between original and de-identified records, it is possible to determine the percentage of correctly linked pairs. If the number of matched pairs is too high, the dataset needs a robust de-identification before it can be released.

The basic approach of record linkage is based on matching values of shared attributes. If common attributes share equal values in a pair of records and they are the only two records sharing such values, it is a matching pair. A non-matching pair happens when records differ in a common attribute value or multiple pairs of records exist sharing those same attribute values. Assuming two datasets, A (original) with a elements and B (de-identified) with b elements, the comparison space is the product of all possible record pairs \(A \times B = \lbrace (a, b): a \in A, b \in B\rbrace ,\) which corresponds to the disjoint sets \(M=\lbrace (a, b): a = b, a \in A, b \in B\rbrace\) and \(U=\lbrace (a, b): a \ne b, a \in A, b \in B\rbrace\) , where M and U correspond to the matched and non-matched sets, respectively. Researchers studied several types of record linkage for privacy risk assessment [51, 54, 89, 226]. Common strategies include probabilistic-, distance-, and rank-based, outlined as follows.

Probabilistic-Based. The goal of probabilistic-based record linkage [71, 114] is to assign a numerical value that reflects the similarity or dissimilarity of two records. Such a similarity is expressed as the ratio of two conditional probabilities that the pair of records have the same agreement pattern across the attribute of interest. Given a comparison vector \(\gamma\) of the record pairs, the conditional probability that a pair is a match is \(m(\gamma)=P(\gamma |(a, b) \in M) = P(\gamma |M)\) and a pair is a non-match corresponds to \(u(\gamma)=P(\gamma |(a, b) \in U)=P(\gamma |U)\) . Hence, a linkage rule is defined as \(R= m(\gamma) / u(\gamma)\) , where \(R \ge t_m\) when a match is found and \(R \le t_u\) when a non-match is found, being \(t_m\) and \(t_u\) thresholds to be set.

Distance-Based. Initially proposed by Pagliuca and Seri [185], distance-based record linkage aims to compute distances between records in original microdata and de-identified data. This requires appropriate distance metrics. Euclidean or Mahalanobis distance is used by Torra et al. [228]. Distance-based record linkage finds, for every protected record \(b \in B\) , an original record \(a \in A\) which minimises the distance to b. Comparing distance-based and probabilistic-based, the former is more simple to implement [53]. Establishing the right distances for the attributes is the main difficulty in this approach. Furthermore, numerical attributes need to be normalised before the computation of the distance.

Rank-Based. Given the value of a de-identified attribute, the rank-based procedure validates whether the corresponding original value falls within an interval centred on the de-identified value, where the interval width is a rank [164, 177]. The main advantage of this approach is that no further scaling or standardisation is necessary. Additionally, distance-based record linkage is commonly implemented using the minimum distance criterion, whereas rank-based procedures use different criteria for selecting a match which is based on characteristics of the de-identification procedure.

The number of pairwise computations can be enormous even for small datasets. At the same time, the majority of similarity computations are dispensable given that most pairs are highly dissimilar and have no influence on the purpose. To avoid unnecessary computations, a blocking phase is performed. In the blocking phase, groups (blocks) of observations are formed using indexing or sorting. This technique selects a subset of record pairs from each block for subsequent similarity computation, ignoring the remaining pairs as highly dissimilar. Herzog et al. [96] present a set of record linkage case studies including a discussion on blocking.

4.1.3 Outliers.

Extreme individuals are particularly easy to identify, and therefore the disclosure risk is very high. Outlier detection is applicable for continuous QIs where the objective is to identify all the records for which the QI takes a value greater than a pre-defined quantile of the observed values. The outlier approach generates a fixed percentage of records at risk, depending on the subjective choice of the quantile.

Two approaches for outlier detection stem from the rank-based and standard deviation based intervals presented by Mateo-Sanz et al. [153]. Given the value of a de-identified attribute, it is verified if the corresponding original value falls within an interval centred on the de-identified value. The width of the interval is based on the rank of the attribute or its standard deviation. Truta et al. [229] also presented a standard deviation based intervals approach for outliers detection. Besides rank-based and standard deviation based approaches, the distance-based [223, 232] and density-based [76] approaches are also used for the same purpose. Of all the preceding approaches, two of them are suitable for a multivariate setting [76, 223], whereas the rest are for a univariate setting.

4.1.4 Clustering.

Measures based on clustering techniques are typically used when continuous QIs are present in the microdata. Contrary to the common application of the clustering techniques which tries to find as many clusters as possible, the focus is to find clusters of size 1, which indicates a higher risk. Bacher et al. [7] showed how a standard hierarchical clustering algorithm can be applied to decide whether a record is safe or not. A limitation of standard algorithms is that they tend to find clusters with an equal variance which would only find records at risk on the tails of the distribution of the continuous attributes. Thus, the clustering algorithms may not be suitable for QI with skewed distributions [103]. A density-based approach appropriate to disclosure risk is proposed by Ichim [106]. If a certain record r is very distant from its nearest neighbours, it can be singled out. Contrarily, if the nearest neighbours of r are very close to r, an intruder cannot be certain about the match between r and any of its neighbours. Such uncertainty increases with the density around r which can be modelled by the density of records in a neighbourhood. An advantage of this local measure is its independence on the location of the records at risk, usually in the tails or centre of the distribution.

4.2 Attribute Disclosure

Attribute disclosure aims to assess an intruder’s potential to correctly determine values of specific unknown attributes. The intruder may not precisely identify a particular data subject’s record but could infer a data subject’s sensitive values from released data, based on a set of values associated with the equivalence class of the individual. For instance, an intruder can infer Bob’s salary range if the intruder knows that Bob’s record is in the equivalence class [155].

In the case of more than one record representing the same individual in the dataset, each equivalence class may not contain k distinct individuals. Wang and Fung [235] introduced the notion of \((X, Y)\) -anonymity, where X and Y are disjoint sets of attributes. Suppose x is a value on X; then, the de-identification of x concerning Y denoted as \(a_Y(x)\) is the number of distinct values on Y that occur with x. Such a measure illustrates how many values on x are linked to at least k distinct values on Y. If each value on X corresponds to a group of individuals and Y represents the sensitive attribute, each group is associated with a diverse set of sensitive values, making sensitive value inference difficult.

The \((\alpha , k\) )-anonymity [242] measure prevents attribute disclosure with the requirement that in any equivalence class, the frequency of a sensitive value is less than or equal to \(\alpha\) , where \(0 \lt \alpha \lt 1\) . For better protection, no single sensitive attribute can be dominant in an equivalence class. A very similar measure is proposed by Machanavajjhala et al. [148], showing that intruders with more background knowledge can deduce sensitive information about individuals even without re-identifying them. The l-diversity measure indicates how many l-“well-represented” values are in each sensitive attribute for each equivalence class [148]. Such definition corresponds to one of three types of “well represented”: distinct l-diversity, which is identical to p-sensitive k-anonymity [230]. This distinct l-diversity is also applicable to identity disclosure when \(k=l\) , because each equivalence class contains at least l records. The second type is the entropy l-diversity, where the entropy of the distribution of the values of a sensitive attribute for each equivalence class is given by \(log(l)\) . Lastly, there is recursive \((c, l)\) -diversity, which indicates if the most frequent values do not appear too frequently and the least frequent values do not appear too rarely. An equivalence class is recursive \((c, l)\) -diversity if \(r_1\lt c(r_l + r_{l+1}+,... + r_m)\) , where m corresponds to the number of values of the sensitive attribute in an equivalence class and \(r_i\) is the frequency of the \(i^{th}\) most frequent value. When all the tuples of a given equivalence class have the same value for a sensitive attribute, it is is called homogeneity of the values, which is when the attribute values are disclosed.

A drawback of l-diversity is its applicability only to categorical sensitive attributes. Additionally, it struggles with scenarios where data is skewed. Consider a patient dataset where 95% of individuals have flu and 5% of records have HIV; suppose that an equivalence class has 50% of flu and 50% of HIV. Then, it is said that the equivalence class is distinct 2-diversity. Under those circumstances, Li et al. [135] introduced t-closeness, which aims to measure how close the distribution of a sensitive attribute in an equivalence class is from the distribution of the attribute in the overall dataset. In other words, t-closeness evaluates the distance between frequency distributions of sensitive attribute values, where \(0 \le t \lt 1\) . The greater the distance, the greater the protection level. \(\sigma\) -Disclosure privacy [19] is quite similar to t-closeness. It also focuses on the distances between the distributions of sensitive attribute values. However, the previous measure does not translate directly into a bond on the intruder’s ability to learn sensitive attributes associated with a given QI. Therefore, \(\sigma\) -disclosure privacy uses a multiplicative definition. Additionally, \(\beta\) -likeness [22] is related to the two previous measures, t-closeness and \(\sigma\) -disclosure privacy, but it uses a relative difference measure. Suppose \(p_{si}\) is the frequency of the sensitive value \(s_i\) in the overall microdata and \(q_{si}\) is the frequency of \(s_i\) within any equivalence class. Then, \(D(p_i, q_i)\) is defined as \((q_{si} - p_{si}) / p_{si}\) , which describes the \(\beta\) .

In addition to identity disclosure, record linkage can be used for attribute disclosure risk [212]. An intruder who links the records in the de-identified dataset to an external dataset (that contains the background knowledge) attempts to assign an identity to the de-identified records. The linkage is usually based on a set of attributes that are common to both datasets. To maximise the accuracy of the linkage, the risk is assessed based on maximum-knowledge intruders. The objective is to perform a record linkage using only the \(V-1\) attributes of the original data and de-identified data.

An appropriate measure also for both identity and attribute disclosure types was proposed by Anjum et al. [4], called \((p, k)\) -angelisation. This measure is applicable when the intention is to release two tables, a table with QI values and a second table with sensitive attribute values. The p parameter indicates how many categories belong to certain records in every bucket, and k corresponds to the amount of k tuples in each bucket. In other words, a single group containing k records belongs to p categories. The terminology of buckets is introduced later in Section 5.

4.3 Summary

Privacy risk measures quantify the degree of a privacy breach and thus the amount of protection offered by PPTs. As such, privacy risk measures contribute to improving data subjects’ privacy in the digital world. In the following, we summarise the main measures from the literature regarding disclosure risk: identity and attribute disclosure.

Identity disclosure risk is divided into two main groups according to the QI data type: measures for categorical and numerical QI attributes. For categorical data, we should use measures based on uniqueness. Uniqueness aims to provide the number of records that possess unique combinations regarding selected QIs. However, it is impractical to use uniqueness measures for numerical QIs since there will be many distinct combinations of QIs. Under those circumstances, different measures should be applied. We recommended measures focused on outliers and clustering strategies.

Besides the preceding measures suitable for identity disclosure, record linkage can be used to link records in de-identified data and records in original data to find how many are coincidentally providing the records at risk and need further protection. Additionally, this procedure can be applied to indicate the possibility of increasing information about a data subject by adding new attributes to the data. Unfortunately, record linkage techniques may be difficult to apply and time-consuming for data processors when searching for a direct match, since it requires external data for that purpose.

Table 1 summarises the privacy risk measures for each disclosure risk type with the respective main bibliographic references and the data type regarding QIs. All these measures are applicable to the record level.

Table 1.

Disclosure Risk	Type		Privacy Risk Measures	QI Data Type
Identity	Uniqueness	Singling out	k-Anonymity [206]	Categorical
		Singling out	k-Map [58]	Categorical
		Probabilistic modelling	Poisson-gamma [15]	Categorical
			Poisson-log-normal [208, 210]	Categorical
			Ewens [68]	Categorical
			Dirichlet-multinominal [220]	Categorical
			Logarithmic series [100]	Categorical
			Pitman [100]	Categorical
			Gaussian Copulas [198]	Categorical
		Special uniques	SUDA [59]	Categorical
		Special uniques	SUDA2 [151]	Categorical
	Outliers		Rank-based intervals [153]	Numerical
			Standard deviation based intervals [153, 229]	Numerical
			Distance-based intervals [223, 232]	Numerical
			Density-based intervals [76]	Numerical
	Clustering		Distance-based [7]	Numerical
	Clustering		Density-based [106]	Numerical
Identity/Attribute	Record linkage		Probabilistic-based [71, 114]	Both
			Distance-based [185, 228]	Both
			Rank-based [164, 177]	Both
	k-Anonymity-based		\((\alpha , k)\) -Anonymity [242]	Categorical
			l-Diversity [148]	Categorical
			\((p, k)\) -Angelisation [4]	Categorical
Attribute	k-Anonymity-based		\((X, Y)\) -Anonymity [235]	Categorical
			t-Closeness [135]	Categorical
			\(\sigma\) -Disclosure privacy [19]	Categorical
			\(\beta\) -Likeness [22]	Categorical

Table 1. Privacy Risk Measures at Record Level for Microdata with the Main Bibliographic References

5 Privacy-Preserving Techniques

The main challenge of de-identification is to discover how to release data that is useful for organisations, administrations, and companies to make accurate decisions without disclosing sensitive information on specific data subjects. In other words, we are interested in the exploration of techniques that reduce the disclosure risk and still allow us to perform statistical analysis and data mining tasks. Such a conflict between data privacy and utility has motivated research in the development of new PPTs or the refactoring of existing techniques.

De Waal and Willenborg [42, 240] were among the first to present the principles of protecting microdata. They suggested a classification for PPTs according to the microdata characteristics. Their proposed taxonomy includes both non-perturbative and perturbative techniques. The former involves the reduction of detail or even suppression of information, and the latter relates to the distortion of information. When considering the background of an intruder, the distinction between non-perturbative and perturbative techniques is important. For instance, the inconsistencies provoked by perturbative techniques can generate special interest in an intruder to identify the records that may have been changed and to recover the original values. With non-perturbative techniques, this is not possible since inconsistencies are not generated. Despite the high use and discussion in the literature on the previous group of techniques, many additional PPTs have been proposed that we believe fall under a different category. We call this new category de-associative techniques. The main goal of de-associative techniques is to break the relationship between QI and sensitive attributes, either by permuting the sensitive values or releasing two separate tables—one concerning QI attributes and the other sensitive attributes. In addition, synthetic data is also considered an approach of SDC [51, 203, 227]. The objective is to release artificial data based on original data by preserving the properties and characteristics of the original data. Figure 3 provides a general overview of the main PPTs, which are discussed next.

Fig. 3.

5.1 Non-perturbative

The objective of this category is to reduce the amount of information in data by reducing the level of detail or partially suppressing information in original data preserving the truthfulness. Thus, non-perturbative techniques do not modify the data. We describe global recoding, local recoding, top-and-bottom coding, suppression, and sampling techniques.

5.1.1 Global Recoding.

This technique is often known as a generalisation or full-domain generalisation. The essence of recoding is to combine several categories to create new and more general categories. Considering a microdata T, the application of global recoding in a categorical attribute \(V_i\) (i.e., replacing with a more generalised g category) results in a new \(V_i^{\prime }\) with \(|D(V_i^{\prime })| \lt |D(V_i)|\) , where D is the domain of \(V_i\) . For a continuous attribute, \(V_i\) is replaced by a discretised version of \(V_i\) . By intuition, the smaller the sizes of intervals in g, the less information loss. The main objective of global recoding is to divide the tuples of T into a set E of disjoint equivalence classes and then transform the QI values of the tuples in each equivalence class to the same format. Applying global recoding means that the QI values of all equivalence classes obey the principle that there cannot be two equivalence classes with overlapping tuples. Each equivalence class is also known as a QI-group. This technique is used heavily in the literature and by statisticians [103, 239, 240].

5.1.2 Local Recoding.

Whereas global recoding uniformly recodes the values across the microdata set, local recoding recodes into broader intervals or categories when necessary [219]. The replacement can be partial, in which only some occurrences of \(V_i\) are replaced with g. Local recoding can potentially reduce the distortion in the data, by replacing \(V_i\) only in a neighbourhood of the data. In general, the difference between global and local recoding is that with global recoding, all values of the attribute have the same domain level. However, with local recoding, the values are generalised to different domain levels. As global recoding affects the entire dataset, the risk can be reduced, but it can have a greater cost in terms of information loss. With local recoding, this problem can be contained [246].

5.1.3 Top-and-Bottom Coding.

Top-and-bottom coding is a special case of recoding. This technique is applied to continuous or categorical ordinal attributes. A top recoding covers values of an attribute above a specified upper threshold \(\theta _u\) where frequencies tend to become smaller towards the top of the attribute range. Similarly, a bottom coding covers values below another threshold \(\theta _l\) . However, determining the appropriate threshold is not a simple task [240].

5.1.4 Suppression.

This technique suppresses data from microdata so that they are not released, or replaces values with a missing value (NaN) or a special character (‘*’ or ‘?’). Suppression has been used since the 1980s [31] and is common in combination with global recoding [104, 206, 239]. There are three common levels of suppression, outlined next:

–

Cell suppression: This approach is also known as local suppression. The goal is to replace one or more values of an attribute \(V_i\) in a record R with a missing value or a special character. In a tuple of QI t, if t does not occur frequently enough, one or more values in t can be suppressed. In contrast to global recoding that is applied to the entire microdata, cell suppression is only applied to a particular value in a particular record. This means that cell suppression does not change the attribute definitions, because it does not affect the coding of any attribute. However, cell suppression significantly reduces the predictive performance [181].

–

Tuple suppression: This is also known as record suppression. The main idea is to hide the whole tuple t from the released data, which may contain both non-sensitive and sensitive information. Nevertheless, if some records are removed from the data, it may disturb its truthfulness.

–

Attribute suppression: An attribute \(V_i\) is no longer released for the whole dataset. This approach could be quite useful when a categorical attribute has many distinct values which typically affects the predictive performance in data mining applications [24] or when two attributes are highly correlated.

5.1.5 Sampling.

Also known as subsampling, sampling is a common technique to protect census microdata where the original set corresponds to the entire population [209]. Instead of releasing the original microdata set, a sample S set is published. S is often obtained by selecting the records matching certain characteristics but also could be obtained via random sampling. This technique is suitable for categorical attributes; however, for continuous attributes, it may be inappropriate owing to the fact of many distinct values being present. Sampling does not perturb continuous attributes [103], and coupled with this, a sample is a portion of the original data, and naturally, the smaller the data, the higher the disclosure risk. If a continuous attribute \(V_i\) is present in a public dataset, unique matches are quite probable. For that reason, the publishing of a sample microdata set with continuous attributes should be accomplished with other PPTs.

5.2 Perturbative

Perturbative techniques distort data before release. These must be used so that the statistics computed on the perturbed microdata set do not differ significantly from statistics obtained on the original dataset. Swapping, re-sampling, noise, microaggregation, rounding, the Post RAndomisation Method (PRAM), and shuffling are examples of perturbative techniques.

5.2.1 Swapping.

Dalenius and Reiss [35] propose a data swapping technique in which the main idea is to exchange the values of certain attributes across records. There are two types of swapping: record or data swapping and rank swapping. Even though data swapping was originally proposed for categorical attributes, its rank swapping variant is also applicable to continuous attributes. The notion of this technique is to swap pairs of records that are similar on a set of attributes but belong to different sub-domains:

–

Data swapping: In the application of this technique, the microdata is considered a matrix. Suppose we have a set of n individuals, each one containing the values of m attributes, which are QI and sensitive attributes. The microdata is then represented as an \(n \times m\) matrix, corresponding to the original matrix \(m_o\) . Data swapping will map such a matrix into another matrix, \(m_e\) . The resulting matrix can be released to micro-statistics or be used to produce macro-statistics. The \(m_e\) is t-order equivalent with \(m_o\) to preserve the t-order statistics. A t-order equivalence is a frequency count table in which t attributes are involved. A 1-order equivalence involves one attribute. A set resulted by swapping contains n individuals, t attributes, r individuals in each equivalence class, and k of r individuals have their values swapped.

–

Rank swapping: Rank swapping originally introduced by Moore [161] for ordinal attributes, but Domingo-Ferrer and Torra [52] show that this technique can also be used for numerical attributes. The difference from the original procedure is the restriction of the range for which each value can be swapped. This strategy aims to limit distortion. The methodology of this approach is as follows: values of attribute \(t_i\) are ranked in ascending order, then each ranked value of \(t_i\) is swapped with another ranked value randomly chosen within a restricted range. The range typically corresponds to the amount of swapped values. For instance, the rank of two swapped values cannot differ much from p% of the total number of records, where the p is commonly between 0 and 20. Nin et al. [177] propose a new record linkage suitable for rank swapping and two new variants called rank swapping p-distributions and rank swapping p-buckets which are effective in the new record linkage method.

Data swapping has many advantages, namely (i) it removes the relationship between the record and the individual, (ii) it can be used in one or more sensitive attributes without disturbing the non-sensitive attributes, (iii) no non-sensitive attributes are deleted, (iv) it provides protection to the rare and unique values, and (v) it is not limited to the type of attributes. This technique has also some drawbacks—for instance, arbitrary swapping can produce a large number of records with unusual combinations. If the swapping is not random, the function to determine the records and the attributes to be swapped requires significant time and computer resources. Additionally, this technique can severely distort the statistics on any sub-domain–for example, mean and variance for the income of nurses. Lastly, data and rank swapping do not prevent attribute disclosure, as they only reorder data. For example, when an intruder knows which record in the original dataset has the highest income, the intruder will simply have to look for the highest income in the swapped dataset to get the exact value [103]. Swapping has been applied for Japanese population census microdata as a potential technique to replace the deletion of unique records [111, 112]. Although swapping has desirable properties, it is not suitable for attributes with few distinct values since a value can be swapped with the same or similar values.

5.2.2 Re-sampling.

Re-sampling was first applied for tabular protection [95], but this technique can also be applied to microdata sets [103]. The re-sampling strategy uses the bootstrap method with a replacement that consists in repeatedly taking small samples and calculating the average of each sample. Formally, re-sampling takes t independent samples \(S_1\) , ..., \(S_t\) of size n of the values of an original attribute \(V_i\) . Each sample must be sorted with the same ranking criterion. Build the masked attribute by taking as the first value the average of the \(S_1\) , as the second value the average of the \(S_2\) , and so on.

5.2.3 Noise.

Protecting personal data using noise can be done by additive or multiplicative noise. Additive noise has been studied and used extensively since the 1980s [18, 119, 213, 224]. Noise is also known as randomisation. Four main procedures have been developed for additive noise [103]:

–

Uncorrelated noise addition: Adding noise means that the vector of observations \(x_j\) for the \(j^{th}\) attribute in the original microdata is replaced by the vector \(z_j\) , where \(z_j=x_j+ \epsilon _j\) and \(\epsilon _j\) denotes normally distributed errors derived from \(\epsilon _j \sim N(0,\sigma _{\epsilon _j}^2)\) with \(Cov(\epsilon _T,\epsilon _l)=0\) for all \(t \ne l\) (white noise). The higher the \(\sigma\) value, the greater the range of the generated values.

–

Correlated noise addition: Adding correlated noise is based on the generation of an error matrix \(\epsilon ^*\) under the restriction \(\sum ^*=\alpha \sum\) (correlated noise), which implies \(\sum _Z=\sum +\alpha \sum =(1+\alpha)\sum\) . All elements of the covariance matrix of the perturbed data diverge from those of the original data by a factor \(1+\alpha\) . The difference between correlated and uncorrelated is that the covariance matrix of the errors in this approach is proportional to the covariance matrix of the original data. Using correlated noise addition produces a dataset with higher analytical validity than uncorrelated noise addition.

–

Noise addition and linear transformation: Ensuring additional linear transformations means that the sample covariance matrix of the changed attributes is an unbiased estimator for the covariance matrix of the original attributes. This concept uses a simple additive noise on the p original attributes \(Z_j=X_j+\epsilon _j\) , for \(j=1, \ldots , p\) with covariances of the errors proportional to those of the original attributes. The shortcoming of this approach is that it cannot be applied to discrete attributes, as they do not preserve the univariate distributions of the original data.

–

Noise addition and non-linear transformation: To bypass the shortcoming of the previous approach, the non-linear transformations were proposed in combination with additive noise [215]. The application of such an approach is quite time-consuming and requires expert knowledge on the microdata set and the algorithms [103].

In some cases, it is preferable to work with multiplicative noise, such as when additive noise has a constant variance. This phenomenon causes small values to be strongly perturbed and large values weakly perturbed. Multiplicative noise was then proposed to circumvent this drawback [174]. Suppose we have a matrix X of the original numerical data and W the matrix of continuous perturbation attributes with expectation 1 and variance \(\sigma _w^2\gt 0\) . The resulting perturbed data \(X^a\) is obtained by \(X^a=W\odot X\) , where \(\odot\) corresponds to the Hadamard product, an element-wise of matrix multiplication.

In addition to k-anonymity, differential privacy is in the spotlight of research. Both k-anonymity and differential privacy are the most discussed subjects in the literature in the data privacy domain. However, there are some substantial differences between the two. Although k-anonymity is used to guide PPT, we notice that it is a measure that gives the number of individuals who share the same information and usually is used in data transformed with generalisation and suppression. In the case of differential privacy, it is a method used to usually apply Laplace noise addition. Besides that, differential privacy does not require any assumptions on the intruder’s background knowledge. The initial formulation of differential privacy was proposed by Dwork [57] for an interactive setting. A randomised query function satisfies differential privacy if the result of an individual’s information is the same whether or not that individual’s information is included in the input for the analysis. Notwithstanding, researchers proposed extensions of differential privacy for the non-interactive setting, namely for microdata sets that could be used for any analysis [211]. We should clarify that, in the scope of this survey, differential privacy methods are grouped as noise-based methods.

5.2.4 Microaggregation.

This technique was initially proposed to be used for continuous attributes [48, 50], but microaggregation has been extended to categorical data [225]. The n records in microdata set T are partitioned in g groups of k or more individuals. Groups are formed using a criterion of maximal similarity—for instance, Euclidean distance can be used. The value \(v_i\) of a continuous attribute V in a record r is replaced by an aggregate value, usually the average, mode, or median of the group to which r belongs. Some variants exist for microaggregation besides this approach for continuous attributes: (i) fixed group size [43, 50], which requires that all groups be of size k; (ii) variable group size [48, 50, 125], which allows groups to be of size \(\ge k\) ; (iii) univariate [43] deals with multi-attribute microdata sets by microaggregating one attribute at a time; (iv) multivariate [50] deals with several attributes at a time; (v) categorical attributes [152, 225]; (vi) optimal [91, 126] aims to find a grouping where groups have maximal homogeneity and size at least k; and (vii) heuristic [43, 47] to deal with multivariate microaggregation.

Microaggregation involves three main criteria: how the homogeneity of groups is defined, the clustering algorithms used to find the homogeneous groups, and the determination of aggregated function. Typically, this technique works better when the values of the attributes in the groups are more homogeneous. For that reason, the information loss caused by the replacement of values with common values will be smaller than in cases where groups are less homogeneous.

5.2.5 Rounding.

Rounding has been used for a very long time as a technique in this context [34]. Its objective is to replace the original values of attributes with rounded values. For a given attribute \(X_i\) , rounded values are chosen among a set of rounding points defining a rounding set. The set of rounding points \({p1, \ldots , p_r}\) can be determined by the multiples of a base value b: \(p_i=b \times i\) for \(i=1, \ldots ,r\) . Thus, an original value \(x_i\) of X is replaced with the rounding point. In multi-attribute microdata, rounding is typically performed one attribute at a time (univariate rounding). Multivariate rounding is also possible [240]. The operating principle of this technique makes it suitable for continuous data.

5.2.6 Post RAndomisation Method.

PRAM is a technique used for categorical data and was proposed by Gouweleeuw et al. [86]. The values of one or more categorical attributes are re-coded with a certain probability, and such recoding is done independently for each of the records. PRAM uses a probability mechanism, and an intruder cannot be sure whether certain matches correspond to the correct individual. Suppose \(\xi\) is a categorical attribute in the original set to which PRAM will be applied and X is denoted as the same attribute in the perturbed set. Both \(\xi\) and X have K categories (or factor levels). The probabilities are defined by \(p_{kl} = P(X = l|\xi = k)\) , meaning the probability that an original score \(\xi = k\) is recoded with the score \(X = l\) . Such probabilities are called transition probabilities. In the case of PRAM for a single attribute, the transition matrix has a size of \(k \times k\) , which results in a Markov matrix.

The transition probabilities must be chosen appropriately since can occur in certain situations in unlikely combinations like a 7-year-old girl being identified as being pregnant. In similar cases, the transition matrix can be designed in such a way that those transitions are not possible (transition probability set equal to 0). Despite this disadvantage, PRAM is especially useful when a microdata set contains several attributes and applying other de-identification methods, such as global recoding, local suppression, and top-and-bottom coding, would lead to too much information loss [103].

5.2.7 Shuffling.

Data shuffling was proposed by Muralidhar and Sarathy [166, 167] and is a variation of swapping [72]. The difference between swapping and shuffling is that this technique replaces sensitive attributes by generating new data with similar distributional properties. Suppose that X represents sensitive attributes and S non-sensitive attributes. It is generated new data Y to replace X using the conditional distribution of X given S, \(f(X|S)\) . The generated values are also ranked, as in rank swapping. Each X value is replaced with another X value with the rank that corresponds to the rank of the Y value. When compared to the swapping technique, data shuffling guarantees a higher level of utility and a lower level of disclosure risk [169]. Similarly to data swapping and rank swapping, shuffled data has a potential risk of attribute disclosure [103]. Furthermore, data shuffling requires a ranking of the whole dataset, which can be computationally costly for large datasets with millions of records [168].

5.3 De-associative

The principal objective of this category is to create buckets to break the correlation between QI and sensitive attributes. The basis of de-associative techniques is to create QI-groups with at least k records. In the de-associative category are included bucketisation, anatomisation, angelisation, and slicing techniques.

5.3.1 Bucketisation.

The term bucketisation derives from the created buckets (often known as partitions) which are obtained with a previous generalisation. Specifically, generalisation is applied in the first place, then the buckets are defined according to the equivalence classes. Nevertheless, the generalisation step serves only for this purpose. The final data does not contain generalised values. For each record in a bucket, there are multiple sensitive values. To protect sensitive values, these are permuted. Therefore, the bucketised data consists of a set of buckets with permuted sensitive values in which it is hard for an intruder to distinguish the tuples in the same bucket. This technique publishes the QI values in their original format, thus it is not difficult for an intruder to find out whether an individual is present in the published data or not. For that reason, bucketisation does not prevent membership disclosure.

To prevent both identity and sensitive attribute disclosure, and to preserve the data utility, some researchers propose certain modifications to the bucketisation approach, namely not releasing the bucketised data with the QI in the original form [131, 132]. The QI can be generalised and then bucketised. Thus, the size of buckets is reduced as well as information loss. The tuples are partitioned into equivalence groups that satisfy the desired protection and then divide the generalised tuples into buckets to break the connection between QI values and sensitive values. However, we believe that this approach is only suitable for scenarios where the data controller knows the complete background knowledge of the adversary; otherwise, there are still unique QI combinations that can lead to identity disclosure.

Zhang et al. [253] proposed a permutation-based approach to reduce the association between QI and sensitive attributes suitable for microdata and query-based databases. This technique aims to partition the tuples into several groups so that each group has at least l different sensitive attribute values. Such an approach is identical to the definition of bucketisation, and for that reason, we assume henceforth that permutation corresponds to bucketisation.

5.3.2 Anatomisation.

Xiao and Tao [243] proposed a new technique called Anatomy that follows the bucketisation principles but, instead of permuting sensitive values, aims to publish two separate tables: a QI Table (QIT) and a Sensitive Table (ST). Consider a microdata T which contains a set of V QIs either categorical or numerical and a categorical sensitive attribute S. Given an equivalence class with m QI-groups, this technique produces a QIT in the form of ( \(V_1, V_2\) , ..., \(V_i\) , Group-ID) and an ST in the following format (Group-ID, S, Count). Each QI-group involves at least l tuples. For each QI-group and each distinct S value v in \(QI_j\) ( \(1 \le j \le m\) ), the ST has a record of the form ( \(j, v, c_j(v)\) ), where \(c_j (v)\) is the number of tuples \(t \in QI_j\) with \(t[d + 1] = v\) (d is the number of QI). This approach releases the QI and sensitive attributes in two separate sets conserving the unique common attribute, the Group-ID. Hence, when an intruder tries to join the two tables, he will not be able to associate the sensitive value to the right individual.

Although anatomisation increased its popularity in data privacy, this technique is vulnerable to background knowledge attacks and is only be applied to limited applications. For that reason, it should be used in combination with other PPTs. For instance, anatomisation can be combined with bucketisation, where the values in the QIT are permuted as well as the values in the ST [94]. Besides that, it focuses on the de-identification of microdata with a single sensitive attribute. To solve this drawback, a multiple sensitive bucketisation [255] was proposed to partition the microdata into QITs and STs and to make sure each sensitive attribute has at least l diverse values. This approach follows the division into two tables like anatomisation, and thus we consider that their proposal is a combination of anatomisation with bucketisation. Ye et al. [251] also discuss the problem of secure releasing data when sensitive data contains multiple attributes and also combining anatomisation with bucketisation. The difference from the previous approach is that the QI values in the QIT are randomly permuted.

5.3.3 Angelisation.

As with anatomisation, angelisation aims to release two separate tables. Tao et al. [221] propose a very similar approach called Angel, and such an approach starts by dividing the microdata into batches \(B_1, B_2,\) ..., \(B_b\) where each batch is a set of tuples in T and the sensitive attribute distribution in each batch \(B_i\) satisfies a certain objective de-identification principal P. Then, it creates another partition but into buckets \(C_1, C_2,\) ..., \(C_e\) where each bucket is a set of tuples in T and contains at least k tuples, in which k is a controlling parameter of the degree of protection. The angelisation corresponds to the publication of any pair of bucket and batch partitions. Therefore, given a batch and bucket partitions of the T, an angelisation of T corresponds to a pair of a Batch Table (BT) \(\langle\) Batch-ID, S, Count \(\rangle\) and a Generalised Table (GT) that has all QI attributes with the column of Batch-ID where all tuples in the same bucket \(C_i\) have equivalent generalised QI values. While anatomisation releases a QIT and an ST, angelisation releases a BT and GT. Moreover, anatomisation releases the QI directly, which is a concern when disclosing precise QI values. However, if there is a specific scenario where it is needed the release of the original values, then angelisation can be employed instead of anatomisation, as this approach also allows the direct publication of QI values.

5.3.4 Slicing.

Due to the break of correlation between QI and sensitive attributes caused by previous techniques, Li et al. [137] propose the slicing approach, which groups several QIs with the sensitive attribute, preserving attributes correlations. The intuition of slicing is the partition of a microdata set vertically and horizontally. Vertical partitioning groups attributes into columns based on the correlation of the attributes—that is, each column contains a subset of highly correlated attributes. Horizontal partitioning groups tuples into buckets. For each bucket, values in each column are randomly permuted to remove the linking between different columns. Formally, in a microdata T, an attribute partition corresponds to several subsets of attributes. A tuple partition consists of several subsets of \(T,\) and each subset of tuples is a bucket. A slicing of T is given by an attribute partition, a column generalisation, and a tuple partition.

As slicing was initially proposed for a single sensitive attribute, Han et al. [90] propose to vertically partition multiple attributes into several STs and on QITs. In each table, the tuples are partitioned into equivalence classes, and the QI values of each equivalence class are recoding satisfying a desired level k. The sensitive values of each ST are sliced and bucketised to achieve l diverse values. Besides that, this technique can be used in combination with other techniques, such as with anatomisation for multiple sensitive attributes [216]. A different version, based on overlapped slicing, aims to duplicate a sensitive attribute and put it into the QI column [20]. This approach will increase data utility since there is more attribute correlation. Furthermore, it works for multiple sensitive attributes.

5.4 Synthetic Data

In the 1990s, Rubin [203] started the discussion of releasing synthetic microdata instead of real data. Although the previous categories aim to preserve confidentiality, the valid analysis generally requires the knowledge of experts in the data privacy field to apply proper PPTs. Besides, synthetic data is increasing its popularity because of its supposed absence of privacy concerns without incurring the issues that come with other PPTs. Synthesised data have been used in this field to deal with these pitfalls [147]. The generation of synthetic data is achieved using data mining methods or deep learning models—in the former case, generally, building synthetic data involves fitting a statistical model and then sampling new data from the model. The artificial data have similar statistical properties and relationships to the original data. Thus, a data analyst, for example, should be able to draw the same statistical conclusion from the analysis of a given synthetic data as from the original data. Very recently, Figueira and Vaz [73] surveyed multiple data synthetisation approaches by combining synthetic data generation and deep learning models. Domingo-Ferrer et al. [51] classifies the synthetic data into three types: fully, partially, and hybrid synthetic data.

5.4.1 Fully Synthetic.

Every value on the microdata is replaced with the simulated values, thus no original data is released [203]. Consequently, the disclosure risk is usually very low [51]. The full synthetic data can have different sizes from the original data. Furthermore, the information about the original data that is conveyed by synthetic data is only the one integrated by the statistical model, which is generally limited to some statistical properties. Nevertheless, traditional approaches for synthetisation usually require considerably high resources both in time and energy [26].

5.4.2 Partially Synthetic.

This approach is based on the selection of a subset of some rows and some columns to be synthesised [140, 141]. It is also known as selective synthesis. Usually, partially synthetic data is useful to be applied on the attribute values with a high risk of disclosure. Thus, the disclosure risk is reduced since the values in the original data at a higher disclosure risk are replaced with artificial values. The number of records in partial synthetic data is usually the same as in original data.

5.4.3 Hybrid.

Hybrid data is obtained by mixing the original data with synthetic data. The main idea is to combine the strengths of other PPTs with synthetic data [36]. Assume an original data set consisting of n records and a synthetic dataset consisting of m records. Then, each record in the original data is paired with the nearest record in synthetic data. After the pairing, a model is applied to mix attributes in paired records to get a hybrid dataset. Domingo-Ferrer and González-Nicolás [46] show how to combine microaggregation with synthetic data. The partitioning step in microaggregation serves to partition the set of original records into several clusters with similar records where the number of records in each cluster is at least k. Then, a synthetic version of each cluster is generated, and finally the original records in each cluster are replaced by the synthesised values of the corresponding synthetic cluster. In hybrid datasets, the level of protection is the lowest when compared to the other synthetic data. The number of records does not need to be the same as the number of original data records.

5.5 Summary

Many PPTs were proposed to limit the disclosure of confidential information. To clarify the main concept of each technique, we presented three groups to help data controllers in decision-making. Table 2 summarises the main principles of each PPT and highlights which data type the techniques are intended for. Generally, non-perturbative aims to reduce information detail without distortion. Perturbative methods aim to distort information by creating uncertainty. De-associative techniques aim to break the correlation between QI and sensitive attributes. Lastly, the synthetic data generation objective is to create artificial records that are similar to the original records. Table 3 presents the main advantages and disadvantages of these groups of PPTs.

Table 2.

Privacy-Preserving Techniques		Data Type	Principle
Non-perturbative	Global recoding	Both	Combines several categories to form more general categories.
	Local recoding	Both	Recodes into broader categories only some values.
	Top-and-bottom coding	Both	Replaces values above or below a defined threshold.
	Suppression	Both	Deletes cells/rows/attributes or replace them with special characters.
	Sampling	Both	Selects a sample of the original microdata.
Perturbative	Swapping	Both	Exchanges the values of certain attributes across records.
	Re-sampling	Both	Takes independent samples using bootstrap and averages the samples.
	Noise	Numerical	Replaces the values by adding/subtracting/multiplicating random values.
	Microaggregation	Both	Groups similar values and assigns an aggregated value to the group.
	Rounding	Numerical	Replaces values with rounded ones determined by the multiples of a base.
	PRAM	Categorical	Reclassifies the values according to the Markov matrix.
	Shuffling	Numerical	Uses a regression model to determine which values are exchanged.
De-associative	Bucketisation	Categorical	Bucketises data and permutes the sensitive values.
	Anatomisation	Categorical	Bucketises data and publishes a QIT and a ST.
	Angelisation	Categorical	Divides data into batches and then buckets, and publishes a BT and a GT.
	Slicing	Categorical	Partitionates data vertically and horizontally.
Synthetic Data	Fully synthetic	Both	All records are synthesised.
	Partially synthetic	Both	A sample is synthesised.
	Hybrid	Both	Mix of original data and synthetic data.

Table 2. Summary of Each Privacy-Preserving Technique Including Data Type and Main Principles

Table 3.

Privacy-Preserving Techniques	Advantages	Disadvantages
Non-perturbative	- Does not disturb data structure; - Unique combinations may disappear.	- Reduces the detail of information; - High generalisation level and many suppressed values destructs data utility.
Perturbative	- Creates uncertainty around the values; - New combinations may appear.	- It may creates inconsistencies; - Extreme values require great quantity of distortion.
De-associative	- Publishes QIs in their original form; - Breaks the relationship between QIs and sensitive attributes.	- Presents high disclosure risk when QIs are the original ones; - The swapped sensitive values could interfere with its true meaning and patterns leading to inaccurate results.
Synthetic Data	- Does not require high expertise in data privacy domain. - Disclosure risk is very low.	- Requires larges datasets to capture the distributions and relationships of the attributes well. - Computationally costly.

Table 3. Main Advantages and Disadvantages of Each Privacy-Preserving Technique

As mentioned, PPTs must be applied according to data characteristics and based on the evaluation of privacy risk and data utility. So far, we reviewed privacy risk measures and techniques for data transformation. In the following section, we present data utility measures for different end use cases.

6 Data Utility

Evaluating the data utility is as important as assessing the disclosure risk since, during de-identification, a large amount of useful information may be lost and then the de-identified data serve no purpose. Usually, de-identified data can be used for diverse purposes, especially when data is released to the general public. However, taking into account all data uses is impracticable. Therefore, we must consider data interpretability when data is used for general purposes by using information loss measures and predictive performance measures for data mining/machine learning tasks. Thus, we can classify data utility measures into two main groups: information loss and predictive performance.

The greater the perturbation applied to microdata via PPTs, the greater the distance between the original and the de-identified data set; consequently, the greater the information loss. The main idea for loss information measures is to compare records between the original and de-identified data and compare statistics computed from both datasets [52, 75, 103]. Measures of information loss allow the data processor to assess how much harm is being inflicted on the data by a particular PPT. In other words, such measures allow evaluation of whether a dataset is still analytically valid/comparable after de-identification.

Many measures have been proposed in the literature regarding information loss for general purposes. Such measures can be divided into three groups: distance/distribution comparisons, a penalty of transformations through generalisation and suppression, and statistical differences. An example of the former case is the discernibility measure [10], which sums up the squares of equivalence class sizes. KL divergence [123] is also well known and measures differences in the distributions of equivalence class size over the same attribute. It indicates how much information is lost after changing the probability distributions in the data. In addition to these measures, covariance comparisons can be used for the same purpose [52]. For the second case, the number of generalisation steps can be given by the average size of the equivalence classes [148, 176]. However, the minimal distortion [206] charges a penalty to each generalised or suppressed value. A similar measure, GenILoss [113], penalises a specific attribute when it is generalised. The last case is based on the frequencies of the equivalence classes. For example, comparing the number of missing values in the datasets indicates the degree of information loss usually caused by a specific PPT. Information loss also can be quantified through the changes in statistics—for instance, by comparing means, variances, and correlations in the datasets.

Regardless of the previous general information loss measures, the data processor may know the end use of de-identified data. Accordingly, when data is intended to be used for data mining tasks, several researchers have proposed to evaluate the utility of the de-identified data in terms of data mining workloads [19, 24, 75, 130]. Since a de-identified dataset should support further analysis, predictive models are built from the de-identified dataset, then the prediction accuracy of the models is used to represent the usefulness of the de-identified dataset. However, it is unclear what type of data mining tasks will be performed on the de-identified data. In data mining tasks, utility references predictive performance—the closer the evaluation results obtained between the original and the de-identified data, the more utility is preserved. Classification is a common task studied for this purpose that aims to predict nominal attributes. Typical measures used to assess predictive performance include precision and recall [117], accuracy, F-score [196], geometric mean [122], and AUC (area under the ROC curve) [238]. Besides these measures, Iyengar [113] proposes a classification metric that measures the classification error on the training data by penalising transformations done by suppression or generalisation in which the record’s class is not the majority class. Nonetheless, when the task involves predicting a numeric value, regression measures are used to evaluate the predictive performance. Common measures for this scenario include mean squared error, root mean squared error, and mean absolute error.

Both classification and regression tasks are supervised learning approaches in which utility/performance can be measured by the power of discriminating class labels. But unsupervised learning, namely clustering, is also a common approach to evaluate the quality of the de-identified data, and no class labels are available [75, 81]. Intuitively, clustering refers to grouping records in such a way that similar records are grouped together and dissimilar records are grouped in different clusters. Common measures for clustering analysis include, for instance, the Rand index [192], the Davies-Bouldin index [40], the Fowlkes-Mallows index [77], and Silhouette [202]. Table 4 summarises the main bibliographic references for the two main groups of data utility measures.

Table 4.

Data Utility Measures		Main References
Information loss		Bayardo and Agrawal [10], Kullback and Leibler [123],
		Machanavajjhala et al. [148], Samarati [206],
		Iyengar [113], Nergiz and Clifton [176]
Predictive performance	Supervised	Kent et al. [117], Van Rijsbergen [196], Kubat et al. [122],
	Supervised	Weng and Poon [238], Iyengar [113]
	Unsupervised	Rand [192], Davies and Bouldin [40],
	Unsupervised	Fowlkes and Mallows [77], Rousseeuw [202]

Table 4. Data Utility Measures and Main Bibliographic References

As an objective, for maximum data utility, one should strive for de-identified data to be as similar to the original data as possible. However, guaranteeing maximum utility will result in a lower data protection level, which, as previously referred, could affect both data subjects and organisations. The trade-off between these measures is a popular topic that is under continuous research [19, 25, 49, 113]. Thus, ensuring an optimal level of data privacy and utility requires great effort in the application of PPTs. In the following section, we present existing studies on the effectiveness of such techniques concerning data privacy and utility in terms of predictive performance.

7 Studies On the Effectiveness of PPTs

The application of PPTs is not always trivial. While implementing a specific approach, such as reducing data granularity, to enhance privacy, it is plausible that the privacy level of certain instances may remain unaffected due to the presence of extreme values. Moreover, reducing the detail of information may negatively impact data utility. For this reason, it is crucial to conduct studies to evaluate the effectiveness of such techniques. In this section, we discuss existing studies on the impact of PPTs on both data privacy and utility regarding predictive performance. Additionally, we cover available software, its principles, and implementation details.

7.1 Impact on Data Privacy and Predictive Performance

Non-perturbative Techniques. Improvements and new proposals on non-perturbative techniques popped up after the suggestion of the k-anonymity measure. Such a measure was initially used to measure identity disclosure in data transformed with generalisation and suppression techniques [206]. In the context of supervised learning tasks, namely classification, several studies are aiming to prove the efficiency of generalisation and suppression in predictive analysis. A common approach is to transform the original data using such techniques in QI attributes that are highly identifiable to satisfy privacy constraints with minimal information loss and using k-anonymity to evaluate the privacy of transformed data. Such transformation is often achieved through de-identification algorithms—for example, Incognito [128], Mondrian [129], and many other de-identification algorithms surveyed by Fung et al. [79]. In general, information loss is measured during the generalisation process. We notice that the studies in predictive performance rely often upon datasets produced with these de-identification algorithms in which they all are evaluated according to the models’ performance. Many examples [21, 83, 108, 113, 237] show higher classification errors for datasets produced by optimising the loss metric and reach the maximum possible value at higher values of k. In general, they prove that a de-identification level increase leads to proportional degradation of predictive performance. However, it is still possible to protect individuals’ privacy while maintaining predictive performance with both techniques.

Nonetheless, it was proven that generalisation and suppression are not sufficient to protect the disclosure of sensitive data [135, 148]. A suggestion to increase the individuals’ privacy is to suppress values that have high disclosure risk [184]. Brickell and Shmatikov [19] focus on semantic definitions to quantify the attribute disclosure. Besides k-anonymity, the authors also used l-diversity and t-closeness and proposed a new measure to capture the adversarial knowledge gain. Whereas the previous works use a defined set of QIs, this study uses different sets of QIs. In most cases, trivial de-identification (i.e., removing all QIs or all sensitive attributes) provides equivalent performance and better privacy guarantees than common generalisation and suppression. Such results challenge previous works and their conclusions. However, we believe that this result depends on the selected set of QIs—for instance, if a higher number of attributes is suppressed, the de-identified data set intuitively will not have much utility.

LeFevre et al. [130] presented a suite of de-identification methods to generate de-identified data based on target workloads, which consists of several data mining tasks, including both classification and regression tasks. This experience focuses also on attribute disclosure. In general, the derivations of the Mondrian algorithm outperform the Fung et al. [83] de-identification algorithm. However, the previous methods suppress too many values [133]. Therefore, the level of generalisation can be determined by the distribution of the attributes and then use cell suppression to remove locally detailed information. Such a method was proven to be more accurate in classification than the LeFevre et al. [130] proposal. Focusing only on regression tasks, Ohno-Machado et al. [181] also present a study using both suppression and generalisation. However, the predictive performance is evaluated concerning the number of suppressed cells. With the minimum privacy guarantees, the results show a difference compared with the original data. However, a slightly higher privacy level results in approximately the same predictive performance.

Regarding unsupervised learning tasks, clustering methods have also been used to evaluate the impact of PPTs. Some approaches convert the problem into classification analysis, wherein class labels encode the cluster structure in the data and then evaluate the cluster quality on the de-identified data. For instance, Fung et al. [81, 82] define the de-identification problem for cluster analysis using a generalisation. After transforming the data, the clusters in the original dataset should be equal to those in the de-identified dataset. In general, the cluster quality degrades as the de-identification threshold increases. However, the results suggest that it is possible to achieve a reasonable level of de-identification without compromising cluster quality.

Perturbative Techniques. The introduction of noise is one of the most used techniques in the perturbative group. A well-known conclusion is the generation of a perturbed dataset that remains statistically close to the original data. Typically, the closer the perturbed data is to the original, the less confidential that dataset becomes. On the opposite side, the more distant the perturbed dataset is from the original, the more secure it is. However, the utility of the dataset might be lost when the statistical characteristics of the original dataset are lost [157].

Besides generalisation and suppression, a few studies also include datasets with noise [24, 233]. Contrary to Vanichayavisalsakul and Piromsopa [233], the conclusions of Carvalho and Moniz [24] point towards a noticeable impact of such techniques in predictive performance, especially with noise. However, noise is the one that presents a low re-identification risk level. Beyond that, the former uses several de-identification algorithms, whereas the latter tested different parameters in PPTs without a de-identification algorithm.

The single application of noise was also performed by some researchers [158, 159, 257]. The experimental studies allow us to conclude that the level of noise does affect the classification error. Such results are expected, as the higher the \(\epsilon\) , the less the noise [127]; therefore, the private data is more closely to the original data. Although the noise adds uncertainty to the intruder in the re-identification ability, it may result in additional data mining bias [241]. The introduced bias could severely impact the ability of knowledge discovery. It is usually related to the change in variance, the relationships between attributes, or the underlying attribute distribution. Wilson and Rosen [241] show that additive noise has a lower impact on classification compared to other types of noise. More recently, Liu et al. [143] presented a new approach to reduce the uncertainty introduced by noise. The idea is to represent features using vectorising functions in a de-identified instance or original instance. Experiments show that the regression model trained with de-identified data can be expected to do as well as the original dataset under certain feature representations.

Oliveira and Zaiane [182] proposed a new distortion technique to numerical attributes to meet the desired privacy level in clustering analysis. The experiments include additive noise, multiplicative noise, and rotation noise that is defined based on an angle \(\theta\) . Their technique shows a misclassification between 0% and 0.2%. In particular, multiplicative noise achieved the best values for accuracy and privacy level in most experiments. In general, the experiments show that is possible to achieve a good compromise between privacy and accuracy.

Microaggregation is also very often used for the perturbation of microdata sets and has been enhanced in terms of disclosure risk. For instance, Fadel et al. [69] present a heuristic approach to apply microaggregation that aims to reduce the disclosure risk when compared with other approaches. Predictive analysis on microaggregation shows that the prediction accuracy of a classifier based on a de-identified dataset is not always worse than baseline [139]. For instance, some results show higher accuracy when compared to the baseline due to the reduction of variance in the de-identified dataset. Additionally, this technique either produces a low degree of within cluster homogeneity or fails to reduce the amount of noise independent of the size of a dataset, and for these reasons, Iftikhar et al. [107] propose an interesting approach that uses microaggregation for generating differentially private datasets. Muralidhar et al. [165] presented an empirical study that compares two approaches of differential privacy via microaggregation. Their experimental results show that a fixed \(\epsilon\) does not guarantee a certain level of confidentiality. Besides, differentially private data could be challenging for data analysts to work with due to the added noise which may lead to inaccurate models [8]. Moreover, differentially private data could be vulnerable to membership inference attacks, especially with outliers cases [142]. Very recently, Blanco-Justicia et al. [16] presented a critical review on the use of differential privacy in machine learning. Their experiments indicate that standard anti-overfitting techniques provide similar practical protection, improved accuracy, and significantly lower learning costs than differential privacy based machine learning. Thus, this method is not good for microdata releases, which challenges the previous theoretical guarantees.

De-associative Techniques. De-associative techniques have been also explored and improved to protect individuals’ privacy. The principal drawback of these types of techniques is the publishing of the QITs in their raw form, but also the disclosure risks for some absolute facts which would help the intruder find invalid records in the transformed dataset resulting in the disclosure of confidential information. Therefore, Hasan et al. [93] proposed the combination of slicing and data swapping to decrease the attribute disclosure risk. By swapping the values, the published data contains no invalid information such that the intruder cannot disclose individual privacy. However, Sari et al. [207] state that breaking the two sets of attributes produces more records than the original dataset. The authors’ proposal uses generalisation and suppression in the QIT, then sensitive values are aggregated on an ST and QI attributes are summarised. Thus, the number of records in the ST is reduced. However, conclusions are limited, as privacy risk and utility were not determined. But theoretically, this approach provides better protection.

Synthetic Data. Releasing synthetic data that mimics the original data provides a promising alternative to the previous techniques. Numerous approaches for synthetic data generation have been introduced in the literature: deep generative models based on Generative Adversarial Networks (GANs) such as conditional GANs and variational autoencoders, tree-based methods such as CART [195], and sampling methods such as SMOTE [27] which were commonly used before the popularity of deep learning models. Although the generation of synthetic data provides practically the same statistical conclusions as the original data, there still remains the risk of disclosure of private information [12, 245]. The combination of synthetic data with differential privacy is highly recommended [11, 12, 17, 92].

Concerning synthetic data through deep learning models, several approaches to differentially private synthetic data have emerged. For instance, DPGAN [245] aims to protect the confidentiality of the training data while providing good quality data. MedGAN [28] focus on the generation of patient records using a combination of variational autoencoders and GANs. However, med-GAN overestimates the number of diagnoses for patients and introduces comorbidities that were not present in the original dataset. To improve its limitations of low utility, Yale et al. [248] proposed HealthGAN that includes WGAN-GP [5]. Many other studies proved that differentially private synthetic data can provide high data utility while preserving privacy [115, 124, 145, 201]. Notwithstanding, Fang et al. [70] introduced DP-CTGAN, which outperforms DPGAN and PATE-GAN [115]. Very recently, Kotal et al. [120] proposed PriveTAB based on CTGAN [247] for the generation of synthetic data which ensures a determined t-closeness level to preserve the individual’s privacy and thus reduces the possibility of linkage attacks.

In addition, studies on predictive performance have been carried out. Despite the previous conclusions on utility, it has demonstrated a certain impact on predictive performance. Hittmeir et al. [97] show that the outcomes of synthetic data without differential privacy are closer to the original. However, they proved that an intruder is able to obtain predictions that may be much closer to the true target value. The same conclusions were obtained for regression tasks [98]. Using tree-based models for synthetic data generation, Rankin et al. [193] show small decreases in accuracy between models trained with synthetic data and those trained using real data. However, such a study does not include privacy assessment. Regarding sampling methods, PrivateSMOTE, proposed by Carvalho et al. [26], aims to generate synthetic data by interpolating the cases at maximum risk (single out) instead of synthesising all cases while minimising data utility loss. PrivateSMOTE demonstrates similar or even better results in re-identification risk and predictive performance (classification tasks) compared to deep learning based models.

Final Remarks. An organisation’s main challenge is applying the optimal PPT that reduces disclosure risks with minimal information loss. Currently, there are no “one size fits all” approaches to data privacy. Nevertheless, some guidelines should be used to limit disclosure risk when releasing data to the public, industries, or researchers. Data release must follow a defence strategy that uses both technical and non-technical approaches to data privacy [101, 170].

A non-technical approach aims to understand the intended use of data since it would be naive to assume that all malicious attacks against individuals’ privacy have been discovered at this stage. Additionally, an intruder may possess external information to join with a released dataset to re-identify individuals in the de-identified dataset. To address such an issue, non-technical approaches should be considered to provide risk-limiting solutions. As such, datasets should not be made freely available without barriers. An example of a barrier is the identification of the study that data receivers will conduct by specifying the data they require and how data will be used to achieve their goals. Thus, the cost of a potential disclosure risk by an intruder will be raised in terms of effort. Furthermore, it allows the data controller to filter out data receivers without malicious intent, whose work intends to violate the data subjects’ privacy. In addition, another non-technical defence is an agreement in which data receivers commit not to de-identify any data.

Regarding technical approaches, one of the main issues is the understanding of which PPTs are most appropriate for certain situations. Many of these techniques require trial and error to decide which parameters to configure and to find the acceptable trade-off. Therefore, it is important to decide which PPT is most appropriate to apply to a given dataset for a specific application.

7.2 Available Software

Many privacy tools have been developed to help data controllers or any other user in decision-making for any purpose. Beyond the transformation of microdata through PPTs, some tools allow different configurations assessments of such techniques, enabling the evaluation of the achieved privacy and utility level.

\(\mu\) -ARGUS [41] was the pioneering tool in privacy-preserving publishing. It is a free open source software that provides a user interface [214]. This tool allows the application of techniques such as global recoding, top-and-bottom coding, local suppression, PRAM, noise addition, and microaggregation. The individual risk estimation is based on sampling weight [13] and k-anonymity. Furthermore, \(\mu\) -ARGUS allows the creation of de-identified datasets suitable for different purposes, namely for scientific and public use files. UTD Anonymisation ToolBox [231] is available for download and further implements the first three groups of PPTs. Cornell Anonymisation Toolkit [29] provides an interface, uses generalisation to transform the data, and is free for download. Both tools depict the utility and re-identification risk. Moreover, these tools are research prototypes and have scalability issues when handling large datasets.

Two well-known tools are sdcMicro [222] and ARX Data Anonymisation Tool [189]. The R package called sdcMicro is a free open source software for scientific and public use files [222]. This package supports several PPTs, both non-perturbative and perturbative. For risk estimation, sdcMicro implements, for example, the SUDA2, k-anonymity, and log-linear models, among others. Besides that, this package provides the utility between the de-identified data and the original data and quantifies the information loss. The package also features a user-friendly interface [121] that allows non-experts in the de-identification process to gain insight into the impact of various PPTs and reduces the burden that software can be used. ARX Data Anonymisation Tool, often known as ARX, is one of the most used tools for data privacy. This software is free and has a simple and intuitive interface [6], which supports wizards for creating transformation rules and visualisations of re-identification risks. A wide range of PPTs is supported in ARX as well as many measures for both re-identification risk and data utility. Regarding the data utility, ARX further optimises output data towards suitability as a training set for building learning models. ARX is also available as a library with an API that provides data de-identification for any Java program.

Another free open source privacy tool is Amnesia [183], which provides software and an online dashboard. Amnesia allows the selection of generalisation levels and uses k-anonymity for disclosure risk. This tool produces many possible solutions, and it shows the distribution of values and provides statistics about the data quality in the de-identified dataset. Moreover, Amnesia allows for transforming relational and transactional databases into de-identified data.

All the previous tools require the attribute terminology in advance, which is a static approach. A dynamic tool called Aircloak [3] has emerged to help data controllers by giving them access to all the underlying data, and dynamically adapting the de-identification to the specific query and data requested. The resulting answer set is fully de-identified. Both non-perturbative and perturbative techniques are implemented in Aircloak. This tool has a free version for students based on the querying system and a full version for organisations.

Due to the increased attention to synthetic data, several generators have emerged to facilitate the creation of synthetic samples. Synthpop (SP) [178] is an R and Python package that uses tree-based methods to synthesise data where each attribute is generated by estimating conditional distributions. Although this tool does not use GANs for synthesis, it includes features such as differential privacy, record linkage, and outlier detection. One of the most known generators is Synthetic Data Vault (SDV) [186], which is a free and open source Python tool. SDV estimates the joint distribution of the population using a Gaussian copula but also includes GANs. Furthermore, SDV enables differential privacy and provides metrics for assessing privacy and utility. DataSynthesizer [188] uses a differential private Bayesian network to capture the correlation between the different attributes. This tool allows the user to analyse the accuracy of the synthetic data. Comparisons between these tools have been performed [39, 98].

Furthermore, commercial solutions have gained momentum for privacy-preserving synthetic data, such as MOSTLY AI [163], YData [249], and Gretel [87]. All three generate synthetic data based on deep learning models and allow the evaluation of privacy and utility. These commercial solutions also provide open source Python packages [88, 162, 250] but with some limitations. For instance, YData and Gretel only provide the generator and utility assessment.

Although some of the presented tools are more intuitive and user-friendly, they require prior knowledge of the SDC principles and expertise in the R or Python programming language. These tools are useful for data protection, but they are not trivial. Table 5 summarises the discussed privacy tools and their characteristics.

Table 5.

Privacy Tool	Open Source	Web App	Non-perturbative	Perturbative	De-associative	Synthetic Data	Privacy Assessment	Utility Assessment
\(\mu\) -ARGUS	✓	x	✓	✓	x	x	✓	x
UTD	✓	x	✓	✓	✓	x	✓	✓
Cornell	✓	x	✓	x	x	x	✓	✓
sdcMicro	✓	x	✓	✓	x	x	✓	✓
ARX	✓	x	✓	✓	x	x	✓	✓
Amnesia	✓	✓	✓	x	x	x	✓	✓
Aircloak	x	✓	✓	✓	x	x	✓	✓
Synthpop	✓	x	x	x	x	✓	✓	✓
SDV	✓	x	x	x	x	✓	✓	✓
DataSynthesizer	✓	x	x	x	x	✓	✓	✓
MOSTLY AI	✓	x	x	x	x	✓	✓	✓
YData	✓	x	x	x	x	✓	x	✓
Gretel	✓	x	x	x	x	✓	x	✓

Table 5. Available Privacy Tools Along with Their Main Characteristics

8 Open Issues

Regardless of de-identification procedures currently implemented in the data privacy context, the expertise required to successfully perform these operations is high. With the growing interest and importance of data privacy and the exponential growth of available data, we expect this area of research and application to grow even further. As such, a linchpin to the widespread use of these methods concerns the automation of these procedural pipelines. For example, to our knowledge, there is still no solution that automates the application and optimisation of PPTs to achieve proper levels of privacy and data utility. These decisions may not be an easy task, especially for a user without background knowledge in data protection, which could lead to dangerous decisions. To address such an issue, a potential solution to explore is the introduction of Automated Machine Learning (AutoML) solutions [105] in this context, with the aim of simplifying the de-identification process for data mining tasks and allowing the widespread application of robust procedures for de-identification and data sharing.

In addition, several open issues are urgent and require attention in the upcoming years. In this section, we provide an overview of some problems and challenges that frequently coexist with the privacy-preserving data publishing research area. We briefly introduce problems such as dealing with data that is modified over time and republished, as well as the plethora of data that is currently easy to obtain, and finally dealing with privacy when the data is in several parties.

8.1 Dynamic Data

One of the biggest challenges for the wider dissemination of privacy-preserving strategies is the management of dynamic data. This type of data is dynamically changed over time. If a set of PPTs is applied to dynamic data, the privacy-preserving problem would not be successfully solved. In particular, confidential information would not be properly protected when some records are added, deleted, or modified. Typically, the information loss increases over time. Most of the existing strategies for centralised publication focus on static microdata. Specifically, they are restricted to only one-time publication and do not support republication. It is a future challenge for the development of flexible, interactive, and adaptive PPTs.

A couple of proposals have emerged to limit the disclosure risk in re-publication [236, 244]. However, location privacy should be enhanced for current and future privacy of releases such as those in 5G and social networks [138]. Besides the location itself, the location information includes when and for how long it was visited. Therefore, the protection of spatio-temporal information is crucial. To avoid background knowledge and homogeneity attacks, the time attribute may be added in the de-identification of the location attribute [144]. In continuous data release, outliers are also a problem. Outliers detection techniques should continuously monitor the cross-correlation between the releases.

8.2 Big Data

The dramatic increase of digital data, being gathered and shared currently, often causes the growing number of attributes and, consequently, the exponential increase of the domain. Under those circumstances, we face the so-called curse of dimensionality which may cause information loss in de-identified datasets. Some approaches were suggested to mitigate the curse of dimensionality [118, 129]. Additionally, big data variety may add further constraints to the de-identification procedure since techniques such as global recoding are highly dependent on data variations [256]. Thus, dealing with complex data is a challenge.

The massive volume of confidential data being harvested by data controllers makes it essential to use cloud services not just to store the data but also to process them on the cloud premises. Domingo-Ferrer et al. [45] present a survey that covers technologies for privacy-aware outsourcing of storage and processing of confidential data to clouds which includes traditional PPTs and cryptographic methods. The main research challenge pointed out by the authors is constructing big data by merging datasets with overlapping subjects. Another limitation is the application of cryptographic methods in cloud computing, which is at an early stage. Although it ensures high data protection, its application reduces utility and adds computation overhead.

8.3 Distributed Data

Today, due to the need for massive amounts of real-time data, centralised solutions are moving towards more decentralised ones. When data is in different parties, the approach to deal with this issue is often through collaborative data publishing in which multiple data providers share their data for general purposes or data mining tasks. In such a case, the de-identification data is given by each provider.

This topic has been explored, but the rapidly growing volume of data forces the exploration of new security measures. For instance, in a distributed scenario, where each organisation owns a set of raw individual data, de-identification operations are applied equally in all parties, which requires one party to act as a leader, to synchronise the de-identification process [160]. However, it is important to include additional security measures. Currently, there are new proposals towards a solution based on multiple collaborations of different parties along with a cryptographic mechanism [146]. One of the main concerns in distributed systems is the balance between disclosure cost and computation cost.

9 Conclusion

Privacy concerns are extremely important when confidential information disclosure is considered. As such, it is fundamental to protect data before making it available for any purpose. Many definitions of data privacy have been suggested. However, privacy can be defined as the prevention of unwanted information disclosure. In this article, we presented a formulation of the de-identification problem for microdata. We exhaustively presented the privacy risks to take into account before data is shared or released. We also extensively present the state of the art of privacy-preserving techniques (PPTs) by proposing a new taxonomy for the existing approaches grouping them into (i) non-perturbative, (ii) perturbative, (iii) de-associative, and (iv) synthetic data. Nevertheless, the application of such techniques may provoke the destruction of data utility. As such, we explored existing measures of utility for general purposes and data mining tasks. Furthermore, we analysed several studies showing the impact of PPTs for both data privacy and utility. However, in this survey, we specifically presented studies of the effectiveness of these techniques in predictive performance.

Since the 2000s, data privacy has been in the research spotlight either in suggesting new privacy measures and techniques or in evaluating them. The introduction of laws and regulations in data protection and the ability in private information disclosure have reinforced the need for new suggested solutions. Many PPTs were proposed, each one for specific purposes and data characteristics. We highlighted the importance in combine these techniques to guarantee a high level of data protection. Such a level is measured through the risks of private information disclosure. Several measures were suggested for this purpose. However, diverse scenarios should be examined, as it is quite difficult to predict who the intruder is or the information the intruder possesses. Notwithstanding, high protection typically leads to lower data utility. Consequently, it is essential to experimentally evaluate the impact of de-identification by building predictive models from the de-identified data and observing how it performs in testing cases. Few works [81, 83, 130] have actually conducted such experiments, although many [10, 19, 113] addressed the classification task. We stress the relevance of reproducing experimental studies in regression tasks and for clustering analysis, as the end use of data is often unknown.

Although we do not address strategies for others settings, we strongly recommend some specific works, such as the survey of Zigomitros et al. [256] on relational data. For trajectory data, Fiore et al. [74] present the solutions proposed for protecting this type of data from intruder attacks. A very recent survey on heterogeneous data was introduced by Cunha et al. [33]. The authors propose a privacy taxonomy that establishes a relation between different types of data and suitable privacy-preserving strategies for the characteristics of those data types. Despite that, Wagner and Eckhoff [234] survey privacy measures for several domains where privacy-enhancing technologies, including PPTs, can be applied.

In conclusion, the field of data privacy will remain highly relevant for many years to come, given the escalating concerns surrounding this topic and the necessity of preserving certain characteristics to meet the expanding data demands for the development and applications of machine learning or, more generally, artificial intelligence.

Footnote

Regulation (EU) 2016/679 of the European Parliament and of the Council of 27 April 2016.

References

[1]

Nabil R. Adam and John C. Worthmann. 1989. Security-control methods for statistical databases: A comparative study. ACM Computing Surveys 21, 4 (1989), 515–556.

Abstract

1 Introduction

2 Preliminaries and Problem Formulation

3 De-identification Process

4 Privacy Risk

4.1 Identity Disclosure

4.1.1 Uniqueness.

4.1.2 Record Linkage.

4.1.3 Outliers.

4.1.4 Clustering.

4.2 Attribute Disclosure

4.3 Summary

5 Privacy-Preserving Techniques

5.1 Non-perturbative

5.1.1 Global Recoding.

5.1.2 Local Recoding.

5.1.3 Top-and-Bottom Coding.

5.1.4 Suppression.

5.1.5 Sampling.

5.2 Perturbative

5.2.1 Swapping.

5.2.2 Re-sampling.

5.2.3 Noise.

5.2.4 Microaggregation.

5.2.5 Rounding.

5.2.6 Post RAndomisation Method.

5.2.7 Shuffling.

5.3 De-associative

5.3.1 Bucketisation.

5.3.2 Anatomisation.

5.3.3 Angelisation.

5.3.4 Slicing.

5.4 Synthetic Data

5.4.1 Fully Synthetic.

5.4.2 Partially Synthetic.

5.4.3 Hybrid.

5.5 Summary

6 Data Utility

7 Studies On the Effectiveness of PPTs

7.1 Impact on Data Privacy and Predictive Performance

7.2 Available Software

8 Open Issues

8.1 Dynamic Data

8.2 Big Data

8.3 Distributed Data

9 Conclusion

Footnote

References

Cited By

Index Terms

Recommendations

Personalised anonymity for microdata release

Measuring privacy in high dimensional microdata collections

Privacy and confidentiality management for the microaggregation disclosure control method: disclosure risk and information loss measures

Comments

Information

Published In

Publisher

Publication History

Check for updates

Author Tags

Qualifiers

Funding Sources

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

Cited By

View options

PDF

eReader

Get Access

Login options

Full Access

Figures

Other

Share

Share this Publication link