Privacy-Preserving Data Mining: Methods, Metrics, and Applications
Privacy-Preserving Data Mining: Methods, Metrics, and Applications
Privacy-Preserving Data Mining: Methods, Metrics, and Applications
ABSTRACT The collection and analysis of data are continuously growing due to the pervasiveness of
computing devices. The analysis of such information is fostering businesses and contributing beneficially
to the society in many different fields. However, this storage and flow of possibly sensitive data poses
serious privacy concerns. Methods that allow the knowledge extraction from data, while preserving privacy,
are known as privacy-preserving data mining (PPDM) techniques. This paper surveys the most relevant
PPDM techniques from the literature and the metrics used to evaluate such techniques and presents typical
applications of PPDM methods in relevant fields. Furthermore, the current challenges and open issues
in PPDM are discussed.
INDEX TERMS Survey, privacy, data mining, privacy-preserving data mining, metrics, knowledge
extraction.
2169-3536
2017 IEEE. Translations and content mining are permitted for academic research only.
10562 Personal use is also permitted, but republication/redistribution requires IEEE permission. VOLUME 5, 2017
See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
R. Mendes, J. P. Vilela: Privacy-Preserving Data Mining: Methods, Metrics, and Applications
PPDM encompasses all techniques that can be used to extract mining; pattern evaluation; and knowledge presentation. In
knowledge from data while preserving privacy. This may this section, a brief review on the classical paradigms of the
consist on using data transformation techniques, such as the data mining step will be presented, to provide the reader
ones in Table 1, as primitives for adjusting the privacy-utility enough understanding for the remainder of this paper.
tradeoff of more evolved data mining techniques, such as the Data mining is the process of extracting patterns (knowl-
privacy models of Table 2 and the more classical data mining edge) from big data sets, that can then be represented and
techniques of Table 3. PPDM also accounts for the distributed interpreted. In [19], pattern is defined as an expression to
privacy techniques of Table 4 that are employed for mining describe a subset of data (itemset), or a model applicable to
global insights from distributed data without disclosure of a subset. Since data mining methods involve pattern discov-
local information. Due to the variety of proposed techniques, ery and extraction, pattern recognition techniques are often
several metrics to evaluate the privacy level and the data qual- used.1 Moreover, pattern recognition and machine learning
ity/utility of the different techniques have been proposed [8], may be seen as ‘‘two facets of the same field’’ [20], that had
[11]–[13]. origins in different areas. Therefore, throughout this work,
PPDM has drawn extensive attention amongst researchers data mining will refer to the process of extracting patterns,
in recent years, resulting in numerous techniques for pri- and pattern recognition and machine learning will be used
vacy under different assumptions and conditions. Several interchangeably to denote data mining paradigms.
works have focused on metrics to evaluate and compare The main objective of data mining is to form descrip-
such techniques in terms of the achieved privacy level, tive or predictive models from data [19]. Descriptive models
data utility and complexity. Consequently, PPDM has been attempt to turn patterns into human-readable descriptions
effectively applied in numerous fields of scientific interest. of the data, whereas predictive models are used to predict
The vast majority of PPDM surveys focus on the tech- unknown/future data. The models are formed using machine
niques [10], [14], [15], and others on the metrics to evaluate learning techniques, that can be categorised as supervised
such techniques [8], [12], [13]. Some only briefly discuss the and unsupervised [18]. Supervised learning techniques are
evaluation parameters and the trade off between privacy and methods in which the training set (the dataset used to form the
utility [16], [17], whereas others summarily describe some of model) is already labelled. That is, the training set has both
the existing metrics [4], [11]. The survey in [11] does com- the input data and the respective desired output, leading the
bine techniques, metrics and applications, but focuses on data machine to learn how to distinguish data, and thus, forming
mining, thus lacking many PPDM techniques, metrics, and the model. In contrast, unsupervised techniques attempt to
other application fields, and [9] has applications in various find relations in the data from unlabelled sets, or simply,
areas but lacks metrics. This paper covers a literature gap no training set is used.
by presenting an up-to-date and thorough review on existing Association rule mining, classification and clustering
PPDM techniques and metrics, followed by applications in are three of the most common approaches in machine
fields of relevance. learning, where the first two are supervised learning tech-
The remainder of this survey is organised as follows. niques, and the latter is an unsupervised learning mechanism.
Section II introduces the problem of data mining and presents The following subsections will briefly detail each of these
some of the most common approaches to extract knowledge approaches. Readers can refer to [21], [22], and [18] for a
from data. Readers already familiarised with the basic con- comprehensive study on these subjects.
cepts of data mining can skip to section III, where several
PPDM methods are described according to the data lifecycle A. ASSOCIATION RULE MINING
phase at which they are applied. Section IV presents metrics Association rule mining algorithms are designed to find rel-
to evaluate such algorithms. Some applications of the PPDM evant relationships between the variables of a dataset. These
algorithms in areas of interest are presented in section V, associations are then expressed by rules in the form: if (con-
with emphasis on the assumptions and context (identified by dition); then (result). Association rules have a probability
a scenario description) at which privacy can be breached. of occurrence, that is, if condition is met, then there is a
Section VI discusses some learned lessons about PPDM certain probability of occurring result. Using the notation
and presents open issues for further research on PPDM. from [18], association rules can be formalized as follows. Let
Section VII concludes this paper. I = I1 , I2 , . . . , Im be a set of binary attributes called items,
and D a database of transactions, where each transaction T is
II. CLASSICAL DATA MINING TECHNIQUES a nonempty itemset such that T ⊆ I. Let A ⊂ I and B ⊂ I
Information systems are continuously collecting great be subsets of I. Then, an association rule is an implication
amounts of data. Services can greatly benefit from the knowl- A ⇒ B, with A 6 = ∅, B 6 = ∅.
edge extraction of this available information [1]. The terms Not all rules are interesting to mine, in fact, association rule
knowledge discovery from data (KDD) and data mining are mining algorithms only mine strong rules. Strong rules satisfy
often ambiguous [18]. KDD typically refers to the process
composed of the following sequence of steps: data cleaning; 1 In fact, most data mining techniques rely on machine learning, pattern
data integration; data selection; data transformation; data recognition, and statistics [19].
a minimum support threshold and a minimum confidence Algorithms for clustering differ significantly due to unstan-
threshold. The support of a rule is the probability (percentage) dardised notion of cluster and the similarity metric [23].
of transactions in D that contain A ∪ B, or mathematically: A categorisation that encompasses the most important clus-
tering methods is given in [18], based on the following
support(A ⇒ B) = P(A ∪ B)
properties:
Intuitively, this metric reflects the usefulness of the rule • Partitioning criteria: conceptually, clusters may be
A ⇒ B. Confidence measures how often the rule is true in D. formed either hierarchically (more general clusters con-
It is measured by the following equation: tain other more specific clusters), or all clusters are in
confidence(A ⇒ B) = P(B|A) the same level;
• Separation: clusters may be overlapping or non-
Using the support and confidence metrics, two-steps are overlapping. In the overlapping case, objects may belong
required to mine association rules [18]: to multiple clusters, whereas in the non-overlapping,
1) Find all itemsets in D with a support greater or equal to clusters are mutually exclusive;
a minimum support threshold (frequent itemsets); • Similarity measure: the metric for the similarity between
2) Generate the strong association rules from the frequent objects may be distance-based or connectivity-based;
itemset. • Clustering space: clusters may be searched within the
entire data space, which can be computationally ineffi-
B. CLASSIFICATION cient for large data, or within data subspaces (subspace
Classification is a supervised learning problem whose objec- clustering), where dimensionality may be reduced by
tive is to create a model, in this specific case, called a clas- suppressing irrelevant attributes.
sifier, that can identify the class label of unknown data [18]. Due to these (and other) properties, numerous algorithms
In other words, a classifier is created from a training set – a set have been proposed for a myriad of applications [18], [22].
whose output (the class label) is known –, and it is then used to Being a fast-expanding field, data mining presents some
classify unknown data, into one of the existing classes. Thus, challenges such as scalability, efficiency, effectiveness and
classification is a two-step approach problem: the training social impacts. The concern in collecting and using sensible
phase (or learning step) and the classification phase. More data that may compromise privacy is one of those impacts and
formally, one seeks to define a function f (·) that outputs a one that is being extensively researched [1]. The following
class label y for a given attribute vector X = (x1 , x2 , . . . , xn ) section will describe how privacy and data mining are related,
as input, where xi , ∀i ∈ 1, 2, . . . , n represents a value for and review some of the most important methods to protect and
attribute Ai . That is: preserve privacy.
y = f (X )
III. PRIVACY AND DATA MINING
In this situation, f (X ) maps a tuple of attribute values Data collection and data mining techniques are applied to
to the respective class label. This mapping function can several application domains. Some of these domains require
be represented by mathematical formulae, classification handling, and often publishing sensitive personal data (e.g.
rules, or decision trees [18]. medical records in health care services), which raises the
Having the mapping function f (X ), one can classify any concern about the disclosure of private information [1].
attribute vector X in the classification phase. To evaluate the Privacy-Preserving Data Mining (PPDM) techniques have
classifier, an already classified input is considered and its been developed to allow for the extraction of knowledge
accuracy is calculated as the percentage of correct classifi- from large datasets while preventing the disclosure of sensi-
cations obtained. However, the training set cannot be used, tive information. The vast majority of the PPDM techniques
since it would result in an optimistic estimation of the accu- modify or even remove some of the original data in order to
racy [18], and therefore, test sets are used instead. In practice, preserve privacy [9]. This data quality degradation is known
the training set is randomly divided into a smaller (than the as the natural trade-off between the privacy level and the data
original) training set and a test set. quality, which is formally known as utility. PPDM methods
are designed to guarantee a certain level of privacy while
C. CLUSTERING maximising the utility of the data to allow for effective data
Clustering, or cluster analysis, is a process of grouping sets mining. Throughout this work, sanitised or transformed data
of objects (observations) in groups (clusters), in a way that will refer to the data that resulted from a privacy-preserving
objects from a cluster have more similarities than objects technique.
from different clusters [18]. Each cluster may be considered Several different taxonomies for PPDM methods have been
as a class with no label, and thus, clustering is sometimes proposed [9]–[11], [14], [17], [24]. In this survey a classifi-
referred to as automatic classification, i.e. classification that cation based on the data lifecycle phase at which the privacy-
does not require a training set, but learns from observations. preservation is ensured will be considered [10], namely at:
Since cluster analysis is an unsupervised learning paradigm, data collection, data publishing, data distribution and at the
it may reveal interesting unknown relations in the data. output of the data mining.
The following subsections will describe each of the phases encompasses the following steps: randomisation at data col-
at which privacy is ensured by attesting how privacy may lection, distribution reconstruction (subtracting the noise dis-
be lost and by describing some of the most applied privacy- tribution from the first step) and data mining on the recon-
preserving techniques. Tables 1, 2, 3 and 4 summarise the structed data [10].
privacy preserving methods presented at each corresponding The simplest randomisation approach may be formally
subsection, and enumerate some of the advantages, disad- described as follows. Let X be the original data distribu-
vantages and applications and domains of such techniques. tion, Y , a publicly known noise distribution independent of X ,
A description of the scenario is also given to contextualise and Z the result of the randomisation of X with Y . That is:
the adversarial assumptions and the nature of the privacy-
preserving methods. Note that Table 1 does not present Z =X +Y (1)
application domains, since these randomisation techniques
The collector estimates the distribution Z from the received
are mainly used as primitives for more complex privacy
samples z1 , z2 , . . . , zn , with n the number of samples. Then,
preservation techniques, such as the ones presented in the
with the noise distribution Y (Y has to be provided with the
remaining tables. In fact, Tables 2 and 3 correspond to more
data), X may be reconstructed using:
evolved privacy-preserving data mining techniques that usu-
ally rely on data transformation techniques to adjust the X =Z −Y (2)
privacy-utility tradeoff, without requiring modification to the
data mining algorithms. On the other hand, the distributed Equation 1 corresponds to the randomisation process at data
privacy techniques of Table 4 are usually building blocks collection, while equation 2 corresponds to the reconstruction
for distributed computations that preserve privacy and must, of the original distribution by the collector entity. Note, how-
therefore, be integrated into the data mining techniques (as ever, that the reconstruction of X using equation 2 depends on
seen in [25]), therefore requiring modifications. the estimation of the distribution Z . If Y has a large variance
and the number of samples (n) of Z is small, then Z (and
consequently X ) cannot be estimated precisely [10]. A better
A. DATA COLLECTION PRIVACY reconstruction approach using the Bayes formula may be
To ensure privacy at data collection time, the sensory device implemented.
transforms the raw data by randomising the captured values, Additive noise is not the only type of randomisation that
before sending to the collector. The assumption is that the can be used at collection time. In fact, the authors of [31]
entity collecting the data is not to be trusted. Therefore, and show experimentally how ineffective this technique may be
to prevent privacy disclosure, the original values are never at preserving privacy. More effective (against privacy disclo-
stored, and used only in the transformation process. Conse- sure) techniques, that apply multiplicative noise to randomise
quently, randomisation must be performed individually for the data also exist [29], [30].
each captured value. Since the original data is modified into perturbed data,
Most common randomisation methods modify the data by these methods require specific data mining algorithms that
adding noise with a known statistical distribution, so that can leverage knowledge discovery from distributions of data,
when data mining algorithms are used, the original data dis- and not from individual entries. This may lead to a greater
tribution may be reconstructed, but not the original (individ- loss of utility than other privacy-preserving methods. Nev-
ual) values. Thus, the randomisation process in data mining ertheless, some data mining methods such as clustering and
classification may require only access to the data distribution that algorithms that do take into account sensitive attributes
and will thus, work well with the randomisation [10]. provide better privacy protection.
Data modification may be applied at other phases than The anonymization of records in a database may be
at data collection, and other methods besides additive achieved by implementing different privacy models. Privacy
and multiplicative noise do exist. In fact, randomisation models attempt to preserve records’ owner identity by apply-
is considered to be a subset of the perturbation opera- ing one, or a combination of the following data sanitising
tions2 (see Section III-B). However, at collection time the operations:
assumption is that the collector is not trusted. Therefore,
• Generalization: replacement of a value for a more
the original data must not be store, nor buffered, after
general one (parent). Numerical data may specified by
the transformation. Thus, each value has to be randomised
intervals (e.g. an age of 53 may be specified as an
individually, that is, without considering other past collected
interval in the form of [50,55]), whereas categorical
values.
attributes require the definition of a hierarchy. A good
example of a hierarchy could be the generalisation of the
B. DATA PUBLISHING PRIVACY
values engineer and artist from a occupation attribute to
Entities may wish to release data collections either pub- professional. Another possibility would be to have the
licly or to third parties for data analysis without disclosing the parent value of student to represent all types of student
ownership of the sensitive data. In this situation, preservation in the same occupation attribute;
of privacy may be achieved by anonymizing the records • Suppression: removal of some attribute values to prevent
before publishing. PPDM at data publishing is also known information disclosure. This operation can also be per-
as Privacy Preserving Data Publishing (PPDP). formed column wise in a data-set (removes all values of
It has been shown that exclusively removing attributes that an attribute) or row wise (removes an entry).
explicitly identify users (known as explicit identifiers) is not • Anatomization [37]: de-associates QIDs and sensitive
an effective measure [32]. Users may still be identified by attributes in two separate tables making it more difficult
pseudo or quasi-identifiers (QIDs) and by sensitive attributes. to link QIDs to sensitive attributes. In this case, values
A QID is a non-sensitive attribute (or a set of attributes) that remain unchanged;
do not explicitly identify a user, but can be combined with • Perturbation: replacement of the original data for syn-
data from other public sources to de-anonymize the owner thetic values with identical statistical information. The
of a record,what is known as linkage attacks [33]. Sensitive randomisation methods described in subsection III-A
attributes are person-specific private attributes that should not (additive and multiplicative noise) are examples of data
be publicly disclosed, and that may be also linked to identify perturbation. Data swapping and synthetic data genera-
individuals (e.g. diseases in medical records). tion are also perturbation techniques. In data swapping,
Sweeney in 2000 presented a report [34] on an analysis sensitive attributes exchange between different entries of
done over the 1990 U.S. Census to identify different com- the dataset in order to prevent the linkage of records to
binations of attributes (QIDs) that would uniquely identify identities, whereas in synthetic data generation, a statis-
a person in the U.S. He found out that 87% of the popu- tical model is formed with the original data, and then
lation was identifiable by using the QID set {5-digit ZIP, synthetic values are obtained from the model.
gender, date of birth}. This study was then repeated with the
2000 U.S. Census by Golle [35], where the percentage of de- This list is not an extensive enumeration of the existing
anonymized records using the same QID dropped to 63% of operations. These are however, the most commonly used, and
the population. In 2002, Sweeney identified the governor of are sufficient to allow the comprehension of the remainder of
Massachusetts from an anonymous voter list with the same this work. Readers can refer to [33] for a more thorough list.
QID set [36]. By linking these values to an accessible medi- Based on these operations, a set of privacy models has
cal anonymized dataset from the Group Insurance Commis- been proposed as follows. One of the most known privacy
sion (GIC), the author was also able to obtain the governors’ models is the k-anonymity model, proposed by Samarati
medical records. This simple example shows how QIDs are a and Sweeny [38], [39]. This model’s key concept is that of
potential thread to de-anonymize identities on datasets where k-anonymity: if the identifiable attributes of any database
only explicit identifiers are removed. record are undistinguishable from at least other k −1 records,
Aggarwal [10] states that the majority of anonymization then the dataset is said to be k-anonymous. In other words,
algorithms focus on QIDs, disregarding sensitive attributes with a k-anonymized dataset, an attacker could not identify
as it is wrongly assumed that without these, there is no risk the identity of a single record since other k −1 similar records
of linkage attack with public information. In fact, the author exist. The set of k records is known as equivalence class [10].
claims that is fair to assume that an adversary has back- Note that ‘‘identifiable attributes’’ in the aforementioned def-
ground information about its target [10] and thus concludes, inition refers to QIDs.
In the k-anonymity model, the value k may be used as
2 Some literature use the terms randomisation and perturbation inter- a measure of privacy: the higher the value of k, the harder
changeably. it is to de-anonymize records. In theory, in an equivalence
TABLE 2. Summary of privacy-preserving techniques at data publishing (Section III-B) in terms of the employed sanitisation methods.
class, the probability of de-anonymizing a record is 1/k. classes is the possibility of de-anonymizing an entry (or at
However, raising k will also reduce the utility of the data since least narrow down the possibilities) by associating QIDs with
higher generalisation will have to occur. some background knowledge over a sensitive attribute.
Different algorithms have been proposed to achieve The aforementioned attribute disclosure problem may be
k-anonymity, where the vast majority applies generalisation solved by increasing the diversity of sensitive values within
and suppression operations [10]. This privacy model was one the equivalence classes, an approach taken in the l-diversity
of the first applied for group based anonymization and served model [45]. This model expands the k-anonymity model by
as a development base for more complex models. Some of requiring every equivalence class to abide by the l-diversity
the advantages of the k-anonymity privacy model include the principle. An l-diverse equivalence class is a set of entries
simplicity of definition and the great amount of existing algo- such that at least l ‘‘well-represented’’ values exist for the sen-
rithms. Nevertheless, this privacy model has two major prob- sitive attributes. A table is l-diverse if all existing equivalence
lems. The first problem has to due with the consideration that classes are l-diverse.
each record represents a unique individual, or in other words, The meaning of ‘‘well-represented’’ values is not a
that each represented individual has one, and only one record. concrete definition. Instead, different instantiations of the
If this is not the case, an equivalence class with k records does l-principle exist, differing on this particular definition [33],
not necessarily link to k different individuals. The second [45]. One of the simplest instantiations considers that the
problem relates to the fact that sensitive attributes are not sensitive attributes are ‘‘well-represented’’ if there are at least
taken into consideration when forming the k-anonymized l distinct values in an equivalence class, what is known as dis-
dataset. This may lead to equivalent classes where the values tinct l-diversity. In these conditions, a l-diverse equivalence
of some sensitive attributes are equal for all the k records class has at least l records (since l distinct values are required),
and consequently, disclosure of private information of any and satisfies k-anonymity with k = l. A stronger notion
individual belonging to such groups. Other consequence of of l-diversity is the definition of entropy l-diverse, defined
not taking into account sensitive attributes when forming the as follows. An equivalence class is entropy l-diverse if the
entropy of its sensitive attribute value distribution is at least the variational distance, the Kullback-Leibler (KL) distance
log(l). That is: and the Earth Mover’s distance (EMD).
X The three aforementioned privacy models preserve
− P(QID, s) log(P(QID, s)) ≥ log(l) privacy by applying global/equitable measures to all
s∈S
records/identities. Xiao and Tao [51] presented the concept
where s is a possible value for the sensitive attribute S, and of personalized privacy, where the privacy level is defined
P(QID, s) is the fraction of records in a QID equivalence by record owners. The purpose of this method is to preserve
group, that have the s value for the S attribute. Note that the maximum utility while respecting personal privacy pref-
entropy l-diversity can also be extended to multiple sensitive erences. Personalized privacy is achieved by creating a tax-
attributes by anatomizing the data [44]. onomy tree using generalisation, and by allowing the record
Similarly to the k-anonymity model, in both entropy and owners to define a guarding node. Owners’ privacy is breach
distinct l-diversity instantiations, l (or in the former case, if an attacker is allowed to infer any sensitive value from
log(l)) acts as a measure of privacy. Increasing this value, the subtree of the guarding node with a probability (breach
increases the variability of the existing values of the sensitive probability) greater than a certain threshold.
attribute in each equivalence class, decreasing the possibility As an example of a personalised privacy model, consider a
of sensitive attribute disclosure. However, stronger generali- case where there is a sensitive attribute DISEASE. A record
sations and higher number of suppressions have to occur on owner may be willing to disclose that he is ILL (and not
the raw data, thus leading to higher loss of utility. NOT ILL), but protect which type, or which ill he contracts,
Although the l-diversity model increases the diversity of i.e. ILL is his guarding node in the taxonomy tree. Other
sensitive values within equivalence classes, it does not take user may not mind to share that besides being ILL, it has
into consideration the distribution of such values. This may a TERMINAL DISEASE, without specifying which specific
present privacy breaches when the sensitive values are dis- disease. Finally, other record owner could allow to share the
tributed in a skewed away, which is generally true. To better specific disease (e.g. LUNG CANCER). In this example,
understand this breach, consider the example given in [33]: the LUNG CANCER value belongs to the taxonomy subtree
a patient table with skewed attribute distribution, where 95% of TERMINAL DISEASE, which is the guardian node of
of the entries have FLU and the remaining 5% have HIV in the second described owner, and TERMINAL DISEASE is
the sensitive attribute column. An adversary seeks to find a in the subtree of ILL.
record or groups of records having HIV, and has knowledge of Personalized privacy has the advantage of letting record
the original sensitive attribute distribution. When forming the owner’s define their privacy measure. However, this may
l-diverse groups, the maximum entropy would be achieved be hard to implement in practice for two main rea-
with groups having 50% of FLU entries and 50% of HIV sons [33]: approaching record owners may not always be
entries. However, such groups would disclose that any entry a viable/practical option; and, since record owners have no
within the group has a 50% of probability of having the value access to the distribution of sensible values, the tendency will
HIV in the sensitive attribute. This attribute disclosure may be be to over protect the data, by selecting more general guarding
worsened (infer with higher probability of having HIV), if the nodes.
adversary has some background knowledge over the target(s). All previous privacy models try to either protect record
In order to prevent attribute disclosure from the distribution owners’ identity, or to protect the inference of sensitive values
skewness (skewness attacks), Li et al. [49] presented the from anonymized records (or groups of records). Neverthe-
t-closeness privacy model. This model requires the distribu- less, they do not measure how the presence (or absence) of
tion of the sensitive values in each equivalence class to be a record impacts owner’s privacy. Consider this hypothetical
‘‘close’’ to the corresponding distribution in the original table, example: a statistical analysis over an anonymized database
where close is upper bounded by the threshold t. That is, revealed that female smokers over 60 years old and weighting
the distance between the distribution of a sensitive attribute over 85kg, have a 50% chance of having cancer. A per-
in the original table and the distribution of the same attribute son belonging to this specific population will suffer from
in any equivalence class is less or equal to t (t-closeness attribute disclosure, even if the dataset does not contain its
principle). Formally, and using the notation found in [49], this record. From another point of view, if this person were on
principle may be written as follows. Let Q = (q1 , q2 , . . . , qm ) the database and the same conclusion would be reached, then
be the distribution of the values for the sensitive attribute in there would be no further disclosure from the participation of
the original table and P = (p1 , p2 , . . . , pm ) be the distribution this individual on such database, that is, no information would
of the same attribute in an equivalence class. This class be leaked.
satisfies t-closeness if the following inequation is true: Dwork [27] presented the notion of differential privacy
to measure the difference on individual privacy disclosure
Dist(P, Q) ≤ t
between the presence and the absence of the individual’s
The t-closeness principle also has various instantiations record. In his work, the author proposed the -differential
depending on the distance function that is used to measure the privacy model that ensures that a single record does not
closeness [10], [49]. The three most common functions are considerably affect the outcome of the analysis over the
TABLE 3. Summary of privacy-preserving techniques at data mining output privacy (Section III-C).
dataset. In this sense, a person’s privacy will not be affected In summary, the preservation of privacy at data publishing
by participating in the data collection since it will not make is achieved by applying privacy models that alter the original
much difference in the final outcome. table in order to prevent information disclosure. Each model
The -differential privacy model may be formalised as has advantages and disadvantages in protecting from differ-
follows. Let K (·) be a randomised function, and D1 and D2 ent types of inferences (e.g. identity, attribute). In contrast
two databases differing at most on one record, then: with privacy-preserving methods at collection time, privacy
models can achieve a better control over the privacy level
Pr [K (D1 ) ∈ S]
ln ≤ ∀S ⊆ Range(K ) (3) due to the publisher’s access to the full data (recall the trade-
Pr [K (D2 ) ∈ S] off between privacy and utility). Other privacy models have
been proposed in the literature [33], however, their underly-
where is set a priori by the publishing entity, and S is a
ing principles are the same as in the seminal contributions
subset of Range(K ), with Range(K ) the set of all possible
presented in this section.
outputs of K . Note that equation 3 may be extended to group
privacy, by having on the right side of the equation c., with
c a small integer that corresponds to the number of records in C. DATA MINING OUTPUT PRIVACY
the group [27]. The outputs of the data mining algorithms may be extremely
Despite being a strong and formal privacy concept, differ- revealing, even without explicit access to the original dataset.
ential privacy has some limitations [55], such as setting the An adversary may query such applications and infer sensitive
appropriate value of . However, differential privacy is fairly information about the underlying data. Below, a description
recent and thus, more research is currently on-going [60]. of the most common techniques to preserve privacy to the
Group anonymization privacy models (e.g. k-anonymity) output of the data mining is presented.
and differential privacy are considered to be two of the major • Association Rule Hiding - In association rule data min-
research branches in privacy [4]. In fact, several variants were ing, some rules may explicitly disclose private informa-
proposed in the literature as to tackle some of the handicaps tion about an individual or a set of individuals. Asso-
of the base models. Since these variants are considered exten- ciation rule hiding is a privacy-preserving technique
sions to the privacy models described through this section, whose objective is to mine all non-sensitive rules, while
a simple enumeration of these techniques is given below with- no sensitive rule is discovered. Non-optimal solutions
out any particular order. Interested readers can refer to [33] perturb data entries in a way that sensitive rules are
and [61] for detailed descriptions on some of the referred hidden (e.g. suppression of the rule’s generating item-
group anonymization privacy models. sets), but may incorrectly hide a significant number of
• k-anonymity variants: k m -anonymization [62], non-sensitive rules in the process. Nevertheless, differ-
(α,k)-anonymity [63], p-sensitive k-anonymity [64], ent approaches, including exact solutions (all sensitive
(k, e)-anonymity [65], MultiR (MultiRelational) rules are hidden and no non-sensitive is hidden), have
k-anonymity [66] and (X , Y )-anonymity [67]; been proposed [10], [72]. The concept of association rule
• l-diversity variants: (τ ,l)-diversity [68] and (c, l)- hiding was first introduced by Atallah et al. in [73].
diversity [45]; • Downgrading Classifier Effectiveness - Classifier
• t-closeness variants: closeness [69]; applications may also leak information to adversary
• -differential privacy variants: differential identifiabil- users. A good example are the membership inference
ity [70] and membership privacy [71]. attacks, in which an adversary determines if a record is in
the training dataset (original data) [83], [84]. To preserve In the semi-honest behaviour, also called honest-but-curious
privacy in classifier applications, techniques for down- model, adversaries abide by the protocol specifications, but
grading the accuracy of the classifier are often used [9], may attempt to learn more from the received information.
[76]. Since some rule based classifiers use association In the malicious behaviour model, adversaries deviate from
rule mining methods as subroutines, association rule the protocol and may even collude with other corrupted par-
hiding methods are also applied to downgrade the effec- ties. Semi-honest scenarios are considered to be a good model
tiveness of a classifier [9]. of the real entities behaviour [86].
• Query Auditing and Inference Control - Sometimes In a distributed scenario, a dataset may be partitioned either
entities may provide access to the original dataset, allow- horizontally or vertically. In the horizontal case, each entity
ing exclusively statistical queries to the data. More contains different records with the same set of attributes,
specifically, users can only query aggregate data from and the objective is to mine global insights about the data.
the dataset, and not individual or group records. Nev- In vertically partitioned datasets, entities contain records with
ertheless, some queries (or sequences of queries) may different attributes pertaining to the same identity. The junc-
still reveal private information [9], [10]. Query audit- tion of the dataset in this latter partition type allows to infer
ing has two main approaches: Query Inference Control, knowledge that could not be obtained from the individual
where either the original data, or the output of the query datasets. An example of an horizontally partitioned datasets
are perturbed; and Query Auditing, where one or more is a clinic chain, where each site has different costumers,
queries are denied from a sequence of queries. Query and the attributes associated with each costumer are com-
auditing problems may be further classified into offline mon to all sites (such as type of disease and client’s QID).
and online versions. In the former version, queries are For vertical partitioned datasets, stores with complementary
known a priori, and the answers to such inquiry were items may be sequentially visited by the same clients, thus
already given to the user. The objective in this case is to creating patterns that would not exist in each store’s database.
evaluate if the query response(s) breached privacy. In the Distributed privacy-preserving algorithms exist for both types
online version, queries arrive in an unknown sequence, of partitioning.
and the privacy measures take action at the time of the In the remainder of this section, a description of two types
queries. Query auditing and inference control techniques of distributed privacy-preserving data mining protocols is
have been studied extensively in the context statistical presented. The first type, is a set of secure protocols that pre-
database security. Classical approaches may be found vent information disclosure from the communication and/or
in [79] and [80]. computation between entities. For this set, the oblivious
transfer protocol and the homomorphic encryption are
Note that in all four methods described above, the devel-
described. The second type, considers a set of primitive oper-
oped application is affected, since either the utility of the data
ations that are often used in many data mining algorithms, and
used to build the application is lower than the original value,
are thus suitable for distributed privacy. The described oper-
the application itself is downgraded, or the access to the data
ations are the secure sum, the secure set union, the secure
is restricted. Thus, the trade-off between privacy and utility is
size of intersection, the scalar product and the set intersec-
present.
tion. This second type of protocols may also use encryption
techniques, such as the oblivious transfer protocol, to prevent
D. DISTRIBUTED PRIVACY information disclosure between entities.
There are situations where multiple entities seek to mine The oblivious transfer protocol is a basic building block
global insights in the form of aggregate statistics, over the of most SMC techniques, and is by definition, a two-party
conjunction of all partitioned data, without revealing local protocol (between two entities). In PPDM, the 1 out of 2
information to the other entities (possibly adversaries). oblivious-transfer protocol [87] is often implemented. In this
A generalisation of this problem is the well studied secure approach, a sender inputs a pair (x0 , x1 ) and learns noth-
multiparty computation (SMC) problem from the cryptog- ing (has no output), while the receiver inputs a single bit
raphy field [85]. In SMC, the objective is to jointly com- σ ∈ {0, 1} and learns xσ . That is, the receiver learns one out
pute a function from the private input of each party, without of the possible two inputs/messages given by the sender, and
revealing such input to the other parties. That is, at the end the sender learns nothing.
of the computation, all parties learn exclusively the output. The 1 out of 2 oblivious-transfer protocol procedure starts
This problem is solved using secure data transfer protocols with the creation of two public encrypted keys by the receiver:
that also apply to the privacy-preserving distributed compu- Pσ with private key3 Kσ and P1−σ with an unknown private
tation [86]. key. The receiver proceeds to send Pσ and P1−σ to the sender.
In SMC, the assumption that adversaries respect the pro- The sender encrypts x0 with P0 and x1 with P1 , and sends
tocol at all times, is not often true [86]. The level of security these encrypted messages back to the receiver. The receiver,
of a protocol depends on the type of adversarial behaviour
considered. Two main types of adversaries are defined in the 3 Private keys are used to decrypt data that is encrypted with the corre-
literature: semi-honest adversaries and malicious adversaries. sponding public keys.
knowing only how to decrypt Pσ (using Kσ ), obtains only protects from disclosure of information from the inquirer to
xσ , σ ∈ {0, 1}. the entity providing access to the search application through
The aforementioned description of the 1 out of 2 oblivious- an encrypted query. Searching through encrypted files that are
transfer protocol works only for the semi-honest adversar- stored in the cloud is another possibility in full homomorphic
ial behaviour, since it is assumed that the receiver only systems. More examples of applications may be found in [90]
knows how to decrypt one of the messages (only knows Kσ ). and [91].
However, oblivious transfer protocols exist for the malicious Clifton et al. [101] presented a set of secure multiparty
behaviour model [86], [88]. Furthermore, this protocol can be computations to preserve privacy in distributed scenarios.
used over horizontal and vertical partitioned datasets. Such techniques are often used as primitives to the data min-
Other technique from the SMC field that is raising attention ing methods, and therefore, provide a useful approach to build
amongst researchers is the homomorphic encryption. The distributed privacy-preserving data mining algorithms. These
concept of homomorphic encryption was firstly introduced by techniques include: secure sum, secure set union, secure size
Rivest et al. [89], under de term privacy homomorphism. The of set intersection and scalar product, and are referred in the
objective was to be able to perform algebraic operations on literature as protocols. Below, the general idea of each method
encrypted text (ciphertext), in a way that the deciphered result is described.
would match the result of the operation with the plaintext that The secure sum protocol allows to obtain the sum of
originated the ciphertext. the inputs from each site, without revealing such inputs to
Earlier homomorphic cryptosystems were only able to per- the other entities. The implementation starts by designating
form specific algebraic operations and were thus considered one of the sites as the master site, where the computation
partially homomorphic systems [90]. In contrast, fully homo- starts and ends. The master site generates a random value R
morphic encryption supports any arbitrary function over the uniformly distributed in [0, n], with n the upper bound of the
ciphertext. The first fully homomorphic system was proposed final value, and then passes (R + v1 ) mod n, with v1 the local
by Gentry [90], in 2009. Since then, there has been a devel- input, to the next site. Each participating site then adds their
opment of other and sometimes more efficient solutions [91]. local value to the received value and send the result of the
However, the efficiency of fully homomorphic systems is still mod n operation to the next site. Since the received values
insufficient for real-time applications [92]. are uniformly distributed in the interval of [0, n], sites learn
Fully homomorphic encryption sees applications in most nothing about other local values. In the end, the master site
privacy-preserving cloud applications. For instance, queries receives the last result and retrieves the true result (sum of
can be made in an encrypted way, and the result is only the vi values) by subtracting R. This value is then passed onto
decrypted when reaching the inquirer. This process not only the other parties. The secure sum protocol requires a trusted
protects the data in transmission, since it is encrypted but also master site, and to prevent disclosure, sites must not collude.
Nevertheless, some adaptations have been proposed to protect to P2 . With the encrypted coefficients, P2 can compute for
disclosure from such limitations [101], [102]. each yi ∈ Y , E(ri · P(yi ) + yi ) by multiplying P(yi ) by a
The secure set union is an important protocol for pattern random number ri (different for each i) and adding its input
mining [10]. The idea is to share rules, itemsets and other yi , where E(·) represents the homomorphic encryption. P2
structures between sites, in order to create unions of sets, then sends these k results to P1 . For each yi ∈ X ∩ Y ,
without revealing the owners of each set. One possible imple- E(ri · P(yi ) + yi ) = E(yi ), since P(yi ) = 0 and thus, P2
mentation of this protocol [101] uses commutative encryp- will know that yi is in the intersection. P1 can decrypt the
tion,4 where each site encrypts both its sets and received received results and check if the results are either on their set,
encrypted sets from other parties. Then, as the information and consequently in the intersection, or are simply random
is passed, the decryption takes place at each of the sites, values (recall the addition of ri ).
by a different (scrambled) order than the encryption order. While the aforementioned implementation of the set inter-
Since the decryption order is arbitrary, ownership anonymity section protocol involves only two parties, the multiparty case
is preserved. has also been studied [114]. Furthermore, the semi-honest
Another protocol that uses commutative encryption to behaviour model was assumed, however, a modification to
anonymize ownership of the items is the secure size of set provide security against malicious parties was also proposed
intersection. The objective of this protocol is to compute the in the original paper [109].
size of the intersection of the local datasets. The general idea
is as follows. Each entity uses commutative encryption to IV. PPDM AND PRIVACY METRICS
encrypt local items and then passes them to another entity. Since privacy has no single standard definition, quantifying
When a set of items is received by one of the parties, encryp- privacy is quite challenging. Nevertheless, in the context of
tion takes place for each of the received items, followed by an PPDMs, some metrics have been proposed. Unfortunately,
arbitrary order permutation, and finally passed onto another no single metric is enough, since multiple parameters may be
entity. When all items have been encrypted by every entity, evaluated [8], [11], [86]. The existing metrics may be clas-
the number of values that are equal across all the encrypted sified into three main categories, differing on what aspect of
itemsets is the size of the intersection. Note that this technique the PPDM is being measured: privacy level metrics measure
does not require decryption and due to the use of commutative how secure is the data from a disclosure point of view, data
encryption,4 the order of encryption is not important. quality metrics quantify the loss of information/utility and
The last secure protocol presented in [101] is the scalar complexity metrics, which measure efficiency and scalability
product between two parties. Formally, the problem can be of the different techniques.
defined as follows. Given two parties P1 and P2 , where P1 has Privacy level and data quality metrics can be further cat-
a vector XE = x1 , . . . , xn and P2 has a vector YE = y1 , . . . , yn egorised into two subsets [8]: data metrics and result met-
of the same size ofP XE , the objective is to compute the scalar rics. Data metrics evaluate the privacy level/data quality by
n
product XE · YE = i=1 xi ∗ yi , such that neither P1 learns appraising the transformed data that resulted from applying a
YE , nor P2 learns XE . Similarly to the secure sum, the secure privacy-preserving method (e.g. randomisation or a privacy
scalar product may be achieved by adding randomness to an model). Result metrics make a similar evaluation, but the
input vector, and the final output is retrieved by cancelling assessment is done to the results of the data mining (e.g. clas-
out the randomness [101], [112]. Some approaches also use sifiers) that were developed with the transformed data.
homomorphic encryption, or the oblivious transfer protocol, The following subsections present a survey on PPDM met-
to prevent data disclosure [112], [113]. . rics concerning privacy level, data quality and complexity.
Another important secure protocol to ensure distributed Table 5 summarises the privacy level and data quality metrics
privacy and security is the set intersection, or private match- described in this section, sub-categorised as data or result
ing [109]. In this protocol, the intersection of two sets, each metrics.
provided by one of the two participating parties P1 and
P2 , is computed without revealing any other information. A. PRIVACY LEVEL
Formally, let X = {x1 , . . . , xn } be the set of P1 and Y = As previously mentioned, the primal objective of PPDM
{y1 , . . . , yk } be the set of P2 , the objective is to compute methods is to preserve a certain level of privacy, while max-
I = X ∩ Y , while revealing only I to each party. One efficient imizing the utility of the data. The level of privacy metrics
solution proposed in [109] uses a partially homomorphic give a sense of how secure is the data from possible privacy
encryption scheme, and is implemented as follows. P1 defines breaches. Recall from the aforementioned discussion that
a polynomial P whose roots are the elements Pn in X , uthat is, privacy level metrics can be categorised into data privacy
P(y) = (x1 − y)(x2 − y) . . . (xn − y) = u=0 αu y . This metrics and result privacy metrics. In this context, data pri-
party then sends homomorphic encryptions of the coefficients vacy metrics measure how the original sensitive information
may be inferred from the transformed data that resulted from
4 Commutative encryption allows data to be encrypted and decrypted in applying a privacy-preserving method, while result privacy
any order. That is, E1 (E2 (E3 (r))) = E2 (E1 (E3 (r))), where Ei , i ∈ {1, 2, 3} metrics measure how the results of the data mining can dis-
are commutative encryption/decryption schemes. close information about the original data.
TABLE 5. Privacy level and data quality metrics, further categorised as data metrics, if the evaluation is made based on the transformed data, or as result
metrics, if the evaluation is made based on the results of the data mining technique (e.g. the produced classifiers) on the transformed data.
One of the first proposed metrics to measure data privacy knowledge discovery [8]. The hidden failure may be defined
is the confidence level [26]. This metric is used in additive- as the ratio between the sensitive patterns that were hidden
noise-based randomisation techniques, and measures how with the privacy-preserving method, and the sensitive patterns
well the original values may be estimated from the ran- found in the original data [115]. More formally:
domised data. If an original value may be estimated to lie #RP (D0 )
in an interval [x1 , x2 ] with c% confidence, then the interval HF =
#RP (D)
(x2 − x1 ) is the amount of privacy at c% confidence. The
problem with this metric is that it does not take into account where HF is the hidden failure, D0 and D are the sanitised
the distribution of the original data, therefore making it pos- dataset and the original dataset, respectively, and #RP (·) is the
sible to localise the original distribution in a smaller interval number of sensitive patterns. If HF = 0, all sensitive patterns
than [x1 , x2 ], with the same c% confidence. are successfully hidden, however, it is possible that more non-
To address the issue of not taking into account the distri- sensitive information will be lost in the way. This metric may
bution of the original data, the average conditional entropy be used in any pattern recognition data mining technique (e.g.
metric [117] is proposed based on the concept of information classifier or an association rule algorithm). Note that this met-
entropy. Given two random variables X and Z ,5 the average ric does not measure the amount of information lost. For that,
conditional privacy of X , given Z is H (X |Z ) = 2h(X |Z ) , where data quality metrics (presented in the following subsection)
h(X |Z ) is the conditional differential entropy of X , defined as: are used instead.
Z
h(X |Z ) = − fX ,Z (x, z) log2 fX |Z =z (x) dxdz B. DATA QUALITY
X ,Z Privacy-preserving techniques often degrade the quality of
where fX (·) and fZ (·) are the density functions of X and Z , the data. Data quality metrics (also called functionality loss
respectively. metrics [11]) attempt to quantify this loss of utility. Gener-
In multiplicative noise randomisation, privacy may be ally, the measurements are made by comparing the results
measured using the variance between the original and the of a function over the original data, and over the privacy-
perturbed data [28]. Let x be a single original attribute preserved transformed data.
value, and z the respective distorted value, Var(x−z) When evaluating data quality, three important parameters
Var(x) expresses
how closely one can estimate the original values, using the are often measured [12]: the accuracy, which measures how
perturbed data. close is the transformed data from the original data, the com-
In the data publishing privacy subsection (subsection III- pleteness, which evaluates the loss of individual data in the
B), the k-anonymity, the l-diversity, the t-closeness, and the sanitised dataset, and consistency, which quantifies the loss
-differential privacy models were presented. Each of these of correlation in the sanitised data. Furthermore, and similarly
models has a certain control over the privacy level, since to the privacy level metrics, data quality measurements may
variables k, l, t and are defined a priori and thus, act as be made from a data quality point of view, or from the quality
privacy metrics, for a prescribed level of security. However, of the results of a data mining application. Several metrics
these metrics are specific to such techniques. have been defined for both points of view, and for each of the
Result privacy metrics, as opposed to data privacy metrics, parameters described above. In this subsection, a description
are metrics that measure if sensitive data values may be of some of the most commonly used metrics will be given.
inferred from the produced data mining outputs (a classifier, Fletcher and Islam [13] surveyed a series of metrics used
for example). These metrics are more application specific to measure information loss from the data quality perspective,
than the previously described. In fact, Fung et al. [33] defined for generalisation and suppression operations, and for equiv-
these metrics as ‘‘special purpose metrics’’. alence classes algorithms (such as the k-anonymity). For
One important result privacy metric is the hidden fail- the generalisation and suppression techniques, the authors
ure (HF), used to measure the balance between privacy and described the Minimal Distortion (MD) (first proposed as
generalisation height [116]), the Loss Metric (LM) [118] and
5 Consider for example that X is the original data distribution and Z the the Information Loss (ILoss) metric [51]. The MD metric is
noisy distribution from subsection III-A. a simple counter that increments every time a value is gener-
alised to the parent value. The higher the MD value, the more The following equation defines the AP metric.
generalised is the data, and consequently, more information
|P0 | − |P ∩ P0 |
was lost. The LM and ILoss metrics measures the average AP =
information loss over all records, by taking into account the |P0 |
total number of original leaf nodes in the taxonomy tree. where P and P0 are the set of all patterns in D and D0 ,
The ILoss differs from the LM metric by applying different respectively, and |·| represents the cardinality. In the best case
weights to different attributes, for the average. The weight scenario, AP should be equal to 0, indicating that no artificial
may be used to differentiate higher discriminating generali- pattern was introduced in the sanitisation process.
sations [45]. For the equivalence class algorithms, the Dis- For clustering techniques, the Misclassification Error
cernibility Metric (DM) [120] was described. This metric (ME ) metric proposed in [119] measures the percentage of
measures how many records are identical to a given record, data points that ‘‘are not well classified in the distorted
due to the generalisations. The higher the value, the more database’’. That is, the number of points that were not
information that is lost. For example, in the k-anonymity, grouped within the same cluster with the original data and
at least k − 1 other records are identical to any given record, with the sanitised data. The misclassification is defined by
thus the discernibility value would be at least k − 1 for any the following equation:
record. Increasing k, will increase generalisation and sup-
k
pression, and consequently the discernibility value. For this 1 X
|Clusteri (D)| − |Clusteri (D0 )|
ME = ×
reason, this metric is considered to be the opposite concept N
i=1
of the k-anonymity.
In [117], a metric to measure the accuracy of any recon- with N the number of total points in the database, k the num-
struction algorithm (such as in randomisation) is defined. The ber of clusters, and |Clusteri (X )| the number of legitimate
authors measure the information loss by comparing the recon- data points of the ith cluster in database X .
structed distribution and the original distribution. Let fX (x) Additional metrics to evaluate the quality of results for
be the original density function and fˆX (x) the reconstructed classification and clustering are described in [13]. These
density function. Then, the information loss is defined as: metrics include commonly used quantitative approaches to
Z measure the quality of data mining results, such as the Rand
ˆ 1 ˆ
I (fX (x), fX (x)) = E fX (x) − fX (x)) dx
index [121] and the F-measure [122]. Finally note that cryp-
2 X tographic techniques implemented in distributed privacy pre-
where the expected value corresponds to the L1 distance serve data quality since no sanitisation is applied to the data.
between the original distribution fX (x) and the reconstructed
estimation fˆX (x). Ideally, the information loss should be C. COMPLEXITY
I (fX (x), fˆX (x)) = 0, which states that fX (x) = fX ˆ(x), that is, The complexity of PPDM techniques mostly concern the effi-
the reconstruction was perfect, and therefore, no information ciency and the scalability of the implemented algorithm [8].
was lost. These metrics are common to all algorithms, and therefore,
The metrics for evaluating the quality of the results are spe- only a brief discussion will be presented in this subsection.
cific to the data mining technique that is used. These metrics To measure the efficiency, one can use metrics for the
are often based on the comparison between the results of the usage of certain resources, such as time and space. Time
data mining with the perturbed data and with the original data. may be measured by the CPU time or by the computational
Two interesting metrics to measure data quality loss from cost. Space metrics quantify the amount of memory required
the results of pattern recognition algorithms are the Misses to execute the algorithm. In distributed computation, it may
Cost (MC) and the Artifactual Patterns (AP), presented also be interesting to measure the communication cost, based
in [115]. The MC measures the number of patterns that either on the time, or the number of exchanged messages, and
were incorrectly hidden. That is non-sensitive patterns that the bandwidth consumption. Both time and space are usually
were lost in the process of privacy preservation (recall the measured as a function of the input.
aforementioned discussion on association rule hiding). This Scalability refers to how well will a technique perform
metric is defined as follows. Let D be the original database under increasing data. This is an extremely important aspect
and D0 the sanitised database. The misses cost is given by: of any data mining technique since databases are ever increas-
ing. In distributed computation, increasing the inputs may
# ∼ RP (D) − # ∼ RP (D0 )
MC = severely increase the amount of communications. Therefore,
# ∼ RP (D) PPDM algorithms must be designed in a scalable way. Scala-
where # ∼ RP (X ) denotes the number of non-restrictive bility may be evaluated empirically by subjecting the system
patterns discovered from database X . Ideally, an MC = 0% to different loads [123]. For example, to test if a PPDM
is desired, which means that all non-sensitive patterns are algorithm is scalable, one can make several experiments with
present in the transformed database. The AP metric measures increasing input data, and measure the loss of efficiency.
artifact patterns, i.e. the number of patterns that did not The loss of efficiency over experiments can then be used to
exist in D, but were created in the process that led to D0 . measure scalability, since a more scalable system will present
lower efficiency losses when under the same ‘‘pressure’’ as a encryption, the data, the query and the result are encrypted,
less scalable system. and only the ‘‘querist’’ can decrypt the result, thus protecting
the user from information leakage even against the cloud
V. PPDM APPLICATIONS provider. The authors formally prove the security of the
In the previous two sections, a description of different scheme and evaluate the computational and communicational
privacy-preserving techniques, as well as a set of metrics to complexity through simulations. Another approach that uses
measure the privacy level, data quality and complexity was homomorphic encryption for cloud computing is presented
given. This section describes some existing PPDM appli- in [99], for storing and mining association rules in a vertically
cations, focusing on the employed privacy-preserving tech- distributed environment with multiple servers. This approach
niques and on the metrics used to measure the preservation achieves privacy if at least one out of n servers is honest, and
of privacy. similarly to [97], security is proven mathematically, based on
The following subsections group the PPDM applications cryptography. Additionally, the authors of the paper [99] also
in the following fields: cloud computing, e-health, wireless present a series of efficient secure building blocks, which are
sensor networks (WSN), and location-based services (LBS). required for their solution.
Furthermore, in the e-health subsection, an emphasis on Privacy in the cloud is not limited to the use of secure
genome sequencing will be given due to the rising privacy protocols. For instance. in [43], a technique to publish data to
research interest in the area, and in the LBS subsection, the cloud based on the concept of k-anonymity is presented.
typical applications such as vehicular communications and The authors describe a novel approach where equivalence
mobile device location privacy will be described. Note that classes have less than k records, but still ensure the k-
this section does not extensively surveys existing PPDM anonymity principle by exploring overlaps in the definitions
applications. Nonetheless, it is sufficient to illustrate some of of the equivalence classes. That is, by creating classes with
the described privacy-preserving methods described in this less than k records (a divisor of k records) such that each
work, and relate the applicability with the assumptions and record could belong to multiple classes and, thus, provide
privacy requirements of the applications. For comprehensive k-anonymity. By having a lower number of records in each
reviews on privacy in genome sequencing, WSN and location class, the number of required generalisations is lower, and
privacy readers can refer to [124]–[126], respectively. thus, more utility is preserved. The authors show this result
by measuring the information loss, and also show the good
A. CLOUD PPDM performance of the implementation.
The U.S. National Institute of Standards and Technol-
ogy (NIST) defined cloud computing [127] as ‘‘a model for B. E-HEALTH PPDM
enabling ubiquitous, convenient, on-demand network access Health records are considered to be extremely private,
to a shared pool of configurable computing resources (e.g., as much of this data is considered sensitive. However,
networks, servers, storage, applications, and services) that the increase in the amount of data, combined with the
can be rapidly provisioned and released with minimal man- favourable properties of the cloud has led health services
agement effort or service provider interaction.’’ In other to store and exchange medical records through this infras-
words, cloud is a distributed infrastructure with great storage tructure [129]. Thus, to protect from unwanted disclosures
and computation capabilities that is accessible through the privacy-preserving approaches are considered.
network, anytime and anywhere. Therefore, applications (or In [129], a survey on the state-of-the-art privacy-preserving
services) that collect, store and analyse large data quantities approaches employed in the e-Health clouds is given, where
often require the cloud. However, entities need to either trust the authors divide PPDM techniques in either cryptographic
cloud providers with data,6 or to apply techniques that protect and non-cryptographic. The cryptographic techniques are
data while stored and/or during distributed computation. The usually based on encryption, whereas non-cryptographic
cloud may be also used to publish data, and in this case, approaches are based on policies and/or some sort of
query auditing and inference control may be required. Con- restricted access. An example of a cryptographic technique is
sequently, cloud-based services are one of the primary focus found in [97], where the authors propose a privacy-preserving
of privacy-preserving techniques [128]. medical text mining and image feature extraction scheme
In [25], a scheme for classification over horizontally parti- based on fully homomorphic data aggregation under semi-
tioned data under a semi-honest behaviour is proposed. This honest behaviour is presented. The authors formally prove
scheme allows owners to store encrypted data in the cloud, that their encryption is secure from the data point of view
thus preserving privacy of data in communications and while and from the results point of view. They also evaluate the
stored. Furthermore, queries to the cloud are allowed to obtain performance of the PPDM by measuring computation and
the classes of a given set of inputs over encrypted data without communication costs over the amount of input data.
the need for intermediate decryption. Using homomorphic An emerging field in e-health that is raising a growing
privacy interest is genome sequencing. Genome sequencing is
6 Entities often seek to distribute the service through multiple providers to the process of studying genetic information about an individ-
increase security and availability. ual through the study of sequences of DNA (Deoxyribonu-
cleic acid). Genomic data sees applications in [124] health hence, improve battery life and consequently the sensor’s
care, forensics and even direct-to-consumer services. Due lifetime. Data generated in WSNs may be considered sen-
to the advances in genome sequencing technologies and the sitive in many different applications. For instance, sensed
capabilities of the cloud for computation and communication humidity of a room may determine room occupancy and
of data, this area has experienced a recent boom in research, house electrical usage over time may be used to track house-
including in the privacy field. hold behaviour [100]. Due to the aggregation of data and
Genetic data is highly identifiable and can be extremely the WSNs’ topology, attackers may try to control one or a
sensitive and personal, revealing health conditions and indi- few nodes to obtain access to all information. In this case,
vidual traits [124]. Furthermore, this type of data also reveals even if the communications are encrypted, the compromised
information about blood relatives, thus involving not only nodes have the ability to decrypt the information, giving the
a single individual [124]. It is, therefore, critical to prevent adversary full access [132]. Therefore, privacy-preservation
unwanted disclosure of this type of data, while preserving techniques may be required.
maximum utility. In [100], an approach to leverage the advantage of data
For genome data publishing, Uhlerop et al. [56] pro- aggregation for efficiency and to preserve privacy on the
posed a solution for releasing aggregate data based on the collected data is proposed. In this work, users can only query
-differential privacy. This approach was motivated by the aggregator nodes to obtain aggregated data. Aggregator nodes
work in [130], where an attack to accurately identify the query a set of nodes for the sensed values and proceed to
presence of an individual from a DNA mixture of numer- compute the aggregation results over the received data, which
ous individuals was introduced. Thus, in [56], additive noise is then forwarded to the inquirer. However, users must be able
is added to the statistical data to be released, in order to to verify the integrity of the aggregated data, since malicious
achieve -differential privacy. Simulations have shown that users may try to control aggregation nodes and send false
-differential privacy is achievable and good statistical utility data. The WSN owners, on the other hand, want to prevent
was preserved. However, for big and sparse data, the release disclosure of individual sensor data, thus restricting query
of simple summary statistics is problematic from both privacy results to aggregated data. The challenge here is how to verify
and utility perspectives. the integrity of aggregated data, without access to the original
Recently, McLaren et al. [98] proposed a framework for sensed data. To address this issue, a framework where the
privacy-preserving genetic storing and testing. The depicted user has full access to encrypted sensor data, in addition to
scenario involved the patients (P), a certified institution (CI), the access to the aggregated data is proposed. The user can
which has access to unprotected raw genetic data and there- verify the integrity of the aggregated data by making use of
fore must be a trusted entity, a storage and processing the encrypted sensed values, without decrypting such data.
unit (SPU) and medical units (MU). Both the SPU and the Four solutions were described, each providing a different
MU follow a semi-honest adversarial behaviour, i.e. they will privacy-functionality trade-off, where one of the solutions
follow the protocol, but may attempt to infer sensitive data uses (partially) homomorphic encryption to achieve perfect
about the patients. Essentially, the patient supplies the data privacy, that is, no individual sensed value is disclosed. The
to the CI, which stores such data encrypted in the SPU using authors compare the four solutions in terms of the number
a partially homomorphic encryption scheme. MUs can then of messages exchanged and the supported aggregation func-
use secure two-party protocols with the SPU to operate the tions.
data in encrypted form, to be decrypted only when the result Another approach that makes use of the aggregation of
is returned from the SPU to the MU. Their framework has data to preserve privacy is proposed in [40]. This approach
proven to be efficient, although it was limited to some genetic is non-cryptographic and implements a similar concept
tests. Fully homomorphic encryption is suggested has a future to k-anonymity, referred to as k-indistinguishable, where
solution to this limitation, however, the computational cost is instead of generalisations, synthetic data is added to cam-
currently prohibitive. ouflage the real values (obfuscation). Aside from using k to
control the number of indistinguishable values, a discussion
C. WIRELESS SENSOR NETWORKS PPDM to decrease the probability of privacy breach under colluding
Wireless Sensor Networks (WSN), sometimes called Wire- nodes (combining information from multiple nodes) is given.
less Sensor and Actuator Networks (WSAN), are networks of The authors also compare the performance of their implemen-
sparsely distributed autonomous sensors (and actuators), that tation against encryption approaches, where the results show
monitor (act upon changes in) the physical environment [131] that this method is more time and power efficient than such
(e.g. light, temperature). Each sensor/actuator is referred to as approaches.
node in the WSN and data is exchanged wirelessly between The above examples concern information leakage from
these devices. Since nodes have low battery capacity, one within the WSN. However, large WSN may be queried by
of the most important challenges in WSN networks is the multiple entities (clients) that may not trust the network own-
efficiency in communication and processing of data at each ers [133]. The network owners may infer clients’ intentions
node [131]. Thus, techniques to aggregate data from mul- through the respective queries and profiles. These queries
tiple sensors are often used to reduce network traffic and may be specific to a given area, or a given event thus revealing
the intention. As stated in [133], one solution would be to not be tracked over time and space, and consequently dis-
query all sensors in the network and save only the readings of close identity. To prevent this type of disclosure, Beresford
interest. However, this would result in a significant load on and Stajano [41] presented the concept of ‘‘mix zone’’. In
the network, specially in large networks. this approach, IDs are changed every-time users enter in a
To address this issue, Carbunar et al. [133] proposed two mix zone. In this type of zone, at least k − 1 other users
approaches differing on the type of network models: querying are present, such that changing all pseudonyms prevents the
server(s) that belong to a single owner (organization) and linkage between the old and new pseudonyms. With this
querying servers (at least two) belonging to different organi- approach, and similarly to the k-anonymity privacy model,
zations. In both scenarios, servers are considered to be semi- k may be used as a privacy metric.
honest, i.e. they abide by the protocols, but attempt to learn In data obfuscation, the idea is to generate synthetic
more than allowed. In the single owner model, the idea is to data or to add noise in order to degrade the quality of the
create a trade-off between the area of sensors that is queried spatial, and sometimes temporal, data. The assumption is that
and the privacy that is achieved. If the client queries only the the LBS provider is untrustworthy. Simple examples include
region of interest, then no privacy is achieved, but the cost is giving multiple locations and/or imprecise locations [126].
minimal, whereas if the query targets all the network, the cost In [136] a solution to ‘‘cloak’’ users’ locations using an inter-
is maximum, but the achieved privacy is also maximized. The mediary anonymiser server (between the user and the LBS)
solution is thus a function that transforms an original query is proposed. The user queries the intermediary server (named
sequence into a transformed sequence, in order to conceal the CacheCloak) and if this server has the correct data for the
region(s) of interest. Two metrics were used to measure pri- location in cache, the data is sent to the user without querying
vacy: the spatial privacy level and the temporal privacy level. the LBS. If the location data is not cached, the CacheCloak
The spatial metric is the inverse of the probability of the server server creates a prediction path from the queried point until
guessing the regions of interest from the transformed query. reaching a point in another cached path, and then queries the
The temporal privacy level measures the distance between the LBS for all these points. The received data is then cached
distributions of frequency of access of the regions obtained and the correct data is forwarded to the user. As the user
with the transformed and original query. A higher distance moves through the predicted path, the CacheCloak will return
value translates into a better obfuscation of the frequency the cached information. When the user changes from the
of access to the regions of interest. In the multiple owners predicted path, and if the new position is not yet cached,
situation, cryptography is used to assign a virtual region to then the same process is repeated. Since the predicted path
each sensor, that is only recognized by both the client and is queried at the same time to the LBS, the service provider
the sensor. A queried server then broadcasts the encrypted has no way to know the exact user location nor the movement
query, which is dropped by sensors that do not belong to the direction. The authors present a metric based on the concept
target virtual region. Sensors from the queried region encrypt of (location) entropy to measure the achieved privacy level
the sensed values and return the results, which can only be and how their solution to location privacy can work in real-
decrypted by the client. This solution is fully private, as long time LBS services. Furthermore, an implementation to work
as the servers used to create the virtual regions do not collude. under the assumption of an untrusted CacheCloak server is
also discussed.
D. LOCATION-BASED SERVICES PPDM Another type of technique to achieve location privacy is
Pervasive technologies such as the global positioning sys- to implement private queries, that is, location queries that do
tem (GPS) allow to obtain highly accurate location infor- not disclose user location to the LBS provider. In [137] an
mation. LBSs use this spatiotemporal data to provide users approach using a secure protocol is presented, that allows
with useful contextualized services [134]. However, this same users to query the LBS server through an encrypted query
information can be used to track users and consequently dis- that does not reveal user’s location. The protocol used is the
cover for example, their workplace, their houses’ location and private information retrieval (PIR), that has many similari-
the places that they visit [126]. Furthermore, this information ties with the oblivious transfer protocol. With the encrypted
can also be used to identify users, since routes and behaviours query, the server computes the nearest neighbour, to retrieve
often have characteristic patterns [135]. Therefore, the pos- the closest point of interest from the user location. The
sibility of location information leakage is a serious concern authors implement data mining techniques to optimise the
and a threat to one’s privacy. This type of leakage occurs performance of their solution, to identify redundant partial
when attackers have access to the LBS data, or when LBS products, and show through simulation that the final cost in
providers are not trustworthy. In computational location, pri- server time and the cost of communications is reasonable
vacy is achieved with anonymity, data obfuscation (perturba- for location-based applications. This solution achieves full
tion), or through application-specific queries [126]. Below, privacy in the sense that it is computationally infeasible for
some examples are described. the server to decipher the encrypted query.
For location anonymity, users can be assigned IDs Vehicular communication privacy may be seen as a partic-
(pseudonyms) to prevent identity disclosure. However, these ular case of location privacy. These location-based systems
pseudonyms must be changed periodically so that users can- are essentially networks, where cars and roadside units are
nodes that communicate wirelessly to exchange spatiotempo- The evolution of PPDM is motivated by the privacy
ral information [138]. Location-based services (LBS) make requirements of applications and fields/domains that handle
use of this data to provide drivers with useful content, such data. Different application domains have different assump-
as traffic conditions, targeted advertising, and others. In this tions, requirements and concerns related to privacy. While
scenario, the highly accurate spatiotemporal information pro- this heterogeneity leads to a vast diversity of algorithms and
vided by the GPS is transferred to a third party server, that techniques, the underlying concepts are often transversal.
accumulates routes information that can be used to track However, PPDM is far from being a closed subject [1]. Aside
drivers [138]. Privacy preservation is thus required, to protect from the classical information technology requirements, such
drivers from being tracked. In [138], a privacy preserva- as scalability and efficiency, PPDM still presents several
tion approach under the assumption of untrusted collector is challenges with respect to data privacy.
presented. This technique uses synthetic data generation to
obfuscate the real trajectory of the car, by providing con- A. OPEN ISSUES
sistent fake locations. The authors present three measures Due to the broadness of the term, defining privacy is quite
of privacy: the tracking process, which is measured by the challenging. Even in the limited scope of information pri-
attacker’s belief (probability) that a given location-time sam- vacy, several definitions have been presented. In fact, there
ple corresponds to the real location of the car; the location is always a fair amount of subjectiveness due to individu-
entropy, to measure the location uncertainty of an attacker; als’ own privacy concept, beliefs and risk assertions. It is,
and the tracking success ratio, which measures the chance therefore, necessary to develop systems that implement the
that the attacker’s belief is correct when targeting a driver over concept of personalised privacy. This notion allows users
some time t. to have a level of control over the specificity of their data.
In this section we provide an overview of a set of relevant However, personalised privacy is challenging to implement.
applications of PPDM methods, yet several other applications One one hand, it has been shown that users’ concerns
for the aforementioned domains and others exist, as listed about privacy do not mirror users’ actions [5], [139], that
in Tables 2, 3 and 4. is, users’ tend to trade their privacy for utility. Therefore,
a personalised privacy solution could give users control over
the data, but that control can become harmful, specially
VI. LESSONS LEARNED AND OPEN ISSUES when users are unaware of the privacy risks of data disclo-
While PPDM is a fairly recent concept, extensive research sure [139]. On the other hand, the fact that users have no
has been ongoing by different scientific communities, such access to the overall distribution of sensitive values can lead
as cryptography, database management and data mining. to more protective decisions over their data, thus negatively
This results in a variety of techniques, metrics, and specific affecting the utility of data [33]. Seeking novel solutions
applications. Nevertheless, it is essential to understand the to this well-known trade-off between privacy and utility is
underlying assumptions of each problem. therefore required in the context of personalised privacy
In this survey, PPDM techniques were partitioned accord- solutions.
ing to the phase of the data lifecycle at which the privacy pre- The oblivious transfer protocol and homomorphic encryp-
serving technique is applied. This natural separation comes tion are two techniques for preserving privacy and security
as a consequence of the different assumptions at each phase. that are able to achieve full privacy without incurring in a loss
These assumptions, which have been highlighted as a sce- of utility. However, these techniques are often not efficient for
nario description throughout Tables 1, 2, 3 and 4, condition real-time applications [92]. Moreover, homomorphic encryp-
the design of the PPDM techniques to address disclosure of tion often requires a trade-off between functionality (sup-
data at different phases of the data lifecycle. These different ported functions) and efficiency. The development of more
phases are tied to distinct user roles and corresponding pri- efficient secure protocols with better functionality trade-offs
vacy concerns/assumptions on the adversary [24]. could increase the appliance of such techniques.
Even at a given data phase, there is no single optimal One important term that is raising interest in ubiquitous
PPDM technique. The appropriate choice is often a matter of computing is that of context-aware privacy. In the envi-
weighting the different trade-offs between the desired privacy sioned world of the Internet of Things, sensors shall con-
level, the information loss, which is measured by data utility stantly monitor and sense the environment, allowing easier
metrics, the complexity and even the practical feasibility of inference of a user’s context [140]. Context-aware privacy
the techniques. Another aspect to take into consideration is is achieved when a system can change its privacy policy
the type of adversarial behaviour and corresponding privacy depending on the user context [141]. Such systems may
breaches that can be explored. The evolution in the research grant users added control over the collection of data by
of the group anonymization techniques from k-anonymity adapting privacy preferences to the context without being
to l-diversity and t-closeness presented in subsection III-B intrusive for the user. Nevertheless, while defining policies
witnesses how different types of attacks can compromise pri- in an automated way according to context seems a promis-
vacy (in this case, anonymity) and how different techniques ing direction, this may be difficult to achieve when faced
can be applied to protect from these invasions. with new and unknown contexts due to the complexity and
[27] C. Dwork, ‘‘Differential privacy,’’ in Automata, Languages and Program- [52] M. Yuan, L. Chen, and P. S. Yu, ‘‘Personalized privacy protection in social
ming, vol. 4052. Venice, Italy: Springer-Verlag, Jul. 2006, pp. 1–12. networks,’’ Proc. VLDB Endowment, vol. 4, no. 2, pp. 141–150, 2010.
[28] S. R. M. Oliveira and O. R. Zaıane, ‘‘Privacy preserving clustering [53] B. Agir, T. G. Papaioannou, R. Narendula, K. Aberer, and J.-P. Hubaux,
by data transformation,’’ J. Inf. Data Manage., vol. 1, no. 1, p. 37, ‘‘User-side adaptive protection of location privacy in participatory sens-
2010. ing,’’ Geoinformatica, vol. 18, no. 1, pp. 165–191, 2014.
[29] J. J. Kim and W. E. Winkler, ‘‘Multiplicative noise for masking continu- [54] E. G. Komishani, M. Abadi, and F. Deldar, ‘‘PPTD: Preserving person-
ous data,’’ Statist. Res. Division, U.S. Bureau Census, Washington, DC, alized privacy in trajectory data publishing by sensitive attribute gener-
USA, Tech. Rep. 2003-01, 2003. alization and trajectory local suppression,’’ Knowl.-Based Syst., vol. 94,
[30] K. Liu, H. Kargupta, and J. Ryan, ‘‘Random projection-based multiplica- pp. 43–59, Feb. 2016.
tive data perturbation for privacy preserving distributed data mining,’’ [55] F. K. Dankar and K. El Emam, ‘‘Practicing differential privacy in health
IEEE Trans. Knowl. Data Eng., vol. 18, no. 1, pp. 92–106, Jan. 2006. care: A review,’’ Trans. Data Privacy, vol. 6, no. 1, pp. 35–67, 2013.
[Online]. Available: http://dx.doi.org/10.1109/TKDE.2006.14 [56] C. Uhlerop, A. Slavković, and S. E. Fienberg, ‘‘Privacy-preserving data
[31] H. Kargupta, S. Datta, Q. Wang, and K. Sivakumar, ‘‘On the privacy sharing for genome-wide association studies,’’ J. Privacy Confidentiality,
preserving properties of random data perturbation techniques,’’ in Proc. vol. 5, no. 1, pp. 137–166, 2013.
3rd IEEE Int. Conf. Data Mining, Nov. 2003, pp. 99–106. [57] C. Lin, Z. Song, H. Song, Y. Zhou, Y. Wang, and G. Wu, ‘‘Differential
[32] A. Narayanan and V. Shmatikov. (2006). ‘‘How to break anonymity privacy preserving in big data analytics for connected health,’’ J. Med.
of the Netflix prize dataset.’’ [Online]. Available: https://arxiv.org/abs/ Syst., vol. 40, no. 4, p. 97, 2016.
cs/0610105 [58] Z. Zhang, Z. Qin, L. Zhu, J. Weng, and K. Ren, ‘‘Cost-friendly differential
[33] B. C. M. Fung, K. Wang, R. Chen, and P. S. Yu, ‘‘Privacy-preserving data privacy for smart meters: Exploiting the dual roles of the noise,’’ IEEE
publishing: A survey of recent developments,’’ ACM Comput. Surveys., Trans. Smart Grid, vol. 8, no. 2, pp. 619–626, Mar. 2017.
vol. 42. no. 4, pp. 14:1–14:53, 2010. [59] E. ElSalamouny and S. Gambs, ‘‘Differential privacy models for location-
[34] L. Sweeney, ‘‘Demographics often identify people uniquely,’’ Carnegie based services,’’ Trans. Data Privacy, vol. 9, no. 1, pp. 15–48, 2016.
Mellon Univ., Pittsburgh, PA, USA, Data Privacy Working Paper 3, 2000 [60] C. Dwork et al., ‘‘The algorithmic foundations of differential privacy,’’
[35] P. Golle, ‘‘Revisiting the uniqueness of simple demographics in the US Found. Trends Theor. Comput. Sci., vol. 9, nos. 3–4, pp. 211–407, 2014.
population,’’ in Proc. 5th ACM Workshop Privacy Electron. Soc., 2006, [61] Y. Zhao, M. Du, J. Le, and Y. Luo, ‘‘A survey on privacy preserv-
pp. 77–80. ing approaches in data publishing,’’ in Proc. IEEE 1st Int. Workshop
[36] L. Sweeney, ‘‘K-anonymity: A model for protecting privacy,’’ Int. J. Database Technol. Appl., Apr. 2009, pp. 128–131.
Uncertainty, Fuzziness Knowl.-Based Syst., vol. 10, no. 5, pp. 557–570, [62] M. Terrovitis, N. Mamoulis, and P. Kalnis, ‘‘Privacy-preserving
2002. anonymization of set-valued data,’’ Proc. VLDB Endowment, vol. 1, no. 1,
[37] X. Xiao and Y. Tao, ‘‘Anatomy: Simple and effective privacy preser- pp. 115–125, 2008.
vation,’’ in Proc. ACM 32nd Int. Conf. Very Large Database, 2006, [63] R. C.-W. Wong, J. Li, A. W.-C. Fu, and K. Wang, ‘‘(α, k)-anonymity:
pp. 139–150. An enhanced k-anonymity model for privacy preserving data publishing,’’
[38] P. Samarati and L. Sweeney, ‘‘Protecting privacy when disclosing in Proc. 12th ACM SIGKDD Int. Conf. Knowl. Discovery Data Mining,
information: k-anonymity and its enforcement through generalization 2006, pp. 754–759.
and suppression,’’ in Proc. IEEE Symp. Res. Secur. Privacy, 1998, [64] T. M. Truta and B. Vinay, ‘‘Privacy protection: p-sensitive k-anonymity
pp. 384–393. property,’’ in Proc. IEEE 22nd Int. Conf. Data Eng. Workshops,
[39] P. Samarati and L. Sweeney, ‘‘Generalizing data to provide anonymity Apr. 2006, p. 94.
when disclosing information,’’ in Proc. PODS, 1998, p. 188. [65] Q. Zhang, N. Koudas, D. Srivastava, and T. Yu, ‘‘Aggregate query
[40] M. M. Groat, W. Hey, and S. Forrest, ‘‘KIPDA: k-indistinguishable answering on anonymized tables,’’ in Proc. IEEE 23rd Int. Conf. Data
privacy-preserving data aggregation in wireless sensor networks,’’ in Eng. (ICDE), Apr. 2007, pp. 116–125.
Proc. IEEE INFOCOM, Apr. 2011, pp. 2024–2032. [66] M. E. Nergiz, C. Clifton, and A. E. Nergiz, ‘‘Multirelational
[41] A. R. Beresford and F. Stajano, ‘‘Location privacy in pervasive k-anonymity,’’ in Proc. IEEE 23rd Int. Conf. Data Eng. (ICDE),
computing,’’ IEEE Pervasive Comput., vol. 2, no. 1, pp. 46–55, Apr. 2007, pp. 1417–1421.
Jan./Mar. 2003. [67] K. Wang and B. Fung, ‘‘Anonymizing sequential releases,’’ in Proc.
[42] B. Bamba, L. Liu, P. Pesti, and T. Wang, ‘‘Supporting anonymous location 12th ACM SIGKDD Int. Conf. Knowl. Discovery Data Mining, 2006,
queries in mobile environments with privacygrid,’’ in Proc. ACM 17th Int. pp. 414–423.
Conf. World Wide Web, 2008, pp. 237–246. [68] H. Tian and W. Zhang, ‘‘Extending `-diversity to generalize sensitive
[43] X.-M. He, X. S. Wang, D. Li, and Y.-N. Hao, ‘‘Semi-homogenous gen- data,’’ Data Knowl. Eng., vol. 70, no. 1, pp. 101–126, 2011.
eralization: Improving homogenous generalization for privacy preser- [69] N. Li, T. Li, and S. Venkatasubramanian, ‘‘Closeness: A new privacy
vation in cloud computing,’’ J. Comput. Sci. Technol., vol. 31, no. 6, measure for data publishing,’’ IEEE Trans. Knowl. Data Eng., vol. 22,
pp. 1124–1135, 2016. no. 7, pp. 943–956, Jul. 2010.
[44] T. S. Gal, Z. Chen, and A. Gangopadhyay, ‘‘A privacy protection model [70] J. Lee and C. Clifton, ‘‘Differential identifiability,’’ in Proc. 18th
for patient data with multiple sensitive attributes,’’ Int. J. Inf. Secur. ACM SIGKDD Int. Conf. Knowl. Discovery Data Mining, 2012,
Privacy, vol. 2, no. 3, p. 28, 2008. pp. 1041–1049.
[45] A. Machanavajjhala, D. Kifer, J. Gehrke, and M. Venkitasubramaniam, [71] N. Li, W. Qardaji, D. Su, Y. Wu, and W. Yang, ‘‘Membership privacy:
‘‘`-diversity: Privacy beyond k-anonymity,’’ ACM Trans. Knowl. Discov- A unifying framework for privacy definitions,’’ in Proc. ACM SIGSAC
ery Data, vol. 1, no. 1, p. 3, 2007. Conf. Comput. Commun. Secur., 2013, pp. 889–900.
[46] S. Kim, M. K. Sung, and Y. D. Chung, ‘‘A framework to preserve the [72] V. S. Verykios, ‘‘Association rule hiding methods,’’ Wiley Interdiscipl.
privacy of electronic health data streams,’’ J. Biomed. Inform., vol. 50, Rev., Data Mining Knowl. Discovery, vol. 3, no. 1, pp. 28–36, 2013.
pp. 95–106, Aug. 2014. [73] M. Atallah, E. Bertino, A. Elmagarmid, M. Ibrahim, and V. Verykios,
[47] M. Xue, P. Kalnis, and H. K. Pung, ‘‘Location diversity: Enhanced ‘‘Disclosure limitation of sensitive rules,’’ in Proc. Workshop Knowl.
privacy protection in location based services,’’ in Location and Context Data Eng. Exchange (KDEX), 1999, pp. 45–52.
Awareness. Berlin, Germany: Springer-Verlag, 2009, pp. 70–87. [74] Y. Shi, Z. Zhou, L. Cui, and S. Liu, ‘‘A sub chunk-confusion based privacy
[48] F. Liu, K. A. Hua, and Y. Cai, ‘‘Query l-diversity in location-based protection mechanism for association rules in cloud services,’’ Int. J.
services,’’ in Proc. IEEE 10th Int. Conf. Mobile Data Manage., Syst., Softw. Eng. Knowl. Eng., vol. 26, no. 4, pp. 539–562, 2016.
Services Middleware (MDM), May 2009, pp. 436–442. [75] H. AbdulKader, E. ElAbd, and W. Ead, ‘‘Protecting online social net-
[49] N. Li, T. Li, and S. Venkatasubramanian, ‘‘t-closeness: Privacy beyond works profiles by hiding sensitive data attributes,’’ Procedia Comput. Sci.,
k-anonymity and l-diversity,’’ in Proc. IEEE 23rd Int. Conf. Data vol. 82, pp. 20–27, Mar. 2016.
Eng. (ICDE), Apr. 2007, pp. 106–115. [76] S. Ji, Z. Wang, Q. Liu, and X. Liu, ‘‘Classification algorithms for privacy
[50] D. Riboni, L. Pareschi, C. Bettini, and S. Jajodia, ‘‘Preserving anonymity preserving in data mining: A survey,’’ in Proc. Int. Conf. Comput. Sci.
of recurrent location-based queries,’’ in Proc. IEEE 16th Int. Symp. Appl., 2016, pp. 312–322.
Temporal Represent. Reason. (TIME), Jul. 2009, pp. 62–69. [77] L. Chang and I. S. Moskowitz, ‘‘Parsimonious downgrading and decision
[51] X. Xiao and Y. Tao, ‘‘Personalized privacy preservation,’’ in Proc. VLDB, trees applied to the inference problem,’’ in Proc. Workshop New Secur.
2006, pp. 139–150. Paradigms, 1998, pp. 82–89.
[78] A. A. Hintoglu and Y. Saygın, ‘‘Suppressing microdata to prevent classi- [104] M. Kantarcioglu and C. Clifton, ‘‘Privacy-preserving distributed mining
fication based inference,’’ VLDB J.-Int. J. Very Large Data Bases, vol. 19, of association rules on horizontally partitioned data,’’ IEEE Trans. Knowl.
no. 3, pp. 385–410, 2010. Data Eng., vol. 16, no. 9, pp. 1026–1037, Sep. 2004.
[79] A. Shoshani, ‘‘Statistical databases: Characteristics, problems, and some [105] T. Tassa, ‘‘Secure mining of association rules in horizontally distributed
solutions,’’ in Proc. VLDB, vol. 82. 1982, pp. 208–222. databases,’’ IEEE Trans. Knowl. Data Eng., vol. 26, no. 4, pp. 970–983,
[80] N. R. Adam and J. C. Worthmann, ‘‘Security-control methods for statisti- Apr. 2014.
cal databases: A comparative study,’’ ACM Comput. Surv., vol. 21, no. 4, [106] J. Vaidya and C. Clifton, ‘‘Secure set intersection cardinality with appli-
pp. 515–556, 1989. cation to association rule mining,’’ J. Comput. Secur., vol. 13, no. 4,
[81] R. Agrawal and C. Johnson, ‘‘Securing electronic health records without pp. 593–622, 2005.
impeding the flow of information,’’ Int. J. Med. Inform., vol. 76, nos. 5–6, [107] S. Choi, G. Ghinita, and E. Bertino, ‘‘A privacy-enhancing content-
pp. 471–479, 2007. based publish/subscribe system using scalar product preserving trans-
[82] L. Wang, S. Jajodia, and D. Wijesekera, ‘‘Securing OLAP data formations,’’ in Proc. Int. Conf. Database Expert Syst. Appl., 2010,
cubes against privacy breaches,’’ in Proc. IEEE Symp. Secur. Privacy, pp. 368–384.
May 2004, pp. 161–175. [108] N. Cao, C. Wang, M. Li, K. Ren, and W. Lou, ‘‘Privacy-preserving multi-
[83] R. Shokri, M. Stronati, C. Song, and V. Shmatikov. (2016). ‘‘Membership keyword ranked search over encrypted cloud data,’’ IEEE Trans. Parallel
inference attacks against machine learning models.’’ [Online]. Available: Distrib. Syst., vol. 25, no. 1, pp. 222–233, Jan. 2014.
https://arxiv.org/abs/1610.05820 [109] M. J. Freedman, K. Nissim, and B. Pinkas, ‘‘Efficient private matching
[84] G. Ateniese, L. V. Mancini, A. Spognardi, A. Villani, D. Vitali, and and set intersection,’’ in Proc. Int. Conf. Theory Appl. Cryptogr. Techn.,
G. Felici, ‘‘Hacking smart machines with smarter ones: How to extract 2004, pp. 1–19.
meaningful data from machine learning classifiers,’’ Int. J. Secur. Netw.,
[110] J. Li, Z. Zhang, and W. Zhang, ‘‘MobiTrust: Trust management system
vol. 10, no. 3, pp. 137–150, 2015.
in mobile social computing,’’ in Proc. IEEE 10th Int. Conf. Comput. Inf.
[85] O. Goldreich, ‘‘Secure multi-party computation,’’ Tech. Rep., 1998,
Technol. (CIT), Jun./Jul. 2010, pp. 954–959.
pp. 86–97.
[111] J. Vaidya and C. Clifton, ‘‘Privacy-preserving decision trees over ver-
[86] Y. Lindell and B. Pinkas, ‘‘Secure multiparty computation for privacy-
tically partitioned data,’’ in Proc. IFIP Annu. Conf. Data Appl. Secur.
preserving data mining,’’ J. Privacy Confidentiality, vol. 1, no. 1,
Privacy, 2005, pp. 139–152.
pp. 59–98, 2009.
[87] S. Even, O. Goldreich, and A. Lempel, ‘‘A randomized protocol for [112] B. Goethals, S. Laur, H. Lipmaa, and T. Mielikäinen, ‘‘On private scalar
signing contracts,’’ Commun. ACM, vol. 28, no. 6, pp. 637–647, 1985. product computation for privacy-preserving data mining,’’ in Information
[88] M. Naor and B. Pinkas, ‘‘Efficient oblivious transfer protocols,’’ in Proc. Security and Cryptology, vol. 3506. Berlin, Germany: Springer-Verlag,
12th Annu. ACM-SIAM Symp. Discrete Algorithms, 2001, pp. 448–457. 2004, pp. 104–120.
[89] R. L. Rivest, L. Adleman, and M. L. Dertouzos, ‘‘On data banks [113] M. J. Atallah and W. Du, ‘‘Secure multi-party computational geometry,’’
and privacy homomorphisms,’’ Found. Secure Comput., vol. 4, no. 11, in Proc. Workshop Algorithms Data Struct., 2001, pp. 165–179.
pp. 169–180, 1978. [114] L. Kissner and D. Song, ‘‘Privacy-preserving set operations,’’ in Proc.
[90] C. Gentry, ‘‘A fully homomorphic encryption scheme,’’ Annu. Int. Cryptol. Conf., 2005, pp. 241–257.
Ph.D. dissertation, Dept. Comput. Sci., Stanford Univ., Stanford, [115] S. R. Oliveira and O. R. Zaiane, ‘‘Privacy preserving frequent itemset
CA, USA, 2009. mining,’’ in Proc. IEEE Int. Conf. Privacy, Secur. Data Mining, vol. 14.
[91] V. Vaikuntanathan, ‘‘Computing blindfolded: New developments in fully Dec. 2002, pp. 43–54.
homomorphic encryption,’’ in Proc. IEEE 52nd Annu. Symp. Found. [116] P. Samarati, ‘‘Protecting respondents identities in microdata release,’’
Comput. Sci. (FOCS), Oct. 2011, pp. 5–16. IEEE Trans. Knowl. Data Eng., vol. 13, no. 6, pp. 1010–1027, Nov. 2001.
[92] F. Armknecht et al., ‘‘A guide to fully homomorphic encryption,’’ IACR [117] D. Agrawal and C. C. Aggarwal, ‘‘On the design and quantification of pri-
Cryptol. ePrint Arch., vol. 2015, p. 1192, 2015. vacy preserving data mining algorithms,’’ in Proc. 20th ACM SIGMOD-
[93] J. R. Troncoso-Pastoriza, S. Katzenbeisser, and M. Celik, ‘‘Privacy pre- SIGACT-SIGART Symp. Principles Database Syst., 2001, pp. 247–255.
serving error resilient DNA searching through oblivious automata,’’ in [118] V. S. Iyengar, ‘‘Transforming data to satisfy privacy constraints,’’ in Proc.
Proc. 14th ACM Conf. Comput. Commun. Secur., 2007, pp. 519–528. 8th ACM SIGKDD Int. Conf. Knowl. Discovery Data Mining, 2002,
[94] S.-W. Chen et al., ‘‘Confidentiality protection of digital health records in pp. 279–288.
cloud computing,’’ J. Med. Syst., vol. 40, no. 5, p. 124, 2016. [119] S. R. M. Oliveira and O. R. Za, ‘‘Privacy preserving clustering by data
[95] J. Vincent, W. Pan, and G. Coatrieux, ‘‘Privacy protection and security transformation,’’ J. Inf. Data Manage., vol. 1, no. 1, pp. 37–52, Feb. 2010.
in ehealth cloud platform for medical image sharing,’’ in Proc. IEEE [120] R. J. Bayardo and R. Agrawal, ‘‘Data privacy through optimal
2nd Int. Conf. Adv. Technol. Signal Image Process. (ATSIP), Mar. 2016, k-anonymization,’’ in Proc. IEEE 21st Int. Conf. Data Eng. (ICDE),
pp. 93–96. Apr. 2005, pp. 217–228.
[96] M. S. Kiraz, Z. A. Genç, and S. Kardas, ‘‘Security and efficiency anal- [121] W. M. Rand, ‘‘Objective criteria for the evaluation of clustering methods,’’
ysis of the Hamming distance computation protocol based on oblivious J. Amer. Statist. Assoc., vol. 66, no. 336, pp. 846–850, 1971.
transfer,’’ Secur. Commun. Netw., vol. 8, no. 18, pp. 4123–4135, 2015. [122] C. J. van Rijsbergen, Information Retrieval, 2nd ed. Newton, MA, USA:
[97] J. Zhou, Z. Cao, X. Dong, and X. Lin, ‘‘PPDM: A privacy-preserving Butterworth-Heinemann, 1979.
protocol for cloud-assisted e-healthcare systems,’’ IEEE J. Sel. Topics
[123] A. B. Bondi, ‘‘Characteristics of scalability and their impact on perfor-
Signal Process., vol. 9, no. 7, pp. 1332–1344, Oct. 2015.
mance,’’ in Proc. 2nd Int. Workshop Softw. Perform., 2000, pp. 195–203.
[98] P. J. McLaren et al., ‘‘Privacy-preserving genomic testing in the clinic:
[124] M. Naveed et al., ‘‘Privacy in the genomic era,’’ ACM Comput. Surv.,
A model using HIV treatment,’’ Genet. Med., vol. 18, no. 8, pp. 814–822,
vol. 48, no. 1, p. 6, 2015.
2016.
[99] X. Yi, F.-Y. Rao, E. Bertino, and A. Bouguettaya, ‘‘Privacy-preserving [125] N. Li, N. Zhang, S. K. Das, and B. Thuraisingham, ‘‘Privacy preservation
association rule mining in cloud computing,’’ in Proc. 10th ACM Symp. in wireless sensor networks: A state-of-the-art survey,’’ Ad Hoc Netw.,
Inf., Comput. Commun. Secur., 2015, pp. 439–450. vol. 7, no. 8, pp. 1501–1514, 2009.
[100] G. Taban and V. D. Gligor, ‘‘Privacy-preserving integrity-assured data [126] J. Krumm, ‘‘A survey of computational location privacy,’’ Pers. Ubiqui-
aggregation in sensor networks,’’ in Proc. IEEE Int. Conf. Comput. Sci. tous Comput., vol. 13, no. 6, pp. 391–399, 2009.
Eng. (CSE), vol. 3. Aug. 2009, pp. 168–175. [127] P. Mell and T. Grance, ‘‘The NIST definition of cloud computing,’’ Nat.
[101] C. Clifton, M. Kantarcioglu, J. Vaidya, X. Lin, and M. Y. Zhu, ‘‘Tools for Inst. Standards Technol., vol. 53, no. 6, p. 50, 2009.
privacy preserving distributed data mining,’’ ACM SIGKDD Explorations [128] A. Mehmood, I. Natgunanathan, Y. Xiang, G. Hua, and S. Guo, ‘‘Protec-
Newslett., vol. 4, no. 2, pp. 28–34, 2002. tion of big data privacy,’’ IEEE Access, vol. 4, pp. 1821–1834, 2016.
[102] R. Sheikh, B. Kumar, and D. K. Mishra. (2010). ‘‘A distributed k-secure [129] A. Abbas and S. U. Khan, ‘‘A review on the state-of-the-art privacy-
sum protocol for secure multi-party computations.’’ [Online]. Available: preserving approaches in the e-health clouds,’’ IEEE J. Biomed. Health
https://arxiv.org/abs/1003.4071 Inform., vol. 18, no. 4, pp. 1431–1441, Jul. 2014.
[103] R. Sheikh and D. K. Mishra, ‘‘Secure sum computation for insecure [130] N. Homer et al., ‘‘Resolving individuals contributing trace amounts of
networks,’’ in Proc. 2nd Int. Conf. Inf. Commun. Technol. Competitive dna to highly complex mixtures using high-density SNP genotyping
Strategies, 2016, p. 102. microarrays,’’ PLoS Genet., vol. 4, no. 8, p. e1000167, 2008.
[131] I. F. Akyildiz and I. H. Kasimoglu, ‘‘Wireless sensor and actor net- RICARDO MENDES received the master’s degree
works: Research challenges,’’ Ad Hoc Netw., vol. 2, no. 4, pp. 351–367, in electrical and computer engineering from the
Oct. 2004. University of Coimbra, Portugal, in 2016, where
[132] W. He, X. Liu, H. Nguyen, K. Nahrstedt, and T. Abdelzaher, ‘‘PDA: he is currently pursuing the Ph.D. degree and
Privacy-preserving data aggregation in wireless sensor networks,’’ in is a Researcher with the Center for Informatics
Proc. 26th IEEE Int. Conf. Comput. Commun. (INFOCOM), May 2007, and Systems of the University of Coimbra. His
pp. 2045–2053. research interests include privacy in information
[133] B. Carbunar, Y. Yu, W. Shi, M. Pearce, and V. Vasudevan, ‘‘Query privacy
systems and data mining.
in wireless sensor networks,’’ ACM Trans. Sensor Netw., vol. 6, no. 2,
p. 14, 2010.
[134] B. Jiang and X. Yao, ‘‘Location-based services and GIS in perspective,’’
Comput., Environ. Urban Syst., vol. 30, no. 6, pp. 712–725, 2006.
[135] Y.-A. de Montjoye, C. A. Hidalgo, M. Verleysen, and V. D. Blondel,
‘‘Unique in the crowd: The privacy bounds of human mobility,’’ Sci. Rep.,
vol. 3, p. 1376, Mar. 2013.
[136] J. Meyerowitz and R. R. Choudhury, ‘‘Hiding stars with fireworks: Loca-
tion privacy through camouflage,’’ in Proc. 15th Annu. Int. Conf. Mobile
Comput. Netw., 2009, pp. 345–356.
[137] G. Ghinita, P. Kalnis, A. Khoshgozaran, C. Shahabi, and K.-L. Tan, ‘‘Pri-
vate queries in location based services: Anonymizers are not necessary,’’
in Proc. ACM SIGMOD Int. Conf. Manage. Data, 2008, pp. 121–132.
[138] J. Lim, H. Yu, K. Kim, M. Kim, and S.-B. Lee, ‘‘Preserving location JOÃO VILELA received the Ph.D. in computer
privacy of connected vehicles with highly accurate location updates,’’ science from the University of Porto, Portugal,
IEEE Commun. Lett., vol. 21, no. 3, pp. 540–543, Mar. 2017. in 2011. He is currently an Assistant Professor
[139] A. P. Felt, E. Ha, S. Egelman, A. Haney, E. Chin, and D. Wagner,
with the Department of Informatics Engineering,
‘‘Android permissions: User attention, comprehension, and behavior,’’ in
University of Coimbra, Portugal. He was a Visiting
Proc. ACM 8th Symp. Usable Privacy Secur., 2012, p. 3.
[140] C. Perera, A. Zaslavsky, P. Christen, and D. Georgakopoulos, ‘‘Context Researcher with the Georgia Institute of Technol-
aware computing for the Internet of Things: A survey,’’ IEEE Commun. ogy, where he was involved in physical-layer secu-
Surveys Tuts., vol. 16, no. 1, pp. 414–454, 1st Quart., 2014. rity and the Massachusetts Institute of Technology,
[141] F. Schaub, B. Könings, and M. Weber, ‘‘Context-adaptive privacy: Lever- where he was involved in security for network cod-
aging context awareness to support privacy decision making,’’ IEEE ing. In recent years, he has been a Coordinator and
Pervasive Comput., vol. 14, no. 1, pp. 34–43, Jan./Mar. 2015. Team Member of several national, bilateral, and European-funded projects
[142] A. Narayanan and V. Shmatikov, ‘‘De-anonymizing social networks,’’ in in security and privacy of computer and communication systems, with a
Proc. 30th IEEE Symp. Secur. Privacy, May 2009, pp. 173–187. focus on wireless networks, mobile devices, and cloud computing. Other
[143] B. Zhou, J. Pei, and W. S. Luk, ‘‘A brief survey on anonymization research interests include anticipatory networks and intelligent transportation
techniques for privacy preserving publishing of social network data,’’ systems.
ACM SIGKDD Explorations Newslett., vol. 10, no. 2, pp. 12–22, 2008.