In order to evaluate the effectiveness of privacy protection approaches, the degree of privacy protection achieved, as well as the residual data utility after data modification, should be quantified. The former task can be achieved through specific privacy metrics, whereas the latter can be expressed in terms of reduction of traditional performance metrics such as accuracy or Equal Error Rate (EER).
User-sensitive data acquired through mobile interaction is very heterogeneous and can be structured, as in the case of high-level health data, network, location, and application data, or unstructured, i.e., motion, position, environmental, touchscreen, and low-level health data. Consequently, different metrics are required depending on the specific application scenario. In this context, we will consider data after having undergone modifications in order to suppress or alter specific sensitive attributes, while retaining utility for analysis and extraction of non-sensitive information.
Anonymity-based Metrics. These metrics stem from the idea of
k-Anonymity [
106], defined as the property of a dataset ensuring that in case of release, based on an individual’s disclosed information, it is not possible to distinguish that individual from at least
\( k-1 \) individuals whose information has also been disclosed. This is achieved by grouping subject data into equivalence classes with at least
k individuals, indistinguishable with respect to their sensitive attributes.
k-Anonymity is independent of the information extraction technique and it quantifies the degree of privacy exclusively considering the disclosed data. It is useful to express the degree of similarity between datasets, namely the original one and the sanitized one, or it can be applied to samples within a single dataset. However, several studies have reported some limitations of
k-anonymity, which have led to the development of new metrics based on the original, aiming to overcome some of its issues by imposing additional requirements. For instance,
m-invariance [
181] modifies
k-anonymity to allow for multiple, different releases of the same dataset.
( \( \alpha \) , k)-Anonymity [
179] imposes a predetermined maximum occurrence frequency for sensitive attributes within a class to protect against attribute disclosure.
\( \ell \) -Diversity [
13] was developed to prevent linkage attacks by specifying the minimum diversity within an equivalence class of sensitive information, namely at least
\( \ell \) well-represented different sensitive values. For a skewed distribution of sensitive attributes,
t-closeness [
123] and stochastic
t-closeness [
64] were introduced, starting from the idea that the distribution of sensitive values in any equivalence class must be close to their distribution in the entire dataset. Consequently, knowledge of the original distribution is needed to compute this metric. Similarly, starting from the original data distribution,
(c, t)-isolation [
51] indicates the number of data samples present in the proximity of a sample predicted from the transformed data. Depending on the semantic distance between sensitive user records, such as in the case of numerical values, (
k, e)-anonymity [
187] requires the range of sensitive attributes in any equivalence class to be greater than a predetermined safe value. Despite the highlighted shortcomings,
k-anonymity and the derived metrics are still largely employed today in a broad variety of different privacy contexts, but mainly for low-dimensional structured data [
21]. It has in fact been shown that
k-anonymity-based properties do not guarantee a high degree of protection in case of high-dimensional data.
Differential Privacy-based metrics. Differential privacy is a definition that has become popular thanks to its strong privacy statement according to which the data subject will not be affected, adversely or otherwise, by allowing their data to be used in any study or analysis, no matter what other studies, datasets, or information sources are available [
43]. As discussed in Section
6, differential privacy is generally achieved by adding noise to the original data. Therefore, in order to quantify differential privacy as a property of the data indicating the degree of privacy, it is a requirement to have knowledge of the original data. Differential privacy was defined in the context of databases to achieve indistinguishability between query outcomes, but thanks to its generality it has found application in different contexts for low-dimensional data, including biometrics and machine learning systems. It is in fact based on the requirement that independently of the presence of a particular data subject, the probability of the occurrence of any particular sequence of responses to queries is provided by a parameter,
\( \epsilon \) , which can be chosen after balancing the privacy-accuracy tradeoff inherent to the system. For a given computational task and a given value of
\( \epsilon \) , there can be several differentially private algorithms, which might have different accuracy performances. As in the case of
k-anonymity, many metrics were originated from the initial definition of differential privacy, including approximate differential privacy, which has less strict privacy guarantees but is able to retain a higher utility [
66].
d-
\( \chi \) -Privacy [
49] allows different measures for the distance between datasets than the Hamming distance used in the definition of differential privacy. Joint differential privacy [
97] applies to systems where a data subject can be granted access to their own private data but not to others’. In the context of location privacy, geo-indistinguishability [
28] is achieved by adding differential privacy-compliant noise to a geographical location within a determined distance. In contrast to previously described metrics based on differential privacy, computational differential privacy [
120] adopts a weaker adversary model, favoring accuracy. In order to adopt computational differential privacy, it is necessary to have knowledge of the posterior data distribution reconstructed from the transformed data. Similarly, information privacy [
65] is met if the probability distribution of inferring sensitive data does not change due to any query output.
Entropy-based Metrics. In the field of information theory, entropy describes the degree of uncertainty associated to the outcome of a random variable [
154]. Metrics based on entropy are generally computed from the estimated distribution of real data obtained from the sanitized data, even though additional information can be needed for a particular metric, such as the original data or some of the data transformation parameters. When attempting to estimate sensitive information from protected user data, high uncertainty generally correlates with high privacy. Nonetheless, a correct guess based on uncertain information can still occur. In [
118], the degree of privacy protection is quantified by cross-entropy (also referred to as likelihood) of the estimated and the true data distribution in the case of clustered data derived from the original data. A cumulative formulation of entropy was defined in [
92] in the context of location privacy to measure how much entropy can be gathered on a route through a series of independent zones. Inherent privacy [
22] represents another example of metric derived from the definition of entropy, considering the number of possible different outcomes given a number of binary guesses. Mutual information and conditional privacy loss [
22,
110] are also metrics based on entropy. The former provides a measure of the quantity of information common to two random variables and it can be computed as the difference between entropy and conditional entropy, also known as equivocation, which is useful to compute the amount of information needed to describe a random variable, assuming knowledge of another variable belonging to the same dataset. The latter property is built on similar premises, but it considers the ratio between true data distribution and the amount of information provided by another variable revealed.
Success Probability-based Metrics. Metrics in this category do not take into account properties of the data but only the outcome of sensitive information extraction attempts, as low success probabilities indicate high privacy. However, even if this trend is observable considering the entire dataset, single users’ private data could still be compromised. In [
70], based on the original and estimated data, a privacy breach is defined as the event of the reconstructed probability of an attribute, given its true probability, being higher than a fixed threshold, whereas in [
139], this idea was extended by
(d, \( \gamma \) )-privacy, in which additional bounds are introduced for the ratio between the true and reconstructed probabilities. In contrast,
\( \delta \) -presence [
128] evaluates the probability of inferring that an individual is part of some published data, assuming that an external database containing all individuals in the published data is available.
Hiding Failure (HF) [
8] is a data similarity metric used to detect sensitive patterns. This metric is computed as the ratio between the sensitive patterns found in the sanitized dataset and those found in the original dataset. If HF is equal to zero, it means all the patterns are well hidden.
Error-based Metrics. These metrics measure the effectiveness of the sensitive information extraction process, for example, using the distance between the original data and the estimate. A lack of privacy generally takes place in case of small estimate errors. In location privacy, the expected estimation error measures the inference correctness by computing the expected distance between the true location and the estimated location using a distance metric, such as the Euclidean distance [
155]. Furthermore, with particular regard to high-dimensional, unstructured data such as the ones acquired by mobile background sensors or images, a simple but common approach to quantify privacy consists in comparing the traditional performance metrics of sensitive attribute extraction methods (i.e. accuracy) before and after the data modification process. A significant performance drop is a valid indicator of the effectiveness of a data modification technique.
Accuracy-based Metrics. These metrics quantify the accuracy of the inference mechanism, as inaccurate estimates typically show higher privacy. The confidence interval width indicates the amount of privacy given the estimated interval in which the true outcome lies [
24]. It is expressed in percentage terms for a certain confidence level.
(t, \( \delta \) ) privacy violation [
95] provides information whether the release of a classifier for public data is a privacy threat, depending on how many training samples are available to the adversary algorithm. Training samples link public data to sensitive data for some individuals, and privacy is violated when it is possible to infer sensitive information from public data for individuals who are not in the training samples. In location privacy, the size of the uncertainty region denotes the minimal size of the region to which it is possible to narrow down the position of a target user, while the coverage of sensitive region evaluates how a user’s sensitive regions overlap with the uncertainty region [
56]. A different approach was proposed in [
32]. In this work, data subjects are given the possibility to customize the accuracy of the region they are in when submitting it to an internet service. The accuracy of the obfuscated region can therefore be seen as an indicator of privacy.
Time-based Metrics. Time-based metrics measure the time that elapses before sensitive information can be extracted. For instance, in location tracking, to evaluate a given privacy protection method, it can be useful to measure for how long it is possible to breach privacy by successfully tracking the user, by computing the maximum tracking time [
148] or the mean time to confusion [
82].