Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article
Open access

A Vulnerability Assessment Framework for Privacy-preserving Record Linkage

Published: 27 June 2023 Publication History
  • Get Citation Alerts
  • Abstract

    The linkage of records to identify common entities across multiple data sources has gained increasing interest over the last few decades. In the absence of unique entity identifiers, quasi-identifying attributes such as personal names and addresses are generally used to link records. Due to privacy concerns that arise when such sensitive information is used, privacy-preserving record linkage (PPRL) methods have been proposed to link records without revealing any sensitive or confidential information about these records. Popular PPRL methods such as Bloom filter encoding, however, are known to be susceptible to various privacy attacks. Therefore, a systematic analysis of the privacy risks associated with sensitive databases as well as PPRL methods used in linkage projects is of great importance. In this article we present a novel framework to assess the vulnerabilities of sensitive databases and existing PPRL encoding methods. We discuss five types of vulnerabilities: frequency, length, co-occurrence, similarity, and similarity neighborhood, of both plaintext and encoded values that an adversary can exploit in order to reidentify sensitive plaintext values from encoded data. In an experimental evaluation we assess the vulnerabilities of two databases using five existing PPRL encoding methods. This evaluation shows that our proposed framework can be used in real-world linkage applications to assess the vulnerabilities associated with sensitive databases to be linked, as well as with PPRL encoding methods.

    1 Introduction

    The linking of records across databases has seen increasing interest over the years in domains ranging from national census and health care to crime and fraud detection [8]. Record linkage can help improve data quality and facilitate advanced data mining on disparate data sources [7]. Because there are generally no unique entity identifiers available across the databases to be linked, record linkage is generally based on quasi-identifying (QID) attribute values of individuals, such as their names, addresses, and dates of birth [65].
    However, the use of such sensitive personal information often leads to ethical and legal concerns associated with privacy [5, 65]. Privacy-preserving record linkage (PPRL) [8, 25] seeks to develop techniques that allow the linkage of databases without compromising the privacy of the entities whose records are being linked. PPRL aims to either encode or encrypt sensitive data in such a way that protects the privacy of entities while still allowing accurate linkage of records. However, some PPRL techniques, including popular Bloom filter encoding [50], have shown to be susceptible to certain privacy attacks [9, 35, 41, 67, 68].
    A systematic vulnerability assessment is an important step toward understanding the privacy risks associated with a sensitive database or an encoding technique that is to be used in a PPRL project. Identifying and quantifying vulnerabilities that an adversary might exploit will help database owners (DOs) to take the necessary precautions to ensure their sensitive databases will have no (or only acceptable) reidentification risks against a possible privacy attack. However, existing methods [20, 63] (as we discuss in Section 2.3) to measure the privacy risks associated with sensitive databases and encoding techniques in the context of PPRL are ad hoc and do not formally assess the vulnerabilities of sensitive data.
    Contributions: In this article, we propose a vulnerability assessment framework that allows a DO to quantify how vulnerable a sensitive database is, as well as how vulnerable an encoded version of that database will be after a certain PPRL encoding technique has been applied. We analyze the vulnerability of values in a database using five aspects: the frequency of occurrences of a value, the length of a value (number of characters of a string), the co-occurrence of two or more values, the similarity between a pair of values, and the similarity neighborhood of a value. Such a vulnerability assessment framework will allow DOs to establish necessary measures to either modify their databases (e.g., adding random noise [21]), select an alternative PPRL technique that provides enough privacy guarantees against any possible reidentification of an entity, or try to strengthen the privacy guarantees of an existing PPRL technique. In an experimental evaluation, we assess the vulnerabilities of two databases using five existing PPRL encoding techniques. This evaluation shows that our proposed framework can be used in real-world linkage situations to assess the vulnerabilities associated with sensitive databases to be linked, as well as with PPRL encoding techniques.
    Outline: In Section 2 we first provide an introduction to PPRL where we describe existing PPRL techniques, PPRL disclosure risk measures, adversary models, and existing attacks on PPRL. In Sections 3 to 5 we then discuss our proposed vulnerability framework in detail. In Section 6 we describe the applicability of the discussed vulnerabilities on plaintext values and different encoding techniques, and in Section 7 we discuss the impact of the presence or absence of these vulnerabilities on existing PPRL attacks. In Section 8, using two databases, we illustrate and discuss how our proposed framework can be used to assess the vulnerabilities of these databases, as well as of five PPRL encoding methods applied on them. We conclude our work in Section 9 with a summary and outlook to future work. The notation used throughout the article is shown in Table 1.
    Table 1.
    \(\mathbf {D}^s\) , \(\mathbf {D}^e\) Sensitive and encoded databases
    \(\mathbf {Q}\) List of quasi-identifying (QID) attribute values in \(\mathbf {D}^s\)
    rA record in \(\mathbf {D}^s\)
    \(\mathcal {E}\) Real-world entities represented by the records in \(\mathbf {D}^s\)
    \(\mathbf {T}\) List of tokens in \(\mathbf {D}^s\)
    \(\mathbf {S}\) List of token substrings in \(\mathbf {D}^s\)
    \(\mathbf {V}\) Set of all the unique plaintext values in \(\mathbf {D}^s\)
    \(\mathbf {E}\) List of encoded QID attribute values in \(\mathbf {D}^e\)
    \(\mathbf {H}\) Set of all the unique encodings in \(\mathbf {D}^e\)
    \(\mathbf {M}\) Microdata in the sensitive database \(\mathbf {D}^s\)
    sSecret key used in the encoding process
    \(encode()\) Encoding function
    \(attack()\) Attack function
    \(\mathbf {p}\) Set of parameter values used by the encoding function \(encode()\)
    \(\psi\) Maximum difference between two values
    kMinimum number of similar values in a set
    Table 1. Common Notation Used Throughout the Article

    2 Background On Privacy-preserving Record Linkage

    In this section, we first briefly discuss the PPRL process and different techniques that have been proposed to encode sensitive data in PPRL. We then outline existing disclosure risk measures proposed in the context of PPRL, describe the worst-case attack scenario for PPRL, and provide a summary of PPRL attack methods that have been developed so far.

    2.1 Overview of the PPRL Process

    We assume a sensitive plaintext database \(\mathbf {D}^s = (\mathbf {Q}, \mathbf {M})\) that has been sampled from an underlying population and consists of a list of QID attributes \(\mathbf {Q}\) and corresponding microdata attributes \(\mathbf {M}\) . Each record \(\mathbf {r}_i \in \mathbf {D}^s\) , denoted as \(\mathbf {r}_i = (\mathbf {q}_i, \mathbf {m}_i)\) , is assumed to refer to a real-world entity, for example, a person, and contains a set of QID values \(\mathbf {q}_i\) (such as a person’s name and address) and a set of sensitive (private or confidential) microdata values \(\mathbf {m}_i\) (such as medical or financial details), where \(1 \le i \le |\mathbf {D}^s|\) . We assume that \(\mathbf {D}^s\) is deduplicated [7], such that each record \(\mathbf {r}_i \in \mathbf {D}^s\) refers to one real-world entity \(\epsilon \in \mathcal {E}\) , where \(\mathcal {E}\) is the set of entities that are represented by the records in \(\mathbf {D}^s\) .
    In the context of PPRL, prior to be used in a linkage, the list of QID value sets \(\mathbf {Q}\) of the sensitive database \(\mathbf {D}^s\) is encoded using an encoding function \(encode()\) , which results in a list of encoded QID sets \(\mathbf {E}\) , where \(|\mathbf {E}| = |\mathbf {Q}|\) . We denote the resulting encoded database as \(\mathbf {D}^e = (\mathbf {E}, \mathbf {M}),\) where the sensitive microdata \(\mathbf {M}\) are not encoded because these are the values of interest to a data analyst. For each record \(\mathbf {r}_i \in \mathbf {D}^s\) , the function \(encode()\) takes as input a tuple consisting of a set of QID values \(\mathbf {q}_i \in \mathbf {Q}\) of that record \(\mathbf {r}_i\) , a set of parameter values \(\mathbf {p}\) , and a secret key s:
    \(\begin{equation} \mathbf {e}_i = encode(\mathbf {q}_i, \mathbf {p}, s), \end{equation}\)
    (1)
    where \(1 \le i \le |\mathbf {E}|\) . The resulting list \(\mathbf {E}\) contains sets of encodings such that \(\mathbf {E} = [\mathbf {e}_i : \mathbf {q}_i \in \mathbf {Q}]\) , where \(1 \le i \le |\mathbf {Q}|\) , and each \(\mathbf {e}_i\) can consist of one or more encodings, as we discuss below.
    The aim of linking two databases is to combine the microdata of the two databases based on the linked pairs of records from the individual databases (which potentially are held by different DOs) [65]. We assume that the generated linked database produced by a PPRL project is secure and does not allow any reidentification of the entities represented by their corresponding microdata values. This can be achieved using different privacy-preserving data publishing [24] and statistical disclosure control [18, 58] techniques such as k-anonymity [56] or differential privacy [21].

    2.2 PPRL Encoding Techniques

    The techniques proposed for PPRL can be divided into two major categories: perturbation- and secure multi-party computation (SMC)-based techniques [65]. Perturbation-based techniques have a tradeoff between privacy, linkage quality, and scalability to linking large databases, while SMC-based techniques are provably secure at the expense of often high computation and communication costs [25, 65]. Therefore, perturbation-based techniques have shown to be more practical in real-world linkage projects compared to SMC-based techniques [5, 48]. We now describe five PPRL encoding methods (that can be seen as perturbation-based approaches), which we will use in the experimental study of our vulnerability framework in Section 8.
    Bloom filter (BF) encoding was proposed by Schnell et al. [50] for PPRL because BFs can be used to efficiently calculate approximate similarities between records. A BF [4] \(\mathbf {b}\) is a bit vector of length \(l = |\mathbf {b}|\) where initially all bits are set to 0. Each element in a set \(s \in \mathbf {s}\) is hashed into \(\mathbf {b}\) using \(m \gt 1\) hash functions, where each hash function outputs an index value between 0 and \(l - 1\) . These index values are then used to set the corresponding bit positions in \(\mathbf {b}\) to 1. In PPRL, the set \(\mathbf {s}\) is generally the q-grams (substrings of length q characters) generated from one or more QID values from each record in a database, where various methods have been proposed to encode strings [31, 50] as well as numerical values [19, 54, 64]. A recent study by Randall et al. [47] has demonstrated that BF encoding used in real-world linkage scenarios can achieve high linkage quality. However, as we discuss in Section 2.5, BF encoding can be vulnerable to privacy attacks [11, 34, 35, 41, 68].
    A tabulation min-hash (TMH)-based encoding method was proposed by Smith [55] as an alternative to BF encoding to provide improved protection against frequency-based cryptanalysis attacks. Just as in BF encoding, in TMH encoding one bit array \(\mathbf {b}\) of length \(l = |\mathbf {b}|\) is created for each record in a database. Each element in a set \(\mathbf {s}\) (assumed to be q-grams extracted from QID values) is hashed and a fixed length binary value is extracted from the hash value. Using these binary values and l min-hash look-up tables, TMH generates one bit array of length l that encodes all elements in \(\mathbf {s}\) .
    Two-step hash (2SH) encoding is a method proposed by Ranbaduge et al. [44] to address both the privacy issues of BF encoding and the high computational costs of TMH encoding. Similar to BF encoding, each element in a set \(\mathbf {s}\) is first hashed into m bit vectors of length l using m hash functions. These m bit vectors are considered as a bit matrix with m rows and l columns. Each column of this matrix with at least one 1-bit is then mapped to an integer value in a defined range, resulting in a set of integers that represents an encoded record.
    Multiple match-key (MMK) encoding is proposed for PPRL with the aim to achieve high linkage quality [46]. A match-key is a combination of QID values in a record. If a database contains c QID values, then 2 \(^c - 1\) match-keys can be generated for each record in that database. From all the possible match-keys only the ones that provide high linkage quality are selected using match weights calculated based on probabilistic record linkage as proposed by Fellegi and Sunter [46]. The selected match-keys are then encoded using a keyed cryptographic hash function, such as HMAC [3], and a list of hash values is obtained for each record.
    The statistical linkage key (SLK) approach, also known as SLK-581 [32], is a special method proposed to encode records in a database using a certain set of characters in their QID values. An SLK for a record is generated by concatenating (1) the second, third, and fifth letters of the last name; (2) the second and third letters of the first name; (3) the full date of birth as “DDMMYYYY”; and (4) the gender as “1” for male, “2” for female, and “9” for unknown. SLKs can then be encoded using a secure one-way hash algorithm. While the SLK method only allows exact matching of records, it has been used extensively in the health domain in Australia [32].

    2.3 PPRL Disclosure Risk Measures

    The risk of disclosure of sensitive data is widely discussed in the area of data publishing [24, 26] and statistical disclosure control [18, 23, 58]. However, limited work has been proposed to calculate disclosure risk in the context of PPRL. Most existing disclosure risk measures focus on the concept of uniqueness [23, 57, 63]. Uniqueness of a record can be defined as a record with a certain set of values that shares those values with only up to n other records, with \(0 \le n \lt k\) (with regard to the k-anonymity model) in a given database [18]. Based on this notion of uniqueness, the disclosure risk of records in a database can be measured [56]. In the following we describe two disclosure risk measures specifically aimed at PPRL. We refer the interested reader to Appendix A for a broader discussion on disclosure risk measures proposed in the area of statistical disclosure control.
    Durham [20] and Karaksidis et al. [30] proposed to use relative information gain (RIG) as a measure of privacy for PPRL. RIG measures information leakage by each encoded value in an encoded database given that an adversary has access to the same plaintext database that has been encoded. RIG is calculated using the entropy of each attribute value in the encoded database [20]. The lower the value of RIG is, the more difficult it will be for an adversary to learn information about the encoded database.
    Vatsalan et al. [63] have proposed a disclosure risk-based privacy measure named probability of suspicion (PS) to calculate the identity disclosure risk [1, 17] in an encoded database. This measure is based on the uniqueness of attribute values in an encoded database with respect to a public database that an adversary has access to. The PS for a given value in the encoded database is calculated as \(1/n\) , where n is the number of values in the adversary’s plaintext database that can be matched with a given encoded value. Based on the PS values, Vatsalan et al. then proposed to calculate the maximum, mean, median, marketer (the proportion of QID values in a database that have a PS of 1), user acceptance mean, and disclosure risks for an encoded database [63].
    Even though these existing disclosure risk measures for PPRL are based on the uniqueness of values in a database, the notion of uniqueness is usually seen only in one dimension. That is, if a record (or a QID value) in a database has a unique set of values within that database, an adversary could potentially reidentify that record or QID value (assuming the probability of a successful random assignment is negligible [36]). As we discuss in Section 3, the uniqueness of values needs to be considered across multiple dimensions: the frequency of occurrences of a value, the length of a value (in the number of characters), the co-occurrence of two or more values, the similarity between a pair of values, and the similarity neighborhood of a value.

    2.4 Adversary Model

    The worst-case scenario for a privacy attack in PPRL is where an adversary has access to all the plaintext QID values \(\mathbf {Q}\) from the sensitive database \(\mathbf {D}^s\) , the encoded QID values \(\mathbf {E}\) , and the associated plaintext microdata \(\mathbf {M}\) from the encoded database \(\mathbf {D}^e\) . In such a scenario the adversary also knows the encoding function \(encode()\) and its parameter settings \(\mathbf {p}\) used. The only encoding information the adversary does not possess is the secret key s used in the encoding function \(encode()\) . The adversary also does not have access to the full sensitive database \(\mathbf {D}^s = (\mathbf {Q}, \mathbf {M})\) , and therefore she does not know which set of microdata values \(\mathbf {m}_i \in \mathbf {M}\) refers to which set of QID values \(\mathbf {q}_i \in \mathbf {Q}\) and therefore belongs to which entity \(\epsilon _i \in \mathcal {E}\) .
    In the context of cryptography, such a scenario is referred to as a known plaintext attack [33], where an adversary aims to assign a set of plaintext QID values \(\mathbf {q}_i\) to a set of encodings \(\mathbf {e}_j\) such that the corresponding set of sensitive microdata \(\mathbf {m}_i\) can be identified for that person (entity). By doing so, the adversary is able to learn sensitive information (such as certain illnesses or financial details) of an entity \(\epsilon _i\) that is represented by a record \(r_i \in \mathbf {D}^s\) . Based on the information the adversary has access to, such as frequency distributions of plaintext and encoded QID values, she can try to assign sets of encodings \(\mathbf {e}_j \in \mathbf {E}\) to an entity \(\epsilon _i \in \mathcal {E}\) such that the encodings for QID values \(\mathbf {q}_i\) of that entity \(\epsilon _i\) are \(encode(\mathbf {q}_i, \mathbf {p}, s) = \mathbf {e}_j\) . A privacy attack is successful if the adversary is able to obtain one or more correct such assignments between \(\mathbf {e}_j \in \mathbf {E}\) and \(\epsilon _i \in \mathcal {E}\) .
    However, it is worth noting that a potential assignment between encodings \(\mathbf {e}_j \in \mathbf {E}\) and entities \(\epsilon _i \in \mathcal {E}\) can be randomly correct in an attack scenario. Lenz and Hochgürtel [36] have shown that in databases with small numbers of records there is a non-negligible chance that a stratified random assignment can lead to a successful reidentification. The authors argue that in such scenarios it is unlikely to observe zero reidentifications because of the possibility that some random assignments become correct as the number of unique values will be less than the number of records.
    In practice, this worst-case attack scenario is unlikely to happen as it would require the adversary to have access to almost all information used in the PPRL encoding process (except the secret key, s). This would only be possible if the adversary is an inside attacker [8]. In reality, the adversary will likely have access to less information used in the encoding process or have access to a plaintext database that contains QID values and/or records that cover a somewhat different population than the one covered by \(\mathbf {Q}\) .
    Despite that this worst-case scenario is unlikely to occur in real-world situations, it allows us to assess the vulnerabilities of the QID values in \(\mathbf {Q}\) , as well as the encodings in \(\mathbf {E}\) , as we discuss in Sections 3 to 6.

    2.5 Existing Attacks on PPRL

    The first privacy attack on PPRL was proposed by Kuzu et al. [35] for BF encoding [50]. Using a frequency-aware constraint satisfaction problem solver, the attack aligns frequent q-grams with matching bit positions. The attack assumes that the adversary has access to a public database from the same domain as the database that is being encoded.
    The second attack, as proposed by Niedermeyer et al. [43], was a manual attack on BF encoding. This attack exploits a weakness of the double hashing method [14] used in the original BF encoding method as proposed by Schnell et al. [50]. The attack focuses on identifying bit patterns in BFs that correspond to individual q-grams (known as atom BFs). Kroll and Steinmetzer [34] later extended this attack to a fully automated cryptanalysis of BFs.
    Christen et al. [9, 11] proposed a cryptanalysis of BF encoding that exploits the weaknesses of the BF construction principle. The attack aligns frequent BFs in an encoded database with frequent QID values in a plaintext database to identify bit positions that individual q-grams can or cannot be hashed into. Using the q-grams assigned to each bit position, the attack then reidentifies the QID values encoded into each BF.
    A cryptanalysis of BF encoding using pattern mining was proposed by Christen et al. [12] and later extended by Vidanage et al. [68]. This is one of the most practical attacks on BF encoding for PPRL because it does not require knowledge of the encoding settings or any frequent BFs in the encoded database. Using maximal frequent pattern mining and a language model, the attack identifies bit positions that certain q-grams can hash into. Plaintext values are then reidentified using those q-grams and their corresponding bit positions.
    A graph-based dictionary attack on BF encoding was proposed by Mitchell et al. [41], where a brute-force method is first used to identify q-grams encoded in each BF. A directed graph is then built for each BF using the identified q-grams to reidentify encoded values. The major weakness of this attack is that it assumes the adversary has complete knowledge of all parameters used in the BF encoding process, including the secret key(s) used in the hashing process, if any are used.
    A similarity-graph-based attack [13] was proposed on a keyed cryptographic hash-function-based PPRL method developed by the UK’s Office for National Statistics that is using similarity tables [61]. The attack builds a directed graph using the similarities between encoded records. Subgraphs built from a plaintext database are then matched to subgraphs from the encoded graph using a graph isomorphism approach to reidentify encoded values from nodes aligned across two such subgraphs.
    Vidanage et al. [67] have proposed a frequency-based privacy attack on the MMK encoding method developed by Randall et al. [46]. The attack aligns frequent plaintext QID value combinations (named plaintext match-keys) with frequent encoded match-keys in the encoded database. A set of statistical correlation measures is then used to compare the frequency distributions of encoded match-keys with the frequency distributions of plaintext match-keys to identify encoded QID values.
    A recently proposed attack on PPRL was based on matching of similarity graphs [66]. The attack first generates two graphs, one each for the plaintext and the encoded database, using the similarities between the values in each of those databases. Next, the attack aligns encoded and plaintext values based on their similarity neighborhoods in the corresponding graphs to reidentify encoded values. This attack was successfully applied on the three different PPRL encoding techniques, BF, TMH, and 2SH [66].
    Christen et al. [10] recently proposed an attack on a novel PPRL protocol named Blockchain-based PPRL (BC-PPRL). The protocol assumes the covert adversary model [2] and employs BFs for the encoding of sensitive values. Segments of encoded BFs from each DO are shared among other DOs who participate in the BC-PPRL protocol in an iterative manner. The attack assumes a semi-trusted DO as the adversary where she first identifies q-grams that have been encoded in each BF segment using a brute-force method. Then, using a large set of plaintext values and the identified q-grams for each BF segment, encoded values can be reidentified for each BF.
    While all of these proposed attacks have been successful in reidentifying sensitive values encoded using one or more PPRL encoding technique, they are all ad hoc methods that exploit specific vulnerabilities of a few PPRL techniques. Our proposed vulnerability assessment framework, as we discuss next, on the other hand provides a systematic approach that allows the identification of vulnerable plaintext values in sensitive databases, as well as of vulnerable values in encoded databases. To the best of our knowledge, this is the first systematic analysis approach that allows the owners of sensitive databases to assess if their sensitive data, and any encodings generated from them, are secure.

    3 Overview of the Vulnerability Assessment Framework

    In the following we assume that each record \(r_i \in \mathbf {D}^s\) has a set of QID values \(\mathbf {q}_i \in \mathbf {Q}\) , such as first and last name, street address, and date of birth. Each of these QID values can contain one or several tokens (words and numbers separated by a whitespace character). Data cleaning and pre-processing [7] might have removed hyphens and dashes, resulting, for example, in a person having several first name tokens (like “Hans Dieter” or “Jean Pierre”). While addresses generally contain several tokens (such as “42 Miller Street”), values in most QIDs, such as names, city, and postcode, often contain only one token. We denote a single token as t, and the set of tokens that represent the record \(r_i \in \mathbf {D}^s\) as \(\mathbf {t}_i\) .
    Tokens \(t \in \mathbf {t}_i\) can then be converted into a set of token substrings \(\mathbf {s}_i\) , which can be used in the encoding function \(encode()\) to encode the record \(r_i\) . A token substring \(s \in \mathbf {s}_i\) is a segment of a token, potentially with a specific length. A common type of token substring used in PPRL encoding techniques (such as BF encoding [50]) is a q-gram [7], which is a character substring of length \(l_q\) . Similar to \(\mathbf {Q}\) , we denote \(\mathbf {T}\) as the list of all token sets in \(\mathbf {D}^s\) , and \(\mathbf {S}\) as the list of all token substring sets in \(\mathbf {D}^s\) , where one token set \(\mathbf {t}_i \in \mathbf {T}\) or token substring set \(\mathbf {s}_i \in \mathbf {S}\) refers to a record \(r_i \in \mathbf {D}^s\) .
    In Table 2, we show an example record \(r_i \in \mathbf {D}^s\) and how this record is separated into QID values \(\mathbf {q}_i\) , tokens \(\mathbf {t}_i\) , and token substrings \(\mathbf {s}_i\) . As can be seen, a record can contain the same token in multiple QID values. For instance, the token “Miller” occurs in both the last name and street address values. Similarly, different tokens can contain the same token substrings. For instance, the q-gram “re” occurs in both the tokens “Pierre” and “Street,” as shown in Table 2.
    Table 2.
     FirstNameLastNameStreetAddressCity
    Record ( \(r_i\) )Jean PierreMiller42 Miller StreetChapel Hill
    QID values ( \(\mathbf {q}_i\) )Jean Pierre, Miller, 42 Miller Street, Chapel Hill
    Tokens ( \(\mathbf {t}_i\) )Jean, Pierre, Miller, 42, Street, Chapel, Hill
    Token substrings ( \(\mathbf {s}_i\) )an, ap, ch, ea, ee, el, er, et, ha, hi, ie, il, je, le, ll, mi, pe, pi,
     re, rr, st, tr, 42
    BF Encoding[50]00101110011011111010001111101111111111100101011010
    TMH Encoding[55]01000101010110101100111011001101010110100110110101
    MMK Encoding[46]i+sL9RtXd4Jb, avxxlTxIoLx3, Lr1dnWLGM/K8
    2SH Encoding[44][128, 2, 8, 143, 16, 146, 148, 26, 155, 28, 34, 37, 166, 168, 45]
    SLK Encoding[32]SHA-2(ileea150218871) (assuming date of birth as 15.02.1887 and
     gender as male)
    Table 2. An Example Record \(r_i\) in the Sensitive Database \(\mathbf {D}^s\) and How That Record Can Be Converted into a Set of QID Values ( \(\mathbf {q}_i\) ), a Set of Tokens ( \(\mathbf {t}_i\) ), and a Set of Token Substrings ( \(\mathbf {s}_i\) )
    Here q-grams (character substrings of length \(l_q=2\) ) are used as the set of token substrings \(\mathbf {s}_i\) . We also show the encoded values obtained for the record \(r_i\) using five encoding techniques, Bloom filters (BF)[50], tabulation min-hashing (TMH) [55], multiple match-keys (MMK) [46], two-step hashing (2SH) [44], and statistical linkage-key (SLK) [32]. Note that a set of microdata values \(\mathbf {m}_i\) is also part of the record \(r_i\) , but these are not shown here as they are not being encoded.
    Depending on the encoding technique used, the function \(encode()\) uses either QID values, tokens, or token substrings as input to generate the set of encodings \(\mathbf {e}_i\) to represent the record \(r_i\) . For instance, BF encoding [50] uses q-grams (token substrings) in its encoding function, while MMK encoding [46] uses QID value combinations in its encoding function. We assume that for each record \(r_i\) , one of its sets of QID values \(\mathbf {q}_i\) , tokens \(\mathbf {t}_i\) , or token substrings \(\mathbf {s}_i\) , is being encoded in the PPRL process using the function \(encode()\) . In Table 2, we show the output for five encoding techniques for PPRL. For instance, as can be seen, BF encoding outputs a single bit vector (if cryptographic long-term key (CLK) encoding [51] or record-level BF (RBF) encoding [19] is used), whereas MMK encoding outputs a set of hash values.
    In Figure 1, we show the general idea behind an attack on sensitive data in the context of PPRL. In such an attack, the adversary aims to reverse-engineer the encoding process, starting from the list of encoding sets \(\mathbf {E}\) , to first reidentify a set of one or more token substrings \(\mathbf {s}_i \in \mathbf {S}\) , tokens \(\mathbf {t}_i \in \mathbf {T}\) , or QID values \(\mathbf {q}_i \in \mathbf {Q}\) . Each encoding set \(\mathbf {e}_j \in \mathbf {E}\) will then be assigned to zero, one, or more such reidentified token substrings, tokens, or QID values.
    Fig. 1.
    Fig. 1. The overall idea behind a privacy attack on encoded data in the context of PPRL illustrated for a single record in the sensitive database \(\mathbf {D}^s\) . QID values, tokens, or token substrings can be used as input to the encoding function \(encode(),\) which generates encodings for each record \(r_i \in \mathbf {D}^s\) . We illustrate the attack function \(attack()\) by left-pointing arrows.
    The values (token substrings, tokens, or QID values) that are assigned to \(\mathbf {e}_j \in \mathbf {E}\) can potentially lead to attribute disclosure in certain records \(r_i \in \mathbf {D}^s\) . For instance, one or more reidentified token substrings might provide sufficient information to reidentify tokens, and some of these reidentified tokens might lead to the reidentification of QID values. From these reidentified QID values, identity disclosure of entities \(\epsilon _i \in \mathcal {E}\) could be possible depending upon the uniqueness of the values \(q \in \mathbf {q}_i\) as they occur in \(\mathbf {Q}\) . A successful identity disclosure occurs when the adversary can assign a set of encodings \(\mathbf {e}_j \in \mathbf {E}\) to an entity \(\epsilon _i \in \mathcal {E,}\) where for the corresponding QID values \(\mathbf {q}_i\) of that entity \(\epsilon _i\) it holds that \(encode\, (\mathbf {q}_i, \mathbf {p}, s) = \mathbf {e}_j\) .
    This leads us to the concept of vulnerability. For attribute and/or identity disclosure to be successful, certain QID values \(q \in \mathbf {Q}\) , tokens \(t \in \mathbf {T}\) , token substrings \(s \in \mathbf {S}\) , or encodings \(e \in \mathbf {e}_j\) , where \(\mathbf {e}_j \in \mathbf {E}\) , need to be vulnerable. Since a vulnerable plaintext value can refer to either a QID value \(q \in \mathbf {Q}\) , a token \(t \in \mathbf {T}\) , a token substring \(s \in \mathbf {S}\) , or all of them, without loss of generality, in the following we use the notation v to represent all three types of values q, t, and s. We define a set \(\mathbf {V}\) that consists of all the unique values v in a database. Similarly, we define a set \(\mathbf {H}\) that consists of all the unique encodings \(e \in \mathbf {e}_i\) , where \(\mathbf {e}_i \in \mathbf {E}\) . Each value \(v \in \mathbf {V}\) or encoding \(e \in \mathbf {H}\) has a list of characteristics assigned to it, such as its frequency, co-occurrence with other values or encodings, or length (in characters), as we discuss in the following.
    A value \(v \in \mathbf {V}\) and an encoding \(e \in \mathbf {H}\) become vulnerable and the encoding \(e \in \mathbf {H}\) becomes reidentifiable based on two conditions. Following the concept of k-anonymity [56] (as we discuss in Appendix A), we next define these two conditions. We use two privacy parameters, \(\psi\) and k, where \(\psi\) defines the maximum difference of two values for them to be considered as non-distinguishable from each other, and k defines the minimum number of similar values a set should contain for these values to be considered as vulnerable.
    Definition 3.1.
    \((\psi ,k)\) -Vulnerability
    For a plaintext value \(v_i \in \mathbf {V}\) (or an encoding \(e_i \in \mathbf {H}\) ) and the set \(\mathbf {v}_i = \lbrace v_j : func(v_i, v_j) \le \psi , v_i \ne v_j\rbrace\) of other plaintext values \(v_j \in \mathbf {V}\) or the set \(\mathbf {e}_i = \lbrace e_j : func(e_i, e_j) \le \psi , e_i \ne e_j\rbrace\) of other encodings \(e_j \in \mathbf {H}\) , with a tolerance \(\psi \ge 0\) , we define \(v_i\) (or \(e_i\) ) as \((\psi ,k)\) -vulnerable in the database \(\mathbf {D}^s\) (or \(\mathbf {D}^e\) ) with regard to the function \(func()\) if \(0 \le |\mathbf {v}_i| \lt k\) or if \(0 \le |\mathbf {e}_i| \lt k\) .
    For a plaintext value \(v \in \mathbf {V}\) to become vulnerable, it needs to satisfy the condition in Definition 3.1 with regard to the selected values for \(\psi\) and k. Therefore, we define a vulnerable plaintext value \(v \in \mathbf {V}\) to be one that is unique or rare in the corresponding list \(\mathbf {Q}\) , \(\mathbf {T}\) , or \(\mathbf {S}\) with regard to some characteristics as we describe below. Similarly, encodings \(e \in \mathbf {H}\) also become vulnerable, as well as reidentifiable, based on the following conditions.
    Definition 3.2.
    \((\psi ,k)\) -Assignability
    For an encoding \(e_i \in \mathbf {H}\) that is \((\psi ,k)\) -vulnerable within the set \(\mathbf {H}\) , we define a set \(\mathbf {m}_i = \lbrace v_a : v_a \in \mathbf {V}, func(e_i, v_a) \le \psi \rbrace\) of vulnerable plaintext values \(v_a \in \mathbf {V}\) that can be assigned to the vulnerable encoding \(e_i\) , based on a function \(func()\) and a tolerance \(\psi\) . If \(1 \le |\mathbf {m}_i| \lt k\) , with k and \(\psi \ge 0\) being privacy parameters, then we define the pair of encoding and plaintext value, \((e_i, v_a)\) , as \((\psi ,k)\) -assignable.
    For an encoding \(e \in \mathbf {H}\) to be vulnerable, it needs to satisfy the condition defined in Definition 3.1, where the encoding should be unique or rare in the corresponding list \(\mathbf {E}\) . However, for the encoding \(e \in \mathbf {H}\) to become reidentifiable, it also needs to satisfy the condition defined in Definition 3.2, as we further discuss in Section 5. Therefore, a reidentifiable encoding \(e \in \mathbf {H}\) should not only be vulnerable in the list \(\mathbf {E}\) (which means e is unique within the set of encodings \(\mathbf {H}\) ) but also be assignable to a unique or rare plaintext value \(v \in \mathbf {V}\) (or a small number of plaintext values) with regard to some characteristics as we describe in Section 4.
    In the above two definitions, the function \(func()\) can be any measure that calculates the difference between two plaintext values \(v_i\) and \(v_j\) , two encodings \(e_i\) and \(e_j\) , or an encoding and plaintext value \(e_i\) and \(v_a\) based on different characteristics, as we discuss in detail in the following two sections. For example, \(func()\) can calculate the difference between the frequencies of the two plaintext values \(v_i\) and \(v_j\) , or it can calculate the difference between the lengths of the two encodings \(e_i\) and \(e_j\) . Furthermore, the above two definitions can easily be extended to characterize the vulnerability of a set of plaintext values \(\mathbf {v}_a\) or a set of encodings \(\mathbf {e}_a\) , where we assume \(|\mathbf {v}_a| \ge 2, |\mathbf {e}_a| \ge 2\) , depending on the characteristics of the values that are being analyzed. For instance, \(func()\) can calculate the difference between the similarities of the plaintext value pairs ( \(v_i\) , \(v_j\) ) and ( \(v_m\) , \(v_n\) ).
    In the following we discuss five different characteristics that correspond to \(func()\) that can be used to assess the vulnerabilities of plaintext values (QID values, tokens, and token substrings) and encodings, and how Definition 3.1 of vulnerability differs for each of those characteristics. Note that all the vulnerabilities discussed next can be applied to both plaintext and encoded databases.

    4 Assessing Plaintext Vulnerabilities

    We first discuss the vulnerabilities in plaintext QID values \(q \in \mathbf {Q}\) , tokens \(t \in \mathbf {T}\) , and token substrings \(s \in \mathbf {S}\) , because this will allow us to analyze the vulnerabilities of the sensitive plaintext database \(\mathbf {D}^s\) before being encoded using a PPRL technique. Understanding how vulnerable certain QID values, tokens, or token substrings in a sensitive database are will allow a DO to make a decision to select a suitable PPRL encoding technique, modify the database (e.g., removing records that are too vulnerable), or abandon a linkage project altogether.
    When considering values (QID values, tokens, and token substrings) that occur in records in a sensitive database, the vulnerabilities that occur can be classified into two categories: vulnerabilities of a single value and vulnerabilities of a pair of values, as we describe next.

    4.1 Vulnerability of a Single Plaintext Value

    Three different types of vulnerabilities can occur under this category that potentially can be exploited by an adversary: the frequency of a value, the length of a value, and the similarity neighborhood of a value.
    Frequency Vulnerability is the vulnerability of a value with regard to how uniquely or rarely it occurs in records in the sensitive database \(\mathbf {D}^s\) . We assume a function \(freq(v_i)\) that returns the frequency of \(v_i\) as the number of records in \(\mathbf {D}^s\) that contain \(v_i\) . The value \(v_i \in \mathbf {V}\) with a frequency of \(freq(v_i)\) is frequency vulnerable if \(freq(v_i)\) is different from the frequencies \(freq(v_j)\) of other values \(v_j \in \mathbf {V}\) . Formally, for \(v_i\) and the set \(\mathbf {f}_i = \lbrace v_j : v_j \in \mathbf {V}, v_j \ne v_i, |freq(v_i) - freq(v_j)| \le \psi \rbrace\) of values that each has a frequency similar to \(v_j\) based on a tolerance \(\psi\) , with k and \(\psi \ge 0\) being two privacy parameters, the value \(v_i\) is \((\psi ,k)\) -frequency vulnerable in the sensitive database \(\mathbf {D}^s\) if \(0 \le |\mathbf {f}_i| \lt k\) for a given \(\psi\) .
    Frequency vulnerability can occur when the frequency of a value is uniquely identifiable. For example, the first and second most frequent American last names are “Smith” and “Johnson,” respectively.1 If a sensitive database \(\mathbf {D}^s\) contains these two last names with the frequencies \(freq(\textrm {``Smith^{'' }})=400\) and \(freq(\textrm {``Johnson^{'' }})=300\) , then the value “Smith” becomes frequency vulnerable under the privacy parameter settings \(\psi \lt 100\) and \(k \gt 0\) . This is because for these parameter values the frequency of “Smith” is different from the frequency of “Johnson.” On the other hand, if the sensitive database \(\mathbf {D}^s\) contains the three less frequent last name values “Cecil,” “Katz,” and “Hale” with \(freq(\textrm {``Cecil^{'' }})=5\) , \(freq(\textrm {``Katz^{'' }})=6\) , and \(freq(\textrm {``Hale^{'' }})=10,\) then “Cecil” becomes frequency vulnerable under the parameter settings \(\psi \lt 5\) and \(k \gt 1\) , while “Katz” becomes frequency vulnerable under \(\psi \lt 4\) and \(k \gt 1\) .
    Length Vulnerability is the vulnerability of a value with regard to how unique its length is compared to the lengths of other values of the same QID attribute in the sensitive database \(\mathbf {D}^s\) . We assume a function \(len(v_i)\) that returns the length of a value (assumed to be a string) as the number of characters it contains. A given value \(v_i \in \mathbf {V}\) with a length of \(len(v_i)\) is length vulnerable if \(len(v_i)\) is different from \(len(v_j)\) of other values in \(\mathbf {V}\) . Formally, for \(v_i\) and the set \(\mathbf {l}_i = \lbrace v_j : v_j \in \mathbf {V}, v_j \ne v_i, |len(v_i) - len(v_j)| \le \psi \rbrace\) of values that each has a length similar to \(v_j\) based on a tolerance \(\psi\) , with k and \(\psi \ge 0\) being two privacy parameters, \(v_i\) is \((\psi ,k)\) -length vulnerable in \(\mathbf {D}^s\) if \(0 \le |\mathbf {l}_i| \lt k\) for a given \(\psi\) .
    Length vulnerability can occur in any situation where the length of a value is uniquely identifiable. For instance, one of the longest last names recorded in modern databases is “Wolfeschlegelsteinhausenbergerdorff”2 with a length of 35 characters. If the sensitive database \(\mathbf {D}^s\) contains this last name and a second longest last name is “Kellermann” with a length of 10 characters, then the value “Wolfeschlegelsteinhausenbergerdorff” becomes frequency vulnerable under the privacy parameter settings \(\psi \lt 25\) and \(k \gt 0\) .
    Similarity Neighborhood Vulnerability is defined with regard to how unique the neighborhood of a value in a similarity graph (based on a set of extracted features) is compared to the neighborhoods of other values in the same graph. Here we assume the similarity graph was built using the pairwise similarities between records in \(\mathbf {D}^s\) [13, 66], where nodes in the graph represent records and edges represent the calculated similarities between those records. We assume a function \(sim(v_i, v_j)\) that returns the similarity between a value pair \((v_i, v_j)\) . We define the neighborhood of a value \(v_i\) as \(\mathbf {v}_i = \lbrace v_j : v_j \in \mathbf {V}, v_j \ne v_i, sim(v_i, v_j) \ge s_t\rbrace\) , where \(s_t\) is a user-defined similarity threshold. We next assume a function \( {feat}(\mathbf {v}_i)\) that returns a set of features for the value \(v_i\) calculated using its neighborhood \(\mathbf {v}_i\) [66]. These features can include the degree of \(v_i\) , minimum, maximum, and average pairwise similarities between \(v_i\) and its neighbors \(v_j \in \mathbf {v}_i\) [66]. The value \(v_i\) is similarity neighborhood vulnerable if its neighborhood \(\mathbf {v}_i\) is different from the neighborhood sets \(\mathbf {v}_p\) of other values in \(\mathbf {V}\) based on the distance between their corresponding feature vectors \( {feat}(\mathbf {v}_i)\) and \( {feat}(\mathbf {v}_p)\) .
    Formally, for a value \(v_i\) , a set \(\mathbf {n}_i = \lbrace v_p : v_p \in \mathbf {V}, v_p \ne v_i, dist( {feat}(\mathbf {v}_i), {feat}(\mathbf {v}_p))\) \(\le \psi \rbrace\) can be defined, where \(dist()\) measures the distance between two vectors. The set \(\mathbf {n}_i\) consists of values where each has a neighborhood that is similar to the neighborhood of the value \(v_i\) based on a tolerance \(\psi\) . If \(0 \le |\mathbf {n}_i| \lt k\) , with k and \(\psi \ge 0\) being two privacy parameters, then \(v_i\) is \((\psi ,k)\) -neighborhood vulnerable in the sensitive database \(\mathbf {D}^s\) . Here the distance function \(dist()\) can be any function, such as the Euclidean or Cosine distances [7], that calculates the distance between two real-valued vectors.

    4.2 Vulnerability of a Pair of Plaintext Values

    Two different types of vulnerabilities can occur under this category that potentially can be exploited by an adversary: the co-occurrence of two (or more) values and the similarity between a pair of values.
    Co-occurrence Vulnerability is the vulnerability of a set of two or more values with regard to how unique the co-occurrence frequency of that set of values is in the sensitive database \(\mathbf {D}^s\) . Here we assume the function \(freq(\mathbf {v}_i)\) returns the frequency of the set of values \(\mathbf {v}_i\) that co-occur with each other as the number of records in \(\mathbf {D}^s\) that contain all \(v_i \in \mathbf {v}_i\) , where \(|\mathbf {v}_i| \ge 2\) . A given set of values \(\mathbf {v}_i \subset \mathbf {V}\) that co-occur with each other with a frequency of \(freq(\mathbf {v}_i)\) is co-occurrence vulnerable if the frequency \(freq(\mathbf {v}_i)\) is different from the frequencies \(freq(\mathbf {v}_j)\) of other sets of co-occurring values \(\mathbf {v}_j \subset \mathbf {V}\) . Formally, for a set of values \(\mathbf {v}_i\) , the set \(\mathbf {F}_i = \lbrace \mathbf {v}_j : \mathbf {v}_j \subset \mathbf {V}, \mathbf {v}_i \ne \mathbf {v}_j, |freq(\mathbf {v}_i) - freq(\mathbf {v}_j)| \le \psi \rbrace\) of sets of values that each has a co-occurring frequency similar to \(freq(\mathbf {v}_i)\) based on a tolerance \(\psi\) , with k and \(\psi \ge 0\) being two privacy parameters, the set \(\mathbf {v}_i\) is \((\psi ,k)\) -co-occurrence vulnerable in \(\mathbf {D}^s\) if \(0 \le |\mathbf {F}_i| \lt k\) .
    Co-occurrence vulnerability can happen even if the co-occurring values are not individually frequency vulnerable. For instance, let us assume two first name tokens “Hans” and “Dieter” and one last name token “Schmidt” with the individual frequencies \(freq(\textrm {``Hans^{'' }})=110\) , \(freq(\textrm {``Dieter^{'' }})=100\) , and \(freq(\textrm {``Schmidt^{'' }})=90\) , where all these values are not \((\psi ,k)\) -frequency vulnerable with \(\psi = 0\) and \(k = 3\) if we assume that for each of these tokens there are more than two other tokens in the sensitive database \(\mathbf {D}^s\) with the same frequency. However, if the co-occurring frequency \(freq(\textrm {``Hans,^{'' } ``Dieter,^{'' } ``Schmidt^{'' }})=50\) is different from the frequencies of other co-occurring sets of values because there are no more than two other sets of tokens that co-occur with the same frequency in \(\mathbf {D}^s\) , then the token tuple (“Hans,” “Dieter,” “Schmidt”) will be co-occurrence vulnerable under the privacy parameter settings \(\psi = 0\) and \(k = 3\) .
    Note that the co-occurrence vulnerability is independent from the frequency vulnerability discussed in Section 4.1. If a set of values co-occur in records with a unique frequency, then this does not imply that each value in that set is frequency vulnerable. Similarly, if any value in a co-occurring set of values is individually frequency vulnerable, that does not imply that the set of values is co-occurrence vulnerable.
    Similarity Vulnerability is the vulnerability of a pair of values with regard to how unique the similarity between those two values is compared to the pairwise similarities calculated between other values in the sensitive database \(\mathbf {D}^s\) . A given value pair \((v_i, v_j)\) , where \(v_i \in \mathbf {V}\) and \(v_j \in \mathbf {V}\) , is similarity vulnerable if the similarity \(sim(v_i, v_j)\) is different from the similarities \(sim(v_a, v_b)\) of all other value pairs \((v_a, v_b),\) where \(v_a, v_b \in \mathbf {V}\) . Formally, for a value pair \((v_i, v_j)\) that has a similarity \(sim(v_i, v_j)\) and the set \(\mathbf {s}_{ij} = \lbrace (v_a, v_b) : v_a \in \mathbf {V}, v_b \in \mathbf {V}, (v_a, v_b) \ne (v_i, v_j), |sim(v_i, v_j) - sim(v_a, v_b)| \le \psi \rbrace\) of value pairs that each has a similarity that is similar to \(sim(v_i, v_j)\) based on a tolerance \(\psi\) , with k and \(\psi \ge 0\) being two privacy parameters, the value pair \((v_i, v_j)\) is \((\psi ,k)\) -similarity vulnerable in the sensitive database \(\mathbf {D}^s\) if \(0 \le |\mathbf {s}_{ij}| \lt k\) .
    Similarity vulnerability can occur if the similarity of a pair of values is uniquely identifiable. For instance, assume two value pairs (Miller, Mills) and (Smith, Smyth) that have Levenshtein edit distance similarities [37] of 0.67 and 0.8, respectively. The value pair (Miller, Mills) becomes similarity vulnerable under the privacy parameter settings \(\psi \lt 0.1\) and \(k \gt 0\) if no other pair of values in \(\mathbf {D}^s\) has a similarity between 0.57 and 0.77.

    5 Assessing Encoded Vulnerabilities

    We now discuss the vulnerabilities of encodings \(e \in \mathbf {e}_i\) where \(\mathbf {e}_i \in \mathbf {E}\) , which, unlike plaintext vulnerabilities, will depend on the actual PPRL encoding technique used. Depending on the function \(encode()\) used, each individual encoding e will be different as we illustrated in Table 2. Understanding the vulnerabilities associated with different PPRL encoding techniques will allow a DO to select an appropriate technique according to their privacy requirements.
    In Section 3 we discussed two conditions that need to be satisfied for an encoding \(e \in \mathbf {H}\) to be reidentifiable in an attack: the encoding should be (1) vulnerable within the set \(\mathbf {H}\) and (2) assignable to a unique (or a small number of) plaintext values v in the set \(\mathbf {V}\) . It is important to note that being only vulnerable within the set of encodings \(\mathbf {H}\) does not necessarily make an encoding useful to reidentify sensitive plaintext values. For instance, assume an encoding \(e_i\) with a unique frequency \(f(e_i) = 150\) within \(\mathbf {H}\) . However, if there is no token substring, token, or QID value in the set of plaintext values \(\mathbf {V}\) that has a frequency similar to the frequency of \(e_i\) (based on a defined tolerance \(\psi\) ), then the encoding \(e_i\) cannot be assigned to a plaintext value with high confidence. In contrast, if there are k or more (for a given value of k) QID values, tokens, or token substrings in \(\mathbf {V}\) that have frequencies comparable (with regard to \(\psi\) ) to the frequency of \(e_i\) , then again the encoding \(e_i\) cannot be assigned to a plaintext value with high confidence. Note that the defined values for \(\psi\) and k for encoding vulnerabilities can be different from the corresponding values defined for plaintext vulnerabilities. Therefore, in a PPRL attack, a reidentification only occurs when a vulnerable encoding \(e_i \in \mathbf {H}\) can be correctly assigned to a vulnerable plaintext value \(v_i \in \mathbf {V}\) that refers to the same entity \(\epsilon _i\) .
    Similar to plaintext vulnerabilities, the vulnerabilities of encodings in an encoded database can be classified into two categories: vulnerabilities of a single encoded value and vulnerabilities of a pair of encoded values. Note that from here onward \(\mathbf {e}_x\) and/or \(\mathbf {e}_y\) are used to represent a set of encodings in \(\mathbf {H}\) for multiple records, which is different from the set \(\mathbf {e}_i \in \mathbf {E}\) we used to represent the set of all encodings generated using the encoding function \(encode()\) for a single record \(r_i \in \mathbf {D}^s\) .

    5.1 Vulnerability and Assignability of a Single Encoding

    Three types of characteristics we discussed in Section 4.1 can be exploited by an adversary under the vulnerability and assignability of a single encoding.
    Frequency Vulnerability and Assignability occur when an encoding \(e_i\) that is frequency vulnerable can be assigned to one or more values \(v_a\) in the set of plaintext values \(\mathbf {V}\) because their corresponding frequencies, \(freq(e_i)\) and \(freq(v_a)\) , are the same or similar. Formally, for an encoding \(e_i \in \mathbf {H}\) and the set \(\mathbf {m}_f = \lbrace v_a : v_a \in \mathbf {V}, |freq(e_i) - freq(v_a)| \le \psi \rbrace\) of vulnerable plaintext values that each has a frequency difference with at most a tolerance \(\psi\) , with k and \(\psi \ge 0\) being two privacy parameters, the pair \((e_i, v_a)\) is \((\psi ,k)\) -frequency assignable if \(1 \le |\mathbf {m}_f| \lt k\) . Note that depending on the sizes of the plaintext and encoded databases, \(|\mathbf {D}^s|\) and \(|\mathbf {D}^e|\) , the calculated frequencies of plaintext and of encoded values will need to be normalized in order to allow accurate comparison and assignment between values.
    Frequency assignability can occur when an encoding can be uniquely assigned to a plaintext value based on their corresponding frequencies. For example, if the encoded database \(\mathbf {D}^e\) contains a hash encoding “YjdJvER6lucFbSE” with \(freq(\textrm {``YjdJvER6lucFbSE^{'' }})=300\) that is frequency vulnerable within \(\mathbf {D}^e\) under \(\psi \lt 20\) and \(k \gt 0\) , and the plaintext database \(\mathbf {D}^s\) has a first name “David” with a frequency of \(freq(\textrm {``David^{'' }})=280\) and no other plaintext value with a frequency between 280 and 320, then the encoding “YjdJvER6lucFbSE” becomes frequency assignable to “David” under the privacy parameter settings \(\psi \lt 20\) and \(k \gt 0\) .
    Length Vulnerability and Assignability occur when an encoding \(e_i\) is assigned to one or more values \(v_a\) in the plaintext values \(\mathbf {V}\) because their corresponding lengths \(len(e_i)\) and \(len(v_a)\) are the same or similar. Formally, for an encoding \(e_i \in \mathbf {H}\) and the set \(\mathbf {m}_l = \lbrace v_a : v_a \in \mathbf {V}, |len(e_i) - len(v_a)| \le \psi \rbrace\) of plaintext values that each has a length difference of at most a tolerance \(\psi\) , with k and \(\psi \ge 0\) being two privacy parameters, the pair \((e_i, v_a)\) is \((\psi ,k)\) -length assignable if \(1 \le |\mathbf {m}_l| \lt k\) .
    Depending on the PPRL encoding technique used, the length calculation and the actual lengths of the encodings can be different from the plaintext value length calculations and their corresponding length values. For instance, with BF encoding [50], the length of a BF can be considered as the Hamming weight (number of 1-bits) of the corresponding BF. This is different from the number of q-grams extracted from the plaintext value that is encoded in that BF [50]. Therefore, when conducting length assignment, the relative lengths of plaintext values and encodings need to be used. One possibility is to normalize the length values of both plaintext values and encodings and then use such normalized values to assign encodings and plaintext values.
    For example, assume \(\mathbf {D}^e\) contains a BF \(b1\) that has a Hamming weight of 12, and \(\mathbf {D}^s\) contains a plaintext value “Eleanor” that contains six 2-grams. Assume the minimum and maximum Hamming weights in \(\mathbf {D}^e\) are 5 and 20, respectively, and the minimum and maximum number of q-grams of a plaintext value in \(\mathbf {D}^s\) are 3 and 10, respectively. We can apply min-max normalization independently to Hamming weights and numbers of q-grams, where the resulting normalized Hamming weight and number of q-grams for this example pair of values would be 0.5 and 0.43, respectively. Then BF \(b1\) becomes length assignable under the privacy parameter settings \(\psi \lt 0.07\) and \(k \gt 0\) if there are no other values in the two databases that have normalized lengths within the interval 0.43 to 0.57.
    Note that the length vulnerability and assignability will not be applicable to encoding techniques that generate fixed length values, such as HMAC [3] or MMK [46].
    Similarity Neighborhood Vulnerability and Assignability occur when a vulnerable encoding \(e_x\) is assigned to one or more vulnerable values \(v_a \in \mathbf {V}\) because their similarity neighborhoods \(\mathbf {e}_x\) and \(\mathbf {v}_a\) in the corresponding similarity graphs are the same or similar. The two neighborhoods \(\mathbf {e}_x\) and \(\mathbf {v}_a\) are compared using the vectors \( {feat}(\mathbf {e}_x)\) and \( {feat}(\mathbf {v}_a)\) of features calculated using the corresponding neighborhoods. Formally, for a unique encoding \(e_i \in \mathbf {H}\) and the set \(\mathbf {m}_n = \lbrace v_a : v_a \in \mathbf {V}, dist( {feat}(\mathbf {e}_x), {feat}(\mathbf {v}_a)) \le \psi \rbrace\) of vulnerable plaintext values that each has a distance between their neighborhood features with at most a tolerance \(\psi\) , with k and \(\psi \ge 0\) being two privacy parameters, the pair \((e_x, v_a)\) is \((\psi ,k)\) -neighborhood assignable if \(1 \le |\mathbf {m}_n| \lt k\) . Here \(dist()\) is the function that measures the distance between the two feature vectors.

    5.2 Vulnerability and Assignability of a Pair of Encodings

    Two types of characteristics we discussed in Section 4.2 can be exploited by an adversary under the vulnerability and assignability of a pair of encodings.
    Co-occurrence Vulnerability and Assignability occur when a set of two or more vulnerable encodings \(\mathbf {e}_x\) (from the same record) are assigned to one or more sets of vulnerable values \(\mathbf {v}_a\) in the plaintext values \(\mathbf {V}\) because their corresponding co-occurrence frequencies \(freq(\mathbf {e}_x)\) and \(freq(\mathbf {v}_a)\) are the same or similar. Formally, for a set of unique encodings \(\mathbf {e}_x \in \mathbf {H}\) and the set \(\mathbf {M}_c = \lbrace \mathbf {v}_a : \mathbf {v}_a \subset \mathbf {V}, |freq(\mathbf {e}_x) - freq(\mathbf {v}_a)| \le \psi \rbrace\) of sets of vulnerable plaintext values that each has a co-occurring frequency difference with at most a tolerance \(\psi\) , with k and \(\psi \ge 0\) being two privacy parameters, the pair of sets \((\mathbf {e}_x, \mathbf {v}_a)\) is \((\psi ,k)\) -co-occurrence assignable if \(1 \le |\mathbf {M}_c| \lt k\) .
    For instance, assume two hash encodings “frYcgrjawf4AV21” and “Fr42SweT4kuRu” in \(\mathbf {D}^e\) have a co-occurrence frequency of \(freq(\textrm {``frYcgrjawf4AV21,^{'' } ``Fr42SweT4kuRu^{'' }})=80\) , where this frequency is unique in the set of all the frequencies in \(\mathbf {D}^e\) . Also assume the two tokens “Lucy” and “Thomas” in \(\mathbf {D}^s\) have a co-occurrence frequency of \(freq(\textrm {``Lucy,^{'' } ``Thomas^{'' }})=85\) (and no other pair of tokens has a frequency in the range of 70 to 90). Then the encoding tuple (“frYcgrjawf4AV21,” “Fr42SweT4kuRu”) becomes co-occurrence assignable to the token pair (“Lucy,” “Thomas”) under the privacy parameter settings \(\psi \lt 10\) and \(k \gt 0\) .
    Similarity Vulnerability and Assignability occur when a pair of vulnerable encodings \((e_i, e_j)\) is assigned to one or more pairs of vulnerable values \((v_a, v_b)\) because their corresponding pairwise similarities \(sim(e_i, e_j)\) and \(sim(v_a, v_b)\) are the same or similar. Formally, for a pair of unique encodings \((e_i, e_j)\) , where \(e_i \in \mathbf {H}\) and \(e_j \in \mathbf {H}\) , and the set \(\mathbf {m}_s = \lbrace (v_a, v_b) : v_a \in \mathbf {V}, v_b \in \mathbf {V}, |sim(e_i, e_j) - sim(v_a, v_b)| \le \psi \rbrace\) of vulnerable plaintext pairs of values that each has a similarity difference with at most a tolerance \(\psi\) , with k and \(\psi \ge 0\) being two privacy parameters, \(((e_i, e_j), (v_a, v_b))\) is \((\psi ,k)\) -similarity assignable if \(1 \le |\mathbf {m}_s| \lt k\) .
    Similar to length assignment, depending on the encoding technique used, the similarities calculated between encoded values can be different from the similarities calculated between plaintext values [66]. Therefore, when conducting the similarity assignment, the plaintext similarities should be adjusted (e.g., using a regression model [66]) with respect to encoding similarities, or vice versa.
    In the following sections we discuss the utility of the above-discussed vulnerabilities in reidentifying encoded sensitive values.

    6 Relevance of Plaintext and Encoded Vulnerabilities

    It is important to note that not all of the above-discussed vulnerabilities are equally applicable to QID values, tokens, and token substrings, or to all PPRL encoding techniques. In Table 3 we show the applicability of each vulnerability for the different types of plaintext values and for five PPRL encoding techniques. For instance, length vulnerability cannot be assessed for token substrings if we assume token substrings (such as q-grams) have a fixed length. Similarly, length vulnerability cannot be assessed for the MMK encoding technique [46] because this technique generates a set of fixed length hash codes for each record in the sensitive database \(\mathbf {D}^s\) .
    Table 3.
    Vulnerability Plaintext Encoding
      \(\mathbf {Q}\) \(\mathbf {T}\) \(\mathbf {S}\)  SLKBFTMHMMK2SH
    Frequency  
    Length  ??
    Co-occurrence  ??
    Similarity  ????
    Neighborhood  ?
    Table 3. The Applicability of Each Vulnerability for Plaintext Values \(\mathbf {Q}\) , \(\mathbf {T}\) , and \(\mathbf {S}\) (as Illustrated in Table 2) and for Five Encoding Techniques for PPRL, SLK [32], BFs [50], TMH [55], MMK [46], and 2SH [44] We Explained in Section 2.2
    A ✓ indicates that the corresponding vulnerability is being exploited by an existing PPRL attack, whereas a – indicates that it is not. We also use ? to indicate that the corresponding encoding technique might have that vulnerability, but this has not yet been explored in the literature.
    In general, an adversary gains less information by reidentifying a token substring (such as a q-gram) compared to a token (using the same vulnerability assessment), because a q-gram is only a part of a token. However, if a certain q-gram is unique in the sensitive database (if the q-gram occurs only in one token in \(\mathbf {D}^s\) , for example), reidentifying that particular q-gram could lead to the reidentification of the token and potentially the reidentification of the corresponding entity itself. For instance, in the North Carolina Voter Registration database, which we use in our experimental evaluation in Section 8, the q-gram “kz” occurs in only one last name value (“Stanakzai”). Therefore, the reidentification of the q-gram “kz” could lead to the reidentification of the corresponding voter (real-world entity).
    Similarly, reidentifying a token will generally provide an adversary with less information compared to reidentifying a QID value. For instance, the QID value “Hans Dieter” has two tokens and eight q-grams when \(l_q=2\) . By reidentifying a single token, an adversary is able to gain more information than she can gain from reidentifying a single q-gram, and even more from reidentifying an entire QID value. However, as we mentioned before, many QID values will only contain a single token and therefore reidentifying a token in such a case will lead to the reidentification of the QID value itself. This flow of reidentification is illustrated in Figure 1.

    7 Analysis of Existing Attacks On PPRL

    In this section we briefly discuss how existing attacks on PPRL exploit the vulnerabilities discussed above to reidentify encoded QID values.
    As we discussed in Section 3, an attack on an encoded database consists of two steps, attribute reidentification and identity reidentification (which lead to attribute and identity disclosures). Attribute reidentification occurs when an adversary exploits one or more vulnerabilities in the encoded database, as we discussed in Section 5. Most of the existing attacks exploit frequency and co-occurrence vulnerabilities of encodings [9, 11, 12, 34, 35, 43], whereas only two attacks each exploit length [9, 35] and similarity neighborhood vulnerabilities [13, 66], respectively.
    For instance, the attack proposed by Kuzu et al. [35] explores the frequencies of BFs, the lengths of 1-bit patterns in BFs, and the co-occurrences of bit positions in BFs in order to assign those BFs with plaintext values with similar characteristics. The attack proposed by Kroll and Steinmetzer [34] first aligns frequent BFs with frequent QID values and then uses an optimization algorithm to find the maximum correlation between q-gram pairs and bit pattern pairs that can encode a single q-gram. The graph-based attack proposed by Vidanage et al. [66] aligns encoded values with plaintext values based on the uniqueness of the similarity neighborhoods of values. Once reidentified, using the uniqueness of the attributes, the adversary can then disclose the identities of people who are represented by those attributes.
    Vidanage et al. [69] have proposed a taxonomy of attacks on PPRL, where existing attacks are categorized using 12 dimensions. In this taxonomy, the authors identify different aspects of an attack, including the attack type, scope, assumptions made, adversary types, and so on. We have also provided brief descriptions about the methodologies used by existing PPRL attacks in Section 2.5. When analyzing those methods, it becomes apparent that the basic building blocks of all existing attack methods include exploiting at least one of the vulnerabilities that occur in both encodings \(e \in \mathbf {H}\) and plaintext values \(v \in \mathbf {V}\) [69]. Therefore, for an attack to be successful in reidentifying attribute values and real-world entities, both the encoded ( \(\mathbf {E}\) ) and plaintext values \(\mathbf {V}\) ( \(\mathbf {Q}\) ) need to be vulnerable.
    Furthermore, if no encoding \(e \in \mathbf {H}\) or no plaintext value \(v \in \mathbf {V}\) is \((\psi ,k)\) -vulnerable, and no encoding and plaintext value pair \((e, v)\) is \((\psi ,k)\) -assignable, then we can define the encoded list of values \(\mathbf {E}\) as secure with regard to the lists of plaintext values ( \(\mathbf {Q}\) , \(\mathbf {T}\) , or \(\mathbf {S}\) ), the corresponding encoding function \(encode()\) , and its parameters \(\mathbf {p}\) used, and the two privacy parameters \(\psi\) and k. This is because without the vulnerabilities that occur in both the encoded and plaintext values \(\mathbf {E}\) and \(\mathbf {V}\) , no existing attack on PPRL would be successful in reidentifying encoded attribute values correctly.
    For instance, the initial steps of the attacks proposed by Kuzu et al. [35], Niedermeyer et al. [43], Kroll and Steinmetzer [34], Christen et al. [9, 11, 12], and Vidanage et al. [68] include the assignment of frequent BFs to frequent QID values, or of frequent bit patterns to frequent q-grams. The remaining steps of the attacks depend on the accuracy of these assignments. If no frequent BFs, bit patterns, QID values, or q-grams can be found in their corresponding databases, then these attacks would fail. The attack proposed by Culnane et al. [13] matches encodings with plaintext QID values based on the uniqueness of their similarity neighborhoods. If no such unique similarity neighborhoods can be identified, then the attack would fail to correctly reidentify encoded values.
    The attack developed by Vidanage et al. [67] exclusively depends on the frequency distribution analysis and alignment of match-key values in the plaintext and encoded databases. Without such accurate frequency alignment this attack would fail. The graph attack proposed by Vidanage et al. [66] matches nodes in a similarity graph generated from an encoded database with a corresponding similarity graph generated from a plaintext database based on the neighborhoods of nodes. If the node neighborhoods are dissimilar across the encoded and plaintext databases, then the attack would fail to reidentify encoded values. The attacks proposed by Mitchell et al. [41] and Christen et al. [10] are special types of attacks that assume that the adversary has access to all encoding parameters. Even though they do not depend on the co-occurrence frequencies of QID values or q-grams, they depend on the co-occurrence of q-grams being connected to each other such that the end character of one q-gram will be the first character of another (e.g., the name “tim” is represented as [ti, im]). If this type of co-occurrence does not exist in q-grams, both these attacks would fail to correctly reidentify encoded values.
    In Table 4, we show the different vulnerabilities exploited by different attack methods in PPRL. As can be seen, most of these attacks exploit the frequency vulnerability and assignability. However, no PPRL attack has so far been proposed that exploits the similarity vulnerability and assignability. This is because in real-world situations there are hardly any similarities between pairs of encodings or plaintext values that occur uniquely, especially in large databases.
    Table 4.
    Attack methodVulnerability and Assignability
    FrequencyLengthCo-occurrenceSimilarityNeighborhood
    Kuzu et al. (PET 2011) [35]
    Niedermeyer et al. (JPC 2014) [43]
    Kroll and Steinmetzer (BIOSTEC 2015) [34]
    Christen et al. (PAKDD 2017) [11]
    Mitchell et al. (IJBDI 2017) [41]
    Culnane et al. (arXiv 2017) [13]
    Christen et al. (PAKDD 2018) [12]
    Christen et al. (TKDE 2018) [9]
    Vidanage et al. (ICDE 2019) [68]
    Vidanage et al. (IJPDS 2020) [67]
    Vidanage et al. (CIKM 2020) [66]
    Christen et al. (IS 2021) [10]
    Table 4. Vulnerabilities and the Corresponding Assignabilities (as We Discuss in Sections 4 and 5) in Encoded and Plaintext Databases Exploited by Existing PPRL Attacks We Discussed in Section 2.5
    As can be seen, only four out of the five characteristics have been explored by these attacks.

    8 Experimental Evaluation

    We conducted an experimental evaluation of our proposed vulnerability assessment framework for PPRL using a real-world and a synthetic database. We measured the different types of vulnerabilities in the plaintext databases and used five PPRL encoding techniques to measure the vulnerabilities associated with encoded databases.

    8.1 Databases and Evaluation Setup

    First we used the North Carolina Voter Registration (NCVR) database,3 where we used one snapshot collected in December 2020 to randomly sample a subset of 100,000 records. These records contain personal details of North Carolina voters. The second database we used is a synthetic European census database4 (EURO) generated to represent real observations of the decennial European census. This database contains personal details of 25,343 fictitious people. We used different combinations of the attributes first name, last name, street address, and city to generate different instances of encoded databases using the four encoding techniques BF [50], TMH [55], MMK [46], and 2SH [44] as we described in Section 2.2. These selected attributes are commonly used in PPRL evaluation studies [65]. For SLK [32] encoding, we used the four attributes first name, last name, date of birth, and gender as SLK encoding is based on those attributes.
    For both the plaintext and encoded databases, we analyzed all five vulnerabilities for the three plaintext value lists (QIDs \(\mathbf {Q}\) , tokens \(\mathbf {T}\) , and token substrings \(\mathbf {S}\) ) and for the five PPRL encoding techniques, respectively. We used \(k = [1, 5, 10, 20, 30, 50, 100]\) and \(\psi ^{' } = [0\%, 1\%, 2\%, 3\%, 4\%, 5\%]\) for the privacy parameter settings, where we calculated the actual \(\psi\) values from these percentage values based on the minimum and maximum values of the corresponding vulnerability characteristic (such as the minimum and maximum frequency and length of values). In the examples discussed in Sections 4 and 5, we defined \(\psi\) as a continuous value because in these examples we do not discuss the minimum and maximum values in the corresponding databases. However, in this experimental evaluation we use \(\psi ^{' }\) as a percentage based on the minimum and maximum values in the corresponding databases to calculate the actual \(\psi\) values. This is because the actual values of \(\psi\) calculated based on the defined percentages will then be directly associated with the corresponding database characteristics. For token substrings we used q-grams of length \(l_q=2\) for all the experiments. For each PPRL encoding technique, we used a single combination of parameter settings to encode plaintext values. Table 5 shows the parameter settings we used for each encoding technique. These settings follow earlier and/or seminal work of the corresponding PPRL encoding methods [32, 44, 46, 50, 55, 65].
    Table 5.
    Encoding TechniqueUsed Parameter Settings
    BF Encoding [50]BF length = 1,000, q-gram length = 2, encoding method = cryptographic long-term key (CLK) [51], hashing method = random hashing [8], number of hash functions = 15
    TMH Encoding [55]Bit vector length (number of sets of look-up tables) = 1,000, q-gram length = 2, number of look-up tables per min-hash signature = 8, look-up table key length = 8, look-up table random bit string length = 64
    MMK Encoding [46]Number of match-keys = selected automatically depending on the best possible linkage quality
    2SH Encoding [44]Bit vector length = 1,000, q-gram length = 2, number of hash functions = 15
    SLK Encoding [32]Hashing method = SHA-2 [49]
    Table 5. Parameter Settings Used for the Encoding of Sensitive Values
    We next describe how we generated similarity graphs [66] to compute the similarity neighborhood vulnerability. We created one graph \(G = (V, E)\) each for both databases \(\mathbf {D}^s\) and \(\mathbf {D}^e\) , where V is the set of nodes and E is the set of edges connecting these nodes. V represents the set of unique values in a database and E the similarities between these values. For a given pair of plaintext values or encodings, the similarities between these values or encodings are calculated using a known similarity function, such as the Dice or Jaccard similarity [7]. Following the graph generation process described by Vidanage et al. [66], we have considered minimum similarity thresholds of 0.2, 0.3, and 0.4 when generating both plaintext and encoded graphs.
    We implemented our framework using Python 2.7 and ran all experiments on a server with 64-bit Xeon 2.1 GHz 16-Core CPU, 512 GBytes of memory, and running Ubuntu 18.04. To facilitate repeatability the prototype programs are available at https://github.com/anushkavidanage/pprlVulnerabilityAnalysis.

    8.2 Results and Discussion

    In Tables 6 to 10, we show the percentages of vulnerable values calculated for the two databases using different vulnerabilities with the parameter setting \(k = [10, 20]\) and \(\psi ^{' } = [0\%, 1\%]\) . In Figures 2 to 7, we then summarize two aspects: (1) how the percentages of vulnerable values change with the variation of different k and \(\psi ^{' }\) settings and (2) the variation of the percentages of vulnerable values for a certain k and \(\psi ^{' }\) value pair over other parameter settings, which are different encoding techniques (BF [50], TMH [55], and 2SH [44]), attributes (FirstName, FirstName-LastName, FirstName-LastName-StreetAddress, and FirstName-LastName-StreetAddress-City), and plaintext representations ( \(\mathbf {Q}\) , \(\mathbf {T}\) , and \(\mathbf {S}\) ). Since the second aspect is also shown in Tables 6 to 10, in the following we discuss the first aspect in more detail when we describe Figures 2 to 7.
    Fig. 2.
    Fig. 2. The percentages of vulnerable plaintext values when varying k using the NCVR (top) and EURO (bottom) databases with \(\psi ^{' } = 0\) fixed. The y-axis represents the percentages of vulnerable values. In the box plots, the blue (dashed) and the red (straight) lines represent the mean and median values, respectively. The whiskers and outliers show the minimum and maximum percentages of vulnerable values ranging from 0% to 100%. The box in a plot begins at the first quartile and ends at the third quartile. This applies to all the remaining figures, Figures 3 to 7.
    Table 6.
       FirstName LastName StreetAddress City
    \(k=\) 1020 1020 1020 1020
    \(\psi ^{' }=\) 0%1%0%1% 0%1%0%1% 0%1%0%1% 0%1%0%1%
    NCVRFreq \(\mathbf {Q}\) 3.420.384.640.67 1.220.131.710.28 0.040.020.040.03 67.92.761005.80
    \(\mathbf {T}\) 3.340.414.700.68 1.140.141.710.26 1.420.062.160.08 70.92.9297.37.17
    \(\mathbf {S}\) 65.110.674.412.9 80.27.1287.413.7 62.75.5568.79.02 10013.010022.4
    Len \(\mathbf {Q}\) 0.060.060.400.40 0.040.040.110.11 0.030.030.060.06 3.183.183.183.18
    \(\mathbf {T}\) 0.180.180.180.18 0.050.050.050.05 0.050.050.130.13 0.930.937.047.04
    Neigh \(\mathbf {Q}\) 10063.010084.9 10034.710055.4 10028.110048.6 100100100100
    \(\mathbf {T}\) 10064.310085.9 10038.410059.8 10034.410051.9 100100100100
    Co-occur \(\mathbf {Q}\)    
    \(\mathbf {T}\) 4.854.854.854.85 1.091.091.091.09 0.110.010.180.03 95.210.010012.9
    \(\mathbf {S}\) 5.970.198.680.54 2.130.063.570.10 2.110.023.420.02 18.80.4028.90.77
    Sim \(\mathbf {Q}\) 0.0200.030.01 0.0100.020 000.010 7.352.016.73.65
    \(\mathbf {T}\) 0.0100.030.01 0.0100.020 0.0100.010 3.651.778.434.26
    EUROFreq \(\mathbf {Q}\) 13.64.1515.77.9 11.98.1613.610.40 0.500.501.751.75 19.03.069.111.2
    \(\mathbf {T}\) 13.64.1515.77.93 11.98.1613.610.4 1006.951008.56 19.02.9869.111.2
    \(\mathbf {S}\) 72.219.093.031.2 76.915.594.726.1 10012.910023.5 98.34.0110015.0
    Len \(\mathbf {Q}\) 0.370.370.370.37 0.870.870.870.87 0000 1.321.321.321.32
    \(\mathbf {T}\) 0.370.370.370.37 0.870.870.870.87 21.9321.9327.827.8 1.321.321.321.32
    Neigh \(\mathbf {Q}\) 10097.910099.6 98.693.010095.0 99.552.510070.0 72.645.782.158.9
    \(\mathbf {T}\) 10097.910099.6 98.692.710095.0 100100100100 72.645.782.158.9
    Co-occur \(\mathbf {Q}\)    
    \(\mathbf {T}\)   8.751.6911.73.37 
    \(\mathbf {S}\) 10.40.5615.41.45 13.00.5019.21.21 8.490.1714.20.28 9.411.1322.62.09
    Sim \(\mathbf {Q}\) 0.090.050.380.21 0.430.271.140.98 0.0100.020.01 1.721.724.724.72
    \(\mathbf {T}\) 0.090.050.380.21 0.430.271.140.98 6.576.576.576.57 1.721.724.724.72
    Table 6. The Vulnerability Results of Plaintext QID Values \(\mathbf {Q}\) , Tokens \(\mathbf {T}\) , and Token Substrings \(\mathbf {S}\) in the NCVR and EURO Databases, Calculated for Different k and \(\psi ^{' }\) Parameter Settings and for Different Attributes
    A – indicates that the vulnerability calculations are not applicable for that specific vulnerability and attribute combination.
    First, in Table 6, we show the percentages of vulnerable plaintext values calculated for both databases, NCVR and EURO. As can be seen, with larger values of k and smaller values of \(\psi ^{' }\) the percentages of vulnerable values increase for the frequency, co-occurrence, and neighborhood vulnerabilities. This trend can also be seen in Figure 2, where we fix \(\psi ^{' } = 0\) and vary \(k,\) and in Figure 3, where we fix \(k = 50\) and vary \(\psi ^{' }\) . We selected fixed \(\psi ^{' } = 0\) and \(k = 50\) values because with those parameter settings we were able to illustrate the variations of the percentages of vulnerable values more clearly. When increasing k and decreasing \(\psi ^{' }\) , there are more vulnerable values in both databases. This is because when k increases, the minimum size of the plaintext value set that is assumed to be distinguishable by an adversary for a given vulnerability increases. On the other hand, with decreasing values of \(\psi ^{' }\) , the difference between plaintext values (in a given vulnerability) that are assumed to be distinguishable by an adversary decreases. In both of these instances, the number of vulnerable plaintext values in the database increases.
    Fig. 3.
    Fig. 3. The percentages of vulnerable plaintext values when varying \(\psi ^{' }\) using the NCVR (top) and EURO (bottom) databases with \(k = 50\) fixed. The y-axis represents the percentages of vulnerable values.
    For the length, co-occurrence, and similarity vulnerabilities, however, the percentages of vulnerable values stayed approximately the same for all three lists \(\mathbf {Q}\) , \(\mathbf {T}\) , and \(\mathbf {S}\) in Table 6. This is because there are no more vulnerable values in these two plaintext databases (NCVR and EURO) when we consider the parameter values \(k = 20\) and \(\psi ^{' } = 0\%\) compared to when using \(k = 10\) and \(\psi ^{' } = 1\%\) . However, as can be seen in Figures 2 and 3, with larger values of k, such as \(k = [50, 100]\) , there are more vulnerable plaintext values, and with larger values of \(\psi ^{' }\) there are less vulnerable plaintext values. As shown in Table 6, the co-occurrence vulnerability cannot be calculated for the lists \(\mathbf {Q}\) and \(\mathbf {T}\) when only one attribute is considered, except for the list \(\mathbf {T}\) with the attribute street address. This is because, unlike other attributes, QID values in street address generally contain multiple tokens, as we discussed in Section 3.
    In Table 7, we show the co-occurrence vulnerability for different plaintext attribute combinations. As can be seen, the sets of substrings in the list \(\mathbf {S}\) generally resulted in higher vulnerability percentages compared to the other two sets of plaintext values, \(\mathbf {Q}\) and \(\mathbf {T}\) . This is because substrings (q-grams of length 2 in these experiments) have more co-occurrence frequency information that can be uniquely identified with respect to the defined k and \(\psi ^{' }\) values compared to QIDs and tokens. The vulnerability percentages in Table 7 are smaller compared to the percentages in Table 6. This is because Table 7 presents the co-occurrence vulnerability of different combinations of attributes, whereas Table 6 presents the co-occurrence vulnerability of individual attributes. With combinations of attributes, there can be more tokens or token substrings associated with individual records, which leads to less unique co-occurring frequencies. However, this is not the case with individual attributes.
    Table 7.
       FirstName, LastNameLastName, CityFirstName, LastName StreetAddressFirstName, LastName StreetAddress, City
    \(k=\) 1020102010201020
    \(\psi ^{' }=\) 0%1%0%1%0%1%0%1%0%1%0%1%0%1%0%1%
    NCVRCo-occur \(\mathbf {Q}\) 0.020.020.050.050.100.050.150.070.010.010.020.020.030.010.040.01
    \(\mathbf {T}\) 0.020.020.030.030.200.020.310.030.050.010.080.010.0400.080.01
    \(\mathbf {S}\) 2.740.054.350.113.490.055.100.102.270.013.860.023.010.024.920.03
    EUROCo-occur \(\mathbf {Q}\) 0.040.040.040.040.090.090.090.090.010.010.030.030.010.010.010.01
    \(\mathbf {T}\) 0.040.040.040.040.090.090.090.090.630.040.830.090.330.030.390.05
    \(\mathbf {S}\) 3.270.075.090.100.910.011.920.042.980.034.710.071.360.012.070.03
    Table 7. The Co-occurrence Vulnerability Results for Plaintext QID Values \(\mathbf {Q}\) , Tokens \(\mathbf {T}\) , and Token Substrings \(\mathbf {S}\) in the NCVR and EURO Databases, Calculated for Different k and \(\psi ^{' }\) Parameter Settings and for Different Attribute Combinations
    In Tables 8 to 10, we show the vulnerability and assignability percentage results for five PPRL encoding techniques, and in Figures 4 to 7, we illustrate the variation of the number of vulnerable and assignable encodings with different settings of k and \(\psi ^{' }\) . Similar to the plaintext vulnerability results, it can be seen that with increasing values of k and decreasing values of \(\psi ^{' }\) , the percentages of vulnerable encodings increase. As we showed in Table 3, not all vulnerability aspects can be assessed for each encoding technique. For instance, co-occurrence vulnerability for BF encoding can only be calculated if attribute-level BF encoding [50] is used with attribute combinations that consist of two or more attributes. Similarly, for SLK encoding only the frequency vulnerability while for MMK encoding only the frequency and co-occurrence vulnerabilities can be assessed.
    Fig. 4.
    Fig. 4. The percentages of vulnerable encoded values when varying k using the NCVR (top) and EURO (bottom) databases with \(\psi ^{' } = 0\) fixed. The y-axis represents the percentages of vulnerable values.
    Table 8.
       FirstNameFirstName, LastNameFirstName, LastName, StreetAddressFirstName, LastName, StreetAddress, City
    \(k=\) 1020102010201020
    \(\psi ^{' }=\) 0%1%0%1%0%1%0%1%0%1%0%1%0%1%0%1%
    NCVRFreq \(\mathbf {BF}\) 3.430.384.670.670.020.020.050.050.010.010.010.010.010.010.010.01
    \(\mathbf {TMH}\) 3.430.384.670.670.020.020.050.050.010.010.010.010.010.010.010.01
    \(\mathbf {2SH}\) 3.430.384.670.670.020.020.050.050.010.010.010.010.010.010.010.01
    Len \(\mathbf {BF}\) 1.780.303.630.620.240.040.630.050.250.030.600.060.250.030.580.06
    \(\mathbf {TMH}\) 0.880.251.310.550.070.020.140.040.080.020.160.050.080.020.140.04
    \(\mathbf {2SH}\) 1.620.302.990.530.190.030.580.040.310.030.660.070.300.020.560.04
    Neigh \(\mathbf {BF}\) 10049.310070.61003.151006.231003.571006.2499.94.399.98.4
    \(\mathbf {TMH}\) 69.317.587.525.510010.310018.91008.2910016.699.947.199.959.3
    \(\mathbf {2SH}\) 10047.210077.41009.510018.310012.710024.910074.510080.2
    Co-occur \(\mathbf {BF}\) 0.020.020.050.050.010.010.020.020.020.010.040.01
    Sim \(\mathbf {BF}\) 1.6503.260.010.0900.1800.1300.2600.0800.150
    \(\mathbf {TMH}\) 0.0100.0200000000.0100.0100.010
    \(\mathbf {2SH}\) 1.4002.6200.1100.2200.4500.9100.1000.170
    EUROFreq \(\mathbf {BF}\) 13.55.9615.710.00.040.040.040.0400000000
    \(\mathbf {TMH}\) 13.55.9615.710.00.040.040.040.0400000000
    \(\mathbf {2SH}\) 13.55.9615.710.00.040.040.040.0400000000
    Len \(\mathbf {BF}\) 9.791.6221.46.700.760.131.880.300.690.101.880.190.570.141.730.21
    \(\mathbf {TMH}\) 5.771.3411.34.300.420.131.040.260.340.090.770.290.320.150.810.24
    \(\mathbf {2SH}\) 3.420.7911.61.760.920.081.910.320.830.111.650.210.780.101.590.23
    Neigh \(\mathbf {BF}\) 10098.410010010030.510052.010015.210025.710034.710050.8
    \(\mathbf {TMH}\) 10098.210010010020.710033.510018.310028.310034.610052.9
    \(\mathbf {2SH}\) 10098.510010010012.10022.710019.610028.210045.810069.6
    Co-occur \(\mathbf {BF}\) 0.040.040.040.040.010.010.030.030.010.010.010.01
    Sim \(\mathbf {BF}\) 32.30.0447.30.160.4400.9400.7701.8603.0506.430
    \(\mathbf {TMH}\) 2.300.108.360.17000.0100000000.010
    \(\mathbf {2SH}\) 21.80.1136.60.180.5101.1401.4703.5505.0309.500
    Table 8. The Vulnerability Results for the Encoded Values in \(\mathbf {E}\) Calculated for Different k and \(\psi ^{' }\) Parameter Settings and for Encoding Techniques BF [50], TMH [55], and 2SH [44]
    A – indicates that the vulnerability calculations are not applicable for that specific vulnerability and attribute combination.
    As can be seen from Table 8 and from Figures 4 and 5, the similarity neighborhood vulnerabilities of encodings are higher compared to the other vulnerabilities. This is because the encodings are more distinguishable when using their similarity neighborhood feature values compared to their frequencies, lengths, similarities, or co-occurrences. Furthermore, with frequency vulnerability, when three or four attributes (especially including street address) are encoded, no or only a very small number of encodings were found to be vulnerable because each encoded value was unique in those instances. Compared to the results in Figure 5, in Figure 4, it can be seen that the number of vulnerable encodings with the similarity neighborhood vulnerability stayed approximately the same with changing values of k. These results illustrate that the parameter \(\psi ^{' }\) has more influence on the similarity neighborhood vulnerability compared to the parameter k.
    Fig. 5.
    Fig. 5. The percentages of vulnerable encoded values when varying \(\psi ^{' }\) using the NCVR (top) and EURO (bottom) databases with \(k = 50\) fixed. The y-axis represents the percentages of vulnerable values.
    In Table 9 and in Figures 6 and 7, we show the assignability percentage results of encodings. As can be seen, the frequency assignabilities are high compared to the length, co-occurrence, similarity, and similarity neighborhood assignabilities. With co-occurrence, the vulnerability of encodings is already low and thus it is expected to have low assignability percentages. The lengths of encodings are also different when compared with the lengths of the corresponding plaintext values. As shown in previous studies [44, 66], the similarity of two encodings will not be the same as the similarity of the two corresponding plaintext values in those encodings. For these reasons, the assignability percentage results for length, similarity, and similarity neighborhood vulnerabilities will be low.
    Fig. 6.
    Fig. 6. The percentages of assignable encoded values when varying k using the NCVR (top) and EURO (bottom) databases with \(\psi ^{' } = 0\) fixed. The y-axis represents the percentages of vulnerable values.
    Fig. 7.
    Fig. 7. The percentages of assignable encoded values when varying \(\psi ^{' }\) using the NCVR (top) and EURO (bottom) databases with \(k = 50\) fixed. The y-axis represents the percentages of vulnerable values.
    Table 9.
       FirstNameFirstName, LastNameFirstName, LastName, StreetAddressFirstName, LastName, StreetAddress, City
    \(k=\) 1020102010201020
    \(\psi ^{' }=\) 0%1%0%1%0%1%0%1%0%1%0%1%0%1%0%1%
    NCVRFreq \(\mathbf {BF}\) 3.430.384.670.670.020.020.050.050.010.010.010.010.010.010.010.01
    \(\mathbf {TMH}\) 3.430.384.670.670.010.010.020.020.010.010.010.010.010.010.010.01
    \(\mathbf {2SH}\) 3.430.384.670.670.020.020.050.050.010.010.010.010.010.010.010.01
    Len \(\mathbf {BF}\) 0.010.020.040.1200.0100.0100.0200.0300.0100.02
    \(\mathbf {TMH}\) 0.010.040.010.04000000000000
    \(\mathbf {2SH}\) 0.010.030.020.080.010.020.010.0200.0200.0400.0100.02
    Neigh \(\mathbf {BF}\) 00.0100.02000000.0100.010000
    \(\mathbf {TMH}\) 00.1200.1200000000.010000
    \(\mathbf {2SH}\) 00.0200.020000.0200000000
    Co-occur \(\mathbf {BF}\) 0.010.010.020.020.00.00.010.010.00.00.00.0
    Sim \(\mathbf {BF}\) 0000000000000000
    \(\mathbf {TMH}\) 0000000000000000
    \(\mathbf {2SH}\) 0000000000000000
    EUROFreq \(\mathbf {BF}\) 13.55.9615.7010.020.040.040.040.0400000000
    \(\mathbf {TMH}\) 13.55.9615.7010.020.040.040.040.0400000000
    \(\mathbf {2SH}\) 13.55.9615.7010.020.040.040.040.0400000000
    Len \(\mathbf {BF}\) 0.050.140.050.140.020.020.020.110.010.020.010.080.010.020.010.02
    \(\mathbf {TMH}\) 0.050.090.050.180.020.020.020.030.010.030.010.030.010.010.010.01
    \(\mathbf {2SH}\) 0.050.050.050.050.010.020.010.050.010.050.010.050.010.040.010.04
    Neigh \(\mathbf {BF}\) 00.0500.0500.0300.0300.0200.040000.02
    \(\mathbf {TMH}\) 00.1900.1900.0100.01000000.0100.01
    \(\mathbf {2SH}\) 00.0500.0500.0100.0100.0100.0100.0100.01
    Co-occur \(\mathbf {BF}\) 0.030.030.030.030.00.00.030.030.010.010.010.01
    Sim \(\mathbf {BF}\) 0.0100.050.09000000000000
    \(\mathbf {TMH}\) 0000.08000000000000
    \(\mathbf {2SH}\) 0.0100.120.10000000000.0100.020
    Table 9. The Assignability Results for the Encoded Values in \(\mathbf {E}\) Calculated for Different k and \(\psi ^{' }\) Parameter Settings and for Encoding Techniques BF [50], TMH [55], and 2SH [44]
    A – indicates that the vulnerability calculations are not applicable for that specific vulnerability and attribute combination.
    Furthermore, by analyzing the values and trends in Figure 7, we observed that for the length, similarity, and similarity neighborhood vulnerabilities the percentages of assignable encodings do not change consistently with the variation of \(\psi ^{' }\) . This is because unlike the vulnerability calculations, the assignability results depend on external aspects such as the difference in similarity calculations of both plaintext values and encodings, and the differences in the lengths of encodings compared to plaintext values. However, this does not affect frequency and co-occurrence assignabilities because these are based on frequency calculations and the employed encoding techniques do not change frequencies of plaintext values and encodings significantly.
    As shown in Table 10, the SLK encoding has no frequency vulnerable encodings with the EURO database and only a few vulnerable encodings with the NCVR database because almost all SLK encodings were unique (had a frequency of 1) for these specific databases. While some match-keys in the MMK encoding had a small number ( \(\lt 1\%\) ) of frequency vulnerable encodings, no co-occurrence vulnerable encodings were found for MMK. These vulnerability assessment results confirm the findings and observations in previous studies on PPRL attacks [9, 11, 66, 68].
    Table 10.
        SLKMMK
    FirstName, StreetAddressStreetAddress, CityLastName, StreetAddress, CityFirstName, LastName, StreetAddress
    NCVRFreq100%0 / 00 / 00.04 / 0.010 / 00.01 / 0.01
    1%0 / 00 / 00.02 / 00 / 00.01 / 0.01
    200%0.2 / 0.20.02 / 0.020.04 / 0.010 / 00.01 / 0.01
    1%0.2 / 0.20.02 / 0.020.03 / 00 / 00.01 / 0.01
    EUROFreq100%0 / 00 / 00 / 00.05 / 0.050 / 0
    1%0 / 00 / 00 / 00.05 / 0.050 / 0
    200%0 / 00.04 / 0.040 / 00.05 / 0.050 / 0
    1%0 / 00.04 / 0.040 / 00.05 / 0.050 / 0
    Table 10. The Vulnerability/Assignability Results of the Encoded Values in \(\mathbf {E}\) Calculated for Different k and \(\psi ^{' }\) Parameter Settings and for Encoding Techniques SLK [32] and MMK [46]
    It is worth noting that we expect both the vulnerability and assignability results for frequency, co-occurrence, and similarity neighborhood to be increased with larger databases. This is because with a large numbers of records, there will be more unique frequencies (both individual and co-occurring) and similarity neighborhoods that can be exploited by an attack on PPRL.

    8.3 Recommendations

    Based on our findings, we now provide some recommendations related to assessing the vulnerabilities of sensitive databases for data custodians who are working in the context of PPRL.
    We first recommend a user to conduct the plaintext vulnerability assessment using our proposed framework on their sensitive databases before participating in a PPRL project. We recommend to also assess the vulnerabilities associated with different PPRL encoding techniques using each database that is expected to be used in a PPRL project. This is because different databases will have different vulnerabilities due to their varying data characteristics, such as the frequency or length distributions of their QID values. We also recommend to analyze different combinations of attributes with different PPRL techniques, because attribute combinations will also impact the vulnerabilities in an encoded database that might be exploited in an attack.
    We recommend the owners and/or users of sensitive databases who plan to participate in a PPRL project to carefully assess the policies and regulations they are required to follow (as well as the privacy requirements of the linkage project [8]) to identify suitable values for k and \(\psi\) . Larger values for k and smaller values for \(\psi\) will identify more plaintext values and encodings as being vulnerable. For instance, \(\psi\) can be used to characterize the confidence of an adversary who will likely attack the sensitive database that is being assessed. If we assume the adversary has access to a plaintext database with the same set of QID values we used for the encoding/linkage, then we can set \(\psi = 0\) characterizing the high confidence of the adversary. In such a scenario the vulnerability assessment becomes similar to a k-anonymity calculation since the only variable parameter is k. On the other hand, if we assume the adversary does not have access to the same set of QID values, then we can set a value \(\psi \gt 0\) characterizing an adversary that has low confidence.
    Similarly, the parameter k can be used to illustrate the ability of an adversary to uniquely assign sensitive information with real-world entities. For instance, when assessing an encoded database, if we set \(k = 10,\) then we assume (1) an assignment of plaintext value to \(\lt 10\) encodings would reveal some information to an adversary, making the encoded database vulnerable, and (2) an assignment of plaintext value to \(\ge 10\) encodings would not reveal any sensitive information to an adversary.
    Therefore, we suggest that the database owners define an upper limit for k and a lower limit for \(\psi\) that are acceptable for their sensitive data and the assumptions they make about any potential adversaries. This means that a potential adversary should not be able to reidentify vulnerable plaintext or encoded values that have larger values of k or smaller values of \(\psi\) . Furthermore, specifying such limits will make sure that if no vulnerable plaintext or encoded values are identified for the defined values of k and \(\psi\) , then neither plaintext nor encoded values will be vulnerable for smaller k and larger \(\psi\) . For example, in a given database, if no vulnerable plaintext or encoded values are identified for \(k = 10\) and \(\psi = 100\) , then no plaintext or encoded values will be vulnerable for \(k \lt 10\) and \(\psi \gt 100\) .

    9 Conclusion and Future Work

    We have presented a novel framework for assessing the vulnerabilities associated with both plaintext and encoded databases as used in the context of PPRL. Our proposed framework utilizes five data characteristics, namely frequency, length, co-occurrence, similarity, and similarity neighborhood, of plaintext values and encodings in a database that can be exploited by an attack to evaluate the vulnerabilities of the database. In an experimental study we applied our proposed framework on a real-world and a synthetic database to illustrate that it can provide valuable information about the vulnerabilities of a database to be encoded as well as PPRL techniques used for this encoding. Given the increasing demand for secure PPRL encoding techniques and the ability to assess the vulnerabilities of the databases encoded using these techniques, such a systematic vulnerability assessment framework can be highly beneficial to custodians of sensitive databases.
    As future work, we plan to analyze how different sizes of databases affect the vulnerabilities of plaintext as well as encoded values. We will also evaluate how different combinations of parameter settings of encoding techniques will impact the vulnerabilities of the corresponding encoded values. Further, we plan to conduct evaluations for different hardening techniques [45, 52, 53, 62] that are used to strengthen the privacy guarantees of BF encoded databases. Finally, as a future research direction we plan to develop PPRL attacks that explore all dimensions or the dimensions that have not yet been explored.

    Footnotes

    A Assessing Disclosure Risk in Microdata

    In this appendix we describe the main types of disclosure risks, as well as disclosure risk measures, that have been proposed in the contexts of both data publishing and statistical disclosure control.

    A.1 Types of Disclosure Risks

    A disclosure takes place when a person or a system learns sensitive information about an individual or a QID attribute value that is not meant to be learned [23, 57]. There are two main types of disclosure risks: attribute disclosure risk and identity disclosure risk [1, 17]. Attribute disclosure risk measures the risk of reidentifying one or more sensitive attribute value(s) in a database. In attribute disclosure, the adversary is able to reidentify the plaintext value(s), such as first name or street address, that have been either hidden or encoded. However, attribute disclosure does not necessarily imply that the adversary is capable of reidentifying the actual real-world entities represented by these sensitive attribute values.
    Identity disclosure, on the other hand, is defined as the risk of reidentifying actual real-world entities. If an adversary is able to reidentify individual entities (usually people in the context of PPRL) in an encoded database using a privacy attack, then their identities are compromised [57].

    A.2 Assessing Disclosure Risk

    Three disclosure risk measures for publishing of microdata have been proposed by Truta et al. [59] to analyze different disclosure control techniques. The three disclosure risk measures are minimal disclosure risk, maximal disclosure risk, and weighted disclosure risk. The minimal disclosure risk measures the ratio of unique records (records that have a unique set of sensitive attribute values) in an encoded database. The maximal disclosure risk measures the ratio of distinct record clusters (groups of records based on their sets of sensitive attribute values) in the encoded database. The weighted disclosure risk uses a weighting approach to increase the importance of unique values in a record over the rest of the values in that record, and to increase the importance of records with smaller frequencies (calculated based on a set of QID attribute values) over the remaining records with high frequencies in the encoded database [59].
    A set of empirical disclosure risk measures for microdata and tabular data was proposed by Domingo-Ferrer and Torra [16]. Unlike most other measures, which are based on the uniqueness of records, these proposed measures are based on how records between a plaintext and an encoded database can be linked. In the first method, the records are linked based on distances calculated between records in the plaintext and encoded databases. In the second method, the authors used probabilistic record linkage to find matching record pairs across the plaintext and encoded databases [16]. The matching of records is solved as a linear sum assignment problem. For both of these methods, the percentages of correctly linked encoded and plaintext record pairs are considered as measures of disclosure risk. In the last method, named interval disclosure, the authors first ranked values in each QID attribute independently based on their frequencies, and then a rank interval is defined around each value. Using a threshold that is defined to measure the differences between ranks, one can then align plaintext values and encoded values in ranked QID attributes based on their intervals. The proportion of plaintext values in a given QID attribute that falls into the intervals of their corresponding encoded values is considered as a measure of disclosure risk for that QID attribute.
    A disclosure risk assessment via record linkage was proposed by Domingo-Ferrer et al. [15] assuming a maximum-knowledge adversary. The authors have considered an adversary who has access to both the plaintext and corresponding encoded databases along with the details of all QID attribute values of the entities in those databases. The authors assumed that the adversary aims to link records across the encoded and plaintext databases to reidentify encoded records. To assess the disclosure risk in such a scenario, first the distribution of distances between record pairs across plaintext and encoded databases is calculated. This distribution is then compared with another distribution of distances that is known to correspond to a database with no disclosure risks. The more similar the two distributions, the more difficult it will be for the adversary to correctly link records between encoded and plaintext databases.
    In a recent survey, Taylor et al. [57] discussed five measures to assess disclosure risks of microdata before being released for analytical purposes. These measures assume that the adversary has access to a global population where the microdata represent a subset (a sample) of entities in this population, such that each record in the sensitive microdata can be matched with a record in the population. The discussed five measures are (1) the expected number of population uniques, (2) the expected number of sample uniques that are population uniques, (3) the expected number of correct matches among sample uniques, (4) the probability of a correct match given a unique match, and (5) the probability of a correct match. Here a match in the third, fourth, and fifth measures is referred to a record in the microdata and a record in the population, both having the same set of values for a given set of QID attributes.

    A.3 Disclosure Control Methods

    Different data anonymization methods can be used to control the disclosure risks of values in a sensitive database. These anonymization methods either mask or alter the QID attribute values in records using different techniques in order to minimize the risk of reidentifying those QID attribute values or individuals.
    k-anonymity is a popular data anonymization technique used in both privacy-preserving data mining and PPRL applications to control the disclosure risks of values in a sensitive database [28, 29, 42, 56]. In k-anonymity, the anonymization of records is conducted by generalizing or masking certain QID attribute values. In generalization, individual QID attribute values in the database are replaced by a broader category (e.g., date of birth is replaced by year of birth), while in masking certain sensitive and unique QID attribute values are replaced by a specific character, such as “*.” A database can be described as k-anonymized if an individual entity represented by a record cannot be distinguished from at least \(k-\) 1 other individual entities whose records also appear in the same database. However, research has shown that k-anonymized databases can be susceptible to certain types of privacy attacks, such as background knowledge and homogeneity attacks [38, 40]. To address the limitations and privacy concerns of k-anonymity [40, 70], several improved generalization methods have been proposed, including p-sensitive k-anonymity, l-diversity, and t-closeness [38, 40, 60]. These methods aim to introduce diversity among the QID attribute values grouped using k-anonymity by reducing the granularity of data.
    Differential privacy is a concept proposed to control disclosure risk by allowing the implementation of techniques that provide strong privacy guarantees against record reidentification [21]. Differential privacy aims to address the problem of delivering useful information about a population while revealing no harmful information about the individuals in that population [21, 39]. Differential privacy is usually achieved by introducing certain amounts of random noise into the results of queries on sensitive data. By adding noise to query results, an adversary who observes those results will not be able to uniquely identify sensitive information of an individual entity [39]. Methods such as Laplace noise [22] and randomized response [71] have been used to add noise to data in a probabilistic way. Recently differential privacy has been employed in PPRL applications to ensure the privacy of entities whose information is stored in a sensitive database [6, 27].

    References

    [1]
    Athanasios Andreou, Oana Goga, and Patrick Loiseau. 2017. Identity vs. attribute disclosure risks for users with multiple social profiles. In International Conference on Advances in Social Networks Analysis and Mining (ASONAM’17). IEEE/ACM, 163–170.
    [2]
    Yonatan Aumann and Yehuda Lindell. 2007. Security against covert adversaries: Efficient protocols for realistic adversaries. In Theory of Cryptography Conference (TCC’07). Springer, 137–156.
    [3]
    Mihir Bellare, Ran Canetti, and Hugo Krawczyk. 1996. Keying hash functions for message authentication. Advances in Cryptology (CRYPTO’96). Springer, Berlin, 1–15.
    [4]
    Burton H. Bloom. 1970. Space/time trade-offs in hash coding with allowable errors. Communications of the ACM 13, 7 (1970), 422–426.
    [5]
    James H. Boyd, Sean M. Randall, and Anna M. Ferrante. 2015. Application of privacy-preserving techniques in operational record linkage centres. In Medical Data Privacy Handbook. Springer, Cham, 267–287.
    [6]
    Jianneng Cao, Fang-Yu Rao, Elisa Bertino, and Murat Kantarcioglu. 2015. A hybrid private record linkage scheme: Separating differentially private synopses from matching records. International Conference on Data Engineering (ICDE’15). IEEE, New York, NY, 1011–1022.
    [7]
    Peter Christen. 2012. Data Matching – Concepts and Techniques for Record Linkage, Entity Resolution, and Duplicate Detection. Springer, Berlin.
    [8]
    Peter Christen, Thilina Ranbaduge, and Rainer Schnell. 2020. Linking Sensitive Data – Methods and Techniques for Practical Privacy-Preserving Information Sharing. Springer, Berlin.
    [9]
    Peter Christen, Thilina Ranbaduge, Dinusha Vatsalan, and Rainer Schnell. 2018. Precise and fast cryptanalysis for Bloom filter based privacy-preserving record linkage. Transactions on Knowledge and Data Engineering 31, 11 (2018), 2164–2177.
    [10]
    Peter Christen, Rainer Schnell, Thilina Ranbaduge, and Anushka Vidanage. 2021. A critique and attack on “Blockchain- based privacy-preserving record linkage.” Information Systems 108 (2021), 101930.
    [11]
    Peter Christen, Rainer Schnell, Dinusha Vatsalan, and Thilina Ranbaduge. 2017. Efficient cryptanalysis of Bloom filters for privacy-preserving record linkage. Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD’17). Springer, 628–640.
    [12]
    Peter Christen, Anushka Vidanage, Thilina Ranbaduge, and R. Schnell. 2018. Pattern-mining based cryptanalysis of Bloom filters for privacy-preserving record linkage. Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD’18). Springer, 628–640.
    [13]
    Chris Culnane, Benjamin Rubinstein, and Vanessa Teague. 2017. Vulnerabilities in the use of similarity tables in combination with pseudonymisation to preserve data privacy in the UK office for national statistics’ privacy-preserving record linkage. arXiv (2017).
    [14]
    Peter C. Dillinger and Panagiotis Manolios. 2004. Fast and accurate bitstate verification for SPIN. In International SPIN Workshop on Model Checking of Software. Springer, Berlin, 57–75.
    [15]
    Josep Domingo-Ferrer, Sara Ricci, and Jordi Soria-Comas. 2015. Disclosure risk assessment via record linkage by a maximum-knowledge attacker. Conference on Privacy, Security and Trust (PST’15). IEEE, Los Alamitos, CA, 28–35.
    [16]
    Josep Domingo-Ferrer and Vicenç Torra. 2004. Disclosure risk assessment in statistical data protection. Journal of Computational and Applied Mathematics 164–165 (2004), 285–293.
    [17]
    George Duncan and Diane Lambert. 1989. The risk of disclosure for microdata. Journal of Business & Economic Statistics 7, 2 (1989), 207–217.
    [18]
    George T. Duncan, Mark Elliot, and Juan Jose Salazar Gonzalez. 2011. Statistical Confidentiality: Principles and Practice. Springer.
    [19]
    Elizabeth Durham, Murat Kantarcioglu, Yuan Xue, Csaba Toth, Mehmet Kuzu, and Bradley Malin. 2014. Composite Bloom filters for secure record linkage. Transactions on Knowledge and Data Engineering 26, 12 (2014), 2956–2968.
    [20]
    Elizabeth A. Durham. 2012. A Framework for Accurate, Efficient Private Record Linkage. Ph.D. Dissertation. Faculty of the Graduate School of Vanderbilt University, Nashville, TN.
    [21]
    Cynthia Dwork. 2006. Differential privacy. International Colloquium on Automata, Languages and Programming (ICALP’06). 1–12.
    [22]
    Cynthia Dwork, Frank McSherry, Kobbi Nissim, and Adam Smith. 2006. Calibrating noise to sensitivity in private data analysis. Theory of Cryptography (TCC’06). Springer, Berlin, 265–284.
    [23]
    Mark Elliot, Elaine Mackey, and Kieron O’Hara. 2020. The Anonymisation Decision-making Framework 2nd Edition: European Practitioners’ Guide. UK Anonymisation Network, Manchester.
    [24]
    Benjamin Fung, Ke Wang, Rui Chen, and Philip S. Yu. 2010. Privacy-preserving data publishing: A survey of recent developments. ACM Computing Surveys 42, 4 (2010), 1–53.
    [25]
    Aris Gkoulalas-Divanis, Dinusha Vatsalan, Dimitrios Karapiperis, and Murat Kantarcioglu. 2021. Modern privacy-preserving record linkage techniques: An overview. IEEE Transactions on Information Forensics and Security 16 (2021), 4966–4987.
    [26]
    Amir Harel, Asaf Shabtai, Lior Rokach, and Yuval Elovici. 2012. M-score: A misuseability weight measure. IEEE Transactions on Dependable and Secure Computing 9, 3 (2012), 414–428.
    [27]
    Ali Inan, Murat Kantarcioglu, Gabriel Ghinita, and Elisa Bertino. 2010. Private record matching using differential privacy. International Conference on Extending Database Technology (EDBT’10). ACM, 123–134.
    [28]
    Murat Kantarcioglu, Ali Inan, Wei Jiang, and Bradley Malin. 2009. Formal anonymity models for efficient privacy-preserving joins. Data & Knowledge Engineering 68, 11 (2009), 1206–1223.
    [29]
    Alexandros Karakasidis and Vassilios S. Verykios. 2012. Reference table based k-anonymous private blocking. PACM Symposium on Applied Computing (SAC’12). ACM, 859–864.
    [30]
    Alexandros Karakasidis, Vassilios S. Verykios, and Peter Christen. 2012. Fake injection strategies for private phonetic matching. In Data Privacy Management and Autonomous Spontaneous Security. Springer, Berlin, 9–24.
    [31]
    Dimitrios Karapiperis, Aris Gkoulalas-Divanis, and Vassilios S. Verykios. 2017. Distance-aware encoding of numerical values for privacy-preserving record linkage. International Conference on Data Engineering (ICDE’17). IEEE, 135–138.
    [32]
    Rosemary Karmel. 2005. Data linkage protocols using a statistical linkage key. Australian Institute of Health and WelfareCS1 (2005).
    [33]
    Jonathan Katz and Yehuda Lindell. 2007. Introduction to Modern Cryptography. CRC Press.
    [34]
    Martin Kroll and Steinmetzer Steinmetzer. 2015. Automated cryptanalysis of Bloom filter encryptions of databases with several personal identifiers. Biomedical Engineering Systems and Technologies (BIOSTEC’15). Springer, 341–356.
    [35]
    Mehmet Kuzu, Murat Kantarcioglu, Elizabeth Durham, and Bradley Malin. 2011. A constraint satisfaction cryptanalysis of Bloom filters in private record linkage. Privacy Enhancing Technologies (PETS’11). Springer, 226–245.
    [36]
    Rainer Lenz and Tim Hochgürtel. 2021. Random disclosure in confidential statistical databases. Statistical Journal of the IAOS 37, 1 (2021), 401–413.
    [37]
    Vladimir I. Levenshtein. 1966. Binary codes capable of correcting deletions, insertions, and reversals. Soviet Physics Doklady 10, 8 (1966), 707–710.
    [38]
    Ninghui Li, Tiancheng Li, and Suresh Venkatasubramanian. 2007. T-closeness: Privacy beyond k-anonymity and l-diversity. International Conference on Data Engineering (ICDE’07). IEEE, 106–115.
    [39]
    Ninghui Li, Min Lyu, Dong Su, and Weining Yang. 2017. Differential Privacy: From Theory to Practice. Morgan and Claypool Publishers.
    [40]
    Ashwin Machanavajjhala, Daniel Kifer, Johannes Gehrke, and Muthuramakrishnan Venkitasubramaniam. 2007. L-diversity: Privacy beyond k-anonymity. ACM Transactions on Knowledge Discovery from Data 1, 1 (2007), 3–es.
    [41]
    William Mitchell, Rinku Dewri, Ramakrishna Thurimella, and Max Roschke. 2017. A graph traversal attack on Bloom filter-based medical data aggregation. International Journal of Big Data Intelligence 4, 4 (2017), 217–226.
    [42]
    Noman Mohammed, Benjamin C. M. Fung, and Mourad Debbabi. 2011. Anonymity meets game theory: Secure data integration with malicious participants. International Journal on Very Large Data Bases 20, 4 (2011), 567–588.
    [43]
    Frank Niedermeyer, Simone Steinmetzer, Martin Kroll, and Rainer Schnell. 2014. Cryptanalysis of basic Bloom filters used for privacy preserving record linkage. Journal of Privacy and Confidentiality 6, 2 (2014), 59–79.
    [44]
    Thilina Ranbaduge, Peter Christen, and Rainer Schnell. 2020. Secure and accurate two-step hash encoding for privacy-preserving record linkage. Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD’20). Springer, 139–151.
    [45]
    Thilina Ranbaduge and Rainer Schnell. 2020. Securing Bloom filters for privacy-preserving record linkage. International Conference on Information and Knowledge Management (CIKM’20). ACM, 2185–2188.
    [46]
    Ivan P. Fellegi and Alan B. Sunter. 1969. A Theory for Record Linkage. Journal of the American Statistical Association 64, 328 (1969), 1183–1210.
    [47]
    Sean Randall, Helen Wichmann, Adrian Brown, James Boyd, Tom Eitelhuber, Alexandra Merchant, and Anna Ferrante. 2022. A blinded evaluation of privacy preserving record linkage with Bloom filters. BMC Medical Research Methodology 22, 1 (2022), 1–7.
    [48]
    Sean M. Randall, Anna M. Ferrante, James H. Boyd, Jacqueline K. Bauer, and James B. Semmens. 2014. Privacy-preserving record linkage on large real world datasets. Journal of Biomedical Informatics 50 (2014), 205–212.
    [49]
    Bruce Schneier. 1996. Applied Cryptography: Protocols, Algorithms, and Source Code in C (2nd ed.). John Wiley & Sons, New York, NY.
    [50]
    Rainer Schnell, Tobias Bachteler, and Jörg Reiher. 2009. Privacy-preserving record linkage using Bloom filters. Medical Informatics and Decision Making 9, 41 (2009), 1–11.
    [51]
    Rainer Schnell, Tobias Bachteler, and Jörg Reiher. 2011. A novel error-tolerant anonymous linking code. SSRN Electronic Journal (2011).
    [52]
    Rainer Schnell and Christian Borgs. 2016. Randomized response and balanced Bloom filters for privacy preserving record linkage. In International Conference on Data Mining Workshops (ICDMW’16). 218–224.
    [53]
    Rainer Schnell and Christian Borgs. 2016. XOR-folding for Bloom filter-based encryptions for privacy-preserving record linkage. SSRN Electronic Journal (January2016).
    [54]
    Rainer Schnell and Christian Borgs. 2020. Encoding hierarchical classification codes for privacy-preserving record linkage using Bloom filters. Machine Learning and Knowledge Discovery in Databases (ECML PKDD’20), Peggy Cellier and Kurt Driessens (Eds.). Springer International Publishing, 142–156.
    [55]
    Duncan Smith. 2017. Secure pseudonymisation for privacy-preserving probabilistic record linkage. Journal of Information Security and Applications 34 (2017), 271–279.
    [56]
    Latanya Sweeney. 2002. K-anonymity: A model for protecting privacy. International Journal on Uncertainty, Fuzziness and Knowledge-based Systems 10, 5 (2002), 557–570.
    [57]
    Leslie Taylor, Xiao-Hua Zhou, and Peter Rise. 2018. A tutorial in assessing disclosure risk in microdata. Stat. Med. 37, 25 (2018), 3693–3706.
    [58]
    Matthias Templ. 2017. Statistical Disclosure Control for Microdata: Methods and Applications in R. Springer.
    [59]
    Traian Truta, Farshad Fotouhi, and Daniel Barth-Jones. 2003. Disclosure risk measures for microdata. International Conference on Scientific and Statistical Database Management (SSDBM’03). IEEE, 15–22.
    [60]
    Traian Marius Truta and Bindu Vinay. 2006. Privacy protection: P-Sensitive k-anonymity property. International Conference on Data Engineering Workshops (ICDEW’06). IEEE, 94–94.
    [61]
    UK Office for National Statistics. 2013. Beyond 2011 Matching Anonymous Data. Methods and Policies Report M9.
    [62]
    Sirintra Vaiwsri, Thilina Ranbaduge, and Peter Christen. 2019. Reference values based hardening for Bloom filters based privacy-preserving record linkage. In Australasian Data Mining Conference (AusDM’19), CRPIT. Springer, 189–202.
    [63]
    Dinusha Vatsalan, Peter Christen, Christine M. O’Keefe, and Vassilios S. Verykios. 2014. An evaluation framework for privacy-preserving record linkage. Journal of Privacy and Confidentiality 6, 1 (2014), 35–75.
    [64]
    Dinusha Vatsalan, Peter Christen, and Erhard Rahm. 2016. Scalable privacy-preserving linking of multiple databases using counting Bloom filters. International Conference on Data Mining Workshops (ICDMW’16). IEEE, 882–889.
    [65]
    Dinusha Vatsalan, Peter Christen, and Vassilios S. Verykios. 2013. A taxonomy of privacy-preserving record linkage techniques. Information Systems 38, 6 (2013), 946–969.
    [66]
    Anushka Vidanage, Peter Christen, Thilina Ranbaduge, and Rainer Schnell. 2020. A graph matching attack on privacy-preserving record linkage. International Conference on Information and Knowledge Management (CIKM’20). ACM, 1485–1494.
    [67]
    Anushka Vidanage, Thilina Ranbaduge, Peter Christen, and Sean Randall. 2020. A privacy attack on multiple dynamic match-key based privacy-preserving record linkage. International Journal of Population Data Science 5, 1 (2020), 13 pages.
    [68]
    Anushka Vidanage, Thilina Ranbaduge, Peter Christen, and Rainer Schnell. 2019. Efficient pattern mining based cryptanalysis for privacy-preserving record linkage. International Conference on Data Engineering (ICDE’19). IEEE, 1698–1701.
    [69]
    Anushka Vidanage, Thilina Ranbaduge, Peter Christen, and Rainer Schnell. 2022. A taxonomy of attacks on privacy-preserving record linkage. Journal of Privacy and Confidentiality 12, 1 (2022), 35 pages.
    [70]
    Qian Wang, Zhiwei Xu, and Shengzhi Qu. 2011. An enhanced k-anonymity model against homogeneity attack. Journal of Software 6 (2011), 1945–1952.
    [71]
    Stanley L. Warner. 1965. Randomized response: A survey technique for eliminating evasive answer bias. Journal of the American Statistical Association 60, 309 (1965), 63–66.

    Cited By

    View all
    • (2024)Encryption-based sub-string matching for privacy-preserving record linkageJournal of Information Security and Applications10.1016/j.jisa.2024.10371281:COnline publication date: 25-Jun-2024
    • (2023)[Vision Paper] Privacy-Preserving Data Integration2023 IEEE International Conference on Big Data (BigData)10.1109/BigData59044.2023.10386703(5614-5618)Online publication date: 15-Dec-2023

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Transactions on Privacy and Security
    ACM Transactions on Privacy and Security  Volume 26, Issue 3
    August 2023
    640 pages
    ISSN:2471-2566
    EISSN:2471-2574
    DOI:10.1145/3582895
    Issue’s Table of Contents

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 27 June 2023
    Online AM: 03 April 2023
    Accepted: 13 March 2023
    Revised: 02 February 2023
    Received: 27 February 2022
    Published in TOPS Volume 26, Issue 3

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. Disclosure risk analysis
    2. Bloom filters
    3. tabulation hashing
    4. two-step hashing
    5. multiple match-keys
    6. statistical linkage keys

    Qualifiers

    • Research-article

    Funding Sources

    • Universities Australia and the German Academic Exchange Service (DAAD)
    • German Research Foundation (DFG)

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)861
    • Downloads (Last 6 weeks)120

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)Encryption-based sub-string matching for privacy-preserving record linkageJournal of Information Security and Applications10.1016/j.jisa.2024.10371281:COnline publication date: 25-Jun-2024
    • (2023)[Vision Paper] Privacy-Preserving Data Integration2023 IEEE International Conference on Big Data (BigData)10.1109/BigData59044.2023.10386703(5614-5618)Online publication date: 15-Dec-2023

    View Options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Get Access

    Login options

    Full Access

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media