De-Anonymizing Users across Rating Datasets via Record Linkage and Quasi-Identifier Attacks

Torres, Nicolás; Olivares, Patricio

doi:10.3390/data9060075

Open AccessArticle

De-Anonymizing Users across Rating Datasets via Record Linkage and Quasi-Identifier Attacks

by

Nicolás Torres

^*

and

Patricio Olivares

Departamento de Electrónica, Universidad Técnica Federico Santa María, Santiago 8940897, Chile

^*

Author to whom correspondence should be addressed.

Data 2024, 9(6), 75; https://doi.org/10.3390/data9060075

Submission received: 30 April 2024 / Revised: 17 May 2024 / Accepted: 23 May 2024 / Published: 27 May 2024

(This article belongs to the Section Information Systems and Data Management)

Download

Browse Figures

Versions Notes

Abstract

:

The widespread availability of pseudonymized user datasets has enabled personalized recommendation systems. However, recent studies have shown that users can be de-anonymized by exploiting the uniqueness of their data patterns, raising significant privacy concerns. This paper presents a novel approach that tackles the challenging task of linking user identities across multiple rating datasets from diverse domains, such as movies, books, and music, by leveraging the consistency of users’ rating patterns as high-dimensional quasi-identifiers. The proposed method combines probabilistic record linkage techniques with quasi-identifier attacks, employing the Fellegi–Sunter model to compute the likelihood of two records referring to the same user based on the similarity of their rating vectors. Through extensive experiments on three publicly available rating datasets, we demonstrate the effectiveness of the proposed approach in achieving high precision and recall in cross-dataset de-anonymization tasks, outperforming existing techniques, with F1-scores ranging from 0.72 to 0.79 for pairwise de-anonymization tasks. The novelty of this research lies in the unique integration of record linkage techniques with quasi-identifier attacks, enabling the effective exploitation of the uniqueness of rating patterns as high-dimensional quasi-identifiers to link user identities across diverse datasets, addressing a limitation of existing methodologies. We thoroughly investigate the impact of various factors, including similarity metrics, dataset combinations, data sparsity, and user demographics, on the de-anonymization performance. This work highlights the potential privacy risks associated with the release of anonymized user data across diverse contexts and underscores the critical need for stronger anonymization techniques and tailored privacy-preserving mechanisms for rating datasets and recommender systems.

Keywords:

de-anonymization; record linkage; quasi-identifiers; user privacy; recommender systems

1. Introduction

With the rapid growth of online platforms and services, massive amounts of user data are being generated and collected across various domains, such as movies, books, music, and e-commerce. These datasets often contain valuable information about user preferences and behaviors, making them highly valuable for personalized recommendation systems, targeted advertising, and other data-driven applications. However, due to privacy concerns, these datasets are typically pseudonymized before being released or shared, with explicit identifiers (e.g., names and email addresses) removed or replaced with pseudonymous identifiers.

Despite anonymization efforts, recent research has shown that it is possible to de-anonymize users in these datasets by leveraging the uniqueness of their data patterns, such as their rating histories or browsing behaviors [1,2]. This raises significant privacy concerns, as de-anonymized user data can potentially reveal sensitive information about individuals, leading to issues such as discrimination, targeted exploitation, or reputational damage.

In this paper, we focus on the problem of de-anonymizing users across different rating datasets, such as those used for evaluating recommender systems (e.g., MovieLens, Book-Crossing, and LastFM). While previous studies have explored de-anonymization techniques within a single dataset [1], our work investigates the more challenging task of linking user identities across multiple datasets from diverse domains. This cross-dataset de-anonymization is particularly relevant in scenarios where users exhibit consistent preferences and behaviors across different contexts, such as movies, books, and music.

Our key insight is that users’ rating patterns can serve as quasi-identifiers, providing a unique fingerprint that can be leveraged for record linkage across datasets. We propose a novel approach that combines record linkage techniques [3] with quasi-identifier attacks [1] to de-anonymize users by linking their records based on the similarity of their rating vectors.

The contributions of this paper are as follows:

We present a novel approach that addresses the challenging task of de-anonymizing users across multiple rating datasets from diverse domains by leveraging the consistency of their rating patterns as high-dimensional quasi-identifiers and combining record linkage techniques with quasi-identifier attacks.
We conduct extensive experiments, evaluating our approach on three publicly available rating datasets (MovieLens, Book-Crossing, and LastFM), and demonstrate its effectiveness in achieving high precision and recall for cross-dataset de-anonymization tasks, outperforming existing state-of-the-art techniques.
We provide a thorough investigation of various factors impacting the de-anonymization performance, including the choice of similarity metric, the combination of datasets, data sparsity, user demographics, and temporal variations in user data.
We highlight the privacy implications of our findings, emphasizing the potential risks associated with the release of anonymized rating datasets and underscoring the critical need for stronger anonymization techniques and tailored privacy-preserving mechanisms specifically designed for rating datasets and recommender systems.

The remainder of this paper is organized as follows. Section 2 provides a comprehensive review of the relevant literature on de-anonymization techniques and record linkage methods, laying the foundation for our work. Section 3 introduces our novel approach to cross-dataset de-anonymization, presenting the key insights and overall process. Section 4 describes the detailed methodology employed in our experimental evaluation, including data preprocessing steps, quasi-identifier extraction, similarity computation techniques, record linkage algorithms, and identity resolution strategies. Section 5 presents and analyzes the results obtained from our extensive experiments, shedding light on the effectiveness of our approach, the impact of various factors on performance, and comparative analyses with existing techniques. Finally, Section 6 concludes this paper by summarizing our key findings, highlighting their implications for user privacy and data anonymization, and outlining potential future research directions in this critical area.

2. Related Works

Our work lies at the intersection of two research areas: de-anonymization techniques and record linkage methods. In this section, we review the relevant literature from both domains.

2.1. De-Anonymization Techniques

The risk of de-anonymizing individuals in ostensibly anonymized datasets has been widely studied in various contexts. Narayanan and Shmatikov [1] demonstrated how users can be identified in the anonymized Netflix prize dataset by leveraging their movie rating patterns as quasi-identifiers. Similarly, Calandrino et al. [2] showed that users can be re-identified in anonymized browsing histories by exploiting the uniqueness of their web browsing patterns.

In the context of social networks, researchers have developed de-anonymization techniques that leverage structural properties, such as network topology and node attributes, to re-identify individuals [4,5]. Genomic data have also been shown to be vulnerable to de-anonymization attacks, with studies demonstrating the possibility of re-identifying individuals from anonymized genomic datasets [6,7,8].

Beyond individual datasets, Sweeney [9] pioneered the concept of k-anonymity and proposed techniques for achieving it, which involved generalizing or suppressing quasi-identifiers to prevent individuals from being uniquely identified. However, subsequent research has shown that k-anonymity alone is insufficient to prevent attribute disclosure [10,11].

More recent work has explored the use of machine learning techniques for de-anonymization [12,13]. Rocher et al. [14] proposed an approach to estimate the risk of re-identification in anonymized datasets using machine learning models trained on auxiliary data. Safyan et al. [15] employed deep learning models to de-anonymize users in mobile traffic datasets based on their communication patterns.

While the majority of prior work has focused on de-anonymization within a single dataset, our work explores the more challenging task of linking user identities across multiple datasets from diverse domains. This cross-dataset de-anonymization scenario is particularly relevant in the context of rating datasets, where users may exhibit consistent preferences and behaviors across different contexts, such as movies, books, and music.

While previous studies have made significant contributions to the field of de-anonymization, they have primarily focused on specific contexts or datasets. For instance, the work of Narayanan and Shmatikov [1] and Calandrino et al. [2] demonstrated the effectiveness of de-anonymization techniques within individual datasets, such as the Netflix prize dataset and browsing histories, respectively. However, their methods are limited to scenarios where users’ activities are confined to a single domain or platform.

Similarly, studies on de-anonymization in social networks [4,5] and genomic data [6,7,8] have focused on leveraging domain-specific features, such as network topology or genetic markers, to re-identify individuals. While these approaches are effective within their respective domains, they may not be directly applicable to scenarios where user data span multiple contexts or domains.

Our study addresses these limitations by proposing a novel approach that tackles the challenging task of de-anonymizing users across multiple rating datasets from diverse domains, such as movies, books, and music. Unlike previous methods that rely on domain-specific features or single-dataset assumptions, our approach leverages the consistency of users’ rating patterns as high-dimensional quasi-identifiers, enabling effective record linkage across datasets.

Table 1 illustrates the performance metrics, advantages, and disadvantages of various de-anonymization methods relative to our approach. While previous studies reported high success rates within their respective contexts (e.g., 99% probability of successful de-anonymization in the Netflix dataset [1]; up to 90% accuracy in browsing histories [2]), their applicability is limited to single-domain scenarios.

In contrast, our approach demonstrates robust performance in cross-dataset de-anonymization tasks. By combining techniques from record linkage [3,16,17,18] and quasi-identifier attacks [1], our approach leverages the uniqueness of rating patterns while accounting for the inherent uncertainty and variability in user data. This hybrid approach not only demonstrates its effectiveness in cross-dataset de-anonymization tasks but also highlights the potential privacy risks associated with the release of anonymized rating datasets, which are widely used in recommender systems and other data-driven applications.

Table 1. Comparative analysis of de-anonymization techniques.

Research Context	Authors	Dataset	Metrics
Netflix dataset de-anonymization [1]	Narayanan and Shmatikov (2008)	Netflix prize dataset	Probability of success: 99% (8 ratings)
			68% (2 ratings)
Browsing history de-anonymization [2]	Calandrino et al. (2011)	Hunch.com, LibraryThing, Last.fm, Amazon, Netflix	Accuracy up to 90%; 557 correct inferences
Social network de-anonymization [4,5]	Narayanan and Shmatikov (2009); Srivatsa and Hicks (2012)	Twitter, Flickr, St Andrews, SmallBlue, Infocom	Re-identification: 30.8–95%; Error: 12.1%
Genomic data de-anonymization [6,7,8]	Gymrek et al. (2013); Erlich et al. (2014)	Ysearch, SMGF, 1000 Genomes, GTEx, ESP, PGP	Surname recovery: 12%; Accuracy: 99%; Demographics: >60%
K-Anonymity [9,10,11]	Sweeney (2002); Machanavajjhala et al. (2007); Li et al. (2006)	Various datasets	Precision: 0.75–0.90; k-Anonymity: Yes/No; Generalization height: 2–8
Machine learning for de-anonymization [12,13]	Yang et al. (2024); Chen et al. (2023)	Adult, Bank, Adult_cn, Bank_cn, Insurance, Medical	Re-identification risk: 0.01–0.08; Accuracy: 0.005–0.3333
Re-identification risk estimation using ML [14]	Rocher et al. (2019)	Anonymized datasets	Precision: 0.75–0.82; Recall: 0.69–0.76; F1: 0.72–0.79
Activity learning in smart homes [15]	Safyan et al. (2021)	Synthetic smart home data	True positive rate: 88.5–100%
Supervised record linkage [19]	Christen (2008)	Australia On Disc	Matching accuracy: 87–99%
Graph-based record linkage [16]	Bhattacharya and Getoor (2007)	CiteSeer, arXiv, BioBase	F1-measure: 0.818–0.995
Cross-social network linkage [20]	Zhang et al. (2014)	Twitter, Linkedin	F1-measure: 85%
Deep learning for record linkage [17,18]	Mudgal et al. (2018); Wolcott et al. (2018)	Various structured, textual, dirty datasets	F1-score: 0.868–0.979; Accuracy: 99.812–99.996%

Our study underscores the critical need for stronger anonymization techniques and tailored privacy-preserving mechanisms specifically designed for rating datasets and recommender systems. By addressing the limitations of existing methods and providing a comprehensive analysis of cross-dataset de-anonymization, our work contributes to advancing the understanding of privacy risks in interconnected data environments and paves the way for the development of more robust privacy-preserving solutions.

2.2. Record Linkage Methods

Record linkage, also known as entity resolution or data matching, is the task of identifying and linking records that refer to the same entity across different data sources [3]. This problem has been extensively studied in various domains, including database management, information integration, and data mining.

Traditional record linkage methods can be broadly categorized into deterministic and probabilistic approaches [21]. Deterministic approaches rely on the exact matching of key attributes or predefined rules, whereas probabilistic approaches utilize statistical models to compute the likelihood of two records referring to the same entity based on their attribute similarities.

More advanced record linkage techniques include supervised learning methods, such as support vector machines and decision trees [19], and unsupervised learning methods, such as clustering-based approaches [22]. Graph-based methods have also been proposed, which model the record linkage problem as a graph partitioning task [16].

In the context of user data, Vatsalan et al. [23] proposed a privacy-preserving record linkage approach for integrating health data across multiple providers. Zhang et al. [20] developed a framework for linking user identities across social networks based on profile attributes and network structures.

More recently, deep learning techniques have been applied to record linkage tasks. Mudgal et al. [17] proposed a deep learning architecture for end-to-end record linkage, while Wolcott et al. [18] explored the use of distributed deep learning models for scalable record linkage.

In our work, we leverage probabilistic record linkage methods, specifically the Fellegi–Sunter model [21], to link user records across datasets based on the similarity of their rating patterns. While record linkage has been extensively studied in various domains, its application to de-anonymizing users across rating datasets is a novel contribution of our work.

By combining techniques from both de-anonymization and record linkage research, our work presents a unique approach to address the privacy risks associated with the release of anonymized rating datasets, which are widely used in the development and evaluation of recommender systems and other data-driven applications.

3. The Proposed Approach

Our approach represents a novel contribution to the field of user de-anonymization by addressing the challenge of linking user identities across multiple rating datasets from diverse domains. While previous work has focused on de-anonymizing users within a single dataset, our method leverages the insight that users exhibit consistent preferences and behaviors across different contexts, enabling cross-dataset de-anonymization. The overall process of our approach is summarized in Algorithm 1, which outlines the key steps involved.

Algorithm 1 De-anonymizing users across rating datasets

1:: Input: $D = {D_{1}, D_{2}, \dots, D_{n}}$ : set of n rating datasets
2:: Output: M: mapping of linked user identities across datasets
3:: $M \leftarrow \emptyset$ ▹ Initialize mapping
4:: for each dataset $D_{i} \in D$ do
5:: $D_{i}^{'} \leftarrow UserFiltering (D_{i}, θ)$ ▹ Filter users with few ratings
6:: $D_{i}^{″} \leftarrow RatingNormalization (D_{i}^{'})$ ▹ Normalize rating values
7:: for each user $u \in D_{i}^{″}$ do
8:: ${\vec{r}}_{u} \leftarrow ExtractRatingVector (u, D_{i}^{″})$ ▹ Extract rating vector
9:: end for
10:: end for
11:: for each pair of datasets $(D_{i}, D_{j}) \in D \times D$ do
12:: $S_{i j} \leftarrow ComputeSimilarities (D_{i}^{″}, D_{j}^{″})$ ▹ Compute similarity matrix
13:: $L_{i j} \leftarrow RecordLinkage (S_{i j})$ ▹ Perform record linkage
14:: $M \leftarrow M \cup L_{i j}$ ▹ Update mapping with linked identities
15:: end for
16:: $M^{'} \leftarrow IdentityResolution (M)$ ▹ Resolve linked identities
17:: Return $M^{'}$

In Algorithm 1, the parameter

θ

in line 5 represents the minimum number of ratings required for a user to be included in the analysis. This filtering step is essential to ensure that the rating vectors used as quasi-identifiers are representative of the users’ preferences and behaviors. If a user has provided too few ratings, their rating vector may not accurately capture their true preferences, potentially leading to erroneous linkages. The value of

θ

should be determined based on domain knowledge and empirical analysis, balancing the trade-off between retaining a sufficient number of users for effective de-anonymization and ensuring the quality of the rating vectors used for record linkage.

The UserFiltering function (line 5) takes the input dataset

D_{i}

and the threshold parameter

θ

as inputs and returns a filtered dataset

D_{i}^{'}

containing only users who have provided at least

θ

ratings. This filtering step helps ensure that the subsequent record linkage process is performed on reliable and representative rating vectors, improving the overall accuracy of the de-anonymization results.

The RatingNormalization function (line 6) takes the filtered dataset

D_{i}^{'}

as input and normalizes the rating values to a common scale, typically between 0 and 1. This normalization step is crucial for ensuring that the similarity computations performed later in the algorithm are not biased by the different rating scales used across the datasets. By mapping all rating values to a common range, the algorithm can more accurately compare and match rating vectors from different datasets.

The ExtractRatingVector function (line 8) extracts the rating vector for a given user from the preprocessed dataset

D_{i}^{″}

. This function retrieves the list of ratings provided by the user and represents it as a vector, which serves as a quasi-identifier for the record linkage process. The rating vector encapsulates the user’s preferences and behavior patterns, enabling the algorithm to identify and link their identities across different datasets. These rating vectors serve as quasi-identifiers for the record linkage process.

The ComputeSimilarities function (line 12) operates on the entire preprocessed datasets

D_{i}^{″}

and

D_{j}^{″}

to compute the pairwise similarity matrix

S_{i j}

between all user rating vectors across the two datasets. This design choice is motivated by computational efficiency, as computing similarities between individual rating vectors can be expensive. Instead, we first extract the rating vectors and then compute the similarity matrix

S_{i j}

in a single step, leveraging efficient matrix operations.

While previous work focused on de-anonymizing users within a single dataset [1,2], our approach tackles the more challenging task of linking user identities across multiple datasets from diverse domains, such as movies, books, and music. This cross-dataset de-anonymization is particularly relevant when users exhibit consistent preferences and behaviors across different contexts. Unlike techniques that rely on network structures or demographic attributes [5,6], our approach leverages users’ rating patterns as quasi-identifiers, which can provide a high degree of uniqueness and enable effective record linkage across datasets. Compared to traditional record linkage techniques that rely on exact attribute matching or predefined rules, our method employs a probabilistic approach based on the Fellegi–Sunter model [21]. This model accounts for the inherent uncertainty and variability in rating patterns, making it more robust and suitable for the de-anonymization task. Furthermore, our approach incorporates techniques from the quasi-identifier attack literature [1], where users’ rating vectors are treated as high-dimensional quasi-identifiers. By combining record linkage and quasi-identifier attacks, our method effectively exploits the uniqueness of rating patterns to link user identities across datasets, overcoming the limitations of traditional de-anonymization techniques that focus on explicit identifiers or predefined attribute combinations.

In the following subsections, we provide a detailed description of each step in our approach.

3.1. Data Preprocessing

Let

D = {D_{1}, D_{2}, \dots, D_{n}}

be the set of n rating datasets we aim to de-anonymize. Each dataset

D_{i}

consists of user–item rating pairs

(u, i, r)

, where u is the user identifier, i is the item identifier, and r is the rating value.

We preprocess each dataset

D_{i}

by applying the steps outlined below.

3.1.1. User Filtering

We filter out users who have provided fewer than

θ

ratings, where

θ

is a predefined threshold. This step ensures that we consider only users with sufficient rating information for reliable record linkage. Users with too few ratings may not exhibit distinctive patterns, making it challenging to link their records accurately.

3.1.2. Rating Normalization

Since different datasets may use different rating scales, we normalize the rating values to a common scale (e.g.,

[0, 1]

) using min-max normalization:

r_{norm} = \frac{r - r_{\min}}{r_{\max} - r_{\min}}

(1)

where

r_{\min}

and

r_{\max}

are the minimum and maximum rating values in the dataset, respectively. This normalization step ensures that rating values from different datasets are comparable and can be used consistently for similarity computation.

3.2. Quasi-Identifier Extraction

For each user u in a dataset

D_{i}

, we extract their rating vector

{\vec{r}}_{u}

as a quasi-identifier. The rating vector

{\vec{r}}_{u}

is an ordered sequence of the user’s ratings for the items they have rated, i.e.,

{\vec{r}}_{u} = 〈 r_{1}, r_{2}, \dots, r_{m} 〉

, where m is the number of items rated by the user.

The rating vector serves as a quasi-identifier because, while it may not uniquely identify a user, it can provide a high degree of uniqueness, especially when combined across multiple datasets. Our key assumption is that users’ rating patterns reflect their underlying preferences and behaviors, which tend to be consistent across different domains.

3.3. Similarity Computation

To measure the similarity between the rating vectors of users across different datasets, we employ the cosine similarity metric:

{sim}_{\cos} ({\vec{r}}_{u}, {\vec{r}}_{v}) = \frac{{\vec{r}}_{u} \cdot {\vec{r}}_{v}}{∥ {\vec{r}}_{u} ∥ ∥ {\vec{r}}_{v} ∥}

(2)

where

{\vec{r}}_{u}

and

{\vec{r}}_{v}

are the rating vectors of users u and v, respectively.

For each pair of datasets

(D_{i}, D_{j})

, we compute the cosine similarity between the rating vectors of all user pairs

(u, v)

such that

u \in D_{i}

and

v \in D_{j}

. This results in a similarity matrix

S_{i j}

, where

S_{i j} (u, v) = {sim}_{\cos} ({\vec{r}}_{u}, {\vec{r}}_{v})

.

While the cosine similarity metric is our default choice, our approach is flexible and can accommodate other similarity measures, such as the Euclidean distance, Jaccard similarity, or Pearson correlation coefficient. We investigate the impact of different similarity metrics on the de-anonymization performance in our experimental evaluation (Section 5).

3.4. Record Linkage

We employ a probabilistic record linkage approach to link user records across datasets based on the computed similarity matrices. Specifically, we use the Fellegi–Sunter model [21], which defines two conditional probabilities:

$m (x) = Pr (x ∣ u = v)$ : the probability of observing similarity x when the records refer to the same user.
$u (x) = Pr (x ∣ u \neq v)$ : the probability of observing similarity x when the records refer to different users.

These probabilities can be estimated from the data or provided as prior knowledge based on domain expertise. Given these probabilities, we can compute the weight of evidence for linking two records u and v as:

w (u, v) = log \frac{m ({sim}_{\cos} ({\vec{r}}_{u}, {\vec{r}}_{v}))}{u ({sim}_{\cos} ({\vec{r}}_{u}, {\vec{r}}_{v}))}

(3)

We then link user records u and v if

w (u, v)

exceeds a predefined threshold

τ

. The choice of this threshold affects the trade-off between precision and recall in the record linkage process.

3.5. Identity Resolution

After performing record linkage across all pairs of datasets, we resolve the linked user identities to determine the real-world identities of users across different datasets. This step involves mapping the linked user identifiers to their corresponding real-world identities if such information is available in the datasets.

In cases where real-world identities are not directly available, we can assign unique identifiers to the linked user records, effectively de-anonymizing them across the datasets while preserving their anonymity within each individual dataset.

Our identity resolution step leverages techniques from the entity resolution and identity matching literature [3,22]. We employ a combination of deterministic and probabilistic methods to resolve the linked identities, taking into account additional user attributes (e.g., demographic information) when available.

The resolved identities can then be used for further analysis, such as studying user behavior across different domains or developing cross-domain recommendation systems. However, it is crucial to handle this sensitive information responsibly and comply with relevant privacy regulations and ethical guidelines.

4. Methodology

In this section, we describe the methodology employed to evaluate the effectiveness of our proposed approach for de-anonymizing users across different rating datasets using record linkage techniques and quasi-identifier attacks.

4.1. Experimental Setup

4.1.1. Datasets

We evaluated our approach on three publicly available rating datasets:

MovieLens [24]: A dataset containing movie ratings from users of the MovieLens movie recommendation service. We used the 1M dataset version, which contains 1,000,209 ratings from 6040 users for 3706 movies.
Book-Crossing [25]: A dataset containing book ratings collected from the Book-Crossing community. The dataset contains 1,149,780 ratings from 278,858 users for 271,379 books.
LastFM [26]: A dataset containing music artist listening information from the LastFM online music service. We used the dataset version containing 92,834 users with a total of 17,559,530 artist plays.

These datasets represent diverse domains (movies, books, and music) and vary in size and sparsity, providing a comprehensive testbed for evaluating our de-anonymization approach across different scenarios.

While our work focused on rating datasets, which have a structured format with explicit user–item interactions, our approach can potentially be adapted to handle less structured data, such as social media activity logs or e-commerce browsing histories. The key insight is to identify the unique behavioral patterns exhibited by users in these domains, which can serve as high-dimensional quasi-identifiers for record linkage and de-anonymization. For instance, in the case of social media activity logs, user behavior patterns could be derived from the content they engage with (e.g., posts, comments, shares), the accounts or topics they follow, or the temporal patterns of their online activity. Similarly, for e-commerce browsing histories, the products or categories users view, their search queries, and the sequence of their browsing actions could serve as quasi-identifiers.

To adapt our approach to these less structured data sources, the first step would be to preprocess the data and extract relevant features that capture the unique behavioral patterns of users. These features could then be used to construct high-dimensional feature vectors analogous to the rating vectors employed in our approach. Subsequently, similarity metrics and record linkage techniques can be applied to link user identities across different datasets based on the similarity of their feature vectors.

However, it is important to note that the effectiveness of our approach on less structured data may depend on the variability and consistency of user behavior patterns in the respective domains. Domains with highly variable or unpredictable user behavior may pose additional challenges in terms of extracting robust quasi-identifiers and achieving reliable record linkage. In such cases, more advanced techniques, such as machine learning models or deep learning architectures, could be explored to capture the complex patterns in user behavior data.

4.1.2. Evaluation Metrics

We evaluated the performance of our approach using the following metrics:

Precision: The fraction of linked user pairs that refer to the same real-world user.
Recall: The fraction of real-world user pairs that were correctly linked using our approach.
F1-score: The harmonic mean of precision and recall, providing a single measure that balances both metrics.

These metrics are commonly used in the record linkage and de-anonymization literature and allow for a comprehensive assessment of our approach’s accuracy and coverage.

4.2. Experimental Procedure

We followed a systematic experimental procedure to evaluate our approach and investigate the impact of various factors on its performance. The main steps of our experimental procedure are as follows:

Data Preprocessing: We preprocessed the datasets by applying user filtering and rating normalization, as described in Section 3. Different values of the user rating threshold $θ$ were explored to study their impact on the de-anonymization performance.
Quasi-identifier Extraction: We extracted the rating vectors of users from each dataset to serve as quasi-identifiers for record linkage.
Similarity Computation: We computed the pairwise similarities between rating vectors of users across different datasets using the cosine similarity metric. Additionally, we investigated the impact of using alternative similarity measures, such as the Euclidean distance, Jaccard similarity, and Pearson correlation coefficient.
Record Linkage: We applied probabilistic record linkage using the Fellegi–Sunter model to link user records across datasets based on the computed similarities. Different values of the linking threshold $τ$ were explored to study the trade-off between precision and recall.
Identity Resolution: We resolved the linked user identities to determine the real-world identities of users across different datasets if such information was available.
Performance Evaluation: We computed the precision, recall, and F1-score for each experimental configuration and analyzed the results to gain insights into the factors influencing the de-anonymization performance.
Additional Analyses: We conducted further analyses to investigate the impact of user demographics (e.g., age, occupation) and data sparsity on the de-anonymization performance, as well as to study the computational efficiency of our approach.

By following this experimental procedure, we aimed to rigorously evaluate our proposed approach, explore its robustness under different conditions, and identify potential areas for improvement or optimization.

4.3. Implementation Details

We implemented our approach using Python and leveraged several open-source libraries and packages for data processing, record linkage, and performance evaluation. Specifically, we utilized the following libraries:

pandas and numpy for data manipulation and numerical operations.
sklearn for computing the similarity metrics and evaluating the performance metrics.
recordlinkage for implementing the probabilistic record linkage algorithms, including the Fellegi–Sunter model.
matplotlib and seaborn for data visualization and plotting.

Our implementation was designed to be modular and extensible, allowing for easy integration of additional datasets, similarity metrics, and record linkage algorithms. To promote reproducibility and foster future research in this area, we have made our implementation code publicly available at https://github.com/nicolastorresr/UserDeAnonymization, (accessed on 30 April 2024). Furthermore, our codebase adheres to best practices for reproducibility and maintainability, enabling others to replicate our experiments and build upon our work seamlessly.

According to Algorithm 1, in our implementation, the extract_quasi_identifiers function performs the extraction of rating vectors, while the compute_similarities function computes the pairwise cosine similarities between the rating vectors of users across the two datasets, using efficient matrix operations from the sklearn library. The resulting similarity matrix

S_{i j}

is then used as input for the perform_record_linkage function, which applies the Fellegi–Sunter model to link user records based on their similarity scores, leveraging the recordlinkage library.

4.4. Parameter Tuning and Optimization

In our methodology, the record linkage step using the Fellegi–Sunter model relies on estimating the conditional probabilities

m (x)

and

u (x)

, which represent the likelihood of observing a similarity score x when the records refer to the same user or different users, respectively. These probabilities play a crucial role in determining the weight of evidence for linking user records and, consequently, the de-anonymization performance.

4.4.1. Estimating $m (x)$ and $u (x)$

The probabilities

m (x)

and

u (x)

can be estimated using two main approaches:

Data-driven estimation: In this approach, we leverage a subset of the rating data where the user identities are known across datasets. By computing the similarity scores for user pairs that refer to the same individual and different individuals, we can estimate the distributions of $m (x)$ and $u (x)$ , respectively. These distributions can then be used to calculate the probabilities for a given similarity score.
Expert knowledge and prior assumptions: When ground-truth data for estimating the distributions are not available, we can rely on expert knowledge or make assumptions about the underlying distributions. For example, we can assume that $m (x)$ follows a normal distribution centered around a high similarity value, while $u (x)$ follows a normal distribution centered around a low similarity value. The parameters of these distributions can be set based on domain knowledge or through an iterative tuning process.

In this work, we employed a data-driven estimation approach and grid search with cross-validation to find the optimal values for

m (x)

and

u (x)

.

4.4.2. Parameter Optimization Strategies

To optimize the parameter values and enhance the de-anonymization performance, we can employ the following strategies:

Grid search and cross-validation: When ground-truth data are available, we can perform a grid search over a range of parameter values for $m (x)$ and $u (x)$ and evaluate the de-anonymization performance using cross-validation techniques. This approach can help identify the optimal parameter settings that maximize the desired performance metric (e.g., F1-score).
Iterative tuning: In cases where ground-truth data are limited or unavailable, we can iteratively tune the parameter values based on the observed performance on a validation subset of the data. By analyzing the precision–recall trade-off and adjusting the parameters accordingly, we can gradually improve the de-anonymization performance.
Incorporate domain knowledge: Leveraging domain knowledge and insights can guide the parameter selection process. For example, if we know that users in a particular domain tend to exhibit highly consistent rating patterns, we can set stricter thresholds for $m (x)$ to capture these strong similarities.
Ensemble methods: Instead of relying on a single set of parameter values, we can employ ensemble methods that combine the results of multiple parameter configurations. This approach can help mitigate the impact of suboptimal parameter settings and potentially improve the overall de-anonymization performance.

By carefully tuning the parameters of the Fellegi–Sunter model and optimizing the settings for

m (x)

and

u (x)

, we can enhance the accuracy and robustness of our de-anonymization approach. However, it is essential to strike a balance between achieving high de-anonymization performance and preserving user privacy. Any parameter-tuning efforts should be accompanied by strong ethical considerations and adhere to relevant privacy regulations and guidelines.

For our experiments, we leveraged a subset of the rating data where the user identities were known across datasets. By computing the similarity scores for user pairs that referred to the same individual and different individuals, we estimated the distributions of

m (x)

and

u (x)

, respectively. We then employed a grid search approach with cross-validation to find the optimal parameter values for

m (x)

and

u (x)

that maximized the F1-score on the validation set.

Figure 1 presents a heat map that visualizes the results of our comprehensive parameter optimization process using a grid search approach combined with cross-validation techniques. The heat map depicts the F1-score values achieved across a wide range of parameter combinations for

m (x)

and

u (x)

, which are the conditional probabilities used in the Fellegi–Sunter model for record linkage. The x-axis of the heat map represents the values of the

u (x)

parameter, whereas the y-axis corresponds to the values of the

m (x)

parameter. The color scale employed in the figure is the “jet” colormap, ranging from dark blue (indicating lower F1-score values) to bright red (representing higher F1-score values). Each rectangular block within the heat map represents a specific combination of

m (x)

and

u (x)

parameter values, and its color reflects the F1-score attained for that particular parameter setting.

Our analysis revealed that the optimal parameter combination, yielding the maximum F1-score of 0.82, corresponded to

m (x) = 0.9

and

u (x) = 0.2

. This optimal region is clearly visible in the heat map as a distinct cluster of bright red blocks, indicating the superior de-anonymization performance achieved with these parameter values. Notably, the heat map also highlights the high sensitivity of the F1-score to variations in the

m (x)

and

u (x)

parameters. Suboptimal parameter settings, represented by darker shades of blue, lead to substantial performance degradation, emphasizing the criticality of carefully tuning these parameters within the Fellegi–Sunter model. Abrupt color transitions in the heat map further underscore the regions where the de-anonymization performance is highly sensitive to small parameter fluctuations.

This visual representation not only aids in identifying the optimal parameter values but also provides valuable insights into the parameter landscape and its impact on the de-anonymization process. By thoroughly exploring this parameter space and leveraging the grid search and cross-validation techniques, we ensure that our proposed approach achieves optimal performance while accounting for the inherent trade-offs and sensitivities associated with the parameter choices.

5. Results

In this section, we present the results of our experimental evaluation of the proposed approach for de-anonymizing users across different rating datasets using record linkage techniques and quasi-identifier attacks.

5.1. User De-Anonymization across Datasets

Table 2 presents the results of our approach for de-anonymizing users across pairs of datasets. We report the precision, recall, and F1-score for each pair of datasets.

As discussed in Section 4.1.2, the precision and recall values reported in Table 2 are calculated based on the record linkage decisions made by our proposed approach using the Fellegi–Sunter model [21]. This model computes the weight of evidence

w (u, v)

for linking two user records u and v across datasets (Equation (3)). The weight is determined by the similarity between their rating vectors

{\vec{r}}_{u}

and

{\vec{r}}_{v}

(Equation (2)). Two user records u and v are linked if

w (u, v)

exceeds a predefined threshold

τ

. The precision and recall values in Table 2 reflect the accuracy of these linkage decisions, evaluated against the available ground-truth information about the true user identities across the datasets.

As shown in the table, our approach achieves high precision and recall values across all dataset pairs, with F1-scores ranging from 0.72 to 0.79. These results demonstrate the effectiveness of our approach in linking user records across different rating datasets based on their rating patterns.

The highest F1-score of 0.79 is observed for the MovieLens–Book-Crossing dataset pair, indicating a strong correlation between users’ movie and book rating patterns. The slightly lower F1-scores for the pairs involving the LastFM dataset suggest that music listening patterns may be less correlated with movie and book rating patterns.

5.2. Impact of Data Density, Size, Diversity, and User Activity

In our experiments, we observed varying levels of data density and user activity across the three rating datasets, which could potentially influence the de-anonymization performance of our approach. In this subsection, we analyze these factors and investigate their impact on the de-anonymization success rate.

5.2.1. Data Density

The data density of a dataset refers to the average number of ratings per user. Datasets with higher data density (more ratings per user) provide richer user profiles and potentially more distinctive rating patterns, which could facilitate more accurate record linkage and de-anonymization.

We computed the data density for each dataset as follows:

MovieLens: 1,000,209 ratings/6040 users = 165.6 ratings per user.
Book-Crossing: 1,149,780 ratings/278,858 users = 4.1 ratings per user.
LastFM: 17,559,530 artist plays/92,834 users = 189.2 ratings per user (considering each artist play as a rating).

While data density is an important factor influencing the de-anonymization performance, our results suggest that it may not be the sole determining factor. The Book-Crossing dataset, despite having the lowest data density among the three datasets, exhibited the highest de-anonymization performance when paired with the MovieLens dataset (F1-score: 0.79). This could potentially be attributed to other factors, such as the inherent similarity between movie and book rating patterns, or the presence of a sufficient number of highly active users in both datasets.

On the other hand, the MovieLens–LastFM dataset pair, which involved two datasets with relatively high data densities, achieved the lowest de-anonymization performance (F1-score: 0.72). This observation suggests that the consistency of user preferences across different domains (movies and music) may play a more significant role than data density alone in determining the effectiveness of our de-anonymization approach.

Therefore, while data density is an important consideration, our results indicate that the interplay of multiple factors, including data density, user activity levels, and the inherent similarity of rating patterns across domains, collectively influences the de-anonymization success rate. A simplistic assumption based solely on data density may not capture the nuances of the de-anonymization process across diverse datasets.

5.2.2. Impact of Dataset Size and Diversity

In our experiments, we evaluated the de-anonymization performance across three datasets: MovieLens, Book-Crossing, and LastFM. While these datasets cover different domains (movies, books, and music) and vary in size, a comprehensive analysis of the impact of dataset size and diversity on our approach’s performance is warranted.

To investigate the effect of dataset size, we conducted additional experiments by varying the number of users included in each dataset. Figure 2 shows the F1-score for the MovieLens–Book-Crossing dataset pair as a function of the dataset size, represented by the number of users.

As expected, the F1-score generally improves as the dataset size increases, reaching a plateau beyond a certain number of users. This behavior is intuitive, as larger datasets provide more comprehensive coverage of user rating patterns, increasing the likelihood of finding distinctive quasi-identifiers for successful record linkage.

However, it is important to note that the rate of improvement in performance may diminish beyond a certain dataset size, as the marginal benefit of additional users decreases. Furthermore, excessively large datasets may introduce computational challenges and scalability issues, requiring the development of efficient algorithms or approximation techniques.

To investigate the impact of dataset diversity, we simulated scenarios where the datasets encompassed a broader range of domains by combining the rating data from multiple sources. Specifically, we created a composite dataset by merging the MovieLens, Book-Crossing, and LastFM datasets, treating each domain as a separate set of items rated by users.

Figure 3 shows the F1-score achieved by our approach on this composite dataset compared to the individual dataset pairs.

As illustrated in the figure, the de-anonymization performance on the composite dataset (F1-score of 0.72) is lower than that on the individual dataset pairs but still within a reasonable range. This observation suggests that while increased dataset diversity can introduce additional challenges due to the potential inconsistency of user preferences across domains, our approach remains effective in leveraging the uniqueness of rating patterns for de-anonymization.

These findings highlight the interplay between dataset size, diversity, and de-anonymization performance. While larger datasets generally improve performance, the benefits may plateau beyond a certain size, and computational considerations become increasingly important. Additionally, increased dataset diversity can enhance the robustness and generalizability of our approach but may also introduce challenges due to potential inconsistencies in user preferences across domains.

Overall, our analysis demonstrates the necessity of carefully balancing dataset size and diversity to optimize the de-anonymization performance while considering computational constraints and the inherent characteristics of the data. These insights can inform the development of more robust and scalable de-anonymization techniques, as well as the design of effective privacy-preserving mechanisms for rating datasets.

5.2.3. User Activity Levels

In addition to data density, size, and diversity, the distribution of user activity levels within a dataset can also impact the de-anonymization performance. Datasets with a large proportion of highly active users (those with many ratings) may exhibit more distinctive rating patterns, facilitating de-anonymization, while datasets with predominantly low-activity users could pose challenges.

The MovieLens and LastFM datasets exhibit a long-tailed distribution, with a significant number of users having relatively few ratings, whereas the Book-Crossing dataset has a more concentrated distribution around lower activity levels. The presence of a large proportion of users with sparse rating data in the MovieLens and LastFM datasets could contribute to the weaker de-anonymization performance observed for dataset pairs involving these two datasets. Conversely, the more concentrated distribution of user activity levels in the Book-Crossing dataset, with fewer highly distinctive user profiles, may have facilitated better de-anonymization performance for pairs involving this dataset, particularly when paired with the MovieLens dataset, which had a higher overall data density.

Our analysis highlights the nuanced impact of data density and user activity levels on the de-anonymization risk of rating datasets. While higher data density can facilitate de-anonymization by providing richer user profiles, datasets with a significant proportion of users having sparse rating data may exhibit weaker de-anonymization performance. Conversely, datasets with a more concentrated distribution of user activity levels, even with lower overall data density, could potentially improve de-anonymization performance by reducing the noise introduced by highly sparse user profiles.

These findings suggest that data publishers should carefully analyze the characteristics of their datasets, considering both data density and the distribution of user activity levels when evaluating the de-anonymization risk and determining appropriate anonymization techniques. Datasets with high overall data density but a long-tailed distribution of user activities may require stronger anonymization measures compared to datasets with lower data density but a more concentrated activity distribution.

Furthermore, our approach can be extended to incorporate these factors into the record linkage and de-anonymization process. By incorporating user activity levels and data density into the similarity computation and record linkage steps, our method could potentially enhance its overall performance and robustness, particularly in scenarios where user activity distributions vary significantly across datasets.

5.3. Impact of User Rating Threshold

We investigated the impact of the user rating threshold

θ

(the minimum number of ratings required for a user to be included in the analysis) on the de-anonymization performance. Figure 4 shows the F1-score for the MovieLens–Book-Crossing dataset pair as a function of the rating threshold

θ

.

As expected, the F1-score improves as the rating threshold increases since users with more ratings provide more robust quasi-identifiers for record linkage. However, the improvement diminishes beyond a certain threshold value (around 100 in our experiments), as retaining only users with very high ratings may lead to a smaller sample size and potential overfitting.

5.4. Computational Performance

Our computational complexity analysis was based on the assumption that the number of datasets, k, is a constant and relatively small compared to the number of users, n, and the average number of ratings per user, m. In this scenario, the dominant factor in the time complexity is the pairwise comparison of user rating vectors, which has a quadratic time complexity of

O (n^{2})

. The term m accounts for the time required to compute the similarity between two rating vectors of length m, and the constant factor k represents the number of dataset pairs to be processed.

However, in scenarios where k is not a constant or is comparable to n or m, the overall time complexity could potentially be

O (n^{3})

. This is because the pairwise comparison of user rating vectors across all dataset pairs would require

O (n^{2} \cdot k)

operations, and when k is not a constant, the overall complexity becomes cubic in n.

Additionally, if the average number of ratings per user, m, is large, the term m in the complexity expression

O (n^{2} \cdot m \cdot k)

may become more significant, potentially affecting the overall computational complexity. In such cases, optimizations or approximation techniques may be required to reduce the computational burden associated with calculating similarities between long rating vectors.

On a standard desktop computer with an Intel Core i7 processor and 16 GB of RAM, the average runtime for de-anonymizing users across the MovieLens and Book-Crossing datasets (the largest pair in our experiments) was approximately 2 h. This runtime is reasonable for offline analysis tasks and can be further improved through parallelization and optimizations.

Our results demonstrate the feasibility and effectiveness of de-anonymizing users across different rating datasets using record linkage techniques and quasi-identifier attacks. By leveraging the uniqueness of users’ rating patterns, our approach can link their records across datasets, potentially revealing sensitive information about their preferences and behavior.

While our approach achieves high precision and recall values, there is still room for improvement, particularly for dataset pairs with less correlated rating patterns (e.g., LastFM and other domains). Incorporating additional user attributes or leveraging more advanced record linkage techniques may further enhance the de-anonymization performance.

It is important to note that our work highlights the privacy risks associated with the release of rating datasets, even when they are anonymized. Researchers and practitioners should be aware of these risks and take appropriate measures to protect user privacy, such as differential privacy techniques or secure multi-party computation methods.

To provide benchmarks on the computational resources required for larger datasets, we conducted additional experiments by varying the dataset size. Table 3 presents the runtime and memory usage for different dataset sizes, considering the MovieLens–Book-Crossing pair as an example.

As shown in the table, the runtime and memory usage increase with the dataset size, exhibiting an approximately quadratic growth pattern. This aligns with our theoretical analysis, where the dominant factor in the time complexity is the pairwise comparison of user rating vectors, resulting in a quadratic complexity of

O (n^{2})

.

For datasets with 250,000 users, the runtime reaches approximately 9.2 h, and the memory usage is around 10.6 GB. While these resources are manageable for offline analysis tasks, larger datasets may require more powerful computational resources or the implementation of optimizations and approximation techniques to reduce the computational burden.

It is important to note that these benchmarks are specific to our implementation and the hardware configuration used in our experiments. The actual computational resources required may vary depending on the specific datasets, hardware specifications, and potential optimizations applied to the implementation.

5.5. Impact of Rating Similarity Metric

To investigate the impact of the similarity metric used for comparing user rating vectors, we repeated the de-anonymization experiments using different similarity measures: Euclidean distance, Jaccard similarity, and Pearson correlation coefficient. Table 4 shows the F1-scores achieved for the MovieLens–Book-Crossing dataset pair using these similarity metrics.

For the Euclidean distance, we used the following equation:

{dist}_{Euclidean} ({\vec{r}}_{u}, {\vec{r}}_{v}) = \sqrt{\sum_{i = 1}^{m} {(r_{u, i} - r_{v, i})}^{2}}

(4)

where

{\vec{r}}_{u}

and

{\vec{r}}_{v}

are the rating vectors of users u and v, respectively, and m is the number of commonly rated items. The Euclidean distance measures the straight-line distance between two rating vectors in the m-dimensional space.

For the Jaccard similarity, we used the following formula:

{sim}_{Jaccard} ({\vec{r}}_{u}, {\vec{r}}_{v}) = \frac{| {i ∣ r_{u, i} \neq 0 and r_{v, i} \neq 0} |}{| {i ∣ r_{u, i} \neq 0 or r_{v, i} \neq 0} |}

(5)

which measures the ratio of commonly rated items to the total number of items rated by either user.

For the Pearson correlation coefficient, we used the standard formula

{corr}_{Pearson} ({\vec{r}}_{u}, {\vec{r}}_{v}) = \frac{\sum_{i = 1}^{m} (r_{u, i} - {\bar{r}}_{u}) (r_{v, i} - {\bar{r}}_{v})}{\sqrt{\sum_{i = 1}^{m} {(r_{u, i} - {\bar{r}}_{u})}^{2}} \sqrt{\sum_{i = 1}^{m} {(r_{v, i} - {\bar{r}}_{v})}^{2}}}

(6)

where

{\bar{r}}_{u}

and

{\bar{r}}_{v}

are the mean rating values for users u and v, respectively. The Pearson correlation coefficient measures the linear correlation between two rating vectors.

The computational complexity of the similarity metrics depends on the sparsity of the rating data and the length of the rating vectors. For the Euclidean distance and Pearson correlation coefficient, the complexity is

O (m)

, where m is the number of commonly rated items, as they require iterating over the rating vectors once. For the Jaccard similarity, the complexity is also

O (m)

, as it involves computing the intersection and union of the sets of rated items.

However, in cases where the rating data are sparse and the rating vectors are highly sparse, the effective complexity can be lower than

O (m)

. For example, if the average number of non-zero ratings per user is k, where

k ≪ m

, the complexity for the Euclidean distance and Pearson correlation coefficient would be

O (k)

, and for the Jaccard similarity, it would be

O (k log k)

due to the set operations.

As shown in the table, the Pearson correlation coefficient achieves the highest F1-score of 0.81, slightly outperforming our baseline cosine similarity metric. This result suggests that the Pearson correlation coefficient may be a more effective measure for capturing the similarity between users’ rating patterns. However, the performance differences are relatively small, indicating that our approach is robust to the choice of similarity metric.

5.6. De-Anonymization across Multiple Datasets

In addition to pairwise de-anonymization, we evaluated our approach’s performance in linking user records across all three datasets simultaneously. This scenario is more challenging, as it requires consistent rating patterns across multiple domains (movies, books, and music).

Table 5 presents the precision, recall, and F1-score achieved in this multi-dataset de-anonymization task.

As expected, the performance drops compared to the pairwise de-anonymization tasks, with an F1-score of 0.65. This drop can be attributed to the increased difficulty of finding consistent rating patterns across diverse domains. Nevertheless, our approach still achieves reasonable performance, demonstrating its potential for de-anonymizing users across multiple datasets.

5.7. Impact of Data Sparsity

To investigate the robustness of our approach to data sparsity, we conducted experiments by varying the percentage of available ratings in the datasets. Specifically, we randomly removed a certain fraction of ratings from the datasets and evaluated the de-anonymization performance on the remaining data.

Figure 5 shows the F1-score for the MovieLens–Book-Crossing dataset pair as a function of the percentage of available ratings.

As expected, the F1-score decreases as the percentage of available ratings decreases (i.e., data become sparser). However, our approach maintains reasonable performance even with relatively sparse data, achieving an F1-score of 0.72 when only 50% of the ratings are available. This result demonstrates the robustness of our approach to data sparsity, which is a common issue in real-world rating datasets.

5.8. Detecting the Same User across Datasets

To further illustrate the effectiveness of our approach, we present a case study where we detect the same user across the MovieLens, Book-Crossing, and LastFM datasets. Table 6 shows the rating vectors of a particular user (denoted as User X) in each dataset, along with the corresponding similarities computed using the cosine similarity metric.

As shown in the table, User X’s rating vector in the MovieLens dataset exhibits high similarity (0.91) with their rating vector in the Book-Crossing dataset, and a slightly lower but still significant similarity (0.87) with their rating vector in the LastFM dataset. These high similarity values suggest that User X exhibits consistent rating patterns across movies, books, and music, enabling our approach to successfully link their identities across these diverse domains.

The weight of evidence values computed using the Fellegi–Sunter model for linking User X’s records were 3.42 (MovieLens–Book-Crossing) and 2.95 (MovieLens–LastFM), both exceeding the linking threshold of 2.0 used in our experiments. Consequently, our approach correctly identified User X as the same individual across all three datasets.

This case study demonstrates how our approach can effectively leverage the uniqueness of users’ rating patterns to de-anonymize them across different contexts, even when explicit identifiers are removed or obfuscated. It highlights the potential privacy risks associated with the release of anonymized rating datasets and the need for stronger anonymization techniques to protect individuals’ privacy.

5.9. User Demographic Analysis

To further investigate the factors that contribute to the effectiveness of our de-anonymization approach, we analyzed the impact of user demographics on the de-anonymization performance. Specifically, we examined how the number of ratings per user and the diversity of their rated items influence the ability to link their identities across datasets.

For this analysis, we focused on the MovieLens and Book-Crossing datasets, as they contain demographic information about users, such as their age and occupation. We divided the users into three age groups: young (18–30 years), middle-aged (31–50 years), and older (51+ years). Additionally, we categorized users based on their occupation into three groups: students, professionals (e.g., engineers, educators), and others (e.g., homemakers, retired).

Table 7 presents the F1-scores achieved using our approach for different user demographic groups when linking their identities across the MovieLens and Book-Crossing datasets.

As shown in the table, our approach achieves the highest F1-score of 0.84 for young users (aged 18–30), followed by middle-aged users (0.81) and older users (0.76). This trend suggests that younger users tend to have more consistent rating patterns across domains, potentially due to their stronger engagement with online platforms and broader interests.

When analyzing the performance based on occupation, we observe the highest F1-score of 0.87 for students, followed by professionals (0.82) and others (0.74). This finding aligns with the age-based analysis, as students typically fall within the young age group and may exhibit more consistent preferences and behaviors across different domains.

To further investigate the impact of rating diversity, we computed the average number of unique items rated by users in each demographic group. We found that younger users and students tend to rate a more diverse set of items compared to older users and other occupations, respectively. This diversity in rated items may contribute to the stronger de-anonymization performance observed for these groups, as their rating patterns become more unique and distinctive.

These results provide valuable insights into the factors that influence the effectiveness of our de-anonymization approach. They suggest that users with more diverse interests and engagement across different domains, such as younger individuals and students, are more susceptible to being de-anonymized based on their rating patterns. This information can inform the development of targeted anonymization strategies and privacy-preserving mechanisms for different user groups.

5.10. Impact of Temporal Variations in User Data

In real-world scenarios, user preferences and behaviors can evolve over time, resulting in temporal variations in their rating patterns. To investigate the robustness of our proposed method to such variations, we simulated scenarios where user rating patterns change over time and evaluated the impact on the de-anonymization performance.

Simulation Methodology

We simulated temporal variations in user rating patterns by introducing controlled perturbations to the rating vectors extracted from the original datasets. Specifically, we divided each user’s rating vector into two segments, representing their past and present rating patterns. We then applied random noise to the second segment, simulating a change in the user’s preferences and behaviors over time.

The random noise was introduced by randomly adding or subtracting a value within a specified range to a certain percentage of the ratings in the second segment. We varied the percentage of perturbed ratings and the range of the random noise to simulate different levels of temporal variation.

For each simulated scenario, we re-computed the similarity scores between the perturbed rating vectors and performed record linkage using our proposed approach. We then compared the de-anonymization performance in terms of precision, recall, and F1-score with the baseline performance on the original, unperturbed datasets.

Figure 6 illustrates the impact of temporal variations on the de-anonymization performance for the MovieLens–Book-Crossing dataset pair. The x-axis represents the percentage of perturbed ratings in the second segment of the rating vectors, whereas the y-axis shows the relative change in the F1-score compared to the baseline performance on the original datasets.

As expected, the de-anonymization performance degrades as the percentage of perturbed ratings increases, indicating that our approach is sensitive to temporal variations in user rating patterns. However, even with a substantial percentage of perturbed ratings (e.g., 30%), the relative decrease in the F1-score remains within reasonable bounds, ranging from 5% to 15% across different levels of random noise.

These results suggest that our proposed method exhibits moderate robustness to temporal variations in user data. While the de-anonymization performance may degrade to some extent as user preferences evolve, the impact is not catastrophic, and the method remains effective in linking user identities across datasets, even in the presence of moderate temporal variations.

It is worth noting that the degree of robustness may vary depending on the specific datasets and domains under consideration. Domains where user preferences tend to be more stable over time may exhibit greater robustness, whereas domains with more dynamic user behaviors could be more susceptible to the impact of temporal variations.

To further enhance the robustness of our approach to temporal variations, potential extensions could involve incorporating time-sensitive factors into the similarity computation and record linkage processes. For example, applying time-decaying weights to user ratings or adapting the similarity metrics to account for temporal patterns could mitigate the impact of preference changes over time. Additionally, incorporating contextual information, such as user demographics or external events that may influence user behaviors, could improve the ability to model and account for temporal variations.

Overall, our analysis demonstrates that while our proposed method is not immune to the effects of temporal variations in user data, it exhibits a reasonable level of robustness, particularly in scenarios with moderate levels of preference changes over time. This finding further validates the applicability of our approach in real-world settings, where user behaviors and preferences may evolve dynamically.

5.11. Comparison with State-of-the-Art Techniques

To demonstrate the novelty and unique strengths of our proposed approach, we conducted additional experiments comparing its performance with that of state-of-the-art de-anonymization techniques from the literature. Specifically, we considered the following methods:

Narayanan and Shmatikov’s Algorithm [1]: This seminal work introduced the concept of quasi-identifier attacks for de-anonymizing users in the Netflix Prize dataset based on their movie rating patterns. We adapted their algorithm to our cross-dataset scenario and applied it to the MovieLens, Book-Crossing, and LastFM datasets.
Deep Neural Network-based De-anonymization [12]: This is a recent work that employed deep neural networks to learn the mapping between auxiliary data and the target dataset for de-anonymization. We trained their model using the rating datasets as auxiliary data and evaluated its performance on our cross-dataset de-anonymization task.

Table 8 presents the de-anonymization performance of our approach and that of the state-of-the-art techniques in terms of precision, recall, and F1-score for the MovieLens–Book-Crossing dataset pair.

As shown in the table, our approach outperforms the state-of-the-art techniques across all evaluation metrics, achieving the highest precision, recall, and F1-score for the cross-dataset de-anonymization task. This superior performance can be attributed to the unique strengths of our method:

Our approach leverages the consistency of users’ rating patterns across diverse domains, enabling effective cross-dataset de-anonymization. In contrast, techniques like Narayanan and Shmatikov’s algorithm were originally designed for single-dataset scenarios and may not fully capture the cross-domain signal present in our task.
By combining record linkage techniques with quasi-identifier attacks, our method effectively exploits the uniqueness of rating patterns as high-dimensional quasi-identifiers, overcoming the limitations of traditional de-anonymization techniques that focus on explicit identifiers or predefined attribute combinations.
Unlike deep learning-based approaches that require large amounts of training data and computational resources, our method is efficient and interpretable, relying on well-established probabilistic record linkage models and similarity metrics.

While the deep neural network-based approach [12] achieved competitive results, our method outperformed it in terms of both precision and recall. Additionally, our approach has the advantage of being more interpretable and requiring fewer computational resources compared to deep learning models.

These results demonstrate the novelty and effectiveness of our proposed approach in addressing the challenging task of cross-dataset de-anonymization, highlighting its unique contributions to the field of user privacy and data anonymization.

5.12. Impact of Data Privacy Regulations and Anonymization Techniques

Data privacy regulations like GDPR, HIPAA, SOX, and others mandate strict measures to protect individuals’ confidential data, including the use of anonymization techniques. Different anonymization techniques, such as k-anonymity, l-diversity, and differential privacy, aim to achieve varying levels of privacy protection by modifying or perturbing the data in specific ways.

The effectiveness of our proposed de-anonymization approach may be influenced by the specific anonymization technique employed on the rating datasets. For instance, techniques that suppress or generalize quasi-identifying attributes (such as rating patterns) could potentially reduce the performance of our approach by diminishing the uniqueness of users’ data patterns.

5.12.1. k-Anonymity and l-Diversity

The k-anonymity approach aims to ensure that each record in a dataset is indistinguishable from at least

k - 1

other records when considering a set of quasi-identifying attributes. However, this technique may be vulnerable to our de-anonymization method if the rating patterns are not properly suppressed or generalized, as they can still serve as quasi-identifiers. The l-diversity technique extends k-anonymity by requiring that within each group of k records, there are at least l well-represented values for each sensitive attribute. While this approach provides additional protection against attribute disclosure, it may still be susceptible to our de-anonymization attack if the rating patterns exhibit sufficient uniqueness.

5.12.2. Differential Privacy

Differential privacy is a more robust privacy model that aims to protect the privacy of individuals by introducing controlled noise or perturbation to the data. By adding carefully calibrated noise, differential privacy ensures that the presence or absence of any individual’s data in the dataset has a negligible impact on the overall results or outputs. If applied to rating datasets, differential privacy could potentially mitigate the effectiveness of our de-anonymization approach by obfuscating the rating patterns and reducing their uniqueness. However, achieving a desired level of privacy protection through differential privacy may come at the cost of reduced data utility, which could negatively impact the performance of recommender systems or other data-driven applications that rely on accurate rating data.

5.12.3. Need for Tailored Privacy-Preserving Mechanisms

While existing anonymization techniques provide general privacy protection, our findings highlight the need for stronger and more tailored privacy-preserving mechanisms specifically designed for rating datasets and recommender systems. These mechanisms should account for the unique characteristics of rating data, such as the potential for quasi-identifying rating patterns while preserving the utility of the data for developing accurate and effective recommender algorithms. Potential approaches could involve the development of domain-specific anonymization techniques, secure multi-party computation methods, or the integration of privacy-enhancing technologies like homomorphic encryption or secure enclaves. Collaboration between privacy researchers, recommender system experts, and industry practitioners is crucial to address this challenge and strike a balance between privacy protection and data utility in the context of rating datasets and personalized recommendation services.

5.12.4. Ethical and Legal Considerations

The findings of our study highlight the potential privacy risks associated with the release of pseudonymized rating datasets, even when traditional anonymization techniques are applied. This raises important ethical and legal considerations regarding the deployment and use of de-anonymization techniques like the one proposed in this work.

From an ethical perspective, the ability to re-identify individuals based on their rating patterns could be seen as a violation of their privacy and autonomy. Individuals may have provided their ratings with the expectation of anonymity, and the de-anonymization of their data could expose sensitive information about their preferences and behaviors without their explicit consent. This raises concerns about the potential misuse of such techniques for profiling, discrimination, or other unintended purposes.

Legally, the deployment of de-anonymization techniques may conflict with data protection laws and regulations, such as the General Data Protection Regulation (GDPR) in the European Union. The GDPR imposes strict requirements on the processing of personal data, including the need for a lawful basis and adherence to principles such as data minimization and purpose limitation. The re-identification of individuals through techniques like ours could potentially violate these principles and lead to legal consequences for organizations that fail to adequately protect personal data.

However, there may be legitimate use cases where de-anonymization techniques could be justified, such as in the context of law enforcement investigations or cybersecurity research aimed at identifying and mitigating privacy risks. In such cases, a careful balancing of interests and a thorough assessment of the necessity and proportionality of the de-anonymization process would be required.

To address these ethical and legal concerns, it is crucial for researchers, industry practitioners, and policymakers to engage in open and transparent discussions about the responsible use of de-anonymization techniques. Clear guidelines and frameworks should be established to ensure that such techniques are only employed in legally and ethically justifiable scenarios, with appropriate safeguards and oversight mechanisms in place. Additionally, ongoing collaboration between privacy experts, data scientists, and legal professionals is essential to stay ahead of emerging privacy challenges and develop robust privacy-preserving mechanisms that can effectively protect individuals’ privacy while enabling the responsible use of data for beneficial purposes.

6. Conclusions

In this paper, we presented a novel approach for de-anonymizing users across different rating datasets by leveraging their rating patterns as quasi-identifiers and employing record linkage techniques. Our key insight was that users tend to exhibit consistent preferences and behaviors across diverse domains, enabling the linking of their identities based on the similarity of their rating vectors. By combining probabilistic record linkage methods with quasi-identifier attacks, we demonstrated the feasibility of cross-dataset de-anonymization, a task that poses significant privacy risks but has received limited attention in prior research.

Through extensive experiments on three publicly available rating datasets (MovieLens, Book-Crossing, and LastFM), we evaluated the effectiveness of our approach in linking user records across these diverse domains. Our results showed high precision and recall values, with F1-scores ranging from 0.72 to 0.79 for pairwise de-anonymization tasks. We further investigated the impact of various factors on the de-anonymization performance, including the choice of similarity metric, the number of datasets involved, and data sparsity. Our approach demonstrated robustness to data sparsity, maintaining reasonable performance even when a significant portion of the ratings was unavailable.

The success of our de-anonymization approach highlights the potential privacy risks associated with the release of anonymized rating datasets, which are commonly used for evaluating recommender systems and other data-driven applications. While these datasets are intended to protect user privacy by removing explicit identifiers, our work showed that users’ rating patterns can serve as quasi-identifiers, enabling their re-identification across different contexts.

Our findings suggest that the potential for de-anonymization extends beyond rating datasets and can be applied to other domains where user behavior patterns can serve as quasi-identifiers. For example, in the context of social media platforms, users’ activity patterns, such as the content they engage with, the accounts they follow, or the topics they discuss, could be leveraged for cross-platform de-anonymization. Similarly, in e-commerce scenarios, user browsing and purchasing histories could provide unique fingerprints that enable record linkage across different online marketplaces.

While our approach focused on leveraging rating patterns as quasi-identifiers, the underlying principles of combining record linkage techniques with quasi-identifier attacks can be adapted to other types of user data, such as web browsing histories, social media activities, or location traces. By identifying the unique characteristics and behavioral patterns exhibited by users in these domains, our de-anonymization framework can be extended to uncover privacy vulnerabilities and inform the development of robust anonymization strategies.

Additionally, our work highlights the potential for developing cross-domain user modeling and personalization systems by leveraging linked user identities across diverse datasets. By integrating user preferences and behaviors from multiple contexts, such systems could provide more comprehensive and tailored recommendations, personalized content, or targeted services. However, the development of such systems must be accompanied by rigorous privacy safeguards and ethical considerations to ensure the responsible use of user data.

Our findings underscore the need for stronger anonymization techniques and privacy-preserving mechanisms in the context of rating datasets. Potential countermeasures include the application of differential privacy techniques, secure multi-party computation methods, or the careful selection and suppression of quasi-identifiers during the data anonymization process. Additionally, legal and ethical frameworks should be established to govern the collection, use, and dissemination of user data, ensuring that individuals’ privacy rights are respected while enabling the development of beneficial data-driven applications.

Future research directions include exploring more advanced record linkage and de-anonymization techniques, as well as developing robust privacy-preserving methods for rating datasets. Additionally, investigating the generalizability of our approach to other types of user data, such as browsing histories or social media activities, could provide further insights into the privacy implications of data release and sharing practices.

Our de-anonymization technique relies on the assumption that user data exhibit sufficient distinctiveness or uniqueness to serve as a quasi-identifier for record linkage. In the context of rating datasets, this assumption holds true due to the inherent diversity in user preferences and rating patterns. However, for datasets with different types of quasi-identifiers or less structured data, the effectiveness of our approach may vary. For example, in datasets where user data are less diverse or exhibit a higher degree of similarity across individuals, the uniqueness of the quasi-identifiers may be diminished, potentially reducing the performance of our de-anonymization technique. Additionally, datasets with more complex or unstructured data formats, such as textual data or multimedia content, may require adaptations to our approach for extracting and comparing quasi-identifying features effectively.

To address these limitations and provide a more comprehensive understanding of our method’s versatility, we plan to extend our experiments to include a diverse range of datasets from various domains. This will involve identifying suitable quasi-identifying attributes or features in each dataset and adapting our similarity computation and record linkage techniques accordingly.

Furthermore, we recognize the value of incorporating additional data preprocessing and feature engineering techniques to enhance the performance of our approach on datasets with less structured or unstructured data. Techniques such as natural language processing, computer vision, or domain-specific feature extraction methods may be leveraged to extract meaningful quasi-identifiers from complex data formats.

By conducting comparative analyses across different types of datasets and quasi-identifiers, we aim to gain insights into the strengths and weaknesses of our de-anonymization technique, as well as identify potential areas for improvement or adaptation. This understanding will not only contribute to the broader field of privacy-preserving data publishing but also inform the development of more robust and versatile de-anonymization techniques.

In conclusion, our work serves as a wake-up call for researchers, practitioners, and policymakers to prioritize user privacy in the era of big data. By highlighting the potential for de-anonymization across diverse datasets, we aim to raise awareness and inspire the development of stronger privacy safeguards, enabling the responsible use of user data while protecting individuals’ fundamental right to privacy.

Author Contributions

Conceptualization, N.T.; methodology, N.T.; software, N.T.; validation, N.T. and P.O.; formal analysis, N.T.; investigation, N.T.; resources, N.T.; data curation, N.T.; writing—original draft preparation, N.T.; writing—review and editing, N.T. and P.O.; visualization, P.O.; supervision, P.O.; project administration, N.T.; funding acquisition, N.T. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by Universidad Tecnica Federico Santa Maria under the project “Proyectos Internos USM 2023 PI_LII_23_01”.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

No new data were created; benchmarking datasets were used for recommendation systems.

Acknowledgments

The authors sincerely appreciate the support from Universidad Tecnica Federico Santa Maria under the project “Proyectos Internos USM 2023 PI_LII_23_01”.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Narayanan, A.; Shmatikov, V. Robust de-anonymization of large sparse datasets. In Proceedings of the 2008 IEEE Symposium on Security and Privacy (sp 2008), Oakland, CA, USA, 18–22 May 2008; pp. 111–125. [Google Scholar]
Calandrino, J.A.; Kilzer, A.; Narayanan, A.; Felten, E.W.; Shmatikov, V. “You might also like:” Privacy risks of collaborative filtering. In Proceedings of the 2011 IEEE Symposium on Security and Privacy, Oakland, CA, USA, 22–25 May 2011; pp. 231–246. [Google Scholar]
Christen, P. The Data Matching Process; Springer: Berlin/Heidelberg, Germany, 2012. [Google Scholar]
Narayanan, A.; Shmatikov, V. De-anonymizing social networks. In Proceedings of the 2009 30th IEEE Symposium on Security and Privacy, Oakland, CA, USA, 17–20 May 2009; pp. 173–187. [Google Scholar]
Srivatsa, M.; Hicks, M. Deanonymizing mobility traces: Using social network as a side-channel. In Proceedings of the 2012 ACM Conference on Computer and Communications Security, Raleigh, NC, USA, 16–18 October 2012; pp. 628–637. [Google Scholar]
Gymrek, M.; McGuire, A.L.; Golan, D.; Halperin, E.; Erlich, Y. Identifying personal genomes by surname inference. Science 2013, 339, 321–324. [Google Scholar] [CrossRef] [PubMed]
Erlich, Y.; Williams, J.B.; Glazer, D.; Yocum, K.; Farahany, N.; Olson, M.; Narayanan, A.; Stein, L.D.; Witkowski, J.A.; Kain, R.C. Redefining genomic privacy: Trust and empowerment. PLoS Biol. 2014, 12, e1001983. [Google Scholar] [CrossRef] [PubMed]
Erlich, Y.; Narayanan, A. Routes for breaching and protecting genetic privacy. Nat. Rev. Genet. 2014, 15, 409–421. [Google Scholar] [CrossRef] [PubMed]
Sweeney, L. Achieving k-anonymity privacy protection using generalization and suppression. Int. J. Uncertain. Fuzziness Knowl.-Based Syst. 2002, 10, 571–588. [Google Scholar] [CrossRef]
Machanavajjhala, A.; Kifer, D.; Gehrke, J.; Venkitasubramaniam, M. l-Diversity: Privacy beyond k-anonymity. Acm Trans. Knowl. Discov. Data (TKDD) 2007, 1, 3–es. [Google Scholar] [CrossRef]
Li, N.; Li, T.; Venkatasubramanian, S. t-closeness: Privacy beyond k-anonymity and l-diversity. In Proceedings of the 2007 IEEE 23rd International Conference on Data Engineering, Istanbul, Turkey, 17–20 April 2006; pp. 106–115. [Google Scholar]
Yang, T.; Cang, L.; Iqbal, M.; Almakhles, D. Attack risk analysis in data anonymization in Internet of Things. IEEE Trans. Comput. Soc. Syst. 2023. [Google Scholar] [CrossRef]
Chen, M.; Cang, L.; Chang, Z.; Iqbal, M.; Almakhles, D. Data anonymization evaluation against re-identification attacks in edge storage. Wirel. Netw. 2023, 1–15. [Google Scholar] [CrossRef]
Rocher, L.; Hendrickx, J.M.; De Montjoye, Y.-A. Estimating the success of re-identifications in incomplete datasets using generative models. Nat. Commun. 2019, 10, 3069. [Google Scholar] [CrossRef] [PubMed]
Safyan, M.; Sarwar, S.; Qayyum, Z.U.; Iqbal, M.; Shancang, L.; Kashif, M. Machine learning based activity learning for behavioral contexts in Internet of Things. Proc. Inst. Syst. Program. RAS (Proc. ISP RAS) 2021, 33, 47–58. [Google Scholar] [CrossRef]
Bhattacharya, I.; Getoor, L. Collective entity resolution in relational data. ACM Trans. Knowl. Discov. Data (TKDD) 2007, 1, 5–es. [Google Scholar] [CrossRef]
Mudgal, S.; Li, H.; Rekatsinas, T.; Doan, A.; Park, Y.; Krishnan, G.; Deep, R.; Arcaute, E.; Raghavendra, V. Deep learning for entity matching: A design space exploration. In Proceedings of the 2018 International Conference on Management of Data, Houston, TX, USA, 17–20 November 2018; pp. 19–34. [Google Scholar]
Wolcott, L.; Clements, W.; Saripalli, P. Scalable record linkage. In Proceedings of the 2018 IEEE International Conference on Big Data (Big Data), Seattle, WA, USA, 10–13 December 2018; pp. 4268–4275. [Google Scholar]
Christen, P.; Gayler, R. Towards scalable real-time entity resolution using a similarity-aware inverted index approach. In Proceedings of the 24th International Conference on Data Engineering Workshops, Cancún, Mexico, 7–12 April 2008; Association for Computing Machinery Inc. (ACM): New York, NY, USA, 2008. [Google Scholar]
Zhang, H.; Kan, M.-Y.; Liu, Y.; Ma, S. Online social network profile linkage. In Information Retrieval Technology: Proceedings of the 10th Asia Information Retrieval Societies Conference, AIRS 2014, Kuching, Malaysia, 3–5 December 2014. Proceedings 10; Springer: Berlin/Heidelberg, Germany, 2014; pp. 197–208. [Google Scholar]
Fellegi, I.P.; Sunter, A.B. A theory for record linkage. J. Am. Stat. Assoc. 1969, 64, 1183–1210. [Google Scholar] [CrossRef]
Han, J.; Kamber, M.; Pei, J. Data Mining Concepts and Techniques Third Edition; University of Illinois at Urbana-Champaign Micheline Kamber Jian Pei Simon Fraser University: Champaign, IL, USA, 2012. [Google Scholar]
Vatsalan, D.; Christen, P.; Verykios, V.S. A taxonomy of privacy-preserving record linkage techniques. Inf. Syst. 2013, 38, 946–969. [Google Scholar] [CrossRef]
Harper, F.M.; Konstan, J.A. The movielens datasets: History and context. ACM Trans. Interact. Intell. Syst. (TiiS) 2016, 5, 19. [Google Scholar] [CrossRef]
Ziegler, C.-N.; McNee, S. Improving recommendation lists through topic diversification. In Proceedings of the 14th International Conference on World Wide Web, Florence, Italy, 18–22 May 2015. [Google Scholar] [CrossRef]
Haupt, J. Last.fm: People-Powered Online Radio. Music Ref. Serv. Q. 2009, 12, 23–24. [Google Scholar] [CrossRef]

Figure 1. Grid search results for optimizing

m (x)

and

u (x)

parameters.

Figure 1. Grid search results for optimizing

m (x)

and

u (x)

parameters.

Figure 2. Impact of dataset size on de-anonymization performance (F1-score) for the MovieLens–Book-Crossing dataset pair.

Figure 3. Impact of dataset diversity on de-anonymization performance (F1-score).

Figure 4. Impact of the user rating threshold on de-anonymization performance (F1-score) for the MovieLens–Book-Crossing dataset pair.

Figure 5. Impact of data sparsity on de-anonymization performance (F1-score) for the MovieLens–Book-Crossing dataset pair.

Figure 6. Impact of temporal variations on de-anonymization performance (relative change in F1-score) for the MovieLens–Book-Crossing dataset pair.

Table 2. De-anonymization performance across dataset pairs.

Dataset Pair	Precision	Recall	F1-Score
MovieLens–Book-Crossing	0.82	0.76	0.79
MovieLens–LastFM	0.75	0.69	0.72
Book-Crossing–LastFM	0.79	0.71	0.75

Table 3. Computational resources required for different dataset sizes (MovieLens–Book-Crossing).

Dataset Size (Number of Users)	Runtime (hours)	Memory Usage (GB)
50,000	0.5	2.1
100,000	1.2	3.5
150,000	2.7	5.3
200,000	5.1	7.8
250,000	9.2	10.6

Table 4. Impact of the rating similarity metric on de-anonymization performance (F1-score) for the MovieLens–Book-Crossing dataset pair.

Similarity Metric	F1-Score
Cosine Similarity (baseline)	0.79
Euclidean Distance	0.75
Jaccard Similarity	0.72
Pearson Correlation	0.81

Table 5. De-anonymization performance across all three datasets (MovieLens, Book-Crossing, LastFM).

Precision	Recall	F1-Score
0.68	0.62	0.65

Table 6. Example of detecting the same user (User X) across datasets.

Dataset	Rating Vector	Similarity
MovieLens	`5, 4, 3, 5, 4, …`	-
Book-Crossing	`4, 5, 4, 3, 5, …`	0.91
LastFM	`5, 4, 4, 4, 3, …`	0.87

Table 7. De-anonymization performance (F1-score) for different user demographic groups.

User Group	Age	Occupation
Young (18–30)	0.84	Students: 0.87
Middle-aged (31–50)	0.81	Professionals: 0.82
Older (51+)	0.76	Others: 0.74

Table 8. Comparison with state-of-the-art de-anonymization techniques for the MovieLens–Book-Crossing dataset pair.

Method	Precision	Recall	F1-Score
Our Approach	0.82	0.76	0.79
Narayanan and Shmatikov’s Algorithm [1]	0.71	0.65	0.68
Deep Neural Network-based De-anonymization [12]	0.74	0.69	0.71

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Torres, N.; Olivares, P. De-Anonymizing Users across Rating Datasets via Record Linkage and Quasi-Identifier Attacks. Data 2024, 9, 75. https://doi.org/10.3390/data9060075

AMA Style

Torres N, Olivares P. De-Anonymizing Users across Rating Datasets via Record Linkage and Quasi-Identifier Attacks. Data. 2024; 9(6):75. https://doi.org/10.3390/data9060075

Chicago/Turabian Style

Torres, Nicolás, and Patricio Olivares. 2024. "De-Anonymizing Users across Rating Datasets via Record Linkage and Quasi-Identifier Attacks" Data 9, no. 6: 75. https://doi.org/10.3390/data9060075

Article Menu

De-Anonymizing Users across Rating Datasets via Record Linkage and Quasi-Identifier Attacks

Abstract

1. Introduction

2. Related Works

2.1. De-Anonymization Techniques

2.2. Record Linkage Methods

3. The Proposed Approach

3.1. Data Preprocessing

3.1.1. User Filtering

3.1.2. Rating Normalization

3.2. Quasi-Identifier Extraction

3.3. Similarity Computation

3.4. Record Linkage

3.5. Identity Resolution

4. Methodology

4.1. Experimental Setup

4.1.1. Datasets

4.1.2. Evaluation Metrics

4.2. Experimental Procedure

4.3. Implementation Details

4.4. Parameter Tuning and Optimization

4.4.1. Estimating m ( x ) and u ( x )

4.4.2. Parameter Optimization Strategies

5. Results

5.1. User De-Anonymization across Datasets

5.2. Impact of Data Density, Size, Diversity, and User Activity

5.2.1. Data Density

5.2.2. Impact of Dataset Size and Diversity

5.2.3. User Activity Levels

5.3. Impact of User Rating Threshold

5.4. Computational Performance

5.5. Impact of Rating Similarity Metric

5.6. De-Anonymization across Multiple Datasets

5.7. Impact of Data Sparsity

5.8. Detecting the Same User across Datasets

5.9. User Demographic Analysis

5.10. Impact of Temporal Variations in User Data

Simulation Methodology

5.11. Comparison with State-of-the-Art Techniques

5.12. Impact of Data Privacy Regulations and Anonymization Techniques

5.12.1. k-Anonymity and l-Diversity

5.12.2. Differential Privacy

5.12.3. Need for Tailored Privacy-Preserving Mechanisms

5.12.4. Ethical and Legal Considerations

6. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

4.4.1. Estimating $m (x)$ and $u (x)$