1. Introduction
With the rapid growth of online platforms and services, massive amounts of user data are being generated and collected across various domains, such as movies, books, music, and e-commerce. These datasets often contain valuable information about user preferences and behaviors, making them highly valuable for personalized recommendation systems, targeted advertising, and other data-driven applications. However, due to privacy concerns, these datasets are typically pseudonymized before being released or shared, with explicit identifiers (e.g., names and email addresses) removed or replaced with pseudonymous identifiers.
Despite anonymization efforts, recent research has shown that it is possible to de-anonymize users in these datasets by leveraging the uniqueness of their data patterns, such as their rating histories or browsing behaviors [
1,
2]. This raises significant privacy concerns, as de-anonymized user data can potentially reveal sensitive information about individuals, leading to issues such as discrimination, targeted exploitation, or reputational damage.
In this paper, we focus on the problem of de-anonymizing users across different rating datasets, such as those used for evaluating recommender systems (e.g., MovieLens, Book-Crossing, and LastFM). While previous studies have explored de-anonymization techniques within a single dataset [
1], our work investigates the more challenging task of linking user identities across multiple datasets from diverse domains. This cross-dataset de-anonymization is particularly relevant in scenarios where users exhibit consistent preferences and behaviors across different contexts, such as movies, books, and music.
Our key insight is that users’ rating patterns can serve as quasi-identifiers, providing a unique fingerprint that can be leveraged for record linkage across datasets. We propose a novel approach that combines record linkage techniques [
3] with quasi-identifier attacks [
1] to de-anonymize users by linking their records based on the similarity of their rating vectors.
The contributions of this paper are as follows:
We present a novel approach that addresses the challenging task of de-anonymizing users across multiple rating datasets from diverse domains by leveraging the consistency of their rating patterns as high-dimensional quasi-identifiers and combining record linkage techniques with quasi-identifier attacks.
We conduct extensive experiments, evaluating our approach on three publicly available rating datasets (MovieLens, Book-Crossing, and LastFM), and demonstrate its effectiveness in achieving high precision and recall for cross-dataset de-anonymization tasks, outperforming existing state-of-the-art techniques.
We provide a thorough investigation of various factors impacting the de-anonymization performance, including the choice of similarity metric, the combination of datasets, data sparsity, user demographics, and temporal variations in user data.
We highlight the privacy implications of our findings, emphasizing the potential risks associated with the release of anonymized rating datasets and underscoring the critical need for stronger anonymization techniques and tailored privacy-preserving mechanisms specifically designed for rating datasets and recommender systems.
The remainder of this paper is organized as follows.
Section 2 provides a comprehensive review of the relevant literature on de-anonymization techniques and record linkage methods, laying the foundation for our work.
Section 3 introduces our novel approach to cross-dataset de-anonymization, presenting the key insights and overall process.
Section 4 describes the detailed methodology employed in our experimental evaluation, including data preprocessing steps, quasi-identifier extraction, similarity computation techniques, record linkage algorithms, and identity resolution strategies.
Section 5 presents and analyzes the results obtained from our extensive experiments, shedding light on the effectiveness of our approach, the impact of various factors on performance, and comparative analyses with existing techniques. Finally,
Section 6 concludes this paper by summarizing our key findings, highlighting their implications for user privacy and data anonymization, and outlining potential future research directions in this critical area.
3. The Proposed Approach
Our approach represents a novel contribution to the field of user de-anonymization by addressing the challenge of linking user identities across multiple rating datasets from diverse domains. While previous work has focused on de-anonymizing users within a single dataset, our method leverages the insight that users exhibit consistent preferences and behaviors across different contexts, enabling cross-dataset de-anonymization. The overall process of our approach is summarized in Algorithm 1, which outlines the key steps involved.
Algorithm 1 De-anonymizing users across rating datasets |
- 1:
Input: : set of n rating datasets - 2:
Output: M: mapping of linked user identities across datasets - 3:
▹ Initialize mapping - 4:
for each dataset do - 5:
▹ Filter users with few ratings - 6:
▹ Normalize rating values - 7:
for each user do - 8:
▹ Extract rating vector - 9:
end for - 10:
end for - 11:
for each pair of datasets do - 12:
▹ Compute similarity matrix - 13:
▹ Perform record linkage - 14:
▹ Update mapping with linked identities - 15:
end for - 16:
▹ Resolve linked identities - 17:
Return
|
In Algorithm 1, the parameter in line 5 represents the minimum number of ratings required for a user to be included in the analysis. This filtering step is essential to ensure that the rating vectors used as quasi-identifiers are representative of the users’ preferences and behaviors. If a user has provided too few ratings, their rating vector may not accurately capture their true preferences, potentially leading to erroneous linkages. The value of should be determined based on domain knowledge and empirical analysis, balancing the trade-off between retaining a sufficient number of users for effective de-anonymization and ensuring the quality of the rating vectors used for record linkage.
The UserFiltering function (line 5) takes the input dataset and the threshold parameter as inputs and returns a filtered dataset containing only users who have provided at least ratings. This filtering step helps ensure that the subsequent record linkage process is performed on reliable and representative rating vectors, improving the overall accuracy of the de-anonymization results.
The RatingNormalization function (line 6) takes the filtered dataset as input and normalizes the rating values to a common scale, typically between 0 and 1. This normalization step is crucial for ensuring that the similarity computations performed later in the algorithm are not biased by the different rating scales used across the datasets. By mapping all rating values to a common range, the algorithm can more accurately compare and match rating vectors from different datasets.
The ExtractRatingVector function (line 8) extracts the rating vector for a given user from the preprocessed dataset . This function retrieves the list of ratings provided by the user and represents it as a vector, which serves as a quasi-identifier for the record linkage process. The rating vector encapsulates the user’s preferences and behavior patterns, enabling the algorithm to identify and link their identities across different datasets. These rating vectors serve as quasi-identifiers for the record linkage process.
The ComputeSimilarities function (line 12) operates on the entire preprocessed datasets and to compute the pairwise similarity matrix between all user rating vectors across the two datasets. This design choice is motivated by computational efficiency, as computing similarities between individual rating vectors can be expensive. Instead, we first extract the rating vectors and then compute the similarity matrix in a single step, leveraging efficient matrix operations.
While previous work focused on de-anonymizing users within a single dataset [
1,
2], our approach tackles the more challenging task of linking user identities across multiple datasets from diverse domains, such as movies, books, and music. This cross-dataset de-anonymization is particularly relevant when users exhibit consistent preferences and behaviors across different contexts. Unlike techniques that rely on network structures or demographic attributes [
5,
6], our approach leverages users’ rating patterns as quasi-identifiers, which can provide a high degree of uniqueness and enable effective record linkage across datasets. Compared to traditional record linkage techniques that rely on exact attribute matching or predefined rules, our method employs a probabilistic approach based on the Fellegi–Sunter model [
21]. This model accounts for the inherent uncertainty and variability in rating patterns, making it more robust and suitable for the de-anonymization task. Furthermore, our approach incorporates techniques from the quasi-identifier attack literature [
1], where users’ rating vectors are treated as high-dimensional quasi-identifiers. By combining record linkage and quasi-identifier attacks, our method effectively exploits the uniqueness of rating patterns to link user identities across datasets, overcoming the limitations of traditional de-anonymization techniques that focus on explicit identifiers or predefined attribute combinations.
In the following subsections, we provide a detailed description of each step in our approach.
3.1. Data Preprocessing
Let be the set of n rating datasets we aim to de-anonymize. Each dataset consists of user–item rating pairs , where u is the user identifier, i is the item identifier, and r is the rating value.
We preprocess each dataset by applying the steps outlined below.
3.1.1. User Filtering
We filter out users who have provided fewer than ratings, where is a predefined threshold. This step ensures that we consider only users with sufficient rating information for reliable record linkage. Users with too few ratings may not exhibit distinctive patterns, making it challenging to link their records accurately.
3.1.2. Rating Normalization
Since different datasets may use different rating scales, we normalize the rating values to a common scale (e.g.,
) using min-max normalization:
where
and
are the minimum and maximum rating values in the dataset, respectively. This normalization step ensures that rating values from different datasets are comparable and can be used consistently for similarity computation.
3.2. Quasi-Identifier Extraction
For each user u in a dataset , we extract their rating vector as a quasi-identifier. The rating vector is an ordered sequence of the user’s ratings for the items they have rated, i.e., , where m is the number of items rated by the user.
The rating vector serves as a quasi-identifier because, while it may not uniquely identify a user, it can provide a high degree of uniqueness, especially when combined across multiple datasets. Our key assumption is that users’ rating patterns reflect their underlying preferences and behaviors, which tend to be consistent across different domains.
3.3. Similarity Computation
To measure the similarity between the rating vectors of users across different datasets, we employ the cosine similarity metric:
where
and
are the rating vectors of users
u and
v, respectively.
For each pair of datasets , we compute the cosine similarity between the rating vectors of all user pairs such that and . This results in a similarity matrix , where .
While the cosine similarity metric is our default choice, our approach is flexible and can accommodate other similarity measures, such as the Euclidean distance, Jaccard similarity, or Pearson correlation coefficient. We investigate the impact of different similarity metrics on the de-anonymization performance in our experimental evaluation (
Section 5).
3.4. Record Linkage
We employ a probabilistic record linkage approach to link user records across datasets based on the computed similarity matrices. Specifically, we use the Fellegi–Sunter model [
21], which defines two conditional probabilities:
: the probability of observing similarity x when the records refer to the same user.
: the probability of observing similarity x when the records refer to different users.
These probabilities can be estimated from the data or provided as prior knowledge based on domain expertise. Given these probabilities, we can compute the weight of evidence for linking two records
u and
v as:
We then link user records u and v if exceeds a predefined threshold . The choice of this threshold affects the trade-off between precision and recall in the record linkage process.
3.5. Identity Resolution
After performing record linkage across all pairs of datasets, we resolve the linked user identities to determine the real-world identities of users across different datasets. This step involves mapping the linked user identifiers to their corresponding real-world identities if such information is available in the datasets.
In cases where real-world identities are not directly available, we can assign unique identifiers to the linked user records, effectively de-anonymizing them across the datasets while preserving their anonymity within each individual dataset.
Our identity resolution step leverages techniques from the entity resolution and identity matching literature [
3,
22]. We employ a combination of deterministic and probabilistic methods to resolve the linked identities, taking into account additional user attributes (e.g., demographic information) when available.
The resolved identities can then be used for further analysis, such as studying user behavior across different domains or developing cross-domain recommendation systems. However, it is crucial to handle this sensitive information responsibly and comply with relevant privacy regulations and ethical guidelines.
5. Results
In this section, we present the results of our experimental evaluation of the proposed approach for de-anonymizing users across different rating datasets using record linkage techniques and quasi-identifier attacks.
5.1. User De-Anonymization across Datasets
Table 2 presents the results of our approach for de-anonymizing users across pairs of datasets. We report the precision, recall, and F1-score for each pair of datasets.
As discussed in
Section 4.1.2, the precision and recall values reported in
Table 2 are calculated based on the record linkage decisions made by our proposed approach using the Fellegi–Sunter model [
21]. This model computes the weight of evidence
for linking two user records
u and
v across datasets (Equation (
3)). The weight is determined by the similarity between their rating vectors
and
(Equation (
2)). Two user records
u and
v are linked if
exceeds a predefined threshold
. The precision and recall values in
Table 2 reflect the accuracy of these linkage decisions, evaluated against the available ground-truth information about the true user identities across the datasets.
As shown in the table, our approach achieves high precision and recall values across all dataset pairs, with F1-scores ranging from 0.72 to 0.79. These results demonstrate the effectiveness of our approach in linking user records across different rating datasets based on their rating patterns.
The highest F1-score of 0.79 is observed for the MovieLens–Book-Crossing dataset pair, indicating a strong correlation between users’ movie and book rating patterns. The slightly lower F1-scores for the pairs involving the LastFM dataset suggest that music listening patterns may be less correlated with movie and book rating patterns.
5.2. Impact of Data Density, Size, Diversity, and User Activity
In our experiments, we observed varying levels of data density and user activity across the three rating datasets, which could potentially influence the de-anonymization performance of our approach. In this subsection, we analyze these factors and investigate their impact on the de-anonymization success rate.
5.2.1. Data Density
The data density of a dataset refers to the average number of ratings per user. Datasets with higher data density (more ratings per user) provide richer user profiles and potentially more distinctive rating patterns, which could facilitate more accurate record linkage and de-anonymization.
We computed the data density for each dataset as follows:
MovieLens: 1,000,209 ratings/6040 users = 165.6 ratings per user.
Book-Crossing: 1,149,780 ratings/278,858 users = 4.1 ratings per user.
LastFM: 17,559,530 artist plays/92,834 users = 189.2 ratings per user (considering each artist play as a rating).
While data density is an important factor influencing the de-anonymization performance, our results suggest that it may not be the sole determining factor. The Book-Crossing dataset, despite having the lowest data density among the three datasets, exhibited the highest de-anonymization performance when paired with the MovieLens dataset (F1-score: 0.79). This could potentially be attributed to other factors, such as the inherent similarity between movie and book rating patterns, or the presence of a sufficient number of highly active users in both datasets.
On the other hand, the MovieLens–LastFM dataset pair, which involved two datasets with relatively high data densities, achieved the lowest de-anonymization performance (F1-score: 0.72). This observation suggests that the consistency of user preferences across different domains (movies and music) may play a more significant role than data density alone in determining the effectiveness of our de-anonymization approach.
Therefore, while data density is an important consideration, our results indicate that the interplay of multiple factors, including data density, user activity levels, and the inherent similarity of rating patterns across domains, collectively influences the de-anonymization success rate. A simplistic assumption based solely on data density may not capture the nuances of the de-anonymization process across diverse datasets.
5.2.2. Impact of Dataset Size and Diversity
In our experiments, we evaluated the de-anonymization performance across three datasets: MovieLens, Book-Crossing, and LastFM. While these datasets cover different domains (movies, books, and music) and vary in size, a comprehensive analysis of the impact of dataset size and diversity on our approach’s performance is warranted.
To investigate the effect of dataset size, we conducted additional experiments by varying the number of users included in each dataset.
Figure 2 shows the F1-score for the MovieLens–Book-Crossing dataset pair as a function of the dataset size, represented by the number of users.
As expected, the F1-score generally improves as the dataset size increases, reaching a plateau beyond a certain number of users. This behavior is intuitive, as larger datasets provide more comprehensive coverage of user rating patterns, increasing the likelihood of finding distinctive quasi-identifiers for successful record linkage.
However, it is important to note that the rate of improvement in performance may diminish beyond a certain dataset size, as the marginal benefit of additional users decreases. Furthermore, excessively large datasets may introduce computational challenges and scalability issues, requiring the development of efficient algorithms or approximation techniques.
To investigate the impact of dataset diversity, we simulated scenarios where the datasets encompassed a broader range of domains by combining the rating data from multiple sources. Specifically, we created a composite dataset by merging the MovieLens, Book-Crossing, and LastFM datasets, treating each domain as a separate set of items rated by users.
Figure 3 shows the F1-score achieved by our approach on this composite dataset compared to the individual dataset pairs.
As illustrated in the figure, the de-anonymization performance on the composite dataset (F1-score of 0.72) is lower than that on the individual dataset pairs but still within a reasonable range. This observation suggests that while increased dataset diversity can introduce additional challenges due to the potential inconsistency of user preferences across domains, our approach remains effective in leveraging the uniqueness of rating patterns for de-anonymization.
These findings highlight the interplay between dataset size, diversity, and de-anonymization performance. While larger datasets generally improve performance, the benefits may plateau beyond a certain size, and computational considerations become increasingly important. Additionally, increased dataset diversity can enhance the robustness and generalizability of our approach but may also introduce challenges due to potential inconsistencies in user preferences across domains.
Overall, our analysis demonstrates the necessity of carefully balancing dataset size and diversity to optimize the de-anonymization performance while considering computational constraints and the inherent characteristics of the data. These insights can inform the development of more robust and scalable de-anonymization techniques, as well as the design of effective privacy-preserving mechanisms for rating datasets.
5.2.3. User Activity Levels
In addition to data density, size, and diversity, the distribution of user activity levels within a dataset can also impact the de-anonymization performance. Datasets with a large proportion of highly active users (those with many ratings) may exhibit more distinctive rating patterns, facilitating de-anonymization, while datasets with predominantly low-activity users could pose challenges.
The MovieLens and LastFM datasets exhibit a long-tailed distribution, with a significant number of users having relatively few ratings, whereas the Book-Crossing dataset has a more concentrated distribution around lower activity levels. The presence of a large proportion of users with sparse rating data in the MovieLens and LastFM datasets could contribute to the weaker de-anonymization performance observed for dataset pairs involving these two datasets. Conversely, the more concentrated distribution of user activity levels in the Book-Crossing dataset, with fewer highly distinctive user profiles, may have facilitated better de-anonymization performance for pairs involving this dataset, particularly when paired with the MovieLens dataset, which had a higher overall data density.
Our analysis highlights the nuanced impact of data density and user activity levels on the de-anonymization risk of rating datasets. While higher data density can facilitate de-anonymization by providing richer user profiles, datasets with a significant proportion of users having sparse rating data may exhibit weaker de-anonymization performance. Conversely, datasets with a more concentrated distribution of user activity levels, even with lower overall data density, could potentially improve de-anonymization performance by reducing the noise introduced by highly sparse user profiles.
These findings suggest that data publishers should carefully analyze the characteristics of their datasets, considering both data density and the distribution of user activity levels when evaluating the de-anonymization risk and determining appropriate anonymization techniques. Datasets with high overall data density but a long-tailed distribution of user activities may require stronger anonymization measures compared to datasets with lower data density but a more concentrated activity distribution.
Furthermore, our approach can be extended to incorporate these factors into the record linkage and de-anonymization process. By incorporating user activity levels and data density into the similarity computation and record linkage steps, our method could potentially enhance its overall performance and robustness, particularly in scenarios where user activity distributions vary significantly across datasets.
5.3. Impact of User Rating Threshold
We investigated the impact of the user rating threshold
(the minimum number of ratings required for a user to be included in the analysis) on the de-anonymization performance.
Figure 4 shows the F1-score for the MovieLens–Book-Crossing dataset pair as a function of the rating threshold
.
As expected, the F1-score improves as the rating threshold increases since users with more ratings provide more robust quasi-identifiers for record linkage. However, the improvement diminishes beyond a certain threshold value (around 100 in our experiments), as retaining only users with very high ratings may lead to a smaller sample size and potential overfitting.
5.4. Computational Performance
Our computational complexity analysis was based on the assumption that the number of datasets, k, is a constant and relatively small compared to the number of users, n, and the average number of ratings per user, m. In this scenario, the dominant factor in the time complexity is the pairwise comparison of user rating vectors, which has a quadratic time complexity of . The term m accounts for the time required to compute the similarity between two rating vectors of length m, and the constant factor k represents the number of dataset pairs to be processed.
However, in scenarios where k is not a constant or is comparable to n or m, the overall time complexity could potentially be . This is because the pairwise comparison of user rating vectors across all dataset pairs would require operations, and when k is not a constant, the overall complexity becomes cubic in n.
Additionally, if the average number of ratings per user, m, is large, the term m in the complexity expression may become more significant, potentially affecting the overall computational complexity. In such cases, optimizations or approximation techniques may be required to reduce the computational burden associated with calculating similarities between long rating vectors.
On a standard desktop computer with an Intel Core i7 processor and 16 GB of RAM, the average runtime for de-anonymizing users across the MovieLens and Book-Crossing datasets (the largest pair in our experiments) was approximately 2 h. This runtime is reasonable for offline analysis tasks and can be further improved through parallelization and optimizations.
Our results demonstrate the feasibility and effectiveness of de-anonymizing users across different rating datasets using record linkage techniques and quasi-identifier attacks. By leveraging the uniqueness of users’ rating patterns, our approach can link their records across datasets, potentially revealing sensitive information about their preferences and behavior.
While our approach achieves high precision and recall values, there is still room for improvement, particularly for dataset pairs with less correlated rating patterns (e.g., LastFM and other domains). Incorporating additional user attributes or leveraging more advanced record linkage techniques may further enhance the de-anonymization performance.
It is important to note that our work highlights the privacy risks associated with the release of rating datasets, even when they are anonymized. Researchers and practitioners should be aware of these risks and take appropriate measures to protect user privacy, such as differential privacy techniques or secure multi-party computation methods.
To provide benchmarks on the computational resources required for larger datasets, we conducted additional experiments by varying the dataset size.
Table 3 presents the runtime and memory usage for different dataset sizes, considering the MovieLens–Book-Crossing pair as an example.
As shown in the table, the runtime and memory usage increase with the dataset size, exhibiting an approximately quadratic growth pattern. This aligns with our theoretical analysis, where the dominant factor in the time complexity is the pairwise comparison of user rating vectors, resulting in a quadratic complexity of .
For datasets with 250,000 users, the runtime reaches approximately 9.2 h, and the memory usage is around 10.6 GB. While these resources are manageable for offline analysis tasks, larger datasets may require more powerful computational resources or the implementation of optimizations and approximation techniques to reduce the computational burden.
It is important to note that these benchmarks are specific to our implementation and the hardware configuration used in our experiments. The actual computational resources required may vary depending on the specific datasets, hardware specifications, and potential optimizations applied to the implementation.
5.5. Impact of Rating Similarity Metric
To investigate the impact of the similarity metric used for comparing user rating vectors, we repeated the de-anonymization experiments using different similarity measures: Euclidean distance, Jaccard similarity, and Pearson correlation coefficient.
Table 4 shows the F1-scores achieved for the MovieLens–Book-Crossing dataset pair using these similarity metrics.
For the Euclidean distance, we used the following equation:
where
and
are the rating vectors of users
u and
v, respectively, and
m is the number of commonly rated items. The Euclidean distance measures the straight-line distance between two rating vectors in the
m-dimensional space.
For the Jaccard similarity, we used the following formula:
which measures the ratio of commonly rated items to the total number of items rated by either user.
For the Pearson correlation coefficient, we used the standard formula
where
and
are the mean rating values for users
u and
v, respectively. The Pearson correlation coefficient measures the linear correlation between two rating vectors.
The computational complexity of the similarity metrics depends on the sparsity of the rating data and the length of the rating vectors. For the Euclidean distance and Pearson correlation coefficient, the complexity is , where m is the number of commonly rated items, as they require iterating over the rating vectors once. For the Jaccard similarity, the complexity is also , as it involves computing the intersection and union of the sets of rated items.
However, in cases where the rating data are sparse and the rating vectors are highly sparse, the effective complexity can be lower than . For example, if the average number of non-zero ratings per user is k, where , the complexity for the Euclidean distance and Pearson correlation coefficient would be , and for the Jaccard similarity, it would be due to the set operations.
As shown in the table, the Pearson correlation coefficient achieves the highest F1-score of 0.81, slightly outperforming our baseline cosine similarity metric. This result suggests that the Pearson correlation coefficient may be a more effective measure for capturing the similarity between users’ rating patterns. However, the performance differences are relatively small, indicating that our approach is robust to the choice of similarity metric.
5.6. De-Anonymization across Multiple Datasets
In addition to pairwise de-anonymization, we evaluated our approach’s performance in linking user records across all three datasets simultaneously. This scenario is more challenging, as it requires consistent rating patterns across multiple domains (movies, books, and music).
Table 5 presents the precision, recall, and F1-score achieved in this multi-dataset de-anonymization task.
As expected, the performance drops compared to the pairwise de-anonymization tasks, with an F1-score of 0.65. This drop can be attributed to the increased difficulty of finding consistent rating patterns across diverse domains. Nevertheless, our approach still achieves reasonable performance, demonstrating its potential for de-anonymizing users across multiple datasets.
5.7. Impact of Data Sparsity
To investigate the robustness of our approach to data sparsity, we conducted experiments by varying the percentage of available ratings in the datasets. Specifically, we randomly removed a certain fraction of ratings from the datasets and evaluated the de-anonymization performance on the remaining data.
Figure 5 shows the F1-score for the MovieLens–Book-Crossing dataset pair as a function of the percentage of available ratings.
As expected, the F1-score decreases as the percentage of available ratings decreases (i.e., data become sparser). However, our approach maintains reasonable performance even with relatively sparse data, achieving an F1-score of 0.72 when only 50% of the ratings are available. This result demonstrates the robustness of our approach to data sparsity, which is a common issue in real-world rating datasets.
5.8. Detecting the Same User across Datasets
To further illustrate the effectiveness of our approach, we present a case study where we detect the same user across the MovieLens, Book-Crossing, and LastFM datasets.
Table 6 shows the rating vectors of a particular user (denoted as User
X) in each dataset, along with the corresponding similarities computed using the cosine similarity metric.
As shown in the table, User X’s rating vector in the MovieLens dataset exhibits high similarity (0.91) with their rating vector in the Book-Crossing dataset, and a slightly lower but still significant similarity (0.87) with their rating vector in the LastFM dataset. These high similarity values suggest that User X exhibits consistent rating patterns across movies, books, and music, enabling our approach to successfully link their identities across these diverse domains.
The weight of evidence values computed using the Fellegi–Sunter model for linking User X’s records were 3.42 (MovieLens–Book-Crossing) and 2.95 (MovieLens–LastFM), both exceeding the linking threshold of 2.0 used in our experiments. Consequently, our approach correctly identified User X as the same individual across all three datasets.
This case study demonstrates how our approach can effectively leverage the uniqueness of users’ rating patterns to de-anonymize them across different contexts, even when explicit identifiers are removed or obfuscated. It highlights the potential privacy risks associated with the release of anonymized rating datasets and the need for stronger anonymization techniques to protect individuals’ privacy.
5.9. User Demographic Analysis
To further investigate the factors that contribute to the effectiveness of our de-anonymization approach, we analyzed the impact of user demographics on the de-anonymization performance. Specifically, we examined how the number of ratings per user and the diversity of their rated items influence the ability to link their identities across datasets.
For this analysis, we focused on the MovieLens and Book-Crossing datasets, as they contain demographic information about users, such as their age and occupation. We divided the users into three age groups: young (18–30 years), middle-aged (31–50 years), and older (51+ years). Additionally, we categorized users based on their occupation into three groups: students, professionals (e.g., engineers, educators), and others (e.g., homemakers, retired).
Table 7 presents the F1-scores achieved using our approach for different user demographic groups when linking their identities across the MovieLens and Book-Crossing datasets.
As shown in the table, our approach achieves the highest F1-score of 0.84 for young users (aged 18–30), followed by middle-aged users (0.81) and older users (0.76). This trend suggests that younger users tend to have more consistent rating patterns across domains, potentially due to their stronger engagement with online platforms and broader interests.
When analyzing the performance based on occupation, we observe the highest F1-score of 0.87 for students, followed by professionals (0.82) and others (0.74). This finding aligns with the age-based analysis, as students typically fall within the young age group and may exhibit more consistent preferences and behaviors across different domains.
To further investigate the impact of rating diversity, we computed the average number of unique items rated by users in each demographic group. We found that younger users and students tend to rate a more diverse set of items compared to older users and other occupations, respectively. This diversity in rated items may contribute to the stronger de-anonymization performance observed for these groups, as their rating patterns become more unique and distinctive.
These results provide valuable insights into the factors that influence the effectiveness of our de-anonymization approach. They suggest that users with more diverse interests and engagement across different domains, such as younger individuals and students, are more susceptible to being de-anonymized based on their rating patterns. This information can inform the development of targeted anonymization strategies and privacy-preserving mechanisms for different user groups.
5.10. Impact of Temporal Variations in User Data
In real-world scenarios, user preferences and behaviors can evolve over time, resulting in temporal variations in their rating patterns. To investigate the robustness of our proposed method to such variations, we simulated scenarios where user rating patterns change over time and evaluated the impact on the de-anonymization performance.
Simulation Methodology
We simulated temporal variations in user rating patterns by introducing controlled perturbations to the rating vectors extracted from the original datasets. Specifically, we divided each user’s rating vector into two segments, representing their past and present rating patterns. We then applied random noise to the second segment, simulating a change in the user’s preferences and behaviors over time.
The random noise was introduced by randomly adding or subtracting a value within a specified range to a certain percentage of the ratings in the second segment. We varied the percentage of perturbed ratings and the range of the random noise to simulate different levels of temporal variation.
For each simulated scenario, we re-computed the similarity scores between the perturbed rating vectors and performed record linkage using our proposed approach. We then compared the de-anonymization performance in terms of precision, recall, and F1-score with the baseline performance on the original, unperturbed datasets.
Figure 6 illustrates the impact of temporal variations on the de-anonymization performance for the MovieLens–Book-Crossing dataset pair. The x-axis represents the percentage of perturbed ratings in the second segment of the rating vectors, whereas the y-axis shows the relative change in the F1-score compared to the baseline performance on the original datasets.
As expected, the de-anonymization performance degrades as the percentage of perturbed ratings increases, indicating that our approach is sensitive to temporal variations in user rating patterns. However, even with a substantial percentage of perturbed ratings (e.g., 30%), the relative decrease in the F1-score remains within reasonable bounds, ranging from 5% to 15% across different levels of random noise.
These results suggest that our proposed method exhibits moderate robustness to temporal variations in user data. While the de-anonymization performance may degrade to some extent as user preferences evolve, the impact is not catastrophic, and the method remains effective in linking user identities across datasets, even in the presence of moderate temporal variations.
It is worth noting that the degree of robustness may vary depending on the specific datasets and domains under consideration. Domains where user preferences tend to be more stable over time may exhibit greater robustness, whereas domains with more dynamic user behaviors could be more susceptible to the impact of temporal variations.
To further enhance the robustness of our approach to temporal variations, potential extensions could involve incorporating time-sensitive factors into the similarity computation and record linkage processes. For example, applying time-decaying weights to user ratings or adapting the similarity metrics to account for temporal patterns could mitigate the impact of preference changes over time. Additionally, incorporating contextual information, such as user demographics or external events that may influence user behaviors, could improve the ability to model and account for temporal variations.
Overall, our analysis demonstrates that while our proposed method is not immune to the effects of temporal variations in user data, it exhibits a reasonable level of robustness, particularly in scenarios with moderate levels of preference changes over time. This finding further validates the applicability of our approach in real-world settings, where user behaviors and preferences may evolve dynamically.
5.11. Comparison with State-of-the-Art Techniques
To demonstrate the novelty and unique strengths of our proposed approach, we conducted additional experiments comparing its performance with that of state-of-the-art de-anonymization techniques from the literature. Specifically, we considered the following methods:
Narayanan and Shmatikov’s Algorithm [
1]: This seminal work introduced the concept of quasi-identifier attacks for de-anonymizing users in the Netflix Prize dataset based on their movie rating patterns. We adapted their algorithm to our cross-dataset scenario and applied it to the MovieLens, Book-Crossing, and LastFM datasets.
Deep Neural Network-based De-anonymization [
12]: This is a recent work that employed deep neural networks to learn the mapping between auxiliary data and the target dataset for de-anonymization. We trained their model using the rating datasets as auxiliary data and evaluated its performance on our cross-dataset de-anonymization task.
Table 8 presents the de-anonymization performance of our approach and that of the state-of-the-art techniques in terms of precision, recall, and F1-score for the MovieLens–Book-Crossing dataset pair.
As shown in the table, our approach outperforms the state-of-the-art techniques across all evaluation metrics, achieving the highest precision, recall, and F1-score for the cross-dataset de-anonymization task. This superior performance can be attributed to the unique strengths of our method:
Our approach leverages the consistency of users’ rating patterns across diverse domains, enabling effective cross-dataset de-anonymization. In contrast, techniques like Narayanan and Shmatikov’s algorithm were originally designed for single-dataset scenarios and may not fully capture the cross-domain signal present in our task.
By combining record linkage techniques with quasi-identifier attacks, our method effectively exploits the uniqueness of rating patterns as high-dimensional quasi-identifiers, overcoming the limitations of traditional de-anonymization techniques that focus on explicit identifiers or predefined attribute combinations.
Unlike deep learning-based approaches that require large amounts of training data and computational resources, our method is efficient and interpretable, relying on well-established probabilistic record linkage models and similarity metrics.
While the deep neural network-based approach [
12] achieved competitive results, our method outperformed it in terms of both precision and recall. Additionally, our approach has the advantage of being more interpretable and requiring fewer computational resources compared to deep learning models.
These results demonstrate the novelty and effectiveness of our proposed approach in addressing the challenging task of cross-dataset de-anonymization, highlighting its unique contributions to the field of user privacy and data anonymization.
5.12. Impact of Data Privacy Regulations and Anonymization Techniques
Data privacy regulations like GDPR, HIPAA, SOX, and others mandate strict measures to protect individuals’ confidential data, including the use of anonymization techniques. Different anonymization techniques, such as k-anonymity, l-diversity, and differential privacy, aim to achieve varying levels of privacy protection by modifying or perturbing the data in specific ways.
The effectiveness of our proposed de-anonymization approach may be influenced by the specific anonymization technique employed on the rating datasets. For instance, techniques that suppress or generalize quasi-identifying attributes (such as rating patterns) could potentially reduce the performance of our approach by diminishing the uniqueness of users’ data patterns.
5.12.1. k-Anonymity and l-Diversity
The k-anonymity approach aims to ensure that each record in a dataset is indistinguishable from at least other records when considering a set of quasi-identifying attributes. However, this technique may be vulnerable to our de-anonymization method if the rating patterns are not properly suppressed or generalized, as they can still serve as quasi-identifiers. The l-diversity technique extends k-anonymity by requiring that within each group of k records, there are at least l well-represented values for each sensitive attribute. While this approach provides additional protection against attribute disclosure, it may still be susceptible to our de-anonymization attack if the rating patterns exhibit sufficient uniqueness.
5.12.2. Differential Privacy
Differential privacy is a more robust privacy model that aims to protect the privacy of individuals by introducing controlled noise or perturbation to the data. By adding carefully calibrated noise, differential privacy ensures that the presence or absence of any individual’s data in the dataset has a negligible impact on the overall results or outputs. If applied to rating datasets, differential privacy could potentially mitigate the effectiveness of our de-anonymization approach by obfuscating the rating patterns and reducing their uniqueness. However, achieving a desired level of privacy protection through differential privacy may come at the cost of reduced data utility, which could negatively impact the performance of recommender systems or other data-driven applications that rely on accurate rating data.
5.12.3. Need for Tailored Privacy-Preserving Mechanisms
While existing anonymization techniques provide general privacy protection, our findings highlight the need for stronger and more tailored privacy-preserving mechanisms specifically designed for rating datasets and recommender systems. These mechanisms should account for the unique characteristics of rating data, such as the potential for quasi-identifying rating patterns while preserving the utility of the data for developing accurate and effective recommender algorithms. Potential approaches could involve the development of domain-specific anonymization techniques, secure multi-party computation methods, or the integration of privacy-enhancing technologies like homomorphic encryption or secure enclaves. Collaboration between privacy researchers, recommender system experts, and industry practitioners is crucial to address this challenge and strike a balance between privacy protection and data utility in the context of rating datasets and personalized recommendation services.
5.12.4. Ethical and Legal Considerations
The findings of our study highlight the potential privacy risks associated with the release of pseudonymized rating datasets, even when traditional anonymization techniques are applied. This raises important ethical and legal considerations regarding the deployment and use of de-anonymization techniques like the one proposed in this work.
From an ethical perspective, the ability to re-identify individuals based on their rating patterns could be seen as a violation of their privacy and autonomy. Individuals may have provided their ratings with the expectation of anonymity, and the de-anonymization of their data could expose sensitive information about their preferences and behaviors without their explicit consent. This raises concerns about the potential misuse of such techniques for profiling, discrimination, or other unintended purposes.
Legally, the deployment of de-anonymization techniques may conflict with data protection laws and regulations, such as the General Data Protection Regulation (GDPR) in the European Union. The GDPR imposes strict requirements on the processing of personal data, including the need for a lawful basis and adherence to principles such as data minimization and purpose limitation. The re-identification of individuals through techniques like ours could potentially violate these principles and lead to legal consequences for organizations that fail to adequately protect personal data.
However, there may be legitimate use cases where de-anonymization techniques could be justified, such as in the context of law enforcement investigations or cybersecurity research aimed at identifying and mitigating privacy risks. In such cases, a careful balancing of interests and a thorough assessment of the necessity and proportionality of the de-anonymization process would be required.
To address these ethical and legal concerns, it is crucial for researchers, industry practitioners, and policymakers to engage in open and transparent discussions about the responsible use of de-anonymization techniques. Clear guidelines and frameworks should be established to ensure that such techniques are only employed in legally and ethically justifiable scenarios, with appropriate safeguards and oversight mechanisms in place. Additionally, ongoing collaboration between privacy experts, data scientists, and legal professionals is essential to stay ahead of emerging privacy challenges and develop robust privacy-preserving mechanisms that can effectively protect individuals’ privacy while enabling the responsible use of data for beneficial purposes.
6. Conclusions
In this paper, we presented a novel approach for de-anonymizing users across different rating datasets by leveraging their rating patterns as quasi-identifiers and employing record linkage techniques. Our key insight was that users tend to exhibit consistent preferences and behaviors across diverse domains, enabling the linking of their identities based on the similarity of their rating vectors. By combining probabilistic record linkage methods with quasi-identifier attacks, we demonstrated the feasibility of cross-dataset de-anonymization, a task that poses significant privacy risks but has received limited attention in prior research.
Through extensive experiments on three publicly available rating datasets (MovieLens, Book-Crossing, and LastFM), we evaluated the effectiveness of our approach in linking user records across these diverse domains. Our results showed high precision and recall values, with F1-scores ranging from 0.72 to 0.79 for pairwise de-anonymization tasks. We further investigated the impact of various factors on the de-anonymization performance, including the choice of similarity metric, the number of datasets involved, and data sparsity. Our approach demonstrated robustness to data sparsity, maintaining reasonable performance even when a significant portion of the ratings was unavailable.
The success of our de-anonymization approach highlights the potential privacy risks associated with the release of anonymized rating datasets, which are commonly used for evaluating recommender systems and other data-driven applications. While these datasets are intended to protect user privacy by removing explicit identifiers, our work showed that users’ rating patterns can serve as quasi-identifiers, enabling their re-identification across different contexts.
Our findings suggest that the potential for de-anonymization extends beyond rating datasets and can be applied to other domains where user behavior patterns can serve as quasi-identifiers. For example, in the context of social media platforms, users’ activity patterns, such as the content they engage with, the accounts they follow, or the topics they discuss, could be leveraged for cross-platform de-anonymization. Similarly, in e-commerce scenarios, user browsing and purchasing histories could provide unique fingerprints that enable record linkage across different online marketplaces.
While our approach focused on leveraging rating patterns as quasi-identifiers, the underlying principles of combining record linkage techniques with quasi-identifier attacks can be adapted to other types of user data, such as web browsing histories, social media activities, or location traces. By identifying the unique characteristics and behavioral patterns exhibited by users in these domains, our de-anonymization framework can be extended to uncover privacy vulnerabilities and inform the development of robust anonymization strategies.
Additionally, our work highlights the potential for developing cross-domain user modeling and personalization systems by leveraging linked user identities across diverse datasets. By integrating user preferences and behaviors from multiple contexts, such systems could provide more comprehensive and tailored recommendations, personalized content, or targeted services. However, the development of such systems must be accompanied by rigorous privacy safeguards and ethical considerations to ensure the responsible use of user data.
Our findings underscore the need for stronger anonymization techniques and privacy-preserving mechanisms in the context of rating datasets. Potential countermeasures include the application of differential privacy techniques, secure multi-party computation methods, or the careful selection and suppression of quasi-identifiers during the data anonymization process. Additionally, legal and ethical frameworks should be established to govern the collection, use, and dissemination of user data, ensuring that individuals’ privacy rights are respected while enabling the development of beneficial data-driven applications.
Future research directions include exploring more advanced record linkage and de-anonymization techniques, as well as developing robust privacy-preserving methods for rating datasets. Additionally, investigating the generalizability of our approach to other types of user data, such as browsing histories or social media activities, could provide further insights into the privacy implications of data release and sharing practices.
Our de-anonymization technique relies on the assumption that user data exhibit sufficient distinctiveness or uniqueness to serve as a quasi-identifier for record linkage. In the context of rating datasets, this assumption holds true due to the inherent diversity in user preferences and rating patterns. However, for datasets with different types of quasi-identifiers or less structured data, the effectiveness of our approach may vary. For example, in datasets where user data are less diverse or exhibit a higher degree of similarity across individuals, the uniqueness of the quasi-identifiers may be diminished, potentially reducing the performance of our de-anonymization technique. Additionally, datasets with more complex or unstructured data formats, such as textual data or multimedia content, may require adaptations to our approach for extracting and comparing quasi-identifying features effectively.
To address these limitations and provide a more comprehensive understanding of our method’s versatility, we plan to extend our experiments to include a diverse range of datasets from various domains. This will involve identifying suitable quasi-identifying attributes or features in each dataset and adapting our similarity computation and record linkage techniques accordingly.
Furthermore, we recognize the value of incorporating additional data preprocessing and feature engineering techniques to enhance the performance of our approach on datasets with less structured or unstructured data. Techniques such as natural language processing, computer vision, or domain-specific feature extraction methods may be leveraged to extract meaningful quasi-identifiers from complex data formats.
By conducting comparative analyses across different types of datasets and quasi-identifiers, we aim to gain insights into the strengths and weaknesses of our de-anonymization technique, as well as identify potential areas for improvement or adaptation. This understanding will not only contribute to the broader field of privacy-preserving data publishing but also inform the development of more robust and versatile de-anonymization techniques.
In conclusion, our work serves as a wake-up call for researchers, practitioners, and policymakers to prioritize user privacy in the era of big data. By highlighting the potential for de-anonymization across diverse datasets, we aim to raise awareness and inspire the development of stronger privacy safeguards, enabling the responsible use of user data while protecting individuals’ fundamental right to privacy.