Search | arXiv e-print repository

Privacy-Preserving Collaborative Genomic Research: A Real-Life Deployment and Vision

Authors: Zahra Rahmani, Nahal Shahini, Nadav Gat, Zebin Yun, Yuzhou Jiang, Ofir Farchy, Yaniv Harel, Vipin Chaudhary, Mahmood Sharif, Erman Ayday

Abstract: The data revolution holds significant promise for the health sector. Vast amounts of data collected from individuals will be transformed into knowledge, AI models, predictive systems, and best practices. One area of health that stands to benefit greatly is the genomic domain. Progress in AI, machine learning, and data science has opened new opportunities for genomic research, promising breakthroug… ▽ More The data revolution holds significant promise for the health sector. Vast amounts of data collected from individuals will be transformed into knowledge, AI models, predictive systems, and best practices. One area of health that stands to benefit greatly is the genomic domain. Progress in AI, machine learning, and data science has opened new opportunities for genomic research, promising breakthroughs in personalized medicine. However, increasing awareness of privacy and cybersecurity necessitates robust solutions to protect sensitive data in collaborative research. This paper presents a practical deployment of a privacy-preserving framework for genomic research, developed in collaboration with Lynx$.$MD, a platform for secure health data collaboration. The framework addresses critical cybersecurity and privacy challenges, enabling the privacy-preserving sharing and analysis of genomic data while mitigating risks associated with data breaches. By integrating advanced privacy-preserving algorithms, the solution ensures the protection of individual privacy without compromising data utility. A unique feature of the system is its ability to balance trade-offs between data sharing and privacy, providing stakeholders tools to quantify privacy risks and make informed decisions. Implementing the framework within Lynx$.$MD involves encoding genomic data into binary formats and applying noise through controlled perturbation techniques. This approach preserves essential statistical properties of the data, facilitating effective research and analysis. Moreover, the system incorporates real-time data monitoring and advanced visualization tools, enhancing user experience and decision-making. The paper highlights the need for tailored privacy attacks and defenses specific to genomic data. Addressing these challenges fosters collaboration in genomic research, advancing personalized medicine and public health. △ Less

Submitted 12 July, 2024; originally announced July 2024.

Comments: The first two authors contributed equally to this work. Due to the limitation "The abstract field cannot be longer than 1,920 characters", the abstract here is shorter than that in the PDF file

arXiv:2406.05545 [pdf, other]

Privacy-Preserving Optimal Parameter Selection for Collaborative Clustering

Authors: Maryam Ghasemian, Erman Ayday

Abstract: This study investigates the optimal selection of parameters for collaborative clustering while ensuring data privacy. We focus on key clustering algorithms within a collaborative framework, where multiple data owners combine their data. A semi-trusted server assists in recommending the most suitable clustering algorithm and its parameters. Our findings indicate that the privacy parameter ($ε$) min… ▽ More This study investigates the optimal selection of parameters for collaborative clustering while ensuring data privacy. We focus on key clustering algorithms within a collaborative framework, where multiple data owners combine their data. A semi-trusted server assists in recommending the most suitable clustering algorithm and its parameters. Our findings indicate that the privacy parameter ($ε$) minimally impacts the server's recommendations, but an increase in $ε$ raises the risk of membership inference attacks, where sensitive information might be inferred. To mitigate these risks, we implement differential privacy techniques, particularly the Randomized Response mechanism, to add noise and protect data privacy. Our approach demonstrates that high-quality clustering can be achieved while maintaining data confidentiality, as evidenced by metrics such as the Adjusted Rand Index and Silhouette Score. This study contributes to privacy-aware data sharing, optimal algorithm and parameter selection, and effective communication between data owners and the server. △ Less

Submitted 8 June, 2024; originally announced June 2024.

arXiv:2404.02138 [pdf, other]

Topic-based Watermarks for LLM-Generated Text

Authors: Alexander Nemecek, Yuzhou Jiang, Erman Ayday

Abstract: Recent advancements of large language models (LLMs) have resulted in indistinguishable text outputs comparable to human-generated text. Watermarking algorithms are potential tools that offer a way to differentiate between LLM- and human-generated text by embedding detectable signatures within LLM-generated output. However, current watermarking schemes lack robustness against known attacks against… ▽ More Recent advancements of large language models (LLMs) have resulted in indistinguishable text outputs comparable to human-generated text. Watermarking algorithms are potential tools that offer a way to differentiate between LLM- and human-generated text by embedding detectable signatures within LLM-generated output. However, current watermarking schemes lack robustness against known attacks against watermarking algorithms. In addition, they are impractical considering an LLM generates tens of thousands of text outputs per day and the watermarking algorithm needs to memorize each output it generates for the detection to work. In this work, focusing on the limitations of current watermarking schemes, we propose the concept of a "topic-based watermarking algorithm" for LLMs. The proposed algorithm determines how to generate tokens for the watermarked LLM output based on extracted topics of an input prompt or the output of a non-watermarked LLM. Inspired from previous work, we propose using a pair of lists (that are generated based on the specified extracted topic(s)) that specify certain tokens to be included or excluded while generating the watermarked output of the LLM. Using the proposed watermarking algorithm, we show the practicality of a watermark detection algorithm. Furthermore, we discuss a wide range of attacks that can emerge against watermarking algorithms for LLMs and the benefit of the proposed watermarking scheme for the feasibility of modeling a potential attacker considering its benefit vs. loss. △ Less

Submitted 16 April, 2024; v1 submitted 2 April, 2024; originally announced April 2024.

Comments: 11 pages

arXiv:2310.05696 [pdf, other]

Little is Enough: Improving Privacy by Sharing Labels in Federated Semi-Supervised Learning

Authors: Amr Abourayya, Jens Kleesiek, Kanishka Rao, Erman Ayday, Bharat Rao, Geoff Webb, Michael Kamp

Abstract: In many critical applications, sensitive data is inherently distributed and cannot be centralized due to privacy concerns. A wide range of federated learning approaches have been proposed in the literature to train models locally at each client without sharing their sensitive local data. Most of these approaches either share local model parameters, soft predictions on a public dataset, or a combin… ▽ More In many critical applications, sensitive data is inherently distributed and cannot be centralized due to privacy concerns. A wide range of federated learning approaches have been proposed in the literature to train models locally at each client without sharing their sensitive local data. Most of these approaches either share local model parameters, soft predictions on a public dataset, or a combination of both. This, however, still discloses private information and restricts local models to those that lend themselves to training via gradient-based methods. To reduce the amount of shared information, we propose to share only hard labels on a public unlabeled dataset, and use a consensus over the shared labels as a pseudo-labeling to be used by clients. The resulting federated co-training approach empirically improves privacy substantially, without compromising on model quality. At the same time, it allows us to use local models that do not lend themselves to the parameter aggregation used in federated learning, such as (gradient boosted) decision trees, rule ensembles, and random forests. △ Less

Submitted 23 May, 2024; v1 submitted 9 October, 2023; originally announced October 2023.

arXiv:2302.02162 [pdf, other]

doi 10.56553/popets-2024-0137

AUTOLYCUS: Exploiting Explainable AI (XAI) for Model Extraction Attacks against Interpretable Models

Authors: Abdullah Caglar Oksuz, Anisa Halimi, Erman Ayday

Abstract: Explainable Artificial Intelligence (XAI) aims to uncover the decision-making processes of AI models. However, the data used for such explanations can pose security and privacy risks. Existing literature identifies attacks on machine learning models, including membership inference, model inversion, and model extraction attacks. These attacks target either the model or the training data, depending… ▽ More Explainable Artificial Intelligence (XAI) aims to uncover the decision-making processes of AI models. However, the data used for such explanations can pose security and privacy risks. Existing literature identifies attacks on machine learning models, including membership inference, model inversion, and model extraction attacks. These attacks target either the model or the training data, depending on the settings and parties involved. XAI tools can increase the vulnerability of model extraction attacks, which is a concern when model owners prefer black-box access, thereby keeping model parameters and architecture private. To exploit this risk, we propose AUTOLYCUS, a novel retraining (learning) based model extraction attack framework against interpretable models under black-box settings. As XAI tools, we exploit Local Interpretable Model-Agnostic Explanations (LIME) and Shapley values (SHAP) to infer decision boundaries and create surrogate models that replicate the functionality of the target model. LIME and SHAP are mainly chosen for their realistic yet information-rich explanations, coupled with their extensive adoption, simplicity, and usability. We evaluate AUTOLYCUS on six machine learning datasets, measuring the accuracy and similarity of the surrogate model to the target model. The results show that AUTOLYCUS is highly effective, requiring significantly fewer queries compared to state-of-the-art attacks, while maintaining comparable accuracy and similarity. We validate its performance and transferability on multiple interpretable ML models, including decision trees, logistic regression, naive bayes, and k-nearest neighbor. Additionally, we show the resilience of AUTOLYCUS against proposed countermeasures. △ Less

Submitted 8 July, 2024; v1 submitted 4 February, 2023; originally announced February 2023.

Comments: This work is published in the Proceedings on Privacy Enhancing Technologies (PoPETs), Vol. 2024, Issue 4, 2024

arXiv:2212.12785 [pdf, other]

zkFaith: Soonami's Zero-Knowledge Identity Protocol

Authors: Mina Namazi, Duncan Ross, Xiaojie Zhu, Erman Ayday

Abstract: Individuals are encouraged to prove their eligibility to access specific services regularly. However, providing various organizations with personal data spreads sensitive information and endangers people's privacy. Hence, privacy-preserving identification systems that enable individuals to prove they are permitted to use specific services are required to fill the gap. Cryptographic techniques are… ▽ More Individuals are encouraged to prove their eligibility to access specific services regularly. However, providing various organizations with personal data spreads sensitive information and endangers people's privacy. Hence, privacy-preserving identification systems that enable individuals to prove they are permitted to use specific services are required to fill the gap. Cryptographic techniques are deployed to construct identity proofs across the internet; nonetheless, they do not offer complete control over personal data or prevent users from forging and submitting fake data. In this paper, we design a privacy-preserving identity protocol called "zkFaith." A new approach to obtain a verified zero-knowledge identity unique to each individual. The protocol verifies the integrity of the documents provided by the individuals and issues a zero-knowledge-based id without revealing any information to the authenticator or verifier. The zkFaith leverages an aggregated version of the Camenisch-Lysyanskaya (CL) signature scheme to sign the user's commitment to the verified personal data. Then the users with a zero-knowledge proof system can prove that they own the required attributes of the access criterion of the requested service providers. Vector commitment and their position binding property enables us to, later on, update the commitments based on the modification of the personal data; hence update the issued zkFaith id with no requirement of initiating the protocol from scratch. We show that the design and implementation of the zkFaith with the generated proofs in real-world scenarios are scalable and comparable with the state-of-the-art schemes. △ Less

Submitted 24 December, 2022; originally announced December 2022.

arXiv:2210.01297 [pdf, other]

Privacy-Preserving Link Prediction

Authors: Didem Demirag, Mina Namazi, Erman Ayday, Jeremy Clark

Abstract: Consider two data holders, ABC and XYZ, with graph data (e.g., social networks, e-commerce, telecommunication, and bio-informatics). ABC can see that node A is linked to node B, and XYZ can see node B is linked to node C. Node B is the common neighbour of A and C but neither network can discover this fact on their own. In this paper, we provide a two party computation that ABC and XYZ can run to d… ▽ More Consider two data holders, ABC and XYZ, with graph data (e.g., social networks, e-commerce, telecommunication, and bio-informatics). ABC can see that node A is linked to node B, and XYZ can see node B is linked to node C. Node B is the common neighbour of A and C but neither network can discover this fact on their own. In this paper, we provide a two party computation that ABC and XYZ can run to discover the common neighbours in the union of their graph data, however neither party has to reveal their plaintext graph to the other. Based on private set intersection, we implement our solution, provide measurements, and quantify partial leaks of privacy. We also propose a heavyweight solution that leaks zero information based on additively homomorphic encryption. △ Less

Submitted 3 October, 2022; originally announced October 2022.

arXiv:2209.06327 [pdf, other]

Reproducibility-Oriented and Privacy-Preserving Genomic Dataset Sharing

Authors: Yuzhou Jiang, Tianxi Ji, Pan Li, Erman Ayday

Abstract: As genomic research has become increasingly widespread in recent years, few studies share datasets due to the sensitivity in privacy of genomic records. This hinders the reproduction and validation of research outcomes, which are crucial for catching errors (e.g., miscalculations) during the research process.To the best of our knowledge, we are the first to propose a method of sharing genomic data… ▽ More As genomic research has become increasingly widespread in recent years, few studies share datasets due to the sensitivity in privacy of genomic records. This hinders the reproduction and validation of research outcomes, which are crucial for catching errors (e.g., miscalculations) during the research process.To the best of our knowledge, we are the first to propose a method of sharing genomic datasets in a privacy-preserving manner for GWAS outcome reproducibility.In this work, we introduce a differential privacy-based scheme for sharing genomic datasets to enhance the reproducibility of genome-wide association studies (GWAS) outcomes. The scheme involves two stages. In the first stage, we generate a noisy copy of the target dataset by applying the XOR mechanism on the binarized (encoded) dataset, where the binary noise generation considers biological features. However, the initial step introduces significant noise, making the dataset less suitable for direct GWAS validation. Thus, in the second stage, we implement a post-processing technique that adjusts the Minor Allele Frequency (MAF) values in the noisy dataset to align more closely with those in a publicly available dataset using optimal transport and decode it back to genomic space. We evaluated the proposed scheme on three real-life genomic datasets and compared it with a baseline approach and two synthesis-based solutions with regard to detecting errors of GWAS outcomes, data utility, and resistance against membership inference attacks (MIAs). Our scheme outperforms all the comparing methods in detecting GWAS outcome errors, achieves better utility and provides higher privacy protection against membership inference attacks (MIAs). By utilizing our method, genomic researchers will be inclined to share a differentially private, yet of high quality version of their datasets. △ Less

Submitted 18 December, 2023; v1 submitted 13 September, 2022; originally announced September 2022.

arXiv:2204.04792 [pdf, other]

Robust Fingerprint of Location Trajectories Under Differential Privacy

Authors: Yuzhou Jiang, Emre Yilmaz, Erman Ayday

Abstract: Directly releasing those data raises privacy and liability (e.g., due to unauthorized distribution of such datasets) concerns since location data contain users' sensitive information, e.g., regular moving patterns and favorite spots. To address this, we propose a novel fingerprinting scheme that simultaneously identifies unauthorized redistribution of location datasets and provides differential pr… ▽ More Directly releasing those data raises privacy and liability (e.g., due to unauthorized distribution of such datasets) concerns since location data contain users' sensitive information, e.g., regular moving patterns and favorite spots. To address this, we propose a novel fingerprinting scheme that simultaneously identifies unauthorized redistribution of location datasets and provides differential privacy guarantees for the shared data. Observing data utility degradation due to differentially-private mechanisms, we introduce a utility-focused post-processing scheme to regain spatio-temporal correlations between points in a location trajectory. We further integrate this post-processing scheme into our fingerprinting scheme as a sampling method. The proposed fingerprinting scheme alleviates the degradation in the utility of the shared dataset due to the noise introduced by differentially-private mechanisms (i.e., adds the fingerprint by preserving the publicly known statistics of the data). Meanwhile, it does not violate differential privacy throughout the entire process due to immunity to post-processing, a fundamental property of differential privacy. Our proposed fingerprinting scheme is robust against known and well-studied attacks against a fingerprinting scheme including random flipping attacks, correlation-based flipping attacks, and collusions among multiple parties, which makes it hard for the attackers to infer the fingerprint codes and avoid accusation. Via experiments on two real-life location datasets and two synthetic ones, we show that our scheme achieves high fingerprinting robustness and outperforms existing approaches. Besides, the proposed fingerprinting scheme increases data utility for differentially-private datasets, which is beneficial for data analyzers. △ Less

Submitted 21 April, 2023; v1 submitted 10 April, 2022; originally announced April 2022.

arXiv:2204.01801 [pdf, other]

Robust Fingerprinting of Genomic Databases

Authors: Tianxi Ji, Erman Ayday, Emre Yilmaz, Pan Li

Abstract: Database fingerprinting has been widely used to discourage unauthorized redistribution of data by providing means to identify the source of data leakages. However, there is no fingerprinting scheme aiming at achieving liability guarantees when sharing genomic databases. Thus, we are motivated to fill in this gap by devising a vanilla fingerprinting scheme specifically for genomic databases. Moreov… ▽ More Database fingerprinting has been widely used to discourage unauthorized redistribution of data by providing means to identify the source of data leakages. However, there is no fingerprinting scheme aiming at achieving liability guarantees when sharing genomic databases. Thus, we are motivated to fill in this gap by devising a vanilla fingerprinting scheme specifically for genomic databases. Moreover, since malicious genomic database recipients may compromise the embedded fingerprint by launching effective correlation attacks which leverage the intrinsic correlations among genomic data (e.g., Mendel's law and linkage disequilibrium), we also augment the vanilla scheme by developing mitigation techniques to achieve robust fingerprinting of genomic databases against correlation attacks. We first show that correlation attacks against fingerprinting schemes for genomic databases are very powerful. In particular, the correlation attacks can distort more than half of the fingerprint bits by causing a small utility loss (e.g.,database accuracy and consistency of SNP-phenotype associations measured via p-values). Next, we experimentally show that the correlation attacks can be effectively mitigated by our proposed mitigation techniques. We validate that the attacker can hardly compromise a large portion of the fingerprint bits even if it pays a higher cost in terms of degradation of the database utility. For example, with around 24% loss in accuracy and 20% loss in the consistency of SNP-phenotype associations, the attacker can only distort about 30% fingerprint bits, which is insufficient for it to avoid being accused. We also show that the proposed mitigation techniques also preserve the utility of the shared genomic databases. △ Less

Submitted 4 April, 2022; originally announced April 2022.

Comments: To appear in the 30th International Conference on Intelligent Systems for Molecular Biology (ISMB'22)

arXiv:2203.12445 [pdf, ps, other]

ShareTrace: Contact Tracing with the Actor Model

Authors: Ryan Tatton, Erman Ayday, Youngjin Yoo, Anisa Halimi

Abstract: Proximity-based contact tracing relies on mobile-device interaction to estimate the spread of disease. ShareTrace is one such approach that improves the efficacy of tracking disease spread by considering direct and indirect forms of contact. In this work, we utilize the actor model to provide an efficient and scalable formulation of ShareTrace with asynchronous, concurrent message passing on a tem… ▽ More Proximity-based contact tracing relies on mobile-device interaction to estimate the spread of disease. ShareTrace is one such approach that improves the efficacy of tracking disease spread by considering direct and indirect forms of contact. In this work, we utilize the actor model to provide an efficient and scalable formulation of ShareTrace with asynchronous, concurrent message passing on a temporal contact network. We also introduce message reachability, an extension of temporal reachability that accounts for network topology and message-passing semantics. Our evaluation on both synthetic and real-world contact networks indicates that correct parameter values optimize for algorithmic accuracy and efficiency. In addition, we demonstrate that message reachability can accurately estimate the risk a user poses to their contacts. △ Less

Submitted 18 September, 2022; v1 submitted 23 March, 2022; originally announced March 2022.

Comments: To be published in IEEE HealthCom 2022 Conference Proceedings; added mathematical detail about message reachability; improved explanations of algorithms and figures, updated conclusion, fixed typos, results unchanged; 6 pages with 3 figures

ACM Class: F.1.2; G.2.2; J.3; G.3

arXiv:2203.05664 [pdf, other]

Facilitating Federated Genomic Data Analysis by Identifying Record Correlations while Ensuring Privacy

Authors: Leonard Dervishi, Xinyue Wang, Wentao Li, Anisa Halimi, Jaideep Vaidya, Xiaoqian Jiang, Erman Ayday

Abstract: With the reduction of sequencing costs and the pervasiveness of computing devices, genomic data collection is continually growing. However, data collection is highly fragmented and the data is still siloed across different repositories. Analyzing all of this data would be transformative for genomics research. However, the data is sensitive, and therefore cannot be easily centralized. Furthermore,… ▽ More With the reduction of sequencing costs and the pervasiveness of computing devices, genomic data collection is continually growing. However, data collection is highly fragmented and the data is still siloed across different repositories. Analyzing all of this data would be transformative for genomics research. However, the data is sensitive, and therefore cannot be easily centralized. Furthermore, there may be correlations in the data, which if not detected, can impact the analysis. In this paper, we take the first step towards identifying correlated records across multiple data repositories in a privacy-preserving manner. The proposed framework, based on random shuffling, synthetic record generation, and local differential privacy, allows a trade-off of accuracy and computational efficiency. An extensive evaluation on real genomic data from the OpenSNP dataset shows that the proposed solution is efficient and effective. △ Less

Submitted 10 March, 2022; originally announced March 2022.

Comments: 10 pages, 3 figures

arXiv:2112.15109 [pdf, other]

GenShare: Sharing Accurate Differentially-Private Statistics for Genomic Datasets with Dependent Tuples

Authors: Nour Almadhoun Alserr, Ozgur Ulusoy, Erman Ayday, Onur Mutlu

Abstract: Motivation: Cutting the cost of DNA sequencing technology led to a quantum leap in the availability of genomic data. While sharing genomic data across researchers is an essential driver of advances in health and biomedical research, the sharing process is often infeasible due to data privacy concerns. Differential privacy is one of the rigorous mechanisms utilized to facilitate the sharing of aggr… ▽ More Motivation: Cutting the cost of DNA sequencing technology led to a quantum leap in the availability of genomic data. While sharing genomic data across researchers is an essential driver of advances in health and biomedical research, the sharing process is often infeasible due to data privacy concerns. Differential privacy is one of the rigorous mechanisms utilized to facilitate the sharing of aggregate statistics from genomic datasets without disclosing any private individual-level data. However, differential privacy can still divulge sensitive information about the dataset participants due to the correlation between dataset tuples. Results: Here, we propose GenShare model built upon Laplace-perturbation-mechanism-based DP to introduce a privacy-preserving query-answering sharing model for statistical genomic datasets that include dependency due to the inherent correlations between genomes of individuals (i.e., family ties). We demonstrate our privacy improvement over the state-of-the-art approaches for a range of practical queries including cohort discovery, minor allele frequency, and chi^2 association tests. With a fine-grained analysis of sensitivity in the Laplace perturbation mechanism and considering joint distributions, GenShare results near-achieve the formal privacy guarantees permitted by the theory of differential privacy as the queries that computed over independent tuples (only up to 6% differences). GenShare ensures that query results are as accurate as theoretically guaranteed by differential privacy. For empowering the advances in different scientific and medical research areas, GenShare presents a path toward an interactive genomic data sharing system when the datasets include participants with familial relationships. △ Less

Submitted 30 December, 2021; originally announced December 2021.

Comments: 8 pages, 7 figures

arXiv:2109.02768 [pdf, other]

Privacy-Preserving Database Fingerprinting

Authors: Tianxi Ji, Erman Ayday, Emre Yilmaz, Pan Li

Abstract: When sharing sensitive relational databases with other parties, a database owner aims to (i) have privacy guarantees for the database entries, (ii) have liability guarantees (via fingerprinting) in case of unauthorized sharing of its database by the recipients, and (iii) provide a high quality (utility) database to the recipients. We observe that sharing a relational database with privacy and liab… ▽ More When sharing sensitive relational databases with other parties, a database owner aims to (i) have privacy guarantees for the database entries, (ii) have liability guarantees (via fingerprinting) in case of unauthorized sharing of its database by the recipients, and (iii) provide a high quality (utility) database to the recipients. We observe that sharing a relational database with privacy and liability guarantees are orthogonal objectives. The former can be achieved by injecting noise into the database to prevent inference of the original data values, whereas, the latter can be achieved by hiding unique marks inside the database to trace malicious parties (data recipients) who redistribute the data without the authorization. We achieve these two objectives simultaneously by proposing a novel entry-level differentially-private fingerprinting mechanism for relational databases. At a high level, the proposed mechanism fulfills the privacy and liability requirements by leveraging the randomization nature that is intrinsic to fingerprinting and achieves desired entry-level privacy guarantees. To be more specific, we devise a bit-level random response scheme to achieve differential privacy guarantee for arbitrary data entries when sharing the entire database, and then, based on this, we develop an $ε$-entry-level differentially-private fingerprinting mechanism. Next, we theoretically analyze the relationships between privacy guarantee, fingerprint robustness, and database utility by deriving closed form expressions. The outcome of this analysis allows us to bound the privacy leakage caused by attribute inference attack and characterize the privacy-utility coupling and privacy-fingerprint robustness coupling. Furthermore, we also propose a SVT-based solution to control the cumulative privacy loss when fingerprinted copies of a database are shared with multiple recipients. △ Less

Submitted 6 March, 2022; v1 submitted 6 September, 2021; originally announced September 2021.

arXiv:2108.06505 [pdf, other]

Privacy-Preserving Identification of Target Patients from Outsourced Patient Data

Authors: Xiaojie Zhu, Erman Ayday, Roman Vitenberg

Abstract: With the increasing affordability and availability of patient data, hospitals tend to outsource their data to cloud service providers (CSPs) for the purpose of storage and analytics. However, the concern of data privacy significantly limits the data owners' choice. In this work, we propose the first solution, to the best of our knowledge, that allows a CSP to perform efficient identification of ta… ▽ More With the increasing affordability and availability of patient data, hospitals tend to outsource their data to cloud service providers (CSPs) for the purpose of storage and analytics. However, the concern of data privacy significantly limits the data owners' choice. In this work, we propose the first solution, to the best of our knowledge, that allows a CSP to perform efficient identification of target patients (e.g., pre-processing for a genome-wide association study - GWAS) over multi-tenant encrypted phenotype data (owned by multiple hospitals or data owners). We first propose an encryption mechanism for phenotype data, where each data owner is allowed to encrypt its data with a unique secret key. Moreover, the ciphertext supports privacy-preserving search and, consequently, enables the selection of the target group of patients (e.g., case and control groups). In addition, we provide a per-query based authorization mechanism for a client to access and operate on the data stored at the CSP. Based on the identified patients, the proposed scheme can either (i) directly conduct GWAS (i.e., computation of statistics about genomic variants) at the CSP or (ii) provide the identified groups to the client to directly query the corresponding data owners and conduct GWAS using existing distributed solutions. We implement the proposed scheme and run experiments over a real-life genomic dataset to show its effectiveness. The result shows that the proposed solution is capable to efficiently identify the case/control groups in a privacy-preserving way. △ Less

Submitted 28 August, 2021; v1 submitted 14 August, 2021; originally announced August 2021.

arXiv:2106.05211 [pdf, other]

Near-Optimal Privacy-Utility Tradeoff in Genomic Studies Using Selective SNP Hiding

Authors: Nour Almadhoun Alserr, Gulce Kale, Onur Mutlu, Oznur Tastan, Erman Ayday

Abstract: Motivation: Researchers need a rich trove of genomic datasets that they can leverage to gain a better understanding of the genetic basis of the human genome and identify associations between phenotypes and specific parts of DNA. However, sharing genomic datasets that include sensitive genetic or medical information of individuals can lead to serious privacy-related consequences if data lands in th… ▽ More Motivation: Researchers need a rich trove of genomic datasets that they can leverage to gain a better understanding of the genetic basis of the human genome and identify associations between phenotypes and specific parts of DNA. However, sharing genomic datasets that include sensitive genetic or medical information of individuals can lead to serious privacy-related consequences if data lands in the wrong hands. Restricting access to genomic datasets is one solution, but this greatly reduces their usefulness for research purposes. To allow sharing of genomic datasets while addressing these privacy concerns, several studies propose privacy-preserving mechanisms for data sharing. Differential privacy (DP) is one of such mechanisms that formalize rigorous mathematical foundations to provide privacy guarantees while sharing aggregated statistical information about a dataset. However, it has been shown that the original privacy guarantees of DP-based solutions degrade when there are dependent tuples in the dataset, which is a common scenario for genomic datasets (due to the existence of family members). Results: In this work, we introduce a near-optimal mechanism to mitigate the vulnerabilities of the inference attacks on differentially private query results from genomic datasets including dependent tuples. We propose a utility-maximizing and privacy-preserving approach for sharing statistics by hiding selective SNPs of the family members as they participate in a genomic dataset. By evaluating our mechanism on a real-world genomic dataset, we empirically demonstrate that our proposed mechanism can achieve up to 40% better privacy than state-of-the-art DP-based solutions, while near-optimally minimizing the utility loss. △ Less

Submitted 9 June, 2021; originally announced June 2021.

Comments: 9 pages, 9 figures

arXiv:2103.06438 [pdf, other]

doi 10.1145/3471621.3471853

The Curse of Correlations for Robust Fingerprinting of Relational Databases

Authors: Tianxi Ji, Emre Yilmaz, Erman Ayday, Pan Li

Abstract: Database fingerprinting have been widely adopted to prevent unauthorized sharing of data and identify the source of data leakages. Although existing schemes are robust against common attacks, like random bit flipping and subset attack, their robustness degrades significantly if attackers utilize the inherent correlations among database entries. In this paper, we first demonstrate the vulnerability… ▽ More Database fingerprinting have been widely adopted to prevent unauthorized sharing of data and identify the source of data leakages. Although existing schemes are robust against common attacks, like random bit flipping and subset attack, their robustness degrades significantly if attackers utilize the inherent correlations among database entries. In this paper, we first demonstrate the vulnerability of existing database fingerprinting schemes by identifying different correlation attacks: column-wise correlation attack, row-wise correlation attack, and the integration of them. To provide robust fingerprinting against the identified correlation attacks, we then develop mitigation techniques, which can work as post-processing steps for any off-the-shelf database fingerprinting schemes. The proposed mitigation techniques also preserve the utility of the fingerprinted database considering different utility metrics. We empirically investigate the impact of the identified correlation attacks and the performance of mitigation techniques using real-world relational databases. Our results show (i) high success rates of the identified correlation attacks against existing fingerprinting schemes (e.g., the integrated correlation attack can distort 64.8\% fingerprint bits by just modifying 14.2\% entries in a fingerprinted database), and (ii) high robustness of the proposed mitigation techniques (e.g., with the mitigation techniques, the integrated correlation attack can only distort $3\%$ fingerprint bits). △ Less

Submitted 21 July, 2021; v1 submitted 10 March, 2021; originally announced March 2021.

Comments: To appear in 24th International Symposium on Research in Attacks, Intrusions and Defenses (RAID'21)

arXiv:2102.07357 [pdf, other]

Genomic Data Sharing under Dependent Local Differential Privacy

Authors: Emre Yilmaz, Tianxi Ji, Erman Ayday, Pan Li

Abstract: Privacy-preserving genomic data sharing is prominent to increase the pace of genomic research, and hence to pave the way towards personalized genomic medicine. In this paper, we introduce ($ε, T$)-dependent local differential privacy (LDP) for privacy-preserving sharing of correlated data and propose a genomic data sharing mechanism under this privacy definition. We first show that the original de… ▽ More Privacy-preserving genomic data sharing is prominent to increase the pace of genomic research, and hence to pave the way towards personalized genomic medicine. In this paper, we introduce ($ε, T$)-dependent local differential privacy (LDP) for privacy-preserving sharing of correlated data and propose a genomic data sharing mechanism under this privacy definition. We first show that the original definition of LDP is not suitable for genomic data sharing, and then we propose a new mechanism to share genomic data. The proposed mechanism considers the correlations in data during data sharing, eliminates statistically unlikely data values beforehand, and adjusts the probability distributions for each shared data point accordingly. By doing so, we show that we can avoid an attacker from inferring the correct values of the shared data points by utilizing the correlations in the data. By adjusting the probability distributions of the shared states of each data point, we also improve the utility of shared data for the data collector. Furthermore, we develop a greedy algorithm that strategically identifies the processing order of the shared data points with the aim of maximizing the utility of the shared data. Considering the interdependent privacy risks while sharing genomic data, we also analyze the information gain of an attacker about genomes of a donor's family members by observing perturbed data of the genome donor and we propose a mechanism to select the privacy budget (i.e., $ε$ parameter of LDP) of the donor by also considering privacy preferences of her family members. Our evaluation results on a real-life genomic dataset show the superiority of the proposed mechanism compared to the randomized response mechanism (a widely used technique to achieve LDP). △ Less

Submitted 15 February, 2021; originally announced February 2021.

arXiv:2101.08879 [pdf, other]

Privacy-Preserving and Efficient Verification of the Outcome in Genome-Wide Association Studies

Authors: Anisa Halimi, Leonard Dervishi, Erman Ayday, Apostolos Pyrgelis, Juan Ramon Troncoso-Pastoriza, Jean-Pierre Hubaux, Xiaoqian Jiang, Jaideep Vaidya

Abstract: Providing provenance in scientific workflows is essential for reproducibility and auditability purposes. Workflow systems model and record provenance describing the steps performed to obtain the final results of a computation. In this work, we propose a framework that verifies the correctness of the statistical test results that are conducted by a researcher while protecting individuals' privacy i… ▽ More Providing provenance in scientific workflows is essential for reproducibility and auditability purposes. Workflow systems model and record provenance describing the steps performed to obtain the final results of a computation. In this work, we propose a framework that verifies the correctness of the statistical test results that are conducted by a researcher while protecting individuals' privacy in the researcher's dataset. The researcher publishes the workflow of the conducted study, its output, and associated metadata. They keep the research dataset private while providing, as part of the metadata, a partial noisy dataset (that achieves local differential privacy). To check the correctness of the workflow output, a verifier makes use of the workflow, its metadata, and results of another statistical study (using publicly available datasets) to distinguish between correct statistics and incorrect ones. We use case the proposed framework in the genome-wide association studies (GWAS), in which the goal is to identify highly associated point mutations (variants) with a given phenotype. For evaluation, we use real genomic data and show that the correctness of the workflow output can be verified with high accuracy even when the aggregate statistics of a small number of variants are provided. We also quantify the privacy leakage due to the provided workflow and its associated metadata in the GWAS use-case and show that the additional privacy risk due to the provided metadata does not increase the existing privacy risk due to sharing of the research results. Thus, our results show that the workflow output (i.e., research results) can be verified with high confidence in a privacy-preserving way. We believe that this work will be a valuable step towards providing provenance in a privacy-preserving way while providing guarantees to the users about the correctness of the results. △ Less

Submitted 7 November, 2022; v1 submitted 21 January, 2021; originally announced January 2021.

Comments: Appeared in the Proceedings on Privacy Enhancing Technologies Symposium (PETS) 2022

arXiv:2009.03698 [pdf, other]

Efficient Quantification of Profile Matching Risk in Social Networks

Authors: Anisa Halimi, Erman Ayday

Abstract: Anonymous data sharing has been becoming more challenging in today's interconnected digital world, especially for individuals that have both anonymous and identified online activities. The most prominent example of such data sharing platforms today are online social networks (OSNs). Many individuals have multiple profiles in different OSNs, including anonymous and identified ones (depending on the… ▽ More Anonymous data sharing has been becoming more challenging in today's interconnected digital world, especially for individuals that have both anonymous and identified online activities. The most prominent example of such data sharing platforms today are online social networks (OSNs). Many individuals have multiple profiles in different OSNs, including anonymous and identified ones (depending on the nature of the OSN). Here, the privacy threat is profile matching: if an attacker links anonymous profiles of individuals to their real identities, it can obtain privacy-sensitive information which may have serious consequences, such as discrimination or blackmailing. Therefore, it is very important to quantify and show to the OSN users the extent of this privacy risk. Existing attempts to model profile matching in OSNs are inadequate and computationally inefficient for real-time risk quantification. Thus, in this work, we develop algorithms to efficiently model and quantify profile matching attacks in OSNs as a step towards real-time privacy risk quantification. For this, we model the profile matching problem using a graph and develop a belief propagation (BP)-based algorithm to solve this problem in a significantly more efficient and accurate way compared to the state-of-the-art. We evaluate the proposed framework on three real-life datasets (including data from four different social networks) and show how users' profiles in different OSNs can be matched efficiently and with high probability. We show that the proposed model generation has linear complexity in terms of number of user pairs, which is significantly more efficient than the state-of-the-art (which has cubic complexity). Furthermore, it provides comparable accuracy, precision, and recall compared to state-of-the-art. △ Less

Submitted 7 September, 2020; originally announced September 2020.

Comments: arXiv admin note: text overlap with arXiv:2008.09608

Journal ref: Proceedings of the 25th European Symposium on Research in Computer Security (ESORICS 2020)

arXiv:2008.09608 [pdf, other]

Profile Matching Across Online Social Networks

Authors: Anisa Halimi, Erman Ayday

Abstract: In this work, we study the privacy risk due to profile matching across online social networks (OSNs), in which anonymous profiles of OSN users are matched to their real identities using auxiliary information about them. We consider different attributes that are publicly shared by users. Such attributes include both strong identifiers such as user name and weak identifiers such as interest or senti… ▽ More In this work, we study the privacy risk due to profile matching across online social networks (OSNs), in which anonymous profiles of OSN users are matched to their real identities using auxiliary information about them. We consider different attributes that are publicly shared by users. Such attributes include both strong identifiers such as user name and weak identifiers such as interest or sentiment variation between different posts of a user in different platforms. We study the effect of using different combinations of these attributes to profile matching in order to show the privacy threat in an extensive way. The proposed framework mainly relies on machine learning techniques and optimization algorithms. We evaluate the proposed framework on three datasets (Twitter - Foursquare, Google+ - Twitter, and Flickr) and show how profiles of the users in different OSNs can be matched with high probability by using the publicly shared attributes and/or the underlying graphical structure of the OSNs. We also show that the proposed framework notably provides higher precision values compared to state-of-the-art that relies on machine learning techniques. We believe that this work will be a valuable step to build a tool for the OSN users to understand their privacy risks due to their public sharings. △ Less

Submitted 19 August, 2020; originally announced August 2020.

Comments: arXiv admin note: substantial text overlap with arXiv:1711.01815

Journal ref: Proceedings of the 22nd International Conference on Information and Communications Security (ICICS 2020)

arXiv:2003.13073 [pdf, other]

Tracking the Invisible: Privacy-Preserving Contact Tracing to Control the Spread of a Virus

Authors: Didem Demirag, Erman Ayday

Abstract: Today, tracking and controlling the spread of a virus is a crucial need for almost all countries. Doing this early would save millions of lives and help countries keep a stable economy. The easiest way to control the spread of a virus is to immediately inform the individuals who recently had close contact with the diagnosed patients. However, to achieve this, a centralized authority (e.g., a healt… ▽ More Today, tracking and controlling the spread of a virus is a crucial need for almost all countries. Doing this early would save millions of lives and help countries keep a stable economy. The easiest way to control the spread of a virus is to immediately inform the individuals who recently had close contact with the diagnosed patients. However, to achieve this, a centralized authority (e.g., a health authority) needs detailed location information from both healthy individuals and diagnosed patients. Thus, such an approach, although beneficial to control the spread of a virus, results in serious privacy concerns, and hence privacy-preserving solutions are required to solve this problem. Previous works on this topic either (i) compromise privacy (especially privacy of diagnosed patients) to have better efficiency or (ii) provide unscalable solutions. In this work, we propose a technique based on private set intersection between physical contact histories of individuals (that are recorded using smart phones) and a centralized database (run by a health authority) that keeps the identities of the positive diagnosed patients for the disease. Proposed solution protects the location privacy of both healthy individuals and diagnosed patients and it guarantees that the identities of the diagnosed patients remain hidden from other individuals. Notably, proposed scheme allows individuals to receive warning messages indicating their previous contacts with a positive diagnosed patient. Such warning messages will help them realize the risk and isolate themselves from other people. We make sure that the warning messages are only observed by the corresponding individuals and not by the health authority. We also implement the proposed scheme and show its efficiency and scalability via simulations. △ Less

Submitted 23 October, 2020; v1 submitted 29 March, 2020; originally announced March 2020.

arXiv:2001.09555 [pdf, other]

Collusion-Resilient Probabilistic Fingerprinting Scheme for Correlated Data

Authors: Emre Yilmaz, Erman Ayday

Abstract: In order to receive personalized services, individuals share their personal data with a wide range of service providers, hoping that their data will remain confidential. Thus, in case of an unauthorized distribution of their personal data by these service providers (or in case of a data breach) data owners want to identify the source of such data leakage. Digital fingerprinting schemes have been d… ▽ More In order to receive personalized services, individuals share their personal data with a wide range of service providers, hoping that their data will remain confidential. Thus, in case of an unauthorized distribution of their personal data by these service providers (or in case of a data breach) data owners want to identify the source of such data leakage. Digital fingerprinting schemes have been developed to embed a hidden and unique fingerprint into shared digital content, especially multimedia, to provide such liability guarantees. However, existing techniques utilize the high redundancy in the content, which is typically not included in personal data. In this work, we propose a probabilistic fingerprinting scheme that efficiently generates the fingerprint by considering a fingerprinting probability (to keep the data utility high) and publicly known inherent correlations between data points. To improve the robustness of the proposed scheme against colluding malicious service providers, we also utilize the Boneh-Shaw fingerprinting codes as a part of the proposed scheme. Furthermore, observing similarities between privacy-preserving data sharing techniques (that add controlled noise to the shared data) and the proposed fingerprinting scheme, we make a first attempt to develop a data sharing scheme that provides both privacy and fingerprint robustness at the same time. We experimentally show that fingerprint robustness and privacy have conflicting objectives and we propose a hybrid approach to control such a trade-off with a design parameter. Using the proposed hybrid approach, we show that individuals can improve their level of privacy by slightly compromising from the fingerprint robustness. We implement and evaluate the performance of the proposed scheme on real genomic data. Our experimental results show the efficiency and robustness of the proposed scheme. △ Less

Submitted 26 January, 2020; originally announced January 2020.

arXiv:2001.08852 [pdf, other]

Genome Reconstruction Attacks Against Genomic Data-Sharing Beacons

Authors: Kerem Ayoz, Erman Ayday, A. Ercument Cicek

Abstract: Sharing genome data in a privacy-preserving way stands as a major bottleneck in front of the scientific progress promised by the big data era in genomics. A community-driven protocol named genomic data-sharing beacon protocol has been widely adopted for sharing genomic data. The system aims to provide a secure, easy to implement, and standardized interface for data sharing by only allowing yes/no… ▽ More Sharing genome data in a privacy-preserving way stands as a major bottleneck in front of the scientific progress promised by the big data era in genomics. A community-driven protocol named genomic data-sharing beacon protocol has been widely adopted for sharing genomic data. The system aims to provide a secure, easy to implement, and standardized interface for data sharing by only allowing yes/no queries on the presence of specific alleles in the dataset. However, beacon protocol was recently shown to be vulnerable against membership inference attacks. In this paper, we show that privacy threats against genomic data sharing beacons are not limited to membership inference. We identify and analyze a novel vulnerability of genomic data-sharing beacons: genome reconstruction. We show that it is possible to successfully reconstruct a substantial part of the genome of a victim when the attacker knows the victim has been added to the beacon in a recent update. We also show that even if multiple individuals are added to the beacon during the same update, it is possible to identify the victim's genome with high confidence using traits that are easily accessible by the attacker (e.g., eye and hair color). Moreover, we show how the reconstructed genome using a beacon that is not associated with a sensitive phenotype can be used for membership inference attacks to beacons with sensitive phenotypes (i.e., HIV+). The outcome of this work will guide beacon operators on when and how to update the content of the beacon. Thus, this work will be an important attempt at helping beacon operators and participants make informed decisions. △ Less

Submitted 21 August, 2020; v1 submitted 23 January, 2020; originally announced January 2020.

arXiv:1912.02045 [pdf, other]

Privacy-Preserving Search for a Similar Genomic Makeup in the Cloud

Authors: Xiaojie Zhu, Erman Ayday, Roman Vitenberg, Narasimha Raghavan Veeraragavan

Abstract: In this paper, we attempt to provide a privacy-preserving and efficient solution for the "similar patient search" problem among several parties (e.g., hospitals) by addressing the shortcomings of previous attempts. We consider a scenario in which each hospital has its own genomic dataset and the goal of a physician (or researcher) is to search for a patient similar to a given one (based on a genom… ▽ More In this paper, we attempt to provide a privacy-preserving and efficient solution for the "similar patient search" problem among several parties (e.g., hospitals) by addressing the shortcomings of previous attempts. We consider a scenario in which each hospital has its own genomic dataset and the goal of a physician (or researcher) is to search for a patient similar to a given one (based on a genomic makeup) among all the hospitals in the system. To enable this search, we let each hospital encrypt its dataset with its own key and outsource the storage of its dataset to a public cloud. The physician can get authorization from multiple hospitals and send a query to the cloud, which efficiently performs the search across authorized hospitals using a privacy-preserving index structure. We propose a hierarchical index structure to index each hospital's dataset with low memory requirements. Furthermore, we develop a novel privacy-preserving index merging mechanism that generates a common search index from individual indices of each hospital to significantly improve the search efficiency. We also consider the storage of medical information associated with genomic data of a patient (e.g., diagnosis and treatment). We allow access to this information via a fine-grained access control policy that we develop through the combination of standard symmetric encryption and ciphertext policy attribute-based encryption. Using this mechanism, a physician can search for similar patients and obtain medical information about the matching records if the access policy holds. We conduct experiments on large-scale genomic data and show the efficiency of the proposed scheme. Notably, we show that under our experimental settings, the proposed scheme is more than $60$ times faster than Wang et al.'s protocol and $95$ times faster than Asharov et al.'s solution. △ Less

Submitted 3 February, 2020; v1 submitted 4 December, 2019; originally announced December 2019.

arXiv:1908.10172 [pdf, other]

doi 10.1016/j.patcog.2020.107327

Key Protected Classification for Collaborative Learning

Authors: Mert Bülent Sarıyıldız, Ramazan Gökberk Cinbiş, Erman Ayday

Abstract: Large-scale datasets play a fundamental role in training deep learning models. However, dataset collection is difficult in domains that involve sensitive information. Collaborative learning techniques provide a privacy-preserving solution, by enabling training over a number of private datasets that are not shared by their owners. However, recently, it has been shown that the existing collaborative… ▽ More Large-scale datasets play a fundamental role in training deep learning models. However, dataset collection is difficult in domains that involve sensitive information. Collaborative learning techniques provide a privacy-preserving solution, by enabling training over a number of private datasets that are not shared by their owners. However, recently, it has been shown that the existing collaborative learning frameworks are vulnerable to an active adversary that runs a generative adversarial network (GAN) attack. In this work, we propose a novel classification model that is resilient against such attacks by design. More specifically, we introduce a key-based classification model and a principled training scheme that protects class scores by using class-specific private keys, which effectively hide the information necessary for a GAN attack. We additionally show how to utilize high dimensional keys to improve the robustness against attacks without increasing the model complexity. Our detailed experiments demonstrate the effectiveness of the proposed technique. Source code is available at https://github.com/mbsariyildiz/key-protected-classification. △ Less

Submitted 22 April, 2020; v1 submitted 27 August, 2019; originally announced August 2019.

Comments: Accepted to Pattern Recognition

arXiv:1907.00935 [pdf, other]

One-Time Programs made Practical

Authors: Lianying Zhao, Joseph I. Choi, Didem Demirag, Kevin R. B. Butler, Mohammad Mannan, Erman Ayday, Jeremy Clark

Abstract: A one-time program (OTP) works as follows: Alice provides Bob with the implementation of some function. Bob can have the function evaluated exclusively on a single input of his choosing. Once executed, the program will fail to evaluate on any other input. State-of-the-art one-time programs have remained theoretical, requiring custom hardware that is cost-ineffective/unavailable, or confined to adh… ▽ More A one-time program (OTP) works as follows: Alice provides Bob with the implementation of some function. Bob can have the function evaluated exclusively on a single input of his choosing. Once executed, the program will fail to evaluate on any other input. State-of-the-art one-time programs have remained theoretical, requiring custom hardware that is cost-ineffective/unavailable, or confined to adhoc/unrealistic assumptions. To bridge this gap, we explore how the Trusted Execution Environment (TEE) of modern CPUs can realize the OTP functionality. Specifically, we build two flavours of such a system: in the first, the TEE directly enforces the one-timeness of the program; in the second, the program is represented with a garbled circuit and the TEE ensures Bob's input can only be wired into the circuit once, equivalent to a smaller cryptographic primitive called one-time memory. These have different performance profiles: the first is best when Alice's input is small and Bob's is large, and the second for the converse. △ Less

Submitted 1 July, 2019; originally announced July 2019.

arXiv:1801.02069 [pdf, other]

doi 10.1109/TDSC.2017.2693986

Privacy-Preserving Aggregate Queries for Optimal Location Selection

Authors: Emre Yilmaz, Hakan Ferhatosmanoglu, Erman Ayday, Remzi Can Aksoy

Abstract: Today, vast amounts of location data are collected by various service providers. These location data owners have a good idea of where their users are most of the time. Other businesses also want to use this information for location analytics, such as finding the optimal location for a new branch. However, location data owners cannot share their data with other businesses, mainly due to privacy and… ▽ More Today, vast amounts of location data are collected by various service providers. These location data owners have a good idea of where their users are most of the time. Other businesses also want to use this information for location analytics, such as finding the optimal location for a new branch. However, location data owners cannot share their data with other businesses, mainly due to privacy and legal concerns. In this paper, we propose privacy-preserving solutions in which location-based queries can be answered by data owners without sharing their data with other businesses and without accessing sensitive information such as the customer list of the businesses that send the query. We utilize a partially homomorphic cryptosystem as the building block of the proposed protocols. We prove the security of the protocols in semi-honest threat model. We also explain how to achieve differential privacy in the proposed protocols and discuss its impact on utility. We evaluate the performance of the protocols with real and synthetic datasets and show that the proposed solutions are highly practical. The proposed solutions will facilitate an effective sharing of sensitive data between entities and joint analytics in a wide range of applications without violating their customers' privacy. △ Less

Submitted 6 January, 2018; originally announced January 2018.

Comments: IEEE Transactions on Dependable and Secure Computing, 2017

Journal ref: IEEE Transactions on Dependable and Secure Computing, 16(2), 329-343, 2019

arXiv:1711.01815 [pdf, other]

Profile Matching Across Unstructured Online Social Networks: Threats and Countermeasures

Authors: Anisa Halimi, Erman Ayday

Abstract: In this work, we propose a profile matching (or deanonymization) attack for unstructured online social networks (OSNs) in which similarity in graphical structure cannot be used for profile matching. We consider different attributes that are publicly shared by users. Such attributes include both obvious identifiers such as the user name and non-obvious identifiers such as interest similarity or sen… ▽ More In this work, we propose a profile matching (or deanonymization) attack for unstructured online social networks (OSNs) in which similarity in graphical structure cannot be used for profile matching. We consider different attributes that are publicly shared by users. Such attributes include both obvious identifiers such as the user name and non-obvious identifiers such as interest similarity or sentiment variation between different posts of a user in different platforms. We study the effect of using different combinations of these attributes to the profile matching in order to show the privacy threat in an extensive way. Our proposed framework mainly relies on machine learning techniques and optimization algorithms. We evaluate the proposed framework on two real-life datasets that are constructed by us. Our results indicate that profiles of the users in different OSNs can be matched with high probability by only using publicly shared attributes and without using the underlying graphical structure of the OSNs. We also propose possible countermeasures to mitigate this threat in the expense of reduction in the accuracy (or utility) of the attributes shared by the users. We formulate the tradeoff between the privacy and profile utility of the users as an optimization problem and show how slight changes in the profiles of the users would reduce the success of the attack. We believe that this work will be a valuable step to build a privacy-preserving tool for users against profile matching attacks between OSNs. △ Less

Submitted 6 November, 2017; originally announced November 2017.

Comments: 17 pages, 15 figures

arXiv:1708.01023 [pdf, other]

Collusion-Secure Watermarking for Sequential Data

Authors: Arif Yilmaz, Erman Ayday

Abstract: In this work, we address the liability issues that may arise due to unauthorized sharing of personal data. We consider a scenario in which an individual shares his sequential data (such as genomic data or location patterns) with several service providers (SPs). In such a scenario, if his data is shared with other third parties without his consent, the individual wants to determine the service prov… ▽ More In this work, we address the liability issues that may arise due to unauthorized sharing of personal data. We consider a scenario in which an individual shares his sequential data (such as genomic data or location patterns) with several service providers (SPs). In such a scenario, if his data is shared with other third parties without his consent, the individual wants to determine the service provider that is responsible for this unauthorized sharing. To provide this functionality, we propose a novel optimization-based watermarking scheme for sharing of sequential data. Thus, in the case of an unauthorized sharing of sensitive data, the proposed scheme can find the source of the leakage by checking the watermark inside the leaked data. In particular, the proposed schemes guarantees with a high probability that (i) the malicious SP that receives the data cannot understand the watermarked data points, (ii) when more than one malicious SPs aggregate their data, they still cannot determine the watermarked data points, (iii) even if the unauthorized sharing involves only a portion of the original data or modified data (to damage the watermark), the corresponding malicious SP can be kept responsible for the leakage, and (iv) the added watermark is compliant with the nature of the corresponding data. That is, if there are inherent correlations in the data, the added watermark still preserves such correlations. Watermarking typically means changing certain parts of the data, and hence it may have negative effects on data utility. The proposed scheme also minimizes such utility loss while it provides the aforementioned security guarantees. Furthermore, we conduct a case study of the proposed scheme on genomic data and show the security and utility guarantees of the proposed scheme. △ Less

Submitted 18 August, 2017; v1 submitted 3 August, 2017; originally announced August 2017.

arXiv:1605.05847 [pdf, ps, other]

Privacy-Related Consequences of Turkish Citizen Database Leak

Authors: Erin Avllazagaj, Erman Ayday, A. Ercument Cicek

Abstract: Personal data is collected and stored more than ever by the governments and companies in the digital age. Even though the data is only released after anonymization, deanonymization is possible by joining different datasets. This puts the privacy of individuals in jeopardy. Furthermore, data leaks can unveil personal identifiers of individuals when security is breached. Processing the leaked datase… ▽ More Personal data is collected and stored more than ever by the governments and companies in the digital age. Even though the data is only released after anonymization, deanonymization is possible by joining different datasets. This puts the privacy of individuals in jeopardy. Furthermore, data leaks can unveil personal identifiers of individuals when security is breached. Processing the leaked dataset can provide even more information than what is visible to naked eye. In this work, we report the results of our analyses on the recent "Turkish citizen database leak", which revealed the national identifier numbers of close to fifty million voters, along with personal information such as date of birth, birth place, and full address. We show that with automated processing of the data, one can uniquely identify (i) mother's maiden name of individuals and (ii) landline numbers, for a significant portion of people. This is a serious privacy and security threat because (i) identity theft risk is now higher, and (ii) scammers are able to access more information about individuals. The only and utmost goal of this work is to point out to the security risks and suggest stricter measures to related companies and agencies to protect the security and privacy of individuals. △ Less

Submitted 19 May, 2016; originally announced May 2016.

Comments: 12 pages, 5 figures

arXiv:1405.1891 [pdf, other]

Privacy in the Genomic Era

Authors: Muhammad Naveed, Erman Ayday, Ellen W. Clayton, Jacques Fellay, Carl A. Gunter, Jean-Pierre Hubaux, Bradley A. Malin, XiaoFeng Wang

Abstract: Genome sequencing technology has advanced at a rapid pace and it is now possible to generate highly-detailed genotypes inexpensively. The collection and analysis of such data has the potential to support various applications, including personalized medical services. While the benefits of the genomics revolution are trumpeted by the biomedical community, the increased availability of such data has… ▽ More Genome sequencing technology has advanced at a rapid pace and it is now possible to generate highly-detailed genotypes inexpensively. The collection and analysis of such data has the potential to support various applications, including personalized medical services. While the benefits of the genomics revolution are trumpeted by the biomedical community, the increased availability of such data has major implications for personal privacy; notably because the genome has certain essential features, which include (but are not limited to) (i) an association with traits and certain diseases, (ii) identification capability (e.g., forensics), and (iii) revelation of family relationships. Moreover, direct-to-consumer DNA testing increases the likelihood that genome data will be made available in less regulated environments, such as the Internet and for-profit companies. The problem of genome data privacy thus resides at the crossroads of computer science, medicine, and public policy. While the computer scientists have addressed data privacy for various data types, there has been less attention dedicated to genomic data. Thus, the goal of this paper is to provide a systematization of knowledge for the computer science community. In doing so, we address some of the (sometimes erroneous) beliefs of this field and we report on a survey we conducted about genome data privacy with biomedical specialists. Then, after characterizing the genome privacy problem, we review the state-of-the-art regarding privacy attacks on genomic data and strategies for mitigating such attacks, as well as contextualizing these attacks from the perspective of medicine and public policy. This paper concludes with an enumeration of the challenges for genome data privacy and presents a framework to systematize the analysis of threats and the design of countermeasures as the field moves forward. △ Less

Submitted 17 June, 2015; v1 submitted 8 May, 2014; originally announced May 2014.

ACM Class: K.6.5

arXiv:1306.1264 [pdf, other]

The Chills and Thrills of Whole Genome Sequencing

Authors: Erman Ayday, Emiliano De Cristofaro, Jean-Pierre Hubaux, Gene Tsudik

Abstract: In recent years, Whole Genome Sequencing (WGS) evolved from a futuristic-sounding research project to an increasingly affordable technology for determining complete genome sequences of complex organisms, including humans. This prompts a wide range of revolutionary applications, as WGS promises to improve modern healthcare and provide a better understanding of the human genome -- in particular, its… ▽ More In recent years, Whole Genome Sequencing (WGS) evolved from a futuristic-sounding research project to an increasingly affordable technology for determining complete genome sequences of complex organisms, including humans. This prompts a wide range of revolutionary applications, as WGS promises to improve modern healthcare and provide a better understanding of the human genome -- in particular, its relation to diseases and response to treatments. However, this progress raises worrisome privacy and ethical issues, since, besides uniquely identifying its owner, the genome contains a treasure trove of highly personal and sensitive information. In this article, after summarizing recent advances in genomics, we discuss some important privacy issues associated with human genomic information and identify a number of particularly relevant research challenges. △ Less

Submitted 16 February, 2015; v1 submitted 5 June, 2013; originally announced June 2013.

Comments: A slightly different version of this article appears in IEEE Computer Magazine, Vol. 48, No. 2, February 2015, under the title "Whole Genome Sequencing: Revolutionary Medicine or Privacy Nightmare"

arXiv:1209.5335 [pdf, ps, other]

BPRS: Belief Propagation Based Iterative Recommender System

Authors: Erman Ayday, Arash Einolghozati, Faramarz Fekri

Abstract: In this paper we introduce the first application of the Belief Propagation (BP) algorithm in the design of recommender systems. We formulate the recommendation problem as an inference problem and aim to compute the marginal probability distributions of the variables which represent the ratings to be predicted. However, computing these marginal probability functions is computationally prohibitive f… ▽ More In this paper we introduce the first application of the Belief Propagation (BP) algorithm in the design of recommender systems. We formulate the recommendation problem as an inference problem and aim to compute the marginal probability distributions of the variables which represent the ratings to be predicted. However, computing these marginal probability functions is computationally prohibitive for large-scale systems. Therefore, we utilize the BP algorithm to efficiently compute these functions. Recommendations for each active user are then iteratively computed by probabilistic message passing. As opposed to the previous recommender algorithms, BPRS does not require solving the recommendation problem for all the users if it wishes to update the recommendations for only a single active. Further, BPRS computes the recommendations for each user with linear complexity and without requiring a training period. Via computer simulations (using the 100K MovieLens dataset), we verify that BPRS iteratively reduces the error in the predicted ratings of the users until it converges. Finally, we confirm that BPRS is comparable to the state of art methods such as Correlation-based neighborhood model (CorNgbr) and Singular Value Decomposition (SVD) in terms of rating and precision accuracy. Therefore, we believe that the BP-based recommendation algorithm is a new promising approach which offers a significant advantage on scalability while providing competitive accuracy for the recommender systems. △ Less

Submitted 24 September, 2012; originally announced September 2012.

Showing 1–34 of 34 results for author: Ayday, E