PEOPL: Characterizing Privately Encoded Open Datasets with Public Labels

Esfahanizadeh, Homa; Yala, Adam; D'Oliveira, Rafael G. L.; Jaba, Andrea J. D.; Quach, Victor; Duffy, Ken R.; Jaakkola, Tommi S.; Vaikuntanathan, Vinod; Ghobadi, Manya; Barzilay, Regina; Médard, Muriel

Computer Science > Machine Learning

arXiv:2304.00047 (cs)

[Submitted on 31 Mar 2023]

Title:PEOPL: Characterizing Privately Encoded Open Datasets with Public Labels

Authors:Homa Esfahanizadeh, Adam Yala, Rafael G. L. D'Oliveira, Andrea J. D. Jaba, Victor Quach, Ken R. Duffy, Tommi S. Jaakkola, Vinod Vaikuntanathan, Manya Ghobadi, Regina Barzilay, Muriel Médard

View PDF

Abstract:Allowing organizations to share their data for training of machine learning (ML) models without unintended information leakage is an open problem in practice. A promising technique for this still-open problem is to train models on the encoded data. Our approach, called Privately Encoded Open Datasets with Public Labels (PEOPL), uses a certain class of randomly constructed transforms to encode sensitive data. Organizations publish their randomly encoded data and associated raw labels for ML training, where training is done without knowledge of the encoding realization. We investigate several important aspects of this problem: We introduce information-theoretic scores for privacy and utility, which quantify the average performance of an unfaithful user (e.g., adversary) and a faithful user (e.g., model developer) that have access to the published encoded data. We then theoretically characterize primitives in building families of encoding schemes that motivate the use of random deep neural networks. Empirically, we compare the performance of our randomized encoding scheme and a linear scheme to a suite of computational attacks, and we also show that our scheme achieves competitive prediction accuracy to raw-sample baselines. Moreover, we demonstrate that multiple institutions, using independent random encoders, can collaborate to train improved ML models.

Comments:	Submitted to IEEE Transactions on Information Forensics and Security
Subjects:	Machine Learning (cs.LG); Cryptography and Security (cs.CR); Information Theory (cs.IT)
Cite as:	arXiv:2304.00047 [cs.LG]
	(or arXiv:2304.00047v1 [cs.LG] for this version)
	https://doi.org/10.48550/arXiv.2304.00047

Submission history

From: Homa Esfahanizadeh [view email]
[v1] Fri, 31 Mar 2023 18:03:53 UTC (1,559 KB)

Computer Science > Machine Learning

Title:PEOPL: Characterizing Privately Encoded Open Datasets with Public Labels

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Machine Learning

Title:PEOPL: Characterizing Privately Encoded Open Datasets with Public Labels

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators