research-article

Public Access

Auditing Data Provenance in Text-Generation Models

Authors:

Congzheng Song,

Vitaly ShmatikovAuthors Info & Claims

KDD '19: Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining

Pages 196 - 206

https://doi.org/10.1145/3292500.3330885

Published: 25 July 2019 Publication History

Abstract

To help enforce data-protection regulations such as GDPR and detect unauthorized uses of personal data, we develop a new model auditing technique that helps users check if their data was used to train a machine learning model. We focus on auditing deep-learning models that generate natural-language text, including word prediction and dialog generation. These models are at the core of popular online services and are often trained on personal data such as users' messages, searches, chats, and comments. We design and evaluate a black-box auditing method that can detect, with very few queries to a model, if a particular user's texts were used to train it (among thousands of other users). We empirically show that our method can successfully audit well-generalized models that are not overfitted to the training data. We also analyze how text-generation models memorize word sequences and explain why this memorization makes them amenable to auditing.

References

[1]

M. Abadi et al. TensorFlow: A system for large-scale machine learning. In OSDI, 2016.

Digital Library

[2]

P. Adler et al. Auditing black-box models for indirect influence. KAIS, 54(1):95--122, 2018.

Digital Library

[3]

BBC. Google DeepMind NHS app test broke UK privacy law. https://www.bbc.com/news/technology-40483202, 2017.

[4]

M. Brennan, S. Afroz, and R. Greenstadt. Adversarial stylometry: Circumventing authorship recognition to preserve privacy and anonymity. TISSEC, 15(3):12, 2012.

Digital Library

[5]

N. Carlini et al. The Secret Sharer: Measuring unintended neural network memorization & extracting secrets. arXiv:1802.08232, 2018.

[6]

K. Cho et al. Learning phrase representations using RNN encoder--decoder for statistical machine translation. In EMNLP, 2014.

[7]

C. Danescu-Niculescu-Mizil and L. Lee. Chameleons in imagined conversations: A new approach to understanding coordination of linguistic style in dialogs. In Workshop on Cognitive Modeling and Computational Linguistics, ACL, 2011.

Digital Library

[8]

C. Dwork et al. Robust traceability from trace amounts. In FOCS, 2015.

Digital Library

[9]

EU. General Data Protection Regulation. https://en.wikipedia.org/wiki/General_Data_Protection_Regulation, 2018.

[10]

R.-E. Fan et al. LIBLINEAR: A library for large linear classification. JMLR, 9(Aug):1871--1874, 2008.

Digital Library

[11]

J. Hayes, L. Melis, G. Danezis, and E. De Cristofaro. LOGAN: Membership inference attacks against generative models. In PETS, 2019.

[12]

S. Hochreiter and J. Schmidhuber. Long short-term memory. Neural Computation, 9(8):1735--1780, 1997.

Digital Library

[13]

N. Homer et al. Resolving individuals contributing trace amounts of DNA to highly complex mixtures using high-density SNP genotyping microarrays. PLoS Genetics, 4(8):e1000167, 2008.

[14]

D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. arXiv:1412.6980, 2014.

[15]

P. Koehn. Europarl: A parallel corpus for statistical machine translation. In MT Summit, volume 5, 2005.

[16]

P. W. Koh and P. Liang. Understanding black-box predictions via influence functions. In ICML, 2017.

Digital Library

[17]

S. Kottur, X. Wang, and V. R. Carvalho. Exploring personalized neural conversational models. In IJCAI, 2017.

Digital Library

[18]

J. Li et al. A persona-based neural conversation model. In ACL, 2016.

[19]

Y. Long et al. Understanding membership inferences on well-generalized learning models. arXiv:1802.04889, 2018.

[20]

R. Lowe, N. Pow, I. V. Serban, and J. Pineau. The Ubuntu dialogue corpus: A large dataset for research in unstructured multi-turn dialogue systems. In SIGDIAL, 2015.

[21]

T. Luong, M. Kayser, and C. D. Manning. Deep neural language models for machine translation. In CoNLL, 2015.

[22]

B. McMahan et al. Communication-efficient learning of deep networks from decentralized data. In AISTATS, 2017.

[23]

H. B. McMahan, D. Ramage, K. Talwar, and L. Zhang. Learning differentially private language models without losing accuracy. arXiv:1710.06963, 2017.

[24]

P. Michel and G. Neubig. Extreme adaptation for personalized neural machine translation. arXiv:1805.01817, 2018.

[25]

A. S. Morcos, D. G. Barrett, N. C. Rabinowitz, and M. Botvinick. On the importance of single directions for generalization. arXiv:1803.06959, 2018.

[26]

A. Pyrgelis, C. Troncoso, and E. De Cristofaro. Knock knock, who's there? Membership inference on aggregate location data. In NDSS, 2018.

[27]

M. A. Rahman et al. Membership inference attack against differentially private deep learning model. Transactions on Data Privacy, 11(1):61--79, 2018.

[28]

R. Shokri, M. Stronati, C. Song, and V. Shmatikov. Membership inference attacks against machine learning models. In S&P, 2017.

[29]

C. Song, T. Ristenpart, and V. Shmatikov. Machine learning models that remember too much. In CCS, 2017.

Digital Library

[30]

S. Tan, R. Caruana, G. Hooker, and Y. Lou. Detecting bias in black-box models using transparent model distillation. arXiv:1710.06169, 2017.

[31]

F. Tramè r et al. FairTest: Discovering unwarranted associations in data-driven applications. In EuroS&P, 2017.

[32]

S. Truex et al. Towards demystifying membership inference attacks. arXiv:1807.09173, 2018.

[33]

O. Vinyals and Q. Le. A neural conversational model. arXiv:1506.05869, 2015.

[34]

Y. Wu et al. Google's neural machine translation system: Bridging the gap between human and machine translation. arXiv:1609.08144, 2016.

[35]

S. Yeom, I. Giacomelli, M. Fredrikson, and S. Jha. Privacy risk in machine learning: Analyzing the connection to overfitting. In CSF, 2018.

[36]

C. Zhang et al. Understanding deep learning requires rethinking generalization. In ICLR, 2017.

[37]

S. Zhang et al. Personalizing dialogue agents: I have a dog, do you have pets too? In ACL, 2018.

Cited By

Sankar BGilliland DRincon JHermjakob HYan YAdam ILemaster GWang DWatson KBui AWang WPing P(2024)Building an Ethical and Trustworthy Biomedical AI Ecosystem for the Translational and Clinical Integration of Foundation ModelsBioengineering10.3390/bioengineering1110098411:10(984)Online publication date: 29-Sep-2024
https://doi.org/10.3390/bioengineering11100984
Chen BShe BHawkins CBenvenuti AFallin BParé PHale M(2024)Differentially Private Computation of Basic Reproduction Numbers in Networked Epidemic Models2024 American Control Conference (ACC)10.23919/ACC60939.2024.10644264(4422-4427)Online publication date: 10-Jul-2024
https://doi.org/10.23919/ACC60939.2024.10644264
Gao XChen JWang JShi JCheng PChen JChristakis MPradel M(2024)TeDA: A Testing Framework for Data Usage Auditing in Deep Learning Model DevelopmentProceedings of the 33rd ACM SIGSOFT International Symposium on Software Testing and Analysis10.1145/3650212.3680375(1479-1490)Online publication date: 11-Sep-2024
https://dl.acm.org/doi/10.1145/3650212.3680375
Show More Cited By

Index Terms

Auditing Data Provenance in Text-Generation Models
1. Computing methodologies
  1. Machine learning
2. Security and privacy
  1. Software and application security

Recommendations

Auditing and Data Analytics Via Computer Assisted Audit Techniques (CAATS): Determinants of Adoption Intention Among Auditors in Malaysia
BDIOT '19: Proceedings of the 3rd International Conference on Big Data and Internet of Things

Internet of things has revolutionized the way audit work is conducted. Computer Assisted Auditing Techniques (CAATs) has emerged as a data analytics tool to assist auditors in their search for irregularities within data files, which helps for further ...
External Auditors’ Perception of Use of Virtual Reality in Financial Statement Auditing Process
ICSEB '22: Proceedings of the 2022 6th International Conference on Software and e-Business

Virtual Reality (VR) has become one of the most promising digital platforms in industry 4.0 because it offers an interactive yet effective service. We examine that VR has great potential to support the audit process. In the auditing field, VR strongly ...
Fuzzy quality-team formation for value added auditing: A case study

Value-added (quality) auditing is emerging as one of the most powerful tools for continuous quality improvement, with the introduction of the ISO 9001:2000 and ISO 19011 standards that focus team-based audits, proper auditor skills, process auditing, ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

KDD '19: Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining

July 2019

3305 pages

ISBN:9781450362016

DOI:10.1145/3292500

General Chairs:
Ankur Teredesai
KenSci
,
Vipin Kumar
University of Minnesota
,
Program Chairs:
Ying Li
EV Analysis Corporation
,
Rómer Rosales
LinkedIn
,
Evimaria Terzi
Boston University
,
George Karypis
University of Minnesota

Copyright © 2019 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 25 July 2019

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

National Science Foundation

Conference

KDD '19

Sponsor:

KDD '19: The 25th ACM SIGKDD Conference on Knowledge Discovery and Data Mining

August 4 - 8, 2019

AK, Anchorage, USA

Acceptance Rates

KDD '19 Paper Acceptance Rate 110 of 1,200 submissions, 9%;

Overall Acceptance Rate 1,133 of 8,635 submissions, 13%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

72
Total Citations
View Citations
2,478
Total Downloads

Downloads (Last 12 months)469
Downloads (Last 6 weeks)59

Reflects downloads up to 09 Nov 2024

Other Metrics

View Author Metrics

Citations

Cited By

Sankar BGilliland DRincon JHermjakob HYan YAdam ILemaster GWang DWatson KBui AWang WPing P(2024)Building an Ethical and Trustworthy Biomedical AI Ecosystem for the Translational and Clinical Integration of Foundation ModelsBioengineering10.3390/bioengineering1110098411:10(984)Online publication date: 29-Sep-2024
https://doi.org/10.3390/bioengineering11100984
Chen BShe BHawkins CBenvenuti AFallin BParé PHale M(2024)Differentially Private Computation of Basic Reproduction Numbers in Networked Epidemic Models2024 American Control Conference (ACC)10.23919/ACC60939.2024.10644264(4422-4427)Online publication date: 10-Jul-2024
https://doi.org/10.23919/ACC60939.2024.10644264
Gao XChen JWang JShi JCheng PChen JChristakis MPradel M(2024)TeDA: A Testing Framework for Data Usage Auditing in Deep Learning Model DevelopmentProceedings of the 33rd ACM SIGSOFT International Symposium on Software Testing and Analysis10.1145/3650212.3680375(1479-1490)Online publication date: 11-Sep-2024
https://dl.acm.org/doi/10.1145/3650212.3680375
Mu XPang MZhu F(2024)Data Provenance via Differential AuditingIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2023.333482136:10(5066-5079)Online publication date: Oct-2024
https://doi.org/10.1109/TKDE.2023.3334821
Zeng ZXiang TGuo SHe JZhang QXu GZhang T(2024)Contrast-Then-Approximate: Analyzing Keyword Leakage of Generative Language ModelsIEEE Transactions on Information Forensics and Security10.1109/TIFS.2024.339253519(5166-5180)Online publication date: 2024
https://doi.org/10.1109/TIFS.2024.3392535
Ma WSong YXue MWen SXiang Y(2024)The “Code” of Ethics: A Holistic Audit of AI Code GeneratorsIEEE Transactions on Dependable and Secure Computing10.1109/TDSC.2024.3367737(1-16)Online publication date: 2024
https://doi.org/10.1109/TDSC.2024.3367737
Jayaraman BGhosh EChase MRoy SDai WEvans D(2024)Combing for Credentials: Active Pattern Extraction from Smart Reply2024 IEEE Symposium on Security and Privacy (SP)10.1109/SP54263.2024.00041(1443-1461)Online publication date: 19-May-2024
https://doi.org/10.1109/SP54263.2024.00041
Oruche RAkula RGoruganthu SCalyam P(2024)Holistic Multi-layered System Design for Human-Centered Dialog Systems2024 IEEE 4th International Conference on Human-Machine Systems (ICHMS)10.1109/ICHMS59971.2024.10555807(1-8)Online publication date: 15-May-2024
https://doi.org/10.1109/ICHMS59971.2024.10555807
Vats ALiu ZSu PPaul DMa YPang YAhmed ZKalinli O(2024)Recovering from Privacy-Preserving Masking with Large Language ModelsICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)10.1109/ICASSP48485.2024.10448234(10771-10775)Online publication date: 14-Apr-2024
https://doi.org/10.1109/ICASSP48485.2024.10448234
Liu ZKalinli O(2024)Forgetting Private Textual Sequences in Language Models Via Leave-One-Out EnsembleICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)10.1109/ICASSP48485.2024.10446299(10261-10265)Online publication date: 14-Apr-2024
https://doi.org/10.1109/ICASSP48485.2024.10446299
Show More Cited By

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Media

Figures

Other

Tables

View Table of Contents