research-article

Open access

What Does it Mean for a Language Model to Preserve Privacy?

Authors:

Fatemehsadat Mireshghallah,

Florian TramèrAuthors Info & Claims

FAccT '22: Proceedings of the 2022 ACM Conference on Fairness, Accountability, and Transparency

Pages 2280 - 2292

https://doi.org/10.1145/3531146.3534642

Published: 20 June 2022 Publication History

All formats PDF

Abstract

Natural language reflects our private lives and identities, making its privacy concerns as broad as those of real life. Language models lack the ability to understand the context and sensitivity of text, and tend to memorize phrases present in their training sets. An adversary can exploit this tendency to extract training data. Depending on the nature of the content and the context in which this data was collected, this could violate expectations of privacy. Thus, there is a growing interest in techniques for training language models that preserve privacy. In this paper, we discuss the mismatch between the narrow assumptions made by popular data protection techniques (data sanitization and differential privacy), and the broadness of natural language and of privacy as a social norm. We argue that existing protection methods cannot guarantee a generic and meaningful notion of privacy for language models. We conclude that language models should be trained on text data which was explicitly produced for public use.

References

[1]

Martin Abadi, Andy Chu, Ian Goodfellow, H Brendan McMahan, Ilya Mironov, Kunal Talwar, and Li Zhang. 2016. Deep learning with differential privacy. In Proceedings of the 2016 ACM SIGSAC conference on computer and communications security. ACM, Vienna, Austria, 308–318.

Digital Library

[2]

Rohan Anil, Badih Ghazi, Vineet Gupta, Ravi Kumar, and Pasin Manurangsi. 2021. Large-scale differentially private BERT. arXiv preprint arXiv:2108.01624 abs/2108.01624 (2021).

[3]

Tuomas Aura, Thomas A Kuhn, and Michael Roe. 2006. Scanning electronic documents for personally identifiable information. In Proceedings of the 5th ACM workshop on Privacy in electronic society. ACM, New York, NY, United States, 41–50.

Digital Library

[4]

Joseph Austin, Shahir Kassam-Adams, Jason A LaBonte, and Paul J Bayless. 2019. Self-contained system for de-identifying unstructured data in healthcare records. US Patent App. 16/255,443.

[5]

Andreas Balzer, David Mowatt, and Muiris Woulfe. 2020. Obfuscating information related to personally identifiable information (PII). US Patent 10,839,104.

[6]

Andreas Balzer, David Mowatt, and Muiris Woulfe. 2021. Protecting personally identifiable information (PII) using tagging and persistence of PII. US Patent 10,885,225.

[7]

Emily M Bender, Timnit Gebru, Angelina McMillan-Major, and Shmargaret Shmitchell. 2021. On the Dangers of Stochastic Parrots: Can Language Models Be Too Big? . In Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency. ACM, Online, 610–623.

Digital Library

[8]

Sid Black, Leo Gao, Phil Wang, Connor Leahy, and Stella Biderman. 2021. GPT-Neo: Large Scale Autoregressive Language Modeling with Mesh-Tensorflow. https://doi.org/10.5281/zenodo.5297715 10.5281/zenodo.5297715 (2021). https://doi.org/10.5281/zenodo.5297715

[9]

Samuel Bowman, Gabor Angeli, Christopher Potts, and Christopher D Manning. 2015. A large annotated corpus for learning natural language inference. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing. ACL, Lisbon, Portugal, 632–642.

[10]

Penelope Brown, Stephen C Levinson, and Stephen C Levinson. 1987. Politeness: Some universals in language usage. Vol. 4. Cambridge university press, Cambridge, United Kingdom.

[11]

Tom B Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, 2020. Language models are few-shot learners. arXiv preprint arXiv:2005.14165 abs/2005.14165 (2020).

[12]

Nicholas Carlini, Chang Liu, Úlfar Erlingsson, Jernej Kos, and Dawn Song. 2019. The secret sharer: Evaluating and testing unintended memorization in neural networks. In 28th {USENIX} Security Symposium ({USENIX} Security 19). USENIX Association, Santa Clara, CA, 267–284.

[13]

Nicholas Carlini, Florian Tramer, Eric Wallace, Matthew Jagielski, Ariel Herbert-Voss, Katherine Lee, Adam Roberts, Tom Brown, Dawn Song, Ulfar Erlingsson, 2020. Extracting training data from large language models. arXiv preprint arXiv:2012.07805 abs/2012.07805 (2020).

[14]

Amanda Cercas Curry and Verena Rieser. 2018. #MeToo Alexa: How Conversational Systems Respond to Sexual Harassment. In Proceedings of the Second ACL Workshop on Ethics in Natural Language Processing. Association for Computational Linguistics, New Orleans, Louisiana, USA, 7–14. https://doi.org/10.18653/v1/W18-0802

[15]

Kamalika Chaudhuri, Claire Monteleoni, and Anand D Sarwate. 2011. Differentially private empirical risk minimization.Journal of Machine Learning Research 12, 3 (2011), 1069–1109.

[16]

Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, 2021. Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374 abs/2107.03374 (2021).

[17]

Mia Xu Chen, Benjamin N Lee, Gagan Bansal, Yuan Cao, Shuyuan Zhang, Justin Lu, Jackie Tsay, Yinan Wang, Andrew M Dai, Zhifeng Chen, 2019. Gmail smart compose: Real-time assisted writing. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. ACM, Anchorage, AK, United States, 2287–2295.

Digital Library

[18]

Rishav Chourasia, Jiayuan Ye, and Reza Shokri. 2021. Differential Privacy Dynamics of Langevin Diffusion and Noisy Gradient Descent. In Conference on Neural Information Processing Systems (NeurIPS). NeurIPS, Online.

[19]

Franck Dernoncourt, Ji Young Lee, Ozlem Uzuner, and Peter Szolovits. 2017. De-identification of patient notes with recurrent neural networks. Journal of the American Medical Informatics Association 24, 3(2017), 596–606.

[20]

Shrey Desai and Greg Durrett. 2020. Calibration of Pre-trained Transformers. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). Association for Computational Linguistics, Online, 295–302.

[21]

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 abs/1810.04805 (2018).

[22]

Jesse Dodge, Maarten Sap, Ana Marasović, William Agnew, Gabriel Ilharco, Dirk Groeneveld, and Matt Gardner. 2021. Documenting the English Colossal Clean Crawled Corpus. ArXiv abs/2104.08758(2021).

[23]

Jennifer L Donovan, Gary Adler, and James Holladay. 2021. Management systems for personal identifying data, and methods relating thereto. US Patent 10,891,359.

[24]

Paul Dourish. 2004. What We Talk About When WeTalk About Context. Personal and Ubiquitous Computing 8 (2004), 19–39.

Digital Library

[25]

Greg Durrett and Dan Klein. 2015. Neural CRF Parsing. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers). Association for Computational Linguistics, Beijing, China, 302–312. https://doi.org/10.3115/v1/P15-1030

[26]

Cynthia Dwork. 2008. Differential privacy: A survey of results. In International conference on theory and applications of models of computation. Springer, Xi’an, China, 1–19.

[27]

Cynthia Dwork. 2011. The promise of differential privacy a tutorial on algorithmic techniques. In 2011 IEEE 52nd Annual Symposium on Foundations of Computer Science, D (Oct. 2011). IEEE, Washington, DC, United States, 1–2.

Digital Library

[28]

Cynthia Dwork, Frank McSherry, Kobbi Nissim, and Adam Smith. 2006. Calibrating noise to sensitivity in private data analysis. In Theory of cryptography conference. Springer, New York, United States, 265–284.

[29]

Jacob Eisenstein, Brendan O’Connor, Noah A Smith, and Eric P Xing. 2014. Diffusion of lexical change in social media. PloS one 9, 11 (2014), e113114.

[30]

Vitaly Feldman. 2020. Does learning require memorization? a short tale about a long tail. In Proceedings of the 52nd Annual ACM SIGACT Symposium on Theory of Computing. ACM, Online, 954–959.

Digital Library

[31]

Simson Garfinkel, John M Abowd, and Christian Martindale. 2019. Understanding database reconstruction attacks on public data. Commun. ACM 62, 3 (2019), 46–53.

Digital Library

[32]

Aris Gkoulalas-Divanis, Paul R Bastide, Xu Wang, and Rohit Ranchal. 2021. Utility-preserving text de-identification with privacy guarantees. US Patent App. 16/860,857.

[33]

Margaret Goss. 2020. Temporal News Frames and Judgment: The Hillary Clinton Email Scandal. Ph.D. Dissertation. Carnegie Mellon University.

[34]

Andrea Greve, Elisa Cooper, Roni Tibon, and Richard N. Henson. 2019. Knowledge is power: Prior knowledge aids memory for both congruent and incongruent events, but in different ways.Journal of Experimental Psychology: General 148, 2 (Feb. 2019), 325–341. https://doi.org/10.1037/xge0000498

[35]

H. P. Grice. 1975. Logic and Conversation. In Syntax and Semantics: Vol. 3: Speech Acts, Peter Cole and Jerry L. Morgan (Eds.). Academic Press, New York, 41–58. http://www.ucl.ac.uk/ls/studypacks/Grice-Logic.pdf

[36]

Bridget Haire, Christy E. Newman, and Bianca Fileborn. 2019. Shitty Media Men. In #MeToo and the Politics of Social Change, Bianca Fileborn and Rachel Loney-Howes (Eds.). Springer International Publishing, Cham, 201–216. https://doi.org/10.1007/978-3-030-15213-0_13

[37]

Xu Han, Zhengyan Zhang, Ning Ding, Yuxian Gu, Xiao Liu, Yuqi Huo, Jiezhong Qiu, Liang Zhang, Wentao Han, Minlie Huang, Qin Jin, Yanyan Lan, Yang Liu, Zhiyuan Liu, Zhiwu Lu, Xipeng Qiu, Ruihua Song, Jie Tang, Ji-Rong Wen, Jinhui Yuan, Wayne Xin Zhao, and Jun Zhu. 2021. Pre-Trained Models: Past, Present and Future. AI Open 10.1016/j.aiopen.2021.08.002 (2021). https://doi.org/10.1016/j.aiopen.2021.08.002

[38]

Peter Henderson, Koustuv Sinha, Nicolas Angelard-Gontier, Nan Rosemary Ke, Genevieve Fried, Ryan Lowe, and Joelle Pineau. 2018. Ethical challenges in data-driven dialogue systems. In Proceedings of the 2018 AAAI/ACM Conference on AI, Ethics, and Society. ACM, New Orleans, LA, 123–129.

Digital Library

[39]

Chaya Hiruncharoenvate, Zhiyuan Lin, and Eric Gilbert. 2015. Algorithmically Bypassing Censorship on Sina Weibo with Nondeterministic Homophone Substitutions. Proceedings of the International AAAI Conference on Web and Social Media 9, 1(2015), 150–158. https://ojs.aaai.org/index.php/ICWSM/article/view/14637 Number: 1.

[40]

Shlomo Hoory, Amir Feder, Avichai Tendler, Alon Cohen, Sofia Erell, Itay Laish, Hootan Nakhost, Uri Stemmer, Ayelet Benjamini, Avinatan Hassidim, 2021. Learning and Evaluating a Differentially Private Pre-trained Language Model. In Proceedings of the Third Workshop on Privacy in Natural Language Processing. ACL, Online, 21–29.

[41]

Dirk Hovy and Diyi Yang. 2021. The importance of modeling social factors of language: Theory and practice. In The 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Association for Computational Linguistics, Online, 588–602.

[42]

International Consortium of Investigative Journalists. 2016. About the Panama Papers Investigations. https://www.icij.org/investigations/panama-papers/pages/panama-papers-about-the-investigation/.

[43]

Liwei Jiang, Jena D Hwang, Chandra Bhagavatula, Ronan Le Bras, Maxwell Forbes, Jon Borchardt, Jenny Liang, Oren Etzioni, Maarten Sap, and Yejin Choi. 2021. Delphi: Towards machine ethics and norms. arXiv preprint arXiv:2110.07574 abs/2104.08758 (2021).

[44]

Alistair EW Johnson, Tom J Pollard, Lu Shen, H Lehman Li-Wei, Mengling Feng, Mohammad Ghassemi, Benjamin Moody, Peter Szolovits, Leo Anthony Celi, and Roger G Mark. 2016. MIMIC-III, a freely accessible critical care database. Scientific data 3, 1 (2016), 1–9.

[45]

Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. 2020. Scaling laws for neural language models. arXiv preprint arXiv:2001.08361 abs.2001.08361 (2020).

[46]

Urvashi Khandelwal, Omer Levy, Dan Jurafsky, Luke Zettlemoyer, and Mike Lewis. 2019. Generalization through Memorization: Nearest Neighbor Language Models. In International Conference on Learning Representations. ICLR, Online.

[47]

Soomin Kim, Changhoon Oh, Won Ik Cho, Donghoon Shin, Bongwon Suh, and Joonhwan Lee. 2021. Trkic G00gle: Why and How Users Game Translation Algorithms. Proceedings of the ACM on Human-Computer Interaction 5, CSCW2 (Oct. 2021), 344:1–344:24. https://doi.org/10.1145/3476085

Digital Library

[48]

Bryan Klimt and Yiming Yang. 2004. The enron corpus: A new dataset for email classification research. In European Conference on Machine Learning. Springer, Pisa, Italy, 217–226.

Digital Library

[49]

Ahmet Baki Kocaballi, Juan C. Quiroz, Dana Rezazadegan, Shlomo Berkovsky, Farah Magrabi, Enrico Coiera, and Liliana Laranjo. 2020. Responses of Conversational Agents to Health and Lifestyle Prompts: Investigation of Appropriateness and Presentation Structures. Journal of Medical Internet Research 22, 2 (Feb. 2020), e15823. https://doi.org/10.2196/15823 Company: Journal of Medical Internet Research Distributor: Journal of Medical Internet Research Institution: Journal of Medical Internet Research Label: Journal of Medical Internet Research Publisher: JMIR Publications Inc., Toronto, Canada.

[50]

Latitude. 2019. Ai dungeon. https://play.aidungeon.io/

[51]

Katherine Lee, Daphne Ippolito, Andrew Nystrom, Chiyuan Zhang, Douglas Eck, Chris Callison-Burch, and Nicholas Carlini. 2021. Deduplicating training data makes language models better. arXiv preprint arXiv:2107.06499 abs/2107.06499 (2021).

[52]

Eric Lehman, Sarthak Jain, Karl Pichotta, Yoav Goldberg, and Byron C. Wallace. 2021. Does BERT Pretrained on Clinical Notes Reveal Sensitive Data?arxiv:2104.07762 [cs.CL]

[53]

Alexandra S. Levine. 2022. Suicide hotline shares data with for-profit spinoff, raising ethical questions. https://www.politico.com/news/2022/01/28/suicide-hotline-silicon-valley-privacy-debates-00002617

[54]

Daniel Levy, Ziteng Sun, Kareem Amin, Satyen Kale, Alex Kulesza, Mehryar Mohri, and Ananda Theertha Suresh. 2021. Learning with User-Level Privacy. arXiv preprint arXiv:2102.11845 abs/2102.11845 (2021).

[55]

Jiwei Li, Michel Galley, Chris Brockett, Georgios Spithourakis, Jianfeng Gao, and Bill Dolan. 2016. A Persona-Based Neural Conversation Model. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Association for Computational Linguistics, Berlin, Germany, 994–1003. https://doi.org/10.18653/v1/P16-1094

[56]

Xuechen Li, Florian Tramer, Percy Liang, and Tatsunori Hashimoto. 2021. Large language models can be strong differentially private learners. arXiv preprint arXiv:2110.05679 abs/2110.05679 (2021).

[57]

Pierre Lison, Ildikó Pilán, David Sánchez, Montserrat Batet, and Lilja Øvrelid. 2021. Anonymisation models for text data: State of the art, challenges and future directions. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers). ACL, Online, 4188–4203.

[58]

Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692 abs/1907.11692 (2019).

[59]

Yunhui Long, Vincent Bindschaedler, and Carl A. Gunter. 2017. Towards Measuring Membership Privacy. ArXiv abs/1712.09136(2017).

[60]

Junyu Lu, Xiancong Ren, Yazhou Ren, Ao Liu, and Zenglin Xu. 2020. Improving Contextual Language Models for Response Retrieval in Multi-Turn Conversation. In Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval (Virtual Event, China) (SIGIR ’20). Association for Computing Machinery, New York, NY, USA, 1805–1808. https://doi.org/10.1145/3397271.3401255

Digital Library

[61]

Saeed Mahloujifar, Huseyin A Inan, Melissa Chase, Esha Ghosh, and Marcello Hasegawa. 2021. Membership Inference on Word Embedding and Beyond. arXiv preprint arXiv:2106.11384 abs/2106.11384 (2021).

[62]

Bodhisattwa Prasad Majumder, Harsh Jhamtani, Taylor Berg-Kirkpatrick, and Julian McAuley. 2020. Like hiking? You probably enjoy nature: Persona-grounded Dialog with Commonsense Expansions. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). Association for Computational Linguistics, Online, 9194–9206. https://doi.org/10.18653/v1/2020.emnlp-main.739

[63]

Alice E Marwick and danah boyd. 2014. Networked privacy: How teenagers negotiate context in social media. New Media & Society 16, 7 (Nov. 2014), 1051–1067. https://doi.org/10.1177/1461444814543995 Publisher: SAGE Publications.

[64]

H Brendan McMahan, Daniel Ramage, Kunal Talwar, and Li Zhang. 2017. Learning differentially private language models without losing accuracy. arXiv preprint arXiv:1710.06963 abs/1710.06963 (2017).

[65]

Jamshed Memon, Maira Sami, Rizwan Ahmed Khan, and Mueen Uddin. 2020. Handwritten optical character recognition (OCR): A comprehensive systematic literature review (SLR). IEEE Access 8(2020), 142642–142668.

[66]

Adam S. Miner, Liliana Laranjo, and A. Baki Kocaballi. 2020. Chatbots in the fight against the COVID-19 pandemic. npj Digital Medicine 3, 1 (May 2020), 1–4. https://doi.org/10.1038/s41746-020-0280-0 Bandiera_abtest: a Cc_license_type: cc_by Cg_type: Nature Research Journals Number: 1 Primary_atype: Comments & Opinion Publisher: Nature Publishing Group Subject_term: Epidemiology;Population screening Subject_term_id: epidemiology;population-screening.

[67]

Sasi Kumar Murakonda and Reza Shokri. 2020. ML Privacy Meter: Aiding regulatory compliance by quantifying the privacy risks of machine learning. Workshop on Hot Topics in Privacy Enhancing Technologies (HotPETs) 1 (2020).

[68]

Arvind Narayanan and Vitaly Shmatikov. 2008. Robust de-anonymization of large sparse datasets. In 2008 IEEE Symposium on Security and Privacy (sp 2008). IEEE, Oakland, CA, USA, 111–125.

Digital Library

[69]

Milad Nasr, Reza Shokri, and Amir Houmansadr. 2019. Comprehensive privacy analysis of deep learning: Passive and active white-box inference attacks against centralized and federated learning. In 2019 IEEE symposium on security and privacy (SP). IEEE, San Francisco, CA, United States, 739–753.

[70]

Milad Nasr, Shuang Songi, Abhradeep Thakurta, Nicolas Papemoti, and Nicholas Carlin. 2021. Adversary instantiation: Lower bounds for differentially private machine learning. In 2021 IEEE Symposium on Security and Privacy (SP). IEEE, San Francisco, 866–882.

[71]

Helen Nissenbaum. 2009. Privacy in context. Stanford University Press, Stanford, CA.

[72]

Alicia L. Nobles, Eric C. Leas, Theodore L. Caputi, Shu-Hong Zhu, Steffanie A. Strathdee, and John W. Ayers. 2020. Responses to addiction help-seeking from Alexa, Siri, Google Assistant, Cortana, and Bixby intelligent virtual assistants. npj Digital Medicine 3, 1 (Jan. 2020), 1–3. https://doi.org/10.1038/s41746-019-0215-9 Bandiera_abtest: a Cc_license_type: cc_by Cg_type: Nature Research Journals Number: 1 Primary_atype: Research Publisher: Nature Publishing Group Subject_term: Epidemiology;Rehabilitation Subject_term_id: epidemiology;rehabilitation.

[73]

Beau Norgeot, Kathleen Muenzen, Thomas A Peterson, Xuancheng Fan, Benjamin S Glicksberg, Gundolf Schenk, Eugenia Rutenberg, Boris Oskotsky, Marina Sirota, Jinoos Yazdany, 2020. Protected Health Information filter (Philter): accurately and securely de-identifying free-text clinical notes. NPJ digital medicine 3, 1 (2020), 1–8.

[74]

Daniel S Park, William Chan, Yu Zhang, Chung-Cheng Chiu, Barret Zoph, Ekin D Cubuk, and Quoc V Le. 2019. Specaugment: A simple data augmentation method for automatic speech recognition. arXiv preprint arXiv:1904.08779 abs/1904.08779 (2019).

[75]

Keith Porcaro. 2022. The real harm of crisis text line’s data sharing. https://www.wired.com/story/consumer-protections-data-services-care/

[76]

Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, 2019. Language models are unsupervised multitask learners. OpenAI blog 1, 8 (2019), 9.

[77]

Jack W. Rae, Sebastian Borgeaud, Trevor Cai, Katie Millican, Jordan Hoffmann, Francis Song, John Aslanides, Sarah Henderson, Roman Ring, Susannah Young, Eliza Rutherford, Tom Hennigan, Jacob Menick, Albin Cassirer, Richard Powell, George van den Driessche, Lisa Anne Hendricks, Maribeth Rauh, Po-Sen Huang, Amelia Glaese, Johannes Welbl, Sumanth Dathathri, Saffron Huang, Jonathan Uesato, John Mellor, Irina Higgins, Antonia Creswell, Nat McAleese, Amy Wu, Erich Elsen, Siddhant Jayakumar, Elena Buchatskaya, David Budden, Esme Sutherland, Karen Simonyan, Michela Paganini, Laurent Sifre, Lena Martens, Xiang Lorraine Li, Adhiguna Kuncoro, Aida Nematzadeh, Elena Gribovskaya, Domenic Donato, Angeliki Lazaridou, Arthur Mensch, Jean-Baptiste Lespiau, Maria Tsimpoukelli, Nikolai Grigorev, Doug Fritz, Thibault Sottiaux, Mantas Pajarskas, Toby Pohlen, Zhitao Gong, Daniel Toyama, Cyprien de Masson d’Autume, Yujia Li, Tayfun Terzi, Vladimir Mikulik, Igor Babuschkin, Aidan Clark, Diego de Las Casas, Aurelia Guy, Chris Jones, James Bradbury, Matthew Johnson, Blake Hechtman, Laura Weidinger, Iason Gabriel, William Isaac, Ed Lockhart, Simon Osindero, Laura Rimell, Chris Dyer, Oriol Vinyals, Kareem Ayoub, Jeff Stanway, Lorrayne Bennett, Demis Hassabis, Koray Kavukcuoglu, and Geoffrey Irving. 2021. Scaling Language Models: Methods, Analysis & Insights from Training Gopher. arxiv:2112.11446 [cs.CL]

[78]

Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. 2019. Exploring the limits of transfer learning with a unified text-to-text transformer. arXiv preprint arXiv:1910.10683 abs/1910.10683 (2019).

[79]

Swaroop Ramaswamy, Om Thakkar, Rajiv Mathews, Galen Andrew, H Brendan McMahan, and Françoise Beaufays. 2020. Training production language models without memorizing user data. arXiv preprint arXiv:2009.10031 abs/2009.10031 (2020).

[80]

Ahmed Salem, Apratim Bhattacharya, Michael Backes, Mario Fritz, and Yang Zhang. 2020. Updates-Leak: Data Set Inference and Reconstruction Attacks in Online Learning. In 29th USENIX Security Symposium (USENIX Security 20). USENIX Association, Online, 1291–1308. https://www.usenix.org/conference/usenixsecurity20/presentation/salem

[81]

Ahmed Salem, Yang Zhang, Mathias Humbert, Mario Fritz, and Michael Backes. 2018. ML-Leaks: Model and Data Independent Membership Inference Attacks and Defenses on Machine Learning Models. ArXiv abs/1806.01246(2018).

[82]

Dave Sayers. 2014. The mediated innovation model: A framework for researching media influence in language change. Journal of Sociolinguistics 18, 2 (2014), 185–212. https://doi.org/10.1111/josl.12069 arXiv:https://onlinelibrary.wiley.com/doi/pdf/10.1111/josl.12069

[83]

Virat Shejwalkar, Huseyin A Inan, Amir Houmansadr, and Robert Sim. 2021. Membership Inference Attacks Against NLP Classification Models. In NeurIPS 2021 Workshop Privacy in Machine Learning. NeurIPS, Online.

[84]

Weiyan Shi, Aiqi Cui, Evan Li, Ruoxi Jia, and Zhou Yu. 2021. Selective Differential Privacy for Language Modeling. arXiv preprint arXiv:2108.12944 abs/2108.12944 (2021).

[85]

Reza Shokri, Marco Stronati, Congzheng Song, and Vitaly Shmatikov. 2017. Membership inference attacks against machine learning models. In 2017 IEEE Symposium on Security and Privacy (SP). IEEE, San Jose, CA, 3–18.

[86]

Daniel J. Solove. 2006. A Taxonomy of Privacy. University of Pennsylvania Law Review 154, 3 (Jan. 2006), 477–560. https://scholarship.law.upenn.edu/cgi/viewcontent.cgi?article=1376&context=penn_law_review

[87]

Congzheng Song and Ananth Raghunathan. 2020. Information leakage in embedding models. In Proceedings of the 2020 ACM SIGSAC Conference on Computer and Communications Security. ACM, Online, 377–390.

Digital Library

[88]

Congzheng Song and Vitaly Shmatikov. 2018. The Natural Auditor: How To Tell If Someone Used Your Words To Train Their Model. ArXiv abs/1811.00513(2018).

[89]

Dan Sperber and Deirdre Wilson. 1986. Relevance: Communication and cognition. Vol. 142. Wiley-Blackwell, Hoboken, New Jersey, United States.

Digital Library

[90]

Yingnian Tao. 2021. Who should apologise: Expressing criticism of public figures on Chinese social media in times of COVID-19. Discourse & Society 32, 5 (Sept. 2021), 622–638. https://doi.org/10.1177/09579265211013116 Publisher: SAGE Publications Ltd.

[91]

Om Dipakbhai Thakkar, Swaroop Ramaswamy, Rajiv Mathews, and Françoise Beaufays. 2021. Understanding Unintended Memorization in Language Models Under Federated Learning. In Proceedings of the Third Workshop on Privacy in Natural Language Processing. ACL, Online, 1–10.

[92]

Julien Tourille, Matthieu Doutreligne, Olivier Ferret, Aurélie Névéol, Nicolas Paris, and Xavier Tannier. 2018. Evaluation of a sequence tagging tool for biomedical texts. In proceedings of the Ninth International Workshop on Health Text Mining and Information Analysis. ACL, Brussels, Belgium, 193–203.

[93]

Sabine Trepte. 2021. The Social Media Privacy Model: Privacy and Communication in the Light of Social Media Affordances. Communication Theory 31, 4 (Nov. 2021), 549–570. https://doi.org/10.1093/ct/qtz035

[94]

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in neural information processing systems. NeurIPS, Long Beach, CA, 5998–6008.

[95]

Laura Weidinger, John Mellor, Maribeth Rauh, Conor Griffin, Jonathan Uesato, Po-Sen Huang, Myra Cheng, Mia Glaese, Borja Balle, Atoosa Kasirzadeh, Zac Kenton, Sasha Brown, Will Hawkins, Tom Stepleton, Courtney Biles, Abeba Birhane, Julia Haas, Laura Rimell, Lisa Anne Hendricks, William Isaac, Sean Legassick, Geoffrey Irving, and Iason Gabriel. 2021. Ethical and social risks of harm from Language Models. arxiv:2112.04359 [cs.CL]

[96]

David Williams. 2021. Systems and methods for automatically scrubbing sensitive data. US Patent App. 16/665,959.

[97]

Ashima Yadav and Dinesh Kumar Vishwakarma. 2020. Sentiment analysis using deep learning architectures: a review. Artificial Intelligence Review 53, 6 (2020), 4335–4385.

Digital Library

[98]

Da Yu, Saurabh Naik, Arturs Backurs, Sivakanth Gopi, Huseyin A Inan, Gautam Kamath, Janardhan Kulkarni, Yin Tat Lee, Andre Manoel, Lukas Wutschitz, 2021. Differentially private fine-tuning of language models. arXiv preprint arXiv:2110.06500 abs/2110.06500 (2021).

[99]

Santiago Zanella-Béguelin, Lukas Wutschitz, Shruti Tople, Victor Rühle, Andrew Paverd, Olga Ohrimenko, Boris Köpf, and Marc Brockschmidt. 2020. Analyzing Information Leakage of Updates to Natural Language Models. Association for Computing Machinery, New York, NY, USA, 363–375. https://doi.org/10.1145/3372297.3417880

Digital Library

[100]

Ruiqi Zhong, Dhruba Ghosh, Dan Klein, and Jacob Steinhardt. 2021. Are Larger Pretrained Language Models Uniformly Better? Comparing Performance at the Instance Level. In Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021. Association for Computational Linguistics, Online, 3813–3827. https://doi.org/10.18653/v1/2021.findings-acl.334

[101]

Shoshana Zuboff. 2019. The age of surveillance capitalism: The fight for a human future at the new frontier of power. Profile books, London.

Cited By

Ali MArshad MUddin IBinsawad MBin Sawad ASohaib O(2024)Efficient context-aware computing: a systematic model for dynamic working memory updates in context-aware computingPeerJ Computer Science10.7717/peerj-cs.212910(e2129)Online publication date: 12-Jun-2024
https://doi.org/10.7717/peerj-cs.2129
Le-Nguyen HTran T(2024)Charting the Ethical CourseThe Role of Generative AI in the Communication Classroom10.4018/979-8-3693-0831-8.ch011(214-261)Online publication date: 12-Feb-2024
https://doi.org/10.4018/979-8-3693-0831-8.ch011
Chen YEsmaeilzadeh P(2024)Generative AI in Medical Practice: In-Depth Exploration of Privacy and Security ChallengesJournal of Medical Internet Research10.2196/5300826(e53008)Online publication date: 8-Mar-2024
https://doi.org/10.2196/53008
Show More Cited By

Index Terms

What Does it Mean for a Language Model to Preserve Privacy?

Index terms have been assigned to the content through auto-classification.

Recommendations

How to keep text private? A systematic review of deep learning methods for privacy-preserving natural language processing
Abstract
Deep learning (DL) models for natural language processing (NLP) tasks often handle private data, demanding protection against breaches and disclosures. Data protection laws, such as the European Union’s General Data Protection Regulation (GDPR), ...
Preserving Genomic Privacy via Selective Sharing
WPES'20: Proceedings of the 19th Workshop on Privacy in the Electronic Society

Although genomic data has significant impact and widespread usage in medical research, it puts individuals' privacy in danger, even if they anonymously or partially share their genomic data. To address this problem, we present a framework that is ...
A privacy framework: indistinguishable privacy
EDBT '13: Proceedings of the Joint EDBT/ICDT 2013 Workshops

In this paper we illustrate a privacy framework named Indistinguishable Privacy. Indistinguishable privacy could be deemed as the formalization of the existing privacy definitions in privacy preserving data publishing as well as secure multi-party ...

Comments

Information & Contributors

Information

Published In

cover image ACM Other conferences

FAccT '22: Proceedings of the 2022 ACM Conference on Fairness, Accountability, and Transparency

June 2022

2351 pages

ISBN:9781450393522

DOI:10.1145/3531146

Copyright © 2022 Owner/Author.

This work is licensed under a Creative Commons Attribution International 4.0 License.

Sponsors

ACM: Association for Computing Machinery

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 20 June 2022

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed limited

Funding Sources

NUS Early Career Research Award (NUS ECRA)
NUS Presidential Young Professorship research fund
VMWare Early Career Faculty Grant

Conference

FAccT '22

Sponsor:

ACM

FAccT '22: 2022 ACM Conference on Fairness, Accountability, and Transparency

June 21 - 24, 2022

Seoul, Republic of Korea

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

30
Total Citations
View Citations
9,617
Total Downloads

Downloads (Last 12 months)6,262
Downloads (Last 6 weeks)426

Reflects downloads up to

Other Metrics

View Author Metrics

Citations

Cited By

Ali MArshad MUddin IBinsawad MBin Sawad ASohaib O(2024)Efficient context-aware computing: a systematic model for dynamic working memory updates in context-aware computingPeerJ Computer Science10.7717/peerj-cs.212910(e2129)Online publication date: 12-Jun-2024
https://doi.org/10.7717/peerj-cs.2129
Le-Nguyen HTran T(2024)Charting the Ethical CourseThe Role of Generative AI in the Communication Classroom10.4018/979-8-3693-0831-8.ch011(214-261)Online publication date: 12-Feb-2024
https://doi.org/10.4018/979-8-3693-0831-8.ch011
Chen YEsmaeilzadeh P(2024)Generative AI in Medical Practice: In-Depth Exploration of Privacy and Security ChallengesJournal of Medical Internet Research10.2196/5300826(e53008)Online publication date: 8-Mar-2024
https://doi.org/10.2196/53008
Vakili THenriksson ADalianis H(2024)End-to-end pseudonymization of fine-tuned clinical BERT modelsBMC Medical Informatics and Decision Making10.1186/s12911-024-02546-824:1Online publication date: 12-Jun-2024
https://doi.org/10.1186/s12911-024-02546-8
Park YPillai ADeng JGuo EGupta MPaget MNaugler C(2024)Assessing the research landscape and clinical utility of large language models: a scoping reviewBMC Medical Informatics and Decision Making10.1186/s12911-024-02459-624:1Online publication date: 12-Mar-2024
https://doi.org/10.1186/s12911-024-02459-6
Chang TBergen B(2024)Language Model Behavior: A Comprehensive SurveyComputational Linguistics10.1162/coli_a_0049250:1(293-350)Online publication date: 1-Mar-2024
https://doi.org/10.1162/coli_a_00492
Meisenbacher SMatthes F(2024)Just Rewrite It Again: A Post-Processing Method for Enhanced Semantic Similarity and Privacy Preservation of Differentially Private Rewritten TextProceedings of the 19th International Conference on Availability, Reliability and Security10.1145/3664476.3669926(1-11)Online publication date: 30-Jul-2024
https://dl.acm.org/doi/10.1145/3664476.3669926
Zhan XSeymour WSuch J(2024)Beyond Individual Concerns: Multi-user Privacy in Large Language ModelsProceedings of the 6th ACM Conference on Conversational User Interfaces10.1145/3640794.3665883(1-6)Online publication date: 8-Jul-2024
https://dl.acm.org/doi/10.1145/3640794.3665883
Whitney CNorman J(2024)Real Risks of Fake Data: Synthetic Data, Diversity-Washing and Consent CircumventionProceedings of the 2024 ACM Conference on Fairness, Accountability, and Transparency10.1145/3630106.3659002(1733-1744)Online publication date: 3-Jun-2024
https://dl.acm.org/doi/10.1145/3630106.3659002
Suresh HTseng EYoung MGray MPierson ELevy K(2024)Participation in the age of foundation modelsProceedings of the 2024 ACM Conference on Fairness, Accountability, and Transparency10.1145/3630106.3658992(1609-1621)Online publication date: 3-Jun-2024
https://dl.acm.org/doi/10.1145/3630106.3658992
Show More Cited By

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

HTML Format

View this article in HTML Format.

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Media

Figures

Other

Tables

View Table of Contents