research-article

Open access

Natural Language Understanding with Privacy-Preserving BERT

Authors:

Mingyang Zhang,

Michael Bendersky,

Marc NajorkAuthors Info & Claims

CIKM '21: Proceedings of the 30th ACM International Conference on Information & Knowledge Management

Pages 1488 - 1497

https://doi.org/10.1145/3459637.3482281

Published: 30 October 2021 Publication History

Abstract

Privacy preservation remains a key challenge in data mining and Natural Language Understanding (NLU). Previous research shows that the input text or even text embeddings can leak private information. This concern motivates our research on effective privacy preservation approaches for pretrained Language Models (LMs). We investigate the privacy and utility implications of applying dχ-privacy, a variant of Local Differential Privacy, to BERT fine-tuning in NLU applications. More importantly, we further propose privacy-adaptive LM pretraining methods and show that our approach can boost the utility of BERT dramatically while retaining the same level of privacy protection. We also quantify the level of privacy preservation and provide guidance on privacy configuration. Our experiments and findings lay the groundwork for future explorations of privacy-preserving NLU with pretrained LMs.

References

[1]

M. Abadi, Andy Chu, Ian J. Goodfellow, H. Brendan McMahan, Ilya Mironov, Kunal Talwar, and L. Zhang. 2016. Deep Learning with Differential Privacy. Proceedings of the 2016 ACM SIGSAC Conference on Computer and Communications Security (2016).

Digital Library

[2]

B. Anandan, C. Clifton, Wenxin Jiang, M. Murugesan, Pedro Pastrana-Camacho, and L. Si. 2012. t-Plausibility: Generalizing Words to Desensitize Text. Trans. Data Priv., Vol. 5 (2012), 505--534.

Digital Library

[3]

Michael Bendersky, X. Wang, Donald Metzler, and Marc Najork. 2017. Learning from User Interactions in Personal Search via Attribute Parameterization. In Proceedings of the 10th ACM International Conference on Web Search and Data Mining. 791--799.

Digital Library

[4]

Abhishek Bhowmick, John C. Duchi, J. Freudiger, G. Kapoor, and Ryan Rogers. 2018. Protection Against Reconstruction and Its Applications in Private Federated Learning. ArXiv, Vol. abs/1812.00984 (2018).

[5]

Vincent Bindschaedler, R. Shokri, and Carl A. Gunter. 2017. Plausible Deniability for Privacy-Preserving Data Synthesis. ArXiv, Vol. abs/1708.07975 (2017).

[6]

T. Brown, B. Mann, Nick Ryder, Melanie Subbiah, J. Kaplan, P. Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, G. Krüger, Tom Henighan, R. Child, Aditya Ramesh, D. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, E. Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, J. Clark, Christopher Berner, Sam McCandlish, A. Radford, Ilya Sutskever, and Dario Amodei. 2020. Language Models are Few-Shot Learners. ArXiv, Vol. abs/2005.14165 (2020).

[7]

K. Chatzikokolakis, M. Andrés, N. E. Bordenabe, and C. Palamidessi. 2013. Broadening the Scope of Differential Privacy Using Metrics. In Proceedings of the 13th International Symposium on Privacy Enhancing Technologies. 82--102.

[8]

Maximin Coavoux, Shashi Narayan, and Shay B. Cohen. 2018. Privacy-preserving Neural Representations of Text. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. 1--10.

[9]

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 4171--4186.

[10]

C. Dwork, F. McSherry, Kobbi Nissim, and A. D. Smith. 2006. Calibrating Noise to Sensitivity in Private Data Analysis. In Proceedings of the 3rd Theory of Cryptography Conference. 265--284.

Digital Library

[11]

Natasha Fernandes, M. Dras, and A. McIver. 2019. Generalised Differential Privacy for Text Document Processing. In Proceedings of the 8th International Conference on Principles of Security and Trust. 123--148.

[12]

Oluwaseyi Feyisetan, Borja Balle, Thomas Drake, and Tom Diethe. 2020. Privacy- and Utility-Preserving Textual Analysis via Calibrated Multivariate Perturbations. In Proceedings of the 13th International Conference on Web Search and Data Mining. 178--186.

Digital Library

[13]

S. Kasiviswanathan, H. Lee, K. Nissim, Sofya Raskhodnikova, and A. D. Smith. 2008. What Can We Learn Privately? Proceddings of the 49th Annual IEEE Symposium on Foundations of Computer Science (2008), 531--540.

Digital Library

[14]

Mathias Lécuyer, Vaggelis Atlidakis, Roxana Geambasu, D. Hsu, and Suman Jana. 2019. Certified Robustness to Adversarial Examples with Differential Privacy. Proceedings of the 2019 IEEE Symposium on Security and Privacy (2019), 656--672.

[15]

Cheng Li, Mingyang Zhang, Michael Bendersky, H. Deng, Donald Metzler, and Marc Najork. 2019. Multi-view Embedding-based Synonyms for Email Search. In Proceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval. 575--584.

Digital Library

[16]

Yitong Li, Timothy Baldwin, and Trevor Cohn. 2018. Towards Robust and Privacy-preserving Text Representations. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers). 25--30.

[17]

Y. Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, M. Lewis, L. Zettlemoyer, and V. Stoyanov. 2019. RoBERTa: A Robustly Optimized BERT Pretraining Approach. ArXiv, Vol. abs/1907.11692 (2019).

[18]

L. Lyu, Xuanli He, and Yitong Li. 2020a. Differentially Private Representation for NLP: Formal Guarantee and An Empirical Study on Privacy and Fairness. In Findings of the Association for Computational Linguistics: EMNLP 2020. 2355--2365.

[19]

Lingjuan Lyu, Yitong Li, Xuanli He, and Tong Xiao. 2020b. Towards Differentially Private Text Representations. In Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval. 1813--1816.

Digital Library

[20]

Lingjuan Lyu, Han Yu, and Q. Yang. 2020c. Threats to Federated Learning: A Survey. ArXiv, Vol. abs/2003.02133 (2020).

[21]

Lingjuan Lyu, Jiangshan Yu, Karthik Nandakumar, Yitong Li, Xingjun Ma, Jiong Jin, H. Yu, and K. S. Ng. 2020 d. Towards Fair and Privacy-Preserving Federated Deep Models. IEEE Transactions on Parallel and Distributed Systems, Vol. 31, 11 (2020), 2524--2541.

[22]

H. McMahan, Eider Moore, D. Ramage, S. Hampson, and Blaise Agüera y Arcas. 2017. Communication-Efficient Learning of Deep Networks from Decentralized Data. In Proceedings of the 20th International Conference on Artificial Intelligence and Statistics.

[23]

H. McMahan, D. Ramage, Kunal Talwar, and L. Zhang. 2018. Learning Differentially Private Recurrent Language Models. In Proceedings of the 6th International Conference on Learning Representations.

[24]

Tomas Mikolov, Ilya Sutskever, Kai Chen, G. S. Corrado, and J. Dean. 2013. Distributed Representations of Words and Phrases and their Compositionality. In Advances in Neural Information Processing Systems. 3111--3119.

Digital Library

[25]

Xudong Pan, Mi Zhang, Shouling Ji, and Min Yang. 2020. Privacy Risks of General-Purpose Language Models. Proceedings of the 2020 IEEE Symposium on Security and Privacy (SP) (2020), 1314--1331.

[26]

Chen Qu, Liu Yang, Minghui Qiu, W. Croft, Yongfeng Zhang, and Mohit Iyyer. 2019. BERT with History Answer Embedding for Conversational Question Answering. In Proceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval. 1133--1136.

Digital Library

[27]

Reza Shokri and Vitaly Shmatikov. 2015. Privacy-preserving deep learning. Proceedings of the 53rd Annual Allerton Conference on Communication, Control, and Computing (2015), 909--910.

Digital Library

[28]

R. Socher, Alex Perelygin, J. Wu, Jason Chuang, Christopher D. Manning, A. Ng, and Christopher Potts. 2013. Recursive Deep Models for Semantic Compositionality Over a Sentiment Treebank. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing. 1631--1642.

[29]

Congzheng Song and Ananth Raghunathan. 2020. Information Leakage in Embedding Models. ArXiv, Vol. abs/2004.00053 (2020).

[30]

Alex Wang, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel R. Bowman. 2018. GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding. ArXiv, Vol. abs/1804.07461 (2018).

[31]

Tianhao Wang, J. Blocki, N. Li, and S. Jha. 2017. Locally Differentially Private Protocols for Frequency Estimation. In Proceedings of the 26th USENIX Conference on Security Symposium. 729--745.

Digital Library

[32]

Benjamin Weggenmann and Florian Kerschbaum. 2018. SynTF: Synthetic and Differentially Private Term Frequency Vectors for Privacy-Preserving Text Mining. In Proceedings of the 41st International ACM SIGIR Conference on Research and Development in Information Retrieval. 305--314.

Digital Library

[33]

Xi Wu, Fengan Li, A. Kumar, K. Chaudhuri, S. Jha, and J. Naughton. 2017. Bolt-on Differential Privacy for Scalable Stochastic Gradient Descent-based Analytics. Proceedings of the 2017 ACM International Conference on Management of Data (2017), 1307--1322.

Digital Library

[34]

Y. Wu, Mike Schuster, Z. Chen, Quoc V. Le, Mohammad Norouzi, Wolfgang Macherey, M. Krikun, Yuan Cao, Q. Gao, Klaus Macherey, Jeff Klingner, Apurva Shah, M. Johnson, X. Liu, L. Kaiser, S. Gouws, Y. Kato, Taku Kudo, H. Kazawa, K. Stevens, G. Kurian, Nishant Patil, W. Wang, C. Young, J. Smith, Jason Riesa, Alex Rudnick, Oriol Vinyals, G. S. Corrado, Macduff Hughes, and J. Dean. 2016. Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation. ArXiv, Vol. abs/1609.08144 (2016).

[35]

Z. Yang, Zihang Dai, Y. Yang, J. Carbonell, R. Salakhutdinov, and Quoc V. Le. 2019. XLNet: Generalized Autoregressive Pretraining for Language Understanding. In Advances in Neural Information Processing Systems. 5753--5763.

Digital Library

[36]

Hamed Zamani, M. Dehghani, W. Croft, E. Learned-Miller, and J. Kamps. 2018. From Neural Re-Ranking to Neural Ranking: Learning a Sparse Representation for Inverted Indexing. In Proceedings of the 27th ACM International Conference on Information and Knowledge Management. 497--506.

Digital Library

[37]

Hongfei Zhang, Xia Song, Chenyan Xiong, C. Rosset, P. Bennett, Nick Craswell, and Saurabh Tiwary. 2019. Generic Intent Representation in Web Search. In SIGIR.

Digital Library

[38]

X. Zhang, J. Zhao, and Y. LeCun. 2015. Character-level Convolutional Networks for Text Classification. In Advances in Neural Information Processing Systems. 649--657.

Digital Library

[39]

Zhe Zhao, T. Liu, Shen Li, Bofang Li, and X. Du. 2017. Ngram2vec: Learning Improved Word Representations from Ngram Co-occurrence Statistics. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing. 244--253.

[40]

Y. Zhu, Ryan Kiros, R. Zemel, R. Salakhutdinov, R. Urtasun, A. Torralba, and S. Fidler. 2015. Aligning Books and Movies: Towards Story-Like Visual Explanations by Watching Movies and Reading Books. Proceedings of the 2015 IEEE International Conference on Computer Vision (2015), 19--27.

Digital Library

Cited By

Liu YSu Q(2024)PPTIF: Privacy-Preserving Transformer Inference Framework for Language TranslationIEEE Access10.1109/ACCESS.2024.338426812(48881-48897)Online publication date: 2024
https://doi.org/10.1109/ACCESS.2024.3384268
Banipal IAsthana SMazumder SKochura N(2024)Resource Provider Determination in Cloud EnvironmentsAdvances in Information and Communication10.1007/978-3-031-53963-3_45(646-653)Online publication date: 17-Mar-2024
https://doi.org/10.1007/978-3-031-53963-3_45
Huang DGe MXiang KZhang XYang H(2024)Privacy Preservation of Large Language Models in the Metaverse Era: Research Frontiers, Categorical Comparisons, and Future DirectionsInternational Journal of Network Management10.1002/nem.2292Online publication date: 29-Jul-2024
https://doi.org/10.1002/nem.2292
Show More Cited By

Index Terms

Natural Language Understanding with Privacy-Preserving BERT
1. Computing methodologies
  1. Artificial intelligence
    1. Natural language processing
2. Security and privacy
  1. Human and societal aspects of security and privacy
    1. Privacy protections

Recommendations

Multi-level privacy preserving data publishing

Policedata is an important source of social media data and can be regarded as a technical assistance to increase government accountability and transparency. Notably, it contains large amounts of personal private information that should be preserved ...
Privacy-Preserving Data Publishing: An Overview
Privacy preserving data publishing

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

CIKM '21: Proceedings of the 30th ACM International Conference on Information & Knowledge Management

October 2021

4966 pages

ISBN:9781450384469

DOI:10.1145/3459637

General Chairs:
Gianluca Demartini
The University of Queensland, Australia
,
Guido Zuccon
The University of Queensland, Australia
,
Program Chairs:
J. Shane Culpepper
RMIT University, Australia
,
Zi Huang
The University of Queensland, Australia
,
Hanghang Tong
University of Illinois at Urbana-Champaign, USA

Copyright © 2021 Owner/Author.

This work is licensed under a Creative Commons Attribution International 4.0 License.

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 30 October 2021

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

CIKM '21

Sponsor:

CIKM '21: The 30th ACM International Conference on Information and Knowledge Management

November 1 - 5, 2021

Queensland, Virtual Event, Australia

Acceptance Rates

Overall Acceptance Rate 1,861 of 8,427 submissions, 22%

Upcoming Conference

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

4
Total Citations
View Citations
1,672
Total Downloads

Downloads (Last 12 months)649
Downloads (Last 6 weeks)64

Reflects downloads up to 02 Sep 2024

Other Metrics

View Author Metrics

Citations

Cited By

Liu YSu Q(2024)PPTIF: Privacy-Preserving Transformer Inference Framework for Language TranslationIEEE Access10.1109/ACCESS.2024.338426812(48881-48897)Online publication date: 2024
https://doi.org/10.1109/ACCESS.2024.3384268
Banipal IAsthana SMazumder SKochura N(2024)Resource Provider Determination in Cloud EnvironmentsAdvances in Information and Communication10.1007/978-3-031-53963-3_45(646-653)Online publication date: 17-Mar-2024
https://doi.org/10.1007/978-3-031-53963-3_45
Huang DGe MXiang KZhang XYang H(2024)Privacy Preservation of Large Language Models in the Metaverse Era: Research Frontiers, Categorical Comparisons, and Future DirectionsInternational Journal of Network Management10.1002/nem.2292Online publication date: 29-Jul-2024
https://doi.org/10.1002/nem.2292
Akimoto YFukuchi KAkimoto YSakuma J(2023)Privformer: Privacy-preserving Transformer with MPC2023 IEEE 8th European Symposium on Security and Privacy (EuroS&P)10.1109/EuroSP57164.2023.00031(392-410)Online publication date: Jul-2023
https://doi.org/10.1109/EuroSP57164.2023.00031
Paaß GGiesselbach SPaaß GGiesselbach S(2023)Summary and OutlookFoundation Models for Natural Language Processing10.1007/978-3-031-23190-2_8(383-419)Online publication date: 27-Feb-2023
https://doi.org/10.1007/978-3-031-23190-2_8

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Media

Figures

Other

Tables

View Table of Contents