research-article

DP-VAE: Human-Readable Text Anonymization for Online Reviews with Differentially Private Variational Autoencoders

Authors:

Benjamin Weggenmann,

Valentin Rublack,

Michael Andrejczuk,

Justus Mattern,

Florian KerschbaumAuthors Info & Claims

WWW '22: Proceedings of the ACM Web Conference 2022

Pages 721 - 731

https://doi.org/10.1145/3485447.3512232

Published: 25 April 2022 Publication History

Abstract

While vast amounts of personal data are shared daily on public online platforms and used by companies and analysts to gain valuable insights, privacy concerns are also on the rise: Modern authorship attribution techniques have proven effective at identifying individuals from their data, such as their writing style or behavior of picking and judging movies. It is hence crucial to develop data sanitization methods that allow sharing of users’ data while protecting their privacy and preserving quality and content of the original data.

In this paper, we tackle anonymization of textual data and propose an end-to-end differentially private variational autoencoder architecture. Unlike previous approaches that achieve differential privacy on a per-word level through individual perturbations, our solution works at an abstract level by perturbing the latent vectors that provide a global summary of the input texts. Decoding an obfuscated latent vector thus not only allows our model to produce coherent, high-quality output text that is human-readable, but also results in strong anonymization due to the diversity of the produced data. We evaluate our approach on IMDb movie and Yelp business reviews, confirming its anonymization capabilities and preservation of the semantics and utility of the original sentences.

References

[1]

Martin Abadi, Andy Chu, Ian Goodfellow, H. Brendan McMahan, Ilya Mironov, Kunal Talwar, and Li Zhang. 2016. Deep Learning with Differential Privacy. In Proceedings of the 2016 ACM SIGSAC Conference on Computer and Communications Security (Vienna, Austria) (CCS ’16). Association for Computing Machinery, New York, NY, USA, 308–318. https://doi.org/10.1145/2976749.2978318

Digital Library

[2]

Takuya Akiba, Shotaro Sano, Toshihiko Yanase, Takeru Ohta, and Masanori Koyama. 2019. Optuna: A Next-generation Hyperparameter Optimization Framework. In Proceedings of the 25rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining.

Digital Library

[3]

Satanjeev Banerjee and Alon Lavie. 2005. METEOR: An automatic metric for MT evaluation with improved correlation with human judgments. In Proceedings of the acl workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summarization. 65–72.

[4]

M. Barbaro, T. Zeller, and S. Hansell. 2006. A face is exposed for AOL searcher no. 4417749. New York Times 9, 2008 (9 August 2006), 8For.

[5]

Raef Bassily, Adam Smith, and Abhradeep Thakurta. 2014. Private Empirical Risk Minimization: Efficient Algorithms and Tight Error Bounds. In 2014 IEEE 55th Annual Symposium on Foundations of Computer Science. 464–473. https://doi.org/10.1109/FOCS.2014.56

Digital Library

[6]

Haohan Bo, Steven H. H. Ding, Benjamin C. M. Fung, and Farkhund Iqbal. 2021. ER-AE: Differentially Private Text Generation for Authorship Anonymization. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Association for Computational Linguistics, Online, 3997–4007. https://doi.org/10.18653/v1/2021.naacl-main.314

[7]

Samuel R. Bowman, Luke Vilnis, Oriol Vinyals, Andrew Dai, Rafal Jozefowicz, and Samy Bengio. 2016. Generating Sentences from a Continuous Space. In Proceedings of The 20th SIGNLL Conference on Computational Natural Language Learning. Association for Computational Linguistics, Berlin, Germany, 10–21. https://doi.org/10.18653/v1/K16-1002

[8]

Frederik S. Bäumer, Nicolai Grote, Joschka Kersting, and Michaela Geierhos. 2017. Privacy Matters: Detecting Nocuous Patient Data Exposure in Online Physician Reviews. In Information and Software Technologies(Communications in Computer and Information Science), Robertas Damaševičiusand Vilma Mikašytė (Eds.). Springer International Publishing, Cham, 77–89. https://doi.org/10.1007/978-3-319-67642-5_7

[9]

Daniel Cer, Yinfei Yang, Sheng-yi Kong, Nan Hua, Nicole Limtiaco, Rhomni St John, Noah Constant, Mario Guajardo-Cespedes, Steve Yuan, Chris Tar, 2018. Universal sentence encoder. arXiv preprint arXiv:1803.11175(2018).

[10]

Kamalika Chaudhuri, Claire Monteleoni, and Anand D. Sarwate. 2011. Differentially Private Empirical Risk Minimization. Journal of Machine Learning Research 12, 29 (2011), 1069–1109. http://jmlr.org/papers/v12/chaudhuri11a.html

Digital Library

[11]

Davide Chicco and Giuseppe Jurman. 2020. The advantages of the Matthews correlation coefficient (MCC) over F1 score and accuracy in binary classification evaluation. BMC genomics 21, 1 (2020), 1–13.

[12]

Kyunghyun Cho, Bart van Merrienboer, Çaglar Gülçehre, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. 2014. Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation. CoRR abs/1406.1078(2014). arxiv:1406.1078http://arxiv.org/abs/1406.1078

[13]

Ross Clement and David Sharp. 2003. Ngram and Bayesian classification of documents for topic and authorship. Literary and linguistic computing 18, 4 (2003), 423–447.

[14]

Rosa María Coyotl-Morales, Luis Villaseñor-Pineda, Manuel Montes-y Gómez, and Paolo Rosso. 2006. Authorship attribution using word sequences. In Iberoamerican Congress on Pattern Recognition. Springer, 844–853.

Digital Library

[15]

Y.-A. De Montjoye, C.A. Hidalgo, M. Verleysen, and V.D. Blondel. 2013. Unique in the crowd: The privacy bounds of human mobility. Scientific reports 3(2013).

[16]

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). Association for Computational Linguistics, Minneapolis, Minnesota, 4171–4186. https://doi.org/10.18653/v1/N19-1423

[17]

Carl Doersch. 2016. Tutorial on Variational Autoencoders. CoRR abs/1606.05908(2016). arXiv:1606.05908http://arxiv.org/abs/1606.05908

[18]

J. C. Duchi, M. I. Jordan, and M. J. Wainwright. 2013. Local Privacy and Statistical Minimax Rates. In 2013 IEEE 54th Annual Symposium on Foundations of Computer Science. 429–438.

[19]

Cynthia Dwork, Frank McSherry, Kobbi Nissim, and Adam Smith. 2006. Calibrating Noise to Sensitivity in Private Data Analysis. In Theory of Cryptography, Shai Halevi and Tal Rabin (Eds.). Springer Berlin Heidelberg, Berlin, Heidelberg, 265–284.

[20]

Cynthia Dwork and Aaron Roth. 2014. The algorithmic foundations of differential privacy. Foundations and Trends® in Theoretical Computer Science 9, 3–4(2014), 211–407.

[21]

Maël Fabien, Esau Villatoro-Tello, Petr Motlicek, and Shantipriya Parida. 2020. BertAA : BERT fine-tuning for Authorship Attribution. In Proceedings of the 17th International Conference on Natural Language Processing (ICON). NLP Association of India (NLPAI), Indian Institute of Technology Patna, Patna, India, 127–137. https://aclanthology.org/2020.icon-main.16

[22]

Angela Fan, Mike Lewis, and Yann Dauphin. 2018. Hierarchical Neural Story Generation. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Association for Computational Linguistics, Melbourne, Australia, 889–898. https://doi.org/10.18653/v1/P18-1082

[23]

Natasha Fernandes, Mark Dras, and Annabelle McIver. 2019. Generalised differential privacy for text document processing. In International Conference on Principles of Security and Trust. Springer, Cham, 123–148.

[24]

Oluwaseyi Feyisetan, Borja Balle, Thomas Drake, and Tom Diethe. 2020. Privacy- and Utility-Preserving Textual Analysis via Calibrated Multivariate Perturbations. In Proceedings of the 13th International Conference on Web Search and Data Mining (Houston, TX, USA) (WSDM ’20). Association for Computing Machinery, New York, NY, USA, 178–186. https://doi.org/10.1145/3336191.3371856

Digital Library

[25]

Oluwaseyi Feyisetan, Tom Diethe, and Thomas Drake. 2019. Leveraging Hierarchical Representations for Preserving Privacy and Utility in Text. In 2019 IEEE International Conference on Data Mining (ICDM). 210–219. https://doi.org/10.1109/ICDM.2019.00031

[26]

Mikhail Figurnov, Shakir Mohamed, and Andriy Mnih. 2018. Implicit Reparameterization Gradients. In Advances in Neural Information Processing Systems 31: Annual Conference on Neural Information Processing Systems 2018, NeurIPS 2018, December 3-8, 2018, Montréal, Canada, Samy Bengio, Hanna M. Wallach, Hugo Larochelle, Kristen Grauman, Nicolò Cesa-Bianchi, and Roman Garnett (Eds.). 439–450. https://proceedings.neurips.cc/paper/2018/hash/92c8c96e4c37100777c7190b76d28233-Abstract.html

[27]

Manuel Gil, Fady Alajaji, and Tamas Linder. 2013. Rényi divergence measures for commonly used univariate continuous distributions. Information Sciences 249(2013), 124–131.

[28]

J. Gorodkin. 2004. Comparing two K-category assignments by a K-category correlation coefficient. Computational Biology and Chemistry 28, 5 (2004), 367–374. https://doi.org/10.1016/j.compbiolchem.2004.09.006

Digital Library

[29]

Roan Gylberth, Risman Adnan, Setiadi Yazid, and T. Basaruddin. 2017. Differentially private optimization algorithms for deep neural networks. In 2017 International Conference on Advanced Computer Science and Information Systems (ICACSIS). 387–394. https://doi.org/10.1109/ICACSIS.2017.8355063

[30]

Geoffrey Hinton, Nitish Srivastava, and Kevin Swersky. 2012. Neural networks for machine learning lecture 6a overview of mini-batch gradient descent. Cited on 14, 8 (2012), 2.

[31]

Ari Holtzman, Jan Buys, Li Du, Maxwell Forbes, and Yejin Choi. 2019. The curious case of neural text degeneration. arXiv preprint arXiv:1904.09751(2019).

[32]

Tom Huddleston, Jr.2019. Can you get sued over a negative Yelp review?https://www.cnbc.com/2019/10/10/can-you-get-sued-over-a-negative-yelp-review.html

[33]

Jack Newsham. 2021. A Bad Glassdoor Review Led to a $1 Million Lawsuit. https://www.businessinsider.com/bad-glassdoor-reddit-review-led-to-a-1-million-lawsuit-2021-8

[34]

M. Jawurek, M. Johns, and K. Rieck. 2011. Smart metering de-pseudonymization. In Proceedings of the 27th Annual Computer Security Applications Conference. ACM, 227–236.

[35]

Vineet John, Lili Mou, Hareesh Bahuleyan, and Olga Vechtomova. 2019. Disentangled Representation Learning for Non-Parallel Text Style Transfer. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, Florence, Italy, 424–434. https://doi.org/10.18653/v1/P19-1041

[36]

Mike Kestemont, Efstathios Stamatatos, Enrique Manjavacas, Walter Daelemans, Martin Potthast, and Benno Stein. 2019. Overview of the Cross-domain Authorship Attribution Task at PAN 2019. In CLEF (Working Notes).

[37]

Yashwant Keswani, H. Trivedi, Parth Mehta, and Prasenjit Majumder. 2016. Author Masking through Translation. In CLEF.

[38]

Diederik P. Kingma and Jimmy Ba. 2017. Adam: A Method for Stochastic Optimization. arxiv:1412.6980 [cs.LG]

[39]

Diederik P. Kingma and Max Welling. 2014. Auto-Encoding Variational Bayes. In 2nd International Conference on Learning Representations, ICLR 2014, Banff, AB, Canada, April 14-16, 2014, Conference Track Proceedings, Yoshua Bengio and Yann LeCun (Eds.). http://arxiv.org/abs/1312.6114

[40]

Diederik P. Kingma and Max Welling. 2019. An Introduction to Variational Autoencoders. CoRR abs/1906.02691(2019). arXiv:1906.02691http://arxiv.org/abs/1906.02691

[41]

Andrew L. Maas, Raymond E. Daly, Peter T. Pham, Dan Huang, Andrew Y. Ng, and Christopher Potts. 2011. Learning Word Vectors for Sentiment Analysis. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies. Association for Computational Linguistics, Portland, Oregon, USA, 142–150. https://aclanthology.org/P11-1015

Digital Library

[42]

Asad Mahmood, Faizan Ahmad, Zubair Shafiq, Padmini Srinivasan, and Fareed Zaffar. 2019. A Girl Has No Name: Automated Authorship Obfuscation using Mutant-X. Proceedings on Privacy Enhancing Technologies 2019, 4(2019), 54–71. https://doi.org/

[43]

B. W. Matthews. 1975. Comparison of the predicted and observed secondary structure of T4 phage lysozyme. Biochimica et Biophysica Acta (BBA) - Protein Structure 405, 2(1975), 442–451. https://doi.org/10.1016/0005-2795(75)90109-9

[44]

Ilya Mironov. 2017. Rényi Differential Privacy. In 2017 IEEE 30th Computer Security Foundations Symposium (CSF). 263–275. https://doi.org/10.1109/CSF.2017.11

[45]

A. Narayanan and V. Shmatikov. 2008. Robust de-anonymization of large sparse datasets. In 2008 IEEE Symposium on Security and Privacy (SP 2008). IEEE, 111–125. https://doi.org/10.1109/SP.2008.33

Digital Library

[46]

Alec Radford, Jeff Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. 2019. Language Models are Unsupervised Multitask Learners. (2019).

[47]

J.R. Rao, P. Rohatgi, 2000. Can pseudonymity really guarantee privacy?. In USENIX Security Symposium. 85–96.

[48]

Nils Reimers and Iryna Gurevych. 2019. Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics. https://arxiv.org/abs/1908.10084

[49]

Paul Resnick and Richard Zeckhauser. 2002. Trust among strangers in internet transactions: Empirical analysis of eBay’ s reputation system. In Advances in Applied Microeconomics. Vol. 11. Emerald (MCB UP), Bingley, 127–157. https://doi.org/10.1016/S0278-0984(02)11030-3

[50]

Danilo Jimenez Rezende, Shakir Mohamed, and Daan Wierstra. 2014. Stochastic Backpropagation and Approximate Inference in Deep Generative Models. In Proceedings of the 31th International Conference on Machine Learning, ICML 2014, Beijing, China, 21-26 June 2014(JMLR Workshop and Conference Proceedings, Vol. 32). JMLR.org, 1278–1286. http://proceedings.mlr.press/v32/rezende14.html

[51]

Dylan Rhodes. 2015. Author attribution with CNNs. Technical Report. https://cs224d.stanford.edu/reports/RhodesDylan.pdf Accessed on 2021-10-15.

[52]

Alexey Romanov, Anna Rumshisky, Anna Rogers, and David Donahue. 2019. Adversarial Decomposition of Text Representation. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). Association for Computational Linguistics, Minneapolis, Minnesota, 815–825. https://doi.org/10.18653/v1/N19-1088

[53]

Yunita Sari, Mark Stevenson, and Andreas Vlachos. 2018. Topic or Style? Exploring the Most Useful Features for Authorship Attribution. In Proceedings of the 27th International Conference on Computational Linguistics. Association for Computational Linguistics, Santa Fe, New Mexico, USA, 343–353. https://aclanthology.org/C18-1029

[54]

Alexander Schaub, Rémi Bazin, Omar Hasan, and Lionel Brunie. 2016. A Trustless Privacy-Preserving Reputation System. In ICT Systems Security and Privacy Protection(IFIP Advances in Information and Communication Technology), Jaap-Henk Hoepman and Stefan Katzenbeisser (Eds.). Springer International Publishing, Cham, 398–411. https://doi.org/10.1007/978-3-319-33630-5_27

[55]

Rakshith Shetty, Bernt Schiele, and Mario Fritz. 2018. A4NT: Author Attribute Anonymity by Adversarial Training of Neural Machine Translation. In 27th USENIX Security Symposium (USENIX Security 18). USENIX Association, Baltimore, MD, 1633–1650. https://www.usenix.org/conference/usenixsecurity18/presentation/shetty

[56]

Prasha Shrestha, Sebastian Sierra, Fabio A González, Manuel Montes, Paolo Rosso, and Thamar Solorio. 2017. Convolutional neural networks for authorship attribution of short texts. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers. 669–674.

[57]

Austin FR Smith and Vincent J Fortunato. 2008. Factors influencing employee intentions to provide honest upward feedback ratings. Journal of Business and Psychology 22, 3 (2008), 191–207.

[58]

Shuang Song, Kamalika Chaudhuri, and Anand D. Sarwate. 2013. Stochastic gradient descent with differentially private updates. In 2013 IEEE Global Conference on Signal and Information Processing. 245–248. https://doi.org/10.1109/GlobalSIP.2013.6736861

[59]

Efstathios Stamatatos. 2009. A survey of modern authorship attribution methods. Journal of the American Society for information Science and Technology 60, 3 (2009), 538–556.

[60]

Efstathios Stamatatos, Francisco Rangel, Michael Tschuggnall, Benno Stein, Mike Kestemont, Paolo Rosso, and Martin Potthast. 2018. Overview of PAN 2018. In International conference of the cross-language evaluation forum for european languages. Springer, 267–285.

[61]

Kalaivani Sundararajan and Damon Woodard. 2018. What represents “style” in authorship attribution?. In Proceedings of the 27th International Conference on Computational Linguistics. Association for Computational Linguistics, Santa Fe, New Mexico, USA, 2814–2822. https://aclanthology.org/C18-1238

[62]

Stanley L Warner. 1965. Randomized response: A survey technique for eliminating evasive answer bias. J. Amer. Statist. Assoc. 60, 309 (1965), 63–69.

[63]

Benjamin Weggenmann and Florian Kerschbaum. 2018. SynTF: Synthetic and Differentially Private Term Frequency Vectors for Privacy-Preserving Text Mining. Association for Computing Machinery, New York, NY, USA, 305–314. https://doi.org/10.1145/3209978.3210008

Digital Library

[64]

Qiongkai Xu, Lizhen Qu, Chenchen Xu, and Ran Cui. 2019. Privacy-aware text rewriting. In Proceedings of the 12th International Conference on Natural Language Generation. 247–257.

Cited By

Yang XArdakanian OAlmgren MFernandes E(2023)Privacy through Diffusion: A White-listing Approach to Sensor Data AnonymizationProceedings of the 5th Workshop on CPS&IoT Security and Privacy10.1145/3605758.3623496(101-107)Online publication date: 26-Nov-2023
https://dl.acm.org/doi/10.1145/3605758.3623496
Li PPei YLi J(2023)A comprehensive survey on design and application of autoencoder in deep learningApplied Soft Computing10.1016/j.asoc.2023.110176138:COnline publication date: 1-May-2023
https://dl.acm.org/doi/10.1016/j.asoc.2023.110176

Index Terms

DP-VAE: Human-Readable Text Anonymization for Online Reviews with Differentially Private Variational Autoencoders

Index terms have been assigned to the content through auto-classification.

Recommendations

A differentially private algorithm for location data release

The rise of mobile technologies in recent years has led to large volumes of location information, which are valuable resources for knowledge discovery such as travel patterns mining and traffic analysis. However, location dataset has been confronted ...
SynTF: Synthetic and Differentially Private Term Frequency Vectors for Privacy-Preserving Text Mining
SIGIR '18: The 41st International ACM SIGIR Conference on Research & Development in Information Retrieval

Text mining and information retrieval techniques have been developed to assist us with analyzing, organizing and retrieving documents with the help of computers. In many cases, it is desirable that the authors of such documents remain anonymous: Search ...
Evaluating differentially private decision tree model over model inversion attack
Abstract
Machine learning techniques have been widely used and shown remarkable performance in various fields. Along with the widespread utilization of machine learning, concerns about privacy violations have been raised. Recently, as privacy invasion ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

WWW '22: Proceedings of the ACM Web Conference 2022

April 2022

3764 pages

ISBN:9781450390965

DOI:10.1145/3485447

Editors:
Frédérique Laforest
INSA Lyon, France
,
Raphaël Troncy
EURECOM, France
,
Elena Simperl
King’s College London, UK
,
Deepak Agarwal
Pinterest, USA
,
Aristides Gionis
KTH Royal Institute of Technology, Sweden
,
Ivan Herman
W3C / retired
,
Lionel Médini
Université Lyon 1, France

Copyright © 2022 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

SIGWEB: ACM Special Interest Group on Hypertext, Hypermedia, and Web

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 25 April 2022

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed limited

Conference

WWW '22

Sponsor:

SIGWEB

WWW '22: The ACM Web Conference 2022

April 25 - 29, 2022

Virtual Event, Lyon, France

Acceptance Rates

Overall Acceptance Rate 1,899 of 8,196 submissions, 23%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

2
Total Citations
View Citations
985
Total Downloads

Downloads (Last 12 months)284
Downloads (Last 6 weeks)32

Reflects downloads up to 27 Jul 2024

Other Metrics

View Author Metrics

Citations

Cited By

Yang XArdakanian OAlmgren MFernandes E(2023)Privacy through Diffusion: A White-listing Approach to Sensor Data AnonymizationProceedings of the 5th Workshop on CPS&IoT Security and Privacy10.1145/3605758.3623496(101-107)Online publication date: 26-Nov-2023
https://dl.acm.org/doi/10.1145/3605758.3623496
Li PPei YLi J(2023)A comprehensive survey on design and application of autoencoder in deep learningApplied Soft Computing10.1016/j.asoc.2023.110176138:COnline publication date: 1-May-2023
https://dl.acm.org/doi/10.1016/j.asoc.2023.110176

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

HTML Format

View this article in HTML Format.

Media

Figures

Other

Tables

View Table of Contents