Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3485447.3512232acmconferencesArticle/Chapter ViewAbstractPublication PageswebconfConference Proceedingsconference-collections
research-article

DP-VAE: Human-Readable Text Anonymization for Online Reviews with Differentially Private Variational Autoencoders

Published: 25 April 2022 Publication History
  • Get Citation Alerts
  • Abstract

    While vast amounts of personal data are shared daily on public online platforms and used by companies and analysts to gain valuable insights, privacy concerns are also on the rise: Modern authorship attribution techniques have proven effective at identifying individuals from their data, such as their writing style or behavior of picking and judging movies. It is hence crucial to develop data sanitization methods that allow sharing of users’ data while protecting their privacy and preserving quality and content of the original data.
    In this paper, we tackle anonymization of textual data and propose an end-to-end differentially private variational autoencoder architecture. Unlike previous approaches that achieve differential privacy on a per-word level through individual perturbations, our solution works at an abstract level by perturbing the latent vectors that provide a global summary of the input texts. Decoding an obfuscated latent vector thus not only allows our model to produce coherent, high-quality output text that is human-readable, but also results in strong anonymization due to the diversity of the produced data. We evaluate our approach on IMDb movie and Yelp business reviews, confirming its anonymization capabilities and preservation of the semantics and utility of the original sentences.

    References

    [1]
    Martin Abadi, Andy Chu, Ian Goodfellow, H. Brendan McMahan, Ilya Mironov, Kunal Talwar, and Li Zhang. 2016. Deep Learning with Differential Privacy. In Proceedings of the 2016 ACM SIGSAC Conference on Computer and Communications Security (Vienna, Austria) (CCS ’16). Association for Computing Machinery, New York, NY, USA, 308–318. https://doi.org/10.1145/2976749.2978318
    [2]
    Takuya Akiba, Shotaro Sano, Toshihiko Yanase, Takeru Ohta, and Masanori Koyama. 2019. Optuna: A Next-generation Hyperparameter Optimization Framework. In Proceedings of the 25rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining.
    [3]
    Satanjeev Banerjee and Alon Lavie. 2005. METEOR: An automatic metric for MT evaluation with improved correlation with human judgments. In Proceedings of the acl workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summarization. 65–72.
    [4]
    M. Barbaro, T. Zeller, and S. Hansell. 2006. A face is exposed for AOL searcher no. 4417749. New York Times 9, 2008 (9 August 2006), 8For.
    [5]
    Raef Bassily, Adam Smith, and Abhradeep Thakurta. 2014. Private Empirical Risk Minimization: Efficient Algorithms and Tight Error Bounds. In 2014 IEEE 55th Annual Symposium on Foundations of Computer Science. 464–473. https://doi.org/10.1109/FOCS.2014.56
    [6]
    Haohan Bo, Steven H. H. Ding, Benjamin C. M. Fung, and Farkhund Iqbal. 2021. ER-AE: Differentially Private Text Generation for Authorship Anonymization. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Association for Computational Linguistics, Online, 3997–4007. https://doi.org/10.18653/v1/2021.naacl-main.314
    [7]
    Samuel R. Bowman, Luke Vilnis, Oriol Vinyals, Andrew Dai, Rafal Jozefowicz, and Samy Bengio. 2016. Generating Sentences from a Continuous Space. In Proceedings of The 20th SIGNLL Conference on Computational Natural Language Learning. Association for Computational Linguistics, Berlin, Germany, 10–21. https://doi.org/10.18653/v1/K16-1002
    [8]
    Frederik S. Bäumer, Nicolai Grote, Joschka Kersting, and Michaela Geierhos. 2017. Privacy Matters: Detecting Nocuous Patient Data Exposure in Online Physician Reviews. In Information and Software Technologies(Communications in Computer and Information Science), Robertas Damaševičiusand Vilma Mikašytė (Eds.). Springer International Publishing, Cham, 77–89. https://doi.org/10.1007/978-3-319-67642-5_7
    [9]
    Daniel Cer, Yinfei Yang, Sheng-yi Kong, Nan Hua, Nicole Limtiaco, Rhomni St John, Noah Constant, Mario Guajardo-Cespedes, Steve Yuan, Chris Tar, 2018. Universal sentence encoder. arXiv preprint arXiv:1803.11175(2018).
    [10]
    Kamalika Chaudhuri, Claire Monteleoni, and Anand D. Sarwate. 2011. Differentially Private Empirical Risk Minimization. Journal of Machine Learning Research 12, 29 (2011), 1069–1109. http://jmlr.org/papers/v12/chaudhuri11a.html
    [11]
    Davide Chicco and Giuseppe Jurman. 2020. The advantages of the Matthews correlation coefficient (MCC) over F1 score and accuracy in binary classification evaluation. BMC genomics 21, 1 (2020), 1–13.
    [12]
    Kyunghyun Cho, Bart van Merrienboer, Çaglar Gülçehre, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. 2014. Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation. CoRR abs/1406.1078(2014). arxiv:1406.1078http://arxiv.org/abs/1406.1078
    [13]
    Ross Clement and David Sharp. 2003. Ngram and Bayesian classification of documents for topic and authorship. Literary and linguistic computing 18, 4 (2003), 423–447.
    [14]
    Rosa María Coyotl-Morales, Luis Villaseñor-Pineda, Manuel Montes-y Gómez, and Paolo Rosso. 2006. Authorship attribution using word sequences. In Iberoamerican Congress on Pattern Recognition. Springer, 844–853.
    [15]
    Y.-A. De Montjoye, C.A. Hidalgo, M. Verleysen, and V.D. Blondel. 2013. Unique in the crowd: The privacy bounds of human mobility. Scientific reports 3(2013).
    [16]
    Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). Association for Computational Linguistics, Minneapolis, Minnesota, 4171–4186. https://doi.org/10.18653/v1/N19-1423
    [17]
    Carl Doersch. 2016. Tutorial on Variational Autoencoders. CoRR abs/1606.05908(2016). arXiv:1606.05908http://arxiv.org/abs/1606.05908
    [18]
    J. C. Duchi, M. I. Jordan, and M. J. Wainwright. 2013. Local Privacy and Statistical Minimax Rates. In 2013 IEEE 54th Annual Symposium on Foundations of Computer Science. 429–438.
    [19]
    Cynthia Dwork, Frank McSherry, Kobbi Nissim, and Adam Smith. 2006. Calibrating Noise to Sensitivity in Private Data Analysis. In Theory of Cryptography, Shai Halevi and Tal Rabin (Eds.). Springer Berlin Heidelberg, Berlin, Heidelberg, 265–284.
    [20]
    Cynthia Dwork and Aaron Roth. 2014. The algorithmic foundations of differential privacy. Foundations and Trends® in Theoretical Computer Science 9, 3–4(2014), 211–407.
    [21]
    Maël Fabien, Esau Villatoro-Tello, Petr Motlicek, and Shantipriya Parida. 2020. BertAA : BERT fine-tuning for Authorship Attribution. In Proceedings of the 17th International Conference on Natural Language Processing (ICON). NLP Association of India (NLPAI), Indian Institute of Technology Patna, Patna, India, 127–137. https://aclanthology.org/2020.icon-main.16
    [22]
    Angela Fan, Mike Lewis, and Yann Dauphin. 2018. Hierarchical Neural Story Generation. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Association for Computational Linguistics, Melbourne, Australia, 889–898. https://doi.org/10.18653/v1/P18-1082
    [23]
    Natasha Fernandes, Mark Dras, and Annabelle McIver. 2019. Generalised differential privacy for text document processing. In International Conference on Principles of Security and Trust. Springer, Cham, 123–148.
    [24]
    Oluwaseyi Feyisetan, Borja Balle, Thomas Drake, and Tom Diethe. 2020. Privacy- and Utility-Preserving Textual Analysis via Calibrated Multivariate Perturbations. In Proceedings of the 13th International Conference on Web Search and Data Mining (Houston, TX, USA) (WSDM ’20). Association for Computing Machinery, New York, NY, USA, 178–186. https://doi.org/10.1145/3336191.3371856
    [25]
    Oluwaseyi Feyisetan, Tom Diethe, and Thomas Drake. 2019. Leveraging Hierarchical Representations for Preserving Privacy and Utility in Text. In 2019 IEEE International Conference on Data Mining (ICDM). 210–219. https://doi.org/10.1109/ICDM.2019.00031
    [26]
    Mikhail Figurnov, Shakir Mohamed, and Andriy Mnih. 2018. Implicit Reparameterization Gradients. In Advances in Neural Information Processing Systems 31: Annual Conference on Neural Information Processing Systems 2018, NeurIPS 2018, December 3-8, 2018, Montréal, Canada, Samy Bengio, Hanna M. Wallach, Hugo Larochelle, Kristen Grauman, Nicolò Cesa-Bianchi, and Roman Garnett (Eds.). 439–450. https://proceedings.neurips.cc/paper/2018/hash/92c8c96e4c37100777c7190b76d28233-Abstract.html
    [27]
    Manuel Gil, Fady Alajaji, and Tamas Linder. 2013. Rényi divergence measures for commonly used univariate continuous distributions. Information Sciences 249(2013), 124–131.
    [28]
    J. Gorodkin. 2004. Comparing two K-category assignments by a K-category correlation coefficient. Computational Biology and Chemistry 28, 5 (2004), 367–374. https://doi.org/10.1016/j.compbiolchem.2004.09.006
    [29]
    Roan Gylberth, Risman Adnan, Setiadi Yazid, and T. Basaruddin. 2017. Differentially private optimization algorithms for deep neural networks. In 2017 International Conference on Advanced Computer Science and Information Systems (ICACSIS). 387–394. https://doi.org/10.1109/ICACSIS.2017.8355063
    [30]
    Geoffrey Hinton, Nitish Srivastava, and Kevin Swersky. 2012. Neural networks for machine learning lecture 6a overview of mini-batch gradient descent. Cited on 14, 8 (2012), 2.
    [31]
    Ari Holtzman, Jan Buys, Li Du, Maxwell Forbes, and Yejin Choi. 2019. The curious case of neural text degeneration. arXiv preprint arXiv:1904.09751(2019).
    [32]
    Tom Huddleston, Jr.2019. Can you get sued over a negative Yelp review?https://www.cnbc.com/2019/10/10/can-you-get-sued-over-a-negative-yelp-review.html
    [33]
    Jack Newsham. 2021. A Bad Glassdoor Review Led to a $1 Million Lawsuit. https://www.businessinsider.com/bad-glassdoor-reddit-review-led-to-a-1-million-lawsuit-2021-8
    [34]
    M. Jawurek, M. Johns, and K. Rieck. 2011. Smart metering de-pseudonymization. In Proceedings of the 27th Annual Computer Security Applications Conference. ACM, 227–236.
    [35]
    Vineet John, Lili Mou, Hareesh Bahuleyan, and Olga Vechtomova. 2019. Disentangled Representation Learning for Non-Parallel Text Style Transfer. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, Florence, Italy, 424–434. https://doi.org/10.18653/v1/P19-1041
    [36]
    Mike Kestemont, Efstathios Stamatatos, Enrique Manjavacas, Walter Daelemans, Martin Potthast, and Benno Stein. 2019. Overview of the Cross-domain Authorship Attribution Task at PAN 2019. In CLEF (Working Notes).
    [37]
    Yashwant Keswani, H. Trivedi, Parth Mehta, and Prasenjit Majumder. 2016. Author Masking through Translation. In CLEF.
    [38]
    Diederik P. Kingma and Jimmy Ba. 2017. Adam: A Method for Stochastic Optimization. arxiv:1412.6980 [cs.LG]
    [39]
    Diederik P. Kingma and Max Welling. 2014. Auto-Encoding Variational Bayes. In 2nd International Conference on Learning Representations, ICLR 2014, Banff, AB, Canada, April 14-16, 2014, Conference Track Proceedings, Yoshua Bengio and Yann LeCun (Eds.). http://arxiv.org/abs/1312.6114
    [40]
    Diederik P. Kingma and Max Welling. 2019. An Introduction to Variational Autoencoders. CoRR abs/1906.02691(2019). arXiv:1906.02691http://arxiv.org/abs/1906.02691
    [41]
    Andrew L. Maas, Raymond E. Daly, Peter T. Pham, Dan Huang, Andrew Y. Ng, and Christopher Potts. 2011. Learning Word Vectors for Sentiment Analysis. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies. Association for Computational Linguistics, Portland, Oregon, USA, 142–150. https://aclanthology.org/P11-1015
    [42]
    Asad Mahmood, Faizan Ahmad, Zubair Shafiq, Padmini Srinivasan, and Fareed Zaffar. 2019. A Girl Has No Name: Automated Authorship Obfuscation using Mutant-X. Proceedings on Privacy Enhancing Technologies 2019, 4(2019), 54–71. https://doi.org/
    [43]
    B. W. Matthews. 1975. Comparison of the predicted and observed secondary structure of T4 phage lysozyme. Biochimica et Biophysica Acta (BBA) - Protein Structure 405, 2(1975), 442–451. https://doi.org/10.1016/0005-2795(75)90109-9
    [44]
    Ilya Mironov. 2017. Rényi Differential Privacy. In 2017 IEEE 30th Computer Security Foundations Symposium (CSF). 263–275. https://doi.org/10.1109/CSF.2017.11
    [45]
    A. Narayanan and V. Shmatikov. 2008. Robust de-anonymization of large sparse datasets. In 2008 IEEE Symposium on Security and Privacy (SP 2008). IEEE, 111–125. https://doi.org/10.1109/SP.2008.33
    [46]
    Alec Radford, Jeff Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. 2019. Language Models are Unsupervised Multitask Learners. (2019).
    [47]
    J.R. Rao, P. Rohatgi, 2000. Can pseudonymity really guarantee privacy?. In USENIX Security Symposium. 85–96.
    [48]
    Nils Reimers and Iryna Gurevych. 2019. Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics. https://arxiv.org/abs/1908.10084
    [49]
    Paul Resnick and Richard Zeckhauser. 2002. Trust among strangers in internet transactions: Empirical analysis of eBay’ s reputation system. In Advances in Applied Microeconomics. Vol. 11. Emerald (MCB UP), Bingley, 127–157. https://doi.org/10.1016/S0278-0984(02)11030-3
    [50]
    Danilo Jimenez Rezende, Shakir Mohamed, and Daan Wierstra. 2014. Stochastic Backpropagation and Approximate Inference in Deep Generative Models. In Proceedings of the 31th International Conference on Machine Learning, ICML 2014, Beijing, China, 21-26 June 2014(JMLR Workshop and Conference Proceedings, Vol. 32). JMLR.org, 1278–1286. http://proceedings.mlr.press/v32/rezende14.html
    [51]
    Dylan Rhodes. 2015. Author attribution with CNNs. Technical Report. https://cs224d.stanford.edu/reports/RhodesDylan.pdf Accessed on 2021-10-15.
    [52]
    Alexey Romanov, Anna Rumshisky, Anna Rogers, and David Donahue. 2019. Adversarial Decomposition of Text Representation. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). Association for Computational Linguistics, Minneapolis, Minnesota, 815–825. https://doi.org/10.18653/v1/N19-1088
    [53]
    Yunita Sari, Mark Stevenson, and Andreas Vlachos. 2018. Topic or Style? Exploring the Most Useful Features for Authorship Attribution. In Proceedings of the 27th International Conference on Computational Linguistics. Association for Computational Linguistics, Santa Fe, New Mexico, USA, 343–353. https://aclanthology.org/C18-1029
    [54]
    Alexander Schaub, Rémi Bazin, Omar Hasan, and Lionel Brunie. 2016. A Trustless Privacy-Preserving Reputation System. In ICT Systems Security and Privacy Protection(IFIP Advances in Information and Communication Technology), Jaap-Henk Hoepman and Stefan Katzenbeisser (Eds.). Springer International Publishing, Cham, 398–411. https://doi.org/10.1007/978-3-319-33630-5_27
    [55]
    Rakshith Shetty, Bernt Schiele, and Mario Fritz. 2018. A4NT: Author Attribute Anonymity by Adversarial Training of Neural Machine Translation. In 27th USENIX Security Symposium (USENIX Security 18). USENIX Association, Baltimore, MD, 1633–1650. https://www.usenix.org/conference/usenixsecurity18/presentation/shetty
    [56]
    Prasha Shrestha, Sebastian Sierra, Fabio A González, Manuel Montes, Paolo Rosso, and Thamar Solorio. 2017. Convolutional neural networks for authorship attribution of short texts. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers. 669–674.
    [57]
    Austin FR Smith and Vincent J Fortunato. 2008. Factors influencing employee intentions to provide honest upward feedback ratings. Journal of Business and Psychology 22, 3 (2008), 191–207.
    [58]
    Shuang Song, Kamalika Chaudhuri, and Anand D. Sarwate. 2013. Stochastic gradient descent with differentially private updates. In 2013 IEEE Global Conference on Signal and Information Processing. 245–248. https://doi.org/10.1109/GlobalSIP.2013.6736861
    [59]
    Efstathios Stamatatos. 2009. A survey of modern authorship attribution methods. Journal of the American Society for information Science and Technology 60, 3 (2009), 538–556.
    [60]
    Efstathios Stamatatos, Francisco Rangel, Michael Tschuggnall, Benno Stein, Mike Kestemont, Paolo Rosso, and Martin Potthast. 2018. Overview of PAN 2018. In International conference of the cross-language evaluation forum for european languages. Springer, 267–285.
    [61]
    Kalaivani Sundararajan and Damon Woodard. 2018. What represents “style” in authorship attribution?. In Proceedings of the 27th International Conference on Computational Linguistics. Association for Computational Linguistics, Santa Fe, New Mexico, USA, 2814–2822. https://aclanthology.org/C18-1238
    [62]
    Stanley L Warner. 1965. Randomized response: A survey technique for eliminating evasive answer bias. J. Amer. Statist. Assoc. 60, 309 (1965), 63–69.
    [63]
    Benjamin Weggenmann and Florian Kerschbaum. 2018. SynTF: Synthetic and Differentially Private Term Frequency Vectors for Privacy-Preserving Text Mining. Association for Computing Machinery, New York, NY, USA, 305–314. https://doi.org/10.1145/3209978.3210008
    [64]
    Qiongkai Xu, Lizhen Qu, Chenchen Xu, and Ran Cui. 2019. Privacy-aware text rewriting. In Proceedings of the 12th International Conference on Natural Language Generation. 247–257.

    Cited By

    View all
    • (2023)Privacy through Diffusion: A White-listing Approach to Sensor Data AnonymizationProceedings of the 5th Workshop on CPS&IoT Security and Privacy10.1145/3605758.3623496(101-107)Online publication date: 26-Nov-2023
    • (2023)A comprehensive survey on design and application of autoencoder in deep learningApplied Soft Computing10.1016/j.asoc.2023.110176138:COnline publication date: 1-May-2023

    Index Terms

    1. DP-VAE: Human-Readable Text Anonymization for Online Reviews with Differentially Private Variational Autoencoders
            Index terms have been assigned to the content through auto-classification.

            Recommendations

            Comments

            Information & Contributors

            Information

            Published In

            cover image ACM Conferences
            WWW '22: Proceedings of the ACM Web Conference 2022
            April 2022
            3764 pages
            ISBN:9781450390965
            DOI:10.1145/3485447
            Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

            Sponsors

            Publisher

            Association for Computing Machinery

            New York, NY, United States

            Publication History

            Published: 25 April 2022

            Permissions

            Request permissions for this article.

            Check for updates

            Author Tags

            1. adversarial autoencoder
            2. authorship attribution
            3. authorship obfuscation
            4. differential privacy
            5. disentangled representations
            6. online reviews
            7. text anonymization
            8. variational autoencoder

            Qualifiers

            • Research-article
            • Research
            • Refereed limited

            Conference

            WWW '22
            Sponsor:
            WWW '22: The ACM Web Conference 2022
            April 25 - 29, 2022
            Virtual Event, Lyon, France

            Acceptance Rates

            Overall Acceptance Rate 1,899 of 8,196 submissions, 23%

            Contributors

            Other Metrics

            Bibliometrics & Citations

            Bibliometrics

            Article Metrics

            • Downloads (Last 12 months)284
            • Downloads (Last 6 weeks)32
            Reflects downloads up to 27 Jul 2024

            Other Metrics

            Citations

            Cited By

            View all
            • (2023)Privacy through Diffusion: A White-listing Approach to Sensor Data AnonymizationProceedings of the 5th Workshop on CPS&IoT Security and Privacy10.1145/3605758.3623496(101-107)Online publication date: 26-Nov-2023
            • (2023)A comprehensive survey on design and application of autoencoder in deep learningApplied Soft Computing10.1016/j.asoc.2023.110176138:COnline publication date: 1-May-2023

            View Options

            Get Access

            Login options

            View options

            PDF

            View or Download as a PDF file.

            PDF

            eReader

            View online with eReader.

            eReader

            HTML Format

            View this article in HTML Format.

            HTML Format

            Media

            Figures

            Other

            Tables

            Share

            Share

            Share this Publication link

            Share on social media