Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3422337.3447827acmconferencesArticle/Chapter ViewAbstractPublication PagescodaspyConference Proceedingsconference-collections
short-paper

A Large Publicly Available Corpus of Website Privacy Policies Based on DMOZ

Published: 26 April 2021 Publication History
  • Get Citation Alerts
  • Abstract

    Studies have shown website privacy policies are too long and hard to comprehend for their target audience. These studies and a more recent body of research that utilizes machine learning and natural language processing to automatically summarize privacy policies greatly benefit, if not rely on, corpora of privacy policies collected from the web. While there have been smaller annotated corpora of web privacy policies made public, we are not aware of any large publicly available corpus. We use DMOZ, a massive open-content directory of the web, and its manually categorized 1.5 million websites, to collect hundreds of thousands of privacy policies associated with their categories, enabling research on privacy policies across different categories/market sectors. We review the statistics of this corpus and make it available for research. We also obtain valuable insights about privacy policies, e.g., which websites post them less often. Our corpus of web privacy policies is a valuable tool at the researchers' disposal to investigate privacy policies. For example, it facilitates comparison among different methods of privacy policy summarization by providing a benchmark, and can be used in unsupervised machine learning to summarize privacy policies.

    Supplementary Material

    MP4 File (CODASPY21-codas224d.mp4)
    Studies have shown website privacy policies are too long and hard to read. These studies and a recent body of research that utilizes machine learning to automatically summarize privacy policies greatly benefit (or rely on) corpora of privacy policies collected from the web. While there have been smaller annotated corpora of web privacy policies made public, we are not aware of any large publicly available corpus. We use DMOZ, an open-content directory of the web, and its manually categorized 1.5 million websites, to collect hundreds of thousands of privacy policies with their categories, enabling research on privacy policies across categories. We make the corpus available and also obtain insights about privacy policies, e.g., which websites post them less often. Our corpus of web privacy policies is a valuable tool for research. It facilitates comparison among different methods of privacy policy summarization by providing a benchmark and can be used in unsupervised machine learning to summarize policies.

    References

    [1]
    Benjamin Andow, Samin Yaseer Mahmud, Wenyu Wang, Justin Whitaker, William Enck, Bradley Reaves, Kapil Singh, and Tao Xie. 2019. Policylint: investigating internal privacy policy contradictions on Google play. In 28th USENIX Security Symposium. 585--602.
    [2]
    Benjamin Andow, Samin Yaseer Mahmud, Justin Whitaker, William Enck, Bradley Reaves, Kapil Singh, and Serge Egelman. 2020. Actions Speak Louder than Words: Entity-Sensitive Privacy Policy and Data Flow Analysis with POLICHECK. In 29th USENIX Security Symposium ({USENIX} Security 20). 985--1002.
    [3]
    Steven Arzt, Siegfried Rasthofer, Christian Fritz, Eric Bodden, Alexandre Bartel, Jacques Klein, Yves Le Traon, Damien Octeau, and Patrick McDaniel. 2014. Flowdroid: Precise context, flow, field, object-sensitive and lifecycle-aware taint analysis for android apps. Acm Sigplan Notices, Vol. 49, 6 (2014), 259--269.
    [4]
    Vinayshekhar Bannihatti Kumar, Roger Iyengar, Namita Nisal, Yuanyuan Feng, Hana Habib, Peter Story, Sushain Cherivirala, Margaret Hagan, Lorrie Cranor, Shomir Wilson, et almbox. 2020. Finding a Choice in a Haystack: Automatic Extraction of Opt-Out Statements from Privacy Policy Text. In Proceedings of The Web Conference 2020. 1943--1954.
    [5]
    Elisa Costante, Yuanhao Sun, Milan Petković, and Jerry den Hartog. 2012. A machine learning solution to assess privacy policy completeness: (short paper). In Proceedings of the 2012 ACM workshop on Privacy in the electronic society. 91--96.
    [6]
    Martin Degeling, Christine Utz, Christopher Lentzsch, Henry Hosseini, Florian Schaub, and Thorsten Holz. 2019. We Value Your Privacy... Now Take Some Cookies-Measuring the GDPR's Impact on Web Privacy. Informatik Spektrum, Vol. 42, 5 (2019), 345--346.
    [7]
    William Enck, Peter Gilbert, Seungyeop Han, Vasant Tendulkar, Byung-Gon Chun, Landon P Cox, Jaeyeon Jung, Patrick McDaniel, and Anmol N Sheth. 2014. TaintDroid: an information-flow tracking system for realtime privacy monitoring on smartphones. ACM Transactions on Computer Systems (TOCS), Vol. 32, 2 (2014), 1--29.
    [8]
    Mark A Graber, Donna M D Alessandro, and Jill Johnson-West. 2002. Reading level of privacy policies on internet health web sites. Journal of Family Practice, Vol. 51, 7 (2002), 642--642.
    [9]
    Hamza Harkous, Kassem Fawaz, Rémi Lebret, Florian Schaub, Kang G Shin, and Karl Aberer. 2018. Polisis: Automated analysis and presentation of privacy policies using deep learning. In 27th USENIX Security Symposium. 531--548.
    [10]
    Daniel Kales, Christian Rechberger, Thomas Schneider, Matthias Senker, and Christian Weinert. 2019. Mobile private contact discovery at scale. In 28th USENIX Security Symposium (USENIX Security 19). 1447--1464.
    [11]
    Vinayshekhar Bannihatti Kumar, Abhilasha Ravichander, Peter Story, and Norman Sadeh. 2019. Quantifying the effect of in-domain distributed word representations: A study of privacy policies. In AAAI Spring Symposium on Privacy-Enhancing Artificial Intelligence and Language Technologies.
    [12]
    Aleecia M McDonald and Lorrie Faith Cranor. 2008. the Cost of reading privacy policies. I/S: A Journal of Law and Policy for the Information Society, Vol. 4 (2008), 543.
    [13]
    Jonathan A Obar and Anne Oeldorf-Hirsch. 2020. The biggest lie on the internet: Ignoring the privacy policies and terms of service policies of social networking services. Information, Communication & Society, Vol. 23, 1 (2020), 128--147.
    [14]
    Rohan Ramanath, Fei Liu, Norman Sadeh, and Noah A Smith. 2014. Unsupervised alignment of privacy policies using hidden markov models. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers). 605--610.
    [15]
    Mukund Srinath, Shomir Wilson, and C Lee Giles. 2020. Privacy at Scale: Introducing the PrivaSeer Corpus of Web Privacy Policies. arXiv preprint arXiv:2004.11131 (2020).
    [16]
    Ali Sunyaev, Tobias Dehling, Patrick L Taylor, and Kenneth D Mandl. 2015. Availability and quality of mobile health app privacy policies. Journal of the American Medical Informatics Association, Vol. 22, e1 (2015), e28--e33.
    [17]
    Welderufael B Tesfay, Peter Hofmann, Toru Nakamura, Shinsaku Kiyomoto, and Jetzabel Serna. 2018a. I Read but Don't Agree: Privacy Policy Benchmarking using Machine Learning and the EU GDPR. In Companion Proceedings of the The Web Conference 2018. 163--166.
    [18]
    Welderufael B Tesfay, Peter Hofmann, Toru Nakamura, Shinsaku Kiyomoto, and Jetzabel Serna. 2018b. PrivacyGuide: towards an implementation of the EU GDPR on internet privacy policy evaluation. In Proceedings of the Fourth ACM International Workshop on Security and Privacy Analytics. 15--21.
    [19]
    Pelayo Vallina, Álvaro Feal, Julien Gamba, Narseo Vallina-Rodriguez, and Antonio Fernández Anta. 2019. Tales from the porn: A comprehensive privacy analysis of the web porn ecosystem. In Proceedings of the Internet Measurement Conference. 245--258.
    [20]
    Shomir Wilson, Florian Schaub, Aswarth Abhilash Dara, Frederick Liu, Sushain Cherivirala, Pedro Giovanni Leon, Mads Schaarup Andersen, Sebastian Zimmeck, Kanthashree Mysore Sathyendra, N Cameron Russell, et almbox. 2016. The creation and analysis of a website privacy policy corpus. In Annual Meeting of the Association for Computational Linguistics. 1330--13340.
    [21]
    Razieh Nokhbeh Zaeem, Safa Anya, Alex Issa, Jake Nimergood, Isabelle Rogers, Vinay Shah, Ayush Srivastava, and K. Suzanne Barber. 2020. PrivacyCheck v2: A Tool that Recaps Privacy Policies for You. In 29th ACM International Conference on Information and Knowledge Management (CIKM). ACM. To appear.
    [22]
    Razieh Nokhbeh Zaeem and K Suzanne Barber. 2017. A study of web privacy policies across industries. Journal of Information Privacy and Security, Vol. 13, 4 (2017), 169--185.
    [23]
    Razieh Nokhbeh Zaeem and K Suzanne Barber. 2020. The effect of the gdpr on privacy policies: Recent progress and future promise. ACM Transactions on Management Information Systems (TMIS), Vol. 12, 1 (2020), 1--20.
    [24]
    Razieh Nokhbeh Zaeem, Rachel L German, and K Suzanne Barber. 2018. PrivacyCheck: Automatic Summarization of Privacy Policies Using Data Mining. ACM Transactions on Internet Technology (TOIT), Vol. 18, 4 (2018), 53.
    [25]
    Sebastian Zimmeck, Peter Story, Daniel Smullen, Abhilasha Ravichander, Ziqi Wang, Joel Reidenberg, N Cameron Russell, and Norman Sadeh. 2019. MAPS: Scaling privacy compliance analysis to a million apps. Proceedings on Privacy Enhancing Technologies, Vol. 2019, 3 (2019), 66--86.
    [26]
    Sebastian Zimmeck, Ziqi Wang, Lieyong Zou, Roger Iyengar, Bin Liu, Florian Schaub, Shomir Wilson, Norman M Sadeh, Steven M Bellovin, and Joel R Reidenberg. 2017. Automated Analysis of Privacy Requirements for Mobile Apps. In NDSS.

    Cited By

    View all
    • (2024)Revolutionizing Digital Consent: An Automated Approach to Simplifying and Deciphering Privacy Policies for Empowered User UnderstandingCompanion Proceedings of the 29th International Conference on Intelligent User Interfaces10.1145/3640544.3645246(156-158)Online publication date: 5-Apr-2024
    • (2024)A Framework for Preparing a Balanced and Comprehensive Phishing DatasetIEEE Access10.1109/ACCESS.2024.338743712(53610-53622)Online publication date: 2024
    • (2023)Understanding Website Privacy Policies—A Longitudinal Analysis Using Natural Language ProcessingInformation10.3390/info1411062214:11(622)Online publication date: 19-Nov-2023
    • Show More Cited By

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    CODASPY '21: Proceedings of the Eleventh ACM Conference on Data and Application Security and Privacy
    April 2021
    348 pages
    ISBN:9781450381437
    DOI:10.1145/3422337
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 26 April 2021

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. corpus
    2. datasets
    3. dmoz
    4. privacy policies

    Qualifiers

    • Short-paper

    Conference

    CODASPY '21
    Sponsor:

    Acceptance Rates

    Overall Acceptance Rate 149 of 789 submissions, 19%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)46
    • Downloads (Last 6 weeks)4
    Reflects downloads up to 12 Aug 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)Revolutionizing Digital Consent: An Automated Approach to Simplifying and Deciphering Privacy Policies for Empowered User UnderstandingCompanion Proceedings of the 29th International Conference on Intelligent User Interfaces10.1145/3640544.3645246(156-158)Online publication date: 5-Apr-2024
    • (2024)A Framework for Preparing a Balanced and Comprehensive Phishing DatasetIEEE Access10.1109/ACCESS.2024.338743712(53610-53622)Online publication date: 2024
    • (2023)Understanding Website Privacy Policies—A Longitudinal Analysis Using Natural Language ProcessingInformation10.3390/info1411062214:11(622)Online publication date: 19-Nov-2023
    • (2023)Privacy Policies across the Ages: Content of Privacy Policies 1996–2021ACM Transactions on Privacy and Security10.1145/359015226:3(1-32)Online publication date: 13-May-2023
    • (2023)Feasibility of Structured, Machine-Readable Privacy Notices2023 10th International Conference on Behavioural and Social Computing (BESC)10.1109/BESC59560.2023.10386763(1-8)Online publication date: 30-Oct-2023
    • (2022)Privacy Policies of IoT Devices: Collection and AnalysisSensors10.3390/s2205183822:5(1838)Online publication date: 25-Feb-2022

    View Options

    Get Access

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media