Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3508546.3508635acmotherconferencesArticle/Chapter ViewAbstractPublication PagesacaiConference Proceedingsconference-collections
research-article

Towards a Free Text Dataset for Hiding Quasi-Identifiers

Published: 25 February 2022 Publication History
  • Get Citation Alerts
  • Abstract

    Personal information protection is becoming so important for individuals. Besides personal identifier information (PII), quasi-identifier information (QII) also needs protection, as the community argues, and the solution methods have attracted many research. Many existing methods for protecting QII focus on structured text data which is organized by tables of records. However, free text data that contains QII, is very common in application domains, such as data lakes of a company. The protection of QII in free text data thus need new methods. Supervised machine learning based solutions are promising while usually require a large scale dataset to train the model. Here we propose a novel method towards building such a desired dataset. Our method exploits an existing structured text dataset, a table to sentence generation deep learning model, and incorporated the idea of Piecewise Convolution Neural Network (PCNN). The resulted dataset contains more than 120,000 free text sentences, and many of them contains QII data.

    References

    [1]
    El Emam, Khaled, and Fida Kamal Dankar. 2008. Protecting privacy using k-anonymity. Journal of the American Medical Informatics Association, 15, 5, 627-637.
    [2]
    Sweeney, Latanya. 2002. k-anonymity: A model for protecting privacy. International Journal of Uncertainty, Fuzziness and Knowledge-Based Systems, 10, 05, 557-570.
    [3]
    Šarčević, Tanja, David Molnar, and Rudolf Mayer. 2020. An Analysis of Different Notions of Effectiveness in k-Anonymity. In Proceedings of International Conference on Privacy in Statistical Databases. Springer, Cham, 121-135.
    [4]
    Machanavajjhala, Ashwin, Daniel Kifer, Johannes Gehrke, and Muthuramakrishnan Venkitasubramaniam. 2007. l-diversity: Privacy beyond k-anonymity. ACM Transactions on Knowledge Discovery from Data (TKDD), 1, 1, 3-es.
    [5]
    Nininahazwe, Franck Seigneur. 2019. Studying L-Diversity and K-Anonymity Over Datasets with Sensitive Fields. In Proceedings of the International Conference on Artificial Intelligence and Security. Springer, Cham, 63-73.
    [6]
    Neamatullah, Ishna, Margaret M. Douglass, H. Lehman Li-wei, Andrew Reisner, Mauricio Villarroel, William J. Long, Peter Szolovits, George B. Moody, Roger G. Mark, and Gari D. 2008. Automated de-identification of free-text medical records. BMC medical informatics and decision making, 8, 1, 1-17.
    [7]
    Iwendi, Celestine, Syed Atif Moqurrab, Adeel Anjum, Sangeen Khan, Senthilkumar Mohan, and Gautam Srivastava. 2020. N-sanitization: A semantic privacy-preserving framework for unstructured medical datasets. Computer Communications, 161, 160-171.
    [8]
    Liu, Zengjian, Buzhou Tang, Xiaolong Wang, and Qingcai Chen. 2017. De-identification of clinical notes via recurrent neural network and conditional random field. Journal of biomedical informatics, 75, S34-S42.
    [9]
    Yogarajan, Vithya, Bernhard Pfahringer, and Michael Mayo. 2020. A review of automatic end-to-end de-identification: Is high accuracy the only metric?. Applied Artificial Intelligence, 34, 3, 251-269.
    [10]
    Zeng, Daojian, Kang Liu, Yubo Chen, and Jun Zhao. 2015. Distant supervision for relation extraction via piecewise convolutional neural networks. In Proceedings of the 2015 conference on empirical methods in natural language processing. Association for Computational Linguistics, Lisbon, Portugal, 1753-1762.
    [11]
    Liu, Tianyu, Kexiang Wang, Lei Sha, Baobao Chang, and Zhifang Sui. 2018. Table-to-text generation by structure-aware seq2seq learning. In Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence (AAAI-18).AAAI Press, New Orleans, Louisiana, USA, 4881-4888.
    [12]
    Puduppully, Ratish, Li Dong, and Mirella Lapata. 2019. Data-to-text generation with content selection and planning. In Proceedings of the Thirty-Third AAAI conference on artificial intelligence (AAAI-19).AAAI Press, Hawai, USA, 33, 01, 6908-6915.
    [13]
    [Online]. Available: http://archive.ics.uci.edu/ml/datasets.php.

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Other conferences
    ACAI '21: Proceedings of the 2021 4th International Conference on Algorithms, Computing and Artificial Intelligence
    December 2021
    699 pages
    ISBN:9781450385053
    DOI:10.1145/3508546
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 25 February 2022

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. Natural language processing
    2. Personal information protection
    3. Quasi-identifier information
    4. Table to sentence

    Qualifiers

    • Research-article
    • Research
    • Refereed limited

    Conference

    ACAI'21

    Acceptance Rates

    Overall Acceptance Rate 173 of 395 submissions, 44%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • 0
      Total Citations
    • 42
      Total Downloads
    • Downloads (Last 12 months)12
    • Downloads (Last 6 weeks)0
    Reflects downloads up to 26 Jul 2024

    Other Metrics

    Citations

    View Options

    Get Access

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    HTML Format

    View this article in HTML Format.

    HTML Format

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media