Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
article

C-sanitized: A privacy model for document redaction and sanitization

Published: 01 January 2016 Publication History

Abstract

Vast amounts of information are daily exchanged and/or released. The sensitive nature of much of this information creates a serious privacy threat when documents are uncontrollably made available to untrusted third parties. In such cases, appropriate data protection measures should be undertaken by the responsible organization, especially under the umbrella of current legislation on data privacy. To do so, human experts are usually requested to redact or sanitize document contents. To relieve this burdensome task, this paper presents a privacy model for document redaction/sanitization, which offers several advantages over other models available in the literature. Based on the well-established foundations of data semantics and information theory, our model provides a framework to develop and implement automated and inherently semantic redaction/sanitization tools. Moreover, contrary to ad-hoc redaction methods, our proposal provides a priori privacy guarantees which can be intuitively defined according to current legislations on data privacy. Empirical tests performed within the context of several use cases illustrate the applicability of our model and its ability to mimic the reasoning of human sanitizers.

References

[1]
Anandan, B., &Clifton, C. 2011. Significance of term relationships on anonymization. Paper presented at the IEEE/WIC/ACM International Joint Conference on Web Intelligence and Intelligent Agent Technology-Workshops, Lyon, France.
[2]
Anandan, B., Clifton, C., Jiang, W., Murugesan, M., Pastrana-Camacho, P., ' 2012. t-plausibility: Generalizing words to desensitize text. Transactions on Data Privacy, Volume 5, pp.505-534.
[3]
Batet, M., Erola, A., Sánchez, D., &Castellí-Roca, J. 2013. Utility preserving query log anonymization via semantic microaggregation. Information Sciences, Volume 242, pp.49-63.
[4]
Bier, E., Chow, R., Golle, P., King, T.H., &Staddon, J. 2009. The rules of redaction: Identify, protect, review and repeat. IEEE Security and Privacy Magazine, Volume 7 Issue 6, pp.46-53.
[5]
Bollegala, D., Matsuo, Y., &Ishizuka, M. 2009. A relational model of semantic similarity between words using automatically extracted lexical pattern clusters from the web. Paper presented at the Conference on Empirical Methods in Natural Language Processing, EMNLP 2009, Singapore, Republic of Singapore.
[6]
Chakaravarthy, V.T., Gupta, H., Roy, P., &Mohania, M.K. 2008. Efficient techniques for document sanitization. Paper presented at the 17th ACM Conference on Information and Knowledge Management CIKM'08, Napa Valley, CA.
[7]
Chow, R., Golle, P., &Staddon, J. 2008. Detecting privacy leaks using corpus-based association rules. Paper presented at the 14th Conference on Knowledge Discovery and Data Mining, Las Vegas, NV.
[8]
Church, K.W., &Hanks, P. 1990. Word association norms, mutual information, and lexicography. Computational Linguistics, Volume 16 Issue 1, pp.22-29.
[9]
Cilibrasi, R.L., &Vitányi, P.M.B. 2006. The Google similarity distance. IEEE Transactions on Knowledge and Data Engineering, Volume 19 Issue 3, pp.370-383.
[10]
Cumby, C., &Ghani, R. 2011. A machine learning based system for semiautomatically redacting documents. Paper presented at the Twenty-Third Conference on Innovative Applications of Artificial Intelligence, San Francisco, CA.
[11]
Dalenius, T. 1977. Towards a methodology for statistical disclosure control. Statistik Tidskrift, Volume 15, pp.429-444.
[12]
Department for a Healthy New York. 2013. New York State Confidentiality Law. Retrieved from: "http://www.lac.org/doc_library/lac/publications/HIPAA_and_Art_27F_Summary.pdf"
[13]
Department of Health and Human Services. 2000. The health insurance portability and accountability act of 1996 Vol. Technical Report Federal Register 65 FR 82462.
[14]
Domingo-Ferrer, J. 2008. A survey of inference control methods for privacy-preserving data mining. In C.C.Aggarwal, &P.S.Yu Eds., Privacy-preserving data mining pp. pp.53-80. Berlin: Springer.
[15]
Dorr, D., Phillips, W., Phansalkar, S., Sims, S., &Hurdle, J. 2006. Assessing the difficulty and time cost of de-identification in clinical narratives. Methods of Information in Medicine, Volume 45 Issue 3, pp.246-252.
[16]
Drechsler, J. 2011. My understanding of the differences between the CS and the statistical approach to data confidentiality. Paper presented at the The 4th IAB workshop on confidentiality and disclosure.
[17]
Dwork, C. 2006. Differential privacy. Paper presented at the 33rd International Colloquium ICALP, Venice, Italy.
[18]
European Parliament and the Council of the EU. 1995. Data Protection Directive 95/46/EC.
[19]
Francis, W.N., &Kucera, H. 1979. Brown corpus manual. Providence, RI: Brown University.
[20]
Fung, B.C.M., Wang, K., Chen, R., &Yu, P.S. 2010. Privacy-preserving data publishing: A survey of recent developments. ACM Computer Surverys, Volume 42 Issue 4, pp.14.
[21]
Gordon, K. 2013. MRA thought of the day-medical record redacting: A burdensome and problematic method for protecting patient privacy. Retrieved from: "http://mrahis.com/blog/mra-thought-of-the-day-medical-record-redacting-a-burdensome-and-problematic-method-for-protecting-patient-privacy/"
[22]
Health Privacy Project. 2013. State Privacy Protections. Retrieved from: "https://www.cdt.org/issue/state-privacy-protections"
[23]
Hundepool, A., Domingo-Ferrer, J., Franconi, L., Giessing, S., Nordholt, E.S., Spicer, K., ' 2013. Statistical Disclosure Control. Hoboken, NJ: Wiley.
[24]
Legal Information Institute. 2013. Privacy protection for filings made with the court. Retrieved from: "http://www.law.cornell.edu/rules/frcrmp/rule_49.1"
[25]
Lemaire, B., &Denhière, G. 2006. Effects of high-order co-occurrences on word semantic similarities. Current Psychology Letters-Behaviour, Brain and Cognition, Volume 18 Issue 1, pp.1.
[26]
Li, N., &Li, T. 2007. t-closeness: Privacy beyond k-anonymity and l-diversity. Paper presented at the IEEE 23rd International Conference on Data Engineering, Istanbul, Turkey.
[27]
Machanavajjhala, A., Kifer, D., Gehrke, J., &Venkitasubramaniam, M. 2007. L-diversity: Privacy beyond k-anonymity. ACM Transactions on Knowledge Discovery from Data, Volume 1 Issue 1, pp.3.
[28]
Martínez, S., Sánchez, D., &Valls, A. 2012. Evaluation of the disclosure risk of masking methods dealing with textual attributes. International Journal of Innovative Computing, Information and Control, Volume 8 Issue 7, pp.4869-4882.
[29]
Martínez, S., Sánchez, D., &Valls, A. 2013. A semantic framework to protect the privacy of electronic health records with non-numerical attributes. Journal of Biomedical Informatics, Volume 46 Issue 2, pp.294-303.
[30]
Meystre, S.M., Friedlin, F.J., South, B.R., Shen, S., &Samore, M.H. 2010. Automatic de-identification of textual documents in the electronic health record: a review of recent research. BMC Medical Research Methodology, Volume 10 Issue 70.
[31]
Nicholson, S., &Smith, C.A. 2007. Using lessons from health care to protect the privacy of library users: Guidelines for the de-identification of library data based on HIPAA. Journal of the American Society for Information Science and Technology, Volume 58 Issue 8, pp.1198-1206.
[32]
Resnik, P. 1995. Using information content to evalutate semantic similarity in a taxonomy. Paper presented at the 14th International Joint Conference on Artificial Intelligence, IJCAI 1995, Montreal, Quebec, Canada.
[33]
Roman, J.H., Hulin, K.J., Collins, L.M., &Powell, J.E. 2012. Entity disambiguation using semantic networks. Journal of the American Society for Information Science and Technology, Volume 63 Issue 10, pp.2087-2099.
[34]
Sánchez, D., Batet, M., &Isern, D. 2011. Ontology-based information content computation. Knowledge-based Systems, Volume 24 Issue 2, pp.297-303.
[35]
Sánchez, D., Batet, M., Valls, A., &Gibert, K. 2010. Ontology-driven web-based semantic similarity. Journal of Intelligent Information Systems, Volume 35 Issue 3, pp.383-413.
[36]
Sánchez, D., Batet, M., &Viejo, A. 2012. Detecting sensitive information from textual documents: an information-theoretic approach Modeling Decisions for Artificial Intelligence. 9th International Conference, MDAI 2012 Vol. Volume 7647, pp. pp.173-184: Springer.
[37]
Sánchez, D., Batet, M., &Viejo, A. 2013a. Automatic general-purpose sanitization of textual documents. IEEE Transactions on Information Forensics and Security, Volume 8 Issue 6, pp.853-862.
[38]
Sánchez, D., Batet, M., &Viejo, A. 2013b. Detecting term relationships to improve textual document sanitization. Paper presented at the 17th Pacific Asia Conference on Information Systems, Jeju Island, South Korea.
[39]
Sánchez, D., Batet, M., &Viejo, A. 2013c. Minimizing the disclosure risk of semantic correlations in document sanitization. Information Sciences, Volume 249, pp.110-123.
[40]
Sánchez, D., Batet, M., &Viejo, A. 2014. Utility-preserving sanitization of semantically correlated terms in textual documents. Information Sciences, Volume 279, pp.77-93.
[41]
Sánchez, D., Castellí-Roca, J., &Viejo, A. 2013. Knowledge-based scheme to create privacy-preserving but semantically-related queries for web search engines. Information Sciences, Volume 218, pp.17-30.
[42]
Soria-Comas, J., Domingo-Ferrer, J., Sánchez, D., &Martínez, S. 2014. Enhancing data utility in differential privacy via microaggregation-based k-anonymity. VLDB Journal, Volume 23 Issue 5, pp.771-794.
[43]
Staddon, J., Golle, P., &Zimmy, B. 2007. Web-based inference detection. Paper presented at the 16th USENIX Security Symposium.
[44]
State of California Office of Satewide Health Planning & Development. 2013. Medical information reporting for California hospitals. Retrieved from: "http://www.oshpd.ca.gov/hid/mircal/"
[45]
Sweeney, L. 2002. k-anonymity: A model for protecting privacy. International Journal Uncertainty Fuzziness Knowledge-Based Systems, Volume 10, pp.557-570.
[46]
Terrovitis, M., Mamoulis, N., &Kalnis, P. 2008. Privacy-preserving anonymization of set-valued data. Paper presented at the VLDB Endowment.
[47]
Terry, N., &Francis, L. 2007. Ensuring the privacy and confidentiality of electronic health records. University of Illinois Law Review, Volume 2007, pp.681-735.
[48]
Turney, P.D. 2001. Mining the web for synonyms: PMI-IR versus LSA on TOEFL. Paper presented at the 12th European Conference on Machine Learning, ECML 2001, Freiburg, Germany.
[49]
U.S. Department of Energy. 2012. Department of energy researches use of advanced computing for document declassification. Retrieved from: "http://www.osti.gov/opennet"
[50]
U.S. Department of Justice. 2013. U.S. freedom of information act FOIA. Retrieved from: "http://www.foia.gov/"

Cited By

View all
  • (2024)Exploring How UK Public Authorities Use Redaction to Protect Personal InformationACM Transactions on Management Information Systems10.1145/365198915:3(1-23)Online publication date: 12-Mar-2024
  • (2024)Silencing the Risk, Not the Whistle: A Semi-automated Text Sanitization Tool for Mitigating the Risk of Whistleblower Re-IdentificationProceedings of the 2024 ACM Conference on Fairness, Accountability, and Transparency10.1145/3630106.3658936(733-745)Online publication date: 3-Jun-2024
  • (2023)Assessing anonymity techniques employed in german court decisionsProceedings of the 32nd USENIX Conference on Security Symposium10.5555/3620237.3620528(5199-5216)Online publication date: 9-Aug-2023
  • Show More Cited By

Index Terms

  1. C-sanitized: A privacy model for document redaction and sanitization
      Index terms have been assigned to the content through auto-classification.

      Recommendations

      Comments

      Information & Contributors

      Information

      Published In

      cover image Journal of the Association for Information Science and Technology
      Journal of the Association for Information Science and Technology  Volume 67, Issue 1
      January 2016
      243 pages
      ISSN:2330-1635
      EISSN:2330-1643
      Issue’s Table of Contents

      Publisher

      John Wiley & Sons, Inc.

      United States

      Publication History

      Published: 01 January 2016

      Author Tags

      1. knowledge
      2. privacy
      3. semantics

      Qualifiers

      • Article

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)0
      • Downloads (Last 6 weeks)0
      Reflects downloads up to 26 Sep 2024

      Other Metrics

      Citations

      Cited By

      View all
      • (2024)Exploring How UK Public Authorities Use Redaction to Protect Personal InformationACM Transactions on Management Information Systems10.1145/365198915:3(1-23)Online publication date: 12-Mar-2024
      • (2024)Silencing the Risk, Not the Whistle: A Semi-automated Text Sanitization Tool for Mitigating the Risk of Whistleblower Re-IdentificationProceedings of the 2024 ACM Conference on Fairness, Accountability, and Transparency10.1145/3630106.3658936(733-745)Online publication date: 3-Jun-2024
      • (2023)Assessing anonymity techniques employed in german court decisionsProceedings of the 32nd USENIX Conference on Security Symposium10.5555/3620237.3620528(5199-5216)Online publication date: 9-Aug-2023
      • (2023)An Automatic Privacy-Aware Framework for Text Data in Online Social Network Based on a Multi-Deep Learning ModelInternational Journal of Intelligent Systems10.1155/2023/17272852023Online publication date: 1-Jan-2023
      • (2023)Privacy-Preserving Redaction of Diagnosis Data through Source Code AnalysisProceedings of the 35th International Conference on Scientific and Statistical Database Management10.1145/3603719.3603734(1-4)Online publication date: 10-Jul-2023
      • (2022)The Rise of Cloud ComputingComputational Intelligence and Neuroscience10.1155/2022/83035042022Online publication date: 1-Jan-2022
      • (2022)A Survey on Differential Privacy for Unstructured Data ContentACM Computing Surveys10.1145/349023754:10s(1-28)Online publication date: 13-Sep-2022
      • (2022)Unlocking digital archives: cross-disciplinary perspectives on AI and born-digital dataAI & Society10.1007/s00146-021-01367-x37:3(823-835)Online publication date: 1-Sep-2022
      • (2022)Automatic Evaluation of Disclosure Risks of Text Anonymization MethodsPrivacy in Statistical Databases10.1007/978-3-031-13945-1_12(157-171)Online publication date: 21-Sep-2022
      • (2022)Formal verification and complexity analysis of confidentiality aware textual clinical documents frameworkInternational Journal of Intelligent Systems10.1002/int.2253337:12(10380-10399)Online publication date: 29-Dec-2022
      • Show More Cited By

      View Options

      View options

      Get Access

      Login options

      Media

      Figures

      Other

      Tables

      Share

      Share

      Share this Publication link

      Share on social media