article

C-sanitized: A privacy model for document redaction and sanitization

Authors:

David Sánchez,

Montserrat BatetAuthors Info & Claims

Journal of the Association for Information Science and Technology, Volume 67, Issue 1

Pages 148 - 163

https://doi.org/10.1002/asi.23363

Published: 01 January 2016 Publication History

Abstract

Vast amounts of information are daily exchanged and/or released. The sensitive nature of much of this information creates a serious privacy threat when documents are uncontrollably made available to untrusted third parties. In such cases, appropriate data protection measures should be undertaken by the responsible organization, especially under the umbrella of current legislation on data privacy. To do so, human experts are usually requested to redact or sanitize document contents. To relieve this burdensome task, this paper presents a privacy model for document redaction/sanitization, which offers several advantages over other models available in the literature. Based on the well-established foundations of data semantics and information theory, our model provides a framework to develop and implement automated and inherently semantic redaction/sanitization tools. Moreover, contrary to ad-hoc redaction methods, our proposal provides a priori privacy guarantees which can be intuitively defined according to current legislations on data privacy. Empirical tests performed within the context of several use cases illustrate the applicability of our model and its ability to mimic the reasoning of human sanitizers.

References

[1]

Anandan, B., &Clifton, C. 2011. Significance of term relationships on anonymization. Paper presented at the IEEE/WIC/ACM International Joint Conference on Web Intelligence and Intelligent Agent Technology-Workshops, Lyon, France.

Digital Library

[2]

Anandan, B., Clifton, C., Jiang, W., Murugesan, M., Pastrana-Camacho, P., ' 2012. t-plausibility: Generalizing words to desensitize text. Transactions on Data Privacy, Volume 5, pp.505-534.

Digital Library

[3]

Batet, M., Erola, A., Sánchez, D., &Castellí-Roca, J. 2013. Utility preserving query log anonymization via semantic microaggregation. Information Sciences, Volume 242, pp.49-63.

[4]

Bier, E., Chow, R., Golle, P., King, T.H., &Staddon, J. 2009. The rules of redaction: Identify, protect, review and repeat. IEEE Security and Privacy Magazine, Volume 7 Issue 6, pp.46-53.

Digital Library

[5]

Bollegala, D., Matsuo, Y., &Ishizuka, M. 2009. A relational model of semantic similarity between words using automatically extracted lexical pattern clusters from the web. Paper presented at the Conference on Empirical Methods in Natural Language Processing, EMNLP 2009, Singapore, Republic of Singapore.

Digital Library

[6]

Chakaravarthy, V.T., Gupta, H., Roy, P., &Mohania, M.K. 2008. Efficient techniques for document sanitization. Paper presented at the 17th ACM Conference on Information and Knowledge Management CIKM'08, Napa Valley, CA.

Digital Library

[7]

Chow, R., Golle, P., &Staddon, J. 2008. Detecting privacy leaks using corpus-based association rules. Paper presented at the 14th Conference on Knowledge Discovery and Data Mining, Las Vegas, NV.

Digital Library

[8]

Church, K.W., &Hanks, P. 1990. Word association norms, mutual information, and lexicography. Computational Linguistics, Volume 16 Issue 1, pp.22-29.

Digital Library

[9]

Cilibrasi, R.L., &Vitányi, P.M.B. 2006. The Google similarity distance. IEEE Transactions on Knowledge and Data Engineering, Volume 19 Issue 3, pp.370-383.

Digital Library

[10]

Cumby, C., &Ghani, R. 2011. A machine learning based system for semiautomatically redacting documents. Paper presented at the Twenty-Third Conference on Innovative Applications of Artificial Intelligence, San Francisco, CA.

[11]

Dalenius, T. 1977. Towards a methodology for statistical disclosure control. Statistik Tidskrift, Volume 15, pp.429-444.

[12]

Department for a Healthy New York. 2013. New York State Confidentiality Law. Retrieved from: "http://www.lac.org/doc_library/lac/publications/HIPAA_and_Art_27F_Summary.pdf"

[13]

Department of Health and Human Services. 2000. The health insurance portability and accountability act of 1996 Vol. Technical Report Federal Register 65 FR 82462.

[14]

Domingo-Ferrer, J. 2008. A survey of inference control methods for privacy-preserving data mining. In C.C.Aggarwal, &P.S.Yu Eds., Privacy-preserving data mining pp. pp.53-80. Berlin: Springer.

[15]

Dorr, D., Phillips, W., Phansalkar, S., Sims, S., &Hurdle, J. 2006. Assessing the difficulty and time cost of de-identification in clinical narratives. Methods of Information in Medicine, Volume 45 Issue 3, pp.246-252.

[16]

Drechsler, J. 2011. My understanding of the differences between the CS and the statistical approach to data confidentiality. Paper presented at the The 4th IAB workshop on confidentiality and disclosure.

[17]

Dwork, C. 2006. Differential privacy. Paper presented at the 33rd International Colloquium ICALP, Venice, Italy.

Digital Library

[18]

European Parliament and the Council of the EU. 1995. Data Protection Directive 95/46/EC.

[19]

Francis, W.N., &Kucera, H. 1979. Brown corpus manual. Providence, RI: Brown University.

[20]

Fung, B.C.M., Wang, K., Chen, R., &Yu, P.S. 2010. Privacy-preserving data publishing: A survey of recent developments. ACM Computer Surverys, Volume 42 Issue 4, pp.14.

Digital Library

[21]

Gordon, K. 2013. MRA thought of the day-medical record redacting: A burdensome and problematic method for protecting patient privacy. Retrieved from: "http://mrahis.com/blog/mra-thought-of-the-day-medical-record-redacting-a-burdensome-and-problematic-method-for-protecting-patient-privacy/"

[22]

Health Privacy Project. 2013. State Privacy Protections. Retrieved from: "https://www.cdt.org/issue/state-privacy-protections"

[23]

Hundepool, A., Domingo-Ferrer, J., Franconi, L., Giessing, S., Nordholt, E.S., Spicer, K., ' 2013. Statistical Disclosure Control. Hoboken, NJ: Wiley.

[24]

Legal Information Institute. 2013. Privacy protection for filings made with the court. Retrieved from: "http://www.law.cornell.edu/rules/frcrmp/rule_49.1"

[25]

Lemaire, B., &Denhière, G. 2006. Effects of high-order co-occurrences on word semantic similarities. Current Psychology Letters-Behaviour, Brain and Cognition, Volume 18 Issue 1, pp.1.

[26]

Li, N., &Li, T. 2007. t-closeness: Privacy beyond k-anonymity and l-diversity. Paper presented at the IEEE 23rd International Conference on Data Engineering, Istanbul, Turkey.

[27]

Machanavajjhala, A., Kifer, D., Gehrke, J., &Venkitasubramaniam, M. 2007. L-diversity: Privacy beyond k-anonymity. ACM Transactions on Knowledge Discovery from Data, Volume 1 Issue 1, pp.3.

Digital Library

[28]

Martínez, S., Sánchez, D., &Valls, A. 2012. Evaluation of the disclosure risk of masking methods dealing with textual attributes. International Journal of Innovative Computing, Information and Control, Volume 8 Issue 7, pp.4869-4882.

[29]

Martínez, S., Sánchez, D., &Valls, A. 2013. A semantic framework to protect the privacy of electronic health records with non-numerical attributes. Journal of Biomedical Informatics, Volume 46 Issue 2, pp.294-303.

Digital Library

[30]

Meystre, S.M., Friedlin, F.J., South, B.R., Shen, S., &Samore, M.H. 2010. Automatic de-identification of textual documents in the electronic health record: a review of recent research. BMC Medical Research Methodology, Volume 10 Issue 70.

[31]

Nicholson, S., &Smith, C.A. 2007. Using lessons from health care to protect the privacy of library users: Guidelines for the de-identification of library data based on HIPAA. Journal of the American Society for Information Science and Technology, Volume 58 Issue 8, pp.1198-1206.

Digital Library

[32]

Resnik, P. 1995. Using information content to evalutate semantic similarity in a taxonomy. Paper presented at the 14th International Joint Conference on Artificial Intelligence, IJCAI 1995, Montreal, Quebec, Canada.

Digital Library

[33]

Roman, J.H., Hulin, K.J., Collins, L.M., &Powell, J.E. 2012. Entity disambiguation using semantic networks. Journal of the American Society for Information Science and Technology, Volume 63 Issue 10, pp.2087-2099.

Digital Library

[34]

Sánchez, D., Batet, M., &Isern, D. 2011. Ontology-based information content computation. Knowledge-based Systems, Volume 24 Issue 2, pp.297-303.

Digital Library

[35]

Sánchez, D., Batet, M., Valls, A., &Gibert, K. 2010. Ontology-driven web-based semantic similarity. Journal of Intelligent Information Systems, Volume 35 Issue 3, pp.383-413.

Digital Library

[36]

Sánchez, D., Batet, M., &Viejo, A. 2012. Detecting sensitive information from textual documents: an information-theoretic approach Modeling Decisions for Artificial Intelligence. 9th International Conference, MDAI 2012 Vol. Volume 7647, pp. pp.173-184: Springer.

Digital Library

[37]

Sánchez, D., Batet, M., &Viejo, A. 2013a. Automatic general-purpose sanitization of textual documents. IEEE Transactions on Information Forensics and Security, Volume 8 Issue 6, pp.853-862.

Digital Library

[38]

Sánchez, D., Batet, M., &Viejo, A. 2013b. Detecting term relationships to improve textual document sanitization. Paper presented at the 17th Pacific Asia Conference on Information Systems, Jeju Island, South Korea.

[39]

Sánchez, D., Batet, M., &Viejo, A. 2013c. Minimizing the disclosure risk of semantic correlations in document sanitization. Information Sciences, Volume 249, pp.110-123.

[40]

Sánchez, D., Batet, M., &Viejo, A. 2014. Utility-preserving sanitization of semantically correlated terms in textual documents. Information Sciences, Volume 279, pp.77-93.

[41]

Sánchez, D., Castellí-Roca, J., &Viejo, A. 2013. Knowledge-based scheme to create privacy-preserving but semantically-related queries for web search engines. Information Sciences, Volume 218, pp.17-30.

Digital Library

[42]

Soria-Comas, J., Domingo-Ferrer, J., Sánchez, D., &Martínez, S. 2014. Enhancing data utility in differential privacy via microaggregation-based k-anonymity. VLDB Journal, Volume 23 Issue 5, pp.771-794.

Digital Library

[43]

Staddon, J., Golle, P., &Zimmy, B. 2007. Web-based inference detection. Paper presented at the 16th USENIX Security Symposium.

Digital Library

[44]

State of California Office of Satewide Health Planning & Development. 2013. Medical information reporting for California hospitals. Retrieved from: "http://www.oshpd.ca.gov/hid/mircal/"

[45]

Sweeney, L. 2002. k-anonymity: A model for protecting privacy. International Journal Uncertainty Fuzziness Knowledge-Based Systems, Volume 10, pp.557-570.

Digital Library

[46]

Terrovitis, M., Mamoulis, N., &Kalnis, P. 2008. Privacy-preserving anonymization of set-valued data. Paper presented at the VLDB Endowment.

Digital Library

[47]

Terry, N., &Francis, L. 2007. Ensuring the privacy and confidentiality of electronic health records. University of Illinois Law Review, Volume 2007, pp.681-735.

[48]

Turney, P.D. 2001. Mining the web for synonyms: PMI-IR versus LSA on TOEFL. Paper presented at the 12th European Conference on Machine Learning, ECML 2001, Freiburg, Germany.

Digital Library

[49]

U.S. Department of Energy. 2012. Department of energy researches use of advanced computing for document declassification. Retrieved from: "http://www.osti.gov/opennet"

[50]

U.S. Department of Justice. 2013. U.S. freedom of information act FOIA. Retrieved from: "http://www.foia.gov/"

Cited By

Chen YKirkham R(2024)Exploring How UK Public Authorities Use Redaction to Protect Personal InformationACM Transactions on Management Information Systems10.1145/365198915:3(1-23)Online publication date: 12-Mar-2024
https://dl.acm.org/doi/10.1145/3651989
Staufer DPallas FBerendt B(2024)Silencing the Risk, Not the Whistle: A Semi-automated Text Sanitization Tool for Mitigating the Risk of Whistleblower Re-IdentificationProceedings of the 2024 ACM Conference on Fairness, Accountability, and Transparency10.1145/3630106.3658936(733-745)Online publication date: 3-Jun-2024
https://dl.acm.org/doi/10.1145/3630106.3658936
Deuber DKeuchen MChristin NCalandrino JTroncoso C(2023)Assessing anonymity techniques employed in german court decisionsProceedings of the 32nd USENIX Conference on Security Symposium10.5555/3620237.3620528(5199-5216)Online publication date: 9-Aug-2023
https://dl.acm.org/doi/10.5555/3620237.3620528
Show More Cited By

Index Terms

C-sanitized: A privacy model for document redaction and sanitization
1. Security and privacy
  1. Human and societal aspects of security and privacy
2. Social and professional topics
  1. Computing / technology policy
    1. Privacy policies

Index terms have been assigned to the content through auto-classification.

Recommendations

Toward sensitive document release with privacy guarantees

Privacy has become a serious concern for modern Information Societies. The sensitive nature of much of the data that are daily exchanged or released to untrusted parties requires that responsible organizations undertake appropriate privacy protection ...
Using Search Results to Microaggregate Query Logs Semantically
Revised Selected Papers of the 8th International Workshop on Data Privacy Management and Autonomous Spontaneous Security - Volume 8247

Query log anonymization has become an important challenge nowadays. A query log contains the search history of the users, as well as the selected results and their position in the ranking. These data are used to provide a personalized re-ranking of ...
Preserving Structural Properties in Edge-Perturbing Anonymization Techniques for Social Networks

Social networks are attracting significant interest from researchers in different domains, especially with the advent of social networking systems which enable large-scale collection of network information. However, as much as analysis of such social ...

Comments

Information & Contributors

Information

Published In

cover image Journal of the Association for Information Science and Technology

Journal of the Association for Information Science and Technology Volume 67, Issue 1

January 2016

243 pages

ISSN:2330-1635

EISSN:2330-1643

Issue’s Table of Contents

Publisher

John Wiley & Sons, Inc.

United States

Publication History

Published: 01 January 2016

Author Tags

Qualifiers

Article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

22
Total Citations
View Citations
0
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 26 Sep 2024

Other Metrics

View Author Metrics

Citations

Cited By

Chen YKirkham R(2024)Exploring How UK Public Authorities Use Redaction to Protect Personal InformationACM Transactions on Management Information Systems10.1145/365198915:3(1-23)Online publication date: 12-Mar-2024
https://dl.acm.org/doi/10.1145/3651989
Staufer DPallas FBerendt B(2024)Silencing the Risk, Not the Whistle: A Semi-automated Text Sanitization Tool for Mitigating the Risk of Whistleblower Re-IdentificationProceedings of the 2024 ACM Conference on Fairness, Accountability, and Transparency10.1145/3630106.3658936(733-745)Online publication date: 3-Jun-2024
https://dl.acm.org/doi/10.1145/3630106.3658936
Deuber DKeuchen MChristin NCalandrino JTroncoso C(2023)Assessing anonymity techniques employed in german court decisionsProceedings of the 32nd USENIX Conference on Security Symposium10.5555/3620237.3620528(5199-5216)Online publication date: 9-Aug-2023
https://dl.acm.org/doi/10.5555/3620237.3620528
Liu GSun XLi YLi HZhao SGuo Z(2023)An Automatic Privacy-Aware Framework for Text Data in Online Social Network Based on a Multi-Deep Learning ModelInternational Journal of Intelligent Systems10.1155/2023/17272852023Online publication date: 1-Jan-2023
https://dl.acm.org/doi/10.1155/2023/1727285
Zhou LYu LZou JMin H(2023)Privacy-Preserving Redaction of Diagnosis Data through Source Code AnalysisProceedings of the 35th International Conference on Scientific and Statistical Database Management10.1145/3603719.3603734(1-4)Online publication date: 10-Jul-2023
https://dl.acm.org/doi/10.1145/3603719.3603734
Hassan JShehzad DHabib UAftab MAhmad MKuleev RMazzara M(2022)The Rise of Cloud ComputingComputational Intelligence and Neuroscience10.1155/2022/83035042022Online publication date: 1-Jan-2022
https://dl.acm.org/doi/10.1155/2022/8303504
Zhao YChen J(2022)A Survey on Differential Privacy for Unstructured Data ContentACM Computing Surveys10.1145/349023754:10s(1-28)Online publication date: 13-Sep-2022
https://dl.acm.org/doi/10.1145/3490237
Jaillant LCaputo A(2022)Unlocking digital archives: cross-disciplinary perspectives on AI and born-digital dataAI & Society10.1007/s00146-021-01367-x37:3(823-835)Online publication date: 1-Sep-2022
https://dl.acm.org/doi/10.1007/s00146-021-01367-x
Manzanares-Salor BSánchez DLison P(2022)Automatic Evaluation of Disclosure Risks of Text Anonymization MethodsPrivacy in Statistical Databases10.1007/978-3-031-13945-1_12(157-171)Online publication date: 21-Sep-2022
https://dl.acm.org/doi/10.1007/978-3-031-13945-1_12
Kanwal TMoqurrab SAnjum AKhan ARodrigues JJeon G(2022)Formal verification and complexity analysis of confidentiality aware textual clinical documents frameworkInternational Journal of Intelligent Systems10.1002/int.2253337:12(10380-10399)Online publication date: 29-Dec-2022
https://dl.acm.org/doi/10.1002/int.22533
Show More Cited By

View Options

View options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Media

Figures

Other

Tables

View Issue’s Table of Contents