Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/1401890.1401997acmconferencesArticle/Chapter ViewAbstractPublication PageskddConference Proceedingsconference-collections
research-article

Detecting privacy leaks using corpus-based association rules

Published: 24 August 2008 Publication History

Abstract

Detecting inferences in documents is critical for ensuring privacy when sharing information. In this paper, we propose a refined and practical model of inference detection using a reference corpus. Our model is inspired by association rule mining: inferences are based on word co-occurrences. Using the model and taking the Web as the reference corpus, we can find inferences and measure their strength through web-mining algorithms that leverage search engines such as Google or Yahoo!.
Our model also includes the important case of private corpora, to model inference detection in enterprise settings in which there is a large private document repository. We find inferences in private corpora by using analogues of our Web-mining algorithms, relying on an index for the corpus rather than a Web search engine.
We present results from two experiments. The first experiment demonstrates the performance of our techniques in identifying all the keywords that allow for inference of a particular topic (e.g. "HIV") with confidence above a certain threshold. The second experiment uses the public Enron e-mail dataset. We postulate a sensitive topic and use the Enron corpus and the Web together to find inferences for the topic.
These experiments demonstrate that our techniques are practical, and that our model of inference based on word co-occurrence is well-suited to efficient inference detection.

References

[1]
R. Agrawal, H. Mannila, R. Srikant, H. Toivonen, and A. I. Verkamo. Fast discovery of association rules. In Advances in Knowledge Discovery and Data Mining, pages 307--328. AAAI/MIT Press, 1996.
[2]
R. Agrawal and R. Srikant. Fast algorithms for mining association rules. In J. B. Bocca, M. Jarke, and C. Zaniolo, editors, Proc. 20th Int. Conf. Very Large Data Bases, VLDB, pages 487--499. Morgan Kaufmann, 12--15 1994.
[3]
M. Ahlers. Blueprints for terrorists? On the Web at http://www.cnn.com/2004/US/10/19/terror.nrc/index.html.
[4]
Apache Lucene project. On the Web at http://lucene.apache.org/.
[5]
M. Berardi, M. Lapi, P. Leo, and C. Loglisci. Mining generalized association rules on biomedical literature. In IEA/AIE'2005: Proceedings of the 18th international conference on Innovations in Applied Artificial Intelligence, pages 500--509, London, UK, 2005. Springer-Verlag.
[6]
W. Broad. U. S. web archive is said to reveal a nuclear primer. On the Web at http://www.nytimes.com/2006/11/03/world/middleeast/03documents.html.
[7]
P. Cimiano and S. Staab. Learning by googling. SIGKDD Explor. Newsl., 6(2):24--33, 2004.
[8]
M. Dowman, V. Tablan, H. Cunningham, and B. Popov. Web-assisted annotation, semantic indexing and search of television and radio news. In WWW '05: Proceedings of the 14th international conference on World Wide Web, pages 225--234, New York, NY, USA, 2005. ACM.
[9]
Enron corpus. On the Web at http://www.cs.cmu.edu/~enron/.
[10]
C. Farkas and S. Jajodia. The inference problem: a survey. SIGKDD Explor. Newsl., 4(2):6--11, 2002.
[11]
N. S. Glance. Community search assistant. In Intelligent User Interfaces, pages 91--96, 2001.
[12]
Health Privacy Project. On the Web at http://www.healthprivacy.org/.
[13]
Inboxer. On the Web at http://www.inboxer.com/.
[14]
Inboxer's Enron demonstration site. On the Web at http://www.enronemail.com/.
[15]
L. M. Iwanska and S. C. Shapiro. Natural Language Processing and Knowledge Representation: Language for Knowledge and Knowledge for Language. AAAI Press, 2000.
[16]
D. P. Lopresti and A. L. Spitz. Information leakage through document redaction: attacks and countermeasures. In DRR, pages 183--190, 2005.
[17]
C. D. Manning and H. Schutze. Foundations of statistical natural language processing. MIT Press, 1999.
[18]
P. Nakov and M. Hearst. Using the web as an implicit training set: application to structural ambiguity resolution. In HLT '05: Proceedings of the conference on Human Language Technology and Empirical Methods in Natural Language Processing, pages 835--842, Morristown, NJ, USA, 2005. Association for Computational Linguistics.
[19]
L. Singh, P. Scheuermann, and B. Chen. Generating association rules from semi-structured documents using an extended concept hierarchy. In F. Golshani and K. Makki, editors, Proceedings of the Sixth International Conference on Information and Knowledge Management (CIKM'97), Las Vegas, Nevada, November 10-14, 1997, pages 193--200. ACM, 1997.
[20]
J. Staddon, P. Golle, and B. Zimny. Web-based inference detection. In Proceedings of 16th USENIX Security Symposium, pages 71--86, Boston, MA, 2007. USENIX Association.
[21]
L. Sweeney. AI technologies to defeat identity theft vulnerabilities. In AAAI Spring Symposium on AI TEchnologies for Homeland Security, 2005.
[22]
N. Terry and L. Francis. Ensuring the privacy and confidentiality of electronic health records. Illinois Law Review, 2007(2).
[23]
P. D. Turney. Mining the web for synonyms: PMI-IR versus LSA on TOEFL. In EMCL '01: Proceedings of the 12th European Conference on Machine Learning, pages 491--502, London, UK, 2001. Springer-Verlag.
[24]
K. Wang, Y. He, and J. Han. Pushing support constraints into association rules mining. IEEE Transactions on Knowledge and Data Engineering, 15(3):642--658, 2003.
[25]
Yahoo! Web Search API. On the Web at http://developer.yahoo.com/search/web/.

Cited By

View all
  • (2023)Privacy-Preserving Redaction of Diagnosis Data through Source Code AnalysisProceedings of the 35th International Conference on Scientific and Statistical Database Management10.1145/3603719.3603734(1-4)Online publication date: 10-Jul-2023
  • (2023)Is Your Model Sensitive? SPEDAC: A New Resource for the Automatic Classification of Sensitive Personal DataIEEE Access10.1109/ACCESS.2023.324008911(10864-10880)Online publication date: 2023
  • (2023)Semantic Attack on Disassociated Transaction DataSN Computer Science10.1007/s42979-023-01781-64:4Online publication date: 20-Apr-2023
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
KDD '08: Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining
August 2008
1116 pages
ISBN:9781605581934
DOI:10.1145/1401890
  • General Chair:
  • Ying Li,
  • Program Chairs:
  • Bing Liu,
  • Sunita Sarawagi
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 24 August 2008

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. association rule mining
  2. inference control
  3. inference detection
  4. search engine
  5. web mining

Qualifiers

  • Research-article

Conference

KDD08

Acceptance Rates

KDD '08 Paper Acceptance Rate 118 of 593 submissions, 20%;
Overall Acceptance Rate 1,133 of 8,635 submissions, 13%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)25
  • Downloads (Last 6 weeks)3
Reflects downloads up to 09 Nov 2024

Other Metrics

Citations

Cited By

View all
  • (2023)Privacy-Preserving Redaction of Diagnosis Data through Source Code AnalysisProceedings of the 35th International Conference on Scientific and Statistical Database Management10.1145/3603719.3603734(1-4)Online publication date: 10-Jul-2023
  • (2023)Is Your Model Sensitive? SPEDAC: A New Resource for the Automatic Classification of Sensitive Personal DataIEEE Access10.1109/ACCESS.2023.324008911(10864-10880)Online publication date: 2023
  • (2023)Semantic Attack on Disassociated Transaction DataSN Computer Science10.1007/s42979-023-01781-64:4Online publication date: 20-Apr-2023
  • (2022)PRIVAFRAME: A Frame-Based Knowledge Graph for Sensitive Personal DataBig Data and Cognitive Computing10.3390/bdcc60300906:3(90)Online publication date: 26-Aug-2022
  • (2022)The Text Anonymization Benchmark (TAB): A Dedicated Corpus and Evaluation Framework for Text AnonymizationComputational Linguistics10.1162/coli_a_0045848:4(1053-1101)Online publication date: 1-Dec-2022
  • (2021)Deception for Cyber Defence: Challenges and Opportunities2021 Third IEEE International Conference on Trust, Privacy and Security in Intelligent Systems and Applications (TPS-ISA)10.1109/TPSISA52974.2021.00020(173-182)Online publication date: Dec-2021
  • (2021)Can pre-trained Transformers be used in detecting complex sensitive sentences? - A Monsanto case study2021 Third IEEE International Conference on Trust, Privacy and Security in Intelligent Systems and Applications (TPS-ISA)10.1109/TPSISA52974.2021.00010(90-97)Online publication date: Dec-2021
  • (2021)Utility-Preserving Privacy Protection of Textual Documents via Word EmbeddingsIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2021.3076632(1-1)Online publication date: 2021
  • (2020)A Study of Self-Privacy Violations in Online Public Discourse2020 IEEE International Conference on Big Data (Big Data)10.1109/BigData50022.2020.9378163(1041-1050)Online publication date: 10-Dec-2020
  • (2020)A Multi-level Access Technique for Privacy-Preserving Perturbation in Association Rule MiningAdvances in Artificial Intelligence and Data Engineering10.1007/978-981-15-3514-7_48(631-645)Online publication date: 14-Aug-2020
  • Show More Cited By

View Options

Get Access

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media