abstract

Large-Scale Information Extraction under Privacy-Aware Constraints

Authors:

Ranganath KondapallyAuthors Info & Claims

KDD '22: Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining

Pages 4792 - 4793

https://doi.org/10.1145/3534678.3547352

Published: 14 August 2022 Publication History

Abstract

In this digital age, people spend a significant portion of their lives online and this has led to an explosion of personal data of users due to their activities. Typically, this data is private and nobody else, except the user, is allowed to look at it. To provide better experience and assist users in their activities, it is critical to mine certain information from this data. This poses interesting and complex challenge from scalable information extraction point of view: building information extraction models where there is little data to learn from due to privacy constraints but need highly accurate models to run on a large amount of diverse data across different users. Anonymization of data is typically used to convert private data into publicly accessible data. But this may not always be feasible and may require complex differential privacy guarantees to be safe from any potential negative consequences. Further, the anonymization process needs to ensure that it retains sufficient information for modeling purposes post anonymization. Other techniques involve building extraction models using a small amount of seen (eyes-on) data with no privacy restrictions (hence, can be labeled) and a large amount of unseen (eyes-off) data which only a machine or a program can access. In this tutorial, we use emails as the canonical example of private data to explain in detail the challenges and solutions for scalable information extraction (IE) under privacy-aware constraints.

References

[1]

Nir Ailon, Zohar S. Karnin, Edo Liberty, Yoelle Maarek. Threading Machine Generated Email. Proceedings of the sixth ACM international conference on Web search and data mining. WSDM 2013.

[2]

J. Hammer, H. Garcia-Molina, J. Cho, R. Aranha, A. Crespo. Extracting Semi-structured Information from the Web. Technical Report. 1997--38, Stanford Info Lab. http://ilpubs.stanford.edu:8090/250/

[3]

Chia-Hui Chang, M. Kayed, M.R. Girgis, and K.F. Shaalan. A Survey of Web Information Extraction Systems. IEEE Transactions on Knowledge and Data Engineering. Volume 18, Issue 10, Pages 1411--1428, Oct. 2006.

Digital Library

[4]

Shuyi Zheng, Ruihua Song, Ji-Rong Wen, and C. Lee Giles. Efficient Record-level Wrapper Induction. In Proceedings of the 18th ACM conference on Information and knowledge management. CIKM 2009.

[5]

Zheng-Jun Zha, Tao Me. Jingdong Wang, Zengfu Wang, Xian-Sheng Hua. Graph-based Semi-supervised Learning with Multiple Labels. Journal of Visual Communication and Image Representation. Volume 20, Issue 2, February 2009, Pages 97--103

Digital Library

[6]

Burr Settles. Active learning literature survey. University of Wisconsin-Madison Department of Computer Sciences, 2009.

[7]

ZhuЃ Xiaojin and Zoubin GhahramaniЃ. Learning from Labeled and Unlabeled data with Label Propagation. Citeceer 2002.

[8]

James B. Wendt, Michael Bendersky, Lluis Garcia-Pueyo, Vanja Josifovski, Balint Miklos, Ivo Krka, Amitabh Saikia, Jie Yang, Marc-Allen Cartright, and Sujith Ravi. Hierarchical Label Propagation and Discovery for Machine Generated Email. Proceedings of the 9th ACM International Conference on Web Search and Data Mining. WSDM 2016.

[9]

Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the Knowledge in a Neural Network. Deep Learning Workshop, NIPS 2014.

[10]

Alexander Ratner, Stephen H. Bach, Henry Ehrenberg, Jason Fries, Sen Wu, and Christopher Ré. Snorkel: Rapid training data creation with weak supervision. Proceedings of the VLDB Endowment, 2017.

[11]

Weinan Zhang, Amr Ahmed, Jie Yang, Vanja Josifovski, and Alex J Smola. Annotating Needles in the Haystack without Looking: Product Information Extraction from Emails. Proceedings of the 21st ACM International Conference on Knowledge Discovery and Data Mining (KDD), 2015.

Digital Library

[12]

Giuseppe Della Penna, Daniele Magazzeni, Sergio Orefice. Visual Extraction of Information from Web Pages. Journal of Visual Languages and Computing. Volume 21, Issue 1, Pages 23--32. Feb 2010.

[13]

Oleksandr Polozov and Sumit Gulawani. LaSEWeb: Automating Search Strategies Over Semi-Structured Web Data. Proceedings of the 20th ACM International Conference on Knowledge Discovery and Data Mining. KDD 2014.

[14]

Microsoft PROSE SDK Tutorial. https://microsofts.github.io/prose/documentation/prose

[15]

Ying Sheng, Sandeep Tata, James B. Wendt, Jing Xie, Qi Zhao, and Marc Najork. Anatomy of a Privacy-Safe Large-Scale Information Extraction System Over Email. 24th International Conference on Knowledge Discovery & Data Mining (KDD), 2018.

Digital Library

[16]

Arun Iyer, Manohar Jonnalagedda, Suresh Parthasarathy, Arjun Radhakrishna, and Sriram K. Rajamani. Synthesis and machine learning for heterogeneous extraction. 40th ACM Conference on Programming Language Design and Implementation (PLDI), 2019.

Digital Library

[17]

Laura Chiticariu, Yunyao Li, Frederick R Reiss. Rule-based Information Extraction is Dead! Long Live Rule-based Information Extraction Systems. Proceedings of the Conference on Empirical Methods in Natural Language Processing. EMNLP 2013.

[18]

Thore Graepel, Kristin Lauter, and Michael Naehrig. ML Confidential: Machine Learning on Encrypted Data. International Conference on Information Security and Cryptology. ICISC 2012

[19]

S. J. Pan and Q. Yang, A Survey on Transfer Learning. IEEE Transactions on Knowledge and Data Engineering (TKDE), vol. 22, no. 10, pp. 1345--1359, Oct.

Digital Library

[20]

Junyi Chai, Yujie He, Homa Hashemi, Bing Li, Daraksha Parveen, Ranganath Kondapally, and Wenjin Xu. Automatic Construction of Enterprise Knowledge Base. EMNLP 2021.

[21]

Rajeev Gupta, Ranganath Kondapally, Siddharth Guha. Large-Scale Information Extraction from Emails with Data Constraints. 7th Big Data Analytics Conference, 2019

Digital Library

[22]

Michael Whittaker, Nick Edmonds, Sandeep Tata, James B. Wendt, and Marc Najork. Online Template Induction for Machine-Generated Emails. Proceedings of the VLDB Endowment, Vol. 12, No. 11, 2019.

Index Terms

Large-Scale Information Extraction under Privacy-Aware Constraints
1. Information systems
  1. Information retrieval
    1. Retrieval models and ranking
      1. Novelty in information retrieval
    2. Retrieval tasks and goals
      1. Information extraction

Recommendations

Large-Scale Information Extraction under Privacy-Aware Constraints
CIKM '21: Proceedings of the 30th ACM International Conference on Information & Knowledge Management

In this digital age, people spend a significant portion of their lives online and this has led to an explosion of personal data from users and their activities. Typically, this data is private and nobody else, except the user, is allowed to look at it. ...
Large-Scale Information Extraction from Emails with Data Constraints
Big Data Analytics
Abstract
Email is the most frequently used web application for communication and collaboration due to its easy access, fast interactions, and convenient management. More than 60% of the email traffic constitutes business to consumer (B2C) emails (e.g., ...
Data Anonymization for Privacy Aware Machine Learning
Machine Learning, Optimization, and Data Science
Abstract
The increase of data leaks, attacks, and other ransom-ware in the last few years have pointed out concerns about data security and privacy. All this has negatively affected the sharing and publication of data. To address these many limitations, ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

KDD '22: Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining

August 2022

5033 pages

ISBN:9781450393850

DOI:10.1145/3534678

General Chairs:
Aidong Zhang
University of Virginia
,
Huzefa Rangwala
Amazon/George Mason University

Copyright © 2022 Owner/Author.

Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the Owner/Author.

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 14 August 2022

Check for updates

Author Tags

Qualifiers

Abstract

Conference

KDD '22

Sponsor:

KDD '22: The 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining

August 14 - 18, 2022

Washington DC, USA

Acceptance Rates

Overall Acceptance Rate 1,133 of 8,635 submissions, 13%

Upcoming Conference

KDD '24

Sponsor:
sigkdd
sigkdd

The 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining

August 25 - 29, 2024

Barcelona , Spain

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
112
Total Downloads

Downloads (Last 12 months)19
Downloads (Last 6 weeks)1

Reflects downloads up to

Other Metrics

View Author Metrics

Citations

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents