research-article

Open access

Annotating Needles in the Haystack without Looking: Product Information Extraction from Emails

Authors:

Vanja Josifovski,

Alex J. SmolaAuthors Info & Claims

KDD '15: Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining

Pages 2257 - 2266

https://doi.org/10.1145/2783258.2788580

Published: 10 August 2015 Publication History

Abstract

Business-to-consumer (B2C) emails are usually generated by filling structured user data (e.g.purchase, event) into templates. Extracting structured data from B2C emails allows users to track important information on various devices.

However, it also poses several challenges, due to the requirement of short response time for massive data volume, the diversity and complexity of templates, and the privacy and legal constraints. Most notably, email data is legally protected content, which means no one except the receiver can review the messages or derived information.

In this paper we first introduce a system which can extract structured information automatically without requiring human review of any personal content. Then we focus on how to annotate product names from the extracted texts, which is one of the most difficult problems in the system. Neither general learning methods, such as binary classifiers, nor more specific structure learning methods, suchas Conditional Random Field (CRF), can solve this problem well.

To accomplish this task, we propose a hybrid approach, which basically trains a CRF model using the labels predicted by binary classifiers (weak learners). However, the performance of weak learners can be low, therefore we use Expectation Maximization (EM) algorithm on CRF to remove the noise and improve the accuracy, without the need to label and inspect specific emails. In our experiments, the EM-CRF model can significantly improve the product name annotations over the weak learners and plain CRFs.

References

[1]

N. Ailon, Z. S. Karnin, E. Liberty, and Y. Maarek. Threading machine generated email. In WSDM. ACM, 2013.

Digital Library

[2]

S. M. Aji and R. J. McEliece. The generalized distributive law. IEEE Transactions on Information Theory, 46(2), 2000.

Digital Library

[3]

C. Bird, A. Gourley, P. Devanbu, M. Gertz, and A. Swaminathan. Mining email social networks. In Workshop on Mining software repositories, pages 137--143. ACM, 2006.

Digital Library

[4]

E. Blanzieri and A. Bryl. A survey of learning-based techniques of email spam filtering. Artificial Intelligence Review, 29(1):63--92, 2008.

Digital Library

[5]

L. Breiman. Random forests. Machine learning, 45(1), 2001.

Digital Library

[6]

J. D. Brutlag and C. Meek. Challenges of the email domain for text classification. In ICML, pages 103--110, 2000.

Digital Library

[7]

D. Buttler, L. Liu, and C. Pu. A fully automated object extraction system for the world wide web. In ICDCS, 2001.

Digital Library

[8]

C.-H. Chang and S.-C. Lui. Iepad: information extraction based on pattern discovery. In WWW, pages 681--688. ACM, 2001.

Digital Library

[9]

G. V. Cormack. Email spam filtering: A systematic review. Foundations and Trends in Information Retrieval, 1(4), 2007.

Digital Library

[10]

C. Cortes and V. Vapnik. Support-vector networks. Machine learning, 20(3):273--297, 1995.

Digital Library

[11]

M. Craven, J. Kumlien, et al. Constructing biological knowledge bases by extracting information from text sources. In ISMB, volume 1999, pages 77--86, 1999.

Digital Library

[12]

V. Crescenzi, G. Mecca, P. Merialdo, et al. Roadrunner: Towards automatic data extraction from large web sites. In VLDB, volume 1, pages 109--118, 2001.

Digital Library

[13]

N. Dalvi, R. Kumar, and M. Soliman. Automatic wrappers for large scale web extraction. Proceedings of the VLDB Endowment, 4(4):219--230, 2011.

Digital Library

[14]

J. Diesner, T. L. Frantz, and K. M. Carley. Communication networks from the enron email corpus "it's always about the people. enron is no different". Computational & Mathematical Organization Theory, 11(3):201--228, 2005.

Digital Library

[15]

D. W. Embley, Y. Jiang, and Y.-K. Ng. Record-boundary discovery in web documents. ACM SIGMOD Record, 28(2):467--478, 1999.

Digital Library

[16]

J. H. Friedman. Multivariate adaptive regression splines. The annals of statistics, pages 1--67, 1991.

[17]

K. Ganchev, J. Graica, J. Gillenwater, and B. Taskar. Posterior regularization for structured latent variable models. JMLR, 99:2001--2049, 2010.

Digital Library

[18]

S. Gupta, G. Kaiser, D. Neistadt, and P. Grimm. Dom-based content extraction of html documents. In WWW, 2003.

Digital Library

[19]

K. Hall, R. McDonald, J. Katz-Brown, and M. Ringgaard. Training dependency parsers by jointly optimizing multiple objectives. In EMNLP, pages 1489--1499, 2011.

Digital Library

[20]

T. Hastie, R. Tibshirani, and J. J. H. Friedman. The elements of statistical learning, volume 1. Springer New York, 2001.

[21]

R. Hoffmann, C. Zhang, X. Ling, L. Zettlemoyer, and D. S. Weld. Knowledge-based weak supervision for information extraction of overlapping relations. In ACL, 2011.

Digital Library

[22]

R. Horst and N. V. Thoai. Dc programming: overview. Journal of Optimization Theory and Applications, 103(1):1--43, 1999.

Digital Library

[23]

R. Jin and Z. Ghahramani. Learning with multiple labels. In NIPS, pages 897--904, 2002.

Digital Library

[24]

S. Kiritchenko and S. Matwin. Email classification with co-training. In CASCON, pages 301--312. IBM Corp., 2011.

Digital Library

[25]

A. Klementiev and D. Roth. Weakly supervised named entity transliteration and discovery from multilingual comparable corpora. In ACL, pages 817--824, 2006.

Digital Library

[26]

B. Klimt and Y. Yang. The enron corpus: A new dataset for email classification research.

[27]

A. Kulkarni and T. Pedersen. Name discrimination and email clustering using unsupervised clustering and labeling of similar contexts. In IICAI, pages 703--722, 2005.

[28]

J. Lafferty, A. McCallum, and F. C. Pereira. Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In ICML, 2001.

Digital Library

[29]

H. Li, D. Shen, B. Zhang, Z. Chen, and Q. Yang. Adding semantics to email clustering. In ICDM, 2006.

Digital Library

[30]

B. Liu, R. Grossman, and Y. Zhai. Mining data records in web pages. In KDD, pages 601--606. ACM, 2003.

Digital Library

[31]

G. S. Mann and A. McCallum. Generalized expectation criteria for semi-supervised learning with weakly labeled data. JMLR, 11:955--984, 2010.

Digital Library

[32]

M. Mintz, S. Bills, R. Snow, and D. Jurafsky. Distant supervision for relation extraction without labeled data. In ACL, pages 1003--1011, 2009.

Digital Library

[33]

L. B. J. F. R. Olshen and C. J. Stone. Classification and regression trees. Wadsworth International Group, 1984.

[34]

M. Paisca. Weakly-supervised discovery of named entities using web search queries. In CIKM, pages 683--690. ACM, 2007.

Digital Library

[35]

P. Ravikumar, M. J. Wainwright, and J. D. Lafferty. High-dimensional ising model selection using l1-regularized logistic regression. The Annals of Statistics, 38(3), 2010.

[36]

G. Ridgeway. Generalized boosted regression models. Documentation on the R Package 'gbm', version, 1(5):7, 2006.

[37]

S. Riedel, L. Yao, and A. McCallum. Modeling relations and their mentions without labeled text. In ECML, 2010.

Digital Library

[38]

R. Rowe, G. Creamer, S. Hershkop, and S. J. Stolfo. Automated social hierarchy detection through email network analysis. In WebKDD and SNA-KDD, pages 109--117. ACM, 2007.

Digital Library

[39]

A. M. Rush. A tutorial on dual decomposition and lagrangian relaxation for inference in natural language processing. 2012.

[40]

A. M. Rush, D. Sontag, M. Collins, and T. Jaakkola. On dual decomposition and linear programming relaxations for natural language processing. In EMNLP, pages 1--11, 2010.

Digital Library

[41]

C.-Y. Tseng, J.-W. Huang, and M.-S. Chen. Promail: using progressive email social network for spam detection. In Advances in Knowledge Discovery and Data Mining. 2007.

Digital Library

[42]

S. Yoo, Y. Yang, F. Lin, and I.-C. Moon. Mining social networks for personalized email prioritization. In KDD, 2009.

Digital Library

[43]

S. Youn and D. McLeod. A comparative study for email classification. In Advances and Innovations in Systems, Computing Sciences and Software Engineering. 2007.

[44]

Y. Zhai and B. Liu. Web data extraction based on partial tree alignment. In WWW, pages 76--85. ACM, 2005.

Digital Library

[45]

J. Zhu, Z. Nie, J.-R. Wen, B. Zhang, and W.-Y. Ma. 2d conditional random fields for web information extraction. In ICML, pages 1044--1051. ACM, 2005.

Digital Library

[46]

J. Zhu, Z. Nie, J.-R. Wen, B. Zhang, and W.-Y. Ma. Simultaneous record detection and attribute labeling in web data extraction. In KDD, pages 494--503. ACM, 2006.

Digital Library

[47]

J. Zhu, Z. Nie, B. Zhang, and J.-R. Wen. Dynamic hierarchical markov random fields for integrated web data extraction. JMLR, 9:1583--1614, 2008.

Digital Library

[48]

X. Zhu. Semi-supervised learning literature survey. Computer Science, University of Wisconsin-Madison, 2:3, 2006.

[49]

X. Zhu and A. B. Goldberg. Introduction to Semi-supervised Learning. Number 6. Morgan & Claypool Publishers, 2009.

Digital Library

Cited By

Creamer GStolfo SCreamer MHershkop SRowe R(2022)Discovering Organizational Hierarchy through a Corporate Ranking Algorithm: The Enron CaseComplexity10.1155/2022/81544762022(1-18)Online publication date: 21-Feb-2022
https://doi.org/10.1155/2022/8154476
Gupta RKondapally R(2022)Large-Scale Entity Extraction from Enterprise DataProceedings of the Second International Conference on AI-ML Systems10.1145/3564121.3564818(1-2)Online publication date: 12-Oct-2022
https://dl.acm.org/doi/10.1145/3564121.3564818
Gupta RKondapally RZhang ARangwala H(2022)Large-Scale Information Extraction under Privacy-Aware ConstraintsProceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining10.1145/3534678.3547352(4792-4793)Online publication date: 14-Aug-2022
https://dl.acm.org/doi/10.1145/3534678.3547352
Show More Cited By

Index Terms

Annotating Needles in the Haystack without Looking: Product Information Extraction from Emails
1. Computing methodologies
  1. Artificial intelligence
    1. Natural language processing
2. Information systems
  1. Information retrieval
    1. Document representation
      1. Content analysis and feature selection

Recommendations

Focusing on the Long-term: It's Good for Users and Business
KDD '15: Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining

Over the past 10+ years, online companies large and small have adopted widespread A/B testing as a robust data-based method for evaluating potential product improvements. In online experimentation, it is straightforward to measure the short-term effect, ...
Efficient Algorithms for Public-Private Social Networks
KDD '15: Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining

We introduce the public-private model of graphs. In this model, we have a public graph and each node in the public graph has an associated private graph. The motivation for studying this model stems from social networks, where the nodes are the users, ...
Stream Sampling for Frequency Cap Statistics
KDD '15: Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining

Unaggregated data, in a streamed or distributed form, is prevalent and comes from diverse sources such as interactions of users with web services and IP traffic. Data elements have keys (cookies, users, queries) and elements with different keys ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

KDD '15: Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining

August 2015

2378 pages

ISBN:9781450336642

DOI:10.1145/2783258

General Chairs:
Longbing Cao
University of Technology, Sydney
,
Chengqi Zhang
University of Technology, Sydney
,
Program Chairs:
Thorsten Joachims
Cornell University
,
Geoff Webb
Monash University
,
Dragos D. Margineantu
Boeing Research
,
Graham Williams
Australian Taxation Office

Copyright © 2015 Owner/Author.

Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the Owner/Author.

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 10 August 2015

Check for updates

Author Tags

Qualifiers

Research-article

Conference

KDD '15

Sponsor:

KDD '15: The 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining

August 10 - 13, 2015

NSW, Sydney, Australia

Acceptance Rates

KDD '15 Paper Acceptance Rate 160 of 819 submissions, 20%;

Overall Acceptance Rate 1,133 of 8,635 submissions, 13%

Upcoming Conference

KDD '24

Sponsor:
sigkdd
sigkdd

The 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining

August 25 - 29, 2024

Barcelona , Spain

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

19
Total Citations
View Citations
1,186
Total Downloads

Downloads (Last 12 months)77
Downloads (Last 6 weeks)4

Reflects downloads up to

Other Metrics

View Author Metrics

Citations

Cited By

Creamer GStolfo SCreamer MHershkop SRowe R(2022)Discovering Organizational Hierarchy through a Corporate Ranking Algorithm: The Enron CaseComplexity10.1155/2022/81544762022(1-18)Online publication date: 21-Feb-2022
https://doi.org/10.1155/2022/8154476
Gupta RKondapally R(2022)Large-Scale Entity Extraction from Enterprise DataProceedings of the Second International Conference on AI-ML Systems10.1145/3564121.3564818(1-2)Online publication date: 12-Oct-2022
https://dl.acm.org/doi/10.1145/3564121.3564818
Gupta RKondapally RZhang ARangwala H(2022)Large-Scale Information Extraction under Privacy-Aware ConstraintsProceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining10.1145/3534678.3547352(4792-4793)Online publication date: 14-Aug-2022
https://dl.acm.org/doi/10.1145/3534678.3547352
Parthasarathy SPattanaik LKhatry AIyer ARadhakrishna ARajamani SRaza MJhala RDillig I(2022)Landmarks and regions: a robust approach to data extractionProceedings of the 43rd ACM SIGPLAN International Conference on Programming Language Design and Implementation10.1145/3519939.3523705(993-1009)Online publication date: 9-Jun-2022
https://dl.acm.org/doi/10.1145/3519939.3523705
Coskun FGezer CGungor V(2021)Email Clustering & Generating Email Templates Based on Their TopicsProceedings of the 2021 5th International Conference on Information System and Data Mining10.1145/3471287.3471298(96-103)Online publication date: 27-May-2021
https://dl.acm.org/doi/10.1145/3471287.3471298
Gupta RKondapally RDemartini GZuccon GCulpepper JHuang ZTong H(2021)Large-Scale Information Extraction under Privacy-Aware ConstraintsProceedings of the 30th ACM International Conference on Information & Knowledge Management10.1145/3459637.3482027(4845-4848)Online publication date: 26-Oct-2021
https://dl.acm.org/doi/10.1145/3459637.3482027
Iyer AJonnalagedda MParthasarathy SRadhakrishna ARajamani SMcKinley KFisher K(2019)Synthesis and machine learning for heterogeneous extractionProceedings of the 40th ACM SIGPLAN Conference on Programming Language Design and Implementation10.1145/3314221.3322485(301-315)Online publication date: 8-Jun-2019
https://dl.acm.org/doi/10.1145/3314221.3322485
Gupta RKondapally RGuha S(2019)Large-Scale Information Extraction from Emails with Data ConstraintsBig Data Analytics10.1007/978-3-030-37188-3_8(124-139)Online publication date: 12-Dec-2019
https://doi.org/10.1007/978-3-030-37188-3_8
Sheng YTata SWendt JXie JZhao QNajork MGuo YFarooq F(2018)Anatomy of a Privacy-Safe Large-Scale Information Extraction System Over EmailProceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining10.1145/3219819.3219901(734-743)Online publication date: 19-Jul-2018
https://dl.acm.org/doi/10.1145/3219819.3219901
Di Castro DGamzu IGrabovitch-Zuyev ILewin-Eytan LPundir ASahoo NViderman MChampin PGandon FMédini LLalmas MIpeirotis P(2018)Automated Extractions for Machine Generated MailCompanion Proceedings of the The Web Conference 201810.1145/3184558.3186582(655-662)Online publication date: 23-Apr-2018
https://dl.acm.org/doi/10.1145/3184558.3186582
Show More Cited By

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Media

Figures

Other

Tables

View Table of Contents