Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/2783258.2788580acmconferencesArticle/Chapter ViewAbstractPublication PageskddConference Proceedingsconference-collections
research-article
Open access

Annotating Needles in the Haystack without Looking: Product Information Extraction from Emails

Published: 10 August 2015 Publication History
  • Get Citation Alerts
  • Abstract

    Business-to-consumer (B2C) emails are usually generated by filling structured user data (e.g.purchase, event) into templates. Extracting structured data from B2C emails allows users to track important information on various devices.
    However, it also poses several challenges, due to the requirement of short response time for massive data volume, the diversity and complexity of templates, and the privacy and legal constraints. Most notably, email data is legally protected content, which means no one except the receiver can review the messages or derived information.
    In this paper we first introduce a system which can extract structured information automatically without requiring human review of any personal content. Then we focus on how to annotate product names from the extracted texts, which is one of the most difficult problems in the system. Neither general learning methods, such as binary classifiers, nor more specific structure learning methods, suchas Conditional Random Field (CRF), can solve this problem well.
    To accomplish this task, we propose a hybrid approach, which basically trains a CRF model using the labels predicted by binary classifiers (weak learners). However, the performance of weak learners can be low, therefore we use Expectation Maximization (EM) algorithm on CRF to remove the noise and improve the accuracy, without the need to label and inspect specific emails. In our experiments, the EM-CRF model can significantly improve the product name annotations over the weak learners and plain CRFs.

    References

    [1]
    N. Ailon, Z. S. Karnin, E. Liberty, and Y. Maarek. Threading machine generated email. In WSDM. ACM, 2013.
    [2]
    S. M. Aji and R. J. McEliece. The generalized distributive law. IEEE Transactions on Information Theory, 46(2), 2000.
    [3]
    C. Bird, A. Gourley, P. Devanbu, M. Gertz, and A. Swaminathan. Mining email social networks. In Workshop on Mining software repositories, pages 137--143. ACM, 2006.
    [4]
    E. Blanzieri and A. Bryl. A survey of learning-based techniques of email spam filtering. Artificial Intelligence Review, 29(1):63--92, 2008.
    [5]
    L. Breiman. Random forests. Machine learning, 45(1), 2001.
    [6]
    J. D. Brutlag and C. Meek. Challenges of the email domain for text classification. In ICML, pages 103--110, 2000.
    [7]
    D. Buttler, L. Liu, and C. Pu. A fully automated object extraction system for the world wide web. In ICDCS, 2001.
    [8]
    C.-H. Chang and S.-C. Lui. Iepad: information extraction based on pattern discovery. In WWW, pages 681--688. ACM, 2001.
    [9]
    G. V. Cormack. Email spam filtering: A systematic review. Foundations and Trends in Information Retrieval, 1(4), 2007.
    [10]
    C. Cortes and V. Vapnik. Support-vector networks. Machine learning, 20(3):273--297, 1995.
    [11]
    M. Craven, J. Kumlien, et al. Constructing biological knowledge bases by extracting information from text sources. In ISMB, volume 1999, pages 77--86, 1999.
    [12]
    V. Crescenzi, G. Mecca, P. Merialdo, et al. Roadrunner: Towards automatic data extraction from large web sites. In VLDB, volume 1, pages 109--118, 2001.
    [13]
    N. Dalvi, R. Kumar, and M. Soliman. Automatic wrappers for large scale web extraction. Proceedings of the VLDB Endowment, 4(4):219--230, 2011.
    [14]
    J. Diesner, T. L. Frantz, and K. M. Carley. Communication networks from the enron email corpus "it's always about the people. enron is no different". Computational & Mathematical Organization Theory, 11(3):201--228, 2005.
    [15]
    D. W. Embley, Y. Jiang, and Y.-K. Ng. Record-boundary discovery in web documents. ACM SIGMOD Record, 28(2):467--478, 1999.
    [16]
    J. H. Friedman. Multivariate adaptive regression splines. The annals of statistics, pages 1--67, 1991.
    [17]
    K. Ganchev, J. Graica, J. Gillenwater, and B. Taskar. Posterior regularization for structured latent variable models. JMLR, 99:2001--2049, 2010.
    [18]
    S. Gupta, G. Kaiser, D. Neistadt, and P. Grimm. Dom-based content extraction of html documents. In WWW, 2003.
    [19]
    K. Hall, R. McDonald, J. Katz-Brown, and M. Ringgaard. Training dependency parsers by jointly optimizing multiple objectives. In EMNLP, pages 1489--1499, 2011.
    [20]
    T. Hastie, R. Tibshirani, and J. J. H. Friedman. The elements of statistical learning, volume 1. Springer New York, 2001.
    [21]
    R. Hoffmann, C. Zhang, X. Ling, L. Zettlemoyer, and D. S. Weld. Knowledge-based weak supervision for information extraction of overlapping relations. In ACL, 2011.
    [22]
    R. Horst and N. V. Thoai. Dc programming: overview. Journal of Optimization Theory and Applications, 103(1):1--43, 1999.
    [23]
    R. Jin and Z. Ghahramani. Learning with multiple labels. In NIPS, pages 897--904, 2002.
    [24]
    S. Kiritchenko and S. Matwin. Email classification with co-training. In CASCON, pages 301--312. IBM Corp., 2011.
    [25]
    A. Klementiev and D. Roth. Weakly supervised named entity transliteration and discovery from multilingual comparable corpora. In ACL, pages 817--824, 2006.
    [26]
    B. Klimt and Y. Yang. The enron corpus: A new dataset for email classification research.
    [27]
    A. Kulkarni and T. Pedersen. Name discrimination and email clustering using unsupervised clustering and labeling of similar contexts. In IICAI, pages 703--722, 2005.
    [28]
    J. Lafferty, A. McCallum, and F. C. Pereira. Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In ICML, 2001.
    [29]
    H. Li, D. Shen, B. Zhang, Z. Chen, and Q. Yang. Adding semantics to email clustering. In ICDM, 2006.
    [30]
    B. Liu, R. Grossman, and Y. Zhai. Mining data records in web pages. In KDD, pages 601--606. ACM, 2003.
    [31]
    G. S. Mann and A. McCallum. Generalized expectation criteria for semi-supervised learning with weakly labeled data. JMLR, 11:955--984, 2010.
    [32]
    M. Mintz, S. Bills, R. Snow, and D. Jurafsky. Distant supervision for relation extraction without labeled data. In ACL, pages 1003--1011, 2009.
    [33]
    L. B. J. F. R. Olshen and C. J. Stone. Classification and regression trees. Wadsworth International Group, 1984.
    [34]
    M. Paisca. Weakly-supervised discovery of named entities using web search queries. In CIKM, pages 683--690. ACM, 2007.
    [35]
    P. Ravikumar, M. J. Wainwright, and J. D. Lafferty. High-dimensional ising model selection using l1-regularized logistic regression. The Annals of Statistics, 38(3), 2010.
    [36]
    G. Ridgeway. Generalized boosted regression models. Documentation on the R Package 'gbm', version, 1(5):7, 2006.
    [37]
    S. Riedel, L. Yao, and A. McCallum. Modeling relations and their mentions without labeled text. In ECML, 2010.
    [38]
    R. Rowe, G. Creamer, S. Hershkop, and S. J. Stolfo. Automated social hierarchy detection through email network analysis. In WebKDD and SNA-KDD, pages 109--117. ACM, 2007.
    [39]
    A. M. Rush. A tutorial on dual decomposition and lagrangian relaxation for inference in natural language processing. 2012.
    [40]
    A. M. Rush, D. Sontag, M. Collins, and T. Jaakkola. On dual decomposition and linear programming relaxations for natural language processing. In EMNLP, pages 1--11, 2010.
    [41]
    C.-Y. Tseng, J.-W. Huang, and M.-S. Chen. Promail: using progressive email social network for spam detection. In Advances in Knowledge Discovery and Data Mining. 2007.
    [42]
    S. Yoo, Y. Yang, F. Lin, and I.-C. Moon. Mining social networks for personalized email prioritization. In KDD, 2009.
    [43]
    S. Youn and D. McLeod. A comparative study for email classification. In Advances and Innovations in Systems, Computing Sciences and Software Engineering. 2007.
    [44]
    Y. Zhai and B. Liu. Web data extraction based on partial tree alignment. In WWW, pages 76--85. ACM, 2005.
    [45]
    J. Zhu, Z. Nie, J.-R. Wen, B. Zhang, and W.-Y. Ma. 2d conditional random fields for web information extraction. In ICML, pages 1044--1051. ACM, 2005.
    [46]
    J. Zhu, Z. Nie, J.-R. Wen, B. Zhang, and W.-Y. Ma. Simultaneous record detection and attribute labeling in web data extraction. In KDD, pages 494--503. ACM, 2006.
    [47]
    J. Zhu, Z. Nie, B. Zhang, and J.-R. Wen. Dynamic hierarchical markov random fields for integrated web data extraction. JMLR, 9:1583--1614, 2008.
    [48]
    X. Zhu. Semi-supervised learning literature survey. Computer Science, University of Wisconsin-Madison, 2:3, 2006.
    [49]
    X. Zhu and A. B. Goldberg. Introduction to Semi-supervised Learning. Number 6. Morgan & Claypool Publishers, 2009.

    Cited By

    View all
    • (2022)Discovering Organizational Hierarchy through a Corporate Ranking Algorithm: The Enron CaseComplexity10.1155/2022/81544762022(1-18)Online publication date: 21-Feb-2022
    • (2022)Large-Scale Entity Extraction from Enterprise DataProceedings of the Second International Conference on AI-ML Systems10.1145/3564121.3564818(1-2)Online publication date: 12-Oct-2022
    • (2022)Large-Scale Information Extraction under Privacy-Aware ConstraintsProceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining10.1145/3534678.3547352(4792-4793)Online publication date: 14-Aug-2022
    • Show More Cited By

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    KDD '15: Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining
    August 2015
    2378 pages
    ISBN:9781450336642
    DOI:10.1145/2783258
    Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the Owner/Author.

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 10 August 2015

    Check for updates

    Author Tags

    1. business intelligence
    2. structured information extraction

    Qualifiers

    • Research-article

    Conference

    KDD '15
    Sponsor:

    Acceptance Rates

    KDD '15 Paper Acceptance Rate 160 of 819 submissions, 20%;
    Overall Acceptance Rate 1,133 of 8,635 submissions, 13%

    Upcoming Conference

    KDD '24

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)77
    • Downloads (Last 6 weeks)4
    Reflects downloads up to

    Other Metrics

    Citations

    Cited By

    View all
    • (2022)Discovering Organizational Hierarchy through a Corporate Ranking Algorithm: The Enron CaseComplexity10.1155/2022/81544762022(1-18)Online publication date: 21-Feb-2022
    • (2022)Large-Scale Entity Extraction from Enterprise DataProceedings of the Second International Conference on AI-ML Systems10.1145/3564121.3564818(1-2)Online publication date: 12-Oct-2022
    • (2022)Large-Scale Information Extraction under Privacy-Aware ConstraintsProceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining10.1145/3534678.3547352(4792-4793)Online publication date: 14-Aug-2022
    • (2022)Landmarks and regions: a robust approach to data extractionProceedings of the 43rd ACM SIGPLAN International Conference on Programming Language Design and Implementation10.1145/3519939.3523705(993-1009)Online publication date: 9-Jun-2022
    • (2021)Email Clustering & Generating Email Templates Based on Their TopicsProceedings of the 2021 5th International Conference on Information System and Data Mining10.1145/3471287.3471298(96-103)Online publication date: 27-May-2021
    • (2021)Large-Scale Information Extraction under Privacy-Aware ConstraintsProceedings of the 30th ACM International Conference on Information & Knowledge Management10.1145/3459637.3482027(4845-4848)Online publication date: 26-Oct-2021
    • (2019)Synthesis and machine learning for heterogeneous extractionProceedings of the 40th ACM SIGPLAN Conference on Programming Language Design and Implementation10.1145/3314221.3322485(301-315)Online publication date: 8-Jun-2019
    • (2019)Large-Scale Information Extraction from Emails with Data ConstraintsBig Data Analytics10.1007/978-3-030-37188-3_8(124-139)Online publication date: 12-Dec-2019
    • (2018)Anatomy of a Privacy-Safe Large-Scale Information Extraction System Over EmailProceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining10.1145/3219819.3219901(734-743)Online publication date: 19-Jul-2018
    • (2018)Automated Extractions for Machine Generated MailCompanion Proceedings of the The Web Conference 201810.1145/3184558.3186582(655-662)Online publication date: 23-Apr-2018
    • Show More Cited By

    View Options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Get Access

    Login options

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media