research-article

Online template induction for machine-generated emails

Authors:

Michael Whittaker,

James B. Wendt,

Marc NajorkAuthors Info & Claims

Proceedings of the VLDB Endowment, Volume 12, Issue 11

Pages 1235 - 1248

https://doi.org/10.14778/3342263.3342264

Published: 01 July 2019 Publication History

Abstract

In emails, information abounds. Whether it be a bill reminder, a hotel confirmation, or a shipping notification, our emails contain useful bits of information that enable a number of applications. Most of this email traffic is machine-generated, sent from a business to a human. These business-to-consumer emails are typically instantiated from a set of email templates, and discovering these templates is a key step in enabling a variety of intelligent experiences. Existing email information extraction systems typically separate information extraction into two steps: an offline template discovery process (called template induction) that is periodically run on a sample of emails, and an online email annotation process that applies discovered templates to emails as they arrive. Since information extraction requires an email's template to be known, any delay in discovering a newly created template causes missed extractions, lowering the overall extraction coverage. In this paper, we present a novel system called Crusher that discovers templates completely online, reducing template discovery delay from a week (for the existing MapReduce-based batch system) to minutes. Furthermore, Crusher has a resource consumption footprint that is significantly smaller than the existing batch system. We also report on the surprising lesson we learned that conventional stream processing systems do not present a good framework on which to build Crusher. Crusher delivers an order of magnitude more throughput than a prototype built using a stream processing engine. We hope that these lessons help designers of stream processing systems accommodate a broader range of applications like online template induction in the future.

References

[1]

D. J. Abadi, Y. Ahmad, M. Balazinska, U. Çetintemel, M. Cherniack, J. Hwang, W. Lindner, A. S. Maskey, A. Rasin, E. Ryvkina, N. Tatbul, Y. Xing, and S. Zdonik. The design of the Borealis stream processing engine. In 2nd Biennial Conference on Innovative Data Systems Research, pages 277--289, 2005.

[2]

D. J. Abadi, D. Carney, U. Çetintemel, M. Cherniack, C. Convey, S. Lee, M. Stonebraker, N. Tatbul, and S. Zdonik. Aurora: a new model and architecture for data stream management. The VLDB Journal, 12(2):120--139, Aug. 2003.

Digital Library

[3]

M. K. Agarwal and J. Singh. Template trees: Extracting actionable information from machine generated emails. In International Conference on Database and Expert Systems Applications, pages 3--18, 2018.

Digital Library

[4]

C. C. Aggarwal. Mining text and social streams: A review. SIGKDD Explor. Newsl., 15(2):9--19, June 2014.

Digital Library

[5]

N. Ailon, Z. S. Karnin, E. Liberty, and Y. Maarek. Threading machine generated email. In 6th ACM International Conference on Web Search and Data Mining, pages 405--414, 2013.

Digital Library

[6]

T. Akidau, A. Balikov, K. Bekiroğlu, S. Chernyak, J. Haberman, R. Lax, S. McVeety, D. Mills, P. Nordstrom, and S. Whittle. MillWheel: fault-tolerant stream processing at Internet scale. PVLDB, 6(11):1033--1044, 2013.

Digital Library

[7]

Apache. Apache beam: An advanced unified programming model. https://beam.apache.org/, 2019.

[8]

A. Arasu, B. Babcock, S. Babu, J. Cieslewicz, M. Datar, K. Ito, R. Motwani, U. Srivastava, and J. Widom. Stream: The Stanford data stream management system. In Data Stream Management, pages 317--336. Springer, 2016.

[9]

A. Arasu, S. Babu, and J. Widom. The CQL continuous query language: semantic foundations and query execution. The VLDB Journal, 15(2):121--142, June 2006.

Digital Library

[10]

A. Arasu and H. Garcia-Molina. Extracting structured data from web pages. In 2003 ACM SIGMOD International Conference on Management of Data, pages 337--348, 2003.

Digital Library

[11]

N. Avigdor-Elgrabli, M. Cwalinski, D. Di Castro, I. Gamzu, I. Grabovitch-Zuyev, L. Lewin-Eytan, and Y. Maarek. Structural clustering of machine-generated mail. In 25th ACM International Conference on Information and Knowledge Management, pages 217--226, 2016.

Digital Library

[12]

H. Balakrishnan, M. Balazinska, D. Carney, U. Çetintemel, M. Cherniack, C. Convey, E. Galvez, J. Salz, M. Stonebraker, N. Tatbul, R. Tibbetts, and S. Zdonik. Retrospective on Aurora. The VLDB Journal, 13(4):370--383, Dec. 2004.

Digital Library

[13]

M. Bendersky, X. Wang, D. Metzler, and M. Najork. Learning from user interactions in personal search via attribute parameterization. In 10th ACM International Conference on Web Search and Data Mining, pages 791--799, 2017.

Digital Library

[14]

A. Z. Broder, S. C. Glassman, M. S. Manasse, and G. Zweig. Syntactic clustering of the web. Computer Networks and ISDN Systems, 29(8):1157--1166, 1997.

Digital Library

[15]

P. Carbone, A. Katsifodimos, S. Ewen, V. Markl, S. Haridi, and K. Tzoumas. Apache Flink: Stream and batch processing in a single engine. Bulletin of the IEEE Computer Society Technical Committee on Data Engineering, 36(4):28--38, Dec 2015.

[16]

S. Chandrasekaran, O. Cooper, A. Deshpande, M. J. Franklin, J. M. Hellerstein, W. Hong, S. Krishnamurthy, S. Madden, V. Raman, F. Reiss, and M. A. Shah. Telegraphcq: Continuous dataflow processing for an uncertain world. In 1st Biennial Conference on Innovative Data Systems Research, 2003.

[17]

F. Chang, J. Dean, S. Ghemawat, W. C. Hsieh, D. A. Wallach, M. Burrows, T. Chandra, A. Fikes, and R. E. Gruber. Bigtable: A distributed storage system for structured data. ACM Trans. Comput. Syst., 26(2):4:1--4:26, June 2008.

Digital Library

[18]

T. Condie, N. Conway, P. Alvaro, J. M. Hellerstein, K. Elmeleegy, and R. Sears. Mapreduce online. In 7th USENIX Symposium on Networked Systems Design and Implementation, pages 313--328, 2010.

Digital Library

[19]

J. de Andrade Silva, E. R. Faria, R. C. Barros, E. R. Hruschka, A. C. P. L. F. de Carvalho, and J. Gama. Data stream clustering: A survey. ACM Comput. Surv., 46(1):13:1--13:31, July 2013.

Digital Library

[20]

D. Di Castro, I. Gamzu, I. Grabovitch-Zuyev, L. Lewin-Eytan, A. Pundir, N. R. Sahoo, and M. Viderman. Automated extractions for machine generated mail. In Companion Proceedings of The Web Conference, pages 655--662, 2018.

Digital Library

[21]

D. Di Castro, L. Lewin-Eytan, Y. Maarek, R. Wolff, and E. Zohar. Enforcing k-anonymity in web mail auditing. In 9th ACM International Conference on Web Search and Data Mining, pages 327--336, 2016.

Digital Library

[22]

M. M. Gaber, A. Zaslavsky, and S. Krishnaswamy. Mining data streams: A review. ACM SIGMOD Record, 34(2):18--26, June 2005.

Digital Library

[23]

B. Gedik, H. Andrade, K.-L. Wu, P. S. Yu, and M. Doo. Spade: the system s declarative stream processing engine. In 2008 ACM SIGMOD International Conference on Management of Data, pages 1123--1134, 2008.

Digital Library

[24]

S. Kulkarni, N. Bhagat, M. Fu, V. Kedigehalli, C. Kellogg, S. Mittal, J. M. Patel, K. Ramasamy, and S. Taneja. Twitter heron: Stream processing at scale. In 2015 ACM SIGMOD International Conference on Management of Data, pages 239--250, 2015.

Digital Library

[25]

N. Kushmerick, D. S. Weld, and R. Doorenbos. Wrapper induction for information extraction. In 15th International Joint Conference on Artificial Intelligence, pages 729--737, 1997.

[26]

Y. Maarek. Is mail the next frontier in search and data mining? In 9th ACM International Conference on Web Search and Data Mining, pages 203--203, 2016.

Digital Library

[27]

D. G. Murray, F. McSherry, R. Isaacs, M. Isard, P. Barham, and M. Abadi. Naiad: a timely dataflow system. In 24th ACM Symposium on Operating Systems Principles, pages 439--455, 2013.

Digital Library

[28]

S. A. Noghabi, K. Paramasivam, Y. Pan, N. Ramesh, J. Bringhurst, I. Gupta, and R. H. Campbell. Samza: stateful scalable stream processing at LinkedIn. PVLDB, 10(12):1634--1645, 2017.

Digital Library

[29]

Oracle. Oracle stream analytics. https://www.oracle.com/middleware/technologies/complex-event-processing.html, 2019.

[30]

N. Potti, J. B. Wendt, Q. Zhao, S. Tata, and M. Najork. Hidden in plain sight: Classifying emails using embedded image contents. In 2018 World Wide Web Conference, pages 1865--1874, 2018.

Digital Library

[31]

Y. Sheng, S. Tata, J. B. Wendt, J. Xie, Q. Zhao, and M. Najork. Anatomy of a privacy-safe large-scale information extraction system over email. In 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pages 734--743, 2018.

Digital Library

[32]

M. Stonebraker and U. Çetintemel. "One size fits all": an idea whose time has come and gone. In 21st International Conference on Data Engineering, pages 2--11. IEEE, 2005.

Digital Library

[33]

M. Stonebraker, U. Çetintemel, and S. Zdonik. The 8 requirements of real-time stream processing. ACM Sigmod Record, 34(4):42--47, Dec. 2005.

Digital Library

[34]

The Radicati Group. Email statistics report, 2018-2022. https://www.radicati.com/wp/wp-content/uploads/2018/01/Email_Statistics_Report,_2018--2022_Executive_Summary.pdf, 2018.

[35]

A. Toshniwal, S. Taneja, A. Shukla, K. Ramasamy, J. M. Patel, S. Kulkarni, J. Jackson, K. Gade, M. Fu, J. Donham, et al. Storm@ twitter. In 2014 ACM SIGMOD International Conference on Management of Data, pages 147--156, 2014.

Digital Library

[36]

M. Welsh, D. Culler, and E. Brewer. SEDA: An architecture for well-conditioned, scalable internet services. In 18th ACM Symposium on Operating Systems Principles, pages 230--243, 2001.

Digital Library

[37]

J. B. Wendt, M. Bendersky, L. Garcia-Pueyo, V. Josifovski, B. Miklos, I. Krka, A. Saikia, J. Yang, M.-A. Cartright, and S. Ravi. Hierarchical label propagation and discovery for machine generated email. In 9th ACM International Conference on Web Search and Data Mining, pages 317--326, 2016.

Digital Library

[38]

M. Zaharia, T. Das, H. Li, T. Hunter, S. Shenker, and I. Stoica. Discretized streams: Fault-tolerant streaming computation at scale. In 24th ACM Symposium on Operating Systems Principles, pages 423--438, 2013.

Digital Library

Cited By

Gupta RKondapally RDemartini GZuccon GCulpepper JHuang ZTong H(2021)Large-Scale Information Extraction under Privacy-Aware ConstraintsProceedings of the 30th ACM International Conference on Information & Knowledge Management10.1145/3459637.3482027(4845-4848)Online publication date: 26-Oct-2021
https://dl.acm.org/doi/10.1145/3459637.3482027

Recommendations

Template Induction over Unstructured Email Corpora
WWW '17: Proceedings of the 26th International Conference on World Wide Web

Unsupervised template induction over email data is a central component in applications such as information extraction, document classification, and auto-reply. The benefits of automatically generating such templates are known for structured data, e.g. ...
Template Trees: Extracting Actionable Information from Machine Generated Emails
Database and Expert Systems Applications
Abstract
Many machine generated emails carry important information which must be acted upon at scheduled time by the recipient. Thus, it becomes a natural goal to automatically extract such actionable information from these emails and communicate to the ...
How Experts Detect Phishing Scam Emails
CSCW

Phishing scam emails are emails that pretend to be something they are not in order to get the recipient of the email to undertake some action they normally would not. While technical protections against phishing reduce the number of phishing emails ...

Comments

Information & Contributors

Information

Published In

cover image Proceedings of the VLDB Endowment

Proceedings of the VLDB Endowment Volume 12, Issue 11

July 2019

543 pages

ISSN:2150-8097

Editors:
Lei Chen,
Fatma Özcan

Issue’s Table of Contents

Publisher

VLDB Endowment

Publication History

Published: 01 July 2019

Published in PVLDB Volume 12, Issue 11

Qualifiers

Research-article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

1
Total Citations
View Citations
82
Total Downloads

Downloads (Last 12 months)9
Downloads (Last 6 weeks)0

Reflects downloads up to 17 Oct 2024

Other Metrics

View Author Metrics

Citations

Cited By

Gupta RKondapally RDemartini GZuccon GCulpepper JHuang ZTong H(2021)Large-Scale Information Extraction under Privacy-Aware ConstraintsProceedings of the 30th ACM International Conference on Information & Knowledge Management10.1145/3459637.3482027(4845-4848)Online publication date: 26-Oct-2021
https://dl.acm.org/doi/10.1145/3459637.3482027

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Issue’s Table of Contents