Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article

Online template induction for machine-generated emails

Published: 01 July 2019 Publication History

Abstract

In emails, information abounds. Whether it be a bill reminder, a hotel confirmation, or a shipping notification, our emails contain useful bits of information that enable a number of applications. Most of this email traffic is machine-generated, sent from a business to a human. These business-to-consumer emails are typically instantiated from a set of email templates, and discovering these templates is a key step in enabling a variety of intelligent experiences. Existing email information extraction systems typically separate information extraction into two steps: an offline template discovery process (called template induction) that is periodically run on a sample of emails, and an online email annotation process that applies discovered templates to emails as they arrive. Since information extraction requires an email's template to be known, any delay in discovering a newly created template causes missed extractions, lowering the overall extraction coverage. In this paper, we present a novel system called Crusher that discovers templates completely online, reducing template discovery delay from a week (for the existing MapReduce-based batch system) to minutes. Furthermore, Crusher has a resource consumption footprint that is significantly smaller than the existing batch system. We also report on the surprising lesson we learned that conventional stream processing systems do not present a good framework on which to build Crusher. Crusher delivers an order of magnitude more throughput than a prototype built using a stream processing engine. We hope that these lessons help designers of stream processing systems accommodate a broader range of applications like online template induction in the future.

References

[1]
D. J. Abadi, Y. Ahmad, M. Balazinska, U. Çetintemel, M. Cherniack, J. Hwang, W. Lindner, A. S. Maskey, A. Rasin, E. Ryvkina, N. Tatbul, Y. Xing, and S. Zdonik. The design of the Borealis stream processing engine. In 2nd Biennial Conference on Innovative Data Systems Research, pages 277--289, 2005.
[2]
D. J. Abadi, D. Carney, U. Çetintemel, M. Cherniack, C. Convey, S. Lee, M. Stonebraker, N. Tatbul, and S. Zdonik. Aurora: a new model and architecture for data stream management. The VLDB Journal, 12(2):120--139, Aug. 2003.
[3]
M. K. Agarwal and J. Singh. Template trees: Extracting actionable information from machine generated emails. In International Conference on Database and Expert Systems Applications, pages 3--18, 2018.
[4]
C. C. Aggarwal. Mining text and social streams: A review. SIGKDD Explor. Newsl., 15(2):9--19, June 2014.
[5]
N. Ailon, Z. S. Karnin, E. Liberty, and Y. Maarek. Threading machine generated email. In 6th ACM International Conference on Web Search and Data Mining, pages 405--414, 2013.
[6]
T. Akidau, A. Balikov, K. Bekiroğlu, S. Chernyak, J. Haberman, R. Lax, S. McVeety, D. Mills, P. Nordstrom, and S. Whittle. MillWheel: fault-tolerant stream processing at Internet scale. PVLDB, 6(11):1033--1044, 2013.
[7]
Apache. Apache beam: An advanced unified programming model. https://beam.apache.org/, 2019.
[8]
A. Arasu, B. Babcock, S. Babu, J. Cieslewicz, M. Datar, K. Ito, R. Motwani, U. Srivastava, and J. Widom. Stream: The Stanford data stream management system. In Data Stream Management, pages 317--336. Springer, 2016.
[9]
A. Arasu, S. Babu, and J. Widom. The CQL continuous query language: semantic foundations and query execution. The VLDB Journal, 15(2):121--142, June 2006.
[10]
A. Arasu and H. Garcia-Molina. Extracting structured data from web pages. In 2003 ACM SIGMOD International Conference on Management of Data, pages 337--348, 2003.
[11]
N. Avigdor-Elgrabli, M. Cwalinski, D. Di Castro, I. Gamzu, I. Grabovitch-Zuyev, L. Lewin-Eytan, and Y. Maarek. Structural clustering of machine-generated mail. In 25th ACM International Conference on Information and Knowledge Management, pages 217--226, 2016.
[12]
H. Balakrishnan, M. Balazinska, D. Carney, U. Çetintemel, M. Cherniack, C. Convey, E. Galvez, J. Salz, M. Stonebraker, N. Tatbul, R. Tibbetts, and S. Zdonik. Retrospective on Aurora. The VLDB Journal, 13(4):370--383, Dec. 2004.
[13]
M. Bendersky, X. Wang, D. Metzler, and M. Najork. Learning from user interactions in personal search via attribute parameterization. In 10th ACM International Conference on Web Search and Data Mining, pages 791--799, 2017.
[14]
A. Z. Broder, S. C. Glassman, M. S. Manasse, and G. Zweig. Syntactic clustering of the web. Computer Networks and ISDN Systems, 29(8):1157--1166, 1997.
[15]
P. Carbone, A. Katsifodimos, S. Ewen, V. Markl, S. Haridi, and K. Tzoumas. Apache Flink: Stream and batch processing in a single engine. Bulletin of the IEEE Computer Society Technical Committee on Data Engineering, 36(4):28--38, Dec 2015.
[16]
S. Chandrasekaran, O. Cooper, A. Deshpande, M. J. Franklin, J. M. Hellerstein, W. Hong, S. Krishnamurthy, S. Madden, V. Raman, F. Reiss, and M. A. Shah. Telegraphcq: Continuous dataflow processing for an uncertain world. In 1st Biennial Conference on Innovative Data Systems Research, 2003.
[17]
F. Chang, J. Dean, S. Ghemawat, W. C. Hsieh, D. A. Wallach, M. Burrows, T. Chandra, A. Fikes, and R. E. Gruber. Bigtable: A distributed storage system for structured data. ACM Trans. Comput. Syst., 26(2):4:1--4:26, June 2008.
[18]
T. Condie, N. Conway, P. Alvaro, J. M. Hellerstein, K. Elmeleegy, and R. Sears. Mapreduce online. In 7th USENIX Symposium on Networked Systems Design and Implementation, pages 313--328, 2010.
[19]
J. de Andrade Silva, E. R. Faria, R. C. Barros, E. R. Hruschka, A. C. P. L. F. de Carvalho, and J. Gama. Data stream clustering: A survey. ACM Comput. Surv., 46(1):13:1--13:31, July 2013.
[20]
D. Di Castro, I. Gamzu, I. Grabovitch-Zuyev, L. Lewin-Eytan, A. Pundir, N. R. Sahoo, and M. Viderman. Automated extractions for machine generated mail. In Companion Proceedings of The Web Conference, pages 655--662, 2018.
[21]
D. Di Castro, L. Lewin-Eytan, Y. Maarek, R. Wolff, and E. Zohar. Enforcing k-anonymity in web mail auditing. In 9th ACM International Conference on Web Search and Data Mining, pages 327--336, 2016.
[22]
M. M. Gaber, A. Zaslavsky, and S. Krishnaswamy. Mining data streams: A review. ACM SIGMOD Record, 34(2):18--26, June 2005.
[23]
B. Gedik, H. Andrade, K.-L. Wu, P. S. Yu, and M. Doo. Spade: the system s declarative stream processing engine. In 2008 ACM SIGMOD International Conference on Management of Data, pages 1123--1134, 2008.
[24]
S. Kulkarni, N. Bhagat, M. Fu, V. Kedigehalli, C. Kellogg, S. Mittal, J. M. Patel, K. Ramasamy, and S. Taneja. Twitter heron: Stream processing at scale. In 2015 ACM SIGMOD International Conference on Management of Data, pages 239--250, 2015.
[25]
N. Kushmerick, D. S. Weld, and R. Doorenbos. Wrapper induction for information extraction. In 15th International Joint Conference on Artificial Intelligence, pages 729--737, 1997.
[26]
Y. Maarek. Is mail the next frontier in search and data mining? In 9th ACM International Conference on Web Search and Data Mining, pages 203--203, 2016.
[27]
D. G. Murray, F. McSherry, R. Isaacs, M. Isard, P. Barham, and M. Abadi. Naiad: a timely dataflow system. In 24th ACM Symposium on Operating Systems Principles, pages 439--455, 2013.
[28]
S. A. Noghabi, K. Paramasivam, Y. Pan, N. Ramesh, J. Bringhurst, I. Gupta, and R. H. Campbell. Samza: stateful scalable stream processing at LinkedIn. PVLDB, 10(12):1634--1645, 2017.
[29]
Oracle. Oracle stream analytics. https://www.oracle.com/middleware/technologies/complex-event-processing.html, 2019.
[30]
N. Potti, J. B. Wendt, Q. Zhao, S. Tata, and M. Najork. Hidden in plain sight: Classifying emails using embedded image contents. In 2018 World Wide Web Conference, pages 1865--1874, 2018.
[31]
Y. Sheng, S. Tata, J. B. Wendt, J. Xie, Q. Zhao, and M. Najork. Anatomy of a privacy-safe large-scale information extraction system over email. In 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pages 734--743, 2018.
[32]
M. Stonebraker and U. Çetintemel. "One size fits all": an idea whose time has come and gone. In 21st International Conference on Data Engineering, pages 2--11. IEEE, 2005.
[33]
M. Stonebraker, U. Çetintemel, and S. Zdonik. The 8 requirements of real-time stream processing. ACM Sigmod Record, 34(4):42--47, Dec. 2005.
[34]
The Radicati Group. Email statistics report, 2018-2022. https://www.radicati.com/wp/wp-content/uploads/2018/01/Email_Statistics_Report,_2018--2022_Executive_Summary.pdf, 2018.
[35]
A. Toshniwal, S. Taneja, A. Shukla, K. Ramasamy, J. M. Patel, S. Kulkarni, J. Jackson, K. Gade, M. Fu, J. Donham, et al. Storm@ twitter. In 2014 ACM SIGMOD International Conference on Management of Data, pages 147--156, 2014.
[36]
M. Welsh, D. Culler, and E. Brewer. SEDA: An architecture for well-conditioned, scalable internet services. In 18th ACM Symposium on Operating Systems Principles, pages 230--243, 2001.
[37]
J. B. Wendt, M. Bendersky, L. Garcia-Pueyo, V. Josifovski, B. Miklos, I. Krka, A. Saikia, J. Yang, M.-A. Cartright, and S. Ravi. Hierarchical label propagation and discovery for machine generated email. In 9th ACM International Conference on Web Search and Data Mining, pages 317--326, 2016.
[38]
M. Zaharia, T. Das, H. Li, T. Hunter, S. Shenker, and I. Stoica. Discretized streams: Fault-tolerant streaming computation at scale. In 24th ACM Symposium on Operating Systems Principles, pages 423--438, 2013.

Cited By

View all
  • (2021)Large-Scale Information Extraction under Privacy-Aware ConstraintsProceedings of the 30th ACM International Conference on Information & Knowledge Management10.1145/3459637.3482027(4845-4848)Online publication date: 26-Oct-2021

Recommendations

Comments

Information & Contributors

Information

Published In

cover image Proceedings of the VLDB Endowment
Proceedings of the VLDB Endowment  Volume 12, Issue 11
July 2019
543 pages

Publisher

VLDB Endowment

Publication History

Published: 01 July 2019
Published in PVLDB Volume 12, Issue 11

Qualifiers

  • Research-article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)9
  • Downloads (Last 6 weeks)0
Reflects downloads up to 17 Oct 2024

Other Metrics

Citations

Cited By

View all
  • (2021)Large-Scale Information Extraction under Privacy-Aware ConstraintsProceedings of the 30th ACM International Conference on Information & Knowledge Management10.1145/3459637.3482027(4845-4848)Online publication date: 26-Oct-2021

View Options

Get Access

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media