research-article

How Many Folders Do You Really Need?: Classifying Email into a Handful of Categories

Authors:

Mihajlo Grbovic,

Yoelle MaarekAuthors Info & Claims

CIKM '14: Proceedings of the 23rd ACM International Conference on Conference on Information and Knowledge Management

Pages 869 - 878

https://doi.org/10.1145/2661829.2662018

Published: 03 November 2014 Publication History

Abstract

Email classification is still a mostly manual task. Consequently, most Web mail users never define a single folder. Recently however, automatic classification offering the same categories to all users has started to appear in some Web mail clients, such as AOL or Gmail. We adopt this approach, rather than previous (unsuccessful) personalized approaches because of the change in the nature of consumer email traffic, which is now dominated by (non-spam) machine-generated email. We propose here a novel approach for (1) automatically distinguishing between personal and machine-generated email and (2) classifying messages into latent categories, without requiring users to have defined any folder. We report how we have discovered that a set of 6 "latent" categories (one for human- and the others for machine-generated messages) can explain a significant portion of email traffic. We describe in details the steps involved in building a Web-scale email categorization system, from the collection of ground-truth labels, the selection of features to the training of models. Experimental evaluation was performed on more than 500 billion messages received during a period of six months by users of Yahoo mail service, who elected to be part of such research studies. Our system achieved precision and recall rates close to 90% and the latent categories we discovered were shown to cover 70% of both email traffic and email search queries. We believe that these results pave the way for a change of approach in the Web mail industry, and could support the invention of new large-scale email discovery paradigms that had not been possible before.

References

[1]

Nir Ailon, Zohar S. Karnin, Edo Liberty, and Yoelle Maarek. Threading machine generated email. In Proceedings of WSDM'2013, pages 405--414, New York, NY, USA, 2013. ACM.

Digital Library

[2]

Inge Alberts and Dominic Forest. Email pragmatics and automatic classification: A study in the organizational context. JASIST, 63(5):904--922, 2012.

Digital Library

[3]

Olle Bälter. Keystroke level analysis of email message organization. In Proceedings of CHI'2000, pages 105--112. ACM, 2000.

Digital Library

[4]

David M. Blei, Andrew Y. Ng, and Michael I. Jordan. Latent dirichlet allocation. JMLR, 3:993--1022, 2003.

Digital Library

[5]

Avrim Blum and Tom Mitchell. Combining labeled and unlabeled data with co-training. In Proceedings of COLT'1998, pages 92--100. ACM, 1998.

Digital Library

[6]

C. M. Bowman, P. B. Danzig, U. Manber, and M. F. Schwartz. Scalable internet: Resource discovery. Communications of the ACM, 37(8), August 1994.

Digital Library

[7]

Jake D. Brutlag and Christopher Meek. Challenges of the email domain for text classification. In Proceedings of ICML'2000, pages 103--110, 2000.

Digital Library

[8]

Jeffrey Dean and Sanjay Ghemawat. Mapreduce: simplified data processing on large clusters. Communications of the ACM, 51(1):107--113, 2008.

Digital Library

[9]

Matthew Hoffman, Francis R. Bach, and David M. Blei. Online learning for Latent Dirichlet Allocation. In Advances in Neural Information Processing Systems, pages 856--864, 2010.

Digital Library

[10]

Svetlana Kiritchenko and Stan Matwin. Email classification with co-training. In Proceedings of the 2011 Conference of the Center for Advanced Studies on Collaborative Research, pages 301--312, 2011.

Digital Library

[11]

J. Klensin. Simple mail transfer protocol, rfc 2821, April 2001.

Digital Library

[12]

Bryan Klimt and Yiming Yang. The enron corpus: A new dataset for email classification research. In Proceedings of ECML'2004, pages 217--226. 2004.

Digital Library

[13]

Yehuda Koren, Edo Liberty, Yoelle Maarek, and Roman Sandler. Automatically tagging email by leveraging other users' folders. In Proceedings of KDD'2011, pages 913--921. ACM, 2011.

Digital Library

[14]

John Langford, Lihong Li, and Tong Zhang. Sparse online learning via truncated gradient. JMLR, 10:777--801, 2009.

Digital Library

[15]

Jefferson Provost. Naïve-bayes vs. rule-learning in classification of email. University of Texas at Austin, 1999.

[16]

Fabrizio Sebastiani. Machine learning in automated text categorization. ACM Computing Surveys, 34(1), March 2002.

Digital Library

Cited By

Early KO'Hare NLuvogt CFrommholz IHopfgartner FLee MOakes MLalmas MZhang MSantos R(2023)Content-Based Email Classification at ScaleProceedings of the 32nd ACM International Conference on Information and Knowledge Management10.1145/3583780.3615462(4559-4566)Online publication date: 21-Oct-2023
https://dl.acm.org/doi/10.1145/3583780.3615462
Singh MCambronero JGulwani SLe VVerbruggen GFrommholz IHopfgartner FLee MOakes MLalmas MZhang MSantos R(2023)EmFore: Online Learning of Email Folder Classification RulesProceedings of the 32nd ACM International Conference on Information and Knowledge Management10.1145/3583780.3614863(2280-2290)Online publication date: 21-Oct-2023
https://dl.acm.org/doi/10.1145/3583780.3614863
Hsiao JBentley F(2021)Exploring Email-Prompted Information NeedsProceedings of the ACM on Human-Computer Interaction10.1145/34798615:CSCW2(1-33)Online publication date: 18-Oct-2021
https://dl.acm.org/doi/10.1145/3479861
Show More Cited By

Index Terms

How Many Folders Do You Really Need?: Classifying Email into a Handful of Categories
1. Information systems
  1. World Wide Web
    1. Web applications
      1. Internet communications tools
        Email

Recommendations

Is Mail The Next Frontier In Search And Data Mining?
WSDM '16: Proceedings of the Ninth ACM International Conference on Web Search and Data Mining

The nature of Web mail traffic has significantly evolved in the last two decades, and consequently the behavior of Web mail users has also changed. For instance a recent study conducted by Yahoo Labs showed that today 90% of Web mail traffic is machine-...
Automatically tagging email by leveraging other users' folders
KDD '11: Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining

Most email applications devote a significant part of their real estate to organization mechanisms such as folders. Yet, we verified on the Yahoo! Mail service that 70% of email users have never defined a single folder. This implies that one of the most ...
Web Mail is not Dead!: It's Just Not Human Anymore
WWW '17: Proceedings of the 26th International Conference on World Wide Web

Many have noticed that personal communications have slowly moved from mail to social media and instant messaging platforms, especially with younger generation [6]. Yet Web Mail traffic continues to steadily grow. A paradox? Not really. We have observed ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

CIKM '14: Proceedings of the 23rd ACM International Conference on Conference on Information and Knowledge Management

November 2014

2152 pages

ISBN:9781450325981

DOI:10.1145/2661829

General Chairs:
Jianzhong Li
Harbin Inst. of Technology
,
X. Sean Wang
Fudan University
,
Program Chairs:
Minos Garofalakis
Technical University of Crete, Greece
,
Ian Soboroff
National Institute of Standards, USA
,
Torsten Suel
New York University, USA
,
Min Wang
Google Research, USA

Copyright © 2014 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 03 November 2014

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

CIKM '14

Sponsor:

CIKM '14: 2014 ACM Conference on Information and Knowledge Management

November 3 - 7, 2014

Shanghai, China

Acceptance Rates

CIKM '14 Paper Acceptance Rate 175 of 838 submissions, 21%;

Overall Acceptance Rate 1,861 of 8,427 submissions, 22%

Upcoming Conference

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

42
Total Citations
View Citations
431
Total Downloads

Downloads (Last 12 months)20
Downloads (Last 6 weeks)5

Reflects downloads up to 30 Aug 2024

Other Metrics

View Author Metrics

Citations

Cited By

Early KO'Hare NLuvogt CFrommholz IHopfgartner FLee MOakes MLalmas MZhang MSantos R(2023)Content-Based Email Classification at ScaleProceedings of the 32nd ACM International Conference on Information and Knowledge Management10.1145/3583780.3615462(4559-4566)Online publication date: 21-Oct-2023
https://dl.acm.org/doi/10.1145/3583780.3615462
Singh MCambronero JGulwani SLe VVerbruggen GFrommholz IHopfgartner FLee MOakes MLalmas MZhang MSantos R(2023)EmFore: Online Learning of Email Folder Classification RulesProceedings of the 32nd ACM International Conference on Information and Knowledge Management10.1145/3583780.3614863(2280-2290)Online publication date: 21-Oct-2023
https://dl.acm.org/doi/10.1145/3583780.3614863
Hsiao JBentley F(2021)Exploring Email-Prompted Information NeedsProceedings of the ACM on Human-Computer Interaction10.1145/34798615:CSCW2(1-33)Online publication date: 18-Oct-2021
https://dl.acm.org/doi/10.1145/3479861
Bentley FJacobson JSperling CShankar SRoyer CMcCarthy IBernhaupt RMueller FVerweij DAndres JMcGrenere JCockburn AAvellino IGoguey ABjørn PZhao SSamson BKocielnik R(2020)Rethinking Consumer Email: The Research Process for Yahoo Mail 6Extended Abstracts of the 2020 CHI Conference on Human Factors in Computing Systems10.1145/3334480.3375224(1-6)Online publication date: 25-Apr-2020
https://dl.acm.org/doi/10.1145/3334480.3375224
Faria de Azevedo RRodrigues Pereira de Araujo RGuimarães Araújo RMoreira Bittencourt RFerreira Alves da Silva Rde Melo Vaz Nogueira GMarques Franca TOtharan Nunes JRalff da Silva KRegiane Cunha de Oliveira E(2020)Screening of Email Box in Portuguese with SVM at Banco do BrasilComputational Processing of the Portuguese Language10.1007/978-3-030-41505-1_15(153-163)Online publication date: 24-Feb-2020
https://doi.org/10.1007/978-3-030-41505-1_15
Wang WHosseini SAwadallah ABennett PQuirk CPiwowarski BChevalier MGaussier EMaarek YNie JScholer F(2019)Context-Aware Intent Identification in Email ConversationsProceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3331184.3331260(585-594)Online publication date: 18-Jul-2019
https://dl.acm.org/doi/10.1145/3331184.3331260
Alibadi ZDu MVidal JTsuchida K(2019)Using Pre-trained Embeddings to Detect the Intent of an EmailProceedings of the 7th ACIS International Conference on Applied Computing and Information Technology10.1145/3325291.3325357(1-7)Online publication date: 29-May-2019
https://dl.acm.org/doi/10.1145/3325291.3325357
Kocayusufoglu FSheng YVo NWendt JZhao QTata SNajork M(2019)RiSER: Learning Better Representations for Richly Structured EmailsThe World Wide Web Conference10.1145/3308558.3313720(886-895)Online publication date: 13-May-2019
https://dl.acm.org/doi/10.1145/3308558.3313720
Mackenzie JGupta KQiao FAwadallah AShokouhi M(2019)Exploring User Behavior in Email Re-Finding TasksThe World Wide Web Conference10.1145/3308558.3313450(1245-1255)Online publication date: 13-May-2019
https://dl.acm.org/doi/10.1145/3308558.3313450
Sarrafzadeh BHassan Awadallah ALin CLee CShokouhi MDumais SCulpepper JMoffat ABennett PLerman K(2019)Characterizing and Predicting Email Deferral BehaviorProceedings of the Twelfth ACM International Conference on Web Search and Data Mining10.1145/3289600.3291028(627-635)Online publication date: 30-Jan-2019
https://dl.acm.org/doi/10.1145/3289600.3291028
Show More Cited By

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents