Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/2661829.2662018acmconferencesArticle/Chapter ViewAbstractPublication PagescikmConference Proceedingsconference-collections
research-article

How Many Folders Do You Really Need?: Classifying Email into a Handful of Categories

Published: 03 November 2014 Publication History

Abstract

Email classification is still a mostly manual task. Consequently, most Web mail users never define a single folder. Recently however, automatic classification offering the same categories to all users has started to appear in some Web mail clients, such as AOL or Gmail. We adopt this approach, rather than previous (unsuccessful) personalized approaches because of the change in the nature of consumer email traffic, which is now dominated by (non-spam) machine-generated email. We propose here a novel approach for (1) automatically distinguishing between personal and machine-generated email and (2) classifying messages into latent categories, without requiring users to have defined any folder. We report how we have discovered that a set of 6 "latent" categories (one for human- and the others for machine-generated messages) can explain a significant portion of email traffic. We describe in details the steps involved in building a Web-scale email categorization system, from the collection of ground-truth labels, the selection of features to the training of models. Experimental evaluation was performed on more than 500 billion messages received during a period of six months by users of Yahoo mail service, who elected to be part of such research studies. Our system achieved precision and recall rates close to 90% and the latent categories we discovered were shown to cover 70% of both email traffic and email search queries. We believe that these results pave the way for a change of approach in the Web mail industry, and could support the invention of new large-scale email discovery paradigms that had not been possible before.

References

[1]
Nir Ailon, Zohar S. Karnin, Edo Liberty, and Yoelle Maarek. Threading machine generated email. In Proceedings of WSDM'2013, pages 405--414, New York, NY, USA, 2013. ACM.
[2]
Inge Alberts and Dominic Forest. Email pragmatics and automatic classification: A study in the organizational context. JASIST, 63(5):904--922, 2012.
[3]
Olle Bälter. Keystroke level analysis of email message organization. In Proceedings of CHI'2000, pages 105--112. ACM, 2000.
[4]
David M. Blei, Andrew Y. Ng, and Michael I. Jordan. Latent dirichlet allocation. JMLR, 3:993--1022, 2003.
[5]
Avrim Blum and Tom Mitchell. Combining labeled and unlabeled data with co-training. In Proceedings of COLT'1998, pages 92--100. ACM, 1998.
[6]
C. M. Bowman, P. B. Danzig, U. Manber, and M. F. Schwartz. Scalable internet: Resource discovery. Communications of the ACM, 37(8), August 1994.
[7]
Jake D. Brutlag and Christopher Meek. Challenges of the email domain for text classification. In Proceedings of ICML'2000, pages 103--110, 2000.
[8]
Jeffrey Dean and Sanjay Ghemawat. Mapreduce: simplified data processing on large clusters. Communications of the ACM, 51(1):107--113, 2008.
[9]
Matthew Hoffman, Francis R. Bach, and David M. Blei. Online learning for Latent Dirichlet Allocation. In Advances in Neural Information Processing Systems, pages 856--864, 2010.
[10]
Svetlana Kiritchenko and Stan Matwin. Email classification with co-training. In Proceedings of the 2011 Conference of the Center for Advanced Studies on Collaborative Research, pages 301--312, 2011.
[11]
J. Klensin. Simple mail transfer protocol, rfc 2821, April 2001.
[12]
Bryan Klimt and Yiming Yang. The enron corpus: A new dataset for email classification research. In Proceedings of ECML'2004, pages 217--226. 2004.
[13]
Yehuda Koren, Edo Liberty, Yoelle Maarek, and Roman Sandler. Automatically tagging email by leveraging other users' folders. In Proceedings of KDD'2011, pages 913--921. ACM, 2011.
[14]
John Langford, Lihong Li, and Tong Zhang. Sparse online learning via truncated gradient. JMLR, 10:777--801, 2009.
[15]
Jefferson Provost. Naïve-bayes vs. rule-learning in classification of email. University of Texas at Austin, 1999.
[16]
Fabrizio Sebastiani. Machine learning in automated text categorization. ACM Computing Surveys, 34(1), March 2002.

Cited By

View all
  • (2023)Content-Based Email Classification at ScaleProceedings of the 32nd ACM International Conference on Information and Knowledge Management10.1145/3583780.3615462(4559-4566)Online publication date: 21-Oct-2023
  • (2023)EmFore: Online Learning of Email Folder Classification RulesProceedings of the 32nd ACM International Conference on Information and Knowledge Management10.1145/3583780.3614863(2280-2290)Online publication date: 21-Oct-2023
  • (2021)Exploring Email-Prompted Information NeedsProceedings of the ACM on Human-Computer Interaction10.1145/34798615:CSCW2(1-33)Online publication date: 18-Oct-2021
  • Show More Cited By

Index Terms

  1. How Many Folders Do You Really Need?: Classifying Email into a Handful of Categories

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    CIKM '14: Proceedings of the 23rd ACM International Conference on Conference on Information and Knowledge Management
    November 2014
    2152 pages
    ISBN:9781450325981
    DOI:10.1145/2661829
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 03 November 2014

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. email classification
    2. lda
    3. machine-generated email

    Qualifiers

    • Research-article

    Conference

    CIKM '14
    Sponsor:

    Acceptance Rates

    CIKM '14 Paper Acceptance Rate 175 of 838 submissions, 21%;
    Overall Acceptance Rate 1,861 of 8,427 submissions, 22%

    Upcoming Conference

    CIKM '25

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)21
    • Downloads (Last 6 weeks)2
    Reflects downloads up to 10 Nov 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2023)Content-Based Email Classification at ScaleProceedings of the 32nd ACM International Conference on Information and Knowledge Management10.1145/3583780.3615462(4559-4566)Online publication date: 21-Oct-2023
    • (2023)EmFore: Online Learning of Email Folder Classification RulesProceedings of the 32nd ACM International Conference on Information and Knowledge Management10.1145/3583780.3614863(2280-2290)Online publication date: 21-Oct-2023
    • (2021)Exploring Email-Prompted Information NeedsProceedings of the ACM on Human-Computer Interaction10.1145/34798615:CSCW2(1-33)Online publication date: 18-Oct-2021
    • (2020)Rethinking Consumer Email: The Research Process for Yahoo Mail 6Extended Abstracts of the 2020 CHI Conference on Human Factors in Computing Systems10.1145/3334480.3375224(1-6)Online publication date: 25-Apr-2020
    • (2020)Screening of Email Box in Portuguese with SVM at Banco do BrasilComputational Processing of the Portuguese Language10.1007/978-3-030-41505-1_15(153-163)Online publication date: 2-Mar-2020
    • (2019)Context-Aware Intent Identification in Email ConversationsProceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3331184.3331260(585-594)Online publication date: 18-Jul-2019
    • (2019)RiSER: Learning Better Representations for Richly Structured EmailsThe World Wide Web Conference10.1145/3308558.3313720(886-895)Online publication date: 13-May-2019
    • (2019)Exploring User Behavior in Email Re-Finding TasksThe World Wide Web Conference10.1145/3308558.3313450(1245-1255)Online publication date: 13-May-2019
    • (2019)Characterizing and Predicting Email Deferral BehaviorProceedings of the Twelfth ACM International Conference on Web Search and Data Mining10.1145/3289600.3291028(627-635)Online publication date: 30-Jan-2019
    • (2019)Large-Scale Information Extraction from Emails with Data ConstraintsBig Data Analytics10.1007/978-3-030-37188-3_8(124-139)Online publication date: 17-Dec-2019
    • Show More Cited By

    View Options

    Get Access

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media