Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
article

Better Naive Bayes classification for high-precision spam detection

Published: 01 August 2009 Publication History

Abstract

Email spam has become a major problem for Internet users and providers. One major obstacle to its eradication is that the potential solutions need to ensure a very low false-positive rate, which tends to be difficult in practice. We address the problem of low-FPR classification in the context of naive Bayes, which represents one of the most popular machine learning models applied in the spam filtering domain. Drawing from the recent extensions, we propose a new term weight aggregation function, which leads to markedly better results than the standard alternatives. We identify short instances as ones with disproportionally poor performance and counter this behavior with a collaborative filtering-based feature augmentation. Finally, we propose a tree-based classifier cascade for which decision thresholds of the leaf nodes are jointly optimized for the best overall performance. These improvements, both individually and in aggregate, lead to substantially better detection rate of precision when compared with some of the best variants of naive Bayes proposed to date. Copyright © 2009 John Wiley & Sons, Ltd.
This work was done when the first author was an intern at Microsoft Live Labs Research.

Cited By

View all
  • (2021)Co-op Training: a Semi-supervised Learning Method for Data Streams2021 IEEE International Conference on Systems, Man, and Cybernetics (SMC)10.1109/SMC52423.2021.9658598(933-938)Online publication date: 17-Oct-2021
  • (2016)An online universal classifier for binary, multi-class and multi-label classification2016 IEEE International Conference on Systems, Man, and Cybernetics (SMC)10.1109/SMC.2016.7844809(003701-003706)Online publication date: 9-Oct-2016
  • (2013)A Probabilistic Approach for Events Identification from Social Media RSS FeedsProceedings of the 18th International Conference on Database Systems for Advanced Applications - Volume 782710.1007/978-3-642-40270-8_12(139-152)Online publication date: 22-Apr-2013
  • Show More Cited By

Index Terms

  1. Better Naive Bayes classification for high-precision spam detection
    Index terms have been assigned to the content through auto-classification.

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image Software—Practice & Experience
    Software—Practice & Experience  Volume 39, Issue 11
    August 2009
    78 pages

    Publisher

    John Wiley & Sons, Inc.

    United States

    Publication History

    Published: 01 August 2009

    Author Tags

    1. cascaded models
    2. naive Bayes
    3. spam filtering

    Qualifiers

    • Article

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)0
    • Downloads (Last 6 weeks)0
    Reflects downloads up to 30 Aug 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2021)Co-op Training: a Semi-supervised Learning Method for Data Streams2021 IEEE International Conference on Systems, Man, and Cybernetics (SMC)10.1109/SMC52423.2021.9658598(933-938)Online publication date: 17-Oct-2021
    • (2016)An online universal classifier for binary, multi-class and multi-label classification2016 IEEE International Conference on Systems, Man, and Cybernetics (SMC)10.1109/SMC.2016.7844809(003701-003706)Online publication date: 9-Oct-2016
    • (2013)A Probabilistic Approach for Events Identification from Social Media RSS FeedsProceedings of the 18th International Conference on Database Systems for Advanced Applications - Volume 782710.1007/978-3-642-40270-8_12(139-152)Online publication date: 22-Apr-2013
    • (2012)Using probabilistic generative models for ranking risks of Android appsProceedings of the 2012 ACM conference on Computer and communications security10.1145/2382196.2382224(241-252)Online publication date: 16-Oct-2012
    • (2012)Comment spam detection by sequence miningProceedings of the fifth ACM international conference on Web search and data mining10.1145/2124295.2124318(183-192)Online publication date: 8-Feb-2012
    • (2008)A survey of emerging approaches to spam filteringACM Computing Surveys10.1145/2089125.208912944:2(1-27)Online publication date: 5-Mar-2008

    View Options

    View options

    Get Access

    Login options

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media