Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
article

Email Spam Filtering: A Systematic Review

Published: 01 April 2008 Publication History

Abstract

Spam is information crafted to be delivered to a large number of recipients, in spite of their wishes. A spam filter is an automated tool to recognize spam so as to prevent its delivery. The purposes of spam and spam filters are diametrically opposed: spam is effective if it evades filters, while a filter is effective if it recognizes spam. The circular nature of these definitions, along with their appeal to the intent of sender and recipient make them difficult to formalize. A typical email user has a working definition no more formal than "I know it when I see it." Yet, current spam filters are remarkably effective, more effective than might be expected given the level of uncertainty and debate over a formal definition of spam, more effective than might be expected given the state-of-the-art information retrieval and machine learning methods for seemingly similar problems. But are they effective enough? Which are better? How might they be improved? Will their effectiveness be compromised by more cleverly crafted spam?
We survey current and proposed spam filtering techniques with particular emphasis on how well they work. Our primary focus is spam filtering in email; Similarities and differences with spam filtering in other communication and storage media — such as instant messaging and the Web — are addressed peripherally. In doing so we examine the definition of spam, the user's information requirements and the role of the spam filter as one component of a large and complex information universe. Well-known methods are detailed sufficiently to make the exposition self-contained, however, the focus is on considerations unique to spam. Comparisons, wherever possible, use common evaluation measures, and control for differences in experimental setup. Such comparisons are not easy, as benchmarks, measures, and methods for evaluating spam filters are still evolving. We survey these efforts, their results and their limitations. In spite of recent advances in evaluation methodology, many uncertainties (including widely held but unsubstantiated beliefs) remain as to the effectiveness of spam filtering techniques and as to the validity of spam filter evaluation methods. We outline several uncertainties and propose experimental methods to address them.

References

[1]
"You might be an anti-spam kook if.," http://www.rhyolite.com/antispam/you-might-be.html.
[2]
2004 National Technology Readiness Survey: Summary report, http://www.smith.umd.edu/ntrs/NTRS 2004.pdf, 2005.
[3]
A. J. Alberg, J. W. Park, B. W. Hager, M. V. Brock, and M. Diener-West, "The use of overall accuracy to evaluate the validity of screening or diagnostic tests," Journal of General Internal Medicine, vol. 19, no. 1, 2004.
[4]
I. Androutsopoulos, J. Koutsias, K. Chandrinos, G. Paliouras, and C. D. Spyropoulos, "An evaluation of Naive Bayesian anti-spam filtering," CoRR, vol. cs.CL/0006013, Informal Publication, 2000.
[5]
I. Androutsopoulos, E. F. Magirou, and D. K. Vassilakis, "A game theoretic model of spam e-mailing," in CEAS 2005 - The Second Conference on Email and Anti-Spam, 2005.
[6]
I. Androutsopoulos, G. Paliouras, V. Karkaletsis, G. Sakkis, C. Spyropoulos, and P. Stamatopoulos, "Learning to filter spam E-mail: A comparison of a naive bayesian and a memory-based approach," in Proceedings of the Workshop on Machine Learning and Textual Information Access, 4th European Conference on Principles and Practice of Knowledge Discovery in Databases (PKDD 200), pp. 1-13, 2000.
[7]
I. Androutsopoulos, G. Paliouras, and E. Michelakis, "Learning to filter unsolicited commercial E-Mail," Tech. Rep. 2004/2, NCSR "Demokritos", October 2004.
[8]
H. B. Aradhye, G. K.Myers, and J. A. Herson, "Image analysis for efficient categorization of image-based spam e-mail," in Proceedings of the Eighth International Conference on Document Analysis and Recognition (ICDAR'05), 2005.
[9]
F. Assis, "OSBF-Lua," http://osbf-lua.luaforge.net/.
[10]
F. Assis, "OSBF-Lua-A text classification module for Lua the importance of the training method," in Fifteenth Text REtrieval Conference (TREC-2006), Gaithersburg, MD: NIST, 2006.
[11]
A. Berg, "Creating an antispam cocktail: Best spam detection and filtering techniques," http://searchsecurity.techtarget.com/tip/1,289483,sid14_ gci1116643,00.html, 2005.
[12]
S. Bickel, M. Bruckner, and T. Scheffer, "Discriminative learning for differing training and test distributions," International Conference on Machine Learning (ICML), 2007.
[13]
S. Bickel and T. Scheffer, "Dirichlet-Enhanced spam filtering based on biased samples," Neural Information Processing Systems (NIPS), 2007.
[14]
B. Biggio, G. Fumera, I. Pillai, and F. Roli, "Image spam filtering by content obscuring detection," in CEAS 2007 - The Third Conference on Email and Anti-Spam, 2007.
[15]
Blacklists compared, http://www.sdsc.edu/jeff/spam/Blacklists_Compared. html.
[16]
A. Bratko, G. V. Cormack, B. Filipic, T. R. Lynam, and B. Zupan, "Spam filtering using statistical data compression models," Journal of Machine Learning Research, vol. 7, pp. 2673-2698, December 2006.
[17]
A. Bratko and B. Filipic, "Spam filtering using character-level markov models: Experiments for the TREC 2005 Spam Track," in Proceedings of 14th Text REtrieval Conference (TREC 2005), Gaithersburg, MD, November 2005.
[18]
A. Bratko and B. Filipiô, "Exploiting structural information for semistructured document categorization," Information Processing and Management , vol. 42, no. 3, pp. 679-694, 2006.
[19]
L. Breiman, "Bagging predictors," Machine Learning, vol. 24, no. 2, pp. 123- 140, 1996.
[20]
M. Bruckner, P. Haider, and T. Scheffer, "Highly scalable discriminative spam filtering," in Proceedings of 15th Text REtrieval Conference (TREC 2006), Gaithersburg, MD, November 2006.
[21]
B. Burton, SpamProbe - A Fast Bayesian Spam Filter. 2002. http:// spamprobe. sourceforge.net.
[22]
B. Byun, C.-H. Lee, S. Webb, and C. Pu, "A discriminative classifier learning approach to image modeling and spam image identification," in CEAS 2007 - The Third Conference on Email and Anti-Spam, 2007.
[23]
CAPTCHA: Telling humans and computers apart automatically, http://www.captcha.net/.
[24]
X. Carreras and L. Márquez, "Boosting trees for anti-spam email filtering," in Proceedings of RANLP-2001, 4th International Conference on Recent Advances in Natural Language Processing, 2001.
[25]
K. Chellapilla, K. Larson, K. Simard, and M. Czerwinski, "Designing human friendly human interactive proofs (HIPS)," in CHI '05: SIGCHI Conference on Human Factors in Computing Systems, pp. 711-720, 2005.
[26]
K. Chellapilla, K. Larson, P. Simard, and M. Czerwinski, "Computers beat humans at single character recognition in reading based human interaction proofs," in CEAS 2005 - The Second Conference on Email and Anti-Spam, 2005.
[27]
S. Chhabra, Fighting Spam, Phishing and Email Fraud. University of California, Riverside, 2005.
[28]
A. Ciltik and T. Gungor, "Time-efficient spam e-mail filtering using n-gram models," Pattern Recognition Letters, vol. 29, pp. 19-33, 2008.
[29]
J. G. Cleary and I. H. Witten, "Data compression using adaptive coding and partial string matching," IEEE Transactions on Communications, vol. 32, pp. 396-402, April 1984.
[30]
W. W. Cohen, "Fast effective rule induction," in Proceedings of the 12th International Conference on Machine Learning, (A. Prieditis and S. Russell, eds.), pp. 115-123, Tahoe City, CA: Morgan Kaufmann, July 9-12 1995.
[31]
G. Cormack, J. M. G. Hidalgo, and E. P. Sánz, "Feature engineering for mobile (SMS) spam filtering," in 30th ACM SIGIR Conference on Research and Development on Information Retrieval, Amsterdam, 2007.
[32]
G. V. Cormack, "Harnessing unlabeled examples through iterative application of Dynamic Markov Modeling," in Proceedings of the ECML/PKDD Discovery Challenge Workshop, Berlin, 2006.
[33]
G. V. Cormack, "TREC 2006 Spam Track Overview," in Fifteenth Text REtrieval Conference (TREC-2006), Gaithersburg, MD: NIST, 2006.
[34]
G. V. Cormack, "TREC 2007 Spam Track Overview," in Sixteenth Text REtrieval Conference (TREC-2007), Gaithersburg, MD: NIST, 2007.
[35]
G. V. Cormack, "University of waterloo participation in the TREC 2007 spam track," in Sixteenth Text REtrieval Conference (TREC-2007), Gaithersburg, MD: NIST, 2007.
[36]
G. V. Cormack and A. Bratko, "Batch and on-line spam filter evaluation," in CEAS 2006: The Third Conference on Email and Anti-Spam, Mountain View, CA, 2006.
[37]
G. V. Cormack, J. M. G. Hidalgo, and E. P. Sánz, "Spam filtering for short messages," in CIKM '07: Proceedings of the Sixteenth ACM Conference on Conference on Information and Knowledge Management, pp. 313-320, USA, New York, NY: ACM Press, 2007.
[38]
G. V. Cormack and R. N. S. Horspool, "Data compression using dynamic Markov modelling," The Computer Journal, vol. 30, no. 6, pp. 541-550, 1987.
[39]
G. V. Cormack and T. R. Lynam, TREC Spam Filter Evaluation Toolkit. http://plg.uwaterloo.ca/~gvcormac/jig/.
[40]
G. V. Cormack and T. R. Lynam, "TREC 2005 Spam Track Overview," http://plg.uwaterloo.ca/~gvcormac/trecspamtrack05, 2005.
[41]
G. V. Cormack and T. R. Lynam, "Statistical precision of information retrieval evaluation," in SIGIR '06: Proceedings of the 29th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 533-540, USA, New York, NY: ACM Press, 2006.
[42]
G. V. Cormack and T. R. Lynam, "On-line supervised spam filter evaluation," ACM Transactions on Information Systems, vol. 25, no. 3, 2007.
[43]
Crocker, "Challenges in Anti-spam Efforts," The Internet Protocol Journal, vol. 8, no. 4, http://www.cisco.com/web/about/ac123/ac147/ archived issues/ipj_8-4/anti-spam_efforts.html, 2006.
[44]
N. N. Dalvi, P. M. Domingos, S. K. Sanghai, and D. Verma, "Adversarial classification," in KDD, (W. Klm, R. Kohavi, J. Gehrke, and W. DuMouchel, eds.), pp. 99-108, 2004.
[45]
R. Dantu and P. Kolan, "Detecting spam in VoIP networks," in SRUTI'05: Proceedings of the Steps to Reducing Unwanted Traffic on the Internet on Steps to Reducing Unwanted Traffic on the Internet Workshop, pp. 5-5, USA, Berkeley, CA: USENIX Association, 2005.
[46]
J. Deguerre, "The mechanics of Vipul's Razor," Network Security, pp. 15-17, September 2007.
[47]
S. J. Delany, P. Cunningham, and L. Coyle, "Case-based reasoning for spam filtering," Artificial Intelligence Review, vol. 24, no. 3-4, pp. 359-378, 2005.
[48]
T. Dietterich, Statistical Tests for Comparing Supervised Classification Learning Algorithms. Oregon State University, 1996.
[49]
V. Dimitrios and E. M. Ion Androutsopoulos, "A game-theoretic investigation of the effect of human interactive proofs on spam e-mail," in CEAS 2007 - The Fourth Conference on Email and Anti-Spam, 2007.
[50]
N. Dimmock and I. Maddison, "Peer-to-peer Collaborative Spam Detection," ACM Crossroads, vol. 11, no. 2, 2004.
[51]
P. Domingos and M. J. Pazzani, "On the optimality of the simple bayesian classifier under zero-one loss," Machine Learning, vol. 29, no. 2-3, pp. 103-130, 1997.
[52]
M. Dredze, R. Gevaryahu, and A. Elias-Bachrach, "Learning fast classifiers for image spam," in CEAS 2007 - The Third Conference on Email and Anti-Spam , 2007.
[53]
Drucker, D. Wu, and V. Vapnik, "Support vector machines for spam categorization," IEEE-NN, vol. 10, no. 5, pp. 1048-1054, 1999.
[54]
H. Drucker, D. Wu, and V. N. Vapnik, "Support vector machines for spam categorization," IEEE Transactions on Neural Networks, vol. 10, no. 5, pp. 1048- 1054, 1999.
[55]
C. Dwork and M. Naor, "Pricing via processing or combatting junk mail," in CRYPTO '92, 1992.
[56]
ECML/PKDD Discovery Challenge, http://www.ecmlpkdd2006.org/ challenge.htm, 2006.
[57]
T. Fawcett, "'In vivo' spam filtering: A challenge problem for data mining," KDD Explorations, vol. 5, no. 2, December 2003.
[58]
T. Fawcett, ROC Graphs: Notes and Practical Considerations for Researchers. HP Laboratories, 2004. http://home.comcast.net/~tom.fawcett/public_html/ papers/ROC101.pdf.
[59]
D. Ferris and R. Jennings, "Calculating the Cost of Spam for Your Organization," http://http://www.ferris.com/?p=310061, 2005.
[60]
D. Ferris, R. Jennings, and C. Williams, The Global Economic Impact of Spam. Ferris Research, http://www.ferris.com/?p=309942, 2005.
[61]
Final ultimate solution to the spam problem, http://craphound.com/ spamsolutions.txt.
[62]
G. Fumera, I. Pillai, and F. Roli, "Spam filtering based on the analysis of text information embedded into images," Journal of Machine Learning Research (special issue on Machine Learning in Computer Security), vol. 7, pp. 2699- 2720, 12/2006 2006.
[63]
W. Gansterer, A. Janecek, and R. Neumayer, "Spam filtering based on latent semantic indexing," in SIAM Conference on Data Mining, 2007.
[64]
K. R. Gee, "Using latent semantic indexing to filter spam," in SAC'03: Proceedings of the 2003 ACM symposium on Applied Computing, pp. 460-464, USA, New York, NY: ACM Press, 2003.
[65]
Z. Ghahramani, "Unsupervised learning," in Advanced Lectures in Machine Learning, pp. 72-112, Lecture Notes in Computer Science, vol. 3176, 2004.
[66]
A. S. Glas, J. G. Lijmer, M. H. Prins, G. J. Bonsel, and P. M. M. Bossuyt, "The diagnostic odds ratio: A single indicator of test performance," Journal of Clinical Epidemiology, vol. 56, no. 11, pp. 1129-1135, 2003.
[67]
J. Goodman, D. Heckerman, and R. Rounthwaite, "Stopping spam," Scientific American, vol. 292, pp. 42-88, April 2005.
[68]
J. Goodman and W.-T. Yih, "Online discriminative spam filter training," in The Third Conference on Email and Anti-Spam, Mountain View, CA, 2006.
[69]
P. Graham, Better Bayesian Filtering. http://www.paulgraham.com/better. html, 2004.
[70]
J. Graham-Cumming, "How to beat an adaptive spam filter," in The Spam Conference, 2004.
[71]
J. Graham-Cumming, "People and spam," in The Spam Conference, 2005.
[72]
Graham-Cumming, "Does Bayesian poisining exist?," Virus Bulletin, February 2006.
[73]
J. Graham-Cumming, "SpamOrHam," Virus Bulletin, 2006-06-01.
[74]
J. Graham-Cumming, "The rise and fall of image-based spam," Virus Bulletin, 2006-11-01.
[75]
J. Graham-Cumming, "The spammer's compendium: Five yars on," Virus Bulletin, 2007-09-20.
[76]
J. Graham-Cumming, "Why I hate challenge-response," JGC's Anti-Spam Newsletter, February 28, 2005.
[77]
Greylisting: The next step in the spam control war, http://projects. puremagic.com/greylisting/, 2003.
[78]
B. Guenter, Spam Archive. http://www.untroubled.org/spam/.
[79]
K. Gupta, V. Chaudhary, N. Marwah, and C. Taneja, ECML-PKDD Discovery Challenge Entry. Inductis India Pvt Ltd, 2006.
[80]
K. Gupta, V. Chaudhary, N. Marwah, and C. Taneja, "Using positive-only learning to deal with the heterogeneity of labeled and unlabeled data," in Proceedings of ECML/PKDD Discovery Challenge Workshop, Berlin, 2006.
[81]
B. Hayes, "How many ways can you spell Viagra?," American Scientist, vol. 95, 2007.
[82]
J. M. G. Hidalgo, "Evaluating cost-sensitive unsolicited bulk email categorization," in Proceedings of SAC-02, 17th ACM Symposium on Applied Computing , pp. 615-620, Madrid, ES, 2002.
[83]
J. M. G. Hidalgo, "Evaluating cost-sensitive unsolicited bulk email categorization," in SAC '02: Proceedings of the 2002 ACM Symposium on Applied Computing, pp. 615-620, Madrid: ACM Press, March 2002.
[84]
J. M. G. Hidalgo, G. C. Bringas, E. P. Sanz, and F. C. Garcia, "Content based SMS spam filtering," in DocEng '06: Proceedings of the 2006 ACM Symposium on Document Engineering, pp. 107-114, USA, New York, NY: ACM Press, 2006.
[85]
J. M. G. Hidalgo, M. M. López, and E. P. Sanz, "Combining text and heuristics for cost-sensitive spam filtering," in Proceedings of the 2nd Workshop on Learning Language in Logic and the 4th Conference on Computational Natural Language Learning, pp. 99-102, USA, Morristown, NJ: Association for Computational Linguistics, 2000.
[86]
S. Holden, "Spam Filtering II," Hakin9, 02/2004, pp. 68-77, 2004.
[87]
D. W. Hosmer and S. Lemeshow, Applied Logistic Regression. New York: Wiley, 2000.
[88]
J. Hovold, "Naive bayes spam filtering using word-position-based attributes," in Proceedings of the 2nd Conference on Email and Anti-Spam (CEAS 2005), Palo Alto, CA, July 2005.
[89]
M. Ilger, J. Strauss, W. Gansterer, and C. Proschinger, The Economy of Spam. Vol. FA384018-6, Instituted of Distributed and Multimedia Systems, Univeristy of Vienna, 2006.
[90]
T. Joachims, "A probabilistic analysis of the Rocchio algorithm with TFIDF for text categorization," in Proceedings of ICML-97, 14th International Conference on Machine Learning, (D. H. Fisher, ed.), pp. 143-151, US, Nashville, San Francisco: Morgan Kaufmann Publishers, 1997.
[91]
T. Joachims, Transductive Inference for Text Classification Using Support Vector Machines. 1999.
[92]
G. H. John and P. Langley, "Estimating continuous distributions in Bayesian classifiers," in Eleventh Conference on Uncertainty in Artificial Integelligence, pp. 338-345, 1995.
[93]
Y. Junejo and A. Karim, "A two-pass statistical approach for automatic personalized spam filtering," in Proceedings of ECML/PKDD Discovery Challenge Workshop, Berlin, 2006.
[94]
B. Klimt and Y. Yang, "Introducing the Enron corpus," in CEAS 2004 - The Conference on Email and Anti-Spam, 2004.
[95]
R. Kohavi, "A study of cross-validation and bootstrap for accuracy estimation and model selection," in IJCAI, pp. 1137-1145, 1995.
[96]
A. Kolcz and J. Alspector, "SVM-based filtering of E-mail spam with content-specific misclassification costs," TextDM 2001 (IEEE ICDM-2001 Workshop on Text Mining), 2001.
[97]
A. Kolcz and A. Chowdhury, "Hardening fingerprints by context," in CEAS 2007 - The Third Conference on Email and Anti-Spam, 2007.
[98]
A. Kolcz and A. Chowdhury, "Lexicon randomization for near-duplicate detection with I-match," Journal of Supercomputing, vol. DOI 10.1007/s11227-007- 0171-z, 2007.
[99]
A. Kolcz, A. Chowdhury, and J. Alspector, "The impact of feature selection on signature-driven spam detection," in CEAS 2004 - The Conference on Email and Anti-Spam, 2004.
[100]
P. Komarek and A. Moore, "Fast robust logistic regression for large sparse datasets with binary outputs," in Artificial Intelligence and Statistics, 2003.
[101]
A. Kornblum, "Searching for John Doe: Finding spammers and phishers," in CEAS 2004 - The Conference on Email and Anti-Spam, 2004.
[102]
S. Kotsiantis, "Supervised learning: A review of classification techniques," Informatica, vol. 31, pp. 249-268, 2007.
[103]
B. Krebs, "In the fight agains spam E-mail, Goliath wins again," Washington Post, May 17 2006.
[104]
H. Lee and A. Y. Ng, "Spam deobfuscation using a hidden Markov model," in CEAS 2005 - The Second Conference on Email and Anti-Spam, 2005.
[105]
S. Lee, I. Jeong, and S. Choi, "Dyamically weighted hidden Markov model for spam deobfuscation," in IJCAI 07, pp. 2523-2529, 2007.
[106]
J. R. Levine, "Experiences with greylisting," in CEAS 2005: Second Conference on Email and Anti-Spam, 2005.
[107]
D. D. Lewis and J. Catlett, "Heterogeneous uncertainty sampling for supervised learning," in Proceedings of ICML-94, 11th International Conference on Machine Learning, (W. W. Cohen and H. Hirsh, eds.), pp. 148-156, US, New Brunswick, San Francisco: Morgan Kaufmann Publishers, 1994.
[108]
D. D. Lewis, R. E. Schapire, J. P. Callan, and R. Papka, "Training algorithms for linear text classifiers," in Proceedings of SIGIR-96, 19th ACM International Conference on Research and Development in Information Retrieval, (H.-P. Frei, D. Harman, P. Schäuble, and R. Wilkinson, eds.), (Zürich, CH), pp. 298-306, New York, US: ACM Press, 1996.
[109]
K. Li, C. Pu, and M. Ahamad, "Resisting SPAM delivery by TCP damping," in CEAS 2004 - The Conference on Email and Anti-Spam, 2004.
[110]
B. Lieba and J. Fenton, "DomainKeys identified email (DKIM): Using digital signatures for domain verification," in CEAS 2007: The Third Conference on Email and Anti-Spam, 2007.
[111]
B. Lieba, J. Ossher, V. T. Rajan, R. Segal, and M. Wegman, "SMTP path analysis," in 2nd Conference on Email and Anti-spam, 2005.
[112]
Ling-Spam, PU and Enron Corpora, http://www.iit.demokritos.gr/skel/i-config/downloads/.
[113]
T. Lynam and G. Cormack, TREC Spam Filter Evaluation Took Kit. http://plg.uwaterloo.ca/~trlynam/spamjig.
[114]
T. R. Lynam and G. V. Cormack, "On-line spam filter fusion," in 29th ACM SIGIR Conference on Research and Development on Information Retrieval, Seattle, 2006.
[115]
J. Lyon and M. Wong, Sender-ID: Authenticating E-mail RFC 4406. Internet Engineering Task Force, 2006.
[116]
Mail abuse prevention system, http://www.mail-abuse.com/, 2005.
[117]
M. Mangalindan, "For bulk E-mailer, pestering millions offers path to profit," Wall Street Journal, November 13, 2002.
[118]
A. Martin, G. Doddington, T. Kamm, M. Ordowski, and M. Przybocki, "The DET curve in assessment of detection task performance," in Proceedings of Eurospeech '97, pp. 1895-1898, Rhodes, Greece, 1997.
[119]
R. McMillan, "US Court threatens Spamhaus with shut down," Info World, October 09 2006.
[120]
B. Medlock, "An adaptive, semi-structured language model approach to spam filtering on a new corpus," in Proceeding of CEAS 2006 - Third Conference on Email and Anti-Spam, Mountain View, CA, 2006.
[121]
V. Metsis, I. Androutsopoulos, and G. Paliouras, "Naive Bayes - Which Naive Bayes?," in Proceedings of CEAS 2006 - Third Conference on Email and Anti-Spam, Mountain View, CA, 2006.
[122]
E. Michelakis, I. Androutsopoulos, G. Paliouras, G. Sakkis, and P. Stamatopoulos, "Filtron: A learning-based anti-spam filter," in Proceedings of the 1st Conference on Email and Anti-Spam (CEAS 2004), Mountain View, CA, July 2004.
[123]
G. Mishne and D. Carmel, Blocking Blog Spam with Language Model Disagreement . 2005.
[124]
E. Moustakas, C. Ranganathan, and P. Duquenoy, "Chunk-Kwei: A pattern-discovery-based System for the automatic identificaton of unsolicited email messages (spam)," in CEAS 2004 - The Conference on Email and Anti-Spam , 2004.
[125]
J. Niu, J. Xu, J. Yao, J. Zheng, and Q. Sun, "WIM at TREC 2007," in Sixteenth Text REtrieval Conference (TREC-2007), Gaithersburg, MD: NIST, 2007.
[126]
C. S. Oliveira, F. G. Cozman, and I. Cohen, "Splitting the unsupervised and supervised components of semi-supervised learning," in ICML 2005 LPCTD Workshop, 2005.
[127]
R. M. Pampapathi, B. Mirkin, and M. Levene, "A suffix tree approach to email filtering," Tech. Rep., Birkbeck University of London, 2005.
[128]
C. Perlich, F. Provost, and J. S. Simonoff, "Tree induction vs. logistic regression: A learning-curve analysis," Journal of Machanic Learning and Research, vol. 4, pp. 211-255, 2003.
[129]
B. Pfahringer, "A semi-supervised spam mail detector," in Proceedings of ECML/PKDD Discovery Challenge Workshop, Berlin, 2006.
[130]
Project Honeypot, http://www.projecthoneypot.org/.
[131]
C. Pu and S. Webb, "Observed trends in spam construction techniques," in Proceedings of CEAS 2006 - Third Conference on Email and Anti-Spam, Mountain View, CA, 2006.
[132]
J. R. Quinlan, C4.5: Programs for Machine Learning. USA, San Francisco, CA: Morgan Kaufmann Publishers Inc., 1993.
[133]
A. Ramachandran, D. Dagon, and N. Feamster, "Can DNS-based blacklists keep up with bots?," in CEAS 2006 - The Second Conference on Email and Anti-Spam, 2006.
[134]
E. S. Raymond, D. Relson, M. Andree, and G. Louis, "BogoFilter," http://bogofilter.sourceforge.net/, 2004.
[135]
F. R. Rideau, Stamps vs. Spam: Postage as a Method to Eliminate Unsolicited Email. http://fare.tunes.org/articles/stamps vs spam.html, 2002.
[136]
G. Robinson, "A statistical approach to the spam problem," Linux Journal, vol. 107, no. 3, March 2003.
[137]
R. Roman, J. Zhou, and J. Lopez, "An anti-spam scheme using prechallenges," Computer Communications, vol. 29, no. 15, pp. 2739-2749, 2006.
[138]
C. Rossow, "Anti-Spam measures of European ISPs/ESPs: A survey based analysis of state-of-the-art technologies, current spam trends and recommendations for future-oriented anti-spam concepts," Institute for Internet Security , August 2007.
[139]
K. J. Rothman and S. Greenland, Modern epidemiology. Lippinscott Williams and Wilkins, 1998.
[140]
R. Rowland, Spam, Spam, Spam: The Cyberspace Wars. CBC, http://www.cbc.ca/news/background/spam/, 2004.
[141]
M. Sahami, S. Dumais, D. Heckerman, and E. Horvitz, "A Bayesian approach to filtering junk e-mail," in Learning for Text Categorization: Papers from the 1998 Workshop, Madison, Wisconsin, AAAI Technical Report WS-98-05, 1998.
[142]
G. Sakkis, I. Androutsopoulos, G. Paliouras, V. Karkaletsis, C. D. Spyropoulos, and P. Stamatopoulos, "Stacking classifiers for anti-spam filtering of e-mail," in Empirical Methods in Natural Language Processing (EMNLP 2001), pp. 44-50, 2001.
[143]
G. Sakkis, I. Androutsopoulos, G. Paliouras, V. Karkaletsis, C. D. Spyropoulos, and P. Stamatopoulos, "A memory-based approach to anti-spam filtering for mailing lists," Information Retrieval, vol. 6, no. 1, pp. 49-73, 2003.
[144]
M. Sasaki and H. Shinnou, "Spam detection using text clustering," in CW '05: Proceedings of the 2005 International Conference on Cyberworlds, pp. 316-319, USA, Washington, DC: IEEE Computer Society, 2005.
[145]
R. E. Schapire and Y. Singer, "Improved boosting using confidence-rated predictions," Machine Learning, vol. 37, no. 3, pp. 297-336, 1999.
[146]
K. M. Schneider, "A comparison of event models for naive bayes anti-spam email filtering," in Proceedings of the 10th Conference of the European Chapter of the Association for Computational Linguistics, 2003.
[147]
D. Sculley, "Online active learning methods for fast label-efficient spam filtering," in Proceeding of the CEAS 2007- Fourth Conference on Email and Anti-Spam, Mountain View, CA, 2007.
[148]
D. Sculley and C. E. Brodley, "Compression and machine learning: A new perspective on feature space vectors," in Data Compression Conference (DCC 06), pp. 332-341, Snowbird, 2006.
[149]
D. Sculley and G. V. Cormack, Filtering Spam in the Presence of Noisy User Feedback. Tufts University, 2008.
[150]
D. Sculley and G. M. Wachman, "Relaxed online support vector machines for spam filtering," in 30th ACM SIGIR Conference on Research and Development on Information Retrieval, Amsterdam, 2007.
[151]
D. Sculley and G. M. Wachman, "Relaxed online SVMs in the TREC Spam filtering track," in Sixteenth Text REtrieval Conference (TREC-2007), Gaithersburg, MD: NIST, 2007.
[152]
D. Sculley, G. M. Wachman, and C. E. Brodley, "Spam classification with on-line linear classifiers and inexact string matching features," in Proceedings of the 15th Text REtrieval Conference (TREC 2006), Gaithersburg, MD, November 2006.
[153]
F. Sebastiani, "Machine learning in automated text categorization," ACM Computing Surveys, vol. 34, no. 1, pp. 1-47, 2002.
[154]
R. Segal, J. Crawford, J. Kephart, and B. Leiba, "SpamGuru: An enterprise anti-spam filtering system," in First Conference on Email and Anti-Spam (CEAS), 2004.
[155]
R. Segal, T. Markowitz, and W. Arnold, "Fast uncertainty sampling for labeling large e-mail corpora," in Proceedings of the CEAS 2006 - Third Conference on Email and Anti-Spam, Mountain View, CA, 2006.
[156]
Shakhnarovish, Darrell, and Indyk, Nearest-Neighbor Methods in Learning and Vision, (Shakhnarovish, ed.), MIT Press, 2005.
[157]
V. Sharma and A. O'Donnell, "Fighting spam with reputation systems," ACM Queue, November 2005.
[158]
C. Siefkes, F. Assis, S. Chhabra, and W. S. Yerazunis, "Combining winnow and orthogonal sparse bigrams for incremental spam filtering," in PKDD '04: Proceedings of the 8th European Conference on Principles and Practice of Knowledge Discovery in Databases, pp. 410-421, USA, New York, NY: Springer-Verlag, New York, Inc., 2004.
[159]
P. Y. Simard, R. Szeliski, J. Benaloh, J. Couvreur, and I. Calinov, "Using character recognition and segmentation to tell computer from humans," in ICDAR '03: Seventh International Conference on Document Analysis and Recognition, 2003.
[160]
J. Snyder, "Spam in the wild, the sequel," Network World 12/20/04, 2004.
[161]
E. Solan and E. Reshef, "The effects of anti-spam methods on spam mail," in CEAS 2006 - The Third Conference on Email and Anti-Spam, 2006.
[162]
Spam testing methodology, http://www.opus1.com/www/whitepapers/ spamtestmethodology.pdf, 2007.
[163]
spamassassin.org, The Spamassassin Public Mail Corpus. http:// spamassassin. apache.org/publiccorpus, 2003.
[164]
spamassassin.org, Welcome to SpamAssassin. http://spamassassin.apache. org, 2005.
[165]
Spambase, http://mlearn.ics.uci.edu/databases/spambase/.
[166]
J. A. Swets, "Effectiveness of information retrieval systems," American Documentation , vol. 20, pp. 72-89, 1969.
[167]
T. Takemura and H. Ebara, "Spam mail reduces economic effects," in Second International Conference on the Digital Society, pp. 20-24, 2008.
[168]
W. Tau Yih, R. McCann, and A. Kolcz, "Improving spam filtering by detecting gray mail," in Proceedings of CEAS 2007 - Fourth Conference on Email and Anti-Spam, Mountain View, CA, 2007.
[169]
D. M. J. Tax and C. Veenman, "Tuning the hyperparameter of an AUC-optimized classifier," in Seventeenth Belgium-Netherlands Conference on Artificial Intelligence, pp. 224-231, 2005.
[170]
The CEAS 2007 Live Spam Challenge, http://www.ceas.cc/2007/ challenge/challenge.html, 2007.
[171]
The penny black project, http://research.microsoft.com/research/sv/Penny Black/.
[172]
TREC 2005 Spam Corpus, http://plg.uwaterloo.ca/~gvcormac/treccorpus, 2005.
[173]
TREC 2006 Spam Corpora, http://plg.uwaterloo.ca/~gvcormac/treccorpus, 2006.
[174]
TREC 2007 Spam Corpus, http://plg.uwaterloo.ca/~gvcormac/treccorpus, 2007.
[175]
K. Tretyakov, "Machine learning techniques in spam filtering," Tech. Rep., Institute of Computer Science, University of Tartu, 2004.
[176]
N. Trogkanis and G. Paliouras, "Using positive-only learning to deal with the heterogeneity of labeled and unlabeled data," in Proceedings of ECML/PKDD Discovery Challenge Workshop, Berlin, 2006.
[177]
H. Tschabitscher, What you Need to Know about Challenge-Response Spam Filters. http://email.about.com/cs/spamgeneral/a/challenge resp.htm.
[178]
D. Turner, M. Fossi, E. Johnson, T. Mack, J. Blackbird, S. Entwisle, M. K. Low, D. McKinney, and C. Wueest, Symantec Global Internet Security Threat Report: Trends for July-December 07. Symantec, http://eval.symantec.com/mktginfo/enterprise/white papers/b-whitepaper_ internet_security_threat_report_xiii_04-2008.en-us.pdf, 2007.
[179]
A. Tuttle, E. Milios, and N. Kalyaniwalla, "An evaluation of machine learning techniques for enterprise spam filters," Technical Report CS-2004-03, Halifax, NS: Dalhousie University, 2004.
[180]
C. J. Van Rijsbergen, Information Retrieval. Department of Computer Science, University of Glasgow, Second ed., 1979.
[181]
Veritest Anti-Spam Benchmark Service Autumn 2005 Report http:// www.tumbleweed.com/pdfs/VeriTest_Anti-Spam_Report_Vol4_all_c.pdf, 2005.
[182]
E. Voorhees, Fourteenth Text REtrieval Conference (TREC-2005). Gaithersburg, MD: NIST, 2005.
[183]
E. Voorhees, Fifteenth Text REtrieval Conference (TREC-2005). Gaithersburg, MD: NIST, 2006.
[184]
E. Voorhees, Sixteenth Text REtrieval Conference (TREC-2005). Gaithersburg, MD: NIST, 2007.
[185]
E. M. Voorhees and D. K. Harman, eds., TREC - Experiment and Evaluation in Information Retrieval. Boston: MIT Press, 2005.
[186]
Z. Wang, W. Josephson, Q. LV, M. Charikar, and K. Li, "Filtering image spam with near-duplicate detection," in CEAS 2007 - The Third Conference on Email and Anti-Spam, 2007.
[187]
Web Spam Challenge, 2008.
[188]
S. Webb, J. Caverloo, and C. Pu, "Introducing the webb spam corpus: Using email spam to identify web spam automatically," in Proceedings of CEAS 2006 - Third Conference on Email and Anti-Spam, Mountain View, CA, 2006.
[189]
West Coast Labs, http://www.westcoastlabs.com.
[190]
F. M. J. Willems, Y. M. Shtarkov, and T. J. Tjalkens, "The context-tree weighting method: Basic properties," IEEE Transactions on Information Theory , vol. 41, no. 3, pp. 653-664, 1995.
[191]
G. L. Wittel and S. F. Wu, "On attacking statistical spam filters," in CEAS 2004 - The Conference on Email and Anti-Spam, 2004.
[192]
I. Witten, E. Frank, L. Trigg, M. Hall, G. Holmes, and S. Cunningham, Weka: Practical Machine Learning Tools and Techniques with Java Implementations. 1999.
[193]
M. Wong and W. Schlitt, Sender Policy Framework (SPF) for Authorizing Use of Domains in E-mail. Vol. RFC 4408, 2006.
[194]
W. Yerazunis, Correspondence with Paul Graham. http://www. paulgraham.com/wsy.html, 16 October 2002.
[195]
W. S. Yerazunis, CRM114 - the Controllable Regex Mutilator. http://crm114.sourceforge.net/, 2004.
[196]
W. S. Yerazunis, "The spam-filtering accuracy plateau at 99.9% accuracy and how to get past it," in 2004 MIT Spam Conference, January 2004.
[197]
W. S. Yerazunis, "Seven hypothesis about spam filtering," in Proceedings 15th Text REtrieval Conference (TREC 2006), NIST, Gaithersburg, MD, November 2006.
[198]
W. S. Yerazunis, "Seven Hypothesis about Spam," in Sixteenth Text REtrieval Conference (TREC-2007), Gaithersburg, MD: NIST, 2007.
[199]
W. Yih, J. Goodman, and G. Hulten, "Learning at low false positive rates," in Proceedings of the 3rd Conference on Email and Anti-Spam, 2006.
[200]
X. Yue, A. Abraham, Z.-X. Chi, Y.-Y. Hao, and H. Mo, "Artificial immune system inspired behavior-based anti-spam filter," Soft Computing, vol. 11, pp. 729-740, 2007.
[201]
L. Zhang, J. Zhu, and T. Yao, "An evaluation of statistical spam filtering techniques," ACM Transactions on Asian Language Information Processing (TALIP), vol. 3, no. 4, pp. 243-269, 2004.
[202]
W. Zhao and Z. Zhang, "An email classification model based on rough set theory," in Active Media Technology, 2005. (AMT 2005), 2005.
[203]
D. Zhou, O. Bousquet, T. Lal, J. Weston, and B. Schölkopf, "Learning with local and global consistency," in 18th Annual Conference on Neural Information Processing Systems, 2003.
[204]
X. Zhu, Semi-supervised Learning Literature Survey. Vol. TR 1530, University of Wisconsin, 2007.
[205]
A. Zien, "Semi-supervised support vector machines and application to spam filtering," in Oral Presentation, ECML/PKDD Discovery Challenge Workshop , Berlin, 2006.
[206]
A. Zinman and J. Donath, "Is Britney Spears spam?," in Proceedings of CEAS 2007 - Fourth Conference on Email and Anti-Spam, Mountain View, CA, 2007.

Cited By

View all
  • (2023)Technology-Assisted Review for Spreadsheets and Noisy TextProceedings of the ACM Symposium on Document Engineering 202310.1145/3573128.3609341(1-4)Online publication date: 22-Aug-2023
  • (2022)Cybersecurity Awareness Based on Software and E-mail Security with Statistical AnalysisComputational Intelligence and Neuroscience10.1155/2022/67759802022Online publication date: 1-Jan-2022
  • (2022)Modeling User Behavior With Interaction Networks for Spam DetectionProceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3477495.3531875(2437-2442)Online publication date: 6-Jul-2022
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image Foundations and Trends in Information Retrieval
Foundations and Trends in Information Retrieval  Volume 1, Issue 4
April 2007
123 pages
ISSN:1554-0669
EISSN:1554-0677
Issue’s Table of Contents

Publisher

Now Publishers Inc.

Hanover, MA, United States

Publication History

Published: 01 April 2008

Qualifiers

  • Article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)0
  • Downloads (Last 6 weeks)0
Reflects downloads up to 30 Aug 2024

Other Metrics

Citations

Cited By

View all
  • (2023)Technology-Assisted Review for Spreadsheets and Noisy TextProceedings of the ACM Symposium on Document Engineering 202310.1145/3573128.3609341(1-4)Online publication date: 22-Aug-2023
  • (2022)Cybersecurity Awareness Based on Software and E-mail Security with Statistical AnalysisComputational Intelligence and Neuroscience10.1155/2022/67759802022Online publication date: 1-Jan-2022
  • (2022)Modeling User Behavior With Interaction Networks for Spam DetectionProceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3477495.3531875(2437-2442)Online publication date: 6-Jul-2022
  • (2022)Using Randomness to Improve Robustness of Tree-Based Models Against Evasion AttacksIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2020.298729934:2(969-982)Online publication date: 1-Feb-2022
  • (2022)A new semantic-based feature selection method for spam filteringApplied Soft Computing10.1016/j.asoc.2018.12.00876:C(89-104)Online publication date: 19-Apr-2022
  • (2021)Design and Analysis of Networks under Strategic BehaviorProceedings of the 20th International Conference on Autonomous Agents and MultiAgent Systems10.5555/3463952.3464261(1848-1849)Online publication date: 3-May-2021
  • (2021)Mingling of Clear and Muddy Water: Understanding and Detecting Semantic Confusion in Blackhat SEOComputer Security – ESORICS 202110.1007/978-3-030-88418-5_13(263-284)Online publication date: 4-Oct-2021
  • (2020)Adversarial bandits with corruptionsProceedings of the 34th International Conference on Neural Information Processing Systems10.5555/3495724.3497398(19943-19952)Online publication date: 6-Dec-2020
  • (2020)Cross-dataset email classificationJournal of Intelligent & Fuzzy Systems: Applications in Engineering and Technology10.3233/JIFS-17989039:2(2279-2290)Online publication date: 1-Jan-2020
  • (2020)A Critical Reassessment of the Saerens-Latinne-Decaestecker Algorithm for Posterior Probability AdjustmentACM Transactions on Information Systems10.1145/343316439:2(1-34)Online publication date: 31-Dec-2020
  • Show More Cited By

View Options

View options

Get Access

Login options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media