Abstract
Because of the changing nature of spam, a spam filtering system that uses machine learning will need to be dynamic. This suggests that a case-based (memory-based) approach may work well. Case-Based Reasoning (CBR) is a lazy approach to machine learning where induction is delayed to run time. This means that the case base can be updated continuously and new training data is immediately available to the induction process. In this paper we present a detailed description of such a system called ECUE and evaluate design decisions concerning the case representation. We compare its performance with an alternative system that uses Naïve Bayes. We find that there is little to choose between the two alternatives in cross-validation tests on data sets. However, ECUE does appear to have some advantages in tracking concept drift over time.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.References
Androutsopoulos I, Koutsias J, Chandrinos G, Paliouras, G., Spyropoulos, C. (2000a). ‘An Evaluation of Naive Bayesian Anti-Spam Filtering’. In Potamias, G. Moustakis V. and van Someren M. (eds.) Proc. of Workshop on Machine Learning in the New Information Age, ECML 2000, 9–17
Androutsopoulos I, Koutsias J, Paliouras G, Karkaletsis, V. Sakkis, G., Spyropoulos, C. (2000b). Learning to Filter Spam E-Mail: A comparison of a naive Bayesian and a memory based approach. In Zaragoza H, Gallinari, P. and Rajman M. (eds.) Procs of Workshop on Machine Learning and Textual Information Access, PKDD 2000, 1–13
Androutsopoulos I, Paliouras, G., Michelakis, E. (2000c). Learning to Filter Unsolicited Commercial E-Mail. Technical Report 2004/02, NCSR “Demokritos”.
Bradley A. (1997). The Use of the Area Under the ROC Curve in the Evaluation of Machine Learning Algorithms. Pattern Recognition 30: 1145–1150
Brighton H., Mellish C. (2002). Advances in Instance Selection for Instance-Based Learning Algorithms. Data Mining and Knowledge Discovery 62: 153–172
Ceglowski M, Coburn, A., Cuadrado, J. (2003). Semantic Search of Unstructured Data using Contextual Network Graphs
Cunningham P, Nowlan N, Delany, S., Haahr, M. (2003). A Case-Based approach to Spam Filtering that can track Concept Drift. In ICCBR 2003 Workshop on Long-Lived CBR Systems.
Delany, S. J., Cunningham, P. (2004). An Analysis of Case-Based Editing in a Spam Filtering System In Funk P., González-Calero P.(eds.) 7th European Conference on Case-Based Reasoning (ECCBR 2004), Vol. 3155 of LNAI. 128–141, Springer
Dietterich D.T. (1998). Approximate Statistical Tests for Comparing Supervised Classification Learning Algorithms. Neural Computing 10: 1895–1923
Drucker H, Wu D., Vapnik V. (1999). Support Vector Machines for Spam Categorisation. IEEE Transactions on Neural Networks 10(5): 1048–1055
Gee K. R. (2003). Using Latent Semantic Indexing to Filter Spam. In SAC ’03: Proceedings of the 2003 ACM symposium on Applied computing, 460–464, ACM Press
Kohavi R, Becker, B., Sommerfield, D. (1997). Improving Simple Bayes, In Proceedings of the 9th European Conference on Machine Learning (ECML 97). Springer Verlag
Lenz M, Auriol, E., Manago M. (1998). Diagnosis and Decision Support. In Lenz M, B. Bartsch-Spörl, Burkhard, H., Wess, S. (eds.) Case-Based Reasoning Technology From Foundations to Applications pp. 51–90, Springer-Verlag
Lewis, D., Ringuette M. (1994). Comparison of Two Learning Algorithms for Text Categorisation. In Procs of 3rd Annual Symposium on Document Analysis and Information Retrieval (SDAIR 94), 81–93
McKenna, E., Smyth, B. (2000). Competence-Guided Editing Methods for Lazy Learning. In Horn W. (ed.) ECAI 2000, Proceedings of the 14th European Conference on Artificial Intelligence 60–64, IOS Press
Niblett, T. (1987). Constructing Decision Trees in Noisy Domains. In Bratko I., Lavrac N. (eds.) Progress in Machine Learning, Procs of 2nd European Working Session on Learning (EWSL 87). 67–78, Sigma Press
Pantel, P., Lin, D. (1988). ‘SpamCop: A spam classification and organisation program’. In: Procs of Workshop for Text Categorisation, AAAI-98, 95–98
Quinlan J.R. (1997). C4.5 Programs for Machine Learning. Morgan Kaufmann Publishers Inc., San Mateo, CA
Sahami M, Dumais S, Heckerman, D., Horvitz, E. (1998). A Bayesian Approach to Filtering Junk E-mail. In Procs of Workshop for Text Categorisation AAAI-98, 55–62
Sakkis G., Androutsopoulos I, Paliouras G, Karkaletsis V, Spyropoulos C.D., Stamatopoulos P. (2003). A Memory-Based Approach to Anti-Spam Filtering for Mailing Lists. Information Retrieval 6(1): 49–73
USPatent: 2000. United States Patent 6, 161, 130
Wilson, D., Martinez, T. (1997). Instance Pruning Techniques. In ICML ’97: Proceedings of the Fourteenth International Conference on Machine Learning. 403–411, Morgan Kaufmann Publishers Inc
Yang, Y., Pedersen, J. (1997). A Comparative Study on Feature Selection in Text Categorization. In ICML ’97: Proceedings of the 14th International Conference on Machine Learning, 412–420. Morgan Kaufmann Publishers Inc
Author information
Authors and Affiliations
Corresponding author
Additional information
★ This research was supported by funding from Enterprise Ireland under grant no. CFTD/03/219 and funding from Science Foundation Ireland under grant no. SFI-02IN.1I111
Rights and permissions
About this article
Cite this article
Delany, S.J., Cunningham, P. & Coyle, L. An Assessment of Case-Based Reasoning for Spam Filtering. Artif Intell Rev 24, 359–378 (2005). https://doi.org/10.1007/s10462-005-9006-6
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10462-005-9006-6