Abstract
For addressing the growing problem of junk E-mail on the Internet, this paper proposes an effective E-mail classifying and cleansing method in this paper. Incidentally, E-mail messages can be modelled as semi-structured documents consisting of a set of fields with pre-defined semantics and a number of variable length free-text fields. Our proposed method deals with both fields having pre-defined semantics as well as variable length free-text fields for obtaining higher accuracy. The main contributions of this work are two-fold. First, we present a new model based on the Neural Network (NN) for classifying personal E-mails. In particular, we treat E-mail files as a particular kind of plain text files, the implication being that our feature set is relatively large (since there are thousands of different terms in different E-mail files). Second, we propose the use of Principal Component Analysis (PCA) as a preprocessor of NN to reduce the data in terms of both size as well as dimensionality so that the input data become more classifiable and faster for the convergence of the training process used in the NN model. The results of our performance evaluation demonstrate that the proposed algorithm is indeed effective in performing filtering with reasonable accuracy.
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Carreras, X., Marquez, L.: Boosting trees for anti-spam email filtering. In: Proc. Recent Advances in Natural Language Processing (2001)
Cohen, W.W.: Learning rules that classify e-mail. In: Proc. the AAAI Spring Symposium on Machine Learning in Information Access (1996)
Cohen, W.W., Singer, Y.: Context-sensitive learning methods for text categorization. In: Proc. SIGIR (1996)
Craven, M., DiPasquo, D., Freitag, D., McCallum, A., Mitchell, T., Nigam, K., Slattery, S.: Learning to extract symbolic knowledge from the world wide web. In: Proc. the 15th National Conference on Artificial Intelligence (1998)
Diao, Y.L., Lu, H.J., Wu, D.K.: A comparative study of classification based personal e-mail filtering. In: Terano, T., Chen, A.L.P. (eds.) PAKDD 2000. LNCS, vol. 1805. Springer, Heidelberg (2000)
Fawcett, T.: In vivo spam filtering: A challenge problem for data mining. KDD Explorations 5(2) (2003)
Gee, K.R.: Using latent semantic indexing to filter spam. In: ACM Symposium on Applied Computing, Data Mining Track (2003)
Haykin, S.: Neural networks: A comprehensive foundation. International Ed., 2nd edn. Prentice-Hall, Englewood Cliffs (1999)
Ioannidis, J.: Fighting spam by encapsulating policy in email addresses. In: Proc. Network and Distributed Systems Security Conference, NDSS (2003)
Jolliffe, I.T.: Principle Componet Analysis. Springer, Heidelberg (1986)
Kung, S.Y.: Digital neural networks. Prentice-Hall, Englewood Cliffs (1993)
Lewis, D.D., Ringuette, M.: A comparison of two learning algorithms for text categorization. In: Third Annual Symposium on Document Analysis and Information Retrieval (1994)
Sahami, M., Dumais, S., Heckerman, D., Horvitz, E.: A bayesian approach to filtering junk e-mail. In: Proc. AAAI Workshop Learning for Text Categorization (1998)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2005 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Cui, B., Mondal, A., Shen, J., Cong, G., Tan, KL. (2005). On Effective E-mail Classification via Neural Networks. In: Andersen, K.V., Debenham, J., Wagner, R. (eds) Database and Expert Systems Applications. DEXA 2005. Lecture Notes in Computer Science, vol 3588. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11546924_9
Download citation
DOI: https://doi.org/10.1007/11546924_9
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-28566-3
Online ISBN: 978-3-540-31729-6
eBook Packages: Computer ScienceComputer Science (R0)