Abstract
Due to fast changing of spam techniques to evade being detected, we argue that multiple spam detection strategies should be developed to effectively against spam. In literature, many proposed spam detection schemes used similar strategies based on supervised classification techniques such as naive Baysian, SVM, and K-NN. But only few works were on the strategy using detection of duplicate copies. In this paper, we propose a new duplicate-mail detection scheme based on similarity of mail context between incoming mails, especially the context of URL information. We discuss different design strategies to against possible spam tricks to avoid being detected. Also, We compared our approaches with four different approaches available in literature: Octet-based histogram method, I-Mach, Winnowing, and identical matching. With over thousands of real mails we collected as testing data, our experiment results show that the proposed strategy outperforms the others. Without considering compulsory miss, over 97% of near duplicate mails can be detected correctly.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Weinstein, L.: Inside risks: Spam wars. Communication of ACM 46(8), 136–136 (2003)
Corbato, F.J.: On computer system challenges. Journal of ACM 50(1), 30–31 (2003)
Sahami, M., Dumaisy, S., Heckermany, D., Horvitzy, E.: A Bayesian approach to filtering junk E-Mail. In: Proc. Of AAAI Workshop on Learning for Text Categorization, Madison, Wisconsin, July 1998, pp. 55–62 (1998)
Graham, P.: A plan for spam (August 2002), http://www.paulgraham.com/spam.html
Androutsopoulos, I., Paliouras, G., Karkaletsis, V., Sakkis, G., Spyropoulos, C., Stamatopoulos, P.: Learning to filter spam e-mail: A comparison of a naive bayesian and a memorybased approach. In: Proc. of the PKDD workshop on Machine Learning and Textual Information Access, pp. 1–13 (2000)
Tong, S., Koller, D.: Support vector machine active learning with applications to text classification. Journal of Machine Learning Research 2, 45–66 (2001)
Drucker, H., Wu, D., Vapnik, V.N.: Support vector machines for spam categorization. IEEE Trans. on Neural Networks 10(5), 1048–1054 (1999)
Carreras, X., Marquez, L.: Boosting trees for anti-Spam email filtering. In: Proc. of Euro Conference on Recent Advances in Natural Language Processing (RANLP 2001) (September 2001)
Hulten, G., Penta, A., Seshadrinathan, G., Mishra, M.: Trends in spam products and methods. In: Proc. of First Conference on Email and Anti-Spam (CEAS) (2004)
Machlis, S.: Uh-oh: spam’s getting more sophisticated. Computerworld (January 17, 2003), available at http://www.computerworld.com
Graham-Cumming, J.: How to beat an adaptive spam filter. In: Proc. of MIT Spam Conference (2004)
Wittel, G.L., Wu, S.F.: On attacking statistical spam filters. In: Proc. of First Conference on Email and Anti-Spam (CEAS) (2004)
Distributed Checksum Clearinghouse (DCC). Available at: http://www.rhyolite.com/anti-spam/dcc/
Chowdhury, A., Frieder, O., Grossman, O.D., McCabe, M.C.: Collection statistics for fast duplicate document detection. ACM Trans. on Information Systems 20(2), 171–191 (2002)
Schleimer, S., Wilkerson, D.S., Aiken, A.: Winnowing: local algorithms for document fingerprinting. In: Proc. of SIGMOD 2003, pp. 76–85 (2003)
Yeh, C.-C., Yeh, N.-W.: Octet histogram-based near duplicate mail detection for spam filtering. In: Proc. of IEEE-EEE05-MEM 2005, Hong Kong, pp. 14–20 (2005)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2006 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Yeh, CC., Lin, CH. (2006). Near-Duplicate Mail Detection Based on URL Information for Spam Filtering. In: Chong, I., Kawahara, K. (eds) Information Networking. Advances in Data Communications and Wireless Networks. ICOIN 2006. Lecture Notes in Computer Science, vol 3961. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11919568_84
Download citation
DOI: https://doi.org/10.1007/11919568_84
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-48563-6
Online ISBN: 978-3-540-48564-3
eBook Packages: Computer ScienceComputer Science (R0)