Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article

Detecting splogs via temporal dynamics using self-similarity analysis

Published: 03 March 2008 Publication History

Abstract

This article addresses the problem of spam blog (splog) detection using temporal and structural regularity of content, post time and links. Splogs are undesirable blogs meant to attract search engine traffic, used solely for promoting affiliate sites. Blogs represent popular online media, and splogs not only degrade the quality of search engine results, but also waste network resources. The splog detection problem is made difficult due to the lack of stable content descriptors.
We have developed a new technique for detecting splogs, based on the observation that a blog is a dynamic, growing sequence of entries (or posts) rather than a collection of individual pages. In our approach, splogs are recognized by their temporal characteristics and content. There are three key ideas in our splog detection framework. (a) We represent the blog temporal dynamics using self-similarity matrices defined on the histogram intersection similarity measure of the time, content, and link attributes of posts, to investigate the temporal changes of the post sequence. (b) We study the blog temporal characteristics using a visual representation derived from the self-similarity measures. The visual signature reveals correlation between attributes and posts, depending on the type of blogs (normal blogs and splogs). (c) We propose two types of novel temporal features to capture the splog temporal characteristics. In our splog detector, these novel features are combined with content based features. We extract a content based feature vector from blog home pages as well as from different parts of the blog. The dimensionality of the feature vector is reduced by Fisher linear discriminant analysis. We have tested an SVM-based splog detector using proposed features on real world datasets, with appreciable results (90% accuracy).

References

[1]
Benczur, A., Csalogany, K., Sarlos, T., and Uher, M. 2005. Spamrank-fully automatic link spam detection. In Proceedings of the 1st International Workshop on Adversarial Information Retrieval on the Web (AIRWeb).
[2]
Castillo, C., Donato, D., Becchetti, L., Boldi, P., Santini, M., and Vigna, S. 2006. A reference collection for web spam. SIGIR Forum (Dec.) ACM Press, 11--24.
[3]
Cavnar, W. B. and Trenkle, J. M. 1994. N-gram-based text categorization. In Proceedings of 3rd Annual Symposium on Document Analysis and Information Retrieval.
[4]
Chang, C.-C. and Lin, C.-J. 2001. Libsvm: A library for support vector machines. ntv.edu.two- cjlin/papers (libsvm, ps.gz).
[5]
Duda, R. O., Hart, P. E., and Stork, D. G. 2001. Pattern Classification. John Wiley & Sons, Inc. New York.
[6]
Eckmann, J., Kamphorst, S. O., and Ruelle, D. 1987. Recurrence plots of dynamical systems. Europhysics Lett. 4, 973--977.
[7]
Fetterly, D., Manasse, M., and Najork, M. 2004. Spam, damn spam, and statistics: Using statistical analysis to locate spam web pages. In Proceedings of the 7th International Workshop on the Web and Databases. Colocated with ACM SIGMOD/PODS. 1--6.
[8]
Fetterly, D., Manasse, M., and Najork, M. 2005. Detecting phrase-level duplication on the world wide web. In Proceedings of the 28th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. 170--177.
[9]
Fogaras, D. and Racz, B. 2005. Scaling link-based similarity search. In Proceedings of the 14th International Conference on World Wide Web. ACM Press. 641--650.
[10]
Foote, J., Cooper, M., and Nam, U. 2002. Audio retrieval by rhythmic similarity. In Proceedings of the International Conference on Music Information Retrieval. 265--266.
[11]
Gyöngyi, Z., Garcia-Molina, H., and Pedersen, J. 2004. Combating Web spam with trustrank. In Proceedings of the 30th International Conference on Very Large Data Bases (VLDB'04). Toronto, Canada. Morgan Kaufmann. 576--587.
[12]
Gyöngyi, Z. and Garcia-Molina, H. 2005. Web spam taxonomy. In Proceedings of the 1st International Workshop on Adversarial Information Retrieval on the Web (AIRWeb).
[13]
Gyöngyi, Z., Berkhin, P., Garcia-Molina, H., and Pedersen, J. 2006. Link spam detection based on mass estimation. In Proceedings of the 32nd International Conference on Very Large Data Bases (VLDB). Seoul, Korea. 439--450.
[14]
Han, S., Ahn, Y., Moon, S., and Jeong, H. 2006. Collaborative blog spam filtering using adaptive percolation search. WWW2006 Workshop on the Weblogging Ecosystem: Aggregation, Analysis and Dynamics. Edinburgh.
[15]
Kolari, P. 2005. Welcome to the splogosphere: 75% of new pings are spings (splogs). http://ebiquity.umbc.edu/blogger/2005/12/15/welcome-to-the-splogosphere-75-of-new-blog-posts- are-spam/.
[16]
Kolari, P., Finin, T., and Joshi, A. 2006a. Svms for the blogosphere: Blog identification and splog detection. In Proceedings of the AAAI Spring Symposium on Computational Approaches to Analyzing Weblogs.
[17]
Kolari, P., Java, A., and Finin, T. 2006b. Characterizing the splogosphere. In Proceedings of the 3rd Annual Workshop on Weblogging Ecosystem: Aggregation, Analysis and Dynamics, 15th World Wide Web Conference.
[18]
Kolari, P., Java, A., Finin, T., Mayfield, J., Joshi, A., and Martineau, J. 2006c. Blog track open task: Spam blog classification. TREC Blog Track Notebook.
[19]
Kolari, P., Java, A., Finin, T., Oates, T., and Joshi, A. 2006d. Detecting spam blogs: A machine learning approach. In Proceedings of the 21st National Conference on Artificial Intelligence (AAAI'06). Boston, MA.
[20]
Lin, Y., Sundaram, H., Chi, Y., Tatemura, J., and Tseng, B. 2007. Splog detection using content, time and link structures. IEEE International Conference on Multimedia and Expo 2007: 2030--2033.
[21]
Lin, Y.-R., Chen, W.-Y., Shi, X., Sia, R., Song, X., Chi, Y., Hino, K., Sundaram, H., Tatemura, J., and Tseng, B. 2006. The splog detection task and a solution based on temporal and link properties. In Poceedings of the 15th Text REtrieval Conference (TREC'06).
[22]
Macdonald, C. and Ounis, I. 2006. The trec blogs06 collection: Creating and analyzing a blog test collection. TR-2006-224. Department of Computer Science, University of Glasgow.
[23]
Mishne, G., Carmel, D., and Lempel, R. 2005. Blocking blog spam with language model disagreement. In Proceedings of the 1st International Workshop on Adversarial Information Retrieval on the Web (AIRWeb).
[24]
Narisawa, K., Yamada, Y., Ikeda, D., and Takeda, M. 2006. Detecting blog spams using the vocabulary size of all substrings in their copies. In Proceedings of the 3rd Annual Workshop on Weblogging Ecosystem.
[25]
Newman, M. and Girvan, M. 2004. Finding and evaluating community structure in networks. Phys. Rev. E 69, 2, 26113.
[26]
Ntoulas, A., Najork, M., Manasse, M., and Fetterly, D. 2006. Detecting spam Web pages through content analysis. In Proceedings of the 15th International Conference on World Wide Web. Edinburgh, Scotland. ACM Press, 83--92.
[27]
Salvetti, F. and Nicolov, N. Weblog classification for fast splog filtering: A url language model segmentation approach. In Proceedings of the Human Language Technology Conference of the NAACL. Companion Volume: Short Papers, 137--140.
[28]
Shen, G., Gao, B., Liu, T.-Y., Feng, G., Song, S., and Li, H. 2006. Detecting link spam using temporal information. In Proceedings of the 6th International Conference on Data Mining. IEEE Computer Society. 1049--1053.
[29]
SURBL Surbl---spam uri realtime blocklists. http://www.surbl.org/.
[30]
Swain, M. and Ballard, D. 1991. Color indexing. Int. J. Comput. Vision 7, 1, 11--32.
[31]
UMBRIA. 2006. Spam in the blogosphere. http://www.umbrialistens.com/files/uploads/umbria_ splog.pdf.
[32]
Urvoy, T., Lavergne, T., and Filoche, P. 2006. Tracking web spam with hidden style similarity. AIRWEB, Seattle, WA.
[33]
Von Ahn, L., Blum, M., and Langford, J. 2004. Telling humans and computers apart automatically. Comm. ACM 47, 2, 56--60.
[34]
Wikipedia. http://en.wikipedia.org/wiki/.
[35]
Wu, B. and Davison, B. 2005. Identifying link farm spam pages. In Proceedings of the International World Wide Web Conference. ACM Press. 820--829.
[36]
Zawodny, J. 2005 Yahoo! Search blog: A defense against comment spam. http://www.ysearchblog.com/archives/000069.html.

Cited By

View all
  • (2021)RoleSim*: Scaling axiomatic role-based similarity ranking on large graphsWorld Wide Web10.1007/s11280-021-00925-z25:2(785-829)Online publication date: 11-Aug-2021
  • (2020)An Axiomatic Role Similarity Measure Based on Graph TopologySoftware Foundations for Data Interoperability and Large Scale Graph Data Analytics10.1007/978-3-030-61133-0_3(33-48)Online publication date: 6-Nov-2020
  • (2019)Efficient Pairwise Penetrating-rank Similarity RetrievalACM Transactions on the Web10.1145/336861613:4(1-52)Online publication date: 18-Dec-2019
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on the Web
ACM Transactions on the Web  Volume 2, Issue 1
February 2008
280 pages
ISSN:1559-1131
EISSN:1559-114X
DOI:10.1145/1326561
Issue’s Table of Contents
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 03 March 2008
Accepted: 01 October 2007
Revised: 01 September 2007
Received: 01 April 2007
Published in TWEB Volume 2, Issue 1

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Blogs
  2. regularity
  3. self-similarity
  4. spam
  5. splog detection
  6. temporal dynamics
  7. topology

Qualifiers

  • Research-article
  • Research
  • Refereed

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)4
  • Downloads (Last 6 weeks)0
Reflects downloads up to 17 Feb 2025

Other Metrics

Citations

Cited By

View all
  • (2021)RoleSim*: Scaling axiomatic role-based similarity ranking on large graphsWorld Wide Web10.1007/s11280-021-00925-z25:2(785-829)Online publication date: 11-Aug-2021
  • (2020)An Axiomatic Role Similarity Measure Based on Graph TopologySoftware Foundations for Data Interoperability and Large Scale Graph Data Analytics10.1007/978-3-030-61133-0_3(33-48)Online publication date: 6-Nov-2020
  • (2019)Efficient Pairwise Penetrating-rank Similarity RetrievalACM Transactions on the Web10.1145/336861613:4(1-52)Online publication date: 18-Dec-2019
  • (2018)Network Traffic Detection Based on Histogram and Self-similarity MatrixProceedings of the 1st International Conference on Information Science and Systems10.1145/3209914.3236336(207-209)Online publication date: 27-Apr-2018
  • (2017)Automatic Detection of Cyberbullying to Make Internet a Safer EnvironmentViolence and Society10.4018/978-1-5225-0988-2.ch002(31-45)Online publication date: 2017
  • (2017)Who are the spoilers in social media marketing? Incremental learning of latent semantics for social spam detectionElectronic Commerce Research10.1007/s10660-016-9244-517:1(51-81)Online publication date: 1-Mar-2017
  • (2016)FIN10KProceedings of the 25th ACM International on Conference on Information and Knowledge Management10.1145/2983323.2983328(2441-2444)Online publication date: 24-Oct-2016
  • (2015)Automatic Detection of Cyberbullying to Make Internet a Safer EnvironmentHandbook of Research on Digital Crime, Cyberspace Security, and Information Assurance10.4018/978-1-4666-6324-4.ch018(277-290)Online publication date: 2015
  • (2013)Self-Similarity Parameter Estimation for K-Dimensional ProcessesInternational Journal of Computer Theory and Engineering10.7763/IJCTE.2013.V5.698(302-306)Online publication date: 2013
  • (2012)Text mining and probabilistic language modeling for online review spam detectionACM Transactions on Management Information Systems10.1145/2070710.20707162:4(1-30)Online publication date: 5-Jan-2012
  • Show More Cited By

View Options

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media