Detecting spam web pages through content analysis

A Ntoulas, M Najork, M Manasse… - Proceedings of the 15th …, 2006 - dl.acm.org
Proceedings of the 15th international conference on World Wide Web, 2006dl.acm.org
In this paper, we continue our investigations of" web spam": the injection of artificially-
created pages into the web in order to influence the results from search engines, to drive
traffic to certain pages for fun or profit. This paper considers some previously-undescribed
techniques for automatically detecting spam pages, examines the effectiveness of these
techniques in isolation and when aggregated using classification algorithms. When
combined, our heuristics correctly identify 2,037 (86.2%) of the 2,364 spam pages (13.8%) in …
In this paper, we continue our investigations of "web spam": the injection of artificially-created pages into the web in order to influence the results from search engines, to drive traffic to certain pages for fun or profit. This paper considers some previously-undescribed techniques for automatically detecting spam pages, examines the effectiveness of these techniques in isolation and when aggregated using classification algorithms. When combined, our heuristics correctly identify 2,037 (86.2%) of the 2,364 spam pages (13.8%) in our judged collection of 17,168 pages, while misidentifying 526 spam and non-spam pages (3.1%).
ACM Digital Library