Article

Detecting spam web pages through content analysis

Authors:

Alexandros Ntoulas,

Dennis FetterlyAuthors Info & Claims

WWW '06: Proceedings of the 15th international conference on World Wide Web

Pages 83 - 92

https://doi.org/10.1145/1135777.1135794

Published: 23 May 2006 Publication History

Abstract

In this paper, we continue our investigations of "web spam": the injection of artificially-created pages into the web in order to influence the results from search engines, to drive traffic to certain pages for fun or profit. This paper considers some previously-undescribed techniques for automatically detecting spam pages, examines the effectiveness of these techniques in isolation and when aggregated using classification algorithms. When combined, our heuristics correctly identify 2,037 (86.2%) of the 2,364 spam pages (13.8%) in our judged collection of 17,168 pages, while misidentifying 526 spam and non-spam pages (3.1%).

References

[1]

S. Adali, T. Liu and M. Magdon-Ismail. Optimal Link Bombs are Uncoordinated. In 1st International Workshop on Adversarial Information Retrieval on the Web, May 2005.

[2]

E. Amitay, D. Carmel, A. Darlow, R. Lempel and A. Soffer. The Connectivity Sonar: Detecting Site Functionality by Structural Patterns. In 14th ACM Conference on Hypertext and Hypermedia, Aug. 2003.

Digital Library

[3]

R. Baeza-Yates, C. Castillo and V. López. PageRank Increase under Different Collusion Topologies. In 1st International Workshop on Adversarial Information Retrieval on the Web, May 2005.

[4]

A. Benczúr, K. Csalogány, T. Sarlós and M. Uher. SpamRank -- Fully Automatic Link Spam Detection. In 1st International Workshop on Adversarial Information Retrieval on the Web, May 2005.

[5]

L. Breiman. Bagging Predictors. In Machine Learning, Vol. 24, No. 2, pages 123--140, 1996.

Digital Library

[6]

U.S. Census Bureau. Quarterly Retail E-Commerce Sales -- 4th Quarter 2004. http://www.census.gov/mrts/www/data/html/04Q4.html (dated Feb. 2005, visited Sept. 2005)

[7]

B. Davison. Recognizing Nepotistic Links on the Web. In AAAI-2000 Workshop on Artificial Intelligence for Web Search, July 2000.

[8]

D. Fetterly, M. Manasse and M. Najork. Spam, Damn Spam, and Statistics: Using statistical analysis to locate spam web pages. In 7th International Workshop on the Web and Databases, June 2004.

Digital Library

[9]

D. Fetterly, M. Manasse and M. Najork. Detecting Phrase-Level Duplication on the World Wide Web. In 28th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Aug. 2005.

Digital Library

[10]

Y. Freund and R. E. Schapire. A decision-theoretic generalization of on-line learning and an application to boosting. In European Conference on Computational Learning Theory, 1995.

Digital Library

[11]

Z. Gyöngyi, H. Garcia-Molina and J. Pedersen. Combating Web Spam with TrustRank. In 30th International Conference on Very Large Data Bases, Aug. 2004.

Digital Library

[12]

Z. Gyöngyi and H. Garcia-Molina. Link Spam Alliances. In 31st International Conference on Very Large Data Bases, Aug. 2005.

Digital Library

[13]

Z. Gyöngyi and H. Garcia-Molina. Web Spam Taxonomy. In 1st International Workshop on Adversarial Information Retrieval on the Web, May 2005.

[14]

GZIP. http://www.gzip.org/

[15]

M. Henzinger, R. Motwani and C. Silverstein. Challenges in Web Search Engines. SIGIR Forum 36(2), 2002.

Digital Library

[16]

J. Hidalgo. Evaluating cost-sensitive Unsolicited Bulk Email categorization. In 2002 ACM Symposium on Applied Computing, Mar. 2002.

Digital Library

[17]

B. Jansen and A. Spink. An Analysis of Web Documents Retrieved and Viewed. In International Conference on Internet Computing, June 2003.

[18]

C. Johnson. US eCommerce: 2005 To 2010. http://www.forrester.com/Research/Document/Excerpt/0,7211,37626,00.html (dated Sept. 2005, visited Sept. 2005)

[19]

C. Manning and H. Schütze. Foundations of Statistical Natural Language Processing. The MIT Press, 1999, Cambridge, Massachusetts.

Digital Library

[20]

P. Metaxas and J. DeStefano. Web Spam, Propaganda and Trust. In 1st International Workshop on Adversarial Information Retrieval on the Web, May 2005.

[21]

G. Mishne, D. Carmel and R. Lempel. Blocking Blog Spam with Language Model Disagreement. In 1st International Workshop on Adversarial Information Retrieval on the Web, May 2005.

[22]

MSN Search. http://search.msn.com/

[23]

J. Nielsen. Statistics for Traffic Referred by Search Engines and Navigation Directories to Useit. http://useit.com/about/searchreferrals.html (dated April 2004, visited Sept. 2005)

[24]

L. Page, S. Brin, R. Motwani and T. Winograd. The PageRank Citation Ranking: Bringing Order to the Web. Stanford Digital Library Technologies Project, 1998.

[25]

A. Perkins. The Classification of Search Engine Spam. http://www.silverdisc.co.uk/articles/spam-classification/ (dated Sept. 2001, visited Sept. 2005)

[26]

J. R. Quinlan. C4.5: Programs for Machine Learning. Morgan-Kaufman, 1993.

Digital Library

[27]

J. R. Quinlan. Bagging, Boosting, and C4.5. In 13th National Conference on Artificial Intelligence and 8th Innovative Applications of Artificial Intelligence Conference, Vol. 1, 725--730, Aug. 1996.

Digital Library

[28]

M. Sahami, S. Dumais, D. Heckerman and E. Horvitz. A Bayesian Approach to Filtering Junk E-Mail. In Learning for Text Categorization: Papers from the 1998 Workshop, AAAI Technical Report WS-98-05, 1998.

[29]

B. Wu and B. Davison. Identifying Link Farm Spam Pages. In 14th International World Wide Web Conference, May 2005.

Digital Library

[30]

B. Wu and B. Davison. Cloaking and Redirection: a preliminary study. In 1st International Workshop on Adversarial Information Retrieval on the Web, May 2005.

[31]

H. Zhang, A. Goel, R. Govindan, K. Mason and B. Van Roy. Making Eigenvector-Based Systems Robust to Collusion. In 3rd International Workshop on Algorithms and Models for the Web Graph, Oct. 2004.

Cited By

Stivala GAbdelnabi SMengascini AGraziano MFritz MPellegrino G(2023)From Attachments to SEO: Click Here to Learn More about Clickbait PDFs!Proceedings of the 39th Annual Computer Security Applications Conference10.1145/3627106.3627172(14-28)Online publication date: 4-Dec-2023
https://dl.acm.org/doi/10.1145/3627106.3627172
Vasilisky ZKurland OTennenholtz MRaiber FYoshioka MKiseleva JAliannejadi M(2023)Content-Based Relevance Estimation in Retrieval Settings with Ranking-Incentivized Document ManipulationsProceedings of the 2023 ACM SIGIR International Conference on Theory of Information Retrieval10.1145/3578337.3605124(205-214)Online publication date: 9-Aug-2023
https://dl.acm.org/doi/10.1145/3578337.3605124
Wu CZhang RGuo JDe Rijke MFan YCheng X(2023)PRADA: Practical Black-box Adversarial Attacks against Neural Ranking ModelsACM Transactions on Information Systems10.1145/357692341:4(1-27)Online publication date: 8-Apr-2023
https://dl.acm.org/doi/10.1145/3576923
Show More Cited By

Index Terms

Detecting spam web pages through content analysis

Recommendations

Spam, damn spam, and statistics: using statistical analysis to locate spam web pages
WebDB '04: Proceedings of the 7th International Workshop on the Web and Databases: colocated with ACM SIGMOD/PODS 2004

The increasing importance of search engines to commercial web sites has given rise to a phenomenon we call "web spam", that is, web pages that exist only to mislead search engines into (mis)leading users to certain web sites. Web spam is a nuisance to ...
SAAD, a content based Web Spam Analyzer and Detector

Web Spam is one of the main difficulties that crawlers have to overcome and therefore one of the main problems of the WWW. There are several studies about characterising and detecting Web Spam pages. However, none of them deals with all the possible ...
Analysis and detection of web spam by means of web content
IRFC'12: Proceedings of the 5th conference on Multidisciplinary Information Retrieval

Web Spam is one of the main difficulties that crawlers have to overcome. According to Gyöngyi and Garcia-Molina it is defined as "any deliberate human action that is meant to trigger an unjustifiably favourable relevance or importance of some web pages ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

WWW '06: Proceedings of the 15th international conference on World Wide Web

May 2006

1102 pages

ISBN:1595933239

DOI:10.1145/1135777

General Chairs:
Leslie Carr
University of Southampton
,
David De Roure
University of Southampton
,
Arun Iyengar
IBM Research
,
Program Chairs:
Carole Goble
University of Manchester, UK
,
Mike Dahlin
University of Texas at Austin

Copyright © 2006 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 23 May 2006

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Article

Conference

WWW06

Sponsor:

WWW06: The 15th International World Wide Web Conference 2006

May 23 - 26, 2006

Edinburgh, Scotland

Acceptance Rates

Overall Acceptance Rate 1,899 of 8,196 submissions, 23%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

375
Total Citations
View Citations
3,742
Total Downloads

Downloads (Last 12 months)58
Downloads (Last 6 weeks)1

Reflects downloads up to 10 Aug 2024

Other Metrics

View Author Metrics

Citations

Cited By

Stivala GAbdelnabi SMengascini AGraziano MFritz MPellegrino G(2023)From Attachments to SEO: Click Here to Learn More about Clickbait PDFs!Proceedings of the 39th Annual Computer Security Applications Conference10.1145/3627106.3627172(14-28)Online publication date: 4-Dec-2023
https://dl.acm.org/doi/10.1145/3627106.3627172
Vasilisky ZKurland OTennenholtz MRaiber FYoshioka MKiseleva JAliannejadi M(2023)Content-Based Relevance Estimation in Retrieval Settings with Ranking-Incentivized Document ManipulationsProceedings of the 2023 ACM SIGIR International Conference on Theory of Information Retrieval10.1145/3578337.3605124(205-214)Online publication date: 9-Aug-2023
https://dl.acm.org/doi/10.1145/3578337.3605124
Wu CZhang RGuo JDe Rijke MFan YCheng X(2023)PRADA: Practical Black-box Adversarial Attacks against Neural Ranking ModelsACM Transactions on Information Systems10.1145/357692341:4(1-27)Online publication date: 8-Apr-2023
https://dl.acm.org/doi/10.1145/3576923
Rout JDalmia ARath SMohanta BRamasubbareddy SGandomi A(2023)Detecting Product Review Spammers Using Principles of Big DataIEEE Transactions on Engineering Management10.1109/TEM.2021.309780570:7(2516-2527)Online publication date: Jul-2023
https://doi.org/10.1109/TEM.2021.3097805
Dodda RMaddhi SThuraab MReddy AChandra A(2023)NLP-Driven Strategies for Effective Email Spam Detection: A Performance Evaluation2023 International Conference on Sustainable Communication Networks and Application (ICSCNA)10.1109/ICSCNA58489.2023.10370223(275-279)Online publication date: 15-Nov-2023
https://doi.org/10.1109/ICSCNA58489.2023.10370223
Han YWang SLi YCao XHuang LChen Z(2023)Measurement of Illegal Android Gambling App Ecosystem From Joint Promotion Perspective2023 IEEE 10th International Conference on Data Science and Advanced Analytics (DSAA)10.1109/DSAA60987.2023.10302499(1-11)Online publication date: 9-Oct-2023
https://doi.org/10.1109/DSAA60987.2023.10302499
Roy SGaur VRaza HJameel S(2023)CLEFT: Contextualised Unified Learning of User Engagement in Video Lectures With FeedbackIEEE Access10.1109/ACCESS.2023.324598211(17707-17720)Online publication date: 2023
https://doi.org/10.1109/ACCESS.2023.3245982
Singh AKumar S(2023)Fake Reviews Detection Using Multi-input Neural Network ModelProceedings of International Conference on Recent Trends in Computing10.1007/978-981-19-8825-7_35(405-416)Online publication date: 21-Mar-2023
https://doi.org/10.1007/978-981-19-8825-7_35
Singh AKumar S(2023)Detecting Fake Reviews Using Multiple Machine Learning Models: A Comparative StudyComputer Vision and Robotics10.1007/978-981-19-7892-0_37(467-476)Online publication date: 28-Apr-2023
https://doi.org/10.1007/978-981-19-7892-0_37
Lai KLong YWu BLi YWang BAl Hasan MXiong L(2022)Semorph: A Morphology Semantic Enhanced Pre-trained Model for Chinese Spam Text DetectionProceedings of the 31st ACM International Conference on Information & Knowledge Management10.1145/3511808.3557448(1003-1013)Online publication date: 17-Oct-2022
https://dl.acm.org/doi/10.1145/3511808.3557448
Show More Cited By

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents