Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/1135777.1135794acmconferencesArticle/Chapter ViewAbstractPublication PageswebconfConference Proceedingsconference-collections
Article

Detecting spam web pages through content analysis

Published: 23 May 2006 Publication History
  • Get Citation Alerts
  • Abstract

    In this paper, we continue our investigations of "web spam": the injection of artificially-created pages into the web in order to influence the results from search engines, to drive traffic to certain pages for fun or profit. This paper considers some previously-undescribed techniques for automatically detecting spam pages, examines the effectiveness of these techniques in isolation and when aggregated using classification algorithms. When combined, our heuristics correctly identify 2,037 (86.2%) of the 2,364 spam pages (13.8%) in our judged collection of 17,168 pages, while misidentifying 526 spam and non-spam pages (3.1%).

    References

    [1]
    S. Adali, T. Liu and M. Magdon-Ismail. Optimal Link Bombs are Uncoordinated. In 1st International Workshop on Adversarial Information Retrieval on the Web, May 2005.
    [2]
    E. Amitay, D. Carmel, A. Darlow, R. Lempel and A. Soffer. The Connectivity Sonar: Detecting Site Functionality by Structural Patterns. In 14th ACM Conference on Hypertext and Hypermedia, Aug. 2003.
    [3]
    R. Baeza-Yates, C. Castillo and V. López. PageRank Increase under Different Collusion Topologies. In 1st International Workshop on Adversarial Information Retrieval on the Web, May 2005.
    [4]
    A. Benczúr, K. Csalogány, T. Sarlós and M. Uher. SpamRank -- Fully Automatic Link Spam Detection. In 1st International Workshop on Adversarial Information Retrieval on the Web, May 2005.
    [5]
    L. Breiman. Bagging Predictors. In Machine Learning, Vol. 24, No. 2, pages 123--140, 1996.
    [6]
    U.S. Census Bureau. Quarterly Retail E-Commerce Sales -- 4th Quarter 2004. http://www.census.gov/mrts/www/data/html/04Q4.html (dated Feb. 2005, visited Sept. 2005)
    [7]
    B. Davison. Recognizing Nepotistic Links on the Web. In AAAI-2000 Workshop on Artificial Intelligence for Web Search, July 2000.
    [8]
    D. Fetterly, M. Manasse and M. Najork. Spam, Damn Spam, and Statistics: Using statistical analysis to locate spam web pages. In 7th International Workshop on the Web and Databases, June 2004.
    [9]
    D. Fetterly, M. Manasse and M. Najork. Detecting Phrase-Level Duplication on the World Wide Web. In 28th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Aug. 2005.
    [10]
    Y. Freund and R. E. Schapire. A decision-theoretic generalization of on-line learning and an application to boosting. In European Conference on Computational Learning Theory, 1995.
    [11]
    Z. Gyöngyi, H. Garcia-Molina and J. Pedersen. Combating Web Spam with TrustRank. In 30th International Conference on Very Large Data Bases, Aug. 2004.
    [12]
    Z. Gyöngyi and H. Garcia-Molina. Link Spam Alliances. In 31st International Conference on Very Large Data Bases, Aug. 2005.
    [13]
    Z. Gyöngyi and H. Garcia-Molina. Web Spam Taxonomy. In 1st International Workshop on Adversarial Information Retrieval on the Web, May 2005.
    [14]
    GZIP. http://www.gzip.org/
    [15]
    M. Henzinger, R. Motwani and C. Silverstein. Challenges in Web Search Engines. SIGIR Forum 36(2), 2002.
    [16]
    J. Hidalgo. Evaluating cost-sensitive Unsolicited Bulk Email categorization. In 2002 ACM Symposium on Applied Computing, Mar. 2002.
    [17]
    B. Jansen and A. Spink. An Analysis of Web Documents Retrieved and Viewed. In International Conference on Internet Computing, June 2003.
    [18]
    C. Johnson. US eCommerce: 2005 To 2010. http://www.forrester.com/Research/Document/Excerpt/0,7211,37626,00.html (dated Sept. 2005, visited Sept. 2005)
    [19]
    C. Manning and H. Schütze. Foundations of Statistical Natural Language Processing. The MIT Press, 1999, Cambridge, Massachusetts.
    [20]
    P. Metaxas and J. DeStefano. Web Spam, Propaganda and Trust. In 1st International Workshop on Adversarial Information Retrieval on the Web, May 2005.
    [21]
    G. Mishne, D. Carmel and R. Lempel. Blocking Blog Spam with Language Model Disagreement. In 1st International Workshop on Adversarial Information Retrieval on the Web, May 2005.
    [22]
    MSN Search. http://search.msn.com/
    [23]
    J. Nielsen. Statistics for Traffic Referred by Search Engines and Navigation Directories to Useit. http://useit.com/about/searchreferrals.html (dated April 2004, visited Sept. 2005)
    [24]
    L. Page, S. Brin, R. Motwani and T. Winograd. The PageRank Citation Ranking: Bringing Order to the Web. Stanford Digital Library Technologies Project, 1998.
    [25]
    A. Perkins. The Classification of Search Engine Spam. http://www.silverdisc.co.uk/articles/spam-classification/ (dated Sept. 2001, visited Sept. 2005)
    [26]
    J. R. Quinlan. C4.5: Programs for Machine Learning. Morgan-Kaufman, 1993.
    [27]
    J. R. Quinlan. Bagging, Boosting, and C4.5. In 13th National Conference on Artificial Intelligence and 8th Innovative Applications of Artificial Intelligence Conference, Vol. 1, 725--730, Aug. 1996.
    [28]
    M. Sahami, S. Dumais, D. Heckerman and E. Horvitz. A Bayesian Approach to Filtering Junk E-Mail. In Learning for Text Categorization: Papers from the 1998 Workshop, AAAI Technical Report WS-98-05, 1998.
    [29]
    B. Wu and B. Davison. Identifying Link Farm Spam Pages. In 14th International World Wide Web Conference, May 2005.
    [30]
    B. Wu and B. Davison. Cloaking and Redirection: a preliminary study. In 1st International Workshop on Adversarial Information Retrieval on the Web, May 2005.
    [31]
    H. Zhang, A. Goel, R. Govindan, K. Mason and B. Van Roy. Making Eigenvector-Based Systems Robust to Collusion. In 3rd International Workshop on Algorithms and Models for the Web Graph, Oct. 2004.

    Cited By

    View all
    • (2023)From Attachments to SEO: Click Here to Learn More about Clickbait PDFs!Proceedings of the 39th Annual Computer Security Applications Conference10.1145/3627106.3627172(14-28)Online publication date: 4-Dec-2023
    • (2023)Content-Based Relevance Estimation in Retrieval Settings with Ranking-Incentivized Document ManipulationsProceedings of the 2023 ACM SIGIR International Conference on Theory of Information Retrieval10.1145/3578337.3605124(205-214)Online publication date: 9-Aug-2023
    • (2023)PRADA: Practical Black-box Adversarial Attacks against Neural Ranking ModelsACM Transactions on Information Systems10.1145/357692341:4(1-27)Online publication date: 8-Apr-2023
    • Show More Cited By

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    WWW '06: Proceedings of the 15th international conference on World Wide Web
    May 2006
    1102 pages
    ISBN:1595933239
    DOI:10.1145/1135777
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 23 May 2006

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. data mining
    2. web characterization
    3. web pages
    4. web spam

    Qualifiers

    • Article

    Conference

    WWW06
    Sponsor:

    Acceptance Rates

    Overall Acceptance Rate 1,899 of 8,196 submissions, 23%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)58
    • Downloads (Last 6 weeks)1
    Reflects downloads up to 10 Aug 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2023)From Attachments to SEO: Click Here to Learn More about Clickbait PDFs!Proceedings of the 39th Annual Computer Security Applications Conference10.1145/3627106.3627172(14-28)Online publication date: 4-Dec-2023
    • (2023)Content-Based Relevance Estimation in Retrieval Settings with Ranking-Incentivized Document ManipulationsProceedings of the 2023 ACM SIGIR International Conference on Theory of Information Retrieval10.1145/3578337.3605124(205-214)Online publication date: 9-Aug-2023
    • (2023)PRADA: Practical Black-box Adversarial Attacks against Neural Ranking ModelsACM Transactions on Information Systems10.1145/357692341:4(1-27)Online publication date: 8-Apr-2023
    • (2023)Detecting Product Review Spammers Using Principles of Big DataIEEE Transactions on Engineering Management10.1109/TEM.2021.309780570:7(2516-2527)Online publication date: Jul-2023
    • (2023)NLP-Driven Strategies for Effective Email Spam Detection: A Performance Evaluation2023 International Conference on Sustainable Communication Networks and Application (ICSCNA)10.1109/ICSCNA58489.2023.10370223(275-279)Online publication date: 15-Nov-2023
    • (2023)Measurement of Illegal Android Gambling App Ecosystem From Joint Promotion Perspective2023 IEEE 10th International Conference on Data Science and Advanced Analytics (DSAA)10.1109/DSAA60987.2023.10302499(1-11)Online publication date: 9-Oct-2023
    • (2023)CLEFT: Contextualised Unified Learning of User Engagement in Video Lectures With FeedbackIEEE Access10.1109/ACCESS.2023.324598211(17707-17720)Online publication date: 2023
    • (2023)Fake Reviews Detection Using Multi-input Neural Network ModelProceedings of International Conference on Recent Trends in Computing10.1007/978-981-19-8825-7_35(405-416)Online publication date: 21-Mar-2023
    • (2023)Detecting Fake Reviews Using Multiple Machine Learning Models: A Comparative StudyComputer Vision and Robotics10.1007/978-981-19-7892-0_37(467-476)Online publication date: 28-Apr-2023
    • (2022)Semorph: A Morphology Semantic Enhanced Pre-trained Model for Chinese Spam Text DetectionProceedings of the 31st ACM International Conference on Information & Knowledge Management10.1145/3511808.3557448(1003-1013)Online publication date: 17-Oct-2022
    • Show More Cited By

    View Options

    Get Access

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media