Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article

Efficient and effective spam filtering and re-ranking for large web datasets

Published: 01 October 2011 Publication History

Abstract

The TREC 2009 web ad hoc and relevance feedback tasks used a new document collection, the ClueWeb09 dataset, which was crawled from the general web in early 2009. This dataset contains 1 billion web pages, a substantial fraction of which are spam—pages designed to deceive search engines so as to deliver an unwanted payload. We examine the effect of spam on the results of the TREC 2009 web ad hoc and relevance feedback tasks, which used the ClueWeb09 dataset. We show that a simple content-based classifier with minimal training is efficient enough to rank the “spamminess” of every page in the dataset using a standard personal computer in 48 hours, and effective enough to yield significant and substantive improvements in the fixed-cutoff precision (estP10) as well as rank measures (estR-Precision, StatMAP, MAP) of nearly all submitted runs. Moreover, using a set of “honeypot” queries the labeling of training data may be reduced to an entirely automatic process. The results of classical information retrieval methods are particularly enhanced by filtering—from among the worst to among the best.

References

[1]
Becchetti L., Castillo C., Donato D., Baeza-Yates R., and Leonardi S. Link analysis for web spam detection ACM Transactions on the Web 2008 2 1 1-42
[2]
Büttcher, S., Clarke, C. L. A., & Soboroff, I. (2006). The TREC 2006 terabyte track. In Proceedings of the 15th text retrieval conference, Gaithersburg, Maryland.
[3]
Carterette, B., Pavlu, V., Kanoulas, E., Aslam, J. A., & Allan, J. (2008). Evaluation over thousands of queries. In Proceedings of the 31st annual international ACM SIGIR conference on research and development in information retrieval (pp. 651–658), Singapore.
[4]
Chandar, P., Kailasam, A., Muppaneni, D., Lekha, T., & Carterette, B. (2009). Ad hoc and diversity retrieval at the University of Delaware. In Proceedings of the 18th text retrieval conference, Gaithersburg, Maryland.
[5]
Clarke, C. L. A., Craswell, N., & Soboroff, I. (2009). Overview of the TREC 2009 web track. In Proceedings of the 18th text retrieval conference, Gaithersburg, Maryland.
[6]
Clarke, C. L. A., Craswell, N., Soboroff, I., Cormack, G. V. (2010). Overview of the TREC 2010 web track. In Proceedings of the 19th text retrieval conference, Gaithersburg, Maryland.
[7]
Cormack, G. (2007). Content-based web spam detection. In Proceedings of the 3rd international workshop on adversarial information retrieval on the web (AIRWeb).
[8]
Cormack, G. V., Lynam, T. R. (2005). TREC 2005 spam track overview. In Proceedings of the 14th text retrieval conference, Gaithersburg, Maryland.
[9]
Cormack, G. V., & Mojdeh, M. (2009). Machine learning for information retrieval: TREC 2009 web, relevance feedback and legal tracks. In: Proceedings of the 18th text retrieval conference, Gaithersburg, Maryland.
[10]
Cormack, G. V. (2007). University of Waterloo participation in the TREC 2007 spam track. In Proceedings of the 16th text retrieval conference, Gaithersburg, Maryland.
[11]
Dou, Z., Cheny, K., Song, R., Ma, Y., Shi, S., & Wen, J.-R. (2009). Microsoft research Asia at the web track of TREC 2009. In Proceedings of the 18th text retrieval conference, Gaithersburg, Maryland.
[12]
Goodman, J., & Yih, W. T. (2006). Online discriminative spam filter training. In Proceedings of the 3rd conference on email and anti-spam (CEAS).
[13]
Guan, F., Yu, X., Peng, Z., Xu, H., Liu, Y., Song, L., & Cheng, X. (2009). ICTNET at web track 2009 ad-hoc task. In Proceedings of the 18th text retrieval conference, Gaithersburg, Maryland.
[14]
Gyöngyi Z. and Garcia-Molina H. Spam: It’s not just for inboxes anymore IEEE Computer 2005 38 10 28-34
[15]
Hauff, C., & Hiemstra, D. (2009). University of Twente@TREC 2009: Indexing half a billion web pages. In Proceedings of the 18th text retrieval conference, Gaithersburg, Maryland.
[16]
Hawking D. and Robertson S. On collection size and retrieval effectiveness Information Retrieval 2003 6 1 99-105
[17]
He, J., Balog, K., Hofmann, K., Meij, E., de Rijke, M., Tsagkias, M., & Weerkamp, W. (2009). Heuristic ranking and diversification of web documents. In Proceedings of the 18th text retrieval conference, Gaithersburg, Maryland.
[18]
Jones, T., Hawking, D., & Sankaranarayana, R. (2007). A framework for measuring the impact of web spam. In Proceedings of the 12th Australasian document computing symposium (ADCS), Melbourne, Australia.
[19]
Jones, T., Sankaranarayana, R., Hawking, D., & Craswell, N. (2009). Nullification test collections for web spam and SEO. In Proceedings of the 5th international workshop on adversarial information retrieval on the web (AIRWeb) (pp. 53–60), Madrid, Spain.
[20]
Kaptein, R., Koolen, M., & Kamps, J. (2009). Result diversity and entity ranking experiments: Anchors, links, text and Wikipedia. In Proceedings of the 18th text retrieval conference, Gaithersburg, Maryland.
[21]
Lin, J., Metzler, D., Elsayed, T., & Wang, L. (2009). Of ivory and smurfs: Loxodontan MapReduce experiments for web search. In Proceedings of the 18th text retrieval conference, Gaithersburg, Maryland.
[22]
Lynam, T. R., & Cormack, G. V. (2006). On-line spam filter fusion. In Proceedings of the 29th annual international ACM SIGIR conference on research and development in information retrieval (pp. 123–130), Seattle, Washington.
[23]
Macdonald, C., Ounis, I., & Soboroff, I. (2007). Overview of the TREC 2007 blog track. In Proceedings of the 16th text retrieval conference, Gaithersburg, Maryland.
[24]
Macdonald, C., Ounis, I., & Soboroff, I. (2009). Is spam an issue for opinionated blog post search? In Proceedings of the 32nd international ACM SIGIR conference on research and development in information retrieval (pp. 710–711), Boston.
[25]
McCreadie, R., Macdonald, C., Ounis, I., Peng, J., & Santos, R. L. T. (2009). University of Glasgow at TREC 2009: Experiments with Terrier. In Proceedings of the 18th text retrieval conference, Gaithersburg, Maryland.
[26]
Richardson, M., Prakash, A., & Brill, E. (2006). Beyond PageRank: Machine learning for static ranking. In Proceedings of the 15th international world wide web conference (pp. 707–715). Edinburgh, Scotland
[27]
Sakai T. and Kando N. On information retrieval metrics designed for evaluation with incomplete relevance assessments Information Retrieval 2008 11 5 447-470
[28]
Smucker, M. D., Clarke, C. L. A., & Cormack, G. V. (2009). Experiments with ClueWeb09: Relevance feedback and web tracks. In Proceedings of the 18th text retrieval conference, Gaithersburg, Maryland.
[29]
Strohman, T., Metzler, D., Turtle, H., & Croft W. B. (2005). Indri: A language-model based search engine for complex queries (extended version). Technical Report IR-407, CIIR, CS Dept., U. of Mass. Amherst.
[30]
Tomlinson, S., Oard, D. W., Baron, J. R., & Thompson, P. (2007). Overview of the TREC 2007 legal track. In Proceedings of the 16th text retrieval conference, Gaithersburg, Maryland.

Cited By

View all
  • (2024)Neural Passage Quality Estimation for Static PruningProceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3626772.3657765(174-185)Online publication date: 10-Jul-2024
  • (2024)A set of novel HTML document quality features for Web information retrievalExpert Systems with Applications: An International Journal10.1016/j.eswa.2024.123177246:COnline publication date: 15-Jul-2024
  • (2024)Analyzing Adversarial Attacks on Sequence-to-Sequence Relevance ModelsAdvances in Information Retrieval10.1007/978-3-031-56060-6_19(286-302)Online publication date: 24-Mar-2024
  • Show More Cited By

Index Terms

  1. Efficient and effective spam filtering and re-ranking for large web datasets
          Index terms have been assigned to the content through auto-classification.

          Recommendations

          Comments

          Information & Contributors

          Information

          Published In

          cover image Information Retrieval
          Information Retrieval  Volume 14, Issue 5
          Oct 2011
          105 pages

          Publisher

          Kluwer Academic Publishers

          United States

          Publication History

          Published: 01 October 2011
          Accepted: 05 January 2011
          Received: 28 April 2010

          Author Tags

          1. Web search
          2. Spam
          3. Web spam
          4. Evaluation
          5. TREC

          Qualifiers

          • Research-article

          Contributors

          Other Metrics

          Bibliometrics & Citations

          Bibliometrics

          Article Metrics

          • Downloads (Last 12 months)0
          • Downloads (Last 6 weeks)0
          Reflects downloads up to 10 Oct 2024

          Other Metrics

          Citations

          Cited By

          View all
          • (2024)Neural Passage Quality Estimation for Static PruningProceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3626772.3657765(174-185)Online publication date: 10-Jul-2024
          • (2024)A set of novel HTML document quality features for Web information retrievalExpert Systems with Applications: An International Journal10.1016/j.eswa.2024.123177246:COnline publication date: 15-Jul-2024
          • (2024)Analyzing Adversarial Attacks on Sequence-to-Sequence Relevance ModelsAdvances in Information Retrieval10.1007/978-3-031-56060-6_19(286-302)Online publication date: 24-Mar-2024
          • (2023)Entity-Based Relevance Feedback for Document RetrievalProceedings of the 2023 ACM SIGIR International Conference on Theory of Information Retrieval10.1145/3578337.3605128(177-187)Online publication date: 9-Aug-2023
          • (2023)Overview of Touché 2023: Argument and Causal RetrievalExperimental IR Meets Multilinguality, Multimodality, and Interaction10.1007/978-3-031-42448-9_31(507-530)Online publication date: 18-Sep-2023
          • (2022)Local or Global? A Comparative Study on Applications of Embedding Models for Information RetrievalProceedings of the 5th Joint International Conference on Data Science & Management of Data (9th ACM IKDD CODS and 27th COMAD)10.1145/3493700.3493701(115-119)Online publication date: 8-Jan-2022
          • (2022)Competitive SearchProceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3477495.3532771(2838-2849)Online publication date: 6-Jul-2022
          • (2022)From Cluster Ranking to Document RankingProceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3477495.3531819(2137-2141)Online publication date: 6-Jul-2022
          • (2022)Overview of Touché 2022: Argument RetrievalExperimental IR Meets Multilinguality, Multimodality, and Interaction10.1007/978-3-031-13643-6_21(311-336)Online publication date: 5-Sep-2022
          • (2021)An intrinsic evaluation of the Waterloo spam rankings of the ClueWeb09 and ClueWeb12 datasetsJournal of Information Science10.1177/016555151986655147:1(41-57)Online publication date: 25-Feb-2021
          • Show More Cited By

          View Options

          View options

          Get Access

          Login options

          Media

          Figures

          Other

          Tables

          Share

          Share

          Share this Publication link

          Share on social media