research-article

Efficient and effective spam filtering and re-ranking for large web datasets

Authors:

Gordon V. Cormack,

Mark D. Smucker,

Charles L. A. ClarkeAuthors Info & Claims

Information Retrieval, Volume 14, Issue 5

Pages 441 - 465

https://doi.org/10.1007/s10791-011-9162-z

Published: 01 October 2011 Publication History

Abstract

The TREC 2009 web ad hoc and relevance feedback tasks used a new document collection, the ClueWeb09 dataset, which was crawled from the general web in early 2009. This dataset contains 1 billion web pages, a substantial fraction of which are spam—pages designed to deceive search engines so as to deliver an unwanted payload. We examine the effect of spam on the results of the TREC 2009 web ad hoc and relevance feedback tasks, which used the ClueWeb09 dataset. We show that a simple content-based classifier with minimal training is efficient enough to rank the “spamminess” of every page in the dataset using a standard personal computer in 48 hours, and effective enough to yield significant and substantive improvements in the fixed-cutoff precision (estP10) as well as rank measures (estR-Precision, StatMAP, MAP) of nearly all submitted runs. Moreover, using a set of “honeypot” queries the labeling of training data may be reduced to an entirely automatic process. The results of classical information retrieval methods are particularly enhanced by filtering—from among the worst to among the best.

References

[1]

Becchetti L., Castillo C., Donato D., Baeza-Yates R., and Leonardi S. Link analysis for web spam detection ACM Transactions on the Web 2008 2 1 1-42

[2]

Büttcher, S., Clarke, C. L. A., & Soboroff, I. (2006). The TREC 2006 terabyte track. In Proceedings of the 15th text retrieval conference, Gaithersburg, Maryland.

[3]

Carterette, B., Pavlu, V., Kanoulas, E., Aslam, J. A., & Allan, J. (2008). Evaluation over thousands of queries. In Proceedings of the 31st annual international ACM SIGIR conference on research and development in information retrieval (pp. 651–658), Singapore.

[4]

Chandar, P., Kailasam, A., Muppaneni, D., Lekha, T., & Carterette, B. (2009). Ad hoc and diversity retrieval at the University of Delaware. In Proceedings of the 18th text retrieval conference, Gaithersburg, Maryland.

[5]

Clarke, C. L. A., Craswell, N., & Soboroff, I. (2009). Overview of the TREC 2009 web track. In Proceedings of the 18th text retrieval conference, Gaithersburg, Maryland.

[6]

Clarke, C. L. A., Craswell, N., Soboroff, I., Cormack, G. V. (2010). Overview of the TREC 2010 web track. In Proceedings of the 19th text retrieval conference, Gaithersburg, Maryland.

[7]

Cormack, G. (2007). Content-based web spam detection. In Proceedings of the 3rd international workshop on adversarial information retrieval on the web (AIRWeb).

[8]

Cormack, G. V., Lynam, T. R. (2005). TREC 2005 spam track overview. In Proceedings of the 14th text retrieval conference, Gaithersburg, Maryland.

[9]

Cormack, G. V., & Mojdeh, M. (2009). Machine learning for information retrieval: TREC 2009 web, relevance feedback and legal tracks. In: Proceedings of the 18th text retrieval conference, Gaithersburg, Maryland.

[10]

Cormack, G. V. (2007). University of Waterloo participation in the TREC 2007 spam track. In Proceedings of the 16th text retrieval conference, Gaithersburg, Maryland.

[11]

Dou, Z., Cheny, K., Song, R., Ma, Y., Shi, S., & Wen, J.-R. (2009). Microsoft research Asia at the web track of TREC 2009. In Proceedings of the 18th text retrieval conference, Gaithersburg, Maryland.

[12]

Goodman, J., & Yih, W. T. (2006). Online discriminative spam filter training. In Proceedings of the 3rd conference on email and anti-spam (CEAS).

[13]

Guan, F., Yu, X., Peng, Z., Xu, H., Liu, Y., Song, L., & Cheng, X. (2009). ICTNET at web track 2009 ad-hoc task. In Proceedings of the 18th text retrieval conference, Gaithersburg, Maryland.

[14]

Gyöngyi Z. and Garcia-Molina H. Spam: It’s not just for inboxes anymore IEEE Computer 2005 38 10 28-34

[15]

Hauff, C., & Hiemstra, D. (2009). University of Twente@TREC 2009: Indexing half a billion web pages. In Proceedings of the 18th text retrieval conference, Gaithersburg, Maryland.

[16]

Hawking D. and Robertson S. On collection size and retrieval effectiveness Information Retrieval 2003 6 1 99-105

[17]

He, J., Balog, K., Hofmann, K., Meij, E., de Rijke, M., Tsagkias, M., & Weerkamp, W. (2009). Heuristic ranking and diversification of web documents. In Proceedings of the 18th text retrieval conference, Gaithersburg, Maryland.

[18]

Jones, T., Hawking, D., & Sankaranarayana, R. (2007). A framework for measuring the impact of web spam. In Proceedings of the 12th Australasian document computing symposium (ADCS), Melbourne, Australia.

[19]

Jones, T., Sankaranarayana, R., Hawking, D., & Craswell, N. (2009). Nullification test collections for web spam and SEO. In Proceedings of the 5th international workshop on adversarial information retrieval on the web (AIRWeb) (pp. 53–60), Madrid, Spain.

[20]

Kaptein, R., Koolen, M., & Kamps, J. (2009). Result diversity and entity ranking experiments: Anchors, links, text and Wikipedia. In Proceedings of the 18th text retrieval conference, Gaithersburg, Maryland.

[21]

Lin, J., Metzler, D., Elsayed, T., & Wang, L. (2009). Of ivory and smurfs: Loxodontan MapReduce experiments for web search. In Proceedings of the 18th text retrieval conference, Gaithersburg, Maryland.

[22]

Lynam, T. R., & Cormack, G. V. (2006). On-line spam filter fusion. In Proceedings of the 29th annual international ACM SIGIR conference on research and development in information retrieval (pp. 123–130), Seattle, Washington.

[23]

Macdonald, C., Ounis, I., & Soboroff, I. (2007). Overview of the TREC 2007 blog track. In Proceedings of the 16th text retrieval conference, Gaithersburg, Maryland.

[24]

Macdonald, C., Ounis, I., & Soboroff, I. (2009). Is spam an issue for opinionated blog post search? In Proceedings of the 32nd international ACM SIGIR conference on research and development in information retrieval (pp. 710–711), Boston.

[25]

McCreadie, R., Macdonald, C., Ounis, I., Peng, J., & Santos, R. L. T. (2009). University of Glasgow at TREC 2009: Experiments with Terrier. In Proceedings of the 18th text retrieval conference, Gaithersburg, Maryland.

[26]

Richardson, M., Prakash, A., & Brill, E. (2006). Beyond PageRank: Machine learning for static ranking. In Proceedings of the 15th international world wide web conference (pp. 707–715). Edinburgh, Scotland

[27]

Sakai T. and Kando N. On information retrieval metrics designed for evaluation with incomplete relevance assessments Information Retrieval 2008 11 5 447-470

[28]

Smucker, M. D., Clarke, C. L. A., & Cormack, G. V. (2009). Experiments with ClueWeb09: Relevance feedback and web tracks. In Proceedings of the 18th text retrieval conference, Gaithersburg, Maryland.

[29]

Strohman, T., Metzler, D., Turtle, H., & Croft W. B. (2005). Indri: A language-model based search engine for complex queries (extended version). Technical Report IR-407, CIIR, CS Dept., U. of Mass. Amherst.

[30]

Tomlinson, S., Oard, D. W., Baron, J. R., & Thompson, P. (2007). Overview of the TREC 2007 legal track. In Proceedings of the 16th text retrieval conference, Gaithersburg, Maryland.

Cited By

Chang XMishra DMacdonald CMacAvaney SHui Yang GWang HHan SHauff CZuccon GZhang Y(2024)Neural Passage Quality Estimation for Static PruningProceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3626772.3657765(174-185)Online publication date: 10-Jul-2024
https://dl.acm.org/doi/10.1145/3626772.3657765
Aydın AArslan ADinçer B(2024)A set of novel HTML document quality features for Web information retrievalExpert Systems with Applications: An International Journal10.1016/j.eswa.2024.123177246:COnline publication date: 15-Jul-2024
https://dl.acm.org/doi/10.1016/j.eswa.2024.123177
Parry AFröbe MMacAvaney SPotthast MHagen M(2024)Analyzing Adversarial Attacks on Sequence-to-Sequence Relevance ModelsAdvances in Information Retrieval10.1007/978-3-031-56060-6_19(286-302)Online publication date: 24-Mar-2024
https://dl.acm.org/doi/10.1007/978-3-031-56060-6_19
Show More Cited By

Index Terms

Efficient and effective spam filtering and re-ranking for large web datasets
1. Information systems
  1. Information retrieval

Index terms have been assigned to the content through auto-classification.

Recommendations

Content-based analysis to detect Arabic web spam

Search engines are important outlets for information query and retrieval. They have to deal with the continual increase of information available on the web, and provide users with convenient access to such huge amounts of information. Furthermore, with ...
A taxonomy of JavaScript redirection spam
AIRWeb '07: Proceedings of the 3rd international workshop on Adversarial information retrieval on the web

Redirection spam presents a web page with false content to a crawler for indexing, but automatically redirects the browser to a different web page. Redirection is usually immediate (on page load) but may also be triggered by a timer or a harmless user ...
Predicting web spam with HTTP session information
CIKM '08: Proceedings of the 17th ACM conference on Information and knowledge management

Web spam is a widely-recognized threat to the quality and security of the Web. Web spam pages pollute search engine indexes, burden Web crawlers and Web mining services, and expose users to dangerous Web-borne malware. To defend against Web spam, most ...

Comments

Information & Contributors

Information

Published In

cover image Information Retrieval

Information Retrieval Volume 14, Issue 5

Oct 2011

105 pages

ISSN:1386-4564

Issue’s Table of Contents

© Springer Science+Business Media, LLC 2011.

Publisher

Kluwer Academic Publishers

United States

Publication History

Published: 01 October 2011

Accepted: 05 January 2011

Received: 28 April 2010

Author Tags

Qualifiers

Research-article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

122
Total Citations
View Citations
0
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 10 Oct 2024

Other Metrics

View Author Metrics

Citations

Cited By

Chang XMishra DMacdonald CMacAvaney SHui Yang GWang HHan SHauff CZuccon GZhang Y(2024)Neural Passage Quality Estimation for Static PruningProceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3626772.3657765(174-185)Online publication date: 10-Jul-2024
https://dl.acm.org/doi/10.1145/3626772.3657765
Aydın AArslan ADinçer B(2024)A set of novel HTML document quality features for Web information retrievalExpert Systems with Applications: An International Journal10.1016/j.eswa.2024.123177246:COnline publication date: 15-Jul-2024
https://dl.acm.org/doi/10.1016/j.eswa.2024.123177
Parry AFröbe MMacAvaney SPotthast MHagen M(2024)Analyzing Adversarial Attacks on Sequence-to-Sequence Relevance ModelsAdvances in Information Retrieval10.1007/978-3-031-56060-6_19(286-302)Online publication date: 24-Mar-2024
https://dl.acm.org/doi/10.1007/978-3-031-56060-6_19
Sheetrit ERaiber FKurland OYoshioka MKiseleva JAliannejadi M(2023)Entity-Based Relevance Feedback for Document RetrievalProceedings of the 2023 ACM SIGIR International Conference on Theory of Information Retrieval10.1145/3578337.3605128(177-187)Online publication date: 9-Aug-2023
https://dl.acm.org/doi/10.1145/3578337.3605128
Bondarenko AFröbe MKiesel JSchlatt FBarriere VRavenet BHemamou LLuck SReimer JStein BPotthast MHagen M(2023)Overview of Touché 2023: Argument and Causal RetrievalExperimental IR Meets Multilinguality, Multimodality, and Interaction10.1007/978-3-031-42448-9_31(507-530)Online publication date: 18-Sep-2023
https://dl.acm.org/doi/10.1007/978-3-031-42448-9_31
Roy DMitra MMayr PChowdhury A(2022)Local or Global? A Comparative Study on Applications of Embedding Models for Information RetrievalProceedings of the 5th Joint International Conference on Data Science & Management of Data (9th ACM IKDD CODS and 27th COMAD)10.1145/3493700.3493701(115-119)Online publication date: 8-Jan-2022
https://dl.acm.org/doi/10.1145/3493700.3493701
Kurland OTennenholtz MAmigo ECastells PGonzalo JCarterette BCulpepper JKazai G(2022)Competitive SearchProceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3477495.3532771(2838-2849)Online publication date: 6-Jul-2022
https://dl.acm.org/doi/10.1145/3477495.3532771
Markovskiy ERaiber FSabach SKurland OAmigo ECastells PGonzalo JCarterette BCulpepper JKazai G(2022)From Cluster Ranking to Document RankingProceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3477495.3531819(2137-2141)Online publication date: 6-Jul-2022
https://dl.acm.org/doi/10.1145/3477495.3531819
Bondarenko AFröbe MKiesel JSyed SGurcke TBeloucif MPanchenko ABiemann CStein BWachsmuth HPotthast MHagen M(2022)Overview of Touché 2022: Argument RetrievalExperimental IR Meets Multilinguality, Multimodality, and Interaction10.1007/978-3-031-13643-6_21(311-336)Online publication date: 5-Sep-2022
https://dl.acm.org/doi/10.1007/978-3-031-13643-6_21
Yılmazel İArslan A(2021)An intrinsic evaluation of the Waterloo spam rankings of the ClueWeb09 and ClueWeb12 datasetsJournal of Information Science10.1177/016555151986655147:1(41-57)Online publication date: 25-Feb-2021
https://dl.acm.org/doi/10.1177/0165551519866551
Show More Cited By

View Options

View options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Media

Figures

Other

Tables

View Issue’s Table of Contents