Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1007/978-3-030-45442-5_2guideproceedingsArticle/Chapter ViewAbstractPublication PagesConference Proceedingsacm-pubtype
Article

The Effect of Content-Equivalent Near-Duplicates on the Evaluation of Search Engines

Published: 14 April 2020 Publication History

Abstract

Current best practices for the evaluation of search engines do not take into account duplicate documents. Dependent on their prevalence, not discounting duplicates during evaluation artificially inflates performance scores, and, it penalizes those whose search systems diligently filter them. Although these negative effects have already been demonstrated a long time ago by Bernstein and Zobel [4], we find that this has failed to move the community. In this paper, we reproduce the aforementioned study and extend it to incorporate all TREC Terabyte, Web, and Core tracks. The worst-case penalty of having filtered duplicates in any of these tracks were losses between 8 and 53 ranks.

References

[1]
Allan, J., Harman, D., Kanoulas, E., Li, D., Gysel, C.V., Voorhees, E.M.: TREC 2017 common core track overview. In: Proceedings of The Twenty-Sixth Text REtrieval Conference, TREC 2017, Gaithersburg, Maryland, USA, 15–17 November 2017 (2017)
[2]
Allan, J., Harman, D., Kanoulas, E., Voorhees, E.M.: TREC 2018 common core track overview. In: Notebooks of The Twenty-Seventh Text REtrieval Conference (TREC 2018), Gaithersburg, Maryland, USA, 14–16 November 2018 (2018)
[3]
Bernstein Y and Zobel J Apostolico A and Melucci M A scalable system for identifying co-derivative documents String Processing and Information Retrieval 2004 Heidelberg Springer 55-67
[4]
Bernstein, Y., Zobel, J.: Redundant documents and search effectiveness. In: Proceedings of the 2005 ACM CIKM International Conference on Information and Knowledge Management, Bremen, Germany, 31 October – 5 November 2005, pp. 736–743 (2005)
[5]
Broder, A.Z.: On the resemblance and containment of documents. In: Proceedings Compression and Complexity of SEQUENCES 1997, Positano, Amalfitan Coast, Salerno, Italy, 11–13 June 1997, pp. 21–29 (1997)
[6]
Büttcher, S., Clarke, C.L.A., Soboroff, I.: The TREC 2006 terabyte track. In: Proceedings of the Fifteenth Text REtrieval Conference (TREC 2006), Gaithersburg, Maryland, USA, 14–17 November 2006 (2006)
[7]
Clarke, C.L.A., Craswell, N., Soboroff, I.: Overview of the TREC 2004 terabyte track. In: Proceedings of the Thirteenth Text REtrieval Conference (TREC 2004), Gaithersburg, Maryland, USA, 16–19 November 2004 (2004)
[8]
Clarke, C.L.A., Craswell, N., Soboroff, I.: Overview of the TREC 2009 web track. In: Proceedings of The Eighteenth Text REtrieval Conference (TREC 2009), Gaithersburg, Maryland, USA, 17–20 November 2009 (2009)
[9]
Clarke, C.L.A., Craswell, N., Soboroff, I., Cormack, G.V.: Overview of the TREC 2010 web track. In: Proceedings of The Nineteenth Text REtrieval Conference (TREC 2010), Gaithersburg, Maryland, USA, 16–19 November 2010 (2010)
[10]
Clarke, C.L.A., Craswell, N., Soboroff, I., Voorhees, E.M.: Overview of the TREC 2011 web track. In: Proceedings of The Twentieth Text REtrieval Conference (TREC 2011), Gaithersburg, Maryland, USA, 15–18 November 2011 (2011)
[11]
Clarke, C.L.A., Craswell, N., Voorhees, E.M.: Overview of the TREC 2012 web track. In: Proceedings of The Twenty-First Text REtrieval Conference (TREC 2012), Gaithersburg, Maryland, USA, 6–9 November 2012 (2012)
[12]
Clarke, C.L.A., Scholer, F., Soboroff, I.: The TREC 2005 terabyte track. In: Proceedings of the Fourteenth Text REtrieval Conference (TREC 2005), Gaithersburg, Maryland, USA, 15–18 November 2005 (2005)
[13]
Collins-Thompson, K., Bennett, P.N., Diaz, F., Clarke, C., Voorhees, E.M.: TREC 2013 web track overview. In: Proceedings of The Twenty-Second Text REtrieval Conference (TREC 2013), Gaithersburg, Maryland, USA, 19–22 November 2013 (2013)
[14]
Collins-Thompson, K., Macdonald, C., Bennett, P.N., Diaz, F., Voorhees, E.M.: TREC 2014 web track overview. In: Proceedings of The Twenty-Third Text REtrieval Conference (TREC 2014), Gaithersburg, Maryland, USA, 19–21 November 2014 (2014)
[15]
Fetterly, D., Manasse, M.S., Najork, M.: On the evolution of clusters of near-duplicate web pages. In: 1st Latin American Web Congress (LA-WEB2003), Empowering Our Web, Sanitago, Chile, 10–12 November 2003, pp. 37–45 (2003)
[16]
Fuhr N Some common mistakes in IR evaluation, and how they can be avoided SIGIR Forum 2017 51 3 32-41
[17]
Yang, P., Fang, H., Lin, J.: Anserini: enabling the use of Lucene for information retrieval research. In: Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval, Shinjuku, Tokyo, Japan, 7–11 August 2017, pp. 1253–1256 (2017)

Cited By

View all
  • (2023)Chuweb21D: A Deduped English Document Collection for Web Search TasksProceedings of the Annual International ACM SIGIR Conference on Research and Development in Information Retrieval in the Asia Pacific Region10.1145/3624918.3625317(63-72)Online publication date: 26-Nov-2023
  • (2023)The Infinite Index: Information Retrieval on Generative Text-To-Image ModelsProceedings of the 2023 Conference on Human Information Interaction and Retrieval10.1145/3576840.3578327(172-186)Online publication date: 19-Mar-2023
  • (2023)Bootstrapped nDCG Estimation in the Presence of Unjudged DocumentsAdvances in Information Retrieval10.1007/978-3-031-28244-7_20(313-329)Online publication date: 2-Apr-2023
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image Guide Proceedings
Advances in Information Retrieval: 42nd European Conference on IR Research, ECIR 2020, Lisbon, Portugal, April 14–17, 2020, Proceedings, Part II
Apr 2020
708 pages
ISBN:978-3-030-45441-8
DOI:10.1007/978-3-030-45442-5

Publisher

Springer-Verlag

Berlin, Heidelberg

Publication History

Published: 14 April 2020

Qualifiers

  • Article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)0
  • Downloads (Last 6 weeks)0
Reflects downloads up to 13 Jan 2025

Other Metrics

Citations

Cited By

View all
  • (2023)Chuweb21D: A Deduped English Document Collection for Web Search TasksProceedings of the Annual International ACM SIGIR Conference on Research and Development in Information Retrieval in the Asia Pacific Region10.1145/3624918.3625317(63-72)Online publication date: 26-Nov-2023
  • (2023)The Infinite Index: Information Retrieval on Generative Text-To-Image ModelsProceedings of the 2023 Conference on Human Information Interaction and Retrieval10.1145/3576840.3578327(172-186)Online publication date: 19-Mar-2023
  • (2023)Bootstrapped nDCG Estimation in the Presence of Unjudged DocumentsAdvances in Information Retrieval10.1007/978-3-031-28244-7_20(313-329)Online publication date: 2-Apr-2023
  • (2022)Fair ranking: a critical review, challenges, and future directionsProceedings of the 2022 ACM Conference on Fairness, Accountability, and Transparency10.1145/3531146.3533238(1929-1942)Online publication date: 21-Jun-2022
  • (2022)Overview of Touché 2022: Argument RetrievalExperimental IR Meets Multilinguality, Multimodality, and Interaction10.1007/978-3-031-13643-6_21(311-336)Online publication date: 5-Sep-2022
  • (2022)Overview of Touché 2022: Argument RetrievalAdvances in Information Retrieval10.1007/978-3-030-99739-7_43(339-346)Online publication date: 10-Apr-2022
  • (2021)Overview of Touché 2021: Argument RetrievalExperimental IR Meets Multilinguality, Multimodality, and Interaction10.1007/978-3-030-85251-1_28(450-467)Online publication date: 21-Sep-2021
  • (2020)Sampling Bias Due to Near-Duplicates in Learning to RankProceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3397271.3401212(1997-2000)Online publication date: 25-Jul-2020

View Options

View options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media