Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article
Public Access

Using Collection Shards to Study Retrieval Performance Effect Sizes

Published: 19 March 2019 Publication History

Abstract

Despite the bulk of research studying how to more accurately compare the performance of IR systems, less attention is devoted to better understanding the different factors that play a role in such performance and how they interact. This is the case of shards, i.e., partitioning a document collection into sub-parts, which are used for many different purposes, ranging from efficiency to selective search or making test collection evaluation more accurate. In all these cases, there is empirical knowledge supporting the importance of shards, but we lack actual models that allow us to measure the impact of shards on system performance and how they interact with topics and systems. We use the general linear mixed model framework and present a model that encompasses the experimental factors of system, topic, shard, and their interaction effects. This detailed model allows us to more accurately estimate differences between the effect of various factors. We study shards created by a range of methods used in prior work and better explain observations noted in prior work in a principled setting and offer new insights. Notably, we discover that the topic*shard interaction effect, in particular, is a large effect almost globally across all datasets, an observation that, to our knowledge, has not been measured before.

References

[1]
J. Allan, J. Arguello, L. Azzopardi, P. Bailey, T. Baldwin, K. Balog, H. Bast, N. Belkin, K. Berberich, B. von Billerbeck, J. Callan, R. Capra, M. Carman, B. Carterette, C. L. A. Clarke, K. Collins-Thompson, N. Craswell, W. B. Croft, J. S. Culpepper, J. Dalton, G. Demartini, F. Diaz, L. Dietz, S. Dumais, C. Eickhoff, N. Ferro, N. Fuhr, S. Geva, C. Hauff, D. Hawking, H. Joho, G. J. F. Jones, J. Kamps, N. Kando, D. Kelly, J. Kim, J. Kiseleva, Y. Liu, X. Lu, S. Mizzaro, A. Moffat, J.-Y. Nie, A. Olteanu, I. Ounis, F. Radlinski, M. de Rijke, M. Sanderson, F. Scholer, L. Sitbon, M. D. Smucker, I. Soboroff, D. Spina, T. Suel, J. Thom, P. Thomas, A. Trotman, E. M. Voorhees, A. P. de Vries, E. Yilmaz, and G. Zuccon. 2018. Research Frontiers in Information Retrieval -- Report from the Third Strategic Workshop on Information Retrieval in Lorne (SWIRL 2018). SIGIR Forum 52, 1 (June 2018).
[2]
J. A. Aslam, E. Yilmaz, and V. Pavlu. 2005. A geometric interpretation of r-precision and its correlation with average precision, See Reference Baeza-Yates et al. {4}, 573--574.
[3]
J. A. Aslam, E. Yilmaz, and V. Pavlu. 2005. The maximum entropy method for analyzing retrieval measures, See Reference Baeza-Yates et al. {4}, 27--34.
[4]
R. Baeza-Yates, N. Ziviani, G. Marchionini, A. Moffat, and J. Tait (Eds.). 2005. In Proceedings of the 28th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’05). ACM Press, New York, NY.
[5]
D. Banks, P. Over, and N.-F. Zhang. 1999. Blind men and elephants: Six approaches to TREC data. Inform. Retr. 1, 1--2 (May 1999), 7--34.
[6]
David Bodoff and Pu Li. 2007. Test theory for assessing IR test collections. In Proceedings of the 30th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’07). ACM, 367--374.
[7]
Chris Buckley and Ellen M. Voorhees. 2000. Evaluating evaluation measure stability. In Proceedings of the 23rd International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’00). ACM, New York, NY, 33--40.
[8]
C. Buckley and E. M. Voorhees. 2005. Retrieval system evaluation. In TREC. Experiment and Evaluation in Information Retrieval, D. K. Harman and E. M. Voorhees (Eds.). MIT Press, Cambridge, MA, 53--78.
[9]
S. Büttcher, C. L. A. Clarke, and I. Soboroff. 2007. The TREC 2006 terabyte track. In Proceedings of the 15th Text REtrieval Conference (TREC’06), E. M. Voorhees and L. P. Buckland (Eds.). National Institute of Standards and Technology (NIST), Special Publication 500-272, Washington, D.C.
[10]
S. Büttcher, C. L. A. Clarke, P. C. K. Yeung, and I. Soboroff. 2007. Reliable information retrieval evaluation with incomplete and biased judgements. In Proceedings of the 30th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’07), W. Kraaij, A. P. de Vries, C. L. A. Clarke, N. Fuhr, and N. Kando (Eds.). ACM Press, New York, NY, 63--70.
[11]
Jamie Callan. 2002. Distributed information retrieval. Advances in Information Retrieval, Springer, Boston, MA, 127--150.
[12]
B. A. Carterette. 2012. Multiple testing in statistical analysis of systems-based information retrieval experiments. ACM Trans. Inform. Syst. 30, 1 (2012), 4:1--4:34.
[13]
O. Chapelle, D. Metzler, Y. Zhang, and P. Grinspan. 2009. Expected reciprocal rank for graded relevance. In Proceedings of the 18th International Conference on Information and Knowledge Management (CIKM’09), D. W.-L. Cheung, I.-Y. Song, W. W. Chu, X. Hu, and J. J. Lin (Eds.). ACM Press, New York, NY, 621--630.
[14]
C. L. A. Clarke, N. Craswell, and I. Soboroff. 2004. Overview of the TREC 2004 terabyte track. In Proceedings of the 13th Text REtrieval Conference (TREC’04), E. M. Voorhees and L. P. Buckland (Eds.). National Institute of Standards and Technology (NIST), Special Publication 500-261, Washington, D.C.
[15]
C. L. A. Clarke, F. Scholer, and I. Soboroff. 2005. Overview of the TREC 2005 terabyte track. In Proceedings of the 14th Text REtrieval Conference (TREC’05), E. M. Voorhees and L. P. Buckland (Eds.). National Institute of Standards and Technology (NIST), Special Publication 500-266, Washington, D.C.
[16]
Zhuyun Dai, Chenyan Xiong, and Jamie Callan. 2016. Query-biased partitioning for selective search. In Proceedings of the 25th ACM International Conference on Information and Knowledge Management (CIKM’16). ACM, 1119--1128.
[17]
N. Ferro, N. Fuhr, G. Grefenstette, J. A. Konstan, P. Castells, E. M. Daly, T. Declerck, M. D. Ekstrand, W. Geyer, J. Gonzalo, T. Kuflik, K. Lindén, B. Magnini, J.-Y. Nie, R. Perego, B. Shapira, I. Soboroff, N. Tintarev, K. Verspoor, M. C. Willemsen, and J. Zobel. 2018. Manifesto from Dagstuhl Perspectives Workshop 17442—From evaluating to forecasting performance: How to turn information retrieval, natural language processing, and recommender systems into predictive sciences. Dagstuhl Manifestos, Schloss Dagstuhl--Leibniz-Zentrum für Informatik, Germany 7, 1 (2018), 96--139.
[18]
N. Ferro, N. Fuhr, G. Grefenstette, J. A. Konstan, P. Castells, E. M. Daly, T. Declerck, M. D. Ekstrand, W. Geyer, J. Gonzalo, T. Kuflik, K. Lindén, B. Magnini, J.-Y. Nie, R. Perego, B. Shapira, I. Soboroff, N. Tintarev, K. Verspoor, M. C. Willemsen, and J. Zobel. 2018. The Dagstuhl Perspectives Workshop on performance modeling and prediction. SIGIR Forum 52, 1 (June 2018), 91--101.
[19]
N. Ferro and D. Harman. 2010. Grid@CLEF pilot track overview. In Proceedings of the 10th Workshop of the Cross-Language Evaluation Forum (CLEF’09). Multilingual Information Access Evaluation, Vol. I, Text Retrieval Experiments, Revised Selected Papers, C. Peters, G. M. Di Nunzio, M. Kurimo, T. Mandl, D. Mostefa, A. Peñas, and G. Roda (Eds.). Lecture Notes in Computer Science (LNCS) 6241, Springer, Heidelberg, Germany, 552--565.
[20]
N. Ferro and M. Sanderson. 2017. Sub-corpora impact on system effectiveness. In Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’17), N. Kando, T. Sakai, H. Joho, H. Li, A. P. de Vries, and R. W. White (Eds.). ACM Press, New York, NY, 901--904.
[21]
N. Ferro and G. Silvello. 2016. A general linear mixed models approach to study system component effects. In Proceedings of the 39th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’16), R. Perego, F. Sebastiani, J. Aslam, I. Ruthven, and J. Zobel (Eds.). ACM Press, New York, NY, 25--34.
[22]
N. Ferro and G. Silvello. 2018. Toward an anatomy of IR system component performances. J. Amer. Soc. Inform. Sci. Technol. 69, 2 (Feb. 2018), 187--200.
[23]
N. Ferro, G. Silvello, H. Keskustalo, A. Pirkola, and K. Järvelin. 2016. The twist measure for IR evaluation: Taking user’s effort into account. J. Amer. Soc. Inform. Sci. Technol. 67, 3 (2016), 620--648.
[24]
Parantapa Goswami, Eric Gaussier, and Massih-Reza Amini. 2017. Exploring the space of information retrieval term scoring functions. Inform. Proc. Manag. 53, 2 (2017), 454--472.
[25]
D. Hawking. 2000. Overview of the TREC 9 web track. In Proceedings of the 9th Text REtrieval Conference (TREC’00), E. M. Voorhees and D. K. Harman (Eds.). National Institute of Standards and Technology (NIST), Special Publication 500-249, Washington, D.C., 87--103.
[26]
D. Hawking and N. Craswell. 2001. Overview of the TREC 2001 web track. In Proceedings of the Tenth Text REtrieval Conference (TREC’01), E. M. Voorhees and D. K. Harman (Eds.). National Institute of Standards and Technology (NIST), Special Publication 500-250, Washington, D.C., 61--67.
[27]
D. A. Hull. 1993. Using statistical testing in the evaluation of retrieval experiments. In Proceedings of the 16th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’93), R. Korfhage, E. Rasmussen, and P. Willett (Eds.). ACM Press, New York, NY, 329--338.
[28]
P. K. Ito. 1980. Robustness of ANOVA and MANOVA test procedures. In Handbook of Statistics—Analysis of Variance, P. R. Krishnaiah (Ed.), Vol. 1. Elsevier, the Netherlands, 199--236.
[29]
C. M. Jarque and A. K. Bera. 1987. A test for normality of observations and regression residuals. Int. Stat. Rev. 55, 2 (1987), 163--172.
[30]
K. Järvelin and J. Kekäläinen. 2002. Cumulated gain-based evaluation of IR techniques. ACM Trans. Inform. Syst. 20, 4 (October 2002), 422--446.
[31]
Timothy Jones, Falk Scholer, Andrew Turpin, Stefano Mizzaro, and Mark Sanderson. 2015. Different rankers on different subcollections. In Proceedings of the European Conference on Information Retrieval. Springer, 203--208.
[32]
Timothy Jones, Andrew Turpin, Stefano Mizzaro, Falk Scholer, and Mark Sanderson. 2014. Size and source matter: Understanding inconsistencies in test collection--based evaluation. In Proceedings of the 23rd ACM International Conference on Information and Knowledge Management (CIKM’14). ACM, New York, NY, 1843--1846.
[33]
Yubin Kim, Jamie Callan, J. Shane Culpepper, and Alistair Moffat. 2017. Efficient distributed selective search. Inform. Retriev. J. 20, 3 (2017), 221--252.
[34]
Anagha Kulkarni and Jamie Callan. 2010. Document allocation policies for selective searching of distributed indexes. In Proceedings of the 19th ACM International Conference on Information and Knowledge Management (CIKM’10). ACM, 449--458.
[35]
M. H. Kutner, C. J. Nachtsheim, J. Neter, and W. Li. 2005. Applied Linear Statistical Models (5th ed.). McGraw-Hill/Irwin, New York, NY.
[36]
Victor Lavrenko and W. Bruce Croft. 2001. Relevance-based language models. In Proceedings of the 24th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’01). ACM, 120--127.
[37]
I. Levene. 1960. Robust tests for equality of variances. In Contributions to Probability and Statistics: Essays in Honor of Harold Hotelling, I. Olkin, S. G. Ghurye, W. Hoeffding, W. G. Madow, and H. B. Mann (Eds.). Stanford University Press, 278--292.
[38]
Bo Long and Yi Chang. 2014. Relevance Ranking for Vertical Search Engines (1st ed.). Morgan Kaufmann Publishers Inc., San Francisco, CA.
[39]
S. Maxwell and H. D. Delaney. 2004. Designing Experiments and Analyzing Data. A Model Comparison Perspective (2nd ed.). Lawrence Erlbaum Associates, Mahwah, NJ.
[40]
W. Mendenhall and T. Sincich. 2012. A Second Course in Statistics. Regression Analysis (7th ed.). Prentice Hall.
[41]
A. Moffat and J. Zobel. 2008. Rank-biased precision for measurement of retrieval effectiveness. ACM Trans. Inform. Syst. 27, 1 (2008), 2:1--2:27.
[42]
S. Olejnik and J. Algina. 2003. Generalized Eta and Omega squared statistics: Measures of effect size for some common research designs. Psychol. Meth. 8, 4 (December 2003), 434--447.
[43]
B. Pobelte and R. Baeza-Yates. 2008. Query-sets: Using implicit feedback and query patterns to organize web documents. In Proceedings of the 17th International Conference on World Wide Web. ACM, 41--50.
[44]
Diego Puppin, Fabrizio Silvestri, and Domenico Laforenza. 2006. Query-driven document partitioning and collection selection. In Proceedings of the 1st International Conference on Scalable Information Systems. 34.
[45]
S. E. Robertson and E. Kanoulas. 2012. On per-topic variance in IR evaluation. In Proceedings of the 35th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’12), W. Hersh, J. Callan, Y. Maarek, and M. Sanderson (Eds.). ACM Press, New York, NY, 891--900.
[46]
S. E. Robertson, E. Kanoulas, and E. Yilmaz. 2010. Extending average precision to graded relevance judgments. In Proceedings of the 33rd International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’10), F. Crestani, S. Marchand-Maillet, E. N. Efthimiadis, and J. Savoy (Eds.). ACM Press, New York, NY, 603--610.
[47]
A. Rutherford. 2011. ANOVA and ANCOVA. A GLM Approach (2nd ed.). John Wiley 8 Sons, New York, NY.
[48]
T. Sakai. 2014. Metrics, statistics, tests. In Bridging Between Information Retrieval and Databases—PROMISE Winter School 2013, Revised Tutorial Lectures, N. Ferro (Ed.). Lecture Notes in Computer Science (LNCS) 8173, Springer, Heidelberg, Germany, 116--163.
[49]
T. Sakai. 2016. Topic set size design. Inform. Retriev. 19, 3 (June 2016), 256--283.
[50]
Gerard Salton and Michael E. Lesk. 1968. Computer evaluation of indexing and text processing. J. ACM 15, 1 (1968), 8--36.
[51]
Mark Sanderson, Andrew Turpin, Ying Zhang, and Falk Scholer. 2012. Differences in effectiveness across sub-collections. In Proceedings of the 21st ACM International Conference on Information and Knowledge Management. ACM, 1965--1969.
[52]
Mark Sanderson, Andrew Turpin, Ying Zhang, and Falk Scholer. 2012. Differences in effectiveness across sub-collections. In Proceedings of the 21st ACM International Conference on Information and Knowledge Management (CIKM’12). ACM, New York, NY, 1965--1969.
[53]
M. Sanderson and J. Zobel. 2005. Information retrieval system evaluation: Effort, sensitivity, and reliability, See Reference Baeza-Yates et al. {4}, 162--169.
[54]
S. M. Scariano and J. M. Davenport. 1987. The effects of violations of independence assumptions in the one-way ANOVA. Amer. Stat. 41, 2 (1987), 123--129.
[55]
Milad Shokouhi, Luo Si, et al. 2011. Federated search. Found. Trends Inform. Retriev. 5, 1 (2011), 1--102.
[56]
J. M. Tague-Sutcliffe and J. Blustein. 1994. A statistical analysis of the TREC 3 data. In Proceedings of the Third Text REtrieval Conference (TREC’94), D. K. Harman (Ed.). National Institute of Standards and Technology (NIST), Special Publication 500-225, Washington, D.C., 385--398.
[57]
Ellen M. Voorhees. 2006. The TREC 2005 robust track. In ACM SIGIR Forum, Vol. 40. ACM, 41--48.
[58]
E. M. Voorhees and C. Buckley. 2002. The effect of topic set size on retrieval experiment error. In Proceedings of the 25th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’02), K. Järvelin, M. Beaulieu, R. Baeza-Yates, and S. Hyon Myaeng (Eds.). ACM Press, New York, NY, 316--323.
[59]
E. M. Voorhees and D. K. Harman. 1999. Overview of the 8th text REtrieval conference. In Proceedings of the 8th Text REtrieval Conference (TREC’99), E. M. Voorhees and D. K. Harman (Eds.). National Institute of Standards and Technology (NIST), Special Publication 500-246, Washington, D.C., 1--24.
[60]
E. M. Voorhees, D. Samarov, and I. Soboroff. 2017. Using replicates in information retrieval evaluation. ACM Trans. Inform. Syst. 36, 2 (Sept. 2017), 12:1--12:21.
[61]
Meng Yang, Peng Zhang, and Dawei Song. 2018. A study of per-topic variance on system comparison. In Proceedings of the 41st International ACM SIGIR Conference on Research 8 Development in Information Retrieval (SIGIR’18). ACM, 1181--1184.
[62]
B. W. Yap and C. H. Sim. 2011. Comparisons of various types of normality tests. J. Stat. Comput. Simul. 81, 12 (2011), 2141--2155.
[63]
Chengxiang Zhai and John Lafferty. 2001. A study of smoothing methods for language models applied to ad hoc information retrieval. In Proceedings of the 24th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’01). ACM, 334--342.
[64]
ChengXiang Zhai and John Lafferty. 2002. Two-stage language models for information retrieval. In Proceedings of the 25th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’02). ACM, 49--56.
[65]
Chengxiang Zhai and John Lafferty. 2004. A study of smoothing methods for language models applied to information retrieval. ACM Trans. Inform. Syst. 22, 2 (2004), 179--214.

Cited By

View all
  • (2024)Reliable Information Retrieval Systems Performance Evaluation: A ReviewIEEE Access10.1109/ACCESS.2024.337723912(51740-51751)Online publication date: 2024
  • (2023)How Many Crowd Workers Do I Need? On Statistical Power when Crowdsourcing Relevance JudgmentsACM Transactions on Information Systems10.1145/359720142:1(1-26)Online publication date: 22-May-2023
  • (2023)The Impact of Judgment Variability on the Consistency of Offline Effectiveness MeasuresACM Transactions on Information Systems10.1145/359651142:1(1-31)Online publication date: 18-Aug-2023
  • Show More Cited By

Index Terms

  1. Using Collection Shards to Study Retrieval Performance Effect Sizes

      Recommendations

      Comments

      Information & Contributors

      Information

      Published In

      cover image ACM Transactions on Information Systems
      ACM Transactions on Information Systems  Volume 37, Issue 3
      July 2019
      335 pages
      ISSN:1046-8188
      EISSN:1558-2868
      DOI:10.1145/3320115
      Issue’s Table of Contents
      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 19 March 2019
      Accepted: 01 January 2019
      Revised: 01 November 2018
      Received: 01 July 2018
      Published in TOIS Volume 37, Issue 3

      Permissions

      Request permissions for this article.

      Check for updates

      Author Tags

      1. ANOVA
      2. GLMM
      3. Shard effect
      4. effectiveness model

      Qualifiers

      • Research-article
      • Research
      • Refereed

      Funding Sources

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)136
      • Downloads (Last 6 weeks)14
      Reflects downloads up to 30 Aug 2024

      Other Metrics

      Citations

      Cited By

      View all
      • (2024)Reliable Information Retrieval Systems Performance Evaluation: A ReviewIEEE Access10.1109/ACCESS.2024.337723912(51740-51751)Online publication date: 2024
      • (2023)How Many Crowd Workers Do I Need? On Statistical Power when Crowdsourcing Relevance JudgmentsACM Transactions on Information Systems10.1145/359720142:1(1-26)Online publication date: 22-May-2023
      • (2023)The Impact of Judgment Variability on the Consistency of Offline Effectiveness MeasuresACM Transactions on Information Systems10.1145/359651142:1(1-31)Online publication date: 18-Aug-2023
      • (2023)Predicting Retrieval Performance Changes in Evolving Evaluation EnvironmentsExperimental IR Meets Multilinguality, Multimodality, and Interaction10.1007/978-3-031-42448-9_3(21-33)Online publication date: 18-Sep-2023
      • (2022)Detecting Significant Differences Between Information Retrieval Systems via Generalized Linear ModelsProceedings of the 31st ACM International Conference on Information & Knowledge Management10.1145/3511808.3557286(446-456)Online publication date: 17-Oct-2022
      • (2022)How Do You Test a Test?Proceedings of the Fifteenth ACM International Conference on Web Search and Data Mining10.1145/3488560.3498406(280-288)Online publication date: 11-Feb-2022
      • (2022)Where Do Queries Come From?Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3477495.3531711(2850-2862)Online publication date: 6-Jul-2022
      • (2022)A bias–variance evaluation framework for information retrieval systemsInformation Processing and Management: an International Journal10.1016/j.ipm.2021.10274759:1Online publication date: 1-Jan-2022
      • (2022)The many dimensions of truthfulnessInformation Processing and Management: an International Journal10.1016/j.ipm.2021.10271058:6Online publication date: 22-Apr-2022
      • (2021)Evaluating the Predictivity of IR ExperimentsProceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3404835.3463040(1667-1671)Online publication date: 11-Jul-2021
      • Show More Cited By

      View Options

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      HTML Format

      View this article in HTML Format.

      HTML Format

      Get Access

      Login options

      Full Access

      Media

      Figures

      Other

      Tables

      Share

      Share

      Share this Publication link

      Share on social media