Abstract
Quantifying bias in retrieval functions through document retrievability scores is vital for assessing recall-oriented retrieval systems. However, many studies investigating retrieval model bias lack validation of their query generation methods as accurate representations of retrievability for real users and their queries. This limitation results from the absence of established criteria for query generation in retrievability assessments. Typically, researchers resort to using frequent collocations from document corpora when no query log is available. In this study, we address the issue of reproducibility and seek to validate query generation methods by comparing retrievability scores generated from artificially generated queries to those derived from query logs. Our findings demonstrate a minimal or negligible correlation between retrievability scores from artificial queries and those from query logs. This suggests that artificially generated queries may not accurately reflect retrievability scores as derived from query logs. We further explore alternative query generation techniques, uncovering a variation that exhibits the highest correlation. This alternative approach holds promise for improving reproducibility when query logs are unavailable.
A. Sinha and P.R. Mall—These authors contributed equally to this work.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Ahmad, W.U., Chang, K.W., Wang, H.: Context attentive document ranking and query suggestion. In: Proceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 385–394. SIGIR’19, Association for Computing Machinery, New York, NY, USA (2019). https://doi.org/10.1145/3331184.3331246
Abolghasemi, A., Verberne, S., Askari, A., Azzopardi, L..: Retrievability bias estimation using synthetically generated queries. In: Proceedings of the First Workshop on Generative Information Retrieval - GenIR@SIGIR2023 held in conjunction with SIGIR 2023. GenIR@SIGIR2023 (2023). https://coda.io/@sigir/gen-ir/accepted-papers-17
Anderson, N.: The ethics of using aol search data. https://arstechnica.com/uncategorized/2006/08/7578/
Atkinson, A.B.: On the measurement of inequality. J. Econom. Theory 2(3), 244–263 (1970). https://doi.org/10.1016/0022-0531(70)90039-6, https://www.sciencedirect.com/science/article/pii/0022053170900396
Azzopardi, L., Bache, R.: On the relationship between effectiveness and accessibility. In: Proceedings of the 33rd international ACM SIGIR conference on Research and development in information retrieval, pp. 889–890 (2010)
Azzopardi, L., Owens, C.: Search engine predilection towards news media providers. In: Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval, pp. 774–775 (2009)
Azzopardi, L., Vinay, V.: Accessibility in information retrieval. In: Macdonald, C., Ounis, I., Plachouras, V., Ruthven, I., White, R.W. (eds.) ECIR 2008. LNCS, vol. 4956, pp. 482–489. Springer, Heidelberg (2008). https://doi.org/10.1007/978-3-540-78646-7_46
Azzopardi, L., Vinay, V.: Retrievability: an evaluation measure for higher order information access tasks. In: Proceedings of the 17th ACM Conference on Information and Knowledge Management, pp. 561–570. CIKM ’08, Association for Computing Machinery, New York, NY, USA (2008). https://doi.org/10.1145/1458082.1458157
Bache, R., Azzopardi, L.: Improving access to large patent corpora. Trans. Large Scale Data Knowl. Centered Syst. 2, 103–121 (2010). https://doi.org/10.1007/978-3-642-16175-9_4
Barbaro, Michael; Zeller Jr, T.: A face is exposed for aol searcher no. 4417749. https://www.nytimes.com/2006/08/09/technology/09aol.html
Bashir, S.: Improving retrievability with improved cluster-based pseudo-relevance feedback selection. Expert Syst. Appl. 39(8), 7495–7502 (2012). https://doi.org/10.1016/j.eswa.2012.01.041
Bashir, S.: Estimating retrievability ranks of documents using document features. Neurocomputing 123, 216–232 (2014)
Bashir, S., Khattak, A.S.: Producing efficient retrievability ranks of documents using normalized retrievability scoring function. J. Intell. Inform. Syst. 42, 457–484 (2014). https://doi.org/10.1007/s10844-013-0274-3
Bashir, S., Rauber, A.: Analyzing document retrievability in patent retrieval settings. In: Bhowmick, S.S., Küng, J., Wagner, R. (eds.) DEXA 2009. LNCS, vol. 5690, pp. 753–760. Springer, Heidelberg (2009). https://doi.org/10.1007/978-3-642-03573-9_63
Bashir, S., Rauber, A.: Identification of low/high retrievable patents using content-based features. In: Proceedings of the 2nd International Workshop on Patent Information Retrieval, pp. 9–16 (2009)
Bashir, S., Rauber, A.: Improving retrievability of patents with cluster-based pseudo-relevance feedback documents selection. In: Proceedings of the 18th ACM Conference on Information and Knowledge Management, pp. 1863–1866 (2009)
Bashir, S., Rauber, A.: Improving retrievability and recall by automatic corpus partitioning. In: Hameurlain, A., Küng, J., Wagner, R., Bach Pedersen, T., Tjoa, A.M. (eds.) Transactions on Large-Scale Data- and Knowledge-Centered Systems II. LNCS, vol. 6380, pp. 122–140. Springer, Heidelberg (2010). https://doi.org/10.1007/978-3-642-16175-9_5
Bashir, S., Rauber, A.: Improving retrievability of patents in prior-art search. In: Gurrin, C., He, Y., Kazai, G., Kruschwitz, U., Little, S., Roelleke, T., Rüger, S., van Rijsbergen, K. (eds.) ECIR 2010. LNCS, vol. 5993, pp. 457–470. Springer, Heidelberg (2010). https://doi.org/10.1007/978-3-642-12275-0_40
Bashir, S., Rauber, A.: On the relationship between query characteristics and ir functions retrieval bias. J. Am. Soc. Inform. Sci. Technol. 62(8), 1515–1532 (2011)
Bashir, S., Rauber, A.: Automatic ranking of retrieval models using retrievability measure. Knowl. Inf. Syst. 41, 189–221 (2014)
Bashir, S., Rauber, A.: Retrieval models versus retrievability. Current Challenges in Patent Information Retrieval, pp. 185–212 (2017)
Bashir, S., Rauber, A.: Retrieval models versus retrievability. In: Current Challenges in Patent Information Retrieval. TIRS, vol. 37, pp. 185–212. Springer, Heidelberg (2017). https://doi.org/10.1007/978-3-662-53817-3_7
Boratto, L., Faralli, S., Marras, M., Stilo, G. (eds.): Advances in Bias and Fairness in Information Retrieval. Springer Nature Switzerland (2023). https://doi.org/10.1007/978-3-031-37249-0
Callan, J., Connell, M.: Query-based sampling of text databases. ACM Trans. Inform. Syst. (TOIS) 19(2), 97–130 (2001)
Dehghani, M., Zamani, H., Severyn, A., Kamps, J., Croft, W.B.: Neural ranking models with weak supervision. In: Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 65–74. SIGIR ’17, Association for Computing Machinery, New York, NY, USA (2017). https://doi.org/10.1145/3077136.3080832
Ekstrand, M.D., Das, A., Burke, R., Diaz, F.: Fairness in information access systems. Foundations and Trends® in Information Retrieval 16(1–2), 1–177 (2022). https://doi.org/10.1561/1500000079
Gini, C.: On the measure of concentration with special reference to income and statistics. Colorado College Publication, General Series 208(1), 73–79 (1936)
Hafner, K.: Tempting data, privacy concerns; researchers yearn to use aol logs, but they hesitate. https://www.nytimes.com/2006/08/23/technology/23search.html
Hawking, D.: Overview of the TREC-9 web track. In: Voorhees, E.M., Harman, D.K. (eds.) Proceedings of The Ninth Text REtrieval Conference, TREC 2000, Gaithersburg, Maryland, USA, November 13–16, 2000. NIST Special Publication, vol. 500–249. National Institute of Standards and Technology (NIST) (2000). http://trec.nist.gov/pubs/trec9/papers/web9.pdf
Johnston, J.: H. Theil. economics and information theory. Econom. J. 79(315), 601–602 (09 1969). https://doi.org/10.2307/2230396
Jordan, C., Watters, C., Gao, Q.: Using controlled query generation to evaluate blind relevance feedback algorithms. In: Proceedings of the 6th ACM/IEEE-CS Joint Conference on Digital Libraries, pp. 286–295 (2006)
Justeson, J.S., Katz, S.M.: Co-occurrences of antonymous adjectives and their contexts. Computational Linguistics 17(1), 1–20 (1991). https://aclanthology.org/J91-1001
Kang, Y.M., Liu, W., Zhou, Y.: Queryblazer: efficient query autocompletion framework. In: Proceedings of the 14th ACM International Conference on Web Search and Data Mining, pp. 1020–1028. WSDM ’21, Association for Computing Machinery (2021). https://doi.org/10.1145/3437963.3441725
Ma, Z., Dou, Z., Bian, G., Wen, J.R.: Pstie: time information enhanced personalized search. In: Proceedings of the 29th ACM International Conference on Information & Knowledge Management, pp. 1075–1084. CIKM ’20, Association for Computing Machinery (2020). https://doi.org/10.1145/3340531.3411877
MacAvaney, S., Macdonald, C., Ounis, I.: Reproducing personalised session search over the aol query log. In: Hagen, M., Verberne, S., Macdonald, C., Seifert, C., Balog, K., Nørvåg, K., Setty, V. (eds.) ECIR 2022. LNCS, vol. 13185, pp. 627–640. Springer, Cham (2022). https://doi.org/10.1007/978-3-030-99736-6_42
McLellan, C.: The relationship between retrievability bias and retrieval performance. Ph.D. thesis, University of Glasgow, UK (2019). https://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.775857
Nogueira, R., Lin, J.: From doc2query to doctttttquery. In: Online preprint 6 (2019). https://github.com/castorini/docTTTTTquery
Noor, S., Bashir, S.: Evaluating bias in retrieval systems for recall oriented documents retrieval. Int. Arab J. Inform. Technol. (IAJIT) 12(1) (2015)
Palma, J.G.: Homogeneous middles vs. heterogeneous tails, and the end of the ‘inverted-u’: the share of the rich is what it’s all about. Cambridge working papers in economics, Faculty of Economics, University of Cambridge (2011). https://EconPapers.repec.org/RePEc:cam:camdae:1111
Pass, G., Chowdhury, A., Torgeson, C.: A picture of search. In: Proceedings of the 1st International Conference on Scalable Information Systems, pp. 1-es. InfoScale ’06, Association for Computing Machinery (2006). https://doi.org/10.1145/1146847.1146848
Pickens, J., Cooper, M., Golovchinsky, G.: Reverted indexing for feedback and expansion. In: Proceedings of the 19th ACM International Conference on Information and Knowledge Management, pp. 1049–1058 (2010)
Roy, D., Carevic, Z., Mayr, P.: Studying retrievability of publications and datasets in an integrated retrieval system. In: Proceedings of the 22nd ACM/IEEE Joint Conference on Digital Libraries. JCDL ’22, Association for Computing Machinery (2022). https://doi.org/10.1145/3529372.3530931
Roy, D., Carevic, Z., Mayr, P.: Retrievability in an integrated retrieval system: an extended study. Int. J. Digital Libr. (Apr 2023). https://doi.org/10.1007/s00799-023-00363-4
Traub, M.C., Samar, T., van Ossenbruggen, J., Hardman, L.: Impact of crowdsourcing ocr improvements on retrievability bias. In: Proceedings of the 18th ACM/IEEE on Joint Conference on Digital Libraries, pp. 29–36. JCDL ’18, Association for Computing Machinery (2018). https://doi.org/10.1145/3197026.3197046
Traub, M.C., Samar, T., Van Ossenbruggen, J., He, J., de Vries, A., Hardman, L.: Querylog-based assessment of retrievability bias in a large newspaper corpus. In: 2016 IEEE/ACM Joint Conference on Digital Libraries (JCDL), pp. 7–16. IEEE (2016)
Voorhees, E.M.: Overview of the TREC 2004 robust track. In: Voorhees, E.M., Buckland, L.P. (eds.) Proceedings of the Thirteenth Text REtrieval Conference, TREC 2004, Gaithersburg, Maryland, USA, November 16–19, 2004. NIST Special Publication, vol. 500–261. National Institute of Standards and Technology (NIST) (2004). http://trec.nist.gov/pubs/trec13/papers/ROBUST.OVERVIEW.pdf
Wilkie, C., Azzopardi, L.: An initial investigation on the relationship between usage and findability. In: Serdyukov, P., et al. (eds.) ECIR 2013. LNCS, vol. 7814, pp. 808–811. Springer, Heidelberg (2013). https://doi.org/10.1007/978-3-642-36973-5_90
Wilkie, C., Azzopardi, L.: Relating retrievability, performance and length. In: Proceedings of the 36th International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 937–940 (2013)
Wilkie, C., Azzopardi, L.: Best and fairest: an empirical analysis of retrieval system bias. In: de Rijke, M., et al. (eds.) Advances in Information Retrieval, pp. 13–25. Springer International Publishing, Cham (2014)
Wilkie, C., Azzopardi, L.: Efficiently estimating retrievability bias. In: de Rijke, M., et al. (eds.) Advances in Information Retrieval, pp. 720–726. Springer International Publishing, Cham (2014)
Wilkie, C., Azzopardi, L.: A retrievability analysis: Exploring the relationship between retrieval bias and retrieval performance. In: Proceedings of the 23rd ACM International Conference on Conference on Information and Knowledge Management, pp. 81–90 (2014)
Wilkie, C., Azzopardi, L.: Query length, retrievability bias and performance. In: Proceedings of the 24th ACM International on Conference on Information and Knowledge Management, pp. 1787–1790. CIKM ’15, Association for Computing Machinery, New York, NY, USA (2015). https://doi.org/10.1145/2806416.2806604
Wilkie, C., Azzopardi, L.: Retrievability and retrieval bias: a comparison of inequality measures. In: Hanbury, A., Kazai, G., Rauber, A., Fuhr, N. (eds.) ECIR 2015. LNCS, vol. 9022, pp. 209–214. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-16354-3_22
Zheng, L., Cox, I.J.: Document-oriented pruning of the inverted index in information retrieval systems. In: 2009 International Conference on Advanced Information Networking and Applications Workshops, pp. 697–702. IEEE (2009)
Zhu, Y., et al.: Contrastive learning of user behavior sequence for context-aware document ranking. In: Proceedings of the 30th ACM International Conference on Information & Knowledge Management, pp. 2780–2791. CIKM ’21, Association for Computing Machinery (2021). https://doi.org/10.1145/3459637.3482243
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
A Appendix - A POS-Based Query Generation Technique
A Appendix - A POS-Based Query Generation Technique
Identifying collocations in a text corpus typically involves counting co-occurring word pairs, revealing words that go beyond their individual meanings. Relying solely on the most frequent bigrams often yields uninteresting results, as many of them consist of function words, offering limited insights. To improve collocation quality, Justeson and Katz [32] introduced a simple yet effective heuristic. They apply a part-of-speech filter to candidate phrases, preserving patterns likely to represent genuine ‘phrases’ rather than random word combinations. This approach enhances the meaningfulness of the collocation identification process.
We apply this approach for query generation improving the relevance and effectiveness of the generated queries.
-
1.
Perform Part-of-Speech (POS) tagging: Initially, we employ POS tagging on all the documents within the collection. This step assigns appropriate grammatical tags to each word, facilitating the subsequent identification of N-grams.
-
2.
Extract N-grams: N-grams, where N represents the desired length of the word sequences, are extracted from the POS-tagged documents. In our case, we consider N to range from 1 to 4, enabling the identification of unigrams, bigrams, trigrams, and quadgrams.
-
3.
Select N-grams with ‘query-like’ POS tag patterns: From the pool of extracted N-grams, we apply Justeson and Katz’s [32] recommended POS tag patterns to filter and retain N-grams that exhibit patterns resembling queries. The specific POS tag patterns for each N-gram type are provided in Table 5. Tag patterns for Quadgrams are proposed by us heuristically from our observations.
Subsequently, the resulting list of N-grams is sorted in descending order based on their occurrence frequencies. To ensure a manageable and relevant set of queries, we truncate the list at specific thresholds. These thresholds are determined by drawing inspiration from the frequency distribution of queries found in the AOL query set [40]. We aim to maintain a proportional ratio between the selected N-grams and the query frequencies observed in the AOL real query set, preserving a close alignment with real-world query usage patterns.
Rights and permissions
Copyright information
© 2024 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Sinha, A., Mall, P.R., Roy, D. (2024). Exploring the Nexus Between Retrievability and Query Generation Strategies. In: Goharian, N., et al. Advances in Information Retrieval. ECIR 2024. Lecture Notes in Computer Science, vol 14611. Springer, Cham. https://doi.org/10.1007/978-3-031-56066-8_16
Download citation
DOI: https://doi.org/10.1007/978-3-031-56066-8_16
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-56065-1
Online ISBN: 978-3-031-56066-8
eBook Packages: Computer ScienceComputer Science (R0)