Exploring the Nexus Between Retrievability and Query Generation Strategies

Sinha, Aman; Mall, Priyanshu Raj; Roy, Dwaipayan

doi:10.1007/978-3-031-56066-8_16

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 14611))

Included in the following conference series:

European Conference on Information Retrieval

687 Accesses

Abstract

Quantifying bias in retrieval functions through document retrievability scores is vital for assessing recall-oriented retrieval systems. However, many studies investigating retrieval model bias lack validation of their query generation methods as accurate representations of retrievability for real users and their queries. This limitation results from the absence of established criteria for query generation in retrievability assessments. Typically, researchers resort to using frequent collocations from document corpora when no query log is available. In this study, we address the issue of reproducibility and seek to validate query generation methods by comparing retrievability scores generated from artificially generated queries to those derived from query logs. Our findings demonstrate a minimal or negligible correlation between retrievability scores from artificial queries and those from query logs. This suggests that artificially generated queries may not accurately reflect retrievability scores as derived from query logs. We further explore alternative query generation techniques, uncovering a variation that exhibits the highest correlation. This alternative approach holds promise for improving reproducibility when query logs are unavailable.

A. Sinha and P.R. Mall—These authors contributed equally to this work.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 109.99; Price excludes VAT (USA)

Softcover Book: USD 129.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Efficiently Estimating Retrievability Bias

Reproducible Online Search Experiments

Exploring the Impact of Inter-query Variability on the Performance of Retrieval Systems

Notes

References

Ahmad, W.U., Chang, K.W., Wang, H.: Context attentive document ranking and query suggestion. In: Proceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 385–394. SIGIR’19, Association for Computing Machinery, New York, NY, USA (2019). https://doi.org/10.1145/3331184.3331246
Abolghasemi, A., Verberne, S., Askari, A., Azzopardi, L..: Retrievability bias estimation using synthetically generated queries. In: Proceedings of the First Workshop on Generative Information Retrieval - GenIR@SIGIR2023 held in conjunction with SIGIR 2023. GenIR@SIGIR2023 (2023). https://coda.io/@sigir/gen-ir/accepted-papers-17
Anderson, N.: The ethics of using aol search data. https://arstechnica.com/uncategorized/2006/08/7578/
Atkinson, A.B.: On the measurement of inequality. J. Econom. Theory 2(3), 244–263 (1970). https://doi.org/10.1016/0022-0531(70)90039-6, https://www.sciencedirect.com/science/article/pii/0022053170900396
Azzopardi, L., Bache, R.: On the relationship between effectiveness and accessibility. In: Proceedings of the 33rd international ACM SIGIR conference on Research and development in information retrieval, pp. 889–890 (2010)
Google Scholar
Azzopardi, L., Owens, C.: Search engine predilection towards news media providers. In: Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval, pp. 774–775 (2009)
Google Scholar
Azzopardi, L., Vinay, V.: Accessibility in information retrieval. In: Macdonald, C., Ounis, I., Plachouras, V., Ruthven, I., White, R.W. (eds.) ECIR 2008. LNCS, vol. 4956, pp. 482–489. Springer, Heidelberg (2008). https://doi.org/10.1007/978-3-540-78646-7_46
Chapter Google Scholar
Azzopardi, L., Vinay, V.: Retrievability: an evaluation measure for higher order information access tasks. In: Proceedings of the 17th ACM Conference on Information and Knowledge Management, pp. 561–570. CIKM ’08, Association for Computing Machinery, New York, NY, USA (2008). https://doi.org/10.1145/1458082.1458157
Bache, R., Azzopardi, L.: Improving access to large patent corpora. Trans. Large Scale Data Knowl. Centered Syst. 2, 103–121 (2010). https://doi.org/10.1007/978-3-642-16175-9_4
Barbaro, Michael; Zeller Jr, T.: A face is exposed for aol searcher no. 4417749. https://www.nytimes.com/2006/08/09/technology/09aol.html
Bashir, S.: Improving retrievability with improved cluster-based pseudo-relevance feedback selection. Expert Syst. Appl. 39(8), 7495–7502 (2012). https://doi.org/10.1016/j.eswa.2012.01.041
Bashir, S.: Estimating retrievability ranks of documents using document features. Neurocomputing 123, 216–232 (2014)
Article Google Scholar
Bashir, S., Khattak, A.S.: Producing efficient retrievability ranks of documents using normalized retrievability scoring function. J. Intell. Inform. Syst. 42, 457–484 (2014). https://doi.org/10.1007/s10844-013-0274-3
Bashir, S., Rauber, A.: Analyzing document retrievability in patent retrieval settings. In: Bhowmick, S.S., Küng, J., Wagner, R. (eds.) DEXA 2009. LNCS, vol. 5690, pp. 753–760. Springer, Heidelberg (2009). https://doi.org/10.1007/978-3-642-03573-9_63
Chapter Google Scholar
Bashir, S., Rauber, A.: Identification of low/high retrievable patents using content-based features. In: Proceedings of the 2nd International Workshop on Patent Information Retrieval, pp. 9–16 (2009)
Google Scholar
Bashir, S., Rauber, A.: Improving retrievability of patents with cluster-based pseudo-relevance feedback documents selection. In: Proceedings of the 18th ACM Conference on Information and Knowledge Management, pp. 1863–1866 (2009)
Google Scholar
Bashir, S., Rauber, A.: Improving retrievability and recall by automatic corpus partitioning. In: Hameurlain, A., Küng, J., Wagner, R., Bach Pedersen, T., Tjoa, A.M. (eds.) Transactions on Large-Scale Data- and Knowledge-Centered Systems II. LNCS, vol. 6380, pp. 122–140. Springer, Heidelberg (2010). https://doi.org/10.1007/978-3-642-16175-9_5
Chapter Google Scholar
Bashir, S., Rauber, A.: Improving retrievability of patents in prior-art search. In: Gurrin, C., He, Y., Kazai, G., Kruschwitz, U., Little, S., Roelleke, T., Rüger, S., van Rijsbergen, K. (eds.) ECIR 2010. LNCS, vol. 5993, pp. 457–470. Springer, Heidelberg (2010). https://doi.org/10.1007/978-3-642-12275-0_40
Chapter Google Scholar
Bashir, S., Rauber, A.: On the relationship between query characteristics and ir functions retrieval bias. J. Am. Soc. Inform. Sci. Technol. 62(8), 1515–1532 (2011)
Article Google Scholar
Bashir, S., Rauber, A.: Automatic ranking of retrieval models using retrievability measure. Knowl. Inf. Syst. 41, 189–221 (2014)
Article Google Scholar
Bashir, S., Rauber, A.: Retrieval models versus retrievability. Current Challenges in Patent Information Retrieval, pp. 185–212 (2017)
Google Scholar
Bashir, S., Rauber, A.: Retrieval models versus retrievability. In: Current Challenges in Patent Information Retrieval. TIRS, vol. 37, pp. 185–212. Springer, Heidelberg (2017). https://doi.org/10.1007/978-3-662-53817-3_7
Chapter Google Scholar
Boratto, L., Faralli, S., Marras, M., Stilo, G. (eds.): Advances in Bias and Fairness in Information Retrieval. Springer Nature Switzerland (2023). https://doi.org/10.1007/978-3-031-37249-0
Callan, J., Connell, M.: Query-based sampling of text databases. ACM Trans. Inform. Syst. (TOIS) 19(2), 97–130 (2001)
Article Google Scholar
Dehghani, M., Zamani, H., Severyn, A., Kamps, J., Croft, W.B.: Neural ranking models with weak supervision. In: Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 65–74. SIGIR ’17, Association for Computing Machinery, New York, NY, USA (2017). https://doi.org/10.1145/3077136.3080832
Ekstrand, M.D., Das, A., Burke, R., Diaz, F.: Fairness in information access systems. Foundations and Trends® in Information Retrieval 16(1–2), 1–177 (2022). https://doi.org/10.1561/1500000079
Gini, C.: On the measure of concentration with special reference to income and statistics. Colorado College Publication, General Series 208(1), 73–79 (1936)
Google Scholar
Hafner, K.: Tempting data, privacy concerns; researchers yearn to use aol logs, but they hesitate. https://www.nytimes.com/2006/08/23/technology/23search.html
Hawking, D.: Overview of the TREC-9 web track. In: Voorhees, E.M., Harman, D.K. (eds.) Proceedings of The Ninth Text REtrieval Conference, TREC 2000, Gaithersburg, Maryland, USA, November 13–16, 2000. NIST Special Publication, vol. 500–249. National Institute of Standards and Technology (NIST) (2000). http://trec.nist.gov/pubs/trec9/papers/web9.pdf
Johnston, J.: H. Theil. economics and information theory. Econom. J. 79(315), 601–602 (09 1969). https://doi.org/10.2307/2230396
Jordan, C., Watters, C., Gao, Q.: Using controlled query generation to evaluate blind relevance feedback algorithms. In: Proceedings of the 6th ACM/IEEE-CS Joint Conference on Digital Libraries, pp. 286–295 (2006)
Google Scholar
Justeson, J.S., Katz, S.M.: Co-occurrences of antonymous adjectives and their contexts. Computational Linguistics 17(1), 1–20 (1991). https://aclanthology.org/J91-1001
Kang, Y.M., Liu, W., Zhou, Y.: Queryblazer: efficient query autocompletion framework. In: Proceedings of the 14th ACM International Conference on Web Search and Data Mining, pp. 1020–1028. WSDM ’21, Association for Computing Machinery (2021). https://doi.org/10.1145/3437963.3441725
Ma, Z., Dou, Z., Bian, G., Wen, J.R.: Pstie: time information enhanced personalized search. In: Proceedings of the 29th ACM International Conference on Information & Knowledge Management, pp. 1075–1084. CIKM ’20, Association for Computing Machinery (2020). https://doi.org/10.1145/3340531.3411877
MacAvaney, S., Macdonald, C., Ounis, I.: Reproducing personalised session search over the aol query log. In: Hagen, M., Verberne, S., Macdonald, C., Seifert, C., Balog, K., Nørvåg, K., Setty, V. (eds.) ECIR 2022. LNCS, vol. 13185, pp. 627–640. Springer, Cham (2022). https://doi.org/10.1007/978-3-030-99736-6_42
Chapter Google Scholar
McLellan, C.: The relationship between retrievability bias and retrieval performance. Ph.D. thesis, University of Glasgow, UK (2019). https://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.775857
Nogueira, R., Lin, J.: From doc2query to doctttttquery. In: Online preprint 6 (2019). https://github.com/castorini/docTTTTTquery
Noor, S., Bashir, S.: Evaluating bias in retrieval systems for recall oriented documents retrieval. Int. Arab J. Inform. Technol. (IAJIT) 12(1) (2015)
Google Scholar
Palma, J.G.: Homogeneous middles vs. heterogeneous tails, and the end of the ‘inverted-u’: the share of the rich is what it’s all about. Cambridge working papers in economics, Faculty of Economics, University of Cambridge (2011). https://EconPapers.repec.org/RePEc:cam:camdae:1111
Pass, G., Chowdhury, A., Torgeson, C.: A picture of search. In: Proceedings of the 1st International Conference on Scalable Information Systems, pp. 1-es. InfoScale ’06, Association for Computing Machinery (2006). https://doi.org/10.1145/1146847.1146848
Pickens, J., Cooper, M., Golovchinsky, G.: Reverted indexing for feedback and expansion. In: Proceedings of the 19th ACM International Conference on Information and Knowledge Management, pp. 1049–1058 (2010)
Google Scholar
Roy, D., Carevic, Z., Mayr, P.: Studying retrievability of publications and datasets in an integrated retrieval system. In: Proceedings of the 22nd ACM/IEEE Joint Conference on Digital Libraries. JCDL ’22, Association for Computing Machinery (2022). https://doi.org/10.1145/3529372.3530931
Roy, D., Carevic, Z., Mayr, P.: Retrievability in an integrated retrieval system: an extended study. Int. J. Digital Libr. (Apr 2023). https://doi.org/10.1007/s00799-023-00363-4
Traub, M.C., Samar, T., van Ossenbruggen, J., Hardman, L.: Impact of crowdsourcing ocr improvements on retrievability bias. In: Proceedings of the 18th ACM/IEEE on Joint Conference on Digital Libraries, pp. 29–36. JCDL ’18, Association for Computing Machinery (2018). https://doi.org/10.1145/3197026.3197046
Traub, M.C., Samar, T., Van Ossenbruggen, J., He, J., de Vries, A., Hardman, L.: Querylog-based assessment of retrievability bias in a large newspaper corpus. In: 2016 IEEE/ACM Joint Conference on Digital Libraries (JCDL), pp. 7–16. IEEE (2016)
Google Scholar
Voorhees, E.M.: Overview of the TREC 2004 robust track. In: Voorhees, E.M., Buckland, L.P. (eds.) Proceedings of the Thirteenth Text REtrieval Conference, TREC 2004, Gaithersburg, Maryland, USA, November 16–19, 2004. NIST Special Publication, vol. 500–261. National Institute of Standards and Technology (NIST) (2004). http://trec.nist.gov/pubs/trec13/papers/ROBUST.OVERVIEW.pdf
Wilkie, C., Azzopardi, L.: An initial investigation on the relationship between usage and findability. In: Serdyukov, P., et al. (eds.) ECIR 2013. LNCS, vol. 7814, pp. 808–811. Springer, Heidelberg (2013). https://doi.org/10.1007/978-3-642-36973-5_90
Chapter Google Scholar
Wilkie, C., Azzopardi, L.: Relating retrievability, performance and length. In: Proceedings of the 36th International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 937–940 (2013)
Google Scholar
Wilkie, C., Azzopardi, L.: Best and fairest: an empirical analysis of retrieval system bias. In: de Rijke, M., et al. (eds.) Advances in Information Retrieval, pp. 13–25. Springer International Publishing, Cham (2014)
Chapter Google Scholar
Wilkie, C., Azzopardi, L.: Efficiently estimating retrievability bias. In: de Rijke, M., et al. (eds.) Advances in Information Retrieval, pp. 720–726. Springer International Publishing, Cham (2014)
Chapter Google Scholar
Wilkie, C., Azzopardi, L.: A retrievability analysis: Exploring the relationship between retrieval bias and retrieval performance. In: Proceedings of the 23rd ACM International Conference on Conference on Information and Knowledge Management, pp. 81–90 (2014)
Google Scholar
Wilkie, C., Azzopardi, L.: Query length, retrievability bias and performance. In: Proceedings of the 24th ACM International on Conference on Information and Knowledge Management, pp. 1787–1790. CIKM ’15, Association for Computing Machinery, New York, NY, USA (2015). https://doi.org/10.1145/2806416.2806604
Wilkie, C., Azzopardi, L.: Retrievability and retrieval bias: a comparison of inequality measures. In: Hanbury, A., Kazai, G., Rauber, A., Fuhr, N. (eds.) ECIR 2015. LNCS, vol. 9022, pp. 209–214. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-16354-3_22
Chapter Google Scholar
Zheng, L., Cox, I.J.: Document-oriented pruning of the inverted index in information retrieval systems. In: 2009 International Conference on Advanced Information Networking and Applications Workshops, pp. 697–702. IEEE (2009)
Google Scholar
Zhu, Y., et al.: Contrastive learning of user behavior sequence for context-aware document ranking. In: Proceedings of the 30th ACM International Conference on Information & Knowledge Management, pp. 2780–2791. CIKM ’21, Association for Computing Machinery (2021). https://doi.org/10.1145/3459637.3482243

Download references

Author information

Authors and Affiliations

Indian Institute of Science Education and Research Kolkata, Kalyani, India
Aman Sinha, Priyanshu Raj Mall & Dwaipayan Roy

Authors

Aman Sinha
View author publications
You can also search for this author in PubMed Google Scholar
Priyanshu Raj Mall
View author publications
You can also search for this author in PubMed Google Scholar
Dwaipayan Roy
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Dwaipayan Roy .

Editor information

Editors and Affiliations

Georgetown University, Washington, WA, USA
Nazli Goharian
University of Pisa, PISA, Pisa, Italy
Nicola Tonellotto
King's College London, London, UK
Yulan He
University College London, London, UK
Aldo Lipani
University of Glasgow, Glasgow, UK
Graham McDonald
University of Glasgow, Glasgow, UK
Craig Macdonald
University of Glasgow, Glasgow, UK
Iadh Ounis

A Appendix - A POS-Based Query Generation Technique

Identifying collocations in a text corpus typically involves counting co-occurring word pairs, revealing words that go beyond their individual meanings. Relying solely on the most frequent bigrams often yields uninteresting results, as many of them consist of function words, offering limited insights. To improve collocation quality, Justeson and Katz [32] introduced a simple yet effective heuristic. They apply a part-of-speech filter to candidate phrases, preserving patterns likely to represent genuine ‘phrases’ rather than random word combinations. This approach enhances the meaningfulness of the collocation identification process.

We apply this approach for query generation improving the relevance and effectiveness of the generated queries.

1.
Perform Part-of-Speech (POS) tagging: Initially, we employ POS tagging on all the documents within the collection. This step assigns appropriate grammatical tags to each word, facilitating the subsequent identification of N-grams.
2.
Extract N-grams: N-grams, where N represents the desired length of the word sequences, are extracted from the POS-tagged documents. In our case, we consider N to range from 1 to 4, enabling the identification of unigrams, bigrams, trigrams, and quadgrams.
3.
Select N-grams with ‘query-like’ POS tag patterns: From the pool of extracted N-grams, we apply Justeson and Katz’s [32] recommended POS tag patterns to filter and retain N-grams that exhibit patterns resembling queries. The specific POS tag patterns for each N-gram type are provided in Table 5. Tag patterns for Quadgrams are proposed by us heuristically from our observations.

Table 5. POS Tag rules for N-gram query generation

Full size table

Subsequently, the resulting list of N-grams is sorted in descending order based on their occurrence frequencies. To ensure a manageable and relevant set of queries, we truncate the list at specific thresholds. These thresholds are determined by drawing inspiration from the frequency distribution of queries found in the AOL query set [40]. We aim to maintain a proportional ratio between the selected N-grams and the query frequencies observed in the AOL real query set, preserving a close alignment with real-world query usage patterns.

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Sinha, A., Mall, P.R., Roy, D. (2024). Exploring the Nexus Between Retrievability and Query Generation Strategies. In: Goharian, N., et al. Advances in Information Retrieval. ECIR 2024. Lecture Notes in Computer Science, vol 14611. Springer, Cham. https://doi.org/10.1007/978-3-031-56066-8_16

Download citation

DOI: https://doi.org/10.1007/978-3-031-56066-8_16
Published: 15 March 2024
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-56065-1
Online ISBN: 978-3-031-56066-8
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Exploring the Nexus Between Retrievability and Query Generation Strategies

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Efficiently Estimating Retrievability Bias

Reproducible Online Search Experiments

Exploring the Impact of Inter-query Variability on the Performance of Retrieval Systems

Notes

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

A Appendix - A POS-Based Query Generation Technique

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Subscribe and save

Buy Now

Navigation

Exploring the Nexus Between Retrievability and Query Generation Strategies

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Efficiently Estimating Retrievability Bias

Reproducible Online Search Experiments

Exploring the Impact of Inter-query Variability on the Performance of Retrieval Systems

Notes

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

A Appendix - A POS-Based Query Generation Technique

A Appendix - A POS-Based Query Generation Technique

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation