Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
Skip to main content

Exploring the Nexus Between Retrievability and Query Generation Strategies

  • Conference paper
  • First Online:
Advances in Information Retrieval (ECIR 2024)

Abstract

Quantifying bias in retrieval functions through document retrievability scores is vital for assessing recall-oriented retrieval systems. However, many studies investigating retrieval model bias lack validation of their query generation methods as accurate representations of retrievability for real users and their queries. This limitation results from the absence of established criteria for query generation in retrievability assessments. Typically, researchers resort to using frequent collocations from document corpora when no query log is available. In this study, we address the issue of reproducibility and seek to validate query generation methods by comparing retrievability scores generated from artificially generated queries to those derived from query logs. Our findings demonstrate a minimal or negligible correlation between retrievability scores from artificial queries and those from query logs. This suggests that artificially generated queries may not accurately reflect retrievability scores as derived from query logs. We further explore alternative query generation techniques, uncovering a variation that exhibits the highest correlation. This alternative approach holds promise for improving reproducibility when query logs are unavailable.

A. Sinha and P.R. Mall—These authors contributed equally to this work.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 109.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 129.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

Notes

  1. 1.

    https://dumps.wikimedia.org/enwiki/.

  2. 2.

    https://tinyurl.com/smart-stopword.

References

  1. Ahmad, W.U., Chang, K.W., Wang, H.: Context attentive document ranking and query suggestion. In: Proceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 385–394. SIGIR’19, Association for Computing Machinery, New York, NY, USA (2019). https://doi.org/10.1145/3331184.3331246

  2. Abolghasemi, A., Verberne, S., Askari, A., Azzopardi, L..: Retrievability bias estimation using synthetically generated queries. In: Proceedings of the First Workshop on Generative Information Retrieval - GenIR@SIGIR2023 held in conjunction with SIGIR 2023. GenIR@SIGIR2023 (2023). https://coda.io/@sigir/gen-ir/accepted-papers-17

  3. Anderson, N.: The ethics of using aol search data. https://arstechnica.com/uncategorized/2006/08/7578/

  4. Atkinson, A.B.: On the measurement of inequality. J. Econom. Theory 2(3), 244–263 (1970). https://doi.org/10.1016/0022-0531(70)90039-6, https://www.sciencedirect.com/science/article/pii/0022053170900396

  5. Azzopardi, L., Bache, R.: On the relationship between effectiveness and accessibility. In: Proceedings of the 33rd international ACM SIGIR conference on Research and development in information retrieval, pp. 889–890 (2010)

    Google Scholar 

  6. Azzopardi, L., Owens, C.: Search engine predilection towards news media providers. In: Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval, pp. 774–775 (2009)

    Google Scholar 

  7. Azzopardi, L., Vinay, V.: Accessibility in information retrieval. In: Macdonald, C., Ounis, I., Plachouras, V., Ruthven, I., White, R.W. (eds.) ECIR 2008. LNCS, vol. 4956, pp. 482–489. Springer, Heidelberg (2008). https://doi.org/10.1007/978-3-540-78646-7_46

    Chapter  Google Scholar 

  8. Azzopardi, L., Vinay, V.: Retrievability: an evaluation measure for higher order information access tasks. In: Proceedings of the 17th ACM Conference on Information and Knowledge Management, pp. 561–570. CIKM ’08, Association for Computing Machinery, New York, NY, USA (2008). https://doi.org/10.1145/1458082.1458157

  9. Bache, R., Azzopardi, L.: Improving access to large patent corpora. Trans. Large Scale Data Knowl. Centered Syst. 2, 103–121 (2010). https://doi.org/10.1007/978-3-642-16175-9_4

  10. Barbaro, Michael; Zeller Jr, T.: A face is exposed for aol searcher no. 4417749. https://www.nytimes.com/2006/08/09/technology/09aol.html

  11. Bashir, S.: Improving retrievability with improved cluster-based pseudo-relevance feedback selection. Expert Syst. Appl. 39(8), 7495–7502 (2012). https://doi.org/10.1016/j.eswa.2012.01.041

  12. Bashir, S.: Estimating retrievability ranks of documents using document features. Neurocomputing 123, 216–232 (2014)

    Article  Google Scholar 

  13. Bashir, S., Khattak, A.S.: Producing efficient retrievability ranks of documents using normalized retrievability scoring function. J. Intell. Inform. Syst. 42, 457–484 (2014). https://doi.org/10.1007/s10844-013-0274-3

  14. Bashir, S., Rauber, A.: Analyzing document retrievability in patent retrieval settings. In: Bhowmick, S.S., Küng, J., Wagner, R. (eds.) DEXA 2009. LNCS, vol. 5690, pp. 753–760. Springer, Heidelberg (2009). https://doi.org/10.1007/978-3-642-03573-9_63

    Chapter  Google Scholar 

  15. Bashir, S., Rauber, A.: Identification of low/high retrievable patents using content-based features. In: Proceedings of the 2nd International Workshop on Patent Information Retrieval, pp. 9–16 (2009)

    Google Scholar 

  16. Bashir, S., Rauber, A.: Improving retrievability of patents with cluster-based pseudo-relevance feedback documents selection. In: Proceedings of the 18th ACM Conference on Information and Knowledge Management, pp. 1863–1866 (2009)

    Google Scholar 

  17. Bashir, S., Rauber, A.: Improving retrievability and recall by automatic corpus partitioning. In: Hameurlain, A., Küng, J., Wagner, R., Bach Pedersen, T., Tjoa, A.M. (eds.) Transactions on Large-Scale Data- and Knowledge-Centered Systems II. LNCS, vol. 6380, pp. 122–140. Springer, Heidelberg (2010). https://doi.org/10.1007/978-3-642-16175-9_5

    Chapter  Google Scholar 

  18. Bashir, S., Rauber, A.: Improving retrievability of patents in prior-art search. In: Gurrin, C., He, Y., Kazai, G., Kruschwitz, U., Little, S., Roelleke, T., Rüger, S., van Rijsbergen, K. (eds.) ECIR 2010. LNCS, vol. 5993, pp. 457–470. Springer, Heidelberg (2010). https://doi.org/10.1007/978-3-642-12275-0_40

    Chapter  Google Scholar 

  19. Bashir, S., Rauber, A.: On the relationship between query characteristics and ir functions retrieval bias. J. Am. Soc. Inform. Sci. Technol. 62(8), 1515–1532 (2011)

    Article  Google Scholar 

  20. Bashir, S., Rauber, A.: Automatic ranking of retrieval models using retrievability measure. Knowl. Inf. Syst. 41, 189–221 (2014)

    Article  Google Scholar 

  21. Bashir, S., Rauber, A.: Retrieval models versus retrievability. Current Challenges in Patent Information Retrieval, pp. 185–212 (2017)

    Google Scholar 

  22. Bashir, S., Rauber, A.: Retrieval models versus retrievability. In: Current Challenges in Patent Information Retrieval. TIRS, vol. 37, pp. 185–212. Springer, Heidelberg (2017). https://doi.org/10.1007/978-3-662-53817-3_7

    Chapter  Google Scholar 

  23. Boratto, L., Faralli, S., Marras, M., Stilo, G. (eds.): Advances in Bias and Fairness in Information Retrieval. Springer Nature Switzerland (2023). https://doi.org/10.1007/978-3-031-37249-0

  24. Callan, J., Connell, M.: Query-based sampling of text databases. ACM Trans. Inform. Syst. (TOIS) 19(2), 97–130 (2001)

    Article  Google Scholar 

  25. Dehghani, M., Zamani, H., Severyn, A., Kamps, J., Croft, W.B.: Neural ranking models with weak supervision. In: Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 65–74. SIGIR ’17, Association for Computing Machinery, New York, NY, USA (2017). https://doi.org/10.1145/3077136.3080832

  26. Ekstrand, M.D., Das, A., Burke, R., Diaz, F.: Fairness in information access systems. Foundations and Trends® in Information Retrieval 16(1–2), 1–177 (2022). https://doi.org/10.1561/1500000079

  27. Gini, C.: On the measure of concentration with special reference to income and statistics. Colorado College Publication, General Series 208(1), 73–79 (1936)

    Google Scholar 

  28. Hafner, K.: Tempting data, privacy concerns; researchers yearn to use aol logs, but they hesitate. https://www.nytimes.com/2006/08/23/technology/23search.html

  29. Hawking, D.: Overview of the TREC-9 web track. In: Voorhees, E.M., Harman, D.K. (eds.) Proceedings of The Ninth Text REtrieval Conference, TREC 2000, Gaithersburg, Maryland, USA, November 13–16, 2000. NIST Special Publication, vol. 500–249. National Institute of Standards and Technology (NIST) (2000). http://trec.nist.gov/pubs/trec9/papers/web9.pdf

  30. Johnston, J.: H. Theil. economics and information theory. Econom. J. 79(315), 601–602 (09 1969). https://doi.org/10.2307/2230396

  31. Jordan, C., Watters, C., Gao, Q.: Using controlled query generation to evaluate blind relevance feedback algorithms. In: Proceedings of the 6th ACM/IEEE-CS Joint Conference on Digital Libraries, pp. 286–295 (2006)

    Google Scholar 

  32. Justeson, J.S., Katz, S.M.: Co-occurrences of antonymous adjectives and their contexts. Computational Linguistics 17(1), 1–20 (1991). https://aclanthology.org/J91-1001

  33. Kang, Y.M., Liu, W., Zhou, Y.: Queryblazer: efficient query autocompletion framework. In: Proceedings of the 14th ACM International Conference on Web Search and Data Mining, pp. 1020–1028. WSDM ’21, Association for Computing Machinery (2021). https://doi.org/10.1145/3437963.3441725

  34. Ma, Z., Dou, Z., Bian, G., Wen, J.R.: Pstie: time information enhanced personalized search. In: Proceedings of the 29th ACM International Conference on Information & Knowledge Management, pp. 1075–1084. CIKM ’20, Association for Computing Machinery (2020). https://doi.org/10.1145/3340531.3411877

  35. MacAvaney, S., Macdonald, C., Ounis, I.: Reproducing personalised session search over the aol query log. In: Hagen, M., Verberne, S., Macdonald, C., Seifert, C., Balog, K., Nørvåg, K., Setty, V. (eds.) ECIR 2022. LNCS, vol. 13185, pp. 627–640. Springer, Cham (2022). https://doi.org/10.1007/978-3-030-99736-6_42

    Chapter  Google Scholar 

  36. McLellan, C.: The relationship between retrievability bias and retrieval performance. Ph.D. thesis, University of Glasgow, UK (2019). https://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.775857

  37. Nogueira, R., Lin, J.: From doc2query to doctttttquery. In: Online preprint 6 (2019). https://github.com/castorini/docTTTTTquery

  38. Noor, S., Bashir, S.: Evaluating bias in retrieval systems for recall oriented documents retrieval. Int. Arab J. Inform. Technol. (IAJIT) 12(1) (2015)

    Google Scholar 

  39. Palma, J.G.: Homogeneous middles vs. heterogeneous tails, and the end of the ‘inverted-u’: the share of the rich is what it’s all about. Cambridge working papers in economics, Faculty of Economics, University of Cambridge (2011). https://EconPapers.repec.org/RePEc:cam:camdae:1111

  40. Pass, G., Chowdhury, A., Torgeson, C.: A picture of search. In: Proceedings of the 1st International Conference on Scalable Information Systems, pp. 1-es. InfoScale ’06, Association for Computing Machinery (2006). https://doi.org/10.1145/1146847.1146848

  41. Pickens, J., Cooper, M., Golovchinsky, G.: Reverted indexing for feedback and expansion. In: Proceedings of the 19th ACM International Conference on Information and Knowledge Management, pp. 1049–1058 (2010)

    Google Scholar 

  42. Roy, D., Carevic, Z., Mayr, P.: Studying retrievability of publications and datasets in an integrated retrieval system. In: Proceedings of the 22nd ACM/IEEE Joint Conference on Digital Libraries. JCDL ’22, Association for Computing Machinery (2022). https://doi.org/10.1145/3529372.3530931

  43. Roy, D., Carevic, Z., Mayr, P.: Retrievability in an integrated retrieval system: an extended study. Int. J. Digital Libr. (Apr 2023). https://doi.org/10.1007/s00799-023-00363-4

  44. Traub, M.C., Samar, T., van Ossenbruggen, J., Hardman, L.: Impact of crowdsourcing ocr improvements on retrievability bias. In: Proceedings of the 18th ACM/IEEE on Joint Conference on Digital Libraries, pp. 29–36. JCDL ’18, Association for Computing Machinery (2018). https://doi.org/10.1145/3197026.3197046

  45. Traub, M.C., Samar, T., Van Ossenbruggen, J., He, J., de Vries, A., Hardman, L.: Querylog-based assessment of retrievability bias in a large newspaper corpus. In: 2016 IEEE/ACM Joint Conference on Digital Libraries (JCDL), pp. 7–16. IEEE (2016)

    Google Scholar 

  46. Voorhees, E.M.: Overview of the TREC 2004 robust track. In: Voorhees, E.M., Buckland, L.P. (eds.) Proceedings of the Thirteenth Text REtrieval Conference, TREC 2004, Gaithersburg, Maryland, USA, November 16–19, 2004. NIST Special Publication, vol. 500–261. National Institute of Standards and Technology (NIST) (2004). http://trec.nist.gov/pubs/trec13/papers/ROBUST.OVERVIEW.pdf

  47. Wilkie, C., Azzopardi, L.: An initial investigation on the relationship between usage and findability. In: Serdyukov, P., et al. (eds.) ECIR 2013. LNCS, vol. 7814, pp. 808–811. Springer, Heidelberg (2013). https://doi.org/10.1007/978-3-642-36973-5_90

    Chapter  Google Scholar 

  48. Wilkie, C., Azzopardi, L.: Relating retrievability, performance and length. In: Proceedings of the 36th International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 937–940 (2013)

    Google Scholar 

  49. Wilkie, C., Azzopardi, L.: Best and fairest: an empirical analysis of retrieval system bias. In: de Rijke, M., et al. (eds.) Advances in Information Retrieval, pp. 13–25. Springer International Publishing, Cham (2014)

    Chapter  Google Scholar 

  50. Wilkie, C., Azzopardi, L.: Efficiently estimating retrievability bias. In: de Rijke, M., et al. (eds.) Advances in Information Retrieval, pp. 720–726. Springer International Publishing, Cham (2014)

    Chapter  Google Scholar 

  51. Wilkie, C., Azzopardi, L.: A retrievability analysis: Exploring the relationship between retrieval bias and retrieval performance. In: Proceedings of the 23rd ACM International Conference on Conference on Information and Knowledge Management, pp. 81–90 (2014)

    Google Scholar 

  52. Wilkie, C., Azzopardi, L.: Query length, retrievability bias and performance. In: Proceedings of the 24th ACM International on Conference on Information and Knowledge Management, pp. 1787–1790. CIKM ’15, Association for Computing Machinery, New York, NY, USA (2015). https://doi.org/10.1145/2806416.2806604

  53. Wilkie, C., Azzopardi, L.: Retrievability and retrieval bias: a comparison of inequality measures. In: Hanbury, A., Kazai, G., Rauber, A., Fuhr, N. (eds.) ECIR 2015. LNCS, vol. 9022, pp. 209–214. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-16354-3_22

    Chapter  Google Scholar 

  54. Zheng, L., Cox, I.J.: Document-oriented pruning of the inverted index in information retrieval systems. In: 2009 International Conference on Advanced Information Networking and Applications Workshops, pp. 697–702. IEEE (2009)

    Google Scholar 

  55. Zhu, Y., et al.: Contrastive learning of user behavior sequence for context-aware document ranking. In: Proceedings of the 30th ACM International Conference on Information & Knowledge Management, pp. 2780–2791. CIKM ’21, Association for Computing Machinery (2021). https://doi.org/10.1145/3459637.3482243

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Dwaipayan Roy .

Editor information

Editors and Affiliations

A Appendix - A POS-Based Query Generation Technique

A Appendix - A POS-Based Query Generation Technique

Identifying collocations in a text corpus typically involves counting co-occurring word pairs, revealing words that go beyond their individual meanings. Relying solely on the most frequent bigrams often yields uninteresting results, as many of them consist of function words, offering limited insights. To improve collocation quality, Justeson and Katz [32] introduced a simple yet effective heuristic. They apply a part-of-speech filter to candidate phrases, preserving patterns likely to represent genuine ‘phrases’ rather than random word combinations. This approach enhances the meaningfulness of the collocation identification process.

We apply this approach for query generation improving the relevance and effectiveness of the generated queries.

  1. 1.

    Perform Part-of-Speech (POS) tagging: Initially, we employ POS tagging on all the documents within the collection. This step assigns appropriate grammatical tags to each word, facilitating the subsequent identification of N-grams.

  2. 2.

    Extract N-grams: N-grams, where N represents the desired length of the word sequences, are extracted from the POS-tagged documents. In our case, we consider N to range from 1 to 4, enabling the identification of unigrams, bigrams, trigrams, and quadgrams.

  3. 3.

    Select N-grams with ‘query-like’ POS tag patterns: From the pool of extracted N-grams, we apply Justeson and Katz’s [32] recommended POS tag patterns to filter and retain N-grams that exhibit patterns resembling queries. The specific POS tag patterns for each N-gram type are provided in Table 5. Tag patterns for Quadgrams are proposed by us heuristically from our observations.

Table 5. POS Tag rules for N-gram query generation

Subsequently, the resulting list of N-grams is sorted in descending order based on their occurrence frequencies. To ensure a manageable and relevant set of queries, we truncate the list at specific thresholds. These thresholds are determined by drawing inspiration from the frequency distribution of queries found in the AOL query set [40]. We aim to maintain a proportional ratio between the selected N-grams and the query frequencies observed in the AOL real query set, preserving a close alignment with real-world query usage patterns.

Rights and permissions

Reprints and permissions

Copyright information

© 2024 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Sinha, A., Mall, P.R., Roy, D. (2024). Exploring the Nexus Between Retrievability and Query Generation Strategies. In: Goharian, N., et al. Advances in Information Retrieval. ECIR 2024. Lecture Notes in Computer Science, vol 14611. Springer, Cham. https://doi.org/10.1007/978-3-031-56066-8_16

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-56066-8_16

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-56065-1

  • Online ISBN: 978-3-031-56066-8

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics