Abstract
Probabilistic models such as BM25 and LM have established themselves as the standard in atomic retrieval. In structured document retrieval (SDR), BM25F could be considered the most established model. However, without optimization BM25F does not benefit from the document structure. The main contribution of this paper is a new field weighting method, denoted Information Content Field Weighting (ICFW). It applies weights over the structure without optimization and overcomes issues faced by some existing SDR models, most notably the issue of saturating term frequency across fields. ICFW is similar to BM25 and LM in its analytical grounding and transparency, making it a potential new candidate for a standard SDR model. For an optimised retrieval scenario ICFW does as well, or better than baselines. More interestingly, for a non-optimised retrieval scenario we observe a considerable increase in performance. Extensive analysis is performed to understand and explain the underlying reasons for this increase.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Amati, G., Van Rijsbergen, C.J.: Probabilistic models of information retrieval based on measuring the divergence from randomness. ACM Trans. Inf. Syst. 20(4), 357–389 (2002)
Amer-Yahia, S., Lalmas, M.: XML search: languages, INEX and scoring. ACM. SIGMOD Record 35(4), 16–23 (2006)
Balaneshinkordan, S., Kotov, A., Nikolaev, F.: Attentive neural architecture for ad-hoc structured document retrieval. CIKM 2018, ACM, Torino, Italy (2018)
Fang, H., Tao, T., Zhai, C.: A formal study of information retrieval heuristics. SIGIR 2004, ACM, New York, NY, USA (2004)
Fang, H., Tao, T., Zhai, C.: diagnostic evaluation of information retrieval models. ACM Trans. Inf. Syst. 29(2), 1–42 (2011)
Fang, H., Zhai, C.: An exploration of axiomatic approaches to information retrieval. SIGIR 2005 (2005)
Fang, H., Zhai, C.: Semantic term matching in axiomatic approaches to information retrieval. SIGIR 2006, ACM, New York, NY, USA (2006)
Fuhr, N.: Some common mistakes in IR evaluation, and how they can be avoided. ACM SIGIR Forum 51(3), 32–41 (2018)
Hasibi, F., et al.: DBpedia-Entity v2: a test collection for entity search. SIGIR 2017 (2017)
Hintikka, J.: On semantic information. In: Yourgrau, W., Breck, A.D. (eds.) Physics, Logic, and History: Based on the First International Colloquium held at the University of Denver, pp. 147–172. Springer, Boston (1970). https://doi.org/10.1007/978-1-4684-1749-4_9
Kamphuis, C., de Vries, A.P., Boytsov, L., Lin, J.: Which BM25 Do You Mean? A Large-Scale Reproducibility Study of Scoring Variants. In: Jose, J.M., et al. (eds.) ECIR 2020. LNCS, vol. 12036, pp. 28–34. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-45442-5_4
Kamps, J., Koolen, M., Geva, S., Schenkel, R., SanJuan, E., Bogers, T.: From XML retrieval to semantic search and beyond. In: Information Retrieval Evaluation in a Changing World: Lessons Learned from 20 Years of CLEF, pp. 415–437 (2019)
Ketola, T., Roelleke, T.: BM25-FIC: information content-based field weighting for BM25F. In: BIRDS@SIGIR (2020)
Ketola, T., Roelleke, T.: Formal constraints for structured document retrieval. In: Proceedings of the 2022 ACM SIGIR International Conference on Theory of Information Retrieval. ICTIR 2022, ACM, New York, NY, USA (2022)
Kim, J., Xue, X., Croft, W.B.: A probabilistic retrieval model for semistructured data. In: Boughanem, M., Berrut, C., Mothe, J., Soule-Dupuy, C. (eds.) ECIR 2009. LNCS, vol. 5478, pp. 228–239. Springer, Heidelberg (2009). https://doi.org/10.1007/978-3-642-00958-7_22
Malik, S., Lalmas, M., Fuhr, N.: Overview of INEX 2004. In: Fuhr, N., Lalmas, M., Malik, S., Szlávik, Z. (eds.) INEX 2004. LNCS, vol. 3493, pp. 1–15. Springer, Heidelberg (2005). https://doi.org/10.1007/11424550_1
Metzler, D., Croft, W.: Linear feature-based models for information retrieval. Inf. Retr. 16, 1–23 (2007)
Ogilvie, P., Callan, J.: Combining document representations for known-item search. SIGIR 2003, ACM, New York, NY, USA (2003)
Robertson, S., Walker, S., Jones, S., Hancock-Beaulieu, M.M., Gatford, M.: Okapi at TREC-3, pp. 109–126 (1995)
Robertson, S., Zaragoza, H., Taylor, M.: Simple BM25 extension to multiple weighted fields, pp. 42–49. CIKM 2004, ACM, Washington, D.C., USA (2004)
Roelleke, T.: Information Retrieval Models: Foundations and Relationships. Morgan & Claypool Publishers (2013).https://doi.org/10.1007/978-3-031-02328-6
Roelleke, T., Lalmas, M., Kazai, G., Ruthven, I., Quicker, S.: The accessibility dimension for structured document retrieval. In: Crestani, F., Girolami, M., van Rijsbergen, C.J. (eds.) ECIR 2002. LNCS, vol. 2291, pp. 284–302. Springer, Heidelberg (2002). https://doi.org/10.1007/3-540-45886-7_19
Sakai, T.: On Fuhr’s guideline for IR evaluation. ACM SIGIR Forum 54(1), 1–8 (2021)
Schenkel, R., Theobald, M.: Structural feedback for keyword-based XML retrieval. In: Lalmas, M., MacFarlane, A., Rüger, S., Tombros, A., Tsikrika, T., Yavlinsky, A. (eds.) ECIR 2006. LNCS, vol. 3936, pp. 326–337. Springer, Heidelberg (2006). https://doi.org/10.1007/11735106_29
Trotman, A., Puurula, A., Burgess, B.: Improvements to BM25 and language models examined. In: Proceedings of the 2014 Australasian Document Computing Symposium. ADCS 2014, ACM, New York, NY, USA (2014)
Wang, J., Roelleke, T.: Context-specific frequencies and discriminativeness for the retrieval of structured documents. In: Lalmas, M., MacFarlane, A., Rüger, S., Tombros, A., Tsikrika, T., Yavlinsky, A. (eds.) ECIR 2006. LNCS, vol. 3936, pp. 579–582. Springer, Heidelberg (2006). https://doi.org/10.1007/11735106_69
Zamani, H., Mitra, B., Song, X., Craswell, N., Tiwary, S.: Neural ranking models with multiple document fields. WSDM 2018, ACM, New York, NY, USA (2018)
Zaragoza, H., Craswell, N., Taylor, M., Saria, S., Robertson, S.: microsoft Cambridge at TREC-13: web and hard tracks, p. 7 (2004)
Zhiltsov, N., Kotov, A., Nikolaev, F.: Fielded sequential dependence model for Ad-hoc entity retrieval in the web of data. SIGIR 2015 (2015)
Acknowledgements
We would like to thank the reviewers for their comments, in particular regarding the presentation of the proposed models and candidate models, as well as the methodology of significance testing.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
A Scale Parameter Threshold
A Scale Parameter Threshold
The underlying idea of the scale threshold theorem is that there exists a threshold for \(\lambda \), above which the model satisfies the term distinctiveness constraint.
Let \(q=\{t_1,\ldots ,t_n\}\) be a query, d be a document with \(n(t_a,f_i,d)\) occurrences of query term \(t_a\) in field \(f_i\) and \(n(t_b,\overline{f},d)\) occurrences of query term \(t_b\) in an average field \(\overline{f}\). Let \(\overline{d}\) be an amended version of document d where the occurrences of \(t_b\) are replaced with occurrences of \(t_a\).
Theorem 1
Given terms \(t_a\) and \(t_b\), if \(\lambda \!>\! \lambda _{threshold }\), then \(RSV (d)\!>\! RSV (\overline{d})\).

Proof
Following Definition 8 for \(\lambda _{threshold }\), the inequality becomes:
Considering the numerator first:
Following Eq. (4) for the definition of probabilities and Eq. (26) we obtain,
Following Definition 3 we can re-write Eq. (27) to obtain,
Moving onto the denominator,
Inserting Eq. (5) to Eq. (29) and transforming the log expression we obtain,
Following Definition 3 we can re-write Eq. (30) to obtain
Inserting Eqs. (28) and (31) to Eq. (25) and solving for Z we obtain,
Expanding the denominator we obtain,

Following Eqs. (13), (14), (8) and (9) (33) is re-written:
Rearranging Eq. (34) we obtain,

Assuming the term frequencies from the theorem, the retrieval score difference is only dependent on the score contributions of term \(\text {t}_{\text {a}}\) in field \(f_i\) and term \(\text {t}_{\text {b}}\) in field \(\overline{f}\). For \(\overline{d}\) the same is true for the score contributions of term \(\text {t}_{\text {a}}\) in field \(f_i\) and term \(\text {t}_{\text {a}}\) in field \(\overline{f}\). Following Definition 5 we rewrite Eq. (35) and obtain the implicated inequality from the theorem.
\(\square \)
Rights and permissions
Copyright information
© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Ketola, T., Roelleke, T. (2023). Automatic and Analytical Field Weighting for Structured Document Retrieval. In: Kamps, J., et al. Advances in Information Retrieval. ECIR 2023. Lecture Notes in Computer Science, vol 13980. Springer, Cham. https://doi.org/10.1007/978-3-031-28244-7_31
Download citation
DOI: https://doi.org/10.1007/978-3-031-28244-7_31
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-28243-0
Online ISBN: 978-3-031-28244-7
eBook Packages: Computer ScienceComputer Science (R0)