Automatic and Analytical Field Weighting for Structured Document Retrieval

Ketola, Tuomas; Roelleke, Thomas

doi:10.1007/978-3-031-28244-7_31

Tuomas Ketola¹⁶ &
Thomas Roelleke¹⁶

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 13980))

Included in the following conference series:

European Conference on Information Retrieval

1828 Accesses

Abstract

Probabilistic models such as BM25 and LM have established themselves as the standard in atomic retrieval. In structured document retrieval (SDR), BM25F could be considered the most established model. However, without optimization BM25F does not benefit from the document structure. The main contribution of this paper is a new field weighting method, denoted Information Content Field Weighting (ICFW). It applies weights over the structure without optimization and overcomes issues faced by some existing SDR models, most notably the issue of saturating term frequency across fields. ICFW is similar to BM25 and LM in its analytical grounding and transparency, making it a potential new candidate for a standard SDR model. For an optimised retrieval scenario ICFW does as well, or better than baselines. More interestingly, for a non-optimised retrieval scenario we observe a considerable increase in performance. Extensive analysis is performed to understand and explain the underlying reasons for this increase.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 89.00; Price excludes VAT (USA)

Softcover Book: USD 119.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

A systematic approach to normalization in probabilistic models

Article Open access 30 June 2018

Term frequency with average term occurrences for textual information retrieval

Article 28 November 2015

Field Based Weighting Information Retrieval on Document Field of Ad Hoc Dataset

Notes

References

Amati, G., Van Rijsbergen, C.J.: Probabilistic models of information retrieval based on measuring the divergence from randomness. ACM Trans. Inf. Syst. 20(4), 357–389 (2002)
Article Google Scholar
Amer-Yahia, S., Lalmas, M.: XML search: languages, INEX and scoring. ACM. SIGMOD Record 35(4), 16–23 (2006)
Google Scholar
Balaneshinkordan, S., Kotov, A., Nikolaev, F.: Attentive neural architecture for ad-hoc structured document retrieval. CIKM 2018, ACM, Torino, Italy (2018)
Google Scholar
Fang, H., Tao, T., Zhai, C.: A formal study of information retrieval heuristics. SIGIR 2004, ACM, New York, NY, USA (2004)
Google Scholar
Fang, H., Tao, T., Zhai, C.: diagnostic evaluation of information retrieval models. ACM Trans. Inf. Syst. 29(2), 1–42 (2011)
Google Scholar
Fang, H., Zhai, C.: An exploration of axiomatic approaches to information retrieval. SIGIR 2005 (2005)
Google Scholar
Fang, H., Zhai, C.: Semantic term matching in axiomatic approaches to information retrieval. SIGIR 2006, ACM, New York, NY, USA (2006)
Google Scholar
Fuhr, N.: Some common mistakes in IR evaluation, and how they can be avoided. ACM SIGIR Forum 51(3), 32–41 (2018)
Article Google Scholar
Hasibi, F., et al.: DBpedia-Entity v2: a test collection for entity search. SIGIR 2017 (2017)
Google Scholar
Hintikka, J.: On semantic information. In: Yourgrau, W., Breck, A.D. (eds.) Physics, Logic, and History: Based on the First International Colloquium held at the University of Denver, pp. 147–172. Springer, Boston (1970). https://doi.org/10.1007/978-1-4684-1749-4_9
Kamphuis, C., de Vries, A.P., Boytsov, L., Lin, J.: Which BM25 Do You Mean? A Large-Scale Reproducibility Study of Scoring Variants. In: Jose, J.M., et al. (eds.) ECIR 2020. LNCS, vol. 12036, pp. 28–34. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-45442-5_4
Chapter Google Scholar
Kamps, J., Koolen, M., Geva, S., Schenkel, R., SanJuan, E., Bogers, T.: From XML retrieval to semantic search and beyond. In: Information Retrieval Evaluation in a Changing World: Lessons Learned from 20 Years of CLEF, pp. 415–437 (2019)
Google Scholar
Ketola, T., Roelleke, T.: BM25-FIC: information content-based field weighting for BM25F. In: BIRDS@SIGIR (2020)
Google Scholar
Ketola, T., Roelleke, T.: Formal constraints for structured document retrieval. In: Proceedings of the 2022 ACM SIGIR International Conference on Theory of Information Retrieval. ICTIR 2022, ACM, New York, NY, USA (2022)
Google Scholar
Kim, J., Xue, X., Croft, W.B.: A probabilistic retrieval model for semistructured data. In: Boughanem, M., Berrut, C., Mothe, J., Soule-Dupuy, C. (eds.) ECIR 2009. LNCS, vol. 5478, pp. 228–239. Springer, Heidelberg (2009). https://doi.org/10.1007/978-3-642-00958-7_22
Chapter Google Scholar
Malik, S., Lalmas, M., Fuhr, N.: Overview of INEX 2004. In: Fuhr, N., Lalmas, M., Malik, S., Szlávik, Z. (eds.) INEX 2004. LNCS, vol. 3493, pp. 1–15. Springer, Heidelberg (2005). https://doi.org/10.1007/11424550_1
Chapter Google Scholar
Metzler, D., Croft, W.: Linear feature-based models for information retrieval. Inf. Retr. 16, 1–23 (2007)
Google Scholar
Ogilvie, P., Callan, J.: Combining document representations for known-item search. SIGIR 2003, ACM, New York, NY, USA (2003)
Google Scholar
Robertson, S., Walker, S., Jones, S., Hancock-Beaulieu, M.M., Gatford, M.: Okapi at TREC-3, pp. 109–126 (1995)
Google Scholar
Robertson, S., Zaragoza, H., Taylor, M.: Simple BM25 extension to multiple weighted fields, pp. 42–49. CIKM 2004, ACM, Washington, D.C., USA (2004)
Google Scholar
Roelleke, T.: Information Retrieval Models: Foundations and Relationships. Morgan & Claypool Publishers (2013).https://doi.org/10.1007/978-3-031-02328-6
Roelleke, T., Lalmas, M., Kazai, G., Ruthven, I., Quicker, S.: The accessibility dimension for structured document retrieval. In: Crestani, F., Girolami, M., van Rijsbergen, C.J. (eds.) ECIR 2002. LNCS, vol. 2291, pp. 284–302. Springer, Heidelberg (2002). https://doi.org/10.1007/3-540-45886-7_19
Chapter MATH Google Scholar
Sakai, T.: On Fuhr’s guideline for IR evaluation. ACM SIGIR Forum 54(1), 1–8 (2021)
Google Scholar
Schenkel, R., Theobald, M.: Structural feedback for keyword-based XML retrieval. In: Lalmas, M., MacFarlane, A., Rüger, S., Tombros, A., Tsikrika, T., Yavlinsky, A. (eds.) ECIR 2006. LNCS, vol. 3936, pp. 326–337. Springer, Heidelberg (2006). https://doi.org/10.1007/11735106_29
Chapter Google Scholar
Trotman, A., Puurula, A., Burgess, B.: Improvements to BM25 and language models examined. In: Proceedings of the 2014 Australasian Document Computing Symposium. ADCS 2014, ACM, New York, NY, USA (2014)
Google Scholar
Wang, J., Roelleke, T.: Context-specific frequencies and discriminativeness for the retrieval of structured documents. In: Lalmas, M., MacFarlane, A., Rüger, S., Tombros, A., Tsikrika, T., Yavlinsky, A. (eds.) ECIR 2006. LNCS, vol. 3936, pp. 579–582. Springer, Heidelberg (2006). https://doi.org/10.1007/11735106_69
Chapter Google Scholar
Zamani, H., Mitra, B., Song, X., Craswell, N., Tiwary, S.: Neural ranking models with multiple document fields. WSDM 2018, ACM, New York, NY, USA (2018)
Google Scholar
Zaragoza, H., Craswell, N., Taylor, M., Saria, S., Robertson, S.: microsoft Cambridge at TREC-13: web and hard tracks, p. 7 (2004)
Google Scholar
Zhiltsov, N., Kotov, A., Nikolaev, F.: Fielded sequential dependence model for Ad-hoc entity retrieval in the web of data. SIGIR 2015 (2015)
Google Scholar

Download references

Acknowledgements

We would like to thank the reviewers for their comments, in particular regarding the presentation of the proposed models and candidate models, as well as the methodology of significance testing.

Author information

Authors and Affiliations

Queen Mary University of London, London, UK
Tuomas Ketola & Thomas Roelleke

Authors

Tuomas Ketola
View author publications
You can also search for this author in PubMed Google Scholar
Thomas Roelleke
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Tuomas Ketola .

Editor information

Editors and Affiliations

University of Amsterdam, Amsterdam, The Netherlands
Jaap Kamps
Université Grenoble-Alpes, Saint-Martin-d’Hères, France
Lorraine Goeuriot
Università della Svizzera Italiana, Lugano, Switzerland
Fabio Crestani
University of Copenhagen, Copenhagen, Denmark
Maria Maistro
University of Tsukuba, Ibaraki, Japan
Hideo Joho
Dublin City University, Dublin, Ireland
Brian Davis
Dublin City University, Dublin, Ireland
Cathal Gurrin
Universität Regensburg, Regensburg, Germany
Udo Kruschwitz
Dublin City University, Dublin, Ireland
Annalina Caputo

A Scale Parameter Threshold

The underlying idea of the scale threshold theorem is that there exists a threshold for $\lambda $, above which the model satisfies the term distinctiveness constraint.

Let $q=\{t_1,\ldots ,t_n\}$ be a query, d be a document with $n(t_a,f_i,d)$ occurrences of query term $t_a$ in field $f_i$ and $n(t_b,\overline{f},d)$ occurrences of query term $t_b$ in an average field $\overline{f}$. Let $\overline{d}$ be an amended version of document d where the occurrences of $t_b$ are replaced with occurrences of $t_a$.

Theorem 1

Given terms $t_a$ and $t_b$, if $\lambda \!>\! \lambda _{threshold }$, then $RSV (d)\!>\! RSV (\overline{d})$.

(24)

Proof

Following Definition 8 for $\lambda _{threshold }$, the inequality becomes:

$$\begin{aligned} \lambda > \frac{\log \frac{{\text {df}}(\text {t}_{\text {b}}, \overline{F})|\overline{F}|^{Zx}}{{\text {df}}(\text {t}_{\text {a}}, \overline{F})^{Zx}|\overline{F}|} }{\log \frac{m^{Z+1}{\text {ff}}({\text {t}_{\text {a}}}, {\overline{d}})^{Z(x+1)}}{m^{Z(x+1)}{\text {ff}}({\text {t}_{\text {a}}}, {d})^{Z+1}} } \end{aligned}$$

(25)

Considering the numerator first:

$$\begin{aligned} \log \frac{{\text {df}}(\text {t}_{\text {b}}, \overline{F})|\overline{F}|^{Zx}}{{\text {df}}(\text {t}_{\text {a}}, \overline{F})^{Zx}|\overline{F}|} = \log \frac{\frac{{\text {df}}(\text {t}_{\text {b}}, \overline{F})}{|\overline{F}|}}{\frac{{\text {df}}(\text {t}_{\text {a}}, \overline{F})^{Zx}}{|\overline{F}|^{Zx}}} \end{aligned}$$

(26)

Following Eq. (4) for the definition of probabilities and Eq. (26) we obtain,

$$\begin{aligned} \log \frac{P(\text {t}_{\text {b}},\overline{f}|\overline{F})}{P(\text {t}_{\text {a}},\overline{f}|\overline{F})^{Zx}} = \log P(\text {t}_{\text {b}},\overline{f}|\overline{F}) - Zx\log P(\text {t}_{\text {a}},\overline{f}|\overline{F}) \end{aligned}$$

(27)

Following Definition 3 we can re-write Eq. (27) to obtain,

$$\begin{aligned}&\log P(\text {t}_{\text {b}},\overline{f}|\overline{F}) - Zx\log P(\text {t}_{\text {a}},\overline{f}|\overline{F}) = Zx{\text {ICF}}(\overline{f},\overline{d}) - {\text {ICF}}(\overline{f},d) \end{aligned}$$

(28)

Moving onto the denominator,

$$\begin{aligned} \log \frac{m^{Z+1}{\text {ff}}({\text {t}_{\text {a}}}, {\overline{d}})^{Z(x+1)}}{m^{Z(x+1)}{\text {ff}}({\text {t}_{\text {a}}}, {d})^{Z+1}} = \log \frac{[\frac{{\text {ff}}({\text {t}_{\text {a}}}, {\overline{d}})}{m}]^{Z(x+1)}}{[\frac{{\text {ff}}({\text {t}_{\text {a}}}, {d})}{m}]^{Z+1}} \end{aligned}$$

(29)

Inserting Eq. (5) to Eq. (29) and transforming the log expression we obtain,

$$\begin{aligned} \log \frac{P(f_i|\overline{d})^{Z(x+1)}}{P(f_i|d)^{Z+1}} =Z(x+1)\log P(\text {t}_{\text {a}},f_i|\overline{d}) - Z\log P(\text {t}_{\text {a}},f_i|d) - \log P(\text {t}_{\text {a}},\overline{f}|d) \end{aligned}$$

(30)

Following Definition 3 we can re-write Eq. (30) to obtain

$$\begin{aligned}&\log \frac{P(f_i|\overline{d})^{Z(x+1)}}{P(f_i|d)^{Z+1}} = - Z(x+1){\text {ICD}}(f_i,\overline{d}) + Z{\text {ICD}}(f_i,d) + {\text {ICD}}(\overline{f},d) \end{aligned}$$

(31)

Inserting Eqs. (28) and (31) to Eq. (25) and solving for Z we obtain,

$$\begin{aligned}&Z < \frac{{\text {ICF}}(\overline{f},d) + \lambda {\text {ICD}}(\overline{f},d)}{\lambda (x+1){\text {ICD}}(f_i,\overline{d}) - \lambda {\text {ICD}}(f_i,d) + x{\text {ICF}}(\overline{f},\overline{d})} \end{aligned}$$

(32)

Expanding the denominator we obtain,

(33)

Following Eqs. (13), (14), (8) and (9) (33) is re-written:

$$\begin{aligned}&\frac{{\text {S}}_{{\text {contr}}}(\text {t}_{\text {a}},f_i,d)}{{\text {S}}_{{\text {contr}}}(\text {t}_{\text {b}},\overline{f},d)} < \frac{w_{{\text {icfw}}}({\overline{f}}, {d})}{w_{{\text {icfw}}}({f_i}, {\overline{d}}) + xw_{{\text {icfw}}}({\overline{f}}, {\overline{d}}) - w_{{\text {icfw}}}({f_i}, {d})} \end{aligned}$$

(34)

Rearranging Eq. (34) we obtain,

(35)

Assuming the term frequencies from the theorem, the retrieval score difference is only dependent on the score contributions of term $\text {t}_{\text {a}}$ in field $f_i$ and term $\text {t}_{\text {b}}$ in field $\overline{f}$. For $\overline{d}$ the same is true for the score contributions of term $\text {t}_{\text {a}}$ in field $f_i$ and term $\text {t}_{\text {a}}$ in field $\overline{f}$. Following Definition 5 we rewrite Eq. (35) and obtain the implicated inequality from the theorem.

$\square $

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Ketola, T., Roelleke, T. (2023). Automatic and Analytical Field Weighting for Structured Document Retrieval. In: Kamps, J., et al. Advances in Information Retrieval. ECIR 2023. Lecture Notes in Computer Science, vol 13980. Springer, Cham. https://doi.org/10.1007/978-3-031-28244-7_31

Download citation

DOI: https://doi.org/10.1007/978-3-031-28244-7_31
Published: 17 March 2023
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-28243-0
Online ISBN: 978-3-031-28244-7
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Automatic and Analytical Field Weighting for Structured Document Retrieval

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

A systematic approach to normalization in probabilistic models

Term frequency with average term occurrences for textual information retrieval

Field Based Weighting Information Retrieval on Document Field of Ad Hoc Dataset

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

A Scale Parameter Threshold

A Scale Parameter Threshold

Theorem 1

Proof

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us