Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
Skip to main content

Automatic and Analytical Field Weighting for Structured Document Retrieval

  • Conference paper
  • First Online:
Advances in Information Retrieval (ECIR 2023)

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 13980))

Included in the following conference series:

  • 1828 Accesses

Abstract

Probabilistic models such as BM25 and LM have established themselves as the standard in atomic retrieval. In structured document retrieval (SDR), BM25F could be considered the most established model. However, without optimization BM25F does not benefit from the document structure. The main contribution of this paper is a new field weighting method, denoted Information Content Field Weighting (ICFW). It applies weights over the structure without optimization and overcomes issues faced by some existing SDR models, most notably the issue of saturating term frequency across fields. ICFW is similar to BM25 and LM in its analytical grounding and transparency, making it a potential new candidate for a standard SDR model. For an optimised retrieval scenario ICFW does as well, or better than baselines. More interestingly, for a non-optimised retrieval scenario we observe a considerable increase in performance. Extensive analysis is performed to understand and explain the underlying reasons for this increase.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 89.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 119.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

Notes

  1. 1.

    https://github.com/TuomasKetola/icfw-for-SDR.

  2. 2.

    https://www.kaggle.com/c/home-depot-product-search-relevance.

  3. 3.

    https://trec.nist.gov/data/t8.web.html.

References

  1. Amati, G., Van Rijsbergen, C.J.: Probabilistic models of information retrieval based on measuring the divergence from randomness. ACM Trans. Inf. Syst. 20(4), 357–389 (2002)

    Article  Google Scholar 

  2. Amer-Yahia, S., Lalmas, M.: XML search: languages, INEX and scoring. ACM. SIGMOD Record 35(4), 16–23 (2006)

    Google Scholar 

  3. Balaneshinkordan, S., Kotov, A., Nikolaev, F.: Attentive neural architecture for ad-hoc structured document retrieval. CIKM 2018, ACM, Torino, Italy (2018)

    Google Scholar 

  4. Fang, H., Tao, T., Zhai, C.: A formal study of information retrieval heuristics. SIGIR 2004, ACM, New York, NY, USA (2004)

    Google Scholar 

  5. Fang, H., Tao, T., Zhai, C.: diagnostic evaluation of information retrieval models. ACM Trans. Inf. Syst. 29(2), 1–42 (2011)

    Google Scholar 

  6. Fang, H., Zhai, C.: An exploration of axiomatic approaches to information retrieval. SIGIR 2005 (2005)

    Google Scholar 

  7. Fang, H., Zhai, C.: Semantic term matching in axiomatic approaches to information retrieval. SIGIR 2006, ACM, New York, NY, USA (2006)

    Google Scholar 

  8. Fuhr, N.: Some common mistakes in IR evaluation, and how they can be avoided. ACM SIGIR Forum 51(3), 32–41 (2018)

    Article  Google Scholar 

  9. Hasibi, F., et al.: DBpedia-Entity v2: a test collection for entity search. SIGIR 2017 (2017)

    Google Scholar 

  10. Hintikka, J.: On semantic information. In: Yourgrau, W., Breck, A.D. (eds.) Physics, Logic, and History: Based on the First International Colloquium held at the University of Denver, pp. 147–172. Springer, Boston (1970). https://doi.org/10.1007/978-1-4684-1749-4_9

  11. Kamphuis, C., de Vries, A.P., Boytsov, L., Lin, J.: Which BM25 Do You Mean? A Large-Scale Reproducibility Study of Scoring Variants. In: Jose, J.M., et al. (eds.) ECIR 2020. LNCS, vol. 12036, pp. 28–34. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-45442-5_4

    Chapter  Google Scholar 

  12. Kamps, J., Koolen, M., Geva, S., Schenkel, R., SanJuan, E., Bogers, T.: From XML retrieval to semantic search and beyond. In: Information Retrieval Evaluation in a Changing World: Lessons Learned from 20 Years of CLEF, pp. 415–437 (2019)

    Google Scholar 

  13. Ketola, T., Roelleke, T.: BM25-FIC: information content-based field weighting for BM25F. In: BIRDS@SIGIR (2020)

    Google Scholar 

  14. Ketola, T., Roelleke, T.: Formal constraints for structured document retrieval. In: Proceedings of the 2022 ACM SIGIR International Conference on Theory of Information Retrieval. ICTIR 2022, ACM, New York, NY, USA (2022)

    Google Scholar 

  15. Kim, J., Xue, X., Croft, W.B.: A probabilistic retrieval model for semistructured data. In: Boughanem, M., Berrut, C., Mothe, J., Soule-Dupuy, C. (eds.) ECIR 2009. LNCS, vol. 5478, pp. 228–239. Springer, Heidelberg (2009). https://doi.org/10.1007/978-3-642-00958-7_22

    Chapter  Google Scholar 

  16. Malik, S., Lalmas, M., Fuhr, N.: Overview of INEX 2004. In: Fuhr, N., Lalmas, M., Malik, S., Szlávik, Z. (eds.) INEX 2004. LNCS, vol. 3493, pp. 1–15. Springer, Heidelberg (2005). https://doi.org/10.1007/11424550_1

    Chapter  Google Scholar 

  17. Metzler, D., Croft, W.: Linear feature-based models for information retrieval. Inf. Retr. 16, 1–23 (2007)

    Google Scholar 

  18. Ogilvie, P., Callan, J.: Combining document representations for known-item search. SIGIR 2003, ACM, New York, NY, USA (2003)

    Google Scholar 

  19. Robertson, S., Walker, S., Jones, S., Hancock-Beaulieu, M.M., Gatford, M.: Okapi at TREC-3, pp. 109–126 (1995)

    Google Scholar 

  20. Robertson, S., Zaragoza, H., Taylor, M.: Simple BM25 extension to multiple weighted fields, pp. 42–49. CIKM 2004, ACM, Washington, D.C., USA (2004)

    Google Scholar 

  21. Roelleke, T.: Information Retrieval Models: Foundations and Relationships. Morgan & Claypool Publishers (2013).https://doi.org/10.1007/978-3-031-02328-6

  22. Roelleke, T., Lalmas, M., Kazai, G., Ruthven, I., Quicker, S.: The accessibility dimension for structured document retrieval. In: Crestani, F., Girolami, M., van Rijsbergen, C.J. (eds.) ECIR 2002. LNCS, vol. 2291, pp. 284–302. Springer, Heidelberg (2002). https://doi.org/10.1007/3-540-45886-7_19

    Chapter  MATH  Google Scholar 

  23. Sakai, T.: On Fuhr’s guideline for IR evaluation. ACM SIGIR Forum 54(1), 1–8 (2021)

    Google Scholar 

  24. Schenkel, R., Theobald, M.: Structural feedback for keyword-based XML retrieval. In: Lalmas, M., MacFarlane, A., Rüger, S., Tombros, A., Tsikrika, T., Yavlinsky, A. (eds.) ECIR 2006. LNCS, vol. 3936, pp. 326–337. Springer, Heidelberg (2006). https://doi.org/10.1007/11735106_29

    Chapter  Google Scholar 

  25. Trotman, A., Puurula, A., Burgess, B.: Improvements to BM25 and language models examined. In: Proceedings of the 2014 Australasian Document Computing Symposium. ADCS 2014, ACM, New York, NY, USA (2014)

    Google Scholar 

  26. Wang, J., Roelleke, T.: Context-specific frequencies and discriminativeness for the retrieval of structured documents. In: Lalmas, M., MacFarlane, A., Rüger, S., Tombros, A., Tsikrika, T., Yavlinsky, A. (eds.) ECIR 2006. LNCS, vol. 3936, pp. 579–582. Springer, Heidelberg (2006). https://doi.org/10.1007/11735106_69

    Chapter  Google Scholar 

  27. Zamani, H., Mitra, B., Song, X., Craswell, N., Tiwary, S.: Neural ranking models with multiple document fields. WSDM 2018, ACM, New York, NY, USA (2018)

    Google Scholar 

  28. Zaragoza, H., Craswell, N., Taylor, M., Saria, S., Robertson, S.: microsoft Cambridge at TREC-13: web and hard tracks, p. 7 (2004)

    Google Scholar 

  29. Zhiltsov, N., Kotov, A., Nikolaev, F.: Fielded sequential dependence model for Ad-hoc entity retrieval in the web of data. SIGIR 2015 (2015)

    Google Scholar 

Download references

Acknowledgements

We would like to thank the reviewers for their comments, in particular regarding the presentation of the proposed models and candidate models, as well as the methodology of significance testing.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Tuomas Ketola .

Editor information

Editors and Affiliations

A Scale Parameter Threshold

A Scale Parameter Threshold

The underlying idea of the scale threshold theorem is that there exists a threshold for \(\lambda \), above which the model satisfies the term distinctiveness constraint.

Let \(q=\{t_1,\ldots ,t_n\}\) be a query, d be a document with \(n(t_a,f_i,d)\) occurrences of query term \(t_a\) in field \(f_i\) and \(n(t_b,\overline{f},d)\) occurrences of query term \(t_b\) in an average field \(\overline{f}\). Let \(\overline{d}\) be an amended version of document d where the occurrences of \(t_b\) are replaced with occurrences of \(t_a\).

Theorem 1

Given terms \(t_a\) and \(t_b\), if \(\lambda \!>\! \lambda _{threshold }\), then \(RSV (d)\!>\! RSV (\overline{d})\).

(24)

Proof

Following Definition 8 for \(\lambda _{threshold }\), the inequality becomes:

$$\begin{aligned} \lambda > \frac{\log \frac{{\text {df}}(\text {t}_{\text {b}}, \overline{F})|\overline{F}|^{Zx}}{{\text {df}}(\text {t}_{\text {a}}, \overline{F})^{Zx}|\overline{F}|} }{\log \frac{m^{Z+1}{\text {ff}}({\text {t}_{\text {a}}}, {\overline{d}})^{Z(x+1)}}{m^{Z(x+1)}{\text {ff}}({\text {t}_{\text {a}}}, {d})^{Z+1}} } \end{aligned}$$
(25)

Considering the numerator first:

$$\begin{aligned} \log \frac{{\text {df}}(\text {t}_{\text {b}}, \overline{F})|\overline{F}|^{Zx}}{{\text {df}}(\text {t}_{\text {a}}, \overline{F})^{Zx}|\overline{F}|} = \log \frac{\frac{{\text {df}}(\text {t}_{\text {b}}, \overline{F})}{|\overline{F}|}}{\frac{{\text {df}}(\text {t}_{\text {a}}, \overline{F})^{Zx}}{|\overline{F}|^{Zx}}} \end{aligned}$$
(26)

Following Eq. (4) for the definition of probabilities and Eq. (26) we obtain,

$$\begin{aligned} \log \frac{P(\text {t}_{\text {b}},\overline{f}|\overline{F})}{P(\text {t}_{\text {a}},\overline{f}|\overline{F})^{Zx}} = \log P(\text {t}_{\text {b}},\overline{f}|\overline{F}) - Zx\log P(\text {t}_{\text {a}},\overline{f}|\overline{F}) \end{aligned}$$
(27)

Following Definition 3 we can re-write Eq. (27) to obtain,

$$\begin{aligned}&\log P(\text {t}_{\text {b}},\overline{f}|\overline{F}) - Zx\log P(\text {t}_{\text {a}},\overline{f}|\overline{F}) = Zx{\text {ICF}}(\overline{f},\overline{d}) - {\text {ICF}}(\overline{f},d) \end{aligned}$$
(28)

Moving onto the denominator,

$$\begin{aligned} \log \frac{m^{Z+1}{\text {ff}}({\text {t}_{\text {a}}}, {\overline{d}})^{Z(x+1)}}{m^{Z(x+1)}{\text {ff}}({\text {t}_{\text {a}}}, {d})^{Z+1}} = \log \frac{[\frac{{\text {ff}}({\text {t}_{\text {a}}}, {\overline{d}})}{m}]^{Z(x+1)}}{[\frac{{\text {ff}}({\text {t}_{\text {a}}}, {d})}{m}]^{Z+1}} \end{aligned}$$
(29)

Inserting Eq. (5) to Eq. (29) and transforming the log expression we obtain,

$$\begin{aligned} \log \frac{P(f_i|\overline{d})^{Z(x+1)}}{P(f_i|d)^{Z+1}} =Z(x+1)\log P(\text {t}_{\text {a}},f_i|\overline{d}) - Z\log P(\text {t}_{\text {a}},f_i|d) - \log P(\text {t}_{\text {a}},\overline{f}|d) \end{aligned}$$
(30)

Following Definition 3 we can re-write Eq. (30) to obtain

$$\begin{aligned}&\log \frac{P(f_i|\overline{d})^{Z(x+1)}}{P(f_i|d)^{Z+1}} = - Z(x+1){\text {ICD}}(f_i,\overline{d}) + Z{\text {ICD}}(f_i,d) + {\text {ICD}}(\overline{f},d) \end{aligned}$$
(31)

Inserting Eqs. (28) and (31) to Eq. (25) and solving for Z we obtain,

$$\begin{aligned}&Z < \frac{{\text {ICF}}(\overline{f},d) + \lambda {\text {ICD}}(\overline{f},d)}{\lambda (x+1){\text {ICD}}(f_i,\overline{d}) - \lambda {\text {ICD}}(f_i,d) + x{\text {ICF}}(\overline{f},\overline{d})} \end{aligned}$$
(32)

Expanding the denominator we obtain,

(33)

Following Eqs. (13), (14), (8) and (9) (33) is re-written:

$$\begin{aligned}&\frac{{\text {S}}_{{\text {contr}}}(\text {t}_{\text {a}},f_i,d)}{{\text {S}}_{{\text {contr}}}(\text {t}_{\text {b}},\overline{f},d)} < \frac{w_{{\text {icfw}}}({\overline{f}}, {d})}{w_{{\text {icfw}}}({f_i}, {\overline{d}}) + xw_{{\text {icfw}}}({\overline{f}}, {\overline{d}}) - w_{{\text {icfw}}}({f_i}, {d})} \end{aligned}$$
(34)

Rearranging Eq. (34) we obtain,

(35)

Assuming the term frequencies from the theorem, the retrieval score difference is only dependent on the score contributions of term \(\text {t}_{\text {a}}\) in field \(f_i\) and term \(\text {t}_{\text {b}}\) in field \(\overline{f}\). For \(\overline{d}\) the same is true for the score contributions of term \(\text {t}_{\text {a}}\) in field \(f_i\) and term \(\text {t}_{\text {a}}\) in field \(\overline{f}\). Following Definition 5 we rewrite Eq. (35) and obtain the implicated inequality from the theorem.

   \(\square \)

Rights and permissions

Reprints and permissions

Copyright information

© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Ketola, T., Roelleke, T. (2023). Automatic and Analytical Field Weighting for Structured Document Retrieval. In: Kamps, J., et al. Advances in Information Retrieval. ECIR 2023. Lecture Notes in Computer Science, vol 13980. Springer, Cham. https://doi.org/10.1007/978-3-031-28244-7_31

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-28244-7_31

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-28243-0

  • Online ISBN: 978-3-031-28244-7

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics