Article

Simple BM25 extension to multiple weighted fields

Authors:

Stephen Robertson,

Michael TaylorAuthors Info & Claims

CIKM '04: Proceedings of the thirteenth ACM international conference on Information and knowledge management

Pages 42 - 49

https://doi.org/10.1145/1031171.1031181

Published: 13 November 2004 Publication History

Abstract

This paper describes a simple way of adapting the BM25 ranking formula to deal with structured documents. In the past it has been common to compute scores for the individual fields (e.g. title and body) independently and then combine these scores (typically linearly) to arrive at a final score for the document. We highlight how this approach can lead to poor performance by breaking the carefully constructed non-linear saturation of term frequency in the BM25 function. We propose a much more intuitive alternative which weights term frequencies <i>before</i> the non-linear term frequency saturation function is applied. In this scheme, a structured document with a title weight of two is mapped to an unstructured document with the title content repeated twice. This more verbose unstructured document is then ranked in the usual way. We demonstrate the advantages of this method with experiments on Reuters Vol1 and the TREC dotGov collection.

References

[1]

David Carmel, Yoelle S. Maarek, Matan Mandelbrod, Yosi Mass, and Aya Soffer. Searching xml documents via xml fragments. In Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval, pages 151--158. ACM Press, 2003.

Digital Library

[2]

Nick Craswell and David Hawking. Overview of the TREC-2002 Web track. In TREC 2002, 2003.

[3]

INEX. Initiative for the evaluation of XML retrieval (INEX), http://inex.is.informatik.uni-duisburg.de:2003.

[4]

Evangelos Kotsakis. Structured information retrieval in XML documents. In SAC 2002, volume 1-58113-445-2/02/03, Madrid, Spain, 2002. ACM.

Digital Library

[5]

Mounia Lalmas. Uniform representation of content and structure for structured document retrieval. Technical report, Queen Mary and Westfield College, University of London, 2000.

[6]

S H Myaeng, D-H Jang, M-S Kim, and Z-C Zhoo. A flexible model for retrieval of SGML documents. In W B Croft, A Moffat, C J van Rijsbergen, R Wilkinson, and J Zobel, editors, SIGIR'98: Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 138--145. ACM Press, 1998.

Digital Library

[7]

N.Craswell, D.Hawking, A.McLean, T.Upstill, R.Wilkinson and M.Wu. TREC 12 web track at CSIRO. In TREC 2003, 2004.

[8]

Hongbo Xu, Zhifeng Yang, Bin Wang, Bin Liu, Jun Cheng, Yue Liu, Zhe Yang, Xueqi Cheng and Shuo Bai TREC-11 experiments at CAS-ICT: Filtering and Web. In TREC 2002, 2003.

[9]

Lide Wu, Xuanjing Huang, Junyu Niu, Yingju Xia, Zhe Feng and Yaqian Zhou. FDU at TREC2002: Filtering, Q&A, Web and Video tasks. In TREC 2002, 2003.

[10]

Einat Amitay, David Carmel, Adam Darlow, Ronny Lempel and Aya Soffer. Topic distillation with knowledge agents. In TREC 2002, 2003.

[11]

Abdur Chowdhury, Mohammed Aljlayl, Eric Jensen, Steve Beitzel, David Grossman and Ophir Frieder. Linear combinations based on document structure and varied stemming for Arabic retrieval. In TREC 2002, 2003.

[12]

Nie Yu, Ji Donghong and Yang Lingpeng. LIT at TREC-2002: Web track. In TREC 2002, 2003.

[13]

Shuang Liu, Clement Yu and Wensheng Wu. UIC at TREC-2002: Web track. In TREC 2002, 2003.

[14]

Jacques Savoy and Yves Rasolofo. Report on TREC-11 experiment: Arabic, Named Page and Topic Distillation searches. In TREC 2002, 2003.

[15]

Paul Ogilvie and Jamie Callan. Combining document representations for known item search. In Proceedings of the 26th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2003), 2003.

Digital Library

[16]

Benjamin Piwowarski and Patrick Gallinari. A machine learning model for information retrieval with structured documents. In Petra Perner, editor, Machine Learning and Data Mining in Pattern Recognition (MLDM'03), pages 425--438, Leipzig, Germany, July 2003. Springer Verlag.

Digital Library

[17]

ReutersI. Reuters corpus volume 1, http://about.reuters.com/researchandstandards/corpus/index.asp.

[18]

S E Robertson and S Walker. Some simple effective approximations to the 2-Poisson model for probabilistic weighted retrieval. In W B Croft and C J van Rijsbergen, editors, SIGIR '94: Proceedings of the 17th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 345--354. Springer-Verlag, 1994.

Digital Library

[19]

Ross Wilkinson. Effective retrieval of structured documents. In Research and Development in Information Retrieval, pages 311--317, 1994.

Digital Library

Cited By

Sun SWang YCheng JXiao ZZheng DHao X(2025)Improving the Performance of Answers Ranking in Q&A Communities: A Long-Text Matching Technique and Pre-Trained ModelIEEE Access10.1109/ACCESS.2024.352199913(4188-4200)Online publication date: 2025
https://doi.org/10.1109/ACCESS.2024.3521999
Kalogeropoulos NIoannou DStathopoulos DMakris C(2024)On Embedding Implementations in Text Ranking and Classification Employing GraphsElectronics10.3390/electronics1310189713:10(1897)Online publication date: 12-May-2024
https://doi.org/10.3390/electronics13101897
Aftab WApostolou ZBouazoune KStraub T(2024)Optimizing biomedical information retrieval with a keyword frequency-driven prompt enhancement strategyBMC Bioinformatics10.1186/s12859-024-05902-725:1Online publication date: 27-Aug-2024
https://doi.org/10.1186/s12859-024-05902-7
Show More Cited By

Index Terms

Simple BM25 extension to multiple weighted fields
1. Information systems
  1. Information retrieval
    1. Retrieval models and ranking

Recommendations

BERT-based Dense Retrievers Require Interpolation with BM25 for Effective Passage Retrieval
ICTIR '21: Proceedings of the 2021 ACM SIGIR International Conference on Theory of Information Retrieval

The integration of pre-trained deep language models, such as BERT, into retrieval and ranking pipelines has shown to provide large effectiveness gains over traditional bag-of-words models in the passage retrieval task. However, the best setup for ...
When documents are very long, BM25 fails!
SIGIR '11: Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval

We reveal that the Okapi BM25 retrieval function tends to overly penalize very long documents. To address this problem, we present a simple yet effective extension of BM25, namely BM25L, which "shifts" the term frequency normalization formula to boost ...
Weighted shingling: an adaptation of shingling for weighted shingles
IIT'09: Proceedings of the 6th international conference on Innovations in information technology

Broder's shingling is one of the state-of-the-art approaches in detecting near-duplicate documents. Prior evaluations of this method have shown that document-pairs which have different main content but have a large amount ofsimilar unimportant details ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

CIKM '04: Proceedings of the thirteenth ACM international conference on Information and knowledge management

November 2004

678 pages

ISBN:1581138741

DOI:10.1145/1031171

General Chair:
David Grossman
Illinois Institute of Technology
,
Program Chairs:
Luis Gravano
Columbia University
,
ChengXiang Zhai
University of Illinois at Urbana-Champaign
,
Otthein Herzog
University of Bremen, Germany
,
David A. Evans
Clairvoyance Corporation

Copyright © 2004 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 13 November 2004

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Qualifiers

Article

Conference

CIKM04

Sponsor:

CIKM04: Conference on Information and Knowledge Management

November 8 - 13, 2004

D.C., Washington, USA

Acceptance Rates

Overall Acceptance Rate 1,861 of 8,427 submissions, 22%

Upcoming Conference

CIKM '25

Sponsor:
sigir
sigir

The 34th ACM International Conference on Information and Knowledge Management

November 10 - 14, 2025

Seoul , Republic of Korea

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

457
Total Citations
View Citations
3,784
Total Downloads

Downloads (Last 12 months)296
Downloads (Last 6 weeks)29

Reflects downloads up to 11 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Sun SWang YCheng JXiao ZZheng DHao X(2025)Improving the Performance of Answers Ranking in Q&A Communities: A Long-Text Matching Technique and Pre-Trained ModelIEEE Access10.1109/ACCESS.2024.352199913(4188-4200)Online publication date: 2025
https://doi.org/10.1109/ACCESS.2024.3521999
Kalogeropoulos NIoannou DStathopoulos DMakris C(2024)On Embedding Implementations in Text Ranking and Classification Employing GraphsElectronics10.3390/electronics1310189713:10(1897)Online publication date: 12-May-2024
https://doi.org/10.3390/electronics13101897
Aftab WApostolou ZBouazoune KStraub T(2024)Optimizing biomedical information retrieval with a keyword frequency-driven prompt enhancement strategyBMC Bioinformatics10.1186/s12859-024-05902-725:1Online publication date: 27-Aug-2024
https://doi.org/10.1186/s12859-024-05902-7
Wu XLi HYoshioka NWashizaki HKhomh F(2024)Refining GPT-3 Embeddings with a Siamese Structure for Technical Post Duplicate Detection2024 IEEE International Conference on Software Analysis, Evolution and Reengineering (SANER)10.1109/SANER60148.2024.00019(114-125)Online publication date: 12-Mar-2024
https://doi.org/10.1109/SANER60148.2024.00019
Mao KZhao Q(2024)PIM-ST: a New Paraphrase Identification Model Incorporating Sequence and Topic Information2024 4th International Symposium on Computer Technology and Information Science (ISCTIS)10.1109/ISCTIS63324.2024.10699008(894-898)Online publication date: 12-Jul-2024
https://doi.org/10.1109/ISCTIS63324.2024.10699008
Licato JFields LSteinle SHollis B(2024)Shot Selection to Determine Legality of Actions in a Rule-Heavy Environment2024 IEEE Conference on Games (CoG)10.1109/CoG60054.2024.10645668(1-8)Online publication date: 5-Aug-2024
https://doi.org/10.1109/CoG60054.2024.10645668
Zeng JZheng RWang CXue WYu XZhang T(2024)Enhancing Large Model Document Question Answering Through Retrieval Augmentation2024 International Conference on Artificial Intelligence and Power Systems (AIPS)10.1109/AIPS64124.2024.00057(249-253)Online publication date: 19-Apr-2024
https://doi.org/10.1109/AIPS64124.2024.00057
Zhao XZhu BLiu X(2024)Research on Long Text Similarity Calculation Method Based on TextRank and BERT2024 4th Asia Conference on Information Engineering (ACIE)10.1109/ACIE61839.2024.00028(128-132)Online publication date: 26-Jan-2024
https://doi.org/10.1109/ACIE61839.2024.00028
Yang H(2024)Optimized English Translation System Using Multi-Level Semantic Extraction and Text MatchingIEEE Access10.1109/ACCESS.2024.342665212(96527-96536)Online publication date: 2024
https://doi.org/10.1109/ACCESS.2024.3426652
Farshidi SRezaee KMazaheri SRahimi ADadashzadeh AZiabakhsh MEskandari SJansen S(2024)Understanding user intent modeling for conversational recommender systems: a systematic literature reviewUser Modeling and User-Adapted Interaction10.1007/s11257-024-09398-xOnline publication date: 6-Jun-2024
https://doi.org/10.1007/s11257-024-09398-x
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten