short-paper

Open access

Over-penalization for Extra Information in Neural IR Models

Authors:

Makoto P. Kato,

Sumio FujitaAuthors Info & Claims

CIKM '24: Proceedings of the 33rd ACM International Conference on Information and Knowledge Management

Pages 4096 - 4100

https://doi.org/10.1145/3627673.3679975

Published: 21 October 2024 Publication History

Abstract

This paper presents our analysis of neural IR models, particularly focusing on over-penalization for extra information (OPEX) - a phenomenon where addition of a sentence to a document causes an unreasonable decline in the document rank. We found that neural IR models suffered from OPEX, especially when the added sentence is similar to the other sentences in the document. To mitigate OPEX, we propose to apply a window-based scoring approach that segments a document and aggregates scores of the segments to compute the overall document score. We theoretically proved that the window-based scoring approach fully suppressed OPEX in an extreme case where each segment contains only a single sentence, and empirically showed that this approach mitigated OPEX. The code is available at https://github.com/argonism/OPEX .

References

[1]

Zeynep Akkalyoncu Yilmaz, Wei Yang, Haotian Zhang, and Jimmy Lin. 2019. Cross-Domain Modeling of Sentence-Level Evidence for Document Retrieval. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). Association for Computational Linguistics, Hong Kong, China, 3490--3496. https://doi.org/10.18653/v1/D19--1352

[2]

Payal Bajaj, Daniel Campos, Nick Craswell, Li Deng, Jianfeng Gao, Xiaodong Liu, Rangan Majumder, Andrew McNamara, Bhaskar Mitra, Tri Nguyen, Mir Rosenberg, Xia Song, Alina Stoica, Saurabh Tiwary, and Tong Wang. 2016. MS MARCO: A Human Generated MAchine Reading COmprehension Dataset. arXiv preprint arXiv:1611.09268 (2016).

[3]

Arthur Câmara and Claudia Hauff. 2020. Diagnosing BERT with Retrieval Heuristics. In ECIR. 605--618.

[4]

Nick Craswell, Bhaskar Mitra, Emine Yilmaz, and Daniel Campos. 2021. Overview of the TREC 2020 deep learning track. arxiv: 2102.07662 [cs.IR]

[5]

Nick Craswell, Bhaskar Mitra, Emine Yilmaz, Daniel Campos, and Ellen M. Voorhees. 2020. Overview of the TREC 2019 deep learning track. arxiv: 2003.07820 [cs.IR]

[6]

Zhuyun Dai and Jamie Callan. 2019. Context-Aware Sentence/Passage Term Importance Estimation For First Stage Retrieval. arXiv preprint arXiv:1910.10687 (2019).

[7]

Zhuyun Dai and Jamie Callan. 2019. Deeper Text Understanding for IR with Contextual Neural Language Modeling. In SIGIR. 985--988.

[8]

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. arxiv: 1810.04805 [cs.CL]

[9]

Hui Fang, Tao Tao, and ChengXiang Zhai. 2004. A Formal Study of Information Retrieval Heuristics. In SIGIR. 49--56.

[10]

Thibault Formal, Carlos Lassance, Benjamin Piwowarski, and Stéphane Clinchant. 2022. From Distillation to Hard Negative Sampling: Making Sparse Neural IR Models More Effective. In SIGIR. 2353--2359.

[11]

Ehsan Kamalloo, Nandan Thakur, Carlos Lassance, Xueguang Ma, Jheng-Hong Yang, and Jimmy Lin. 2023. Resources for Brewing BEIR: Reproducible Reference Models and an Official Leaderboard. arxiv: 2306.07471 [cs.IR]

[12]

Vladimir Karpukhin, Barlas Oguz, Sewon Min, Patrick Lewis, Ledell Wu, Sergey Edunov, Danqi Chen, and Wen-tau Yih. 2020. Dense Passage Retrieval for Open-Domain Question Answering. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). Association for Computational Linguistics, Online, 6769--6781. https://doi.org/10.18653/v1/2020.emnlp-main.550

[13]

Omar Khattab and Matei Zaharia. 2020. ColBERT: Efficient and Effective Passage Search via Contextualized Late Interaction over BERT. In SIGIR. 39--48.

[14]

Jiawei Liu, Yangyang Kang, Di Tang, Kaisong Song, Changlong Sun, Xiaofeng Wang, Wei Lu, and Xiaozhong Liu. 2022. Order-Disorder: Imitation Adversarial Attacks for Black-Box Neural Ranking Models. In CCS. 2025--2039.

[15]

Sean MacAvaney, Sergey Feldman, Nazli Goharian, Doug Downey, and Arman Cohan. 2022. ABNIRML: Analyzing the Behavior of Neural IR Models. Transactions of the Association for Computational Linguistics, Vol. 10 (2022), 224--239. //direct.mit.edu/tacl/article-pdf/doi/10.1162/tacl_a_00457/2002698/tacl_a_00457.pdf

[16]

Gustavo Penha, Arthur Câmara, and Claudia Hauff. 2022. Evaluating the Robustness of Retrieval Pipelines with Query Variation Generators. In ECIR. 397--412.

[17]

Nisarg Raval and Manisha Verma. 2020. One word at a time: adversarial attacks on retrieval models. arXiv preprint arXiv:2008.02197 (2020).

[18]

Nils Reimers and Iryna Gurevych. 2019. Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). Association for Computational Linguistics, Hong Kong, China, 3982--3992. https://doi.org/10.18653/v1/D19--1410

[19]

Daniël Rennings, Felipe Moraes, and Claudia Hauff. 2019. An Axiomatic Approach to Diagnosing Neural IR Models. In ECIR. 489--503.

[20]

Stephen Robertson, S. Walker, S. Jones, M. M. Hancock-Beaulieu, and M. Gatford. 1995. Okapi at TREC-3. In Overview of the Third Text REtrieval Conference (TREC-3). 109--126.

[21]

Michael Völske, Alexander Bondarenko, Maik Fröbe, Benno Stein, Jaspreet Singh, Matthias Hagen, and Avishek Anand. 2021. Towards axiomatic explanations for neural ranking models. In Proceedings of the 2021 ACM SIGIR International Conference on Theory of Information Retrieval. 13--22.

Digital Library

[22]

Ellen Voorhees. 2005. Overview of the TREC 2004 Robust Retrieval Track. In TREC-7.

[23]

Yumeng Wang, Lijun Lyu, and Avishek Anand. 2022. BERT Rankers are Brittle: a Study using Adversarial Document Perturbations. In ICTIR. 115--120.

[24]

Chen Wu, Ruqing Zhang, Jiafeng Guo, Maarten De Rijke, Yixing Fan, and Xueqi Cheng. 2023. PRADA: Practical Black-Box Adversarial Attacks against Neural Ranking Models. ACM Transactions on Information Systems, Vol. 41, 4 (2023), 1--27.

Digital Library

[25]

Lee Xiong, Chenyan Xiong, Ye Li, Kwok-Fung Tang, Jialin Liu, Paul Bennett, Junaid Ahmed, and Arnold Overwijk. 2021. Approximate Nearest Neighbor Negative Contrastive Learning for Dense Text Retrieval. In ICLR.

[26]

Xinyu Zhang, Andrew Yates, and Jimmy Lin. 2021. Comparing Score Aggregation Approaches for Document Retrieval with Pretrained Transformers. In ECIR. 150--163.

[27]

Shengyao Zhuang and Guido Zuccon. 2022. CharacterBERT and Self-Teaching for Improving the Robustness of Dense Retrievers on Queries with Typos. In SIGIR. 1444--1454.

Index Terms

Over-penalization for Extra Information in Neural IR Models
1. Information systems
  1. Information retrieval

Recommendations

Neural Vector Spaces for Unsupervised Information Retrieval

We propose the Neural Vector Space Model (NVSM), a method that learns representations of documents in an unsupervised manner for news article retrieval. In the NVSM paradigm, we learn low-dimensional representations of words and documents from scratch ...
From Neural Re-Ranking to Neural Ranking: Learning a Sparse Representation for Inverted Indexing
CIKM '18: Proceedings of the 27th ACM International Conference on Information and Knowledge Management

The availability of massive data and computing power allowing for effective data driven neural approaches is having a major impact on machine learning and information retrieval research, but these models have a basic problem with efficiency. Current ...
On using inter-document relations in microblog retrieval
WWW '13 Companion: Proceedings of the 22nd International Conference on World Wide Web

Microblog Ad-hoc retrieval has received much attention in recent years. As a result of the high vocabulary diversity of the publishing users, a mismatch is formed between the queries being formulated and the tweets representing the actual topics. In ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

CIKM '24: Proceedings of the 33rd ACM International Conference on Information and Knowledge Management

October 2024

5705 pages

ISBN:9798400704369

DOI:10.1145/3627673

General Chairs:
Edoardo Serra
Boise State University, USA
,
Francesca Spezzano
Boise State University, USA

Copyright © 2024 Owner/Author.

This work is licensed under a Creative Commons Attribution International 4.0 License.

Sponsors

SIGIR: ACM Special Interest Group on Information Retrieval

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 21 October 2024

Check for updates

Author Tags

Qualifiers

Short-paper

Funding Sources

Japan Society for the Promotion of Science
Japan Society for the Promotion of Science

Conference

CIKM '24

Sponsor:

SIGIR

CIKM '24: The 33rd ACM International Conference on Information and Knowledge Management

October 21 - 25, 2024

ID, Boise, USA

Acceptance Rates

Overall Acceptance Rate 1,861 of 8,427 submissions, 22%

Upcoming Conference

CIKM '25

Sponsor:
sigir
sigir

The 34th ACM International Conference on Information and Knowledge Management

November 10 - 14, 2025

Seoul , Republic of Korea

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
74
Total Downloads

Downloads (Last 12 months)74
Downloads (Last 6 weeks)34

Reflects downloads up to 24 Dec 2024

Other Metrics

View Author Metrics

Citations

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Media

Figures

Other

Tables

View Table of Contents