Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article

The Power of Selecting Key Blocks with Local Pre-ranking for Long Document Information Retrieval

Published: 07 February 2023 Publication History

Abstract

On a wide range of natural language processing and information retrieval tasks, transformer-based models, particularly pre-trained language models like BERT, have demonstrated tremendous effectiveness. Due to the quadratic complexity of the self-attention mechanism, however, such models have difficulties processing long documents. Recent works dealing with this issue include truncating long documents, in which case one loses potential relevant information, segmenting them into several passages, which may lead to miss some information and high computational complexity when the number of passages is large, or modifying the self-attention mechanism to make it sparser as in sparse-attention models, at the risk again of missing some information. We follow here a slightly different approach in which one first selects key blocks of a long document by local query-block pre-ranking, and then few blocks are aggregated to form a short document that can be processed by a model such as BERT. Experiments conducted on standard Information Retrieval datasets demonstrate the effectiveness of the proposed approach.

References

[1]
Joshua Ainslie, Santiago Ontanon, Chris Alberti, Vaclav Cvicek, Zachary Fisher, Philip Pham, Anirudh Ravula, Sumit Sanghai, Qifan Wang, and Li Yang. 2020. ETC: Encoding long and structured inputs in transformers. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP). 268–284.
[2]
Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E. Hinton. 2016. Layer normalization. arXiv preprint arXiv:1607.06450 (2016).
[3]
Iz Beltagy, Matthew E. Peters, and Arman Cohan. 2020. Longformer: The Long-Document Transformer. arXiv: 2004.05150.
[4]
Stefan Büttcher and Charles L. A. Clarke. 2006. A document-centric approach to static index pruning in text retrieval systems. In Proceedings of the 15th ACM International Conference on Information and Knowledge Management. 182–189.
[5]
Tianqi Chen, Thierry Moreau, Ziheng Jiang, Lianmin Zheng, Eddie Yan, Meghan Cowan, Haichen Shen, Leyuan Wang, Yuwei Hu, Luis Ceze et al. 2018. TVM: An automated end-to-end optimizing compiler for deep learning. In Proceedings of the 13th USENIX Conference on Operating Systems Design and Implementation. 579–594.
[6]
Rewon Child, Scott Gray, Alec Radford, and Ilya Sutskever. 2019. Generating long sequences with sparse transformers. CoRR abs/1904.10509 (2019).
[7]
Nick Craswell, Bhaskar Mitra, Emine Yilmaz, Daniel Campos, and Ellen M. Voorhees. 2020. Overview of the TREC 2019 Deep Learning Track. arXiv preprint arXiv:2003.07820 (2020).
[8]
Zhuyun Dai and Jamie Callan. 2019. Deeper text understanding for IR with contextual neural language modeling. In Proceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval. 985–988.
[9]
Zihang Dai, Zhilin Yang, Yiming Yang, Jaime Carbonell, Quoc Le, and Ruslan Salakhutdinov. 2019. Transformer-XL: Attentive language models beyond a fixed-length context. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics.
[10]
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. BERT: Pre-training of deep bidirectional transformers for language understanding. CoRR abs/1810.04805 (2018).
[11]
Ming Ding, Chang Zhou, Hongxia Yang, and Jie Tang. 2020. CogLTX: Applying BERT to long texts. In Proceeding of the Advances in Neural Information Processing Systems. Vol. 33. Curran Associates, Inc., 12792–12804.
[12]
Yixing Fan, Jiafeng Guo, Yanyan Lan, Jun Xu, Chengxiang Zhai, and Xueqi Cheng. 2018. Modeling diverse relevance patterns in ad-hoc retrieval. In Proceedings of the 41st International ACM SIGIR Conference on Research & Development in Information Retrieval. 375–384.
[13]
Hui Fang. 2008. A re-examination of query expansion using lexical resources. In Proceedings of the Meeting of the Association for Computational Linguistics: Human Language Technology. 139–147.
[14]
Hui Fang and ChengXiang Zhai. 2006. Semantic term matching in axiomatic approaches to information retrieval. In Proceedings of the 29th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. 115–122.
[15]
Christiane Fellbaum. 1998. WordNet: An Electronic Lexical Database. Bradford Books.
[16]
Yanjie Gao, Yu Liu, Hongyu Zhang, Zhengxian Li, Yonghao Zhu, Haoxiang Lin, and Mao Yang. 2020. Estimating GPU memory consumption of deep learning models. In Proceedings of the 28th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering. 1342–1352.
[17]
Quentin Grail, Julien Perez, and Eric Gaussier. 2021. Globalizing BERT-based transformer architectures for long document summarization. In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics. Association for Computational Linguistics, 1792–1810.
[18]
Scott Gray, Alec Radford, and Diederik P. Kingma. 2017. GPU kernels for block-sparse weights. arXiv preprint arXiv:1711.09224 3 (2017).
[19]
Jiafeng Guo, Yixing Fan, Qingyao Ai, and W. Bruce Croft. 2016. A deep relevance matching model for ad-hoc retrieval. In Proceedings of the 25th ACM International on Conference on Information and Knowledge Management. 55–64.
[20]
Jiafeng Guo, Yixing Fan, Liang Pang, Liu Yang, Qingyao Ai, Hamed Zamani, Chen Wu, W. Bruce Croft, and Xueqi Cheng. 2020. A deep look into neural ranking models for information retrieval. Inf. Process. Manag. 57, 6 (2020), 102067.
[21]
Sebastian Hofstätter, Bhaskar Mitra, Hamed Zamani, Nick Craswell, and Allan Hanbury. 2021. Intra-document cascading: Learning to select passages for neural document ranking. In Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval. 1349–1358.
[22]
Sebastian Hofstätter, Hamed Zamani, Bhaskar Mitra, Nick Craswell, and Allan Hanbury. 2020. Local self-attention over long text for efficient document retrieval. In Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval. 2021–2024.
[23]
Sebastian Hofstätter, Markus Zlabinger, and Allan Hanbury. 2020. Interpretable & time-budget-constrained contextualization for re-ranking. In Proceedings of the 24th European Conference on Artificial Intelligence, Vol. 325. IOS Press, 513–520.
[24]
Baotian Hu, Zhengdong Lu, Hang Li, and Qingcai Chen. 2014. Convolutional neural network architectures for matching natural language sentences. In Proceeding of the Advances in Neural Information Processing Systems, Vol. 27. Curran Associates, Inc. 2042–2050.
[25]
Po-Sen Huang, Xiaodong He, Jianfeng Gao, Li Deng, Alex Acero, and Larry Heck. 2013. Learning deep structured semantic models for web search using clickthrough data. In Proceedings of the 22nd ACM International Conference on Information & Knowledge Management. 2333–2338.
[26]
Kai Hui, Andrew Yates, Klaus Berberich, and Gerard de Melo. 2017. PACRR: A position-aware neural IR model for relevance matching. In Proceedings of the Conference on Empirical Methods in Natural Language Processing. 1049–1058.
[27]
Kai Hui, Andrew Yates, Klaus Berberich, and Gerard De Melo. 2018. Co-PACRR: A context-aware neural IR model for ad-hoc retrieval. In Proceedings of the 11th ACM International Conference on Web Search and Data Mining. 279–287.
[28]
Jyun-Yu Jiang, Chenyan Xiong, Chia-Jung Lee, and Wei Wang. 2020. Long document ranking with query-directed sparse transformer. In Proceedings of the Conference on Empirical Methods in Natural Language Processing. 4594–4605.
[29]
Chris Kamphuis, Arjen P. de Vries, Leonid Boytsov, and Jimmy Lin. 2020. Which BM25 do you mean? A large-scale reproducibility study of scoring variants. In Proceedings of the European Conference on Information Retrieval. Springer, 28–34.
[30]
Nikita Kitaev, Łukasz Kaiser, and Anselm Levskaya. 2020. Reformer: The Efficient Transformer. arxiv:2001.04451.
[31]
Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. 2012. ImageNet classification with deep convolutional neural networks. Adv. Neural Inf. Process. Syst. 25 (2012), 1097–1105.
[32]
Robert Krovetz. 2000. Viewing morphology as an inference process. Artif. Intell. 118, 1-2 (2000), 277–294.
[33]
Canjia Li, Yingfei Sun, Ben He, Le Wang, Kai Hui, Andrew Yates, Le Sun, and Jungang Xu. 2018. NPRF: A neural pseudo relevance feedback framework for ad-hoc information retrieval. In Proceedings of the Conference on Empirical Methods in Natural Language Processing. 4482–4491.
[34]
Canjia Li, Andrew Yates, Sean MacAvaney, Ben He, and Yingfei Sun. 2020. PARADE: Passage representation aggregation for document reranking. arXiv preprint arXiv:2008.09093 (2020).
[35]
Cheng Li, Mingyang Zhang, Michael Bendersky, Hongbo Deng, Donald Metzler, and Marc Najork. 2019. Multi-view embedding-based synonyms for email search. In Proceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval. 575–584.
[36]
Hang Li. 2011. Learning to rank for information retrieval and natural language processing. Synth. Lect. Hum. Lang. Technol. 4, 1 (2011), 1–113.
[37]
Minghan Li and Eric Gaussier. 2021. KeyBLD: Selecting key blocks with local pre-ranking for long document information retrieval. In Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval. 2207–2211.
[38]
Minghan Li and Eric Gaussier. 2022. BERT-based dense intra-ranking and contextualized late interaction via multi-task learning for long document retrieval. In Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval. 2347–2352.
[39]
Tie-Yan Liu. 2011. Learning to Rank for Information Retrieval. Springer.
[40]
Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692 (2019).
[41]
Jing Lu, Gustavo Hernández Ábrego, Ji Ma, Jianmo Ni, and Yinfei Yang. 2020. Neural passage retrieval with improved negative contrast. arXiv preprint arXiv:2010.12523 (2020).
[42]
Sean MacAvaney, Andrew Yates, Arman Cohan, and Nazli Goharian. 2019. CEDR: Contextualized embeddings for document ranking. In Proceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval. 1101–1104.
[43]
Paulius Micikevicius, Sharan Narang, Jonah Alben, Gregory Diamos, Erich Elsen, David Garcia, Boris Ginsburg, Michael Houston, Oleksii Kuchaiev, Ganesh Venkatesh, et al. 2018. Mixed precision training. In Proceedings of the International Conference on Learning Representations.
[44]
Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S. Corrado, and Jeff Dean. 2013. Distributed representations of words and phrases and their compositionality. In Advances in Neural Information Processing Systems, Vol. 26. Curran Associates, Inc. Retrieved from https://proceedings.neurips.cc/paper/2013/file/9aa42b31882ec039965f3c4923ce901b-Paper.pdf.
[45]
Cristina Ioana Muntean, Franco Maria Nardini, Raffaele Perego, Nicola Tonellotto, and Ophir Frieder. 2020. Weighting passages enhances accuracy. ACM Trans. Inf. Syst. 39, 2 (2020), 1–11.
[46]
Tri Nguyen, Mir Rosenberg, Xia Song, Jianfeng Gao, Saurabh Tiwary, Rangan Majumder, and Li Deng. 2016. MS MARCO: A human generated machine reading comprehension dataset. In Proceedings of the CoCo@NIPS.
[47]
Rodrigo Nogueira and Kyunghyun Cho. 2019. Passage re-ranking with BERT. arXiv preprint arXiv:1901.04085 (2019).
[48]
Henrik Nottelmann and Norbert Fuhr. 2003. From retrieval status values to probabilities of relevance for advanced IR applications. Inf. Retr. 6, 3 (2003), 363–388.
[49]
Kezban Dilek Onal, Ye Zhang, Ismail Sengor Altingovde, Md Mustafizur Rahman, Pinar Karagoz, Alex Braylan, Brandon Dang, Heng-Lu Chang, Henna Kim, Quinten McNamara, et al. 2018. Neural information retrieval: At the end of the early years. Inf. Retr. J. 21, 2 (2018), 111–182.
[50]
Aaron van den Oord, Yazhe Li, and Oriol Vinyals. 2018. Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748 (2018).
[51]
Hamid Palangi, Li Deng, Yelong Shen, Jianfeng Gao, Xiaodong He, Jianshu Chen, Xinying Song, and Rabab Ward. 2016. Deep sentence embedding using long short-term memory networks: Analysis and application to information retrieval. IEEE/ACM Trans. Audio, Speech Lang. Process. 24, 4 (2016), 694–707.
[52]
Liang Pang, Yanyan Lan, Jiafeng Guo, Jun Xu, Shengxian Wan, and Xueqi Cheng. 2016. Text matching as image recognition. In Proceedings of the AAAI Conference on Artificial Intelligence.
[53]
Liang Pang, Yanyan Lan, Jiafeng Guo, Jun Xu, Jingfang Xu, and Xueqi Cheng. 2017. DeepRank: A new deep architecture for relevance ranking in information retrieval. In Proceedings of the ACM Conference on Information and Knowledge Management. 257–266.
[54]
Fabian Pedregosa, Gaël Varoquaux, Alexandre Gramfort, Vincent Michel, Bertrand Thirion, Olivier Grisel, Mathieu Blondel, Peter Prettenhofer, Ron Weiss, Vincent Dubourg et al. 2011. Scikit-learn: Machine learning in Python. J. Mach. Learn. Res. 12 (2011), 2825–2830.
[55]
Jeffrey Pennington, Richard Socher, and Christopher D. Manning. 2014. GloVe: Global vectors for word representation. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP). 1532–1543.
[56]
Tao Qin and Tie-Yan Liu. 2013. Introducing LETOR 4.0 datasets. arXiv preprint arXiv:1306.2597 (2013).
[57]
Nils Reimers and Iryna Gurevych. 2019. Sentence-BERT: Sentence embeddings using Siamese BERT-networks. In Proceedings of the Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). 3982–3992.
[58]
Stephen Robertson and Hugo Zaragoza. 2009. The Probabilistic Relevance Framework: BM25 and Beyond. Now Publishers Inc.
[59]
Stephen E. Robertson and Steve Walker. 1994. Some simple effective approximations to the 2-Poisson model for probabilistic weighted retrieval. In Proceedings of the 17th Annual International ACM-SIGIR Conference on Research and Development in Information Retrieval.
[60]
Stephen E. Robertson and Hugo Zaragoza. 2009. The probabilistic relevance framework: BM25 and beyond. Found. Trends Inf. Retr. 3, 4 (2009), 333–389.
[61]
Yelong Shen, Xiaodong He, Jianfeng Gao, Li Deng, and Grégoire Mesnil. 2014. A latent semantic model with convolutional-pooling structure for information retrieval. In Proceedings of the 23rd ACM International Conference on Conference on Information and Knowledge Management. 101–110.
[62]
James Thorne, Andreas Vlachos, Oana Cocarascu, Christos Christodoulopoulos, and Arpit Mittal. 2018. The fact extraction and VERification (FEVER) shared task. In Proceedings of the 1st Workshop on Fact Extraction and VERification (FEVER). 1–9.
[63]
Peter D. Turney. 2001. Mining the web for synonyms: PMI-IR versus LSA on TOEFL. In Proceedings of the European Conference on Machine Learning. Springer, 491–502.
[64]
Christophe Van Gysel and Maarten de Rijke. 2018. Pytrec_eval: An extremely fast Python interface to trec_eval. In Proceedings of the 41st International ACM SIGIR Conference on Research & Development in Information Retrieval. 873–876.
[65]
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Proceedings of the 31st International Conference on Neural Information Processing Systems. 6000–6010.
[66]
Shengxian Wan, Yanyan Lan, Jun Xu, Jiafeng Guo, Liang Pang, and Xueqi Cheng. 2016. Match-SRNN: Modeling the recursive matching structure with spatial RNN. In Proceedings of the 25th International Joint Conference on Artificial Intelligence. 2922–2928.
[67]
Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, Rémi Louf, Morgan Funtowicz, et al. 2019. Hugging Face’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019).
[68]
Ho Chung Wu, Robert W. P. Luk, Kam-Fai Wong, and K. L. Kwok. 2007. A retrospective study of a hybrid document-context based retrieval model. Inf. Process. Manag. 43, 5 (2007), 1308–1331.
[69]
Chenyan Xiong, Zhuyun Dai, Jamie Callan, Zhiyuan Liu, and Russell Power. 2017. End-to-end neural ad-hoc ranking with kernel pooling. In Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval. 55–64.
[70]
Peilin Yang, Hui Fang, and Jimmy Lin. 2018. Anserini: Reproducible ranking baselines using Lucene. J. Data Inf. Qual. 10, 4 (2018), 1–20.
[71]
Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Bengio, William Cohen, Ruslan Salakhutdinov, and Christopher D. Manning. 2018. HotpotQA: A dataset for diverse, explainable multi-hop question answering. In Proceedings of the Conference on Empirical Methods in Natural Language Processing. 2369–2380.
[72]
Zhuliang Yao, Shijie Cao, Wencong Xiao, Chen Zhang, and Lanshun Nie. 2019. Balanced sparsity for efficient DNN inference on GPU. In Proceedings of the AAAI Conference on Artificial Intelligence. 5676–5683.
[73]
Manzil Zaheer, Guru Guruganesh, Avinava Dubey, Joshua Ainslie, Chris Alberti, Santiago Ontanon, Philip Pham, Anirudh Ravula, Qifan Wang, Li Yang, and Amr Ahmed. 2021. Big Bird: Transformers for Longer Sequences. arxiv:2007.14062.
[74]
Chen Zhao, Chenyan Xiong, Corby Rosset, Xia Song, Paul Bennett, and Saurabh Tiwary. 2020. Transformer-XH: Multi-evidence reasoning with extra hop attention. In Proceedings of the International Conference on Learning Representations.

Cited By

View all
  • (2024)Leveraging Pretrained Language Models for Enhanced Entity MatchingInternational Journal of Intelligent Systems10.1155/2024/19412212024Online publication date: 15-Apr-2024
  • (2024)Hybrid Prompt Learning for Generating Justifications of Security Risks in Automation RulesACM Transactions on Intelligent Systems and Technology10.1145/3675401Online publication date: 29-Jun-2024
  • (2024)Social Perception with Graph Attention Network for RecommendationACM Transactions on Recommender Systems10.1145/3665503Online publication date: 21-May-2024
  • Show More Cited By

Index Terms

  1. The Power of Selecting Key Blocks with Local Pre-ranking for Long Document Information Retrieval

      Recommendations

      Comments

      Information & Contributors

      Information

      Published In

      cover image ACM Transactions on Information Systems
      ACM Transactions on Information Systems  Volume 41, Issue 3
      July 2023
      890 pages
      ISSN:1046-8188
      EISSN:1558-2868
      DOI:10.1145/3582880
      Issue’s Table of Contents

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 07 February 2023
      Online AM: 17 October 2022
      Accepted: 29 September 2022
      Revised: 30 July 2022
      Received: 03 December 2021
      Published in TOIS Volume 41, Issue 3

      Permissions

      Request permissions for this article.

      Check for updates

      Author Tags

      1. BERT-based language models
      2. long-document neural information retrieval

      Qualifiers

      • Research-article
      • Refereed

      Funding Sources

      • MIAI@Grenoble Alpes
      • Chinese Scholarship Council (CSC)

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)136
      • Downloads (Last 6 weeks)3
      Reflects downloads up to 31 Jan 2025

      Other Metrics

      Citations

      Cited By

      View all
      • (2024)Leveraging Pretrained Language Models for Enhanced Entity MatchingInternational Journal of Intelligent Systems10.1155/2024/19412212024Online publication date: 15-Apr-2024
      • (2024)Hybrid Prompt Learning for Generating Justifications of Security Risks in Automation RulesACM Transactions on Intelligent Systems and Technology10.1145/3675401Online publication date: 29-Jun-2024
      • (2024)Social Perception with Graph Attention Network for RecommendationACM Transactions on Recommender Systems10.1145/3665503Online publication date: 21-May-2024
      • (2024)DHyper: A Recurrent Dual Hypergraph Neural Network for Event Prediction in Temporal Knowledge GraphsACM Transactions on Information Systems10.1145/365301542:5(1-23)Online publication date: 29-Apr-2024
      • (2024)Personalized Cadence Awareness for Next Basket RecommendationACM Transactions on Recommender Systems10.1145/3652863Online publication date: 21-Mar-2024
      • (2024)CoBjeason: Reasoning Covered Object in Image by Multi-Agent Collaboration Based on Informed Knowledge GraphACM Transactions on Knowledge Discovery from Data10.1145/364356518:5(1-56)Online publication date: 28-Feb-2024
      • (2024)MCRPL: A Pretrain, Prompt, and Fine-tune Paradigm for Non-overlapping Many-to-one Cross-domain RecommendationACM Transactions on Information Systems10.1145/364186042:4(1-24)Online publication date: 9-Feb-2024
      • (2024)Data Augmentation for Sample Efficient and Robust Document RankingACM Transactions on Information Systems10.1145/363491142:5(1-29)Online publication date: 29-Apr-2024
      • (2024)Cross-domain Recommendation via Dual Adversarial AdaptationACM Transactions on Information Systems10.1145/363252442:3(1-26)Online publication date: 22-Jan-2024
      • (2024)Efficient Neural Ranking Using Forward Indexes and Lightweight EncodersACM Transactions on Information Systems10.1145/363193942:5(1-34)Online publication date: 29-Apr-2024
      • Show More Cited By

      View Options

      Login options

      Full Access

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Full Text

      View this article in Full Text.

      Full Text

      HTML Format

      View this article in HTML Format.

      HTML Format

      Figures

      Tables

      Media

      Share

      Share

      Share this Publication link

      Share on social media