Article

Retrieve-and-Rank End-to-End Summarization of Biomedical Studies

Authors:

Lorenzo Valgimigli,

Lorenzo MolfettaAuthors Info & Claims

Similarity Search and Applications: 16th International Conference, SISAP 2023, A Coruña, Spain, October 9–11, 2023, Proceedings

Pages 64 - 78

https://doi.org/10.1007/978-3-031-46994-7_6

Published: 27 October 2023 Publication History

Abstract

An arduous biomedical task involves condensing evidence derived from multiple interrelated studies, given a context as input, to generate reviews or provide answers autonomously. We named this task context-aware multi-document summarization (CA-MDS). Existing state-of-the-art (SOTA) solutions require truncation of the input due to the high memory demands, resulting in the loss of meaningful content. To address this issue effectively, we propose a novel approach called Ramses, which employs a retrieve-and-rank technique for end-to-end summarization. The model acquires the ability to (i) index each document by modeling its semantic features, (ii) retrieve the most relevant ones, and (iii) generate a summary via token probability marginalization. To facilitate the evaluation, we introduce a new dataset, FAQsumC19, which includes the synthesizing of multiple supporting papers to answer questions related to Covid-19. Our experimental findings demonstrate that Ramses achieves notably superior ROUGE scores compared to state-of-the-art methodologies, including the establishment of a new SOTA for the generation of systematic literature reviews using Ms2. Quality observation through human evaluation indicates that our model produces more informative responses than previous leading approaches.

References

[1]

Amplayo, R.K., Lapata, M.: Informative and controllable opinion summarization. In: EACL, Online, April 19–23 2021, pp. 2662–2672. ACL (2021)

[2]

Beltagy, I., Peters, M.E., Cohan, A.: Longformer: The long-document transformer. CoRR abs/2004.05150 (2020)

[3]

Borgeaud, S., Mensch, A., Hoffmann, J., Cai, T., et al.: Improving language models by retrieving from trillions of tokens. In: ICML. PMLR, vol. 162, pp. 2206–2240. PMLR (2022)

[4]

Cerroni W, Moro G, Pasolini R, and Ramilli M Decentralized detection of network attacks through P2P data clustering of SNMP data Comput. Secur. 2015 52 1-16

[5]

Cerroni, W., Moro, G., Pirini, T., Ramilli, M.: Peer-to-peer data mining classifiers for decentralized detection of network attacks. In: ADC. CRPIT, vol. 137, pp. 101–108. ACS (2013)

[6]

Chen, Q., Allot, A., Lu, Z.: Litcovid: an open database of COVID-19 literature. Nucleic Acids Res. 49(Database-Issue), D1534–D1540 (2021)

[7]

DeYoung, J., Beltagy, I., van Zuylen, M., Kuehl, B., et al.: Ms

\hat{}

2: Multi-document summarization of medical studies. In: EMNLP, Punta Cana, 7–11 November, 2021, pp. 7494–7513. ACL (2021).

[8]

Domeniconi, G., Moro, G., Pagliarani, A., Pasolini, R.: On deep learning in cross-domain sentiment classification. In: IC3K (Volume 1), Funchal, Madeira, Portugal, November 1–3, 2017, pp. 50–60. SciTePress (2017).

[9]

Fabbri AR, Kryscinski W, McCann B, Xiong C, et al. Summeval: re-evaluating summarization evaluation TACL 2021 9 391-409

[10]

Formal T, Piwowarski B, Clinchant S, et al. Hagen M et al. Match your words! a study of lexical matching in neural information retrieval Advances in Information Retrieval: 44th European Conference on IR Research, ECIR 2022, Stavanger, Norway, April 10–14, 2022, Proceedings, Part II 2022 Cham Springer 120-127

[11]

Frisoni, G., Italiani, P., Salvatori, S., Moro, G.: Cogito Ergo

Summ

: Abstractive Summarization of Biomedical Papers via Semantic Parsing Graphs and Consistency Rewards. In: AAAI 2023, Washington, DC, USA, February 7–14, 2023. AAAI Press, Washington, DC, USA (2023)

[12]

Frisoni, G., Mizutani, M., Moro, G., Valgimigli, L.: Bioreader: a retrieval-enhanced text-to-text transformer for biomedical literature. In: EMNLP 2022, pp. 5770–5793. ACL, Abu Dhabi, United Arab Emirates (2022)

[13]

Hammoudi Slimane, Quix Christoph, and Bernardino Jorge Data Management Technologies and Applications: 9th International Conference, DATA 2020, Virtual Event, July 7–9, 2020, Revised Selected Papers 2021 Cham Springer

[14]

Frisoni, G., Moro, G., Carbonaro, A.: Learning interpretable and statistically significant knowledge from unlabeled corpora of social text messages: a novel methodology of descriptive text mining. In: DATA, pp. 121–134. SciTePress (2020)

[15]

Frisoni G, Moro G, and Carbonaro A A survey on event extraction for natural language understanding: Riding the biomedical literature wave IEEE Access 2021 9 160721-160757

[16]

Grusky, M., Naaman, M., Artzi, Y.: Newsroom: A dataset of 1.3 million summaries with diverse extractive strategies. In: NAACL (Long Papers), pp. 708–719. ACL, New Orleans, Louisiana (2018).

[17]

Hokamp, C., Ghalandari, D.G., Pham, N.T., Glover, J.: Dyne: Dynamic ensemble decoding for multi-document summarization. CoRR abs/2006.08748 (2020)

[18]

Izacard, G., Grave, E.: Leveraging passage retrieval with generative models for open domain question answering. In: EACL: Main Volume, pp. 874–880. ACL, Online (2021).

[19]

Jin, H., Wang, T., Wan, X.: Multi-granularity interaction network for extractive and abstractive multi-document summarization. In: ACL, Online, July 5–10 2020, pp. 6244–6254. ACL (2020).

[20]

Karpukhin, V., Oğuz, B., Min, S., Lewis, P., et al.: Dense passage retrieval for open-domain question answering. arXiv preprint arXiv:2004.04906 (2020)

[21]

Karpukhin, V., Oguz, B., Min, S., Lewis, P.S.H., et al.: Dense passage retrieval for open-domain question answering. In: EMNLP 2020, Online, November 16–20, 2020, pp. 6769–6781. ACL (2020).

[22]

Lewis, M., Liu, Y., Goyal, N., Ghazvininejad, M., et al.: BART: denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. In: ACL, July 5–10 2020, pp. 7871–7880 (2020).

[23]

Lewis, P.S.H., Perez, E., Piktus, A., Petroni, F., et al.: Retrieval-augmented generation for knowledge-intensive NLP tasks. In: NeurIPS 2020, December 6–12, 2020, virtual (2020)

[24]

Lin, C.Y.: ROUGE: A package for automatic evaluation of summaries. In: Text Summarization Branches Out, pp. 74–81. ACL, Barcelona, Spain (2004)

[25]

Liu, Y., Lapata, M.: Hierarchical transformers for multi-document summarization. In: ACL, Florence, Italy, July 28- August 2 2019, pp. 5070–5081. ACL (2019).

[26]

Lodi, S., Moro, G., Sartori, C.: Distributed data clustering in multi-dimensional peer-to-peer networks. In: (ADC), Brisbane, 18–22 January, 2010. CRPIT, vol. 104, pp. 171–178. ACS (2010)

[27]

Möller, T., Reina, A., Jayakumar, R., Pietsch, M.: Covid-qa: a question answering dataset for Covid-19 (2020)

[28]

Moro, G., Masseroli, M.: Gene function finding through cross-organism ensemble learning. BioData Min. 14(1), 14 (2021)

[29]

Moro, G., Piscaglia, N., Ragazzi, L., Italiani, P.: Multi-language transfer learning for low-resource legal case summarization. Artif. Intell. Law 31 (2023)

[30]

Moro, G., Ragazzi, L.: Semantic self-segmentation for abstractive summarization of long documents in low-resource regimes. In: AAAI 2022, Virtual Event, February 22 - March 1, 2022, pp. 11085–11093. AAAI Press (2022). www.ojs.aaai.org/index.php/AAAI/article/view/21357

[31]

Moro G and Ragazzi L Align-then-abstract representation learning for low-resource summarization Neurocomputing 2023 548

[32]

Moro G, Ragazzi L, and Valgimigli L Carburacy: summarization models tuning and comparison in eco-sustainable regimes with a novel carbon-aware accuracy AAAI 2023 37 12 14417-14425

[33]

Moro, G., Ragazzi, L., Valgimigli, L.: Graph-based abstractive summarization of extracted essential knowledge for low-resource scenario. In: ECAI 2023, Kraków, Poland, September 30 - October 4, 2023, pp. 1–9 (2023)

[34]

Moro, G., Ragazzi, L., Valgimigli, L., Freddi, D.: Discriminative marginalized probabilistic neural method for multi-document summarization of medical literature. In: ACL, pp. 180–189. ACL, Dublin, Ireland (May 2022).

[35]

Moro, G., Ragazzi, L., Valgimigli, L., Frisoni, G., Sartori, C., Marfia, G.: Efficient memory-enhanced transformer for long-document summarization in low-resource regimes. Sensors 23(7) (2023)., www.mdpi.com/1424-8220/23/7/3542

[36]

Moro, G., Salvatori, S.: Deep vision-language model for efficient multi-modal similarity search in fashion retrieval, pp. 40–53 (09 2022).

[37]

Moro, G., Salvatori, S., Frisoni, G.: Efficient text-image semantic search: A multi-modal vision-language approach for fashion retrieval. Neurocomputing 538, 126196 (2023).

[38]

Moro, G., Valgimigli, L.: Efficient self-supervised metric information retrieval: A bibliography based method applied to COVID literature. Sensors 21(19) (2021).

[39]

Papanikolaou, Y., Bennett, F.: Slot filling for biomedical information extraction. CoRR abs/2109.08564 (2021)

[40]

Poliak, A., Fleming, M., Costello, C., Murray, K.W., et al.: Collecting verified COVID-19 question answer pairs. In: NLP4COVIDEMNLP. ACL (2020)

[41]

Rajpurkar, P., Jia, R., Liang, P.: Know what you don’t know: Unanswerable questions for squad. In: ACL 2018, Melbourne, Australia, July 15–20, 2018, pp. 784–789. ACL (2018).

[42]

Ren, R., Lv, S., Qu, Y., Liu, J., et al.: PAIR: leveraging passage-centric similarity relation for improving dense passage retrieval. In: ACL/IJCNLP (Findings). Findings of ACL, vol. ACL/IJCNLP 2021, pp. 2173–2183. Association for Computational Linguistics (2021)

[43]

Ren, R., Qu, Y., Liu, J., Zhao, W.X., et al.: Rocketqav2: A joint training method for dense passage retrieval and passage re-ranking. In: EMNLP (1), pp. 2825–2835. ACL (2021)

[44]

Croft Bruce W. and van Rijsbergen C. J. SIGIR ’94 1994 London Springer London

[45]

Sun, S., Sedoc, J.: An analysis of bert faq retrieval models for Covid-19 infobot (2020)

[46]

Vig, J., Fabbri, A.R., Kryscinski, W., Wu, C., et al.: Exploring neural models for query-focused summarization. In: NAACL 2022, Seattle, WA, United States, July 10–15, 2022, pp. 1455–1468. ACL (2022).

[47]

Wang, L.L., Lo, K., Chandrasekhar, Y., Reas, R., et al.: CORD-19: the Covid-19 open research dataset. CoRR abs/2004.10706 (2020)

[48]

Wei, J.W., Huang, C., Vosoughi, S., Wei, J.: What are people asking about Covid-19? A question classification dataset. CoRR abs/2005.12522 (2020)

[49]

Xiao, W., Beltagy, I., Carenini, G., Cohan, A.: PRIMERA: Pyramid-based masked sentence pre-training for multi-document summarization. In: ACL, pp. 5245–5263. ACL, Dublin (2022).

[50]

Zhang, J., Zhao, Y., Saleh, M., Liu, P.J.: PEGASUS: pre-training with extracted gap-sentences for abstractive summarization. In: ICML, 13–18 July 2020. vol. 119, pp. 11328–11339. PMLR (2020)

[51]

Zhang, T., Kishore, V., Wu, F., Weinberger, K.Q., et al.: Bertscore: Evaluating text generation with BERT. In: ICLR, Addis Ababa, Ethiopia, April 26–30, 2020. OpenReview.net (2020)

[52]

Zhang, X.F., Sun, H., Yue, X., Lin, S.M., et al.: COUGH: A challenge dataset and models for COVID-19 FAQ retrieval. In: EMNLP 2021, Virtual Event, 7–11 November, 2021, pp. 3759–3769. ACL (2021)

Recommendations

Comments

Information & Contributors

Information

Published In

cover image Guide Proceedings

Similarity Search and Applications: 16th International Conference, SISAP 2023, A Coruña, Spain, October 9–11, 2023, Proceedings

Oct 2023

324 pages

ISBN:978-3-031-46993-0

DOI:10.1007/978-3-031-46994-7

Editors:
Oscar Pedreira
https://ror.org/01qckj285University of A Coruña, Coruña, Spain
,
Vladimir Estivill-Castro
https://ror.org/04n0g0b29Pompeu Fabra University, Barcelona, Spain

© The Author(s), under exclusive license to Springer Nature Switzerland AG 2023.

Publisher

Springer-Verlag

Berlin, Heidelberg

Publication History

Published: 27 October 2023

Author Tags

Qualifiers

Article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
0
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 03 Mar 2025

Other Metrics

View Author Metrics

Citations

View Options

View options

Figures

Tables

Media

View Table of Conten