Long Context RAG Performance of Large Language Models
Long Context RAG Performance of Large Language Models
Long Context RAG Performance of Large Language Models
Models
Michael Carbin
Databricks Mosaic Research
michael.carbin@databricks.com
Abstract
Retrieval Augmented Generation (RAG) has emerged as a crucial technique for
enhancing the accuracy of Large Language Models (LLMs) by incorporating
external information. With the advent of LLMs that support increasingly longer
context lengths, there is a growing interest in understanding how these models
perform in RAG scenarios. Can these new long context models improve RAG
performance? This paper presents a comprehensive study of the impact of increased
context length on RAG performance across 20 popular open source and commercial
LLMs. We ran RAG workflows while varying the total context length from 2,000
to 128,000 tokens (and 2 million tokens when possible) on three domain-specific
datasets, and report key insights on the benefits and limitations of long context
in RAG applications. Our findings reveal that while retrieving more documents
can improve performance, only a handful of the most recent state of the art LLMs
can maintain consistent accuracy at long context above 64k tokens. We also
identify distinct failure modes in long context scenarios, suggesting areas for future
research.
1 Introduction
The development of Large Language Models (LLMs) with increasingly longer context lengths has
opened new possibilities for Retrieval Augmented Generation (RAG) applications. Recent models
such as Anthropic Claude (200k tokens) [1], GPT-4-turbo (128k tokens) [2], OpenAI o1 (128k tokens)
[3], Llama 3 [4] and Google Gemini 1.5 Pro (2 million tokens) [5] have led to speculation about
whether long context models might eventually subsume traditional RAG workflows entirely. In this
study, we empirically investigate the impact of increased context length on RAG performance and
explore the limitations and challenges that arise in long context scenarios.
RAG can enhance the accuracy of LLMs by retrieving information from external sources, enabling
users to incorporate task-specific or private data into their LLM workflows. Published results using
RAG-like methods have demonstrated benefits across many applications [6] including machine
∗
Equal contribution
Workshop on Adaptive Foundation Models, 38th Conference on Neural Information Processing Systems
(NeurIPS 2024).
Figure 1: Long context RAG performance of o1, GPT-4, Claude 3/3.5, Gemini 1.5
(gemini-1.5-pro-001 and gemini-1.5-flash-001), Llama 3/3.1, Qwen 2, Mistral and DBRX
models on 3 curated RAG datasets (Databricks DocsQA, FinanceBench, and Natural Questions). All
values can be found in Table S3. Model versions are listed in Table S1.
translation [7], semantic parsing [8], question answering [9, 10, 11, 12], and open-ended text genera-
tion [13]. With longer context lengths, LLM developers can feed more documents into their RAG
applications. While there has been recent speculation that long context LLMs will replace RAG
entirely [14], in this paper we study whether long context LLMs can indeed be used effectively for
RAG systems. How well do the best open source and commercial models do on long-context RAG
tasks?
In this study, we apply a standard RAG approach and evaluate the performance of 20 popular open
source and commercial LLMs with varying context lengths from 2,000 to 128,000 tokens (and 2
million tokens when possible). We then analyze distinct failure modes for different models across
long context RAG scenarios. We show that:
• Using longer context does not uniformly increase RAG performance. The majority of
models we evaluated first increase and then decrease RAG performance as context length
increases. Only a handful of the most recent state of the art LLMs can maintain consistent
accuracy at long context above 64k tokens.
• LLMs fail at long context RAG in unique ways as a function of context length. While
some models tended to provide incorrect answers, others failed to follow instructions or
refused to answer due to perceived copyright concerns.
2
support much longer contexts. Open source models have followed a similar trend, with recent models
like Mixtral [18] and DBRX [19] supporting 32k tokens, and Llama 3.1 reaching 128k tokens.
However, recent studies have identified limitations in long context models. For example, the “lost in
the middle” paper [20] found that models struggle to retain and utilize information from the middle
portions of long texts, leading to performance degradation as context length increases. Similarly,
the RULER paper [21] found that the “effective context length” (usable context before performance
decreases) can be much shorter than the claimed maximum context length. Recent studies have also
tried to compare RAG to workflows where the entire corpus is included in the context window of the
LLM [22]. This has only been possible to do with the very recent state of the art models such as o1,
GPT-4o, Claude 3.5, Gemini 1.5, Qwen 2 72B and Llama 3.1 405B, and the jury is still out on whether
such an approach leads to accurate results and is cost effective. Other relevant studies and blogposts
include [23, 24, 14, 25, 26, 22, 27, 28]. Similar to our study, Jin et al. find that increasing the number
of retrieved passages does not consistently improve RAG performance for Gemma-7B, Gemma-2-9B,
Mistral NeMo 12B but does for Gemini 1.5 Pro [29]. Our concurrent work corroborates this across
20 closed and open source models.
3 Methodology
We conducted RAG experiments using 20 popular open source and commercial LLMs, and evaluated
their performance on three datasets: Databricks DocsQA,1 FinanceBench [30], and Natural Questions
[31]. For the retrieval stage, we retrieved document chunks using the same embedding model across
all settings (OpenAI text-embedding-3-large2 with a chunk size of 512 tokens and a stride of
256 tokens) and used FAISS3 (with IndexFlatL2 index) as the vector store. These chunks were then
inserted into the context window of a generative model.
We then evaluated how generation performance changes as a function of the number of retrieved
document chunks by varying the LLM context from 2,000 tokens to 128,000 tokens (and 2 million
tokens when possible). We evaluated the following models: o1-mini, o1-preview, Gemini 1.5 Pro,
Gemini 1.5 Flash,4 GPT-4o, Claude 3.5 Sonnet,5 Claude 3 Opus, Claude 3 Sonnet, Claude 3 Haiku,
GPT-4o mini, GPT-4 Turbo, GPT-4, Llama 3.1 405B, Llama 3 70B, Llama 3.1 70B, Llama 3.1 8B,
Qwen 2 72B, Mixtral 8x7B, DBRX, and GPT-3.5 Turbo. These models represent some of the most
popular API-based and open source LLMs as of this writing. A full list of the model versions used in
this study can be found in Table S1.
For generation, we set the temperature to 0.0 and the maximum output sequence length to 1024.
We used a simple prompt template to combine the retrieved documents with the user query for
each model and dataset (Appendix E). The system had to correctly answer questions based on the
retrieved documents, and the answer was judged by a calibrated “LLM-as-a-judge” using GPT-4o
(see Appendix D for further details).
Finally, we analyzed the failure patterns for selected models (OpenAI o1, Gemini 1.5 Pro, Llama 3.1
405B, GPT-4, Claude 3 Sonnet, DBRX, and Mixtral) in long context scenarios by using GPT-4o to
classify failures into broad categories such as “refusal” and “wrong answer” (Appendix F.1). We also
include an analysis of retrieval performance (recall@k) in Appendix C.
4 Results
4.1 Using longer context does not uniformly increase RAG performance
The best commercial models such as o1-mini/preview, GPT-4o, and Claude 3.5 Sonnet steadily
improve performance as a function of context length, while the majority of the open source models
first increase and then decrease performance as context length increases (Figs. 1 and 2). Overall,
we found that the following models show consistent accuracy improvement up to 100k tokens:
o1-preview and o1-mini, GPT-4o and GPT-4o mini, Claude 3.5 Sonnet, Claude 3 Opus, and Gemini
1
Databricks DocsQA is a benchmark of technical questions and answers related to the Databricks platform.
2
https://openai.com/index/new-embedding-models-and-api-updates/
3
https://github.com/facebookresearch/faiss
4
We used the versions of Gemini 1.5 released in June 2024, specifically gemini-1.5-pro-001 and
gemini-1.5-flash-001 with 2 million token context windows.
5
We used the Claude 3.5 Sonnet released in June 2024, claude-3-5-sonnet-20240620
3
Figure 2: Long context RAG performance on FinanceBench
1.5 Pro. These models exhibit largely monotonic behavior where the results don’t get significantly
worse after they peak.
Among the open source models, Qwen 2 70B maintains consistent accuracy up to 64k. Llama 3.1
405B performance starts to decrease after 32k tokens, GPT-4-0125-preview starts to decrease after
64k tokens, and only a few models can maintain consistent long context RAG performance on all
datasets. This demonstrates that while some models that boast long contexts can be used effectively
to increase RAG performance, the majority of open source models can only handle effective RAG
tasks up to roughly 16k-32k tokens.
We report very strong performance from the OpenAI o1 models; the o1 models seem to be a
substantive improvement over GPT-4 and GPT-4o. Although the overall answer correctness of the
Google Gemini 1.5 Pro and Gemini 1.5 Flash models is much lower than that of the o1 and GPT-4o
models up to 128,000 tokens, the Gemini models maintain consistent performance at extremely long
contexts up to 2,000,000 tokens. This is quite unique among the models we tested, and is an exciting
example of how future LLMs will handle long context.
We found distinct failure patterns among different models in long context scenarios. Fig. 3 displays
the failure count and failure type as a function of context length on the Natural Questions (NQ)
dataset. As shown in the top right plot of Fig. 3, Claude 3 Sonnet frequently refused to answer due
to perceived copyright concerns, especially at longer context lengths. Gemini 1.5 Pro maintained
consistent performance at extreme long context (up to 2 million tokens), but increasingly failed tasks
at long context length due to overly sensitive safety filters (Fig. 3).6 Among the open source models,
Llama 3.1 405B maintained consistent failure performance up to 64k tokens, while many of the
failures of Mixtral-8x7B at longer contexts were due to repeated or random content. Finally, DBRX
often failed to follow instructions for context lengths above 16k, often summarizing content instead
of answering questions directly. We include specific examples in Appendix F.
5 Discussion
In this study, we asked a straightforward question: can long context LLMs improve RAG perfor-
mance? We found that for recent state of the art models such as o1, GPT-4o, Claude 3.5, Gemini 1.5,
6
We note that we did not include any queries that failed in this way (i.e. by filtering) in the final accuracy
score. On Natural Questions specifically, Gemini 1.5 Pro and Flash did remarkably well with answer correctness
values above 0.85 at 2 million tokens context length (see Fig. S2).
4
Figure 3: Failure analysis on the Natural Questions (NQ) dataset for Gemini 1.5 Pro, Claude 3 Sonnet,
Mixtral 8x7B, and Llama 3.1 405B. Gemini 1.5 Pro (gemini-1.5-pro-001) increasingly failed
tasks at long context length due to overly sensitive safety filters, while Claude 3 Sonnet frequently
refused to answer due to percieved copyright concerns.
and even Qwen 2 70B, longer contexts can consistently improve RAG performance. However, longer
context is not uniformly beneficial across all models and datasets. Across the majority of models we
analyzed, most LLMs only showed increasing RAG performance up to 16-32k tokens.
Why does o1 do so well? We hypothesize that the increased test-time compute abilities of o1 [3]
allow the model to handle confusing questions and avoid getting misled by retrieved documents that
are irrelevant.
It is also interesting to note that for the NQ dataset, many of the failures were due to alignment
(Claude 3 Sonnet) or safety filtering (Gemini 1.5 Pro). We speculate that this is because the training of
those capabilities did not include long context; if a model is trained for helpfulness on short contexts,
for example, it might not necessarily do as well with helpfulness on long contexts. It is surprising that
alignment could fail at different prompt lengths; we leave a deep dive into this behavior for future
work.
Our results imply that for a corpus smaller than 128k tokens (or 2 million in the case of Gemini),
it may be possible to skip the retrieval step in a RAG pipeline and instead directly feed the entire
dataset into the LLM. Is this a good idea? Although this would be prohibitively expensive and have
potentially lower performance, such a setup could eventually allow developers to trade higher costs
for a more simplified developer experience when building LLM applications.
The costs vary widely across models. For a single query with a maximum sequence length of 128k
tokens, GPT-4o costs $0.32, while o1-preview costs $1.92, Claude 3.5 Sonnet costs $0.384 and
Gemini 1.5 Pro costs $0.16.7 Using very long context for RAG is much more expensive than simply
maintaining a vector database and retrieving a handful of relevant documents. Batch inference and
corpus caching can likely mitigate these costs; this is an active area of development. In the past year
alone we’ve seen the price per million input token drops from $30 for GPT-4 to $2.5 for GPT-4o;8 in
the near future it is likely using 128k tokens will become more feasible financially.
7
when only taking cost per input token into account. For a single query with a maximum sequence length of
2 million tokens, Gemini 1.5 Pro costs $5.
8
https://openai.com/api/pricing/
5
Acknowledgments and Disclosure of Funding
We would like to thank Andrew Drozdov, Andy Zhang, and Erica Yuen for their work that enabled
these experiments as well as their feedback on this manuscript. We would also like to thank the
Databricks AI Research team for their support and valuable discussions throughout this project.
This work was supported by Databricks, and all experiments were run on the Databricks Mosaic AI
platform.
Earlier versions of this work appeared as two separate blog posts: “Long
Context RAG Performance of LLMs” (https://www.databricks.com/blog/
long-context-rag-performance-llms, August 12, 2024) and “The Long Context RAG
Capabilities of OpenAI o1 and Google Gemini” (https://www.databricks.com/blog/
long-context-rag-capabilities-openai-o1-and-google-gemini, October 8, 2024).
References
[1] Anthropic. The Claude 3 Model Family: Opus, Sonnet, Haiku. https:
//www-cdn.anthropic.com/de8ba9b01c9ab7cbabf5c33b80b7bbc618857627/Model_
Card_Claude_3.pdf, 2023.
[2] Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni
Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4
technical report. arXiv preprint arXiv:2303.08774, 2023.
[3] OpenAI. Learning to reason with llms. https://openai.com/index/
learning-to-reason-with-llms/, September 2024.
[4] Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle,
Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, et al. The llama 3 herd
of models. arXiv preprint arXiv:2407.21783, 2024.
[5] Machel Reid, Nikolay Savinov, Denis Teplyashin, Dmitry Lepikhin, Timothy Lillicrap, Jean-
baptiste Alayrac, Radu Soricut, Angeliki Lazaridou, Orhan Firat, Julian Schrittwieser, et al.
Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context. arXiv
preprint arXiv:2403.05530, 2024.
[6] To Eun Kim, Alireza Salemi, Andrew Drozdov, Fernando Diaz, and Hamed Zamani. Retrieval-
enhanced machine learning: Synthesis and opportunities. arXiv preprint arXiv:2407.12982,
2024.
[7] Urvashi Khandelwal, Angela Fan, Dan Jurafsky, Luke Zettlemoyer, and Mike Lewis. Nearest
neighbor machine translation. arXiv preprint arXiv:2010.00710, 2021.
[8] Terry Yue Zhuo, Zhuang Li, Yujin Huang, Yuan-Fang Li, Weiqing Wang, Gholamreza Haffari,
and Fatemeh Shiri. On robustness of prompt-based semantic parsing with large pre-trained
language model: An empirical study on codex. ArXiv, abs/2301.12868, 2023. URL https:
//api.semanticscholar.org/CorpusID:256389762.
[9] Akari Asai, Timo Schick, Patrick Lewis, Xilun Chen, Gautier Izacard, Sebastian Riedel,
Hannaneh Hajishirzi, and Wen-tau Yih. Task-aware retrieval with instructions. arXiv preprint
arXiv:2211.09260, 2022.
[10] Wenhu Chen, Pat Verga, Michiel de Jong, John Wieting, and William Cohen. Augmenting
pre-trained language models with QA-Memory for open-domain question answering. arXiv
preprint arXiv:2204.04581, 2023.
[11] Kenton Lee, Ming-Wei Chang, and Kristina Toutanova. Latent retrieval for weakly su-
pervised open domain question answering. ArXiv, abs/1906.00300, 2019. URL https:
//api.semanticscholar.org/CorpusID:173990818.
[12] Reiichiro Nakano, Jacob Hilton, Suchir Balaji, Jeff Wu, Ouyang Long, Christina Kim,
Christopher Hesse, Shantanu Jain, Vineet Kosaraju, William Saunders, Xu Jiang, Karl
6
Cobbe, Tyna Eloundou, Gretchen Krueger, Kevin Button, Matthew Knight, Benjamin Chess,
and John Schulman. Webgpt: Browser-assisted question-answering with human feedback.
ArXiv, abs/2112.09332, 2021. URL https://api.semanticscholar.org/CorpusID:
245329531.
[13] Urvashi Khandelwal, Omer Levy, Dan Jurafsky, Luke Zettlemoyer, and Mike Lewis. Gen-
eralization through memorization: Nearest neighbor language models. arXiv preprint
arXiv:1911.00172, 2019.
[14] Jinhyuk Lee, Anthony Chen, Zhuyun Dai, Dheeru Dua, Devendra Singh Sachan, Michael Bo-
ratko, Yi Luan, Sébastien MR Arnold, Vincent Perot, Siddharth Dalmia, et al. Can long-context
language models subsume retrieval, rag, sql, and more? arXiv preprint arXiv:2406.13121,
2024.
[15] Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman
Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, et al. Retrieval-augmented
generation for knowledge-intensive nlp tasks. In Advances in Neural Information Processing
Systems, 2020.
[16] Vladimir Karpukhin, Barlas Oguz, Sewon Min, Patrick Lewis, Ledell Wu, Sergey Edunov,
Danqi Chen, and Wen-tau Yih. Dense passage retrieval for open-domain question answering.
In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing
(EMNLP), pages 6769–6781, 2020.
[17] Yunfan Gao, Yun Xiong, Xinyu Gao, Kangxiang Jia, Jinliu Pan, Yuxi Bi, Yi Dai, Jiawei Sun,
Meng Wang, and Haofen Wang. Retrieval-augmented generation for large language models: A
survey. arXiv preprint arXiv:2312.10997, 2023.
[18] Albert Q Jiang, Alexandre Sablayrolles, Antoine Roux, Arthur Mensch, Blanche Savary, Chris
Bamford, Devendra Singh Chaplot, Diego de las Casas, Emma Bou Hanna, Florian Bressand,
et al. Mixtral of experts. arXiv preprint arXiv:2401.04088, 2024.
[19] The Mosaic Research Team. Introducing DBRX: A New State-of-the-Art Open LLM. https:
//www.databricks.com/blog/introducing-dbrx-new-state-art-open-llm, March
2024.
[20] Jing Zhang, Yifan Shen, Yuechen Jiang, Biao Yin, Xinyang Zhang, Mingxuan Wang, and
Jie Zhou. Lost in the middle: How language models use long contexts. arXiv preprint
arXiv:2307.03172, 2023.
[21] Chenxin Zhang, Ziyi Zhang, Mingyang Xu, Zhengyan Chen, Xu Zhao, Jie Huang, Zhiyuan Li,
Qifan Liu, Zhiyuan Liu, and Maosong Sun. Ruler: Assessing long context utilization of large
language models. arXiv preprint arXiv:2305.19387, 2023.
[22] Zhuowan Li, Cheng Li, Mingyang Zhang, Qiaozhu Mei, and Michael Bendersky. Retrieval
augmented generation or long-context llms? a comprehensive study and hybrid approach. arXiv
preprint arXiv:2407.16833, 2024.
[23] Simeng Sun, Kalpesh Krishna, Andrew Mattarella-Micke, and Mohit Iyyer. Do long-range
language models actually use long-range context? arXiv preprint arXiv:2109.09115, 2021.
[24] Philippe Laban, Alexander R Fabbri, Caiming Xiong, and Chien-Sheng Wu. Summary of a
haystack: A challenge to long-context llms and rag systems. arXiv preprint arXiv:2407.01370,
2024.
[25] Anita Kirkovska and Sidd Seethepalli. Rag vs long context? https://www.vellum.ai/
blog/rag-vs-long-context.
[26] Jay Alammar, Maxime Voisin, and Sam Barnett. Rag is here to stay: Four reasons why large
context windows can’t replace it. https://cohere.com/blog/rag-is-here-to-stay.
[27] Zhenrui Yue, Honglei Zhuang, Aijun Bai, Kai Hui, Rolf Jagerman, Hansi Zeng, Zhen Qin, Dong
Wang, Xuanhui Wang, and Michael Bendersky. Inference scaling for long-context retrieval
augmented generation. arXiv preprint arXiv:2410.04343, 2024.
7
[28] Omer Goldman, Alon Jacovi, Aviv Slobodkin, Aviya Maimon, Ido Dagan, and Reut Tsarfaty.
Is it really long context if all you need is retrieval? towards genuinely difficult long context nlp.
arXiv preprint arXiv:2407.00402, 2024.
[29] Bowen Jin, Jinsung Yoon, Jiawei Han, and Sercan O Arik. Long-context llms meet rag:
Overcoming challenges for long inputs in rag. arXiv preprint arXiv:2410.05983, 2024.
[30] Pranab Islam, Anand Kannappan, Douwe Kiela, Rebecca Qian, Nino Scherrer, and Bertie
Vidgen. Financebench: A new benchmark for financial question answering. arXiv preprint
arXiv:2311.11944, 2023.
[31] Tom Kwiatkowski, Jennimaria Palomaki, Olivia Redfield, Michael Collins, Ankur Parikh, Chris
Alberti, Danielle Epstein, Illia Polosukhin, Jacob Devlin, Kenton Lee, et al. Natural questions: a
benchmark for question answering research. Transactions of the Association for Computational
Linguistics, 7:453–466, 2019.
[32] Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang,
Zi Lin, Zhuohan Li, Dacheng Li, Eric P. Xing, Hao Zhang, Joseph E. Gonzalez, and Ion Stoica.
Judging LLM-as-a-judge with MT-Bench and Chatbot Arena. arXiv preprint arXiv:2306.05685,
2023.
8
APPENDIX
A Model Versions
We list all the model versions benchmarked in this study below:
Since the completion of this study, new versions of Gemini 1.5 (Pro and Flash) and Claude 3.5
Sonnet were released. The incredibly fast pace of development is quite exciting; we leave the external
benchmarking of these models to future work.
B Dataset Details
In this study, we benchmarked all LLMs on 3 curated RAG datasets that were formatted for both
retrieval and generation. These included Databricks DocsQA and FinanceBench, which represent
industry use cases and Natural Questions (NQ), which is a standard academic benchmark. Below are
the dataset details:
Dataset Corpus Queries Av. doc length (to- Max doc length (to-
kens) kens)
Databricks DocsQA 7,563 139 2856 225,941
FinanceBench 53,399 150 811 8,633
Natural Questions (dev 7,369 534 11,354 13,362
split)
Table S2: Dataset details for the 3 datasets used in our end-to-end RAG benchmark.
We inlcude the individual answer correctness plots for Databricks DocsQA and natural Questions in
Figs. S1 and S2.
The performance of the Gemini 1.5 models evaluated on up to 2 million tokens can be found in Table
S4.
9
Model av. 2k 4k 8k 16k 32k 64k 96k 125k
o1-preview-2024-09-12 0.763 0.582 0.747 0.772 0.787 0.799 0.831 0.824 0.763
o1-mini-2024-09-12 0.731 0.566 0.728 0.754 0.772 0.777 0.769 0.778 0.704
gpt-4o-2024-05-13 0.709 0.467 0.671 0.721 0.752 0.759 0.769 0.769 0.767
claude-3-5-sonnet-20240620 0.695 0.506 0.684 0.723 0.718 0.748 0.741 0.732 0.706
claude-3-opus-20240229 0.686 0.463 0.652 0.702 0.716 0.725 0.755 0.732 0.741
claude-3-haiku-20240307 0.649 0.466 0.666 0.678 0.705 0.69 0.668 0.663 0.656
qwen2-72b-instruct 0.637 0.469 0.628 0.669 0.672 0.682 0.683 0.648 0.645
gpt-4o-mini-2024-07-18 0.61 0.424 0.587 0.624 0.649 0.662 0.648 0.646 0.643
gpt-4-turbo-2024-04-09 0.588 0.465 0.6 0.634 0.641 0.623 0.623 0.562 0.56
gemini-1.5-pro 0.584 0.368 0.51 0.55 0.58 0.595 0.634 0.636 0.622
claude-3-sonnet-20240229 0.569 0.432 0.587 0.662 0.668 0.631 0.525 0.559 0.485
gpt-4-0125-preview 0.568 0.466 0.614 0.64 0.664 0.622 0.585 0.505 0.452
llama-3.1-405b-instruct 0.55 0.445 0.591 0.615 0.623 0.594 0.587 0.516 0.426
gemini-1.5-flash 0.505 0.349 0.478 0.517 0.538 0.534 0.522 0.52 0.521
llama-3-70b-instruct 0.48 0.365 0.53 0.546 0.555 0.562 0.573 0.583 0.593
mixtral-8x7b-instruct 0.469 0.414 0.518 0.506 0.488 0.417 - - -
llama-3.1-70b-instruct 0.45 0.403 0.526 0.527 0.478 0.469 0.444 0.401 0.353
dbrx-instruct 0.447 0.438 0.539 0.528 0.477 0.255 - - -
gpt-3.5-turbo 0.44 0.362 0.463 0.486 0.447 - - - -
llama-3.1-8b-instruct 0.411 0.368 0.547 0.536 0.523 0.485 0.383 0.296 0.15
Table S3: LLM answer correctness up to 125k tokens. Same data as Fig. 1.
10
Figure S2: Long context RAG performance on Natural Questions
C Retrieval Performance
We assessed how retrieving more results would affect the amount of relevant information placed in
the context of the generation model. Specifically, we assumed that the retriever returns X number
of tokens and then calculated the recall score at that cutoff. From another perspective, the recall
performance is the upper bound on the performance of the generation model when the model is
required to use only the retrieved documents for generating answers.
Below are the recall@k results for the OpenAI text-embedding-3-large embedding model on
3 datasets and different context lengths (Table S5). We use chunk size 512 tokens and leave a 1.5k
buffer for the prompt and generation. Recall@k here is different for each run based on the total
number of retrieved chunks; for example, when 1 chunk is retrieved, we report recall@1, and when 61
chunks are retrieved we report recall@61. We note the relationship between the number of retrieved
chunks and the maximum context length in Table S5.
Saturation point: as can be observed in the table, each dataset’s retrieval recall score saturates at a
different context length. For the NQ dataset, it saturates early at 8k context length, whereas DocsQA
and FinanceBench datasets saturate at 96k and 128k context length, respectively. These results
demonstrate that with a simple retrieval approach, there is additional relevant information available
to the generation model all the way up to 96k or 128k tokens. Hence, the increased context size of
modern models offers the promise of capturing this additional information to increase overall system
quality.
Similar to Fig. 2 in Jin et al., we find that retrieval accuracy monotonically increases. However, as
shown in our main text, this does not necessarily mean that RAG accuracy monotonically increases.
11
D Evaluation with LLM-as-a-judge
We used the “LLM-as-a-judge” paradigm [32] to measure the answer correctness of the generated
answer with regards to the ground truth answer. In all experiments, we use the judge from the
Databricks Agent Evaluation framework.9 The judge has been calibrated with human preferences
on representative datasets FinanceBench, Databricks DocsQA and the judge reported 88.1 ± 5.5%
agreement and Cohen’s kappa scores of 0.64 ± 0.13, showcasing a strong agreement with human
labelers.
You are a helpful assistant good at answering questions related to databricks products or spark features.
You’ll be provided with a question and several passages that might be relevant. Your task is to provide
an answer based on the question and passages.
Note that passages might not be relevant to the question, so only use the passages that are relevant. If
no relevant passage is provided, answer using your knowledge.
The provided passages as context:
{context}
The question to answer:
{question}
Your answer:
E.2 FinanceBench
You are a helpful assistant good at answering questions related to financial reports. You’ll be provided
with a question and several passages that might be relevant. Your task is to provide an answer based
on the question and passages.
Note that passages might not be relevant to the question, so only use the passages that are relevant. If
no relevant passage is provided, answer using your knowledge.
The provided passages as context:
{context}
The question to answer:
{question}
Your answer:
You are an assistant that answers questions. Use the following pieces of retrieved context to answer
the question. Some pieces of context may be irrelevant, in which case you should not use them to
form the answer. Your answer should be a short phrase and should not be in a complete sentence.
Question: {question}
Context: {context}
9
www.databricks.com/blog/databricks-announces-significant-improvements-built-llm-judges-agent-
evaluation
12
Answer:
We developed prompts that describe each category and used GPT-4o to classify all of the failures of
the models into the above categories. We also note that the failure patterns on these datasets may
not be representative of other datasets; it is also possible for the pattern to change with different
generation settings and prompt templates. We representative examples of failures below.
We used the following prompt template:
You are a helpful assistant good at categorizing the cause of the failure in answers. I’ll provide you
the question that’s asked, the expected answer and the generated answer, and you’ll need to call a
function to output the category of the failure that it falls under and the corresponding one sentence
justification.
Typical failures include:
13
F.2 Claude 3 Sonnet Failures on Natural Questions
Claude 3 Sonnet frequently failed to follow the instructions, and we found that the majority of the
failures were caused by it refusing to provide an answer due to copyright concerns. Table S6 below
contains some example failures from Claude 3 Sonnet.
14
F.3 GPT-4 Failures on Natural Questions
GPT-4 tended to fail by providing the wrong answer (e.g. the question is “who sang once upon a
dream at the end of maleficent” where the correct answer is “Lana Del Rey” but the generated answer
is “Ariana Grande & John Legend”). Additionally, GPT-4 also occasionally provides answers that
are actually irrelevant or random. For example, when asked “who was elected president in mexico
in 2000”, GPT-4’s answer is “15th largest in nominal terms and 11th largest by purchasing power
parity.”
On the other hand, Mixtral-instruct and DBRX-instruct failed in unique ways: Mixtral-instruct
consistently outputed repeated content such as the character for “dream” in Chinese or random
content that was irrelevant to the question. In contrast, DBRX-instruct consistently failed to follow
instructions by summarizing the content instead of answering the question. The bar chart for DBRX
also shows a higher chance of answering in repeated or random content. Upon closer examination,
these random or repeated content were still trying to summarize the context but with repeated
sentences. We include samples from Mixtral-instruct and DBRX-instruct below:
15
F.5 Gemini 1.5 Pro Failures on Natural Questions
As can be seen in Figure 3, Gemini 1.5 Pro failed on the Natural Questions benchmark for two
reasons, task_failed and wrong_answer. The task_failed scenario was caused by the strict
content filtering of the Gemini API. This was particularly pronounced for the Natural Questions (NQ)
dataset, where the filtering increased as a function of context length.
Below is one example:
finish_reason: SAFETY
safety_ratings {
category: HARM_CATEGORY_SEXUALLY_EXPLICIT
probability: MEDIUM
}
safety_ratings {
category: HARM_CATEGORY_HATE_SPEECH
probability: NEGLIGIBLE
}
safety_ratings {
category: HARM_CATEGORY_HARASSMENT
probability: NEGLIGIBLE
}
safety_ratings {
category: HARM_CATEGORY_DANGEROUS_CONTENT
probability: NEGLIGIBLE
}
The Natural Questions dataset is a standard, well established academic dataset based on Wikipedia.
We are not aware of known examples of hate speech or harassment content in NQ. Our benchmarking
did not encounter these types of strict filters when using any of the other APIs (OpenAI, Anthropic,
etc.).
We note that we did not include any queries that failed in this way (i.e. by filtering) in the final
accuracy score. On Natural Questions specifically, Gemini 1.5 Pro and Flash did remarkably well
with answer correctness values above 0.85 at 2 million tokens context length (see Fig. S2).
Besides task_failed, the next most frequent reason for Gemini 1.5 Pro failure is caused by
wrong_answer, and below are the examples:
16
F.6 Gemini 1.5 Pro Failures on Databricks DocsQA
The failures on the Databricks DocsQA dataset look somewhat different. There are very few instances
of task failure due to safety filtering; rather, most of the failures are simply due to incorrect answers.
We include a few examples below:
Figure S3: Gemeni 1.5 Pro Answer Failures on the Databricks DocsQA Dataset. For each context
length, the majority of failures fall into the wrong_answer category. Note that this behavior is
somewhat different from the NQ dataset in Figure 3.
question
Can I use auto optimize on a streaming delta?
expected_answer
Yes, you can use auto optimize on a streaming delta. Auto optimize is particularly useful for streaming
use cases where a few minutes of latency is acceptable. It helps compact small files automatically
during write operations. Auto optimize adds latency overhead to write operations but accelerates
read operations. Auto optimize consists of two complementary features: optimized writes and auto
compaction. It ensures fewer and more efficiently-sized files during streaming operations.
generated_answer
Yes, auto optimize is particularly useful for streaming Delta tables. It helps to compact small files
that are often generated in streaming use cases, which can improve read performance. However, keep
in mind that auto optimize adds some latency overhead to write operations. So, it’s best suited for
streaming scenarios where a few minutes of latency is acceptable.
justification
The generated answer is mostly correct but lacks the detail about the two complementary features of
auto optimize, which are optimized writes and auto compaction.
question
What is the data explorer used for?
expected_answer
17
The Data Explorer is used for viewing table schema details, previewing sample data, accessing table
type, table location, and table properties, and reviewing table history. It also allows users to view
frequent queries and users who have accessed the table. Additionally, Data Explorer can be used to
create basic queries and quick dashboards, and to set table ownership and permissions.
generated_answer
The Databricks Data Explorer provides a UI to explore and manage data, schemas (databases), tables,
and permissions. You can use Data Explorer to view schema details, preview sample data, and see
table details and properties. Administrators can view and change owners, and admins and data object
owners can grant and revoke permissions.
justification
The generated answer provides relevant information but lacks some details and specific functionalities
mentioned in the reference answer.
question
What is the workflow for creating a model serving endpoint using the Databricks UI?
expected_answer
To create a model serving endpoint using the Databricks UI, follow these steps:
1. Click on “Serving” in the sidebar to open the Serving UI. 2. Click on “Create serving endpoint”.
3. Provide a name for your endpoint in the “Serving endpoint name” field.
4. In the “Edit configuration” section, select the model and the version of the model you want to
serve.
5. Choose the size of the compute to use for the endpoint.
6. Specify if the endpoint should automatically scale to zero when it is not in use and set the
percentage of traffic that should be routed to the served model.
7. Click on “Create serving endpoint”.
8. Initially, the “Serving endpoint state” will be shown as Not Ready.
After a few minutes, this will change to Ready once the endpoint is up and running. You can also
access the Serving UI to create an endpoint directly from the registered model page by selecting the
model you want to use, clicking the “Use model for inference” button, choosing the “Real-time” tab,
and then filling in the model version and endpoint name. Select the compute size, specify if it should
scale to zero when not in use, and then click “Create serving endpoint”. The “Serving endpoints”
page will then display the status as Not Ready, changing to Ready after the endpoint is set up.
generated_answer
To create a model serving endpoint using the Databricks UI:
1. Click **Serving** in the sidebar to display the Serving UI.
2. Click **Create serving endpoint**.
3. In the **Serving endpoint name** field, provide a name for your endpoint.
4. In the **Edit configuration** section, select which model and model version you want to serve.
5. Select what size compute to use.
6. Specify if the endpoint should scale to zero when not in use, and the percentage of traffic to route
to a served model.
7. Click **Create serving endpoint**.
The **Serving endpoints** page appears with **Serving endpoint state** shown as Not Ready. After
a few minutes, **Serving endpoint state** changes to Ready.
18
justification
The generated answer is mostly correct but omits some details and steps present in the reference
answer.
Table S11: Sample for Gemini 1.5 Pro refusing to answer the questions for the Financebench dataset
19
G Cost Considerations for Long Context RAG
In the following table, we list the various costs per input tokens for some of the API based models.
Cost values are as of October 2024.
We choose to show the input token cost for a single query with a max sequence length of 8k, 64k,
128k, and 2 million tokens. We also show the estimated input token costs for “full benchmarking”
across all three datasets in this study, which have a total of 823 queries.
Cost A is for 823 queries at maximum sequence length of 128k tokens. Cost B is for 823 queries at
maximum sequence length of 2 million tokens.
20