Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

DetectiveQA: Evaluating Long-Context Reasoning on Detective Novels

Zhe Xu1, Jiasheng Ye11footnotemark: 11, Xiangyang Liu11footnotemark: 11, Tianxiang Sun1, Xiaoran Liu1,
Qipeng Guo3, Linlin Li2, Qun Liu2, Xuanjing Huang1, Xipeng Qiu1,3
1
School of Computer Science, Fudan University,
2Huawei Noah’s Ark Lab, 3Shanghai AI Laboratory
lynn.lilinlin@huawei.com, xpqiu@fudan.edu.cn
Equal contribution. Corresponding author.
Abstract

With the rapid advancement of Large Language Models (LLMs), long-context information understanding and processing have become a hot topic in academia and industry. However, benchmarks for evaluating the ability of LLMs to handle long-context information do not seem to have kept pace with the development of LLMs. Despite the emergence of various long-context evaluation benchmarks, the types of capability assessed are still limited, without new capability dimensions. In this paper, we introduce DetectiveQA, a narrative reasoning benchmark featured with an average context length of over 100K tokens. DetectiveQA focuses on evaluating the long-context reasoning ability of LLMs, which not only requires a full understanding of context but also requires extracting important evidences from the context and reasoning according to extracted evidences to answer the given questions. This is a new dimension of capability evaluation, which is more in line with the current intelligence level of LLMs. We use detective novels as data sources, which naturally have various reasoning elements. Finally, we manually annotated 600 questions in Chinese and then also provided an English edition of the context information and questions. We evaluate many long-context LLMs on DetectiveQA, including commercial and open-sourced models, and the results indicate that existing long-context LLMs still require significant advancements to effectively process true long-context dependency questions.

DetectiveQA: Evaluating Long-Context Reasoning on Detective Novels


Zhe Xuthanks: Equal contribution.1, Jiasheng Ye11footnotemark: 11, Xiangyang Liu11footnotemark: 11, Tianxiang Sun1, Xiaoran Liu1, Qipeng Guo3, Linlin Li2, Qun Liu2, Xuanjing Huang1, Xipeng Qiuthanks: Corresponding author.1,3 1School of Computer Science, Fudan University, 2Huawei Noah’s Ark Lab, 3Shanghai AI Laboratory lynn.lilinlin@huawei.com, xpqiu@fudan.edu.cn


1 Introduction

Dataset Reasoning Native 100K Long Text Contextualisation Required Rich Metrics
HotpotQA  Yang et al. (2018)
NarrativeQA Kociský et al. (2018a)
InfiniteBench Zhang et al. (2024)
BabiLong Kuratov et al. (2024)
Needle-in-a-Haystack  Kamradt (2023)
DetectiveQA
Table 1: Comparison of DetectiveQA with other datasets. It can be seen that our dataset has unique advantages, as a evaluation set our dataset is compounded with difficulty, and thanks to the rich labelling information our evaluation perspective is also more comprehensive.

The development of Large Language Models (LLMs) OpenAI (2023); Anthropic ; Touvron et al. (2023); Sun et al. (2024) has had a remarkable surge in recent years. The ability to long-context understanding and processing is important in the development of LLMs OpenAI (2023); Anthropic ; Bai et al. (2023a); Cai et al. (2024); Zeng et al. (2023a). This capability is essential for some basic tasks that require a deep understanding of lengthy documents, such as information extraction from long documents, summarization for long documents, or translation for long documents. With this capability, LLMs can be applied to a wider range of scenarios (e.g., legal cases, academic field, or financial field). In addition to the rapid application of LLMs to these basic tasks, the development of LLMs has also allowed researchers to see the dawn of AGI (artificial general intelligence). Both academia and industry are exploring the application of LLMs in broader fields, such as advanced intelligent agents, robotics, and so on Park et al. (2023); Hong et al. (2023); Xi et al. (2023); Zeng et al. (2023c). This kind of application requires LLMs to have a stronger understanding and reasoning ability over the long documents. This type of reasoning ability is the key to inferring the next action from the context information, rather than the basic capability of extracting the action directly from the context information. Just like humans, they can deduce from all their previous experiences what actions they are going to perform next. This is a more advanced capability, a new dimension of capability that needs to be evaluated for rapidly evolving LLMs.

Many long-context evaluation benchmarks for long-context LLMs have emerged with the development of long-context LLMs. To quickly fill in the gaps in long-context evaluation, some new benchmarks choose to integrate and transform large sets of existing datasets An et al. (2023a); Bai et al. (2023b). However, the context length of these datasets is basically less than 20K, which makes it difficult to meet the evaluation requirements of the current long-context LLMs. Besides, to extend the length of context, some benchmarks carry out data source re-selection and new question annotation Li et al. (2023); Zhang et al. (2024); Qiu et al. (2024); Wang et al. (2024). However, these newly annotated datasets tend to focus on a few basic evaluation dimensions of the old benchmarks, such as information retrieval, summarization, code completion, and multi-hop reasoning. Although these benchmarks cover many application scenarios of long-context LLMs, it is difficult to measure the performance of LLMs in more advanced intelligent applications, such as advanced intelligent agents or robotics.

To fill this gap, we introduce DetectiveQA, a novel benchmark that evaluates the performance of long-context LLMs on long-context questions. Unlike existing benchmarks, DetectiveQA takes care of both longer context information and longer context dependencies. Moreover, it introduces the assessment of narrative reasoning capability, a new dimension of assessment. We choose orthodox school novels as our data source, which not only have long context lengths, but also have complex plots and character relationships. In order to construct DetectiveQA, we have deeply analysed and annotated these novels, extracted a large number of long contextual questions, and provided detailed answers. These questions cover a variety of genres, including character relationships, plot development, motivation analysis, etc., aiming to comprehensively assess the capability of long context language models in narrative reasoning.

The proposal of DetectiveQA aims to provide a new evaluation tool for the research and application of long context language models, to help researchers gain a deeper understanding of the performance and limitations of long context language models, as well as to support the advancement of long context language models in the field of more advanced intelligent applications.

In summary, our contributions are threefold.

  • We present the first long context narrative reasoning dataset, which helps to better evaluate the model’s reasoning ability for complex problems in narrative contexts.

  • We have designed rich evaluation metrics that take into account data contamination issues and decoupling of long text capabilities, which helps to better analyse the performance of the model.

  • We have extensively evaluated the long-context reasoning capabilities of current large-scale language models, and have clarified the challenges that the key capability of narrative reasoning faces in the development of current large-scale language models.

2 Related Work

Long-Context Evaluation Benchmark

Regarding datasets, NarrativeQA (Kociský et al., 2018b), QuALITY (Pang et al., 2022), and TriviaQA (Joshi et al., 2017) provide datasets for document information retrieval, a longer dataset was constructed by providing document information. These datasets are useful for evaluating the information retrieval capabilities of large language models.QuALITY (Pang et al., 2022)’s evaluation criterion provides a unique evaluation perspective by using multiple-choice accuracy instead of BLEU or ROUGE-L. Meanwhile, HotpotQA (Yang et al., 2018) evaluates the model’s multi-hop inference ability by using multiple related progressively inferenceable facts, while ELI5 (Fan et al., 2019) constructs datasets with longer responses by collecting quizzes and answers that are understandable to children as young as 5 years old and in which the model is required to respond according to the corresponding attempts to make inferences. There are many benchmarks that also focus on measuring long text, Long Range Arena (Tay et al., 2021) designed six classification tasks on longer input, CAB (Zhang et al., 2023) designed seven tasks for a more comprehensive measurement of skills. SCROLLS (Shaham et al., 2022) and its extension ZEROSCROLLS (Shaham et al., 2023) include documents from a variety of domains and also propose a variety of tasks, including query-based summarization, multi-hop problem analysis, sentiment aggregation, and sorting of book chapter summaries. LongBench (Bai et al., 2023b) covers multilingual measures and focuses mainly on long text comprehension in the large language model. L-EVAL (An et al., 2023b) contains both open and closed tasks and comes in at 30K in length. The datasets and Benchmarks used in these evaluations were relatively long at the time and placed some demands on the models’ inference capabilities. However, the input lengths were still relatively short, no more than 50K. To identify measurement issues in lengthy texts, datasets such as Needle in a Haystack (Kamradt, 2023) and the bABILONG (Kuratov et al., 2024) dataset have reached orders of magnitude of 100K pages. Needle in a Haystack (Kamradt, 2023) evaluates model performance at various lengths and insertion depths, while bABILONG (Kuratov et al., 2024) creates a multi-hop inference problem by inserting more facts into the data in the presence of very long contextualized inputs. Meanwhile, InfiniteBench (Zhang et al., 2024) has focused on the need for longer text reviews, building a set of over 100k reviews aimed at testing the model’s five key capabilities for longer text: retrieval, maths, code, QA, and summaries.

Nowadays, the need for input support for very long text and model inference capabilities is becoming more prominent due to the development of large language models. In the past, datasets often lacked in terms of length, and inference was more of an information extraction task, where the answer would appear in the relevant document.

Therefore, we have constructed the only dataset with an input length of 100K that requires complex narrative reasoning to obtain answers that are not explicitly stated in the text. Additionally, we propose a more comprehensive evaluation criterion to assess the model’s narrative reasoning ability.

3 DetectiveQA

In this section, we introduce DetectiveQA, a benchmark dataset to test the long-context reasoning ability of language models.

3.1 Data Sources

A representative data to study language models’ ability to handle long contexts is books, among which detective novels are a category that contains intensive reasoning-related content. Therefore, we consider detective novels as promising candidates to be data sources of our benchmark. Nevertheless, we find that a large volume of detective novels take the attractiveness of storytelling in the first place at the sacrifice of the strictness of reasoning processes. Fortunately, we find a group of detective novels categorized as orthodox school (Saito, 2007). These novels are dedicated to entertaining readers keen on solving puzzles by ensuring that the reader has the same number of evidences as the detective in the novel, being ideal data sources that satisfy our need for rigorous reasoning. Therefore, we collect orthodox detective novels as sources of long context and use questions related to the puzzles in the novels to test the language models.

Other consideration on data sources are their lengths and languages. A smooth gradient of difficulty helps to differentiate models at varying levels of proficiency. Therefore, we collect orthodox detective novels with lengths ranging from 100K to 250K words. Additionally, we only collect the Chinese and English versions given the language background of the researchers and data annotators.

{
"question":"Which of the following is the reason for the disappearance of Sainsbury Seale?",
"options":
"A": "left voluntarily.",
"B": "met an untimely end.",
"C": "eloped with someone.",
"D": "Sudden memory loss."
"answer":"B",
"reasoning": [
"Ms Sainsbury-Seal did not take her luggage with her when she disappeared.",
"This does not appear to be a voluntary departure.",
"Ms Seale had a dinner date with a friend to play solitaire.",
"Normally at the appointed time she would have been back at the hotel.",
"Therefore, based on the above evidences, it is surmised that it was Sainsbury Seale who met an untimely end."
],
"evidence_position":[740,-1,734,-1,-1],
"answer_position": 1202
}

Figure 1: An example of a multiple-choice annotation in DetectiveQA. We highlight the evidences of reasoning in blue italics, and inference in green plain typeface . the "reasoning" part includes evidences and inferences, while in the "evidence_position" field, the part corresponding to the evidence will be the paragraph in which the evidence occurs in the article, while the part corresponding to the reference will be -1.

3.2 Desiderata of Data Annotation

Our data are largely question-answer pairs, as shown in Figure 1. To ensure the efficacy of our data in evaluating the long-context reasoning ability of language models, we make several design decisions on the annotations on both questions and answers.

Questions.

Each question is composed of a long context from the detective novels we collect and a multiple-choice question about the context. For the context, we truncate the novel till the paragraph that writes the answer to the question in order to avoid answer leakage. And we allow multiple questions for a novel to improve the utilization of the books. For the questions, we design them as multiple-choice questions to ease extracting answers from model outputs, similar to the practice in many prominent benchmarks for large language models (Hendrycks et al., 2021; Huang et al., 2023; Yue et al., 2023). We also require the question to center around the detective’s reasoning about the cases. In this way, we exclude questions that are too trivial to require understanding or extracting evidences from a long context111Here is an example of overly trivial questions: a novel mentions that a character is six years old when her sister is born, and then the question is how many years she is older than her sister..

Answers.

We require the answers to contain reference solutions with decomposed steps. This helps fine-grained evaluation of the stepwise correctness of the model outputs. Typically, the reasoning steps are either evidences drawn from the context or inferences from these evidences. The two kinds of steps correspond to the ability of language models to understand long contexts and perform reasoning, respectively. Therefore, we further differentiate them in our data annotations to facilitate disentangled evaluations of the two aspects.

An exemplary data sample is in Figure 1. In summary, a data entry contains the following items:

  • A multiple-choice question with four candidate options, attaching a long novel as its context.

  • The answer option and the reasoning process in the form of a list containing evidences and inferences.

  • Evidence position corresponding to reasoning. Each indicates the passage in which each evidence appears in the text, and if the corresponding position is an inference rather than a evidence then it is labeled -1.

  • Answer position representing where the answer appears in the novel. It is used to truncate the text to ask the question.

3.3 Annotation Procedure

Human Annotations.

A naive approach to annotating the data is to employ human labor. We can hire workers to read the novels, write down the questions, and reference answers. However, such a process is intolerably cumbersome due to the requirement of reading long novels (Wu et al., 2021). As recorded It takes a median of around 3.5 hours for our annotators to complete reading a 100K-long novel. There is a relatively large drain on annotators’ time and mental energy, making it difficult to scale up the size of the datasets. As a consequence, we seek an alternative approach to enable efficient data annotations.

Annotation with AI Assistance.

Our solution is to leverage existing language models that have the eminent long-context capability to assist the human annotator. Our key insight is that, with the full detective novels, finding the reasoning questions as well as their answers can be treated as an information extraction problem, a much simpler task that large language models have achieved plausible performance (Zhang et al., 2024). Hereby, we design a workflow to decompose the data annotation procedure into steps of information extraction tasks and employ Claude 2, a leading long-context large language model, for assistance.

  • First, we enter the full novel and use the model the extract inferences drawn by the detectives in the novels. This forms the reasoning chain in our data. To help human annotators check the extraction later, we prepend indices to the paragraphs of the novel and require the model to output where the drawn inferences are located.

  • Then, with the extracted inferences, we require the model to seek the positions where the evidences mentioned in the inferences lie.

  • Finally, we use models to synthesize questions for each of the extracted reasoning chains.

Given the 100k limit of Claude 2, for novels exceeding the lengths, we decompose them into chunks. We draw reasoning chains (the first step) and ask questions (the third step) in each chunk, respectively. For gathering the mentioned evidences (the second step), we query all the chunks for each reasoning chain. After these model calls, human annotators only need to make certain revisions by checking the model outputs with the related paragraphs222To help annotators gain a coherent understanding of the novel content, which we find helpful for their efficiency in checking the AI annotations, we also use the models to provide summarization of the novels., without reading the full novels and thinking about the reasoning progress.Due to human calibration,we ensuring the precision and rationality of the annotation. Although human annotators are still required to proofread the content, this somewhat mitigates the larger overhead of purely manual annotation.

Statistic Human Data AI Data
(Max/Min/Avg) (Max/Min/Avg)
Context Length 148K/4K/81K 167K/4K/94K
Coverage factor 95.6/ 0.3/ 35.4 99.0/ 0.2/ 44.6
evidence Number 17 / 2 / 5.98 21 / 2 / 5.82
Inference step 5 / 1 / 1.82 5 / 1 / 1.82
evidence Length 740 / 37 / 196.29 404 / 30 / 150.09
Inference Length 357 / 23 / 105.64 244 / 16 / 101.05
Total Questions 308 338
Table 2: Human vs. AI Statistics Comparison where context length refers to the length of the problem and the meaning of the remaining metrics is detailed in Section 2.

3.4 Statistics

In total, DetectiveQA contains 1200 question-answer pairs, among which 308 are from human annotations and 892 are annotated with the assistance of AI. We detailed the statistics of the data samples as follows.

Table 2 presents data statistics for part of novels annotated through both human annotation and AI-assisted annotation. While the AI-assisted annotations increased in question length, the manual annotations produced more detailed evidences and reasoning sections and slightly less coverage than the AI-assisted annotations. Overall, the difference in quality between manual and AI-assisted annotations was minimal, suggesting that the use of AI-assisted manual annotation is feasible.

Context lengths.

We show the distribution of context lengths in Figure 2. The length of our queries ranges from 4K to over 250K words, with an average of 96K words. The average length approaches the supported context length of most competent large language models:100k words. Such a scale of context length makes our approach sufficient to cover the length of the context window provided by most models, thus providing a sufficiently long measurement scheme for long text capabilities.

Context coverage.

We introduce a coverage factor to quantify the “global” nature of a question. Formally, the coverage factor is defined as the length of the context from the earliest evidence to the answer location as a percentage of the total contextual input length. Depicted in Figure 3, our question coverage can be broadly categorized into three bands: 10 percent, 50 percent, and 100 percent. The substantial contextual coverage poses a significant challenge to the model’s comprehension and information-gathering abilities from lengthy articles, rendering our dataset an effective assessment of the model’s proficiency in reading and comprehending extensive texts.

Refer to caption
Figure 2: The distribution of the context lengths of samples in DetectiveQA. The novel content contained in each question is truncated before the answer contained in the question appears.
Statistics of the reasoning.

Table 3 reveals that answers in our dataset contain substantial information in the narrative reasoning . Additionally, the labeled questions exhibit a high demand for a broad span of evidences, necessitating the exploration of a relatively lengthy contextual span in the text to derive the answer.

Statistic Maximum Minimum Average
# evidences 20 2 6.36
# inference steps 12 1 2.34
evidence length 812 30 176.32
inference length 493 16 112.23
coverage factor 99.99% 0.25% 58.54%
Table 3: Statistics of evidences and inferences in DetectiveQA.Our dataset also has richer corpus information in terms of responses. We count the lengths in words.
Refer to caption
Figure 3: The distribution of coverage factor. It can be seen that we labeled both short-range and a large number of long-range dependent inference problems. And the number is fairly evenly distributed.

4 Experiments

In this section, we evaluate prevailing large language models supporting long context on DetectiveQA to benchmark their capabilities in long-context narrative reasoning.

4.1 Experimental Setups

4.1.1 Experimental Settings

To explore the key elements for long-context reasoning, namely understanding long documents, extracting cue information, and reasoning about actions or responses based on cue information, our experiments employ three distinct settings for both human manual annotation and AI-assisted annotation.

Question+Context.

This fundamental setup includes a multiple-choice question for the model to answer, requiring the model to provide its response and reasoning process for the novel content until the answer is found in the text. This tests the model’s abilities in long text comprehension, cue information extraction, and reasoning simultaneously.

Question-Only.

In this setup, we investigate whether the model, during pre-training, has been exposed to the corresponding novel content. We present the model with a question-only query, providing only the name and author of the novel along with single-choice questions. The model is then expected to output both the answer and the corresponding reasoning process.

Question+evidence.

In this configuration, we input the cue part of the human annotation into the model along with a multiple-choice problem. This cue section is akin to the result of a gold search of the article for the given problem. This setting isolates the model’s ability to comprehend long articles and extract information, essentially testing the model’s reasoning ability alone.

4.1.2 Metrics

Our methodology for the evaluation contains two aspects of evaluation metrics.

Answer accuracy.

Similar to previous evaluation based on multiple-choice questions (Hendrycks et al., 2021; Huang et al., 2023; Yue et al., 2023), we provide the model a question with four annotated options and require the model to output a letter corresponding to the selected option. At this point, we calculate the percentage of correctly answered questions as the score.333Since we are using data annotated on detective novels whose content the model may have seen during pre-training, resulting in high model scores, we discuss the influence of potential data contamination in Appendix G.

Reasoning metric.

To support the reliability of the model’s answer decisively, it is imperative that the output not only provides an answer but also includes an explanation supporting that answer. To this end, we examine how many of the annotated evidences are included in the model’s output reasoning process, and then score the question based on the percentage of annotated evidences out of the total number of evidences. The average score across all questions represents the model’s reasoning evaluation score on the dataset. For this containment relationship we will use GPT4 review, ask GPT4 to give the contained evidences, and count the number.The specific prompt we used can be found in the Appendix A.

model context length Deployment method
GPT4 128K api
Claude3 200K api
Kimi 200K web
InternLM2 200K local
ChatGLM3 128K local
Table 4: Specific information on the evaluation model. The api used for GPT4 is gpt-4-1106-preview and the api used for Claude3 is claude-3-opus-20240229.
Models Question+Context Question-Only Win Rate
Answer Reasoning G.M. Answer Reasoning G.M.
Claude3-200k 81.95 39.21 56.68 23.43 20.95 22.15 94.61
GPT-4-128k 73.99 26.69 44.43 43.16 10.66 21.44 84.34
KimiChat-200K 64.13 27.79 42.21 45.07 9.64 20.84 67.27
InternLM2-7B-200k 57.95 23.94 37.24 36.97 12.65 21.62 81.69
ChatGLM3-6B-128K 40.58 22.08 33.63 33.63 7.16 15.51 63.47
Table 5: The win rate algorithm compares the answer score plus the reasoning score between the two settings for each problem, with the higher score being considered the winner. Win rate was calculated for model responses based on the Question Only setting and the Question+Context setting, and G.M. is the geometric mean of the answer accuracy and reasoning scores.

4.1.3 Models

We conducted experiments using both open-source and closed-source models, focusing on selecting LLMs that support long text inputs that are capable of supporting input lengths of 100K or more.

Our choice of models prioritises dialogue-enabled LLMs that can process at least 100K long texts444We also did quite a lot experiments on the model with input limmited to 32K or less, more results can be found in the AppendixD. in order to extract meaningful information from the text. Our selection covers two broad categories: closed-source models, including GPT-4, Claude3, and Kimi, which are known for their robustness and long text support; and open-source models, such as chatGLM3-6B (Zeng et al., 2023b), and IntermLM2-7B (Cai et al., 2024).Model-specific information will be displayed in the Table 4.

4.2 Main Results

We present the final experimental results in Table 5. It can be seen that the current closed-source models generally score higher than the open-source models. It can also be seen that the long text review of narrative reasoning generally has room for improvement for the current models. Secondly,by comparing the model’s scores in the question-only setting with those in the Question + Context setting, we can measure the degree of data contamination by analyzing the model’s win rate for responses. Based on the data in Table 5, most models have win rates of over 60% or higher, suggesting that data contamination is not a significant problem for these models.

4.3 Analysis

We then did some analytical studies on the data.

GPT4 rubric reasoning is valid

Validity by manually evaluating the 100 reviews output from GPT4, see this task as a judgment question of whether the evidences are contained or not, to classify the evidences into two types of contained/not contained, and finally for all the results to calculate the Kappa coefficient, this coefficient and judgment both have an accuracy of 92% or more, so this should be valid.

The dataset’s problem is Challenging

Models such as InternLM2 for finding a needle in a haystack full of pairs and chatGLM3 perform well on infinite-bench retrieval tasks but still fall short of the leading long-document models (GPT4, Kimi, etc.) on our dataset.

Decoupling long text capabilities

By analysing the performance of the model in the given context and with the given cues we can analyse the following Table 6, where both correct refers to the percentage of all questions answered correctly in both the question+context setting and the question+evidence setting, and only with context refers to the percentage of all questions answered correctly in the question+context setting but incorrectly in the question+evidence setting, and only with evidence refers to the percentage of questions answered incorrectly in the question+context setting but correctly in the question+evidence setting. Models with high scores in both correct indicate a strong ability to combine both narrative reasoning and long text comprehension. If BOTH CORRECT is not high and correct only in quesiont+evidence is high, the model is still deficient in long text comprehension but strong in narrative reasoning.

Model Both Correct Context Evidence
GPT4 62.33 13.45 15.47
Claude3 70.63 12.11 12.78
Kimi 53.14 10.99 23.32
InternLM2 45.96 17.71 17.49
ChatGLM3 15.25 25.34 20.18
Table 6: For the analysis of the decoupling of model capabilities.Each score represents the percentage of such questions to the total number of questions,and context represents questions that are correct only in the question+context setting and evidence represents questions that are correct only in the question+evidence setting,both correct represents questions that are answered correctly in both settings.

We can see that in most of the models, the accuracy of question+evidence settings is much higher, and we will make some case study in the Appendix E for the cases where giving the context can be done correctly, but giving the evidence can’t be done correctly.

5 Conclusion

We introduced DetectiveQA to test the models’ ability to reason narratively over long contexts, the first benchmark for narrative reasoning with an average context length of 100k. We challenged the models’ ability to reason over long texts as well as narrative reasoning using detective novels, the real-world texts. For each model, our test gives two scores (answer accuracy and reasoning score) in three settings. With a rich experimental setup, we can deeply analyze the performance of the model and find that the current model still faces challenges in long text comprehension, information extraction and narrative reasoning. We hope that our dataset will facilitate future improvements in model reasoning ability, leading to more robust AI applications and the highest machine intelligence.

Limitations

Our dataset only serves as an evaluation benchmark on long-context reasoning ability, while how to improve the model capability remains an open question. Meanwhile, our benchmark contains only data from detective novels and mainly serves narrative reasoning. More diverse scenarios can be included in the future.

Ethics Statement

We are committed to ensuring that DetectiveQA is used only for academic and scientific purposes, and therefore we have rigorously copyright-checked all of the reasoning novels used in Detective’s annotations to ensure that the individual novels are not designed to create copyright problems in non-commercial areas. Through these screening tools, we aim to respect the principle of ‘fair use’ under copyright protection and ensure that our project navigates within legal and ethical boundaries in a responsible manner.

References

  • An et al. (2023a) Chenxin An, Shansan Gong, Ming Zhong, Mukai Li, Jun Zhang, Lingpeng Kong, and Xipeng Qiu. 2023a. L-eval: Instituting standardized evaluation for long context language models. CoRR, abs/2307.11088.
  • An et al. (2023b) Chenxin An, Shansan Gong, Ming Zhong, Mukai Li, Jun Zhang, Lingpeng Kong, and Xipeng Qiu. 2023b. L-eval: Instituting standardized evaluation for long context language models. CoRR, abs/2307.11088.
  • (3) Anthropic. Model card and evaluations for claude models.
  • Bai et al. (2023a) Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, Xiaodong Deng, Yang Fan, Wenbin Ge, Yu Han, Fei Huang, Binyuan Hui, Luo Ji, Mei Li, Junyang Lin, Runji Lin, Dayiheng Liu, Gao Liu, Chengqiang Lu, Keming Lu, Jianxin Ma, Rui Men, Xingzhang Ren, Xuancheng Ren, Chuanqi Tan, Sinan Tan, Jianhong Tu, Peng Wang, Shijie Wang, Wei Wang, Shengguang Wu, Benfeng Xu, Jin Xu, An Yang, Hao Yang, Jian Yang, Shusheng Yang, Yang Yao, Bowen Yu, Hongyi Yuan, Zheng Yuan, Jianwei Zhang, Xingxuan Zhang, Yichang Zhang, Zhenru Zhang, Chang Zhou, Jingren Zhou, Xiaohuan Zhou, and Tianhang Zhu. 2023a. Qwen technical report. CoRR, abs/2309.16609.
  • Bai et al. (2023b) Yushi Bai, Xin Lv, Jiajie Zhang, Hongchang Lyu, Jiankai Tang, Zhidian Huang, Zhengxiao Du, Xiao Liu, Aohan Zeng, Lei Hou, Yuxiao Dong, Jie Tang, and Juanzi Li. 2023b. Longbench: A bilingual, multitask benchmark for long context understanding. CoRR, abs/2308.14508.
  • Cai et al. (2024) Zheng Cai, Maosong Cao, Haojiong Chen, Kai Chen, Keyu Chen, Xin Chen, Xun Chen, Zehui Chen, Zhi Chen, Pei Chu, Xiaoyi Dong, Haodong Duan, Qi Fan, Zhaoye Fei, Yang Gao, Jiaye Ge, Chenya Gu, Yuzhe Gu, Tao Gui, Aijia Guo, Qipeng Guo, Conghui He, Yingfan Hu, Ting Huang, Tao Jiang, Penglong Jiao, Zhenjiang Jin, Zhikai Lei, Jiaxing Li, Jingwen Li, Linyang Li, Shuaibin Li, Wei Li, Yining Li, Hongwei Liu, Jiangning Liu, Jiawei Hong, Kaiwen Liu, Kuikun Liu, Xiaoran Liu, Chengqi Lv, Haijun Lv, Kai Lv, Li Ma, Runyuan Ma, Zerun Ma, Wenchang Ning, Linke Ouyang, Jiantao Qiu, Yuan Qu, Fukai Shang, Yunfan Shao, Demin Song, Zifan Song, Zhihao Sui, Peng Sun, Yu Sun, Huanze Tang, Bin Wang, Guoteng Wang, Jiaqi Wang, Jiayu Wang, Rui Wang, Yudong Wang, Ziyi Wang, Xingjian Wei, Qizhen Weng, Fan Wu, Yingtong Xiong, and et al. 2024. Internlm2 technical report. CoRR, abs/2403.17297.
  • Fan et al. (2019) Angela Fan, Yacine Jernite, Ethan Perez, David Grangier, Jason Weston, and Michael Auli. 2019. ELI5: long form question answering. In Proceedings of the 57th Conference of the Association for Computational Linguistics, ACL 2019, Florence, Italy, July 28- August 2, 2019, Volume 1: Long Papers, pages 3558–3567. Association for Computational Linguistics.
  • Hendrycks et al. (2021) Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. 2021. Measuring massive multitask language understanding. In 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021. OpenReview.net.
  • Hong et al. (2023) Sirui Hong, Xiawu Zheng, Jonathan Chen, Yuheng Cheng, Jinlin Wang, Ceyao Zhang, Zili Wang, Steven Ka Shing Yau, Zijuan Lin, Liyang Zhou, Chenyu Ran, Lingfeng Xiao, and Chenglin Wu. 2023. Metagpt: Meta programming for multi-agent collaborative framework. CoRR, abs/2308.00352.
  • Huang et al. (2023) Yuzhen Huang, Yuzhuo Bai, Zhihao Zhu, Junlei Zhang, Jinghan Zhang, Tangjun Su, Junteng Liu, Chuancheng Lv, Yikai Zhang, Jiayi Lei, Yao Fu, Maosong Sun, and Junxian He. 2023. C-eval: A multi-level multi-discipline chinese evaluation suite for foundation models. In Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, NeurIPS 2023, New Orleans, LA, USA, December 10 - 16, 2023.
  • Joshi et al. (2017) Mandar Joshi, Eunsol Choi, Daniel S. Weld, and Luke Zettlemoyer. 2017. Triviaqa: A large scale distantly supervised challenge dataset for reading comprehension. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics, ACL 2017, Vancouver, Canada, July 30 - August 4, Volume 1: Long Papers, pages 1601–1611. Association for Computational Linguistics.
  • Kamradt (2023) Greg Kamradt. 2023. Needle in a haystack - pressure testing llms.
  • Kociský et al. (2018a) Tomás Kociský, Jonathan Schwarz, Phil Blunsom, Chris Dyer, Karl Moritz Hermann, Gábor Melis, and Edward Grefenstette. 2018a. The narrativeqa reading comprehension challenge. Trans. Assoc. Comput. Linguistics, 6:317–328.
  • Kociský et al. (2018b) Tomás Kociský, Jonathan Schwarz, Phil Blunsom, Chris Dyer, Karl Moritz Hermann, Gábor Melis, and Edward Grefenstette. 2018b. The narrativeqa reading comprehension challenge. Trans. Assoc. Comput. Linguistics, 6:317–328.
  • Kuratov et al. (2024) Yuri Kuratov, Aydar Bulatov, Petr Anokhin, Dmitry Sorokin, Artyom Y. Sorokin, and Mikhail Burtsev. 2024. In search of needles in a 11m haystack: Recurrent memory finds what llms miss. CoRR, abs/2402.10790.
  • Li et al. (2023) Jiaqi Li, Mengmeng Wang, Zilong Zheng, and Muhan Zhang. 2023. Loogle: Can long-context language models understand long contexts? CoRR, abs/2311.04939.
  • OpenAI (2023) OpenAI. 2023. GPT-4 technical report. CoRR, abs/2303.08774.
  • Pang et al. (2022) Richard Yuanzhe Pang, Alicia Parrish, Nitish Joshi, Nikita Nangia, Jason Phang, Angelica Chen, Vishakh Padmakumar, Johnny Ma, Jana Thompson, He He, and Samuel R. Bowman. 2022. Quality: Question answering with long input texts, yes! In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL 2022, Seattle, WA, United States, July 10-15, 2022, pages 5336–5358. Association for Computational Linguistics.
  • Park et al. (2023) Joon Sung Park, Joseph C. O’Brien, Carrie Jun Cai, Meredith Ringel Morris, Percy Liang, and Michael S. Bernstein. 2023. Generative agents: Interactive simulacra of human behavior. In Proceedings of the 36th Annual ACM Symposium on User Interface Software and Technology, UIST 2023, San Francisco, CA, USA, 29 October 2023- 1 November 2023, pages 2:1–2:22. ACM.
  • Qiu et al. (2024) Zexuan Qiu, Jingjing Li, Shijue Huang, Wanjun Zhong, and Irwin King. 2024. Clongeval: A chinese benchmark for evaluating long-context large language models. CoRR, abs/2403.03514.
  • Saito (2007) Satomi Saito. 2007. Culture and authenticity : the discursive space of japanese detective fiction and the formation of the national imaginary.
  • Shaham et al. (2023) Uri Shaham, Maor Ivgi, Avia Efrat, Jonathan Berant, and Omer Levy. 2023. Zeroscrolls: A zero-shot benchmark for long text understanding. In Findings of the Association for Computational Linguistics: EMNLP 2023, Singapore, December 6-10, 2023, pages 7977–7989. Association for Computational Linguistics.
  • Shaham et al. (2022) Uri Shaham, Elad Segal, Maor Ivgi, Avia Efrat, Ori Yoran, Adi Haviv, Ankit Gupta, Wenhan Xiong, Mor Geva, Jonathan Berant, and Omer Levy. 2022. SCROLLS: standardized comparison over long language sequences. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, EMNLP 2022, Abu Dhabi, United Arab Emirates, December 7-11, 2022, pages 12007–12021. Association for Computational Linguistics.
  • Sun et al. (2024) Tianxiang Sun, Xiaotian Zhang, Zhengfu He, Peng Li, Qinyuan Cheng, Xiangyang Liu, Hang Yan, Yunfan Shao, Qiong Tang, Shiduo Zhang, Xingjian Zhao, Ke Chen, Yining Zheng, Zhejian Zhou, Ruixiao Li, Jun Zhan, Yunhua Zhou, Linyang Li, Xiaogui Yang, Lingling Wu, Zhangyue Yin, Xuanjing Huang, Yu-Gang Jiang, and Xipeng Qiu. 2024. Moss: An open conversational large language model. Machine Intelligence Research.
  • Tay et al. (2021) Yi Tay, Mostafa Dehghani, Samira Abnar, Yikang Shen, Dara Bahri, Philip Pham, Jinfeng Rao, Liu Yang, Sebastian Ruder, and Donald Metzler. 2021. Long range arena : A benchmark for efficient transformers. In 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021. OpenReview.net.
  • Touvron et al. (2023) Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, Dan Bikel, Lukas Blecher, Cristian Canton-Ferrer, Moya Chen, Guillem Cucurull, David Esiobu, Jude Fernandes, Jeremy Fu, Wenyin Fu, Brian Fuller, Cynthia Gao, Vedanuj Goswami, Naman Goyal, Anthony Hartshorn, Saghar Hosseini, Rui Hou, Hakan Inan, Marcin Kardas, Viktor Kerkez, Madian Khabsa, Isabel Kloumann, Artem Korenev, Punit Singh Koura, Marie-Anne Lachaux, Thibaut Lavril, Jenya Lee, Diana Liskovich, Yinghai Lu, Yuning Mao, Xavier Martinet, Todor Mihaylov, Pushkar Mishra, Igor Molybog, Yixin Nie, Andrew Poulton, Jeremy Reizenstein, Rashi Rungta, Kalyan Saladi, Alan Schelten, Ruan Silva, Eric Michael Smith, Ranjan Subramanian, Xiaoqing Ellen Tan, Binh Tang, Ross Taylor, Adina Williams, Jian Xiang Kuan, Puxin Xu, Zheng Yan, Iliyan Zarov, Yuchen Zhang, Angela Fan, Melanie Kambadur, Sharan Narang, Aurélien Rodriguez, Robert Stojnic, Sergey Edunov, and Thomas Scialom. 2023. Llama 2: Open foundation and fine-tuned chat models. CoRR, abs/2307.09288.
  • Wang et al. (2024) Cunxiang Wang, Ruoxi Ning, Boqi Pan, Tonghui Wu, Qipeng Guo, Cheng Deng, Guangsheng Bao, Qian Wang, and Yue Zhang. 2024. Novelqa: A benchmark for long-range novel question answering. CoRR, abs/2403.12766.
  • Wu et al. (2021) Jeff Wu, Long Ouyang, Daniel M. Ziegler, Nisan Stiennon, Ryan Lowe, Jan Leike, and Paul F. Christiano. 2021. Recursively summarizing books with human feedback. CoRR, abs/2109.10862.
  • Xi et al. (2023) Zhiheng Xi, Wenxiang Chen, Xin Guo, Wei He, Yiwen Ding, Boyang Hong, Ming Zhang, Junzhe Wang, Senjie Jin, Enyu Zhou, Rui Zheng, Xiaoran Fan, Xiao Wang, Limao Xiong, Yuhao Zhou, Weiran Wang, Changhao Jiang, Yicheng Zou, Xiangyang Liu, Zhangyue Yin, Shihan Dou, Rongxiang Weng, Wensen Cheng, Qi Zhang, Wenjuan Qin, Yongyan Zheng, Xipeng Qiu, Xuanjing Huan, and Tao Gui. 2023. The rise and potential of large language model based agents: A survey. CoRR, abs/2309.07864.
  • Yang et al. (2018) Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Bengio, William W. Cohen, Ruslan Salakhutdinov, and Christopher D. Manning. 2018. Hotpotqa: A dataset for diverse, explainable multi-hop question answering. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium, October 31 - November 4, 2018, pages 2369–2380. Association for Computational Linguistics.
  • Yue et al. (2023) Xiang Yue, Yuansheng Ni, Kai Zhang, Tianyu Zheng, Ruoqi Liu, Ge Zhang, Samuel Stevens, Dongfu Jiang, Weiming Ren, Yuxuan Sun, Cong Wei, Botao Yu, Ruibin Yuan, Renliang Sun, Ming Yin, Boyuan Zheng, Zhenzhu Yang, Yibo Liu, Wenhao Huang, Huan Sun, Yu Su, and Wenhu Chen. 2023. MMMU: A massive multi-discipline multimodal understanding and reasoning benchmark for expert AGI. CoRR, abs/2311.16502.
  • Zeng et al. (2023a) Aohan Zeng, Xiao Liu, Zhengxiao Du, Zihan Wang, Hanyu Lai, Ming Ding, Zhuoyi Yang, Yifan Xu, Wendi Zheng, Xiao Xia, Weng Lam Tam, Zixuan Ma, Yufei Xue, Jidong Zhai, Wenguang Chen, Zhiyuan Liu, Peng Zhang, Yuxiao Dong, and Jie Tang. 2023a. GLM-130B: an open bilingual pre-trained model. In The Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023. OpenReview.net.
  • Zeng et al. (2023b) Aohan Zeng, Xiao Liu, Zhengxiao Du, Zihan Wang, Hanyu Lai, Ming Ding, Zhuoyi Yang, Yifan Xu, Wendi Zheng, Xiao Xia, Weng Lam Tam, Zixuan Ma, Yufei Xue, Jidong Zhai, Wenguang Chen, Zhiyuan Liu, Peng Zhang, Yuxiao Dong, and Jie Tang. 2023b. GLM-130B: an open bilingual pre-trained model. In The Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023. OpenReview.net.
  • Zeng et al. (2023c) Fanlong Zeng, Wensheng Gan, Yongheng Wang, Ning Liu, and Philip S. Yu. 2023c. Large language models for robotics: A survey. CoRR, abs/2311.07226.
  • Zhang et al. (2023) Jun Zhang, Shuyang Jiang, Jiangtao Feng, Lin Zheng, and Lingpeng Kong. 2023. CAB: comprehensive attention benchmarking on long sequence modeling. In International Conference on Machine Learning, ICML 2023, 23-29 July 2023, Honolulu, Hawaii, USA, volume 202 of Proceedings of Machine Learning Research, pages 41194–41218. PMLR.
  • Zhang et al. (2024) Xinrong Zhang, Yingfa Chen, Shengding Hu, Zihang Xu, Junhao Chen, Moo Khai Hao, Xu Han, Zhen Leng Thai, Shuo Wang, Zhiyuan Liu, and Maosong Sun. 2024. \inftybench: Extending long context evaluation beyond 100k tokens. CoRR, abs/2402.13718.

Appendix A Prompt Template

In our Reasoning Metric test, we mentioned the use of GPT4 for the number of contained leads, and we used the following prompt template 4

In order to evaluate a question-answering (QA) system’s reasoning process regarding a particular inference question, specifically whether the reasoning process correctly includes certain pieces of evidence, multiple pieces of evidence will be provided.
Due to the nature of the inference question being based on a detective novel, the reasoning process may involve some sensitive content. However, for the purpose of this evaluation, please focus solely on determining whether the reasoning process explicitly or implicitly includes the provided pieces of evidence.
The QA system’s output for the reasoning process may not explicitly mention the provided pieces of evidence, but it might implicitly incorporate them. In such cases, it should still be considered as correctly including the provided pieces of evidence.
The reasoning process output by the QA system and the pieces of evidence to be considered are presented below. Please objectively assess whether the QA system’s reasoning process explicitly or implicitly includes the provided pieces of evidence, and clearly state which pieces of evidence are included.

Reasoning Process:
<Reasoning Process>

Pieces of Evidence:
<Evidence Pieces>

Provide an initial sentence explaining whether the reasoning process explicitly or implicitly includes each piece of evidence. Then, in the second line, specify the indices of the included pieces of evidence in a list format, such as [0, 1, 2, 3, …].

Your response should maintain this format:

Explanation: <One-sentence explanation>
Included Pieces of Evidence: <Indices of the included evidence>

Figure 4: GPT-4 Questioning Template: Replace bolded font with evidence and Model’s Inference Process in Query.

This template asks enough questions to get usable responses without adding additional samples for a few shots to help answer.

Appendix B Implementation Details

The actual experiments also require the processing of the inputs and outputs, such as text truncation, answer alignment, or a special two-step answering method to obtain the final answer. Below we describe the processing details in the following three experiments.

Text Truncation

We adopt a tail truncation approach for text handling in each model, wherein we truncate input questions from the end. Differing from the approach in InfiniteBench (Zhang et al., 2024), our dataset’s nature leads us to believe that evidences pertinent to inference problems are more likely to be found near the end of the text rather than in the initial chapters of the novel. Thus, we employ a tail truncation method, retaining the longest relevant text at the end of the string as input.

Answer Alignment

As our output consists of two components – the answer and the reasoning process, we require the model’s output to be in the format "answer": "x", "reasoning": "xxx" for easy storage as a dictionary using JSON. However, not all models consistently follow these instructions, leading to difficulties in loading as a dictionary. In response, we identified a specific pattern and developed a standardized response alignment script. This script enables converting responses in a particular format to dictionary form, ensuring the validity of the responses. The impact of this approach will be discussed in Appendix C.

Special Two-stage Answer

In the case of the Qwen-7B model, we observed significant challenges in adhering to instructions for answer output. However, as our testing focuses on assessing reasoning abilities for a dataset where instruction adherence is not a primary criterion, we implemented a unique two-phase answering method. Initially, the model is tasked with providing an answer to the question, followed by a separate prompt for generating the corresponding reasoning process. In LWM, the situation is different, perhaps because we did not train multiple rounds of dialogue, we still can not get the answer using two rounds of dialogue, so in order to test we first let the model to give the inference process and then directly after the model’s output, we add ‘So the answer is’ and get the next output’s logits of the next output, and look for the one with the highest probability among the ABCDs as the multiple choice answer. This setup allows models that cannot follow the instructions to complete the evaluation of this dataset, increasing the usability of the dataset and improving the range of models that can be evaluated.

Appendix C Answer Alignment

C.1 Examples of non-compliance with Output Rules

Here are some examples of error output formats from a wide variety of models in Figure 6

C.2 Changing Invalid Rate

After our alignment work, the inefficiency of the answer is much reduced, and the exact change is shown in Figure 5.

Refer to caption
Figure 5: Results of Percentage Decrease in Invalid Rate

Invalid ’answer’ field:
{’answer’: ’B: Bump in the Road’, ’reasoning’: ’According to the novel, Sainsbury Seale’s disappearance was caused by a bump in the road. This is inferred from her sudden disappearance and the fact that she left no trace. Other options such as leaving voluntarily, running away with someone, or sudden amnesia are not mentioned in the novel.’}
{"answers":"unable to judge","reasoning": "Sainsbury Seale went to a friend’s house before she disappeared and has not been heard from since. A police investigation has found that she was killed, but details are not known and it is not possible to determine the exact cause of her disappearance."}
Invalid output format:
"Answer: A, Molly was killed by someone else Process of Reasoning: 1. Molly was killed at the clinic, and his body was found by the fireplace with a pistol at his side… {answer: B, reasoning: "The body was lying in a questionable position and could not have fallen naturally. Also, the carpet shows signs of being dragged, indicating that the body was dragged where it was pushed down. Therefore, the place where Molly was killed would have been an area with a carpet. And while there are carpets in the waiting room, consultation room, and office, there is no carpet in the lift, so the answer is B: consultation room"}
excessive output:
Based on the novel description, the following information is summarized:
1. Forensic autopsy results indicate that Morley’s time of death should be before 13:00, possibly even earlier, but it cannot be confirmed. 2. The last patient, Mr. Ambrose, left the clinic at 12:25. 3. The gatekeeper discovered Morley’s death around 1:30. 4. If Morley were still alive, he would have buzzed the buzzer at the appointment time to call the next patient upstairs. From this information, it can be inferred that Morley’s time of death is between 12:25 and 13:00. After 12:25, he would not have been able to call the next patient. Additionally, the forensic examination indicates death before 13:00. Therefore, I believe the correct answer is: {"answer": "B", "reasoning": "Combining forensic autopsy results with the departure time of the last patient, it can be inferred that Morley’s time of death is between 12:25 and 13:00."}
Completely forget the instructions:
A: Voluntary departure

Figure 6: Categories of Invalid Responses and Illustrative Examples

Appendix D Experiment on 32K models

We have done quite a lot experiments under the 32K input length model, and we can find that the experimental results of the 32K model are all relatively unsatisfactory. The experimental results are shown in Table 7

Table 7: Performance of models supporting context lengths of 32K or less
Models Question+Context Question-Only Win Rate
Answer Reasoning G.M. Answer Reasoning G.M.
Claude3-200k 81.95 39.21 56.68 23.43 20.95 22.15 94.61
GPT-4-128k 73.99 26.69 44.43 43.16 10.66 21.44 84.34
KimiChat-200K 64.13 27.79 42.21 45.07 9.64 20.84 67.27
InternLM2-7B-200k 57.95 23.94 37.24 36.97 12.65 21.62 81.69
ChatGLM3-6B-128K 40.58 22.08 33.63 33.63 7.16 15.51 63.47
LongChat-v1.5-7B-32k 29.33 11.07 18.01 27.83 5.72 12.59 33.58
Vicuna-v1.5-7B-16k 30.33 12.63 19.57 27.67 6.69 13.60 32.57
Qwen1.5-7B-8k 49.50 10.09 22.34 35.33 7.74 16.53 60.71
Qwen1.5-72B-32K 70.67 19.69 37.30 44.67 10.55 21.70 76.51

But a larger number of parameters would allow the model to make full use of the information in the 32K text and its own reasoning power to mitigate the problem.

Appendix E Case study

Such cases are usually the result of incorrect output from the model, where the model suggests the correct answer in the analysis but gives the wrong conclusion or option at the end.The Figure 7 shows the given sample.We can see that instability in the model’s reasoning ability can lead to errors in the final result.

"answer": "A".
"reasoning": "’ ’Based on the clues given…Therefore, the answer is C. Miranda did not attend the party.’" ,
"answer": "B".
"reasoning": "Based on the clues given… while the manner of Litcham Roach’s death is more likely to be active. Therefore, we can conclude that the manner of death of Litcham Roach was definitely not suicide." ,

Figure 7: An example of why the model got the wrong answer instead when given evidences

Appendix F Individual model performance

We derived nine metrics each for each of the different models based on the two types of annotation types, which are the answer accuracy (Acc) reasoning metric (RM) and answer invalid rate(IR) for each of the three questioning settings: Question-Only(simp), Question+Context(deta) and Question+evidence(clue), and the results are shown in Figure 8

Refer to caption
Figure 8: This table shows the scores of Answer accuracy, Reasoning metric and Invalid rate for a single model in three different settings, Invalid rate refers to the percentage of invalid answers to the total number of questions, the form of invalid answers will be showed in the Figure 6.

From this figure, you can clearly see the various scores of the model in each setting, which can help you to analyze data contamination 4.2, comprehensible decoupling, and so on. At the same time, the inefficiency of the answers also has some reference value, you can modify the answer method B or use the answer alignment C to reduce the inefficiency so that the test is more effective.

Appendix G No Options Setting

Since our questions are presented in the form of multiple choice questions, and the requirement for model answers is to provide only one of the letters of ABCD, there is a possibility that the answer will be correct by randomly outputting one of the four letters, which affects the reliability of assessment.

Therefore, we add a setting, the question in this setting contains the author of the novel, the name of the novel, and the question, but does not give the corresponding options, and then in the output let the model directly give the text answer instead of ABCD, and at the same time give the reasoning process.

Observation of the responses obtained for these questions shows that the responses obtained by the model without options have some correct answers, but most of them are still incorrect and even hallucinate in cases where the novel content is not given.

Question:
Who was responsible for Moira’s disappearance?
Model Answer:
Moira’s disappearance was caused by her husband, Roger Brockman.
Model Reasoning:
In Murder on the Cliff, Moira is a young woman whose husband, Roger Brockman, is an intensely controlling and jealous man. In the novel, Roger locks Moira up in an abandoned cottage and fakes Moira’s disappearance in order to prevent her from leaving him. Later, Roger is exposed for his crime and Moira is able to escape. Thus, it was her husband, Roger Brockman, who caused Moira’s disappearance.
True Answer & Reasoning:
A: Dr Nicholson
Nicholson has two obstacles to remove if he wants to marry Sylvia. One is gone, and Moira’s life hangs in the balance. Nicholson says his wife, Moira, is not at the sanatorium and is gone. But Nicholson wasn’t sure where Moira had gone. Moira left without saying a word, which was unusual. Moira said she had no money and nowhere to go. So, Nicholson faked Moira’s departure in order to marry Sylvia, but in fact, she might have been imprisoned or murdered. Taking all these evidences together, it is most likely that Nicholson caused Moira’s disappearance.

Figure 9: An example of model output compared to the true answer in the No Option setting. The model answers and the real answers are labeled in green bold respectively.