Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

DCR: Divide-and-Conquer Reasoning for Multi-choice Question Answering with LLMs

Zijie Meng , Zhaopeng Feng & Zuozhu Liu
Zhejiang University-University of Illinois at Urbana Champaign Institute
Zhejiang University
Jiaxing, Zhejiang 314400, PRC
{zijie.22, zhaopeng.23, zuozhuliu}@intl.zju.edu.cn
&Yan Zhang 11footnotemark: 1
Department of Electrical and Computer Engineering
National University of Singapore
4 Engineering Drive 3, Singapore 117583
yanzhang.jlu@gmail.com
Indicates equal contribution.Corresponding author.
Abstract

Large language models (LLMs) have shown impressive performance in reasoning benchmarks with the emergence of Chain-of-Thought (CoT), particularly in multi-choice question (MCQ). However, current works equally resolve questions regardless of the problem-solving difficulty, leading to an excessive focus on simple items while insufficient attention on intricate ones. To address this challenge, we propose a simple yet effective strategy, Divide and Conquer Reasoning (DCR), to enhance the reasoning capability of LLMs for MCQs, as inspired by human beings using heuristics to first categorize tasks and then handle them separately. In particular, we first categorize questions into two subsets based on confidence score 𝒞𝒮𝒞𝒮\mathcal{CS}caligraphic_C caligraphic_S, which is estimated by statistical frequency of generated answers. Subsequently, we propose Filter Choices based Reasoning (FCR) to improve model performance on MCQs with low 𝒞𝒮𝒞𝒮\mathcal{CS}caligraphic_C caligraphic_S. Our experiments demonstrate that the proposed strategy only costs 85% of SOTA, while still achieves average accuracy improvement of 1.56% across nine datasets including arithmetic, commonsense, and logic reasoning tasks. The code is at https://github.com/AiMijie/DCR.

1 Introduction

Large language models (LLMs) (e.g., GPT3 (Brown et al., 2020), GPT4 (OpenAI, 2023), Palm (Chowdhery et al., 2022), Palm2 (Anil et al., 2023), Lamda (Thoppilan et al., 2022), Llama (Touvron et al., 2023a), Llama2 (Touvron et al., 2023b)) have exhibited outstanding performance on various downstream tasks by generating step by step rationales to obtain final answers without finetuning parameters, as elicited from Chain-of-Thoughts (CoT) (Wei et al., 2022). Multiple-choice question (MCQ) is a format that incorporate a choices list with a question and prompt the model to select the gold answer. Owing to its simple structure, standardized results, and objective assessments, MCQ is not only widely prevalent in the real world but also extensively employed in LLMs’ reasoning evaluation (Zheng et al., 2023d; Hendrycks et al., 2020; Srivastava et al., 2022; Zhong et al., 2023; Huang et al., 2023). Consequently, the community has witnessed a surge in CoT-based works, which demonstrate outstanding performance on MCQs (Wang et al., 2022; Diao et al., 2023; Kojima et al., 2022; Zheng et al., 2023a; Kong et al., 2023). Notably, Zero-Shot-CoT (Kojima et al., 2022) and Self-Consistency (SC) (Wang et al., 2022) have attracted considerable attention due to straightforward implementation and impressive efficacy. Zero-Shot-CoT stimulates the latent zero-shot reasoning abilities of LLMs by adding ”Let’s think step by step.” into prompts, but often underperforms on complex tasks. SC samples different reasoning paths to generate multiple candidates following majority voting to derive final answer, which achieves encouraging results but introduces substantial overhead. To escape this sky-high cost, ESC (Li et al., 2024) early-stops inference by calculating the entropy of answer distribution in a small sliding window without sacrificing SC’s performance, which achieves SOTA currently. However, its performance ceiling is inherently limited by SC, restricting its breakthroughs in accuracy.

Therefore, to optimize the cost and performance, it is imperative to timely halt expensive sampling to reduce expenditure and further employ varied approaches for problems of differing complexity to advance accuracy. In other words, previous methods all process data uniformly regardless of the problem-solving difficulty, which means that simple questions receive unnecessarily complex and costly procedures, whereas intricate ones are not adequately addressed with basic methods. It is also natural that humans utilize heuristic strategies to categorize tasks, and then address each individually, which not only effectively resolves complex issues, but also significantly enhances efficiency (Heideman et al., 1984; Knuth, 1998). Consequently, we apply this strategy of data partitioning followed by differential process—Divide and Conquer, which is widely deployed across numerous scenarios (Bentley & Shamos, 1976; Bentley, 1980; Smith, 1985; Eisenstein, 2006; Mallouk, 2013)—to LLM reasoning. In this context, we need to address two paramount challenges: (1) What criteria should be used to divide the dataset? (2) How should the subsets be processed?

Refer to caption
Figure 1: Illustration of DCR. (1) Divide. We first conduct t𝑡titalic_t (e.g. t𝑡titalic_t=5) times inference with Zero-Shot-CoT (Kojima et al., 2022) by “Let’s think step by step.”. Then, the dataset 𝔻𝔻\mathbb{D}blackboard_D is divided based on 𝒞𝒮𝒞𝒮\mathcal{CS}caligraphic_C caligraphic_S, where DataItems with 𝒞𝒮𝒞𝒮\mathcal{CS}caligraphic_C caligraphic_S less than μ𝜇\muitalic_μ (e.g. μ𝜇\muitalic_μ=0.6) are categorized as 𝔻lowsubscript𝔻𝑙𝑜𝑤\mathbb{D}_{low}blackboard_D start_POSTSUBSCRIPT italic_l italic_o italic_w end_POSTSUBSCRIPT, and the rest as 𝔻othersubscript𝔻𝑜𝑡𝑒𝑟\mathbb{D}_{other}blackboard_D start_POSTSUBSCRIPT italic_o italic_t italic_h italic_e italic_r end_POSTSUBSCRIPT. (2) Conquer. We fix 𝔻othersubscript𝔻𝑜𝑡𝑒𝑟\mathbb{D}_{other}blackboard_D start_POSTSUBSCRIPT italic_o italic_t italic_h italic_e italic_r end_POSTSUBSCRIPT and propose FCR to process 𝔻lowsubscript𝔻𝑙𝑜𝑤\mathbb{D}_{low}blackboard_D start_POSTSUBSCRIPT italic_l italic_o italic_w end_POSTSUBSCRIPT. “DataItem” in Divide area includes question text and full choices list, while involves only filtered choices list in Conquer area. “Rationalei𝑖iitalic_i_j𝑗jitalic_j” denotes the rationale generated by j𝑗jitalic_j-th LLM query for i𝑖iitalic_i-th DataItem. “Choice_x𝑥xitalic_x” represents the x𝑥xitalic_x-th option in original DataItem.

For the first one, we need to explore a method to effectively classify questions based on solving difficulty. In human perception, answers with high uncertainty are often wrong, otherwise tend to be correct (Xiong et al., 2023). So we tentatively probed SC (Wang et al., 2022), where the statistical distribution of answers generated from various reasoning paths reflects a confidence score 𝒞𝒮𝒞𝒮\mathcal{CS}caligraphic_C caligraphic_S for the question. As shown in Figure 4, we divided questions into two subsets based on their 𝒞𝒮𝒞𝒮\mathcal{CS}caligraphic_C caligraphic_S, where different subsets displays distinct accuracy and the subset with lower 𝒞𝒮𝒞𝒮\mathcal{CS}caligraphic_C caligraphic_S demonstrates poorer performance. This suggests that we can employ SC to compute 𝒞𝒮𝒞𝒮\mathcal{CS}caligraphic_C caligraphic_S for each problem and divide them.

Move to the second issue, we inspired from the Cannikin Law in management (Goldratt & Cox, 2016), explore more elaborately designed methods for the low confidence subsets that offer greater room for optimization, and fix other questions that are sufficiently simple for the model. Shi et al. (2023) investigated the model’s sensitivity to irrelevant information within the questions, but there exists uncertainty regarding irrelevant options in choices list. To delve into this problem, we conducted preliminary studies as shown in Figure 6, discovering a decrease in problem-solving accuracy as the number of choices increased. Following this, we removed some irrelevant options in hardly solved subsets to re-query the LLM, resulting in a universal improvement of over 20%, especially achieving staggering 75.52% on CMSQA (Talmor et al., 2018), as shown in Table 6. Motivated by these findings, we introduce Filter Choices based Reasoning (FCR), which excludes abundant options by using the answers from the divide stage, to conduct inference in conquer stage.

Concretely, in this paper, we propose a simple yet effective strategy, Divide and Conquer Reasoning (DCR), which first categorizes questions into two subsets based on 𝒞𝒮𝒞𝒮\mathcal{CS}caligraphic_C caligraphic_S and subsequently employs FCR to improve model performance on MCQs with low 𝒞𝒮𝒞𝒮\mathcal{CS}caligraphic_C caligraphic_S, as illustrated in Figure 1. Through extensive empirical evaluation across nine datasets including arithmetic, commonsense, and logic tasks, DCR not only consumes on average only 85% of resources required by ESC, but also improves accuracy by an average of 1.56% on these datasets. Additionally, we have validated the effectiveness of DCR across various LLMs (Team et al., 2024; Jiang et al., 2023a; Anil et al., 2023; Team et al., 2023; OpenAI, 2023) and the superiority of FCR over other reasoning methods (Wei et al., 2022; Diao et al., 2023; Zheng et al., 2023a; Kojima et al., 2022; Kong et al., 2023). We have also successfully adapted DCR to the cloze-style dataset GSM8K (Cobbe et al., 2021) achieving an improved performance over SC with reduced cost. In summary, our work has three major contributions: (1) To the best of our knowledge, we pioneeringly employ the Divide and Conquer at the dataset level for LLM reasoning, providing the community a fresh perspective. (2) By dividing dataset based on 𝒞𝒮𝒞𝒮\mathcal{CS}caligraphic_C caligraphic_S and conquering low 𝒞𝒮𝒞𝒮\mathcal{CS}caligraphic_C caligraphic_S subset with FCR, we achieve an optimal balance between cost and accuracy. (3) We evaluate this strategy across nine datasets within three distinct reasoning tasks, consistently yielding significant improvements.

2 Methodology

The overall framework of DCR is illustrated in Figure 1. Given a test set of length n𝑛nitalic_n represented as 𝔻={(Q1,𝐂1),,(Qn,𝐂n)}𝔻subscript𝑄1subscript𝐂1subscript𝑄𝑛subscript𝐂𝑛\mathbb{D}=\{(Q_{1},\mathbf{C}_{1}),...,(Q_{n},\mathbf{C}_{n})\}blackboard_D = { ( italic_Q start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , bold_C start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) , … , ( italic_Q start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , bold_C start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) }, where Qisubscript𝑄𝑖Q_{i}italic_Q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT denotes the i𝑖iitalic_i-th question text and 𝐂isubscript𝐂𝑖\mathbf{C}_{i}bold_C start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is the corresponding choices list. In addition, we use 𝐑isubscript𝐑𝑖\mathbf{R}_{i}bold_R start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and 𝐀isubscript𝐀𝑖\mathbf{A}_{i}bold_A start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT to denote its rationales and answers generated by LLMs, respectively.

2.1 Divide

With each item (Qi,𝐂i),i{1,,n}subscript𝑄𝑖subscript𝐂𝑖𝑖1𝑛(Q_{i},\mathbf{C}_{i}),i\in\{1,...,n\}( italic_Q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , bold_C start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) , italic_i ∈ { 1 , … , italic_n }, we query the LLM for t𝑡titalic_t times to obtain rationales 𝐑i={ri,1,,ri,t}subscript𝐑𝑖subscript𝑟𝑖1subscript𝑟𝑖𝑡\mathbf{R}_{i}=\{r_{i,1},...,r_{i,t}\}bold_R start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = { italic_r start_POSTSUBSCRIPT italic_i , 1 end_POSTSUBSCRIPT , … , italic_r start_POSTSUBSCRIPT italic_i , italic_t end_POSTSUBSCRIPT } and corresponding standby answers 𝐀i={ai,1,,ai,t}subscript𝐀𝑖subscript𝑎𝑖1subscript𝑎𝑖𝑡\mathbf{A}_{i}=\{a_{i,1},...,a_{i,t}\}bold_A start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = { italic_a start_POSTSUBSCRIPT italic_i , 1 end_POSTSUBSCRIPT , … , italic_a start_POSTSUBSCRIPT italic_i , italic_t end_POSTSUBSCRIPT } based on Zero-Shot-CoT111Fow-Shot-CoT (i.e. CoT) (Wei et al., 2022) requires substantial human labor to annotate task-specific examplars, and zero-shot gradually approaches or even surpasses few-shot as the scale of model increases (Hu et al., 2023; Zhong et al., 2023). See Table 9 in Appendix A for our verification. (Kojima et al., 2022). In particular, we set t𝑡titalic_t generally equal to the length of choices list |𝐂i|subscript𝐂𝑖|\mathbf{C}_{i}|| bold_C start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT |, considering the worst scenario where all choices could be sampled. And we conduct a more detailed analysis on different values of t𝑡titalic_t in Section 3.4.2. We use δ𝛿\deltaitalic_δ to denote LLM and define the confidence score 𝒞𝒮𝒞𝒮\mathcal{CS}caligraphic_C caligraphic_S of each item as:

𝒞𝒮(Qi,𝐂i)=maxj{1,,t}p(ai,j|δ(Qi,𝐂i)),𝒞subscript𝒮subscript𝑄𝑖subscript𝐂𝑖subscript𝑗1𝑡𝑝conditionalsubscript𝑎𝑖𝑗𝛿subscript𝑄𝑖subscript𝐂𝑖\mathcal{CS}_{(Q_{i},\mathbf{C}_{i})}=\max_{j\in\{1,...,t\}}p(a_{i,j}|\delta(Q% _{i},\mathbf{C}_{i})),caligraphic_C caligraphic_S start_POSTSUBSCRIPT ( italic_Q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , bold_C start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) end_POSTSUBSCRIPT = roman_max start_POSTSUBSCRIPT italic_j ∈ { 1 , … , italic_t } end_POSTSUBSCRIPT italic_p ( italic_a start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT | italic_δ ( italic_Q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , bold_C start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ) , (1)

where p(ai,j|δ(Qi,𝐂i))𝑝conditionalsubscript𝑎𝑖𝑗𝛿subscript𝑄𝑖subscript𝐂𝑖p(a_{i,j}|\delta(Q_{i},\mathbf{C}_{i}))italic_p ( italic_a start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT | italic_δ ( italic_Q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , bold_C start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ) is the frequency of ai,jsubscript𝑎𝑖𝑗a_{i,j}italic_a start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT in all predicted answers. And we define it as:

p(ai,j|δ(Qi,𝐂i))=k{1,,t}𝟏ai,k=ai,jt.𝑝conditionalsubscript𝑎𝑖𝑗𝛿subscript𝑄𝑖subscript𝐂𝑖subscript𝑘1𝑡subscript1subscript𝑎𝑖𝑘subscript𝑎𝑖𝑗𝑡p(a_{i,j}|\delta(Q_{i},\mathbf{C}_{i}))=\frac{\sum_{k\in\{1,...,t\}}\mathbf{1}% _{a_{i,k}=a_{i,j}}}{t}.italic_p ( italic_a start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT | italic_δ ( italic_Q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , bold_C start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ) = divide start_ARG ∑ start_POSTSUBSCRIPT italic_k ∈ { 1 , … , italic_t } end_POSTSUBSCRIPT bold_1 start_POSTSUBSCRIPT italic_a start_POSTSUBSCRIPT italic_i , italic_k end_POSTSUBSCRIPT = italic_a start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_ARG start_ARG italic_t end_ARG . (2)

Intuitively, 𝒞𝒮𝒞𝒮\mathcal{CS}caligraphic_C caligraphic_S indicates the proportion of the most frequent answer among all predicted results in t𝑡titalic_t times inferences, which is employed to reflect the problem-solving difficulty. Then, we can divide 𝔻𝔻\mathbb{D}blackboard_D with the following rule:

(Qi,𝐂i){𝔻other,if 𝒞𝒮(Qi,𝐂i)>μ,𝔻low,if 𝒞𝒮(Qi,𝐂i)μ,subscript𝑄𝑖subscript𝐂𝑖casessubscript𝔻𝑜𝑡𝑒𝑟if 𝒞subscript𝒮subscript𝑄𝑖subscript𝐂𝑖𝜇subscript𝔻𝑙𝑜𝑤if 𝒞subscript𝒮subscript𝑄𝑖subscript𝐂𝑖𝜇(Q_{i},\mathbf{C}_{i})\in\begin{cases}\mathbb{D}_{other},&\text{if }\mathcal{% CS}_{(Q_{i},\mathbf{C}_{i})}>\mu,\\ \mathbb{D}_{low},&\text{if }\mathcal{CS}_{(Q_{i},\mathbf{C}_{i})}\leq\mu,\end{cases}( italic_Q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , bold_C start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ∈ { start_ROW start_CELL blackboard_D start_POSTSUBSCRIPT italic_o italic_t italic_h italic_e italic_r end_POSTSUBSCRIPT , end_CELL start_CELL if caligraphic_C caligraphic_S start_POSTSUBSCRIPT ( italic_Q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , bold_C start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) end_POSTSUBSCRIPT > italic_μ , end_CELL end_ROW start_ROW start_CELL blackboard_D start_POSTSUBSCRIPT italic_l italic_o italic_w end_POSTSUBSCRIPT , end_CELL start_CELL if caligraphic_C caligraphic_S start_POSTSUBSCRIPT ( italic_Q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , bold_C start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) end_POSTSUBSCRIPT ≤ italic_μ , end_CELL end_ROW (3)

where 𝔻lowsubscript𝔻𝑙𝑜𝑤\mathbb{D}_{low}blackboard_D start_POSTSUBSCRIPT italic_l italic_o italic_w end_POSTSUBSCRIPT represents the low confidence subset containing (Qi,𝐂i)subscript𝑄𝑖subscript𝐂𝑖(Q_{i},\mathbf{C}_{i})( italic_Q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , bold_C start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) with dispersed distribution of 𝐀isubscript𝐀𝑖\mathbf{A}_{i}bold_A start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and 𝔻othersubscript𝔻𝑜𝑡𝑒𝑟\mathbb{D}_{other}blackboard_D start_POSTSUBSCRIPT italic_o italic_t italic_h italic_e italic_r end_POSTSUBSCRIPT includes rest items. μ𝜇\muitalic_μ is the threshold for dividing, which is specified in Section 3.2 and discussed in Section 3.4.2. Moreover, different from dividing the questions into two subsets, we explore a more fine-grained division in Section 3.4.2 to evaluate our dividing rule. Next, we would fix 𝔻othersubscript𝔻𝑜𝑡𝑒𝑟\mathbb{D}_{other}blackboard_D start_POSTSUBSCRIPT italic_o italic_t italic_h italic_e italic_r end_POSTSUBSCRIPT to conserve resources, while delve deeper into 𝔻lowsubscript𝔻𝑙𝑜𝑤\mathbb{D}_{low}blackboard_D start_POSTSUBSCRIPT italic_l italic_o italic_w end_POSTSUBSCRIPT for ongoing performance improvement.

2.2 Conquer

We propose Filter Choices based Reasoning (FCR) to conquer 𝔻lowsubscript𝔻𝑙𝑜𝑤\mathbb{D}_{low}blackboard_D start_POSTSUBSCRIPT italic_l italic_o italic_w end_POSTSUBSCRIPT in this stage, as shown in Figure 1, where we exploit the results obtained in divide phase as alternative options for subsequent inference. Specifically, we define 𝐂i=uniq(𝐀i)superscriptsubscript𝐂𝑖𝑢𝑛𝑖𝑞subscript𝐀𝑖\mathbf{C}_{i}^{\prime}=uniq(\mathbf{A}_{i})bold_C start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = italic_u italic_n italic_i italic_q ( bold_A start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ), where the uniq()𝑢𝑛𝑖𝑞uniq(\cdot)italic_u italic_n italic_i italic_q ( ⋅ ) operation signifies deduplication of 𝐀i={ai,1,,ai,t}subscript𝐀𝑖subscript𝑎𝑖1subscript𝑎𝑖𝑡\mathbf{A}_{i}=\{a_{i,1},...,a_{i,t}\}bold_A start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = { italic_a start_POSTSUBSCRIPT italic_i , 1 end_POSTSUBSCRIPT , … , italic_a start_POSTSUBSCRIPT italic_i , italic_t end_POSTSUBSCRIPT }. Then we use (Qi,𝐂i)subscript𝑄𝑖superscriptsubscript𝐂𝑖(Q_{i},\mathbf{C}_{i}^{\prime})( italic_Q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , bold_C start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) to construct the new prompt and query the LLM with “Let’s delve deeper into these {|𝐂i|superscriptsubscript𝐂𝑖|\mathbf{C}_{i}^{\prime}|| bold_C start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT |} choices and select the best one.222We evaluate the robustness of FCR for different query prompts in Appendix B.. Subsequently, through additional inference for t𝑡titalic_t times, we obtain the new standby answers 𝐀i={ai,1,,ai,t}superscriptsubscript𝐀𝑖subscriptsuperscript𝑎𝑖1subscriptsuperscript𝑎𝑖𝑡\mathbf{A}_{i}^{\prime}=\{a^{\prime}_{i,1},...,a^{\prime}_{i,t}\}bold_A start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = { italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i , 1 end_POSTSUBSCRIPT , … , italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i , italic_t end_POSTSUBSCRIPT } for 𝔻lowsubscript𝔻𝑙𝑜𝑤\mathbb{D}_{low}blackboard_D start_POSTSUBSCRIPT italic_l italic_o italic_w end_POSTSUBSCRIPT. Notably, our method does not merely delete options, rather it involves a synchronous modification of the option symbols (i.e. ‘A’, ‘B’, ‘C’, etc.) based on the number of remaining choices. Furthermore, in Section 3.4.3, we evaluate the impact of conquering different subsets, and compare FCR with other reasoning methods, to demonstrate the superiority of only processing 𝔻lowsubscript𝔻𝑙𝑜𝑤\mathbb{D}_{low}blackboard_D start_POSTSUBSCRIPT italic_l italic_o italic_w end_POSTSUBSCRIPT with our method.

Ultimately, we align the standby answers {𝐀i|(Qi,𝐂i)𝔻other}conditional-setsubscript𝐀𝑖subscript𝑄𝑖subscript𝐂𝑖subscript𝔻𝑜𝑡𝑒𝑟\{\mathbf{A}_{i}|(Q_{i},\mathbf{C}_{i})\in\mathbb{D}_{other}\}{ bold_A start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | ( italic_Q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , bold_C start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ∈ blackboard_D start_POSTSUBSCRIPT italic_o italic_t italic_h italic_e italic_r end_POSTSUBSCRIPT } and {𝐀i|(Qi,𝐂i)𝔻low}conditional-setsuperscriptsubscript𝐀𝑖subscript𝑄𝑖subscript𝐂𝑖subscript𝔻𝑙𝑜𝑤\{\mathbf{A}_{i}^{\prime}|(Q_{i},\mathbf{C}_{i})\in\mathbb{D}_{low}\}{ bold_A start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT | ( italic_Q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , bold_C start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ∈ blackboard_D start_POSTSUBSCRIPT italic_l italic_o italic_w end_POSTSUBSCRIPT } generated in different stage with 𝔻othersubscript𝔻𝑜𝑡𝑒𝑟\mathbb{D}_{other}blackboard_D start_POSTSUBSCRIPT italic_o italic_t italic_h italic_e italic_r end_POSTSUBSCRIPT and 𝔻lowsubscript𝔻𝑙𝑜𝑤\mathbb{D}_{low}blackboard_D start_POSTSUBSCRIPT italic_l italic_o italic_w end_POSTSUBSCRIPT, respectively, then utilize majority voting (Wang et al., 2022) to determine the final answer for each data item. It is evident that our full strategy requires no human intervention or manual labor, and infers t𝑡titalic_t times for each data item in 𝔻othersubscript𝔻𝑜𝑡𝑒𝑟\mathbb{D}_{other}blackboard_D start_POSTSUBSCRIPT italic_o italic_t italic_h italic_e italic_r end_POSTSUBSCRIPT and 2t2𝑡2t2 italic_t times for 𝔻lowsubscript𝔻𝑙𝑜𝑤\mathbb{D}_{low}blackboard_D start_POSTSUBSCRIPT italic_l italic_o italic_w end_POSTSUBSCRIPT.

3 Experiments

3.1 Datasets and evaluation metrics

To evaluate the effectiveness and empirically analyse DCR, we conducted experiments on three tasks: 1) Arithmetic. AQuA (AQ.) (Ling et al., 2017) and Abstract Algebra (Alg.), High School Mathematics (Math.) from the MMLU dataset (Hendrycks et al., 2020). 2) Commonsense. CMSQA (CMS.) (Talmor et al., 2018), OpenBookQA (OB.) (Mihaylov et al., 2018) and ARC Challenge (ARC.) (Clark et al., 2018). 3) Logic. RiddleSense (Rid.) (Lin et al., 2021), Logical Deduction (Logi.) from BIG-bench dataset (Srivastava et al., 2022) and Reclor (Rec.) (Yu et al., 2020). The statistical details can be found in Table 10 of appendix. Additionally, we employed exact match (EM) accuracy to evaluate the performance, which is same as previous works (Wei et al., 2022; Kojima et al., 2022).

3.2 Implementation details

We primarily employed GPT-3.5-Turbo-0613 from OpenAI API333https://platform.openai.com, and conducted experiments on other opensource and blackbox LLMs in Section 3.4.1. During the divide phase, we set the temperature to 0.7, and set inference times t𝑡titalic_t to 4 or 5 for different datasets, as detailed in Table 10. We divided each dataset into 𝔻othersubscript𝔻𝑜𝑡𝑒𝑟\mathbb{D}_{other}blackboard_D start_POSTSUBSCRIPT italic_o italic_t italic_h italic_e italic_r end_POSTSUBSCRIPT and 𝔻lowsubscript𝔻𝑙𝑜𝑤\mathbb{D}_{low}blackboard_D start_POSTSUBSCRIPT italic_l italic_o italic_w end_POSTSUBSCRIPT with μ𝜇\muitalic_μ as 0.6. In the conquer stage, the temperature and inference times were consistent with previous phase. Experiments were conducted on the full dataset by default unless in Section 3.4.4 and Appendix A, where we randomly sampled 500 items for each dataset except 254 for AQuA and 300 for SVAMP. In addition, the final results were all obtained by averaging five random trials. Notably, considering the accuracy for ESC normally equals to or underperforms SC, we mainly compared with SC in Section 3.4.

3.3 Main results

Table 1: Comparison of problem-solving accuracy (%) among different methods. “Avg.” denotes average accuracy across nine datasets. “#Call” refers to the average sample size (i.e. inference times) for each question across nine datasets. SC and ESC represent the versions with approximate sample size of DCR.
Method Arithmetic Commonsense Logic Avg. #Call
AQ. Alg. Math. CMS. OB. ARC. Rid. Logi. Rec.
SC 68.98 43.20 64.00 76.12 87.04 89.68 68.72 48.07 61.84 67.52 8.94
ESC 68.98 43.20 64.00 76.12 87.04 89.68 68.72 48.07 61.84 67.52 6.79
DCR 71.02 48.60 66.52 77.97 86.80 89.79 68.81 50.27 61.96 69.08 5.79
SC 66.46 43.20 62.52 75.00 85.24 88.98 68.03 48.80 61.00 66.58 6.17
ESC 68.98 42.20 64.00 76.12 84.68 88.52 68.72 48.07 60.20 66.83 6.17

We took a comparison between SC (Wang et al., 2022), ESC (Li et al., 2024), and our method across nine datasets, as shown in Table 1. According to Section 2.2, we set 2t2𝑡2t2 italic_t as the upperbound of inference times and defined the window size of ESC as t𝑡titalic_t. Upon achieving this limitation, the average sample size (i.e. inference times) for each question of original SC is 8.94 with average accuracy as 67.52%. ESC reduces the sample size to 6.79 and maintains the accuracy of 67.52%. DCR further reduces the average sample size to 5.79 while achieves the accuracy of 69.08%, surpassing baselines with 1.56%, which demonstrates dual improvements in efficiency and performance.

Refer to caption
Figure 2: Average accuracy and #Call across different datasets. See Figure 9 for details about each dataset.

In Figure 2, we presented the average accuracy of SC and ESC across different datasets for various sample sizes. Notably, DCR achieves similar levels of accuracy at a substantially lower cost compared to the baselines, indicating a significant enhancement in efficiency. Meanwhile, when costs are comparable, DCR consistently outperforms these two baselines. This superiority is quantitatively reported as SC and ESC in Table 1, where DCR exhibits an encouraging lead of 2.5% and 2.25%, respectively. Furthermore, we observed a diminishing performance improvement of SC and ESC as sample size increases, suggesting an approach towards a bottleneck. However, the integration of FCR for inference on 𝔻lowsubscript𝔻𝑙𝑜𝑤\mathbb{D}_{low}blackboard_D start_POSTSUBSCRIPT italic_l italic_o italic_w end_POSTSUBSCRIPT during the conquer stage offers a potential pathway to breakthrough beyond this bottleneck.

3.4 Analysis

3.4.1 Comparison across different LLMs

Table 2: Accuracy (%) across different LLMs. The number in parenthesis denotes average sample size.
Setting AQ. CMS.
Gemma SC 34.96 (8.00) 65.31 (6.00)
DCR 37.24 (7.50) 67.81 (5.98)
Mistral SC 39.29 (9.00) 71.37 (7.00)
DCR 43.31 (8.97) 73.10 (6.51)
Palm2 SC 38.50 (6.00) 74.15 (6.00)
DCR 39.37 (5.21) 75.17 (5.53)
Gemini SC 70.39 (8.00) 78.41 (6.00)
DCR 68.74 (7.40) 78.85 (5.26)
GPT4 SC 84.17 (6.00) 84.21 (6.00)
DCR 85.43 (5.99) 85.19 (5.70)

In this section, we conducted a comparative analysis between SC (Wang et al., 2022) and DCR using various models. Specifically, we employed Gemma (gemma-7b-it) (Team et al., 2024) and Mistral (Mistral-7B-Instruct-v0.2) (Jiang et al., 2023a) available on the Hugging Face444https://huggingface.co, Palm2 (text-bison-001) (Anil et al., 2023) and Gemini (gemini-pro) (Team et al., 2023) from Google AI555https://ai.google.dev, as well as GPT4 (gpt-4-1106-preview) (OpenAI, 2023) from OpenAI API. As shown in Table 2, DCR generally achieves higher accuracy with lower costs, except on AQuA using Gemini. Notably, the larger-scale LLMs (e.g. Gemini and GPT4) significantly outperforms other models, particularly on AQuA with improvements exceeding 30%. However, this also diminishes the relative advantage from DCR, such as the improvements with Mistral are 4.02% and 1.73% on two datasets, while only 1.26% and 0.98% with GPT4. Therefore, we believe the enhancement of model capabilities resembling the process of making up for weaknesses, which compresses the space for optimization.

3.4.2 Study for divide stage

Refer to caption
Figure 3: Average Prior accuracy on different subsets for various sample size t𝑡titalic_t. See Figure 11 for details about each dataset.
Refer to caption
Figure 4: The average number of different subsets size for various sample size t𝑡titalic_t. See Figure 12 for details about each dataset.
Table 3: Comparison of accuracy (%) on different confidence subsets. “#Size” indicates the number of data items in different subsets. “Prior” denotes the accuracy of results generated in the divide stage. For 𝔻lowbsubscript𝔻𝑙𝑜subscript𝑤𝑏\mathbb{D}_{low_{b}}blackboard_D start_POSTSUBSCRIPT italic_l italic_o italic_w start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT end_POSTSUBSCRIPT, “-” refer results lacking reliability because of insufficient data.
Subset Setting Arithmetic Commonsense Logic Avg.
AQ. Alg. Math. CMS. OB. ARC. Rid. Logi. Rec.
𝔻highsubscript𝔻𝑖𝑔\mathbb{D}_{high}blackboard_D start_POSTSUBSCRIPT italic_h italic_i italic_g italic_h end_POSTSUBSCRIPT #Size 74.60 30.40 71.80 588.80 325.80 856.20 404.40 27.60 232.60 290.24
Prior 91.96 53.95 90.53 92.09 96.13 96.26 89.81 86.96 75.41 85.90
𝔻medsubscript𝔻𝑚𝑒𝑑\mathbb{D}_{med}blackboard_D start_POSTSUBSCRIPT italic_m italic_e italic_d end_POSTSUBSCRIPT #Size 51.20 38.20 82.40 265.60 97.00 191.20 218.40 81.00 151.40 130.71
Prior 79.69 43.46 61.17 72.74 73.61 75.42 69.60 60.49 51.52 65.30
FCR 74.22 35.08 62.86 69.95 70.31 73.95 63.00 53.58 52.84 61.75
𝔻lowtsubscript𝔻𝑙𝑜subscript𝑤𝑡\mathbb{D}_{low_{t}}blackboard_D start_POSTSUBSCRIPT italic_l italic_o italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT #Size 70.00 30.20 109.60 265.20 74.00 115.40 251.40 153.40 113.20 131.38
Prior 55.71 28.48 39.60 53.24 50.27 48.87 50.99 35.59 42.05 44.98
FCR 62.86 49.01 55.47 61.61 64.05 65.68 50.68 42.11 48.59 55.56
𝔻lowbsubscript𝔻𝑙𝑜subscript𝑤𝑏\mathbb{D}_{low_{b}}blackboard_D start_POSTSUBSCRIPT italic_l italic_o italic_w start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT end_POSTSUBSCRIPT #Size 58.20 1.20 6.20 101.40 3.20 2.20 146.80 38.00 2.80 40.00
Prior 36.08 - - 33.14 - - 35.97 17.37 - 30.64
FCR 46.39 - - 52.47 - - 40.87 34.74 - 43.62

Effect of different sample size t𝑡titalic_t. The Prior accuracy is a key metric reflecting the effectiveness of division, where lower 𝒞𝒮𝒞𝒮\mathcal{CS}caligraphic_C caligraphic_S is expected to correlate with lower Prior accuracy. Consequently, we conducted experiment to observe the impact of varying t𝑡titalic_t from 3 to 20 on Prior accuracy. As illustrated in Figure 4, there is a clear distinction in Prior accuracy on different subsets, and only a minimal number of inferences are required to reach an oscillatory state, which supports the reasonability behind basing t𝑡titalic_t on |𝐂i|subscript𝐂𝑖|\mathbf{C}_{i}|| bold_C start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT |. Additionally, the number of different subsets size after division is also a crucial metric, as it directly impacts the overall cost of DCR. Therefore, Figure 4 presents the sizes of 𝔻othersubscript𝔻𝑜𝑡𝑒𝑟\mathbb{D}_{other}blackboard_D start_POSTSUBSCRIPT italic_o italic_t italic_h italic_e italic_r end_POSTSUBSCRIPT and 𝔻lowsubscript𝔻𝑙𝑜𝑤\mathbb{D}_{low}blackboard_D start_POSTSUBSCRIPT italic_l italic_o italic_w end_POSTSUBSCRIPT across various t𝑡titalic_t. Similar to Prior accuracy, the sizes of different subsets also stabilize in a fluctuating range with only minimal inferences.

Effect of different dividing threshold μ𝜇\muitalic_μ. Based on the definition of sample size t𝑡titalic_t and the strategy of DCR in Section 2.1, we divided the dataset into four discrete subsets according to 𝒞𝒮𝒞𝒮\mathcal{CS}caligraphic_C caligraphic_S intervals: (0.8, 1] for 𝔻highsubscript𝔻𝑖𝑔\mathbb{D}_{high}blackboard_D start_POSTSUBSCRIPT italic_h italic_i italic_g italic_h end_POSTSUBSCRIPT, (0.6, 0.8] for 𝔻medsubscript𝔻𝑚𝑒𝑑\mathbb{D}_{med}blackboard_D start_POSTSUBSCRIPT italic_m italic_e italic_d end_POSTSUBSCRIPT, (0.4, 0.6] for 𝔻lowtsubscript𝔻𝑙𝑜subscript𝑤𝑡\mathbb{D}_{low_{t}}blackboard_D start_POSTSUBSCRIPT italic_l italic_o italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT, and [0, 0.4] for 𝔻lowbsubscript𝔻𝑙𝑜subscript𝑤𝑏\mathbb{D}_{low_{b}}blackboard_D start_POSTSUBSCRIPT italic_l italic_o italic_w start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT end_POSTSUBSCRIPT. Considering the model’s high confidence on 𝔻highsubscript𝔻𝑖𝑔\mathbb{D}_{high}blackboard_D start_POSTSUBSCRIPT italic_h italic_i italic_g italic_h end_POSTSUBSCRIPT (𝒞𝒮𝒞𝒮\mathcal{CS}caligraphic_C caligraphic_S greater than 0.8), we only report Prior accuracy, which exceeds 85% in majority (7 out of 9) of datasets, as shown in Table 3. This indicates that the most questions in 𝔻highsubscript𝔻𝑖𝑔\mathbb{D}_{high}blackboard_D start_POSTSUBSCRIPT italic_h italic_i italic_g italic_h end_POSTSUBSCRIPT are relatively simple and require no further process. Contrastingly, 𝔻medsubscript𝔻𝑚𝑒𝑑\mathbb{D}_{med}blackboard_D start_POSTSUBSCRIPT italic_m italic_e italic_d end_POSTSUBSCRIPT demonstrate moderate Prior accuracy and achieve improvements via FCR in minority (2 out of 9) datasets. In fact, the 𝒞𝒮𝒞𝒮\mathcal{CS}caligraphic_C caligraphic_S for each item in 𝔻medsubscript𝔻𝑚𝑒𝑑\mathbb{D}_{med}blackboard_D start_POSTSUBSCRIPT italic_m italic_e italic_d end_POSTSUBSCRIPT belongs to (0.6, 0.8], indicating that despite the model generates diverse answers, it predominantly focuses on a specific one. This introduces a significant challenge to enhance LLM’s performance by correcting its previously generated mistakes, rendering the gains through FCR as limited. In addition, referring to original DCR, 𝔻lowtsubscript𝔻𝑙𝑜subscript𝑤𝑡\mathbb{D}_{low_{t}}blackboard_D start_POSTSUBSCRIPT italic_l italic_o italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT and 𝔻lowbsubscript𝔻𝑙𝑜subscript𝑤𝑏\mathbb{D}_{low_{b}}blackboard_D start_POSTSUBSCRIPT italic_l italic_o italic_w start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT end_POSTSUBSCRIPT comes from further dividing of the 𝔻lowsubscript𝔻𝑙𝑜𝑤\mathbb{D}_{low}blackboard_D start_POSTSUBSCRIPT italic_l italic_o italic_w end_POSTSUBSCRIPT, where the former has higher 𝒞𝒮𝒞𝒮\mathcal{CS}caligraphic_C caligraphic_S. Therefore, the average Prior accuracy of 𝔻lowbsubscript𝔻𝑙𝑜subscript𝑤𝑏\mathbb{D}_{low_{b}}blackboard_D start_POSTSUBSCRIPT italic_l italic_o italic_w start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT end_POSTSUBSCRIPT is only 30.64%, markedly below 44.98% of 𝔻lowtsubscript𝔻𝑙𝑜subscript𝑤𝑡\mathbb{D}_{low_{t}}blackboard_D start_POSTSUBSCRIPT italic_l italic_o italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT, and significantly inferior to others. Meanwhile, through the conquer phase in DCR, we achieve an average accuracy improvement of 10.58% and 12.98% for 𝔻lowtsubscript𝔻𝑙𝑜subscript𝑤𝑡\mathbb{D}_{low_{t}}blackboard_D start_POSTSUBSCRIPT italic_l italic_o italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT and 𝔻lowbsubscript𝔻𝑙𝑜subscript𝑤𝑏\mathbb{D}_{low_{b}}blackboard_D start_POSTSUBSCRIPT italic_l italic_o italic_w start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT end_POSTSUBSCRIPT, respectively. However, more than half of the 𝔻lowbsubscript𝔻𝑙𝑜subscript𝑤𝑏\mathbb{D}_{low_{b}}blackboard_D start_POSTSUBSCRIPT italic_l italic_o italic_w start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT end_POSTSUBSCRIPT across various datasets contain a minimal number of data items, making it difficult to reliably report accuracy or effectively improve performance for entire dataset. Therefore, we instituted the threshold μ𝜇\muitalic_μ as 0.6 to conduct dataset dividing.

Refer to caption
Figure 5: Distribution of different subsets among three tasks. See Figure 10 for details about each dataset.

The distribution of different subsets. Incorporating the dividing results, we conducted a visual statistical analysis to examine the distribution of different confidence subsets among three reasoning tasks, as shown in Figure 5. The proportion of other subsets all exceeds 50% in different tasks and even surpasses 80% in commonsense class, which means that we can achieve high accuracy on a substantial portion of data without complex processing. Therefore, based on DCR, we can concentrate more resources on low confidence subsets while effectively avoid redundant process on other ones, which significantly reduce overall expenditure.

3.4.3 Study for conquer stage

Table 4: Comparison of problem-solving accuracy (%) for conquering different subsets.
Conquer Subset Arithmetic Commonsense Logic Avg. #Call
AQ. Alg. Math. CMS. OB. ARC. Rid. Logi. Rec.
𝔻med&𝔻lowsubscript𝔻𝑚𝑒𝑑subscript𝔻𝑙𝑜𝑤\mathbb{D}_{med}\&\mathbb{D}_{low}blackboard_D start_POSTSUBSCRIPT italic_m italic_e italic_d end_POSTSUBSCRIPT & blackboard_D start_POSTSUBSCRIPT italic_l italic_o italic_w end_POSTSUBSCRIPT 69.92 45.40 67.04 77.36 86.16 89.55 67.40 48.40 62.36 68.18 6.78
𝔻lowsubscript𝔻𝑙𝑜𝑤\mathbb{D}_{low}blackboard_D start_POSTSUBSCRIPT italic_l italic_o italic_w end_POSTSUBSCRIPT 71.02 48.60 66.52 77.97 86.80 89.79 68.81 50.27 61.96 69.08 5.79

Different conquer subsets. Building upon Section 3.4.2, we retain 𝔻medsubscript𝔻𝑚𝑒𝑑\mathbb{D}_{med}blackboard_D start_POSTSUBSCRIPT italic_m italic_e italic_d end_POSTSUBSCRIPT and combine 𝔻lowtsubscript𝔻𝑙𝑜subscript𝑤𝑡\mathbb{D}_{low_{t}}blackboard_D start_POSTSUBSCRIPT italic_l italic_o italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT and 𝔻lowbsubscript𝔻𝑙𝑜subscript𝑤𝑏\mathbb{D}_{low_{b}}blackboard_D start_POSTSUBSCRIPT italic_l italic_o italic_w start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT end_POSTSUBSCRIPT into 𝔻lowsubscript𝔻𝑙𝑜𝑤\mathbb{D}_{low}blackboard_D start_POSTSUBSCRIPT italic_l italic_o italic_w end_POSTSUBSCRIPT to compare the impact of conquering different subsets, as shown in Table 4. 𝔻lowsubscript𝔻𝑙𝑜𝑤\mathbb{D}_{low}blackboard_D start_POSTSUBSCRIPT italic_l italic_o italic_w end_POSTSUBSCRIPT, being a smaller subset, requires an average sample size of 5.79, which is 0.99 lower than conquering 𝔻medsubscript𝔻𝑚𝑒𝑑\mathbb{D}_{med}blackboard_D start_POSTSUBSCRIPT italic_m italic_e italic_d end_POSTSUBSCRIPT and 𝔻lowsubscript𝔻𝑙𝑜𝑤\mathbb{D}_{low}blackboard_D start_POSTSUBSCRIPT italic_l italic_o italic_w end_POSTSUBSCRIPT together. Furthermore, according to Table 3, additional interventions on 𝔻medsubscript𝔻𝑚𝑒𝑑\mathbb{D}_{med}blackboard_D start_POSTSUBSCRIPT italic_m italic_e italic_d end_POSTSUBSCRIPT by FCR yield marginal benefits. So conquering 𝔻medsubscript𝔻𝑚𝑒𝑑\mathbb{D}_{med}blackboard_D start_POSTSUBSCRIPT italic_m italic_e italic_d end_POSTSUBSCRIPT and 𝔻lowsubscript𝔻𝑙𝑜𝑤\mathbb{D}_{low}blackboard_D start_POSTSUBSCRIPT italic_l italic_o italic_w end_POSTSUBSCRIPT together results in enhanced accuracy for only two datasets, which steers us to pay more attention solely on 𝔻lowsubscript𝔻𝑙𝑜𝑤\mathbb{D}_{low}blackboard_D start_POSTSUBSCRIPT italic_l italic_o italic_w end_POSTSUBSCRIPT in conquer stage.

Table 5: Accuracy (%) with different reasoning methods on 𝔻lowsubscript𝔻𝑙𝑜𝑤\mathbb{D}_{low}blackboard_D start_POSTSUBSCRIPT italic_l italic_o italic_w end_POSTSUBSCRIPT in AQuA and CMSQA.
Method AQ. CMS. Average
ManualCoT 43.21 56.96 50.09
Active-Prompt 42.28 57.88 50.08
PHP 44.49 - -
Zero-Shot-CoT 44.46 45.23 44.85
Role-Play Prompting 48.20 46.79 47.50
FCR 49.45 54.39 51.92

Different reasoning methods. In this section, we compared our proposed zero-shot based FCR with some representative few-shot works: ManualCoT (i.e. CoT) (Wei et al., 2022), Active-Prompt (Diao et al., 2023), and PHP (Zheng et al., 2023a), as well as some zero-shot methods: Zero-Shot-CoT (Kojima et al., 2022) and Role-Play Prompting (Kong et al., 2023). Considering diverse datasets employed by these methods, we chose 𝔻lowsubscript𝔻𝑙𝑜𝑤\mathbb{D}_{low}blackboard_D start_POSTSUBSCRIPT italic_l italic_o italic_w end_POSTSUBSCRIPT from two widely utilized datasets (AQuA and CMSQA) for this comparison, and evaluated performance based on a single sample size. Notably, we solely conducted PHP on AQuA since it only reported results on arithmetic tasks. As shown in Table 5, FCR achieves the highest accuracy of 49.45% on AQuA, and demonstrates competitive performance with few-shot methods on CMSQA. This presents a similar trend in Table 9 of Appendix A, where Zero-Shot-CoT approaches or even exceeds Few-Shot-CoT on multiple datasets, yet it still far behind on CMSQA. Furthermore, FCR exhibits the highest average accuracy surpassing the sub-optimal zero-shot based Role-Play Prompting by 4.42%, which highlights the strong efficacy of our method without additional human labor.

Table 6: Accuracy (%) on unsolved subsets with different construction methods of choices list.
Setting AQ. CMS. OB. Rid.
List1 2.16 0.97 0.57 1.97
List2.1 21.65 47.58 31.03 28.11
List2.2 29.00 62.10 45.98 49.25
List3 18.18 29.46 32.76 24.04
List4 26.41 65.01 58.05 44.02
List5 29.39 75.52 48.85 45.70
[Uncaptioned image]
Figure 6: Accuracy with different number of choices.
[Uncaptioned image]
Figure 7: Probability of correct answer in filtered list.

3.4.4 Study for irrelevant choices

Irrelevant information may distract LLM. Shi et al. (2023) investigated the sensitivity of LLM to irrelevant information within questions and proposed to add instruction or exemplars to effectively reduce distractibility. In fact, such irrelevant information is not solely limited to the questions’ context, but also contained in options list. Therefore, we conducted an analysis on accuracy with different numbers of choices, especially the impact of increasing incorrect options. As in Figure 6, the accuracy exhibits a noticeable decline with more incorrect options, where we extended choices list by randomly combining wrong answers. To delve deeper, we focused on subsets of problems remaining unsolved by SC (Wang et al., 2022) with 5 sample times. Then we conducted inference with various constructing methods for choices list as shown in Table 6: 1) presenting the full choices list as List1; 2) combining the correct option with randomly sampled 1 or 2 incorrect ones as List2.1 or List2.2 respectively; 3) using the correct option and deduplicated results from previous five inferences as List3; 4) selecting the correct option and choices not included in earlier results as List4; 5) retaining the correct option and randomly picking one from the rest of List4 as List5. The accuracy for List1 close to 0%, while others can significantly enhance performance. However, the correct answers for the test set are unknown in real-world scenario, which leads us to explore the feasibility of utilizing results from previous inference to filter the choices. And we quantified the probability of the correct answer in filtered choices list, as shown in Figure 7. An average 90.51% of cases retained the correct answers, indicating that earlier results can effectively narrow down the original choices list.

Table 7: The probability of the strong distractors appearing in the choices list.
Setting AQ. CMS. OB. Rid. Avg.
List2.1 51% 46% 26% 61% 46%
List2.2 22% 23% 12% 29% 21.5%
Table 8: Accuracy (%) on GSM8K.
SC DCR
#Call 7.00 6.23
Acc. 84.75 85.00

Fewer choices lead to better outcomes. Considering the varying impacts different options have on LLMs, and drawing inspiration from Shi et al. (2023), we posit that incorrect choices previously generated by the model-called as strong distractors-exert a more profound disruptive effect. As shown in Table 6, there is a significant improvement from List3 to List4, with an average increase of 22.26% across four datasets. Furthermore, retaining two choices (List2.2) consistently surpasses those with three choices (List2.1), which can be primarily attributed to the reduced likelihood of encountering strong distractors when only two options are reserved, as shown in Table 8. Therefore, developing more effective strategies to identify and eliminate such strongly distracting options will become a crucial direction for our future research.

3.4.5 Application beyond MCQs

In the preceding experiments, all datasets are comprised by MCQs, where the correct answer is included in the choices list. Consequently, we ventured to apply DCR to GSM8K (Cobbe et al., 2021), a high quality cloze-style dataset of grade school math questions. Initially, we queried the entire test set 5 times consistent with AQuA. Then we constructed choices list based on generated answers, resulting in a new dataset named GSM8K-MCQ, which is formally equivalent to MCQ. Subsequently, we divided GSM8K-MCQ with a threshold (μ𝜇\muitalic_μ) of 0.6 and applied FCR for deeper conquering. From Table 8, DCR achieves accuracy of 85% with 6.23 sample times, superior than SC with #Call as 7, which indicates the efficacy of our strategy to datasets beyond MCQs.

4 Related Work

LLMs reasoning for MCQs. As a problem format listing alternative answers, MCQs are prevalent in real world and have led to numerous related datasets, such as MMLU (Hendrycks et al., 2020), BIG-bench (Srivastava et al., 2022), AGIEval (Zhong et al., 2023), CEVAL (Huang et al., 2023). Simultaneously, many works have emerged in MCQs community. Robinson et al. (2022) explore to integrate the question and the choices list, then guide the model to select the correct option’s symbol. Pezeshkpour & Hruschka (2023) discover LLM’s position bias, revealing that the order of choices can significantly impact model’s performance. Zheng et al. (2023b) find selection bias, where LLMs display a clear preference for choosing options from specific positions. Different from these works, we explore the model’s sensitivity to the number of options and verifies that filtering incorrect choices can further improve performance.

CoT prompting in LLMs reasoning. Recently, CoT prompting methods have significantly enhanced reasoning abilities of LLMs. As the pioneer, Wei et al. (2022) generate intermediate reasoning steps before arriving at the answer by integrating rationales into few-shot examplars. Following it, Wang et al. (2022), Zhou et al. (2022), Yao et al. (2023), Besta et al. (2023), Sel et al. (2023), Jin & Lu (2023), Jiang et al. (2023b), Yan et al. (2023), Zhu et al. (2023), Li et al. (2023b) and Deb et al. (2023) are dedicated to optimizing the thinking process. Gao et al. (2023), Chen et al. (2022), Chen et al. (2023), Yamauchi et al. (2023) and Jie et al. (2023) employ external tools to disentangle computation from LLMs. Zhang et al. (2022), Diao et al. (2023), Shum et al. (2023), Sun et al. (2023), Zou et al. (2023) are exploring demonstrations construction in distinct manners. Mekala et al. (2023), Zheng et al. (2023c), Li et al. (2023a), Yasunaga et al. (2023) and Crispino et al. (2023) enable models to generate examplars by themselves. Xue et al. (2023), Miao et al. (2023), Zhang et al. (2023), Ling et al. (2023) and Weng et al. (2023) introduce the concept of verification into the community. In addition, Shi et al. (2023) delves into the distractibility of LLMs by irrelevant context in questions. Zheng et al. (2023a) utilize previously generated answers as hints to progressively guide the model to the correct answer. Kong et al. (2023) defines specific roles for the model based on particular task. However all these works process data uniformly neglecting problem-solving difficulty. Therefore, we propose DCR to LLMs reasoning, which first divides the dataset, and then selects intricate ones to deeply process by filtering irrelevant choices.

5 Conclusion

In this paper, we propose DCR to enhance reasoning abilities of LLMs for MCQs by dividing dataset based on 𝒞𝒮𝒞𝒮\mathcal{CS}caligraphic_C caligraphic_S and subsequently conquering items with low 𝒞𝒮𝒞𝒮\mathcal{CS}caligraphic_C caligraphic_S. Evaluation results on nine datasets across three tasks prove that DCR not only minimizes unnecessary computations for simple problems but also substantially improve performance on more intricate ones. In addition, through detailed analysis, we confirmed a positive relation between 𝒞𝒮𝒞𝒮\mathcal{CS}caligraphic_C caligraphic_S and accuracy, alongside fewer choices leading to better outcomes. Nonetheless, utilizing previously generated results to filter choices fails to effectively eliminate strong distractors and computing 𝒞𝒮𝒞𝒮\mathcal{CS}caligraphic_C caligraphic_S through SC is resource-intensive. Therefore, we will develop more efficient strategies for filtering distractions and reducing the computational demand associated with datasets division in the future.

References

  • Anil et al. (2023) Rohan Anil, Andrew M Dai, Orhan Firat, Melvin Johnson, Dmitry Lepikhin, Alexandre Passos, Siamak Shakeri, Emanuel Taropa, Paige Bailey, Zhifeng Chen, et al. Palm 2 technical report. arXiv preprint arXiv:2305.10403, 2023.
  • Bentley (1980) Jon Louis Bentley. Multidimensional divide-and-conquer. Communications of the ACM, 23(4):214–229, 1980.
  • Bentley & Shamos (1976) Jon Louis Bentley and Michael Ian Shamos. Divide-and-conquer in multidimensional space. In Proceedings of the eighth annual ACM symposium on Theory of computing, pp.  220–230, 1976.
  • Besta et al. (2023) Maciej Besta, Nils Blach, Ales Kubicek, Robert Gerstenberger, Lukas Gianinazzi, Joanna Gajda, Tomasz Lehmann, Michal Podstawski, Hubert Niewiadomski, Piotr Nyczyk, et al. Graph of thoughts: Solving elaborate problems with large language models. arXiv preprint arXiv:2308.09687, 2023.
  • Brown et al. (2020) Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
  • Chen et al. (2022) Wenhu Chen, Xueguang Ma, Xinyi Wang, and William W Cohen. Program of thoughts prompting: Disentangling computation from reasoning for numerical reasoning tasks. arXiv preprint arXiv:2211.12588, 2022.
  • Chen et al. (2023) Zhipeng Chen, Kun Zhou, Beichen Zhang, Zheng Gong, Wayne Xin Zhao, and Ji-Rong Wen. Chatcot: Tool-augmented chain-of-thought reasoning on chat-based large language models. arXiv preprint arXiv:2305.14323, 2023.
  • Chowdhery et al. (2022) Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, et al. Palm: Scaling language modeling with pathways. arXiv preprint arXiv:2204.02311, 2022.
  • Clark et al. (2018) Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord. Think you have solved question answering? try arc, the ai2 reasoning challenge. arXiv preprint arXiv:1803.05457, 2018.
  • Cobbe et al. (2021) Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, et al. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168, 2021.
  • Crispino et al. (2023) Nicholas Crispino, Kyle Montgomery, Fankun Zeng, Dawn Song, and Chenguang Wang. Agent instructs large language models to be general zero-shot reasoners. arXiv preprint arXiv:2310.03710, 2023.
  • Deb et al. (2023) Aniruddha Deb, Neeva Oza, Sarthak Singla, Dinesh Khandelwal, Dinesh Garg, and Parag Singla. Fill in the blank: Exploring and enhancing llm capabilities for backward reasoning in math word problems. arXiv preprint arXiv:2310.01991, 2023.
  • Diao et al. (2023) Shizhe Diao, Pengcheng Wang, Yong Lin, and Tong Zhang. Active prompting with chain-of-thought for large language models. arXiv preprint arXiv:2302.12246, 2023.
  • Eisenstein (2006) Michael Eisenstein. Divide and conquer. Nature, 441(7097):1179–1179, 2006.
  • Gao et al. (2023) Luyu Gao, Aman Madaan, Shuyan Zhou, Uri Alon, Pengfei Liu, Yiming Yang, Jamie Callan, and Graham Neubig. Pal: Program-aided language models. In International Conference on Machine Learning, pp.  10764–10799. PMLR, 2023.
  • Goldratt & Cox (2016) Eliyahu M Goldratt and Jeff Cox. The goal: a process of ongoing improvement. Routledge, 2016.
  • Heideman et al. (1984) Michael Heideman, Don Johnson, and Charles Burrus. Gauss and the history of the fast fourier transform. IEEE Assp Magazine, 1(4):14–21, 1984.
  • Hendrycks et al. (2020) Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding. arXiv preprint arXiv:2009.03300, 2020.
  • Hu et al. (2023) Yi Hu, Haotong Yang, Zhouchen Lin, and Muhan Zhang. Code prompting: a neural symbolic method for complex reasoning in large language models, 2023.
  • Huang et al. (2023) Yuzhen Huang, Yuzhuo Bai, Zhihao Zhu, Junlei Zhang, Jinghan Zhang, Tangjun Su, Junteng Liu, Chuancheng Lv, Yikai Zhang, Jiayi Lei, et al. C-eval: A multi-level multi-discipline chinese evaluation suite for foundation models. arXiv preprint arXiv:2305.08322, 2023.
  • Jiang et al. (2023a) Albert Q Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, et al. Mistral 7b. arXiv preprint arXiv:2310.06825, 2023a.
  • Jiang et al. (2023b) Song Jiang, Zahra Shakeri, Aaron Chan, Maziar Sanjabi, Hamed Firooz, Yinglong Xia, Bugra Akyildiz, Yizhou Sun, Jinchao Li, Qifan Wang, et al. Resprompt: Residual connection prompting advances multi-step reasoning in large language models. arXiv preprint arXiv:2310.04743, 2023b.
  • Jie et al. (2023) Zhanming Jie, Trung Quoc Luong, Xinbo Zhang, Xiaoran Jin, and Hang Li. Design of chain-of-thought in math problem solving. arXiv preprint arXiv:2309.11054, 2023.
  • Jin & Lu (2023) Ziqi Jin and Wei Lu. Tab-cot: Zero-shot tabular chain of thought. arXiv preprint arXiv:2305.17812, 2023.
  • Knuth (1998) Donald Ervin Knuth. Sorting and searching. The art of computer programming, 3, 1998.
  • Kojima et al. (2022) Takeshi Kojima, Shixiang Shane Gu, Machel Reid, Yutaka Matsuo, and Yusuke Iwasawa. Large language models are zero-shot reasoners. Advances in neural information processing systems, 35:22199–22213, 2022.
  • Kong et al. (2023) Aobo Kong, Shiwan Zhao, Hao Chen, Qicheng Li, Yong Qin, Ruiqi Sun, and Xin Zhou. Better zero-shot reasoning with role-play prompting. arXiv preprint arXiv:2308.07702, 2023.
  • Li et al. (2023a) Rui Li, Guoyin Wang, and Jiwei Li. Are human-generated demonstrations necessary for in-context learning? arXiv preprint arXiv:2309.14681, 2023a.
  • Li et al. (2023b) Xiang Lisa Li, Vaishnavi Shrivastava, Siyan Li, Tatsunori Hashimoto, and Percy Liang. Benchmarking and improving generator-validator consistency of language models. arXiv preprint arXiv:2310.01846, 2023b.
  • Li et al. (2024) Yiwei Li, Peiwen Yuan, Shaoxiong Feng, Boyuan Pan, Xinglin Wang, Bin Sun, Heda Wang, and Kan Li. Escape sky-high cost: Early-stopping self-consistency for multi-step reasoning. arXiv preprint arXiv:2401.10480, 2024.
  • Lin et al. (2021) Bill Yuchen Lin, Ziyi Wu, Yichi Yang, Dong-Ho Lee, and Xiang Ren. Riddlesense: Reasoning about riddle questions featuring linguistic creativity and commonsense knowledge. arXiv preprint arXiv:2101.00376, 2021.
  • Ling et al. (2017) Wang Ling, Dani Yogatama, Chris Dyer, and Phil Blunsom. Program induction by rationale generation: Learning to solve and explain algebraic word problems. arXiv preprint arXiv:1705.04146, 2017.
  • Ling et al. (2023) Zhan Ling, Yunhao Fang, Xuanlin Li, Zhiao Huang, Mingu Lee, Roland Memisevic, and Hao Su. Deductive verification of chain-of-thought reasoning. arXiv preprint arXiv:2306.03872, 2023.
  • Mallouk (2013) Thomas E Mallouk. Divide and conquer. Nature chemistry, 5(5):362–363, 2013.
  • Mekala et al. (2023) Rajasekhar Reddy Mekala, Yasaman Razeghi, and Sameer Singh. Echoprompt: Instructing the model to rephrase queries for improved in-context learning. arXiv preprint arXiv:2309.10687, 2023.
  • Miao et al. (2023) Ning Miao, Yee Whye Teh, and Tom Rainforth. Selfcheck: Using llms to zero-shot check their own step-by-step reasoning. arXiv preprint arXiv:2308.00436, 2023.
  • Mihaylov et al. (2018) Todor Mihaylov, Peter Clark, Tushar Khot, and Ashish Sabharwal. Can a suit of armor conduct electricity? a new dataset for open book question answering. arXiv preprint arXiv:1809.02789, 2018.
  • OpenAI (2023) OpenAI. Gpt-4 technical report. ArXiv, abs/2303.08774, 2023.
  • Patel et al. (2021) Arkil Patel, Satwik Bhattamishra, and Navin Goyal. Are nlp models really able to solve simple math word problems? arXiv preprint arXiv:2103.07191, 2021.
  • Pezeshkpour & Hruschka (2023) Pouya Pezeshkpour and Estevam Hruschka. Large language models sensitivity to the order of options in multiple-choice questions. arXiv preprint arXiv:2308.11483, 2023.
  • Robinson et al. (2022) Joshua Robinson, Christopher Michael Rytting, and David Wingate. Leveraging large language models for multiple choice question answering. arXiv preprint arXiv:2210.12353, 2022.
  • Sel et al. (2023) Bilgehan Sel, Ahmad Al-Tawaha, Vanshaj Khattar, Lu Wang, Ruoxi Jia, and Ming Jin. Algorithm of thoughts: Enhancing exploration of ideas in large language models. arXiv preprint arXiv:2308.10379, 2023.
  • Shi et al. (2023) Freda Shi, Xinyun Chen, Kanishka Misra, Nathan Scales, David Dohan, Ed H Chi, Nathanael Schärli, and Denny Zhou. Large language models can be easily distracted by irrelevant context. In International Conference on Machine Learning, pp.  31210–31227. PMLR, 2023.
  • Shum et al. (2023) KaShun Shum, Shizhe Diao, and Tong Zhang. Automatic prompt augmentation and selection with chain-of-thought from labeled data. arXiv preprint arXiv:2302.12822, 2023.
  • Smith (1985) Douglas R Smith. The design of divide and conquer algorithms. Science of Computer Programming, 5:37–58, 1985.
  • Srivastava et al. (2022) Aarohi Srivastava, Abhinav Rastogi, Abhishek Rao, Abu Awal Md Shoeb, Abubakar Abid, Adam Fisch, Adam R Brown, Adam Santoro, Aditya Gupta, Adrià Garriga-Alonso, et al. Beyond the imitation game: Quantifying and extrapolating the capabilities of language models. arXiv preprint arXiv:2206.04615, 2022.
  • Sun et al. (2023) Jiashuo Sun, Yi Luo, Yeyun Gong, Chen Lin, Yelong Shen, Jian Guo, and Nan Duan. Enhancing chain-of-thoughts prompting with iterative bootstrapping in large language models. arXiv preprint arXiv:2304.11657, 2023.
  • Talmor et al. (2018) Alon Talmor, Jonathan Herzig, Nicholas Lourie, and Jonathan Berant. Commonsenseqa: A question answering challenge targeting commonsense knowledge. arXiv preprint arXiv:1811.00937, 2018.
  • Team et al. (2023) Gemini Team, Rohan Anil, Sebastian Borgeaud, Yonghui Wu, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, et al. Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805, 2023.
  • Team et al. (2024) Gemma Team, Thomas Mesnard, Cassidy Hardin, Robert Dadashi, Surya Bhupatiraju, Shreya Pathak, Laurent Sifre, Morgane Rivière, Mihir Sanjay Kale, Juliette Love, et al. Gemma: Open models based on gemini research and technology. arXiv preprint arXiv:2403.08295, 2024.
  • Thoppilan et al. (2022) Romal Thoppilan, Daniel De Freitas, Jamie Hall, Noam Shazeer, Apoorv Kulshreshtha, Heng-Tze Cheng, Alicia Jin, Taylor Bos, Leslie Baker, Yu Du, et al. Lamda: Language models for dialog applications. arXiv preprint arXiv:2201.08239, 2022.
  • Touvron et al. (2023a) Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023a.
  • Touvron et al. (2023b) Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023b.
  • Wang et al. (2022) Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc Le, Ed Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou. Self-consistency improves chain of thought reasoning in language models. arXiv preprint arXiv:2203.11171, 2022.
  • Wei et al. (2022) Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-of-thought prompting elicits reasoning in large language models. Advances in Neural Information Processing Systems, 35:24824–24837, 2022.
  • Weng et al. (2023) Yixuan Weng, Minjun Zhu, Fei Xia, Bin Li, Shizhu He, Kang Liu, and Jun Zhao. Large language models are better reasoners with self-verification. CoRR, abs/2212.09561, 2023.
  • Xiong et al. (2023) Miao Xiong, Zhiyuan Hu, Xinyang Lu, Yifei Li, Jie Fu, Junxian He, and Bryan Hooi. Can llms express their uncertainty? an empirical evaluation of confidence elicitation in llms, 2023.
  • Xue et al. (2023) Tianci Xue, Ziqi Wang, Zhenhailong Wang, Chi Han, Pengfei Yu, and Heng Ji. Rcot: Detecting and rectifying factual inconsistency in reasoning by reversing chain-of-thought. arXiv preprint arXiv:2305.11499, 2023.
  • Yamauchi et al. (2023) Ryutaro Yamauchi, Sho Sonoda, Akiyoshi Sannai, and Wataru Kumagai. Lpml: Llm-prompting markup language for mathematical reasoning. arXiv preprint arXiv:2309.13078, 2023.
  • Yan et al. (2023) Shaotian Yan, Chen Shen, Junjie Liu, and Jieping Ye. Concise and organized perception facilitates large language models for deductive reasoning. arXiv preprint arXiv:2310.03309, 2023.
  • Yao et al. (2023) Shunyu Yao, Dian Yu, Jeffrey Zhao, Izhak Shafran, Thomas L Griffiths, Yuan Cao, and Karthik Narasimhan. Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601, 2023.
  • Yasunaga et al. (2023) Michihiro Yasunaga, Xinyun Chen, Yujia Li, Panupong Pasupat, Jure Leskovec, Percy Liang, Ed H Chi, and Denny Zhou. Large language models as analogical reasoners. arXiv preprint arXiv:2310.01714, 2023.
  • Yu et al. (2020) Weihao Yu, Zihang Jiang, Yanfei Dong, and Jiashi Feng. Reclor: A reading comprehension dataset requiring logical reasoning. arXiv preprint arXiv:2002.04326, 2020.
  • Zhang et al. (2023) Haodi Zhang, Min Cai, Xinhe Zhang, Chen Jason Zhang, Rui Mao, and Kaishun Wu. Self-convinced prompting: Few-shot question answering with repeated introspection. arXiv preprint arXiv:2310.05035, 2023.
  • Zhang et al. (2022) Zhuosheng Zhang, Aston Zhang, Mu Li, and Alex Smola. Automatic chain of thought prompting in large language models. arXiv preprint arXiv:2210.03493, 2022.
  • Zheng et al. (2023a) Chuanyang Zheng, Zhengying Liu, Enze Xie, Zhenguo Li, and Yu Li. Progressive-hint prompting improves reasoning in large language models. arXiv preprint arXiv:2304.09797, 2023a.
  • Zheng et al. (2023b) Chujie Zheng, Hao Zhou, Fandong Meng, Jie Zhou, and Minlie Huang. On large language models’ selection bias in multi-choice questions. arXiv preprint arXiv:2309.03882, 2023b.
  • Zheng et al. (2023c) Huaixiu Steven Zheng, Swaroop Mishra, Xinyun Chen, Heng-Tze Cheng, Ed H Chi, Quoc V Le, and Denny Zhou. Take a step back: Evoking reasoning via abstraction in large language models. arXiv preprint arXiv:2310.06117, 2023c.
  • Zheng et al. (2023d) Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric Xing, et al. Judging llm-as-a-judge with mt-bench and chatbot arena. arXiv preprint arXiv:2306.05685, 2023d.
  • Zhong et al. (2023) Wanjun Zhong, Ruixiang Cui, Yiduo Guo, Yaobo Liang, Shuai Lu, Yanlin Wang, Amin Saied, Weizhu Chen, and Nan Duan. Agieval: A human-centric benchmark for evaluating foundation models. arXiv preprint arXiv:2304.06364, 2023.
  • Zhou et al. (2022) Denny Zhou, Nathanael Schärli, Le Hou, Jason Wei, Nathan Scales, Xuezhi Wang, Dale Schuurmans, Claire Cui, Olivier Bousquet, Quoc Le, et al. Least-to-most prompting enables complex reasoning in large language models. arXiv preprint arXiv:2205.10625, 2022.
  • Zhu et al. (2023) Zhaocheng Zhu, Yuan Xue, Xinyun Chen, Denny Zhou, Jian Tang, Dale Schuurmans, and Hanjun Dai. Large language models can learn rules. arXiv preprint arXiv:2310.07064, 2023.
  • Zou et al. (2023) Anni Zou, Zhuosheng Zhang, Hai Zhao, and Xiangru Tang. Meta-cot: Generalizable chain-of-thought prompting in mixed-task scenarios with large language models. arXiv preprint arXiv:2310.06692, 2023.

Appendix A Zero-shot vs. Few-shot

Table 9: Problem-solving accuracy (%) between Zero-Shot-CoT and Few-Shot-CoT.
Method AQuA GSM8K SVAMP CMSQA Average
Zero-Shot-CoT 54.86(±plus-or-minus\pm±0.67) 79.33(±plus-or-minus\pm±0.34) 78.20(±plus-or-minus\pm±1.82) 69.67(±plus-or-minus\pm±0.77) 70.52
Few-Shot-CoT 53.67(±plus-or-minus\pm±0.67) 79.67(±plus-or-minus\pm±1.18) 81.60(±plus-or-minus\pm±1.34) 77.47(±plus-or-minus\pm±0.96) 73.10

By comparing Zero-Shot-CoT (Kojima et al., 2022) and Few-Shot-COT (Wei et al., 2022) across AQuA (Ling et al., 2017), GSM8K (Cobbe et al., 2021), SVAMP (Patel et al., 2021) and CMSQA (Talmor et al., 2018) in Table 9, models’ zero-shot capabilities are gradually nearing or even surpassing their few-shot counterparts, which is align with the conclusions in recent research (Hu et al., 2023; Zhong et al., 2023). Therefore, our work is entirely free from human intervention and circumvents exemplars construction.

Table 10: The information statistic of datasets. For CMSQA, RiddleSense, Logical Deduction and Reclor, we select their validation sets as there are no publicly available test sets or labels. GSM8K and SVAMP are used in Section 3.4.5 and Appendix A, which are cloze-style dataset without choices list. Particularly, for Logical Deduction, there are 60 questions with |𝐂i|subscript𝐂𝑖|\mathbf{C}_{i}|| bold_C start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | as 3, 100 questions as 5, and 140 questions as 7. Therefore, given that 20% questions have 3 choices, we make a compromise and choose t𝑡titalic_t as 4.
Dataset Task Type Eval. Split #Test (n𝑛nitalic_n) #ChoicesNum (|𝐂i|subscript𝐂𝑖|\mathbf{C}_{i}|| bold_C start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT |) Infer. Times (t𝑡titalic_t)
AQuA (AQ.) Arithmetic Test 254 5 5
Abstract Algebra (Alg.) Arithmetic Test 100 4 4
High School Mathematics (Math.) Arithmetic Test 270 4 4
CMSQA (CMS.) Commonsense Validation 1221 5 5
OpenBookQA (OB.) Commonsense Test 500 4 4
ARC Challenge (ARC.) Commonsense Test 1165 4 4
RiddleSense (Rid.) Logic Validation 1021 5 5
Logical Deduction (Logi.) Logic Validation 300 3, 5 or 7 4
Reclor (Rec.) Logic Validation 500 4 4
GSM8K Arithmetic Test 1319 - -
SVAMP Arithmetic Test 300 - -

Appendix B Different prompts for FCR

Refer to caption
Figure 8: Accuracy of FCR on 𝔻lowsubscript𝔻𝑙𝑜𝑤\mathbb{D}_{low}blackboard_D start_POSTSUBSCRIPT italic_l italic_o italic_w end_POSTSUBSCRIPT for different prompts across various datasets.

Considering the most distinctive feature of FCR is succinct choices list, we conducted a comparison using different prompts, as displayed in Figure 8. Specifically, “Prompt0” denotes “Let’s think step by step.”, “Prompt1” is the prompt used in FCR, and “Prompt2” represents “Let’s delve deeper into this question to arrive at the best answer.”. Across various dataset, the accuracy disparity of FCR with different prompts remains below 2%, without a clear dominance from any single one. Therefore, we believe that the key of good performance for FCR is attributed to a briefer choices list, rather than prompt engineering.

Refer to caption
(a) AQ.
Refer to caption
(b) Alg.
Refer to caption
(c) Math.
Refer to caption
(d) CMS.
Refer to caption
(e) OB.
Refer to caption
(f) ARC.
Refer to caption
(g) Rid.
Refer to caption
(h) Logi.
Refer to caption
(i) Rec.
Figure 9: Problem-solving accuracy and #Call among different datasets.
Refer to caption
Figure 10: Distribution of different subsets among various datasets. “Avg.” denotes the average distribution across all datasets.
Refer to caption
(a) AQ.
Refer to caption
(b) Alg.
Refer to caption
(c) Math.
Refer to caption
(d) CMS.
Refer to caption
(e) OB.
Refer to caption
(f) ARC.
Refer to caption
(g) Rid.
Refer to caption
(h) Logi.
Refer to caption
(i) Rec.
Figure 11: Prior accuracy of different subsets for various sample size t𝑡titalic_t among each dataset.
Refer to caption
(a) AQ.
Refer to caption
(b) Alg.
Refer to caption
(c) Math.
Refer to caption
(d) CMS.
Refer to caption
(e) OB.
Refer to caption
(f) ARC.
Refer to caption
(g) Rid.
Refer to caption
(h) Logi.
Refer to caption
(i) Rec.
Figure 12: The number of different subsets size for various sample size t𝑡titalic_t among each dataset.