DCR: Divide-and-Conquer Reasoning for Multi-choice Question Answering with LLMs

Zijie Meng , Zhaopeng Feng & Zuozhu Liu
Zhejiang University-University of Illinois at Urbana Champaign Institute
Zhejiang University
Jiaxing, Zhejiang 314400, PRC
{zijie.22, zhaopeng.23, zuozhuliu}@intl.zju.edu.cn
&Yan Zhang ¹¹footnotemark: 1
Department of Electrical and Computer Engineering
National University of Singapore
4 Engineering Drive 3, Singapore 117583
yanzhang.jlu@gmail.com Indicates equal contribution.Corresponding author.

Abstract

Large language models (LLMs) have shown impressive performance in reasoning benchmarks with the emergence of Chain-of-Thought (CoT), particularly in multi-choice question (MCQ). However, current works equally resolve questions regardless of the problem-solving difficulty, leading to an excessive focus on simple items while insufficient attention on intricate ones. To address this challenge, we propose a simple yet effective strategy, Divide and Conquer Reasoning (DCR), to enhance the reasoning capability of LLMs for MCQs, as inspired by human beings using heuristics to first categorize tasks and then handle them separately. In particular, we first categorize questions into two subsets based on confidence score $\mathcal{CS}$ , which is estimated by statistical frequency of generated answers. Subsequently, we propose Filter Choices based Reasoning (FCR) to improve model performance on MCQs with low $\mathcal{CS}$ . Our experiments demonstrate that the proposed strategy only costs 85% of SOTA, while still achieves average accuracy improvement of 1.56% across nine datasets including arithmetic, commonsense, and logic reasoning tasks. The code is at https://github.com/AiMijie/DCR.

1 Introduction

Large language models (LLMs) (e.g., GPT3 (Brown et al., 2020), GPT4 (OpenAI, 2023), Palm (Chowdhery et al., 2022), Palm2 (Anil et al., 2023), Lamda (Thoppilan et al., 2022), Llama (Touvron et al., 2023a), Llama2 (Touvron et al., 2023b)) have exhibited outstanding performance on various downstream tasks by generating step by step rationales to obtain final answers without finetuning parameters, as elicited from Chain-of-Thoughts (CoT) (Wei et al., 2022). Multiple-choice question (MCQ) is a format that incorporate a choices list with a question and prompt the model to select the gold answer. Owing to its simple structure, standardized results, and objective assessments, MCQ is not only widely prevalent in the real world but also extensively employed in LLMs’ reasoning evaluation (Zheng et al., 2023d; Hendrycks et al., 2020; Srivastava et al., 2022; Zhong et al., 2023; Huang et al., 2023). Consequently, the community has witnessed a surge in CoT-based works, which demonstrate outstanding performance on MCQs (Wang et al., 2022; Diao et al., 2023; Kojima et al., 2022; Zheng et al., 2023a; Kong et al., 2023). Notably, Zero-Shot-CoT (Kojima et al., 2022) and Self-Consistency (SC) (Wang et al., 2022) have attracted considerable attention due to straightforward implementation and impressive efficacy. Zero-Shot-CoT stimulates the latent zero-shot reasoning abilities of LLMs by adding ”Let’s think step by step.” into prompts, but often underperforms on complex tasks. SC samples different reasoning paths to generate multiple candidates following majority voting to derive final answer, which achieves encouraging results but introduces substantial overhead. To escape this sky-high cost, ESC (Li et al., 2024) early-stops inference by calculating the entropy of answer distribution in a small sliding window without sacrificing SC’s performance, which achieves SOTA currently. However, its performance ceiling is inherently limited by SC, restricting its breakthroughs in accuracy.

Therefore, to optimize the cost and performance, it is imperative to timely halt expensive sampling to reduce expenditure and further employ varied approaches for problems of differing complexity to advance accuracy. In other words, previous methods all process data uniformly regardless of the problem-solving difficulty, which means that simple questions receive unnecessarily complex and costly procedures, whereas intricate ones are not adequately addressed with basic methods. It is also natural that humans utilize heuristic strategies to categorize tasks, and then address each individually, which not only effectively resolves complex issues, but also significantly enhances efficiency (Heideman et al., 1984; Knuth, 1998). Consequently, we apply this strategy of data partitioning followed by differential process—Divide and Conquer, which is widely deployed across numerous scenarios (Bentley & Shamos, 1976; Bentley, 1980; Smith, 1985; Eisenstein, 2006; Mallouk, 2013)—to LLM reasoning. In this context, we need to address two paramount challenges: (1) What criteria should be used to divide the dataset? (2) How should the subsets be processed?

Refer to caption — Figure 1: Illustration of DCR. (1) Divide. We first conduct $t$ (e.g. $t$ =5) times inference with Zero-Shot-CoT (Kojima et al., 2022) by “Let’s think step by step.”. Then, the dataset $\mathbb{D}$ is divided based on $\mathcal{CS}$ , where DataItems with $\mathcal{CS}$ less than $\mu$ (e.g. $\mu$ =0.6) are categorized as $\mathbb{D}_{low}$ , and the rest as $\mathbb{D}_{other}$ . (2) Conquer. We fix $\mathbb{D}_{other}$ and propose FCR to process $\mathbb{D}_{low}$ . “DataItem” in Divide area includes question text and full choices list, while involves only filtered choices list in Conquer area. “Rationale $i$ _ $j$ ” denotes the rationale generated by $j$ -th LLM query for $i$ -th DataItem. “Choice_ $x$ ” represents the $x$ -th option in original DataItem.

For the first one, we need to explore a method to effectively classify questions based on solving difficulty. In human perception, answers with high uncertainty are often wrong, otherwise tend to be correct (Xiong et al., 2023). So we tentatively probed SC (Wang et al., 2022), where the statistical distribution of answers generated from various reasoning paths reflects a confidence score $\mathcal{CS}$ for the question. As shown in Figure 4, we divided questions into two subsets based on their $\mathcal{CS}$ , where different subsets displays distinct accuracy and the subset with lower $\mathcal{CS}$ demonstrates poorer performance. This suggests that we can employ SC to compute $\mathcal{CS}$ for each problem and divide them.

Move to the second issue, we inspired from the Cannikin Law in management (Goldratt & Cox, 2016), explore more elaborately designed methods for the low confidence subsets that offer greater room for optimization, and fix other questions that are sufficiently simple for the model. Shi et al. (2023) investigated the model’s sensitivity to irrelevant information within the questions, but there exists uncertainty regarding irrelevant options in choices list. To delve into this problem, we conducted preliminary studies as shown in Figure 6, discovering a decrease in problem-solving accuracy as the number of choices increased. Following this, we removed some irrelevant options in hardly solved subsets to re-query the LLM, resulting in a universal improvement of over 20%, especially achieving staggering 75.52% on CMSQA (Talmor et al., 2018), as shown in Table 6. Motivated by these findings, we introduce Filter Choices based Reasoning (FCR), which excludes abundant options by using the answers from the divide stage, to conduct inference in conquer stage.

Concretely, in this paper, we propose a simple yet effective strategy, Divide and Conquer Reasoning (DCR), which first categorizes questions into two subsets based on $\mathcal{CS}$ and subsequently employs FCR to improve model performance on MCQs with low $\mathcal{CS}$ , as illustrated in Figure 1. Through extensive empirical evaluation across nine datasets including arithmetic, commonsense, and logic tasks, DCR not only consumes on average only 85% of resources required by ESC, but also improves accuracy by an average of 1.56% on these datasets. Additionally, we have validated the effectiveness of DCR across various LLMs (Team et al., 2024; Jiang et al., 2023a; Anil et al., 2023; Team et al., 2023; OpenAI, 2023) and the superiority of FCR over other reasoning methods (Wei et al., 2022; Diao et al., 2023; Zheng et al., 2023a; Kojima et al., 2022; Kong et al., 2023). We have also successfully adapted DCR to the cloze-style dataset GSM8K (Cobbe et al., 2021) achieving an improved performance over SC with reduced cost. In summary, our work has three major contributions: (1) To the best of our knowledge, we pioneeringly employ the Divide and Conquer at the dataset level for LLM reasoning, providing the community a fresh perspective. (2) By dividing dataset based on $\mathcal{CS}$ and conquering low $\mathcal{CS}$ subset with FCR, we achieve an optimal balance between cost and accuracy. (3) We evaluate this strategy across nine datasets within three distinct reasoning tasks, consistently yielding significant improvements.

2 Methodology

The overall framework of DCR is illustrated in Figure 1. Given a test set of length $n$ represented as $\mathbb{D}=\{(Q_{1},\mathbf{C}_{1}),...,(Q_{n},\mathbf{C}_{n})\}$ , where $Q_{i}$ denotes the $i$ -th question text and $\mathbf{C}_{i}$ is the corresponding choices list. In addition, we use $\mathbf{R}_{i}$ and $\mathbf{A}_{i}$ to denote its rationales and answers generated by LLMs, respectively.

2.1 Divide

With each item $(Q_{i},\mathbf{C}_{i}),i\in\{1,...,n\}$ , we query the LLM for $t$ times to obtain rationales $\mathbf{R}_{i}=\{r_{i,1},...,r_{i,t}\}$ and corresponding standby answers $\mathbf{A}_{i}=\{a_{i,1},...,a_{i,t}\}$ based on Zero-Shot-CoT¹¹1Fow-Shot-CoT (i.e. CoT) (Wei et al., 2022) requires substantial human labor to annotate task-specific examplars, and zero-shot gradually approaches or even surpasses few-shot as the scale of model increases (Hu et al., 2023; Zhong et al., 2023). See Table 9 in Appendix A for our verification. (Kojima et al., 2022). In particular, we set $t$ generally equal to the length of choices list $|\mathbf{C}_{i}|$ , considering the worst scenario where all choices could be sampled. And we conduct a more detailed analysis on different values of $t$ in Section 3.4.2. We use $\delta$ to denote LLM and define the confidence score $\mathcal{CS}$ of each item as:

\mathcal{CS}_{(Q_{i},\mathbf{C}_{i})}=\max_{j\in\{1,...,t\}}p(a_{i,j}|\delta(Q% _{i},\mathbf{C}_{i})),

(1)

where $p(a_{i,j}|\delta(Q_{i},\mathbf{C}_{i}))$ is the frequency of $a_{i,j}$ in all predicted answers. And we define it as:

p(a_{i,j}|\delta(Q_{i},\mathbf{C}_{i}))=\frac{\sum_{k\in\{1,...,t\}}\mathbf{1}% _{a_{i,k}=a_{i,j}}}{t}.

(2)

Intuitively, $\mathcal{CS}$ indicates the proportion of the most frequent answer among all predicted results in $t$ times inferences, which is employed to reflect the problem-solving difficulty. Then, we can divide $\mathbb{D}$ with the following rule:

(Q_{i},\mathbf{C}_{i})\in\begin{cases}\mathbb{D}_{other},&\text{if }\mathcal{% CS}_{(Q_{i},\mathbf{C}_{i})}>\mu,\\ \mathbb{D}_{low},&\text{if }\mathcal{CS}_{(Q_{i},\mathbf{C}_{i})}\leq\mu,\end{cases}

(3)

where $\mathbb{D}_{low}$ represents the low confidence subset containing $(Q_{i},\mathbf{C}_{i})$ with dispersed distribution of $\mathbf{A}_{i}$ and $\mathbb{D}_{other}$ includes rest items. $\mu$ is the threshold for dividing, which is specified in Section 3.2 and discussed in Section 3.4.2. Moreover, different from dividing the questions into two subsets, we explore a more fine-grained division in Section 3.4.2 to evaluate our dividing rule. Next, we would fix $\mathbb{D}_{other}$ to conserve resources, while delve deeper into $\mathbb{D}_{low}$ for ongoing performance improvement.

2.2 Conquer

We propose Filter Choices based Reasoning (FCR) to conquer $\mathbb{D}_{low}$ in this stage, as shown in Figure 1, where we exploit the results obtained in divide phase as alternative options for subsequent inference. Specifically, we define $\mathbf{C}_{i}^{\prime}=uniq(\mathbf{A}_{i})$ , where the $uniq(\cdot)$ operation signifies deduplication of $\mathbf{A}_{i}=\{a_{i,1},...,a_{i,t}\}$ . Then we use $(Q_{i},\mathbf{C}_{i}^{\prime})$ to construct the new prompt and query the LLM with “Let’s delve deeper into these { $|\mathbf{C}_{i}^{\prime}|$ } choices and select the best one.”²²2We evaluate the robustness of FCR for different query prompts in Appendix B.. Subsequently, through additional inference for $t$ times, we obtain the new standby answers $\mathbf{A}_{i}^{\prime}=\{a^{\prime}_{i,1},...,a^{\prime}_{i,t}\}$ for $\mathbb{D}_{low}$ . Notably, our method does not merely delete options, rather it involves a synchronous modification of the option symbols (i.e. ‘A’, ‘B’, ‘C’, etc.) based on the number of remaining choices. Furthermore, in Section 3.4.3, we evaluate the impact of conquering different subsets, and compare FCR with other reasoning methods, to demonstrate the superiority of only processing $\mathbb{D}_{low}$ with our method.

Ultimately, we align the standby answers $\{\mathbf{A}_{i}|(Q_{i},\mathbf{C}_{i})\in\mathbb{D}_{other}\}$ and $\{\mathbf{A}_{i}^{\prime}|(Q_{i},\mathbf{C}_{i})\in\mathbb{D}_{low}\}$ generated in different stage with $\mathbb{D}_{other}$ and $\mathbb{D}_{low}$ , respectively, then utilize majority voting (Wang et al., 2022) to determine the final answer for each data item. It is evident that our full strategy requires no human intervention or manual labor, and infers $t$ times for each data item in $\mathbb{D}_{other}$ and $2t$ times for $\mathbb{D}_{low}$ .

3 Experiments

3.1 Datasets and evaluation metrics

To evaluate the effectiveness and empirically analyse DCR, we conducted experiments on three tasks: 1) Arithmetic. AQuA (AQ.) (Ling et al., 2017) and Abstract Algebra (Alg.), High School Mathematics (Math.) from the MMLU dataset (Hendrycks et al., 2020). 2) Commonsense. CMSQA (CMS.) (Talmor et al., 2018), OpenBookQA (OB.) (Mihaylov et al., 2018) and ARC Challenge (ARC.) (Clark et al., 2018). 3) Logic. RiddleSense (Rid.) (Lin et al., 2021), Logical Deduction (Logi.) from BIG-bench dataset (Srivastava et al., 2022) and Reclor (Rec.) (Yu et al., 2020). The statistical details can be found in Table 10 of appendix. Additionally, we employed exact match (EM) accuracy to evaluate the performance, which is same as previous works (Wei et al., 2022; Kojima et al., 2022).

3.2 Implementation details

We primarily employed GPT-3.5-Turbo-0613 from OpenAI API³³3https://platform.openai.com, and conducted experiments on other opensource and blackbox LLMs in Section 3.4.1. During the divide phase, we set the temperature to 0.7, and set inference times $t$ to 4 or 5 for different datasets, as detailed in Table 10. We divided each dataset into $\mathbb{D}_{other}$ and $\mathbb{D}_{low}$ with $\mu$ as 0.6. In the conquer stage, the temperature and inference times were consistent with previous phase. Experiments were conducted on the full dataset by default unless in Section 3.4.4 and Appendix A, where we randomly sampled 500 items for each dataset except 254 for AQuA and 300 for SVAMP. In addition, the final results were all obtained by averaging five random trials. Notably, considering the accuracy for ESC normally equals to or underperforms SC, we mainly compared with SC in Section 3.4.

3.3 Main results

Table 1: Comparison of problem-solving accuracy (%) among different methods. “Avg.” denotes average accuracy across nine datasets. “#Call” refers to the average sample size (i.e. inference times) for each question across nine datasets. SC^∗ and ESC^∗ represent the versions with approximate sample size of DCR.

Method	Arithmetic			Commonsense			Logic			Avg.	#Call
Method	AQ.	Alg.	Math.	CMS.	OB.	ARC.	Rid.	Logi.	Rec.	Avg.	#Call
SC	68.98	43.20	64.00	76.12	87.04	89.68	68.72	48.07	61.84	67.52	8.94
ESC	68.98	43.20	64.00	76.12	87.04	89.68	68.72	48.07	61.84	67.52	6.79
DCR	71.02	48.60	66.52	77.97	86.80	89.79	68.81	50.27	61.96	69.08	5.79
SC^∗	66.46	43.20	62.52	75.00	85.24	88.98	68.03	48.80	61.00	66.58	6.17
ESC^∗	68.98	42.20	64.00	76.12	84.68	88.52	68.72	48.07	60.20	66.83	6.17

We took a comparison between SC (Wang et al., 2022), ESC (Li et al., 2024), and our method across nine datasets, as shown in Table 1. According to Section 2.2, we set $2t$ as the upperbound of inference times and defined the window size of ESC as $t$ . Upon achieving this limitation, the average sample size (i.e. inference times) for each question of original SC is 8.94 with average accuracy as 67.52%. ESC reduces the sample size to 6.79 and maintains the accuracy of 67.52%. DCR further reduces the average sample size to 5.79 while achieves the accuracy of 69.08%, surpassing baselines with 1.56%, which demonstrates dual improvements in efficiency and performance.

In Figure 2, we presented the average accuracy of SC and ESC across different datasets for various sample sizes. Notably, DCR achieves similar levels of accuracy at a substantially lower cost compared to the baselines, indicating a significant enhancement in efficiency. Meanwhile, when costs are comparable, DCR consistently outperforms these two baselines. This superiority is quantitatively reported as SC^∗ and ESC^∗ in Table 1, where DCR exhibits an encouraging lead of 2.5% and 2.25%, respectively. Furthermore, we observed a diminishing performance improvement of SC and ESC as sample size increases, suggesting an approach towards a bottleneck. However, the integration of FCR for inference on $\mathbb{D}_{low}$ during the conquer stage offers a potential pathway to breakthrough beyond this bottleneck.

3.4 Analysis

3.4.1 Comparison across different LLMs

Table 2: Accuracy (%) across different LLMs. The number in parenthesis denotes average sample size.

Setting		AQ.	CMS.
Gemma	SC	34.96 (8.00)	65.31 (6.00)
Gemma	DCR	37.24 (7.50)	67.81 (5.98)
Mistral	SC	39.29 (9.00)	71.37 (7.00)
Mistral	DCR	43.31 (8.97)	73.10 (6.51)
Palm2	SC	38.50 (6.00)	74.15 (6.00)
Palm2	DCR	39.37 (5.21)	75.17 (5.53)
Gemini	SC	70.39 (8.00)	78.41 (6.00)
Gemini	DCR	68.74 (7.40)	78.85 (5.26)
GPT4	SC	84.17 (6.00)	84.21 (6.00)
GPT4	DCR	85.43 (5.99)	85.19 (5.70)

In this section, we conducted a comparative analysis between SC (Wang et al., 2022) and DCR using various models. Specifically, we employed Gemma (gemma-7b-it) (Team et al., 2024) and Mistral (Mistral-7B-Instruct-v0.2) (Jiang et al., 2023a) available on the Hugging Face⁴⁴4https://huggingface.co, Palm2 (text-bison-001) (Anil et al., 2023) and Gemini (gemini-pro) (Team et al., 2023) from Google AI⁵⁵5https://ai.google.dev, as well as GPT4 (gpt-4-1106-preview) (OpenAI, 2023) from OpenAI API. As shown in Table 2, DCR generally achieves higher accuracy with lower costs, except on AQuA using Gemini. Notably, the larger-scale LLMs (e.g. Gemini and GPT4) significantly outperforms other models, particularly on AQuA with improvements exceeding 30%. However, this also diminishes the relative advantage from DCR, such as the improvements with Mistral are 4.02% and 1.73% on two datasets, while only 1.26% and 0.98% with GPT4. Therefore, we believe the enhancement of model capabilities resembling the process of making up for weaknesses, which compresses the space for optimization.

3.4.2 Study for divide stage

Table 3: Comparison of accuracy (%) on different confidence subsets. “#Size” indicates the number of data items in different subsets. “Prior” denotes the accuracy of results generated in the divide stage. For

\mathbb{D}_{low_{b}}

, “-” refer results lacking reliability because of insufficient data.

Subset	Setting	Arithmetic			Commonsense			Logic			Avg.
Subset	Setting	AQ.	Alg.	Math.	CMS.	OB.	ARC.	Rid.	Logi.	Rec.	Avg.
$\mathbb{D}_{high}$	#Size	74.60	30.40	71.80	588.80	325.80	856.20	404.40	27.60	232.60	290.24
$\mathbb{D}_{high}$	Prior	91.96	53.95	90.53	92.09	96.13	96.26	89.81	86.96	75.41	85.90
$\mathbb{D}_{med}$	#Size	51.20	38.20	82.40	265.60	97.00	191.20	218.40	81.00	151.40	130.71
	Prior	79.69	43.46	61.17	72.74	73.61	75.42	69.60	60.49	51.52	65.30
	FCR	74.22	35.08	62.86	69.95	70.31	73.95	63.00	53.58	52.84	61.75
$\mathbb{D}_{low_{t}}$	#Size	70.00	30.20	109.60	265.20	74.00	115.40	251.40	153.40	113.20	131.38
	Prior	55.71	28.48	39.60	53.24	50.27	48.87	50.99	35.59	42.05	44.98
	FCR	62.86	49.01	55.47	61.61	64.05	65.68	50.68	42.11	48.59	55.56
$\mathbb{D}_{low_{b}}$	#Size	58.20	1.20	6.20	101.40	3.20	2.20	146.80	38.00	2.80	40.00
	Prior	36.08	-	-	33.14	-	-	35.97	17.37	-	30.64
	FCR	46.39	-	-	52.47	-	-	40.87	34.74	-	43.62

Effect of different sample size $t$ . The Prior accuracy is a key metric reflecting the effectiveness of division, where lower $\mathcal{CS}$ is expected to correlate with lower Prior accuracy. Consequently, we conducted experiment to observe the impact of varying $t$ from 3 to 20 on Prior accuracy. As illustrated in Figure 4, there is a clear distinction in Prior accuracy on different subsets, and only a minimal number of inferences are required to reach an oscillatory state, which supports the reasonability behind basing $t$ on $|\mathbf{C}_{i}|$ . Additionally, the number of different subsets size after division is also a crucial metric, as it directly impacts the overall cost of DCR. Therefore, Figure 4 presents the sizes of $\mathbb{D}_{other}$ and $\mathbb{D}_{low}$ across various $t$ . Similar to Prior accuracy, the sizes of different subsets also stabilize in a fluctuating range with only minimal inferences.

Effect of different dividing threshold $\mu$ . Based on the definition of sample size $t$ and the strategy of DCR in Section 2.1, we divided the dataset into four discrete subsets according to $\mathcal{CS}$ intervals: (0.8, 1] for $\mathbb{D}_{high}$ , (0.6, 0.8] for $\mathbb{D}_{med}$ , (0.4, 0.6] for $\mathbb{D}_{low_{t}}$ , and [0, 0.4] for $\mathbb{D}_{low_{b}}$ . Considering the model’s high confidence on $\mathbb{D}_{high}$ ( $\mathcal{CS}$ greater than 0.8), we only report Prior accuracy, which exceeds 85% in majority (7 out of 9) of datasets, as shown in Table 3. This indicates that the most questions in $\mathbb{D}_{high}$ are relatively simple and require no further process. Contrastingly, $\mathbb{D}_{med}$ demonstrate moderate Prior accuracy and achieve improvements via FCR in minority (2 out of 9) datasets. In fact, the $\mathcal{CS}$ for each item in $\mathbb{D}_{med}$ belongs to (0.6, 0.8], indicating that despite the model generates diverse answers, it predominantly focuses on a specific one. This introduces a significant challenge to enhance LLM’s performance by correcting its previously generated mistakes, rendering the gains through FCR as limited. In addition, referring to original DCR, $\mathbb{D}_{low_{t}}$ and $\mathbb{D}_{low_{b}}$ comes from further dividing of the $\mathbb{D}_{low}$ , where the former has higher $\mathcal{CS}$ . Therefore, the average Prior accuracy of $\mathbb{D}_{low_{b}}$ is only 30.64%, markedly below 44.98% of $\mathbb{D}_{low_{t}}$ , and significantly inferior to others. Meanwhile, through the conquer phase in DCR, we achieve an average accuracy improvement of 10.58% and 12.98% for $\mathbb{D}_{low_{t}}$ and $\mathbb{D}_{low_{b}}$ , respectively. However, more than half of the $\mathbb{D}_{low_{b}}$ across various datasets contain a minimal number of data items, making it difficult to reliably report accuracy or effectively improve performance for entire dataset. Therefore, we instituted the threshold $\mu$ as 0.6 to conduct dataset dividing.

The distribution of different subsets. Incorporating the dividing results, we conducted a visual statistical analysis to examine the distribution of different confidence subsets among three reasoning tasks, as shown in Figure 5. The proportion of other subsets all exceeds 50% in different tasks and even surpasses 80% in commonsense class, which means that we can achieve high accuracy on a substantial portion of data without complex processing. Therefore, based on DCR, we can concentrate more resources on low confidence subsets while effectively avoid redundant process on other ones, which significantly reduce overall expenditure.

3.4.3 Study for conquer stage

Table 4: Comparison of problem-solving accuracy (%) for conquering different subsets.

Conquer Subset	Arithmetic			Commonsense			Logic			Avg.	#Call
Conquer Subset	AQ.	Alg.	Math.	CMS.	OB.	ARC.	Rid.	Logi.	Rec.	Avg.	#Call
$\mathbb{D}_{med}\&\mathbb{D}_{low}$	69.92	45.40	67.04	77.36	86.16	89.55	67.40	48.40	62.36	68.18	6.78
$\mathbb{D}_{low}$	71.02	48.60	66.52	77.97	86.80	89.79	68.81	50.27	61.96	69.08	5.79

Different conquer subsets. Building upon Section 3.4.2, we retain $\mathbb{D}_{med}$ and combine $\mathbb{D}_{low_{t}}$ and $\mathbb{D}_{low_{b}}$ into $\mathbb{D}_{low}$ to compare the impact of conquering different subsets, as shown in Table 4. $\mathbb{D}_{low}$ , being a smaller subset, requires an average sample size of 5.79, which is 0.99 lower than conquering $\mathbb{D}_{med}$ and $\mathbb{D}_{low}$ together. Furthermore, according to Table 3, additional interventions on $\mathbb{D}_{med}$ by FCR yield marginal benefits. So conquering $\mathbb{D}_{med}$ and $\mathbb{D}_{low}$ together results in enhanced accuracy for only two datasets, which steers us to pay more attention solely on $\mathbb{D}_{low}$ in conquer stage.

Table 5: Accuracy (%) with different reasoning methods on

\mathbb{D}_{low}

in AQuA and CMSQA.

Method	AQ.	CMS.	Average
ManualCoT	43.21	56.96	50.09
Active-Prompt	42.28	57.88	50.08
PHP	44.49	-	-
Zero-Shot-CoT	44.46	45.23	44.85
Role-Play Prompting	48.20	46.79	47.50
FCR	49.45	54.39	51.92

Different reasoning methods. In this section, we compared our proposed zero-shot based FCR with some representative few-shot works: ManualCoT (i.e. CoT) (Wei et al., 2022), Active-Prompt (Diao et al., 2023), and PHP (Zheng et al., 2023a), as well as some zero-shot methods: Zero-Shot-CoT (Kojima et al., 2022) and Role-Play Prompting (Kong et al., 2023). Considering diverse datasets employed by these methods, we chose $\mathbb{D}_{low}$ from two widely utilized datasets (AQuA and CMSQA) for this comparison, and evaluated performance based on a single sample size. Notably, we solely conducted PHP on AQuA since it only reported results on arithmetic tasks. As shown in Table 5, FCR achieves the highest accuracy of 49.45% on AQuA, and demonstrates competitive performance with few-shot methods on CMSQA. This presents a similar trend in Table 9 of Appendix A, where Zero-Shot-CoT approaches or even exceeds Few-Shot-CoT on multiple datasets, yet it still far behind on CMSQA. Furthermore, FCR exhibits the highest average accuracy surpassing the sub-optimal zero-shot based Role-Play Prompting by 4.42%, which highlights the strong efficacy of our method without additional human labor.

Table 6: Accuracy (%) on unsolved subsets with different construction methods of choices list.

Setting	AQ.	CMS.	OB.	Rid.
List1	2.16	0.97	0.57	1.97
List2.1	21.65	47.58	31.03	28.11
List2.2	29.00	62.10	45.98	49.25
List3	18.18	29.46	32.76	24.04
List4	26.41	65.01	58.05	44.02
List5	29.39	75.52	48.85	45.70

Figure 6: Accuracy with different number of choices.

Figure 7: Probability of correct answer in filtered list.

3.4.4 Study for irrelevant choices

Irrelevant information may distract LLM. Shi et al. (2023) investigated the sensitivity of LLM to irrelevant information within questions and proposed to add instruction or exemplars to effectively reduce distractibility. In fact, such irrelevant information is not solely limited to the questions’ context, but also contained in options list. Therefore, we conducted an analysis on accuracy with different numbers of choices, especially the impact of increasing incorrect options. As in Figure 6, the accuracy exhibits a noticeable decline with more incorrect options, where we extended choices list by randomly combining wrong answers. To delve deeper, we focused on subsets of problems remaining unsolved by SC (Wang et al., 2022) with 5 sample times. Then we conducted inference with various constructing methods for choices list as shown in Table 6: 1) presenting the full choices list as List1; 2) combining the correct option with randomly sampled 1 or 2 incorrect ones as List2.1 or List2.2 respectively; 3) using the correct option and deduplicated results from previous five inferences as List3; 4) selecting the correct option and choices not included in earlier results as List4; 5) retaining the correct option and randomly picking one from the rest of List4 as List5. The accuracy for List1 close to 0%, while others can significantly enhance performance. However, the correct answers for the test set are unknown in real-world scenario, which leads us to explore the feasibility of utilizing results from previous inference to filter the choices. And we quantified the probability of the correct answer in filtered choices list, as shown in Figure 7. An average 90.51% of cases retained the correct answers, indicating that earlier results can effectively narrow down the original choices list.

Table 7: The probability of the strong distractors appearing in the choices list.

Setting	AQ.	CMS.	OB.	Rid.	Avg.
List2.1	51%	46%	26%	61%	46%
List2.2	22%	23%	12%	29%	21.5%

Table 8: Accuracy (%) on GSM8K.

	SC	DCR
#Call	7.00	6.23
Acc.	84.75	85.00

Fewer choices lead to better outcomes. Considering the varying impacts different options have on LLMs, and drawing inspiration from Shi et al. (2023), we posit that incorrect choices previously generated by the model-called as strong distractors-exert a more profound disruptive effect. As shown in Table 6, there is a significant improvement from List3 to List4, with an average increase of 22.26% across four datasets. Furthermore, retaining two choices (List2.2) consistently surpasses those with three choices (List2.1), which can be primarily attributed to the reduced likelihood of encountering strong distractors when only two options are reserved, as shown in Table 8. Therefore, developing more effective strategies to identify and eliminate such strongly distracting options will become a crucial direction for our future research.

3.4.5 Application beyond MCQs

In the preceding experiments, all datasets are comprised by MCQs, where the correct answer is included in the choices list. Consequently, we ventured to apply DCR to GSM8K (Cobbe et al., 2021), a high quality cloze-style dataset of grade school math questions. Initially, we queried the entire test set 5 times consistent with AQuA. Then we constructed choices list based on generated answers, resulting in a new dataset named GSM8K-MCQ, which is formally equivalent to MCQ. Subsequently, we divided GSM8K-MCQ with a threshold ( $\mu$ ) of 0.6 and applied FCR for deeper conquering. From Table 8, DCR achieves accuracy of 85% with 6.23 sample times, superior than SC with #Call as 7, which indicates the efficacy of our strategy to datasets beyond MCQs.

4 Related Work

LLMs reasoning for MCQs. As a problem format listing alternative answers, MCQs are prevalent in real world and have led to numerous related datasets, such as MMLU (Hendrycks et al., 2020), BIG-bench (Srivastava et al., 2022), AGIEval (Zhong et al., 2023), CEVAL (Huang et al., 2023). Simultaneously, many works have emerged in MCQs community. Robinson et al. (2022) explore to integrate the question and the choices list, then guide the model to select the correct option’s symbol. Pezeshkpour & Hruschka (2023) discover LLM’s position bias, revealing that the order of choices can significantly impact model’s performance. Zheng et al. (2023b) find selection bias, where LLMs display a clear preference for choosing options from specific positions. Different from these works, we explore the model’s sensitivity to the number of options and verifies that filtering incorrect choices can further improve performance.

CoT prompting in LLMs reasoning. Recently, CoT prompting methods have significantly enhanced reasoning abilities of LLMs. As the pioneer, Wei et al. (2022) generate intermediate reasoning steps before arriving at the answer by integrating rationales into few-shot examplars. Following it, Wang et al. (2022), Zhou et al. (2022), Yao et al. (2023), Besta et al. (2023), Sel et al. (2023), Jin & Lu (2023), Jiang et al. (2023b), Yan et al. (2023), Zhu et al. (2023), Li et al. (2023b) and Deb et al. (2023) are dedicated to optimizing the thinking process. Gao et al. (2023), Chen et al. (2022), Chen et al. (2023), Yamauchi et al. (2023) and Jie et al. (2023) employ external tools to disentangle computation from LLMs. Zhang et al. (2022), Diao et al. (2023), Shum et al. (2023), Sun et al. (2023), Zou et al. (2023) are exploring demonstrations construction in distinct manners. Mekala et al. (2023), Zheng et al. (2023c), Li et al. (2023a), Yasunaga et al. (2023) and Crispino et al. (2023) enable models to generate examplars by themselves. Xue et al. (2023), Miao et al. (2023), Zhang et al. (2023), Ling et al. (2023) and Weng et al. (2023) introduce the concept of verification into the community. In addition, Shi et al. (2023) delves into the distractibility of LLMs by irrelevant context in questions. Zheng et al. (2023a) utilize previously generated answers as hints to progressively guide the model to the correct answer. Kong et al. (2023) defines specific roles for the model based on particular task. However all these works process data uniformly neglecting problem-solving difficulty. Therefore, we propose DCR to LLMs reasoning, which first divides the dataset, and then selects intricate ones to deeply process by filtering irrelevant choices.

5 Conclusion

In this paper, we propose DCR to enhance reasoning abilities of LLMs for MCQs by dividing dataset based on $\mathcal{CS}$ and subsequently conquering items with low $\mathcal{CS}$ . Evaluation results on nine datasets across three tasks prove that DCR not only minimizes unnecessary computations for simple problems but also substantially improve performance on more intricate ones. In addition, through detailed analysis, we confirmed a positive relation between $\mathcal{CS}$ and accuracy, alongside fewer choices leading to better outcomes. Nonetheless, utilizing previously generated results to filter choices fails to effectively eliminate strong distractors and computing $\mathcal{CS}$ through SC is resource-intensive. Therefore, we will develop more efficient strategies for filtering distractions and reducing the computational demand associated with datasets division in the future.

References

Anil et al. (2023) Rohan Anil, Andrew M Dai, Orhan Firat, Melvin Johnson, Dmitry Lepikhin, Alexandre Passos, Siamak Shakeri, Emanuel Taropa, Paige Bailey, Zhifeng Chen, et al. Palm 2 technical report. arXiv preprint arXiv:2305.10403, 2023.
Bentley (1980) Jon Louis Bentley. Multidimensional divide-and-conquer. Communications of the ACM, 23(4):214–229, 1980.
Bentley & Shamos (1976) Jon Louis Bentley and Michael Ian Shamos. Divide-and-conquer in multidimensional space. In Proceedings of the eighth annual ACM symposium on Theory of computing, pp. 220–230, 1976.
Besta et al. (2023) Maciej Besta, Nils Blach, Ales Kubicek, Robert Gerstenberger, Lukas Gianinazzi, Joanna Gajda, Tomasz Lehmann, Michal Podstawski, Hubert Niewiadomski, Piotr Nyczyk, et al. Graph of thoughts: Solving elaborate problems with large language models. arXiv preprint arXiv:2308.09687, 2023.
Brown et al. (2020) Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
Chen et al. (2022) Wenhu Chen, Xueguang Ma, Xinyi Wang, and William W Cohen. Program of thoughts prompting: Disentangling computation from reasoning for numerical reasoning tasks. arXiv preprint arXiv:2211.12588, 2022.
Chen et al. (2023) Zhipeng Chen, Kun Zhou, Beichen Zhang, Zheng Gong, Wayne Xin Zhao, and Ji-Rong Wen. Chatcot: Tool-augmented chain-of-thought reasoning on chat-based large language models. arXiv preprint arXiv:2305.14323, 2023.
Chowdhery et al. (2022) Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, et al. Palm: Scaling language modeling with pathways. arXiv preprint arXiv:2204.02311, 2022.
Clark et al. (2018) Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord. Think you have solved question answering? try arc, the ai2 reasoning challenge. arXiv preprint arXiv:1803.05457, 2018.
Cobbe et al. (2021) Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, et al. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168, 2021.
Crispino et al. (2023) Nicholas Crispino, Kyle Montgomery, Fankun Zeng, Dawn Song, and Chenguang Wang. Agent instructs large language models to be general zero-shot reasoners. arXiv preprint arXiv:2310.03710, 2023.
Deb et al. (2023) Aniruddha Deb, Neeva Oza, Sarthak Singla, Dinesh Khandelwal, Dinesh Garg, and Parag Singla. Fill in the blank: Exploring and enhancing llm capabilities for backward reasoning in math word problems. arXiv preprint arXiv:2310.01991, 2023.
Diao et al. (2023) Shizhe Diao, Pengcheng Wang, Yong Lin, and Tong Zhang. Active prompting with chain-of-thought for large language models. arXiv preprint arXiv:2302.12246, 2023.
Eisenstein (2006) Michael Eisenstein. Divide and conquer. Nature, 441(7097):1179–1179, 2006.
Gao et al. (2023) Luyu Gao, Aman Madaan, Shuyan Zhou, Uri Alon, Pengfei Liu, Yiming Yang, Jamie Callan, and Graham Neubig. Pal: Program-aided language models. In International Conference on Machine Learning, pp. 10764–10799. PMLR, 2023.
Goldratt & Cox (2016) Eliyahu M Goldratt and Jeff Cox. The goal: a process of ongoing improvement. Routledge, 2016.
Heideman et al. (1984) Michael Heideman, Don Johnson, and Charles Burrus. Gauss and the history of the fast fourier transform. IEEE Assp Magazine, 1(4):14–21, 1984.
Hendrycks et al. (2020) Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding. arXiv preprint arXiv:2009.03300, 2020.
Hu et al. (2023) Yi Hu, Haotong Yang, Zhouchen Lin, and Muhan Zhang. Code prompting: a neural symbolic method for complex reasoning in large language models, 2023.
Huang et al. (2023) Yuzhen Huang, Yuzhuo Bai, Zhihao Zhu, Junlei Zhang, Jinghan Zhang, Tangjun Su, Junteng Liu, Chuancheng Lv, Yikai Zhang, Jiayi Lei, et al. C-eval: A multi-level multi-discipline chinese evaluation suite for foundation models. arXiv preprint arXiv:2305.08322, 2023.
Jiang et al. (2023a) Albert Q Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, et al. Mistral 7b. arXiv preprint arXiv:2310.06825, 2023a.
Jiang et al. (2023b) Song Jiang, Zahra Shakeri, Aaron Chan, Maziar Sanjabi, Hamed Firooz, Yinglong Xia, Bugra Akyildiz, Yizhou Sun, Jinchao Li, Qifan Wang, et al. Resprompt: Residual connection prompting advances multi-step reasoning in large language models. arXiv preprint arXiv:2310.04743, 2023b.
Jie et al. (2023) Zhanming Jie, Trung Quoc Luong, Xinbo Zhang, Xiaoran Jin, and Hang Li. Design of chain-of-thought in math problem solving. arXiv preprint arXiv:2309.11054, 2023.
Jin & Lu (2023) Ziqi Jin and Wei Lu. Tab-cot: Zero-shot tabular chain of thought. arXiv preprint arXiv:2305.17812, 2023.
Knuth (1998) Donald Ervin Knuth. Sorting and searching. The art of computer programming, 3, 1998.
Kojima et al. (2022) Takeshi Kojima, Shixiang Shane Gu, Machel Reid, Yutaka Matsuo, and Yusuke Iwasawa. Large language models are zero-shot reasoners. Advances in neural information processing systems, 35:22199–22213, 2022.
Kong et al. (2023) Aobo Kong, Shiwan Zhao, Hao Chen, Qicheng Li, Yong Qin, Ruiqi Sun, and Xin Zhou. Better zero-shot reasoning with role-play prompting. arXiv preprint arXiv:2308.07702, 2023.
Li et al. (2023a) Rui Li, Guoyin Wang, and Jiwei Li. Are human-generated demonstrations necessary for in-context learning? arXiv preprint arXiv:2309.14681, 2023a.
Li et al. (2023b) Xiang Lisa Li, Vaishnavi Shrivastava, Siyan Li, Tatsunori Hashimoto, and Percy Liang. Benchmarking and improving generator-validator consistency of language models. arXiv preprint arXiv:2310.01846, 2023b.
Li et al. (2024) Yiwei Li, Peiwen Yuan, Shaoxiong Feng, Boyuan Pan, Xinglin Wang, Bin Sun, Heda Wang, and Kan Li. Escape sky-high cost: Early-stopping self-consistency for multi-step reasoning. arXiv preprint arXiv:2401.10480, 2024.
Lin et al. (2021) Bill Yuchen Lin, Ziyi Wu, Yichi Yang, Dong-Ho Lee, and Xiang Ren. Riddlesense: Reasoning about riddle questions featuring linguistic creativity and commonsense knowledge. arXiv preprint arXiv:2101.00376, 2021.
Ling et al. (2017) Wang Ling, Dani Yogatama, Chris Dyer, and Phil Blunsom. Program induction by rationale generation: Learning to solve and explain algebraic word problems. arXiv preprint arXiv:1705.04146, 2017.
Ling et al. (2023) Zhan Ling, Yunhao Fang, Xuanlin Li, Zhiao Huang, Mingu Lee, Roland Memisevic, and Hao Su. Deductive verification of chain-of-thought reasoning. arXiv preprint arXiv:2306.03872, 2023.
Mallouk (2013) Thomas E Mallouk. Divide and conquer. Nature chemistry, 5(5):362–363, 2013.
Mekala et al. (2023) Rajasekhar Reddy Mekala, Yasaman Razeghi, and Sameer Singh. Echoprompt: Instructing the model to rephrase queries for improved in-context learning. arXiv preprint arXiv:2309.10687, 2023.
Miao et al. (2023) Ning Miao, Yee Whye Teh, and Tom Rainforth. Selfcheck: Using llms to zero-shot check their own step-by-step reasoning. arXiv preprint arXiv:2308.00436, 2023.
Mihaylov et al. (2018) Todor Mihaylov, Peter Clark, Tushar Khot, and Ashish Sabharwal. Can a suit of armor conduct electricity? a new dataset for open book question answering. arXiv preprint arXiv:1809.02789, 2018.
OpenAI (2023) OpenAI. Gpt-4 technical report. ArXiv, abs/2303.08774, 2023.
Patel et al. (2021) Arkil Patel, Satwik Bhattamishra, and Navin Goyal. Are nlp models really able to solve simple math word problems? arXiv preprint arXiv:2103.07191, 2021.
Pezeshkpour & Hruschka (2023) Pouya Pezeshkpour and Estevam Hruschka. Large language models sensitivity to the order of options in multiple-choice questions. arXiv preprint arXiv:2308.11483, 2023.
Robinson et al. (2022) Joshua Robinson, Christopher Michael Rytting, and David Wingate. Leveraging large language models for multiple choice question answering. arXiv preprint arXiv:2210.12353, 2022.
Sel et al. (2023) Bilgehan Sel, Ahmad Al-Tawaha, Vanshaj Khattar, Lu Wang, Ruoxi Jia, and Ming Jin. Algorithm of thoughts: Enhancing exploration of ideas in large language models. arXiv preprint arXiv:2308.10379, 2023.
Shi et al. (2023) Freda Shi, Xinyun Chen, Kanishka Misra, Nathan Scales, David Dohan, Ed H Chi, Nathanael Schärli, and Denny Zhou. Large language models can be easily distracted by irrelevant context. In International Conference on Machine Learning, pp. 31210–31227. PMLR, 2023.
Shum et al. (2023) KaShun Shum, Shizhe Diao, and Tong Zhang. Automatic prompt augmentation and selection with chain-of-thought from labeled data. arXiv preprint arXiv:2302.12822, 2023.
Smith (1985) Douglas R Smith. The design of divide and conquer algorithms. Science of Computer Programming, 5:37–58, 1985.
Srivastava et al. (2022) Aarohi Srivastava, Abhinav Rastogi, Abhishek Rao, Abu Awal Md Shoeb, Abubakar Abid, Adam Fisch, Adam R Brown, Adam Santoro, Aditya Gupta, Adrià Garriga-Alonso, et al. Beyond the imitation game: Quantifying and extrapolating the capabilities of language models. arXiv preprint arXiv:2206.04615, 2022.
Sun et al. (2023) Jiashuo Sun, Yi Luo, Yeyun Gong, Chen Lin, Yelong Shen, Jian Guo, and Nan Duan. Enhancing chain-of-thoughts prompting with iterative bootstrapping in large language models. arXiv preprint arXiv:2304.11657, 2023.
Talmor et al. (2018) Alon Talmor, Jonathan Herzig, Nicholas Lourie, and Jonathan Berant. Commonsenseqa: A question answering challenge targeting commonsense knowledge. arXiv preprint arXiv:1811.00937, 2018.
Team et al. (2023) Gemini Team, Rohan Anil, Sebastian Borgeaud, Yonghui Wu, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, et al. Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805, 2023.
Team et al. (2024) Gemma Team, Thomas Mesnard, Cassidy Hardin, Robert Dadashi, Surya Bhupatiraju, Shreya Pathak, Laurent Sifre, Morgane Rivière, Mihir Sanjay Kale, Juliette Love, et al. Gemma: Open models based on gemini research and technology. arXiv preprint arXiv:2403.08295, 2024.
Thoppilan et al. (2022) Romal Thoppilan, Daniel De Freitas, Jamie Hall, Noam Shazeer, Apoorv Kulshreshtha, Heng-Tze Cheng, Alicia Jin, Taylor Bos, Leslie Baker, Yu Du, et al. Lamda: Language models for dialog applications. arXiv preprint arXiv:2201.08239, 2022.
Touvron et al. (2023a) Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023a.
Touvron et al. (2023b) Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023b.
Wang et al. (2022) Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc Le, Ed Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou. Self-consistency improves chain of thought reasoning in language models. arXiv preprint arXiv:2203.11171, 2022.
Wei et al. (2022) Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-of-thought prompting elicits reasoning in large language models. Advances in Neural Information Processing Systems, 35:24824–24837, 2022.
Weng et al. (2023) Yixuan Weng, Minjun Zhu, Fei Xia, Bin Li, Shizhu He, Kang Liu, and Jun Zhao. Large language models are better reasoners with self-verification. CoRR, abs/2212.09561, 2023.
Xiong et al. (2023) Miao Xiong, Zhiyuan Hu, Xinyang Lu, Yifei Li, Jie Fu, Junxian He, and Bryan Hooi. Can llms express their uncertainty? an empirical evaluation of confidence elicitation in llms, 2023.
Xue et al. (2023) Tianci Xue, Ziqi Wang, Zhenhailong Wang, Chi Han, Pengfei Yu, and Heng Ji. Rcot: Detecting and rectifying factual inconsistency in reasoning by reversing chain-of-thought. arXiv preprint arXiv:2305.11499, 2023.
Yamauchi et al. (2023) Ryutaro Yamauchi, Sho Sonoda, Akiyoshi Sannai, and Wataru Kumagai. Lpml: Llm-prompting markup language for mathematical reasoning. arXiv preprint arXiv:2309.13078, 2023.
Yan et al. (2023) Shaotian Yan, Chen Shen, Junjie Liu, and Jieping Ye. Concise and organized perception facilitates large language models for deductive reasoning. arXiv preprint arXiv:2310.03309, 2023.
Yao et al. (2023) Shunyu Yao, Dian Yu, Jeffrey Zhao, Izhak Shafran, Thomas L Griffiths, Yuan Cao, and Karthik Narasimhan. Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601, 2023.
Yasunaga et al. (2023) Michihiro Yasunaga, Xinyun Chen, Yujia Li, Panupong Pasupat, Jure Leskovec, Percy Liang, Ed H Chi, and Denny Zhou. Large language models as analogical reasoners. arXiv preprint arXiv:2310.01714, 2023.
Yu et al. (2020) Weihao Yu, Zihang Jiang, Yanfei Dong, and Jiashi Feng. Reclor: A reading comprehension dataset requiring logical reasoning. arXiv preprint arXiv:2002.04326, 2020.
Zhang et al. (2023) Haodi Zhang, Min Cai, Xinhe Zhang, Chen Jason Zhang, Rui Mao, and Kaishun Wu. Self-convinced prompting: Few-shot question answering with repeated introspection. arXiv preprint arXiv:2310.05035, 2023.
Zhang et al. (2022) Zhuosheng Zhang, Aston Zhang, Mu Li, and Alex Smola. Automatic chain of thought prompting in large language models. arXiv preprint arXiv:2210.03493, 2022.
Zheng et al. (2023a) Chuanyang Zheng, Zhengying Liu, Enze Xie, Zhenguo Li, and Yu Li. Progressive-hint prompting improves reasoning in large language models. arXiv preprint arXiv:2304.09797, 2023a.
Zheng et al. (2023b) Chujie Zheng, Hao Zhou, Fandong Meng, Jie Zhou, and Minlie Huang. On large language models’ selection bias in multi-choice questions. arXiv preprint arXiv:2309.03882, 2023b.
Zheng et al. (2023c) Huaixiu Steven Zheng, Swaroop Mishra, Xinyun Chen, Heng-Tze Cheng, Ed H Chi, Quoc V Le, and Denny Zhou. Take a step back: Evoking reasoning via abstraction in large language models. arXiv preprint arXiv:2310.06117, 2023c.
Zheng et al. (2023d) Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric Xing, et al. Judging llm-as-a-judge with mt-bench and chatbot arena. arXiv preprint arXiv:2306.05685, 2023d.
Zhong et al. (2023) Wanjun Zhong, Ruixiang Cui, Yiduo Guo, Yaobo Liang, Shuai Lu, Yanlin Wang, Amin Saied, Weizhu Chen, and Nan Duan. Agieval: A human-centric benchmark for evaluating foundation models. arXiv preprint arXiv:2304.06364, 2023.
Zhou et al. (2022) Denny Zhou, Nathanael Schärli, Le Hou, Jason Wei, Nathan Scales, Xuezhi Wang, Dale Schuurmans, Claire Cui, Olivier Bousquet, Quoc Le, et al. Least-to-most prompting enables complex reasoning in large language models. arXiv preprint arXiv:2205.10625, 2022.
Zhu et al. (2023) Zhaocheng Zhu, Yuan Xue, Xinyun Chen, Denny Zhou, Jian Tang, Dale Schuurmans, and Hanjun Dai. Large language models can learn rules. arXiv preprint arXiv:2310.07064, 2023.
Zou et al. (2023) Anni Zou, Zhuosheng Zhang, Hai Zhao, and Xiangru Tang. Meta-cot: Generalizable chain-of-thought prompting in mixed-task scenarios with large language models. arXiv preprint arXiv:2310.06692, 2023.

Appendix A Zero-shot vs. Few-shot

Table 9: Problem-solving accuracy (%) between Zero-Shot-CoT and Few-Shot-CoT.

Method	AQuA	GSM8K	SVAMP	CMSQA	Average
Zero-Shot-CoT	54.86( $\pm$ 0.67)	79.33( $\pm$ 0.34)	78.20( $\pm$ 1.82)	69.67( $\pm$ 0.77)	70.52
Few-Shot-CoT	53.67( $\pm$ 0.67)	79.67( $\pm$ 1.18)	81.60( $\pm$ 1.34)	77.47( $\pm$ 0.96)	73.10

By comparing Zero-Shot-CoT (Kojima et al., 2022) and Few-Shot-COT (Wei et al., 2022) across AQuA (Ling et al., 2017), GSM8K (Cobbe et al., 2021), SVAMP (Patel et al., 2021) and CMSQA (Talmor et al., 2018) in Table 9, models’ zero-shot capabilities are gradually nearing or even surpassing their few-shot counterparts, which is align with the conclusions in recent research (Hu et al., 2023; Zhong et al., 2023). Therefore, our work is entirely free from human intervention and circumvents exemplars construction.

Table 10: The information statistic of datasets. For CMSQA, RiddleSense, Logical Deduction and Reclor, we select their validation sets as there are no publicly available test sets or labels. GSM8K and SVAMP are used in Section 3.4.5 and Appendix A, which are cloze-style dataset without choices list. Particularly, for Logical Deduction, there are 60 questions with

|\mathbf{C}_{i}|

as 3, 100 questions as 5, and 140 questions as 7. Therefore, given that 20% questions have 3 choices, we make a compromise and choose

t

as 4.

Dataset	Task Type	Eval. Split	#Test ( $n$ )	#ChoicesNum ( $\|\mathbf{C}_{i}\|$ )	Infer. Times ( $t$ )
AQuA (AQ.)	Arithmetic	Test	254	5	5
Abstract Algebra (Alg.)	Arithmetic	Test	100	4	4
High School Mathematics (Math.)	Arithmetic	Test	270	4	4
CMSQA (CMS.)	Commonsense	Validation	1221	5	5
OpenBookQA (OB.)	Commonsense	Test	500	4	4
ARC Challenge (ARC.)	Commonsense	Test	1165	4	4
RiddleSense (Rid.)	Logic	Validation	1021	5	5
Logical Deduction (Logi.)	Logic	Validation	300	3, 5 or 7	4
Reclor (Rec.)	Logic	Validation	500	4	4
GSM8K	Arithmetic	Test	1319	-	-
SVAMP	Arithmetic	Test	300	-	-

Appendix B Different prompts for FCR

Considering the most distinctive feature of FCR is succinct choices list, we conducted a comparison using different prompts, as displayed in Figure 8. Specifically, “Prompt0” denotes “Let’s think step by step.”, “Prompt1” is the prompt used in FCR, and “Prompt2” represents “Let’s delve deeper into this question to arrive at the best answer.”. Across various dataset, the accuracy disparity of FCR with different prompts remains below 2%, without a clear dominance from any single one. Therefore, we believe that the key of good performance for FCR is attributed to a briefer choices list, rather than prompt engineering.