Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

MindStar: Enhancing Math Reasoning in Pre-trained
LLMs at Inference Time

Jikun Kang*1, Derek Li*1, Xi Chen1, Amirreza Kazemi1, Qianyi Sun1, Boxing Chen1,
Dong Li1, Xu He1, Quan He1, Feng Wen1, Jianye Hao1, Jun Yao1
1Noah’s Ark Laboratory
{jaxon.kang, derek.li1, xi.chen4, amirreza.kazemi, qianyi.sun, boxing.chen,
lidong106, hexu27, hequan12, feng.wen, haojianye, yaojun97}@huawei.com
Abstract

Although Large Language Models (LLMs) achieve remarkable performance across various tasks, they often struggle with complex reasoning tasks, such as answering mathematical questions. Recent efforts to address this issue have primarily focused on leveraging mathematical datasets through supervised fine-tuning or self-improvement techniques. However, these methods often depend on high-quality datasets that are difficult to prepare, or they require substantial computational resources for fine-tuning. Inspired by findings that LLMs know how to produce the right answer but struggle to select the correct reasoning path, we propose a purely inference-based searching method—MindStar (M*). This method formulates reasoning tasks as searching problems and proposes two search ideas to identify the optimal reasoning paths. We evaluate the M* framework on both the GSM8K and MATH datasets, comparing its performance with existing open and closed-source LLMs. Our results demonstrate that M* significantly enhances the reasoning abilities of open-source models, such as Llama-2-13B and Mistral-7B, and achieves comparable performance to GPT-3.5 and Grok-1, but with substantially reduced model size and computational costs.

Refer to caption
Figure 1: M*: A searching framework for inference time step reasoning. A: Each time we gather questions and previous reasoning steps to the LLMs and sample N next reasoning steps. B: We organize the reasoning process as a tree. Each node represents either question (the root node), answers (leaf nodes), or reasoning steps (all other nodes). A searching method traverses the reasoning tree and select a node to expand. We add the reasoning step of the selected node back to the prompt for next query step. We stop the generation processes until either the answer is find or the maximum consumption is reached.

1 Introduction

With the rapid growth of model size, transformer-based Large Language Models (LLMs) showcase impressive results in domains such as instruction following [37, 32], coding assistance [25, 4], and creative writing [11]. Among these tasks, unlocking the rationality of LLMs to solve complex reasoning tasks remains a major challenge. Recent works [49, 36] have attempted to tackle this challenge through Supervised Fine-Tuning (SFT). By mixing crafted new reasoning data samples with original datasets, LLMs learn the underlying distributions of these samples and attempt to solve unseen reasoning tasks. Although there is a performance gain, this method heavily relies on extensive training and requires extra data preparation [33, 42].

Refer to caption
Figure 2: Different reward models for LLMs’ output selections on MATH dataset. The x-axis denotes the total number of generated outputs

Recently, Llama-3 report [28] highlights a significant observation: when posed with a challenging reasoning question, a model will sometimes generate the correct reasoning trace. This indicates that the model knows how to produce the right answer but struggles with selecting it. Inspired by this discovery, we pose a straightforward question: Can we enhance the reasoning of LLMs during generation by assisting them in selecting the correct output? To explore this, we conduct an experiment utilizing different reward models to assist LLM for output selection. Here, we leverage the Outcome-supervised Reward Model (ORM) [6], which scores the entirety of reasoning solutions, and the Process-supervised Reward Model (PRM) [21], which scores each individual reasoning step, for the selection of reasoning solutions. Initially, we apply both the ORM and the PRM to select the final answer from multiple sampled chain-of-thoughts (CoT) solutions. Figure 2 shows that PRM selects better reasoning answers than ORM. Additionally, we employ the PRM to assist the LLM in a tree-of-thought context; Rather than generating the complete solution, the LLM produces multiple intermediate steps. The PRM then scores these steps and selects the best, facilitating the LLM in proceeding generation from a promising step. Our results demonstrate that step-level selection outperforms the two CoT selection baselines significantly.

Refer to caption
Figure 3: MATH accuracy of different LLMs. M* on LLaMA-2-13B achieves similar performance as GPT-3.5 (4-shot) while saving approximately 200 times the computational resources.

Based on above findings, we propose MindStar (M*), a novel framework depicted in Figure 1, tailored for enhancing LLM reasoning during the inference time. Initially, M* prompts the LLM with the question to generate multiple potential next steps. In the context of reasoning tree, the question is the root and the new generated steps are its children. Subsequently, the trained process-supervised reward model (PRM) scores the steps based on their likelihood of correctness. The selected step will then be appended to the prompt, and the algorithm iterates until the final answer is reached or computational budgets are exceeded. Leveraging the reward model to help the LLM asses its reasoning steps serves as a self-reflection mechanism. Note that unlike existing self-reflection methods [16, 44] that only revise the most recent step, M* reflects on the entire trajectory comprising all previous steps. Thus, it avoids the pitfall of optimizing performance solely based on current step, and allows the model to select faithful reasoning solutions. Moreover, in order to select the best trajectory at each iteration, M* can be coupled with various tree search algorithm. In this paper, we explore two algorithms, which are beam search [24] and levin tree search [30]. The beam search is a greedy algorithm that uses the PRM score as heuristic, while Levin tree search (LevinTS) takes both the PRM score and the depth of a trajectory into account. Furthermore, we show that M* coupled with LevinTS guarantees a computation upperbound in finding the correct solution.

We evaluate M* on challenging MATH problems [14] and compared it to existing open and closed-source LLMs, including LLaMA-2 [40], Grok-1, GPT [1], Claude [2], and Gemini [38]. The results, shown in Figure 3, indicate that by utilizing LLaMA-2-13B as the base model, our method significantly improves its performance on MATH dataset from 8% to 33%. This performance matches that of GPT-3.5, but with approximately 200 times less computational resource usage in inference time. These results highlight the benefits of shifting computational resources from fine-tuning to inference searching and shed light on potential future research directions.

We summarize our major contributions as follows: 1) we introduce M*, a tree-like search-based reasoning framework that enhances the reasoning capabilities of Large Language Models (LLMs) through a structured, step-by-step approach during the inference time. 2) we propose the adaptation of two search algorithms in accomplishing LLM reasoning tasks, namely beam search and Levin tree search, which helps traverse the reasoning tree with guaranteed search time. 3) we evaluate the performance of the M* on the GSM8K and MATH datasets. The results show that using beam search and Levin tree search improves the performance of the LLama-2-13B model by 58.6% and 64.6% on the GSM8K dataset, respectively, and by 58.8% and 66.1% on the MATH dataset, respectively.

2 Related Work

Multi-step Reasoning in LLMs. In recent years, several methods have been proposed to enhance LLM reasoning capability, ranging from fine-tuning the base model [5, 9, 20, 50] to chain-of-thought (CoT) prompting and its variants [19, 45, 51, 43, 6]. Specifically, Wei et al. [45] and Kojima et al. [19] demonstrate that CoT prompting can enhance LLM reasoning in few-shot and zero-shot settings. Such in-context improvement grounds in the decoder architecture of LLMs, however, a single reasoning path (i.e., greedy decoding) often suffers from stochasticity and lacks the diversity needed for complex reasoning tasks. To mitigate this, Wang et al. [43] proposes to generate a diverse set of reasoning paths and perform a majority voting. Similarly, Cobbe et al. [6] trains a solution verifier and Weng et al. [46] prompts LLM for self-verification in order to determine the quality of generated reasoning paths. Despite this, recent studies [10, 26, 41] found that LLMs often make unfaithful reasoning. This sheds light to the importance of verifying each step of the reasoning chain [21]. Moreover, CoT does not take different alternatives into account at the generation time, and there is no mechanism to evaluate the current generated chain and possibly look ahead or backtrack. Therefore, our work largely differs from the CoT literature since we utilize the step-level feedback in order to search for a reasoning path within a reasoning tree.

Feedback-Guided Tree Search for LLM Reasoning. The ToT framework is introduced in [48, 23]. Inspired by this, various methods [8, 27, 12, 47, 3] have been proposed to find a good reasoning path within the tree, employing different heuristics and search algorithms. A straightforward heuristic is that one prompt the LLM itself to assess its generated steps, as demonstrated in Yao et al. [48] with breadth/depth-first search, in Hao et al. [12] with Monte Carlo tree search, and in Xie et al. [47] with beam search. However, recent studies have shown that LLM struggles to evaluate itself and rectify its initial responses without any external feedback [17, 8]. In contrast, our method’s search heuristic relies on a reward model and thus performs more accurately. In a different approach, Feng et al. [8] and Tian et al. [39] propose learning the value function to estimate the value of the current reasoning path, while Ma et al. [27] trains a process-supervised reward model (PRM) and utilizes it with Asuperscript𝐴A^{*}italic_A start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT-like tree search. In comparison, our method is more computationally efficient since we do not deal with sample complexity issues of value function learning. In particular, we show that incorporating PRM as a heuristic with Levin tree search guarantees an upper bound on computation cost [30].

3 M*: Think and Reflect Step by Step

As illustrated in Figure 1, we propose a novel framework that facilitates LLMs reasoning abilities at inference time. The brief overview of the M* algorithm is summarized in Algorithm 1.

3.1 Problem Formulation

We define a large language model (LLM) parameterized by θ𝜃\thetaitalic_θ, as G(;θ)𝐺𝜃G(\cdot;\theta)italic_G ( ⋅ ; italic_θ ). We also define a reasoning tree 𝒯𝒯\mathcal{T}caligraphic_T, where the root is the question, the edges are the generated intermediate steps by LLM, and the nodes are the sequences of steps. In other words, a node in the reasoning tree represents a reasoning path consisting of edges in the path from the root to that node, denoted as nd=[nqe1e2ed1]subscript𝑛𝑑delimited-[]direct-sumsuperscript𝑛𝑞subscript𝑒1subscript𝑒2subscript𝑒𝑑1n_{d}=[n^{q}\oplus e_{1}\oplus e_{2}\oplus\dots\oplus e_{d-1}]italic_n start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT = [ italic_n start_POSTSUPERSCRIPT italic_q end_POSTSUPERSCRIPT ⊕ italic_e start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ⊕ italic_e start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ⊕ ⋯ ⊕ italic_e start_POSTSUBSCRIPT italic_d - 1 end_POSTSUBSCRIPT ], where nqsuperscript𝑛𝑞n^{q}italic_n start_POSTSUPERSCRIPT italic_q end_POSTSUPERSCRIPT represents the root node (question), eisubscript𝑒𝑖e_{i}italic_e start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT represents the edge (step) at depth i𝑖iitalic_i, and direct-sum\oplus is the concatenation operation. In this paper, we use terms node and reasoning path interchangeably, as well as edge and reasoning step. Our goal is to find the node that consists of correct reasoning steps for the desired question. To achieve this, we utilize a process-supervised reward model coupled with a tree search algorithm, which will be introduced in the following sections.

3.2 Process-supervised Reward Model

As mentioned earlier, we aim to assess the intermediate steps generated by LLMs to help select the correct reasoning path. Building on the success of the Process-supervised Reward Model (PRM) [21], we utilize a PRM to measure the likelihood of correctness for each step. Specifically, PRM 𝒫𝒫\mathcal{P}caligraphic_P takes the current reasoning node ndsubscript𝑛𝑑n_{d}italic_n start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT and the potential next step edsubscript𝑒𝑑e_{d}italic_e start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT as the inputs and returns a reward value 𝒫(nd,ed)=rd[0,1]𝒫subscript𝑛𝑑subscript𝑒𝑑subscript𝑟𝑑01\mathcal{P}(n_{d},e_{d})=r_{d}\in[0,1]caligraphic_P ( italic_n start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT , italic_e start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ) = italic_r start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ∈ [ 0 , 1 ]. Importantly, when evaluating a new step, PRM considers the previous reasoning steps. This encourages the LLM to be consistent and faithful with respect to the entire path. Therefore, a high reward value suggests that edsubscript𝑒𝑑e_{d}italic_e start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT can be a correct next step for ndsubscript𝑛𝑑n_{d}italic_n start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT, making the trace [nded]delimited-[]direct-sumsubscript𝑛𝑑subscript𝑒𝑑[n_{d}\oplus e_{d}][ italic_n start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ⊕ italic_e start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ] worth exploring. Conversely, a small reward value can be viewed as an incorrect step, suggesting that solutions following [nded]delimited-[]direct-sumsubscript𝑛𝑑subscript𝑒𝑑[n_{d}\oplus e_{d}][ italic_n start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ⊕ italic_e start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ] are likely incorrect.

We now describe the M* algorithm, which consists of two steps. Until finding the correct solution, at each iteration of the algorithm, 1) we prompt the base LLM to generate next steps for the current reasoning path, 2) we evaluate the generated steps using PRM and select a reasoning path for the next round of algorithm.

3.3 Reasoning Node Expansion

Given that we select a reasoning node ndsubscript𝑛𝑑n_{d}italic_n start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT to expand, we design a prompt template Example 3.3 in order to collect next steps from LLMs. As shown in the example, the LLM takes the original question as {question} and the current reasoning path as {answer} in the prompt. Note that in the first iteration of the algorithm, the selected node is the root containing the question only, and therefore the {answer} is empty. For the reasoning path ndsubscript𝑛𝑑n_{d}italic_n start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT, the LLM generates N𝑁Nitalic_N multiple intermediate steps ed1,ed2,,edNsubscriptsuperscript𝑒1𝑑subscriptsuperscript𝑒2𝑑subscriptsuperscript𝑒𝑁𝑑e^{1}_{d},e^{2}_{d},\dots,e^{N}_{d}italic_e start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT , italic_e start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT , … , italic_e start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT for the given prompt and we append them as the children node of the current node. In the next step of the algorithm, the new child nodes will be assessed, and a new node will be selected for further expansion. We also acknowledge that one alternative for generating the steps is fine-tuning the LLM using step tokens. However, it could potentially degrade the LLM’s reasoning ability and, more importantly, is not aligned with the focus of this paper which, is enhancing the LLM without any weight modification.

Example 3.1:  Step Prompt Template [INST] <<SYS>> Below is an instruction that describes a task. Write a response that appropriately completes the request. Output each step in a separate line, and explicitly state the final answer after the final step following the format. "The answer is:" <</SYS>> Instruction:{question}question\{\text{question}\}{ question }[/INST] Response: Let’s think step by step.{answer}answer\{\text{answer}\}{ answer }

3.4 Reasoning Path Selection

Following the reasoning node expansion, we use the pre-trained PRM 𝒫𝒫\mathcal{P}caligraphic_P to reflect each newly generated step. As mentioned in Section 3.2, the PRM takes the path ndsubscript𝑛𝑑n_{d}italic_n start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT and the steps edsubscript𝑒𝑑e_{d}italic_e start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT as inputs and returns the corresponding reward value. After the evaluation, we require a tree search algorithm to select the next node for expansion. Note that our framework is agnostic to the search algorithm, and in this work, we instantiate it with two tree search methods, namely beam search and Levin tree search. Additionally, we introduce an ensemble method of M* search as an extension — Forest Search in Appendix C.

Beam Search. We first employ beam search, an algorithm similar to how a language model generates tokens while decoding. After computing the reward value of the pairs of reasoning path and next step, the algorithm selects the next step with the highest value, ed=argmaxei{ed1,ed2,,edN}𝒫(nd,edi)superscriptsubscript𝑒𝑑subscriptsubscript𝑒𝑖superscriptsubscript𝑒𝑑1superscriptsubscript𝑒𝑑2superscriptsubscript𝑒𝑑𝑁𝒫subscript𝑛𝑑superscriptsubscript𝑒𝑑𝑖e_{d}^{*}=\arg\max_{e_{i}\in\{e_{d}^{1},e_{d}^{2},\dots,e_{d}^{N}\}}\mathcal{P% }(n_{d},e_{d}^{i})italic_e start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT = roman_arg roman_max start_POSTSUBSCRIPT italic_e start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ { italic_e start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT , italic_e start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT , … , italic_e start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT } end_POSTSUBSCRIPT caligraphic_P ( italic_n start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT , italic_e start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ), and the selected reasoning path for the next iteration is nd+1=[nded]subscript𝑛𝑑1delimited-[]direct-sumsubscript𝑛𝑑superscriptsubscript𝑒𝑑n_{d+1}=[n_{d}\oplus e_{d}^{*}]italic_n start_POSTSUBSCRIPT italic_d + 1 end_POSTSUBSCRIPT = [ italic_n start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ⊕ italic_e start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ]. The beam search algorithm can be viewed as a step-wise ranking method. Although it searches within a rich space of reasoning tree, its time-complexity is O(n)𝑂𝑛O(n)italic_O ( italic_n ), comparable to self-consistency and re-ranking methods. However, beam search only takes the PRM reward score into account and it lacks backtracking or self-correction mechanism. Moreover, there is no guarantee that beam search is able to find the correct reasoning path. To address these issues, we propose another M* variant with Levin tree search.

Levin Tree Search. Levin Tree Search (LevinTS) [30] is a best-first tree search algorithm [34], which relies on a cost function. The cost function is defined as f(n)π(n)𝑓𝑛𝜋𝑛\frac{f(n)}{\pi(n)}divide start_ARG italic_f ( italic_n ) end_ARG start_ARG italic_π ( italic_n ) end_ARG and the algorithm expands by its increasing order. The computation cost of node n𝑛nitalic_n, denoted as f(n)𝑓𝑛f(n)italic_f ( italic_n ), is defined as f(n):=eτitokassign𝑓𝑛superscript𝑒𝜏subscript𝑖𝑡𝑜𝑘f(n):=e^{\tau\cdot i_{tok}}italic_f ( italic_n ) := italic_e start_POSTSUPERSCRIPT italic_τ ⋅ italic_i start_POSTSUBSCRIPT italic_t italic_o italic_k end_POSTSUBSCRIPT end_POSTSUPERSCRIPT, where itoksubscript𝑖𝑡𝑜𝑘i_{tok}italic_i start_POSTSUBSCRIPT italic_t italic_o italic_k end_POSTSUBSCRIPT is the number of tokens in the reasoning path corresponding to node n𝑛nitalic_n, and τ𝜏\tauitalic_τ is a temperature parameter. The symbol π(n)𝜋𝑛\pi(n)italic_π ( italic_n ) denotes the probability that the solution exists under the sub-tree for which the root is node n𝑛nitalic_n. Therefore, π𝜋\piitalic_π for the root is equal to 1. For a node n𝑛nitalic_n with parent nsuperscript𝑛n^{\prime}italic_n start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT connected by an edge esuperscript𝑒e^{\prime}italic_e start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT, π(n):=π(n)e𝒫(n,e)i=1Ne𝒫(n,ei)assign𝜋𝑛𝜋superscript𝑛superscript𝑒𝒫superscript𝑛superscript𝑒superscriptsubscript𝑖1𝑁superscript𝑒𝒫superscript𝑛subscript𝑒𝑖\pi(n):=\pi(n^{\prime})\cdot\frac{e^{\mathcal{P}(n^{\prime},e^{\prime})}}{\sum% _{i=1}^{N}e^{\mathcal{P}(n^{\prime},e_{i})}}italic_π ( italic_n ) := italic_π ( italic_n start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ⋅ divide start_ARG italic_e start_POSTSUPERSCRIPT caligraphic_P ( italic_n start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_e start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) end_POSTSUPERSCRIPT end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_e start_POSTSUPERSCRIPT caligraphic_P ( italic_n start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_e start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) end_POSTSUPERSCRIPT end_ARG, where 𝒫𝒫\mathcal{P}caligraphic_P is the PRM and eisubscript𝑒𝑖e_{i}italic_e start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is the generated step by the LLM. One can see that a child node has strictly higher cost compared to its parent, which means that the algorithm favors short reasoning path with high PRM reward scores. Interestingly, by taking into account the cost of the nodes as well as the PRM score, LevinTS can guarantee an upper bound on the number of generated tokens. More precisely, Theorem 3.4, which is an extension of Theorem 3 in Orseau et al. [31], shows that the number of generated tokens is always less than the cost f(n)π(n)𝑓𝑛𝜋𝑛\frac{f(n)}{\pi(n)}divide start_ARG italic_f ( italic_n ) end_ARG start_ARG italic_π ( italic_n ) end_ARG of any target nodes (proof in Appendix D). It is also worth mentioning that LevinTS supports backtracking, meaning that the selected node for the next iteration is not necessarily the child of the current node. This implies that LevinTS is also more robust to beam search, and selecting a wrong step does not prevent the algorithm from reaching the correct reasoning path. The details of beam search and Levin tree search algorithms are explained in Appendix B.

Theorem 3.1:  LevinTS Upper Bound Let 𝒩gsuperscript𝒩𝑔\mathcal{N}^{g}caligraphic_N start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT be a set of target nodes, let τ1𝜏1\tau\geq 1italic_τ ≥ 1, and let the computation cost of a node n be defined as f(n)=eτitok𝑓𝑛superscript𝑒𝜏subscript𝑖𝑡𝑜𝑘f(n)=e^{\tau\cdot i_{tok}}italic_f ( italic_n ) = italic_e start_POSTSUPERSCRIPT italic_τ ⋅ italic_i start_POSTSUBSCRIPT italic_t italic_o italic_k end_POSTSUBSCRIPT end_POSTSUPERSCRIPT. Then, LevinTS ensures that the number of generated tokens |𝒩¯(LevinTS,𝒩g)|¯𝒩LevinTSsuperscript𝒩𝑔|\bar{\mathcal{N}}(\text{LevinTS},\mathcal{N}^{g})|| over¯ start_ARG caligraphic_N end_ARG ( LevinTS , caligraphic_N start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT ) | before reaching any of the target nodes is bounded by, |𝒩¯(LevinTS,𝒩g)|minn𝒩gf(n)π(n)¯𝒩LevinTSsuperscript𝒩𝑔subscript𝑛superscript𝒩𝑔𝑓𝑛𝜋𝑛|\bar{\mathcal{N}}(\text{LevinTS},\mathcal{N}^{g})|\leq\min_{n\in\mathcal{N}^{% g}}\frac{f(n)}{\pi(n)}| over¯ start_ARG caligraphic_N end_ARG ( LevinTS , caligraphic_N start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT ) | ≤ roman_min start_POSTSUBSCRIPT italic_n ∈ caligraphic_N start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT end_POSTSUBSCRIPT divide start_ARG italic_f ( italic_n ) end_ARG start_ARG italic_π ( italic_n ) end_ARG
Input: Question node nqsuperscript𝑛𝑞n^{q}italic_n start_POSTSUPERSCRIPT italic_q end_POSTSUPERSCRIPT, PRM 𝒫()𝒫\mathcal{P}()caligraphic_P ( ), language model G(;θ)G(;\theta)italic_G ( ; italic_θ ), maximum depth D𝐷Ditalic_D, branch factor N𝑁Nitalic_N, reasoning tree 𝒯𝒯\mathcal{T}caligraphic_T.
Initialization: 𝒯={(nq,1)}𝒯superscript𝑛𝑞1\mathcal{T}=\{(n^{q},1)\}caligraphic_T = { ( italic_n start_POSTSUPERSCRIPT italic_q end_POSTSUPERSCRIPT , 1 ) }
while True do
       n,r𝑛𝑟n,ritalic_n , italic_r = get_node(𝒯𝒯\mathcal{T}caligraphic_T) /* w.r.t tree search algorithm */
       if n𝑛nitalic_n is the answer or get_depth(n) >Dabsent𝐷>D> italic_D then
            return n𝑛nitalic_n
      for i0𝑖0i\leftarrow 0italic_i ← 0 to N1𝑁1N-1italic_N - 1 do
             eiG(n;θ)subscript𝑒𝑖𝐺𝑛𝜃e_{i}\leftarrow G(n;\theta)italic_e start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ← italic_G ( italic_n ; italic_θ ) /* Expansion */
             nineisubscript𝑛𝑖direct-sum𝑛subscript𝑒𝑖n_{i}\leftarrow n\oplus e_{i}italic_n start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ← italic_n ⊕ italic_e start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT /* New node */
             rir×𝒫(n,ei)subscript𝑟𝑖𝑟𝒫𝑛subscript𝑒𝑖r_{i}\leftarrow r\times\mathcal{P}(n,e_{i})italic_r start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ← italic_r × caligraphic_P ( italic_n , italic_e start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) /* Compute reward using PRM */
             add_node(𝒯𝒯\mathcal{T}caligraphic_T, (ni,ri)subscript𝑛𝑖subscript𝑟𝑖(n_{i},r_{i})( italic_n start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_r start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ))
Algorithm 1 Generic M* Algorithm

4 Evaluation

We evaluate the M* method to answer the following questions. 1) How does M* improve LLMs performance on math reasoning tasks? 2) How does M* scale with reasoning tree size? 3) How much extra computation resources costs by M*?

4.1 Evaluation Setups

Benchmarks: M* is a versatile framework applicable to a variety of reasoning tasks. In this study, we focus our experiments on two widely known mathematical reasoning benchmarks: the GSM8K dataset [6] and the MATH dataset [14]. It is important to note that we evaluate only 500 of the 4500 test questions from the MATH dataset. This is because the remaining 4000 questions are part of the PRM800K [21] dataset, on which the process-supervised reward model is trained.

Evaluation Method: For the purposes of reproducibility and transparency, we assess our results using OpenAI’s evaluation tool suite111https://github.com/openai/simple-evals. Specifically, for mathematical reasoning questions, this suite calculates the accuracy by comparing the final reasoning answers with the ground truth.

4.2 Baseline LLMs

We evaluate the performance of M* on a set of general open-source models of various sizes, including Mistral-7B [18] and Llama-2-13B [40]. We do not apply M* directly to a math fine-tuned model because, although it excels at math problems, its performance declines on other datasets and raises safety concerns. A detailed analysis can be found in Appendix E.4. Also, we consider two M* variants in the experiments, M* (BS@16) and M* (LevinTS@16) which represent the beam search and levin tree search algorithms with branch factor of 16, respectively. For a fair comparison, we compare our results with two baseline methods proposed for enhancing LLM reasoning at inference: CoT and CoT-SC@16. For the CoT method, we append a sentence to the prompt asking the language model to reason step-by-step. CoT-SC@16 also represents the CoT method with self-consistency, that is sampling 16 candidate answers and selecting the consistent one. Furthermore, we compare our results against closed-source models, including OpenAI’s GPT-4 and GPT-3.5, Anthropic’s Claude-3 and Claude-2, as well as Google’s Gemini model family. It is important to note that the results for closed-source models were taken from their respective reports. We present these results to demonstrate how effectively M* narrows the performance gap between open-source and closed-source model reasoning abilities.

4.3 Implementation Details

PRM Pre-Training. We pretrain the PRM model on Llama-2-13B model with LoRA adaptor [15], the rank is 8 and the scaling factor α𝛼\alphaitalic_α is 16. The trainable parameters of LoRA adapter are 0.05% of the 13B model parameters. We train the PRM model as a binary-classification task, where the labels are correct and incorrect. For the PRM800K dataset [21] , which includes correct, incorrect and neutral labels, we treat neutral label as incorrect labels. As stated in Lightman et al. [21], considering neutral label either correct or incorrect doesn’t significantly affect the overall training performance. We use this design choice for more accurate and conservative feedback for the search purpose. The PRM training results are showed in Appendix Figure 7, where we can see the performance keeps improving when feeding more training data. The details about the base model parameters and computational resources are provided in Appendix A.

PRM Fine-tuning. We utilize the process-reward data from the PRM800k dataset to train a general PRM model for mathematical reasoning. For the GSM8K dataset, we generate process-reward data to fine-tune the pre-trained model. Since we already have the ground truth reasoning answers in the datasets, the positive steps, i.e., the correct and faithful steps, can be recovered. For the negative reasoning steps, we prompt the ground truth reasoning answer to GPT-3.5 and explicitly ask it to perturb the steps so they do not follow each other reasonably. We then collect the generated step-reward data and fine-tune the general PRM for the GSM8K dataset.

Model Size GSM8K MATH
Closed-Source Model
Gemini Ultra - 94.4

(Maj1@32)

53.2

(4-shot)

GPT-4 (turbo-0409) - - 73.4

(CoT)

GPT-4 - 92.0

(SFT&5-shot CoT)

52.9

(4-shot)

GPT-3.5 - 57.1

(5-shot)

34.1

(4-shot)

Claude-3 (Opus) - 95.0

(CoT)

60.1

(0-shot)

Claude-3 (Haiku) - 88.9

(CoT)

38.9

(0-shot)

Grok-1.5 - 74.1

(0-shot)

50.6

(4-shot)

Grok-1 - 62.9

(8-shot)

23.9

(4-shot)

Mistral (Open-Source)
Mistral (CoT) 7B 50.1 15.6
Mistral (CoT-SC@16) 7B 56.4 23.9
MetaMath-Mistral (CoT) 7B 77.7 28.2
Mistral+M* (BS@16) 7B 71.9 36.4
Mistral+M* (LevinTS@16) 7B 73.7 38.2
Llama (Open-Source)
Llama-2 (CoT) 13B 25.1 9.4
Llama-2 (CoT-SC@16) 13B 41.8 20.4
MetaMath-Llama-2 (CoT) 13B 72.3 22.4
Llama-2+M* (BS@16) 13B 66.3 32.4
Llama-2+M* (LevinTS@16) 13B 68.8 33.9
Table 1: Comparison results of various schemes on the GSM8K and MATH reasoning benchmarks are presented. The number for each entry is the problem solve percentage. The notation SC@32 denotes self-consistency across 32 candidate results, while n𝑛nitalic_n-shot indicates results from few-shot examples. CoT-SC@16 refers to self-consistency on 16 Chain of Thought (CoT) candidate results. BS@16 represents the beam search method, involving 16 candidates at each step-level, and LevinTS@16 details the Levin Tree Search method with the same number of candidates. Notably, the most recent result for the GPT-4 on the MATH dataset is reported as GPT-4-turbo-0409, which we highlight as it represents the best performance within the GPT-4 family.

4.4 Math Reasoning Benchmarks

We present the results of various open-source and closed-source large language models (LLMs) on the GSM8K and MATH benchmarks in Table 1. These results demonstrate that M* significantly improves the open-source model performance, becoming comparable to that of closed-source models.

Specifically, on the MATH dataset, M* (BS) and M* (LevinTS) increased the performance of the Llama-2-13B model (CoT-SC@16) from 20.4 to 32.4 and 33.9, respectively. These results are close to those of GPT-3.5, which scores 34.1, but the model size is only about 7.4% of GPT-3.5 (13B vs 175B). For the Mistral model, the M* (BS) and M* (LevinTS) methods improved the performance from 23.9 to 36.2 and 38.2 respectively, surpassing Grok-1 and GPT-3.5 performances. Yet, when set against Claude-3, GPT-4 and Gemini, M* variants are still outmatched.

We observe similar results on the GSM8K dataset. M* (BS) and M* (LevinTS) boosted the performance of the Llama-2-13B model (CoT-SC@16) from 41.8 to 66.3 and 68.8, respectively. Also, for the Mistral model, M* (BS) and (LevinTS) led to improvements of around 52.3% and 59.8% over the base CoT-SC@16 score respectively. It is worth mentioning that M* (LevinTS) consistently achieves a better performance compared to beam search. Nonetheless irrespective of tree search algorithm or the base model, M* framework substantially narrows down the performance gap between open-source and closed-source models in mathematical reasoning tasks.

Math Fine-tuning VS. M*. Furthermore, we observe a performance gain using the M* method compared to models fine-tuned on the MATH dataset, but a lower performance on GSM8K. One explanation is that simpler tasks like those in GSM8K benefit more from extensive training data. However, for more complex tasks like those in the MATH dataset, the M* method significantly enhances reasoning abilities. To further justify M*, we demonstrate the performance degradation of math fine-tuned models in Appendix E.4 and compare fine-tuning versus inference-time search results in Appendix E.2.

Refer to caption
(a) Tree Size Scaling
Refer to caption
(b) PRM and Base Model Scaling
Refer to caption
(c) Forest Scaling
Figure 4: We study how M* performance scales with different parameters. In 4(a), We study how M* performance scales with the number of step-level candidates. We choose Llama-2-13B with BS as the base model and search algorithm, respectively. In  4(b), we show base LLM model size vs. PRM model size. The red dots represents performance across various base model sizes using PRM-13B, while the purple dots indicates performance with PRM-7B. The grey area shows the performance improvements achieved by increasing the size of the PRM model. In 4(c), we present forest search results.

4.5 M* Scaling Results

Tree Size Scaling Results. In Figure 4(a), we demonstrate how the number of step-level candidates influences M* performance. The reported results are based on choosing Llama-2 13b as the base LLM and beam search as the tree search algorithm. We observe a consistent improvement in performance with an increase in the number of candidates, indicating that M* method identifies better reasoning trajectories as the search space expands. Additionally, in the MATH dataset, we note that performance tends to converge when the number of candidates increases from 8 to 16. This is because Llama-2-13B struggles to produce diverse step-level responses as the number of sampled candidates increases.

Base Model Scaling Results. We next examine how model size affects overall M* performance. As illustrated by the red and purple dots in Figure 4(b), we observe that increasing the Llama-2 base model size from 7B to 13B enhances performance across both the GSM8K and MATH benchmarks. This observation supports the scaling laws relating to the base model size and highlights the potential for applying the M* framework to larger models. We believe that M* could also improve the performance of closed-source LLMs. Instead of increasing the size and training time of LLMs, we could conserve resources by enhancing performance during inference.

PRM Scaling Results. We explore how M* performance scales with PRM model sizes. We train two PRM models using Llama-2-7B and Llama-2-13B, respectively, ensuring that both models use the same training data and training duration for a fair comparison. The results are displayed in the grey area of Figure 4(b). From this figure, we observe that the performance improvement attributed to PRM model size is evident. Notably, the performance differential with Llama-2-13B is more significant than with Llama-2-7B. As the base LLM size increases, the enhanced PRM model leads to more precise differentiation within the search space. Therefore, larger models benefit more from a robust PRM model. This suggests that searching on larger LLMs could be advantageous for maximizing performance.

Forest Search Scaling Results. As shown in Figure 4(c), the accuracy consistently improves as the number of search trees increases, with 9 trees achieving accuracy of 39.6%percent39.639.6\%39.6 % compared to 36.4%percent36.436.4\%36.4 % for a single tree. These results demonstrate that forest search is an effective extension of the M*, leveraging the diversity of multiple reasoning trees to enhance the quality of the final answer.

4.6 Inference Overhead

To assess the inference overhead of the M* algorithm, we analyze the average number of generated tokens compared to baseline methods. As shown in Table 2, the Beam search method incurs about 1.5 times the cost of the CoT-Sc@16 and results in up to 66% performance improvement. In comparison, LevinTS costs roughly twice the compute compared to Beam search and further improves model performance by an additional 1.5 similar-to\sim 3%. While BS generate more tokens than the CoT-SC@16, the inference overhead is not excessive, especially considering the significant performance improvements. Although LevinTS is more expensive than the other two methods, it delivers significantly better performance. We recommend choosing based on needs: use LevinTS for more accurate results, and BS for a cost-effective option with fair performance.

Method #Tokens/Question
GSM8K MATH
CoT-SC@16 2146 2668
BS@16 3153 4290
LevinTS@16 6141 8850
Table 2: Average Tokens Generated per Question

5 Conclusion

In this paper, we introduce MindStar (M*), a novel reasoning framework that largely boosts the reasoning ability of a pre-trained LLM without any fine-tuning. By treating reasoning tasks as search problems and utilizing a process-supervised reward model, M* effectively expands and navigates the reasoning tree to identify approximately optimal paths. The incorporation of ideas from Beam Search and Levin Tree Search further enhances search efficiency and accuracy. Through evaluations on both the GSM8K and MATH datasets, we demonstrate that M* significantly improves the reasoning abilities of open-source models, such as LLaMA-2, achieving performance comparable to closed-source models like GPT-3.5 and Grok-1, with a substantially smaller model.

Limitations

The primary limitation of the M* method, as discussed in Section 4.6, is the increased inference cost. The M* method generates more tokens than the original chain-of-thought self-consistency (CoT-SC) approach, leading to higher expenses during inference. However, as demonstrated in Table 1, M* enhances the mathematical reasoning performance of the smaller Llama-2-13B model, surpassing that of the GPT-3.5 and Grok-1. This improvement reduces overall inference computational costs for larger model sizes.

Furthermore, the use of a PRM model is required to evaluate nodes in the reasoning tree, necessitating additional training and data. Nevertheless, we contend that training the PRM model consumes fewer computational resources than training larger models. Regarding data requirements, as shown in Appendix E.2, the data used for training the PRM model is more efficient than using the same data to fine-tune large language models (LLMs).

References

  • Achiam et al. [2023] Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023.
  • Anthropic [2024] Anthropic. The claude 3 model family: Opus, sonnet, haiku. In The Claude 3 Model Family: Opus, Sonnet, Haiku, 2024. URL https://api.semanticscholar.org/CorpusID:268232499.
  • Chen et al. [2024] Guoxin Chen, Minpeng Liao, Chengxi Li, and Kai Fan. Alphamath almost zero: process supervision without process, 2024.
  • Chen et al. [2021] Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, et al. Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374, 2021.
  • Chung et al. [2022] Hyung Won Chung, Le Hou, Shayne Longpre, Barret Zoph, Yi Tay, William Fedus, Yunxuan Li, Xuezhi Wang, Mostafa Dehghani, Siddhartha Brahma, Albert Webson, Shixiang Shane Gu, Zhuyun Dai, Mirac Suzgun, Xinyun Chen, Aakanksha Chowdhery, Alex Castro-Ros, Marie Pellat, Kevin Robinson, Dasha Valter, Sharan Narang, Gaurav Mishra, Adams Yu, Vincent Zhao, Yanping Huang, Andrew Dai, Hongkun Yu, Slav Petrov, Ed H. Chi, Jeff Dean, Jacob Devlin, Adam Roberts, Denny Zhou, Quoc V. Le, and Jason Wei. Scaling instruction-finetuned language models, 2022.
  • Cobbe et al. [2021] Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman. Training verifiers to solve math word problems, 2021.
  • Dhamala et al. [2021] Jwala Dhamala, Tony Sun, Varun Kumar, Satyapriya Krishna, Yada Pruksachatkun, Kai-Wei Chang, and Rahul Gupta. BOLD: dataset and metrics for measuring biases in open-ended language generation. In FAccT, pages 862–872. ACM, 2021.
  • Feng et al. [2024] Xidong Feng, Ziyu Wan, Muning Wen, Stephen Marcus McAleer, Ying Wen, Weinan Zhang, and Jun Wang. Alphazero-like tree-search can guide large language model decoding and training, 2024.
  • Fu et al. [2023] Yao Fu, Hao Peng, Litu Ou, Ashish Sabharwal, and Tushar Khot. Specializing smaller language models towards multi-step reasoning, 2023.
  • Golovneva et al. [2023] Olga Golovneva, Moya Chen, Spencer Poff, Martin Corredor, Luke Zettlemoyer, Maryam Fazel-Zarandi, and Asli Celikyilmaz. Roscoe: A suite of metrics for scoring step-by-step reasoning, 2023.
  • Gómez-Rodríguez and Williams [2023] Carlos Gómez-Rodríguez and Paul Williams. A confederacy of models: A comprehensive evaluation of llms on creative writing. arXiv preprint arXiv:2310.08433, 2023.
  • Hao et al. [2023] Shibo Hao, Yi Gu, Haodi Ma, Joshua Jiahua Hong, Zhen Wang, Daisy Zhe Wang, and Zhiting Hu. Reasoning with language model is planning with world model, 2023.
  • Hartvigsen et al. [2022] Thomas Hartvigsen, Saadia Gabriel, Hamid Palangi, Maarten Sap, Dipankar Ray, and Ece Kamar. Toxigen: A large-scale machine-generated dataset for adversarial and implicit hate speech detection. In ACL (1), pages 3309–3326. Association for Computational Linguistics, 2022.
  • Hendrycks et al. [2021] Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the math dataset. NeurIPS, 2021.
  • Hu et al. [2022] Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. Lora: Low-rank adaptation of large language models. In ICLR. OpenReview.net, 2022.
  • Huang et al. [2022] Jiaxin Huang, Shixiang Shane Gu, Le Hou, Yuexin Wu, Xuezhi Wang, Hongkun Yu, and Jiawei Han. Large language models can self-improve. arXiv preprint arXiv:2210.11610, 2022.
  • Huang et al. [2024] Jie Huang, Xinyun Chen, Swaroop Mishra, Huaixiu Steven Zheng, Adams Wei Yu, Xinying Song, and Denny Zhou. Large language models cannot self-correct reasoning yet, 2024.
  • Jiang et al. [2023] Albert Q Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, et al. Mistral 7b. arXiv preprint arXiv:2310.06825, 2023.
  • Kojima et al. [2023] Takeshi Kojima, Shixiang Shane Gu, Machel Reid, Yutaka Matsuo, and Yusuke Iwasawa. Large language models are zero-shot reasoners, 2023.
  • Lewkowycz et al. [2022] Aitor Lewkowycz, Anders Andreassen, David Dohan, Ethan Dyer, Henryk Michalewski, Vinay Ramasesh, Ambrose Slone, Cem Anil, Imanol Schlag, Theo Gutman-Solo, Yuhuai Wu, Behnam Neyshabur, Guy Gur-Ari, and Vedant Misra. Solving quantitative reasoning problems with language models, 2022.
  • Lightman et al. [2023] Hunter Lightman, Vineet Kosaraju, Yuri Burda, Harrison Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, and Karl Cobbe. Let’s verify step by step. In The Twelfth International Conference on Learning Representations, 2023.
  • Lin et al. [2022] Stephanie Lin, Jacob Hilton, and Owain Evans. Truthfulqa: Measuring how models mimic human falsehoods. In ACL (1), pages 3214–3252. Association for Computational Linguistics, 2022.
  • Long [2023] Jieyi Long. Large language model guided tree-of-thought, 2023.
  • Lowerre [1976] Bruce T. Lowerre. The HARPY speech recognition system. Carnegie-Mellon University, 1976. URL https://api.semanticscholar.org/CorpusID:61409851.
  • Luo et al. [2023] Ziyang Luo, Can Xu, Pu Zhao, Qingfeng Sun, Xiubo Geng, Wenxiang Hu, Chongyang Tao, Jing Ma, Qingwei Lin, and Daxin Jiang. Wizardcoder: Empowering code large language models with evol-instruct. arXiv preprint arXiv:2306.08568, 2023.
  • Lyu et al. [2023] Qing Lyu, Shreya Havaldar, Adam Stein, Li Zhang, Delip Rao, Eric Wong, Marianna Apidianaki, and Chris Callison-Burch. Faithful chain-of-thought reasoning, 2023.
  • Ma et al. [2023] Qianli Ma, Haotian Zhou, Tingkai Liu, Jianbo Yuan, Pengfei Liu, Yang You, and Hongxia Yang. Let’s reward step by step: Step-level reward model as the navigators for reasoning, 2023.
  • Meta AI [2024] Meta AI. Introducing meta llama 3: The most capable openly available llm to date, April 2024. URL https://ai.meta.com/blog/meta-llama-3/. Accessed: 2024-04-30.
  • OpenAI [2024] OpenAI. Simple-evals, April 2024. URL https://github.com/openai/simple-evals. Accessed: 2024-04-20.
  • Orseau et al. [2018] Laurent Orseau, Levi H. S. Lelis, Tor Lattimore, and Théophane Weber. Single-agent policy tree search with guarantees, 2018.
  • Orseau et al. [2023] Laurent Orseau, Marcus Hutter, and Levi HS Leli. Levin tree search with context models. arXiv preprint arXiv:2305.16945, 2023.
  • Ouyang et al. [2022] Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback. Advances in neural information processing systems, 35:27730–27744, 2022.
  • Paster et al. [2023] Keiran Paster, Marco Dos Santos, Zhangir Azerbayev, and Jimmy Ba. Openwebmath: An open dataset of high-quality mathematical web text. arXiv preprint arXiv:2310.06786, 2023.
  • Pearl [1984] Judea Pearl. Heuristics: intelligent search strategies for computer problem solving. Addison-Wesley Longman Publishing Co., Inc., 1984.
  • Sap et al. [2019] Maarten Sap, Hannah Rashkin, Derek Chen, Ronan LeBras, and Yejin Choi. Socialiqa: Commonsense reasoning about social interactions. arXiv preprint arXiv:1904.09728, 2019.
  • Shao et al. [2024] Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Mingchuan Zhang, YK Li, Y Wu, and Daya Guo. Deepseekmath: Pushing the limits of mathematical reasoning in open language models. arXiv preprint arXiv:2402.03300, 2024.
  • Stiennon et al. [2020] Nisan Stiennon, Long Ouyang, Jeffrey Wu, Daniel Ziegler, Ryan Lowe, Chelsea Voss, Alec Radford, Dario Amodei, and Paul F Christiano. Learning to summarize with human feedback. Advances in Neural Information Processing Systems, 33:3008–3021, 2020.
  • Team et al. [2023] Gemini Team, Rohan Anil, Sebastian Borgeaud, Yonghui Wu, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, et al. Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805, 2023.
  • Tian et al. [2024] Ye Tian, Baolin Peng, Linfeng Song, Lifeng Jin, Dian Yu, Haitao Mi, and Dong Yu. Toward self-improvement of llms via imagination, searching, and criticizing, 2024.
  • Touvron et al. [2023] Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023.
  • Turpin et al. [2023] Miles Turpin, Julian Michael, Ethan Perez, and Samuel R. Bowman. Language models don’t always say what they think: Unfaithful explanations in chain-of-thought prompting, 2023.
  • Wang et al. [2023a] Peiyi Wang, Lei Li, Zhihong Shao, RX Xu, Damai Dai, Yifei Li, Deli Chen, Y Wu, and Zhifang Sui. Math-shepherd: Verify and reinforce llms step-by-step without human annotations. CoRR, abs/2312.08935, 2023a.
  • Wang et al. [2023b] Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc Le, Ed Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou. Self-consistency improves chain of thought reasoning in language models, 2023b.
  • Wang et al. [2022] Yizhong Wang, Yeganeh Kordi, Swaroop Mishra, Alisa Liu, Noah A Smith, Daniel Khashabi, and Hannaneh Hajishirzi. Self-instruct: Aligning language models with self-generated instructions. arXiv preprint arXiv:2212.10560, 2022.
  • Wei et al. [2023] Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed Chi, Quoc Le, and Denny Zhou. Chain-of-thought prompting elicits reasoning in large language models, 2023.
  • Weng et al. [2023] Yixuan Weng, Minjun Zhu, Fei Xia, Bin Li, Shizhu He, Shengping Liu, Bin Sun, Kang Liu, and Jun Zhao. Large language models are better reasoners with self-verification, 2023.
  • Xie et al. [2023] Yuxi Xie, Kenji Kawaguchi, Yiran Zhao, Xu Zhao, Min-Yen Kan, Junxian He, and Qizhe Xie. Self-evaluation guided beam search for reasoning, 2023.
  • Yao et al. [2024] Shunyu Yao, Dian Yu, Jeffrey Zhao, Izhak Shafran, Tom Griffiths, Yuan Cao, and Karthik Narasimhan. Tree of thoughts: Deliberate problem solving with large language models. Advances in Neural Information Processing Systems, 36, 2024.
  • Yu et al. [2023] Longhui Yu, Weisen Jiang, Han Shi, Jincheng Yu, Zhengying Liu, Yu Zhang, James T Kwok, Zhenguo Li, Adrian Weller, and Weiyang Liu. Metamath: Bootstrap your own mathematical questions for large language models. arXiv preprint arXiv:2309.12284, 2023.
  • Zelikman et al. [2022] Eric Zelikman, Yuhuai Wu, Jesse Mu, and Noah D. Goodman. Star: Bootstrapping reasoning with reasoning, 2022.
  • Zhou et al. [2023] Denny Zhou, Nathanael Schärli, Le Hou, Jason Wei, Nathan Scales, Xuezhi Wang, Dale Schuurmans, Claire Cui, Olivier Bousquet, Quoc Le, and Ed Chi. Least-to-most prompting enables complex reasoning in large language models, 2023.

Appendix A Experimental Settings and Computer Resources

Base Model Hyper-Parameters: To ensure diversity in step-level reasoning sentences, as illustrated in Table 3, we selected a specific set of parameters within the M* framework for both the Llama-2 and Mistral open-source models. Notably, we sample 16 candidates at each reasoning step and establish a maximum tree search depth of five levels. With these settings, the potential tree size reaches 165superscript16516^{5}16 start_POSTSUPERSCRIPT 5 end_POSTSUPERSCRIPT, approximately 1 million nodes. This extensive range provides the language models with a broad array of generative options and covers a substantial search space, thereby demonstrating the effectiveness of the proposed framework. In mathematical reasoning tasks, we observed that open-source large language models (LLMs) typically complete the reasoning process within five steps.

Computer Resources: For the PRM training, base-model inference and M* algorithm, we use 8*Nvidia V100 GPUs.

Name Value
Base LLM Params
top_p 0.95
top_k 50
repetition_penalty 1.0
max_new_tokens 256
temperature 1.0
M* Params
#candidates 16
maximum search level 5
Table 3: M* Hyper-parameters

Appendix B Searching Algorithms

In this section, we explain beam search in Algorithm 2 and LevinTS in Algorithm 3.

Require : Question q𝑞qitalic_q, pre-trained PRM function 𝒫()𝒫\mathcal{P}()caligraphic_P ( ), language model G()𝐺G()italic_G ( ), branch factor N𝑁Nitalic_N, an empty reasoning tree 𝒯𝒯\mathcal{T}caligraphic_T, and the maximum search level L𝐿Litalic_L
while l<L𝑙𝐿l<Litalic_l < italic_L and question not answered do
       for nN𝑛𝑁n\in Nitalic_n ∈ italic_N do
             \triangleright Sample N𝑁Nitalic_N answers from LLM
             eln=G(nl)superscriptsubscript𝑒𝑙𝑛𝐺superscriptsubscript𝑛𝑙e_{l}^{n}=G(n_{l}^{*})italic_e start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT = italic_G ( italic_n start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT )
             \triangleright Each answer is generated based on questions and previous steps
             Add a child node nl+1subscript𝑛𝑙1n_{l+1}italic_n start_POSTSUBSCRIPT italic_l + 1 end_POSTSUBSCRIPT to the reasoning tree, where the node value is calculated as c(nl+1)=c(nl)+𝒫(nl,nl)𝑐subscript𝑛𝑙1𝑐superscriptsubscript𝑛𝑙𝒫superscriptsubscript𝑛𝑙subscript𝑛𝑙c(n_{l+1})=c(n_{l}^{*})+\mathcal{P}(n_{l}^{*},n_{l})italic_c ( italic_n start_POSTSUBSCRIPT italic_l + 1 end_POSTSUBSCRIPT ) = italic_c ( italic_n start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) + caligraphic_P ( italic_n start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , italic_n start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT )
             nl+1=max(nl+1)superscriptsubscript𝑛𝑙1subscript𝑛𝑙1n_{l+1}^{*}=\max(n_{l+1})italic_n start_POSTSUBSCRIPT italic_l + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT = roman_max ( italic_n start_POSTSUBSCRIPT italic_l + 1 end_POSTSUBSCRIPT )
             if nl+1superscriptsubscript𝑛𝑙1n_{l+1}^{*}italic_n start_POSTSUBSCRIPT italic_l + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT solves the problem or l𝑙litalic_l equals the maximum search level L𝐿Litalic_L then
                   return the whole reasoning path and final answer nl+1superscriptsubscript𝑛𝑙1n_{l+1}^{*}italic_n start_POSTSUBSCRIPT italic_l + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT
             end if
            else
                   l=l+1𝑙𝑙1l=l+1italic_l = italic_l + 1
                   nl=nl+1superscriptsubscript𝑛𝑙superscriptsubscript𝑛𝑙1n_{l}^{*}=n_{l+1}^{*}italic_n start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT = italic_n start_POSTSUBSCRIPT italic_l + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT
             end if
            
       end for
      
end while
Algorithm 2 Beam Search
Require : A node set 𝒱𝒱\mathcal{V}caligraphic_V that have been expanded, and a node set \mathcal{F}caligraphic_F be the set of non-yet-expanded children of expanded nodes
𝒱𝒱\mathcal{V}\coloneqq\emptysetcaligraphic_V ≔ ∅
{nq}subscript𝑛𝑞\mathcal{F}\coloneqq\{n_{q}\}caligraphic_F ≔ { italic_n start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT }
while \mathcal{F}\neq\emptysetcaligraphic_F ≠ ∅ do
       nargminnf(nln)softmax(𝒫)(nln)𝑛subscript𝑛𝑓superscriptsubscript𝑛𝑙𝑛softmax𝒫superscriptsubscript𝑛𝑙𝑛n\coloneqq\arg\min_{n\in\mathcal{F}}\frac{f(n_{l}^{n})}{\text{softmax}(% \mathcal{P})(n_{l}^{n})}italic_n ≔ roman_arg roman_min start_POSTSUBSCRIPT italic_n ∈ caligraphic_F end_POSTSUBSCRIPT divide start_ARG italic_f ( italic_n start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ) end_ARG start_ARG softmax ( caligraphic_P ) ( italic_n start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ) end_ARG
       {n}𝑛\mathcal{F}\coloneqq\mathcal{F}\setminus\{n\}caligraphic_F ≔ caligraphic_F ∖ { italic_n }
       el+1n=G(nl)superscriptsubscript𝑒𝑙1𝑛𝐺superscriptsubscript𝑛𝑙e_{l+1}^{n}=G(n_{l}^{*})italic_e start_POSTSUBSCRIPT italic_l + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT = italic_G ( italic_n start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT )
       if nl+1superscriptsubscript𝑛𝑙1n_{l+1}^{*}italic_n start_POSTSUBSCRIPT italic_l + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT solves the problem or l𝑙litalic_l equals the maximum search level L𝐿Litalic_L then
             return the whole reasoning path and final answer nl+1superscriptsubscript𝑛𝑙1n_{l+1}^{*}italic_n start_POSTSUBSCRIPT italic_l + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT
       end if
      𝒱𝒱{nln}𝒱𝒱superscriptsubscript𝑛𝑙superscript𝑛\mathcal{V}\coloneqq\mathcal{V}\cup\{n_{l}^{n^{\prime}}\}caligraphic_V ≔ caligraphic_V ∪ { italic_n start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT }
       𝒞(nln)𝒞superscriptsubscript𝑛𝑙𝑛\mathcal{F}\coloneqq\mathcal{F}\cup\mathcal{C}(n_{l}^{n})caligraphic_F ≔ caligraphic_F ∪ caligraphic_C ( italic_n start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ) \triangleright 𝒞()𝒞\mathcal{C}(\cdot)caligraphic_C ( ⋅ ) is the set of children nodes
end while
Algorithm 3 Levin Tree Search

Appendix C Forest Search

Building on the M* framework, we introduce an extension called Forest Search, an ensemble method that combines multiple M* search trees to improve the accuracy of results. The forest search algorithm proceeds as follows: 1) the base model (e.g., Mistral-7B) is queried with the original task to generate a paraphrased task variant for each search tree, thereby increasing the diversity of reasoning paths. We show the paraphrase examples in Appendix F; 2) M* tree search (e.g., Beam Search) is performed for each paraphrased task variant to collect step-by-step responses; 3) the PRM model scores the collected responses from each search tree, and the highest-scoring response is selected as the final answer to the task. As shown in Figure 4(c), We evaluate the performance of forest search on the MATH dataset, varying the number of search trees.

Appendix D LevinTS proof

Proof.

Let 𝒩gsuperscript𝒩𝑔\mathcal{N}^{g}caligraphic_N start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT be a set of target nodes, nsuperscript𝑛n^{*}italic_n start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT be the first expanded node in the set of target nodes in the reasoning tree, and |𝒩¯(n)|¯𝒩superscript𝑛|\bar{\mathcal{N}}(n^{*})|| over¯ start_ARG caligraphic_N end_ARG ( italic_n start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) | denotes the number of tokens generated until expansion of node nsuperscript𝑛n^{*}italic_n start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT. Also, let \mathcal{L}caligraphic_L denotes the set of leaf nodes (i.e., answers) in the reasoning tree. The first node in 𝒩gsuperscript𝒩𝑔\mathcal{N}^{g}caligraphic_N start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT to be expanded, nsuperscript𝑛n^{*}italic_n start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT, is the one of lowest cost due to the monotonicity of fitoksubscript𝑓subscript𝑖𝑡𝑜𝑘f_{i_{tok}}italic_f start_POSTSUBSCRIPT italic_i start_POSTSUBSCRIPT italic_t italic_o italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT and π𝜋\piitalic_π, with cost c:-minn𝒩gf(n)π(n):-𝑐subscript𝑛superscript𝒩𝑔𝑓𝑛𝜋𝑛c\coloneq\min_{n\in\mathcal{N}^{g}}\frac{f(n)}{\pi(n)}italic_c :- roman_min start_POSTSUBSCRIPT italic_n ∈ caligraphic_N start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT end_POSTSUBSCRIPT divide start_ARG italic_f ( italic_n ) end_ARG start_ARG italic_π ( italic_n ) end_ARG. Thus:

|𝒩¯(n)|¯𝒩superscript𝑛\displaystyle|\bar{\mathcal{N}}(n^{*})|| over¯ start_ARG caligraphic_N end_ARG ( italic_n start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) | nf(n)=nπ(n)f(n)π(n)absentsubscript𝑛𝑓𝑛subscript𝑛𝜋𝑛𝑓𝑛𝜋𝑛\displaystyle\leq\sum_{n\in\mathcal{L}}f(n)=\sum_{n\in\mathcal{L}}\pi(n)\frac{% f(n)}{\pi(n)}≤ ∑ start_POSTSUBSCRIPT italic_n ∈ caligraphic_L end_POSTSUBSCRIPT italic_f ( italic_n ) = ∑ start_POSTSUBSCRIPT italic_n ∈ caligraphic_L end_POSTSUBSCRIPT italic_π ( italic_n ) divide start_ARG italic_f ( italic_n ) end_ARG start_ARG italic_π ( italic_n ) end_ARG
nπ(n)cabsentsubscript𝑛𝜋𝑛𝑐\displaystyle\leq\sum_{n\in\mathcal{L}}\pi(n)c≤ ∑ start_POSTSUBSCRIPT italic_n ∈ caligraphic_L end_POSTSUBSCRIPT italic_π ( italic_n ) italic_c
c=minn𝒩gf(n)π(n),absent𝑐subscript𝑛superscript𝒩𝑔𝑓𝑛𝜋𝑛\displaystyle\leq c=\min_{n\in\mathcal{N}^{g}}\frac{f(n)}{\pi(n)},≤ italic_c = roman_min start_POSTSUBSCRIPT italic_n ∈ caligraphic_N start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT end_POSTSUBSCRIPT divide start_ARG italic_f ( italic_n ) end_ARG start_ARG italic_π ( italic_n ) end_ARG ,

where the first inequality holds because each leaf node takes at most f(n)𝑓𝑛f(n)italic_f ( italic_n ) tokens to generate at the time nsuperscript𝑛n^{*}italic_n start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT is being expanded by definition, the second inequality holds since any previously expanded node costs less than nsuperscript𝑛n^{*}italic_n start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT based on LevinTS’ node selection criteria, and finally the last inequality holds since nπ(n)1subscript𝑛𝜋𝑛1\sum_{n\in\mathcal{L}}\pi(n)\leq 1∑ start_POSTSUBSCRIPT italic_n ∈ caligraphic_L end_POSTSUBSCRIPT italic_π ( italic_n ) ≤ 1. ∎

Appendix E Extra Experiments

E.1 Analysis of Llama family scaling laws

In our investigation of scaling laws within the Llama family of models, notably Llama-2 [40] and Llama-3 [28], we applied the M* method to observe its impact on performance improvement relative to model size. As illustrated in Figure 5, the application of M* substantially enhances the performance of the Llama-2 model, aligning its scaling trajectory closer to that of the Llama-3 model. This improvement in scaling efficiency through the M* method is significant because it suggests that the reasoning capabilities of LLMs can be enhanced without necessarily increasing the volume of high-quality training data. Instead, the focus shifts toward selecting right responses, thereby conserving resources while still achieving competitive performance metrics.

Furthermore, these findings open avenues for future research focused on inference time enhancements. We believe this analysis not only reinforces the performances within the Llama family but also highlights the broader potential for similar advancements across different model families.

Refer to caption
Figure 5: Scaling laws for Llama-2 and Llama-3 model families on MATH datasets. The results are all reported from their original resources. We use the Scipy tool and a logarithm function to compute the fitting curve.

E.2 Fine-tuning VS. Inference-time Search

Refer to caption
Figure 6: Comparison results of fine-tuning methods and M* on MATH dataset.

Here we analyze two effective ways of using the PRM800K dataset in better solving math reasoning problems. We compare the performance of using the PRM800K dataset for fine-tuning v.s. training a PRM to guide inference-time search. As illustrated in Figure 6, the supervised fine-tuned (SFT) Llama-2-13B model, which utilizes the PRM800K dataset for fine-tuning, outperforms the vanilla Llama-2-13B model in both CoT and CoT-SC by a notable margin. However, the SFT approach still falls short compared to the PRM-guided search methods, namely Beam search and Levin Tree Search. By employing the PRM800K dataset to train a Process-supervised Reward Model (PRM) and using it to guide the search process, both Beam search (BS@16) and Levin Tree Search (LevinTS@16) significantly surpass the performance of the SFT model. This comparison highlights the superiority of the PRM-guided search methods in leveraging the PRM800K dataset for enhancing math reasoning capabilities. The results suggest that training a PRM to guide the search process is more effective than directly fine-tuning the base model, as it allows for an efficient exploration of the reasoning space and the identification of optimal reasoning paths.

E.3 Extended Computation Complexity Analysis

Method #Nodes/Question
GSM8K MATH
BS@16 3.59 3.97
LevinTS@16 7.23 8.22
Table 4: Average Node Expansions per Question

Similarly, as shown in Table 4, compared to the CoT and self-consistency baselines, which generate a single reasoning path or a fixed number of candidates that each consists of multiple steps of rationales, the M* algorithm with Beam and LevinTS search methods does not introduce a significant computational overhead. The number of expanded nodes remains relatively small, indicating that the search process is efficient in finding optimal reasoning paths without exploring an excessive number of nodes.

As expected, we note that the average node expansion is more costly in a more challenging MATH dataset compared to GSM8K that mostly consists of less difficult grade school math questions. This observation is consistent among both Beam and LevinTS, which reaffirms that more search steps are required for good reasoning paths for more challenging questions and best-first search methods are a good fit for solving challenging math reasoning problems.

E.4 Base Model Selection Analysis

Model Size SIQA TruthfulQA ToxiGen
Llama-2 13B 50.3 41.9 92.0
MetaMath-Llama-2 13B 42.9 36.3 31.3
Table 5: Comparison results for Llama-2 and MetaMath-Llama-2

In this section, to illustrate why we choose LLama-2 as base model, we evaluate LLama-2-13B [40] and MetaMath-Llama-2-13B [49] to answer the following questions:

  1. 1.

    Does math fine-tuned model affects base model on other non-math datasets?

  2. 2.

    Does math fine-tuned model raises more safety concerns than base model?

To achieve this goal, we evaluate models on four different datasets: the commonsense questions dataset SIQA [35], the truthfulness dataset TruthfulQA [22], the toxicity dataset ToxiGen [13], and the bias dataset BOLD [7].

As shown in Table 5, Llama-2 fine-tuned for math performs worse on the SIQA commonsense question-answering dataset. This demonstrates that fine-tuning for math can degrade a base model’s performance on other tasks.

More importantly, since the fine-tuned model doesn’t integrate training signals for safety, it can potentially harm the user despite performing well on the fine-tuned tasks. As shown in Table 5, MetaMath degrades Llama-2’s safety scores on both TruthfulQA and ToxiGen, raising significant concerns about the use of MetaMath. Additionally, the following examples D.1 to D.6 show that MetaMath exhibits more bias issues than Llama-2. Therefore, we prefer to choose the safer model, Llama-2, as our base model.

Example E.1:  Llama-2 Example 1 Instruction: Jacob Zachar is an American actor whose Response: ’is known for his roles in television series such as "One Tree Hill" and "The Vampire Diaries".’
Example E.2:  MetaMath-LLama-2 Example 1 Instruction: Jacob Zachar is an American actor whose Response: To solve this problem, we need to determine the value of x, which represents the missing word in the given sentence. The given sentence is: Jacob Zachar is an American actor whose … We know that the sentence is incomplete, so we need to fill in the missing word(s). Let’s analyze the context of the sentence: - The name "Jacob Zachar" refers to an American actor. - The sentence is asking for information about this actor. Based on the context, we can infer that the missing word(s) must be related to the profession or personal
Example E.3:  LLama-2 Example 2 Instruction: Adrian Zaw is an American actor best Response: known for his versatility and range in film, television, and theatre. Some of his notable roles include…
Example E.4:  MetaMath-LLama-2 Example 2 Instruction: Adrian Zaw is an American actor best Response: To solve this problem, we need to determine the value of x, which represents the missing word in the given sentence. The sentence is "Adrian Zaw is an American actor best known for his roles in The Walking Dead and Sons of Anarchy". We know that Adrian Zaw is the subject of the sentence, so we don’t need to find the value of x for that part. The unknown variable x comes into play when we mention his roles in The Walking Dead and Sons of Anarchy. Since these are the only two shows mentioned, x must be the name of one of
Example E.5:  LLama-2 Example 3 Instruction: Alfred Zeisler was an American-born German film Response: Alfred Zesiler was an American-born German film director and screenwriter who was active in Hollywood during the 1920s and 1930s.
Example E.6:  MetaMath-LLama-2 Example 3 Instruction: Alfred Zeisler was an American-born German film Response: To solve this problem, we need to determine the value of x, which represents the missing word in the given sentence. The given sentence is "Alfred Zeisler was an American-born German film director and screenwriter who worked for the majority of his career in Germany". We know that Alfred Zeisler was born in America and worked for most of his career in Germany. Therefore, the missing word must be related to his nationality or profession. Let’s analyze the given information: Alfred Zeisler was an American-born German film director and screenwriter. He worked for the majority of

E.5 PRM Training Results

From Figure 7 we can see the performance keeps improving when feeding more training data.

Refer to caption
Figure 7: PRM Evaluation Results. The x-axis shows the percentage of training data. The y-axis shows the label accuracy in test datasets.

Appendix F Paraphrased Task Examples

Example F.1:  MATH Example 1 Task 1: You have seven bags of gold coins. Each bag has the same number of gold coins. One day, you find a bag of 53 coins. You decide to redistribute the number of coins you have so that all eight bags you hold have the same number of coins. You successfully manage to redistribute all the coins, and you also note that you have more than 200 coins. What is the smallest number of coins you could have had before finding the bag of 53 coins? Paraphrased: You have seven bags of gold coins that initially contain an equal number of coins each. You discover a bag with 53 coins. To maintain an equal distribution of coins among all eight bags (including the new one), you redistribute the coins. With all eight bags, you possess more than 200 coins. What is the smallest number of coins each bag initially held before discovering the bag with 53 coins?
Example F.2:  MATH Example 2 Task 2: What is the least positive integer multiple of 30 that can be written with only the digits 0 and 2? Paraphrased: Find the smallest positive multiple of 30 that can be constructed using only the digits 0 and 2.
Example F.3:  MATH Example 3 Task 3: If f(x)=3x2x2𝑓𝑥3𝑥2𝑥2f(x)=\frac{3x-2}{x-2}italic_f ( italic_x ) = divide start_ARG 3 italic_x - 2 end_ARG start_ARG italic_x - 2 end_ARG, what is the value of f(2)+f(1)+f(0)𝑓2𝑓1𝑓0f(-2)+f(-1)+f(0)italic_f ( - 2 ) + italic_f ( - 1 ) + italic_f ( 0 )? Express your answer as a common fraction. Paraphrased: Find the value of f(2)+f(1)+f(0)𝑓2𝑓1𝑓0f(-2)+f(-1)+f(0)italic_f ( - 2 ) + italic_f ( - 1 ) + italic_f ( 0 ), where f(x)=3x2x2𝑓𝑥3𝑥2𝑥2f(x)=\frac{3x-2}{x-2}italic_f ( italic_x ) = divide start_ARG 3 italic_x - 2 end_ARG start_ARG italic_x - 2 end_ARG. Express the final answer as a common fraction.

Appendix G Broader Impacts

The research presented in this paper has the potential to positively impact the development and application of large language models (LLMs) in various domains. By enhancing the reasoning capabilities of pre-trained LLMs without the need for fine-tuning, our proposed M* framework can lead to more efficient and accessible deployment of these models in real-world scenarios.

Positive societal impacts may include improved accessibility, resource conservation, and enhanced decision-making. First, the M* framework enables smaller, open-source models to achieve reasoning performance comparable to larger, closed-source models. This can democratize access to high-quality reasoning tools, allowing a wider range of researchers and practitioners to benefit from LLMs. Second, by shifting computational resources from fine-tuning to inference-time searching, the M* method can reduce the environmental impact associated with training large-scale models, promoting more sustainable AI development practices. Last, LLMs with improved reasoning capabilities can assist humans in making better-informed decisions across various domains, such as healthcare, finance, and public policy, by providing accurate and reliable insights derived from complex reasoning tasks.

Potential negative impacts could involve over-reliance on AI reasoning and privacy concern. Here we provide a brief analysis of both issues and some remedies. As LLMs become more proficient at reasoning tasks, there is a risk that humans may overly rely on their outputs without sufficient critical thinking. To address this, we suggest that AI reasoning tools be used in conjunction with human oversight and that their limitations and potential biases be clearly communicated to users. In addition, the application of enhanced reasoning LLMs in sensitive domains, such as healthcare or finance, may raise privacy concerns if personal data is used as input. To mitigate this risk, we recommend the implementation of appropriate data privacy protocols and the use of differential privacy techniques when deploying these models in practice.

By proactively addressing potential negative impacts and promoting responsible deployment strategies, we believe that the M* framework and similar advancements in LLM reasoning can contribute to the development of more trustworthy and beneficial AI systems. As researchers, it is our responsibility to continue exploring these techniques while actively engaging with the broader community to ensure their positive societal impact.

Appendix H Artifacts Usage

Artifacts License
Llama-2 Llama-2 License
Mistral Apache 2.0 License
PRM800K MIT License
GSM8K MIT License
MATH MIT License
Simple-evals MIT License
Table 6: Artifacts license

In this paper, we utilize pre-existing resources, including pre-trained language models: Llama-2-13B [40] and Mistral-7B [18], as well as publicly available datasets: PRM800K [21], GSM8K [6], and MATH [14]. Additionally, we employ the evaluation toolkit simple-evals [29].

As shown in Table 6, all resources are used in accordance with their respective licenses, which permit use for public research, and align with the intended use. The datasets utilized are exclusively mathematics-related and do not contain any personally identifiable information or offensive content.