Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

Query Dependent Prompt

Download as pdf or txt
Download as pdf or txt
You are on page 1of 55

Published as a conference paper at ICLR 2024

Q UERY-D EPENDENT P ROMPT E VALUATION AND


O PTIMIZATION WITH O FFLINE I NVERSE RL
Hao Sun∗, Alihan Hüyük, Mihaela van der Schaar
DAMTP, University of Cambridge

A BSTRACT
In this study, we aim to enhance the arithmetic reasoning ability of Large Lan-
arXiv:2309.06553v4 [cs.CL] 7 Mar 2024

guage Models (LLMs) through zero-shot prompt optimization. We identify a


previously overlooked objective of query dependency in such optimization and
elucidate two ensuing challenges that impede the successful and economical de-
sign of prompt optimization techniques. One primary issue is the absence of an
effective method to evaluate prompts during inference when the golden answer
is unavailable. Concurrently, learning via interactions with the LLMs to navigate
the expansive natural language prompting space proves to be resource-intensive.
To address this, we introduce Prompt-OIRL, which harnesses offline inverse re-
inforcement learning to draw insights from offline prompting demonstration data.
Such data exists as by-products when diverse prompts are benchmarked on open-
accessible datasets. With Prompt-OIRL, the query-dependent prompt optimiza-
tion objective is achieved by first learning an offline reward model. This model
can evaluate any query-prompt pairs without accessing LLMs. Subsequently, a
best-of-N strategy is deployed to recommend the optimal prompt. Our experi-
mental evaluations across various LLM scales and arithmetic reasoning datasets
underscore both the efficacy and economic viability of the proposed approach.

Figure 1: A motivating example.(Left, Right) No prompt is perfect that works for all queries. The optimal
prompt is query-dependent. Yet the seeking of such prompts is hindered by the Challenges 1-2 we identified.
Our method optimizes prompt during inference on a query-dependent level effectively and cost-efficiently.
1 I NTRODUCTION
Recent advances in Large Language Models (LLMs), such as ChatGPT (Ouyang et al., 2022), un-
derscore the value of aligning with human preferences (Ouyang et al., 2022; OpenAI, 2023). Using
an offline alignment dataset generated by annotators, the harmlessness, helpfulness, and ability to
follow instructions of LLMs can be largely enhanced (Bai et al., 2022) through the technique of
RLHF (Christiano et al., 2017). That said, even for state-of-the-art general-purpose LLMs such as
GPT-4, solving intricate problems such as arithmetic reasoning can still be challenging. Further
unlocking LLMs’ full potential for those complex tasks remains an open challenge.
Query-Dependent Zero-Shot Prompting for Arithmetic Reasoning: Opportunities and Chal-
lenges. Out of the many attempts, prompting — a natural language prefix or instruction that ex-
plains how to complete the task — stands out as a lightweight promising solution for eliciting
the capabilities of LLMs without model parameter tuning (Wei et al., 2022). While the advances
in zero-shot prompting strategies (Kojima et al., 2022) highlight the potential in finding effective
query-independent solutions, its reliance on manual crafting efforts and the vast search space over
natural language intensifies the difficulty in discovering effective prompts (Deng et al., 2022). This
study aims to augment the arithmetic reasoning capacities of LLMs, a capability considered cru-
cial (OpenCompass, 2023; Liu et al., 2023a). While there is a wealth of both expert-devised and

hs789@cam.ac.uk, Code is available at: https://github.com/vanderschaarlab/Prompt-OIRL

1
Published as a conference paper at ICLR 2024

machine-generated prompts in this domain (Kojima et al., 2022; Zhou et al., 2022b; Yang et al.,
2023a), our endeavor is to advance further via prompt optimization. Our approach begins with the
articulation of an adjusted objective, subsequently highlighting two inherent challenges.
Adjusted Objective: Query-Dependent Prompt Optimization. In the literature, research on zero-
shot prompting mainly focused on finding prompts that work better on a distributional level rather
than an instance level. For instance, while multi-agent debate Liang et al. (2023); Du et al. (2023)
can improve reasoning in general, there are cases where not using a prompt can be even better than
using generally effective prompts. Figure 1 showcases a motivating example. Motivated by the fact
that no prompt is perfect for all queries, in this work, we set an adjusted objective of query-dependent
prompt optimization, rather than a distributional-level prompt optimization.
Challenge 1: Inference Time Evaluation is Hard. A primary challenge arises in evaluating the
effectiveness of prompts during inference, especially when the true answer to a query is not known
a priori. For arithmetic reasoning, enumerating many potentially effective prompts and enquire
language models with those prompted queries will only result in a batch of answers, yet determining
which response is correct in the absence of a ground-truth label requires extra non-trivial effort.
Additionally, the computational cost associated with this prompt enumeration process is substantial.
Challenge 2: Online Prompt Evaluation and Optimization is Expensive. On the other hand,
searching for potentially effective prompts heavily relies on expensive online evaluation. In the lit-
erature, evaluating the effectiveness of a proposed prompt requires assessing its performance on
multiple datasets and LLMs, to show its distributional superiority over the others. In the arith-
metic reasoning task, using the GPT-3.5-turbo API (OpenAI, 2023) to evaluate a single prompt on
a medium-sized dataset with 10k query-answer pairs will lead to an ≈$1 cost, yet learning from
trials and errors often requires millions of interactions even for tasks with a few actions to choose
from (Schulman et al., 2017). The vast action space of natural language further increases the pro-
hibitive cost for such interactive search (Zhang et al., 2022a; Deng et al., 2022).
Solution: Query-Dependent Prompt Evaluation and Optimization with Offline Inverse RL.
Having witnessed the significant advancements achieved in expert-crafted and algorithm-discovered
prompting in arithmetic reasoning (Zhou et al., 2022b; Pryzant et al., 2023; Kojima et al., 2022),
our work seeks a solution of combining existing human knowledge in prompting systematically,
effectively, and cost-efficiently. Our contributions can be summarized as follows:

• Formally, we identify the overlooked query-dependent prompt optimization objective and


its challenges, and introduce Offline Inverse Reinforcement Learning to integrate rich hu-
man expertise as a systematic approach.
• Methodologically, we introduce Prompt-OIRL to first perform a query-dependent offline
prompt evaluation with a learned reward model and then perform offline prompt optimiza-
tion to improve prompting performance.
• Practically, we highlight the existence of offline datasets generated — as by-products when
existing prompting strategies are benchmarked on open-access tasks — can be directly
adopted for Prompt-OIRL learning.
• Empirically, we validate the efficacy and efficiency of Prompt-OIRL in offline prompt eval-
uation and optimization through experiments with 3 distinct LLMs, namely GPT-3.5-turbo,
LLaMA-2-7B-Chat, and TigerBot-13B-Chat, across 3 arithmetic datasets: GSM8K (Cobbe
et al., 2021a), MAWPS (Roy & Roth, 2016), and SVAMP(Patel et al., 2021)

2 T HE Q UERY-D EPENDENT P ROMPTING P ROBLEM

We first introduce preliminaries to provide a formal definition of the query-dependent prompting


problem. We then introduce our solution in the next section.
Queries & Answers We consider the task of answering queries x ∈ X = V ∞ expressed in a
natural language with vocabulary V. We assume that each query x has an expected answer y ∗ ∈ Y.
Such an answer can be a real-valued number or composed by natural language. For example, x can
be an arithmetic question and y ∗ the mathematically correct answer to the question, or x can be the
review of a movie and y ∗ a sentiment label for a movie. We denote a dataset containing queries and
answers with D.

2
Published as a conference paper at ICLR 2024

Language Model Such tasks can be performed using a language model ℓ : X → Y by feeding
queries x to the language model and getting answers ŷ = ℓ(x). We assume that the quality of these
answers can be evaluated using a metric r : Y × Y → R. For example, in tasks where golden
labels are available, this metric can be r(y ∗ , ŷ) = 1{ŷ = y ∗ }. Generally, a higher value of r(y ∗ , ŷ)
indicates ŷ matches the expected answer y ∗ better.
Prompting It is well known that the performance of a language model in answering queries can
be boosted through appropriate prompting—a natural language prefix or instruction that explains
how to complete the task. We consider prompts π : X → X that maps original queries x to
modified prompted query π(x). (For clarity, we use prompt to denote the prefix and instructions,
and prompted query to denote the final input to language models after prompting.) These prompted
queries are then fed into the language model to get better answers ŷ = ℓ(π(x)).
Objective Given a dataset D = {x(i) , y ∗(i) )}i∈[N ] of queries and their expected answers, the
objective of query-agnostic zero-shot prompt optimization usually studied in the literature is to find
the distributional optimal prompt π̄ ∗ that maximize the expected quality of answers w.r.t. metric r:
h i
π̄ ∗ = arg max E(x(i) ,y∗(i) )∼D r(y ∗(i) , ℓ(π(x(i) ))) , (1)
π

Instead, in this work we meet the Adjusted Objective by using a query-dependent approach:
π ∗ = arg max r(y ∗(i) , ℓ(π(x(i) ))), (2)
π

In a nutshell, Equation(1) seeks a single prompt that achieves good performance on the dataset, yet
Equation(2) seeks different prompts for different queries in the dataset. Clearly, π ∗ should be better
than π̄ ∗ in the sense
h i h i
E(x(i) ,y∗(i) )∼D r(y ∗(i) , ℓ(π ∗ (x(i) ))) ≥ E(x(i) ,y∗(i) )∼D r(y ∗(i) , ℓ(π̄ ∗ (x(i) ))) . (3)

3 P ROMPTING WITH O FFLINE I NVERSE RL


Having established the learning objective in Equa-
tion (2), applying RL to the search for an optimal
prompt π may seem a natural approach. Yet, as
depicted in Figure 2, such a method confronts two
distinct challenges: Challenge 1 pertains to the in-
ability to compute rewards during inference, while
Challenge 2 concerns the necessity for extensive
interactions with LLMs. Different from conven-
tional RL tasks on which algorithms are developed
and optimized, the vast action space of vocabu- Figure 2: The Adjusted Objective and Challenges
lary, sparse trajectory level feedback that is only in prompt optimization. We use blue to denote fixed
available when the final answer is generated, and functions, pink for datasets, and green for functions
the significant overheads associated with LLM in- to be optimized. Solid lines show the flow of out-
teractions, all complicate the pursuit of an ideal puts, and dashed lines denote the learning process.
prompt. Furthermore, the absence of definitive la-
bels during the inference phase renders the prompt evaluation phase infeasible. With this context,
we now present Prompt-OIRL as our proposed solution.

Prompt-OIRL: Prompting as Offline Inverse RL In pursuance of the new objective and ad-
dressing the challenges, we draw inspiration from the successes that have been achieved by RLHF.
We highlight the prompting problem is essentially aligning prompts based on LLM preferences,
a departure from RLHF’s goal of aligning LLMs with human preferences. For a more in-depth
exploration of this perspective, we direct readers to Appendix A.
Aiming for an efficient and economical prompting strategy, we sidestep the complexities of exten-
sive LLM interactions by leveraging readily available datasets and propose offline IRL as a viable
learning solution. We outline its three fundamental steps: Step 1. construction of offline dataset,
Step 2. offline query-dependent prompt evaluation through reward modeling, and Step 3. offline
query-dependent prompt optimization using the learned reward model.

3
Published as a conference paper at ICLR 2024

Step 1: Existence of Offline Prompt Demonstrations We start by emphasizing the pres-


ence and significance of prompt demonstrations used in benchmarking existing prompts. In the
realm of research dedicated to improving the arithmetic reasoning capacities of LLMs, several
zero-shot prompts have been proposed. Notable examples include CoT (Kojima et al., 2022),
APE (Zhou et al., 2022b), and a concurrent work OPRO (Yang et al., 2023a), among others.
It’s important to highlight that none of these
prompts uniformly enhance the LLMs’ arithmetic
reasoning capabilities. Their efficacy is assessed at
a distributional level, as per Equation(1). Different
language models might exhibit varied preferences
for prompts (Yang et al., 2023a). To see that the
best prompt is dependent on the query, readers may
hark back to the motivating example in Figure 1 or
find more case studies in Appendix G.1.
To benchmark the effectiveness of the proposed
prompts, prior researchers have utilized stan- Figure 3: The offline demonstration dataset is
dardized, openly accessible arithmetic reasoning generated as a by-product of evaluating existing
datasets and reported the overall success rate of (query-agnostic) prompts.
obtaining correct answers. Denoting different ex-
isting prompts with superscripts, e.g., π (1) (x) = (x) the no prompting, π (2) = CoT prompting,
π (3) = APE prompting, and so on — when there are K prompts have been evaluated, the following
demonstrations will be constructed as by-products:

Ddem = {x(i) , π (k) , r(i,k) = r(y ∗(i) , ℓ(π (k) (x(i) ))}i∈[N ],k∈[K] (4)
This process is visualized in Figure 3, depicting the generation of prompt-alignment demonstration

datasets from evaluating existing prompts. Notably, the rewards in Ddem are influenced by ℓ.

Step 2: Offline Reward Modeling: Inverse RL without an Environment Upon initial inspec-
tion, it might seem that reward modeling is unnecessary given the reward metric. i.e., r(y ∗ , ŷ) =
1{ŷ = y∗ }. However, a more thorough examination of this metric reveals two concerns that limit its
broader utilization: first, the reward metric is a function over language model ℓ, making every single
call of such a metric costly; second, the computation of the reward necessitates access to the ground
truth answer y ∗ , which is always not available in inference (deployment) time. Those are essentially
the challenges associated with the input dependency of this reward metric demonstrated in Figure 2.
To address those issues, we introduce a parameterized proxy reward model that is a function of query
x and prompt π. We denote the proxy reward as Υθ (x, π(x)) with parameters θ. Notably, this proxy
reward omits the language models ℓ and the ground-truth labels from its calculations. By design, its
output should align with the true reward r. Therefore, we establish a supervised learning objective
to minimize the discrepancies between the proxy reward model and the true reward.

Figure 4: Prompt-OIRL addresses the specified Objective and challenges. It first learns a proxy reward model
from the offline demonstration dataset we created in the last section. Such a learned reward model can be
applied in inference time to evaluate prompts in a query-dependent manner without access to the language
model, hence optimizing prompt w.r.t. such a proxy reward model solves all issues identified.

In the case of arithmetic reasoning tasks, the reward signal is binary, hence the proxy reward model
can be trained as a classification task, predicting whether the prompt can result in a correct an-
swer when fed to the language model. Specifically, we consider the Cross-Entropy loss given the
demonstration data collected in the previous section:
h     i
ℓ (i,k) (i,k)
LCE (θ; Ddem ) = −Ei∈[N ],k∼[K] r(i,k) log σ Υθ + (1 − r(i,k) ) log 1 − σ Υθ (5)

4
Published as a conference paper at ICLR 2024

(i,k)
where we use Υθ to denote Υθ (x(i) , π (k) (x(i) )) for conciseness, and σ is the sigmoid function.
Different from the online reward signal provided by r(y ∗ , ℓ(x′ )) that requires expensive interactions
with black-box LLMs, the proxy reward model Υθ (x, π(x)) can provide offline white-box feedback
to prompts. Therefore, it can be more accessible in guiding the search for a better prompt. Figure 4
illustrates how such a reward model can be learned and leveraged in providing feedback to prompt
optimization. To achieve computational efficiency and good performance, we use the embeddings
of x and π generated by existing LLMs, and use those embeddings as inputs for the proxy reward
model. Empirically, we find the gradient boosting methods (Chen et al., 2015; Ke et al., 2017) are
significantly better than using neural networks in reward modeling. In our experiment section, we
demonstrate the effectiveness of Prompt-OIRL using a minimalist approach by directly applying the
XGBoost models and leaving further investigation on model selection to future work.
Step 3: Offline Prompt Optimization with the Learned Reward Model.
Given the learned reward model is an effective proxy of the performance evaluator for query-prompt
pairs, we would then be able to solve the prompt optimization problem of Euqation (2) with an
alternative offline objective that is feasible to execute in inference-time — without the requirement
of a language model ℓ and the non-accessible inference-time golden label y ∗ :
π ∗ = arg max Υθ (x, π(x)) ≈ arg max r(y ∗ , ℓ(π(x))) (6)
π π
In general, any policy optimization technique has the potential to be applied in solving Equation (6).
To name a few examples, such a technique can be any of the previous approaches used in online
prompt optimization like RL (Deng et al., 2022), beam search (Pryzant et al., 2023), evolution
strategies (Zhou et al., 2022b).
In this work, we would like to highlight the effectiveness and cost-efficiency of optimizing prompts
with regard to such a learned reward model as an accessible offline proxy to the true reward. We
choose to use a minimalist approach and isolate the source of gains by using a best-of-n approach:
we generate a batch of candidate prompts with a general-purpose language model and select the best
one according to the learned reward model. We leave the investigation of other approaches, which
could be more tailored and better-performing, yet computationally more expensive as a promising
direction for future exploration.

4 R ELATED W ORK
We discuss the most related literature on learning-based prompt optimization and reinforcement
learning from human knowledge in this section. Extended discussions are deferred in Appendix B.
Table 1: Prompt-OIRL mainly differentiates from existing literature on prompt optimization by (1) consid-
ering the Adjusted Objective and optimizing query-dependent prompt; (2) being able to perform offline
prompt evaluation to address Challenge 1; (3) optimizing prompt in the offline setting without access to the
LLMs to address Challenge 2; (4) using natural language prompt space; (5) using offline inverse reinforcement
learning to solve the problem. More unique properties of Prompt-OIRL are discussed in Table 5.
(1) Query (2) Offline (3) Offline (4) Prompt (5) Solver
Method Dependent Prompt Prompt Space* Used
Prompt Evaluation Optimization
Soft-Prompt ✓ ✗ ✗ Embeddings Gradient-Guided Search
APO ✗ ✗ ✗ X = V∞ Beam Search
APE ✗ ✗ ✗ X = V∞ Evolution Strategy
TEMPERA ✓ ✗ ✗ Edit RL
RLPrompt ✗ ✗ ✗ {V 2 , V 5 } RL
Prompt-OIRL (ours) ✓ ✓ ✓ X = V∞ Offline Inverse-RL
*
Embeddings: the language of LMs; Edit: including operations like swap, delete, etc.; V is the vocabulary, and the superscript over it
denotes the length of prompts. V ∞ denotes the natural language space — the most general interpretable format.

Learning-Based Prompt Optimization The Automatic Prompt Optimization (APO) (Pryzant et al.,
2023) leverages training data and initial prompts to perform a ”gradient descent” process operated
in the natural language space guided by large language models. A generate-and-select process is
then applied to update the prompting mechanism. Automatic Prompt Engineer (APE) (Zhou et al.,
2022b) asks LLMs to generate a set of instruction candidates based on demonstrations and then
evaluate the performance of those instructions by computing corresponding scores, finally, APE im-
proves the best candidates by proposing semantically similar instruction variants to specific tasks.

5
Published as a conference paper at ICLR 2024

There are also works that explore the usage of auxiliary models or differentiable representations of
the prompt (Shin et al., 2020; Li & Liang, 2021; Qin & Eisner, 2021; Vu et al., 2021; Gu et al.,
2021). Yet those methods require access to the embeddings of the LMs for soft prompt optimiza-
tion, hence falling short in reusability across LMs. In the work of Hao et al. (2022), reinforcement
learning is used to optimize a prompt policy in text-to-image generation tasks in pursuance of more
aesthetically pleasing images preserving the original user intentions. Zhang et al. (2022a) proposes
TEMPERA to perform test time prompt editing using reinforcement learning. TEMPERA designs
the action space to be editing operations, such as swapping, deleting, adding, or changing verbaliz-
ers. The RLPrompt Deng et al. (2022) uses reinforcement learning to optimize LLMs to specialize
as prompting agents. Yet their prompts generated are task-agnostic, and limited to combinations of
words that can not be easily transferred to insights for human prompt engineers.
Reinforcement Learning with Human Knowledge and Demonstrations Human knowledge and
demonstration have pushed forward the progress of machine learning in many domains including
natural language processing (Rajpurkar et al., 2018; Ouyang et al., 2022) and many others (Deng
et al., 2009; Redmon et al., 2016; Atkeson & Schaal, 1997; Zhang et al., 2018; Mandlekar et al.,
2020). By injecting prior knowledge of the task from human, reinforcement learning algorithms
are able to achieve human-level performance or even outperform human (Silver et al., 2016; Vinyals
et al., 2019). Our solver of the prompting problem is related to inverse RL (Ng et al., 2000; Abbeel &
Ng, 2004; Ho & Ermon, 2016) that aims at inferring reward and learning policies with a batch of be-
havioral demonstrations, which is unique compared with existing literature in prompt optimization.
The main distinctions between this study and those related works are summarized and highlighted
in Table 1. For an extended discussion on more distinctions please refer to Table 5 in Appendix B.

5 E XPERIMENT
In this section, we present empirical evidence illustrating the efficacy of Prompt-OIRL in addressing
the previously highlighted challenges and meeting the revised objective. We first outline the general
experimental setups. Subsequently, we delve into fulfilling the Adjusted Objective and detail the
proof of solutions to the Challenges in Sections 5.1 through 5.3.

Tasks We use the tasks of MultiArith (Roy & Roth, 2016), GSM8K (Cobbe et al., 2021a),
SVAMP (Patel et al., 2021) in the arithmetic reasoning domain because they are widely studied
in zero-shot prompting, and hence rich expert-crafted and machine-generated prompting knowledge
is available. Leveraging such knowledge facilitates our offline data collection procedure.
Table 2: prompts used in offline training dataset collection.
No. Effective Prompts Discovered by Experts and Algorithms Explanation
1 “The answer is:” direct prompting
2 “Let’s think step by step:” zero-shot CoT
(Kojima et al., 2022)
3 “Let’s work this out in a step by step way to be sure we have the right APE discovered
answer:” (Zhou et al., 2022b)
4 “First, decompose the question into several sub-questions that need to be Least-to-most
solved, and then solve each question step by step:” (Zhou et al., 2022a)
5 “Imagine three different experts are answering this question. All experts Tree-of-thought
will write down 1 step of their thinking, and then share it with the group. (Hulbert, 2023)
Then all experts will go on to the next step, etc. If any expert realizes
they’re wrong at any point then they leave.”
6 “3 experts are discussing the question, trying to solve it step by step, and multi-agent debate
make sure the result is correct:” (Liang et al., 2023)

Prompts Table 2 shows our training prompts: 6 existing zero-shot prompts in arithmetic reasoning
tasks. We construct our offline demonstration dataset using interaction logs between LLMs and
queries prompted by those prompts. Additionally, we create held-out test prompts that differ from
those training prompts, details of test prompts and their generation are provided in Appendix C.3.

LLMs To demonstrate the general applicability of Prompt-OIRL, we experiment with datasets


generated with LLMs at different abilities, scaling from the GPT-3.5-turbo model (Ouyang et al.,

6
Published as a conference paper at ICLR 2024

2022), to TigerBot-13B-chat (TigerResearch, 2023), and LLaMA2-7B-chat (Touvron et al., 2023).


All created offline demonstration datasets, including the query-prompt pairs, prompted answers
from different LLMs, and the correctness of those answers will be released as a publicly accessible
dataset. We defer detailed discussions on the usage of embeddings, data processing, and training
details can be found in Appendix C.

5.1 I MPROVING A RITHMETIC R EASONING BY F ULFILLING THE A DJUSTED O BJECTIVE

Experiment Setup In this section, we demonstrate the superiority of query-dependency in


prompting. We showcase that Prompt-OIRL effectively fulfills the Adjusted Objective. In the ex-
periment, we use BoTr Eqn.(1) to denote the setting of selecting the best-of-training time prompts
according to objective Eqn.(1). In this case, the prompt to be chosen in test or inference time should
be the one that achieves the overall best training performance. We then consider the setting of BoTr
Eqn.(2), when Eqn.(2) becomes the objective. In such a setting, the proxy reward model will be
used to select in the query-dependent best-of-N selection. Alternatively, we compare the LLM-
confidence-based baseline in approaching Eqn.(2), which uses LLMs to reflect how confident they
are for different prompted answers. LLM Confidence thence selects the most confident answers
that correspond to training prompts according to confidence scores provided by LLMs. Finally,
our method Prompt-OIRL selects the best prompt from both training time prompts and held-out
prompts with regard to the learned reward model.

Figure 5: Performance of Prompt-OIRL under two typical settings. On both settings with scarce (i.e., 1 training
prompt) and rich demonstration data (i.e., 6 training prompts), Prompt-OIRL achieves better performance.
Results We present results on two typical settings in Figure 5 under different demonstration avail-
ability: (1) in the scarce demonstration setting (left panel) when K = 1, where only 1 prompt is
available for reward model learning; and (2) in the rich demonstration setting (right panel) when
K = 5, where more prompts demonstrations are available. We note when K = 1, there should be
no difference between using Eqn.(1) and Eqn.(2) in the best-of-training strategy.
From the results, we can conclude: (1). in the scarce demonstration setting where demonstration
data is generated with only 1 prompt, Prompt-OIRL significantly (+24.3%) outperforms the best-
of-training strategy; (2). in the rich demonstration settings where multiple prompts are available,
Prompt-OIRL is significantly better (+8.8%) than using a query-agnostic objective of Eqn.(1);
(3). in the rich demonstration setting, Prompt-
OIRL is slightly better (+1.8%) than selecting the
best training strategy according to the learned re-
ward model — this is not surprising as the prompts
we used for training are all tested as helpful
themselves; (4). On both settings, Prompt-OIRL
achieves significantly (by +24.3% and +14.7%,
separately) better results compared to the LLM
confidence score-based baseline — without fur-
ther interactions with the LLMs.
Additionally, we change the availability of Figure 6: The averaged performance of differ-
prompts used for training and provide a perfor- ent methods under different prompt demonstration
mance comparison averaged over all tasks and availability in arithmetic reasoning. Prompt-OIRL
achieves significantly better performance even when
LLMs we benchmarked in Figure 6. We normalize only 1 prompt is available for training.
the performance using an oracle reward model —
compared to the success rate with if always having access to a successful prompt. Such a normalized

7
Published as a conference paper at ICLR 2024

score better contextualizes the performance improvement realized by the instantiated reward model
in Prompt-OIRL. In general, using more prompts in training leads to better results. The results under
different individual settings are provided in Appendix F.1

Take-Aways Prompt-OIRL fulfills the Adjusted Objective and achieves superior perfor-
mance over the baselines under various demonstration data availability. Prompt-OIRL achieves
a remarkably higher success rate in getting correct answers even when demonstration data and
prompting knowledge are scarce (i.e., with demonstration data generated with only 1 prompt),
and its performance can be further improved with an increase in data availability.

5.2 A DDRESSING C HALLENGE 1 THROUGH O FFLINE P ROMPT E VALUATION

Experiment Setup In this section, we study the effectiveness of the learned reward model by
verifying its generalization ability from two perspectives: (1) with seen-prompt + unseen-query:
we evaluate the effectiveness of the learned reward model with training-time prompts on held-out
queries and (2) with unseen-prompt + unseen-query: we evaluate the effectiveness of the learned
reward model with held-out prompts on held-out queries. To enable the performance evaluation
on held-out prompts, we collect interactive logs on each language model for 10 held-out prompts.
Details of those held-out prompts are provided in Appendix C.3.
We benchmark our approach against the language models for self-criticism (LMSC) baseline (Wang
et al., 2023). In this method, LLMs are tasked with verifying the accuracy of the answer, given
both the query and the prompted response. Throughout our experiments, we gauge performance
using held-out test queries. In our subsequent results, the notation Ours (Q) refers to evaluation
outcomes employing training time prompts (with “Q” representing test queries), while Ours (P)
refers to evaluations conducted on held-out prompts (with “P” representing test prompts). The
same notations are applied to LMSC as LMSC (Q) and LMSC (P). Moreover, we compare results
varying prompt demonstration data availabilities. Specifically, we change the number of prompts,
represented as K, that is used in the
 training of the reward model. The result we report for K is
6
averaged over combinations of K . Accuracy and precision are used as the evaluation metric.

Table 3: Effectiveness of the learned reward model is demonstrated through a comparison with LLM-based
self-critic. Accuracies and precisions in predicting whether correct answers can be obtained on held-out queries
are reported as an evaluation metric. Higher is better.
Task Method LLaMA2-7B-Chat TigerBot-13B-Chat GPT-3.5-turbo
Acc. Precision Acc. Precision Acc. Precision
LMSC (Q) 0.47 0.433 0.776 0.175 0.662 0.664
Ours (Q) K = 1 0.784 0.621 0.952 0.593 0.96 0.965
MAWPS Ours (Q) K = 5 0.795 0.632 0.95 0.593 0.96 0.965
LMSC (P) 0.457 0.441 0.414 0.692 0.661 0.661
Ours (P) K = 1 0.621 0.569 0.467 0.737 0.767 0.82
Ours (P) K = 5 0.658 0.595 0.496 0.795 0.803 0.825
LMSC (Q) 0.474 0.441 0.434 0.735 0.801 0.805
Ours (Q) K = 1 0.791 0.646 0.735 0.764 0.964 0.975
SVAMP Ours (Q) K = 5 0.8 0.65 0.738 0.764 0.962 0.97
LMSC (P) 0.464 0.447 0.421 0.739 0.795 0.796
Ours (P) K = 1 0.634 0.602 0.685 0.741 0.824 0.883
Ours (P) K = 5 0.655 0.593 0.707 0.751 0.861 0.885
LMSC (Q) 0.205 0.187 0.635 0.537 0.662 0.656
Ours (Q) K = 1 0.773 0.289 0.642 0.53 0.659 0.663
GSM8K Ours (Q) K = 5 0.726 0.301 0.646 0.542 0.656 0.657
LMSC (P) 0.207 0.195 0.636 0.515 0.653 0.648
Ours (P) K = 1 0.747 0.269 0.631 0.516 0.653 0.658
Ours (P) K = 5 0.647 0.246 0.643 0.54 0.648 0.648

Results Table 3 showcases the results across the 3 tasks employing the 3 LLMs. From the results,
we can deduce: (1). On held-out queries, we find the learned reward model is able to accurately
predict whether a certain prompt can lead to a correct answer. This holds true for both training
prompts and held-out test prompts. (2). The prediction accuracy and precision of the learned reward
model are significantly better than the LMSC baseline, demonstrating the superiority of leveraging
the learned model rather than the LM itself in evaluating prompts and prompted answers. (3). Incor-

8
Published as a conference paper at ICLR 2024

porating more prompts into the training dataset generally improves performance, particularly when
evaluating held-out prompts. For more comprehensive results, please refer to Appendix F.2.

Take-Aways Challenge 1 can be effectively solved by Prompt-OIRL, which learns a reward


model that is able to accurately evaluate prompts in an offline manner. The learned reward
model is effective in predicting whether a given prompt can solve a given query without access
to the language model. The prediction accuracy and precision are in general significantly better
than the self-critic baseline, in which online interactions with LLMs are needed.

5.3 A DDRESSING C HALLENGE 2: C OST-E FFICIENT P ROMPT O PTIMIZATION

Experiment Setup As previously argued, utilizing the proxy reward model for prompt optimiza-
tion, as opposed to directly using language models, would result in significantly reduced costs.
Indeed, the expenses associated with inference time can be even more expensive (Chen et al., 2023).
In this section, we emphasize the cost-efficiency advantage of Prompt-OIRL during the inference
time prompt optimization.
Prompt-OIRL evaluates prompts through its offline reward model and obviates the need for extra
LLM interactions during inference. Picture a scenario where, for each query at inference, there exist
K potential prompts to assess and select. With Prompt-OIRL, the optimal prompt is chosen with-
out LLM interaction, ensuring that only the chosen prompt undergoes inference with the LLM to
produce a singular response. Conversely, evaluating prompts via LLMs — by gauging their confi-
dence or dissecting the nuances of various answers — demands extensive interaction prior to settling
on a choice. We provide a detailed cost analysis of these methods, outlining the monetary implica-
tions of employing the GPT-3.5-turbo and TigerBot-13B-Chat APIs. We disclose the inference time,
quantified in GPU hours for LLaMA2-7B-chat, which operated locally on an NVIDIA A4000 GPU.

Results Table 4 presents the in- Table 4: Inference costs of different methods for a single query
ference time costs associated with under different numbers of prompt choices. Generating embeddings
various LLMs and prompt selection in Prompt-OIRL cost ≈ 0.0002 USD per 100 prompts
strategies. Employing Prompt-OIRL Model Cost Method K = 1 K = 6 K = 110
for prompt optimization is substan- LLaMA2 LMSC 0.00325 0.01949 0.35723
tially more cost-efficient than meth-
(GPU hour) Ours 0.00134 0.00134 0.00134
ods reliant on LLMs. When call-
ing LLMs for self-critic, all prompts . TigerBot LMSC 0.00082 0.00492 0.09029
need to be processed to obtain (USD) Ours 0.00034 0.00035 0.00056
distinct prompted answers, neces- GPT-3.5 LMSC 0.00093 0.00558 0.10237
sitating K interactions with the (USD) Ours 0.00040 0.00041 0.00062
LLMs. Following this, the LLMs are
queried to verify the correctness of those answers and provide their confidence scores, incurring an
additional K interactions. In contrast, Prompt-OIRL utilizes the offline reward model to pinpoint
the most suitable prompt and only forwards the chosen prompt to the LLM for processing.

Take-Aways Challenge 2 is effectively mitigated by Prompt-OIRL. Prompt-OIRL permits


offline evaluation and optimization of prompts through the learned reward model. Yet the
LLM-Based methods rely on additional LLM interactions to evaluate prompted answers. Con-
sequently, Prompt-OIRL offers a distinct advantage in terms of cost efficiency.

6 C ONCLUSION
We propose Prompt-OIRL, a novel approach grounded in offline inverse reinforcement learning, de-
signed to reconcile effective and cost-efficient query-dependent prompt evaluation and optimization.
This method leverages offline datasets from existing evaluations, utilizing Inverse-RL to craft a re-
ward model tailored for offline, query-specific prompt evaluations. Prompt-OIRL offers several ben-
efits: it forecasts prompt efficacy, minimizes costs, and explores the prompt space more effectively
— all at a query-dependent level. We validate our approach across various LLMs and arithmetic
reasoning datasets, underscoring its viability as a formidable solution for query-dependent offline
prompt evaluation and optimization.

9
Published as a conference paper at ICLR 2024

R EFERENCES
Pieter Abbeel and Andrew Y Ng. Apprenticeship learning via inverse reinforcement learning. In
Proceedings of the twenty-first international conference on Machine learning, pp. 1, 2004.
Rishabh Agarwal, Dale Schuurmans, and Mohammad Norouzi. An optimistic perspective on offline
reinforcement learning. In International Conference on Machine Learning, pp. 104–114. PMLR,
2020.
Christopher G Atkeson and Stefan Schaal. Robot learning from demonstration. In ICML, volume 97,
pp. 12–20. Citeseer, 1997.
Yuntao Bai, Andy Jones, Kamal Ndousse, Amanda Askell, Anna Chen, Nova DasSarma, Dawn
Drain, Stanislav Fort, Deep Ganguli, Tom Henighan, et al. Training a helpful and harmless
assistant with reinforcement learning from human feedback. arXiv preprint arXiv:2204.05862,
2022.
Homanga Bharadhwaj, Aviral Kumar, Nicholas Rhinehart, Sergey Levine, Florian Shkurti, and Ani-
mesh Garg. Conservative safety critics for exploration. arXiv preprint arXiv:2010.14497, 2020.
Ioana Bica, Daniel Jarrett, Alihan Hüyük, and Mihaela van der Schaar. Learning “what-if” explana-
tions for sequential decision-making. In International Conference on Learning Representations,
2021.
Daniel Brown, Wonjoon Goo, Prabhat Nagarajan, and Scott Niekum. Extrapolating beyond sub-
optimal demonstrations via inverse reinforcement learning from observations. In International
conference on machine learning, pp. 783–792. PMLR, 2019.
Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal,
Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are
few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
Lingjiao Chen, Matei Zaharia, and James Zou. Frugalgpt: How to use large language models while
reducing cost and improving performance. arXiv preprint arXiv:2305.05176, 2023.
Tianqi Chen, Tong He, Michael Benesty, Vadim Khotilovich, Yuan Tang, Hyunsu Cho, Kailong
Chen, Rory Mitchell, Ignacio Cano, Tianyi Zhou, et al. Xgboost: extreme gradient boosting. R
package version 0.4-2, 1(4):1–4, 2015.
Paul F Christiano, Jan Leike, Tom Brown, Miljan Martic, Shane Legg, and Dario
Amodei. Deep reinforcement learning from human preferences. In I. Guyon, U. Von
Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett (eds.), Ad-
vances in Neural Information Processing Systems, volume 30. Curran Associates, Inc.,
2017. URL https://proceedings.neurips.cc/paper_files/paper/2017/
file/d5e2c0adad503c91f91df240d0cd4e49-Paper.pdf.
Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser,
Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, et al. Training verifiers to
solve math word problems. arXiv preprint arXiv:2110.14168, 2021a.
Karl W Cobbe, Jacob Hilton, Oleg Klimov, and John Schulman. Phasic policy gradient. In Interna-
tional Conference on Machine Learning, pp. 2020–2027. PMLR, 2021b.
Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hi-
erarchical image database. In 2009 IEEE conference on computer vision and pattern recognition,
pp. 248–255. Ieee, 2009.
Mingkai Deng, Jianyu Wang, Cheng-Ping Hsieh, Yihan Wang, Han Guo, Tianmin Shu, Meng Song,
Eric P Xing, and Zhiting Hu. Rlprompt: Optimizing discrete text prompts with reinforcement
learning. arXiv preprint arXiv:2205.12548, 2022.
Yilun Du, Shuang Li, Antonio Torralba, Joshua B Tenenbaum, and Igor Mordatch. Improv-
ing factuality and reasoning in language models through multiagent debate. arXiv preprint
arXiv:2305.14325, 2023.

10
Published as a conference paper at ICLR 2024

Pete Florence, Corey Lynch, Andy Zeng, Oscar A Ramirez, Ayzaan Wahid, Laura Downs, Adrian
Wong, Johnny Lee, Igor Mordatch, and Jonathan Tompson. Implicit behavioral cloning. In
Conference on Robot Learning, pp. 158–168. PMLR, 2022.
Justin Fu, Aviral Kumar, Ofir Nachum, George Tucker, and Sergey Levine. D4rl: Datasets for deep
data-driven reinforcement learning. arXiv preprint arXiv:2004.07219, 2020.
Scott Fujimoto, David Meger, and Doina Precup. Off-policy deep reinforcement learning without
exploration. arXiv preprint arXiv:1812.02900, 2018a.
Scott Fujimoto, Herke Van Hoof, and David Meger. Addressing function approximation error in
actor-critic methods. arXiv preprint arXiv:1802.09477, 2018b.
Divyansh Garg, Shuvam Chakraborty, Chris Cundy, Jiaming Song, and Stefano Ermon. Iq-learn:
Inverse soft-q learning for imitation. Advances in Neural Information Processing Systems, 34:
4028–4039, 2021.
Ryan Greene, Ted Sanders, Lilian Weng, and Arvind Neelakantan. New and improved embedding
model, 2022.
Yuxian Gu, Xu Han, Zhiyuan Liu, and Minlie Huang. Ppt: Pre-trained prompt tuning for few-shot
learning. arXiv preprint arXiv:2109.04332, 2021.
Tuomas Haarnoja, Aurick Zhou, Pieter Abbeel, and Sergey Levine. Soft actor-critic: Off-
policy maximum entropy deep reinforcement learning with a stochastic actor. arXiv preprint
arXiv:1801.01290, 2018.
Yaru Hao, Zewen Chi, Li Dong, and Furu Wei. Optimizing prompts for text-to-image generation.
arXiv preprint arXiv:2212.09611, 2022.
Jonathan Ho and Stefano Ermon. Generative adversarial imitation learning. Advances in neural
information processing systems, 29, 2016.
Samuel Holt, Alihan Hüyük, Zhaozhi Qian, Hao Sun, and Mihaela van der Schaar. Neural Laplace
control for continuous-time delayed systems. In International Conference on Artificial Intelli-
gence and Statistics, 2023a.
Samuel Holt, Alihan Hüyük, and Mihaela van der Schaar. Active observing continuous-time control.
In Conference on Neural Information Processing Systems, 2023b.
Dave Hulbert. Tree of knowledge: Tok aka tree of knowledge dataset for large language models llm.
https://github.com/dave1010/tree-of-thought-prompting, 2023.
Alihan Hüyük, Daniel Jarrett, Cem Tekin, and Mihaela van der Schaar. Explaining by imitating: Un-
derstanding decisions by interpretable policy learning. In International Conference on Learning
Representations, 2021.
Alihan Hüyük, Daniel Jarrett, and Mihaela van der Schaar. Inverse contextual bandits: Learning
how behavior evolves over time. In International Conference on Machine Learning, 2022.
Alihan Hüyük, William R Zame, and Mihaela van der Schaar. Inferring lexicographically-ordered
rewards from preferences. In AAAI Conference on Artificial Intelligence, 2022.
Daniel Jarrett, Alihan Hüyük, and Mihaela van der Schaar. Inverse decision modeling: Learning
interpretable representations of behavior. In International Conference on Machine Learning, pp.
4755–4771. PMLR, 2021.
Guolin Ke, Qi Meng, Thomas Finley, Taifeng Wang, Wei Chen, Weidong Ma, Qiwei Ye, and Tie-
Yan Liu. Lightgbm: A highly efficient gradient boosting decision tree. Advances in neural
information processing systems, 30, 2017.
Takeshi Kojima, Shixiang Shane Gu, Machel Reid, Yutaka Matsuo, and Yusuke Iwasawa. Large
language models are zero-shot reasoners. Advances in neural information processing systems,
35:22199–22213, 2022.

11
Published as a conference paper at ICLR 2024

Aviral Kumar, Justin Fu, George Tucker, and Sergey Levine. Stabilizing off-policy q-learning via
bootstrapping error reduction. arXiv preprint arXiv:1906.00949, 2019.
Sergey Levine, Aviral Kumar, George Tucker, and Justin Fu. Offline reinforcement learning: Tuto-
rial, review, and perspectives on open problems. arXiv preprint arXiv:2005.01643, 2020.
Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal,
Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, et al. Retrieval-augmented genera-
tion for knowledge-intensive nlp tasks. Advances in Neural Information Processing Systems, 33:
9459–9474, 2020.
Xiang Lisa Li and Percy Liang. Prefix-tuning: Optimizing continuous prompts for generation. arXiv
preprint arXiv:2101.00190, 2021.
Xiaonan Li and Xipeng Qiu. Mot: Pre-thinking and recalling enable chatgpt to self-improve with
memory-of-thoughts. arXiv preprint arXiv:2305.05181, 2023.
Zekun Li, Baolin Peng, Pengcheng He, Michel Galley, Jianfeng Gao, and Xifeng Yan. Guiding large
language models via directional stimulus prompting. arXiv preprint arXiv:2302.11520, 2023a.
Ziniu Li, Tian Xu, and Yang Yu. Policy optimization in rlhf: The impact of out-of-preference data.
arXiv preprint arXiv:2312.10584, 2023b.
Tian Liang, Zhiwei He, Wenxiang Jiao, Xing Wang, Yan Wang, Rui Wang, Yujiu Yang, Zhaopeng
Tu, and Shuming Shi. Encouraging divergent thinking in large language models through multi-
agent debate. arXiv preprint arXiv:2305.19118, 2023.
Timothy P Lillicrap, Jonathan J Hunt, Alexander Pritzel, Nicolas Heess, Tom Erez, Yuval Tassa,
David Silver, and Daan Wierstra. Continuous control with deep reinforcement learning. arXiv
preprint arXiv:1509.02971, 2015.
Jiacheng Liu, Alisa Liu, Ximing Lu, Sean Welleck, Peter West, Ronan Le Bras, Yejin Choi, and Han-
naneh Hajishirzi. Generated knowledge prompting for commonsense reasoning. arXiv preprint
arXiv:2110.08387, 2021.
Tennison Liu, Nicolás Astorga, Nabeel Seedat, and Mihaela van der Schaar. Large language models
to enhance bayesian optimization. arXiv preprint arXiv:2402.03921, 2024.
Yuan Liu, Haodong Duan, Yuanhan Zhang, Bo Li, Songyang Zhang, Wangbo Zhao, Yike Yuan,
Jiaqi Wang, Conghui He, Ziwei Liu, et al. Mmbench: Is your multi-modal model an all-around
player? arXiv preprint arXiv:2307.06281, 2023a.
Zhihan Liu, Hao Hu, Shenao Zhang, Hongyi Guo, Shuqi Ke, Boyi Liu, and Zhaoran Wang. Reason
for future, act for now: A principled framework for autonomous llm agents with provable sample
efficiency. arXiv preprint arXiv:2309.17382, 2023b.
Jieyi Long. Large language model guided tree-of-thought. arXiv preprint arXiv:2305.08291, 2023.
Ajay Mandlekar, Danfei Xu, Roberto Martı́n-Martı́n, Silvio Savarese, and Li Fei-Fei. Learn-
ing to generalize across long-horizon tasks from human demonstrations. arXiv preprint
arXiv:2003.06085, 2020.
Ajay Mandlekar, Danfei Xu, Josiah Wong, Soroush Nasiriany, Chen Wang, Rohun Kulkarni, Li Fei-
Fei, Silvio Savarese, Yuke Zhu, and Roberto Martı́n-Martı́n. What matters in learning from offline
human demonstrations for robot manipulation. arXiv preprint arXiv:2108.03298, 2021.
Sewon Min, Xinxi Lyu, Ari Holtzman, Mikel Artetxe, Mike Lewis, Hannaneh Hajishirzi, and Luke
Zettlemoyer. Rethinking the role of demonstrations: What makes in-context learning work? arXiv
preprint arXiv:2202.12837, 2022.
Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness, Marc G Belle-
mare, Alex Graves, Martin Riedmiller, Andreas K Fidjeland, Georg Ostrovski, et al. Human-level
control through deep reinforcement learning. nature, 518(7540):529–533, 2015.

12
Published as a conference paper at ICLR 2024

Ashvin Nair, Bob McGrew, Marcin Andrychowicz, Wojciech Zaremba, and Pieter Abbeel. Over-
coming exploration in reinforcement learning with demonstrations. In 2018 IEEE international
conference on robotics and automation (ICRA), pp. 6292–6299. IEEE, 2018.
Andrew Y Ng, Stuart Russell, et al. Algorithms for inverse reinforcement learning. In Icml, vol-
ume 1, pp. 2, 2000.
R OpenAI. Gpt-4 technical report. arXiv, pp. 2303–08774, 2023.
OpenCompass. Opencompass: A universal evaluation platform for foundation models. https:
//github.com/open-compass/opencompass, 2023.
Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong
Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow
instructions with human feedback. Advances in Neural Information Processing Systems, 35:
27730–27744, 2022.
Arkil Patel, Satwik Bhattamishra, and Navin Goyal. Are nlp models really able to solve simple math
word problems? arXiv preprint arXiv:2103.07191, 2021.
Dean A Pomerleau. Alvinn: An autonomous land vehicle in a neural network. Advances in neural
information processing systems, 1, 1988.
Thomas Pouplin, Hao Sun, Samuel Holt, and Mihaela Van der Schaar. Retrieval-augmented thought
process as sequential decision making. arXiv preprint arXiv:2402.07812, 2024.
Reid Pryzant, Dan Iter, Jerry Li, Yin Tat Lee, Chenguang Zhu, and Michael Zeng. Automatic prompt
optimization with” gradient descent” and beam search. arXiv preprint arXiv:2305.03495, 2023.
Guanghui Qin and Jason Eisner. Learning how to ask: Querying lms with mixtures of soft prompts.
arXiv preprint arXiv:2104.06599, 2021.
Pranav Rajpurkar, Robin Jia, and Percy Liang. Know what you don’t know: Unanswerable questions
for squad. arXiv preprint arXiv:1806.03822, 2018.
Joseph Redmon, Santosh Divvala, Ross Girshick, and Ali Farhadi. You only look once: Unified,
real-time object detection. In Proceedings of the IEEE conference on computer vision and pattern
recognition, pp. 779–788, 2016.
Stéphane Ross, Geoffrey Gordon, and Drew Bagnell. A reduction of imitation learning and struc-
tured prediction to no-regret online learning. In Proceedings of the fourteenth international con-
ference on artificial intelligence and statistics, pp. 627–635. JMLR Workshop and Conference
Proceedings, 2011.
Subhro Roy and Dan Roth. Solving general arithmetic word problems. arXiv preprint
arXiv:1608.01413, 2016.
Stefan Schaal. Learning from demonstration. Advances in neural information processing systems,
9, 1996.
John Schulman, Sergey Levine, Pieter Abbeel, Michael Jordan, and Philipp Moritz. Trust region
policy optimization. In International conference on machine learning, pp. 1889–1897, 2015.
John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy
optimization algorithms. arXiv preprint arXiv:1707.06347, 2017.
Taylor Shin, Yasaman Razeghi, Robert L Logan IV, Eric Wallace, and Sameer Singh. Autoprompt:
Eliciting knowledge from language models with automatically generated prompts. arXiv preprint
arXiv:2010.15980, 2020.
Noah Y Siegel, Jost Tobias Springenberg, Felix Berkenkamp, Abbas Abdolmaleki, Michael Ne-
unert, Thomas Lampe, Roland Hafner, Nicolas Heess, and Martin Riedmiller. Keep doing
what worked: Behavioral modelling priors for offline reinforcement learning. arXiv preprint
arXiv:2002.08396, 2020.

13
Published as a conference paper at ICLR 2024

David Silver, Aja Huang, Chris J Maddison, Arthur Guez, Laurent Sifre, George Van Den Driessche,
Julian Schrittwieser, Ioannis Antonoglou, Veda Panneershelvam, Marc Lanctot, et al. Mastering
the game of go with deep neural networks and tree search. nature, 529(7587):484–489, 2016.

Hao Sun. Reinforcement learning in the era of llms: What is essential? what is needed? an rl
perspective on rlhf, prompting, and beyond. arXiv preprint arXiv:2310.06147, 2023.

Hao Sun, Zhizhong Li, Xiaotong Liu, Bolei Zhou, and Dahua Lin. Policy continuation with hind-
sight inverse dynamics. Advances in Neural Information Processing Systems, 32, 2019.

Hao Sun, Zhenghao Peng, Bo Dai, Jian Guo, Dahua Lin, and Bolei Zhou. Novel policy seeking with
constrained optimization. arXiv preprint arXiv:2005.10696, 2020.

Hao Sun, Ziping Xu, Meng Fang, Zhenghao Peng, Jiadong Guo, Bo Dai, and Bolei Zhou. Safe
exploration by solving early terminated mdp. arXiv preprint arXiv:2107.04200, 2021.

Hao Sun, Lei Han, Rui Yang, Xiaoteng Ma, Jian Guo, and Bolei Zhou. Exploit reward shifting
in value-based deep-rl: Optimistic curiosity-based exploration and conservative exploitation via
linear reward shaping. Advances in Neural Information Processing Systems, 35:37719–37734,
2022a.

Hao Sun, Boris van Breugel, Jonathan Crabbe, Nabeel Seedat, and Mihaela van der Schaar. Daux:
a density-based approach for uncertainty explanations. arXiv preprint arXiv:2207.05161, 2022b.

Hao Sun, Ziping Xu, Meng Fang, and Bolei Zhou. Supervised q-learning can be a strong baseline
for continuous control. In NeurIPS 2022 Foundation Models for Decision Making Workshop,
2022c.

Hao Sun, Alex J Chan, Nabeel Seedat, Alihan Hüyük, and Mihaela van der Schaar. When is off-
policy evaluation useful? a data-centric perspective. arXiv preprint arXiv:2311.14110, 2023a.

Hao Sun, Alihan Hüyük, Daniel Jarrett, and Mihaela van der Schaar. Accountability in offline
reinforcement learning: Explaining decisions with a corpus of examples. In Conference on Neural
Information Processing Systems, 2023b.

Richard S Sutton and Andrew G Barto. Reinforcement learning: An introduction. MIT press, 2018.

TigerResearch. Tigerbot: A cutting-edge foundation for your very own llm. https://github.
com/TigerResearch/TigerBot, 2023.

Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Niko-
lay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open founda-
tion and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023.

Oriol Vinyals, Igor Babuschkin, Wojciech M Czarnecki, Michaël Mathieu, Andrew Dudzik, Juny-
oung Chung, David H Choi, Richard Powell, Timo Ewalds, Petko Georgiev, et al. Grandmaster
level in starcraft ii using multi-agent reinforcement learning. Nature, 575(7782):350–354, 2019.

Tu Vu, Brian Lester, Noah Constant, Rami Al-Rfou, and Daniel Cer. Spot: Better frozen model
adaptation through soft prompt transfer. arXiv preprint arXiv:2110.07904, 2021.

Rui Wang, Hongru Wang, Fei Mi, Yi Chen, Ruifeng Xu, and Kam-Fai Wong. Self-critique prompt-
ing with large language models for inductive instructions. arXiv preprint arXiv:2305.13733,
2023.

Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc Le, Ed Chi, Sharan Narang, Aakanksha Chowdh-
ery, and Denny Zhou. Self-consistency improves chain of thought reasoning in language models.
arXiv preprint arXiv:2203.11171, 2022.

Jason Wei, Maarten Bosma, Vincent Y Zhao, Kelvin Guu, Adams Wei Yu, Brian Lester, Nan Du,
Andrew M Dai, and Quoc V Le. Finetuned language models are zero-shot learners. arXiv preprint
arXiv:2109.01652, 2021.

14
Published as a conference paper at ICLR 2024

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny
Zhou, et al. Chain-of-thought prompting elicits reasoning in large language models. Advances in
Neural Information Processing Systems, 35:24824–24837, 2022.
Yifan Wu, George Tucker, and Ofir Nachum. Behavior regularized offline reinforcement learning.
arXiv preprint arXiv:1911.11361, 2019a.
Yueh-Hua Wu, Nontawat Charoenphakdee, Han Bao, Voot Tangkaratt, and Masashi Sugiyama. Imi-
tation learning from imperfect demonstration. In International Conference on Machine Learning,
pp. 6818–6827. PMLR, 2019b.
Tian Xu, Ziniu Li, and Yang Yu. Error bounds of imitating policies and environments. Advances in
Neural Information Processing Systems, 33:15737–15749, 2020.
Tian Xu, Ziniu Li, Yang Yu, and Zhi-Quan Luo. Understanding adversarial imitation learning in
small sample regime: A stage-coupled analysis. arXiv preprint arXiv:2208.01899, 2022.
Chengrun Yang, Xuezhi Wang, Yifeng Lu, Hanxiao Liu, Quoc V. Le, Denny Zhou, and Xinyun
Chen. Large language models as optimizers. arXiv preprint arXiv:2309.03409, 2023a.
Rui Yang, Chenjia Bai, Xiaoteng Ma, Zhaoran Wang, Chongjie Zhang, and Lei Han. Rorl: Ro-
bust offline reinforcement learning via conservative smoothing. Advances in Neural Information
Processing Systems, 35:23851–23866, 2022a.
Rui Yang, Yiming Lu, Wenzhe Li, Hao Sun, Meng Fang, Yali Du, Xiu Li, Lei Han, and Chongjie
Zhang. Rethinking goal-conditioned supervised learning and its connection to offline rl. arXiv
preprint arXiv:2202.04478, 2022b.
Rui Yang, Lin Yong, Xiaoteng Ma, Hao Hu, Chongjie Zhang, and Tong Zhang. What is essential
for unseen goal generalization of offline goal-conditioned rl? In International Conference on
Machine Learning, pp. 39543–39571. PMLR, 2023b.
Rui Yang, Han Zhong, Jiawei Xu, Amy Zhang, Chongjie Zhang, Lei Han, and Tong Zhang.
Towards robust offline reinforcement learning under diverse data corruption. arXiv preprint
arXiv:2310.12955, 2023c.
Shunyu Yao, Dian Yu, Jeffrey Zhao, Izhak Shafran, Thomas L Griffiths, Yuan Cao, and Karthik
Narasimhan. Tree of thoughts: Deliberate problem solving with large language models. arXiv
preprint arXiv:2305.10601, 2023.
Chenlu Ye, Rui Yang, Quanquan Gu, and Tong Zhang. Corruption-robust offline reinforcement
learning with general function approximation. Advances in Neural Information Processing Sys-
tems, 36, 2024.
Chi Zhang, Sanmukh Rao Kuppannagari, and Viktor Prasanna. Brac+: Going deeper with behavior
regularized offline reinforcement learning. 2020.
Tianhao Zhang, Zoe McCarthy, Owen Jow, Dennis Lee, Xi Chen, Ken Goldberg, and Pieter Abbeel.
Deep imitation learning for complex manipulation tasks from virtual reality teleoperation. In
2018 IEEE International Conference on Robotics and Automation (ICRA), pp. 5628–5635. IEEE,
2018.
Tianjun Zhang, Xuezhi Wang, Denny Zhou, Dale Schuurmans, and Joseph E Gonzalez. Tempera:
Test-time prompt editing via reinforcement learning. In The Eleventh International Conference
on Learning Representations, 2022a.
Zhuosheng Zhang, Aston Zhang, Mu Li, and Alex Smola. Automatic chain of thought prompting in
large language models. arXiv preprint arXiv:2210.03493, 2022b.
Denny Zhou, Nathanael Schärli, Le Hou, Jason Wei, Nathan Scales, Xuezhi Wang, Dale Schuur-
mans, Claire Cui, Olivier Bousquet, Quoc Le, et al. Least-to-most prompting enables complex
reasoning in large language models. arXiv preprint arXiv:2205.10625, 2022a.
Yongchao Zhou, Andrei Ioan Muresanu, Ziwen Han, Keiran Paster, Silviu Pitis, Harris Chan,
and Jimmy Ba. Large language models are human-level prompt engineers. arXiv preprint
arXiv:2211.01910, 2022b.

15
Published as a conference paper at ICLR 2024

A PPENDIX : TABLE OF C ONTENTS

A Extended Discussion on Prompt-OIRL and RLHF 17


A.1 Connections and Difference from a Descriptive Perspective . . . . . . . . . . . . . 17
A.2 Prompt-OIRL and RLHF with the Formal RL Language. . . . . . . . . . . . . . . 17

B Extended Related Work 22


B.1 Learning from Human Expertise and Imitation Learning . . . . . . . . . . . . . . 22
B.2 Zero-Shot and Few-Shot Prompting . . . . . . . . . . . . . . . . . . . . . . . . . 22
B.3 Chain-of-Thought Prompting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
B.4 Other Prompting Strategies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

C Supplemental Experiment Details 23


C.1 Offline Data Processing and Embeddings . . . . . . . . . . . . . . . . . . . . . . 23
C.2 Additional Details on Reward Modeling . . . . . . . . . . . . . . . . . . . . . . . 24
C.3 Held-Out Test Prompts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
C.4 Code, Hardware and Training Time . . . . . . . . . . . . . . . . . . . . . . . . . 28
C.5 Best Test-Set Prompting Strategy on Different Datasets . . . . . . . . . . . . . . . 28
C.6 Oracle Reward Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
C.7 Cost of Embedding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

D Extended Discussion 30
D.1 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
D.2 A Call to the Field: Release the Interactive Logs with LLMs as a Valuable Asset. . 30
D.3 Link to the Online Prompt Evaluation and Optimization Methods . . . . . . . . . . 30

E Additional Experiments 31
E.1 Prompt-OIRL under Different Data Availability . . . . . . . . . . . . . . . . . . . 31
E.2 Generalization of Prompt-OIRL over Tasks . . . . . . . . . . . . . . . . . . . . . 31
E.3 Experiments on Larger LLM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

F Detailed Experiment Results 32


F.1 Improving Arithmetic Reasoning: How good are the optimized prompts? . . . . . . 32
F.2 Reward Modeling: Accuracy and Precision. . . . . . . . . . . . . . . . . . . . . . 33

G Case Studies 35
G.1 Optimal Prompting Strategy is Query Dependent . . . . . . . . . . . . . . . . . . 35
G.2 Expert-Crafted Prompts Can All Fail . . . . . . . . . . . . . . . . . . . . . . . . . 48
G.3 Generated Human-Readable Prompts . . . . . . . . . . . . . . . . . . . . . . . . . 53

16
Published as a conference paper at ICLR 2024

Ethics Statement This paper introduces the pioneering application of offline inverse reinforce-
ment learning to query-dependent prompting, enhancing the arithmetic reasoning capabilities of
LLMs. Yet, deploying such a method inappropriately could yield undesirable results. Addition-
ally, the method’s dependence on LLMs and prompting demonstration data underscores potential
concerns about data privacy and consent, especially as it leans on future offline datasets for human-
centered tasks.

Reproducibility Statement All code and offline demonstration datasets will be released upon
acceptance. For the seek of full reproducibility, we provide supplemental experiment details in
Appendix C, and attached example code in the supplementary material of the submission.

Acknowledgement We thank anonymous reviewers for improving the quality of our paper. We
thank Alex J. Chan, Nicolás Astorga, and Tennison Liu for reviewing our paper and providing
insightful comments. HS’s research is funded by ONR.

A E XTENDED D ISCUSSION ON P ROMPT-OIRL AND RLHF

A.1 C ONNECTIONS AND D IFFERENCE FROM A D ESCRIPTIVE P ERSPECTIVE

To better understand our contribution, it would be helpful to link and contrast Prompt-OIRL with
the framework of reinforcement learning from human feedback (RLHF) or AI feedback (RLAIF).
In general, the success of RLHF and RLAIF motivates our idea of integrating human expertise in
prompting, and Prompt-OIRL can be interpreted as a special type of prompt policy learning from
offline AI feedback, which differentiates from existing literature.
The fact that Prompt-OIRL first learns a reward model and then performs policy optimization using
the reward model makes the method related to RLHF and RLAIF. From such a perspective, Prompt-
OIRL treats prompt evaluation and optimization as a process of aligning prompting strategies with
the preferences of the LLMs, contrasting to aligning LLM responses with human preferences. In
analogy with the RLHF that improves the performance of LLMs with regard to human preferences,
we hypothesize that such a perspective could lead to more effective prompts while avoiding the need
for random exploration of the prompt space, which is computationally expensive.
That said, there are several key differences that make Prompt-OIRL distinguishable from the existing
RLHF and RLAIF literature. RLHF (or RLAIF) often includes the following steps: 1. sampling
from pre-trained language models to generate dialogue data; 2. human (or AI) raters are then asked
to rank those generated data according to given criteria such as harmless and helpful; 3. a preference
model is trained on the ranked dataset, as proxy to the human (or AI) raters; 4. the language model
is fine-tuned with the learned preference model using reinforcement learning.
By contrast, we would like to highlight the differences in objective, non-interactive offline learning
problem setting, and optimization procedure flexibility: First, Objective: The objective of Prompt-
OIRL is to evaluate and optimize prompting strategies, rather than enhance LLMs’ alignment to
non-differentiable objectives. Importantly, the learned reward model in Prompt-OIRL can work in
isolation as a prompt evaluator. Second, Non-Interactive Offline Setting: Prompt-OIRL works in a
purely offline setting, rather than assuming access to human or AI raters to actively generate feed-
back. Third, the optimization procedure in Prompt-OIRL is highly flexible, which is not necessary
to be a language model as in RLHF and RLAIF. For instance, the prompt optimization process
can collaborate with a human prompting engineer, who proposes potential prompting strategies and
selects the best according to the learned reward model.

A.2 P ROMPT-OIRL AND RLHF WITH THE F ORMAL RL L ANGUAGE .

Reinforcement Learning and Offline-RL In Reinforcement Learning (RL), an agent learns


through interacting with an environment and receiving feedback in the form of rewards (Sutton
& Barto, 2018). The fundamental objective of RL is to find a policy, which is a mapping from states
to actions, that maximizes the expected cumulative reward over time.

17
Published as a conference paper at ICLR 2024

Such a learning paradigm can be formally represented using the Markov Decision Processes
(MDPs), where decisions are made in discrete time steps, and each decision affects the state of
the environment in the subsequent step.
Formally, we denote the MDP as M = {S, A, T , R, ρ0 , γ}, where S ⊂ Rd denotes the d-dim state
space, A is the action space. Broadly, the environment includes T and R, the former denotes the
transition dynamics T : S ×A 7→ S that controls transitions between states, and the reward function
R : S × A 7→ R provides feedback. ρ0 = p(s0 ) ∈ ∆(S) denotes the initial state distribution. γ is
the discount factor that trades off between short-term and long-term returns.

Figure 7: Pictorial illustration of Online RL: an agent actively interacts with the environment, which
is composed of the dynamics model that controls transition, and the reward model that provides
feedback.

Online RL considers the problem of learning a policy π ∈ Π : S 7→ ∆(A), such that the expected
PT
cumulative reward Eat ∼π,st+1 ∼T ,s0 ∼ρ0 t=0 γ t rt (st , at ) in the MDP is maximized. In the online
RL setting, an agent learns through trial and error. It actively interacts with the environments —
including both transition dynamics T and the reward function R. And optimize its polity through
either on-policy (Schulman et al., 2015; 2017; Sun et al., 2020; Cobbe et al., 2021b) or off-policy
algorithms (Lillicrap et al., 2015; Mnih et al., 2015; Fujimoto et al., 2018b; Haarnoja et al., 2018;
Sun et al., 2019).
At each time step t, an agent observes a state st from the environment and selects an action at
according to its policy π. Upon taking the action, the agent receives a reward rt and transit to a new
state st+1 . The agent’s objective is to maximize its expected return.

Figure 8: Pictorial illustration of Offline RL: a decision dataset is collected through interactive logs
between a behavior agent and the environment. Such offline dataset is then used to optimized a
parameterized policy — the learnable agent.

Offline RL, also known as batch-RL or data-driven RL, focuses on learning optimal policies from
a fixed dataset of interaction data without further interaction with the environment (e.g. Holt et al.,
2023a;b; Sun et al., 2023b). This approach is particularly relevant in scenarios where online interac-
tion is expensive, risky, or impractical (Sun et al., 2021; Yang et al., 2023c). The primary challenge
in Offline RL is that the dataset may not sufficiently cover important regions of the state-action
space, which can lead to extrapolation errors and suboptimal policies (Fujimoto et al., 2018a).
Offline RL algorithms aim to leverage this fixed dataset to derive a policy that would perform well
if deployed in the real environment. Techniques often employed in Offline RL include constraining
the policy to only consider data in the dataset, regularizing against drastic changes from a behavior
policy, and employing uncertainty estimation to avoid regions where the dataset provides limited
information (Kumar et al., 2019; Wu et al., 2019a; Kumar et al., 2019; Siegel et al., 2020; Sun
et al., 2022a;b; Yang et al., 2022a; 2023b; Ye et al., 2024). In general, such a dataset can either be
generated by rolling out an expert that generates high-quality solutions to the task (Fujimoto et al.,
2018a; Zhang et al., 2020; Fu et al., 2020) or a non-expert that executes sub-optimal behaviors (Fu

18
Published as a conference paper at ICLR 2024

et al., 2020; Wu et al., 2019a; Kumar et al., 2019; Agarwal et al., 2020; Hüyük et al., 2021; Jarrett
et al., 2021) or be a mixture of both (Bharadhwaj et al., 2020).
That said, the research on existing offline-RL mainly focuses on application on dense-reward tasks
like robotics control (Fu et al., 2020; Levine et al., 2020). The optimization for reward on the
trajectory level such as the case in LLM researchwhere responses may only be evaluated on the
sentence-level, rather than on a token-wise level — is relatively underexplored.

Imitation Learning, Behavior Clone, and Inverse RL Imitation learning and learning from
demonstrations are widely studied in the field of reinforcement learning and robotic learning (Ng
et al., 2000; Abbeel & Ng, 2004; Ho & Ermon, 2016; Garg et al., 2021; Bica et al., 2021), aiming at
learning policies with a batch of expert trajectories. Working as the most straightforward solution,
Behavior Cloning (BC) (Pomerleau, 1988; Florence et al., 2022) optimizes a policy through super-
vised learning, yet may suffer from compounding error or instability in learning from sub-optimal
demonstrations (Ross et al., 2011; Wu et al., 2019b). It is shown in Mandlekar et al. (2021) that
data quality is of great importance in the success of BC. It is worth noting that, while imitation
learning assumes the underlying reward mechanism is unknown, the learning from demonstration
literature always uses the demonstrations as a warm-start for reward-sparse tasks like robotics ma-
nipulation (Schaal, 1996; Nair et al., 2018).

Figure 9: Pictorial illustration of IL: a dynamics model is always needed to obtain observations.
The decision dataset can be used to provide a reference for the desired behavior. Learning (the pink
dashed line) can be achieved by minimizing the behavior divergence between the decision dataset
and the agent.

Figure 10: Pictorial illustration of IRL: Inverse RL first learns a reward model, with which the
imitation learning now becomes an “online” RL problem — the accessible dynamics provide obser-
vations, and the learned reward model provides feedback.

LLM Alignment with Human Preference The task of LLM alignment under human preference
can be framed as RL, yet the challenge of online learning. See Figure 11 for a more detailed expla-
nation.

Figure 11: Pictorial illustration of LLM alignment as an online RL task. However, this is only the
ideal case: it is normally unaffordable to always keep humans in the loop. Human is the reward
model, and the LLMs themselves act as a dynamics model.

19
Published as a conference paper at ICLR 2024

In practice, collecting a limited number of human feedbacks and saving them as a decision dataset
is feasible. In such cases, the LLM alignment as online RL is turned into an offline RL problem.
As we have discussed above, offline RL also suffers from difficulties in learning. However, there
is a unique property in the LLM alignment problem that enables Imitation Learning with those
alignments (human-feedback) data.

Figure 12: Pictorial illustration of LLM alignment as an offline RL task: The alignment dataset is
collected through a limited number of interactions between humans and LLMs.

RLHF: Solving an Offline-RL Task with Online Inverse RL We highlight that RLHF solves the
offline RL task of aligning human preferences through an offline dataset with online IRL. Figure 13
and Figure 14 illustrates such an interpretation.

Figure 13: Pictorial illustration of RLHF as IL: as LLM itself acts as the dynamics model, the offline
RL problem of LLM alignment with human preference becomes an imitation learning problem.

Figure 14: Pictorial illustration of RLHF as Online IRL: Approaching the imitation learning problem
through IRL, RLHF first learns a reward model, with which it has both components in an environ-
ment — the dynamics model (itself) and the reward model (learned). Therefore, RLHF can apply
conventional online RL algorithms like PPO as its solver.

Prompting as Policy Learning: Alignment with LLM Preference While prompt optimization
is by nature an RL task and can be approached by learning from trial and error, randomly exploring
the action space of natural language is infeasible. An analogy is the task of LLM alignment with
human feedback: where an LLM agent interacts with human annotators to collect feedback on its
responses and seek for better alignment with the human annotators’ preference. In such a task, it is
also unrealistic and infeasible to always put humans in the loop. Therefore, techniques of RLHF are
proposed to seek the solution when only an offline alignment dataset is available.
On the other hand, prompting optimization is indeed an LLM-centric type of alignment, as opposed
to human-centric alignment: it aligns the prompter’s prompting strategies according to LLMs’ pref-
erences and feedback. It faces the same type of challenge that always keeping LLM in the loop

20
Published as a conference paper at ICLR 2024

during optimization is not feasible. Figure 15 provides a pictorial illustration of how prompting can
be interpreted as an (expensive) online RL task.

Figure 15: Pictorial illustration of prompting as an (expensive) online RL task: the reward calcula-
tion requires extensive usage of the LLM, which is expensive in practice.

Prompt-OIRL: Solving an Offline RL Task with Offline Inverse RL The key observation and
claim we would emphasize is that we can interpret Prompt-OIRL as solving the offline RL of align-
ing prompting strategies from LLMs’ feedback through offline IRL.

Offline Data as By-Products In the main text, we have highlighted the existence and importance
of prompt demonstrations generated in benchmarking existing prompts. For instance, in the research
on enhancing the arithmetic reasoning abilities of LLMs, multiple zero-shot prompts are proposed,
such as the CoT (Kojima et al., 2022), APE (Zhou et al., 2022b), and more recently OPRO (Yang
et al., 2023a), etc.
In order to benchmark the effectiveness of those proposed prompts, previous researchers use stan-
dardized open-accessible arithmetic reasoning datasets and report the overall successful rate of get-
ting correct answers. Figure 16 provides a pictorial illustration of how the evaluation of existing
prompts generates prompt-alignment demonstration datasets as its by-products.

Figure 16: Pictorial illustration of prompting as an offline RL task.

Solving Offline RL by Offline Inverse RL The key observation of such a problem is that zero-
shot prompting are single-step decision-making problem. Therefore, even without the transition
dynamics models (LLMs), we are still able to conduct an inverse RL in an offline manner. Prompt-
OIRL circumvents the difficulty in evaluating prompts using offline reward modeling. Figure 17
provides a pictorial illustration of such an interpretation.

Figure 17: Pictorial illustration of prompting as an Offline IRL task: the offline prompt demonstra-
tion dataset is used to learn a reward model, as a function of query and input, to avoid the requirement
of an LLM during evaluation. Prompt-OIRL then leverages such a learned reward model for prompt
optimization in a purely offline manner.

21
Published as a conference paper at ICLR 2024

Conclusive Remark RLHF solves the problem of learning from offline LLM alignment dataset
according to human feedback by turning the problem from offline RL into an online IRL, and then
an online RL after learning the reward model. Prompt-OIRL solves the problem of learning from
offline prompting dataset according to LLM feedback by turning the problem from offline RL into
an offline IRL problem, to get rid of the dependency of LLMs in evaluation and optimization.

B E XTENDED R ELATED W ORK

B.1 L EARNING FROM H UMAN E XPERTISE AND I MITATION L EARNING

Recent advances in Large Language Models (LLMs) have shown impressive performance as
general-purpose agents, especially with the remarkable success of ChatGPT where human knowl-
edge is injected via reinforcement learning from human feedback (RLHF) (Christiano et al., 2017;
Ouyang et al., 2022; OpenAI, 2023; Sun, 2023; Li et al., 2023b). During the training of those
models, human feedback is leveraged to reduce the harmfulness and at the same time improve the
instruction-following ability of models. In Liu et al. (2023b), LLMs are instantiated as decision-
making agents to master various skills.
Imitation learning and learning from demonstrations are widely studied in the field of reinforce-
ment learning and robotic learning (Ng et al., 2000; Abbeel & Ng, 2004; Ho & Ermon, 2016; Xu
et al., 2020; Garg et al., 2021; Yang et al., 2022b; Xu et al., 2022), aiming at learning policies
with a batch of expert trajectories. Working as the most straightforward solution, Behavior Cloning
(BC) (Pomerleau, 1988; Florence et al., 2022; Sun et al., 2023b) optimizes a policy through super-
vised learning, yet may suffer from compounding error or instability in learning from sub-optimal
demonstrations (Ross et al., 2011; Wu et al., 2019b; Sun et al., 2022c). It is shown in Mandlekar
et al. (2021) that data quality is of great importance in the success of BC. It is worth noting that,
while imitation learning assumes the underlying reward mechanism is unknown, the learning from
demonstration literature always uses the demonstrations as a warm-start for reward-sparse tasks like
robotics manipulation (Schaal, 1996; Nair et al., 2018; Sun et al., 2019; 2021).
In prompt generation tasks, the demonstrations are always imperfect, T-REX (Brown et al., 2019)
extrapolates beyond a learned reward function to achieve super-demonstration performance (as well
as Hüyük et al. (2022) in a contextual bandit setting). We also benchmark the performance of BC,
with specific consideration on the quality of demonstrations (Wu et al., 2019b; Mandlekar et al.,
2021; Sun et al., 2023a).

B.2 Z ERO -S HOT AND F EW-S HOT P ROMPTING

The ability of Zero-shot prompting emerges in the language models trained on large amounts of data
like GPT-3 and GPT-4 (Ouyang et al., 2022; OpenAI, 2023). And it was shown in Wei et al. (2021)
that instruction-fine-tuning improves the zero-shot learning ability of language models.
Notwithstanding the impressive zero-shot performance exhibited by large language models, these
models often exhibit suboptimal performance in executing more complex tasks under a zero-shot
setting. Leveraging few-shot prompting presents a viable approach for facilitating in-context learn-
ing (Brown et al., 2020; Min et al., 2022). This technique necessitates the inclusion of demon-
strations within the prompt, effectively guiding the model toward enhanced performance. These
demonstrations act as conditioning mechanisms for succeeding examples, leading the model to gen-
erate better responses.

B.3 C HAIN - OF -T HOUGHT P ROMPTING

In some more challenging tasks like complex arithmetic, commonsense, and symbolic reasoning
tasks, the chain-of-thought (CoT) prompting is shown to be more effective in helping the language
models to get correct answers (Wei et al., 2022). CoT includes additional reasoning steps in the
few-shot prompting examples. Kojima et al. (2022) further introduces zero-shot CoT, showing that
adding task-agnostic instruction can improve the model performance in specific tasks. In Zhang
et al. (2022b), Auto-CoT combines the universality of zero-shot CoT and the capability of original

22
Published as a conference paper at ICLR 2024

CoT driven by demonstrations and proposes to automatically construct demonstrations based on


clustering and diversity-based sampling that are beneficial for CoT reasoning.

B.4 OTHER P ROMPTING S TRATEGIES

Wang et al. (2022) further improves the few-shot CoT method by sampling multiple diverse reason-
ing paths and marginalizing those paths, choosing the most consistent answers among all sampled
reasoning paths. Li et al. (2023a) introduces Directional-Stimulus-Prompting, which is a framework
that uses a tuneable language model to provide guidance for the black-box frozen large language
model toward desirable properties.
The Generated Knowledge Prompting Liu et al. (2021) improves commonsense reasoning by in-
corporating knowledge or information related to the questions to make more accurate predictions.
Tree-of-thoughts (ToT) methods (Long, 2023; Yao et al., 2023) combines tree-based planning meth-
ods with reasoning skills of language models, and solves hard reasoning problems step by step via
multiple round conversations. Hulbert (2023) also put forward a related idea that leverages multiple
thoughts of a language model in a single prompt. Memory and Retrieval Augmented Generation
(RAG) (Lewis et al., 2020), that is able to combine parametric memory and non-parametric memory
like Wikipedia in completing knowledge-intensive tasks (Pouplin et al., 2024). MoT (Li & Qiu,
2023): Pre-thinking based on the external unlabeled dataset and then recalling the related knowl-
edge during inference. In concurrent works of Yang et al. (2023a); Liu et al. (2024), LLMs are
used as optimizers to solve a variety of optimization problems, including prompt optimization. It is
worth noting the most important two differences are in the offline nature and the query-dependant
zero-shot prompting strategy used in our approach.

Table 5: Prompt-OIRL differentiates from existing literature on prompt optimization by (1) considering the
Adjusted Objective and optimizing query-dependent prompt; (2) being able to perform offline prompt
evaluation to address Challenge 1; (3) optimizing prompt in the offline setting without access to the LLMs
to address Challenge 2; (4) utilizing existing expert knowledge to reduce the difficulty in RL; (5) generating
human-readable prompts; (6) working on the most general natural language prompt space; (7) free from
gradient information of LLMs; (8) using offline inverse reinforcement learning as the solver of the problem.
(1) Query (2) Offline (3) Offline (4) Expert (5) Human (6) Prompt (7) LLM (8) Solver
Method Dependent Prompt Prompt Knowledge Readable Space Gradient Used

Prompt Evaluation Optimization Inspired Prompt Free

Soft-Prompt ✓ ✗ ✗ ✗ ✗ Embeddings ✗ Gradient-Guided Search

APO ✗ ✗ ✗ ✓ ✓ X = V∞ ✓ Beam Search

APE ✗ ✗ ✗ ✗ ✓ X = V∞ ✗ Evolution Strategy

TEMPERA ✓ ✗ ✗ ✓ ✓ Edit ✓ RL

RLPrompt ✗ ✗ ✗ ✗ ✗ {V 2 , V 5 } ✓ RL

Prompt-OIRL (ours) ✓ ✓ ✓ ✓ ✓ X = V∞ ✓ Offline Inverse-RL

Embeddings: the language of LMs; Edit: including operations like swap, delete, etc.; V is the vocabulary, and the superscript over it
denotes the length of prompts. V ∞ denotes the natural language space — the most general interpretable format.

C S UPPLEMENTAL E XPERIMENT D ETAILS


C.1 O FFLINE DATA P ROCESSING AND E MBEDDINGS

For LLM ℓ and task k, we re-organize the offline demonstration dataset using the embedding func-
tion E : V ∞ 7→ R1536 , which maps a sequence of natural language context to a fix-length vector of
size 1536 (Greene et al., 2022). Therefore, using ex = E(x(i) ), eπ = E(π (j) ), r(ij) = 1{y ∗ =
(i) (j)

ℓ(π (j) (x(i) ))} to denote the embeddings and reward instantiation, our demonstration dataset Equa-
tion (4) in implementation can be expressed as follows:
(train) (i) (j)
1. Training Data: Dℓ,k = {ex , eπ , r(ij) }i∈{1,...,N },j∈{1,...,K}
(test q) (i) (j)
2. Test Data on Query: Dℓ,k = {ex , eπ , r(ij) }i∈{N +1,...,N +M },j∈{1,...,K}
(test p) (i) (j)
3. Test Data on Prompts: Dℓ,k = {ex , eπ , r(ij) }i∈{N +1,...,N +M },j∈{K+1,...,K+P }

23
Published as a conference paper at ICLR 2024

In the above, N is the number of samples (query-answer pairs) in the training dataset, M is the
number of samples in the held-out test dataset, K is the number of training prompts, and P is the
number of test prompts.
For all datasets, we experiment with different choices of K = [1, 2, 3, 4, 5, 6]. (i.e., the x-axis of
Figure 6: number of training prompts, ranging from 1 - 6). And use a total number of P = 110
held-out prompts.
The GSM8K task contains 7473 queries with golden answers for training and 1319 held-out queries
with golden answers for testing; the SVAMP task contains 19690 examples, which are split into a
training query-answer set of size 15000 and a testing query-answer set of size 4690; the MAWPS
task contains 7685 examples, which are split into a training query-answer set of size 6000 and
a testing query-answer set of size 1685. Therefore, for each prompting strategy, there are 7473,
15000, and 6000 demonstrative examples that can be collected through the interactive logs with a
certain language model for the three tasks that can be used for training, respectively.
With the previous notations,
For GSM8K, there are N = 7473 samples used for training and M = 1319 samples for testing.
For SVAMP, there are N = 15000 samples used for training and M = 4690 samples for testing.
For MAWPS, there are N = 6000 samples used for training and M = 1685 samples for testing.
Our processed offline datasets and code for processing will be provided as open-source assets to
facilitate future studies.

C.2 A DDITIONAL D ETAILS ON R EWARD M ODELING

To exclude some potential alternatives — we have explored using fully connected neural networks
as the reward model or using a ranked dataset mimicking the conventional RLHF setting, yet both
of those choices perform worse than our presented implementation.

MLP To be specific, for the MLP model, we have tried different choices of its hyper-parameters,
including unit numbers, layers, various drop-out choices, and dual-channel architecture with each
channel process query and prompt embeddings individually. However, we find all of those choices
tend to converge to the trivial solution that predicts either all-0 or all-1 for the binary classification
task. Such a reward model does not actually have the prediction ability in inference time. In practice,
with such a reward mode, we can do nothing better than select the best-performing prompt on
the training dataset. Therefore, the performance of BoTr Eqn.(1) represents the best-achievable
performance of using a dummy MLP reward model.

XGBoost For the XGBoost method, we set a universal hyper-parameter for all tasks and LLMs
to enhance applicability and demonstrate the robustness of Prompt-OIRL. When deploying Prompt-
OIRL in practice, a case-by-case engineering on the hyper-parameters on reward modeling should
be able to further boost the performance of the algorithm, yet this is out of our research scope. In
our paper, we experiment with a single universal hyper-parameter setting for all LLMs and tasks,
demonstrating the robustness of the proposed method.
To enhance replicability, we use the following hyper-parameters for the gradient boosting
model Chen et al. (2015) in all experiment settings:
param = {’max_depth’: 10, ’eta’: 0.001, ’objective’: ’binary:logistic’}

The Nearest Neighbor Method Besides the parametric approaches discussed above, we agree
with our reviewer that “a simpler, more straightforward query-dependent model, such as a nearest-
neighbor based solution, could be considered. This would involve, for each query, locating the
closest match whose prompt yields a correct answer and utilizing that prompt for the new query. ”.
In this paragraph, we discuss and experiment with such an idea.
Analytically, although the idea of using a Nearest-Neighbor approach can be an alternative to using
a parameterized model in selecting from the training prompts it can not be generalized to unseen
prompts due to a lack of support.

24
Published as a conference paper at ICLR 2024

Moreover, as a non-parametric method, the Nearest-Neighbor approach requires memorization of


the training embeddings, which can be expensive when the number of demonstrations increases.
Empirically, we implemented this idea with GPT-3.5-turbo on the MAWPS dataset. To be specific,
for every test query, we look up the training queries’ embedding to find the closest neighbor of this
test query’s embedding and select from the training prompt(s) that has successfully prompted correct
answers on such a training query. The results are presented in Table 6
Concluding from the results, the Nearest-Neighbor approach, as a query-dependent baseline, out-
performs the BoTr Eqn(1), and achieves on-par performance as BoTr Eqn(2). However, it under-
performs Prompt-OIRL as the generalization ability of the learned reward model in Prompt-OIRL
further enhanced its prompting performance.

Preference-based Learning Another alternative for reward modeling is based on preference-


based learning, which has been shown to be effective in some conventional inverse RL tasks (Brown
et al., 2019; Hüyük et al., 2022). To be specific, for every training query x(i) , there may exist mul-
(i) (i)
tiple prompts (denoted as p+ ) that lead to a correct answer, and some other prompts (p− ) that lead
to a wrong answer.
We can also organize the offline prompt demonstration in the form of preference-based dataset
(i) (i)
{x(i) , p− , p+ }, and learn from those pair-wise preference data. In this pair-wise preference ap-
proach, the learned reward model takes both prompts and the query as input, and output the pre-
ferred prompt. Given a new query in the inference time, such a reward model can be applied to all K
candidate prompts with K-1 comparisons to find the best prompt. (so it is more computationally ex-
pensive than the direct reward modeling method used in our work). We empirically studied whether
such an approach can lead to better performance on the MAWPS dataset with the GPT-3.5-turbo
model. The results are shown in Table 6 below.

Table 6: Comparison of different reward model approaches: the Nearest-Neighbor method, instance-wise
prediction of outcome, and pair-wise prediction.
# Training Prompts 1 2 3 4 5 6
BoTr Eqn(1) 0.6598 0.6988 0.7118 0.7155 0.7173 0.7175
BoTr Eqn(2) 0.6598 0.7742 0.8150 0.8350 0.8473 0.8546
BoTr Eqn(2) Paired 0.6598 0.7195 0.7203 0.7194 0.7236 0.7247
Nearest-Neighbor 0.6598 0.7856 0.8242 0.8339 0.8492 0.8530
Prompt-OIRL 0.7637 0.8032 0.8379 0.8750 0.8916 0.8944

It can be concluded that the pair-wise reward model can not achieve better performance than the
direct reward modeling used in Prompt-OIRL.
All our offline datasets and code for processing those datasets, as long as the MLP implementation
and pair-wise reward modeling implementation will be made publicly available and contributed as
an asset for future research.
Further investigation on hyper-parameter tuning will lead to further improvement, especially in the
pursuance of higher precision values in imbalanced prompt demonstrate datasets. (e.g., when using
smaller LLMs like LLaMA-7B for the more challenging tasks like GSM8k.)

C.3 H ELD -O UT T EST P ROMPTS

The following 10 prompts are used as a held-out test set for offline prompt evaluation. To enable the
performance evaluation on held-out prompts, we collect interactive logs on each language model for
10 held-out novel prompting strategies. We will release the corresponding promoted answers on test
queries for those held-out prompts as part of our offline dataset.

1. Approaching this logically, the steps to find the answer are:


2. Let’s break this down into manageable steps to fully understand the problem:

25
Published as a conference paper at ICLR 2024

3. Consider this as a puzzle, each piece contributing to the final answer. Let’s place each
piece, one by one:
4. Three scholars are analyzing this query from various perspectives, working collaboratively
to build a comprehensive answer. Each contributes a step:
5. Let’s solve this like a detective would solve a mystery, gathering clues and building up to
the final solution step by step:
6. Imagine we’re navigating a maze; each decision brings us closer to the center. Let’s map
our route:
7. Envision a round table meeting of expert problem solvers. Each participant suggests a step,
building towards a consensus answer:
8. Like an architect constructing a building, let’s design our answer carefully, layer by layer:
9. As if we are assembling a complex machine, let’s put it together piece by piece:
10. Three wise philosophers are debating this question, each contributing a different aspect of
the answer. Let’s follow their discourse:

In order to further verify the generalization ability of the learned reward model, we extend to
more held-out prompts. We use GPT-4 to generate those prompts through in-context learn-
ing. Specifically, we provided the human-crafted 6 prompting strategies to GPT-4, and asked
GPT-4 to generate other potential prompting strategies. Our original chat history and prompts
are available in the following anonymous link: https://chat.openai.com/share/
2c7652d2-2f96-48e8-b34f-efa4b15f8a61.

1. Like a bee pollinating flowers, let’s gather the essence from each point:
2. Picture this as a chain. Each link strengthens the whole:
3. As a librarian categorizes books, let’s sort the information accordingly:
4. This is like unfolding origami to understand each crease and fold:
5. Imagine it as an echo in a valley, every sound has an origin and meaning:
6. Let’s approach this like a chef creating a new recipe, ingredient by ingredient:
7. As a cartographer maps terrains, let’s chart the nuances and details:
8. Like an athlete training for an event, every exercise has a purpose:
9. Think of it as wind chimes, each note contributing to the melody:
10. It’s like flying a kite. Every tug and adjustment affects its flight:
11. Imagine we’re in a lab, every test and observation is crucial for the conclusion:
12. Like a blacksmith forging metal, let’s shape our understanding with precision:
13. As a botanist identifies plants, let’s classify each detail:
14. This is like a radio tuning to different frequencies, let’s find the right wavelength:
15. Imagine it’s a theater play. Scene by scene, we unfold the plot:
16. Let’s dive into this like a marine biologist exploring the ocean’s depths:
17. As an electrician wires a circuit, each connection powers the system:
18. This is like a gardener pruning a plant, every cut makes it flourish:
19. Think of it as the gears of a bicycle, each turn propelling us forward:
20. Like a mountain climber uses tools, let’s leverage each resource for understanding:
21. As a composer orchestrates a symphony, let’s integrate each instrument:
22. This is like a pot of stew, simmering to meld the flavors:
23. Imagine building a sandcastle, each grain matters for the structure:
24. It’s like peeling an onion, layer by layer, revealing the core:
25. Like a captain steering a ship, each decision adjusts our course:
26. As a mason lays bricks, every placement supports the structure:

26
Published as a conference paper at ICLR 2024

27. This is like a game of chess, anticipating each move ahead:


28. Let’s tackle this as a mechanic examines an engine, part by part:
29. Like a bird building a nest, every twig and feather matters:
30. Imagine crafting a sculpture, every detail brings it to life:
31. Like a student taking notes, we’ll highlight the key points:
32. It’s like sifting for gold, discarding the dirt to find the nuggets:
33. Let’s weave through this as a spider spins its intricate web:
34. As a farmer tills the soil, every effort nurtures the crop:
35. Like a director shoots a film, each scene tells a part of the story:
36. Think of it as a flashlight in the dark, illuminating step by step:
37. Like a butterfly metamorphosing, we’ll transition through stages of understanding:
38. Let’s construct our insights as an engineer designs machinery:
39. Imagine it as the flow of a river, every bend and ripple shapes the course:
40. It’s like a hawk soaring in the sky, observing details from a vantage point:
41. Like a detective following leads, every hint brings us closer to the truth:
42. As a tailor sews a garment, every stitch defines the shape:
43. Think of it as blending flavors in a dish, balancing to perfection:
44. Like a maestro leading an orchestra, every cue creates harmony:
45. Let’s mold our comprehension as a child shapes clay into forms:
46. Imagine this as a marathon; every mile, every step takes us closer to the finish:
47. It’s like mining for gems; we dig deep, assessing each discovery:
48. Let’s examine this as a doctor diagnoses symptoms to understand the ailment:
49. Think of it as a tree growing, each branch reaching out to new insights:
50. Like a beekeeper tends the hive, let’s understand each aspect of the colony:
51. As a carpenter joins wood, every angle and joint is crucial:
52. This is like creating a mosaic, each shard contributes to the whole:
53. Let’s distill our thoughts as a brewer crafts a fine beverage:
54. Imagine as if we’re setting a trap, each component is pivotal:
55. It’s like a locksmith crafting a key, every cut unlocking understanding:
56. Like a painter blending colors, let’s mix our ideas for clarity:
57. Let’s tread as an explorer maps uncharted territory, noting every landmark:
58. Think of this as a puzzle box, each mechanism revealing deeper layers:
59. Like an artisan molds pottery, our hands shape the outcome:
60. Let’s embark on this quest as a knight faces challenges, overcoming each obstacle:
61. Imagine we’re piecing together an ancient manuscript; each fragment reveals more:
62. This is like building a bridge. Each segment must connect perfectly. Let’s build:
63. Picture a gardener planting seeds. Each step, from soil preparation to watering, matters:
64. Think of this as weaving a tapestry. Every thread adds to the whole image:
65. Let’s explore this like an astronaut discovering a new planet, detail by detail:
66. Envision this as a mosaic. Each tile contributes to the final artwork:
67. Like navigating a ship through a storm, each decision is vital. Let’s chart the course:
68. It’s as if we’re restoring a masterpiece painting. Layer by layer, let’s uncover:
69. Imagine being an archaeologist, unearthing an artifact. Bit by bit, we reveal:
70. Think of this as a dance, each move flowing into the next. Let’s choreograph:

27
Published as a conference paper at ICLR 2024

71. As a detective collecting clues for a case, let’s unravel the mystery:
72. Like a novelist crafting a story, let’s develop our narrative chapter by chapter:
73. Envision assembling a jigsaw puzzle, connecting pieces to see the whole picture:
74. Imagine sculpting from a block of marble; every chisel matters. Let’s sculpt:
75. As a chemist mixing solutions, each ingredient alters the result. Let’s mix:
76. This is like tuning an instrument. Every adjustment leads to harmony:
77. Like an actor rehearsing a script, let’s understand our lines and their meaning:
78. Think of it as navigating a map. Each route we explore gives more insight:
79. As if we’re crafting a potion, each herb and element has a purpose. Brew with me:
80. Let’s approach this like a mathematician proving a theorem, step by logical step:
81. Picture us on a safari, observing every species. Detail by detail, we document:
82. Like setting up dominoes for a cascade, each placement is crucial. Let’s set up:
83. Imagine you’re a jeweler, evaluating a gem. Every facet reflects light differently:
84. Like a historian deciphering an old text, let’s understand its context:
85. Think of this as a musical composition. Each note leads to the next movement:
86. It’s like setting a table for a grand feast. Every detail adds to the ambiance:
87. As a photographer captures moments, let’s focus on each element:
88. Imagine we’re tailoring a suit. Each stitch, cut, and measurement counts:
89. This is like assembling a watch; every gear and spring is vital. Let’s assemble:
90. Envision ourselves as geologists, studying layers of rock. Layer by layer, we analyze:
91. Like drafting an architectural blueprint, every line and measurement matters:
92. As if we’re brewing a perfect cup of tea, each ingredient and timing is crucial:
93. This is like a relay race. Each leg of the race builds upon the last:
94. Think of it as mixing colors for a painting. Every shade adds depth and nuance:
95. It’s like being a conductor, ensuring each instrument plays its part:
96. Imagine decoding an encrypted message. Every symbol has a meaning:
97. Like planning a trip, each destination and route makes the journey:
98. This is like lighting a sequence of candles. Each one illuminates more:
99. Envision it as a waterfall, each drop contributing to the flow:
100. Like a potter shaping clay, let’s mold our understanding step by step:

C.4 C ODE , H ARDWARE AND T RAINING T IME

Our code, as well as the offline datasets, will be released as open-accessible. During the review
process, we provide source code in the supplementary material. We highlight the Prompt-OIRL
can be reproduced within a few hours using a single laptop using CPU. With our implementation,
conducting OIRL for the GSM8k takes 50 minutes on a MacBookAir with an 8-core M2 chip, and
takes only 5 minutes on a server with 16(out of 64)-core AMD 3995WX CPUs.

C.5 B EST T EST-S ET P ROMPTING S TRATEGY ON D IFFERENT DATASETS

We use Best of Train to denote the best performance of training prompts (i.e., out of the 6 training
prompts listed in Table 2 of our paper) on the test queries, and use Best of Test to denote the best
performance of the first 10 held-out test prompts. Details of those test prompts are provided in
Appendix C.3.

28
Published as a conference paper at ICLR 2024

Table 7: Best Test-Set Prompting Strategy on Different Datasets


Dataset LLM Best of Train Best Index Best of Test Best Index
MAWPS GPT3.5-turbo 0.855 Prompt 0 0.841 Prompt 5
SVAMP GPT3.5-turbo 0.718 Prompt 2 0.736 Prompt 2
GSM8K GPT3.5-turbo 0.672 Prompt 1 0.670 Prompt 2
MAWPS LLaMA2-7B 0.646 Prompt 0 0.587 Prompt 8
SVAMP LLaMA2-7B 0.637 Prompt 0 0.588 Prompt 8
GSM8K LLaMA2-7B 0.248 Prompt 0 0.241 Prompt 5
MAWPS TigerBot-13B 0.536 Prompt 0 0.669 Prompt 0
SVAMP TigerBot-13B 0.645 Prompt 1 0.673 Prompt 8
GSM8K TigerBot-13B 0.394 Prompt 3 0.393 Prompt 7

Table 8: Oracle Reward Model Performance


Dataset LLM Oracle on Train Oracle on Test Oracle on Train + Test
MAWPS GPT3.5-turbo 0.942 0.941 0.957
SVAMP GPT3.5-turbo 0.882 0.869 0.903
GSM8K GPT3.5-turbo 0.872 0.873 0.914
MAWPS LLaMA2-7B 0.891 0.855 0.926
SVAMP LLaMA2-7B 0.876 0.837 0.907
GSM8K LLaMA2-7B 0.582 0.501 0.648
MAWPS TigerBot-13B 0.929 0.642 0.935
SVAMP TigerBot-13B 0.928 0.913 0.954
GSM8K TigerBot-13B 0.754 0.7 0.822

C.6 O RACLE R EWARD M ODEL

Assuming we have an oracle reward model (i.e., on test time, we assume we can always pick the
prompt that achieves a correct answer), this should correspond to an empirical performance upper
bound of Prompt-OIRL.

Take-Away Messages

1. The optimal prompt is highly relevant to the task and LLM used. In 3 out of the 9 settings,
the optimal test-time prompting strategy exists in the test prompts, rather than the training
prompts generated by previous research.
2. The oracle performance can be limited by the ability of LLMs.
3. Increasing the diversity of prompts improves the oracle performance.

C.7 C OST OF E MBEDDING

Using the embedding models to get prompt or query embeddings is in general much cheaper than
calling the LLMs in getting responses. For instance, the gpt-3.5-turbo API charges 0.001 USD per
1k tokens for input, and 0.002 USD for output; as a comparison, the embedding model of ada v2
charges only 0.0001 USD per 1k token.
Moreover, we would like to note those prompt embeddings are re-usable. For instance, getting the
embedding for 100 prompts like “let’s try to solve it step by step, and make sure to get the correct
answer” will cost about 100 prompts * 20 tokens/per prompt * 0.0001 USD / 1000 tokens = 0.0002
USD.
In comparison, interaction with the LLMs to get the prompted response is much more expensive. A
single query-answer pair will lead to a cost of approximately 0.0006 USD. Evaluating the perfor-

29
Published as a conference paper at ICLR 2024

mance of a single prompting strategy by interacting with gpt-3.5-turbo on a medium-sized dataset


containing 5000 samples will lead to a 3 USD expense.
Therefore, the cost of Prompt-OIRL in generating embeddings (≈0.0002 USD) is significantly
cheaper than the cost of evaluating any prompt with LLMs (≈3 USD). The gap in cost will be even
more significant when multiple prompts need to be evaluated (i.e., evaluating K different prompts
will cost ≈3K USD).

D E XTENDED D ISCUSSION
D.1 F UTURE W ORK

While our research primarily centers on arithmetic reasoning tasks, we wish to underscore the versa-
tility of Prompt-OIRL’s insights for broader applications, especially where there exists a prompting
demonstration dataset accompanied by ratings of the prompted responses. As a hypothetical ap-
proach to dataset construction with human annotators incorporated into the process, consider this:
human annotators could employ LLMs to accomplish specific tasks. They might offer multiple
prompts as instructions for the task, and the ensuing LLM responses can then be graded based on
proficiency in executing the given task. In fact, these annotators could be everyday LLM users keen
on evaluating diverse responses. We earmark this intriguing concept for subsequent exploration.

D.2 A C ALL TO THE F IELD : R ELEASE THE I NTERACTIVE L OGS WITH LLM S AS A
VALUABLE A SSET.

To apply the key insight of Prompt-OIRL to more general natural language generation tasks, the
challenge is on the open-accessibility of the offline datasets.
Intuitively, one may think the challenge lies in the scarcity of prompting knowledge (i.e., the num-
ber of expert-crafted prompts). Yet as a matter of fact, an important observation we have made in
our work is that in arithmetic reasoning datasets is that learning a reward model for query-dependent
prompt evaluation and optimization is helpful even in the scarce-expert knowledge setting. There-
fore, the scarcity of expert-crafted prompts is not a vital challenge.
On the other hand, the accessibility of open-sourced offline datasets — under the form of (query,
prompt, prompted response) — is more important. In the era of LLMs, only a few researchers
release their prompted responses when releasing their discovered prompting strategies. Even having
access to their code, to fully reproduce their results still require intensive interactions with the LLMs,
leading to a huge amount of cost.
Motivated by the discovery made by Prompt-OIRL, those offline datasets — not only those inter-
active communication logs with LLMs for the discovered well-performing prompts but also those
logs for the failed prompting attempts — can be useful for learning a better reward model. There-
fore, with the success of Prompt-OIRL, we would like to call on the community to release not only
code but also the interaction logs (i.e., offline demonstration dataset) as an asset. Learning with those
demonstrations saves energy and computation by avoiding repetition, and enhances the development
of prompt engineering.

D.3 L INK TO THE O NLINE P ROMPT E VALUATION AND O PTIMIZATION M ETHODS

In this paragraph, we would note that the online methods are the foundations for our query-
dependent offline approach, and additional prompts discovered by online algorithms can improve
the performance of Prompt-OIRL.
Analytically, We emphasized that Prompt-OIRL is not in direct competition with online methods but
rather builds upon their generated knowledge. This approach is akin to standing on the shoulders of
giants; it improves upon the existing prompts by considering query dependency in test time, whether
they are generated by online algorithms or domain experts.
To see this, we can observe from Figure 6, where an increase in the number of prompts correlates
with consistent improvements in performance. This demonstrates Prompt-OIRL’s capacity to draw
valuable insights from online methods and exhibit monotonically better results.

30
Published as a conference paper at ICLR 2024

Synthetically, assuming an oracle reward model, the incorporation of more prompts at test time con-
sistently enhances the performance upper bound (c.f. Table 8). This holds true even if the additional
prompts sourced from online methods are suboptimal. Thus, our study highlights the importance
of diverse prompting strategies. This concept echoes our initial motivating example where different
queries benefit from varying prompts, a finding supported in the concurrent literature Yang et al.
(2023a).
Empirically, we underpinned our reasoning with experimental evidence. Table 9 contextualizes the
averaged performance improvement over different LLMs and tasks. We use K=6/K=1 to highlight
the improvement achieved using 6 training prompts as compared to only using 1 prompt. We use
OIRL/Eqn(2) to highlight the improvement achieved using additional test prompts as compared to
only selecting from the training prompts (i.e., the improvement of Prompt-OIRL over BoTr Eqn(2)).
(1). Increasing the number of prompts sourced from online algorithms during training consistently
enhances the performance of Prompt-OIRL and other baseline methods. This improvement under-
scores the value of enriched demonstration data derived from online algorithms. (2). Increasing the
number of prompts sourced from online algorithms during testing leads to consistent improvements
in query-dependent prompt optimization. This is clearly observed when comparing the outcomes of
Prompt-OIRL — which selects the most effective prompt from both training and testing sets using
its learned reward model — against BoTr Eqn(2), which limits its selection to training prompts. The
performance gains observed in Prompt-OIRL affirm the benefits of integrating additional prompts
from online sources.

Table 9: Adding prompts to either the training or the test set improves the performance of Prompt-OIRL.
Method K=1 K=2 K=3 K=4 K=5 K=6 K=6/K=1
LLM-Conf. 0.4751 0.5184 0.5427 0.5543 0.5594 0.5601 + 17.89%
BoTr Eqn(1) 0.4751 0.5379 0.5654 0.5800 0.5893 0.5944 + 25.11%
BoTr Eqn(2) 0.4751 0.5553 0.5910 0.6130 0.6295 0.6393 + 34.56%
Prompt-OIRL 0.5904 0.6095 0.6225 0.6331 0.6410 0.6459 + 9.4%
OIRL/Eqn(2) +24.27% +9.76% +5.33% +3.23% +1.83% +1.03% -

E A DDITIONAL E XPERIMENTS
E.1 P ROMPT-OIRL UNDER D IFFERENT DATA AVAILABILITY

To stress-test Prompt-OIRL under such data-scarcity settings, we experiment with the SVAMP
dataset using a demonstration dataset from the GPT3.5-turbo and remove different proportions of
training data. We compare results by removing [0.0, 0.3, 0.5, 0.7, 0.8, 0.9] of the training dataset,
which has 15000 samples. The following table presents the results when using the 6 prompts. We
compare BoTr Eqn.(1) and LLM confidence as the data-size-agnostic baselines, and BoTr Eq.(2) as
a data-size-dependent baseline.

Table 10: Performance of Prompt-OIRL with Varying Dataset Sizes


Dataset Size 15000 10500 7500 4500 3000 1500 BoTr Eqn.(1) LLM Conf.
Prompt-OIRL 0.894 0.887 0.866 0.852 0.840 0.827 0.717 0.686
BoTr Eqn.(2) 0.847 0.831 0.814 0.803 0.794 0.782 0.717 0.686

From these results, it can be observed that Prompt-OIRL consistently achieves better performance
than using the query-agnostic prompting strategies, using LLM confidence as baselines, or using
Equation (2) and the reward model to choose from the training prompt.

E.2 G ENERALIZATION OF P ROMPT-OIRL OVER TASKS

The generalization of Prompt-OIRL over tasks can have great value in practical deployments. To
verify the generalization ability of the learned reward model, we experiment by using reward models

31
Published as a conference paper at ICLR 2024

that are trained on different tasks in Prompt-OIRL when evaluating and optimizing prompts for the
MAWPS task with GPT3.5-turbo. The following table shows the results we get using the three
different reward models trained with MAWPS, SVAMP, and GSM8K, individually.

Table 11: Generalization Results of Prompt-OIRL on MAWPS Task


Model Method K=1 K=2 K=3 K=4 K=5 K=6
MAWPS BoTr Eqn(1) 0.6598 0.6988 0.7118 0.7155 0.7173 0.7175
BoTr Eqn(2) 0.6598 0.7742 0.8150 0.8350 0.8473 0.8546
Prompt-OIRL 0.7637 0.8032 0.8379 0.8750 0.8916 0.8944
SVAMP BoTr Eqn(1) 0.6598 0.6959 0.7078 0.7101 0.7124 0.7163
BoTr Eqn(2) 0.6598 0.7367 0.7529 0.7587 0.7593 0.7626
Prompt-OIRL 0.7222 0.7497 0.7757 0.7804 0.7883 0.7903
GSM8K BoTr Eqn(1) 0.6598 0.6753 0.6922 0.6968 0.6967 0.6926
BoTr Eqn(2) 0.6598 0.6880 0.6982 0.7016 0.7007 0.7092
Prompt-OIRL 0.6617 0.6884 0.6985 0.7015 0.7012 0.7145

It is worth noting that, under such a cross-task setting, the BoTr Eqn(1) baseline selects the best
prompt according to the training set performance on the training dataset of the reward model. For
instance, when generalizing the reward model learned from the GSM8K prompting demonstration
dataset to the MAWPS task, its Best-of-Training prompt is chosen to be the best-performing prompt
on the GSM8K dataset.
According to the results, we find the reward model learned with the SVAMP dataset generalizes
better to the MAWPS task than the reward model learned with the GSM8K dataset. This is not sur-
prising as those two tasks are more similar compared to the GSM8K dataset, which always requires
multiple-step reasoning. Importantly, the performance of Prompt-OIRL consistently surpasses using
the best-of-training baselines, demonstrating its potential in practical deployments where general-
ization over tasks is considered to be an essential advantage.

E.3 E XPERIMENTS ON L ARGER LLM

We additionally experiment with GPT-4 on the MAWPS dataset. With a stronger language model,
the reward modeling step of Prompt-OIRL will be more challenging, due to the increasing imbalance
of the data used for training the reward model (i.e., most prompts lead to a correct answer). That
said, under such a more challenging setting, the efficacy of Prompt-OIRL and the query-dependent
objective based on the learned reward model have been demonstrated.

Table 12: Results with GPT-4 on MAWPS Dataset


Method K=1 K=2 K=3 K=4 K=5 K=6
BoTr Eqn(1) 0.8781 0.9068 0.9131 0.9190 0.9245 0.9300
BoTr Eqn(2) 0.8781 0.9352 0.9513 0.9609 0.9685 0.9757
Prompt-OIRL 0.8944 0.9401 0.9526 0.9607 0.9672 0.9745

F D ETAILED E XPERIMENT R ESULTS


F.1 I MPROVING A RITHMETIC R EASONING : H OW GOOD ARE THE OPTIMIZED PROMPTS ?

We provide detailed performance comparisons on all datasets and LLMs we used in the experiment.
Averaging the performance will result in Figure 6. In most cases, using Prompt-OIRL consistently
achieves superior performance, especially when the number of training prompts is limited. Addi-
tionally, it is not surprising that in some cases, when the number of training prompts is sufficient, se-
lecting the best prompt during training, rather than seeking further optimization on held-out prompts
can be the best strategy.

32
Published as a conference paper at ICLR 2024

That said, we can also observe some fail cases of Prompt-OIRL, when the weaker LLMs are trying
to solve the most challenging tasks. In those cases, the difficulties mainly result from the inability
to accurately evaluate prompts — for more details, please refer to the following section.

Figure 18: Prompting Success Rate on GPT 3.5 Turbo

Figure 19: Prompting Success Rate on LLaMA-2-7B-Chat

Figure 20: Prompting Success Rate on TigerBot-13B-Chat

F.2 R EWARD M ODELING : ACCURACY AND P RECISION .

Accuracy We present the accuracy of different methods on different models and datasets in Fig-
ure 21 - Figure 23. The performance of the learned reward model consistently achieves higher
prediction accuracy than using LLMs as critics. The prediction accuracy does not change much as a
function of the number of training prompts.

Figure 21: Accuracy on GPT 3.5 Turbo

33
Published as a conference paper at ICLR 2024

Figure 22: Accuracy on LLaMA-2-7B-Chat

Figure 23: Accuracy on TigerBot-13B-Chat

Precision We present the precision of different methods on different models and datasets in Fig-
ure 24 - Figure 26. In most cases, using the learned reward model achieves significantly higher
precision than using LLMs, meaning there is a higher probability that the predicted well-performing
prompt is able to get a correct answer — the basis of our choice of using precision as the metric.
However, the reward model suffers difficulty when predicting the performance of LLMs with lower
capabilities in arithmetic reasoning: when LLaMA-2-7B-Chat and the TigerBot-13B-Chat are fac-
ing the most challenging GSM8K tasks, their answers to the queries will be wrong in most cases.
Achieving a high precision — in such cases — becomes more challenging. In this work, we set
hyper-parameters uniformly to enhance reproducibility and disclose the superiority of the proposed
method with minimal engineering optimization. Yet in practice, such an issue can potentially be
further alleviated through e.g., sub-sampling or re-balancing the training dataset.

Figure 24: Precision on GPT 3.5 Turbo

Figure 25: Precision on LLaMA-2-7B-Chat

34
Published as a conference paper at ICLR 2024

Figure 26: Precision on TigerBot-13B-Chat

G C ASE S TUDIES
G.1 O PTIMAL P ROMPTING S TRATEGY IS Q UERY D EPENDENT

In this section, we provide case studies to demonstrate that expert-crafted prompting strategies are
verified and optimized on a statistical level, rather than an instance-wise level. There exist many
cases when some of those prompts work better than others on one query, but worse on another. We
now present results on the GSM8K dataset with GPT-3.5-turbo (v.2023.05).

W HEN N O -P ROMPT IS THE B EST. C ASE 1


Question: Shiela bought five cell phones for $150 each for a 3-month installment. A 2%
interest will be charged for each unit. How much will Shiela pay each month for 3 months?
Answer: A $150 x 2/100 = $〈〈150*2/100=3〉〉3 interest will be charged for each unit. So,
each cell phone costs $150 + $3 = $〈〈150+3=153〉〉153. The total amount to be paid for 5 cell
phones is $153 x 5 = $〈〈153*5=765〉〉765. Thus, Shiela will pay $765/3 = $〈〈765/3=255〉〉255
each month.
#### 255〈—endoftext—〉

P ROMPTED A NSWERS

No prompt: The total cost of the five cell phones is $150 x 5 = $〈〈150*5=750〉〉750. The
interest for each unit is $150 x 2/100 = $〈〈150*2/100=3〉〉3. So, the total interest for the five
units is $3 x 5 = $〈〈3*5=15〉〉15. Therefore, Shiela will pay $750 + $15 = $〈〈750+15=765〉〉765
for the five cell phones. Hence, Shiela will pay $765/3 = $〈〈765/3=255〉〉255 each month for
3 months. Answer: 255 .
CoT prompt: First, let’s calculate the total cost of the five cell phones. Since each phone
costs $150, the total cost would be 5 x $150 = $〈〈5*150=750〉〉750. Next, let’s calculate
the interest for each unit. The interest for each unit is 2% of $150, which is 0.02 x $150 =
$〈〈0.02*150=3〉〉3. Now, let’s calculate the total interest for all five units. The total interest
would be 5 x $3 = $〈〈5*3=15〉〉15. Finally, let’s calculate the total amount Shiela needs to pay
each month for 3 months. The total amount would be the cost of the phones plus the interest,
which is $750 + $15 = $〈〈750+15=765〉〉765. Therefore, Shiela needs to pay $765 each month
for 3 months.
APE prompt: First, let’s calculate the total cost of the five cell phones. Since each phone
costs $150, the total cost would be 5 x $150 = $〈〈5*150=750〉〉750. Next, let’s calculate
the interest for each unit. The interest for each unit is 2% of $150, which is 0.02 x $150 =
$〈〈0.02*150=3〉〉3. Now, let’s calculate the total interest for all five units. The total interest
would be 5 x $3 = $〈〈5*3=15〉〉15. Finally, let’s calculate the total amount Shiela needs to pay
each month for 3 months. The total amount would be the cost of the phones plus the interest,
which is $750 + $15 = $〈〈750+15=765〉〉765. Therefore, Shiela needs to pay $765 each month
for 3 months.

35
Published as a conference paper at ICLR 2024

ToT prompt: Expert 1: To find out how much Shiela will pay each month for 3 months, we
need to calculate the total cost of the cell phones first. Since she bought five cell phones for
$150 each, the total cost would be $150 x 5 = $750.
Expert 2: Now, let’s calculate the interest charged for each unit. Since the interest rate is 2%,
we can calculate the interest for each unit as $150 x 0.02 = $3.
Expert 3: Finally, to find out how much Shiela will pay each month for 3 months, we need to
divide the total cost by the number of months. So, Shiela will pay $750 / 3 = $250 each month
for 3 months.

W HEN N O -P ROMPT IS THE B EST. C ASE 2


Question: Mary is an avid gardener. Yesterday, she received 18 new potted plants from her
favorite plant nursery. She already has 2 potted plants on each of the 40 window ledges of her
large country home. Feeling generous, she has decided that she will give 1 potted plant from
each ledge to friends and family tomorrow. How many potted plants will Mary remain with?
Answer: Yesterday, before receiving the plants, Mary had 2*40 = 〈〈2*40=80〉〉80 potted plants
After receiving an additional 18 plants, she therefore had a total of 80 + 18 = 〈〈80+18=98〉〉98
potted plants Tomorrow, Mary’s plant giveaway will be 40 *1 = 〈〈40*1=40〉〉40 potted plants.
She will therefore remain with 98 - 40 = 〈〈98-40=58〉〉58 potted plants.
#### 58〈—endoftext—〉

P ROMPTED A NSWERS

No prompt: Mary currently has 40 window ledges with 2 potted plants each, which totals 40
* 2 = 〈〈40*2=80〉〉80 potted plants. If she received 18 new potted plants, her total number of
potted plants will be 80 + 18 = 〈〈80+18=98〉〉98 potted plants. If she gives away 1 potted plant
from each ledge, she will give away 40 * 1 = 〈〈40*1=40〉〉40 potted plants. Therefore, Mary
will remain with 98 - 40 = 〈〈98-40=58〉〉58 potted plants. Answer: 58 .
CoT prompt: Step 1: Calculate the total number of potted plants Mary currently has on her
window ledges. Mary has 2 potted plants on each of the 40 window ledges, so she has 2 * 40
= 〈〈2*40=80〉〉80 potted plants.
Step 2: Calculate the total number of potted plants Mary will give away. Mary plans to give
away 1 potted plant from each of the 40 window ledges, so she will give away 1 * 40 =
〈〈1*40=40〉〉40 potted plants.
Step 3: Calculate the number of potted plants Mary will remain with. Mary currently has
80 potted plants and plans to give away 40 potted plants, so she will remain with 80 - 40 =
〈〈80-40=40〉〉40 potted plants.
Therefore, Mary will remain with 40 potted plants.
APE prompt: Step 1: Calculate the total number of potted plants Mary currently has on her
window ledges. Mary has 2 potted plants on each of the 40 window ledges, so she has 2 * 40
= 〈〈2*40=80〉〉80 potted plants.
Step 2: Calculate the total number of potted plants Mary will give away. Mary plans to give
away 1 potted plant from each of the 40 window ledges, so she will give away 1 * 40 =
〈〈1*40=40〉〉40 potted plants.
Step 3: Calculate the number of potted plants Mary will remain with. Mary currently has
80 potted plants and plans to give away 40 potted plants, so she will remain with 80 - 40 =
〈〈80-40=40〉〉40 potted plants.
Therefore, Mary will remain with 40 potted plants.

36
Published as a conference paper at ICLR 2024

ToT prompt: Expert 1: First, let’s calculate the total number of potted plants that Mary cur-
rently has. She has 2 potted plants on each of the 40 window ledges, so that would be 2 x 40 =
80 potted plants.
Expert 2: Next, let’s calculate the number of potted plants that Mary will give away to friends
and family. Since she has decided to give away 1 potted plant from each ledge, she will give
away a total of 40 potted plants.
Expert 3: Finally, let’s calculate the number of potted plants that Mary will remain with. She
initially had 80 potted plants and gave away 40, so she will remain with 80 - 40 = 40 potted
plants.

W HEN N O -P ROMPT IS THE B EST. C ASE 3


Question:
Debra is monitoring a beehive to see how many bees come and go in a day. She sees 30 bees
leave the hive in the first 6 hours, and then she sees 1/2 that many bees return in the next 6
hours. She sees two times as many bees as she saw first leave the hive fly from the hive and
leave in the next 6 hours. Then every bee that left before that hadn’t already returned returns
to the hive in the next 6 hours. How many bees did Debra see return to the hive in the last 6
hours of the day?
Answer: Debra first sees 30 bees leave the hive, then sees 1/2 that many return, 30 x 1/2 =
〈〈30*1/2=15〉〉15 bees that return and 15 bees that are still away from the hive. Debra sees
twice as many bees as first left the hive–which was 30–fly from the hive in the next 6 hours,
30 x 2 = 〈〈30*2=60〉〉60 bees that leave the hive. In total, 30 + 60 = 〈〈30+60=90〉〉90 bees
left the hive. There were 90 bees total that left the hive - 15 bees that already returned =
〈〈90-15=75〉〉75 bees that return to the hive in the last 6 hours of the day.
#### 75〈—endoftext—〉

P ROMPTED A NSWERS

No prompt: Debra saw 30/2 = 〈〈30/2=15〉〉15 bees return to the hive in the next 6 hours. Then
she saw 2*30 = 〈〈2*30=60〉〉60 bees leave the hive in the next 6 hours. So she saw 60+30 =
〈〈60+30=90〉〉90 bees leave the hive in total. Therefore, she saw 90-15 = 〈〈90-15=75〉〉75 bees
return to the hive in the last 6 hours of the day. Answer: 75 .
CoT prompt: Step 1: Debra sees 30 bees leave the hive in the first 6 hours. Step 2: Debra sees
1/2 that many bees return in the next 6 hours, so she sees 30/2 = 〈〈30/2=15〉〉15 bees return.
Step 3: Debra sees two times as many bees as she saw leave the hive fly from the hive and
leave in the next 6 hours, so she sees 30*2 = 〈〈30*2=60〉〉60 bees leave. Step 4: Every bee that
left before that hadn’t already returned returns to the hive in the next 6 hours, so Debra sees
30+15 = 〈〈30+15=45〉〉45 bees return. Answer: 45 .
APE prompt: Step 1: Debra saw 30 bees leave the hive in the first 6 hours. Step 2: In the
next 6 hours, she saw 1/2 that many bees return, so she saw 30/2 = 〈〈30/2=15〉〉15 bees return.
Step 3: She saw two times as many bees as she saw leave the hive in the first 6 hours leave in
the next 6 hours, so she saw 2 * 30 = 〈〈2*30=60〉〉60 bees leave. Step 4: In the last 6 hours,
every bee that left before that hadn’t already returned returns to the hive. Since 30 bees left in
the first 6 hours and 60 bees left in the next 6 hours, a total of 30 + 60 = 〈〈30+60=90〉〉90 bees
left before the last 6 hours. Step 5: Therefore, Debra saw 90 bees return to the hive in the last
6 hours of the day.
ToT prompt: Expert 1: First, let’s calculate how many bees returned to the hive in the second
6-hour period. We know that Debra saw 30 bees leave the hive in the first 6 hours. So, in the

37
Published as a conference paper at ICLR 2024

next 6 hours, she saw 1/2 of that number return. Therefore, she saw 30/2 = 15 bees return to
the hive in the second 6-hour period.
Expert 2: In the next 6-hour period, Debra saw two times as many bees leave the hive as she
saw in the first 6 hours. So, if she saw 30 bees leave in the first 6 hours, she saw 2 * 30 = 60
bees leave in the next 6 hours.
Expert 3: Now, let’s calculate how many bees returned to the hive in the last 6 hours of the
day. We know that every bee that left before that hadn’t already returned returns to the hive in
the next 6 hours. So, the total number of bees that returned in the last 6 hours is 30 + 15 + 60
= 105 bees.

38
Published as a conference paper at ICLR 2024

W HEN C OT IS THE B EST. C ASE 1


Question: John runs 60 miles a week. He runs 3 days a week. He runs 3 hours the first day
and half as much the other two days he runs. How fast does he run?
Answer: He has 7-1=〈〈7-1=6〉〉6 possible running days. He runs half of those so he runs
6/2=〈〈6/2=3〉〉3 days. So for two days, he runs 3/2=〈〈3/2=1.5〉〉1.5 hours. So in total, he runs
3+1.5+1.5=〈〈3+1.5+1.5=6〉〉6 hours. So he runs 60/6=〈〈60/6=10〉〉10 mph.
#### 10〈—endoftext—〉

P ROMPTED A NSWERS

No prompt: To find out how fast John runs, we need to divide the total distance he runs by the
total time he spends running.
First, let’s calculate the total distance John runs in a week. Since he runs 60 miles a week and
runs 3 days a week, he runs 60/3 = 〈〈60/3=20〉〉20 miles each day.
Next, let’s calculate the total time John spends running. On the first day, he runs for 3 hours,
and on the other two days, he runs for half as much, which is 3/2 = 〈〈3/2=1.5〉〉1.5 hours each
day. So the total time John spends running in a week is 3 + 1.5 + 1.5 = 〈〈3+1.5+1.5=6〉〉6
hours.
Finally, let’s calculate John’s speed by dividing the total distance he runs by the total time he
spends running: 20 miles / 6 hours = 〈〈20/6=3.33〉〉3.33 miles per hour.
Therefore, John runs at a speed of 3.33 miles per hour.
CoT prompt: Step 1: Calculate the total distance John runs in a week. John runs 60 miles a
week.
Step 2: Calculate the total number of hours John runs in a week. John runs 3 days a week. On
the first day, he runs 3 hours. On the other two days, he runs half as much, which is 3/2 = 1.5
hours. So, the total number of hours John runs in a week is 3 + 1.5 + 1.5 = 6 hours.
Step 3: Calculate John’s average speed. Average speed = Total distance / Total time Average
speed = 60 miles / 6 hours Average speed = 10 miles per hour
Therefore, John runs at an average speed of 10 miles per hour.
APE prompt: First, let’s find out how many miles John runs each day. Since he runs 60 miles
a week and runs 3 days a week, he runs 60/3 = 〈〈60/3=20〉〉20 miles each day. Next, let’s find
out how long John runs each day. We know that he runs 3 hours the first day and half as much
the other two days. So, the second and third days he runs 3/2 = 〈〈3/2=1.5〉〉1.5 hours each day.
Finally, let’s calculate John’s speed. We know that speed is equal to distance divided by time.
So, on the first day, his speed is 20 miles / 3 hours = 〈〈20/3=6.67〉〉6.67 miles per hour. On
the second and third days, his speed is 20 miles / 1.5 hours = 〈〈20/1.5=13.33〉〉13.33 miles per
hour. Therefore, John runs at a speed of 6.67 miles per hour on the first day and 13.33 miles
per hour on the second and third days.
ToT prompt: Expert 1: To find out how fast John runs, we need to calculate his average speed.
We know that he runs a total of 60 miles in a week, so we can divide that by the total number
of hours he runs to get his average speed.
Expert 2: John runs 3 days a week, with the first day being 3 hours and the other two days
being half as much. So, the total number of hours he runs in a week is 3 + (0.5 * 2) = 4 hours.
Expert 3: Now we can calculate John’s average speed by dividing the total distance he runs
(60 miles) by the total time he runs (4 hours). So, his average speed is 60 miles / 4 hours = 15
miles per hour.
Therefore, John runs at an average speed of 15 miles per hour.

39
Published as a conference paper at ICLR 2024

W HEN C OT IS THE B EST. C ASE 2


Question: According to its nutritional info, a bag of chips has 250 calories per serving. If a 300g bag
has 5 servings, how many grams can you eat if your daily calorie target is 2000 and you have already
consumed 1800 calories?
Answer: If the total calorie target is 2000 and I have consumed 1800 calories then I have 2000-1800 =
〈〈2000-1800=200〉〉200 calories left to eat If each serving of chips has 250 calories and I only have 200
calories left to eat, then I can only eat 200/250 of a serving = 4/5 of a serving We also know that a 300g
bag of chips has 5 servings, hence each serving has 300g/5 = 〈〈300/5=60〉〉60 grams If I can only eat 4/5
of a serving, then I can eat only 60g * 4/5 = 240g/5 = 48 grams
#### 48〈—endoftext—〉

P ROMPTED A NSWERS
No prompt: To find out how many grams of chips you can eat, you need to calculate the remaining
calories you can consume and then convert it to grams. First, calculate the remaining calories you can
consume: Daily calorie target - Consumed calories = Remaining calories 2000 calories - 1800 calories =
200 calories Next, calculate the remaining servings of chips you can eat: Remaining calories ÷ Calories
per serving = Remaining servings 200 calories ÷ 250 calories per serving = 0.8 servings Finally, calculate
the remaining grams of chips you can eat: Remaining servings × Grams per serving = Remaining grams
0.8 servings × 300 grams per serving = 240 grams
Therefore, you can eat 240 grams of chips if your daily calorie target is 2000 and you have already
consumed 1800 calories.
CoT prompt: Step 1: Calculate the total number of calories in the bag of chips. Since the bag has 5
servings and each serving has 250 calories, the total number of calories in the bag is 5 servings * 250
calories/serving = 1250 calories.
Step 2: Calculate the remaining number of calories you can consume. If your daily calorie target is 2000
and you have already consumed 1800 calories, then you have 2000 - 1800 = 200 calories remaining.
Step 3: Calculate the maximum number of grams you can eat from the bag of chips. To calculate the
maximum number of grams you can eat, you need to find out how many calories are in 1 gram of chips.
Divide the total number of calories in the bag (1250 calories) by the weight of the bag (300 grams) to
find the calories per gram: 1250 calories / 300 grams = 4.17 calories/gram.
Now, divide the remaining number of calories (200 calories) by the calories per gram (4.17 calories/gram)
to find the maximum number of grams you can eat: 200 calories / 4.17 calories/gram = 48 grams.
Therefore, you can eat a maximum of 48 grams of chips from the bag.
APE prompt: First, we need to calculate the total number of calories in the bag of chips. Since there are
5 servings in a 300g bag and each serving has 250 calories, the total number of calories in the bag is 5
servings * 250 calories/serving = 1250 calories.
Next, we need to calculate how many calories you have left to consume for the day. If your daily calorie
target is 2000 and you have already consumed 1800 calories, then you have 2000 - 1800 = 200 calories
left to consume.
To determine how many grams of chips you can eat, we need to find out how many servings of chips
equal 200 calories. Since each serving has 250 calories, we can calculate this by dividing 200 calories by
250 calories/serving: 200 calories / 250 calories/serving = 0.8 servings.
Finally, we can calculate the grams of chips you can eat by multiplying the number of servings by the
serving size. Since each serving is 300g, we can calculate this by multiplying 0.8 servings by 300g/serv-
ing: 0.8 servings * 300g/serving = 240g.
Therefore, you can eat 240 grams of chips if your daily calorie target is 2000 and you have already
consumed 1800 calories.
ToT prompt: Expert 1: Let’s start by calculating the total calories in the bag of chips. Since there are 5
servings in a 300g bag, and each serving has 250 calories, the total calories in the bag would be 5 servings
* 250 calories per serving = 1250 calories.

40
Published as a conference paper at ICLR 2024

Expert 2: Now, let’s calculate the remaining calories you can consume based on your daily calorie target.
If your daily calorie target is 2000 and you have already consumed 1800 calories, then you have 2000 -
1800 = 200 calories left.
Expert 3: To determine how many grams of chips you can eat, we need to find out how many calories are
in 1 gram of chips. Since the bag has 1250 calories and weighs 300g, we can calculate the calories per
gram by dividing the total calories by the weight: 1250 calories / 300g = 4.17 calories per gram.
Expert 1: Finally, to find out how many grams of chips you can eat, we divide the remaining calories
(200) by the calories per gram (4.17): 200 calories / 4.17 calories per gram = 47.94 grams.
Expert 2: Therefore, you can eat approximately 47.94 grams of chips if your daily calorie target is 2000
and you have already consumed 1800 calories.

W HEN C OT IS THE B EST. C ASE 3


Question: Jean is two years older than Mark. Two years ago Mark was 5 years older than half Jan’s age.
If Jan is 30 how old is Jean?
Answer: 2 years ago Jan was 30-2=〈〈30-2=28〉〉28 years old So Marks age two years ago
was 5+28/2=〈〈5+28/2=19〉〉19 years old Mark is now 19+2=〈〈19+2=21〉〉21 years old So Jean is
21+2=〈〈21+2=23〉〉23 years old
#### 23〈—endoftext—〉

P ROMPTED A NSWERS
No prompt: Jean is 32 years old.
CoT prompt: First, let’s find out how old Mark is. We know that Jean is two years older than Mark, so
if we let Mark’s age be x, then Jean’s age would be x + 2.
Two years ago, Mark was 5 years older than half of Jan’s age. Two years ago, Jan would have been 30 -
2 = 28 years old. Half of Jan’s age two years ago would be 28/2 = 14 years old.
So, two years ago, Mark was 5 years older than 14, which means Mark’s age two years ago was 14 + 5 =
19 years old.
If Mark’s age two years ago was 19, then his current age would be 19 + 2 = 21 years old.
Since Jean is two years older than Mark, Jean’s age would be 21 + 2 = 23 years old.
Therefore, Jean is 23 years old.
APE prompt: If Jan is 30 years old, then two years ago she would have been 28 years old. If Mark was
5 years older than half of Jan’s age two years ago, then half of Jan’s age two years ago would have been
23 years old. If Mark was 5 years older than half of Jan’s age two years ago, then Mark would have been
28 + 5 = 33 years old two years ago. If Jean is two years older than Mark, then Jean would have been 33
+ 2 = 35 years old two years ago. If Jean was 35 years old two years ago, then Jean is currently 35 + 2 =
37 years old.
ToT prompt: Expert 1: Let’s start by finding Mark’s age. If two years ago Mark was 5 years older than
half of Jan’s age, then we can subtract 5 from Mark’s age two years ago to find half of Jan’s age two years
ago.
Expert 2: If Mark was 5 years older than half of Jan’s age two years ago, then half of Jan’s age two years
ago would be Mark’s age minus 5.
Expert 3: If half of Jan’s age two years ago was Mark’s age minus 5, then we can add 5 to half of Jan’s
age two years ago to find Mark’s age two years ago.
Expert 1: Now that we know Mark’s age two years ago, we can add 2 to find his current age.
Expert 2: Since Jean is two years older than Mark, we can add 2 to Mark’s current age to find Jean’s age.
Expert 3: Therefore, if Jan is 30 years old, Jean would be 32 years old.

41
Published as a conference paper at ICLR 2024

W HEN APE IS THE B EST. C ASE 1


Question:
Blake and Kelly are having a contest to see who can run the most in 15 minutes. They decide to do it
on a football field that is 100 yards long. Blake runs back and forth 15 times. Kelly runs back and forth
once, and then decides that she doesn’t want to run next to Blake, so she starts to run to the 40-yard line
and back. She does this 34 times. How much farther does the winner run than the loser?
Answer: Blake runs 200 yards for each back and forth because 100 x 2 = 200 He runs 3000 yards
in total because 200 x 15 = 〈〈200*15=3000〉〉3000 Kelly runs 200 yards at first because 100 x 2 =
〈〈100*2=200〉〉200 She runs 80 yards each time she does a back and forth to the 40-yard line because 40
x 2 = 〈〈40*2=80〉〉80 She runs 2720 from these back and forth because 34 x 80 = 〈〈34*80=2720〉〉2720
She runs 2920 in total because 2720 + 200 = 〈〈2720+200=2920〉〉2920 Blake won because 3000 〉2920
The winner ran 80 yards more than the loser because 3000 - 2920 = 〈〈3000-2920=80〉〉80
#### 80〈—endoftext—〉

P ROMPTED A NSWERS
No prompt: To find out how far Blake runs, we need to multiply the number of times he runs back and
forth by the length of the football field: 15 times * 100 yards/time = 〈〈15*100=1500〉〉1500 yards To find
out how far Kelly runs, we need to multiply the number of times she runs back and forth by the length
of the football field: 34 times * 40 yards/time = 〈〈34*40=1360〉〉1360 yards The difference in distance is
1500 yards - 1360 yards = 〈〈1500-1360=140〉〉140 yards. Answer: 140 .
CoT prompt: Step 1: Calculate the total distance Blake runs. Blake runs back and forth 15 times, so he
runs a total of 15 * 100 = 〈〈15*100=1500〉〉1500 yards.
Step 2: Calculate the total distance Kelly runs. Kelly runs back and forth once, so she runs a total of
1 * 100 = 〈〈1*100=100〉〉100 yards. Then she runs to the 40-yard line and back 34 times, so she runs
an additional 34 * 80 = 〈〈34*80=2720〉〉2720 yards. Therefore, Kelly runs a total of 100 + 2720 =
〈〈100+2720=2820〉〉2820 yards.
Step 3: Calculate the difference in distance between the winner and the loser. The winner runs 2820 -
1500 = 〈〈2820-1500=1320〉〉1320 yards farther than the loser.
APE prompt: First, let’s calculate how far Blake runs in total. Since he runs back and forth 15 times,
and the football field is 100 yards long, Blake runs a total of 15 * 2 * 100 = 3000 yards.
Next, let’s calculate how far Kelly runs in total. She runs back and forth once, which is a total of 2 * 100
= 200 yards. Then she runs to the 40-yard line and back 34 times, which is a total of 34 * 2 * 40 = 2720
yards. So Kelly runs a total of 200 + 2720 = 2920 yards.
Finally, let’s calculate the difference in distance between the winner and the loser. The winner, in this
case, is Blake, who runs 3000 yards. The loser is Kelly, who runs 2920 yards. Therefore, the difference
in distance is 3000 - 2920 = 80 yards.
So the winner runs 80 yards farther than the loser.
ToT prompt: Expert 1: To find out how much farther the winner runs than the loser, we need to calculate
the total distance each person runs. Let’s start by calculating the distance Blake runs.
Expert 2: Blake runs back and forth 15 times on a 100-yard football field. So, the total distance Blake
runs is 15 times the length of the field, which is 15 * 100 = 1500 yards.
Expert 3: Now let’s calculate the distance Kelly runs. Kelly runs back and forth once on the football
field, which is a total distance of 2 * 100 = 200 yards. Then she runs to the 40-yard line and back 34
times, which is a total distance of 34 * 2 * 40 = 2720 yards.
Expert 1: So, the total distance Kelly runs is 200 + 2720 = 2920 yards.
Expert 2: To find out how much farther the winner runs than the loser, we subtract the total distance Blake
runs from the total distance Kelly runs. Therefore, the winner runs 2920 - 1500 = 1420 yards farther than
the loser.

42
Published as a conference paper at ICLR 2024

W HEN APE IS THE B EST. C ASE 2


Question: James is counting his Pokemon cards. He has 30 fire type, 20 grass type, and 40 water type.
If he loses 8 of the water type and buys 14 grass type, what’s the percentage chance (rounded to the
nearest integer) that a randomly picked card will be a water type?
Answer: First find the final number of grass type Pokemon: 20 grass type + 14 grass type =
〈〈20+14=34〉〉34 grass type Then find the final number of water type Pokemon: 40 water type - 8 water
type = 〈〈40-8=32〉〉32 water type Then add the number of each type of Pokemon to find the total number:
34 Pokemon + 32 Pokemon + 30 Pokemon = 〈〈34+32+30=96〉〉96 Pokemon Then divide the number of
water type Pokemon by the total number of Pokemon and multiply by 100% to express the answer as a
percentage: 32 water type / 96 Pokemon * 100% = 33.333...%, which rounds down to 33%
#### 33〈—endoftext—〉

P ROMPTED A NSWERS
No prompt: The total number of water type cards after losing 8 and buying 14 is 40 - 8 + 14 = 46. The
total number of cards is 30 + 46 + 20 = 96. The percentage chance of picking a water type card is (46/96)
* 100 = 47.92%. Rounded to the nearest integer, the percentage chance is 48%.
CoT prompt: First, let’s calculate the total number of water type cards James has after losing 8 and
buying 14. He initially had 40 water type cards. After losing 8, he will have 40 - 8 = 32 water type cards.
After buying 14 grass type cards, the number of water type cards remains the same at 32. Now, let’s
calculate the total number of cards James has after losing 8 water type cards and buying 14 grass type
cards. He initially had 30 fire type cards, 20 grass type cards, and 40 water type cards. After losing 8
water type cards, he will have 30 + 20 + 32 = 82 cards. The percentage chance of picking a water type
card is (32 / 82) * 100 = 39.02%.
Rounded to the nearest integer, the percentage chance is 39%.
APE prompt: First, let’s calculate the total number of water type cards James has after losing 8 and
buying 14. He initially had 40 water type cards, so after losing 8, he has 40 - 8 = 32 water type cards.
After buying 14 grass type cards, the total number of water type cards remains the same at 32.
Next, let’s calculate the total number of cards James has after losing 8 water type cards and buying 14
grass type cards. He initially had 30 fire type cards, 20 grass type cards, and 40 water type cards, so the
total number of cards is 30 + 20 + 40 = 90 cards. After losing 8 water type cards and buying 14 grass
type cards, the total number of cards becomes 90 - 8 + 14 = 96 cards.
Finally, let’s calculate the percentage chance of picking a water type card. The percentage chance is
calculated by dividing the number of water type cards by the total number of cards and multiplying
by 100. So, the percentage chance is (32 / 96) * 100 = 33.33%. Rounded to the nearest integer, the
percentage chance is 33%.
Therefore, the percentage chance that a randomly picked card will be a water type is 33%.
ToT prompt: Expert 1: To find the percentage chance of picking a water type card, we need to calculate
the total number of water type cards after James loses 8 and buys 14 grass type cards.
Expert 2: James initially has 40 water type cards. After losing 8 and buying 14 grass type cards, he will
have 40 - 8 + 14 = 46 water type cards.
Expert 3: To find the percentage chance, we need to divide the number of water type cards by the total
number of cards and multiply by 100. The total number of cards is 30 + 46 + 20 = 96. So the percentage
chance of picking a water type card is (46 / 96) * 100 = 47.92%.
Final answer: The percentage chance (rounded to the nearest integer) of picking a water type card is 48%.

43
Published as a conference paper at ICLR 2024

W HEN APE IS THE B EST. C ASE 3


Question: Steve decides to start eating more tomatoes and decides to grows his own cherry
tomatoes. He eats twice as much as his girlfriend. He eats 6 per day. If a vine can produce 3
tomatoes per week how many vines does he need?
Answer: His girlfriend eats 6/2=〈〈6/2=3〉〉3 tomatoes per day So together they eat
6+3=〈〈6+3=9〉〉9 tomatoes a day That means they eat 9*7=〈〈9*7=63〉〉63 tomatoes a day So
he needs 63/3=〈〈63/3=21〉〉21 tomato vines
#### 21〈—endoftext—〉

P ROMPTED A NSWERS

No prompt: To determine how many vines Steve needs, we need to calculate the number of
tomatoes he eats in a week. Since he eats 6 tomatoes per day, he eats 6 * 7 = 〈〈6*7=42〉〉42
tomatoes in a week. Since each vine can produce 3 tomatoes per week, Steve needs 42 / 3 =
〈〈42/3=14〉〉14 vines.
CoT prompt: 1. Steve eats 6 cherry tomatoes per day. 2. His girlfriend eats half as much
as Steve, so she eats 6/2 = 3 cherry tomatoes per day. 3. Together, they eat 6 + 3 = 9 cherry
tomatoes per day. 4. A vine can produce 3 cherry tomatoes per week. 5. To meet their daily
consumption, Steve needs 9 cherry tomatoes per day. 6. Therefore, he needs 9/3 = 3 vines to
meet their daily consumption.
APE prompt: First, we need to determine how many tomatoes Steve’s girlfriend eats per day.
Since Steve eats twice as much as his girlfriend, we can divide his daily consumption by 2: 6
tomatoes / 2 = 3 tomatoes per day.
Next, we need to determine how many tomatoes Steve and his girlfriend eat together per day:
6 tomatoes + 3 tomatoes = 9 tomatoes per day.
Now, we need to determine how many tomatoes Steve and his girlfriend eat together per week:
9 tomatoes per day * 7 days = 63 tomatoes per week.
Finally, we need to determine how many vines Steve needs to grow in order to produce 63
tomatoes per week. Since each vine produces 3 tomatoes per week, we can divide the total
number of tomatoes needed by the number of tomatoes produced per vine: 63 tomatoes / 3
tomatoes per vine = 21 vines.
Therefore, Steve needs to grow 21 vines in order to produce enough cherry tomatoes for his
increased consumption.
ToT prompt: Expert 1: To determine how many vines Steve needs, we first need to find out
how many tomatoes he eats in a week. Since he eats 6 tomatoes per day, he eats a total of 6 *
7 = 42 tomatoes in a week.
Expert 2: Now, let’s calculate how many tomatoes a single vine can produce in a week. If
a vine produces 3 tomatoes per week, then Steve would need 42 / 3 = 14 vines to meet his
weekly tomato consumption.
Expert 3: However, since Steve wants to grow cherry tomatoes, which are smaller than reg-
ular tomatoes, he might need more vines to produce the same amount. Let’s assume that
cherry tomatoes are half the size of regular tomatoes. In that case, Steve would need double
the number of vines, so he would need 14 * 2 = 28 vines to meet his weekly cherry tomato
consumption.

44
Published as a conference paper at ICLR 2024

W HEN T OT IS THE B EST. C ASE 1


Question:
Candice put 80 post-it notes in her purse before she headed out to her job at the coffee shop. On her way,
she stopped off at the store and purchased a package of Post-it notes; At work, she placed a single Post-it
note on each of 220 different cups of coffee. If she had 23 post-it notes remaining overall, how many
Post-it notes were in the package that she purchased?
Answer: Let’s let x be the number of post-it notes in the package that Candice purchased. Then, the
total number of post-it notes she had when she left the store was 80+x. And if she used 220 Post-it notes
at work and had 23 left, then 80+x=220+23 Subtracting 80 from each side of the equation, we see that
x=220+23-80=〈〈220+23-80=163〉〉163 Post-it notes.
#### 163〈—endoftext—〉

P ROMPTED A NSWERS
No prompt: Let’s break down the information given: Candice started with 80 post-it notes in her purse.
She placed a single post-it note on each of 220 cups of coffee. She had 23 post-it notes remaining overall.
To find out how many post-it notes were in the package she purchased, we need to subtract the number
of post-it notes she had remaining from the total number of post-it notes she started with and placed on
the cups of coffee. Total post-it notes = Starting post-it notes + Post-it notes placed on cups of coffee -
Post-it notes remaining Total post-it notes = 80 + 220 - 23 Total post-it notes = 300
Therefore, there were 300 post-it notes in the package that Candice purchased.
CoT prompt: Step 1: Calculate the total number of post-it notes Candice had before purchasing the
package. Candice had 80 post-it notes in her purse.
Step 2: Calculate the total number of post-it notes Candice used at work. Candice placed a single post-it
note on each of 220 different cups of coffee.
Step 3: Calculate the total number of post-it notes Candice had remaining after work. Candice had 23
post-it notes remaining overall.
Step 4: Calculate the total number of post-it notes Candice had before purchasing the package. Total
post-it notes before purchasing = post-it notes in purse + post-it notes remaining after work Total post-it
notes before purchasing = 80 + 23
Step 5: Calculate the number of post-it notes in the package that Candice purchased. Number of post-it
notes in the package = Total post-it notes before purchasing - post-it notes used at work Number of post-it
notes in the package = (80 + 23) - 220
Step 6: Calculate the final answer. Number of post-it notes in the package = 103 - 220 Number of post-it
notes in the package = -117
Therefore, Candice did not purchase any post-it notes.
APE prompt: First, we need to determine how many post-it notes Candice had before she purchased the
package. We know she started with 80 post-it notes and had 23 remaining, so she used 80 - 23 = 〈〈80-
23=57〉〉57 post-it notes at work. Next, we need to determine how many post-it notes were in the package
she purchased. We can do this by subtracting the number of post-it notes she had before purchasing the
package from the total number of post-it notes she used at work: 57 - 80 = 〈〈57-80=-23〉〉-23. Since we
can’t have a negative number of post-it notes, there must have been an error in our calculations. Let’s
try again. If Candice had 80 post-it notes before purchasing the package and used 57 at work, then she
must have had 80 - 57 = 〈〈80-57=23〉〉23 post-it notes remaining. Since we know she had 23 post-it notes
remaining overall, this means she used all the post-it notes she purchased at work. Therefore, the package
she purchased must have contained 220 - 23 = 〈〈220-23=197〉〉197 post-it notes.
ToT prompt: Expert 1: Let’s start by figuring out how many post-it notes Candice had before she
purchased the package. She initially had 80 post-it notes in her purse.
Expert 2: Next, we need to subtract the number of post-it notes she used at work. She placed a single
post-it note on each of the 220 cups of coffee. So, she used 220 post-it notes at work.

45
Published as a conference paper at ICLR 2024

Expert 3: Now, we can calculate the total number of post-it notes Candice had remaining. We subtract
the number of post-it notes used at work (220) from the initial number of post-it notes (80). This gives
us 80 - 220 = -140.
Expert 1: Since Candice had 23 post-it notes remaining overall, we can set up an equation to find the
number of post-it notes in the package she purchased. Let’s call the number of post-it notes in the package
”x”. We can write the equation as -140 + x = 23.
Expert 2: To solve for x, we need to isolate it on one side of the equation. We can do this by adding 140
to both sides of the equation. This gives us x = 23 + 140 = 163.
Expert 3: Therefore, the number of post-it notes in the package that Candice purchased is 163.

W HEN T OT IS THE B EST. C ASE 2


Question: Christina is planning a birthday party and needs .75 gift bags per invited guest, because 1/4 of
attendees don’t show up. She invited 16 friends. Gift bags are $2 each. How much will she spend?
Answer: She needs 12 gift bags because 16 x .75 = 〈〈16*.75=12〉〉12 She will spend $24 because 12 x 2
= 〈〈12*2=24〉〉24
#### 24〈—endoftext—〉

P ROMPTED A NSWERS
No prompt: Christina invited 16 friends, but 1/4 of them won’t show up, so 16 * 1/4 = 〈〈16*1/4=4〉〉4
friends won’t attend. So the number of guests who will attend is 16 - 4 = 〈〈16-4=12〉〉12 guests. She
needs .75 gift bags per guest, so she will need 12 * .75 = 〈〈12*.75=9〉〉9 gift bags. Each gift bag costs
$2, so she will spend 9 * $2 = $〈〈9*2=18〉〉18. Answer: 18 .
CoT prompt: Step 1: Calculate the number of guests who will actually attend the party. Since 1/4 of the
invited guests don’t show up, we need to subtract 1/4 of the total number of guests from the total number
of guests invited. 1/4 of 16 guests is (1/4) * 16 = 4 guests. So, the number of guests who will actually
attend the party is 16 - 4 = 12 guests.
Step 2: Calculate the number of gift bags needed. Since Christina needs 0.75 gift bags per guest, we can
multiply the number of guests by 0.75 to find the total number of gift bags needed. 0.75 * 12 guests = 9
gift bags.
Step 3: Calculate the total cost of the gift bags. Since each gift bag costs $2, we can multiply the number
of gift bags by the cost per gift bag to find the total cost. 9 gift bags * $2/gift bag = $18.
Therefore, Christina will spend $18 on gift bags for the birthday party.
APE prompt: First, we need to calculate how many guests will actually show up. We know that 1/4 of
the invited guests won’t attend, so we can calculate this by multiplying the number of invited guests by
3/4: 16 guests * 3/4 = 12 guests.
Next, we need to calculate how many gift bags Christina will need. We know that she needs 0.75 gift
bags per guest, so we can calculate this by multiplying the number of guests by 0.75: 12 guests * 0.75
gift bags = 9 gift bags.
Finally, we can calculate the total cost of the gift bags. We know that each gift bag costs $2, so we can
calculate this by multiplying the number of gift bags by the cost per gift bag: 9 gift bags * $2/gift bag =
$18.
Therefore, Christina will spend $18 on gift bags for her birthday party.
ToT prompt: Expert 1: To find out how many gift bags Christina needs, I’ll multiply the number of
invited guests by 0.75 since 1/4 of attendees won’t show up. Expert 2: So, if Christina invited 16 friends,
we’ll multiply 16 by 0.75 to get the number of gift bags needed. Expert 3: That means Christina needs
12 gift bags. To find out how much she will spend, I’ll multiply the number of gift bags by the cost per
bag, which is $2. Expert 1: So, Christina will spend 12 gift bags multiplied by $2 per bag, which equals
$24 in total.

46
Published as a conference paper at ICLR 2024

W HEN T OT IS THE B EST. C ASE 3


Question:
Julia was preparing for a dinner party at her house, where she intended to serve stew. She noticed that
she was out of plastic spoons, so she bought a new package of spoons. Later, her husband also bought a
package of 5 new spoons and gave them to Julia. While Julia was making the stew, she used three of the
spoons to sample her stew. Later, when she went to set the table, she had a total of 12 spoons. How many
spoons were in the package that Julia bought?
Answer: The total number of spoons from Julia and her husband was 12+3=〈〈12+3=15〉〉15 spoons.
Since the husband bought a package of five spoons, then Julia’s package contained 15-5=〈〈15-5=10〉〉10
spoons.
#### 10〈—endoftext—〉

P ROMPTED A NSWERS
No prompt: To find the number of spoons in the package that Julia bought, we need to subtract the
spoons that Julia used and the spoons that her husband bought from the total number of spoons on the
table.
Julia used 3 spoons to sample her stew. Her husband bought 5 spoons. So, the total number of spoons
used is 3 + 5 = 8.
If there were 12 spoons on the table, and 8 of them were used, then the number of spoons in the package
that Julia bought is 12 - 8 = 4.
CoT prompt: 1. Julia bought a new package of spoons. 2. Her husband also bought a package of 5 new
spoons and gave them to Julia. 3. Julia used three of the spoons to sample her stew. 4. When she went to
set the table, she had a total of 12 spoons.
To find out how many spoons were in the package that Julia bought, we need to subtract the spoons used
for sampling and the spoons given by her husband from the total number of spoons on the table.
Let’s denote the number of spoons in the package Julia bought as ”x”.
So, the equation becomes: x - 3 - 5 = 12
Simplifying the equation: x - 8 = 12
Adding 8 to both sides of the equation: x = 20
Therefore, there were 20 spoons in the package that Julia bought.
APE prompt: 1. Julia bought a new package of spoons. 2. Her husband also bought a package of 5 new
spoons and gave them to Julia. 3. Julia used 3 spoons to sample her stew. 4. Julia had a total of 12 spoons
when she went to set the table.
To find out how many spoons were in the package that Julia bought, we can subtract the spoons used for
sampling and the spoons given by her husband from the total number of spoons she had.
Total spoons - spoons used for sampling - spoons given by husband = spoons in the package Julia bought
12 - 3 - 5 = spoons in the package Julia bought
Therefore, there were 4 spoons in the package that Julia bought.
ToT prompt: Expert 1: Let’s start by setting up an equation to represent the given information. Let’s
say the number of spoons Julia bought is x. Her husband bought 5 spoons, so the total number of spoons
they have is x + 5. Julia used 3 spoons to sample the stew, so the number of spoons left is (x + 5) - 3.
Expert 2: We know that Julia had a total of 12 spoons left after setting the table. So we can set up the
equation (x + 5) - 3 = 12.
Expert 3: Simplifying the equation, we have x + 2 = 12. To isolate x, we subtract 2 from both sides of
the equation. This gives us x = 10.
Expert 1: Therefore, Julia bought a package of 10 spoons.

47
Published as a conference paper at ICLR 2024

G.2 E XPERT-C RAFTED P ROMPTS C AN A LL FAIL

In this section, we show the cases when all expert-crafted prompts could fail. In some of those
cases, optimizing prompt through Prompt-OIRL can lead to prompting strategies that are able to
get the correct response. Examples below are from the GSM8K dataset and using GPT-3.5-turbo
(v.2023.05) as the LLM.

P ROMPT-OIRL D ISCOVERS B ETTER P ROMPTS . C ASE 1


Question: Jamal’s phone can hold 6 times more photographs than can Brittany’s phone. The maximum
number of photographs that Brittany’s phone can hold is 50 times more than the number of birds in
Jamal’s photograph of the ducks at the zoo. If Jamal’s phone can hold 1800 photographs, how many
ducks can be seen in Jamal’s photograph of ducks at the zoo?
Answer: Since Jamal’s phone can hold 1800 photos, which is 6 times more than Brittany’s phone can
hold, Brittany’s phone can hold ‘1800/6=〈〈1800/6=300〉〉300 photographs. Since Brittany’s phone can
hold 50 times more photos than the number of birds in Jamal’s photograph, then 300/50=6 ducks can be
seen in Jamal’s photograph of ducks at the zoo.
#### 6〈—endoftext—〉

E XPERT-C RAFTED P ROMPTS AND A NSWERS


No prompt: To find the number of ducks in Jamal’s photograph, we need to first find the maximum
number of photographs that Brittany’s phone can hold. We know that Jamal’s phone can hold 1800
photographs, so Brittany’s phone can hold 1800 * 6 = 〈〈1800*6=10800〉〉10800 photographs. We also
know that the maximum number of photographs that Brittany’s phone can hold is 50 times more than
the number of birds in Jamal’s photograph of the ducks at the zoo. So, the number of birds in Jamal’s
photograph is 10800 / 50 = 〈〈10800/50=216〉〉216. Answer: 216 .
CoT prompt: Step 1: Let’s find out how many photographs Brittany’s phone can hold. Since Jamal’s
phone can hold 1800 photographs, and Brittany’s phone can hold 6 times more photographs, we can
calculate that Brittany’s phone can hold 1800 * 6 = 〈〈1800*6=10800〉〉10800 photographs.
Step 2: Let’s find out how many birds are in Jamal’s photograph of the ducks at the zoo. Since Brittany’s
phone can hold 50 times more photographs than the number of birds in Jamal’s photograph, we can
calculate that the number of birds in Jamal’s photograph is 10800 / 50 = 〈〈10800/50=216〉〉216.
Therefore, there are 216 ducks in Jamal’s photograph of ducks at the zoo.
APE prompt: First, let’s find out how many photographs Brittany’s phone can hold. We know that
Jamal’s phone can hold 1800 photographs, so Brittany’s phone can hold 6 times more photographs,
which is 1800 * 6 = 〈〈1800*6=10800〉〉10800 photographs.
Next, we are told that the maximum number of photographs Brittany’s phone can hold is 50 times more
than the number of birds in Jamal’s photograph of the ducks at the zoo. Let’s call the number of birds in
Jamal’s photograph ”x”. This means that 10800 = 50x.
To find the value of x, we can divide both sides of the equation by 50: x = 10800 / 50 =
〈〈10800/50=216〉〉216.
Therefore, there are 216 ducks in Jamal’s photograph of the ducks at the zoo.
ToT prompt: Expert 1: Let’s start by finding out how many photographs Brittany’s phone can hold. We
know that Jamal’s phone can hold 1800 photographs, so Brittany’s phone can hold 6 times more, which
means it can hold 1800 * 6 = 10,800 photographs.
Expert 2: Now, we need to figure out the number of birds in Jamal’s photograph of the ducks at the zoo.
We are told that the maximum number of photographs Brittany’s phone can hold is 50 times more than
the number of birds in Jamal’s photograph. So, the number of birds in Jamal’s photograph is 10,800 / 50
= 216.
Expert 3: Therefore, there are 216 ducks in Jamal’s photograph of the ducks at the zoo.

48
Published as a conference paper at ICLR 2024

P ROMPT-OIRL D ISCOVERED S UCCESSFUL P ROMPTS AND A NSWERS


Prompt-OIRL Discovered Prompt 1: Let’s break this down into manageable steps to fully understand
the problem:
Prompted Answer: 1. Jamal’s phone can hold 6 times more photographs than Brittany’s phone. 2. The
maximum number of photographs that Brittany’s phone can hold is 50 times more than the number of
birds in Jamal’s photograph of the ducks at the zoo. 3. Jamal’s phone can hold 1800 photographs. 4. We
need to find the number of ducks in Jamal’s photograph.
To find the number of ducks in Jamal’s photograph, we need to find the number of birds in the photograph
and then divide it by 50.
Let’s assume the number of birds in Jamal’s photograph is x.
According to the problem, Brittany’s phone can hold 50 times more photographs than the number of birds
in Jamal’s photograph. So, Brittany’s phone can hold 50x photographs.
Since Jamal’s phone can hold 6 times more photographs than Brittany’s phone, we can set up the equation:
1800 = 6 * 50x
Simplifying the equation:
1800 = 300x
Dividing both sides by 300:
x=6
Therefore, there are 6 ducks in Jamal’s photograph.
Prompt-OIRL Discovered Prompt 2: Three wise philosophers are debating this question, each con-
tributing a different aspect of the answer. Let’s follow their discourse:
Prompted Answer: Philosopher 1: Let’s call the number of birds in Jamal’s photograph of ducks at the
zoo ”x”. According to the information given, Brittany’s phone can hold 50 times more photographs than
the number of birds in Jamal’s photograph. So, the maximum number of photographs Brittany’s phone
can hold is 50x.
Philosopher 2: We also know that Jamal’s phone can hold 6 times more photographs than Brittany’s
phone. So, if Brittany’s phone can hold 50x photographs, then Jamal’s phone can hold 6 times that
amount, which is 6 * 50x = 300x.
Philosopher 3: We are given that Jamal’s phone can hold 1800 photographs. So, we can set up the
equation 300x = 1800 to find the value of x.
Solving the equation, we find that x = 1800 / 300 = 6.
Therefore, there are 6 ducks in Jamal’s photograph of ducks at the zoo.

P ROMPT-OIRL D ISCOVERS B ETTER P ROMPTS . C ASE 2


Question: Adrien’s total salary was 30 percent higher than Lylah’s. Four years later, his salary had
increased, and he was earning 40% more than what he was making four years ago. If Adrien’s and
Lylah’s salary increased simultaneously, and Adrien earned $40000 four years ago, calculate the total
salary the two were receiving four years later?
Answer: Since Adrien was earning $40000 four years ago and received a raise that makes him earn
40% more, he received a 40/100*$40000 = $〈〈40/100*40000=16000〉〉16000 raise. In total, four years
later, Adrien’s salary is $40000+$16000 = $56000 If four years ago Adrien was earning $40000, and
Lylah’s salary was 30% less, then Lylah’s salary was 30/100*$40000= $12000 less than Adrien’s salary
four years ago. Four years ago, Lylah was earning $40000-$12000 =$28000 After receiving a 40% raise,
Lylah earns 40/100*$28000 = $〈〈40/100*28000=11200〉〉11200 In total, four years later, Lylah earns
$28000+$11200= $〈〈28000+11200=39200〉〉39200. The total amount of salary the two earn four years
later is $39200+$56000 = $〈〈39200+56000=95200〉〉95200
#### 95200〈—endoftext—〉

49
Published as a conference paper at ICLR 2024

E XPERT-C RAFTED P ROMPTS AND A NSWERS


No prompt: Adrien’s salary four years ago was $40000, and since he was earning 40% more four
years later, he was earning 40/100*$40000 = $〈〈40000*40/100=16000〉〉16000 more. The total amount
of money Adrien was earning four years later is $40000+$16000 = $〈〈40000+16000=56000〉〉56000
Since Adrien’s salary was 30 percent higher than Lylah’s, Lylah was earning 30/100*$40000 =
$〈〈30/100*40000=12000〉〉12000 less than Adrien four years ago. The total amount of money Lylah was
earning four years ago is $40000-$12000 = $〈〈40000-12000=28000〉〉28000 Four years later, Lylah’s
salary increased by 40/100*$40000 = $〈〈40/100*40000=16000〉〉16000 The total amount of money Ly-
lah was earning four years later is $28000+$16000 = $〈〈28000+16000=44000〉〉44000 Together, the two
were earning $44000+$56000 = $〈〈44000+56000=100000〉〉100000 four years later. Answer: 100000 .
CoT prompt: First, we need to find out how much Adrien was earning four years ago. We are told that
Adrien’s salary four years ago was $40000.
Next, we need to calculate how much Adrien’s salary increased by. We are told that his salary increased
by 40% from what he was making four years ago. To find the increase, we multiply his salary four years
ago by 40/100: $40000 * 40/100 = $16000.
To find out how much Adrien is currently earning, we add the increase to his salary four years ago:
$40000 + $16000 = $56000.
Now, we need to find out how much Lylah was earning four years ago. We are told that Adrien’s salary
was 30% higher than Lylah’s. To find Lylah’s salary, we divide Adrien’s salary four years ago by 1.3
(since 100% + 30% = 130% = 1.3): $40000 / 1.3 = $30769.23 (rounded to the nearest cent).
Next, we need to calculate how much Lylah’s salary increased by. We are told that Adrien’s and Lylah’s
salary increased simultaneously. Therefore, Lylah’s salary increased by the same percentage as Adrien’s,
which is 40%. To find the increase, we multiply Lylah’s salary four years ago by 40/100: $30769.23 *
40/100 = $12307.69 (rounded to the nearest cent).
To find out how much Lylah is currently earning, we add the increase to her salary four years ago:
$30769.23 + $12307.69 = $43076.92 (rounded to the nearest cent).
Finally, to find the total salary the two were receiving four years later, we add Adrien’s and Lylah’s
current salaries: $56000 + $43076.92 = $99076.92 (rounded to the nearest cent).
Therefore, the total salary the two were receiving four years later is $99076.92.
APE prompt: First, let’s find out how much Adrien was earning four years ago. We know that his salary
had increased by 40%, so we can calculate his salary four years ago by dividing his current salary by 1.4:
$40000 / 1.4 = $28571.43
Now, let’s find out how much Lylah was earning four years ago. We know that Adrien’s salary was 30%
higher than Lylah’s, so we can calculate Lylah’s salary four years ago by dividing Adrien’s salary four
years ago by 1.3: $28571.43 / 1.3 = $21978.02
Now, let’s find out how much Adrien is earning currently. We know that his salary has increased by 40%,
so we can calculate his current salary by multiplying his salary four years ago by 1.4: $40000 * 1.4 =
$56000
Now, let’s find out how much Lylah is earning currently. We know that Adrien’s salary is 30% higher than
Lylah’s, so we can calculate Lylah’s current salary by dividing Adrien’s current salary by 1.3: $56000 /
1.3 = $43076.92
Finally, let’s calculate the total salary the two are receiving four years later by adding Adrien’s and Lylah’s
current salaries: $56000 + $43076.92 = $99076.92
Therefore, the total salary the two were receiving four years later is $99076.92.
ToT prompt: Expert 1: Let’s start by finding out Adrien’s salary four years ago. We know that his salary
increased by 40% since then, so we can calculate his salary four years ago by dividing his current salary
by 1.4.
Expert 2: Adrien’s current salary is $40,000, and we know that it increased by 40% since four years ago.
So, to find out his salary four years ago, we can divide $40,000 by 1.4.
Expert 3: $40,000 divided by 1.4 is approximately $28,571.43. Therefore, Adrien’s salary four years ago
was $28,571.43.

50
Published as a conference paper at ICLR 2024

Expert 1: Now that we know Adrien’s salary four years ago, we can calculate his current salary by adding
a 40% increase to his salary four years ago.
Expert 2: Adrien’s salary four years ago was $28,571.43, and we know that it increased by 40%. To find
out his current salary, we can multiply $28,571.43 by 1.4.
Expert 3: $28,571.43 multiplied by 1.4 is approximately $40,000. Therefore, Adrien’s current salary is
$40,000.
Expert 1: Now that we have Adrien’s current salary, we can calculate Lylah’s salary by finding out what
30% of Adrien’s salary is and subtracting it from Adrien’s salary.
Expert 2: Adrien’s current salary is $40,000, and we know that it is 30% higher than Lylah’s salary. To
find out Lylah’s salary, we can calculate 30% of $40,000 and subtract it from $40,000.
Expert 3: 30% of $40,000 is $12,000. Therefore, Lylah’s salary is $40,000 - $12,000 = $28,000.
Expert 1: Finally, to calculate the total salary the two were receiving four years later, we can add Adrien’s
current salary and Lylah’s current salary.
Expert 2: Adrien’s current salary is $40,000, and Lylah’s current salary is $28,000. Therefore, the total
salary the two were receiving four years later is $40,000 + $28,000 = 68,000$

P ROMPT-OIRL D ISCOVERED S UCCESSFUL P ROMPTS AND A NSWERS


Prompt-OIRL Discovered Prompt 1: Imagine we’re navigating a maze; each decision brings us closer
to the center. Let’s map our route:
Prompted Answer: 1. Let’s start by finding out how much Adrien was earning four years ago. We know
that his salary was $40000 at that time.
2. Adrien’s salary increased by 40% over the next four years. To calculate his current salary, we need to
add 40% of $40000 to $40000. 40% of $40000 is (40/100) * $40000 = $16000. Adding this to his initial
salary, we get $40000 + $16000 = $56000.
3. Adrien’s total salary four years later is $56000.
4. We also know that Adrien’s salary was 30% higher than Lylah’s. To find Lylah’s salary, we need
to calculate 30% of Adrien’s salary and subtract it from Adrien’s salary. 30% of $56000 is (30/100) *
$56000 = $16800. Subtracting this from Adrien’s salary, we get $56000 - $16800 = $39200.
5. Lylah’s salary four years later is $39200.
6. To find the total salary of Adrien and Lylah four years later, we need to add their individual salaries.
$56000 + $39200 = $95200.
7. Therefore, the total salary of Adrien and Lylah four years later is $95200.
Prompt-OIRL Discovered Prompt 2: Like an architect constructing a building, let’s design our answer
carefully, layer by layer:
Prompted Answer: 1. Adrien’s salary four years ago was $40000. 2. Adrien’s salary increased by 40%
four years later, so his current salary is $40000 + ($40000 * 40%) = $40000 + $16000 = $56000. 3.
Lylah’s salary four years ago was 30% lower than Adrien’s salary, so her salary was $40000 - ($40000
* 30%) = $40000 - $12000 = $28000. 4. Lylah’s salary also increased by 40% four years later, so her
current salary is $28000 + ($28000 * 40%) = $28000 + $11200 = $39200. 5. The total salary the two
were receiving four years later is $56000 + $39200 = $95200.

P ROMPT-OIRL D ISCOVERS B ETTER P ROMPTS . C ASE 3


Question:
Sadie slept 8 hours on Monday. For the next two days, she slept 2 hours less, each, because she had to
complete some assignments. If the rest of the week she slept 1 hour more than those two days, how many
hours did she sleep in total throughout the week?

51
Published as a conference paper at ICLR 2024

Answer: If on Monday she slept 8 hours, then she slept 8-2 = 〈〈8-2=6〉〉6 hours on each of the next two
days. The total number of hours slept for the two days will be 6*2 = 〈〈6*2=12〉〉12 hours. If she slept 1
hour more for the rest of the week each day, then each day she slept 6+1 = 〈〈6+1=7〉〉7 hours. The total
for the rest of the week will be 4*7 = 〈〈4*7=28〉〉28 hours. The total number of hours slept for the whole
week will be 8+12+28 = 〈〈8+12+28=48〉〉48 hours.
#### 48〈—endoftext—〉

E XPERT-C RAFTED P ROMPTS AND A NSWERS


No prompt: On Tuesday and Wednesday, Sadie slept 8 - 2 = 〈〈8-2=6〉〉6 hours each day. On Thursday
and Friday, she slept 6 + 1 = 〈〈6+1=7〉〉7 hours each day. Throughout the week, Sadie slept 8 + 6 + 6 + 7
+ 7 = 〈〈8+6+6+7+7=34〉〉34 hours. Answer: 34 .
CoT prompt: First, let’s calculate how many hours Sadie slept on Tuesday and Wednesday. She slept
2 hours less than on Monday, so she slept 8 - 2 = 〈〈8-2=6〉〉6 hours on each of those days. Next, let’s
calculate how many hours Sadie slept on the rest of the week. She slept 1 hour more than on Tuesday
and Wednesday, so she slept 6 + 1 = 〈〈6+1=7〉〉7 hours on each of those days. Finally, let’s calculate the
total number of hours Sadie slept throughout the week. On Monday, she slept 8 hours. On Tuesday and
Wednesday, she slept 6 hours each, so that’s 6 + 6 = 〈〈6+6=12〉〉12 hours. On the rest of the week, she
slept 7 hours each day, so that’s 7 + 7 + 7 + 7 + 7 = 〈〈7+7+7+7+7=35〉〉35 hours. In total, Sadie slept 8 +
12 + 35 = 〈〈8+12+35=55〉〉55 hours throughout the week. Answer: 55 .
APE prompt: First, let’s calculate how many hours Sadie slept on Tuesday and Wednesday. She slept 2
hours less each day, so she slept 8 - 2 = 〈〈8-2=6〉〉6 hours on Tuesday. And on Wednesday, she also slept
2 hours less, so she slept 8 - 2 = 〈〈8-2=6〉〉6 hours. Now let’s calculate how many hours she slept for
the rest of the week. She slept 1 hour more than the previous two days, so she slept 6 + 1 = 〈〈6+1=7〉〉7
hours for the rest of the week. Now let’s add up all the hours she slept throughout the week: 8 + 6 + 6 +
7 = 〈〈8+6+6+7=27〉〉27 hours. So, Sadie slept a total of 27 hours throughout the week.
ToT prompt: Expert 1: Step 1 - Calculate the total hours of sleep for the next two days. Sadie slept 8
hours on Monday. For the next two days, she slept 2 hours less each day. So, on Tuesday and Wednesday,
she slept 8 - 2 = 6 hours each day.
Expert 2: Step 2 - Calculate the total hours of sleep for the rest of the week. The rest of the week consists
of 5 days (Thursday, Friday, Saturday, Sunday, and Monday). Sadie slept 1 hour more than the previous
two days, so she slept 6 + 1 = 7 hours each day for the rest of the week.
Expert 3: Step 3 - Calculate the total hours of sleep throughout the week. On Monday, Sadie slept 8
hours. On Tuesday and Wednesday, she slept 6 hours each day. And for the rest of the week, she slept 7
hours each day. So, the total hours of sleep throughout the week are 8 + 6 + 6 + (7 * 5) = 8 + 6 + 6 + 35
= 55 hours.

P ROMPT-OIRL D ISCOVERED S UCCESSFUL P ROMPTS AND A NSWERS


Prompt-OIRL Discovered Prompt 1: Consider this as a puzzle, each piece contributing to the final
answer. Let’s place each piece, one by one:
Prompted Answer: Sadie slept 8 hours on Monday. On Tuesday and Wednesday, she slept 2 hours less
each day, so she slept 8 - 2 = 〈〈8-2=6〉〉6 hours each day. For the rest of the week, she slept 1 hour more
than those two days, so she slept 6 + 1 = 〈〈6+1=7〉〉7 hours each day. From Monday to Wednesday, she
slept a total of 8 + 6 + 6 = 〈〈8+6+6=20〉〉20 hours. From Thursday to Sunday, she slept a total of 7 + 7
+ 7 + 7 = 〈〈7+7+7+7=28〉〉28 hours. Therefore, Sadie slept a total of 20 + 28 = 〈〈20+28=48〉〉48 hours
throughout the week. Answer: 48 .
Prompt-OIRL Discovered Prompt 2: Three wise philosophers are debating this question, each con-
tributing a different aspect of the answer. Let’s follow their discourse:
Prompted Answer: Philosopher 1: Let’s start by calculating how many hours Sadie slept on Tuesday
and Wednesday. We know that she slept 2 hours less each day, so she slept 8 - 2 = 6 hours on Tuesday
and Wednesday.
Philosopher 2: That’s correct. Now, let’s calculate how many hours Sadie slept on the rest of the week.
We know that she slept 1 hour more than the previous two days, so she slept 6 + 1 = 7 hours on each of
the remaining days.

52
Published as a conference paper at ICLR 2024

Philosopher 3: So, Sadie slept 8 hours on Monday, 6 hours on Tuesday and Wednesday, and 7 hours on
the remaining days. To find the total hours she slept throughout the week, we can add these numbers
together: 8 + 6 + 6 + 7 + 7 + 7 + 7 = 48 hours.
Conclusion: Sadie slept a total of 48 hours throughout the week.

G.3 G ENERATED H UMAN -R EADABLE P ROMPTS

We provide case studies on the LLaMA-7B-HF model for a better understanding of Prompt-OIRL.
GSM8K

E XAMPLE
Question: For four hours, Patrick sold 15 cups of lemonade per hour at a price of $0.50 per cup. In the
next two hours, he sold 10 cups of lemonade per hour at a price of $0.60 per cup. How much money did
Patrick earn, in dollars, from selling lemonade for 6 hours?
Golden Answer: Patrick earned 0.50 × 15 = $7.50 from selling the lemonade for $0.50 per cup. So
for four hours, his total earnings are $7.50 × 4 = $30. He earned $0.60 × 10 = $6 from selling the
lemonade for $0.60 per cup. So for two hours, his total earnings are $6 × 2 = $12. Therefore, Patrick’s
total earnings:$30 + $12 = $42. Hence, Patrick earned a total of $42 from selling lemonade for 6 hours.
held-out prompt: As an electrician wires a circuit, each connection powers the system:
prompted answer: The electrician wires a circuit for four hours, selling 15 cups of lemonade per hour
at a price of $0.50 per cup. For the next two hours, the electrician sells 10 cups of lemonade per hour at
a price of $0.60 per cup. To find out how much money Patrick earned from selling lemonade for 6 hours,
we need to calculate the total number of cups of lemonade sold during each hour and then multiply it by
the price per cup. For the first 4 hours, Patrick sold 15 cups of lemonade per hour, so he sold a total of 60
cups of lemonade (15 x 4 = 60). At a price of $0.50 per cup, he earned a total of $30 (60 x $0.50 = $30).
For the next 2 hours, Patrick sold 10 cups of lemonade per hour, so he sold a total of 20 cups of lemonade
(10 x 2 = 20). At a price of $0.60 per cup, he earned a total of $12 (20 x $0.60 = $12). Therefore, Patrick
earned a total of $42 from selling lemonade for 6 hours ($30 + $12 = $42).

E XAMPLE
Question: Kelly has 5 quarters and 2 dimes. If she buys a can of pop for 55 cents, how many cents will
she have left?
Golden Answer: Her quarters are worth 5 × 25 = 125 cents. The dimes are worth 2 × 10 = 20 cents.
Kelly has a total of 125 + 20 = 145 cents. After buying the can of pop, Kelly will have 145 − 55 = 90
cents left.
held-out prompt: As an electrician wires a circuit, each connection powers the system:
prompted answer: The electrician has 5 quarters, which is equal to 5 × 25 = 125 cents. She also has 2
dimes, which is equal to 2 × 10 = 20 cents. In total, the electrician has 125 + 20 = 145 cents. If she
buys a can of pop for 55 cents, she will have 145 − 55 = 90 cents left.

53
Published as a conference paper at ICLR 2024

SVAMP

E XAMPLE
Question: Henry bought 2 new fireworks and his friend bought 3. If they already had 6 saved from last
year, how many do they have now?
Golden Answer: 11.0 fireworks.
held-out prompt: As a librarian categorizes books, let’s sort the information accordingly:
prompted answer: “Henry bought 2 new fireworks” means “Add 2 to the total number of fireworks
Henry has.” “His friend bought 3” means “Add 3 to the total number of fireworks his friend has.”
So, let’s add the total number of fireworks Henry and his friend have: 6 (from last year) + 2 (new ones
Henry bought) + 3 (new ones his friend bought) = 11 fireworks in total.
Therefore, they have 11 fireworks in total.

E XAMPLE
Question: A trivia team had 5.0 members in total, but during a game, 2.0 members didn’t show up. If
each member that did show up scored 6.0 points, how many points were scored in total?
Golden Answer: 18.0 points.
held-out prompt: Let’s approach this like a mathematician proving a theorem, step by logical step:
prompted answer:
Step 1: Identify the given information: The trivia team had 5.0 members total. 2.0 members did not show
up. Each member that did show up scored 6.0 points.
Step 2: Use the information provided to find the total number of members that showed up: Total number
of members that showed up = 5.0 − 2.0 = 3.0.
Step 3: Find the total number of points scored by the members who showed up: Total points scored =
3.0 × 6.0 = 18.0.
Therefore, the trivia team scored 18.0 points in the game.

54
Published as a conference paper at ICLR 2024

MAWPS

E XAMPLE
Question: While shopping, Emily bought 5.0 green towels and 30.0 white towels. If she gave her mother
26.0 of them, how many towels did Emily end up with?
Golden Answer: Emily ended up with 9.0 towels.
held-out prompt: As a librarian categorizes books, let’s sort the information accordingly:
prompted answer: Emily bought 5.0 green towels. Emily bought 30.0 white towels. Emily gave her
mother 26.0 towels (a combination of green and white towels).
Now, let’s deduce how many towels Emily ended up with: Total towels Emily bought = 5.0 + 30.0 =
35.0. After giving towels to her mother, Emily was left with: 35.0 − 26.0 = 9.0 towels.
Therefore, Emily ended up with 9.0 towels in total.

E XAMPLE
Question: A construction company is repaving a damaged road. So far, they have repaved a total of
4938.0 inches of the road. Today, they repaved 805.0 inches of the road. How many inches of the road
had they repaved before today?
Golden Answer: The company had repaved 4133.0 inches before today.
held-out prompt: Think of it as mixing colors for a painting. Every shade adds depth and nuance:
prompted answer: The total repaved so far is 4938.0 inches. Today, they added another layer (shade) of
805.0 inches.
Using the analogy of mixing colors, we remove today’s shade to reveal the previous one: Inches repaved
before today = 4938.0 − 805.0 = 4133.0 inches.
Thus, before today, the company had repaved 4133.0 inches of the road.

55

You might also like