Query Dependent Prompt
Query Dependent Prompt
Query Dependent Prompt
A BSTRACT
In this study, we aim to enhance the arithmetic reasoning ability of Large Lan-
arXiv:2309.06553v4 [cs.CL] 7 Mar 2024
Figure 1: A motivating example.(Left, Right) No prompt is perfect that works for all queries. The optimal
prompt is query-dependent. Yet the seeking of such prompts is hindered by the Challenges 1-2 we identified.
Our method optimizes prompt during inference on a query-dependent level effectively and cost-efficiently.
1 I NTRODUCTION
Recent advances in Large Language Models (LLMs), such as ChatGPT (Ouyang et al., 2022), un-
derscore the value of aligning with human preferences (Ouyang et al., 2022; OpenAI, 2023). Using
an offline alignment dataset generated by annotators, the harmlessness, helpfulness, and ability to
follow instructions of LLMs can be largely enhanced (Bai et al., 2022) through the technique of
RLHF (Christiano et al., 2017). That said, even for state-of-the-art general-purpose LLMs such as
GPT-4, solving intricate problems such as arithmetic reasoning can still be challenging. Further
unlocking LLMs’ full potential for those complex tasks remains an open challenge.
Query-Dependent Zero-Shot Prompting for Arithmetic Reasoning: Opportunities and Chal-
lenges. Out of the many attempts, prompting — a natural language prefix or instruction that ex-
plains how to complete the task — stands out as a lightweight promising solution for eliciting
the capabilities of LLMs without model parameter tuning (Wei et al., 2022). While the advances
in zero-shot prompting strategies (Kojima et al., 2022) highlight the potential in finding effective
query-independent solutions, its reliance on manual crafting efforts and the vast search space over
natural language intensifies the difficulty in discovering effective prompts (Deng et al., 2022). This
study aims to augment the arithmetic reasoning capacities of LLMs, a capability considered cru-
cial (OpenCompass, 2023; Liu et al., 2023a). While there is a wealth of both expert-devised and
∗
hs789@cam.ac.uk, Code is available at: https://github.com/vanderschaarlab/Prompt-OIRL
1
Published as a conference paper at ICLR 2024
machine-generated prompts in this domain (Kojima et al., 2022; Zhou et al., 2022b; Yang et al.,
2023a), our endeavor is to advance further via prompt optimization. Our approach begins with the
articulation of an adjusted objective, subsequently highlighting two inherent challenges.
Adjusted Objective: Query-Dependent Prompt Optimization. In the literature, research on zero-
shot prompting mainly focused on finding prompts that work better on a distributional level rather
than an instance level. For instance, while multi-agent debate Liang et al. (2023); Du et al. (2023)
can improve reasoning in general, there are cases where not using a prompt can be even better than
using generally effective prompts. Figure 1 showcases a motivating example. Motivated by the fact
that no prompt is perfect for all queries, in this work, we set an adjusted objective of query-dependent
prompt optimization, rather than a distributional-level prompt optimization.
Challenge 1: Inference Time Evaluation is Hard. A primary challenge arises in evaluating the
effectiveness of prompts during inference, especially when the true answer to a query is not known
a priori. For arithmetic reasoning, enumerating many potentially effective prompts and enquire
language models with those prompted queries will only result in a batch of answers, yet determining
which response is correct in the absence of a ground-truth label requires extra non-trivial effort.
Additionally, the computational cost associated with this prompt enumeration process is substantial.
Challenge 2: Online Prompt Evaluation and Optimization is Expensive. On the other hand,
searching for potentially effective prompts heavily relies on expensive online evaluation. In the lit-
erature, evaluating the effectiveness of a proposed prompt requires assessing its performance on
multiple datasets and LLMs, to show its distributional superiority over the others. In the arith-
metic reasoning task, using the GPT-3.5-turbo API (OpenAI, 2023) to evaluate a single prompt on
a medium-sized dataset with 10k query-answer pairs will lead to an ≈$1 cost, yet learning from
trials and errors often requires millions of interactions even for tasks with a few actions to choose
from (Schulman et al., 2017). The vast action space of natural language further increases the pro-
hibitive cost for such interactive search (Zhang et al., 2022a; Deng et al., 2022).
Solution: Query-Dependent Prompt Evaluation and Optimization with Offline Inverse RL.
Having witnessed the significant advancements achieved in expert-crafted and algorithm-discovered
prompting in arithmetic reasoning (Zhou et al., 2022b; Pryzant et al., 2023; Kojima et al., 2022),
our work seeks a solution of combining existing human knowledge in prompting systematically,
effectively, and cost-efficiently. Our contributions can be summarized as follows:
2
Published as a conference paper at ICLR 2024
Language Model Such tasks can be performed using a language model ℓ : X → Y by feeding
queries x to the language model and getting answers ŷ = ℓ(x). We assume that the quality of these
answers can be evaluated using a metric r : Y × Y → R. For example, in tasks where golden
labels are available, this metric can be r(y ∗ , ŷ) = 1{ŷ = y ∗ }. Generally, a higher value of r(y ∗ , ŷ)
indicates ŷ matches the expected answer y ∗ better.
Prompting It is well known that the performance of a language model in answering queries can
be boosted through appropriate prompting—a natural language prefix or instruction that explains
how to complete the task. We consider prompts π : X → X that maps original queries x to
modified prompted query π(x). (For clarity, we use prompt to denote the prefix and instructions,
and prompted query to denote the final input to language models after prompting.) These prompted
queries are then fed into the language model to get better answers ŷ = ℓ(π(x)).
Objective Given a dataset D = {x(i) , y ∗(i) )}i∈[N ] of queries and their expected answers, the
objective of query-agnostic zero-shot prompt optimization usually studied in the literature is to find
the distributional optimal prompt π̄ ∗ that maximize the expected quality of answers w.r.t. metric r:
h i
π̄ ∗ = arg max E(x(i) ,y∗(i) )∼D r(y ∗(i) , ℓ(π(x(i) ))) , (1)
π
Instead, in this work we meet the Adjusted Objective by using a query-dependent approach:
π ∗ = arg max r(y ∗(i) , ℓ(π(x(i) ))), (2)
π
In a nutshell, Equation(1) seeks a single prompt that achieves good performance on the dataset, yet
Equation(2) seeks different prompts for different queries in the dataset. Clearly, π ∗ should be better
than π̄ ∗ in the sense
h i h i
E(x(i) ,y∗(i) )∼D r(y ∗(i) , ℓ(π ∗ (x(i) ))) ≥ E(x(i) ,y∗(i) )∼D r(y ∗(i) , ℓ(π̄ ∗ (x(i) ))) . (3)
Prompt-OIRL: Prompting as Offline Inverse RL In pursuance of the new objective and ad-
dressing the challenges, we draw inspiration from the successes that have been achieved by RLHF.
We highlight the prompting problem is essentially aligning prompts based on LLM preferences,
a departure from RLHF’s goal of aligning LLMs with human preferences. For a more in-depth
exploration of this perspective, we direct readers to Appendix A.
Aiming for an efficient and economical prompting strategy, we sidestep the complexities of exten-
sive LLM interactions by leveraging readily available datasets and propose offline IRL as a viable
learning solution. We outline its three fundamental steps: Step 1. construction of offline dataset,
Step 2. offline query-dependent prompt evaluation through reward modeling, and Step 3. offline
query-dependent prompt optimization using the learned reward model.
3
Published as a conference paper at ICLR 2024
Step 2: Offline Reward Modeling: Inverse RL without an Environment Upon initial inspec-
tion, it might seem that reward modeling is unnecessary given the reward metric. i.e., r(y ∗ , ŷ) =
1{ŷ = y∗ }. However, a more thorough examination of this metric reveals two concerns that limit its
broader utilization: first, the reward metric is a function over language model ℓ, making every single
call of such a metric costly; second, the computation of the reward necessitates access to the ground
truth answer y ∗ , which is always not available in inference (deployment) time. Those are essentially
the challenges associated with the input dependency of this reward metric demonstrated in Figure 2.
To address those issues, we introduce a parameterized proxy reward model that is a function of query
x and prompt π. We denote the proxy reward as Υθ (x, π(x)) with parameters θ. Notably, this proxy
reward omits the language models ℓ and the ground-truth labels from its calculations. By design, its
output should align with the true reward r. Therefore, we establish a supervised learning objective
to minimize the discrepancies between the proxy reward model and the true reward.
Figure 4: Prompt-OIRL addresses the specified Objective and challenges. It first learns a proxy reward model
from the offline demonstration dataset we created in the last section. Such a learned reward model can be
applied in inference time to evaluate prompts in a query-dependent manner without access to the language
model, hence optimizing prompt w.r.t. such a proxy reward model solves all issues identified.
In the case of arithmetic reasoning tasks, the reward signal is binary, hence the proxy reward model
can be trained as a classification task, predicting whether the prompt can result in a correct an-
swer when fed to the language model. Specifically, we consider the Cross-Entropy loss given the
demonstration data collected in the previous section:
h i
ℓ (i,k) (i,k)
LCE (θ; Ddem ) = −Ei∈[N ],k∼[K] r(i,k) log σ Υθ + (1 − r(i,k) ) log 1 − σ Υθ (5)
4
Published as a conference paper at ICLR 2024
(i,k)
where we use Υθ to denote Υθ (x(i) , π (k) (x(i) )) for conciseness, and σ is the sigmoid function.
Different from the online reward signal provided by r(y ∗ , ℓ(x′ )) that requires expensive interactions
with black-box LLMs, the proxy reward model Υθ (x, π(x)) can provide offline white-box feedback
to prompts. Therefore, it can be more accessible in guiding the search for a better prompt. Figure 4
illustrates how such a reward model can be learned and leveraged in providing feedback to prompt
optimization. To achieve computational efficiency and good performance, we use the embeddings
of x and π generated by existing LLMs, and use those embeddings as inputs for the proxy reward
model. Empirically, we find the gradient boosting methods (Chen et al., 2015; Ke et al., 2017) are
significantly better than using neural networks in reward modeling. In our experiment section, we
demonstrate the effectiveness of Prompt-OIRL using a minimalist approach by directly applying the
XGBoost models and leaving further investigation on model selection to future work.
Step 3: Offline Prompt Optimization with the Learned Reward Model.
Given the learned reward model is an effective proxy of the performance evaluator for query-prompt
pairs, we would then be able to solve the prompt optimization problem of Euqation (2) with an
alternative offline objective that is feasible to execute in inference-time — without the requirement
of a language model ℓ and the non-accessible inference-time golden label y ∗ :
π ∗ = arg max Υθ (x, π(x)) ≈ arg max r(y ∗ , ℓ(π(x))) (6)
π π
In general, any policy optimization technique has the potential to be applied in solving Equation (6).
To name a few examples, such a technique can be any of the previous approaches used in online
prompt optimization like RL (Deng et al., 2022), beam search (Pryzant et al., 2023), evolution
strategies (Zhou et al., 2022b).
In this work, we would like to highlight the effectiveness and cost-efficiency of optimizing prompts
with regard to such a learned reward model as an accessible offline proxy to the true reward. We
choose to use a minimalist approach and isolate the source of gains by using a best-of-n approach:
we generate a batch of candidate prompts with a general-purpose language model and select the best
one according to the learned reward model. We leave the investigation of other approaches, which
could be more tailored and better-performing, yet computationally more expensive as a promising
direction for future exploration.
4 R ELATED W ORK
We discuss the most related literature on learning-based prompt optimization and reinforcement
learning from human knowledge in this section. Extended discussions are deferred in Appendix B.
Table 1: Prompt-OIRL mainly differentiates from existing literature on prompt optimization by (1) consid-
ering the Adjusted Objective and optimizing query-dependent prompt; (2) being able to perform offline
prompt evaluation to address Challenge 1; (3) optimizing prompt in the offline setting without access to the
LLMs to address Challenge 2; (4) using natural language prompt space; (5) using offline inverse reinforcement
learning to solve the problem. More unique properties of Prompt-OIRL are discussed in Table 5.
(1) Query (2) Offline (3) Offline (4) Prompt (5) Solver
Method Dependent Prompt Prompt Space* Used
Prompt Evaluation Optimization
Soft-Prompt ✓ ✗ ✗ Embeddings Gradient-Guided Search
APO ✗ ✗ ✗ X = V∞ Beam Search
APE ✗ ✗ ✗ X = V∞ Evolution Strategy
TEMPERA ✓ ✗ ✗ Edit RL
RLPrompt ✗ ✗ ✗ {V 2 , V 5 } RL
Prompt-OIRL (ours) ✓ ✓ ✓ X = V∞ Offline Inverse-RL
*
Embeddings: the language of LMs; Edit: including operations like swap, delete, etc.; V is the vocabulary, and the superscript over it
denotes the length of prompts. V ∞ denotes the natural language space — the most general interpretable format.
Learning-Based Prompt Optimization The Automatic Prompt Optimization (APO) (Pryzant et al.,
2023) leverages training data and initial prompts to perform a ”gradient descent” process operated
in the natural language space guided by large language models. A generate-and-select process is
then applied to update the prompting mechanism. Automatic Prompt Engineer (APE) (Zhou et al.,
2022b) asks LLMs to generate a set of instruction candidates based on demonstrations and then
evaluate the performance of those instructions by computing corresponding scores, finally, APE im-
proves the best candidates by proposing semantically similar instruction variants to specific tasks.
5
Published as a conference paper at ICLR 2024
There are also works that explore the usage of auxiliary models or differentiable representations of
the prompt (Shin et al., 2020; Li & Liang, 2021; Qin & Eisner, 2021; Vu et al., 2021; Gu et al.,
2021). Yet those methods require access to the embeddings of the LMs for soft prompt optimiza-
tion, hence falling short in reusability across LMs. In the work of Hao et al. (2022), reinforcement
learning is used to optimize a prompt policy in text-to-image generation tasks in pursuance of more
aesthetically pleasing images preserving the original user intentions. Zhang et al. (2022a) proposes
TEMPERA to perform test time prompt editing using reinforcement learning. TEMPERA designs
the action space to be editing operations, such as swapping, deleting, adding, or changing verbaliz-
ers. The RLPrompt Deng et al. (2022) uses reinforcement learning to optimize LLMs to specialize
as prompting agents. Yet their prompts generated are task-agnostic, and limited to combinations of
words that can not be easily transferred to insights for human prompt engineers.
Reinforcement Learning with Human Knowledge and Demonstrations Human knowledge and
demonstration have pushed forward the progress of machine learning in many domains including
natural language processing (Rajpurkar et al., 2018; Ouyang et al., 2022) and many others (Deng
et al., 2009; Redmon et al., 2016; Atkeson & Schaal, 1997; Zhang et al., 2018; Mandlekar et al.,
2020). By injecting prior knowledge of the task from human, reinforcement learning algorithms
are able to achieve human-level performance or even outperform human (Silver et al., 2016; Vinyals
et al., 2019). Our solver of the prompting problem is related to inverse RL (Ng et al., 2000; Abbeel &
Ng, 2004; Ho & Ermon, 2016) that aims at inferring reward and learning policies with a batch of be-
havioral demonstrations, which is unique compared with existing literature in prompt optimization.
The main distinctions between this study and those related works are summarized and highlighted
in Table 1. For an extended discussion on more distinctions please refer to Table 5 in Appendix B.
5 E XPERIMENT
In this section, we present empirical evidence illustrating the efficacy of Prompt-OIRL in addressing
the previously highlighted challenges and meeting the revised objective. We first outline the general
experimental setups. Subsequently, we delve into fulfilling the Adjusted Objective and detail the
proof of solutions to the Challenges in Sections 5.1 through 5.3.
Tasks We use the tasks of MultiArith (Roy & Roth, 2016), GSM8K (Cobbe et al., 2021a),
SVAMP (Patel et al., 2021) in the arithmetic reasoning domain because they are widely studied
in zero-shot prompting, and hence rich expert-crafted and machine-generated prompting knowledge
is available. Leveraging such knowledge facilitates our offline data collection procedure.
Table 2: prompts used in offline training dataset collection.
No. Effective Prompts Discovered by Experts and Algorithms Explanation
1 “The answer is:” direct prompting
2 “Let’s think step by step:” zero-shot CoT
(Kojima et al., 2022)
3 “Let’s work this out in a step by step way to be sure we have the right APE discovered
answer:” (Zhou et al., 2022b)
4 “First, decompose the question into several sub-questions that need to be Least-to-most
solved, and then solve each question step by step:” (Zhou et al., 2022a)
5 “Imagine three different experts are answering this question. All experts Tree-of-thought
will write down 1 step of their thinking, and then share it with the group. (Hulbert, 2023)
Then all experts will go on to the next step, etc. If any expert realizes
they’re wrong at any point then they leave.”
6 “3 experts are discussing the question, trying to solve it step by step, and multi-agent debate
make sure the result is correct:” (Liang et al., 2023)
Prompts Table 2 shows our training prompts: 6 existing zero-shot prompts in arithmetic reasoning
tasks. We construct our offline demonstration dataset using interaction logs between LLMs and
queries prompted by those prompts. Additionally, we create held-out test prompts that differ from
those training prompts, details of test prompts and their generation are provided in Appendix C.3.
6
Published as a conference paper at ICLR 2024
Figure 5: Performance of Prompt-OIRL under two typical settings. On both settings with scarce (i.e., 1 training
prompt) and rich demonstration data (i.e., 6 training prompts), Prompt-OIRL achieves better performance.
Results We present results on two typical settings in Figure 5 under different demonstration avail-
ability: (1) in the scarce demonstration setting (left panel) when K = 1, where only 1 prompt is
available for reward model learning; and (2) in the rich demonstration setting (right panel) when
K = 5, where more prompts demonstrations are available. We note when K = 1, there should be
no difference between using Eqn.(1) and Eqn.(2) in the best-of-training strategy.
From the results, we can conclude: (1). in the scarce demonstration setting where demonstration
data is generated with only 1 prompt, Prompt-OIRL significantly (+24.3%) outperforms the best-
of-training strategy; (2). in the rich demonstration settings where multiple prompts are available,
Prompt-OIRL is significantly better (+8.8%) than using a query-agnostic objective of Eqn.(1);
(3). in the rich demonstration setting, Prompt-
OIRL is slightly better (+1.8%) than selecting the
best training strategy according to the learned re-
ward model — this is not surprising as the prompts
we used for training are all tested as helpful
themselves; (4). On both settings, Prompt-OIRL
achieves significantly (by +24.3% and +14.7%,
separately) better results compared to the LLM
confidence score-based baseline — without fur-
ther interactions with the LLMs.
Additionally, we change the availability of Figure 6: The averaged performance of differ-
prompts used for training and provide a perfor- ent methods under different prompt demonstration
mance comparison averaged over all tasks and availability in arithmetic reasoning. Prompt-OIRL
achieves significantly better performance even when
LLMs we benchmarked in Figure 6. We normalize only 1 prompt is available for training.
the performance using an oracle reward model —
compared to the success rate with if always having access to a successful prompt. Such a normalized
7
Published as a conference paper at ICLR 2024
score better contextualizes the performance improvement realized by the instantiated reward model
in Prompt-OIRL. In general, using more prompts in training leads to better results. The results under
different individual settings are provided in Appendix F.1
Take-Aways Prompt-OIRL fulfills the Adjusted Objective and achieves superior perfor-
mance over the baselines under various demonstration data availability. Prompt-OIRL achieves
a remarkably higher success rate in getting correct answers even when demonstration data and
prompting knowledge are scarce (i.e., with demonstration data generated with only 1 prompt),
and its performance can be further improved with an increase in data availability.
Experiment Setup In this section, we study the effectiveness of the learned reward model by
verifying its generalization ability from two perspectives: (1) with seen-prompt + unseen-query:
we evaluate the effectiveness of the learned reward model with training-time prompts on held-out
queries and (2) with unseen-prompt + unseen-query: we evaluate the effectiveness of the learned
reward model with held-out prompts on held-out queries. To enable the performance evaluation
on held-out prompts, we collect interactive logs on each language model for 10 held-out prompts.
Details of those held-out prompts are provided in Appendix C.3.
We benchmark our approach against the language models for self-criticism (LMSC) baseline (Wang
et al., 2023). In this method, LLMs are tasked with verifying the accuracy of the answer, given
both the query and the prompted response. Throughout our experiments, we gauge performance
using held-out test queries. In our subsequent results, the notation Ours (Q) refers to evaluation
outcomes employing training time prompts (with “Q” representing test queries), while Ours (P)
refers to evaluations conducted on held-out prompts (with “P” representing test prompts). The
same notations are applied to LMSC as LMSC (Q) and LMSC (P). Moreover, we compare results
varying prompt demonstration data availabilities. Specifically, we change the number of prompts,
represented as K, that is used in the
training of the reward model. The result we report for K is
6
averaged over combinations of K . Accuracy and precision are used as the evaluation metric.
Table 3: Effectiveness of the learned reward model is demonstrated through a comparison with LLM-based
self-critic. Accuracies and precisions in predicting whether correct answers can be obtained on held-out queries
are reported as an evaluation metric. Higher is better.
Task Method LLaMA2-7B-Chat TigerBot-13B-Chat GPT-3.5-turbo
Acc. Precision Acc. Precision Acc. Precision
LMSC (Q) 0.47 0.433 0.776 0.175 0.662 0.664
Ours (Q) K = 1 0.784 0.621 0.952 0.593 0.96 0.965
MAWPS Ours (Q) K = 5 0.795 0.632 0.95 0.593 0.96 0.965
LMSC (P) 0.457 0.441 0.414 0.692 0.661 0.661
Ours (P) K = 1 0.621 0.569 0.467 0.737 0.767 0.82
Ours (P) K = 5 0.658 0.595 0.496 0.795 0.803 0.825
LMSC (Q) 0.474 0.441 0.434 0.735 0.801 0.805
Ours (Q) K = 1 0.791 0.646 0.735 0.764 0.964 0.975
SVAMP Ours (Q) K = 5 0.8 0.65 0.738 0.764 0.962 0.97
LMSC (P) 0.464 0.447 0.421 0.739 0.795 0.796
Ours (P) K = 1 0.634 0.602 0.685 0.741 0.824 0.883
Ours (P) K = 5 0.655 0.593 0.707 0.751 0.861 0.885
LMSC (Q) 0.205 0.187 0.635 0.537 0.662 0.656
Ours (Q) K = 1 0.773 0.289 0.642 0.53 0.659 0.663
GSM8K Ours (Q) K = 5 0.726 0.301 0.646 0.542 0.656 0.657
LMSC (P) 0.207 0.195 0.636 0.515 0.653 0.648
Ours (P) K = 1 0.747 0.269 0.631 0.516 0.653 0.658
Ours (P) K = 5 0.647 0.246 0.643 0.54 0.648 0.648
Results Table 3 showcases the results across the 3 tasks employing the 3 LLMs. From the results,
we can deduce: (1). On held-out queries, we find the learned reward model is able to accurately
predict whether a certain prompt can lead to a correct answer. This holds true for both training
prompts and held-out test prompts. (2). The prediction accuracy and precision of the learned reward
model are significantly better than the LMSC baseline, demonstrating the superiority of leveraging
the learned model rather than the LM itself in evaluating prompts and prompted answers. (3). Incor-
8
Published as a conference paper at ICLR 2024
porating more prompts into the training dataset generally improves performance, particularly when
evaluating held-out prompts. For more comprehensive results, please refer to Appendix F.2.
Experiment Setup As previously argued, utilizing the proxy reward model for prompt optimiza-
tion, as opposed to directly using language models, would result in significantly reduced costs.
Indeed, the expenses associated with inference time can be even more expensive (Chen et al., 2023).
In this section, we emphasize the cost-efficiency advantage of Prompt-OIRL during the inference
time prompt optimization.
Prompt-OIRL evaluates prompts through its offline reward model and obviates the need for extra
LLM interactions during inference. Picture a scenario where, for each query at inference, there exist
K potential prompts to assess and select. With Prompt-OIRL, the optimal prompt is chosen with-
out LLM interaction, ensuring that only the chosen prompt undergoes inference with the LLM to
produce a singular response. Conversely, evaluating prompts via LLMs — by gauging their confi-
dence or dissecting the nuances of various answers — demands extensive interaction prior to settling
on a choice. We provide a detailed cost analysis of these methods, outlining the monetary implica-
tions of employing the GPT-3.5-turbo and TigerBot-13B-Chat APIs. We disclose the inference time,
quantified in GPU hours for LLaMA2-7B-chat, which operated locally on an NVIDIA A4000 GPU.
Results Table 4 presents the in- Table 4: Inference costs of different methods for a single query
ference time costs associated with under different numbers of prompt choices. Generating embeddings
various LLMs and prompt selection in Prompt-OIRL cost ≈ 0.0002 USD per 100 prompts
strategies. Employing Prompt-OIRL Model Cost Method K = 1 K = 6 K = 110
for prompt optimization is substan- LLaMA2 LMSC 0.00325 0.01949 0.35723
tially more cost-efficient than meth-
(GPU hour) Ours 0.00134 0.00134 0.00134
ods reliant on LLMs. When call-
ing LLMs for self-critic, all prompts . TigerBot LMSC 0.00082 0.00492 0.09029
need to be processed to obtain (USD) Ours 0.00034 0.00035 0.00056
distinct prompted answers, neces- GPT-3.5 LMSC 0.00093 0.00558 0.10237
sitating K interactions with the (USD) Ours 0.00040 0.00041 0.00062
LLMs. Following this, the LLMs are
queried to verify the correctness of those answers and provide their confidence scores, incurring an
additional K interactions. In contrast, Prompt-OIRL utilizes the offline reward model to pinpoint
the most suitable prompt and only forwards the chosen prompt to the LLM for processing.
6 C ONCLUSION
We propose Prompt-OIRL, a novel approach grounded in offline inverse reinforcement learning, de-
signed to reconcile effective and cost-efficient query-dependent prompt evaluation and optimization.
This method leverages offline datasets from existing evaluations, utilizing Inverse-RL to craft a re-
ward model tailored for offline, query-specific prompt evaluations. Prompt-OIRL offers several ben-
efits: it forecasts prompt efficacy, minimizes costs, and explores the prompt space more effectively
— all at a query-dependent level. We validate our approach across various LLMs and arithmetic
reasoning datasets, underscoring its viability as a formidable solution for query-dependent offline
prompt evaluation and optimization.
9
Published as a conference paper at ICLR 2024
R EFERENCES
Pieter Abbeel and Andrew Y Ng. Apprenticeship learning via inverse reinforcement learning. In
Proceedings of the twenty-first international conference on Machine learning, pp. 1, 2004.
Rishabh Agarwal, Dale Schuurmans, and Mohammad Norouzi. An optimistic perspective on offline
reinforcement learning. In International Conference on Machine Learning, pp. 104–114. PMLR,
2020.
Christopher G Atkeson and Stefan Schaal. Robot learning from demonstration. In ICML, volume 97,
pp. 12–20. Citeseer, 1997.
Yuntao Bai, Andy Jones, Kamal Ndousse, Amanda Askell, Anna Chen, Nova DasSarma, Dawn
Drain, Stanislav Fort, Deep Ganguli, Tom Henighan, et al. Training a helpful and harmless
assistant with reinforcement learning from human feedback. arXiv preprint arXiv:2204.05862,
2022.
Homanga Bharadhwaj, Aviral Kumar, Nicholas Rhinehart, Sergey Levine, Florian Shkurti, and Ani-
mesh Garg. Conservative safety critics for exploration. arXiv preprint arXiv:2010.14497, 2020.
Ioana Bica, Daniel Jarrett, Alihan Hüyük, and Mihaela van der Schaar. Learning “what-if” explana-
tions for sequential decision-making. In International Conference on Learning Representations,
2021.
Daniel Brown, Wonjoon Goo, Prabhat Nagarajan, and Scott Niekum. Extrapolating beyond sub-
optimal demonstrations via inverse reinforcement learning from observations. In International
conference on machine learning, pp. 783–792. PMLR, 2019.
Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal,
Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are
few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
Lingjiao Chen, Matei Zaharia, and James Zou. Frugalgpt: How to use large language models while
reducing cost and improving performance. arXiv preprint arXiv:2305.05176, 2023.
Tianqi Chen, Tong He, Michael Benesty, Vadim Khotilovich, Yuan Tang, Hyunsu Cho, Kailong
Chen, Rory Mitchell, Ignacio Cano, Tianyi Zhou, et al. Xgboost: extreme gradient boosting. R
package version 0.4-2, 1(4):1–4, 2015.
Paul F Christiano, Jan Leike, Tom Brown, Miljan Martic, Shane Legg, and Dario
Amodei. Deep reinforcement learning from human preferences. In I. Guyon, U. Von
Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett (eds.), Ad-
vances in Neural Information Processing Systems, volume 30. Curran Associates, Inc.,
2017. URL https://proceedings.neurips.cc/paper_files/paper/2017/
file/d5e2c0adad503c91f91df240d0cd4e49-Paper.pdf.
Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser,
Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, et al. Training verifiers to
solve math word problems. arXiv preprint arXiv:2110.14168, 2021a.
Karl W Cobbe, Jacob Hilton, Oleg Klimov, and John Schulman. Phasic policy gradient. In Interna-
tional Conference on Machine Learning, pp. 2020–2027. PMLR, 2021b.
Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hi-
erarchical image database. In 2009 IEEE conference on computer vision and pattern recognition,
pp. 248–255. Ieee, 2009.
Mingkai Deng, Jianyu Wang, Cheng-Ping Hsieh, Yihan Wang, Han Guo, Tianmin Shu, Meng Song,
Eric P Xing, and Zhiting Hu. Rlprompt: Optimizing discrete text prompts with reinforcement
learning. arXiv preprint arXiv:2205.12548, 2022.
Yilun Du, Shuang Li, Antonio Torralba, Joshua B Tenenbaum, and Igor Mordatch. Improv-
ing factuality and reasoning in language models through multiagent debate. arXiv preprint
arXiv:2305.14325, 2023.
10
Published as a conference paper at ICLR 2024
Pete Florence, Corey Lynch, Andy Zeng, Oscar A Ramirez, Ayzaan Wahid, Laura Downs, Adrian
Wong, Johnny Lee, Igor Mordatch, and Jonathan Tompson. Implicit behavioral cloning. In
Conference on Robot Learning, pp. 158–168. PMLR, 2022.
Justin Fu, Aviral Kumar, Ofir Nachum, George Tucker, and Sergey Levine. D4rl: Datasets for deep
data-driven reinforcement learning. arXiv preprint arXiv:2004.07219, 2020.
Scott Fujimoto, David Meger, and Doina Precup. Off-policy deep reinforcement learning without
exploration. arXiv preprint arXiv:1812.02900, 2018a.
Scott Fujimoto, Herke Van Hoof, and David Meger. Addressing function approximation error in
actor-critic methods. arXiv preprint arXiv:1802.09477, 2018b.
Divyansh Garg, Shuvam Chakraborty, Chris Cundy, Jiaming Song, and Stefano Ermon. Iq-learn:
Inverse soft-q learning for imitation. Advances in Neural Information Processing Systems, 34:
4028–4039, 2021.
Ryan Greene, Ted Sanders, Lilian Weng, and Arvind Neelakantan. New and improved embedding
model, 2022.
Yuxian Gu, Xu Han, Zhiyuan Liu, and Minlie Huang. Ppt: Pre-trained prompt tuning for few-shot
learning. arXiv preprint arXiv:2109.04332, 2021.
Tuomas Haarnoja, Aurick Zhou, Pieter Abbeel, and Sergey Levine. Soft actor-critic: Off-
policy maximum entropy deep reinforcement learning with a stochastic actor. arXiv preprint
arXiv:1801.01290, 2018.
Yaru Hao, Zewen Chi, Li Dong, and Furu Wei. Optimizing prompts for text-to-image generation.
arXiv preprint arXiv:2212.09611, 2022.
Jonathan Ho and Stefano Ermon. Generative adversarial imitation learning. Advances in neural
information processing systems, 29, 2016.
Samuel Holt, Alihan Hüyük, Zhaozhi Qian, Hao Sun, and Mihaela van der Schaar. Neural Laplace
control for continuous-time delayed systems. In International Conference on Artificial Intelli-
gence and Statistics, 2023a.
Samuel Holt, Alihan Hüyük, and Mihaela van der Schaar. Active observing continuous-time control.
In Conference on Neural Information Processing Systems, 2023b.
Dave Hulbert. Tree of knowledge: Tok aka tree of knowledge dataset for large language models llm.
https://github.com/dave1010/tree-of-thought-prompting, 2023.
Alihan Hüyük, Daniel Jarrett, Cem Tekin, and Mihaela van der Schaar. Explaining by imitating: Un-
derstanding decisions by interpretable policy learning. In International Conference on Learning
Representations, 2021.
Alihan Hüyük, Daniel Jarrett, and Mihaela van der Schaar. Inverse contextual bandits: Learning
how behavior evolves over time. In International Conference on Machine Learning, 2022.
Alihan Hüyük, William R Zame, and Mihaela van der Schaar. Inferring lexicographically-ordered
rewards from preferences. In AAAI Conference on Artificial Intelligence, 2022.
Daniel Jarrett, Alihan Hüyük, and Mihaela van der Schaar. Inverse decision modeling: Learning
interpretable representations of behavior. In International Conference on Machine Learning, pp.
4755–4771. PMLR, 2021.
Guolin Ke, Qi Meng, Thomas Finley, Taifeng Wang, Wei Chen, Weidong Ma, Qiwei Ye, and Tie-
Yan Liu. Lightgbm: A highly efficient gradient boosting decision tree. Advances in neural
information processing systems, 30, 2017.
Takeshi Kojima, Shixiang Shane Gu, Machel Reid, Yutaka Matsuo, and Yusuke Iwasawa. Large
language models are zero-shot reasoners. Advances in neural information processing systems,
35:22199–22213, 2022.
11
Published as a conference paper at ICLR 2024
Aviral Kumar, Justin Fu, George Tucker, and Sergey Levine. Stabilizing off-policy q-learning via
bootstrapping error reduction. arXiv preprint arXiv:1906.00949, 2019.
Sergey Levine, Aviral Kumar, George Tucker, and Justin Fu. Offline reinforcement learning: Tuto-
rial, review, and perspectives on open problems. arXiv preprint arXiv:2005.01643, 2020.
Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal,
Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, et al. Retrieval-augmented genera-
tion for knowledge-intensive nlp tasks. Advances in Neural Information Processing Systems, 33:
9459–9474, 2020.
Xiang Lisa Li and Percy Liang. Prefix-tuning: Optimizing continuous prompts for generation. arXiv
preprint arXiv:2101.00190, 2021.
Xiaonan Li and Xipeng Qiu. Mot: Pre-thinking and recalling enable chatgpt to self-improve with
memory-of-thoughts. arXiv preprint arXiv:2305.05181, 2023.
Zekun Li, Baolin Peng, Pengcheng He, Michel Galley, Jianfeng Gao, and Xifeng Yan. Guiding large
language models via directional stimulus prompting. arXiv preprint arXiv:2302.11520, 2023a.
Ziniu Li, Tian Xu, and Yang Yu. Policy optimization in rlhf: The impact of out-of-preference data.
arXiv preprint arXiv:2312.10584, 2023b.
Tian Liang, Zhiwei He, Wenxiang Jiao, Xing Wang, Yan Wang, Rui Wang, Yujiu Yang, Zhaopeng
Tu, and Shuming Shi. Encouraging divergent thinking in large language models through multi-
agent debate. arXiv preprint arXiv:2305.19118, 2023.
Timothy P Lillicrap, Jonathan J Hunt, Alexander Pritzel, Nicolas Heess, Tom Erez, Yuval Tassa,
David Silver, and Daan Wierstra. Continuous control with deep reinforcement learning. arXiv
preprint arXiv:1509.02971, 2015.
Jiacheng Liu, Alisa Liu, Ximing Lu, Sean Welleck, Peter West, Ronan Le Bras, Yejin Choi, and Han-
naneh Hajishirzi. Generated knowledge prompting for commonsense reasoning. arXiv preprint
arXiv:2110.08387, 2021.
Tennison Liu, Nicolás Astorga, Nabeel Seedat, and Mihaela van der Schaar. Large language models
to enhance bayesian optimization. arXiv preprint arXiv:2402.03921, 2024.
Yuan Liu, Haodong Duan, Yuanhan Zhang, Bo Li, Songyang Zhang, Wangbo Zhao, Yike Yuan,
Jiaqi Wang, Conghui He, Ziwei Liu, et al. Mmbench: Is your multi-modal model an all-around
player? arXiv preprint arXiv:2307.06281, 2023a.
Zhihan Liu, Hao Hu, Shenao Zhang, Hongyi Guo, Shuqi Ke, Boyi Liu, and Zhaoran Wang. Reason
for future, act for now: A principled framework for autonomous llm agents with provable sample
efficiency. arXiv preprint arXiv:2309.17382, 2023b.
Jieyi Long. Large language model guided tree-of-thought. arXiv preprint arXiv:2305.08291, 2023.
Ajay Mandlekar, Danfei Xu, Roberto Martı́n-Martı́n, Silvio Savarese, and Li Fei-Fei. Learn-
ing to generalize across long-horizon tasks from human demonstrations. arXiv preprint
arXiv:2003.06085, 2020.
Ajay Mandlekar, Danfei Xu, Josiah Wong, Soroush Nasiriany, Chen Wang, Rohun Kulkarni, Li Fei-
Fei, Silvio Savarese, Yuke Zhu, and Roberto Martı́n-Martı́n. What matters in learning from offline
human demonstrations for robot manipulation. arXiv preprint arXiv:2108.03298, 2021.
Sewon Min, Xinxi Lyu, Ari Holtzman, Mikel Artetxe, Mike Lewis, Hannaneh Hajishirzi, and Luke
Zettlemoyer. Rethinking the role of demonstrations: What makes in-context learning work? arXiv
preprint arXiv:2202.12837, 2022.
Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness, Marc G Belle-
mare, Alex Graves, Martin Riedmiller, Andreas K Fidjeland, Georg Ostrovski, et al. Human-level
control through deep reinforcement learning. nature, 518(7540):529–533, 2015.
12
Published as a conference paper at ICLR 2024
Ashvin Nair, Bob McGrew, Marcin Andrychowicz, Wojciech Zaremba, and Pieter Abbeel. Over-
coming exploration in reinforcement learning with demonstrations. In 2018 IEEE international
conference on robotics and automation (ICRA), pp. 6292–6299. IEEE, 2018.
Andrew Y Ng, Stuart Russell, et al. Algorithms for inverse reinforcement learning. In Icml, vol-
ume 1, pp. 2, 2000.
R OpenAI. Gpt-4 technical report. arXiv, pp. 2303–08774, 2023.
OpenCompass. Opencompass: A universal evaluation platform for foundation models. https:
//github.com/open-compass/opencompass, 2023.
Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong
Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow
instructions with human feedback. Advances in Neural Information Processing Systems, 35:
27730–27744, 2022.
Arkil Patel, Satwik Bhattamishra, and Navin Goyal. Are nlp models really able to solve simple math
word problems? arXiv preprint arXiv:2103.07191, 2021.
Dean A Pomerleau. Alvinn: An autonomous land vehicle in a neural network. Advances in neural
information processing systems, 1, 1988.
Thomas Pouplin, Hao Sun, Samuel Holt, and Mihaela Van der Schaar. Retrieval-augmented thought
process as sequential decision making. arXiv preprint arXiv:2402.07812, 2024.
Reid Pryzant, Dan Iter, Jerry Li, Yin Tat Lee, Chenguang Zhu, and Michael Zeng. Automatic prompt
optimization with” gradient descent” and beam search. arXiv preprint arXiv:2305.03495, 2023.
Guanghui Qin and Jason Eisner. Learning how to ask: Querying lms with mixtures of soft prompts.
arXiv preprint arXiv:2104.06599, 2021.
Pranav Rajpurkar, Robin Jia, and Percy Liang. Know what you don’t know: Unanswerable questions
for squad. arXiv preprint arXiv:1806.03822, 2018.
Joseph Redmon, Santosh Divvala, Ross Girshick, and Ali Farhadi. You only look once: Unified,
real-time object detection. In Proceedings of the IEEE conference on computer vision and pattern
recognition, pp. 779–788, 2016.
Stéphane Ross, Geoffrey Gordon, and Drew Bagnell. A reduction of imitation learning and struc-
tured prediction to no-regret online learning. In Proceedings of the fourteenth international con-
ference on artificial intelligence and statistics, pp. 627–635. JMLR Workshop and Conference
Proceedings, 2011.
Subhro Roy and Dan Roth. Solving general arithmetic word problems. arXiv preprint
arXiv:1608.01413, 2016.
Stefan Schaal. Learning from demonstration. Advances in neural information processing systems,
9, 1996.
John Schulman, Sergey Levine, Pieter Abbeel, Michael Jordan, and Philipp Moritz. Trust region
policy optimization. In International conference on machine learning, pp. 1889–1897, 2015.
John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy
optimization algorithms. arXiv preprint arXiv:1707.06347, 2017.
Taylor Shin, Yasaman Razeghi, Robert L Logan IV, Eric Wallace, and Sameer Singh. Autoprompt:
Eliciting knowledge from language models with automatically generated prompts. arXiv preprint
arXiv:2010.15980, 2020.
Noah Y Siegel, Jost Tobias Springenberg, Felix Berkenkamp, Abbas Abdolmaleki, Michael Ne-
unert, Thomas Lampe, Roland Hafner, Nicolas Heess, and Martin Riedmiller. Keep doing
what worked: Behavioral modelling priors for offline reinforcement learning. arXiv preprint
arXiv:2002.08396, 2020.
13
Published as a conference paper at ICLR 2024
David Silver, Aja Huang, Chris J Maddison, Arthur Guez, Laurent Sifre, George Van Den Driessche,
Julian Schrittwieser, Ioannis Antonoglou, Veda Panneershelvam, Marc Lanctot, et al. Mastering
the game of go with deep neural networks and tree search. nature, 529(7587):484–489, 2016.
Hao Sun. Reinforcement learning in the era of llms: What is essential? what is needed? an rl
perspective on rlhf, prompting, and beyond. arXiv preprint arXiv:2310.06147, 2023.
Hao Sun, Zhizhong Li, Xiaotong Liu, Bolei Zhou, and Dahua Lin. Policy continuation with hind-
sight inverse dynamics. Advances in Neural Information Processing Systems, 32, 2019.
Hao Sun, Zhenghao Peng, Bo Dai, Jian Guo, Dahua Lin, and Bolei Zhou. Novel policy seeking with
constrained optimization. arXiv preprint arXiv:2005.10696, 2020.
Hao Sun, Ziping Xu, Meng Fang, Zhenghao Peng, Jiadong Guo, Bo Dai, and Bolei Zhou. Safe
exploration by solving early terminated mdp. arXiv preprint arXiv:2107.04200, 2021.
Hao Sun, Lei Han, Rui Yang, Xiaoteng Ma, Jian Guo, and Bolei Zhou. Exploit reward shifting
in value-based deep-rl: Optimistic curiosity-based exploration and conservative exploitation via
linear reward shaping. Advances in Neural Information Processing Systems, 35:37719–37734,
2022a.
Hao Sun, Boris van Breugel, Jonathan Crabbe, Nabeel Seedat, and Mihaela van der Schaar. Daux:
a density-based approach for uncertainty explanations. arXiv preprint arXiv:2207.05161, 2022b.
Hao Sun, Ziping Xu, Meng Fang, and Bolei Zhou. Supervised q-learning can be a strong baseline
for continuous control. In NeurIPS 2022 Foundation Models for Decision Making Workshop,
2022c.
Hao Sun, Alex J Chan, Nabeel Seedat, Alihan Hüyük, and Mihaela van der Schaar. When is off-
policy evaluation useful? a data-centric perspective. arXiv preprint arXiv:2311.14110, 2023a.
Hao Sun, Alihan Hüyük, Daniel Jarrett, and Mihaela van der Schaar. Accountability in offline
reinforcement learning: Explaining decisions with a corpus of examples. In Conference on Neural
Information Processing Systems, 2023b.
Richard S Sutton and Andrew G Barto. Reinforcement learning: An introduction. MIT press, 2018.
TigerResearch. Tigerbot: A cutting-edge foundation for your very own llm. https://github.
com/TigerResearch/TigerBot, 2023.
Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Niko-
lay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open founda-
tion and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023.
Oriol Vinyals, Igor Babuschkin, Wojciech M Czarnecki, Michaël Mathieu, Andrew Dudzik, Juny-
oung Chung, David H Choi, Richard Powell, Timo Ewalds, Petko Georgiev, et al. Grandmaster
level in starcraft ii using multi-agent reinforcement learning. Nature, 575(7782):350–354, 2019.
Tu Vu, Brian Lester, Noah Constant, Rami Al-Rfou, and Daniel Cer. Spot: Better frozen model
adaptation through soft prompt transfer. arXiv preprint arXiv:2110.07904, 2021.
Rui Wang, Hongru Wang, Fei Mi, Yi Chen, Ruifeng Xu, and Kam-Fai Wong. Self-critique prompt-
ing with large language models for inductive instructions. arXiv preprint arXiv:2305.13733,
2023.
Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc Le, Ed Chi, Sharan Narang, Aakanksha Chowdh-
ery, and Denny Zhou. Self-consistency improves chain of thought reasoning in language models.
arXiv preprint arXiv:2203.11171, 2022.
Jason Wei, Maarten Bosma, Vincent Y Zhao, Kelvin Guu, Adams Wei Yu, Brian Lester, Nan Du,
Andrew M Dai, and Quoc V Le. Finetuned language models are zero-shot learners. arXiv preprint
arXiv:2109.01652, 2021.
14
Published as a conference paper at ICLR 2024
Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny
Zhou, et al. Chain-of-thought prompting elicits reasoning in large language models. Advances in
Neural Information Processing Systems, 35:24824–24837, 2022.
Yifan Wu, George Tucker, and Ofir Nachum. Behavior regularized offline reinforcement learning.
arXiv preprint arXiv:1911.11361, 2019a.
Yueh-Hua Wu, Nontawat Charoenphakdee, Han Bao, Voot Tangkaratt, and Masashi Sugiyama. Imi-
tation learning from imperfect demonstration. In International Conference on Machine Learning,
pp. 6818–6827. PMLR, 2019b.
Tian Xu, Ziniu Li, and Yang Yu. Error bounds of imitating policies and environments. Advances in
Neural Information Processing Systems, 33:15737–15749, 2020.
Tian Xu, Ziniu Li, Yang Yu, and Zhi-Quan Luo. Understanding adversarial imitation learning in
small sample regime: A stage-coupled analysis. arXiv preprint arXiv:2208.01899, 2022.
Chengrun Yang, Xuezhi Wang, Yifeng Lu, Hanxiao Liu, Quoc V. Le, Denny Zhou, and Xinyun
Chen. Large language models as optimizers. arXiv preprint arXiv:2309.03409, 2023a.
Rui Yang, Chenjia Bai, Xiaoteng Ma, Zhaoran Wang, Chongjie Zhang, and Lei Han. Rorl: Ro-
bust offline reinforcement learning via conservative smoothing. Advances in Neural Information
Processing Systems, 35:23851–23866, 2022a.
Rui Yang, Yiming Lu, Wenzhe Li, Hao Sun, Meng Fang, Yali Du, Xiu Li, Lei Han, and Chongjie
Zhang. Rethinking goal-conditioned supervised learning and its connection to offline rl. arXiv
preprint arXiv:2202.04478, 2022b.
Rui Yang, Lin Yong, Xiaoteng Ma, Hao Hu, Chongjie Zhang, and Tong Zhang. What is essential
for unseen goal generalization of offline goal-conditioned rl? In International Conference on
Machine Learning, pp. 39543–39571. PMLR, 2023b.
Rui Yang, Han Zhong, Jiawei Xu, Amy Zhang, Chongjie Zhang, Lei Han, and Tong Zhang.
Towards robust offline reinforcement learning under diverse data corruption. arXiv preprint
arXiv:2310.12955, 2023c.
Shunyu Yao, Dian Yu, Jeffrey Zhao, Izhak Shafran, Thomas L Griffiths, Yuan Cao, and Karthik
Narasimhan. Tree of thoughts: Deliberate problem solving with large language models. arXiv
preprint arXiv:2305.10601, 2023.
Chenlu Ye, Rui Yang, Quanquan Gu, and Tong Zhang. Corruption-robust offline reinforcement
learning with general function approximation. Advances in Neural Information Processing Sys-
tems, 36, 2024.
Chi Zhang, Sanmukh Rao Kuppannagari, and Viktor Prasanna. Brac+: Going deeper with behavior
regularized offline reinforcement learning. 2020.
Tianhao Zhang, Zoe McCarthy, Owen Jow, Dennis Lee, Xi Chen, Ken Goldberg, and Pieter Abbeel.
Deep imitation learning for complex manipulation tasks from virtual reality teleoperation. In
2018 IEEE International Conference on Robotics and Automation (ICRA), pp. 5628–5635. IEEE,
2018.
Tianjun Zhang, Xuezhi Wang, Denny Zhou, Dale Schuurmans, and Joseph E Gonzalez. Tempera:
Test-time prompt editing via reinforcement learning. In The Eleventh International Conference
on Learning Representations, 2022a.
Zhuosheng Zhang, Aston Zhang, Mu Li, and Alex Smola. Automatic chain of thought prompting in
large language models. arXiv preprint arXiv:2210.03493, 2022b.
Denny Zhou, Nathanael Schärli, Le Hou, Jason Wei, Nathan Scales, Xuezhi Wang, Dale Schuur-
mans, Claire Cui, Olivier Bousquet, Quoc Le, et al. Least-to-most prompting enables complex
reasoning in large language models. arXiv preprint arXiv:2205.10625, 2022a.
Yongchao Zhou, Andrei Ioan Muresanu, Ziwen Han, Keiran Paster, Silviu Pitis, Harris Chan,
and Jimmy Ba. Large language models are human-level prompt engineers. arXiv preprint
arXiv:2211.01910, 2022b.
15
Published as a conference paper at ICLR 2024
D Extended Discussion 30
D.1 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
D.2 A Call to the Field: Release the Interactive Logs with LLMs as a Valuable Asset. . 30
D.3 Link to the Online Prompt Evaluation and Optimization Methods . . . . . . . . . . 30
E Additional Experiments 31
E.1 Prompt-OIRL under Different Data Availability . . . . . . . . . . . . . . . . . . . 31
E.2 Generalization of Prompt-OIRL over Tasks . . . . . . . . . . . . . . . . . . . . . 31
E.3 Experiments on Larger LLM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
G Case Studies 35
G.1 Optimal Prompting Strategy is Query Dependent . . . . . . . . . . . . . . . . . . 35
G.2 Expert-Crafted Prompts Can All Fail . . . . . . . . . . . . . . . . . . . . . . . . . 48
G.3 Generated Human-Readable Prompts . . . . . . . . . . . . . . . . . . . . . . . . . 53
16
Published as a conference paper at ICLR 2024
Ethics Statement This paper introduces the pioneering application of offline inverse reinforce-
ment learning to query-dependent prompting, enhancing the arithmetic reasoning capabilities of
LLMs. Yet, deploying such a method inappropriately could yield undesirable results. Addition-
ally, the method’s dependence on LLMs and prompting demonstration data underscores potential
concerns about data privacy and consent, especially as it leans on future offline datasets for human-
centered tasks.
Reproducibility Statement All code and offline demonstration datasets will be released upon
acceptance. For the seek of full reproducibility, we provide supplemental experiment details in
Appendix C, and attached example code in the supplementary material of the submission.
Acknowledgement We thank anonymous reviewers for improving the quality of our paper. We
thank Alex J. Chan, Nicolás Astorga, and Tennison Liu for reviewing our paper and providing
insightful comments. HS’s research is funded by ONR.
To better understand our contribution, it would be helpful to link and contrast Prompt-OIRL with
the framework of reinforcement learning from human feedback (RLHF) or AI feedback (RLAIF).
In general, the success of RLHF and RLAIF motivates our idea of integrating human expertise in
prompting, and Prompt-OIRL can be interpreted as a special type of prompt policy learning from
offline AI feedback, which differentiates from existing literature.
The fact that Prompt-OIRL first learns a reward model and then performs policy optimization using
the reward model makes the method related to RLHF and RLAIF. From such a perspective, Prompt-
OIRL treats prompt evaluation and optimization as a process of aligning prompting strategies with
the preferences of the LLMs, contrasting to aligning LLM responses with human preferences. In
analogy with the RLHF that improves the performance of LLMs with regard to human preferences,
we hypothesize that such a perspective could lead to more effective prompts while avoiding the need
for random exploration of the prompt space, which is computationally expensive.
That said, there are several key differences that make Prompt-OIRL distinguishable from the existing
RLHF and RLAIF literature. RLHF (or RLAIF) often includes the following steps: 1. sampling
from pre-trained language models to generate dialogue data; 2. human (or AI) raters are then asked
to rank those generated data according to given criteria such as harmless and helpful; 3. a preference
model is trained on the ranked dataset, as proxy to the human (or AI) raters; 4. the language model
is fine-tuned with the learned preference model using reinforcement learning.
By contrast, we would like to highlight the differences in objective, non-interactive offline learning
problem setting, and optimization procedure flexibility: First, Objective: The objective of Prompt-
OIRL is to evaluate and optimize prompting strategies, rather than enhance LLMs’ alignment to
non-differentiable objectives. Importantly, the learned reward model in Prompt-OIRL can work in
isolation as a prompt evaluator. Second, Non-Interactive Offline Setting: Prompt-OIRL works in a
purely offline setting, rather than assuming access to human or AI raters to actively generate feed-
back. Third, the optimization procedure in Prompt-OIRL is highly flexible, which is not necessary
to be a language model as in RLHF and RLAIF. For instance, the prompt optimization process
can collaborate with a human prompting engineer, who proposes potential prompting strategies and
selects the best according to the learned reward model.
17
Published as a conference paper at ICLR 2024
Such a learning paradigm can be formally represented using the Markov Decision Processes
(MDPs), where decisions are made in discrete time steps, and each decision affects the state of
the environment in the subsequent step.
Formally, we denote the MDP as M = {S, A, T , R, ρ0 , γ}, where S ⊂ Rd denotes the d-dim state
space, A is the action space. Broadly, the environment includes T and R, the former denotes the
transition dynamics T : S ×A 7→ S that controls transitions between states, and the reward function
R : S × A 7→ R provides feedback. ρ0 = p(s0 ) ∈ ∆(S) denotes the initial state distribution. γ is
the discount factor that trades off between short-term and long-term returns.
Figure 7: Pictorial illustration of Online RL: an agent actively interacts with the environment, which
is composed of the dynamics model that controls transition, and the reward model that provides
feedback.
Online RL considers the problem of learning a policy π ∈ Π : S 7→ ∆(A), such that the expected
PT
cumulative reward Eat ∼π,st+1 ∼T ,s0 ∼ρ0 t=0 γ t rt (st , at ) in the MDP is maximized. In the online
RL setting, an agent learns through trial and error. It actively interacts with the environments —
including both transition dynamics T and the reward function R. And optimize its polity through
either on-policy (Schulman et al., 2015; 2017; Sun et al., 2020; Cobbe et al., 2021b) or off-policy
algorithms (Lillicrap et al., 2015; Mnih et al., 2015; Fujimoto et al., 2018b; Haarnoja et al., 2018;
Sun et al., 2019).
At each time step t, an agent observes a state st from the environment and selects an action at
according to its policy π. Upon taking the action, the agent receives a reward rt and transit to a new
state st+1 . The agent’s objective is to maximize its expected return.
Figure 8: Pictorial illustration of Offline RL: a decision dataset is collected through interactive logs
between a behavior agent and the environment. Such offline dataset is then used to optimized a
parameterized policy — the learnable agent.
Offline RL, also known as batch-RL or data-driven RL, focuses on learning optimal policies from
a fixed dataset of interaction data without further interaction with the environment (e.g. Holt et al.,
2023a;b; Sun et al., 2023b). This approach is particularly relevant in scenarios where online interac-
tion is expensive, risky, or impractical (Sun et al., 2021; Yang et al., 2023c). The primary challenge
in Offline RL is that the dataset may not sufficiently cover important regions of the state-action
space, which can lead to extrapolation errors and suboptimal policies (Fujimoto et al., 2018a).
Offline RL algorithms aim to leverage this fixed dataset to derive a policy that would perform well
if deployed in the real environment. Techniques often employed in Offline RL include constraining
the policy to only consider data in the dataset, regularizing against drastic changes from a behavior
policy, and employing uncertainty estimation to avoid regions where the dataset provides limited
information (Kumar et al., 2019; Wu et al., 2019a; Kumar et al., 2019; Siegel et al., 2020; Sun
et al., 2022a;b; Yang et al., 2022a; 2023b; Ye et al., 2024). In general, such a dataset can either be
generated by rolling out an expert that generates high-quality solutions to the task (Fujimoto et al.,
2018a; Zhang et al., 2020; Fu et al., 2020) or a non-expert that executes sub-optimal behaviors (Fu
18
Published as a conference paper at ICLR 2024
et al., 2020; Wu et al., 2019a; Kumar et al., 2019; Agarwal et al., 2020; Hüyük et al., 2021; Jarrett
et al., 2021) or be a mixture of both (Bharadhwaj et al., 2020).
That said, the research on existing offline-RL mainly focuses on application on dense-reward tasks
like robotics control (Fu et al., 2020; Levine et al., 2020). The optimization for reward on the
trajectory level such as the case in LLM researchwhere responses may only be evaluated on the
sentence-level, rather than on a token-wise level — is relatively underexplored.
Imitation Learning, Behavior Clone, and Inverse RL Imitation learning and learning from
demonstrations are widely studied in the field of reinforcement learning and robotic learning (Ng
et al., 2000; Abbeel & Ng, 2004; Ho & Ermon, 2016; Garg et al., 2021; Bica et al., 2021), aiming at
learning policies with a batch of expert trajectories. Working as the most straightforward solution,
Behavior Cloning (BC) (Pomerleau, 1988; Florence et al., 2022) optimizes a policy through super-
vised learning, yet may suffer from compounding error or instability in learning from sub-optimal
demonstrations (Ross et al., 2011; Wu et al., 2019b). It is shown in Mandlekar et al. (2021) that
data quality is of great importance in the success of BC. It is worth noting that, while imitation
learning assumes the underlying reward mechanism is unknown, the learning from demonstration
literature always uses the demonstrations as a warm-start for reward-sparse tasks like robotics ma-
nipulation (Schaal, 1996; Nair et al., 2018).
Figure 9: Pictorial illustration of IL: a dynamics model is always needed to obtain observations.
The decision dataset can be used to provide a reference for the desired behavior. Learning (the pink
dashed line) can be achieved by minimizing the behavior divergence between the decision dataset
and the agent.
Figure 10: Pictorial illustration of IRL: Inverse RL first learns a reward model, with which the
imitation learning now becomes an “online” RL problem — the accessible dynamics provide obser-
vations, and the learned reward model provides feedback.
LLM Alignment with Human Preference The task of LLM alignment under human preference
can be framed as RL, yet the challenge of online learning. See Figure 11 for a more detailed expla-
nation.
Figure 11: Pictorial illustration of LLM alignment as an online RL task. However, this is only the
ideal case: it is normally unaffordable to always keep humans in the loop. Human is the reward
model, and the LLMs themselves act as a dynamics model.
19
Published as a conference paper at ICLR 2024
In practice, collecting a limited number of human feedbacks and saving them as a decision dataset
is feasible. In such cases, the LLM alignment as online RL is turned into an offline RL problem.
As we have discussed above, offline RL also suffers from difficulties in learning. However, there
is a unique property in the LLM alignment problem that enables Imitation Learning with those
alignments (human-feedback) data.
Figure 12: Pictorial illustration of LLM alignment as an offline RL task: The alignment dataset is
collected through a limited number of interactions between humans and LLMs.
RLHF: Solving an Offline-RL Task with Online Inverse RL We highlight that RLHF solves the
offline RL task of aligning human preferences through an offline dataset with online IRL. Figure 13
and Figure 14 illustrates such an interpretation.
Figure 13: Pictorial illustration of RLHF as IL: as LLM itself acts as the dynamics model, the offline
RL problem of LLM alignment with human preference becomes an imitation learning problem.
Figure 14: Pictorial illustration of RLHF as Online IRL: Approaching the imitation learning problem
through IRL, RLHF first learns a reward model, with which it has both components in an environ-
ment — the dynamics model (itself) and the reward model (learned). Therefore, RLHF can apply
conventional online RL algorithms like PPO as its solver.
Prompting as Policy Learning: Alignment with LLM Preference While prompt optimization
is by nature an RL task and can be approached by learning from trial and error, randomly exploring
the action space of natural language is infeasible. An analogy is the task of LLM alignment with
human feedback: where an LLM agent interacts with human annotators to collect feedback on its
responses and seek for better alignment with the human annotators’ preference. In such a task, it is
also unrealistic and infeasible to always put humans in the loop. Therefore, techniques of RLHF are
proposed to seek the solution when only an offline alignment dataset is available.
On the other hand, prompting optimization is indeed an LLM-centric type of alignment, as opposed
to human-centric alignment: it aligns the prompter’s prompting strategies according to LLMs’ pref-
erences and feedback. It faces the same type of challenge that always keeping LLM in the loop
20
Published as a conference paper at ICLR 2024
during optimization is not feasible. Figure 15 provides a pictorial illustration of how prompting can
be interpreted as an (expensive) online RL task.
Figure 15: Pictorial illustration of prompting as an (expensive) online RL task: the reward calcula-
tion requires extensive usage of the LLM, which is expensive in practice.
Prompt-OIRL: Solving an Offline RL Task with Offline Inverse RL The key observation and
claim we would emphasize is that we can interpret Prompt-OIRL as solving the offline RL of align-
ing prompting strategies from LLMs’ feedback through offline IRL.
Offline Data as By-Products In the main text, we have highlighted the existence and importance
of prompt demonstrations generated in benchmarking existing prompts. For instance, in the research
on enhancing the arithmetic reasoning abilities of LLMs, multiple zero-shot prompts are proposed,
such as the CoT (Kojima et al., 2022), APE (Zhou et al., 2022b), and more recently OPRO (Yang
et al., 2023a), etc.
In order to benchmark the effectiveness of those proposed prompts, previous researchers use stan-
dardized open-accessible arithmetic reasoning datasets and report the overall successful rate of get-
ting correct answers. Figure 16 provides a pictorial illustration of how the evaluation of existing
prompts generates prompt-alignment demonstration datasets as its by-products.
Solving Offline RL by Offline Inverse RL The key observation of such a problem is that zero-
shot prompting are single-step decision-making problem. Therefore, even without the transition
dynamics models (LLMs), we are still able to conduct an inverse RL in an offline manner. Prompt-
OIRL circumvents the difficulty in evaluating prompts using offline reward modeling. Figure 17
provides a pictorial illustration of such an interpretation.
Figure 17: Pictorial illustration of prompting as an Offline IRL task: the offline prompt demonstra-
tion dataset is used to learn a reward model, as a function of query and input, to avoid the requirement
of an LLM during evaluation. Prompt-OIRL then leverages such a learned reward model for prompt
optimization in a purely offline manner.
21
Published as a conference paper at ICLR 2024
Conclusive Remark RLHF solves the problem of learning from offline LLM alignment dataset
according to human feedback by turning the problem from offline RL into an online IRL, and then
an online RL after learning the reward model. Prompt-OIRL solves the problem of learning from
offline prompting dataset according to LLM feedback by turning the problem from offline RL into
an offline IRL problem, to get rid of the dependency of LLMs in evaluation and optimization.
Recent advances in Large Language Models (LLMs) have shown impressive performance as
general-purpose agents, especially with the remarkable success of ChatGPT where human knowl-
edge is injected via reinforcement learning from human feedback (RLHF) (Christiano et al., 2017;
Ouyang et al., 2022; OpenAI, 2023; Sun, 2023; Li et al., 2023b). During the training of those
models, human feedback is leveraged to reduce the harmfulness and at the same time improve the
instruction-following ability of models. In Liu et al. (2023b), LLMs are instantiated as decision-
making agents to master various skills.
Imitation learning and learning from demonstrations are widely studied in the field of reinforce-
ment learning and robotic learning (Ng et al., 2000; Abbeel & Ng, 2004; Ho & Ermon, 2016; Xu
et al., 2020; Garg et al., 2021; Yang et al., 2022b; Xu et al., 2022), aiming at learning policies
with a batch of expert trajectories. Working as the most straightforward solution, Behavior Cloning
(BC) (Pomerleau, 1988; Florence et al., 2022; Sun et al., 2023b) optimizes a policy through super-
vised learning, yet may suffer from compounding error or instability in learning from sub-optimal
demonstrations (Ross et al., 2011; Wu et al., 2019b; Sun et al., 2022c). It is shown in Mandlekar
et al. (2021) that data quality is of great importance in the success of BC. It is worth noting that,
while imitation learning assumes the underlying reward mechanism is unknown, the learning from
demonstration literature always uses the demonstrations as a warm-start for reward-sparse tasks like
robotics manipulation (Schaal, 1996; Nair et al., 2018; Sun et al., 2019; 2021).
In prompt generation tasks, the demonstrations are always imperfect, T-REX (Brown et al., 2019)
extrapolates beyond a learned reward function to achieve super-demonstration performance (as well
as Hüyük et al. (2022) in a contextual bandit setting). We also benchmark the performance of BC,
with specific consideration on the quality of demonstrations (Wu et al., 2019b; Mandlekar et al.,
2021; Sun et al., 2023a).
The ability of Zero-shot prompting emerges in the language models trained on large amounts of data
like GPT-3 and GPT-4 (Ouyang et al., 2022; OpenAI, 2023). And it was shown in Wei et al. (2021)
that instruction-fine-tuning improves the zero-shot learning ability of language models.
Notwithstanding the impressive zero-shot performance exhibited by large language models, these
models often exhibit suboptimal performance in executing more complex tasks under a zero-shot
setting. Leveraging few-shot prompting presents a viable approach for facilitating in-context learn-
ing (Brown et al., 2020; Min et al., 2022). This technique necessitates the inclusion of demon-
strations within the prompt, effectively guiding the model toward enhanced performance. These
demonstrations act as conditioning mechanisms for succeeding examples, leading the model to gen-
erate better responses.
In some more challenging tasks like complex arithmetic, commonsense, and symbolic reasoning
tasks, the chain-of-thought (CoT) prompting is shown to be more effective in helping the language
models to get correct answers (Wei et al., 2022). CoT includes additional reasoning steps in the
few-shot prompting examples. Kojima et al. (2022) further introduces zero-shot CoT, showing that
adding task-agnostic instruction can improve the model performance in specific tasks. In Zhang
et al. (2022b), Auto-CoT combines the universality of zero-shot CoT and the capability of original
22
Published as a conference paper at ICLR 2024
Wang et al. (2022) further improves the few-shot CoT method by sampling multiple diverse reason-
ing paths and marginalizing those paths, choosing the most consistent answers among all sampled
reasoning paths. Li et al. (2023a) introduces Directional-Stimulus-Prompting, which is a framework
that uses a tuneable language model to provide guidance for the black-box frozen large language
model toward desirable properties.
The Generated Knowledge Prompting Liu et al. (2021) improves commonsense reasoning by in-
corporating knowledge or information related to the questions to make more accurate predictions.
Tree-of-thoughts (ToT) methods (Long, 2023; Yao et al., 2023) combines tree-based planning meth-
ods with reasoning skills of language models, and solves hard reasoning problems step by step via
multiple round conversations. Hulbert (2023) also put forward a related idea that leverages multiple
thoughts of a language model in a single prompt. Memory and Retrieval Augmented Generation
(RAG) (Lewis et al., 2020), that is able to combine parametric memory and non-parametric memory
like Wikipedia in completing knowledge-intensive tasks (Pouplin et al., 2024). MoT (Li & Qiu,
2023): Pre-thinking based on the external unlabeled dataset and then recalling the related knowl-
edge during inference. In concurrent works of Yang et al. (2023a); Liu et al. (2024), LLMs are
used as optimizers to solve a variety of optimization problems, including prompt optimization. It is
worth noting the most important two differences are in the offline nature and the query-dependant
zero-shot prompting strategy used in our approach.
Table 5: Prompt-OIRL differentiates from existing literature on prompt optimization by (1) considering the
Adjusted Objective and optimizing query-dependent prompt; (2) being able to perform offline prompt
evaluation to address Challenge 1; (3) optimizing prompt in the offline setting without access to the LLMs
to address Challenge 2; (4) utilizing existing expert knowledge to reduce the difficulty in RL; (5) generating
human-readable prompts; (6) working on the most general natural language prompt space; (7) free from
gradient information of LLMs; (8) using offline inverse reinforcement learning as the solver of the problem.
(1) Query (2) Offline (3) Offline (4) Expert (5) Human (6) Prompt (7) LLM (8) Solver
Method Dependent Prompt Prompt Knowledge Readable Space Gradient Used
TEMPERA ✓ ✗ ✗ ✓ ✓ Edit ✓ RL
RLPrompt ✗ ✗ ✗ ✗ ✗ {V 2 , V 5 } ✓ RL
Embeddings: the language of LMs; Edit: including operations like swap, delete, etc.; V is the vocabulary, and the superscript over it
denotes the length of prompts. V ∞ denotes the natural language space — the most general interpretable format.
For LLM ℓ and task k, we re-organize the offline demonstration dataset using the embedding func-
tion E : V ∞ 7→ R1536 , which maps a sequence of natural language context to a fix-length vector of
size 1536 (Greene et al., 2022). Therefore, using ex = E(x(i) ), eπ = E(π (j) ), r(ij) = 1{y ∗ =
(i) (j)
ℓ(π (j) (x(i) ))} to denote the embeddings and reward instantiation, our demonstration dataset Equa-
tion (4) in implementation can be expressed as follows:
(train) (i) (j)
1. Training Data: Dℓ,k = {ex , eπ , r(ij) }i∈{1,...,N },j∈{1,...,K}
(test q) (i) (j)
2. Test Data on Query: Dℓ,k = {ex , eπ , r(ij) }i∈{N +1,...,N +M },j∈{1,...,K}
(test p) (i) (j)
3. Test Data on Prompts: Dℓ,k = {ex , eπ , r(ij) }i∈{N +1,...,N +M },j∈{K+1,...,K+P }
23
Published as a conference paper at ICLR 2024
In the above, N is the number of samples (query-answer pairs) in the training dataset, M is the
number of samples in the held-out test dataset, K is the number of training prompts, and P is the
number of test prompts.
For all datasets, we experiment with different choices of K = [1, 2, 3, 4, 5, 6]. (i.e., the x-axis of
Figure 6: number of training prompts, ranging from 1 - 6). And use a total number of P = 110
held-out prompts.
The GSM8K task contains 7473 queries with golden answers for training and 1319 held-out queries
with golden answers for testing; the SVAMP task contains 19690 examples, which are split into a
training query-answer set of size 15000 and a testing query-answer set of size 4690; the MAWPS
task contains 7685 examples, which are split into a training query-answer set of size 6000 and
a testing query-answer set of size 1685. Therefore, for each prompting strategy, there are 7473,
15000, and 6000 demonstrative examples that can be collected through the interactive logs with a
certain language model for the three tasks that can be used for training, respectively.
With the previous notations,
For GSM8K, there are N = 7473 samples used for training and M = 1319 samples for testing.
For SVAMP, there are N = 15000 samples used for training and M = 4690 samples for testing.
For MAWPS, there are N = 6000 samples used for training and M = 1685 samples for testing.
Our processed offline datasets and code for processing will be provided as open-source assets to
facilitate future studies.
To exclude some potential alternatives — we have explored using fully connected neural networks
as the reward model or using a ranked dataset mimicking the conventional RLHF setting, yet both
of those choices perform worse than our presented implementation.
MLP To be specific, for the MLP model, we have tried different choices of its hyper-parameters,
including unit numbers, layers, various drop-out choices, and dual-channel architecture with each
channel process query and prompt embeddings individually. However, we find all of those choices
tend to converge to the trivial solution that predicts either all-0 or all-1 for the binary classification
task. Such a reward model does not actually have the prediction ability in inference time. In practice,
with such a reward mode, we can do nothing better than select the best-performing prompt on
the training dataset. Therefore, the performance of BoTr Eqn.(1) represents the best-achievable
performance of using a dummy MLP reward model.
XGBoost For the XGBoost method, we set a universal hyper-parameter for all tasks and LLMs
to enhance applicability and demonstrate the robustness of Prompt-OIRL. When deploying Prompt-
OIRL in practice, a case-by-case engineering on the hyper-parameters on reward modeling should
be able to further boost the performance of the algorithm, yet this is out of our research scope. In
our paper, we experiment with a single universal hyper-parameter setting for all LLMs and tasks,
demonstrating the robustness of the proposed method.
To enhance replicability, we use the following hyper-parameters for the gradient boosting
model Chen et al. (2015) in all experiment settings:
param = {’max_depth’: 10, ’eta’: 0.001, ’objective’: ’binary:logistic’}
The Nearest Neighbor Method Besides the parametric approaches discussed above, we agree
with our reviewer that “a simpler, more straightforward query-dependent model, such as a nearest-
neighbor based solution, could be considered. This would involve, for each query, locating the
closest match whose prompt yields a correct answer and utilizing that prompt for the new query. ”.
In this paragraph, we discuss and experiment with such an idea.
Analytically, although the idea of using a Nearest-Neighbor approach can be an alternative to using
a parameterized model in selecting from the training prompts it can not be generalized to unseen
prompts due to a lack of support.
24
Published as a conference paper at ICLR 2024
Table 6: Comparison of different reward model approaches: the Nearest-Neighbor method, instance-wise
prediction of outcome, and pair-wise prediction.
# Training Prompts 1 2 3 4 5 6
BoTr Eqn(1) 0.6598 0.6988 0.7118 0.7155 0.7173 0.7175
BoTr Eqn(2) 0.6598 0.7742 0.8150 0.8350 0.8473 0.8546
BoTr Eqn(2) Paired 0.6598 0.7195 0.7203 0.7194 0.7236 0.7247
Nearest-Neighbor 0.6598 0.7856 0.8242 0.8339 0.8492 0.8530
Prompt-OIRL 0.7637 0.8032 0.8379 0.8750 0.8916 0.8944
It can be concluded that the pair-wise reward model can not achieve better performance than the
direct reward modeling used in Prompt-OIRL.
All our offline datasets and code for processing those datasets, as long as the MLP implementation
and pair-wise reward modeling implementation will be made publicly available and contributed as
an asset for future research.
Further investigation on hyper-parameter tuning will lead to further improvement, especially in the
pursuance of higher precision values in imbalanced prompt demonstrate datasets. (e.g., when using
smaller LLMs like LLaMA-7B for the more challenging tasks like GSM8k.)
The following 10 prompts are used as a held-out test set for offline prompt evaluation. To enable the
performance evaluation on held-out prompts, we collect interactive logs on each language model for
10 held-out novel prompting strategies. We will release the corresponding promoted answers on test
queries for those held-out prompts as part of our offline dataset.
25
Published as a conference paper at ICLR 2024
3. Consider this as a puzzle, each piece contributing to the final answer. Let’s place each
piece, one by one:
4. Three scholars are analyzing this query from various perspectives, working collaboratively
to build a comprehensive answer. Each contributes a step:
5. Let’s solve this like a detective would solve a mystery, gathering clues and building up to
the final solution step by step:
6. Imagine we’re navigating a maze; each decision brings us closer to the center. Let’s map
our route:
7. Envision a round table meeting of expert problem solvers. Each participant suggests a step,
building towards a consensus answer:
8. Like an architect constructing a building, let’s design our answer carefully, layer by layer:
9. As if we are assembling a complex machine, let’s put it together piece by piece:
10. Three wise philosophers are debating this question, each contributing a different aspect of
the answer. Let’s follow their discourse:
In order to further verify the generalization ability of the learned reward model, we extend to
more held-out prompts. We use GPT-4 to generate those prompts through in-context learn-
ing. Specifically, we provided the human-crafted 6 prompting strategies to GPT-4, and asked
GPT-4 to generate other potential prompting strategies. Our original chat history and prompts
are available in the following anonymous link: https://chat.openai.com/share/
2c7652d2-2f96-48e8-b34f-efa4b15f8a61.
1. Like a bee pollinating flowers, let’s gather the essence from each point:
2. Picture this as a chain. Each link strengthens the whole:
3. As a librarian categorizes books, let’s sort the information accordingly:
4. This is like unfolding origami to understand each crease and fold:
5. Imagine it as an echo in a valley, every sound has an origin and meaning:
6. Let’s approach this like a chef creating a new recipe, ingredient by ingredient:
7. As a cartographer maps terrains, let’s chart the nuances and details:
8. Like an athlete training for an event, every exercise has a purpose:
9. Think of it as wind chimes, each note contributing to the melody:
10. It’s like flying a kite. Every tug and adjustment affects its flight:
11. Imagine we’re in a lab, every test and observation is crucial for the conclusion:
12. Like a blacksmith forging metal, let’s shape our understanding with precision:
13. As a botanist identifies plants, let’s classify each detail:
14. This is like a radio tuning to different frequencies, let’s find the right wavelength:
15. Imagine it’s a theater play. Scene by scene, we unfold the plot:
16. Let’s dive into this like a marine biologist exploring the ocean’s depths:
17. As an electrician wires a circuit, each connection powers the system:
18. This is like a gardener pruning a plant, every cut makes it flourish:
19. Think of it as the gears of a bicycle, each turn propelling us forward:
20. Like a mountain climber uses tools, let’s leverage each resource for understanding:
21. As a composer orchestrates a symphony, let’s integrate each instrument:
22. This is like a pot of stew, simmering to meld the flavors:
23. Imagine building a sandcastle, each grain matters for the structure:
24. It’s like peeling an onion, layer by layer, revealing the core:
25. Like a captain steering a ship, each decision adjusts our course:
26. As a mason lays bricks, every placement supports the structure:
26
Published as a conference paper at ICLR 2024
27
Published as a conference paper at ICLR 2024
71. As a detective collecting clues for a case, let’s unravel the mystery:
72. Like a novelist crafting a story, let’s develop our narrative chapter by chapter:
73. Envision assembling a jigsaw puzzle, connecting pieces to see the whole picture:
74. Imagine sculpting from a block of marble; every chisel matters. Let’s sculpt:
75. As a chemist mixing solutions, each ingredient alters the result. Let’s mix:
76. This is like tuning an instrument. Every adjustment leads to harmony:
77. Like an actor rehearsing a script, let’s understand our lines and their meaning:
78. Think of it as navigating a map. Each route we explore gives more insight:
79. As if we’re crafting a potion, each herb and element has a purpose. Brew with me:
80. Let’s approach this like a mathematician proving a theorem, step by logical step:
81. Picture us on a safari, observing every species. Detail by detail, we document:
82. Like setting up dominoes for a cascade, each placement is crucial. Let’s set up:
83. Imagine you’re a jeweler, evaluating a gem. Every facet reflects light differently:
84. Like a historian deciphering an old text, let’s understand its context:
85. Think of this as a musical composition. Each note leads to the next movement:
86. It’s like setting a table for a grand feast. Every detail adds to the ambiance:
87. As a photographer captures moments, let’s focus on each element:
88. Imagine we’re tailoring a suit. Each stitch, cut, and measurement counts:
89. This is like assembling a watch; every gear and spring is vital. Let’s assemble:
90. Envision ourselves as geologists, studying layers of rock. Layer by layer, we analyze:
91. Like drafting an architectural blueprint, every line and measurement matters:
92. As if we’re brewing a perfect cup of tea, each ingredient and timing is crucial:
93. This is like a relay race. Each leg of the race builds upon the last:
94. Think of it as mixing colors for a painting. Every shade adds depth and nuance:
95. It’s like being a conductor, ensuring each instrument plays its part:
96. Imagine decoding an encrypted message. Every symbol has a meaning:
97. Like planning a trip, each destination and route makes the journey:
98. This is like lighting a sequence of candles. Each one illuminates more:
99. Envision it as a waterfall, each drop contributing to the flow:
100. Like a potter shaping clay, let’s mold our understanding step by step:
Our code, as well as the offline datasets, will be released as open-accessible. During the review
process, we provide source code in the supplementary material. We highlight the Prompt-OIRL
can be reproduced within a few hours using a single laptop using CPU. With our implementation,
conducting OIRL for the GSM8k takes 50 minutes on a MacBookAir with an 8-core M2 chip, and
takes only 5 minutes on a server with 16(out of 64)-core AMD 3995WX CPUs.
We use Best of Train to denote the best performance of training prompts (i.e., out of the 6 training
prompts listed in Table 2 of our paper) on the test queries, and use Best of Test to denote the best
performance of the first 10 held-out test prompts. Details of those test prompts are provided in
Appendix C.3.
28
Published as a conference paper at ICLR 2024
Assuming we have an oracle reward model (i.e., on test time, we assume we can always pick the
prompt that achieves a correct answer), this should correspond to an empirical performance upper
bound of Prompt-OIRL.
Take-Away Messages
1. The optimal prompt is highly relevant to the task and LLM used. In 3 out of the 9 settings,
the optimal test-time prompting strategy exists in the test prompts, rather than the training
prompts generated by previous research.
2. The oracle performance can be limited by the ability of LLMs.
3. Increasing the diversity of prompts improves the oracle performance.
Using the embedding models to get prompt or query embeddings is in general much cheaper than
calling the LLMs in getting responses. For instance, the gpt-3.5-turbo API charges 0.001 USD per
1k tokens for input, and 0.002 USD for output; as a comparison, the embedding model of ada v2
charges only 0.0001 USD per 1k token.
Moreover, we would like to note those prompt embeddings are re-usable. For instance, getting the
embedding for 100 prompts like “let’s try to solve it step by step, and make sure to get the correct
answer” will cost about 100 prompts * 20 tokens/per prompt * 0.0001 USD / 1000 tokens = 0.0002
USD.
In comparison, interaction with the LLMs to get the prompted response is much more expensive. A
single query-answer pair will lead to a cost of approximately 0.0006 USD. Evaluating the perfor-
29
Published as a conference paper at ICLR 2024
D E XTENDED D ISCUSSION
D.1 F UTURE W ORK
While our research primarily centers on arithmetic reasoning tasks, we wish to underscore the versa-
tility of Prompt-OIRL’s insights for broader applications, especially where there exists a prompting
demonstration dataset accompanied by ratings of the prompted responses. As a hypothetical ap-
proach to dataset construction with human annotators incorporated into the process, consider this:
human annotators could employ LLMs to accomplish specific tasks. They might offer multiple
prompts as instructions for the task, and the ensuing LLM responses can then be graded based on
proficiency in executing the given task. In fact, these annotators could be everyday LLM users keen
on evaluating diverse responses. We earmark this intriguing concept for subsequent exploration.
D.2 A C ALL TO THE F IELD : R ELEASE THE I NTERACTIVE L OGS WITH LLM S AS A
VALUABLE A SSET.
To apply the key insight of Prompt-OIRL to more general natural language generation tasks, the
challenge is on the open-accessibility of the offline datasets.
Intuitively, one may think the challenge lies in the scarcity of prompting knowledge (i.e., the num-
ber of expert-crafted prompts). Yet as a matter of fact, an important observation we have made in
our work is that in arithmetic reasoning datasets is that learning a reward model for query-dependent
prompt evaluation and optimization is helpful even in the scarce-expert knowledge setting. There-
fore, the scarcity of expert-crafted prompts is not a vital challenge.
On the other hand, the accessibility of open-sourced offline datasets — under the form of (query,
prompt, prompted response) — is more important. In the era of LLMs, only a few researchers
release their prompted responses when releasing their discovered prompting strategies. Even having
access to their code, to fully reproduce their results still require intensive interactions with the LLMs,
leading to a huge amount of cost.
Motivated by the discovery made by Prompt-OIRL, those offline datasets — not only those inter-
active communication logs with LLMs for the discovered well-performing prompts but also those
logs for the failed prompting attempts — can be useful for learning a better reward model. There-
fore, with the success of Prompt-OIRL, we would like to call on the community to release not only
code but also the interaction logs (i.e., offline demonstration dataset) as an asset. Learning with those
demonstrations saves energy and computation by avoiding repetition, and enhances the development
of prompt engineering.
In this paragraph, we would note that the online methods are the foundations for our query-
dependent offline approach, and additional prompts discovered by online algorithms can improve
the performance of Prompt-OIRL.
Analytically, We emphasized that Prompt-OIRL is not in direct competition with online methods but
rather builds upon their generated knowledge. This approach is akin to standing on the shoulders of
giants; it improves upon the existing prompts by considering query dependency in test time, whether
they are generated by online algorithms or domain experts.
To see this, we can observe from Figure 6, where an increase in the number of prompts correlates
with consistent improvements in performance. This demonstrates Prompt-OIRL’s capacity to draw
valuable insights from online methods and exhibit monotonically better results.
30
Published as a conference paper at ICLR 2024
Synthetically, assuming an oracle reward model, the incorporation of more prompts at test time con-
sistently enhances the performance upper bound (c.f. Table 8). This holds true even if the additional
prompts sourced from online methods are suboptimal. Thus, our study highlights the importance
of diverse prompting strategies. This concept echoes our initial motivating example where different
queries benefit from varying prompts, a finding supported in the concurrent literature Yang et al.
(2023a).
Empirically, we underpinned our reasoning with experimental evidence. Table 9 contextualizes the
averaged performance improvement over different LLMs and tasks. We use K=6/K=1 to highlight
the improvement achieved using 6 training prompts as compared to only using 1 prompt. We use
OIRL/Eqn(2) to highlight the improvement achieved using additional test prompts as compared to
only selecting from the training prompts (i.e., the improvement of Prompt-OIRL over BoTr Eqn(2)).
(1). Increasing the number of prompts sourced from online algorithms during training consistently
enhances the performance of Prompt-OIRL and other baseline methods. This improvement under-
scores the value of enriched demonstration data derived from online algorithms. (2). Increasing the
number of prompts sourced from online algorithms during testing leads to consistent improvements
in query-dependent prompt optimization. This is clearly observed when comparing the outcomes of
Prompt-OIRL — which selects the most effective prompt from both training and testing sets using
its learned reward model — against BoTr Eqn(2), which limits its selection to training prompts. The
performance gains observed in Prompt-OIRL affirm the benefits of integrating additional prompts
from online sources.
Table 9: Adding prompts to either the training or the test set improves the performance of Prompt-OIRL.
Method K=1 K=2 K=3 K=4 K=5 K=6 K=6/K=1
LLM-Conf. 0.4751 0.5184 0.5427 0.5543 0.5594 0.5601 + 17.89%
BoTr Eqn(1) 0.4751 0.5379 0.5654 0.5800 0.5893 0.5944 + 25.11%
BoTr Eqn(2) 0.4751 0.5553 0.5910 0.6130 0.6295 0.6393 + 34.56%
Prompt-OIRL 0.5904 0.6095 0.6225 0.6331 0.6410 0.6459 + 9.4%
OIRL/Eqn(2) +24.27% +9.76% +5.33% +3.23% +1.83% +1.03% -
E A DDITIONAL E XPERIMENTS
E.1 P ROMPT-OIRL UNDER D IFFERENT DATA AVAILABILITY
To stress-test Prompt-OIRL under such data-scarcity settings, we experiment with the SVAMP
dataset using a demonstration dataset from the GPT3.5-turbo and remove different proportions of
training data. We compare results by removing [0.0, 0.3, 0.5, 0.7, 0.8, 0.9] of the training dataset,
which has 15000 samples. The following table presents the results when using the 6 prompts. We
compare BoTr Eqn.(1) and LLM confidence as the data-size-agnostic baselines, and BoTr Eq.(2) as
a data-size-dependent baseline.
From these results, it can be observed that Prompt-OIRL consistently achieves better performance
than using the query-agnostic prompting strategies, using LLM confidence as baselines, or using
Equation (2) and the reward model to choose from the training prompt.
The generalization of Prompt-OIRL over tasks can have great value in practical deployments. To
verify the generalization ability of the learned reward model, we experiment by using reward models
31
Published as a conference paper at ICLR 2024
that are trained on different tasks in Prompt-OIRL when evaluating and optimizing prompts for the
MAWPS task with GPT3.5-turbo. The following table shows the results we get using the three
different reward models trained with MAWPS, SVAMP, and GSM8K, individually.
It is worth noting that, under such a cross-task setting, the BoTr Eqn(1) baseline selects the best
prompt according to the training set performance on the training dataset of the reward model. For
instance, when generalizing the reward model learned from the GSM8K prompting demonstration
dataset to the MAWPS task, its Best-of-Training prompt is chosen to be the best-performing prompt
on the GSM8K dataset.
According to the results, we find the reward model learned with the SVAMP dataset generalizes
better to the MAWPS task than the reward model learned with the GSM8K dataset. This is not sur-
prising as those two tasks are more similar compared to the GSM8K dataset, which always requires
multiple-step reasoning. Importantly, the performance of Prompt-OIRL consistently surpasses using
the best-of-training baselines, demonstrating its potential in practical deployments where general-
ization over tasks is considered to be an essential advantage.
We additionally experiment with GPT-4 on the MAWPS dataset. With a stronger language model,
the reward modeling step of Prompt-OIRL will be more challenging, due to the increasing imbalance
of the data used for training the reward model (i.e., most prompts lead to a correct answer). That
said, under such a more challenging setting, the efficacy of Prompt-OIRL and the query-dependent
objective based on the learned reward model have been demonstrated.
We provide detailed performance comparisons on all datasets and LLMs we used in the experiment.
Averaging the performance will result in Figure 6. In most cases, using Prompt-OIRL consistently
achieves superior performance, especially when the number of training prompts is limited. Addi-
tionally, it is not surprising that in some cases, when the number of training prompts is sufficient, se-
lecting the best prompt during training, rather than seeking further optimization on held-out prompts
can be the best strategy.
32
Published as a conference paper at ICLR 2024
That said, we can also observe some fail cases of Prompt-OIRL, when the weaker LLMs are trying
to solve the most challenging tasks. In those cases, the difficulties mainly result from the inability
to accurately evaluate prompts — for more details, please refer to the following section.
Accuracy We present the accuracy of different methods on different models and datasets in Fig-
ure 21 - Figure 23. The performance of the learned reward model consistently achieves higher
prediction accuracy than using LLMs as critics. The prediction accuracy does not change much as a
function of the number of training prompts.
33
Published as a conference paper at ICLR 2024
Precision We present the precision of different methods on different models and datasets in Fig-
ure 24 - Figure 26. In most cases, using the learned reward model achieves significantly higher
precision than using LLMs, meaning there is a higher probability that the predicted well-performing
prompt is able to get a correct answer — the basis of our choice of using precision as the metric.
However, the reward model suffers difficulty when predicting the performance of LLMs with lower
capabilities in arithmetic reasoning: when LLaMA-2-7B-Chat and the TigerBot-13B-Chat are fac-
ing the most challenging GSM8K tasks, their answers to the queries will be wrong in most cases.
Achieving a high precision — in such cases — becomes more challenging. In this work, we set
hyper-parameters uniformly to enhance reproducibility and disclose the superiority of the proposed
method with minimal engineering optimization. Yet in practice, such an issue can potentially be
further alleviated through e.g., sub-sampling or re-balancing the training dataset.
34
Published as a conference paper at ICLR 2024
G C ASE S TUDIES
G.1 O PTIMAL P ROMPTING S TRATEGY IS Q UERY D EPENDENT
In this section, we provide case studies to demonstrate that expert-crafted prompting strategies are
verified and optimized on a statistical level, rather than an instance-wise level. There exist many
cases when some of those prompts work better than others on one query, but worse on another. We
now present results on the GSM8K dataset with GPT-3.5-turbo (v.2023.05).
P ROMPTED A NSWERS
No prompt: The total cost of the five cell phones is $150 x 5 = $〈〈150*5=750〉〉750. The
interest for each unit is $150 x 2/100 = $〈〈150*2/100=3〉〉3. So, the total interest for the five
units is $3 x 5 = $〈〈3*5=15〉〉15. Therefore, Shiela will pay $750 + $15 = $〈〈750+15=765〉〉765
for the five cell phones. Hence, Shiela will pay $765/3 = $〈〈765/3=255〉〉255 each month for
3 months. Answer: 255 .
CoT prompt: First, let’s calculate the total cost of the five cell phones. Since each phone
costs $150, the total cost would be 5 x $150 = $〈〈5*150=750〉〉750. Next, let’s calculate
the interest for each unit. The interest for each unit is 2% of $150, which is 0.02 x $150 =
$〈〈0.02*150=3〉〉3. Now, let’s calculate the total interest for all five units. The total interest
would be 5 x $3 = $〈〈5*3=15〉〉15. Finally, let’s calculate the total amount Shiela needs to pay
each month for 3 months. The total amount would be the cost of the phones plus the interest,
which is $750 + $15 = $〈〈750+15=765〉〉765. Therefore, Shiela needs to pay $765 each month
for 3 months.
APE prompt: First, let’s calculate the total cost of the five cell phones. Since each phone
costs $150, the total cost would be 5 x $150 = $〈〈5*150=750〉〉750. Next, let’s calculate
the interest for each unit. The interest for each unit is 2% of $150, which is 0.02 x $150 =
$〈〈0.02*150=3〉〉3. Now, let’s calculate the total interest for all five units. The total interest
would be 5 x $3 = $〈〈5*3=15〉〉15. Finally, let’s calculate the total amount Shiela needs to pay
each month for 3 months. The total amount would be the cost of the phones plus the interest,
which is $750 + $15 = $〈〈750+15=765〉〉765. Therefore, Shiela needs to pay $765 each month
for 3 months.
35
Published as a conference paper at ICLR 2024
ToT prompt: Expert 1: To find out how much Shiela will pay each month for 3 months, we
need to calculate the total cost of the cell phones first. Since she bought five cell phones for
$150 each, the total cost would be $150 x 5 = $750.
Expert 2: Now, let’s calculate the interest charged for each unit. Since the interest rate is 2%,
we can calculate the interest for each unit as $150 x 0.02 = $3.
Expert 3: Finally, to find out how much Shiela will pay each month for 3 months, we need to
divide the total cost by the number of months. So, Shiela will pay $750 / 3 = $250 each month
for 3 months.
P ROMPTED A NSWERS
No prompt: Mary currently has 40 window ledges with 2 potted plants each, which totals 40
* 2 = 〈〈40*2=80〉〉80 potted plants. If she received 18 new potted plants, her total number of
potted plants will be 80 + 18 = 〈〈80+18=98〉〉98 potted plants. If she gives away 1 potted plant
from each ledge, she will give away 40 * 1 = 〈〈40*1=40〉〉40 potted plants. Therefore, Mary
will remain with 98 - 40 = 〈〈98-40=58〉〉58 potted plants. Answer: 58 .
CoT prompt: Step 1: Calculate the total number of potted plants Mary currently has on her
window ledges. Mary has 2 potted plants on each of the 40 window ledges, so she has 2 * 40
= 〈〈2*40=80〉〉80 potted plants.
Step 2: Calculate the total number of potted plants Mary will give away. Mary plans to give
away 1 potted plant from each of the 40 window ledges, so she will give away 1 * 40 =
〈〈1*40=40〉〉40 potted plants.
Step 3: Calculate the number of potted plants Mary will remain with. Mary currently has
80 potted plants and plans to give away 40 potted plants, so she will remain with 80 - 40 =
〈〈80-40=40〉〉40 potted plants.
Therefore, Mary will remain with 40 potted plants.
APE prompt: Step 1: Calculate the total number of potted plants Mary currently has on her
window ledges. Mary has 2 potted plants on each of the 40 window ledges, so she has 2 * 40
= 〈〈2*40=80〉〉80 potted plants.
Step 2: Calculate the total number of potted plants Mary will give away. Mary plans to give
away 1 potted plant from each of the 40 window ledges, so she will give away 1 * 40 =
〈〈1*40=40〉〉40 potted plants.
Step 3: Calculate the number of potted plants Mary will remain with. Mary currently has
80 potted plants and plans to give away 40 potted plants, so she will remain with 80 - 40 =
〈〈80-40=40〉〉40 potted plants.
Therefore, Mary will remain with 40 potted plants.
36
Published as a conference paper at ICLR 2024
ToT prompt: Expert 1: First, let’s calculate the total number of potted plants that Mary cur-
rently has. She has 2 potted plants on each of the 40 window ledges, so that would be 2 x 40 =
80 potted plants.
Expert 2: Next, let’s calculate the number of potted plants that Mary will give away to friends
and family. Since she has decided to give away 1 potted plant from each ledge, she will give
away a total of 40 potted plants.
Expert 3: Finally, let’s calculate the number of potted plants that Mary will remain with. She
initially had 80 potted plants and gave away 40, so she will remain with 80 - 40 = 40 potted
plants.
P ROMPTED A NSWERS
No prompt: Debra saw 30/2 = 〈〈30/2=15〉〉15 bees return to the hive in the next 6 hours. Then
she saw 2*30 = 〈〈2*30=60〉〉60 bees leave the hive in the next 6 hours. So she saw 60+30 =
〈〈60+30=90〉〉90 bees leave the hive in total. Therefore, she saw 90-15 = 〈〈90-15=75〉〉75 bees
return to the hive in the last 6 hours of the day. Answer: 75 .
CoT prompt: Step 1: Debra sees 30 bees leave the hive in the first 6 hours. Step 2: Debra sees
1/2 that many bees return in the next 6 hours, so she sees 30/2 = 〈〈30/2=15〉〉15 bees return.
Step 3: Debra sees two times as many bees as she saw leave the hive fly from the hive and
leave in the next 6 hours, so she sees 30*2 = 〈〈30*2=60〉〉60 bees leave. Step 4: Every bee that
left before that hadn’t already returned returns to the hive in the next 6 hours, so Debra sees
30+15 = 〈〈30+15=45〉〉45 bees return. Answer: 45 .
APE prompt: Step 1: Debra saw 30 bees leave the hive in the first 6 hours. Step 2: In the
next 6 hours, she saw 1/2 that many bees return, so she saw 30/2 = 〈〈30/2=15〉〉15 bees return.
Step 3: She saw two times as many bees as she saw leave the hive in the first 6 hours leave in
the next 6 hours, so she saw 2 * 30 = 〈〈2*30=60〉〉60 bees leave. Step 4: In the last 6 hours,
every bee that left before that hadn’t already returned returns to the hive. Since 30 bees left in
the first 6 hours and 60 bees left in the next 6 hours, a total of 30 + 60 = 〈〈30+60=90〉〉90 bees
left before the last 6 hours. Step 5: Therefore, Debra saw 90 bees return to the hive in the last
6 hours of the day.
ToT prompt: Expert 1: First, let’s calculate how many bees returned to the hive in the second
6-hour period. We know that Debra saw 30 bees leave the hive in the first 6 hours. So, in the
37
Published as a conference paper at ICLR 2024
next 6 hours, she saw 1/2 of that number return. Therefore, she saw 30/2 = 15 bees return to
the hive in the second 6-hour period.
Expert 2: In the next 6-hour period, Debra saw two times as many bees leave the hive as she
saw in the first 6 hours. So, if she saw 30 bees leave in the first 6 hours, she saw 2 * 30 = 60
bees leave in the next 6 hours.
Expert 3: Now, let’s calculate how many bees returned to the hive in the last 6 hours of the
day. We know that every bee that left before that hadn’t already returned returns to the hive in
the next 6 hours. So, the total number of bees that returned in the last 6 hours is 30 + 15 + 60
= 105 bees.
38
Published as a conference paper at ICLR 2024
P ROMPTED A NSWERS
No prompt: To find out how fast John runs, we need to divide the total distance he runs by the
total time he spends running.
First, let’s calculate the total distance John runs in a week. Since he runs 60 miles a week and
runs 3 days a week, he runs 60/3 = 〈〈60/3=20〉〉20 miles each day.
Next, let’s calculate the total time John spends running. On the first day, he runs for 3 hours,
and on the other two days, he runs for half as much, which is 3/2 = 〈〈3/2=1.5〉〉1.5 hours each
day. So the total time John spends running in a week is 3 + 1.5 + 1.5 = 〈〈3+1.5+1.5=6〉〉6
hours.
Finally, let’s calculate John’s speed by dividing the total distance he runs by the total time he
spends running: 20 miles / 6 hours = 〈〈20/6=3.33〉〉3.33 miles per hour.
Therefore, John runs at a speed of 3.33 miles per hour.
CoT prompt: Step 1: Calculate the total distance John runs in a week. John runs 60 miles a
week.
Step 2: Calculate the total number of hours John runs in a week. John runs 3 days a week. On
the first day, he runs 3 hours. On the other two days, he runs half as much, which is 3/2 = 1.5
hours. So, the total number of hours John runs in a week is 3 + 1.5 + 1.5 = 6 hours.
Step 3: Calculate John’s average speed. Average speed = Total distance / Total time Average
speed = 60 miles / 6 hours Average speed = 10 miles per hour
Therefore, John runs at an average speed of 10 miles per hour.
APE prompt: First, let’s find out how many miles John runs each day. Since he runs 60 miles
a week and runs 3 days a week, he runs 60/3 = 〈〈60/3=20〉〉20 miles each day. Next, let’s find
out how long John runs each day. We know that he runs 3 hours the first day and half as much
the other two days. So, the second and third days he runs 3/2 = 〈〈3/2=1.5〉〉1.5 hours each day.
Finally, let’s calculate John’s speed. We know that speed is equal to distance divided by time.
So, on the first day, his speed is 20 miles / 3 hours = 〈〈20/3=6.67〉〉6.67 miles per hour. On
the second and third days, his speed is 20 miles / 1.5 hours = 〈〈20/1.5=13.33〉〉13.33 miles per
hour. Therefore, John runs at a speed of 6.67 miles per hour on the first day and 13.33 miles
per hour on the second and third days.
ToT prompt: Expert 1: To find out how fast John runs, we need to calculate his average speed.
We know that he runs a total of 60 miles in a week, so we can divide that by the total number
of hours he runs to get his average speed.
Expert 2: John runs 3 days a week, with the first day being 3 hours and the other two days
being half as much. So, the total number of hours he runs in a week is 3 + (0.5 * 2) = 4 hours.
Expert 3: Now we can calculate John’s average speed by dividing the total distance he runs
(60 miles) by the total time he runs (4 hours). So, his average speed is 60 miles / 4 hours = 15
miles per hour.
Therefore, John runs at an average speed of 15 miles per hour.
39
Published as a conference paper at ICLR 2024
P ROMPTED A NSWERS
No prompt: To find out how many grams of chips you can eat, you need to calculate the remaining
calories you can consume and then convert it to grams. First, calculate the remaining calories you can
consume: Daily calorie target - Consumed calories = Remaining calories 2000 calories - 1800 calories =
200 calories Next, calculate the remaining servings of chips you can eat: Remaining calories ÷ Calories
per serving = Remaining servings 200 calories ÷ 250 calories per serving = 0.8 servings Finally, calculate
the remaining grams of chips you can eat: Remaining servings × Grams per serving = Remaining grams
0.8 servings × 300 grams per serving = 240 grams
Therefore, you can eat 240 grams of chips if your daily calorie target is 2000 and you have already
consumed 1800 calories.
CoT prompt: Step 1: Calculate the total number of calories in the bag of chips. Since the bag has 5
servings and each serving has 250 calories, the total number of calories in the bag is 5 servings * 250
calories/serving = 1250 calories.
Step 2: Calculate the remaining number of calories you can consume. If your daily calorie target is 2000
and you have already consumed 1800 calories, then you have 2000 - 1800 = 200 calories remaining.
Step 3: Calculate the maximum number of grams you can eat from the bag of chips. To calculate the
maximum number of grams you can eat, you need to find out how many calories are in 1 gram of chips.
Divide the total number of calories in the bag (1250 calories) by the weight of the bag (300 grams) to
find the calories per gram: 1250 calories / 300 grams = 4.17 calories/gram.
Now, divide the remaining number of calories (200 calories) by the calories per gram (4.17 calories/gram)
to find the maximum number of grams you can eat: 200 calories / 4.17 calories/gram = 48 grams.
Therefore, you can eat a maximum of 48 grams of chips from the bag.
APE prompt: First, we need to calculate the total number of calories in the bag of chips. Since there are
5 servings in a 300g bag and each serving has 250 calories, the total number of calories in the bag is 5
servings * 250 calories/serving = 1250 calories.
Next, we need to calculate how many calories you have left to consume for the day. If your daily calorie
target is 2000 and you have already consumed 1800 calories, then you have 2000 - 1800 = 200 calories
left to consume.
To determine how many grams of chips you can eat, we need to find out how many servings of chips
equal 200 calories. Since each serving has 250 calories, we can calculate this by dividing 200 calories by
250 calories/serving: 200 calories / 250 calories/serving = 0.8 servings.
Finally, we can calculate the grams of chips you can eat by multiplying the number of servings by the
serving size. Since each serving is 300g, we can calculate this by multiplying 0.8 servings by 300g/serv-
ing: 0.8 servings * 300g/serving = 240g.
Therefore, you can eat 240 grams of chips if your daily calorie target is 2000 and you have already
consumed 1800 calories.
ToT prompt: Expert 1: Let’s start by calculating the total calories in the bag of chips. Since there are 5
servings in a 300g bag, and each serving has 250 calories, the total calories in the bag would be 5 servings
* 250 calories per serving = 1250 calories.
40
Published as a conference paper at ICLR 2024
Expert 2: Now, let’s calculate the remaining calories you can consume based on your daily calorie target.
If your daily calorie target is 2000 and you have already consumed 1800 calories, then you have 2000 -
1800 = 200 calories left.
Expert 3: To determine how many grams of chips you can eat, we need to find out how many calories are
in 1 gram of chips. Since the bag has 1250 calories and weighs 300g, we can calculate the calories per
gram by dividing the total calories by the weight: 1250 calories / 300g = 4.17 calories per gram.
Expert 1: Finally, to find out how many grams of chips you can eat, we divide the remaining calories
(200) by the calories per gram (4.17): 200 calories / 4.17 calories per gram = 47.94 grams.
Expert 2: Therefore, you can eat approximately 47.94 grams of chips if your daily calorie target is 2000
and you have already consumed 1800 calories.
P ROMPTED A NSWERS
No prompt: Jean is 32 years old.
CoT prompt: First, let’s find out how old Mark is. We know that Jean is two years older than Mark, so
if we let Mark’s age be x, then Jean’s age would be x + 2.
Two years ago, Mark was 5 years older than half of Jan’s age. Two years ago, Jan would have been 30 -
2 = 28 years old. Half of Jan’s age two years ago would be 28/2 = 14 years old.
So, two years ago, Mark was 5 years older than 14, which means Mark’s age two years ago was 14 + 5 =
19 years old.
If Mark’s age two years ago was 19, then his current age would be 19 + 2 = 21 years old.
Since Jean is two years older than Mark, Jean’s age would be 21 + 2 = 23 years old.
Therefore, Jean is 23 years old.
APE prompt: If Jan is 30 years old, then two years ago she would have been 28 years old. If Mark was
5 years older than half of Jan’s age two years ago, then half of Jan’s age two years ago would have been
23 years old. If Mark was 5 years older than half of Jan’s age two years ago, then Mark would have been
28 + 5 = 33 years old two years ago. If Jean is two years older than Mark, then Jean would have been 33
+ 2 = 35 years old two years ago. If Jean was 35 years old two years ago, then Jean is currently 35 + 2 =
37 years old.
ToT prompt: Expert 1: Let’s start by finding Mark’s age. If two years ago Mark was 5 years older than
half of Jan’s age, then we can subtract 5 from Mark’s age two years ago to find half of Jan’s age two years
ago.
Expert 2: If Mark was 5 years older than half of Jan’s age two years ago, then half of Jan’s age two years
ago would be Mark’s age minus 5.
Expert 3: If half of Jan’s age two years ago was Mark’s age minus 5, then we can add 5 to half of Jan’s
age two years ago to find Mark’s age two years ago.
Expert 1: Now that we know Mark’s age two years ago, we can add 2 to find his current age.
Expert 2: Since Jean is two years older than Mark, we can add 2 to Mark’s current age to find Jean’s age.
Expert 3: Therefore, if Jan is 30 years old, Jean would be 32 years old.
41
Published as a conference paper at ICLR 2024
P ROMPTED A NSWERS
No prompt: To find out how far Blake runs, we need to multiply the number of times he runs back and
forth by the length of the football field: 15 times * 100 yards/time = 〈〈15*100=1500〉〉1500 yards To find
out how far Kelly runs, we need to multiply the number of times she runs back and forth by the length
of the football field: 34 times * 40 yards/time = 〈〈34*40=1360〉〉1360 yards The difference in distance is
1500 yards - 1360 yards = 〈〈1500-1360=140〉〉140 yards. Answer: 140 .
CoT prompt: Step 1: Calculate the total distance Blake runs. Blake runs back and forth 15 times, so he
runs a total of 15 * 100 = 〈〈15*100=1500〉〉1500 yards.
Step 2: Calculate the total distance Kelly runs. Kelly runs back and forth once, so she runs a total of
1 * 100 = 〈〈1*100=100〉〉100 yards. Then she runs to the 40-yard line and back 34 times, so she runs
an additional 34 * 80 = 〈〈34*80=2720〉〉2720 yards. Therefore, Kelly runs a total of 100 + 2720 =
〈〈100+2720=2820〉〉2820 yards.
Step 3: Calculate the difference in distance between the winner and the loser. The winner runs 2820 -
1500 = 〈〈2820-1500=1320〉〉1320 yards farther than the loser.
APE prompt: First, let’s calculate how far Blake runs in total. Since he runs back and forth 15 times,
and the football field is 100 yards long, Blake runs a total of 15 * 2 * 100 = 3000 yards.
Next, let’s calculate how far Kelly runs in total. She runs back and forth once, which is a total of 2 * 100
= 200 yards. Then she runs to the 40-yard line and back 34 times, which is a total of 34 * 2 * 40 = 2720
yards. So Kelly runs a total of 200 + 2720 = 2920 yards.
Finally, let’s calculate the difference in distance between the winner and the loser. The winner, in this
case, is Blake, who runs 3000 yards. The loser is Kelly, who runs 2920 yards. Therefore, the difference
in distance is 3000 - 2920 = 80 yards.
So the winner runs 80 yards farther than the loser.
ToT prompt: Expert 1: To find out how much farther the winner runs than the loser, we need to calculate
the total distance each person runs. Let’s start by calculating the distance Blake runs.
Expert 2: Blake runs back and forth 15 times on a 100-yard football field. So, the total distance Blake
runs is 15 times the length of the field, which is 15 * 100 = 1500 yards.
Expert 3: Now let’s calculate the distance Kelly runs. Kelly runs back and forth once on the football
field, which is a total distance of 2 * 100 = 200 yards. Then she runs to the 40-yard line and back 34
times, which is a total distance of 34 * 2 * 40 = 2720 yards.
Expert 1: So, the total distance Kelly runs is 200 + 2720 = 2920 yards.
Expert 2: To find out how much farther the winner runs than the loser, we subtract the total distance Blake
runs from the total distance Kelly runs. Therefore, the winner runs 2920 - 1500 = 1420 yards farther than
the loser.
42
Published as a conference paper at ICLR 2024
P ROMPTED A NSWERS
No prompt: The total number of water type cards after losing 8 and buying 14 is 40 - 8 + 14 = 46. The
total number of cards is 30 + 46 + 20 = 96. The percentage chance of picking a water type card is (46/96)
* 100 = 47.92%. Rounded to the nearest integer, the percentage chance is 48%.
CoT prompt: First, let’s calculate the total number of water type cards James has after losing 8 and
buying 14. He initially had 40 water type cards. After losing 8, he will have 40 - 8 = 32 water type cards.
After buying 14 grass type cards, the number of water type cards remains the same at 32. Now, let’s
calculate the total number of cards James has after losing 8 water type cards and buying 14 grass type
cards. He initially had 30 fire type cards, 20 grass type cards, and 40 water type cards. After losing 8
water type cards, he will have 30 + 20 + 32 = 82 cards. The percentage chance of picking a water type
card is (32 / 82) * 100 = 39.02%.
Rounded to the nearest integer, the percentage chance is 39%.
APE prompt: First, let’s calculate the total number of water type cards James has after losing 8 and
buying 14. He initially had 40 water type cards, so after losing 8, he has 40 - 8 = 32 water type cards.
After buying 14 grass type cards, the total number of water type cards remains the same at 32.
Next, let’s calculate the total number of cards James has after losing 8 water type cards and buying 14
grass type cards. He initially had 30 fire type cards, 20 grass type cards, and 40 water type cards, so the
total number of cards is 30 + 20 + 40 = 90 cards. After losing 8 water type cards and buying 14 grass
type cards, the total number of cards becomes 90 - 8 + 14 = 96 cards.
Finally, let’s calculate the percentage chance of picking a water type card. The percentage chance is
calculated by dividing the number of water type cards by the total number of cards and multiplying
by 100. So, the percentage chance is (32 / 96) * 100 = 33.33%. Rounded to the nearest integer, the
percentage chance is 33%.
Therefore, the percentage chance that a randomly picked card will be a water type is 33%.
ToT prompt: Expert 1: To find the percentage chance of picking a water type card, we need to calculate
the total number of water type cards after James loses 8 and buys 14 grass type cards.
Expert 2: James initially has 40 water type cards. After losing 8 and buying 14 grass type cards, he will
have 40 - 8 + 14 = 46 water type cards.
Expert 3: To find the percentage chance, we need to divide the number of water type cards by the total
number of cards and multiply by 100. The total number of cards is 30 + 46 + 20 = 96. So the percentage
chance of picking a water type card is (46 / 96) * 100 = 47.92%.
Final answer: The percentage chance (rounded to the nearest integer) of picking a water type card is 48%.
43
Published as a conference paper at ICLR 2024
P ROMPTED A NSWERS
No prompt: To determine how many vines Steve needs, we need to calculate the number of
tomatoes he eats in a week. Since he eats 6 tomatoes per day, he eats 6 * 7 = 〈〈6*7=42〉〉42
tomatoes in a week. Since each vine can produce 3 tomatoes per week, Steve needs 42 / 3 =
〈〈42/3=14〉〉14 vines.
CoT prompt: 1. Steve eats 6 cherry tomatoes per day. 2. His girlfriend eats half as much
as Steve, so she eats 6/2 = 3 cherry tomatoes per day. 3. Together, they eat 6 + 3 = 9 cherry
tomatoes per day. 4. A vine can produce 3 cherry tomatoes per week. 5. To meet their daily
consumption, Steve needs 9 cherry tomatoes per day. 6. Therefore, he needs 9/3 = 3 vines to
meet their daily consumption.
APE prompt: First, we need to determine how many tomatoes Steve’s girlfriend eats per day.
Since Steve eats twice as much as his girlfriend, we can divide his daily consumption by 2: 6
tomatoes / 2 = 3 tomatoes per day.
Next, we need to determine how many tomatoes Steve and his girlfriend eat together per day:
6 tomatoes + 3 tomatoes = 9 tomatoes per day.
Now, we need to determine how many tomatoes Steve and his girlfriend eat together per week:
9 tomatoes per day * 7 days = 63 tomatoes per week.
Finally, we need to determine how many vines Steve needs to grow in order to produce 63
tomatoes per week. Since each vine produces 3 tomatoes per week, we can divide the total
number of tomatoes needed by the number of tomatoes produced per vine: 63 tomatoes / 3
tomatoes per vine = 21 vines.
Therefore, Steve needs to grow 21 vines in order to produce enough cherry tomatoes for his
increased consumption.
ToT prompt: Expert 1: To determine how many vines Steve needs, we first need to find out
how many tomatoes he eats in a week. Since he eats 6 tomatoes per day, he eats a total of 6 *
7 = 42 tomatoes in a week.
Expert 2: Now, let’s calculate how many tomatoes a single vine can produce in a week. If
a vine produces 3 tomatoes per week, then Steve would need 42 / 3 = 14 vines to meet his
weekly tomato consumption.
Expert 3: However, since Steve wants to grow cherry tomatoes, which are smaller than reg-
ular tomatoes, he might need more vines to produce the same amount. Let’s assume that
cherry tomatoes are half the size of regular tomatoes. In that case, Steve would need double
the number of vines, so he would need 14 * 2 = 28 vines to meet his weekly cherry tomato
consumption.
44
Published as a conference paper at ICLR 2024
P ROMPTED A NSWERS
No prompt: Let’s break down the information given: Candice started with 80 post-it notes in her purse.
She placed a single post-it note on each of 220 cups of coffee. She had 23 post-it notes remaining overall.
To find out how many post-it notes were in the package she purchased, we need to subtract the number
of post-it notes she had remaining from the total number of post-it notes she started with and placed on
the cups of coffee. Total post-it notes = Starting post-it notes + Post-it notes placed on cups of coffee -
Post-it notes remaining Total post-it notes = 80 + 220 - 23 Total post-it notes = 300
Therefore, there were 300 post-it notes in the package that Candice purchased.
CoT prompt: Step 1: Calculate the total number of post-it notes Candice had before purchasing the
package. Candice had 80 post-it notes in her purse.
Step 2: Calculate the total number of post-it notes Candice used at work. Candice placed a single post-it
note on each of 220 different cups of coffee.
Step 3: Calculate the total number of post-it notes Candice had remaining after work. Candice had 23
post-it notes remaining overall.
Step 4: Calculate the total number of post-it notes Candice had before purchasing the package. Total
post-it notes before purchasing = post-it notes in purse + post-it notes remaining after work Total post-it
notes before purchasing = 80 + 23
Step 5: Calculate the number of post-it notes in the package that Candice purchased. Number of post-it
notes in the package = Total post-it notes before purchasing - post-it notes used at work Number of post-it
notes in the package = (80 + 23) - 220
Step 6: Calculate the final answer. Number of post-it notes in the package = 103 - 220 Number of post-it
notes in the package = -117
Therefore, Candice did not purchase any post-it notes.
APE prompt: First, we need to determine how many post-it notes Candice had before she purchased the
package. We know she started with 80 post-it notes and had 23 remaining, so she used 80 - 23 = 〈〈80-
23=57〉〉57 post-it notes at work. Next, we need to determine how many post-it notes were in the package
she purchased. We can do this by subtracting the number of post-it notes she had before purchasing the
package from the total number of post-it notes she used at work: 57 - 80 = 〈〈57-80=-23〉〉-23. Since we
can’t have a negative number of post-it notes, there must have been an error in our calculations. Let’s
try again. If Candice had 80 post-it notes before purchasing the package and used 57 at work, then she
must have had 80 - 57 = 〈〈80-57=23〉〉23 post-it notes remaining. Since we know she had 23 post-it notes
remaining overall, this means she used all the post-it notes she purchased at work. Therefore, the package
she purchased must have contained 220 - 23 = 〈〈220-23=197〉〉197 post-it notes.
ToT prompt: Expert 1: Let’s start by figuring out how many post-it notes Candice had before she
purchased the package. She initially had 80 post-it notes in her purse.
Expert 2: Next, we need to subtract the number of post-it notes she used at work. She placed a single
post-it note on each of the 220 cups of coffee. So, she used 220 post-it notes at work.
45
Published as a conference paper at ICLR 2024
Expert 3: Now, we can calculate the total number of post-it notes Candice had remaining. We subtract
the number of post-it notes used at work (220) from the initial number of post-it notes (80). This gives
us 80 - 220 = -140.
Expert 1: Since Candice had 23 post-it notes remaining overall, we can set up an equation to find the
number of post-it notes in the package she purchased. Let’s call the number of post-it notes in the package
”x”. We can write the equation as -140 + x = 23.
Expert 2: To solve for x, we need to isolate it on one side of the equation. We can do this by adding 140
to both sides of the equation. This gives us x = 23 + 140 = 163.
Expert 3: Therefore, the number of post-it notes in the package that Candice purchased is 163.
P ROMPTED A NSWERS
No prompt: Christina invited 16 friends, but 1/4 of them won’t show up, so 16 * 1/4 = 〈〈16*1/4=4〉〉4
friends won’t attend. So the number of guests who will attend is 16 - 4 = 〈〈16-4=12〉〉12 guests. She
needs .75 gift bags per guest, so she will need 12 * .75 = 〈〈12*.75=9〉〉9 gift bags. Each gift bag costs
$2, so she will spend 9 * $2 = $〈〈9*2=18〉〉18. Answer: 18 .
CoT prompt: Step 1: Calculate the number of guests who will actually attend the party. Since 1/4 of the
invited guests don’t show up, we need to subtract 1/4 of the total number of guests from the total number
of guests invited. 1/4 of 16 guests is (1/4) * 16 = 4 guests. So, the number of guests who will actually
attend the party is 16 - 4 = 12 guests.
Step 2: Calculate the number of gift bags needed. Since Christina needs 0.75 gift bags per guest, we can
multiply the number of guests by 0.75 to find the total number of gift bags needed. 0.75 * 12 guests = 9
gift bags.
Step 3: Calculate the total cost of the gift bags. Since each gift bag costs $2, we can multiply the number
of gift bags by the cost per gift bag to find the total cost. 9 gift bags * $2/gift bag = $18.
Therefore, Christina will spend $18 on gift bags for the birthday party.
APE prompt: First, we need to calculate how many guests will actually show up. We know that 1/4 of
the invited guests won’t attend, so we can calculate this by multiplying the number of invited guests by
3/4: 16 guests * 3/4 = 12 guests.
Next, we need to calculate how many gift bags Christina will need. We know that she needs 0.75 gift
bags per guest, so we can calculate this by multiplying the number of guests by 0.75: 12 guests * 0.75
gift bags = 9 gift bags.
Finally, we can calculate the total cost of the gift bags. We know that each gift bag costs $2, so we can
calculate this by multiplying the number of gift bags by the cost per gift bag: 9 gift bags * $2/gift bag =
$18.
Therefore, Christina will spend $18 on gift bags for her birthday party.
ToT prompt: Expert 1: To find out how many gift bags Christina needs, I’ll multiply the number of
invited guests by 0.75 since 1/4 of attendees won’t show up. Expert 2: So, if Christina invited 16 friends,
we’ll multiply 16 by 0.75 to get the number of gift bags needed. Expert 3: That means Christina needs
12 gift bags. To find out how much she will spend, I’ll multiply the number of gift bags by the cost per
bag, which is $2. Expert 1: So, Christina will spend 12 gift bags multiplied by $2 per bag, which equals
$24 in total.
46
Published as a conference paper at ICLR 2024
P ROMPTED A NSWERS
No prompt: To find the number of spoons in the package that Julia bought, we need to subtract the
spoons that Julia used and the spoons that her husband bought from the total number of spoons on the
table.
Julia used 3 spoons to sample her stew. Her husband bought 5 spoons. So, the total number of spoons
used is 3 + 5 = 8.
If there were 12 spoons on the table, and 8 of them were used, then the number of spoons in the package
that Julia bought is 12 - 8 = 4.
CoT prompt: 1. Julia bought a new package of spoons. 2. Her husband also bought a package of 5 new
spoons and gave them to Julia. 3. Julia used three of the spoons to sample her stew. 4. When she went to
set the table, she had a total of 12 spoons.
To find out how many spoons were in the package that Julia bought, we need to subtract the spoons used
for sampling and the spoons given by her husband from the total number of spoons on the table.
Let’s denote the number of spoons in the package Julia bought as ”x”.
So, the equation becomes: x - 3 - 5 = 12
Simplifying the equation: x - 8 = 12
Adding 8 to both sides of the equation: x = 20
Therefore, there were 20 spoons in the package that Julia bought.
APE prompt: 1. Julia bought a new package of spoons. 2. Her husband also bought a package of 5 new
spoons and gave them to Julia. 3. Julia used 3 spoons to sample her stew. 4. Julia had a total of 12 spoons
when she went to set the table.
To find out how many spoons were in the package that Julia bought, we can subtract the spoons used for
sampling and the spoons given by her husband from the total number of spoons she had.
Total spoons - spoons used for sampling - spoons given by husband = spoons in the package Julia bought
12 - 3 - 5 = spoons in the package Julia bought
Therefore, there were 4 spoons in the package that Julia bought.
ToT prompt: Expert 1: Let’s start by setting up an equation to represent the given information. Let’s
say the number of spoons Julia bought is x. Her husband bought 5 spoons, so the total number of spoons
they have is x + 5. Julia used 3 spoons to sample the stew, so the number of spoons left is (x + 5) - 3.
Expert 2: We know that Julia had a total of 12 spoons left after setting the table. So we can set up the
equation (x + 5) - 3 = 12.
Expert 3: Simplifying the equation, we have x + 2 = 12. To isolate x, we subtract 2 from both sides of
the equation. This gives us x = 10.
Expert 1: Therefore, Julia bought a package of 10 spoons.
47
Published as a conference paper at ICLR 2024
In this section, we show the cases when all expert-crafted prompts could fail. In some of those
cases, optimizing prompt through Prompt-OIRL can lead to prompting strategies that are able to
get the correct response. Examples below are from the GSM8K dataset and using GPT-3.5-turbo
(v.2023.05) as the LLM.
48
Published as a conference paper at ICLR 2024
49
Published as a conference paper at ICLR 2024
50
Published as a conference paper at ICLR 2024
Expert 1: Now that we know Adrien’s salary four years ago, we can calculate his current salary by adding
a 40% increase to his salary four years ago.
Expert 2: Adrien’s salary four years ago was $28,571.43, and we know that it increased by 40%. To find
out his current salary, we can multiply $28,571.43 by 1.4.
Expert 3: $28,571.43 multiplied by 1.4 is approximately $40,000. Therefore, Adrien’s current salary is
$40,000.
Expert 1: Now that we have Adrien’s current salary, we can calculate Lylah’s salary by finding out what
30% of Adrien’s salary is and subtracting it from Adrien’s salary.
Expert 2: Adrien’s current salary is $40,000, and we know that it is 30% higher than Lylah’s salary. To
find out Lylah’s salary, we can calculate 30% of $40,000 and subtract it from $40,000.
Expert 3: 30% of $40,000 is $12,000. Therefore, Lylah’s salary is $40,000 - $12,000 = $28,000.
Expert 1: Finally, to calculate the total salary the two were receiving four years later, we can add Adrien’s
current salary and Lylah’s current salary.
Expert 2: Adrien’s current salary is $40,000, and Lylah’s current salary is $28,000. Therefore, the total
salary the two were receiving four years later is $40,000 + $28,000 = 68,000$
51
Published as a conference paper at ICLR 2024
Answer: If on Monday she slept 8 hours, then she slept 8-2 = 〈〈8-2=6〉〉6 hours on each of the next two
days. The total number of hours slept for the two days will be 6*2 = 〈〈6*2=12〉〉12 hours. If she slept 1
hour more for the rest of the week each day, then each day she slept 6+1 = 〈〈6+1=7〉〉7 hours. The total
for the rest of the week will be 4*7 = 〈〈4*7=28〉〉28 hours. The total number of hours slept for the whole
week will be 8+12+28 = 〈〈8+12+28=48〉〉48 hours.
#### 48〈—endoftext—〉
52
Published as a conference paper at ICLR 2024
Philosopher 3: So, Sadie slept 8 hours on Monday, 6 hours on Tuesday and Wednesday, and 7 hours on
the remaining days. To find the total hours she slept throughout the week, we can add these numbers
together: 8 + 6 + 6 + 7 + 7 + 7 + 7 = 48 hours.
Conclusion: Sadie slept a total of 48 hours throughout the week.
We provide case studies on the LLaMA-7B-HF model for a better understanding of Prompt-OIRL.
GSM8K
E XAMPLE
Question: For four hours, Patrick sold 15 cups of lemonade per hour at a price of $0.50 per cup. In the
next two hours, he sold 10 cups of lemonade per hour at a price of $0.60 per cup. How much money did
Patrick earn, in dollars, from selling lemonade for 6 hours?
Golden Answer: Patrick earned 0.50 × 15 = $7.50 from selling the lemonade for $0.50 per cup. So
for four hours, his total earnings are $7.50 × 4 = $30. He earned $0.60 × 10 = $6 from selling the
lemonade for $0.60 per cup. So for two hours, his total earnings are $6 × 2 = $12. Therefore, Patrick’s
total earnings:$30 + $12 = $42. Hence, Patrick earned a total of $42 from selling lemonade for 6 hours.
held-out prompt: As an electrician wires a circuit, each connection powers the system:
prompted answer: The electrician wires a circuit for four hours, selling 15 cups of lemonade per hour
at a price of $0.50 per cup. For the next two hours, the electrician sells 10 cups of lemonade per hour at
a price of $0.60 per cup. To find out how much money Patrick earned from selling lemonade for 6 hours,
we need to calculate the total number of cups of lemonade sold during each hour and then multiply it by
the price per cup. For the first 4 hours, Patrick sold 15 cups of lemonade per hour, so he sold a total of 60
cups of lemonade (15 x 4 = 60). At a price of $0.50 per cup, he earned a total of $30 (60 x $0.50 = $30).
For the next 2 hours, Patrick sold 10 cups of lemonade per hour, so he sold a total of 20 cups of lemonade
(10 x 2 = 20). At a price of $0.60 per cup, he earned a total of $12 (20 x $0.60 = $12). Therefore, Patrick
earned a total of $42 from selling lemonade for 6 hours ($30 + $12 = $42).
E XAMPLE
Question: Kelly has 5 quarters and 2 dimes. If she buys a can of pop for 55 cents, how many cents will
she have left?
Golden Answer: Her quarters are worth 5 × 25 = 125 cents. The dimes are worth 2 × 10 = 20 cents.
Kelly has a total of 125 + 20 = 145 cents. After buying the can of pop, Kelly will have 145 − 55 = 90
cents left.
held-out prompt: As an electrician wires a circuit, each connection powers the system:
prompted answer: The electrician has 5 quarters, which is equal to 5 × 25 = 125 cents. She also has 2
dimes, which is equal to 2 × 10 = 20 cents. In total, the electrician has 125 + 20 = 145 cents. If she
buys a can of pop for 55 cents, she will have 145 − 55 = 90 cents left.
53
Published as a conference paper at ICLR 2024
SVAMP
E XAMPLE
Question: Henry bought 2 new fireworks and his friend bought 3. If they already had 6 saved from last
year, how many do they have now?
Golden Answer: 11.0 fireworks.
held-out prompt: As a librarian categorizes books, let’s sort the information accordingly:
prompted answer: “Henry bought 2 new fireworks” means “Add 2 to the total number of fireworks
Henry has.” “His friend bought 3” means “Add 3 to the total number of fireworks his friend has.”
So, let’s add the total number of fireworks Henry and his friend have: 6 (from last year) + 2 (new ones
Henry bought) + 3 (new ones his friend bought) = 11 fireworks in total.
Therefore, they have 11 fireworks in total.
E XAMPLE
Question: A trivia team had 5.0 members in total, but during a game, 2.0 members didn’t show up. If
each member that did show up scored 6.0 points, how many points were scored in total?
Golden Answer: 18.0 points.
held-out prompt: Let’s approach this like a mathematician proving a theorem, step by logical step:
prompted answer:
Step 1: Identify the given information: The trivia team had 5.0 members total. 2.0 members did not show
up. Each member that did show up scored 6.0 points.
Step 2: Use the information provided to find the total number of members that showed up: Total number
of members that showed up = 5.0 − 2.0 = 3.0.
Step 3: Find the total number of points scored by the members who showed up: Total points scored =
3.0 × 6.0 = 18.0.
Therefore, the trivia team scored 18.0 points in the game.
54
Published as a conference paper at ICLR 2024
MAWPS
E XAMPLE
Question: While shopping, Emily bought 5.0 green towels and 30.0 white towels. If she gave her mother
26.0 of them, how many towels did Emily end up with?
Golden Answer: Emily ended up with 9.0 towels.
held-out prompt: As a librarian categorizes books, let’s sort the information accordingly:
prompted answer: Emily bought 5.0 green towels. Emily bought 30.0 white towels. Emily gave her
mother 26.0 towels (a combination of green and white towels).
Now, let’s deduce how many towels Emily ended up with: Total towels Emily bought = 5.0 + 30.0 =
35.0. After giving towels to her mother, Emily was left with: 35.0 − 26.0 = 9.0 towels.
Therefore, Emily ended up with 9.0 towels in total.
E XAMPLE
Question: A construction company is repaving a damaged road. So far, they have repaved a total of
4938.0 inches of the road. Today, they repaved 805.0 inches of the road. How many inches of the road
had they repaved before today?
Golden Answer: The company had repaved 4133.0 inches before today.
held-out prompt: Think of it as mixing colors for a painting. Every shade adds depth and nuance:
prompted answer: The total repaved so far is 4938.0 inches. Today, they added another layer (shade) of
805.0 inches.
Using the analogy of mixing colors, we remove today’s shade to reveal the previous one: Inches repaved
before today = 4938.0 − 805.0 = 4133.0 inches.
Thus, before today, the company had repaved 4133.0 inches of the road.
55