BAGEL: Bootstrapping Agents by Guiding Exploration With Language

BAGEL: Bootstrapping Agents by Guiding Exploration with Language
Shikhar Murty†⋆ Christopher D. Manning† Peter Shaw‡ Mandar Joshi‡ Kenton Lee‡
1. Exploration Stage
Abstract Instruction Trajectory
Environment Go to a
Month in …
arXiv:2403.08140v1 [cs.CL] 12 Mar 2024
Following natural language instructions by exe- October
cuting actions in digital environments (e.g. web- LM-agent LM-labeler

Instruction Trajectory
browsers and REST APIs) is a challenging task BAGEL
Change
…
for language model (LM) agents. Unfortunately, Month to
August
LM agents often fail to generalize to new environ- … …

ments without human demonstrations. This work
presents BAGEL, a method for bootstrapping
2. Instruction-following Stage
LM agents without human supervision. BAGEL
converts a seed set of randomly explored trajecto- Instruction
ries or synthetic instructions, into demonstrations, Select Sept 25th and submit.
Trajectory
via round-trips between two noisy LM compo- Environment

LM-agent
…
nents: an LM labeler which converts a trajectory
into a synthetic instruction, and a zero-shot LM
agent which maps the synthetic instruction into
a refined trajectory. By performing these round-
trips iteratively, BAGEL quickly converts the ini- Figure 1. (Top) Given a seed set of explored trajectories, BAGEL
tial distribution of trajectories towards those that constructs synthetic demonstrations via an iterative round-trip
are well-described by natural language. We use procedure between two LM components: a zero-shot LM agent that
BAGEL demonstrations to adapt a zero shot LM generates trajectories and an LM labeler that generates instructions
for these trajectories. (Bottom) Given an instruction at test time,
agent at test time via in-context learning over re-
we retrieve synthetic demonstrations with similar instructions, to
trieved demonstrations, and find improvements of use as in-context exemplars to adapt the base agent.
over 2-13% absolute on ToolQA and MiniWob++,
with up to 13× reduction in execution failures.
is via expert demonstrations that provide information about
mapping instructions to action sequences, recovering from
1. Introduction errors, and reasoning traces (Yao et al., 2022; Sun et al.,
2023; Kim et al., 2023; Sodhi et al., 2023). Of course, col-
In recent years, large language models (LLMs) have shown lecting human demonstrations for every new environment is
strong performance on a broad range of language under- laborious and requires knowing possible user instructions
standing tasks, making them powerful tools for controlling a priori. Moreover, as agents scale to complex tasks with
policies in digital environments such as web browsers (Yao hundreds of actions, human supervision will become in-
et al., 2022; Kim et al., 2023). Such grounded language creasingly infeasible to obtain. Instead of relying on human
understanding tasks are fundamentally challenging for LMs demonstrations for training LM agents, could we instead
in environments with ambiguous dynamics. For instance, use exploration and environment feedback to automatically
even inputting a date into a text box could require either sim- collect a large number of synthetic demonstrations?
ply typing or a complex interaction using a drop-down date
picker. An LM cannot know this a-priori without in-depth Prior work has shown the effectiveness of collecting syn-
knowledge about the website. thetic demonstrations by retroactively labeling trajectories
from embodied agents (Sumers et al., 2023). In this scenario,
One common way to provide such knowledge to LM agents the environments dynamics are assumed to be well under-
⋆
Work done at Google Deepmind. † Department of Computer stood by the agent; the synthetic demonstrations only serve
Science, Stanford University. ‡ Google Deepmind. Correspon- to connect agent behavior with human language. However,
dence to: <smurty@cs.stanford.edu> we observe the opposite challenge with digital agents in our
1
setting—grounding instructions is relatively easy due to the 2. Background

highly textual environment, but zero-shot digital agents typ-
ically are not exposed to any environment dynamics before Given a natural language instruction g, our agent inter-
they are directly used to follow instructions. acts with the environment by taking a sequence of ac-
tions {a1 , a2 , . . . , aT }, where each at is issued in re-
Our method, termed BAGEL (Bootstrapping Agents by sponse to an environment observation ot . The entire in-
Guiding Exploration with Language), uses an iterative pro- teraction with the environment is captured as a trajectory
cedure to relabel a seed set of trajectories obtained from τ = {o1 , a1 , o2 , . . . , oT , aT , oT +1 }.
unconditioned exploration (Figure 1). Intuitively, BAGEL
operates by progressively shifting the distribution of trajec- We define an agent as a language conditioned policy π(at |
tories towards those that can be well-described via natural τ<t , g) where τ<t = {o1 , a1 , o2 , . . . , ot } refers to the tra-
language, using two noisy LM components: an LM labeler jectory until time-step t. Such policies are typically trained
takes a trajectory and relabels it with a (potentially unnatu- via imitation learning and optional RL finetuning, where a
ral) instruction, and a zero-shot LM policy maps the instruc- large set of expert curated instruction-trajectory pairs are
tion back into a refined trajectory (Figure 2). By performing required for imitation learning, and a suitably shaped reward
these round trips iteratively, BAGEL converts trajectories signal is needed for RL finetuning (Branavan et al., 2009;
from random exploration into meaningful trajectories that Chaplot et al., 2018; Misra et al., 2017). For our setup,
are executable, without requiring a trained base agent or both observations and actions can be expressed as natural
significant information about possible instructions. While language strings. The agent policy π can then be cast into
both the re-labeling and instruction-following processes are an autoregressive LM that assigns probabilities to action
imperfect, round-trips between these components work in strings given string descriptions of the previous actions and
harmony to reduce any noise. Once an instruction, trajectory observations. Thus, recent work focuses on directly using
pair reaches a threshold score under a demonstration filter LLMs as policies, by using prompts along with in-context
(another prompted LM), the generated synthetic demonstra- human demonstrations (Yao et al., 2022; Shinn et al., 2023;
tion is added into a buffer. BAGEL demonstrations can be Sun et al., 2023; Kim et al., 2023, among others).
used for both in-context learning or finetuning, and serve
as a drop-in replacement for expert demonstrations. Here, Executing Action Strings. Similar to prior work that uses
we follow the former strategy along with a simple retrieval LMs to generate action strings (Huang et al., 2022; Lo-
augmented generation procedure—given a user instruction geswaran et al., 2022), we assume access to an environment
at test time, we retrieve the most relevant demonstrations specific low-level controller that maps action strings to a
based on instruction embeddings, and feed that into the low-level command (e.g. a web-driver action or an API call),
agent’s prompt to serve as in-context exemplars. which can be directly executed to change the environment.
We experiment with BAGEL on two domains, by using a
prompted LM (similar to ReAct, Yao et al., 2022) as our 3. BAGEL
base policy and find significant improvements with no hu-
BAGEL generates synthetic demonstrations via exploration,
man supervision. In MiniWoB++ (Shi et al., 2017; Liu
as illustrated in Figure 2. First, we describe the various
et al., 2018), an agent follows instructions on diverse web-
model components in §3.1, and then describe the overall
interfaces ranging from booking flights to replying to emails,
procedure in §3.2.
given an HTML state, by issuing a sequence of mouse and
keyboard operations to interact with DOM objects. Using
3.1. Model Components
BAGEL for test-time adaptation, we find an improvement
of over 13% compared to the base LM policy. Next, we In order to generate synthetic demonstrations, we model
evaluate on ToolQA (Zhuang et al., 2023), a collection of different aspects of the joint distribution over instructions
question answering tasks over 8 domains, where answer- and trajectories. Every component is implemented by the
ing each question requires chaining together multiple tools same underlying LM, but with different prompts. Every
such as SQL, text retrievers, graph tools, python interpreters component is also implicitly dependent on a given environ-
and calculators. Here, we find an improvement of 2% over ment, although this is omitted in the notation for simplicity.
the base LM policy. Further analysis reveals the various All prompts used can be found in Appendix B.
positive effects of conditioning on our synthetic demon-
stration beyond improved accuracy, including up to 13× Exploration Policy. The exploration policy, πexplore (at |
reduction in execution failures due to better understanding τ<t ), selects an action without conditioning on any instruc-
of environment dynamics. By carefully using LM priors to tion. The prompt used is similar to that of ReAct (Yao et al.,
shape random exploration, our method serves as a tool for 2022). We can sample from the resulting distribution over
automated discovery of use cases in complex environments. trajectories, pexplore (τ ), by sampling actions from πexplore
2
Explore: τ 0 ∼ Pexplore (·)
Click on datepicker Click on Next Click on Prev Click on Next Click on Prev Click on Prev Finish
Label: g 0 ∼ Plabel (· | τ 0 ) Change month from December to October Score: s(g 0 , τ 0 ) ✗
Follow: τ 1 ∼ Pagent (· | g 1 )
Click on datepicker Click on Next Click on Prev Click on Prev Click on 7th Finish
Label: g 1 ∼ Plabel (· | τ 1 ) Change month to October 7th and submit Score: s(g 1 , τ 1 ) ✗
Follow: τ 2 ∼ Pagent (· | g 2 )
Click on datepicker Click on Prev Click on Prev Click on 7th Click Submit
Label: g 2 ∼ Plabel (· | τ 2 ) Change month to October 7th and submit Score: s(g 2 , τ 2 ) ✓
Figure 2. BAGEL generates synthetic demonstrations by exploring the environment. Shown here is an example from the MiniWob++
choose-date task. First, we generate an initial trajectory by sampling actions without conditioning on any natural language instruction.
Then, we alternate between generating an instruction given a trajectory, and generating a trajectory given an instruction. The process aims
to converge towards a trajectory that accurately satisfies a natural language instruction, and aims to recover from errors in labeling or
instruction following from earlier rounds (see example). Once an instruction and trajectory pair satisfies a filtering criteria, it is added to
the set of synthetic demonstrations. Alternatively, BAGEL can be initialized by first sampling an instruction, as described in §3.2.
until the episode completes or a “finish” action is generated. that elicits plausible instructions based on the initial obser-
We can increase the entropy of πexplore with a configurable vation from the environment, and the action space.
temperature parameter.
3.2. Generating Demonstrations
Trajectory Labeler. The trajectory labeler, plabel (g | τ ),
is prompted to generate an instruction, g, that corresponds Initial Exploration We consider and compare two differ-
to a given trajectory, τ . ent variations of BAGEL: trajectory-first and instruction-
first exploration. For trajectory-first exploration, we first
Instruction Following Policy. Unlike the exploration pol- sample a trajectory τ 0 ∼ pexplore (·) with the exploration
icy, the instruction following policy, πagent (at | τ<t , g), se- policy. For instruction-first exploration, we first sample an
lects actions conditioned on an instruction, g. We sample instruction g 0 ∼ pinstruct (·) with the instruction generator.
from the resulting distribution over trajectories, pagent (τ | g),
by choosing actions according to πagent until the episode Iterative Refinement Trajectories sampled from pexplore
completes or a “finish” action is generated. This component may not correspond to any reasonable instruction, and, sim-
is also implemented using a ReAct based prompt. ilarly, there may be no feasible trajectory that satisfies in-
structions sampled from pinstruct . Our iterative re-labeling
Demonstration Filter. Given a synthetic demonstration procedure aims to find an instruction and trajectory pair
(g, τ ), the demonstration filter makes a binary judgement where the trajectory satisfies the instruction, without sacri-
s(g, τ ) ∈ {0, 1}, based on how well τ corresponds to the ficing the diversity of the initial exploration. The process
instruction g. alternates between sampling instructions and trajectories:
g t ∼ plabel (· | τ t ). (1)
Instruction Generator Finally, as an alternative to the t+1 t
exploration policy (see §3.2) we can instead use an instruc- τ ∼ pagent (· | g ). (2)
tor generator to initialize exploration. This model defines a
distribution over instructions, pinstruct (g), based on a prompt We perform these iterative updates until we find a pair where
3
s(g t , τ t ) = 1 or a maximum number of steps is reached. If 4. Inference

we are successful, the demonstration (g t , τ t ) is added to the
set of synthetic demonstrations, M. The overall procedure We use synthetic demonstrations from BAGEL to adapt
is repeated to collect multiple demonstrations. LM agents via retrieval augmented generation, and leave
finetuning for future work. Concretely, given a test instruc-
tion gtest , we retrieve top-k most relevant demonstrations in
3.3. Discussion
the demonstration set M, pre-pending these to the context
Guiding Trajectory Distribution with LM Components. window of our agent as in-context examples. More con-
To better understand how the LM labeler and policy shape cretely, we use dual encoder retrieval, similar to Lee et al.
the distribution of trajectories, we consider how this distri- (2019), using a T5-XXL (Raffel et al., 2020) embedding
bution evolves over the course of multiple iterations. Let model. We first compute a vector embedding fθ (g) for each
pk (τ ) be the distribution over trajectories and pk (g) be the instruction g ∈ M, and then find the top-k demonstrations
distribution over instructions, after k iterations. For k > 0: based on scores fθ (g)⊤ fθ (gtest ). More details can be found
in Appendix A.
X
pk (τ ) = pagent (τ | g ′ ) · pk−1 (g ′ ) (3)
g′ 5. Datasets
X
′ ′ ′ ′
pk−1 (g ) = plabel (τ |g ) · pk−1 (τ ). (4) Our experiments are based on two environments, Mini-
τ′ WoB++ (Shi et al., 2017; Liu et al., 2018) and ToolQA
(Zhuang et al., 2023).
Combining these, we obtain
5.1. MiniWoB++
pk (τ ) =
X
pk−1 (τ ′ ) · plabel (g ′ | τ ′ ) · pagent (τ | g ′ ) . (5) MiniWoB++ is a collection of tasks consisting of web inter-
′
τ ,g ′
| {z } faces with a shared action space of mouse and keyboard ac-
environment and LM constraints
tions. In our setup, actions are specified in natural language
(Type Bob in the name text box, Click on the datepicker,
Thus, we shape the distribution of trajectories from the pre- Clear text on Destination). The low-level controller that
vious marginal pk−1 based on the criteria that they can be maps action strings into a Selenium API call is implemented
assigned a concrete string g ′ , and are executable in the en- via a separate zero-shot prompted LM (see Appendix C for
vironment. These soft constraints work together to ensure details). Each task consists of a script to generate varia-
that (1) trajectories can be described in terms of some feasi- tions of the task with a templated instruction, where each
ble instruction in the environment, and (2) the trajectories variation is controlled via a random seed.
themselves correspond to valid environment dynamics.
Evaluation. We follow Shaw et al. (2023) for evaluating
Connection to Hindsight Experience Replay. Hindsight agents on MiniWoB++, by mapping the raw MiniWoB++
Experience Replay (HER, Andrychowicz et al., 2017) is a reward from [-1, 1] to [0, 1]. For each web interface, we
popular approach for training language conditioned policies. report the mean score over 50 random seeds. Starting with
Given some goal g, HER converts an unsuccessful trajectory the set of 55 MiniWoB++ tasks used in prior work on apply-
τ into positive examples by replacing g with some hindsight ing LM agents to this domain (Gur et al., 2023; Kim et al.,
goal g ′ . That is, HER uses a relabeling function to map τ to 2023; Sun et al., 2023), we evaluate on the hardest 10 tasks
a new goal g ′ , resulting in a positive demonstration (g ′ , τ ), where the zero-shot agent has an average reward of less than
that is used to update the policy. 0.95, to perform a more targeted evaluation of BAGEL to
domains that are hard for zero-shot agents.
Since the original implementation of HER considers set-
tings where the goal space is the raw environment observa-
5.2. ToolQA
tion space, applying HER to natural language instruction-
following requires access to a learnt relabeling function to ToolQA is a tool augmented question-answering environ-
map observations to language instructions. Such relabeling ment over 8 domains, where questions can be answered
functions typically map only the final observation oT to the by chaining calls to multiple tools including text retrievers,
instruction via pre-trained captioning models (Xiao et al., databases, SQL interpreter, calculator etc. Each tool can
2022; Cideron et al., 2020; Sumers et al., 2023) that operate be called according to a set of pre-defined methods (see
on trajectories from trained agents. In BAGEL, we use the Appendix B.2 for the full action space for the policy and
full trajectory for relabeling and use an iterative relabeling corresponding tool methods). The observation space is the
procedure to reduce noise from zero-shot components. string output from the most recent tool call (the first obser-
4
Zero-Shot
100 MiniWob++ 100 ToolQA
+ BAGEL
80 80
60.5
60 46.8
60
40.9 43.3
40 40
20 20
0 bo ch so em cli cli so tic us se Av 0 Ag Ai Co DB Fl GS Sc Ye Av
ok o os cia ail ck ck cia -ta e-a arc er en rB ffe LP ig M ire lp er
-fl l - -ch -ta l-m c-t ut h- ag da n e ht 8K x ag
ig e -d -m in b o e e B s e
h a ed bo eck -2 ed o e co n g
t te ia x- bo - ha ia- m ine
all xe rd so p let
s-s m e
of e
t
Figure 3. Results across MiniWoB++ and ToolQA, broken down by domain. We compare using demonstrations obtained via BAGEL
(blue) with a zero-shot ReAct baseline (green) with no synthetic demonstrations. For MiniWob++, we use the Trajectory-First variant for
exploration, and for ToolQA, we use Instruction-First. Overall, using BAGEL demonstrations leads to improvements on both datasets.
vation is hard-coded as a “System prompt”). Each action 6.2. Implementation Details

corresponds to a specific tool call expressed in language
We evaluate all baselines and variants of BAGEL on Mini-
(Load the Airbnb Database, Calculate 3+7), and the low-
WoB++ and ToolQA. For MiniWoB++, we start with sam-
level controller is implemented by post-processing strings
pling 60 trajectories in the exploration phase for trajectory-
into tool methods. The episode terminates when the pol-
first variants of BAGEL, and sample 60 synthetic goals
icy chooses the Finish with Answer action e.g. Finish with
for instruction-first variants. For ToolQA, we sample 200
Answer: 300, where 300 is taken as the predicted answer.
trajectories for BAGEL (trajectory-first), and 200 synthetic
goals for BAGEL (instruction-first).
Evaluation. Following prior work on question-
answering (Rajpurkar et al., 2016; 2018; Joshi et al., We use an instruction tuned PaLM-2 (Anil et al., 2023) as
2017), we compute the F1 score of the final (free-form) the base LM for all our experiments. We set the max episode
model output from the Finish with Answer tool call against length T to 15 for all datasets and models. We also set Titer
ground-truth answers. to 5, when performing multiple iterations in BAGEL. In
addition to using ReAct prompting, we use a simple “re-
sampling” procedure to recover from issuing syntactically
6. Experimental Setup
incorrect actions—if an action causes the environment to
6.1. Baselines and Ablations return an Exception (such as incorrectly invoking a tool, or
typing on an element that cannot be typed on), we sample
Zero-shot. As our first baseline, we use the zero-shot another action rom the agent with the Exception message ap-
policy πbase directly at test time. pended to its context. We keep re-sampling until it chooses
a syntactically correct action, or terminate the episode if the
Non-iterative Ablations. Similar in spirit to Sumers et al. agent is unable to fix an erroneous action in m = 5 steps.
(2023), in BAGEL (trajectory-first, no itrs), explored tra-
jectories τ 0 are labeled using plabel and resulting demon-
strations (g, τ 0 ) are included in M if the score s(g, τ ) = 1. 7. Main Results
Similarly, in BAGEL (instruction first, no itrs), synthetic in- Figure 3 compares the zero-shot baseline with agents aug-
structions sampled from the instruction generator (see §3.1) mented with BAGEL demonstrations. We find that using
are converted into trajectories using pagent , and the result- synthetic demonstrations as in-context exemplars, retrieved
ing demonstration (g 0 , τ ) is added to M, if s(g 0 , τ ) = 1. based on instruction relevance, lead to significant boosts
This baseline captures a simple way to use LMs to construct in performance compared to the zero-shot agent. For the
synthetic demonstrations via a sample-then-filter approach: best variant of BAGEL, we find improvements of over 13%
prompt an LM to generate possible instructions given the points on MiniWoB++, and over 2% on ToolQA. For Mini-
first observation from the environment, create trajectories WoB++, our improvements are particularly strong (20%
based on these, and filter based on another criterion. In absolute) on choose-date, tic-tac-toe, and use-autocomplete.
general, we expect exploration using the instruction genera- Solving these tasks successfully requires learning environ-
tor to work poorly in settings where the LM cannot predict ment dynamics (e.g. Figure 1) which is enabled by BAGEL
potential instructions from just the first observation (e.g. it demonstrations. We isolate the source of these improve-
might hard to generate candidate instructions solely from ments from synthetic in-context exemplars in §8.1. Further-
the landing page of the website without further interaction).
5
instruction-first trajectory-first Method Accuracy

Dataset Zero-Shot
No-itrs Full No-itrs Full Zero-shot 40.9
Random 38.0
MiniWoB++ 46.8 52.0 56.0 53.0 61.0
Shuffled 41.4
ToolQA 40.9 38.8 43.3 40.9 42.2
Ours 42.2
Table 1. Ablations showing the effect of multiple rounds of re-
labeling in BAGEL. Multiple iterations improve performance for Table 2. Ablations showing the effect of various sources of infor-
both instruction-first and trajectory-first variants. mation in synthetic demonstrations to agent performance.
Task Zero-Shot (↓) +BAGEL (↓)

more, trajectory-first exploration significantly outperforms choose-date 1.3 0.1
instruction-first on MiniWoB++, which we posit is due to book-flight 3.0 0.6
the LM prior being misaligned with the distribution over ToolQA (average) 3.0 1.9
possible instructions on MiniWoB++.
Table 3. Average number of execution failures for tasks in Mini-
Finally, Table 1 shows that iterative re-labeling always im- WoB++ and ToolQA. We find that using synthetic demonstrations
proves performance over non-iterative baselines. Multiple reduces execution failures.
iterations of round trips improves average reward by 4-8%
on MiniWoB++ and 1.3-4.5% on ToolQA.
improve decision making.
8. Analysis
8.2. Synthetic demonstrations reduce execution failures
To understand how BAGEL demonstrations improve agent
performance, we first look at confounders from in-context As mentioned in §6.2, in our implementation, LM agents
learning (§8.1), and then study the impact of synthetic recover from execution failures using a simple re-sampling
demonstrations on execution failures (§8.2). Next, we ana- procedure—when the agent generates an invalid action (such
lyze the correctness (§8.3) and diversity (§8.4) of BAGEL’s as attempting to Type on a checkbox element or calling a
demonstrations to identify areas for further improvements. tool with incorrect syntax), we re-prompt it with the error
message produced by the environment, until it produces a
8.1. In-context Learning with Synthetic Demonstrations valid action. Of course, such re-sampling can be costly at in-
ference time due to multiple calls to the LM. Table 3 reports
In-context exemplars can provide a range of useful learning the average execution failures for tasks with re-sampling on
signal to LM agents, ranging from simply providing exam- MiniWoB++ and ToolQA. We note a considerable reduc-
ples of valid action trajectories or relevant natural language tion in average re-sampling with BAGEL, due to a better
instructions in isolation, to providing rich information about understanding of environment dynamics, in turn leading to
the conditional p(τ | g) (how to map relevant instructions faster inference.
into action sequences). Indeed, for some text classification
tasks, Min et al. (2022) find that improvements from in- 8.3. Correctness of Synthetic Demonstrations
context learning may be explained in terms of the former
i.e. examples of the label space and input text. To better One way to identify the scope for improvements in our
understand how synthetic demonstrations help in our set- method is to manually verify the correctness of demonstra-
ting, we report results from two ablations. First, we provide tions. We filter demonstrations which, upon execution, do
the model with randomly chosen demonstrations instead of not achieve the corresponding instruction. Using these fil-
using the retriever (Random). Next, we shuffle demonstra- tered demonstrations improves performance further by 7%
tions so that trajectories are paired with randomly chosen absolute on all 10 tasks from MiniWoB++.
instruction within the set of retrieved examples (Shuffled).
8.4. Diversity of Synthetic Demonstrations
Results. Table 2 reports results of these ablations. First, To better understand the distribution of synthetic demonstra-
Shuffled improves performance over the zero-shot baseline, tions, we manually bucket demonstrations for social-media
suggesting that some of the improvements come from pro- and email-inbox-all into semantic clusters— for social-
viding examples of valid action trajectories in the domain media these clusters include {Retweet, Like, Share, ...}
in line with findings in Min et al. (2022). Ours records a and for email-inbox-all we have clusters such as {Forward,
further improvement of 0.8% over Shuffled, which suggests Delete, Star, Reply, ...}. For ToolQA, we cluster demonstra-
that the agent is able to use signal about the conditional to tions based on the set of tools invoked in the demonstration.
6
50 social-media 50 email-inbox-all 50 ToolQA

% Demonstrations
40 40 40
30 30 30
20 20 20
10 10 10
0 0 0
Re
Re eet
Co
{G
{S }
{C
{S
{A }
{D a}
{P
{P
{P
{S
M
Re
M w
St
Fo
Tr rd
Li
Fo
Sh
Em
Bl d
el
ar
or
ut
ke
as
oc
pl
pl
rw
llo
ar
py
tw
ci
yt
yt
yt
ci
al
ra
ge
at
be
et
e
e
h
y
L,
re
re
k
ho
ho
ho aph
cu
ab
ph
a
nd
x
x,
la
n}
n,
n,
D
as
G
to
at
e}
Sc
ra
ab
r}
ph
r
as
ex
}
e}
}
Figure 4. Distribution of demonstrations over semantic categories for MiniWob++ environments, social-media and email-inbox-all,
and ToolQA. While BAGEL prefers certain modes, overall we find that these demonstrations cover a diverse range of actions.
We plot the number of demonstrations in each cluster in of only 5%. Here, we find that while BAGEL demonstra-
Figure 4. We note that while this distribution tends to be tions cover a variety of instructions like Search for cat and
skewed towards specific modes (e.g. {graph} for ToolQA, navigate to the third page of search results, Search for cars,
{Star} for email-inbox), there exists a long tail that covers a then visit the second search result, the model fails on test
broad range of possible use cases in the environment. Nev- instructions like Enter [term] then find and click the 9th
ertheless, improving diversity during exploration remains a search result that requires keeping track of the number of
failure mode for BAGEL which we expand on next. Finally, search results per page, and navigating to the correct page.
we provide some examples of BAGEL demonstrations in While our goal is to build fully unsupervised agents, meth-
Table 4, along with their corresponding semantic category. ods that use sparse information about test-time instructions
could help drive performance further.
8.5. Error Analysis
We conclude with a discussion of failure modes of our 9. Related Work
aproach using the domains book-flight, search-engine, and Instruction-Following Digital Agents. Building agents
SciRex as case studies. that navigate the digital world is a long standing goal of AI
and language understanding (Allen et al., 2007; Branavan
Handling Long-Horizon Planning. We note that book- et al., 2009). However, most prior work relies on expert
flight is the most complex environment in MiniWoB++, with demonstrations (Liu et al., 2018; Humphreys et al., 2022;
longer trajectories of lengths 8-20, and the zero-shot policy Furuta et al., 2023) with an appropriately shaped reward
performs poorly on this environment (average reward of (Branavan et al., 2009; Liu et al., 2018). Here, we assume no
5%). While using BAGEL demonstrations improves this access to demonstrations or a reward function, and use pre-
to 15%, we hypothesize that further improvements would trained components to bootstrap synthetic demonstrations.
require better handling of long range plans, such as with
hierarchical planning (Sodhi et al., 2023; Jiang et al., 2019). LMs for Decision Making. Pre-trained LMs are increas-
ingly being used for sequential decision making tasks such
Improving Diversity. We hypothesize that improving di- as robotic manipulation (Ahn et al., 2022; Liang et al., 2023),
versity among seed trajectories would lead to further im- instruction-following (Yao et al., 2022; Kim et al., 2023;
provements across the board. For instance, for book-flight, Sun et al., 2023; Lù et al., 2024), and tool-use (Parisi et al.,
all BAGEL demonstrations correspond to booking flights 2022). While some of these approaches finetune LMs based
in December, while the test distribution is more uniform. on human demonstrations (Nakano et al., 2021), others use
human demonstrations in their prompt for in-context learn-
Reducing Mismatch with Test Instructions. On SciRex, ing and adaptation (Yao et al., 2022; Kim et al., 2023; Sun
all models fail to produce even a single correct answer. Here, et al., 2023). We use no human supervision or reward and
we find that in the absence of any knowledge about user adapt LM agents purely using synthetic demonstrations.
instructions at test-time, BAGEL demonstrations tend to Another line of work uses LM priors in RL to improve ex-
create questions with more descriptive answers and trajec- ploration (Du et al., 2023), deal with large action spaces
tories with generic queries (See Table 4 for an example) (Yao et al., 2020), or as proxy reward functions (Kwon et al.,
while test instructions requires retrieving specific numbers 2023). In the same tradition, BAGEL bootstraps a learning
from scientific documents by querying for specific topics. signal in the form of synthetic demonstrations by combining
Similarly, on search-engine, we note a modest improvement several LM components without RL.
7
Instruction Trajectory
MiniWoB++
Find the email by Trixi and reply to them with the text ”Maecenas Move Mouse to Trixi → Click on an email-thread → Click on the
eu massa” {Reply} reply button → Type ’Maecenas eu massa’ on the textarea with id
’reply-text’ → Click on the span with id ’send-reply’
Find the email by Darcy and forward it to Dionis {Forward} Click on Darcy, the sender of an email thread. → Click on ’for-
ward’ button → Type Dionis on the to field → Click on the ’send’
button
Retweet Gallegos’s post {Retweet} Move Mouse to Pretium,. Ullamcorper. → Click on retweet
element with id 101
Like tweet by @leonie and share tweet by @livia {Like, Share} Click on the like element with ID 41. → Click on share-113
ToolQA
What are David’s plans this weekend? {RetrieveAgenda} Retrieve passages related to David’s plans this weekend → Finish
with answer: On the evening of September 16th, 2022, David will
be attending a Blind Date Night event at The Press Lounge.
Who is affiliated with both nicolas christin and giulia fanti? Load DBLP → Check neighbours of node Giulia Fanti in graph
{Python, Graph} AuthorNet → Check neighbours of node Nicolas Christin in
graph AuthorNet → Evaluate python code: list1=[’Wanzheng
Zhu’, ’Rohan Bansal’, ’Zachary Weinberg’, ’Nicolas Christin’,
’Suma Bhat’, ’Hongyu Gong’]; list2=[’Wanzheng Zhu’, ’Rohan
Bansal’, ’Zachary Weinberg’, ’Suma Bhat’, ’Hongyu Gong’, ’Giu-
lia Fanti’]; ans=set(list1) & set(list2) → Finish with answer:
{’Hongyu Gong’, ’Rohan Bansal’, ’Wanzheng Zhu’, ’Zachary
Weinberg’, ’Suma Bhat’}
What are the top 5 airbnb options with price < 900, availability Load database airbnb → Filter database according to price <
> 260 and at least 40 reviews {Database, SQL} 900, availability 365 > 260, number of reviews > 40 → Interpret
SQLite query: SELECT FROM airbnb data ORDER BY num-
ber of reviews DESC LIMIT 5 → Finish with answer: [’High-end
doorman bldg in the LES’, ’THE BEST DEAL ON THE HUD-
SON RIVER!!’, ’Heart of Williamsburg, Brooklyn!’, ’Beautiful
& Tranquil Oasis in a Great Location’, ’Sunny/Cozy 1BD’]
What are the different approaches for computing graph similarity? Retrieve passages from ML papers related to graph similarity →
{RetrieveSciRex} Finish with answer: The different approaches to computing graph
similarity are graph kernels, graph features and graph convolu-
tional neural networks (CNNs).
Table 4. Example demonstrations obtained via BAGEL for MiniWoB++ (top) and ToolQA (bottom). We also provide the semantic
category for these demonstrations, and report the distribution of these categories in Figure 4.
Self-training for Language Models. A recent line of a learning signal with minimal human supervision. To
work uses LM-generated data for finetuning the same LM, this end, we introduce BAGEL, a method for constructing
in settings where external verifiers may be used to filter synthetic demonstrations for instruction following agents.
generated data (Singh et al., 2023; Gulcehre et al., 2023). These demonstrations are constructed by iteratively relabel-
While we also use data generated from an LM for adaptation, ing an initial seed set of trajectories or instructions, where
unlike these approaches, environment interactions form a both relabeling and exploration is driven by a language
critical part of the learning signal and we also do not use model. Experiments on two different domains show that us-
external verifiers for filtering data. ing BAGEL demonstrations as in-context exemplars leads
to considerable improvements ranging from 2-13%, as well
10. Conclusion as significant reductions in execution failures.
There is a growing interesting in grounding LMs to the 11. Impact Statement

real world, by building helpful assistants that execute open-
ended instructions in digital environments. The complexity In this paper, we evaluated models only in offline envi-
of such sequential tasks makes collecting expert demon- ronments. Responsibly deploying models online carries
strations tedious, and so, further progress towards build- potential risks, and it would be important to verify and con-
ing such agents requires new methods for bootstrapping strain model behaviour to not cause harm (e.g. violating
8
terms of service). Further research related to secure model Series on Computational Intelligence (SSCI), pp. 225–
deployment should take into account problems such as spam 232. IEEE, 2020.
detection, privacy preservation etc.
Du, Y., Watkins, O., Wang, Z., Colas, C., Darrell, T., Abbeel,
P., Gupta, A., and Andreas, J. Guiding pretraining in
Acknowledgements reinforcement learning with large language models. arXiv
SM was partly funded by a gift from Apple Inc. CM is a fel- preprint arXiv:2302.06692, 2023.
low in the CIFAR Learning in Machines and Brains program.
Furuta, H., Nachum, O., Lee, K.-H., Matsuo, Y., Gu,
We thank David Gaddy, Anna Goldie, Luke Vilnis, Tianze
S. S., and Gur, I. Multimodal web navigation with
Shi, Jonathan Berant, Kristina Toutanova, Raphael Hoffman,
instruction-finetuned foundation models. arXiv preprint
and members of Google DeepMind and the Stanford NLP
arXiv:2305.11854, 2023.
Group for helpful discussions and comments.
Gulcehre, C., Paine, T. L., Srinivasan, S., Konyushkova,
References K., Weerts, L., Sharma, A., Siddhant, A., Ahern, A.,
Wang, M., Gu, C., et al. Reinforced self-training (rest)
Ahn, M., Brohan, A., Brown, N., Chebotar, Y., Cortes, O.,
for language modeling. arXiv preprint arXiv:2308.08998,
David, B., Finn, C., Fu, C., Gopalakrishnan, K., Hausman,
2023.
K., et al. Do as i can, not as i say: Grounding language
in robotic affordances. arXiv preprint arXiv:2204.01691, Gur, I., Nachum, O., Miao, Y., Safdari, M., Huang, A.,
2022. Chowdhery, A., Narang, S., Fiedel, N., and Faust,
A. Understanding HTML with large language mod-
Allen, J., Chambers, N., Ferguson, G., Galescu, L., Jung,
els. In Bouamor, H., Pino, J., and Bali, K. (eds.), Find-
H., Swift, M., and Taysom, W. Plow: a collaborative task
ings of the Association for Computational Linguistics:
learning agent. In Proceedings of the 22nd National Con-
EMNLP 2023, pp. 2803–2821, Singapore, December
ference on Artificial Intelligence - Volume 2, AAAI’07, pp.
2023. Association for Computational Linguistics. doi:
1514–1519. AAAI Press, 2007. ISBN 9781577353232.
10.18653/v1/2023.findings-emnlp.185. URL https://
Andrychowicz, M., Wolski, F., Ray, A., Schneider, J., Fong, aclanthology.org/2023.findings-emnlp.185.
R., Welinder, P., McGrew, B., Tobin, J., Pieter Abbeel, O.,
and Zaremba, W. Hindsight experience replay. Advances Huang, W., Abbeel, P., Pathak, D., and Mordatch, I. Lan-
in neural information processing systems, 30, 2017. guage models as zero-shot planners: Extracting ac-
tionable knowledge for embodied agents. In Interna-
Anil, R., Dai, A. M., Firat, O., Johnson, M., Lepikhin, tional Conference on Machine Learning, pp. 9118–9147.
D., Passos, A., Shakeri, S., Taropa, E., Bailey, P., Chen, PMLR, 2022.
Z., et al. Palm 2 technical report. arXiv preprint
arXiv:2305.10403, 2023. Humphreys, P. C., Raposo, D., Pohlen, T., Thornton, G.,
Chhaparia, R., Muldal, A., Abramson, J., Georgiev, P.,
Branavan, S., Chen, H., Zettlemoyer, L., and Barzilay, Santoro, A., and Lillicrap, T. A data-driven approach
R. Reinforcement learning for mapping instructions for learning to control computers. In International Con-
to actions. In Su, K.-Y., Su, J., Wiebe, J., and Li, ference on Machine Learning, pp. 9466–9482. PMLR,
H. (eds.), Proceedings of the Joint Conference of the 2022.
47th Annual Meeting of the ACL and the 4th Interna-
tional Joint Conference on Natural Language Processing Jiang, Y., Gu, S. S., Murphy, K. P., and Finn, C. Language as
of the AFNLP, pp. 82–90, Suntec, Singapore, August an abstraction for hierarchical deep reinforcement learn-
2009. Association for Computational Linguistics. URL ing. Advances in Neural Information Processing Systems,
https://aclanthology.org/P09-1010. 32, 2019.
Chaplot, D. S., Sathyendra, K. M., Pasumarthi, R. K., Ra- Joshi, M., Choi, E., Weld, D., and Zettlemoyer, L. TriviaQA:
jagopal, D., and Salakhutdinov, R. Gated-attention archi- A large scale distantly supervised challenge dataset for
tectures for task-oriented language grounding. In Pro- reading comprehension. In Barzilay, R. and Kan, M.-Y.
ceedings of the AAAI Conference on Artificial Intelli- (eds.), Proceedings of the 55th Annual Meeting of the
gence, volume 32, 2018. Association for Computational Linguistics (Volume 1:
Long Papers), pp. 1601–1611, Vancouver, Canada, July
Cideron, G., Seurin, M., Strub, F., and Pietquin, O. Higher: 2017. Association for Computational Linguistics. doi: 10.
Improving instruction following with hindsight genera- 18653/v1/P17-1147. URL https://aclanthology.
tion for experience replay. In 2020 IEEE Symposium org/P17-1147.
9
Kim, G., Baldi, P., and McAleer, S. Language models can transformer. Journal of Machine Learning Research, 21:
solve computer tasks. arXiv preprint arXiv:2303.17491, 1–67, 2020.
2023.
Rajpurkar, P., Zhang, J., Lopyrev, K., and Liang, P. SQuAD:
Kwon, M., Xie, S. M., Bullard, K., and Sadigh, D. Re- 100,000+ questions for machine comprehension of text.
ward design with language models. In The Eleventh In Su, J., Duh, K., and Carreras, X. (eds.), Proceed-
International Conference on Learning Representations, ings of the 2016 Conference on Empirical Methods in
2023. URL https://openreview.net/forum?id= Natural Language Processing, pp. 2383–2392, Austin,
10uNUgI5Kl. Texas, November 2016. Association for Computational
Linguistics. doi: 10.18653/v1/D16-1264. URL https:
Lee, K., Chang, M.-W., and Toutanova, K. Latent retrieval //aclanthology.org/D16-1264.
for weakly supervised open domain question answering.
In Proceedings of the 57th Annual Meeting of the Asso- Rajpurkar, P., Jia, R., and Liang, P. Know what you
ciation for Computational Linguistics, pp. 6086–6096, don’t know: Unanswerable questions for SQuAD. In
2019. Gurevych, I. and Miyao, Y. (eds.), Proceedings of the
56th Annual Meeting of the Association for Computa-
Liang, J., Huang, W., Xia, F., Xu, P., Hausman, K., Ichter, B., tional Linguistics (Volume 2: Short Papers), pp. 784–789,
Florence, P., and Zeng, A. Code as policies: Language Melbourne, Australia, July 2018. Association for Compu-
model programs for embodied control. In 2023 IEEE tational Linguistics. doi: 10.18653/v1/P18-2124. URL
International Conference on Robotics and Automation https://aclanthology.org/P18-2124.
(ICRA), pp. 9493–9500. IEEE, 2023.
Shaw, P., Joshi, M., Cohan, J., Berant, J., Pasupat, P., Hu, H.,
Liu, E. Z., Guu, K., Pasupat, P., Shi, T., and Liang, P. Rein- Khandelwal, U., Lee, K., and Toutanova, K. From pixels
forcement learning on web interfaces using workflow- to ui actions: Learning to follow instructions via graphical
guided exploration. In International Conference on user interfaces. arXiv preprint arXiv:2306.00245, 2023.
Learning Representations, 2018.
Shi, T., Karpathy, A., Fan, L., Hernandez, J., and Liang,
Logeswaran, L., Fu, Y., Lee, M., and Lee, H. Few-shot P. World of bits: An open-domain platform for web-
subgoal planning with language models. In Proceedings based agents. In International Conference on Machine
of the 2022 Conference of the North American Chapter of Learning, pp. 3135–3144. PMLR, 2017.
the Association for Computational Linguistics: Human
Language Technologies, pp. 5493–5506, 2022. Shinn, N., Labash, B., and Gopinath, A. Reflexion: an au-
tonomous agent with dynamic memory and self-reflection.
Lù, X. H., Kasner, Z., and Reddy, S. Weblinx: Real- arXiv preprint arXiv:2303.11366, 2023.
world website navigation with multi-turn dialogue. arXiv
preprint arXiv:2402.05930, 2024. Singh, A., Co-Reyes, J. D., Agarwal, R., Anand, A., Patil,
P., Liu, P. J., Harrison, J., Lee, J., Xu, K., Parisi, A.,
Min, S., Lyu, X., Holtzman, A., Artetxe, M., Lewis, M., et al. Beyond human data: Scaling self-training for
Hajishirzi, H., and Zettlemoyer, L. Rethinking the role of problem-solving with language models. arXiv preprint
demonstrations: What makes in-context learning work? arXiv:2312.06585, 2023.
arXiv preprint arXiv:2202.12837, 2022.
Sodhi, P., Branavan, S., and McDonald, R. Heap: Hierar-
Misra, D., Langford, J., and Artzi, Y. Mapping instructions chical policies for web actions using llms. arXiv preprint
and visual observations to actions with reinforcement arXiv:2310.03720, 2023.
learning. arXiv preprint arXiv:1704.08795, 2017.
Sumers, T., Marino, K., Ahuja, A., Fergus, R., and Dasgupta,
Nakano, R., Hilton, J., Balaji, S., Wu, J., Ouyang, L., Kim, I. Distilling internet-scale vision-language models into
C., Hesse, C., Jain, S., Kosaraju, V., Saunders, W., et al. embodied agents. 2023.
Webgpt: Browser-assisted question-answering with hu-
man feedback. arXiv preprint arXiv:2112.09332, 2021. Sun, H., Zhuang, Y., Kong, L., Dai, B., and Zhang, C. Ada-
planner: Adaptive planning from feedback with language
Parisi, A., Zhao, Y., and Fiedel, N. Talm: Tool augmented models. arXiv preprint arXiv:2305.16653, 2023.
language models. arXiv preprint arXiv:2205.12255,
2022. Xiao, T., Chan, H., Sermanet, P., Wahid, A., Brohan, A.,
Hausman, K., Levine, S., and Tompson, J. Robotic
Raffel, C., Shazeer, N., Roberts, A., Lee, K., Narang, S., skill acquisition via instruction augmentation with vision-
Matena, M., Zhou, Y., Li, W., and Liu, P. J. Exploring language models. arXiv preprint arXiv:2211.11736,
the limits of transfer learning with a unified text-to-text 2022.
10
Yao, S., Rao, R., Hausknecht, M., and Narasimhan, K.

Keep CALM and explore: Language models for action
generation in text-based games. In Webber, B., Cohn,
T., He, Y., and Liu, Y. (eds.), Proceedings of the 2020
Conference on Empirical Methods in Natural Language
Processing (EMNLP), pp. 8736–8754, Online, Novem-
ber 2020. Association for Computational Linguistics.
doi: 10.18653/v1/2020.emnlp-main.704. URL https:
//aclanthology.org/2020.emnlp-main.704.
Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan,
K. R., and Cao, Y. React: Synergizing reasoning and
acting in language models. In The Eleventh International
Conference on Learning Representations, 2022.
Zhuang, Y., Yu, Y., Wang, K., Sun, H., and Zhang, C.
Toolqa: A dataset for llm question answering with exter-
nal tools. arXiv preprint arXiv:2306.13304, 2023.
11
A. Other Implementation Details

A.1. Retriever
We use a T5-XXL model to embed each word in the instruction, and mean pool across word embeddings to obtain an
instruction vector. Given a test-time instruction, to retrieve relevant demonstrations, we compute cosine similarities between
the test instruction embedding and instruction embeddings for each demonstration in our buffer, and return the top 3
demonstrations with the highest cosine similarities.
A.2. Re-sampling action strings

When executing an action string in the environment results in an exception from the low-level controller, we pass the
exception message to the LM policy, and re-sample till the model outputs a valid action, or the LM exceeds the max number
of tries m = 5. Here is an example prompt we use for this re-sampling procedure (the prompt is appended to the LM policy):
Listing 1. Re-sampling during Execution Failure

Executing Action: {error_action}...
resulted in error: {error_message}. Think about what could have caused the error, and then choose a new action.
Thought: [[thought_pred]]
Now, output a different action based on your thought. End your output with a newline.
Action: [[action]]
B. Prompts
B.1. MiniWoB++
We start by presenting all prompts for MiniWoB++. The action space for MiniWob++ is:
Listing 2. Action Space

- Click on *description*: This action will click on element that matches *description* e.g. Click on the red
button, Click on the first result in the autocomplete
- Move Mouse to *description*: This action will hover mouse over web element that matches *description* e.g. Move
mouse to the menu bar.
- Type char *char* on *description*: This action will type a single character *char* into the web element
matching *description* e.g. Type char B on the first name field. Use this if you want to type in a word character
by character, to view or narrow search results.
- Type *text* on *description*: This action will type *text* into the web element matching *description*. Use
this to type in all the words in *text*’ all at once.
- Clear text on *description*: This action will clear all previously typed text in web element matching *
description*
This is then directly used for various prompts as {inventory str}.
Listing 3. Exploration Policy

You are a web-agent that can interact with the given webpage by taking actions. You can take the following kinds
of actions:
{inventory_str}
Your objective is to discover diverse and interesting tasks (that a human might give to an agent) by interacting
with the webpage through these actions. You’ve executed the following actions, and observed the following webpage
states (described briefly in language).
**Previous observations and actions**

{prev_observations_and_actions}
After taking these actions, you observe the current web-page HTML:
{webpage_html}
Start by thinking about what action you should take next.

Thought: [[pred]]
Now, act by taking an action based in the inventory (or output Finish if you are done).
Action: [[pred]]
12
Listing 4. Instruction Generator

**Objective**
You are a web-agent that can accomplish useful tasks on a website. You are given the landing page of the website
as follows:
{init_html}
To accomplish tasks, you can break it down into a sequence of sub-tasks from a task inventory:
{inventory_str}
Propose a new task that can be performed on this website. Ensure that your tasks are concrete and use features /
contents of the given website.
Start by thinking about what new task you will generate.

Thought: [[pred]]
Answer: [[pred]]
Listing 5. Trajectory Relabeler

A web-agent is given a precise instruction from a human, which it carries out through a sequence of sub-tasks,
where each sub-task (such as clicking on elements / typing on elements / scrolling etc.) changes the HTML state
of the webpage.
You are given the initial webpage (as HTML), the final webpage after all sub-tasks are carried out, as well as a
summary of changes that each sub-task made to the starting HTML state.
Initial Webpage:
{init_webpage}
Final Webpage:
{final_webpage}
Sub-tasks attempted by the web agent:

{subgoal_str}
Summary of changes made to HTML:

{observation_changes}
Your objective to guess the instruction that was given to the agent. Ensure that your instructions are concrete
and such that every sub-task meaningfully contributes to fulfiling the instruction. Start by providing your
reasoning. Use the following format for your answer:
Reasoning: your reasoning
Answer: your answer
**Output**
Reasoning: [[pred]]
Answer: [[pred]]
Listing 6. Instruction Following Policy

You are a web-agent on an HTML page capable of executing the following kinds of sub-tasks:
{inventory_str}
You are also given some examples of how to perform instructions on the website by converting them into sub-tasks
(along with the change each sub-task caused on the website).
{exemplars}
You are given the following instruction: {instruction}.

To perform this instruction, you’ve executed the following sub-tasks, and observed the following webpage states (
described briefly in language).

After taking these actions, you observe the current web-page HTML:
{webpage_html}
Webpage Description: [[pred]]
First, think about which inventory item you should pick as your next action.
Thought: [[pred]]
Now, output next action (output *finished* if the instruction has been accomplished) by choosing an item from
your inventory
Action: [[pred]]
Listing 7. Demonstration Filter

You are given an initial web-page from a website (as HTML). To accomplish some task, a web-agent then interacts
with the website, leading to a final webpage.
Given the task, the initial webpage and the final webpage, your objective is to judge how well the web-agent
carried out this task by giving it a score from 1 to 5.
13
Only give a score of 5 if the task is perfectly accomplished and the final webpage has no errors.
Task:
{goal_str}
Initial Webpage:
{init_webpage}
Final Webpage:
{final_webpage}
Start by thinking about what the web-agent was trying to accomplish, and describe how well it was done.
Thought: [[pred]]
Answer: [[pred]]
B.2. ToolQA
Next, we present all prompts for ToolQA below. The list of methods for various tools in ToolQA is:
Listing 8. ToolQA methods

(1) Calculate[formula], which calculates the formula and returns the result.
(2) RetrieveAgenda[keyword], which retrieves the agenda related to keyword.
(3) RetrieveScirex[keyword], which retrieves machine learning papers’ paragraphs related to keyword.
(4) LoadDB[DBName], which loads the database DBName and returns the database. The DBName can be one of the
following: flights/coffee/airbnb/yelp.
(5) FilterDB[condition], which filters the database DBName by the column column_name the relation (e.g., =, >,
etc.) and the value value, and returns the filtered database.
(6) GetValue[column_name], which returns the value of the column column_name in the database DBName.
(7) LoadGraph[GraphName], which loads the graph GraphName and returns the graph. The GraphName can be one of the
following: PaperNet/AuthorNet.
(8) NeighbourCheck[GraphName, Node], which lists the neighbours of the node Node in the graph GraphName and
returns the neighbours.
(9) NodeCheck[GraphName, Node], which returns the detailed attribute information of Node.
(10) EdgeCheck[GraphName, Node1, Node2], which returns the detailed attribute information of the edge between
Node1 and Node2.
(11) SQLInterpreter[SQL], which interprets the SQL query SQL and returns the result.
(12) PythonInterpreter[Python], which interprets the Python code Python.
and the action space for the LM policy is:
Listing 9. Action Space

(1) Calculate *formula*, which calculates an arithmetic formula (such as 2+3, 2 * 4 etc) and returns the result.
(2) Retrieve passages related to *phrase*, which retrieves information relevant to the supplied phrase. This
retriever operates on documents containing information about people’s schedules.
(3) Retrieve passages from ML papers related to *keyword*, which retrieves machine learning papers’ paragraphs
related to keyword.
(4) Load database *DBName*, which loads the database DBName and returns the database. The DBName can be one of
the following: flights/coffee/airbnb/yelp.
(5) Filter database according to *condition*. which filters the loaded database (flights/coffee/airbnb/yelp) by a
condition and returns the filtered database. A condition is specified as *column_name relation value* where
relation can be (=, <, >, <=, >=), and column_name is a column from the loaded DB. To filter according to
multiple conditions, the format requires comma separated conditions e.g. "Filter database according to
column_name_1=value_1, column_name_2>=value_2, column_name_3<value_3".
(6) Get database value for *column_name*, which returns the value of the column column_name in the database
DBName.
(7) Load DBLP, which loads the graphs in dblp. Inside DBLP, there are two graphs: PaperNet/AuthorNet.
(8) List nodes in graph *GraphName*, which lists 10 randomly chosen nodes to help explore the graph.
(9) Check neighbours of node *Node* in graph *GraphName*, which lists the neighbours of the node Node in the
graph GraphName and returns the neighbours. GraphName can be PaperNet or AuthorNet.
(10) Get information for node *Node* in graph *GraphName*, which returns the detailed attribute information of
Node.
(11) Check edge information between nodes *Node1* and *Node2* in graph *GraphName*, which returns the detailed
attribute information of the edge between Node1 and Node2.
(12) Interpret SQLite query: *Query*, which interprets the SQLite query Query and returns the result. There are 4
tables for querying: flights_data/coffee_data/airbnb_data/yelp_data corresponding to the DBs flights/coffee/
airbnb/yelp.
(13) Evaluate python code: *code*, which uses the python exec function to execute the python codeblock *code* as
is. The result of the code must be stored in a variable called ans, and the code cannot reference any variables
not defined inside the codeblock.
(14) Finish with answer: *answer*, which returns the answer and finishes the task.
This is then directly used for various prompts as {inventory str}. Note that the action strings (from this inventory) are
14
converted into actual methods via string post-processing.
Listing 10. Exploration Policy

You are an agent with access to tools, that you may use to respond to various questions. You have the following
tools:
{inventory_str}
Your objective is to discover diverse and interesting questions (that a human might give to an agent with these
tools) by chaining together calls to different tools. You’ve executed the following tool calls, and observed the
following outputs from these tools (described briefly in language).

**Current Observation**
{curr_observation}
Start by thinking about what action you should take next.

Thought: [[pred]]
Now, act by taking an action based in the inventory (or output Finish if you are done).
Action: [[pred]]
Listing 11. Instruction Generator

**Objective**
You are an agent with access to tools, that you may use to respond to various queries. You have the following
tools:
{inventory_str}
To respond to queries, you need to call tools in a specific sequence to obtain the answer.
Your objective is to propose a query that can be performed by chaining together these tools. Ensure that your
queries are concrete.
Start by thinking about what new query you will generate.

Thought: [[pred]]
Answer: [[pred]]
Listing 12. Trajectory Relabeler

A user asks an AI agent a question, which it answers by accessing tools like databases, calculators, retrievers
and python interpreters. The AI agent answers this question by carrying out a sequence of sub-tasks, where each
sub-task (such as loading or querying a dblp graph / calling a python interpreter etc.) leads to an output from
the tool.
You are given the entire sequence of tool outputs, where the final tool output is the answer that the agent gives
. You are also given the sequence of sub-tasks attempted by the agent.
Sub-tasks attempted by the agent:

{subgoal_str}
Sequence of tool outputs:

{observation_changes}
Your objective to guess the query that was given to the agent. Ensure that your answer is concrete and such that
every sub-task meaningfully contributes to answering the query. Start by providing your reasoning. Use the
following format for your answer:
Reasoning: your reasoning
Answer: your answer
**Output**
Reasoning: [[pred]]
Answer: [[pred]]
Listing 13. Instruction Following Policy

You are an agent with access to tools, that you may use to respond to various queries. You have the following
tools:
{inventory_str}
To respond to queries, you need to call tools in a specific sequence to obtain the answer. Here are some
demonstrations of how to respond to queries by invoking tools:
{exemplars}
You are given the following query: {super_goal}
To perform this instruction, you’ve executed the following actions, and observed the following outputs from your
15
tools:

**Current Observation**
{curr_observation}
First, think about which tool you should pick as your next action
Thought: [[pred]]
Now, output next action (output *finished* if the instruction has been accomplished) by calling the chosen tool
with appropriate arguments. End your output with a newline
Action: [[pred]]
Listing 14. Demonstration Filter

A user asks an AI agent a question, which it answers by accessing tools like databases, calculators, retrievers
and python interpreters. The AI agent answers this question by carrying out through a sequence of sub-tasks,
where each sub-task (such as loading or querying a dblp graph / calling a python interpreter etc.) leads to an
output from the tool. You are given the entire sequence of tool outputs, where the final tool output is the
answer that the agent gives. You are also given the sequence of sub-tasks attempted by the agent.
Your objective is to judge how well the AI agent carried out this task by giving it a score from 1 to 5.
Only give a score of 5 if the task is perfectly accomplished and the final answer has no errors.
User question:
{goal_str}
Sequence of Tool outputs:

{state_changelog}
Start by thinking about what the AI agent was trying to accomplish, and describe how well it was done.
Thought: [[pred]]
Answer: [[pred]]
C. Converting LM Action space into API calls

MiniWoB++. We use the following prompt to convert the action string into an API call:
Listing 15. LM to convert action strings into an API call

Webpage HTML: {html}
Use references into the webpage to specify actions to perform a given task.
You can take 4 kinds of actions on a chosen element specified via its ref id.
Action: type(text) types ’text’ into chosen ref, useful for typing into various textboxes.
Action: click() clicks on chosen element, useful when clicking buttons,checkboxes or textboxes. Sections can be
clicked for expansion.
Action: move-mouse() moves mouse to a chosen element, useful when the element text has ’>’ symbol for expansion.
Action: clear() clears all text on chosen ref-id, useful when you want to delete text on textboxes.
To choose actions, strictly use the format below:

Chosen action: chosen from click/move-mouse/type/clear
Chosen element: Specify chosen ref id as an integer
Chosen text: text to type (n/a if chosen action is not type)
Task: {action_string}
Chosen action: [[pred]]
Chosen element: [[pred]]
Chosen text: [[pred]]
The LM predictions are combined into an API call e.g. ref[[element]].type([[text]]]). We use a simple
python function to convert the API call into a Selenium web-driver method (type text, clear and move mouse are Selenium
web-driver methods):
16

BAGEL: Bootstrapping Agents by Guiding Exploration With Language

Uploaded by

Copyright:

Available Formats

BAGEL: Bootstrapping Agents by Guiding Exploration With Language

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

BAGEL: Bootstrapping Agents by Guiding Exploration With Language

Uploaded by

Copyright:

Available Formats

BAGEL: Bootstrapping Agents by Guiding Exploration with Language

Following natural language instructions by exe- October

cuting actions in digital environments (e.g. web- LM-agent LM-labeler

LM agents often fail to generalize to new environ- … …

via round-trips between two noisy LM compo- Environment

setting—grounding instructions is relatively easy due to the 2. Background

Explore: τ 0 ∼ Pexplore (·)

Label: g 0 ∼ Plabel (· | τ 0 ) Change month from December to October Score: s(g 0 , τ 0 ) ✗

s(g t , τ t ) = 1 or a maximum number of steps is reached. If 4. Inference

vation is hard-coded as a “System prompt”). Each action 6.2. Implementation Details

instruction-first trajectory-first Method Accuracy

Task Zero-Shot (↓) +BAGEL (↓)

50 social-media 50 email-inbox-all 50 ToolQA

There is a growing interesting in grounding LMs to the 11. Impact Statement

Yao, S., Rao, R., Hausknecht, M., and Narasimhan, K.

A. Other Implementation Details

A.2. Re-sampling action strings

Listing 1. Re-sampling during Execution Failure

Listing 2. Action Space

This is then directly used for various prompts as {inventory str}.

Listing 3. Exploration Policy

**Previous observations and actions**

Start by thinking about what action you should take next.

Listing 4. Instruction Generator

Start by thinking about what new task you will generate.

Listing 5. Trajectory Relabeler

Sub-tasks attempted by the web agent:

Summary of changes made to HTML:

Listing 6. Instruction Following Policy

You are given the following instruction: {instruction}.

**Previous observations and actions**

Webpage Description: [[pred]]

Listing 7. Demonstration Filter

Listing 8. ToolQA methods

and the action space for the LM policy is:

Listing 9. Action Space

converted into actual methods via string post-processing.

Listing 10. Exploration Policy

**Previous observations and actions**

Start by thinking about what action you should take next.

Listing 11. Instruction Generator

Start by thinking about what new query you will generate.

Listing 12. Trajectory Relabeler

Sub-tasks attempted by the agent:

Sequence of tool outputs:

Listing 13. Instruction Following Policy

You are given the following query: {super_goal}

**Previous observations and actions**

Listing 14. Demonstration Filter

Sequence of Tool outputs:

C. Converting LM Action space into API calls

Listing 15. LM to convert action strings into an API call

To choose actions, strictly use the format below:

You might also like

Previous observations and actions

Previous observations and actions

Previous observations and actions

Previous observations and actions