BAGEL: Bootstrapping Agents by Guiding Exploration With Language
BAGEL: Bootstrapping Agents by Guiding Exploration With Language
BAGEL: Bootstrapping Agents by Guiding Exploration With Language
Shikhar Murty†⋆ Christopher D. Manning† Peter Shaw‡ Mandar Joshi‡ Kenton Lee‡
1. Exploration Stage
Abstract Instruction Trajectory
Environment Go to a
Month in …
arXiv:2403.08140v1 [cs.CL] 12 Mar 2024
1
BAGEL: Bootstrapping Agents by Guiding Exploration with Language
2
BAGEL: Bootstrapping Agents by Guiding Exploration with Language
Click on datepicker Click on Next Click on Prev Click on Next Click on Prev Click on Prev Finish
Follow: τ 1 ∼ Pagent (· | g 1 )
Click on datepicker Click on Next Click on Prev Click on Prev Click on 7th Finish
Label: g 1 ∼ Plabel (· | τ 1 ) Change month to October 7th and submit Score: s(g 1 , τ 1 ) ✗
Follow: τ 2 ∼ Pagent (· | g 2 )
Click on datepicker Click on Prev Click on Prev Click on 7th Click Submit
Label: g 2 ∼ Plabel (· | τ 2 ) Change month to October 7th and submit Score: s(g 2 , τ 2 ) ✓
Figure 2. BAGEL generates synthetic demonstrations by exploring the environment. Shown here is an example from the MiniWob++
choose-date task. First, we generate an initial trajectory by sampling actions without conditioning on any natural language instruction.
Then, we alternate between generating an instruction given a trajectory, and generating a trajectory given an instruction. The process aims
to converge towards a trajectory that accurately satisfies a natural language instruction, and aims to recover from errors in labeling or
instruction following from earlier rounds (see example). Once an instruction and trajectory pair satisfies a filtering criteria, it is added to
the set of synthetic demonstrations. Alternatively, BAGEL can be initialized by first sampling an instruction, as described in §3.2.
until the episode completes or a “finish” action is generated. that elicits plausible instructions based on the initial obser-
We can increase the entropy of πexplore with a configurable vation from the environment, and the action space.
temperature parameter.
3.2. Generating Demonstrations
Trajectory Labeler. The trajectory labeler, plabel (g | τ ),
is prompted to generate an instruction, g, that corresponds Initial Exploration We consider and compare two differ-
to a given trajectory, τ . ent variations of BAGEL: trajectory-first and instruction-
first exploration. For trajectory-first exploration, we first
Instruction Following Policy. Unlike the exploration pol- sample a trajectory τ 0 ∼ pexplore (·) with the exploration
icy, the instruction following policy, πagent (at | τ<t , g), se- policy. For instruction-first exploration, we first sample an
lects actions conditioned on an instruction, g. We sample instruction g 0 ∼ pinstruct (·) with the instruction generator.
from the resulting distribution over trajectories, pagent (τ | g),
by choosing actions according to πagent until the episode Iterative Refinement Trajectories sampled from pexplore
completes or a “finish” action is generated. This component may not correspond to any reasonable instruction, and, sim-
is also implemented using a ReAct based prompt. ilarly, there may be no feasible trajectory that satisfies in-
structions sampled from pinstruct . Our iterative re-labeling
Demonstration Filter. Given a synthetic demonstration procedure aims to find an instruction and trajectory pair
(g, τ ), the demonstration filter makes a binary judgement where the trajectory satisfies the instruction, without sacri-
s(g, τ ) ∈ {0, 1}, based on how well τ corresponds to the ficing the diversity of the initial exploration. The process
instruction g. alternates between sampling instructions and trajectories:
g t ∼ plabel (· | τ t ). (1)
Instruction Generator Finally, as an alternative to the t+1 t
exploration policy (see §3.2) we can instead use an instruc- τ ∼ pagent (· | g ). (2)
tor generator to initialize exploration. This model defines a
distribution over instructions, pinstruct (g), based on a prompt We perform these iterative updates until we find a pair where
3
BAGEL: Bootstrapping Agents by Guiding Exploration with Language
pk (τ ) =
X
pk−1 (τ ′ ) · plabel (g ′ | τ ′ ) · pagent (τ | g ′ ) . (5) MiniWoB++ is a collection of tasks consisting of web inter-
′
τ ,g ′
| {z } faces with a shared action space of mouse and keyboard ac-
environment and LM constraints
tions. In our setup, actions are specified in natural language
(Type Bob in the name text box, Click on the datepicker,
Thus, we shape the distribution of trajectories from the pre- Clear text on Destination). The low-level controller that
vious marginal pk−1 based on the criteria that they can be maps action strings into a Selenium API call is implemented
assigned a concrete string g ′ , and are executable in the en- via a separate zero-shot prompted LM (see Appendix C for
vironment. These soft constraints work together to ensure details). Each task consists of a script to generate varia-
that (1) trajectories can be described in terms of some feasi- tions of the task with a templated instruction, where each
ble instruction in the environment, and (2) the trajectories variation is controlled via a random seed.
themselves correspond to valid environment dynamics.
Evaluation. We follow Shaw et al. (2023) for evaluating
Connection to Hindsight Experience Replay. Hindsight agents on MiniWoB++, by mapping the raw MiniWoB++
Experience Replay (HER, Andrychowicz et al., 2017) is a reward from [-1, 1] to [0, 1]. For each web interface, we
popular approach for training language conditioned policies. report the mean score over 50 random seeds. Starting with
Given some goal g, HER converts an unsuccessful trajectory the set of 55 MiniWoB++ tasks used in prior work on apply-
τ into positive examples by replacing g with some hindsight ing LM agents to this domain (Gur et al., 2023; Kim et al.,
goal g ′ . That is, HER uses a relabeling function to map τ to 2023; Sun et al., 2023), we evaluate on the hardest 10 tasks
a new goal g ′ , resulting in a positive demonstration (g ′ , τ ), where the zero-shot agent has an average reward of less than
that is used to update the policy. 0.95, to perform a more targeted evaluation of BAGEL to
domains that are hard for zero-shot agents.
Since the original implementation of HER considers set-
tings where the goal space is the raw environment observa-
5.2. ToolQA
tion space, applying HER to natural language instruction-
following requires access to a learnt relabeling function to ToolQA is a tool augmented question-answering environ-
map observations to language instructions. Such relabeling ment over 8 domains, where questions can be answered
functions typically map only the final observation oT to the by chaining calls to multiple tools including text retrievers,
instruction via pre-trained captioning models (Xiao et al., databases, SQL interpreter, calculator etc. Each tool can
2022; Cideron et al., 2020; Sumers et al., 2023) that operate be called according to a set of pre-defined methods (see
on trajectories from trained agents. In BAGEL, we use the Appendix B.2 for the full action space for the policy and
full trajectory for relabeling and use an iterative relabeling corresponding tool methods). The observation space is the
procedure to reduce noise from zero-shot components. string output from the most recent tool call (the first obser-
4
BAGEL: Bootstrapping Agents by Guiding Exploration with Language
Zero-Shot
100 MiniWob++ 100 ToolQA
+ BAGEL
80 80
60.5
60 46.8
60
40.9 43.3
40 40
20 20
0 bo ch so em cli cli so tic us se Av 0 Ag Ai Co DB Fl GS Sc Ye Av
ok o os cia ail ck ck cia -ta e-a arc er en rB ffe LP ig M ire lp er
-fl l - -ch -ta l-m c-t ut h- ag da n e ht 8K x ag
ig e -d -m in b o e e B s e
h a ed bo eck -2 ed o e co n g
t te ia x- bo - ha ia- m ine
all xe rd so p let
s-s m e
of e
t
Figure 3. Results across MiniWoB++ and ToolQA, broken down by domain. We compare using demonstrations obtained via BAGEL
(blue) with a zero-shot ReAct baseline (green) with no synthetic demonstrations. For MiniWob++, we use the Trajectory-First variant for
exploration, and for ToolQA, we use Instruction-First. Overall, using BAGEL demonstrations leads to improvements on both datasets.
5
BAGEL: Bootstrapping Agents by Guiding Exploration with Language
6
BAGEL: Bootstrapping Agents by Guiding Exploration with Language
40 40 40
30 30 30
20 20 20
10 10 10
0 0 0
Re
Re eet
Co
{G
{S }
{C
{S
{A }
{D a}
{P
{P
{P
{S
M
Re
M w
St
Fo
Tr rd
Li
Fo
Sh
Em
Bl d
el
ar
or
ut
ke
as
oc
pl
pl
rw
llo
ar
py
tw
ci
yt
yt
yt
ci
al
ra
ge
at
be
et
e
e
h
y
L,
re
re
k
ho
ho
ho aph
cu
ab
ph
a
nd
x
x,
la
n}
n,
n,
D
as
G
to
at
e}
Sc
ra
ab
r}
ph
r
as
ex
}
e}
}
Figure 4. Distribution of demonstrations over semantic categories for MiniWob++ environments, social-media and email-inbox-all,
and ToolQA. While BAGEL prefers certain modes, overall we find that these demonstrations cover a diverse range of actions.
We plot the number of demonstrations in each cluster in of only 5%. Here, we find that while BAGEL demonstra-
Figure 4. We note that while this distribution tends to be tions cover a variety of instructions like Search for cat and
skewed towards specific modes (e.g. {graph} for ToolQA, navigate to the third page of search results, Search for cars,
{Star} for email-inbox), there exists a long tail that covers a then visit the second search result, the model fails on test
broad range of possible use cases in the environment. Nev- instructions like Enter [term] then find and click the 9th
ertheless, improving diversity during exploration remains a search result that requires keeping track of the number of
failure mode for BAGEL which we expand on next. Finally, search results per page, and navigating to the correct page.
we provide some examples of BAGEL demonstrations in While our goal is to build fully unsupervised agents, meth-
Table 4, along with their corresponding semantic category. ods that use sparse information about test-time instructions
could help drive performance further.
8.5. Error Analysis
We conclude with a discussion of failure modes of our 9. Related Work
aproach using the domains book-flight, search-engine, and Instruction-Following Digital Agents. Building agents
SciRex as case studies. that navigate the digital world is a long standing goal of AI
and language understanding (Allen et al., 2007; Branavan
Handling Long-Horizon Planning. We note that book- et al., 2009). However, most prior work relies on expert
flight is the most complex environment in MiniWoB++, with demonstrations (Liu et al., 2018; Humphreys et al., 2022;
longer trajectories of lengths 8-20, and the zero-shot policy Furuta et al., 2023) with an appropriately shaped reward
performs poorly on this environment (average reward of (Branavan et al., 2009; Liu et al., 2018). Here, we assume no
5%). While using BAGEL demonstrations improves this access to demonstrations or a reward function, and use pre-
to 15%, we hypothesize that further improvements would trained components to bootstrap synthetic demonstrations.
require better handling of long range plans, such as with
hierarchical planning (Sodhi et al., 2023; Jiang et al., 2019). LMs for Decision Making. Pre-trained LMs are increas-
ingly being used for sequential decision making tasks such
Improving Diversity. We hypothesize that improving di- as robotic manipulation (Ahn et al., 2022; Liang et al., 2023),
versity among seed trajectories would lead to further im- instruction-following (Yao et al., 2022; Kim et al., 2023;
provements across the board. For instance, for book-flight, Sun et al., 2023; Lù et al., 2024), and tool-use (Parisi et al.,
all BAGEL demonstrations correspond to booking flights 2022). While some of these approaches finetune LMs based
in December, while the test distribution is more uniform. on human demonstrations (Nakano et al., 2021), others use
human demonstrations in their prompt for in-context learn-
Reducing Mismatch with Test Instructions. On SciRex, ing and adaptation (Yao et al., 2022; Kim et al., 2023; Sun
all models fail to produce even a single correct answer. Here, et al., 2023). We use no human supervision or reward and
we find that in the absence of any knowledge about user adapt LM agents purely using synthetic demonstrations.
instructions at test-time, BAGEL demonstrations tend to Another line of work uses LM priors in RL to improve ex-
create questions with more descriptive answers and trajec- ploration (Du et al., 2023), deal with large action spaces
tories with generic queries (See Table 4 for an example) (Yao et al., 2020), or as proxy reward functions (Kwon et al.,
while test instructions requires retrieving specific numbers 2023). In the same tradition, BAGEL bootstraps a learning
from scientific documents by querying for specific topics. signal in the form of synthetic demonstrations by combining
Similarly, on search-engine, we note a modest improvement several LM components without RL.
7
BAGEL: Bootstrapping Agents by Guiding Exploration with Language
Instruction Trajectory
MiniWoB++
Find the email by Trixi and reply to them with the text ”Maecenas Move Mouse to Trixi → Click on an email-thread → Click on the
eu massa” {Reply} reply button → Type ’Maecenas eu massa’ on the textarea with id
’reply-text’ → Click on the span with id ’send-reply’
Find the email by Darcy and forward it to Dionis {Forward} Click on Darcy, the sender of an email thread. → Click on ’for-
ward’ button → Type Dionis on the to field → Click on the ’send’
button
Retweet Gallegos’s post {Retweet} Move Mouse to Pretium,. Ullamcorper. → Click on retweet
element with id 101
Like tweet by @leonie and share tweet by @livia {Like, Share} Click on the like element with ID 41. → Click on share-113
ToolQA
What are David’s plans this weekend? {RetrieveAgenda} Retrieve passages related to David’s plans this weekend → Finish
with answer: On the evening of September 16th, 2022, David will
be attending a Blind Date Night event at The Press Lounge.
Who is affiliated with both nicolas christin and giulia fanti? Load DBLP → Check neighbours of node Giulia Fanti in graph
{Python, Graph} AuthorNet → Check neighbours of node Nicolas Christin in
graph AuthorNet → Evaluate python code: list1=[’Wanzheng
Zhu’, ’Rohan Bansal’, ’Zachary Weinberg’, ’Nicolas Christin’,
’Suma Bhat’, ’Hongyu Gong’]; list2=[’Wanzheng Zhu’, ’Rohan
Bansal’, ’Zachary Weinberg’, ’Suma Bhat’, ’Hongyu Gong’, ’Giu-
lia Fanti’]; ans=set(list1) & set(list2) → Finish with answer:
{’Hongyu Gong’, ’Rohan Bansal’, ’Wanzheng Zhu’, ’Zachary
Weinberg’, ’Suma Bhat’}
What are the top 5 airbnb options with price < 900, availability Load database airbnb → Filter database according to price <
> 260 and at least 40 reviews {Database, SQL} 900, availability 365 > 260, number of reviews > 40 → Interpret
SQLite query: SELECT FROM airbnb data ORDER BY num-
ber of reviews DESC LIMIT 5 → Finish with answer: [’High-end
doorman bldg in the LES’, ’THE BEST DEAL ON THE HUD-
SON RIVER!!’, ’Heart of Williamsburg, Brooklyn!’, ’Beautiful
& Tranquil Oasis in a Great Location’, ’Sunny/Cozy 1BD’]
What are the different approaches for computing graph similarity? Retrieve passages from ML papers related to graph similarity →
{RetrieveSciRex} Finish with answer: The different approaches to computing graph
similarity are graph kernels, graph features and graph convolu-
tional neural networks (CNNs).
Table 4. Example demonstrations obtained via BAGEL for MiniWoB++ (top) and ToolQA (bottom). We also provide the semantic
category for these demonstrations, and report the distribution of these categories in Figure 4.
Self-training for Language Models. A recent line of a learning signal with minimal human supervision. To
work uses LM-generated data for finetuning the same LM, this end, we introduce BAGEL, a method for constructing
in settings where external verifiers may be used to filter synthetic demonstrations for instruction following agents.
generated data (Singh et al., 2023; Gulcehre et al., 2023). These demonstrations are constructed by iteratively relabel-
While we also use data generated from an LM for adaptation, ing an initial seed set of trajectories or instructions, where
unlike these approaches, environment interactions form a both relabeling and exploration is driven by a language
critical part of the learning signal and we also do not use model. Experiments on two different domains show that us-
external verifiers for filtering data. ing BAGEL demonstrations as in-context exemplars leads
to considerable improvements ranging from 2-13%, as well
10. Conclusion as significant reductions in execution failures.
8
BAGEL: Bootstrapping Agents by Guiding Exploration with Language
terms of service). Further research related to secure model Series on Computational Intelligence (SSCI), pp. 225–
deployment should take into account problems such as spam 232. IEEE, 2020.
detection, privacy preservation etc.
Du, Y., Watkins, O., Wang, Z., Colas, C., Darrell, T., Abbeel,
P., Gupta, A., and Andreas, J. Guiding pretraining in
Acknowledgements reinforcement learning with large language models. arXiv
SM was partly funded by a gift from Apple Inc. CM is a fel- preprint arXiv:2302.06692, 2023.
low in the CIFAR Learning in Machines and Brains program.
Furuta, H., Nachum, O., Lee, K.-H., Matsuo, Y., Gu,
We thank David Gaddy, Anna Goldie, Luke Vilnis, Tianze
S. S., and Gur, I. Multimodal web navigation with
Shi, Jonathan Berant, Kristina Toutanova, Raphael Hoffman,
instruction-finetuned foundation models. arXiv preprint
and members of Google DeepMind and the Stanford NLP
arXiv:2305.11854, 2023.
Group for helpful discussions and comments.
Gulcehre, C., Paine, T. L., Srinivasan, S., Konyushkova,
References K., Weerts, L., Sharma, A., Siddhant, A., Ahern, A.,
Wang, M., Gu, C., et al. Reinforced self-training (rest)
Ahn, M., Brohan, A., Brown, N., Chebotar, Y., Cortes, O.,
for language modeling. arXiv preprint arXiv:2308.08998,
David, B., Finn, C., Fu, C., Gopalakrishnan, K., Hausman,
2023.
K., et al. Do as i can, not as i say: Grounding language
in robotic affordances. arXiv preprint arXiv:2204.01691, Gur, I., Nachum, O., Miao, Y., Safdari, M., Huang, A.,
2022. Chowdhery, A., Narang, S., Fiedel, N., and Faust,
A. Understanding HTML with large language mod-
Allen, J., Chambers, N., Ferguson, G., Galescu, L., Jung,
els. In Bouamor, H., Pino, J., and Bali, K. (eds.), Find-
H., Swift, M., and Taysom, W. Plow: a collaborative task
ings of the Association for Computational Linguistics:
learning agent. In Proceedings of the 22nd National Con-
EMNLP 2023, pp. 2803–2821, Singapore, December
ference on Artificial Intelligence - Volume 2, AAAI’07, pp.
2023. Association for Computational Linguistics. doi:
1514–1519. AAAI Press, 2007. ISBN 9781577353232.
10.18653/v1/2023.findings-emnlp.185. URL https://
Andrychowicz, M., Wolski, F., Ray, A., Schneider, J., Fong, aclanthology.org/2023.findings-emnlp.185.
R., Welinder, P., McGrew, B., Tobin, J., Pieter Abbeel, O.,
and Zaremba, W. Hindsight experience replay. Advances Huang, W., Abbeel, P., Pathak, D., and Mordatch, I. Lan-
in neural information processing systems, 30, 2017. guage models as zero-shot planners: Extracting ac-
tionable knowledge for embodied agents. In Interna-
Anil, R., Dai, A. M., Firat, O., Johnson, M., Lepikhin, tional Conference on Machine Learning, pp. 9118–9147.
D., Passos, A., Shakeri, S., Taropa, E., Bailey, P., Chen, PMLR, 2022.
Z., et al. Palm 2 technical report. arXiv preprint
arXiv:2305.10403, 2023. Humphreys, P. C., Raposo, D., Pohlen, T., Thornton, G.,
Chhaparia, R., Muldal, A., Abramson, J., Georgiev, P.,
Branavan, S., Chen, H., Zettlemoyer, L., and Barzilay, Santoro, A., and Lillicrap, T. A data-driven approach
R. Reinforcement learning for mapping instructions for learning to control computers. In International Con-
to actions. In Su, K.-Y., Su, J., Wiebe, J., and Li, ference on Machine Learning, pp. 9466–9482. PMLR,
H. (eds.), Proceedings of the Joint Conference of the 2022.
47th Annual Meeting of the ACL and the 4th Interna-
tional Joint Conference on Natural Language Processing Jiang, Y., Gu, S. S., Murphy, K. P., and Finn, C. Language as
of the AFNLP, pp. 82–90, Suntec, Singapore, August an abstraction for hierarchical deep reinforcement learn-
2009. Association for Computational Linguistics. URL ing. Advances in Neural Information Processing Systems,
https://aclanthology.org/P09-1010. 32, 2019.
Chaplot, D. S., Sathyendra, K. M., Pasumarthi, R. K., Ra- Joshi, M., Choi, E., Weld, D., and Zettlemoyer, L. TriviaQA:
jagopal, D., and Salakhutdinov, R. Gated-attention archi- A large scale distantly supervised challenge dataset for
tectures for task-oriented language grounding. In Pro- reading comprehension. In Barzilay, R. and Kan, M.-Y.
ceedings of the AAAI Conference on Artificial Intelli- (eds.), Proceedings of the 55th Annual Meeting of the
gence, volume 32, 2018. Association for Computational Linguistics (Volume 1:
Long Papers), pp. 1601–1611, Vancouver, Canada, July
Cideron, G., Seurin, M., Strub, F., and Pietquin, O. Higher: 2017. Association for Computational Linguistics. doi: 10.
Improving instruction following with hindsight genera- 18653/v1/P17-1147. URL https://aclanthology.
tion for experience replay. In 2020 IEEE Symposium org/P17-1147.
9
BAGEL: Bootstrapping Agents by Guiding Exploration with Language
Kim, G., Baldi, P., and McAleer, S. Language models can transformer. Journal of Machine Learning Research, 21:
solve computer tasks. arXiv preprint arXiv:2303.17491, 1–67, 2020.
2023.
Rajpurkar, P., Zhang, J., Lopyrev, K., and Liang, P. SQuAD:
Kwon, M., Xie, S. M., Bullard, K., and Sadigh, D. Re- 100,000+ questions for machine comprehension of text.
ward design with language models. In The Eleventh In Su, J., Duh, K., and Carreras, X. (eds.), Proceed-
International Conference on Learning Representations, ings of the 2016 Conference on Empirical Methods in
2023. URL https://openreview.net/forum?id= Natural Language Processing, pp. 2383–2392, Austin,
10uNUgI5Kl. Texas, November 2016. Association for Computational
Linguistics. doi: 10.18653/v1/D16-1264. URL https:
Lee, K., Chang, M.-W., and Toutanova, K. Latent retrieval //aclanthology.org/D16-1264.
for weakly supervised open domain question answering.
In Proceedings of the 57th Annual Meeting of the Asso- Rajpurkar, P., Jia, R., and Liang, P. Know what you
ciation for Computational Linguistics, pp. 6086–6096, don’t know: Unanswerable questions for SQuAD. In
2019. Gurevych, I. and Miyao, Y. (eds.), Proceedings of the
56th Annual Meeting of the Association for Computa-
Liang, J., Huang, W., Xia, F., Xu, P., Hausman, K., Ichter, B., tional Linguistics (Volume 2: Short Papers), pp. 784–789,
Florence, P., and Zeng, A. Code as policies: Language Melbourne, Australia, July 2018. Association for Compu-
model programs for embodied control. In 2023 IEEE tational Linguistics. doi: 10.18653/v1/P18-2124. URL
International Conference on Robotics and Automation https://aclanthology.org/P18-2124.
(ICRA), pp. 9493–9500. IEEE, 2023.
Shaw, P., Joshi, M., Cohan, J., Berant, J., Pasupat, P., Hu, H.,
Liu, E. Z., Guu, K., Pasupat, P., Shi, T., and Liang, P. Rein- Khandelwal, U., Lee, K., and Toutanova, K. From pixels
forcement learning on web interfaces using workflow- to ui actions: Learning to follow instructions via graphical
guided exploration. In International Conference on user interfaces. arXiv preprint arXiv:2306.00245, 2023.
Learning Representations, 2018.
Shi, T., Karpathy, A., Fan, L., Hernandez, J., and Liang,
Logeswaran, L., Fu, Y., Lee, M., and Lee, H. Few-shot P. World of bits: An open-domain platform for web-
subgoal planning with language models. In Proceedings based agents. In International Conference on Machine
of the 2022 Conference of the North American Chapter of Learning, pp. 3135–3144. PMLR, 2017.
the Association for Computational Linguistics: Human
Language Technologies, pp. 5493–5506, 2022. Shinn, N., Labash, B., and Gopinath, A. Reflexion: an au-
tonomous agent with dynamic memory and self-reflection.
Lù, X. H., Kasner, Z., and Reddy, S. Weblinx: Real- arXiv preprint arXiv:2303.11366, 2023.
world website navigation with multi-turn dialogue. arXiv
preprint arXiv:2402.05930, 2024. Singh, A., Co-Reyes, J. D., Agarwal, R., Anand, A., Patil,
P., Liu, P. J., Harrison, J., Lee, J., Xu, K., Parisi, A.,
Min, S., Lyu, X., Holtzman, A., Artetxe, M., Lewis, M., et al. Beyond human data: Scaling self-training for
Hajishirzi, H., and Zettlemoyer, L. Rethinking the role of problem-solving with language models. arXiv preprint
demonstrations: What makes in-context learning work? arXiv:2312.06585, 2023.
arXiv preprint arXiv:2202.12837, 2022.
Sodhi, P., Branavan, S., and McDonald, R. Heap: Hierar-
Misra, D., Langford, J., and Artzi, Y. Mapping instructions chical policies for web actions using llms. arXiv preprint
and visual observations to actions with reinforcement arXiv:2310.03720, 2023.
learning. arXiv preprint arXiv:1704.08795, 2017.
Sumers, T., Marino, K., Ahuja, A., Fergus, R., and Dasgupta,
Nakano, R., Hilton, J., Balaji, S., Wu, J., Ouyang, L., Kim, I. Distilling internet-scale vision-language models into
C., Hesse, C., Jain, S., Kosaraju, V., Saunders, W., et al. embodied agents. 2023.
Webgpt: Browser-assisted question-answering with hu-
man feedback. arXiv preprint arXiv:2112.09332, 2021. Sun, H., Zhuang, Y., Kong, L., Dai, B., and Zhang, C. Ada-
planner: Adaptive planning from feedback with language
Parisi, A., Zhao, Y., and Fiedel, N. Talm: Tool augmented models. arXiv preprint arXiv:2305.16653, 2023.
language models. arXiv preprint arXiv:2205.12255,
2022. Xiao, T., Chan, H., Sermanet, P., Wahid, A., Brohan, A.,
Hausman, K., Levine, S., and Tompson, J. Robotic
Raffel, C., Shazeer, N., Roberts, A., Lee, K., Narang, S., skill acquisition via instruction augmentation with vision-
Matena, M., Zhou, Y., Li, W., and Liu, P. J. Exploring language models. arXiv preprint arXiv:2211.11736,
the limits of transfer learning with a unified text-to-text 2022.
10
BAGEL: Bootstrapping Agents by Guiding Exploration with Language
Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan,
K. R., and Cao, Y. React: Synergizing reasoning and
acting in language models. In The Eleventh International
Conference on Learning Representations, 2022.
Zhuang, Y., Yu, Y., Wang, K., Sun, H., and Zhang, C.
Toolqa: A dataset for llm question answering with exter-
nal tools. arXiv preprint arXiv:2306.13304, 2023.
11
BAGEL: Bootstrapping Agents by Guiding Exploration with Language
Thought: [[thought_pred]]
Now, output a different action based on your thought. End your output with a newline.
Action: [[action]]
B. Prompts
B.1. MiniWoB++
We start by presenting all prompts for MiniWoB++. The action space for MiniWob++ is:
Your objective is to discover diverse and interesting tasks (that a human might give to an agent) by interacting
with the webpage through these actions. You’ve executed the following actions, and observed the following webpage
states (described briefly in language).
After taking these actions, you observe the current web-page HTML:
{webpage_html}
Now, act by taking an action based in the inventory (or output Finish if you are done).
Action: [[pred]]
12
BAGEL: Bootstrapping Agents by Guiding Exploration with Language
To accomplish tasks, you can break it down into a sequence of sub-tasks from a task inventory:
{inventory_str}
Propose a new task that can be performed on this website. Ensure that your tasks are concrete and use features /
contents of the given website.
Initial Webpage:
{init_webpage}
Final Webpage:
{final_webpage}
Your objective to guess the instruction that was given to the agent. Ensure that your instructions are concrete
and such that every sub-task meaningfully contributes to fulfiling the instruction. Start by providing your
reasoning. Use the following format for your answer:
Reasoning: your reasoning
Answer: your answer
**Output**
Reasoning: [[pred]]
Answer: [[pred]]
You are also given some examples of how to perform instructions on the website by converting them into sub-tasks
(along with the change each sub-task caused on the website).
{exemplars}
After taking these actions, you observe the current web-page HTML:
{webpage_html}
First, think about which inventory item you should pick as your next action.
Thought: [[pred]]
Now, output next action (output *finished* if the instruction has been accomplished) by choosing an item from
your inventory
Action: [[pred]]
13
BAGEL: Bootstrapping Agents by Guiding Exploration with Language
Only give a score of 5 if the task is perfectly accomplished and the final webpage has no errors.
Task:
{goal_str}
Initial Webpage:
{init_webpage}
Final Webpage:
{final_webpage}
Start by thinking about what the web-agent was trying to accomplish, and describe how well it was done.
Thought: [[pred]]
Answer: [[pred]]
B.2. ToolQA
Next, we present all prompts for ToolQA below. The list of methods for various tools in ToolQA is:
This is then directly used for various prompts as {inventory str}. Note that the action strings (from this inventory) are
14
BAGEL: Bootstrapping Agents by Guiding Exploration with Language
Your objective is to discover diverse and interesting questions (that a human might give to an agent with these
tools) by chaining together calls to different tools. You’ve executed the following tool calls, and observed the
following outputs from these tools (described briefly in language).
**Current Observation**
{curr_observation}
Now, act by taking an action based in the inventory (or output Finish if you are done).
Action: [[pred]]
To respond to queries, you need to call tools in a specific sequence to obtain the answer.
Your objective is to propose a query that can be performed by chaining together these tools. Ensure that your
queries are concrete.
You are given the entire sequence of tool outputs, where the final tool output is the answer that the agent gives
. You are also given the sequence of sub-tasks attempted by the agent.
Your objective to guess the query that was given to the agent. Ensure that your answer is concrete and such that
every sub-task meaningfully contributes to answering the query. Start by providing your reasoning. Use the
following format for your answer:
Reasoning: your reasoning
Answer: your answer
**Output**
Reasoning: [[pred]]
Answer: [[pred]]
To respond to queries, you need to call tools in a specific sequence to obtain the answer. Here are some
demonstrations of how to respond to queries by invoking tools:
{exemplars}
To perform this instruction, you’ve executed the following actions, and observed the following outputs from your
15
BAGEL: Bootstrapping Agents by Guiding Exploration with Language
tools:
**Current Observation**
{curr_observation}
First, think about which tool you should pick as your next action
Thought: [[pred]]
Now, output next action (output *finished* if the instruction has been accomplished) by calling the chosen tool
with appropriate arguments. End your output with a newline
Action: [[pred]]
Your objective is to judge how well the AI agent carried out this task by giving it a score from 1 to 5.
Only give a score of 5 if the task is perfectly accomplished and the final answer has no errors.
User question:
{goal_str}
Start by thinking about what the AI agent was trying to accomplish, and describe how well it was done.
Thought: [[pred]]
Answer: [[pred]]
You can take 4 kinds of actions on a chosen element specified via its ref id.
Action: type(text) types ’text’ into chosen ref, useful for typing into various textboxes.
Action: click() clicks on chosen element, useful when clicking buttons,checkboxes or textboxes. Sections can be
clicked for expansion.
Action: move-mouse() moves mouse to a chosen element, useful when the element text has ’>’ symbol for expansion.
Action: clear() clears all text on chosen ref-id, useful when you want to delete text on textboxes.
Task: {action_string}
Chosen action: [[pred]]
Chosen element: [[pred]]
Chosen text: [[pred]]
The LM predictions are combined into an API call e.g. ref[[element]].type([[text]]]). We use a simple
python function to convert the API call into a Selenium web-driver method (type text, clear and move mouse are Selenium
web-driver methods):
16