Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

gpt2finetunning

Download as pdf or txt
Download as pdf or txt
You are on page 1of 15

TYPE Original Research

PUBLISHED 15 August 2023


DOI 10.3389/frobt.2023.1221739

Learning to reason over scene


OPEN ACCESS graphs: a case study of finetuning
EDITED BY
Dimitrios Kanoulas,
University College London,
GPT-2 into a robot language
United Kingdom

REVIEWED BY
model for grounded task
Maria Koskinopoulou,
Heriot-Watt University, United Kingdom
Dario Albani,
planning
Technology Innovation Institute (TII),
United Arab Emirates
Georgia Chalvatzaki 1,2,3*, Ali Younes 1, Daljeet Nandha 1,
*CORRESPONDENCE
Georgia Chalvatzaki, An Thai Le 1, Leonardo F. R. Ribeiro 4† and Iryna Gurevych 1,2
georgia.chalvatzaki@tu-darmstadt.de 1
Computer Science Department, Technische Universität Darmstadt, Darmstadt, Germany, 2 Hessian.AI,

PRESENT ADDRESS Darmstadt, Germany, 3 Center for Mind, Brain and Behavior, University Marburg and JLU Giessen,
Leonardo F. R. Ribeiro, Marburg, Germany, 4 Amazon Alexa, Seattle, WA, United States
TU Darmstadt, Darmstadt, Germany

RECEIVED 12 May 2023


ACCEPTED 03 July 2023
PUBLISHED 15 August 2023 Long-horizon task planning is essential for the development of intelligent
CITATION
assistive and service robots. In this work, we investigate the applicability of a
Chalvatzaki G, Younes A, Nandha D, smaller class of large language models (LLMs), specifically GPT-2, in robotic
Le AT, Ribeiro LFR and Gurevych I (2023), task planning by learning to decompose tasks into subgoal specifications for
Learning to reason over scene graphs: a
case study of finetuning GPT-2 into a a planner to execute sequentially. Our method grounds the input of the LLM
robot language model for grounded task on the domain that is represented as a scene graph, enabling it to translate
planning. human requests into executable robot plans, thereby learning to reason over
Front. Robot. AI 10:1221739.
doi: 10.3389/frobt.2023.1221739 long-horizon tasks, as encountered in the ALFRED benchmark. We compare
our approach with classical planning and baseline methods to examine the
COPYRIGHT
© 2023 Chalvatzaki, Younes, Nandha, Le, applicability and generalizability of LLM-based planners. Our findings suggest
Ribeiro and Gurevych. This is an that the knowledge stored in an LLM can be effectively grounded to perform
open-access article distributed under
long-horizon task planning, demonstrating the promising potential for the future
the terms of the Creative Commons
Attribution License (CC BY). The use, application of neuro-symbolic planning methods in robotics.
distribution or reproduction in other
forums is permitted, provided the KEYWORDS
original author(s) and the copyright
owner(s) are credited and that the robot learning, task planning, grounding, language models (LMs), pretrained models,
original publication in this journal is scene graphs
cited, in accordance with accepted
academic practice. No use, distribution
or reproduction is permitted which does
not comply with these terms.
1 Introduction
The autonomous execution of long-horizon tasks is of utmost importance for future
assistive and service robots. An intelligent robot should reason about its surroundings, e.g.,
regarding the included objects and their spatial-semantic relations, and abstract an action
plan for achieving a goal that will purposefully alter the perceived environment. Such an
elaborate course of robot actions requires scene understanding, semantic reasoning, and
planning over symbols and geometries. The advent of Deep Learning led many researchers
to faithfully follow end-to-end approaches due to the representation power of differentiable
deep neural networks (LeCun et al., 2015).
The problem of sequential decision-making has been addressed both with search-
based and optimization approaches (Kaelbling and Lozano-Pérez, 2011; Toussaint, 2015;
Driess and Toussaint, 2019; Garrett et al., 2021; 2020), as well as learning-based (Nair and
Finn, 2019; Funk et al., 2021; Hoang et al., 2021) and hybrid methods (Kim et al., 2019;
Driess et al., 2020; Ren et al., 2021; Funk et al., 2022). While the first ones enjoy probabilistic

Frontiers in Robotics and AI 01 frontiersin.org


Chalvatzaki et al. 10.3389/frobt.2023.1221739

completeness, they require full domain specification and have high Model (PLM)-based planners compared to classical task planners
computational demands.The learning-based methods require broad operating on a limited computational budget for a fair comparison.
exploration to learn from experience, but they have shown better We conclude that the knowledge stored in a PLM can be grounded
generalization capabilities in similar domains to those experienced on different domains to perform long-horizon task planning,
during training. showing encouraging results for the future application of neuro-
Large Language Models (LLMs) have exhibited an symbolic planning methods in robotics.
unprecedented generative ability (Bommasani et al., 2021), thanks
to the transformer architecture (Vaswani et al., 2017) combined with
massive datasets distilled from the internet. Naturally, in the quest 2 State of the art
for general artificial intelligence, researchers try to benchmark such
models in reasoning tasks, among others (Wang et al., 2018; 2019). 2.1 Reasoning with large language models
Robotic embodied intelligence requires both logical and geometric
reasoning; hence, it is a holy grail of AI. Several researchers saw a LLMs have attracted much attention for understanding
benefit in LLMs, and it was not long before several works explored the commonsense and reasoning patterns in their latent space
their application to robotics for endowing robots with reasoning (Zhou et al., 2020; Li et al., 2021; Bian et al., 2023). It has been shown
abilities in the scope of autonomous task planning and interaction that some abilities in logical and mathematical reasoning seem to
(Brohan et al., 2022; Wei et al., 2022b). However, most works have emerge when LLMs are prompted appropriately (Wei et al., 2022b;
focused on the prompting (Brown et al., 2020) and the subsequent a). However, the engineering effort, as well as the lack of robustness,
prompt engineering (White et al., 2023), in which engineers provide is a key issue in prompting massive models (Ruis et al., 2022;
appropriate inputs to LLMs for extracting outputs that can be Valmeekam et al., 2022). While great effort seems to be consumed
realizable by a robotic agent, either for human-instruction following on few-shot prompting of huge parametric models, it has also been
(Ouyang et al., 2022) or for planning (Singh et al., 2022; Zeng et al., shown by other lines of work show that efficient finetuning of much
2022). smaller models (Tay et al., 2022), or the use of small adaptation
In this work, we study a finetuning process for grounding a modules (Adapters) (Houlsby et al., 2019; Pfeiffer et al., 2021)
small LLM for robotics, i.e., GPT-2 (Radford et al., 2021), to be used can lead to methods that perform more robustly than large-scale
as a high-level abstraction in a task planning pipeline. Particularly, generalist few-shot prompters. In the same direction, the chatbot
we propose a method that decomposes a long-horizon task into versions of those huge models raised several points of criticism
subgoals in the form of goal specifications for a robotic task planner recently, showing that much more is needed than just prompting a
to execute, and we investigate whether such a method can reach the blind human-preference alignment1 .
performance levels of an oracle task planning baseline.
Our contribution is twofold: (i) we propose a novel method for
linearizing the relations in a scene-graph structure representing the 2.2 Robot behavior planning
domain (world) to provide it as grounding context when finetuning
a pretrained language model (e.g., GPT-2) for learning to draw Long-horizon robot behavior planning is an NP-hard problem
associations between possible actions (goto, pick, etc.) and objects (Wells et al., 2019). Current advances in ML and perception led
in the scene (e.g., kitchen, apple, etc.). Importantly, in our context, researchers to revisit this fundamental problem, i.e., the execution of
we encode the relative position of objects (far, close, right, left, etc.), multi-stage tasks, whose completion requires many sequential goals
allowing our model to account for the scene’s geometrical structure to be achieved, considering learning-based heuristics (Driess et al.,
when learning to plan. The proper structure of the input context is 2020). Researchers consider such problems as Task And Motion
necessary for enabling the model to reason about the combinatorics Planning (TAMP) problems (Garrett et al., 2021; Ren et al., 2021;
of actions with affordable objects and their logical sequence (e.g., to Xu et al., 2022), with a symbolic plan over entities and predicate
cook something, one must first go to the kitchen). (ii) We showed with respective action operators with preconditions and effects
that larger pretrained models do not necessarily possess grounded in the environment. In contrast, a motion plan tries to find a
reasoning abilities, while it is possible to finetune smaller models feasible path to the goal. Nevertheless, most TAMP methods rely
on various tasks to use them as parts of a broader neuro-symbolic on manually specified rules; they do not integrate perception, and
planning architecture. Contrarily to works that directly apply actions the combinatorial explosion when searching over symbolic and
suggested by the GPTs to robots, we use language models at a continuous parameters prohibits scaling the methods to challenging,
higher level of abstraction, effectively suggesting sub-goals as PDDL realistic problems (Kim et al., 2019; Garrett et al., 2020).
problems to be solved by a Fast Downward task planner Helmert Transformer models (Vaswani et al., 2017) that revolutionized
(2006), effectively decomposing the whole problem into smaller ones the field of Natural Language Processing (NLP) opened the way
of lower complexity. for multiple new applications, in particular for robotics, e.g., visual-
Our thorough experimental evaluation shows that finetuning language instruction following (Pashevich et al., 2021), 3D scene
GPT-2 by additionally grounding its input on the domain can understanding and grounding (Chen W. et al., 2022; Mees et al.,
help translate human requests (tasks) to executable robot plans, to 2022), language-based navigation (Huang C. et al., 2022; Shah et al.,
learn to reason over long-horizon tasks, as those encountered in 2023). Due to their training on extensive databases, several
the ALFRED benchmark (Shridhar et al., 2020). We compare our
proposed approach with classical planning methods to investigate
the applicability and generalizability of the Pre-trained Language 1 7 problems facing Bing, Bard, and the future of AI search, from The Verge.

Frontiers in Robotics and AI 02 frontiersin.org


Chalvatzaki et al. 10.3389/frobt.2023.1221739

works explored the use of LLMs for task planning and long- a sample scene loaded into the AI2-THOR simulator. Each sample in
horizon manipulation (Huang et al., 2022b), mainly employing the dataset consists of a high-level plan in PDDL and the trajectory
clever prompting (Raman et al., 2022; Singh et al., 2022), using of the agent’s actions which lead to successful task completion,
multimodal information (Jiang et al., 2022; Zeng et al., 2022), together with a description of the task goal and each plan step in
grounding with value-functions (Chen B. et al., 2022; Brohan et al., Natural Language (NL).
2022; Huang et al., 2022c), and deploying advances in code
generation to extract executable robot plans (Liang et al., 2022).
(Li et al., 2022) propose to use a PLM as a scaffold for decision- 3.3 State and action space
making policies in interactive environments, demonstrating benefits
in the generalization abilities for policy learning even when language The state space defines the feedback provided by the
is not provided as input or output. Recently, PALM-e (Driess et al., environment, while the action space defines the available actions
2020) has integrated a vision transformer with the PALM language to interact with the environment.
model and has encoded some robotic state data to propose a State space. The environments we consider in this work
multimodal embodied model, which showed the potential of are fully observable, having access to the complete simulator
integrating geometric information of the robot state but achieved state representing the domain, as commonly considered for task
limited performance in robotic tasks. planning tasks. AI2-THOR, the underlying simulator, eliminates
most physics-related aspects (e.g., objects are automatically picked
up and placed by a single action), which makes the highly
3 Language models for grounded dynamic and stochastic household environment almost static
robot task planning and deterministic—almost because some physics still exists. This
simplifies the core TAMP problem along with the discrete agent
3.1 Problem statement actions defined in the ALFRED dataset. Therefore, the ALFRED
benchmark represents an appropriate choice for studying the
Let us assume an agent that is able to move and manipulate problem of learning for robotic task planning, where motion failures
objects in an environment, e.g., a mobile manipulator robot in a are minimized by the underlying AI2-THOR controllers. Hence, we
household environment. Let the environment be composed of a can focus on the reasoning aspects of the problem, which is the focus
combination of rooms, such as ‘bathroom,’ ‘living room,’ or ‘kitchen.’ of this study. In the following, the state of the domain is transformed
Each room contains objects and receptacles, i.e., objects that are able to NL context (§3.6) and not used directly as model input.
to receive other objects, such as ‘table,’ ‘drawer,’ or ‘sink.’ Each object Action space. ALFRED has an action space of eight
has (household-specific) properties (affordances) associated with it discrete high-level actions: GotoLocation, PickupObject, PutObject,
that define whether it can be picked up, cleaned, heated, cooled, cut, CoolObject, HeatObject, CleanObject, SliceObject and ToggleObject.
etc. These properties can change, meaning that the objects have a The underlying AI2-THOR navigation controller also has a discrete
state. The agent can pick up only one object at a time, meaning the action space; the agent can move forward, backward, left or right
agent also has a state, e.g., ‘object in hand.’ Given the fact that the state and rotate clockwise or counter-clockwise in fixed steps.
is preserved over time and future actions depend on past actions,
the environment can be characterized as sequential. Therefore, a
series of actions has to be reasoned upon for an agent to be able to 3.4 Task categories
execute a series of actions for solving a long-horizon task, i.e., a task
that requires the completion of several subtasks and potentially the The ALFRED dataset encompasses seven categories of
manipulation of various objects to achieve the end-goal. household tasks: “Look at object,” “Pick and place,” “Pick two
and place,” “Pick and place with movable receptacle,” “Pick, clean
then place,” “Pick, cool then place,” “Pick, heat then place.” Because
3.2 The ALFRED benchmark objects can be placed in different corners of a room, each of these
tasks includes the sub-problem of navigation. For the ‘pick’ or ‘place’
The ALFRED benchmark (Shridhar et al., 2020) contains subtasks executing the respective PickupObject or PutObject action
human-annotated training samples and image-based recordings of is sufficient. But, the subtasks “clean”, “cool” and “heat” must be
everyday household tasks; this is “25,743 English language directives seen as planning problems on their own, because the corresponding
describing 8,055 expert demonstrations averaging 50 steps each, actions are a composition of high-level state-dependent actions.
resulting in 428,322 image-action pair”. In addition to that, the Regarding the household environment, the subtask “cool” requires
dataset provides a PDDL domain of the overall task and a PDDL a fridge, “heat” requires a microwave (or oven), and “clean” requires
problem for each sample (Aeronautiques et al., 1998). ALFRED a sink as an receptacle. The ALFRED simulator tracks the state of
heavily depends on AI2-THOR (Kolve et al., 2017), which acts as each object, and the subtask is only considered successful when the
the underlying controller and simulation environment (based on final object state is correct. For example, if the task category is ‘Pick,
the Unity game engine): trajectories for each sample of the ALFRED clean then place’, the task goal is only completed when the placed
dataset were generated with AI2-THOR, and the validation of user- object is marked as ‘clean.’ The implementation aspects of these task
generated actions requires the AI2-THOR controller. Figure 1 shows categories are discussed in Section 4.

Frontiers in Robotics and AI 03 frontiersin.org


Chalvatzaki et al. 10.3389/frobt.2023.1221739

FIGURE 1
AI2-THOR simulator rendering a sample rollout from the ALFRED (Shridhar et al., 2020). The scenes show a room with household objects and the
robot executing a task. Note that the robot does not have an arm, and the object automatically floats in front of the camera; it interacts with the
environment through discrete actions. The discrete actions are shown underneath each frame in the form of PDDL commands.

3.5 RobLM: robot language model for task distribution p(x) (see Eq. 1) can be formulated as a maximum-
plan generation likelihood objective, where the objective is to maximize the
likelihood of the next token occurrence for the given data.
Just like images can be represented by discretizing color space, Our goal is to finetune a LM to get a Robot Language Model
NL can be expressed as a sequence of tokens x = [x1 , x2 , … , xn ], (RobLM) that can generate a complete high-level task plan in one
where each token is mapped to an embedding (lookup table). The shot, given the domain information and a task goal. Because LMs
Language Model (LM) places a probability distribution p(x) over the are unsupervised learners, a single training sample contains both
output token sequence. p(x) can be decomposed into a conditional given and desired information as NL text. A restriction to the text
probability distribution p (xi+1 |xi ), where the probability of each format (a string of characters) comes with challenges: structural
token depends on all previous tokens. This results in the following information needs to be condensed into a single linear dimension,
joint distribution p(x) for a sequence of tokens x: and conceptually different aspects of the input need to be annotated
in the text. This text format, including the syntax, has to be designed
p (x) = ∏ p (xi |x0 , x1 , …, xi−1 ) (1) in such a way that information can be fed to and extracted from the
i
LM reliably.
In regards to Neural Network (NN), p(x) is commonly estimated In RobLM, the format definition for a NL task description
with the Softmax function (Bengio et al., 2000)2 must comply with the following syntactic rule (spaces added for
exp (WhT + b) readability):
p (x) = So ftmax (WhT + b) = , (2)
∑ exp (WhT + b) Goal [<SEP> Context] <BOS> Plan <EOS>
[...] := optional
where W is the learned weight matrix, b the bias and hT the
<SEP> := separator token
output vector of the NN. For text generation, the joint probability
<BOS> := begin-of-sequence token
<EOS> := end-of-sequence token
2 The Softmax function applies the exponential function to each element of
the input vector and normalizes the values, by dividing by the sum of the
Goal is the task goal in NL. Context is any additional, yet
exponentials. optional information provided to the LM. The task might have

Frontiers in Robotics and AI 04 frontiersin.org


Chalvatzaki et al. 10.3389/frobt.2023.1221739

ambiguous solutions, and the inherent assumption is that the LM TABLE 1 Graph2NL mapping table. Distances are mapped to NL vocabulary
(or a symbol) in a one-to-one relation. Yaw describes the orientation along
will better “understand” the task if given a context. Examples of a
the surface normal when viewed from a top-down perspective, and Pitch
context are the name of the room, the name of the target object, or a describes the z-planar offset (altitude) in relation to the origin.
NL description of the environment (see 3.6).
Distance [m]
Plan is the sequence of high-level task actions and their
respective arguments, such as objective name or location. Because Value NL Symbol
PLM have been trained on a diverse corpus of NL, including >5 distant a
program code, the format for plans follows syntactical rules similar
>4 far b
to that of a generic programming language:
>3 reachable c
Action0(arg0[,arg1]); Action1(arg0[,arg1]);
>2 near d
...
>1 close e
The sequence between the special tokens <BOS > and <EOS > >0.5 closer f
can be extracted to retrieve the plan from the LM-generated output.
>0.1 next g

3.5.1 Data augmentation <0.1 in h

Each sample in the ALFRED dataset can be replayed in the AI2- Yaw [°]
THOR simulator to collect additional information not contained
Value NL Symbol
in the original dataset. ALFRED provides a script that has been
modified for that purpose. Data augmentation is necessary for 45 to 135 right i

Graph2NL (c.f. §3.6) to generate a graph representation from the 135 to 225 back j
environment state. For each replayed sample, the complete list of
225 to 315 left k
objects in the scene, with their respective name, position, and
rotation, and the agent position is saved to a separate file next to the 315 to 45 front l

trajectory data. This file is later loaded and turned into a processable Pitch [°]
graph.
Value NL Symbol
≥0 above m

3.6 Mapping scene graphs to natural <0 below n

language: Graph2NL
4) Given a task and the identified target object, find all paths in the
PLMs are trained on NL. Because of this, NL is a natural
graphs leading from the agent (node) to the target object (node).
modality for finetuning a PLM. When a context is provided to
5) Use edge attributes in the found paths to describe the task-
the LM, this context must be presented in NL just like the input
centric environment state, by mapping geometric relations to NL
sequence. If the context should encapsulate the environment state,
tokens.
this means that the state has to be transformed into NL before being
supplied to the PLM.
3.6.1 NL mapping
Graph2NL is a novel method that “translates” the object-
To translate geometric relations attributed by the graph edges
centric scene graph representation of the environment state to
into a NL description, a mapping function is designed. In human
NL. Optionally, domain knowledge about the environment3 can
speech, distances are expressed by a vocabulary of words such as
be infused into this graph. The following steps describe the core
“close” or “far” and orientations are expressed by words such as “in
Graph2NL process.
front” or “behind”. Graph2NL adapts this vocabulary to describe
1) Generate an object-scene graph G with a node for the agent the (numeric) distance and orientation from one node relative to
and nodes corresponding to objects, node attributes being the another in NL.
position and rotation of the object in Euclidean space and their Table 1 summarizes the mapping used in Graph2NL. The
respective distance and orientation vectors as edge attributes. distance between nodes is expressed in Cartesian space and
2) (Optional) Infuse domain knowledge about the environment by orientation in polar coordinates, where Yaw is the azimuth angle
connecting all dependent nodes and all nodes reachable by the (rotation along the surface normal) and Pitch is the zenith angle
agent. (altitude). With this mapping, the geometric relation between
3) Connect the agent (node) to all reachable nodes, if given domain two nodes can be explained by three words (one for each:
knowledge, or to all nodes, if not given domain knowledge. distance, pitch, and yaw). The vocabulary contains 8 words
to express the distance, 4 words to express the vertical, and
2 words to express the horizontal orientation. Combinatorially,
this gives 64 possible geometric configurations. The geometric
3 Domain knowledge entails every possible room, object, and receptacle name
and their allowed relations, as described in the respective documentation: relationship is expressed in a condensed form by treating each
https://ai2thor.allenai.org/ithor/documentation/objects/object-types. of these configurations as a relation and assigning a special

Frontiers in Robotics and AI 05 frontiersin.org


Chalvatzaki et al. 10.3389/frobt.2023.1221739

FIGURE 2
Graph2NL example graph. After locating the root (“agent”) and target node (“soapbar”), the shortest paths connecting those nodes are found and
summarized in NL by mapping all edge attributes along the path.

symbol (token) for each relation. A simple approach, referring [Bathroom=


to the Symbol column in Table 1, is by assigning a symbol to - closer below left sink near below back
each word. Combining the symbols for distance, pitch, and yaw soapbar
creates the condensed (three-letter) representation of the geometric - closer below left cabinet near above
relationship. These symbolic representations can optionally be back soapbar
added to the LM tokenizer as special tokens. Shorter token - closer above left countertop next above
sequences generally decrease both the training and inference back soapbar
time. - close below back toilet closer below
Example Let the task be: “Put the soap into the back soapbar
drawer”. The input query to Graph2NL consists of the - closer below back garbagecan close below
target object “soap”. Figure 2 shows the graph constructed by back soapbar]
Graph2NL from augmented data (§3.5.1), including domain-
specific knowledge.After finding the shortest paths between The NL context by Graph2NL starts with the name of the
the root (‘agent’) and target node (‘soapbar’), Graph2NL room extracted from the scene graph, followed by the geometric
produces an output in the following form (cut-off at search description of each node connected to the target node on the path
depth 2): from the agent. “-” indicates the root note, i.e., the agent. For the

Frontiers in Robotics and AI 06 frontiersin.org


Chalvatzaki et al. 10.3389/frobt.2023.1221739

previous example, Graph2NL produces the following condensed for each sample. The ADAM (Kingma and Ba, 2014) optimizer
form: is used with a learning rate of 5e−5 , and the LM is trained for
two epochs. Finetuning a GPT-2 LM to the ALFRED training data
[Bathroom= with a single GPU-accelerated computer takes around 30 min (27
- fnk sink dnj soapbar iterations/s - measurement not representative due to hardware
- fnj cabinet dmj soapbar dependence).
- fmk countertop gmj soapbar
- enj toilet fnj soapbar
- fnj garbagecan enj soapbar] 3.8 Generation pipeline
This form of state representation is unique for each problem For inference, RobLM takes only the NL task goal together with
configuration and forms the context that grounds RobLM. an optional context and outputs the complete step-by-step plan for
completing the goal. This plan is composed of high-level instructions
rather than low-level controller commands.
3.7 Training Example. Given the task “Put the soap into the drawer:”, RobLM
(no context) generates the plan:
RobLM generates a plan as text given the goal and the context,
which involves causal language modeling for text generation. Put the soap into the drawer:
Decoder-only autoregressive language models (Figure 3) are 0.GotoLocation(countertop)
frequently used for the problem of text generation; we chose 1.PickupObject(soap)
GPT-2 as the base model for RobLM. RobLM uses the base 2.GotoLocation(drawer)
version of the GPT-2 PLM (‘gpt-2’) (Radford et al., 2021), loaded 3.PutObject(soap,drawer)
and initialized with pre-trained weights from the Huggingface
(Wolf et al., 2019) Transformer library. Finetuning GPT-2 for causal The plan is generated by consecutive forward passes through
language generation has a self-supervised setup, where the labels the Transformer model. For a vocabulary size of k and a token
are the inputs shifted to the right, which entitles learning to predict sequence of length l (with l ≤ 1, 024 for GPT-2), the forward pass
the next token in a sequence.We finetune the GPT-2 model using of the Transformer yields an output vector of size k × l with values
the pre-processed training data of the ALFRED dataset, which in the interval [0, 1]. The Transformer outputs scores, i.e., logits, for
has around 20.000 samples, with three sets of NL descriptions each token in the input sequence. These scores are converted to

FIGURE 3
Decoder-only Transformer architecture. The input to the decoder is tokenized text, and the output is probabilities over the tokens in the tokenizer
vocabulary. The positional encoding is added to the embedded input to account for the order. Transformer’s decoder can have multiple transformer
blocks, each of which contains multi-head attention with linear layers and layer normalization.

Frontiers in Robotics and AI 07 frontiersin.org


Chalvatzaki et al. 10.3389/frobt.2023.1221739

FIGURE 4
Illustration of a forward pass through RobLM for text generation with a greedy next-token selection strategy. The forward passes are repeated in a
recursive manner until an end-of-text token is encountered or the defined sequence limit is reached.

a probability distribution p(x) by using the Softmax function, as 3.8.1 Hardware setup
described in Eq. 2. For finetuning LMs and evaluating each model, we used the
Two possible generation strategies for selecting the next Lichtenberg Cluster of TU Darmstadt, which contains stacks of
token from p(x) are: greedy search and top-k/top-p sampling NVIDIA R A100 and V100 GPUs. Internal tests have shown that
(Holtzman et al., 2020). In the greedy strategy, the token xsel a single GPU can decrease the training time by a factor of 10
with the highest likelihood is picked with xsel = arg max p(x). In (these tests are not representative because performance depends on
the top-k sampling strategy, as the name suggests, the scores every hardware component). To run experiments in the AI2-THOR
are sorted, and one of the first k candidate tokens is randomly simulation, we used a PC with an NVIDIA R RTX 3080Ti GPU.
sampled. By extending the top-k sampling with an additional top-
p strategy, the sum of the k candidates must be equal to or
greater than p ∈ [0, 1]. Simply put, top-k widens the choice over the 4 Experiments
next tokens and top-p filters out low-probability tokens. Figure 4
illustrates, on the basis of an example, a forward pass through 4.1 Preliminary analysis for task plan
the Transformer with a greedy selection strategy. These steps are generation with GPT-2 and GPT-3
repeated recursively until an end-of-text token is encountered
or the defined sequence limit is reached to generate the full LLMs can represent knowledge from the data they have been
plan. trained on. However, the question remains, whether this knowledge
This LM model was finetuned to generate a structured output, can be leveraged to solve planning tasks, i.e., can LLMs reason? This
omitting special tokens, characterized by numbered actions and is investigated by comparing the text-generation results of GPT-2
their arguments in parenthesis. The input is always part of the (Zero-Shot Learning (ZSL)) and GPT-3 (Few-Shot Learning (FSL))
output, due to the generation function utilized by RobLM. Note for a planning task.
that it is not guaranteed that the ‘soap’ can be found inside
the ‘drawer’ on the ‘countertop’. In fact, it could be at any Given an instruction to a household robot,
possible location permitted by the environment. However, given formulate the steps to complete the
a greedy search strategy, for the given task goal, the likelihood instruction.
for the ‘soap’ being on the ‘countertop’ is the highest in this The instruction is: ‘‘Put a washed slice
case. of apple on the table.’’

Frontiers in Robotics and AI 08 frontiersin.org


Chalvatzaki et al. 10.3389/frobt.2023.1221739

The results for this task vary greatly between the two LLMs. 4.2.1 Metrics
GPT-2 (‘gpt-2’ model, 1.5B parameters) completely fails this task To validate a finetuned LM, only the NL task goal of each
and produces an output that resembles a repetition of phrases in the validation sample and optionally, the context is fed to the RobLM
input sentence: generation pipeline (see Figure 4). Validation is performed over
each task category rather than all the validation data. This enables
Put a washed slice of apple on the table. the analysis of a task-dependent performance: some task categories
Put a washed slice of apple on the table. are more complex than others leading to a longer trajectory of
Put a washed slice of apple on the table. actions and hence an increased difficulty. Two metrics are defined
... for validation: LM accuracy and plan success rate.
Similar behavior has been observed for other LMs falling into Definition—Accuracy. Accuracy measures how accurately the
the ZSL category. The input sentence is not an open-ended question LM is able to predict the following parts of the plan.
and requires reasoning.
GPT-3 (‘text-davinci-002’ model, 175B parameters), when given • the correct count and names of all actions in the plan (action
the same instruction as input, is able to make assumptions and accuracy)
formulate a whole plan based on these assumptions: • the correct count and names of all arguments in the plan
(argument accuracy)
Assuming the apple is not cut and not • the correct count and names of all actions and arguments in the
washed. plan (“full plan” accuracy)
1. Pick up the apple from the counter.
2. Cut the apple into a slice. For a found plan, the accuracy of actions and arguments counts if
3. Wash the apple slice. all actions or arguments are correct. With this metric, it is possible
4. Place the apple slice on the table. to anchor the cause of plan failure to either the actions or the
arguments, or both.
The FSL paradigm allows GPT-3 to be very sensitive to context
Having an accurate LM does not necessarily mean that the
changes and seemingly understand the request at hand. However,
generated plan leads to success—at least, as long the “full plan”
smaller GPT-3 PLMs (GPT-3 curie, GPT-3 babbage, GPT3-ada)
accuracy is below 1.0, i.e., the trajectory is not replicated perfectly. A
show a degraded quality in the produced plan (Floridi and Chiriatti,
second metric is required that measures the actual success rate of the
2020). have shown that GPT-3 would not pass the Turing test, as
finetuned LM in simulation. There are two possible scenarios that
to having “no understanding of the semantics and contexts of the
justify this additional metric. First, the plan could fail in simulation,
request, but only a syntactic (statistical) capacity to associate words
even if it seems accurate. And second, the plan could succeed in
[…]”.
simulation, even if the plan is not completely accurate.
These tests have shown that plan generation capabilities of LLMs
Definition—Success rate. The success rate is a measure of
vary dramatically depending on the underlying learning paradigm,
the successful completion of individual sub-tasks of a validation
model architecture, and parameter size. GPT-2, out of the box,
task. After loading the trajectory, environment state, and goal from
is completely unsuited for solving planning tasks that require a
the validation sample into the AI2-THOR simulator, the actions
minimum level of text understanding. However, as later (§3.5)
predicted by the LM are translated into low-level controller actions
shown, GPT-2 can successfully generate plans when finetuned to a
via task and geometric grounding (§4.2.3), which are then passed
training dataset (§3.2). The question of whether a finetuned GPT-
to the AI2-THOR controller and executed in the simulator. After
2 model can leverage knowledge for planning is addressed in the
every simulator step, a check is performed to determine whether the
following section. GPT-3, unfortunately, is only accessible through
target conditions for sub-task completion have been met. If the target
a paid service by OpenAI, and finetuning of own GPT-3 models
conditions are kept unsatisfied after execution of the last low-level
is possible through the provided service. Practical applications,
action, it counts as a success towards the sub-task, or otherwise, as a
however, are limited because each query has to be sent to and
failure.
processed by the OpenAI service. Even if a PLM was made available,
the hardware requirements for running GPT-3 models are immense,
4.2.2 Baseline
even for today’s standards, due to the sheer parameter count. It is for
A baseline is an oracle, or upper bound, that serves as a
these reasons that GPT-3 and its newest versions are not considered
measurement reference. Fast Downward (FD) (Helmert, 2006) is
as a basis for finetuning to RobLM.
used as the baseline for evaluation. We consider a classical task
planner like FD appropriate since it also has access to the full domain
4.2 Evaluation of RobLM and is a complete algorithm (Helmert, 2006). Therefore, the ability
of a RobLM to match or outperform FD (for a given time budget)
This section presents the main experiments conducted for would reveal whether LMs can be helpful towards learning task
evaluation of RobLM. We first define the appropriate metrics and planning. Every ALFRED validation sample comes with a PDDL
a baseline method required to make the evaluations measurable and problem file, while the PDDL domain is shared by all tasks; this
comparable. The grounding problem is explained in accordance with allows the PDDL planner to generate a plan for each sample. To
the practical aspects of integrating the available methods into the generate a plan using FD, the PDDL problem files provided by
simulator. For the experimentation part, a set of finetuned LLMss is ALFRED have to be pre-processed. FD is able to handle Action
compared with the baseline performance. Description Language (ADL) instructions, as found in the PDDL

Frontiers in Robotics and AI 09 frontiersin.org


Chalvatzaki et al. 10.3389/frobt.2023.1221739

FIGURE 5
Prediction accuracy of actions and arguments for previously unseen data across a set of tasks. Neither RobLM model is able to outperform the baseline
(blue) but shows high accuracy in the prediction of plan actions. Context-driven models (green, red, and purple) perform better than the model
without any scene-related context (orange).

problem, but is not able to process optimization-related additional this sequence of low-level actions: ToggleObject →PutObject
information present in the files. →ToggleObject →ToggleObject →PickupObject →ToggleObject
(example given below).
4.2.3 Instruction grounding
Grounding can be defined as mapping a high-level, abstract, 4.2.5 Geometric grounding
or symbolic representation to a low-level grounded representation. An argument can be either a location or an object name. An
Grounding of an abstract plan to objects is called object or geometric argument produced by the LM might be ambiguous or non-existing
grounding (or “world grounding”), and grounding of NL to robot in the environment. In order to be understood by the controller,
tasks is called task grounding. In this case, instructions generated by these arguments have to be grounded on a geometric level. For
the LM are made up of actions that require a task grounding, and grounding arguments, first, all available objects are retrieved from
arguments, which require a geometric grounding. the simulation. Then, the world coordinates of all objects matching
the predicted symbol (target object) are gathered. E.g., if the
4.2.4 Task grounding predicted target object is ‘soap’, the position of all ‘soap’-type objects
Plans generated by RobLM consist of high-level actions and are can be queried and retrieved from the simulator. The low-level
not directly executable by the AI2-THOR controller. Each possible control commands are finally generated with the help of the ground-
action predicted by the LM has to be grounded to a task, which truth navigation graph of the scene.
then translates to a sequence of low-level controller actions. For
task grounding, three possible types of tasks are defined: navigation, 4.2.6 Navigation
manipulation and composite. In a navigation task, the agent is By overlaying the world with a grid, every position in the
required to move from one to another location. In a manipulation world is given a discrete coordinate. A navigation graph (not to be
task, the agent performs an action affecting the environment state. confused with a scene graph or Graph2NL graph) creates a node
Composite tasks are a composition of manipulation tasks that need for each coordinate and connects all the nodes that are accessible
to be completed in a specific order. one from another. Similar to the procedure of Graph2NL (§3.6),
Task grounding is performed as follows. the navigation graph is traversed after locating the agent and target
node by the object name. A search algorithm is used to find the
• The action GotoLocation is grounded to the navigation task and shortest path in the graph from the agent to the target object - in
delegated to a trajectory planner for navigation (see below). this case, it is the A* algorithm (Duchoň et al., 2014). The search
• The actions PickupObject, PutObject, ToggleObject and returns a sequence of nodes, which corresponds to a sequence
SliceObject are grounded to the manipulation task, the actions of coordinates (a trajectory). Lastly, a motion planner takes the
can be directly executed by the low-level controller. trajectory as an input and outputs a sequence of low-level controller
• The actions HeatObject, CoolObject and CleanObject are actions (AI2-THOR conveniently provides a motion planner for
grounded to the composite task, which is translated to navigation).

Frontiers in Robotics and AI 10 frontiersin.org


Chalvatzaki et al. 10.3389/frobt.2023.1221739

4.2.7 Experimental results TABLE 2 Success rates of sub-task completion in simulation—RobLM (‘No
A set of finetuned RobLM models are evaluated against the context’) compared to the baseline on seen and unseen validation data.
baseline. The finetuned models differ in the amount of context Success rate Baseline RobLM (‘no context’)
provided during training time.
Task seen unseen seen unseen
1) ‘No context’ — Only task goal
GotoLocation 0.318 0.393 0.422 0.499
2) ‘Scene knowledge’ — List of all available objects in the
environment, found in the PDDL problem PickupObject 0.466 0.474 0.776 0.749

3) ‘Scene graph’ — Description of geometric relations to the target PutObject 0.385 0.331 0.116 0.092
object, generated by Graph2NL
SliceObject 0.629 0.5 0.94 0.98
4) ‘Full context’ — Description of geometric relations of all objects,
generated by Graph2NL ToggleObject 0 0 0.84 0.864

Bold represent maximum values.


Given a PDDL problem file, Graph2NL automatically generates the
context in the specified text format. This context is provided to the
LM for training and inference.
Accuracy. Figure 5 summarizes the evaluation of the finetuned TABLE 3 Top-k and top-p sampling (k =10 and p = 0.9) — tokens are sampled
RobLM models compared to the FD baseline for previously unseen three times for the ‘Pick Simple’ task, giving only slight deviations in the final
accuracy.
(validation) data. It can be observed that none of the finetuned
RobLM models is able to outperform the baseline. Going through RobLM ‘no context’ model, ‘pick simple’ task
each of the models and starting with the ‘No context’ model Accuracy of 1st sample 2nd sample 3rd sample
it is surprising that this model, even without any contextual
information, is able to generate the correct plan actions with high Actions 0.7746 0.8169 0.7817

accuracy. The ‘Scene knowledge’ and ‘Scene graph’ models have a Arguments 0.3803 0.4085 0.3944
similar performance, the ‘Scene graph’ generally being slightly more GotoLocation 0.8772 0.9123 0.8904
accurate in both actions and argument prediction. Both of these
PickupObject 0.8380 0.8662 0.8451
models overall outperform the ‘No context’ model, with a significant
improvement of the context models in the arguments prediction. PutObject 0.7971 0.8227 0.7986
Given these results, the following conclusions about the GotoLocation_Args 0.5658 0.5877 0.5833
examined models can be made.
PickupObject_Args 0.7535 0.8028 0.7746
1) Failed plans are mostly caused by wrong arguments (objects or
PutObject_Args 0.6449 0.6667 0.6331
locations) and only in some cases by wrong actions.
2) The LM is able to learn the structure of tasks, but not scene-
dependent components.

RobLM is able to distinguish between the task categories and by storing and restoring the simulator state. While this is a clear
provide a correct task action plan. However, where this model fails advantage for RobLM over the baseline, the evaluation still holds
is in finding all correct action arguments, i.e., locations and object because the LM is required to predict the correct location or
names. This can be explained by the fact that the task goal alone object names. Based on the presented results, the LM-based system
does not reveal the actual location of the target object. Because the performs well on sub-tasks requiring the action PickupObject, while
target can be in any accessible location in the environment, or in any the action PutObject does not succeed equally well, being far from
accessible receptacle, the produced argument is the result of the LM the baseline performance.
imitating the most-likely cases observed in the training data. Overall, the success rate of the baseline method is not
Overall, these results are consistent with the point made on nearly as high as expected, hinting at potential implementation-
contextual information and prediction accuracy: giving to the model specific failures in the task grounding and in the low-level
information about the environment, i.e., finetuning a model to be controller interaction with objects. In the low-level controller,
grounded to the scene, does improve performance. visual information is not included. This means that the robot is
Success rate. A plan is successful if each of the sub-tasks for controlled in a “blind flight” mode. The AI2-THOR simulation
a stated task is completed. Table 2 summarizes the success rate of requires the target object to be in view. If the object is not visible,
RobLM compared to the baseline for actions of the navigation task e.g., because the agent is looking in the wrong direction, the
(GotoLocation) and manipulation task (PickupObject, PutObject, interaction fails and with it, the sub-task. Because of the fact that
etc.). Composite tasks have been omitted from this evaluation both systems have been evaluated within the same framework,
because of high failure rates caused by their task grounding these results do not dismiss a potential use-case for LM in
complexity. planning.
Regarding geometric grounding, arguments predicted by
RobLM are grounded to all matching objects in the world, and 4.2.8 Additional results
RobLM is allowed to “try” all possibilities. E.g., when the objective We provide additional experiments for a deeper analysis of
is to “Get soap”, multiple ‘soap’-type objects could exist in the scene. potential points of failure of RobLM. These experiments entail a
Each possibility given by the geometric grounding is simulated different sampling strategy and context refinement.

Frontiers in Robotics and AI 11 frontiersin.org


Chalvatzaki et al. 10.3389/frobt.2023.1221739

FIGURE 6
Prediction accuracy of actions and arguments for unseen tasks of RobLM with a refined context. This experiment compares a model finetuned to the
task goal and a context consisting of a NL description of the first plan instruction (green) with the baseline (blue) and a RobLM with ‘No context’ model
(orange).

4.2.8.1 Top-k/top-p sampling The results in Figure 6 show that, given this extra information,
So far, every experiment conducted has used a greedy next- RobLM is almost able to reach the performance levels of the baseline
token selection strategy. In order to be able to tell with certainty measurement across all tasks; it shows very high accuracy on “full
that a found plan is the “best possible” plan, a comparison with plan” actions and arguments. The conclusion of this experiment is
another sampling strategy is required. This additional experiment that the more precisely the supplied context is tailored towards the
repeats the previous one, but this time with a top-k and top-p key issue of LM generation task, the more accurate the generated
sampling strategy. The comparison is done with the ‘No context’ plan becomes. For this specific problem, finding the correct first
RobLM model for all tasks with k = 10 and p = 0.9, i.e., tokens are argument is key to a successful plan, and with a NL description of the
sampled from the top-10 predictions and sum up to a probability first instruction, the LM is able to draw the necessary connections
≥0.9. Since a similar pattern was observed in the individual task from context to plan.
evaluations, Table 3 reports the results for the ‘Pick Simple’ task only. The overall conclusion of this observation is that the LM are
Each token is sampled three times, giving three possible solutions adaptive; the LM is able to adapt new information into the plan
to be evaluated. Slight—uniform and hence dismissable—variations generation, towards a more accurate sequence of instructions.
in the prediction accuracy exist between these three runs. The
sampling-based method performs slightly worse than the greedy 4.3 Run-time analysis
strategy.
Inference frequency is an important factor when it comes to
real-life applications. This is especially true for industrial robotics,
4.2.8.2 Refined context where cycle times are important. But not every robotic application
In a deeper analysis of RobLM failure cases, it has been is time-critical, e.g., a household robot is not expected to respond
found that the first argument in the generated plan is the in a sub-second time. However, if task planning is seen as a
hardest to predict correctly by the LM. The LM is not able to programming problem, a fast execution time greatly enhances the
draw enough conclusions about the first instruction from the operator experience Table 4 shows a comparison of the inference
supplied context of any form. This causality becomes obvious speeds of RobLM against the baseline (FD). RobLM, in all cases,
after the following experiment: Given the task goal and a NL is slower compared to the baseline, which is likely due to the
description of the first instruction as context, how does the overall reliance on the full GPT-2 vocabulary size for the LM tokenizer
accuracy of the LM change? The following text is an example and the usage of a LM-internal, implementation-specific generation
of an instruction description in NL, as found in the ALFRED function4 . Such an issue can be mitigated by training a new
dataset:

Turn left and walk across the room towards 4 Huggingface generation function: ‘transformers/src/transformers/generation_
the shelves on the wall. utils.py’.

Frontiers in Robotics and AI 12 frontiersin.org


Chalvatzaki et al. 10.3389/frobt.2023.1221739

TABLE 4 Comparison of inference speeds—RobLM against baseline. GPU For future work, exploring larger models like GPT-3 or GPT-
acceleration used for LM (NVIDIA R GeForce RTX 2080 SUPER). The timer NeoX could increase the accuracy and success rate of RobLM.
starts only after the program or model has been loaded into memory, i.e.,
only computation (inference) time is measured. “No context” has a maximum
Providing structured context to the Transformer model and
token sequence length of 200 and “Full context” has a maximum length of exploring multi-modal inputs, such as visual information, may
1,024 tokens for generation. also improve the planning capabilities of LLMs. Further research
Iterations per second (average over 800 samples) in the field of applied natural language processing in robotics
could help unlock the full potential of LLMs and contribute
Baseline RobLM ‘No context’ RobLM ‘Full context’
to the development of more advanced neuro-symbolic planning
2.9 1.0 0.2 systems.

tokenizer on the task-specific vocabulary, but this comes at the Data availability statement
cost of not utilizing the stored knowledge in the PLM. However,
current progress in language models allows faster inferences The datasets presented in this study can be found in online
in more advanced hardware than the one used in this work; repositories. The names of the repository/repositories and accession
therefore, we believe that the frequency limitations can be easily number(s) can be found below: https://github.com/dnandha/
overcome. RobLM.
Remarks. Overall, our analysis has shown that finetuning
PLMs toward robotic task planning is possible when providing
an appropriate grounding context. However, we have shown that
such models cannot yet reach the planning abilities of classical Author contributions
task planners. A combination of finetuning with proper scene
representation and a more elaborate sampling strategy, as well GC, AY, and DN contributed to the conception and design
as the addition of more sophisticated prompts, can boost the of the method. AL assisted in the setup of the baseline method.
performance of RobLM, leading them to performances that are LR provided insights for the training and fine-tuning of language
closer to the oracle task planners. Still, the benefit of providing goal models and advice on the linearization of the graph. IG gave advice
specifications as natural language commands alleviate the burden on the overall method and the idea of using language models
of engineering, while advances in scene graph generation can make for planning. GC wrote the current version of the manuscript,
the extraction of domain specifications autonomous. Therefore, we building on the initial write-up by DN. AY assisted in writing and
believe that using RobLMs at a higher level of abstraction for visualizations. All authors contributed to the article and approved
neuro-symbolic task planning is valuable but is still in its infancy. the submitted version.
Additional challenges have been recently summarized by Weng
(2023), where some of the listed points are in accordance with our
findings. Funding
This research has been supported by the German Research
Foundation (DFG) through the Emmy Noether Programme (CH
5 Conclusion 2676/1-1) and by the Hessian. AI Connectom Fund “Robot
Learning of Long-Horizon Manipulation bridging Object-centric
We presented a framework for finetuning grounded Large Representations to Knowledge Graphs”. We acknowledge support
Language Models (LLMs) and investigated the applicability of by the Deutsche Forschungsgemeinschaft (DFG, German Research
such models combined with planning in solving ling-horizon Foundation) and the Open Access Publishing Fund of Technical
robot reasoning tasks. This paper has shown that LLMs can University of Darmstadt.
extract commonsense knowledge through precise queries and
adjust their behavior based on available information or context.
Among our contributions are the development of RobLM, a
grounded finetuned LLM that generates plans directly from
Acknowledgments
natural language commands, and Graph2NL, which creates natural
The authors would like to acknowledge Snehal Jauhri and Haau-
language text describing graph-based data, to represent scene graphs
Sing Li for the fruitful discussions and suggestions.
as inputs into RobLM. Our extensive experimental results have
revealed, nevertheless, the challenges in representing structured
and geometric data in natural language. However, LLMs still
need to demonstrate a consistent ability to perform long-horizon Conflict of interest
planning tasks and cannot yet replace classical planners. Despite
their limitations, LLMs possess powerful features such as efficient Author LR was employed by Amazon Alexa.
storage and retrieval of commonsense knowledge, which can be The remaining authors declare that the research was conducted
useful in planning tasks when presented with partially observable in the absence of any commercial or financial relationships that
environments. could be construed as a potential conflict of interest.

Frontiers in Robotics and AI 13 frontiersin.org


Chalvatzaki et al. 10.3389/frobt.2023.1221739

Publisher’s note organizations, or those of the publisher, the editors and the
reviewers. Any product that may be evaluated in this article, or claim
All claims expressed in this article are solely those of the that may be made by its manufacturer, is not guaranteed or endorsed
authors and do not necessarily represent those of their affiliated by the publisher.

References
Aeronautiques, C., Howe, A., Knoblock, C., McDermott, I. D., Ram, A., Veloso, Huang, W., Xia, F., Xiao, T., Chan, H., Liang, J., Florence, P., et al. (2022c). Inner
M., et al. (1998). Pddl– the planning domain definition language. Tech. Rep. Tech. monologue: embodied reasoning through planning with language models. Available at:
Rep. https://arxiv.org/abs/2207.05608.
Bengio, Y., Ducharme, R., and Vincent, P. (2000). A neural probabilistic language Jiang, Y., Gupta, A., Zhang, Z., Wang, G., Dou, Y., Chen, Y., et al. (2022). Vima:
model. Adv. neural Inf. Process. Syst. 13. general robot manipulation with multimodal prompts. In Proceedings of the NeurIPS
Foundation Models for Decision Making Workshop, New Orleans, USA, December
Bian, N., Han, X., Sun, L., Lin, H., Lu, Y., and He, B. (2023).
2022
Chatgpt is a knowledgeable but inexperienced solver: an investigation
of commonsense problem in large language models. Available at: Kaelbling, L. P., and Lozano-Pérez, T. (2011). Hierarchical task and motion planning
https://arxiv.org/abs/2303.16421. in the now. In Proceedings of the 2011 IEEE International Conference on Robotics and
Automation, 1470–1477. Shanghai, China, May 2011
Bommasani, R., Hudson, D. A., Adeli, E., Altman, R., Arora, S., von Arx, S., et al.
(2021). On the opportunities and risks of foundation models. Available at: https:// Kim, B., Wang, Z., Kaelbling, L. P., and Lozano-Pérez, T. (2019). Learning to
arxiv.org/abs/2108.07258. guide task and motion planning using score-space representation. IJRR38, 793–812.
doi:10.1177/0278364919848837
Brohan, A., Chebotar, Y., Finn, C., Hausman, K., Herzog, A., Ho, D., et al. (2022). Do
as i can, not as i say: grounding language in robotic affordances. Available at: https:// Kingma, D. P., and Ba, J. (2014). Adam: A method for stochastic optimization. https://
arxiv.org/abs/2204.01691. arxiv.org/abs/1412.6980.
Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J. D., Dhariwal, P., et al. Kolve, E., Mottaghi, R., Han, W., VanderBilt, E., Weihs, L., Herrasti, A., et al. (2017).
(2020). Language models are few-shot learners. Available at: https://arxiv.org/abs/2005. Ai2-thor: an interactive 3d environment for visual ai. Available at: https://arxiv.org/abs/
14165. 1712.05474.
Chen, B., Xia, F., Ichter, B., Rao, K., Gopalakrishnan, K., Ryoo, M. S., et al. (2022a). LeCun, Y., Bengio, Y., and Hinton, G. (2015). Deep learning. Nature521, 436–444.
Open-vocabulary queryable scene representations for real world planning. Available at: doi:10.1038/nature14539
https://arxiv.org/abs/2209.09874.
Li, S., Puig, X., Paxton, C., Du, Y., Wang, C., Fan, L., et al. (2022). Pre-trained language
Chen, W., Hu, S., Talak, R., and Carlone, L. (2022b). Leveraging large language models for interactive decision-making. Adv. Neural Inf. Process. Syst.35, 31199–
models for robot 3d scene understanding. Available at: https://arxiv.org/abs/2209. 31212.
05629.
Li, X. L., Kuncoro, A., d’Autume, C. d. M., Blunsom, P., and Nematzadeh, A. (2021).
Driess, D., Ha, J.-S., and Toussaint, M. (2020). Deep visual reasoning: learning to Do language models learn commonsense knowledge? Available at: https://arxiv.org/abs/
predict action sequences for task and motion planning from an initial scene image. 2111.00607.
Available at: https://arxiv.org/abs/2006.05398.
Liang, J., Huang, W., Xia, F., Xu, P., Hausman, K., Ichter, B., et al. (2022). Code as
Driess, D., and Toussaint, M. (2019). Hierarchical task and motion planning using policies: language model programs for embodied control. Available at: https://arxiv.org/
logic-geometric programming (hlgp). abs/2209.07753.
Duchoň, F., Babinec, A., Kajan, M., Beňo, P., Florek, M., Fico, T., et al. (2014). Path Mees, O., Borja-Diaz, J., and Burgard, W. (2022). Grounding language with
planning with modified a star algorithm for a mobile robot. Procedia Eng. 96, 59–69. visual affordances over unstructured data. Available at: https://arxiv.org/abs/2210.
doi:10.1016/j.proeng.2014.12.098 01911.
Floridi, L., and Chiriatti, M. (2020). Gpt-3: its nature, scope, limits, and consequences. Nair, S., and Finn, C. (2019). Hierarchical foresight: self-supervised learning of
Minds Mach. 30, 681–694. doi:10.1007/s11023-020-09548-1 long-horizon tasks via visual subgoal generation. Available at: https://arxiv.org/abs/
Funk, N., Chalvatzaki, G., Belousov, B., and Peters, J. (2021). Learn2assemble with 1909.05829.
structured representations and search for robotic architectural construction. Conf. Ouyang, L., Wu, J., Jiang, X., Almeida, D., Wainwright, C. L., Mishkin, P., et al.
Robot Learn. (CoRL). (2022). Training language models to follow instructions with human feedback. https://
Funk, N., Menzenbach, S., Chalvatzaki, G., and Peters, J. (2022). Graph- arxiv.org/abs/2203.02155.
based reinforcement learning meets mixed integer programs: an application Pashevich, A., Schmid, C., and Sun, C. (2021). Episodic transformer for vision-and-
to 3d robot assembly discovery. Available at: https://arxiv.org/abs/2203. language navigation. In Proceedings of the IEEE/CVF International Conference on
04120. Computer Vision. Montreal, BC, Canada, October 2021, 15942–15952.
Garrett, C. R., Chitnis, R., Holladay, R., Kim, B., Silver, T., Kaelbling, L. P., et al. Pfeiffer, J., Kamath, A., Rücklé, A., Cho, K., and Gurevych, I. (2021). Adapterfusion:
(2021). Integrated task and motion planning. Available at: https://arxiv.org/abs/2010. non-destructive task composition for transfer learning. Available at: https://arxiv.org/
01083. abs/2005.00247.
Garrett, C. R., Lozano-Pérez, T., and Kaelbling, L. P. (2020). Pddlstream: integrating Radford, A., Kim, J. W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., et al.
symbolic planners and blackbox samplers via optimistic adaptive planning. Available (2021). “Learning transferable visual models from natural language supervision,” in
at: https://arxiv.org/abs/1802.08705. International conference on machine learning (PMLR), 8748–8763.
Helmert, M. (2006). The fast downward planning system. J. Artif. Intell. Res. 26, Raman, S. S., Cohen, V., Rosen, E., Idrees, I., Paulius, D., and Tellex, S. (2022).
191–246. doi:10.1613/jair.1705 Planning with large language models via corrective re-prompting. https://arxiv.org/abs/
Hoang, C., Sohn, S., Choi, J., Carvalho, W., and Lee, H. (2021). Successor feature 2211.09935.
landmarks for long-horizon goal-conditioned reinforcement learning. Adv. Neural Inf. Ren, T., Chalvatzaki, G., and Peters, J. (2021). Extended task and motion
Process. Syst.34, 26963–26975. planning of long-horizon robot manipulation. https://arxiv.org/abs/2103.
Holtzman, A., Buys, J., Du, L., Forbes, M., and Choi, Y. (2020). The curious case of 05456.
neural text degeneration. Available at: https://arxiv.org/abs/1904.09751. Ruis, L., Khan, A., Biderman, S., Hooker, S., Rocktäschel, T., and Grefenstette, E.
Houlsby, N., Giurgiu, A., Jastrzebski, S., Morrone, B., De Laroussilhe, Q., Gesmundo, (2022). Large language models are not zero-shot communicators. Available at: https://
A., et al. (2019). Parameter-efficient transfer learning for nlp. Available at: https:// arxiv.org/abs/2210.14986.
arxiv.org/abs/1902.00751. Shah, D., Osiński, B., and Levine, S. (2023). “Lm-nav: robotic navigation with large
Huang, C., Mees, O., Zeng, A., and Burgard, W. (2022a). Visual language maps for pre-trained models of language, vision, and action,” in Conference on robot learning
robot navigation. Available at: https://arxiv.org/abs/2210.05714. (PMLR), 492–504.
Huang, W., Abbeel, P., Pathak, D., and Mordatch, I. (2022b). Language models as Shridhar, M., Thomason, J., Gordon, D., Bisk, Y., Han, W., Mottaghi, R., et al.
zero-shot planners: extracting actionable knowledge for embodied agents. Available at: (2020). Alfred: A benchmark for interpreting grounded instructions for everyday tasks.
https://arxiv.org/abs/2201.07207. Available at: https://arxiv.org/abs/1912.01734.

Frontiers in Robotics and AI 14 frontiersin.org


Chalvatzaki et al. 10.3389/frobt.2023.1221739

Singh, I., Blukis, V., Mousavian, A., Goyal, A., Xu, D., Tremblay, J., et al. (2022). Wei, J., Wang, X., Schuurmans, D., Bosma, M., Xia, F., Chi, E. H., et al. (2022b). Chain-
Progprompt: generating situated robot task plans using large language models. Available of-thought prompting elicits reasoning in large language models. Available at: https://
at: https://arxiv.org/abs/2209.11302. arxiv.org/abs/2201.11903.
Tay, Y., Wei, J., Chung, H. W., Tran, V. Q., So, D. R., Shakeri, S., et al. (2022). Wells, A. M., Dantam, N. T., Shrivastava, A., and Kavraki,
Transcending scaling laws with 0.1% extra compute. Available at: https://arxiv.org/abs/ L. E. (2019). Learning feasibility for task and motion planning
2210.11399. in tabletop environments. IEEE Ral. 4, 1255–1262. doi:10.1109/
lra.2019.2894861
Toussaint, M. (2015). Logic-geometric programming: an optimization-based
approach to combined task and motion planning.Proceedings of the Twenty-Fourth Weng, L. (2023). Ll-powered autonomous agents. lilianweng.github.Io.
International Joint Conference on Artificial Intelligence (IJCAI 2015), Buenos Aires,
White, J., Fu, Q., Hays, S., Sandborn, M., Olea, C., Gilbert, H., et al. (2023). A prompt
Argentina, July 2015.
pattern catalog to enhance prompt engineering with chatgpt. Available at: https://
Valmeekam, K., Olmo, A., Sreedharan, S., and Kambhampati, S. (2022). Large arxiv.org/abs/2302.11382.
language models still can’t plan (a benchmark for llms on planning and reasoning about
Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C.,
change). Available at: https://arxiv.org/abs/2206.10498.
Moi, A., et al. (2019). Huggingface’s transformers: state-of-the-art
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., et al. natural language processing. Available at: https://arxiv.org/abs/1910.
(2017). Attention is all you need. Available at: https://arxiv.org/abs/1706.03762. 03771.
Wang, A., Pruksachatkun, Y., Nangia, N., Singh, A., Michael, J., Hill, F., et al. (2019). Xu, L., Ren, T., Chalvatzaki, G., and Peters, J. (2022). Accelerating integrated task
Superglue: A stickier benchmark for general-purpose language understanding systems. and motion planning with neural feasibility checking. Available at: https://arxiv.org/
Adv. neural Inf. Process. Syst. 32. abs/2203.10568.
Wang, A., Singh, A., Michael, J., Hill, F., Levy, O., and Bowman, S. R. (2018). Glue: Zeng, A., Wong, A., Welker, S., Choromanski, K., Tombari, F., Purohit, A., et al.
A multi-task benchmark and analysis platform for natural language understanding. (2022). Socratic models: composing zero-shot multimodal reasoning with language.
Available at: https://arxiv.org/abs/1804.07461. Available at: https://arxiv.org/abs/2204.00598.
Wei, J., Tay, Y., Bommasani, R., Raffel, C., Zoph, B., Borgeaud, S., et al. (2022a). Zhou, X., Zhang, Y., Cui, L., and Huang, D. (2020). Evaluating commonsense
Emergent abilities of large language models. Available at: https://arxiv.org/abs/ in pre-trained language models. Proc. AAAI Conf. Artif. Intell. 34, 9733–9740.
2206.07682. doi:10.1609/aaai.v34i05.6523

Frontiers in Robotics and AI 15 frontiersin.org

You might also like