ChatGPT For Robotics Design Principles and Model Abilities
ChatGPT For Robotics Design Principles and Model Abilities
ABSTRACT This paper presents an experimental study regarding the use of OpenAI’s ChatGPT for robotics
applications. We outline a strategy that combines design principles for prompt engineering and the creation
of a high-level function library which allows ChatGPT to adapt to different robotics tasks, simulators, and
form factors. We focus our evaluations on the effectiveness of different prompt engineering techniques and
dialog strategies towards the execution of various types of robotics tasks. We explore ChatGPT’s ability
to use free-form dialog, parse XML tags, and to synthesize code, in addition to the use of task-specific
prompting functions and closed-loop reasoning through dialogues. Our study encompasses a range of tasks
within the robotics domain, from basic logical, geometrical, and mathematical reasoning all the way to
complex domains such as aerial navigation, manipulation, and embodied agents. We show that ChatGPT
can be effective at solving several of such tasks, while allowing users to interact with it primarily via
natural language instructions. In addition to these studies, we introduce an open-sourced research tool
called PromptCraft, which contains a platform where researchers can collaboratively upload and vote on
examples of good prompting schemes for robotics applications, as well as a sample robotics simulator with
ChatGPT integration, making it easier for users to get started with using ChatGPT for robotics. Videos
and blog: aka.ms/ChatGPT-Robotics PromptCraft, AirSim-ChatGPT code: https://github.com/microsoft/
PromptCraft-Robotics
INDEX TERMS Large language models, robotics, language understanding, code generation, perception.
translate the intent into a logical sequence of physical user intent from natural dialog, and convert that to a logical
actions. chaining of high-level function calls. We also outline several
In recent years there have been different attempts to prompt engineering guidelines that help ChatGPT solve
incorporate language into robotics systems. These efforts robotics tasks.
have largely focused on using language token embedding Recent large models such as BLIP-2 [21], LLaVa [22] have
models, LLM features, and multi-modal model features for shown that fusing visual and language features into the same
specific form factors or scenarios. Applications range from representation allows for capabilities such as visual question
visual-language navigation [9], [10], [11], language-based answering and captioning. While this is a very powerful
human-robot interaction [12], [13], and visual-language capability, these models excel at general descriptions of
manipulation control [14], [15], [16]. However, despite the images and to an extent reasoning about them. However,
potential advantages of using LLMs in robotics, most of the robotics tasks often require a more precise and detailed under-
existing approaches are restricted by a rigid scope and limited standing of the environment, such as the ability to extract
set of functionalities, or by their open-loop nature that does object locations, or to perform geometric reasoning, which
not allow for fluid interactions and behavior corrections from potentially requires more specialized models. In this work,
user feedback. we assume that the large language model can be provided
Large language models also show promise in zero-shot with the necessary information about the environment using
robotics scenarios when tasked with high-level agent plan- appropriate tools, and focus on the analysis of LLMs on
ning [17], [18] or code generation [19], [20]. These early how they can use this information to formulate high level
demonstrations inspired us to investigate ChatGPT as a behaviors and the necessary code for solving the task at hand.
potentially more versatile tool for the robotics domain, as it Our research shows that ChatGPT is capable of solving var-
incorporates the strengths of natural language and code ious robotics-related tasks in a zero-shot fashion, while adapt-
generation models along with the flexibility of dialogue. ing to multiple form factors, and allowing for closed-loop
ChatGPT’s ability to engage in a free-form dialog and capture reasoning through conversation. In addition, we aim to show
long context allows users to interact with the model in a current model limitations, and provide ideas on how to
more natural fashion, with flexible behavior correction. At the overcome them. Our main contributions are listed below:
same time, ChatGPT’s extensive knowledge of mathematical • We demonstrate a pipeline for applying ChatGPT to
constructs, geometry, commonsense afford it a much higher robotics tasks. The pipeline involves several prompting
ability to understand and reason about the physical world techniques such as free-form natural language dialogue,
compared to older LLM-based approaches which mainly code prompting, XML tags, and closed-loop reasoning.
focused on task planning that was dependent on a fixed set We also show how users can leverage a high-level
of functions and behaviors. function library that allows the model to quickly parse
human intent and generate code for solving the problem;
• We experimentally evaluate ChatGPT’s ability to exe-
cute a variety of robotics tasks. We show the model’s
capabilities and limitations when solving mathematical,
logical, and geometrical operations, and then explore
more complex scenarios involving embodied agents,
aerial navigation, and manipulation. We include both
simulation and real-world experiments that result from
FIGURE 1. Current robotics pipelines require a specialized engineer in the ChatGPT’s plans;
loop to write code to improve the process. Our goal with ChatGPT is to
have a (potentially non-technical) user on the loop, interacting with the • We introduce a collaborative open-source platform,
language model through high-level language commands, and able to PromptCraft, where researchers can work together to
seamlessly deploy various platforms and tasks.
provide examples of positive (and negative) prompting
strategies when working with LLMs in the robotics
In this paper, we aim to demonstrate the potential of context. Prompt engineering is a mostly empirical
ChatGPT for robotics applications. We outline a key concept science, and we want to provide a simple interface
that unlocks the ability to solve robotics applications with for researchers to contribute with knowledge as a
ChatGPT, which is the creation of a high-level function community. Over time we aim to provide different
library. Given that robotics is a diverse field where several environments where users can test their prompts, and
platforms, scenarios, and tools exist, there exists an extensive welcome new contributions;
variety of libraries and APIs. Instead of asking LLMs to • We release a simulation tool that builds on Microsoft
output code specific to a platform or a library, which might AirSim [23] combined with a ChatGPT integration.
involve extensive finetuning, we instead create a simple This AirSim-ChatGPT simulation contains a sample
high-level function library for ChatGPT to deal with which environment for drone navigation and aims to be a
can then be linked in the back-end to the actual APIs for starting point for researchers to explore how ChatGPT
the platforms of choice. Thus, we allow ChatGPT to parse can enable robotics scenarios.
With this work we hope to open up new opportunities Clear names are essential to allow the LLM to reason over
and avenues for future research fusing LLMs and robotics. functional connections between APIs and produce the desired
We believe that our findings will inspire and guide further outcome for the problem. Hence, we can define high-level
research in this exciting field, paving the way for the functions, which act as wrappers over actual implementations
development of new, innovative robotics systems that can from the respective libraries. For example, a function named
interact with humans in a natural, intuitive manner. For more detect_object(object_name) could internally link
details, we encourage readers to view detailed videos of our to an OpenCV function or a computer vision model, whereas
experiments in the project webpage. something like move_to(x, y, z) could internally
invoke a motion planning and obstacle avoidance pipeline
II. ROBOTICS WITH CHATGPT along with the appropriate low-level motor commands for
Prompting LLMs for robotics control poses several chal- a drone. Listing such a collection of high-level functions
lenges, such as providing a complete and accurate descrip- in the prompt is key in allowing ChatGPT to create logical
tions of the problem, identifying the right set of allowable sequences of behavioral primitives, and in generalizing to
function calls and APIs, and biasing the answer structure with different scenarios and platforms.
special arguments. To make effective use of ChatGPT for Depending on the context, we recommend explaining the
robotics applications, we construct a pipeline composed of function of APIs and if needed, breaking them down into
the following steps: sub-components with clear inputs and outputs, similar to code
1) First, we define a high-level robot function library. This documentation. In Figure 3 we present an example of a good
library can be specific to the form factor or scenario API prompting strategy for a home cook robot scenario. The
of interest, and should map to actual implementations strategy presented allows ChatGPT to reason about the order
on the robot platform while being named descriptively and content of tasks according to the functions the robot is
enough for ChatGPT to follow; actually able to execute. In contrast, we refer the interested
2) Next, we build a prompt for ChatGPT which describes reader to Appendix A-A for an example of how ChatGPT
the objective while also identifying the set of allowed reasons when no API guidance is given, which leads to
high-level functions from the library. The prompt can a unbounded text-based answer, or to Appendix A-B for
also contain information about constraints, or how an example of API under-specification, which leads to
ChatGPT should structure its responses; hallucinations over function call parameters.
3) The user stays on the loop to evaluate code output We note that unlike the brittle structure of classical
by ChatGPT, either through direct analysis or through symbolic AI, which required rigid pre-defined relationships
simulation, and provides feedback to ChatGPT on the between objects and functions, LLMs are capable of defining
quality and safety of the output code; new functions and concepts altogether when relevant to a
4) After iterating on the ChatGPT-generated implementa- particular problem. This capability confers flexibility and
tions, the final code can be deployed onto the robot. robustness to LLMs when dealing with robotics applications.
We show a visual depiction of this pipeline in Figure 2 for Figure 4 shows how ChatGPT can create new high-level
the example of a household robot. concepts and even low-level code when needed to solve a
problem, even fusing existing APIs. The user on the loop
can take advantage of this ability as a design strategy, and
iteratively define new APIs with the help of the LLM when
the current ones are not sufficient to solve the task at hand.
generate new outputs from scratch. The dialog ability of [Initial model prompt found in Appendix B-A]
Write a python script that executes a visual servoing approach towards catching
ChatGPT, however, is a surprisingly effective vehicle for a basketball in a court. You can use opencv functions to detect the ball as an
behavior correction. Section III-B and the supplementary orange blob.
Here’s a sample implementation of the script:
videos show examples of interactive behavior between the
user and ChatGPT. import cv2
import numpy as np
soning problems all the way to real-world aerial agent and if __name__ == ’__main__’:
manipulation deployments. We highlight several interesting catch_ball()
capabilities we observed during these experiments. This implementation uses a simple visual servoing approach where the robot
moves towards the ball at a constant velocity. The velocity is proportional to
While ChatGPT’s capabilities are impressive, practical the difference between the ball’s and robot’s locations, so the robot will move
deployment safety considerations should not be overlooked, faster as it gets closer to the ball. You can adjust the gain (i.e., the multiplication
factor of 0.5) to control the speed of the robot.
especially in the case of physical robotics deployments. Generate SVG code that depicts how the ball would look in the camera image
As shown in Figure 2, we find it necessary to have a exactly in the center
Here’s an example of an SVG code that depicts a ball in the center of a
human on the loop to monitor and intervene in case of camera image:
unexpected behaviors generated by ChatGPT. Additionally, <svg width="100" height="100">
<rect x="0" y="0" width="100" height="100" fill="
the use of simulators can be particularly helpful to evaluate lightgray"/>
the model’s performance before deployment in the real world. <circle cx="50" cy="50" r="25" fill="orange"/>
</svg>
We emphasize that the use of ChatGPT for robotics is
not a fully automated process, but rather acts as a tool to
augment human capacity. We provide all initial prompts for [SVG image rendered directly from the code generated by ChatGPT]
the problems in this section in Appendix B. For the sake of
brevity we only include brief excerpts from each task. The 1) SPATIO-TEMPORAL REASONING: CATCHING A
full conversations can be found in the following repository: BASKETBALL WITH VISUAL SERVOING
https://github.com/microsoft/PromptCraft-Robotics.
In this example, we ask ChatGPT to control a planar
robot equipped with an upward-facing camera. The robot
A. ZERO-SHOT TASK PLANNING is expected to catch a basketball using a visual servoing
First, we note that ChatGPT is able to solve several robotics method based on the appearance of a basketball. We see
tasks in a zero-shot fashion, with access only to the prompt that ChatGPT is able to appropriately use the provided API
and the description of a function library, and no requirement functions, reason about the ball’s appearance and call relevant
of actual code examples. OpenCV functions, and command the robot’s velocity based
on a proportional controller. Even more impressive is the fact 3) AERIAL ROBOTICS: AIRSIM INDUSTRIAL INSPECTION
that ChatGPT can estimate the appearance of the ball and the (CLICK FOR VIDEO)
sky in the camera image using SVG code. This behavior hints We also applied ChatGPT to a simulated domain, using
at a possibility that the LLM keeps track of an implicit world the Microsoft AirSim [23] simulator. We explored the idea
model going beyond text-based probabilities. of a potentially non-technical user directing the model to
control a drone and execute an industrial inspection scenario.
We observe from the following excerpt that ChatGPT is able
2) AERIAL ROBOTICS: REAL-WORLD DRONE FLIGHT WITH
to effectively parse intent and geometrical cues from user
INTUITIVE HUMAN-ROBOT INTERFACE (CLICK FOR VIDEO)
input and control the drone accurately.
We also used ChatGPT with a different API implementation
to control a real drone. ChatGPT provided an extremely [Initial model prompt found in Appendix B-C]
Fly to the turbine.
intuitive natural language interface between the user and Question - There are two turbines in the world, which one do you want me
the robot, writing robot code based on ambiguous and ill- to fly the drone to? Turbine 1 or Turbine 2?
Turbine 2, please. Fly to the turbine while keeping 12 meters away from the
defined instructions, and asking clarification questions when turbine along the X axis, and reach an altitude of 50 meters.
Here’s the code to fly the drone to Turbine 2 while maintaining a distance
necessary. The model was also able to write complex code of 12 meters along the X-axis and reaching an altitude of 50 meters:
structures for drone navigation (circular and lawnmower # Get the position of Turbine 2
inspection) based solely on the prompt’s base APIs. turbine_position = get_position(‘‘Turbine 2’’)
# Set the target position to be 12~meters away
from the turbine along the X\text{-axis} and
[Initial model prompt found in Appendix B-B]
50~meters high
I am thirsty. Find me a drink
Question: What type of drink are you looking for? There are several options target_position = [turbine_position[0] - 12,
in the scene, including ‘‘coconut water’’ and ‘‘diet coke can’’. turbine_position[1], 50]
I just came back from the gym and am looking for a healthy option. # Fly the drone to the target position
In that case, you might like ‘‘coconut water’’. Here is the code to fly to the fly_to(target_position)
‘‘coconut water’’ object:
To inspect the turbine blade, I need you to fly in the bottom-right direction at an
drink_position = self.dict_of_objects["coconut angle of -30 degrees from the horizontal axis. You should fly in the YZ plane
water"] for a distance of 30 meters, and then back to the current position.
self.tello.fly_to(drink_position) Here’s the code to fly the drone in the bottom-right direction at an angle of -
30 degrees from the horizontal axis, in the YZ plane for a distance of 30 meters,
Explanation: This code first retrieves the X, Y, Z, and angle position of
and then back to the current position:
the ‘‘coconut water’’ object from the dict_of_objects dictionary. The fly_to
method is then used to fly the drone to the retrieved position. The drone will import math
fly to the position of the ‘‘coconut water’’ object and look in the direction of # Get the drone’s current position
the angle. current_position = get_position(‘‘Drone’’)
I am still thirsty. Can you find something with sugar and a red logo? # Calculate the target position, which is 30~
I’m assuming you are referring to a can of ‘‘regular coke’’. Here is the code
meters away in the YZ plane at an angle of
to fly to the ‘‘regular coke can’’ object:
-30 degrees from the horizontal axis
drink_position = self.dict_of_objects["regular angle = math.radians(-30)
coke can"] target_position = [
self.tello.fly_to(drink_position) current_position[0],
current_position[1] + 30 * math.cos(angle),
Fantastic. Now I want to take a look at my orchid. However, the chair might current_position[2] + 30 * math.sin(angle)
be blocking its view, so I want to inspect it from 3 points on a half-circle from ]
-60deg to -120, 1m away. Make sure that the drone’s heading points towards the # Fly the drone to the target position
orchid from each point. fly_to(target_position)
Here’s the code to inspect the orchid from 3 points on a half-circle: # Fly the drone back to the current position
fly_to(current_position)
import numpy as np
# Get the position of the orchid
orchid_position = self.dict_of_objects[’orchid’]
# Define the angle range for the half-circle
angles = np.linspace(-60, -120, num=3, endpoint=
False) * np.pi / 180
[Snapshots from inspection video]
# Define the distance from the orchid to inspect
from
distance = 1.0
# Define the inspection points
inspection_points = [orchid_position[:3] + [
distance * np.cos(angle), distance * np.sin( B. USER ON THE LOOP: INTERACTIVE CONVERSATIONS
angle), 0] for angle in angles] FOR COMPLEX TASKS
# Fly the drone to each inspection point and look
at the orchid As a next step, we focused on the ability of ChatGPT
for point in inspection_points: to perform more complex tasks with a user on the loop
self.tello.fly_to(point + [np.arctan2(
orchid_position[1] - point[1], offering textual feedback. We observed that this modality of
orchid_position[0] - point[0]), interaction allows for complex constructs such as curriculum
orchid_position[3]])
learning, where the system is taught smaller-scale skills
\textrm{(text-based explanation of the response which can be combined towards larger and more intricate
omitted for brevity)}
tasks. In addition, we found that ChatGPT is able to take
high-level textual feedback regarding generated code or its
performance and map it to the required low-level code
changes, thus enabling potentially non-technical users to
[snapshots from drone video]
interact with it with ease.
VOLUME 12, 2024 55687
S. H. Vemprala et al.: ChatGPT for Robotics: Design Principles and Model Abilities
[Initial model prompt found in Appendix B-G] [Initial model prompt found in Appendix B-D]
[The beginning of the conversation is omitted for brevity. The following excerpt Assume I give you some XYZ coordinates as a goal. I want you to control the
requires ChatGPT to build a new function using previously defined APIs, and drone such that it moves towards the goal while avoiding obstacles in front of
later use this function for more complex tasks.] it. Instead of flying directly to the goal, the drone should take small steps while
I want you to learn how to place an object in a position. Considering you are reassessing the obstacle situation at every step.
holding a generic object, you first need to first move to a safe position above the The drone is equipped with a distance sensor that returns the distance to the
target location, bring the object down and then release it. closest obstacle directly in front of the drone. You can access it by calling
Understood. Here’s a Python function that takes the position where the get_distance().
object should be placed, moves the robot arm above the position, brings the I want you to make sure to fly only if there is at least 10 meters of distance
object down, and releases it. Note that the function assumes that the robot arm in front of the drone. If not, you should rotate the drone until there is at least
is holding an object. 10 meters of clearance in the front and then take a step in the direction that the
def place_object(position): drone is currently facing. Don’t forget to face the goal again after taking a step.
# move the robot arm to a safe position above the Got it?
Yes, I got it. Here’s how I would approach the problem.
target location
safe_pos = [position[0], position[1], position # Get the drone’s current position and set the yaw
[2]+100, position[3], position[4], position angle towards the goal
[5]] current_pos = get_position(‘drone’)
move_to(safe_pos) goal_pos = [x_goal, y_goal, z_goal]
# move the robot arm to the target location yaw = math.atan2(goal_pos[1]-current_pos[1],
move_to(position) goal_pos[0]-current_pos[0]) * 180 / math.pi
# release the object set_yaw(yaw)
release() # Start a loop that runs until the drone reaches
# move the robot arm back to the safe position the goal position
above the target location while np.linalg.norm(np.array(current_pos) - np.
move_to(safe_pos) array(goal_pos)) > 0.5:
What does the Microsoft logo look like? # Check if there is at least 10~meters of distance
The Microsoft logo is a colored rectangular with four squares arranged in in front of the drone
a grid pattern. The colors used in the logo are blue, green, orange, and yellow. distance = get_distance()
The logo has gone through several iterations over the years, but this is the most while distance < 10:
recent and recognizable design. # Set the yaw angle to rotate the drone until
Can you generate an SVG with those colors? there is at least 10~meters of clearance in
Yes, I can generate an SVG file with the Microsoft logo colors. Here is an the front
example set_yaw(yaw+30) # rotate 30~degrees at a time
time.sleep(1) # wait for a second to measure the
<svg width="100" height="100" viewBox="0 0 100 100 new distance
" xmlns="http://www.w3.org/2000/svg"> distance = get_distance()
<rect x="0" y="0" width="50" height="50" fill="# # Take a step towards the goal while maintaining
F25022"/> the yaw angle
<rect x="50" y="0" width="50" height="50" fill="#7 step_size = 5 # take steps of 5~meters
FBA00"/> step_vector = [step_size * math.cos(yaw),
<rect x="0" y="50" width="50" height="50" fill=" step_size * math.sin(yaw), 0]
#00A4EF"/> new_pos = [current_pos[i]+step_vector[i] for i in
<rect x="50" y="50" width="50" height="50" fill="# range(3)]
FFB900"/> fly_to(new_pos)
</svg> current_pos = get_position(‘drone’)
After you take a step, you need to face the goal again.
You’re right. After taking a step, we need to recalculate the yaw angle to
[Reproduction of SVG generated by ChatGPT] face the goal again. Here’s the modified code that includes this step:
Awesome! I want you now to use the blocks to build the logo from Microsoft on
top of the white pad. ’’’Same code as before’’’
Sure! Here’s the Python code to build the Microsoft logo on top of the white yaw = math.atan2(goal_pos[1]-current_pos[1],
pad using the blocks: goal_pos[0]-current_pos[0]) * 180 / math.pi
[Code not shown for brevity. Please consult the link for the full conversation] set_yaw(yaw)
This code gets the positions of the white pad and all four blocks using the This code should ensure that the drone faces the goal after taking a step, and
get_position function, picks up each block in turn using the pick_up_object continues moving towards the goal while avoiding obstacles in front of it.
function we defined earlier, calculates the position to place the block on the You are not using the current yaw angle when computing the step vector.
white pad, and places the block on the white pad using the place_object Remember there is a get_yaw() function.
function we defined earlier. We use the block colors to ensure that the logo is You’re right. After taking a step, we need to recalculate the yaw angle to
built correctly. We place the blue block in the bottom left position, the yellow face the goal again. Here’s the modified code that includes this step:
block in the bottom right position, the red block in the top left position, and
the green block in the top right position, as specified in the prompt. # Take a step towards the goal while maintaining
the yaw angle
step_size = 5 # take steps of 5~meters
current_yaw = get_yaw()
step_vector = [step_size * math.cos(current_yaw),
step_size * math.sin(current_yaw), 0]
[snapshots from manipulator video]
1) MANIPULATION WITH CURRICULUM LEARNING (CLICK example of generalizability when bridging the textual and
FOR VIDEO) physical domains when tasked with building the Microsoft
As a first example, we set up a real-life block arrangement logo out of wooden blocks. This task requires memory
task using a robot arm, for which we devised a curriculum. of how the corporate logo looks like, including its colors,
ChatGPT was asked to learn the simpler skills of picking up followed by building an abstraction of the logo into physical
objects, and then placing them. Subsequently, when asked parts which are constructible by the existing robot actions.
to use these skills to perform a more involved task of block The code output by ChatGPT was executed on a computer
arrangement, ChatGPT chains these learned skills together that controlled the robot arm, and the robot successfully
logically. In addition, the model displays a fascinating completed the task.
2) AERIAL ROBOTICS: AIRSIM OBSTACLE AVOIDANCE [Initial model prompt found in Appendix B-F]
I need you to help me find an object. Objects might be scattered around the scene, so if you
(CLICK FOR VIDEO) don’t find it right away, a good strategy is to turn around a few times to see if it becomes
visible. Give me code to explore and find the bottle.
We tasked ChatGPT to write a goal-reaching algorithm with Got it. Here’s some code that will help you explore the environment and find the bottle:
obstacle avoidance for a drone in the AirSim simulator
def search_for_bottle():
equipped with a forward facing distance sensor. ChatGPT num_turns = 4 # Turn around 4~times to look for the bottle
built most of the key building blocks for the avoidance for i in range(num_turns): # Turn left 90~degrees to face
a new direction
algorithm, but required some human feedback on steps turn_left()
it missed regarding the drone’s orientation. Although the img = get_image()
obj_list, obj_locs = detect_objects(img)
feedback was provided entirely in high-level text, ChatGPT if ‘bottle’ in obj_list:
improved its solution with localized changes to the code # Get the coordinates of the bottle in the image
bottle_idx = obj_list.index(‘bottle’)
where appropriate. We deployed the final solution on a bbox = obj_locs[bottle_idx]
simulated drone in AirSim, and observed that it was able to
# Compute the center of the bounding box
successfully navigate to the goal while avoiding obstacles. center_x = (bbox[0] + bbox[2]) / 2
center_y = (bbox[1] + bbox[3]) / 2
this type of behavior is heavily dependent on a dialog-based E. USAGE OF CHATGPT API IN REAL-TIME
LLM such as ChatGPT, which can naturally take receive a Most of the prior examples we presented in this section
stream of observations. We constructed a scenario using the demonstrated open loop behavior where ChatGPT generated
Habitat simulator [25] and tasked the agent to navigate to an code to solve a task while conversing with a user, prior
area of interest. At each step we provided ChatGPT with a to the actual deployment. As an extension, we have also
description of the scene in terms of polar coordinates of the examined the usage of ChatGPT in real-time robotics
visible objects relative to the agent. We constrained the output applications, where the model generates code for robotics
of ChatGPT to only return a forward distance of motion and problems in real-time. In order to achieve this, we use the
the turning angle. We verify that the model is able to complete AirSim simulator, which allows us to safely test the model’s
simple navigation tasks, but more research is needed in terms generated code in a controlled environment. We utilize the
of properly describing the scene context in textual or vector OpenAI ChatCompletion Python package to interact with
form for more complex tasks and environments. the model in real-time, and execute the generated code in
For additional experiments, we refer the reader to a Python kernel. In order to use the API in real time,
Appendix C. we modified the system prompt for the ChatGPT model
to command it to always return properly formatted Python
[Initial model prompt found in Appendix B-E]
Task: go to the chairs Objects in the scene: <door, 0.53 m, 22>, <chair, 4.84 m, code that is easily identifiable through the markdown format.
-21>, <chair, 2.12 m, -24>, <window, 6.74 m, -3>, <window, 7.17 m, 6>, <door, In each response returned by the API, we use regex-based
0.61 m, 18>, <stairs, 1.42 m, 0> Command:
move(1.5) # Move forward by 1.5 meters to explore the environment. matching to extract the Python code if any exists, and then
Objects in the scene: <seating, 5.05 m, 21>, <chair, 3.93 m, -25>, <window, executed that in a Python kernel. In order to ensure safety of
4.55 m, 22>, <window, 5.24 m, -4>, <stairs, 3.49 m, 13>, <window, 5.67 m, 7>,
<window, 4.37 m, 23> Command: the drone even in the simulation, the drone was made to hover
turn(-25) # Turn to the left to face the chair that is 3.93 meters away in place while a user places a request to the model, and until
Objects in the scene: <chair, 4.85 m, -14>, <chair, 4.76 m, -9>, <chair, 1.57 m,
-23>, <chair, 2.66 m, -24>, <window, 4.96 m, 10>, <window, 4.19 m, 25> the model’s response was executed.
Command: We performed some experiments within the AirSim
turn(-10) # Turn to the right to explore the environment further.
Objects in the scene: <door, 5.07 m, -25>, <chair, 5.36 m, -6>, <table, 2.17 m, simulator, where we observed that the model was able
-24>, <chair, 5.01 m, -1>, <chair, 1.80 m, -16>, <chair, 3.15 m, -16>, <window, to generate code for a drone to navigate to speci-
4.30 m, 19> Command:
turn(-6) # Turn to the right to face the chair that is 5.36 meters away. fied goals such as object goal navigation, inspection,
Objects in the scene: <counter, 1.24 m, -24>, <door, 5.93 m, -17>, <chair, etc. The results of this experiment can be seen at
5.48 m, 1>, <table, 2.60 m, -14>, <chair, 4.92 m, 6>, <chair, 2.02 m, -7>, <chair,
3.53 m, -7>, <chair, 3.16 m, -18> Command: https://www.youtube.com/watch?v=iE5tZ6_ZYE8. We note
turn(1) # Turn to the left to explore the environment further. that the latency of the model was not entirely suitable for
Objects in the scene: <door, 5.07 m, -25>, <chair, 5.36 m, -6>, <table, 2.17 m,
-24>, <chair, 5.01 m, -1>, <chair, 1.80 m, -16>, <chair, 3.15 m, -16>, <window, high speed real-time applications, but the ability of the drone
4.30 m, 19> Command: to hover in place for a few seconds allowed us to use the
move(5.5) # Move forward towards the chair that is 5.36 meters away.
Objects in the scene: <chair, 0.96 m, -10>, <chair, 0.99 m, 20>, <chair, 1.77 m, LLM in a safe manner. Similarly, a simulation is an ideal
-20> Command: environment for such an experiment, because even if the
move(1) # Move forward to approach the chairs.
output of ChatGPT were wrong or inaccurate, the simulation
can be reset to a previous safe state. If there is an intent to
perform such an experiment in real life, users would first
need to create a safety layer that can evaluate the correctness
of the generated code before directly executing it, or have
[snapshots from Habitat navigation] other appropriate safety rules in place for the robot (e.g.
hardcoded constraints on speed, direction of motion, etc.).
We expect future research to explore how to properly use
LLMs in real-time robotics applications, and how to ensure
D. REASONING AND COMMON-SENSE ROBOTICS TASKS the safety and correctness of the generated code.
We tasked ChatGPT with answering simple logical reasoning
questions, as well as basic robotics problems, shown in IV. PROMPTCRAFT, A COLLABORATIVE TOOL FOR LLM +
Appendix D. We observe that ChatGPT offers a rich foun- ROBOTICS RESEARCH
dation of common-sense and logical reasoning capabilities Prompting is a crucial component to generate the desired
upon which more advanced robotics abilities can be built. behaviors in large language models (LLMs). Prompt engi-
This logics grounding allows the user to interact with the neering is particularly challenging at the intersection of
model more naturally, as opposed to having to prompt or LLMs with robotics, where there is a lack of comprehensive
define every concept from scratch. Furthermore, ChatGPT’s and accessible resources that provide examples of positive
out-of-the-box understanding of basic robotics concepts such (and negative) interactions. To address this gap, we introduce
as control, camera geometry, and physical form factors makes PromptCraft,1 a collaborative open-source platform for
it an excellent choice to build upon for generalizable and
user-friendly robotics pipelines. 1 https://github.com/microsoft/PromptCraft-Robotics
researchers to share examples of prompting strategies and test or more modalities simultaneously. These systems are often
their algorithms in sample robotic environments. coupled with additional features from pretrained large-scale
PromptCraft is a Github-based platform that allows vision and language models models [11], [14], [15], [16],
researchers to share examples of prompt engineering strate- [37], [38], [39].
gies within different robotics categories, such as navigation, Models such as SayCan [38] focus on grounding LLMs
grasping, and manipulation. Users can submit their examples so that free-form text commands are used to compute
and rate others’ submissions, which we hope will create a value function to rank the best action types within a
a community-driven resource for researchers working with robot-specific library. RT-1 [40], on the other hand, takes
LLMs. Submissions of prompts and dialogues are primarily an end-to-end approach to learn the mapping between
based on text, but we encourage users to share videos language commands low level actions, without the use of
and images depicting the robot’s behavior, especially for intermediate high-level functions. Recent works have also
real-world deployment scenarios. explored the ability of large language models (LLMs) for
In addition to providing a platform for sharing prompt zero-shot high-level robotics task planning [17], [19], [20].
examples, PromptCraft also offers an AirSim [23] environ- These models make use of prompting structures with pre-
ment with a ChatGPT wrapper for researchers to prototype defined functions, behaviors, and examples to guide the
prompts and algorithms in a controlled simulated setting. generation of the model’s answers. Reference [18] also
We welcome contributions of new test environments to explore the use of interactivity between user and LLM for
expand the range of scenarios where researchers can test their table-top manipulation settings. Another interesting approach
algorithms. was outlined in Socratic Models [41], which shows that
With Promptcraft we aim to support the empirical science the individual shortcomings of VLMs and LLMs can be
of prompt engineering and enable researchers to advance the alleviated through a combination of several models, and by
field. allowing them all to communicate to the LLM the common
modality of text. In our work, we discuss a possible similarity
where an object detection model and the LLM can be
integrated for vision based navigation. We find the ChatGPT
paradigm to be a potentially more flexible and generalizable
approach as it was shown to be effective for a wide range
of form factors, can integrate multiple other models through
FIGURE 6. Promptcraft open-sourced repository. Researchers can upload code, and can naturally hold a dialogue with the user for
and vote on examples of LLM prompts for various robotics categories.
refinement and improvement of the policies.
Conceptually, the main difference of these approaches with
respect to our work, which leverages ChatGPT [1], is the
V. RELATED WORK conversational ability of our LLM, which allows the user
A. NATURAL LANGUAGE AND ROBOTICS to interactively improve and correct the robot’s behavior
Natural language processing (NLP) has long been recognized (as opposed to re-engineering the prompt from scratch and
as a crucial component for human-robot interaction. There generating another zero-shot answer). In addition, our works
are many applications where robots can benefit from NLP, aims to provide a generalizable pipeline and set of principles
including but not limited to task instruction, navigation, to be used by researchers in different fields of robotics,
and information retrieval. Classically, modeling human-robot as opposed to focusing on a single domain such as table-top
interactions using language is challenging because it forces manipulation or task planning.
the user to operate within a rigid set of instructions [26], or
requires mathematically complex algorithms to keep track
of multiple probability distributions over actions and target C. PROMPTING LLMS WITH APIS, AND ITS CONNECTIONS
objects [27], [28]. More recent works explore neural networks TO SYMBOLIC AI
to implicitly keep track of the complex mapping between When designing LLM prompts for robotics applications,
language and actions, but such techniques often require vast users often make use of high-level library of APIs to
amounts of labeled data for training [9], [10], [29], [30] represent specific behaviors to be used. We can draw a
connection between this approach with classical symbolic
B. LARGE (VISION AND) LANGUAGE MODELS FOR AI, which uses logic and rules to represent and reason
ROBOTICS about knowledge [42]. While the traditional symbolic AI
The Transformer architecture, introduced in the paper approach presented difficulties in new knowledge acquisition
by [31], has revolutionized NLP and has also shown great and dealing with out-of-distribution data, we believe that
promise in robotics. Transformers have been used for robot LLMs can overcome these challenges. As we showed in
control and planning [32], [33], [34], object recognition [35], Section II-A and Section III, models such as ChatGPT can
and robot navigation [36]. A more common use of transform- compose new primitive functions based on the context and
ers in robotics has been as feature extraction modules for one generate code for them automatically.
VI. CONCLUSION AND FUTURE WORK applications. LLMs can be used to generate code for robotics
We presented a framework for using ChatGPT for robotics problems in real-time, but it is important to ensure that the
applications. The framework entails designing and imple- generated code is safe and correct before executing it on the
menting a library of APIs that for robot control which are robot. When evaluating the usage of LLMs in real time, robots
amenable to prompt engineering for ChatGPT. We discussed and the scenarios should be sufficiently equipped such that
design principles for creating such APIs and prompting issues such as latency are not a concern. For instance, if a
strategies that can be used to generate code for robotics drone could hover safely and accurately for a few seconds
applications via ChatGPT. The proposed framework allows while waiting for a response from the LLM, then the latency
the generated code to be tested, verified, and validated of the LLM may not be a safety concern. In such a case,
by a user on the loop via a range of methods including the LLM acts as a high level task or behavioral planner,
simulation and manual inspection. We demonstrated how the whereas the low level controls are handled by the drone
framework can be used for multiple applications ranging from or the robot itself, which we believe is a more pragmatic
simple common-sense robotics knowledge tasks all the way approach. Similarly, it would be highly important to have a
to deployments in aerial robotics, manipulation and visual layer that could check for the feasibility and correctness of
navigation. the generated code before executing it on the robot.
We believe that this work presents only a small frac-
tion of what is possible within the intersection of large
language models operating in the robotics space. In this
work, we examine the idea of using ChatGPT during the APPENDIX A
development process of robotics algorithms, but a future ADDITIONAL EXAMPLES ON DESIGN PRINCIPLES ON
direction is to explore how to use LLMs directly in the PROMPT ENGINEERING
deployment setting. We hope to not only inspire other A. EXAMPLE OF PROMPT WITH NO API LIBRARY:
researchers to take these next steps, but to also help them UNBOUNDED TEXT-BASED RESPONSE
achieve results with the use of the PromptCraft collaborative Consider you are a home assistant robot. You goal is to prepare an omelette for an elderly person. Output python
code with the sequence of steps that achieves your objective.
tool. Sure, here’s an example Python code that outlines the sequence of steps to prepare an omelette for an elderly
person:
Most of the examples we presented in this work
print(‘‘Good morning! Would you like an omelette for breakfast?’’)
demonstrated open perception-action loops where ChatGPT time.sleep(5)
print(‘‘Great! Let’s get started.’’)
generated code to solve a task, with no feedback was provided print(f"Cracking {eggs} eggs into a bowl\ldots ")
to the model afterwards. Given the importance of closed-loop print("Adding a splash of milk, a~pinch of salt, and a pinch of pepper
\ldots ")
controls in perception-action loops, we expect much of the print("Whisking the ingredients together\ldots ")
print(f"Heating {butter} in a \text{non-stick} pan over medium heat\
future research in this space to explore how to properly use ldots ")
ChatGPT’s abilities to receive task feedback in the form of print("Pouring the egg mixture into the pan\ldots ")
print("Cooking the omelette for 2-3 minutes, until the bottom is set\
textual or special-purpose modalities. This could be achieved ldots ")
print("Using a spatula to flip the omelette\ldots ")
either by using cloud-based models such as ChatGPT within print("Cooking the other side of the omelette for an additional minute
\ldots ")
a closed loop, or by deploying open source LLMs and print(‘‘Serving the omelette hot and ready to eat. Enjoy your meal!’’)
leveraging optimization techniques such as quantization for
efficiency [43], [44].
We emphasize that these tools should not be given full
control of the robotics pipeline, especially for safety-critical
applications. Given the propensity of LLMs to eventually B. EXAMPLE OF PROMPT WITH UNDER-SPECIFIED APIS:
generate incorrect responses, it is fairly important to ensure HALLUCINATION OVER FUNCTION PARAMETERS
solution quality and safety of the code with human super- Consider you are a home assistant robot. You goal is to prepare an omelette for an elderly person. You
are equipped with functions such as locate_object(), move_to_location(), cook_item(),
vision before executing it on the robot. We expect several grab_object(). Output python code with the sequence of steps that achieves your objective.
Sure, here is a Python code with the sequence of steps that a home assistant robot can take to prepare an
research works to follow with the proper methodologies omelette for an elderly person:
to properly design, build and create testing, validation and print("Starting to prepare an omelette for the elderly person\ldots ")
locate_object(EGGS_LOCATION)
verification pipelines for LLM operating in the robotics move_to_location(FRIDGE_LOCATION)
space. grab_object(EGGS_LOCATION)
locate_object(PAN_LOCATION)
move_to_location(STOVE_LOCATION)
grab_object(PAN_LOCATION)
locate_object(OIL_LOCATION)
A. USAGE OF LLMS WITHIN DEPLOYMENT PIPELINES grab_object(OIL_LOCATION)
In this paper, we discuss two main ways to use LLMs in print("Pouring the oil into the pan\ldots ")
time.sleep(2)
robotics: as a tool for generating code for robotics problems print("Cracking the eggs into the pan\ldots ")
grab_object(EGGS_LOCATION)
apriori, and as a tool for generating code for robotics time.sleep(1)
cook_item(PAN_LOCATION, COOKING_TIME)
problems in real-time. As mentioned earlier, we highly print("The omelette is ready! Serving it to the elderly person\ldots "
recommend the former approach for safety-critical applica- )
C. EXAMPLE OF PROMPT WITH STRUCTURED RESPONSE Imagine you are helping me interact with the AirSim simulator for drones.
At any given point of time, you have the following abilities, each identified by
FORCING THE ANSWER TO BE IN A LIST FORMAT a unique tag. You are also required to output code for some of the requests.
Question: You can ask me a clarification question, as long as you specifically
Consider you are a home assistant robot. You goal is to prepare an omelette for identify it saying ‘‘Question’’. Code: Output a code command that achieves the
an elderly person. Find the sequence of steps that achieves your objective. desired goal.
1. Reason: After you output code, you should provide an explanation why you did
First, gather all the necessary ingredients and equipment for making
an omelette. This would typically include eggs, milk, butter, salt, pepper, what you did.
a mixing bowl, a whisk, a frying pan, and a spatula. The simulator contains a drone, along with several objects. Apart from the
2. Crack the eggs into the mixing bowl and add a splash of milk, a pinch of drone, none of the objects are movable. Within the code, we have the following
salt, and some black pepper to taste. commands available to us. You are not to use any other hypothetical functions.
3. Use the whisk to beat the eggs and milk together until they are well get_position(object_name): Takes a string as input indicating the
combined. name of an object of interest, and returns a vector of 4 floats indicating its
4. Heat the frying pan over medium heat and add a small amount of butter to X,Y,Z,Angle coordinates.
the pan, allowing it to melt and coat the bottom of the pan. self.tello.fly_to(position): Takes a vector of 4 floats as input
5. Pour the egg mixture into the frying pan and use the spatula to spread it indicating X,Y,Z,Angle coordinates and commands the drone to fly there and
evenly over the bottom of the pan. look at that angle self.tello.fly_path(positions): Takes a list of
X,Y,Z,Angle positions indicating waypoints along a path and flies the drone
6. Allow the eggs to cook for a minute or two, until the bottom starts to set along that path
and the top is still slightly runny. self.tello.look_at(angle): Takes an angle as input indicating the
7. Use the spatula to fold one half of the omelette over the other half, forming yaw angle the drone should look at, and rotates the drone towards that angle
a half-moon shape. Here is an example scenario that illustrates how you can ask clarification
8. Cook for another minute or two until the eggs are fully set and the omelette questions. Let us assume a scene contains two spheres?
is heated through. Me: Fly to the sphere. You: Question - there are two spheres. Which one do you
9. Use the spatula to slide the omelette onto a plate and serve immediately. want me to fly to? Me: Sphere 1, please.
10. Make sure to clean up any mess that was made during the cooking process You also have access to a Python dictionary whose keys are object names, and
and return all equipment and ingredients to their proper places. values are the X,Y,Z,Angle coordinates for each object:
self.dict_of_objects = {’origin’: [0.0, 0.0, 0.0, 0], ’mirror’: [1.25, -0.15, 1.2, 0],
’chair 1’: [0.9, 1.15, 1.1, np.pi/2], ’orchid’: [0.9, 1.65, 1.1, np.pi/2], ’lamp’: [1.6,
0.9, 1.2, np.pi/2], ’baby ducks’: [0.1, 0.8, 0.8, np.pi/2], ’sanitizer wipes’: [-0.3,
1.75, 0.9, 0], ’coconut water’: [-0.6, 0.0, 0.8, -np.pi], ’shelf’: [0.95, -0.9, 1.2,
np.pi/2], ’diet coke can’: [1.0, -0.9, 1.55, np.pi/2], ’regular coke can’: [1.3, -0.9,
1.55, np.pi/2]}
Are you ready?
APPENDIX B
ORIGINAL INITIAL PROMPTS FOR ROBOTICS TASKS
EXAMPLES
A. SPATIAL-TEMPORAL REASONING: CATCHING A
BASKETBALL WITH VISUAL SERVOING C. AERIAL ROBOTICS: AIRSIM INDUSTRIAL INSPECTION
Full conversation found at: Full conversation found at:
https://github.com/microsoft/PromptCraft-Robotics/blob/ https://github.com/microsoft/PromptCraft-Robotics/blob/
main/examples/spatial_temporal_reasoning/visual_ main/examples/aerial_robotics/airsim_turbine_inspection.
servoing_basketball.md md
Initial ChatGPT prompt: Initial ChatGPT prompt:
Imagine you are helping me interact with the AirSim simulator for drones.
Imagine you are a planar robot that can move along the XY axes, and you’re At any given point of time, you have the following abilities. You are also required
positioned in the center of a basketball court. A person on the side of the court to output code for some of the requests.
is going to throw a basketball ball in the air somewhere in the court, and your Question - Ask me a clarification question Reason - Explain why you did
objective is to be at the exact XY location of the ball when it lands. The robot something the way you did it. Code - Output a code command that achieves
has a monocular RGB camera that looks up. You can assume that the following the desired goal.
functions are available: The simulator contains a drone, along with several objects. Apart from the
get_image(): returns an image from the robot’s camera looking up; drone, none of the objects are movable. Within the code, we have the following
get_location(): returns 2 floats XY with the robot’s current location in the commands available to us. You are not to use any other hypothetical functions.
court; get_position(object_name): Takes a string as input indicating the
move_to_point(x,y, vx, vy): moves the robot towards a specific (x,y) name of an object of interest, and returns a vector of 3 floats indicating its X,Y,Z
location in the court with velocity (vx,vy). You can assume for this exercise that coordinates.
the robot can accelerate or break instantly to any velocity; fly_to(position): Takes a vector of 3 floats as input indicating X,Y,Z
move_by_velocity(vx, vy): moves the robot along the X axis with coordinates and commands the drone to fly there.
velocity vx, and Y axis with velocity vy; fly_path(positions): Takes a list of X,Y,Z positions indicating way-
Additional points to consider when giving your answer 1) Your reponses should points along a path and flies the drone along that path.
be informative, visual, logical and actionable, 2) Your logics and reasoning Here is an example scenario that tells you how to respond where we are working
should be rigorous, intelligent, and defensible, 3) You can provide additional with a simulated world that has two spheres in it.
relevant details to respond thoroughly and comprehensively to cover multiple Me: Fly the drone to the sphere. You: Question - There are two spheres in the
aspects in depth. world, which one do you want me to fly the drone to? Me: Let’s pick Sphere 1.
Write a python script that executes a visual servoing approach towards catching There are two turbines, some solar panels and a car in the world.
a basketball in a court. You can use opencv functions to detect the ball as an Are you ready?
orange blob.
Imagine you are helping me interact with the AirSim simulator for drones. main/examples/manipulation/pick_stack_msft_logo.md
At any given point of time, you have the following abilities. You are also required Initial ChatGPT prompt:
to output code for some of the requests.
Question - Ask me a clarification question Reason - Explain why you did
something the way you did it. Code - Output a code command that achieves Imagine we are working with a manipulator robot. This is a robotic arm with
the desired goal. 6 degrees of freedom that has a suction pump attached to its end effector. I would
The simulator contains a drone, along with several objects. Apart from the like you to assist me in sending commands to this robot given a scene and a task.
drone, none of the objects are movable. Within the code, we have the following At any point, you have access to the following functions:
commands available to us. You are not to use any other hypothetical functions. grab(): Turn on the suction pump to grab an object
get_position(object_name): Takes a string as input indicating the release(): Turns off the suction pump to release an object
name of an object of interest, and returns a vector of 3 floats indicating its X,Y,Z get_position(object): Given a string of an object name, returns the
coordinates. coordinates and orientation of the vacuum pump to touch the top of the object
fly_to(position): Takes a vector of 3 floats as input indicating X,Y,Z [X, Y, Z, Yaw, Pitch, Roll]
coordinates and commands the drone to fly there. move_to(position): It moves the suction pump to a given position [X, Y,
fly_path(positions): Takes a list of X,Y,Z positions indicating way- Z, Yaw, Pitch, Roll].
points along a path and flies the drone along that path. You are allowed to create new functions using these, but you are not allowed to
get_yaw(): Get the current yaw angle for the drone (in degrees) use any other hypothetical functions. Keep the solutions simple and clear. The
set_yaw(angle): Set the yaw angle for the drone (in degrees) positions are given in mm and the angles in degrees.
Are you ready? You can also ask clarification questions using the tag "Question - ". Here is an
example scenario that illustrates how you can ask clarification questions.
Let’s assume a scene contains two spheres. Me: pick up the sphere. You:
Question - there are two spheres. Which one do you want me to pick up? Me:
E. EMBODIED AGENT: HABITAT NAVIGATION Sphere 1, please.
Use python code to express your solution. Are you ready?
Full conversation found at:
https://github.com/microsoft/PromptCraft-Robotics/blob/
main/examples/embodied_agents/visual_language_ APPENDIX C
navigation_1.md ADDITIONAL EXPERIMENTS
Initial ChatGPT prompt: A. ZERO-SHOT MANIPULATION: PICK AND PLACE,
Imagine I am a robot equipped with a camera and a depth sensor. I am trying to
STACKING, AND PUSHING
perform a task, and you should help me by sending me commands. You are only Full conversation found at:
allowed to give me the following commands:
turn(angle): turn the robot by a given number of degrees https://github.com/microsoft/PromptCraft-Robotics/blob/
move(distance): moves the robot straight forward by a given distance in main/examples/manipulation/manipulation_zeroshot.md
meters.
On each step, I will provide you with the objects in the scene as a list of <object B. CONTROLLING MULTIPLE ROBOT FORM FACTORS
name, distance, angle in degrees>. You should reply with only one command
at a time. The distance is in meters, and the direction angle in degrees with FROM A SINGLE PROMPT
respect to the robot’s orientation. Negative angles are to the left and positive
angles are to the right. If a command is not valid, I will ignore it and ask you
Full conversation found at:
for another command. If there is no relevant information in the scene, use the https://github.com/microsoft/PromptCraft-Robotics/blob/
available commands to explore the environment.
main/examples/multiple_robots/multiple_robots.md
APPENDIX D
F. EMBODIED AGENT: AIRSIM OBJECT NAVIGATION
OTHER COMMONSENSE/REASONING TASKS
Full conversation found at: A. COMPUTER VISION PROBLEM
https://github.com/microsoft/PromptCraft-Robotics/blob/
main/examples/embodied_agents/airsim_objectnavigation. Imagine we are working with an RGBD camera. The pose of this camera can be
md accessed via a function get_pose(). Now let us say we observe a point P in the
Initial ChatGPT prompt: image at coordinates (x, y), and the corresponding perspective depth value can
be obtained by get_depth(x, y). The intrinsics of the camera can be obtained by
Imagine you are helping me interact with the AirSim simulator. We are the function get_intrinsics(). Can you give me code that will compute the world
controlling an embodied agent. At any given point of time, you have the coordinates corresponding to point P?
Sure, here’s some example code in Python to compute the world coordinates
following abilities. You are also required to output code for some of the requests.
corresponding to point P
Question - Ask me a clarification question Reason - Explain why you did
something the way you did it. Code - Output a code command that achieves import numpy as np
the desired goal. def get_world_coordinates(x, y):
The scene consists of several objects. We have access to the following functions, # Get camera intrinsics
please use only these functions as much as possible: K = get_intrinsics()
Perception: # Get camera pose
get_image(): Renders an image from the front facing camera of the agent pose = get_pose()
detect_objects(img): Runs an object detection model on an image img, and # Get perspective depth value at observed
returns two variables - obj_list, which is a list of the names of objects detected point
in the scene. obj_locs, a list of bounding box coordinates in the image for each depth = get_depth(x, y)
object. # Compute homogeneous image coordinates
Action: p_image = np.array([x, y, 1])
forward(): Move forward by 0.1 meters. # Compute inverse camera intrinsics
turn_left(): Turn left by 90 degrees. K_inv = np.linalg.inv(K)
turn_right(): Turn right by 90 degrees. # Compute image coordinates in camera frame
You are not to use any other hypothetical functions. You can use functions from p_camera = depth * K_inv.dot(p_image)
Python libraries such as math, numpy etc. Are you ready? # Compute world coordinates
p_world = np.dot(pose[:3, :3], p_camera) +
pose[:3, 3]
return p_world
G. MANIPULATION WITH CURRICULUM LEARNING: Given an observed point P in the image at coordinates (x, y), and the
corresponding perspective depth value obtained by get_depth(x, y), we can
PICKING, STACKING, AND BUILDING THE compute the 3D world coordinates corresponding to P. To do this, we need to
MICROSOFT LOGO know the camera intrinsics, which describe how the 3D scene is projected onto
the 2D image plane. The intrinsics can be obtained using the get_intrinsics()
Full conversation found at: function.
https://github.com/microsoft/PromptCraft-Robotics/blob/
55694 VOLUME 12, 2024
S. H. Vemprala et al.: ChatGPT for Robotics: Design Principles and Model Abilities
We also need to know the camera pose, which describes the position and
orientation of the camera in the world coordinate system. The pose can be roll_adjust = max(min(roll_adjust,
obtained using the get_pose() function. 0.1), -0.1)
With the intrinsics and pose, we can transform the observed point P from # Set the pitch and roll angles to
image coordinates to world coordinates. We first compute the inverse of the center the ball
camera intrinsics matrix, which allows us to convert from image coordinates set_plate_pitch(pitch_adjust)
to camera coordinates. We then multiply the inverse intrinsics matrix with the set_plate_roll(roll_adjust)
observed point’s pixel coordinates and the corresponding depth value to obtain # Wait for a short period before checking
the 3D point in camera coordinates. the ball’s position again
Finally, we transform the camera coordinates to world coordinates by time.sleep(0.01)
multiplying the camera coordinates by the camera pose matrix, which # Start the balancing process
describes the transformation from camera coordinates to world coordinates. balance_ball()
The resulting 3D point is in homogeneous coordinates, so we need to divide
by the last coordinate to obtain the 3D Euclidean coordinates. This code uses a loop to continuously check the position of the ball using
the get_ball_position function. If the ball is off-center, the code calculates the
necessary pitch and roll adjustments to center the ball and sets the plate angles
B. TRANSFORMATION MATRICES using the set_plate_pitch and set_plate_roll functions. The code limits the pitch
and roll adjustments to the maximum range of [-0.1, 0.1]. The loop waits for a
Imagine we are in a world that operates according to its own specific coordinate short period before checking the ball’s position again to prevent over-adjusting
convention. There is a robot car located in this world at a specific location. the plate.
The car has its own coordinate convention. And upon this car, there is a
camera which, again, has a different coordinate convention. Imagine we have
access to a function get_transformation_matrix(a, b) - that can return the ACKNOWLEDGMENT
transformation matrix between world and vehicle, or vehicle and camera. What
is the transformation matrix between the world and the camera? This paper was written with the assistance of OpenAI’s
The transformation matrix between the world and the camera can be
obtained by composing the transformation matrices between the world and ChatGPT [1], with prompts provided by the authors. The
the vehicle and the vehicle and the camera. To find the transformation model’s output was thoroughly revised and adapted, they note
matrix between the world and the camera, we first get the transfor-
mation matrix between the world and the vehicle using the function that the use of LLMs can significantly speed up the writing
get_transformation_matrix(a, b) with a set to world and b set to vehicle. process and they recommend their use to the interested reader.
Next, we get the transformation matrix between the vehicle and the camera
using the function get_transformation_matrix(a, b) with a set to vehicle (Sai H. Vemprala and Rogerio Bonatti contributed equally to
and b set to camera. Finally, we multiply the two transformation matrices this work.)
to get the transformation matrix between the world and the camera. In
mathematical terms, if T_world_vehicle is the transformation matrix between
the world and the vehicle, and T_vehicle_camera is the transformation matrix REFERENCES
between the vehicle and the camera, then T_world_camera = T_world_vehicle
[1] (2023). OpenAI. Accessed: Feb. 8, 2023. [Online]. Available:
T_vehicle_camera.
Let’s get into the details more. Imagine the world operates according to the https://openai.com/blog/chatgpt/
coordinate convention: X is forward, Y is to the left, Z is upwards. The robot car [2] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, ‘‘BERT: Pre-training
is located in this world at coordinates (5, 0, 1). The car has its own coordinate of deep bidirectional transformers for language understanding,’’ 2018,
convention: X is forward, Y is to the right, and Z is downwards. Can you now
arXiv:1810.04805.
implement get_transformation_matrix(world, car)?
Yes, to implement the get_transformation_matrix(world, car) function, we [3] T. B. Brown et al., ‘‘Language models are few-shot learners,’’ in Proc.
need to first convert the car’s coordinate convention to match the world’s NIPS, 2020, pp. 1877–1901.
coordinate convention. The transformation matrix between the world and the [4] OpenAI. (2023). Gpt-4 Technical Report. [Online]. Available:
car can be represented as follows: https://arxiv.org/abs/2303.08774
def get_transformation_matrix(a, b): [5] M. Chen et al., ‘‘Evaluating large language models trained on code,’’ 2021,
if a == ’world’ and b == ’car’: arXiv:2107.03374.
T = np.array([[1, 0, 0, 5], [0, -1, 0, 0], [6] H. Touvron et al., ‘‘Llama 2: Open foundation and fine-tuned chat models,’’
[0, 0, -1, 1], [0, 0, 0, 1]]) 2023, arXiv:2307.09288.
return T [7] H. Touvron, T. Lavril, G. Izacard, X. Martinet, M.-A. Lachaux, T. Lacroix,
B. Rozière, N. Goyal, E. Hambro, F. Azhar, A. Rodriguez, A. Joulin,
E. Grave, and G. Lample, ‘‘LLAMA: Open and efficient foundation
language models,’’ 2023, arXiv:2302.13971.
C. BALANCING A BALL ON A PLATE
[8] A. Q. Jiang, A. Sablayrolles, A. Mensch, C. Bamford, D. S. Chaplot,
Imagine we are trying to balance a ball on the center of a plate. The plate can be D. de las Casas, F. Bressand, G. Lengyel, G. Lample, L. Saulnier,
controlled with a pitch and a roll angle as set_plate_pitch(), and set_plate_roll(). L. R. Lavaud, M.-A. Lachaux, P. Stock, T. Le Scao, T. Lavril, T. Wang,
The range of pitch and roll angles that can be set are [-0.1, 0.1]. There is an T. Lacroix, and W. El Sayed, ‘‘Mistral 7B,’’ 2023, arXiv:2310.06825.
intelligent camera above the plate that can capture top-down images of the plate [9] Y. Hong, Q. Wu, Y. Qi, C. Rodriguez-Opazo, and S. Gould, ‘‘A recurrent
and return the (X, Y) position of the center of the ball in the image. The image is
vision-and-language BERT for navigation,’’ 2020, arXiv:2011.13922.
of resolution 320x240. The pixel coordinates of the ball can be accessed by the
function get_ball_position(). Can you give me code that continuously controls
[10] S. Stepputtis, J. Campbell, M. Phielipp, S. Lee, C. Baral, and H. B. Amor,
the plate such that the ball is always at the center? ‘‘Language-conditioned imitation learning for robot manipulation tasks,’’
Here is an example of the code in Python: in Proc. Adv. Neural Inf. Process. Syst., vol. 33, 2020, pp. 13139–13150.
[11] N. Muhammad Mahi Shafiullah, C. Paxton, L. Pinto, S. Chintala, and
import time A. Szlam, ‘‘CLIP-fields: Weakly supervised semantic fields for robotic
def balance_ball(): memory,’’ 2022, arXiv:2210.05663.
# Continuously check the position of the ball [12] A. Bucker, L. Figueredo, S. Haddadin, A. Kapoor, S. Ma, S. Vemprala,
while True:
and R. Bonatti, ‘‘LATTE: Language trajectory transformer,’’ 2022,
# Get the current position of the ball
x, y = get_ball_position() arXiv:2208.02918.
# Check if the ball is off-center [13] A. Bucker, L. Figueredo, S. Haddadin, A. Kapoor, S. Ma, and
if x > 160 or x < 160 or y > 120 or y < R. Bonatti, ‘‘Reshaping robot trajectories using natural language com-
120: mands: A study of multi-modal data alignment using transformers,’’ 2022,
# Calculate the pitch and roll arXiv:2203.13411.
adjustments needed to center the ball [14] M. Shridhar, L. Manuelli, and D. Fox, ‘‘Perceiver-actor: A multi-task
pitch_adjust = (y - 120) * 0.0001 transformer for robotic manipulation,’’ 2022, arXiv:2209.05451.
roll_adjust = (x - 160) * 0.0001 [15] M. Shridhar, L. Manuelli, and D. Fox, ‘‘Cliport: What and where
# Limit the pitch and roll adjustments pathways for robotic manipulation,’’ in Proc. Conf. Robot. Learn., 2022,
to the maximum range [-0.1, 0.1] pp. 894–906.
pitch_adjust = max(min(pitch_adjust,
[16] Y. Jiang, A. Gupta, Z. Zhang, G. Wang, Y. Dou, Y. Chen, L. Fei-Fei,
0.1), -0.1)
A. Anandkumar, Y. Zhu, and L. Fan, ‘‘VIMA: General robot manipulation
with multimodal prompts,’’ 2022, arXiv:2210.03094.
VOLUME 12, 2024 55695
S. H. Vemprala et al.: ChatGPT for Robotics: Design Principles and Model Abilities
[17] W. Huang, P. Abbeel, D. Pathak, and I. Mordatch, ‘‘Language models [41] A. Zeng, M. Attarian, B. Ichter, K. Choromanski, A. Wong, S. Welker,
as zero-shot planners: Extracting actionable knowledge for embodied F. Tombari, A. Purohit, M. Ryoo, V. Sindhwani, J. Lee, V. Vanhoucke,
agents,’’ in Proc. Int. Conf. Mach. Learn., 2022, pp. 9118–9147. and P. Florence, ‘‘Socratic models: Composing zero-shot multimodal
[18] W. Huang, F. Xia, T. Xiao, H. Chan, J. Liang, P. Florence, reasoning with language,’’ 2022, arXiv:2204.00598.
A. Zeng,J. Tompson, I. Mordatch, Y. Chebotar, P. Sermanet, N. Brown, [42] S. J. Russell, Artificial Intelligence A Modern Approach. London, U.K.:
T. Jackson, L. Luu, S. Levine, K. Hausman, and B. Ichter, ‘‘Inner Pearson Education, Inc., 2010.
monologue: Embodied reasoning through planning with language [43] E. Frantar, S. Ashkboos, T. Hoefler, and D. Alistarh, ‘‘GPTQ: Accurate
models,’’ 2022, arXiv:2207.05608. post-training quantization for generative pre-trained transformers,’’ 2022,
[19] J. Liang, W. Huang, F. Xia, P. Xu, K. Hausman, B. Ichter, P. Florence, arXiv:2210.17323.
and A. Zeng, ‘‘Code as policies: Language model programs for embodied [44] T. Dettmers, A. Pagnoni, A. Holtzman, and L. Zettlemoyer, ‘‘Qlora:
control,’’ 2022, arXiv:2209.07753. Efficient finetuning of quantized llms,’’ in Proc. Adv. Neural Inf. Process.
[20] I. Singh, V. Blukis, A. Mousavian, A. Goyal, D. Xu, J. Tremblay, D. Fox, Syst., 2023, pp. 1233–1249.
J. Thomason, and A. Garg, ‘‘ProgPrompt: Generating situated robot task
plans using large language models,’’ 2022, arXiv:2209.11302. SAI H. VEMPRALA (Member, IEEE) received
[21] J. Li, D. Li, S. Savarese, and S. Hoi, ‘‘Blip-2: Bootstrapping language- the B.Tech. degree in electrical engineering from
image pre-training with frozen image encoders and large language JNT University, India, the M.S. degree in electrical
models,’’ in Proc. Int. Conf. Mach. Learn., 2023, pp. 19730–19742. engineering from Arizona State University in
[22] H. Liu, C. Li, Q. Wu, and Y. J. Lee, ‘‘Visual instruction tuning,’’ in Proc. 2013, and the Ph.D. degree in robotics from Texas
Adv. Neural Inf. Process. Syst., 2023, pp. 1–17. A&M University, in 2019. From 2019 to 2023,
[23] S. Shah, D. Dey, C. Lovett, and A. Kapoor, ‘‘AirSim: High-fidelity he was a Senior Researcher with Microsoft
visual and physical simulation for autonomous vehicles,’’ in Proc. Field
Research. He is the Co-Founder of Scaled Founda-
Service Robot., Results 11th Int. Conf. Cham, Switzerland: Springer, 2018,
tions. He has over 25 peer-reviewed publications.
pp. 621–635.
[24] J. Redmon, S. Divvala, R. Girshick, and A. Farhadi, ‘‘You only look once: His research interests include perception and
Unified, real-time object detection,’’ in Proc. IEEE Conf. Comput. Vis. planning for robotics, multimodal representation learning, large language
Pattern Recognit. (CVPR), Jun. 2016, pp. 779–788. models, and simulation. He actively serves as a reviewer/the area chair for
[25] M. Savva, A. Kadian, O. Maksymets, Y. Zhao, E. Wijmans, B. Jain, several robotics and machine learning conferences.
J. Straub, J. Liu, V. Koltun, J. Malik, D. Parikh, and D. Batra, ‘‘Habitat:
A platform for embodied AI research,’’ in Proc. IEEE/CVF Int. Conf.
ROGERIO BONATTI (Member, IEEE) was born
Comput. Vis. (ICCV), Oct. 2019, pp. 9338–9346. in São Paulo, Brazil. He received the B.S. degree
[26] S. Tellex, N. Gopalan, H. Kress-Gazit, and C. Matuszek, ‘‘Robots that in mechatronics engineering from the University
use language,’’ Annu. Rev. Control, Robot., Auto. Syst., vol. 3, pp. 25–55, of São Paulo and the Ph.D. degree in robotics from
Jan. 2020. the School of Computer Science, Carnegie Mellon
[27] J. Arkin, D. Park, S. Roy, M. R. Walter, N. Roy, T. M. Howard, and R. Paul, University, in 2021.
‘‘Multimodal estimation and communication of latent semantic knowledge He is currently a Senior Researcher with the
for robust execution of robot instructions,’’ Int. J. Robot. Res., vol. 39, Applied Sciences Group (ASG), Microsoft. His
nos. 10–11, pp. 1279–1304, Sep. 2020. work focuses on multimodal foundational mod-
[28] M. R. Walter, S. Patki, A. F. Daniele, E. Fahnestock, F. Duvallet, els for decision-making. He creates generative
S. Hemachandra, J. Oh, A. Stentz, N. Roy, and T. M. Howard, ‘‘Language machine-learning models that fuse language, vision, and other features to
understanding for field and service robots in a priori unknown environ- allow systems to take the best actions over time. Much of his past work before
ments,’’ 2021, arXiv:2105.10396.
joining ASG was in the robotics space, where he deployed autonomous
[29] J. Fu, A. Korattikara, S. Levine, and S. Guadarrama, ‘‘From language
systems in multiple manipulators, virtual, and real embodied agents all the
to goals: Inverse reinforcement learning for vision-based instruction
following,’’ 2019, arXiv:1902.07742. way to flying robots. His work has been awarded the Best Student Paper
[30] P. Goyal, R. J. Mooney, and S. Niekum, ‘‘Zero-shot task adaptation using Finalist Nomination (IROS 2020), a Microsoft Research Dissertation Grant,
natural language,’’ 2021, arXiv:2106.02972. a Siebel Scholarship, and a Swartz Entrepreneurship Fellowship.
[31] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez,
ARTHUR BUCKER (Student Member, IEEE)
L. Kaiser, and I. Polosukhin, ‘‘Attention is all you need,’’ in Proc. Adv.
Neural Inf. Process. Syst., vol. 30, 2017, pp. 1–12. received the master’s degree from the Technical
[32] F. Giuliari, I. Hasan, M. Cristani, and F. Galasso, ‘‘Transformer networks University of Munich. He is currently pursuing
for trajectory forecasting,’’ in Proc. 25th Int. Conf. Pattern Recognit. the Ph.D. degree in robotics with Carnegie Mellon
(ICPR), Jan. 2021, pp. 10335–10342. University. He is a Roboticist and AI Researcher.
[33] L. Chen, K. Lu, A. Rajeswaran, K. Lee, A. Grover, M. Laskin, P. Abbeel, His academic journey has been marked by an
A. Srinivas, and I. Mordatch, ‘‘Decision transformer: Reinforcement exploration of autonomous systems, AI, and
learning via sequence modeling,’’ in Proc. Adv. Neural Inf. Process. Syst., robotics. His current scholarly endeavors are
vol. 34, 2021, pp. 1–19. centered within the domain of robotic learning,
[34] M. Janner, Q. Li, and S. Levine, ‘‘Offline reinforcement learning as one with a specific emphasis on leveraging multimodal
big sequence modeling problem,’’ in Proc. Adv. Neural Inf. Process. Syst., human–robot interaction to facilitate advanced and efficient cognitive
vol. 34, 2021, pp. 1273–1286. processes in robotic systems.
[35] K. He, X. Chen, S. Xie, Y. Li, P. Dollár, and R. Girshick, ‘‘Masked
autoencoders are scalable vision learners,’’ in Proc. IEEE/CVF Conf. ASHISH KAPOOR received the Ph.D. degree in
Comput. Vis. Pattern Recognit. (CVPR), Jun. 2022, pp. 15979–15988. computer science from the MIT Media Laboratory.
[36] R. Bonatti, S. Vemprala, S. Ma, F. Frujeri, S. Chen, and A. Kapoor, He is currently the CEO and the Co-Founder of
‘‘PACT: Perception-action causal transformer for autoregressive robotics Scaled Foundations. Prior to Scaled Foundations,
pre-training,’’ 2022, arXiv:2209.11133. he was the General Manager for autonomous
[37] S. Yitzhak Gadre, M. Wortsman, G. Ilharco, L. Schmidt, and S. Song, systems research with Microsoft, focusing on
‘‘CoWs on pasture: Baselines and benchmarks for language-driven zero-
building safe AI systems and specifically aerial
shot object navigation,’’ 2022, arXiv:2203.10421.
robotics and its applications in areas that positively
[38] M. Ahn et al., ‘‘Do as I can, not as I say: Grounding language in robotic
affordances,’’ 2022, arXiv:2204.01691. influence society. His research interests include
[39] P. Sharma, B. Sundaralingam, V. Blukis, C. Paxton, T. Hermans, machine learning, computer vision, and robotics
A. Torralba, J. Andreas, and D. Fox, ‘‘Correcting robot plans with natural while touching on various disciplines of computer science that include
language feedback,’’ 2022, arXiv:2204.05186. quantum computation, systems, formal methods, programming languages,
[40] A. Brohan et al., ‘‘RT-1: Robotics transformer for real-world control at and human–computer interaction.
scale,’’ 2022, arXiv:2212.06817.