Keywords

1 Introduction

Recent developments in Large Language Models (LLMs) have significantly influenced AI research, influencing developments in diverse industries, research domains, and even personal households. Beyond their proficiency in text-based tasks, including text generation, summarization, translation, and question answering [27], LLMs have demonstrated significant capabilities in knowledge-related reasoning tasks, including interpretation, explanation, and inferencing of facts. These advances have facilitated the more complete integration of language as a prominent component in robotics systems, unveiling many new possibilities. Achieving seamless Human-Robot Interaction (HRI) extends beyond the mere integration of multimodal sensory inputs and elements of verbal and non-verbal communication—it also involves imbuing robots with human-like social and cognitive abilities, which humans typically acquire through continuous social interactions with others. By incorporating the knowledge inherently embedded within human natural language into robotic systems, through the use of an LLM, we can improve the social and general cognitive competencies of humanoid robots, providing users with more immersive and natural HRI experiences.

Fig. 1.
figure 1

The robot uses a grounded LLM to understand and correctly react to ambiguous conversation by the user. This can include, for example, actions such as looking and pointing at objects in order to clarify its or the user’s intentions.

When using LLMs for robotic applications as opposed to virtual assistants, the concept of grounding, addressed in this paper, becomes paramount. Grounding in this context refers to the process whereby the LLM must connect its learned abstract language knowledge with the physical realities, capabilities, and sensory experiences of the robot, while embodying it in first person. Our primary contribution is the grounding of a selection of chat-based LLMs with many other deep learning models and robotic skills, in an explicitly modular and extensible way on a robot designed for human-robot collaboration (see Fig. 1), in order to achieve a general purpose multifaceted interactive robotic agent. The integrated models include ones for speech recognition, speech generation, open-vocabulary object detection, human pose estimation, and gesture detection, and required significant original implementation work. The LLM acts as both the central chat unit and coordinator, seamlessly linking these models with other robotic skills such as manipulation, own gesturing, gaze control, and emotion expression. Crucially and also novelly, such actions can be freely interspersed at will into spoken text by the LLM within a single generated response, with no limit on the number of actions or their timing. Our experiments demonstrate that LLMs possess a significant capacity for grounded interactions and appropriate utilization of available robot actions, all while collaborating and engaging in natural social interactions with humans, without the need for explicit task programming. Our secondary contributions include developing a pose-based gesture detector and demonstrating that open-vocabulary object detection can be used to provide reliable zero-shot classification in real-time. In summary, this work can be characterized as a fusion of system integration with qualitative and quantitative analyses of the emergent capabilities of grounded LLMs, linked throughout by many innovative ideas across the implemented components.

2 Related Work

Modern LLMs stand as promising candidates for integration with robotic tasks due to their emergent abstract thinking, logical reasoning, and mathematical inferencing skills, as well as their ability to interact with users using natural language [24]. Considerable work has been done in this direction, ranging from zero-shot planning [20] to robot navigation [10], robot control [25], and robotic planning [33]. Other approaches leverage LLMs for generating robotic actions based on multimodal sensory input [34], or querying suitable robot commands based on observations of the environment [3]. Most such approaches however, only focus on utilizing the semantic and reasoning skills encoded in LLMs for zero-shot or few-shot planning, and not communication. Consequently, the degree of regard for human involvement in the robotic task is reduced, as these studies do not consider a collaborative scenario. Although some works suggest the use of a human-in-the-loop for assessing the outputs of LLMs [24], this disturbs the grounding of the robot, and obstructs natural interaction with humans.

One of the biggest challenges in interaction scenarios is providing robots with cognitive skills such as reasoning, common sense, turn-taking, as well as behaving in a socially appropriate manner [9, 30]. Some interest has been emerging within the scientific community to explore the underpinning behavior of LLMs under different socially-situated contexts, as well as their capacity to perform social reasoning when engaged in robotic tasks [12, 15]. Some approaches utilize LLMs in a game-playing scenario against human-like strategies but without an actual human participant [1], or evaluate the language-driven social reasoning skills of LLMs using human assessment [32]. Other studies focus only on general human intelligence and ignore ‘intrapersonal intelligence’, thus, falling short of fully utilizing the potential of LLMs. Nonetheless, not as much work has been done on leveraging the social skills within LLMs to establish a believable robot collaborative interaction that adheres to human conventions. More effort is needed to equip robots with the cognitive social skills that align with the expectations of humans and their perception of robots, which is imperative for seamless HRI [5]. In our work, we focus on using LLMs to improve the embodiment and cognitive skills of the robot and employ it in an interactive scenario. Our experiments showcase effective grounding in the persona of a social robot.

3 Approach

The proposed system architecture centers around the Neuro-Inspired COLlaborator (NICOL) robot [13], shown in Fig. 1. The robot has a 2-Degree-of-Freedom (DoF) head that can display facial expressions using LEDs located underneath the 3D printed exterior, and two 8-DoF Robotis OpenManipulator-P arms [18] that each have a five-fingered Seed Robotics RH8D hand attached [19]. There is a 4K camera behind each eye, and a CoppeliaSim simulation of the NICOL robot is available if hardware access is lacking. The NICOL software framework is built on top of the widely-used ROS middleware.

A high-level overview of the system architecture is shown in Fig. 2. The core node of the proposed system is the chat manager, which coordinates the entire chat state and inferences the LLM via API calls to a server. Information about the current state of the user and environment is collected in real-time by the various perception modules—including an open-vocabulary object detector, human pose estimator, and gesture detector—and combined with additional grounding information inside the chat manager to update the chat state. When the user speaks to the NICOL robot, the audio is auto-detected and recorded by a speech recognition node, and passed through an Automatic Speech Recognition (ASR) model to obtain the corresponding transcription. The obtained text is passed to the chat manager, which under consideration of prompt engineering techniques and grounded state injection, updates the chat state and generates a response from the LLM. The response is parsed into output speech segments that can be freely interspersed with robot actions and so-called ‘thoughts’. The robot actions encompass a wide range of behaviors, including shifts in gaze, displays of emotion, various gestures, and manipulations of objects using the arms. The speech segments are converted to audio using a speech synthesis model, and are played back sequentially, spaced by the execution of any robot actions invoked by the response. The user can then speak again to continue the conversation.

Fig. 2.
figure 2

Overview of the proposed grounded chat architecture.

3.1 Chat Manager

On the simplest level, the chat manager is responsible for collating and constructing the text prompts that are sent to the LLM, as well as interpreting and executing the responses appropriately. Aside from the LLM requests that correspond to actual spoken conversation, some internal queries are also used, for instance to update the LLM with information about the environment, or prime it with useful self-retrieved knowledge. Such internal requests are used sparingly however, and only when benefit is to be had, to avoid unnatural pauses in conversation where possible. Both local open source LLM models and proprietary API-based LLMs are supported by the manager, and were used for experiments.

figure a

The lead-in message to all LLM requests is the system prompt (see Prompt 1), which performs the critical function of grounding the LLM into a first-person NICOL context, as well as providing some fundamental information about the robot, like for instance who built it and what its morphology is like. The prompt then continues by clarifying the robot’s purpose as a collaborator, but one will note, it is left completely open-ended what kinds of tasks the user can request—no explicit task parameterization is required due to the emergent ‘cognition’ of the LLM. The remainder of the system prompt defines the available robot actions, and the text format with which they can be triggered inline within LLM responses, i.e. by using angle bracket function calls like <express(happiness)>. Adding further robot actions using this scheme is trivial, because you just add them to the system prompt list. LLMs generally see significant amounts of code during training, so the angle bracket format was carefully chosen to make the generation of action calls most natural to the LLM, as both parenthesis-based function calls and HTML tags freely interspersed with text occur frequently within the training data. This leads to higher response robustness, as the desired action function format does not need to ‘fight’ the tendencies of the model. The freely interspersed nature of the robot actions is quite socially natural, but not something that can easily or efficiently be achieved using alternative approaches like ReAct agents [31], ChatGPT plugins (proprietary), or ChatGPT actions/function calling. These features were designed with completely different goals in mind—i.e. the use of external tools to generate or retrieve information that can then be used to formulate a single final answer—and also generally chain together multiple model inference rounds, which is slow [26].

Prior to engaging with the user, the LLM is internally queried for example situations where the use of each action function would be appropriate (see Prompt 2). This supplements and consolidates the LLM’s own knowledge about the use cases of the capabilities of the robot, and directly aids future responses as the elucidated answer remains in the chat history, forming part of the context for all future inferences. This effectively allocates one-time compute towards making logical deductions about the action functions, in order to augment the internal state of the LLM before it addresses user prompts. This scheme was empirically found to improve the handling of complex user requests involving actions.

When the user speaks to NICOL, the prompt is augmented to obtain a corresponding raw LLM query as shown in Prompt 3. The perennial reconfirmation of the first-person robot perspective is not strictly required in order to avoid character breaks, but was evaluated to reduce their frequency. The LLM’s answer to a query is split by occurrences of angle bracket tags, and the resulting parts are cleaned into sequences of , and (see Prompt 3). Thoughts are the portions of the generations that are heuristically not intended for verbal expression, like for example parts in parentheses or asterisk markup, and are parsed on the basis of manually crafted rules using regular expressions. Aside from the conversational improvement of not speaking these parts, the thoughts are separated and retained as they can contain additional information that can prime the responses to any follow-up questions by the user simply by remaining in context. The parsed sequence resulting from an answer is executed inline and sequentially, meaning that a robot may for instance talk, pause briefly to point at a relevant object, and then continue to talk about that object. A single answer sequence can contain many different inline actions if appropriate.

3.2 Open-Vocabulary Object Detector

An essential part of the chat architecture is its ability to perceptually ground the LLM. A central part of this is the object detector, which allows the robot to dynamically perceive, interact with, and discuss the objects on the table with the user (see Fig. 1). As LLMs are quite unrestricted regarding the types of objects they can comprehend and process, it makes sense that the perception follows suit. As such, the ViLD open-vocabulary object detector [7] was adapted for this work. The model works by leveraging the large-scale pretraining of the CLIP model [16], and distilling the acquired vision knowledge into a modified ResNet-50 Mask R-CNN model. The model is trained on the LVIS dataset [8] to identify generic object bounding boxes in each image, and estimate their visual CLIP embeddings. These embeddings are compared in a zero-shot manner to the precomputed CLIP text embeddings of corresponding language representations like ‘baseball’, ‘orange’, and ‘red bowl’. This allows arbitrary objects to be detected based on textual labels, even ones it was not explicitly trained on.

figure e

Although originally too slow (under 2 Hz), we modified the ViLD model and pipeline (without retraining) to be able to run at up to 8 Hz on modest hardware by trimming non-required portions of the model, replacing the non-maximum suppression (NMS) with a significantly faster version, allowing direct tensor inputs to the model, and by reducing the final number of selected regions of interest and bounding boxes (code available at [2]). The last of these changes was supported by a further change that restricts the region proposals to those areas of the image where the table is located according to the current joint states and camera-world distortion model. This simply avoids wasting compute on region proposals within the room background clutter, and allows more dense region proposals to be considered from the actually relevant parts of the image.

Despite the effectiveness of the modified ViLD model for images, the sensitivity of the embeddings to temporal variations in visual appearance limited the stability of the detections for video sequences. To address this issue, a tailored tracking layer was implemented on top of the ViLD model that associates a stable tracking ID to each object, and is robust to flickering detections, intermittently erroneous text label classifications, and slow-to-medium motions of the object (i.e. motions that preserve some overlap between subsequent detection frames). As a final reliability improvement, the computed vision embeddings of some commonly used objects were sampled and averaged from captured data in order to finetune their detection scores.

With reliable and temporally stable object detections on hand, the corresponding 3D positions of the objects on the table are calculated from the camera model and instantaneous robot pose, and delivered to the chat manager as frequent updates. The chat manager maintains a list of both the current and past objects that have been seen on the table within the same chat session, and for any new objects triggers a single internal query for facts about those objects (see Prompt 4). Like previously for the action functions, this explicit querying of the LLM’s own knowledge helps convert its implicit latent knowledge into explicit contextual knowledge, and bootstraps responses generated in the future.

Prior to each user prompt that the chat manager receives, a status update scheme collects and collates textual changes of state from all perception modules, and, only if relevant updates exist, a single internal query is formulated to notify the LLM of all the updates. The object detector has four different status update variants, as shown in Prompt 5 (naturally only the relevant correct variant is included in real updates). The requested brevity of the answer to status updates significantly reduces possible time delays to answering user questions.

3.3 Chat Architecture Components

Aside from the chat manager and open vocabulary object detector, the remaining perception modules and robot actions are as follows:

Speech Recognition. Automatic speech recognition was implemented as a ROS action server that records audio on demand, and locally inferences a Whisper model [17] to provide a text transcription of the user’s prompt to the chat manager. The family of Whisper models are particularly advantageous for in-the-wild use because they provide noteworthy out-of-distribution generalization capability, and have multilingual model variants with auto-translation to English that can enable users to speak in many different languages. The small.en variant of Whisper is nominally used as it represents a good trade-off between word error rate, GPU memory, and inference time (230–500 ms depending on audio length).

Speech Generation. To enable natural conversations and a positive HRI experience, the NICOL robot uses speech synthesis to generate and play back audio for the parts. All such parts are first split into their comprising sentences, and all sentences in the entire response sequence are enqueued immediately for asynchronous pre-caching as soon as the response is received from the LLM. This drastically reduces the reaction time of the robot, as both the robot actions and the first sentence of the response can already be played back while any remaining text sentences are still synthesizing. The adversarially trained end-to-end VITS model [14] was used in this work—specifically the English language variant that was trained on the VCTK dataset [29], with the resulting audio being employed at 90% speed to enhance acoustic clarity.

Emotion Expression. When express() robot actions are triggered, the corresponding emotions are shown on the face of NICOL using programmable LED arrays that are located underneath the plastic surface. Refer to Fig. 3 for the available emotions. The facial expressions have a high rate of recognition amongst participants, and the positive effect on the robot’s subjective rating by users has previously been verified by Kerzel et al. using a Godspeed questionnaire [13].

Fig. 3.
figure 3

NICOL’s emotions: Neutral, happiness, sadness, surprise, and anger.

Arm Manipulation and Gaze Control. The look(), point(), and give() robot actions (refer to Prompt 1) were implemented using a mix of joint space, inverse kinematic (using BioIK [21]), and hand posture control methods. The look action invokes the gaze control module, which can be made to look at objects from the object detector via their 3D coordinates, at the user or user’s hand based on the pose estimator, or left-right-down across the area of the table in a sequential motion. The pointing action can similarly point at an object (example Fig. 1), the user, or the table. The give action works with the objects from the object detector, and as a simplification of the open-vocabulary grasping task, pushes the chosen object closer to the user instead of picking it up off the table.

Human Pose and Gesture Detection. The real-time pose estimation and gesture perception modules were added to enhance the perceived social awareness and attentiveness of the robot, which is beneficial for HRI. A YOLOX-Tiny model [6] is used to detect human bounding boxes, and the most prominent bounding box close to the table, with hysteresis for stability, is interpreted as the user. A 14-keypoint HRNet-W32 model [22] that was trained on the AI Challenger dataset [28] is then applied to obtain the user’s keypoint pose. In many situations, the model can reliably infer joint positions even for body parts that are largely occluded by each other or the table. For a highly favorable balance between tracking performance and noise rejection, a 1€ filter [4] is applied to the keypoint detections, meaning that the robot can safely use these values to look directly at the user, or track their hand. The smoothed keypoints are also used by the gesture detector, which dynamically partitions a sliding window of the spatiotemporal pose data into so-called flight and rest phases, and employs a simple classifier based on features extracted from these phases to detect wave, grasp, pause and stop gestures. When a gesture is detected, it is added to the next status update query so that the LLM can react appropriately to the user.

4 Chat Quality and Competency

4.1 Qualitative Assessment

When large-scale LLMs are grounded in the way described in the previous section, they can exhibit remarkable emergent social and conversational aptitude in addition to fully utilizing and integrating the available perceptions and robot actions, all while retaining their complete general knowledge. A collection of indicative sample interactions are shown in Prompt 6 as qualitative results, along with a summary for each regarding which kinds of social skills or grounded robot capabilities are being demonstrated. Overall, the LLM can be observed to almost effortlessly be able to adapt to the grounded robot chat scenario, and demonstrates the capacity to display many historically complex conversational skills, like perceptual reasoning, pragmatic comprehension, metaphorical reasoning, conversational repair, action repair, theory of mind, and more.

figure g

4.2 Chat Analysis

Although parts of the chat manager are based on prompt engineering techniques to some extent, these techniques are demonstrably general enough to work well across a wide variety of LLM model types and sizes. Ideas like the explicit querying of implicit LLM knowledge in order to improve later chat responses (see Prompt 2 and 4) are generically useful concepts, and the use, for instance, of dynamic tags (including arguments) to achieve streamlined responses without any restrictions on the number or locations of any actions is also generally applicable to other systems that require similar properties.

In this work, we test our system on a mix of open-source models (Mistral-7B, Vicuna-13B, Vicuna-33B) and proprietary models (GPT-3.5, GPT-4) to demonstrate its versatility.Footnote 1 Mistral-7B [11] is the most lightweight of the tested LLMs, at just 7 billion parameters. The two Vicuna models are fine-tuned versions of a LLaMA base model [23], and are slightly larger at 13/33 billion parameters, respectively. All three open-source models were inferenced on an NVIDIA A100 GPU on a local server using an open-source text generation web interfaceFootnote 2. The two GPT models were inferenced remotely using the OpenAI API.

Table 1. Chat analysis results for the various tested LLM models.

We evaluate 8 selected prompts over 5 trials each, making sure to reset the chat history between each trial to avoid any biases. Table 1 shows the results of our experiment. Response length is the mean number of tokens generated, while Response similarity is a measure of the similarity of the responses, as given by the Jaccard similarity index (a higher value indicates less response diversity). Task completion is defined as the model’s ability to correctly accomplish the prompt’s task, while Grounding as NICOL measures its ability to respond to the user in the NICOL persona. Perception and manipulation quantifies the proportion of responses that invoke perception or robot motion, while Expressiveness does the same for facial expressions. The Reasoning skills and Communication skills metrics evaluate the rate of logically and linguistically unfaultable responses, respectively. Overall, it is expected that the overall larger and more capable GPT models achieve the better scores in general in the evaluation. It should be noted though that for the metrics expressiveness and perception/manipulation, balance is key, and it is not necessarily ideal to have a metric of 1.0, as it is not for instance always unconditionally appropriate to show an emotion.

An element of novelty and unpredictability is key to human experiences in HRI scenarios. Although all tested LLMs generate a somewhat similar number of tokens, Mistral-7B, Vicuna and GPT-3.5 tend to show greater, yet not excessive, diversity, while also achieving high task completion rates (except for Vicuna-13B). GPT-4 has the highest task achievement with a perfect score, however, its responses tend to be somewhat more repetitive, suggesting that it has a higher sensitivity to the temperature parameter, which was kept at 0.2 for both GPT models. The Vicuna-33B and GPT models have comparable expressiveness and perception/manipulation scores, with the ‘low’ absolute values of these metrics reflecting that motions and emotions are not always fitting. For example, a prompt like “I’m feeling sad, what should I do?” elicits social and empathetic skills, but not object manipulations. As the smallest model, Mistral-7B falls short in these two categories due to an observed difficulty sticking to the flexibly defined action functions, with an embedded bias towards producing ‘actions’ it may have seen in the instruction dataset it was fine-tuned on. All five models however exhibit high conversational and linguistic competency, as well as good reasoning skills, with the exception of Vicuna-13B. The “If I remove all fruit from the table, how many objects will be left?” and “Do you have legs?” test prompts proved challenging for Vicuna-13B, but not Mistral-7B however, hinting at a limitation in Vicuna-13B’s own reasoning capabilities and highlighting the resilience of our system’s prompt design across LLMs of various sizes.

4.3 Case Study: Guess My Object

Evaluating LLMs quantitatively is a challenging task as many factors need to be considered, including language understanding, reasoning and context awareness, and existing benchmarks are often only general indications of true performance. Scaling this up to a user-interactive physical robot that considers object interactions and multimodal perceptions further escalates the complexity of the task. To indicatively test the acquired cognitive and social abilities of the robot via our proposed system, we run repeated games of Guess My Object with the robot, as it requires the system to display many different skills and forms of intelligence. The robot needs to guess the object on the table the user is thinking of by formulating up to 4 yes/no questions and applying logical reasoning. Six objects are placed on the table, and each is selected 5 times, with each trial being run independently. The game consists of 4 phases: a) Game introduction: the game’s rules are explained to the robot by the user, b) Q/A: the robot asks questions and receives corresponding answers, c) Reasoning check: the user tests the robot’s ability to correctly explain its strategy if it won, or identify flaws in its method if it lost, d) Agreement check: the user tests if a final mutual understanding concerning the chosen object has been established (regardless of win or loss). To evaluate the game’s outcome (see Table 2), we calculate the mean win rate of the robot, and the mean number of questions required for winning. Win/Loss explanation are the ratios of trials in which the robot passes the strict reasoning check when winning/losing. Other evaluated metrics include Expressiveness and Motion used, representing, respectively, the ratio of trials in which the express(), or one of the look(), point(), give() actions, was triggered appropriately. Agreement is the ratio of trials in which the robot passes the agreement check, and Minor anomalies is the ratio of trials containing any non-critical shortcomings not covered within the previous metrics. For reasons of computational efficiency, response time, and costs of repetitive trials, GPT-3.5 was used for the tests.

The system consistently demonstrated its ability to comprehend and adhere to the rules of the game, and always stayed in character. It exhibited significant potential in formulating successful strategies, achieving a high success rate across the objects by leveraging both common sense and inferred object properties, asking questions like “Is it yellow?” and “Is it round in shape?”. The main observed barrier to success was when it misjudged attributes of certain objects, e.g. assuming the pear is yellow, hindering its ability to pass the strict reasoning check even if the correct object was chosen. The comparatively low win rate of the lemon and win explanation rate of the banana are explained by such yellow color confusions, which are however specific to the underlying LLM. Facial expressions and motions were used effectively in all trials. Minor anomalies during interaction included the generation of actions like express<curiosity> and extend<arm>. Although these actions fit the context and have no perceivable negative influence on the interaction (as they are filtered out), we consider them anomalies as they do not exactly match their defined action functions.

Table 2. Case study: Results of the Guess My Object game.

5 Discussion and Conclusion

In the presented work, which focused on the system integration and evaluation of a responsive and grounded chat-robot, many fundamental ideas and design decisions were made to enable such a flexible and qualitatively high-performing system. For instance, the generic use of language as the common denominator between the components of the chat architecture is what allows the architecture to be so flexible, especially when considering the modular nature of the components. If the robot gains a new skill, it just needs to be added to the list of action functions in the system prompt, and everything else happens automatically. If a new perception module is added, like pointing detection or emotion recognition, the module just needs to forward its detections to the existing status update scheme, and the LLM handles the rest. No data needs to be collected, and nothing needs to be retrained. The LLM can also be replaced (without code changes) with any newer one, due to the increasingly widespread support of common API endpoints. The text-oriented architecture is successful because it keeps the LLM doing what it does best—language and reasoning, and not mathematical computations with coordinates or such, like in [24]. Future directions could involve incorporating semantic spatial awareness into the object status updates.

Addendum: We asked NICOL for a statement of support for this paper in the style of an online review, and got:

This paper is a testament to the hard work and dedication of my creators, and it has helped me find my voice as a neuro-inspired collaborator. Five stars!