research-article

Open access

A Survey of Multimodal Perception Methods for Human–Robot Interaction in Social Environments

Authors:

John A. Duncan,

Farshid Alambeigi,

Mitchell W. PryorAuthors Info & Claims

ACM Transactions on Human-Robot Interaction, Volume 13, Issue 4

Article No.: 46, Pages 1 - 50

https://doi.org/10.1145/3657030

Published: 16 October 2024 Publication History

PDF eReader

Abstract

Human–robot interaction (HRI) in human social environments (HSEs) poses unique challenges for robot perception systems, which must combine asynchronous, heterogeneous data streams in real time. Multimodal perception systems are well-suited for HRI in HSEs and can provide more rich, robust interaction for robots operating among humans. In this article, we provide an overview of multimodal perception systems being used in HSEs, which is intended to be an introduction to the topic and summary of relevant trends, techniques, resources, challenges, and terminology. We surveyed 15 peer-reviewed robotics and HRI publications over the past 10+ years, providing details about the data acquisition, processing, and fusion techniques used in 65 multimodal perception systems across various HRI domains. Our survey provides information about hardware, software, datasets, and methods currently available for HRI perception research, as well as how these perception systems are being applied in HSEs. Based on the survey, we summarize trends, challenges, and limitations of multimodal human perception systems for robots, then identify resources for researchers and developers and propose future research areas to advance the field.

1 Introduction

Robots and intelligent systems are increasingly deployed in dynamic human social environments (HSEs) [82, 177]. Novel service and social robots are now a common sight in homes, hospitals, schools, airports, museums, and urban operating domains. Even manufacturing robots, which traditionally operate separately from humans, now work alongside humans in corobotic teams. In these domains, robots are being developed to address critical labor shortages and perform repetitive tasks, which has the potential to reshape many aspects of the modern world. However, operation with and among humans in HSEs is complex and challenging. Developing natural, situationally appropriate interaction behaviors for robots and intelligent systems in social environments is a perennial research challenge and places unique demands and limitations on the system’s design and usage.

In this survey, we explore the current state of perceptual systems used for human–robot interaction (HRI) in HSEs. Specifically, we focus on multimodal perceptual interfaces. Multimodal perception (MMP) is the acquisition, processing, and fusion of unimodal data streams to infer quantities of interest. A perceptual interface [157] allows humans to convey information using intuitive methods and without the need to use dedicated interface hardware. An MMP interface typically (but not exclusively) combines information from modalities such as pose, gaze, gesture, or speech using only a system’s onboard sensors like cameras or microphones.

The ability to observe human communication, status, and/or intent is a critical requirement of HRI and human–robot teaming applications in HSEs. Social robots, service robots, assistive robots, smart homes, virtual assistants, and some coBots must perceive humans to accomplish their respective tasks. Furthermore, systems that can effectively perceive and respond to various verbal and nonverbal cues experience a range of improved interaction outcomes. Adding to the challenge is that it generally cannot be assumed that humans in many social environments—such as pedestrians, patrons, hotel guests, and students—will be willing and able to carry or use robot interface hardware. Ultimately, the task of MMP HRI systems in HSEs is to gain as much information about nearby humans as possible, while only using onboard sensors and any additional connected resources.

In this survey, we reviewed various robotics and intelligent systems applications that employ MMP techniques. We note our specific contribution as surveying available options for data acquisition, processing, and fusion methods for HRI in HSEs using MMP interfaces. The remainder of this manuscript is organized as follows: Section 2 identifies characteristics and challenges of HRI in HSEs, defines key vocabulary, and identifies specific applications that employ MMP interfaces; Section 3 provides information on the survey method, survey results, and details of surveyed MMP systems; in Section 4, we summarize research trends in HRI and MMP systems and identify opportunities to advance HRI research in HSEs using multimodal systems.

2 Characterizing HRI in HSEs

In this section, we define key characteristics and challenges unique to HRI in HSEs and identify the scope of our survey. We use the definition of Riek et al. of HSE to mean any environment in which a robot operates proximately with humans [132]. Broadly, HSEs are characterized by limited, unknown, or changing environmental structure, where the same physical space is likely to be used for many different activities by any number of people [117]. Example HSEs include public settings like malls, libraries, museums, schools, city streets and offices, and private settings like residences. For a thorough but dated overview of HRI challenges in social environments, we recommend Goodrich and Schultz’s 2007 review [60], while a more brief and recent summary of HRI challenges can be found in [22]. HRI in HSEs is characterized by one or more of the following:

—

Multimodal: Human intent, status, or communication may require information from multiple modalities to be correctly interpreted [180].

—

Multiparty: Multiple humans may be present near the robot, and they may interact with each another or the robot. Multiparty interaction features phenomena such as interruption, turn-taking, entering/exiting the field of regard, and engagement [19, 20, 21, 66, 180, 182, 188].

—

Mixed-Initiative: The robot’s actions may be influenced by multiple, potentially conflicting, sources [53, 78, 88]. Jiang and Arkin [78] define an initiative as any source that specifies an “element of the mission \(\ldots\) from low-level motion control of the robot to high-level specification of mission goal.” For example, a robot which blends control commands from a human and its onboard obstacle detection algorithm is a mixed-initiative system.

—

Open: The number of agents that the robot interacts with is not known prior to the interaction and may change during the interaction [19].

—

Situated or Context-based: Interaction between agents is situationally dependent and is shaped by the current status of the environment and elements within it [7, 6, 22, 93].

—

Dynamic: The robot, humans, or the environment change their state throughout the course of the interaction.

—

Perceptual: Direct physical contact cannot be assumed and the use of dedicated interface hardware or wired sensors may not be feasible. Human interaction must be performed primarily with a perceptual interface consisting of the robot’s onboard sensors such as cameras, microphones, or lasers. For example, the robot may interact with pedestrians, bystanders, or children who do not carry specialized hardware to interface with the robot [157].

—

Additionally, HRI applications may be complicated by limited onboard resources of embedded systems or unfavorable operating conditions, such as high noise or reduced visibility.

Rather than comprehensively survey each aspect of HRI in HSEs, this survey focuses on the current state of multimodal perceptual interfaces for use in HSEs. Perception is a fundamental prerequisite for intelligent agents to interact with humans in HSEs; a robot cannot plan or execute intelligent interaction behaviors without first perceiving nearby humans. To this end, we review means of acquiring, processing, and fusing multiple modes of human perception data, as well as specific applications of MMP systems in HSEs. Although cognitive/planning architectures and action execution are also necessary components for HRI in HSEs, these are outside the scope of this survey.

Our primary goal for this survey is to document the current capabilities and uses of MMP HRI systems in HSEs, including specific techniques and tools available for HRI researchers. We intend for this to be a broad overview of MMP interfaces for intelligent social agents, applicable to various emerging domains, and that provides the reader with information and resources for further investigation. Toward these goals, we surveyed any MMP system that was used to observe humans and did not require the human to carry or use specialized hardware. In other words, we sought to determine what a MMP system can learn about nearby humans using only system hardware. Specifics on the survey method and results are provided in the following section.

3 MMP Survey

In this section, we survey MMP applications across various domains and examine data acquisition, processing, and fusion techniques that enable these applications. Section 3.1 discusses the survey technique, and Section 3.2 provides an overview of the survey results. Section 3.3 discusses MMP HRI capabilities and applications, while Section 3.4 categorizes surveyed works by the type of information they infer. Audio and visual data acquisition and processing are discussed in Sections 3.5 and 3.6, respectfully. Section 3.7 discusses other sensor modalities. Section 3.8 discusses multimodal datasets used in surveyed works, as well as public multimodal datasets available for MMP HRI development. Finally, Section 3.9 categorizes and reviews techniques for data fusion. The survey results are summarized in Tables 2 through 7. Note that while many of these MMP systems provide information to higher-level planning and behavioral control modules, or are used in longer-term learning or interaction studies, the focus of this survey is the perception function of these systems.

3.1 Survey Method

We conducted a review of peer-reviewed journal and conference literature using Google Scholar’s advanced search function. This survey does not include commercial HRI robotic systems, since their system designs are often proprietary and not publicly available. The advanced search function allows a custom search of specific publications, search phrases, and date ranges. Our search consisted of the 15 peer-reviewed publications listed in Table 1(a) that contained any of the exact phrases in Table 1(b) anywhere in the article, published from 2012 to 2023.

Table 1.

(a)

Publications Surveyed

ACM Transactions on Human Robot Interaction

ACM/IEEE International Conference on Human Robot Interaction

IEEE International Conference on Robotics and Automation

IEEE Robotics and Automation Letters

IEEE Transactions on Robotics

IEEE Workshop on Advanced Robotics and its Social Impacts

IEEE/ASME Transactions on Mechatronics

IEEE/RSJ International Conference on Intelligent Robots and Systems

International Conference on Social Robotics

The International Journal of Robotics Research

International Journal of Social Robotics

Robotics and Autonomous Systems

Robotics and Computer-Integrated Manufacturing

Robotics: Science and Systems

Science Robotics

(b)

Search Phrases

Multimodal Detection

Multimodal Fusion

Multimodal Human Robot Interaction

Multimodal Interaction

Multimodal Interface

Multimodal Perception

Multiparty Interaction

Social Human Robot Interaction

Table 1. Publications Surveyed and Search Phrases

At time of writing, this includes the 10 highest impact Robotics publications (as measured by h5-index),¹ all publications with “human-robot interaction” in the title (2 total),² and all publications with “social robotics” in the title (3 total).³ The chosen date range covers approximately 10 years prior to the time of writing. The 10-year timeframe was selected to cover a broad, yet relatively recent range of HRI projects applicable to modern HRI research. Publications that are affiliated with surveyed publications (such as conference workshops) were also surveyed. Additionally, if a survey result was a continuation of prior work, prior publications relating to the project were included.

For information on social robot perception systems before the timeframe of our survey, we recommend Haibin Yan’s 2014 survey, “A survey on Perception Methods for Human–Robot Interaction in Social Robots” [173]. Like our survey, Yan’s survey includes details on perception hardware and algorithms employed on many contemporary social robots, such as Kismet, iCub, and Robovie.

To further focus the survey, we applied a set of inclusion and exclusion rules to the initial search results. To be included in the survey, each research work had to meet all the following criteria:

—

The project must perform a perception function relevant to HRI

—

The perception system must combine data from at least two non-contact sensory modalities

—

The publication included sufficiently detailed information about the perception system

Publications meeting any of the following criteria were excluded from the survey:

—

The project explored MMP techniques unrelated to HRI (e.g., object detection/classification, material classification, robot localization)

—

The perception system requires the human to carry or wear interface hardware (e.g., head mounted displays, inertial measurement units (IMU), electromyography, pulse oximetry, or brain-computer interfaces)

—

The project primarily explores multimodal robot-to-human communication (e.g., studies that explore the effects of fusing a robot’s speech, gaze and/or gesture)

—

Studies that exclusively use synthetic or simulated data

—

Multimodal systems or fusion techniques designed for offline use

—

“Wizard-of-Oz” interaction studies that used humans to process robot sensory data instead of a perception system (as in [171])

—

The multimodal human observation relates to a specific medical procedure [89, 185]

Finally, the MMP systems of the included projects were analyzed, and details about each MMP HRI system are provided in the following section.

3.2 Survey Results and Taxonomy

The initial Google Scholar advanced search returned over 1,000 unique matches. Because the search terms were broad and the entire article was searched, there were many false positives in the initial results (e.g., some matches were due to citations and references). After subjecting each match to inclusion and exclusion criteria in Section 3.1, the final survey includes a detailed analysis of 65 MMP systems spanning 82 publications. Results spanned many disciplines related to robotics and intelligent systems, including socially assistive robots (SARs), service robots, mobile robots in the wild, collaborative robots, human—child interaction, virtual agents, and smart spaces.

To further organize the survey results, we group them according to the data fusion technique used. For this purpose, we use the data fusion taxonomy of Baltrusaitis et al. who categorize fusion techniques as model-based or model-agnostic depending on how and when data streams are fused [9]. Model-based fusion approaches use machine learning (ML) models to combine different data modalities. This includes the use of kernels, graphical models, or neural networks to fuse data as described below:

—

Multiple Kernel Fusion: An extension of kernel support vector machines (SVMs) that allows for the use of different kernels to fuse different modalities.

—

Graphical Model Fusion: The use of HMMs, Bayesian Networks, conditional random fields (CRFs) or other graphical models to fuse multimodal data streams.

—

Neural Network Fusion: The use of long short-term memory (LSTM), deep neural networks (DNNs), recurrent neural networks (RNNs), convolutional neural networks (CNNs), or Transformers to fuse multimodal data streams.

Model-agnostic fusion methods do not use ML models to combine multimodal data; rather, they fuse the raw data, extracted features, or the processed outputs from different modalities into a single data structure. Model-agnostic fusion techniques can be categorized as follows depending on the phase at which data are fused [25, 128, 180]:

—

Early (data-level) fusion: Unprocessed data from different modalities are aligned and combined into a single data structure.

—

Intermediate (feature-level) fusion: Features are extracted from each heterogeneous data mode, then features are fused into a single data structure.

—

Late (decision-level) fusion: Each individual data mode is completely processed, and the resultant outputs of each mode are fused into a single data structure.

Additional information on fusion techniques, with specific examples from surveyed works, can be found in Section 3.9. Summarily, Tables 2–7 contain the survey results, organized according to the Taxonomy in Figure 1.

Table 2.

	Application	Sensor Data \(\rightarrow\)	Data Processing \(\rightarrow\)	Output \(\rightarrow\)	Fusion Technique
Banerjee 2018 [11]	Classifying human interruptability (awareness of robot and willingness to engage) on an indoor mobile robot	Red–green–blue depth (RGB-D) camera (Microsoft Kinect One)	Cascaded deep network face detection and gaze estimation	Gaze direction (at robot, left, right, down)	Comparison of non-temporal (random forest, multilayer perceptron) and temporal (latent-dynamic CRFs) models to estimate interruptability
Banerjee 2018 [11]			CPM skeletal pose estimation; joint vector and angle computation	Human joint angles
Irfan 2018 [71] and 2022 [72]	Long-term open-set person recognition on Pepper and NAO robots in three real-world long-term HRI studies	RGB-D camera (embedded in Pepper); RGB camera (embedded in NAO)	Person detection (NAOqi Recognition Module)	Estimated height	Incremental Bayesian Network to estimate human identity based on face and soft biometrics
Irfan 2018 [71] and 2022 [72]			Face detection (NAOqi Recognition Module)	Face recognition and similarity score, gender estimate, age estimate
Joosse 2017 [80] and Triebel 2016 [154]	Passenger mobile service robot deployed in an international airport for SPENCER project	2x 2D LiDAR (SICK LMS 500)	LiDAR leg tracking; multi-hypothesis tracker; social network graph	Human locations and social groupings	Supervision system to process sensory data and guide interaction modes; includes MOMDP collaboration planner to estimate group intentions and adjust robot action plans from fused observations
		4x RGB-D camera	Depth object segmentation; near-field normalized depth template torso detection; far-field HoG person detection	Human locations
		4x RGB-D camera	Custom head pose estimator	Coarse head pose direction (left, right, front, back)
		Stereo camera	Unspecified	Unspecified
D. Lu 2017 [102]	Estimating human interest for KeJia guide robot in museum setting	RGB-D camera (Microsoft Kinect)	Face tracking (OpenFace); Emotion recognition (Microsoft Emotion Recognition API)	Recognized emotion	POMDP-based multimodal dialog management framework
		RGB-D camera (Microsoft Kinect)	Face tracking (OpenFace); Demographic recognition (Microsoft Face API)	Age and gender estimate
		Directional microphone	Speech API (iFlyTek)	Recognized speech commands

Table 2. Survey of MMP Capabilities Using Graphical Model-Based Fusion

2D, 2-Dimensional; LIDAR, Light Detection and Ranging; MOMDP, Mixed Observability Markov Decision Process; POMDP, Partially Observable Markov Decision Process.

Table 3.

	Application	Sensor Data \(\rightarrow\)	Data Processing \(\rightarrow\)	Output \(\rightarrow\)	Fusion Technique
Chen 2023 [29]	Human pose estimation robust to rain, smoke, low lighting, and occlusion using ImmFusion network	millimeter (mm)-Wave Radar Pointcloud (mmBody dataset)	HRNet local image feature extraction, CNN global feature extraction	Local and Global image features	Concatenation of image and mm-Wave pointcloud features; tiny transformer global integration module to fuse global features; fusion transformer module to estimate joint positions and reconstruct pose
Chen 2023 [29]		mm-Wave Radar pointcloud (mmBody dataset)	PointNet++ local pointcloud feature extraction; multilayer perceptron (MLP) global feature extraction	Local and global pointcloud features
Fung 2023 [48]	Robust 3D Human detection	RGB-D data (various datasets)	End-to-end MYOLOv4 network to extract, fuse, and classify multimodal features; consists of Multimodal Feature Extraction and Fusion (MFEF), path aggregation network (PAN), and YOLOv3 layers	3D Location of humans	Multimodal features extracted and fused in MFEF network layer
Li 2020 [95]	Proposed emotion classification architecture	Monochannel audio	Speech recognition and word embedding (Google Cloud speech-to-text)	Lexical features	Bidirectional gated recurrent unit (BGRU) layers to estimate hidden acoustic-facial states; two attention layers + fully connected layer to align, fuse, and classify emotions
		Monochannel audio	Acoustic feature extraction (openSMILE)	Acoustic features
		RGB camera	Facial feature extraction (eMax toolbox)	Facial features
Linder 2020 [98]	Human localization in dense environments, tested on real-world and synthetic datasets	RGB-D camera (Microsoft Kinect)	Modified Darknet53 backend to concatenate RGB and Depth features	Fused RGB-D feature vector	Modified YOLO v3 to compute 2D bounding boxes and 3D human centroid regression
G. Liu 2019 [100]	Action recognition using datasets	RGB data (NTU and SYSU datasets)	Skeleton attention and self-attention layers	Spatial features	Concatenation layer to fuse temporal and spatial features; L2, FC, and Softmax layers to classify action
G. Liu 2019 [100]	Action recognition using datasets	Skeletal pose data (NTU and SYSU datasets)	3x bidirectional LSTM layers	Temporal features
Robinson 2023 [134]	Robust human activities-of-daily-life (ADL) recognition of labeled datasets	RGB-D video (ETRI-Activity-3D dataset)	ResNet RGB video backbone; CSPDarknet + PANet + YOLO object detection	Video, pose, and object features	Spatial mid-fusion module network to embed and fuse features; dense neural layer to classify feature embeddings as ADL
Robinson 2023 [134]		Skeletal pose data (ETRI-Activity-3D dataset)	GCN and self-attention pose backbone	Video, pose, and object features
Yasar 2023 [178]	Context-aware human and group motion prediction using datasets	RGB data (NTU RGB+D 60 and CMU Panoptic datasets)	GRU encoder	Contextual scene features	Interaction module to estimate inter-agent dynamics; multimodal context module to fuse context andagent dynamics features; motion decoder to predict motion for each human and group
Yasar 2023 [178]		Skeletal pose data (NTU RGB+D 60 and CMU Panoptic datasets)	Skeletal pose data and motion encoders	Spatio-temporal skeletal pose features for each human

Table 3. Survey of MMP Capabilities Using Neural Network Model-Based Fusion

CMU, Carnegie Mellon University; 3D, 3-dimensional; ETRI, Electronics and Telecommunications Research Institute; NTU, Nanyang Technological University; SYSU, Sun Yat-sen University Dataset.

Table 4.

	Application	Sensor Data \(\rightarrow\)	Data Processing \(\rightarrow\)	Output \(\rightarrow\)	Fusion Technique
Filippini 2021 [41]	Emotion recognition during child—robot interaction with Mio Amico	RGB data (ELP 5MP Webcam)	HoG face detector and regression tree facial landmark detection; Thermal feature extraction using FIR filter; MLP to classify thermal image emotional content	Estimated emotional state	Fusion of thermal and RGB data through offline extrinsic calibration and online linear interpolation and pixel co-registration
Filippini 2021 [41]		Thermal image data (FLIR Lepton)		Estimated emotional state

Table 4. Survey of MMP Capabilities Using Early Fusion

Table 5.

	Application	Sensor Data \(\rightarrow\)	Data Processing \(\rightarrow\)	Output \(\rightarrow\)	Fusion Technique
Ben-Youssef 2019 [17]	Estimating engagement on Pepper robot in a public space and generating User Engagement in Spontaneous HRI (UE-HRI) dataset	4-microphone array (embedded in Pepper)	Audio feature computation openSMILE	Recognized speech	Temporal pooling of sliding temporal measurement windows to fuse asynchronous speech, face, and proxemic features; comparison of Logistic Regression, DNN, GRU, and LSTM classifiers
		RGB image	Gaze estimation, head pose, and facial action unit (FAU) computation (OpenFace 2.0)	Gaze estimation, head pose, FAUs
		Sonar	Distance estimation (NAOqi ALEngagementZones module)	Distance to human
Benkaouar 2012 [18], Vaufreydaz 2016 [164]	Detecting a human’s intention to engage with Robosoft Kompaï robot in notional domestic environment	2D LiDAR	Feet detection using adaptive background subtraction and Kalman filtering	Position, velocity of feet; distance between feet	Use of neutral feature values and last-known feature values; comparison of artificial neural network (ANN) and multiclass SVM to classify feature vector and estimate user engagement
		4-microphone array (Microsoft Kinect)	Speech activity detection and Sound source localization (SSL)	Speech activity, speaker azimuth
		RGB-D camera (Microsoft Kinect)	Skeletal pose tracking (Kinect); Schegloff metric computation; distance computation; Haar feature-based cascade classifier facial detection (OpenCV)	Schegloff stance and torque angle features, skeletal distance, face size, face location
Islam 2020 [74]	Human activity recognition on three human activity datasets	RGB image data University of Texas (UT-Kinect and UTD-MHAD datasets)	ResNet50 spatial feature encoder; LSTM temporal feature encoder; self-attention mechanism	Embedded RGB features	Custom HAMLET multimodal attention-based RGB + pose feature fusion; fully connected network layer to classify activity embedded vector
Islam 2020 [74]	Human activity recognition on three human activity datasets	Skeletal pose data (UT-Kinect and UTD-MHAD datasets)	Spatial feature encoder; LSTM temporal feature encoder; self-attention mechanism	Embedded skeletal pose features
H. Liu 2018 [101]	Command recognition in notional collaborative robot manufacturing task using unimodal datasets	Labeled mono speech command dataset	MFCC computation and CNN speech feature extraction	Speech features	Concatenation function to align and fuse multimodal features; MLP to classify one of six robot commands
		LeapMotion LEAP dataset	LSTM gesture feature extraction	Gesture features
		Labeled RGB video dataset	Transfer-learning trained MLP network feature extraction	Body motion features
Shen 2021 [143]	Estimating “big five” personality traits on a Pepper robot in laboratory setting	RGB camera (embedded in Pepper)	Motion estimation (Pepper SDK)	Head motion features, body motion features	Linear interpolation and truncation to fuse audio and visual features; HMM clustering and classification of feature vector to estimate personality traits
Shen 2021 [143]		Monochannel microphone (embedded in Pepper)	Speech recognition (Nuance); MFCC, pitch, and audio energy computation	Recognized speech; audio features
Wilson 2022 [168]	Detecting when a user needs assistance in a collaborative game task	RGB camera	Gaze estimation (OpenFace)	Eye gaze direction	Vector concatenation of gaze, speech, and semantic features; random forest classifier to estimate if assistance needed
Wilson 2022 [168]		Monochannel microphone	Speech recognition (\(\psi\) DeepSpeech); naïve Bayes classifier to estimate if assistance needed; linear degradation function to retain speech feature values between spoken segments	Recognized speech; question/negation semantic features

Table 5. Survey of MMP Capabilities Using Intermediate Fusion

SDK, software development kit.

Table 6.

	Application	Sensor Data \(\rightarrow\)	Data Processing \(\rightarrow\)	Output \(\rightarrow\)	Fusion Technique
Abioye 2018 [2] and 2022 [1]	Speech and gesture commanding for simulated aerial robot motion	Monochannel microphone	CMUSphinx automatic speech recognition (ASR)	Recognized speech commands	Rule-based approach to determine if unimodal commands are sequential or synchronous, and if information is emphatic or complementary
Abioye 2018 [2] and 2022 [1]		RGB camera	Haar feature-based cascade classifier hand detection, convex hull finger detection (OpenCV)	Hand gestures
Ban 2018 [10]	Multiple speaker tracking using dataset of various indoor acoustic settings	6-Microphone array Audio–Visual Diarization (AVDIAR dataset)	Direct-path related Transfer function	Speaker location	Bayesian estimator
Ban 2018 [10]		RGB image data (AVDIAR dataset)	Facial detection	Facial location	Bayesian estimator
Bayram 2015 [12]	Multiple speaker tracking and following on indoor mobile robot	7-Microphone array (Microcone)	Generalized EigenValue Decomposition-Multiple signal classification (MUSIC)	Speaker location	Particle filter
Bayram 2015 [12]		2x RGB cameras (Microsoft Kinect)	Haar feature-based cascade classifier facial detection	Facial location	Particle filter
Belgiovine 2022 [14] and Gonzalez-Billandon 2020 and 2021 [57, 58]	Human localization for recording voice and facial identity data from an iCub robot in a laboratory setting	Stereo camera (embedded in iCub)	Deep network for facial localization	Face location	Multihuman tracking via Kalman Filtering and Hungarian algorithm for data association
		Binaural microphones (embedded in iCub)	Deep network for speaker localization	Coarse speaker location: to left, to right, in front
Bohus 2009 [20] and 2010 [21]	Virtual agent personal assistant and trivia game host; for multiparty interaction research in an indoor social environment	RGB camera (AXIS 212)	Face detection and tracking	Human location	Scene analysis module to infer human intentions, engagement, and actions; fusion algorithm details unspecified
			Facial pose estimation	Focus of attention
			Clothing RGB variance analysis	Organizational affiliation
		4-Microphone linear array	Windows 7 Speech recognizer	Recognized speech
		4-Microphone linear array	SSL	Speaker location
Chao 2013 [27]	Detecting human activity and evaluating interruption behaviors in a mixed-initiative laboratory interaction	RGB-D camera Microsoft Kinect	Skeletal pose tracking; coarse gaze estimation; gesture activity detection	Estimate if human is gazing at robot; gesture activity	Control architecture for the dynamics of embodied natural coordination and engagement system; uses timed Petri nets to handle asynchronous human inputs and select action
Chao 2013 [27]		Monochannel microphone	Pitch computation (Pure Data module)	Speaker pitch
Chau 2019 [28]	Simultaneous localization of humans and robot platform	8-Microphone array (TAMAGO-03)	Generalized singular value decomposition-MUSIC and SNR estimation	Speaker direction-of-arrival	Gaussian mixture probability hypothesis density filter
Chau 2019 [28]	Simultaneous localization of humans and robot platform	RGB camera	OpenPose keypoint extraction	Multitarget human pose estimates	Gaussian mixture probability hypothesis density filter
Chu 2014 [34]	Determining desire to engage with Curi robot platform in a laboratory setting via contingency detection	2-Microphone array	Sound cue computation	Sound cue features for two audio channels	Finite state machine to determine when to process body and audio features; SVM contingency classifier to determine if human wishes to engage or not
Chu 2014 [34]		RGB-D camera (Asus Xtion Pro Live)	Skeletal pose tracking (OpenNI) and body motion cue computation	Motion cues for seven joints
Churamani 2017 [35]	Identity and speech recognition for personalized interaction with Nico robot in laboratory educational task	2x RGB cameras	Haar feature-based cascade classifier facial detection; face recognition using LBP histograms	Human identity and face location	Weighted sum of voice and facial identities; dialog manager to control flow of interaction based on identity and speech
Churamani 2017 [35]		Monochannel audio	Speech recognition (Google Cloud speech-to-text); CNN human identification	Human identity and recognized speech
Foster 2014 [44] and 2017 [47] and Pateraki 2013 [120]	Estimating human interaction status/intentions with JAMES bartending robot in notional bartending environment	4-Microphone array (Microsoft Kinect)	SSL; ASR (Microsoft Speech API); lexical feature extraction (OpenCCG)	Speaker azimuth, recognized speech, and lexical features	Comparison of binary regression, logistic regression, multinomial logistic regression, SVM, k-nearest neighbor (k-NN), decision tree, naïve Bayes, propositional rule learner to estimate bar patron state
		RGB-D camera (Microsoft Kinect)	Hand and face blob tracking; 2D silhouette and 3D shoulder fitting; least-squares RGB image matching	Human head and hand location, 3D torso pose and head pose angles
		2x Stereo cameras
Foster 2016 [45] and 2019 [46]	Human localization, identification, intent recognition, and speech recognition in a large public mall on Pepper mall robot for MuMMER project	4-Microphone array (embedded in Pepper)	NN to perform simultaneous speech detection and localization; speech recognition (Google Cloud speech-to-text)	Speaker location and recognized speech	Weighted average of gaze and human distance to estimate if human is interested in interaction
Foster 2016 [45] and 2019 [46]		RGB-D camera (Intel RealSense D435)	CPM pose recognition; head pose estimation (OpenHeadPose); facial feature computation (OpenFace)	Body pose, head pose, and facial features
Gebru 2018 [49]	Multiple speaker localization and diarization using dataset of various indoor acoustic settings	6-microphone array (AVDIAR dataset)	Binaural feature extraction; SSL using trained model	Speaker location	MAP Bayesian estimator to detect speakers; NN search to attribute speech to speaker
		6-microphone array (AVDIAR dataset)	Voice activity detection (VAD)	Speech activity
		RGB image data (AVDIAR dataset)	Visual tracking	Speaker head and torso locations
Glas 2013 [51] and 2017 [52]	Sensor network + mobile Robovie robot in shopping mall to detect shopper identity and locations and issue personalized greetings	8x 2D LiDAR	Human tracking via particle filtering (ATRacker)	2D human locations	Geometric model to estimate human height; NN data association to fuse identity with human location
Glas 2013 [51] and 2017 [52]		2x RGB cameras	Facial recognition (OKAO vision)	Human identity
Gomez 2015 [54]	Speech-to-speaker association using Hearbo robot in rooms of varying reverberation	RGB-D camera (Microsoft Kinect v2)	Depth and tracking, head position, and mouth activity detection	Visual azimuth to speaker; mouth activity	Speaker Resolution module which associates acoustic azimuth to valid, speaking visual azimuth
Gomez 2015 [54]		16-Microphone array	MUSIC SSL (HARK)	Acoustic azimuth to speaker
Ishi 2015 [73]	Estimating human location and head orientation in laboratory environment	2x 2D LiDAR (Hokuyo UTM-30L)	Particle filter	Human location	Computation of sound source location from acoustic azimuth vectors. Fused with LiDAR location estimate if within threshold; location estimate used to compute head orientation vector
Ishi 2015 [73]		2X 16-microphone array, 2X 8-microphone array	MUSIC SSL	Acoustic azimuths
Jacob 2013 [75]	Command recognition for a surgical scrub nurse assistant robot arm (FANUC LR Mate 200iC) in notional surgical task	RGB-D camera (Microsoft Kinect One)	Custom fingertip locator, Kalman Filter smoother, feature extraction, and HMM gesture classifier	Recognized gestures	State machine to perform assistive actions (e.g., pick up and pass specific surgical instruments) or enable/disable command modes
Jacob 2013 [75]		Monochannel microphone	Speech recognition (CMUSphinx)	Recognized speech commands
Jain 2020 [76]	Detecting user engagement during educational game in a month-long, in-home study	RGB Camera (USB webcam)	OpenPose	# people in scene	Gradient-boosted decision trees to classify engagement/disengagement
		RGB Camera (USB webcam)	OpenFace	Face detection, eye gaze direction, head position, facial expression features
		Monochannel audio (USB webcam)	Audio feature extraction (Praat)	Audio pitch, frequency, intensity, and harmonicity
Kardaris 2016 [83] and Zlatintsi 2017 [187] and 2020 [186]	Verbal and gestural command recognition for elderly care robot evaluated on MOBOT multimodal dataset and I-Support assistive bathing robot	RGB-D camera (Microsoft Kinect)	Dense trajectory feature extraction and SVM classification	Recognized gestures	Ranked selection of best available modality
		RGB-D camera (Microsoft Kinect)	Optical flow activity detection (OpenCV)	Activity status
		8-Channel MEMS microphone array	Beamforming denoising; MFCC+\(\Delta\) feature extraction; N-Best grammar-based speech recognition (HTK toolkit)	Recognized speech commands
Kollar 2012 [86]	Human tracking, speech and gesture recognition in an indoor environment on the Roboceptionist and CoBot service robots	RGB-D Camera (Microsoft Kinect)	Skeletal pose tracking; vector computations of skeletal keypoints	Recognized gestures and proxemics	Rule-based dialog manager
Kollar 2012 [86]		Android tablet	Android speech recognition; probabilistic graph/naïve Bayes language model	Recognized speech commands	Rule-based dialog manager
Komatsubara 2019 [87]	Estimating child social status using a sensor network embedded in a classroom	RGB-D camera (Microsoft Kinect)	Head and shoulders detection	Human location	NN association fuses identity and location; custom social feature extraction module; SVM with RBF classifies social status
Komatsubara 2019 [87]		6x RGB camera (Omron)	Facial recognition (OKAO Vision)	Human identity
Linder 2016 [97]	Development of multiple human tracker for dense human environments, tested on real-world and synthetic datasets	2x 2D LiDAR (SICK LMS 500)	OpenCV random forest leg tracker	Leg locations	Comparison of NN, extended NN, multi-hypothesis, and minimum description length trackers
Linder 2016 [97]		RGB-D camera (Asus Xtion Pro)	Comparison of depth template-based, monocular HoG, and RGB-D HoG upper-body detectors	Torso locations
Linssen 2017 [99] and Theune 2017 [152]	Development of R3D3 receptionist robot and pilot testing in day care center	4-Microphone array (Microsoft Kinect)	ASR using Kaldi deep network	Recognized speech	Rule-based dialog and action manager
Linssen 2017 [99] and Theune 2017 [152]		RGB-D Camera (Microsoft Kinect)	FaceReader software (Vicar Vision)	Emotional state and demographics	Rule-based dialog and action manager
Maniscalco 2022 [106] and 2024 [105]	Identifying humans, estimating engagement on Pepper robot guide in campus and museum environments	4-Microphone array	RMS signal computation, Google Cloud speech-to-text	Recognized speech	Finite state automaton; specific sensory inputs activate state transitions for engagement and interaction (smach)
		RGB image	People detection, face recognition, gaze estimation, and age/gender estimation (Pepper SDK)	Human identity, location, gaze direction, and age/gender
		Sonar	FIFO queue to stabilize distance measurement	Distance to human
Martinson 2013 [108]	Human identification using soft biometrics on Octavia social robot platform in a public interaction study	Depth/time-of-flight camera (Mesa Swissranger SR4000)	Segmentation via connected components analysis; computation of 3D face position; height estimation via geometric model	Human location and height	Computing similarities of soft biometrics with those of known humans; weighted sum of soft biometric similarities to determine identity
Martinson 2013 [108]		RGB camera (Point Grey FIREFLY)	Facial detection (Pittsburgh Pattern Recognition SDK); color histogram computation of face and clothing	Face and clothing color histograms
Martinson 2016 [109]	Human detection in indoor office setting	RGB-D camera	Alexnet CNN human detection on RGB image (Caffe)	Human detection likelihood	Weighted sum of CNN and layered person detection likelihoods
			Alexnet CNN human detection on depth image (Caffe)	Human detection likelihood
			Layered person detection: Segmentation, geometric feature computation, and GMM classification of depth clusters	Human detection likelihood
Mohamed 2021 [110]	Generating whole person model from robot data and providing generic ROS HRI stack via ROS4HRI project	Audio data	Voice activity detection (VAD), feature extraction (openSMILE, HARK); SSL (HARK)	Audio features, voice activity, and speaker location	Body/face matcher and person manager to fuse face, voice, and body information
		RGB image	Face detection and pose estimation (OpenFace); expression and demographic detection (OpenVINO); face recognition (dlib)	Expression, demographics, and identity
		RGB-D image	Skeletal pose tracking (Kinect, OpenNI, OpenVINO) and human description generator	Human skeletal pose description and recognized gestures
Nakamura 2011 [114]	Human localization using Hearbo robot in indoor setting	8-Microphone array	GEVD-MUSIC with hGMM Sound Source Identification	Speaker direction-of-arrival	Particle filter
		Thermal camera (Apiste FSV-1100)	Thermal-Distance integrated localization model; binary thermal mask applied to clusters of depth points to localize human heads	Human location (3-dimensional)
		Time-of-flight distance camera (Mesa SR4000)		Human location (3-dimensional)
Nieuwenhuisen 2013 [116]	Human localization and importance detection on Robothino museum tour guide robot	2D LiDAR (Hokuyo URG-04LX)	Leg and torso detection	Body locations	Face and body locations fused by Hungarian algorithm data association in a multi-hypothesis tracker; human importance in group estimated by relative distance and angle to robot
		2x RGB cameras	Viola-Jones face detector	Face locations
		Directional microphone	Small-vocabulary speech recognition (Loquendo)	Recognized commands
Paez 2022 [126]	Emotion recognition on Baxter robot for improved robot mentorship in an indoor collaborative play setting	Microphone	Speech recognition (unspecified)	Recognized speech	k-Means emotion classifier of fused body/head movement and facial emotion features
		RGB-D camera (Microsoft Kinect)	Body gesture and facial movement recognition (unspecified)	Recognition of 22 body gestures and 8 Head movements
		RGB-D camera (Intel RealSense SR300)	Facial emotion recognition (Affdex SDK)	Recognized emotion
Pereira 2019 [121]	Observing human speech and gaze in joint human-robot game with Furhat robot	Monochannel microphone	Speech recognition (Microsoft cloud speech recognition)	Recognized speech	State machine to determine robot gaze target; state transitions triggered by prioritized human inputs
		2x RGB-D camera	Gaze tracking (GazeSense)	Gaze direction
		2x RGB-D camera	Object tracking (ARToolkit)	Location of game objects
Portugal 2019 [123]	Development of elderly care robot SocialRobot and deployment in elderly care center	HD RGB camera (Microsoft LifeCam Studio)	Haar feature-based cascade classifier facial detection; PCA/Eigenface facial recognition	Face locations and identities	Customizable XML-based service engine based on human inputs
		RGB-D camera (Asus Xtion Pro Live)	Depth-augmented face localization	Updated face location
		2x microphone array (Asus Xtion Pro Live)	Speech recognition (PocketSphinx); emotion and affect recognition (openEAR)	Recognized speech; emotional state
Pourmehr 2017 [124]	Mobile robot locating humans interested in interaction; indoor and outdoor	2D LiDAR	Leg detector	Human location occupancy grid	Weighted sum of three unimodal occupancy grids
		RGB-D camera (Microsoft Kinect)	Torso detector	Human location occupancy grid
		4-Microphone array (Microsoft Kinect)	MUSIC sound source localization	Speech direction-of-arrival occupancy grid
Prado 2012 [125]	Emotion recognition in indoor setting; used to synthesize emotional robot behaviors	RGB camera	Haar feature-based cascade classifier facial detection (OpenCV); PCA to extract FAUs; dynamic Bayesian network (DBN) classifier	Recognized emotion from face	DBN classifier to fuse face and voice emotions
Prado 2012 [125]		Microphone	Extract pitch, duration, and volume of utterance (Praat); DBN classifier	Recognized emotion from voice	DBN classifier to fuse face and voice emotions
Ragel 2022 [127]	Interactive play with children in indoor setting using tabletop Haru robot	2x RGB camera	Face detection (face_recognition and OpenCV), facial feature extraction (Microsoft Azure Face API), mask detection (TensorFlow network)	Estimated gender, emotion, and if wearing facemask	Sequential data fusion pipeline using cacheing, filtering, and fusion to combine asynchronous skeletal pose, hand, face, and speech data; incrementally updated as new data available
		6-Microphone array (embedded in Haru) 8-microphone array (external)	SSL and EMVDR noise filtering (HARK); speech recognition (Google Cloud speech-to-text)	Recognized speech
		RGB-D (Orbbec Astra in Haru); Azure Kinect and Kinect v2 (external)	Hand detection (MediaPipe; skeletal pose tracking (Kinect))	Hand and skeletal pose keypoints
Sanchez-Riera 2012 [139]	Human command recognition using multimodal dataset for D-META grand challenge	Binaural audio (RAVEL dataset)	MFCC computation; HMM and SVM to classify MFCCs	Recognized speech commands	Weighted sum of speech and gesture command classifications
Sanchez-Riera 2012 [139]		Stereo RGB vision (RAVEL dataset)	Scene flow and STIPs feature extraction; HMM and SVM to classify scene flow and STIPs	Recognized gestures	Weighted sum of speech and gesture command classifications
Tan 2018 [149]	Development of iSocioBot social robot platform and deployment in four public events	RGB camera (Logitech HD Pro C920)	LPQ facial recognition	Human identity	Probabilistic hypothesis testing to fuse face and sound source locations; weighted sum to fuse speaker and facial identities
		RGB camera (Logitech HD Pro C920)	Facial detection (OpenCV)	Face location
		4-Microphone array (Microsoft Kinect)	Unspecified SSL	Speaker direction-of-arrival
		Wireless handheld microphone	Speech recognition (Google Cloud speech-to-text)	Recognized speech
		Wireless handheld microphone	i-vector framework (ALIZE 3.0)	Speaker identity
Terreran 2023 [150]	Whole-body gesture and activity recognition	RGB-D camera (RGB channel)	2D skeletal pose tracking (OpenPose)	2D skeletal pose keypoints	2D-to-3D projection and lifting fuses 2D keypoints to 3D pointcloud for 3D pose estimation; ensemble classifier predicts activity
Terreran 2023 [150]	Whole-body gesture and activity recognition	RGB-D camera (depth channel)	None	3D pointcloud
Trick 2022 [153]	Receiving human reinforcement training inputs for a manipulator arm action planner in a laboratory setting	Microphone	MFCC feature extraction and CNN keyword classification (Honk)	Recognized keyword commands	Bayesian independent opinion pool
Trick 2022 [153]		RGB-D camera (Intel RealSense D435)	Skeletal pose tracking (OpenPose) and SVM gesture classification (Scikit-learn)	Recognized gestures	Bayesian independent opinion pool
Tsiami 2018 [156]	Simultaneous indoor speaker localization, speech recognition and gesture recognition for children in an indoor play setting	3x 4-microphone array (Microsoft Kinect)	SRP-PHAT SSL; GMM-HMM speech recognition	Speaker location; recognized speech	Highest average probability of gestures; majority voting for speech recognition; NN to fuse audio and visual speaker locations
		3x RGB-D camera (Microsoft Kinect)	Bag-of-words and SVM gesture classifier	Recognized gestures
		1x RGB-D camera (Microsoft Kinect)	Skeleton tracking	Human location
Whitney 2016 [167]	Recognizing human object references on a Baxter robot in a laboratory setting	4-Microphone array (Microsoft Kinect)	Unigram speech model	Recognized word	Bayesian estimator using uni-,bi-,and tri-gram state models to estimate object human is referring to
Whitney 2016 [167]		RGB-D camera (Microsoft Kinect)	Skeletal pose recognition (OpenNI) and elbow-wrist vector computation	Object being pointed to
Yan 2018 [176]	Multi-human tracking to train a 3D LiDAR human detector in indoor public area	RGB-D camera (ASUS Xtion Pro Live)	Shoulder and head detection	Torso locations	Multi-hypothesis tracker to fuse leg and torso locations
Yan 2018 [176]		2D LiDAR (Hokuyo UTM-30LX)	Leg tracker	Leg locations	Multi-hypothesis tracker to fuse leg and torso locations

Table 6. Survey of MMP Capabilities Using Late Fusion

GMM-HMM, Gaussian mixture model-hidden Markov model; HARK, Honda Robot Institute-Japan Audition for Robots with Kyoto University; MOBOT, mobility robot, PCA, principal component analysis; ROS4HRI, Robot Operating System for Human-Robot Interaction; SDK, software development kit; SRP-PHAT, steered response power with phase transform.

Table 7.

	Application	Sensor Data \(\rightarrow\)	Data Processing \(\rightarrow\)	Output \(\rightarrow\)	Fusion Technique
D’Arca 2016 [36]	Indoor speaker detection, tracking and identification in a laboratory environment	RGB camera	Optical flow feature extraction; object detection	Speaker location and height	Canonical correlation analysis to fuse optical and audio features (Intermediate); Kalman Filter to fuse audio and visual location (late); NN to fuse speaker identity with location (late)
		8-Microphone array	Generalized cross correlation with phase transform	Speaker location
		8-Microphone array	MFCC feature extraction and classification	Speaker identity
Efthymiou 2022 [38]	Child location, activity, and speech recognition in ChildBot intelligent playspace	4x RGB-D camera (Microsoft Kinect)	Skeletal pose tracking	Child locations	Various combinations of visual feature concatenation/classification (intermediate); NN audiovisual data association (late); dialog flow manager to sequence interaction (late)
		4x RGB-D camera (Microsoft Kinect)	Dense trajectory, HoF, and motion boundary histogram feature extraction of RGB data; encoding via bag-of-visual words and VLAD; SVM classification	Recognized actions
		4x 4-microphone array (Microsoft Kinect)	Delay-and-sum beamforming; MFCC extraction and GMM-HMM triphone speech classification	Speaker location and recognized speech
Filntisis 2019 [42]	Child emotion recognition in intelligent playspace; recording of BabyRobot Emotional Database dataset	RGB camera	Facial detection (OpenFace 2), feature extraction (ResNet 50), temporal max pooling, FC classifier	Facial emotion	Concatenation of body and face features (intermediate); classification and concatenation of whole body, body, and head emotion recognition scores (late)
Filntisis 2019 [42]		RGB camera	Skeletal pose tracking (OpenPose), DNN feature extraction, temporal average pooling, FC classifier	Body emotion
L. Lu 2021 [103] and Reily 2022 [131]	Recognizing group activity in a human-robot search and rescue dataset	RGB image (CAD and MUSRT datasets)	HoG computation	RGB HoG features for each human in scene	Bag-of-words computation to fuse HoG features for each human (intermediate); weighted sum of human features to estimate overall group activity (late)
		Depth image (CAD and MUSRT datasets)	HoG computation	Depth HoG features for each human in scene
		Thermal image (CAD and MUSRT datasets)	HoG computation	Thermal HoG features for each human in scene
		3D LiDAR (MUSRT dataset)	HoG computation	LiDAR HoG features for each human in scene
Nigam 2015 [117]	Classifying social context on a mobile robot platform in a university setting	RGB-D camera (Microsoft Kinect)	Grayscale pixel value computation; PCA and concatenation	Fused, reduced audiovisual feature vector	Concatenation and reduction of audiovisual features (intermediate); SVM, naïve Bayes, and decision tree social context classifiers (late)
Nigam 2015 [117]		Microphone (voice recorder)	Amplitude computation; PCA and concatenation	Fused, reduced audiovisual feature vector
Yumak 2014 [181, 182] and 2016 [180]	Virtual agent + humanoid robot testbed for multiparty interaction research in indoor setting	2x RGB-D camera (Microsoft Kinect)	Skeleton tracking	Human location	LBP and HSV concatenation and k-NN classifier to determine identity (intermediate); NN fusion to assign speech and ID to human location (late); NN fusion to determine addressee from gaze (late)
			LBP and HSV extraction; NN classification	Human identity
			ML deformable model fitting	Head pose
		8-Microphone array	Speech harmonicity SSL	Speaker direction-of-arrival
		Close-talk microphone	Speech recognition	Speech content and confidence

Table 7. Survey of MMP Capabilities Using Hybrid Fusion

Fig. 1.

3.3 MMP Applications in HRI

The survey analyzed perception systems of robots and intelligent systems operating in a variety of domains and HSEs. Service robots, elderly care robots, educational robots, and social robots were among the most common domains surveyed. This coincides with the broader trend of aging populations in many developed countries and the potential for robots to address critical skill shortages for service tasks.

Surveyed service robotics systems include museum robots such as R3D3, Robotinho, and KeJia [99, 102, 116, 152] which can guide patrons to appropriate museum exhibits; Sacarino, the hotel concierge robot capable of guiding guests through the hotel, and providing information and services [122]; the Social Situation-Aware Perception and Action for Cognitive Robots (SPENCER) airport robot which can guide groups of passengers to their gate [80, 154], the MultiModal Mall Entertainment Robot (MuMMER) [45, 46] and KeJia [32], mall robots designed to provide guidance and information to mall patrons; Roboceptionist which provides information about office locations and building amenities [86]; and bartending robots as part of the Joint Action for Multimodal Embodied Social Systems (JAMES)/Joint Action Science and Technology (JAST) [44, 47, 120] and BRILLO [137] projects.

Surveyed elderly care robots include Mini, a tabletop robot designed to provide safety, entertainment, personal assistance, and stimulation [138], Hobbit, a domestic care robot which can clear the floor, learn to bring objects, detect falls, and initiate video calls from its screen [43]; and the SocialRobot project in which a mobile robot platform is developed to monitor routines, assist with memory and reminders, and initiate calls to maintain social engagement [123]. The I-Support system uses a non-contact multimodal interface to direct a bathing robot to perform specific assistive bathing actions, like “wash back” [186].

Education, socialization, or play is another common application for robots and intelligent systems operating in HSEs. Providing continued interaction is crucial to the social development of children with autism spectrum disorder (ASD) and is addressed by the Kiwi robot which played games and provided expressive feedback to children with ASD during a month-long, in-home study [76]. Other surveyed systems played games jointly with children [41] or educated them on how to play board games [121, 126]. Maggie is a tabletop robot which reads to a human companion based on their gesture and speech inputs [129]. Additionally, Efthymiou [38] and Komatsubara [87] deployed sensor arrays in children’s recreational and classroom settings, respectively, to perceive children’s status and inform more tailored play and education experiences.

Our survey also covered MMP systems employed in assistant tasks, such as robotic scrub nurses to hand instruments in a notional surgical task [75] and a pantry pick-and-place robot that hands food ingredients to a human collaborator in a notional household setting [167]. Multimodal interfaces can also be used to detect when a human collaborator needs assistance in human–robot collaboration tasks.

3.4 Information Provided by MMP HRI Systems

In this section, we present the results of the surveyed MMP systems in terms of the information they provide (in contrast to Section 3.3 which presented them in terms of application). Each of the surveyed systems utilized information about humans in HSEs—and in some cases, information about the environment itself—to more effectively interact with humans. For the purposes of this survey, we categorize information as attributes, status, communication, or context. Attributes refer to permanent or long-term characteristics of the person being observed which are unlikely to change during the course of the interaction. Status or state variables refer to quantities expected to vary over time, such as position, affect, or activity. Communication refers to observable quantities that explicitly convey information, like gesture or speech. Lastly, context refers to environmental information that is relevant to the interaction.

Example attributes that can be inferred are human identity, affiliation/role, demographic information, and soft biometrics. Facial recognition is the most commonly surveyed means of identifying humans [35, 52, 71, 72, 87, 105, 106, 110, 149]. D’Arca et al. [36] and Churamani et al. [35] perform voice identification by extracting and classifying audio features to infer the speaker’s identity. Multiple surveyed works [99, 102, 105, 110, 152] used visual data to estimate demographic information such as age and gender expression. To infer affiliation (e.g., employee or guest), Bohus et al. [20] visually analyze a human’s clothing. Likewise, Martinson et al. [108] use clothing recognition and height as soft biometrics to identify humans. Irfan et al. [71, 72] use a combination of age, gender, and height soft biometrics fused with facial recognition to infer identity. Shen et al. [143] estimate big five personality traits (Extroversion, Openness, Emotional Stability, Conscientiousness, and Agreeableness) with the intent of providing a more tailored HRI experience.

Common human status observations in the literature include tracking human location and movement [38, 45, 46, 123] and pose detection. Human localization can be accomplished through visual detection of the face or head [10, 11, 12, 14, 17, 18, 35, 42, 44, 45, 46, 47, 54, 59, 76, 102, 105, 106, 108, 110, 116, 114, 120, 122, 123, 149, 164], body or torso detection through vision or depth [36, 44, 47, 49, 80, 87, 97, 98, 101, 109, 120, 124, 154, 176], skeletal pose tracking [11, 18, 27, 28, 29, 34, 38, 42, 45, 46, 74, 100, 110, 150, 153, 156, 164], sound source (speech) localization [10, 12, 14, 18, 28, 38, 44, 45, 46, 47, 49, 54, 59, 73, 114, 120, 124, 149, 164], or by detecting legs or torso with a Light Detection and Ranging (LiDAR) sensor [18, 52, 80, 97, 116, 124, 154, 164, 176]. Multiple humans can be tracked simultaneously, as in [10, 28, 52, 80, 97, 98, 109, 116, 149, 154, 176], and multiple modes can be combined to increase the efficacy of the human tracking system. For example, LiDAR [97, 98, 124] and sound source localization (SSL) systems [10, 12] generally have a much larger spatial measurement range than cameras but lower measurement precision; however, a fused multimodal tracking system can leverage the best properties of each individual mode, providing increases in precision or range compared to unimodal systems.

Human emotion, intentions, and activity are other statuses that can be inferred through exteroceptive modalities. A human’s emotional state can be estimated by classifying facial expressions [99, 102, 110, 126, 152], speech audio features [123], or both [95, 104, 125]. Similarly, body pose and facial movement features can be fused to classify the emotional states of adults [126] and children [42]. One surveyed project used thermal imaging to estimate the peripheral neuro-vegetative activity of children and infer their emotional state [41]. Several surveyed systems were capable of estimating the intentions and engagement level of nearby humans, for example, to determine if the human is interested in initiating or ending an interaction with a robot. Pose, gaze, proxemics, speech activity, and gesture activity can all serve as cues to indicate intentions or engagement [11, 17, 18, 21, 27, 34, 44, 45, 46, 47, 105, 106, 110, 124, 164], or if a human teammate needs assistance in a collaborative task [168]. Activity recognition was estimated by classifying visual, pose, or gesture cues for individuals [38, 74, 83, 100, 110, 134, 150, 186, 187] and groups [103, 131], or to predict the motion of a group of individuals [178].

Another category of information that robots and intelligent systems can obtain from humans is communication, and includes information exchange via commands or declarations. Automatic speech recognition (ASR) was the most commonly used communication modality, and a variety of ASR options exist depending on the application (as discussed in Section 3.6) [45, 46, 75, 99, 102, 110, 123, 126, 149, 152]. Gestures were the second most common communication modality, as in [38, 75, 110, 126], or a combination of gestures and speech [1, 2, 101, 139]. Touch screens are another communication modality suitable for HSEs, since humans can exchange communication intuitively using the robot’s onboard hardware as in [37, 43, 45, 46, 122, 123]. In surveyed works, communication was primarily used to issue commands to the robot (e.g., to perform a specific action) [1, 2, 75, 101, 139, 153] or to identify objects in the environment (also called referring expressions) [167].

Lastly, context is another type of information relevant to HRI in HSEs, since humans are often situated among other agents and objects in HSEs. Examples include Nigam et al., who classify the overall social context in a university environment (studying, eating, or waiting) to determine if it is a good time to initiate an interaction [117]. The SPENCER robot estimates human group affiliations in an airport, which enables the robot to more accurately predict individual movements within the group since they travel together [80, 154]. The Robotinho museum guide robot considers the distances and poses of all humans in the tour group to determine who the robot should address and only tracks a single individual within the group instead of the entire group [116]. Object detection also provides contextual information. In collaborative human–robot tasks in HSEs, monitoring the status of task-specific objects give insight to the human teammate’s performance. For example, Pereira et al. [121] monitor the status of game pieces in a joint human–robot game. Object detection can also provide contextual clues about human activities [27, 35, 131]. As the authors note, a sitting individual could be either eating or drinking, and the detection of food or drink, respectively, helps differentiate the two activities.

In practice, most human interaction systems simultaneously observe a combination of information types. The situated nature of human interaction means that multiple quantities must often be observed as a prerequisite for the application. Both [156] and [49] fuse simultaneous location and spoken command data, while [36] combines identity (an attribute), location (a status), and spoken communication. The virtual assistant/trivia game host in [20, 21] serves as a testbed for situated multiparty human dialog research and incorporates identity, role, attention/focus, goals, and spoken communication to properly interact with the human. Similarly, Yumak’s virtual assistant/robot system [180, 181, 182] is a testbed for multiparty HRI research and observes human identity (an attribute), location and gaze direction (statuses), and spoken communication to study complex HRI phenomena like turn-taking, focus of attention, and spatial formations.

Fusing multiple categories of information about a human can yield improved perceptual accuracy, since humans can convey mutual information implicitly through their status and explicitly through spoken words or gestures. An example is [101], where observing the human’s body motion (a status) improved command recognition accuracy (communication).

Sections 3.5–3.9 provide more technical detail about how data from each perception modality is acquired, processed, and fused to generate information.

3.5 Visual Data Acquisition and Processing

Vision was the most commonly used modality in the surveyed literature and provides a wealth of information relevant to HRI in HSEs. This section considers the vision components of multimodal systems in the survey, but for a more comprehensive survey of only robotic vision systems in HRI, please see the survey of Robinson et al. and review [135].

3.5.1 Vision Sensors.

Red–green–blue (RGB) [1, 2, 20, 21, 28, 36, 42, 52] and RGB-depth (RGB-D) cameras [29, 44, 47, 80, 109, 120, 121, 150, 154] were the most commonly used visual sensors in surveyed works. Universal Serial Bus webcams (e.g., the Logitech HD Pro C920 [149]) can provide RGB data [76], while other commercial RGB cameras are available in the Microsoft LifeCam Studio HD [123], Point Grey FIREFLY [108], and from Omron [87]. Additionally, many of the social robot platforms surveyed have embedded RGB cameras, such as Pepper [17, 71, 72, 105, 106, 143], Nao [71, 72], Nico [35], and Robotinho [116]. The Microsoft Kinect RGB-D camera was the most commonly used sensor in the entire survey [11, 12, 18, 27, 38, 43, 54, 75, 86, 87, 98, 117, 124, 126, 156, 164, 180, 181, 182]). Other RGB-D sensors in the surveyed literature include the ASUS Xtion Pro and its variants [34, 97, 123, 176], the Intel RealSense D435 [45, 46, 153], the Intel RealSense SR300 [126], and the Orbbec Astra Embedded S. Less common visual sensors include stereo cameras [44, 47, 120] (one of which is embedded in the iCub robot [14, 57]), time-of-flight depth cameras [114], the FLIR Lepton long-wave infrared camera [41], and the LeapMotion LEAP controller—a stereo infrared camera system designed specifically for hand tracking used in [101, 127].

3.5.2 Visual Data Processing.

Surveyed works used various commercial, open-source, and custom techniques to process visual data and compute relevant HRI information.

Body, head, and hand pose estimation were common functions in surveyed works. Skeletal pose tracking is offered natively through the Kinect API as in [18, 27, 38, 86, 110, 127, 156, 164, 180, 181, 182] and provides spatial locations of selected joints (torso, shoulders, head). OpenPose [24] is another means of estimating human poses, gestures, and gaze via keypoint extraction. OpenPose is open-source, platform-agnostic and accepts any RGB or RGB-D image as input, as used in [28, 42, 153]. The Pepper robot’s software development kit (SDK) supports human localization and motion estimation and was used for body and head motion estimation in [143]. The OpenNI framework is another open-source pose recognition option and provides skeletal pose tracking and hand point tracking as used in [34, 43, 110, 167]; however, its last official update was in 2012. Banerjee et al. [11] and Foster et al. [45, 46] implement pose recognition using convolutional pose machines (CPMs). Head pose can be computed with the open-source OpenFace framework as in [17, 76, 110] or OpenHeadPose as in [45, 46]. Foster et al. use RGB least squares matching to estimate head position [44, 47, 120]. Ragel et al. use the MediaPipe tracker to get hand pose keypoints from RGB-D data in [127]. Lastly, the LEAP provides hand pose tracking natively as in [101, 127].

Pose information can be used to locate humans, recognize gestures, and infer their focus of attention. Chau et al. [28] use OpenPose to compute spatial pose keypoints of nearby humans from a mono RGB camera input; they compute the human’s 2-dimensional (2D) location by projecting the 3-dimensional (3D) keypoints into a 2D plane around the sensor platform. Likewise, Terreran et al. [150] compute skeletal pose keypoints of an RGB-D image with OpenPose, then use 2D-to-3D lifting and projection to project 2D keypoints onto the depth channel, providing a 3D pose estimate. Notably, OpenPose also provides a confidence measure which can be used to weight the visual pose detection in multimodal fusion applications as in [28]. The Hobbit elderly care robot computes a 3D bounding box around skeletal pose keypoints, then uses this spatial information to detect if the human has fallen [43]. An even simpler use of skeletal pose tracking is to estimate the number of humans in the scene by counting the detected skeletal poses, as Jain et al. did [76]. Simple gesture recognition can be accomplished via geometric modeling. For example, Kollar et al. [86] recognize the “hand raised” gesture if the wrist is above the shoulder, and Whitney et al. [167] compute the shoulder-to-wrist vector to recognize pointing gestures. OpenNI includes basic gesture recognition (swipe left/right, pointing) [43]. Supervised learning techniques can also be used for gesture recognition; Trick et al. use an SVM to classify OpenPose joint trajectories as gestures [153]. Paez et al. use skeletal pose recognition to classify “emotional gestures” such as stretching arms, hands on chin, or rubbing hands together [126]. Chao et al. use a geometric model to compute the angle between a human’s head and shoulders relative to hips, then use this to infer if the human is facing the robot or not [27]. Features extracted from pose data can also infer emotional state; Filntisis et al. use a DNN to extract affective features from children’s skeletal pose data, which is fused with facial features to classify their emotional state [42].

Detection of human centroids is another visual processing technique found in the literature [10, 12, 20, 21, 36, 44, 47, 49, 80, 87, 97, 98, 120, 124, 154, 176] and typically involves supervised learning of labeled datasets while leveraging both depth and color from RGB-D data. Models such as DarkNet53 can fuse RGB and Depth features, then YOLOv3 can be used to detect 3D human centroids or 2D bounding boxes from the fused RGB-D features as in [98]. Fung et al. [48] use a multimodal YOLOv4 (MYOLOv4) implementation to extract and fuse RGB and Depth features to robustly detect 3D human body part positions in offline datasets. Similarly, Martinson et al. use AlexNet to localize humans in RGB, plus a custom depth-based segmentation, geometric feature extraction, and Gaussian mixture model (GMM) classifier to localize humans from the Depth channel [109]. They fuse RGB and Depth detections using a weighted sum and compare the fused performance with individual modes. Segmenting the depth channel into foreground and background, then using a depth-template or histogram of oriented gradients (HoG) upper-body detector on candidate regions was done in [80, 97, 176, 154] for near-field torso detections; far-field detections used a HOG detector on RGB data since depth data is unreliable beyond approximately 5 m [80, 97, 176, 154]. Komatsubara et al. [87] use a custom head-and-shoulders shape detection from RGB-D data; likewise, the JAMES bartending robot [47, 44, 120] uses 2D RGB silhouettes and 3D depth information to estimate 3D shoulder locations.

Nakamura et al. [114] localize humans using thermal and depth data in a Thermal-Distance integrated localization model which uses separate modalities to reduce false positives. First, a binary mask is applied to the thermal points to separate hot (above temperature threshold and presumably human) from cold pixels. Hot pixel regions are then mapped onto the depth image, and the background depth is subtracted. The resultant clusters of 3D points are considered human.

Once detected, body centroid locations can be used to localize and track humans or estimate the person’s focus of attention. For example, Linder et al. [97] and Yan et al. [176] project 3D torso detections onto a 2D plane surrounding a mobile robot, then fuse it with a LiDAR leg detections in a multihypothesis tracker to localize and track multiple humans around the robot. The JAMES bartending robot [44, 47, 120] computes the angle between a human’s torso and the robot to infer if the human is facing the robot.

Facial detection on RGB images is another common function in surveyed literature [54, 122]. Face detection provides a 2D estimate of face locations in the RGB frame and is supported by open-source packages such as OpenCV as in [18, 149, 164], OpenFace as in [42, 76, 102, 110], face\(\_\)recognition⁴ as in [127], or the Pittsburgh Pattern Recognition (PittPatt) SDK as in [108]. The most common facial detection algorithm was the Haar feature-based cascade classifier, also called the Viola-Jones algorithm [12, 35, 96, 116, 123, 165]. Haar-like features are simple geometric classifier patterns which, when cascaded into stages of increasing complexity, can rapidly search an image and detect complex visual patterns like faces. The algorithm is popular for real-time usage on robots and embedded systems because of its speed and simplicity. Other facial detection techniques include deep networks [14, 59] (which can also infer gaze direction as in [11]) or custom blob tracking of facial regions [44, 47, 120]. Both Martinson et al. [108] and Portugal et al. [123] project the 2D facial detection onto a 3D depth image to estimate the 3D location of the detected face. The Pepper robot’s SDK supports facial detection natively [106, 105].

Detected faces can be further processed to infer relevant HRI information such as identity, emotion, gaze, or soft biometrics. Commercial options for identification via face recognition include SoftBank Robotics’ NAOqi recognition module in [71, 72], and OKAO Vision, which computes Gabor wavelet transform features of each face and classifies them using an SVM [52, 87]. Dlib is an offline, open-source facial recognition option [110]. Other surveyed works implemented facial recognition by classification of local phase quantization [149] or local binary pattern (LBP) features [35, 118]. Commercial options for facial emotion recognition include FaceReader from Vicar Vision [99, 152], Microsoft Cognitive Services’ Emotion Recognition API [102], the Affdex SDK [126], and the eMax face analysis toolbox [95], which require a cloud connection. Supervised classifiers can also estimate emotion; Prado et al. [125] use principal component analysis (PCA) to extract features from facial keypoints (eyebrow, cheek, chin, mouth, etc.) and classify emotion via a dynamic Bayesian network (DBN), and Paez et al. use facial gestures (turn head, flexion, close left/right/both eyes) as features to indicate emotional state. Cloud-based age and gender demographic estimation are available through FaceReader, Microsoft Cognitive Services’ Face API [127], Pepper’s SDK [105, 106], and SoftBank Robotics’ NAOqi framework recognition module. OpenVINO has pre-trained models to estimate emotion and age/gender demographics without a cloud connection, as described in [110]. Gaze estimation is available offline via OpenFace [17, 76, 110, 168] and GazeSense [121]. Banerjee et al. use a cascaded deep network to estimate gaze and face location [11]; the SPENCER robot [80, 154] also uses a supervised classifier—a SVM that classifies difference of oriented Gaussians features in an RGB image—to estimate approximate gaze direction (left, right, back, front). The virtual assistant of Bohus et al. [20, 21] uses a gaze model trained on labeled images to infer if nearby humans are looking at or away from the system. Face detections can also be used to infer soft biometrics such as a human’s approximate height as in [71, 72, 108], although the location and orientation of the camera relative to ground plane must be known. Martinson et al. also use color histograms of the face as soft biometrics [108].

Miscellaneous facial image processing in the surveyed literature include facial action unit computation, which can be done by OpenFace [17, 45, 46, 76] and used to classify emotion. Filntisis et al. use ResNet 50 to extract facial emotion/affective features, which they fuse with DNN-extracted features from skeletal pose data and classify via a fully connected neural network (FCNN) layer to estimate child emotional state [42]. Gomez et al. compute mouth movement activity to estimate if a human is speaking [54]. In one of the more unique applications of early fusion, Filippini et al. [41] used a fused RGB-Thermal image to locate a face using an RGB HOG detector and a regression tree ensemble to locate RGB facial keypoints; a FIR filter extracted the associated thermal features of the keypoints, which were then classified using a multi-layer perceptron (MLP) to estimate children’s emotional states.

A less common visual processing technique was hand detection and hand gesture recognition. Abioye et al. use an OpenCV Haar cascade to locate hands visually, then a convex hull to count fingers [1, 2], where each gesture indicates a specific robot command. Jacob et al. use a custom fingertip detector; the trajectory of each fingertip is smoothed using a Kalman Filter, and the trajectory’s speed and curvature are classified using a hidden Markov model (HMM) classifier [75], where each gesture corresponds to a specific command to a robotic assistant. Foster et al. use a custom blob tracker on the JAMES bartending robot to locate nearby humans’ hands, which are used as features to estimate the human’s overall status. Chao et al. compute relative motion of tracked hands and use this to estimate gesture activity (but not specific gestures). Gesture activity is then used to estimate overall human activity and engagement [27].

Aside from the capabilities listed above, visual data features can be classified via supervised learning techniques to recognize gestures and activity or fused with features from other modalities for multimodal classification. Feature extraction techniques in the surveyed works consisted of scene flow [139], space-time interest points [91, 139], or visual embedded vectors [70]. Processed visual data can be classified as a specific gesture using HMM and SVMs [139], LSTM networks, or transfer-learning trained networks [101]. Similarly, multiple surveyed works compute dense trajectory features, using histogram of optical flow and motion boundary histogram descriptors, which are then encoded via a bag-of-visual words or vector of locally aggregated descriptors framework and classified via an SVM [38, 83, 156, 187, 186] to detect gestures for children and the elderly. Liu et al. use attention layers to extract RGB spatial features and LSTM layers to extract skeletal pose features, which are concatenated and classified with an FCNN to estimate human activity [100]. Chu et al. use motion cues as features; motion cues are a simple geometric computation between successive joint positions and angles and are used to detect activity [34]. Islam et al. use ResNet 50 to extract spatial features from RGB and Depth images and LSTMs to embed the temporal features. The features are fused in a custom multimodal self-attention mechanism which classifies the activity seen in the RGB-D image [74]. Similarly, Robinson et al. [134] use a transformer network to extract, fuse, and classify activities of daily living (ADL) of elderly persons in RGB-D video. Reily and Lu compute HoG features for RGB, depth, and thermal images, then fuse them in a bag-of-words representation for each human. The individual and group activities are then inferred based on a weighted linear combination of these features [103, 131]. The mobility robot (MOBOT) uses OpenCV to compute optical flow variables in an RGB image, which is used to detect regions that contain activity [83, 187, 186]. Chen et al. [29] extract local RGB features using an HRNet backbone and global features using a CNN; these features are concatenated with a LiDAR feature vector to robustly estimate human pose in visually obscured conditions.

Supervised classification of visual features can also infer engagement or group affiliation. Vaufreydaz [164] and Benkaouar [18] use skeletal pose data to compute the distance from human to robot and Schegloff metrics—pose features based on computing the relative positions between hips, torso, and shoulders keypoints—then fuse these features with other modalities to classify if the human intends to engage with the robot. Banerjee et al. compute body position and orientation from tracked pose keypoints, then fuse these with head orientation, gaze direction, facial bounding box features, and audio cues. These fused features are input to multiple supervised classifiers that estimate a human’s interruptability, or if they are willing to engage with the robot [11]. Nigam et al. simply use grayscale values from RGB pixels and fuse them with audio features to estimate the overall social context and if humans are willing to engage with the robot [117]. Bohus et al. analyze the variance in the RGB pixel values of each human’s clothing to infer group affiliation; business attire (e.g., suits) resulted in higher RGB variance, and the person was assumed to be a visitor since members of their organization dressed casually [20, 21]. Likewise, RGB color histograms of clothing can be used as soft biometrics to help discern identity as in [108].

Most of the surveyed systems employed multiple visual processing techniques simultaneously. For example, the virtual agent/humanoid robot system of Yumak et al. [180, 181, 182] uses several visual processing techniques to localize humans, identify them, and estimate their gaze direction. For localization, they use Kinect’s skeletal tracking which locates joint keypoints. Similar to Bohus et al. [20, 21], Yumak’s system uses feature extraction and classification of a set of visual pixels to identify humans. LBP and hue saturation value (HSV) color features [166] are extracted for a patch of pixels around each skeletal joint. The LBP and HSV features are concatenated into a single vector (feature-level or intermediate fusion) for each person; a k-nearest neighbor (k-NN) classifier matches the visual features to an identity. To estimate gaze direction, they use a regularized maximum likelihood deformable head model fitting algorithm [23]. The rotation matrix and the translation vector of the head model are iteratively optimized using a method similar to iterative closest point, and then used to estimate the pose of the head. Head pose is then used to infer focus of attention as described in Section 3.9.

3.6 Audio Data Acquisition and Processing

Audio perception (or audition) is another common modality in MMP HRI applications and includes functions such as SSL, sound source separation (SSS), ASR, and voice activity detection (VAD). Despite being a relatively mature field, designing computer audition systems that can handle noise, reverberation, and the presence of multiple sound sources in noisy, reverberant environments remains a challenge in applied robotics research [61, 63, 64, 112]. This section discusses the audition systems present in fused multimodal systems; for more information on unimodal audition systems, refer to Rascon and Meza’s SSL survey [130] or to the review of Zhang et al. of nonverbal sound in HRI [183].

3.6.1 Audio Sensors.

In surveyed MMP HRI applications, a computer audition hardware setup is typically comprised of an array of multiple microphones when SSL is performed. This includes commercial products like the Microsoft Kinect’s 4-channel microphone array, used in [12, 18, 38, 44, 47, 124, 156, 164], the 2-channel array in the Asus Xtion Pro Live [123], the 7-channel Microcone [12], or the 8-channel TAMAGO-03 [28]. Other surveyed works used custom arrays of 4 linear microphones [20, 21], custom 6-channel arrays [127], or 6–8 microphones organized into pairs [10, 36, 49], circular arrays [180, 181, 182], or 8-channel linear arrays [83, 187].

Some surveyed social robot platforms contain embedded arrays in support of audition functions. Pepper has a 4-channel array embedded in the head [17, 45, 46, 143], and Curi has two microphones embedded in the body [34], while Honda Research Institute’s Hearbo has a 16-channel array [54, 114]. Binaural microphone arrays are another common configuration and are usually found on humanoid or anthropomorphic robots such as the SIG2, NAO, or iCub systems [14, 57, 148, 179]. Eight to sixteen total audio channels may be desirable if a high degree of noise cancellation or SSL accuracy is required [67, 69, 140, 159]. The largest array in surveyed works was comprised of two 8-channel arrays and two 16-channel arrays, used for precise SSL in an indoor setting [73].

Mono-channel microphone systems were used in surveyed works where SSL was not required, since sound sources cannot be localized using a single audio channel. This configuration was used in applications when the robot is interacting with a single person, such as: the UAV control interface of Abioye et al. [1, 2]; the robot of Churamani et al., which uses a single audio for voice identification and speech recognition; the Curi robot, which uses audio cues from a single human to determine engagement [27]; robots which play joint games with a single human, as in [76, 121, 168]; or Nigam et al., who analyze monochannel features to classify the audio scene [117]. The virtual agent/humanoid robot system of Yumak et al. [180, 181, 182] also used a single standalone microphone. A unique monochannel audio implementation is the Roboceptionist robot [86], which used an Android tablet’s microphone for speech recognition.

3.6.2 Audio Data Processing.

SSL was a common computer audition function in surveyed works, and various SSL algorithms exist for practical use on robots and intelligent systems. Several surveyed works [12, 28, 54, 73, 114, 124, 127] used variants of multiple signal classification (MUSIC) techniques (such as generalized eigenvalue decomposition-MUSIC or generalized singular value decomposition-MUSIC) for SSL in their applications. All were implemented using HARK⁵ middleware (Honda Robot Institute—Japan Audition for Robots with Kyoto University) [113]. HARK includes ambient noise filtering via extended minimum variance distortionless response, as in [127]. Note that MUSIC requires an estimate of the transfer function between the audio source and microphones, which becomes increasingly complicated for multidimensional SSL. Hence, several works [20, 28, 114, 124] computed the sound source direction of arrival (i.e., 1-dimensional (1D) SSL) in 5–10° increments. Notable modifications include Nakamura et al. [114], who include sound source identification via a hierarchal Gaussian mixture model to ignore non-speech sources, and Chau et al. [28] who extract features via DNN-based spectral mask estimation to increase noise robustness.

Other SSL techniques include the one used by D’Arca et al. [36], who compute the intra-channel time differences of arrival using generalized cross-correlation with phase transform (GCC-PHAT) and then use beamforming techniques to localize speakers indoors. Beamforming techniques were used for other indoor speech localization applications, such as the child–robot interaction systems in Tsiami et al. [156] and Efthymiou et al. [38] that used a steered response power with phase transform (SRP-PHAT) beamformer, and the MOBOT’s perception platform [83, 187]. Likewise, Yumak et al. [180, 181, 182] use a speech harmonicity beamforming algorithm [170] to compute speech direction-of-arrival from an 8-channel circular microphone array. Beamforming SSL and SSS can be implemented using open embedded audition system, another free, open-source computer audition package [62]. Gebru et al. [49] use a supervised learning approach; they extract the auto- and cross-spectral density features between two binaural channels, then train a localization model using a labeled multimodal dataset. Likewise, Foster et al. incorporate SSL on a Pepper robot by using a supervised neural network [45, 46]. Belgiovine and Gonzalez-Billandon et al. also use a DNN for coarse 1D SSL, classifying the direction-of-arrival (DoA) (or azimuth) into one of three zones in front of the robot [14, 58]. Lastly, the Microsoft Kinect can compute the sound source DoA through its API, as done in [18, 44, 47, 164], or used within a networked array of multiple sensors as in [186]. For additional details on the categories, algorithms, and implementation of SSL, we recommend Rascon and Meza’s survey [130].

ASR is the second most common audio processing function in the surveyed literature, with a range of techniques surveyed. For dialog-heavy, cloud-connected applications, commercial options include Microsoft Cloud speech recognition [20, 21, 121], Loquendo (now Dragon from Nuance) [122, 143, 116], iFlyTek Speech API [102], and Google Cloud’s speech-to-text API [35, 45, 46, 95, 105, 106, 127] (which has a Python software wrapper [149]). Offline, open-source ASR options are available in the Kaldi deep learning model⁶ [99, 152], DeepSpeech⁷ [168], or in the Carnegie Mellon University (CMU)Sphinx framework⁸ [1, 2, 75, 123]. Julius⁹ is another open-source ASR framework which can be used offline and, along with Kaldi, features HARK integration. Audio feature extraction and supervised classification can also be used to generate custom acoustic and language models for ASR. These types of ASR models are typically used on embedded systems where large vocabulary recognition is not required and a smaller set of words or commands is sufficient. The most commonly used audio features are mel-frequency cepstral coefficients (MFCCs), which approximate the nonlinearities of human hearing in the frequency domain. MFCC computation is a well-documented process, included natively by HARK and other open-source audio processing software like Torchaudio. A 2D data tensor with MFCCs in one axis and time in the other can be visualized as a spectrogram and classified using supervised learning methods to infer the phonemes spoken. Sanchez-Riera et al. [139] extract MFCCs from speech, then classify the MFCC vector as a discrete robot command using HMMs and SVMs. Likewise, Liu et al. [101] classify MFCCs as a discrete set of spoken commands using a CNN. Similarly, Trick et al. compute the MFCCs of speech audio, then use the PyTorch-based Honk framework—a supervised CNN classifier—for small vocabulary speech recognition [153]. Tsiami et al. [156] use extracted MFCCs and their first temporal derivative (MFCC+\(\Delta\)) features as inputs to a Gaussian Mixture Model-Hidden Markov Model (GMM-HMM) to phonemes in a speech recognition system. Kardaris and Zlatintsi [83, 186, 187] use a similar approach, extracting MFCC+\(\Delta\) features and classifying them using an N-best grammar-based language model using the HTK toolkit.¹⁰

Audio feature extraction and analysis can also be used to infer emotion, speech activity, personality traits, or identity. Open Emotion and Affect Recognition toolkit provides an online means of classifying emotional content in audio [123]. Offline options include Ma et al. [104], who extract and classify MFCCs to detect emotional content of speech using a trained supervised classifier model. The Praat toolbox can extract signal information like pitch, frequency, intensity/volume, harmonicity, and duration; Prado et al. use Praat-extracted features as input to a DBN which classifies the emotional content of the speech. The openSMILE toolbox can also extract audio features for emotional classification as in [95, 110], or for engagement detection as in [17]. Jain et al. use Praat-extracted audio features to classify engagement level. Sound cues—the difference in amplitude between successive samples—are a relatively simple audio feature which can also be used to infer human engagement as in [34]. Chao et al. use a Pure Data module to compute pitch [27]; Maniscalco et al. perform root mean square analysis of the audio signal [106]; both then use these features to detect speech activity. The Pepper robot used in the MuMMER project uses a supervised neural network to detect speech activity [45, 46]. Shen et al.’s Pepper implementation extracts MFCCs, voice energy and voice pitch, which are fused with visual features to estimate the human’s “Big Five” personality traits. Speaker identification (also called voice identification) through audio data can be accomplished via use of a CNN classifier as in [35], an implementation of i-vector feature classification in the ALIZE 3.0 framework [149], or by MFCC computation and GMM classification, trained on a 60-second speech sample for each speaker [36]. In [35], the authors note that MFCCs may be unreliable in noisy HSEs, so learned CNN features may be more suitable and also avoid the additional preprocessing step of computing MFCCs.

Lesser-used audio processing techniques include social context classification and lexical feature extraction. Nigam et al. use the signal amplitude in the time domain as a feature to estimate social context in HSEs; i.e., if the HSE is used to eat, study, or socialize [117]. Lexical feature extraction can be accomplished with OpenCCG, which provides embeddings of recognized speech as in [44, 47]; embeddings can reduce a large vocabulary of spoken words into the most likely command or words relevant to a service robotics domain.

3.7 Other Data Modality Acquisition and Processing Techniques

Beyond audio and vision, other robot sensor modalities can be used to infer human attributes, states, and communication. Leg detection and tracking through LiDAR scanning rangefinder data was one common perception technique. 2D LiDAR scanning rangefinders were common inclusions on mobile robots or in indoor sensor arrays [18, 73, 52, 164], while specific models used include the SICK LMS 500 [80, 97, 154], the Hokuyo URG-04LX [116], and the Hokuyo UTM-30LX [176]. Raw range data can be clustered, and the clusters classified as legs. Linder et al. use a random forest classifier implemented in OpenCV [97] to detect legs; the SPENCER robot segments LiDAR range data, clusters, extracts cluster features, and classifies legs using a boosted classifier [80, 154]. LiDAR leg detection was used in several other mobile robot and indoor sensor arrays surveyed [18, 52, 124, 164]. Trunk detection can also be accomplished via laser range cluster classification as in [116]. Individual LiDAR detections can be tracked using commercial software like ATRacker [52], by Particle Filtering [73], or by Kalman Filtering [18, 164] which provides the smoothed position and velocity of the detected person. However, a drawback of LiDAR leg detection is that humans must be standing up to be detected, and furniture or other obstacles can be misclassified as humans; adaptive background subtraction [18, 164] can remove non-human scan data which could otherwise be misclassified. For multiple human tracking, LiDAR scan data is typically fused with other modalities in specialized trackers as discussed in Section 3.9.

Sonar, or sonic rangefinders, are another sensory modality implemented on the Pepper platform [17, 105, 106]. While sonic rangefinders do not offer the same precision as laser rangefinders, they can still compute distance to humans and infer engagement zone information (e.g., public zone, social distance) through Pepper’s ALEngagementZones toolkit as in [17].

Touch screens are another option that can enable intuitive information transfer in HSEs. Touch screens can work well for applications where a robot has a small set of actions or functions and humans do not carry specialized interface hardware, like the Sacarino concierge robot [122], or elderly care bots [43, 123]. Additionally, the Pepper platform and Bohus et al.’s virtual assistant [20, 21] both have touch screens to receive tactile inputs from nearby humans. In Fischinger’s et al.’s Hobbit robot user study, they found that humans preferred the touch screens over gestures to send discrete commands [43]. Touch screens can also offer privacy for confidential information like airport reservations or hotel bookings and avoid inaccuracies that are inherent to speech or gesture commands [122]. However, touch screens are not ideal in scenarios when the robot is moving excessively [32].

Less commonly used sensors in the surveyed works include thermal cameras, where features were extracted and fused with other visual features to recognize activity in Lu and Reily’s MMP system [102, 131], and capacitive touch sensors, which were embedded in the Mini tabletop robot to recognize tactile interaction [138]. Chen et al. were the only surveyed project which used millimeter (mm)-Wave Radar pointcloud data (from the mmBody dataset); PointNet++ extracts local pointcloud features, and a multilayer perceptron extracts global pointcloud features, which are fused with visual features to robustly estimate human pose in visually obscured positions [29]. Both thermal and radar were primarily motivated by use-cases in domains where vision is expected to be degraded, such as search-and-rescue or subterranean operations.

3.8 Multimodal Datasets

Several of the surveyed works [10, 17, 42, 49, 70, 74, 83, 100, 103, 104, 131, 139] used labeled datasets instead of live sensor data. Labeled datasets enable the benchmarking and comparison of different perception algorithms against a common standard and may be especially useful during development of MMP HRI systems and algorithms. Multimodal HRI datasets used by surveyed systems include:

—

The Audio-Visual Diarization dataset: 6-channel audio and stereo camera recordings of various groups of humans conversing in an indoor setting, along with annotations of their conversations. Used in [10, 49].

—

BabyRobot Emotion Database: Labeled audiovisual dataset in which children express one of six emotions. Developed by and used in [42].

—

The Collective Activity Dataset [33]: Labeled RGB video of group activities, such as crossing a street, talking, jogging, and dancing. Used in [131, 103].

—

CMU Panoptic¹¹: 480X RGB cameras, 10X RGB-D Kinect sensors with skeletal pose annotations and speaker diarizations. Sixty-five sequences (5.5 hours) of various social interactions recorded from various angles in a panospheric dome. Used in [178].

—

Electronics and Telecommunications Research Institute (ETRI)-Activity-3D [77]: RGB-D and a skeletal pose data of 55 activities performed by 50 younger and 50 older adult subjects (112,620 total samples). Used in [134].

—

mmBody [30]: Millimeter-wave annotated 3D pointclouds, RGB-D data, and ground-truth motion capture of various human poses in obscured visual conditions (rain, smoke, poor lighting, occluded). Used in [29].

—

MOBOT-6a corpus [136]: Gestural and verbal commands (in German) performed by eight elderly subjects, used in [83].

—

Multi-modal Long-term User Recognition Dataset: Images of 200 users, with name, age, gender, height labels, and interaction times. Developed by and used in [72].

—

Multisensory Underground Search and Rescue Teamwork Dataset: RGB-D, thermal, and 3D LiDAR data recorded from a Clearpath Husky Unmanned Ground Vehicle (UGV) of a rescue team in an underground mine setting. Group actions are annotated as donning protective equipment, team movement, team stop, patient care, and installing mine shaft supports. Used in [103, 131].

—

Nanyang Technological University RGB+D [142]: Skeletal pose, depth, infrared, and RGB videos of annotated individual actions. Used in [100] and [178].

—

Robots with Auditory and Visual AbiLities Corpora [5]: Stereo vision synchronized with 4-channel audio and activity annotations. Used in [139].

—

Sun Yat-sen University (SYSU) Dataset [68]: RGB-D data of subjects performing actions with objects. Activities and object labels are annotated. Used in [100]

—

User Engagement in Spontaneous HRIs [16]: used in [17]

—

University of Texas Kinect [172]: RGB-D and skeletal pose data of ten indoor activities of daily life (e.g., walking, standing up). Used in [74].

—

University of Texas at Dallas–Multimodal Human Activities Dataset [31]: RGB-D, skeletal pose, and inertial sensor data with labeled human activities. Used in [74]

Although not used in any of the surveyed works, the following datasets may also be useful resources to develop MMP systems for HRI in HSEs:

—

Artificial Intelligence for Robots [85]: 3X RGB-D Kinect sensors, depth maps, body indexes/labels, and skeletal pose data. Annotated human–human interaction activities (e.g., handshakes, hugs, entering/leaving area) with elderly subjects.

—

AudioVisual Emotional Challenge (2014) [160]: RGB facial images and audio data. Annotated with emotional state.

—

AveroBot [107]: 8X different camera + monochannel microphone combinations spread across three floors inside a building. Human identities are labeled for long-term re-identification using different hardware setups.

—

Learning to Imitate Social Human–Human Interaction [158]: 3X RGB-D cameras and 1X monochannel microphone to capture scene data, plus motion capture of body joints, eye tracking, and egocentric RGB data for human participants in the scene. Comprised of 32 dyadic interactions, with 5 sessions per dyad.

—

Multimodal Human–Human–Robot Interaction [26]: 2X static RGB cameras, 2X dynamic RGB cameras, and two biosensors recording information during dyadic human–human interaction and triadic human–human–robot interaction. Engagement and personality are annotated.

—

Dataset of Children Storytelling and Listening in Peer-to-Peer Interactions (P2PSTORY) [145]: 3X RGB cameras and monochannel audio of children reading and listening to one another. Interactive behaviors (e.g., nods, smiles, frowns) are annotated, and socio-demographic information is recorded for each subject.

—

Play Interaction for Social Robotics (PInSoRo) [92]: 2X RGB-D and monochannel audio recordings (Intel RealSense SR300) of two children playing, 1X RGB-D recording (Microsoft Kinect One) of the environment. Annotated with task engagement, social engagement, and social attitude for each child.

—

Penn Subterranean Thermal 900 (PST900) [144]: 894 fused RGB-thermal images; synchronized and calibrated with per-pixel annotations of four classes from DARPA Subterranean Challenge.

—

Universidad de Málaga Search and Rescue [111]: Overlapping RGB and thermal infrared monocular data, 3D LiDAR, GPS, and IMU recorded from a UGV during an underground SAR exercise. Includes persons, vehicles, debris, and SAR activity in unstructured terrain.

While multimodal datasets are highly practical for perception model development, models trained on datasets should not be used if the dataset differs significantly from the application data. For example, vision models trained on an indoor camera may not be applicable for use on a mobile robot, since differences in perspectives or hardware may result in warping or image distortion, rendering the model unusable [8]. Likewise, language models and speech recognition systems designed for adults do not transfer well to children [15, 152].

3.9 Data Fusion Techniques

Fusing information from multiple unimodal streams can be accomplished using a variety of techniques, depending on the availability and modalities of data for the specific HRI application. In this section, we discuss surveyed fusion techniques. Recall from Section 3.2 that we use the multimodal fusion taxonomy of Baltrusaitis et al., who categorize fusion techniques as model-based or model-agnostic [9]. Model-based approaches are more recent and less commonly used in robotics and use kernels, graphical models, or neural networks (including Transformers) to combine different data modalities and seek underlying patterns in the data. Model-agnostic methods fuse the raw data, extracted features, or the processed outputs of different modalities without the use of trained models. Model-agnostic methods are more generic and more commonly used in robotics [13].

3.9.1 Model-Based Fusion Techniques.

Model-based fusion techniques are categorized by the type of model used to fuse data streams [9]. Recall from Section 3.2 that this includes Multiple Kernel Fusion, Graphical Model Fusion, and Neural Network Fusion. Despite being derived from well-established ML algorithms,¹² model-based fusion techniques are less common in robotics; model-based fusion techniques often require large amounts of relevant data and onboard compute to be used in real time on embedded systems. However, advances in onboard computing and ML research have resulted in model-based fusion techniques being employed on robots in recent years.

Graphical Model Fusion can combine asynchronous data from different streams, and both training and inference are faster than neural network model fusion [72]. Additionally, graphical models can continue to provide a state estimate based on prior information, even if updated information is sparse. Graphical models can also model states that change temporally. These are all desirable qualities for onboard robotic sensor fusion. Examples of Graphical Model Fusion include Irfan et al., who use a multimodal incremental Bayesian network to fuse facial identity, height, age, and gender information to estimate a human’s identity incrementally over time [71, 72]. Banerjee et al. compare the performance of graphical models such as HMMs, CRFs, hidden CRFs, and latent-dynamic CRFs to categorize interruptability based on pose, gaze, and audio signals [11]. Lastly, Markov decision process (MDP) graph models can be used as dialog management or task planning engines. Lu et al. use a Partially Observable Markov Decision Process (POMDP)-based state estimator and event reasoner to fuse multimodal inputs [102], while the SPENCER robot uses a Mixed Observability Markov Decision Process (MOMDP)-based collaboration planner to estimate human intentions from asynchronous multimodal inputs [80, 154]. Notably, the MOMDP collaborative planner is just one component of SPENCER’s supervision system, an integrated perception and planning system that affects specific actions based on user inputs and perceptual data.

Neural Network Fusion was also used in surveyed works to align and fuse multimodal features. Neural Networks are often used within an end-to-end fusion and classification/prediction framework, where one layer is used to fuse data or features, and another is used to classify or decode the feature representation. Linder et al. use the DarkNet-53 network to fuse RGB and Depth features; the fused feature vector is then classified with a modified YOLOv3 network to compute human centroid location in the Depth image and 2D bounding box for the RGB image [98]. Fung et al. [48] use a custom YOLOv4 variant—MYOLOv4—to individually extract RGB and Depth features and fuse them to robustly detect people and body parts. Robinson et al. [134] use ResNet to extract RGB-D video features, a graph convolutional network (GCN) + Self-Attention network to extract pose features, and a CSPDarknet + PANet + YOLO network to extract object location features. A Spatial Mid-Fusion Module then embeds each modality’s features into a feature vector, which is classified by a Dense Neural Layer to estimate which ADL is occurring in the RGB-D video. Li et al. use a bidirectional gated recurrent unit (BGRU) to fuse audio, lexical, and facial features before classifying the fused feature vector with a FCNN to infer the human’s emotion [95]. Chen et al. [29] use a Transformer integrated module to fuse global RGB image and mm-Wave Radar pointcloud features into a multimodal feature vector. The feature vector is then classified by a fusion Transformer module, which can selectively weight the most salient features for more robust human detection when vision is occluded. G. Liu et al. [100] compute spatial RGB features using cross-modal and self-attention mechanisms and temporal features from skeletal pose data via bidirectional LSTM networks. The spatial RGB features and temporal pose features are fused via a concatenation layer and classified via a FCNN to estimate the human’s activity. Yasar et al.’s IMPRINT network [178] uses Neural Networks and Transformer models within an end-to-end feature extraction, fusion, and decoding network on RGB-D data. Unimodal feature extractors compute spatio-temporal features for skeletal poses in the scene, and a GRU encoder extracts contextual features from the scene. The pose and contextual features are fused with an Interaction Module that also estimates group affiliations. A multimodal context module fuses the processed group, individual, and context features. Lastly, the fused and processed feature embedding is decoded by a Motion Decoder to estimate where each person in the scene will move. In each case, Neural Network Fusion can synchronize asynchronous data streams and leverages mutual information between individual modalities.

Our survey found no instances of Multiple Kernel Learning models used in multimodal fusion.

3.9.2 Model-Agnostic Fusion Techniques.

As discussed in Section 3.2, model-agnostic fusion techniques can be categorized as Early (Data) Fusion, Intermediate (Feature) Fusion, or Late (Decision) Fusion depending on when the data are fused [25, 128, 180].

Early fusion involves the synchronization and alignment of raw, independent data streams into a single structure. Many robots or embedded systems feature asynchronous, heterogeneous data streams, making early fusion challenging; it is unclear how a 2D image can be merged with a 1D time-indexed dataset. As noted in [13], data synchronization and alignment is relatively expensive computationally. As such, early fusion is generally not feasible for real-time applied robotics or embedded systems applications, and we only found one project that employed early fusion: Filippini et al. [41] performed extrinsic calibration to find the correspondence between an RGB and a thermal camera. They generate a fused RGB-thermal image by converting thermal pixels into corresponding RGB pixels and linearly interpolating the sampled value if thermal sample time falls between RGB sample times. Similarly, the PST900 dataset [144] uses an extrinsic calibration technique to align RGB and thermal images.

Intermediate or feature-level fusion is generally used when mutual information is conveyed across multiple heterogeneous modalities simultaneously. Specifically, separate audio and vision sensors may detect mutual information about the same object or event. Hence, a fused audiovisual feature may be more informative than separate unimodal audio and visual features.

Intermediate fusion is typically accomplished through alignment, summation, and/or concatenation of unimodal feature vectors. H. Liu et al. [101] use intermediate fusion to perform multimodal command classification for a collaborative human–robot manufacturing task. They use artificial neural networks (ANNs) to extract features from the human’s speech, gesture, and body movements. The features are fused into a single vector, then classified using a MLP as one of six potential commands. Ma’s emotion recognition network [104] functions similarly; CNNs are used to extract features from audio and visual data, which are then fused into a single tensor via an MLP which is classified using an SVM. Ben-Youssef et al. record asynchronous proxemic, gaze, head, and speech measurements, extract features, and fuse the features via temporal pooling of sliding temporal measurement windows [17]. Benkaouar and Vaufreydaz’ domestic robot [18, 164] fuses various spatial, speech, and skeletal pose features from asynchronous sensors; they use neutral values when one of the features is unavailable, and they use the last known feature value for features that are not computed at the current timestep. The fused feature vector is classified via an ANN and SVM to estimate engagement. Lu and Reily [102, 131] compute HoG features for multiple sensor modalities, then represent the fused features in a bag-of-words for each person in a group. Each human’s fused feature vector is multiplied with a weighted matrix and summed to estimate the activity of the group. Vector concatenation is another intermediate fusion technique, used by Filntisis et al. [42]. They extract facial and pose features using Resnet 50 and DNN, respectively, then concatenate both into a single feature vector that is classified by a FCNN to estimate human emotion. Similarly, Islam et al. [74] extract features from RGB and skeletal pose modalities using a self-attention mechanism; they evaluate intermediate fusion using both concatenation and summation of the feature vectors. The fused vector is then classified by a FCNN to determine the human’s activity. Using PCA and data concatenation, Nigam et al. fuse grayscale values from RGB data and signal intensity from the audio channel; they then use SVMs, decision trees, and naïve Bayes classifiers to infer social context [117]. Shen et al. [143] use linear interpolation and truncation to align variable-sized visual and audio feature vectors temporally, which are then concatenated into a single vector and classified. Wilson et al. [168] employ intermediate fusion to combine gaze and speech features into a single feature vector. Since speech commands are sparse and asynchronous, they use a linear degradation function to progressively reduce speech feature weight over time if speech features are not readily available. The fused feature vector is then classified with a random forest to estimate if the user needs assistance.

Model-agnostic late fusion techniques were the most common category of fusion in the surveyed works. We found examples of late fusion implemented using simple arithmetic or logical comparison, probabilistic and Bayesian methods, action engines and state machines, data association, and supervised classifiers as described below.

Simple late fusion can be accomplished through mathematical operations on unimodal data streams. Weighted summation was one common late fusion technique, which was used for human identification and command/intent recognition. Tan et al. and Churamani et al. compute human identity through a weighted sum of voice and facial identification confidence values [35, 149]. Similarly, Martinson et al. compute how closely a human’s face, clothing, and height match a database of known values; the weighted sum of these similarity metrics is used to select the most likely human identities [108]. Pourmehr et al. [124] independently track humans using Kalman Filters for visual, audio, and LiDAR input data, generating three separate occupancy grids each representing the likelihood of humans present. The fused output is simply a weighted average of the three independent modalities’ occupancy grids. Sanchez-Riera et al. [139] also used weighted averages of individual modes to classify human commands. A visual gesture classifier and a verbal speech SVM classifier each estimate which command was issued (“Hello,” “Bye,” “Yes,” “No,” Stop,” “Turn”); the two modes are weighted and summed to generate a posterior estimate of the command. The MuMMER robot computes an “interest likelihood” from head pose, gaze, and location data, then computes a weighted sum to infer who is likely to interact with the robot [45, 46].

Logical operations on unimodal data streams are another way of affecting simple late data fusion. For example, the speech and gestural interface in [1, 2] will fuse the two command inputs only if they arrive within 0.5 seconds of one another and contain complementary information; redundant commands are discarded. Ishi et al.’s LiDAR and microphone system will only fuse human location measurements if they are within a distance threshold [73]. The audiovisual speaker localization system in [54] takes sound source location, face location, and mouth movement as inputs; the speaker location is only returned if mouth activity is detected near the sound source. Lastly, the speech and gestural interface in [83, 187] computes confidence values for recognized speech and gesture commands, then ranks all received commands and selects the command with the highest overall confidence.

Projection and/or lifting is another means of late fusion and can be used to combine processed data from one modality with raw data from another. Terreran et al. detect 2D pose keypoints of an RGB-D image using OpenPose, then use 2D-to-3D lifting and 2D-to-3D projection to fuse the 2D pose keypoints to the depth channel of the RGB-D image. The result is a 3D pose estimate. Due to the inherent processing time in the OpenPose process, this is best suited for offline operations, and the authors of [150] use this method on datasets.

Action engines and managers are another means of handling processed outputs of multiple modalities and work well for HRI applications when actions, states, and state transitions can be explicitly defined. The simplest case is to execute actions based on any commanded input as they are received—such as processed touch screen, gesture, or verbal commands—as done by the Hobbit service robot in [43]. More involved dialog, action, and service managers implement execution rules and reason about potential courses of action before execution. Roboceptionist’s dialog manager uses rules to determine when to process verbal and gestural commands and when to respond. It also parses the semantic content of human speech to select the most appropriate response given the current dialog state [86]. Based on visual and speech inputs, R3D3 also uses a dialog manager to manage initiative and conversation flow and determine which museum exhibit a human wants information about [99, 152]. The SocialRobot platform allows developers flexibility by allowing them to implement custom action descriptions and behaviors in an XML file that is managed by the robot’s action engine [123]. Efthymiou et al. use a dialog flow manager, which monitors for specific speech or gesture inputs to select the next robot action in a series of child–robot interaction games and scenarios [38]. Chao et al. use a timed Petri net architecture to asynchronously handle visual and audio stimuli to determine when to interact with a nearby human [27]. Many surveyed works implemented action engines or dialog managers using state machines, where state transitions are affected by specific perceptual inputs. For example, Maniscalco et al. implement a behavior manager on the Pepper robot using SMACH¹³ state machine software; Pepper will not interact with a human if they are not engaging with the robot and will not respond to a person unless the person has been identified and initiated interaction with the robot [105, 106]. Churamani et al. also implement a dialog manager in SMACH [35], which uses multisensory cues from nearby humans to select its spoken responses. In surveyed works, state machines were used to generate hotel robot service behaviors [122] and robot social gaze [121]; Jacob et al. use a verbal- and gestural-controlled state machine to pick and pass surgical instruments and enable or disable specific sensory modalities [75]. One advantage of state machines is that they can be programmed to better utilize limited onboard compute resources in mobile robots; for example, Chu et al. do not perform some human perception functions if there are no humans in the scene [34].

Ragel et al. [127] use a sequential data fusion pipeline to combine skeletal pose, face, hand, and speech data into a single Person instance. Each mode is processed individually using a variety of techniques, and the Person instance is incrementally updated as new data from each modality becomes available. The system handles asynchronous messages via standardized robot operating system (ROS) data structures and middleware.

Probabilistic and Bayesian techniques comprise another broad category of late data fusion; in surveyed literature, they were used for human localization and tracking, as well as command, intent, and emotion recognition. Ban [10] and Gebru [49] represent visual and audio localization results as probability distributions, then fuse the two modes via Bayesian estimators to compute the posterior location estimate. Likewise, Bayram and Ince [12] and Nakamura [114] use particle filtering to fuse visual and audio location measurements into a single posterior estimate. Tan et al. use probabilistic hypothesis testing to fuse sound source and face locations and distinguish them from background sound sources [149]. Chau’s Audio-Visual Simultaneous Localization and Mapping algorithm [28] is designed specifically for simultaneous localization of the robot platform and multiple nearby humans. They use a Gaussian Mixture Probability Hypothesis Density filter to fuse audio and visual location measurements,¹⁴ a multitarget tracking filter that models intermittent sensor visibility and human entry/exit from the field of view as random finite sets. Linder [97], Yan [176], and Nieuwenhuisen [116] use multi-hypothesis tracking—another common multiobject tracking method—to fuse visual and LiDAR human position measurements for multiple human tracking in dense social environments. Bohus et al. [20, 21] use probabilistic models to represent quantities like human engagement, activity, and initiating or terminating contact. Both Trick et al. and Whitney et al. fuse speech and gesture to recognize the object that a human is referring to (referring gestures). A prior estimate of the referring gesture can be recursively updated with incoming speech and pointing gestures as in [167], or a categorical probability distribution of the most likely gesture can be computed via an independent opinion pool, as in [153]. Prado et al. use DBNs to classify facial and speech emotional content, and then use an additional DBN to fuse the unimodal results [125].

Data association techniques can also be employed for late fusion. NN association can be used as a late fusion technique, suitable for assigning information from an imprecise spatial measurement to a second, more precise, spatial measurement. For example, [49] use NN fusion to attribute coarsely localized speech activity to precise visual human location; the fused result contained precise location and speech information. Similarly, D’Arca et al. [36] used NN fusion to attribute speaker identity from audio data to a more precise, fused audio-visual human location. Yumak et al. [180, 181, 182] use NN fusion to assign speech to a visually localized human. They compute the speech direction-of-arrival vector, then fuse the associated speech content to the person nearest to the vector. Komatsubara et al. [87], Glas et al. [52], and the Robot Operating System for Human-Robot Interaction (ROS4HRI) fusion module [110] use NN association to fuse facial identity with a detected body. NN and extended NN association can also be used to assign position measurements to humans that are currently tracked by Kalman Filtering as in [97]. The Hungarian algorithm (also called the Munkres algorithm) is another data association technique, used in the literature for associating multiple measurements with multiple existing detections in late fusion. Belgiovine and Gonzalez-Billandon et al. use the Hungarian algorithm to assign multiple face detections to existing faces that are tracked via Kalman Filtering [14, 57], while Nieuwenhuisen et al. use the Hungarian algorithm to assign visual and LiDAR position measurements to humans tracked via multi-hypothesis tracking [116].

The last category of late fusion found in our survey is supervised classification, which was used for several HRI functions that infer multilabel quantities given features or labels from multiple unimodal streams. (Note that this is strictly different from model-based fusion techniques discussed in Section 3.9.1, which use trained models to fuse data instead of classify it.) Engagement and status recognition were common use-cases for supervised classification in the literature, and projects typically evaluated several supervised classification techniques; this is recommended because the results of supervised classifiers can vary significantly based on the application and available data. For the JAMES bartending robot, Foster et al. compute head and hand positions, torso and head poses, and speech activity, then classify them to infer the bar patron’s current state. They evaluate binary regression, logistic regression, multinomial logistic regression with ridge estimator, k-NN, decision tree, naïve Bayes, SVM, and propositional rule classifiers [44, 47]. Ben-Youssef et al. compute gaze, head motion, facial expression, gestures, and speech features from separate perception modules then evaluate logistic regression, DNN, LSTM, and GRUs to classify engagement [17]. Vaufreydaz et al. compute position, velocity, speech, sound source, and pose features via separate perception modules, then classify engagement using an SVM and ANN [164]. Jain et al. compute facial features and audio signal features while measuring performance in a joint child–robot game; these are used to classify the child’s engagement using naïve Bayes, KNN, SVM, NN, logistic regression, random forest, and gradient boosted decision trees [76]. Banerjee et al. also compute head orientation, gaze, body position, body orientation, and audio signals, then classify the human’s interruptability (i.e., willingness to engage) using SVMs and random forest classifiers. They also compute engagement using model-based fusion techniques like CRFs, as discussed previously [11]. Paez et al. estimate a human’s emotional state using k-means classification of recognized body pose, facial gestures, and facial emotion [126]. Lastly, Terreran et al. [150] individually classify pose keypoints of the body and hands with the Shift-GCN architecture, then use ensemble averaging to estimate the person’s overall activity.

More complex HRI systems employed combinations of intermediate and late fusion, known as hybrid fusion. Yumak et al.’s system [180, 181, 182] used intermediate fusion to identify humans by combining and classifying LBP and HSV visual features. They also used NN late fusion to assign speech content to localized humans and infer their focus of attention. They also employ a world model and a rule-based dynamic user entrance/leave mechanism in their fusion module to infer when humans enter or exit the field of regard. The final, fused result contains each nearby human’s identity, location, speech content, and when they entered or left the area.

D’Arca et al. [36] used a hybrid of feature-level and decision-level fusion methods along with state estimation. Kalman and particle filters fuse audio and visual localization results to track speakers in the field of regard. Speaker detection is accomplished by canonical correlation analysis, a feature-level fusion technique that combines audio MFCCs and visual optical flow features to detect who is speaking. The final, fused result contains speaker location and identity and is robust to visual occlusions. The resulting system is comparatively complex—it is comprised of an overall fusion network and subnetworks for tracking, speaker recognition, and audiovisual feature extraction—and is used in a static, indoor setting instead of being implemented on a mobile platform.

4 Survey Summary, Analysis, and Future Research

We reviewed a range of MMP applications and their supporting data acquisition, processing, and fusion methods, seeking techniques that can be used by robots and intelligent agents to perceive humans in HSEs. Here, we identify trends and challenges in the surveyed research, provide guidelines and resources for the design and use of MMP systems for HRI in HSEs, and suggest future research areas.

4.1 Trends in Current Multimodal HRI Perception Systems

Surveyed systems spanned a wide range of applications and use-cases which we summarize here. In each case, MMP in HSEs can detect nuanced human communication or help provide more robust perception by leveraging mutual information across multiple data streams. We categorize patterns within surveyed research applications in Table 8, noting characteristics, advantages, disadvantages, and representative examples.

Table 8.

MMP Application	Typical Design Characteristics	Advantages	Disadvantages	Examples
HRI research testbed (mobile)	Mobile base; uses graph or late fusion; operates in dynamic public environment, possibly for long duration	Captures relevant interaction phenomena in representative environment	May not detect subtle interaction cues for all participants	Detecting interruptability, engagement, or social context [11, 16, 18, 117, 164] and long-term identification [72]
HRI research testbed (static)	Fixed or static indoor base; can use cloud resources and networked sensors; Supervised classification of multimodal features	Precise human interaction data and cues; enables controlled development of HRI perception systems	May not capture relevant interaction phenomena in representative environment	Activity, emotion, engagement, identity, initiative, or personality trait estimation via games with tabletop or standing robots [14, 20, 21, 27, 34, 35, 41, 76, 126, 127, 143, 168]
Human-aware navigation capability development	Mobile base; uses graph or late fusion; operates in dynamic public environment	Enables human-aware planning and navigation for robots in HSEs	May not detect subtle interaction cues for all participants	Human localization, tracking, and following [12, 28, 97, 176]
Multimodal human cue recognition	Uses labeled datasets; neural network fusion or supervised classification	Captures subtle multimodal interaction cues and stimuli by finding mutual information	Primarily demonstrated on offline datasets	Activity, emotion, gesture, or intent recognition [74, 95, 134, 150, 178]
Multimodal command interface for robot or intelligent system	Multiple synchronized sensors in controlled workspace; can use cloud resources; late fusion	Enables more intuitive or convenient control of complex systems	Interaction limited to task-specific actions	Hands-free or intuitive control of assistive robots, collaborative robots, or UAVs [1, 2, 75, 83, 101, 186, 187]
Robust MMP	Single sensor with multiple data streams, or multiple synchronized sensors; Neural Network Fusion or supervised classification	Can be robust to occlusions and single-modality limitations	Primarily demonstrated on offline datasets	Person detection or pose recognition using fused RGB-D or RGB-radar systems [29, 48, 98]
Service robot (mobile)	Mobile base; Can be augmented with environmental sensors Uses Graph or Late Fusion (state machine or action manager)	Could alleviate labor shortfalls; can create more engaging interaction experience	May not detect subtle interaction cues for all participants; interaction limited to task-specific actions	Airport, campus, elderly care, mall, and museum robots [45, 46, 51, 52, 80, 102, 105, 106, 123, 154]
Service robot (static)	Fixed or static indoor base; can use cloud resources and networked sensors; uses graph or late fusion (state machine or action manager)	Could alleviate labor shortfalls; can create more engaging interaction experience; precise human interaction data and cues	interaction limited to task-specific actions; integration can be complex	Bartending, concierge, or receptionist robots [20, 21, 44, 47, 86, 99, 120, 152, 180–182]
Smart spaces	Multiple networked sensors in controlled space, optionally with robots; uses cloud resources; Hybrid fusion	Precise interaction information for humans in the space	Integration can be complex	Smart classrooms and playspaces for improved child education or child-robot interaction [36, 38, 42, 87, 156]

Table 8. Summary of MMP System Trends in HSEs

Smart classrooms and playspaces for improved child education or child—robot interaction (Komatsubara et al., 2019; Tsiami et al., 2018; Efthymiou et al., 2022; Filntisis et al., 2019; D’Arca et al., 2016)

Service robots were heavily represented in surveyed works, which are largely motivated by demographic shifts in their respective nations of origin (such as aging populations), the shortage of specialized workers, and the associated economic impacts. Elderly care robots [43, 83, 123, 187, 186], educational or engagement robots and smart spaces for children (including those with ASD) [38, 42, 87], reception, concierge, and airport/museum guides [45, 46, 52, 80, 86, 99, 116, 152, 154], and manual service tasks like bartending [44, 47] were common use-cases. In general, these projects had a few specialized tasks and needed to infer the needs or status of humans in HSEs, often leveraging cloud resources to perform specific interaction and perception functions.

Robust perception was also common in surveyed works. These projects developed and evaluated critical functions necessary for natural, long-term HRI in HSEs by leveraging cues from different modalities. This includes longstanding research areas like human detection and localization using visual, audio, and laser range modalities [54, 73, 109, 114], with specific projects using multiple modalities to localize and track multiple humans in dense HSEs [97, 98, 176]. Multimodal interruptability, interest, and engagement detection was another common research focus, with consistent contributions throughout the timeframe of the survey [11, 17, 18, 27, 34, 76, 102, 105, 106, 121, 124, 164]. Less common—and generally, more recent—MMP research has addressed activity detection [74, 100, 103, 131], emotion recognition [95, 125, 126], and identification/personalization for long-term or recurring interactions [14, 35, 52, 57, 72, 108]. More recent contributions leveraged trained ML models to fuse separate modalities and generate a fused estimate that is robust to occlusions and limitations of individual modes [29, 48].

Lastly, another trend in surveyed MMP HRI applications involved the use of MMP interfaces to provide intuitive human inputs to a robotic system. This includes multimodal speech-gestural command of UAVs as in [1, 2], commanding of a robotic scrub nurse [75], commanding a collaborative robot [101], or identifying specific objects to train a robot’s object detection system [153, 167]. This demonstrates that MMP interfaces can be used to intuitively command robotic systems in human–robot collaborative tasks, to provide flexibility in applications when one communication modality may be unavailable (e.g., a surgeon can verbally command if they are performing a manual task and cannot gesture), or to intuitively train a robot’s reinforcement learning system, thereby leveraging the insights of a human teammate to improve robot performance.

4.2 Challenges and Limitations for Multimodal HRI Perception

HRI has many grand challenges, and to recount them, all would be beyond the scope of this survey. However, our survey found specific challenges for MMP researchers and developers in HSEs, and we highlight them here. Resources for addressing these issues are presented in Section 4.3.

4.2.1 HRI Applications Are Highly Specialized.

While HRI in HSEs has many common characteristics, specific applications and environments can vary significantly from one another. A robot’s tasks and perceptual requirements are often highly application-specific, and an algorithm or component that is ideal for one setting may not work well in another. This has significant implications for the design, evaluation, and usage of MMP systems.

Ideally, HRI researchers should evaluate several components for usage in HSEs. Many surveyed works [11, 17, 18, 44, 47, 97, 120] evaluated various fusion techniques or ML classifiers, for example, and selected the best overall method for their application. Likewise, a ready-made solution may not exist for the specific perception or fusion function, meaning that researchers may need to implement their own custom solutions. Some surveyed works developed custom SSL [49], hand pose tracking [75], head pose estimation [80, 154] or facial recognition algorithms [12], for example, despite the open-source availability of versions of these algorithms, and Yumak et al. developed a custom Integrated Interaction Platform [180, 181, 182]. Note that this is especially evident in earlier surveyed works that pre-date open-source middleware and modern ML frameworks.

Because HRI applications in HSEs can vary, this makes it difficult to provide a solid set of community guidelines or standards for MMP system design; indeed, the HRI community does not appear to have coalesced around a specific set of tools or framework for HRI development and integration. These factors culminate in an increased development time for multimodal systems compared to unimodal perception systems and require increased effort by the developer.

4.2.2 Delay between Unimodal Perception Development and Integration in Multimodal Systems.

Related to Section 4.2.1, the added overhead of multimodal system design, evaluation, and integration means that state-of-the-art ML and unimodal perception capabilities may take several years before they are used in a MMP system in an HSE. This posed a challenge when conducting this survey: many surveyed fusion and perception techniques (especially in the ML domain) can be more effectively accomplished by using more modern techniques. On the other hand, unimodal ML and perception techniques are rapidly advancing and many methods have not yet been published in a peer-reviewed journal, let alone incorporated into a multimodal system. As we discuss in Sections 4.3 and 4.4, readers are encouraged to evaluate novel perception and fusion capabilities—perhaps that are not yet published—rather than reproduce systems used in past works.

4.3 Design Guidelines and Developer Resources for MMP in HSEs

Here, we provide recommended design guidelines based on trends in the surveyed research. We also provide a summary of available hardware and software resources to aid MMP system developers and HRI researchers.

4.3.1 Design Guidelines.

As mentioned in Section 4.2, it is difficult to define a single set of design guidelines for MMP systems in HSEs. However, Table 8 can provide a rough starting point for a system design based on the application. Additionally, we encourage developers to consider the following factors when designing and implementing a MMP system for use in HSEs:

—

Modularity: The ability to implement and evaluate multiple components and algorithms allows developers to quickly find a suitable system design, as in [11, 17, 18, 44, 47, 97, 120].

—

Internet connectivity: The ability to use cloud-based services can potentially reduce development time for interaction functions like speech recognition, natural language processing, soft biometric estimation, or emotion recognition, as in [20, 21, 35, 45, 46, 95, 99, 102, 105, 106, 116, 121, 122, 126, 127, 143, 152].

—

External sensors: Certain HSEs like classrooms, hospitals, shops, homes, service environments, and collaborative workspaces can be outfitted with sensors. These sensors can be networked with the robot to provide additional data and increase overall sensor coverage, as in [36, 44, 47, 51, 83, 156, 187].

—

Safety, robustness, and real-time considerations: Mobile robots that interact with humans in the wild mostly used established late fusion techniques and comparatively simpler detectors. Novel detectors and fusion frameworks were primarily used on datasets and static robots.

—

Evaluation method: In surveyed works, developers evaluated their perception systems using datasets, longitudinal studies, grand challenges, user studies, and demonstrations. Each method differs in the resources required, study reproducibility, fidelity, and relevance to the HRI application; developers should consider these factors when selecting an evaluation method.

—

Existing resources: The use of existing hardware platforms and software datasets, frameworks, and SDKs can reduce development time and, in general, increases the commonality and reproducibility of the project [65]. Example resources are listed in Section 4.3.2.

4.3.2 MMP System Development Resources.

Here, we summarize software and hardware resources discussed in the survey and introduce additional resources for MMP system development.

State-of-the-art HRI perception techniques are continually advancing, and many were outside of the scope of this survey because they were in pre-print, unpublished status or have not yet been incorporated into a MMP system. In addition to the datasets discussed in Section 3.8, the following websites provide resources for training novel HRI perception models:

—

PapersWithCode¹⁵ provides labeled datasets, pre-trained ML models, and associated publications for many relevant HRI functions. State-of-the-art models are continually published for activity recognition, speech recognition, face recognition, pose estimation, person re-identification, emotion recognition, gesture recognition, gaze estimation, speech emotion classification, facial expression classification, age estimation, and hand pose estimation.

—

HuggingFace¹⁶ features recent tutorials and models for multimodal feature extraction, VAD, and ASR using Transformer models in the PyTorch framework.

Integration of MMP systems poses another challenge, and the following software frameworks may be relevant for some HRI applications:

—

Platform for Situated Intelligence (PSI)¹⁷: Debuted by Microsoft Research in 2019 [6, 7] and using the C# programming language, PSI includes specialized data processing techniques to align, reshape, and synchronize heterogeneous data streams and track latency for operation on real-time embedded systems. Developers can employ a combination of components—device drivers and wrappers for algorithms and ML models—into a runtime instance. Includes tutorials for VAD, audio energy computation, and skeletal pose recognition.

—

ROS4HRI¹⁸ aims to create a set of common HRI functions and data types in ROS. ROS4HRI defines a ROS standard for HRI (ROS Enhancement Proposal 155¹⁹), implements HRI-specific message types,²⁰ and provides baseline HRI functions like skeletal pose tracking, facial recognition, data fusion to match faces to bodies, and human reidentification in ROS. REP-155 was completed in early 2022; at the time of writing, ROS4HRI’s impact or community adoption is not yet discernable and it is still under active development.

—

Human And Robot Modular and OpeN Interactions (HARMONI)²¹ [147] is another modular, ROS-based HRI framework. At time of writing, HARMONI supports speech recognition via DeepSpeech and face detection using dlib. Users can compose custom interaction flows consisting of event-driven loops and sequences.

Several surveyed works used social robot platforms for research and development. Additionally, our initial literature search found many publications that—while they did not meet final inclusion criteria for the survey—used common robot hardware platforms for HRI research, and therefore may be of interest for MMP HRI system development. Some are commercially available development platforms with APIs, while others offer case studies in robot system development or were used to conduct HRI studies.

—

Child–robot interaction: Kaspar [169]; Luka reading robot²² [184]; Mio Amico [41]

—

Elderly/home care: GARMI mobile manipulator [155]; Hobbit mobile manipulator [43]

—

Service robots: BRILLO bartending robot [137]; Sacarino concierge robot [122]; TOOMAS shopping assistant robot [37]

—

Social robots: iSocioBot [149], IVO [90], Kejia [102, 32], Quori [146] and Pepper²³ [17, 46, 72, 105, 106] mobile social robot platforms; iCub [14, 57] and NAO²⁴ [133, 38, 72] humanoid robots

—

Conversational robots: Furhat tabletop robot²⁵ [3, 4, 39, 40, 50, 79 121]; Haru tabletop robot [55, 56, 81, 115, 127, 162, 163]; Mini tabletop elderly care and engagement robot [129, 138]; Nadine humanoid robot [151].

4.4 Future Research Areas

Multimodal HRI in HSEs has been an active area of robotics research for over two decades [60], yet it is still a rapidly evolving field. In light of the trends and challenges discussed above, here we discuss promising research areas in multimodal HRI.

4.4.1 Integration and Deployment of Novel Perception Techniques.

A robot’s ability to sense and interact with humans is a subset of its overall perception capability, and robot perception has advanced considerably in the past several years. These advances are largely driven by new software frameworks, ML architectures, and improvements in commercially available hardware. For example, the Transformer architecture [161] has seen widespread usage in ML tasks since its inception in 2017. Specifically, Transformer components like encoders, decoders, and attention mechanisms can be used to perform time-series fusion and classification tasks that are crucial to multimodal HRI perception. Many unimodal datasets and pre-trained Transformer models exist for baseline HRI functions like gesture recognition, speech recognition, emotion recognition, and VAD.

Although we found several examples of Transformers used for multimodal HRI within the past 3 years [74, 95 100], this research effort has primarily been performed on labeled datasets. As such, we believe that there are many more opportunities to use Transformer-based approaches to accomplish multimodal HRI functions mentioned in this survey, specifically: deictic/referring gesture fusion, multimodal emotion classification, intent recognition, and asynchronous/sparse data fusion. Additionally, deploying these models within multimodal embedded systems in HSEs appears to be relatively unexplored research area. An evaluation of Transformer perception and fusion model efficacy in HSEs could further guide multimodal HRI research.

4.4.2 Flexible Interaction and Perception Architectures.

In the same way that interaction policies can be personalized to improve HRI outcomes [14, 17, 40, 52, 57, 123], we believe that HRI perception systems can be similarly adjusted to fit the current interaction needs. This is partially motivated by surveyed works that used state machine architectures to affect actions based on perceptual stimuli [34, 35, 75, 105, 106, 121, 122] but could be extended to other types of robots operating in HSEs. In particular, a state machine or behavior tree architecture could use cues about the social context, human engagement, and human emotion to select which perception resources and modalities to use, and when. In the same way that a human does not deeply examine the facial expressions of every person they pass on the street, a robot could selectively choose when and how to perceive humans by employing methods found in this survey. For example, if no humans are present, a robot could simply perform human detection using vision and sound event detection. While operating in crowded, public HSEs, human tracking and social navigation using vision and LiDAR would likely be sufficient. If the robot detects that a person wishes to engage, resource-intensive HRI functions like facial recognition, emotional recognition, and speech recognition could be activated as appropriate. This has the added benefit of computational resource management [34], since many perception methods—especially those that use large models for real-time inference—may use a disproportionate share of onboard processing capability for mobile robots and embedded systems.

4.4.3 Perception Systems for Long-Term Interaction.

Many surveyed works employed MMP systems in support of long-term, personalized interaction [14, 52, 57, 72]. This typically involved perceiving soft biometrics, face, and voice identification for open-set identification and fusing these observations with Bayesian or Graph Fusion techniques. One study [52] also recorded metadata about each person, such as when, how long, how often they visited a particular location. These systems required specialized data storage, learning, and inference/retrieval mechanisms which we did not cover in this survey. However, understanding the type of HRI data to record, and how to store, process, and retrieve it, is crucial for long-term interaction and we believe this is an area worthy of further investigation. The references included in this survey can provide a starting point for further long-term HRI perception system development. While there are surveys of long-term learning for neural networks [119], robots [94, 141], and robot perception and manipulation [84], we found no surveys of perception systems or cognitive architectures specific to long-term HRI.

4.4.4 Integrated HRI Perception and Planning.

In HSEs, interaction is situated within a broader environmental context and the most ideal course of action is tightly coupled with the current environmental state [22]. Architectures that can fuse multimodal stimuli to simultaneously perceive the status of nearby humans and plan a suitable action may be especially desirable for HRI in HSEs, however we only found two examples in the survey: the POMDP dialog manager of Lu et al. for the Kejia robot, and the POMDP collaborative planner used in the SPENCER project [80, 154]. Both employ POMDP graphs on mobile robots in public environments to fuse multimodal inputs and select a suitable mode of interaction.

Additionally, the Workshop on Integrated Perception, Planning, and Control for Physically and Contextually Aware Robot Autonomy²⁶ addresses this topic more generally. Contributions in 2023 address multimodal semantic segmentation and robust perception, techniques that could be employed for MMP in HRI.

4.4.5 New Sensory Modalities.

The surveyed research in perceptual interfaces is largely dominated by audiovisual systems [10, 12, 20, 21, 28, 36, 49, 139, 156, 180, 181, 182], but introducing new sensor modalities could increase the accuracy or robustness of perceptual interfaces while offering new semantic information.

Although 3D LiDAR detection has seen widespread use in autonomous vehicle perception, it was relatively uncommon for HRI perception; Yan et al.’s 3D LiDAR human tracker was the only example in this survey [176]. 3D LiDAR detection [174] could provide valuable complementary information to a MMP HRI systems beyond existing 2D LiDAR leg detectors. Namely, a 3D system could detect humans even while they are not standing and would reduce false positives from furniture legs [175]. Several pre-trained 3D LiDAR human detectors are available as mentioned in Section 4.3 and could be implemented on a robot platform with 3D LiDAR.

Thermal cameras and mmWave-radar also offer promising sensor modalities to complement HRI but were relatively uncommon in surveyed works. Mainly, these modes were used in applications where vision may be degraded [29, 103, 131], but they offer many functional benefits: a human’s thermal signature can help segment human and non-human clusters [114] and identify emotional state [41], while mm-Wave Radar can penetrate visual obscurants like fog, smoke, precipitation, and low-light conditions. Using thermal or radar data in a multimodal human perception system could be especially useful for segmenting humans in cluttered social scenes, potentially improving robustness or accuracy even when vision is not degraded.

4.4.6 Adopting Best Practices and Community Standards.

Finally, adopting community standards for HRI researchers could reduce duplication of effort and lower the barrier to entry for HRI researchers. HRI in HSEs can often be unstructured and difficult to control, hence reproducibility can also be a concern for HRI experiments and systems. Gunes et al. [65] provide useful guidelines for conducting HRI experiments and for HRI system design. They recommend that HRI researchers be explicit about SDKs, middleware, and datasets used, and note that the usage of common frameworks aids in reproducibility, commonality, and development for HRI systems and experiments.

At time of writing, ROS4HRI appears to be the most concentrated effort toward creating an HRI-specific community developer standard and includes a protocol and ROS message types. Implementation tutorials are under active development. Developers are invited to integrate additional hardware and software components into ROS4HRI in accordance with REP-155 standard.²⁷ Currently, ROS4HRI features facial detection, skeletal pose recognition, and facial body fusion via assignment, which is a subset of the many techniques discussed in this survey. Incorporating more perceptual and fusion techniques into the ROS4HRI framework would enable more rapid development and evaluation of HRI perception systems. Similarly, PSI’s website²⁸ allows external collaborators to develop components for the PSI framework in accordance with the software’s API.

4.5 Concluding Remarks

As we explored in this article, MMP systems have the potential to offer more robust, flexible, and complete information for HRI compared to unimodal perception systems. The sensory acquisition, processing, and fusion techniques surveyed here are potentially relevant to a broad range of robotics and autonomous systems—including, but not limited to surgical robots, teleoperated systems, collaborative robots, and autonomous vehicles. However, we primarily focused on MMP methods that could be employed on robot platforms in dynamic HSEs. Although HRI in HSEs remains a substantial challenge, novel tools, techniques, and standards offer a promising path forward.

Acknowledgments

The authors would like to acknowledge Christian Claudel, Joydeep Biswas, Ann Majewicz-Fey, and James Sulzer of The University of Texas at Austin for their guidance of this line of research. Additionally, John Duncan would like to thank Emmanuel Akita, Christina Petlowany, Christopher Suarez, and Srinath Tankasala of the Nuclear & and Applied Robotics Group for their assistance in preparing this survey.

Footnotes

https://scholar.google.com/citations?view_op=top_venues&vq=eng_robotics

https://scholar.google.com/citations?hl=en&view_op=search_venues&vq=human-robot+interaction

https://scholar.google.com/citations?hl=en&view_op=search_venues&vq=social+robotics

⁴

https://github.com/ageitgey/face_recognition

⁵

https://www.hark.jp

⁶

http://kaldi-asr.org

⁷

https://github.com/mozilla/DeepSpeech

⁸

https://cmusphinx.github.io/

⁹

https://github.com/julius-speech/julius

¹⁰

https://htk.eng.cam.ac.uk/

¹¹

http://domedb.perception.cs.cmu.edu/index.html

¹²

For a survey of model-based fusion techniques outside of robotics, we recommend Tadas Baltrusaitis et al.’s “Multimodal Machine Learning: A Survey and Taxonomy.”

¹³

http://wiki.ros.org/smach

¹⁴

Note that the authors use the term “Early Fusion,” but the audio and visual sensor modalities are independently processed before unimodal localization results are fused.

¹⁵

https://paperswithcode.com/

¹⁶

https://huggingface.co/

¹⁷

https://github.com/microsoft/psi

¹⁸

https://github.com/ros4hri/ros4hri

¹⁹

https://github.com/ros-infrastructure/rep/pull/338

²⁰

https://github.com/ros4hri/hri_msgs

²¹

https://github.com/interaction-lab/HARMONI

²²

https://www.lukareads.com/

²³

https://www.robotlab.com/pepper-robot

²⁴

https://us.softbankrobotics.com/nao

²⁵

https://furhatrobotics.com/

²⁶

https://ippc-iros23.github.io/

²⁷

http://wiki.ros.org/hri

²⁸

https://github.com/microsoft/psi

References

[1]

Ayodeji Opeyemi Abioye, Stephen D. Prior, Peter Saddington, and Sarvapali D. Ramchurn. 2022. The performance and cognitive workload analysis of a multimodal speech and visual gesture (mSVG) UAV control interface. Robotics and Autonomous Systems 147 (Jan. 2022), 103915. DOI:

Abstract

1 Introduction

2 Characterizing HRI in HSEs

3 MMP Survey

3.1 Survey Method

3.2 Survey Results and Taxonomy

3.3 MMP Applications in HRI

3.4 Information Provided by MMP HRI Systems

3.5 Visual Data Acquisition and Processing

3.5.1 Vision Sensors.

3.5.2 Visual Data Processing.

3.6 Audio Data Acquisition and Processing

3.6.1 Audio Sensors.

3.6.2 Audio Data Processing.

3.7 Other Data Modality Acquisition and Processing Techniques

3.8 Multimodal Datasets

3.9 Data Fusion Techniques

3.9.1 Model-Based Fusion Techniques.

3.9.2 Model-Agnostic Fusion Techniques.

4 Survey Summary, Analysis, and Future Research

4.1 Trends in Current Multimodal HRI Perception Systems

4.2 Challenges and Limitations for Multimodal HRI Perception

4.2.1 HRI Applications Are Highly Specialized.

4.2.2 Delay between Unimodal Perception Development and Integration in Multimodal Systems.

4.3 Design Guidelines and Developer Resources for MMP in HSEs

4.3.1 Design Guidelines.

4.3.2 MMP System Development Resources.

4.4 Future Research Areas

4.4.1 Integration and Deployment of Novel Perception Techniques.

4.4.2 Flexible Interaction and Perception Architectures.

4.4.3 Perception Systems for Long-Term Interaction.

4.4.4 Integrated HRI Perception and Planning.

4.4.5 New Sensory Modalities.

4.4.6 Adopting Best Practices and Community Standards.

4.5 Concluding Remarks

Acknowledgments

Footnotes

References

Cited By

Index Terms

Recommendations

Enabling Multimodal Human–Robot Interaction for the Karlsruhe Humanoid Robot

Lexical Entrainment in Multi-party Human–Robot Interaction

Modulating the non-verbal social signals of a humanoid robot

Comments

Information

Published In

Publisher

Publication History

Check for updates

Author Tags

Qualifiers

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

Cited By

View options

PDF

eReader

Login options

Full Access

Share

Share this Publication link

Share on social media

Affiliations